├── .gitattributes ├── architecture.png ├── first_place_solution_doc.docx ├── requirements.txt ├── .gitignore └── README.md /.gitattributes: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bakarys01/hydropower-kalam-winner/HEAD/architecture.png -------------------------------------------------------------------------------- /first_place_solution_doc.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bakarys01/hydropower-kalam-winner/HEAD/first_place_solution_doc.docx -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy 2 | pandas 3 | matplotlib 4 | seaborn 5 | scikit-learn 6 | lightgbm 7 | tqdm 8 | bayesian-optimization 9 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Jupyter notebook checkpoints 2 | .ipynb_checkpoints/ 3 | 4 | # Python cache 5 | __pycache__/ 6 | *.pyc 7 | 8 | # Virtual environments 9 | venv/ 10 | .env/ 11 | env/ 12 | 13 | # Data files 14 | *.csv 15 | *.xlsx 16 | *.xls 17 | *.xlsm 18 | *.xlsb 19 | 20 | # Datasets folder 21 | datasets/ 22 | datasets/* 23 | output/ 24 | output/* 25 | 26 | # Output files 27 | *.pkl 28 | *.h5 29 | *.model 30 | 31 | # Logs 32 | logs/ 33 | *.log 34 | 35 | # OS specific files 36 | .DS_Store 37 | Thumbs.db 38 | 39 | # IDE specific files 40 | .idea/ 41 | .vscode/ 42 | *.swp 43 | *.swo -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # First Place Solution 2 | **15 Apr 2025, 16:44 · 11 min read** 3 | 4 | --- 5 | 6 | ## 1. Overview and Objectives 7 | **Purpose:** This document describes a winning solution for the *IBM SkillsBuild Hydropower Climate Optimization Challenge*. The goal is to predict daily energy consumption for multiple consumer devices in Kalam, Pakistan. The solution applies comprehensive feature engineering, strategic data segmentation, and advanced ensemble modeling to handle various seasonal and climate-related complexities. 8 | 9 | **Objectives and Expected Outcomes:** 10 | 1. **Accurate kWh Predictions**: Achieve low RMSE for daily consumption forecasts. 11 | 2. **Seasonality and Weather Integration**: Incorporate Pakistan's unique seasonal and weather factors. 12 | 3. **Robustness**: Effectively manage zero-consumption periods and device filtering. 13 | 4. **Efficient Inference Pipeline**: Offer a systematic approach from raw data transformation to final predictions. 14 | 15 | --- 16 | 17 | ## 🔥 Summary 18 | This repository contains the winning solution for the IBM SkillsBuild Hydropower Climate Optimisation Challenge. It provides a complete end‑to‑end pipeline—from data preprocessing and feature engineering, through model training and ensemble optimization, to inference and submission file generation. 19 | 20 | **Key Results** 21 | - **Private Leaderboard RMSE:** 4.312706649 22 | - **Final Rank:** 🥇 1st Place 23 | 24 | --- 25 | 26 | ## 🧹 Data Preprocessing & Cleaning 27 | - **Device Filtering:** Excluded consumer devices and users not present in the test set to avoid leakage and noise. 28 | - **Aggregation:** Converted raw 5‑minute readings into daily statistics (sum, mean, std, min, max) for kWh, voltage (red/blue/yellow), current, and power factor. 29 | - **Offline Period Detection:** Marked weeks with zero consumption across all days as "offline" and filtered them out to focus on meaningful usage patterns. 30 | - **Final Dataset:** Saved as `output/filtred_online_days_enriched.csv` containing only "online" days for the users in the test set. 31 | 32 | --- 33 | 34 | ## 🌦️ Feature Engineering (Comprehensive Climate Integration) 35 | - **Climate Data Merge:** Joined in daily aggregates of temperature, dewpoint, precipitation, snowfall, snow cover, and wind components. 36 | - **Cyclical Temporal Features:** Sine/cosine encodings for day-of-week, day-of-year, month, week, and day-in-season. 37 | - **Pakistan‑Specific Features:** Flags for public holidays and Ramadan periods; season indicators tuned to Kalam's climate. 38 | - **Advanced Trends:** Exponential moving averages, volatility, acceleration, and extreme‑event indicators for temperature. 39 | - **Heating/Cooling Metrics:** Heating degree days, cooling degree days, and temperature‑dewpoint differentials. 40 | - **Interactions:** Temperature × weekday interactions and weather × consumption effects. 41 | 42 | --- 43 | 44 | ## 🧪 Strategic Data Segmentation 45 | Divided into four temporal segments to capture distinct seasonal behaviors: 46 | 47 | | Segment | Periods | Rationale | 48 | | ------- | --------------------------------------- | ---------------------------------------------------- | 49 | | **Data1** | Aug–Sep 2024 & Oct 2023 (late summer/early fall) | High consumption → model learns peak usage patterns | 50 | | **Data2** | Nov–Dec 2023 & Jul 2024 (winter & mid-summer) | Moderate consumption periods | 51 | | **Data3** | All other intervals | Mostly zero/minimal consumption | 52 | | **Data4** | Entire dataset | Global model for overarching patterns | 53 | 54 | --- 55 | 56 | ## 🧠 Advanced Ensemble Modeling 57 | 1. **Seven LightGBM Configurations** per segment: 58 | - **Precise:** deep, conservative trees (max_depth=8). 59 | - **Feature‑Selective:** aggressive feature sampling. 60 | - **Robust:** outlier‑resistant (min_data_in_leaf=20). 61 | - **Deep Forest:** very deep with many estimators. 62 | - **Highly Regularized:** strong L1/L2 penalties. 63 | - **Fast Learner:** high learning rate for rapid convergence. 64 | - **Balanced:** tuned for bias‑variance tradeoff. 65 | 2. **Cross‑Validation:** 5‑fold CV to evaluate each base model. 66 | 3. **Bayesian Optimization:** Search optimal ensemble weights per fold instead of a meta‑model. 67 | 4. **Multi‑Level Blending:** Combine segment‑specific ensembles with global weighting. 68 | 69 | --- 70 | 71 | ## 🔗 Dataset Access via Kaggle 72 | Due to GitHub file size limits, raw CSV/XLSX files are **not** included here. Please download from Kaggle: 73 | 74 | 1. Visit: 75 | 👉 [IBM SkillsBuild Hydropower Climate Optimisation (Updated)](https://www.kaggle.com/datasets/muhammadqasimshabbir/ibmskillsbuildhydropowerclimateoptimisationupdated) 76 | 2. Click **Download All**. 77 | 3. Unzip and place into this repo under: 78 | ``` 79 | datasets/ 80 | ├── Data/ 81 | │ └── Data.csv 82 | ├── SampleSubmission.csv 83 | └── Climate Data/ 84 | └── Kalam Climate Data.xlsx 85 | ``` 86 | 87 | --- 88 | 89 | ## 📂 Repository Structure 90 | ``` 91 | . 92 | ├── datasets/ # Kaggle data (not committed) 93 | ├── output/ # Generated data, models, plots 94 | ├── first_place_solution.ipynb # Full notebook with code & docs 95 | ├── final_submission.csv # Submission file matching leaderboard 96 | ├── requirements.txt # Python dependencies 97 | └── README.md # This documentation 98 | ``` 99 | 100 | --- 101 | 102 | ## 🛠️ How to Run 103 | 1. Clone the repo 104 | 2. Download & place the dataset as described above 105 | 3. Create and activate a Conda/Python environment: 106 | ```bash 107 | pip install -r requirements.txt 108 | ``` 109 | 4. Open `first_place_solution.ipynb` in Jupyter/Colab 110 | 5. Run cells top to bottom—full training takes ~25–30 minutes 111 | 112 | --- 113 | 114 | ## 📈 Performance Metrics 115 | - **Public Leaderboard RMSE:** 5.454981323 116 | - **Private Leaderboard RMSE:** 4.312706649 117 | - **Ensemble Improvement:** Bayesian weighting improved RMSE by ~4.2% 118 | 119 | --- 120 | 121 | ## 🕒 Run Time 122 | - **Data preprocessing & feature engineering:** ~3–5 minutes 123 | - **Model training (4 segments × 7 configs × 5‑fold CV):** ~20–25 minutes 124 | - **Inference & submission export:** < 1 minute 125 | 126 | --- 127 | 128 | ## ❓ Additional Notes 129 | - **Reproducibility:** `seed_everything(42)` ensures consistent results. 130 | - **Logging:** Comprehensive logging for each stage (ETL, training, inference). 131 | - **Error Handling:** Fallback models in case of training errors; robust NaN column filtering. 132 | 133 | --- 134 | 135 | ## 📞 Contact 136 | Bakary Sidibé 137 | ✉️ bakarysidibe1995@gmail.com 138 | 🔗 [LinkedIn](https://www.linkedin.com/in/bakary-sidibe-256419111/) 139 | 140 | --- 141 | 142 | *Thank you for reviewing this solution!* --------------------------------------------------------------------------------