├── fig └── certificate.png ├── requirements.txt ├── .gitignore ├── LICENSE ├── dataset └── data_source.md ├── README.md └── cmi-piu-silver-medal-solution.ipynb /fig/certificate.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chenfeng-huang/Kaggle_Silver_Medal_Solutioun_CMI-PIU/HEAD/fig/certificate.png -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy 2 | pandas 3 | polars 4 | scikit-learn 5 | lightgbm 6 | xgboost 7 | catboost 8 | scipy 9 | matplotlib 10 | seaborn 11 | tqdm 12 | colorama 13 | ipython 14 | ipykernel 15 | pyarrow -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Ignore large dataset files 2 | 3 | dataset/train.csv 4 | dataset/test.csv 5 | dataset/series_train.parquet/ 6 | dataset/series_test.parquet/ 7 | dataset/sample_submission.csv 8 | 9 | # Ignore catboost info 10 | catboost_info/* -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2025 Chenfeng Huang(AIrick_H) 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /dataset/data_source.md: -------------------------------------------------------------------------------- 1 | ## Data source and usage 2 | 3 | Data for this project comes from the Kaggle competition [Child Mind Institute — Problematic Internet Use](https://www.kaggle.com/competitions/child-mind-institute-problematic-internet-use/data). 4 | 5 | ### What to download 6 | 7 | - `train.csv`, `test.csv`, `sample_submission.csv` 8 | - `series_train.parquet/` and `series_test.parquet/`, each containing per‑participant folders like `id=/part-0.parquet` 9 | 10 | ### Where to place the files 11 | 12 | Place all files under `dataset/` with the following layout: 13 | 14 | ``` 15 | dataset/ 16 | ├─ train.csv 17 | ├─ test.csv 18 | ├─ sample_submission.csv 19 | ├─ series_train.parquet/ 20 | │ └─ id=/part-0.parquet 21 | └─ series_test.parquet/ 22 | └─ id=/part-0.parquet 23 | ``` 24 | 25 | ### Git policy (large files) 26 | 27 | This repository ignores large competition artifacts to respect size and licensing: 28 | 29 | - Ignored: `dataset/train.csv`, `dataset/test.csv`, `dataset/series_train.parquet/`, `dataset/series_test.parquet/` 30 | - Tracked: `dataset/sample_submission.csv` 31 | 32 | See `.gitignore` for details. 33 | 34 | ### Competition data rules (summary) 35 | 36 | - The competition data consists of public and private test sets; which is which is not disclosed to participants. 37 | - Access and use are allowed for non‑commercial purposes only (participation, research, education) during the competition; terms may change thereafter. 38 | - Phenotypic/tabular survey data are de‑identified. You must not redistribute the data, attempt re‑identification, or probe the test labels. Report any PII findings to organizers via Kaggle forums. 39 | - Keep data secure; do not share it with non‑participants. Notify Kaggle of any unauthorized access or transmission. 40 | - External data may be used only if it is publicly available, equally accessible to all participants, and free of charge, and all other competition rules still apply. 41 | 42 | ### Citation 43 | 44 | If you reference or use the dataset, include the following citation: 45 | 46 | “CMI 2024 Problematic Internet Use Detection Challenge” 47 | 48 | Adam Santorelli, Arianna Zuanazzi, Michael Leyden, Logan Lawler, Maggie Devkin, Yuki Kotani, and Gregory Kiar. Child Mind Institute — Problematic Internet Use. https://kaggle.com/competitions/child-mind-institute-problematic-internet-use, 2024. Kaggle. 49 | 50 | BibTeX: 51 | 52 | ```bibtex 53 | @misc{child-mind-institute-problematic-internet-use, 54 | author = {Adam Santorelli and Arianna Zuanazzi and Michael Leyden and Logan Lawler and Maggie Devkin and Yuki Kotani and Gregory Kiar}, 55 | title = {Child Mind Institute — Problematic Internet Use}, 56 | year = {2024}, 57 | howpublished = {\url{https://kaggle.com/competitions/child-mind-institute-problematic-internet-use}}, 58 | note = {Kaggle} 59 | } 60 | ``` -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Child Mind Institute — Problematic Internet Use (CMI‑PIU) Solution 2 | ![Python](https://img.shields.io/badge/python-3.10+-blue.svg) 3 | 4 | 5 | 6 | This solution was developed for the Kaggle competition [Child Mind Institute — Problematic Internet Use](https://www.kaggle.com/competitions/child-mind-institute-problematic-internet-use), where participants predict the severity category `sii ∈ {0,1,2,3}` using demographics, clinical/questionnaire data, and wearable time‑series signals. The approach combines tabular features with per‑participant time‑series summary statistics and trains tree‑based regressors with out‑of‑fold validation and threshold optimization for the Quadratic Weighted Kappa (QWK) metric. 7 | 8 | Our work earned a Silver Medal. The repository contains an end‑to‑end notebook that reproduces training and inference and generates a valid `submission.csv`. 🥈 9 | 10 | For a detailed walkthrough of the solution and insights into the modeling process, see my [blog post](https://chenfenghuang.info/2024/12/20/Kaggle-CMI-PIU/). 11 | ![CMI‑PIU — Silver Medal](./fig/certificate.png) 12 | 13 | ## Competition Overview 14 | 15 | ### Competition Introduction 16 | 17 | The task is to build a predictive model for `sii` (four ordered categories) using a mixture of tabular data (demographics, physical health, and questionnaires such as SDS/PCIAT) and multivariate time‑series recorded per participant. The evaluation metric is **Quadratic Weighted Kappa (QWK)**, which measures agreement between predicted classes and ground truth while penalizing larger class discrepancies more heavily. 18 | 19 | ### Competition Background 20 | 21 | Problematic Internet Use is an emerging mental‑health concern. The dataset couples questionnaire and clinical information with wearable time‑series, enabling models that can leverage both static and dynamic signals. A practical solution must align columns across train/test, summarize long time‑series efficiently, and handle missing/categorical features robustly while optimizing directly for the ordered‑class QWK objective. 22 | 23 | ## Solution Overview 24 | 25 | This solution comprises three components: 26 | 27 | 1. **Time‑Series Feature Extraction**: Aggregate each participant’s parquet time‑series into compact summary statistics via `DataFrame.describe()` (e.g., count, mean, std, quantiles, max) per channel. 28 | 2. **Tabular Fusion and Preprocessing**: Merge time‑series stats with tabular features by `id`, fill and encode season‑type categoricals, and align train/test columns. 29 | 3. **Modeling and Thresholding**: Train tree‑based regressors with 5‑fold stratified CV on `sii`, ensemble predictions where helpful, and optimize three thresholds to discretize continuous predictions into {0,1,2,3} maximizing QWK. 30 | 31 | The final output is a two‑column `submission.csv` with `id,sii`. 32 | 33 | 34 | ## How to Reproduce 35 | 36 | ### Environment Setup 37 | 38 | - Python 3.10+ recommended 39 | - Install dependencies: 40 | 41 | ```bash 42 | pip install -r requirements.txt 43 | ``` 44 | 45 | Key libraries: `numpy`, `pandas`, `scikit-learn`, `lightgbm`, `xgboost`, `catboost`, `scipy`, `pyarrow`, `tqdm`. 46 | 47 | ### Data Layout 48 | 49 | Place the competition data under `dataset/`: 50 | 51 | ``` 52 | dataset/ 53 | ├─ train.csv 54 | ├─ test.csv 55 | ├─ sample_submission.csv 56 | ├─ series_train.parquet/ 57 | │ └─ id=/part-0.parquet 58 | └─ series_test.parquet/ 59 | └─ id=/part-0.parquet 60 | ``` 61 | 62 | ### Data Source and Citation 63 | 64 | - Dataset and rules: see `dataset/data_source.md` and the Kaggle data page: https://www.kaggle.com/competitions/child-mind-institute-problematic-internet-use/data 65 | 66 | ```bibtex 67 | @misc{child-mind-institute-problematic-internet-use, 68 | author = {Adam Santorelli and Arianna Zuanazzi and Michael Leyden and Logan Lawler and Maggie Devkin and Yuki Kotani and Gregory Kiar}, 69 | title = {Child Mind Institute — Problematic Internet Use}, 70 | year = {2024}, 71 | howpublished = {\url{https://kaggle.com/competitions/child-mind-institute-problematic-internet-use}}, 72 | note = {Kaggle} 73 | } 74 | ``` 75 | 76 | ### Training and Inference 77 | 78 | - Launch Jupyter and open the notebook: 79 | 80 | - Run all cells in `cmi-piu-silver-medal-solution.ipynb` to train, validate, and generate `submission.csv` in the repository root. 81 | 82 | ## Author 83 | 84 | Maintainer: Chenfeng Huang - [Kaggle](https://www.kaggle.com/alrickh) 85 | 86 | For questions, please open an issue or discussion in this repository. 87 | 88 | ## License 89 | 90 | Distributed under the terms of the license in `LICENSE`. 91 | 92 | -------------------------------------------------------------------------------- /cmi-piu-silver-medal-solution.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "id": "492a992d", 7 | "metadata": { 8 | "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19", 9 | "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5", 10 | "execution": { 11 | "iopub.execute_input": "2025-01-01T15:25:36.247831Z", 12 | "iopub.status.busy": "2025-01-01T15:25:36.247465Z", 13 | "iopub.status.idle": "2025-01-01T15:25:42.175421Z", 14 | "shell.execute_reply": "2025-01-01T15:25:42.174665Z" 15 | }, 16 | "papermill": { 17 | "duration": 5.935686, 18 | "end_time": "2025-01-01T15:25:42.177081", 19 | "exception": false, 20 | "start_time": "2025-01-01T15:25:36.241395", 21 | "status": "completed" 22 | }, 23 | "tags": [] 24 | }, 25 | "outputs": [], 26 | "source": [ 27 | "import os\n", 28 | "import warnings\n", 29 | "from concurrent.futures import ThreadPoolExecutor\n", 30 | "\n", 31 | "import numpy as np\n", 32 | "import pandas as pd\n", 33 | "from sklearn.model_selection import StratifiedKFold\n", 34 | "from sklearn.impute import SimpleImputer, KNNImputer\n", 35 | "from sklearn.pipeline import Pipeline\n", 36 | "\n", 37 | "from sklearn.base import clone\n", 38 | "from sklearn.metrics import cohen_kappa_score\n", 39 | "from lightgbm import LGBMRegressor\n", 40 | "from xgboost import XGBRegressor\n", 41 | "from catboost import CatBoostRegressor\n", 42 | "from sklearn.ensemble import VotingRegressor, RandomForestRegressor, GradientBoostingRegressor\n", 43 | "\n", 44 | "from scipy.optimize import minimize\n", 45 | "\n", 46 | "\n", 47 | "from tqdm import tqdm\n", 48 | "from IPython.display import clear_output\n", 49 | "\n", 50 | "warnings.filterwarnings('ignore') \n", 51 | "pd.options.display.max_columns = None \n", 52 | "\n" 53 | ] 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "id": "abb90b3b", 58 | "metadata": { 59 | "papermill": { 60 | "duration": 0.003826, 61 | "end_time": "2025-01-01T15:25:42.185401", 62 | "exception": false, 63 | "start_time": "2025-01-01T15:25:42.181575", 64 | "status": "completed" 65 | }, 66 | "tags": [] 67 | }, 68 | "source": [ 69 | "## Method 1" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": 2, 75 | "id": "232dca9d", 76 | "metadata": { 77 | "execution": { 78 | "iopub.execute_input": "2025-01-01T15:25:42.194061Z", 79 | "iopub.status.busy": "2025-01-01T15:25:42.193611Z", 80 | "iopub.status.idle": "2025-01-01T15:25:42.264763Z", 81 | "shell.execute_reply": "2025-01-01T15:25:42.263859Z" 82 | }, 83 | "papermill": { 84 | "duration": 0.077324, 85 | "end_time": "2025-01-01T15:25:42.266462", 86 | "exception": false, 87 | "start_time": "2025-01-01T15:25:42.189138", 88 | "status": "completed" 89 | }, 90 | "tags": [] 91 | }, 92 | "outputs": [], 93 | "source": [ 94 | "train = pd.read_csv('dataset/train.csv') \n", 95 | "test = pd.read_csv('dataset/test.csv') \n", 96 | "sample = pd.read_csv('dataset/sample_submission.csv') \n", 97 | "def process_file(filename, dirname):\n", 98 | " df = pd.read_parquet(os.path.join(dirname, filename, 'part-0.parquet')) \n", 99 | " df.drop('step', axis=1, inplace=True) \n", 100 | " return df.describe().values.reshape(-1), filename.split('=')[1] \n", 101 | "def load_time_series(dirname) -> pd.DataFrame:\n", 102 | " ids = os.listdir(dirname) \n", 103 | "\n", 104 | " with ThreadPoolExecutor() as executor:\n", 105 | " results = list(tqdm(executor.map(lambda fname: process_file(fname, dirname), ids), total=len(ids))) \n", 106 | "\n", 107 | " stats, indexes = zip(*results) \n", 108 | "\n", 109 | " df = pd.DataFrame(stats, columns=[f\"stat_{i}\" for i in range(len(stats[0]))]) \n", 110 | " df['id'] = indexes \n", 111 | " return df" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": 3, 117 | "id": "49aeba17", 118 | "metadata": { 119 | "execution": { 120 | "iopub.execute_input": "2025-01-01T15:25:42.275593Z", 121 | "iopub.status.busy": "2025-01-01T15:25:42.275346Z", 122 | "iopub.status.idle": "2025-01-01T15:26:51.501872Z", 123 | "shell.execute_reply": "2025-01-01T15:26:51.500865Z" 124 | }, 125 | "papermill": { 126 | "duration": 69.232586, 127 | "end_time": "2025-01-01T15:26:51.503352", 128 | "exception": false, 129 | "start_time": "2025-01-01T15:25:42.270766", 130 | "status": "completed" 131 | }, 132 | "tags": [] 133 | }, 134 | "outputs": [ 135 | { 136 | "name": "stderr", 137 | "output_type": "stream", 138 | "text": [ 139 | "100%|██████████| 111/111 [00:02<00:00, 54.32it/s]\n", 140 | "100%|██████████| 2/2 [00:00<00:00, 22.82it/s]\n" 141 | ] 142 | } 143 | ], 144 | "source": [ 145 | "train_ts = load_time_series(\"dataset/series_train.parquet\") \n", 146 | "test_ts = load_time_series(\"dataset/series_test.parquet\") \n", 147 | "\n", 148 | "time_series_cols = train_ts.columns.tolist()\n", 149 | "time_series_cols.remove(\"id\") \n", 150 | "\n", 151 | "train = pd.merge(train, train_ts, how=\"left\", on='id') \n", 152 | "test = pd.merge(test, test_ts, how=\"left\", on='id') \n", 153 | "\n", 154 | "train = train.drop('id', axis=1) \n", 155 | "test = test.drop('id', axis=1) " 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": 4, 161 | "id": "100a40dc", 162 | "metadata": { 163 | "execution": { 164 | "iopub.execute_input": "2025-01-01T15:26:51.540481Z", 165 | "iopub.status.busy": "2025-01-01T15:26:51.540155Z", 166 | "iopub.status.idle": "2025-01-01T15:26:51.567867Z", 167 | "shell.execute_reply": "2025-01-01T15:26:51.566962Z" 168 | }, 169 | "papermill": { 170 | "duration": 0.047922, 171 | "end_time": "2025-01-01T15:26:51.569613", 172 | "exception": false, 173 | "start_time": "2025-01-01T15:26:51.521691", 174 | "status": "completed" 175 | }, 176 | "tags": [] 177 | }, 178 | "outputs": [], 179 | "source": [ 180 | "# Select Relevant Features and Handle Missing Values\n", 181 | "featuresCols = ['Basic_Demos-Enroll_Season', 'Basic_Demos-Age', 'Basic_Demos-Sex',\n", 182 | " 'CGAS-Season', 'CGAS-CGAS_Score', 'Physical-Season', 'Physical-BMI',\n", 183 | " 'Physical-Height', 'Physical-Weight', 'Physical-Waist_Circumference',\n", 184 | " 'Physical-Diastolic_BP', 'Physical-HeartRate', 'Physical-Systolic_BP',\n", 185 | " 'Fitness_Endurance-Season', 'Fitness_Endurance-Max_Stage',\n", 186 | " 'Fitness_Endurance-Time_Mins', 'Fitness_Endurance-Time_Sec',\n", 187 | " 'FGC-Season', 'FGC-FGC_CU', 'FGC-FGC_CU_Zone', 'FGC-FGC_GSND',\n", 188 | " 'FGC-FGC_GSND_Zone', 'FGC-FGC_GSD', 'FGC-FGC_GSD_Zone', 'FGC-FGC_PU',\n", 189 | " 'FGC-FGC_PU_Zone', 'FGC-FGC_SRL', 'FGC-FGC_SRL_Zone', 'FGC-FGC_SRR',\n", 190 | " 'FGC-FGC_SRR_Zone', 'FGC-FGC_TL', 'FGC-FGC_TL_Zone', 'BIA-Season',\n", 191 | " 'BIA-BIA_Activity_Level_num', 'BIA-BIA_BMC', 'BIA-BIA_BMI',\n", 192 | " 'BIA-BIA_BMR', 'BIA-BIA_DEE', 'BIA-BIA_ECW', 'BIA-BIA_FFM',\n", 193 | " 'BIA-BIA_FFMI', 'BIA-BIA_FMI', 'BIA-BIA_Fat', 'BIA-BIA_Frame_num',\n", 194 | " 'BIA-BIA_ICW', 'BIA-BIA_LDM', 'BIA-BIA_LST', 'BIA-BIA_SMM',\n", 195 | " 'BIA-BIA_TBW', 'PAQ_A-Season', 'PAQ_A-PAQ_A_Total', 'PAQ_C-Season',\n", 196 | " 'PAQ_C-PAQ_C_Total', 'SDS-Season', 'SDS-SDS_Total_Raw',\n", 197 | " 'SDS-SDS_Total_T', 'PreInt_EduHx-Season',\n", 198 | " 'PreInt_EduHx-computerinternet_hoursday', 'sii']\n", 199 | "\n", 200 | "\n", 201 | "featuresCols += time_series_cols \n", 202 | "\n", 203 | "train = train[featuresCols] \n", 204 | "train = train.dropna(subset='sii') \n", 205 | "\n", 206 | "\n", 207 | "cat_c = ['Basic_Demos-Enroll_Season', 'CGAS-Season', 'Physical-Season', \n", 208 | " 'Fitness_Endurance-Season', 'FGC-Season', 'BIA-Season', \n", 209 | " 'PAQ_A-Season', 'PAQ_C-Season', 'SDS-Season', 'PreInt_EduHx-Season'] \n", 210 | "\n", 211 | "def update(df):\n", 212 | " global cat_c\n", 213 | " for c in cat_c: \n", 214 | " df[c] = df[c].fillna('Missing') \n", 215 | " df[c] = df[c].astype('category') \n", 216 | " return df\n", 217 | " \n", 218 | "train = update(train) \n", 219 | "test = update(test) " 220 | ] 221 | }, 222 | { 223 | "cell_type": "markdown", 224 | "id": "3373ab21", 225 | "metadata": { 226 | "papermill": { 227 | "duration": 0.017233, 228 | "end_time": "2025-01-01T15:26:51.604962", 229 | "exception": false, 230 | "start_time": "2025-01-01T15:26:51.587729", 231 | "status": "completed" 232 | }, 233 | "tags": [] 234 | }, 235 | "source": [ 236 | "### Feature Extraction " 237 | ] 238 | }, 239 | { 240 | "cell_type": "code", 241 | "execution_count": 5, 242 | "id": "9ef1228e", 243 | "metadata": { 244 | "execution": { 245 | "iopub.execute_input": "2025-01-01T15:26:51.640809Z", 246 | "iopub.status.busy": "2025-01-01T15:26:51.640510Z", 247 | "iopub.status.idle": "2025-01-01T15:26:51.673254Z", 248 | "shell.execute_reply": "2025-01-01T15:26:51.672562Z" 249 | }, 250 | "papermill": { 251 | "duration": 0.052186, 252 | "end_time": "2025-01-01T15:26:51.674572", 253 | "exception": false, 254 | "start_time": "2025-01-01T15:26:51.622386", 255 | "status": "completed" 256 | }, 257 | "tags": [] 258 | }, 259 | "outputs": [], 260 | "source": [ 261 | "def create_mapping(column, dataset):\n", 262 | " unique_values = dataset[column].unique() \n", 263 | " # to {feat0: 0, feat1: 1, feat2: 2, ...}\n", 264 | " return {value: idx for idx, value in enumerate(unique_values)} \n", 265 | "\n", 266 | "\n", 267 | "for col in cat_c:\n", 268 | " mapping = create_mapping(col, train) \n", 269 | " mappingTe = create_mapping(col, test) \n", 270 | " \n", 271 | " train[col] = train[col].replace(mapping).astype(int) \n", 272 | " test[col] = test[col].replace(mappingTe).astype(int) " 273 | ] 274 | }, 275 | { 276 | "cell_type": "markdown", 277 | "id": "0dafe0b5", 278 | "metadata": { 279 | "papermill": { 280 | "duration": 0.017297, 281 | "end_time": "2025-01-01T15:26:51.709290", 282 | "exception": false, 283 | "start_time": "2025-01-01T15:26:51.691993", 284 | "status": "completed" 285 | }, 286 | "tags": [] 287 | }, 288 | "source": [ 289 | "### Training model" 290 | ] 291 | }, 292 | { 293 | "cell_type": "code", 294 | "execution_count": 6, 295 | "id": "995fbf40", 296 | "metadata": { 297 | "execution": { 298 | "iopub.execute_input": "2025-01-01T15:26:51.745352Z", 299 | "iopub.status.busy": "2025-01-01T15:26:51.744956Z", 300 | "iopub.status.idle": "2025-01-01T15:26:51.749853Z", 301 | "shell.execute_reply": "2025-01-01T15:26:51.748964Z" 302 | }, 303 | "papermill": { 304 | "duration": 0.024415, 305 | "end_time": "2025-01-01T15:26:51.751107", 306 | "exception": false, 307 | "start_time": "2025-01-01T15:26:51.726692", 308 | "status": "completed" 309 | }, 310 | "tags": [] 311 | }, 312 | "outputs": [], 313 | "source": [ 314 | "def quadratic_weighted_kappa(y_true, y_pred):\n", 315 | " return cohen_kappa_score(y_true, y_pred, weights='quadratic') \n", 316 | "\n", 317 | "def threshold_Rounder(oof_non_rounded, thresholds):\n", 318 | " return np.where(oof_non_rounded < thresholds[0], 0,\n", 319 | " np.where(oof_non_rounded < thresholds[1], 1, \n", 320 | " np.where(oof_non_rounded < thresholds[2], 2, 3))) \n", 321 | "\n", 322 | "def evaluate_predictions(thresholds, y_true, oof_non_rounded):\n", 323 | " rounded_p = threshold_Rounder(oof_non_rounded, thresholds)\n", 324 | " return -quadratic_weighted_kappa(y_true, rounded_p)" 325 | ] 326 | }, 327 | { 328 | "cell_type": "code", 329 | "execution_count": 7, 330 | "id": "73cdb3e9", 331 | "metadata": { 332 | "execution": { 333 | "iopub.execute_input": "2025-01-01T15:26:51.786688Z", 334 | "iopub.status.busy": "2025-01-01T15:26:51.786250Z", 335 | "iopub.status.idle": "2025-01-01T15:26:51.789479Z", 336 | "shell.execute_reply": "2025-01-01T15:26:51.788687Z" 337 | }, 338 | "papermill": { 339 | "duration": 0.022369, 340 | "end_time": "2025-01-01T15:26:51.790748", 341 | "exception": false, 342 | "start_time": "2025-01-01T15:26:51.768379", 343 | "status": "completed" 344 | }, 345 | "tags": [] 346 | }, 347 | "outputs": [], 348 | "source": [ 349 | "SEED = 42\n", 350 | "n_splits = 5" 351 | ] 352 | }, 353 | { 354 | "cell_type": "code", 355 | "execution_count": 8, 356 | "id": "2fa26bbb", 357 | "metadata": { 358 | "execution": { 359 | "iopub.execute_input": "2025-01-01T15:26:51.825995Z", 360 | "iopub.status.busy": "2025-01-01T15:26:51.825783Z", 361 | "iopub.status.idle": "2025-01-01T15:26:51.833161Z", 362 | "shell.execute_reply": "2025-01-01T15:26:51.832516Z" 363 | }, 364 | "papermill": { 365 | "duration": 0.026285, 366 | "end_time": "2025-01-01T15:26:51.834277", 367 | "exception": false, 368 | "start_time": "2025-01-01T15:26:51.807992", 369 | "status": "completed" 370 | }, 371 | "tags": [] 372 | }, 373 | "outputs": [], 374 | "source": [ 375 | "def TrainML(model_class, test_data):\n", 376 | " X = train.drop(['sii'], axis=1) \n", 377 | " y = train['sii'] \n", 378 | "\n", 379 | " SKF = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=SEED) \n", 380 | " \n", 381 | " train_S = []\n", 382 | " test_S = []\n", 383 | " \n", 384 | " oof_non_rounded = np.zeros(len(y), dtype=float) \n", 385 | " oof_rounded = np.zeros(len(y), dtype=int) \n", 386 | " test_preds = np.zeros((len(test_data), n_splits)) \n", 387 | "\n", 388 | " for fold, (train_idx, test_idx) in enumerate(tqdm(SKF.split(X, y), desc=\"Training Folds\", total=n_splits)):\n", 389 | " X_train, X_val = X.iloc[train_idx], X.iloc[test_idx] \n", 390 | " y_train, y_val = y.iloc[train_idx], y.iloc[test_idx] \n", 391 | " model = clone(model_class) \n", 392 | " model.fit(X_train, y_train)\n", 393 | "\n", 394 | " y_train_pred = model.predict(X_train) \n", 395 | " y_val_pred = model.predict(X_val) \n", 396 | "\n", 397 | " oof_non_rounded[test_idx] = y_val_pred \n", 398 | " y_val_pred_rounded = y_val_pred.round(0).astype(int) \n", 399 | " oof_rounded[test_idx] = y_val_pred_rounded \n", 400 | "\n", 401 | " train_kappa = quadratic_weighted_kappa(y_train, y_train_pred.round(0).astype(int)) \n", 402 | " val_kappa = quadratic_weighted_kappa(y_val, y_val_pred_rounded) \n", 403 | "\n", 404 | " train_S.append(train_kappa) \n", 405 | " test_S.append(val_kappa) \n", 406 | " \n", 407 | " test_preds[:, fold] = model.predict(test_data) \n", 408 | " \n", 409 | " clear_output(wait=True) \n", 410 | "\n", 411 | " \n", 412 | " KappaOPtimizer = minimize(evaluate_predictions,\n", 413 | " x0=[0.5, 1.5, 2.5], args=(y, oof_non_rounded), \n", 414 | " method='Nelder-Mead')\n", 415 | " assert KappaOPtimizer.success, \"Optimization did not converge.\" \n", 416 | " \n", 417 | " oof_tuned = threshold_Rounder(oof_non_rounded, KappaOPtimizer.x) \n", 418 | " tKappa = quadratic_weighted_kappa(y, oof_tuned) \n", 419 | "\n", 420 | " tpm = test_preds.mean(axis=1) \n", 421 | " tpTuned = threshold_Rounder(tpm, KappaOPtimizer.x) \n", 422 | " \n", 423 | " submission = pd.DataFrame({\n", 424 | " 'id': sample['id'],\n", 425 | " 'sii': tpTuned\n", 426 | " }) \n", 427 | " return submission" 428 | ] 429 | }, 430 | { 431 | "cell_type": "code", 432 | "execution_count": 9, 433 | "id": "7f9f6e8c", 434 | "metadata": { 435 | "execution": { 436 | "iopub.execute_input": "2025-01-01T15:26:51.869751Z", 437 | "iopub.status.busy": "2025-01-01T15:26:51.869538Z", 438 | "iopub.status.idle": "2025-01-01T15:27:50.771711Z", 439 | "shell.execute_reply": "2025-01-01T15:27:50.770893Z" 440 | }, 441 | "papermill": { 442 | "duration": 58.921369, 443 | "end_time": "2025-01-01T15:27:50.773131", 444 | "exception": false, 445 | "start_time": "2025-01-01T15:26:51.851762", 446 | "status": "completed" 447 | }, 448 | "tags": [] 449 | }, 450 | "outputs": [ 451 | { 452 | "name": "stderr", 453 | "output_type": "stream", 454 | "text": [ 455 | "Training Folds: 100%|██████████| 5/5 [00:27<00:00, 5.49s/it]\n" 456 | ] 457 | }, 458 | { 459 | "data": { 460 | "text/html": [ 461 | "
\n", 462 | "\n", 475 | "\n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | " \n", 527 | " \n", 528 | " \n", 529 | " \n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | " \n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | "
idsii
000008ff92
1000fd4600
2001052580
300115b9f1
40016bb221
5001f33791
60038ba980
70068a4850
80069fbed1
90083e3971
100087dd650
1100abe6550
1200ae59c91
1300af63871
1400bd43591
1500c0cd712
1600d56d4b0
1700d9913d0
1800e6167c0
1900ebc35d1
\n", 586 | "
" 587 | ], 588 | "text/plain": [ 589 | " id sii\n", 590 | "0 00008ff9 2\n", 591 | "1 000fd460 0\n", 592 | "2 00105258 0\n", 593 | "3 00115b9f 1\n", 594 | "4 0016bb22 1\n", 595 | "5 001f3379 1\n", 596 | "6 0038ba98 0\n", 597 | "7 0068a485 0\n", 598 | "8 0069fbed 1\n", 599 | "9 0083e397 1\n", 600 | "10 0087dd65 0\n", 601 | "11 00abe655 0\n", 602 | "12 00ae59c9 1\n", 603 | "13 00af6387 1\n", 604 | "14 00bd4359 1\n", 605 | "15 00c0cd71 2\n", 606 | "16 00d56d4b 0\n", 607 | "17 00d9913d 0\n", 608 | "18 00e6167c 0\n", 609 | "19 00ebc35d 1" 610 | ] 611 | }, 612 | "execution_count": 9, 613 | "metadata": {}, 614 | "output_type": "execute_result" 615 | } 616 | ], 617 | "source": [ 618 | "# LightGBM\n", 619 | "Params = {\n", 620 | " 'learning_rate': 0.046, \n", 621 | " 'max_depth': 12, \n", 622 | " 'num_leaves': 478, \n", 623 | " 'min_data_in_leaf': 13, \n", 624 | " 'feature_fraction': 0.893, \n", 625 | " 'bagging_fraction': 0.784, \n", 626 | " 'bagging_freq': 4, \n", 627 | " 'lambda_l1': 10,\n", 628 | " 'lambda_l2': 0.01, \n", 629 | "}\n", 630 | "\n", 631 | "XGB_Params = {\n", 632 | " 'learning_rate': 0.05, \n", 633 | " 'max_depth': 6, \n", 634 | " 'n_estimators': 200, \n", 635 | " 'subsample': 0.8, \n", 636 | " 'colsample_bytree': 0.8, \n", 637 | " 'reg_alpha': 1, \n", 638 | " 'reg_lambda': 5, \n", 639 | " 'random_state': SEED, \n", 640 | "}\n", 641 | "\n", 642 | "CatBoost_Params = {\n", 643 | " 'learning_rate': 0.05, \n", 644 | " 'depth': 6, \n", 645 | " 'iterations': 200, \n", 646 | " 'random_seed': SEED, \n", 647 | " 'cat_features': cat_c, \n", 648 | " 'verbose': 0, \n", 649 | " 'l2_leaf_reg': 10, \n", 650 | "}\n", 651 | "\n", 652 | "\n", 653 | "from collections import Counter\n", 654 | "\n", 655 | "class_counts = Counter(train['sii']) \n", 656 | "total_samples = len(train) \n", 657 | "# w = total_sample / class_sample\n", 658 | "class_weights = {cls: total_samples / count for cls, count in class_counts.items()} \n", 659 | "\n", 660 | "\n", 661 | "Params_with_weights = {\n", 662 | " **Params,\n", 663 | " 'class_weight': class_weights\n", 664 | "}\n", 665 | "\n", 666 | "\n", 667 | "Light = LGBMRegressor(**Params_with_weights, random_state=SEED, verbose=-1, n_estimators=300)\n", 668 | "XGB_Model = XGBRegressor(**XGB_Params) \n", 669 | "CatBoost_Model = CatBoostRegressor(**CatBoost_Params) \n", 670 | "\n", 671 | "voting_model = VotingRegressor(estimators=[\n", 672 | " ('lightgbm', Light),\n", 673 | " ('xgboost', XGB_Model),\n", 674 | " ('catboost', CatBoost_Model)\n", 675 | "])\n", 676 | "\n", 677 | "# Train the ensemble model\n", 678 | "Submission1 = TrainML(voting_model, test) \n", 679 | "\n", 680 | "Submission1" 681 | ] 682 | }, 683 | { 684 | "cell_type": "markdown", 685 | "id": "76614c97", 686 | "metadata": { 687 | "papermill": { 688 | "duration": 0.017445, 689 | "end_time": "2025-01-01T15:27:50.809561", 690 | "exception": false, 691 | "start_time": "2025-01-01T15:27:50.792116", 692 | "status": "completed" 693 | }, 694 | "tags": [] 695 | }, 696 | "source": [ 697 | "## Method 2" 698 | ] 699 | }, 700 | { 701 | "cell_type": "code", 702 | "execution_count": 10, 703 | "id": "008ee316", 704 | "metadata": { 705 | "execution": { 706 | "iopub.execute_input": "2025-01-01T15:27:50.846386Z", 707 | "iopub.status.busy": "2025-01-01T15:27:50.846013Z", 708 | "iopub.status.idle": "2025-01-01T15:27:50.885071Z", 709 | "shell.execute_reply": "2025-01-01T15:27:50.884286Z" 710 | }, 711 | "papermill": { 712 | "duration": 0.059238, 713 | "end_time": "2025-01-01T15:27:50.886643", 714 | "exception": false, 715 | "start_time": "2025-01-01T15:27:50.827405", 716 | "status": "completed" 717 | }, 718 | "tags": [] 719 | }, 720 | "outputs": [], 721 | "source": [ 722 | "# Load data\n", 723 | "train = pd.read_csv('dataset/train.csv') \n", 724 | "test = pd.read_csv('dataset/test.csv') \n", 725 | "sample = pd.read_csv('dataset/sample_submission.csv') " 726 | ] 727 | }, 728 | { 729 | "cell_type": "code", 730 | "execution_count": 11, 731 | "id": "3af3d073", 732 | "metadata": { 733 | "execution": { 734 | "iopub.execute_input": "2025-01-01T15:27:50.923999Z", 735 | "iopub.status.busy": "2025-01-01T15:27:50.923707Z", 736 | "iopub.status.idle": "2025-01-01T15:28:59.977058Z", 737 | "shell.execute_reply": "2025-01-01T15:28:59.976147Z" 738 | }, 739 | "papermill": { 740 | "duration": 69.073418, 741 | "end_time": "2025-01-01T15:28:59.978281", 742 | "exception": false, 743 | "start_time": "2025-01-01T15:27:50.904863", 744 | "status": "completed" 745 | }, 746 | "tags": [] 747 | }, 748 | "outputs": [ 749 | { 750 | "name": "stderr", 751 | "output_type": "stream", 752 | "text": [ 753 | "100%|██████████| 111/111 [00:02<00:00, 53.38it/s]\n", 754 | "100%|██████████| 2/2 [00:00<00:00, 20.19it/s]\n" 755 | ] 756 | } 757 | ], 758 | "source": [ 759 | "# Merge and Drop Columns\n", 760 | "train_ts = load_time_series(\"dataset/series_train.parquet\") \n", 761 | "test_ts = load_time_series(\"dataset/series_test.parquet\") \n", 762 | "\n", 763 | "time_series_cols = train_ts.columns.tolist()\n", 764 | "time_series_cols.remove(\"id\") \n", 765 | "\n", 766 | "train = pd.merge(train, train_ts, how=\"left\", on='id') \n", 767 | "test = pd.merge(test, test_ts, how=\"left\", on='id') \n", 768 | "train = train.drop('id', axis=1) \n", 769 | "test = test.drop('id', axis=1) " 770 | ] 771 | }, 772 | { 773 | "cell_type": "code", 774 | "execution_count": 12, 775 | "id": "19390bef", 776 | "metadata": { 777 | "execution": { 778 | "iopub.execute_input": "2025-01-01T15:29:00.045938Z", 779 | "iopub.status.busy": "2025-01-01T15:29:00.045693Z", 780 | "iopub.status.idle": "2025-01-01T15:29:14.637956Z", 781 | "shell.execute_reply": "2025-01-01T15:29:14.637202Z" 782 | }, 783 | "papermill": { 784 | "duration": 14.62708, 785 | "end_time": "2025-01-01T15:29:14.639577", 786 | "exception": false, 787 | "start_time": "2025-01-01T15:29:00.012497", 788 | "status": "completed" 789 | }, 790 | "tags": [] 791 | }, 792 | "outputs": [], 793 | "source": [ 794 | "imputer = KNNImputer(n_neighbors=5) \n", 795 | "\n", 796 | "numeric_cols = train.select_dtypes(include=['int32', 'int64', 'float64', 'int64']).columns \n", 797 | "imputed_data = imputer.fit_transform(train[numeric_cols]) \n", 798 | "train_imputed = pd.DataFrame(imputed_data, columns=numeric_cols) \n", 799 | "train_imputed['sii'] = train_imputed['sii'].round().astype(int) \n", 800 | "for col in train.columns:\n", 801 | " if col not in numeric_cols:\n", 802 | " train_imputed[col] = train[col] \n", 803 | " \n", 804 | "train = train_imputed " 805 | ] 806 | }, 807 | { 808 | "cell_type": "code", 809 | "execution_count": 13, 810 | "id": "301ecede", 811 | "metadata": { 812 | "execution": { 813 | "iopub.execute_input": "2025-01-01T15:29:14.707069Z", 814 | "iopub.status.busy": "2025-01-01T15:29:14.706770Z", 815 | "iopub.status.idle": "2025-01-01T15:29:14.733969Z", 816 | "shell.execute_reply": "2025-01-01T15:29:14.733302Z" 817 | }, 818 | "papermill": { 819 | "duration": 0.062428, 820 | "end_time": "2025-01-01T15:29:14.735454", 821 | "exception": false, 822 | "start_time": "2025-01-01T15:29:14.673026", 823 | "status": "completed" 824 | }, 825 | "tags": [] 826 | }, 827 | "outputs": [], 828 | "source": [ 829 | "def feature_engineering(df):\n", 830 | "\n", 831 | " season_cols = [col for col in df.columns if 'Season' in col] \n", 832 | " df = df.drop(season_cols, axis=1) # Drop Season (too many missing values)\n", 833 | " df['BMI_Age'] = df['Physical-BMI'] * df['Basic_Demos-Age'] # BMI and age interactions \n", 834 | " df['Internet_Hours_Age'] = df['PreInt_EduHx-computerinternet_hoursday'] * df['Basic_Demos-Age'] # Internet hours and age interactions\n", 835 | " df['BMI_Internet_Hours'] = df['Physical-BMI'] * df['PreInt_EduHx-computerinternet_hoursday'] # BMI and Internet hours interactions\n", 836 | " df['BFP_BMI'] = df['BIA-BIA_Fat'] / df['BIA-BIA_BMI'] # Fat and BMI ratio\n", 837 | " df['FFMI_BFP'] = df['BIA-BIA_FFMI'] / df['BIA-BIA_Fat'] # FFMI and Fat ratio\n", 838 | " df['FMI_BFP'] = df['BIA-BIA_FMI'] / df['BIA-BIA_Fat'] # FMI and Fat ratio\n", 839 | " df['LST_TBW'] = df['BIA-BIA_LST'] / df['BIA-BIA_TBW'] # LST and TBW ratio\n", 840 | " df['BFP_BMR'] = df['BIA-BIA_Fat'] * df['BIA-BIA_BMR'] # Fat and BMR interactions\n", 841 | " df['BFP_DEE'] = df['BIA-BIA_Fat'] * df['BIA-BIA_DEE'] # Fat and DEE interactions\n", 842 | " df['BMR_Weight'] = df['BIA-BIA_BMR'] / df['Physical-Weight'] # BMR and Weight ratio\n", 843 | " df['DEE_Weight'] = df['BIA-BIA_DEE'] / df['Physical-Weight'] # DEE and Weight ratio\n", 844 | " df['SMM_Height'] = df['BIA-BIA_SMM'] / df['Physical-Height'] # SMM and Height ratio\n", 845 | " df['Muscle_to_Fat'] = df['BIA-BIA_SMM'] / df['BIA-BIA_FMI'] # Muscle and Fat ratio\n", 846 | " df['Hydration_Status'] = df['BIA-BIA_TBW'] / df['Physical-Weight'] #TBW and Weight ratio\n", 847 | " df['ICW_TBW'] = df['BIA-BIA_ICW'] / df['BIA-BIA_TBW'] # ICW and TBW ratio\n", 848 | " df['BMI_PHR'] = df['Physical-BMI'] * df['Physical-HeartRate'] # BMI and Heart rate interaction\n", 849 | " return df\n", 850 | "\n", 851 | "train = feature_engineering(train) \n", 852 | "train = train.dropna(thresh=10, axis=0) # Keep rows with at least 10 non-missing values\n", 853 | "test = feature_engineering(test) " 854 | ] 855 | }, 856 | { 857 | "cell_type": "code", 858 | "execution_count": 14, 859 | "id": "5f33cd5d", 860 | "metadata": { 861 | "execution": { 862 | "iopub.execute_input": "2025-01-01T15:29:14.801735Z", 863 | "iopub.status.busy": "2025-01-01T15:29:14.801456Z", 864 | "iopub.status.idle": "2025-01-01T15:29:14.810767Z", 865 | "shell.execute_reply": "2025-01-01T15:29:14.810060Z" 866 | }, 867 | "papermill": { 868 | "duration": 0.043658, 869 | "end_time": "2025-01-01T15:29:14.812076", 870 | "exception": false, 871 | "start_time": "2025-01-01T15:29:14.768418", 872 | "status": "completed" 873 | }, 874 | "tags": [] 875 | }, 876 | "outputs": [], 877 | "source": [ 878 | "featuresCols = ['Basic_Demos-Age', 'Basic_Demos-Sex',\n", 879 | " 'CGAS-CGAS_Score', 'Physical-BMI',\n", 880 | " 'Physical-Height', 'Physical-Weight', 'Physical-Waist_Circumference',\n", 881 | " 'Physical-Diastolic_BP', 'Physical-HeartRate', 'Physical-Systolic_BP',\n", 882 | " 'Fitness_Endurance-Max_Stage',\n", 883 | " 'Fitness_Endurance-Time_Mins', 'Fitness_Endurance-Time_Sec',\n", 884 | " 'FGC-FGC_CU', 'FGC-FGC_CU_Zone', 'FGC-FGC_GSND',\n", 885 | " 'FGC-FGC_GSND_Zone', 'FGC-FGC_GSD', 'FGC-FGC_GSD_Zone', 'FGC-FGC_PU',\n", 886 | " 'FGC-FGC_PU_Zone', 'FGC-FGC_SRL', 'FGC-FGC_SRL_Zone', 'FGC-FGC_SRR',\n", 887 | " 'FGC-FGC_SRR_Zone', 'FGC-FGC_TL', 'FGC-FGC_TL_Zone',\n", 888 | " 'BIA-BIA_Activity_Level_num', 'BIA-BIA_BMC', 'BIA-BIA_BMI',\n", 889 | " 'BIA-BIA_BMR', 'BIA-BIA_DEE', 'BIA-BIA_ECW', 'BIA-BIA_FFM',\n", 890 | " 'BIA-BIA_FFMI', 'BIA-BIA_FMI', 'BIA-BIA_Fat', 'BIA-BIA_Frame_num',\n", 891 | " 'BIA-BIA_ICW', 'BIA-BIA_LDM', 'BIA-BIA_LST', 'BIA-BIA_SMM',\n", 892 | " 'BIA-BIA_TBW', 'PAQ_A-PAQ_A_Total',\n", 893 | " 'PAQ_C-PAQ_C_Total', 'SDS-SDS_Total_Raw',\n", 894 | " 'SDS-SDS_Total_T',\n", 895 | " 'PreInt_EduHx-computerinternet_hoursday', 'BMI_Age', 'Internet_Hours_Age', 'BMI_Internet_Hours',\n", 896 | " 'BFP_BMI', 'FFMI_BFP', 'FMI_BFP', 'LST_TBW', 'BFP_BMR', 'BFP_DEE', 'BMR_Weight', 'DEE_Weight', 'SMM_Height', 'Muscle_to_Fat', 'Hydration_Status', 'ICW_TBW', 'BMI_PHR',\n", 897 | " ]\n", 898 | "\n", 899 | "train = train[featuresCols + time_series_cols + ['sii']]\n", 900 | "train = train.dropna(subset='sii') \n", 901 | "\n", 902 | "\n", 903 | "test = test[featuresCols + time_series_cols]" 904 | ] 905 | }, 906 | { 907 | "cell_type": "code", 908 | "execution_count": 15, 909 | "id": "1c3192d2", 910 | "metadata": { 911 | "execution": { 912 | "iopub.execute_input": "2025-01-01T15:29:14.877874Z", 913 | "iopub.status.busy": "2025-01-01T15:29:14.877598Z", 914 | "iopub.status.idle": "2025-01-01T15:29:14.887045Z", 915 | "shell.execute_reply": "2025-01-01T15:29:14.886388Z" 916 | }, 917 | "papermill": { 918 | "duration": 0.044146, 919 | "end_time": "2025-01-01T15:29:14.888509", 920 | "exception": false, 921 | "start_time": "2025-01-01T15:29:14.844363", 922 | "status": "completed" 923 | }, 924 | "tags": [] 925 | }, 926 | "outputs": [], 927 | "source": [ 928 | "if np.any(np.isinf(train)):\n", 929 | " train = train.replace([np.inf, -np.inf], np.nan) " 930 | ] 931 | }, 932 | { 933 | "cell_type": "markdown", 934 | "id": "b311fd0f", 935 | "metadata": { 936 | "papermill": { 937 | "duration": 0.034398, 938 | "end_time": "2025-01-01T15:29:14.957111", 939 | "exception": false, 940 | "start_time": "2025-01-01T15:29:14.922713", 941 | "status": "completed" 942 | }, 943 | "tags": [] 944 | }, 945 | "source": [ 946 | "### Training model" 947 | ] 948 | }, 949 | { 950 | "cell_type": "code", 951 | "execution_count": 16, 952 | "id": "4514fd9d", 953 | "metadata": { 954 | "execution": { 955 | "iopub.execute_input": "2025-01-01T15:29:15.066031Z", 956 | "iopub.status.busy": "2025-01-01T15:29:15.065714Z", 957 | "iopub.status.idle": "2025-01-01T15:29:15.073359Z", 958 | "shell.execute_reply": "2025-01-01T15:29:15.072524Z" 959 | }, 960 | "papermill": { 961 | "duration": 0.04206, 962 | "end_time": "2025-01-01T15:29:15.074631", 963 | "exception": false, 964 | "start_time": "2025-01-01T15:29:15.032571", 965 | "status": "completed" 966 | }, 967 | "tags": [] 968 | }, 969 | "outputs": [], 970 | "source": [ 971 | "def TrainML(model_class, test_data):\n", 972 | " X = train.drop(['sii'], axis=1) \n", 973 | " y = train['sii'] \n", 974 | "\n", 975 | " SKF = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=SEED) \n", 976 | " \n", 977 | " train_S = []\n", 978 | " test_S = []\n", 979 | " \n", 980 | " oof_non_rounded = np.zeros(len(y), dtype=float) \n", 981 | " oof_rounded = np.zeros(len(y), dtype=int) \n", 982 | " test_preds = np.zeros((len(test_data), n_splits)) \n", 983 | "\n", 984 | " for fold, (train_idx, test_idx) in enumerate(tqdm(SKF.split(X, y), desc=\"Training Folds\", total=n_splits)):\n", 985 | " X_train, X_val = X.iloc[train_idx], X.iloc[test_idx] \n", 986 | " y_train, y_val = y.iloc[train_idx], y.iloc[test_idx] \n", 987 | "\n", 988 | " model = clone(model_class) \n", 989 | " model.fit(X_train, y_train) \n", 990 | "\n", 991 | " y_train_pred = model.predict(X_train) \n", 992 | " y_val_pred = model.predict(X_val) \n", 993 | "\n", 994 | " oof_non_rounded[test_idx] = y_val_pred \n", 995 | " y_val_pred_rounded = y_val_pred.round(0).astype(int) \n", 996 | " oof_rounded[test_idx] = y_val_pred_rounded \n", 997 | "\n", 998 | " train_kappa = quadratic_weighted_kappa(y_train, y_train_pred.round(0).astype(int)) \n", 999 | " val_kappa = quadratic_weighted_kappa(y_val, y_val_pred_rounded) \n", 1000 | "\n", 1001 | " train_S.append(train_kappa) \n", 1002 | " test_S.append(val_kappa) \n", 1003 | " \n", 1004 | " test_preds[:, fold] = model.predict(test_data)\n", 1005 | " \n", 1006 | " clear_output(wait=True) \n", 1007 | "\n", 1008 | " # Maxmize Kappa score optimization\n", 1009 | " KappaOPtimizer = minimize(evaluate_predictions,\n", 1010 | " x0=[0.5, 1.5, 2.5], args=(y, oof_non_rounded), \n", 1011 | " method='Nelder-Mead')\n", 1012 | " assert KappaOPtimizer.success, \"Optimization did not converge.\" \n", 1013 | " \n", 1014 | " oof_tuned = threshold_Rounder(oof_non_rounded, KappaOPtimizer.x) \n", 1015 | " tKappa = quadratic_weighted_kappa(y, oof_tuned) \n", 1016 | "\n", 1017 | " tpm = test_preds.mean(axis=1) \n", 1018 | " tp_rounded = threshold_Rounder(tpm, KappaOPtimizer.x) \n", 1019 | "\n", 1020 | " return tp_rounded " 1021 | ] 1022 | }, 1023 | { 1024 | "cell_type": "code", 1025 | "execution_count": 17, 1026 | "id": "2f8d68aa", 1027 | "metadata": { 1028 | "execution": { 1029 | "iopub.execute_input": "2025-01-01T15:29:15.140584Z", 1030 | "iopub.status.busy": "2025-01-01T15:29:15.140266Z", 1031 | "iopub.status.idle": "2025-01-01T15:32:12.170806Z", 1032 | "shell.execute_reply": "2025-01-01T15:32:12.169857Z" 1033 | }, 1034 | "papermill": { 1035 | "duration": 177.065371, 1036 | "end_time": "2025-01-01T15:32:12.172492", 1037 | "exception": false, 1038 | "start_time": "2025-01-01T15:29:15.107121", 1039 | "status": "completed" 1040 | }, 1041 | "tags": [] 1042 | }, 1043 | "outputs": [ 1044 | { 1045 | "name": "stderr", 1046 | "output_type": "stream", 1047 | "text": [ 1048 | "Training Folds: 100%|██████████| 5/5 [00:59<00:00, 11.88s/it]\n" 1049 | ] 1050 | } 1051 | ], 1052 | "source": [ 1053 | "# Ensemble Model\n", 1054 | "imputer = SimpleImputer(strategy='median') \n", 1055 | "\n", 1056 | "ensemble = VotingRegressor(estimators=[\n", 1057 | " ('lgb', Pipeline(steps=[('imputer', imputer), ('regressor', LGBMRegressor(random_state=SEED))])), # LightGBM\n", 1058 | " ('xgb', Pipeline(steps=[('imputer', imputer), ('regressor', XGBRegressor(random_state=SEED))])), # XGBoost\n", 1059 | " ('cat', Pipeline(steps=[('imputer', imputer), ('regressor', CatBoostRegressor(random_state=SEED, silent=True))])), # CatBoost\n", 1060 | " ('rf', Pipeline(steps=[('imputer', imputer), ('regressor', RandomForestRegressor(random_state=SEED))])), # Random Forest\n", 1061 | " ('gb', Pipeline(steps=[('imputer', imputer), ('regressor', GradientBoostingRegressor(random_state=SEED))])) # Gradient Boosting\n", 1062 | "])\n", 1063 | "\n", 1064 | "Submission2 = TrainML(ensemble, test) " 1065 | ] 1066 | }, 1067 | { 1068 | "cell_type": "code", 1069 | "execution_count": 18, 1070 | "id": "9f90de79", 1071 | "metadata": { 1072 | "execution": { 1073 | "iopub.execute_input": "2025-01-01T15:32:12.248388Z", 1074 | "iopub.status.busy": "2025-01-01T15:32:12.247952Z", 1075 | "iopub.status.idle": "2025-01-01T15:32:12.258275Z", 1076 | "shell.execute_reply": "2025-01-01T15:32:12.257317Z" 1077 | }, 1078 | "papermill": { 1079 | "duration": 0.050096, 1080 | "end_time": "2025-01-01T15:32:12.259945", 1081 | "exception": false, 1082 | "start_time": "2025-01-01T15:32:12.209849", 1083 | "status": "completed" 1084 | }, 1085 | "tags": [] 1086 | }, 1087 | "outputs": [ 1088 | { 1089 | "data": { 1090 | "text/html": [ 1091 | "
\n", 1092 | "\n", 1105 | "\n", 1106 | " \n", 1107 | " \n", 1108 | " \n", 1109 | " \n", 1110 | " \n", 1111 | " \n", 1112 | " \n", 1113 | " \n", 1114 | " \n", 1115 | " \n", 1116 | " \n", 1117 | " \n", 1118 | " \n", 1119 | " \n", 1120 | " \n", 1121 | " \n", 1122 | " \n", 1123 | " \n", 1124 | " \n", 1125 | " \n", 1126 | " \n", 1127 | " \n", 1128 | " \n", 1129 | " \n", 1130 | " \n", 1131 | " \n", 1132 | " \n", 1133 | " \n", 1134 | " \n", 1135 | " \n", 1136 | " \n", 1137 | " \n", 1138 | " \n", 1139 | " \n", 1140 | " \n", 1141 | " \n", 1142 | " \n", 1143 | " \n", 1144 | " \n", 1145 | " \n", 1146 | " \n", 1147 | " \n", 1148 | " \n", 1149 | " \n", 1150 | " \n", 1151 | " \n", 1152 | " \n", 1153 | " \n", 1154 | " \n", 1155 | " \n", 1156 | " \n", 1157 | " \n", 1158 | " \n", 1159 | " \n", 1160 | " \n", 1161 | " \n", 1162 | " \n", 1163 | " \n", 1164 | " \n", 1165 | " \n", 1166 | " \n", 1167 | " \n", 1168 | " \n", 1169 | " \n", 1170 | " \n", 1171 | " \n", 1172 | " \n", 1173 | " \n", 1174 | " \n", 1175 | " \n", 1176 | " \n", 1177 | " \n", 1178 | " \n", 1179 | " \n", 1180 | " \n", 1181 | " \n", 1182 | " \n", 1183 | " \n", 1184 | " \n", 1185 | " \n", 1186 | " \n", 1187 | " \n", 1188 | " \n", 1189 | " \n", 1190 | " \n", 1191 | " \n", 1192 | " \n", 1193 | " \n", 1194 | " \n", 1195 | " \n", 1196 | " \n", 1197 | " \n", 1198 | " \n", 1199 | " \n", 1200 | " \n", 1201 | " \n", 1202 | " \n", 1203 | " \n", 1204 | " \n", 1205 | " \n", 1206 | " \n", 1207 | " \n", 1208 | " \n", 1209 | " \n", 1210 | " \n", 1211 | " \n", 1212 | " \n", 1213 | " \n", 1214 | " \n", 1215 | "
idsii
000008ff92
1000fd4600
2001052580
300115b9f1
40016bb220
5001f33791
60038ba980
70068a4850
80069fbed1
90083e3970
100087dd650
1100abe6550
1200ae59c91
1300af63870
1400bd43591
1500c0cd711
1600d56d4b0
1700d9913d0
1800e6167c0
1900ebc35d0
\n", 1216 | "
" 1217 | ], 1218 | "text/plain": [ 1219 | " id sii\n", 1220 | "0 00008ff9 2\n", 1221 | "1 000fd460 0\n", 1222 | "2 00105258 0\n", 1223 | "3 00115b9f 1\n", 1224 | "4 0016bb22 0\n", 1225 | "5 001f3379 1\n", 1226 | "6 0038ba98 0\n", 1227 | "7 0068a485 0\n", 1228 | "8 0069fbed 1\n", 1229 | "9 0083e397 0\n", 1230 | "10 0087dd65 0\n", 1231 | "11 00abe655 0\n", 1232 | "12 00ae59c9 1\n", 1233 | "13 00af6387 0\n", 1234 | "14 00bd4359 1\n", 1235 | "15 00c0cd71 1\n", 1236 | "16 00d56d4b 0\n", 1237 | "17 00d9913d 0\n", 1238 | "18 00e6167c 0\n", 1239 | "19 00ebc35d 0" 1240 | ] 1241 | }, 1242 | "execution_count": 18, 1243 | "metadata": {}, 1244 | "output_type": "execute_result" 1245 | } 1246 | ], 1247 | "source": [ 1248 | "Submission2 = pd.DataFrame({\n", 1249 | " 'id': sample['id'],\n", 1250 | " 'sii': Submission2\n", 1251 | "}) \n", 1252 | "\n", 1253 | "Submission2" 1254 | ] 1255 | }, 1256 | { 1257 | "cell_type": "code", 1258 | "execution_count": 19, 1259 | "id": "9b523b1b", 1260 | "metadata": {}, 1261 | "outputs": [], 1262 | "source": [ 1263 | "sub1 = Submission1 \n", 1264 | "sub2 = Submission2 \n", 1265 | "sub1 = sub1.sort_values(by='id').reset_index(drop=True) \n", 1266 | "sub2 = sub2.sort_values(by='id').reset_index(drop=True) \n", 1267 | "\n", 1268 | "combined = pd.DataFrame({\n", 1269 | " 'id': sub1['id'],\n", 1270 | " 'sii_1': sub1['sii'],\n", 1271 | " 'sii_2': sub2['sii']\n", 1272 | "}) \n", 1273 | "\n", 1274 | "def majority_vote(row):\n", 1275 | " \"\"\"\n", 1276 | " For each row of predictions, perform majority voting. \n", 1277 | " If there are multiple modes, take their average and round to the nearest integer.\n", 1278 | "\n", 1279 | " Parameters:\n", 1280 | " - row: A row of prediction values\n", 1281 | "\n", 1282 | " Returns:\n", 1283 | " - The final predicted 'sii' value\n", 1284 | " \"\"\"\n", 1285 | " return row.mode()[0] if len(row.mode()) == 1 else row.mean().round().astype(int)\n", 1286 | "\n", 1287 | "combined['final_sii'] = combined[['sii_1', 'sii_2']].apply(majority_vote, axis=1) \n", 1288 | "\n", 1289 | "final_submission = combined[['id', 'final_sii']].rename(columns={'final_sii': 'sii'}) \n", 1290 | "final_submission.to_csv('submission.csv', index=False) " 1291 | ] 1292 | } 1293 | ], 1294 | "metadata": { 1295 | "kaggle": { 1296 | "accelerator": "none", 1297 | "dataSources": [ 1298 | { 1299 | "databundleVersionId": 9643020, 1300 | "sourceId": 81933, 1301 | "sourceType": "competition" 1302 | } 1303 | ], 1304 | "isGpuEnabled": false, 1305 | "isInternetEnabled": false, 1306 | "language": "python", 1307 | "sourceType": "notebook" 1308 | }, 1309 | "kernelspec": { 1310 | "display_name": "CMI-PIU", 1311 | "language": "python", 1312 | "name": "python3" 1313 | }, 1314 | "language_info": { 1315 | "codemirror_mode": { 1316 | "name": "ipython", 1317 | "version": 3 1318 | }, 1319 | "file_extension": ".py", 1320 | "mimetype": "text/x-python", 1321 | "name": "python", 1322 | "nbconvert_exporter": "python", 1323 | "pygments_lexer": "ipython3", 1324 | "version": "3.8.20" 1325 | }, 1326 | "papermill": { 1327 | "default_parameters": {}, 1328 | "duration": 399.490884, 1329 | "end_time": "2025-01-01T15:32:13.524661", 1330 | "environment_variables": {}, 1331 | "exception": null, 1332 | "input_path": "__notebook__.ipynb", 1333 | "output_path": "__notebook__.ipynb", 1334 | "parameters": {}, 1335 | "start_time": "2025-01-01T15:25:34.033777", 1336 | "version": "2.6.0" 1337 | } 1338 | }, 1339 | "nbformat": 4, 1340 | "nbformat_minor": 5 1341 | } 1342 | --------------------------------------------------------------------------------