├── .gitignore ├── 01-forecasting-as-regression ├── 01-feature-engineering.ipynb ├── 02-feature-engineering-with-feature-engine.ipynb ├── 03-benchmark-models.ipynb └── 04-single-step-forecasting-with-machine-learning.ipynb ├── 02-multistep-forecasting ├── 01-recursive-forecasting.ipynb ├── 02-recursive-forecasting-with-future-known-features.ipynb ├── 03-recursive-forecasting-with-window-features.ipynb ├── 04-direct-forecasting-skforecast.ipynb ├── 05-direct-forecasting-sklearn.ipynb ├── 06-direct-forecasting-with-future-known-features.ipynb └── assignment │ ├── 01-assigment.ipynb │ ├── 01-solution.ipynb │ ├── 02-assignment.ipynb │ └── 02-solution.ipynb ├── 03-multiseries-forecasting ├── 01-independent-multi-time-series-forecasting.ipynb ├── 02-dependent-multi-time-series-forecasting.ipynb ├── exercise-1-solution.ipynb ├── exercise-1.ipynb ├── exercise-2-solution.ipynb ├── exercise-2.ipynb └── images │ ├── forecaster_multi_series_sample_weight.png │ └── forecaster_multivariate_train_matrix_diagram.png ├── 04-backtesting ├── 01-backtesting-without-refitting.ipynb ├── 02-backtesting-with-refitting-expanding-training-window.ipynb ├── 03-backtesting-with-refitting-rolling-training-window.ipynb ├── 04-backtesting-with-intermittent-refitting.ipynb ├── 05-backtesting-with-gap.ipynb ├── exercise-1-solution.ipynb ├── exercise-1.ipynb ├── exercise-2-solution.ipynb ├── exercise-2.ipynb └── images │ ├── backtesting_intermittent_refit.gif │ ├── backtesting_no_refit.gif │ ├── backtesting_refit.gif │ ├── backtesting_refit_fixed_train_size.gif │ └── backtesting_refit_gap.gif ├── 05-error-metrics ├── exercise-1-solution.ipynb └── exercise-1.ipynb ├── FWML-Logo.png ├── LICENSE ├── README.md └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | *.db 6 | 7 | # Jupyter Notebook 8 | .ipynb_checkpoints 9 | 10 | # pyenv 11 | .python-version 12 | 13 | # Environments 14 | .env 15 | .venv 16 | env/ 17 | venv/ 18 | ENV/ 19 | env.bak/ 20 | venv.bak/ 21 | 22 | # datasets 23 | *.csv 24 | *.zip 25 | *.data 26 | *.labels 27 | 28 | # files not to put in repo 29 | Make-toy-finance-dataset.ipynb 30 | tbf -------------------------------------------------------------------------------- /01-forecasting-as-regression/02-feature-engineering-with-feature-engine.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "fcea71b1", 6 | "metadata": {}, 7 | "source": [ 8 | "# Feature engineering for forecasting\n", 9 | "\n", 10 | "[Forecasting with Machine Learning - Course](https://www.trainindata.com/p/forecasting-with-machine-learning)\n", 11 | "\n", 12 | "In this notebook, we will create a table of predictive features and a target from a time series dataset, utilizing [Feature-engine](https://feature-engine.trainindata.com)" 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": 2, 18 | "id": "42834613", 19 | "metadata": {}, 20 | "outputs": [], 21 | "source": [ 22 | "import matplotlib.pyplot as plt\n", 23 | "import pandas as pd\n", 24 | "\n", 25 | "from feature_engine.datetime import DatetimeFeatures\n", 26 | "from feature_engine.timeseries.forecasting import LagFeatures, WindowFeatures" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "id": "1c87901a", 32 | "metadata": {}, 33 | "source": [ 34 | "# Load data\n", 35 | "\n", 36 | "We will use the electricity demand dataset found [here](https://github.com/tidyverts/tsibbledata/tree/master/data-raw/vic_elec/VIC2015).\n", 37 | "\n", 38 | "**Citation:**\n", 39 | "\n", 40 | "Godahewa, Rakshitha, Bergmeir, Christoph, Webb, Geoff, Hyndman, Rob, & Montero-Manso, Pablo. (2021). Australian Electricity Demand Dataset (Version 1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.4659727\n", 41 | "\n", 42 | "**Description of data:**\n", 43 | "\n", 44 | "A description of the data can be found [here](https://rdrr.io/cran/tsibbledata/man/vic_elec.html). The data contains electricity demand in Victoria, Australia, at 30 minute intervals over a period of 12 years, from 2002 to early 2015. There is also the temperature in Melbourne at 30 minute intervals and public holiday dates." 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": 3, 50 | "id": "0ab8e60d", 51 | "metadata": {}, 52 | "outputs": [ 53 | { 54 | "data": { 55 | "text/html": [ 56 | "
\n", 57 | "\n", 70 | "\n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | "
demand
date_time
2002-01-01 00:00:006919.366092
2002-01-01 01:00:007165.974188
2002-01-01 02:00:006406.542994
2002-01-01 03:00:005815.537828
2002-01-01 04:00:005497.732922
\n", 104 | "
" 105 | ], 106 | "text/plain": [ 107 | " demand\n", 108 | "date_time \n", 109 | "2002-01-01 00:00:00 6919.366092\n", 110 | "2002-01-01 01:00:00 7165.974188\n", 111 | "2002-01-01 02:00:00 6406.542994\n", 112 | "2002-01-01 03:00:00 5815.537828\n", 113 | "2002-01-01 04:00:00 5497.732922" 114 | ] 115 | }, 116 | "execution_count": 3, 117 | "metadata": {}, 118 | "output_type": "execute_result" 119 | } 120 | ], 121 | "source": [ 122 | "# Electricity demand.\n", 123 | "url = \"https://raw.githubusercontent.com/tidyverts/tsibbledata/master/data-raw/vic_elec/VIC2015/demand.csv\"\n", 124 | "df = pd.read_csv(url)\n", 125 | "\n", 126 | "df.drop(columns=[\"Industrial\"], inplace=True)\n", 127 | "\n", 128 | "# Convert the integer Date to an actual date with datetime type\n", 129 | "df[\"date\"] = df[\"Date\"].apply(\n", 130 | " lambda x: pd.Timestamp(\"1899-12-30\") + pd.Timedelta(x, unit=\"days\")\n", 131 | ")\n", 132 | "\n", 133 | "# Create a timestamp from the integer Period representing 30 minute intervals\n", 134 | "df[\"date_time\"] = df[\"date\"] + \\\n", 135 | " pd.to_timedelta((df[\"Period\"] - 1) * 30, unit=\"m\")\n", 136 | "\n", 137 | "df.dropna(inplace=True)\n", 138 | "\n", 139 | "# Rename columns\n", 140 | "df = df[[\"date_time\", \"OperationalLessIndustrial\"]]\n", 141 | "\n", 142 | "df.columns = [\"date_time\", \"demand\"]\n", 143 | "\n", 144 | "# Resample to hourly\n", 145 | "df = (\n", 146 | " df.set_index(\"date_time\")\n", 147 | " .resample(\"h\")\n", 148 | " .agg({\"demand\": \"sum\"})\n", 149 | ")\n", 150 | "\n", 151 | "df.head()" 152 | ] 153 | }, 154 | { 155 | "cell_type": "markdown", 156 | "id": "a9b0a097", 157 | "metadata": {}, 158 | "source": [ 159 | "## Lag features\n", 160 | "\n", 161 | "We shift past values of the time series forward.\n", 162 | "\n", 163 | "With feature-engine, we can create all lags in one go." 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": 7, 169 | "id": "c460a534", 170 | "metadata": {}, 171 | "outputs": [ 172 | { 173 | "data": { 174 | "text/html": [ 175 | "
\n", 176 | "\n", 189 | "\n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | "
demanddemand_lag_1_xdemand_lag_24_xdemand_lag_144_xdemand_lag_1_ydemand_lag_24_ydemand_lag_144_ydemand_window_3_meandemand_window_3_stddemand_window_24_meandemand_window_24_stddemand_lag_1demand_lag_24demand_lag_168
date_time
2002-01-01 00:00:006919.366092NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2002-01-01 01:00:007165.9741886919.366092NaNNaN6919.366092NaNNaNNaNNaNNaNNaN6919.366092NaNNaN
2002-01-01 02:00:006406.5429947165.974188NaNNaN7165.974188NaNNaNNaNNaNNaNNaN7165.974188NaNNaN
2002-01-01 03:00:005815.5378286406.542994NaNNaN6406.542994NaNNaN6830.627758387.414253NaNNaN6406.542994NaNNaN
2002-01-01 04:00:005497.7329225815.537828NaNNaN5815.537828NaNNaN6462.685003676.966421NaNNaN5815.537828NaNNaN
\n", 314 | "
" 315 | ], 316 | "text/plain": [ 317 | " demand demand_lag_1_x demand_lag_24_x \\\n", 318 | "date_time \n", 319 | "2002-01-01 00:00:00 6919.366092 NaN NaN \n", 320 | "2002-01-01 01:00:00 7165.974188 6919.366092 NaN \n", 321 | "2002-01-01 02:00:00 6406.542994 7165.974188 NaN \n", 322 | "2002-01-01 03:00:00 5815.537828 6406.542994 NaN \n", 323 | "2002-01-01 04:00:00 5497.732922 5815.537828 NaN \n", 324 | "\n", 325 | " demand_lag_144_x demand_lag_1_y demand_lag_24_y \\\n", 326 | "date_time \n", 327 | "2002-01-01 00:00:00 NaN NaN NaN \n", 328 | "2002-01-01 01:00:00 NaN 6919.366092 NaN \n", 329 | "2002-01-01 02:00:00 NaN 7165.974188 NaN \n", 330 | "2002-01-01 03:00:00 NaN 6406.542994 NaN \n", 331 | "2002-01-01 04:00:00 NaN 5815.537828 NaN \n", 332 | "\n", 333 | " demand_lag_144_y demand_window_3_mean \\\n", 334 | "date_time \n", 335 | "2002-01-01 00:00:00 NaN NaN \n", 336 | "2002-01-01 01:00:00 NaN NaN \n", 337 | "2002-01-01 02:00:00 NaN NaN \n", 338 | "2002-01-01 03:00:00 NaN 6830.627758 \n", 339 | "2002-01-01 04:00:00 NaN 6462.685003 \n", 340 | "\n", 341 | " demand_window_3_std demand_window_24_mean \\\n", 342 | "date_time \n", 343 | "2002-01-01 00:00:00 NaN NaN \n", 344 | "2002-01-01 01:00:00 NaN NaN \n", 345 | "2002-01-01 02:00:00 NaN NaN \n", 346 | "2002-01-01 03:00:00 387.414253 NaN \n", 347 | "2002-01-01 04:00:00 676.966421 NaN \n", 348 | "\n", 349 | " demand_window_24_std demand_lag_1 demand_lag_24 \\\n", 350 | "date_time \n", 351 | "2002-01-01 00:00:00 NaN NaN NaN \n", 352 | "2002-01-01 01:00:00 NaN 6919.366092 NaN \n", 353 | "2002-01-01 02:00:00 NaN 7165.974188 NaN \n", 354 | "2002-01-01 03:00:00 NaN 6406.542994 NaN \n", 355 | "2002-01-01 04:00:00 NaN 5815.537828 NaN \n", 356 | "\n", 357 | " demand_lag_168 \n", 358 | "date_time \n", 359 | "2002-01-01 00:00:00 NaN \n", 360 | "2002-01-01 01:00:00 NaN \n", 361 | "2002-01-01 02:00:00 NaN \n", 362 | "2002-01-01 03:00:00 NaN \n", 363 | "2002-01-01 04:00:00 NaN " 364 | ] 365 | }, 366 | "execution_count": 7, 367 | "metadata": {}, 368 | "output_type": "execute_result" 369 | } 370 | ], 371 | "source": [ 372 | "# We'll use the previous value, the value 24 hs before, \n", 373 | "# and the value at the same time the prior week.\n", 374 | "\n", 375 | "lag_f = LagFeatures(\n", 376 | " variables = \"demand\", # if none, it will make lags of all numerical variables\n", 377 | " periods=[1,24, 7*24],\n", 378 | ")\n", 379 | "\n", 380 | "df = lag_f.fit_transform(df)\n", 381 | "\n", 382 | "df.head()" 383 | ] 384 | }, 385 | { 386 | "cell_type": "markdown", 387 | "id": "7a492968", 388 | "metadata": {}, 389 | "source": [ 390 | "## Window features\n", 391 | "\n", 392 | "We aggregate values within windows in the past.\n", 393 | "\n", 394 | "With Feature-engine, we can create many windows by using many functions, all in one go." 395 | ] 396 | }, 397 | { 398 | "cell_type": "code", 399 | "execution_count": 6, 400 | "id": "82669807", 401 | "metadata": {}, 402 | "outputs": [ 403 | { 404 | "data": { 405 | "text/html": [ 406 | "
\n", 407 | "\n", 420 | "\n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | "
demanddemand_lag_1_xdemand_lag_24_xdemand_lag_144_xdemand_lag_1_ydemand_lag_24_ydemand_lag_144_ydemand_window_3_meandemand_window_3_stddemand_window_24_meandemand_window_24_std
date_time
2002-01-01 00:00:006919.366092NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2002-01-01 01:00:007165.9741886919.366092NaNNaN6919.366092NaNNaNNaNNaNNaNNaN
2002-01-01 02:00:006406.5429947165.974188NaNNaN7165.974188NaNNaNNaNNaNNaNNaN
2002-01-01 03:00:005815.5378286406.542994NaNNaN6406.542994NaNNaN6830.627758387.414253NaNNaN
2002-01-01 04:00:005497.7329225815.537828NaNNaN5815.537828NaNNaN6462.685003676.966421NaNNaN
\n", 524 | "
" 525 | ], 526 | "text/plain": [ 527 | " demand demand_lag_1_x demand_lag_24_x \\\n", 528 | "date_time \n", 529 | "2002-01-01 00:00:00 6919.366092 NaN NaN \n", 530 | "2002-01-01 01:00:00 7165.974188 6919.366092 NaN \n", 531 | "2002-01-01 02:00:00 6406.542994 7165.974188 NaN \n", 532 | "2002-01-01 03:00:00 5815.537828 6406.542994 NaN \n", 533 | "2002-01-01 04:00:00 5497.732922 5815.537828 NaN \n", 534 | "\n", 535 | " demand_lag_144_x demand_lag_1_y demand_lag_24_y \\\n", 536 | "date_time \n", 537 | "2002-01-01 00:00:00 NaN NaN NaN \n", 538 | "2002-01-01 01:00:00 NaN 6919.366092 NaN \n", 539 | "2002-01-01 02:00:00 NaN 7165.974188 NaN \n", 540 | "2002-01-01 03:00:00 NaN 6406.542994 NaN \n", 541 | "2002-01-01 04:00:00 NaN 5815.537828 NaN \n", 542 | "\n", 543 | " demand_lag_144_y demand_window_3_mean \\\n", 544 | "date_time \n", 545 | "2002-01-01 00:00:00 NaN NaN \n", 546 | "2002-01-01 01:00:00 NaN NaN \n", 547 | "2002-01-01 02:00:00 NaN NaN \n", 548 | "2002-01-01 03:00:00 NaN 6830.627758 \n", 549 | "2002-01-01 04:00:00 NaN 6462.685003 \n", 550 | "\n", 551 | " demand_window_3_std demand_window_24_mean \\\n", 552 | "date_time \n", 553 | "2002-01-01 00:00:00 NaN NaN \n", 554 | "2002-01-01 01:00:00 NaN NaN \n", 555 | "2002-01-01 02:00:00 NaN NaN \n", 556 | "2002-01-01 03:00:00 387.414253 NaN \n", 557 | "2002-01-01 04:00:00 676.966421 NaN \n", 558 | "\n", 559 | " demand_window_24_std \n", 560 | "date_time \n", 561 | "2002-01-01 00:00:00 NaN \n", 562 | "2002-01-01 01:00:00 NaN \n", 563 | "2002-01-01 02:00:00 NaN \n", 564 | "2002-01-01 03:00:00 NaN \n", 565 | "2002-01-01 04:00:00 NaN " 566 | ] 567 | }, 568 | "execution_count": 6, 569 | "metadata": {}, 570 | "output_type": "execute_result" 571 | } 572 | ], 573 | "source": [ 574 | "window_f = WindowFeatures(\n", 575 | " variables = \"demand\", # if none, it will make window features from all numerical variables\n", 576 | " window = [3, 24],\n", 577 | " functions = [\"mean\", \"std\"],\n", 578 | " missing_values=\"ignore\"\n", 579 | ")\n", 580 | "\n", 581 | "df = window_f.fit_transform(df)\n", 582 | "\n", 583 | "df.head()" 584 | ] 585 | }, 586 | { 587 | "cell_type": "markdown", 588 | "id": "29713ebf", 589 | "metadata": {}, 590 | "source": [ 591 | "## Datetime features\n", 592 | "\n", 593 | "With feature-engine, we can create many features automatically." 594 | ] 595 | }, 596 | { 597 | "cell_type": "code", 598 | "execution_count": 8, 599 | "id": "b46f7c2e", 600 | "metadata": {}, 601 | "outputs": [ 602 | { 603 | "data": { 604 | "text/html": [ 605 | "
\n", 606 | "\n", 619 | "\n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | " \n", 743 | " \n", 744 | " \n", 745 | " \n", 746 | " \n", 747 | " \n", 748 | " \n", 749 | " \n", 750 | " \n", 751 | " \n", 752 | " \n", 753 | " \n", 754 | " \n", 755 | " \n", 756 | " \n", 757 | " \n", 758 | " \n", 759 | " \n", 760 | " \n", 761 | " \n", 762 | " \n", 763 | " \n", 764 | "
demanddemand_lag_1_xdemand_lag_24_xdemand_lag_144_xdemand_lag_1_ydemand_lag_24_ydemand_lag_144_ydemand_window_3_meandemand_window_3_stddemand_window_24_meandemand_window_24_stddemand_lag_1demand_lag_24demand_lag_168monthday_of_weekhour
date_time
2002-01-01 00:00:006919.366092NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN110
2002-01-01 01:00:007165.9741886919.366092NaNNaN6919.366092NaNNaNNaNNaNNaNNaN6919.366092NaNNaN111
2002-01-01 02:00:006406.5429947165.974188NaNNaN7165.974188NaNNaNNaNNaNNaNNaN7165.974188NaNNaN112
2002-01-01 03:00:005815.5378286406.542994NaNNaN6406.542994NaNNaN6830.627758387.414253NaNNaN6406.542994NaNNaN113
2002-01-01 04:00:005497.7329225815.537828NaNNaN5815.537828NaNNaN6462.685003676.966421NaNNaN5815.537828NaNNaN114
\n", 765 | "
" 766 | ], 767 | "text/plain": [ 768 | " demand demand_lag_1_x demand_lag_24_x \\\n", 769 | "date_time \n", 770 | "2002-01-01 00:00:00 6919.366092 NaN NaN \n", 771 | "2002-01-01 01:00:00 7165.974188 6919.366092 NaN \n", 772 | "2002-01-01 02:00:00 6406.542994 7165.974188 NaN \n", 773 | "2002-01-01 03:00:00 5815.537828 6406.542994 NaN \n", 774 | "2002-01-01 04:00:00 5497.732922 5815.537828 NaN \n", 775 | "\n", 776 | " demand_lag_144_x demand_lag_1_y demand_lag_24_y \\\n", 777 | "date_time \n", 778 | "2002-01-01 00:00:00 NaN NaN NaN \n", 779 | "2002-01-01 01:00:00 NaN 6919.366092 NaN \n", 780 | "2002-01-01 02:00:00 NaN 7165.974188 NaN \n", 781 | "2002-01-01 03:00:00 NaN 6406.542994 NaN \n", 782 | "2002-01-01 04:00:00 NaN 5815.537828 NaN \n", 783 | "\n", 784 | " demand_lag_144_y demand_window_3_mean \\\n", 785 | "date_time \n", 786 | "2002-01-01 00:00:00 NaN NaN \n", 787 | "2002-01-01 01:00:00 NaN NaN \n", 788 | "2002-01-01 02:00:00 NaN NaN \n", 789 | "2002-01-01 03:00:00 NaN 6830.627758 \n", 790 | "2002-01-01 04:00:00 NaN 6462.685003 \n", 791 | "\n", 792 | " demand_window_3_std demand_window_24_mean \\\n", 793 | "date_time \n", 794 | "2002-01-01 00:00:00 NaN NaN \n", 795 | "2002-01-01 01:00:00 NaN NaN \n", 796 | "2002-01-01 02:00:00 NaN NaN \n", 797 | "2002-01-01 03:00:00 387.414253 NaN \n", 798 | "2002-01-01 04:00:00 676.966421 NaN \n", 799 | "\n", 800 | " demand_window_24_std demand_lag_1 demand_lag_24 \\\n", 801 | "date_time \n", 802 | "2002-01-01 00:00:00 NaN NaN NaN \n", 803 | "2002-01-01 01:00:00 NaN 6919.366092 NaN \n", 804 | "2002-01-01 02:00:00 NaN 7165.974188 NaN \n", 805 | "2002-01-01 03:00:00 NaN 6406.542994 NaN \n", 806 | "2002-01-01 04:00:00 NaN 5815.537828 NaN \n", 807 | "\n", 808 | " demand_lag_168 month day_of_week hour \n", 809 | "date_time \n", 810 | "2002-01-01 00:00:00 NaN 1 1 0 \n", 811 | "2002-01-01 01:00:00 NaN 1 1 1 \n", 812 | "2002-01-01 02:00:00 NaN 1 1 2 \n", 813 | "2002-01-01 03:00:00 NaN 1 1 3 \n", 814 | "2002-01-01 04:00:00 NaN 1 1 4 " 815 | ] 816 | }, 817 | "execution_count": 8, 818 | "metadata": {}, 819 | "output_type": "execute_result" 820 | } 821 | ], 822 | "source": [ 823 | "date_f = DatetimeFeatures(\n", 824 | " variables=\"index\",\n", 825 | " features_to_extract=[\"month\", \"day_of_week\", \"hour\"]\n", 826 | ")\n", 827 | "\n", 828 | "df = date_f.fit_transform(df)\n", 829 | "\n", 830 | "df.head()" 831 | ] 832 | }, 833 | { 834 | "cell_type": "markdown", 835 | "id": "c185bfde", 836 | "metadata": {}, 837 | "source": [ 838 | "## Finalize tabularization\n", 839 | "\n", 840 | "Now we just separate our data into the table of features and the target variable." 841 | ] 842 | }, 843 | { 844 | "cell_type": "code", 845 | "execution_count": 9, 846 | "id": "c27e6d18", 847 | "metadata": {}, 848 | "outputs": [ 849 | { 850 | "data": { 851 | "text/html": [ 852 | "
\n", 853 | "\n", 866 | "\n", 867 | " \n", 868 | " \n", 869 | " \n", 870 | " \n", 871 | " \n", 872 | " \n", 873 | " \n", 874 | " \n", 875 | " \n", 876 | " \n", 877 | " \n", 878 | " \n", 879 | " \n", 880 | " \n", 881 | " \n", 882 | " \n", 883 | " \n", 884 | " \n", 885 | " \n", 886 | " \n", 887 | " \n", 888 | " \n", 889 | " \n", 890 | " \n", 891 | " \n", 892 | " \n", 893 | " \n", 894 | " \n", 895 | " \n", 896 | " \n", 897 | " \n", 898 | " \n", 899 | " \n", 900 | " \n", 901 | " \n", 902 | " \n", 903 | " \n", 904 | " \n", 905 | " \n", 906 | " \n", 907 | " \n", 908 | " \n", 909 | " \n", 910 | " \n", 911 | " \n", 912 | " \n", 913 | " \n", 914 | " \n", 915 | " \n", 916 | " \n", 917 | " \n", 918 | " \n", 919 | " \n", 920 | " \n", 921 | " \n", 922 | " \n", 923 | " \n", 924 | " \n", 925 | " \n", 926 | " \n", 927 | " \n", 928 | " \n", 929 | " \n", 930 | " \n", 931 | " \n", 932 | " \n", 933 | " \n", 934 | " \n", 935 | " \n", 936 | " \n", 937 | " \n", 938 | " \n", 939 | " \n", 940 | " \n", 941 | " \n", 942 | " \n", 943 | " \n", 944 | " \n", 945 | " \n", 946 | " \n", 947 | " \n", 948 | " \n", 949 | " \n", 950 | " \n", 951 | " \n", 952 | " \n", 953 | " \n", 954 | " \n", 955 | " \n", 956 | " \n", 957 | " \n", 958 | " \n", 959 | " \n", 960 | " \n", 961 | " \n", 962 | " \n", 963 | " \n", 964 | " \n", 965 | " \n", 966 | " \n", 967 | " \n", 968 | " \n", 969 | " \n", 970 | " \n", 971 | " \n", 972 | " \n", 973 | " \n", 974 | " \n", 975 | " \n", 976 | " \n", 977 | " \n", 978 | " \n", 979 | " \n", 980 | " \n", 981 | " \n", 982 | " \n", 983 | " \n", 984 | " \n", 985 | " \n", 986 | " \n", 987 | " \n", 988 | " \n", 989 | " \n", 990 | " \n", 991 | " \n", 992 | " \n", 993 | " \n", 994 | " \n", 995 | " \n", 996 | " \n", 997 | " \n", 998 | " \n", 999 | " \n", 1000 | " \n", 1001 | " \n", 1002 | " \n", 1003 | " \n", 1004 | "
demand_lag_1_xdemand_lag_24_xdemand_lag_144_xdemand_lag_1_ydemand_lag_24_ydemand_lag_144_ydemand_window_3_meandemand_window_3_stddemand_window_24_meandemand_window_24_stddemand_lag_1demand_lag_24demand_lag_168monthday_of_weekhour
date_time
2002-01-08 00:00:007406.0479106808.0089166579.2198807406.0479106808.0089166579.2198807003.111898369.9471687637.818532865.1853517406.0479106808.0089166919.366092110
2002-01-08 01:00:007077.0819047209.2857126990.8264207077.0819047209.2857126990.8264207053.973603364.1787347649.029907855.6557567077.0819047209.2857127165.974188111
2002-01-08 02:00:007445.3543106535.8183426382.9150187445.3543106535.8183426382.9150187309.494708202.2326187658.866098851.7287427445.3543106535.8183426406.542994112
2002-01-08 03:00:006800.5774786112.3826365896.9281386800.5774786112.3826365896.9281387107.671231323.4749937669.897729838.1570086800.5774786112.3826365815.537828113
2002-01-08 04:00:006340.9140866165.8820965853.9371406340.9140866165.8820965853.9371406862.281958554.7996347679.419873820.8117166340.9140866165.8820965497.732922114
\n", 1005 | "
" 1006 | ], 1007 | "text/plain": [ 1008 | " demand_lag_1_x demand_lag_24_x demand_lag_144_x \\\n", 1009 | "date_time \n", 1010 | "2002-01-08 00:00:00 7406.047910 6808.008916 6579.219880 \n", 1011 | "2002-01-08 01:00:00 7077.081904 7209.285712 6990.826420 \n", 1012 | "2002-01-08 02:00:00 7445.354310 6535.818342 6382.915018 \n", 1013 | "2002-01-08 03:00:00 6800.577478 6112.382636 5896.928138 \n", 1014 | "2002-01-08 04:00:00 6340.914086 6165.882096 5853.937140 \n", 1015 | "\n", 1016 | " demand_lag_1_y demand_lag_24_y demand_lag_144_y \\\n", 1017 | "date_time \n", 1018 | "2002-01-08 00:00:00 7406.047910 6808.008916 6579.219880 \n", 1019 | "2002-01-08 01:00:00 7077.081904 7209.285712 6990.826420 \n", 1020 | "2002-01-08 02:00:00 7445.354310 6535.818342 6382.915018 \n", 1021 | "2002-01-08 03:00:00 6800.577478 6112.382636 5896.928138 \n", 1022 | "2002-01-08 04:00:00 6340.914086 6165.882096 5853.937140 \n", 1023 | "\n", 1024 | " demand_window_3_mean demand_window_3_std \\\n", 1025 | "date_time \n", 1026 | "2002-01-08 00:00:00 7003.111898 369.947168 \n", 1027 | "2002-01-08 01:00:00 7053.973603 364.178734 \n", 1028 | "2002-01-08 02:00:00 7309.494708 202.232618 \n", 1029 | "2002-01-08 03:00:00 7107.671231 323.474993 \n", 1030 | "2002-01-08 04:00:00 6862.281958 554.799634 \n", 1031 | "\n", 1032 | " demand_window_24_mean demand_window_24_std \\\n", 1033 | "date_time \n", 1034 | "2002-01-08 00:00:00 7637.818532 865.185351 \n", 1035 | "2002-01-08 01:00:00 7649.029907 855.655756 \n", 1036 | "2002-01-08 02:00:00 7658.866098 851.728742 \n", 1037 | "2002-01-08 03:00:00 7669.897729 838.157008 \n", 1038 | "2002-01-08 04:00:00 7679.419873 820.811716 \n", 1039 | "\n", 1040 | " demand_lag_1 demand_lag_24 demand_lag_168 month \\\n", 1041 | "date_time \n", 1042 | "2002-01-08 00:00:00 7406.047910 6808.008916 6919.366092 1 \n", 1043 | "2002-01-08 01:00:00 7077.081904 7209.285712 7165.974188 1 \n", 1044 | "2002-01-08 02:00:00 7445.354310 6535.818342 6406.542994 1 \n", 1045 | "2002-01-08 03:00:00 6800.577478 6112.382636 5815.537828 1 \n", 1046 | "2002-01-08 04:00:00 6340.914086 6165.882096 5497.732922 1 \n", 1047 | "\n", 1048 | " day_of_week hour \n", 1049 | "date_time \n", 1050 | "2002-01-08 00:00:00 1 0 \n", 1051 | "2002-01-08 01:00:00 1 1 \n", 1052 | "2002-01-08 02:00:00 1 2 \n", 1053 | "2002-01-08 03:00:00 1 3 \n", 1054 | "2002-01-08 04:00:00 1 4 " 1055 | ] 1056 | }, 1057 | "execution_count": 9, 1058 | "metadata": {}, 1059 | "output_type": "execute_result" 1060 | } 1061 | ], 1062 | "source": [ 1063 | "df.dropna(inplace=True)\n", 1064 | "\n", 1065 | "y = df[\"demand\"]\n", 1066 | "X = df.drop(\"demand\", axis=1)\n", 1067 | "\n", 1068 | "# Predictors\n", 1069 | "\n", 1070 | "X.head()" 1071 | ] 1072 | }, 1073 | { 1074 | "cell_type": "code", 1075 | "execution_count": 10, 1076 | "id": "4ed38b9e", 1077 | "metadata": {}, 1078 | "outputs": [ 1079 | { 1080 | "data": { 1081 | "text/plain": [ 1082 | "date_time\n", 1083 | "2002-01-08 00:00:00 7077.081904\n", 1084 | "2002-01-08 01:00:00 7445.354310\n", 1085 | "2002-01-08 02:00:00 6800.577478\n", 1086 | "2002-01-08 03:00:00 6340.914086\n", 1087 | "2002-01-08 04:00:00 6277.978250\n", 1088 | "Freq: h, Name: demand, dtype: float64" 1089 | ] 1090 | }, 1091 | "execution_count": 10, 1092 | "metadata": {}, 1093 | "output_type": "execute_result" 1094 | } 1095 | ], 1096 | "source": [ 1097 | "# target\n", 1098 | "\n", 1099 | "y.head()" 1100 | ] 1101 | }, 1102 | { 1103 | "cell_type": "markdown", 1104 | "id": "2439ccc5", 1105 | "metadata": {}, 1106 | "source": [ 1107 | "That's it! We can now forecast the energy demand in the next hour as a regression.\n", 1108 | "\n", 1109 | "In this notebook, we only extracted features from the time series. We can add more features from external data sources. We will address that in coming notebooks." 1110 | ] 1111 | }, 1112 | { 1113 | "cell_type": "code", 1114 | "execution_count": null, 1115 | "id": "78a62f26", 1116 | "metadata": {}, 1117 | "outputs": [], 1118 | "source": [] 1119 | } 1120 | ], 1121 | "metadata": { 1122 | "kernelspec": { 1123 | "display_name": ".venv", 1124 | "language": "python", 1125 | "name": "python3" 1126 | }, 1127 | "language_info": { 1128 | "codemirror_mode": { 1129 | "name": "ipython", 1130 | "version": 3 1131 | }, 1132 | "file_extension": ".py", 1133 | "mimetype": "text/x-python", 1134 | "name": "python", 1135 | "nbconvert_exporter": "python", 1136 | "pygments_lexer": "ipython3", 1137 | "version": "3.12.6" 1138 | }, 1139 | "toc": { 1140 | "base_numbering": 1, 1141 | "nav_menu": {}, 1142 | "number_sections": true, 1143 | "sideBar": true, 1144 | "skip_h1_title": false, 1145 | "title_cell": "Table of Contents", 1146 | "title_sidebar": "Contents", 1147 | "toc_cell": false, 1148 | "toc_position": {}, 1149 | "toc_section_display": true, 1150 | "toc_window_display": true 1151 | } 1152 | }, 1153 | "nbformat": 4, 1154 | "nbformat_minor": 5 1155 | } 1156 | -------------------------------------------------------------------------------- /02-multistep-forecasting/assignment/02-assignment.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "a9b2affd", 6 | "metadata": {}, 7 | "source": [ 8 | "# Direct forecasting\n", 9 | "\n", 10 | "[Forecasting with Machine Learning - Course](https://www.trainindata.com/p/forecasting-with-machine-learning)\n", 11 | "\n", 12 | "Load the retail sales data set located in Facebook's Prophet Github repository and use **direct forecasting** to predict future sales. \n", 13 | "\n", 14 | "- We want to forecast sales over the next 3 months. \n", 15 | "- Sales are recorded monthly. \n", 16 | "- We assume that we have all data to the month before the first point in the forecasting horizon.\n", 17 | "\n", 18 | "We will forecast using Scikit-learn in this exercise.\n", 19 | "\n", 20 | "Follow the guidelines below to accomplish this assignment.\n", 21 | "\n", 22 | "## Import required classes and functions" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": 1, 28 | "id": "6fde7533", 29 | "metadata": {}, 30 | "outputs": [], 31 | "source": [ 32 | "import pandas as pd" 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "id": "b50b4411", 38 | "metadata": {}, 39 | "source": [ 40 | "## Load data" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": 2, 46 | "id": "cf8dd614", 47 | "metadata": {}, 48 | "outputs": [ 49 | { 50 | "data": { 51 | "text/html": [ 52 | "
\n", 53 | "\n", 66 | "\n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | "
y
ds
1992-01-01146376
1992-02-01147079
1992-03-01159336
1992-04-01163669
1992-05-01170068
\n", 100 | "
" 101 | ], 102 | "text/plain": [ 103 | " y\n", 104 | "ds \n", 105 | "1992-01-01 146376\n", 106 | "1992-02-01 147079\n", 107 | "1992-03-01 159336\n", 108 | "1992-04-01 163669\n", 109 | "1992-05-01 170068" 110 | ] 111 | }, 112 | "execution_count": 2, 113 | "metadata": {}, 114 | "output_type": "execute_result" 115 | } 116 | ], 117 | "source": [ 118 | "url = \"https://raw.githubusercontent.com/facebook/prophet/master/examples/example_retail_sales.csv\"\n", 119 | "df = pd.read_csv(url)\n", 120 | "df.to_csv(\"example_retail_sales.csv\", index=False)\n", 121 | "\n", 122 | "df = pd.read_csv(\n", 123 | " \"example_retail_sales.csv\",\n", 124 | " parse_dates=[\"ds\"],\n", 125 | " index_col=[\"ds\"],\n", 126 | " nrows=160,\n", 127 | ")\n", 128 | "\n", 129 | "df = df.asfreq(\"MS\")\n", 130 | "\n", 131 | "df.head()" 132 | ] 133 | }, 134 | { 135 | "cell_type": "markdown", 136 | "id": "d03172ab", 137 | "metadata": {}, 138 | "source": [ 139 | "## Create the target variable\n", 140 | "\n", 141 | "In direct forecasting, we train a model per step. Hence, we need to create 1 target per step." 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": null, 147 | "id": "d77bd8b7", 148 | "metadata": {}, 149 | "outputs": [], 150 | "source": [] 151 | }, 152 | { 153 | "cell_type": "markdown", 154 | "id": "1d003eaf", 155 | "metadata": {}, 156 | "source": [ 157 | "## Split data into train and test\n", 158 | "\n", 159 | "Leave data from 2004 onwards in the test set." 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": null, 165 | "id": "5f26c24e", 166 | "metadata": {}, 167 | "outputs": [], 168 | "source": [] 169 | }, 170 | { 171 | "cell_type": "markdown", 172 | "id": "caa19336", 173 | "metadata": {}, 174 | "source": [ 175 | "## Set up regression model\n", 176 | "\n", 177 | "We will use Lasso in this assignment." 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": null, 183 | "id": "c61e6218", 184 | "metadata": {}, 185 | "outputs": [], 186 | "source": [] 187 | }, 188 | { 189 | "cell_type": "markdown", 190 | "id": "9b69bbde", 191 | "metadata": {}, 192 | "source": [ 193 | "## Set up a feature engineering pipeline\n", 194 | "\n", 195 | "Set up transformers from feature-engine and / or scikit- learn in a pipeline and test it to make sure the input feature table is the one you need for the forecasts.\n", 196 | "\n", 197 | "We will use feature-engine because we are great fans of the library.\n", 198 | "\n", 199 | "If you prefer pandas, as long as the input feature table is the one you expect, that is also a suitable alternative." 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "execution_count": null, 205 | "id": "00908e10", 206 | "metadata": {}, 207 | "outputs": [], 208 | "source": [] 209 | }, 210 | { 211 | "cell_type": "markdown", 212 | "id": "721fb6aa", 213 | "metadata": {}, 214 | "source": [ 215 | "## Test pipeline over test set\n", 216 | "\n", 217 | "Ensure that the returned input feature table is suitable to forecast from `2004-01-01` onwards." 218 | ] 219 | }, 220 | { 221 | "cell_type": "code", 222 | "execution_count": null, 223 | "id": "84aee4f3", 224 | "metadata": {}, 225 | "outputs": [], 226 | "source": [] 227 | }, 228 | { 229 | "cell_type": "markdown", 230 | "id": "3132210a", 231 | "metadata": {}, 232 | "source": [ 233 | "## Train a recursive forecaster\n", 234 | "\n", 235 | "Now that we know that the pipeline works, we can train the forecaster.\n", 236 | "\n", 237 | "You can take the feature table and target returned up to here to train the Lasso. \n", 238 | "\n", 239 | "Or, as we will do, you can add the Lasso within the pipeline." 240 | ] 241 | }, 242 | { 243 | "cell_type": "code", 244 | "execution_count": null, 245 | "id": "8a120faf", 246 | "metadata": {}, 247 | "outputs": [], 248 | "source": [] 249 | }, 250 | { 251 | "cell_type": "markdown", 252 | "id": "886f78c8", 253 | "metadata": {}, 254 | "source": [ 255 | "# Forecast 3 months of sales\n", 256 | "\n", 257 | "We'll start by forecasting 3 months of sales, starting at every single point of the test set.\n", 258 | "\n", 259 | "This is the equivalent of backtesting without refit. More info in section 6!!" 260 | ] 261 | }, 262 | { 263 | "cell_type": "code", 264 | "execution_count": null, 265 | "id": "1cde1862", 266 | "metadata": {}, 267 | "outputs": [], 268 | "source": [] 269 | }, 270 | { 271 | "cell_type": "markdown", 272 | "id": "578c60c4", 273 | "metadata": {}, 274 | "source": [ 275 | "## Plot predictions vs actuals\n", 276 | "\n", 277 | "Pick the first row of predictions and plot them against the real sales." 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "execution_count": null, 283 | "id": "778c7bd5", 284 | "metadata": {}, 285 | "outputs": [], 286 | "source": [] 287 | }, 288 | { 289 | "cell_type": "markdown", 290 | "id": "528988f7", 291 | "metadata": {}, 292 | "source": [ 293 | "## Determine the RMSE\n", 294 | "\n", 295 | "Pick the first row of predictions and calculate the RMSE" 296 | ] 297 | }, 298 | { 299 | "cell_type": "code", 300 | "execution_count": null, 301 | "id": "67408586", 302 | "metadata": {}, 303 | "outputs": [], 304 | "source": [] 305 | }, 306 | { 307 | "cell_type": "markdown", 308 | "id": "914b7abb", 309 | "metadata": {}, 310 | "source": [ 311 | "## Forecast next 3 months of sales\n", 312 | "\n", 313 | "Predict the first 3 months of sales right after the end of the test set.\n", 314 | "\n", 315 | "That is, starting on `2005-02-02`." 316 | ] 317 | }, 318 | { 319 | "cell_type": "code", 320 | "execution_count": null, 321 | "id": "9c4f46ac", 322 | "metadata": {}, 323 | "outputs": [], 324 | "source": [] 325 | } 326 | ], 327 | "metadata": { 328 | "kernelspec": { 329 | "display_name": "fwml", 330 | "language": "python", 331 | "name": "fwml" 332 | }, 333 | "language_info": { 334 | "codemirror_mode": { 335 | "name": "ipython", 336 | "version": 3 337 | }, 338 | "file_extension": ".py", 339 | "mimetype": "text/x-python", 340 | "name": "python", 341 | "nbconvert_exporter": "python", 342 | "pygments_lexer": "ipython3", 343 | "version": "3.10.5" 344 | }, 345 | "toc": { 346 | "base_numbering": 1, 347 | "nav_menu": {}, 348 | "number_sections": true, 349 | "sideBar": true, 350 | "skip_h1_title": false, 351 | "title_cell": "Table of Contents", 352 | "title_sidebar": "Contents", 353 | "toc_cell": false, 354 | "toc_position": {}, 355 | "toc_section_display": true, 356 | "toc_window_display": true 357 | } 358 | }, 359 | "nbformat": 4, 360 | "nbformat_minor": 5 361 | } 362 | -------------------------------------------------------------------------------- /03-multiseries-forecasting/exercise-1.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "327f3f9b-3f15-4bec-bafe-b75e5cf91740", 6 | "metadata": {}, 7 | "source": [ 8 | "# Exercise 1: Multiple independent time series" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "id": "8f5b8ff1-e3f7-47af-b603-15d89d0c6fe5", 14 | "metadata": {}, 15 | "source": [ 16 | "[Forecasting for machine learning](https://www.trainindata.com/p/forecasting-with-machine-learning)\n", 17 | "\n", 18 | "In this notebook we have an exercise to do multiple independent time series forecasting. The solutions we show are only one way of answering these questions." 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 1, 24 | "id": "91a8c715-0369-4928-907d-9492db4f37a1", 25 | "metadata": {}, 26 | "outputs": [], 27 | "source": [ 28 | "import numpy as np\n", 29 | "import pandas as pd\n", 30 | "import matplotlib.pyplot as plt" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "id": "f6e9fcef-4dde-4107-bce6-216516626f86", 36 | "metadata": {}, 37 | "source": [ 38 | "# Data preparation" 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "id": "d91ac0bd-372a-4ef0-9c0e-a0e0cf12560d", 44 | "metadata": {}, 45 | "source": [ 46 | "The dataset we shall use is the Quarterly overnight trips (in thousands) from 1998 Q1 to 2016 Q4 across\n", 47 | "Australia. The number of trips is split by `State`, `Region`, and `Purpose`. \n", 48 | "\n", 49 | "**In this exercise we are going to forecast the total number of trips for each Region (there are 76 regions therefore we will have 76 time series). We shall treat this as a multiple independent time series forecasting problem.**\n", 50 | "\n", 51 | "Source: A new tidy data structure to support\n", 52 | "exploration and modeling of temporal data, Journal of Computational and\n", 53 | "Graphical Statistics, 29:3, 466-478, doi:10.1080/10618600.2019.1695624.\n", 54 | "Shape of the dataset: (24320, 5)" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": 2, 60 | "id": "76ef5949-cbf1-4109-887d-104f71574c2a", 61 | "metadata": {}, 62 | "outputs": [ 63 | { 64 | "name": "stdout", 65 | "output_type": "stream", 66 | "text": [ 67 | "australia_tourism\n", 68 | "-----------------\n", 69 | "Quarterly overnight trips (in thousands) from 1998 Q1 to 2016 Q4 across\n", 70 | "Australia. The tourism regions are formed through the aggregation of Statistical\n", 71 | "Local Areas (SLAs) which are defined by the various State and Territory tourism\n", 72 | "authorities according to their research and marketing needs.\n", 73 | "Wang, E, D Cook, and RJ Hyndman (2020). A new tidy data structure to support\n", 74 | "exploration and modeling of temporal data, Journal of Computational and\n", 75 | "Graphical Statistics, 29:3, 466-478, doi:10.1080/10618600.2019.1695624.\n", 76 | "Shape of the dataset: (24320, 5)\n" 77 | ] 78 | }, 79 | { 80 | "data": { 81 | "text/html": [ 82 | "
\n", 83 | "\n", 96 | "\n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | "
date_timeRegionStatePurposeTrips
01998-01-01AdelaideSouth AustraliaBusiness135.077690
11998-04-01AdelaideSouth AustraliaBusiness109.987316
21998-07-01AdelaideSouth AustraliaBusiness166.034687
31998-10-01AdelaideSouth AustraliaBusiness127.160464
41999-01-01AdelaideSouth AustraliaBusiness137.448533
\n", 150 | "
" 151 | ], 152 | "text/plain": [ 153 | " date_time Region State Purpose Trips\n", 154 | "0 1998-01-01 Adelaide South Australia Business 135.077690\n", 155 | "1 1998-04-01 Adelaide South Australia Business 109.987316\n", 156 | "2 1998-07-01 Adelaide South Australia Business 166.034687\n", 157 | "3 1998-10-01 Adelaide South Australia Business 127.160464\n", 158 | "4 1999-01-01 Adelaide South Australia Business 137.448533" 159 | ] 160 | }, 161 | "execution_count": 2, 162 | "metadata": {}, 163 | "output_type": "execute_result" 164 | } 165 | ], 166 | "source": [ 167 | "from skforecast.datasets import fetch_dataset\n", 168 | "\n", 169 | "# Load the data\n", 170 | "data = fetch_dataset(name=\"australia_tourism\", raw=True)\n", 171 | "data.head()" 172 | ] 173 | }, 174 | { 175 | "cell_type": "markdown", 176 | "id": "48a62223-83ab-4a7e-af1b-4fe91e1b31c7", 177 | "metadata": {}, 178 | "source": [ 179 | "Pre-process the data by performing the following:\n", 180 | "1) Convert the `date_time` column to datetime type\n", 181 | "2) Create a dataframe with one column per `Region` which gives the total number of Trips for each date.\n", 182 | "3) Ensure the index is `date_time` and resampled to quarterly start `QS`\n" 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": 3, 188 | "id": "6506a6d4-e680-4c12-9dd1-7e037c91a2ef", 189 | "metadata": {}, 190 | "outputs": [ 191 | { 192 | "data": { 193 | "text/html": [ 194 | "
\n", 195 | "\n", 208 | "\n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | "
RegionAdelaideAdelaide HillsAlice SpringsAustralia's Coral CoastAustralia's Golden OutbackAustralia's North WestAustralia's South WestBallaratBarklyBarossa...Sunshine CoastSydneyThe MurrayTropical North QueenslandUpper YarraWestern GrampiansWhitsundaysWilderness WestWimmeraYorke Peninsula
date_time
1998-01-01658.5538959.79863020.207638132.516409161.726948120.775450474.858729182.23934118.46520646.796083...742.6022992288.955629356.500087220.915346102.79102286.99659160.22664963.33509718.804743160.681637
1998-04-01449.85393526.06695256.356223172.615378164.973780158.404387411.622281137.5665397.51096949.428717...609.8833331814.459480312.291189253.09761674.85513684.939977106.19084842.60707652.482311104.324252
1998-07-01592.90459726.491072110.918441173.904335206.879934184.619035360.039657117.64276143.56562529.743302...615.3063311989.731939376.718698423.50673559.46540579.97488481.77100518.85121435.65755168.996468
1998-10-01524.24276027.25685940.868270207.002571198.509591138.878263462.620050136.07272429.35923978.193066...684.4302392150.913627336.367694283.69445135.238855116.235617105.60014350.45096527.204455103.340264
1999-01-01548.39410513.77297548.368038198.856638140.213443103.337122562.974629156.4562426.34199735.277910...842.1674181779.286905323.418472194.50990467.823457101.765635111.50497259.88800350.219851146.658290
\n", 382 | "

5 rows × 76 columns

\n", 383 | "
" 384 | ], 385 | "text/plain": [ 386 | "Region Adelaide Adelaide Hills Alice Springs \\\n", 387 | "date_time \n", 388 | "1998-01-01 658.553895 9.798630 20.207638 \n", 389 | "1998-04-01 449.853935 26.066952 56.356223 \n", 390 | "1998-07-01 592.904597 26.491072 110.918441 \n", 391 | "1998-10-01 524.242760 27.256859 40.868270 \n", 392 | "1999-01-01 548.394105 13.772975 48.368038 \n", 393 | "\n", 394 | "Region Australia's Coral Coast Australia's Golden Outback \\\n", 395 | "date_time \n", 396 | "1998-01-01 132.516409 161.726948 \n", 397 | "1998-04-01 172.615378 164.973780 \n", 398 | "1998-07-01 173.904335 206.879934 \n", 399 | "1998-10-01 207.002571 198.509591 \n", 400 | "1999-01-01 198.856638 140.213443 \n", 401 | "\n", 402 | "Region Australia's North West Australia's South West Ballarat \\\n", 403 | "date_time \n", 404 | "1998-01-01 120.775450 474.858729 182.239341 \n", 405 | "1998-04-01 158.404387 411.622281 137.566539 \n", 406 | "1998-07-01 184.619035 360.039657 117.642761 \n", 407 | "1998-10-01 138.878263 462.620050 136.072724 \n", 408 | "1999-01-01 103.337122 562.974629 156.456242 \n", 409 | "\n", 410 | "Region Barkly Barossa ... Sunshine Coast Sydney \\\n", 411 | "date_time ... \n", 412 | "1998-01-01 18.465206 46.796083 ... 742.602299 2288.955629 \n", 413 | "1998-04-01 7.510969 49.428717 ... 609.883333 1814.459480 \n", 414 | "1998-07-01 43.565625 29.743302 ... 615.306331 1989.731939 \n", 415 | "1998-10-01 29.359239 78.193066 ... 684.430239 2150.913627 \n", 416 | "1999-01-01 6.341997 35.277910 ... 842.167418 1779.286905 \n", 417 | "\n", 418 | "Region The Murray Tropical North Queensland Upper Yarra \\\n", 419 | "date_time \n", 420 | "1998-01-01 356.500087 220.915346 102.791022 \n", 421 | "1998-04-01 312.291189 253.097616 74.855136 \n", 422 | "1998-07-01 376.718698 423.506735 59.465405 \n", 423 | "1998-10-01 336.367694 283.694451 35.238855 \n", 424 | "1999-01-01 323.418472 194.509904 67.823457 \n", 425 | "\n", 426 | "Region Western Grampians Whitsundays Wilderness West Wimmera \\\n", 427 | "date_time \n", 428 | "1998-01-01 86.996591 60.226649 63.335097 18.804743 \n", 429 | "1998-04-01 84.939977 106.190848 42.607076 52.482311 \n", 430 | "1998-07-01 79.974884 81.771005 18.851214 35.657551 \n", 431 | "1998-10-01 116.235617 105.600143 50.450965 27.204455 \n", 432 | "1999-01-01 101.765635 111.504972 59.888003 50.219851 \n", 433 | "\n", 434 | "Region Yorke Peninsula \n", 435 | "date_time \n", 436 | "1998-01-01 160.681637 \n", 437 | "1998-04-01 104.324252 \n", 438 | "1998-07-01 68.996468 \n", 439 | "1998-10-01 103.340264 \n", 440 | "1999-01-01 146.658290 \n", 441 | "\n", 442 | "[5 rows x 76 columns]" 443 | ] 444 | }, 445 | "execution_count": 3, 446 | "metadata": {}, 447 | "output_type": "execute_result" 448 | } 449 | ], 450 | "source": [] 451 | }, 452 | { 453 | "cell_type": "markdown", 454 | "id": "589f58e0-6106-430c-ab2b-557ea84bb2a5", 455 | "metadata": {}, 456 | "source": [ 457 | "Check for missing values." 458 | ] 459 | }, 460 | { 461 | "cell_type": "code", 462 | "execution_count": null, 463 | "id": "92bd6aaf-caac-4bb8-96a2-c0d59220f6e8", 464 | "metadata": {}, 465 | "outputs": [], 466 | "source": [] 467 | }, 468 | { 469 | "cell_type": "markdown", 470 | "id": "453278bb-fa05-42d1-bc66-a17b05d501c9", 471 | "metadata": {}, 472 | "source": [ 473 | "Later we may want to use LightGBM, it does not support special JSON characters (e.g., `'`) in the column name. Let's remove these characters from the column names." 474 | ] 475 | }, 476 | { 477 | "cell_type": "code", 478 | "execution_count": 5, 479 | "id": "872210ab-580d-4420-9fbc-2e3014d5883b", 480 | "metadata": {}, 481 | "outputs": [], 482 | "source": [ 483 | "import re\n", 484 | "\n", 485 | "data = data.rename(columns=lambda x: re.sub(\"[^A-Za-z0-9_]+\", \"\", x))" 486 | ] 487 | }, 488 | { 489 | "cell_type": "markdown", 490 | "id": "44fb99c2-65b9-4665-9c55-528af407baee", 491 | "metadata": {}, 492 | "source": [ 493 | "Assign the name of each state to a variable `Region`. We will use this later." 494 | ] 495 | }, 496 | { 497 | "cell_type": "code", 498 | "execution_count": null, 499 | "id": "57f6888c-a06c-4cd8-a358-039860bc5efd", 500 | "metadata": {}, 501 | "outputs": [], 502 | "source": [] 503 | }, 504 | { 505 | "cell_type": "markdown", 506 | "id": "eb03773c-a9bf-4fa8-84a4-099af6b87827", 507 | "metadata": {}, 508 | "source": [ 509 | "# Exploratory data analysis" 510 | ] 511 | }, 512 | { 513 | "cell_type": "markdown", 514 | "id": "7d1ba5b6-b49d-4bab-882e-915f075b9493", 515 | "metadata": {}, 516 | "source": [ 517 | "Print the number of data points in the time series, the start time, and the end time of the time series." 518 | ] 519 | }, 520 | { 521 | "cell_type": "code", 522 | "execution_count": null, 523 | "id": "36af3ab9-4267-471d-800b-dc9a3dcdb43e", 524 | "metadata": {}, 525 | "outputs": [], 526 | "source": [] 527 | }, 528 | { 529 | "cell_type": "markdown", 530 | "id": "3ffa8a95-d0d5-48e4-ba5f-b92a4f54c0ef", 531 | "metadata": {}, 532 | "source": [ 533 | "Plot the time series summed over all states." 534 | ] 535 | }, 536 | { 537 | "cell_type": "code", 538 | "execution_count": null, 539 | "id": "f1629484-508f-419f-8647-1ef6626eab26", 540 | "metadata": {}, 541 | "outputs": [], 542 | "source": [] 543 | }, 544 | { 545 | "cell_type": "markdown", 546 | "id": "16971015-96a9-480e-9f9e-f21637d68e87", 547 | "metadata": {}, 548 | "source": [ 549 | "Plot a subsample of the time series from different regions." 550 | ] 551 | }, 552 | { 553 | "cell_type": "code", 554 | "execution_count": null, 555 | "id": "d7193016-417f-4bec-87e0-dcce787ee7fe", 556 | "metadata": {}, 557 | "outputs": [], 558 | "source": [] 559 | }, 560 | { 561 | "cell_type": "markdown", 562 | "id": "143c6eab-8a90-4b45-9f63-db450d91c296", 563 | "metadata": {}, 564 | "source": [ 565 | "It appears that there is yearly seasonality for these series and they appear to be anti-correlated (i.e., some areas experience peaks whilst others experience troughs)." 566 | ] 567 | }, 568 | { 569 | "cell_type": "markdown", 570 | "id": "af0befd7-d612-4bfe-8cb6-fc2bfd2bc9f1", 571 | "metadata": {}, 572 | "source": [ 573 | "Create a quarter of the year feature which could help with the yearly seasonality." 574 | ] 575 | }, 576 | { 577 | "cell_type": "code", 578 | "execution_count": null, 579 | "id": "6c2b84e0-07a4-46ef-ae53-6327625a7284", 580 | "metadata": {}, 581 | "outputs": [], 582 | "source": [] 583 | }, 584 | { 585 | "cell_type": "markdown", 586 | "id": "bed95ade-4109-4b31-847b-28f5c9807692", 587 | "metadata": {}, 588 | "source": [ 589 | "# Forecasting" 590 | ] 591 | }, 592 | { 593 | "cell_type": "markdown", 594 | "id": "a40647fe-99b4-474e-8a03-b7bf4f9eed29", 595 | "metadata": {}, 596 | "source": [ 597 | "Import the class needed for recursive forecasting for multiple independent time series from `skforecast`" 598 | ] 599 | }, 600 | { 601 | "cell_type": "code", 602 | "execution_count": null, 603 | "id": "e00293e1", 604 | "metadata": {}, 605 | "outputs": [], 606 | "source": [] 607 | }, 608 | { 609 | "cell_type": "code", 610 | "execution_count": null, 611 | "id": "7b2e3c25", 612 | "metadata": {}, 613 | "outputs": [], 614 | "source": [] 615 | }, 616 | { 617 | "cell_type": "code", 618 | "execution_count": null, 619 | "id": "613011ce-c45c-40b5-b46a-e6a8037d4f08", 620 | "metadata": {}, 621 | "outputs": [], 622 | "source": [] 623 | }, 624 | { 625 | "cell_type": "markdown", 626 | "id": "d692c732-e699-46cb-bab6-cd148e9af1f1", 627 | "metadata": {}, 628 | "source": [ 629 | "Import a transformer from `sklearn` to scale the data." 630 | ] 631 | }, 632 | { 633 | "cell_type": "code", 634 | "execution_count": null, 635 | "id": "cadffd44-e241-4677-b640-9c8b9db2784f", 636 | "metadata": {}, 637 | "outputs": [], 638 | "source": [] 639 | }, 640 | { 641 | "cell_type": "markdown", 642 | "id": "eaf0b4fb-9548-4da1-b356-e9ec377d68f9", 643 | "metadata": {}, 644 | "source": [ 645 | "Import a model of your choice." 646 | ] 647 | }, 648 | { 649 | "cell_type": "code", 650 | "execution_count": null, 651 | "id": "0495bdc3-5e22-483a-b0ab-2b3e0402fbee", 652 | "metadata": {}, 653 | "outputs": [], 654 | "source": [] 655 | }, 656 | { 657 | "cell_type": "markdown", 658 | "id": "66a304ca-77a4-48f6-99ef-f24d30417daa", 659 | "metadata": {}, 660 | "source": [ 661 | "Assign the names of the states to a `target_cols` variable and any exogenous features to an `exog_cols` variable." 662 | ] 663 | }, 664 | { 665 | "cell_type": "code", 666 | "execution_count": null, 667 | "id": "be5cfa65-d69b-427c-bccc-0021fa04ef02", 668 | "metadata": {}, 669 | "outputs": [], 670 | "source": [] 671 | }, 672 | { 673 | "cell_type": "markdown", 674 | "id": "9a24e307-22ef-44f1-8f11-b7746a70d3bb", 675 | "metadata": {}, 676 | "source": [ 677 | "Specify a forecast horizon and assign it to a variable `steps`. Try forecasting 8 quarters into the future." 678 | ] 679 | }, 680 | { 681 | "cell_type": "code", 682 | "execution_count": null, 683 | "id": "2b37700d-9bfa-4c60-926a-b48a9e32c7ce", 684 | "metadata": {}, 685 | "outputs": [], 686 | "source": [] 687 | }, 688 | { 689 | "cell_type": "markdown", 690 | "id": "4109bb1b-b6a1-48f1-99ca-caca04c01a63", 691 | "metadata": {}, 692 | "source": [ 693 | "Create a dataframe for the future values of any exogenous features.\n", 694 | "\n", 695 | "Hint: `pd.DateOffset` and using `freq=QS` in `pd.date_range` might be helpful " 696 | ] 697 | }, 698 | { 699 | "cell_type": "code", 700 | "execution_count": null, 701 | "id": "c96dd685-09d8-42d0-949b-652acf45e56a", 702 | "metadata": {}, 703 | "outputs": [], 704 | "source": [] 705 | }, 706 | { 707 | "cell_type": "markdown", 708 | "id": "dc854213-25d3-47f9-877b-af4a139ee94f", 709 | "metadata": {}, 710 | "source": [ 711 | "Define window features using the `RollingFeatures` class from skforecast. Try a window of 4 and 8 (1 and 2 years)." 712 | ] 713 | }, 714 | { 715 | "cell_type": "code", 716 | "execution_count": null, 717 | "id": "b2d65f26-66b6-4bf0-9275-9d51f7561589", 718 | "metadata": {}, 719 | "outputs": [], 720 | "source": [] 721 | }, 722 | { 723 | "cell_type": "markdown", 724 | "id": "7023fc71-a772-462b-8456-f34aa91c71bf", 725 | "metadata": {}, 726 | "source": [ 727 | "Define a weight function (a function of the time axis) that linearly decreases the weight from 1 to 0 as we go back in time. This will give more weight to recent dates. Define it so there are no harded coded dates in the function.\n", 728 | "\n", 729 | "Hint: Consider using `np.linspace`" 730 | ] 731 | }, 732 | { 733 | "cell_type": "code", 734 | "execution_count": null, 735 | "id": "2fceaa6a-3795-4ebc-89a4-cdd0933afe65", 736 | "metadata": {}, 737 | "outputs": [], 738 | "source": [] 739 | }, 740 | { 741 | "cell_type": "markdown", 742 | "id": "3ccac13b-ba92-4403-840b-0107271cb8df", 743 | "metadata": {}, 744 | "source": [ 745 | "Define a forecaster to predict all the time series. Pass your weight function and custom predictors function to the forecaster." 746 | ] 747 | }, 748 | { 749 | "cell_type": "code", 750 | "execution_count": null, 751 | "id": "3a6dca60-9b60-4504-a7be-763172567716", 752 | "metadata": {}, 753 | "outputs": [], 754 | "source": [] 755 | }, 756 | { 757 | "cell_type": "markdown", 758 | "id": "a511fcdc-dea7-48a8-ac05-8089bb92cad2", 759 | "metadata": {}, 760 | "source": [ 761 | "Fit the forecaster." 762 | ] 763 | }, 764 | { 765 | "cell_type": "code", 766 | "execution_count": null, 767 | "id": "3d0ebf20-d667-4298-bb05-9c84770b3169", 768 | "metadata": {}, 769 | "outputs": [], 770 | "source": [] 771 | }, 772 | { 773 | "cell_type": "markdown", 774 | "id": "f03fa990-4113-4a4e-be3b-28f6695ec179", 775 | "metadata": {}, 776 | "source": [ 777 | "Make a forecast." 778 | ] 779 | }, 780 | { 781 | "cell_type": "code", 782 | "execution_count": null, 783 | "id": "450a6090-e4ae-424f-91d9-840044ca4c16", 784 | "metadata": {}, 785 | "outputs": [], 786 | "source": [] 787 | }, 788 | { 789 | "cell_type": "markdown", 790 | "id": "dd7553db-6b01-48ad-aa43-58506fd4babb", 791 | "metadata": {}, 792 | "source": [ 793 | "Plot the a random subset of the time series and the forecast." 794 | ] 795 | }, 796 | { 797 | "cell_type": "code", 798 | "execution_count": null, 799 | "id": "067f8c6e-b570-4cb7-9be2-0fd35e8c2937", 800 | "metadata": {}, 801 | "outputs": [], 802 | "source": [] 803 | }, 804 | { 805 | "cell_type": "markdown", 806 | "id": "9861f6b8-a2ec-4f20-8197-ac7ae47973d5", 807 | "metadata": {}, 808 | "source": [ 809 | "# Feature importance" 810 | ] 811 | }, 812 | { 813 | "cell_type": "markdown", 814 | "id": "52146b7c-52a9-4f0f-a353-0c5965302fd7", 815 | "metadata": {}, 816 | "source": [ 817 | "Plot the 10 most important features." 818 | ] 819 | }, 820 | { 821 | "cell_type": "code", 822 | "execution_count": null, 823 | "id": "4bac5303-30aa-4263-bda2-7eab421c3412", 824 | "metadata": {}, 825 | "outputs": [], 826 | "source": [] 827 | } 828 | ], 829 | "metadata": { 830 | "kernelspec": { 831 | "display_name": "Python 3 (ipykernel)", 832 | "language": "python", 833 | "name": "python3" 834 | }, 835 | "language_info": { 836 | "codemirror_mode": { 837 | "name": "ipython", 838 | "version": 3 839 | }, 840 | "file_extension": ".py", 841 | "mimetype": "text/x-python", 842 | "name": "python", 843 | "nbconvert_exporter": "python", 844 | "pygments_lexer": "ipython3", 845 | "version": "3.10.2" 846 | } 847 | }, 848 | "nbformat": 4, 849 | "nbformat_minor": 5 850 | } 851 | -------------------------------------------------------------------------------- /03-multiseries-forecasting/exercise-2.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "6b76acb4-90e2-4b95-b211-65607b67dc78", 6 | "metadata": {}, 7 | "source": [ 8 | "# Exercise 2: Multiple dependent time series" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "id": "8f5b8ff1-e3f7-47af-b603-15d89d0c6fe5", 14 | "metadata": {}, 15 | "source": [ 16 | "[Forecasting for machine learning](https://www.trainindata.com/p/forecasting-with-machine-learning)\n", 17 | "\n", 18 | "In this notebook we have an exercise to do multiple dependent time series forecasting. The solutions we show are only one way of answering these questions." 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": null, 24 | "id": "91a8c715-0369-4928-907d-9492db4f37a1", 25 | "metadata": {}, 26 | "outputs": [], 27 | "source": [ 28 | "import numpy as np\n", 29 | "import pandas as pd\n", 30 | "import matplotlib.pyplot as plt" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "id": "f6e9fcef-4dde-4107-bce6-216516626f86", 36 | "metadata": {}, 37 | "source": [ 38 | "# Data preparation" 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "id": "d91ac0bd-372a-4ef0-9c0e-a0e0cf12560d", 44 | "metadata": {}, 45 | "source": [ 46 | "The dataset we shall use is the Quarterly overnight trips (in thousands) from 1998 Q1 to 2016 Q4 across\n", 47 | "Australia. The number of trips is split by `State`, `Region`, and `Purpose`. \n", 48 | "\n", 49 | "**In this exercise we are going to forecast the total number of trips for each State (there are 8 states therefore we will have 8 time series). We shall treat this as a multivariate forecasting problem.**\n", 50 | "\n", 51 | "Source: A new tidy data structure to support\n", 52 | "exploration and modeling of temporal data, Journal of Computational and\n", 53 | "Graphical Statistics, 29:3, 466-478, doi:10.1080/10618600.2019.1695624.\n", 54 | "Shape of the dataset: (24320, 5)" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": null, 60 | "id": "76ef5949-cbf1-4109-887d-104f71574c2a", 61 | "metadata": {}, 62 | "outputs": [], 63 | "source": [ 64 | "from skforecast.datasets import fetch_dataset\n", 65 | "\n", 66 | "# Load the data\n", 67 | "data = fetch_dataset(name=\"australia_tourism\", raw=True)\n", 68 | "data.head()" 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "id": "48a62223-83ab-4a7e-af1b-4fe91e1b31c7", 74 | "metadata": {}, 75 | "source": [ 76 | "Pre-process the data by performing the following:\n", 77 | "1) Convert the `date_time` column to datetime type\n", 78 | "2) Create a dataframe with one column per `Region` which gives the total number of Trips for each date.\n", 79 | "3) Ensure the index is `date_time` and resampled to quarterly start `QS`\n" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": null, 85 | "id": "6506a6d4-e680-4c12-9dd1-7e037c91a2ef", 86 | "metadata": {}, 87 | "outputs": [], 88 | "source": [] 89 | }, 90 | { 91 | "cell_type": "markdown", 92 | "id": "589f58e0-6106-430c-ab2b-557ea84bb2a5", 93 | "metadata": {}, 94 | "source": [ 95 | "Check for missing values." 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": null, 101 | "id": "92bd6aaf-caac-4bb8-96a2-c0d59220f6e8", 102 | "metadata": {}, 103 | "outputs": [], 104 | "source": [] 105 | }, 106 | { 107 | "cell_type": "markdown", 108 | "id": "44fb99c2-65b9-4665-9c55-528af407baee", 109 | "metadata": {}, 110 | "source": [ 111 | "Assign the name of each state to a variable `states`. We will use this later." 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": null, 117 | "id": "57f6888c-a06c-4cd8-a358-039860bc5efd", 118 | "metadata": {}, 119 | "outputs": [], 120 | "source": [] 121 | }, 122 | { 123 | "cell_type": "markdown", 124 | "id": "eb03773c-a9bf-4fa8-84a4-099af6b87827", 125 | "metadata": {}, 126 | "source": [ 127 | "# Exploratory data analysis" 128 | ] 129 | }, 130 | { 131 | "cell_type": "markdown", 132 | "id": "7d1ba5b6-b49d-4bab-882e-915f075b9493", 133 | "metadata": {}, 134 | "source": [ 135 | "Print the number of data points in the time series, the start time, and the end time of the time series." 136 | ] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "execution_count": null, 141 | "id": "36af3ab9-4267-471d-800b-dc9a3dcdb43e", 142 | "metadata": {}, 143 | "outputs": [], 144 | "source": [] 145 | }, 146 | { 147 | "cell_type": "markdown", 148 | "id": "3ffa8a95-d0d5-48e4-ba5f-b92a4f54c0ef", 149 | "metadata": {}, 150 | "source": [ 151 | "Plot the time series summed over all states." 152 | ] 153 | }, 154 | { 155 | "cell_type": "code", 156 | "execution_count": null, 157 | "id": "f1629484-508f-419f-8647-1ef6626eab26", 158 | "metadata": {}, 159 | "outputs": [], 160 | "source": [] 161 | }, 162 | { 163 | "cell_type": "markdown", 164 | "id": "16971015-96a9-480e-9f9e-f21637d68e87", 165 | "metadata": {}, 166 | "source": [ 167 | "Plot all of the time series." 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": null, 173 | "id": "d7193016-417f-4bec-87e0-dcce787ee7fe", 174 | "metadata": {}, 175 | "outputs": [], 176 | "source": [] 177 | }, 178 | { 179 | "cell_type": "markdown", 180 | "id": "143c6eab-8a90-4b45-9f63-db450d91c296", 181 | "metadata": {}, 182 | "source": [ 183 | "It appears that there is yearly seasonality for these series and they appear to be anti-correlated (i.e., some areas experience peaks whilst others experience troughs)." 184 | ] 185 | }, 186 | { 187 | "cell_type": "markdown", 188 | "id": "af0befd7-d612-4bfe-8cb6-fc2bfd2bc9f1", 189 | "metadata": {}, 190 | "source": [ 191 | "Create a quarter of the year feature which could help with the yearly seasonality." 192 | ] 193 | }, 194 | { 195 | "cell_type": "code", 196 | "execution_count": null, 197 | "id": "6c2b84e0-07a4-46ef-ae53-6327625a7284", 198 | "metadata": {}, 199 | "outputs": [], 200 | "source": [] 201 | }, 202 | { 203 | "cell_type": "markdown", 204 | "id": "bed95ade-4109-4b31-847b-28f5c9807692", 205 | "metadata": {}, 206 | "source": [ 207 | "# Forecasting" 208 | ] 209 | }, 210 | { 211 | "cell_type": "markdown", 212 | "id": "a40647fe-99b4-474e-8a03-b7bf4f9eed29", 213 | "metadata": {}, 214 | "source": [ 215 | "Import the class needed for recursive forecasting for multiple dependent time series from `skforecast`." 216 | ] 217 | }, 218 | { 219 | "cell_type": "code", 220 | "execution_count": null, 221 | "id": "613011ce-c45c-40b5-b46a-e6a8037d4f08", 222 | "metadata": {}, 223 | "outputs": [], 224 | "source": [] 225 | }, 226 | { 227 | "cell_type": "markdown", 228 | "id": "d692c732-e699-46cb-bab6-cd148e9af1f1", 229 | "metadata": {}, 230 | "source": [ 231 | "Import a transformer from sklearn to scale the data." 232 | ] 233 | }, 234 | { 235 | "cell_type": "code", 236 | "execution_count": null, 237 | "id": "cadffd44-e241-4677-b640-9c8b9db2784f", 238 | "metadata": {}, 239 | "outputs": [], 240 | "source": [] 241 | }, 242 | { 243 | "cell_type": "markdown", 244 | "id": "eaf0b4fb-9548-4da1-b356-e9ec377d68f9", 245 | "metadata": {}, 246 | "source": [ 247 | "Import a model of your choice." 248 | ] 249 | }, 250 | { 251 | "cell_type": "code", 252 | "execution_count": null, 253 | "id": "0495bdc3-5e22-483a-b0ab-2b3e0402fbee", 254 | "metadata": {}, 255 | "outputs": [], 256 | "source": [] 257 | }, 258 | { 259 | "cell_type": "markdown", 260 | "id": "66a304ca-77a4-48f6-99ef-f24d30417daa", 261 | "metadata": {}, 262 | "source": [ 263 | "Assign the names of the states to a `target_cols` variable and any exogenous features to an `exog_cols` variable." 264 | ] 265 | }, 266 | { 267 | "cell_type": "code", 268 | "execution_count": null, 269 | "id": "be5cfa65-d69b-427c-bccc-0021fa04ef02", 270 | "metadata": {}, 271 | "outputs": [], 272 | "source": [] 273 | }, 274 | { 275 | "cell_type": "markdown", 276 | "id": "9a24e307-22ef-44f1-8f11-b7746a70d3bb", 277 | "metadata": {}, 278 | "source": [ 279 | "Specify a forecast horizon and assign it to a variable `steps`. Try forecasting 8 quarters into the future." 280 | ] 281 | }, 282 | { 283 | "cell_type": "code", 284 | "execution_count": null, 285 | "id": "2b37700d-9bfa-4c60-926a-b48a9e32c7ce", 286 | "metadata": {}, 287 | "outputs": [], 288 | "source": [] 289 | }, 290 | { 291 | "cell_type": "markdown", 292 | "id": "4109bb1b-b6a1-48f1-99ca-caca04c01a63", 293 | "metadata": {}, 294 | "source": [ 295 | "Create a dataframe for the future values of any exogenous features.\n", 296 | "\n", 297 | "Hint: `pd.DateOffset` and using `freq=QS` in `pd.date_range` might be helpful " 298 | ] 299 | }, 300 | { 301 | "cell_type": "code", 302 | "execution_count": null, 303 | "id": "c96dd685-09d8-42d0-949b-652acf45e56a", 304 | "metadata": {}, 305 | "outputs": [], 306 | "source": [] 307 | }, 308 | { 309 | "cell_type": "markdown", 310 | "id": "52146b7c-52a9-4f0f-a353-0c5965302fd7", 311 | "metadata": {}, 312 | "source": [ 313 | "Forecast over each state using a for loop. Define a `ForecasterDirectMultiVariate` forecaster and experiment with the number of lags to use as a feature." 314 | ] 315 | }, 316 | { 317 | "cell_type": "code", 318 | "execution_count": null, 319 | "id": "59447171-bbdc-4a0a-82d1-fa809a9db068", 320 | "metadata": {}, 321 | "outputs": [], 322 | "source": [] 323 | }, 324 | { 325 | "cell_type": "markdown", 326 | "id": "95652d3b-4073-4080-b2ac-567739b08d3e", 327 | "metadata": {}, 328 | "source": [ 329 | "Plot the forecasts." 330 | ] 331 | }, 332 | { 333 | "cell_type": "code", 334 | "execution_count": null, 335 | "id": "9a448c27-2516-47c5-8598-f4430d9540ce", 336 | "metadata": {}, 337 | "outputs": [], 338 | "source": [] 339 | } 340 | ], 341 | "metadata": { 342 | "kernelspec": { 343 | "display_name": "Python 3 (ipykernel)", 344 | "language": "python", 345 | "name": "python3" 346 | }, 347 | "language_info": { 348 | "codemirror_mode": { 349 | "name": "ipython", 350 | "version": 3 351 | }, 352 | "file_extension": ".py", 353 | "mimetype": "text/x-python", 354 | "name": "python", 355 | "nbconvert_exporter": "python", 356 | "pygments_lexer": "ipython3", 357 | "version": "3.10.2" 358 | } 359 | }, 360 | "nbformat": 4, 361 | "nbformat_minor": 5 362 | } 363 | -------------------------------------------------------------------------------- /03-multiseries-forecasting/images/forecaster_multi_series_sample_weight.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/trainindata/forecasting-with-machine-learning/2e45285cb67a1d1afed4eebd80150493bdd7c0fd/03-multiseries-forecasting/images/forecaster_multi_series_sample_weight.png -------------------------------------------------------------------------------- /03-multiseries-forecasting/images/forecaster_multivariate_train_matrix_diagram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/trainindata/forecasting-with-machine-learning/2e45285cb67a1d1afed4eebd80150493bdd7c0fd/03-multiseries-forecasting/images/forecaster_multivariate_train_matrix_diagram.png -------------------------------------------------------------------------------- /04-backtesting/exercise-1.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "8f5b8ff1-e3f7-47af-b603-15d89d0c6fe5", 6 | "metadata": {}, 7 | "source": [ 8 | "In this notebook we set out an exercise to do backtesting to compare different models for a single time series. The solutions we show are only one way of answering these questions." 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "id": "f6e9fcef-4dde-4107-bce6-216516626f86", 14 | "metadata": {}, 15 | "source": [ 16 | "# Data preparation" 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "id": "d91ac0bd-372a-4ef0-9c0e-a0e0cf12560d", 22 | "metadata": {}, 23 | "source": [ 24 | "The dataset we shall use is the daily web visits to cienciadedatos.net." 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": null, 30 | "id": "76ef5949-cbf1-4109-887d-104f71574c2a", 31 | "metadata": {}, 32 | "outputs": [], 33 | "source": [ 34 | "from skforecast.datasets import fetch_dataset\n", 35 | "\n", 36 | "# Load the data\n", 37 | "data = fetch_dataset(name=\"website_visits\", raw=True)\n", 38 | "data.head()" 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "id": "48a62223-83ab-4a7e-af1b-4fe91e1b31c7", 44 | "metadata": {}, 45 | "source": [ 46 | "Pre-process the data by performing the following:\n", 47 | "1) Convert the `date` column to datetime type\n", 48 | "2) Set the date as an index\n", 49 | "3) Set the frequency to daily\n", 50 | "4) Sort the time series by date\n", 51 | "\n", 52 | "Hint: Try the format `%d/%m/%y`" 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": null, 58 | "id": "47d63a91-4966-4f0c-a081-2c5821298a28", 59 | "metadata": {}, 60 | "outputs": [], 61 | "source": [] 62 | }, 63 | { 64 | "cell_type": "markdown", 65 | "id": "589f58e0-6106-430c-ab2b-557ea84bb2a5", 66 | "metadata": {}, 67 | "source": [ 68 | "Check for missing values." 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": null, 74 | "id": "535fd22d-3e40-4796-aca8-05a029d36129", 75 | "metadata": {}, 76 | "outputs": [], 77 | "source": [] 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "id": "eb03773c-a9bf-4fa8-84a4-099af6b87827", 82 | "metadata": {}, 83 | "source": [ 84 | "# Exploratory data analysis" 85 | ] 86 | }, 87 | { 88 | "cell_type": "markdown", 89 | "id": "7d1ba5b6-b49d-4bab-882e-915f075b9493", 90 | "metadata": {}, 91 | "source": [ 92 | "Print the number of data points in the time series, the start time, and the end time of the time series." 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": null, 98 | "id": "271cdc42-587b-4ee7-a433-71d8103ced5f", 99 | "metadata": {}, 100 | "outputs": [], 101 | "source": [] 102 | }, 103 | { 104 | "cell_type": "markdown", 105 | "id": "3ffa8a95-d0d5-48e4-ba5f-b92a4f54c0ef", 106 | "metadata": {}, 107 | "source": [ 108 | "Plot the time series" 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": null, 114 | "id": "f1629484-508f-419f-8647-1ef6626eab26", 115 | "metadata": {}, 116 | "outputs": [], 117 | "source": [] 118 | }, 119 | { 120 | "cell_type": "markdown", 121 | "id": "af0befd7-d612-4bfe-8cb6-fc2bfd2bc9f1", 122 | "metadata": {}, 123 | "source": [ 124 | "Check if there is any weekly seasonality. \n", 125 | "\n", 126 | "Note: This could be done in a variety ways with different degrees of complexity. " 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": null, 132 | "id": "6c2b84e0-07a4-46ef-ae53-6327625a7284", 133 | "metadata": {}, 134 | "outputs": [], 135 | "source": [] 136 | }, 137 | { 138 | "cell_type": "markdown", 139 | "id": "3fcb0da2-348b-4e8e-8aae-72eeb7db1b4c", 140 | "metadata": {}, 141 | "source": [ 142 | "Create a feature(s) which can help capture weekly seasonality from the datetime index." 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "execution_count": null, 148 | "id": "1ab5c17b-cf5a-40e9-9c6a-61052d17f693", 149 | "metadata": {}, 150 | "outputs": [], 151 | "source": [] 152 | }, 153 | { 154 | "cell_type": "markdown", 155 | "id": "bed95ade-4109-4b31-847b-28f5c9807692", 156 | "metadata": {}, 157 | "source": [ 158 | "# Model definition" 159 | ] 160 | }, 161 | { 162 | "cell_type": "markdown", 163 | "id": "a40647fe-99b4-474e-8a03-b7bf4f9eed29", 164 | "metadata": {}, 165 | "source": [ 166 | "Import the class needed for recursive forecasting with a single time series from `skforecast`." 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": null, 172 | "id": "613011ce-c45c-40b5-b46a-e6a8037d4f08", 173 | "metadata": {}, 174 | "outputs": [], 175 | "source": [] 176 | }, 177 | { 178 | "cell_type": "markdown", 179 | "id": "d692c732-e699-46cb-bab6-cd148e9af1f1", 180 | "metadata": {}, 181 | "source": [ 182 | "Import a transformer from sklearn to scale the data." 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": null, 188 | "id": "cadffd44-e241-4677-b640-9c8b9db2784f", 189 | "metadata": {}, 190 | "outputs": [], 191 | "source": [] 192 | }, 193 | { 194 | "cell_type": "markdown", 195 | "id": "eaf0b4fb-9548-4da1-b356-e9ec377d68f9", 196 | "metadata": {}, 197 | "source": [ 198 | "Import a linear model and LightGBM" 199 | ] 200 | }, 201 | { 202 | "cell_type": "code", 203 | "execution_count": null, 204 | "id": "0495bdc3-5e22-483a-b0ab-2b3e0402fbee", 205 | "metadata": {}, 206 | "outputs": [], 207 | "source": [] 208 | }, 209 | { 210 | "cell_type": "markdown", 211 | "id": "66a304ca-77a4-48f6-99ef-f24d30417daa", 212 | "metadata": {}, 213 | "source": [ 214 | "Define one forecaster for the linear model and one forecaster for the LightGBM model using `skforecast`. Use some recent lags and a lag of 7 to capture weekly seasonality." 215 | ] 216 | }, 217 | { 218 | "cell_type": "code", 219 | "execution_count": null, 220 | "id": "59447171-bbdc-4a0a-82d1-fa809a9db068", 221 | "metadata": {}, 222 | "outputs": [], 223 | "source": [] 224 | }, 225 | { 226 | "cell_type": "markdown", 227 | "id": "8b8d3162-9769-4b24-896a-f10a4914dabf", 228 | "metadata": {}, 229 | "source": [ 230 | "# Backtesting" 231 | ] 232 | }, 233 | { 234 | "cell_type": "markdown", 235 | "id": "115435b1-725c-4dc1-9df1-782b19d1fc55", 236 | "metadata": {}, 237 | "source": [ 238 | "Import the relevant backtesting objects from `skforecast`." 239 | ] 240 | }, 241 | { 242 | "cell_type": "code", 243 | "execution_count": null, 244 | "id": "bd952f40-4636-4fa7-ba9a-25eeb0d797c8", 245 | "metadata": {}, 246 | "outputs": [], 247 | "source": [] 248 | }, 249 | { 250 | "cell_type": "markdown", 251 | "id": "1dd03f71-1e45-427d-85fe-9830dfbf686d", 252 | "metadata": {}, 253 | "source": [ 254 | "Use `backtesting_forecaster` to do backtesting with:\n", 255 | "- Intermittent refitting where the model is refit every 7 steps\n", 256 | "- An expanding training window\n", 257 | "- Using both the `mean_squared_error` and the `mean_absolute_percentage_error`\n", 258 | "- Use approximately the first 30 days for training\n", 259 | "- Use exogenous features\n", 260 | "\n", 261 | "Do this twice, once for the linear forecaster and once for the lightgbm forecaster." 262 | ] 263 | }, 264 | { 265 | "cell_type": "code", 266 | "execution_count": null, 267 | "id": "2b6a89aa-38bd-4a59-b603-bfb3f98b4c19", 268 | "metadata": {}, 269 | "outputs": [], 270 | "source": [] 271 | }, 272 | { 273 | "cell_type": "code", 274 | "execution_count": null, 275 | "id": "4b66897e-b76a-42b3-8e37-59b72fd88a7a", 276 | "metadata": {}, 277 | "outputs": [], 278 | "source": [] 279 | }, 280 | { 281 | "cell_type": "markdown", 282 | "id": "0a1678ed-b2a0-422d-9816-0a7ef2ec5a34", 283 | "metadata": {}, 284 | "source": [ 285 | "Compare the metrics for the linear model and lightgbm model. Which model did better?" 286 | ] 287 | }, 288 | { 289 | "cell_type": "code", 290 | "execution_count": null, 291 | "id": "5344aa17-275c-4fdb-9338-aa2673d90b20", 292 | "metadata": {}, 293 | "outputs": [], 294 | "source": [] 295 | }, 296 | { 297 | "cell_type": "markdown", 298 | "id": "eb967325-4484-477c-a1f6-b2151079e55a", 299 | "metadata": {}, 300 | "source": [ 301 | "Plot the predictions made during backtesting alongside the actuals to get a better understanding of the errors." 302 | ] 303 | }, 304 | { 305 | "cell_type": "code", 306 | "execution_count": null, 307 | "id": "e084d1db-ff9a-43cb-be4e-e0794486d80b", 308 | "metadata": {}, 309 | "outputs": [], 310 | "source": [] 311 | }, 312 | { 313 | "cell_type": "markdown", 314 | "id": "4d0cdd6a-fde5-4f0d-ba7f-852457701063", 315 | "metadata": {}, 316 | "source": [ 317 | "Why do you think one model is doing better than the other? What might help one of the models to improve?" 318 | ] 319 | }, 320 | { 321 | "cell_type": "markdown", 322 | "id": "b14949a5-0628-4041-aca0-ed033d967bc0", 323 | "metadata": {}, 324 | "source": [] 325 | } 326 | ], 327 | "metadata": { 328 | "kernelspec": { 329 | "display_name": "Python 3 (ipykernel)", 330 | "language": "python", 331 | "name": "python3" 332 | }, 333 | "language_info": { 334 | "codemirror_mode": { 335 | "name": "ipython", 336 | "version": 3 337 | }, 338 | "file_extension": ".py", 339 | "mimetype": "text/x-python", 340 | "name": "python", 341 | "nbconvert_exporter": "python", 342 | "pygments_lexer": "ipython3", 343 | "version": "3.10.2" 344 | } 345 | }, 346 | "nbformat": 4, 347 | "nbformat_minor": 5 348 | } 349 | -------------------------------------------------------------------------------- /04-backtesting/exercise-2.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "0c0af68d-793e-40ce-b41a-85131ab6aeb3", 6 | "metadata": {}, 7 | "source": [ 8 | "# Exercise 2: Backtesting with multiple time series" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "id": "8f5b8ff1-e3f7-47af-b603-15d89d0c6fe5", 14 | "metadata": {}, 15 | "source": [ 16 | "In this notebook we set out an exercise to do backtesting to compare different models for multiple time series. The solutions we show are only one way of answering these questions." 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": null, 22 | "id": "91a8c715-0369-4928-907d-9492db4f37a1", 23 | "metadata": {}, 24 | "outputs": [], 25 | "source": [ 26 | "import numpy as np\n", 27 | "import pandas as pd\n", 28 | "import matplotlib.pyplot as plt" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "id": "f6e9fcef-4dde-4107-bce6-216516626f86", 34 | "metadata": {}, 35 | "source": [ 36 | "# Data preparation" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "id": "d91ac0bd-372a-4ef0-9c0e-a0e0cf12560d", 42 | "metadata": {}, 43 | "source": [ 44 | "The dataset we shall use is the Quarterly overnight trips (in thousands) from 1998 Q1 to 2016 Q4 across\n", 45 | "Australia. The number of trips is split by `State`, `Region`, and `Purpose`. \n", 46 | "\n", 47 | "**In this exercise we are going to forecast the total number of trips for each Region (there are 76 regions therefore we will have 76 time series).**\n", 48 | "\n", 49 | "Source: A new tidy data structure to support\n", 50 | "exploration and modeling of temporal data, Journal of Computational and\n", 51 | "Graphical Statistics, 29:3, 466-478, doi:10.1080/10618600.2019.1695624.\n", 52 | "Shape of the dataset: (24320, 5)" 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": null, 58 | "id": "76ef5949-cbf1-4109-887d-104f71574c2a", 59 | "metadata": {}, 60 | "outputs": [], 61 | "source": [ 62 | "from skforecast.datasets import fetch_dataset\n", 63 | "\n", 64 | "# Load the data\n", 65 | "data = fetch_dataset(name=\"australia_tourism\", raw=True)\n", 66 | "data.head()" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "id": "48a62223-83ab-4a7e-af1b-4fe91e1b31c7", 72 | "metadata": {}, 73 | "source": [ 74 | "Pre-process the data by performing the following:\n", 75 | "1) Convert the `date_time` column to datetime type\n", 76 | "2) Create a dataframe with one column per `Region` which gives the total number of Trips for each date.\n", 77 | "3) Ensure the index is `date_time` and resampled to quarterly start `QS`\n" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": null, 83 | "id": "6506a6d4-e680-4c12-9dd1-7e037c91a2ef", 84 | "metadata": {}, 85 | "outputs": [], 86 | "source": [] 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "id": "446f0153-9869-4f25-a84e-d040b9078635", 91 | "metadata": {}, 92 | "source": [ 93 | "Later we will use LightGBM. It does not support special JSON characters (e.g., `'`) in the column name. Let's remove these characters from the column names." 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": null, 99 | "id": "e78e1110-9a99-4ff2-af9f-d78fd28bcc9e", 100 | "metadata": {}, 101 | "outputs": [], 102 | "source": [ 103 | "import re\n", 104 | "\n", 105 | "data = data.rename(columns=lambda x: re.sub(\"[^A-Za-z0-9_]+\", \"\", x))" 106 | ] 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "id": "589f58e0-6106-430c-ab2b-557ea84bb2a5", 111 | "metadata": {}, 112 | "source": [ 113 | "Check for missing values." 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": null, 119 | "id": "92bd6aaf-caac-4bb8-96a2-c0d59220f6e8", 120 | "metadata": {}, 121 | "outputs": [], 122 | "source": [] 123 | }, 124 | { 125 | "cell_type": "markdown", 126 | "id": "eb03773c-a9bf-4fa8-84a4-099af6b87827", 127 | "metadata": {}, 128 | "source": [ 129 | "# Exploratory data analysis" 130 | ] 131 | }, 132 | { 133 | "cell_type": "markdown", 134 | "id": "7d1ba5b6-b49d-4bab-882e-915f075b9493", 135 | "metadata": {}, 136 | "source": [ 137 | "Print the number of data points in the time series, the start time, and the end time of the time series." 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": null, 143 | "id": "36af3ab9-4267-471d-800b-dc9a3dcdb43e", 144 | "metadata": {}, 145 | "outputs": [], 146 | "source": [] 147 | }, 148 | { 149 | "cell_type": "markdown", 150 | "id": "3ffa8a95-d0d5-48e4-ba5f-b92a4f54c0ef", 151 | "metadata": {}, 152 | "source": [ 153 | "Plot the time series summed over all regions" 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": null, 159 | "id": "f1629484-508f-419f-8647-1ef6626eab26", 160 | "metadata": {}, 161 | "outputs": [], 162 | "source": [] 163 | }, 164 | { 165 | "cell_type": "markdown", 166 | "id": "bdff7d75-4ca8-4b40-aebb-1ef24318b6e0", 167 | "metadata": {}, 168 | "source": [ 169 | "Plot a subsample of the time series from different regions." 170 | ] 171 | }, 172 | { 173 | "cell_type": "code", 174 | "execution_count": null, 175 | "id": "652caf88-93a8-4b82-abbb-0d07ccbdbdb7", 176 | "metadata": {}, 177 | "outputs": [], 178 | "source": [] 179 | }, 180 | { 181 | "cell_type": "markdown", 182 | "id": "f4befcc4-bf9a-4adf-bb81-8c4c136007a1", 183 | "metadata": {}, 184 | "source": [ 185 | "Create a quarter of the year feature which could help with this." 186 | ] 187 | }, 188 | { 189 | "cell_type": "code", 190 | "execution_count": null, 191 | "id": "1ab5c17b-cf5a-40e9-9c6a-61052d17f693", 192 | "metadata": {}, 193 | "outputs": [], 194 | "source": [] 195 | }, 196 | { 197 | "cell_type": "markdown", 198 | "id": "bed95ade-4109-4b31-847b-28f5c9807692", 199 | "metadata": {}, 200 | "source": [ 201 | "# Model definition" 202 | ] 203 | }, 204 | { 205 | "cell_type": "markdown", 206 | "id": "a40647fe-99b4-474e-8a03-b7bf4f9eed29", 207 | "metadata": {}, 208 | "source": [ 209 | "Import the classes needed for recursive forecasting for multiple time series and window features from `skforecast`." 210 | ] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "execution_count": null, 215 | "id": "613011ce-c45c-40b5-b46a-e6a8037d4f08", 216 | "metadata": {}, 217 | "outputs": [], 218 | "source": [] 219 | }, 220 | { 221 | "cell_type": "markdown", 222 | "id": "d692c732-e699-46cb-bab6-cd148e9af1f1", 223 | "metadata": {}, 224 | "source": [ 225 | "Import a transformer from sklearn to scale the data." 226 | ] 227 | }, 228 | { 229 | "cell_type": "code", 230 | "execution_count": null, 231 | "id": "cadffd44-e241-4677-b640-9c8b9db2784f", 232 | "metadata": {}, 233 | "outputs": [], 234 | "source": [] 235 | }, 236 | { 237 | "cell_type": "markdown", 238 | "id": "eaf0b4fb-9548-4da1-b356-e9ec377d68f9", 239 | "metadata": {}, 240 | "source": [ 241 | "Import a linear model and LightGBM" 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": null, 247 | "id": "0495bdc3-5e22-483a-b0ab-2b3e0402fbee", 248 | "metadata": {}, 249 | "outputs": [], 250 | "source": [] 251 | }, 252 | { 253 | "cell_type": "markdown", 254 | "id": "dca4af0c-af99-4ced-b13d-5528ed7f9866", 255 | "metadata": {}, 256 | "source": [ 257 | "Define a function that extracts features from the target variable:\n", 258 | "- A rolling mean with window 4\n", 259 | "- A rolling standard deviation with window 4\n", 260 | "- A rolling mean with window 12\n", 261 | "- A rolling standard deviation with window 12\n", 262 | "\n", 263 | "Hint: https://skforecast.org/0.11.0/user_guides/custom-predictors.html" 264 | ] 265 | }, 266 | { 267 | "cell_type": "code", 268 | "execution_count": null, 269 | "id": "7eb270e3-12ea-4cb8-82d2-09bf27a81afc", 270 | "metadata": {}, 271 | "outputs": [], 272 | "source": [] 273 | }, 274 | { 275 | "cell_type": "markdown", 276 | "id": "66a304ca-77a4-48f6-99ef-f24d30417daa", 277 | "metadata": {}, 278 | "source": [ 279 | "Define one forecaster for the linear model and one forecaster for the LightGBM model using `skforecast`. Pass the function in the previous cell as an argument. Use all lags of up to 8." 280 | ] 281 | }, 282 | { 283 | "cell_type": "code", 284 | "execution_count": null, 285 | "id": "59447171-bbdc-4a0a-82d1-fa809a9db068", 286 | "metadata": {}, 287 | "outputs": [], 288 | "source": [] 289 | }, 290 | { 291 | "cell_type": "markdown", 292 | "id": "8b8d3162-9769-4b24-896a-f10a4914dabf", 293 | "metadata": {}, 294 | "source": [ 295 | "# Backtesting" 296 | ] 297 | }, 298 | { 299 | "cell_type": "markdown", 300 | "id": "115435b1-725c-4dc1-9df1-782b19d1fc55", 301 | "metadata": {}, 302 | "source": [ 303 | "Import the relevant backtesting objects from `skforecast`." 304 | ] 305 | }, 306 | { 307 | "cell_type": "code", 308 | "execution_count": null, 309 | "id": "bd952f40-4636-4fa7-ba9a-25eeb0d797c8", 310 | "metadata": {}, 311 | "outputs": [], 312 | "source": [] 313 | }, 314 | { 315 | "cell_type": "markdown", 316 | "id": "4841d21c-fc14-4aba-867b-53ae9def3c36", 317 | "metadata": {}, 318 | "source": [ 319 | "Assign the name of the target variables and exogenous variables to variables called `target_cols` and `exog_cols`." 320 | ] 321 | }, 322 | { 323 | "cell_type": "code", 324 | "execution_count": null, 325 | "id": "5dd68852-0c87-4b34-adc6-d11c4050a5ae", 326 | "metadata": {}, 327 | "outputs": [], 328 | "source": [] 329 | }, 330 | { 331 | "cell_type": "markdown", 332 | "id": "1dd03f71-1e45-427d-85fe-9830dfbf686d", 333 | "metadata": {}, 334 | "source": [ 335 | "Use the backtesting function to do backtesting with:\n", 336 | "- Refitting at every step\n", 337 | "- An expanding training window\n", 338 | "- Use the `mean_absolute_error` as the error metric\n", 339 | "- Use the first 13 datapoints (just over 3 years) as initial training size\n", 340 | "- Use exogenous features\n", 341 | "- Set the forecast horizon to be 4 steps (i.e., 1 year)\n", 342 | "\n", 343 | "Do this twice, once for the linear forecaster and once for the lightgbm forecaster." 344 | ] 345 | }, 346 | { 347 | "cell_type": "code", 348 | "execution_count": null, 349 | "id": "2b6a89aa-38bd-4a59-b603-bfb3f98b4c19", 350 | "metadata": {}, 351 | "outputs": [], 352 | "source": [] 353 | }, 354 | { 355 | "cell_type": "code", 356 | "execution_count": null, 357 | "id": "4b66897e-b76a-42b3-8e37-59b72fd88a7a", 358 | "metadata": {}, 359 | "outputs": [], 360 | "source": [] 361 | }, 362 | { 363 | "cell_type": "markdown", 364 | "id": "0a1678ed-b2a0-422d-9816-0a7ef2ec5a34", 365 | "metadata": {}, 366 | "source": [ 367 | "Merge the two metrics dataframe into a single dataframe to make it easier to compare the errors between the linear model and the LightGBM model. " 368 | ] 369 | }, 370 | { 371 | "cell_type": "code", 372 | "execution_count": null, 373 | "id": "46c73196-40d2-44d4-a89d-8cb677a6875a", 374 | "metadata": {}, 375 | "outputs": [], 376 | "source": [] 377 | }, 378 | { 379 | "cell_type": "markdown", 380 | "id": "32ef4541-c15e-4a51-92f8-5c693c5f5074", 381 | "metadata": {}, 382 | "source": [ 383 | "How often did one model outperform the other model? Remember, the lower the error the better." 384 | ] 385 | }, 386 | { 387 | "cell_type": "code", 388 | "execution_count": null, 389 | "id": "02780ced-856e-4eaf-bea1-fe3f6e6cb95f", 390 | "metadata": {}, 391 | "outputs": [], 392 | "source": [] 393 | }, 394 | { 395 | "cell_type": "markdown", 396 | "id": "4cde377b-d4f5-4a50-b358-566a51ebdfdc", 397 | "metadata": {}, 398 | "source": [ 399 | "It looks like Ridge was better for most time series here. Once again, for the lightgbm model to outperform we might need additional feature engineering, hyperparameter tuning, handling of trends, and more data. There are not any features which help group similar time series together - these type of features are normally well utilised by gradient boosted trees. So in this scenario, a simple linear model works better." 400 | ] 401 | }, 402 | { 403 | "cell_type": "markdown", 404 | "id": "40f80613-124c-4591-b198-ca9819eae2c7", 405 | "metadata": {}, 406 | "source": [ 407 | "Compute the mean absolute error over all the predictions for both the linear model and the LightGBM model. Compare them." 408 | ] 409 | }, 410 | { 411 | "cell_type": "code", 412 | "execution_count": null, 413 | "id": "5344aa17-275c-4fdb-9338-aa2673d90b20", 414 | "metadata": {}, 415 | "outputs": [], 416 | "source": [] 417 | }, 418 | { 419 | "cell_type": "markdown", 420 | "id": "eb967325-4484-477c-a1f6-b2151079e55a", 421 | "metadata": {}, 422 | "source": [ 423 | "Plot the predictions made during backtesting alongside the actuals for a random subset of time series to get a better understanding of the errors." 424 | ] 425 | }, 426 | { 427 | "cell_type": "code", 428 | "execution_count": null, 429 | "id": "e084d1db-ff9a-43cb-be4e-e0794486d80b", 430 | "metadata": {}, 431 | "outputs": [], 432 | "source": [] 433 | }, 434 | { 435 | "cell_type": "markdown", 436 | "id": "aeac094e-47d4-4d19-9c99-3ea50ed397ad", 437 | "metadata": {}, 438 | "source": [ 439 | "Plot the predictions made during backtesting alongside the actuals for the **highest error** subset of time series to get a better understanding of the errors." 440 | ] 441 | }, 442 | { 443 | "cell_type": "code", 444 | "execution_count": null, 445 | "id": "7dd01596-68e2-4484-a42b-2cfe5eb9103e", 446 | "metadata": {}, 447 | "outputs": [], 448 | "source": [] 449 | } 450 | ], 451 | "metadata": { 452 | "kernelspec": { 453 | "display_name": "Python 3 (ipykernel)", 454 | "language": "python", 455 | "name": "python3" 456 | }, 457 | "language_info": { 458 | "codemirror_mode": { 459 | "name": "ipython", 460 | "version": 3 461 | }, 462 | "file_extension": ".py", 463 | "mimetype": "text/x-python", 464 | "name": "python", 465 | "nbconvert_exporter": "python", 466 | "pygments_lexer": "ipython3", 467 | "version": "3.10.2" 468 | } 469 | }, 470 | "nbformat": 4, 471 | "nbformat_minor": 5 472 | } 473 | -------------------------------------------------------------------------------- /04-backtesting/images/backtesting_intermittent_refit.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/trainindata/forecasting-with-machine-learning/2e45285cb67a1d1afed4eebd80150493bdd7c0fd/04-backtesting/images/backtesting_intermittent_refit.gif -------------------------------------------------------------------------------- /04-backtesting/images/backtesting_no_refit.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/trainindata/forecasting-with-machine-learning/2e45285cb67a1d1afed4eebd80150493bdd7c0fd/04-backtesting/images/backtesting_no_refit.gif -------------------------------------------------------------------------------- /04-backtesting/images/backtesting_refit.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/trainindata/forecasting-with-machine-learning/2e45285cb67a1d1afed4eebd80150493bdd7c0fd/04-backtesting/images/backtesting_refit.gif -------------------------------------------------------------------------------- /04-backtesting/images/backtesting_refit_fixed_train_size.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/trainindata/forecasting-with-machine-learning/2e45285cb67a1d1afed4eebd80150493bdd7c0fd/04-backtesting/images/backtesting_refit_fixed_train_size.gif -------------------------------------------------------------------------------- /04-backtesting/images/backtesting_refit_gap.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/trainindata/forecasting-with-machine-learning/2e45285cb67a1d1afed4eebd80150493bdd7c0fd/04-backtesting/images/backtesting_refit_gap.gif -------------------------------------------------------------------------------- /05-error-metrics/exercise-1.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Exercise 1: Error metrics" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "In this notebook we set out exercises to implement error metrics and use them in skforecast. The solutions we show are only one way of answering these questions." 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": null, 20 | "metadata": {}, 21 | "outputs": [], 22 | "source": [ 23 | "import numpy as np\n", 24 | "import pandas as pd\n", 25 | "import matplotlib.pyplot as plt" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "# Data loading and preparation" 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "metadata": {}, 38 | "source": [ 39 | "Data has been obtained from [Store Item Demand Forecasting Challenge](https://www.kaggle.com/competitions/demand-forecasting-kernels-only/data), specifically `train.csv`. This dataset contains 913,000 sales transactions from 2013–01–01 to 2017–12–31 for 50 products (SKU) in 10 stores. The goal is to predict the next 7 days sales for 50 different items in one store using the available 5 years history." 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": null, 45 | "metadata": {}, 46 | "outputs": [], 47 | "source": [ 48 | "from skforecast.datasets import fetch_dataset\n", 49 | "\n", 50 | "# Load the data\n", 51 | "data = fetch_dataset(name=\"store_sales\", raw=True)\n", 52 | "data.head()" 53 | ] 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "metadata": {}, 58 | "source": [ 59 | "`ForecasterRecursiveMultiSeries` requires that each time series is a column in the dataframe and that the index is time-like (datetime or timestamp). \n", 60 | "\n", 61 | "So now we process the data to get dataframes in the required format." 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": null, 67 | "metadata": {}, 68 | "outputs": [], 69 | "source": [ 70 | "# Data preprocessing\n", 71 | "\n", 72 | "selected_store = 2\n", 73 | "selected_items = data.item.unique() # All items\n", 74 | "# selected_items = [1, 2, 3, 4, 5] # Selection of items to reduce computation time\n", 75 | "\n", 76 | "# Filter data to specific stores and products\n", 77 | "mask = (data[\"store\"] == selected_store) & (data[\"item\"].isin(selected_items))\n", 78 | "data = data[mask].copy()\n", 79 | "\n", 80 | "# Convert `date` column to datetime\n", 81 | "data[\"date\"] = pd.to_datetime(data[\"date\"], format=\"%Y-%m-%d\")\n", 82 | "\n", 83 | "# Convert to one column per time series\n", 84 | "data = pd.pivot_table(data=data, values=\"sales\", index=\"date\", columns=\"item\")\n", 85 | "\n", 86 | "# Reset column names\n", 87 | "data.columns.name = None\n", 88 | "data.columns = [f\"item_{col}\" for col in data.columns]\n", 89 | "\n", 90 | "# Explicitly set the frequency of the data to daily.\n", 91 | "# This would introduce missing values for missing days.\n", 92 | "data = data.asfreq(\"1D\")\n", 93 | "\n", 94 | "# Sort by time\n", 95 | "data = data.sort_index()\n", 96 | "\n", 97 | "data.head(4)" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": null, 103 | "metadata": {}, 104 | "outputs": [], 105 | "source": [ 106 | "# Check if any missing values introduced\n", 107 | "data.isnull().sum().any()" 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": null, 113 | "metadata": {}, 114 | "outputs": [], 115 | "source": [ 116 | "# Plot a subset of the time series\n", 117 | "fig, ax = plt.subplots(figsize=(8, 5))\n", 118 | "data.iloc[:, :4].plot(\n", 119 | " legend=True,\n", 120 | " subplots=True,\n", 121 | " sharex=True,\n", 122 | " title=\"Sales of store 2\",\n", 123 | " ax=ax,\n", 124 | ")\n", 125 | "fig.tight_layout()" 126 | ] 127 | }, 128 | { 129 | "cell_type": "markdown", 130 | "metadata": {}, 131 | "source": [ 132 | "Let's add the day of the week to use as an exogenous feature." 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": null, 138 | "metadata": {}, 139 | "outputs": [], 140 | "source": [ 141 | "data[\"day_of_week\"] = data.index.weekday" 142 | ] 143 | }, 144 | { 145 | "cell_type": "markdown", 146 | "metadata": {}, 147 | "source": [ 148 | "# Analysis" 149 | ] 150 | }, 151 | { 152 | "cell_type": "markdown", 153 | "metadata": {}, 154 | "source": [ 155 | "We want to build an intuition about the characteristics (trend, outliers, seasonality, intermittency, etc.) and the range of values that the time series have.\n", 156 | "This helps when deciding which error metrics to use.\n", 157 | "\n", 158 | "Compute a set of summary statistics and/or plots to do this." 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": null, 164 | "metadata": {}, 165 | "outputs": [], 166 | "source": [] 167 | }, 168 | { 169 | "cell_type": "markdown", 170 | "metadata": {}, 171 | "source": [ 172 | "Plot the 5 largest and 5 smallest time series." 173 | ] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "execution_count": null, 178 | "metadata": {}, 179 | "outputs": [], 180 | "source": [] 181 | }, 182 | { 183 | "cell_type": "markdown", 184 | "metadata": {}, 185 | "source": [ 186 | "Check for any zero values." 187 | ] 188 | }, 189 | { 190 | "cell_type": "code", 191 | "execution_count": null, 192 | "metadata": {}, 193 | "outputs": [], 194 | "source": [] 195 | }, 196 | { 197 | "cell_type": "markdown", 198 | "metadata": {}, 199 | "source": [ 200 | "# Implementing custom error metrics" 201 | ] 202 | }, 203 | { 204 | "cell_type": "markdown", 205 | "metadata": {}, 206 | "source": [ 207 | "Error metrics that we can consider using are:\n", 208 | "- To measure the bias: \n", 209 | " - Average forecast bias: we compute the forecast bias, which is a scale independent measure, for each time series and then average over all time series.\n", 210 | "- To measure the error:\n", 211 | " - Normalised RMSE or normalised deviation: these are pooled error metrics, which cannot currently be calculated natively in skforecast.\n", 212 | " - Average RMSSE or MASE: we can compute the MASE or RMSSE, a scale independent metric, for each time series individually and then average the results to give our final metric. This cannot currently be calculated in skforecast as it requires the custom error function to have\n", 213 | " access to the training set. \n", 214 | " - Average WAPE: we can compute the WAPE, a scale independent metric, for each time series individually and then average the results to give our final metric." 215 | ] 216 | }, 217 | { 218 | "cell_type": "markdown", 219 | "metadata": {}, 220 | "source": [ 221 | "Create a function that takes `y_true` and `y_pred` as an argument and ouputs the \n", 222 | "weighted mean absolute percentage error (WAPE) metric defined as:\n", 223 | "\n", 224 | "$ WAPE = \\frac{\\sum_t{|e_t|}}{\\sum_t{|y_t|}} $\n", 225 | "\n", 226 | "Note: The data does have trend, this would suggest we should not use WAPE. However,\n", 227 | "as we are only forecasting 7 days into the future the magnitude of the trend in\n", 228 | "the forecast horizon is negligible. " 229 | ] 230 | }, 231 | { 232 | "cell_type": "code", 233 | "execution_count": null, 234 | "metadata": {}, 235 | "outputs": [], 236 | "source": [] 237 | }, 238 | { 239 | "cell_type": "markdown", 240 | "metadata": {}, 241 | "source": [ 242 | "Create a function that takes `y_true` and `y_pred` as an argument and ouputs the \n", 243 | "forecast bias metric defined as the mean error divided by the mean of the time series\n", 244 | "in the forecast horizon:\n", 245 | "\n", 246 | "$ FB = \\frac{1/H\\sum_t{e_t}}{1/H\\sum_t{y_t}} = \\frac{\\sum_t{e_t}}{\\sum_t{y_t}}$" 247 | ] 248 | }, 249 | { 250 | "cell_type": "code", 251 | "execution_count": null, 252 | "metadata": {}, 253 | "outputs": [], 254 | "source": [] 255 | }, 256 | { 257 | "cell_type": "markdown", 258 | "metadata": {}, 259 | "source": [ 260 | "# Using a custom error metric" 261 | ] 262 | }, 263 | { 264 | "cell_type": "markdown", 265 | "metadata": {}, 266 | "source": [ 267 | "Define a model to do recursive multistep forecasting for multiple independent time series." 268 | ] 269 | }, 270 | { 271 | "cell_type": "code", 272 | "execution_count": null, 273 | "metadata": {}, 274 | "outputs": [], 275 | "source": [] 276 | }, 277 | { 278 | "cell_type": "markdown", 279 | "metadata": {}, 280 | "source": [ 281 | "Perform backtesting with refitting at every backtest step and compute the forecast bias\n", 282 | "and WAPE for each time series." 283 | ] 284 | }, 285 | { 286 | "cell_type": "code", 287 | "execution_count": null, 288 | "metadata": {}, 289 | "outputs": [], 290 | "source": [ 291 | "from skforecast.model_selection import backtesting_forecaster_multiseries" 292 | ] 293 | }, 294 | { 295 | "cell_type": "code", 296 | "execution_count": null, 297 | "metadata": {}, 298 | "outputs": [], 299 | "source": [] 300 | }, 301 | { 302 | "cell_type": "markdown", 303 | "metadata": {}, 304 | "source": [ 305 | "Average the forecast bias and WAPE over all items to get a single value for each\n", 306 | "metric. Does your model over or under forecast?" 307 | ] 308 | }, 309 | { 310 | "cell_type": "code", 311 | "execution_count": null, 312 | "metadata": {}, 313 | "outputs": [], 314 | "source": [] 315 | }, 316 | { 317 | "cell_type": "markdown", 318 | "metadata": {}, 319 | "source": [ 320 | "Plot a sample of the largest and smallest WAPE time series and their backtest\n", 321 | "predictions." 322 | ] 323 | }, 324 | { 325 | "cell_type": "code", 326 | "execution_count": null, 327 | "metadata": {}, 328 | "outputs": [], 329 | "source": [] 330 | } 331 | ], 332 | "metadata": { 333 | "kernelspec": { 334 | "display_name": "fcst-ml", 335 | "language": "python", 336 | "name": "python3" 337 | }, 338 | "language_info": { 339 | "codemirror_mode": { 340 | "name": "ipython", 341 | "version": 3 342 | }, 343 | "file_extension": ".py", 344 | "mimetype": "text/x-python", 345 | "name": "python", 346 | "nbconvert_exporter": "python", 347 | "pygments_lexer": "ipython3", 348 | "version": "3.10.2" 349 | } 350 | }, 351 | "nbformat": 4, 352 | "nbformat_minor": 2 353 | } 354 | -------------------------------------------------------------------------------- /FWML-Logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/trainindata/forecasting-with-machine-learning/2e45285cb67a1d1afed4eebd80150493bdd7c0fd/FWML-Logo.png -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | BSD 3-Clause License 2 | 3 | Copyright (c) 2017-2023, Soledad Galli and Kishan Manani, Forecasting with Machine Learning course: 4 | https://www.trainindata.com/p/forecasting-with-machine-learning 5 | 6 | 7 | Redistribution and use in source and binary forms, with or without 8 | modification, are permitted provided that the following conditions are met: 9 | 10 | 1. Redistributions of source code must retain the above copyright notice, this 11 | list of conditions and the following disclaimer. 12 | 13 | 2. Redistributions in binary form must reproduce the above copyright notice, 14 | this list of conditions and the following disclaimer in the documentation 15 | and/or other materials provided with the distribution. 16 | 17 | 3. Neither the name of the copyright holder nor the names of its 18 | contributors may be used to endorse or promote products derived from 19 | this software without specific prior written permission. 20 | 21 | 22 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 23 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 24 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 25 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 26 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 27 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 28 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 29 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 30 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 31 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 32 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ![PythonVersion](https://img.shields.io/badge/python-3.9%20%20|%203.10%20|%203.11%20|3.12%20-success) 2 | [![License https://github.com/trainindata/forecasting-with-machine-learning/blob/master/LICENSE](https://img.shields.io/badge/license-BSD-success.svg)](https://github.com/trainindata/forecasting-with-machine-learning/blob/master/LICENSE) 3 | [![Sponsorship https://www.trainindata.com/](https://img.shields.io/badge/Powered%20By-TrainInData-orange.svg)](https://www.trainindata.com/) 4 | 5 | ## Machine Learning Interpretability- Code Repository 6 | 7 | Code repository for the online course [Forecasting with Machine Learning](https://www.trainindata.com/p/forecasting-with-machine-learning) 8 | 9 | **Course Launch: March, 2024** 10 | 11 | Actively maintained. 12 | 13 | [](https://www.trainindata.com/p/forecasting-with-machine-learning) 14 | 15 | ## Table of Contents 16 | 17 | 1. **Time Series as a regression** 18 | 1. Time series forecasting 19 | 2. Forecasting as regression 20 | 3. Tabularizing time series 21 | 4. Endogenous and exogenous variables 22 | 23 | 2. **Forecasting strategies** 24 | 1. Single step forecasting 25 | 2. Multiple step forecasting 26 | 3. Direct forecasting 27 | 4. Recursive forecasting 28 | 29 | 3. **Multi series forecasting** 30 | 1. Global vs local forecasting 31 | 2. Multiseries forecasting models 32 | 3. Independent time series forecasting 33 | 4. Dependent time series forecasting 34 | 35 | 4. **Backtesting** 36 | 1. Backtesting with or without refit 37 | 2. Backtesting with rolling or expanding windows 38 | 3. Backtesting with intermitent refit. 39 | 40 | 41 | ## Links 42 | 43 | - [Online Course](https://www.trainindata.com/p/forecasting-with-machine-learning) 44 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | feature-engine>=1.8.2 2 | matplotlib>=3.9.2 3 | numpy>=2.1.2 4 | pandas>=2.2.3 5 | scikit-learn>=1.5.2 6 | seaborn>=0.13.2 7 | skforecast>=0.14.0 --------------------------------------------------------------------------------