├── .gitignore
├── 01-forecasting-as-regression
    ├── 01-feature-engineering.ipynb
    ├── 02-feature-engineering-with-feature-engine.ipynb
    ├── 03-benchmark-models.ipynb
    └── 04-single-step-forecasting-with-machine-learning.ipynb
├── 02-multistep-forecasting
    ├── 01-recursive-forecasting.ipynb
    ├── 02-recursive-forecasting-with-future-known-features.ipynb
    ├── 03-recursive-forecasting-with-window-features.ipynb
    ├── 04-direct-forecasting-skforecast.ipynb
    ├── 05-direct-forecasting-sklearn.ipynb
    ├── 06-direct-forecasting-with-future-known-features.ipynb
    └── assignment
    │   ├── 01-assigment.ipynb
    │   ├── 01-solution.ipynb
    │   ├── 02-assignment.ipynb
    │   └── 02-solution.ipynb
├── 03-multiseries-forecasting
    ├── 01-independent-multi-time-series-forecasting.ipynb
    ├── 02-dependent-multi-time-series-forecasting.ipynb
    ├── exercise-1-solution.ipynb
    ├── exercise-1.ipynb
    ├── exercise-2-solution.ipynb
    ├── exercise-2.ipynb
    └── images
    │   ├── forecaster_multi_series_sample_weight.png
    │   └── forecaster_multivariate_train_matrix_diagram.png
├── 04-backtesting
    ├── 01-backtesting-without-refitting.ipynb
    ├── 02-backtesting-with-refitting-expanding-training-window.ipynb
    ├── 03-backtesting-with-refitting-rolling-training-window.ipynb
    ├── 04-backtesting-with-intermittent-refitting.ipynb
    ├── 05-backtesting-with-gap.ipynb
    ├── exercise-1-solution.ipynb
    ├── exercise-1.ipynb
    ├── exercise-2-solution.ipynb
    ├── exercise-2.ipynb
    └── images
    │   ├── backtesting_intermittent_refit.gif
    │   ├── backtesting_no_refit.gif
    │   ├── backtesting_refit.gif
    │   ├── backtesting_refit_fixed_train_size.gif
    │   └── backtesting_refit_gap.gif
├── 05-error-metrics
    ├── exercise-1-solution.ipynb
    └── exercise-1.ipynb
├── FWML-Logo.png
├── LICENSE
├── README.md
└── requirements.txt


/.gitignore:
--------------------------------------------------------------------------------
 1 | # Byte-compiled / optimized / DLL files
 2 | __pycache__/
 3 | *.py[cod]
 4 | *$py.class
 5 | *.db
 6 | 
 7 | # Jupyter Notebook
 8 | .ipynb_checkpoints
 9 | 
10 | # pyenv
11 | .python-version
12 | 
13 | # Environments
14 | .env
15 | .venv
16 | env/
17 | venv/
18 | ENV/
19 | env.bak/
20 | venv.bak/
21 | 
22 | # datasets
23 | *.csv
24 | *.zip
25 | *.data
26 | *.labels
27 | 
28 | # files not to put in repo
29 | Make-toy-finance-dataset.ipynb
30 | tbf


--------------------------------------------------------------------------------
/01-forecasting-as-regression/02-feature-engineering-with-feature-engine.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "id": "fcea71b1",
   6 |    "metadata": {},
   7 |    "source": [
   8 |     "# Feature engineering for forecasting\n",
   9 |     "\n",
  10 |     "[Forecasting with Machine Learning - Course](https://www.trainindata.com/p/forecasting-with-machine-learning)\n",
  11 |     "\n",
  12 |     "In this notebook, we will create a table of predictive features and a target from a time series dataset, utilizing [Feature-engine](https://feature-engine.trainindata.com)"
  13 |    ]
  14 |   },
  15 |   {
  16 |    "cell_type": "code",
  17 |    "execution_count": 2,
  18 |    "id": "42834613",
  19 |    "metadata": {},
  20 |    "outputs": [],
  21 |    "source": [
  22 |     "import matplotlib.pyplot as plt\n",
  23 |     "import pandas as pd\n",
  24 |     "\n",
  25 |     "from feature_engine.datetime import DatetimeFeatures\n",
  26 |     "from feature_engine.timeseries.forecasting import LagFeatures, WindowFeatures"
  27 |    ]
  28 |   },
  29 |   {
  30 |    "cell_type": "markdown",
  31 |    "id": "1c87901a",
  32 |    "metadata": {},
  33 |    "source": [
  34 |     "# Load data\n",
  35 |     "\n",
  36 |     "We will use the electricity demand dataset found [here](https://github.com/tidyverts/tsibbledata/tree/master/data-raw/vic_elec/VIC2015).\n",
  37 |     "\n",
  38 |     "**Citation:**\n",
  39 |     "\n",
  40 |     "Godahewa, Rakshitha, Bergmeir, Christoph, Webb, Geoff, Hyndman, Rob, & Montero-Manso, Pablo. (2021). Australian Electricity Demand Dataset (Version 1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.4659727\n",
  41 |     "\n",
  42 |     "**Description of data:**\n",
  43 |     "\n",
  44 |     "A description of the data can be found [here](https://rdrr.io/cran/tsibbledata/man/vic_elec.html). The data contains electricity demand in Victoria, Australia, at 30 minute intervals over a period of 12 years, from 2002 to early 2015. There is also the temperature in Melbourne at 30 minute intervals and public holiday dates."
  45 |    ]
  46 |   },
  47 |   {
  48 |    "cell_type": "code",
  49 |    "execution_count": 3,
  50 |    "id": "0ab8e60d",
  51 |    "metadata": {},
  52 |    "outputs": [
  53 |     {
  54 |      "data": {
  55 |       "text/html": [
  56 |        "<div>\n",
  57 |        "<style scoped>\n",
  58 |        "    .dataframe tbody tr th:only-of-type {\n",
  59 |        "        vertical-align: middle;\n",
  60 |        "    }\n",
  61 |        "\n",
  62 |        "    .dataframe tbody tr th {\n",
  63 |        "        vertical-align: top;\n",
  64 |        "    }\n",
  65 |        "\n",
  66 |        "    .dataframe thead th {\n",
  67 |        "        text-align: right;\n",
  68 |        "    }\n",
  69 |        "</style>\n",
  70 |        "<table border=\"1\" class=\"dataframe\">\n",
  71 |        "  <thead>\n",
  72 |        "    <tr style=\"text-align: right;\">\n",
  73 |        "      <th></th>\n",
  74 |        "      <th>demand</th>\n",
  75 |        "    </tr>\n",
  76 |        "    <tr>\n",
  77 |        "      <th>date_time</th>\n",
  78 |        "      <th></th>\n",
  79 |        "    </tr>\n",
  80 |        "  </thead>\n",
  81 |        "  <tbody>\n",
  82 |        "    <tr>\n",
  83 |        "      <th>2002-01-01 00:00:00</th>\n",
  84 |        "      <td>6919.366092</td>\n",
  85 |        "    </tr>\n",
  86 |        "    <tr>\n",
  87 |        "      <th>2002-01-01 01:00:00</th>\n",
  88 |        "      <td>7165.974188</td>\n",
  89 |        "    </tr>\n",
  90 |        "    <tr>\n",
  91 |        "      <th>2002-01-01 02:00:00</th>\n",
  92 |        "      <td>6406.542994</td>\n",
  93 |        "    </tr>\n",
  94 |        "    <tr>\n",
  95 |        "      <th>2002-01-01 03:00:00</th>\n",
  96 |        "      <td>5815.537828</td>\n",
  97 |        "    </tr>\n",
  98 |        "    <tr>\n",
  99 |        "      <th>2002-01-01 04:00:00</th>\n",
 100 |        "      <td>5497.732922</td>\n",
 101 |        "    </tr>\n",
 102 |        "  </tbody>\n",
 103 |        "</table>\n",
 104 |        "</div>"
 105 |       ],
 106 |       "text/plain": [
 107 |        "                          demand\n",
 108 |        "date_time                       \n",
 109 |        "2002-01-01 00:00:00  6919.366092\n",
 110 |        "2002-01-01 01:00:00  7165.974188\n",
 111 |        "2002-01-01 02:00:00  6406.542994\n",
 112 |        "2002-01-01 03:00:00  5815.537828\n",
 113 |        "2002-01-01 04:00:00  5497.732922"
 114 |       ]
 115 |      },
 116 |      "execution_count": 3,
 117 |      "metadata": {},
 118 |      "output_type": "execute_result"
 119 |     }
 120 |    ],
 121 |    "source": [
 122 |     "# Electricity demand.\n",
 123 |     "url = \"https://raw.githubusercontent.com/tidyverts/tsibbledata/master/data-raw/vic_elec/VIC2015/demand.csv\"\n",
 124 |     "df = pd.read_csv(url)\n",
 125 |     "\n",
 126 |     "df.drop(columns=[\"Industrial\"], inplace=True)\n",
 127 |     "\n",
 128 |     "# Convert the integer Date to an actual date with datetime type\n",
 129 |     "df[\"date\"] = df[\"Date\"].apply(\n",
 130 |     "    lambda x: pd.Timestamp(\"1899-12-30\") + pd.Timedelta(x, unit=\"days\")\n",
 131 |     ")\n",
 132 |     "\n",
 133 |     "# Create a timestamp from the integer Period representing 30 minute intervals\n",
 134 |     "df[\"date_time\"] = df[\"date\"] + \\\n",
 135 |     "    pd.to_timedelta((df[\"Period\"] - 1) * 30, unit=\"m\")\n",
 136 |     "\n",
 137 |     "df.dropna(inplace=True)\n",
 138 |     "\n",
 139 |     "# Rename columns\n",
 140 |     "df = df[[\"date_time\", \"OperationalLessIndustrial\"]]\n",
 141 |     "\n",
 142 |     "df.columns = [\"date_time\", \"demand\"]\n",
 143 |     "\n",
 144 |     "# Resample to hourly\n",
 145 |     "df = (\n",
 146 |     "    df.set_index(\"date_time\")\n",
 147 |     "    .resample(\"h\")\n",
 148 |     "    .agg({\"demand\": \"sum\"})\n",
 149 |     ")\n",
 150 |     "\n",
 151 |     "df.head()"
 152 |    ]
 153 |   },
 154 |   {
 155 |    "cell_type": "markdown",
 156 |    "id": "a9b0a097",
 157 |    "metadata": {},
 158 |    "source": [
 159 |     "## Lag features\n",
 160 |     "\n",
 161 |     "We shift past values of the time series forward.\n",
 162 |     "\n",
 163 |     "With feature-engine, we can create all lags in one go."
 164 |    ]
 165 |   },
 166 |   {
 167 |    "cell_type": "code",
 168 |    "execution_count": 7,
 169 |    "id": "c460a534",
 170 |    "metadata": {},
 171 |    "outputs": [
 172 |     {
 173 |      "data": {
 174 |       "text/html": [
 175 |        "<div>\n",
 176 |        "<style scoped>\n",
 177 |        "    .dataframe tbody tr th:only-of-type {\n",
 178 |        "        vertical-align: middle;\n",
 179 |        "    }\n",
 180 |        "\n",
 181 |        "    .dataframe tbody tr th {\n",
 182 |        "        vertical-align: top;\n",
 183 |        "    }\n",
 184 |        "\n",
 185 |        "    .dataframe thead th {\n",
 186 |        "        text-align: right;\n",
 187 |        "    }\n",
 188 |        "</style>\n",
 189 |        "<table border=\"1\" class=\"dataframe\">\n",
 190 |        "  <thead>\n",
 191 |        "    <tr style=\"text-align: right;\">\n",
 192 |        "      <th></th>\n",
 193 |        "      <th>demand</th>\n",
 194 |        "      <th>demand_lag_1_x</th>\n",
 195 |        "      <th>demand_lag_24_x</th>\n",
 196 |        "      <th>demand_lag_144_x</th>\n",
 197 |        "      <th>demand_lag_1_y</th>\n",
 198 |        "      <th>demand_lag_24_y</th>\n",
 199 |        "      <th>demand_lag_144_y</th>\n",
 200 |        "      <th>demand_window_3_mean</th>\n",
 201 |        "      <th>demand_window_3_std</th>\n",
 202 |        "      <th>demand_window_24_mean</th>\n",
 203 |        "      <th>demand_window_24_std</th>\n",
 204 |        "      <th>demand_lag_1</th>\n",
 205 |        "      <th>demand_lag_24</th>\n",
 206 |        "      <th>demand_lag_168</th>\n",
 207 |        "    </tr>\n",
 208 |        "    <tr>\n",
 209 |        "      <th>date_time</th>\n",
 210 |        "      <th></th>\n",
 211 |        "      <th></th>\n",
 212 |        "      <th></th>\n",
 213 |        "      <th></th>\n",
 214 |        "      <th></th>\n",
 215 |        "      <th></th>\n",
 216 |        "      <th></th>\n",
 217 |        "      <th></th>\n",
 218 |        "      <th></th>\n",
 219 |        "      <th></th>\n",
 220 |        "      <th></th>\n",
 221 |        "      <th></th>\n",
 222 |        "      <th></th>\n",
 223 |        "      <th></th>\n",
 224 |        "    </tr>\n",
 225 |        "  </thead>\n",
 226 |        "  <tbody>\n",
 227 |        "    <tr>\n",
 228 |        "      <th>2002-01-01 00:00:00</th>\n",
 229 |        "      <td>6919.366092</td>\n",
 230 |        "      <td>NaN</td>\n",
 231 |        "      <td>NaN</td>\n",
 232 |        "      <td>NaN</td>\n",
 233 |        "      <td>NaN</td>\n",
 234 |        "      <td>NaN</td>\n",
 235 |        "      <td>NaN</td>\n",
 236 |        "      <td>NaN</td>\n",
 237 |        "      <td>NaN</td>\n",
 238 |        "      <td>NaN</td>\n",
 239 |        "      <td>NaN</td>\n",
 240 |        "      <td>NaN</td>\n",
 241 |        "      <td>NaN</td>\n",
 242 |        "      <td>NaN</td>\n",
 243 |        "    </tr>\n",
 244 |        "    <tr>\n",
 245 |        "      <th>2002-01-01 01:00:00</th>\n",
 246 |        "      <td>7165.974188</td>\n",
 247 |        "      <td>6919.366092</td>\n",
 248 |        "      <td>NaN</td>\n",
 249 |        "      <td>NaN</td>\n",
 250 |        "      <td>6919.366092</td>\n",
 251 |        "      <td>NaN</td>\n",
 252 |        "      <td>NaN</td>\n",
 253 |        "      <td>NaN</td>\n",
 254 |        "      <td>NaN</td>\n",
 255 |        "      <td>NaN</td>\n",
 256 |        "      <td>NaN</td>\n",
 257 |        "      <td>6919.366092</td>\n",
 258 |        "      <td>NaN</td>\n",
 259 |        "      <td>NaN</td>\n",
 260 |        "    </tr>\n",
 261 |        "    <tr>\n",
 262 |        "      <th>2002-01-01 02:00:00</th>\n",
 263 |        "      <td>6406.542994</td>\n",
 264 |        "      <td>7165.974188</td>\n",
 265 |        "      <td>NaN</td>\n",
 266 |        "      <td>NaN</td>\n",
 267 |        "      <td>7165.974188</td>\n",
 268 |        "      <td>NaN</td>\n",
 269 |        "      <td>NaN</td>\n",
 270 |        "      <td>NaN</td>\n",
 271 |        "      <td>NaN</td>\n",
 272 |        "      <td>NaN</td>\n",
 273 |        "      <td>NaN</td>\n",
 274 |        "      <td>7165.974188</td>\n",
 275 |        "      <td>NaN</td>\n",
 276 |        "      <td>NaN</td>\n",
 277 |        "    </tr>\n",
 278 |        "    <tr>\n",
 279 |        "      <th>2002-01-01 03:00:00</th>\n",
 280 |        "      <td>5815.537828</td>\n",
 281 |        "      <td>6406.542994</td>\n",
 282 |        "      <td>NaN</td>\n",
 283 |        "      <td>NaN</td>\n",
 284 |        "      <td>6406.542994</td>\n",
 285 |        "      <td>NaN</td>\n",
 286 |        "      <td>NaN</td>\n",
 287 |        "      <td>6830.627758</td>\n",
 288 |        "      <td>387.414253</td>\n",
 289 |        "      <td>NaN</td>\n",
 290 |        "      <td>NaN</td>\n",
 291 |        "      <td>6406.542994</td>\n",
 292 |        "      <td>NaN</td>\n",
 293 |        "      <td>NaN</td>\n",
 294 |        "    </tr>\n",
 295 |        "    <tr>\n",
 296 |        "      <th>2002-01-01 04:00:00</th>\n",
 297 |        "      <td>5497.732922</td>\n",
 298 |        "      <td>5815.537828</td>\n",
 299 |        "      <td>NaN</td>\n",
 300 |        "      <td>NaN</td>\n",
 301 |        "      <td>5815.537828</td>\n",
 302 |        "      <td>NaN</td>\n",
 303 |        "      <td>NaN</td>\n",
 304 |        "      <td>6462.685003</td>\n",
 305 |        "      <td>676.966421</td>\n",
 306 |        "      <td>NaN</td>\n",
 307 |        "      <td>NaN</td>\n",
 308 |        "      <td>5815.537828</td>\n",
 309 |        "      <td>NaN</td>\n",
 310 |        "      <td>NaN</td>\n",
 311 |        "    </tr>\n",
 312 |        "  </tbody>\n",
 313 |        "</table>\n",
 314 |        "</div>"
 315 |       ],
 316 |       "text/plain": [
 317 |        "                          demand  demand_lag_1_x  demand_lag_24_x  \\\n",
 318 |        "date_time                                                           \n",
 319 |        "2002-01-01 00:00:00  6919.366092             NaN              NaN   \n",
 320 |        "2002-01-01 01:00:00  7165.974188     6919.366092              NaN   \n",
 321 |        "2002-01-01 02:00:00  6406.542994     7165.974188              NaN   \n",
 322 |        "2002-01-01 03:00:00  5815.537828     6406.542994              NaN   \n",
 323 |        "2002-01-01 04:00:00  5497.732922     5815.537828              NaN   \n",
 324 |        "\n",
 325 |        "                     demand_lag_144_x  demand_lag_1_y  demand_lag_24_y  \\\n",
 326 |        "date_time                                                                \n",
 327 |        "2002-01-01 00:00:00               NaN             NaN              NaN   \n",
 328 |        "2002-01-01 01:00:00               NaN     6919.366092              NaN   \n",
 329 |        "2002-01-01 02:00:00               NaN     7165.974188              NaN   \n",
 330 |        "2002-01-01 03:00:00               NaN     6406.542994              NaN   \n",
 331 |        "2002-01-01 04:00:00               NaN     5815.537828              NaN   \n",
 332 |        "\n",
 333 |        "                     demand_lag_144_y  demand_window_3_mean  \\\n",
 334 |        "date_time                                                     \n",
 335 |        "2002-01-01 00:00:00               NaN                   NaN   \n",
 336 |        "2002-01-01 01:00:00               NaN                   NaN   \n",
 337 |        "2002-01-01 02:00:00               NaN                   NaN   \n",
 338 |        "2002-01-01 03:00:00               NaN           6830.627758   \n",
 339 |        "2002-01-01 04:00:00               NaN           6462.685003   \n",
 340 |        "\n",
 341 |        "                     demand_window_3_std  demand_window_24_mean  \\\n",
 342 |        "date_time                                                         \n",
 343 |        "2002-01-01 00:00:00                  NaN                    NaN   \n",
 344 |        "2002-01-01 01:00:00                  NaN                    NaN   \n",
 345 |        "2002-01-01 02:00:00                  NaN                    NaN   \n",
 346 |        "2002-01-01 03:00:00           387.414253                    NaN   \n",
 347 |        "2002-01-01 04:00:00           676.966421                    NaN   \n",
 348 |        "\n",
 349 |        "                     demand_window_24_std  demand_lag_1  demand_lag_24  \\\n",
 350 |        "date_time                                                                \n",
 351 |        "2002-01-01 00:00:00                   NaN           NaN            NaN   \n",
 352 |        "2002-01-01 01:00:00                   NaN   6919.366092            NaN   \n",
 353 |        "2002-01-01 02:00:00                   NaN   7165.974188            NaN   \n",
 354 |        "2002-01-01 03:00:00                   NaN   6406.542994            NaN   \n",
 355 |        "2002-01-01 04:00:00                   NaN   5815.537828            NaN   \n",
 356 |        "\n",
 357 |        "                     demand_lag_168  \n",
 358 |        "date_time                            \n",
 359 |        "2002-01-01 00:00:00             NaN  \n",
 360 |        "2002-01-01 01:00:00             NaN  \n",
 361 |        "2002-01-01 02:00:00             NaN  \n",
 362 |        "2002-01-01 03:00:00             NaN  \n",
 363 |        "2002-01-01 04:00:00             NaN  "
 364 |       ]
 365 |      },
 366 |      "execution_count": 7,
 367 |      "metadata": {},
 368 |      "output_type": "execute_result"
 369 |     }
 370 |    ],
 371 |    "source": [
 372 |     "# We'll use the previous value, the value 24 hs before, \n",
 373 |     "# and the value at the same time the prior week.\n",
 374 |     "\n",
 375 |     "lag_f = LagFeatures(\n",
 376 |     "    variables = \"demand\", # if none, it will make lags of all numerical variables\n",
 377 |     "    periods=[1,24, 7*24],\n",
 378 |     ")\n",
 379 |     "\n",
 380 |     "df = lag_f.fit_transform(df)\n",
 381 |     "\n",
 382 |     "df.head()"
 383 |    ]
 384 |   },
 385 |   {
 386 |    "cell_type": "markdown",
 387 |    "id": "7a492968",
 388 |    "metadata": {},
 389 |    "source": [
 390 |     "## Window features\n",
 391 |     "\n",
 392 |     "We aggregate values within windows in the past.\n",
 393 |     "\n",
 394 |     "With Feature-engine, we can create many windows by using many functions, all in one go."
 395 |    ]
 396 |   },
 397 |   {
 398 |    "cell_type": "code",
 399 |    "execution_count": 6,
 400 |    "id": "82669807",
 401 |    "metadata": {},
 402 |    "outputs": [
 403 |     {
 404 |      "data": {
 405 |       "text/html": [
 406 |        "<div>\n",
 407 |        "<style scoped>\n",
 408 |        "    .dataframe tbody tr th:only-of-type {\n",
 409 |        "        vertical-align: middle;\n",
 410 |        "    }\n",
 411 |        "\n",
 412 |        "    .dataframe tbody tr th {\n",
 413 |        "        vertical-align: top;\n",
 414 |        "    }\n",
 415 |        "\n",
 416 |        "    .dataframe thead th {\n",
 417 |        "        text-align: right;\n",
 418 |        "    }\n",
 419 |        "</style>\n",
 420 |        "<table border=\"1\" class=\"dataframe\">\n",
 421 |        "  <thead>\n",
 422 |        "    <tr style=\"text-align: right;\">\n",
 423 |        "      <th></th>\n",
 424 |        "      <th>demand</th>\n",
 425 |        "      <th>demand_lag_1_x</th>\n",
 426 |        "      <th>demand_lag_24_x</th>\n",
 427 |        "      <th>demand_lag_144_x</th>\n",
 428 |        "      <th>demand_lag_1_y</th>\n",
 429 |        "      <th>demand_lag_24_y</th>\n",
 430 |        "      <th>demand_lag_144_y</th>\n",
 431 |        "      <th>demand_window_3_mean</th>\n",
 432 |        "      <th>demand_window_3_std</th>\n",
 433 |        "      <th>demand_window_24_mean</th>\n",
 434 |        "      <th>demand_window_24_std</th>\n",
 435 |        "    </tr>\n",
 436 |        "    <tr>\n",
 437 |        "      <th>date_time</th>\n",
 438 |        "      <th></th>\n",
 439 |        "      <th></th>\n",
 440 |        "      <th></th>\n",
 441 |        "      <th></th>\n",
 442 |        "      <th></th>\n",
 443 |        "      <th></th>\n",
 444 |        "      <th></th>\n",
 445 |        "      <th></th>\n",
 446 |        "      <th></th>\n",
 447 |        "      <th></th>\n",
 448 |        "      <th></th>\n",
 449 |        "    </tr>\n",
 450 |        "  </thead>\n",
 451 |        "  <tbody>\n",
 452 |        "    <tr>\n",
 453 |        "      <th>2002-01-01 00:00:00</th>\n",
 454 |        "      <td>6919.366092</td>\n",
 455 |        "      <td>NaN</td>\n",
 456 |        "      <td>NaN</td>\n",
 457 |        "      <td>NaN</td>\n",
 458 |        "      <td>NaN</td>\n",
 459 |        "      <td>NaN</td>\n",
 460 |        "      <td>NaN</td>\n",
 461 |        "      <td>NaN</td>\n",
 462 |        "      <td>NaN</td>\n",
 463 |        "      <td>NaN</td>\n",
 464 |        "      <td>NaN</td>\n",
 465 |        "    </tr>\n",
 466 |        "    <tr>\n",
 467 |        "      <th>2002-01-01 01:00:00</th>\n",
 468 |        "      <td>7165.974188</td>\n",
 469 |        "      <td>6919.366092</td>\n",
 470 |        "      <td>NaN</td>\n",
 471 |        "      <td>NaN</td>\n",
 472 |        "      <td>6919.366092</td>\n",
 473 |        "      <td>NaN</td>\n",
 474 |        "      <td>NaN</td>\n",
 475 |        "      <td>NaN</td>\n",
 476 |        "      <td>NaN</td>\n",
 477 |        "      <td>NaN</td>\n",
 478 |        "      <td>NaN</td>\n",
 479 |        "    </tr>\n",
 480 |        "    <tr>\n",
 481 |        "      <th>2002-01-01 02:00:00</th>\n",
 482 |        "      <td>6406.542994</td>\n",
 483 |        "      <td>7165.974188</td>\n",
 484 |        "      <td>NaN</td>\n",
 485 |        "      <td>NaN</td>\n",
 486 |        "      <td>7165.974188</td>\n",
 487 |        "      <td>NaN</td>\n",
 488 |        "      <td>NaN</td>\n",
 489 |        "      <td>NaN</td>\n",
 490 |        "      <td>NaN</td>\n",
 491 |        "      <td>NaN</td>\n",
 492 |        "      <td>NaN</td>\n",
 493 |        "    </tr>\n",
 494 |        "    <tr>\n",
 495 |        "      <th>2002-01-01 03:00:00</th>\n",
 496 |        "      <td>5815.537828</td>\n",
 497 |        "      <td>6406.542994</td>\n",
 498 |        "      <td>NaN</td>\n",
 499 |        "      <td>NaN</td>\n",
 500 |        "      <td>6406.542994</td>\n",
 501 |        "      <td>NaN</td>\n",
 502 |        "      <td>NaN</td>\n",
 503 |        "      <td>6830.627758</td>\n",
 504 |        "      <td>387.414253</td>\n",
 505 |        "      <td>NaN</td>\n",
 506 |        "      <td>NaN</td>\n",
 507 |        "    </tr>\n",
 508 |        "    <tr>\n",
 509 |        "      <th>2002-01-01 04:00:00</th>\n",
 510 |        "      <td>5497.732922</td>\n",
 511 |        "      <td>5815.537828</td>\n",
 512 |        "      <td>NaN</td>\n",
 513 |        "      <td>NaN</td>\n",
 514 |        "      <td>5815.537828</td>\n",
 515 |        "      <td>NaN</td>\n",
 516 |        "      <td>NaN</td>\n",
 517 |        "      <td>6462.685003</td>\n",
 518 |        "      <td>676.966421</td>\n",
 519 |        "      <td>NaN</td>\n",
 520 |        "      <td>NaN</td>\n",
 521 |        "    </tr>\n",
 522 |        "  </tbody>\n",
 523 |        "</table>\n",
 524 |        "</div>"
 525 |       ],
 526 |       "text/plain": [
 527 |        "                          demand  demand_lag_1_x  demand_lag_24_x  \\\n",
 528 |        "date_time                                                           \n",
 529 |        "2002-01-01 00:00:00  6919.366092             NaN              NaN   \n",
 530 |        "2002-01-01 01:00:00  7165.974188     6919.366092              NaN   \n",
 531 |        "2002-01-01 02:00:00  6406.542994     7165.974188              NaN   \n",
 532 |        "2002-01-01 03:00:00  5815.537828     6406.542994              NaN   \n",
 533 |        "2002-01-01 04:00:00  5497.732922     5815.537828              NaN   \n",
 534 |        "\n",
 535 |        "                     demand_lag_144_x  demand_lag_1_y  demand_lag_24_y  \\\n",
 536 |        "date_time                                                                \n",
 537 |        "2002-01-01 00:00:00               NaN             NaN              NaN   \n",
 538 |        "2002-01-01 01:00:00               NaN     6919.366092              NaN   \n",
 539 |        "2002-01-01 02:00:00               NaN     7165.974188              NaN   \n",
 540 |        "2002-01-01 03:00:00               NaN     6406.542994              NaN   \n",
 541 |        "2002-01-01 04:00:00               NaN     5815.537828              NaN   \n",
 542 |        "\n",
 543 |        "                     demand_lag_144_y  demand_window_3_mean  \\\n",
 544 |        "date_time                                                     \n",
 545 |        "2002-01-01 00:00:00               NaN                   NaN   \n",
 546 |        "2002-01-01 01:00:00               NaN                   NaN   \n",
 547 |        "2002-01-01 02:00:00               NaN                   NaN   \n",
 548 |        "2002-01-01 03:00:00               NaN           6830.627758   \n",
 549 |        "2002-01-01 04:00:00               NaN           6462.685003   \n",
 550 |        "\n",
 551 |        "                     demand_window_3_std  demand_window_24_mean  \\\n",
 552 |        "date_time                                                         \n",
 553 |        "2002-01-01 00:00:00                  NaN                    NaN   \n",
 554 |        "2002-01-01 01:00:00                  NaN                    NaN   \n",
 555 |        "2002-01-01 02:00:00                  NaN                    NaN   \n",
 556 |        "2002-01-01 03:00:00           387.414253                    NaN   \n",
 557 |        "2002-01-01 04:00:00           676.966421                    NaN   \n",
 558 |        "\n",
 559 |        "                     demand_window_24_std  \n",
 560 |        "date_time                                  \n",
 561 |        "2002-01-01 00:00:00                   NaN  \n",
 562 |        "2002-01-01 01:00:00                   NaN  \n",
 563 |        "2002-01-01 02:00:00                   NaN  \n",
 564 |        "2002-01-01 03:00:00                   NaN  \n",
 565 |        "2002-01-01 04:00:00                   NaN  "
 566 |       ]
 567 |      },
 568 |      "execution_count": 6,
 569 |      "metadata": {},
 570 |      "output_type": "execute_result"
 571 |     }
 572 |    ],
 573 |    "source": [
 574 |     "window_f = WindowFeatures(\n",
 575 |     "    variables = \"demand\", # if none, it will make window features from all numerical variables\n",
 576 |     "    window = [3, 24],\n",
 577 |     "    functions = [\"mean\", \"std\"],\n",
 578 |     "    missing_values=\"ignore\"\n",
 579 |     ")\n",
 580 |     "\n",
 581 |     "df = window_f.fit_transform(df)\n",
 582 |     "\n",
 583 |     "df.head()"
 584 |    ]
 585 |   },
 586 |   {
 587 |    "cell_type": "markdown",
 588 |    "id": "29713ebf",
 589 |    "metadata": {},
 590 |    "source": [
 591 |     "## Datetime features\n",
 592 |     "\n",
 593 |     "With feature-engine, we can create many features automatically."
 594 |    ]
 595 |   },
 596 |   {
 597 |    "cell_type": "code",
 598 |    "execution_count": 8,
 599 |    "id": "b46f7c2e",
 600 |    "metadata": {},
 601 |    "outputs": [
 602 |     {
 603 |      "data": {
 604 |       "text/html": [
 605 |        "<div>\n",
 606 |        "<style scoped>\n",
 607 |        "    .dataframe tbody tr th:only-of-type {\n",
 608 |        "        vertical-align: middle;\n",
 609 |        "    }\n",
 610 |        "\n",
 611 |        "    .dataframe tbody tr th {\n",
 612 |        "        vertical-align: top;\n",
 613 |        "    }\n",
 614 |        "\n",
 615 |        "    .dataframe thead th {\n",
 616 |        "        text-align: right;\n",
 617 |        "    }\n",
 618 |        "</style>\n",
 619 |        "<table border=\"1\" class=\"dataframe\">\n",
 620 |        "  <thead>\n",
 621 |        "    <tr style=\"text-align: right;\">\n",
 622 |        "      <th></th>\n",
 623 |        "      <th>demand</th>\n",
 624 |        "      <th>demand_lag_1_x</th>\n",
 625 |        "      <th>demand_lag_24_x</th>\n",
 626 |        "      <th>demand_lag_144_x</th>\n",
 627 |        "      <th>demand_lag_1_y</th>\n",
 628 |        "      <th>demand_lag_24_y</th>\n",
 629 |        "      <th>demand_lag_144_y</th>\n",
 630 |        "      <th>demand_window_3_mean</th>\n",
 631 |        "      <th>demand_window_3_std</th>\n",
 632 |        "      <th>demand_window_24_mean</th>\n",
 633 |        "      <th>demand_window_24_std</th>\n",
 634 |        "      <th>demand_lag_1</th>\n",
 635 |        "      <th>demand_lag_24</th>\n",
 636 |        "      <th>demand_lag_168</th>\n",
 637 |        "      <th>month</th>\n",
 638 |        "      <th>day_of_week</th>\n",
 639 |        "      <th>hour</th>\n",
 640 |        "    </tr>\n",
 641 |        "    <tr>\n",
 642 |        "      <th>date_time</th>\n",
 643 |        "      <th></th>\n",
 644 |        "      <th></th>\n",
 645 |        "      <th></th>\n",
 646 |        "      <th></th>\n",
 647 |        "      <th></th>\n",
 648 |        "      <th></th>\n",
 649 |        "      <th></th>\n",
 650 |        "      <th></th>\n",
 651 |        "      <th></th>\n",
 652 |        "      <th></th>\n",
 653 |        "      <th></th>\n",
 654 |        "      <th></th>\n",
 655 |        "      <th></th>\n",
 656 |        "      <th></th>\n",
 657 |        "      <th></th>\n",
 658 |        "      <th></th>\n",
 659 |        "      <th></th>\n",
 660 |        "    </tr>\n",
 661 |        "  </thead>\n",
 662 |        "  <tbody>\n",
 663 |        "    <tr>\n",
 664 |        "      <th>2002-01-01 00:00:00</th>\n",
 665 |        "      <td>6919.366092</td>\n",
 666 |        "      <td>NaN</td>\n",
 667 |        "      <td>NaN</td>\n",
 668 |        "      <td>NaN</td>\n",
 669 |        "      <td>NaN</td>\n",
 670 |        "      <td>NaN</td>\n",
 671 |        "      <td>NaN</td>\n",
 672 |        "      <td>NaN</td>\n",
 673 |        "      <td>NaN</td>\n",
 674 |        "      <td>NaN</td>\n",
 675 |        "      <td>NaN</td>\n",
 676 |        "      <td>NaN</td>\n",
 677 |        "      <td>NaN</td>\n",
 678 |        "      <td>NaN</td>\n",
 679 |        "      <td>1</td>\n",
 680 |        "      <td>1</td>\n",
 681 |        "      <td>0</td>\n",
 682 |        "    </tr>\n",
 683 |        "    <tr>\n",
 684 |        "      <th>2002-01-01 01:00:00</th>\n",
 685 |        "      <td>7165.974188</td>\n",
 686 |        "      <td>6919.366092</td>\n",
 687 |        "      <td>NaN</td>\n",
 688 |        "      <td>NaN</td>\n",
 689 |        "      <td>6919.366092</td>\n",
 690 |        "      <td>NaN</td>\n",
 691 |        "      <td>NaN</td>\n",
 692 |        "      <td>NaN</td>\n",
 693 |        "      <td>NaN</td>\n",
 694 |        "      <td>NaN</td>\n",
 695 |        "      <td>NaN</td>\n",
 696 |        "      <td>6919.366092</td>\n",
 697 |        "      <td>NaN</td>\n",
 698 |        "      <td>NaN</td>\n",
 699 |        "      <td>1</td>\n",
 700 |        "      <td>1</td>\n",
 701 |        "      <td>1</td>\n",
 702 |        "    </tr>\n",
 703 |        "    <tr>\n",
 704 |        "      <th>2002-01-01 02:00:00</th>\n",
 705 |        "      <td>6406.542994</td>\n",
 706 |        "      <td>7165.974188</td>\n",
 707 |        "      <td>NaN</td>\n",
 708 |        "      <td>NaN</td>\n",
 709 |        "      <td>7165.974188</td>\n",
 710 |        "      <td>NaN</td>\n",
 711 |        "      <td>NaN</td>\n",
 712 |        "      <td>NaN</td>\n",
 713 |        "      <td>NaN</td>\n",
 714 |        "      <td>NaN</td>\n",
 715 |        "      <td>NaN</td>\n",
 716 |        "      <td>7165.974188</td>\n",
 717 |        "      <td>NaN</td>\n",
 718 |        "      <td>NaN</td>\n",
 719 |        "      <td>1</td>\n",
 720 |        "      <td>1</td>\n",
 721 |        "      <td>2</td>\n",
 722 |        "    </tr>\n",
 723 |        "    <tr>\n",
 724 |        "      <th>2002-01-01 03:00:00</th>\n",
 725 |        "      <td>5815.537828</td>\n",
 726 |        "      <td>6406.542994</td>\n",
 727 |        "      <td>NaN</td>\n",
 728 |        "      <td>NaN</td>\n",
 729 |        "      <td>6406.542994</td>\n",
 730 |        "      <td>NaN</td>\n",
 731 |        "      <td>NaN</td>\n",
 732 |        "      <td>6830.627758</td>\n",
 733 |        "      <td>387.414253</td>\n",
 734 |        "      <td>NaN</td>\n",
 735 |        "      <td>NaN</td>\n",
 736 |        "      <td>6406.542994</td>\n",
 737 |        "      <td>NaN</td>\n",
 738 |        "      <td>NaN</td>\n",
 739 |        "      <td>1</td>\n",
 740 |        "      <td>1</td>\n",
 741 |        "      <td>3</td>\n",
 742 |        "    </tr>\n",
 743 |        "    <tr>\n",
 744 |        "      <th>2002-01-01 04:00:00</th>\n",
 745 |        "      <td>5497.732922</td>\n",
 746 |        "      <td>5815.537828</td>\n",
 747 |        "      <td>NaN</td>\n",
 748 |        "      <td>NaN</td>\n",
 749 |        "      <td>5815.537828</td>\n",
 750 |        "      <td>NaN</td>\n",
 751 |        "      <td>NaN</td>\n",
 752 |        "      <td>6462.685003</td>\n",
 753 |        "      <td>676.966421</td>\n",
 754 |        "      <td>NaN</td>\n",
 755 |        "      <td>NaN</td>\n",
 756 |        "      <td>5815.537828</td>\n",
 757 |        "      <td>NaN</td>\n",
 758 |        "      <td>NaN</td>\n",
 759 |        "      <td>1</td>\n",
 760 |        "      <td>1</td>\n",
 761 |        "      <td>4</td>\n",
 762 |        "    </tr>\n",
 763 |        "  </tbody>\n",
 764 |        "</table>\n",
 765 |        "</div>"
 766 |       ],
 767 |       "text/plain": [
 768 |        "                          demand  demand_lag_1_x  demand_lag_24_x  \\\n",
 769 |        "date_time                                                           \n",
 770 |        "2002-01-01 00:00:00  6919.366092             NaN              NaN   \n",
 771 |        "2002-01-01 01:00:00  7165.974188     6919.366092              NaN   \n",
 772 |        "2002-01-01 02:00:00  6406.542994     7165.974188              NaN   \n",
 773 |        "2002-01-01 03:00:00  5815.537828     6406.542994              NaN   \n",
 774 |        "2002-01-01 04:00:00  5497.732922     5815.537828              NaN   \n",
 775 |        "\n",
 776 |        "                     demand_lag_144_x  demand_lag_1_y  demand_lag_24_y  \\\n",
 777 |        "date_time                                                                \n",
 778 |        "2002-01-01 00:00:00               NaN             NaN              NaN   \n",
 779 |        "2002-01-01 01:00:00               NaN     6919.366092              NaN   \n",
 780 |        "2002-01-01 02:00:00               NaN     7165.974188              NaN   \n",
 781 |        "2002-01-01 03:00:00               NaN     6406.542994              NaN   \n",
 782 |        "2002-01-01 04:00:00               NaN     5815.537828              NaN   \n",
 783 |        "\n",
 784 |        "                     demand_lag_144_y  demand_window_3_mean  \\\n",
 785 |        "date_time                                                     \n",
 786 |        "2002-01-01 00:00:00               NaN                   NaN   \n",
 787 |        "2002-01-01 01:00:00               NaN                   NaN   \n",
 788 |        "2002-01-01 02:00:00               NaN                   NaN   \n",
 789 |        "2002-01-01 03:00:00               NaN           6830.627758   \n",
 790 |        "2002-01-01 04:00:00               NaN           6462.685003   \n",
 791 |        "\n",
 792 |        "                     demand_window_3_std  demand_window_24_mean  \\\n",
 793 |        "date_time                                                         \n",
 794 |        "2002-01-01 00:00:00                  NaN                    NaN   \n",
 795 |        "2002-01-01 01:00:00                  NaN                    NaN   \n",
 796 |        "2002-01-01 02:00:00                  NaN                    NaN   \n",
 797 |        "2002-01-01 03:00:00           387.414253                    NaN   \n",
 798 |        "2002-01-01 04:00:00           676.966421                    NaN   \n",
 799 |        "\n",
 800 |        "                     demand_window_24_std  demand_lag_1  demand_lag_24  \\\n",
 801 |        "date_time                                                                \n",
 802 |        "2002-01-01 00:00:00                   NaN           NaN            NaN   \n",
 803 |        "2002-01-01 01:00:00                   NaN   6919.366092            NaN   \n",
 804 |        "2002-01-01 02:00:00                   NaN   7165.974188            NaN   \n",
 805 |        "2002-01-01 03:00:00                   NaN   6406.542994            NaN   \n",
 806 |        "2002-01-01 04:00:00                   NaN   5815.537828            NaN   \n",
 807 |        "\n",
 808 |        "                     demand_lag_168  month  day_of_week  hour  \n",
 809 |        "date_time                                                      \n",
 810 |        "2002-01-01 00:00:00             NaN      1            1     0  \n",
 811 |        "2002-01-01 01:00:00             NaN      1            1     1  \n",
 812 |        "2002-01-01 02:00:00             NaN      1            1     2  \n",
 813 |        "2002-01-01 03:00:00             NaN      1            1     3  \n",
 814 |        "2002-01-01 04:00:00             NaN      1            1     4  "
 815 |       ]
 816 |      },
 817 |      "execution_count": 8,
 818 |      "metadata": {},
 819 |      "output_type": "execute_result"
 820 |     }
 821 |    ],
 822 |    "source": [
 823 |     "date_f = DatetimeFeatures(\n",
 824 |     "    variables=\"index\",\n",
 825 |     "    features_to_extract=[\"month\", \"day_of_week\", \"hour\"]\n",
 826 |     ")\n",
 827 |     "\n",
 828 |     "df = date_f.fit_transform(df)\n",
 829 |     "\n",
 830 |     "df.head()"
 831 |    ]
 832 |   },
 833 |   {
 834 |    "cell_type": "markdown",
 835 |    "id": "c185bfde",
 836 |    "metadata": {},
 837 |    "source": [
 838 |     "## Finalize tabularization\n",
 839 |     "\n",
 840 |     "Now we just separate our data into the table of features and the target variable."
 841 |    ]
 842 |   },
 843 |   {
 844 |    "cell_type": "code",
 845 |    "execution_count": 9,
 846 |    "id": "c27e6d18",
 847 |    "metadata": {},
 848 |    "outputs": [
 849 |     {
 850 |      "data": {
 851 |       "text/html": [
 852 |        "<div>\n",
 853 |        "<style scoped>\n",
 854 |        "    .dataframe tbody tr th:only-of-type {\n",
 855 |        "        vertical-align: middle;\n",
 856 |        "    }\n",
 857 |        "\n",
 858 |        "    .dataframe tbody tr th {\n",
 859 |        "        vertical-align: top;\n",
 860 |        "    }\n",
 861 |        "\n",
 862 |        "    .dataframe thead th {\n",
 863 |        "        text-align: right;\n",
 864 |        "    }\n",
 865 |        "</style>\n",
 866 |        "<table border=\"1\" class=\"dataframe\">\n",
 867 |        "  <thead>\n",
 868 |        "    <tr style=\"text-align: right;\">\n",
 869 |        "      <th></th>\n",
 870 |        "      <th>demand_lag_1_x</th>\n",
 871 |        "      <th>demand_lag_24_x</th>\n",
 872 |        "      <th>demand_lag_144_x</th>\n",
 873 |        "      <th>demand_lag_1_y</th>\n",
 874 |        "      <th>demand_lag_24_y</th>\n",
 875 |        "      <th>demand_lag_144_y</th>\n",
 876 |        "      <th>demand_window_3_mean</th>\n",
 877 |        "      <th>demand_window_3_std</th>\n",
 878 |        "      <th>demand_window_24_mean</th>\n",
 879 |        "      <th>demand_window_24_std</th>\n",
 880 |        "      <th>demand_lag_1</th>\n",
 881 |        "      <th>demand_lag_24</th>\n",
 882 |        "      <th>demand_lag_168</th>\n",
 883 |        "      <th>month</th>\n",
 884 |        "      <th>day_of_week</th>\n",
 885 |        "      <th>hour</th>\n",
 886 |        "    </tr>\n",
 887 |        "    <tr>\n",
 888 |        "      <th>date_time</th>\n",
 889 |        "      <th></th>\n",
 890 |        "      <th></th>\n",
 891 |        "      <th></th>\n",
 892 |        "      <th></th>\n",
 893 |        "      <th></th>\n",
 894 |        "      <th></th>\n",
 895 |        "      <th></th>\n",
 896 |        "      <th></th>\n",
 897 |        "      <th></th>\n",
 898 |        "      <th></th>\n",
 899 |        "      <th></th>\n",
 900 |        "      <th></th>\n",
 901 |        "      <th></th>\n",
 902 |        "      <th></th>\n",
 903 |        "      <th></th>\n",
 904 |        "      <th></th>\n",
 905 |        "    </tr>\n",
 906 |        "  </thead>\n",
 907 |        "  <tbody>\n",
 908 |        "    <tr>\n",
 909 |        "      <th>2002-01-08 00:00:00</th>\n",
 910 |        "      <td>7406.047910</td>\n",
 911 |        "      <td>6808.008916</td>\n",
 912 |        "      <td>6579.219880</td>\n",
 913 |        "      <td>7406.047910</td>\n",
 914 |        "      <td>6808.008916</td>\n",
 915 |        "      <td>6579.219880</td>\n",
 916 |        "      <td>7003.111898</td>\n",
 917 |        "      <td>369.947168</td>\n",
 918 |        "      <td>7637.818532</td>\n",
 919 |        "      <td>865.185351</td>\n",
 920 |        "      <td>7406.047910</td>\n",
 921 |        "      <td>6808.008916</td>\n",
 922 |        "      <td>6919.366092</td>\n",
 923 |        "      <td>1</td>\n",
 924 |        "      <td>1</td>\n",
 925 |        "      <td>0</td>\n",
 926 |        "    </tr>\n",
 927 |        "    <tr>\n",
 928 |        "      <th>2002-01-08 01:00:00</th>\n",
 929 |        "      <td>7077.081904</td>\n",
 930 |        "      <td>7209.285712</td>\n",
 931 |        "      <td>6990.826420</td>\n",
 932 |        "      <td>7077.081904</td>\n",
 933 |        "      <td>7209.285712</td>\n",
 934 |        "      <td>6990.826420</td>\n",
 935 |        "      <td>7053.973603</td>\n",
 936 |        "      <td>364.178734</td>\n",
 937 |        "      <td>7649.029907</td>\n",
 938 |        "      <td>855.655756</td>\n",
 939 |        "      <td>7077.081904</td>\n",
 940 |        "      <td>7209.285712</td>\n",
 941 |        "      <td>7165.974188</td>\n",
 942 |        "      <td>1</td>\n",
 943 |        "      <td>1</td>\n",
 944 |        "      <td>1</td>\n",
 945 |        "    </tr>\n",
 946 |        "    <tr>\n",
 947 |        "      <th>2002-01-08 02:00:00</th>\n",
 948 |        "      <td>7445.354310</td>\n",
 949 |        "      <td>6535.818342</td>\n",
 950 |        "      <td>6382.915018</td>\n",
 951 |        "      <td>7445.354310</td>\n",
 952 |        "      <td>6535.818342</td>\n",
 953 |        "      <td>6382.915018</td>\n",
 954 |        "      <td>7309.494708</td>\n",
 955 |        "      <td>202.232618</td>\n",
 956 |        "      <td>7658.866098</td>\n",
 957 |        "      <td>851.728742</td>\n",
 958 |        "      <td>7445.354310</td>\n",
 959 |        "      <td>6535.818342</td>\n",
 960 |        "      <td>6406.542994</td>\n",
 961 |        "      <td>1</td>\n",
 962 |        "      <td>1</td>\n",
 963 |        "      <td>2</td>\n",
 964 |        "    </tr>\n",
 965 |        "    <tr>\n",
 966 |        "      <th>2002-01-08 03:00:00</th>\n",
 967 |        "      <td>6800.577478</td>\n",
 968 |        "      <td>6112.382636</td>\n",
 969 |        "      <td>5896.928138</td>\n",
 970 |        "      <td>6800.577478</td>\n",
 971 |        "      <td>6112.382636</td>\n",
 972 |        "      <td>5896.928138</td>\n",
 973 |        "      <td>7107.671231</td>\n",
 974 |        "      <td>323.474993</td>\n",
 975 |        "      <td>7669.897729</td>\n",
 976 |        "      <td>838.157008</td>\n",
 977 |        "      <td>6800.577478</td>\n",
 978 |        "      <td>6112.382636</td>\n",
 979 |        "      <td>5815.537828</td>\n",
 980 |        "      <td>1</td>\n",
 981 |        "      <td>1</td>\n",
 982 |        "      <td>3</td>\n",
 983 |        "    </tr>\n",
 984 |        "    <tr>\n",
 985 |        "      <th>2002-01-08 04:00:00</th>\n",
 986 |        "      <td>6340.914086</td>\n",
 987 |        "      <td>6165.882096</td>\n",
 988 |        "      <td>5853.937140</td>\n",
 989 |        "      <td>6340.914086</td>\n",
 990 |        "      <td>6165.882096</td>\n",
 991 |        "      <td>5853.937140</td>\n",
 992 |        "      <td>6862.281958</td>\n",
 993 |        "      <td>554.799634</td>\n",
 994 |        "      <td>7679.419873</td>\n",
 995 |        "      <td>820.811716</td>\n",
 996 |        "      <td>6340.914086</td>\n",
 997 |        "      <td>6165.882096</td>\n",
 998 |        "      <td>5497.732922</td>\n",
 999 |        "      <td>1</td>\n",
1000 |        "      <td>1</td>\n",
1001 |        "      <td>4</td>\n",
1002 |        "    </tr>\n",
1003 |        "  </tbody>\n",
1004 |        "</table>\n",
1005 |        "</div>"
1006 |       ],
1007 |       "text/plain": [
1008 |        "                     demand_lag_1_x  demand_lag_24_x  demand_lag_144_x  \\\n",
1009 |        "date_time                                                                \n",
1010 |        "2002-01-08 00:00:00     7406.047910      6808.008916       6579.219880   \n",
1011 |        "2002-01-08 01:00:00     7077.081904      7209.285712       6990.826420   \n",
1012 |        "2002-01-08 02:00:00     7445.354310      6535.818342       6382.915018   \n",
1013 |        "2002-01-08 03:00:00     6800.577478      6112.382636       5896.928138   \n",
1014 |        "2002-01-08 04:00:00     6340.914086      6165.882096       5853.937140   \n",
1015 |        "\n",
1016 |        "                     demand_lag_1_y  demand_lag_24_y  demand_lag_144_y  \\\n",
1017 |        "date_time                                                                \n",
1018 |        "2002-01-08 00:00:00     7406.047910      6808.008916       6579.219880   \n",
1019 |        "2002-01-08 01:00:00     7077.081904      7209.285712       6990.826420   \n",
1020 |        "2002-01-08 02:00:00     7445.354310      6535.818342       6382.915018   \n",
1021 |        "2002-01-08 03:00:00     6800.577478      6112.382636       5896.928138   \n",
1022 |        "2002-01-08 04:00:00     6340.914086      6165.882096       5853.937140   \n",
1023 |        "\n",
1024 |        "                     demand_window_3_mean  demand_window_3_std  \\\n",
1025 |        "date_time                                                        \n",
1026 |        "2002-01-08 00:00:00           7003.111898           369.947168   \n",
1027 |        "2002-01-08 01:00:00           7053.973603           364.178734   \n",
1028 |        "2002-01-08 02:00:00           7309.494708           202.232618   \n",
1029 |        "2002-01-08 03:00:00           7107.671231           323.474993   \n",
1030 |        "2002-01-08 04:00:00           6862.281958           554.799634   \n",
1031 |        "\n",
1032 |        "                     demand_window_24_mean  demand_window_24_std  \\\n",
1033 |        "date_time                                                          \n",
1034 |        "2002-01-08 00:00:00            7637.818532            865.185351   \n",
1035 |        "2002-01-08 01:00:00            7649.029907            855.655756   \n",
1036 |        "2002-01-08 02:00:00            7658.866098            851.728742   \n",
1037 |        "2002-01-08 03:00:00            7669.897729            838.157008   \n",
1038 |        "2002-01-08 04:00:00            7679.419873            820.811716   \n",
1039 |        "\n",
1040 |        "                     demand_lag_1  demand_lag_24  demand_lag_168  month  \\\n",
1041 |        "date_time                                                                 \n",
1042 |        "2002-01-08 00:00:00   7406.047910    6808.008916     6919.366092      1   \n",
1043 |        "2002-01-08 01:00:00   7077.081904    7209.285712     7165.974188      1   \n",
1044 |        "2002-01-08 02:00:00   7445.354310    6535.818342     6406.542994      1   \n",
1045 |        "2002-01-08 03:00:00   6800.577478    6112.382636     5815.537828      1   \n",
1046 |        "2002-01-08 04:00:00   6340.914086    6165.882096     5497.732922      1   \n",
1047 |        "\n",
1048 |        "                     day_of_week  hour  \n",
1049 |        "date_time                               \n",
1050 |        "2002-01-08 00:00:00            1     0  \n",
1051 |        "2002-01-08 01:00:00            1     1  \n",
1052 |        "2002-01-08 02:00:00            1     2  \n",
1053 |        "2002-01-08 03:00:00            1     3  \n",
1054 |        "2002-01-08 04:00:00            1     4  "
1055 |       ]
1056 |      },
1057 |      "execution_count": 9,
1058 |      "metadata": {},
1059 |      "output_type": "execute_result"
1060 |     }
1061 |    ],
1062 |    "source": [
1063 |     "df.dropna(inplace=True)\n",
1064 |     "\n",
1065 |     "y = df[\"demand\"]\n",
1066 |     "X = df.drop(\"demand\", axis=1)\n",
1067 |     "\n",
1068 |     "# Predictors\n",
1069 |     "\n",
1070 |     "X.head()"
1071 |    ]
1072 |   },
1073 |   {
1074 |    "cell_type": "code",
1075 |    "execution_count": 10,
1076 |    "id": "4ed38b9e",
1077 |    "metadata": {},
1078 |    "outputs": [
1079 |     {
1080 |      "data": {
1081 |       "text/plain": [
1082 |        "date_time\n",
1083 |        "2002-01-08 00:00:00    7077.081904\n",
1084 |        "2002-01-08 01:00:00    7445.354310\n",
1085 |        "2002-01-08 02:00:00    6800.577478\n",
1086 |        "2002-01-08 03:00:00    6340.914086\n",
1087 |        "2002-01-08 04:00:00    6277.978250\n",
1088 |        "Freq: h, Name: demand, dtype: float64"
1089 |       ]
1090 |      },
1091 |      "execution_count": 10,
1092 |      "metadata": {},
1093 |      "output_type": "execute_result"
1094 |     }
1095 |    ],
1096 |    "source": [
1097 |     "# target\n",
1098 |     "\n",
1099 |     "y.head()"
1100 |    ]
1101 |   },
1102 |   {
1103 |    "cell_type": "markdown",
1104 |    "id": "2439ccc5",
1105 |    "metadata": {},
1106 |    "source": [
1107 |     "That's it! We can now forecast the energy demand in the next hour as a regression.\n",
1108 |     "\n",
1109 |     "In this notebook, we only extracted features from the time series. We can add more features from external data sources. We will address that in coming notebooks."
1110 |    ]
1111 |   },
1112 |   {
1113 |    "cell_type": "code",
1114 |    "execution_count": null,
1115 |    "id": "78a62f26",
1116 |    "metadata": {},
1117 |    "outputs": [],
1118 |    "source": []
1119 |   }
1120 |  ],
1121 |  "metadata": {
1122 |   "kernelspec": {
1123 |    "display_name": ".venv",
1124 |    "language": "python",
1125 |    "name": "python3"
1126 |   },
1127 |   "language_info": {
1128 |    "codemirror_mode": {
1129 |     "name": "ipython",
1130 |     "version": 3
1131 |    },
1132 |    "file_extension": ".py",
1133 |    "mimetype": "text/x-python",
1134 |    "name": "python",
1135 |    "nbconvert_exporter": "python",
1136 |    "pygments_lexer": "ipython3",
1137 |    "version": "3.12.6"
1138 |   },
1139 |   "toc": {
1140 |    "base_numbering": 1,
1141 |    "nav_menu": {},
1142 |    "number_sections": true,
1143 |    "sideBar": true,
1144 |    "skip_h1_title": false,
1145 |    "title_cell": "Table of Contents",
1146 |    "title_sidebar": "Contents",
1147 |    "toc_cell": false,
1148 |    "toc_position": {},
1149 |    "toc_section_display": true,
1150 |    "toc_window_display": true
1151 |   }
1152 |  },
1153 |  "nbformat": 4,
1154 |  "nbformat_minor": 5
1155 | }
1156 | 


--------------------------------------------------------------------------------
/02-multistep-forecasting/assignment/02-assignment.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "a9b2affd",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "# Direct forecasting\n",
  9 |     "\n",
 10 |     "[Forecasting with Machine Learning - Course](https://www.trainindata.com/p/forecasting-with-machine-learning)\n",
 11 |     "\n",
 12 |     "Load the retail sales data set located in Facebook's Prophet Github repository and use **direct forecasting** to predict future sales. \n",
 13 |     "\n",
 14 |     "- We want to forecast sales over the next 3 months. \n",
 15 |     "- Sales are recorded monthly. \n",
 16 |     "- We assume that we have all data to the month before the first point in the forecasting horizon.\n",
 17 |     "\n",
 18 |     "We will forecast using Scikit-learn in this exercise.\n",
 19 |     "\n",
 20 |     "Follow the guidelines below to accomplish this assignment.\n",
 21 |     "\n",
 22 |     "## Import required classes and functions"
 23 |    ]
 24 |   },
 25 |   {
 26 |    "cell_type": "code",
 27 |    "execution_count": 1,
 28 |    "id": "6fde7533",
 29 |    "metadata": {},
 30 |    "outputs": [],
 31 |    "source": [
 32 |     "import pandas as pd"
 33 |    ]
 34 |   },
 35 |   {
 36 |    "cell_type": "markdown",
 37 |    "id": "b50b4411",
 38 |    "metadata": {},
 39 |    "source": [
 40 |     "## Load data"
 41 |    ]
 42 |   },
 43 |   {
 44 |    "cell_type": "code",
 45 |    "execution_count": 2,
 46 |    "id": "cf8dd614",
 47 |    "metadata": {},
 48 |    "outputs": [
 49 |     {
 50 |      "data": {
 51 |       "text/html": [
 52 |        "<div>\n",
 53 |        "<style scoped>\n",
 54 |        "    .dataframe tbody tr th:only-of-type {\n",
 55 |        "        vertical-align: middle;\n",
 56 |        "    }\n",
 57 |        "\n",
 58 |        "    .dataframe tbody tr th {\n",
 59 |        "        vertical-align: top;\n",
 60 |        "    }\n",
 61 |        "\n",
 62 |        "    .dataframe thead th {\n",
 63 |        "        text-align: right;\n",
 64 |        "    }\n",
 65 |        "</style>\n",
 66 |        "<table border=\"1\" class=\"dataframe\">\n",
 67 |        "  <thead>\n",
 68 |        "    <tr style=\"text-align: right;\">\n",
 69 |        "      <th></th>\n",
 70 |        "      <th>y</th>\n",
 71 |        "    </tr>\n",
 72 |        "    <tr>\n",
 73 |        "      <th>ds</th>\n",
 74 |        "      <th></th>\n",
 75 |        "    </tr>\n",
 76 |        "  </thead>\n",
 77 |        "  <tbody>\n",
 78 |        "    <tr>\n",
 79 |        "      <th>1992-01-01</th>\n",
 80 |        "      <td>146376</td>\n",
 81 |        "    </tr>\n",
 82 |        "    <tr>\n",
 83 |        "      <th>1992-02-01</th>\n",
 84 |        "      <td>147079</td>\n",
 85 |        "    </tr>\n",
 86 |        "    <tr>\n",
 87 |        "      <th>1992-03-01</th>\n",
 88 |        "      <td>159336</td>\n",
 89 |        "    </tr>\n",
 90 |        "    <tr>\n",
 91 |        "      <th>1992-04-01</th>\n",
 92 |        "      <td>163669</td>\n",
 93 |        "    </tr>\n",
 94 |        "    <tr>\n",
 95 |        "      <th>1992-05-01</th>\n",
 96 |        "      <td>170068</td>\n",
 97 |        "    </tr>\n",
 98 |        "  </tbody>\n",
 99 |        "</table>\n",
100 |        "</div>"
101 |       ],
102 |       "text/plain": [
103 |        "                 y\n",
104 |        "ds                \n",
105 |        "1992-01-01  146376\n",
106 |        "1992-02-01  147079\n",
107 |        "1992-03-01  159336\n",
108 |        "1992-04-01  163669\n",
109 |        "1992-05-01  170068"
110 |       ]
111 |      },
112 |      "execution_count": 2,
113 |      "metadata": {},
114 |      "output_type": "execute_result"
115 |     }
116 |    ],
117 |    "source": [
118 |     "url = \"https://raw.githubusercontent.com/facebook/prophet/master/examples/example_retail_sales.csv\"\n",
119 |     "df = pd.read_csv(url)\n",
120 |     "df.to_csv(\"example_retail_sales.csv\", index=False)\n",
121 |     "\n",
122 |     "df = pd.read_csv(\n",
123 |     "    \"example_retail_sales.csv\",\n",
124 |     "    parse_dates=[\"ds\"],\n",
125 |     "    index_col=[\"ds\"],\n",
126 |     "    nrows=160,\n",
127 |     ")\n",
128 |     "\n",
129 |     "df = df.asfreq(\"MS\")\n",
130 |     "\n",
131 |     "df.head()"
132 |    ]
133 |   },
134 |   {
135 |    "cell_type": "markdown",
136 |    "id": "d03172ab",
137 |    "metadata": {},
138 |    "source": [
139 |     "## Create the target variable\n",
140 |     "\n",
141 |     "In direct forecasting, we train a model per step. Hence, we need to create 1 target per step."
142 |    ]
143 |   },
144 |   {
145 |    "cell_type": "code",
146 |    "execution_count": null,
147 |    "id": "d77bd8b7",
148 |    "metadata": {},
149 |    "outputs": [],
150 |    "source": []
151 |   },
152 |   {
153 |    "cell_type": "markdown",
154 |    "id": "1d003eaf",
155 |    "metadata": {},
156 |    "source": [
157 |     "## Split data into train and test\n",
158 |     "\n",
159 |     "Leave data from 2004 onwards in the test set."
160 |    ]
161 |   },
162 |   {
163 |    "cell_type": "code",
164 |    "execution_count": null,
165 |    "id": "5f26c24e",
166 |    "metadata": {},
167 |    "outputs": [],
168 |    "source": []
169 |   },
170 |   {
171 |    "cell_type": "markdown",
172 |    "id": "caa19336",
173 |    "metadata": {},
174 |    "source": [
175 |     "## Set up regression model\n",
176 |     "\n",
177 |     "We will use Lasso in this assignment."
178 |    ]
179 |   },
180 |   {
181 |    "cell_type": "code",
182 |    "execution_count": null,
183 |    "id": "c61e6218",
184 |    "metadata": {},
185 |    "outputs": [],
186 |    "source": []
187 |   },
188 |   {
189 |    "cell_type": "markdown",
190 |    "id": "9b69bbde",
191 |    "metadata": {},
192 |    "source": [
193 |     "## Set up a feature engineering pipeline\n",
194 |     "\n",
195 |     "Set up transformers from feature-engine and / or scikit- learn in a pipeline and test it to make sure the input feature table is the one you need for the forecasts.\n",
196 |     "\n",
197 |     "We will use feature-engine because we are great fans of the library.\n",
198 |     "\n",
199 |     "If you prefer pandas, as long as the input feature table is the one you expect, that is also a suitable alternative."
200 |    ]
201 |   },
202 |   {
203 |    "cell_type": "code",
204 |    "execution_count": null,
205 |    "id": "00908e10",
206 |    "metadata": {},
207 |    "outputs": [],
208 |    "source": []
209 |   },
210 |   {
211 |    "cell_type": "markdown",
212 |    "id": "721fb6aa",
213 |    "metadata": {},
214 |    "source": [
215 |     "## Test pipeline over test set\n",
216 |     "\n",
217 |     "Ensure that the returned input feature table is suitable to forecast from `2004-01-01` onwards."
218 |    ]
219 |   },
220 |   {
221 |    "cell_type": "code",
222 |    "execution_count": null,
223 |    "id": "84aee4f3",
224 |    "metadata": {},
225 |    "outputs": [],
226 |    "source": []
227 |   },
228 |   {
229 |    "cell_type": "markdown",
230 |    "id": "3132210a",
231 |    "metadata": {},
232 |    "source": [
233 |     "## Train a recursive forecaster\n",
234 |     "\n",
235 |     "Now that we know that the pipeline works, we can train the forecaster.\n",
236 |     "\n",
237 |     "You can take the feature table and target returned up to here to train the Lasso. \n",
238 |     "\n",
239 |     "Or, as we will do, you can add the Lasso within the pipeline."
240 |    ]
241 |   },
242 |   {
243 |    "cell_type": "code",
244 |    "execution_count": null,
245 |    "id": "8a120faf",
246 |    "metadata": {},
247 |    "outputs": [],
248 |    "source": []
249 |   },
250 |   {
251 |    "cell_type": "markdown",
252 |    "id": "886f78c8",
253 |    "metadata": {},
254 |    "source": [
255 |     "# Forecast 3 months of sales\n",
256 |     "\n",
257 |     "We'll start by forecasting 3 months of sales, starting at every single point of the test set.\n",
258 |     "\n",
259 |     "This is the equivalent of backtesting without refit. More info in section 6!!"
260 |    ]
261 |   },
262 |   {
263 |    "cell_type": "code",
264 |    "execution_count": null,
265 |    "id": "1cde1862",
266 |    "metadata": {},
267 |    "outputs": [],
268 |    "source": []
269 |   },
270 |   {
271 |    "cell_type": "markdown",
272 |    "id": "578c60c4",
273 |    "metadata": {},
274 |    "source": [
275 |     "## Plot predictions vs actuals\n",
276 |     "\n",
277 |     "Pick the first row of predictions and plot them against the real sales."
278 |    ]
279 |   },
280 |   {
281 |    "cell_type": "code",
282 |    "execution_count": null,
283 |    "id": "778c7bd5",
284 |    "metadata": {},
285 |    "outputs": [],
286 |    "source": []
287 |   },
288 |   {
289 |    "cell_type": "markdown",
290 |    "id": "528988f7",
291 |    "metadata": {},
292 |    "source": [
293 |     "## Determine the RMSE\n",
294 |     "\n",
295 |     "Pick the first row of predictions and calculate the RMSE"
296 |    ]
297 |   },
298 |   {
299 |    "cell_type": "code",
300 |    "execution_count": null,
301 |    "id": "67408586",
302 |    "metadata": {},
303 |    "outputs": [],
304 |    "source": []
305 |   },
306 |   {
307 |    "cell_type": "markdown",
308 |    "id": "914b7abb",
309 |    "metadata": {},
310 |    "source": [
311 |     "## Forecast next 3 months of sales\n",
312 |     "\n",
313 |     "Predict the first 3 months of sales right after the end of the test set.\n",
314 |     "\n",
315 |     "That is, starting on `2005-02-02`."
316 |    ]
317 |   },
318 |   {
319 |    "cell_type": "code",
320 |    "execution_count": null,
321 |    "id": "9c4f46ac",
322 |    "metadata": {},
323 |    "outputs": [],
324 |    "source": []
325 |   }
326 |  ],
327 |  "metadata": {
328 |   "kernelspec": {
329 |    "display_name": "fwml",
330 |    "language": "python",
331 |    "name": "fwml"
332 |   },
333 |   "language_info": {
334 |    "codemirror_mode": {
335 |     "name": "ipython",
336 |     "version": 3
337 |    },
338 |    "file_extension": ".py",
339 |    "mimetype": "text/x-python",
340 |    "name": "python",
341 |    "nbconvert_exporter": "python",
342 |    "pygments_lexer": "ipython3",
343 |    "version": "3.10.5"
344 |   },
345 |   "toc": {
346 |    "base_numbering": 1,
347 |    "nav_menu": {},
348 |    "number_sections": true,
349 |    "sideBar": true,
350 |    "skip_h1_title": false,
351 |    "title_cell": "Table of Contents",
352 |    "title_sidebar": "Contents",
353 |    "toc_cell": false,
354 |    "toc_position": {},
355 |    "toc_section_display": true,
356 |    "toc_window_display": true
357 |   }
358 |  },
359 |  "nbformat": 4,
360 |  "nbformat_minor": 5
361 | }
362 | 


--------------------------------------------------------------------------------
/03-multiseries-forecasting/exercise-1.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "327f3f9b-3f15-4bec-bafe-b75e5cf91740",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "# Exercise 1: Multiple independent time series"
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "markdown",
 13 |    "id": "8f5b8ff1-e3f7-47af-b603-15d89d0c6fe5",
 14 |    "metadata": {},
 15 |    "source": [
 16 |     "[Forecasting for machine learning](https://www.trainindata.com/p/forecasting-with-machine-learning)\n",
 17 |     "\n",
 18 |     "In this notebook we have an exercise to do multiple independent time series forecasting. The solutions we show are only one way of answering these questions."
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "code",
 23 |    "execution_count": 1,
 24 |    "id": "91a8c715-0369-4928-907d-9492db4f37a1",
 25 |    "metadata": {},
 26 |    "outputs": [],
 27 |    "source": [
 28 |     "import numpy as np\n",
 29 |     "import pandas as pd\n",
 30 |     "import matplotlib.pyplot as plt"
 31 |    ]
 32 |   },
 33 |   {
 34 |    "cell_type": "markdown",
 35 |    "id": "f6e9fcef-4dde-4107-bce6-216516626f86",
 36 |    "metadata": {},
 37 |    "source": [
 38 |     "# Data preparation"
 39 |    ]
 40 |   },
 41 |   {
 42 |    "cell_type": "markdown",
 43 |    "id": "d91ac0bd-372a-4ef0-9c0e-a0e0cf12560d",
 44 |    "metadata": {},
 45 |    "source": [
 46 |     "The dataset we shall use is the Quarterly overnight trips (in thousands) from 1998 Q1 to 2016 Q4 across\n",
 47 |     "Australia. The number of trips is split by `State`, `Region`, and `Purpose`. \n",
 48 |     "\n",
 49 |     "**In this exercise we are going to forecast the total number of trips for each Region (there are 76 regions therefore we will have 76 time series). We shall treat this as a multiple independent time series forecasting problem.**\n",
 50 |     "\n",
 51 |     "Source: A new tidy data structure to support\n",
 52 |     "exploration and modeling of temporal data, Journal of Computational and\n",
 53 |     "Graphical Statistics, 29:3, 466-478, doi:10.1080/10618600.2019.1695624.\n",
 54 |     "Shape of the dataset: (24320, 5)"
 55 |    ]
 56 |   },
 57 |   {
 58 |    "cell_type": "code",
 59 |    "execution_count": 2,
 60 |    "id": "76ef5949-cbf1-4109-887d-104f71574c2a",
 61 |    "metadata": {},
 62 |    "outputs": [
 63 |     {
 64 |      "name": "stdout",
 65 |      "output_type": "stream",
 66 |      "text": [
 67 |       "australia_tourism\n",
 68 |       "-----------------\n",
 69 |       "Quarterly overnight trips (in thousands) from 1998 Q1 to 2016 Q4 across\n",
 70 |       "Australia. The tourism regions are formed through the aggregation of Statistical\n",
 71 |       "Local Areas (SLAs) which are defined by the various State and Territory tourism\n",
 72 |       "authorities according to their research and marketing needs.\n",
 73 |       "Wang, E, D Cook, and RJ Hyndman (2020). A new tidy data structure to support\n",
 74 |       "exploration and modeling of temporal data, Journal of Computational and\n",
 75 |       "Graphical Statistics, 29:3, 466-478, doi:10.1080/10618600.2019.1695624.\n",
 76 |       "Shape of the dataset: (24320, 5)\n"
 77 |      ]
 78 |     },
 79 |     {
 80 |      "data": {
 81 |       "text/html": [
 82 |        "<div>\n",
 83 |        "<style scoped>\n",
 84 |        "    .dataframe tbody tr th:only-of-type {\n",
 85 |        "        vertical-align: middle;\n",
 86 |        "    }\n",
 87 |        "\n",
 88 |        "    .dataframe tbody tr th {\n",
 89 |        "        vertical-align: top;\n",
 90 |        "    }\n",
 91 |        "\n",
 92 |        "    .dataframe thead th {\n",
 93 |        "        text-align: right;\n",
 94 |        "    }\n",
 95 |        "</style>\n",
 96 |        "<table border=\"1\" class=\"dataframe\">\n",
 97 |        "  <thead>\n",
 98 |        "    <tr style=\"text-align: right;\">\n",
 99 |        "      <th></th>\n",
100 |        "      <th>date_time</th>\n",
101 |        "      <th>Region</th>\n",
102 |        "      <th>State</th>\n",
103 |        "      <th>Purpose</th>\n",
104 |        "      <th>Trips</th>\n",
105 |        "    </tr>\n",
106 |        "  </thead>\n",
107 |        "  <tbody>\n",
108 |        "    <tr>\n",
109 |        "      <th>0</th>\n",
110 |        "      <td>1998-01-01</td>\n",
111 |        "      <td>Adelaide</td>\n",
112 |        "      <td>South Australia</td>\n",
113 |        "      <td>Business</td>\n",
114 |        "      <td>135.077690</td>\n",
115 |        "    </tr>\n",
116 |        "    <tr>\n",
117 |        "      <th>1</th>\n",
118 |        "      <td>1998-04-01</td>\n",
119 |        "      <td>Adelaide</td>\n",
120 |        "      <td>South Australia</td>\n",
121 |        "      <td>Business</td>\n",
122 |        "      <td>109.987316</td>\n",
123 |        "    </tr>\n",
124 |        "    <tr>\n",
125 |        "      <th>2</th>\n",
126 |        "      <td>1998-07-01</td>\n",
127 |        "      <td>Adelaide</td>\n",
128 |        "      <td>South Australia</td>\n",
129 |        "      <td>Business</td>\n",
130 |        "      <td>166.034687</td>\n",
131 |        "    </tr>\n",
132 |        "    <tr>\n",
133 |        "      <th>3</th>\n",
134 |        "      <td>1998-10-01</td>\n",
135 |        "      <td>Adelaide</td>\n",
136 |        "      <td>South Australia</td>\n",
137 |        "      <td>Business</td>\n",
138 |        "      <td>127.160464</td>\n",
139 |        "    </tr>\n",
140 |        "    <tr>\n",
141 |        "      <th>4</th>\n",
142 |        "      <td>1999-01-01</td>\n",
143 |        "      <td>Adelaide</td>\n",
144 |        "      <td>South Australia</td>\n",
145 |        "      <td>Business</td>\n",
146 |        "      <td>137.448533</td>\n",
147 |        "    </tr>\n",
148 |        "  </tbody>\n",
149 |        "</table>\n",
150 |        "</div>"
151 |       ],
152 |       "text/plain": [
153 |        "    date_time    Region            State   Purpose       Trips\n",
154 |        "0  1998-01-01  Adelaide  South Australia  Business  135.077690\n",
155 |        "1  1998-04-01  Adelaide  South Australia  Business  109.987316\n",
156 |        "2  1998-07-01  Adelaide  South Australia  Business  166.034687\n",
157 |        "3  1998-10-01  Adelaide  South Australia  Business  127.160464\n",
158 |        "4  1999-01-01  Adelaide  South Australia  Business  137.448533"
159 |       ]
160 |      },
161 |      "execution_count": 2,
162 |      "metadata": {},
163 |      "output_type": "execute_result"
164 |     }
165 |    ],
166 |    "source": [
167 |     "from skforecast.datasets import fetch_dataset\n",
168 |     "\n",
169 |     "# Load the data\n",
170 |     "data = fetch_dataset(name=\"australia_tourism\", raw=True)\n",
171 |     "data.head()"
172 |    ]
173 |   },
174 |   {
175 |    "cell_type": "markdown",
176 |    "id": "48a62223-83ab-4a7e-af1b-4fe91e1b31c7",
177 |    "metadata": {},
178 |    "source": [
179 |     "Pre-process the data by performing the following:\n",
180 |     "1) Convert the `date_time` column to datetime type\n",
181 |     "2) Create a dataframe with one column per `Region` which gives the total number of Trips for each date.\n",
182 |     "3) Ensure the index is `date_time` and resampled to quarterly start `QS`\n"
183 |    ]
184 |   },
185 |   {
186 |    "cell_type": "code",
187 |    "execution_count": 3,
188 |    "id": "6506a6d4-e680-4c12-9dd1-7e037c91a2ef",
189 |    "metadata": {},
190 |    "outputs": [
191 |     {
192 |      "data": {
193 |       "text/html": [
194 |        "<div>\n",
195 |        "<style scoped>\n",
196 |        "    .dataframe tbody tr th:only-of-type {\n",
197 |        "        vertical-align: middle;\n",
198 |        "    }\n",
199 |        "\n",
200 |        "    .dataframe tbody tr th {\n",
201 |        "        vertical-align: top;\n",
202 |        "    }\n",
203 |        "\n",
204 |        "    .dataframe thead th {\n",
205 |        "        text-align: right;\n",
206 |        "    }\n",
207 |        "</style>\n",
208 |        "<table border=\"1\" class=\"dataframe\">\n",
209 |        "  <thead>\n",
210 |        "    <tr style=\"text-align: right;\">\n",
211 |        "      <th>Region</th>\n",
212 |        "      <th>Adelaide</th>\n",
213 |        "      <th>Adelaide Hills</th>\n",
214 |        "      <th>Alice Springs</th>\n",
215 |        "      <th>Australia's Coral Coast</th>\n",
216 |        "      <th>Australia's Golden Outback</th>\n",
217 |        "      <th>Australia's North West</th>\n",
218 |        "      <th>Australia's South West</th>\n",
219 |        "      <th>Ballarat</th>\n",
220 |        "      <th>Barkly</th>\n",
221 |        "      <th>Barossa</th>\n",
222 |        "      <th>...</th>\n",
223 |        "      <th>Sunshine Coast</th>\n",
224 |        "      <th>Sydney</th>\n",
225 |        "      <th>The Murray</th>\n",
226 |        "      <th>Tropical North Queensland</th>\n",
227 |        "      <th>Upper Yarra</th>\n",
228 |        "      <th>Western Grampians</th>\n",
229 |        "      <th>Whitsundays</th>\n",
230 |        "      <th>Wilderness West</th>\n",
231 |        "      <th>Wimmera</th>\n",
232 |        "      <th>Yorke Peninsula</th>\n",
233 |        "    </tr>\n",
234 |        "    <tr>\n",
235 |        "      <th>date_time</th>\n",
236 |        "      <th></th>\n",
237 |        "      <th></th>\n",
238 |        "      <th></th>\n",
239 |        "      <th></th>\n",
240 |        "      <th></th>\n",
241 |        "      <th></th>\n",
242 |        "      <th></th>\n",
243 |        "      <th></th>\n",
244 |        "      <th></th>\n",
245 |        "      <th></th>\n",
246 |        "      <th></th>\n",
247 |        "      <th></th>\n",
248 |        "      <th></th>\n",
249 |        "      <th></th>\n",
250 |        "      <th></th>\n",
251 |        "      <th></th>\n",
252 |        "      <th></th>\n",
253 |        "      <th></th>\n",
254 |        "      <th></th>\n",
255 |        "      <th></th>\n",
256 |        "      <th></th>\n",
257 |        "    </tr>\n",
258 |        "  </thead>\n",
259 |        "  <tbody>\n",
260 |        "    <tr>\n",
261 |        "      <th>1998-01-01</th>\n",
262 |        "      <td>658.553895</td>\n",
263 |        "      <td>9.798630</td>\n",
264 |        "      <td>20.207638</td>\n",
265 |        "      <td>132.516409</td>\n",
266 |        "      <td>161.726948</td>\n",
267 |        "      <td>120.775450</td>\n",
268 |        "      <td>474.858729</td>\n",
269 |        "      <td>182.239341</td>\n",
270 |        "      <td>18.465206</td>\n",
271 |        "      <td>46.796083</td>\n",
272 |        "      <td>...</td>\n",
273 |        "      <td>742.602299</td>\n",
274 |        "      <td>2288.955629</td>\n",
275 |        "      <td>356.500087</td>\n",
276 |        "      <td>220.915346</td>\n",
277 |        "      <td>102.791022</td>\n",
278 |        "      <td>86.996591</td>\n",
279 |        "      <td>60.226649</td>\n",
280 |        "      <td>63.335097</td>\n",
281 |        "      <td>18.804743</td>\n",
282 |        "      <td>160.681637</td>\n",
283 |        "    </tr>\n",
284 |        "    <tr>\n",
285 |        "      <th>1998-04-01</th>\n",
286 |        "      <td>449.853935</td>\n",
287 |        "      <td>26.066952</td>\n",
288 |        "      <td>56.356223</td>\n",
289 |        "      <td>172.615378</td>\n",
290 |        "      <td>164.973780</td>\n",
291 |        "      <td>158.404387</td>\n",
292 |        "      <td>411.622281</td>\n",
293 |        "      <td>137.566539</td>\n",
294 |        "      <td>7.510969</td>\n",
295 |        "      <td>49.428717</td>\n",
296 |        "      <td>...</td>\n",
297 |        "      <td>609.883333</td>\n",
298 |        "      <td>1814.459480</td>\n",
299 |        "      <td>312.291189</td>\n",
300 |        "      <td>253.097616</td>\n",
301 |        "      <td>74.855136</td>\n",
302 |        "      <td>84.939977</td>\n",
303 |        "      <td>106.190848</td>\n",
304 |        "      <td>42.607076</td>\n",
305 |        "      <td>52.482311</td>\n",
306 |        "      <td>104.324252</td>\n",
307 |        "    </tr>\n",
308 |        "    <tr>\n",
309 |        "      <th>1998-07-01</th>\n",
310 |        "      <td>592.904597</td>\n",
311 |        "      <td>26.491072</td>\n",
312 |        "      <td>110.918441</td>\n",
313 |        "      <td>173.904335</td>\n",
314 |        "      <td>206.879934</td>\n",
315 |        "      <td>184.619035</td>\n",
316 |        "      <td>360.039657</td>\n",
317 |        "      <td>117.642761</td>\n",
318 |        "      <td>43.565625</td>\n",
319 |        "      <td>29.743302</td>\n",
320 |        "      <td>...</td>\n",
321 |        "      <td>615.306331</td>\n",
322 |        "      <td>1989.731939</td>\n",
323 |        "      <td>376.718698</td>\n",
324 |        "      <td>423.506735</td>\n",
325 |        "      <td>59.465405</td>\n",
326 |        "      <td>79.974884</td>\n",
327 |        "      <td>81.771005</td>\n",
328 |        "      <td>18.851214</td>\n",
329 |        "      <td>35.657551</td>\n",
330 |        "      <td>68.996468</td>\n",
331 |        "    </tr>\n",
332 |        "    <tr>\n",
333 |        "      <th>1998-10-01</th>\n",
334 |        "      <td>524.242760</td>\n",
335 |        "      <td>27.256859</td>\n",
336 |        "      <td>40.868270</td>\n",
337 |        "      <td>207.002571</td>\n",
338 |        "      <td>198.509591</td>\n",
339 |        "      <td>138.878263</td>\n",
340 |        "      <td>462.620050</td>\n",
341 |        "      <td>136.072724</td>\n",
342 |        "      <td>29.359239</td>\n",
343 |        "      <td>78.193066</td>\n",
344 |        "      <td>...</td>\n",
345 |        "      <td>684.430239</td>\n",
346 |        "      <td>2150.913627</td>\n",
347 |        "      <td>336.367694</td>\n",
348 |        "      <td>283.694451</td>\n",
349 |        "      <td>35.238855</td>\n",
350 |        "      <td>116.235617</td>\n",
351 |        "      <td>105.600143</td>\n",
352 |        "      <td>50.450965</td>\n",
353 |        "      <td>27.204455</td>\n",
354 |        "      <td>103.340264</td>\n",
355 |        "    </tr>\n",
356 |        "    <tr>\n",
357 |        "      <th>1999-01-01</th>\n",
358 |        "      <td>548.394105</td>\n",
359 |        "      <td>13.772975</td>\n",
360 |        "      <td>48.368038</td>\n",
361 |        "      <td>198.856638</td>\n",
362 |        "      <td>140.213443</td>\n",
363 |        "      <td>103.337122</td>\n",
364 |        "      <td>562.974629</td>\n",
365 |        "      <td>156.456242</td>\n",
366 |        "      <td>6.341997</td>\n",
367 |        "      <td>35.277910</td>\n",
368 |        "      <td>...</td>\n",
369 |        "      <td>842.167418</td>\n",
370 |        "      <td>1779.286905</td>\n",
371 |        "      <td>323.418472</td>\n",
372 |        "      <td>194.509904</td>\n",
373 |        "      <td>67.823457</td>\n",
374 |        "      <td>101.765635</td>\n",
375 |        "      <td>111.504972</td>\n",
376 |        "      <td>59.888003</td>\n",
377 |        "      <td>50.219851</td>\n",
378 |        "      <td>146.658290</td>\n",
379 |        "    </tr>\n",
380 |        "  </tbody>\n",
381 |        "</table>\n",
382 |        "<p>5 rows × 76 columns</p>\n",
383 |        "</div>"
384 |       ],
385 |       "text/plain": [
386 |        "Region        Adelaide  Adelaide Hills  Alice Springs  \\\n",
387 |        "date_time                                               \n",
388 |        "1998-01-01  658.553895        9.798630      20.207638   \n",
389 |        "1998-04-01  449.853935       26.066952      56.356223   \n",
390 |        "1998-07-01  592.904597       26.491072     110.918441   \n",
391 |        "1998-10-01  524.242760       27.256859      40.868270   \n",
392 |        "1999-01-01  548.394105       13.772975      48.368038   \n",
393 |        "\n",
394 |        "Region      Australia's Coral Coast  Australia's Golden Outback  \\\n",
395 |        "date_time                                                         \n",
396 |        "1998-01-01               132.516409                  161.726948   \n",
397 |        "1998-04-01               172.615378                  164.973780   \n",
398 |        "1998-07-01               173.904335                  206.879934   \n",
399 |        "1998-10-01               207.002571                  198.509591   \n",
400 |        "1999-01-01               198.856638                  140.213443   \n",
401 |        "\n",
402 |        "Region      Australia's North West  Australia's South West    Ballarat  \\\n",
403 |        "date_time                                                                \n",
404 |        "1998-01-01              120.775450              474.858729  182.239341   \n",
405 |        "1998-04-01              158.404387              411.622281  137.566539   \n",
406 |        "1998-07-01              184.619035              360.039657  117.642761   \n",
407 |        "1998-10-01              138.878263              462.620050  136.072724   \n",
408 |        "1999-01-01              103.337122              562.974629  156.456242   \n",
409 |        "\n",
410 |        "Region         Barkly    Barossa  ...  Sunshine Coast       Sydney  \\\n",
411 |        "date_time                         ...                                \n",
412 |        "1998-01-01  18.465206  46.796083  ...      742.602299  2288.955629   \n",
413 |        "1998-04-01   7.510969  49.428717  ...      609.883333  1814.459480   \n",
414 |        "1998-07-01  43.565625  29.743302  ...      615.306331  1989.731939   \n",
415 |        "1998-10-01  29.359239  78.193066  ...      684.430239  2150.913627   \n",
416 |        "1999-01-01   6.341997  35.277910  ...      842.167418  1779.286905   \n",
417 |        "\n",
418 |        "Region      The Murray  Tropical North Queensland  Upper Yarra  \\\n",
419 |        "date_time                                                        \n",
420 |        "1998-01-01  356.500087                 220.915346   102.791022   \n",
421 |        "1998-04-01  312.291189                 253.097616    74.855136   \n",
422 |        "1998-07-01  376.718698                 423.506735    59.465405   \n",
423 |        "1998-10-01  336.367694                 283.694451    35.238855   \n",
424 |        "1999-01-01  323.418472                 194.509904    67.823457   \n",
425 |        "\n",
426 |        "Region      Western Grampians  Whitsundays  Wilderness West    Wimmera  \\\n",
427 |        "date_time                                                                \n",
428 |        "1998-01-01          86.996591    60.226649        63.335097  18.804743   \n",
429 |        "1998-04-01          84.939977   106.190848        42.607076  52.482311   \n",
430 |        "1998-07-01          79.974884    81.771005        18.851214  35.657551   \n",
431 |        "1998-10-01         116.235617   105.600143        50.450965  27.204455   \n",
432 |        "1999-01-01         101.765635   111.504972        59.888003  50.219851   \n",
433 |        "\n",
434 |        "Region      Yorke Peninsula  \n",
435 |        "date_time                    \n",
436 |        "1998-01-01       160.681637  \n",
437 |        "1998-04-01       104.324252  \n",
438 |        "1998-07-01        68.996468  \n",
439 |        "1998-10-01       103.340264  \n",
440 |        "1999-01-01       146.658290  \n",
441 |        "\n",
442 |        "[5 rows x 76 columns]"
443 |       ]
444 |      },
445 |      "execution_count": 3,
446 |      "metadata": {},
447 |      "output_type": "execute_result"
448 |     }
449 |    ],
450 |    "source": []
451 |   },
452 |   {
453 |    "cell_type": "markdown",
454 |    "id": "589f58e0-6106-430c-ab2b-557ea84bb2a5",
455 |    "metadata": {},
456 |    "source": [
457 |     "Check for missing values."
458 |    ]
459 |   },
460 |   {
461 |    "cell_type": "code",
462 |    "execution_count": null,
463 |    "id": "92bd6aaf-caac-4bb8-96a2-c0d59220f6e8",
464 |    "metadata": {},
465 |    "outputs": [],
466 |    "source": []
467 |   },
468 |   {
469 |    "cell_type": "markdown",
470 |    "id": "453278bb-fa05-42d1-bc66-a17b05d501c9",
471 |    "metadata": {},
472 |    "source": [
473 |     "Later we may want to use LightGBM, it does not support special JSON characters (e.g., `'`)  in the column name. Let's remove these characters from the column names."
474 |    ]
475 |   },
476 |   {
477 |    "cell_type": "code",
478 |    "execution_count": 5,
479 |    "id": "872210ab-580d-4420-9fbc-2e3014d5883b",
480 |    "metadata": {},
481 |    "outputs": [],
482 |    "source": [
483 |     "import re\n",
484 |     "\n",
485 |     "data = data.rename(columns=lambda x: re.sub(\"[^A-Za-z0-9_]+\", \"\", x))"
486 |    ]
487 |   },
488 |   {
489 |    "cell_type": "markdown",
490 |    "id": "44fb99c2-65b9-4665-9c55-528af407baee",
491 |    "metadata": {},
492 |    "source": [
493 |     "Assign the name of each state to a variable `Region`. We will use this later."
494 |    ]
495 |   },
496 |   {
497 |    "cell_type": "code",
498 |    "execution_count": null,
499 |    "id": "57f6888c-a06c-4cd8-a358-039860bc5efd",
500 |    "metadata": {},
501 |    "outputs": [],
502 |    "source": []
503 |   },
504 |   {
505 |    "cell_type": "markdown",
506 |    "id": "eb03773c-a9bf-4fa8-84a4-099af6b87827",
507 |    "metadata": {},
508 |    "source": [
509 |     "# Exploratory data analysis"
510 |    ]
511 |   },
512 |   {
513 |    "cell_type": "markdown",
514 |    "id": "7d1ba5b6-b49d-4bab-882e-915f075b9493",
515 |    "metadata": {},
516 |    "source": [
517 |     "Print the number of data points in the time series, the start time, and the end time of the time series."
518 |    ]
519 |   },
520 |   {
521 |    "cell_type": "code",
522 |    "execution_count": null,
523 |    "id": "36af3ab9-4267-471d-800b-dc9a3dcdb43e",
524 |    "metadata": {},
525 |    "outputs": [],
526 |    "source": []
527 |   },
528 |   {
529 |    "cell_type": "markdown",
530 |    "id": "3ffa8a95-d0d5-48e4-ba5f-b92a4f54c0ef",
531 |    "metadata": {},
532 |    "source": [
533 |     "Plot the time series summed over all states."
534 |    ]
535 |   },
536 |   {
537 |    "cell_type": "code",
538 |    "execution_count": null,
539 |    "id": "f1629484-508f-419f-8647-1ef6626eab26",
540 |    "metadata": {},
541 |    "outputs": [],
542 |    "source": []
543 |   },
544 |   {
545 |    "cell_type": "markdown",
546 |    "id": "16971015-96a9-480e-9f9e-f21637d68e87",
547 |    "metadata": {},
548 |    "source": [
549 |     "Plot a subsample of the time series from different regions."
550 |    ]
551 |   },
552 |   {
553 |    "cell_type": "code",
554 |    "execution_count": null,
555 |    "id": "d7193016-417f-4bec-87e0-dcce787ee7fe",
556 |    "metadata": {},
557 |    "outputs": [],
558 |    "source": []
559 |   },
560 |   {
561 |    "cell_type": "markdown",
562 |    "id": "143c6eab-8a90-4b45-9f63-db450d91c296",
563 |    "metadata": {},
564 |    "source": [
565 |     "It appears that there is yearly seasonality for these series and they appear to be anti-correlated (i.e., some areas experience peaks whilst others experience troughs)."
566 |    ]
567 |   },
568 |   {
569 |    "cell_type": "markdown",
570 |    "id": "af0befd7-d612-4bfe-8cb6-fc2bfd2bc9f1",
571 |    "metadata": {},
572 |    "source": [
573 |     "Create a quarter of the year feature which could help with the yearly seasonality."
574 |    ]
575 |   },
576 |   {
577 |    "cell_type": "code",
578 |    "execution_count": null,
579 |    "id": "6c2b84e0-07a4-46ef-ae53-6327625a7284",
580 |    "metadata": {},
581 |    "outputs": [],
582 |    "source": []
583 |   },
584 |   {
585 |    "cell_type": "markdown",
586 |    "id": "bed95ade-4109-4b31-847b-28f5c9807692",
587 |    "metadata": {},
588 |    "source": [
589 |     "# Forecasting"
590 |    ]
591 |   },
592 |   {
593 |    "cell_type": "markdown",
594 |    "id": "a40647fe-99b4-474e-8a03-b7bf4f9eed29",
595 |    "metadata": {},
596 |    "source": [
597 |     "Import the class needed for recursive forecasting for multiple independent time series from `skforecast`"
598 |    ]
599 |   },
600 |   {
601 |    "cell_type": "code",
602 |    "execution_count": null,
603 |    "id": "e00293e1",
604 |    "metadata": {},
605 |    "outputs": [],
606 |    "source": []
607 |   },
608 |   {
609 |    "cell_type": "code",
610 |    "execution_count": null,
611 |    "id": "7b2e3c25",
612 |    "metadata": {},
613 |    "outputs": [],
614 |    "source": []
615 |   },
616 |   {
617 |    "cell_type": "code",
618 |    "execution_count": null,
619 |    "id": "613011ce-c45c-40b5-b46a-e6a8037d4f08",
620 |    "metadata": {},
621 |    "outputs": [],
622 |    "source": []
623 |   },
624 |   {
625 |    "cell_type": "markdown",
626 |    "id": "d692c732-e699-46cb-bab6-cd148e9af1f1",
627 |    "metadata": {},
628 |    "source": [
629 |     "Import a transformer from `sklearn` to scale the data."
630 |    ]
631 |   },
632 |   {
633 |    "cell_type": "code",
634 |    "execution_count": null,
635 |    "id": "cadffd44-e241-4677-b640-9c8b9db2784f",
636 |    "metadata": {},
637 |    "outputs": [],
638 |    "source": []
639 |   },
640 |   {
641 |    "cell_type": "markdown",
642 |    "id": "eaf0b4fb-9548-4da1-b356-e9ec377d68f9",
643 |    "metadata": {},
644 |    "source": [
645 |     "Import a model of your choice."
646 |    ]
647 |   },
648 |   {
649 |    "cell_type": "code",
650 |    "execution_count": null,
651 |    "id": "0495bdc3-5e22-483a-b0ab-2b3e0402fbee",
652 |    "metadata": {},
653 |    "outputs": [],
654 |    "source": []
655 |   },
656 |   {
657 |    "cell_type": "markdown",
658 |    "id": "66a304ca-77a4-48f6-99ef-f24d30417daa",
659 |    "metadata": {},
660 |    "source": [
661 |     "Assign the names of the states to a `target_cols` variable and any exogenous features to an `exog_cols` variable."
662 |    ]
663 |   },
664 |   {
665 |    "cell_type": "code",
666 |    "execution_count": null,
667 |    "id": "be5cfa65-d69b-427c-bccc-0021fa04ef02",
668 |    "metadata": {},
669 |    "outputs": [],
670 |    "source": []
671 |   },
672 |   {
673 |    "cell_type": "markdown",
674 |    "id": "9a24e307-22ef-44f1-8f11-b7746a70d3bb",
675 |    "metadata": {},
676 |    "source": [
677 |     "Specify a forecast horizon and assign it to a variable `steps`. Try forecasting 8 quarters into the future."
678 |    ]
679 |   },
680 |   {
681 |    "cell_type": "code",
682 |    "execution_count": null,
683 |    "id": "2b37700d-9bfa-4c60-926a-b48a9e32c7ce",
684 |    "metadata": {},
685 |    "outputs": [],
686 |    "source": []
687 |   },
688 |   {
689 |    "cell_type": "markdown",
690 |    "id": "4109bb1b-b6a1-48f1-99ca-caca04c01a63",
691 |    "metadata": {},
692 |    "source": [
693 |     "Create a dataframe for the future values of any exogenous features.\n",
694 |     "\n",
695 |     "Hint: `pd.DateOffset` and using `freq=QS` in `pd.date_range` might be helpful "
696 |    ]
697 |   },
698 |   {
699 |    "cell_type": "code",
700 |    "execution_count": null,
701 |    "id": "c96dd685-09d8-42d0-949b-652acf45e56a",
702 |    "metadata": {},
703 |    "outputs": [],
704 |    "source": []
705 |   },
706 |   {
707 |    "cell_type": "markdown",
708 |    "id": "dc854213-25d3-47f9-877b-af4a139ee94f",
709 |    "metadata": {},
710 |    "source": [
711 |     "Define window features using the `RollingFeatures` class from skforecast. Try a window of 4 and 8 (1 and 2 years)."
712 |    ]
713 |   },
714 |   {
715 |    "cell_type": "code",
716 |    "execution_count": null,
717 |    "id": "b2d65f26-66b6-4bf0-9275-9d51f7561589",
718 |    "metadata": {},
719 |    "outputs": [],
720 |    "source": []
721 |   },
722 |   {
723 |    "cell_type": "markdown",
724 |    "id": "7023fc71-a772-462b-8456-f34aa91c71bf",
725 |    "metadata": {},
726 |    "source": [
727 |     "Define a weight function (a function of the time axis) that linearly decreases the weight from 1 to 0 as we go back in time. This will give more weight to recent dates. Define it so there are no harded coded dates in the function.\n",
728 |     "\n",
729 |     "Hint: Consider using `np.linspace`"
730 |    ]
731 |   },
732 |   {
733 |    "cell_type": "code",
734 |    "execution_count": null,
735 |    "id": "2fceaa6a-3795-4ebc-89a4-cdd0933afe65",
736 |    "metadata": {},
737 |    "outputs": [],
738 |    "source": []
739 |   },
740 |   {
741 |    "cell_type": "markdown",
742 |    "id": "3ccac13b-ba92-4403-840b-0107271cb8df",
743 |    "metadata": {},
744 |    "source": [
745 |     "Define a forecaster to predict all the time series. Pass your weight function and custom predictors function to the forecaster."
746 |    ]
747 |   },
748 |   {
749 |    "cell_type": "code",
750 |    "execution_count": null,
751 |    "id": "3a6dca60-9b60-4504-a7be-763172567716",
752 |    "metadata": {},
753 |    "outputs": [],
754 |    "source": []
755 |   },
756 |   {
757 |    "cell_type": "markdown",
758 |    "id": "a511fcdc-dea7-48a8-ac05-8089bb92cad2",
759 |    "metadata": {},
760 |    "source": [
761 |     "Fit the forecaster."
762 |    ]
763 |   },
764 |   {
765 |    "cell_type": "code",
766 |    "execution_count": null,
767 |    "id": "3d0ebf20-d667-4298-bb05-9c84770b3169",
768 |    "metadata": {},
769 |    "outputs": [],
770 |    "source": []
771 |   },
772 |   {
773 |    "cell_type": "markdown",
774 |    "id": "f03fa990-4113-4a4e-be3b-28f6695ec179",
775 |    "metadata": {},
776 |    "source": [
777 |     "Make a forecast."
778 |    ]
779 |   },
780 |   {
781 |    "cell_type": "code",
782 |    "execution_count": null,
783 |    "id": "450a6090-e4ae-424f-91d9-840044ca4c16",
784 |    "metadata": {},
785 |    "outputs": [],
786 |    "source": []
787 |   },
788 |   {
789 |    "cell_type": "markdown",
790 |    "id": "dd7553db-6b01-48ad-aa43-58506fd4babb",
791 |    "metadata": {},
792 |    "source": [
793 |     "Plot the a random subset of the time series and the forecast."
794 |    ]
795 |   },
796 |   {
797 |    "cell_type": "code",
798 |    "execution_count": null,
799 |    "id": "067f8c6e-b570-4cb7-9be2-0fd35e8c2937",
800 |    "metadata": {},
801 |    "outputs": [],
802 |    "source": []
803 |   },
804 |   {
805 |    "cell_type": "markdown",
806 |    "id": "9861f6b8-a2ec-4f20-8197-ac7ae47973d5",
807 |    "metadata": {},
808 |    "source": [
809 |     "# Feature importance"
810 |    ]
811 |   },
812 |   {
813 |    "cell_type": "markdown",
814 |    "id": "52146b7c-52a9-4f0f-a353-0c5965302fd7",
815 |    "metadata": {},
816 |    "source": [
817 |     "Plot the 10 most important features."
818 |    ]
819 |   },
820 |   {
821 |    "cell_type": "code",
822 |    "execution_count": null,
823 |    "id": "4bac5303-30aa-4263-bda2-7eab421c3412",
824 |    "metadata": {},
825 |    "outputs": [],
826 |    "source": []
827 |   }
828 |  ],
829 |  "metadata": {
830 |   "kernelspec": {
831 |    "display_name": "Python 3 (ipykernel)",
832 |    "language": "python",
833 |    "name": "python3"
834 |   },
835 |   "language_info": {
836 |    "codemirror_mode": {
837 |     "name": "ipython",
838 |     "version": 3
839 |    },
840 |    "file_extension": ".py",
841 |    "mimetype": "text/x-python",
842 |    "name": "python",
843 |    "nbconvert_exporter": "python",
844 |    "pygments_lexer": "ipython3",
845 |    "version": "3.10.2"
846 |   }
847 |  },
848 |  "nbformat": 4,
849 |  "nbformat_minor": 5
850 | }
851 | 


--------------------------------------------------------------------------------
/03-multiseries-forecasting/exercise-2.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "6b76acb4-90e2-4b95-b211-65607b67dc78",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "# Exercise 2: Multiple dependent time series"
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "markdown",
 13 |    "id": "8f5b8ff1-e3f7-47af-b603-15d89d0c6fe5",
 14 |    "metadata": {},
 15 |    "source": [
 16 |     "[Forecasting for machine learning](https://www.trainindata.com/p/forecasting-with-machine-learning)\n",
 17 |     "\n",
 18 |     "In this notebook we have an exercise to do multiple dependent time series forecasting. The solutions we show are only one way of answering these questions."
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "code",
 23 |    "execution_count": null,
 24 |    "id": "91a8c715-0369-4928-907d-9492db4f37a1",
 25 |    "metadata": {},
 26 |    "outputs": [],
 27 |    "source": [
 28 |     "import numpy as np\n",
 29 |     "import pandas as pd\n",
 30 |     "import matplotlib.pyplot as plt"
 31 |    ]
 32 |   },
 33 |   {
 34 |    "cell_type": "markdown",
 35 |    "id": "f6e9fcef-4dde-4107-bce6-216516626f86",
 36 |    "metadata": {},
 37 |    "source": [
 38 |     "# Data preparation"
 39 |    ]
 40 |   },
 41 |   {
 42 |    "cell_type": "markdown",
 43 |    "id": "d91ac0bd-372a-4ef0-9c0e-a0e0cf12560d",
 44 |    "metadata": {},
 45 |    "source": [
 46 |     "The dataset we shall use is the Quarterly overnight trips (in thousands) from 1998 Q1 to 2016 Q4 across\n",
 47 |     "Australia. The number of trips is split by `State`, `Region`, and `Purpose`. \n",
 48 |     "\n",
 49 |     "**In this exercise we are going to forecast the total number of trips for each State (there are 8 states therefore we will have 8 time series). We shall treat this as a multivariate forecasting problem.**\n",
 50 |     "\n",
 51 |     "Source: A new tidy data structure to support\n",
 52 |     "exploration and modeling of temporal data, Journal of Computational and\n",
 53 |     "Graphical Statistics, 29:3, 466-478, doi:10.1080/10618600.2019.1695624.\n",
 54 |     "Shape of the dataset: (24320, 5)"
 55 |    ]
 56 |   },
 57 |   {
 58 |    "cell_type": "code",
 59 |    "execution_count": null,
 60 |    "id": "76ef5949-cbf1-4109-887d-104f71574c2a",
 61 |    "metadata": {},
 62 |    "outputs": [],
 63 |    "source": [
 64 |     "from skforecast.datasets import fetch_dataset\n",
 65 |     "\n",
 66 |     "# Load the data\n",
 67 |     "data = fetch_dataset(name=\"australia_tourism\", raw=True)\n",
 68 |     "data.head()"
 69 |    ]
 70 |   },
 71 |   {
 72 |    "cell_type": "markdown",
 73 |    "id": "48a62223-83ab-4a7e-af1b-4fe91e1b31c7",
 74 |    "metadata": {},
 75 |    "source": [
 76 |     "Pre-process the data by performing the following:\n",
 77 |     "1) Convert the `date_time` column to datetime type\n",
 78 |     "2) Create a dataframe with one column per `Region` which gives the total number of Trips for each date.\n",
 79 |     "3) Ensure the index is `date_time` and resampled to quarterly start `QS`\n"
 80 |    ]
 81 |   },
 82 |   {
 83 |    "cell_type": "code",
 84 |    "execution_count": null,
 85 |    "id": "6506a6d4-e680-4c12-9dd1-7e037c91a2ef",
 86 |    "metadata": {},
 87 |    "outputs": [],
 88 |    "source": []
 89 |   },
 90 |   {
 91 |    "cell_type": "markdown",
 92 |    "id": "589f58e0-6106-430c-ab2b-557ea84bb2a5",
 93 |    "metadata": {},
 94 |    "source": [
 95 |     "Check for missing values."
 96 |    ]
 97 |   },
 98 |   {
 99 |    "cell_type": "code",
100 |    "execution_count": null,
101 |    "id": "92bd6aaf-caac-4bb8-96a2-c0d59220f6e8",
102 |    "metadata": {},
103 |    "outputs": [],
104 |    "source": []
105 |   },
106 |   {
107 |    "cell_type": "markdown",
108 |    "id": "44fb99c2-65b9-4665-9c55-528af407baee",
109 |    "metadata": {},
110 |    "source": [
111 |     "Assign the name of each state to a variable `states`. We will use this later."
112 |    ]
113 |   },
114 |   {
115 |    "cell_type": "code",
116 |    "execution_count": null,
117 |    "id": "57f6888c-a06c-4cd8-a358-039860bc5efd",
118 |    "metadata": {},
119 |    "outputs": [],
120 |    "source": []
121 |   },
122 |   {
123 |    "cell_type": "markdown",
124 |    "id": "eb03773c-a9bf-4fa8-84a4-099af6b87827",
125 |    "metadata": {},
126 |    "source": [
127 |     "# Exploratory data analysis"
128 |    ]
129 |   },
130 |   {
131 |    "cell_type": "markdown",
132 |    "id": "7d1ba5b6-b49d-4bab-882e-915f075b9493",
133 |    "metadata": {},
134 |    "source": [
135 |     "Print the number of data points in the time series, the start time, and the end time of the time series."
136 |    ]
137 |   },
138 |   {
139 |    "cell_type": "code",
140 |    "execution_count": null,
141 |    "id": "36af3ab9-4267-471d-800b-dc9a3dcdb43e",
142 |    "metadata": {},
143 |    "outputs": [],
144 |    "source": []
145 |   },
146 |   {
147 |    "cell_type": "markdown",
148 |    "id": "3ffa8a95-d0d5-48e4-ba5f-b92a4f54c0ef",
149 |    "metadata": {},
150 |    "source": [
151 |     "Plot the time series summed over all states."
152 |    ]
153 |   },
154 |   {
155 |    "cell_type": "code",
156 |    "execution_count": null,
157 |    "id": "f1629484-508f-419f-8647-1ef6626eab26",
158 |    "metadata": {},
159 |    "outputs": [],
160 |    "source": []
161 |   },
162 |   {
163 |    "cell_type": "markdown",
164 |    "id": "16971015-96a9-480e-9f9e-f21637d68e87",
165 |    "metadata": {},
166 |    "source": [
167 |     "Plot all of the time series."
168 |    ]
169 |   },
170 |   {
171 |    "cell_type": "code",
172 |    "execution_count": null,
173 |    "id": "d7193016-417f-4bec-87e0-dcce787ee7fe",
174 |    "metadata": {},
175 |    "outputs": [],
176 |    "source": []
177 |   },
178 |   {
179 |    "cell_type": "markdown",
180 |    "id": "143c6eab-8a90-4b45-9f63-db450d91c296",
181 |    "metadata": {},
182 |    "source": [
183 |     "It appears that there is yearly seasonality for these series and they appear to be anti-correlated (i.e., some areas experience peaks whilst others experience troughs)."
184 |    ]
185 |   },
186 |   {
187 |    "cell_type": "markdown",
188 |    "id": "af0befd7-d612-4bfe-8cb6-fc2bfd2bc9f1",
189 |    "metadata": {},
190 |    "source": [
191 |     "Create a quarter of the year feature which could help with the yearly seasonality."
192 |    ]
193 |   },
194 |   {
195 |    "cell_type": "code",
196 |    "execution_count": null,
197 |    "id": "6c2b84e0-07a4-46ef-ae53-6327625a7284",
198 |    "metadata": {},
199 |    "outputs": [],
200 |    "source": []
201 |   },
202 |   {
203 |    "cell_type": "markdown",
204 |    "id": "bed95ade-4109-4b31-847b-28f5c9807692",
205 |    "metadata": {},
206 |    "source": [
207 |     "# Forecasting"
208 |    ]
209 |   },
210 |   {
211 |    "cell_type": "markdown",
212 |    "id": "a40647fe-99b4-474e-8a03-b7bf4f9eed29",
213 |    "metadata": {},
214 |    "source": [
215 |     "Import the class needed for recursive forecasting for multiple dependent time series from `skforecast`."
216 |    ]
217 |   },
218 |   {
219 |    "cell_type": "code",
220 |    "execution_count": null,
221 |    "id": "613011ce-c45c-40b5-b46a-e6a8037d4f08",
222 |    "metadata": {},
223 |    "outputs": [],
224 |    "source": []
225 |   },
226 |   {
227 |    "cell_type": "markdown",
228 |    "id": "d692c732-e699-46cb-bab6-cd148e9af1f1",
229 |    "metadata": {},
230 |    "source": [
231 |     "Import a transformer from sklearn to scale the data."
232 |    ]
233 |   },
234 |   {
235 |    "cell_type": "code",
236 |    "execution_count": null,
237 |    "id": "cadffd44-e241-4677-b640-9c8b9db2784f",
238 |    "metadata": {},
239 |    "outputs": [],
240 |    "source": []
241 |   },
242 |   {
243 |    "cell_type": "markdown",
244 |    "id": "eaf0b4fb-9548-4da1-b356-e9ec377d68f9",
245 |    "metadata": {},
246 |    "source": [
247 |     "Import a model of your choice."
248 |    ]
249 |   },
250 |   {
251 |    "cell_type": "code",
252 |    "execution_count": null,
253 |    "id": "0495bdc3-5e22-483a-b0ab-2b3e0402fbee",
254 |    "metadata": {},
255 |    "outputs": [],
256 |    "source": []
257 |   },
258 |   {
259 |    "cell_type": "markdown",
260 |    "id": "66a304ca-77a4-48f6-99ef-f24d30417daa",
261 |    "metadata": {},
262 |    "source": [
263 |     "Assign the names of the states to a `target_cols` variable and any exogenous features to an `exog_cols` variable."
264 |    ]
265 |   },
266 |   {
267 |    "cell_type": "code",
268 |    "execution_count": null,
269 |    "id": "be5cfa65-d69b-427c-bccc-0021fa04ef02",
270 |    "metadata": {},
271 |    "outputs": [],
272 |    "source": []
273 |   },
274 |   {
275 |    "cell_type": "markdown",
276 |    "id": "9a24e307-22ef-44f1-8f11-b7746a70d3bb",
277 |    "metadata": {},
278 |    "source": [
279 |     "Specify a forecast horizon and assign it to a variable `steps`. Try forecasting 8 quarters into the future."
280 |    ]
281 |   },
282 |   {
283 |    "cell_type": "code",
284 |    "execution_count": null,
285 |    "id": "2b37700d-9bfa-4c60-926a-b48a9e32c7ce",
286 |    "metadata": {},
287 |    "outputs": [],
288 |    "source": []
289 |   },
290 |   {
291 |    "cell_type": "markdown",
292 |    "id": "4109bb1b-b6a1-48f1-99ca-caca04c01a63",
293 |    "metadata": {},
294 |    "source": [
295 |     "Create a dataframe for the future values of any exogenous features.\n",
296 |     "\n",
297 |     "Hint: `pd.DateOffset` and using `freq=QS` in `pd.date_range` might be helpful "
298 |    ]
299 |   },
300 |   {
301 |    "cell_type": "code",
302 |    "execution_count": null,
303 |    "id": "c96dd685-09d8-42d0-949b-652acf45e56a",
304 |    "metadata": {},
305 |    "outputs": [],
306 |    "source": []
307 |   },
308 |   {
309 |    "cell_type": "markdown",
310 |    "id": "52146b7c-52a9-4f0f-a353-0c5965302fd7",
311 |    "metadata": {},
312 |    "source": [
313 |     "Forecast over each state using a for loop. Define a `ForecasterDirectMultiVariate` forecaster and experiment with the number of lags to use as a feature."
314 |    ]
315 |   },
316 |   {
317 |    "cell_type": "code",
318 |    "execution_count": null,
319 |    "id": "59447171-bbdc-4a0a-82d1-fa809a9db068",
320 |    "metadata": {},
321 |    "outputs": [],
322 |    "source": []
323 |   },
324 |   {
325 |    "cell_type": "markdown",
326 |    "id": "95652d3b-4073-4080-b2ac-567739b08d3e",
327 |    "metadata": {},
328 |    "source": [
329 |     "Plot the forecasts."
330 |    ]
331 |   },
332 |   {
333 |    "cell_type": "code",
334 |    "execution_count": null,
335 |    "id": "9a448c27-2516-47c5-8598-f4430d9540ce",
336 |    "metadata": {},
337 |    "outputs": [],
338 |    "source": []
339 |   }
340 |  ],
341 |  "metadata": {
342 |   "kernelspec": {
343 |    "display_name": "Python 3 (ipykernel)",
344 |    "language": "python",
345 |    "name": "python3"
346 |   },
347 |   "language_info": {
348 |    "codemirror_mode": {
349 |     "name": "ipython",
350 |     "version": 3
351 |    },
352 |    "file_extension": ".py",
353 |    "mimetype": "text/x-python",
354 |    "name": "python",
355 |    "nbconvert_exporter": "python",
356 |    "pygments_lexer": "ipython3",
357 |    "version": "3.10.2"
358 |   }
359 |  },
360 |  "nbformat": 4,
361 |  "nbformat_minor": 5
362 | }
363 | 


--------------------------------------------------------------------------------
/03-multiseries-forecasting/images/forecaster_multi_series_sample_weight.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/trainindata/forecasting-with-machine-learning/2e45285cb67a1d1afed4eebd80150493bdd7c0fd/03-multiseries-forecasting/images/forecaster_multi_series_sample_weight.png


--------------------------------------------------------------------------------
/03-multiseries-forecasting/images/forecaster_multivariate_train_matrix_diagram.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/trainindata/forecasting-with-machine-learning/2e45285cb67a1d1afed4eebd80150493bdd7c0fd/03-multiseries-forecasting/images/forecaster_multivariate_train_matrix_diagram.png


--------------------------------------------------------------------------------
/04-backtesting/exercise-1.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "8f5b8ff1-e3f7-47af-b603-15d89d0c6fe5",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "In this notebook we set out an exercise to do backtesting to compare different models for a single time series. The solutions we show are only one way of answering these questions."
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "markdown",
 13 |    "id": "f6e9fcef-4dde-4107-bce6-216516626f86",
 14 |    "metadata": {},
 15 |    "source": [
 16 |     "# Data preparation"
 17 |    ]
 18 |   },
 19 |   {
 20 |    "cell_type": "markdown",
 21 |    "id": "d91ac0bd-372a-4ef0-9c0e-a0e0cf12560d",
 22 |    "metadata": {},
 23 |    "source": [
 24 |     "The dataset we shall use is the daily web visits to cienciadedatos.net."
 25 |    ]
 26 |   },
 27 |   {
 28 |    "cell_type": "code",
 29 |    "execution_count": null,
 30 |    "id": "76ef5949-cbf1-4109-887d-104f71574c2a",
 31 |    "metadata": {},
 32 |    "outputs": [],
 33 |    "source": [
 34 |     "from skforecast.datasets import fetch_dataset\n",
 35 |     "\n",
 36 |     "# Load the data\n",
 37 |     "data = fetch_dataset(name=\"website_visits\", raw=True)\n",
 38 |     "data.head()"
 39 |    ]
 40 |   },
 41 |   {
 42 |    "cell_type": "markdown",
 43 |    "id": "48a62223-83ab-4a7e-af1b-4fe91e1b31c7",
 44 |    "metadata": {},
 45 |    "source": [
 46 |     "Pre-process the data by performing the following:\n",
 47 |     "1) Convert the `date` column to datetime type\n",
 48 |     "2) Set the date as an index\n",
 49 |     "3) Set the frequency to daily\n",
 50 |     "4) Sort the time series by date\n",
 51 |     "\n",
 52 |     "Hint: Try the format `%d/%m/%y`"
 53 |    ]
 54 |   },
 55 |   {
 56 |    "cell_type": "code",
 57 |    "execution_count": null,
 58 |    "id": "47d63a91-4966-4f0c-a081-2c5821298a28",
 59 |    "metadata": {},
 60 |    "outputs": [],
 61 |    "source": []
 62 |   },
 63 |   {
 64 |    "cell_type": "markdown",
 65 |    "id": "589f58e0-6106-430c-ab2b-557ea84bb2a5",
 66 |    "metadata": {},
 67 |    "source": [
 68 |     "Check for missing values."
 69 |    ]
 70 |   },
 71 |   {
 72 |    "cell_type": "code",
 73 |    "execution_count": null,
 74 |    "id": "535fd22d-3e40-4796-aca8-05a029d36129",
 75 |    "metadata": {},
 76 |    "outputs": [],
 77 |    "source": []
 78 |   },
 79 |   {
 80 |    "cell_type": "markdown",
 81 |    "id": "eb03773c-a9bf-4fa8-84a4-099af6b87827",
 82 |    "metadata": {},
 83 |    "source": [
 84 |     "# Exploratory data analysis"
 85 |    ]
 86 |   },
 87 |   {
 88 |    "cell_type": "markdown",
 89 |    "id": "7d1ba5b6-b49d-4bab-882e-915f075b9493",
 90 |    "metadata": {},
 91 |    "source": [
 92 |     "Print the number of data points in the time series, the start time, and the end time of the time series."
 93 |    ]
 94 |   },
 95 |   {
 96 |    "cell_type": "code",
 97 |    "execution_count": null,
 98 |    "id": "271cdc42-587b-4ee7-a433-71d8103ced5f",
 99 |    "metadata": {},
100 |    "outputs": [],
101 |    "source": []
102 |   },
103 |   {
104 |    "cell_type": "markdown",
105 |    "id": "3ffa8a95-d0d5-48e4-ba5f-b92a4f54c0ef",
106 |    "metadata": {},
107 |    "source": [
108 |     "Plot the time series"
109 |    ]
110 |   },
111 |   {
112 |    "cell_type": "code",
113 |    "execution_count": null,
114 |    "id": "f1629484-508f-419f-8647-1ef6626eab26",
115 |    "metadata": {},
116 |    "outputs": [],
117 |    "source": []
118 |   },
119 |   {
120 |    "cell_type": "markdown",
121 |    "id": "af0befd7-d612-4bfe-8cb6-fc2bfd2bc9f1",
122 |    "metadata": {},
123 |    "source": [
124 |     "Check if there is any weekly seasonality. \n",
125 |     "\n",
126 |     "Note: This could be done in a variety ways with different degrees of complexity. "
127 |    ]
128 |   },
129 |   {
130 |    "cell_type": "code",
131 |    "execution_count": null,
132 |    "id": "6c2b84e0-07a4-46ef-ae53-6327625a7284",
133 |    "metadata": {},
134 |    "outputs": [],
135 |    "source": []
136 |   },
137 |   {
138 |    "cell_type": "markdown",
139 |    "id": "3fcb0da2-348b-4e8e-8aae-72eeb7db1b4c",
140 |    "metadata": {},
141 |    "source": [
142 |     "Create a feature(s) which can help capture weekly seasonality from the datetime index."
143 |    ]
144 |   },
145 |   {
146 |    "cell_type": "code",
147 |    "execution_count": null,
148 |    "id": "1ab5c17b-cf5a-40e9-9c6a-61052d17f693",
149 |    "metadata": {},
150 |    "outputs": [],
151 |    "source": []
152 |   },
153 |   {
154 |    "cell_type": "markdown",
155 |    "id": "bed95ade-4109-4b31-847b-28f5c9807692",
156 |    "metadata": {},
157 |    "source": [
158 |     "# Model definition"
159 |    ]
160 |   },
161 |   {
162 |    "cell_type": "markdown",
163 |    "id": "a40647fe-99b4-474e-8a03-b7bf4f9eed29",
164 |    "metadata": {},
165 |    "source": [
166 |     "Import the class needed for recursive forecasting with a single time series from `skforecast`."
167 |    ]
168 |   },
169 |   {
170 |    "cell_type": "code",
171 |    "execution_count": null,
172 |    "id": "613011ce-c45c-40b5-b46a-e6a8037d4f08",
173 |    "metadata": {},
174 |    "outputs": [],
175 |    "source": []
176 |   },
177 |   {
178 |    "cell_type": "markdown",
179 |    "id": "d692c732-e699-46cb-bab6-cd148e9af1f1",
180 |    "metadata": {},
181 |    "source": [
182 |     "Import a transformer from sklearn to scale the data."
183 |    ]
184 |   },
185 |   {
186 |    "cell_type": "code",
187 |    "execution_count": null,
188 |    "id": "cadffd44-e241-4677-b640-9c8b9db2784f",
189 |    "metadata": {},
190 |    "outputs": [],
191 |    "source": []
192 |   },
193 |   {
194 |    "cell_type": "markdown",
195 |    "id": "eaf0b4fb-9548-4da1-b356-e9ec377d68f9",
196 |    "metadata": {},
197 |    "source": [
198 |     "Import a linear model and LightGBM"
199 |    ]
200 |   },
201 |   {
202 |    "cell_type": "code",
203 |    "execution_count": null,
204 |    "id": "0495bdc3-5e22-483a-b0ab-2b3e0402fbee",
205 |    "metadata": {},
206 |    "outputs": [],
207 |    "source": []
208 |   },
209 |   {
210 |    "cell_type": "markdown",
211 |    "id": "66a304ca-77a4-48f6-99ef-f24d30417daa",
212 |    "metadata": {},
213 |    "source": [
214 |     "Define one forecaster for the linear model and one forecaster for the LightGBM model  using `skforecast`. Use some recent lags and a lag of 7 to capture weekly seasonality."
215 |    ]
216 |   },
217 |   {
218 |    "cell_type": "code",
219 |    "execution_count": null,
220 |    "id": "59447171-bbdc-4a0a-82d1-fa809a9db068",
221 |    "metadata": {},
222 |    "outputs": [],
223 |    "source": []
224 |   },
225 |   {
226 |    "cell_type": "markdown",
227 |    "id": "8b8d3162-9769-4b24-896a-f10a4914dabf",
228 |    "metadata": {},
229 |    "source": [
230 |     "# Backtesting"
231 |    ]
232 |   },
233 |   {
234 |    "cell_type": "markdown",
235 |    "id": "115435b1-725c-4dc1-9df1-782b19d1fc55",
236 |    "metadata": {},
237 |    "source": [
238 |     "Import the relevant backtesting objects from `skforecast`."
239 |    ]
240 |   },
241 |   {
242 |    "cell_type": "code",
243 |    "execution_count": null,
244 |    "id": "bd952f40-4636-4fa7-ba9a-25eeb0d797c8",
245 |    "metadata": {},
246 |    "outputs": [],
247 |    "source": []
248 |   },
249 |   {
250 |    "cell_type": "markdown",
251 |    "id": "1dd03f71-1e45-427d-85fe-9830dfbf686d",
252 |    "metadata": {},
253 |    "source": [
254 |     "Use `backtesting_forecaster` to do backtesting with:\n",
255 |     "- Intermittent refitting where the model is refit every 7 steps\n",
256 |     "- An expanding training window\n",
257 |     "- Using both the `mean_squared_error` and the `mean_absolute_percentage_error`\n",
258 |     "- Use approximately the first 30 days for training\n",
259 |     "- Use exogenous features\n",
260 |     "\n",
261 |     "Do this twice, once for the linear forecaster and once for the lightgbm forecaster."
262 |    ]
263 |   },
264 |   {
265 |    "cell_type": "code",
266 |    "execution_count": null,
267 |    "id": "2b6a89aa-38bd-4a59-b603-bfb3f98b4c19",
268 |    "metadata": {},
269 |    "outputs": [],
270 |    "source": []
271 |   },
272 |   {
273 |    "cell_type": "code",
274 |    "execution_count": null,
275 |    "id": "4b66897e-b76a-42b3-8e37-59b72fd88a7a",
276 |    "metadata": {},
277 |    "outputs": [],
278 |    "source": []
279 |   },
280 |   {
281 |    "cell_type": "markdown",
282 |    "id": "0a1678ed-b2a0-422d-9816-0a7ef2ec5a34",
283 |    "metadata": {},
284 |    "source": [
285 |     "Compare the metrics for the linear model and lightgbm model. Which model did better?"
286 |    ]
287 |   },
288 |   {
289 |    "cell_type": "code",
290 |    "execution_count": null,
291 |    "id": "5344aa17-275c-4fdb-9338-aa2673d90b20",
292 |    "metadata": {},
293 |    "outputs": [],
294 |    "source": []
295 |   },
296 |   {
297 |    "cell_type": "markdown",
298 |    "id": "eb967325-4484-477c-a1f6-b2151079e55a",
299 |    "metadata": {},
300 |    "source": [
301 |     "Plot the predictions made during backtesting alongside the actuals to get a better understanding of the errors."
302 |    ]
303 |   },
304 |   {
305 |    "cell_type": "code",
306 |    "execution_count": null,
307 |    "id": "e084d1db-ff9a-43cb-be4e-e0794486d80b",
308 |    "metadata": {},
309 |    "outputs": [],
310 |    "source": []
311 |   },
312 |   {
313 |    "cell_type": "markdown",
314 |    "id": "4d0cdd6a-fde5-4f0d-ba7f-852457701063",
315 |    "metadata": {},
316 |    "source": [
317 |     "Why do you think one model is doing better than the other? What might help one of the models to improve?"
318 |    ]
319 |   },
320 |   {
321 |    "cell_type": "markdown",
322 |    "id": "b14949a5-0628-4041-aca0-ed033d967bc0",
323 |    "metadata": {},
324 |    "source": []
325 |   }
326 |  ],
327 |  "metadata": {
328 |   "kernelspec": {
329 |    "display_name": "Python 3 (ipykernel)",
330 |    "language": "python",
331 |    "name": "python3"
332 |   },
333 |   "language_info": {
334 |    "codemirror_mode": {
335 |     "name": "ipython",
336 |     "version": 3
337 |    },
338 |    "file_extension": ".py",
339 |    "mimetype": "text/x-python",
340 |    "name": "python",
341 |    "nbconvert_exporter": "python",
342 |    "pygments_lexer": "ipython3",
343 |    "version": "3.10.2"
344 |   }
345 |  },
346 |  "nbformat": 4,
347 |  "nbformat_minor": 5
348 | }
349 | 


--------------------------------------------------------------------------------
/04-backtesting/exercise-2.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "0c0af68d-793e-40ce-b41a-85131ab6aeb3",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "# Exercise 2: Backtesting with multiple time series"
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "markdown",
 13 |    "id": "8f5b8ff1-e3f7-47af-b603-15d89d0c6fe5",
 14 |    "metadata": {},
 15 |    "source": [
 16 |     "In this notebook we set out an exercise to do backtesting to compare different models for multiple time series. The solutions we show are only one way of answering these questions."
 17 |    ]
 18 |   },
 19 |   {
 20 |    "cell_type": "code",
 21 |    "execution_count": null,
 22 |    "id": "91a8c715-0369-4928-907d-9492db4f37a1",
 23 |    "metadata": {},
 24 |    "outputs": [],
 25 |    "source": [
 26 |     "import numpy as np\n",
 27 |     "import pandas as pd\n",
 28 |     "import matplotlib.pyplot as plt"
 29 |    ]
 30 |   },
 31 |   {
 32 |    "cell_type": "markdown",
 33 |    "id": "f6e9fcef-4dde-4107-bce6-216516626f86",
 34 |    "metadata": {},
 35 |    "source": [
 36 |     "# Data preparation"
 37 |    ]
 38 |   },
 39 |   {
 40 |    "cell_type": "markdown",
 41 |    "id": "d91ac0bd-372a-4ef0-9c0e-a0e0cf12560d",
 42 |    "metadata": {},
 43 |    "source": [
 44 |     "The dataset we shall use is the Quarterly overnight trips (in thousands) from 1998 Q1 to 2016 Q4 across\n",
 45 |     "Australia. The number of trips is split by `State`, `Region`, and `Purpose`. \n",
 46 |     "\n",
 47 |     "**In this exercise we are going to forecast the total number of trips for each Region (there are 76 regions therefore we will have 76 time series).**\n",
 48 |     "\n",
 49 |     "Source: A new tidy data structure to support\n",
 50 |     "exploration and modeling of temporal data, Journal of Computational and\n",
 51 |     "Graphical Statistics, 29:3, 466-478, doi:10.1080/10618600.2019.1695624.\n",
 52 |     "Shape of the dataset: (24320, 5)"
 53 |    ]
 54 |   },
 55 |   {
 56 |    "cell_type": "code",
 57 |    "execution_count": null,
 58 |    "id": "76ef5949-cbf1-4109-887d-104f71574c2a",
 59 |    "metadata": {},
 60 |    "outputs": [],
 61 |    "source": [
 62 |     "from skforecast.datasets import fetch_dataset\n",
 63 |     "\n",
 64 |     "# Load the data\n",
 65 |     "data = fetch_dataset(name=\"australia_tourism\", raw=True)\n",
 66 |     "data.head()"
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "markdown",
 71 |    "id": "48a62223-83ab-4a7e-af1b-4fe91e1b31c7",
 72 |    "metadata": {},
 73 |    "source": [
 74 |     "Pre-process the data by performing the following:\n",
 75 |     "1) Convert the `date_time` column to datetime type\n",
 76 |     "2) Create a dataframe with one column per `Region` which gives the total number of Trips for each date.\n",
 77 |     "3) Ensure the index is `date_time` and resampled to quarterly start `QS`\n"
 78 |    ]
 79 |   },
 80 |   {
 81 |    "cell_type": "code",
 82 |    "execution_count": null,
 83 |    "id": "6506a6d4-e680-4c12-9dd1-7e037c91a2ef",
 84 |    "metadata": {},
 85 |    "outputs": [],
 86 |    "source": []
 87 |   },
 88 |   {
 89 |    "cell_type": "markdown",
 90 |    "id": "446f0153-9869-4f25-a84e-d040b9078635",
 91 |    "metadata": {},
 92 |    "source": [
 93 |     "Later we will use LightGBM. It does not support special JSON characters (e.g., `'`)  in the column name. Let's remove these characters from the column names."
 94 |    ]
 95 |   },
 96 |   {
 97 |    "cell_type": "code",
 98 |    "execution_count": null,
 99 |    "id": "e78e1110-9a99-4ff2-af9f-d78fd28bcc9e",
100 |    "metadata": {},
101 |    "outputs": [],
102 |    "source": [
103 |     "import re\n",
104 |     "\n",
105 |     "data = data.rename(columns=lambda x: re.sub(\"[^A-Za-z0-9_]+\", \"\", x))"
106 |    ]
107 |   },
108 |   {
109 |    "cell_type": "markdown",
110 |    "id": "589f58e0-6106-430c-ab2b-557ea84bb2a5",
111 |    "metadata": {},
112 |    "source": [
113 |     "Check for missing values."
114 |    ]
115 |   },
116 |   {
117 |    "cell_type": "code",
118 |    "execution_count": null,
119 |    "id": "92bd6aaf-caac-4bb8-96a2-c0d59220f6e8",
120 |    "metadata": {},
121 |    "outputs": [],
122 |    "source": []
123 |   },
124 |   {
125 |    "cell_type": "markdown",
126 |    "id": "eb03773c-a9bf-4fa8-84a4-099af6b87827",
127 |    "metadata": {},
128 |    "source": [
129 |     "# Exploratory data analysis"
130 |    ]
131 |   },
132 |   {
133 |    "cell_type": "markdown",
134 |    "id": "7d1ba5b6-b49d-4bab-882e-915f075b9493",
135 |    "metadata": {},
136 |    "source": [
137 |     "Print the number of data points in the time series, the start time, and the end time of the time series."
138 |    ]
139 |   },
140 |   {
141 |    "cell_type": "code",
142 |    "execution_count": null,
143 |    "id": "36af3ab9-4267-471d-800b-dc9a3dcdb43e",
144 |    "metadata": {},
145 |    "outputs": [],
146 |    "source": []
147 |   },
148 |   {
149 |    "cell_type": "markdown",
150 |    "id": "3ffa8a95-d0d5-48e4-ba5f-b92a4f54c0ef",
151 |    "metadata": {},
152 |    "source": [
153 |     "Plot the time series summed over all regions"
154 |    ]
155 |   },
156 |   {
157 |    "cell_type": "code",
158 |    "execution_count": null,
159 |    "id": "f1629484-508f-419f-8647-1ef6626eab26",
160 |    "metadata": {},
161 |    "outputs": [],
162 |    "source": []
163 |   },
164 |   {
165 |    "cell_type": "markdown",
166 |    "id": "bdff7d75-4ca8-4b40-aebb-1ef24318b6e0",
167 |    "metadata": {},
168 |    "source": [
169 |     "Plot a subsample of the time series from different regions."
170 |    ]
171 |   },
172 |   {
173 |    "cell_type": "code",
174 |    "execution_count": null,
175 |    "id": "652caf88-93a8-4b82-abbb-0d07ccbdbdb7",
176 |    "metadata": {},
177 |    "outputs": [],
178 |    "source": []
179 |   },
180 |   {
181 |    "cell_type": "markdown",
182 |    "id": "f4befcc4-bf9a-4adf-bb81-8c4c136007a1",
183 |    "metadata": {},
184 |    "source": [
185 |     "Create a quarter of the year feature which could help with this."
186 |    ]
187 |   },
188 |   {
189 |    "cell_type": "code",
190 |    "execution_count": null,
191 |    "id": "1ab5c17b-cf5a-40e9-9c6a-61052d17f693",
192 |    "metadata": {},
193 |    "outputs": [],
194 |    "source": []
195 |   },
196 |   {
197 |    "cell_type": "markdown",
198 |    "id": "bed95ade-4109-4b31-847b-28f5c9807692",
199 |    "metadata": {},
200 |    "source": [
201 |     "# Model definition"
202 |    ]
203 |   },
204 |   {
205 |    "cell_type": "markdown",
206 |    "id": "a40647fe-99b4-474e-8a03-b7bf4f9eed29",
207 |    "metadata": {},
208 |    "source": [
209 |     "Import the classes needed for recursive forecasting for multiple time series and window features from `skforecast`."
210 |    ]
211 |   },
212 |   {
213 |    "cell_type": "code",
214 |    "execution_count": null,
215 |    "id": "613011ce-c45c-40b5-b46a-e6a8037d4f08",
216 |    "metadata": {},
217 |    "outputs": [],
218 |    "source": []
219 |   },
220 |   {
221 |    "cell_type": "markdown",
222 |    "id": "d692c732-e699-46cb-bab6-cd148e9af1f1",
223 |    "metadata": {},
224 |    "source": [
225 |     "Import a transformer from sklearn to scale the data."
226 |    ]
227 |   },
228 |   {
229 |    "cell_type": "code",
230 |    "execution_count": null,
231 |    "id": "cadffd44-e241-4677-b640-9c8b9db2784f",
232 |    "metadata": {},
233 |    "outputs": [],
234 |    "source": []
235 |   },
236 |   {
237 |    "cell_type": "markdown",
238 |    "id": "eaf0b4fb-9548-4da1-b356-e9ec377d68f9",
239 |    "metadata": {},
240 |    "source": [
241 |     "Import a linear model and LightGBM"
242 |    ]
243 |   },
244 |   {
245 |    "cell_type": "code",
246 |    "execution_count": null,
247 |    "id": "0495bdc3-5e22-483a-b0ab-2b3e0402fbee",
248 |    "metadata": {},
249 |    "outputs": [],
250 |    "source": []
251 |   },
252 |   {
253 |    "cell_type": "markdown",
254 |    "id": "dca4af0c-af99-4ced-b13d-5528ed7f9866",
255 |    "metadata": {},
256 |    "source": [
257 |     "Define a function that extracts features from the target variable:\n",
258 |     "- A rolling mean with window 4\n",
259 |     "- A rolling standard deviation with window 4\n",
260 |     "- A rolling mean with window 12\n",
261 |     "- A rolling standard deviation with window 12\n",
262 |     "\n",
263 |     "Hint: https://skforecast.org/0.11.0/user_guides/custom-predictors.html"
264 |    ]
265 |   },
266 |   {
267 |    "cell_type": "code",
268 |    "execution_count": null,
269 |    "id": "7eb270e3-12ea-4cb8-82d2-09bf27a81afc",
270 |    "metadata": {},
271 |    "outputs": [],
272 |    "source": []
273 |   },
274 |   {
275 |    "cell_type": "markdown",
276 |    "id": "66a304ca-77a4-48f6-99ef-f24d30417daa",
277 |    "metadata": {},
278 |    "source": [
279 |     "Define one forecaster for the linear model and one forecaster for the LightGBM model  using `skforecast`. Pass the function in the previous cell as an argument. Use all lags of up to 8."
280 |    ]
281 |   },
282 |   {
283 |    "cell_type": "code",
284 |    "execution_count": null,
285 |    "id": "59447171-bbdc-4a0a-82d1-fa809a9db068",
286 |    "metadata": {},
287 |    "outputs": [],
288 |    "source": []
289 |   },
290 |   {
291 |    "cell_type": "markdown",
292 |    "id": "8b8d3162-9769-4b24-896a-f10a4914dabf",
293 |    "metadata": {},
294 |    "source": [
295 |     "# Backtesting"
296 |    ]
297 |   },
298 |   {
299 |    "cell_type": "markdown",
300 |    "id": "115435b1-725c-4dc1-9df1-782b19d1fc55",
301 |    "metadata": {},
302 |    "source": [
303 |     "Import the relevant backtesting objects from `skforecast`."
304 |    ]
305 |   },
306 |   {
307 |    "cell_type": "code",
308 |    "execution_count": null,
309 |    "id": "bd952f40-4636-4fa7-ba9a-25eeb0d797c8",
310 |    "metadata": {},
311 |    "outputs": [],
312 |    "source": []
313 |   },
314 |   {
315 |    "cell_type": "markdown",
316 |    "id": "4841d21c-fc14-4aba-867b-53ae9def3c36",
317 |    "metadata": {},
318 |    "source": [
319 |     "Assign the name of the target variables and exogenous variables to variables called `target_cols` and `exog_cols`."
320 |    ]
321 |   },
322 |   {
323 |    "cell_type": "code",
324 |    "execution_count": null,
325 |    "id": "5dd68852-0c87-4b34-adc6-d11c4050a5ae",
326 |    "metadata": {},
327 |    "outputs": [],
328 |    "source": []
329 |   },
330 |   {
331 |    "cell_type": "markdown",
332 |    "id": "1dd03f71-1e45-427d-85fe-9830dfbf686d",
333 |    "metadata": {},
334 |    "source": [
335 |     "Use the backtesting function to do backtesting with:\n",
336 |     "- Refitting at every step\n",
337 |     "- An expanding training window\n",
338 |     "- Use the `mean_absolute_error` as the error metric\n",
339 |     "- Use the first 13 datapoints (just over 3 years) as initial training size\n",
340 |     "- Use exogenous features\n",
341 |     "- Set the forecast horizon to be 4 steps (i.e., 1 year)\n",
342 |     "\n",
343 |     "Do this twice, once for the linear forecaster and once for the lightgbm forecaster."
344 |    ]
345 |   },
346 |   {
347 |    "cell_type": "code",
348 |    "execution_count": null,
349 |    "id": "2b6a89aa-38bd-4a59-b603-bfb3f98b4c19",
350 |    "metadata": {},
351 |    "outputs": [],
352 |    "source": []
353 |   },
354 |   {
355 |    "cell_type": "code",
356 |    "execution_count": null,
357 |    "id": "4b66897e-b76a-42b3-8e37-59b72fd88a7a",
358 |    "metadata": {},
359 |    "outputs": [],
360 |    "source": []
361 |   },
362 |   {
363 |    "cell_type": "markdown",
364 |    "id": "0a1678ed-b2a0-422d-9816-0a7ef2ec5a34",
365 |    "metadata": {},
366 |    "source": [
367 |     "Merge the two metrics dataframe into a single dataframe to make it easier to compare the errors between the linear model and the LightGBM model. "
368 |    ]
369 |   },
370 |   {
371 |    "cell_type": "code",
372 |    "execution_count": null,
373 |    "id": "46c73196-40d2-44d4-a89d-8cb677a6875a",
374 |    "metadata": {},
375 |    "outputs": [],
376 |    "source": []
377 |   },
378 |   {
379 |    "cell_type": "markdown",
380 |    "id": "32ef4541-c15e-4a51-92f8-5c693c5f5074",
381 |    "metadata": {},
382 |    "source": [
383 |     "How often did one model outperform the other model? Remember, the lower the error the better."
384 |    ]
385 |   },
386 |   {
387 |    "cell_type": "code",
388 |    "execution_count": null,
389 |    "id": "02780ced-856e-4eaf-bea1-fe3f6e6cb95f",
390 |    "metadata": {},
391 |    "outputs": [],
392 |    "source": []
393 |   },
394 |   {
395 |    "cell_type": "markdown",
396 |    "id": "4cde377b-d4f5-4a50-b358-566a51ebdfdc",
397 |    "metadata": {},
398 |    "source": [
399 |     "It looks like Ridge was better for most time series here. Once again, for the lightgbm model to outperform we might need additional feature engineering, hyperparameter tuning, handling of trends, and more data. There are not any features which help group similar time series together - these type of features are normally well utilised by gradient boosted trees. So in this scenario, a simple linear model works better."
400 |    ]
401 |   },
402 |   {
403 |    "cell_type": "markdown",
404 |    "id": "40f80613-124c-4591-b198-ca9819eae2c7",
405 |    "metadata": {},
406 |    "source": [
407 |     "Compute the mean absolute error over all the predictions for both the linear model and the LightGBM model. Compare them."
408 |    ]
409 |   },
410 |   {
411 |    "cell_type": "code",
412 |    "execution_count": null,
413 |    "id": "5344aa17-275c-4fdb-9338-aa2673d90b20",
414 |    "metadata": {},
415 |    "outputs": [],
416 |    "source": []
417 |   },
418 |   {
419 |    "cell_type": "markdown",
420 |    "id": "eb967325-4484-477c-a1f6-b2151079e55a",
421 |    "metadata": {},
422 |    "source": [
423 |     "Plot the predictions made during backtesting alongside the actuals for a random subset of time series to get a better understanding of the errors."
424 |    ]
425 |   },
426 |   {
427 |    "cell_type": "code",
428 |    "execution_count": null,
429 |    "id": "e084d1db-ff9a-43cb-be4e-e0794486d80b",
430 |    "metadata": {},
431 |    "outputs": [],
432 |    "source": []
433 |   },
434 |   {
435 |    "cell_type": "markdown",
436 |    "id": "aeac094e-47d4-4d19-9c99-3ea50ed397ad",
437 |    "metadata": {},
438 |    "source": [
439 |     "Plot the predictions made during backtesting alongside the actuals for the **highest error** subset of time series to get a better understanding of the errors."
440 |    ]
441 |   },
442 |   {
443 |    "cell_type": "code",
444 |    "execution_count": null,
445 |    "id": "7dd01596-68e2-4484-a42b-2cfe5eb9103e",
446 |    "metadata": {},
447 |    "outputs": [],
448 |    "source": []
449 |   }
450 |  ],
451 |  "metadata": {
452 |   "kernelspec": {
453 |    "display_name": "Python 3 (ipykernel)",
454 |    "language": "python",
455 |    "name": "python3"
456 |   },
457 |   "language_info": {
458 |    "codemirror_mode": {
459 |     "name": "ipython",
460 |     "version": 3
461 |    },
462 |    "file_extension": ".py",
463 |    "mimetype": "text/x-python",
464 |    "name": "python",
465 |    "nbconvert_exporter": "python",
466 |    "pygments_lexer": "ipython3",
467 |    "version": "3.10.2"
468 |   }
469 |  },
470 |  "nbformat": 4,
471 |  "nbformat_minor": 5
472 | }
473 | 


--------------------------------------------------------------------------------
/04-backtesting/images/backtesting_intermittent_refit.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/trainindata/forecasting-with-machine-learning/2e45285cb67a1d1afed4eebd80150493bdd7c0fd/04-backtesting/images/backtesting_intermittent_refit.gif


--------------------------------------------------------------------------------
/04-backtesting/images/backtesting_no_refit.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/trainindata/forecasting-with-machine-learning/2e45285cb67a1d1afed4eebd80150493bdd7c0fd/04-backtesting/images/backtesting_no_refit.gif


--------------------------------------------------------------------------------
/04-backtesting/images/backtesting_refit.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/trainindata/forecasting-with-machine-learning/2e45285cb67a1d1afed4eebd80150493bdd7c0fd/04-backtesting/images/backtesting_refit.gif


--------------------------------------------------------------------------------
/04-backtesting/images/backtesting_refit_fixed_train_size.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/trainindata/forecasting-with-machine-learning/2e45285cb67a1d1afed4eebd80150493bdd7c0fd/04-backtesting/images/backtesting_refit_fixed_train_size.gif


--------------------------------------------------------------------------------
/04-backtesting/images/backtesting_refit_gap.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/trainindata/forecasting-with-machine-learning/2e45285cb67a1d1afed4eebd80150493bdd7c0fd/04-backtesting/images/backtesting_refit_gap.gif


--------------------------------------------------------------------------------
/05-error-metrics/exercise-1.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Exercise 1: Error metrics"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "In this notebook we set out exercises to implement error metrics and use them in skforecast. The solutions we show are only one way of answering these questions."
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "code",
 19 |    "execution_count": null,
 20 |    "metadata": {},
 21 |    "outputs": [],
 22 |    "source": [
 23 |     "import numpy as np\n",
 24 |     "import pandas as pd\n",
 25 |     "import matplotlib.pyplot as plt"
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "markdown",
 30 |    "metadata": {},
 31 |    "source": [
 32 |     "# Data loading and preparation"
 33 |    ]
 34 |   },
 35 |   {
 36 |    "cell_type": "markdown",
 37 |    "metadata": {},
 38 |    "source": [
 39 |     "Data has been obtained from [Store Item Demand Forecasting Challenge](https://www.kaggle.com/competitions/demand-forecasting-kernels-only/data), specifically `train.csv`. This dataset contains 913,000 sales transactions from 2013–01–01 to 2017–12–31 for 50 products (SKU) in 10 stores. The goal is to predict the next 7 days sales for 50 different items in one store using the available 5 years history."
 40 |    ]
 41 |   },
 42 |   {
 43 |    "cell_type": "code",
 44 |    "execution_count": null,
 45 |    "metadata": {},
 46 |    "outputs": [],
 47 |    "source": [
 48 |     "from skforecast.datasets import fetch_dataset\n",
 49 |     "\n",
 50 |     "# Load the data\n",
 51 |     "data = fetch_dataset(name=\"store_sales\", raw=True)\n",
 52 |     "data.head()"
 53 |    ]
 54 |   },
 55 |   {
 56 |    "cell_type": "markdown",
 57 |    "metadata": {},
 58 |    "source": [
 59 |     "`ForecasterRecursiveMultiSeries` requires that each time series is a column in the dataframe and that the index is time-like (datetime or timestamp). \n",
 60 |     "\n",
 61 |     "So now we process the data to get dataframes in the required format."
 62 |    ]
 63 |   },
 64 |   {
 65 |    "cell_type": "code",
 66 |    "execution_count": null,
 67 |    "metadata": {},
 68 |    "outputs": [],
 69 |    "source": [
 70 |     "# Data preprocessing\n",
 71 |     "\n",
 72 |     "selected_store = 2\n",
 73 |     "selected_items = data.item.unique()  # All items\n",
 74 |     "# selected_items = [1, 2, 3, 4, 5] # Selection of items to reduce computation time\n",
 75 |     "\n",
 76 |     "# Filter data to specific stores and products\n",
 77 |     "mask = (data[\"store\"] == selected_store) & (data[\"item\"].isin(selected_items))\n",
 78 |     "data = data[mask].copy()\n",
 79 |     "\n",
 80 |     "# Convert `date` column to datetime\n",
 81 |     "data[\"date\"] = pd.to_datetime(data[\"date\"], format=\"%Y-%m-%d\")\n",
 82 |     "\n",
 83 |     "# Convert to one column per time series\n",
 84 |     "data = pd.pivot_table(data=data, values=\"sales\", index=\"date\", columns=\"item\")\n",
 85 |     "\n",
 86 |     "# Reset column names\n",
 87 |     "data.columns.name = None\n",
 88 |     "data.columns = [f\"item_{col}\" for col in data.columns]\n",
 89 |     "\n",
 90 |     "# Explicitly set the frequency of the data to daily.\n",
 91 |     "# This would introduce missing values for missing days.\n",
 92 |     "data = data.asfreq(\"1D\")\n",
 93 |     "\n",
 94 |     "# Sort by time\n",
 95 |     "data = data.sort_index()\n",
 96 |     "\n",
 97 |     "data.head(4)"
 98 |    ]
 99 |   },
100 |   {
101 |    "cell_type": "code",
102 |    "execution_count": null,
103 |    "metadata": {},
104 |    "outputs": [],
105 |    "source": [
106 |     "# Check if any missing values introduced\n",
107 |     "data.isnull().sum().any()"
108 |    ]
109 |   },
110 |   {
111 |    "cell_type": "code",
112 |    "execution_count": null,
113 |    "metadata": {},
114 |    "outputs": [],
115 |    "source": [
116 |     "# Plot a subset of the time series\n",
117 |     "fig, ax = plt.subplots(figsize=(8, 5))\n",
118 |     "data.iloc[:, :4].plot(\n",
119 |     "    legend=True,\n",
120 |     "    subplots=True,\n",
121 |     "    sharex=True,\n",
122 |     "    title=\"Sales of store 2\",\n",
123 |     "    ax=ax,\n",
124 |     ")\n",
125 |     "fig.tight_layout()"
126 |    ]
127 |   },
128 |   {
129 |    "cell_type": "markdown",
130 |    "metadata": {},
131 |    "source": [
132 |     "Let's add the day of the week to use as an exogenous feature."
133 |    ]
134 |   },
135 |   {
136 |    "cell_type": "code",
137 |    "execution_count": null,
138 |    "metadata": {},
139 |    "outputs": [],
140 |    "source": [
141 |     "data[\"day_of_week\"] = data.index.weekday"
142 |    ]
143 |   },
144 |   {
145 |    "cell_type": "markdown",
146 |    "metadata": {},
147 |    "source": [
148 |     "# Analysis"
149 |    ]
150 |   },
151 |   {
152 |    "cell_type": "markdown",
153 |    "metadata": {},
154 |    "source": [
155 |     "We want to build an intuition about the characteristics (trend, outliers, seasonality, intermittency, etc.) and the range of values that the time series have.\n",
156 |     "This helps when deciding which error metrics to use.\n",
157 |     "\n",
158 |     "Compute a set of summary statistics and/or plots to do this."
159 |    ]
160 |   },
161 |   {
162 |    "cell_type": "code",
163 |    "execution_count": null,
164 |    "metadata": {},
165 |    "outputs": [],
166 |    "source": []
167 |   },
168 |   {
169 |    "cell_type": "markdown",
170 |    "metadata": {},
171 |    "source": [
172 |     "Plot the 5 largest and 5 smallest time series."
173 |    ]
174 |   },
175 |   {
176 |    "cell_type": "code",
177 |    "execution_count": null,
178 |    "metadata": {},
179 |    "outputs": [],
180 |    "source": []
181 |   },
182 |   {
183 |    "cell_type": "markdown",
184 |    "metadata": {},
185 |    "source": [
186 |     "Check for any zero values."
187 |    ]
188 |   },
189 |   {
190 |    "cell_type": "code",
191 |    "execution_count": null,
192 |    "metadata": {},
193 |    "outputs": [],
194 |    "source": []
195 |   },
196 |   {
197 |    "cell_type": "markdown",
198 |    "metadata": {},
199 |    "source": [
200 |     "# Implementing custom error metrics"
201 |    ]
202 |   },
203 |   {
204 |    "cell_type": "markdown",
205 |    "metadata": {},
206 |    "source": [
207 |     "Error metrics that we can consider using are:\n",
208 |     "- To measure the bias: \n",
209 |     "    - Average forecast bias: we compute the forecast bias, which is a scale independent measure, for each time series and then average over all time series.\n",
210 |     "- To measure the error:\n",
211 |     "    - Normalised RMSE or normalised deviation: these are pooled error metrics, which cannot currently be calculated natively in skforecast.\n",
212 |     "    - Average RMSSE or MASE:  we can compute the MASE or RMSSE, a scale independent metric, for each time series individually and then average the results to give our final metric. This cannot currently be calculated in skforecast as it requires the custom error function to have\n",
213 |     "    access to the training set. \n",
214 |     "    - Average WAPE: we can compute the WAPE, a scale independent metric, for each time series individually and then average the results to give our final metric."
215 |    ]
216 |   },
217 |   {
218 |    "cell_type": "markdown",
219 |    "metadata": {},
220 |    "source": [
221 |     "Create a function that takes `y_true` and `y_pred` as an argument and ouputs the \n",
222 |     "weighted mean absolute percentage error (WAPE) metric defined as:\n",
223 |     "\n",
224 |     "$ WAPE = \\frac{\\sum_t{|e_t|}}{\\sum_t{|y_t|}} $\n",
225 |     "\n",
226 |     "Note: The data does have trend, this would suggest we should not use WAPE. However,\n",
227 |     "as we are only forecasting 7 days into the future the magnitude of the trend in\n",
228 |     "the forecast horizon is negligible. "
229 |    ]
230 |   },
231 |   {
232 |    "cell_type": "code",
233 |    "execution_count": null,
234 |    "metadata": {},
235 |    "outputs": [],
236 |    "source": []
237 |   },
238 |   {
239 |    "cell_type": "markdown",
240 |    "metadata": {},
241 |    "source": [
242 |     "Create a function that takes `y_true` and `y_pred` as an argument and ouputs the \n",
243 |     "forecast bias metric defined as the mean error divided by the mean of the time series\n",
244 |     "in the forecast horizon:\n",
245 |     "\n",
246 |     "$ FB = \\frac{1/H\\sum_t{e_t}}{1/H\\sum_t{y_t}} = \\frac{\\sum_t{e_t}}{\\sum_t{y_t}}$"
247 |    ]
248 |   },
249 |   {
250 |    "cell_type": "code",
251 |    "execution_count": null,
252 |    "metadata": {},
253 |    "outputs": [],
254 |    "source": []
255 |   },
256 |   {
257 |    "cell_type": "markdown",
258 |    "metadata": {},
259 |    "source": [
260 |     "# Using a custom error metric"
261 |    ]
262 |   },
263 |   {
264 |    "cell_type": "markdown",
265 |    "metadata": {},
266 |    "source": [
267 |     "Define a model to do recursive multistep forecasting for multiple independent time series."
268 |    ]
269 |   },
270 |   {
271 |    "cell_type": "code",
272 |    "execution_count": null,
273 |    "metadata": {},
274 |    "outputs": [],
275 |    "source": []
276 |   },
277 |   {
278 |    "cell_type": "markdown",
279 |    "metadata": {},
280 |    "source": [
281 |     "Perform backtesting with refitting at every backtest step and compute the forecast bias\n",
282 |     "and WAPE for each time series."
283 |    ]
284 |   },
285 |   {
286 |    "cell_type": "code",
287 |    "execution_count": null,
288 |    "metadata": {},
289 |    "outputs": [],
290 |    "source": [
291 |     "from skforecast.model_selection import backtesting_forecaster_multiseries"
292 |    ]
293 |   },
294 |   {
295 |    "cell_type": "code",
296 |    "execution_count": null,
297 |    "metadata": {},
298 |    "outputs": [],
299 |    "source": []
300 |   },
301 |   {
302 |    "cell_type": "markdown",
303 |    "metadata": {},
304 |    "source": [
305 |     "Average the forecast bias and WAPE over all items to get a single value for each\n",
306 |     "metric. Does your model over or under forecast?"
307 |    ]
308 |   },
309 |   {
310 |    "cell_type": "code",
311 |    "execution_count": null,
312 |    "metadata": {},
313 |    "outputs": [],
314 |    "source": []
315 |   },
316 |   {
317 |    "cell_type": "markdown",
318 |    "metadata": {},
319 |    "source": [
320 |     "Plot a sample of the largest and smallest WAPE time series and their backtest\n",
321 |     "predictions."
322 |    ]
323 |   },
324 |   {
325 |    "cell_type": "code",
326 |    "execution_count": null,
327 |    "metadata": {},
328 |    "outputs": [],
329 |    "source": []
330 |   }
331 |  ],
332 |  "metadata": {
333 |   "kernelspec": {
334 |    "display_name": "fcst-ml",
335 |    "language": "python",
336 |    "name": "python3"
337 |   },
338 |   "language_info": {
339 |    "codemirror_mode": {
340 |     "name": "ipython",
341 |     "version": 3
342 |    },
343 |    "file_extension": ".py",
344 |    "mimetype": "text/x-python",
345 |    "name": "python",
346 |    "nbconvert_exporter": "python",
347 |    "pygments_lexer": "ipython3",
348 |    "version": "3.10.2"
349 |   }
350 |  },
351 |  "nbformat": 4,
352 |  "nbformat_minor": 2
353 | }
354 | 


--------------------------------------------------------------------------------
/FWML-Logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/trainindata/forecasting-with-machine-learning/2e45285cb67a1d1afed4eebd80150493bdd7c0fd/FWML-Logo.png


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | BSD 3-Clause License
 2 | 
 3 | Copyright (c) 2017-2023, Soledad Galli and Kishan Manani, Forecasting with Machine Learning course:
 4 | https://www.trainindata.com/p/forecasting-with-machine-learning
 5 | 
 6 | 
 7 | Redistribution and use in source and binary forms, with or without
 8 | modification, are permitted provided that the following conditions are met:
 9 | 
10 | 1. Redistributions of source code must retain the above copyright notice, this
11 |    list of conditions and the following disclaimer.
12 | 
13 | 2. Redistributions in binary form must reproduce the above copyright notice,
14 |    this list of conditions and the following disclaimer in the documentation
15 |    and/or other materials provided with the distribution.
16 | 
17 | 3. Neither the name of the copyright holder nor the names of its
18 |    contributors may be used to endorse or promote products derived from
19 |    this software without specific prior written permission.
20 | 
21 | 
22 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
23 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
24 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
25 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
26 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
27 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
28 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
29 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
30 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
31 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
32 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ﻿![PythonVersion](https://img.shields.io/badge/python-3.9%20%20|%203.10%20|%203.11%20|3.12%20-success)
 2 | [![License https://github.com/trainindata/forecasting-with-machine-learning/blob/master/LICENSE](https://img.shields.io/badge/license-BSD-success.svg)](https://github.com/trainindata/forecasting-with-machine-learning/blob/master/LICENSE)
 3 | [![Sponsorship https://www.trainindata.com/](https://img.shields.io/badge/Powered%20By-TrainInData-orange.svg)](https://www.trainindata.com/)
 4 | 
 5 | ## Machine Learning Interpretability- Code Repository
 6 | 
 7 | Code repository for the online course [Forecasting with Machine Learning](https://www.trainindata.com/p/forecasting-with-machine-learning)
 8 | 
 9 | **Course Launch: March, 2024**
10 | 
11 | Actively maintained.
12 | 
13 | [<img src="./FWML-Logo.png" width="248">](https://www.trainindata.com/p/forecasting-with-machine-learning)
14 | 
15 | ## Table of Contents
16 | 
17 | 1. **Time Series as a regression**
18 | 	1. Time series forecasting
19 | 	2. Forecasting as regression
20 | 	3. Tabularizing time series
21 | 	4. Endogenous and exogenous variables
22 | 
23 | 2. **Forecasting strategies**
24 | 	1. Single step forecasting
25 | 	2. Multiple step forecasting
26 | 	3. Direct forecasting
27 | 	4. Recursive forecasting
28 | 
29 | 3. **Multi series forecasting**
30 | 	1. Global vs local forecasting
31 | 	2. Multiseries forecasting models
32 | 	3. Independent time series forecasting
33 | 	4. Dependent time series forecasting
34 | 
35 | 4. **Backtesting**
36 | 	1. Backtesting with or without refit
37 | 	2. Backtesting with rolling or expanding windows
38 | 	3. Backtesting with intermitent refit.
39 | 
40 | 
41 | ## Links
42 | 
43 | - [Online Course](https://www.trainindata.com/p/forecasting-with-machine-learning)
44 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | feature-engine>=1.8.2
2 | matplotlib>=3.9.2
3 | numpy>=2.1.2
4 | pandas>=2.2.3
5 | scikit-learn>=1.5.2
6 | seaborn>=0.13.2
7 | skforecast>=0.14.0


--------------------------------------------------------------------------------