├── README.md ├── automl_env.yml ├── automl_env_linux.yml ├── automl_env_mac.yml ├── automl_setup.cmd ├── automl_setup_linux.sh ├── automl_setup_mac.sh ├── check_conda_version.py └── forecasting-energy-demand ├── auto-ml-forecasting-energy-demand.ipynb ├── forecasting_helper.py └── metrics_helper.py /README.md: -------------------------------------------------------------------------------- 1 | # Table of Contents 2 | 1. [Automated ML Introduction](#introduction) 3 | 1. [Setup using Compute Instances](#jupyter) 4 | 1. [Setup using a Local Conda environment](#localconda) 5 | 1. [Setup using Azure Databricks](#databricks) 6 | 7 | 8 | # Automated ML introduction 9 | Automated machine learning (automated ML) builds high quality machine learning models for you by automating model and hyperparameter selection. Bring a labelled dataset that you want to build a model for, automated ML will give you a high quality machine learning model that you can use for predictions. 10 | 11 | 12 | If you are new to Data Science, automated ML will help you get jumpstarted by simplifying machine learning model building. It abstracts you from needing to perform model selection, hyperparameter selection and in one step creates a high quality trained model for you to use. 13 | 14 | If you are an experienced data scientist, automated ML will help increase your productivity by intelligently performing the model and hyperparameter selection for your training and generates high quality models much quicker than manually specifying several combinations of the parameters and running training jobs. Automated ML provides visibility and access to all the training jobs and the performance characteristics of the models to help you further tune the pipeline if you desire. 15 | 16 | Below are the three execution environments supported by automated ML. 17 | 18 | 19 | 20 | ## Setup using Compute Instances - Jupyter based notebooks from a Azure Virtual Machine 21 | 22 | 1. Open the [ML Azure portal](https://ml.azure.com) 23 | 1. Select Compute 24 | 1. Select Compute Instances 25 | 1. Click New 26 | 1. Type a Compute Name, select a Virtual Machine type and select a Virtual Machine size 27 | 1. Click Create 28 | 29 | 30 | ## Setup using a Local Conda environment 31 | 32 | To run these notebook on your own notebook server, use these installation instructions. 33 | The instructions below will install everything you need and then start a Jupyter notebook. 34 | 35 | ### 1. Install mini-conda from [here](https://conda.io/miniconda.html), choose 64-bit Python 3.7 or higher. 36 | - **Note**: if you already have conda installed, you can keep using it but it should be version 4.4.10 or later (as shown by: conda -V). If you have a previous version installed, you can update it using the command: conda update conda. 37 | There's no need to install mini-conda specifically. 38 | 39 | ### 2. Downloading the sample notebooks 40 | - Download the sample notebooks as zip and extract the contents to a local directory. 41 | 42 | ### 3. Setup a new conda environment 43 | The **automl_setup** script creates a new conda environment, installs the necessary packages, configures the widget and starts a jupyter notebook. It takes the conda environment name as an optional parameter. The default conda environment name is azure_automl. The exact command depends on the operating system. See the specific sections below for Windows, Mac and Linux. It can take about 10 minutes to execute. 44 | 45 | Packages installed by the **automl_setup** script: 46 |

python
nb_conda
matplotlib
numpy
cython
urllib3
scipy
scikit-learn
pandas
tensorflow
py-xgboost
azureml-sdk
azureml-widgets
pandas-ml

47 | 48 | For more details refer to the [automl_env.yml](./automl_env.yml) 49 | 50 | ### 4. Running configuration.ipynb 51 | - Before running any samples you next need to run the configuration notebook. Click on [configuration](../../configuration.ipynb) notebook 52 | - Execute the cells in the notebook to Register Machine Learning Services Resource Provider and create a workspace. (*instructions in notebook*) 53 | 54 | ### 5. Running Samples 55 | - Please make sure you use the Python [conda env:azure_automl] kernel when trying the sample Notebooks. 56 | - Follow the instructions in the individual notebooks to explore various features in automated ML. 57 | 58 | ### 6. Starting jupyter notebook manually 59 | To start your Jupyter notebook manually, use: 60 | 61 | ``` 62 | conda activate azure_automl 63 | jupyter notebook 64 | ``` 65 | 66 | or on Mac or Linux: 67 | 68 | ``` 69 | source activate azure_automl 70 | jupyter notebook 71 | ``` 72 | 73 | 74 | ## Setup using Azure Databricks 75 | 76 | **NOTE**: Please create your Azure Databricks cluster as v7.1 (high concurrency preferred) with **Python 3** (dropdown). 77 | **NOTE**: You should at least have contributor access to your Azure subcription to run the notebook. 78 | - You can find the detail Readme instructions at [GitHub](https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/azure-databricks/automl). 79 | - Download the sample notebook automl-databricks-local-01.ipynb from [GitHub](https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/azure-databricks/automl) and import into the Azure databricks workspace. 80 | - Attach the notebook to the cluster. 81 | -------------------------------------------------------------------------------- /automl_env.yml: -------------------------------------------------------------------------------- 1 | name: azure_automl 2 | dependencies: 3 | # The python interpreter version. 4 | # Currently Azure ML only supports 3.5.2 and later. 5 | - pip==20.2.4 6 | - python>=3.5.2,<3.8 7 | - nb_conda 8 | - boto3==1.15.18 9 | - matplotlib==2.1.0 10 | - numpy==1.18.5 11 | - cython 12 | - urllib3<1.24 13 | - scipy>=1.4.1,<=1.5.2 14 | - scikit-learn==0.22.1 15 | - pandas==0.25.1 16 | - py-xgboost<=0.90 17 | - conda-forge::fbprophet==0.5 18 | - holidays==0.9.11 19 | - pytorch::pytorch=1.4.0 20 | - cudatoolkit=10.1.243 21 | 22 | - pip: 23 | # Required packages for AzureML execution, history, and data preparation. 24 | - azureml-widgets~=1.21.0 25 | - pytorch-transformers==1.0.0 26 | - spacy==2.1.8 27 | - https://aka.ms/automl-resources/packages/en_core_web_sm-2.1.0.tar.gz 28 | - -r https://automlcesdkdataresources.blob.core.windows.net/validated-requirements/1.21.0/validated_win32_requirements.txt [--no-deps] 29 | -------------------------------------------------------------------------------- /automl_env_linux.yml: -------------------------------------------------------------------------------- 1 | name: azure_automl 2 | dependencies: 3 | # The python interpreter version. 4 | # Currently Azure ML only supports 3.5.2 and later. 5 | - pip==20.2.4 6 | - python>=3.5.2,<3.8 7 | - nb_conda 8 | - boto3==1.15.18 9 | - matplotlib==2.1.0 10 | - numpy==1.18.5 11 | - cython 12 | - urllib3<1.24 13 | - scipy>=1.4.1,<=1.5.2 14 | - scikit-learn==0.22.1 15 | - pandas==0.25.1 16 | - py-xgboost<=0.90 17 | - conda-forge::fbprophet==0.5 18 | - holidays==0.9.11 19 | - pytorch::pytorch=1.4.0 20 | - cudatoolkit=10.1.243 21 | 22 | - pip: 23 | # Required packages for AzureML execution, history, and data preparation. 24 | - azureml-widgets~=1.21.0 25 | - pytorch-transformers==1.0.0 26 | - spacy==2.1.8 27 | - https://aka.ms/automl-resources/packages/en_core_web_sm-2.1.0.tar.gz 28 | - -r https://automlcesdkdataresources.blob.core.windows.net/validated-requirements/1.21.0/validated_linux_requirements.txt [--no-deps] 29 | 30 | -------------------------------------------------------------------------------- /automl_env_mac.yml: -------------------------------------------------------------------------------- 1 | name: azure_automl 2 | dependencies: 3 | # The python interpreter version. 4 | # Currently Azure ML only supports 3.5.2 and later. 5 | - pip==20.2.4 6 | - nomkl 7 | - python>=3.5.2,<3.8 8 | - nb_conda 9 | - boto3==1.15.18 10 | - matplotlib==2.1.0 11 | - numpy==1.18.5 12 | - cython 13 | - urllib3<1.24 14 | - scipy>=1.4.1,<=1.5.2 15 | - scikit-learn==0.22.1 16 | - pandas==0.25.1 17 | - py-xgboost<=0.90 18 | - conda-forge::fbprophet==0.5 19 | - holidays==0.9.11 20 | - pytorch::pytorch=1.4.0 21 | - cudatoolkit=9.0 22 | 23 | - pip: 24 | # Required packages for AzureML execution, history, and data preparation. 25 | - azureml-widgets~=1.21.0 26 | - pytorch-transformers==1.0.0 27 | - spacy==2.1.8 28 | - https://aka.ms/automl-resources/packages/en_core_web_sm-2.1.0.tar.gz 29 | - -r https://automlcesdkdataresources.blob.core.windows.net/validated-requirements/1.21.0/validated_darwin_requirements.txt [--no-deps] 30 | -------------------------------------------------------------------------------- /automl_setup.cmd: -------------------------------------------------------------------------------- 1 | @echo off 2 | set conda_env_name=%1 3 | set automl_env_file=%2 4 | set options=%3 5 | set PIP_NO_WARN_SCRIPT_LOCATION=0 6 | 7 | IF "%conda_env_name%"=="" SET conda_env_name="azure_automl" 8 | IF "%automl_env_file%"=="" SET automl_env_file="automl_env.yml" 9 | SET check_conda_version_script="check_conda_version.py" 10 | 11 | IF NOT EXIST %automl_env_file% GOTO YmlMissing 12 | 13 | IF "%CONDA_EXE%"=="" GOTO CondaMissing 14 | 15 | IF NOT EXIST %check_conda_version_script% GOTO VersionCheckMissing 16 | 17 | python "%check_conda_version_script%" 18 | IF errorlevel 1 GOTO ErrorExit: 19 | 20 | SET replace_version_script="replace_latest_version.ps1" 21 | IF EXIST %replace_version_script% ( 22 | powershell -file %replace_version_script% %automl_env_file% 23 | ) 24 | 25 | call conda activate %conda_env_name% 2>nul: 26 | 27 | if not errorlevel 1 ( 28 | echo Upgrading existing conda environment %conda_env_name% 29 | call pip uninstall azureml-train-automl -y -q 30 | call conda env update --name %conda_env_name% --file %automl_env_file% 31 | if errorlevel 1 goto ErrorExit 32 | ) else ( 33 | call conda env create -f %automl_env_file% -n %conda_env_name% 34 | ) 35 | 36 | call conda activate %conda_env_name% 2>nul: 37 | if errorlevel 1 goto ErrorExit 38 | 39 | call python -m ipykernel install --user --name %conda_env_name% --display-name "Python (%conda_env_name%)" 40 | 41 | REM azureml.widgets is now installed as part of the pip install under the conda env. 42 | REM Removing the old user install so that the notebooks will use the latest widget. 43 | call jupyter nbextension uninstall --user --py azureml.widgets 44 | 45 | echo. 46 | echo. 47 | echo *************************************** 48 | echo * AutoML setup completed successfully * 49 | echo *************************************** 50 | IF NOT "%options%"=="nolaunch" ( 51 | echo. 52 | echo Starting jupyter notebook - please run the configuration notebook 53 | echo. 54 | jupyter notebook --log-level=50 --notebook-dir='..\..' 55 | ) 56 | 57 | goto End 58 | 59 | :CondaMissing 60 | echo Please run this script from an Anaconda Prompt window. 61 | echo You can start an Anaconda Prompt window by 62 | echo typing Anaconda Prompt on the Start menu. 63 | echo If you don't see the Anaconda Prompt app, install Miniconda. 64 | echo If you are running an older version of Miniconda or Anaconda, 65 | echo you can upgrade using the command: conda update conda 66 | goto End 67 | 68 | :VersionCheckMissing 69 | echo File %check_conda_version_script% not found. 70 | goto End 71 | 72 | :YmlMissing 73 | echo File %automl_env_file% not found. 74 | 75 | :ErrorExit 76 | echo Install failed 77 | 78 | :End -------------------------------------------------------------------------------- /automl_setup_linux.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | CONDA_ENV_NAME=$1 4 | AUTOML_ENV_FILE=$2 5 | OPTIONS=$3 6 | PIP_NO_WARN_SCRIPT_LOCATION=0 7 | CHECK_CONDA_VERSION_SCRIPT="check_conda_version.py" 8 | 9 | if [ "$CONDA_ENV_NAME" == "" ] 10 | then 11 | CONDA_ENV_NAME="azure_automl" 12 | fi 13 | 14 | if [ "$AUTOML_ENV_FILE" == "" ] 15 | then 16 | AUTOML_ENV_FILE="automl_env_linux.yml" 17 | fi 18 | 19 | if [ ! -f $AUTOML_ENV_FILE ]; then 20 | echo "File $AUTOML_ENV_FILE not found" 21 | exit 1 22 | fi 23 | 24 | if [ ! -f $CHECK_CONDA_VERSION_SCRIPT ]; then 25 | echo "File $CHECK_CONDA_VERSION_SCRIPT not found" 26 | exit 1 27 | fi 28 | 29 | python "$CHECK_CONDA_VERSION_SCRIPT" 30 | if [ $? -ne 0 ]; then 31 | exit 1 32 | fi 33 | 34 | sed -i 's/AZUREML-SDK-VERSION/latest/' $AUTOML_ENV_FILE 35 | 36 | if source activate $CONDA_ENV_NAME 2> /dev/null 37 | then 38 | echo "Upgrading existing conda environment" $CONDA_ENV_NAME 39 | pip uninstall azureml-train-automl -y -q 40 | conda env update --name $CONDA_ENV_NAME --file $AUTOML_ENV_FILE && 41 | jupyter nbextension uninstall --user --py azureml.widgets 42 | else 43 | conda env create -f $AUTOML_ENV_FILE -n $CONDA_ENV_NAME && 44 | source activate $CONDA_ENV_NAME && 45 | python -m ipykernel install --user --name $CONDA_ENV_NAME --display-name "Python ($CONDA_ENV_NAME)" && 46 | jupyter nbextension uninstall --user --py azureml.widgets && 47 | echo "" && 48 | echo "" && 49 | echo "***************************************" && 50 | echo "* AutoML setup completed successfully *" && 51 | echo "***************************************" && 52 | if [ "$OPTIONS" != "nolaunch" ] 53 | then 54 | echo "" && 55 | echo "Starting jupyter notebook - please run the configuration notebook" && 56 | echo "" && 57 | jupyter notebook --log-level=50 --notebook-dir '../..' 58 | fi 59 | fi 60 | 61 | if [ $? -gt 0 ] 62 | then 63 | echo "Installation failed" 64 | fi 65 | 66 | 67 | -------------------------------------------------------------------------------- /automl_setup_mac.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | CONDA_ENV_NAME=$1 4 | AUTOML_ENV_FILE=$2 5 | OPTIONS=$3 6 | PIP_NO_WARN_SCRIPT_LOCATION=0 7 | CHECK_CONDA_VERSION_SCRIPT="check_conda_version.py" 8 | 9 | if [ "$CONDA_ENV_NAME" == "" ] 10 | then 11 | CONDA_ENV_NAME="azure_automl" 12 | fi 13 | 14 | if [ "$AUTOML_ENV_FILE" == "" ] 15 | then 16 | AUTOML_ENV_FILE="automl_env_mac.yml" 17 | fi 18 | 19 | if [ ! -f $AUTOML_ENV_FILE ]; then 20 | echo "File $AUTOML_ENV_FILE not found" 21 | exit 1 22 | fi 23 | 24 | if [ ! -f $CHECK_CONDA_VERSION_SCRIPT ]; then 25 | echo "File $CHECK_CONDA_VERSION_SCRIPT not found" 26 | exit 1 27 | fi 28 | 29 | python "$CHECK_CONDA_VERSION_SCRIPT" 30 | if [ $? -ne 0 ]; then 31 | exit 1 32 | fi 33 | 34 | sed -i '' 's/AZUREML-SDK-VERSION/latest/' $AUTOML_ENV_FILE 35 | 36 | if source activate $CONDA_ENV_NAME 2> /dev/null 37 | then 38 | echo "Upgrading existing conda environment" $CONDA_ENV_NAME 39 | pip uninstall azureml-train-automl -y -q 40 | conda env update --name $CONDA_ENV_NAME --file $AUTOML_ENV_FILE && 41 | jupyter nbextension uninstall --user --py azureml.widgets 42 | else 43 | conda env create -f $AUTOML_ENV_FILE -n $CONDA_ENV_NAME && 44 | source activate $CONDA_ENV_NAME && 45 | conda install lightgbm -c conda-forge -y && 46 | python -m ipykernel install --user --name $CONDA_ENV_NAME --display-name "Python ($CONDA_ENV_NAME)" && 47 | jupyter nbextension uninstall --user --py azureml.widgets && 48 | echo "" && 49 | echo "" && 50 | echo "***************************************" && 51 | echo "* AutoML setup completed successfully *" && 52 | echo "***************************************" && 53 | if [ "$OPTIONS" != "nolaunch" ] 54 | then 55 | echo "" && 56 | echo "Starting jupyter notebook - please run the configuration notebook" && 57 | echo "" && 58 | jupyter notebook --log-level=50 --notebook-dir '../..' 59 | fi 60 | fi 61 | 62 | if [ $? -gt 0 ] 63 | then 64 | echo "Installation failed" 65 | fi 66 | 67 | 68 | 69 | -------------------------------------------------------------------------------- /check_conda_version.py: -------------------------------------------------------------------------------- 1 | from distutils.version import LooseVersion 2 | import platform 3 | 4 | try: 5 | import conda 6 | except: 7 | print('Failed to import conda.') 8 | print('This setup is usually run from the base conda environment.') 9 | print('You can activate the base environment using the command "conda activate base"') 10 | exit(1) 11 | 12 | architecture = platform.architecture()[0] 13 | 14 | if architecture != "64bit": 15 | print('This setup requires 64bit Anaconda or Miniconda. Found: ' + architecture) 16 | exit(1) 17 | 18 | minimumVersion = "4.7.8" 19 | 20 | versionInvalid = (LooseVersion(conda.__version__) < LooseVersion(minimumVersion)) 21 | 22 | if versionInvalid: 23 | print('Setup requires conda version ' + minimumVersion + ' or higher.') 24 | print('You can use the command "conda update conda" to upgrade conda.') 25 | 26 | exit(versionInvalid) 27 | -------------------------------------------------------------------------------- /forecasting-energy-demand/auto-ml-forecasting-energy-demand.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "Copyright (c) Microsoft Corporation. All rights reserved.\n", 8 | "\n", 9 | "Licensed under the MIT License." 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/forecasting-energy-demand/auto-ml-forecasting-energy-demand.png)" 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "metadata": {}, 22 | "source": [ 23 | "# Automated Machine Learning\n", 24 | "_**Forecasting using the Energy Demand Dataset**_\n", 25 | "\n", 26 | "## Contents\n", 27 | "1. [Introduction](#Introduction)\n", 28 | "1. [Setup](#Setup)\n", 29 | "1. [Data and Forecasting Configurations](#Data)\n", 30 | "1. [Train](#Train)\n", 31 | "\n", 32 | "Advanced Forecasting\n", 33 | "1. [Advanced Training](#advanced_training)\n", 34 | "1. [Advanced Results](#advanced_results)" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [ 41 | "## Introduction\n", 42 | "\n", 43 | "In this example we use the associated New York City energy demand dataset to showcase how you can use AutoML for a simple forecasting problem and explore the results. The goal is predict the energy demand for the next 48 hours based on historic time-series data.\n", 44 | "\n", 45 | "If you are using an Azure Machine Learning Compute Instance, you are all set. Otherwise, go through the [configuration notebook](../../../configuration.ipynb) first, if you haven't already, to establish your connection to the AzureML Workspace.\n", 46 | "\n", 47 | "In this notebook you will learn how to:\n", 48 | "1. Creating an Experiment using an existing Workspace\n", 49 | "1. Configure AutoML using 'AutoMLConfig'\n", 50 | "1. Train the model using AmlCompute\n", 51 | "1. Explore the engineered features and results\n", 52 | "1. Configuration and remote run of AutoML for a time-series model with lag and rolling window features\n", 53 | "1. Run and explore the forecast" 54 | ] 55 | }, 56 | { 57 | "cell_type": "markdown", 58 | "metadata": {}, 59 | "source": [ 60 | "## Setup" 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": null, 66 | "metadata": {}, 67 | "outputs": [], 68 | "source": [ 69 | "import logging\n", 70 | "\n", 71 | "from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score\n", 72 | "from matplotlib import pyplot as plt\n", 73 | "import pandas as pd\n", 74 | "import numpy as np\n", 75 | "import warnings\n", 76 | "import os\n", 77 | "\n", 78 | "# Squash warning messages for cleaner output in the notebook\n", 79 | "warnings.showwarning = lambda *args, **kwargs: None\n", 80 | "\n", 81 | "import azureml.core\n", 82 | "from azureml.core import Experiment, Workspace, Dataset\n", 83 | "from azureml.train.automl import AutoMLConfig\n", 84 | "from datetime import datetime" 85 | ] 86 | }, 87 | { 88 | "cell_type": "markdown", 89 | "metadata": {}, 90 | "source": [ 91 | "This sample notebook may use features that are not available in previous versions of the Azure ML SDK." 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": null, 97 | "metadata": {}, 98 | "outputs": [], 99 | "source": [ 100 | "print(\"This notebook was created using version 1.21.0 of the Azure ML SDK\")\n", 101 | "print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")" 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "metadata": {}, 107 | "source": [ 108 | "As part of the setup you have already created an Azure ML `Workspace` object. For Automated ML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments." 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": null, 114 | "metadata": {}, 115 | "outputs": [], 116 | "source": [ 117 | "ws = Workspace.from_config()\n", 118 | "\n", 119 | "# choose a name for the run history container in the workspace\n", 120 | "experiment_name = 'automl-forecasting-energydemand'\n", 121 | "\n", 122 | "# # project folder\n", 123 | "# project_folder = './sample_projects/automl-forecasting-energy-demand'\n", 124 | "\n", 125 | "experiment = Experiment(ws, experiment_name)\n", 126 | "\n", 127 | "output = {}\n", 128 | "output['Subscription ID'] = ws.subscription_id\n", 129 | "output['Workspace'] = ws.name\n", 130 | "output['Resource Group'] = ws.resource_group\n", 131 | "output['Location'] = ws.location\n", 132 | "output['Run History Name'] = experiment_name\n", 133 | "pd.set_option('display.max_colwidth', -1)\n", 134 | "outputDf = pd.DataFrame(data = output, index = [''])\n", 135 | "outputDf.T" 136 | ] 137 | }, 138 | { 139 | "cell_type": "markdown", 140 | "metadata": {}, 141 | "source": [ 142 | "## Create or Attach existing AmlCompute\n", 143 | "A compute target is required to execute a remote Automated ML run. \n", 144 | "\n", 145 | "[Azure Machine Learning Compute](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute) is a managed-compute infrastructure that allows the user to easily create a single or multi-node compute. In this tutorial, you create AmlCompute as your training compute resource.\n", 146 | "\n", 147 | "#### Creation of AmlCompute takes approximately 5 minutes. \n", 148 | "If the AmlCompute with that name is already in your workspace this code will skip the creation process.\n", 149 | "As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota." 150 | ] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "execution_count": null, 155 | "metadata": {}, 156 | "outputs": [], 157 | "source": [ 158 | "from azureml.core.compute import ComputeTarget, AmlCompute\n", 159 | "from azureml.core.compute_target import ComputeTargetException\n", 160 | "\n", 161 | "# Choose a name for your cluster.\n", 162 | "amlcompute_cluster_name = \"energy-cluster\"\n", 163 | "\n", 164 | "# Verify that cluster does not exist already\n", 165 | "try:\n", 166 | " compute_target = ComputeTarget(workspace=ws, name=amlcompute_cluster_name)\n", 167 | " print('Found existing cluster, use it.')\n", 168 | "except ComputeTargetException:\n", 169 | " compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS12_V2',\n", 170 | " max_nodes=6)\n", 171 | " compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, compute_config)\n", 172 | "\n", 173 | "compute_target.wait_for_completion(show_output=True)" 174 | ] 175 | }, 176 | { 177 | "cell_type": "markdown", 178 | "metadata": {}, 179 | "source": [ 180 | "# Data\n", 181 | "\n", 182 | "We will use energy consumption [data from New York City](http://mis.nyiso.com/public/P-58Blist.htm) for model training. The data is stored in a tabular format and includes energy demand and basic weather data at an hourly frequency. \n", 183 | "\n", 184 | "With Azure Machine Learning datasets you can keep a single copy of data in your storage, easily access data during model training, share data and collaborate with other users. Below, we will upload the datatset and create a [tabular dataset](https://docs.microsoft.com/bs-latn-ba/azure/machine-learning/service/how-to-create-register-datasets#dataset-types) to be used training and prediction." 185 | ] 186 | }, 187 | { 188 | "cell_type": "markdown", 189 | "metadata": {}, 190 | "source": [ 191 | "Let's set up what we know about the dataset.\n", 192 | "\n", 193 | "Target column is what we want to forecast.

\n", 194 | "Time column is the time axis along which to predict.\n", 195 | "\n", 196 | "The other columns, \"temp\" and \"precip\", are implicitly designated as features." 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": null, 202 | "metadata": {}, 203 | "outputs": [], 204 | "source": [ 205 | "target_column_name = 'demand'\n", 206 | "time_column_name = 'timeStamp'" 207 | ] 208 | }, 209 | { 210 | "cell_type": "code", 211 | "execution_count": null, 212 | "metadata": {}, 213 | "outputs": [], 214 | "source": [ 215 | "dataset = Dataset.Tabular.from_delimited_files(path = \"https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/nyc_energy.csv\").with_timestamp_columns(fine_grain_timestamp=time_column_name) \n", 216 | "dataset.take(5).to_pandas_dataframe().reset_index(drop=True)" 217 | ] 218 | }, 219 | { 220 | "cell_type": "markdown", 221 | "metadata": {}, 222 | "source": [ 223 | "The NYC Energy dataset is missing energy demand values for all datetimes later than August 10th, 2017 5AM. Below, we trim the rows containing these missing values from the end of the dataset." 224 | ] 225 | }, 226 | { 227 | "cell_type": "code", 228 | "execution_count": null, 229 | "metadata": {}, 230 | "outputs": [], 231 | "source": [ 232 | "# Cut off the end of the dataset due to large number of nan values\n", 233 | "dataset = dataset.time_before(datetime(2017, 10, 10, 5))" 234 | ] 235 | }, 236 | { 237 | "cell_type": "markdown", 238 | "metadata": {}, 239 | "source": [ 240 | "## Split the data into train and test sets" 241 | ] 242 | }, 243 | { 244 | "cell_type": "markdown", 245 | "metadata": {}, 246 | "source": [ 247 | "The first split we make is into train and test sets. Note that we are splitting on time. Data before and including August 8th, 2017 5AM will be used for training, and data after will be used for testing." 248 | ] 249 | }, 250 | { 251 | "cell_type": "code", 252 | "execution_count": null, 253 | "metadata": {}, 254 | "outputs": [], 255 | "source": [ 256 | "# split into train based on time\n", 257 | "train = dataset.time_before(datetime(2017, 8, 8, 5), include_boundary=True)\n", 258 | "train.to_pandas_dataframe().reset_index(drop=True).sort_values(time_column_name).tail(5)" 259 | ] 260 | }, 261 | { 262 | "cell_type": "code", 263 | "execution_count": null, 264 | "metadata": {}, 265 | "outputs": [], 266 | "source": [ 267 | "# split into test based on time\n", 268 | "test = dataset.time_between(datetime(2017, 8, 8, 6), datetime(2017, 8, 10, 5))\n", 269 | "test.to_pandas_dataframe().reset_index(drop=True).head(5)" 270 | ] 271 | }, 272 | { 273 | "cell_type": "markdown", 274 | "metadata": {}, 275 | "source": [ 276 | "### Setting the maximum forecast horizon\n", 277 | "\n", 278 | "The forecast horizon is the number of periods into the future that the model should predict. It is generally recommend that users set forecast horizons to less than 100 time periods (i.e. less than 100 hours in the NYC energy example). Furthermore, **AutoML's memory use and computation time increase in proportion to the length of the horizon**, so consider carefully how this value is set. If a long horizon forecast really is necessary, consider aggregating the series to a coarser time scale. \n", 279 | "\n", 280 | "Learn more about forecast horizons in our [Auto-train a time-series forecast model](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-auto-train-forecast#configure-and-run-experiment) guide.\n", 281 | "\n", 282 | "In this example, we set the horizon to 48 hours." 283 | ] 284 | }, 285 | { 286 | "cell_type": "code", 287 | "execution_count": null, 288 | "metadata": {}, 289 | "outputs": [], 290 | "source": [ 291 | "forecast_horizon = 48" 292 | ] 293 | }, 294 | { 295 | "cell_type": "markdown", 296 | "metadata": {}, 297 | "source": [ 298 | "## Forecasting Parameters\n", 299 | "To define forecasting parameters for your experiment training, you can leverage the ForecastingParameters class. The table below details the forecasting parameter we will be passing into our experiment.\n", 300 | "\n", 301 | "|Property|Description|\n", 302 | "|-|-|\n", 303 | "|**time_column_name**|The name of your time column.|\n", 304 | "|**forecast_horizon**|The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly).|\n", 305 | "|**freq**|Forecast frequency. This optional parameter represents the period with which the forecast is desired, for example, daily, weekly, yearly, etc. Use this parameter for the correction of time series containing irregular data points or for padding of short time series. The frequency needs to be a pandas offset alias. Please refer to [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) for more information." 306 | ] 307 | }, 308 | { 309 | "cell_type": "markdown", 310 | "metadata": {}, 311 | "source": [ 312 | "## Train\n", 313 | "\n", 314 | "Instantiate an AutoMLConfig object. This config defines the settings and data used to run the experiment. We can provide extra configurations within 'automl_settings', for this forecasting task we add the forecasting parameters to hold all the additional forecasting parameters.\n", 315 | "\n", 316 | "|Property|Description|\n", 317 | "|-|-|\n", 318 | "|**task**|forecasting|\n", 319 | "|**primary_metric**|This is the metric that you want to optimize.
Forecasting supports the following primary metrics
spearman_correlation
normalized_root_mean_squared_error
r2_score
normalized_mean_absolute_error|\n", 320 | "|**blocked_models**|Models in blocked_models won't be used by AutoML. All supported models can be found at [here](https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.constants.supportedmodels.forecasting?view=azure-ml-py).|\n", 321 | "|**experiment_timeout_hours**|Maximum amount of time in hours that the experiment take before it terminates.|\n", 322 | "|**training_data**|The training data to be used within the experiment.|\n", 323 | "|**label_column_name**|The name of the label column.|\n", 324 | "|**compute_target**|The remote compute for training.|\n", 325 | "|**n_cross_validations**|Number of cross validation splits. Rolling Origin Validation is used to split time-series in a temporally consistent way.|\n", 326 | "|**enable_early_stopping**|Flag to enble early termination if the score is not improving in the short term.|\n", 327 | "|**forecasting_parameters**|A class holds all the forecasting related parameters.|\n" 328 | ] 329 | }, 330 | { 331 | "cell_type": "markdown", 332 | "metadata": {}, 333 | "source": [ 334 | "This notebook uses the blocked_models parameter to exclude some models that take a longer time to train on this dataset. You can choose to remove models from the blocked_models list but you may need to increase the experiment_timeout_hours parameter value to get results." 335 | ] 336 | }, 337 | { 338 | "cell_type": "code", 339 | "execution_count": null, 340 | "metadata": {}, 341 | "outputs": [], 342 | "source": [ 343 | "from azureml.automl.core.forecasting_parameters import ForecastingParameters\n", 344 | "forecasting_parameters = ForecastingParameters(\n", 345 | " time_column_name=time_column_name, forecast_horizon=forecast_horizon\n", 346 | ")\n", 347 | "\n", 348 | "automl_config = AutoMLConfig(task='forecasting', \n", 349 | " primary_metric='normalized_root_mean_squared_error',\n", 350 | " blocked_models = ['ExtremeRandomTrees', 'AutoArima', 'Prophet'], \n", 351 | " experiment_timeout_hours=0.3,\n", 352 | " training_data=train,\n", 353 | " label_column_name=target_column_name,\n", 354 | " compute_target=compute_target,\n", 355 | " enable_early_stopping=True,\n", 356 | " n_cross_validations=3, \n", 357 | " verbosity=logging.INFO,\n", 358 | " forecasting_parameters=forecasting_parameters)" 359 | ] 360 | }, 361 | { 362 | "cell_type": "markdown", 363 | "metadata": {}, 364 | "source": [ 365 | "Call the `submit` method on the experiment object and pass the run configuration. Depending on the data and the number of iterations this can run for a while.\n", 366 | "One may specify `show_output = True` to print currently running iterations to the console." 367 | ] 368 | }, 369 | { 370 | "cell_type": "code", 371 | "execution_count": null, 372 | "metadata": {}, 373 | "outputs": [], 374 | "source": [ 375 | "remote_run = experiment.submit(automl_config, show_output=False)" 376 | ] 377 | }, 378 | { 379 | "cell_type": "code", 380 | "execution_count": null, 381 | "metadata": {}, 382 | "outputs": [], 383 | "source": [ 384 | "remote_run" 385 | ] 386 | }, 387 | { 388 | "cell_type": "code", 389 | "execution_count": null, 390 | "metadata": {}, 391 | "outputs": [], 392 | "source": [ 393 | "remote_run.wait_for_completion()" 394 | ] 395 | }, 396 | { 397 | "cell_type": "markdown", 398 | "metadata": {}, 399 | "source": [ 400 | "## Retrieve the Best Model\n", 401 | "Below we select the best model from all the training iterations using get_output method." 402 | ] 403 | }, 404 | { 405 | "cell_type": "code", 406 | "execution_count": null, 407 | "metadata": {}, 408 | "outputs": [], 409 | "source": [ 410 | "best_run, fitted_model = remote_run.get_output()\n", 411 | "fitted_model.steps" 412 | ] 413 | }, 414 | { 415 | "cell_type": "markdown", 416 | "metadata": {}, 417 | "source": [ 418 | "## Featurization\n", 419 | "You can access the engineered feature names generated in time-series featurization." 420 | ] 421 | }, 422 | { 423 | "cell_type": "code", 424 | "execution_count": null, 425 | "metadata": {}, 426 | "outputs": [], 427 | "source": [ 428 | "fitted_model.named_steps['timeseriestransformer'].get_engineered_feature_names()" 429 | ] 430 | }, 431 | { 432 | "cell_type": "markdown", 433 | "metadata": {}, 434 | "source": [ 435 | "### View featurization summary\n", 436 | "You can also see what featurization steps were performed on different raw features in the user data. For each raw feature in the user data, the following information is displayed:\n", 437 | "\n", 438 | "+ Raw feature name\n", 439 | "+ Number of engineered features formed out of this raw feature\n", 440 | "+ Type detected\n", 441 | "+ If feature was dropped\n", 442 | "+ List of feature transformations for the raw feature" 443 | ] 444 | }, 445 | { 446 | "cell_type": "code", 447 | "execution_count": null, 448 | "metadata": {}, 449 | "outputs": [], 450 | "source": [ 451 | "# Get the featurization summary as a list of JSON\n", 452 | "featurization_summary = fitted_model.named_steps['timeseriestransformer'].get_featurization_summary()\n", 453 | "# View the featurization summary as a pandas dataframe\n", 454 | "pd.DataFrame.from_records(featurization_summary)" 455 | ] 456 | }, 457 | { 458 | "cell_type": "markdown", 459 | "metadata": {}, 460 | "source": [ 461 | "## Forecasting\n", 462 | "\n", 463 | "Now that we have retrieved the best pipeline/model, it can be used to make predictions on test data. First, we remove the target values from the test set:" 464 | ] 465 | }, 466 | { 467 | "cell_type": "code", 468 | "execution_count": null, 469 | "metadata": {}, 470 | "outputs": [], 471 | "source": [ 472 | "X_test = test.to_pandas_dataframe().reset_index(drop=True)\n", 473 | "y_test = X_test.pop(target_column_name).values" 474 | ] 475 | }, 476 | { 477 | "cell_type": "markdown", 478 | "metadata": {}, 479 | "source": [ 480 | "### Forecast Function\n", 481 | "For forecasting, we will use the forecast function instead of the predict function. Using the predict method would result in getting predictions for EVERY horizon the forecaster can predict at. This is useful when training and evaluating the performance of the forecaster at various horizons, but the level of detail is excessive for normal use. Forecast function also can handle more complicated scenarios, see the [forecast function notebook](../forecasting-forecast-function/auto-ml-forecasting-function.ipynb)." 482 | ] 483 | }, 484 | { 485 | "cell_type": "code", 486 | "execution_count": null, 487 | "metadata": {}, 488 | "outputs": [], 489 | "source": [ 490 | "# The featurized data, aligned to y, will also be returned.\n", 491 | "# This contains the assumptions that were made in the forecast\n", 492 | "# and helps align the forecast to the original data\n", 493 | "y_predictions, X_trans = fitted_model.forecast(X_test)" 494 | ] 495 | }, 496 | { 497 | "cell_type": "markdown", 498 | "metadata": {}, 499 | "source": [ 500 | "### Evaluate\n", 501 | "To evaluate the accuracy of the forecast, we'll compare against the actual sales quantities for some select metrics, included the mean absolute percentage error (MAPE). For more metrics that can be used for evaluation after training, please see [supported metrics](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-understand-automated-ml#regressionforecasting-metrics), and [how to calculate residuals](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-understand-automated-ml#residuals).\n", 502 | "\n", 503 | "It is a good practice to always align the output explicitly to the input, as the count and order of the rows may have changed during transformations that span multiple rows." 504 | ] 505 | }, 506 | { 507 | "cell_type": "code", 508 | "execution_count": null, 509 | "metadata": {}, 510 | "outputs": [], 511 | "source": [ 512 | "from forecasting_helper import align_outputs\n", 513 | "\n", 514 | "df_all = align_outputs(y_predictions, X_trans, X_test, y_test, target_column_name)" 515 | ] 516 | }, 517 | { 518 | "cell_type": "code", 519 | "execution_count": null, 520 | "metadata": {}, 521 | "outputs": [], 522 | "source": [ 523 | "from azureml.automl.core.shared import constants\n", 524 | "from azureml.automl.runtime.shared.score import scoring\n", 525 | "from matplotlib import pyplot as plt\n", 526 | "\n", 527 | "# use automl metrics module\n", 528 | "scores = scoring.score_regression(\n", 529 | " y_test=df_all[target_column_name],\n", 530 | " y_pred=df_all['predicted'],\n", 531 | " metrics=list(constants.Metric.SCALAR_REGRESSION_SET))\n", 532 | "\n", 533 | "print(\"[Test data scores]\\n\")\n", 534 | "for key, value in scores.items(): \n", 535 | " print('{}: {:.3f}'.format(key, value))\n", 536 | " \n", 537 | "# Plot outputs\n", 538 | "%matplotlib inline\n", 539 | "test_pred = plt.scatter(df_all[target_column_name], df_all['predicted'], color='b')\n", 540 | "test_test = plt.scatter(df_all[target_column_name], df_all[target_column_name], color='g')\n", 541 | "plt.legend((test_pred, test_test), ('prediction', 'truth'), loc='upper left', fontsize=8)\n", 542 | "plt.show()" 543 | ] 544 | }, 545 | { 546 | "cell_type": "markdown", 547 | "metadata": {}, 548 | "source": [ 549 | "Looking at `X_trans` is also useful to see what featurization happened to the data." 550 | ] 551 | }, 552 | { 553 | "cell_type": "code", 554 | "execution_count": null, 555 | "metadata": {}, 556 | "outputs": [], 557 | "source": [ 558 | "X_trans" 559 | ] 560 | }, 561 | { 562 | "cell_type": "markdown", 563 | "metadata": {}, 564 | "source": [ 565 | "## Advanced Training \n", 566 | "We did not use lags in the previous model specification. In effect, the prediction was the result of a simple regression on date, time series identifier columns and any additional features. This is often a very good prediction as common time series patterns like seasonality and trends can be captured in this manner. Such simple regression is horizon-less: it doesn't matter how far into the future we are predicting, because we are not using past data. In the previous example, the horizon was only used to split the data for cross-validation." 567 | ] 568 | }, 569 | { 570 | "cell_type": "markdown", 571 | "metadata": {}, 572 | "source": [ 573 | "### Using lags and rolling window features\n", 574 | "Now we will configure the target lags, that is the previous values of the target variables, meaning the prediction is no longer horizon-less. We therefore must still specify the `forecast_horizon` that the model will learn to forecast. The `target_lags` keyword specifies how far back we will construct the lags of the target variable, and the `target_rolling_window_size` specifies the size of the rolling window over which we will generate the `max`, `min` and `sum` features.\n", 575 | "\n", 576 | "This notebook uses the blocked_models parameter to exclude some models that take a longer time to train on this dataset. You can choose to remove models from the blocked_models list but you may need to increase the iteration_timeout_minutes parameter value to get results." 577 | ] 578 | }, 579 | { 580 | "cell_type": "code", 581 | "execution_count": null, 582 | "metadata": {}, 583 | "outputs": [], 584 | "source": [ 585 | "advanced_forecasting_parameters = ForecastingParameters(\n", 586 | " time_column_name=time_column_name, forecast_horizon=forecast_horizon,\n", 587 | " target_lags=12, target_rolling_window_size=4\n", 588 | ")\n", 589 | "\n", 590 | "automl_config = AutoMLConfig(task='forecasting', \n", 591 | " primary_metric='normalized_root_mean_squared_error',\n", 592 | " blocked_models = ['ElasticNet','ExtremeRandomTrees','GradientBoosting','XGBoostRegressor','ExtremeRandomTrees', 'AutoArima', 'Prophet'], #These models are blocked for tutorial purposes, remove this for real use cases. \n", 593 | " experiment_timeout_hours=0.3,\n", 594 | " training_data=train,\n", 595 | " label_column_name=target_column_name,\n", 596 | " compute_target=compute_target,\n", 597 | " enable_early_stopping = True,\n", 598 | " n_cross_validations=3, \n", 599 | " verbosity=logging.INFO,\n", 600 | " forecasting_parameters=advanced_forecasting_parameters)" 601 | ] 602 | }, 603 | { 604 | "cell_type": "markdown", 605 | "metadata": {}, 606 | "source": [ 607 | "We now start a new remote run, this time with lag and rolling window featurization. AutoML applies featurizations in the setup stage, prior to iterating over ML models. The full training set is featurized first, followed by featurization of each of the CV splits. Lag and rolling window features introduce additional complexity, so the run will take longer than in the previous example that lacked these featurizations." 608 | ] 609 | }, 610 | { 611 | "cell_type": "code", 612 | "execution_count": null, 613 | "metadata": {}, 614 | "outputs": [], 615 | "source": [ 616 | "advanced_remote_run = experiment.submit(automl_config, show_output=False)" 617 | ] 618 | }, 619 | { 620 | "cell_type": "code", 621 | "execution_count": null, 622 | "metadata": {}, 623 | "outputs": [], 624 | "source": [ 625 | "advanced_remote_run.wait_for_completion()" 626 | ] 627 | }, 628 | { 629 | "cell_type": "markdown", 630 | "metadata": {}, 631 | "source": [ 632 | "### Retrieve the Best Model" 633 | ] 634 | }, 635 | { 636 | "cell_type": "code", 637 | "execution_count": null, 638 | "metadata": {}, 639 | "outputs": [], 640 | "source": [ 641 | "best_run_lags, fitted_model_lags = advanced_remote_run.get_output()" 642 | ] 643 | }, 644 | { 645 | "cell_type": "markdown", 646 | "metadata": {}, 647 | "source": [ 648 | "## Advanced Results\n", 649 | "We did not use lags in the previous model specification. In effect, the prediction was the result of a simple regression on date, time series identifier columns and any additional features. This is often a very good prediction as common time series patterns like seasonality and trends can be captured in this manner. Such simple regression is horizon-less: it doesn't matter how far into the future we are predicting, because we are not using past data. In the previous example, the horizon was only used to split the data for cross-validation." 650 | ] 651 | }, 652 | { 653 | "cell_type": "code", 654 | "execution_count": null, 655 | "metadata": {}, 656 | "outputs": [], 657 | "source": [ 658 | "# The featurized data, aligned to y, will also be returned.\n", 659 | "# This contains the assumptions that were made in the forecast\n", 660 | "# and helps align the forecast to the original data\n", 661 | "y_predictions, X_trans = fitted_model_lags.forecast(X_test)" 662 | ] 663 | }, 664 | { 665 | "cell_type": "code", 666 | "execution_count": null, 667 | "metadata": {}, 668 | "outputs": [], 669 | "source": [ 670 | "from forecasting_helper import align_outputs\n", 671 | "\n", 672 | "df_all = align_outputs(y_predictions, X_trans, X_test, y_test, target_column_name)" 673 | ] 674 | }, 675 | { 676 | "cell_type": "code", 677 | "execution_count": null, 678 | "metadata": {}, 679 | "outputs": [], 680 | "source": [ 681 | "from azureml.automl.core.shared import constants\n", 682 | "from azureml.automl.runtime.shared.score import scoring\n", 683 | "from matplotlib import pyplot as plt\n", 684 | "\n", 685 | "# use automl metrics module\n", 686 | "scores = scoring.score_regression(\n", 687 | " y_test=df_all[target_column_name],\n", 688 | " y_pred=df_all['predicted'],\n", 689 | " metrics=list(constants.Metric.SCALAR_REGRESSION_SET))\n", 690 | "\n", 691 | "print(\"[Test data scores]\\n\")\n", 692 | "for key, value in scores.items(): \n", 693 | " print('{}: {:.3f}'.format(key, value))\n", 694 | " \n", 695 | "# Plot outputs\n", 696 | "%matplotlib inline\n", 697 | "test_pred = plt.scatter(df_all[target_column_name], df_all['predicted'], color='b')\n", 698 | "test_test = plt.scatter(df_all[target_column_name], df_all[target_column_name], color='g')\n", 699 | "plt.legend((test_pred, test_test), ('prediction', 'truth'), loc='upper left', fontsize=8)\n", 700 | "plt.show()" 701 | ] 702 | } 703 | ], 704 | "metadata": { 705 | "authors": [ 706 | { 707 | "name": "jialiu" 708 | } 709 | ], 710 | "categories": [ 711 | "how-to-use-azureml", 712 | "automated-machine-learning" 713 | ], 714 | "kernelspec": { 715 | "display_name": "Python 3.6", 716 | "language": "python", 717 | "name": "python36" 718 | }, 719 | "language_info": { 720 | "codemirror_mode": { 721 | "name": "ipython", 722 | "version": 3 723 | }, 724 | "file_extension": ".py", 725 | "mimetype": "text/x-python", 726 | "name": "python", 727 | "nbconvert_exporter": "python", 728 | "pygments_lexer": "ipython3", 729 | "version": "3.6.8" 730 | } 731 | }, 732 | "nbformat": 4, 733 | "nbformat_minor": 2 734 | } -------------------------------------------------------------------------------- /forecasting-energy-demand/forecasting_helper.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | from pandas.tseries.frequencies import to_offset 4 | 5 | 6 | def align_outputs(y_predicted, X_trans, X_test, y_test, target_column_name, 7 | predicted_column_name='predicted', 8 | horizon_colname='horizon_origin'): 9 | """ 10 | Demonstrates how to get the output aligned to the inputs 11 | using pandas indexes. Helps understand what happened if 12 | the output's shape differs from the input shape, or if 13 | the data got re-sorted by time and grain during forecasting. 14 | 15 | Typical causes of misalignment are: 16 | * we predicted some periods that were missing in actuals -> drop from eval 17 | * model was asked to predict past max_horizon -> increase max horizon 18 | * data at start of X_test was needed for lags -> provide previous periods 19 | """ 20 | 21 | if (horizon_colname in X_trans): 22 | df_fcst = pd.DataFrame({predicted_column_name: y_predicted, 23 | horizon_colname: X_trans[horizon_colname]}) 24 | else: 25 | df_fcst = pd.DataFrame({predicted_column_name: y_predicted}) 26 | 27 | # y and X outputs are aligned by forecast() function contract 28 | df_fcst.index = X_trans.index 29 | 30 | # align original X_test to y_test 31 | X_test_full = X_test.copy() 32 | X_test_full[target_column_name] = y_test 33 | 34 | # X_test_full's index does not include origin, so reset for merge 35 | df_fcst.reset_index(inplace=True) 36 | X_test_full = X_test_full.reset_index().drop(columns='index') 37 | together = df_fcst.merge(X_test_full, how='right') 38 | 39 | # drop rows where prediction or actuals are nan 40 | # happens because of missing actuals 41 | # or at edges of time due to lags/rolling windows 42 | clean = together[together[[target_column_name, 43 | predicted_column_name]].notnull().all(axis=1)] 44 | return(clean) 45 | -------------------------------------------------------------------------------- /forecasting-energy-demand/metrics_helper.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | 4 | 5 | def APE(actual, pred): 6 | """ 7 | Calculate absolute percentage error. 8 | Returns a vector of APE values with same length as actual/pred. 9 | """ 10 | return 100 * np.abs((actual - pred) / actual) 11 | 12 | 13 | def MAPE(actual, pred): 14 | """ 15 | Calculate mean absolute percentage error. 16 | Remove NA and values where actual is close to zero 17 | """ 18 | not_na = ~(np.isnan(actual) | np.isnan(pred)) 19 | not_zero = ~np.isclose(actual, 0.0) 20 | actual_safe = actual[not_na & not_zero] 21 | pred_safe = pred[not_na & not_zero] 22 | return np.mean(APE(actual_safe, pred_safe)) 23 | --------------------------------------------------------------------------------