├── 05-extract-insights-with-databricks
    └── README.md
├── assets
    ├── img
    │   ├── ds-process.png
    │   └── key-components-ml-workspace.png
    └── template.md
├── 06-intro-to-ml-with-python-and-azure-notebooks
    └── README.md
├── .vscode
    └── settings.json
├── 04-data-engineering-with-databricks
    └── README.md
├── README.md
├── 00-exam-training
    └── README.md
├── 03-get-started-with-ADSVM
    └── README.md
├── 01-explore-AI-solution-development
    └── README.md
└── 02-build-AI-solutions-with-AMLS
    └── README.md


/05-extract-insights-with-databricks/README.md:
--------------------------------------------------------------------------------
1 | # Perform data engineering with Azure Databricks


--------------------------------------------------------------------------------
/assets/img/ds-process.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pmbrull/azure-ds-examdp100-notes/HEAD/assets/img/ds-process.png


--------------------------------------------------------------------------------
/06-intro-to-ml-with-python-and-azure-notebooks/README.md:
--------------------------------------------------------------------------------
1 | # Introduction to machine learning with Python and Azure Notebooks


--------------------------------------------------------------------------------
/.vscode/settings.json:
--------------------------------------------------------------------------------
1 | {
2 |     "python.pythonPath": "C:\\Users\\P.BrullBorras\\AppData\\Local\\Continuum\\miniconda3\\python.exe"
3 | }


--------------------------------------------------------------------------------
/assets/img/key-components-ml-workspace.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pmbrull/azure-ds-examdp100-notes/HEAD/assets/img/key-components-ml-workspace.png


--------------------------------------------------------------------------------
/04-data-engineering-with-databricks/README.md:
--------------------------------------------------------------------------------
 1 | # Data Engineering with Databricks
 2 | 
 3 | ##
 4 | 
 5 | <details>
 6 | <summary> 
 7 | Show content
 8 | </summary>
 9 | <p>
10 | 
11 | ### Learning objectives


--------------------------------------------------------------------------------
/assets/template.md:
--------------------------------------------------------------------------------
 1 | ## Section
 2 | 
 3 | <details>
 4 | <summary> 
 5 | Show content
 6 | </summary>
 7 | <p>
 8 | 
 9 | ### Learning Objectives
10 | 
11 | 
12 | 
13 | ### Knowledge Check
14 | 
15 | 1. Some question
16 | 
17 | * bla
18 | 
19 |     <details>
20 |     <summary> 
21 |     Answer
22 |     </summary>
23 |     <p>
24 |     bla
25 |     </p>
26 |     </details>
27 | 
28 | 
29 | </p>
30 | </details>
31 | 
32 | ---


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Azure Data Science exam DP-100 notes
 2 | 
 3 | Personal notes for the Azure Data Science exam DP-100. All content is based on Microsoft's Learning Path [docs](https://docs.microsoft.com/en-us/learn/certifications/exams/dp-100?source=learn).
 4 | 
 5 | Some useful links:
 6 | * [Exam skills measured](https://docs.microsoft.com/en-us/learn/certifications/exams/dp-100?source=learn)
 7 | * [Exam requirements infographic](https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RE2PLKZ)
 8 | * [DP-100 Labs](https://github.com/MicrosoftLearning/DP-100-Designing-and-Implementing-a-Data-Science-Solutio)
 9 | * [Azure ML service example notebooks](https://github.com/Azure/MachineLearningNotebooks)
10 | 
11 | Study guide:
12 | * [Data Concepts](https://medium.com/deep-ai/all-you-need-to-know-about-data-for-machine-learning-a80bc8555d58)
13 | * [Study Guide](https://medium.com/deep-ai/study-guide-for-microsoft-azure-data-scientist-associate-certification-dp-100-c2e4611cb071)


--------------------------------------------------------------------------------
/00-exam-training/README.md:
--------------------------------------------------------------------------------
 1 | # DS DP-100 Exam training 01
 2 | 
 3 | ## Azure Data Science Options
 4 | 
 5 | * Azure ML Studio -> Drag and drop. Understand for the exam. No code is needed there. Training and Deployment. Complete ML environment. Ideal for learning and beginner data scientists.
 6 | * Azure Databricks for Big Data - based on Spark. Massive scale with spark. User friendly portal. Dynamic scale. Secure collaboration (secured workspace). DS tools. You can use different languages in the same notebook.
 7 | 	* Core artifacts: Jobs, libraries, clusters, workspaces and notebooks.
 8 | * Azure Data Science Virtual Machine - VM with almost all of the tools one would need to do DS already presintalled. You can deploy them directly to Azure and work from there. It's easy to customize for your needs. It has some sample code already there. They merged it with Deep Learning VM. There are specific versions for Geo data.
 9 | * SQL Server Machine Learning Services - We cna use this to analyze data on SQL Server. Useful for on-premise data. It is an option as source Python and R does not scale, security concerns, operationalization.
10 | * Spark on Azure HDInsight - massive scale with in-memory processing. Hortonworks Distribution. Easy management PaaS. Integration with other Azure services.
11 | 
12 | > Databricks vs. HDInsights: Databricks it's easier to collaborate, built for collaboration and work in teams.
13 | 
14 | * Azure ML Service: core of the course. Model management, training, selection, hyper-param tuning, feature selection and model evaluation. It lets you automate tuning and selection tasks. All is in Python.
15 | 
16 | ## Azure Notebooks
17 | Azure based Jupyter Notebooks. Free tier. Ready to use project that teach how to use Azure data and AI services.
18 | 
19 | Jupyter notebooks can be integrated in VScode and thus you can use git integration with that.
20 | 
21 | Azure notebooks only support Python, R and F#.
22 | 
23 | The advantadge of Azure Notebooks is that most of the libraries are preinstalled, but you still have the possibility to install more libraries. You can upload data from your local machine and use custom environment configuration. By using VMs from your azure subscription you can add processing power.
24 | 
25 | It is Azure ML service ready. From Azure Notebooks you can call Azure ML Service.
26 | 
27 | ## Azure ML Service
28 | 
29 | Bring the power of containerization and automation to DS. Pack model and libraries into a container and run everything.
30 | 
31 | DS pipeline:
32 | 	Environment setup -> Data preparation -> Experimentation -> Deployment
33 | 
34 | * Environment setup: create a workspace to store your work. Use python or Azure portal. An experiment within the workspace stored model training information. Use the IDE of your choice.
35 | * Data Preparation: Use python libraries or the Azure Data Prep SDK.
36 | * Experimentation: Train models with Python open source modules of your choice. Train locally or in Azure. Submit model training to Azure containers. Monitor model training. Register final model.
37 | * Deployment: to make a model available. Target deployment environments are: Docker images, Azure container instances, Azure kubernetes service, Azure IoT edge, Field Programmable Gate Array (FPGA). FOr the dpeloyment you'll need the following files:
38 | 	* A score script file tells Azure ML Services (AMLS) to call the model
39 | 	* An environment file specifies package dependencies
40 | 	* A configuration file requests the required resources for the container.
41 | 
42 | ### What is a Workspace
43 | 
44 | The top-level resource for AMLS. It serves as a hub for building and deploying models. You can create a workspace in the Azure portla, or you can create and access it using Python on an IDE of your choice.
45 | 
46 | All models must be registered in the workspace for future use. Together with the scoring scripts, you create an image for deployment.
47 | 
48 | The workspace stores experiment objects that are required for each model you create. Additionally, it saves your compute targets. You can track training runs.
49 | 
50 | ### What is an Image
51 | 
52 | An image has three components:
53 | * A model and scoring script or application
54 | * An environment file that declares the dependencies that are needed by the model, scoring script or application
55 | * A configuration file that describes the necessary resources to execute the model
56 | 
57 | ### What is a Datastore
58 | 
59 | An abstraction over an Azure Storage account. Each workspace has a registered, default datastore that you can use right away, but you can register other Azure Blob or File storage containers as a datastore.
60 | 
61 | ### What is a Pipeline
62 | 
63 | A ML pipeline is a tool to create and manage workflows during a DS process: data manipulation, model training and testing and deployment phases. Each step of the process can run unattended in different compute targets, which makes it easier to allocate resources.
64 | 
65 | ### What is a Compute Target
66 | 
67 | Is the compute resource to run a training script or to host service deployment. It's attached to a workspace. Other than the local machine, users of the workspace share compute targets.
68 | 
69 | ### What is a deployed Web Service
70 | 
71 | For a deployed web service, you have the choices of container Instances, AKS or FPGAs. With your model, script and associated files all set in the image, you can create a web service.
72 | 
73 | > OBS: Trainer said that its better to create a new RG for each ML workspace as there are several resources involved and we don't want to get a mess.
74 | 
75 | > OBS2: In Azure Notebooks, change python kernel to 3.6, as the default is just set to Python 3! This can rise errors when importing azure ML libs.
76 | 
77 | ### Interact with ML Service
78 | 
79 | We can interact via Azure Notebook (linking the subscription), a visual interface, Notebook VMs and automated ML.
80 | 
81 | You can run notebooks in the workspace without any kind of authentication and they are stored in the WS. So it is useful to work in teams.
82 | 
83 | If you use JupyterLab, you can link a repo in Azure DevOps.
84 | 
85 | If you register a model with the same name multiple times, it gets uploaded with greater version.
86 | 
87 | > OBS: Scalability is enabled during training, but once the code is deployed it is flat. Also, it is only supported as an Azure App Service so you keep paying even if it is idle.
88 | 


--------------------------------------------------------------------------------
/03-get-started-with-ADSVM/README.md:
--------------------------------------------------------------------------------
  1 | # Get started with Machine Learning with an Azure Data Science Virtual Machine
  2 | 
  3 | ## Introduction to the Azure Data Science Virtual Machine (DSVM)
  4 | 
  5 | <details>
  6 | <summary> 
  7 | Show content
  8 | </summary>
  9 | <p>
 10 | 
 11 | ### Learning objectives
 12 | 
 13 | * Learn about the types of Data Science Virtual Machines
 14 | * Learn what type of DSVM to use for each type of use case
 15 | 
 16 | ### When to use an Azure DSVM?
 17 | 
 18 | Azure DSVM makes it easy to maintain consistency in the evolving Data Science environments.
 19 | 
 20 | It also provides samples in Jupyter Notebooks and scripts for Python and R to learn about Microsoft and Azure ML services:
 21 | * How to connect to cloud Datastores with Azure ML and how to build models.
 22 | * Deep Learning samples using Microsoft Cognitive Services.
 23 | * How to compare Microsoft R and open source R and how to operationalize models with ML Services in SQL Server.
 24 | 
 25 | ### Types of Azure DSVM
 26 | 
 27 | * **Windows vs. Linux**: Windows Server 2012 and 2016 vs. Ubuntu 16.04 LTS and CentOS 7.4
 28 | * **Deep Learning**: The Deep Learning DSVM comes preconfigured and preinstalled with many tools and you can select high-speed GPU based machines.
 29 | * **Geo AI DSVM**: VM optimized for geospatial and location data. It has ArcGIS Pro system integrated. 
 30 | 
 31 | ### Use cases for a DSVM
 32 | 
 33 | * **Collaborate as a team using DSVMs**: Working with cloud-based resources that can share the same configuration helps to ensure that all team members have a consistent development environment.
 34 | * **Address issues with DSVMs**: As issues related to environment mismatches are reduced. Giving DSVMs to students in a class.
 35 | * **Use on-demand elastic capacity for large-scale projects**: As it helps to replicate data science environments on demand to allow high-powered computing resources to be run.
 36 | * **Experiment and evaluate on a DSVM**: As they are easy to create, they can be used for demos and short experiments.
 37 | * **Learn about DSVMs and deep learning**: The flexibility of the underlying compute power (scaling or switching to GPU) makes it easy to train all kind of models.
 38 | 
 39 | ### Knowledge Check
 40 | 
 41 | 1. Which of the following is a reason to use an Azure Data Science Virtual Machine?
 42 | 
 43 |     * You want to create an Azure Databricks workspace.
 44 |     * You want to get a jump-start on data science work.
 45 |     * You want to deploy a web application to it.
 46 | 
 47 |     <details>
 48 |     <summary> 
 49 |     Answer
 50 |     </summary>
 51 |     <p>
 52 |     The purpose of Data Science Virtual Machines is to give a data scientist the tools they need, pre-installed, and ready to go.
 53 |     </p>
 54 |     </details>
 55 | 
 56 | 1. Which of the following is installed on a Data Science Virtual Machine?
 57 | 
 58 |     * Azure Data Warehouse
 59 |     * Jupyter Notebook
 60 |     * Azure Machine Learning Studio
 61 | 
 62 |     <details>
 63 |     <summary> 
 64 |     Answer
 65 |     </summary>
 66 |     <p>
 67 |      Jupyter Notebook is installed on Data Science Virtual Machines and provides a great data science development tool.
 68 |     </p>
 69 |     </details>
 70 | 
 71 | </p>
 72 | </details>
 73 | 
 74 | ---
 75 | 
 76 | ## Explore the types of Azure Data Science Virtual Machines
 77 | 
 78 | <details>
 79 | <summary> 
 80 | Show content
 81 | </summary>
 82 | <p>
 83 | 
 84 | ### Learning objectives
 85 | 
 86 | * Learn how to create Windows-based and Linux-based DSVMs
 87 | * Explore the Deep Learning Data Science Virtual Machines
 88 | * Work with Geo AI Data Science Virtual Machines
 89 | 
 90 | ### Windows-Based DSVMs
 91 | 
 92 | You can use the Windows-based DSVM to jump-start your data science projects. You don't pay for the DSVM image, just usage fees.
 93 | 
 94 | The image comes with a bunch of features:
 95 | * Tutorials
 96 | * Support for Office
 97 | * SQL Server integrated with ML Services
 98 | * Preinstalled languages: R, Python, SQL, C#
 99 | * Data Science tools such as Azure ML SDK for Python, Anaconda, Jupyter...
100 | * ML tools as Azure Congitive Services support, H2O, Tensorflow, Weka...
101 | 
102 | ### Deep Learning Virtual Machine
103 | 
104 | Deep Learning Virtual Machines (DLVMs) use GPU-based hardware that provide increased mathematical calculation speed for faster model training. The image can be either Windows or Ubuntu.
105 | 
106 | The DLVM simplifies the tool selection process by including preconfigured tools for different situations.
107 | 
108 | ### Geo AI Data Science VM with ArcGIS
109 | 
110 | Both Python and R work with ArcGIS Pro, and are preconfigured on the Geo AI Data Science VM.
111 | 
112 | The image includes a large set of tools as DL frameworks, Keras, Caffe2 and Spark standalone.
113 | 
114 | > OBS: Tools need to be compatible with GPUs.
115 | 
116 | It also comes bundled with IDEs such as visual studio or PyCharm.
117 | 
118 | Examples of Geo AI include:
119 | 
120 | * Real-time results of traffic conditions
121 | * Driver availability in Uber or Lyft at any time
122 | * Deep learning for disaster response
123 | * Urban growth prediction
124 | 
125 | ### Knowledge Check
126 | 
127 | 1. You want to learn about how to use Azure services related to machine learning with as little fuss as possible installing and configuring software and locating demonstration scripts. Which Data Science Virtual Machine type would best suit these needs?
128 | 
129 |     * Deep Learning DSVM
130 |     * Windows 2016 DSVM
131 |     * Geo AI Data Science VM with ArcGIS DSVM
132 | 
133 |     <details>
134 |     <summary> 
135 |     Answer
136 |     </summary>
137 |     <p>
138 |     The Windows 2016 gives you the most popular data science tools installed and configured and includes many sample scripts for using Azure machine learning related services.
139 |     </p>
140 |     </details>
141 | 
142 | 2. You need to train deep learning models to do image recognition using a lot of training data in the form of images. Which DSVM configuration would be best for the fastest model training?
143 | 
144 |     * Windows 2016 with standard CPUs.
145 |     * Geo AI Data Science VM with ArcGIS DSVM
146 |     * Deep Learning VM which is configured to use GPUs.
147 | 
148 |     <details>
149 |     <summary> 
150 |     Answer
151 |     </summary>
152 |     <p>
153 |     The DSVM includes all the software needed for training deep learning models and use graphic processor units (GPUs) which perform calculations much faster than standard CPUs.
154 |     </p>
155 |     </details>
156 | 
157 | </p>
158 | </details>
159 | 
160 | ---
161 | 
162 | ## Provision and use an Azure Data Science Virtual Machine
163 | 
164 | <details>
165 | <summary> 
166 | Show content
167 | </summary>
168 | <p>
169 | 
170 | This module is based on exercise, so it's best followed [here](https://docs.microsoft.com/en-us/learn/modules/provision-and-use-azure-dsvm/).
171 | 
172 | ### Knowledge Check
173 | 
174 | 1. What method did we use to log into a Windows-Based Data Science VM?
175 | 
176 |     * Remote Desktop Protocol (RDP)
177 |     * HTTP
178 |     * ODBC
179 | 
180 |     <details>
181 |     <summary> 
182 |     Answer
183 |     </summary>
184 |     <p>
185 |     RDP: A step by step walk through explains all the steps to connect to a Windows-based DSVM.
186 |     </p>
187 |     </details>
188 | 
189 | 1. What development environment has pre-loaded sample code available?
190 | 
191 |     * PyCharm
192 |     * Zeppelin Notebook
193 |     * Jupyter Notebook
194 | 
195 |     <details>
196 |     <summary> 
197 |     Answer
198 |     </summary>
199 |     <p>
200 |     Jupyter: We showed that many sample notebooks are installed that demonstrate how to use Microsoft Machine Learning technologies.
201 |     </p>
202 |     </details>
203 | 
204 | 1. What type of Jupyter Notebook cell is used to provide annotations?
205 | 
206 |     * Code cell
207 |     * Markdown cell
208 |     * Raw cell
209 | 
210 |     <details>
211 |     <summary> 
212 |     Answer
213 |     </summary>
214 |     <p>
215 |     Markdown support rich formatting and is ideal for adding comments and annotations to your notebooks.
216 |     </p>
217 |     </details>
218 | 
219 | </p>
220 | </details>
221 | 
222 | ---


--------------------------------------------------------------------------------
/01-explore-AI-solution-development/README.md:
--------------------------------------------------------------------------------
  1 | # Explore AI solution development with data science services in Azure
  2 | 
  3 | ## Introduction to Data Science in Azure
  4 | 
  5 | <details>
  6 | <summary> 
  7 | Show content
  8 | </summary>
  9 | <p>
 10 | 
 11 | ### Learning Objectives
 12 | 
 13 | * Learn the steps involved in the data science process
 14 | * Learn the machine learning modeling cycle
 15 | * Learn data cleansing and preparation
 16 | * Learn model feature engineering
 17 | * Learn model training and evaluation
 18 | * Learn about model deployment
 19 | * Discover the specialized roles in the data science process
 20 | 
 21 | ### The Data Science process
 22 | 
 23 | ![img](../assets/img/ds-process.png)
 24 | 
 25 | Iterative process that starts with a question, risen from business needs and understanding.
 26 | 
 27 | ### What is modeling?
 28 | 
 29 | Modeling is a cycle of data and business understanding. You start by exploring your assets, in this case data, with **Exploratory Data Analysis (EDA)**, from that point feature engineering starts and finally train a model on top, which is an algorithm that learns information and provides a probabilistic prediction.
 30 | 
 31 | In the end, the model is evaluated to check where it is accurate and where it is failing to correct the behavior.
 32 | 
 33 | ### Choose a use case
 34 | 
 35 | Identify the problem (business understanding) -> Define the project goals -> Identify data sources
 36 | 
 37 | ### Data preparation
 38 | 
 39 | Data cleansing and EDA are vital to the modeling process, to get insights on what data is or is not useful and what needs to be corrected or taken into account. Understanding the data is one of the most vital steps in the data science cycle.
 40 | 
 41 | ### Feature engineering
 42 | 
 43 | What extra knowledge we can extract by combining existing features to create new ones.
 44 | 
 45 | ### Model training
 46 | 
 47 | Split data -> Cross-validate data -> Obtain probabilistic prediction
 48 | 
 49 | ### Model evaluation
 50 | 
 51 | **Hyperparameters** are parameters used in model training that cannot be learned by the training process. These parameters must be set before model training begins.
 52 | 
 53 | For evaluating the results you need to set up a metric to compare different runs, such as accuracy or MSE.
 54 | 
 55 | ### Model deployment
 56 | 
 57 | Model deployment is the final stage of the data science procedure. It is often done by a developer, and is usually not part of the data scientist role.
 58 | 
 59 | ### Specialized roles in the Data Science process
 60 | 
 61 | In the data science process, there are specialists in each of the steps:
 62 | 
 63 | Business Analyst or Domain Expert, Data Engineer, Developer and Data Scientist.
 64 | 
 65 | ### Knowledge Check
 66 | 
 67 | 1. Which of the following is not a specialized role in the Data Science Process?
 68 | 
 69 | * Database Administrator
 70 | * Data Scientist
 71 | * Data Engineer
 72 | 
 73 |     <details>
 74 |     <summary> 
 75 |     Answer
 76 |     </summary>
 77 |     <p>
 78 |     DBA
 79 |     </p>
 80 |     </details>
 81 | 
 82 | 1. Model feature engineering refers to which of the following?
 83 | 
 84 | * Selecting the best model to use for the experiment.
 85 | * Determine which data elements will help in making a prediction and preparing these columns to be used in model training.
 86 | * Exploring the data to understand it better.
 87 | 
 88 |     <details>
 89 |     <summary> 
 90 |     Answer
 91 |     </summary>
 92 |     <p>
 93 |     Feature engineering involves the data scientist determining which data to use in model training and preparing the data so it can be used by the model.
 94 |     </p>
 95 |     </details>
 96 | 
 97 | 1. The Model deployment involves.
 98 | 
 99 | * Calling a model to score new data.
100 | * Training a model.
101 | * Copying a trained model and its code dependencies to an environment where it will be used to score new data.
102 | 
103 |     <details>
104 |     <summary> 
105 |     Answer
106 |     </summary>
107 |     <p>
108 |     Deploying a model makes it available for use.
109 |     </p>
110 |     </details>
111 | 
112 | </p>
113 | </details>
114 | 
115 | ---
116 | 
117 | ## Choose the Data Science service in Azure you need
118 | 
119 | <details>
120 | <summary> 
121 | Show content
122 | </summary>
123 | <p>
124 | 
125 | ### Learning Objectives
126 | 
127 | * Differentiate each of the Azure machine learning products.
128 | * Identify key features of each product.
129 | * Describe the use cases for each service.
130 | 
131 | ### Machine Learning options on Azure
132 | 
133 | We have the following services:
134 | 
135 | * **Azure Machine Learning Studio**: GUI-based solution best chosen for learning. It includes all DS pipeline steps, from importing and playing around with data to different deployment options. All is based in a drag-and-drop method.
136 | * **Azure Databricks**: Great collaboration platform with a powerful notebook interface, job scheduling, AAD integration and granular security control. It allows to create and modify Spark clusters.
137 | * **Azure Data Science Virtual Machine**: preconfigured VMs with lots of preinstalled popular ML tools. You can directly connect to the machine via ssh or remote desktop. There are different types of machines:
138 |     * Linux and Windows OS, where Windows supports scalability with ML in SQL Server and Linux does not.
139 |     * Deep Learning VM, offering DL tools.
140 |     * Geo AI DSVM, with specific tools for working with spatial data. Includes ArcGIS.
141 | * **SQL Server Machine Learning Services**: add-on which runs on the SQL Server on-premises and supports scale up and high performance of Python and R code. It includes several advantages:
142 |     * Security, as the processing occurs closer to the data source.
143 |     * Performance
144 |     * Consistency
145 |     * Efficiency, as you can use integrated tools such as PowerBI to report and analyze results.
146 | * **Spark on HDInsight**: HDInsight is PaaS service offering Apache Hadoop. It provides several benefits:
147 |     * Easy and fast to create and modify clusters on demand.
148 |     * Usage of ETL tools in the cluster with Map Reduce and Spark.
149 |     * Compliance standards with Azure Virtual Network, envryption and integration of Azure AD.
150 |     * Integrated with other Azure services, such as ADLS or ADF.
151 | 
152 |     HDInsight Spark is an implementation of Apache Spark on Azure HDInsight.
153 | * **Azure Machine Learning Service**: Supports the whole DS pipeline integration, scale processing and automate the following tasks:
154 | 
155 |     * Model management
156 |     * Model training
157 |     * Model selection
158 |     * Hyper-parameter tuning
159 |     * Feature selection
160 |     * Model evaluation
161 | 
162 |     It supports open-source technologies such as Python and common ds tools. It makes it easier to containerize and deploy the model and automate several tasks. The platform is designed to support three roles:
163 | 
164 |     * Data Engineer to ingest and prepare data for analysis either locally or on Azure containers.
165 |     * Data Scientist to apply the modeling tools and processes. AMLS support sklearn, tensorFlow, pyTorch, Microsoft Cognitive Toolkit and Apache MXNet.
166 |     * Developer to create an image of the built and trained model with all the needed components. An **image** contains:
167 |         1. The model
168 |         1. A scoring script or application which passes input to the model and returns the output of the model
169 |         1. The required dependencies, such as Python scripts or packages needed by the model or scoring script.
170 | 
171 |         Images can be deployed as Docker images or field programmable gate array (FPGA) images. Iages can be deployed to a web service (running in Azure Container Instance, FPGA or Azure Kubernetes Service), or an IoT module (IoT Edge).
172 | 
173 |         > OBS: Scalability is enabled during training, but once the code is deployed it is flat. Also, it is only supported as an Azure App Service so you keep paying even if it is idle.
174 | 
175 | 
176 | ### Knowledge Check
177 | 
178 | 1. Azure Machine Learning service supports which programming language.
179 | 
180 | * R
181 | * Julia
182 | * Python
183 | 
184 |     <details>
185 |     <summary> 
186 |     Answer
187 |     </summary>
188 |     <p>
189 |     Python is supported by Azure Machine Learning service.
190 |     </p>
191 |     </details>
192 | 
193 | 1. Azure Databricks is built on which Big Data platform?
194 | 
195 | * Azure SQL Data Warehouse
196 | * SQL Server
197 | * Apache Spark
198 | 
199 |     <details>
200 |     <summary> 
201 |     Answer
202 |     </summary>
203 |     <p>
204 |     Azure Databricks makes using Spark easier.
205 |     </p>
206 |     </details>
207 | 
208 | 
209 | 1. Which is not an operating system available for an Azure Data Science Virtual Machine?
210 | 
211 | * Windows
212 | * Linux
213 | * Apple iOS
214 | 
215 |     <details>
216 |     <summary> 
217 |     Answer
218 |     </summary>
219 |     <p>
220 |     Data Science VMs running Apple iOS are not available.
221 |     </p>
222 |     </details>
223 | 
224 | 
225 | </p>
226 | </details>
227 | 
228 | ---


--------------------------------------------------------------------------------
/02-build-AI-solutions-with-AMLS/README.md:
--------------------------------------------------------------------------------
   1 | # Build AI solutions with Azure Machine Learning service
   2 | 
   3 | ## Introduction to Azure Machine Learning service
   4 | 
   5 | <details>
   6 | <summary> 
   7 | Show content
   8 | </summary>
   9 | <p>
  10 | 
  11 | ### Learning Objectives
  12 | 
  13 | * Learn the difference between Azure Machine Learning Studio and Azure Machine Learning service
  14 | * See how Azure Machine Learning service fits into the data science process
  15 | * Learn the concepts related to an Azure Machine Learning service experiment
  16 | * Explore the Azure Machine Learning service pipeline
  17 | * Train a model using Azure Machine Learning service
  18 | 
  19 | ### Azure Machine Learning Service within a data science process
  20 | 
  21 | Environment Set Up -> Data Preparation -> Experimentation -> Deployment
  22 | 
  23 | * **Environment setup**: First step is creating a **Workspace**, where you store your ML work. An **Experiment** is created within the workspace to store information about runs for your model. You can have multiple experiments in one workspace. You can interact with the environment with different IDEs such as PyCharm or Azure Notebooks.
  24 | * **Data Preparation**: explore, analyze and visualize the sources. You can use any tool. Azure provides the following SDK `Azureml.dataprep`.
  25 | * **Experimentation**: Iterative process of training and testing. With AMLS you can run the model in Azure containers. You need to create and configure a computer target object used to provision computer resources.
  26 | * **Deployment**: Create a Docker image that will get deployed to Azure Container Instances (you could also choose AKS, Azure IoT or FPGA).
  27 | 
  28 | ### Create a machine learning experiment
  29 | 
  30 | ![img](../assets/img/key-components-ml-workspace.png)
  31 | 
  32 | * **Workspace**: top-level resource in AMLS where you build and deploy your models. With a registered model and scoring scripts you can create an image for deployment. It stores experiment objects which save computer targets, track runs, logs, metrics and outputs.
  33 | * **Image**: it has three key components:
  34 |     1. A model and scoring script or application
  35 |     1. An environment file that declares the dependencies.
  36 |     1. A configuration file with the necessary resources to execute the model.
  37 | * **Datastore**: Abstraction over an Azure Storage account. Each workspace has a default one, but you could add Blob or File storage containers.
  38 | * **Pipeline**: Tool to create and manage workflows during a ds process. Each step can run unattended in different computer targets, which makes it easier to allocate resources.
  39 | * **Computer target**: Resource to run a training model or to host service deployment. It is attached to a workspace.
  40 | * **Deployed Web service**: You can choose between ACI, AKS or FPGA. With the model, script and image files you can create a Web service.
  41 | * **IoT module**: It is a Docker container and has the same needs as a Web Service. It enables to monitor a hosting device.
  42 | 
  43 | ### Creating a pipeline
  44 | 
  45 | Some features or Azure ML pipelines are:
  46 | * Schedule tasks and executions,
  47 | * You can allocate different computer targets for different steps and coordinate multiple pipelines,
  48 | * You can reuse pipeline scripts and customize them,
  49 | * You can record and manage input, output, intermediate tasks and data.
  50 | 
  51 | ### Knowledge Check
  52 | 
  53 | 1. The Azure Machine Learning service SDK is which of the following?
  54 | 
  55 | * A visual machine learning development portal.
  56 | * A Python package containing functions to use the Azure ML service.
  57 | * A special type of Azure virtual machine.
  58 | 
  59 |     <details>
  60 |     <summary> 
  61 |     Answer
  62 |     </summary>
  63 |     <p>
  64 |     The modules provided by the Azure ML SDK provide the functions you need to work with the service in Python.
  65 |     </p>
  66 |     </details>
  67 | 
  68 | 1. Which of the following is the underlying technology of the Azure Machine Learning service?
  69 | 
  70 | * Spark
  71 | * Hadoop
  72 | * Containerization including Docker and Kubernetes
  73 | 
  74 |     <details>
  75 |     <summary> 
  76 |     Answer
  77 |     </summary>
  78 |     <p>
  79 |     Containerization is a key technology used by the Azure ML service.
  80 |     </p>
  81 |     </details>
  82 | 
  83 | 1. Which of the following is not a component of an Azure Machine Learning service workspace image?
  84 | 
  85 | * An R package
  86 | * An environment file that declares dependencies that are needed by the model, scoring script or application.
  87 | * A model scoring script
  88 | 
  89 |     <details>
  90 |     <summary> 
  91 |     Answer
  92 |     </summary>
  93 |     <p>
  94 |     R packages are not part of an Azure Machine Learning service workspace image.
  95 |     </p>
  96 |     </details>
  97 | 
  98 | 1. Which of the following descriptions accurately describes Azure Machine Learning?
  99 | 
 100 |     * A Python library that you can use as an alternative to common machine learning frameworks like Scikit-Learn, PyTorch, and Tensorflow.
 101 |     * A cloud-based platform for operating machine learning solutions at scale.
 102 |     * An application for Microsoft Windows that enables you to create machine learning models by using a drag and drop interface.
 103 | 
 104 |     <details>
 105 |     <summary> 
 106 |     Answer
 107 |     </summary>
 108 |     <p>
 109 |     Cloud based Platform: Azure Machine Learning enables you to manage machine learning model data preparation, training, validation, and deployment. It supports existing frameworks such as Scikit-Learn, PyTorch, and Tensorflow; and provides a cross-platform platform for operationalizing machine learning in the cloud.
 110 |     </p>
 111 |     </details>
 112 | 
 113 | 1. Which edition of Azure Machine Learning workspace should you provision if you only plan to use the graphical Designer tool to train machine learning models?
 114 | 
 115 |     * Basic
 116 |     * Enterprise
 117 | 
 118 |     <details>
 119 |     <summary> 
 120 |     Answer
 121 |     </summary>
 122 |     <p>
 123 |     The visual Designer tool is not available in Basic edition workspaces, so you must create an Enterprise workspace to use it.
 124 |     </p>
 125 |     </details>
 126 | 
 127 | 1. You are using the Azure Machine Learning Python SDK to write code for an experiment. You must log metrics from each run of the experiment, and be able to retrieve them easily from each run. What should you do?
 128 | 
 129 |     * Add print statements to the experiment code to print the metrics.
 130 |     * Save the experiment data in the outputs folder.
 131 |     * Use the log* methods of the Run class to record named metrics.
 132 | 
 133 |     <details>
 134 |     <summary> 
 135 |     Answer
 136 |     </summary>
 137 |     <p>
 138 |     To record metrics in an experiment run, use the Run.log* methods.
 139 |     </p>
 140 |     </details>
 141 | 
 142 | 
 143 | 
 144 | </p>
 145 | </details>
 146 | 
 147 | ---
 148 | 
 149 | ## Train a local ML model with Azure Machine Learning service
 150 | 
 151 | <details>
 152 | <summary> 
 153 | Show content
 154 | </summary>
 155 | <p>
 156 | 
 157 | ### Learning Objectives
 158 | 
 159 | 
 160 | * Use an Estimator to run a model training script as an Azure Machine Learning experiment.
 161 | * Create reusable, parameterized training scripts.
 162 | * Register models, including metadata such as performance metrics.
 163 | 
 164 | > As this is a rather practical module, you can refer to the labs notebooks or directly to Azure's docs.
 165 | 
 166 | ### What is HyperDrive
 167 | 
 168 | HyperDrive is a built-in service that automatically launches multiple experiments in parallel each with different parameter configurations. Azure Machine Learning then automatically finds the configuration that results in the best performance measured by the metric you choose. The service will terminate poorly performing training runs to minimize compute resources usage.
 169 | 
 170 | ### Azure Machine Learning estimators
 171 | 
 172 | In Azure Machine Learning, you can use a **Run Configuration** and a **Script Run Configuration** to run a script-based experiment that trains a machine learning model. However, these configurations may end up being really complex, so another abstraction layer is added: An **Estimator** encapsulates a run configuration and a script configuration in a single object.
 173 | 
 174 | We have some default Estimators for frameworks such as Scikit Learn, Pytorch and TF.
 175 | 
 176 | #### Writing a Script to Train a Model
 177 | 
 178 | After training a model, it should be saved in the **outputs** directory. For example witch SKlearn:
 179 | 
 180 | ```python
 181 | from azureml.core import Run
 182 | import joblib
 183 | 
 184 | # Get the experiment run context
 185 | run = Run.get_context()
 186 | 
 187 | # Train and test...
 188 | 
 189 | # Save the trained model
 190 | os.makedirs('outputs', exist_ok=True)
 191 | joblib.dump(value=model, filename='outputs/model.pkl')
 192 | 
 193 | run.complete()
 194 | ```
 195 | 
 196 | #### Using an Estimator
 197 | 
 198 | You can use a generic Estimator class to define a run configuration for a training script like this:
 199 | 
 200 | ```python
 201 | from azureml.train.estimator import Estimator
 202 | from azureml.core import Experiment
 203 | 
 204 | # Create an estimator
 205 | estimator = Estimator(source_directory='experiment_folder',
 206 |                       entry_script='training_script.py',
 207 |                       compute_target='local',
 208 |                       conda_packages=['scikit-learn']
 209 |                       )
 210 | 
 211 | # Or use a framework specific estimator as
 212 | estimator = SKLearn(source_directory='experiment_folder',
 213 |                     entry_script='training_script.py'
 214 |                     compute_target='local'
 215 |                     )
 216 | 
 217 | # Create and run an experiment
 218 | experiment = Experiment(workspace = ws, name = 'training_experiment')
 219 | run = experiment.submit(config=estimator)
 220 | ```
 221 | 
 222 | ### Using script parameters
 223 | 
 224 | Used to increase the flexibility of script-based experiments.
 225 | 
 226 | These parameters are read as usual Python parameters in scripts. So for example, after setting the `Run`:
 227 | 
 228 | ```python
 229 | # Set regularization hyperparameter
 230 | parser = argparse.ArgumentParser()
 231 | parser.add_argument('--reg_rate', type=float, dest='reg', default=0.01)
 232 | args = parser.parse_args()
 233 | reg = args.reg
 234 | ```
 235 | 
 236 | To use parameters in **Estimators**, add the `script_params` value as a dict:
 237 | 
 238 | ```python
 239 | # Create an estimator
 240 | estimator = SKLearn(source_directory='experiment_folder',
 241 |                     entry_script='training_script.py',
 242 |                     script_params = {'--reg_rate': 0.1},
 243 |                     compute_target='local'
 244 |                     )
 245 | ```
 246 | 
 247 | ### Registering models
 248 | 
 249 | After running an experiment that trains a model you can use a reference to the Run object to retrieve its outputs, including the trained model.
 250 | 
 251 | #### Retrieving Model Files
 252 | 
 253 | From the `run` object we can get all the files that it generated with `run.get_file_names()` and download the models as (recall how we said that usually those were stored under `outputs/`)
 254 | 
 255 | ```python
 256 | run.download_file(name='outputs/model.pkl', output_file_path='model.pkl')
 257 | ```
 258 | 
 259 | #### Registering a Model
 260 | 
 261 | With `Model.register()` we can save different versions of our models:
 262 | 
 263 | ```python
 264 | from azureml.core import Model
 265 | 
 266 | model = Model.register(workspace=ws,
 267 |                        model_name='classification_model',
 268 |                        model_path='model.pkl', # local path
 269 |                        description='A classification model',
 270 |                        tags={'dept': 'sales'},
 271 |                        model_framework=Model.Framework.SCIKITLEARN,
 272 |                        model_framework_version='0.20.3')
 273 | ```
 274 | 
 275 | Or the same by referencing the `run` object:
 276 | 
 277 | ```python
 278 | run.register_model( model_name='classification_model',
 279 |                     model_path='outputs/model.pkl', # run outputs path
 280 |                     description='A classification model',
 281 |                     tags={'dept': 'sales'},
 282 |                     model_framework=Model.Framework.SCIKITLEARN,
 283 |                     model_framework_version='0.20.3')
 284 | ```
 285 | 
 286 | We can then view all the models we saved by using:
 287 | 
 288 | ```python
 289 | for model in Model.list(ws):
 290 |     # Get model name and auto-generated version
 291 |     print(model.name, 'version:', model.version)
 292 | ```
 293 | 
 294 | ### Knowledge Check
 295 | 
 296 | 1. An Experiment contains which of the following?
 297 | 
 298 |    * A composition of a series of runs
 299 |    * A Docker image
 300 |    * The data used for model training
 301 | 
 302 | 
 303 |     <details>
 304 |     <summary> 
 305 |     Answer
 306 |     </summary>
 307 |     <p>
 308 |     A composition of a series of runs: Azure ML Studio provides a visual drag and drop machine learning development portal but that is a separate offering.
 309 |     </p>
 310 |     </details>
 311 | 
 312 | 
 313 | 1. A run refers to which of the following?
 314 | 
 315 |    * Python code for a specific task such as training a model or tuning hyperparameters. Run does the job of logging metrics and uploading the results to Azure platform.
 316 |    * A set of containers managed by Kubertes to run your models.
 317 |    * A Spark cluster.
 318 | 
 319 | 
 320 | 
 321 |     <details>
 322 |     <summary> 
 323 |     Answer
 324 |     </summary>
 325 |     <p>
 326 |     Python code for a specific task such as training a model or tuning hyperparameters. Run does the job of logging metrics and uploading the results to Azure platform. 
 327 |     </p>
 328 |     </details>
 329 | 
 330 | 
 331 | 1. A hyperparameter is which of the following?
 332 | 
 333 |    * A model parameter that cannot be learned by the model training process.
 334 |    * A model feature derived from the source data.
 335 |    * A parameter that automatically and frequently changes value during a single model training run.
 336 | 
 337 | 
 338 | 
 339 |     <details>
 340 |     <summary> 
 341 |     Answer
 342 |     </summary>
 343 |     <p>
 344 |     Hyperparameters control how the model training executes and must be set before model training.
 345 |     </p>
 346 |     </details>
 347 | 
 348 | 
 349 | 1. Before you can train and run experiments in your code, you must do which of the following?
 350 | 
 351 |    * Create a virtual machine
 352 |    * Log out of the Azure portal
 353 |    * Write a model scoring script
 354 | 
 355 | 
 356 |     <details>
 357 |     <summary> 
 358 |     Answer
 359 |     </summary>
 360 |     <p>
 361 |     Your Python script needs to connect to the Azure ML workspace before you can train and run experiments.
 362 |     </p>
 363 |     </details>
 364 | 
 365 | 
 366 | 1. Which of the following is a technique for determining hyperparameter values?
 367 | 
 368 |    * grid searching
 369 |    * Bayesian sampling
 370 |    * hyper searching
 371 | 
 372 | 
 373 | 
 374 |     <details>
 375 |     <summary> 
 376 |     Answer
 377 |     </summary>
 378 |     <p>
 379 |     Grid searching is often used by data scientists to find the best hyperparamter value.
 380 |     </p>
 381 |     </details>
 382 | 
 383 | 1. You have written a script that uses the Scikit-Learn framework to train a model. Which framework-specific estimator should you use to run the script as an experiment?
 384 | 
 385 |     * PyTorch
 386 |     * Tensorflow
 387 |     * SKLearn
 388 | 
 389 | 
 390 |     <details>
 391 |     <summary> 
 392 |     Answer
 393 |     </summary>
 394 |     <p>
 395 |     To run a scikit-learn training script as an experiment, use the generic Estimator estimator or a SKLearn estimator.
 396 |     </p>
 397 |     </details>
 398 | 
 399 | 
 400 | 1. You have run an experiment to train a model. You want the model to be stored in the workspace, and available to other experiments and published services. What should you do?
 401 | 
 402 |    * Register the model in the workspace.
 403 |    * Save the model as a file in a Compute Instance.
 404 |    * Save the experiment script as a notebook.
 405 | 
 406 |     <details>
 407 |     <summary> 
 408 |     Answer
 409 |     </summary>
 410 |     <p>
 411 |     To store a model in the workspace, register it.
 412 |     </p>
 413 |     </details>
 414 | 
 415 | </p>
 416 | </details>
 417 | 
 418 | ---
 419 | 
 420 | 
 421 | ## Working with Data in Azure Machine Learning
 422 | 
 423 | <details>
 424 | <summary> 
 425 | Show content
 426 | </summary>
 427 | <p>
 428 | 
 429 | ### Learning objectives
 430 | 
 431 | * Create and use datastores
 432 | * Create and use datasets
 433 | 
 434 | ### Introduction to datastores
 435 | 
 436 | Abstractions for cloud data sources. They hold the connection information and can be used to both read and write. The different sources could be (sample from [here](https://docs.microsoft.com/en-us/azure/machine-learning/concept-data#access-data-in-storage)):
 437 | 
 438 | * Azure Storage (blob and file containers)
 439 | * Azure Data Lake Storage
 440 | * Azure SQL Database
 441 | * Azure Databricks file system (DBFS)
 442 | 
 443 | #### Using datastores
 444 | 
 445 | Each workspace has two built-in datastores (blob container + Azure Storage File container) used as system storage by AMLS. You have a limited use on top of those.
 446 | 
 447 | The good part of using external datasources - which is the usual - is the ability to share data accross multiple experiments, regardless of the compute context in which those experiments are running.
 448 | 
 449 | You can use the AMLS SDK to store / retrieve data from the datastores.
 450 | 
 451 | #### Registering a datastore
 452 | 
 453 | To register a datastore, you could either use the UI in AMLS or the SDK:
 454 | 
 455 | ```python
 456 | from azureml.core import Workspace, Datastore
 457 | 
 458 | ws = Workspace.from_config()
 459 | 
 460 | # Register a new datastore
 461 | blob_ds = Datastore.register_azure_blob_container(
 462 |     workspace=ws,
 463 |     datastore_name='blob_data',
 464 |     container_name='data_container',
 465 |     account_name='az_store_acct',
 466 |     account_key='123456abcde789…'
 467 | )
 468 | ```
 469 | 
 470 | #### Managing datastores
 471 | 
 472 | Again, managing can be done via UI or SDK:
 473 | 
 474 | ```python
 475 | # list
 476 | for ds_name in ws.datastores:
 477 |     print(ds_name)
 478 | 
 479 | # get
 480 | blob_store = Datastore.get(ws, datastore_name='blob_data')
 481 | 
 482 | # get default
 483 | default_store = ws.get_default_datastore()
 484 | 
 485 | # set default
 486 | ws.set_default_datastore('blob_data')
 487 | ```
 488 | 
 489 | ### Use datastores
 490 | 
 491 | You can interact directly with a datastore via the SDK and *pass data references* to scripts that need to access data.
 492 | 
 493 | > OBS: For blobs to work correctly as a datastore and be accessible in the code to upload / download, the storage account should be Standard / Hot, not Premium!
 494 | 
 495 | #### Working directly with a datastore
 496 | 
 497 | ```python
 498 | blob_ds.upload(src_dir='/files',
 499 |                target_path='/data/files',
 500 |                overwrite=True, show_progress=True)
 501 | 
 502 | blob_ds.download(target_path='downloads',
 503 |                  prefix='/data',
 504 |                  show_progress=True)
 505 | ```
 506 | 
 507 | #### Using data references
 508 | 
 509 | When you want to use a datastore in an experiment script, you must pass a data reference to the script. There are the following accesses:
 510 | 
 511 | * **Download**: Contents are downloaded to the compute context.
 512 | * **Upload**: The files generated by the experiment are uploaded to the datastore after the run completes.
 513 | * **Mount**: When experiments run on a remote compute (not local), you can mount the path.
 514 | 
 515 | To pass the reference to an experiment script, define the `script_params`:
 516 | 
 517 | ```python
 518 | data_ref = blob_ds.path('data/files').as_download(path_on_compute='training_data')
 519 | estimator = SKLearn(source_directory='experiment_folder',
 520 |                     entry_script='training_script.py'
 521 |                     compute_target='local',
 522 |                     script_params = {'--data_folder': data_ref})
 523 | ```
 524 | 
 525 | `script_params` can then be retrieved via `argparse`.
 526 | 
 527 | ### Introduction to datasets
 528 | 
 529 | Datasets are versioned packaged data objects that can be easily consumed in experiments and pipelines. They are the recommended way to work with data.
 530 | 
 531 | Datasets can be based on files in a datastore or on URLs and other resources.
 532 | 
 533 | #### Types of dataset
 534 | 
 535 | * **Tabular**: useful when when we work, for example, with pandas.
 536 | * **File**: For unstructured data. Dataset will present a list of paths that can be read as thought from the file system. For example, for images in a CNN.
 537 | 
 538 | #### Creating and registering datasets
 539 | 
 540 | You can use the UI or the SDK to create datasets from files or paths (which can include wildcards `*` for regex).
 541 | 
 542 | ##### Creating and registering tabular datasets
 543 | 
 544 | ```python
 545 | from azureml.core import Dataset
 546 | 
 547 | blob_ds = ws.get_default_datastore()
 548 | csv_paths = [(blob_ds, 'data/files/current_data.csv'),
 549 |              (blob_ds, 'data/files/archive/*.csv')]
 550 | tab_ds = Dataset.Tabular.from_delimited_files(path=csv_paths)
 551 | tab_ds = tab_ds.register(workspace=ws, name='csv_table')
 552 | ```
 553 | 
 554 | ##### Creating and registering file datasets
 555 | 
 556 | ```python
 557 | from azureml.core import Dataset
 558 | 
 559 | blob_ds = ws.get_default_datastore()
 560 | file_ds = Dataset.File.from_files(path=(blob_ds, 'data/files/images/*.jpg'))
 561 | file_ds = file_ds.register(workspace=ws, name='img_files')
 562 | ```
 563 | 
 564 | #### Retrieving a registered dataset
 565 | 
 566 | You can retrieve datasets by the `datasets` attribute of a `Workspace` or by calling `get_by_name` or `get_by_id` of the `Dataset` class:
 567 | 
 568 | ```python
 569 | import azureml.core
 570 | from azureml.core import Workspace, Dataset
 571 | 
 572 | # Load the workspace from the saved config file
 573 | ws = Workspace.from_config()
 574 | 
 575 | # Get a dataset from the workspace datasets collection
 576 | ds1 = ws.datasets['csv_table']
 577 | 
 578 | # Get a dataset by name from the datasets class
 579 | ds2 = Dataset.get_by_name(ws, 'img_files')
 580 | ```
 581 | 
 582 | #### Dataset versioning
 583 | 
 584 | Useful to reproduce experiments with data in the same state. Use the `create_new_version` property when registering a dataset:
 585 | 
 586 | ```python
 587 | img_paths = [(blob_ds, 'data/files/images/*.jpg'),
 588 |              (blob_ds, 'data/files/images/*.png')]
 589 | file_ds = Dataset.File.from_files(path=img_paths)
 590 | file_ds = file_ds.register(workspace=ws, name='img_files', create_new_version=True)
 591 | ```
 592 | 
 593 | To retrieve a specific version:
 594 | 
 595 | ```python
 596 | img_ds = Dataset.get_by_name(workspace=ws, name='img_files', version=2)
 597 | ```
 598 | 
 599 | ### Use datasets
 600 | 
 601 | You can read data directly from a dataset, or you can pass a dataset as a named input to a script configuration or estimator.
 602 | 
 603 | #### Working with a dataset directly
 604 | 
 605 | If you have a reference to a dataset, you can access its contents directly.
 606 | 
 607 | ```python
 608 | df = tab_ds.to_pandas_dataframe()
 609 | ```
 610 | 
 611 | When working with a file dataset, use `to_path()`:
 612 | 
 613 | ```python
 614 | for file_path in file_ds.to_path():
 615 |     print(file_path)
 616 | ```
 617 | 
 618 | #### Passing a dataset to an experiment script
 619 | 
 620 | When you need to access a dataset in an experiment script, you can pass the dataset as an input to a **ScriptRunConfig** or an **Estimator**:
 621 | 
 622 | ```python
 623 | estimator = SKLearn( source_directory='experiment_folder',
 624 |                      entry_script='training_script.py',
 625 |                      compute_target='local',
 626 |                      inputs=[tab_ds.as_named_input('csv_data')],
 627 |                      pip_packages=['azureml-dataprep[pandas]')
 628 | ```
 629 | 
 630 | Since the script will need to work with a **Dataset** object, you must include either the full **azureml-sdk** package or the **azureml-dataprep** package with the **pandas** extra library in the script's compute environment.
 631 | 
 632 | Then, in the experiment
 633 | 
 634 | ```python
 635 | run = Run.get_context()
 636 | data = run.input_datasets['csv_data'].to_pandas_dataframe()
 637 | ```
 638 | 
 639 | Finally, when passing a file dataset, you must specify the access mode:
 640 | 
 641 | ```python
 642 | estimator = Estimator( source_directory='experiment_folder',
 643 |                      entry_script='training_script.py'
 644 |                      compute_target='local',
 645 |                      inputs=[img_ds.as_named_input('img_data').as_download(path_on_compute='data')],
 646 |                      pip_packages=['azureml-dataprep[pandas]')
 647 | ```
 648 | 
 649 | ### Knowledge Check
 650 | 
 651 | 1. You've uploaded some data files to a folder in a blob container, and registered the blob container as a datastore in your Azure Machine Learning workspace. You want to run a script as an experiment that loads the data files and trains a model. What should you do?
 652 | 
 653 |    * Save the experiment script in the same blob folder as the data files.
 654 |    * Create a data reference for the datastore location and pass it to the script as a parameter.
 655 |    * Create global variables for the Azure Storage account name and key in the experiment script.
 656 | 
 657 |     <details>
 658 |     <summary> 
 659 |     Answer
 660 |     </summary>
 661 |     <p>
 662 |     To access a path in a datastore in an experiment script, you must create a data reference and pass it to the script as a parameter. The script can then read data from the data reference parameter just like a local file path.
 663 |     </p>
 664 |     </details>
 665 | 
 666 | 1. You've registered a dataset in your workspace. You want to use the dataset in an experiment script that is run using an estimator. What should you do?
 667 | 
 668 |    * Pass the dataset as a named input to the estimator.
 669 |    * Create a data reference for the datastore location where the dataset data is stored, and pass it to the script as a parameter.
 670 |    * Use the dataset to save the data as a CSV file in the experiment script folder before running the experiment.
 671 | 
 672 |     <details>
 673 |     <summary> 
 674 |     Answer
 675 |     </summary>
 676 |     <p>
 677 |     To access a dataset in an experiment script, pass the dataset as a named input to the estimator. 
 678 |     </p>
 679 |     </details>
 680 | 
 681 | </p>
 682 | </details>
 683 | 
 684 | ---
 685 | 
 686 | 
 687 | ## Working with Compute Contexts in Azure Machine Learning
 688 | 
 689 | <details>
 690 | <summary> 
 691 | Show content
 692 | </summary>
 693 | <p>
 694 | 
 695 | ### Learning objectives
 696 | 
 697 | * Create and use environments.
 698 | * Create and use compute targets.
 699 | 
 700 | ### Introduction to environments
 701 | 
 702 | Python code runs in the context of a virtual environment that defines the version of the Python runtime to be used as well as the installed packages available to the code.
 703 | 
 704 | #### Environments in Azure Machine Learning
 705 | 
 706 | In general, AML handles environment creationm, package installation and environment registration for you - usually through the creation of Docker containers. You'd just need to specify the packages you want. You could also manage the environments if needed.
 707 | 
 708 | Environments are encapsulated by the **Environment** class; which you can use to create environments and specify runtime configuration for an experiment.
 709 | 
 710 | #### Creating environments
 711 | 
 712 | * **Creating an environment from a specification file**: based on conda or pip. For example, a file named **conda.yml**
 713 |   
 714 |     ```
 715 |     name: py_env
 716 |         dependencies:
 717 |         - numpy
 718 |         - pandas
 719 |         - scikit-learn
 720 |         - pip:
 721 |             - azureml-defaults
 722 |    ```
 723 | 
 724 |    Then, create the environment with the SDK
 725 | 
 726 |    ```python
 727 |     from azureml.core import Environment
 728 | 
 729 |     env = Environment.from_conda_specification(name='training_environment',
 730 |                                             file_path='./conda.yml')
 731 |    ```
 732 | 
 733 | * **Creating an environment from an existing Conda environment**: If you have already a defined Conda environment on the workstation you can reuse it in AML
 734 | 
 735 |     ```python
 736 |     from azureml.core import Environment
 737 | 
 738 |     env = Environment.from_existing_conda_environment(name='training_environment',
 739 |                                                     conda_environment_name='py_env')
 740 |     ```
 741 | 
 742 | * **Creating an environment by specifying packages**: using a **CondaDependencies** object:
 743 |   
 744 |     ```python
 745 |     from azureml.core import Environment
 746 |     from azureml.core.conda_dependencies import CondaDependencies
 747 | 
 748 |     env = Environment('training_environment')
 749 |     deps = CondaDependencies.create(conda_packages=['scikit-learn','pandas','numpy'],
 750 |                                     pip_packages=['azureml-defaults'])
 751 |     env.python.conda_dependencies = deps
 752 |     ```
 753 | 
 754 | #### Registering and reusing environments
 755 | 
 756 | After you've created an environment, you can register it in your workspace and reuse it for future experiments that have the same Python dependencies.
 757 | 
 758 | Register it via `env.register(workspace=ws)` and get the registered environments in a workspace using `Environment.list(workspace=ws)`.
 759 | 
 760 | #### Retrieving and using an environment
 761 | 
 762 | You can retrieve an environment and assign it to an **Estimator** or a **ScriptRunConfig**:
 763 | 
 764 | ```python
 765 | from azureml.core import Environment, Estimator
 766 | 
 767 | training_env = Environment.get(workspace=ws, name='training_environment')
 768 | estimator = Estimator(source_directory='experiment_folder'
 769 |                       entry_script='training_script.py',
 770 |                       compute_target='local',
 771 |                       environment_definition=training_env)
 772 | ```
 773 | 
 774 | > OBS: When an experiment based on the estimator is run, Azure Machine Learning will look for an existing environment that matches the definition, and if none is found a new environment will be created based on the registered environment specification.
 775 | 
 776 | ### Introduction to compute targets
 777 | 
 778 | Compute Targets are physical or virtual computers on which experiments are run. You can assign experiments to specific compute targets. This means that one can test on cheaper ones and run individual processes on GPUs, if needed.
 779 | 
 780 | You pay-by-use as compute targets
 781 | 
 782 | * Start on-demand and stop automatically when no longer required.
 783 | * Scale automatically based on workload processing needs (for model training)
 784 | 
 785 | #### Types of compute
 786 | 
 787 | * **Local compute**: Great for test and development. The experiment will run where the code is initiated, e.g., you own computer or a VM with jupyter on top.
 788 | * **Training Clusters**: multi-node clusters of VMs that automatically scale up or down to meet demand for training workloads. Useful when working with large data or when needing parallel processing.
 789 | * **Inference clusters**: To deploy trained models as production services. They use containerization to enable rapid initialization of compute for on-demand inferencing.
 790 | * **Attached compute**: You can attach another Azure-based compute environment to AML, as another VM or a Databricks cluster. They can be used for certain types of workload.
 791 | 
 792 | More info [here](https://docs.microsoft.com/en-us/azure/machine-learning/concept-compute-target).
 793 | 
 794 | ### Create compute targets
 795 | 
 796 | Can be done via UI or SDK. UI is the most common.
 797 | 
 798 | #### Creating a managed compute target with the SDK
 799 | 
 800 | They are managed by AML, e.g., a training cluster.
 801 | 
 802 | ```python
 803 | from azureml.core import Workspace
 804 | from azureml.core.compute import ComputeTarget, AmlCompute
 805 | 
 806 | # Load the workspace from the saved config file
 807 | ws = Workspace.from_config()
 808 | 
 809 | # Specify a name for the compute (unique within the workspace)
 810 | compute_name = 'aml-cluster'
 811 | 
 812 | # Define compute configuration
 813 | compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS12_V2',
 814 |                                                        min_nodes=0, max_nodes=4,
 815 |                                                        vm_priority='dedicated')
 816 | 
 817 | # Create the compute
 818 | aml_cluster = ComputeTarget.create(ws, compute_name, compute_config)
 819 | aml_cluster.wait_for_completion(show_output=True)
 820 | ```
 821 | 
 822 | > Priority can be **dedicated** to use for this cluster or **low priority**, for less cost but the possibility to be preemted.
 823 | 
 824 | #### Attaching an unmanaged compute target with the SDK
 825 | 
 826 | Unmanaged instances are defined and managed outside of the AML, e.g., a VM or a Databricks.
 827 | 
 828 | ```python
 829 | from azureml.core import Workspace
 830 | from azureml.core.compute import ComputeTarget, DatabricksCompute
 831 | 
 832 | # Load the workspace from the saved config file
 833 | ws = Workspace.from_config()
 834 | 
 835 | # Specify a name for the compute (unique within the workspace)
 836 | compute_name = 'db_cluster'
 837 | 
 838 | # Define configuration for existing Azure Databricks cluster
 839 | db_workspace_name = 'db_workspace'
 840 | db_resource_group = 'db_resource_group'
 841 | db_access_token = '1234-abc-5678-defg-90...'
 842 | db_config = DatabricksCompute.attach_configuration(resource_group=db_resource_group,
 843 |                                                    workspace_name=db_workspace_name,
 844 |                                                    access_token=db_access_token)
 845 | 
 846 | # Create the compute
 847 | databricks_compute = ComputeTarget.attach(ws, compute_name, db_config)
 848 | databricks_compute.wait_for_completion(True)
 849 | ```
 850 | 
 851 | #### Checking for an existing compute target
 852 | 
 853 | You can check if a compute targets exists to only create it otherwise:
 854 | 
 855 | ```python
 856 | from azureml.core.compute import ComputeTarget, AmlCompute
 857 | from azureml.core.compute_target import ComputeTargetException
 858 | 
 859 | compute_name = "aml-cluster"
 860 | 
 861 | # Check if the compute target exists
 862 | try:
 863 |     aml_cluster = ComputeTarget(workspace=ws, name=compute_name)
 864 |     print('Found existing cluster.')
 865 | except ComputeTargetException:
 866 |     # If not, create it
 867 |     compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS12_V2',
 868 |                                                            max_nodes=4)
 869 |     aml_cluster = ComputeTarget.create(ws, compute_name, compute_config)
 870 | 
 871 | aml_cluster.wait_for_completion(show_output=True)
 872 | ```
 873 | 
 874 | More info [here](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-set-up-training-targets).
 875 | 
 876 | ### Use compute targets
 877 | 
 878 | You can use them to run specific workloads:
 879 | 
 880 | ```python
 881 | from azureml.core import Environment, Estimator
 882 | 
 883 | compute_name = 'aml-cluster'
 884 | 
 885 | training_env = Environment.get(workspace=ws, name='training_environment')
 886 | 
 887 | estimator = Estimator(source_directory='experiment_folder',
 888 |                       entry_script='training_script.py',
 889 |                       environment_definition=training_env,
 890 |                       compute_target=compute_name)
 891 | ```
 892 | 
 893 | > OBS: When an experiment for the estimator is submitted, the run will be queued while the compute target is started and the specified environment deployed to it, and then the run will be processed on the compute environment.
 894 | 
 895 | Instead of working by name, you could also pass a **ComputeTarget** object:
 896 | 
 897 | ```python
 898 | from azureml.core import Environment, Estimator
 899 | from azureml.core.compute import ComputeTarget
 900 | 
 901 | compute_name = 'aml-cluster'
 902 | training_cluster = ComputeTarget(workspace=ws, name=compute_name)
 903 | 
 904 | training_env = Environment.get(workspace=ws, name='training_environment')
 905 | 
 906 | estimator = Estimator(source_directory='experiment_folder',
 907 |                       entry_script='training_script.py',
 908 |                       environment_definition=training_env,
 909 |                       compute_target=training_cluster)
 910 | ```
 911 | 
 912 | ### Knowledge Check
 913 | 
 914 | 1. You're using the Azure Machine Learning Python SDK to run experiments. You need to create an environment from a Conda configuration (.yml) file. Which method of the Environment class should you use?
 915 | 
 916 |    * create
 917 |    * create_from_conda_specification
 918 |    * create_from_existing_conda_environment
 919 | 
 920 |     <details>
 921 |     <summary> 
 922 |     Answer
 923 |     </summary>
 924 |     <p>
 925 |      Use the create_from_conda_specification method to create an environment from a configuration file. The create method requires you to explicitly specify conda and pip packages, and the create_from_existing_conda_environment requires an existing environment on the computer.
 926 |     </p>
 927 |     </details>
 928 | 
 929 | 1. You must create a compute target for training experiments that require a graphical processing unit (GPU). You want to be able to scale the compute so that multiple nodes are started automatically as required. Which kind of compute target should you create?
 930 | 
 931 |    * Compute Instance
 932 |    * Training Cluster
 933 |    * Inference Cluster
 934 | 
 935 |     <details>
 936 |     <summary> 
 937 |     Answer
 938 |     </summary>
 939 |     <p>
 940 |     Use a training cluster to create multiple nodes of GPU-enabled VMs that are started automatically as needed.
 941 |     </p>
 942 |     </details>
 943 | 
 944 | </p>
 945 | </details>
 946 | 
 947 | ---
 948 | 
 949 | 
 950 | ## Orchestrating machine learning with pipelines
 951 | 
 952 | <details>
 953 | <summary> 
 954 | Show content
 955 | </summary>
 956 | <p>
 957 | 
 958 | ### Learning objectives
 959 | 
 960 | * Create an Azure Machine Learning pipeline.
 961 | * Publish an Azure Machine Learning pipeline.
 962 | * Schedule an Azure Machine Learning pipeline.
 963 | 
 964 | ### Introduction to pipelines
 965 | 
 966 | A pipeline is a workflow of machine learning tasks in which each task is implemented as a step. Steps can be sequential or parallel and you can choose a specific compute target for them to run on.
 967 | 
 968 | A pipeline can be executed as a process by running the pipeline as an experiment.
 969 | 
 970 | They can be triggered via an scheduler or through a REST endpoint.
 971 | 
 972 | #### Pipeline steps
 973 | 
 974 | There are different types of steps:
 975 | * **PythonScriptStep**: runs a specific python script.
 976 | * **EstimatorStep**: runs an estimator.
 977 | * **DataTransferStep**: Uses Azure Data Factory to copy data between data stores.
 978 | * **DatabricksStep**: runs a notebook, script or compiled JAR on dbks.
 979 | * **AdlaStep**: runs a U-SQL job in Azure Data Lake Analytics.
 980 | 
 981 | You can find the full list [here](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps?view=azure-ml-py).
 982 | 
 983 | #### Defining steps in a pipeline
 984 | 
 985 | First, you define the steps and then assemble the pipeline based on those:
 986 | 
 987 | ```python
 988 | from azureml.pipeline.steps import PythonScriptStep, EstimatorStep
 989 | 
 990 | # Step to run a Python script
 991 | step1 = PythonScriptStep(name = 'prepare data',
 992 |                          source_directory = 'scripts',
 993 |                          script_name = 'data_prep.py',
 994 |                          compute_target = 'aml-cluster',
 995 |                          runconfig = run_config)
 996 | 
 997 | # Step to run an estimator
 998 | step2 = EstimatorStep(name = 'train model',
 999 |                       estimator = sk_estimator,
1000 |                       compute_target = 'aml-cluster')
1001 | 
1002 | from azureml.pipeline.core import Pipeline
1003 | from azureml.core import Experiment
1004 | 
1005 | # Construct the pipeline
1006 | train_pipeline = Pipeline(workspace = ws, steps = [step1,step2])
1007 | 
1008 | # Create an experiment and run the pipeline
1009 | experiment = Experiment(workspace = ws, name = 'training-pipeline')
1010 | pipeline_run = experiment.submit(train_pipeline)
1011 | ```
1012 | 
1013 | ### Pass data between pipeline steps
1014 | 
1015 | It is not unusual to have steps depending on previous steps' results.
1016 | 
1017 | #### The PipelineData object
1018 | 
1019 | The **PipelineData** object is a special kind of **DataReference** that:
1020 | 
1021 | * References a location in a datastore.
1022 | * Creates a data dependency between pipeline steps.
1023 | 
1024 | It is an intermediary store between two subsequent steps: `step1 -> PipelineData -> step2`.
1025 | 
1026 | #### PipelineData step inputs and outputs
1027 | 
1028 | To use a **PipelineData** object you must:
1029 | 1. Define a named **PipelineData** object that references a location in a datastore.
1030 | 2. Configure the input / output of the steps that use it.
1031 | 3. Pass the **PipelineData** object as a script parameter in steps that run scripts (and add the `argparse` in those scripts, as we do with usual data refs).
1032 | 
1033 | ```python
1034 | from azureml.pipeline.core import PipelineData
1035 | from azureml.pipeline.steps import PythonScriptStep, EstimatorStep
1036 | 
1037 | # Get a dataset for the initial data
1038 | raw_ds = Dataset.get_by_name(ws, 'raw_dataset')
1039 | 
1040 | # Define a PipelineData object to pass data between steps
1041 | data_store = ws.get_default_datastore()
1042 | prepped_data = PipelineData('prepped',  datastore=data_store)
1043 | 
1044 | # Step to run a Python script
1045 | step1 = PythonScriptStep(name = 'prepare data',
1046 |                          source_directory = 'scripts',
1047 |                          script_name = 'data_prep.py',
1048 |                          compute_target = 'aml-cluster',
1049 |                          runconfig = run_config,
1050 |                          # Specify dataset as initial input
1051 |                          inputs=[raw_ds.as_named_input('raw_data')],
1052 |                          # Specify PipelineData as output
1053 |                          outputs=[prepped_data],
1054 |                          # Also pass as data reference to script
1055 |                          arguments = ['--folder', prepped_data])
1056 | 
1057 | # Step to run an estimator
1058 | step2 = EstimatorStep(name = 'train model',
1059 |                       estimator = sk_estimator,
1060 |                       compute_target = 'aml-cluster',
1061 |                       # Specify PipelineData as input
1062 |                       inputs=[prepped_data],
1063 |                       # Pass as data reference to estimator script
1064 |                       estimator_entry_script_arguments=['--folder', prepped_data])
1065 | ```
1066 | 
1067 | ### Reuse pipeline steps
1068 | 
1069 | AML includes some caching and reuse feature to reduce the time to run some steps.
1070 | 
1071 | #### Managing step output reuse
1072 | 
1073 | By default, the step output from a previous pipeline run is reused without rerunning the step. This is useful if the scripts, sources and directories have no change at all, otherwise this may lead to stale results.
1074 | 
1075 | To control reuse for an individual step, you can use `allow_reuse` parameter:
1076 | 
1077 | ```python
1078 | step1 = PythonScriptStep(name = 'prepare data',
1079 |                          ...
1080 |                          # Disable step reuse
1081 |                          allow_reuse = False)
1082 | ```
1083 | 
1084 | #### Forcing all steps to run
1085 | 
1086 | You can force all steps to run regardless of individual reuse by setting the `regenerate_outputs` param at submision time:
1087 | 
1088 | ```python
1089 | pipeline_run = experiment.submit(train_pipeline, regenerate_outputs=True)
1090 | ```
1091 | 
1092 | ### Publish pipelines
1093 | 
1094 | After you have created a pipeline, you can publish it to create a REST endpoint through which the pipeline can be run on demand.
1095 | 
1096 | ```python
1097 | published_pipeline = pipeline.publish(name='training_pipeline',
1098 |                                       description='Model training pipeline',
1099 |                                       version='1.0')
1100 | ```
1101 | 
1102 | You can also publish the pipeline on a successful run:
1103 | 
1104 | ```python
1105 | # Get the most recent run of the pipeline
1106 | pipeline_experiment = ws.experiments.get('training-pipeline')
1107 | run = list(pipeline_experiment.get_runs())[0]
1108 | 
1109 | # Publish the pipeline from the run
1110 | published_pipeline = run.publish_pipeline(name='training_pipeline',
1111 |                                           description='Model training pipeline',
1112 |                                           version='1.0')
1113 | ```
1114 | 
1115 | To get the endpoint
1116 | 
1117 | ```python
1118 | rest_endpoint = published_pipeline.endpoint
1119 | print(rest_endpoint)
1120 | ```
1121 | 
1122 | #### Using a published pipeline
1123 | 
1124 | To use the endpoint, you need to get the token from a service principal with permission to run the pipeline.
1125 | 
1126 | ```python
1127 | import requests
1128 | 
1129 | response = requests.post(rest_endpoint,
1130 |                          headers=auth_header,
1131 |                          json={"ExperimentName": "run_training_pipeline"})
1132 | run_id = response.json()["Id"]
1133 | print(run_id)
1134 | ```
1135 | 
1136 | ### Use pipeline parameters
1137 | 
1138 | To define parameters for a pipeline, create a **PipelineParameter** object for each parameter, and specify each parameter in at least one step.
1139 | 
1140 | ```python
1141 | from azureml.pipeline.core.graph import PipelineParameter
1142 | 
1143 | reg_param = PipelineParameter(name='reg_rate', default_value=0.01)
1144 | 
1145 | ...
1146 | 
1147 | step2 = EstimatorStep(name = 'train model',
1148 |                       estimator = sk_estimator,
1149 |                       compute_target = 'aml-cluster',
1150 |                       inputs=[prepped],
1151 |                       estimator_entry_script_arguments=['--folder', prepped,
1152 |                                                         '--reg', reg_param])
1153 | ```
1154 | 
1155 | > OBS: You must define parameters for a pipeline before publishing it.
1156 | 
1157 | #### Running a pipeline with a parameter
1158 | 
1159 | After publishing a pipeline with a parameter, you can specify it in the JSON payload in the REST call:
1160 | 
1161 | ```python
1162 | response = requests.post(rest_endpoint,
1163 |                          headers=auth_header,
1164 |                          json={"ExperimentName": "run_training_pipeline",
1165 |                                "ParameterAssignments": {"reg_rate": 0.1}})
1166 | ```
1167 | 
1168 | ### Schedule pipelines
1169 | 
1170 | #### Scheduling a pipeline for periodic intervals
1171 | 
1172 | To schedule a pipeline to run at periodic intervals, you must define a **ScheduleRecurrance** that determines the run frequency, and use it to create a **Schedule**.
1173 | 
1174 | ```python
1175 | from azureml.pipeline.core import ScheduleRecurrence, Schedule
1176 | 
1177 | daily = ScheduleRecurrence(frequency='Day', interval=1)
1178 | pipeline_schedule = Schedule.create(ws, name='Daily Training',
1179 |                                         description='trains model every day',
1180 |                                         pipeline_id=published_pipeline.id,
1181 |                                         experiment_name='Training_Pipeline',
1182 |                                         # daily schedule
1183 |                                         recurrence=daily)
1184 | ```
1185 | 
1186 | #### Triggering a pipeline run on data changes
1187 | 
1188 | You can also monitor a specified path on a datastore. This will become a trigger for a new run.
1189 | 
1190 | ```python
1191 | from azureml.core import Datastore
1192 | from azureml.pipeline.core import Schedule
1193 | 
1194 | training_datastore = Datastore(workspace=ws, name='blob_data')
1195 | pipeline_schedule = Schedule.create(ws, name='Reactive Training',
1196 |                                     description='trains model on data change',
1197 |                                     pipeline_id=published_pipeline_id,
1198 |                                     experiment_name='Training_Pipeline',
1199 |                                     datastore=training_datastore,
1200 |                                     path_on_datastore='data/training')
1201 | ```
1202 | 
1203 | ### Knowledge Check
1204 | 
1205 | 1. You're creating a pipeline that includes two steps. Step 1 preprocesses some data, and step 2 uses the preprocessed data to train a model. What type of object should you use to pass data from step 1 to step 2 and create a dependency between these steps?
1206 | 
1207 |    * Datastore
1208 |    * PipelineData
1209 |    * Data Reference
1210 | 
1211 |     <details>
1212 |     <summary> 
1213 |     Answer
1214 |     </summary>
1215 |     <p>
1216 |     To pass data between steps in a pipeline, use a PipelineData object.
1217 |     </p>
1218 |     </details>
1219 | 
1220 | 2. You've published a pipeline that you want to run every week. You plan to use the Schedule.create method to create the schedule. What kind of object must you create first to configure how frequently the pipeline runs?
1221 | 
1222 |    * Datastore
1223 |    * PipelineParameter
1224 |    * ScheduleRecurrance
1225 | 
1226 |     <details>
1227 |     <summary> 
1228 |     Answer
1229 |     </summary>
1230 |     <p>
1231 |     You need a ScheduleRecurrance object to create a schedule that runs at a regular interval.
1232 |     </p>
1233 |     </details>
1234 | 
1235 | </p>
1236 | </details>
1237 | 
1238 | ---
1239 | 
1240 | ## Deploying machine learning models with Azure Machine Learning
1241 | 
1242 | <details>
1243 | <summary> 
1244 | Show content
1245 | </summary>
1246 | <p>
1247 | 
1248 | ### Learning objectives
1249 | 
1250 | * Deploy a model as a real-time inferencing service.
1251 | * Consume a real-time inferencing service.
1252 | * Troubleshoot service deployment
1253 | 
1254 | ### Deploying a model as a real-time service
1255 | 
1256 | You can deploy a model as a real-time web service to several kinds of compute target:
1257 | * Local compute
1258 | * Azure ML compute instance
1259 | * Azure Container Instance (ACI)
1260 | * AKS
1261 | * Azure Function
1262 | * IoT module
1263 | 
1264 | AML uses containers for model packaging and deployment.
1265 | 
1266 | #### 1. Register a trained model
1267 | 
1268 | After a successful training, you first need to register the model.
1269 | 
1270 | To register from a local file:
1271 | 
1272 | ```python
1273 | from azureml.core import Model
1274 | 
1275 | classification_model = Model.register(workspace=ws,
1276 |                        model_name='classification_model',
1277 |                        model_path='model.pkl', # local path
1278 |                        description='A classification model')
1279 | ```
1280 | 
1281 | Or to reference to the **Run** used to train the model:
1282 | 
1283 | ```python
1284 | run.register_model( model_name='classification_model',
1285 |                     model_path='outputs/model.pkl', # run outputs path
1286 |                     description='A classification model')
1287 | ```
1288 | 
1289 | #### 2. Define an Inference Configuration
1290 | 
1291 | The model will be deployed as a service that consist of:
1292 | 
1293 | * A script to load the model and return predictions for submitted data.
1294 | * An environment in which the script will be run.
1295 | 
1296 | ##### Creating an Entry Script (or scoring script)
1297 | 
1298 | It is a py file that must contain
1299 | 
1300 | * `init()`: Called when the service is initialized.
1301 | * `run(raw_data)`: Called when new data is submitted to the service.
1302 | 
1303 | ```python
1304 | import json
1305 | import joblib
1306 | import numpy as np
1307 | from azureml.core.model import Model
1308 | 
1309 | # Called when the service is loaded
1310 | def init():
1311 |     global model
1312 |     # Get the path to the registered model file and load it
1313 |     model_path = Model.get_model_path('classification_model')
1314 |     model = joblib.load(model_path)
1315 | 
1316 | # Called when a request is received
1317 | def run(raw_data):
1318 |     # Get the input data as a numpy array
1319 |     data = np.array(json.loads(raw_data)['data'])
1320 |     # Get a prediction from the model
1321 |     predictions = model.predict(data)
1322 |     # Return the predictions as any JSON serializable format
1323 |     return predictions.tolist()
1324 | ```
1325 | 
1326 | ##### Creating an Environment
1327 | 
1328 | You can use **CondaDependencies**
1329 | 
1330 | ```python
1331 | from azureml.core.conda_dependencies import CondaDependencies
1332 | 
1333 | # Add the dependencies for your model
1334 | myenv = CondaDependencies()
1335 | myenv.add_conda_package("scikit-learn")
1336 | 
1337 | # Save the environment config as a .yml file
1338 | env_file = 'service_files/env.yml'
1339 | with open(env_file,"w") as f:
1340 |     f.write(myenv.serialize_to_string())
1341 | print("Saved dependency info in", env_file)
1342 | ```
1343 | 
1344 | ##### Combining the Script and Environment in an InferenceConfig
1345 | 
1346 | ```python
1347 | from azureml.core.model import InferenceConfig
1348 | 
1349 | classifier_inference_config = InferenceConfig(runtime= "python",
1350 |                                               source_directory = 'service_files',
1351 |                                               entry_script="score.py",
1352 |                                               conda_file="env.yml")
1353 | ```
1354 | 
1355 | #### 3. Define a Deployment Configuration
1356 | 
1357 | Now, select the compute target to deploy to.
1358 | 
1359 | > OBS: if deploying to AKS, create the cluster and a compute target for it before deploying.
1360 | 
1361 | ```python
1362 | from azureml.core.compute import ComputeTarget, AksCompute
1363 | 
1364 | cluster_name = 'aks-cluster'
1365 | compute_config = AksCompute.provisioning_configuration(location='eastus')
1366 | production_cluster = ComputeTarget.create(ws, cluster_name, compute_config)
1367 | production_cluster.wait_for_completion(show_output=True)
1368 | ```
1369 | 
1370 | With the compute target created, define the deployment config
1371 | 
1372 | ```python
1373 | from azureml.core.webservice import AksWebservice
1374 | 
1375 | classifier_deploy_config = AksWebservice.deploy_configuration(cpu_cores = 1,
1376 |                                                               memory_gb = 1)
1377 | ```
1378 | 
1379 | The code to configure an ACI deployment is similar, except that you do not need to explicitly create an ACI compute target, and you must use the deploy_configuration class from the **azureml.core.webservice.AciWebservice** namespace. Similarly, you can use the **azureml.core.webservice.LocalWebservice** namespace to configure a local Docker-based service.
1380 | 
1381 | #### 4. Deploy the Model
1382 | 
1383 | ```python
1384 | from azureml.core.model import Model
1385 | 
1386 | model = ws.models['classification_model']
1387 | service = Model.deploy(workspace=ws,
1388 |                        name = 'classifier-service',
1389 |                        models = [model],
1390 |                        inference_config = classifier_inference_config,
1391 |                        deployment_config = classifier_deploy_config,
1392 |                        deployment_target = production_cluster)
1393 | service.wait_for_deployment(show_output = True)
1394 | ```
1395 | 
1396 | For ACI or local services, you can omit the deployment_target parameter (or set it to None).
1397 | 
1398 | ### Consuming a real-time inferencing service
1399 | 
1400 | #### Using the Azure Machine Learning SDK
1401 | 
1402 | For testing, you can use the AML SDK
1403 | 
1404 | ```python
1405 | import json
1406 | 
1407 | # An array of new data cases
1408 | x_new = [[0.1,2.3,4.1,2.0],
1409 |          [0.2,1.8,3.9,2.1]]
1410 | 
1411 | # Convert the array to a serializable list in a JSON document
1412 | json_data = json.dumps({"data": x_new})
1413 | 
1414 | # Call the web service, passing the input data
1415 | response = service.run(input_data = json_data)
1416 | 
1417 | # Get the predictions
1418 | predictions = json.loads(response)
1419 | 
1420 | # Print the predicted class for each case.
1421 | for i in range(len(x_new)):
1422 |     print (x_new[i]), predictions[i] )
1423 | ```
1424 | 
1425 | #### Using a REST Endpoint
1426 | 
1427 | You can retrieve the service endpoint via the UI or the SDK:
1428 | 
1429 | ```python
1430 | endpoint = service.scoring_uri
1431 | print(endpoint)
1432 | ```
1433 | 
1434 | ```python
1435 | import requests
1436 | import json
1437 | 
1438 | # An array of new data cases
1439 | x_new = [[0.1,2.3,4.1,2.0],
1440 |          [0.2,1.8,3.9,2.1]]
1441 | 
1442 | # Convert the array to a serializable list in a JSON document
1443 | json_data = json.dumps({"data": x_new})
1444 | 
1445 | # Set the content type in the request headers
1446 | request_headers = { 'Content-Type':'application/json' }
1447 | 
1448 | # Call the service
1449 | response = requests.post(url = endpoint,
1450 |                          data = json_data,
1451 |                          headers = request_headers)
1452 | 
1453 | # Get the predictions from the JSON response
1454 | predictions = json.loads(response.json())
1455 | 
1456 | # Print the predicted class for each case.
1457 | for i in range(len(x_new)):
1458 |     print (x_new[i]), predictions[i] )
1459 | ```
1460 | 
1461 | #### Authentication
1462 | 
1463 | There are two kinds of auth
1464 | 
1465 | * **Key**: Requests are authenticated by specifying the key associated with the service.
1466 | * **Token**: Requests are authenticated by providing a JSON Web Token (JWT).
1467 | 
1468 | > OBS: By default, authentication is disabled for ACI services, and set to key-based authentication for AKS services (for which primary and secondary keys are automatically generated). You can optionally configure an AKS service to use token-based authentication (which is not supported for ACI services).
1469 | 
1470 | You can retrieve the keys for a **WebService** as
1471 | 
1472 | ```python
1473 | primary_key, secondary_key = service.get_keys()
1474 | ```
1475 | 
1476 | To use a token, the application needs to use a service-principal auth to verity the identity through AAD and call the **get_token** method to create a time-limited token.
1477 | 
1478 | ```python
1479 | import requests
1480 | import json
1481 | 
1482 | # An array of new data cases
1483 | x_new = [[0.1,2.3,4.1,2.0],
1484 |          [0.2,1.8,3.9,2.1]]
1485 | 
1486 | # Convert the array to a serializable list in a JSON document
1487 | json_data = json.dumps({"data": x_new})
1488 | 
1489 | # Set the content type in the request headers
1490 | request_headers = { "Content-Type":"application/json",
1491 |                     "Authorization":"Bearer " + key_or_token }
1492 | 
1493 | # Call the service
1494 | response = requests.post(url = endpoint,
1495 |                          data = json_data,
1496 |                          headers = request_headers)
1497 | 
1498 | # Get the predictions from the JSON response
1499 | predictions = json.loads(response.json())
1500 | 
1501 | # Print the predicted class for each case.
1502 | for i in range(len(x_new)):
1503 |     print (x_new[i]), predictions[i] )
1504 | ```
1505 | 
1506 | ### Troubleshooting service deployment
1507 | 
1508 | #### Check the Service State
1509 | 
1510 | ```python
1511 | from azureml.core.webservice import AksWebservice
1512 | 
1513 | # Get the deployed service
1514 | service = AciWebservice(name='classifier-service', workspace=ws)
1515 | 
1516 | # Check its state
1517 | print(service.state)
1518 | ```
1519 | 
1520 | > OBS: To view the state of a service, you must use the compute-specific service type (for example AksWebservice) and not a generic WebService object.
1521 | 
1522 | #### Review Service Logs
1523 | 
1524 | ```python
1525 | print(service.get_logs())
1526 | ```
1527 | 
1528 | #### Deploy to a Local Container
1529 | 
1530 | A quick check on runtime errors can be done by deploying to a local container.
1531 | 
1532 | ```python
1533 | from azureml.core.webservice import LocalWebservice
1534 | 
1535 | deployment_config = LocalWebservice.deploy_configuration(port=8890)
1536 | service = Model.deploy(ws, 'test-svc', [model], inference_config, deployment_config)
1537 | ```
1538 | 
1539 | You can then test the locally deployed service using the SDK `service.run(input_data = json_data)` and troubleshoot runtime issues by making changes to the scoring file and reloading the service without redeploying (this can ONLY be done with a local service)
1540 | 
1541 | ```python
1542 | service.reload()
1543 | print(service.run(input_data = json_data))
1544 | ```
1545 | 
1546 | ### Knowledge Check
1547 | 
1548 | 1. You've trained a model using the Python SDK for Azure Machine Learning. You want to deploy the model as a containerized real-time service with high scalability and security. What kind of compute should you create to host the service?
1549 | 
1550 |     * An Azure Kubernetes Services (AKS) inferencing cluster.
1551 |     * A compute instance with GPUs.
1552 |     * A training cluster with multiple nodes.
1553 | 
1554 |     <details>
1555 |     <summary> 
1556 |     Answer
1557 |     </summary>
1558 |     <p>
1559 |     You should use an AKS cluster to deploy a model as a scalable, secure, containerized service.
1560 |     </p>
1561 |     </details>
1562 | 
1563 | 2. You're deploying a model as a real-time inferencing service. What functions must the entry script for the service include?
1564 | 
1565 |     * main() and score()
1566 |     * base() and train()
1567 |     * init() and run()
1568 | 
1569 |     <details>
1570 |     <summary> 
1571 |     Answer
1572 |     </summary>
1573 |     <p>
1574 |     You must implement init and run functions in the entry (scoring) script.
1575 |     </p>
1576 |     </details>
1577 | 
1578 | </p>
1579 | </details>
1580 | 
1581 | ---
1582 | 
1583 | ## Automate machine learning model selection with Azure Machine Learning
1584 | 
1585 | <details>
1586 | <summary> 
1587 | Show content
1588 | </summary>
1589 | <p>
1590 | 
1591 | ### Learning objectives
1592 | 
1593 | > OBS: Azure Machine Learning includes support for automated machine learning through a visual interface in Azure Machine Learning studio for Enterprise edition workspaces only. SDK is enabled in both Basic and Enterprise.
1594 | 
1595 | * Use Azure Machine Learning's automated machine learning capabilities to determine the best performing algorithm for your data.
1596 | * Use automated machine learning to preprocess data for training.
1597 | * Run an automated machine learning experiment.
1598 | 
1599 | ### Automated machine learning tasks and algorithms
1600 | 
1601 | You can automate classification, regression and time series forecasting.
1602 | 
1603 | There is a huge [list](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-define-task-type) of supporting algorithms for each task. By default, automated ML will randomly select from the full range of algorithms, but you can block individual algorithms.
1604 | 
1605 | ### Preprocessing and featurization
1606 | 
1607 | As well as trying the algorithms, automated ML can also apply preprocessing to the data to improve the performance.
1608 | 
1609 | * **Scaling and Normalization**: they are applied automatically to prevent any large numeric feature to dominate the taining.
1610 | * **Optional Featurization**: You can choose to apply preprocessing such as:
1611 |   * Missing value imputation
1612 |   * Categorical encoding
1613 |   * Dropping high cardinality features (as IDs)
1614 |   * Feature engineering (e.g., deriving individual date parts from DateTime features)
1615 | 
1616 | More information [here](https://docs.microsoft.com/en-us/azure/machine-learning/concept-automated-ml#preprocessing).
1617 | 
1618 | ### Running automated machine learning experiments
1619 | 
1620 | You can use the UI (Enterprise) or the SDK.
1621 | 
1622 | #### Configuring an Automated Machine Learning Experiment
1623 | 
1624 | With the SDK you have greater flexibility and you can set experiment options using the **AutoMLConfig** class:
1625 | 
1626 | ```python
1627 | from azureml.train.automl import AutoMLConfig
1628 | 
1629 | automl_run_config = RunConfiguration(framework='python')
1630 | automl_config = AutoMLConfig(name='Automated ML Experiment',
1631 |                              task='classification',
1632 |                              primary_metric = 'AUC_weighted',
1633 |                              compute_target=aml_compute,
1634 |                              training_data = train_dataset,
1635 |                              validation_data = test_dataset,
1636 |                              label_column_name='Label',
1637 |                              featurization='auto',
1638 |                              iterations=12,
1639 |                              max_concurrent_iterations=4)
1640 | ```
1641 | 
1642 | #### Specifying Data for Training
1643 | 
1644 | With the UI, you can just select the training **dataset**. With the SDK, you can submit the data in the following ways:
1645 | * Specify a dataset or dataframe of training data that includes features and the label to be predicted.
1646 | * Specify a dataset, dataframe, or numpy array of X values containing the training features, with a corresponding y array of label values.
1647 | 
1648 | For both cases, you can optionally specify a validation dataset that will be used to validate the model. If it is not provided, Cross-Validation will be applied.
1649 | 
1650 | #### Specifying the Primary Metric
1651 | 
1652 | One of the most important settings. You can get all the metrics for a particular task as follows:
1653 | 
1654 | ```python
1655 | from azureml.train.automl.utilities import get_primary_metrics
1656 | 
1657 | get_primary_metrics('classification')
1658 | ```
1659 | 
1660 | You can find a full list of primary metrics [here](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-understand-automated-ml).
1661 | 
1662 | #### Submitting an Automated Machine Learning Experiment
1663 | 
1664 | Automated ML experiments are submitted as any other experiment with the SDK:
1665 | 
1666 | ```python
1667 | from azureml.core.experiment import Experiment
1668 | 
1669 | automl_experiment = Experiment(ws, 'automl_experiment')
1670 | automl_run = automl_experiment.submit(automl_config)
1671 | ```
1672 | 
1673 | You can monitor the runs in AML Studio or in the Jupyter Notebooks **RunDetails** widget.
1674 | 
1675 | #### Retrieving the Best Run and its Model
1676 | 
1677 | ```python
1678 | best_run, fitted_model = automl_run.get_output()
1679 | best_run_metrics = best_run.get_metrics()
1680 | for metric_name in best_run_metrics:
1681 |     metric = best_run_metrics[metric_name]
1682 |     print(metric_name, metric)
1683 | ```
1684 | 
1685 | #### Exploring Preprocessing Steps
1686 | 
1687 | AutoML uses SKlearn pipelines to encapsulate the processing steps. You view those steps in the fittedmodel obtained from the best run as shown below:
1688 | 
1689 | ```python
1690 | for step_ in fitted_model.named_steps:
1691 |     print(step)
1692 | ```
1693 | 
1694 | ### Knowledge Check
1695 | 
1696 | 1. You are using automated machine learning to train a model that predicts the species of an iris based on its petal and sepal measurements. Which kind of task should you specify for automated machine learning?
1697 | 
1698 | 
1699 |     * Regression
1700 |     * Forecasting
1701 |     * Classification
1702 | 
1703 |     <details>
1704 |     <summary> 
1705 |     Answer
1706 |     </summary>
1707 |     <p>
1708 |     Predicting a class requires a classification task.
1709 |     </p>
1710 |     </details>
1711 | 
1712 | 2. You have submitted an automated machine learning run using the Python SDk for Azure Machine Learning. When the run completes, which method of the run object should you use to retrieve the best model?
1713 | 
1714 | 
1715 |     * get_output()
1716 |     * load_model()
1717 |     * get_metrics()
1718 | 
1719 |     <details>
1720 |     <summary> 
1721 |     Answer
1722 |     </summary>
1723 |     <p>
1724 |     The get_output method of an automated machine learning run returns the best mode and the child run that trained it.
1725 |     </p>
1726 |     </details>
1727 | 
1728 | </p>
1729 | </details>
1730 | 
1731 | ---
1732 | 


--------------------------------------------------------------------------------