├── .github └── PULL_REQUEST_TEMPLATE.md ├── .gitignore ├── CleanupGuide └── README.md ├── Cloud9 ├── README.md └── images │ └── setup-cloud9-terminal.png ├── Introduction └── README.md ├── LICENSE ├── NOTICE ├── NotebookCreation ├── README.md └── images │ ├── console-services.png │ ├── git-info.png │ ├── notebook-instances.png │ ├── notebook-settings.png │ ├── open-notebook.png │ └── role-popup.png ├── README.md ├── Simplify-Workflows └── README.md ├── StudioCreation └── README.md ├── TensorFlow └── README.md ├── contributing ├── CODE_OF_CONDUCT.md └── CONTRIBUTING.md ├── data └── raw_data.csv ├── images ├── cells.png ├── clawfoot_bathtub.jpg ├── overview.png └── region-selection.png ├── modules ├── Distributed_Training_CLI_Console.md ├── Video_Game_Sales_CLI_Console.md └── images │ ├── distrib-dev-environment.png │ ├── distrib-endpoint-config.png │ ├── distrib-endpoint.png │ ├── distrib-model.png │ ├── videogames-cloud9.png │ ├── videogames-endpoint-config-v1.png │ ├── videogames-endpoint-config.png │ ├── videogames-endpoint-v1.png │ ├── videogames-endpoint.png │ ├── videogames-first-cell.png │ ├── videogames-model-v1.png │ ├── videogames-model.png │ └── videogames-next-cell.png └── notebooks ├── Image-classification-transfer-learning.ipynb ├── data_distribution_types.ipynb ├── sentiment-analysis.ipynb ├── tf-2-workflow-smpipelines.ipynb ├── tf-distributed-training.ipynb ├── videogame-sales-cli-console.ipynb └── videogame-sales.ipynb /.github/PULL_REQUEST_TEMPLATE.md: -------------------------------------------------------------------------------- 1 | *Issue #, if available:* 2 | 3 | *Description of changes:* 4 | 5 | 6 | By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. 7 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | 2 | # Eclipse 3 | .classpath 4 | .project 5 | .settings/ 6 | 7 | # Intellij 8 | .idea/ 9 | *.iml 10 | *.iws 11 | 12 | # Maven 13 | log/ 14 | target/ 15 | 16 | # VIM 17 | *.swp 18 | 19 | # Mac 20 | .DS_Store 21 | .DS_Store? 22 | 23 | # Windows 24 | Desktop.ini 25 | *.lnk 26 | *.cab 27 | *.msi 28 | *.msm 29 | *.msp 30 | $RECYCLE.BIN/ 31 | Thumbs.db 32 | ehthumbs.db 33 | -------------------------------------------------------------------------------- /CleanupGuide/README.md: -------------------------------------------------------------------------------- 1 | 2 | # Cleanup Guide 3 | 4 | To avoid charges for resources you no longer need when you're done with this workshop, you can delete them or, in the case of your notebook instance, stop them. Below is a list of the resources you should check. Please go through them in order to make sure you do not incur charges unnecessarily: 5 | 6 | 7 | - **Endpoints**: These are the clusters of one or more instances serving inferences from your models. If you did not delete them from within a notebook, you can delete them via the SageMaker console. To do so: 8 | 9 | - Click the **Endpoints** link in the left panel. 10 | 11 | - Then, for each endpoint, click the radio button next to it, then select **Delete** from the **Actions** drop down menu. 12 | 13 | - You can follow a similar procedure to delete the related Models and Endpoint configurations. 14 | 15 | 16 | - **SageMaker Studio**: Follow these instructions if you created a SageMaker Studio domain, rather than a SageMaker Notebook Instance. SageMaker Studio charges mainly accrue for resources that are running in the SageMaker Studio domain. Accordingly, you have two options: 17 | 18 | - To **shut down** resources in the domain: Follow the instructions for **Shut Down Resources** at https://docs.aws.amazon.com/sagemaker/latest/dg/notebooks-run-and-manage-shut-down.html. 19 | 20 | - To **delete** the SageMaker Studio domain: If you do not expect to use the domain, you may completely delete it. Follow the instructions at https://docs.aws.amazon.com/sagemaker/latest/dg/gs-studio-delete-domain.html. 21 | 22 | 23 | - **Notebook instance**: Follow these instructions if you created a SageMaker notebook instance, rather than a SageMaker Studio domain. You have two options if you do not want to keep the notebook instance running. If you would like to save it for later, you can stop rather than deleting it. 24 | 25 | - To **stop** a notebook instance: Click the **Notebook instances** link in the left pane of the SageMaker console home page. Next, click the **Stop** link under the 'Actions' column to the left of your notebook instance's name. After the notebook instance is stopped, you can start it again by clicking the **Start** link. Keep in mind that if you stop rather than delete it, you will be charged for the storage associated with it. 26 | 27 | - To **delete** a notebook instance: First stop it per the instruction above. Next, click the radio button next to your notebook instance, then select **Delete** from the **Actions** drop down menu. 28 | 29 | 30 | - **S3 Bucket**: If you retain the S3 bucket created for this workshop, you will be charged for storage. To avoid these charges if you no longer wish to use the bucket, you may delete it. To delete the bucket, go to the S3 service console, and locate your bucket's name in the bucket table. Next, click in the bucket table row for your bucket to highlight the table row. At the top of the table, the **Delete Bucket** button should now be enabled, so click it and then click the **Confirm** button in the resulting pop-up to complete the deletion. 31 | 32 | 33 | 34 | -------------------------------------------------------------------------------- /Cloud9/README.md: -------------------------------------------------------------------------------- 1 | # AWS Cloud9 IDE 2 | 3 | Either [**AWS CloudShell**](https://aws.amazon.com/cloudshell/) or [**AWS Cloud9**](https://aws.amazon.com/cloud9/) can be used to easily run Bash scripts in a workshop setting. AWS CloudShell is a browser-based shell that makes it easy to securely manage, explore, and interact with your AWS resources. To run Bash scripts for workshops using CloudShell, simply create raw text script files on your local computer, and then follow the instruction steps for [uploading and running script files](https://docs.aws.amazon.com/cloudshell/latest/userguide/getting-started.html). 4 | 5 | AWS Cloud9 is a cloud-based integrated development environment (IDE) that lets you write, run, and debug your code with just a browser. It includes a code editor, debugger, and terminal. Cloud9 comes prepackaged with essential tools for popular programming languages and the AWS Command Line Interface (CLI) pre-installed so you don’t need to install files or configure your laptop for this workshop. Your Cloud9 environment will have access to the same AWS resources as the user with which you logged into the AWS Management Console. 6 | 7 | If you choose to use Cloud9, take a moment now and setup your Cloud9 development environment. 8 | 9 | ## Step-by-step Instructions 10 | 11 | - Go to the AWS Management Console, click **Services** then select **Cloud9** under Developer Tools. 12 | 13 | 14 | - Click **Create environment**. 15 | 16 | 17 | - Enter `Development` into **Name** and optionally provide a **Description**. 18 | 19 | 20 | - Click **Next step**. 21 | 22 | 23 | - You may leave **Environment settings** at their defaults of launching a new **t2.micro** EC2 instance which will be paused after **30 minutes** of inactivity. 24 | 25 | 26 | - Click **Next step**. 27 | 28 | 29 | - Review the environment settings and click **Create environment**. It will take several minutes for your environment to be provisioned and prepared. 30 | 31 | 32 | - Once ready, your IDE will open to a welcome screen. The central panel of the IDE has two parts: a text/code editor in the upper half, and a terminal window in the lower half. Below the welcome screen in the editor, you should see a terminal prompt similar to the following (you may need to scroll down below the welcome screen to see it): 33 | 34 | ![Terminal](./images/setup-cloud9-terminal.png) 35 | 36 | - You can run AWS CLI commands in here just like you would on your local computer. Verify that your user is logged in by running `aws sts get-caller-identity` as follows at the terminal prompt: 37 | 38 | ``` 39 | aws sts get-caller-identity 40 | ``` 41 | 42 | - You’ll see output indicating your account and user information: 43 | 44 | ``` 45 | Admin:~/environment $ aws sts get-caller-identity 46 | 47 | { 48 | "Account": "123456789012", 49 | "UserId": "AKIAI44QH8DHBEXAMPLE", 50 | "Arn": "arn:aws:iam::123456789012:user/Alice" 51 | } 52 | ``` 53 | 54 | 55 | - To create a new text/code file, just click the **+** symbol in the tabs section of the editor part of the IDE. You can do that now, and close the wecome screen by clicking the **x** symbol in the welcome screen tab. 56 | 57 | 58 | - Keep your AWS Cloud9 IDE opened in a browser tab throughout this workshop as we’ll use it for activities like using the AWS CLI and running Bash scripts. 59 | 60 | ## Tips 61 | 62 | Keep an open scratch pad in Cloud9 or a text editor on your local computer for notes. When the step-by-step directions tell you to note something such as an ID or Amazon Resource Name (ARN), copy and paste that into the scratch pad. 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | -------------------------------------------------------------------------------- /Cloud9/images/setup-cloud9-terminal.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/amazon-sagemaker-workshop/0cf3d661a4065becde6b974ee6fff99c9640b850/Cloud9/images/setup-cloud9-terminal.png -------------------------------------------------------------------------------- /Introduction/README.md: -------------------------------------------------------------------------------- 1 | # Introduction to SageMaker 2 | 3 | In this workshop, we'll work though several examples that demonstrate Amazon SageMaker's core modular components including SageMaker Studio (or notebook instances), hosted training, and hosted model endpoints. Examples are divided into modules. The examples show how Amazon SageMaker can be applied in three of the common categories of machine learning: working with structured data, computer vision, and natural language processing. 4 | 5 | We'll make use of some of Amazon SageMaker's built-in algorithms, specifically an AWS-optimized version of XGBoost and a deep learning-based image classification algorithm. Built-in algorithms enable you to avoid spending time against algorithm/neural net design, provide conveniences such as reduced need for model tuning, and are meant to handle the scalability and reliability issues that arise when working with large datasets. As a contrast, in one module we'll use a script defining our own custom deep learning model instead of a built-in algorithm. Whether you define your own custom models or use built-in algorithms, all of Amazon SageMaker's features may be used in a similar way, in any combination. 6 | 7 | To summarize, here are some of the key components and features of Amazon SageMaker demonstrated in this workshop: 8 | 9 | - Using **SageMaker Studio** (or Notebook Instances) for Exploratory Data Analysis and prototyping. 10 | - **Choosing different instance types** for different use cases (CPU vs. GPU, model training vs. deployment, etc.). 11 | - **Hosted Training** for large scale model training. 12 | - **Built-in algorithms** designed for web scale and rapid prototyping of data science projects without the need to write a lot of code. 13 | - **Script Mode**, which enables you to use your own custom model definitions and scripts similar to those outside SageMaker, with prebuilt machine learning containers. 14 | - **Hosted Endpoints** for real-time predictions. 15 | - **Batch Transform** for asynchronous, large scale batch inference. 16 | 17 | 18 | 19 | ## Modules 20 | 21 | This workshop is divided into multiple modules. After completing **Preliminaries**, complete the module **Creating a Notebook Environment** next. You can complete the remaining modules in any order. 22 | 23 | - Preliminaries 24 | 25 | - Creating a Notebook Environment 26 | 27 | - Structured Data Use Case: Videogame Sales 28 | 29 | - Computer Vision Use Case: Image Classification 30 | 31 | - Natural Language Processing Use Case: Sentiment Analysis 32 | 33 | - Extra Credit: Automated Workflow for Boston Housing Price Predictions 34 | 35 | 36 | ## Preliminaries 37 | 38 | - Be sure you have completed all of the Prerequisites listed in the [**main README**](../README.md). 39 | 40 | - If you are new to using Jupyter notebooks, read the next section, otherwise you may now skip ahead to the next module. 41 | 42 | ### Jupyter Notebooks: A Brief Overview 43 | 44 | Jupyter is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more. With respect to code, it can be thought of as a web-based IDE that executes code on the server it is running on instead of locally. 45 | 46 | There are two main types of "cells" in a notebook: code cells, and "markdown" cells with explanatory text. You will be running the code cells. These are distinguished by having "In" next to them in the left margin next to the cell, and a greyish background. Markdown cells lack "In" and have a white background. In the screenshot below, the upper cell is a markdown cell, while the lower cell is a code cell: 47 | 48 | ![Cells](../images/cells.png) 49 | 50 | To run a code cell, simply click in it, then either click the **Run Cell** button in the notebook's toolbar, or use Control+Enter from your computer's keyboard. It may take a few seconds to a few minutes for a code cell to run. You can determine whether a cell is running by examining the `In[]:` indicator in the left margin next to each cell: a cell will show `In [*]:` when running, and `In [a number]:` when complete. 51 | 52 | Please run each code cell in order, and **only once**, to avoid repeated operations. For example, running the same training job cell twice might create two training jobs, possibly exceeding your service limits. 53 | 54 | 55 | ## Creating a Notebook Environment 56 | 57 | SageMaker provides hosted Jupyter notebooks that require no setup, so you can begin processing your training data sets immediately. With a few clicks in the SageMaker console, you can create a fully managed notebook environment, pre-loaded with useful libraries for machine learning. You need only add your data. You have two different options for this workshop. Follow the choice specified by your workshop instructor if you're in a live workshop, or make your own choice otherwise: 58 | 59 | - **SageMaker Studio**: An IDE for machine learning. To create a SageMaker Studio domain for this workshop, follow the instructions at [**Creating an Amazon SageMaker Studio domain**](../StudioCreation), then return here to continue with the next module of the workshop. 60 | 61 | - **SageMaker Notebook Instance**: A managed instance with preinstalled data science tools (though not as fully managed as SageMaker Studio). To create a SageMaker notebook instance for this workshop, follow the instructions at [**Creating a Notebook Instance**](../NotebookCreation), then return here to continue with the next module of the workshop. 62 | 63 | 64 | ## Structured Data Use Case: Videogame Sales 65 | 66 | In this module, we'll use Amazon SageMaker's built-in version of XGBoost to make predictions based on structured data related to the videogame industry. Assuming you have cloned this repository into your notebook environment (which you should do if you haven't), open the `notebooks` directory of the repository and click on the `videogame-sales.ipynb` notebook to open it. Make sure you are using the `Python 3 (Data Science)` kernel if you're using SageMaker Studio. 67 | 68 | When you're finished, return here to move on to the next module. 69 | 70 | 71 | ## Computer Vision Use Case: Image Classification 72 | 73 | This module uses Amazon SageMaker's built-in Image Classification algorithm. Assuming you have cloned this repository into your notebook environment (which you should do if you haven't), open the `notebooks` directory of the repository and click on the `Image-classification-transfer-learning.ipynb` notebook to open it. Make sure you are using the `Python 3 (Data Science)` kernel if you're using SageMaker Studio. 74 | 75 | When you're finished, return here to move on to the next module. 76 | 77 | 78 | ## Natural Language Processing Use Case: Sentiment Analysis 79 | 80 | In contrast to the previous modules, which used some of Amazon SageMaker's built-in algorithms, in this module we'll use a deep learning framework within Amazon SageMaker with our own script defining a custom model. Assuming you have cloned this repository into your notebook environment (which you should do if you haven't), open the `notebooks` directory of the repository and click on the `sentiment-analysis.ipynb` notebook to open it. Make sure you are using the `Python 3 (Data Science)` kernel if you're using SageMaker Studio. 81 | 82 | When you're finished, return here and go on to the Extra Credit module or Cleanup Guide. 83 | 84 | 85 | ## Extra Credit: Automated Workflow for Boston Housing Price Predictions 86 | 87 | For extra credit, this module dives deeper into how to create a complete, automated workflow in Amazon SageMaker for your custom models. In particular, we'll preprocess data with SageMaker Processing, prototype training and inference code with Local Mode, use Automatic Model Tuning, deploy the tuned model to a real time endpoint, and examine how SageMaker Pipelines can automate setting up this workflow for a production environment. Assuming you have cloned this repository into your notebook environment (which you should do if you haven't), open the `notebooks` directory of the repository and click on the `tf-2-workflow-smpipelines.ipynb` notebook to open it. NOTE: if you are using SageMaker Studio, skip the Local Mode sections of the example. 88 | 89 | 90 | ## Cleanup 91 | 92 | If you are using your own AWS account rather than one provided at an AWS-run event: To avoid charges for endpoints and other resources you might not need after the workshop, please refer to the [**Cleanup Guide**](../CleanupGuide). 93 | 94 | 95 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | 2 | Apache License 3 | Version 2.0, January 2004 4 | http://www.apache.org/licenses/ 5 | 6 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 7 | 8 | 1. Definitions. 9 | 10 | "License" shall mean the terms and conditions for use, reproduction, 11 | and distribution as defined by Sections 1 through 9 of this document. 12 | 13 | "Licensor" shall mean the copyright owner or entity authorized by 14 | the copyright owner that is granting the License. 15 | 16 | "Legal Entity" shall mean the union of the acting entity and all 17 | other entities that control, are controlled by, or are under common 18 | control with that entity. For the purposes of this definition, 19 | "control" means (i) the power, direct or indirect, to cause the 20 | direction or management of such entity, whether by contract or 21 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 22 | outstanding shares, or (iii) beneficial ownership of such entity. 23 | 24 | "You" (or "Your") shall mean an individual or Legal Entity 25 | exercising permissions granted by this License. 26 | 27 | "Source" form shall mean the preferred form for making modifications, 28 | including but not limited to software source code, documentation 29 | source, and configuration files. 30 | 31 | "Object" form shall mean any form resulting from mechanical 32 | transformation or translation of a Source form, including but 33 | not limited to compiled object code, generated documentation, 34 | and conversions to other media types. 35 | 36 | "Work" shall mean the work of authorship, whether in Source or 37 | Object form, made available under the License, as indicated by a 38 | copyright notice that is included in or attached to the work 39 | (an example is provided in the Appendix below). 40 | 41 | "Derivative Works" shall mean any work, whether in Source or Object 42 | form, that is based on (or derived from) the Work and for which the 43 | editorial revisions, annotations, elaborations, or other modifications 44 | represent, as a whole, an original work of authorship. For the purposes 45 | of this License, Derivative Works shall not include works that remain 46 | separable from, or merely link (or bind by name) to the interfaces of, 47 | the Work and Derivative Works thereof. 48 | 49 | "Contribution" shall mean any work of authorship, including 50 | the original version of the Work and any modifications or additions 51 | to that Work or Derivative Works thereof, that is intentionally 52 | submitted to Licensor for inclusion in the Work by the copyright owner 53 | or by an individual or Legal Entity authorized to submit on behalf of 54 | the copyright owner. For the purposes of this definition, "submitted" 55 | means any form of electronic, verbal, or written communication sent 56 | to the Licensor or its representatives, including but not limited to 57 | communication on electronic mailing lists, source code control systems, 58 | and issue tracking systems that are managed by, or on behalf of, the 59 | Licensor for the purpose of discussing and improving the Work, but 60 | excluding communication that is conspicuously marked or otherwise 61 | designated in writing by the copyright owner as "Not a Contribution." 62 | 63 | "Contributor" shall mean Licensor and any individual or Legal Entity 64 | on behalf of whom a Contribution has been received by Licensor and 65 | subsequently incorporated within the Work. 66 | 67 | 2. Grant of Copyright License. Subject to the terms and conditions of 68 | this License, each Contributor hereby grants to You a perpetual, 69 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 70 | copyright license to reproduce, prepare Derivative Works of, 71 | publicly display, publicly perform, sublicense, and distribute the 72 | Work and such Derivative Works in Source or Object form. 73 | 74 | 3. Grant of Patent License. Subject to the terms and conditions of 75 | this License, each Contributor hereby grants to You a perpetual, 76 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 77 | (except as stated in this section) patent license to make, have made, 78 | use, offer to sell, sell, import, and otherwise transfer the Work, 79 | where such license applies only to those patent claims licensable 80 | by such Contributor that are necessarily infringed by their 81 | Contribution(s) alone or by combination of their Contribution(s) 82 | with the Work to which such Contribution(s) was submitted. If You 83 | institute patent litigation against any entity (including a 84 | cross-claim or counterclaim in a lawsuit) alleging that the Work 85 | or a Contribution incorporated within the Work constitutes direct 86 | or contributory patent infringement, then any patent licenses 87 | granted to You under this License for that Work shall terminate 88 | as of the date such litigation is filed. 89 | 90 | 4. Redistribution. You may reproduce and distribute copies of the 91 | Work or Derivative Works thereof in any medium, with or without 92 | modifications, and in Source or Object form, provided that You 93 | meet the following conditions: 94 | 95 | (a) You must give any other recipients of the Work or 96 | Derivative Works a copy of this License; and 97 | 98 | (b) You must cause any modified files to carry prominent notices 99 | stating that You changed the files; and 100 | 101 | (c) You must retain, in the Source form of any Derivative Works 102 | that You distribute, all copyright, patent, trademark, and 103 | attribution notices from the Source form of the Work, 104 | excluding those notices that do not pertain to any part of 105 | the Derivative Works; and 106 | 107 | (d) If the Work includes a "NOTICE" text file as part of its 108 | distribution, then any Derivative Works that You distribute must 109 | include a readable copy of the attribution notices contained 110 | within such NOTICE file, excluding those notices that do not 111 | pertain to any part of the Derivative Works, in at least one 112 | of the following places: within a NOTICE text file distributed 113 | as part of the Derivative Works; within the Source form or 114 | documentation, if provided along with the Derivative Works; or, 115 | within a display generated by the Derivative Works, if and 116 | wherever such third-party notices normally appear. The contents 117 | of the NOTICE file are for informational purposes only and 118 | do not modify the License. You may add Your own attribution 119 | notices within Derivative Works that You distribute, alongside 120 | or as an addendum to the NOTICE text from the Work, provided 121 | that such additional attribution notices cannot be construed 122 | as modifying the License. 123 | 124 | You may add Your own copyright statement to Your modifications and 125 | may provide additional or different license terms and conditions 126 | for use, reproduction, or distribution of Your modifications, or 127 | for any such Derivative Works as a whole, provided Your use, 128 | reproduction, and distribution of the Work otherwise complies with 129 | the conditions stated in this License. 130 | 131 | 5. Submission of Contributions. Unless You explicitly state otherwise, 132 | any Contribution intentionally submitted for inclusion in the Work 133 | by You to the Licensor shall be under the terms and conditions of 134 | this License, without any additional terms or conditions. 135 | Notwithstanding the above, nothing herein shall supersede or modify 136 | the terms of any separate license agreement you may have executed 137 | with Licensor regarding such Contributions. 138 | 139 | 6. Trademarks. This License does not grant permission to use the trade 140 | names, trademarks, service marks, or product names of the Licensor, 141 | except as required for reasonable and customary use in describing the 142 | origin of the Work and reproducing the content of the NOTICE file. 143 | 144 | 7. Disclaimer of Warranty. Unless required by applicable law or 145 | agreed to in writing, Licensor provides the Work (and each 146 | Contributor provides its Contributions) on an "AS IS" BASIS, 147 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 148 | implied, including, without limitation, any warranties or conditions 149 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 150 | PARTICULAR PURPOSE. You are solely responsible for determining the 151 | appropriateness of using or redistributing the Work and assume any 152 | risks associated with Your exercise of permissions under this License. 153 | 154 | 8. Limitation of Liability. In no event and under no legal theory, 155 | whether in tort (including negligence), contract, or otherwise, 156 | unless required by applicable law (such as deliberate and grossly 157 | negligent acts) or agreed to in writing, shall any Contributor be 158 | liable to You for damages, including any direct, indirect, special, 159 | incidental, or consequential damages of any character arising as a 160 | result of this License or out of the use or inability to use the 161 | Work (including but not limited to damages for loss of goodwill, 162 | work stoppage, computer failure or malfunction, or any and all 163 | other commercial damages or losses), even if such Contributor 164 | has been advised of the possibility of such damages. 165 | 166 | 9. Accepting Warranty or Additional Liability. While redistributing 167 | the Work or Derivative Works thereof, You may choose to offer, 168 | and charge a fee for, acceptance of support, warranty, indemnity, 169 | or other liability obligations and/or rights consistent with this 170 | License. However, in accepting such obligations, You may act only 171 | on Your own behalf and on Your sole responsibility, not on behalf 172 | of any other Contributor, and only if You agree to indemnify, 173 | defend, and hold each Contributor harmless for any liability 174 | incurred by, or claims asserted against, such Contributor by reason 175 | of your accepting any such warranty or additional liability. 176 | 177 | END OF TERMS AND CONDITIONS 178 | 179 | APPENDIX: How to apply the Apache License to your work. 180 | 181 | To apply the Apache License to your work, attach the following 182 | boilerplate notice, with the fields enclosed by brackets "[]" 183 | replaced with your own identifying information. (Don't include 184 | the brackets!) The text should be enclosed in the appropriate 185 | comment syntax for the file format. We also recommend that a 186 | file or class name and description of purpose be included on the 187 | same "printed page" as the copyright notice for easier 188 | identification within third-party archives. 189 | 190 | Copyright [yyyy] [name of copyright owner] 191 | 192 | Licensed under the Apache License, Version 2.0 (the "License"); 193 | you may not use this file except in compliance with the License. 194 | You may obtain a copy of the License at 195 | 196 | http://www.apache.org/licenses/LICENSE-2.0 197 | 198 | Unless required by applicable law or agreed to in writing, software 199 | distributed under the License is distributed on an "AS IS" BASIS, 200 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 201 | See the License for the specific language governing permissions and 202 | limitations under the License. 203 | -------------------------------------------------------------------------------- /NOTICE: -------------------------------------------------------------------------------- 1 | Amazon Sagemaker Workshop 2 | Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. 3 | -------------------------------------------------------------------------------- /NotebookCreation/README.md: -------------------------------------------------------------------------------- 1 | 2 | # Creating a Notebook Instance 3 | 4 | We'll start by creating an Amazon S3 bucket that may be used in certain modules of the workshop. Next, we'll create a SageMaker notebook instance for running the Jupyter notebooks used in this workshop. 5 | 6 | ## 1. Create a S3 Bucket 7 | 8 | SageMaker typically uses S3 as storage for data and model artifacts. In this step you'll create a S3 bucket for this purpose. To begin, sign into the AWS Management Console, https://console.aws.amazon.com/. 9 | 10 | ### High-Level Instructions 11 | 12 | Use the console or AWS CLI to create an Amazon S3 bucket (see step-by-step instructions below if you are unfamiliar with this process). Keep in mind that your bucket's name must be globally unique across all regions and customers. We recommend using a name like `smworkshop-firstname-lastname`. If you get an error that your bucket name already exists, try adding additional numbers or characters until you find an unused name. 13 | 14 |
15 | Step-by-step instructions (expand for details)

16 | 17 | 1. In the AWS Management Console, choose **Services** then select **S3** under Storage. 18 | 19 | 1. Choose **+Create Bucket** 20 | 21 | 1. Provide a globally unique name for your bucket such as `smworkshop-firstname-lastname`. 22 | 23 | 1. Select the Region you've chosen to use for this workshop from the dropdown. 24 | 25 | 1. Choose **Create** in the lower left of the dialog without selecting a bucket to copy settings from. 26 | 27 |

28 | 29 | ## 2. Launching the Notebook Instance 30 | 31 | 1. Make sure you are on the AWS Management Console home page. As shown below, in the **Search for services** search box, type **SageMaker**. The search result list will populate with Amazon SageMaker, which you should now click. This will bring you to the Amazon SageMaker console homepage. 32 | 33 | ![Services in Console](./images/console-services.png) 34 | 35 | 2. In the upper-right corner of the AWS Management Console, confirm you are in the desired AWS region. Select N. Virginia, Oregon, Ohio, or Ireland (or any other region where SageMaker is available). 36 | 37 | 3. To create a new notebook instance, click the **Notebook instances** link on the left side, and click the **Create notebook instance** button in the upper right corner of the browser window. 38 | 39 | ![Notebook Instances](./images/notebook-instances.png) 40 | 41 | 4. Type smworkshop-[First Name]-[Last Name] into the **Notebook instance name** text box, and select ml.m5.xlarge for the **Notebook instance type**. 42 | 43 | ![Notebook Settings](./images/notebook-settings.png) 44 | 45 | 5. In the **Permissions and encryption** section, choose **Create a new role** in the **IAM role** drop down menu. Leave the defaults in the pop-up modal, as shown below. Click **Create role**. 46 | 47 | ![Create IAM role](./images/role-popup.png) 48 | 49 | 6. As shown below, go to the **Git repositories** section, click the **Repository** drop down menu, select **Clone a Git repository to this notebook instance only**, and enter the following for **Git repository URL**: `https://github.com/awslabs/amazon-sagemaker-workshop.git` 50 | 51 | ![Enter Git info](./images/git-info.png) 52 | 53 | 7. Click **Create notebook instance** at the bottom. 54 | 55 | ### 3. Accessing the Notebook Instance 56 | 57 | 1. Wait for the server status to change to **InService**. This will take several minutes, possibly up to ten but likely much less. 58 | 59 | ![Access Notebook](./images/open-notebook.png) 60 | 61 | 2. Click **Open Jupyter**. You will now see the Jupyter homepage for your notebook instance. 62 | 63 | 64 | -------------------------------------------------------------------------------- /NotebookCreation/images/console-services.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/amazon-sagemaker-workshop/0cf3d661a4065becde6b974ee6fff99c9640b850/NotebookCreation/images/console-services.png -------------------------------------------------------------------------------- /NotebookCreation/images/git-info.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/amazon-sagemaker-workshop/0cf3d661a4065becde6b974ee6fff99c9640b850/NotebookCreation/images/git-info.png -------------------------------------------------------------------------------- /NotebookCreation/images/notebook-instances.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/amazon-sagemaker-workshop/0cf3d661a4065becde6b974ee6fff99c9640b850/NotebookCreation/images/notebook-instances.png -------------------------------------------------------------------------------- /NotebookCreation/images/notebook-settings.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/amazon-sagemaker-workshop/0cf3d661a4065becde6b974ee6fff99c9640b850/NotebookCreation/images/notebook-settings.png -------------------------------------------------------------------------------- /NotebookCreation/images/open-notebook.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/amazon-sagemaker-workshop/0cf3d661a4065becde6b974ee6fff99c9640b850/NotebookCreation/images/open-notebook.png -------------------------------------------------------------------------------- /NotebookCreation/images/role-popup.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/amazon-sagemaker-workshop/0cf3d661a4065becde6b974ee6fff99c9640b850/NotebookCreation/images/role-popup.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Amazon SageMaker Workshops 2 | 3 | Amazon SageMaker is a fully managed service that enables developers and data scientists to quickly and easily build, train, and deploy machine learning models at any scale. This repository contains a collection of 2-hour workshops covering many features of Amazon SageMaker. They are suitable for self-service or live, guided events. 4 | 5 | ![Overview](./images/overview.png) 6 | 7 | **BEFORE ATTEMPTING ANY WORKSHOP: please review the Prerequisites below and complete any actions that are required, especially those in the Permissions section.** 8 | 9 | 10 | # Workshops 11 | 12 | - [**Introduction to Amazon SageMaker**](Introduction) - This 100-200 level workshop demonstrates some of the key features of Amazon SageMaker. It does so via a set of straightforward examples for common use cases including: working with structured (tabular) data, natural language processing (sentiment analysis), and computer vision (image classification). Content includes how to (1) do exploratory data analysis in Amazon SageMaker notebook environments such as SageMaker Studio or SageMaker Notebook Instances; (2) run Amazon SageMaker training jobs with your own custom models or built-in algorithms; and (3) get predictions using hosted model endpoints and batch transform jobs. 13 | 14 | - [**TensorFlow in Amazon SageMaker**](TensorFlow) - In this 400 level workshop for experienced TensorFlow users, various aspects of TensorFlow usage in Amazon SageMaker will be demonstrated. In particular, TensorFlow will be applied to a natural language processing use case, a structured data use case, and a computer vision use case. Relevant SageMaker features that will be demonstrated include: prototyping training and inference code with SageMaker Local Mode; SageMaker Pipelines for workflow orchestration; hosted training jobs for full-scale training; distributed training on a single multi-GPU instance or multiple instances; Automatic Model Tuning; batch and real time inference options. 15 | 16 | - [**Simplify Workflows with Scripts, the CLI and Console**](Simplify-Workflows) - (**NOTE**: for CI/CD in Amazon SageMaker and workflow orchestration, first consider [Amazon SageMaker Pipelines](https://aws.amazon.com/sagemaker/pipelines); an example is the Structured Data module of [**TensorFlow in Amazon SageMaker**](TensorFlow)). The focus of this 200+ level workshop is on simplifying Amazon SageMaker workflows, or doing ad hoc jobs, with a roll-your-own solution using scripts, the AWS CLI, and the Amazon SageMaker console. All of these are alternatives to using Jupyter notebooks as an interface to Amazon SageMaker. You'll apply Amazon SageMaker built-in algorithms to a structured data example and a distributed training example showing different ways to set up nodes in a training cluster. 17 | 18 | 19 | # Prerequisites 20 | 21 | ## AWS Account 22 | 23 | **Permissions**: In order to complete this workshop you'll need an AWS Account, and an AWS IAM user in that account with at least full permissions to the following AWS services: 24 | 25 | - AWS IAM 26 | - Amazon S3 27 | - Amazon SageMaker 28 | - AWS CloudShell or AWS Cloud9 29 | - Amazon EC2: including P3, C5, and M5 instance types; to check your limits, see [Viewing Your Current Limits](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-resource-limits.html). If you do not have at least the default limits specified in [the Amazon SageMaker Limits table](https://docs.aws.amazon.com/general/latest/gr/sagemaker.html), please file a limit increase request via the AWS console. 30 | 31 | **Use Your Own Account**: The code and instructions in this workshop assume only one student is using a given AWS account at a time. If you try sharing an account with another student, you'll run into naming conflicts for certain resources. You can work around these by appending a unique suffix to the resources that fail to create due to conflicts, but the instructions do not provide details on the changes required to make this work. Use a personal account or create a new AWS account for this workshop rather than using an organization’s account to ensure you have full access to the necessary services and to ensure you do not leave behind any resources from the workshop. 32 | 33 | **Costs**: Some, but NOT all, of the resources you will launch as part of this workshop are eligible for the AWS free tier if your account is less than 12 months old. See the [AWS Free Tier page](https://aws.amazon.com/free/) for more details. An example of a resource that is **not** covered by the free tier is the Amazon SageMaker notebook instance type used in some workshops. To avoid charges for endpoints and other resources you might not need after you've finished a workshop, please refer to the [**Cleanup Guide**](./CleanupGuide). 34 | 35 | 36 | ## AWS Region 37 | 38 | Amazon SageMaker is not available in all AWS Regions at this time. Accordingly, we recommend running this workshop in one of the following supported AWS Regions: N. Virginia, Oregon, Ohio, Ireland or Sydney. 39 | 40 | Once you've chosen a region, you should create all of the resources for this workshop there, including a new Amazon S3 bucket and a new SageMaker notebook instance. Make sure you select your region from the dropdown in the upper right corner of the AWS Console before getting started. 41 | 42 | ![Region selection screenshot](./images/region-selection.png) 43 | 44 | 45 | ## Browser 46 | 47 | We recommend you use the latest version of Chrome or Firefox to complete this workshop. 48 | 49 | 50 | ## AWS Command Line Interface 51 | 52 | To complete certain workshop modules, you'll need the AWS Command Line Interface (CLI) and a Bash environment. You'll use the AWS CLI to interface with Amazon SageMaker and other AWS services. Do NOT attempt to use a locally installed AWS CLI during a live workshop because there is insufficient time during a live workshop to resolve related issues with your laptop etc. 53 | 54 | To avoid problems that can arise configuring the CLI on your machine during a live workshop, either [**AWS CloudShell**](https://aws.amazon.com/cloudshell/) or [**AWS Cloud9**](https://aws.amazon.com/cloud9/) can be used. AWS CloudShell is a browser-based shell that makes it easy to securely manage, explore, and interact with your AWS resources. To run Bash scripts for workshops using CloudShell, simply create raw text script files on your local computer, and then follow the instruction steps for [uploading and running script files](https://docs.aws.amazon.com/cloudshell/latest/userguide/getting-started.html). 55 | 56 | AWS Cloud9 is a cloud-based integrated development environment (IDE) that lets you write, run, and debug your code with just a browser. It has the AWS CLI pre-installed so you don’t need to install files or configure your laptop to use the AWS CLI. For Cloud9 setup directions, see [**Cloud9 Setup**](Cloud9). 57 | 58 | 59 | ## Text Editor 60 | 61 | For any workshop module that requires use of the AWS Command Line Interface (see above), you also will need a **plain text** editor for writing Bash scripts. Any editor that inserts Windows or other special characters potentially will cause scripts to fail. AWS Cloud9 includes a text editor, while for AWS CloudShell you'll need to use your own separate text editor of your choice to create script files (or enter commands one at a time). 62 | 63 | 64 | # License & Contributing 65 | 66 | The contents of this workshop are licensed under the [Apache 2.0 License](./LICENSE). 67 | If you are interested in contributing to this project, please see the [Contributing Guidelines](./contributing/CONTRIBUTING.md). In connection with contributing, also review the [Code of Conduct](./contributing/CODE_OF_CONDUCT.md). 68 | 69 | 70 | -------------------------------------------------------------------------------- /Simplify-Workflows/README.md: -------------------------------------------------------------------------------- 1 | # Simplify Workflows with Scripts, the CLI and Console 2 | 3 | ## Modules 4 | 5 | This workshop is divided into multiple modules. After completing **Preliminaries**, complete the module **Creating a Notebook Instance** next. You can complete the remaining modules in any order. 6 | 7 | - Preliminaries 8 | 9 | - Creating a Notebook Instance 10 | 11 | - Videogame Sales with the CLI and Console 12 | 13 | - Distributed Training with Built-in Algorithms, the CLI and Console 14 | 15 | 16 | ## Preliminaries 17 | 18 | - Be sure you have completed all of the Prerequisites listed in the [**main README**](../README.md). This workshop makes use of the AWS CLI and requires the use of a Bash environment for scripting. AWS CloudShell or AWS Cloud9 can be used to run the CLI and scripts; if you haven't done so already, please complete the [**CloudShell (or Cloud9) Setup**](../Cloud9). 19 | 20 | - If you are new to using Jupyter notebooks, read the next section, otherwise you may now skip ahead to the next section. 21 | 22 | 23 | ### Jupyter Notebooks: A Brief Overview 24 | 25 | Jupyter is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more. With respect to code, it can be thought of as a web-based IDE that executes code on the server it is running on instead of locally. 26 | 27 | There are two main types of "cells" in a notebook: code cells, and "markdown" cells with explanatory text. You will be running the code cells. These are distinguished by having "In" next to them in the left margin next to the cell, and a greyish background. Markdown cells lack "In" and have a white background. In the screenshot below, the upper cell is a markdown cell, while the lower cell is a code cell: 28 | 29 | ![Cells](../images/cells.png) 30 | 31 | To run a code cell, simply click in it, then either click the **Run Cell** button in the notebook's toolbar, or use Control+Enter from your computer's keyboard. It may take a few seconds to a few minutes for a code cell to run. You can determine whether a cell is running by examining the `In[]:` indicator in the left margin next to each cell: a cell will show `In [*]:` when running, and `In [a number]:` when complete. 32 | 33 | Please run each code cell in order, and **only once**, to avoid repeated operations. For example, running the same training job cell twice might create two training jobs, possibly exceeding your service limits. 34 | 35 | 36 | ## Creating a Notebook Environment 37 | 38 | SageMaker provides hosted Jupyter notebooks that require no setup, so you can begin processing your training data sets immediately. With a few clicks in the SageMaker console, you can create a fully managed notebook environment, pre-loaded with useful libraries for machine learning. You need only add your data. You have two different options for this workshop. Follow the choice specified by your workshop instructor if you're in a live workshop, or make your own choice otherwise: 39 | 40 | - **SageMaker Studio**: An IDE for machine learning. To create a SageMaker Studio domain for this workshop, follow the instructions at [**Creating an Amazon SageMaker Studio domain**](../StudioCreation), then return here to continue with the next module of the workshop. 41 | 42 | - **SageMaker Notebook Instance**: A managed instance with preinstalled data science tools (though not as fully managed as SageMaker Studio). To create a SageMaker notebook instance for this workshop, follow the instructions at [**Creating a Notebook Instance**](../NotebookCreation), then return here to continue with the next module of the workshop. 43 | 44 | 45 | ## Videogame Sales with the CLI and Console 46 | 47 | Please go to the following link for this module: [**Videogame Sales with the CLI and Console**](../modules/Video_Game_Sales_CLI_Console.md). Be sure to use the **downloaded** version of the applicable Jupyter notebook from this workshop repository. 48 | 49 | When you're finished, return here to move on to the next module. 50 | 51 | 52 | ## Distributed Training with Built-in Algorithms, the CLI and Console 53 | 54 | Please go to the following link for this module: [**Distributed Training with Built-in Algorithms, the CLI and Console**](../modules/Distributed_Training_CLI_Console.md). Be sure to use the **downloaded** version of the applicable Jupyter notebook from this workshop repository. 55 | 56 | When you're finished, return here and go on to the Cleanup Guide. 57 | 58 | 59 | ## Cleanup 60 | 61 | To avoid charges for endpoints and other resources you might not need after the workshop, please refer to the [**Cleanup Guide**](../CleanupGuide). 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | -------------------------------------------------------------------------------- /StudioCreation/README.md: -------------------------------------------------------------------------------- 1 | # Creating an Amazon SageMaker Studio domain 2 | 3 | SageMaker Studio is an IDE for machine learning. To create a SageMaker Studio domain for this workshop, follow these steps: 4 | 5 | - Go to https://docs.aws.amazon.com/sagemaker/latest/dg/onboard-quick-start.html 6 | - Follow the directions for the first several steps, when you reach reach the "For **Execution role**" step, be sure to carefully read the bullet point below first. 7 | - For **Execution role**, choose **Create a new role**, leave the defaults on the pop up, and click **Create role**. 8 | - When creation is complete (this may take about five minutes), click **Open Studio** in the line for your default user. 9 | - Open a terminal within Studio (File menu -> New -> Terminal). 10 | - Clone the official samples repository with this command: 11 | 12 | ``` 13 | git clone https://github.com/awslabs/amazon-sagemaker-workshop.git 14 | ``` 15 | 16 | - Return to the main workshop instructions page and continue with the rest of the workshop. 17 | -------------------------------------------------------------------------------- /TensorFlow/README.md: -------------------------------------------------------------------------------- 1 | # TensorFlow in Amazon SageMaker 2 | 3 | This workshop demonstrates various aspects of how to work with custom TensorFlow models in Amazon SageMaker. We'll examine how TensorFlow can be applied in SageMaker to a natural language processing use case, a structured data use case, and a computer vision use case. 4 | 5 | Here are some of the key features of SageMaker demonstrated in this workshop: 6 | 7 | - **Data Processing** 8 | - **SageMaker Processing** for data preprocessing tasks within SageMaker. 9 | 10 | - **Prototyping and Working with Code** 11 | - **Script Mode**, which enables you to use your own custom model definitions and training scripts similar to those outside Amazon SageMaker, with prebuilt TensorFlow containers. 12 | - **Git integration** for Script Mode, which allows you to specify a training script in a Git repository so your code is version controlled and you don't have to download code locally. 13 | - **Local Mode Training** for rapid prototyping and to confirm your code is working before moving on to full scale model training. 14 | - **Local Mode Endpoints** to test your models and inference code before deploying with TensorFlow Serving in SageMaker hosted endpoints for production. 15 | 16 | - **Training and Tuning Models** 17 | - **Hosted Training** for large scale model training. 18 | - **Automatic Model Tuning** to find the best model hyperparameters using automation. 19 | - **Distributed Training with TensorFlow's native MirroredStrategy** to perform training with multiple GPUs on a *single* instance. 20 | - **Distributed Training on a SageMaker-managed cluster** of *multiple* instances using either the SageMaker Distributed Training feature, or parameters servers and Horovod. 21 | 22 | - **Inference** 23 | - **Hosted Endpoints** for real time predictions with TensorFlow Serving. 24 | - **Batch Transform Jobs** for asynchronous, large scale batch inference. 25 | - **Model evaluation or batch inference** with SageMaker Processing. 26 | 27 | - **Workflow Automation** 28 | - **SageMaker Pipelines** for creating an end-to-end automated pipeline from data preprocessing to model training to hosted model deployment. 29 | 30 | 31 | ## Modules 32 | 33 | This workshop is divided into multiple modules that should be completed in order. After **Preliminaries**, complete the module **Creating a Notebook Environment**. The next two modules, NLP and Structured Data, should be completed in order to show how to build a workflow from relatively simple to more complex. 34 | 35 | - Preliminaries 36 | 37 | - Creating a Notebook Environment 38 | 39 | - Natural Language Processing (NLP) Use Case: Sentiment Analysis 40 | 41 | - Structured Data Use Case: End-to-End Workflow for Boston Housing Price Predictions 42 | 43 | - Computer Vision Use Case: Image Classification 44 | 45 | 46 | ## Preliminaries 47 | 48 | - Be sure you have completed all of the Prerequisites listed in the [**main README**](../README.md). 49 | 50 | - If you are new to using Jupyter notebooks, read the next section, otherwise you may now skip ahead to the next module. 51 | 52 | 53 | ### Jupyter Notebooks: A Brief Overview 54 | 55 | Jupyter is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more. With respect to code, it can be thought of as a web-based IDE that executes code on the server it is running on instead of locally. 56 | 57 | There are two main types of "cells" in a notebook: code cells, and "markdown" cells with explanatory text. You will be running the code cells. These are distinguished by having "In" next to them in the left margin next to the cell, and a greyish background. Markdown cells lack "In" and have a white background. In the screenshot below, the upper cell is a markdown cell, while the lower cell is a code cell: 58 | 59 | ![Cells](../images/cells.png) 60 | 61 | To run a code cell, simply click in it, then either click the **Run Cell** button in the notebook's toolbar, or use Control+Enter from your computer's keyboard. It may take a few seconds to a few minutes for a code cell to run. You can determine whether a cell is running by examining the `In[]:` indicator in the left margin next to each cell: a cell will show `In [*]:` when running, and `In [a number]:` when complete. 62 | 63 | Please run each code cell in order, and **only once**, to avoid repeated operations. For example, running the same training job cell twice might create two training jobs, possibly exceeding your service limits. 64 | 65 | 66 | ## Creating a Notebook Environment 67 | 68 | SageMaker provides hosted Jupyter notebooks that require no setup, so you can begin processing your training data sets immediately. With a few clicks in the SageMaker console, you can create a fully managed notebook environment, pre-loaded with useful libraries for machine learning. You need only add your data. You have two different options for this workshop. Follow the choice specified by your workshop instructor if you're in a live workshop, or make your own choice otherwise: 69 | 70 | - **SageMaker Studio**: An IDE for machine learning. To create a SageMaker Studio domain for this workshop, follow the instructions at [**Creating an Amazon SageMaker Studio domain**](../StudioCreation), then return here to continue with the next module of the workshop. 71 | 72 | - **SageMaker Notebook Instance**: A managed instance with preinstalled data science tools (though not as fully managed as SageMaker Studio). To create a SageMaker notebook instance for this workshop, follow the instructions at [**Creating a Notebook Instance**](../NotebookCreation), then return here to continue with the next module of the workshop. 73 | 74 | 75 | ## Natural Language Processing Use Case: Sentiment Analysis 76 | 77 | In this module, we'll train a custom sentiment analysis model by providing our own Python training script for use with Amazon SageMaker's prebuilt TensorFlow 2 container. Assuming you have cloned this repository into your notebook environment (which you should do if you haven't), open the `notebooks` directory of the repository and click on the `sentiment-analysis.ipynb` notebook to open it. 78 | 79 | When you're finished, return here to move on to the next module. 80 | 81 | 82 | ## Structured Data Use Case: End-to-End Workflow for Boston Housing Price Predictions 83 | 84 | We'll focus on a relatively complete TensorFlow 2 workflow in this module to predict prices based on the Boston Housing dataset. In particular, we'll preprocess data with SageMaker Processing, prototype training and inference code with Local Mode, use Automatic Model Tuning, deploy the tuned model to a real time endpoint, and examine how SageMaker Pipelines can automate setting up this workflow for a production environment. Assuming you have cloned this repository into your notebook environment (which you should do if you haven't), open the `notebooks` directory of the repository and click on the `tf-2-workflow-smpipelines.ipynb` notebook to open it. 85 | 86 | When you're finished, return here to move on to the next module. 87 | 88 | 89 | ## Computer Vision Use Case: Image Classification 90 | 91 | This module applies TensorFlow within Amazon SageMaker to an image classification use case. Currently we recommend using the example [TensorFlow2 and SMDataParallel](https://github.com/aws/amazon-sagemaker-examples/tree/master/training/distributed_training/tensorflow/data_parallel/mnist). This example applies the SageMaker Distributed Training feature with data parallelism to train a model on multiple instances. (Model parallelism is another possibility.) 92 | 93 | Alternatively, there is a TensorFlow 1.x example for the parameter server method and Horovod. This example also uses the pre/post-processing script feature of the SageMaker TensorFlow Serving container to transform data for inference, without having to build separate containers and infrastructure to do this job. Assuming you have cloned this repository into your notebook environment (which you should do if you haven't), open the `notebooks` directory of the repository and click on the `tf-distributed-training.ipynb` notebook to open it. 94 | When you're finished, return here and go on to the Cleanup Guide. 95 | 96 | 97 | ## Cleanup 98 | 99 | To avoid charges for endpoints and other resources you might not need after the workshop, please refer to the [**Cleanup Guide**](../CleanupGuide). 100 | 101 | 102 | -------------------------------------------------------------------------------- /contributing/CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. 5 | -------------------------------------------------------------------------------- /contributing/CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check [existing open](https://github.com/awslabs/amazon-sagemaker-workshop/issues), or [recently closed](https://github.com/awslabs/amazon-sagemaker-workshop/issues?utf8=%E2%9C%93&q=is%3Aissue%20is%3Aclosed%20), issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *master* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels ((enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any ['help wanted'](https://github.com/awslabs/amazon-sagemaker-workshop/labels/help%20wanted) issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](https://github.com/awslabs/amazon-sagemaker-workshop/blob/master/LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. 60 | 61 | We may ask you to sign a [Contributor License Agreement (CLA)](http://en.wikipedia.org/wiki/Contributor_License_Agreement) for larger changes. 62 | -------------------------------------------------------------------------------- /images/cells.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/amazon-sagemaker-workshop/0cf3d661a4065becde6b974ee6fff99c9640b850/images/cells.png -------------------------------------------------------------------------------- /images/clawfoot_bathtub.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/amazon-sagemaker-workshop/0cf3d661a4065becde6b974ee6fff99c9640b850/images/clawfoot_bathtub.jpg -------------------------------------------------------------------------------- /images/overview.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/amazon-sagemaker-workshop/0cf3d661a4065becde6b974ee6fff99c9640b850/images/overview.png -------------------------------------------------------------------------------- /images/region-selection.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/amazon-sagemaker-workshop/0cf3d661a4065becde6b974ee6fff99c9640b850/images/region-selection.png -------------------------------------------------------------------------------- /modules/Distributed_Training_CLI_Console.md: -------------------------------------------------------------------------------- 1 | # Distributed Training with SageMaker's Built-in Algorithms 2 | 3 | ## Introduction 4 | 5 | Amazon SageMaker provides high-performance, scalable machine learning algorithms optimized for speed, scale, and accuracy, and designed to run on extremely large training datasets. Based on the type of learning that you are undertaking, you can choose from either supervised algorithms (e.g. linear/logistic regression or classification), or unsupervised algorithms (e.g. k-means clustering). Besides general purpose algorithms, Amazon SageMaker's set of built-in algorithms also includes specific-purpose algorithms suited for tasks in domains such as natural language processing and computer vision. 6 | 7 | Amazon SageMaker's built-in algorithms are re-envisioned from the ground up, specifically for large training data sets. Most algorithms available elsewhere rely on being able to load files or the entire data set into memory, which doesn’t work for very large datasets. Even algorithms that don’t do this need all of the data downloaded before training starts, instead of streaming the data in and processing it as it comes in. And lastly, large training data sets can cause some algorithms available elsewhere to give up - in some cases, with training data sets as small as a few gigabytes. 8 | 9 | To summarize, there are many reasons to use Amazon SageMaker’s built-in algorithms instead of "bringing your own" (BYO) algorithm if the built-in algorithms fit your use case: 10 | 11 | - With BYO, time/cost must be spent on design (e.g. of a neural net). 12 | - With BYO, you must solve the problems of scalability and reliability for large data sets. 13 | - Built-in algorithms take care of these concerns. 14 | - Built-in algorithms also provide many conveniences such as reduced need for external hyperparameter optimization, efficient data loading, etc. 15 | - Faster training, and faster inference with smaller models produced by some built-in algorithms. 16 | - Even if you’re doing BYO, built-in algorithms may be helpful at some point in the machine learning pipeline, such as the PCA built-in algorithm for dimensionality reduction. 17 | 18 | ## Parallelized Data Distribution 19 | 20 | Amazon SageMaker makes it easy to train machine learning models across a cluster containing a large number of machines. This a non-trivial process, but Amazon SageMaker's built-in algorithms and pre-built machine learning containers (for TensorFlow, PyTorch, XGBoost, Scikit-learn, and MXNet) hide most of the complexity from you. Nevertheless, there are decisions about how to structure data that will have implications regarding how the distributed training is carried out. 21 | 22 | In this module, we will learn about how to take full advantage of distributed training clusters when using one of Amazon SageMaker's built-in algorithms. This module also shows how to use SageMaker's built-in algorithms via hosted Jupyter notebooks, the AWS CLI, and the Amazon SageMaker console. 23 | 24 | 1. **Exploratory Data Analysis**: For this part of the module, we'll be using an Amazon SageMaker notebook instance to explore and visualize a data set. Be sure you have downloaded this GitHub repository as specified in **Preliminaries** before you start. Next, in your notebook instance, click the **New** button on the right and select **Folder**. 25 | 26 | 2. Click the checkbox next to your new folder, click the **Rename** button above in the menu bar, and give the folder a name such as 'distributed-data'. 27 | 28 | 3. Click the folder to enter it. 29 | 30 | 4. To upload the notebook for this module, click the **Upload** button on the right. Then in the file selection popup, select the file 'data_distribution_types.ipynb' from the notebooks subdirectory in the folder on your computer where you downloaded this GitHub repository. Click the blue **Upload** button that appears to the right of the notebook's file name. 31 | 32 | 5. You are now ready to begin the notebook: click the notebook's file name to open it. 33 | 34 | 6. In the ```bucket = ''``` code line, paste the name of the S3 bucket you created in **Creating a Notebook Instance** to replace ``````. The code line should now read similar to ```bucket = 'smworkshop-john-smith'```. Do NOT paste the entire path (s3://.......), just the bucket name. 35 | 36 | - In this workshop, you also will be accessing a S3 bucket that holds data from one of the AWS Public Data Sets. 37 | 38 | - If you followed the [**Creating a Notebook Instance**](../NotebookCreation) module to create your notebook instance, you should be able to access this S3 bucket. Otherwise, if you are using your own notebook instance created elsewhere, you may need to modify the associated IAM role to add permissions for `s3:ListBucket` for `arn:aws:s3:::gdelt-open-data`, and `s3:GetObject` for `arn:aws:s3:::gdelt-open-data/*`. 39 | 40 | 7. Follow the directions in the notebook. When it is time to set up a training job, return from the notebook to these instructions. 41 | 42 | 8. **First Training Job**: Now that we have our data in S3, we can begin training. We'll use Amazon SageMaker's built-in Linear Learner algorithm. Since the focus of this module is data distribution to a training cluster, we'll fit two models in order to compare data distribution types. To understand the different types, please read the following: 43 | 44 | - In the first job, we'll use `FullyReplicated` for our `train` channel. This will pass every file in our input S3 location to every machine (in this case we're using 5 machines). 45 | 46 | - In the second job, we'll use `ShardedByS3Key` for the `train` channel (note that we'll keep `FullyReplicated` for the validation channel). So, for the training data, we'll pass each S3 object to a separate machine. Since we have 5 files (one for each year), we'll train on 5 machines, meaning each machine will get a year's worth of records. 47 | 48 | - We'll be using the AWS CLI and Bash scripts to run the training jobs. Besides [Amazon SageMaker Pipelines](https://aws.amazon.com/sagemaker/pipelines), using the AWS CLI and scripts is another way to automate machine learning pipelines and repetitive tasks. Either [**AWS CloudShell**](https://aws.amazon.com/cloudshell/) or [**AWS Cloud9**](https://aws.amazon.com/cloud9/) can be used to easily run Bash scripts in a workshop setting. AWS CloudShell is a browser-based shell that makes it easy to securely manage, explore, and interact with your AWS resources. To run Bash scripts for workshops using CloudShell, simply create raw text script files on your local computer, and then follow the instruction steps for [uploading and running script files](https://docs.aws.amazon.com/cloudshell/latest/userguide/getting-started.html). If you prefer to use Cloud9 but you haven't done so already, please set up and open your Cloud9 environment now as described in [**Cloud9 Setup**](../Cloud9). 49 | 50 | Below is a screenshot of what the first script should look like as you create it and run the related commands. Step-by-step instructions follow, ALONG WITH CODE TO COPY AND PASTE. 51 | 52 | ![Cloud9](./images/distrib-dev-environment.png) 53 | 54 | 9. Create a text file named `replicated.sh`. If you haven't done so already, open a terminal/command window that supports Bash to enter commands. In the terminal window, change to the directory in which you created the file (if you're not already there), then run the following command: 55 | 56 | ``` 57 | chmod +x replicated.sh 58 | ``` 59 | 60 | 10. Paste the bash script below into the `replicated.sh` file, and then change the text in the angle brackets (< >) as follows. Do NOT put quotes around the values you insert, or retain the brackets. 61 | 62 | - arn_role: To get the value for this variable, go to the Amazon SageMaker console, click **Notebook instances** in the left pane, then in the 'Notebook instances' table, click the name of the instance you created for this workshop. In the **Notebook instance settings** section, look for the 'IAM role ARN' value, and copy its text. It should look like the following: `arn:aws:iam::1234567890:role/service-role/AmazonSageMaker-ExecutionRole-20171211T211964`. 63 | 64 | - training_image: select one of the following, depending on the AWS Region where you are running this workshop. 65 | - N. Virginia: 382416733822.dkr.ecr.us-east-1.amazonaws.com/linear-learner:latest 66 | - Oregon: 174872318107.dkr.ecr.us-west-2.amazonaws.com/linear-learner:latest 67 | - Ohio: 404615174143.dkr.ecr.us-east-2.amazonaws.com/linear-learner:latest 68 | - Ireland: 438346466558.dkr.ecr.eu-west-1.amazonaws.com/linear-learner:latest 69 | 70 | - bucket: the name of the S3 bucket you used in your notebook. It should look like: `s3://smworkshop-john-smith`. 71 | 72 | - region: the region code for the region where you are running this workshop, either `us-east-1` for N. Virginia, `us-west-2` for Oregon, `us-east-2` for Ohio, or `eu-west-1` for Ireland. 73 | 74 | ``` 75 | # Fill in the values of these four variables 76 | arn_role= 77 | training_image= 78 | bucket= 79 | region= 80 | 81 | prefix=/sagemaker/data_distribution_types 82 | training_job_name=linear-replicated-`date '+%Y-%m-%d-%H-%M-%S'` 83 | 84 | training_data=$bucket$prefix/train 85 | eval_data=$bucket$prefix/validation 86 | train_source={S3DataSource={S3DataType=S3Prefix,S3DataDistributionType=FullyReplicated,S3Uri=$training_data}} 87 | eval_source={S3DataSource={S3DataType=S3Prefix,S3DataDistributionType=FullyReplicated,S3Uri=$eval_data}} 88 | 89 | aws --region $region \ 90 | sagemaker create-training-job \ 91 | --role-arn $arn_role \ 92 | --training-job-name $training_job_name \ 93 | --algorithm-specification TrainingImage=$training_image,TrainingInputMode=File \ 94 | --resource-config InstanceCount=5,InstanceType=ml.c4.2xlarge,VolumeSizeInGB=10 \ 95 | --input-data-config ChannelName=train,DataSource=$train_source,CompressionType=None,RecordWrapperType=None ChannelName=validation,DataSource=$eval_source,CompressionType=None,RecordWrapperType=None \ 96 | --output-data-config S3OutputPath=$bucket$prefix \ 97 | --hyper-parameters feature_dim=25,mini_batch_size=500,predictor_type=regressor,epochs=2,num_models=32,loss=absolute_loss \ 98 | --stopping-condition MaxRuntimeInSeconds=1800 99 | 100 | ``` 101 | 102 | 11. Save your file, then in your terminal window, run the following command to start the training job. Total job duration may last up to about 10 minutes, including time for setting up the training cluster. In case the training job encounters problems and is stuck, you can set a stopping condition that times out, in this case after a half hour. Now, since you can run another job concurrently with this one, move onto the next step after you start this job. 103 | 104 | ``` 105 | ./replicated.sh 106 | ``` 107 | 108 | 12. **Second Training Job**: For our next training job with the `ShardedByS3Key` distribution type, please create a text file named `sharded.sh`. then run the following command in your terminal window: 109 | 110 | ``` 111 | chmod +x sharded.sh 112 | ``` 113 | 114 | 13. Paste the bash script below into the `sharded.sh` file, and then change the text in the angle brackets (< >) as follows. Do NOT put quotes around the values you insert, or retain the brackets. All four values to change are the same as the values you changed for the previous script; they are noted again below for your ease of reference. 115 | 116 | - arn_role: same as for the previous script. It should look like the following: `arn:aws:iam::1234567890:role/service-role/AmazonSageMaker-ExecutionRole-20171211T211964`. 117 | 118 | - training_image: same as the previous script; the image depends on the AWS Region where you are running this workshop. They are shown again here for convenience: 119 | - N. Virginia: 382416733822.dkr.ecr.us-east-1.amazonaws.com/linear-learner:latest 120 | - Oregon: 174872318107.dkr.ecr.us-west-2.amazonaws.com/linear-learner:latest 121 | - Ohio: 404615174143.dkr.ecr.us-east-2.amazonaws.com/linear-learner:latest 122 | - Ireland: 438346466558.dkr.ecr.eu-west-1.amazonaws.com/linear-learner:latest 123 | 124 | - bucket: same as for the previous script. It should look like: `s3://smworkshop-john-smith`. 125 | 126 | - region: the region code for the region where you are running this workshop, either `us-east-1` for N. Virginia, `us-west-2` for Oregon, `us-east-2` for Ohio, or `eu-west-1` for Ireland. 127 | 128 | ``` 129 | # Fill in the values of these four variables 130 | arn_role= 131 | training_image= 132 | bucket= 133 | region= 134 | 135 | prefix=/sagemaker/data_distribution_types 136 | training_job_name=linear-sharded-`date '+%Y-%m-%d-%H-%M-%S'` 137 | 138 | training_data=$bucket$prefix/train 139 | eval_data=$bucket$prefix/validation 140 | train_source={S3DataSource={S3DataType=S3Prefix,S3DataDistributionType=ShardedByS3Key,S3Uri=$training_data}} 141 | eval_source={S3DataSource={S3DataType=S3Prefix,S3DataDistributionType=FullyReplicated,S3Uri=$eval_data}} 142 | 143 | aws --region $region \ 144 | sagemaker create-training-job \ 145 | --role-arn $arn_role \ 146 | --training-job-name $training_job_name \ 147 | --algorithm-specification TrainingImage=$training_image,TrainingInputMode=File \ 148 | --resource-config InstanceCount=5,InstanceType=ml.c4.2xlarge,VolumeSizeInGB=10 \ 149 | --input-data-config ChannelName=train,DataSource=$train_source,CompressionType=None,RecordWrapperType=None ChannelName=validation,DataSource=$eval_source,CompressionType=None,RecordWrapperType=None \ 150 | --output-data-config S3OutputPath=$bucket$prefix \ 151 | --hyper-parameters feature_dim=25,mini_batch_size=500,predictor_type=regressor,epochs=2,num_models=32,loss=absolute_loss \ 152 | --stopping-condition MaxRuntimeInSeconds=1800 153 | 154 | ``` 155 | 156 | 14. Save your file, then in your terminal window, run the following command to start your second training job now, there is no need to wait for the first training job to complete: 157 | 158 | ``` 159 | ./sharded.sh 160 | ``` 161 | 162 | 15. In the Amazon SageMaker console, click **Jobs** in the left panel to check the status of the training jobs, which run concurrently. When they are complete, their **Status** column will change from InProgress to Complete. As a reminder, duration of these jobs can last up to about 10 minutes, including time for setting up the training cluster, as shown in the **Duration** column of the **Jobs** table. 163 | 164 | - To check the actual training time (not including cluster setup) for each job when both are complete, click the training job name in the jobs table, then examine the **Training duration** listed at the top right under **Job Settings**. **Training duration** does not include the time related to cluster setup. As we can see, and might expect, the sharded distribution type trained substantially faster than the fully replicated type. This is a key differentiator to consider when preparing data and picking the distribution type. 165 | 166 | 16. **Amazon SageMaker Model Creation**: Now that we've trained our machine learning models, we'll want to make predictions by setting up a hosted endpoint for them. The first step in doing that is to create a SageMaker model object that wraps the actual model artifact from training. To create the model object, we will point to the model.tar.gz that came from training and the inference code container, then create the hosting model object. We'll do this twice, once for each model we trained earlier. Here are the steps to do this via the Amazon SageMaker console (see screenshot below for an example of all relevant fields filled in for the Oregon AWS Region): 167 | 168 | - In the left pane of the SageMaker console home page, right click the **Models** link and open it in another tab of your browser. Click the **Create Model** button at the upper right above the 'Models' table. 169 | 170 | - For the 'Model name' field under **Model Settings**, enter `distributed-replicated`. 171 | 172 | - For the 'Location of inference code image' field under **Primary Container**, enter the name of the same Docker image you specified previously for the region where you're running this workshop. For ease of reference, here are the image names again: 173 | 174 | - N. Virginia: 382416733822.dkr.ecr.us-east-1.amazonaws.com/linear-learner:latest 175 | - Oregon: 174872318107.dkr.ecr.us-west-2.amazonaws.com/linear-learner:latest 176 | - Ohio: 404615174143.dkr.ecr.us-east-2.amazonaws.com/linear-learner:latest 177 | - Ireland: 438346466558.dkr.ecr.eu-west-1.amazonaws.com/linear-learner:latest 178 | 179 | - For the 'Location of model artifacts' field under **Primary Container**, enter the path to the output of your replicated training job. To find the path, go back to your first browser tab, click **Jobs** in the left pane, then find and click the replicated job name, which will look like `linear-replicated-`. Scroll down to the **Outputs** section, then copy the path under 'S3 model artifact'. Paste the path in the field; it should look like `s3:///sagemaker/data_distribution_types/linear-replicated-2018-03-11-18-13-13/output/model.tar.gz`. 180 | 181 | - Click **Create model** at the bottom of the page. 182 | 183 | - Repeat the above steps for the sharded training job model, except: for 'Model name', enter `distributed-sharded`, and for 'Location of model artifacts', enter the path for the sharded training job model artifact. 184 | 185 | ![Model](./images/distrib-model.png) 186 | 187 | 17. **Endpoint Configuration**: Once we've setup our models, we can configure what our hosting endpoints should be. Here we specify the EC2 instance type to use for hosting, the initial number of instances, and our hosting model name. Again, we'll do this twice, once for each model we trained earlier. Here are the steps to do this via the SageMaker console (see screenshot below for an example of all relevant fields filled in for the Oregon AWS Region): 188 | 189 | - In the left pane of the SageMaker console, click **Endpoint configuration**. Click the **Create endpoint configuration** button at the upper right above the 'Endpoint configuration' table. 190 | 191 | - For the 'Endpoint configuration name' field under **New endpoint configuration**, enter `distributed-replicated`. 192 | 193 | - Under **Production variants**, click **Add model**. From the **Add model** popup, select the `distributed-replicated` model you created earlier, and click **Save**. Then click **Create endpoint configuration** at the bottom of the page. 194 | 195 | - Repeat the above steps for the sharded training job model, except: for 'Endpoint configuration name', enter `distributed-sharded`, and for the **Add model** popup, select the `distributed-sharded` model. 196 | 197 | ![Endpoint Configuration](./images/distrib-endpoint-config.png) 198 | 199 | 18. **Endpoint Creation**: Now that we've specified how our endpoints should be configured, we can create them. For this final step in the process of settng up endpoints, we'll once again use the SageMaker console to do so (see screenshot below for an example of all relevant fields filled in for the Oregon AWS Region): 200 | 201 | - In the left pane of the Amazon SageMaker console, click **Endpoints**. Click the **Create endpoint** button at the upper right above the 'Endpoints' table. 202 | 203 | - For the 'Endpoint name' field under **Endpoint**, enter `distributed-replicated`. 204 | 205 | - Under **Attach endpoint configuration**, leave 'Use an existing endpoint configuration' selected, then under **Endpoint configuration**, select `distributed-replicated` from the table, then click **Select endpoint configuration** at the bottom of the table. Then click **Create endpoint** at the bottom of the page. 206 | 207 | - Repeat the above steps, except: for 'Endpoint name', enter `distributed-sharded`, and for the **Endpoint configuration** table, select the `distributed-sharded` endpoint configuration. 208 | 209 | - In the **Endpoints** table, refer to the 'Status' column, and wait for both endpoints to change from 'Creating' to 'InService' before proceeding to the next step. It will take several minutes for endpoint creation, possibly as long as ten minutes. 210 | 211 | ![Endpoint](./images/distrib-endpoint.png) 212 | 213 | 19. **Evaluate**: To compare predictions from our two models, let's return to the notebook we used earlier. When you are finished, return here and proceed to the next section. 214 | 215 | ### Conclusion & Extensions 216 | 217 | In this module, we ran a regression on a relatively artificial example, and we skipped some pre-processing steps along the way (like potentially transforming or winsorizing our target variable, looking for interations in our features, etc.). But the main point was to highlight the difference in training time and accuracy of a model trained through two different distribution methods. 218 | 219 | Overall, sharding data into separate files and sending them to separate training nodes will run faster, but may produce lower accuracy than a model that replicates the data across all nodes. Naturally, this can be influenced by training the sharded model longer, with more epochs. And it should be noted that we trained with a very small number of epochs to highlight this difference. 220 | 221 | Different algorithms can be expected to show variation in which distribution mechanism is most effective at achieving optimal compute spend per point of model accuracy. The message remains the same though, that the process of finding the right distribution type is another experiment in optimizing model training times. 222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 | -------------------------------------------------------------------------------- /modules/Video_Game_Sales_CLI_Console.md: -------------------------------------------------------------------------------- 1 | ## Videogame Sales with the CLI and Console 2 | 3 | 4 | In this module, we'll work our way through an example that demonstrates how to use a built-in algorithm in SageMaker. More specifically, we'll use Amazon SageMaker's version of XGBoost, a popular and efficient open-source implementation of the gradient boosted trees algorithm. 5 | 6 | A gradient boosted trees algorithm attempts to predict a target variable by combining the estimates of a set of simpler, weaker models. XGBoost has done remarkably well in machine learning competitions because it robustly handles a wide variety of data types, relationships, and distributions. It often is a useful, go-to algorithm in working with structured data, such as data that might be found in relational databases and flat files. 7 | 8 | This module also shows how to use Amazon SageMaker's built-in algorithms via hosted Jupyter notebooks, the AWS CLI, and the Amazon SageMaker console. To proceed, follow these steps: 9 | 10 | 1. **Exploratory Data Analysis**: For this part of the module, we'll be using a SageMaker notebook instance to explore and visualize a data set. Be sure you have downloaded this GitHub repository as specified in **Preliminaries** before you start. Next, in your notebook instance, click the **New** button on the right and select **Folder**. 11 | 12 | 2. Click the checkbox next to your new folder, click the **Rename** button above in the menu bar, and give the folder a name such as 'videogame-sales-cli-console'. 13 | 14 | 3. Click the folder to enter it. 15 | 16 | 4. To upload the notebook for this module, click the **Upload** button on the right. Then in the file selection popup, select the file 'videogame-sales-cli-console.ipynb' from the notebooks subdirectory in the folder on your computer where you downloaded this GitHub repository. DO NOT USE THE DIFFERNT NOTEBOOK 'videogame-sales' (it does not use the CLI or console). Click the blue **Upload** button that appears to the right of the notebook's file name. 17 | 18 | 5. You are now ready to begin the notebook: click the notebook's file name to open it. 19 | 20 | 6. In the ```bucket = ''``` code line, paste the name of the S3 bucket you created in Module 1 to replace ``````. The code line should now read similar to ```bucket = 'smworkshop-john-smith'```. Do NOT paste the entire path (s3://.......), just the bucket name. 21 | 22 | 7. Follow the directions in the notebook. Begin by reading the **Background** section, then run the cells in the **Setup** and **Data** sections; the first cell to run is pictured below. When it is time to set up a training job, return from the notebook to these instructions. 23 | 24 | ![First Cell](./images/videogames-first-cell.png) 25 | 26 | 8. **Training Job**: The next few steps will be performed outside the notebook, but leave the notebook open for now because you'll be returning to it later. Now that we have our data in S3, we can begin training a model. We'll use SageMaker's built-in version of the XGBoost algorithm, and the AWS CLI to run the training job. XGBoost has many tunable hyperparameters. Some of these hyperparameters are listed below; initially we'll only use a few of them. Many of the hyperparameters are used to prevent overfitting, which prevents a model from generalizing to new observations. 27 | 28 | - `max_depth`: Maximum depth of a tree. As a cautionary note, a value too small could underfit the data, while increasing it will make the model more complex and thus more likely to overfit the data (in other words, the classic bias-variance tradeoff). 29 | - `eta`: Step size shrinkage used in updates to prevent overfitting. 30 | - `eval_metric`: Evaluation metric(s) for validation data. For data sets such as this one with imbalanced classes, we'll use the AUC metric. 31 | - `scale_pos_weight`: Controls the balance of positive and negative weights, again useful for data sets having imbalanced classes. 32 | 33 | 9. We'll be using the AWS CLI and a Bash script to run the training job. Besides [Amazon SageMaker Pipelines](https://aws.amazon.com/sagemaker/pipelines), using the AWS CLI and scripts is another way to automate machine learning pipelines and repetitive tasks. Either [**AWS CloudShell**](https://aws.amazon.com/cloudshell/) or [**AWS Cloud9**](https://aws.amazon.com/cloud9/) can be used to easily run Bash scripts in a workshop setting. AWS CloudShell is a browser-based shell that makes it easy to securely manage, explore, and interact with your AWS resources. To run Bash scripts for workshops using CloudShell, simply create raw text script files on your local computer, and then follow the instruction steps for [uploading and running script files](https://docs.aws.amazon.com/cloudshell/latest/userguide/getting-started.html). If you prefer to use Cloud9 but you haven't done so already, please set up and open your Cloud9 environment now as described in [**Cloud9 Setup**](../Cloud9). 34 | 35 | Below is a screenshot of what the first script should look like as you create it and run the related commands. Step-by-step instructions follow, ALONG WITH CODE TO COPY AND PASTE. 36 | 37 | ![Cloud9](./images/videogames-cloud9.png) 38 | 39 | 10. Create a text file named `videogames.sh`. If you haven't done so already, open a terminal/command window that supports Bash to enter commands. In the terminal window, change to the directory in which you created the file (if you're not already there), then run the following command: 40 | 41 | ``` 42 | chmod +x videogames.sh 43 | ``` 44 | 45 | 11. Paste the bash script below into the `videogames.sh` file, and then change the text in the angle brackets (< >) as follows. Do NOT put quotes around the values you insert, or retain the brackets. 46 | 47 | - arn_role: To get the value for this variable, go to the SageMaker console, click **Notebook instances** in the left pane, then in the 'Notebook instances' table, click the name of the instance you created for this workshop. In the **Notebook instance settings** section, look for the 'IAM role ARN' value, and copy its text. It should look like the following: `arn:aws:iam::1234567890:role/service-role/AmazonSageMaker-ExecutionRole-20171211T211964`. 48 | 49 | - training_image: select one of the following, depending on the AWS Region where you are running this workshop. 50 | - N. Virginia: 811284229777.dkr.ecr.us-east-1.amazonaws.com/xgboost:latest 51 | - Oregon: 433757028032.dkr.ecr.us-west-2.amazonaws.com/xgboost:latest 52 | - Ohio: 825641698319.dkr.ecr.us-east-2.amazonaws.com/xgboost:latest 53 | - Ireland: 685385470294.dkr.ecr.eu-west-1.amazonaws.com/xgboost:latest 54 | - Sydney: 544295431143.dkr.ecr.ap-southeast-2.amazonaws.com/xgboost:latest 55 | 56 | - bucket: the name of the S3 bucket you used in your notebook. It should look like: `s3://smworkshop-john-smith`. 57 | 58 | - region: the region code for the region where you are running this workshop, either `us-east-1` for N. Virginia, `us-west-2` for Oregon, `us-east-2` for Ohio, `eu-west-1` for Ireland, or `ap-southeast-2` for Sydney. 59 | 60 | ``` 61 | # Fill in the values of these four variables 62 | arn_role= 63 | training_image= 64 | bucket= 65 | region= 66 | 67 | prefix=/sagemaker/videogames-xgboost 68 | training_job_name=videogames-xgboost-`date '+%Y-%m-%d-%H-%M-%S'` 69 | 70 | training_data=$bucket$prefix/train 71 | eval_data=$bucket$prefix/validation 72 | train_source={S3DataSource={S3DataType=S3Prefix,S3DataDistributionType=FullyReplicated,S3Uri=$training_data}} 73 | eval_source={S3DataSource={S3DataType=S3Prefix,S3DataDistributionType=FullyReplicated,S3Uri=$eval_data}} 74 | 75 | aws --region $region \ 76 | sagemaker create-training-job \ 77 | --role-arn $arn_role \ 78 | --training-job-name $training_job_name \ 79 | --algorithm-specification TrainingImage=$training_image,TrainingInputMode=File \ 80 | --resource-config InstanceCount=1,InstanceType=ml.c4.2xlarge,VolumeSizeInGB=10 \ 81 | --input-data-config ChannelName=train,DataSource=$train_source,CompressionType=None,ContentType=libsvm ChannelName=validation,DataSource=$eval_source,CompressionType=None,ContentType=libsvm \ 82 | --output-data-config S3OutputPath=$bucket$prefix \ 83 | --hyper-parameters max_depth=3,eta=0.1,eval_metric=auc,scale_pos_weight=2.0,subsample=0.5,objective=binary:logistic,num_round=100 \ 84 | --stopping-condition MaxRuntimeInSeconds=1800 85 | 86 | ``` 87 | 88 | 12. In your terminal window, run the following command to start the training job. Total job duration may last up to about 5 minutes, including time for setting up the training cluster. In case the training job encounters problems and is stuck, you can set a stopping condition that times out, in this case after a half hour. 89 | 90 | ``` 91 | ./videogames.sh 92 | ``` 93 | 94 | 13. In the SageMaker console, click **Jobs** in the left panel to check the status of the training job. When the job is complete, its **Status** column will change from InProgress to Complete. As a reminder, duration of this job can last up to about 5 minutes, including time for setting up the training cluster. 95 | 96 | - To check the actual training time (not including cluster setup) for a job when it is complete, click the training job name in the jobs table, then examine the **Training time** listed at the top right under **Job Settings**. 97 | 98 | 14. **SageMaker Model Creation**: Now that we've trained our machine learning model, we'll want to make predictions by setting up a hosted endpoint for it. The first step in doing that is to create a SageMaker model object that wraps the actual model artifact from training. To create the model object, we will point to the model.tar.gz that came from training and the inference code container, then create the hosting model object. Here are the steps to do this via the SageMaker console (see screenshot below for an example of all relevant fields filled in for the Oregon AWS Region): 99 | 100 | - In the left pane of the SageMaker console home page, right click the **Models** link and open it in another tab of your browser. Click the **Create Model** button at the upper right above the 'Models' table. 101 | 102 | - For the 'Model name' field under **Model Settings**, enter `videogames-xgboost`. 103 | 104 | - For the 'Location of inference code image' field under **Primary Container**, enter the name of the same Docker image you specified previously for the region where you're running this workshop. For ease of reference, here are the image names again: 105 | 106 | - N. Virginia: 811284229777.dkr.ecr.us-east-1.amazonaws.com/xgboost:latest 107 | - Oregon: 433757028032.dkr.ecr.us-west-2.amazonaws.com/xgboost:latest 108 | - Ohio: 825641698319.dkr.ecr.us-east-2.amazonaws.com/xgboost:latest 109 | - Ireland: 685385470294.dkr.ecr.eu-west-1.amazonaws.com/xgboost:latest 110 | - Sydney: 544295431143.dkr.ecr.ap-southeast-2.amazonaws.com/xgboost:latest 111 | 112 | - For the 'Location of model artifacts' field under **Primary Container**, enter the path to the output of your training job. To find the path, go back to your first browser tab, click **Jobs** in the left pane, then find and click the job name, which will look like `videogames-xgboost-`. Scroll down to the **Outputs** section, then copy the path under 'S3 model artifact'. Paste the path in the field; it should look like `s3://smworkshop-john-smith/sagemaker/videogames_xgboost/videogames-xgboost-2018-04-17-20-40-13/output/model.tar.gz `. 113 | 114 | - Click **Create model** at the bottom of the page. 115 | 116 | ![Model](./images/videogames-model-v1.png) 117 | 118 | 15. **Endpoint Configuration**: Once we've setup our model, we can configure what our hosting endpoint should be. Here we specify the EC2 instance type to use for hosting, the initial number of instances, and our hosting model name. Here are the steps to do this via the SageMaker console (see screenshot below for an example of all relevant fields filled in for the Oregon AWS Region): 119 | 120 | - In the left pane of the SageMaker console, click **Endpoint configuration**. Click the **Create endpoint configuration** button at the upper right above the 'Endpoint configuration' table. 121 | 122 | - For the 'Endpoint configuration name' field under **New endpoint configuration**, enter `videogames-xgboost`. 123 | 124 | - Under **Production variants**, click **Add model**. From the **Add model** popup, select the `videogames-xgboost` model you created earlier, and click **Save**. Then click **Create endpoint configuration** at the bottom of the page. 125 | 126 | ![Endpoint Configuration](./images/videogames-endpoint-config-v1.png) 127 | 128 | 16. **Endpoint Creation**: Now that we've specified how our endpoint should be configured, we can create it. For this final step in the process of settng up an endpoint, we'll once again use the SageMaker console to do so (see screenshot below for an example of all relevant fields filled in for the Oregon AWS Region): 129 | 130 | - In the left pane of the SageMaker console, click **Endpoints**. Click the **Create endpoint** button at the upper right above the 'Endpoints' table. 131 | 132 | - For the 'Endpoint name' field under **Endpoint**, enter `videogames-xgboost`. 133 | 134 | - Under **Attach endpoint configuration**, leave 'Use an existing endpoint configuration' selected, then under **Endpoint configuration**, select `videogames-xgboost` from the table, then click **Select endpoint configuration** at the bottom of the table. Then click **Create endpoint** at the bottom of the page. 135 | 136 | - In the **Endpoints** table, refer to the 'Status' column, and wait for the endpoint status to change from 'Creating' to 'InService' before proceeding to the next step. It will take several minutes for endpoint creation, possibly as long as ten minutes. 137 | 138 | ![Endpoint](./images/videogames-endpoint-v1.png) 139 | 140 | 17. **Evaluate**: To evaluate predictions from our model, we'll use the notebook uploaded to your Amazon SageMaker notebook instance in steps 1 to 5. (NOTE: IF YOU DID NOT PERFORM STEPS 1 TO 5 EARLIER, DO SO NOW.) Next: 141 | 142 | [a] Run the first two cells of the notebook if you haven't done so already, in order to make sure you have imported the required libraries. 143 | 144 | [b] Go to the **Evaluation** section of the notebook, and run the remaining notebook cells, the first of which is pictured below. When you are finished, return here and proceed to the next section. 145 | 146 | ![Next Cell](./images/videogames-next-cell.png) 147 | 148 | ### Conclusion & Extensions 149 | 150 | This XGBoost model is just the starting point for predicting whether a game will be a hit based on reviews and other features. There are several possible avenues for improving the model's performance. First, of course, would be to collect more data and, if possible, fill in the existing missing fields with actual information. Another possibility is further hyperparameter tuning with Amazon SageMaker's Automatic Model Tuning feature, which automates the tuning process. And, although ensemble learners often do well with imbalanced data sets, it could be worth exploring techniques for mitigating imbalances such as downsampling, synthetic data augmentation, and other approaches. 151 | 152 | -------------------------------------------------------------------------------- /modules/images/distrib-dev-environment.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/amazon-sagemaker-workshop/0cf3d661a4065becde6b974ee6fff99c9640b850/modules/images/distrib-dev-environment.png -------------------------------------------------------------------------------- /modules/images/distrib-endpoint-config.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/amazon-sagemaker-workshop/0cf3d661a4065becde6b974ee6fff99c9640b850/modules/images/distrib-endpoint-config.png -------------------------------------------------------------------------------- /modules/images/distrib-endpoint.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/amazon-sagemaker-workshop/0cf3d661a4065becde6b974ee6fff99c9640b850/modules/images/distrib-endpoint.png -------------------------------------------------------------------------------- /modules/images/distrib-model.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/amazon-sagemaker-workshop/0cf3d661a4065becde6b974ee6fff99c9640b850/modules/images/distrib-model.png -------------------------------------------------------------------------------- /modules/images/videogames-cloud9.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/amazon-sagemaker-workshop/0cf3d661a4065becde6b974ee6fff99c9640b850/modules/images/videogames-cloud9.png -------------------------------------------------------------------------------- /modules/images/videogames-endpoint-config-v1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/amazon-sagemaker-workshop/0cf3d661a4065becde6b974ee6fff99c9640b850/modules/images/videogames-endpoint-config-v1.png -------------------------------------------------------------------------------- /modules/images/videogames-endpoint-config.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/amazon-sagemaker-workshop/0cf3d661a4065becde6b974ee6fff99c9640b850/modules/images/videogames-endpoint-config.png -------------------------------------------------------------------------------- /modules/images/videogames-endpoint-v1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/amazon-sagemaker-workshop/0cf3d661a4065becde6b974ee6fff99c9640b850/modules/images/videogames-endpoint-v1.png -------------------------------------------------------------------------------- /modules/images/videogames-endpoint.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/amazon-sagemaker-workshop/0cf3d661a4065becde6b974ee6fff99c9640b850/modules/images/videogames-endpoint.png -------------------------------------------------------------------------------- /modules/images/videogames-first-cell.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/amazon-sagemaker-workshop/0cf3d661a4065becde6b974ee6fff99c9640b850/modules/images/videogames-first-cell.png -------------------------------------------------------------------------------- /modules/images/videogames-model-v1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/amazon-sagemaker-workshop/0cf3d661a4065becde6b974ee6fff99c9640b850/modules/images/videogames-model-v1.png -------------------------------------------------------------------------------- /modules/images/videogames-model.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/amazon-sagemaker-workshop/0cf3d661a4065becde6b974ee6fff99c9640b850/modules/images/videogames-model.png -------------------------------------------------------------------------------- /modules/images/videogames-next-cell.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/amazon-sagemaker-workshop/0cf3d661a4065becde6b974ee6fff99c9640b850/modules/images/videogames-next-cell.png -------------------------------------------------------------------------------- /notebooks/Image-classification-transfer-learning.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Computer Vision with SageMaker's Image Classification Algorithm\n", 8 | "\n", 9 | "1. [Introduction](#Introduction)\n", 10 | "2. [Prerequisites and Preprocessing](#Prequisites-and-Preprocessing)\n", 11 | "3. [Fine-tuning the Image classification model](#Fine-tuning-the-Image-classification-model)\n", 12 | "4. [Training parameters](#Training-parameters)\n", 13 | "5. [Start the training](#Start-the-training)\n", 14 | "6. [Inference](#Inference)\n" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "## Introduction\n", 22 | "\n", 23 | "Image classification is a fundamental computer vision task that involves predicting the overall label/class of an image. Modern computer vision techniques use neural net models for these kinds of tasks. Although neural nets can achieve high accuracy for image classification, they can be quite difficult to use directly. Amazon SageMaker's built-in image classification algorithm makes such neural nets much easier to use. Simply provide your dataset and specify a few parameters, and you can train and deploy a custom model. \n", 24 | "\n", 25 | "This notebook is an end-to-end example of image classification to \"fine-tune\" a pretrained model. Fine-tuning, a former of \"transfer learning\" from one classification task to another, typically results in substantial time and cost savings compared to training from scratch. We'll use SageMaker's built-in image classification algorithm in transfer learning mode to fine-tune a model previously trained on the well-known public ImageNet dataset. This fine-tuned model will be used to classify a new dataset different from ImageNet. In particular, the pretrained model will be fine-tuned with the [Caltech-256 dataset](http://www.vision.caltech.edu/Image_Datasets/Caltech256/). \n", 26 | "\n", 27 | "SageMaker's built-in image classification algorithm has an option for training from scratch as well as transfer learning. Using the built-in algorithm's transfer learning mode frees you from having to modify the underlying neural net architecture, which otherwise would be necessary if you used the neural net directly from a code base. There are many other conveniences provided by this built-in algorithm, such as the ability to automatically train faster on a cluster of many instances without requiring you to manage cluster setup and teardown. \n", 28 | "\n", 29 | "To get started, we need to set up the environment with a few prerequisite steps, for permissions, configurations, and so on." 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "## Prequisites and Preprocessing\n", 37 | "\n", 38 | "### Permissions and environment variables\n", 39 | "\n", 40 | "Here we set up the linkage and authentication for AWS services. There are three parts to this:\n", 41 | "\n", 42 | "* The IAM role used to give learning and hosting access to your data. This will be obtained from the role used to start the notebook.\n", 43 | "* The S3 bucket for training and model data.\n", 44 | "* The Amazon SageMaker image classification algoritm Docker image which you can use out of the box, without modifications." 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": null, 50 | "metadata": {}, 51 | "outputs": [], 52 | "source": [ 53 | "%%time\n", 54 | "import boto3\n", 55 | "import sagemaker\n", 56 | "from sagemaker import get_execution_role\n", 57 | "\n", 58 | "role = get_execution_role()\n", 59 | "print(role)\n", 60 | "\n", 61 | "sess = sagemaker.Session()\n", 62 | "bucket = sess.default_bucket()\n", 63 | "prefix = 'ic-transfer-learning'" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": null, 69 | "metadata": {}, 70 | "outputs": [], 71 | "source": [ 72 | "region = sess.boto_region_name\n", 73 | "image_name = sagemaker.image_uris.retrieve(region=boto3.Session().region_name, framework='image-classification')\n", 74 | "print(image_name)" 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": {}, 80 | "source": [ 81 | "###### Fine-tuning the Image Classification model\n", 82 | "\n", 83 | "The Caltech 256 dataset consist of images from 256 categories plus a clutter category. It has a total of 30000 images, with a minimum of 80 images and a maximum of about 800 images per category. \n", 84 | "\n", 85 | "The image classification algorithm can take two types of input formats. The first is a [recordio format](https://mxnet.incubator.apache.org/faq/recordio.html), and the other is a [lst format](https://mxnet.incubator.apache.org/faq/recordio.html?highlight=im2rec). In this example, we will use the recordio format." 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": null, 91 | "metadata": {}, 92 | "outputs": [], 93 | "source": [ 94 | "import os\n", 95 | "import urllib.request\n", 96 | "import boto3\n", 97 | "\n", 98 | "def download(url):\n", 99 | " filename = url.split(\"/\")[-1]\n", 100 | " if not os.path.exists(filename):\n", 101 | " urllib.request.urlretrieve(url, filename)\n", 102 | "\n", 103 | " \n", 104 | "def upload_to_s3(channel, file):\n", 105 | " s3 = boto3.resource('s3')\n", 106 | " data = open(file, \"rb\")\n", 107 | " key = channel + '/' + file\n", 108 | " s3.Bucket(bucket).put_object(Key=key, Body=data)\n", 109 | "\n", 110 | "\n", 111 | "# # caltech-256\n", 112 | "download('http://data.mxnet.io/data/caltech-256/caltech-256-60-train.rec')\n", 113 | "download('http://data.mxnet.io/data/caltech-256/caltech-256-60-val.rec')\n", 114 | "upload_to_s3('validation', 'caltech-256-60-val.rec')\n", 115 | "upload_to_s3('train', 'caltech-256-60-train.rec')" 116 | ] 117 | }, 118 | { 119 | "cell_type": "markdown", 120 | "metadata": {}, 121 | "source": [ 122 | "Next, we'll upload the data to Amazon S3 so it can be accessed by SageMaker for model training." 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": null, 128 | "metadata": {}, 129 | "outputs": [], 130 | "source": [ 131 | "# Four channels: train, validation, train_lst, and validation_lst\n", 132 | "s3train = 's3://{}/{}/train/'.format(bucket, prefix)\n", 133 | "s3validation = 's3://{}/{}/validation/'.format(bucket, prefix)\n", 134 | "\n", 135 | "# upload the lst files to train and validation channels\n", 136 | "!aws s3 cp caltech-256-60-train.rec $s3train --quiet\n", 137 | "!aws s3 cp caltech-256-60-val.rec $s3validation --quiet" 138 | ] 139 | }, 140 | { 141 | "cell_type": "markdown", 142 | "metadata": {}, 143 | "source": [ 144 | "Once we have the data available in S3 in the correct format for training, the next step is to actually train the model using the data. Before training the model, we need to setup the training parameters. The next section will explain the parameters in detail and dive into how to set up the training job." 145 | ] 146 | }, 147 | { 148 | "cell_type": "markdown", 149 | "metadata": {}, 150 | "source": [ 151 | "## Training\n", 152 | "\n", 153 | "Now that we are done with the data setup, we are almost ready to train our image classfication model. To begin, let's create a ``sageMaker.estimator.Estimator`` object. This Estimator will launch the training job.\n", 154 | "\n", 155 | "### Training parameters\n", 156 | "\n", 157 | "There are two kinds of parameters to set for training. The first kind is the parameters for the training job itself, such as amount and type of hardware to use, and S3 location. For this example, these include:\n", 158 | "\n", 159 | "* **Instance count**: This is the number of instances on which to run the training. When the number of instances is greater than one, then the image classification algorithm will run in a distributed cluster automatically without requiring you to manage cluster setup. \n", 160 | "* **Instance type**: This indicates the type of machine on which to run the training. Typically, we use GPU instances for computer vision models such as this one.\n", 161 | "* **Output path**: This the S3 folder in which the training output will be stored." 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": null, 167 | "metadata": {}, 168 | "outputs": [], 169 | "source": [ 170 | "s3_output_location = 's3://{}/{}/output'.format(bucket, prefix)\n", 171 | "\n", 172 | "ic = sagemaker.estimator.Estimator(\n", 173 | " image_uri=image_name,\n", 174 | " role=role,\n", 175 | " instance_count=1,\n", 176 | " instance_type='ml.p3.8xlarge',\n", 177 | " volume_size = 50,\n", 178 | " sagemaker_session=sess,\n", 179 | " output_path=s3_output_location )" 180 | ] 181 | }, 182 | { 183 | "cell_type": "markdown", 184 | "metadata": {}, 185 | "source": [ 186 | "Apart from the above set of training job parameters, the second set of parameters are hyperparameters that are specific to the algorithm. These include:\n", 187 | "\n", 188 | "* **num_layers**: The number of layers (depth) for the network. We use 18 in this example, but other values such as 50, 152 can be used to achieve greater accuracy at the cost of longer training time.\n", 189 | "* **use_pretrained_model**: Set to 1 to use a pretrained model for transfer learning.\n", 190 | "* **image_shape**: The input image dimensions,'num_channels, height, width', for the network. It should be no larger than the actual image size. The number of channels should be same as the actual image.\n", 191 | "* **num_classes**: This is the number of output classes for the new dataset. Imagenet has 1000 classes, but the number of output classes for our pretrained network can be changed with fine-tuning. For this Caltech dataset, we use 257 because it has 256 object categories + 1 clutter class.\n", 192 | "* **num_training_samples**: This is the total number of training samples. It is set to 15240 for the Caltech dataset due to the current split between training and validation data.\n", 193 | "* **mini_batch_size**: The number of training samples used for each mini batch. In distributed training for multiple training instances (we just use one here), the number of training samples used per batch would be N * mini_batch_size, where N is the number of hosts on which training is run.\n", 194 | "* **epochs**: Number of training epochs, i.e. passes over the complete training data.\n", 195 | "* **learning_rate**: Learning rate for training.\n", 196 | "* **precision_dtype**: Training datatype precision (default: float32). If set to 'float16', the training will be done in mixed_precision mode and will be faster than float32 mode, at the cost of slightly less accuracy. " 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": null, 202 | "metadata": { 203 | "isConfigCell": true 204 | }, 205 | "outputs": [], 206 | "source": [ 207 | "ic.set_hyperparameters(\n", 208 | " num_layers=18,\n", 209 | " use_pretrained_model=1,\n", 210 | " image_shape = \"3,224,224\",\n", 211 | " num_classes=257,\n", 212 | " num_training_samples=15420,\n", 213 | " mini_batch_size=128,\n", 214 | " epochs=2,\n", 215 | " learning_rate=0.01,\n", 216 | " precision_dtype='float32' )" 217 | ] 218 | }, 219 | { 220 | "cell_type": "markdown", 221 | "metadata": {}, 222 | "source": [ 223 | "## Input data specification\n", 224 | "\n", 225 | "The next step is to set the data type and channels used for training. The channel definitions inform SageMaker about where to find both the training and validation datasets in S3." 226 | ] 227 | }, 228 | { 229 | "cell_type": "code", 230 | "execution_count": null, 231 | "metadata": {}, 232 | "outputs": [], 233 | "source": [ 234 | "train_data = sagemaker.inputs.TrainingInput(s3_data=s3train, content_type='application/x-recordio')\n", 235 | "validation_data = sagemaker.inputs.TrainingInput(s3_data=s3validation, content_type='application/x-recordio')\n", 236 | "\n", 237 | "data_channels = {'train': train_data, 'validation': validation_data}" 238 | ] 239 | }, 240 | { 241 | "cell_type": "markdown", 242 | "metadata": {}, 243 | "source": [ 244 | "## Start the training job\n", 245 | "\n", 246 | "Now we can start the training job by calling the `fit` method of the Estimator object." 247 | ] 248 | }, 249 | { 250 | "cell_type": "code", 251 | "execution_count": null, 252 | "metadata": {}, 253 | "outputs": [], 254 | "source": [ 255 | "ic.fit(inputs=data_channels, logs=True)" 256 | ] 257 | }, 258 | { 259 | "cell_type": "markdown", 260 | "metadata": {}, 261 | "source": [ 262 | "# Inference\n", 263 | "\n", 264 | "***\n", 265 | "\n", 266 | "A trained model does nothing on its own. We now want to use the model to perform inference, i.e. get predictions from the model. For this example, that means predicting the Caltech-256 class of a given image. To deploy the trained model, we simply use the `deploy` method of the Estimator. This will create a SageMaker endpoint that can return predictions in real time, for example for use with a consumer-facing app that must have low latency responses to user requests. SageMaker also can perform offline batch, asynchronous inference with its Batch Transform feature. " 267 | ] 268 | }, 269 | { 270 | "cell_type": "code", 271 | "execution_count": null, 272 | "metadata": {}, 273 | "outputs": [], 274 | "source": [ 275 | "ic_classifier = ic.deploy(initial_instance_count = 1,\n", 276 | " instance_type = 'ml.m5.xlarge')" 277 | ] 278 | }, 279 | { 280 | "cell_type": "markdown", 281 | "metadata": {}, 282 | "source": [ 283 | "### Download a test image" 284 | ] 285 | }, 286 | { 287 | "cell_type": "code", 288 | "execution_count": null, 289 | "metadata": {}, 290 | "outputs": [], 291 | "source": [ 292 | "!wget -O test.jpg https://raw.githubusercontent.com/awslabs/amazon-sagemaker-workshop/master/images/clawfoot_bathtub.jpg\n", 293 | "file_name = 'test.jpg'\n", 294 | "# test image\n", 295 | "from IPython.display import Image\n", 296 | "Image(file_name) " 297 | ] 298 | }, 299 | { 300 | "cell_type": "markdown", 301 | "metadata": {}, 302 | "source": [ 303 | "### Evaluation\n", 304 | "\n", 305 | "Let's now use the SageMaker endpoint hosting the trained model to predict the Caltech-256 class of the test image. The model outputs class probabilities. Typically, one selects the class with the maximum probability as the final predicted class output.\n", 306 | "\n", 307 | "**Note:** Although the output class detected by the network is likely to predict the correct class (bathtub), it is not guaranteed to be accurate as model training is a stochastic process. To limit the training time and related cost, we have trained the model only for a couple of epochs. If the model is trained for more epochs (say 20), the output class will be more accurate." 308 | ] 309 | }, 310 | { 311 | "cell_type": "code", 312 | "execution_count": null, 313 | "metadata": {}, 314 | "outputs": [], 315 | "source": [ 316 | "import json\n", 317 | "import numpy as np\n", 318 | "from sagemaker.serializers import IdentitySerializer\n", 319 | "\n", 320 | "with open(file_name, 'rb') as f:\n", 321 | " payload = f.read()\n", 322 | " payload = bytearray(payload)\n", 323 | " \n", 324 | "ic_classifier.serializer = IdentitySerializer(content_type='application/x-image')\n", 325 | "\n", 326 | "result = json.loads(ic_classifier.predict(payload))\n", 327 | "# output the probabilities for all classes, then find the class with maximum probability and print its index\n", 328 | "index = np.argmax(result)\n", 329 | "object_categories = ['ak47', 'american-flag', 'backpack', 'baseball-bat', 'baseball-glove', 'basketball-hoop', 'bat', 'bathtub', 'bear', 'beer-mug', 'billiards', 'binoculars', 'birdbath', 'blimp', 'bonsai-101', 'boom-box', 'bowling-ball', 'bowling-pin', 'boxing-glove', 'brain-101', 'breadmaker', 'buddha-101', 'bulldozer', 'butterfly', 'cactus', 'cake', 'calculator', 'camel', 'cannon', 'canoe', 'car-tire', 'cartman', 'cd', 'centipede', 'cereal-box', 'chandelier-101', 'chess-board', 'chimp', 'chopsticks', 'cockroach', 'coffee-mug', 'coffin', 'coin', 'comet', 'computer-keyboard', 'computer-monitor', 'computer-mouse', 'conch', 'cormorant', 'covered-wagon', 'cowboy-hat', 'crab-101', 'desk-globe', 'diamond-ring', 'dice', 'dog', 'dolphin-101', 'doorknob', 'drinking-straw', 'duck', 'dumb-bell', 'eiffel-tower', 'electric-guitar-101', 'elephant-101', 'elk', 'ewer-101', 'eyeglasses', 'fern', 'fighter-jet', 'fire-extinguisher', 'fire-hydrant', 'fire-truck', 'fireworks', 'flashlight', 'floppy-disk', 'football-helmet', 'french-horn', 'fried-egg', 'frisbee', 'frog', 'frying-pan', 'galaxy', 'gas-pump', 'giraffe', 'goat', 'golden-gate-bridge', 'goldfish', 'golf-ball', 'goose', 'gorilla', 'grand-piano-101', 'grapes', 'grasshopper', 'guitar-pick', 'hamburger', 'hammock', 'harmonica', 'harp', 'harpsichord', 'hawksbill-101', 'head-phones', 'helicopter-101', 'hibiscus', 'homer-simpson', 'horse', 'horseshoe-crab', 'hot-air-balloon', 'hot-dog', 'hot-tub', 'hourglass', 'house-fly', 'human-skeleton', 'hummingbird', 'ibis-101', 'ice-cream-cone', 'iguana', 'ipod', 'iris', 'jesus-christ', 'joy-stick', 'kangaroo-101', 'kayak', 'ketch-101', 'killer-whale', 'knife', 'ladder', 'laptop-101', 'lathe', 'leopards-101', 'license-plate', 'lightbulb', 'light-house', 'lightning', 'llama-101', 'mailbox', 'mandolin', 'mars', 'mattress', 'megaphone', 'menorah-101', 'microscope', 'microwave', 'minaret', 'minotaur', 'motorbikes-101', 'mountain-bike', 'mushroom', 'mussels', 'necktie', 'octopus', 'ostrich', 'owl', 'palm-pilot', 'palm-tree', 'paperclip', 'paper-shredder', 'pci-card', 'penguin', 'people', 'pez-dispenser', 'photocopier', 'picnic-table', 'playing-card', 'porcupine', 'pram', 'praying-mantis', 'pyramid', 'raccoon', 'radio-telescope', 'rainbow', 'refrigerator', 'revolver-101', 'rifle', 'rotary-phone', 'roulette-wheel', 'saddle', 'saturn', 'school-bus', 'scorpion-101', 'screwdriver', 'segway', 'self-propelled-lawn-mower', 'sextant', 'sheet-music', 'skateboard', 'skunk', 'skyscraper', 'smokestack', 'snail', 'snake', 'sneaker', 'snowmobile', 'soccer-ball', 'socks', 'soda-can', 'spaghetti', 'speed-boat', 'spider', 'spoon', 'stained-glass', 'starfish-101', 'steering-wheel', 'stirrups', 'sunflower-101', 'superman', 'sushi', 'swan', 'swiss-army-knife', 'sword', 'syringe', 'tambourine', 'teapot', 'teddy-bear', 'teepee', 'telephone-box', 'tennis-ball', 'tennis-court', 'tennis-racket', 'theodolite', 'toaster', 'tomato', 'tombstone', 'top-hat', 'touring-bike', 'tower-pisa', 'traffic-light', 'treadmill', 'triceratops', 'tricycle', 'trilobite-101', 'tripod', 't-shirt', 'tuning-fork', 'tweezer', 'umbrella-101', 'unicorn', 'vcr', 'video-projector', 'washing-machine', 'watch-101', 'waterfall', 'watermelon', 'welding-mask', 'wheelbarrow', 'windmill', 'wine-bottle', 'xylophone', 'yarmulke', 'yo-yo', 'zebra', 'airplanes-101', 'car-side-101', 'faces-easy-101', 'greyhound', 'tennis-shoes', 'toad', 'clutter']\n", 330 | "print(\"Result: label - \" + object_categories[index] + \", probability - \" + str(result[index]))" 331 | ] 332 | }, 333 | { 334 | "cell_type": "markdown", 335 | "metadata": {}, 336 | "source": [ 337 | "### Clean up\n", 338 | "\n", 339 | "When we're done with the endpoint, we can just delete it and the backing instance will be released. Run the following cell to delete the endpoint." 340 | ] 341 | }, 342 | { 343 | "cell_type": "code", 344 | "execution_count": null, 345 | "metadata": {}, 346 | "outputs": [], 347 | "source": [ 348 | "sess.delete_endpoint(ic_classifier.endpoint_name)" 349 | ] 350 | } 351 | ], 352 | "metadata": { 353 | "kernelspec": { 354 | "display_name": "conda_mxnet_p36", 355 | "language": "python", 356 | "name": "conda_mxnet_p36" 357 | }, 358 | "language_info": { 359 | "codemirror_mode": { 360 | "name": "ipython", 361 | "version": 3 362 | }, 363 | "file_extension": ".py", 364 | "mimetype": "text/x-python", 365 | "name": "python", 366 | "nbconvert_exporter": "python", 367 | "pygments_lexer": "ipython3", 368 | "version": "3.6.10" 369 | }, 370 | "notice": "Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License." 371 | }, 372 | "nbformat": 4, 373 | "nbformat_minor": 2 374 | } 375 | -------------------------------------------------------------------------------- /notebooks/data_distribution_types.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Taking Full Advantage of Parallelism With Data Distribution\n", 8 | "_**Using Amazon SageMaker's Managed, Distributed Training with Different Data Distribution Methods**_\n", 9 | "\n", 10 | "---\n", 11 | "\n", 12 | "---\n", 13 | "\n", 14 | "## Contents\n", 15 | "\n", 16 | "1. [Background](#Background)\n", 17 | "1. [Setup](#Setup)\n", 18 | "1. [Data](#Data)\n", 19 | " 1. [Scaling](#Scaling)\n", 20 | "1. [Training-Hosting](#Training-Hosting)\n", 21 | "1. [Evaluate](#Evaluate)\n", 22 | "\n", 23 | "\n", 24 | "This notebook contains two parts of the Parallelized Data Distribution module of the Built-in Algorithms workshop. More specifically, it covers exploratory data analysis and evaluation of the trained models; training and deployment of the models is covered in the workshop lab guide.\n", 25 | "\n", 26 | "---\n", 27 | "# Setup\n", 28 | "\n", 29 | "_This notebook was created and tested on an ml.m4.xlarge notebook instance._\n", 30 | "\n", 31 | "Let's start by specifying the S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting. In the first line of code, replace the placeholder text with your bucket name, such as `smworkshop-john smith`, with NO path. Be sure to remove the angle brackets and keep the quotes. The IAM role for S3 data access is pulled in from the SageMaker Notebook Instance. After you have replaced the bucket name, go ahead and run the cell by clicking the 'Run cell' button in the toolbar above, or using Control + Enter from your keyboard. Do NOT 'Run All' cells because we will be leaving the notebook for training and returning later for evaluation." 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": null, 37 | "metadata": { 38 | "isConfigCell": true 39 | }, 40 | "outputs": [], 41 | "source": [ 42 | "bucket = '' # replace with bucket name in quotes only, NO URL path\n", 43 | "prefix = 'sagemaker/data_distribution_types'\n", 44 | "\n", 45 | "import boto3\n", 46 | "import re\n", 47 | "from sagemaker import get_execution_role\n", 48 | "\n", 49 | "role = get_execution_role()" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": {}, 55 | "source": [ 56 | "Next, we'll import the Python libraries we'll need for the remainder of the exercise." 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": null, 62 | "metadata": {}, 63 | "outputs": [], 64 | "source": [ 65 | "import boto3\n", 66 | "import pandas as pd\n", 67 | "import numpy as np\n", 68 | "import matplotlib.pyplot as plt\n", 69 | "from IPython.display import display\n", 70 | "import io\n", 71 | "import time\n", 72 | "import copy\n", 73 | "import json\n", 74 | "import sys\n", 75 | "import sagemaker.amazon.common as smac\n", 76 | "import os\n", 77 | "import warnings\n", 78 | "warnings.simplefilter(\"ignore\")" 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "metadata": {}, 84 | "source": [ 85 | "---\n", 86 | "\n", 87 | "## Data\n", 88 | "\n", 89 | "The [dataset](https://aws.amazon.com/public-datasets/gdelt/) we'll use for this notebook is from the [Global Database of Events, Language and Tone (GDELT) Project](https://www.gdeltproject.org/). This information is freely available on S3 as part of the [AWS Public Datasets](https://aws.amazon.com/public-datasets/) program.\n", 90 | "\n", 91 | "The data are stored as multiples files on S3, with two different formats: historical, which covers from 1979 to 2013, and daily updates which covers from 2013 on. For this example, we'll stick to historical. Let's bring in 1979 data for the purpose of interactive exploration. We'll write a simple function so that later we can use it to download multiple files. Among other actions, the function downloads the data from the public S3 bucket, and reads it into a Pandas dataframe. \n", 92 | "\n", 93 | "**NOTE**: If running the following code cell causes an exception, you might need to add permissions to your notebook environment's IAM role to read the `gdelt-open-data` S3 bucket. See https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_manage_modify.html. " 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": null, 99 | "metadata": {}, 100 | "outputs": [], 101 | "source": [ 102 | "def get_gdelt(filename):\n", 103 | " s3 = boto3.resource('s3')\n", 104 | " s3.Bucket('gdelt-open-data').download_file('events/' + filename, '.gdelt.csv')\n", 105 | " df = pd.read_csv('.gdelt.csv', sep='\\t')\n", 106 | " header = pd.read_csv('https://www.gdeltproject.org/data/lookups/CSV.header.historical.txt', sep='\\t')\n", 107 | " df.columns = header.columns\n", 108 | " return df" 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": null, 114 | "metadata": {}, 115 | "outputs": [], 116 | "source": [ 117 | "data = get_gdelt('1979.csv')\n", 118 | "data" 119 | ] 120 | }, 121 | { 122 | "cell_type": "markdown", 123 | "metadata": {}, 124 | "source": [ 125 | "As we can see, there are 57 columns (you may not be able to see all 57 as they may be truncated for display purposes). Some of the columns are sparsely populated, cryptically named, and in a format that's not particularly friendly for machine learning. So, for our use case, we'll strip down to a few core attributes. We'll use:\n", 126 | "\n", 127 | "- `EventCode`: This is the raw CAMEO action code describing the action that Actor1 performed upon Actor2. More detail can be found [here](https://www.gdeltproject.org/data/documentation/CAMEO.Manual.1.1b3.pdf)\n", 128 | "- `NumArticles`: This is the total number of source documents containing one or more mentions of this event. This can be used as a method of assessing the “importance” of an event; the more discussion of that event, the more likely it is to be significant\n", 129 | "- `AvgTone`: This is the average “tone” of all documents containing one or more mentions of this event. The score ranges from -100 (extremely negative) to +100 (extremely positive). Common values range between -10 and +10, with 0 indicating neutral.\n", 130 | "- `Actor1Geo_Lat`: This is the centroid latitude of the Actor1 landmark for mapping.\n", 131 | "- `Actor1Geo_Long`: This is the centroid longitude of the Actor1 landmark for mapping.\n", 132 | "- `Actor2Geo_Lat`: This is the centroid latitude of the Actor2 landmark for mapping.\n", 133 | "- `Actor2Geo_Long`: This is the centroid longitude of the Actor2 landmark for mapping." 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": null, 139 | "metadata": {}, 140 | "outputs": [], 141 | "source": [ 142 | "data = data[['EventCode', 'NumArticles', 'AvgTone', 'Actor1Geo_Lat', 'Actor1Geo_Long', 'Actor2Geo_Lat', 'Actor2Geo_Long']]\n", 143 | "data['EventCode'] = data['EventCode'].astype(object)\n", 144 | "\n", 145 | "for column in data.select_dtypes(include=['object']).columns:\n", 146 | " display(pd.crosstab(index=data[column], columns='% observations', normalize='columns'))\n", 147 | "\n", 148 | "display(data.describe())\n", 149 | "hist = data.hist(bins=30, sharey=True, figsize=(10, 10))\n", 150 | "plt.show()" 151 | ] 152 | }, 153 | { 154 | "cell_type": "markdown", 155 | "metadata": {}, 156 | "source": [ 157 | "We can see:\n", 158 | "- `EventCode` is pretty unevenly distributed, with some events making up 7%+ of the observations and others being a thousandth of a percent.\n", 159 | "- `AvgTone` seems to be reasonably smoothly distributed, while `NumArticles` has a long tail, and `Actor` geo features have suspiciously large spikes near 0.\n", 160 | "\n", 161 | "Let's remove the (0, 0) lat-longs, one hot encode `EventCode`, and prepare our data for a machine learning model. For this example we'll keep things straightforward and try to predict `AvgTone`, using the other variables in our dataset as features.\n", 162 | "\n", 163 | "One more issue remains. As we noticed above, some occurrences of `EventCode` are very rare, and may be unlikely to occur in every single year. This means if we one hot encode individual years at a time, our feature matrix may change shape over the years, which will not work. Therefore, we'll limit all years to the most common `EventCodes` from the year we current have. Let's get this list." 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": null, 169 | "metadata": {}, 170 | "outputs": [], 171 | "source": [ 172 | "events = pd.crosstab(index=data['EventCode'], columns='count').sort_values(by='count', ascending=False).index[:20]" 173 | ] 174 | }, 175 | { 176 | "cell_type": "markdown", 177 | "metadata": {}, 178 | "source": [ 179 | "### Scaling\n", 180 | "\n", 181 | "Now that we've explored our data and are ready to prepare for modeling, we can start developing a few simple functions to help us scale this to GDELT datasets from other years." 182 | ] 183 | }, 184 | { 185 | "cell_type": "code", 186 | "execution_count": null, 187 | "metadata": {}, 188 | "outputs": [], 189 | "source": [ 190 | "def write_to_s3(bucket, prefix, channel, file_prefix, X, y):\n", 191 | " buf = io.BytesIO()\n", 192 | " smac.write_numpy_to_dense_tensor(buf, X.astype('float32'), y.astype('float32'))\n", 193 | " buf.seek(0)\n", 194 | " boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, channel, file_prefix + '.data')).upload_fileobj(buf)\n", 195 | "\n", 196 | "def transform_gdelt(df, events=None):\n", 197 | " df = df[['AvgTone', 'EventCode', 'NumArticles', 'Actor1Geo_Lat', 'Actor1Geo_Long', 'Actor2Geo_Lat', 'Actor2Geo_Long']]\n", 198 | " df['EventCode'] = df['EventCode'].astype(object)\n", 199 | " if events is not None:\n", 200 | " df = df[np.in1d(df['EventCode'], events)]\n", 201 | " return pd.get_dummies(df[((df['Actor1Geo_Lat'] == 0) & (df['Actor1Geo_Long'] == 0) != True) &\n", 202 | " ((df['Actor2Geo_Lat'] == 0) & (df['Actor2Geo_Long'] == 0) != True)])\n", 203 | " \n", 204 | "def prepare_gdelt(bucket, prefix, file_prefix, events=None, random_state=1729):\n", 205 | " df = get_gdelt(file_prefix + '.csv')\n", 206 | " model_data = transform_gdelt(df, events)\n", 207 | " train_data, validation_data = np.split(model_data.sample(frac=1, random_state=random_state).as_matrix(), \n", 208 | " [int(0.9 * len(model_data))])\n", 209 | " write_to_s3(bucket, prefix, 'train', file_prefix, train_data[:, 1:], train_data[:, 0])\n", 210 | " write_to_s3(bucket, prefix, 'validation', file_prefix, validation_data[:, 1:], validation_data[:, 0])" 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": null, 216 | "metadata": {}, 217 | "outputs": [], 218 | "source": [ 219 | "for year in range(1979, 1984):\n", 220 | " prepare_gdelt(bucket, prefix, str(year), events)" 221 | ] 222 | }, 223 | { 224 | "cell_type": "markdown", 225 | "metadata": {}, 226 | "source": [ 227 | "---\n", 228 | "\n", 229 | "## Training-Hosting\n", 230 | "\n", 231 | "Next, we'll set up training jobs and deploy the resulting models. For those steps, please return to the workshop lab guide after the previous code cell completes running, which may take a few minutes. You'll return to this notebook for the final evaluation steps. If you are running the **Distributed Training** workshop module from GitHub, the link back is https://github.com/awslabs/amazon-sagemaker-workshop/tree/master/Built-in-Algorithms." 232 | ] 233 | }, 234 | { 235 | "cell_type": "markdown", 236 | "metadata": {}, 237 | "source": [ 238 | "---\n", 239 | "\n", 240 | "## Evaluate\n", 241 | "\n", 242 | "To compare predictions from our two models, let's bring in some new data from a year the model was not trained or validated on." 243 | ] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "execution_count": null, 248 | "metadata": {}, 249 | "outputs": [], 250 | "source": [ 251 | "test_data = transform_gdelt(get_gdelt('1984.csv'), events).as_matrix()\n", 252 | "test_X = test_data[:, 1:]\n", 253 | "test_y = test_data[:, 0]" 254 | ] 255 | }, 256 | { 257 | "cell_type": "markdown", 258 | "metadata": {}, 259 | "source": [ 260 | "Now we'll need a function to convert these numpy matrices to CSVs so they can be passed to our endpoint as an HTTP POST request." 261 | ] 262 | }, 263 | { 264 | "cell_type": "code", 265 | "execution_count": null, 266 | "metadata": {}, 267 | "outputs": [], 268 | "source": [ 269 | "def np2csv(arr):\n", 270 | " csv = io.BytesIO()\n", 271 | " np.savetxt(csv, arr, delimiter=',', fmt='%g')\n", 272 | " return csv.getvalue().decode().rstrip()" 273 | ] 274 | }, 275 | { 276 | "cell_type": "markdown", 277 | "metadata": {}, 278 | "source": [ 279 | "Next, because POST requests to our endpoint are limited to ~6MB, we'll setup a small function to split our test data up into mini-batches that are each about 5MB, loop through and invoke our endpoint to get predictions for those mini-batches, and gather them into a single array. " 280 | ] 281 | }, 282 | { 283 | "cell_type": "code", 284 | "execution_count": null, 285 | "metadata": {}, 286 | "outputs": [], 287 | "source": [ 288 | "def predict_batches(data, endpoint):\n", 289 | " rows = 5. * 1024. * 1024. / sys.getsizeof(np2csv(data[0, :]))\n", 290 | " split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))\n", 291 | " predictions = []\n", 292 | " runtime = boto3.Session().client('runtime.sagemaker')\n", 293 | " for array in split_array:\n", 294 | " payload = np2csv(array)\n", 295 | " response = runtime.invoke_endpoint(EndpointName=endpoint,\n", 296 | " ContentType='text/csv',\n", 297 | " Body=payload)\n", 298 | " result = json.loads(response['Body'].read().decode())\n", 299 | " predictions += [r['score'] for r in result['predictions']]\n", 300 | " return np.array(predictions)" 301 | ] 302 | }, 303 | { 304 | "cell_type": "markdown", 305 | "metadata": {}, 306 | "source": [ 307 | "We're almost ready to get the predictions. First, however, we need to bring in the names of the endpoints we created in the previous section of the module. If you followed the naming convention in the lab guide, the names are already filled in for the code cell below. Otherwise, replace the names with the ones you chose." 308 | ] 309 | }, 310 | { 311 | "cell_type": "code", 312 | "execution_count": null, 313 | "metadata": {}, 314 | "outputs": [], 315 | "source": [ 316 | "sharded_endpoint = 'distributed-sharded'\n", 317 | "replicated_endpoint = 'distributed-replicated'" 318 | ] 319 | }, 320 | { 321 | "cell_type": "markdown", 322 | "metadata": {}, 323 | "source": [ 324 | "Now we'll compare accuracy in mean squared error (MSE)." 325 | ] 326 | }, 327 | { 328 | "cell_type": "code", 329 | "execution_count": null, 330 | "metadata": {}, 331 | "outputs": [], 332 | "source": [ 333 | "sharded_predictions = predict_batches(test_X, sharded_endpoint)\n", 334 | "replicated_predictions = predict_batches(test_X, replicated_endpoint)\n", 335 | "\n", 336 | "print('Sharded MSE =', np.mean((test_y - sharded_predictions) ** 2))\n", 337 | "print('Replicated MSE =', np.mean((test_y - replicated_predictions) ** 2))" 338 | ] 339 | }, 340 | { 341 | "cell_type": "markdown", 342 | "metadata": {}, 343 | "source": [ 344 | "We can see that the fully replicated distribution type performs just slightly better in terms of fit. However, this difference is small compared to the overall speedup achieved by dividing the data set into multiple S3 objects and distributing them across machines.\n", 345 | "\n", 346 | "This concludes the notebook portion of this module. Please return to the workshop lab guide. If you are running the **Distributed Training** workshop module from GitHub, the link back is https://github.com/awslabs/amazon-sagemaker-workshop/tree/master/Built-in-Algorithms." 347 | ] 348 | } 349 | ], 350 | "metadata": { 351 | "kernelspec": { 352 | "display_name": "conda_python3", 353 | "language": "python", 354 | "name": "conda_python3" 355 | }, 356 | "language_info": { 357 | "codemirror_mode": { 358 | "name": "ipython", 359 | "version": 3 360 | }, 361 | "file_extension": ".py", 362 | "mimetype": "text/x-python", 363 | "name": "python", 364 | "nbconvert_exporter": "python", 365 | "pygments_lexer": "ipython3", 366 | "version": "3.6.10" 367 | }, 368 | "notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License." 369 | }, 370 | "nbformat": 4, 371 | "nbformat_minor": 2 372 | } 373 | -------------------------------------------------------------------------------- /notebooks/sentiment-analysis.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Sentiment Analysis with TensorFlow 2\n", 8 | "\n", 9 | "Amazon SageMaker provides both (1) built-in algorithms and (2) an easy path to train your own custom models. Although the built-in algorithms cover many domains (computer vision, natural language processing etc.) and are easy to use (just provide your data), sometimes training a custom model is the preferred approach. This notebook will focus on training a custom model using TensorFlow 2. \n", 10 | "\n", 11 | "Sentiment analysis is a very common text analytics task that determines whether a text sample is positive or negative about its subject. There are several different algorithms for performing this task, including older statistical algorithms and newer deep learning algorithms. With respect to deep learning, a 1D Convolutional Neural Net (CNN) is sometimes used for this purpose. In this notebook we'll use a CNN built with TensorFlow 2 to perform sentiment analysis in Amazon SageMaker on the IMDb dataset, which consists of movie reviews labeled as having positive or negative sentiment. Several aspects of Amazon SageMaker will be demonstrated:\n", 12 | "\n", 13 | "- How to use a SageMaker prebuilt TensorFlow 2 container with a custom model training script similar to one you would use outside SageMaker. This feature is known as Script Mode. \n", 14 | "- Hosted training: for full scale training on a complete dataset on a separate, larger and more powerful SageMaker-managed GPU instance. \n", 15 | "- Distributed training: using multiple GPUs to speed up training. \n", 16 | "- Batch Transform: for offline, asynchronous predictions on large batches of data. \n", 17 | "- Instance type choices: many different kinds of CPU and GPU instances are available in SageMaker, and are applicable to different use cases.\n", 18 | "\n", 19 | "### ***Prerequisites***\n", 20 | "\n", 21 | "In SageMaker Studio, for kernel select **Python 3 (Data Science)**; for a SageMaker Notebook Instance, select the kernel **conda_python3**.\n", 22 | "\n", 23 | "# Prepare the dataset\n", 24 | "\n", 25 | "We'll begin by importing some necessary libraries." 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": null, 31 | "metadata": {}, 32 | "outputs": [], 33 | "source": [ 34 | "%matplotlib inline\n", 35 | "\n", 36 | "import logging\n", 37 | "logging.getLogger(\"tensorflow\").setLevel(logging.ERROR)\n", 38 | "import numpy as np\n", 39 | "import os\n", 40 | "import sys\n", 41 | "\n", 42 | "!{sys.executable} -m pip install tensorflow --quiet" 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "metadata": {}, 48 | "source": [ 49 | "Now we'll load the reviews dataset, and pad the reviews so all reviews have the same length. Each review is represented as an array of numbers, where each number represents an indexed word. We'll also pad shorter reviews to match a maximum specified length." 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": null, 55 | "metadata": {}, 56 | "outputs": [], 57 | "source": [ 58 | "from tensorflow.keras.preprocessing import sequence\n", 59 | "from tensorflow.python.keras.datasets import imdb\n", 60 | "\n", 61 | "max_features = 20000\n", 62 | "maxlen = 400\n", 63 | "\n", 64 | "(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)\n", 65 | "print(len(x_train), 'train sequences')\n", 66 | "print(len(x_test), 'test sequences')\n", 67 | "\n", 68 | "x_train = sequence.pad_sequences(x_train, maxlen=maxlen)\n", 69 | "x_test = sequence.pad_sequences(x_test, maxlen=maxlen)\n", 70 | "print('x_train shape:', x_train.shape)\n", 71 | "print('x_test shape:', x_test.shape)" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "metadata": {}, 77 | "source": [ 78 | "Next, we'll save the padded data to files, locally for now, and later to Amazon S3." 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": null, 84 | "metadata": {}, 85 | "outputs": [], 86 | "source": [ 87 | "data_dir = os.path.join(os.getcwd(), 'data')\n", 88 | "os.makedirs(data_dir, exist_ok=True)\n", 89 | "\n", 90 | "train_dir = os.path.join(os.getcwd(), 'data/train')\n", 91 | "os.makedirs(train_dir, exist_ok=True)\n", 92 | "\n", 93 | "test_dir = os.path.join(os.getcwd(), 'data/test')\n", 94 | "os.makedirs(test_dir, exist_ok=True)\n", 95 | "\n", 96 | "csv_test_dir = os.path.join(os.getcwd(), 'data/csv-test')\n", 97 | "os.makedirs(csv_test_dir, exist_ok=True)\n", 98 | "\n", 99 | "np.save(os.path.join(train_dir, 'x_train.npy'), x_train)\n", 100 | "np.save(os.path.join(train_dir, 'y_train.npy'), y_train)\n", 101 | "np.save(os.path.join(test_dir, 'x_test.npy'), x_test)\n", 102 | "np.save(os.path.join(test_dir, 'y_test.npy'), y_test)\n", 103 | "np.savetxt(os.path.join(csv_test_dir, 'csv-test.csv'), np.array(x_test[:100], dtype=np.int32), fmt='%d', delimiter=\",\")" 104 | ] 105 | }, 106 | { 107 | "cell_type": "markdown", 108 | "metadata": {}, 109 | "source": [ 110 | "# SageMaker Training\n", 111 | "\n", 112 | "With our dataset prepared, we're now ready to set up a SageMaker hosted training job. The core concept of SageMaker hosted training is to use more powerful compute resources separate from the less powerful, lower cost notebook instance that you use for prototyping. Hosted training spins up one or more instances (i.e. a cluster) for training, and then tears the cluster down when training is complete, with billing per second for cluster up time. In general, hosted training is preferred for doing actual large-scale training on more powerful instances, especially for distributed training on a single large instance with multiple GPUs, or multiple instances each having multiple GPUs. \n", 113 | "\n", 114 | "### Git Configuration\n", 115 | "\n", 116 | "To begin, we need a training script that can be used to train the model in Amazon SageMaker. In this example, we'll use Git integration. That is, you can specify a training script that is stored in a GitHub, AWS CodeCommit or another Git repository as the entry point so that you don't have to download the scripts locally. For this purpose, the source directory and dependencies should be in the same repository.\n", 117 | "\n", 118 | "To use Git integration, pass a dict `git_config` as a parameter when you create an Amazon SageMaker Estimator object. In the `git_config` parameter, you specify the fields `repo`, `branch` and `commit` to locate the specific repo you want to use. If you do not specify `commit` in `git_config`, the latest commit of the specified repo and branch will be used by default. Also, if authentication is required to access the repo, you can specify fields `2FA_enabled`, `username`, `password` and `token` accordingly.\n", 119 | "\n", 120 | "The script that we will use in this example is stored a public GitHub repo so we don't need authentication to access it. Let's specify the `git_config` argument here:" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": null, 126 | "metadata": {}, 127 | "outputs": [], 128 | "source": [ 129 | "git_config = {'repo': 'https://github.com/aws-samples/amazon-sagemaker-script-mode', \n", 130 | " 'branch': 'master'}" 131 | ] 132 | }, 133 | { 134 | "cell_type": "markdown", 135 | "metadata": {}, 136 | "source": [ 137 | "### Upload data to S3\n", 138 | "\n", 139 | "Before starting hosted training, the data must be present in storage that can be accessed by SageMaker. The storage options are: Amazon S3 (object storage service), Amazon EFS (elastic NFS file system service), and Amazon FSx for Lustre (high-performance file system service). For this example, we'll upload the data to S3. " 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "execution_count": null, 145 | "metadata": {}, 146 | "outputs": [], 147 | "source": [ 148 | "import sagemaker\n", 149 | "\n", 150 | "s3_prefix = 'tf-keras-sentiment'\n", 151 | "\n", 152 | "traindata_s3_prefix = '{}/data/train'.format(s3_prefix)\n", 153 | "testdata_s3_prefix = '{}/data/test'.format(s3_prefix)\n", 154 | "\n", 155 | "train_s3 = sagemaker.Session().upload_data(path='./data/train/', key_prefix=traindata_s3_prefix)\n", 156 | "test_s3 = sagemaker.Session().upload_data(path='./data/test/', key_prefix=testdata_s3_prefix)\n", 157 | "\n", 158 | "inputs = {'train':train_s3, 'test': test_s3}\n", 159 | "print(inputs)" 160 | ] 161 | }, 162 | { 163 | "cell_type": "markdown", 164 | "metadata": {}, 165 | "source": [ 166 | "### Estimator setup\n", 167 | "\n", 168 | "With the training data now in S3, we're ready to set up an Estimator object for hosted training. Most of the Estimator parameters are self-explantory; further discussion of the instance type selection is below. The parameters most likely to change between different training jobs, the algorithm hyperparameters, are passed in as a dictionary." 169 | ] 170 | }, 171 | { 172 | "cell_type": "code", 173 | "execution_count": null, 174 | "metadata": {}, 175 | "outputs": [], 176 | "source": [ 177 | "from sagemaker.tensorflow import TensorFlow\n", 178 | "\n", 179 | "model_dir = '/opt/ml/model'\n", 180 | "train_instance_type = 'ml.p3.8xlarge'\n", 181 | "hyperparameters = {'epochs': 10, 'batch_size': 256, 'learning_rate': 0.01}\n", 182 | "\n", 183 | "estimator = TensorFlow(\n", 184 | " git_config=git_config,\n", 185 | " source_dir='tf-sentiment-script-mode',\n", 186 | " entry_point='sentiment.py',\n", 187 | " model_dir=model_dir,\n", 188 | " instance_type=train_instance_type,\n", 189 | " instance_count=1,\n", 190 | " hyperparameters=hyperparameters,\n", 191 | " role=sagemaker.get_execution_role(),\n", 192 | " base_job_name='tf-sentiment',\n", 193 | " framework_version='2.1',\n", 194 | " py_version='py3',\n", 195 | " script_mode=True)" 196 | ] 197 | }, 198 | { 199 | "cell_type": "markdown", 200 | "metadata": {}, 201 | "source": [ 202 | "### Distributed training on a single multi-GPU instance\n", 203 | "\n", 204 | "The SageMaker instance type selected above, p3.8xlarge, contains four GPUs based on NVIDIA's V100 Tensor Core architecture. This presents an opportunity to do distributed training within a single multi-GPU instance, utilizing all four GPUs to reduce total training time compared to using a single GPU. Although using multiple instances also is a possibility, using a single multi-GPU instance may be more performant because it avoids extra network traffic necessary to coordinate multiple instances. For larger datasets and more complex models, using multiple instances may be a necessity, however, that is not the case here.\n", 205 | "\n", 206 | "To utilize all four GPUs on the instance, you don't need to do anything special in Amazon SageMaker: TensorFlow 2 itself will handle the details under the hood. TensorFlow 2 includes several native distribution strategies, including MirroredStrategy, which is well-suited for training a model using multiple GPUs on a single instance. To enable MirroredStrategy, we simply add the following lines of code in the training script before defining and compiling the model (this has already been done for this example):\n", 207 | "\n", 208 | "```python\n", 209 | "def get_model(learning_rate):\n", 210 | "\n", 211 | " mirrored_strategy = tf.distribute.MirroredStrategy()\n", 212 | " \n", 213 | " with mirrored_strategy.scope():\n", 214 | " embedding_layer = tf.keras.layers.Embedding(max_features,\n", 215 | " embedding_dims,\n", 216 | " input_length=maxlen)\n", 217 | " ....\n", 218 | " model.compile(loss='binary_crossentropy',\n", 219 | " optimizer=optimizer,\n", 220 | " metrics=['accuracy'])\n", 221 | " \n", 222 | " return model\n", 223 | "```\n", 224 | "\n", 225 | "Additionally, the batch size is increased in the Estimator hyperparameters to account for the fact that batches are divided among multiple GPUs. If you are interested in reviewing the rest of the training code, it is at the GitHub repository referenced above in the `git_config` variable. " 226 | ] 227 | }, 228 | { 229 | "cell_type": "markdown", 230 | "metadata": {}, 231 | "source": [ 232 | "### Start the hosted training job\n", 233 | "\n", 234 | "We simply call `fit` to start the actual hosted training. The training job should take around 5 minutes, including the time needed to spin up the training instance. At the end of hosted training, you'll see from the logs below the code cell that validation accuracy is approaching 90%, and the number of billable seconds (which should be in the neighborhood of 180). " 235 | ] 236 | }, 237 | { 238 | "cell_type": "code", 239 | "execution_count": null, 240 | "metadata": {}, 241 | "outputs": [], 242 | "source": [ 243 | "estimator.fit(inputs)" 244 | ] 245 | }, 246 | { 247 | "cell_type": "markdown", 248 | "metadata": {}, 249 | "source": [ 250 | "The validation accuracy appears to have plateaued, so the model might be overfitting: it might be less able to generalize to data it has not yet seen. This is the case even though we are employing dropout as a regularization technique to reduce the possibility of overfitting. (See the training script at the GitHub repository referenced above.) For a production model, further experimentation would be necessary.\n", 251 | "\n", 252 | "TensorFlow 2's tf.keras API provides a convenient way to capture the history of model training. When the model was saved after training, the history was saved alongside it. To retrieve the history, we first download the trained model from the S3 bucket where SageMaker stored it. Models trained by SageMaker are always accessible in this way to be run anywhere. Next, we can unzip it to gain access to the history data structure, and then simply load the history as JSON:" 253 | ] 254 | }, 255 | { 256 | "cell_type": "code", 257 | "execution_count": null, 258 | "metadata": {}, 259 | "outputs": [], 260 | "source": [ 261 | "import json \n", 262 | "\n", 263 | "!aws s3 cp {estimator.model_data} ./model/model.tar.gz\n", 264 | "!tar -xzf ./model/model.tar.gz -C ./model\n", 265 | "\n", 266 | "with open('./model/history.p', \"r\") as f:\n", 267 | " history = json.load(f)" 268 | ] 269 | }, 270 | { 271 | "cell_type": "markdown", 272 | "metadata": {}, 273 | "source": [ 274 | "Now we can plot the training curves based on the history, with separate graphs for model accuracy and model loss. We can see that training converged relatively smoothly to higher model accuracy and correspondingly lower model loss." 275 | ] 276 | }, 277 | { 278 | "cell_type": "code", 279 | "execution_count": null, 280 | "metadata": {}, 281 | "outputs": [], 282 | "source": [ 283 | "import matplotlib.pyplot as plt\n", 284 | "\n", 285 | "def plot_training_curves(history): \n", 286 | " fig, axes = plt.subplots(1, 2, figsize=(12, 4), sharex=True)\n", 287 | " ax = axes[0]\n", 288 | " ax.plot(history['accuracy'], label='train')\n", 289 | " ax.set(title='model accuracy', ylabel='accuracy', xlabel='epoch')\n", 290 | " ax.legend()\n", 291 | " ax = axes[1]\n", 292 | " ax.plot(history['loss'], label='train')\n", 293 | " ax.set(title='model loss', ylabel='loss', xlabel='epoch')\n", 294 | " ax.legend()\n", 295 | " fig.tight_layout()\n", 296 | " \n", 297 | "plot_training_curves(history)" 298 | ] 299 | }, 300 | { 301 | "cell_type": "markdown", 302 | "metadata": {}, 303 | "source": [ 304 | "# Batch Prediction\n", 305 | "\n", 306 | "\n", 307 | "If our use case requires individual predictions in near real-time, SageMaker hosted endpoints can be created. Hosted endpoints also can be used for pseudo-batch prediction, but the process is more involved than simply using SageMaker's Batch Transform feature, which is designed for large-scale, asynchronous batch inference.\n", 308 | "\n", 309 | "To use Batch Transform, we first upload to S3 some input test data to be transformed. The data can be in any format accepted by your model; in this case, it is CSV." 310 | ] 311 | }, 312 | { 313 | "cell_type": "code", 314 | "execution_count": null, 315 | "metadata": {}, 316 | "outputs": [], 317 | "source": [ 318 | "csvtestdata_s3_prefix = '{}/data/csv-test'.format(s3_prefix)\n", 319 | "csvtest_s3 = sagemaker.Session().upload_data(path='./data/csv-test/', key_prefix=csvtestdata_s3_prefix)\n", 320 | "print(csvtest_s3)" 321 | ] 322 | }, 323 | { 324 | "cell_type": "markdown", 325 | "metadata": {}, 326 | "source": [ 327 | "A Transformer object must be set up to describe the Batch Transform job, including the amount and type of inference hardware to be used. Then the actual transform job itself is started with a call to the `transform` method of the Transformer. When setting up Batch Transform jobs, hardware considerations are important, just as they are for training:\n", 328 | "\n", 329 | "- `instance_count`: Batch Transform can spin up a cluster of multiple instances; at the end of the job, the cluster is torn down automatically. Since this dataset is small, we'll just use one instance.\n", 330 | "- `instance_type`: When doing inference for smaller models, such as this one, often CPU-based instance types can be used instead of GPU instance types, allowing significant cost savings. Note, however, that the choice of specific CPU instance type can significantly affect inference speed: although we could use a general purpose instance here such as a m5.xlarge, if we use a compute-optimized c5.xlarge instance, the total batch inference time is cut in half." 331 | ] 332 | }, 333 | { 334 | "cell_type": "code", 335 | "execution_count": null, 336 | "metadata": {}, 337 | "outputs": [], 338 | "source": [ 339 | "transformer = estimator.transformer(instance_count=1, \n", 340 | " instance_type='ml.c5.xlarge')\n", 341 | "\n", 342 | "transformer.transform(csvtest_s3, content_type='text/csv')\n", 343 | "print('Waiting for transform job: ' + transformer.latest_transform_job.job_name)\n", 344 | "transformer.wait()" 345 | ] 346 | }, 347 | { 348 | "cell_type": "markdown", 349 | "metadata": {}, 350 | "source": [ 351 | "We can now download the batch predictions from S3 to the local filesystem on the notebook instance; the predictions are contained in a file with a .out extension, and are embedded in JSON. Next we'll load the JSON and examine the predictions, which are confidence scores from 0.0 to 1.0 where numbers close to 1.0 indicate positive sentiment, while numbers close to 0.0 indicate negative sentiment." 352 | ] 353 | }, 354 | { 355 | "cell_type": "code", 356 | "execution_count": null, 357 | "metadata": {}, 358 | "outputs": [], 359 | "source": [ 360 | "import json\n", 361 | "\n", 362 | "batch_output = transformer.output_path\n", 363 | "!mkdir -p batch_data/output\n", 364 | "!aws s3 cp --recursive $batch_output/ batch_data/output/\n", 365 | "\n", 366 | "with open('batch_data/output/csv-test.csv.out', 'r') as f:\n", 367 | " jstr = json.load(f)\n", 368 | " results = [float('%.3f'%(item)) for sublist in jstr['predictions'] for item in sublist]\n", 369 | " print(results)" 370 | ] 371 | }, 372 | { 373 | "cell_type": "markdown", 374 | "metadata": {}, 375 | "source": [ 376 | "Now let's look at the text of some actual reviews to see the predictions in action. First, we have to convert the integers representing the words back to the words themselves by using a reversed dictionary. Next we can decode the reviews, taking into account that the first 3 indices were reserved for \"padding\", \"start of sequence\", and \"unknown\", and removing a string of unknown tokens from the start of the review." 377 | ] 378 | }, 379 | { 380 | "cell_type": "code", 381 | "execution_count": null, 382 | "metadata": {}, 383 | "outputs": [], 384 | "source": [ 385 | "import re\n", 386 | "\n", 387 | "regex = re.compile(r'^[\\?\\s]+')\n", 388 | "\n", 389 | "word_index = imdb.get_word_index()\n", 390 | "reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])\n", 391 | "first_decoded_review = ' '.join([reverse_word_index.get(i - 3, '?') for i in x_test[0]])\n", 392 | "regex.sub('', first_decoded_review)" 393 | ] 394 | }, 395 | { 396 | "cell_type": "markdown", 397 | "metadata": {}, 398 | "source": [ 399 | "Overall, this review looks fairly negative. Let's compare the actual label with the prediction:" 400 | ] 401 | }, 402 | { 403 | "cell_type": "code", 404 | "execution_count": null, 405 | "metadata": {}, 406 | "outputs": [], 407 | "source": [ 408 | "def get_sentiment(score):\n", 409 | " return 'positive' if score > 0.5 else 'negative' \n", 410 | "\n", 411 | "print('Labeled sentiment for this review is {}, predicted sentiment is {}'.format(get_sentiment(y_test[0]), \n", 412 | " get_sentiment(results[0])))" 413 | ] 414 | }, 415 | { 416 | "cell_type": "markdown", 417 | "metadata": {}, 418 | "source": [ 419 | "Training deep learning models is a stochastic process, so your results may vary -- there is no guarantee that the predicted result will match the actual label. However, it is likely that the sentiment prediction agrees with the label for this review. Let's now examine another review:" 420 | ] 421 | }, 422 | { 423 | "cell_type": "code", 424 | "execution_count": null, 425 | "metadata": {}, 426 | "outputs": [], 427 | "source": [ 428 | "second_decoded_review = ' '.join([reverse_word_index.get(i - 3, '?') for i in x_test[5]])\n", 429 | "regex.sub('', second_decoded_review)" 430 | ] 431 | }, 432 | { 433 | "cell_type": "code", 434 | "execution_count": null, 435 | "metadata": {}, 436 | "outputs": [], 437 | "source": [ 438 | "print('Labeled sentiment for this review is {}, predicted sentiment is {}'.format(get_sentiment(y_test[5]), \n", 439 | " get_sentiment(results[5])))" 440 | ] 441 | }, 442 | { 443 | "cell_type": "markdown", 444 | "metadata": {}, 445 | "source": [ 446 | "Again, it is likely (but not guaranteed) that the prediction agreed with the label for the test data. Note that there is no need to clean up any Batch Transform resources: after the transform job is complete, the cluster used to make inferences is torn down. Now that we've reviewed some sample predictions as a sanity check, this brief example is complete. " 447 | ] 448 | } 449 | ], 450 | "metadata": { 451 | "instance_type": "ml.t3.medium", 452 | "kernelspec": { 453 | "display_name": "Python 3 (Data Science)", 454 | "language": "python", 455 | "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-west-2:236514542706:image/datascience-1.0" 456 | }, 457 | "language_info": { 458 | "codemirror_mode": { 459 | "name": "ipython", 460 | "version": 3 461 | }, 462 | "file_extension": ".py", 463 | "mimetype": "text/x-python", 464 | "name": "python", 465 | "nbconvert_exporter": "python", 466 | "pygments_lexer": "ipython3", 467 | "version": "3.7.10" 468 | } 469 | }, 470 | "nbformat": 4, 471 | "nbformat_minor": 4 472 | } 473 | -------------------------------------------------------------------------------- /notebooks/videogame-sales-cli-console.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Videogame Sales Prediction with the AWS CLI and Console\n", 8 | "_**Using XGBoost to Predict Whether Sales will Exceed the \"Hit\" Threshold**_\n", 9 | "\n", 10 | "---\n", 11 | "\n", 12 | "---\n", 13 | "\n", 14 | "## Contents\n", 15 | "\n", 16 | "1. [Background](#Background)\n", 17 | "1. [Setup](#Setup)\n", 18 | "1. [Data](#Data)\n", 19 | "1. [Training-Hosting](#Training-Hosting)\n", 20 | "1. [Evaluation](#Evaluation)\n", 21 | "1. [Extensions](#Extensions)\n", 22 | "\n", 23 | "\n", 24 | "## Background\n", 25 | "\n", 26 | "Word of mouth in the form of user reviews, critic reviews, social media comments, etc. often can provide insights about whether a product ultimately will be a success. In the video game industry in particular, reviews and ratings can have a large impact on a game's success. However, not all games with bad reviews fail, and not all games with good reviews turn out to be hits. To predict hit games, machine learning algorithms potentially can take advantage of various relevant data attributes in addition to reviews. \n", 27 | "\n", 28 | "For this notebook, we will work with the data set Video Game Sales with Ratings. This [Metacritic](http://www.metacritic.com/browse/games/release-date/available) data includes attributes for user reviews as well as critic reviews, sales, ESRB ratings, among others. Both user reviews and critic reviews are in the form of ratings scores, on a scale of 0 to 10 or 0 to 100. Although this is convenient, a significant issue with the data set is that it is relatively small. \n", 29 | "\n", 30 | "Dealing with a small data set such as this one is a common problem in machine learning. This problem often is compounded by imbalances between the classes in the small data set. In such situations, using an ensemble learner can be a good choice. This notebook will focus on using XGBoost, a popular ensemble learner, to build a classifier to determine whether a game will be a hit. \n", 31 | "\n", 32 | "This notebook contains two parts of the Video Game Sales module of the SageMaker workshop. More specifically, it covers exploratory data analysis and evaluation of the trained models; training and deployment of the models is covered in the workshop lab guide.\n", 33 | "\n", 34 | "## Setup\n", 35 | "\n", 36 | "Let's start by specifying the S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting. The IAM role for S3 data access is pulled in from the SageMaker Notebook Instance. \n", 37 | "\n", 38 | "In the first line of code, replace the placeholder text with your bucket name, such as `smworkshop-john smith`, with NO path. Be sure to remove the angle brackets and keep the quotes. After you have replaced the bucket name, go ahead and run the cell by clicking the 'Run cell' button in the toolbar above, or using Control + Enter from your keyboard. Do NOT 'Run All' cells because we will be leaving the notebook for training and returning later for evaluation." 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": null, 44 | "metadata": { 45 | "isConfigCell": true 46 | }, 47 | "outputs": [], 48 | "source": [ 49 | "bucket = '' # replace with bucket name in quotes only, NO URL path\n", 50 | "prefix = 'sagemaker/videogames-xgboost'\n", 51 | " \n", 52 | "import sagemaker\n", 53 | "\n", 54 | "role = sagemaker.get_execution_role()" 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "metadata": {}, 60 | "source": [ 61 | "Next we'll import the Python libraries we'll need." 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": null, 67 | "metadata": {}, 68 | "outputs": [], 69 | "source": [ 70 | "import numpy as np \n", 71 | "import pandas as pd \n", 72 | "import matplotlib.pyplot as plt \n", 73 | "from IPython.display import Image \n", 74 | "from IPython.display import display \n", 75 | "from sklearn.datasets import dump_svmlight_file \n", 76 | "from time import gmtime, strftime \n", 77 | "import sys \n", 78 | "import math \n", 79 | "import json\n", 80 | "import boto3" 81 | ] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "metadata": {}, 86 | "source": [ 87 | "---\n", 88 | "## Data\n", 89 | "\n", 90 | "Before proceeding further, let's download the data set to your notebook instance. It will then appear in the same directory as this notebook. Then we'll take an initial look at the data." 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": null, 96 | "metadata": {}, 97 | "outputs": [], 98 | "source": [ 99 | "!wget -O raw_data.csv https://raw.githubusercontent.com/awslabs/amazon-sagemaker-workshop/master/data/raw_data.csv\n", 100 | "\n", 101 | "data = pd.read_csv('./raw_data.csv')\n", 102 | "pd.set_option('display.max_rows', 20) \n", 103 | "data" 104 | ] 105 | }, 106 | { 107 | "cell_type": "markdown", 108 | "metadata": {}, 109 | "source": [ 110 | "Before proceeding further, we need to decide upon a target to predict. Video game development budgets can run into the tens of millions of dollars, so it is critical for game publishers to publish \"hit\" games to recoup their costs and make a profit. As a proxy for what constitutes a \"hit\" game, we will set a target of greater than 1 million units in global sales." 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": null, 116 | "metadata": {}, 117 | "outputs": [], 118 | "source": [ 119 | "data['y'] = (data['Global_Sales'] > 1)" 120 | ] 121 | }, 122 | { 123 | "cell_type": "markdown", 124 | "metadata": {}, 125 | "source": [ 126 | "With our target now defined, let's take a look at the imbalance between the \"hit\" and \"not a hit\" classes:" 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": null, 132 | "metadata": {}, 133 | "outputs": [], 134 | "source": [ 135 | "plt.bar(['not a hit', 'hit'], data['y'].value_counts())\n", 136 | "plt.show()" 137 | ] 138 | }, 139 | { 140 | "cell_type": "markdown", 141 | "metadata": {}, 142 | "source": [ 143 | "Not surprisingly, only a small fraction of games can be considered \"hits\" under our metric. Next, we'll choose features that have predictive power for our target. We'll begin by plotting review scores versus global sales to check our hunch that such scores have an impact on sales. Logarithmic scale is used for clarity." 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "execution_count": null, 149 | "metadata": {}, 150 | "outputs": [], 151 | "source": [ 152 | "viz = data.filter(['User_Score','Critic_Score', 'Global_Sales'], axis=1)\n", 153 | "viz['User_Score'] = pd.Series(viz['User_Score'].apply(pd.to_numeric, errors='coerce'))\n", 154 | "viz['User_Score'] = viz['User_Score'].mask(np.isnan(viz[\"User_Score\"]), viz['Critic_Score'] / 10.0)\n", 155 | "viz.plot(kind='scatter', logx=True, logy=True, x='User_Score', y='Global_Sales')\n", 156 | "plt.axis([2, 10, 1, 100])\n", 157 | "plt.show()" 158 | ] 159 | }, 160 | { 161 | "cell_type": "markdown", 162 | "metadata": {}, 163 | "source": [ 164 | "Our intuition about the relationship between review scores and sales seems justified. We also note in passing that other relevant features can be extracted from the data set. For example, the ESRB rating has an impact since games with an \"E\" for everyone rating typically reach a wider audience than games with an age-restricted \"M\" for mature rating, though depending on another feature, the genre (such as shooter or action), M-rated games also can be huge hits. Our model hopefully will learn these relationships and others. \n", 165 | "\n", 166 | "Next, looking at the columns of features of this data set, we can identify several that should be excluded. For example, there are five columns that specify sales numbers: these numbers are directly related to the target we're trying to predict, so these columns should be dropped. Other features may be irrelevant, such as the name of the game." 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": null, 172 | "metadata": {}, 173 | "outputs": [], 174 | "source": [ 175 | "data = data.drop(['Name', 'Year_of_Release', 'NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales', 'Critic_Count', 'User_Count', 'Developer'], axis=1)" 176 | ] 177 | }, 178 | { 179 | "cell_type": "markdown", 180 | "metadata": {}, 181 | "source": [ 182 | "With the number of columns reduced, now is a good time to check how many columns are missing data:" 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": null, 188 | "metadata": {}, 189 | "outputs": [], 190 | "source": [ 191 | "data.isnull().sum()" 192 | ] 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "metadata": {}, 197 | "source": [ 198 | "As noted in Kaggle's overview of this data set, many review ratings are missing. Unfortunately, since those are crucial features that we are relying on for our predictions, and there is no reliable way of imputing so many of them, we'll need to drop rows missing those features." 199 | ] 200 | }, 201 | { 202 | "cell_type": "code", 203 | "execution_count": null, 204 | "metadata": {}, 205 | "outputs": [], 206 | "source": [ 207 | "data = data.dropna()" 208 | ] 209 | }, 210 | { 211 | "cell_type": "markdown", 212 | "metadata": {}, 213 | "source": [ 214 | "Now we need to resolve a problem we see in the User_Score column: it contains some 'tbd' string values, so it obviously is not numeric. User_Score is more properly a numeric rather than categorical feature, so we'll need to convert it from string type to numeric, and temporarily fill in NaNs for the tbds. Next, we must decide what to do with these new NaNs in the User_Score column. We've already thrown out a large number of rows, so if we can salvage these rows, we should. As a first approximation, we'll take the value in the Critic_Score column and divide by 10 since the user scores tend to track the critic scores (though on a scale of 0 to 10 instead of 0 to 100). " 215 | ] 216 | }, 217 | { 218 | "cell_type": "code", 219 | "execution_count": null, 220 | "metadata": {}, 221 | "outputs": [], 222 | "source": [ 223 | "data['User_Score'] = data['User_Score'].apply(pd.to_numeric, errors='coerce')\n", 224 | "data['User_Score'] = data['User_Score'].mask(np.isnan(data[\"User_Score\"]), data['Critic_Score'] / 10.0)" 225 | ] 226 | }, 227 | { 228 | "cell_type": "markdown", 229 | "metadata": {}, 230 | "source": [ 231 | "Let's do some final preprocessing of the data, including converting the categorical features into numeric using the one-hot encoding method." 232 | ] 233 | }, 234 | { 235 | "cell_type": "code", 236 | "execution_count": null, 237 | "metadata": {}, 238 | "outputs": [], 239 | "source": [ 240 | "if data['y'].dtype == bool:\n", 241 | " data['y'] = data['y'].apply(lambda y: 'yes' if y == True else 'no')\n", 242 | "model_data = pd.get_dummies(data)" 243 | ] 244 | }, 245 | { 246 | "cell_type": "markdown", 247 | "metadata": {}, 248 | "source": [ 249 | "To help prevent overfitting the model, we'll randomly split the data into three groups. Specifically, the model will be trained on 70% of the data. It will then be evaluated on 20% of the data to give us an estimate of the accuracy we hope to have on \"new\" data. As a final testing dataset, the remaining 10% will be held out until the end." 250 | ] 251 | }, 252 | { 253 | "cell_type": "code", 254 | "execution_count": null, 255 | "metadata": {}, 256 | "outputs": [], 257 | "source": [ 258 | "train_data, validation_data, test_data = np.split(model_data.sample(frac=1, random_state=1729), [int(0.7 * len(model_data)), int(0.9 * len(model_data))]) " 259 | ] 260 | }, 261 | { 262 | "cell_type": "markdown", 263 | "metadata": {}, 264 | "source": [ 265 | "Amazon SageMaker's built-in version of XGBoost supports input data in both CSV and libSVM data format. We'll use libSVM here, with features and the target variable provided as separate arguments. To avoid any misalignment issues due to random reordering, this split is done after the previous split in the above cell. As a last step before training, we'll copy the resulting files to S3 as input for Amazon SageMaker's hosted training." 266 | ] 267 | }, 268 | { 269 | "cell_type": "code", 270 | "execution_count": null, 271 | "metadata": {}, 272 | "outputs": [], 273 | "source": [ 274 | "dump_svmlight_file(X=train_data.drop(['y_no', 'y_yes'], axis=1), y=train_data['y_yes'], f='train.libsvm')\n", 275 | "dump_svmlight_file(X=validation_data.drop(['y_no', 'y_yes'], axis=1), y=validation_data['y_yes'], f='validation.libsvm')\n", 276 | "dump_svmlight_file(X=test_data.drop(['y_no', 'y_yes'], axis=1), y=test_data['y_yes'], f='test.libsvm')\n", 277 | "\n", 278 | "boto3.Session().resource('s3').Bucket(bucket).Object(prefix + '/train/train.libsvm').upload_file('train.libsvm')\n", 279 | "boto3.Session().resource('s3').Bucket(bucket).Object(prefix + '/validation/validation.libsvm').upload_file('validation.libsvm')" 280 | ] 281 | }, 282 | { 283 | "cell_type": "markdown", 284 | "metadata": {}, 285 | "source": [ 286 | "---\n", 287 | "## Training-Hosting\n", 288 | "\n", 289 | "Next, we'll set up training jobs and deploy the resulting models. For those steps, please return to the workshop lab guide after the previous code cell completes running, which may take a few minutes. You'll return to this notebook for the final evaluation steps. If you are running this from the Simplify Workflows workshop module from GitHub, the link back is https://github.com/awslabs/amazon-sagemaker-workshop/blob/master/modules/Video_Game_Sales_CLI_Console.md." 290 | ] 291 | }, 292 | { 293 | "cell_type": "markdown", 294 | "metadata": {}, 295 | "source": [ 296 | "---\n", 297 | "\n", 298 | "## Evaluation\n", 299 | "\n", 300 | "Now that we have our hosted endpoint, we can generate predictions from it. More specifically, let's generate predictions from our test data set to understand how well our model generalizes to data it has not seen yet. First, however, we need to bring in the name of the endpoint we created in the previous section of the module. If you followed the naming convention in the lab guide, the name is already filled in for the code cell below. Otherwise, replace the name with the one you chose." 301 | ] 302 | }, 303 | { 304 | "cell_type": "code", 305 | "execution_count": null, 306 | "metadata": {}, 307 | "outputs": [], 308 | "source": [ 309 | "endpoint_name = 'videogames-xgboost'" 310 | ] 311 | }, 312 | { 313 | "cell_type": "markdown", 314 | "metadata": {}, 315 | "source": [ 316 | "There are many ways to compare the performance of a machine learning model. We'll start simply by comparing actual to predicted values of whether the game was a \"hit\" (`1`) or not (`0`). Then we'll produce a confusion matrix, which shows how many test data points were predicted by the model in each category versus how many test data points actually belonged in each category." 317 | ] 318 | }, 319 | { 320 | "cell_type": "code", 321 | "execution_count": null, 322 | "metadata": {}, 323 | "outputs": [], 324 | "source": [ 325 | "runtime = boto3.client('runtime.sagemaker')" 326 | ] 327 | }, 328 | { 329 | "cell_type": "code", 330 | "execution_count": null, 331 | "metadata": {}, 332 | "outputs": [], 333 | "source": [ 334 | "def do_predict(data, endpoint_name, content_type):\n", 335 | " payload = '\\n'.join(data)\n", 336 | " response = runtime.invoke_endpoint(EndpointName=endpoint_name, \n", 337 | " ContentType=content_type, \n", 338 | " Body=payload)\n", 339 | " result = response['Body'].read()\n", 340 | " result = result.decode(\"utf-8\")\n", 341 | " result = result.split(',')\n", 342 | " preds = [float((num)) for num in result]\n", 343 | " preds = [round(num) for num in preds]\n", 344 | " return preds\n", 345 | "\n", 346 | "def batch_predict(data, batch_size, endpoint_name, content_type):\n", 347 | " items = len(data)\n", 348 | " arrs = []\n", 349 | " \n", 350 | " for offset in range(0, items, batch_size):\n", 351 | " if offset+batch_size < items:\n", 352 | " results = do_predict(data[offset:(offset+batch_size)], endpoint_name, content_type)\n", 353 | " arrs.extend(results)\n", 354 | " else:\n", 355 | " arrs.extend(do_predict(data[offset:items], endpoint_name, content_type))\n", 356 | " sys.stdout.write('.')\n", 357 | " return(arrs)" 358 | ] 359 | }, 360 | { 361 | "cell_type": "code", 362 | "execution_count": null, 363 | "metadata": {}, 364 | "outputs": [], 365 | "source": [ 366 | "%%time\n", 367 | "import json\n", 368 | "\n", 369 | "with open('test.libsvm', 'r') as f:\n", 370 | " payload = f.read().strip()\n", 371 | "\n", 372 | "labels = [int(line.split(' ')[0]) for line in payload.split('\\n')]\n", 373 | "test_data = [line for line in payload.split('\\n')]\n", 374 | "preds = batch_predict(test_data, 100, endpoint_name, 'text/x-libsvm')\n", 375 | "\n", 376 | "print ('\\nerror rate=%f' % ( sum(1 for i in range(len(preds)) if preds[i]!=labels[i]) /float(len(preds))))" 377 | ] 378 | }, 379 | { 380 | "cell_type": "code", 381 | "execution_count": null, 382 | "metadata": {}, 383 | "outputs": [], 384 | "source": [ 385 | "pd.crosstab(index=np.array(labels), columns=np.array(preds))" 386 | ] 387 | }, 388 | { 389 | "cell_type": "markdown", 390 | "metadata": {}, 391 | "source": [ 392 | "Of the 132 games in the test set that actually are \"hits\" by our metric, the model correctly identified over 70, while the overall error rate is 13%. The amount of false negatives versus true positives can be shifted substantially in favor of true positives by increasing the hyperparameter scale_pos_weight. Of course, this increase comes at the expense of reduced accuracy/increased error rate and more false positives. How to make this trade-off ultimately is a business decision based on the relative costs of false positives, false negatives, etc.\n", 393 | "\n", 394 | "This concludes the notebook portion of this module. Please return to the workshop lab guide. If you are running this from the Simplify Workflows workshop module from GitHub, the link back is https://github.com/awslabs/amazon-sagemaker-workshop/blob/master/modules/Video_Game_Sales_CLI_Console.md." 395 | ] 396 | } 397 | ], 398 | "metadata": { 399 | "kernelspec": { 400 | "display_name": "conda_python3", 401 | "language": "python", 402 | "name": "conda_python3" 403 | }, 404 | "language_info": { 405 | "codemirror_mode": { 406 | "name": "ipython", 407 | "version": 3 408 | }, 409 | "file_extension": ".py", 410 | "mimetype": "text/x-python", 411 | "name": "python", 412 | "nbconvert_exporter": "python", 413 | "pygments_lexer": "ipython3", 414 | "version": "3.6.10" 415 | }, 416 | "notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License." 417 | }, 418 | "nbformat": 4, 419 | "nbformat_minor": 2 420 | } 421 | -------------------------------------------------------------------------------- /notebooks/videogame-sales.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Predicting Product Success When Review Data Is Available\n", 8 | "_**Using XGBoost to Predict Whether Sales will Exceed the \"Hit\" Threshold**_\n", 9 | "\n", 10 | "---\n", 11 | "\n", 12 | "---\n", 13 | "\n", 14 | "## Contents\n", 15 | "\n", 16 | "1. [Background](#Background)\n", 17 | "1. [Setup](#Setup)\n", 18 | "1. [Data](#Data)\n", 19 | "1. [Train](#Train)\n", 20 | "1. [Host](#Host)\n", 21 | "1. [Evaluation](#Evaluation)\n", 22 | "1. [Extensions](#Extensions)\n", 23 | "\n", 24 | "\n", 25 | "## Background\n", 26 | "\n", 27 | "Word of mouth in the form of user reviews, critic reviews, social media comments, etc. often can provide insights about whether a product ultimately will be a success. In the video game industry in particular, reviews and ratings can have a large impact on a game's success. However, not all games with bad reviews fail, and not all games with good reviews turn out to be hits. To predict hit games, machine learning algorithms potentially can take advantage of various relevant data attributes in addition to reviews. \n", 28 | "\n", 29 | "For this notebook, we will work with the dataset [Video Game Sales with Ratings](https://www.kaggle.com/rush4ratio/video-game-sales-with-ratings) from Kaggle. This dataset includes data from [Metacritic](http://www.metacritic.com/browse/games/release-date/available) and other sources, with attributes for user reviews as well as critic reviews, sales, ESRB ratings, among others. Both user reviews and critic reviews are in the form of ratings scores, on a scale of 0 to 10 or 0 to 100. Although this is convenient, a significant issue with the dataset is that it is relatively small. \n", 30 | "\n", 31 | "Dealing with a small dataset such as this one is a common problem in machine learning. This problem often is compounded by imbalances between the classes in the small dataset. In such situations, using an ensemble learner can be a good choice. This notebook will focus on using XGBoost, a popular ensemble learner, to build a classifier to determine whether a game will be a hit. \n", 32 | "\n", 33 | "## Setup\n", 34 | "\n", 35 | "\n", 36 | "***Prerequisite: check the upper right hand corner of the UI to make sure your notebook kernel is either (1) for SageMaker Studio, Python 3 (Data Science), or (2) for a SageMaker Notebook Instance, conda_python3.*** \n", 37 | "\n", 38 | "Let's start by:\n", 39 | "\n", 40 | "- Importing various Python libraries we'll need.\n", 41 | "- Specifying a S3 bucket and bucket prefix to use for training and model data.\n", 42 | "- Defining an IAM role for access to Amazon S3, where we will store training data.\n", 43 | "\n", 44 | "Run the cell by clicking either (1) the play symbol that appears to the left of `In[]` when you hover over it, or (2) the 'Run cell' button in the toolbar above, or (3) using Control + Enter from your keyboard. " 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": null, 50 | "metadata": { 51 | "isConfigCell": true 52 | }, 53 | "outputs": [], 54 | "source": [ 55 | "import numpy as np \n", 56 | "import pandas as pd \n", 57 | "import matplotlib.pyplot as plt \n", 58 | "from IPython.display import Image \n", 59 | "from IPython.display import display \n", 60 | "from sklearn.datasets import dump_svmlight_file \n", 61 | "from time import gmtime, strftime \n", 62 | "import sys \n", 63 | "import math \n", 64 | "import json\n", 65 | "import boto3\n", 66 | "import sagemaker\n", 67 | "\n", 68 | "bucket = sagemaker.Session().default_bucket()\n", 69 | "prefix = 'sagemaker/videogames-xgboost'\n", 70 | "role = sagemaker.get_execution_role()\n", 71 | "\n", 72 | "print('Bucket:\\n{}'.format(bucket))" 73 | ] 74 | }, 75 | { 76 | "cell_type": "markdown", 77 | "metadata": {}, 78 | "source": [ 79 | "---\n", 80 | "## Data\n", 81 | "\n", 82 | "Before proceeding further, let's download the dataset. It will appear in the same directory as this notebook. Then we'll take an initial look at the data." 83 | ] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "execution_count": null, 88 | "metadata": {}, 89 | "outputs": [], 90 | "source": [ 91 | "!wget -O raw_data.csv https://raw.githubusercontent.com/awslabs/amazon-sagemaker-workshop/master/data/raw_data.csv\n", 92 | "\n", 93 | "data = pd.read_csv('./raw_data.csv')\n", 94 | "pd.set_option('display.max_rows', 20) \n", 95 | "data" 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": {}, 101 | "source": [ 102 | "Before proceeding further, we need to decide upon a target to predict. Video game development budgets can run into the tens of millions of dollars, so it is critical for game publishers to publish \"hit\" games to recoup their costs and make a profit. As a proxy for what constitutes a \"hit\" game, we will set a target of greater than 1 million units in global sales." 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": null, 108 | "metadata": {}, 109 | "outputs": [], 110 | "source": [ 111 | "data['y'] = (data['Global_Sales'] > 1)" 112 | ] 113 | }, 114 | { 115 | "cell_type": "markdown", 116 | "metadata": {}, 117 | "source": [ 118 | "With our target now defined, let's take a look at the imbalance between the \"hit\" and \"not a hit\" classes:" 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": null, 124 | "metadata": {}, 125 | "outputs": [], 126 | "source": [ 127 | "plt.bar(['not a hit', 'hit'], data['y'].value_counts())\n", 128 | "plt.show()" 129 | ] 130 | }, 131 | { 132 | "cell_type": "markdown", 133 | "metadata": {}, 134 | "source": [ 135 | "Not surprisingly, only a small fraction of games can be considered \"hits\" under our metric. Next, we'll choose features that have predictive power for our target. We'll begin by plotting review scores versus global sales to check our hunch that such scores have an impact on sales. Logarithmic scale is used for clarity." 136 | ] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "execution_count": null, 141 | "metadata": {}, 142 | "outputs": [], 143 | "source": [ 144 | "viz = data.filter(['User_Score','Critic_Score', 'Global_Sales'], axis=1)\n", 145 | "viz['User_Score'] = pd.Series(viz['User_Score'].apply(pd.to_numeric, errors='coerce'))\n", 146 | "viz['User_Score'] = viz['User_Score'].mask(np.isnan(viz[\"User_Score\"]), viz['Critic_Score'] / 10.0)\n", 147 | "viz.plot(kind='scatter', logx=True, logy=True, x='User_Score', y='Global_Sales')\n", 148 | "plt.axis([2, 10, 1, 100])\n", 149 | "plt.show()" 150 | ] 151 | }, 152 | { 153 | "cell_type": "markdown", 154 | "metadata": {}, 155 | "source": [ 156 | "Our intuition about the relationship between review scores and sales seems justified. We also note in passing that other relevant features can be extracted from the dataset. For example, the ESRB rating has an impact since games with an \"E\" for everyone rating typically reach a wider audience than games with an age-restricted \"M\" for mature rating, though depending on another feature, the genre (such as shooter or action), M-rated games also can be huge hits. Our model hopefully will learn these relationships and others. \n", 157 | "\n", 158 | "Next, looking at the columns of features of this dataset, we can identify several that should be excluded. For example, there are five columns that specify sales numbers: these numbers are directly related to the target we're trying to predict, so these columns should be dropped. Other features may be irrelevant, such as the name of the game." 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": null, 164 | "metadata": {}, 165 | "outputs": [], 166 | "source": [ 167 | "data = data.drop(['Name', 'Year_of_Release', 'NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales', 'Critic_Count', 'User_Count', 'Developer'], axis=1)" 168 | ] 169 | }, 170 | { 171 | "cell_type": "markdown", 172 | "metadata": {}, 173 | "source": [ 174 | "With the number of columns reduced, now is a good time to check how many columns are missing data:" 175 | ] 176 | }, 177 | { 178 | "cell_type": "code", 179 | "execution_count": null, 180 | "metadata": {}, 181 | "outputs": [], 182 | "source": [ 183 | "data.isnull().sum()" 184 | ] 185 | }, 186 | { 187 | "cell_type": "markdown", 188 | "metadata": {}, 189 | "source": [ 190 | "As noted in Kaggle's overview of this dataset, many review ratings are missing. Unfortunately, since those are crucial features that we are relying on for our predictions, and there is no reliable way of imputing so many of them, we'll need to drop rows missing those features." 191 | ] 192 | }, 193 | { 194 | "cell_type": "code", 195 | "execution_count": null, 196 | "metadata": {}, 197 | "outputs": [], 198 | "source": [ 199 | "data = data.dropna()" 200 | ] 201 | }, 202 | { 203 | "cell_type": "markdown", 204 | "metadata": {}, 205 | "source": [ 206 | "Now we need to resolve a problem we see in the User_Score column: it contains some 'tbd' string values, so it obviously is not numeric. User_Score is more properly a numeric rather than categorical feature, so we'll need to convert it from string type to numeric, and temporarily fill in NaNs for the tbds. Next, we must decide what to do with these new NaNs in the User_Score column. We've already thrown out a large number of rows, so if we can salvage these rows, we should. As a first approximation, we'll take the value in the Critic_Score column and divide by 10 since the user scores tend to track the critic scores (though on a scale of 0 to 10 instead of 0 to 100). " 207 | ] 208 | }, 209 | { 210 | "cell_type": "code", 211 | "execution_count": null, 212 | "metadata": {}, 213 | "outputs": [], 214 | "source": [ 215 | "data['User_Score'] = data['User_Score'].apply(pd.to_numeric, errors='coerce')\n", 216 | "data['User_Score'] = data['User_Score'].mask(np.isnan(data[\"User_Score\"]), data['Critic_Score'] / 10.0)" 217 | ] 218 | }, 219 | { 220 | "cell_type": "markdown", 221 | "metadata": {}, 222 | "source": [ 223 | "Let's do some final preprocessing of the data, including converting the categorical features into numeric using the one-hot encoding method." 224 | ] 225 | }, 226 | { 227 | "cell_type": "code", 228 | "execution_count": null, 229 | "metadata": {}, 230 | "outputs": [], 231 | "source": [ 232 | "if data['y'].dtype == bool:\n", 233 | " data['y'] = data['y'].apply(lambda y: 'yes' if y == True else 'no')\n", 234 | "model_data = pd.get_dummies(data)" 235 | ] 236 | }, 237 | { 238 | "cell_type": "markdown", 239 | "metadata": {}, 240 | "source": [ 241 | "To help prevent overfitting the model, we'll randomly split the data into three groups. Specifically, the model will be trained on 70% of the data. It will then be evaluated on 20% of the data to give us an estimate of the accuracy we hope to have on \"new\" data. As a final testing dataset, the remaining 10% will be held out until the end." 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": null, 247 | "metadata": {}, 248 | "outputs": [], 249 | "source": [ 250 | "train_data, validation_data, test_data = np.split(model_data.sample(frac=1, random_state=1729), [int(0.7 * len(model_data)), int(0.9 * len(model_data))]) " 251 | ] 252 | }, 253 | { 254 | "cell_type": "markdown", 255 | "metadata": {}, 256 | "source": [ 257 | "Amazon SageMaker's built-in version of XGBoost supports input data in both the CSV format and the more compact libSVM format. We'll use the libSVM format here, with features and the target variable provided as separate arguments. To avoid any misalignment issues due to random reordering, this split is done after the previous split in the above cell. As a last step before training, we'll copy the resulting files to S3 as input for Amazon SageMaker's hosted training." 258 | ] 259 | }, 260 | { 261 | "cell_type": "code", 262 | "execution_count": null, 263 | "metadata": {}, 264 | "outputs": [], 265 | "source": [ 266 | "dump_svmlight_file(X=train_data.drop(['y_no', 'y_yes'], axis=1), y=train_data['y_yes'], f='train.libsvm')\n", 267 | "dump_svmlight_file(X=validation_data.drop(['y_no', 'y_yes'], axis=1), y=validation_data['y_yes'], f='validation.libsvm')\n", 268 | "dump_svmlight_file(X=test_data.drop(['y_no', 'y_yes'], axis=1), y=test_data['y_yes'], f='test.libsvm')\n", 269 | "\n", 270 | "boto3.Session().resource('s3').Bucket(bucket).Object(prefix + '/train/train.libsvm').upload_file('train.libsvm')\n", 271 | "boto3.Session().resource('s3').Bucket(bucket).Object(prefix + '/validation/validation.libsvm').upload_file('validation.libsvm')\n", 272 | "\n", 273 | "s3_input_train = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='libsvm')\n", 274 | "s3_input_validation = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='libsvm')" 275 | ] 276 | }, 277 | { 278 | "cell_type": "markdown", 279 | "metadata": {}, 280 | "source": [ 281 | "---\n", 282 | "## Train\n", 283 | "\n", 284 | "Our data is now ready to be used to train a XGBoost model. The XGBoost algorithm has many tunable hyperparameters. Some of these hyperparameters are listed below; initially we'll only use a few of them. \n", 285 | "\n", 286 | "- `max_depth`: Maximum depth of a tree. As a cautionary note, a value too small could underfit the data, while increasing it will make the model more complex and thus more likely to overfit the data (in other words, the classic bias-variance tradeoff).\n", 287 | "- `eta`: Step size shrinkage used in updates to prevent overfitting. \n", 288 | "- `eval_metric`: Evaluation metric(s) for validation data. For data sets such as this one with imbalanced classes, we'll use the AUC metric.\n", 289 | "- `scale_pos_weight`: Controls the balance of positive and negative weights, again useful for data sets having imbalanced classes.\n", 290 | "\n", 291 | "First we'll set up the parameters for an Amazon SageMaker Estimator object, and the hyperparameters for the algorithm itself. The Estimator object from the Amazon SageMaker Python SDK is a convenient way to set up training jobs with a minimal amount of code." 292 | ] 293 | }, 294 | { 295 | "cell_type": "code", 296 | "execution_count": null, 297 | "metadata": {}, 298 | "outputs": [], 299 | "source": [ 300 | "from sagemaker.amazon.amazon_estimator import get_image_uri\n", 301 | "\n", 302 | "container = sagemaker.image_uris.retrieve( 'xgboost', boto3.Session().region_name, 'latest')\n", 303 | "\n", 304 | "xgb = sagemaker.estimator.Estimator(container,\n", 305 | " role, \n", 306 | " base_job_name='videogames-xgboost',\n", 307 | " instance_count=1, \n", 308 | " instance_type='ml.c5.xlarge',\n", 309 | " output_path='s3://{}/{}/output'.format(bucket, prefix),\n", 310 | " sagemaker_session=sagemaker.Session())\n", 311 | "\n", 312 | "xgb.set_hyperparameters(max_depth=3,\n", 313 | " eta=0.1,\n", 314 | " subsample=0.5,\n", 315 | " eval_metric='auc',\n", 316 | " objective='binary:logistic',\n", 317 | " scale_pos_weight=2.0,\n", 318 | " num_round=100)\n" 319 | ] 320 | }, 321 | { 322 | "cell_type": "markdown", 323 | "metadata": {}, 324 | "source": [ 325 | "Next, we'll run the hosted training job itself. The hardware used for the training job is separate from your notebook instance and is managed by Amazon SageMaker, which performs the heavy lifting such as setting up a training cluster and tearing it down when the job is done. A single line of code starts the training job." 326 | ] 327 | }, 328 | { 329 | "cell_type": "code", 330 | "execution_count": null, 331 | "metadata": {}, 332 | "outputs": [], 333 | "source": [ 334 | "xgb.fit({'train': s3_input_train, 'validation': s3_input_validation})" 335 | ] 336 | }, 337 | { 338 | "cell_type": "markdown", 339 | "metadata": {}, 340 | "source": [ 341 | "---\n", 342 | "## Host\n", 343 | "\n", 344 | "Now that we've trained the XGBoost algorithm on our data, we can deploy the trained model to an Amazon SageMaker hosted endpoint with one simple line of code." 345 | ] 346 | }, 347 | { 348 | "cell_type": "code", 349 | "execution_count": null, 350 | "metadata": {}, 351 | "outputs": [], 352 | "source": [ 353 | "xgb_predictor = xgb.deploy(initial_instance_count=1,\n", 354 | " instance_type='ml.m5.xlarge')" 355 | ] 356 | }, 357 | { 358 | "cell_type": "markdown", 359 | "metadata": {}, 360 | "source": [ 361 | "---\n", 362 | "\n", 363 | "## Evaluation\n", 364 | "\n", 365 | "Now that we have our hosted endpoint, we can generate predictions from it. More specifically, let's generate predictions from our test data set to understand how well our model generalizes to data it has not seen yet.\n", 366 | "\n", 367 | "There are many ways to compare the performance of a machine learning model. We'll start simply by comparing actual to predicted values of whether the game was a \"hit\" (`1`) or not (`0`). Then we'll produce a confusion matrix, which shows how many test data points were predicted by the model in each category versus how many test data points actually belonged in each category." 368 | ] 369 | }, 370 | { 371 | "cell_type": "code", 372 | "execution_count": null, 373 | "metadata": {}, 374 | "outputs": [], 375 | "source": [ 376 | "from sagemaker.deserializers import BytesDeserializer\n", 377 | "from sagemaker.serializers import LibSVMSerializer\n", 378 | "\n", 379 | "## xgb_predictor.content_type = 'text/x-libsvm'\n", 380 | "xgb_predictor.deserializer = BytesDeserializer()\n", 381 | "xgb_predictor.serializer = LibSVMSerializer()\n", 382 | "\n", 383 | "def do_predict(data):\n", 384 | " payload = '\\n'.join(data)\n", 385 | " response = xgb_predictor.predict(payload).decode('utf-8')\n", 386 | " result = response.split(',')\n", 387 | " preds = [float((num)) for num in result]\n", 388 | " preds = [round(num) for num in preds]\n", 389 | " return preds\n", 390 | "\n", 391 | "def batch_predict(data, batch_size):\n", 392 | " items = len(data)\n", 393 | " arrs = []\n", 394 | " \n", 395 | " for offset in range(0, items, batch_size):\n", 396 | " if offset+batch_size < items:\n", 397 | " results = do_predict(data[offset:(offset+batch_size)])\n", 398 | " arrs.extend(results)\n", 399 | " else:\n", 400 | " arrs.extend(do_predict(data[offset:items]))\n", 401 | " sys.stdout.write('.')\n", 402 | " return(arrs)" 403 | ] 404 | }, 405 | { 406 | "cell_type": "code", 407 | "execution_count": null, 408 | "metadata": {}, 409 | "outputs": [], 410 | "source": [ 411 | "%%time\n", 412 | "import json\n", 413 | "\n", 414 | "with open('test.libsvm', 'r') as f:\n", 415 | " payload = f.read().strip()\n", 416 | "\n", 417 | "labels = [int(line.split(' ')[0]) for line in payload.split('\\n')]\n", 418 | "test_data = [line for line in payload.split('\\n')]\n", 419 | "preds = batch_predict(test_data, 100)\n", 420 | "\n", 421 | "print ('\\nerror rate=%f' % ( sum(1 for i in range(len(preds)) if preds[i]!=labels[i]) /float(len(preds))))" 422 | ] 423 | }, 424 | { 425 | "cell_type": "code", 426 | "execution_count": null, 427 | "metadata": {}, 428 | "outputs": [], 429 | "source": [ 430 | "pd.crosstab(index=np.array(labels), columns=np.array(preds))" 431 | ] 432 | }, 433 | { 434 | "cell_type": "markdown", 435 | "metadata": {}, 436 | "source": [ 437 | "Of the 132 games in the test set that actually are \"hits\" by our metric, the model correctly identified over 70, while the overall error rate is 13%. The amount of false negatives versus true positives can be shifted substantially in favor of true positives by increasing the hyperparameter scale_pos_weight. Of course, this increase comes at the expense of reduced accuracy/increased error rate and more false positives. How to make this trade-off ultimately is a business decision based on the relative costs of false positives, false negatives, etc." 438 | ] 439 | }, 440 | { 441 | "cell_type": "markdown", 442 | "metadata": {}, 443 | "source": [ 444 | "---\n", 445 | "## Extensions\n", 446 | "\n", 447 | "This XGBoost model is just the starting point for predicting whether a game will be a hit based on reviews and other attributes. There are several possible avenues for improving the model's performance. First, of course, would be to collect more data and, if possible, fill in the existing missing fields with actual information. Another possibility is further hyperparameter tuning using Amazon SageMaker's Automatic Model Tuning feature. Examples of using this feature can be found in the [hyperparameter tuning directory of the SageMaker Examples GitHub repository](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/hyperparameter_tuning) and the **SageMaker Examples** tab of Amazon SageMaker notebook instances. And, although ensemble learners often do well with imbalanced data sets, it could be worth exploring techniques for mitigating imbalances such as downsampling, synthetic data augmentation, and other approaches. \n", 448 | "\n", 449 | "---\n", 450 | "## Cleanup\n", 451 | "\n", 452 | "If you are finished with this notebook, please run the cell below. This will remove the hosted endpoint you created and avoid any charges from a stray instance being left on." 453 | ] 454 | }, 455 | { 456 | "cell_type": "code", 457 | "execution_count": null, 458 | "metadata": {}, 459 | "outputs": [], 460 | "source": [ 461 | "sagemaker.Session().delete_endpoint(xgb_predictor.endpoint_name)" 462 | ] 463 | } 464 | ], 465 | "metadata": { 466 | "instance_type": "ml.t3.medium", 467 | "kernelspec": { 468 | "display_name": "conda_python3", 469 | "language": "python", 470 | "name": "conda_python3" 471 | }, 472 | "language_info": { 473 | "codemirror_mode": { 474 | "name": "ipython", 475 | "version": 3 476 | }, 477 | "file_extension": ".py", 478 | "mimetype": "text/x-python", 479 | "name": "python", 480 | "nbconvert_exporter": "python", 481 | "pygments_lexer": "ipython3", 482 | "version": "3.6.10" 483 | }, 484 | "notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License." 485 | }, 486 | "nbformat": 4, 487 | "nbformat_minor": 4 488 | } 489 | --------------------------------------------------------------------------------