├── .gitignore
├── README.md
├── aws-ai-ml-infos.md
├── git-integration.md
├── img
├── .DS_Store
├── ai4good.png
├── git00.png
├── git02.png
├── git03.png
├── git04.png
├── git05.png
├── git06.png
├── git07.png
├── git08.png
├── git09.png
├── git10.png
├── git11.png
├── new-terminal.png
├── sagemaker01.png
├── sagemaker02.png
├── sagemaker03.png
├── sagemaker04.png
├── sagemaker05.png
├── sagemaker06.png
├── sagemaker07.png
├── sagemaker08.png
├── sagemaker09.png
├── sagemaker10.png
├── support_center01.png
├── support_center02.png
├── support_center03.png
└── support_center04.png
├── mxnet
├── input.html
├── mxnet_mnist.ipynb
└── src
│ ├── mnist_mxnet.py
│ └── requirements.txt
├── pytorch
├── input.html
├── pytorch_mnist.ipynb
└── src
│ ├── mnist_pytorch.py
│ └── requirements.txt
├── quota-increase.md
└── tensorflow
├── src
├── mnist-2.py
├── mnist.py
├── mnist_keras_tf2.py
├── requirements.txt
└── setup.py
├── tensorflow_mnist.ipynb
└── tensorflow_script_mode_training_and_serving.ipynb
/.gitignore:
--------------------------------------------------------------------------------
1 | model/
2 | data/
3 | .ipynb_checkpoints/
4 | .DS_Store
5 | src/.DS_Store
6 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | ## Deep Berlin AI For Good Hackathon 2020
2 | [](https://deep-berlin.ai/hackathon2020/)
3 |
4 |
5 | # Train Your Model with AWS
6 |
7 | ## Preparation
8 | Ideally, you set up your AWS account 3-4 days before the hackathon and increase SageMaker Instance Limits as described below. This is specifically important if you want to leverage a larger number of GPU-based cloud instances to run your training jobs.
9 |
10 | ### Step 1: Create and Activate Your AWS Account
11 | * https://aws.amazon.com/premiumsupport/knowledge-center/create-and-activate-aws-account/
12 |
13 | ### Step 2: Create an S3 Bucket
14 | * How to [Create an S3 Bucket](https://docs.aws.amazon.com/AmazonS3/latest/user-guide/create-bucket.html)
15 |
16 | ### Step 3: Choose a Region
17 | * Review the [Regions](https://docs.aws.amazon.com/general/latest/gr/rande.html#sagemaker_region) supported by Amazon SageMaker.
18 |
19 | ### Step 4: Manage/Increase SageMaker Instance Limits (see detailed instructions [here](quota-increase.md))
20 | [](quota-increase.md)
21 | * Review the [Default Limits](https://docs.aws.amazon.com/general/latest/gr/sagemaker.html#limits_sagemaker) for Amazon SageMaker Service Limits
22 | * Request a [Limit Increase](quota-increase.md) if needed via the [AWS Support Center](https://console.aws.amazon.com/support/home#/)
23 | * Review the [SageMaker Instance Pricing](https://aws.amazon.com/sagemaker/pricing/instance-types/) for an overview of supported instance types
24 |
25 | ## Let's get started!
26 |
27 | ### Step 5: Apply Your AWS Credits (If Applicable)
28 | * https://aws.amazon.com/awscredits/
29 |
30 | ### Step 6: Create a SageMaker Notebook Instance
31 | * Navigate to Amazon SageMaker (https://console.aws.amazon.com/sagemaker/) (make sure you're in the region of your choice shown in the top right menu of your screen)
32 |
33 | * Navigate to `Noteboook Instances` in the left side menu
34 | 
35 |
36 | * Click on `Create notebook instance`
37 | 
38 |
39 | * Choose a name for your notebook instance
40 | * Select an [Instance Type](https://aws.amazon.com/sagemaker/pricing/instance-types/)
41 | * You might want to increase the volume size of the locally attached disk (i.e. to 250GB or 500GB)
42 |
43 | 
44 |
45 | * Select `Create a New IAM Role`
46 | * Select `Specific S3 Bucket` and type in the name of the S3 bucket you created earlier
47 | * Click `Create role`
48 |
49 | 
50 |
51 | 
52 |
53 | * In `Git repositories` you can choose to clone a public Git repo to this notebook instance
54 | * Just provide the public Git repo URL
55 |
56 | 
57 |
58 | 
59 |
60 | * You can leave everything else to default, and hit `Create notebook instance`
61 |
62 | * Your notebook instance is now being created (this can take 2-3min)
63 |
64 | 
65 |
66 | * Once the instance shows Status `InService` you can connect to it via `Open Jupyter` or `Open JupyterLab`.
67 | 
68 |
69 | * This opens up your notebook environment with the cloned Git repo ready.
70 | 
71 |
72 | Note: You can also connect a Private GitHub or GitLab repo, just follow [these](git-integration.md) instructions.
73 |
74 | ### Step 7: Clone any other GitHub Repo
75 |
76 | If you want to clone another Git repo:
77 |
78 | * Open a New Terminal
79 | 
80 |
81 | * Clone the Git repo using the following command (we are using this repo as an example again):
82 |
83 | ```
84 | cd ~/SageMaker
85 | git clone https://github.com/data-science-on-aws/ai4good-hackathon.git
86 | ```
87 |
88 | ### Step 8: Train a Model Using Your SageMaker Notebook Instance
89 | * [**Sample TensorFlow Notebook**](tensorflow/) using Distributed TensorFlow and SageMaker.
90 | * [**Sample PyTorch Notebook**](pytorch/) using Distributed PyTorch and SageMaker.
91 | * [**Sample MXNet Notebook**](mxnet/) using Distributed MXNet and SageMaker.
92 | * To adapt a custom training script to SageMaker, please follow [these instructions](https://sagemaker.readthedocs.io/en/stable/using_tf.html#adapting-your-local-tensorflow-script).
93 |
94 |
95 |
96 | ## More Resources
97 | * [Deep Berlin AI For Good Hackathon 2020](https://deep-berlin.ai/hackathon2020/)
98 | * [Official Hackathon Github Repo with Datasets](https://github.com/deepberlin1/aiforgood2020)
99 | * [More Amazon SageMaker Sample Notebooks](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-python-sdk)
100 |
--------------------------------------------------------------------------------
/aws-ai-ml-infos.md:
--------------------------------------------------------------------------------
1 | # Deep Learning Stack
2 |
3 | Amazon [Sagemaker Python SDK](https://github.com/aws/sagemaker-python-sdk) is a higher level python interface to Sagemaker which is a really convenient way to orchestrate training and inference.
4 |
5 | * [TensorFlow Estimators](https://sagemaker.readthedocs.io/en/stable/using_tf.html)
6 | * [Pytorch estimators](https://github.com/aws/sagemaker-python-sdk#pytorch-sagemaker-estimators)
7 | * [Bring your own docker containers](https://sagemaker.readthedocs.io/en/stable/overview.html#byo-docker-containers-with-sagemaker-estimators)
8 |
9 | The [Sagemaker Examples Repository](https://github.com/awslabs/amazon-sagemaker-examples) has a variety of examples that could be of interest:
10 | * [Hyperparameter Tuning with Tensor Flow](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/hyperparameter_tuning/tensorflow_mnist)
11 | * [Using Sagemaker Debugger with a Custom Pytorch Container](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-debugger/pytorch_custom_container/pytorch_byoc_smdebug.ipynb)
12 |
13 | More links:
14 | * [Hyperparameter Tuning](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-ex.html) – Search for optimal parameter settings (in parallel)
15 | * [Sagemaker Debugger](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html) – Detect issues, e.g., with vanishing gradients
16 | * [Experiments](https://github.com/aws/sagemaker-experiments) — to track different training/tuning jobs
17 | * [Managed Spot Training](https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html) – to save costs during training
18 |
19 | # Vision Stack
20 |
21 | ## Frameworks Layer:
22 | [Sagemaker Python SDK](https://github.com/aws/sagemaker-python-sdk) is a higher level python interface to Sagemaker which is a really convenient way to orchestrate training and inference. The [Sagemaker Examples Repository](https://github.com/awslabs/amazon-sagemaker-examples) has a variety of examples that could be of interest:
23 | * [Hyperparameter Tuning with TensorFlow](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/hyperparameter_tuning/tensorflow_mnist)
24 | * [Using Sagemaker Debugger with a Custom Pytorch Container](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-debugger/pytorch_custom_container/pytorch_byoc_smdebug.ipynb)
25 |
26 | ## ML Services:
27 | Sagemaker has built-in algorithms for [object detection](https://docs.aws.amazon.com/sagemaker/latest/dg/object-detection.html) and [semantic segmentation[](https://docs.aws.amazon.com/sagemaker/latest/dg/semantic-segmentation.html). [Sagemaker Ground Truth](https://docs.aws.amazon.com/sagemaker/latest/dg/sms.html) enables a combination of labeling/annotation workflows with active learning. Examples demonstrating all these features can be found in the [Sagemaker examples repository](https://github.com/awslabs/amazon-sagemaker-examples).
28 |
29 | ## AI Services:
30 | [Rekognition](https://docs.aws.amazon.com/rekognition/latest/dg/what-is.html) supports a variety of use cases related to image and video including object detection, face and celebrity recognition, and content moderation. For cases where you need domain-specific object labels (e.g., road signs, company logs), we offer [Rekognition Custom Labels](https://docs.aws.amazon.com/rekognition/latest/customlabels-dg/what-is.html) where you can upload training data to customize the detected objects.
31 | [Textract](https://docs.aws.amazon.com/textract/latest/dg/what-is.html) enables high-accuracy text extraction/OCR from images (PDF, JPEG etc)
32 |
33 | # Partnerships:
34 | [AWS Data Exchange](https://aws.amazon.com/data-exchange/) provides startups a way to monetize the expensive datasets that need to be created for vision use cases.
35 |
--------------------------------------------------------------------------------
/git-integration.md:
--------------------------------------------------------------------------------
1 | # Amazon SageMaker <> Git Integration
2 |
3 | See also:
4 | * https://docs.aws.amazon.com/sagemaker/latest/dg/nbi-git-repo.html
5 | * https://aws.amazon.com/blogs/machine-learning/git-integration-now-available-for-amazon-sagemaker-python-sdk/
6 | * https://aws.amazon.com/blogs/machine-learning/amazon-sagemaker-notebooks-now-support-git-integration-for-increased-persistence-collaboration-and-reproducibility/
7 |
8 | ## 1. Go to the Amazon SageMaker console, and click on Git repositories > Add repository
9 | 
10 |
11 | ## 2. Select GitHub/Other Git-based repo
12 | 
13 |
14 | ### Provide the following information:
15 | * Amazon SageMaker repository name >> any repo name
16 | * Git Repository URL >> URL to your repo
17 | * Git credentials >> Create secret
18 |
19 | 
20 |
21 | * For Password, generate Github Personal Access Token with the relevant permissions:
22 |
23 | 
24 |
25 | * Add repository:
26 |
27 | 
28 |
29 | ## 3. Create Amazon SageMaker Jupyter instance & attach Git
30 |
31 | 
32 |
33 | * Create notebook instance
34 |
35 | 
36 |
37 | * Select previously created Git repo
38 |
39 | 
40 |
41 | * Create notebook instance
42 |
43 | 
44 |
45 | ## 4. Login to Jupyter instance
46 |
47 | 
48 | 
49 |
--------------------------------------------------------------------------------
/img/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-science-on-aws/ai4good-hackathon/1c65d0ca4cb1f25d6f9ac1f132b240a39ae12179/img/.DS_Store
--------------------------------------------------------------------------------
/img/ai4good.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-science-on-aws/ai4good-hackathon/1c65d0ca4cb1f25d6f9ac1f132b240a39ae12179/img/ai4good.png
--------------------------------------------------------------------------------
/img/git00.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-science-on-aws/ai4good-hackathon/1c65d0ca4cb1f25d6f9ac1f132b240a39ae12179/img/git00.png
--------------------------------------------------------------------------------
/img/git02.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-science-on-aws/ai4good-hackathon/1c65d0ca4cb1f25d6f9ac1f132b240a39ae12179/img/git02.png
--------------------------------------------------------------------------------
/img/git03.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-science-on-aws/ai4good-hackathon/1c65d0ca4cb1f25d6f9ac1f132b240a39ae12179/img/git03.png
--------------------------------------------------------------------------------
/img/git04.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-science-on-aws/ai4good-hackathon/1c65d0ca4cb1f25d6f9ac1f132b240a39ae12179/img/git04.png
--------------------------------------------------------------------------------
/img/git05.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-science-on-aws/ai4good-hackathon/1c65d0ca4cb1f25d6f9ac1f132b240a39ae12179/img/git05.png
--------------------------------------------------------------------------------
/img/git06.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-science-on-aws/ai4good-hackathon/1c65d0ca4cb1f25d6f9ac1f132b240a39ae12179/img/git06.png
--------------------------------------------------------------------------------
/img/git07.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-science-on-aws/ai4good-hackathon/1c65d0ca4cb1f25d6f9ac1f132b240a39ae12179/img/git07.png
--------------------------------------------------------------------------------
/img/git08.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-science-on-aws/ai4good-hackathon/1c65d0ca4cb1f25d6f9ac1f132b240a39ae12179/img/git08.png
--------------------------------------------------------------------------------
/img/git09.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-science-on-aws/ai4good-hackathon/1c65d0ca4cb1f25d6f9ac1f132b240a39ae12179/img/git09.png
--------------------------------------------------------------------------------
/img/git10.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-science-on-aws/ai4good-hackathon/1c65d0ca4cb1f25d6f9ac1f132b240a39ae12179/img/git10.png
--------------------------------------------------------------------------------
/img/git11.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-science-on-aws/ai4good-hackathon/1c65d0ca4cb1f25d6f9ac1f132b240a39ae12179/img/git11.png
--------------------------------------------------------------------------------
/img/new-terminal.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-science-on-aws/ai4good-hackathon/1c65d0ca4cb1f25d6f9ac1f132b240a39ae12179/img/new-terminal.png
--------------------------------------------------------------------------------
/img/sagemaker01.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-science-on-aws/ai4good-hackathon/1c65d0ca4cb1f25d6f9ac1f132b240a39ae12179/img/sagemaker01.png
--------------------------------------------------------------------------------
/img/sagemaker02.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-science-on-aws/ai4good-hackathon/1c65d0ca4cb1f25d6f9ac1f132b240a39ae12179/img/sagemaker02.png
--------------------------------------------------------------------------------
/img/sagemaker03.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-science-on-aws/ai4good-hackathon/1c65d0ca4cb1f25d6f9ac1f132b240a39ae12179/img/sagemaker03.png
--------------------------------------------------------------------------------
/img/sagemaker04.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-science-on-aws/ai4good-hackathon/1c65d0ca4cb1f25d6f9ac1f132b240a39ae12179/img/sagemaker04.png
--------------------------------------------------------------------------------
/img/sagemaker05.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-science-on-aws/ai4good-hackathon/1c65d0ca4cb1f25d6f9ac1f132b240a39ae12179/img/sagemaker05.png
--------------------------------------------------------------------------------
/img/sagemaker06.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-science-on-aws/ai4good-hackathon/1c65d0ca4cb1f25d6f9ac1f132b240a39ae12179/img/sagemaker06.png
--------------------------------------------------------------------------------
/img/sagemaker07.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-science-on-aws/ai4good-hackathon/1c65d0ca4cb1f25d6f9ac1f132b240a39ae12179/img/sagemaker07.png
--------------------------------------------------------------------------------
/img/sagemaker08.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-science-on-aws/ai4good-hackathon/1c65d0ca4cb1f25d6f9ac1f132b240a39ae12179/img/sagemaker08.png
--------------------------------------------------------------------------------
/img/sagemaker09.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-science-on-aws/ai4good-hackathon/1c65d0ca4cb1f25d6f9ac1f132b240a39ae12179/img/sagemaker09.png
--------------------------------------------------------------------------------
/img/sagemaker10.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-science-on-aws/ai4good-hackathon/1c65d0ca4cb1f25d6f9ac1f132b240a39ae12179/img/sagemaker10.png
--------------------------------------------------------------------------------
/img/support_center01.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-science-on-aws/ai4good-hackathon/1c65d0ca4cb1f25d6f9ac1f132b240a39ae12179/img/support_center01.png
--------------------------------------------------------------------------------
/img/support_center02.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-science-on-aws/ai4good-hackathon/1c65d0ca4cb1f25d6f9ac1f132b240a39ae12179/img/support_center02.png
--------------------------------------------------------------------------------
/img/support_center03.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-science-on-aws/ai4good-hackathon/1c65d0ca4cb1f25d6f9ac1f132b240a39ae12179/img/support_center03.png
--------------------------------------------------------------------------------
/img/support_center04.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-science-on-aws/ai4good-hackathon/1c65d0ca4cb1f25d6f9ac1f132b240a39ae12179/img/support_center04.png
--------------------------------------------------------------------------------
/mxnet/input.html:
--------------------------------------------------------------------------------
1 |
52 |
53 |
54 |
55 |
56 |
57 |
58 |
59 |
60 |
61 |
62 |
63 |
64 |
267 |
268 |
269 |
270 |
271 |
--------------------------------------------------------------------------------
/mxnet/mxnet_mnist.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Training and hosting SageMaker Models using the Apache MXNet Module API\n",
8 | "\n",
9 | "The **SageMaker Python SDK** makes it easy to train and deploy MXNet models. In this example, we train a simple neural network using the Apache MXNet [Module API](https://mxnet.apache.org/api/python/module/module.html) and the MNIST dataset. The MNIST dataset is widely used for handwritten digit classification, and consists of 70,000 labeled 28x28 pixel grayscale images of hand-written digits. The dataset is split into 60,000 training images and 10,000 test images. There are 10 classes (one for each of the 10 digits). The task at hand is to train a model using the 60,000 training images and subsequently test its classification accuracy on the 10,000 test images."
10 | ]
11 | },
12 | {
13 | "cell_type": "markdown",
14 | "metadata": {},
15 | "source": [
16 | "## Setup\n",
17 | "\n",
18 | "First we need to define a few variables that will be needed later in the example."
19 | ]
20 | },
21 | {
22 | "cell_type": "markdown",
23 | "metadata": {},
24 | "source": [
25 | "### Install Dependencies\n",
26 | "_Note: Ignore Warnings and Errors Below_"
27 | ]
28 | },
29 | {
30 | "cell_type": "code",
31 | "execution_count": null,
32 | "metadata": {},
33 | "outputs": [],
34 | "source": [
35 | "!pip install sagemaker --upgrade --ignore-installed --no-cache --user"
36 | ]
37 | },
38 | {
39 | "cell_type": "markdown",
40 | "metadata": {},
41 | "source": [
42 | "### Restart the Kernel to Recognize New Dependencies Above"
43 | ]
44 | },
45 | {
46 | "cell_type": "code",
47 | "execution_count": null,
48 | "metadata": {},
49 | "outputs": [],
50 | "source": [
51 | "from IPython.display import display_html\n",
52 | "display_html(\"\", raw=True)"
53 | ]
54 | },
55 | {
56 | "cell_type": "code",
57 | "execution_count": null,
58 | "metadata": {},
59 | "outputs": [],
60 | "source": [
61 | "!pip list"
62 | ]
63 | },
64 | {
65 | "cell_type": "markdown",
66 | "metadata": {},
67 | "source": [
68 | "## Create the SageMaker Session"
69 | ]
70 | },
71 | {
72 | "cell_type": "code",
73 | "execution_count": null,
74 | "metadata": {
75 | "isConfigCell": true
76 | },
77 | "outputs": [],
78 | "source": [
79 | "import sagemaker\n",
80 | "from sagemaker import get_execution_role\n",
81 | "\n",
82 | "sagemaker_session = sagemaker.Session()"
83 | ]
84 | },
85 | {
86 | "cell_type": "markdown",
87 | "metadata": {},
88 | "source": [
89 | "## Setup the Service Execution Role and Region\n",
90 | "Get IAM role arn used to give training and hosting access to your data. See the documentation for how to create these. Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the `sagemaker.get_execution_role()` with a the appropriate full IAM role arn string(s)."
91 | ]
92 | },
93 | {
94 | "cell_type": "code",
95 | "execution_count": null,
96 | "metadata": {},
97 | "outputs": [],
98 | "source": [
99 | "# S3 bucket for saving code and model artifacts.\n",
100 | "# Feel free to specify a different bucket here if you wish.\n",
101 | "bucket = sagemaker_session.default_bucket()\n",
102 | "\n",
103 | "# Location to save your custom code in tar.gz format.\n",
104 | "custom_code_upload_location = 's3://{}/customcode/mxnet'.format(bucket)\n",
105 | "\n",
106 | "# Location where results of model training are saved.\n",
107 | "model_output_path = 's3://{}/sagemaker/mxnet-mnist/training-runs'.format(bucket)"
108 | ]
109 | },
110 | {
111 | "cell_type": "code",
112 | "execution_count": null,
113 | "metadata": {},
114 | "outputs": [],
115 | "source": [
116 | "role = get_execution_role()\n",
117 | "print('RoleARN: {}\\n'.format(role))\n",
118 | "\n",
119 | "region = sagemaker_session.boto_session.region_name\n",
120 | "print('Region: {}'.format(region))"
121 | ]
122 | },
123 | {
124 | "cell_type": "markdown",
125 | "metadata": {},
126 | "source": [
127 | "## Training Data\n",
128 | "\n",
129 | "The MNIST dataset has been loaded to the public S3 buckets ``sagemaker-sample-data-`` under the prefix ``mxnet/mnist``."
130 | ]
131 | },
132 | {
133 | "cell_type": "code",
134 | "execution_count": null,
135 | "metadata": {
136 | "scrolled": false
137 | },
138 | "outputs": [],
139 | "source": [
140 | "import boto3\n",
141 | "\n",
142 | "train_data_location = 's3://sagemaker-sample-data-{}/mxnet/mnist/train'.format(region)\n",
143 | "test_data_location = 's3://sagemaker-sample-data-{}/mxnet/mnist/test'.format(region)"
144 | ]
145 | },
146 | {
147 | "cell_type": "markdown",
148 | "metadata": {},
149 | "source": [
150 | "### Copy the Training Data to Your Notebook Disk"
151 | ]
152 | },
153 | {
154 | "cell_type": "code",
155 | "execution_count": null,
156 | "metadata": {},
157 | "outputs": [],
158 | "source": [
159 | "local_data_path = './data'"
160 | ]
161 | },
162 | {
163 | "cell_type": "code",
164 | "execution_count": null,
165 | "metadata": {},
166 | "outputs": [],
167 | "source": [
168 | "!aws --region {region} s3 cp --recursive {train_data_location} {local_data_path}"
169 | ]
170 | },
171 | {
172 | "cell_type": "code",
173 | "execution_count": null,
174 | "metadata": {},
175 | "outputs": [],
176 | "source": [
177 | "!ls {local_data_path}"
178 | ]
179 | },
180 | {
181 | "cell_type": "markdown",
182 | "metadata": {},
183 | "source": [
184 | "## Train "
185 | ]
186 | },
187 | {
188 | "cell_type": "markdown",
189 | "metadata": {},
190 | "source": [
191 | "### The training script\n",
192 | "\n",
193 | "The ``mnist_mxnet.py`` script provides all the code we need for training and hosting a SageMaker model. The script also checkpoints the model at the end of every epoch and saves the model graph, params and optimizer state in the folder `/opt/ml/checkpoints`. If the folder path does not exist then it will skip checkpointing. The script we will use is adaptated from Apache MXNet [MNIST tutorial (https://mxnet.incubator.apache.org/tutorials/python/mnist.html).\n",
194 | "\n"
195 | ]
196 | },
197 | {
198 | "cell_type": "code",
199 | "execution_count": null,
200 | "metadata": {},
201 | "outputs": [],
202 | "source": [
203 | "!cat ./src/mnist_mxnet.py"
204 | ]
205 | },
206 | {
207 | "cell_type": "markdown",
208 | "metadata": {},
209 | "source": [
210 | "You can add custom Python modules to the `src/requirements.txt` file. They will automatically be installed - and made available to your training script."
211 | ]
212 | },
213 | {
214 | "cell_type": "code",
215 | "execution_count": null,
216 | "metadata": {},
217 | "outputs": [],
218 | "source": [
219 | "!cat ./src/requirements.txt"
220 | ]
221 | },
222 | {
223 | "cell_type": "markdown",
224 | "metadata": {},
225 | "source": [
226 | "### Train with SageMaker's `MXNet` Estimator"
227 | ]
228 | },
229 | {
230 | "cell_type": "markdown",
231 | "metadata": {},
232 | "source": [
233 | "The SageMaker ```MXNet``` estimator allows us to run single machine or distributed training in SageMaker, using CPU or GPU-based instances.\n",
234 | "\n",
235 | "When we create the estimator, we pass in the filename of our training script, the name of our IAM execution role, and the S3 locations we defined in the setup section. We also provide a few other parameters. ``train_instance_count`` and ``train_instance_type`` determine the number and type of SageMaker instances that will be used for the training job. The ``hyperparameters`` parameter is a ``dict`` of values that will be passed to your training script -- you can see how to access these values in the ``mnist.py`` script above.\n",
236 | "\n",
237 | "For this example, we will choose one ``ml.m4.xlarge`` instance."
238 | ]
239 | },
240 | {
241 | "cell_type": "code",
242 | "execution_count": null,
243 | "metadata": {},
244 | "outputs": [],
245 | "source": [
246 | "from sagemaker.mxnet import MXNet\n",
247 | "\n",
248 | "mnist_estimator = MXNet(entry_point='mnist_mxnet.py',\n",
249 | " source_dir='./src',\n",
250 | " role=role,\n",
251 | " output_path=model_output_path,\n",
252 | " code_location=custom_code_upload_location,\n",
253 | " train_instance_count=1,\n",
254 | " train_instance_type='ml.m4.xlarge',\n",
255 | " framework_version='1.4.1',\n",
256 | " py_version='py3',\n",
257 | " enable_sagemaker_metrics=True,\n",
258 | " distributions={'parameter_server': {'enabled': True}},\n",
259 | " hyperparameters={'learning-rate': 0.1},\n",
260 | " # Assuming the logline from the MXNet training job is as follows:\n",
261 | " # Epoch[9] Train-accuracy=0.976817\n",
262 | " # Epoch[9] Time cost=1.200\n",
263 | " # Epoch[9] Validation-accuracy=0.968800 \n",
264 | " metric_definitions=[\n",
265 | " {'Name':'test:accuracy', 'Regex':'Validation-accuracy=(.*?)'},\n",
266 | " ])"
267 | ]
268 | },
269 | {
270 | "cell_type": "markdown",
271 | "metadata": {},
272 | "source": [
273 | "### `fit` the Model (Approx. 15 mins)\n",
274 | "\n",
275 | "After we've constructed our MXNet object, we can fit it using data stored in S3. Below we run SageMaker training on two input channels: **train** and **test**.\n",
276 | "\n",
277 | "During training, SageMaker makes this data stored in S3 available in the local filesystem where the mnist script is running. The ```mnist.py``` script simply loads the train and test data from disk."
278 | ]
279 | },
280 | {
281 | "cell_type": "code",
282 | "execution_count": null,
283 | "metadata": {},
284 | "outputs": [],
285 | "source": [
286 | "mnist_estimator.fit({'train': train_data_location, 'test': test_data_location}, wait=False)\n",
287 | "\n",
288 | "training_job_name = mnist_estimator.latest_training_job.name\n",
289 | "print('training_job_name: {}'.format(training_job_name))"
290 | ]
291 | },
292 | {
293 | "cell_type": "code",
294 | "execution_count": null,
295 | "metadata": {},
296 | "outputs": [],
297 | "source": [
298 | "from sagemaker.mxnet import MXNet\n",
299 | "\n",
300 | "mnist_estimator = MXNet.attach(training_job_name=training_job_name)"
301 | ]
302 | },
303 | {
304 | "cell_type": "markdown",
305 | "metadata": {},
306 | "source": [
307 | "## Option 1: Perform Batch Predictions Directly in the Notebook"
308 | ]
309 | },
310 | {
311 | "cell_type": "code",
312 | "execution_count": null,
313 | "metadata": {},
314 | "outputs": [],
315 | "source": [
316 | "print(model_output_path)"
317 | ]
318 | },
319 | {
320 | "cell_type": "code",
321 | "execution_count": null,
322 | "metadata": {},
323 | "outputs": [],
324 | "source": [
325 | "!aws --region {region} s3 ls --recursive s3://sagemaker-us-east-1-835319576252/sagemaker # {model_output_path}/{training_job_name}"
326 | ]
327 | },
328 | {
329 | "cell_type": "code",
330 | "execution_count": null,
331 | "metadata": {},
332 | "outputs": [],
333 | "source": [
334 | "!aws --region {region} s3 cp {model_output_path}/{training_job_name}/output/model.tar.gz ./model/model.tar.gz"
335 | ]
336 | },
337 | {
338 | "cell_type": "code",
339 | "execution_count": null,
340 | "metadata": {},
341 | "outputs": [],
342 | "source": [
343 | "!ls ./model"
344 | ]
345 | },
346 | {
347 | "cell_type": "code",
348 | "execution_count": null,
349 | "metadata": {},
350 | "outputs": [],
351 | "source": [
352 | "!tar -xzvf ./model/model.tar.gz -C ./model"
353 | ]
354 | },
355 | {
356 | "cell_type": "code",
357 | "execution_count": null,
358 | "metadata": {},
359 | "outputs": [],
360 | "source": [
361 | "# TODO: Perform Batch Predictions Directly in Notebook"
362 | ]
363 | },
364 | {
365 | "cell_type": "markdown",
366 | "metadata": {},
367 | "source": [
368 | "## Option 2: Create a SageMaker Endpoint and Perform REST-based Predictions\n",
369 | "\n",
370 | "After training, we use the ``MXNet estimator`` object to build and deploy an ``MXNetPredictor``. This creates a Sagemaker **Endpoint** -- a hosted prediction service that we can use to perform inference. \n",
371 | "\n",
372 | "The arguments to the ``deploy`` function allow us to set the number and type of instances that will be used for the Endpoint. These do not need to be the same as the values we used for the training job. For example, you can train a model on a set of GPU-based instances, and then deploy the Endpoint to a fleet of CPU-based instances. Here we will deploy the model to a single ``ml.m4.xlarge`` instance."
373 | ]
374 | },
375 | {
376 | "cell_type": "code",
377 | "execution_count": null,
378 | "metadata": {},
379 | "outputs": [],
380 | "source": [
381 | "predictor = mnist_estimator.deploy(initial_instance_count=1,\n",
382 | " instance_type='ml.m4.xlarge')"
383 | ]
384 | },
385 | {
386 | "cell_type": "markdown",
387 | "metadata": {},
388 | "source": [
389 | "The request handling behavior of the Endpoint is determined by the ``mnist_mxnet.py`` script. In this case, the script doesn't include any request handling functions, so the Endpoint will use the default handlers provided by SageMaker. These default handlers allow us to perform inference on input data encoded as a multi-dimensional JSON array.\n",
390 | "\n",
391 | "### Making an Inference Request\n",
392 | "\n",
393 | "Now that our Endpoint is deployed and we have a ``predictor`` object, we can use it to classify handwritten digits.\n",
394 | "\n",
395 | "To see inference in action, draw a digit in the image box below. The pixel data from your drawing will be loaded into a ``data`` variable in this notebook. \n",
396 | "\n",
397 | "*Note: after drawing the image, you'll need to move to the next notebook cell.*"
398 | ]
399 | },
400 | {
401 | "cell_type": "code",
402 | "execution_count": null,
403 | "metadata": {},
404 | "outputs": [],
405 | "source": [
406 | "from IPython.display import HTML\n",
407 | "HTML(open(\"input.html\").read())"
408 | ]
409 | },
410 | {
411 | "cell_type": "markdown",
412 | "metadata": {},
413 | "source": [
414 | "Now we can use the ``predictor`` object to classify the handwritten digit:"
415 | ]
416 | },
417 | {
418 | "cell_type": "code",
419 | "execution_count": null,
420 | "metadata": {
421 | "scrolled": true
422 | },
423 | "outputs": [],
424 | "source": [
425 | "response = predictor.predict(data)\n",
426 | "print('Raw prediction result:')\n",
427 | "response = response[0]\n",
428 | "print(response)\n",
429 | "\n",
430 | "labeled_predictions = list(zip(range(10), response))\n",
431 | "print('Labeled predictions: ')\n",
432 | "print(labeled_predictions)\n",
433 | "\n",
434 | "labeled_predictions.sort(key=lambda label_and_prob: 1.0 - label_and_prob[1])\n",
435 | "print('Most likely answer: {}'.format(labeled_predictions[0]))"
436 | ]
437 | },
438 | {
439 | "cell_type": "markdown",
440 | "metadata": {
441 | "collapsed": true
442 | },
443 | "source": [
444 | "### (Optional) Delete the Endpoint\n",
445 | "\n",
446 | "After you have finished with this example, remember to delete the prediction endpoint to release the instance(s) associated with it."
447 | ]
448 | },
449 | {
450 | "cell_type": "code",
451 | "execution_count": null,
452 | "metadata": {},
453 | "outputs": [],
454 | "source": [
455 | "print(\"Endpoint name: \" + predictor.endpoint)"
456 | ]
457 | },
458 | {
459 | "cell_type": "code",
460 | "execution_count": null,
461 | "metadata": {},
462 | "outputs": [],
463 | "source": [
464 | "import sagemaker\n",
465 | "\n",
466 | "# predictor.delete_endpoint()"
467 | ]
468 | },
469 | {
470 | "cell_type": "code",
471 | "execution_count": null,
472 | "metadata": {},
473 | "outputs": [],
474 | "source": []
475 | }
476 | ],
477 | "metadata": {
478 | "kernelspec": {
479 | "display_name": "conda_python3",
480 | "language": "python",
481 | "name": "conda_python3"
482 | },
483 | "language_info": {
484 | "codemirror_mode": {
485 | "name": "ipython",
486 | "version": 3
487 | },
488 | "file_extension": ".py",
489 | "mimetype": "text/x-python",
490 | "name": "python",
491 | "nbconvert_exporter": "python",
492 | "pygments_lexer": "ipython3",
493 | "version": "3.6.5"
494 | },
495 | "notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
496 | },
497 | "nbformat": 4,
498 | "nbformat_minor": 2
499 | }
500 |
--------------------------------------------------------------------------------
/mxnet/src/mnist_mxnet.py:
--------------------------------------------------------------------------------
1 | import argparse
2 | import gzip
3 | import json
4 | import logging
5 | import os
6 | import struct
7 |
8 | import mxnet as mx
9 | import numpy as np
10 |
11 |
12 | def load_data(path):
13 | with gzip.open(find_file(path, "labels.gz")) as flbl:
14 | struct.unpack(">II", flbl.read(8))
15 | labels = np.fromstring(flbl.read(), dtype=np.int8)
16 | with gzip.open(find_file(path, "images.gz")) as fimg:
17 | _, _, rows, cols = struct.unpack(">IIII", fimg.read(16))
18 | images = np.fromstring(fimg.read(), dtype=np.uint8).reshape(len(labels), rows, cols)
19 | images = images.reshape(images.shape[0], 1, 28, 28).astype(np.float32) / 255
20 | return labels, images
21 |
22 |
23 | def find_file(root_path, file_name):
24 | for root, dirs, files in os.walk(root_path):
25 | if file_name in files:
26 | return os.path.join(root, file_name)
27 |
28 |
29 | def build_graph():
30 | data = mx.sym.var('data')
31 | data = mx.sym.flatten(data=data)
32 | fc1 = mx.sym.FullyConnected(data=data, num_hidden=128)
33 | act1 = mx.sym.Activation(data=fc1, act_type="relu")
34 | fc2 = mx.sym.FullyConnected(data=act1, num_hidden=64)
35 | act2 = mx.sym.Activation(data=fc2, act_type="relu")
36 | fc3 = mx.sym.FullyConnected(data=act2, num_hidden=10)
37 | return mx.sym.SoftmaxOutput(data=fc3, name='softmax')
38 |
39 |
40 | def get_training_context(num_gpus):
41 | if num_gpus:
42 | return [mx.gpu(i) for i in range(num_gpus)]
43 | else:
44 | return mx.cpu()
45 |
46 |
47 | def train(batch_size, epochs, learning_rate, num_gpus, training_channel, testing_channel,
48 | hosts, current_host, model_dir):
49 | (train_labels, train_images) = load_data(training_channel)
50 | (test_labels, test_images) = load_data(testing_channel)
51 | CHECKPOINTS_DIR = '/opt/ml/checkpoints'
52 | checkpoints_enabled = os.path.exists(CHECKPOINTS_DIR)
53 |
54 | # Data parallel training - shard the data so each host
55 | # only trains on a subset of the total data.
56 | shard_size = len(train_images) // len(hosts)
57 | for i, host in enumerate(hosts):
58 | if host == current_host:
59 | start = shard_size * i
60 | end = start + shard_size
61 | break
62 |
63 | train_iter = mx.io.NDArrayIter(train_images[start:end], train_labels[start:end], batch_size,
64 | shuffle=True)
65 | val_iter = mx.io.NDArrayIter(test_images, test_labels, batch_size)
66 |
67 | logging.getLogger().setLevel(logging.DEBUG)
68 |
69 | kvstore = 'local' if len(hosts) == 1 else 'dist_sync'
70 |
71 | mlp_model = mx.mod.Module(symbol=build_graph(),
72 | context=get_training_context(num_gpus))
73 |
74 | checkpoint_callback = None
75 | if checkpoints_enabled:
76 | # Create a checkpoint callback that checkpoints the model params and the optimizer state after every epoch at the given path.
77 | checkpoint_callback = mx.callback.module_checkpoint(mlp_model,
78 | CHECKPOINTS_DIR + "/mnist",
79 | period=1,
80 | save_optimizer_states=True)
81 | mlp_model.fit(train_iter,
82 | eval_data=val_iter,
83 | kvstore=kvstore,
84 | optimizer='sgd',
85 | optimizer_params={'learning_rate': learning_rate},
86 | eval_metric='acc',
87 | epoch_end_callback = checkpoint_callback,
88 | batch_end_callback=mx.callback.Speedometer(batch_size, 100),
89 | num_epoch=epochs)
90 |
91 | if current_host == hosts[0]:
92 | save(model_dir, mlp_model)
93 |
94 |
95 | def save(model_dir, model):
96 | model.symbol.save(os.path.join(model_dir, 'model-symbol.json'))
97 | model.save_params(os.path.join(model_dir, 'model-0000.params'))
98 |
99 | signature = [{'name': data_desc.name, 'shape': [dim for dim in data_desc.shape]}
100 | for data_desc in model.data_shapes]
101 | with open(os.path.join(model_dir, 'model-shapes.json'), 'w') as f:
102 | json.dump(signature, f)
103 |
104 |
105 | def parse_args():
106 | parser = argparse.ArgumentParser()
107 |
108 | parser.add_argument('--batch-size', type=int, default=100)
109 | parser.add_argument('--epochs', type=int, default=10)
110 | parser.add_argument('--learning-rate', type=float, default=0.1)
111 |
112 | parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
113 | parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])
114 | parser.add_argument('--test', type=str, default=os.environ['SM_CHANNEL_TEST'])
115 |
116 | parser.add_argument('--current-host', type=str, default=os.environ['SM_CURRENT_HOST'])
117 | parser.add_argument('--hosts', type=list, default=json.loads(os.environ['SM_HOSTS']))
118 |
119 | return parser.parse_args()
120 |
121 | if __name__ == '__main__':
122 | args = parse_args()
123 | num_gpus = int(os.environ['SM_NUM_GPUS'])
124 |
125 | train(args.batch_size, args.epochs, args.learning_rate, num_gpus, args.train, args.test,
126 | args.hosts, args.current_host, args.model_dir)
127 |
--------------------------------------------------------------------------------
/mxnet/src/requirements.txt:
--------------------------------------------------------------------------------
1 | # Python dependencies go here
--------------------------------------------------------------------------------
/pytorch/input.html:
--------------------------------------------------------------------------------
1 |
51 |
52 |
53 |
54 |
55 |
56 |
57 |
58 |
59 |
60 |
61 |
62 |
63 |
--------------------------------------------------------------------------------
/pytorch/pytorch_mnist.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {
6 | "collapsed": true
7 | },
8 | "source": [
9 | "# PyTorch Training and Serving in SageMaker \"Script Mode\"\n",
10 | "\n",
11 | "Script mode is a training script format for PyTorch that lets you execute any PyTorch training script in SageMaker with minimal modification. The [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) handles transferring your script to a SageMaker training instance. On the training instance, SageMaker's native PyTorch support sets up training-related environment variables and executes your training script. In this tutorial, we use the SageMaker Python SDK to launch a training job and deploy the trained model.\n",
12 | "\n",
13 | "Script mode supports training with a Python script, a Python module, or a shell script. In this example, we use a Python script to train a classification model on the [MNIST dataset](http://yann.lecun.com/exdb/mnist/). In this example, we will show how easily you can train a SageMaker using PyTorch scripts with SageMaker Python SDK. In addition, this notebook demonstrates how to perform real time inference with the [SageMaker PyTorch Serving container](https://github.com/aws/sagemaker-pytorch-serving-container). The PyTorch Serving container is the default inference method for script mode. For full documentation on deploying PyTorch models, please visit [here](https://github.com/aws/sagemaker-python-sdk/blob/master/doc/using_pytorch.rst#deploy-pytorch-models)."
14 | ]
15 | },
16 | {
17 | "cell_type": "markdown",
18 | "metadata": {},
19 | "source": [
20 | "## Contents\n",
21 | "\n",
22 | "1. [Background](#Background)\n",
23 | "1. [Setup](#Setup)\n",
24 | "1. [Data](#Data)\n",
25 | "1. [Train](#Train)\n",
26 | "1. [Host](#Host)\n",
27 | "\n",
28 | "---\n",
29 | "\n",
30 | "## Background\n",
31 | "\n",
32 | "MNIST is a widely used dataset for handwritten digit classification. It consists of 70,000 labeled 28x28 pixel grayscale images of hand-written digits. The dataset is split into 60,000 training images and 10,000 test images. There are 10 classes (one for each of the 10 digits). This tutorial will show how to train and test an MNIST model on SageMaker using PyTorch.\n",
33 | "\n",
34 | "For more information about the PyTorch in SageMaker, please visit [sagemaker-pytorch-containers](https://github.com/aws/sagemaker-pytorch-containers) and [sagemaker-python-sdk](https://github.com/aws/sagemaker-python-sdk) github repositories.\n",
35 | "\n",
36 | "---\n",
37 | "\n",
38 | "## Setup\n",
39 | "\n",
40 | "_This notebook was created and tested on an ml.m4.xlarge notebook instance._"
41 | ]
42 | },
43 | {
44 | "cell_type": "markdown",
45 | "metadata": {},
46 | "source": [
47 | "### Install SageMaker Python SDK"
48 | ]
49 | },
50 | {
51 | "cell_type": "code",
52 | "execution_count": null,
53 | "metadata": {
54 | "scrolled": true
55 | },
56 | "outputs": [],
57 | "source": [
58 | "!pip install sagemaker --upgrade --ignore-installed --no-cache --user"
59 | ]
60 | },
61 | {
62 | "cell_type": "code",
63 | "execution_count": null,
64 | "metadata": {
65 | "scrolled": true
66 | },
67 | "outputs": [],
68 | "source": [
69 | "!pip install torch==1.3.1 torchvision==0.4.2 --upgrade --ignore-installed --no-cache --user"
70 | ]
71 | },
72 | {
73 | "cell_type": "markdown",
74 | "metadata": {},
75 | "source": [
76 | "Forcing `pillow==6.2.1` due to https://discuss.pytorch.org/t/cannot-import-name-pillow-version-from-pil/66096"
77 | ]
78 | },
79 | {
80 | "cell_type": "code",
81 | "execution_count": null,
82 | "metadata": {
83 | "scrolled": true
84 | },
85 | "outputs": [],
86 | "source": [
87 | "!pip uninstall -y pillow"
88 | ]
89 | },
90 | {
91 | "cell_type": "code",
92 | "execution_count": null,
93 | "metadata": {},
94 | "outputs": [],
95 | "source": [
96 | "!pip install pillow==6.2.1 --upgrade --ignore-installed --no-cache --user"
97 | ]
98 | },
99 | {
100 | "cell_type": "markdown",
101 | "metadata": {},
102 | "source": [
103 | "### Restart the Kernel to Recognize New Dependencies Above"
104 | ]
105 | },
106 | {
107 | "cell_type": "code",
108 | "execution_count": null,
109 | "metadata": {},
110 | "outputs": [],
111 | "source": [
112 | "from IPython.display import display_html\n",
113 | "display_html(\"\", raw=True)"
114 | ]
115 | },
116 | {
117 | "cell_type": "code",
118 | "execution_count": null,
119 | "metadata": {
120 | "scrolled": false
121 | },
122 | "outputs": [],
123 | "source": [
124 | "!pip3 list"
125 | ]
126 | },
127 | {
128 | "cell_type": "markdown",
129 | "metadata": {},
130 | "source": [
131 | "## Create the SageMaker Session"
132 | ]
133 | },
134 | {
135 | "cell_type": "code",
136 | "execution_count": null,
137 | "metadata": {},
138 | "outputs": [],
139 | "source": [
140 | "import os\n",
141 | "import sagemaker\n",
142 | "from sagemaker import get_execution_role\n",
143 | "\n",
144 | "sagemaker_session = sagemaker.Session()"
145 | ]
146 | },
147 | {
148 | "cell_type": "markdown",
149 | "metadata": {},
150 | "source": [
151 | "## Setup the Service Execution Role and Region\n",
152 | "Get IAM role arn used to give training and hosting access to your data. See the documentation for how to create these. Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the `sagemaker.get_execution_role()` with a the appropriate full IAM role arn string(s)."
153 | ]
154 | },
155 | {
156 | "cell_type": "code",
157 | "execution_count": null,
158 | "metadata": {},
159 | "outputs": [],
160 | "source": [
161 | "role = get_execution_role()\n",
162 | "print('RoleARN: {}\\n'.format(role))\n",
163 | "\n",
164 | "region = sagemaker_session.boto_session.region_name\n",
165 | "print('Region: {}'.format(region))"
166 | ]
167 | },
168 | {
169 | "cell_type": "markdown",
170 | "metadata": {},
171 | "source": [
172 | "## Training Data"
173 | ]
174 | },
175 | {
176 | "cell_type": "markdown",
177 | "metadata": {},
178 | "source": [
179 | "### Copy the Training Data to Your Notebook Disk"
180 | ]
181 | },
182 | {
183 | "cell_type": "code",
184 | "execution_count": null,
185 | "metadata": {},
186 | "outputs": [],
187 | "source": [
188 | "local_data_path = './data'"
189 | ]
190 | },
191 | {
192 | "cell_type": "code",
193 | "execution_count": null,
194 | "metadata": {},
195 | "outputs": [],
196 | "source": [
197 | "from torchvision import datasets, transforms\n",
198 | "\n",
199 | "normalization_mean = 0.1307\n",
200 | "normalization_std = 0.3081\n",
201 | "\n",
202 | "# download the dataset\n",
203 | "# this will not only download data to ./mnist folder, but also load and transform (normalize) them\n",
204 | "datasets.MNIST(local_data_path, download=True, transform=transforms.Compose([\n",
205 | " transforms.ToTensor(),\n",
206 | " transforms.Normalize((normalization_mean,), (normalization_std,))\n",
207 | "]))"
208 | ]
209 | },
210 | {
211 | "cell_type": "code",
212 | "execution_count": null,
213 | "metadata": {},
214 | "outputs": [],
215 | "source": [
216 | "!ls -R {local_data_path}"
217 | ]
218 | },
219 | {
220 | "cell_type": "markdown",
221 | "metadata": {},
222 | "source": [
223 | "### Upload the Data to S3 for Distributed Training Across Many Workers\n",
224 | "We are going to use the `sagemaker.Session.upload_data` function to upload our datasets to an S3 location. The return value inputs identifies the location -- we will use later when we start the training job.\n",
225 | "\n",
226 | "This is S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting."
227 | ]
228 | },
229 | {
230 | "cell_type": "code",
231 | "execution_count": null,
232 | "metadata": {},
233 | "outputs": [],
234 | "source": [
235 | "bucket = sagemaker_session.default_bucket()\n",
236 | "data_prefix = 'sagemaker/pytorch-mnist/data'"
237 | ]
238 | },
239 | {
240 | "cell_type": "code",
241 | "execution_count": null,
242 | "metadata": {},
243 | "outputs": [],
244 | "source": [
245 | "training_data_uri = sagemaker_session.upload_data(path=local_data_path, bucket=bucket, key_prefix=data_prefix)\n",
246 | "print('Input spec (S3 path): {}'.format(training_data_uri))"
247 | ]
248 | },
249 | {
250 | "cell_type": "code",
251 | "execution_count": null,
252 | "metadata": {},
253 | "outputs": [],
254 | "source": [
255 | "!aws s3 ls --recursive {training_data_uri}"
256 | ]
257 | },
258 | {
259 | "cell_type": "markdown",
260 | "metadata": {},
261 | "source": [
262 | "## Train\n",
263 | "### Training Script\n",
264 | "The `pytorch_mnist.py` script provides all the code we need for training and hosting a SageMaker model (`model_fn` function to load a model).\n",
265 | "The training script is very similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables, such as:\n",
266 | "\n",
267 | "* `SM_MODEL_DIR`: A string representing the path to the directory to write model artifacts to.\n",
268 | " These artifacts are uploaded to S3 for model hosting.\n",
269 | "* `SM_NUM_GPUS`: The number of gpus available in the current container.\n",
270 | "* `SM_CURRENT_HOST`: The name of the current container on the container network.\n",
271 | "* `SM_HOSTS`: JSON encoded list containing all the hosts .\n",
272 | "\n",
273 | "Supposing one input channel, 'training', was used in the call to the PyTorch estimator's `fit()` method, the following will be set, following the format `SM_CHANNEL_[channel_name]`:\n",
274 | "\n",
275 | "* `SM_CHANNEL_TRAINING`: A string representing the path to the directory containing data in the 'training' channel.\n",
276 | "\n",
277 | "For more information about training environment variables, please visit [SageMaker Containers](https://github.com/aws/sagemaker-containers).\n",
278 | "\n",
279 | "A typical training script loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model to `model_dir` so that it can be hosted later. Hyperparameters are passed to your script as arguments and can be retrieved with an `argparse.ArgumentParser` instance.\n",
280 | "\n",
281 | "Because the SageMaker imports the training script, you should put your training code in a main guard (``if __name__=='__main__':``) if you are using the same script to host your model as we do in this example, so that SageMaker does not inadvertently run your training code at the wrong point in execution.\n",
282 | "\n",
283 | "For example, the script run by this notebook:"
284 | ]
285 | },
286 | {
287 | "cell_type": "code",
288 | "execution_count": null,
289 | "metadata": {
290 | "scrolled": false
291 | },
292 | "outputs": [],
293 | "source": [
294 | "!ls ./src/mnist_pytorch.py"
295 | ]
296 | },
297 | {
298 | "cell_type": "markdown",
299 | "metadata": {},
300 | "source": [
301 | "You can add custom Python modules to the `src/requirements.txt` file. They will automatically be installed - and made available to your training script."
302 | ]
303 | },
304 | {
305 | "cell_type": "code",
306 | "execution_count": null,
307 | "metadata": {},
308 | "outputs": [],
309 | "source": [
310 | "!cat ./src/requirements.txt"
311 | ]
312 | },
313 | {
314 | "cell_type": "markdown",
315 | "metadata": {},
316 | "source": [
317 | "### Train with SageMaker `PyTorch` Estimator\n",
318 | "\n",
319 | "The `PyTorch` class allows us to run our training function as a training job on SageMaker infrastructure. We need to configure it with our training script, an IAM role, the number of training instances, the training instance type, and hyperparameters. In this case we are going to run our training job on two(2) `ml.p3.2xlarge` instances. Alternatively, you can specify `ml.c4.xlarge` instances. This example can be ran on one or multiple, cpu or gpu instances ([full list of available instances](https://aws.amazon.com/sagemaker/pricing/instance-types/)). The hyperparameters parameter is a dict of values that will be passed to your training script -- you can see how to access these values in the `mnist.py` script above."
320 | ]
321 | },
322 | {
323 | "cell_type": "markdown",
324 | "metadata": {},
325 | "source": [
326 | "After we've constructed our `PyTorch` object, we can fit it using the data we uploaded to S3. SageMaker makes sure our data is available in the local filesystem of each worker, so our training script can simply read the data from disk."
327 | ]
328 | },
329 | {
330 | "cell_type": "markdown",
331 | "metadata": {},
332 | "source": [
333 | "### `fit` the Model (Approx. 15 mins)\n",
334 | "\n",
335 | "To start a training job, we call `estimator.fit(training_data_uri)`."
336 | ]
337 | },
338 | {
339 | "cell_type": "code",
340 | "execution_count": null,
341 | "metadata": {
342 | "scrolled": false
343 | },
344 | "outputs": [],
345 | "source": [
346 | "from sagemaker.pytorch import PyTorch\n",
347 | "import time\n",
348 | "\n",
349 | "model_output_path = 's3://{}/sagemaker/pytorch-mnist/training-runs'.format(bucket)\n",
350 | "\n",
351 | "mnist_estimator = PyTorch(\n",
352 | " entry_point='mnist_pytorch.py',\n",
353 | " source_dir='./src',\n",
354 | " output_path=model_output_path,\n",
355 | " role=role,\n",
356 | " framework_version='1.3.1',\n",
357 | " train_instance_count=1,\n",
358 | " train_instance_type='ml.c5.2xlarge',\n",
359 | " enable_sagemaker_metrics=True,\n",
360 | " hyperparameters={\n",
361 | " 'epochs': 5,\n",
362 | " 'backend': 'gloo'\n",
363 | " },\n",
364 | " # Assuming the logline from the PyTorch training job is as follows:\n",
365 | " # Test set: Average loss: 0.3230, Accuracy: 9103/10000 (91%)\n",
366 | " metric_definitions=[\n",
367 | " {'Name':'test:loss', 'Regex':'Test set: Average loss: (.*?),'},\n",
368 | " {'Name':'test:accuracy', 'Regex':'(.*?)%;'}\n",
369 | " ]\n",
370 | ")\n",
371 | "\n",
372 | "mnist_estimator.fit(inputs={'training': training_data_uri},\n",
373 | " wait=False)\n",
374 | "\n",
375 | "training_job_name = mnist_estimator.latest_training_job.name\n",
376 | "\n",
377 | "print('training_job_name: {}'.format(training_job_name))"
378 | ]
379 | },
380 | {
381 | "cell_type": "markdown",
382 | "metadata": {},
383 | "source": [
384 | "Attach to a training job to monitor the logs.\n",
385 | "\n",
386 | "_Note: Each instance in the training job (2 in this example) will appear as a different color in the logs. 1 color per instance._"
387 | ]
388 | },
389 | {
390 | "cell_type": "code",
391 | "execution_count": null,
392 | "metadata": {
393 | "scrolled": false
394 | },
395 | "outputs": [],
396 | "source": [
397 | "mnist_estimator = PyTorch.attach(training_job_name=training_job_name)"
398 | ]
399 | },
400 | {
401 | "cell_type": "markdown",
402 | "metadata": {},
403 | "source": [
404 | "## Option 1: Perform Batch Predictions Directly in the Notebook"
405 | ]
406 | },
407 | {
408 | "cell_type": "markdown",
409 | "metadata": {},
410 | "source": [
411 | "Use PyTorch Core to load the model from `model_output_path`"
412 | ]
413 | },
414 | {
415 | "cell_type": "code",
416 | "execution_count": null,
417 | "metadata": {},
418 | "outputs": [],
419 | "source": [
420 | "!aws --region {region} s3 ls --recursive {model_output_path}/{training_job_name}/output/"
421 | ]
422 | },
423 | {
424 | "cell_type": "code",
425 | "execution_count": null,
426 | "metadata": {},
427 | "outputs": [],
428 | "source": [
429 | "!aws --region {region} s3 cp {model_output_path}/{training_job_name}/output/model.tar.gz ./model/model.tar.gz"
430 | ]
431 | },
432 | {
433 | "cell_type": "code",
434 | "execution_count": null,
435 | "metadata": {},
436 | "outputs": [],
437 | "source": [
438 | "!ls ./model"
439 | ]
440 | },
441 | {
442 | "cell_type": "code",
443 | "execution_count": null,
444 | "metadata": {},
445 | "outputs": [],
446 | "source": [
447 | "!tar -xzvf ./model/model.tar.gz -C ./model"
448 | ]
449 | },
450 | {
451 | "cell_type": "code",
452 | "execution_count": null,
453 | "metadata": {},
454 | "outputs": [],
455 | "source": [
456 | "# Based on https://github.com/pytorch/examples/blob/master/mnist/main.py\n",
457 | "import torch.nn as nn\n",
458 | "import torch.nn.functional as F\n",
459 | "\n",
460 | "class Net(nn.Module):\n",
461 | " def __init__(self):\n",
462 | " super(Net, self).__init__()\n",
463 | " self.conv1 = nn.Conv2d(1, 10, kernel_size=5)\n",
464 | " self.conv2 = nn.Conv2d(10, 20, kernel_size=5)\n",
465 | " self.conv2_drop = nn.Dropout2d()\n",
466 | " self.fc1 = nn.Linear(320, 50)\n",
467 | " self.fc2 = nn.Linear(50, 10)\n",
468 | "\n",
469 | " def forward(self, x):\n",
470 | " x = F.relu(F.max_pool2d(self.conv1(x), 2))\n",
471 | " x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))\n",
472 | " x = x.view(-1, 320)\n",
473 | " x = F.relu(self.fc1(x))\n",
474 | " x = F.dropout(x, training=self.training)\n",
475 | " x = self.fc2(x)\n",
476 | " return F.log_softmax(x, dim=1)"
477 | ]
478 | },
479 | {
480 | "cell_type": "code",
481 | "execution_count": null,
482 | "metadata": {},
483 | "outputs": [],
484 | "source": [
485 | "import torch\n",
486 | "\n",
487 | "loaded_model = Net().to('cpu')\n",
488 | "# single-machine multi-gpu case or single-machine or multi-machine cpu case\n",
489 | "loaded_model = torch.nn.DataParallel(loaded_model)\n",
490 | "print(loaded_model)"
491 | ]
492 | },
493 | {
494 | "cell_type": "code",
495 | "execution_count": null,
496 | "metadata": {},
497 | "outputs": [],
498 | "source": [
499 | "loaded_model.load_state_dict(torch.load('./model/model.pth', map_location='cpu'))"
500 | ]
501 | },
502 | {
503 | "cell_type": "code",
504 | "execution_count": null,
505 | "metadata": {},
506 | "outputs": [],
507 | "source": [
508 | "test_loader = torch.utils.data.DataLoader(\n",
509 | " datasets.MNIST('./data', train=False,\n",
510 | " transform=transforms.Compose([\n",
511 | " transforms.ToTensor(),\n",
512 | " transforms.Normalize((0.1307,), (0.3081,))\n",
513 | " ])),\n",
514 | " batch_size=256, \n",
515 | " shuffle=True\n",
516 | ")\n",
517 | "\n",
518 | "single_loaded_img = test_loader.dataset.data[0]\n",
519 | "single_loaded_img = single_loaded_img.to('cpu')\n",
520 | "single_loaded_img = single_loaded_img[None, None]\n",
521 | "single_loaded_img = single_loaded_img.type('torch.FloatTensor') # instead of DoubleTensor\n",
522 | "\n",
523 | "print(single_loaded_img.numpy())"
524 | ]
525 | },
526 | {
527 | "cell_type": "code",
528 | "execution_count": null,
529 | "metadata": {},
530 | "outputs": [],
531 | "source": [
532 | "from matplotlib import pyplot as plt\n",
533 | "\n",
534 | "plt.imshow(single_loaded_img.numpy().reshape(28, 28), cmap='Greys')"
535 | ]
536 | },
537 | {
538 | "cell_type": "code",
539 | "execution_count": null,
540 | "metadata": {},
541 | "outputs": [],
542 | "source": [
543 | "result = loaded_model(single_loaded_img)\n",
544 | "prediction = result.max(1, keepdim=True)[1][0][0].numpy()\n",
545 | "print(prediction)"
546 | ]
547 | },
548 | {
549 | "cell_type": "markdown",
550 | "metadata": {},
551 | "source": [
552 | "## Option 2: Create a SageMaker Endpoint and Perform REST-based Predictions"
553 | ]
554 | },
555 | {
556 | "cell_type": "markdown",
557 | "metadata": {},
558 | "source": [
559 | "### Deploy the Trained Model to a SageMaker Endpoint (Approx. 10 mins)\n",
560 | "\n",
561 | "After training, we use the `PyTorch` estimator object to build and deploy a `PyTorchPredictor`. This creates a Sagemaker Endpoint -- a hosted prediction service that we can use to perform inference.\n",
562 | "\n",
563 | "As mentioned above we have implementation of `model_fn` in the `pytorch_mnist.py` script that is required. We are going to use default implementations of `input_fn`, `predict_fn`, `output_fn` and `transform_fm` defined in [sagemaker-pytorch-containers](https://github.com/aws/sagemaker-pytorch-containers).\n",
564 | "\n",
565 | "The arguments to the deploy function allow us to set the number and type of instances that will be used for the Endpoint. These do not need to be the same as the values we used for the training job. For example, you can train a model on a set of GPU-based instances, and then deploy the Endpoint to a fleet of CPU-based instances, but you need to make sure that you return or save your model as a cpu model similar to what we did in `mnist.py`."
566 | ]
567 | },
568 | {
569 | "cell_type": "code",
570 | "execution_count": null,
571 | "metadata": {},
572 | "outputs": [],
573 | "source": [
574 | "predictor = mnist_estimator.deploy(initial_instance_count=1, instance_type='ml.c5.2xlarge')"
575 | ]
576 | },
577 | {
578 | "cell_type": "markdown",
579 | "metadata": {},
580 | "source": [
581 | "### Invoke the Endpoint\n",
582 | "\n",
583 | "We can now use this predictor to classify hand-written digits. Drawing into the image box loads the pixel data into a `data` variable in this notebook, which we can then pass to the `predictor`."
584 | ]
585 | },
586 | {
587 | "cell_type": "code",
588 | "execution_count": null,
589 | "metadata": {
590 | "scrolled": true
591 | },
592 | "outputs": [],
593 | "source": [
594 | "from IPython.display import HTML\n",
595 | "HTML(open(\"input.html\").read())"
596 | ]
597 | },
598 | {
599 | "cell_type": "markdown",
600 | "metadata": {},
601 | "source": [
602 | "The value of `data` is retrieved from the HTML above."
603 | ]
604 | },
605 | {
606 | "cell_type": "code",
607 | "execution_count": null,
608 | "metadata": {},
609 | "outputs": [],
610 | "source": [
611 | "print(data)"
612 | ]
613 | },
614 | {
615 | "cell_type": "code",
616 | "execution_count": null,
617 | "metadata": {},
618 | "outputs": [],
619 | "source": [
620 | "import numpy as np\n",
621 | "\n",
622 | "image = np.array([data], dtype=np.float32)\n",
623 | "response = predictor.predict(image)\n",
624 | "prediction = response.argmax(axis=1)[0]\n",
625 | "print(prediction)"
626 | ]
627 | },
628 | {
629 | "cell_type": "markdown",
630 | "metadata": {},
631 | "source": [
632 | "### (Optional) Cleanup Endpoint\n",
633 | "\n",
634 | "After you have finished with this example, remember to delete the prediction endpoint to release the instance(s) associated with it"
635 | ]
636 | },
637 | {
638 | "cell_type": "code",
639 | "execution_count": null,
640 | "metadata": {},
641 | "outputs": [],
642 | "source": [
643 | "# sagemaker.Session().delete_endpoint(predictor.endpoint)"
644 | ]
645 | },
646 | {
647 | "cell_type": "code",
648 | "execution_count": null,
649 | "metadata": {},
650 | "outputs": [],
651 | "source": []
652 | }
653 | ],
654 | "metadata": {
655 | "kernelspec": {
656 | "display_name": "conda_python3",
657 | "language": "python",
658 | "name": "conda_python3"
659 | },
660 | "language_info": {
661 | "codemirror_mode": {
662 | "name": "ipython",
663 | "version": 3
664 | },
665 | "file_extension": ".py",
666 | "mimetype": "text/x-python",
667 | "name": "python",
668 | "nbconvert_exporter": "python",
669 | "pygments_lexer": "ipython3",
670 | "version": "3.6.5"
671 | },
672 | "notice": "Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
673 | },
674 | "nbformat": 4,
675 | "nbformat_minor": 1
676 | }
677 |
--------------------------------------------------------------------------------
/pytorch/src/mnist_pytorch.py:
--------------------------------------------------------------------------------
1 | import argparse
2 | import json
3 | import logging
4 | import os
5 | import sys
6 | import torch
7 | import torch.distributed as dist
8 | import torch.nn as nn
9 | import torch.nn.functional as F
10 | import torch.optim as optim
11 | import torch.utils.data
12 | import torch.utils.data.distributed
13 | from torchvision import datasets, transforms
14 |
15 | logger = logging.getLogger(__name__)
16 | logger.setLevel(logging.DEBUG)
17 | logger.addHandler(logging.StreamHandler(sys.stdout))
18 |
19 |
20 | # Based on https://github.com/pytorch/examples/blob/master/mnist/main.py
21 | class Net(nn.Module):
22 | def __init__(self):
23 | super(Net, self).__init__()
24 | self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
25 | self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
26 | self.conv2_drop = nn.Dropout2d()
27 | self.fc1 = nn.Linear(320, 50)
28 | self.fc2 = nn.Linear(50, 10)
29 |
30 | def forward(self, x):
31 | x = F.relu(F.max_pool2d(self.conv1(x), 2))
32 | x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
33 | x = x.view(-1, 320)
34 | x = F.relu(self.fc1(x))
35 | x = F.dropout(x, training=self.training)
36 | x = self.fc2(x)
37 | return F.log_softmax(x, dim=1)
38 |
39 |
40 | def _get_train_data_loader(batch_size, training_dir, is_distributed, **kwargs):
41 | logger.info("Get train data loader")
42 | dataset = datasets.MNIST(training_dir, train=True, transform=transforms.Compose([
43 | transforms.ToTensor(),
44 | transforms.Normalize((0.1307,), (0.3081,))
45 | ]))
46 | train_sampler = torch.utils.data.distributed.DistributedSampler(dataset) if is_distributed else None
47 | return torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=train_sampler is None,
48 | sampler=train_sampler, **kwargs)
49 |
50 |
51 | def _get_test_data_loader(test_batch_size, training_dir, **kwargs):
52 | logger.info("Get test data loader")
53 | return torch.utils.data.DataLoader(
54 | datasets.MNIST(training_dir, train=False, transform=transforms.Compose([
55 | transforms.ToTensor(),
56 | transforms.Normalize((0.1307,), (0.3081,))
57 | ])),
58 | batch_size=test_batch_size, shuffle=True, **kwargs)
59 |
60 |
61 | def _average_gradients(model):
62 | # Gradient averaging.
63 | size = float(dist.get_world_size())
64 | for param in model.parameters():
65 | dist.all_reduce(param.grad.data, op=dist.reduce_op.SUM)
66 | param.grad.data /= size
67 |
68 |
69 | def train(args):
70 | is_distributed = len(args.hosts) > 1 and args.backend is not None
71 | logger.debug("Distributed training - {}".format(is_distributed))
72 | use_cuda = args.num_gpus > 0
73 | logger.debug("Number of gpus available - {}".format(args.num_gpus))
74 | kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}
75 | device = torch.device("cuda" if use_cuda else "cpu")
76 |
77 | if is_distributed:
78 | # Initialize the distributed environment.
79 | world_size = len(args.hosts)
80 | os.environ['WORLD_SIZE'] = str(world_size)
81 | host_rank = args.hosts.index(args.current_host)
82 | os.environ['RANK'] = str(host_rank)
83 | dist.init_process_group(backend=args.backend, rank=host_rank, world_size=world_size)
84 | logger.info('Initialized the distributed environment: \'{}\' backend on {} nodes. '.format(
85 | args.backend, dist.get_world_size()) + 'Current host rank is {}. Number of gpus: {}'.format(
86 | dist.get_rank(), args.num_gpus))
87 |
88 | # set the seed for generating random numbers
89 | torch.manual_seed(args.seed)
90 | if use_cuda:
91 | torch.cuda.manual_seed(args.seed)
92 |
93 | train_loader = _get_train_data_loader(args.batch_size, args.data_dir, is_distributed, **kwargs)
94 | test_loader = _get_test_data_loader(args.test_batch_size, args.data_dir, **kwargs)
95 |
96 | logger.debug("Processes {}/{} ({:.0f}%) of train data".format(
97 | len(train_loader.sampler), len(train_loader.dataset),
98 | 100. * len(train_loader.sampler) / len(train_loader.dataset)
99 | ))
100 |
101 | logger.debug("Processes {}/{} ({:.0f}%) of test data".format(
102 | len(test_loader.sampler), len(test_loader.dataset),
103 | 100. * len(test_loader.sampler) / len(test_loader.dataset)
104 | ))
105 |
106 | model = Net().to(device)
107 | if is_distributed and use_cuda:
108 | # multi-machine multi-gpu case
109 | model = torch.nn.parallel.DistributedDataParallel(model)
110 | else:
111 | # single-machine multi-gpu case or single-machine or multi-machine cpu case
112 | model = torch.nn.DataParallel(model)
113 |
114 | optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum)
115 |
116 | for epoch in range(1, args.epochs + 1):
117 | model.train()
118 | for batch_idx, (data, target) in enumerate(train_loader, 1):
119 | data, target = data.to(device), target.to(device)
120 | optimizer.zero_grad()
121 | output = model(data)
122 | loss = F.nll_loss(output, target)
123 | loss.backward()
124 | if is_distributed and not use_cuda:
125 | # average gradients manually for multi-machine cpu case only
126 | _average_gradients(model)
127 | optimizer.step()
128 | if batch_idx % args.log_interval == 0:
129 | logger.info('Train Epoch: {} [{}/{} ({:.0f}%)] Loss: {:.6f}'.format(
130 | epoch, batch_idx * len(data), len(train_loader.sampler),
131 | 100. * batch_idx / len(train_loader), loss.item()))
132 | test(model, test_loader, device)
133 | save_model(model, args.model_dir)
134 |
135 |
136 | def test(model, test_loader, device):
137 | model.eval()
138 | test_loss = 0
139 | correct = 0
140 | with torch.no_grad():
141 | for data, target in test_loader:
142 | data, target = data.to(device), target.to(device)
143 | output = model(data)
144 | test_loss += F.nll_loss(output, target, size_average=False).item() # sum up batch loss
145 | pred = output.max(1, keepdim=True)[1] # get the index of the max log-probability
146 | correct += pred.eq(target.view_as(pred)).sum().item()
147 |
148 | test_loss /= len(test_loader.dataset)
149 | logger.info('Test set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
150 | test_loss, correct, len(test_loader.dataset),
151 | 100. * correct / len(test_loader.dataset)))
152 |
153 |
154 | def model_fn(model_dir):
155 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
156 | model = torch.nn.DataParallel(Net())
157 | with open(os.path.join(model_dir, 'model.pth'), 'rb') as f:
158 | model.load_state_dict(torch.load(f))
159 | return model.to(device)
160 |
161 |
162 | def save_model(model, model_dir):
163 | logger.info("Saving the model.")
164 | path = os.path.join(model_dir, 'model.pth')
165 | # recommended way from http://pytorch.org/docs/master/notes/serialization.html
166 | torch.save(model.cpu().state_dict(), path)
167 |
168 |
169 | if __name__ == '__main__':
170 |
171 | parser = argparse.ArgumentParser()
172 |
173 | # Data and model checkpoints directories
174 | parser.add_argument('--batch-size', type=int, default=64, metavar='N',
175 | help='input batch size for training (default: 64)')
176 | parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N',
177 | help='input batch size for testing (default: 1000)')
178 | parser.add_argument('--epochs', type=int, default=10, metavar='N',
179 | help='number of epochs to train (default: 10)')
180 | parser.add_argument('--lr', type=float, default=0.01, metavar='LR',
181 | help='learning rate (default: 0.01)')
182 | parser.add_argument('--momentum', type=float, default=0.5, metavar='M',
183 | help='SGD momentum (default: 0.5)')
184 | parser.add_argument('--seed', type=int, default=1, metavar='S',
185 | help='random seed (default: 1)')
186 | parser.add_argument('--log-interval', type=int, default=100, metavar='N',
187 | help='how many batches to wait before logging training status')
188 | parser.add_argument('--backend', type=str, default=None,
189 | help='backend for distributed training (tcp, gloo on cpu and gloo, nccl on gpu)')
190 |
191 | # Container environment
192 | parser.add_argument('--hosts', type=list, default=json.loads(os.environ['SM_HOSTS']))
193 | parser.add_argument('--current-host', type=str, default=os.environ['SM_CURRENT_HOST'])
194 | parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
195 | parser.add_argument('--data-dir', type=str, default=os.environ['SM_CHANNEL_TRAINING'])
196 | parser.add_argument('--num-gpus', type=int, default=os.environ['SM_NUM_GPUS'])
197 |
198 | train(parser.parse_args())
199 |
--------------------------------------------------------------------------------
/pytorch/src/requirements.txt:
--------------------------------------------------------------------------------
1 | # Python dependencies go here
--------------------------------------------------------------------------------
/quota-increase.md:
--------------------------------------------------------------------------------
1 | # How to increase your AWS Service Quotas for Amazon SageMaker
2 |
3 | Your AWS account has default quotas, formerly referred to as limits, for each AWS service.
4 | Unless otherwise noted, each quota is Region-specific.
5 |
6 | ## The default service quotas for Amazon SageMaker are listed [here](https://docs.aws.amazon.com/general/latest/gr/sagemaker.html#limits_sagemaker).
7 |
8 | Please note that the service quotas may vary between different Amazon SageMaker services as described in the link shared above:
9 |
10 | - Amazon SageMaker Notebooks
11 | - Amazon SageMaker Automatic Model Tuning
12 | - Amazon SageMaker Processing
13 | - Amazon SageMaker Training and Managed Spot Training
14 | - Amazon SageMaker Hosting
15 | - Amazon SageMaker Batch Transform
16 | - Amazon SageMaker Ground Truth
17 |
18 | For model training you should look at the quota for **Amazon SageMaker Training and Managed Spot Training**.
19 |
20 | ## Which instance type do I need?
21 |
22 | Here's a list of [supported Amazon SageMaker instance types](https://aws.amazon.com/sagemaker/pricing/instance-types/) (ml.xxx) comparing the different instances in vCPU, GPU, Mem (GiB), GPU Mem (GiB) and Network Performance.
23 |
24 | You find the GPU-based instance types at "Accelerated Computing – Current Generation".
25 |
26 |
27 | ## Request quota increase
28 |
29 | You can request a quota increase using the AWS Support Center as follows:
30 |
31 | ### 1. Open the [AWS Support Center](https://console.aws.amazon.com/support/home#/) (sign in if necessary)
32 | ### 2. Choose ***Create case***
33 | 
34 |
35 | ### 3. Choose ***Service limit increase***.
36 | 
37 |
38 | ### 3. Fill in your request details
39 | 
40 | * Select the AWS Region you're working in
41 | * Select the relevant service, e.g. SageMaker Training
42 | * Select the limit to increase
43 | (e.g. limits per instance type like ml.p3.8xlarge, or number of instances overall)
44 | * Set required new limit value
45 |
46 | ### 4. Add a case description and contact options, and hit ***submit***!
47 | 
48 |
49 |
50 |
--------------------------------------------------------------------------------
/tensorflow/src/mnist-2.py:
--------------------------------------------------------------------------------
1 | # Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
2 | #
3 | # Licensed under the Apache License, Version 2.0 (the "License"). You
4 | # may not use this file except in compliance with the License. A copy of
5 | # the License is located at
6 | #
7 | # http://aws.amazon.com/apache2.0/
8 | #
9 | # or in the "license" file accompanying this file. This file is
10 | # distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF
11 | # ANY KIND, either express or implied. See the License for the specific
12 | # language governing permissions and limitations under the License.import tensorflow as tf
13 |
14 | import tensorflow as tf
15 | import argparse
16 | import os
17 | import numpy as np
18 | import json
19 |
20 |
21 | def model(x_train, y_train, x_test, y_test):
22 | """Generate a simple model"""
23 | model = tf.keras.models.Sequential([
24 | tf.keras.layers.Flatten(),
25 | tf.keras.layers.Dense(1024, activation=tf.nn.relu),
26 | tf.keras.layers.Dropout(0.4),
27 | tf.keras.layers.Dense(10, activation=tf.nn.softmax)
28 | ])
29 |
30 | model.compile(optimizer='adam',
31 | loss='sparse_categorical_crossentropy',
32 | metrics=['accuracy'])
33 | model.fit(x_train, y_train)
34 | model.evaluate(x_test, y_test)
35 |
36 | return model
37 |
38 |
39 | def _load_training_data(base_dir):
40 | """Load MNIST training data"""
41 | x_train = np.load(os.path.join(base_dir, 'train_data.npy'))
42 | y_train = np.load(os.path.join(base_dir, 'train_labels.npy'))
43 | return x_train, y_train
44 |
45 |
46 | def _load_testing_data(base_dir):
47 | """Load MNIST testing data"""
48 | x_test = np.load(os.path.join(base_dir, 'eval_data.npy'))
49 | y_test = np.load(os.path.join(base_dir, 'eval_labels.npy'))
50 | return x_test, y_test
51 |
52 |
53 | def _parse_args():
54 | parser = argparse.ArgumentParser()
55 |
56 | # Data, model, and output directories
57 | # model_dir is always passed in from SageMaker. By default this is a S3 path under the default bucket.
58 | parser.add_argument('--model_dir', type=str)
59 | parser.add_argument('--sm-model-dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
60 | parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAINING'))
61 | parser.add_argument('--hosts', type=list, default=json.loads(os.environ.get('SM_HOSTS')))
62 | parser.add_argument('--current-host', type=str, default=os.environ.get('SM_CURRENT_HOST'))
63 |
64 | return parser.parse_known_args()
65 |
66 |
67 | if __name__ == "__main__":
68 | args, unknown = _parse_args()
69 |
70 | train_data, train_labels = _load_training_data(args.train)
71 | eval_data, eval_labels = _load_testing_data(args.train)
72 |
73 | mnist_classifier = model(train_data, train_labels, eval_data, eval_labels)
74 |
75 | if args.current_host == args.hosts[0]:
76 | # save model to an S3 directory with version number '00000001'
77 | mnist_classifier.save(os.path.join(args.sm_model_dir, '000000001'), 'my_model.h5')
78 |
--------------------------------------------------------------------------------
/tensorflow/src/mnist.py:
--------------------------------------------------------------------------------
1 | # Copyright 2018-2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
2 | #
3 | # Licensed under the Apache License, Version 2.0 (the "License"). You
4 | # may not use this file except in compliance with the License. A copy of
5 | # the License is located at
6 | #
7 | # http://aws.amazon.com/apache2.0/
8 | #
9 | # or in the "license" file accompanying this file. This file is
10 | # distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF
11 | # ANY KIND, either express or implied. See the License for the specific
12 | # language governing permissions and limitations under the License.
13 | """Convolutional Neural Network Estimator for MNIST, built with tf.layers."""
14 |
15 | from __future__ import absolute_import
16 | from __future__ import division
17 | from __future__ import print_function
18 |
19 | import numpy as np
20 | import tensorflow as tf
21 | import os
22 | import json
23 | import argparse
24 | from tensorflow.python.platform import tf_logging
25 | import logging as _logging
26 | import sys as _sys
27 |
28 |
29 | def cnn_model_fn(features, labels, mode):
30 | """Model function for CNN."""
31 | # Input Layer
32 | # Reshape X to 4-D tensor: [batch_size, width, height, channels]
33 | # MNIST images are 28x28 pixels, and have one color channel
34 | input_layer = tf.reshape(features["x"], [-1, 28, 28, 1])
35 |
36 | # Convolutional Layer #1
37 | # Computes 32 features using a 5x5 filter with ReLU activation.
38 | # Padding is added to preserve width and height.
39 | # Input Tensor Shape: [batch_size, 28, 28, 1]
40 | # Output Tensor Shape: [batch_size, 28, 28, 32]
41 | conv1 = tf.layers.conv2d(
42 | inputs=input_layer,
43 | filters=32,
44 | kernel_size=[5, 5],
45 | padding="same",
46 | activation=tf.nn.relu)
47 |
48 | # Pooling Layer #1
49 | # First max pooling layer with a 2x2 filter and stride of 2
50 | # Input Tensor Shape: [batch_size, 28, 28, 32]
51 | # Output Tensor Shape: [batch_size, 14, 14, 32]
52 | pool1 = tf.layers.max_pooling2d(inputs=conv1, pool_size=[2, 2], strides=2)
53 |
54 | # Convolutional Layer #2
55 | # Computes 64 features using a 5x5 filter.
56 | # Padding is added to preserve width and height.
57 | # Input Tensor Shape: [batch_size, 14, 14, 32]
58 | # Output Tensor Shape: [batch_size, 14, 14, 64]
59 | conv2 = tf.layers.conv2d(
60 | inputs=pool1,
61 | filters=64,
62 | kernel_size=[5, 5],
63 | padding="same",
64 | activation=tf.nn.relu)
65 |
66 | # Pooling Layer #2
67 | # Second max pooling layer with a 2x2 filter and stride of 2
68 | # Input Tensor Shape: [batch_size, 14, 14, 64]
69 | # Output Tensor Shape: [batch_size, 7, 7, 64]
70 | pool2 = tf.layers.max_pooling2d(inputs=conv2, pool_size=[2, 2], strides=2)
71 |
72 | # Flatten tensor into a batch of vectors
73 | # Input Tensor Shape: [batch_size, 7, 7, 64]
74 | # Output Tensor Shape: [batch_size, 7 * 7 * 64]
75 | pool2_flat = tf.reshape(pool2, [-1, 7 * 7 * 64])
76 |
77 | # Dense Layer
78 | # Densely connected layer with 1024 neurons
79 | # Input Tensor Shape: [batch_size, 7 * 7 * 64]
80 | # Output Tensor Shape: [batch_size, 1024]
81 | dense = tf.layers.dense(inputs=pool2_flat, units=1024, activation=tf.nn.relu)
82 |
83 | # Add dropout operation; 0.6 probability that element will be kept
84 | dropout = tf.layers.dropout(
85 | inputs=dense, rate=0.4, training=mode == tf.estimator.ModeKeys.TRAIN)
86 |
87 | # Logits layer
88 | # Input Tensor Shape: [batch_size, 1024]
89 | # Output Tensor Shape: [batch_size, 10]
90 | logits = tf.layers.dense(inputs=dropout, units=10)
91 |
92 | predictions = {
93 | # Generate predictions (for PREDICT and EVAL mode)
94 | "classes": tf.argmax(input=logits, axis=1),
95 | # Add `softmax_tensor` to the graph. It is used for PREDICT and by the
96 | # `logging_hook`.
97 | "probabilities": tf.nn.softmax(logits, name="softmax_tensor")
98 | }
99 | if mode == tf.estimator.ModeKeys.PREDICT:
100 | return tf.estimator.EstimatorSpec(mode=mode, predictions=predictions)
101 |
102 | # Calculate Loss (for both TRAIN and EVAL modes)
103 | loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)
104 |
105 | # Configure the Training Op (for TRAIN mode)
106 | if mode == tf.estimator.ModeKeys.TRAIN:
107 | optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
108 | train_op = optimizer.minimize(
109 | loss=loss,
110 | global_step=tf.train.get_global_step())
111 | return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op)
112 |
113 | # Add evaluation metrics (for EVAL mode)
114 | eval_metric_ops = {
115 | "accuracy": tf.metrics.accuracy(
116 | labels=labels, predictions=predictions["classes"])}
117 | return tf.estimator.EstimatorSpec(
118 | mode=mode, loss=loss, eval_metric_ops=eval_metric_ops)
119 |
120 | def _load_training_data(base_dir):
121 | x_train = np.load(os.path.join(base_dir, 'train_data.npy'))
122 | y_train = np.load(os.path.join(base_dir, 'train_labels.npy'))
123 | return x_train, y_train
124 |
125 | def _load_testing_data(base_dir):
126 | x_test = np.load(os.path.join(base_dir, 'eval_data.npy'))
127 | y_test = np.load(os.path.join(base_dir, 'eval_labels.npy'))
128 | return x_test, y_test
129 |
130 | def _parse_args():
131 |
132 | parser = argparse.ArgumentParser()
133 |
134 | # Data, model, and output directories
135 | # model_dir is always passed in from SageMaker. By default this is a S3 path under the default bucket.
136 | parser.add_argument('--model_dir', type=str)
137 | parser.add_argument('--sm-model-dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
138 | parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAINING'))
139 | parser.add_argument('--hosts', type=list, default=json.loads(os.environ.get('SM_HOSTS')))
140 | parser.add_argument('--current-host', type=str, default=os.environ.get('SM_CURRENT_HOST'))
141 |
142 | return parser.parse_known_args()
143 |
144 | def serving_input_fn():
145 | inputs = {'x': tf.placeholder(tf.float32, [None, 784])}
146 | return tf.estimator.export.ServingInputReceiver(inputs, inputs)
147 |
148 | if __name__ == "__main__":
149 | args, unknown = _parse_args()
150 |
151 | train_data, train_labels = _load_training_data(args.train)
152 | eval_data, eval_labels = _load_testing_data(args.train)
153 |
154 | # Create the Estimator
155 | mnist_classifier = tf.estimator.Estimator(
156 | model_fn=cnn_model_fn, model_dir=args.model_dir)
157 |
158 | # Set up logging for predictions
159 | # Log the values in the "Softmax" tensor with label "probabilities"
160 | tensors_to_log = {"probabilities": "softmax_tensor"}
161 | logging_hook = tf.train.LoggingTensorHook(
162 | tensors=tensors_to_log, every_n_iter=50)
163 |
164 | # Train the model
165 | train_input_fn = tf.estimator.inputs.numpy_input_fn(
166 | x={"x": train_data},
167 | y=train_labels,
168 | batch_size=100,
169 | num_epochs=None,
170 | shuffle=True)
171 |
172 | # Evaluate the model and print results
173 | eval_input_fn = tf.estimator.inputs.numpy_input_fn(
174 | x={"x": eval_data},
175 | y=eval_labels,
176 | num_epochs=1,
177 | shuffle=False)
178 |
179 | train_spec = tf.estimator.TrainSpec(train_input_fn, max_steps=20000)
180 | eval_spec = tf.estimator.EvalSpec(eval_input_fn)
181 | tf.estimator.train_and_evaluate(mnist_classifier, train_spec, eval_spec)
182 |
183 | if args.current_host == args.hosts[0]:
184 | mnist_classifier.export_savedmodel(args.sm_model_dir, serving_input_fn)
185 |
--------------------------------------------------------------------------------
/tensorflow/src/mnist_keras_tf2.py:
--------------------------------------------------------------------------------
1 | # Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
2 | #
3 | # Licensed under the Apache License, Version 2.0 (the "License"). You
4 | # may not use this file except in compliance with the License. A copy of
5 | # the License is located at
6 | #
7 | # http://aws.amazon.com/apache2.0/
8 | #
9 | # or in the "license" file accompanying this file. This file is
10 | # distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF
11 | # ANY KIND, either express or implied. See the License for the specific
12 | # language governing permissions and limitations under the License.import tensorflow as tf
13 |
14 | import tensorflow as tf
15 | from tensorflow.keras.callbacks import TensorBoard, ModelCheckpoint
16 | import argparse
17 | import os
18 | import numpy as np
19 | import json
20 | from datetime import datetime
21 |
22 | class SyncToS3(tf.keras.callbacks.Callback):
23 | def __init__(self, logdir, s3logdir):
24 | super(SyncToS3, self).__init__()
25 | self.logdir = logdir
26 | self.s3logdir = s3logdir
27 |
28 | # Explicitly sync to S3 upon completion
29 | def on_epoch_end(self, batch, logs={}):
30 | os.system('aws s3 sync ' + self.logdir + ' ' + self.s3logdir)
31 | # ' >/dev/null 2>&1'
32 |
33 | def model(x_train, y_train, x_test, y_test, args):
34 | """Generate a simple model"""
35 | model = tf.keras.models.Sequential([
36 | tf.keras.layers.Flatten(),
37 | tf.keras.layers.Dense(1024, activation=tf.nn.relu),
38 | tf.keras.layers.Dropout(0.4),
39 | tf.keras.layers.Dense(10, activation=tf.nn.softmax)
40 | ])
41 |
42 | model.compile(optimizer=args.optimizer,
43 | loss='sparse_categorical_crossentropy',
44 | metrics=['accuracy'])
45 |
46 | callbacks = []
47 | logdir = args.output_data_dir + '/' + datetime.now().strftime("%Y%m%d-%H%M%S")
48 | callbacks.append(ModelCheckpoint(args.output_data_dir + '/checkpoint-{epoch}.h5'))
49 | callbacks.append(TensorBoard(log_dir=logdir, profile_batch=0))
50 | callbacks.append(SyncToS3(logdir=logdir, s3logdir=args.model_dir))
51 |
52 | model.fit(x=x_train,
53 | y=y_train,
54 | callbacks=callbacks,
55 | epochs=args.epochs)
56 |
57 | score = model.evaluate(x=x_test,
58 | y=y_test)
59 | print('Test loss :', score[0])
60 | print('Test accuracy:', score[1])
61 |
62 | return model
63 |
64 |
65 | def _load_training_data(base_dir):
66 | """Load MNIST training data"""
67 | print(base_dir)
68 | x_train = np.load(os.path.join(base_dir, 'train_data.npy'))
69 | y_train = np.load(os.path.join(base_dir, 'train_labels.npy'))
70 | return x_train, y_train
71 |
72 |
73 | def _load_testing_data(base_dir):
74 | """Load MNIST testing data"""
75 | print(base_dir)
76 | x_test = np.load(os.path.join(base_dir, 'eval_data.npy'))
77 | y_test = np.load(os.path.join(base_dir, 'eval_labels.npy'))
78 | return x_test, y_test
79 |
80 |
81 | def _parse_args():
82 | parser = argparse.ArgumentParser()
83 |
84 | # Hyper-parameters
85 | parser.add_argument('--epochs', type=int, default=5)
86 | parser.add_argument('--optimizer', type=str, default='adam')
87 |
88 | # SageMaker parameters
89 | # model_dir is always passed in from SageMaker. (By default, it is a S3 path under the default bucket.)
90 | parser.add_argument('--model_dir', type=str)
91 | parser.add_argument('--sm-model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
92 | parser.add_argument('--model_output_dir', type=str, default=os.environ['SM_MODEL_DIR'])
93 | parser.add_argument('--output_data_dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
94 |
95 | # Data directories and other options
96 | parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAINING'])
97 | parser.add_argument('--hosts', type=list, default=json.loads(os.environ['SM_HOSTS']))
98 | parser.add_argument('--current-host', type=str, default=os.environ['SM_CURRENT_HOST'])
99 |
100 | return parser.parse_known_args()
101 |
102 |
103 | if __name__ == "__main__":
104 |
105 | args, unknown = _parse_args()
106 |
107 | train_data, train_labels = _load_training_data(args.train)
108 | eval_data, eval_labels = _load_testing_data(args.train)
109 |
110 | mnist_classifier = model(train_data,
111 | train_labels,
112 | eval_data,
113 | eval_labels,
114 | args)
115 |
116 | print('current_host: {}'.format(args.current_host))
117 | print('hosts[0]: {}'.format(args.hosts[0]))
118 | if args.current_host == args.hosts[0]:
119 | # save model to an S3 directory with version number '00000001'
120 | mnist_classifier.save(os.path.join(args.sm_model_dir, '000000001'), './sm_tensorflow_mnist.h5')
121 | mnist_classifier.save(os.path.join('/opt/ml/model/', '000000001'), './opt_tensorflow_mnist.h5')
122 |
123 | # TODO: Copy .h5 file to /opt/ml/model/ (backed by S3)
124 | # import shutil
125 | # shutil.copyfile('./sm_tensorflow_mnist.h5', '/opt/ml/model/000000001/sm_tensorflow_mnist.h5')
126 | # shutil.copyfile('./opt_tensorflow_mnist.h5', '/opt/ml/model/000000001/opt_tensorflow_mnist.h5')
--------------------------------------------------------------------------------
/tensorflow/src/requirements.txt:
--------------------------------------------------------------------------------
1 | # Python dependencies go here...
--------------------------------------------------------------------------------
/tensorflow/src/setup.py:
--------------------------------------------------------------------------------
1 | from setuptools import setup, find_packages
2 |
3 | setup(name='mnist_keras_tf2',
4 | version='1.0',
5 | description='SageMaker Example for MNIST Keras TensorFlow 2.x',
6 | author='cfregly',
7 | author_email='chris@fregly.com',
8 | url='https://github.com/data-science-on-aws',
9 | packages=find_packages(exclude=('tests', 'docs')))
--------------------------------------------------------------------------------
/tensorflow/tensorflow_mnist.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# TensorFlow Training and Serving in SageMaker \"Script Mode\"\n",
8 | "\n",
9 | "Script mode is a training script format for TensorFlow that lets you execute any TensorFlow training script in SageMaker with minimal modification. The [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) handles transferring your script to a SageMaker training instance. On the training instance, SageMaker's native TensorFlow support sets up training-related environment variables and executes your training script. In this tutorial, we use the SageMaker Python SDK to launch a training job and deploy the trained model.\n",
10 | "\n",
11 | "Script mode supports training with a Python script, a Python module, or a shell script. In this example, we use a Python script to train a classification model on the [MNIST dataset](http://yann.lecun.com/exdb/mnist/). In this example, we will show how easily you can train a SageMaker using TensorFlow 1.x and TensorFlow 2.x scripts with SageMaker Python SDK. In addition, this notebook demonstrates how to perform real time inference with the [SageMaker TensorFlow Serving container](https://github.com/aws/sagemaker-tensorflow-serving-container). The TensorFlow Serving container is the default inference method for script mode. For full documentation on the TensorFlow Serving container, please visit [here](https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/deploying_tensorflow_serving.rst).\n"
12 | ]
13 | },
14 | {
15 | "cell_type": "markdown",
16 | "metadata": {},
17 | "source": [
18 | "## Set up the environment\n",
19 | "\n",
20 | "Let's start by setting up the environment:"
21 | ]
22 | },
23 | {
24 | "cell_type": "markdown",
25 | "metadata": {},
26 | "source": [
27 | "### Install TensorFlow and SageMaker\n",
28 | "_Note: Ignore Warnings and Errors Below_"
29 | ]
30 | },
31 | {
32 | "cell_type": "code",
33 | "execution_count": null,
34 | "metadata": {},
35 | "outputs": [],
36 | "source": [
37 | "!pip install --upgrade pip\n",
38 | "!pip install tensorflow==2.1.0"
39 | ]
40 | },
41 | {
42 | "cell_type": "code",
43 | "execution_count": null,
44 | "metadata": {},
45 | "outputs": [],
46 | "source": [
47 | "!pip3 install sagemaker --upgrade --ignore-installed --no-cache --user"
48 | ]
49 | },
50 | {
51 | "cell_type": "markdown",
52 | "metadata": {},
53 | "source": [
54 | "### Restart the Kernel to Recognize New Dependencies Above"
55 | ]
56 | },
57 | {
58 | "cell_type": "code",
59 | "execution_count": null,
60 | "metadata": {},
61 | "outputs": [],
62 | "source": [
63 | "from IPython.display import display_html\n",
64 | "display_html(\"\", raw=True)"
65 | ]
66 | },
67 | {
68 | "cell_type": "markdown",
69 | "metadata": {},
70 | "source": [
71 | "## Create the SageMaker Session"
72 | ]
73 | },
74 | {
75 | "cell_type": "code",
76 | "execution_count": null,
77 | "metadata": {},
78 | "outputs": [],
79 | "source": [
80 | "import os\n",
81 | "import sagemaker\n",
82 | "from sagemaker import get_execution_role\n",
83 | "\n",
84 | "sagemaker_session = sagemaker.Session()"
85 | ]
86 | },
87 | {
88 | "cell_type": "markdown",
89 | "metadata": {},
90 | "source": [
91 | "## Setup the Service Execution Role and Region\n",
92 | "Get IAM role arn used to give training and hosting access to your data. See the documentation for how to create these. Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the `sagemaker.get_execution_role()` with a the appropriate full IAM role arn string(s)."
93 | ]
94 | },
95 | {
96 | "cell_type": "code",
97 | "execution_count": null,
98 | "metadata": {},
99 | "outputs": [],
100 | "source": [
101 | "role = get_execution_role()\n",
102 | "print('RoleARN: {}\\n'.format(role))\n",
103 | "\n",
104 | "region = sagemaker_session.boto_session.region_name\n",
105 | "print('Region: {}'.format(region))"
106 | ]
107 | },
108 | {
109 | "cell_type": "markdown",
110 | "metadata": {},
111 | "source": [
112 | "## Training Data\n",
113 | "\n",
114 | "The MNIST dataset has been loaded to the public S3 buckets ``sagemaker-sample-data-`` under the prefix ``tensorflow/mnist``."
115 | ]
116 | },
117 | {
118 | "cell_type": "code",
119 | "execution_count": null,
120 | "metadata": {},
121 | "outputs": [],
122 | "source": [
123 | "original_training_data_uri = 's3://sagemaker-sample-data-{}/tensorflow/mnist'.format(region)\n",
124 | "print(original_training_data_uri)"
125 | ]
126 | },
127 | {
128 | "cell_type": "markdown",
129 | "metadata": {},
130 | "source": [
131 | "### Copy the Training Data to Your Notebook Disk"
132 | ]
133 | },
134 | {
135 | "cell_type": "code",
136 | "execution_count": null,
137 | "metadata": {},
138 | "outputs": [],
139 | "source": [
140 | "local_data_path = './data'"
141 | ]
142 | },
143 | {
144 | "cell_type": "code",
145 | "execution_count": null,
146 | "metadata": {
147 | "scrolled": true
148 | },
149 | "outputs": [],
150 | "source": [
151 | "!aws --region {region} s3 cp --recursive {original_training_data_uri} {local_data_path}"
152 | ]
153 | },
154 | {
155 | "cell_type": "markdown",
156 | "metadata": {},
157 | "source": [
158 | "There are four ``.npy`` file under this prefix:\n",
159 | "* ``train_data.npy``\n",
160 | "* ``eval_data.npy``\n",
161 | "* ``train_labels.npy``\n",
162 | "* ``eval_labels.npy``"
163 | ]
164 | },
165 | {
166 | "cell_type": "code",
167 | "execution_count": null,
168 | "metadata": {
169 | "scrolled": true
170 | },
171 | "outputs": [],
172 | "source": [
173 | "!ls {local_data_path}"
174 | ]
175 | },
176 | {
177 | "cell_type": "markdown",
178 | "metadata": {},
179 | "source": [
180 | "### Upload the Data to S3 for Distributed Training Across Many Workers\n",
181 | "We are going to use the `sagemaker.Session.upload_data` function to upload our datasets to an S3 location. The return value inputs identifies the location -- we will use later when we start the training job.\n",
182 | "\n",
183 | "This is S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting."
184 | ]
185 | },
186 | {
187 | "cell_type": "code",
188 | "execution_count": null,
189 | "metadata": {},
190 | "outputs": [],
191 | "source": [
192 | "# Change to your S3 bucket name\n",
193 | "bucket = 'ai4good-hackathon'\n",
194 | "\n",
195 | "data_prefix = 'sagemaker/tensorflow-mnist/data'"
196 | ]
197 | },
198 | {
199 | "cell_type": "code",
200 | "execution_count": null,
201 | "metadata": {},
202 | "outputs": [],
203 | "source": [
204 | "training_data_uri = sagemaker_session.upload_data(path=local_data_path, bucket=bucket, key_prefix=data_prefix)\n",
205 | "print(training_data_uri)"
206 | ]
207 | },
208 | {
209 | "cell_type": "code",
210 | "execution_count": null,
211 | "metadata": {},
212 | "outputs": [],
213 | "source": [
214 | "!aws s3 ls --recursive {training_data_uri}"
215 | ]
216 | },
217 | {
218 | "cell_type": "markdown",
219 | "metadata": {},
220 | "source": [
221 | "## Train\n",
222 | "https://sagemaker.readthedocs.io/en/stable/using_tf.html#distributed-training\n",
223 | "\n",
224 | "This tutorial's training script was adapted from TensorFlow's official [CNN MNIST example](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/layers/cnn_mnist.py). We have modified it to handle the ``model_dir`` parameter passed in by SageMaker. This is an S3 path which can be used for data sharing during distributed training and checkpointing and/or model persistence. We have also added an argument-parsing function to handle processing training-related variables.\n",
225 | "\n",
226 | "At the end of the training job we have added a step to export the trained model to the path stored in the environment variable ``SM_MODEL_DIR``, which always points to ``/opt/ml/model``. This is critical because SageMaker uploads all the model artifacts in this folder to S3 at end of training."
227 | ]
228 | },
229 | {
230 | "cell_type": "markdown",
231 | "metadata": {},
232 | "source": [
233 | "### Training Script"
234 | ]
235 | },
236 | {
237 | "cell_type": "code",
238 | "execution_count": null,
239 | "metadata": {},
240 | "outputs": [],
241 | "source": [
242 | "!ls ./src/mnist_keras_tf2.py"
243 | ]
244 | },
245 | {
246 | "cell_type": "markdown",
247 | "metadata": {},
248 | "source": [
249 | "You can add custom Python modules to the `src/requirements.txt` file. They will automatically be installed - and made available to your training script."
250 | ]
251 | },
252 | {
253 | "cell_type": "code",
254 | "execution_count": null,
255 | "metadata": {},
256 | "outputs": [],
257 | "source": [
258 | "!cat ./src/requirements.txt"
259 | ]
260 | },
261 | {
262 | "cell_type": "markdown",
263 | "metadata": {},
264 | "source": [
265 | "### Train with SageMaker `TensorFlow` Estimator\n",
266 | "\n",
267 | "https://sagemaker.readthedocs.io/en/stable/using_tf.html#distributed-training\n",
268 | "\n",
269 | "The `sagemaker.tensorflow.TensorFlow` estimator handles locating the script mode container, uploading your script to a S3 location and creating a SageMaker training job. Let's call out a couple important parameters here:\n",
270 | "\n",
271 | "* `py_version` is set to `'py3'`\n",
272 | "* `script_mode` is set to `True`\n",
273 | "* `distributions` is used to configure the distributed training setup. It's required only if you are doing distributed training either across a cluster of instances or across multiple GPUs. Here we are using parameter servers as the distributed training schema. SageMaker training jobs run on homogeneous clusters. To make parameter server more performant in the SageMaker setup, we run a parameter server on every instance in the cluster, so there is no need to specify the number of parameter servers to launch. You can find the full documentation on how to configure `distributions` [here](https://github.com/aws/sagemaker-python-sdk/tree/master/src/sagemaker/tensorflow#distributed-training).\n",
274 | "\n",
275 | "Notes: \n",
276 | "* This example uses two(2) `ml.p3.2xlarge` instances. You will likely need to request a SageMaker instance limit increase from Support before continuing.\n",
277 | "\n",
278 | "* Alternatively, you can specify `ml.c5.2xlarge` or choose only one(1) `ml.p3.2xlarge`.\n",
279 | "\n",
280 | "* To recognize the `requirements.txt`, we must include `src/setup.py` per [this](https://github.com/aws/sagemaker-python-sdk/issues/911) GitHub issue."
281 | ]
282 | },
283 | {
284 | "cell_type": "code",
285 | "execution_count": null,
286 | "metadata": {},
287 | "outputs": [],
288 | "source": [
289 | "from sagemaker.tensorflow import TensorFlow\n",
290 | "\n",
291 | "model_output_path = 's3://{}/sagemaker/tensorflow-mnist/training-runs'.format(bucket)\n",
292 | "\n",
293 | "mnist_estimator = TensorFlow(entry_point='mnist_keras_tf2.py',\n",
294 | " source_dir='./src',\n",
295 | " output_path=model_output_path,\n",
296 | " role=role,\n",
297 | " train_instance_count=1,\n",
298 | " train_instance_type='ml.c5.2xlarge',\n",
299 | " framework_version='2.0.0',\n",
300 | " py_version='py3',\n",
301 | " enable_sagemaker_metrics=True,\n",
302 | " script_mode=True,\n",
303 | " distributions={'parameter_server': {'enabled': True}},\n",
304 | " tags = [{'Key' : 'Project', 'Value' : 'tensorflow-mnist'},\n",
305 | " {'Key' : 'TensorBoard', 'Value' : 'dist'}],\n",
306 | " # Assuming the loglines from the TensorFlow training job are as follows:\n",
307 | " # Test loss : 0.0635053280624561\n",
308 | " # Test accuracy: 0.9821\n",
309 | " metric_definitions=[\n",
310 | " {'Name': 'test:loss', 'Regex': 'Test loss : ([0-9\\\\.]+)'},\n",
311 | " {'Name': 'test:accuracy', 'Regex': 'Test accuracy: ([0-9\\\\.]+)'},\n",
312 | " ])"
313 | ]
314 | },
315 | {
316 | "cell_type": "markdown",
317 | "metadata": {},
318 | "source": [
319 | "### `fit` the Model (Approx. 15 mins)\n",
320 | "\n",
321 | "To start a training job, we call `estimator.fit(training_data_uri)`.\n",
322 | "\n",
323 | "An S3 location is used here as the input. `fit` creates a default channel named `'training'`, which points to this S3 location. In the training script we can then access the training data from the location stored in `SM_CHANNEL_TRAINING`. `fit` accepts other parameters, as well. See the API doc [here](https://sagemaker.readthedocs.io/en/stable/estimators.html#sagemaker.estimator.EstimatorBase.fit) for details."
324 | ]
325 | },
326 | {
327 | "cell_type": "code",
328 | "execution_count": null,
329 | "metadata": {},
330 | "outputs": [],
331 | "source": [
332 | "mnist_estimator.fit(inputs=training_data_uri, wait=False)\n",
333 | "\n",
334 | "training_job_name = mnist_estimator.latest_training_job.name\n",
335 | "print('training_job_name: {}'.format(training_job_name))"
336 | ]
337 | },
338 | {
339 | "cell_type": "markdown",
340 | "metadata": {},
341 | "source": [
342 | "After some time, or in a separate Python notebook, we can attach to the running job using the `training_job_name`."
343 | ]
344 | },
345 | {
346 | "cell_type": "code",
347 | "execution_count": null,
348 | "metadata": {
349 | "scrolled": true
350 | },
351 | "outputs": [],
352 | "source": [
353 | "from sagemaker.tensorflow import TensorFlow\n",
354 | "\n",
355 | "mnist_estimator = TensorFlow.attach(training_job_name=training_job_name)"
356 | ]
357 | },
358 | {
359 | "cell_type": "code",
360 | "execution_count": null,
361 | "metadata": {},
362 | "outputs": [],
363 | "source": [
364 | "!aws --region {region} s3 ls --recursive {model_output_path}/{training_job_name}"
365 | ]
366 | },
367 | {
368 | "cell_type": "code",
369 | "execution_count": null,
370 | "metadata": {},
371 | "outputs": [],
372 | "source": [
373 | "print(model_output_path)"
374 | ]
375 | },
376 | {
377 | "cell_type": "markdown",
378 | "metadata": {},
379 | "source": [
380 | "## Option 1: Predict Directly in the Notebook"
381 | ]
382 | },
383 | {
384 | "cell_type": "markdown",
385 | "metadata": {},
386 | "source": [
387 | "Use TensorFlow Core to load the model from `model_output_path`"
388 | ]
389 | },
390 | {
391 | "cell_type": "code",
392 | "execution_count": null,
393 | "metadata": {},
394 | "outputs": [],
395 | "source": [
396 | "!aws --region {region} s3 ls --recursive {model_output_path}/{training_job_name}/output/"
397 | ]
398 | },
399 | {
400 | "cell_type": "code",
401 | "execution_count": null,
402 | "metadata": {},
403 | "outputs": [],
404 | "source": [
405 | "!aws --region {region} s3 cp {model_output_path}/{training_job_name}/output/model.tar.gz ./model/model.tar.gz"
406 | ]
407 | },
408 | {
409 | "cell_type": "code",
410 | "execution_count": null,
411 | "metadata": {},
412 | "outputs": [],
413 | "source": [
414 | "saved_model_path = './model/000000001'"
415 | ]
416 | },
417 | {
418 | "cell_type": "code",
419 | "execution_count": null,
420 | "metadata": {},
421 | "outputs": [],
422 | "source": [
423 | "!rm -rf {saved_model_path}"
424 | ]
425 | },
426 | {
427 | "cell_type": "code",
428 | "execution_count": null,
429 | "metadata": {},
430 | "outputs": [],
431 | "source": [
432 | "!ls ./model"
433 | ]
434 | },
435 | {
436 | "cell_type": "code",
437 | "execution_count": null,
438 | "metadata": {
439 | "scrolled": true
440 | },
441 | "outputs": [],
442 | "source": [
443 | "!tar -xzvf ./model/model.tar.gz -C ./model"
444 | ]
445 | },
446 | {
447 | "cell_type": "markdown",
448 | "metadata": {},
449 | "source": [
450 | "Load the model and list the prediction signatures"
451 | ]
452 | },
453 | {
454 | "cell_type": "code",
455 | "execution_count": null,
456 | "metadata": {},
457 | "outputs": [],
458 | "source": [
459 | "import tensorflow as tf\n",
460 | "\n",
461 | "loaded_model = tf.saved_model.load(export_dir='./model/000000001/')\n",
462 | "\n",
463 | "loaded_model.signatures"
464 | ]
465 | },
466 | {
467 | "cell_type": "markdown",
468 | "metadata": {},
469 | "source": [
470 | "Show the prediction signature details"
471 | ]
472 | },
473 | {
474 | "cell_type": "code",
475 | "execution_count": null,
476 | "metadata": {
477 | "scrolled": true
478 | },
479 | "outputs": [],
480 | "source": [
481 | "from tensorflow.python.tools import saved_model_cli\n",
482 | "\n",
483 | "parser = saved_model_cli.create_parser()\n",
484 | "args = parser.parse_args(['show', '--dir', saved_model_path, '--tag_set', 'serve', '--signature_def', 'serving_default'])\n",
485 | "saved_model_cli.show(args)"
486 | ]
487 | },
488 | {
489 | "cell_type": "markdown",
490 | "metadata": {},
491 | "source": [
492 | "### Perform inference with the signature"
493 | ]
494 | },
495 | {
496 | "cell_type": "code",
497 | "execution_count": null,
498 | "metadata": {},
499 | "outputs": [],
500 | "source": [
501 | "import numpy as np\n",
502 | "\n",
503 | "train_data = np.load('{}/train_data.npy'.format(local_data_path))\n",
504 | "train_labels = np.load('{}/train_labels.npy'.format(local_data_path))\n",
505 | "\n",
506 | "eval_data = np.load('{}/eval_data.npy'.format(local_data_path))\n",
507 | "eval_labels = np.load('{}/eval_labels.npy'.format(local_data_path))"
508 | ]
509 | },
510 | {
511 | "cell_type": "code",
512 | "execution_count": null,
513 | "metadata": {},
514 | "outputs": [],
515 | "source": [
516 | "import numpy as np\n",
517 | "from matplotlib import pyplot as plt\n",
518 | "\n",
519 | "image_idx = 4444 # random idx number\n",
520 | "\n",
521 | "plt.imshow(eval_data[image_idx].reshape(28, 28), cmap='Greys')"
522 | ]
523 | },
524 | {
525 | "cell_type": "code",
526 | "execution_count": null,
527 | "metadata": {},
528 | "outputs": [],
529 | "source": [
530 | "predict_fn = loaded_model.signatures[\"serving_default\"]\n",
531 | "\n",
532 | "result = predict_fn(input_1=tf.constant(eval_data[image_idx].reshape(1, 28, 28, 1)))\n",
533 | "prediction = result['output_1'].numpy().argmax()\n",
534 | "\n",
535 | "label = eval_labels[image_idx]\n",
536 | "print('Prediction is {}, Label is {}, Matched: {}'.format(prediction, label, prediction == label))"
537 | ]
538 | },
539 | {
540 | "cell_type": "markdown",
541 | "metadata": {},
542 | "source": [
543 | "## Option 2: Create a SageMaker Endpoint and Perform REST-based Predictions"
544 | ]
545 | },
546 | {
547 | "cell_type": "markdown",
548 | "metadata": {},
549 | "source": [
550 | "### Deploy the Trained Model to a SageMaker Endpoint (Approx. 10 mins)\n",
551 | "\n",
552 | "After training, we use the `TensorFlow` estimator object to build and deploy a `TensorFlowPredictor`. This creates a Sagemaker Endpoint -- a hosted prediction service that we can use to perform inference.\n",
553 | "\n",
554 | "As mentioned above we have implementation of `model_fn` in the `tensorflow_mnist.py` script that is required. We are going to use default implementations of `input_fn`, `predict_fn`, `output_fn` and `transform_fm` defined in [sagemaker-tensorflow-containers](https://github.com/aws/sagemaker-tensorflow-containers).\n",
555 | "\n",
556 | "The `deploy()` method creates a SageMaker model, which is then deployed to an endpoint to serve prediction requests in real time. We will use the TensorFlow Serving container for the endpoint, because we trained with script mode. This serving container runs an implementation of a web server that is compatible with SageMaker hosting protocol. The [Using your own inference code](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-main.html) document explains how SageMaker runs inference containers.\n",
557 | "\n",
558 | "The arguments to the deploy function allow us to set the number and type of instances that will be used for the Endpoint. These do not need to be the same as the values we used for the training job. For example, you can train a model on a set of GPU-based instances, and then deploy the Endpoint to a fleet of CPU-based instances, but you need to make sure that you return or save your model as a cpu model similar to what we did in `mnist.py`. Here we will deploy the model to a single `ml.p3.2xlarge` instance. Alternatively, you can use a `ml.c5.2xlarge` instance."
559 | ]
560 | },
561 | {
562 | "cell_type": "code",
563 | "execution_count": null,
564 | "metadata": {},
565 | "outputs": [],
566 | "source": [
567 | "predictor = mnist_estimator.deploy(initial_instance_count=1, instance_type='ml.c5.2xlarge')"
568 | ]
569 | },
570 | {
571 | "cell_type": "markdown",
572 | "metadata": {},
573 | "source": [
574 | "### Invoke the Endpoint\n",
575 | "\n",
576 | "Let's download the training data and use that as input for inference."
577 | ]
578 | },
579 | {
580 | "cell_type": "code",
581 | "execution_count": null,
582 | "metadata": {},
583 | "outputs": [],
584 | "source": [
585 | "import numpy as np\n",
586 | "\n",
587 | "train_data = np.load('{}/train_data.npy'.format(local_data_path))\n",
588 | "train_labels = np.load('{}/train_labels.npy'.format(local_data_path))"
589 | ]
590 | },
591 | {
592 | "cell_type": "markdown",
593 | "metadata": {},
594 | "source": [
595 | "The formats of the input and the output data correspond directly to the request and response formats of the `Predict` method in the [TensorFlow Serving REST API](https://www.tensorflow.org/serving/api_rest). SageMaker's TensforFlow Serving endpoints can also accept additional input formats that are not part of the TensorFlow REST API, including the simplified JSON format, line-delimited JSON objects (\"jsons\" or \"jsonlines\"), and CSV data.\n",
596 | "\n",
597 | "In this example we are using a `numpy` array as input, which will be serialized into the simplified JSON format. In addition, TensorFlow Serving can also process multiple items at once as you can see in the following code. You can find the complete documentation on how to make predictions against a TensorFlow Serving SageMaker endpoint [here](https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/deploying_tensorflow_serving.rst#making-predictions-against-a-sagemaker-endpoint)."
598 | ]
599 | },
600 | {
601 | "cell_type": "markdown",
602 | "metadata": {},
603 | "source": [
604 | "Examine the prediction result from the TensorFlow 2.0 model."
605 | ]
606 | },
607 | {
608 | "cell_type": "code",
609 | "execution_count": null,
610 | "metadata": {},
611 | "outputs": [],
612 | "source": [
613 | "predictions = predictor.predict(train_data[:50])\n",
614 | "for i in range(0, 50):\n",
615 | " prediction = predictions['predictions'][i]\n",
616 | " label = train_labels[i]\n",
617 | " print('Prediction is {}, Label is {}, Matched: {}'.format(prediction, label, prediction == label))"
618 | ]
619 | },
620 | {
621 | "cell_type": "markdown",
622 | "metadata": {},
623 | "source": [
624 | "### (Optional) Cleanup Endpoint\n",
625 | "\n",
626 | "Let's delete the endpoint we just created to prevent incurring any extra costs."
627 | ]
628 | },
629 | {
630 | "cell_type": "code",
631 | "execution_count": null,
632 | "metadata": {},
633 | "outputs": [],
634 | "source": [
635 | "# sagemaker.Session().delete_endpoint(predictor.endpoint)"
636 | ]
637 | },
638 | {
639 | "cell_type": "code",
640 | "execution_count": null,
641 | "metadata": {},
642 | "outputs": [],
643 | "source": []
644 | }
645 | ],
646 | "metadata": {
647 | "kernelspec": {
648 | "display_name": "conda_python3",
649 | "language": "python",
650 | "name": "conda_python3"
651 | },
652 | "language_info": {
653 | "codemirror_mode": {
654 | "name": "ipython",
655 | "version": 3
656 | },
657 | "file_extension": ".py",
658 | "mimetype": "text/x-python",
659 | "name": "python",
660 | "nbconvert_exporter": "python",
661 | "pygments_lexer": "ipython3",
662 | "version": "3.6.5"
663 | },
664 | "notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
665 | },
666 | "nbformat": 4,
667 | "nbformat_minor": 4
668 | }
669 |
--------------------------------------------------------------------------------
/tensorflow/tensorflow_script_mode_training_and_serving.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# TensorFlow script mode training and serving\n",
8 | "\n",
9 | "Script mode is a training script format for TensorFlow that lets you execute any TensorFlow training script in SageMaker with minimal modification. The [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) handles transferring your script to a SageMaker training instance. On the training instance, SageMaker's native TensorFlow support sets up training-related environment variables and executes your training script. In this tutorial, we use the SageMaker Python SDK to launch a training job and deploy the trained model.\n",
10 | "\n",
11 | "Script mode supports training with a Python script, a Python module, or a shell script. In this example, we use a Python script to train a classification model on the [MNIST dataset](http://yann.lecun.com/exdb/mnist/). In this example, we will show how easily you can train a SageMaker using TensorFlow 1.x and TensorFlow 2.0 scripts with SageMaker Python SDK. In addition, this notebook demonstrates how to perform real time inference with the [SageMaker TensorFlow Serving container](https://github.com/aws/sagemaker-tensorflow-serving-container). The TensorFlow Serving container is the default inference method for script mode. For full documentation on the TensorFlow Serving container, please visit [here](https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/deploying_tensorflow_serving.rst).\n"
12 | ]
13 | },
14 | {
15 | "cell_type": "markdown",
16 | "metadata": {},
17 | "source": [
18 | "# Set up the environment\n",
19 | "\n",
20 | "Let's start by setting up the environment:"
21 | ]
22 | },
23 | {
24 | "cell_type": "code",
25 | "execution_count": null,
26 | "metadata": {},
27 | "outputs": [],
28 | "source": [
29 | "import os\n",
30 | "import sagemaker\n",
31 | "from sagemaker import get_execution_role\n",
32 | "\n",
33 | "sagemaker_session = sagemaker.Session()\n",
34 | "\n",
35 | "role = get_execution_role()\n",
36 | "region = sagemaker_session.boto_session.region_name"
37 | ]
38 | },
39 | {
40 | "cell_type": "markdown",
41 | "metadata": {},
42 | "source": [
43 | "## Training Data\n",
44 | "\n",
45 | "The MNIST dataset has been loaded to the public S3 buckets ``sagemaker-sample-data-`` under the prefix ``tensorflow/mnist``. There are four ``.npy`` file under this prefix:\n",
46 | "* ``train_data.npy``\n",
47 | "* ``eval_data.npy``\n",
48 | "* ``train_labels.npy``\n",
49 | "* ``eval_labels.npy``"
50 | ]
51 | },
52 | {
53 | "cell_type": "code",
54 | "execution_count": null,
55 | "metadata": {},
56 | "outputs": [],
57 | "source": [
58 | "training_data_uri = 's3://sagemaker-sample-data-{}/tensorflow/mnist'.format(region)\n",
59 | "print(training_data_uri)"
60 | ]
61 | },
62 | {
63 | "cell_type": "markdown",
64 | "metadata": {},
65 | "source": [
66 | "# Construct a script for distributed training\n",
67 | "\n",
68 | "This tutorial's training script was adapted from TensorFlow's official [CNN MNIST example](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/layers/cnn_mnist.py). We have modified it to handle the ``model_dir`` parameter passed in by SageMaker. This is an S3 path which can be used for data sharing during distributed training and checkpointing and/or model persistence. We have also added an argument-parsing function to handle processing training-related variables.\n",
69 | "\n",
70 | "At the end of the training job we have added a step to export the trained model to the path stored in the environment variable ``SM_MODEL_DIR``, which always points to ``/opt/ml/model``. This is critical because SageMaker uploads all the model artifacts in this folder to S3 at end of training.\n",
71 | "\n",
72 | "Here is the entire script:"
73 | ]
74 | },
75 | {
76 | "cell_type": "code",
77 | "execution_count": null,
78 | "metadata": {},
79 | "outputs": [],
80 | "source": [
81 | "# TensorFlow 1.15.x script\n",
82 | "!pygmentize 'src/mnist.py'\n",
83 | "\n",
84 | "# TensorFlow 2.1.0 script\n",
85 | "!pygmentize 'src/mnist-2.py'"
86 | ]
87 | },
88 | {
89 | "cell_type": "markdown",
90 | "metadata": {},
91 | "source": [
92 | "# Create a training job using the `TensorFlow` estimator\n",
93 | "\n",
94 | "The `sagemaker.tensorflow.TensorFlow` estimator handles locating the script mode container, uploading your script to a S3 location and creating a SageMaker training job. Let's call out a couple important parameters here:\n",
95 | "\n",
96 | "* `py_version` is set to `'py3'` to indicate that we are using script mode since legacy mode supports only Python 2. Though Python 2 will be deprecated soon, you can use script mode with Python 2 by setting `py_version` to `'py2'` and `script_mode` to `True`.\n",
97 | "\n",
98 | "* `distributions` is used to configure the distributed training setup. It's required only if you are doing distributed training either across a cluster of instances or across multiple GPUs. Here we are using parameter servers as the distributed training schema. SageMaker training jobs run on homogeneous clusters. To make parameter server more performant in the SageMaker setup, we run a parameter server on every instance in the cluster, so there is no need to specify the number of parameter servers to launch. Script mode also supports distributed training with [Horovod](https://github.com/horovod/horovod). You can find the full documentation on how to configure `distributions` [here](https://github.com/aws/sagemaker-python-sdk/tree/master/src/sagemaker/tensorflow#distributed-training). \n",
99 | "\n"
100 | ]
101 | },
102 | {
103 | "cell_type": "code",
104 | "execution_count": null,
105 | "metadata": {},
106 | "outputs": [],
107 | "source": [
108 | "from sagemaker.tensorflow import TensorFlow\n",
109 | "\n",
110 | "\n",
111 | "mnist_estimator = TensorFlow(entry_point='src/mnist.py',\n",
112 | " role=role,\n",
113 | " train_instance_count=2,\n",
114 | " train_instance_type='ml.p3.2xlarge',\n",
115 | " framework_version='1.15.2',\n",
116 | " py_version='py3',\n",
117 | " distributions={'parameter_server': {'enabled': True}})"
118 | ]
119 | },
120 | {
121 | "cell_type": "markdown",
122 | "metadata": {},
123 | "source": [
124 | "You can also initiate an estimator to train with TensorFlow 2.1 script. The only things that you will need to change are the script name and ``framework_version``"
125 | ]
126 | },
127 | {
128 | "cell_type": "code",
129 | "execution_count": null,
130 | "metadata": {},
131 | "outputs": [],
132 | "source": [
133 | "mnist_estimator2 = TensorFlow(entry_point='src/mnist-2.py',\n",
134 | " role=role,\n",
135 | " train_instance_count=2,\n",
136 | " train_instance_type='ml.p3.2xlarge',\n",
137 | " framework_version='2.1.0',\n",
138 | " py_version='py3',\n",
139 | " distributions={'parameter_server': {'enabled': True}})"
140 | ]
141 | },
142 | {
143 | "cell_type": "markdown",
144 | "metadata": {},
145 | "source": [
146 | "## Calling ``fit``\n",
147 | "\n",
148 | "To start a training job, we call `estimator.fit(training_data_uri)`.\n",
149 | "\n",
150 | "An S3 location is used here as the input. `fit` creates a default channel named `'training'`, which points to this S3 location. In the training script we can then access the training data from the location stored in `SM_CHANNEL_TRAINING`. `fit` accepts a couple other types of input as well. See the API doc [here](https://sagemaker.readthedocs.io/en/stable/estimators.html#sagemaker.estimator.EstimatorBase.fit) for details.\n",
151 | "\n",
152 | "When training starts, the TensorFlow container executes mnist.py, passing `hyperparameters` and `model_dir` from the estimator as script arguments. Because we didn't define either in this example, no hyperparameters are passed, and `model_dir` defaults to `s3:///`, so the script execution is as follows:\n",
153 | "```bash\n",
154 | "python mnist.py --model_dir s3:///\n",
155 | "```\n",
156 | "When training is complete, the training job will upload the saved model for TensorFlow serving."
157 | ]
158 | },
159 | {
160 | "cell_type": "code",
161 | "execution_count": null,
162 | "metadata": {},
163 | "outputs": [],
164 | "source": [
165 | "mnist_estimator.fit(training_data_uri)"
166 | ]
167 | },
168 | {
169 | "cell_type": "markdown",
170 | "metadata": {},
171 | "source": [
172 | "Calling fit to train a model with TensorFlow 2.1 script."
173 | ]
174 | },
175 | {
176 | "cell_type": "code",
177 | "execution_count": null,
178 | "metadata": {},
179 | "outputs": [],
180 | "source": [
181 | "mnist_estimator2.fit(training_data_uri)"
182 | ]
183 | },
184 | {
185 | "cell_type": "markdown",
186 | "metadata": {},
187 | "source": [
188 | "# Deploy the trained model to an endpoint\n",
189 | "\n",
190 | "The `deploy()` method creates a SageMaker model, which is then deployed to an endpoint to serve prediction requests in real time. We will use the TensorFlow Serving container for the endpoint, because we trained with script mode. This serving container runs an implementation of a web server that is compatible with SageMaker hosting protocol. The [Using your own inference code]() document explains how SageMaker runs inference containers."
191 | ]
192 | },
193 | {
194 | "cell_type": "markdown",
195 | "metadata": {},
196 | "source": [
197 | "Deploy the trained TensorFlow 1.15 model to an endpoint."
198 | ]
199 | },
200 | {
201 | "cell_type": "code",
202 | "execution_count": null,
203 | "metadata": {},
204 | "outputs": [],
205 | "source": [
206 | "predictor = mnist_estimator.deploy(initial_instance_count=1, instance_type='ml.c5.large')"
207 | ]
208 | },
209 | {
210 | "cell_type": "markdown",
211 | "metadata": {},
212 | "source": [
213 | "Deploy the trained TensorFlow 2.1 model to an endpoint."
214 | ]
215 | },
216 | {
217 | "cell_type": "code",
218 | "execution_count": null,
219 | "metadata": {},
220 | "outputs": [],
221 | "source": [
222 | "predictor2 = mnist_estimator2.deploy(initial_instance_count=1, instance_type='ml.c5.large')"
223 | ]
224 | },
225 | {
226 | "cell_type": "markdown",
227 | "metadata": {},
228 | "source": [
229 | "# Invoke the endpoint\n",
230 | "\n",
231 | "Let's download the training data and use that as input for inference."
232 | ]
233 | },
234 | {
235 | "cell_type": "code",
236 | "execution_count": null,
237 | "metadata": {},
238 | "outputs": [],
239 | "source": [
240 | "import numpy as np\n",
241 | "\n",
242 | "!aws --region {region} s3 cp s3://sagemaker-sample-data-{region}/tensorflow/mnist/train_data.npy train_data.npy\n",
243 | "!aws --region {region} s3 cp s3://sagemaker-sample-data-{region}/tensorflow/mnist/train_labels.npy train_labels.npy\n",
244 | "\n",
245 | "train_data = np.load('train_data.npy')\n",
246 | "train_labels = np.load('train_labels.npy')"
247 | ]
248 | },
249 | {
250 | "cell_type": "markdown",
251 | "metadata": {},
252 | "source": [
253 | "The formats of the input and the output data correspond directly to the request and response formats of the `Predict` method in the [TensorFlow Serving REST API](https://www.tensorflow.org/serving/api_rest). SageMaker's TensforFlow Serving endpoints can also accept additional input formats that are not part of the TensorFlow REST API, including the simplified JSON format, line-delimited JSON objects (\"jsons\" or \"jsonlines\"), and CSV data.\n",
254 | "\n",
255 | "In this example we are using a `numpy` array as input, which will be serialized into the simplified JSON format. In addtion, TensorFlow serving can also process multiple items at once as you can see in the following code. You can find the complete documentation on how to make predictions against a TensorFlow serving SageMaker endpoint [here](https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/deploying_tensorflow_serving.rst#making-predictions-against-a-sagemaker-endpoint)."
256 | ]
257 | },
258 | {
259 | "cell_type": "code",
260 | "execution_count": null,
261 | "metadata": {},
262 | "outputs": [],
263 | "source": [
264 | "predictions = predictor.predict(train_data[:50])\n",
265 | "for i in range(0, 50):\n",
266 | " prediction = predictions['predictions'][i]['classes']\n",
267 | " label = train_labels[i]\n",
268 | " print('prediction is {}, label is {}, matched: {}'.format(prediction, label, prediction == label))"
269 | ]
270 | },
271 | {
272 | "cell_type": "markdown",
273 | "metadata": {},
274 | "source": [
275 | "Examine the prediction result from the TensorFlow 2.1 model."
276 | ]
277 | },
278 | {
279 | "cell_type": "code",
280 | "execution_count": null,
281 | "metadata": {},
282 | "outputs": [],
283 | "source": [
284 | "predictions2 = predictor2.predict(train_data[:50])\n",
285 | "for i in range(0, 50):\n",
286 | " prediction = predictions2['predictions'][i]\n",
287 | " label = train_labels[i]\n",
288 | " print('prediction is {}, label is {}, matched: {}'.format(prediction, label, prediction == label))"
289 | ]
290 | },
291 | {
292 | "cell_type": "markdown",
293 | "metadata": {},
294 | "source": [
295 | "# (Optional) Delete the endpoint\n",
296 | "\n",
297 | "Let's delete the endpoint we just created to prevent incurring any extra costs."
298 | ]
299 | },
300 | {
301 | "cell_type": "code",
302 | "execution_count": null,
303 | "metadata": {},
304 | "outputs": [],
305 | "source": [
306 | "# sagemaker.Session().delete_endpoint(predictor.endpoint)"
307 | ]
308 | },
309 | {
310 | "cell_type": "markdown",
311 | "metadata": {},
312 | "source": [
313 | "Delete the TensorFlow 2.1 endpoint as well."
314 | ]
315 | },
316 | {
317 | "cell_type": "code",
318 | "execution_count": null,
319 | "metadata": {},
320 | "outputs": [],
321 | "source": [
322 | "# sagemaker.Session().delete_endpoint(predictor2.endpoint)"
323 | ]
324 | },
325 | {
326 | "cell_type": "code",
327 | "execution_count": null,
328 | "metadata": {},
329 | "outputs": [],
330 | "source": []
331 | }
332 | ],
333 | "metadata": {
334 | "kernelspec": {
335 | "display_name": "conda_tensorflow_p36",
336 | "language": "python",
337 | "name": "conda_tensorflow_p36"
338 | },
339 | "language_info": {
340 | "codemirror_mode": {
341 | "name": "ipython",
342 | "version": 3
343 | },
344 | "file_extension": ".py",
345 | "mimetype": "text/x-python",
346 | "name": "python",
347 | "nbconvert_exporter": "python",
348 | "pygments_lexer": "ipython3",
349 | "version": "3.6.6"
350 | },
351 | "notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
352 | },
353 | "nbformat": 4,
354 | "nbformat_minor": 4
355 | }
356 |
--------------------------------------------------------------------------------