├── .github
    └── PULL_REQUEST_TEMPLATE.md
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── NOTICE
├── README.md
├── daemon.json
├── deploy-pretrained-model
    ├── BERT
    │   ├── Deploy_BERT.ipynb
    │   └── code
    │   │   ├── inference_code.py
    │   │   └── requirements.txt
    └── GPT2
    │   ├── Deploy_GPT2.ipynb
    │   └── code
    │       ├── inference_code.py
    │       └── requirements.txt
├── hugging-face-lambda-step
    ├── .ipynb_checkpoints
    │   ├── iam_helper-checkpoint.py
    │   └── sm-pipelines-hugging-face-lambda-step-checkpoint.ipynb
    ├── iam_helper.py
    ├── scripts
    │   ├── .ipynb_checkpoints
    │   │   ├── evaluate-checkpoint.py
    │   │   ├── preprocessing-checkpoint.py
    │   │   └── train-checkpoint.py
    │   ├── evaluate.py
    │   ├── preprocessing.py
    │   └── train.py
    └── sm-pipelines-hugging-face-lambda-step.ipynb
├── k-means-clustering
    └── k-means-clustering.ipynb
├── lightgbm-byo
    └── lightgbm-byo.ipynb
├── local_mode_setup.sh
├── r-churn
    └── r_autopilot_churn.ipynb
├── r-in-sagemaker-processing
    └── r_in_sagemaker_processing.ipynb
├── r-workflow
    └── r_workflow.ipynb
├── tf-2-word-embeddings
    ├── code
    │   ├── model_def.py
    │   └── train.py
    └── tf-2-word-embeddings.ipynb
├── tf-2-workflow-smpipelines
    ├── tf-2-workflow-smpipelines.ipynb
    └── train_model
    │   ├── model_def.py
    │   └── train.py
├── tf-2-workflow
    ├── tf-2-workflow.ipynb
    └── train_model
    │   ├── model_def.py
    │   └── train.py
├── tf-batch-inference-script
    ├── code
    │   ├── inference.py
    │   ├── model_def.py
    │   ├── requirements.txt
    │   ├── train.py
    │   └── utilities.py
    ├── generate_cifar10_tfrecords.py
    ├── sample-img
    │   ├── 1000_dog.png
    │   ├── 1001_airplane.png
    │   ├── 1003_deer.png
    │   ├── 1004_ship.png
    │   ├── 1005_automobile.png
    │   ├── 1008_truck.png
    │   ├── 1009_frog.png
    │   ├── 1014_cat.png
    │   ├── 1037_horse.png
    │   └── 1038_bird.png
    └── tf-batch-inference-script.ipynb
├── tf-distribution-options
    ├── code
    │   ├── inference.py
    │   ├── model_def.py
    │   ├── requirements.txt
    │   ├── train_hvd.py
    │   ├── train_ps.py
    │   └── utilities.py
    ├── generate_cifar10_tfrecords.py
    ├── sample-img
    │   ├── 1000_dog.png
    │   ├── 1001_airplane.png
    │   ├── 1003_deer.png
    │   ├── 1004_ship.png
    │   ├── 1005_automobile.png
    │   ├── 1008_truck.png
    │   ├── 1009_frog.png
    │   ├── 1014_cat.png
    │   ├── 1037_horse.png
    │   └── 1038_bird.png
    └── tf-distributed-training.ipynb
├── tf-eager-script-mode
    ├── tf-boston-housing.ipynb
    └── train_model
    │   ├── model_def.py
    │   └── train.py
├── tf-horovod-inference-pipeline
    ├── generate_cifar10_tfrecords.py
    ├── image-transformer-container
    │   ├── Dockerfile
    │   ├── app
    │   │   └── main.py
    │   ├── ecr_policy.json
    │   └── entrypoint.sh
    ├── sample-img
    │   ├── 1000_dog.png
    │   ├── 1001_airplane.png
    │   ├── 1003_deer.png
    │   ├── 1004_ship.png
    │   ├── 1005_automobile.png
    │   ├── 1008_truck.png
    │   ├── 1009_frog.png
    │   ├── 1014_cat.png
    │   ├── 1037_horse.png
    │   └── 1038_bird.png
    ├── tf-horovod-inference-pipeline.ipynb
    └── train.py
└── tf-sentiment-script-mode
    ├── sentiment-analysis.ipynb
    └── sentiment.py


/.github/PULL_REQUEST_TEMPLATE.md:
--------------------------------------------------------------------------------
1 | *Issue #, if available:*
2 | 
3 | *Description of changes:*
4 | 
5 | 
6 | By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
7 | 


--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | ## Code of Conduct
2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
4 | opensource-codeofconduct@amazon.com with any additional questions or comments.
5 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contributing Guidelines
 2 | 
 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
 4 | documentation, we greatly value feedback and contributions from our community.
 5 | 
 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
 7 | information to effectively respond to your bug report or contribution.
 8 | 
 9 | 
10 | ## Reporting Bugs/Feature Requests
11 | 
12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features.
13 | 
14 | When filing an issue, please check [existing open](https://github.com/aws-samples/amazon-sagemaker-script-mode/issues), or [recently closed](https://github.com/aws-samples/amazon-sagemaker-script-mode/issues?utf8=%E2%9C%93&q=is%3Aissue%20is%3Aclosed%20), issues to make sure somebody else hasn't already
15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
16 | 
17 | * A reproducible test case or series of steps
18 | * The version of our code being used
19 | * Any modifications you've made relevant to the bug
20 | * Anything unusual about your environment or deployment
21 | 
22 | 
23 | ## Contributing via Pull Requests
24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
25 | 
26 | 1. You are working against the latest source on the *master* branch.
27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
29 | 
30 | To send us a pull request, please:
31 | 
32 | 1. Fork the repository.
33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
34 | 3. Ensure local tests pass.
35 | 4. Commit to your fork using clear commit messages.
36 | 5. Send us a pull request, answering any default questions in the pull request interface.
37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
38 | 
39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
41 | 
42 | 
43 | ## Finding contributions to work on
44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any ['help wanted'](https://github.com/aws-samples/amazon-sagemaker-script-mode/labels/help%20wanted) issues is a great place to start.
45 | 
46 | 
47 | ## Code of Conduct
48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
50 | opensource-codeofconduct@amazon.com with any additional questions or comments.
51 | 
52 | 
53 | ## Security issue notifications
54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
55 | 
56 | 
57 | ## Licensing
58 | 
59 | See the [LICENSE](https://github.com/aws-samples/amazon-sagemaker-script-mode/blob/master/LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
60 | 
61 | We may ask you to sign a [Contributor License Agreement (CLA)](http://en.wikipedia.org/wiki/Contributor_License_Agreement) for larger changes.
62 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 | 
  2 |                                  Apache License
  3 |                            Version 2.0, January 2004
  4 |                         http://www.apache.org/licenses/
  5 | 
  6 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
  7 | 
  8 |    1. Definitions.
  9 | 
 10 |       "License" shall mean the terms and conditions for use, reproduction,
 11 |       and distribution as defined by Sections 1 through 9 of this document.
 12 | 
 13 |       "Licensor" shall mean the copyright owner or entity authorized by
 14 |       the copyright owner that is granting the License.
 15 | 
 16 |       "Legal Entity" shall mean the union of the acting entity and all
 17 |       other entities that control, are controlled by, or are under common
 18 |       control with that entity. For the purposes of this definition,
 19 |       "control" means (i) the power, direct or indirect, to cause the
 20 |       direction or management of such entity, whether by contract or
 21 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 22 |       outstanding shares, or (iii) beneficial ownership of such entity.
 23 | 
 24 |       "You" (or "Your") shall mean an individual or Legal Entity
 25 |       exercising permissions granted by this License.
 26 | 
 27 |       "Source" form shall mean the preferred form for making modifications,
 28 |       including but not limited to software source code, documentation
 29 |       source, and configuration files.
 30 | 
 31 |       "Object" form shall mean any form resulting from mechanical
 32 |       transformation or translation of a Source form, including but
 33 |       not limited to compiled object code, generated documentation,
 34 |       and conversions to other media types.
 35 | 
 36 |       "Work" shall mean the work of authorship, whether in Source or
 37 |       Object form, made available under the License, as indicated by a
 38 |       copyright notice that is included in or attached to the work
 39 |       (an example is provided in the Appendix below).
 40 | 
 41 |       "Derivative Works" shall mean any work, whether in Source or Object
 42 |       form, that is based on (or derived from) the Work and for which the
 43 |       editorial revisions, annotations, elaborations, or other modifications
 44 |       represent, as a whole, an original work of authorship. For the purposes
 45 |       of this License, Derivative Works shall not include works that remain
 46 |       separable from, or merely link (or bind by name) to the interfaces of,
 47 |       the Work and Derivative Works thereof.
 48 | 
 49 |       "Contribution" shall mean any work of authorship, including
 50 |       the original version of the Work and any modifications or additions
 51 |       to that Work or Derivative Works thereof, that is intentionally
 52 |       submitted to Licensor for inclusion in the Work by the copyright owner
 53 |       or by an individual or Legal Entity authorized to submit on behalf of
 54 |       the copyright owner. For the purposes of this definition, "submitted"
 55 |       means any form of electronic, verbal, or written communication sent
 56 |       to the Licensor or its representatives, including but not limited to
 57 |       communication on electronic mailing lists, source code control systems,
 58 |       and issue tracking systems that are managed by, or on behalf of, the
 59 |       Licensor for the purpose of discussing and improving the Work, but
 60 |       excluding communication that is conspicuously marked or otherwise
 61 |       designated in writing by the copyright owner as "Not a Contribution."
 62 | 
 63 |       "Contributor" shall mean Licensor and any individual or Legal Entity
 64 |       on behalf of whom a Contribution has been received by Licensor and
 65 |       subsequently incorporated within the Work.
 66 | 
 67 |    2. Grant of Copyright License. Subject to the terms and conditions of
 68 |       this License, each Contributor hereby grants to You a perpetual,
 69 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 70 |       copyright license to reproduce, prepare Derivative Works of,
 71 |       publicly display, publicly perform, sublicense, and distribute the
 72 |       Work and such Derivative Works in Source or Object form.
 73 | 
 74 |    3. Grant of Patent License. Subject to the terms and conditions of
 75 |       this License, each Contributor hereby grants to You a perpetual,
 76 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 77 |       (except as stated in this section) patent license to make, have made,
 78 |       use, offer to sell, sell, import, and otherwise transfer the Work,
 79 |       where such license applies only to those patent claims licensable
 80 |       by such Contributor that are necessarily infringed by their
 81 |       Contribution(s) alone or by combination of their Contribution(s)
 82 |       with the Work to which such Contribution(s) was submitted. If You
 83 |       institute patent litigation against any entity (including a
 84 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
 85 |       or a Contribution incorporated within the Work constitutes direct
 86 |       or contributory patent infringement, then any patent licenses
 87 |       granted to You under this License for that Work shall terminate
 88 |       as of the date such litigation is filed.
 89 | 
 90 |    4. Redistribution. You may reproduce and distribute copies of the
 91 |       Work or Derivative Works thereof in any medium, with or without
 92 |       modifications, and in Source or Object form, provided that You
 93 |       meet the following conditions:
 94 | 
 95 |       (a) You must give any other recipients of the Work or
 96 |           Derivative Works a copy of this License; and
 97 | 
 98 |       (b) You must cause any modified files to carry prominent notices
 99 |           stating that You changed the files; and
100 | 
101 |       (c) You must retain, in the Source form of any Derivative Works
102 |           that You distribute, all copyright, patent, trademark, and
103 |           attribution notices from the Source form of the Work,
104 |           excluding those notices that do not pertain to any part of
105 |           the Derivative Works; and
106 | 
107 |       (d) If the Work includes a "NOTICE" text file as part of its
108 |           distribution, then any Derivative Works that You distribute must
109 |           include a readable copy of the attribution notices contained
110 |           within such NOTICE file, excluding those notices that do not
111 |           pertain to any part of the Derivative Works, in at least one
112 |           of the following places: within a NOTICE text file distributed
113 |           as part of the Derivative Works; within the Source form or
114 |           documentation, if provided along with the Derivative Works; or,
115 |           within a display generated by the Derivative Works, if and
116 |           wherever such third-party notices normally appear. The contents
117 |           of the NOTICE file are for informational purposes only and
118 |           do not modify the License. You may add Your own attribution
119 |           notices within Derivative Works that You distribute, alongside
120 |           or as an addendum to the NOTICE text from the Work, provided
121 |           that such additional attribution notices cannot be construed
122 |           as modifying the License.
123 | 
124 |       You may add Your own copyright statement to Your modifications and
125 |       may provide additional or different license terms and conditions
126 |       for use, reproduction, or distribution of Your modifications, or
127 |       for any such Derivative Works as a whole, provided Your use,
128 |       reproduction, and distribution of the Work otherwise complies with
129 |       the conditions stated in this License.
130 | 
131 |    5. Submission of Contributions. Unless You explicitly state otherwise,
132 |       any Contribution intentionally submitted for inclusion in the Work
133 |       by You to the Licensor shall be under the terms and conditions of
134 |       this License, without any additional terms or conditions.
135 |       Notwithstanding the above, nothing herein shall supersede or modify
136 |       the terms of any separate license agreement you may have executed
137 |       with Licensor regarding such Contributions.
138 | 
139 |    6. Trademarks. This License does not grant permission to use the trade
140 |       names, trademarks, service marks, or product names of the Licensor,
141 |       except as required for reasonable and customary use in describing the
142 |       origin of the Work and reproducing the content of the NOTICE file.
143 | 
144 |    7. Disclaimer of Warranty. Unless required by applicable law or
145 |       agreed to in writing, Licensor provides the Work (and each
146 |       Contributor provides its Contributions) on an "AS IS" BASIS,
147 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
148 |       implied, including, without limitation, any warranties or conditions
149 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
150 |       PARTICULAR PURPOSE. You are solely responsible for determining the
151 |       appropriateness of using or redistributing the Work and assume any
152 |       risks associated with Your exercise of permissions under this License.
153 | 
154 |    8. Limitation of Liability. In no event and under no legal theory,
155 |       whether in tort (including negligence), contract, or otherwise,
156 |       unless required by applicable law (such as deliberate and grossly
157 |       negligent acts) or agreed to in writing, shall any Contributor be
158 |       liable to You for damages, including any direct, indirect, special,
159 |       incidental, or consequential damages of any character arising as a
160 |       result of this License or out of the use or inability to use the
161 |       Work (including but not limited to damages for loss of goodwill,
162 |       work stoppage, computer failure or malfunction, or any and all
163 |       other commercial damages or losses), even if such Contributor
164 |       has been advised of the possibility of such damages.
165 | 
166 |    9. Accepting Warranty or Additional Liability. While redistributing
167 |       the Work or Derivative Works thereof, You may choose to offer,
168 |       and charge a fee for, acceptance of support, warranty, indemnity,
169 |       or other liability obligations and/or rights consistent with this
170 |       License. However, in accepting such obligations, You may act only
171 |       on Your own behalf and on Your sole responsibility, not on behalf
172 |       of any other Contributor, and only if You agree to indemnify,
173 |       defend, and hold each Contributor harmless for any liability
174 |       incurred by, or claims asserted against, such Contributor by reason
175 |       of your accepting any such warranty or additional liability.
176 | 


--------------------------------------------------------------------------------
/NOTICE:
--------------------------------------------------------------------------------
1 | TensorFlow Eager Execution with SageMaker's Script Mode
2 | Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved. 
3 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ## Amazon SageMaker Script Mode Examples
 2 | 
 3 | This repository contains examples and related resources regarding Amazon SageMaker Script Mode and SageMaker Processing. With Script Mode, you can use training scripts similar to those you would use outside SageMaker with SageMaker's prebuilt containers for various frameworks such TensorFlow and PyTorch.  Similarly, in SageMaker Processing, you can supply ordinary data preprocessing scripts for almost any language or technology you wish to use, such as the R programming language.  
 4 | 
 5 | Currently this repository has resources for **Hugging Face**, **TensorFlow**, **R**, **Bring Your Own** (BYO models, plus Script Mode-style experience with your own containers), and **Miscellaneous** (Script Mode-style experience for SageMaker Processing etc.). There also is an **Older Resources** section with examples of older framework versions; these examples are for reference only, and are not maintained.  
 6 | 
 7 | For those new to SageMaker, there is a set of 2-hour workshops covering the basics at [**Amazon SageMaker Workshops**](https://github.com/awslabs/amazon-sagemaker-workshop).
 8 | 
 9 | - **Hugging Face Resources:**
10 | 
11 |   - [**Hugging Face automated model training and deployment in SageMaker Pipelines**](hugging-face-lambda-step):  This example uses the SageMaker prebuilt Hugging Face (PyTorch) container in an end-to-end demo with model training and deployment within SageMaker Pipelines.  A lightweight model deployment is performed by a SageMaker Pipeline Lambda step.  **PREREQUISITES:**  either clone this repository, or from the *hugging-face-lambda-step* directory, upload all files and folders; then run the notebook `sm-pipelines-hugging-face-lambda-step.ipynb`.
12 | 
13 | - **TensorFlow Resources:**  
14 | 
15 |   - [**TensorFlow 2 Sentiment Analysis**](tf-sentiment-script-mode):  SageMaker's prebuilt TensorFlow 2 container is used in this example to train a custom sentiment analysis model. Distributed hosted training in SageMaker is performed on a multi-GPU instance, using the native TensorFlow `MirroredStrategy`.  Additionally, SageMaker Batch Transform is used for asynchronous, large scale inference/batch scoring. **PREREQUISITES:**  From the *tf-sentiment-script-mode* directory, upload ONLY the Jupyter notebook `sentiment-analysis.ipynb`.  
16 | 
17 |   - [**TensorFlow 2 Workflow with SageMaker Pipelines**](tf-2-workflow-smpipelines):  This example shows a complete workflow for TensorFlow 2, starting with prototyping followed by automation with [Amazon SageMaker Pipelines](https://aws.amazon.com/sagemaker/pipelines).  To begin, SageMaker Processing is used to transform the dataset.  Next, Local Mode training and Local Mode endpoints are demonstrated for prototyping training and inference code, respectively.  Automatic Model Tuning is used to automate the hyperparameter tuning process.  Finally, the workflow is automated with SageMaker Pipelines.  **PREREQUISITES:** If you wish to run the Local Mode sections of the example, use a SageMaker Notebook Instance rather than SageMaker Studio.  From the *tf-2-workflow-smpipelines* directory, upload ONLY the Jupyter notebook `tf-2-workflow-smpipelines.ipynb`.  
18 |   
19 |   - [**TensorFlow 2 Loading Pretrained Embeddings for Classification Tasks**](tf-2-word-embeddings): In this example, TensorFlow 2 is used with Script Mode for a text classification task. An important aspect of the example is showing how to load pretrained embeddings in Script Mode. This illustrates one aspect of the flexibility of SageMaker Script Mode for setting up training jobs: in addition to data, you can pass in arbitrary files needed for training (not just embeddings).  **PREREQUISITES:**  (1) be sure to upload all files in the *tf-2-word-embeddings* directory (including subdirectory *code*) to the directory where you will run the related Jupyter notebook.
20 |   
21 | - **R Resources:** 
22 | 
23 |    - [**R in SageMaker Processing**](r-in-sagemaker-processing): SageMaker Script Mode is directed toward making the model training process easier.  However, an experience similar to Script Mode also is available for SageMaker Processing:  you can bring in your data processing scripts and easily run them on managed infrastructure either with BYO containers or prebuilt containers for frameworks such as Spark and Scikit-learn.  In this example, R is used to perform operations on a dataset and generate a plot within SageMaker Processing.  The job results including the plot image are retrieved and displayed, demonstrating how R can be easily used within a SageMaker workflow. **PREREQUISITES:**  From the *r-in-sagemaker-processing* directory, upload the Jupyter notebook `r-in-sagemaker_processing.ipynb`.
24 |    
25 |    - [**R Complete Workflow**](r-workflow):  This example shows a complete workflow for R, starting with prototyping and moving to model tuning and inference, followed by automation with [Amazon SageMaker Pipelines](https://aws.amazon.com/sagemaker/pipelines).  **PREREQUISITES:**  Use a R kernel; and from the *r-workflow* directory, upload the Jupyter notebook `r-workflow.ipynb`.
26 |    
27 |    - [**R Churn Example**](r-churn):  Often it is helpful to benchmark possible model quality using AutoML, even if you intend to manage training of your own custom model later using Script Mode.  This example shows how to use R to access SageMaker Autopilot for this purpose. **PREREQUISITES:**  Use a R kernel; and from the *r-churn* directory, upload the Jupyter notebook `r-churn.ipynb`.
28 |   
29 | - **Bring Your Own (BYO) Resources:**  
30 | 
31 |   - [**lightGBM BYO**](lightgbm-byo): In this repository, most samples use Amazon SageMaker prebuilt framework containers for TensorFlow and other frameworks.  For this example, however, we'll show how to BYO container to create a Script Mode-style experience similar to a prebuilt SageMaker framework container, using lightGBM, a popular gradient boosting framework.  **PREREQUISITES:**  From the *lightgbm-byo* directory, upload the Jupyter notebook `lightgbm-byo.ipynb`.
32 | 
33 |   - [**Deploy Pretrained Models**](deploy-pretrained-model):  In addition to the ease of use of the SageMaker Python SDK for model training in Script Mode, the SDK also enables you to easily BYO model.  In this example, the SageMaker prebuilt PyTorch container is used to demonstrate how you can quickly take a pretrained or locally trained model and deploy it in a SageMaker hosted endpoint. There are examples for both OpenAI's GPT-2 and BERT. **PREREQUISITES:**  From the *deploy-pretrained-model* directory, upload the entire BERT or GPT2 folder's contents, depending on which model you select. Run either `Deploy_BERT.pynb` or `Deploy_GPT2.ipynb`.  
34 | 
35 | 
36 | - **Miscellaneous Resources:**  
37 | 
38 |   - [**K-means clustering**](k-means-clustering): Most of the samples in this repository involve supervised learning tasks in Amazon SageMaker Script Mode.  For this example, by contrast, we'll undertake an unsupervised learning task, and do so with the Amazon SageMaker K-means built-in algorithm rather than Script Mode.  The SageMaker built-in algorithms were developed for large-scale training tasks and may offer a simpler user experience depending on the use case.  **PREREQUISITES:**  From the *k-means-clustering* directory, upload the Jupyter notebook `k-means-clustering.ipynb`.
39 | 
40 | 
41 | - **Older Resources:**  ***(reference only, not maintained)***  
42 | 
43 |   - [**TensorFlow 2 Workflow with the AWS Step Functions Data Science SDK**](tf-2-workflow):  **NOTE**:  This example has been superseded by the **TensorFlow 2 Workflow with SageMaker Pipelines** example above. This example shows a complete workflow for TensorFlow 2 with automation by the AWS Step Functions Data Science SDK, an older alternative to [Amazon SageMaker Pipelines](https://aws.amazon.com/sagemaker/pipelines).  To begin, SageMaker Processing is used to transform the dataset.  Next, Local Mode training and Local Mode endpoints are demonstrated for prototyping training and inference code, respectively.  Automatic Model Tuning is used to automate the hyperparameter tuning process.  **PREREQUISITES:**  From the *tf-2-workflow* directory, upload ONLY the Jupyter notebook `tf-2-workflow.ipynb`.  
44 |   
45 |   - [**TensorFlow 1.x (tf.keras) Highly Performant Batch Inference & Training**](tf-batch-inference-script):  The focus of this example is highly performant batch inference using TensorFlow Serving, along with Horovod distributed training. To transform the input image data for inference, a preprocessing script is used with the Amazon SageMaker TensorFlow Serving container.  **PREREQUISITES:**  be sure to upload all files in the *tf-batch-inference-script* directory (including the subdirectory code and files) to the directory where you will run the related Jupyter notebook.  
46 | 
47 |   - [**TensorFlow 1.x (tf.keras) with Horovod & Inference Pipeline**](tf-horovod-inference-pipeline):  Script Mode with TensorFlow is used for a computer vision task, in a demonstration of Horovod distributed training and doing batch inference in conjunction with an Inference Pipeline for transforming image data before inputting it to the model container. This is an alternative to the previous example, which uses a preprocessing script with the Amazon SageMaker TensorFlow Serving Container rather than an Inference Pipeline. **PREREQUISITES:**  be sure to upload all files in the *tf-horovod-inference-pipeline* directory (including the subdirectory code and files) to the directory where you will run the related Jupyter notebook.  
48 | 
49 |   
50 |   - [**TensorFlow 1.x (tf.keras) Distributed Training Options**](tf-distribution-options): **NOTE**:  Besides the options listed here for TensorFlow 1.x, there are additional options for TensorFlow 2, including [A] built-in [**SageMaker Distributed Training**](https://aws.amazon.com/sagemaker/distributed-training/) for both data and model parallelism, and [B] native distribution strategies such as MirroredStrategy as demonstrated in the **TensorFlow 2 Sentiment Analysis** example above. This TensorFlow 1.x example demonstrates two other distributed training options for SageMaker's Script Mode:  (1) parameter servers, and (2) Horovod. **PREREQUISITES:**  From the *tf-distribution-options* directory, upload ONLY the Jupyter notebook `tf-distributed-training.ipynb`.
51 | 
52 |   - [**TensorFlow 1.x (tf.keras) Eager Execution**](tf-eager-script-mode):  **NOTE**:  This TensorFlow 1.x example has been superseded by the **TensorFlow 2 Workflow** example above.  This example shows how to use Script Mode with Eager Execution mode in TensorFlow 1.x, a more intuitive and dynamic alternative to the original graph mode of TensorFlow.  It is the default mode of TensorFlow 2.  Local Mode and Automatic Model Tuning also are demonstrated. **PREREQUISITES:**  From the *tf-eager-script-mode* directory, upload ONLY the Jupyter notebook `tf-boston-housing.ipynb`.  
53 | 
54 | 
55 |   
56 | ## License
57 | 
58 | The contents of this repository are licensed under the Apache 2.0 License except where otherwise noted.
59 | 


--------------------------------------------------------------------------------
/daemon.json:
--------------------------------------------------------------------------------
 1 | 
 2 | {
 3 | 	"default-runtime": "nvidia",
 4 |     "runtimes": {
 5 |         "nvidia": {
 6 |             "path": "/usr/bin/nvidia-container-runtime",
 7 |             "runtimeArgs": []
 8 |         }
 9 |     }
10 | }
11 | 


--------------------------------------------------------------------------------
/deploy-pretrained-model/BERT/Deploy_BERT.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Hosting a Pretrained Model on SageMaker\n",
  8 |     "    \n",
  9 |     "Amazon SageMaker is a service to accelerate the entire machine learning lifecycle. It includes components for building, training and deploying machine learning models. Each SageMaker component is modular, so you're welcome to only use the features needed for your use case. One of the most popular features of SageMaker is [model hosting](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-deployment.html). Using SageMaker Hosting you can deploy your model as a scalable, highly available, multi-process API endpoint with a few lines of code. In this notebook, we will demonstrate how to host a pretrained model (BERT) in Amazon SageMaker to extract embeddings from text.\n",
 10 |     "\n",
 11 |     "SageMaker provides prebuilt containers that can be used for training, hosting, or data processing. The inference containers include a web serving stack, so you don't need to install and configure one. We will be using the SageMaker [PyTorch container](https://github.com/aws/deep-learning-containers), but you may use the [TensorFlow container](https://github.com/aws/deep-learning-containers/blob/master/available_images.md), or bring your own container if needed.  \n",
 12 |     "\n",
 13 |     "This notebook will walk you through how to deploy a pretrained Hugging Face model as a scalable, highly available, production ready API in under 15 minutes."
 14 |    ]
 15 |   },
 16 |   {
 17 |    "cell_type": "markdown",
 18 |    "metadata": {},
 19 |    "source": [
 20 |     "## Retrieve Model Artifacts\n",
 21 |     "\n",
 22 |     "First we will download the model artifacts for the pretrained [BERT](https://arxiv.org/abs/1810.04805) model. BERT is a popular natural language processing (NLP) model that extracts meaning and context from text."
 23 |    ]
 24 |   },
 25 |   {
 26 |    "cell_type": "code",
 27 |    "execution_count": null,
 28 |    "metadata": {
 29 |     "scrolled": true
 30 |    },
 31 |    "outputs": [],
 32 |    "source": [
 33 |     "!pip install transformers==3.3.1 sagemaker==2.15.0 --quiet"
 34 |    ]
 35 |   },
 36 |   {
 37 |    "cell_type": "code",
 38 |    "execution_count": null,
 39 |    "metadata": {},
 40 |    "outputs": [],
 41 |    "source": [
 42 |     "import os\n",
 43 |     "from transformers import BertTokenizer, BertModel\n",
 44 |     "\n",
 45 |     "tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n",
 46 |     "model = BertModel.from_pretrained(\"bert-base-uncased\")\n",
 47 |     "\n",
 48 |     "model_path = 'model/'\n",
 49 |     "code_path = 'code/'\n",
 50 |     "\n",
 51 |     "if not os.path.exists(model_path):\n",
 52 |     "    os.mkdir(model_path)\n",
 53 |     "    \n",
 54 |     "model.save_pretrained(save_directory=model_path)\n",
 55 |     "tokenizer.save_pretrained(save_directory=model_path)"
 56 |    ]
 57 |   },
 58 |   {
 59 |    "cell_type": "markdown",
 60 |    "metadata": {},
 61 |    "source": [
 62 |     "## Write the Inference Script\n",
 63 |     "\n",
 64 |     "Since we are bringing a model to SageMaker, we must create an inference script. The script will run inside our PyTorch container. Our script should include a function for model loading, and optionally functions generating predicitions, and input/output processing. The PyTorch container provides default implementations for generating a prediction and input/output processing. By including these functions in your script you are overriding the default functions. You can find additional [details here](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#serve-a-pytorch-model).\n",
 65 |     "\n",
 66 |     "In the next cell we'll see our inference script. You will notice that it uses the [transformers library from HuggingFace](https://huggingface.co/transformers/). This Python library is not installed in the container by default, so we will have to add that in the next section."
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "code",
 71 |    "execution_count": null,
 72 |    "metadata": {},
 73 |    "outputs": [],
 74 |    "source": [
 75 |     "!pygmentize code/inference_code.py"
 76 |    ]
 77 |   },
 78 |   {
 79 |    "cell_type": "markdown",
 80 |    "metadata": {},
 81 |    "source": [
 82 |     "## Package Model\n",
 83 |     "\n",
 84 |     "For hosting, SageMaker requires that the deployment package be structed in a compatible format. It expects all files to be packaged in a tar archive named \"model.tar.gz\" with gzip compression. To install additional libraries at container startup, we can add a [requirements.txt](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#using-third-party-libraries) text file that specifies the libraries to be installed using [pip](https://pypi.org/project/pip/). Within the archive, the PyTorch container expects all inference code and requirements.txt file to be inside the code/ directory. See the [guide here](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#for-versions-1-2-and-higher) for a thorough explanation of the required directory structure.  "
 85 |    ]
 86 |   },
 87 |   {
 88 |    "cell_type": "code",
 89 |    "execution_count": null,
 90 |    "metadata": {},
 91 |    "outputs": [],
 92 |    "source": [
 93 |     "import tarfile\n",
 94 |     "\n",
 95 |     "zipped_model_path = os.path.join(model_path, \"model.tar.gz\")\n",
 96 |     "\n",
 97 |     "with tarfile.open(zipped_model_path, \"w:gz\") as tar:\n",
 98 |     "    tar.add(model_path)\n",
 99 |     "    tar.add(code_path)"
100 |    ]
101 |   },
102 |   {
103 |    "cell_type": "markdown",
104 |    "metadata": {},
105 |    "source": [
106 |     "## Deploy Model\n",
107 |     "\n",
108 |     "Now that we have our deployment package, we can use the [SageMaker SDK](https://sagemaker.readthedocs.io/en/stable/index.html) to deploy our API endpoint with two lines of code. We need to specify an IAM role for the SageMaker endpoint to use. Minimally, it will need read access to the default SageMaker bucket (usually named sagemaker-{region}-{your account number}) so it can read the deployment package. When we call deploy(), the SDK will save our deployment archive to S3 for the SageMaker endpoint to use. We will use the helper function [get_execution_role](https://sagemaker.readthedocs.io/en/stable/api/utility/session.html?highlight=get_execution_role#sagemaker.session.get_execution_role) to retrieve our current IAM role so we can pass it to the SageMaker endpoint. Minimally it will require read access to the model artifacts in S3 and the [ECR repository](https://github.com/aws/deep-learning-containers/blob/master/available_images.md) where the container image is stored by AWS.\n",
109 |     "\n",
110 |     "\n",
111 |     "You may notice that we specify our PyTorch version and Python version when creating the PyTorchModel object. The SageMaker SDK uses these parameters to determine which PyTorch container to use. \n",
112 |     "\n",
113 |     "We'll choose an m5 instance for our endpoint to ensure we have sufficient memory to serve our model. "
114 |    ]
115 |   },
116 |   {
117 |    "cell_type": "code",
118 |    "execution_count": null,
119 |    "metadata": {},
120 |    "outputs": [],
121 |    "source": [
122 |     "from sagemaker.pytorch import PyTorchModel\n",
123 |     "from sagemaker import get_execution_role\n",
124 |     "\n",
125 |     "endpoint_name = 'bert-base'\n",
126 |     "\n",
127 |     "model = PyTorchModel(entry_point='inference_code.py', \n",
128 |     "                     model_data=zipped_model_path, \n",
129 |     "                     role=get_execution_role(), \n",
130 |     "                     framework_version='1.5', \n",
131 |     "                     py_version='py3')\n",
132 |     "\n",
133 |     "predictor = model.deploy(initial_instance_count=1, \n",
134 |     "                         instance_type='ml.m5.xlarge', \n",
135 |     "                         endpoint_name=endpoint_name)"
136 |    ]
137 |   },
138 |   {
139 |    "cell_type": "markdown",
140 |    "metadata": {},
141 |    "source": [
142 |     "## Get Predictions\n",
143 |     "\n",
144 |     "Now that our API endpoint is deployed, we can send it text to get predictions from our BERT model. You can use the SageMaker SDK or the [SageMaker Runtime API](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html) to invoke the endpoint. "
145 |    ]
146 |   },
147 |   {
148 |    "cell_type": "code",
149 |    "execution_count": null,
150 |    "metadata": {},
151 |    "outputs": [],
152 |    "source": [
153 |     "import boto3\n",
154 |     "\n",
155 |     "sm = boto3.client('sagemaker-runtime')\n",
156 |     "\n",
157 |     "prompt = \"The best part of Amazon SageMaker is that it makes machine learning easy.\"\n",
158 |     "\n",
159 |     "response = sm.invoke_endpoint(EndpointName=endpoint_name, \n",
160 |     "                              Body=prompt.encode(encoding='UTF-8'),\n",
161 |     "                              ContentType='text/csv')\n",
162 |     "\n",
163 |     "response['Body'].read()"
164 |    ]
165 |   },
166 |   {
167 |    "cell_type": "markdown",
168 |    "metadata": {},
169 |    "source": [
170 |     "## Conclusion\n",
171 |     "\n",
172 |     "You have successfully created a scalable, high available, RESTful API that is backed by a BERT model! It can be used for downstreaming NLP tasks like text classification. If you are still interested in learning more, check out some of the more advanced features of SageMaker Hosting, like [model monitoring](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor.html) to detect concept drift, [autoscaling](https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling.html) to dynamically adjust the number of instances, or [VPC config](https://docs.aws.amazon.com/sagemaker/latest/dg/host-vpc.html) to control network access to/from your endpoint.\n",
173 |     "\n",
174 |     "You can also look in to the [ezsmdeploy SDK](https://aws.amazon.com/blogs/opensource/deploy-machine-learning-models-to-amazon-sagemaker-using-the-ezsmdeploy-python-package-and-a-few-lines-of-code/) that automates most of this process."
175 |    ]
176 |   },
177 |   {
178 |    "cell_type": "code",
179 |    "execution_count": null,
180 |    "metadata": {},
181 |    "outputs": [],
182 |    "source": []
183 |   }
184 |  ],
185 |  "metadata": {
186 |   "kernelspec": {
187 |    "display_name": "conda_pytorch_latest_p36",
188 |    "language": "python",
189 |    "name": "conda_pytorch_latest_p36"
190 |   },
191 |   "language_info": {
192 |    "codemirror_mode": {
193 |     "name": "ipython",
194 |     "version": 3
195 |    },
196 |    "file_extension": ".py",
197 |    "mimetype": "text/x-python",
198 |    "name": "python",
199 |    "nbconvert_exporter": "python",
200 |    "pygments_lexer": "ipython3",
201 |    "version": "3.6.10"
202 |   }
203 |  },
204 |  "nbformat": 4,
205 |  "nbformat_minor": 4
206 | }
207 | 


--------------------------------------------------------------------------------
/deploy-pretrained-model/BERT/code/inference_code.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import json
 3 | from transformers import BertTokenizer, BertModel
 4 | 
 5 | def model_fn(model_dir):
 6 |     """
 7 |     Load the model for inference
 8 |     """
 9 | 
10 |     model_path = os.path.join(model_dir, 'model/')
11 |     
12 |     # Load BERT tokenizer from disk.
13 |     tokenizer = BertTokenizer.from_pretrained(model_path)
14 | 
15 |     # Load BERT model from disk.
16 |     model = BertModel.from_pretrained(model_path)
17 | 
18 |     model_dict = {'model': model, 'tokenizer':tokenizer}
19 |     
20 |     return model_dict
21 | 
22 | def predict_fn(input_data, model):
23 |     """
24 |     Apply model to the incoming request
25 |     """
26 |     
27 |     tokenizer = model['tokenizer']
28 |     bert_model = model['model']
29 |     
30 |     encoded_input = tokenizer(input_data, return_tensors='pt')
31 |     
32 |     return bert_model(**encoded_input)
33 | 
34 | def input_fn(request_body, request_content_type):
35 |     """
36 |     Deserialize and prepare the prediction input
37 |     """
38 |     
39 |     if request_content_type == "application/json":
40 |         request = json.loads(request_body)
41 |     else:
42 |         request = request_body
43 | 
44 |     return request
45 | 
46 | def output_fn(prediction, response_content_type):
47 |     """
48 |     Serialize and prepare the prediction output
49 |     """
50 |     
51 |     if response_content_type == "application/json":
52 |         response = str(prediction)
53 |     else:
54 |         response = str(prediction)
55 | 
56 |     return response


--------------------------------------------------------------------------------
/deploy-pretrained-model/BERT/code/requirements.txt:
--------------------------------------------------------------------------------
1 | transformers==3.3.1


--------------------------------------------------------------------------------
/deploy-pretrained-model/GPT2/Deploy_GPT2.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Hosting a Pretrained Model on SageMaker\n",
  8 |     "    \n",
  9 |     "Amazon SageMaker is a service to accelerate the entire machine learning lifecycle. It includes components for building, training and deploying machine learning models. Each SageMaker component is modular, so you're welcome to only use the features needed for your use case. One of the most popular features of SageMaker is [model hosting](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-deployment.html). Using SageMaker Hosting you can deploy your model as a scalable, highly available, multi-process API endpoint with a few lines of code. In this notebook, we will demonstrate how to host a pretrained model (GPT-2) in Amazon SageMaker.\n",
 10 |     "\n",
 11 |     "SageMaker provides prebuilt containers that can be used for training, hosting, or data processing. The inference containers include a web serving stack, so you don't need to install and configure one. We will be using the SageMaker [PyTorch container](https://github.com/aws/deep-learning-containers), but you may use the [TensorFlow container](https://github.com/aws/deep-learning-containers/blob/master/available_images.md), or bring your own container if needed.  \n",
 12 |     "\n",
 13 |     "This notebook will walk you through how to deploy a pretrained Hugging Face model as a scalable, highly available, production ready API in under 15 minutes."
 14 |    ]
 15 |   },
 16 |   {
 17 |    "cell_type": "markdown",
 18 |    "metadata": {},
 19 |    "source": [
 20 |     "## Retrieve Model Artifacts\n",
 21 |     "\n",
 22 |     "First we will download the model artifacts for the pretrained [GPT-2](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) model. GPT-2 is a popular text generation model that was developed by OpenAI. Given a text prompt it can generate synthetic text that may follow."
 23 |    ]
 24 |   },
 25 |   {
 26 |    "cell_type": "code",
 27 |    "execution_count": null,
 28 |    "metadata": {
 29 |     "scrolled": true
 30 |    },
 31 |    "outputs": [],
 32 |    "source": [
 33 |     "!pip install transformers==3.3.1 sagemaker==2.15.0 --quiet"
 34 |    ]
 35 |   },
 36 |   {
 37 |    "cell_type": "code",
 38 |    "execution_count": null,
 39 |    "metadata": {},
 40 |    "outputs": [],
 41 |    "source": [
 42 |     "import os\n",
 43 |     "from transformers import GPT2Tokenizer, GPT2LMHeadModel\n",
 44 |     "\n",
 45 |     "tokenizer = GPT2Tokenizer.from_pretrained('gpt2')\n",
 46 |     "model = GPT2LMHeadModel.from_pretrained('gpt2')\n",
 47 |     "\n",
 48 |     "model_path = 'model/'\n",
 49 |     "code_path = 'code/'\n",
 50 |     "\n",
 51 |     "if not os.path.exists(model_path):\n",
 52 |     "    os.mkdir(model_path)\n",
 53 |     "    \n",
 54 |     "model.save_pretrained(save_directory=model_path)\n",
 55 |     "tokenizer.save_vocabulary(save_directory=model_path)"
 56 |    ]
 57 |   },
 58 |   {
 59 |    "cell_type": "markdown",
 60 |    "metadata": {},
 61 |    "source": [
 62 |     "## Write the Inference Script\n",
 63 |     "\n",
 64 |     "Since we are bringing a model to SageMaker, we must create an inference script. The script will run inside our PyTorch container. Our script should include a function for model loading, and optionally functions generating predicitions, and input/output processing. The PyTorch container provides default implementations for generating a prediction and input/output processing. By including these functions in your script you are overriding the default functions. You can find additional [details here](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#serve-a-pytorch-model).\n",
 65 |     "\n",
 66 |     "In the next cell we'll see our inference script. You will notice that it uses the [transformers library from Hugging Face](https://huggingface.co/transformers/). This Python library is not installed in the container by default, so we will have to add that in the next section."
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "code",
 71 |    "execution_count": null,
 72 |    "metadata": {},
 73 |    "outputs": [],
 74 |    "source": [
 75 |     "!pygmentize code/inference_code.py"
 76 |    ]
 77 |   },
 78 |   {
 79 |    "cell_type": "markdown",
 80 |    "metadata": {},
 81 |    "source": [
 82 |     "## Package Model\n",
 83 |     "\n",
 84 |     "For hosting, SageMaker requires that the deployment package be structed in a compatible format. It expects all files to be packaged in a tar archive named \"model.tar.gz\" with gzip compression. To install additional libraries at container startup, we can add a [requirements.txt](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#using-third-party-libraries) text file that specifies the libraries to be installed using [pip](https://pypi.org/project/pip/). Within the archive, the PyTorch container expects all inference code and requirements.txt file to be inside the code/ directory. See the [guide here](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#for-versions-1-2-and-higher) for a thorough explanation of the required directory structure.  "
 85 |    ]
 86 |   },
 87 |   {
 88 |    "cell_type": "code",
 89 |    "execution_count": null,
 90 |    "metadata": {},
 91 |    "outputs": [],
 92 |    "source": [
 93 |     "import tarfile\n",
 94 |     "\n",
 95 |     "zipped_model_path = os.path.join(model_path, \"model.tar.gz\")\n",
 96 |     "\n",
 97 |     "with tarfile.open(zipped_model_path, \"w:gz\") as tar:\n",
 98 |     "    tar.add(model_path)\n",
 99 |     "    tar.add(code_path)"
100 |    ]
101 |   },
102 |   {
103 |    "cell_type": "markdown",
104 |    "metadata": {},
105 |    "source": [
106 |     "## Deploy Model\n",
107 |     "\n",
108 |     "Now that we have our deployment package, we can use the [SageMaker SDK](https://sagemaker.readthedocs.io/en/stable/index.html) to deploy our API endpoint with two lines of code. We need to specify an IAM role for the SageMaker endpoint to use. Minimally, it will need read access to the default SageMaker bucket (usually named sagemaker-{region}-{your account number}) so it can read the deployment package. When we call deploy(), the SDK will save our deployment archive to S3 for the SageMaker endpoint to use. We will use the helper function [get_execution_role](https://sagemaker.readthedocs.io/en/stable/api/utility/session.html?highlight=get_execution_role#sagemaker.session.get_execution_role) to retrieve our current IAM role so we can pass it to the SageMaker endpoint. You may specify another IAM role here. Minimally it will require read access to the model artifacts in S3 and the [ECR repository](https://github.com/aws/deep-learning-containers/blob/master/available_images.md) where the container image is stored by AWS.\n",
109 |     "\n",
110 |     "You may notice that we specify our PyTorch version and Python version when creating the PyTorchModel object. The SageMaker SDK uses these parameters to determine which PyTorch container to use. \n",
111 |     "\n",
112 |     "The full size [GPT-2 model has 1.2 billion parameters](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf). Even though we are using the small version of the model, our endpoint will need to fit millions of parameters in to memory. We'll choose an m5 instance for our endpoint to ensure we have sufficient memory to serve our model. "
113 |    ]
114 |   },
115 |   {
116 |    "cell_type": "code",
117 |    "execution_count": null,
118 |    "metadata": {},
119 |    "outputs": [],
120 |    "source": [
121 |     "from sagemaker.pytorch import PyTorchModel\n",
122 |     "from sagemaker import get_execution_role\n",
123 |     "\n",
124 |     "endpoint_name = 'GPT2'\n",
125 |     "\n",
126 |     "model = PyTorchModel(entry_point='inference_code.py', \n",
127 |     "                     model_data=zipped_model_path, \n",
128 |     "                     role=get_execution_role(),\n",
129 |     "                     framework_version='1.5', \n",
130 |     "                     py_version='py3')\n",
131 |     "\n",
132 |     "predictor = model.deploy(initial_instance_count=1, \n",
133 |     "                         instance_type='ml.m5.xlarge', \n",
134 |     "                         endpoint_name=endpoint_name)"
135 |    ]
136 |   },
137 |   {
138 |    "cell_type": "markdown",
139 |    "metadata": {},
140 |    "source": [
141 |     "## Get Predictions\n",
142 |     "\n",
143 |     "Now that our RESTful API endpoint is deployed, we can send it text to get predictions from our GPT-2 model. You can use the SageMaker Python SDK or the [SageMaker Runtime API](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html) to invoke the endpoint. "
144 |    ]
145 |   },
146 |   {
147 |    "cell_type": "code",
148 |    "execution_count": null,
149 |    "metadata": {},
150 |    "outputs": [],
151 |    "source": [
152 |     "import boto3\n",
153 |     "import json\n",
154 |     "\n",
155 |     "sm = boto3.client('sagemaker-runtime')\n",
156 |     "\n",
157 |     "prompt = \"Working with SageMaker makes machine learning \"\n",
158 |     "\n",
159 |     "response = sm.invoke_endpoint(EndpointName=endpoint_name, \n",
160 |     "                              Body=json.dumps(prompt),\n",
161 |     "                              ContentType='text/csv')\n",
162 |     "\n",
163 |     "response['Body'].read().decode('utf-8')"
164 |    ]
165 |   },
166 |   {
167 |    "cell_type": "markdown",
168 |    "metadata": {},
169 |    "source": [
170 |     "## Conclusion\n",
171 |     "\n",
172 |     "You have successfully created a scalable, high available, RESTful API that is backed by a GPT-2 model! If you are still interested in learning more, check out some of the more advanced features of SageMaker Hosting, like [model monitoring](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor.html) to detect concept drift, [autoscaling](https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling.html) to dynamically adjust the number of instances, or [VPC config](https://docs.aws.amazon.com/sagemaker/latest/dg/host-vpc.html) to control network access to/from your endpoint.\n",
173 |     "\n",
174 |     "You can also look in to the [ezsmdeploy SDK](https://aws.amazon.com/blogs/opensource/deploy-machine-learning-models-to-amazon-sagemaker-using-the-ezsmdeploy-python-package-and-a-few-lines-of-code/) that automates most of this process."
175 |    ]
176 |   }
177 |  ],
178 |  "metadata": {
179 |   "kernelspec": {
180 |    "display_name": "conda_pytorch_latest_p36",
181 |    "language": "python",
182 |    "name": "conda_pytorch_latest_p36"
183 |   },
184 |   "language_info": {
185 |    "codemirror_mode": {
186 |     "name": "ipython",
187 |     "version": 3
188 |    },
189 |    "file_extension": ".py",
190 |    "mimetype": "text/x-python",
191 |    "name": "python",
192 |    "nbconvert_exporter": "python",
193 |    "pygments_lexer": "ipython3",
194 |    "version": "3.6.10"
195 |   }
196 |  },
197 |  "nbformat": 4,
198 |  "nbformat_minor": 4
199 | }
200 | 


--------------------------------------------------------------------------------
/deploy-pretrained-model/GPT2/code/inference_code.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import json
 3 | from transformers import GPT2Tokenizer, TextGenerationPipeline, GPT2LMHeadModel
 4 | 
 5 | def model_fn(model_dir):
 6 |     """
 7 |     Load the model for inference
 8 |     """
 9 | 
10 |     # Load GPT2 tokenizer from disk.
11 |     vocab_path = os.path.join(model_dir, 'model/vocab.json')
12 |     merges_path = os.path.join(model_dir, 'model/merges.txt')
13 |     
14 |     tokenizer = GPT2Tokenizer(vocab_file=vocab_path,
15 |                               merges_file=merges_path)
16 | 
17 |     # Load GPT2 model from disk.
18 |     model_path = os.path.join(model_dir, 'model/')
19 |     model = GPT2LMHeadModel.from_pretrained(model_path)
20 | 
21 |     return TextGenerationPipeline(model=model, tokenizer=tokenizer)
22 | 
23 | def predict_fn(input_data, model):
24 |     """
25 |     Apply model to the incoming request
26 |     """
27 | 
28 |     return model.__call__(input_data)
29 | 
30 | def input_fn(request_body, request_content_type):
31 |     """
32 |     Deserialize and prepare the prediction input
33 |     """
34 |     
35 |     if request_content_type == "application/json":
36 |         request = json.loads(request_body)
37 |     else:
38 |         request = request_body
39 | 
40 |     return request
41 | 
42 | def output_fn(prediction, response_content_type):
43 |     """
44 |     Serialize and prepare the prediction output
45 |     """
46 | 
47 |     return str(prediction)


--------------------------------------------------------------------------------
/deploy-pretrained-model/GPT2/code/requirements.txt:
--------------------------------------------------------------------------------
1 | transformers==3.3.1


--------------------------------------------------------------------------------
/hugging-face-lambda-step/.ipynb_checkpoints/iam_helper-checkpoint.py:
--------------------------------------------------------------------------------
 1 | import boto3
 2 | import json
 3 | 
 4 | iam = boto3.client('iam')
 5 | 
 6 | def create_lambda_role(role_name):
 7 |     try:
 8 |         response = iam.create_role(
 9 |             RoleName = role_name,
10 |             AssumeRolePolicyDocument = json.dumps({
11 |                 "Version": "2012-10-17",
12 |                 "Statement": [
13 |                     {
14 |                         "Effect": "Allow",
15 |                         "Principal": {
16 |                             "Service": "lambda.amazonaws.com"
17 |                         },
18 |                         "Action": "sts:AssumeRole"
19 |                     }
20 |                 ]
21 |             }),
22 |             Description='Role for Lambda to call ECS Fargate task'
23 |         )
24 | 
25 |         role_arn = response['Role']['Arn']
26 | 
27 |         response = iam.attach_role_policy(
28 |             RoleName=role_name,
29 |             PolicyArn='arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole'
30 |         )
31 | 
32 |         response = iam.attach_role_policy(
33 |             PolicyArn='arn:aws:iam::aws:policy/AmazonSageMakerFullAccess',
34 |             RoleName=role_name
35 |         )
36 | 
37 |         return role_arn
38 | 
39 |     except iam.exceptions.EntityAlreadyExistsException:
40 |         print(f'Using ARN from existing role: {role_name}')
41 |         response = iam.get_role(RoleName=role_name)
42 |         return response['Role']['Arn']


--------------------------------------------------------------------------------
/hugging-face-lambda-step/iam_helper.py:
--------------------------------------------------------------------------------
 1 | import boto3
 2 | import json
 3 | 
 4 | iam = boto3.client('iam')
 5 | 
 6 | def create_lambda_role(role_name):
 7 |     try:
 8 |         response = iam.create_role(
 9 |             RoleName = role_name,
10 |             AssumeRolePolicyDocument = json.dumps({
11 |                 "Version": "2012-10-17",
12 |                 "Statement": [
13 |                     {
14 |                         "Effect": "Allow",
15 |                         "Principal": {
16 |                             "Service": "lambda.amazonaws.com"
17 |                         },
18 |                         "Action": "sts:AssumeRole"
19 |                     }
20 |                 ]
21 |             }),
22 |             Description='Role for Lambda to call ECS Fargate task'
23 |         )
24 | 
25 |         role_arn = response['Role']['Arn']
26 | 
27 |         response = iam.attach_role_policy(
28 |             RoleName=role_name,
29 |             PolicyArn='arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole'
30 |         )
31 | 
32 |         response = iam.attach_role_policy(
33 |             PolicyArn='arn:aws:iam::aws:policy/AmazonSageMakerFullAccess',
34 |             RoleName=role_name
35 |         )
36 | 
37 |         return role_arn
38 | 
39 |     except iam.exceptions.EntityAlreadyExistsException:
40 |         print(f'Using ARN from existing role: {role_name}')
41 |         response = iam.get_role(RoleName=role_name)
42 |         return response['Role']['Arn']


--------------------------------------------------------------------------------
/hugging-face-lambda-step/scripts/.ipynb_checkpoints/evaluate-checkpoint.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | """Evaluation script for measuring mean squared error."""
 4 | 
 5 | import subprocess
 6 | import sys
 7 | 
 8 | def install(package):
 9 |     subprocess.check_call([sys.executable, "-m", "pip", "install", package])
10 | 
11 | import json
12 | import logging
13 | import pathlib
14 | import pickle
15 | import tarfile
16 | import os
17 | 
18 | import numpy as np
19 | import pandas as pd
20 | 
21 | from transformers import AutoModelForSequenceClassification, Trainer
22 | from datasets import load_from_disk
23 | 
24 | logger = logging.getLogger()
25 | logger.setLevel(logging.INFO)
26 | logger.addHandler(logging.StreamHandler())
27 | 
28 | if __name__ == "__main__":
29 |     logger.debug("Starting evaluation.")
30 |     model_path = "/opt/ml/processing/model/model.tar.gz"
31 |     with tarfile.open(model_path) as tar:
32 |         tar.extractall(path="./hf_model")
33 | 
34 |     logger.debug(os.listdir('./hf_model'))
35 |     
36 | #     test_dir = "/opt/ml/processing/test/"
37 | #     test_dataset = load_from_disk(test_dir)
38 |     
39 | #     model = AutoModelForSequenceClassification.from_pretrained('./hf_model')
40 | 
41 | #     trainer = Trainer(model=model)
42 |     
43 | #     eval_result = trainer.evaluate(eval_dataset=test_dataset)
44 | 
45 |     with open('./hf_model/evaluation.json') as f:
46 |         eval_result = json.load(f)
47 |     
48 |     logger.debug(eval_result)
49 |     output_dir = "/opt/ml/processing/evaluation"
50 |     pathlib.Path(output_dir).mkdir(parents=True, exist_ok=True)
51 | 
52 |     evaluation_path = f"{output_dir}/evaluation.json"
53 |     with open(evaluation_path, "w") as f:
54 |         f.write(json.dumps(eval_result))
55 | 


--------------------------------------------------------------------------------
/hugging-face-lambda-step/scripts/.ipynb_checkpoints/preprocessing-checkpoint.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | import numpy as np
 4 | import os
 5 | import pandas as pd 
 6 | import subprocess
 7 | import sys
 8 | 
 9 | def install(package):
10 |     subprocess.check_call([sys.executable, "-m", "pip", "install", package])
11 | 
12 | if __name__=='__main__':
13 |     
14 |     install('torch')
15 |     install('transformers')
16 |     install('datasets[s3]')
17 |     
18 |     from datasets import load_dataset
19 |     from transformers import AutoTokenizer
20 | 
21 |     # tokenizer used in preprocessing
22 |     tokenizer_name = 'distilbert-base-uncased'
23 | 
24 |     # dataset used
25 |     dataset_name = 'imdb'
26 | 
27 |     # load dataset
28 |     dataset = load_dataset(dataset_name)
29 | 
30 |     # download tokenizer
31 |     tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
32 | 
33 |     # tokenizer helper function
34 |     def tokenize(batch):
35 |         return tokenizer(batch['text'], padding='max_length', truncation=True)
36 | 
37 |     # load dataset
38 |     train_dataset, test_dataset = load_dataset('imdb', split=['train', 'test'])
39 |     test_dataset = test_dataset.shuffle().select(range(1000)) # smaller the size for test dataset to 1k 
40 | 
41 |     # tokenize dataset
42 |     train_dataset = train_dataset.map(tokenize, batched=True)
43 |     test_dataset = test_dataset.map(tokenize, batched=True)
44 | 
45 |     # set format for pytorch
46 |     train_dataset =  train_dataset.rename_column("label", "labels")
47 |     train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
48 |     
49 |     test_dataset = test_dataset.rename_column("label", "labels")
50 |     test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
51 |     
52 |     train_dataset.save_to_disk('/opt/ml/processing/train')
53 |     test_dataset.save_to_disk('/opt/ml/processing/test')


--------------------------------------------------------------------------------
/hugging-face-lambda-step/scripts/.ipynb_checkpoints/train-checkpoint.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments, AutoTokenizer
  3 | from sklearn.metrics import accuracy_score, precision_recall_fscore_support
  4 | from datasets import load_from_disk
  5 | import random
  6 | import logging
  7 | import sys
  8 | import argparse
  9 | import os
 10 | import torch
 11 | 
 12 | import pathlib
 13 | import json
 14 | 
 15 | if __name__ == "__main__":
 16 | 
 17 |     parser = argparse.ArgumentParser()
 18 | 
 19 |     # hyperparameters sent by the client are passed as command-line arguments to the script.
 20 |     parser.add_argument("--epochs", type=int, default=3)
 21 |     parser.add_argument("--train_batch_size", type=int, default=32)
 22 |     parser.add_argument("--eval_batch_size", type=int, default=64)
 23 |     parser.add_argument("--warmup_steps", type=int, default=500)
 24 |     parser.add_argument("--model_name", type=str)
 25 |     parser.add_argument("--learning_rate", type=str, default=5e-5)
 26 | 
 27 |     # Data, model, and output directories
 28 |     parser.add_argument("--output-data-dir", type=str, default=os.environ["SM_OUTPUT_DATA_DIR"])
 29 |     parser.add_argument("--model-dir", type=str, default=os.environ["SM_MODEL_DIR"])
 30 |     parser.add_argument("--n_gpus", type=str, default=os.environ["SM_NUM_GPUS"])
 31 |     parser.add_argument("--training_dir", type=str, default=os.environ["SM_CHANNEL_TRAIN"])
 32 |     parser.add_argument("--test_dir", type=str, default=os.environ["SM_CHANNEL_TEST"])
 33 | 
 34 |     args, _ = parser.parse_known_args()
 35 | 
 36 |     # Set up logging
 37 |     logger = logging.getLogger(__name__)
 38 | 
 39 |     logging.basicConfig(
 40 |         level=logging.getLevelName("INFO"),
 41 |         handlers=[logging.StreamHandler(sys.stdout)],
 42 |         format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
 43 |     )
 44 | 
 45 |     # load datasets
 46 |     train_dataset = load_from_disk(args.training_dir)
 47 |     test_dataset = load_from_disk(args.test_dir)
 48 | 
 49 |     logger.info(f" loaded train_dataset length is: {len(train_dataset)}")
 50 |     logger.info(f" loaded test_dataset length is: {len(test_dataset)}")
 51 | 
 52 |     # compute metrics function for binary classification
 53 |     def compute_metrics(pred):
 54 |         labels = pred.label_ids
 55 |         preds = pred.predictions.argmax(-1)
 56 |         precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average="binary")
 57 |         acc = accuracy_score(labels, preds)
 58 |         return {"accuracy": acc, "f1": f1, "precision": precision, "recall": recall}
 59 | 
 60 |     # download model from model hub
 61 |     model = AutoModelForSequenceClassification.from_pretrained(args.model_name)
 62 |     tokenizer = AutoTokenizer.from_pretrained(args.model_name)
 63 | 
 64 |     # define training args
 65 |     training_args = TrainingArguments(
 66 |         output_dir=args.model_dir,
 67 |         num_train_epochs=args.epochs,
 68 |         per_device_train_batch_size=args.train_batch_size,
 69 |         per_device_eval_batch_size=args.eval_batch_size,
 70 |         warmup_steps=args.warmup_steps,
 71 |         evaluation_strategy="epoch",
 72 |         logging_dir=f"{args.output_data_dir}/logs",
 73 |         learning_rate=float(args.learning_rate),
 74 |     )
 75 | 
 76 |     # create Trainer instance
 77 |     trainer = Trainer(
 78 |         model=model,
 79 |         args=training_args,
 80 |         compute_metrics=compute_metrics,
 81 |         train_dataset=train_dataset,
 82 |         eval_dataset=test_dataset,
 83 |         tokenizer=tokenizer,
 84 |     )
 85 | 
 86 |     # train model
 87 |     trainer.train()
 88 | 
 89 |     # evaluate model
 90 |     eval_result = trainer.evaluate(eval_dataset=test_dataset)
 91 | 
 92 | #     # writes eval result to file which can be accessed later in s3 ouput
 93 | #     with open(os.path.join(args.output_data_dir, "eval_results.txt"), "w") as writer:
 94 | #         print(f"***** Eval results *****")
 95 | #         for key, value in sorted(eval_result.items()):
 96 | #             writer.write(f"{key} = {value}\n")
 97 | 
 98 |     evaluation_path = "/opt/ml/model/evaluation.json"
 99 |     with open(evaluation_path, "w+") as f:
100 |         f.write(json.dumps(eval_result))
101 |         
102 |     # Saves the model to s3
103 |     trainer.save_model(args.model_dir)
104 | 
105 | 
106 | 


--------------------------------------------------------------------------------
/hugging-face-lambda-step/scripts/evaluate.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | """Evaluation script for measuring mean squared error."""
 4 | 
 5 | import subprocess
 6 | import sys
 7 | 
 8 | def install(package):
 9 |     subprocess.check_call([sys.executable, "-m", "pip", "install", package])
10 | 
11 | import json
12 | import logging
13 | import pathlib
14 | import pickle
15 | import tarfile
16 | import os
17 | 
18 | import numpy as np
19 | import pandas as pd
20 | 
21 | from transformers import AutoModelForSequenceClassification, Trainer
22 | from datasets import load_from_disk
23 | 
24 | logger = logging.getLogger()
25 | logger.setLevel(logging.INFO)
26 | logger.addHandler(logging.StreamHandler())
27 | 
28 | if __name__ == "__main__":
29 |     logger.debug("Starting evaluation.")
30 |     model_path = "/opt/ml/processing/model/model.tar.gz"
31 |     with tarfile.open(model_path) as tar:
32 |         tar.extractall(path="./hf_model")
33 | 
34 |     logger.debug(os.listdir('./hf_model'))
35 |     
36 | #     test_dir = "/opt/ml/processing/test/"
37 | #     test_dataset = load_from_disk(test_dir)
38 |     
39 | #     model = AutoModelForSequenceClassification.from_pretrained('./hf_model')
40 | 
41 | #     trainer = Trainer(model=model)
42 |     
43 | #     eval_result = trainer.evaluate(eval_dataset=test_dataset)
44 | 
45 |     with open('./hf_model/evaluation.json') as f:
46 |         eval_result = json.load(f)
47 |     
48 |     logger.debug(eval_result)
49 |     output_dir = "/opt/ml/processing/evaluation"
50 |     pathlib.Path(output_dir).mkdir(parents=True, exist_ok=True)
51 | 
52 |     evaluation_path = f"{output_dir}/evaluation.json"
53 |     with open(evaluation_path, "w") as f:
54 |         f.write(json.dumps(eval_result))
55 | 


--------------------------------------------------------------------------------
/hugging-face-lambda-step/scripts/preprocessing.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | import numpy as np
 4 | import os
 5 | import pandas as pd 
 6 | import subprocess
 7 | import sys
 8 | 
 9 | def install(package):
10 |     subprocess.check_call([sys.executable, "-m", "pip", "install", package])
11 | 
12 | if __name__=='__main__':
13 |     
14 |     install('torch')
15 |     install('transformers')
16 |     install('datasets[s3]')
17 |     
18 |     from datasets import load_dataset
19 |     from transformers import AutoTokenizer
20 | 
21 |     # tokenizer used in preprocessing
22 |     tokenizer_name = 'distilbert-base-uncased'
23 | 
24 |     # dataset used
25 |     dataset_name = 'imdb'
26 | 
27 |     # load dataset
28 |     dataset = load_dataset(dataset_name)
29 | 
30 |     # download tokenizer
31 |     tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
32 | 
33 |     # tokenizer helper function
34 |     def tokenize(batch):
35 |         return tokenizer(batch['text'], padding='max_length', truncation=True)
36 | 
37 |     # load dataset
38 |     train_dataset, test_dataset = load_dataset('imdb', split=['train', 'test'])
39 |     test_dataset = test_dataset.shuffle().select(range(1000)) # smaller the size for test dataset to 1k 
40 | 
41 |     # tokenize dataset
42 |     train_dataset = train_dataset.map(tokenize, batched=True)
43 |     test_dataset = test_dataset.map(tokenize, batched=True)
44 | 
45 |     # set format for pytorch
46 |     train_dataset =  train_dataset.rename_column("label", "labels")
47 |     train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
48 |     
49 |     test_dataset = test_dataset.rename_column("label", "labels")
50 |     test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
51 |     
52 |     train_dataset.save_to_disk('/opt/ml/processing/train')
53 |     test_dataset.save_to_disk('/opt/ml/processing/test')


--------------------------------------------------------------------------------
/hugging-face-lambda-step/scripts/train.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments, AutoTokenizer
  3 | from sklearn.metrics import accuracy_score, precision_recall_fscore_support
  4 | from datasets import load_from_disk
  5 | import random
  6 | import logging
  7 | import sys
  8 | import argparse
  9 | import os
 10 | import torch
 11 | 
 12 | import pathlib
 13 | import json
 14 | 
 15 | if __name__ == "__main__":
 16 | 
 17 |     parser = argparse.ArgumentParser()
 18 | 
 19 |     # hyperparameters sent by the client are passed as command-line arguments to the script.
 20 |     parser.add_argument("--epochs", type=int, default=3)
 21 |     parser.add_argument("--train_batch_size", type=int, default=32)
 22 |     parser.add_argument("--eval_batch_size", type=int, default=64)
 23 |     parser.add_argument("--warmup_steps", type=int, default=500)
 24 |     parser.add_argument("--model_name", type=str)
 25 |     parser.add_argument("--learning_rate", type=str, default=5e-5)
 26 | 
 27 |     # Data, model, and output directories
 28 |     parser.add_argument("--output-data-dir", type=str, default=os.environ["SM_OUTPUT_DATA_DIR"])
 29 |     parser.add_argument("--model-dir", type=str, default=os.environ["SM_MODEL_DIR"])
 30 |     parser.add_argument("--n_gpus", type=str, default=os.environ["SM_NUM_GPUS"])
 31 |     parser.add_argument("--training_dir", type=str, default=os.environ["SM_CHANNEL_TRAIN"])
 32 |     parser.add_argument("--test_dir", type=str, default=os.environ["SM_CHANNEL_TEST"])
 33 | 
 34 |     args, _ = parser.parse_known_args()
 35 | 
 36 |     # Set up logging
 37 |     logger = logging.getLogger(__name__)
 38 | 
 39 |     logging.basicConfig(
 40 |         level=logging.getLevelName("INFO"),
 41 |         handlers=[logging.StreamHandler(sys.stdout)],
 42 |         format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
 43 |     )
 44 | 
 45 |     # load datasets
 46 |     train_dataset = load_from_disk(args.training_dir)
 47 |     test_dataset = load_from_disk(args.test_dir)
 48 | 
 49 |     logger.info(f" loaded train_dataset length is: {len(train_dataset)}")
 50 |     logger.info(f" loaded test_dataset length is: {len(test_dataset)}")
 51 | 
 52 |     # compute metrics function for binary classification
 53 |     def compute_metrics(pred):
 54 |         labels = pred.label_ids
 55 |         preds = pred.predictions.argmax(-1)
 56 |         precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average="binary")
 57 |         acc = accuracy_score(labels, preds)
 58 |         return {"accuracy": acc, "f1": f1, "precision": precision, "recall": recall}
 59 | 
 60 |     # download model from model hub
 61 |     model = AutoModelForSequenceClassification.from_pretrained(args.model_name)
 62 |     tokenizer = AutoTokenizer.from_pretrained(args.model_name)
 63 | 
 64 |     # define training args
 65 |     training_args = TrainingArguments(
 66 |         output_dir=args.model_dir,
 67 |         num_train_epochs=args.epochs,
 68 |         per_device_train_batch_size=args.train_batch_size,
 69 |         per_device_eval_batch_size=args.eval_batch_size,
 70 |         warmup_steps=args.warmup_steps,
 71 |         evaluation_strategy="epoch",
 72 |         logging_dir=f"{args.output_data_dir}/logs",
 73 |         learning_rate=float(args.learning_rate),
 74 |     )
 75 | 
 76 |     # create Trainer instance
 77 |     trainer = Trainer(
 78 |         model=model,
 79 |         args=training_args,
 80 |         compute_metrics=compute_metrics,
 81 |         train_dataset=train_dataset,
 82 |         eval_dataset=test_dataset,
 83 |         tokenizer=tokenizer,
 84 |     )
 85 | 
 86 |     # train model
 87 |     trainer.train()
 88 | 
 89 |     # evaluate model
 90 |     eval_result = trainer.evaluate(eval_dataset=test_dataset)
 91 | 
 92 | #     # writes eval result to file which can be accessed later in s3 ouput
 93 | #     with open(os.path.join(args.output_data_dir, "eval_results.txt"), "w") as writer:
 94 | #         print(f"***** Eval results *****")
 95 | #         for key, value in sorted(eval_result.items()):
 96 | #             writer.write(f"{key} = {value}\n")
 97 | 
 98 |     evaluation_path = "/opt/ml/model/evaluation.json"
 99 |     with open(evaluation_path, "w+") as f:
100 |         f.write(json.dumps(eval_result))
101 |         
102 |     # Saves the model to s3
103 |     trainer.save_model(args.model_dir)
104 | 
105 | 
106 | 


--------------------------------------------------------------------------------
/local_mode_setup.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # Do we have GPU support?
 4 | nvidia-smi > /dev/null 2>&1
 5 | if [ $? -eq 0 ]; then
 6 |   # check if we have nvidia-docker
 7 |   NVIDIA_DOCKER=`rpm -qa | grep -c nvidia-docker2`
 8 |   if [ $NVIDIA_DOCKER -eq 0 ]; then
 9 |     # Install nvidia-docker2
10 |     DOCKER_VERSION=`yum list docker | tail -1 | awk '{print $2}' | head -c 2`
11 | 
12 |     if [ $DOCKER_VERSION -eq 17 ]; then
13 |       DOCKER_PKG_VERSION='17.09.1ce-1.111.amzn1'
14 |       NVIDIA_DOCKER_PKG_VERSION='2.0.3-1.docker17.09.1.ce.amzn1'
15 |     else
16 |       DOCKER_PKG_VERSION='18.06.1ce-3.17.amzn1'
17 |       NVIDIA_DOCKER_PKG_VERSION='2.0.3-1.docker18.06.1.ce.amzn1'
18 |     fi
19 | 
20 |     sudo yum -y remove docker
21 |     sudo yum -y install docker-$DOCKER_PKG_VERSION
22 | 
23 |     sudo /etc/init.d/docker start
24 | 
25 |     curl -s -L https://nvidia.github.io/nvidia-docker/amzn1/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
26 |     sudo yum install -y nvidia-docker2-$NVIDIA_DOCKER_PKG_VERSION
27 |     sudo cp daemon.json /etc/docker/daemon.json
28 |     sudo pkill -SIGHUP dockerd
29 |     echo "installed nvidia-docker2"
30 |   else
31 |     echo "nvidia-docker2 already installed. We are good to go!"
32 |   fi
33 | fi
34 | 
35 | # This is common for both GPU and CPU instances
36 | 
37 | # check if we have docker-compose
38 | docker-compose version >/dev/null 2>&1
39 | if [ $? -ne 0 ]; then
40 |   # install docker compose
41 |   pip install docker-compose
42 | fi
43 | 
44 | # check if we need to configure our docker interface
45 | SAGEMAKER_NETWORK=`docker network ls | grep -c sagemaker-local`
46 | if [ $SAGEMAKER_NETWORK -eq 0 ]; then
47 |   docker network create --driver bridge sagemaker-local
48 | fi
49 | 
50 | # Notebook instance Docker networking fixes
51 | RUNNING_ON_NOTEBOOK_INSTANCE=`sudo iptables -S OUTPUT -t nat | grep -c 169.254.0.2`
52 | 
53 | # Get the Docker Network CIDR and IP for the sagemaker-local docker interface.
54 | SAGEMAKER_INTERFACE=br-`docker network ls | grep sagemaker-local | cut -d' ' -f1`
55 | DOCKER_NET=`ip route | grep $SAGEMAKER_INTERFACE | cut -d" " -f1`
56 | DOCKER_IP=`ip route | grep $SAGEMAKER_INTERFACE | cut -d" " -f12`
57 | 
58 | # check if both IPTables and the Route Table are OK.
59 | IPTABLES_PATCHED=`sudo iptables -S PREROUTING -t nat | grep -c $SAGEMAKER_INTERFACE`
60 | ROUTE_TABLE_PATCHED=`sudo ip route show table agent | grep -c $SAGEMAKER_INTERFACE`
61 | 
62 | if [ $RUNNING_ON_NOTEBOOK_INSTANCE -gt 0 ]; then
63 | 
64 |   if [ $ROUTE_TABLE_PATCHED -eq 0 ]; then
65 |     # fix routing
66 |     sudo ip route add $DOCKER_NET via $DOCKER_IP dev $SAGEMAKER_INTERFACE table agent
67 |   else
68 |     echo "SageMaker instance route table setup is ok. We are good to go."
69 |   fi
70 | 
71 |   if [ $IPTABLES_PATCHED -eq 0 ]; then
72 |     sudo iptables -t nat -A PREROUTING  -i $SAGEMAKER_INTERFACE -d 169.254.169.254/32 -p tcp -m tcp --dport 80 -j DNAT --to-destination 169.254.0.2:9081
73 |     echo "iptables for Docker setup done"
74 |   else
75 |     echo "SageMaker instance routing for Docker is ok. We are good to go!"
76 |   fi
77 | fi
78 | 


--------------------------------------------------------------------------------
/r-churn/r_autopilot_churn.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Customer Churn Prediction with Amazon SageMaker Autopilot\n",
  8 |     "_**Using AutoPilot to Predict Mobile Customer Departure**_\n",
  9 |     "\n",
 10 |     "---\n",
 11 |     "\n",
 12 |     "---\n",
 13 |     "\n",
 14 |     "\n",
 15 |     "## Introduction\n",
 16 |     "\n",
 17 |     "Amazon SageMaker Autopilot is an automated machine learning (commonly referred to as AutoML) solution for tabular datasets. You can use SageMaker Autopilot in different ways: on autopilot (hence the name) or with human guidance, without code through SageMaker Studio, or using the AWS SDKs. This notebook, as a first glimpse, will use the AWS SDKs to simply create and deploy a machine learning model.\n",
 18 |     "\n",
 19 |     "Losing customers is costly for any business.  Identifying unhappy customers early on gives you a chance to offer them incentives to stay.  This notebook describes using machine learning (ML) for the automated identification of unhappy customers, also known as customer churn prediction. ML models rarely give perfect predictions though, so this notebook is also about how to incorporate the relative costs of prediction mistakes when determining the financial outcome of using ML.\n",
 20 |     "\n",
 21 |     "We use an example of churn that is familiar to all of us–leaving a mobile phone operator.  Seems like I can always find fault with my provider du jour! And if my provider knows that I’m thinking of leaving, it can offer timely incentives–I can always use a phone upgrade or perhaps have a new feature activated–and I might just stick around. Incentives are often much more cost effective than losing and reacquiring a customer.\n",
 22 |     "\n"
 23 |    ]
 24 |   },
 25 |   {
 26 |    "cell_type": "markdown",
 27 |    "metadata": {},
 28 |    "source": [
 29 |     "### Reticulating the Amazon SageMaker Python SDK\n",
 30 |     "\n",
 31 |     "First, load the `reticulate` library and import the `sagemaker` Python module. Once the module is loaded, use the `$` notation in R instead of the `.` notation in Python to use available classes. "
 32 |    ]
 33 |   },
 34 |   {
 35 |    "cell_type": "code",
 36 |    "execution_count": null,
 37 |    "metadata": {},
 38 |    "outputs": [],
 39 |    "source": [
 40 |     "# Turn warnings off globally\n",
 41 |     "options(warn=-1)"
 42 |    ]
 43 |   },
 44 |   {
 45 |    "cell_type": "code",
 46 |    "execution_count": null,
 47 |    "metadata": {},
 48 |    "outputs": [],
 49 |    "source": [
 50 |     "# Install reticulate library and import sagemaker\n",
 51 |     "library(reticulate)\n",
 52 |     "sagemaker <- import('sagemaker')"
 53 |    ]
 54 |   },
 55 |   {
 56 |    "cell_type": "markdown",
 57 |    "metadata": {},
 58 |    "source": [
 59 |     "---\n",
 60 |     "\n",
 61 |     "## Setup\n",
 62 |     "\n",
 63 |     "_This notebook was created and tested on an ml.m4.xlarge notebook instance._\n",
 64 |     "\n",
 65 |     "Let's start by specifying:\n",
 66 |     "\n",
 67 |     "- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.\n",
 68 |     "- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s)."
 69 |    ]
 70 |   },
 71 |   {
 72 |    "cell_type": "code",
 73 |    "execution_count": null,
 74 |    "metadata": {},
 75 |    "outputs": [],
 76 |    "source": [
 77 |     "session <- sagemaker$Session()\n",
 78 |     "bucket <- session$default_bucket()\n",
 79 |     "prefix <- 'data/r-churn'\n",
 80 |     "role_arn <- sagemaker$get_execution_role()\n",
 81 |     "\n",
 82 |     "bucket\n",
 83 |     "role_arn"
 84 |    ]
 85 |   },
 86 |   {
 87 |    "cell_type": "markdown",
 88 |    "metadata": {},
 89 |    "source": [
 90 |     "---\n",
 91 |     "## Data\n",
 92 |     "\n",
 93 |     "Mobile operators have historical records on which customers ultimately ended up churning and which continued using the service. We can use this historical information to construct an ML model of one mobile operator’s churn using a process called training. After training the model, we can pass the profile information of an arbitrary customer (the same profile information that we used to train the model) to the model, and have the model predict whether this customer is going to churn. Of course, we expect the model to make mistakes–after all, predicting the future is tricky business! But I’ll also show how to deal with prediction errors.\n",
 94 |     "\n",
 95 |     "The dataset we will use is synthetically generated, but indictive of the types of features you'd see in this use case."
 96 |    ]
 97 |   },
 98 |   {
 99 |    "cell_type": "code",
100 |    "execution_count": null,
101 |    "metadata": {},
102 |    "outputs": [],
103 |    "source": [
104 |     "session$download_data(path = './', \n",
105 |     "                      bucket = 'sagemaker-sample-files', \n",
106 |     "                      key_prefix = 'datasets/tabular/synthetic/churn.txt')"
107 |    ]
108 |   },
109 |   {
110 |    "cell_type": "markdown",
111 |    "metadata": {},
112 |    "source": [
113 |     "Before you run Autopilot on the dataset, first perform a check of the dataset to make sure that it has no obvious errors. The Autopilot process can take long time, and it's generally a good practice to inspect the dataset before you start a job. This particular dataset is small, so you can inspect it in the notebook instance itself. If you have a larger dataset that will not fit in a notebook instance memory, inspect the dataset offline using a big data analytics tool like Apache Spark. [Deequ](https://github.com/awslabs/deequ) is a library built on top of Apache Spark that can be helpful for performing checks on large datasets. \n",
114 |     "\n",
115 |     "Read the data into a data frame and take a look."
116 |    ]
117 |   },
118 |   {
119 |    "cell_type": "code",
120 |    "execution_count": null,
121 |    "metadata": {},
122 |    "outputs": [],
123 |    "source": [
124 |     "library(readr)\n",
125 |     "\n",
126 |     "churn <- read_csv(file = 'churn.txt')\n",
127 |     "head(churn)"
128 |    ]
129 |   },
130 |   {
131 |    "cell_type": "markdown",
132 |    "metadata": {},
133 |    "source": [
134 |     "By modern standards, it’s a relatively small dataset, with only 5,000 records, where each record uses 21 attributes to describe the profile of a customer of an unknown US mobile operator. The attributes are:\n",
135 |     "\n",
136 |     "- `State`: the US state in which the customer resides, indicated by a two-letter abbreviation; for example, OH or NJ\n",
137 |     "- `Account Length`: the number of days that this account has been active\n",
138 |     "- `Area Code`: the three-digit area code of the corresponding customer’s phone number\n",
139 |     "- `Phone`: the remaining seven-digit phone number\n",
140 |     "- `Int’l Plan`: whether the customer has an international calling plan: yes/no\n",
141 |     "- `VMail Plan`: whether the customer has a voice mail feature: yes/no\n",
142 |     "- `VMail Message`: presumably the average number of voice mail messages per month\n",
143 |     "- `Day Mins`: the total number of calling minutes used during the day\n",
144 |     "- `Day Calls`: the total number of calls placed during the day\n",
145 |     "- `Day Charge`: the billed cost of daytime calls\n",
146 |     "- `Eve Mins, Eve Calls, Eve Charge`: the billed cost for calls placed during the evening\n",
147 |     "- `Night Mins`, `Night Calls`, `Night Charge`: the billed cost for calls placed during nighttime\n",
148 |     "- `Intl Mins`, `Intl Calls`, `Intl Charge`: the billed cost for international calls\n",
149 |     "- `CustServ Calls`: the number of calls placed to Customer Service\n",
150 |     "- `Churn?`: whether the customer left the service: true/false\n",
151 |     "\n",
152 |     "The last attribute, `Churn?`, is known as the target attribute–the attribute that we want the ML model to predict."
153 |    ]
154 |   },
155 |   {
156 |    "cell_type": "markdown",
157 |    "metadata": {},
158 |    "source": [
159 |     "### Upload the dataset to S3"
160 |    ]
161 |   },
162 |   {
163 |    "cell_type": "markdown",
164 |    "metadata": {},
165 |    "source": [
166 |     "Now we'll upload the data to a S3 bucket in our own AWS account so Autopilot can access it."
167 |    ]
168 |   },
169 |   {
170 |    "cell_type": "code",
171 |    "execution_count": null,
172 |    "metadata": {},
173 |    "outputs": [],
174 |    "source": [
175 |     "write_csv(churn, 'churn.csv', col_names = TRUE)"
176 |    ]
177 |   },
178 |   {
179 |    "cell_type": "code",
180 |    "execution_count": null,
181 |    "metadata": {},
182 |    "outputs": [],
183 |    "source": [
184 |     "s3_train <- session$upload_data(path = 'churn.csv', \n",
185 |     "                                bucket = bucket, \n",
186 |     "                                key_prefix = prefix)\n",
187 |     "\n",
188 |     "s3_train"
189 |    ]
190 |   },
191 |   {
192 |    "cell_type": "markdown",
193 |    "metadata": {},
194 |    "source": [
195 |     "---\n",
196 |     "## Launching a SageMaker Autopilot Job<a name=\"Settingup\"></a>\n",
197 |     "\n",
198 |     "After uploading the dataset to Amazon S3, you can launch Autopilot to find the best ML pipeline to train a model on this dataset. \n",
199 |     "\n",
200 |     "Currently Autopilot supports only tabular datasets in CSV format. Either all files should have a header row, or the first file of the dataset, when sorted in alphabetical/lexical order by name, is expected to have a header row."
201 |    ]
202 |   },
203 |   {
204 |    "cell_type": "markdown",
205 |    "metadata": {},
206 |    "source": [
207 |     "We'll launch an Autopilot job via the Studio UI (it also is possible to do so via API).  To do so:\n",
208 |     "\n",
209 |     "- Go to the tilted triangle icon in the left toolbar and click it, then select **Experiments and trials**.\n",
210 |     "- Click the **Create Autopilot Experiment** button.\n",
211 |     "- For **Experiment name**, enter a name such as `automl-churn-` with a date suffix. e.g. `automl-churn-10-14-21`\n",
212 |     "- Skip to **CONNECT YOUR DATA**, then find the **S3 bucket name** using autocomplete by typing `sagemaker-` and matching to the bucket name printed below the previous code cell.  Similarly, find the **Dataset file name** the same way, it should be `data/r-churn/churn.csv`\n",
213 |     "- Skip to **Target**, and select `Churn?` from the drop down menu.\n",
214 |     "- Skip to **Output data location**, select the radio button for **Enter S3 bucket location**, and then enter a string such as `s3://<your-bucket-name>/data/r-churn/output` where you replace `your-bucket-name` with the bucket name you've used previously.\n",
215 |     "- Go to **Auto deploy** and switch it to off.  \n",
216 |     "- Click the down arrow for **Advanced Settings**, go to **Max candidates** and enter 20.  (This is to keep the runtime of the job within reasonable limits for a workshop setting.)  \n",
217 |     "- Click **Create Experiment**."
218 |    ]
219 |   },
220 |   {
221 |    "cell_type": "markdown",
222 |    "metadata": {},
223 |    "source": [
224 |     "### Tracking SageMaker Autopilot job progress<a name=\"Tracking\"></a>\n",
225 |     "SageMaker Autopilot job consists of the following high-level steps : \n",
226 |     "\n",
227 |     "* Analyzing Data, where the dataset is analyzed and Autopilot comes up with a list of ML pipelines that should be tried out on the dataset. The dataset is also split into train and validation sets.\n",
228 |     "* Feature Engineering, where Autopilot performs feature transformation on individual features of the dataset as well as at an aggregate level.\n",
229 |     "* Model Tuning, where the top performing pipeline is selected along with the optimal hyperparameters for the training algorithm (the last stage of the pipeline). \n",
230 |     "\n",
231 |     "Although we can use code/API calls to track job progress, we'll use the SageMaker Studio UI to do so.  After you create the job via the Studio UI above, the tab will convert to an Autopilot Job tracking tab.   You'll be able to see the progress of the job in that tab.\n",
232 |     "\n",
233 |     "If you close the tab you can always get back to it.  To do so, go to the tilted triangle icon in the left toolbar and click it, then select **Experiments and trials**.  Next, right-click the name of your AutoML job, which should start with \"automl-churn-\", and select **Describe AutoML Job**.  A new Studio tab will open details about your job, and a summary when it completes, with the ability to sort models by metric and deploy with a single click.  "
234 |    ]
235 |   },
236 |   {
237 |    "cell_type": "markdown",
238 |    "metadata": {},
239 |    "source": [
240 |     "### Model Explainability\n",
241 |     "\n",
242 |     "Autopilot also generates an explainability report using SageMaker Clarify.  The report and related artifacts are uploaded to S3, but you also can access the report in SageMaker Studio.\n",
243 |     "\n",
244 |     "To do so:\n",
245 |     "- Go to the tilted triangle icon in the left toolbar and click it, then select **Experiments and trials**.\n",
246 |     "- In the list of experiments, click on ***Unassigned trial components***.\n",
247 |     "- Double-click the trial component with the name of the form, `automl-churn-<data-suffix>-documentation`.\n",
248 |     "- A new tab will open named `Describe Trial Component`; in it you will see a graph of feature importance by aggregated SHAP values.  Of the 20 original input features, Clarify plots the 10 features with the greatest feature attribution."
249 |    ]
250 |   },
251 |   {
252 |    "cell_type": "markdown",
253 |    "metadata": {},
254 |    "source": [
255 |     "Clarify uses a model-agnostic feature attribution approach, which you can used to understand why a model made a prediction after training and to provide per-instance explanation during inference. The implementation includes a scalable and efficient implementation of SHAP."
256 |    ]
257 |   },
258 |   {
259 |    "cell_type": "markdown",
260 |    "metadata": {},
261 |    "source": [
262 |     "It also is possible to visualize the the local explanations for single examples in your dataset. You can simply load the local explanations stored in the Autopilot explainability output path, and visualize the explanation (i.e., the impact that the single features have on the prediction of your model) for any example.  Typically for an example you would plot a bar chart with SHAP values for each feature.  The larger the bar, the more impact the feature has on the target feature. Bars with positive values are associated with higher predictions in the target variable, and bars with negative values are associated with lower predictions in the target variable."
263 |    ]
264 |   }
265 |  ],
266 |  "metadata": {
267 |   "celltoolbar": "Tags",
268 |   "instance_type": "ml.t3.medium",
269 |   "kernelspec": {
270 |    "display_name": "R (custom-r/latest)",
271 |    "language": "python",
272 |    "name": "ir__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-west-2:894087409521:image/custom-r"
273 |   },
274 |   "language_info": {
275 |    "codemirror_mode": "r",
276 |    "file_extension": ".r",
277 |    "mimetype": "text/x-r-source",
278 |    "name": "R",
279 |    "pygments_lexer": "r",
280 |    "version": "4.0.0"
281 |   },
282 |   "notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved.  Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
283 |  },
284 |  "nbformat": 4,
285 |  "nbformat_minor": 4
286 | }
287 | 


--------------------------------------------------------------------------------
/r-in-sagemaker-processing/r_in_sagemaker_processing.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Using R in SageMaker Processing\n",
  8 |     "\n",
  9 |     "Amazon SageMaker Processing is a capability of Amazon SageMaker that lets you easily run your preprocessing, postprocessing and model evaluation workloads on fully managed infrastructure.  In this example, we'll see how to use SageMaker Processing with the R programming language.\n",
 10 |     "\n",
 11 |     "The workflow for using R with SageMaker Processing involves the following steps:\n",
 12 |     "\n",
 13 |     "- Writing a R script.\n",
 14 |     "- Building a Docker container.\n",
 15 |     "- Creating a SageMaker Processing job.\n",
 16 |     "- Retrieving and viewing job results.  "
 17 |    ]
 18 |   },
 19 |   {
 20 |    "cell_type": "markdown",
 21 |    "metadata": {},
 22 |    "source": [
 23 |     "## The R script\n",
 24 |     "\n",
 25 |     "To use R with SageMaker Processing, first prepare a R script similar to one you would use outside SageMaker.  Below is the R script we'll be using.  It performs operations on data and also saves a .png of a plot for retrieval and display later after the Processing job is complete.  This enables you to perform any kind of analysis and feature engineering at scale with R, and also create visualizations for display anywhere.  "
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "code",
 30 |    "execution_count": null,
 31 |    "metadata": {},
 32 |    "outputs": [],
 33 |    "source": [
 34 |     "%%writefile preprocessing.R\n",
 35 |     "\n",
 36 |     "library(readr)\n",
 37 |     "library(dplyr)\n",
 38 |     "library(ggplot2)\n",
 39 |     "library(forcats)\n",
 40 |     "\n",
 41 |     "input_dir <- \"/opt/ml/processing/input/\"\n",
 42 |     "filename <- Sys.glob(paste(input_dir, \"*.csv\", sep=\"\"))\n",
 43 |     "df <- read_csv(filename)\n",
 44 |     "\n",
 45 |     "plot_data <- df %>%\n",
 46 |     "  group_by(state) %>%\n",
 47 |     "  count()\n",
 48 |     "\n",
 49 |     "write_csv(plot_data, \"/opt/ml/processing/csv/plot_data.csv\")\n",
 50 |     "\n",
 51 |     "plot <- plot_data %>% \n",
 52 |     "  ggplot()+\n",
 53 |     "  geom_col(aes(fct_reorder(state, n), \n",
 54 |     "               n, \n",
 55 |     "               fill = n))+\n",
 56 |     "  coord_flip()+\n",
 57 |     "  labs(\n",
 58 |     "    title = \"Number of people by state\",\n",
 59 |     "    subtitle = \"From US-500 dataset\",\n",
 60 |     "    x = \"State\",\n",
 61 |     "    y = \"Number of people\"\n",
 62 |     "  )+ \n",
 63 |     "  theme_bw()\n",
 64 |     "\n",
 65 |     "ggsave(\"/opt/ml/processing/images/census_plot.png\", width = 10, height = 8, dpi = 100)"
 66 |    ]
 67 |   },
 68 |   {
 69 |    "cell_type": "markdown",
 70 |    "metadata": {},
 71 |    "source": [
 72 |     "## Building a Docker container\n",
 73 |     "\n",
 74 |     "Next, there is a one-time step to create a R container.  For subsequent SageMaker Processing jobs, you can just reuse this container (unless you need to add further dependencies, in which case you can just add them to the Dockerfile and rebuild).  To start, set up a local directory for Docker-related files."
 75 |    ]
 76 |   },
 77 |   {
 78 |    "cell_type": "code",
 79 |    "execution_count": null,
 80 |    "metadata": {},
 81 |    "outputs": [],
 82 |    "source": [
 83 |     "!mkdir docker"
 84 |    ]
 85 |   },
 86 |   {
 87 |    "cell_type": "markdown",
 88 |    "metadata": {},
 89 |    "source": [
 90 |     "A simple Dockerfile can be used to build a Docker container for SageMaker Processing.  For this example, we'll use a parent Docker image from the Rocker Project, which provides a set of convenient R Docker images.  There is no need to include your R script in the container itself because SageMaker Processing will ingest it for you.  This gives you the flexibility to modify the script as needed without having to rebuild the Docker image every time you modify it.  "
 91 |    ]
 92 |   },
 93 |   {
 94 |    "cell_type": "code",
 95 |    "execution_count": null,
 96 |    "metadata": {},
 97 |    "outputs": [],
 98 |    "source": [
 99 |     "%%writefile docker/Dockerfile\n",
100 |     "\n",
101 |     "FROM rocker/tidyverse:latest\n",
102 |     "\n",
103 |     "# tidyverse has all the packages we need, otherwise we could install more as follows\n",
104 |     "# RUN install2.r --error \\\n",
105 |     "#    jsonlite \\\n",
106 |     "#    tseries\n",
107 |     "\n",
108 |     "ENTRYPOINT [\"Rscript\"]"
109 |    ]
110 |   },
111 |   {
112 |    "cell_type": "markdown",
113 |    "metadata": {},
114 |    "source": [
115 |     "The Dockerfile is now used to build the Docker image.  We'll also create an Amazon Elastic Container Registry (ECR) repository, and push the image to ECR so it can be accessed by SageMaker."
116 |    ]
117 |   },
118 |   {
119 |    "cell_type": "code",
120 |    "execution_count": null,
121 |    "metadata": {},
122 |    "outputs": [],
123 |    "source": [
124 |     "import boto3\n",
125 |     "\n",
126 |     "account_id = boto3.client('sts').get_caller_identity().get('Account')\n",
127 |     "region = boto3.session.Session().region_name\n",
128 |     "\n",
129 |     "ecr_repository = 'r-in-sagemaker-processing'\n",
130 |     "tag = ':latest'\n",
131 |     "\n",
132 |     "uri_suffix = 'amazonaws.com'\n",
133 |     "processing_repository_uri = '{}.dkr.ecr.{}.{}/{}'.format(account_id, region, uri_suffix, ecr_repository + tag)\n",
134 |     "\n",
135 |     "# Create ECR repository and push Docker image\n",
136 |     "!docker build -t $ecr_repository docker\n",
137 |     "!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)\n",
138 |     "!aws ecr create-repository --repository-name $ecr_repository\n",
139 |     "!docker tag {ecr_repository + tag} $processing_repository_uri\n",
140 |     "!docker push $processing_repository_uri"
141 |    ]
142 |   },
143 |   {
144 |    "cell_type": "markdown",
145 |    "metadata": {},
146 |    "source": [
147 |     "## Creating a SageMaker Processing job\n",
148 |     "\n",
149 |     "With our Docker image in ECR, we now prepare for the SageMaker Processing job by specifying Amazon S3 buckets for output and input, and downloading the raw dataset."
150 |    ]
151 |   },
152 |   {
153 |    "cell_type": "code",
154 |    "execution_count": null,
155 |    "metadata": {},
156 |    "outputs": [],
157 |    "source": [
158 |     "import sagemaker\n",
159 |     "from sagemaker import get_execution_role\n",
160 |     "\n",
161 |     "role = get_execution_role()\n",
162 |     "session = sagemaker.Session()\n",
163 |     "s3_output = session.default_bucket()\n",
164 |     "s3_prefix = 'R-in-Processing'\n",
165 |     "s3_source = 'sagemaker-workshop-pdx'\n",
166 |     "session.download_data(path='./data', bucket=s3_source, key_prefix='R-in-Processing/us-500.csv')"
167 |    ]
168 |   },
169 |   {
170 |    "cell_type": "markdown",
171 |    "metadata": {},
172 |    "source": [
173 |     "Before setting up the SageMaker Processing job, the raw dataset is uploaded to S3 so it is accessible to SageMaker Processing.  "
174 |    ]
175 |   },
176 |   {
177 |    "cell_type": "code",
178 |    "execution_count": null,
179 |    "metadata": {},
180 |    "outputs": [],
181 |    "source": [
182 |     "rawdata_s3_prefix = '{}/data/raw'.format(s3_prefix)\n",
183 |     "raw_s3 = session.upload_data(path='./data', key_prefix=rawdata_s3_prefix)\n",
184 |     "print(raw_s3)"
185 |    ]
186 |   },
187 |   {
188 |    "cell_type": "markdown",
189 |    "metadata": {},
190 |    "source": [
191 |     "The `ScriptProcessor` class of the SageMaker SDK lets you run a command inside a Docker container.  We'll use this to run our own script using the `Rscript` command.  In the `ScriptProcessor` you also can specify the type and number of instances to be used in the SageMaker Processing job."
192 |    ]
193 |   },
194 |   {
195 |    "cell_type": "code",
196 |    "execution_count": null,
197 |    "metadata": {},
198 |    "outputs": [],
199 |    "source": [
200 |     "from sagemaker.processing import ScriptProcessor\n",
201 |     "\n",
202 |     "script_processor = ScriptProcessor(command=['Rscript'],\n",
203 |     "                image_uri=processing_repository_uri,\n",
204 |     "                role=role,\n",
205 |     "                instance_count=1,\n",
206 |     "                instance_type='ml.c5.xlarge')"
207 |    ]
208 |   },
209 |   {
210 |    "cell_type": "markdown",
211 |    "metadata": {},
212 |    "source": [
213 |     "We can now start the SageMaker Processing job.  The main aspects of the code below are specifying the input and output locations, and the name of our R preprocessing script.  "
214 |    ]
215 |   },
216 |   {
217 |    "cell_type": "code",
218 |    "execution_count": null,
219 |    "metadata": {},
220 |    "outputs": [],
221 |    "source": [
222 |     "from sagemaker.processing import ProcessingInput, ProcessingOutput\n",
223 |     "from time import gmtime, strftime \n",
224 |     "\n",
225 |     "processing_job_name = \"R-in-Processing-{}\".format(strftime(\"%d-%H-%M-%S\", gmtime()))\n",
226 |     "output_destination = 's3://{}/{}/data'.format(s3_output, s3_prefix)\n",
227 |     "\n",
228 |     "script_processor.run(code='preprocessing.R',\n",
229 |     "                      job_name=processing_job_name,\n",
230 |     "                      inputs=[ProcessingInput(\n",
231 |     "                        source=raw_s3,\n",
232 |     "                        destination='/opt/ml/processing/input')],\n",
233 |     "                      outputs=[ProcessingOutput(output_name='csv',\n",
234 |     "                                                destination='{}/csv'.format(output_destination),\n",
235 |     "                                                source='/opt/ml/processing/csv'),\n",
236 |     "                               ProcessingOutput(output_name='images',\n",
237 |     "                                                destination='{}/images'.format(output_destination),\n",
238 |     "                                                source='/opt/ml/processing/images')])\n",
239 |     "\n",
240 |     "preprocessing_job_description = script_processor.jobs[-1].describe()"
241 |    ]
242 |   },
243 |   {
244 |    "cell_type": "markdown",
245 |    "metadata": {},
246 |    "source": [
247 |     "## Retrieving and viewing job results\n",
248 |     "\n",
249 |     "From the SageMaker Processing job description, we can look up the S3 URIs of the output, including the output plot .png file."
250 |    ]
251 |   },
252 |   {
253 |    "cell_type": "code",
254 |    "execution_count": null,
255 |    "metadata": {},
256 |    "outputs": [],
257 |    "source": [
258 |     "output_config = preprocessing_job_description['ProcessingOutputConfig']\n",
259 |     "for output in output_config['Outputs']:\n",
260 |     "    if output['OutputName'] == 'csv':\n",
261 |     "        preprocessed_csv_data = output['S3Output']['S3Uri']\n",
262 |     "    if output['OutputName'] == 'images':\n",
263 |     "        preprocessed_images = output['S3Output']['S3Uri']"
264 |    ]
265 |   },
266 |   {
267 |    "cell_type": "markdown",
268 |    "metadata": {},
269 |    "source": [
270 |     "Now we can display the plot produced by the SageMaker Processing job.  A similar workflow applies to retrieving and working with any other output from a job, such as the transformed data itself.  "
271 |    ]
272 |   },
273 |   {
274 |    "cell_type": "code",
275 |    "execution_count": null,
276 |    "metadata": {},
277 |    "outputs": [],
278 |    "source": [
279 |     "from PIL import Image\n",
280 |     "from IPython.display import display\n",
281 |     "\n",
282 |     "plot_key = 'census_plot.png'\n",
283 |     "plot_in_s3 = '{}/{}'.format(preprocessed_images, plot_key)\n",
284 |     "!aws s3 cp {plot_in_s3} .\n",
285 |     "im = Image.open(plot_key)\n",
286 |     "display(im)"
287 |    ]
288 |   }
289 |  ],
290 |  "metadata": {
291 |   "kernelspec": {
292 |    "display_name": "conda_python3",
293 |    "language": "python",
294 |    "name": "conda_python3"
295 |   },
296 |   "language_info": {
297 |    "codemirror_mode": {
298 |     "name": "ipython",
299 |     "version": 3
300 |    },
301 |    "file_extension": ".py",
302 |    "mimetype": "text/x-python",
303 |    "name": "python",
304 |    "nbconvert_exporter": "python",
305 |    "pygments_lexer": "ipython3",
306 |    "version": "3.6.10"
307 |   }
308 |  },
309 |  "nbformat": 4,
310 |  "nbformat_minor": 4
311 | }
312 | 


--------------------------------------------------------------------------------
/tf-2-word-embeddings/code/model_def.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import os
 3 | import tensorflow as tf
 4 | 
 5 | 
 6 | def get_embeddings(embedding_dir):
 7 |     
 8 |     embeddings = np.load(os.path.join(embedding_dir, 'embedding.npy'))
 9 |     print('embeddings shape:  ', embeddings.shape)
10 | 
11 |     return embeddings
12 | 
13 | 
14 | def get_model(embedding_dir, NUM_WORDS, WORD_INDEX_LENGTH, LABELS_INDEX_LENGTH, EMBEDDING_DIM, MAX_SEQUENCE_LENGTH):
15 |     
16 |     embedding_matrix = get_embeddings(embedding_dir)
17 |     
18 |     # trainable = False to keep the embeddings frozen
19 |     embedding_layer = tf.keras.layers.Embedding(NUM_WORDS,
20 |                                                 EMBEDDING_DIM,
21 |                                                 embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix),
22 |                                                 input_length=MAX_SEQUENCE_LENGTH,
23 |                                                 trainable=False)
24 | 
25 |     sequence_input = tf.keras.Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
26 |     embedded_sequences = embedding_layer(sequence_input)
27 |     x = tf.keras.layers.Conv1D(128, 5, activation='relu')(embedded_sequences)
28 |     x = tf.keras.layers.MaxPooling1D(5)(x)
29 |     x = tf.keras.layers.Conv1D(128, 5, activation='relu')(x)
30 |     x = tf.keras.layers.MaxPooling1D(5)(x)
31 |     x = tf.keras.layers.Conv1D(128, 5, activation='relu')(x)
32 |     x = tf.keras.layers.GlobalMaxPooling1D()(x)
33 |     x = tf.keras.layers.Dense(128, activation='relu')(x)
34 |     preds = tf.keras.layers.Dense(LABELS_INDEX_LENGTH, activation='softmax')(x)
35 |     
36 |     return tf.keras.Model(sequence_input, preds)
37 |    
38 | 


--------------------------------------------------------------------------------
/tf-2-word-embeddings/code/train.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | import os
 3 | import sys
 4 | import numpy as np
 5 | import tensorflow as tf
 6 | 
 7 | from model_def import get_model
 8 | 
 9 | 
10 | def parse_args():
11 |     
12 |     parser = argparse.ArgumentParser()
13 | 
14 |     # hyperparameters sent by the client are passed as command-line arguments to the script
15 |     parser.add_argument('--epochs', type=int, default=1)
16 |     parser.add_argument('--batch_size', type=int, default=64)
17 |     
18 |     parser.add_argument('--num_words', type=int)
19 |     parser.add_argument('--word_index_len', type=int)
20 |     parser.add_argument('--labels_index_len', type=int)
21 |     parser.add_argument('--embedding_dim', type=int)
22 |     parser.add_argument('--max_sequence_len', type=int)
23 |     
24 |     # data directories
25 |     parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
26 |     parser.add_argument('--val', type=str, default=os.environ.get('SM_CHANNEL_VAL'))
27 |     
28 |     # embedding directory
29 |     parser.add_argument('--embedding', type=str, default=os.environ.get('SM_CHANNEL_EMBEDDING'))
30 |     
31 |     # model directory: we will use the default set by SageMaker, /opt/ml/model
32 |     parser.add_argument('--model_dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
33 |     
34 |     return parser.parse_known_args()
35 | 
36 | 
37 | def get_train_data(train_dir):
38 |     
39 |     x_train = np.load(os.path.join(train_dir, 'x_train.npy'))
40 |     y_train = np.load(os.path.join(train_dir, 'y_train.npy'))
41 |     print('x train', x_train.shape,'y train', y_train.shape)
42 | 
43 |     return x_train, y_train
44 | 
45 | 
46 | def get_val_data(val_dir):
47 |     
48 |     x_val = np.load(os.path.join(val_dir, 'x_val.npy'))
49 |     y_val = np.load(os.path.join(val_dir, 'y_val.npy'))
50 |     print('x val', x_val.shape,'y val', y_val.shape)
51 | 
52 |     return x_val, y_val
53 |  
54 | 
55 | if __name__ == "__main__":
56 |             
57 |     args, _ = parse_args()
58 |     
59 |     x_train, y_train = get_train_data(args.train)
60 |     x_val, y_val = get_val_data(args.val)
61 |     
62 |     model = get_model(args.embedding, 
63 |                       args.num_words,
64 |                       args.word_index_len,
65 |                       args.labels_index_len, 
66 |                       args.embedding_dim, 
67 |                       args.max_sequence_len)
68 |     
69 |     model.compile(loss='categorical_crossentropy',
70 |               optimizer='rmsprop',
71 |               metrics=['acc'])
72 | 
73 |     model.fit(x_train, y_train,
74 |               batch_size=args.batch_size,
75 |               epochs=args.epochs,
76 |               validation_data=(x_val, y_val)) 
77 |     
78 |     # create a TensorFlow SavedModel for deployment to a SageMaker endpoint with TensorFlow Serving
79 |     model.save(args.model_dir + '/1')
80 | 
81 | 


--------------------------------------------------------------------------------
/tf-2-word-embeddings/tf-2-word-embeddings.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Loading Word Embeddings in SageMaker for Text Classification with TensorFlow 2\n",
  8 |     "\n",
  9 |     "In this notebook, two aspects of Amazon SageMaker will be demonstrated.  First, we'll use SageMaker Script Mode with a prebuilt TensorFlow 2 framework container, which enables you to use a training script similar to one you would use outside SageMaker. Second, we'll see how to use the concept of SageMaker input channels to load word embeddings into the container for training.  The word embeddings will be used with a Convolutional Neural Net (CNN) in TensorFlow 2 to perform text classification.  \n",
 10 |     "\n",
 11 |     "We'll begin with some necessary imports."
 12 |    ]
 13 |   },
 14 |   {
 15 |    "cell_type": "code",
 16 |    "execution_count": null,
 17 |    "metadata": {},
 18 |    "outputs": [],
 19 |    "source": [
 20 |     "import os\n",
 21 |     "import sys\n",
 22 |     "import numpy as np\n",
 23 |     "import tensorflow as tf\n",
 24 |     "\n",
 25 |     "from tensorflow.keras.preprocessing.text import Tokenizer\n",
 26 |     "from tensorflow.keras.preprocessing.sequence import pad_sequences\n",
 27 |     "from tensorflow.keras.utils import to_categorical"
 28 |    ]
 29 |   },
 30 |   {
 31 |    "cell_type": "markdown",
 32 |    "metadata": {},
 33 |    "source": [
 34 |     "# Prepare Dataset and Embeddings"
 35 |    ]
 36 |   },
 37 |   {
 38 |    "cell_type": "markdown",
 39 |    "metadata": {},
 40 |    "source": [
 41 |     "Initially, we download the 20 Newsgroups dataset.  "
 42 |    ]
 43 |   },
 44 |   {
 45 |    "cell_type": "code",
 46 |    "execution_count": null,
 47 |    "metadata": {},
 48 |    "outputs": [],
 49 |    "source": [
 50 |     "!mkdir ./20_newsgroup\n",
 51 |     "!wget -O ./20_newsgroup/news20.tar.gz http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz\n",
 52 |     "!tar -xvzf ./20_newsgroup/news20.tar.gz"
 53 |    ]
 54 |   },
 55 |   {
 56 |    "cell_type": "markdown",
 57 |    "metadata": {},
 58 |    "source": [
 59 |     "The next step is to download the GloVe word embeddings that we will load in the neural net."
 60 |    ]
 61 |   },
 62 |   {
 63 |    "cell_type": "code",
 64 |    "execution_count": null,
 65 |    "metadata": {},
 66 |    "outputs": [],
 67 |    "source": [
 68 |     "!mkdir ./glove.6B\n",
 69 |     "!wget https://nlp.stanford.edu/data/glove.6B.zip\n",
 70 |     "!unzip glove.6B.zip -d ./glove.6B"
 71 |    ]
 72 |   },
 73 |   {
 74 |    "cell_type": "markdown",
 75 |    "metadata": {},
 76 |    "source": [
 77 |     "We have to map the GloVe embedding vectors into an index."
 78 |    ]
 79 |   },
 80 |   {
 81 |    "cell_type": "code",
 82 |    "execution_count": null,
 83 |    "metadata": {},
 84 |    "outputs": [],
 85 |    "source": [
 86 |     "BASE_DIR = ''\n",
 87 |     "GLOVE_DIR = os.path.join(BASE_DIR, 'glove.6B')\n",
 88 |     "TEXT_DATA_DIR = os.path.join(BASE_DIR, '20_newsgroup')\n",
 89 |     "MAX_SEQUENCE_LENGTH = 1000\n",
 90 |     "MAX_NUM_WORDS = 20000\n",
 91 |     "EMBEDDING_DIM = 100\n",
 92 |     "VALIDATION_SPLIT = 0.2\n",
 93 |     "\n",
 94 |     "embeddings_index = {}\n",
 95 |     "with open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt')) as f:\n",
 96 |     "    for line in f:\n",
 97 |     "        values = line.split()\n",
 98 |     "        word = values[0]\n",
 99 |     "        coefs = np.asarray(values[1:], dtype='float32')\n",
100 |     "        embeddings_index[word] = coefs\n",
101 |     "\n",
102 |     "print('Found %s word vectors.' % len(embeddings_index))"
103 |    ]
104 |   },
105 |   {
106 |    "cell_type": "markdown",
107 |    "metadata": {},
108 |    "source": [
109 |     "The 20 Newsgroups text also must be preprocessed.  For example, the labels for each sample must be extracted and mapped to a numeric index."
110 |    ]
111 |   },
112 |   {
113 |    "cell_type": "code",
114 |    "execution_count": null,
115 |    "metadata": {},
116 |    "outputs": [],
117 |    "source": [
118 |     "texts = []  # list of text samples\n",
119 |     "labels_index = {}  # dictionary mapping label name to numeric id\n",
120 |     "labels = []  # list of label ids\n",
121 |     "for name in sorted(os.listdir(TEXT_DATA_DIR)):\n",
122 |     "    path = os.path.join(TEXT_DATA_DIR, name)\n",
123 |     "    if os.path.isdir(path):\n",
124 |     "        label_id = len(labels_index)\n",
125 |     "        labels_index[name] = label_id\n",
126 |     "        for fname in sorted(os.listdir(path)):\n",
127 |     "            if fname.isdigit():\n",
128 |     "                fpath = os.path.join(path, fname)\n",
129 |     "                args = {} if sys.version_info < (3,) else {'encoding': 'latin-1'}\n",
130 |     "                with open(fpath, **args) as f:\n",
131 |     "                    t = f.read()\n",
132 |     "                    i = t.find('\\n\\n')  # skip header\n",
133 |     "                    if 0 < i:\n",
134 |     "                        t = t[i:]\n",
135 |     "                    texts.append(t)\n",
136 |     "                labels.append(label_id)\n",
137 |     "\n",
138 |     "print('Found %s texts.' % len(texts))"
139 |    ]
140 |   },
141 |   {
142 |    "cell_type": "markdown",
143 |    "metadata": {},
144 |    "source": [
145 |     "We can use Keras text preprocessing functions to tokenize the text, limit the sequence length of the samples, and pad shorter sequences as necessary.  Additionally, the preprocessed dataset must be split into training and validation sets."
146 |    ]
147 |   },
148 |   {
149 |    "cell_type": "code",
150 |    "execution_count": null,
151 |    "metadata": {},
152 |    "outputs": [],
153 |    "source": [
154 |     "tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)\n",
155 |     "tokenizer.fit_on_texts(texts)\n",
156 |     "sequences = tokenizer.texts_to_sequences(texts)\n",
157 |     "\n",
158 |     "word_index = tokenizer.word_index\n",
159 |     "print('Found %s unique tokens.' % len(word_index))\n",
160 |     "\n",
161 |     "data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)\n",
162 |     "\n",
163 |     "labels = to_categorical(np.asarray(labels))\n",
164 |     "print('Shape of data tensor:', data.shape)\n",
165 |     "print('Shape of label tensor:', labels.shape)\n",
166 |     "\n",
167 |     "# split the data into a training set and a validation set\n",
168 |     "indices = np.arange(data.shape[0])\n",
169 |     "np.random.shuffle(indices)\n",
170 |     "data = data[indices]\n",
171 |     "labels = labels[indices]\n",
172 |     "num_validation_samples = int(VALIDATION_SPLIT * data.shape[0])\n",
173 |     "\n",
174 |     "x_train = data[:-num_validation_samples]\n",
175 |     "y_train = labels[:-num_validation_samples]\n",
176 |     "x_val = data[-num_validation_samples:]\n",
177 |     "y_val = labels[-num_validation_samples:]"
178 |    ]
179 |   },
180 |   {
181 |    "cell_type": "markdown",
182 |    "metadata": {},
183 |    "source": [
184 |     "After the dataset text preprocessing is complete, we can now map the 20 Newsgroup vocabulary words to their GloVe embedding vectors for use in an embedding matrix.  This matrix will be loaded in an Embedding layer of the neural net."
185 |    ]
186 |   },
187 |   {
188 |    "cell_type": "code",
189 |    "execution_count": null,
190 |    "metadata": {},
191 |    "outputs": [],
192 |    "source": [
193 |     "num_words = min(MAX_NUM_WORDS, len(word_index)) + 1\n",
194 |     "embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))\n",
195 |     "for word, i in word_index.items():\n",
196 |     "    if i > MAX_NUM_WORDS:\n",
197 |     "        continue\n",
198 |     "    embedding_vector = embeddings_index.get(word)\n",
199 |     "    if embedding_vector is not None:\n",
200 |     "        # words not found in embedding index will be all-zeros.\n",
201 |     "        embedding_matrix[i] = embedding_vector\n",
202 |     "\n",
203 |     "print('Number of words:', num_words)\n",
204 |     "print('Shape of embeddings:', embedding_matrix.shape)"
205 |    ]
206 |   },
207 |   {
208 |    "cell_type": "markdown",
209 |    "metadata": {},
210 |    "source": [
211 |     "Now the data AND embeddings are saved to file to prepare for training.\n",
212 |     "\n",
213 |     "Note that we will not be loading the original, unprocessed set of embeddings into the training container — instead, to save loading time, we just save the embedding matrix, which at 16MB is much smaller than the original set of embeddings at 892MB.  Depending on how large of a set of embeddings you need for other use cases, you might save further space by saving the embeddings with joblib (more efficient than the original Python pickle), and/or save the embeddings with half precision (fp16) instead of full precision and then restore them to full precision after they are loaded."
214 |    ]
215 |   },
216 |   {
217 |    "cell_type": "code",
218 |    "execution_count": null,
219 |    "metadata": {},
220 |    "outputs": [],
221 |    "source": [
222 |     "data_dir = os.path.join(os.getcwd(), 'data')\n",
223 |     "os.makedirs(data_dir, exist_ok=True)\n",
224 |     "\n",
225 |     "train_dir = os.path.join(os.getcwd(), 'data/train')\n",
226 |     "os.makedirs(train_dir, exist_ok=True)\n",
227 |     "\n",
228 |     "val_dir = os.path.join(os.getcwd(), 'data/val')\n",
229 |     "os.makedirs(val_dir, exist_ok=True)\n",
230 |     "\n",
231 |     "embedding_dir = os.path.join(os.getcwd(), 'data/embedding')\n",
232 |     "os.makedirs(embedding_dir, exist_ok=True)\n",
233 |     "\n",
234 |     "np.save(os.path.join(train_dir, 'x_train.npy'), x_train)\n",
235 |     "np.save(os.path.join(train_dir, 'y_train.npy'), y_train)\n",
236 |     "np.save(os.path.join(val_dir, 'x_val.npy'), x_val)\n",
237 |     "np.save(os.path.join(val_dir, 'y_val.npy'), y_val)\n",
238 |     "np.save(os.path.join(embedding_dir, 'embedding.npy'), embedding_matrix)"
239 |    ]
240 |   },
241 |   {
242 |    "cell_type": "markdown",
243 |    "metadata": {},
244 |    "source": [
245 |     "# SageMaker Hosted Training\n",
246 |     "\n",
247 |     "Now that we've prepared our embedding matrix, we can move on to use SageMaker's hosted training functionality. SageMaker hosted training is preferred for doing actual training in place of local notebook prototyping, especially for large-scale, distributed training. Before starting hosted training, the data must be uploaded to S3. The word embedding matrix also will be uploaded.  We'll do that now, and confirm the upload was successful."
248 |    ]
249 |   },
250 |   {
251 |    "cell_type": "code",
252 |    "execution_count": null,
253 |    "metadata": {},
254 |    "outputs": [],
255 |    "source": [
256 |     "s3_prefix = 'tf-20-newsgroups'\n",
257 |     "\n",
258 |     "traindata_s3_prefix = '{}/data/train'.format(s3_prefix)\n",
259 |     "valdata_s3_prefix = '{}/data/val'.format(s3_prefix)\n",
260 |     "embeddingdata_s3_prefix = '{}/data/embedding'.format(s3_prefix)\n",
261 |     "\n",
262 |     "train_s3 = sagemaker.Session().upload_data(path='./data/train/', key_prefix=traindata_s3_prefix)\n",
263 |     "val_s3 = sagemaker.Session().upload_data(path='./data/val/', key_prefix=valdata_s3_prefix)\n",
264 |     "embedding_s3 = sagemaker.Session().upload_data(path='./data/embedding/', key_prefix=embeddingdata_s3_prefix)\n",
265 |     "\n",
266 |     "inputs = {'train':train_s3, 'val': val_s3, 'embedding': embedding_s3}\n",
267 |     "print(inputs)"
268 |    ]
269 |   },
270 |   {
271 |    "cell_type": "markdown",
272 |    "metadata": {},
273 |    "source": [
274 |     "We're now ready to set up an Estimator object for hosted training. Hyperparameters are passed in as a dictionary.  Importantly, for the case of a model such as this one that takes word embeddings as an input, various aspects of the embeddings can be passed in with the dictionary so the embedding layer can be constructed in a flexible manner and not hardcoded.  This allows easier tuning without having to make code modifications.  "
275 |    ]
276 |   },
277 |   {
278 |    "cell_type": "code",
279 |    "execution_count": null,
280 |    "metadata": {},
281 |    "outputs": [],
282 |    "source": [
283 |     "import sagemaker\n",
284 |     "from sagemaker.tensorflow import TensorFlow\n",
285 |     "\n",
286 |     "train_instance_type = 'ml.p3.2xlarge'\n",
287 |     "hyperparameters = {'epochs': 20, \n",
288 |     "                   'batch_size': 128, \n",
289 |     "                   'num_words': num_words,\n",
290 |     "                   'word_index_len': len(word_index),\n",
291 |     "                   'labels_index_len': len(labels_index),\n",
292 |     "                   'embedding_dim': EMBEDDING_DIM,\n",
293 |     "                   'max_sequence_len': MAX_SEQUENCE_LENGTH\n",
294 |     "                  }\n",
295 |     "\n",
296 |     "estimator = TensorFlow(entry_point='train.py',\n",
297 |     "                       source_dir='code',\n",
298 |     "                       model_dir=model_dir,\n",
299 |     "                       instance_type=train_instance_type,\n",
300 |     "                       instance_count=1,\n",
301 |     "                       hyperparameters=hyperparameters,\n",
302 |     "                       role=sagemaker.get_execution_role(),\n",
303 |     "                       base_job_name='tf-20-newsgroups',\n",
304 |     "                       framework_version='2.1',\n",
305 |     "                       py_version='py3',\n",
306 |     "                       script_mode=True)"
307 |    ]
308 |   },
309 |   {
310 |    "cell_type": "markdown",
311 |    "metadata": {},
312 |    "source": [
313 |     "To start the training job, simply call the `fit` method of the `Estimator` object.  The `inputs` parameter is the dictionary we created above, which defines three channels.  Besides the usual channels for the training and validation datasets, there is a channel for the embedding matrix.  This illustrates one aspect of the flexibility of SageMaker for setting up training jobs:  in addition to data, you can pass in arbitrary files needed for training.  "
314 |    ]
315 |   },
316 |   {
317 |    "cell_type": "code",
318 |    "execution_count": null,
319 |    "metadata": {},
320 |    "outputs": [],
321 |    "source": [
322 |     "estimator.fit(inputs)"
323 |    ]
324 |   },
325 |   {
326 |    "cell_type": "markdown",
327 |    "metadata": {},
328 |    "source": [
329 |     "# SageMaker hosted endpoint\n",
330 |     "\n",
331 |     "If we wish to deploy the model to production, the next step is to create a SageMaker hosted endpoint. The endpoint will retrieve the TensorFlow SavedModel created during training and deploy it within a TensorFlow Serving container. This all can be accomplished with one line of code, an invocation of the Estimator's deploy method."
332 |    ]
333 |   },
334 |   {
335 |    "cell_type": "code",
336 |    "execution_count": null,
337 |    "metadata": {},
338 |    "outputs": [],
339 |    "source": [
340 |     "predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m5.xlarge')"
341 |    ]
342 |   },
343 |   {
344 |    "cell_type": "markdown",
345 |    "metadata": {},
346 |    "source": [
347 |     "We can now compare the predictions generated by the endpoint with a sample of the validation data.  The results are shown as integer labels from 0 to 19 corresponding to the 20 different newsgroups."
348 |    ]
349 |   },
350 |   {
351 |    "cell_type": "code",
352 |    "execution_count": null,
353 |    "metadata": {},
354 |    "outputs": [],
355 |    "source": [
356 |     "results = predictor.predict(x_val[:10])['predictions'] \n",
357 |     "\n",
358 |     "print('predictions: \\t{}'.format(np.argmax(results, axis=1)))\n",
359 |     "print('target values: \\t{}'.format(np.argmax(y_val[:10], axis=1)))"
360 |    ]
361 |   },
362 |   {
363 |    "cell_type": "markdown",
364 |    "metadata": {},
365 |    "source": [
366 |     "When you're finished with your review of this notebook, you can delete the prediction endpoint to release the instance(s) associated with it."
367 |    ]
368 |   },
369 |   {
370 |    "cell_type": "code",
371 |    "execution_count": null,
372 |    "metadata": {},
373 |    "outputs": [],
374 |    "source": [
375 |     "sagemaker.Session().delete_endpoint(predictor.endpoint_name)"
376 |    ]
377 |   }
378 |  ],
379 |  "metadata": {
380 |   "kernelspec": {
381 |    "display_name": "conda_tensorflow2_p36",
382 |    "language": "python",
383 |    "name": "conda_tensorflow2_p36"
384 |   },
385 |   "language_info": {
386 |    "codemirror_mode": {
387 |     "name": "ipython",
388 |     "version": 3
389 |    },
390 |    "file_extension": ".py",
391 |    "mimetype": "text/x-python",
392 |    "name": "python",
393 |    "nbconvert_exporter": "python",
394 |    "pygments_lexer": "ipython3",
395 |    "version": "3.6.13"
396 |   }
397 |  },
398 |  "nbformat": 4,
399 |  "nbformat_minor": 2
400 | }
401 | 


--------------------------------------------------------------------------------
/tf-2-workflow-smpipelines/train_model/model_def.py:
--------------------------------------------------------------------------------
 1 | import tensorflow as tf
 2 | 
 3 | 
 4 | def get_model():
 5 | 
 6 |     inputs = tf.keras.Input(shape=(13,))
 7 |     hidden_1 = tf.keras.layers.Dense(13, activation='tanh')(inputs)
 8 |     hidden_2 = tf.keras.layers.Dense(6, activation='sigmoid')(hidden_1)
 9 |     outputs = tf.keras.layers.Dense(1)(hidden_2)
10 |     return tf.keras.Model(inputs=inputs, outputs=outputs)
11 | 


--------------------------------------------------------------------------------
/tf-2-workflow-smpipelines/train_model/train.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | import numpy as np
 3 | import os
 4 | import tensorflow as tf
 5 | 
 6 | from model_def import get_model
 7 | 
 8 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
 9 | 
10 | 
11 | def parse_args():
12 | 
13 |     parser = argparse.ArgumentParser()
14 | 
15 |     # hyperparameters sent by the client are passed as command-line arguments to the script
16 |     parser.add_argument('--epochs', type=int, default=1)
17 |     parser.add_argument('--batch_size', type=int, default=64)
18 |     parser.add_argument('--learning_rate', type=float, default=0.1)
19 | 
20 |     # data directories
21 |     parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
22 |     parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST'))
23 | 
24 |     # model directory
25 |     parser.add_argument('--sm-model-dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
26 | 
27 |     return parser.parse_known_args()
28 | 
29 | 
30 | def get_train_data(train_dir):
31 | 
32 |     x_train = np.load(os.path.join(train_dir, 'x_train.npy'))
33 |     y_train = np.load(os.path.join(train_dir, 'y_train.npy'))
34 |     print('x train', x_train.shape,'y train', y_train.shape)
35 | 
36 |     return x_train, y_train
37 | 
38 | 
39 | def get_test_data(test_dir):
40 | 
41 |     x_test = np.load(os.path.join(test_dir, 'x_test.npy'))
42 |     y_test = np.load(os.path.join(test_dir, 'y_test.npy'))
43 |     print('x test', x_test.shape,'y test', y_test.shape)
44 | 
45 |     return x_test, y_test
46 | 
47 | 
48 | if __name__ == "__main__":
49 | 
50 |     args, _ = parse_args()
51 | 
52 |     print('Training data location: {}'.format(args.train))
53 |     print('Test data location: {}'.format(args.test))
54 |     x_train, y_train = get_train_data(args.train)
55 |     x_test, y_test = get_test_data(args.test)
56 | 
57 |     device = '/cpu:0'
58 |     print(device)
59 |     batch_size = args.batch_size
60 |     epochs = args.epochs
61 |     learning_rate = args.learning_rate
62 |     print('batch_size = {}, epochs = {}, learning rate = {}'.format(batch_size, epochs, learning_rate))
63 | 
64 |     with tf.device(device):
65 | 
66 |         model = get_model()
67 |         optimizer = tf.keras.optimizers.SGD(learning_rate)
68 |         model.compile(optimizer=optimizer, loss='mse')
69 |         model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs,
70 |                   validation_data=(x_test, y_test))
71 | 
72 |         # evaluate on test set
73 |         scores = model.evaluate(x_test, y_test, batch_size, verbose=2)
74 |         print("\nTest MSE :", scores)
75 | 
76 |         # save model
77 |         model.save(args.sm_model_dir + '/1')
78 | 
79 | 
80 | 


--------------------------------------------------------------------------------
/tf-2-workflow/train_model/model_def.py:
--------------------------------------------------------------------------------
 1 | import tensorflow as tf
 2 | 
 3 | 
 4 | def get_model():
 5 |     
 6 |     inputs = tf.keras.Input(shape=(13,))
 7 |     hidden_1 = tf.keras.layers.Dense(13, activation='tanh')(inputs)
 8 |     hidden_2 = tf.keras.layers.Dense(6, activation='sigmoid')(hidden_1)
 9 |     outputs = tf.keras.layers.Dense(1)(hidden_2)
10 |     return tf.keras.Model(inputs=inputs, outputs=outputs)
11 |  


--------------------------------------------------------------------------------
/tf-2-workflow/train_model/train.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | import numpy as np
 3 | import os
 4 | import tensorflow as tf
 5 | 
 6 | from model_def import get_model
 7 | 
 8 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 
 9 | 
10 | 
11 | def parse_args():
12 |     
13 |     parser = argparse.ArgumentParser()
14 | 
15 |     # hyperparameters sent by the client are passed as command-line arguments to the script
16 |     parser.add_argument('--epochs', type=int, default=1)
17 |     parser.add_argument('--batch_size', type=int, default=64)
18 |     parser.add_argument('--learning_rate', type=float, default=0.1)
19 |     
20 |     # data directories
21 |     parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
22 |     parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST'))
23 |     
24 |     # model directory: we will use the default set by SageMaker, /opt/ml/model
25 |     parser.add_argument('--model_dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
26 |     
27 |     return parser.parse_known_args()
28 | 
29 | 
30 | def get_train_data(train_dir):
31 |     
32 |     x_train = np.load(os.path.join(train_dir, 'x_train.npy'))
33 |     y_train = np.load(os.path.join(train_dir, 'y_train.npy'))
34 |     print('x train', x_train.shape,'y train', y_train.shape)
35 | 
36 |     return x_train, y_train
37 | 
38 | 
39 | def get_test_data(test_dir):
40 |     
41 |     x_test = np.load(os.path.join(test_dir, 'x_test.npy'))
42 |     y_test = np.load(os.path.join(test_dir, 'y_test.npy'))
43 |     print('x test', x_test.shape,'y test', y_test.shape)
44 | 
45 |     return x_test, y_test
46 |    
47 | 
48 | if __name__ == "__main__":
49 |         
50 |     args, _ = parse_args()
51 |     
52 |     print('Training data location: {}'.format(args.train))
53 |     print('Test data location: {}'.format(args.test))
54 |     x_train, y_train = get_train_data(args.train)
55 |     x_test, y_test = get_test_data(args.test)
56 |     
57 |     device = '/cpu:0' 
58 |     print(device)
59 |     batch_size = args.batch_size
60 |     epochs = args.epochs
61 |     learning_rate = args.learning_rate
62 |     print('batch_size = {}, epochs = {}, learning rate = {}'.format(batch_size, epochs, learning_rate))
63 | 
64 |     with tf.device(device):
65 |         
66 |         model = get_model()
67 |         optimizer = tf.keras.optimizers.SGD(learning_rate)
68 |         model.compile(optimizer=optimizer, loss='mse')    
69 |         model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs,
70 |                   validation_data=(x_test, y_test))
71 | 
72 |         # evaluate on test set
73 |         scores = model.evaluate(x_test, y_test, batch_size, verbose=2)
74 |         print("\nTest MSE :", scores)
75 |         
76 |         # save model
77 |         model.save(args.model_dir + '/1')
78 |         
79 | 


--------------------------------------------------------------------------------
/tf-batch-inference-script/code/inference.py:
--------------------------------------------------------------------------------
 1 | # Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | #
 3 | # Licensed under the Apache License, Version 2.0 (the "License"). You
 4 | # may not use this file except in compliance with the License. A copy of
 5 | # the License is located at
 6 | #
 7 | #     http://aws.amazon.com/apache2.0/
 8 | #
 9 | # or in the "license" file accompanying this file. This file is
10 | # distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF
11 | # ANY KIND, either express or implied. See the License for the specific
12 | # language governing permissions and limitations under the License.
13 | 
14 | import io
15 | import json
16 | import numpy as np
17 | from collections import namedtuple
18 | from PIL import Image
19 | 
20 | Context = namedtuple('Context',
21 |                      'model_name, model_version, method, rest_uri, grpc_uri, '
22 |                      'custom_attributes, request_content_type, accept_header')
23 | 
24 | 
25 | def input_handler(data, context):
26 |     """ Pre-process request input before it is sent to TensorFlow Serving REST API
27 | 
28 |     Args:
29 |         data (obj): the request data, in format of dict or string
30 |         context (Context): an object containing request and configuration details
31 | 
32 |     Returns:
33 |         (dict): a JSON-serializable dict that contains request body and headers
34 |     """
35 | 
36 |     if context.request_content_type == 'application/x-image':
37 | 
38 |         image_as_bytes = io.BytesIO(data.read())
39 |         image = Image.open(image_as_bytes)
40 |         instance = np.expand_dims(image, axis=0)
41 |         return json.dumps({"instances": instance.tolist()})
42 | 
43 |     else:
44 |         _return_error(415, 'Unsupported content type "{}"'.format(context.request_content_type or 'Unknown'))
45 | 
46 | 
47 | def output_handler(data, context):
48 |     """Post-process TensorFlow Serving output before it is returned to the client.
49 | 
50 |     Args:
51 |         data (obj): the TensorFlow serving response
52 |         context (Context): an object containing request and configuration details
53 | 
54 |     Returns:
55 |         (bytes, string): data to return to client, response content type
56 |     """
57 |     if data.status_code != 200:
58 |         raise Exception(data.content.decode('utf-8'))
59 |     response_content_type = context.accept_header
60 |     prediction = data.content
61 |     return prediction, response_content_type
62 | 
63 | 
64 | def _return_error(code, message):
65 |     raise ValueError('Error: {}, {}'.format(str(code), message))
66 | 


--------------------------------------------------------------------------------
/tf-batch-inference-script/code/model_def.py:
--------------------------------------------------------------------------------
 1 | import tensorflow as tf
 2 | from tensorflow.keras.layers import Activation, Conv2D, Dense, Dropout, Flatten, MaxPooling2D, BatchNormalization
 3 | from tensorflow.keras.models import Sequential
 4 | from tensorflow.keras.optimizers import Adam, SGD, RMSprop
 5 | 
 6 | HEIGHT = 32
 7 | WIDTH = 32
 8 | DEPTH = 3
 9 | NUM_CLASSES = 10
10 | 
11 | def get_model(learning_rate, weight_decay, optimizer, momentum, size, mpi=False, hvd=False):
12 | 
13 |     model = Sequential()
14 |     model.add(Conv2D(32, (3, 3), padding='same', input_shape=(HEIGHT, WIDTH, DEPTH)))
15 |     model.add(BatchNormalization())
16 |     model.add(Activation('relu'))
17 |     model.add(Conv2D(32, (3, 3)))
18 |     model.add(BatchNormalization())
19 |     model.add(Activation('relu'))
20 |     model.add(MaxPooling2D(pool_size=(2, 2)))
21 |     model.add(Dropout(0.2))
22 | 
23 |     model.add(Conv2D(64, (3, 3), padding='same'))
24 |     model.add(BatchNormalization())
25 |     model.add(Activation('relu'))
26 |     model.add(Conv2D(64, (3, 3)))
27 |     model.add(BatchNormalization())
28 |     model.add(Activation('relu'))
29 |     model.add(MaxPooling2D(pool_size=(2, 2)))
30 |     model.add(Dropout(0.3))
31 | 
32 |     model.add(Conv2D(128, (3, 3), padding='same'))
33 |     model.add(BatchNormalization())
34 |     model.add(Activation('relu'))
35 |     model.add(Conv2D(128, (3, 3)))
36 |     model.add(BatchNormalization())
37 |     model.add(Activation('relu'))
38 |     model.add(MaxPooling2D(pool_size=(2, 2)))
39 |     model.add(Dropout(0.4))
40 | 
41 |     model.add(Flatten())
42 |     model.add(Dense(512))
43 |     model.add(Activation('relu'))
44 |     model.add(Dropout(0.5))
45 |     model.add(Dense(NUM_CLASSES))
46 |     model.add(Activation('softmax'))
47 | 
48 |     if mpi:
49 |         size = hvd.size()
50 | 
51 |     if optimizer.lower() == 'sgd':
52 |         opt = SGD(lr=learning_rate * size, decay=weight_decay, momentum=momentum)
53 |     elif optimizer.lower() == 'rmsprop':
54 |         opt = RMSprop(lr=learning_rate * size, decay=weight_decay)
55 |     else:
56 |         opt = Adam(lr=learning_rate * size, decay=weight_decay)
57 | 
58 |     if mpi:
59 |         opt = hvd.DistributedOptimizer(opt)
60 | 
61 |     model.compile(loss='categorical_crossentropy',
62 |                   optimizer=opt,
63 |                   metrics=['accuracy'])
64 |     
65 |     return model
66 | 
67 | 


--------------------------------------------------------------------------------
/tf-batch-inference-script/code/requirements.txt:
--------------------------------------------------------------------------------
1 | Pillow
2 | numpy
3 | 


--------------------------------------------------------------------------------
/tf-batch-inference-script/code/train.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import codecs
  3 | import json
  4 | import logging
  5 | import numpy as np
  6 | import os
  7 | import re
  8 | 
  9 | import tensorflow as tf
 10 | import tensorflow.keras.backend as K
 11 | from tensorflow.keras.callbacks import TensorBoard, ModelCheckpoint
 12 | 
 13 | from model_def import get_model, HEIGHT, WIDTH, DEPTH, NUM_CLASSES
 14 | from utilities import process_input
 15 | 
 16 | 
 17 | logging.getLogger().setLevel(logging.INFO)
 18 | tf.logging.set_verbosity(tf.logging.ERROR)
 19 | 
 20 | 
 21 | #  Copy inference pre/post-processing script so it will be included in the model package
 22 | os.system('mkdir /opt/ml/model/code')
 23 | os.system('cp inference.py /opt/ml/model/code')
 24 | os.system('cp requirements.txt /opt/ml/model/code')
 25 | 
 26 | 
 27 | class CustomTensorBoardCallback(TensorBoard):
 28 |     
 29 |     def on_batch_end(self, batch, logs=None):
 30 |         pass
 31 | 
 32 |     
 33 | def save_history(path, history):
 34 | 
 35 |     history_for_json = {}
 36 |     # transform float values that aren't json-serializable
 37 |     for key in list(history.history.keys()):
 38 |         if type(history.history[key]) == np.ndarray:
 39 |             history_for_json[key] = history.history[key].tolist()
 40 |         elif type(history.history[key]) == list:
 41 |            if  type(history.history[key][0]) == np.float32 or type(history.history[key][0]) == np.float64:
 42 |                history_for_json[key] = list(map(float, history.history[key]))
 43 | 
 44 |     with codecs.open(path, 'w', encoding='utf-8') as f:
 45 |         json.dump(history_for_json, f, separators=(',', ':'), sort_keys=True, indent=4) 
 46 | 
 47 | 
 48 | def save_model(model, output):
 49 | 
 50 |     # create a TensorFlow SavedModel for deployment to a SageMaker endpoint with TensorFlow Serving
 51 |     tf.contrib.saved_model.save_keras_model(model, args.model_dir)
 52 |     logging.info("Model successfully saved at: {}".format(output))
 53 |     return
 54 | 
 55 | 
 56 | def main(args):
 57 | 
 58 |     mpi = False
 59 |     if 'sourcedir.tar.gz' in args.tensorboard_dir:
 60 |         tensorboard_dir = re.sub('source/sourcedir.tar.gz', 'model', args.tensorboard_dir)
 61 |     else:
 62 |         tensorboard_dir = args.tensorboard_dir
 63 |     logging.info("Writing TensorBoard logs to {}".format(tensorboard_dir))
 64 |     
 65 |     if 'sagemaker_mpi_enabled' in args.fw_params:
 66 |         if args.fw_params['sagemaker_mpi_enabled']:
 67 |             import horovod.tensorflow.keras as hvd
 68 |             mpi = True
 69 |             hvd.init()
 70 |             config = tf.ConfigProto()
 71 |             config.gpu_options.allow_growth = True
 72 |             config.gpu_options.visible_device_list = str(hvd.local_rank())
 73 |             K.set_session(tf.Session(config=config))
 74 |     else:
 75 |         hvd = None
 76 | 
 77 |     logging.info("Running with MPI={}".format(mpi))
 78 |     logging.info("getting data")
 79 |     train_dataset = process_input(args.epochs, args.batch_size, args.train, 'train', args.data_config)
 80 |     eval_dataset = process_input(args.epochs, args.batch_size, args.eval, 'eval', args.data_config)
 81 |     validation_dataset = process_input(args.epochs, args.batch_size, args.validation, 'validation', args.data_config)
 82 | 
 83 |     logging.info("configuring model")
 84 |     model = get_model(args.learning_rate, args.weight_decay, args.optimizer, args.momentum, 1, mpi, hvd)
 85 |     callbacks = []
 86 |     if mpi:
 87 |         callbacks.append(hvd.callbacks.BroadcastGlobalVariablesCallback(0))
 88 |         callbacks.append(hvd.callbacks.MetricAverageCallback())
 89 |         callbacks.append(hvd.callbacks.LearningRateWarmupCallback(warmup_epochs=5, verbose=1))
 90 |         callbacks.append(tf.keras.callbacks.ReduceLROnPlateau(patience=10, verbose=1))
 91 |         if hvd.rank() == 0:
 92 |             callbacks.append(ModelCheckpoint(args.output_data_dir + '/checkpoint-{epoch}.h5'))
 93 |             callbacks.append(CustomTensorBoardCallback(log_dir=tensorboard_dir))
 94 |     else:
 95 |         callbacks.append(ModelCheckpoint(args.output_data_dir + '/checkpoint-{epoch}.h5'))
 96 |         callbacks.append(CustomTensorBoardCallback(log_dir=tensorboard_dir))
 97 |         
 98 |     logging.info("Starting training")
 99 |     size = 1
100 |     if mpi:
101 |         size = hvd.size()
102 |         
103 |     history = model.fit(x=train_dataset[0], 
104 |               y=train_dataset[1],
105 |               steps_per_epoch=(num_examples_per_epoch('train') // args.batch_size) // size,
106 |               epochs=args.epochs, 
107 |               validation_data=validation_dataset,
108 |               validation_steps=(num_examples_per_epoch('validation') // args.batch_size) // size,
109 |               callbacks=callbacks)
110 | 
111 |     score = model.evaluate(eval_dataset[0], 
112 |                            eval_dataset[1], 
113 |                            steps=num_examples_per_epoch('eval') // args.batch_size,
114 |                            verbose=0)
115 | 
116 |     logging.info('Test loss:{}'.format(score[0]))
117 |     logging.info('Test accuracy:{}'.format(score[1]))
118 | 
119 |     if mpi:
120 |         if hvd.rank() == 0:
121 |             save_history(args.model_dir + "/hvd_history.p", history)
122 |             return save_model(model, args.model_output_dir)
123 |     else:
124 |         save_history(args.model_dir + "/hvd_history.p", history)
125 |         return save_model(model, args.model_output_dir)
126 | 
127 | 
128 | def num_examples_per_epoch(subset='train'):
129 |     if subset == 'train':
130 |         return 40000
131 |     elif subset == 'validation':
132 |         return 10000
133 |     elif subset == 'eval':
134 |         return 10000
135 |     else:
136 |         raise ValueError('Invalid data subset "%s"' % subset)
137 | 
138 | 
139 | if __name__ == '__main__':
140 | 
141 |     parser = argparse.ArgumentParser()
142 |     
143 |     parser.add_argument('--train',type=str,required=False,default=os.environ.get('SM_CHANNEL_TRAIN'))
144 |     parser.add_argument('--validation',type=str,required=False,default=os.environ.get('SM_CHANNEL_VALIDATION'))
145 |     parser.add_argument('--eval',type=str,required=False,default=os.environ.get('SM_CHANNEL_EVAL'))
146 |     parser.add_argument('--model_dir',type=str,required=True,help='The directory where the model will be stored.')
147 |     parser.add_argument('--model_output_dir',type=str,default=os.environ.get('SM_MODEL_DIR'))
148 |     parser.add_argument('--output_data_dir',type=str,default=os.environ.get('SM_OUTPUT_DATA_DIR'))
149 |     parser.add_argument('--output-dir',type=str,default=os.environ.get('SM_OUTPUT_DIR'))
150 |     parser.add_argument('--tensorboard-dir',type=str,default=os.environ.get('SM_MODULE_DIR'))
151 |     parser.add_argument('--weight-decay',type=float,default=2e-4,help='Weight decay for convolutions.')
152 |     parser.add_argument('--learning-rate',type=float,default=0.001,help='Initial learning rate.')
153 |     parser.add_argument('--epochs',type=int,default=10)
154 |     parser.add_argument('--batch-size',type=int,default=128)
155 |     parser.add_argument('--data-config',type=json.loads,default=os.environ.get('SM_INPUT_DATA_CONFIG'))
156 |     parser.add_argument('--fw-params',type=json.loads,default=os.environ.get('SM_FRAMEWORK_PARAMS'))
157 |     parser.add_argument('--optimizer',type=str,default='adam')
158 |     parser.add_argument('--momentum',type=float,default='0.9')
159 |     
160 |     args = parser.parse_args()
161 | 
162 |     main(args)
163 | 


--------------------------------------------------------------------------------
/tf-batch-inference-script/code/utilities.py:
--------------------------------------------------------------------------------
 1 | import logging
 2 | import os
 3 | import tensorflow as tf
 4 | 
 5 | from model_def import HEIGHT, WIDTH, DEPTH, NUM_CLASSES
 6 | 
 7 | 
 8 | NUM_DATA_BATCHES = 5
 9 | NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN = 10000 * NUM_DATA_BATCHES
10 | 
11 | 
12 | def _get_filenames(channel_name, channel):
13 |     if channel_name in ['train', 'validation', 'eval']:
14 |         return [os.path.join(channel, channel_name + '.tfrecords')]
15 |     else:
16 |         raise ValueError('Invalid data subset "%s"' % channel_name)
17 |         
18 | 
19 | def _train_preprocess_fn(image):
20 | 
21 |     # Resize the image to add four extra pixels on each side.
22 |     image = tf.image.resize_image_with_crop_or_pad(image, HEIGHT + 8, WIDTH + 8)
23 | 
24 |     # Randomly crop a [HEIGHT, WIDTH] section of the image.
25 |     image = tf.random_crop(image, [HEIGHT, WIDTH, DEPTH])
26 | 
27 |     # Randomly flip the image horizontally.
28 |     image = tf.image.random_flip_left_right(image)
29 | 
30 |     return image
31 | 
32 | 
33 | def _dataset_parser(value):
34 | 
35 |     featdef = {
36 |         'image': tf.FixedLenFeature([], tf.string),
37 |         'label': tf.FixedLenFeature([], tf.int64),
38 |     }
39 | 
40 |     example = tf.parse_single_example(value, featdef)
41 |     image = tf.decode_raw(example['image'], tf.uint8)
42 |     image.set_shape([DEPTH * HEIGHT * WIDTH])
43 | 
44 |     # Reshape from [depth * height * width] to [depth, height, width].
45 |     image = tf.cast(
46 |         tf.transpose(tf.reshape(image, [DEPTH, HEIGHT, WIDTH]), [1, 2, 0]),
47 |         tf.float32)
48 |     label = tf.cast(example['label'], tf.int32)
49 |     image = _train_preprocess_fn(image)
50 |     return image, tf.one_hot(label, NUM_CLASSES)
51 | 
52 | 
53 | def process_input(epochs, batch_size, channel, channel_name, data_config):
54 |     
55 |     mode = data_config[channel_name]['TrainingInputMode']
56 |     filenames = _get_filenames(channel_name, channel)
57 |     # Repeat infinitely.
58 |     logging.info("Running {} in {} mode".format(channel_name, mode))
59 |     if mode == 'Pipe':
60 |         from sagemaker_tensorflow import PipeModeDataset
61 |         dataset = PipeModeDataset(channel=channel_name, record_format='TFRecord')
62 |     else:
63 |         dataset = tf.data.TFRecordDataset(filenames)
64 | 
65 |     dataset = dataset.repeat(epochs)
66 |     dataset = dataset.prefetch(10)
67 | 
68 |     # Parse records.
69 |     dataset = dataset.map(
70 |         _dataset_parser, num_parallel_calls=10)
71 | 
72 |     # Potentially shuffle records.
73 |     if channel_name == 'train':
74 |         # Ensure that the capacity is sufficiently large to provide good random
75 |         # shuffling.
76 |         buffer_size = int(NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN * 0.4) + 3 * batch_size
77 |         dataset = dataset.shuffle(buffer_size=buffer_size)
78 | 
79 |     # Batch it up.
80 |     dataset = dataset.batch(batch_size, drop_remainder=True)
81 |     iterator = dataset.make_one_shot_iterator()
82 |     image_batch, label_batch = iterator.get_next()
83 | 
84 |     return image_batch, label_batch
85 | 
86 | 
87 | 


--------------------------------------------------------------------------------
/tf-batch-inference-script/generate_cifar10_tfrecords.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2017 The TensorFlow Authors. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | # ==============================================================================
 15 | """Read CIFAR-10 data from pickled numpy arrays and writes TFRecords.
 16 | 
 17 | Generates tf.train.Example protos and writes them to TFRecord files from the
 18 | python version of the CIFAR-10 dataset downloaded from
 19 | https://www.cs.toronto.edu/~kriz/cifar.html.
 20 | """
 21 | 
 22 | from __future__ import absolute_import
 23 | from __future__ import division
 24 | from __future__ import print_function
 25 | 
 26 | import argparse
 27 | import os
 28 | import sys
 29 | 
 30 | import tarfile
 31 | from six.moves import cPickle as pickle
 32 | from six.moves import xrange  # pylint: disable=redefined-builtin
 33 | import tensorflow as tf
 34 | 
 35 | tf.logging.set_verbosity(tf.logging.ERROR)
 36 | if type(tf.contrib) != type(tf): tf.contrib._warning = None
 37 | 
 38 | CIFAR_FILENAME = 'cifar-10-python.tar.gz'
 39 | CIFAR_DOWNLOAD_URL = 'https://www.cs.toronto.edu/~kriz/' + CIFAR_FILENAME
 40 | CIFAR_LOCAL_FOLDER = 'cifar-10-batches-py'
 41 | 
 42 | 
 43 | def download_and_extract(data_dir):
 44 |   # download CIFAR-10 if not already downloaded.
 45 |   tf.contrib.learn.datasets.base.maybe_download(CIFAR_FILENAME, data_dir,
 46 |                                                 CIFAR_DOWNLOAD_URL)
 47 |   tarfile.open(os.path.join(data_dir, CIFAR_FILENAME),
 48 |                'r:gz').extractall(data_dir)
 49 | 
 50 | 
 51 | def _int64_feature(value):
 52 |   return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
 53 | 
 54 | 
 55 | def _bytes_feature(value):
 56 |   return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
 57 | 
 58 | 
 59 | def _get_file_names():
 60 |   """Returns the file names expected to exist in the input_dir."""
 61 |   file_names = {}
 62 |   file_names['train'] = ['data_batch_%d' % i for i in xrange(1, 5)]
 63 |   file_names['validation'] = ['data_batch_5']
 64 |   file_names['eval'] = ['test_batch']
 65 |   return file_names
 66 | 
 67 | 
 68 | def read_pickle_from_file(filename):
 69 |   with tf.gfile.Open(filename, 'rb') as f:
 70 |     if sys.version_info >= (3, 0):
 71 |       data_dict = pickle.load(f, encoding='bytes')
 72 |     else:
 73 |       data_dict = pickle.load(f)
 74 |   return data_dict
 75 | 
 76 | 
 77 | def convert_to_tfrecord(input_files, output_file):
 78 |   """Converts a file to TFRecords."""
 79 |   print('Generating %s' % output_file)
 80 |   with tf.python_io.TFRecordWriter(output_file) as record_writer:
 81 |     for input_file in input_files:
 82 |       data_dict = read_pickle_from_file(input_file)
 83 |       data = data_dict[b'data']
 84 |       labels = data_dict[b'labels']
 85 | 
 86 |       num_entries_in_batch = len(labels)
 87 |       for i in range(num_entries_in_batch):
 88 |         example = tf.train.Example(features=tf.train.Features(
 89 |             feature={
 90 |                 'image': _bytes_feature(data[i].tobytes()),
 91 |                 'label': _int64_feature(labels[i])
 92 |             }))
 93 |         record_writer.write(example.SerializeToString())
 94 | 
 95 | 
 96 | def main(data_dir):
 97 |   print('Download from {} and extract.'.format(CIFAR_DOWNLOAD_URL))
 98 |   download_and_extract(data_dir)
 99 |   file_names = _get_file_names()
100 |   input_dir = os.path.join(data_dir, CIFAR_LOCAL_FOLDER)
101 |   for mode, files in file_names.items():
102 |     input_files = [os.path.join(input_dir, f) for f in files]
103 |     output_file = os.path.join(data_dir+'/'+mode, mode + '.tfrecords')
104 |     if not os.path.exists(data_dir+'/'+mode):
105 |         os.makedirs(data_dir+'/'+mode)
106 |     try:
107 |       os.remove(output_file)
108 |     except OSError:
109 |       pass
110 |     # Convert to tf.train.Example and write the to TFRecords.
111 |     convert_to_tfrecord(input_files, output_file)
112 |   print('Done!')
113 |   import shutil
114 |   shutil.rmtree(data_dir+'/cifar-10-batches-py')
115 |   os.remove(data_dir+'/cifar-10-python.tar.gz')
116 | 
117 | 
118 | if __name__ == '__main__':
119 |   parser = argparse.ArgumentParser()
120 |   parser.add_argument(
121 |       '--data-dir',
122 |       type=str,
123 |       default='',
124 |       help='Directory to download and extract CIFAR-10 to.')
125 | 
126 |   args = parser.parse_args()
127 |   main(args.data_dir)
128 | 


--------------------------------------------------------------------------------
/tf-batch-inference-script/sample-img/1000_dog.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-batch-inference-script/sample-img/1000_dog.png


--------------------------------------------------------------------------------
/tf-batch-inference-script/sample-img/1001_airplane.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-batch-inference-script/sample-img/1001_airplane.png


--------------------------------------------------------------------------------
/tf-batch-inference-script/sample-img/1003_deer.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-batch-inference-script/sample-img/1003_deer.png


--------------------------------------------------------------------------------
/tf-batch-inference-script/sample-img/1004_ship.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-batch-inference-script/sample-img/1004_ship.png


--------------------------------------------------------------------------------
/tf-batch-inference-script/sample-img/1005_automobile.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-batch-inference-script/sample-img/1005_automobile.png


--------------------------------------------------------------------------------
/tf-batch-inference-script/sample-img/1008_truck.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-batch-inference-script/sample-img/1008_truck.png


--------------------------------------------------------------------------------
/tf-batch-inference-script/sample-img/1009_frog.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-batch-inference-script/sample-img/1009_frog.png


--------------------------------------------------------------------------------
/tf-batch-inference-script/sample-img/1014_cat.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-batch-inference-script/sample-img/1014_cat.png


--------------------------------------------------------------------------------
/tf-batch-inference-script/sample-img/1037_horse.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-batch-inference-script/sample-img/1037_horse.png


--------------------------------------------------------------------------------
/tf-batch-inference-script/sample-img/1038_bird.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-batch-inference-script/sample-img/1038_bird.png


--------------------------------------------------------------------------------
/tf-distribution-options/code/inference.py:
--------------------------------------------------------------------------------
 1 | # Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | #
 3 | # Licensed under the Apache License, Version 2.0 (the "License"). You
 4 | # may not use this file except in compliance with the License. A copy of
 5 | # the License is located at
 6 | #
 7 | #     http://aws.amazon.com/apache2.0/
 8 | #
 9 | # or in the "license" file accompanying this file. This file is
10 | # distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF
11 | # ANY KIND, either express or implied. See the License for the specific
12 | # language governing permissions and limitations under the License.
13 | 
14 | import io
15 | import json
16 | import numpy as np
17 | from collections import namedtuple
18 | from PIL import Image
19 | 
20 | Context = namedtuple('Context',
21 |                      'model_name, model_version, method, rest_uri, grpc_uri, '
22 |                      'custom_attributes, request_content_type, accept_header')
23 | 
24 | 
25 | def input_handler(data, context):
26 |     """ Pre-process request input before it is sent to TensorFlow Serving REST API
27 | 
28 |     Args:
29 |         data (obj): the request data, in format of dict or string
30 |         context (Context): an object containing request and configuration details
31 | 
32 |     Returns:
33 |         (dict): a JSON-serializable dict that contains request body and headers
34 |     """
35 | 
36 |     if context.request_content_type == 'application/x-image':
37 | 
38 |         image_as_bytes = io.BytesIO(data.read())
39 |         image = Image.open(image_as_bytes)
40 |         instance = np.expand_dims(image, axis=0)
41 |         return json.dumps({"instances": instance.tolist()})
42 | 
43 |     else:
44 |         _return_error(415, 'Unsupported content type "{}"'.format(context.request_content_type or 'Unknown'))
45 | 
46 | 
47 | def output_handler(data, context):
48 |     """Post-process TensorFlow Serving output before it is returned to the client.
49 | 
50 |     Args:
51 |         data (obj): the TensorFlow serving response
52 |         context (Context): an object containing request and configuration details
53 | 
54 |     Returns:
55 |         (bytes, string): data to return to client, response content type
56 |     """
57 |     if data.status_code != 200:
58 |         raise Exception(data.content.decode('utf-8'))
59 |     response_content_type = context.accept_header
60 |     prediction = data.content
61 |     return prediction, response_content_type
62 | 
63 | 
64 | def _return_error(code, message):
65 |     raise ValueError('Error: {}, {}'.format(str(code), message))
66 | 


--------------------------------------------------------------------------------
/tf-distribution-options/code/model_def.py:
--------------------------------------------------------------------------------
 1 | import tensorflow as tf
 2 | from tensorflow.keras.layers import Activation, Conv2D, Dense, Dropout, Flatten, MaxPooling2D, BatchNormalization
 3 | from tensorflow.keras.models import Sequential
 4 | from tensorflow.keras.optimizers import Adam, SGD, RMSprop
 5 | 
 6 | HEIGHT = 32
 7 | WIDTH = 32
 8 | DEPTH = 3
 9 | NUM_CLASSES = 10
10 | 
11 | def get_model(learning_rate, weight_decay, optimizer, momentum, size, mpi=False, hvd=False):
12 | 
13 |     model = Sequential()
14 |     model.add(Conv2D(32, (3, 3), padding='same', input_shape=(HEIGHT, WIDTH, DEPTH)))
15 |     model.add(BatchNormalization())
16 |     model.add(Activation('relu'))
17 |     model.add(Conv2D(32, (3, 3)))
18 |     model.add(BatchNormalization())
19 |     model.add(Activation('relu'))
20 |     model.add(MaxPooling2D(pool_size=(2, 2)))
21 |     model.add(Dropout(0.2))
22 | 
23 |     model.add(Conv2D(64, (3, 3), padding='same'))
24 |     model.add(BatchNormalization())
25 |     model.add(Activation('relu'))
26 |     model.add(Conv2D(64, (3, 3)))
27 |     model.add(BatchNormalization())
28 |     model.add(Activation('relu'))
29 |     model.add(MaxPooling2D(pool_size=(2, 2)))
30 |     model.add(Dropout(0.3))
31 | 
32 |     model.add(Conv2D(128, (3, 3), padding='same'))
33 |     model.add(BatchNormalization())
34 |     model.add(Activation('relu'))
35 |     model.add(Conv2D(128, (3, 3)))
36 |     model.add(BatchNormalization())
37 |     model.add(Activation('relu'))
38 |     model.add(MaxPooling2D(pool_size=(2, 2)))
39 |     model.add(Dropout(0.4))
40 | 
41 |     model.add(Flatten())
42 |     model.add(Dense(512))
43 |     model.add(Activation('relu'))
44 |     model.add(Dropout(0.5))
45 |     model.add(Dense(NUM_CLASSES))
46 |     model.add(Activation('softmax'))
47 | 
48 |     if mpi:
49 |         size = hvd.size()
50 | 
51 |     if optimizer.lower() == 'sgd':
52 |         opt = SGD(lr=learning_rate * size, decay=weight_decay, momentum=momentum)
53 |     elif optimizer.lower() == 'rmsprop':
54 |         opt = RMSprop(lr=learning_rate * size, decay=weight_decay)
55 |     else:
56 |         opt = Adam(lr=learning_rate * size, decay=weight_decay)
57 | 
58 |     if mpi:
59 |         opt = hvd.DistributedOptimizer(opt)
60 | 
61 |     model.compile(loss='categorical_crossentropy',
62 |                   optimizer=opt,
63 |                   metrics=['accuracy'])
64 |     
65 |     return model
66 | 
67 | 


--------------------------------------------------------------------------------
/tf-distribution-options/code/requirements.txt:
--------------------------------------------------------------------------------
1 | Pillow
2 | numpy
3 | 


--------------------------------------------------------------------------------
/tf-distribution-options/code/train_hvd.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import codecs
  3 | import json
  4 | import logging
  5 | import numpy as np
  6 | import os
  7 | import re
  8 | 
  9 | import tensorflow as tf
 10 | import tensorflow.keras.backend as K
 11 | from tensorflow.keras.callbacks import TensorBoard, ModelCheckpoint
 12 | 
 13 | from model_def import get_model, HEIGHT, WIDTH, DEPTH, NUM_CLASSES
 14 | from utilities import process_input
 15 | 
 16 | 
 17 | logging.getLogger().setLevel(logging.INFO)
 18 | tf.logging.set_verbosity(tf.logging.ERROR)
 19 | 
 20 | 
 21 | #  Copy inference pre/post-processing script so it will be included in the model package
 22 | os.system('mkdir /opt/ml/model/code')
 23 | os.system('cp inference.py /opt/ml/model/code')
 24 | os.system('cp requirements.txt /opt/ml/model/code')
 25 | 
 26 | 
 27 | class CustomTensorBoardCallback(TensorBoard):
 28 |     
 29 |     def on_batch_end(self, batch, logs=None):
 30 |         pass
 31 | 
 32 |     
 33 | def save_history(path, history):
 34 | 
 35 |     history_for_json = {}
 36 |     # transform float values that aren't json-serializable
 37 |     for key in list(history.history.keys()):
 38 |         if type(history.history[key]) == np.ndarray:
 39 |             history_for_json[key] == history.history[key].tolist()
 40 |         elif type(history.history[key]) == list:
 41 |            if  type(history.history[key][0]) == np.float32 or type(history.history[key][0]) == np.float64:
 42 |                history_for_json[key] = list(map(float, history.history[key]))
 43 | 
 44 |     with codecs.open(path, 'w', encoding='utf-8') as f:
 45 |         json.dump(history_for_json, f, separators=(',', ':'), sort_keys=True, indent=4) 
 46 | 
 47 | 
 48 | def save_model(model, output):
 49 | 
 50 |     # create a TensorFlow SavedModel for deployment to a SageMaker endpoint with TensorFlow Serving
 51 |     tf.contrib.saved_model.save_keras_model(model, args.model_dir)
 52 |     logging.info("Model successfully saved at: {}".format(output))
 53 |     return
 54 | 
 55 | 
 56 | def main(args):
 57 | 
 58 |     mpi = False
 59 |     if 'sourcedir.tar.gz' in args.tensorboard_dir:
 60 |         tensorboard_dir = re.sub('source/sourcedir.tar.gz', 'model', args.tensorboard_dir)
 61 |     else:
 62 |         tensorboard_dir = args.tensorboard_dir
 63 |     logging.info("Writing TensorBoard logs to {}".format(tensorboard_dir))
 64 |     
 65 |     if 'sagemaker_mpi_enabled' in args.fw_params:
 66 |         if args.fw_params['sagemaker_mpi_enabled']:
 67 |             import horovod.tensorflow.keras as hvd
 68 |             mpi = True
 69 |             # Horovod: initialize Horovod.
 70 |             hvd.init()
 71 | 
 72 |             # Horovod: pin GPU to be used to process local rank (one GPU per process)
 73 |             config = tf.ConfigProto()
 74 |             config.gpu_options.allow_growth = True
 75 |             config.gpu_options.visible_device_list = str(hvd.local_rank())
 76 |             K.set_session(tf.Session(config=config))
 77 |     else:
 78 |         hvd = None
 79 | 
 80 |     logging.info("Running with MPI={}".format(mpi))
 81 |     logging.info("getting data")
 82 |     train_dataset = process_input(args.epochs, args.batch_size, args.train, 'train', args.data_config)
 83 |     eval_dataset = process_input(args.epochs, args.batch_size, args.eval, 'eval', args.data_config)
 84 |     validation_dataset = process_input(args.epochs, args.batch_size, args.validation, 'validation', args.data_config)
 85 | 
 86 |     logging.info("configuring model")
 87 |     model = get_model(args.learning_rate, args.weight_decay, args.optimizer, args.momentum, 1, mpi, hvd)
 88 |     callbacks = []
 89 |     if mpi:
 90 |         callbacks.append(hvd.callbacks.BroadcastGlobalVariablesCallback(0))
 91 |         callbacks.append(hvd.callbacks.MetricAverageCallback())
 92 |         callbacks.append(hvd.callbacks.LearningRateWarmupCallback(warmup_epochs=5, verbose=1))
 93 |         callbacks.append(tf.keras.callbacks.ReduceLROnPlateau(patience=10, verbose=1))
 94 |         if hvd.rank() == 0:
 95 |             callbacks.append(ModelCheckpoint(args.output_data_dir + '/checkpoint-{epoch}.h5'))
 96 |             callbacks.append(CustomTensorBoardCallback(log_dir=tensorboard_dir))
 97 |     else:
 98 |         callbacks.append(ModelCheckpoint(args.output_data_dir + '/checkpoint-{epoch}.h5'))
 99 |         callbacks.append(CustomTensorBoardCallback(log_dir=tensorboard_dir))
100 |         
101 |     logging.info("Starting training")
102 |     size = 1
103 |     if mpi:
104 |         size = hvd.size()
105 |         
106 |     history = model.fit(x=train_dataset[0], 
107 |               y=train_dataset[1],
108 |               steps_per_epoch=(num_examples_per_epoch('train') // args.batch_size) // size,
109 |               epochs=args.epochs, 
110 |               validation_data=validation_dataset,
111 |               validation_steps=(num_examples_per_epoch('validation') // args.batch_size) // size,
112 |               callbacks=callbacks)
113 | 
114 |     score = model.evaluate(eval_dataset[0], 
115 |                            eval_dataset[1], 
116 |                            steps=num_examples_per_epoch('eval') // args.batch_size,
117 |                            verbose=0)
118 | 
119 |     logging.info('Test loss:{}'.format(score[0]))
120 |     logging.info('Test accuracy:{}'.format(score[1]))
121 | 
122 |     # Horovod: Save model and history only on worker 0 (i.e. master)
123 |     if mpi:
124 |         if hvd.rank() == 0:
125 |             save_history(args.model_dir + "/hvd_history.p", history)
126 |             return save_model(model, args.model_output_dir)
127 |     else:
128 |         save_history(args.model_dir + "/hvd_history.p", history)
129 |         return save_model(model, args.model_output_dir)
130 | 
131 | 
132 | def num_examples_per_epoch(subset='train'):
133 |     if subset == 'train':
134 |         return 40000
135 |     elif subset == 'validation':
136 |         return 10000
137 |     elif subset == 'eval':
138 |         return 10000
139 |     else:
140 |         raise ValueError('Invalid data subset "%s"' % subset)
141 | 
142 | 
143 | if __name__ == '__main__':
144 | 
145 |     parser = argparse.ArgumentParser()
146 |     
147 |     parser.add_argument('--train',type=str,required=False,default=os.environ.get('SM_CHANNEL_TRAIN'))
148 |     parser.add_argument('--validation',type=str,required=False,default=os.environ.get('SM_CHANNEL_VALIDATION'))
149 |     parser.add_argument('--eval',type=str,required=False,default=os.environ.get('SM_CHANNEL_EVAL'))
150 |     parser.add_argument('--model_dir',type=str,required=True,help='The directory where the model will be stored.')
151 |     parser.add_argument('--model_output_dir',type=str,default=os.environ.get('SM_MODEL_DIR'))
152 |     parser.add_argument('--output_data_dir',type=str,default=os.environ.get('SM_OUTPUT_DATA_DIR'))
153 |     parser.add_argument('--output-dir',type=str,default=os.environ.get('SM_OUTPUT_DIR'))
154 |     parser.add_argument('--tensorboard-dir',type=str,default=os.environ.get('SM_MODULE_DIR'))
155 |     parser.add_argument('--weight-decay',type=float,default=2e-4,help='Weight decay for convolutions.')
156 |     parser.add_argument('--learning-rate',type=float,default=0.001,help='Initial learning rate.')
157 |     parser.add_argument('--epochs',type=int,default=10)
158 |     parser.add_argument('--batch-size',type=int,default=128)
159 |     parser.add_argument('--data-config',type=json.loads,default=os.environ.get('SM_INPUT_DATA_CONFIG'))
160 |     parser.add_argument('--fw-params',type=json.loads,default=os.environ.get('SM_FRAMEWORK_PARAMS'))
161 |     parser.add_argument('--optimizer',type=str,default='adam')
162 |     parser.add_argument('--momentum',type=float,default='0.9')
163 |     
164 |     args = parser.parse_args()
165 | 
166 |     main(args)
167 | 


--------------------------------------------------------------------------------
/tf-distribution-options/code/train_ps.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import codecs
  3 | import json
  4 | import logging
  5 | import numpy as np
  6 | import os
  7 | import re
  8 | 
  9 | import tensorflow as tf
 10 | import tensorflow.keras.backend as K
 11 | from tensorflow.keras.callbacks import TensorBoard, ModelCheckpoint
 12 | 
 13 | from model_def import get_model, HEIGHT, WIDTH, DEPTH, NUM_CLASSES
 14 | from utilities import process_input
 15 | 
 16 | 
 17 | logging.getLogger().setLevel(logging.INFO)
 18 | tf.logging.set_verbosity(tf.logging.ERROR)
 19 | 
 20 | 
 21 | #  Copy inference pre/post-processing script so it will be included in the model package
 22 | os.system('mkdir /opt/ml/model/code')
 23 | os.system('cp inference.py /opt/ml/model/code')
 24 | os.system('cp requirements.txt /opt/ml/model/code')
 25 | 
 26 | 
 27 | class CustomTensorBoardCallback(TensorBoard):
 28 |     def on_batch_end(self, batch, logs=None):
 29 |         pass
 30 |     
 31 |     
 32 | def save_history(path, history):
 33 | 
 34 |     history_for_json = {}
 35 |     # transform float values that aren't json-serializable
 36 |     for key in list(history.history.keys()):
 37 |         if type(history.history[key]) == np.ndarray:
 38 |             history_for_json[key] == history.history[key].tolist()
 39 |         elif type(history.history[key]) == list:
 40 |            if  type(history.history[key][0]) == np.float32 or type(history.history[key][0]) == np.float64:
 41 |                history_for_json[key] = list(map(float, history.history[key]))
 42 | 
 43 |     with codecs.open(path, 'w', encoding='utf-8') as f:
 44 |         json.dump(history_for_json, f, separators=(',', ':'), sort_keys=True, indent=4) 
 45 | 
 46 | 
 47 | def save_model(model, output):
 48 | 
 49 |     # create a TensorFlow SavedModel for deployment to a SageMaker endpoint with TensorFlow Serving
 50 |     tf.contrib.saved_model.save_keras_model(model, args.model_dir)
 51 |     logging.info("Model successfully saved at: {}".format(output))
 52 |     return
 53 | 
 54 | 
 55 | def main(args):
 56 | 
 57 |     if 'sourcedir.tar.gz' in args.tensorboard_dir:
 58 |         tensorboard_dir = re.sub('source/sourcedir.tar.gz', 'model', args.tensorboard_dir)
 59 |     else:
 60 |         tensorboard_dir = args.tensorboard_dir
 61 | 
 62 |     logging.info("Writing TensorBoard logs to {}".format(tensorboard_dir))
 63 | 
 64 |     logging.info("getting data")
 65 |     train_dataset = process_input(args.epochs, args.batch_size, args.train, 'train', args.data_config)
 66 |     eval_dataset = process_input(args.epochs, args.batch_size, args.eval, 'eval', args.data_config)
 67 |     validation_dataset = process_input(args.epochs, args.batch_size, args.validation, 'validation', args.data_config)
 68 | 
 69 |     logging.info("configuring model")
 70 |     logging.info("Hosts: "+ os.environ.get('SM_HOSTS'))
 71 | 
 72 |     size = len(args.hosts)
 73 | 
 74 |     #Deal with this
 75 |     model = get_model(args.learning_rate, args.weight_decay, args.optimizer, args.momentum, size)
 76 |     callbacks = []
 77 |     if args.current_host == args.hosts[0]:
 78 |         callbacks.append(ModelCheckpoint(args.output_data_dir + '/checkpoint-{epoch}.h5'))
 79 |         callbacks.append(CustomTensorBoardCallback(log_dir=tensorboard_dir))
 80 | 
 81 |     logging.info("Starting training")
 82 | 
 83 |     history = model.fit(x=train_dataset[0], 
 84 |               y=train_dataset[1],
 85 |               steps_per_epoch=(num_examples_per_epoch('train') // args.batch_size) // size,
 86 |               epochs=args.epochs, 
 87 |               validation_data=validation_dataset,
 88 |               validation_steps=(num_examples_per_epoch('validation') // args.batch_size) // size, callbacks=callbacks)
 89 | 
 90 |     score = model.evaluate(eval_dataset[0], 
 91 |                            eval_dataset[1], 
 92 |                            steps=num_examples_per_epoch('eval') // args.batch_size,
 93 |                            verbose=0)
 94 | 
 95 |     logging.info('Test loss:{}'.format(score[0]))
 96 |     logging.info('Test accuracy:{}'.format(score[1]))
 97 | 
 98 |     # PS: Save model and history only on worker 0
 99 |     if args.current_host == args.hosts[0]:
100 |         save_history(args.model_dir + "/ps_history.p", history)
101 |         save_model(model, args.model_dir)
102 | 
103 | 
104 | def num_examples_per_epoch(subset='train'):
105 |     
106 |     if subset == 'train':
107 |         return 40000
108 |     elif subset == 'validation':
109 |         return 10000
110 |     elif subset == 'eval':
111 |         return 10000
112 |     else:
113 |         raise ValueError('Invalid data subset "%s"' % subset)
114 | 
115 | 
116 | if __name__ == '__main__':
117 | 
118 |     parser = argparse.ArgumentParser()
119 | 
120 |     parser.add_argument('--hosts',type=list,default=json.loads(os.environ.get('SM_HOSTS')))
121 |     parser.add_argument('--current-host',type=str,default=os.environ.get('SM_CURRENT_HOST'))
122 |     parser.add_argument('--train',type=str,required=False,default=os.environ.get('SM_CHANNEL_TRAIN'))
123 |     parser.add_argument('--validation',type=str,required=False,default=os.environ.get('SM_CHANNEL_VALIDATION'))
124 |     parser.add_argument('--eval',type=str,required=False,default=os.environ.get('SM_CHANNEL_EVAL'))
125 |     parser.add_argument('--model_dir',type=str,required=True,help='The directory where the model will be stored.')
126 |     parser.add_argument('--model_output_dir',type=str,default=os.environ.get('SM_MODEL_DIR'))
127 |     parser.add_argument('--output_data_dir',type=str,default=os.environ.get('SM_OUTPUT_DATA_DIR'))
128 |     parser.add_argument('--output-dir',type=str,default=os.environ.get('SM_OUTPUT_DIR'))
129 |     parser.add_argument('--tensorboard-dir',type=str,default=os.environ.get('SM_MODULE_DIR'))
130 |     parser.add_argument('--weight-decay',type=float,default=2e-4,help='Weight decay for convolutions.')
131 |     parser.add_argument('--learning-rate',type=float,default=0.001,help='Initial learning rate.')
132 |     parser.add_argument('--epochs',type=int,default=10)
133 |     parser.add_argument('--batch-size',type=int,default=128)
134 |     parser.add_argument('--data-config',type=json.loads,default=os.environ.get('SM_INPUT_DATA_CONFIG'))
135 |     parser.add_argument('--fw-params',type=json.loads,default=os.environ.get('SM_FRAMEWORK_PARAMS'))
136 |     parser.add_argument('--optimizer',type=str,default='adam')
137 |     parser.add_argument('--momentum',type=float,default='0.9')
138 |     
139 |     args = parser.parse_args()
140 | 
141 |     main(args)
142 | 


--------------------------------------------------------------------------------
/tf-distribution-options/code/utilities.py:
--------------------------------------------------------------------------------
 1 | import logging
 2 | import os
 3 | import tensorflow as tf
 4 | 
 5 | from model_def import HEIGHT, WIDTH, DEPTH, NUM_CLASSES
 6 | 
 7 | 
 8 | NUM_DATA_BATCHES = 5
 9 | NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN = 10000 * NUM_DATA_BATCHES
10 | 
11 | 
12 | def _get_filenames(channel_name, channel):
13 |     if channel_name in ['train', 'validation', 'eval']:
14 |         return [os.path.join(channel, channel_name + '.tfrecords')]
15 |     else:
16 |         raise ValueError('Invalid data subset "%s"' % channel_name)
17 |         
18 | 
19 | def _train_preprocess_fn(image):
20 | 
21 |     # Resize the image to add four extra pixels on each side.
22 |     image = tf.image.resize_image_with_crop_or_pad(image, HEIGHT + 8, WIDTH + 8)
23 | 
24 |     # Randomly crop a [HEIGHT, WIDTH] section of the image.
25 |     image = tf.random_crop(image, [HEIGHT, WIDTH, DEPTH])
26 | 
27 |     # Randomly flip the image horizontally.
28 |     image = tf.image.random_flip_left_right(image)
29 | 
30 |     return image
31 | 
32 | 
33 | def _dataset_parser(value):
34 | 
35 |     featdef = {
36 |         'image': tf.FixedLenFeature([], tf.string),
37 |         'label': tf.FixedLenFeature([], tf.int64),
38 |     }
39 | 
40 |     example = tf.parse_single_example(value, featdef)
41 |     image = tf.decode_raw(example['image'], tf.uint8)
42 |     image.set_shape([DEPTH * HEIGHT * WIDTH])
43 | 
44 |     # Reshape from [depth * height * width] to [depth, height, width].
45 |     image = tf.cast(
46 |         tf.transpose(tf.reshape(image, [DEPTH, HEIGHT, WIDTH]), [1, 2, 0]),
47 |         tf.float32)
48 |     label = tf.cast(example['label'], tf.int32)
49 |     image = _train_preprocess_fn(image)
50 |     return image, tf.one_hot(label, NUM_CLASSES)
51 | 
52 | 
53 | def process_input(epochs, batch_size, channel, channel_name, data_config):
54 |     
55 |     mode = data_config[channel_name]['TrainingInputMode']
56 |     filenames = _get_filenames(channel_name, channel)
57 |     # Repeat infinitely.
58 |     logging.info("Running {} in {} mode".format(channel_name, mode))
59 |     if mode == 'Pipe':
60 |         from sagemaker_tensorflow import PipeModeDataset
61 |         dataset = PipeModeDataset(channel=channel_name, record_format='TFRecord')
62 |     else:
63 |         dataset = tf.data.TFRecordDataset(filenames)
64 | 
65 |     dataset = dataset.repeat(epochs)
66 |     dataset = dataset.prefetch(10)
67 | 
68 |     # Parse records.
69 |     dataset = dataset.map(
70 |         _dataset_parser, num_parallel_calls=10)
71 | 
72 |     # Potentially shuffle records.
73 |     if channel_name == 'train':
74 |         # Ensure that the capacity is sufficiently large to provide good random
75 |         # shuffling.
76 |         buffer_size = int(NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN * 0.4) + 3 * batch_size
77 |         dataset = dataset.shuffle(buffer_size=buffer_size)
78 | 
79 |     # Batch it up.
80 |     dataset = dataset.batch(batch_size, drop_remainder=True)
81 |     iterator = dataset.make_one_shot_iterator()
82 |     image_batch, label_batch = iterator.get_next()
83 | 
84 |     return image_batch, label_batch
85 | 
86 | 
87 | 


--------------------------------------------------------------------------------
/tf-distribution-options/generate_cifar10_tfrecords.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2017 The TensorFlow Authors. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | # ==============================================================================
 15 | """Read CIFAR-10 data from pickled numpy arrays and writes TFRecords.
 16 | 
 17 | Generates tf.train.Example protos and writes them to TFRecord files from the
 18 | python version of the CIFAR-10 dataset downloaded from
 19 | https://www.cs.toronto.edu/~kriz/cifar.html.
 20 | """
 21 | 
 22 | from __future__ import absolute_import
 23 | from __future__ import division
 24 | from __future__ import print_function
 25 | 
 26 | import argparse
 27 | import os
 28 | import sys
 29 | 
 30 | import tarfile
 31 | from six.moves import cPickle as pickle
 32 | from six.moves import xrange  # pylint: disable=redefined-builtin
 33 | import tensorflow as tf
 34 | 
 35 | tf.logging.set_verbosity(tf.logging.ERROR)
 36 | if type(tf.contrib) != type(tf): tf.contrib._warning = None
 37 | 
 38 | CIFAR_FILENAME = 'cifar-10-python.tar.gz'
 39 | CIFAR_DOWNLOAD_URL = 'https://www.cs.toronto.edu/~kriz/' + CIFAR_FILENAME
 40 | CIFAR_LOCAL_FOLDER = 'cifar-10-batches-py'
 41 | 
 42 | 
 43 | def download_and_extract(data_dir):
 44 |   # download CIFAR-10 if not already downloaded.
 45 |   tf.contrib.learn.datasets.base.maybe_download(CIFAR_FILENAME, data_dir,
 46 |                                                 CIFAR_DOWNLOAD_URL)
 47 |   tarfile.open(os.path.join(data_dir, CIFAR_FILENAME),
 48 |                'r:gz').extractall(data_dir)
 49 | 
 50 | 
 51 | def _int64_feature(value):
 52 |   return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
 53 | 
 54 | 
 55 | def _bytes_feature(value):
 56 |   return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
 57 | 
 58 | 
 59 | def _get_file_names():
 60 |   """Returns the file names expected to exist in the input_dir."""
 61 |   file_names = {}
 62 |   file_names['train'] = ['data_batch_%d' % i for i in xrange(1, 5)]
 63 |   file_names['validation'] = ['data_batch_5']
 64 |   file_names['eval'] = ['test_batch']
 65 |   return file_names
 66 | 
 67 | 
 68 | def read_pickle_from_file(filename):
 69 |   with tf.gfile.Open(filename, 'rb') as f:
 70 |     if sys.version_info >= (3, 0):
 71 |       data_dict = pickle.load(f, encoding='bytes')
 72 |     else:
 73 |       data_dict = pickle.load(f)
 74 |   return data_dict
 75 | 
 76 | 
 77 | def convert_to_tfrecord(input_files, output_file):
 78 |   """Converts a file to TFRecords."""
 79 |   print('Generating %s' % output_file)
 80 |   with tf.python_io.TFRecordWriter(output_file) as record_writer:
 81 |     for input_file in input_files:
 82 |       data_dict = read_pickle_from_file(input_file)
 83 |       data = data_dict[b'data']
 84 |       labels = data_dict[b'labels']
 85 | 
 86 |       num_entries_in_batch = len(labels)
 87 |       for i in range(num_entries_in_batch):
 88 |         example = tf.train.Example(features=tf.train.Features(
 89 |             feature={
 90 |                 'image': _bytes_feature(data[i].tobytes()),
 91 |                 'label': _int64_feature(labels[i])
 92 |             }))
 93 |         record_writer.write(example.SerializeToString())
 94 | 
 95 | 
 96 | def main(data_dir):
 97 |   print('Download from {} and extract.'.format(CIFAR_DOWNLOAD_URL))
 98 |   download_and_extract(data_dir)
 99 |   file_names = _get_file_names()
100 |   input_dir = os.path.join(data_dir, CIFAR_LOCAL_FOLDER)
101 |   for mode, files in file_names.items():
102 |     input_files = [os.path.join(input_dir, f) for f in files]
103 |     output_file = os.path.join(data_dir+'/'+mode, mode + '.tfrecords')
104 |     if not os.path.exists(data_dir+'/'+mode):
105 |         os.makedirs(data_dir+'/'+mode)
106 |     try:
107 |       os.remove(output_file)
108 |     except OSError:
109 |       pass
110 |     # Convert to tf.train.Example and write the to TFRecords.
111 |     convert_to_tfrecord(input_files, output_file)
112 |   print('Done!')
113 |   import shutil
114 |   shutil.rmtree(data_dir+'/cifar-10-batches-py')
115 |   os.remove(data_dir+'/cifar-10-python.tar.gz')
116 | 
117 | 
118 | if __name__ == '__main__':
119 |   parser = argparse.ArgumentParser()
120 |   parser.add_argument(
121 |       '--data-dir',
122 |       type=str,
123 |       default='',
124 |       help='Directory to download and extract CIFAR-10 to.')
125 | 
126 |   args = parser.parse_args()
127 |   main(args.data_dir)
128 | 


--------------------------------------------------------------------------------
/tf-distribution-options/sample-img/1000_dog.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-distribution-options/sample-img/1000_dog.png


--------------------------------------------------------------------------------
/tf-distribution-options/sample-img/1001_airplane.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-distribution-options/sample-img/1001_airplane.png


--------------------------------------------------------------------------------
/tf-distribution-options/sample-img/1003_deer.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-distribution-options/sample-img/1003_deer.png


--------------------------------------------------------------------------------
/tf-distribution-options/sample-img/1004_ship.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-distribution-options/sample-img/1004_ship.png


--------------------------------------------------------------------------------
/tf-distribution-options/sample-img/1005_automobile.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-distribution-options/sample-img/1005_automobile.png


--------------------------------------------------------------------------------
/tf-distribution-options/sample-img/1008_truck.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-distribution-options/sample-img/1008_truck.png


--------------------------------------------------------------------------------
/tf-distribution-options/sample-img/1009_frog.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-distribution-options/sample-img/1009_frog.png


--------------------------------------------------------------------------------
/tf-distribution-options/sample-img/1014_cat.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-distribution-options/sample-img/1014_cat.png


--------------------------------------------------------------------------------
/tf-distribution-options/sample-img/1037_horse.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-distribution-options/sample-img/1037_horse.png


--------------------------------------------------------------------------------
/tf-distribution-options/sample-img/1038_bird.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-distribution-options/sample-img/1038_bird.png


--------------------------------------------------------------------------------
/tf-eager-script-mode/train_model/model_def.py:
--------------------------------------------------------------------------------
 1 | import tensorflow as tf
 2 | 
 3 | 
 4 | def get_model():
 5 |     
 6 |     inputs = tf.keras.Input(shape=(13,))
 7 |     hidden_1 = tf.keras.layers.Dense(13, activation='tanh')(inputs)
 8 |     hidden_2 = tf.keras.layers.Dense(6, activation='sigmoid')(hidden_1)
 9 |     outputs = tf.keras.layers.Dense(1)(hidden_2)
10 |     return tf.keras.Model(inputs=inputs, outputs=outputs)
11 |  


--------------------------------------------------------------------------------
/tf-eager-script-mode/train_model/train.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | import numpy as np
 3 | import os
 4 | import tensorflow as tf
 5 | from tensorflow.contrib.eager.python import tfe
 6 | 
 7 | from model_def import get_model
 8 | 
 9 | 
10 | tf.enable_eager_execution()
11 | tf.set_random_seed(0)
12 | np.random.seed(0)
13 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 
14 | 
15 | 
16 | def parse_args():
17 |     
18 |     parser = argparse.ArgumentParser()
19 | 
20 |     # hyperparameters sent by the client are passed as command-line arguments to the script
21 |     parser.add_argument('--epochs', type=int, default=1)
22 |     parser.add_argument('--batch_size', type=int, default=64)
23 |     parser.add_argument('--learning_rate', type=float, default=0.1)
24 |     
25 |     # data directories
26 |     parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
27 |     parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST'))
28 |     
29 |     # model directory: we will use the default set by SageMaker, /opt/ml/model
30 |     parser.add_argument('--model_dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
31 |     
32 |     return parser.parse_known_args()
33 | 
34 | 
35 | def get_train_data(train_dir):
36 |     
37 |     x_train = np.load(os.path.join(train_dir, 'x_train.npy'))
38 |     y_train = np.load(os.path.join(train_dir, 'y_train.npy'))
39 |     print('x train', x_train.shape,'y train', y_train.shape)
40 | 
41 |     return x_train, y_train
42 | 
43 | 
44 | def get_test_data(test_dir):
45 |     
46 |     x_test = np.load(os.path.join(test_dir, 'x_test.npy'))
47 |     y_test = np.load(os.path.join(test_dir, 'y_test.npy'))
48 |     print('x test', x_test.shape,'y test', y_test.shape)
49 | 
50 |     return x_test, y_test
51 |    
52 | 
53 | if __name__ == "__main__":
54 |         
55 |     args, _ = parse_args()
56 |     
57 |     x_train, y_train = get_train_data(args.train)
58 |     x_test, y_test = get_test_data(args.test)
59 |     
60 |     device = '/cpu:0' 
61 |     print(device)
62 |     batch_size = args.batch_size
63 |     epochs = args.epochs
64 |     learning_rate = args.learning_rate
65 |     print('batch_size = {}, epochs = {}, learning rate = {}'.format(batch_size, epochs, learning_rate))
66 | 
67 |     with tf.device(device):
68 |         
69 |         model = get_model()
70 |         optimizer = tf.train.GradientDescentOptimizer(learning_rate)
71 |         model.compile(optimizer=optimizer, loss='mse')    
72 |         model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs,
73 |                   validation_data=(x_test, y_test))
74 | 
75 |         # evaluate on test set
76 |         scores = model.evaluate(x_test, y_test, batch_size, verbose=2)
77 |         print("Test MSE :", scores)
78 | 
79 |         # save checkpoint for locally loading in notebook
80 |         saver = tfe.Saver(model.variables)
81 |         saver.save(args.model_dir + '/weights.ckpt')
82 |         # create a separate SavedModel for deployment to a SageMaker endpoint with TensorFlow Serving
83 |         tf.contrib.saved_model.save_keras_model(model, args.model_dir)
84 | 
85 |             
86 |         


--------------------------------------------------------------------------------
/tf-horovod-inference-pipeline/generate_cifar10_tfrecords.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2017 The TensorFlow Authors. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | # ==============================================================================
 15 | """Read CIFAR-10 data from pickled numpy arrays and writes TFRecords.
 16 | 
 17 | Generates tf.train.Example protos and writes them to TFRecord files from the
 18 | python version of the CIFAR-10 dataset downloaded from
 19 | https://www.cs.toronto.edu/~kriz/cifar.html.
 20 | """
 21 | 
 22 | from __future__ import absolute_import
 23 | from __future__ import division
 24 | from __future__ import print_function
 25 | 
 26 | import argparse
 27 | import os
 28 | import sys
 29 | 
 30 | import tarfile
 31 | from six.moves import cPickle as pickle
 32 | from six.moves import xrange  # pylint: disable=redefined-builtin
 33 | import tensorflow as tf
 34 | 
 35 | tf.logging.set_verbosity(tf.logging.ERROR)
 36 | if type(tf.contrib) != type(tf): tf.contrib._warning = None
 37 | 
 38 | CIFAR_FILENAME = 'cifar-10-python.tar.gz'
 39 | CIFAR_DOWNLOAD_URL = 'https://www.cs.toronto.edu/~kriz/' + CIFAR_FILENAME
 40 | CIFAR_LOCAL_FOLDER = 'cifar-10-batches-py'
 41 | 
 42 | 
 43 | def download_and_extract(data_dir):
 44 |   # download CIFAR-10 if not already downloaded.
 45 |   tf.contrib.learn.datasets.base.maybe_download(CIFAR_FILENAME, data_dir,
 46 |                                                 CIFAR_DOWNLOAD_URL)
 47 |   tarfile.open(os.path.join(data_dir, CIFAR_FILENAME),
 48 |                'r:gz').extractall(data_dir)
 49 | 
 50 | 
 51 | def _int64_feature(value):
 52 |   return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
 53 | 
 54 | 
 55 | def _bytes_feature(value):
 56 |   return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
 57 | 
 58 | 
 59 | def _get_file_names():
 60 |   """Returns the file names expected to exist in the input_dir."""
 61 |   file_names = {}
 62 |   file_names['train'] = ['data_batch_%d' % i for i in xrange(1, 5)]
 63 |   file_names['validation'] = ['data_batch_5']
 64 |   file_names['eval'] = ['test_batch']
 65 |   return file_names
 66 | 
 67 | 
 68 | def read_pickle_from_file(filename):
 69 |   with tf.gfile.Open(filename, 'rb') as f:
 70 |     if sys.version_info >= (3, 0):
 71 |       data_dict = pickle.load(f, encoding='bytes')
 72 |     else:
 73 |       data_dict = pickle.load(f)
 74 |   return data_dict
 75 | 
 76 | 
 77 | def convert_to_tfrecord(input_files, output_file):
 78 |   """Converts a file to TFRecords."""
 79 |   print('Generating %s' % output_file)
 80 |   with tf.python_io.TFRecordWriter(output_file) as record_writer:
 81 |     for input_file in input_files:
 82 |       data_dict = read_pickle_from_file(input_file)
 83 |       data = data_dict[b'data']
 84 |       labels = data_dict[b'labels']
 85 |       
 86 |       num_entries_in_batch = len(labels)
 87 |       for i in range(num_entries_in_batch):
 88 |         example = tf.train.Example(features=tf.train.Features(
 89 |             feature={
 90 |                 'image': _bytes_feature(data[i].tobytes()),
 91 |                 'label': _int64_feature(labels[i])
 92 |             }))
 93 |         record_writer.write(example.SerializeToString())
 94 | 
 95 | 
 96 | def main(data_dir):
 97 |   print('Download from {} and extract.'.format(CIFAR_DOWNLOAD_URL))
 98 |   download_and_extract(data_dir)
 99 |   file_names = _get_file_names()
100 |   input_dir = os.path.join(data_dir, CIFAR_LOCAL_FOLDER)
101 |   for mode, files in file_names.items():
102 |     input_files = [os.path.join(input_dir, f) for f in files]
103 |     output_file = os.path.join(data_dir+'/'+mode, mode + '.tfrecords')
104 |     if not os.path.exists(data_dir+'/'+mode):
105 |         os.makedirs(data_dir+'/'+mode)
106 |     try:
107 |       os.remove(output_file)
108 |     except OSError:
109 |       pass
110 |     # Convert to tf.train.Example and write the to TFRecords.
111 |     convert_to_tfrecord(input_files, output_file)
112 |   print('Done!')
113 |   import shutil
114 |   shutil.rmtree(data_dir+'/cifar-10-batches-py')
115 |   os.remove(data_dir+'/cifar-10-python.tar.gz')
116 | 
117 | 
118 | if __name__ == '__main__':
119 |   parser = argparse.ArgumentParser()
120 |   parser.add_argument(
121 |       '--data-dir',
122 |       type=str,
123 |       default='',
124 |       help='Directory to download and extract CIFAR-10 to.')
125 | 
126 |   args = parser.parse_args()
127 |   main(args.data_dir)
128 | 


--------------------------------------------------------------------------------
/tf-horovod-inference-pipeline/image-transformer-container/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM python:3.6
 2 | 
 3 | LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true
 4 | 
 5 | RUN pip install flask gunicorn numpy Pillow
 6 | 
 7 | # Add flask app directory
 8 | COPY ./app /app
 9 | WORKDIR /app
10 | 
11 | # Copy entrypoint file and make it executable
12 | COPY entrypoint.sh /entrypoint.sh
13 | RUN chmod +x /entrypoint.sh
14 | 
15 | ENTRYPOINT ["/entrypoint.sh"]
16 | 


--------------------------------------------------------------------------------
/tf-horovod-inference-pipeline/image-transformer-container/app/main.py:
--------------------------------------------------------------------------------
 1 | import io
 2 | import json
 3 | import numpy as np
 4 | import struct
 5 | 
 6 | from flask import Flask, Response, request
 7 | from PIL import Image 
 8 | 
 9 | 
10 | app = Flask(__name__)
11 | 
12 | 
13 | def read_image(image_from_s3):
14 |     
15 |     image_as_bytes = io.BytesIO(image_from_s3)
16 |     image = Image.open(image_as_bytes)
17 |     instance = np.expand_dims(image, axis=0)
18 |     
19 |     return instance.tolist()
20 | 
21 | 
22 | @app.route("/invocations", methods=['POST'])
23 | def invocations():
24 |     
25 |     try:
26 |       image_for_JSON = read_image(request.data)
27 |       # TensorFlow Serving's REST API requires a JSON-formatted request
28 |       response = Response(json.dumps({"instances": image_for_JSON}))
29 |       response.headers['Content-Type'] = "application/json"
30 |       return response
31 |     except ValueError as err:
32 |       return str(err), 400
33 | 
34 | 
35 | @app.route("/ping", methods=['GET'])
36 | def ping():
37 |     
38 |     return "", 200
39 | 


--------------------------------------------------------------------------------
/tf-horovod-inference-pipeline/image-transformer-container/ecr_policy.json:
--------------------------------------------------------------------------------
 1 | {
 2 |     "Version": "2008-10-17",
 3 |     "Statement": [
 4 |         {
 5 |             "Sid": "allowSageMakerToPull",
 6 |             "Effect": "Allow",
 7 |             "Principal": {
 8 |                 "Service": "sagemaker.amazonaws.com"
 9 |             },
10 |             "Action": [
11 |                 "ecr:GetDownloadUrlForLayer",
12 |                 "ecr:BatchGetImage",
13 |                 "ecr:BatchCheckLayerAvailability"
14 |             ]
15 |         }
16 |     ]
17 | }
18 | 


--------------------------------------------------------------------------------
/tf-horovod-inference-pipeline/image-transformer-container/entrypoint.sh:
--------------------------------------------------------------------------------
 1 | #! /usr/bin/env bash
 2 | set -e
 3 | 
 4 | # Get the listen port from the SM env variable, otherwise default to 8080
 5 | LISTEN_PORT=${SAGEMAKER_BIND_TO_PORT:-8080}
 6 | 
 7 | # Set the number of gunicorn worker processes
 8 | GUNICORN_WORKER_COUNT=$(nproc)
 9 | 
10 | PYTHONUNBUFFERED=1
11 | 
12 | # Start flask app
13 | exec gunicorn -w $GUNICORN_WORKER_COUNT -b 0.0.0.0:$LISTEN_PORT main:app
14 | 


--------------------------------------------------------------------------------
/tf-horovod-inference-pipeline/sample-img/1000_dog.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-horovod-inference-pipeline/sample-img/1000_dog.png


--------------------------------------------------------------------------------
/tf-horovod-inference-pipeline/sample-img/1001_airplane.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-horovod-inference-pipeline/sample-img/1001_airplane.png


--------------------------------------------------------------------------------
/tf-horovod-inference-pipeline/sample-img/1003_deer.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-horovod-inference-pipeline/sample-img/1003_deer.png


--------------------------------------------------------------------------------
/tf-horovod-inference-pipeline/sample-img/1004_ship.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-horovod-inference-pipeline/sample-img/1004_ship.png


--------------------------------------------------------------------------------
/tf-horovod-inference-pipeline/sample-img/1005_automobile.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-horovod-inference-pipeline/sample-img/1005_automobile.png


--------------------------------------------------------------------------------
/tf-horovod-inference-pipeline/sample-img/1008_truck.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-horovod-inference-pipeline/sample-img/1008_truck.png


--------------------------------------------------------------------------------
/tf-horovod-inference-pipeline/sample-img/1009_frog.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-horovod-inference-pipeline/sample-img/1009_frog.png


--------------------------------------------------------------------------------
/tf-horovod-inference-pipeline/sample-img/1014_cat.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-horovod-inference-pipeline/sample-img/1014_cat.png


--------------------------------------------------------------------------------
/tf-horovod-inference-pipeline/sample-img/1037_horse.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-horovod-inference-pipeline/sample-img/1037_horse.png


--------------------------------------------------------------------------------
/tf-horovod-inference-pipeline/sample-img/1038_bird.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-horovod-inference-pipeline/sample-img/1038_bird.png


--------------------------------------------------------------------------------
/tf-horovod-inference-pipeline/train.py:
--------------------------------------------------------------------------------
  1 | #     Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
  2 | #
  3 | #     Licensed under the Apache License, Version 2.0 (the "License").
  4 | #     You may not use this file except in compliance with the License.
  5 | #     A copy of the License is located at
  6 | #    
  7 | #         https://aws.amazon.com/apache-2-0/
  8 | #    
  9 | #     or in the "license" file accompanying this file. This file is distributed
 10 | #     on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
 11 | #     express or implied. See the License for the specific language governing
 12 | #     permissions and limitations under the License.
 13 | 
 14 | 
 15 | import argparse
 16 | import json
 17 | import logging
 18 | import os
 19 | import re
 20 | 
 21 | import tensorflow as tf
 22 | import tensorflow.keras.backend as K
 23 | from tensorflow.keras.callbacks import TensorBoard, ModelCheckpoint
 24 | from tensorflow.keras.layers import Activation, Conv2D, Dense, Dropout, Flatten, MaxPooling2D, BatchNormalization
 25 | from tensorflow.keras.models import Sequential
 26 | from tensorflow.keras.optimizers import Adam, SGD, RMSprop
 27 | 
 28 | logging.getLogger().setLevel(logging.INFO)
 29 | tf.logging.set_verbosity(tf.logging.ERROR)
 30 | HEIGHT = 32
 31 | WIDTH = 32
 32 | DEPTH = 3
 33 | NUM_CLASSES = 10
 34 | NUM_DATA_BATCHES = 5
 35 | NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN = 10000 * NUM_DATA_BATCHES
 36 | INPUT_TENSOR_NAME = 'inputs_input'  # needs to match the name of the first layer + "_input"
 37 | 
 38 | 
 39 | def keras_model_fn(learning_rate, weight_decay, optimizer, momentum, mpi=False, hvd=False):
 40 |     
 41 |     model = Sequential()
 42 |     model.add(Conv2D(32, (3, 3), padding='same', name='inputs', input_shape=(HEIGHT, WIDTH, DEPTH)))
 43 |     model.add(BatchNormalization())
 44 |     model.add(Activation('relu'))
 45 |     model.add(Conv2D(32, (3, 3)))
 46 |     model.add(BatchNormalization())
 47 |     model.add(Activation('relu'))
 48 |     model.add(MaxPooling2D(pool_size=(2, 2)))
 49 |     model.add(Dropout(0.2))
 50 | 
 51 |     model.add(Conv2D(64, (3, 3), padding='same'))
 52 |     model.add(BatchNormalization())
 53 |     model.add(Activation('relu'))
 54 |     model.add(Conv2D(64, (3, 3)))
 55 |     model.add(BatchNormalization())
 56 |     model.add(Activation('relu'))
 57 |     model.add(MaxPooling2D(pool_size=(2, 2)))
 58 |     model.add(Dropout(0.3))
 59 | 
 60 |     model.add(Conv2D(128, (3, 3), padding='same'))
 61 |     model.add(BatchNormalization())
 62 |     model.add(Activation('relu'))
 63 |     model.add(Conv2D(128, (3, 3)))
 64 |     model.add(BatchNormalization())
 65 |     model.add(Activation('relu'))
 66 |     model.add(MaxPooling2D(pool_size=(2, 2)))
 67 |     model.add(Dropout(0.4))
 68 | 
 69 |     model.add(Flatten())
 70 |     model.add(Dense(512))
 71 |     model.add(Activation('relu'))
 72 |     model.add(Dropout(0.5))
 73 |     model.add(Dense(NUM_CLASSES))
 74 |     model.add(Activation('softmax'))
 75 | 
 76 |     size = 1
 77 |     if mpi:
 78 |         size = hvd.size()
 79 | 
 80 |     if optimizer.lower() == 'sgd':
 81 |         opt = SGD(lr=learning_rate * size, decay=weight_decay, momentum=momentum)
 82 |     elif optimizer.lower() == 'rmsprop':
 83 |         opt = RMSprop(lr=learning_rate * size, decay=weight_decay)
 84 |     else:
 85 |         opt = Adam(lr=learning_rate * size, decay=weight_decay)
 86 | 
 87 |     if mpi:
 88 |         opt = hvd.DistributedOptimizer(opt)
 89 | 
 90 |     model.compile(loss='categorical_crossentropy',
 91 |                   optimizer=opt,
 92 |                   metrics=['accuracy'])
 93 |     return model
 94 | 
 95 | 
 96 | class CustomTensorBoardCallback(TensorBoard):
 97 |     def on_batch_end(self, batch, logs=None):
 98 |         pass
 99 | 
100 | 
101 | def get_filenames(channel_name, channel):
102 |     if channel_name in ['train', 'validation', 'eval']:
103 |         return [os.path.join(channel, channel_name + '.tfrecords')]
104 |     else:
105 |         raise ValueError('Invalid data subset "%s"' % channel_name)
106 | 
107 | 
108 | def train_input_fn():
109 |     return _input(args.epochs, args.batch_size, args.train, 'train')
110 | 
111 | 
112 | def eval_input_fn():
113 |     return _input(args.epochs, args.batch_size, args.eval, 'eval')
114 | 
115 | 
116 | def validation_input_fn():
117 |     return _input(args.epochs, args.batch_size, args.validation, 'validation')
118 | 
119 | 
120 | def _input(epochs, batch_size, channel, channel_name):
121 |     mode = args.data_config[channel_name]['TrainingInputMode']
122 |     """Uses the tf.data input pipeline for CIFAR-10 dataset.
123 |     Args:
124 |         mode: Standard names for model modes (tf.estimators.ModeKeys).
125 |         batch_size: The number of samples per batch of input requested.
126 |     """
127 |     filenames = get_filenames(channel_name, channel)
128 |     # Repeat infinitely.
129 |     logging.info("Running {} in {} mode".format(channel_name, mode))
130 |     if mode == 'Pipe':
131 |         from sagemaker_tensorflow import PipeModeDataset
132 |         dataset = PipeModeDataset(channel=channel_name, record_format='TFRecord')
133 |     else:
134 |         dataset = tf.data.TFRecordDataset(filenames)
135 | 
136 |     dataset = dataset.repeat(epochs)
137 |     dataset = dataset.prefetch(10)
138 | 
139 |     # Parse records.
140 |     dataset = dataset.map(
141 |         _dataset_parser, num_parallel_calls=10)
142 | 
143 |     # Potentially shuffle records.
144 |     if channel_name == 'train':
145 |         # Ensure that the capacity is sufficiently large to provide good random
146 |         # shuffling.
147 |         buffer_size = int(NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN * 0.4) + 3 * batch_size
148 |         dataset = dataset.shuffle(buffer_size=buffer_size)
149 | 
150 |     # Batch it up.
151 |     dataset = dataset.batch(batch_size, drop_remainder=True)
152 |     iterator = dataset.make_one_shot_iterator()
153 |     image_batch, label_batch = iterator.get_next()
154 | 
155 |     return {INPUT_TENSOR_NAME: image_batch}, label_batch
156 | 
157 | 
158 | def _train_preprocess_fn(image):
159 |     """Preprocess a single training image of layout [height, width, depth]."""
160 |     # Resize the image to add four extra pixels on each side.
161 |     image = tf.image.resize_image_with_crop_or_pad(image, HEIGHT + 8, WIDTH + 8)
162 | 
163 |     # Randomly crop a [HEIGHT, WIDTH] section of the image.
164 |     image = tf.random_crop(image, [HEIGHT, WIDTH, DEPTH])
165 | 
166 |     # Randomly flip the image horizontally.
167 |     image = tf.image.random_flip_left_right(image)
168 | 
169 |     return image
170 | 
171 | 
172 | def _dataset_parser(value):
173 |     """Parse a CIFAR-10 record from value."""
174 |     featdef = {
175 |         'image': tf.FixedLenFeature([], tf.string),
176 |         'label': tf.FixedLenFeature([], tf.int64),
177 |     }
178 | 
179 |     example = tf.parse_single_example(value, featdef)
180 |     image = tf.decode_raw(example['image'], tf.uint8)
181 |     image.set_shape([DEPTH * HEIGHT * WIDTH])
182 | 
183 |     # Reshape from [depth * height * width] to [depth, height, width].
184 |     image = tf.cast(
185 |         tf.transpose(tf.reshape(image, [DEPTH, HEIGHT, WIDTH]), [1, 2, 0]),
186 |         tf.float32)
187 |     label = tf.cast(example['label'], tf.int32)
188 |     image = _train_preprocess_fn(image)
189 |     return image, tf.one_hot(label, NUM_CLASSES)
190 | 
191 | 
192 | def save_model(model, output):
193 |     
194 |     # create a TensorFlow SavedModel for deployment to a SageMaker endpoint with TensorFlow Serving
195 |     tf.contrib.saved_model.save_keras_model(model, args.model_dir)    
196 |     logging.info("Model successfully saved at: {}".format(output))
197 |     return
198 | 
199 | 
200 | def main(args):
201 |     
202 |     mpi = False
203 |     if 'sourcedir.tar.gz' in args.tensorboard_dir:
204 |         tensorboard_dir = re.sub('source/sourcedir.tar.gz', 'model', args.tensorboard_dir)
205 |     else:
206 |         tensorboard_dir = args.tensorboard_dir
207 |     logging.info("Writing TensorBoard logs to {}".format(tensorboard_dir))
208 |     if 'sagemaker_mpi_enabled' in args.fw_params:
209 |         if args.fw_params['sagemaker_mpi_enabled']:
210 |             import horovod.tensorflow.keras as hvd 
211 |             mpi = True
212 |             # Horovod: initialize Horovod.
213 |             hvd.init()
214 | 
215 |             # Horovod: pin GPU to be used to process local rank (one GPU per process)
216 |             config = tf.ConfigProto()
217 |             config.gpu_options.allow_growth = True
218 |             config.gpu_options.visible_device_list = str(hvd.local_rank())
219 |             K.set_session(tf.Session(config=config))
220 |     else:
221 |         hvd = None
222 | 
223 |     logging.info("Running with MPI={}".format(mpi))
224 |     logging.info("getting data")
225 |     train_dataset = train_input_fn()
226 |     eval_dataset = eval_input_fn()
227 |     validation_dataset = validation_input_fn()
228 | 
229 |     logging.info("configuring model")
230 |     model = keras_model_fn(args.learning_rate, args.weight_decay, args.optimizer, args.momentum, mpi, hvd)
231 |     callbacks = []
232 |     if mpi:
233 |         callbacks.append(hvd.callbacks.BroadcastGlobalVariablesCallback(0))
234 |         callbacks.append(hvd.callbacks.MetricAverageCallback())
235 |         callbacks.append(hvd.callbacks.LearningRateWarmupCallback(warmup_epochs=5, verbose=1))
236 |         callbacks.append(tf.keras.callbacks.ReduceLROnPlateau(patience=10, verbose=1))
237 |         if hvd.rank() == 0:
238 |             callbacks.append(ModelCheckpoint(args.output_dir + '/checkpoint-{epoch}.h5'))
239 |             callbacks.append(CustomTensorBoardCallback(log_dir=tensorboard_dir))
240 |     else:
241 |         callbacks.append(ModelCheckpoint(args.output_dir + '/checkpoint-{epoch}.h5'))
242 |         callbacks.append(CustomTensorBoardCallback(log_dir=tensorboard_dir))
243 |     logging.info("Starting training")
244 |     size = 1
245 |     if mpi:
246 |         size = hvd.size()
247 |     model.fit(x=train_dataset[0], y=train_dataset[1],
248 |               steps_per_epoch=(num_examples_per_epoch('train') // args.batch_size) // size,
249 |               epochs=args.epochs, validation_data=validation_dataset,
250 |               validation_steps=(num_examples_per_epoch('validation') // args.batch_size) // size, callbacks=callbacks)
251 | 
252 |     score = model.evaluate(eval_dataset[0], eval_dataset[1], steps=num_examples_per_epoch('eval') // args.batch_size,
253 |                            verbose=0)
254 | 
255 |     logging.info('Test loss:{}'.format(score[0]))
256 |     logging.info('Test accuracy:{}'.format(score[1]))
257 | 
258 |     # Horovod: Save model only on worker 0 (i.e. master)
259 |     if mpi:
260 |         if hvd.rank() == 0:
261 |             return save_model(model, args.model_output_dir)
262 |     else:
263 |         return save_model(model, args.model_output_dir)
264 | 
265 | 
266 | def num_examples_per_epoch(subset='train'):
267 |     if subset == 'train':
268 |         return 40000
269 |     elif subset == 'validation':
270 |         return 10000
271 |     elif subset == 'eval':
272 |         return 10000
273 |     else:
274 |         raise ValueError('Invalid data subset "%s"' % subset)
275 | 
276 | 
277 | if __name__ == '__main__':
278 |     
279 |     parser = argparse.ArgumentParser()
280 |     parser.add_argument(
281 |         '--train',
282 |         type=str,
283 |         required=False,
284 |         default=os.environ.get('SM_CHANNEL_TRAIN'),
285 |         help='The directory where the CIFAR-10 input data is stored.')
286 |     parser.add_argument(
287 |         '--validation',
288 |         type=str,
289 |         required=False,
290 |         default=os.environ.get('SM_CHANNEL_VALIDATION'),
291 |         help='The directory where the CIFAR-10 input data is stored.')
292 |     parser.add_argument(
293 |         '--eval',
294 |         type=str,
295 |         required=False,
296 |         default=os.environ.get('SM_CHANNEL_EVAL'),
297 |         help='The directory where the CIFAR-10 input data is stored.')
298 |     parser.add_argument(
299 |         '--model_dir',
300 |         type=str,
301 |         required=True,
302 |         help='The directory where the model will be stored.')
303 |     parser.add_argument(
304 |         '--model_output_dir',
305 |         type=str,
306 |         default=os.environ.get('SM_MODEL_DIR'))
307 |     parser.add_argument(
308 |         '--output-dir',
309 |         type=str,
310 |         default=os.environ.get('SM_OUTPUT_DIR'))
311 |     parser.add_argument(
312 |         '--tensorboard-dir',
313 |         type=str,
314 |         default=os.environ.get('SM_MODULE_DIR'))
315 |     parser.add_argument(
316 |         '--weight-decay',
317 |         type=float,
318 |         default=2e-4,
319 |         help='Weight decay for convolutions.')
320 |     parser.add_argument(
321 |         '--learning-rate',
322 |         type=float,
323 |         default=0.001,
324 |         help="""\
325 |         This is the inital learning rate value. The learning rate will decrease
326 |         during training. For more details check the model_fn implementation in
327 |         this file.\
328 |         """)
329 |     parser.add_argument(
330 |         '--epochs',
331 |         type=int,
332 |         default=10,
333 |         help='The number of steps to use for training.')
334 |     parser.add_argument(
335 |         '--batch-size',
336 |         type=int,
337 |         default=128,
338 |         help='Batch size for training.')
339 |     parser.add_argument(
340 |         '--data-config',
341 |         type=json.loads,
342 |         default=os.environ.get('SM_INPUT_DATA_CONFIG')
343 |     )
344 |     parser.add_argument(
345 |         '--fw-params',
346 |         type=json.loads,
347 |         default=os.environ.get('SM_FRAMEWORK_PARAMS')
348 |     )
349 |     parser.add_argument(
350 |         '--optimizer',
351 |         type=str,
352 |         default='adam'
353 |     )
354 |     parser.add_argument(
355 |         '--momentum',
356 |         type=float,
357 |         default='0.9'
358 |     )
359 |     args = parser.parse_args()
360 |     
361 |     main(args)
362 | 


--------------------------------------------------------------------------------
/tf-sentiment-script-mode/sentiment.py:
--------------------------------------------------------------------------------
  1 | import logging
  2 | logging.getLogger("tensorflow").setLevel(logging.ERROR)
  3 | import argparse
  4 | import codecs
  5 | import json
  6 | import numpy as np
  7 | import os
  8 | import tensorflow as tf
  9 | 
 10 | max_features = 20000
 11 | maxlen = 400
 12 | embedding_dims = 300
 13 | filters = 256
 14 | kernel_size = 3
 15 | hidden_dims = 256
 16 | 
 17 | def parse_args():
 18 | 
 19 |     parser = argparse.ArgumentParser()
 20 | 
 21 |     # hyperparameters sent by the client are passed as command-line arguments to the script
 22 |     parser.add_argument('--epochs', type=int, default=1)
 23 |     parser.add_argument('--batch_size', type=int, default=64)
 24 |     parser.add_argument('--learning_rate', type=float, default=0.01)
 25 | 
 26 |     # data directories
 27 |     parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
 28 |     parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST'))
 29 | 
 30 |     # model directory: we will use the default set by SageMaker, /opt/ml/model
 31 |     parser.add_argument('--model_dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
 32 | 
 33 |     return parser.parse_known_args()
 34 | 
 35 | 
 36 | def save_history(path, history):
 37 | 
 38 |     history_for_json = {}
 39 |     # transform float values that aren't json-serializable
 40 |     for key in list(history.history.keys()):
 41 |         if type(history.history[key]) == np.ndarray:
 42 |             history_for_json[key] = history.history[key].tolist()
 43 |         elif type(history.history[key]) == list:
 44 |            if  type(history.history[key][0]) == np.float32 or type(history.history[key][0]) == np.float64:
 45 |                history_for_json[key] = list(map(float, history.history[key]))
 46 | 
 47 |     with codecs.open(path, 'w', encoding='utf-8') as f:
 48 |         json.dump(history_for_json, f, separators=(',', ':'), sort_keys=True, indent=4) 
 49 | 
 50 | 
 51 | def get_train_data(train_dir):
 52 | 
 53 |     x_train = np.load(os.path.join(train_dir, 'x_train.npy'))
 54 |     y_train = np.load(os.path.join(train_dir, 'y_train.npy'))
 55 |     print('x train', x_train.shape,'y train', y_train.shape)
 56 | 
 57 |     return x_train, y_train
 58 | 
 59 | 
 60 | def get_test_data(test_dir):
 61 | 
 62 |     x_test = np.load(os.path.join(test_dir, 'x_test.npy'))
 63 |     y_test = np.load(os.path.join(test_dir, 'y_test.npy'))
 64 |     print('x test', x_test.shape,'y test', y_test.shape)
 65 | 
 66 |     return x_test, y_test
 67 | 
 68 | 
 69 | def get_model(learning_rate):
 70 | 
 71 |     mirrored_strategy = tf.distribute.MirroredStrategy()
 72 |     
 73 |     with mirrored_strategy.scope():
 74 |         embedding_layer = tf.keras.layers.Embedding(max_features,
 75 |                                                     embedding_dims,
 76 |                                                     input_length=maxlen)
 77 | 
 78 |         sequence_input = tf.keras.Input(shape=(maxlen,), dtype='int32')
 79 |         embedded_sequences = embedding_layer(sequence_input)
 80 |         x = tf.keras.layers.Dropout(0.2)(embedded_sequences)
 81 |         x = tf.keras.layers.Conv1D(filters, kernel_size, padding='valid', activation='relu', strides=1)(x)
 82 |         x = tf.keras.layers.MaxPooling1D()(x)
 83 |         x = tf.keras.layers.GlobalMaxPooling1D()(x)
 84 |         x = tf.keras.layers.Dense(hidden_dims, activation='relu')(x)
 85 |         x = tf.keras.layers.Dropout(0.2)(x)
 86 |         preds = tf.keras.layers.Dense(1, activation='sigmoid')(x)
 87 |         
 88 |         model = tf.keras.Model(sequence_input, preds)
 89 |         optimizer = tf.keras.optimizers.Adam(learning_rate)
 90 |         model.compile(loss='binary_crossentropy',
 91 |                   optimizer=optimizer,
 92 |                   metrics=['accuracy'])
 93 |     
 94 |     return model
 95 | 
 96 | 
 97 | if __name__ == "__main__":
 98 | 
 99 |     args, _ = parse_args()
100 | 
101 |     x_train, y_train = get_train_data(args.train)
102 |     x_test, y_test = get_test_data(args.test)
103 | 
104 |     model = get_model(args.learning_rate)
105 | 
106 |     history = model.fit(x_train, y_train,
107 |               batch_size=args.batch_size,
108 |               epochs=args.epochs,
109 |               validation_data=(x_test, y_test))
110 | 
111 |     save_history(args.model_dir + "/history.p", history)
112 |     
113 |     # create a TensorFlow SavedModel for deployment to a SageMaker endpoint with TensorFlow Serving
114 |     model.save(args.model_dir + '/1')
115 | 
116 | 
117 | 


--------------------------------------------------------------------------------