├── .github └── PULL_REQUEST_TEMPLATE.md ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── NOTICE ├── README.md ├── daemon.json ├── deploy-pretrained-model ├── BERT │ ├── Deploy_BERT.ipynb │ └── code │ │ ├── inference_code.py │ │ └── requirements.txt └── GPT2 │ ├── Deploy_GPT2.ipynb │ └── code │ ├── inference_code.py │ └── requirements.txt ├── hugging-face-lambda-step ├── .ipynb_checkpoints │ ├── iam_helper-checkpoint.py │ └── sm-pipelines-hugging-face-lambda-step-checkpoint.ipynb ├── iam_helper.py ├── scripts │ ├── .ipynb_checkpoints │ │ ├── evaluate-checkpoint.py │ │ ├── preprocessing-checkpoint.py │ │ └── train-checkpoint.py │ ├── evaluate.py │ ├── preprocessing.py │ └── train.py └── sm-pipelines-hugging-face-lambda-step.ipynb ├── k-means-clustering └── k-means-clustering.ipynb ├── lightgbm-byo └── lightgbm-byo.ipynb ├── local_mode_setup.sh ├── r-churn └── r_autopilot_churn.ipynb ├── r-in-sagemaker-processing └── r_in_sagemaker_processing.ipynb ├── r-workflow └── r_workflow.ipynb ├── tf-2-word-embeddings ├── code │ ├── model_def.py │ └── train.py └── tf-2-word-embeddings.ipynb ├── tf-2-workflow-smpipelines ├── tf-2-workflow-smpipelines.ipynb └── train_model │ ├── model_def.py │ └── train.py ├── tf-2-workflow ├── tf-2-workflow.ipynb └── train_model │ ├── model_def.py │ └── train.py ├── tf-batch-inference-script ├── code │ ├── inference.py │ ├── model_def.py │ ├── requirements.txt │ ├── train.py │ └── utilities.py ├── generate_cifar10_tfrecords.py ├── sample-img │ ├── 1000_dog.png │ ├── 1001_airplane.png │ ├── 1003_deer.png │ ├── 1004_ship.png │ ├── 1005_automobile.png │ ├── 1008_truck.png │ ├── 1009_frog.png │ ├── 1014_cat.png │ ├── 1037_horse.png │ └── 1038_bird.png └── tf-batch-inference-script.ipynb ├── tf-distribution-options ├── code │ ├── inference.py │ ├── model_def.py │ ├── requirements.txt │ ├── train_hvd.py │ ├── train_ps.py │ └── utilities.py ├── generate_cifar10_tfrecords.py ├── sample-img │ ├── 1000_dog.png │ ├── 1001_airplane.png │ ├── 1003_deer.png │ ├── 1004_ship.png │ ├── 1005_automobile.png │ ├── 1008_truck.png │ ├── 1009_frog.png │ ├── 1014_cat.png │ ├── 1037_horse.png │ └── 1038_bird.png └── tf-distributed-training.ipynb ├── tf-eager-script-mode ├── tf-boston-housing.ipynb └── train_model │ ├── model_def.py │ └── train.py ├── tf-horovod-inference-pipeline ├── generate_cifar10_tfrecords.py ├── image-transformer-container │ ├── Dockerfile │ ├── app │ │ └── main.py │ ├── ecr_policy.json │ └── entrypoint.sh ├── sample-img │ ├── 1000_dog.png │ ├── 1001_airplane.png │ ├── 1003_deer.png │ ├── 1004_ship.png │ ├── 1005_automobile.png │ ├── 1008_truck.png │ ├── 1009_frog.png │ ├── 1014_cat.png │ ├── 1037_horse.png │ └── 1038_bird.png ├── tf-horovod-inference-pipeline.ipynb └── train.py └── tf-sentiment-script-mode ├── sentiment-analysis.ipynb └── sentiment.py /.github/PULL_REQUEST_TEMPLATE.md: -------------------------------------------------------------------------------- 1 | *Issue #, if available:* 2 | 3 | *Description of changes:* 4 | 5 | 6 | By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. 7 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. 5 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check [existing open](https://github.com/aws-samples/amazon-sagemaker-script-mode/issues), or [recently closed](https://github.com/aws-samples/amazon-sagemaker-script-mode/issues?utf8=%E2%9C%93&q=is%3Aissue%20is%3Aclosed%20), issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *master* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any ['help wanted'](https://github.com/aws-samples/amazon-sagemaker-script-mode/labels/help%20wanted) issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](https://github.com/aws-samples/amazon-sagemaker-script-mode/blob/master/LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. 60 | 61 | We may ask you to sign a [Contributor License Agreement (CLA)](http://en.wikipedia.org/wiki/Contributor_License_Agreement) for larger changes. 62 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | 2 | Apache License 3 | Version 2.0, January 2004 4 | http://www.apache.org/licenses/ 5 | 6 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 7 | 8 | 1. Definitions. 9 | 10 | "License" shall mean the terms and conditions for use, reproduction, 11 | and distribution as defined by Sections 1 through 9 of this document. 12 | 13 | "Licensor" shall mean the copyright owner or entity authorized by 14 | the copyright owner that is granting the License. 15 | 16 | "Legal Entity" shall mean the union of the acting entity and all 17 | other entities that control, are controlled by, or are under common 18 | control with that entity. For the purposes of this definition, 19 | "control" means (i) the power, direct or indirect, to cause the 20 | direction or management of such entity, whether by contract or 21 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 22 | outstanding shares, or (iii) beneficial ownership of such entity. 23 | 24 | "You" (or "Your") shall mean an individual or Legal Entity 25 | exercising permissions granted by this License. 26 | 27 | "Source" form shall mean the preferred form for making modifications, 28 | including but not limited to software source code, documentation 29 | source, and configuration files. 30 | 31 | "Object" form shall mean any form resulting from mechanical 32 | transformation or translation of a Source form, including but 33 | not limited to compiled object code, generated documentation, 34 | and conversions to other media types. 35 | 36 | "Work" shall mean the work of authorship, whether in Source or 37 | Object form, made available under the License, as indicated by a 38 | copyright notice that is included in or attached to the work 39 | (an example is provided in the Appendix below). 40 | 41 | "Derivative Works" shall mean any work, whether in Source or Object 42 | form, that is based on (or derived from) the Work and for which the 43 | editorial revisions, annotations, elaborations, or other modifications 44 | represent, as a whole, an original work of authorship. For the purposes 45 | of this License, Derivative Works shall not include works that remain 46 | separable from, or merely link (or bind by name) to the interfaces of, 47 | the Work and Derivative Works thereof. 48 | 49 | "Contribution" shall mean any work of authorship, including 50 | the original version of the Work and any modifications or additions 51 | to that Work or Derivative Works thereof, that is intentionally 52 | submitted to Licensor for inclusion in the Work by the copyright owner 53 | or by an individual or Legal Entity authorized to submit on behalf of 54 | the copyright owner. For the purposes of this definition, "submitted" 55 | means any form of electronic, verbal, or written communication sent 56 | to the Licensor or its representatives, including but not limited to 57 | communication on electronic mailing lists, source code control systems, 58 | and issue tracking systems that are managed by, or on behalf of, the 59 | Licensor for the purpose of discussing and improving the Work, but 60 | excluding communication that is conspicuously marked or otherwise 61 | designated in writing by the copyright owner as "Not a Contribution." 62 | 63 | "Contributor" shall mean Licensor and any individual or Legal Entity 64 | on behalf of whom a Contribution has been received by Licensor and 65 | subsequently incorporated within the Work. 66 | 67 | 2. Grant of Copyright License. Subject to the terms and conditions of 68 | this License, each Contributor hereby grants to You a perpetual, 69 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 70 | copyright license to reproduce, prepare Derivative Works of, 71 | publicly display, publicly perform, sublicense, and distribute the 72 | Work and such Derivative Works in Source or Object form. 73 | 74 | 3. Grant of Patent License. Subject to the terms and conditions of 75 | this License, each Contributor hereby grants to You a perpetual, 76 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 77 | (except as stated in this section) patent license to make, have made, 78 | use, offer to sell, sell, import, and otherwise transfer the Work, 79 | where such license applies only to those patent claims licensable 80 | by such Contributor that are necessarily infringed by their 81 | Contribution(s) alone or by combination of their Contribution(s) 82 | with the Work to which such Contribution(s) was submitted. If You 83 | institute patent litigation against any entity (including a 84 | cross-claim or counterclaim in a lawsuit) alleging that the Work 85 | or a Contribution incorporated within the Work constitutes direct 86 | or contributory patent infringement, then any patent licenses 87 | granted to You under this License for that Work shall terminate 88 | as of the date such litigation is filed. 89 | 90 | 4. Redistribution. You may reproduce and distribute copies of the 91 | Work or Derivative Works thereof in any medium, with or without 92 | modifications, and in Source or Object form, provided that You 93 | meet the following conditions: 94 | 95 | (a) You must give any other recipients of the Work or 96 | Derivative Works a copy of this License; and 97 | 98 | (b) You must cause any modified files to carry prominent notices 99 | stating that You changed the files; and 100 | 101 | (c) You must retain, in the Source form of any Derivative Works 102 | that You distribute, all copyright, patent, trademark, and 103 | attribution notices from the Source form of the Work, 104 | excluding those notices that do not pertain to any part of 105 | the Derivative Works; and 106 | 107 | (d) If the Work includes a "NOTICE" text file as part of its 108 | distribution, then any Derivative Works that You distribute must 109 | include a readable copy of the attribution notices contained 110 | within such NOTICE file, excluding those notices that do not 111 | pertain to any part of the Derivative Works, in at least one 112 | of the following places: within a NOTICE text file distributed 113 | as part of the Derivative Works; within the Source form or 114 | documentation, if provided along with the Derivative Works; or, 115 | within a display generated by the Derivative Works, if and 116 | wherever such third-party notices normally appear. The contents 117 | of the NOTICE file are for informational purposes only and 118 | do not modify the License. You may add Your own attribution 119 | notices within Derivative Works that You distribute, alongside 120 | or as an addendum to the NOTICE text from the Work, provided 121 | that such additional attribution notices cannot be construed 122 | as modifying the License. 123 | 124 | You may add Your own copyright statement to Your modifications and 125 | may provide additional or different license terms and conditions 126 | for use, reproduction, or distribution of Your modifications, or 127 | for any such Derivative Works as a whole, provided Your use, 128 | reproduction, and distribution of the Work otherwise complies with 129 | the conditions stated in this License. 130 | 131 | 5. Submission of Contributions. Unless You explicitly state otherwise, 132 | any Contribution intentionally submitted for inclusion in the Work 133 | by You to the Licensor shall be under the terms and conditions of 134 | this License, without any additional terms or conditions. 135 | Notwithstanding the above, nothing herein shall supersede or modify 136 | the terms of any separate license agreement you may have executed 137 | with Licensor regarding such Contributions. 138 | 139 | 6. Trademarks. This License does not grant permission to use the trade 140 | names, trademarks, service marks, or product names of the Licensor, 141 | except as required for reasonable and customary use in describing the 142 | origin of the Work and reproducing the content of the NOTICE file. 143 | 144 | 7. Disclaimer of Warranty. Unless required by applicable law or 145 | agreed to in writing, Licensor provides the Work (and each 146 | Contributor provides its Contributions) on an "AS IS" BASIS, 147 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 148 | implied, including, without limitation, any warranties or conditions 149 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 150 | PARTICULAR PURPOSE. You are solely responsible for determining the 151 | appropriateness of using or redistributing the Work and assume any 152 | risks associated with Your exercise of permissions under this License. 153 | 154 | 8. Limitation of Liability. In no event and under no legal theory, 155 | whether in tort (including negligence), contract, or otherwise, 156 | unless required by applicable law (such as deliberate and grossly 157 | negligent acts) or agreed to in writing, shall any Contributor be 158 | liable to You for damages, including any direct, indirect, special, 159 | incidental, or consequential damages of any character arising as a 160 | result of this License or out of the use or inability to use the 161 | Work (including but not limited to damages for loss of goodwill, 162 | work stoppage, computer failure or malfunction, or any and all 163 | other commercial damages or losses), even if such Contributor 164 | has been advised of the possibility of such damages. 165 | 166 | 9. Accepting Warranty or Additional Liability. While redistributing 167 | the Work or Derivative Works thereof, You may choose to offer, 168 | and charge a fee for, acceptance of support, warranty, indemnity, 169 | or other liability obligations and/or rights consistent with this 170 | License. However, in accepting such obligations, You may act only 171 | on Your own behalf and on Your sole responsibility, not on behalf 172 | of any other Contributor, and only if You agree to indemnify, 173 | defend, and hold each Contributor harmless for any liability 174 | incurred by, or claims asserted against, such Contributor by reason 175 | of your accepting any such warranty or additional liability. 176 | -------------------------------------------------------------------------------- /NOTICE: -------------------------------------------------------------------------------- 1 | TensorFlow Eager Execution with SageMaker's Script Mode 2 | Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved. 3 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Amazon SageMaker Script Mode Examples 2 | 3 | This repository contains examples and related resources regarding Amazon SageMaker Script Mode and SageMaker Processing. With Script Mode, you can use training scripts similar to those you would use outside SageMaker with SageMaker's prebuilt containers for various frameworks such TensorFlow and PyTorch. Similarly, in SageMaker Processing, you can supply ordinary data preprocessing scripts for almost any language or technology you wish to use, such as the R programming language. 4 | 5 | Currently this repository has resources for **Hugging Face**, **TensorFlow**, **R**, **Bring Your Own** (BYO models, plus Script Mode-style experience with your own containers), and **Miscellaneous** (Script Mode-style experience for SageMaker Processing etc.). There also is an **Older Resources** section with examples of older framework versions; these examples are for reference only, and are not maintained. 6 | 7 | For those new to SageMaker, there is a set of 2-hour workshops covering the basics at [**Amazon SageMaker Workshops**](https://github.com/awslabs/amazon-sagemaker-workshop). 8 | 9 | - **Hugging Face Resources:** 10 | 11 | - [**Hugging Face automated model training and deployment in SageMaker Pipelines**](hugging-face-lambda-step): This example uses the SageMaker prebuilt Hugging Face (PyTorch) container in an end-to-end demo with model training and deployment within SageMaker Pipelines. A lightweight model deployment is performed by a SageMaker Pipeline Lambda step. **PREREQUISITES:** either clone this repository, or from the *hugging-face-lambda-step* directory, upload all files and folders; then run the notebook `sm-pipelines-hugging-face-lambda-step.ipynb`. 12 | 13 | - **TensorFlow Resources:** 14 | 15 | - [**TensorFlow 2 Sentiment Analysis**](tf-sentiment-script-mode): SageMaker's prebuilt TensorFlow 2 container is used in this example to train a custom sentiment analysis model. Distributed hosted training in SageMaker is performed on a multi-GPU instance, using the native TensorFlow `MirroredStrategy`. Additionally, SageMaker Batch Transform is used for asynchronous, large scale inference/batch scoring. **PREREQUISITES:** From the *tf-sentiment-script-mode* directory, upload ONLY the Jupyter notebook `sentiment-analysis.ipynb`. 16 | 17 | - [**TensorFlow 2 Workflow with SageMaker Pipelines**](tf-2-workflow-smpipelines): This example shows a complete workflow for TensorFlow 2, starting with prototyping followed by automation with [Amazon SageMaker Pipelines](https://aws.amazon.com/sagemaker/pipelines). To begin, SageMaker Processing is used to transform the dataset. Next, Local Mode training and Local Mode endpoints are demonstrated for prototyping training and inference code, respectively. Automatic Model Tuning is used to automate the hyperparameter tuning process. Finally, the workflow is automated with SageMaker Pipelines. **PREREQUISITES:** If you wish to run the Local Mode sections of the example, use a SageMaker Notebook Instance rather than SageMaker Studio. From the *tf-2-workflow-smpipelines* directory, upload ONLY the Jupyter notebook `tf-2-workflow-smpipelines.ipynb`. 18 | 19 | - [**TensorFlow 2 Loading Pretrained Embeddings for Classification Tasks**](tf-2-word-embeddings): In this example, TensorFlow 2 is used with Script Mode for a text classification task. An important aspect of the example is showing how to load pretrained embeddings in Script Mode. This illustrates one aspect of the flexibility of SageMaker Script Mode for setting up training jobs: in addition to data, you can pass in arbitrary files needed for training (not just embeddings). **PREREQUISITES:** (1) be sure to upload all files in the *tf-2-word-embeddings* directory (including subdirectory *code*) to the directory where you will run the related Jupyter notebook. 20 | 21 | - **R Resources:** 22 | 23 | - [**R in SageMaker Processing**](r-in-sagemaker-processing): SageMaker Script Mode is directed toward making the model training process easier. However, an experience similar to Script Mode also is available for SageMaker Processing: you can bring in your data processing scripts and easily run them on managed infrastructure either with BYO containers or prebuilt containers for frameworks such as Spark and Scikit-learn. In this example, R is used to perform operations on a dataset and generate a plot within SageMaker Processing. The job results including the plot image are retrieved and displayed, demonstrating how R can be easily used within a SageMaker workflow. **PREREQUISITES:** From the *r-in-sagemaker-processing* directory, upload the Jupyter notebook `r-in-sagemaker_processing.ipynb`. 24 | 25 | - [**R Complete Workflow**](r-workflow): This example shows a complete workflow for R, starting with prototyping and moving to model tuning and inference, followed by automation with [Amazon SageMaker Pipelines](https://aws.amazon.com/sagemaker/pipelines). **PREREQUISITES:** Use a R kernel; and from the *r-workflow* directory, upload the Jupyter notebook `r-workflow.ipynb`. 26 | 27 | - [**R Churn Example**](r-churn): Often it is helpful to benchmark possible model quality using AutoML, even if you intend to manage training of your own custom model later using Script Mode. This example shows how to use R to access SageMaker Autopilot for this purpose. **PREREQUISITES:** Use a R kernel; and from the *r-churn* directory, upload the Jupyter notebook `r-churn.ipynb`. 28 | 29 | - **Bring Your Own (BYO) Resources:** 30 | 31 | - [**lightGBM BYO**](lightgbm-byo): In this repository, most samples use Amazon SageMaker prebuilt framework containers for TensorFlow and other frameworks. For this example, however, we'll show how to BYO container to create a Script Mode-style experience similar to a prebuilt SageMaker framework container, using lightGBM, a popular gradient boosting framework. **PREREQUISITES:** From the *lightgbm-byo* directory, upload the Jupyter notebook `lightgbm-byo.ipynb`. 32 | 33 | - [**Deploy Pretrained Models**](deploy-pretrained-model): In addition to the ease of use of the SageMaker Python SDK for model training in Script Mode, the SDK also enables you to easily BYO model. In this example, the SageMaker prebuilt PyTorch container is used to demonstrate how you can quickly take a pretrained or locally trained model and deploy it in a SageMaker hosted endpoint. There are examples for both OpenAI's GPT-2 and BERT. **PREREQUISITES:** From the *deploy-pretrained-model* directory, upload the entire BERT or GPT2 folder's contents, depending on which model you select. Run either `Deploy_BERT.pynb` or `Deploy_GPT2.ipynb`. 34 | 35 | 36 | - **Miscellaneous Resources:** 37 | 38 | - [**K-means clustering**](k-means-clustering): Most of the samples in this repository involve supervised learning tasks in Amazon SageMaker Script Mode. For this example, by contrast, we'll undertake an unsupervised learning task, and do so with the Amazon SageMaker K-means built-in algorithm rather than Script Mode. The SageMaker built-in algorithms were developed for large-scale training tasks and may offer a simpler user experience depending on the use case. **PREREQUISITES:** From the *k-means-clustering* directory, upload the Jupyter notebook `k-means-clustering.ipynb`. 39 | 40 | 41 | - **Older Resources:** ***(reference only, not maintained)*** 42 | 43 | - [**TensorFlow 2 Workflow with the AWS Step Functions Data Science SDK**](tf-2-workflow): **NOTE**: This example has been superseded by the **TensorFlow 2 Workflow with SageMaker Pipelines** example above. This example shows a complete workflow for TensorFlow 2 with automation by the AWS Step Functions Data Science SDK, an older alternative to [Amazon SageMaker Pipelines](https://aws.amazon.com/sagemaker/pipelines). To begin, SageMaker Processing is used to transform the dataset. Next, Local Mode training and Local Mode endpoints are demonstrated for prototyping training and inference code, respectively. Automatic Model Tuning is used to automate the hyperparameter tuning process. **PREREQUISITES:** From the *tf-2-workflow* directory, upload ONLY the Jupyter notebook `tf-2-workflow.ipynb`. 44 | 45 | - [**TensorFlow 1.x (tf.keras) Highly Performant Batch Inference & Training**](tf-batch-inference-script): The focus of this example is highly performant batch inference using TensorFlow Serving, along with Horovod distributed training. To transform the input image data for inference, a preprocessing script is used with the Amazon SageMaker TensorFlow Serving container. **PREREQUISITES:** be sure to upload all files in the *tf-batch-inference-script* directory (including the subdirectory code and files) to the directory where you will run the related Jupyter notebook. 46 | 47 | - [**TensorFlow 1.x (tf.keras) with Horovod & Inference Pipeline**](tf-horovod-inference-pipeline): Script Mode with TensorFlow is used for a computer vision task, in a demonstration of Horovod distributed training and doing batch inference in conjunction with an Inference Pipeline for transforming image data before inputting it to the model container. This is an alternative to the previous example, which uses a preprocessing script with the Amazon SageMaker TensorFlow Serving Container rather than an Inference Pipeline. **PREREQUISITES:** be sure to upload all files in the *tf-horovod-inference-pipeline* directory (including the subdirectory code and files) to the directory where you will run the related Jupyter notebook. 48 | 49 | 50 | - [**TensorFlow 1.x (tf.keras) Distributed Training Options**](tf-distribution-options): **NOTE**: Besides the options listed here for TensorFlow 1.x, there are additional options for TensorFlow 2, including [A] built-in [**SageMaker Distributed Training**](https://aws.amazon.com/sagemaker/distributed-training/) for both data and model parallelism, and [B] native distribution strategies such as MirroredStrategy as demonstrated in the **TensorFlow 2 Sentiment Analysis** example above. This TensorFlow 1.x example demonstrates two other distributed training options for SageMaker's Script Mode: (1) parameter servers, and (2) Horovod. **PREREQUISITES:** From the *tf-distribution-options* directory, upload ONLY the Jupyter notebook `tf-distributed-training.ipynb`. 51 | 52 | - [**TensorFlow 1.x (tf.keras) Eager Execution**](tf-eager-script-mode): **NOTE**: This TensorFlow 1.x example has been superseded by the **TensorFlow 2 Workflow** example above. This example shows how to use Script Mode with Eager Execution mode in TensorFlow 1.x, a more intuitive and dynamic alternative to the original graph mode of TensorFlow. It is the default mode of TensorFlow 2. Local Mode and Automatic Model Tuning also are demonstrated. **PREREQUISITES:** From the *tf-eager-script-mode* directory, upload ONLY the Jupyter notebook `tf-boston-housing.ipynb`. 53 | 54 | 55 | 56 | ## License 57 | 58 | The contents of this repository are licensed under the Apache 2.0 License except where otherwise noted. 59 | -------------------------------------------------------------------------------- /daemon.json: -------------------------------------------------------------------------------- 1 | 2 | { 3 | "default-runtime": "nvidia", 4 | "runtimes": { 5 | "nvidia": { 6 | "path": "/usr/bin/nvidia-container-runtime", 7 | "runtimeArgs": [] 8 | } 9 | } 10 | } 11 | -------------------------------------------------------------------------------- /deploy-pretrained-model/BERT/Deploy_BERT.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Hosting a Pretrained Model on SageMaker\n", 8 | " \n", 9 | "Amazon SageMaker is a service to accelerate the entire machine learning lifecycle. It includes components for building, training and deploying machine learning models. Each SageMaker component is modular, so you're welcome to only use the features needed for your use case. One of the most popular features of SageMaker is [model hosting](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-deployment.html). Using SageMaker Hosting you can deploy your model as a scalable, highly available, multi-process API endpoint with a few lines of code. In this notebook, we will demonstrate how to host a pretrained model (BERT) in Amazon SageMaker to extract embeddings from text.\n", 10 | "\n", 11 | "SageMaker provides prebuilt containers that can be used for training, hosting, or data processing. The inference containers include a web serving stack, so you don't need to install and configure one. We will be using the SageMaker [PyTorch container](https://github.com/aws/deep-learning-containers), but you may use the [TensorFlow container](https://github.com/aws/deep-learning-containers/blob/master/available_images.md), or bring your own container if needed. \n", 12 | "\n", 13 | "This notebook will walk you through how to deploy a pretrained Hugging Face model as a scalable, highly available, production ready API in under 15 minutes." 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "metadata": {}, 19 | "source": [ 20 | "## Retrieve Model Artifacts\n", 21 | "\n", 22 | "First we will download the model artifacts for the pretrained [BERT](https://arxiv.org/abs/1810.04805) model. BERT is a popular natural language processing (NLP) model that extracts meaning and context from text." 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": null, 28 | "metadata": { 29 | "scrolled": true 30 | }, 31 | "outputs": [], 32 | "source": [ 33 | "!pip install transformers==3.3.1 sagemaker==2.15.0 --quiet" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": null, 39 | "metadata": {}, 40 | "outputs": [], 41 | "source": [ 42 | "import os\n", 43 | "from transformers import BertTokenizer, BertModel\n", 44 | "\n", 45 | "tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n", 46 | "model = BertModel.from_pretrained(\"bert-base-uncased\")\n", 47 | "\n", 48 | "model_path = 'model/'\n", 49 | "code_path = 'code/'\n", 50 | "\n", 51 | "if not os.path.exists(model_path):\n", 52 | " os.mkdir(model_path)\n", 53 | " \n", 54 | "model.save_pretrained(save_directory=model_path)\n", 55 | "tokenizer.save_pretrained(save_directory=model_path)" 56 | ] 57 | }, 58 | { 59 | "cell_type": "markdown", 60 | "metadata": {}, 61 | "source": [ 62 | "## Write the Inference Script\n", 63 | "\n", 64 | "Since we are bringing a model to SageMaker, we must create an inference script. The script will run inside our PyTorch container. Our script should include a function for model loading, and optionally functions generating predicitions, and input/output processing. The PyTorch container provides default implementations for generating a prediction and input/output processing. By including these functions in your script you are overriding the default functions. You can find additional [details here](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#serve-a-pytorch-model).\n", 65 | "\n", 66 | "In the next cell we'll see our inference script. You will notice that it uses the [transformers library from HuggingFace](https://huggingface.co/transformers/). This Python library is not installed in the container by default, so we will have to add that in the next section." 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": null, 72 | "metadata": {}, 73 | "outputs": [], 74 | "source": [ 75 | "!pygmentize code/inference_code.py" 76 | ] 77 | }, 78 | { 79 | "cell_type": "markdown", 80 | "metadata": {}, 81 | "source": [ 82 | "## Package Model\n", 83 | "\n", 84 | "For hosting, SageMaker requires that the deployment package be structed in a compatible format. It expects all files to be packaged in a tar archive named \"model.tar.gz\" with gzip compression. To install additional libraries at container startup, we can add a [requirements.txt](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#using-third-party-libraries) text file that specifies the libraries to be installed using [pip](https://pypi.org/project/pip/). Within the archive, the PyTorch container expects all inference code and requirements.txt file to be inside the code/ directory. See the [guide here](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#for-versions-1-2-and-higher) for a thorough explanation of the required directory structure. " 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": null, 90 | "metadata": {}, 91 | "outputs": [], 92 | "source": [ 93 | "import tarfile\n", 94 | "\n", 95 | "zipped_model_path = os.path.join(model_path, \"model.tar.gz\")\n", 96 | "\n", 97 | "with tarfile.open(zipped_model_path, \"w:gz\") as tar:\n", 98 | " tar.add(model_path)\n", 99 | " tar.add(code_path)" 100 | ] 101 | }, 102 | { 103 | "cell_type": "markdown", 104 | "metadata": {}, 105 | "source": [ 106 | "## Deploy Model\n", 107 | "\n", 108 | "Now that we have our deployment package, we can use the [SageMaker SDK](https://sagemaker.readthedocs.io/en/stable/index.html) to deploy our API endpoint with two lines of code. We need to specify an IAM role for the SageMaker endpoint to use. Minimally, it will need read access to the default SageMaker bucket (usually named sagemaker-{region}-{your account number}) so it can read the deployment package. When we call deploy(), the SDK will save our deployment archive to S3 for the SageMaker endpoint to use. We will use the helper function [get_execution_role](https://sagemaker.readthedocs.io/en/stable/api/utility/session.html?highlight=get_execution_role#sagemaker.session.get_execution_role) to retrieve our current IAM role so we can pass it to the SageMaker endpoint. Minimally it will require read access to the model artifacts in S3 and the [ECR repository](https://github.com/aws/deep-learning-containers/blob/master/available_images.md) where the container image is stored by AWS.\n", 109 | "\n", 110 | "\n", 111 | "You may notice that we specify our PyTorch version and Python version when creating the PyTorchModel object. The SageMaker SDK uses these parameters to determine which PyTorch container to use. \n", 112 | "\n", 113 | "We'll choose an m5 instance for our endpoint to ensure we have sufficient memory to serve our model. " 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": null, 119 | "metadata": {}, 120 | "outputs": [], 121 | "source": [ 122 | "from sagemaker.pytorch import PyTorchModel\n", 123 | "from sagemaker import get_execution_role\n", 124 | "\n", 125 | "endpoint_name = 'bert-base'\n", 126 | "\n", 127 | "model = PyTorchModel(entry_point='inference_code.py', \n", 128 | " model_data=zipped_model_path, \n", 129 | " role=get_execution_role(), \n", 130 | " framework_version='1.5', \n", 131 | " py_version='py3')\n", 132 | "\n", 133 | "predictor = model.deploy(initial_instance_count=1, \n", 134 | " instance_type='ml.m5.xlarge', \n", 135 | " endpoint_name=endpoint_name)" 136 | ] 137 | }, 138 | { 139 | "cell_type": "markdown", 140 | "metadata": {}, 141 | "source": [ 142 | "## Get Predictions\n", 143 | "\n", 144 | "Now that our API endpoint is deployed, we can send it text to get predictions from our BERT model. You can use the SageMaker SDK or the [SageMaker Runtime API](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html) to invoke the endpoint. " 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": null, 150 | "metadata": {}, 151 | "outputs": [], 152 | "source": [ 153 | "import boto3\n", 154 | "\n", 155 | "sm = boto3.client('sagemaker-runtime')\n", 156 | "\n", 157 | "prompt = \"The best part of Amazon SageMaker is that it makes machine learning easy.\"\n", 158 | "\n", 159 | "response = sm.invoke_endpoint(EndpointName=endpoint_name, \n", 160 | " Body=prompt.encode(encoding='UTF-8'),\n", 161 | " ContentType='text/csv')\n", 162 | "\n", 163 | "response['Body'].read()" 164 | ] 165 | }, 166 | { 167 | "cell_type": "markdown", 168 | "metadata": {}, 169 | "source": [ 170 | "## Conclusion\n", 171 | "\n", 172 | "You have successfully created a scalable, high available, RESTful API that is backed by a BERT model! It can be used for downstreaming NLP tasks like text classification. If you are still interested in learning more, check out some of the more advanced features of SageMaker Hosting, like [model monitoring](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor.html) to detect concept drift, [autoscaling](https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling.html) to dynamically adjust the number of instances, or [VPC config](https://docs.aws.amazon.com/sagemaker/latest/dg/host-vpc.html) to control network access to/from your endpoint.\n", 173 | "\n", 174 | "You can also look in to the [ezsmdeploy SDK](https://aws.amazon.com/blogs/opensource/deploy-machine-learning-models-to-amazon-sagemaker-using-the-ezsmdeploy-python-package-and-a-few-lines-of-code/) that automates most of this process." 175 | ] 176 | }, 177 | { 178 | "cell_type": "code", 179 | "execution_count": null, 180 | "metadata": {}, 181 | "outputs": [], 182 | "source": [] 183 | } 184 | ], 185 | "metadata": { 186 | "kernelspec": { 187 | "display_name": "conda_pytorch_latest_p36", 188 | "language": "python", 189 | "name": "conda_pytorch_latest_p36" 190 | }, 191 | "language_info": { 192 | "codemirror_mode": { 193 | "name": "ipython", 194 | "version": 3 195 | }, 196 | "file_extension": ".py", 197 | "mimetype": "text/x-python", 198 | "name": "python", 199 | "nbconvert_exporter": "python", 200 | "pygments_lexer": "ipython3", 201 | "version": "3.6.10" 202 | } 203 | }, 204 | "nbformat": 4, 205 | "nbformat_minor": 4 206 | } 207 | -------------------------------------------------------------------------------- /deploy-pretrained-model/BERT/code/inference_code.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | from transformers import BertTokenizer, BertModel 4 | 5 | def model_fn(model_dir): 6 | """ 7 | Load the model for inference 8 | """ 9 | 10 | model_path = os.path.join(model_dir, 'model/') 11 | 12 | # Load BERT tokenizer from disk. 13 | tokenizer = BertTokenizer.from_pretrained(model_path) 14 | 15 | # Load BERT model from disk. 16 | model = BertModel.from_pretrained(model_path) 17 | 18 | model_dict = {'model': model, 'tokenizer':tokenizer} 19 | 20 | return model_dict 21 | 22 | def predict_fn(input_data, model): 23 | """ 24 | Apply model to the incoming request 25 | """ 26 | 27 | tokenizer = model['tokenizer'] 28 | bert_model = model['model'] 29 | 30 | encoded_input = tokenizer(input_data, return_tensors='pt') 31 | 32 | return bert_model(**encoded_input) 33 | 34 | def input_fn(request_body, request_content_type): 35 | """ 36 | Deserialize and prepare the prediction input 37 | """ 38 | 39 | if request_content_type == "application/json": 40 | request = json.loads(request_body) 41 | else: 42 | request = request_body 43 | 44 | return request 45 | 46 | def output_fn(prediction, response_content_type): 47 | """ 48 | Serialize and prepare the prediction output 49 | """ 50 | 51 | if response_content_type == "application/json": 52 | response = str(prediction) 53 | else: 54 | response = str(prediction) 55 | 56 | return response -------------------------------------------------------------------------------- /deploy-pretrained-model/BERT/code/requirements.txt: -------------------------------------------------------------------------------- 1 | transformers==3.3.1 -------------------------------------------------------------------------------- /deploy-pretrained-model/GPT2/Deploy_GPT2.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Hosting a Pretrained Model on SageMaker\n", 8 | " \n", 9 | "Amazon SageMaker is a service to accelerate the entire machine learning lifecycle. It includes components for building, training and deploying machine learning models. Each SageMaker component is modular, so you're welcome to only use the features needed for your use case. One of the most popular features of SageMaker is [model hosting](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-deployment.html). Using SageMaker Hosting you can deploy your model as a scalable, highly available, multi-process API endpoint with a few lines of code. In this notebook, we will demonstrate how to host a pretrained model (GPT-2) in Amazon SageMaker.\n", 10 | "\n", 11 | "SageMaker provides prebuilt containers that can be used for training, hosting, or data processing. The inference containers include a web serving stack, so you don't need to install and configure one. We will be using the SageMaker [PyTorch container](https://github.com/aws/deep-learning-containers), but you may use the [TensorFlow container](https://github.com/aws/deep-learning-containers/blob/master/available_images.md), or bring your own container if needed. \n", 12 | "\n", 13 | "This notebook will walk you through how to deploy a pretrained Hugging Face model as a scalable, highly available, production ready API in under 15 minutes." 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "metadata": {}, 19 | "source": [ 20 | "## Retrieve Model Artifacts\n", 21 | "\n", 22 | "First we will download the model artifacts for the pretrained [GPT-2](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) model. GPT-2 is a popular text generation model that was developed by OpenAI. Given a text prompt it can generate synthetic text that may follow." 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": null, 28 | "metadata": { 29 | "scrolled": true 30 | }, 31 | "outputs": [], 32 | "source": [ 33 | "!pip install transformers==3.3.1 sagemaker==2.15.0 --quiet" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": null, 39 | "metadata": {}, 40 | "outputs": [], 41 | "source": [ 42 | "import os\n", 43 | "from transformers import GPT2Tokenizer, GPT2LMHeadModel\n", 44 | "\n", 45 | "tokenizer = GPT2Tokenizer.from_pretrained('gpt2')\n", 46 | "model = GPT2LMHeadModel.from_pretrained('gpt2')\n", 47 | "\n", 48 | "model_path = 'model/'\n", 49 | "code_path = 'code/'\n", 50 | "\n", 51 | "if not os.path.exists(model_path):\n", 52 | " os.mkdir(model_path)\n", 53 | " \n", 54 | "model.save_pretrained(save_directory=model_path)\n", 55 | "tokenizer.save_vocabulary(save_directory=model_path)" 56 | ] 57 | }, 58 | { 59 | "cell_type": "markdown", 60 | "metadata": {}, 61 | "source": [ 62 | "## Write the Inference Script\n", 63 | "\n", 64 | "Since we are bringing a model to SageMaker, we must create an inference script. The script will run inside our PyTorch container. Our script should include a function for model loading, and optionally functions generating predicitions, and input/output processing. The PyTorch container provides default implementations for generating a prediction and input/output processing. By including these functions in your script you are overriding the default functions. You can find additional [details here](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#serve-a-pytorch-model).\n", 65 | "\n", 66 | "In the next cell we'll see our inference script. You will notice that it uses the [transformers library from Hugging Face](https://huggingface.co/transformers/). This Python library is not installed in the container by default, so we will have to add that in the next section." 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": null, 72 | "metadata": {}, 73 | "outputs": [], 74 | "source": [ 75 | "!pygmentize code/inference_code.py" 76 | ] 77 | }, 78 | { 79 | "cell_type": "markdown", 80 | "metadata": {}, 81 | "source": [ 82 | "## Package Model\n", 83 | "\n", 84 | "For hosting, SageMaker requires that the deployment package be structed in a compatible format. It expects all files to be packaged in a tar archive named \"model.tar.gz\" with gzip compression. To install additional libraries at container startup, we can add a [requirements.txt](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#using-third-party-libraries) text file that specifies the libraries to be installed using [pip](https://pypi.org/project/pip/). Within the archive, the PyTorch container expects all inference code and requirements.txt file to be inside the code/ directory. See the [guide here](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#for-versions-1-2-and-higher) for a thorough explanation of the required directory structure. " 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": null, 90 | "metadata": {}, 91 | "outputs": [], 92 | "source": [ 93 | "import tarfile\n", 94 | "\n", 95 | "zipped_model_path = os.path.join(model_path, \"model.tar.gz\")\n", 96 | "\n", 97 | "with tarfile.open(zipped_model_path, \"w:gz\") as tar:\n", 98 | " tar.add(model_path)\n", 99 | " tar.add(code_path)" 100 | ] 101 | }, 102 | { 103 | "cell_type": "markdown", 104 | "metadata": {}, 105 | "source": [ 106 | "## Deploy Model\n", 107 | "\n", 108 | "Now that we have our deployment package, we can use the [SageMaker SDK](https://sagemaker.readthedocs.io/en/stable/index.html) to deploy our API endpoint with two lines of code. We need to specify an IAM role for the SageMaker endpoint to use. Minimally, it will need read access to the default SageMaker bucket (usually named sagemaker-{region}-{your account number}) so it can read the deployment package. When we call deploy(), the SDK will save our deployment archive to S3 for the SageMaker endpoint to use. We will use the helper function [get_execution_role](https://sagemaker.readthedocs.io/en/stable/api/utility/session.html?highlight=get_execution_role#sagemaker.session.get_execution_role) to retrieve our current IAM role so we can pass it to the SageMaker endpoint. You may specify another IAM role here. Minimally it will require read access to the model artifacts in S3 and the [ECR repository](https://github.com/aws/deep-learning-containers/blob/master/available_images.md) where the container image is stored by AWS.\n", 109 | "\n", 110 | "You may notice that we specify our PyTorch version and Python version when creating the PyTorchModel object. The SageMaker SDK uses these parameters to determine which PyTorch container to use. \n", 111 | "\n", 112 | "The full size [GPT-2 model has 1.2 billion parameters](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf). Even though we are using the small version of the model, our endpoint will need to fit millions of parameters in to memory. We'll choose an m5 instance for our endpoint to ensure we have sufficient memory to serve our model. " 113 | ] 114 | }, 115 | { 116 | "cell_type": "code", 117 | "execution_count": null, 118 | "metadata": {}, 119 | "outputs": [], 120 | "source": [ 121 | "from sagemaker.pytorch import PyTorchModel\n", 122 | "from sagemaker import get_execution_role\n", 123 | "\n", 124 | "endpoint_name = 'GPT2'\n", 125 | "\n", 126 | "model = PyTorchModel(entry_point='inference_code.py', \n", 127 | " model_data=zipped_model_path, \n", 128 | " role=get_execution_role(),\n", 129 | " framework_version='1.5', \n", 130 | " py_version='py3')\n", 131 | "\n", 132 | "predictor = model.deploy(initial_instance_count=1, \n", 133 | " instance_type='ml.m5.xlarge', \n", 134 | " endpoint_name=endpoint_name)" 135 | ] 136 | }, 137 | { 138 | "cell_type": "markdown", 139 | "metadata": {}, 140 | "source": [ 141 | "## Get Predictions\n", 142 | "\n", 143 | "Now that our RESTful API endpoint is deployed, we can send it text to get predictions from our GPT-2 model. You can use the SageMaker Python SDK or the [SageMaker Runtime API](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html) to invoke the endpoint. " 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "execution_count": null, 149 | "metadata": {}, 150 | "outputs": [], 151 | "source": [ 152 | "import boto3\n", 153 | "import json\n", 154 | "\n", 155 | "sm = boto3.client('sagemaker-runtime')\n", 156 | "\n", 157 | "prompt = \"Working with SageMaker makes machine learning \"\n", 158 | "\n", 159 | "response = sm.invoke_endpoint(EndpointName=endpoint_name, \n", 160 | " Body=json.dumps(prompt),\n", 161 | " ContentType='text/csv')\n", 162 | "\n", 163 | "response['Body'].read().decode('utf-8')" 164 | ] 165 | }, 166 | { 167 | "cell_type": "markdown", 168 | "metadata": {}, 169 | "source": [ 170 | "## Conclusion\n", 171 | "\n", 172 | "You have successfully created a scalable, high available, RESTful API that is backed by a GPT-2 model! If you are still interested in learning more, check out some of the more advanced features of SageMaker Hosting, like [model monitoring](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor.html) to detect concept drift, [autoscaling](https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling.html) to dynamically adjust the number of instances, or [VPC config](https://docs.aws.amazon.com/sagemaker/latest/dg/host-vpc.html) to control network access to/from your endpoint.\n", 173 | "\n", 174 | "You can also look in to the [ezsmdeploy SDK](https://aws.amazon.com/blogs/opensource/deploy-machine-learning-models-to-amazon-sagemaker-using-the-ezsmdeploy-python-package-and-a-few-lines-of-code/) that automates most of this process." 175 | ] 176 | } 177 | ], 178 | "metadata": { 179 | "kernelspec": { 180 | "display_name": "conda_pytorch_latest_p36", 181 | "language": "python", 182 | "name": "conda_pytorch_latest_p36" 183 | }, 184 | "language_info": { 185 | "codemirror_mode": { 186 | "name": "ipython", 187 | "version": 3 188 | }, 189 | "file_extension": ".py", 190 | "mimetype": "text/x-python", 191 | "name": "python", 192 | "nbconvert_exporter": "python", 193 | "pygments_lexer": "ipython3", 194 | "version": "3.6.10" 195 | } 196 | }, 197 | "nbformat": 4, 198 | "nbformat_minor": 4 199 | } 200 | -------------------------------------------------------------------------------- /deploy-pretrained-model/GPT2/code/inference_code.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | from transformers import GPT2Tokenizer, TextGenerationPipeline, GPT2LMHeadModel 4 | 5 | def model_fn(model_dir): 6 | """ 7 | Load the model for inference 8 | """ 9 | 10 | # Load GPT2 tokenizer from disk. 11 | vocab_path = os.path.join(model_dir, 'model/vocab.json') 12 | merges_path = os.path.join(model_dir, 'model/merges.txt') 13 | 14 | tokenizer = GPT2Tokenizer(vocab_file=vocab_path, 15 | merges_file=merges_path) 16 | 17 | # Load GPT2 model from disk. 18 | model_path = os.path.join(model_dir, 'model/') 19 | model = GPT2LMHeadModel.from_pretrained(model_path) 20 | 21 | return TextGenerationPipeline(model=model, tokenizer=tokenizer) 22 | 23 | def predict_fn(input_data, model): 24 | """ 25 | Apply model to the incoming request 26 | """ 27 | 28 | return model.__call__(input_data) 29 | 30 | def input_fn(request_body, request_content_type): 31 | """ 32 | Deserialize and prepare the prediction input 33 | """ 34 | 35 | if request_content_type == "application/json": 36 | request = json.loads(request_body) 37 | else: 38 | request = request_body 39 | 40 | return request 41 | 42 | def output_fn(prediction, response_content_type): 43 | """ 44 | Serialize and prepare the prediction output 45 | """ 46 | 47 | return str(prediction) -------------------------------------------------------------------------------- /deploy-pretrained-model/GPT2/code/requirements.txt: -------------------------------------------------------------------------------- 1 | transformers==3.3.1 -------------------------------------------------------------------------------- /hugging-face-lambda-step/.ipynb_checkpoints/iam_helper-checkpoint.py: -------------------------------------------------------------------------------- 1 | import boto3 2 | import json 3 | 4 | iam = boto3.client('iam') 5 | 6 | def create_lambda_role(role_name): 7 | try: 8 | response = iam.create_role( 9 | RoleName = role_name, 10 | AssumeRolePolicyDocument = json.dumps({ 11 | "Version": "2012-10-17", 12 | "Statement": [ 13 | { 14 | "Effect": "Allow", 15 | "Principal": { 16 | "Service": "lambda.amazonaws.com" 17 | }, 18 | "Action": "sts:AssumeRole" 19 | } 20 | ] 21 | }), 22 | Description='Role for Lambda to call ECS Fargate task' 23 | ) 24 | 25 | role_arn = response['Role']['Arn'] 26 | 27 | response = iam.attach_role_policy( 28 | RoleName=role_name, 29 | PolicyArn='arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole' 30 | ) 31 | 32 | response = iam.attach_role_policy( 33 | PolicyArn='arn:aws:iam::aws:policy/AmazonSageMakerFullAccess', 34 | RoleName=role_name 35 | ) 36 | 37 | return role_arn 38 | 39 | except iam.exceptions.EntityAlreadyExistsException: 40 | print(f'Using ARN from existing role: {role_name}') 41 | response = iam.get_role(RoleName=role_name) 42 | return response['Role']['Arn'] -------------------------------------------------------------------------------- /hugging-face-lambda-step/iam_helper.py: -------------------------------------------------------------------------------- 1 | import boto3 2 | import json 3 | 4 | iam = boto3.client('iam') 5 | 6 | def create_lambda_role(role_name): 7 | try: 8 | response = iam.create_role( 9 | RoleName = role_name, 10 | AssumeRolePolicyDocument = json.dumps({ 11 | "Version": "2012-10-17", 12 | "Statement": [ 13 | { 14 | "Effect": "Allow", 15 | "Principal": { 16 | "Service": "lambda.amazonaws.com" 17 | }, 18 | "Action": "sts:AssumeRole" 19 | } 20 | ] 21 | }), 22 | Description='Role for Lambda to call ECS Fargate task' 23 | ) 24 | 25 | role_arn = response['Role']['Arn'] 26 | 27 | response = iam.attach_role_policy( 28 | RoleName=role_name, 29 | PolicyArn='arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole' 30 | ) 31 | 32 | response = iam.attach_role_policy( 33 | PolicyArn='arn:aws:iam::aws:policy/AmazonSageMakerFullAccess', 34 | RoleName=role_name 35 | ) 36 | 37 | return role_arn 38 | 39 | except iam.exceptions.EntityAlreadyExistsException: 40 | print(f'Using ARN from existing role: {role_name}') 41 | response = iam.get_role(RoleName=role_name) 42 | return response['Role']['Arn'] -------------------------------------------------------------------------------- /hugging-face-lambda-step/scripts/.ipynb_checkpoints/evaluate-checkpoint.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | """Evaluation script for measuring mean squared error.""" 4 | 5 | import subprocess 6 | import sys 7 | 8 | def install(package): 9 | subprocess.check_call([sys.executable, "-m", "pip", "install", package]) 10 | 11 | import json 12 | import logging 13 | import pathlib 14 | import pickle 15 | import tarfile 16 | import os 17 | 18 | import numpy as np 19 | import pandas as pd 20 | 21 | from transformers import AutoModelForSequenceClassification, Trainer 22 | from datasets import load_from_disk 23 | 24 | logger = logging.getLogger() 25 | logger.setLevel(logging.INFO) 26 | logger.addHandler(logging.StreamHandler()) 27 | 28 | if __name__ == "__main__": 29 | logger.debug("Starting evaluation.") 30 | model_path = "/opt/ml/processing/model/model.tar.gz" 31 | with tarfile.open(model_path) as tar: 32 | tar.extractall(path="./hf_model") 33 | 34 | logger.debug(os.listdir('./hf_model')) 35 | 36 | # test_dir = "/opt/ml/processing/test/" 37 | # test_dataset = load_from_disk(test_dir) 38 | 39 | # model = AutoModelForSequenceClassification.from_pretrained('./hf_model') 40 | 41 | # trainer = Trainer(model=model) 42 | 43 | # eval_result = trainer.evaluate(eval_dataset=test_dataset) 44 | 45 | with open('./hf_model/evaluation.json') as f: 46 | eval_result = json.load(f) 47 | 48 | logger.debug(eval_result) 49 | output_dir = "/opt/ml/processing/evaluation" 50 | pathlib.Path(output_dir).mkdir(parents=True, exist_ok=True) 51 | 52 | evaluation_path = f"{output_dir}/evaluation.json" 53 | with open(evaluation_path, "w") as f: 54 | f.write(json.dumps(eval_result)) 55 | -------------------------------------------------------------------------------- /hugging-face-lambda-step/scripts/.ipynb_checkpoints/preprocessing-checkpoint.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | import numpy as np 4 | import os 5 | import pandas as pd 6 | import subprocess 7 | import sys 8 | 9 | def install(package): 10 | subprocess.check_call([sys.executable, "-m", "pip", "install", package]) 11 | 12 | if __name__=='__main__': 13 | 14 | install('torch') 15 | install('transformers') 16 | install('datasets[s3]') 17 | 18 | from datasets import load_dataset 19 | from transformers import AutoTokenizer 20 | 21 | # tokenizer used in preprocessing 22 | tokenizer_name = 'distilbert-base-uncased' 23 | 24 | # dataset used 25 | dataset_name = 'imdb' 26 | 27 | # load dataset 28 | dataset = load_dataset(dataset_name) 29 | 30 | # download tokenizer 31 | tokenizer = AutoTokenizer.from_pretrained(tokenizer_name) 32 | 33 | # tokenizer helper function 34 | def tokenize(batch): 35 | return tokenizer(batch['text'], padding='max_length', truncation=True) 36 | 37 | # load dataset 38 | train_dataset, test_dataset = load_dataset('imdb', split=['train', 'test']) 39 | test_dataset = test_dataset.shuffle().select(range(1000)) # smaller the size for test dataset to 1k 40 | 41 | # tokenize dataset 42 | train_dataset = train_dataset.map(tokenize, batched=True) 43 | test_dataset = test_dataset.map(tokenize, batched=True) 44 | 45 | # set format for pytorch 46 | train_dataset = train_dataset.rename_column("label", "labels") 47 | train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels']) 48 | 49 | test_dataset = test_dataset.rename_column("label", "labels") 50 | test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels']) 51 | 52 | train_dataset.save_to_disk('/opt/ml/processing/train') 53 | test_dataset.save_to_disk('/opt/ml/processing/test') -------------------------------------------------------------------------------- /hugging-face-lambda-step/scripts/.ipynb_checkpoints/train-checkpoint.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments, AutoTokenizer 3 | from sklearn.metrics import accuracy_score, precision_recall_fscore_support 4 | from datasets import load_from_disk 5 | import random 6 | import logging 7 | import sys 8 | import argparse 9 | import os 10 | import torch 11 | 12 | import pathlib 13 | import json 14 | 15 | if __name__ == "__main__": 16 | 17 | parser = argparse.ArgumentParser() 18 | 19 | # hyperparameters sent by the client are passed as command-line arguments to the script. 20 | parser.add_argument("--epochs", type=int, default=3) 21 | parser.add_argument("--train_batch_size", type=int, default=32) 22 | parser.add_argument("--eval_batch_size", type=int, default=64) 23 | parser.add_argument("--warmup_steps", type=int, default=500) 24 | parser.add_argument("--model_name", type=str) 25 | parser.add_argument("--learning_rate", type=str, default=5e-5) 26 | 27 | # Data, model, and output directories 28 | parser.add_argument("--output-data-dir", type=str, default=os.environ["SM_OUTPUT_DATA_DIR"]) 29 | parser.add_argument("--model-dir", type=str, default=os.environ["SM_MODEL_DIR"]) 30 | parser.add_argument("--n_gpus", type=str, default=os.environ["SM_NUM_GPUS"]) 31 | parser.add_argument("--training_dir", type=str, default=os.environ["SM_CHANNEL_TRAIN"]) 32 | parser.add_argument("--test_dir", type=str, default=os.environ["SM_CHANNEL_TEST"]) 33 | 34 | args, _ = parser.parse_known_args() 35 | 36 | # Set up logging 37 | logger = logging.getLogger(__name__) 38 | 39 | logging.basicConfig( 40 | level=logging.getLevelName("INFO"), 41 | handlers=[logging.StreamHandler(sys.stdout)], 42 | format="%(asctime)s - %(name)s - %(levelname)s - %(message)s", 43 | ) 44 | 45 | # load datasets 46 | train_dataset = load_from_disk(args.training_dir) 47 | test_dataset = load_from_disk(args.test_dir) 48 | 49 | logger.info(f" loaded train_dataset length is: {len(train_dataset)}") 50 | logger.info(f" loaded test_dataset length is: {len(test_dataset)}") 51 | 52 | # compute metrics function for binary classification 53 | def compute_metrics(pred): 54 | labels = pred.label_ids 55 | preds = pred.predictions.argmax(-1) 56 | precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average="binary") 57 | acc = accuracy_score(labels, preds) 58 | return {"accuracy": acc, "f1": f1, "precision": precision, "recall": recall} 59 | 60 | # download model from model hub 61 | model = AutoModelForSequenceClassification.from_pretrained(args.model_name) 62 | tokenizer = AutoTokenizer.from_pretrained(args.model_name) 63 | 64 | # define training args 65 | training_args = TrainingArguments( 66 | output_dir=args.model_dir, 67 | num_train_epochs=args.epochs, 68 | per_device_train_batch_size=args.train_batch_size, 69 | per_device_eval_batch_size=args.eval_batch_size, 70 | warmup_steps=args.warmup_steps, 71 | evaluation_strategy="epoch", 72 | logging_dir=f"{args.output_data_dir}/logs", 73 | learning_rate=float(args.learning_rate), 74 | ) 75 | 76 | # create Trainer instance 77 | trainer = Trainer( 78 | model=model, 79 | args=training_args, 80 | compute_metrics=compute_metrics, 81 | train_dataset=train_dataset, 82 | eval_dataset=test_dataset, 83 | tokenizer=tokenizer, 84 | ) 85 | 86 | # train model 87 | trainer.train() 88 | 89 | # evaluate model 90 | eval_result = trainer.evaluate(eval_dataset=test_dataset) 91 | 92 | # # writes eval result to file which can be accessed later in s3 ouput 93 | # with open(os.path.join(args.output_data_dir, "eval_results.txt"), "w") as writer: 94 | # print(f"***** Eval results *****") 95 | # for key, value in sorted(eval_result.items()): 96 | # writer.write(f"{key} = {value}\n") 97 | 98 | evaluation_path = "/opt/ml/model/evaluation.json" 99 | with open(evaluation_path, "w+") as f: 100 | f.write(json.dumps(eval_result)) 101 | 102 | # Saves the model to s3 103 | trainer.save_model(args.model_dir) 104 | 105 | 106 | -------------------------------------------------------------------------------- /hugging-face-lambda-step/scripts/evaluate.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | """Evaluation script for measuring mean squared error.""" 4 | 5 | import subprocess 6 | import sys 7 | 8 | def install(package): 9 | subprocess.check_call([sys.executable, "-m", "pip", "install", package]) 10 | 11 | import json 12 | import logging 13 | import pathlib 14 | import pickle 15 | import tarfile 16 | import os 17 | 18 | import numpy as np 19 | import pandas as pd 20 | 21 | from transformers import AutoModelForSequenceClassification, Trainer 22 | from datasets import load_from_disk 23 | 24 | logger = logging.getLogger() 25 | logger.setLevel(logging.INFO) 26 | logger.addHandler(logging.StreamHandler()) 27 | 28 | if __name__ == "__main__": 29 | logger.debug("Starting evaluation.") 30 | model_path = "/opt/ml/processing/model/model.tar.gz" 31 | with tarfile.open(model_path) as tar: 32 | tar.extractall(path="./hf_model") 33 | 34 | logger.debug(os.listdir('./hf_model')) 35 | 36 | # test_dir = "/opt/ml/processing/test/" 37 | # test_dataset = load_from_disk(test_dir) 38 | 39 | # model = AutoModelForSequenceClassification.from_pretrained('./hf_model') 40 | 41 | # trainer = Trainer(model=model) 42 | 43 | # eval_result = trainer.evaluate(eval_dataset=test_dataset) 44 | 45 | with open('./hf_model/evaluation.json') as f: 46 | eval_result = json.load(f) 47 | 48 | logger.debug(eval_result) 49 | output_dir = "/opt/ml/processing/evaluation" 50 | pathlib.Path(output_dir).mkdir(parents=True, exist_ok=True) 51 | 52 | evaluation_path = f"{output_dir}/evaluation.json" 53 | with open(evaluation_path, "w") as f: 54 | f.write(json.dumps(eval_result)) 55 | -------------------------------------------------------------------------------- /hugging-face-lambda-step/scripts/preprocessing.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | import numpy as np 4 | import os 5 | import pandas as pd 6 | import subprocess 7 | import sys 8 | 9 | def install(package): 10 | subprocess.check_call([sys.executable, "-m", "pip", "install", package]) 11 | 12 | if __name__=='__main__': 13 | 14 | install('torch') 15 | install('transformers') 16 | install('datasets[s3]') 17 | 18 | from datasets import load_dataset 19 | from transformers import AutoTokenizer 20 | 21 | # tokenizer used in preprocessing 22 | tokenizer_name = 'distilbert-base-uncased' 23 | 24 | # dataset used 25 | dataset_name = 'imdb' 26 | 27 | # load dataset 28 | dataset = load_dataset(dataset_name) 29 | 30 | # download tokenizer 31 | tokenizer = AutoTokenizer.from_pretrained(tokenizer_name) 32 | 33 | # tokenizer helper function 34 | def tokenize(batch): 35 | return tokenizer(batch['text'], padding='max_length', truncation=True) 36 | 37 | # load dataset 38 | train_dataset, test_dataset = load_dataset('imdb', split=['train', 'test']) 39 | test_dataset = test_dataset.shuffle().select(range(1000)) # smaller the size for test dataset to 1k 40 | 41 | # tokenize dataset 42 | train_dataset = train_dataset.map(tokenize, batched=True) 43 | test_dataset = test_dataset.map(tokenize, batched=True) 44 | 45 | # set format for pytorch 46 | train_dataset = train_dataset.rename_column("label", "labels") 47 | train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels']) 48 | 49 | test_dataset = test_dataset.rename_column("label", "labels") 50 | test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels']) 51 | 52 | train_dataset.save_to_disk('/opt/ml/processing/train') 53 | test_dataset.save_to_disk('/opt/ml/processing/test') -------------------------------------------------------------------------------- /hugging-face-lambda-step/scripts/train.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments, AutoTokenizer 3 | from sklearn.metrics import accuracy_score, precision_recall_fscore_support 4 | from datasets import load_from_disk 5 | import random 6 | import logging 7 | import sys 8 | import argparse 9 | import os 10 | import torch 11 | 12 | import pathlib 13 | import json 14 | 15 | if __name__ == "__main__": 16 | 17 | parser = argparse.ArgumentParser() 18 | 19 | # hyperparameters sent by the client are passed as command-line arguments to the script. 20 | parser.add_argument("--epochs", type=int, default=3) 21 | parser.add_argument("--train_batch_size", type=int, default=32) 22 | parser.add_argument("--eval_batch_size", type=int, default=64) 23 | parser.add_argument("--warmup_steps", type=int, default=500) 24 | parser.add_argument("--model_name", type=str) 25 | parser.add_argument("--learning_rate", type=str, default=5e-5) 26 | 27 | # Data, model, and output directories 28 | parser.add_argument("--output-data-dir", type=str, default=os.environ["SM_OUTPUT_DATA_DIR"]) 29 | parser.add_argument("--model-dir", type=str, default=os.environ["SM_MODEL_DIR"]) 30 | parser.add_argument("--n_gpus", type=str, default=os.environ["SM_NUM_GPUS"]) 31 | parser.add_argument("--training_dir", type=str, default=os.environ["SM_CHANNEL_TRAIN"]) 32 | parser.add_argument("--test_dir", type=str, default=os.environ["SM_CHANNEL_TEST"]) 33 | 34 | args, _ = parser.parse_known_args() 35 | 36 | # Set up logging 37 | logger = logging.getLogger(__name__) 38 | 39 | logging.basicConfig( 40 | level=logging.getLevelName("INFO"), 41 | handlers=[logging.StreamHandler(sys.stdout)], 42 | format="%(asctime)s - %(name)s - %(levelname)s - %(message)s", 43 | ) 44 | 45 | # load datasets 46 | train_dataset = load_from_disk(args.training_dir) 47 | test_dataset = load_from_disk(args.test_dir) 48 | 49 | logger.info(f" loaded train_dataset length is: {len(train_dataset)}") 50 | logger.info(f" loaded test_dataset length is: {len(test_dataset)}") 51 | 52 | # compute metrics function for binary classification 53 | def compute_metrics(pred): 54 | labels = pred.label_ids 55 | preds = pred.predictions.argmax(-1) 56 | precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average="binary") 57 | acc = accuracy_score(labels, preds) 58 | return {"accuracy": acc, "f1": f1, "precision": precision, "recall": recall} 59 | 60 | # download model from model hub 61 | model = AutoModelForSequenceClassification.from_pretrained(args.model_name) 62 | tokenizer = AutoTokenizer.from_pretrained(args.model_name) 63 | 64 | # define training args 65 | training_args = TrainingArguments( 66 | output_dir=args.model_dir, 67 | num_train_epochs=args.epochs, 68 | per_device_train_batch_size=args.train_batch_size, 69 | per_device_eval_batch_size=args.eval_batch_size, 70 | warmup_steps=args.warmup_steps, 71 | evaluation_strategy="epoch", 72 | logging_dir=f"{args.output_data_dir}/logs", 73 | learning_rate=float(args.learning_rate), 74 | ) 75 | 76 | # create Trainer instance 77 | trainer = Trainer( 78 | model=model, 79 | args=training_args, 80 | compute_metrics=compute_metrics, 81 | train_dataset=train_dataset, 82 | eval_dataset=test_dataset, 83 | tokenizer=tokenizer, 84 | ) 85 | 86 | # train model 87 | trainer.train() 88 | 89 | # evaluate model 90 | eval_result = trainer.evaluate(eval_dataset=test_dataset) 91 | 92 | # # writes eval result to file which can be accessed later in s3 ouput 93 | # with open(os.path.join(args.output_data_dir, "eval_results.txt"), "w") as writer: 94 | # print(f"***** Eval results *****") 95 | # for key, value in sorted(eval_result.items()): 96 | # writer.write(f"{key} = {value}\n") 97 | 98 | evaluation_path = "/opt/ml/model/evaluation.json" 99 | with open(evaluation_path, "w+") as f: 100 | f.write(json.dumps(eval_result)) 101 | 102 | # Saves the model to s3 103 | trainer.save_model(args.model_dir) 104 | 105 | 106 | -------------------------------------------------------------------------------- /local_mode_setup.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Do we have GPU support? 4 | nvidia-smi > /dev/null 2>&1 5 | if [ $? -eq 0 ]; then 6 | # check if we have nvidia-docker 7 | NVIDIA_DOCKER=`rpm -qa | grep -c nvidia-docker2` 8 | if [ $NVIDIA_DOCKER -eq 0 ]; then 9 | # Install nvidia-docker2 10 | DOCKER_VERSION=`yum list docker | tail -1 | awk '{print $2}' | head -c 2` 11 | 12 | if [ $DOCKER_VERSION -eq 17 ]; then 13 | DOCKER_PKG_VERSION='17.09.1ce-1.111.amzn1' 14 | NVIDIA_DOCKER_PKG_VERSION='2.0.3-1.docker17.09.1.ce.amzn1' 15 | else 16 | DOCKER_PKG_VERSION='18.06.1ce-3.17.amzn1' 17 | NVIDIA_DOCKER_PKG_VERSION='2.0.3-1.docker18.06.1.ce.amzn1' 18 | fi 19 | 20 | sudo yum -y remove docker 21 | sudo yum -y install docker-$DOCKER_PKG_VERSION 22 | 23 | sudo /etc/init.d/docker start 24 | 25 | curl -s -L https://nvidia.github.io/nvidia-docker/amzn1/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo 26 | sudo yum install -y nvidia-docker2-$NVIDIA_DOCKER_PKG_VERSION 27 | sudo cp daemon.json /etc/docker/daemon.json 28 | sudo pkill -SIGHUP dockerd 29 | echo "installed nvidia-docker2" 30 | else 31 | echo "nvidia-docker2 already installed. We are good to go!" 32 | fi 33 | fi 34 | 35 | # This is common for both GPU and CPU instances 36 | 37 | # check if we have docker-compose 38 | docker-compose version >/dev/null 2>&1 39 | if [ $? -ne 0 ]; then 40 | # install docker compose 41 | pip install docker-compose 42 | fi 43 | 44 | # check if we need to configure our docker interface 45 | SAGEMAKER_NETWORK=`docker network ls | grep -c sagemaker-local` 46 | if [ $SAGEMAKER_NETWORK -eq 0 ]; then 47 | docker network create --driver bridge sagemaker-local 48 | fi 49 | 50 | # Notebook instance Docker networking fixes 51 | RUNNING_ON_NOTEBOOK_INSTANCE=`sudo iptables -S OUTPUT -t nat | grep -c 169.254.0.2` 52 | 53 | # Get the Docker Network CIDR and IP for the sagemaker-local docker interface. 54 | SAGEMAKER_INTERFACE=br-`docker network ls | grep sagemaker-local | cut -d' ' -f1` 55 | DOCKER_NET=`ip route | grep $SAGEMAKER_INTERFACE | cut -d" " -f1` 56 | DOCKER_IP=`ip route | grep $SAGEMAKER_INTERFACE | cut -d" " -f12` 57 | 58 | # check if both IPTables and the Route Table are OK. 59 | IPTABLES_PATCHED=`sudo iptables -S PREROUTING -t nat | grep -c $SAGEMAKER_INTERFACE` 60 | ROUTE_TABLE_PATCHED=`sudo ip route show table agent | grep -c $SAGEMAKER_INTERFACE` 61 | 62 | if [ $RUNNING_ON_NOTEBOOK_INSTANCE -gt 0 ]; then 63 | 64 | if [ $ROUTE_TABLE_PATCHED -eq 0 ]; then 65 | # fix routing 66 | sudo ip route add $DOCKER_NET via $DOCKER_IP dev $SAGEMAKER_INTERFACE table agent 67 | else 68 | echo "SageMaker instance route table setup is ok. We are good to go." 69 | fi 70 | 71 | if [ $IPTABLES_PATCHED -eq 0 ]; then 72 | sudo iptables -t nat -A PREROUTING -i $SAGEMAKER_INTERFACE -d 169.254.169.254/32 -p tcp -m tcp --dport 80 -j DNAT --to-destination 169.254.0.2:9081 73 | echo "iptables for Docker setup done" 74 | else 75 | echo "SageMaker instance routing for Docker is ok. We are good to go!" 76 | fi 77 | fi 78 | -------------------------------------------------------------------------------- /r-churn/r_autopilot_churn.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Customer Churn Prediction with Amazon SageMaker Autopilot\n", 8 | "_**Using AutoPilot to Predict Mobile Customer Departure**_\n", 9 | "\n", 10 | "---\n", 11 | "\n", 12 | "---\n", 13 | "\n", 14 | "\n", 15 | "## Introduction\n", 16 | "\n", 17 | "Amazon SageMaker Autopilot is an automated machine learning (commonly referred to as AutoML) solution for tabular datasets. You can use SageMaker Autopilot in different ways: on autopilot (hence the name) or with human guidance, without code through SageMaker Studio, or using the AWS SDKs. This notebook, as a first glimpse, will use the AWS SDKs to simply create and deploy a machine learning model.\n", 18 | "\n", 19 | "Losing customers is costly for any business. Identifying unhappy customers early on gives you a chance to offer them incentives to stay. This notebook describes using machine learning (ML) for the automated identification of unhappy customers, also known as customer churn prediction. ML models rarely give perfect predictions though, so this notebook is also about how to incorporate the relative costs of prediction mistakes when determining the financial outcome of using ML.\n", 20 | "\n", 21 | "We use an example of churn that is familiar to all of us–leaving a mobile phone operator. Seems like I can always find fault with my provider du jour! And if my provider knows that I’m thinking of leaving, it can offer timely incentives–I can always use a phone upgrade or perhaps have a new feature activated–and I might just stick around. Incentives are often much more cost effective than losing and reacquiring a customer.\n", 22 | "\n" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "### Reticulating the Amazon SageMaker Python SDK\n", 30 | "\n", 31 | "First, load the `reticulate` library and import the `sagemaker` Python module. Once the module is loaded, use the `$` notation in R instead of the `.` notation in Python to use available classes. " 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": null, 37 | "metadata": {}, 38 | "outputs": [], 39 | "source": [ 40 | "# Turn warnings off globally\n", 41 | "options(warn=-1)" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": null, 47 | "metadata": {}, 48 | "outputs": [], 49 | "source": [ 50 | "# Install reticulate library and import sagemaker\n", 51 | "library(reticulate)\n", 52 | "sagemaker <- import('sagemaker')" 53 | ] 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "metadata": {}, 58 | "source": [ 59 | "---\n", 60 | "\n", 61 | "## Setup\n", 62 | "\n", 63 | "_This notebook was created and tested on an ml.m4.xlarge notebook instance._\n", 64 | "\n", 65 | "Let's start by specifying:\n", 66 | "\n", 67 | "- The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting.\n", 68 | "- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these. Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s)." 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": null, 74 | "metadata": {}, 75 | "outputs": [], 76 | "source": [ 77 | "session <- sagemaker$Session()\n", 78 | "bucket <- session$default_bucket()\n", 79 | "prefix <- 'data/r-churn'\n", 80 | "role_arn <- sagemaker$get_execution_role()\n", 81 | "\n", 82 | "bucket\n", 83 | "role_arn" 84 | ] 85 | }, 86 | { 87 | "cell_type": "markdown", 88 | "metadata": {}, 89 | "source": [ 90 | "---\n", 91 | "## Data\n", 92 | "\n", 93 | "Mobile operators have historical records on which customers ultimately ended up churning and which continued using the service. We can use this historical information to construct an ML model of one mobile operator’s churn using a process called training. After training the model, we can pass the profile information of an arbitrary customer (the same profile information that we used to train the model) to the model, and have the model predict whether this customer is going to churn. Of course, we expect the model to make mistakes–after all, predicting the future is tricky business! But I’ll also show how to deal with prediction errors.\n", 94 | "\n", 95 | "The dataset we will use is synthetically generated, but indictive of the types of features you'd see in this use case." 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": null, 101 | "metadata": {}, 102 | "outputs": [], 103 | "source": [ 104 | "session$download_data(path = './', \n", 105 | " bucket = 'sagemaker-sample-files', \n", 106 | " key_prefix = 'datasets/tabular/synthetic/churn.txt')" 107 | ] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "metadata": {}, 112 | "source": [ 113 | "Before you run Autopilot on the dataset, first perform a check of the dataset to make sure that it has no obvious errors. The Autopilot process can take long time, and it's generally a good practice to inspect the dataset before you start a job. This particular dataset is small, so you can inspect it in the notebook instance itself. If you have a larger dataset that will not fit in a notebook instance memory, inspect the dataset offline using a big data analytics tool like Apache Spark. [Deequ](https://github.com/awslabs/deequ) is a library built on top of Apache Spark that can be helpful for performing checks on large datasets. \n", 114 | "\n", 115 | "Read the data into a data frame and take a look." 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": null, 121 | "metadata": {}, 122 | "outputs": [], 123 | "source": [ 124 | "library(readr)\n", 125 | "\n", 126 | "churn <- read_csv(file = 'churn.txt')\n", 127 | "head(churn)" 128 | ] 129 | }, 130 | { 131 | "cell_type": "markdown", 132 | "metadata": {}, 133 | "source": [ 134 | "By modern standards, it’s a relatively small dataset, with only 5,000 records, where each record uses 21 attributes to describe the profile of a customer of an unknown US mobile operator. The attributes are:\n", 135 | "\n", 136 | "- `State`: the US state in which the customer resides, indicated by a two-letter abbreviation; for example, OH or NJ\n", 137 | "- `Account Length`: the number of days that this account has been active\n", 138 | "- `Area Code`: the three-digit area code of the corresponding customer’s phone number\n", 139 | "- `Phone`: the remaining seven-digit phone number\n", 140 | "- `Int’l Plan`: whether the customer has an international calling plan: yes/no\n", 141 | "- `VMail Plan`: whether the customer has a voice mail feature: yes/no\n", 142 | "- `VMail Message`: presumably the average number of voice mail messages per month\n", 143 | "- `Day Mins`: the total number of calling minutes used during the day\n", 144 | "- `Day Calls`: the total number of calls placed during the day\n", 145 | "- `Day Charge`: the billed cost of daytime calls\n", 146 | "- `Eve Mins, Eve Calls, Eve Charge`: the billed cost for calls placed during the evening\n", 147 | "- `Night Mins`, `Night Calls`, `Night Charge`: the billed cost for calls placed during nighttime\n", 148 | "- `Intl Mins`, `Intl Calls`, `Intl Charge`: the billed cost for international calls\n", 149 | "- `CustServ Calls`: the number of calls placed to Customer Service\n", 150 | "- `Churn?`: whether the customer left the service: true/false\n", 151 | "\n", 152 | "The last attribute, `Churn?`, is known as the target attribute–the attribute that we want the ML model to predict." 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "metadata": {}, 158 | "source": [ 159 | "### Upload the dataset to S3" 160 | ] 161 | }, 162 | { 163 | "cell_type": "markdown", 164 | "metadata": {}, 165 | "source": [ 166 | "Now we'll upload the data to a S3 bucket in our own AWS account so Autopilot can access it." 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": null, 172 | "metadata": {}, 173 | "outputs": [], 174 | "source": [ 175 | "write_csv(churn, 'churn.csv', col_names = TRUE)" 176 | ] 177 | }, 178 | { 179 | "cell_type": "code", 180 | "execution_count": null, 181 | "metadata": {}, 182 | "outputs": [], 183 | "source": [ 184 | "s3_train <- session$upload_data(path = 'churn.csv', \n", 185 | " bucket = bucket, \n", 186 | " key_prefix = prefix)\n", 187 | "\n", 188 | "s3_train" 189 | ] 190 | }, 191 | { 192 | "cell_type": "markdown", 193 | "metadata": {}, 194 | "source": [ 195 | "---\n", 196 | "## Launching a SageMaker Autopilot Job\n", 197 | "\n", 198 | "After uploading the dataset to Amazon S3, you can launch Autopilot to find the best ML pipeline to train a model on this dataset. \n", 199 | "\n", 200 | "Currently Autopilot supports only tabular datasets in CSV format. Either all files should have a header row, or the first file of the dataset, when sorted in alphabetical/lexical order by name, is expected to have a header row." 201 | ] 202 | }, 203 | { 204 | "cell_type": "markdown", 205 | "metadata": {}, 206 | "source": [ 207 | "We'll launch an Autopilot job via the Studio UI (it also is possible to do so via API). To do so:\n", 208 | "\n", 209 | "- Go to the tilted triangle icon in the left toolbar and click it, then select **Experiments and trials**.\n", 210 | "- Click the **Create Autopilot Experiment** button.\n", 211 | "- For **Experiment name**, enter a name such as `automl-churn-` with a date suffix. e.g. `automl-churn-10-14-21`\n", 212 | "- Skip to **CONNECT YOUR DATA**, then find the **S3 bucket name** using autocomplete by typing `sagemaker-` and matching to the bucket name printed below the previous code cell. Similarly, find the **Dataset file name** the same way, it should be `data/r-churn/churn.csv`\n", 213 | "- Skip to **Target**, and select `Churn?` from the drop down menu.\n", 214 | "- Skip to **Output data location**, select the radio button for **Enter S3 bucket location**, and then enter a string such as `s3:///data/r-churn/output` where you replace `your-bucket-name` with the bucket name you've used previously.\n", 215 | "- Go to **Auto deploy** and switch it to off. \n", 216 | "- Click the down arrow for **Advanced Settings**, go to **Max candidates** and enter 20. (This is to keep the runtime of the job within reasonable limits for a workshop setting.) \n", 217 | "- Click **Create Experiment**." 218 | ] 219 | }, 220 | { 221 | "cell_type": "markdown", 222 | "metadata": {}, 223 | "source": [ 224 | "### Tracking SageMaker Autopilot job progress\n", 225 | "SageMaker Autopilot job consists of the following high-level steps : \n", 226 | "\n", 227 | "* Analyzing Data, where the dataset is analyzed and Autopilot comes up with a list of ML pipelines that should be tried out on the dataset. The dataset is also split into train and validation sets.\n", 228 | "* Feature Engineering, where Autopilot performs feature transformation on individual features of the dataset as well as at an aggregate level.\n", 229 | "* Model Tuning, where the top performing pipeline is selected along with the optimal hyperparameters for the training algorithm (the last stage of the pipeline). \n", 230 | "\n", 231 | "Although we can use code/API calls to track job progress, we'll use the SageMaker Studio UI to do so. After you create the job via the Studio UI above, the tab will convert to an Autopilot Job tracking tab. You'll be able to see the progress of the job in that tab.\n", 232 | "\n", 233 | "If you close the tab you can always get back to it. To do so, go to the tilted triangle icon in the left toolbar and click it, then select **Experiments and trials**. Next, right-click the name of your AutoML job, which should start with \"automl-churn-\", and select **Describe AutoML Job**. A new Studio tab will open details about your job, and a summary when it completes, with the ability to sort models by metric and deploy with a single click. " 234 | ] 235 | }, 236 | { 237 | "cell_type": "markdown", 238 | "metadata": {}, 239 | "source": [ 240 | "### Model Explainability\n", 241 | "\n", 242 | "Autopilot also generates an explainability report using SageMaker Clarify. The report and related artifacts are uploaded to S3, but you also can access the report in SageMaker Studio.\n", 243 | "\n", 244 | "To do so:\n", 245 | "- Go to the tilted triangle icon in the left toolbar and click it, then select **Experiments and trials**.\n", 246 | "- In the list of experiments, click on ***Unassigned trial components***.\n", 247 | "- Double-click the trial component with the name of the form, `automl-churn--documentation`.\n", 248 | "- A new tab will open named `Describe Trial Component`; in it you will see a graph of feature importance by aggregated SHAP values. Of the 20 original input features, Clarify plots the 10 features with the greatest feature attribution." 249 | ] 250 | }, 251 | { 252 | "cell_type": "markdown", 253 | "metadata": {}, 254 | "source": [ 255 | "Clarify uses a model-agnostic feature attribution approach, which you can used to understand why a model made a prediction after training and to provide per-instance explanation during inference. The implementation includes a scalable and efficient implementation of SHAP." 256 | ] 257 | }, 258 | { 259 | "cell_type": "markdown", 260 | "metadata": {}, 261 | "source": [ 262 | "It also is possible to visualize the the local explanations for single examples in your dataset. You can simply load the local explanations stored in the Autopilot explainability output path, and visualize the explanation (i.e., the impact that the single features have on the prediction of your model) for any example. Typically for an example you would plot a bar chart with SHAP values for each feature. The larger the bar, the more impact the feature has on the target feature. Bars with positive values are associated with higher predictions in the target variable, and bars with negative values are associated with lower predictions in the target variable." 263 | ] 264 | } 265 | ], 266 | "metadata": { 267 | "celltoolbar": "Tags", 268 | "instance_type": "ml.t3.medium", 269 | "kernelspec": { 270 | "display_name": "R (custom-r/latest)", 271 | "language": "python", 272 | "name": "ir__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-west-2:894087409521:image/custom-r" 273 | }, 274 | "language_info": { 275 | "codemirror_mode": "r", 276 | "file_extension": ".r", 277 | "mimetype": "text/x-r-source", 278 | "name": "R", 279 | "pygments_lexer": "r", 280 | "version": "4.0.0" 281 | }, 282 | "notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License." 283 | }, 284 | "nbformat": 4, 285 | "nbformat_minor": 4 286 | } 287 | -------------------------------------------------------------------------------- /r-in-sagemaker-processing/r_in_sagemaker_processing.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Using R in SageMaker Processing\n", 8 | "\n", 9 | "Amazon SageMaker Processing is a capability of Amazon SageMaker that lets you easily run your preprocessing, postprocessing and model evaluation workloads on fully managed infrastructure. In this example, we'll see how to use SageMaker Processing with the R programming language.\n", 10 | "\n", 11 | "The workflow for using R with SageMaker Processing involves the following steps:\n", 12 | "\n", 13 | "- Writing a R script.\n", 14 | "- Building a Docker container.\n", 15 | "- Creating a SageMaker Processing job.\n", 16 | "- Retrieving and viewing job results. " 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "metadata": {}, 22 | "source": [ 23 | "## The R script\n", 24 | "\n", 25 | "To use R with SageMaker Processing, first prepare a R script similar to one you would use outside SageMaker. Below is the R script we'll be using. It performs operations on data and also saves a .png of a plot for retrieval and display later after the Processing job is complete. This enables you to perform any kind of analysis and feature engineering at scale with R, and also create visualizations for display anywhere. " 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": null, 31 | "metadata": {}, 32 | "outputs": [], 33 | "source": [ 34 | "%%writefile preprocessing.R\n", 35 | "\n", 36 | "library(readr)\n", 37 | "library(dplyr)\n", 38 | "library(ggplot2)\n", 39 | "library(forcats)\n", 40 | "\n", 41 | "input_dir <- \"/opt/ml/processing/input/\"\n", 42 | "filename <- Sys.glob(paste(input_dir, \"*.csv\", sep=\"\"))\n", 43 | "df <- read_csv(filename)\n", 44 | "\n", 45 | "plot_data <- df %>%\n", 46 | " group_by(state) %>%\n", 47 | " count()\n", 48 | "\n", 49 | "write_csv(plot_data, \"/opt/ml/processing/csv/plot_data.csv\")\n", 50 | "\n", 51 | "plot <- plot_data %>% \n", 52 | " ggplot()+\n", 53 | " geom_col(aes(fct_reorder(state, n), \n", 54 | " n, \n", 55 | " fill = n))+\n", 56 | " coord_flip()+\n", 57 | " labs(\n", 58 | " title = \"Number of people by state\",\n", 59 | " subtitle = \"From US-500 dataset\",\n", 60 | " x = \"State\",\n", 61 | " y = \"Number of people\"\n", 62 | " )+ \n", 63 | " theme_bw()\n", 64 | "\n", 65 | "ggsave(\"/opt/ml/processing/images/census_plot.png\", width = 10, height = 8, dpi = 100)" 66 | ] 67 | }, 68 | { 69 | "cell_type": "markdown", 70 | "metadata": {}, 71 | "source": [ 72 | "## Building a Docker container\n", 73 | "\n", 74 | "Next, there is a one-time step to create a R container. For subsequent SageMaker Processing jobs, you can just reuse this container (unless you need to add further dependencies, in which case you can just add them to the Dockerfile and rebuild). To start, set up a local directory for Docker-related files." 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": null, 80 | "metadata": {}, 81 | "outputs": [], 82 | "source": [ 83 | "!mkdir docker" 84 | ] 85 | }, 86 | { 87 | "cell_type": "markdown", 88 | "metadata": {}, 89 | "source": [ 90 | "A simple Dockerfile can be used to build a Docker container for SageMaker Processing. For this example, we'll use a parent Docker image from the Rocker Project, which provides a set of convenient R Docker images. There is no need to include your R script in the container itself because SageMaker Processing will ingest it for you. This gives you the flexibility to modify the script as needed without having to rebuild the Docker image every time you modify it. " 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": null, 96 | "metadata": {}, 97 | "outputs": [], 98 | "source": [ 99 | "%%writefile docker/Dockerfile\n", 100 | "\n", 101 | "FROM rocker/tidyverse:latest\n", 102 | "\n", 103 | "# tidyverse has all the packages we need, otherwise we could install more as follows\n", 104 | "# RUN install2.r --error \\\n", 105 | "# jsonlite \\\n", 106 | "# tseries\n", 107 | "\n", 108 | "ENTRYPOINT [\"Rscript\"]" 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "metadata": {}, 114 | "source": [ 115 | "The Dockerfile is now used to build the Docker image. We'll also create an Amazon Elastic Container Registry (ECR) repository, and push the image to ECR so it can be accessed by SageMaker." 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": null, 121 | "metadata": {}, 122 | "outputs": [], 123 | "source": [ 124 | "import boto3\n", 125 | "\n", 126 | "account_id = boto3.client('sts').get_caller_identity().get('Account')\n", 127 | "region = boto3.session.Session().region_name\n", 128 | "\n", 129 | "ecr_repository = 'r-in-sagemaker-processing'\n", 130 | "tag = ':latest'\n", 131 | "\n", 132 | "uri_suffix = 'amazonaws.com'\n", 133 | "processing_repository_uri = '{}.dkr.ecr.{}.{}/{}'.format(account_id, region, uri_suffix, ecr_repository + tag)\n", 134 | "\n", 135 | "# Create ECR repository and push Docker image\n", 136 | "!docker build -t $ecr_repository docker\n", 137 | "!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)\n", 138 | "!aws ecr create-repository --repository-name $ecr_repository\n", 139 | "!docker tag {ecr_repository + tag} $processing_repository_uri\n", 140 | "!docker push $processing_repository_uri" 141 | ] 142 | }, 143 | { 144 | "cell_type": "markdown", 145 | "metadata": {}, 146 | "source": [ 147 | "## Creating a SageMaker Processing job\n", 148 | "\n", 149 | "With our Docker image in ECR, we now prepare for the SageMaker Processing job by specifying Amazon S3 buckets for output and input, and downloading the raw dataset." 150 | ] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "execution_count": null, 155 | "metadata": {}, 156 | "outputs": [], 157 | "source": [ 158 | "import sagemaker\n", 159 | "from sagemaker import get_execution_role\n", 160 | "\n", 161 | "role = get_execution_role()\n", 162 | "session = sagemaker.Session()\n", 163 | "s3_output = session.default_bucket()\n", 164 | "s3_prefix = 'R-in-Processing'\n", 165 | "s3_source = 'sagemaker-workshop-pdx'\n", 166 | "session.download_data(path='./data', bucket=s3_source, key_prefix='R-in-Processing/us-500.csv')" 167 | ] 168 | }, 169 | { 170 | "cell_type": "markdown", 171 | "metadata": {}, 172 | "source": [ 173 | "Before setting up the SageMaker Processing job, the raw dataset is uploaded to S3 so it is accessible to SageMaker Processing. " 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": null, 179 | "metadata": {}, 180 | "outputs": [], 181 | "source": [ 182 | "rawdata_s3_prefix = '{}/data/raw'.format(s3_prefix)\n", 183 | "raw_s3 = session.upload_data(path='./data', key_prefix=rawdata_s3_prefix)\n", 184 | "print(raw_s3)" 185 | ] 186 | }, 187 | { 188 | "cell_type": "markdown", 189 | "metadata": {}, 190 | "source": [ 191 | "The `ScriptProcessor` class of the SageMaker SDK lets you run a command inside a Docker container. We'll use this to run our own script using the `Rscript` command. In the `ScriptProcessor` you also can specify the type and number of instances to be used in the SageMaker Processing job." 192 | ] 193 | }, 194 | { 195 | "cell_type": "code", 196 | "execution_count": null, 197 | "metadata": {}, 198 | "outputs": [], 199 | "source": [ 200 | "from sagemaker.processing import ScriptProcessor\n", 201 | "\n", 202 | "script_processor = ScriptProcessor(command=['Rscript'],\n", 203 | " image_uri=processing_repository_uri,\n", 204 | " role=role,\n", 205 | " instance_count=1,\n", 206 | " instance_type='ml.c5.xlarge')" 207 | ] 208 | }, 209 | { 210 | "cell_type": "markdown", 211 | "metadata": {}, 212 | "source": [ 213 | "We can now start the SageMaker Processing job. The main aspects of the code below are specifying the input and output locations, and the name of our R preprocessing script. " 214 | ] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "execution_count": null, 219 | "metadata": {}, 220 | "outputs": [], 221 | "source": [ 222 | "from sagemaker.processing import ProcessingInput, ProcessingOutput\n", 223 | "from time import gmtime, strftime \n", 224 | "\n", 225 | "processing_job_name = \"R-in-Processing-{}\".format(strftime(\"%d-%H-%M-%S\", gmtime()))\n", 226 | "output_destination = 's3://{}/{}/data'.format(s3_output, s3_prefix)\n", 227 | "\n", 228 | "script_processor.run(code='preprocessing.R',\n", 229 | " job_name=processing_job_name,\n", 230 | " inputs=[ProcessingInput(\n", 231 | " source=raw_s3,\n", 232 | " destination='/opt/ml/processing/input')],\n", 233 | " outputs=[ProcessingOutput(output_name='csv',\n", 234 | " destination='{}/csv'.format(output_destination),\n", 235 | " source='/opt/ml/processing/csv'),\n", 236 | " ProcessingOutput(output_name='images',\n", 237 | " destination='{}/images'.format(output_destination),\n", 238 | " source='/opt/ml/processing/images')])\n", 239 | "\n", 240 | "preprocessing_job_description = script_processor.jobs[-1].describe()" 241 | ] 242 | }, 243 | { 244 | "cell_type": "markdown", 245 | "metadata": {}, 246 | "source": [ 247 | "## Retrieving and viewing job results\n", 248 | "\n", 249 | "From the SageMaker Processing job description, we can look up the S3 URIs of the output, including the output plot .png file." 250 | ] 251 | }, 252 | { 253 | "cell_type": "code", 254 | "execution_count": null, 255 | "metadata": {}, 256 | "outputs": [], 257 | "source": [ 258 | "output_config = preprocessing_job_description['ProcessingOutputConfig']\n", 259 | "for output in output_config['Outputs']:\n", 260 | " if output['OutputName'] == 'csv':\n", 261 | " preprocessed_csv_data = output['S3Output']['S3Uri']\n", 262 | " if output['OutputName'] == 'images':\n", 263 | " preprocessed_images = output['S3Output']['S3Uri']" 264 | ] 265 | }, 266 | { 267 | "cell_type": "markdown", 268 | "metadata": {}, 269 | "source": [ 270 | "Now we can display the plot produced by the SageMaker Processing job. A similar workflow applies to retrieving and working with any other output from a job, such as the transformed data itself. " 271 | ] 272 | }, 273 | { 274 | "cell_type": "code", 275 | "execution_count": null, 276 | "metadata": {}, 277 | "outputs": [], 278 | "source": [ 279 | "from PIL import Image\n", 280 | "from IPython.display import display\n", 281 | "\n", 282 | "plot_key = 'census_plot.png'\n", 283 | "plot_in_s3 = '{}/{}'.format(preprocessed_images, plot_key)\n", 284 | "!aws s3 cp {plot_in_s3} .\n", 285 | "im = Image.open(plot_key)\n", 286 | "display(im)" 287 | ] 288 | } 289 | ], 290 | "metadata": { 291 | "kernelspec": { 292 | "display_name": "conda_python3", 293 | "language": "python", 294 | "name": "conda_python3" 295 | }, 296 | "language_info": { 297 | "codemirror_mode": { 298 | "name": "ipython", 299 | "version": 3 300 | }, 301 | "file_extension": ".py", 302 | "mimetype": "text/x-python", 303 | "name": "python", 304 | "nbconvert_exporter": "python", 305 | "pygments_lexer": "ipython3", 306 | "version": "3.6.10" 307 | } 308 | }, 309 | "nbformat": 4, 310 | "nbformat_minor": 4 311 | } 312 | -------------------------------------------------------------------------------- /tf-2-word-embeddings/code/model_def.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import os 3 | import tensorflow as tf 4 | 5 | 6 | def get_embeddings(embedding_dir): 7 | 8 | embeddings = np.load(os.path.join(embedding_dir, 'embedding.npy')) 9 | print('embeddings shape: ', embeddings.shape) 10 | 11 | return embeddings 12 | 13 | 14 | def get_model(embedding_dir, NUM_WORDS, WORD_INDEX_LENGTH, LABELS_INDEX_LENGTH, EMBEDDING_DIM, MAX_SEQUENCE_LENGTH): 15 | 16 | embedding_matrix = get_embeddings(embedding_dir) 17 | 18 | # trainable = False to keep the embeddings frozen 19 | embedding_layer = tf.keras.layers.Embedding(NUM_WORDS, 20 | EMBEDDING_DIM, 21 | embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix), 22 | input_length=MAX_SEQUENCE_LENGTH, 23 | trainable=False) 24 | 25 | sequence_input = tf.keras.Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32') 26 | embedded_sequences = embedding_layer(sequence_input) 27 | x = tf.keras.layers.Conv1D(128, 5, activation='relu')(embedded_sequences) 28 | x = tf.keras.layers.MaxPooling1D(5)(x) 29 | x = tf.keras.layers.Conv1D(128, 5, activation='relu')(x) 30 | x = tf.keras.layers.MaxPooling1D(5)(x) 31 | x = tf.keras.layers.Conv1D(128, 5, activation='relu')(x) 32 | x = tf.keras.layers.GlobalMaxPooling1D()(x) 33 | x = tf.keras.layers.Dense(128, activation='relu')(x) 34 | preds = tf.keras.layers.Dense(LABELS_INDEX_LENGTH, activation='softmax')(x) 35 | 36 | return tf.keras.Model(sequence_input, preds) 37 | 38 | -------------------------------------------------------------------------------- /tf-2-word-embeddings/code/train.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import os 3 | import sys 4 | import numpy as np 5 | import tensorflow as tf 6 | 7 | from model_def import get_model 8 | 9 | 10 | def parse_args(): 11 | 12 | parser = argparse.ArgumentParser() 13 | 14 | # hyperparameters sent by the client are passed as command-line arguments to the script 15 | parser.add_argument('--epochs', type=int, default=1) 16 | parser.add_argument('--batch_size', type=int, default=64) 17 | 18 | parser.add_argument('--num_words', type=int) 19 | parser.add_argument('--word_index_len', type=int) 20 | parser.add_argument('--labels_index_len', type=int) 21 | parser.add_argument('--embedding_dim', type=int) 22 | parser.add_argument('--max_sequence_len', type=int) 23 | 24 | # data directories 25 | parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN')) 26 | parser.add_argument('--val', type=str, default=os.environ.get('SM_CHANNEL_VAL')) 27 | 28 | # embedding directory 29 | parser.add_argument('--embedding', type=str, default=os.environ.get('SM_CHANNEL_EMBEDDING')) 30 | 31 | # model directory: we will use the default set by SageMaker, /opt/ml/model 32 | parser.add_argument('--model_dir', type=str, default=os.environ.get('SM_MODEL_DIR')) 33 | 34 | return parser.parse_known_args() 35 | 36 | 37 | def get_train_data(train_dir): 38 | 39 | x_train = np.load(os.path.join(train_dir, 'x_train.npy')) 40 | y_train = np.load(os.path.join(train_dir, 'y_train.npy')) 41 | print('x train', x_train.shape,'y train', y_train.shape) 42 | 43 | return x_train, y_train 44 | 45 | 46 | def get_val_data(val_dir): 47 | 48 | x_val = np.load(os.path.join(val_dir, 'x_val.npy')) 49 | y_val = np.load(os.path.join(val_dir, 'y_val.npy')) 50 | print('x val', x_val.shape,'y val', y_val.shape) 51 | 52 | return x_val, y_val 53 | 54 | 55 | if __name__ == "__main__": 56 | 57 | args, _ = parse_args() 58 | 59 | x_train, y_train = get_train_data(args.train) 60 | x_val, y_val = get_val_data(args.val) 61 | 62 | model = get_model(args.embedding, 63 | args.num_words, 64 | args.word_index_len, 65 | args.labels_index_len, 66 | args.embedding_dim, 67 | args.max_sequence_len) 68 | 69 | model.compile(loss='categorical_crossentropy', 70 | optimizer='rmsprop', 71 | metrics=['acc']) 72 | 73 | model.fit(x_train, y_train, 74 | batch_size=args.batch_size, 75 | epochs=args.epochs, 76 | validation_data=(x_val, y_val)) 77 | 78 | # create a TensorFlow SavedModel for deployment to a SageMaker endpoint with TensorFlow Serving 79 | model.save(args.model_dir + '/1') 80 | 81 | -------------------------------------------------------------------------------- /tf-2-word-embeddings/tf-2-word-embeddings.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Loading Word Embeddings in SageMaker for Text Classification with TensorFlow 2\n", 8 | "\n", 9 | "In this notebook, two aspects of Amazon SageMaker will be demonstrated. First, we'll use SageMaker Script Mode with a prebuilt TensorFlow 2 framework container, which enables you to use a training script similar to one you would use outside SageMaker. Second, we'll see how to use the concept of SageMaker input channels to load word embeddings into the container for training. The word embeddings will be used with a Convolutional Neural Net (CNN) in TensorFlow 2 to perform text classification. \n", 10 | "\n", 11 | "We'll begin with some necessary imports." 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": null, 17 | "metadata": {}, 18 | "outputs": [], 19 | "source": [ 20 | "import os\n", 21 | "import sys\n", 22 | "import numpy as np\n", 23 | "import tensorflow as tf\n", 24 | "\n", 25 | "from tensorflow.keras.preprocessing.text import Tokenizer\n", 26 | "from tensorflow.keras.preprocessing.sequence import pad_sequences\n", 27 | "from tensorflow.keras.utils import to_categorical" 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": {}, 33 | "source": [ 34 | "# Prepare Dataset and Embeddings" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [ 41 | "Initially, we download the 20 Newsgroups dataset. " 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": null, 47 | "metadata": {}, 48 | "outputs": [], 49 | "source": [ 50 | "!mkdir ./20_newsgroup\n", 51 | "!wget -O ./20_newsgroup/news20.tar.gz http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz\n", 52 | "!tar -xvzf ./20_newsgroup/news20.tar.gz" 53 | ] 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "metadata": {}, 58 | "source": [ 59 | "The next step is to download the GloVe word embeddings that we will load in the neural net." 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": null, 65 | "metadata": {}, 66 | "outputs": [], 67 | "source": [ 68 | "!mkdir ./glove.6B\n", 69 | "!wget https://nlp.stanford.edu/data/glove.6B.zip\n", 70 | "!unzip glove.6B.zip -d ./glove.6B" 71 | ] 72 | }, 73 | { 74 | "cell_type": "markdown", 75 | "metadata": {}, 76 | "source": [ 77 | "We have to map the GloVe embedding vectors into an index." 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": null, 83 | "metadata": {}, 84 | "outputs": [], 85 | "source": [ 86 | "BASE_DIR = ''\n", 87 | "GLOVE_DIR = os.path.join(BASE_DIR, 'glove.6B')\n", 88 | "TEXT_DATA_DIR = os.path.join(BASE_DIR, '20_newsgroup')\n", 89 | "MAX_SEQUENCE_LENGTH = 1000\n", 90 | "MAX_NUM_WORDS = 20000\n", 91 | "EMBEDDING_DIM = 100\n", 92 | "VALIDATION_SPLIT = 0.2\n", 93 | "\n", 94 | "embeddings_index = {}\n", 95 | "with open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt')) as f:\n", 96 | " for line in f:\n", 97 | " values = line.split()\n", 98 | " word = values[0]\n", 99 | " coefs = np.asarray(values[1:], dtype='float32')\n", 100 | " embeddings_index[word] = coefs\n", 101 | "\n", 102 | "print('Found %s word vectors.' % len(embeddings_index))" 103 | ] 104 | }, 105 | { 106 | "cell_type": "markdown", 107 | "metadata": {}, 108 | "source": [ 109 | "The 20 Newsgroups text also must be preprocessed. For example, the labels for each sample must be extracted and mapped to a numeric index." 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": null, 115 | "metadata": {}, 116 | "outputs": [], 117 | "source": [ 118 | "texts = [] # list of text samples\n", 119 | "labels_index = {} # dictionary mapping label name to numeric id\n", 120 | "labels = [] # list of label ids\n", 121 | "for name in sorted(os.listdir(TEXT_DATA_DIR)):\n", 122 | " path = os.path.join(TEXT_DATA_DIR, name)\n", 123 | " if os.path.isdir(path):\n", 124 | " label_id = len(labels_index)\n", 125 | " labels_index[name] = label_id\n", 126 | " for fname in sorted(os.listdir(path)):\n", 127 | " if fname.isdigit():\n", 128 | " fpath = os.path.join(path, fname)\n", 129 | " args = {} if sys.version_info < (3,) else {'encoding': 'latin-1'}\n", 130 | " with open(fpath, **args) as f:\n", 131 | " t = f.read()\n", 132 | " i = t.find('\\n\\n') # skip header\n", 133 | " if 0 < i:\n", 134 | " t = t[i:]\n", 135 | " texts.append(t)\n", 136 | " labels.append(label_id)\n", 137 | "\n", 138 | "print('Found %s texts.' % len(texts))" 139 | ] 140 | }, 141 | { 142 | "cell_type": "markdown", 143 | "metadata": {}, 144 | "source": [ 145 | "We can use Keras text preprocessing functions to tokenize the text, limit the sequence length of the samples, and pad shorter sequences as necessary. Additionally, the preprocessed dataset must be split into training and validation sets." 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": null, 151 | "metadata": {}, 152 | "outputs": [], 153 | "source": [ 154 | "tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)\n", 155 | "tokenizer.fit_on_texts(texts)\n", 156 | "sequences = tokenizer.texts_to_sequences(texts)\n", 157 | "\n", 158 | "word_index = tokenizer.word_index\n", 159 | "print('Found %s unique tokens.' % len(word_index))\n", 160 | "\n", 161 | "data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)\n", 162 | "\n", 163 | "labels = to_categorical(np.asarray(labels))\n", 164 | "print('Shape of data tensor:', data.shape)\n", 165 | "print('Shape of label tensor:', labels.shape)\n", 166 | "\n", 167 | "# split the data into a training set and a validation set\n", 168 | "indices = np.arange(data.shape[0])\n", 169 | "np.random.shuffle(indices)\n", 170 | "data = data[indices]\n", 171 | "labels = labels[indices]\n", 172 | "num_validation_samples = int(VALIDATION_SPLIT * data.shape[0])\n", 173 | "\n", 174 | "x_train = data[:-num_validation_samples]\n", 175 | "y_train = labels[:-num_validation_samples]\n", 176 | "x_val = data[-num_validation_samples:]\n", 177 | "y_val = labels[-num_validation_samples:]" 178 | ] 179 | }, 180 | { 181 | "cell_type": "markdown", 182 | "metadata": {}, 183 | "source": [ 184 | "After the dataset text preprocessing is complete, we can now map the 20 Newsgroup vocabulary words to their GloVe embedding vectors for use in an embedding matrix. This matrix will be loaded in an Embedding layer of the neural net." 185 | ] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "execution_count": null, 190 | "metadata": {}, 191 | "outputs": [], 192 | "source": [ 193 | "num_words = min(MAX_NUM_WORDS, len(word_index)) + 1\n", 194 | "embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))\n", 195 | "for word, i in word_index.items():\n", 196 | " if i > MAX_NUM_WORDS:\n", 197 | " continue\n", 198 | " embedding_vector = embeddings_index.get(word)\n", 199 | " if embedding_vector is not None:\n", 200 | " # words not found in embedding index will be all-zeros.\n", 201 | " embedding_matrix[i] = embedding_vector\n", 202 | "\n", 203 | "print('Number of words:', num_words)\n", 204 | "print('Shape of embeddings:', embedding_matrix.shape)" 205 | ] 206 | }, 207 | { 208 | "cell_type": "markdown", 209 | "metadata": {}, 210 | "source": [ 211 | "Now the data AND embeddings are saved to file to prepare for training.\n", 212 | "\n", 213 | "Note that we will not be loading the original, unprocessed set of embeddings into the training container — instead, to save loading time, we just save the embedding matrix, which at 16MB is much smaller than the original set of embeddings at 892MB. Depending on how large of a set of embeddings you need for other use cases, you might save further space by saving the embeddings with joblib (more efficient than the original Python pickle), and/or save the embeddings with half precision (fp16) instead of full precision and then restore them to full precision after they are loaded." 214 | ] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "execution_count": null, 219 | "metadata": {}, 220 | "outputs": [], 221 | "source": [ 222 | "data_dir = os.path.join(os.getcwd(), 'data')\n", 223 | "os.makedirs(data_dir, exist_ok=True)\n", 224 | "\n", 225 | "train_dir = os.path.join(os.getcwd(), 'data/train')\n", 226 | "os.makedirs(train_dir, exist_ok=True)\n", 227 | "\n", 228 | "val_dir = os.path.join(os.getcwd(), 'data/val')\n", 229 | "os.makedirs(val_dir, exist_ok=True)\n", 230 | "\n", 231 | "embedding_dir = os.path.join(os.getcwd(), 'data/embedding')\n", 232 | "os.makedirs(embedding_dir, exist_ok=True)\n", 233 | "\n", 234 | "np.save(os.path.join(train_dir, 'x_train.npy'), x_train)\n", 235 | "np.save(os.path.join(train_dir, 'y_train.npy'), y_train)\n", 236 | "np.save(os.path.join(val_dir, 'x_val.npy'), x_val)\n", 237 | "np.save(os.path.join(val_dir, 'y_val.npy'), y_val)\n", 238 | "np.save(os.path.join(embedding_dir, 'embedding.npy'), embedding_matrix)" 239 | ] 240 | }, 241 | { 242 | "cell_type": "markdown", 243 | "metadata": {}, 244 | "source": [ 245 | "# SageMaker Hosted Training\n", 246 | "\n", 247 | "Now that we've prepared our embedding matrix, we can move on to use SageMaker's hosted training functionality. SageMaker hosted training is preferred for doing actual training in place of local notebook prototyping, especially for large-scale, distributed training. Before starting hosted training, the data must be uploaded to S3. The word embedding matrix also will be uploaded. We'll do that now, and confirm the upload was successful." 248 | ] 249 | }, 250 | { 251 | "cell_type": "code", 252 | "execution_count": null, 253 | "metadata": {}, 254 | "outputs": [], 255 | "source": [ 256 | "s3_prefix = 'tf-20-newsgroups'\n", 257 | "\n", 258 | "traindata_s3_prefix = '{}/data/train'.format(s3_prefix)\n", 259 | "valdata_s3_prefix = '{}/data/val'.format(s3_prefix)\n", 260 | "embeddingdata_s3_prefix = '{}/data/embedding'.format(s3_prefix)\n", 261 | "\n", 262 | "train_s3 = sagemaker.Session().upload_data(path='./data/train/', key_prefix=traindata_s3_prefix)\n", 263 | "val_s3 = sagemaker.Session().upload_data(path='./data/val/', key_prefix=valdata_s3_prefix)\n", 264 | "embedding_s3 = sagemaker.Session().upload_data(path='./data/embedding/', key_prefix=embeddingdata_s3_prefix)\n", 265 | "\n", 266 | "inputs = {'train':train_s3, 'val': val_s3, 'embedding': embedding_s3}\n", 267 | "print(inputs)" 268 | ] 269 | }, 270 | { 271 | "cell_type": "markdown", 272 | "metadata": {}, 273 | "source": [ 274 | "We're now ready to set up an Estimator object for hosted training. Hyperparameters are passed in as a dictionary. Importantly, for the case of a model such as this one that takes word embeddings as an input, various aspects of the embeddings can be passed in with the dictionary so the embedding layer can be constructed in a flexible manner and not hardcoded. This allows easier tuning without having to make code modifications. " 275 | ] 276 | }, 277 | { 278 | "cell_type": "code", 279 | "execution_count": null, 280 | "metadata": {}, 281 | "outputs": [], 282 | "source": [ 283 | "import sagemaker\n", 284 | "from sagemaker.tensorflow import TensorFlow\n", 285 | "\n", 286 | "train_instance_type = 'ml.p3.2xlarge'\n", 287 | "hyperparameters = {'epochs': 20, \n", 288 | " 'batch_size': 128, \n", 289 | " 'num_words': num_words,\n", 290 | " 'word_index_len': len(word_index),\n", 291 | " 'labels_index_len': len(labels_index),\n", 292 | " 'embedding_dim': EMBEDDING_DIM,\n", 293 | " 'max_sequence_len': MAX_SEQUENCE_LENGTH\n", 294 | " }\n", 295 | "\n", 296 | "estimator = TensorFlow(entry_point='train.py',\n", 297 | " source_dir='code',\n", 298 | " model_dir=model_dir,\n", 299 | " instance_type=train_instance_type,\n", 300 | " instance_count=1,\n", 301 | " hyperparameters=hyperparameters,\n", 302 | " role=sagemaker.get_execution_role(),\n", 303 | " base_job_name='tf-20-newsgroups',\n", 304 | " framework_version='2.1',\n", 305 | " py_version='py3',\n", 306 | " script_mode=True)" 307 | ] 308 | }, 309 | { 310 | "cell_type": "markdown", 311 | "metadata": {}, 312 | "source": [ 313 | "To start the training job, simply call the `fit` method of the `Estimator` object. The `inputs` parameter is the dictionary we created above, which defines three channels. Besides the usual channels for the training and validation datasets, there is a channel for the embedding matrix. This illustrates one aspect of the flexibility of SageMaker for setting up training jobs: in addition to data, you can pass in arbitrary files needed for training. " 314 | ] 315 | }, 316 | { 317 | "cell_type": "code", 318 | "execution_count": null, 319 | "metadata": {}, 320 | "outputs": [], 321 | "source": [ 322 | "estimator.fit(inputs)" 323 | ] 324 | }, 325 | { 326 | "cell_type": "markdown", 327 | "metadata": {}, 328 | "source": [ 329 | "# SageMaker hosted endpoint\n", 330 | "\n", 331 | "If we wish to deploy the model to production, the next step is to create a SageMaker hosted endpoint. The endpoint will retrieve the TensorFlow SavedModel created during training and deploy it within a TensorFlow Serving container. This all can be accomplished with one line of code, an invocation of the Estimator's deploy method." 332 | ] 333 | }, 334 | { 335 | "cell_type": "code", 336 | "execution_count": null, 337 | "metadata": {}, 338 | "outputs": [], 339 | "source": [ 340 | "predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m5.xlarge')" 341 | ] 342 | }, 343 | { 344 | "cell_type": "markdown", 345 | "metadata": {}, 346 | "source": [ 347 | "We can now compare the predictions generated by the endpoint with a sample of the validation data. The results are shown as integer labels from 0 to 19 corresponding to the 20 different newsgroups." 348 | ] 349 | }, 350 | { 351 | "cell_type": "code", 352 | "execution_count": null, 353 | "metadata": {}, 354 | "outputs": [], 355 | "source": [ 356 | "results = predictor.predict(x_val[:10])['predictions'] \n", 357 | "\n", 358 | "print('predictions: \\t{}'.format(np.argmax(results, axis=1)))\n", 359 | "print('target values: \\t{}'.format(np.argmax(y_val[:10], axis=1)))" 360 | ] 361 | }, 362 | { 363 | "cell_type": "markdown", 364 | "metadata": {}, 365 | "source": [ 366 | "When you're finished with your review of this notebook, you can delete the prediction endpoint to release the instance(s) associated with it." 367 | ] 368 | }, 369 | { 370 | "cell_type": "code", 371 | "execution_count": null, 372 | "metadata": {}, 373 | "outputs": [], 374 | "source": [ 375 | "sagemaker.Session().delete_endpoint(predictor.endpoint_name)" 376 | ] 377 | } 378 | ], 379 | "metadata": { 380 | "kernelspec": { 381 | "display_name": "conda_tensorflow2_p36", 382 | "language": "python", 383 | "name": "conda_tensorflow2_p36" 384 | }, 385 | "language_info": { 386 | "codemirror_mode": { 387 | "name": "ipython", 388 | "version": 3 389 | }, 390 | "file_extension": ".py", 391 | "mimetype": "text/x-python", 392 | "name": "python", 393 | "nbconvert_exporter": "python", 394 | "pygments_lexer": "ipython3", 395 | "version": "3.6.13" 396 | } 397 | }, 398 | "nbformat": 4, 399 | "nbformat_minor": 2 400 | } 401 | -------------------------------------------------------------------------------- /tf-2-workflow-smpipelines/train_model/model_def.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | 3 | 4 | def get_model(): 5 | 6 | inputs = tf.keras.Input(shape=(13,)) 7 | hidden_1 = tf.keras.layers.Dense(13, activation='tanh')(inputs) 8 | hidden_2 = tf.keras.layers.Dense(6, activation='sigmoid')(hidden_1) 9 | outputs = tf.keras.layers.Dense(1)(hidden_2) 10 | return tf.keras.Model(inputs=inputs, outputs=outputs) 11 | -------------------------------------------------------------------------------- /tf-2-workflow-smpipelines/train_model/train.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import numpy as np 3 | import os 4 | import tensorflow as tf 5 | 6 | from model_def import get_model 7 | 8 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 9 | 10 | 11 | def parse_args(): 12 | 13 | parser = argparse.ArgumentParser() 14 | 15 | # hyperparameters sent by the client are passed as command-line arguments to the script 16 | parser.add_argument('--epochs', type=int, default=1) 17 | parser.add_argument('--batch_size', type=int, default=64) 18 | parser.add_argument('--learning_rate', type=float, default=0.1) 19 | 20 | # data directories 21 | parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN')) 22 | parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST')) 23 | 24 | # model directory 25 | parser.add_argument('--sm-model-dir', type=str, default=os.environ.get('SM_MODEL_DIR')) 26 | 27 | return parser.parse_known_args() 28 | 29 | 30 | def get_train_data(train_dir): 31 | 32 | x_train = np.load(os.path.join(train_dir, 'x_train.npy')) 33 | y_train = np.load(os.path.join(train_dir, 'y_train.npy')) 34 | print('x train', x_train.shape,'y train', y_train.shape) 35 | 36 | return x_train, y_train 37 | 38 | 39 | def get_test_data(test_dir): 40 | 41 | x_test = np.load(os.path.join(test_dir, 'x_test.npy')) 42 | y_test = np.load(os.path.join(test_dir, 'y_test.npy')) 43 | print('x test', x_test.shape,'y test', y_test.shape) 44 | 45 | return x_test, y_test 46 | 47 | 48 | if __name__ == "__main__": 49 | 50 | args, _ = parse_args() 51 | 52 | print('Training data location: {}'.format(args.train)) 53 | print('Test data location: {}'.format(args.test)) 54 | x_train, y_train = get_train_data(args.train) 55 | x_test, y_test = get_test_data(args.test) 56 | 57 | device = '/cpu:0' 58 | print(device) 59 | batch_size = args.batch_size 60 | epochs = args.epochs 61 | learning_rate = args.learning_rate 62 | print('batch_size = {}, epochs = {}, learning rate = {}'.format(batch_size, epochs, learning_rate)) 63 | 64 | with tf.device(device): 65 | 66 | model = get_model() 67 | optimizer = tf.keras.optimizers.SGD(learning_rate) 68 | model.compile(optimizer=optimizer, loss='mse') 69 | model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, 70 | validation_data=(x_test, y_test)) 71 | 72 | # evaluate on test set 73 | scores = model.evaluate(x_test, y_test, batch_size, verbose=2) 74 | print("\nTest MSE :", scores) 75 | 76 | # save model 77 | model.save(args.sm_model_dir + '/1') 78 | 79 | 80 | -------------------------------------------------------------------------------- /tf-2-workflow/train_model/model_def.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | 3 | 4 | def get_model(): 5 | 6 | inputs = tf.keras.Input(shape=(13,)) 7 | hidden_1 = tf.keras.layers.Dense(13, activation='tanh')(inputs) 8 | hidden_2 = tf.keras.layers.Dense(6, activation='sigmoid')(hidden_1) 9 | outputs = tf.keras.layers.Dense(1)(hidden_2) 10 | return tf.keras.Model(inputs=inputs, outputs=outputs) 11 | -------------------------------------------------------------------------------- /tf-2-workflow/train_model/train.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import numpy as np 3 | import os 4 | import tensorflow as tf 5 | 6 | from model_def import get_model 7 | 8 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 9 | 10 | 11 | def parse_args(): 12 | 13 | parser = argparse.ArgumentParser() 14 | 15 | # hyperparameters sent by the client are passed as command-line arguments to the script 16 | parser.add_argument('--epochs', type=int, default=1) 17 | parser.add_argument('--batch_size', type=int, default=64) 18 | parser.add_argument('--learning_rate', type=float, default=0.1) 19 | 20 | # data directories 21 | parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN')) 22 | parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST')) 23 | 24 | # model directory: we will use the default set by SageMaker, /opt/ml/model 25 | parser.add_argument('--model_dir', type=str, default=os.environ.get('SM_MODEL_DIR')) 26 | 27 | return parser.parse_known_args() 28 | 29 | 30 | def get_train_data(train_dir): 31 | 32 | x_train = np.load(os.path.join(train_dir, 'x_train.npy')) 33 | y_train = np.load(os.path.join(train_dir, 'y_train.npy')) 34 | print('x train', x_train.shape,'y train', y_train.shape) 35 | 36 | return x_train, y_train 37 | 38 | 39 | def get_test_data(test_dir): 40 | 41 | x_test = np.load(os.path.join(test_dir, 'x_test.npy')) 42 | y_test = np.load(os.path.join(test_dir, 'y_test.npy')) 43 | print('x test', x_test.shape,'y test', y_test.shape) 44 | 45 | return x_test, y_test 46 | 47 | 48 | if __name__ == "__main__": 49 | 50 | args, _ = parse_args() 51 | 52 | print('Training data location: {}'.format(args.train)) 53 | print('Test data location: {}'.format(args.test)) 54 | x_train, y_train = get_train_data(args.train) 55 | x_test, y_test = get_test_data(args.test) 56 | 57 | device = '/cpu:0' 58 | print(device) 59 | batch_size = args.batch_size 60 | epochs = args.epochs 61 | learning_rate = args.learning_rate 62 | print('batch_size = {}, epochs = {}, learning rate = {}'.format(batch_size, epochs, learning_rate)) 63 | 64 | with tf.device(device): 65 | 66 | model = get_model() 67 | optimizer = tf.keras.optimizers.SGD(learning_rate) 68 | model.compile(optimizer=optimizer, loss='mse') 69 | model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, 70 | validation_data=(x_test, y_test)) 71 | 72 | # evaluate on test set 73 | scores = model.evaluate(x_test, y_test, batch_size, verbose=2) 74 | print("\nTest MSE :", scores) 75 | 76 | # save model 77 | model.save(args.model_dir + '/1') 78 | 79 | -------------------------------------------------------------------------------- /tf-batch-inference-script/code/inference.py: -------------------------------------------------------------------------------- 1 | # Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"). You 4 | # may not use this file except in compliance with the License. A copy of 5 | # the License is located at 6 | # 7 | # http://aws.amazon.com/apache2.0/ 8 | # 9 | # or in the "license" file accompanying this file. This file is 10 | # distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF 11 | # ANY KIND, either express or implied. See the License for the specific 12 | # language governing permissions and limitations under the License. 13 | 14 | import io 15 | import json 16 | import numpy as np 17 | from collections import namedtuple 18 | from PIL import Image 19 | 20 | Context = namedtuple('Context', 21 | 'model_name, model_version, method, rest_uri, grpc_uri, ' 22 | 'custom_attributes, request_content_type, accept_header') 23 | 24 | 25 | def input_handler(data, context): 26 | """ Pre-process request input before it is sent to TensorFlow Serving REST API 27 | 28 | Args: 29 | data (obj): the request data, in format of dict or string 30 | context (Context): an object containing request and configuration details 31 | 32 | Returns: 33 | (dict): a JSON-serializable dict that contains request body and headers 34 | """ 35 | 36 | if context.request_content_type == 'application/x-image': 37 | 38 | image_as_bytes = io.BytesIO(data.read()) 39 | image = Image.open(image_as_bytes) 40 | instance = np.expand_dims(image, axis=0) 41 | return json.dumps({"instances": instance.tolist()}) 42 | 43 | else: 44 | _return_error(415, 'Unsupported content type "{}"'.format(context.request_content_type or 'Unknown')) 45 | 46 | 47 | def output_handler(data, context): 48 | """Post-process TensorFlow Serving output before it is returned to the client. 49 | 50 | Args: 51 | data (obj): the TensorFlow serving response 52 | context (Context): an object containing request and configuration details 53 | 54 | Returns: 55 | (bytes, string): data to return to client, response content type 56 | """ 57 | if data.status_code != 200: 58 | raise Exception(data.content.decode('utf-8')) 59 | response_content_type = context.accept_header 60 | prediction = data.content 61 | return prediction, response_content_type 62 | 63 | 64 | def _return_error(code, message): 65 | raise ValueError('Error: {}, {}'.format(str(code), message)) 66 | -------------------------------------------------------------------------------- /tf-batch-inference-script/code/model_def.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | from tensorflow.keras.layers import Activation, Conv2D, Dense, Dropout, Flatten, MaxPooling2D, BatchNormalization 3 | from tensorflow.keras.models import Sequential 4 | from tensorflow.keras.optimizers import Adam, SGD, RMSprop 5 | 6 | HEIGHT = 32 7 | WIDTH = 32 8 | DEPTH = 3 9 | NUM_CLASSES = 10 10 | 11 | def get_model(learning_rate, weight_decay, optimizer, momentum, size, mpi=False, hvd=False): 12 | 13 | model = Sequential() 14 | model.add(Conv2D(32, (3, 3), padding='same', input_shape=(HEIGHT, WIDTH, DEPTH))) 15 | model.add(BatchNormalization()) 16 | model.add(Activation('relu')) 17 | model.add(Conv2D(32, (3, 3))) 18 | model.add(BatchNormalization()) 19 | model.add(Activation('relu')) 20 | model.add(MaxPooling2D(pool_size=(2, 2))) 21 | model.add(Dropout(0.2)) 22 | 23 | model.add(Conv2D(64, (3, 3), padding='same')) 24 | model.add(BatchNormalization()) 25 | model.add(Activation('relu')) 26 | model.add(Conv2D(64, (3, 3))) 27 | model.add(BatchNormalization()) 28 | model.add(Activation('relu')) 29 | model.add(MaxPooling2D(pool_size=(2, 2))) 30 | model.add(Dropout(0.3)) 31 | 32 | model.add(Conv2D(128, (3, 3), padding='same')) 33 | model.add(BatchNormalization()) 34 | model.add(Activation('relu')) 35 | model.add(Conv2D(128, (3, 3))) 36 | model.add(BatchNormalization()) 37 | model.add(Activation('relu')) 38 | model.add(MaxPooling2D(pool_size=(2, 2))) 39 | model.add(Dropout(0.4)) 40 | 41 | model.add(Flatten()) 42 | model.add(Dense(512)) 43 | model.add(Activation('relu')) 44 | model.add(Dropout(0.5)) 45 | model.add(Dense(NUM_CLASSES)) 46 | model.add(Activation('softmax')) 47 | 48 | if mpi: 49 | size = hvd.size() 50 | 51 | if optimizer.lower() == 'sgd': 52 | opt = SGD(lr=learning_rate * size, decay=weight_decay, momentum=momentum) 53 | elif optimizer.lower() == 'rmsprop': 54 | opt = RMSprop(lr=learning_rate * size, decay=weight_decay) 55 | else: 56 | opt = Adam(lr=learning_rate * size, decay=weight_decay) 57 | 58 | if mpi: 59 | opt = hvd.DistributedOptimizer(opt) 60 | 61 | model.compile(loss='categorical_crossentropy', 62 | optimizer=opt, 63 | metrics=['accuracy']) 64 | 65 | return model 66 | 67 | -------------------------------------------------------------------------------- /tf-batch-inference-script/code/requirements.txt: -------------------------------------------------------------------------------- 1 | Pillow 2 | numpy 3 | -------------------------------------------------------------------------------- /tf-batch-inference-script/code/train.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import codecs 3 | import json 4 | import logging 5 | import numpy as np 6 | import os 7 | import re 8 | 9 | import tensorflow as tf 10 | import tensorflow.keras.backend as K 11 | from tensorflow.keras.callbacks import TensorBoard, ModelCheckpoint 12 | 13 | from model_def import get_model, HEIGHT, WIDTH, DEPTH, NUM_CLASSES 14 | from utilities import process_input 15 | 16 | 17 | logging.getLogger().setLevel(logging.INFO) 18 | tf.logging.set_verbosity(tf.logging.ERROR) 19 | 20 | 21 | # Copy inference pre/post-processing script so it will be included in the model package 22 | os.system('mkdir /opt/ml/model/code') 23 | os.system('cp inference.py /opt/ml/model/code') 24 | os.system('cp requirements.txt /opt/ml/model/code') 25 | 26 | 27 | class CustomTensorBoardCallback(TensorBoard): 28 | 29 | def on_batch_end(self, batch, logs=None): 30 | pass 31 | 32 | 33 | def save_history(path, history): 34 | 35 | history_for_json = {} 36 | # transform float values that aren't json-serializable 37 | for key in list(history.history.keys()): 38 | if type(history.history[key]) == np.ndarray: 39 | history_for_json[key] = history.history[key].tolist() 40 | elif type(history.history[key]) == list: 41 | if type(history.history[key][0]) == np.float32 or type(history.history[key][0]) == np.float64: 42 | history_for_json[key] = list(map(float, history.history[key])) 43 | 44 | with codecs.open(path, 'w', encoding='utf-8') as f: 45 | json.dump(history_for_json, f, separators=(',', ':'), sort_keys=True, indent=4) 46 | 47 | 48 | def save_model(model, output): 49 | 50 | # create a TensorFlow SavedModel for deployment to a SageMaker endpoint with TensorFlow Serving 51 | tf.contrib.saved_model.save_keras_model(model, args.model_dir) 52 | logging.info("Model successfully saved at: {}".format(output)) 53 | return 54 | 55 | 56 | def main(args): 57 | 58 | mpi = False 59 | if 'sourcedir.tar.gz' in args.tensorboard_dir: 60 | tensorboard_dir = re.sub('source/sourcedir.tar.gz', 'model', args.tensorboard_dir) 61 | else: 62 | tensorboard_dir = args.tensorboard_dir 63 | logging.info("Writing TensorBoard logs to {}".format(tensorboard_dir)) 64 | 65 | if 'sagemaker_mpi_enabled' in args.fw_params: 66 | if args.fw_params['sagemaker_mpi_enabled']: 67 | import horovod.tensorflow.keras as hvd 68 | mpi = True 69 | hvd.init() 70 | config = tf.ConfigProto() 71 | config.gpu_options.allow_growth = True 72 | config.gpu_options.visible_device_list = str(hvd.local_rank()) 73 | K.set_session(tf.Session(config=config)) 74 | else: 75 | hvd = None 76 | 77 | logging.info("Running with MPI={}".format(mpi)) 78 | logging.info("getting data") 79 | train_dataset = process_input(args.epochs, args.batch_size, args.train, 'train', args.data_config) 80 | eval_dataset = process_input(args.epochs, args.batch_size, args.eval, 'eval', args.data_config) 81 | validation_dataset = process_input(args.epochs, args.batch_size, args.validation, 'validation', args.data_config) 82 | 83 | logging.info("configuring model") 84 | model = get_model(args.learning_rate, args.weight_decay, args.optimizer, args.momentum, 1, mpi, hvd) 85 | callbacks = [] 86 | if mpi: 87 | callbacks.append(hvd.callbacks.BroadcastGlobalVariablesCallback(0)) 88 | callbacks.append(hvd.callbacks.MetricAverageCallback()) 89 | callbacks.append(hvd.callbacks.LearningRateWarmupCallback(warmup_epochs=5, verbose=1)) 90 | callbacks.append(tf.keras.callbacks.ReduceLROnPlateau(patience=10, verbose=1)) 91 | if hvd.rank() == 0: 92 | callbacks.append(ModelCheckpoint(args.output_data_dir + '/checkpoint-{epoch}.h5')) 93 | callbacks.append(CustomTensorBoardCallback(log_dir=tensorboard_dir)) 94 | else: 95 | callbacks.append(ModelCheckpoint(args.output_data_dir + '/checkpoint-{epoch}.h5')) 96 | callbacks.append(CustomTensorBoardCallback(log_dir=tensorboard_dir)) 97 | 98 | logging.info("Starting training") 99 | size = 1 100 | if mpi: 101 | size = hvd.size() 102 | 103 | history = model.fit(x=train_dataset[0], 104 | y=train_dataset[1], 105 | steps_per_epoch=(num_examples_per_epoch('train') // args.batch_size) // size, 106 | epochs=args.epochs, 107 | validation_data=validation_dataset, 108 | validation_steps=(num_examples_per_epoch('validation') // args.batch_size) // size, 109 | callbacks=callbacks) 110 | 111 | score = model.evaluate(eval_dataset[0], 112 | eval_dataset[1], 113 | steps=num_examples_per_epoch('eval') // args.batch_size, 114 | verbose=0) 115 | 116 | logging.info('Test loss:{}'.format(score[0])) 117 | logging.info('Test accuracy:{}'.format(score[1])) 118 | 119 | if mpi: 120 | if hvd.rank() == 0: 121 | save_history(args.model_dir + "/hvd_history.p", history) 122 | return save_model(model, args.model_output_dir) 123 | else: 124 | save_history(args.model_dir + "/hvd_history.p", history) 125 | return save_model(model, args.model_output_dir) 126 | 127 | 128 | def num_examples_per_epoch(subset='train'): 129 | if subset == 'train': 130 | return 40000 131 | elif subset == 'validation': 132 | return 10000 133 | elif subset == 'eval': 134 | return 10000 135 | else: 136 | raise ValueError('Invalid data subset "%s"' % subset) 137 | 138 | 139 | if __name__ == '__main__': 140 | 141 | parser = argparse.ArgumentParser() 142 | 143 | parser.add_argument('--train',type=str,required=False,default=os.environ.get('SM_CHANNEL_TRAIN')) 144 | parser.add_argument('--validation',type=str,required=False,default=os.environ.get('SM_CHANNEL_VALIDATION')) 145 | parser.add_argument('--eval',type=str,required=False,default=os.environ.get('SM_CHANNEL_EVAL')) 146 | parser.add_argument('--model_dir',type=str,required=True,help='The directory where the model will be stored.') 147 | parser.add_argument('--model_output_dir',type=str,default=os.environ.get('SM_MODEL_DIR')) 148 | parser.add_argument('--output_data_dir',type=str,default=os.environ.get('SM_OUTPUT_DATA_DIR')) 149 | parser.add_argument('--output-dir',type=str,default=os.environ.get('SM_OUTPUT_DIR')) 150 | parser.add_argument('--tensorboard-dir',type=str,default=os.environ.get('SM_MODULE_DIR')) 151 | parser.add_argument('--weight-decay',type=float,default=2e-4,help='Weight decay for convolutions.') 152 | parser.add_argument('--learning-rate',type=float,default=0.001,help='Initial learning rate.') 153 | parser.add_argument('--epochs',type=int,default=10) 154 | parser.add_argument('--batch-size',type=int,default=128) 155 | parser.add_argument('--data-config',type=json.loads,default=os.environ.get('SM_INPUT_DATA_CONFIG')) 156 | parser.add_argument('--fw-params',type=json.loads,default=os.environ.get('SM_FRAMEWORK_PARAMS')) 157 | parser.add_argument('--optimizer',type=str,default='adam') 158 | parser.add_argument('--momentum',type=float,default='0.9') 159 | 160 | args = parser.parse_args() 161 | 162 | main(args) 163 | -------------------------------------------------------------------------------- /tf-batch-inference-script/code/utilities.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import os 3 | import tensorflow as tf 4 | 5 | from model_def import HEIGHT, WIDTH, DEPTH, NUM_CLASSES 6 | 7 | 8 | NUM_DATA_BATCHES = 5 9 | NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN = 10000 * NUM_DATA_BATCHES 10 | 11 | 12 | def _get_filenames(channel_name, channel): 13 | if channel_name in ['train', 'validation', 'eval']: 14 | return [os.path.join(channel, channel_name + '.tfrecords')] 15 | else: 16 | raise ValueError('Invalid data subset "%s"' % channel_name) 17 | 18 | 19 | def _train_preprocess_fn(image): 20 | 21 | # Resize the image to add four extra pixels on each side. 22 | image = tf.image.resize_image_with_crop_or_pad(image, HEIGHT + 8, WIDTH + 8) 23 | 24 | # Randomly crop a [HEIGHT, WIDTH] section of the image. 25 | image = tf.random_crop(image, [HEIGHT, WIDTH, DEPTH]) 26 | 27 | # Randomly flip the image horizontally. 28 | image = tf.image.random_flip_left_right(image) 29 | 30 | return image 31 | 32 | 33 | def _dataset_parser(value): 34 | 35 | featdef = { 36 | 'image': tf.FixedLenFeature([], tf.string), 37 | 'label': tf.FixedLenFeature([], tf.int64), 38 | } 39 | 40 | example = tf.parse_single_example(value, featdef) 41 | image = tf.decode_raw(example['image'], tf.uint8) 42 | image.set_shape([DEPTH * HEIGHT * WIDTH]) 43 | 44 | # Reshape from [depth * height * width] to [depth, height, width]. 45 | image = tf.cast( 46 | tf.transpose(tf.reshape(image, [DEPTH, HEIGHT, WIDTH]), [1, 2, 0]), 47 | tf.float32) 48 | label = tf.cast(example['label'], tf.int32) 49 | image = _train_preprocess_fn(image) 50 | return image, tf.one_hot(label, NUM_CLASSES) 51 | 52 | 53 | def process_input(epochs, batch_size, channel, channel_name, data_config): 54 | 55 | mode = data_config[channel_name]['TrainingInputMode'] 56 | filenames = _get_filenames(channel_name, channel) 57 | # Repeat infinitely. 58 | logging.info("Running {} in {} mode".format(channel_name, mode)) 59 | if mode == 'Pipe': 60 | from sagemaker_tensorflow import PipeModeDataset 61 | dataset = PipeModeDataset(channel=channel_name, record_format='TFRecord') 62 | else: 63 | dataset = tf.data.TFRecordDataset(filenames) 64 | 65 | dataset = dataset.repeat(epochs) 66 | dataset = dataset.prefetch(10) 67 | 68 | # Parse records. 69 | dataset = dataset.map( 70 | _dataset_parser, num_parallel_calls=10) 71 | 72 | # Potentially shuffle records. 73 | if channel_name == 'train': 74 | # Ensure that the capacity is sufficiently large to provide good random 75 | # shuffling. 76 | buffer_size = int(NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN * 0.4) + 3 * batch_size 77 | dataset = dataset.shuffle(buffer_size=buffer_size) 78 | 79 | # Batch it up. 80 | dataset = dataset.batch(batch_size, drop_remainder=True) 81 | iterator = dataset.make_one_shot_iterator() 82 | image_batch, label_batch = iterator.get_next() 83 | 84 | return image_batch, label_batch 85 | 86 | 87 | -------------------------------------------------------------------------------- /tf-batch-inference-script/generate_cifar10_tfrecords.py: -------------------------------------------------------------------------------- 1 | # Copyright 2017 The TensorFlow Authors. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | # ============================================================================== 15 | """Read CIFAR-10 data from pickled numpy arrays and writes TFRecords. 16 | 17 | Generates tf.train.Example protos and writes them to TFRecord files from the 18 | python version of the CIFAR-10 dataset downloaded from 19 | https://www.cs.toronto.edu/~kriz/cifar.html. 20 | """ 21 | 22 | from __future__ import absolute_import 23 | from __future__ import division 24 | from __future__ import print_function 25 | 26 | import argparse 27 | import os 28 | import sys 29 | 30 | import tarfile 31 | from six.moves import cPickle as pickle 32 | from six.moves import xrange # pylint: disable=redefined-builtin 33 | import tensorflow as tf 34 | 35 | tf.logging.set_verbosity(tf.logging.ERROR) 36 | if type(tf.contrib) != type(tf): tf.contrib._warning = None 37 | 38 | CIFAR_FILENAME = 'cifar-10-python.tar.gz' 39 | CIFAR_DOWNLOAD_URL = 'https://www.cs.toronto.edu/~kriz/' + CIFAR_FILENAME 40 | CIFAR_LOCAL_FOLDER = 'cifar-10-batches-py' 41 | 42 | 43 | def download_and_extract(data_dir): 44 | # download CIFAR-10 if not already downloaded. 45 | tf.contrib.learn.datasets.base.maybe_download(CIFAR_FILENAME, data_dir, 46 | CIFAR_DOWNLOAD_URL) 47 | tarfile.open(os.path.join(data_dir, CIFAR_FILENAME), 48 | 'r:gz').extractall(data_dir) 49 | 50 | 51 | def _int64_feature(value): 52 | return tf.train.Feature(int64_list=tf.train.Int64List(value=[value])) 53 | 54 | 55 | def _bytes_feature(value): 56 | return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value])) 57 | 58 | 59 | def _get_file_names(): 60 | """Returns the file names expected to exist in the input_dir.""" 61 | file_names = {} 62 | file_names['train'] = ['data_batch_%d' % i for i in xrange(1, 5)] 63 | file_names['validation'] = ['data_batch_5'] 64 | file_names['eval'] = ['test_batch'] 65 | return file_names 66 | 67 | 68 | def read_pickle_from_file(filename): 69 | with tf.gfile.Open(filename, 'rb') as f: 70 | if sys.version_info >= (3, 0): 71 | data_dict = pickle.load(f, encoding='bytes') 72 | else: 73 | data_dict = pickle.load(f) 74 | return data_dict 75 | 76 | 77 | def convert_to_tfrecord(input_files, output_file): 78 | """Converts a file to TFRecords.""" 79 | print('Generating %s' % output_file) 80 | with tf.python_io.TFRecordWriter(output_file) as record_writer: 81 | for input_file in input_files: 82 | data_dict = read_pickle_from_file(input_file) 83 | data = data_dict[b'data'] 84 | labels = data_dict[b'labels'] 85 | 86 | num_entries_in_batch = len(labels) 87 | for i in range(num_entries_in_batch): 88 | example = tf.train.Example(features=tf.train.Features( 89 | feature={ 90 | 'image': _bytes_feature(data[i].tobytes()), 91 | 'label': _int64_feature(labels[i]) 92 | })) 93 | record_writer.write(example.SerializeToString()) 94 | 95 | 96 | def main(data_dir): 97 | print('Download from {} and extract.'.format(CIFAR_DOWNLOAD_URL)) 98 | download_and_extract(data_dir) 99 | file_names = _get_file_names() 100 | input_dir = os.path.join(data_dir, CIFAR_LOCAL_FOLDER) 101 | for mode, files in file_names.items(): 102 | input_files = [os.path.join(input_dir, f) for f in files] 103 | output_file = os.path.join(data_dir+'/'+mode, mode + '.tfrecords') 104 | if not os.path.exists(data_dir+'/'+mode): 105 | os.makedirs(data_dir+'/'+mode) 106 | try: 107 | os.remove(output_file) 108 | except OSError: 109 | pass 110 | # Convert to tf.train.Example and write the to TFRecords. 111 | convert_to_tfrecord(input_files, output_file) 112 | print('Done!') 113 | import shutil 114 | shutil.rmtree(data_dir+'/cifar-10-batches-py') 115 | os.remove(data_dir+'/cifar-10-python.tar.gz') 116 | 117 | 118 | if __name__ == '__main__': 119 | parser = argparse.ArgumentParser() 120 | parser.add_argument( 121 | '--data-dir', 122 | type=str, 123 | default='', 124 | help='Directory to download and extract CIFAR-10 to.') 125 | 126 | args = parser.parse_args() 127 | main(args.data_dir) 128 | -------------------------------------------------------------------------------- /tf-batch-inference-script/sample-img/1000_dog.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-batch-inference-script/sample-img/1000_dog.png -------------------------------------------------------------------------------- /tf-batch-inference-script/sample-img/1001_airplane.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-batch-inference-script/sample-img/1001_airplane.png -------------------------------------------------------------------------------- /tf-batch-inference-script/sample-img/1003_deer.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-batch-inference-script/sample-img/1003_deer.png -------------------------------------------------------------------------------- /tf-batch-inference-script/sample-img/1004_ship.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-batch-inference-script/sample-img/1004_ship.png -------------------------------------------------------------------------------- /tf-batch-inference-script/sample-img/1005_automobile.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-batch-inference-script/sample-img/1005_automobile.png -------------------------------------------------------------------------------- /tf-batch-inference-script/sample-img/1008_truck.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-batch-inference-script/sample-img/1008_truck.png -------------------------------------------------------------------------------- /tf-batch-inference-script/sample-img/1009_frog.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-batch-inference-script/sample-img/1009_frog.png -------------------------------------------------------------------------------- /tf-batch-inference-script/sample-img/1014_cat.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-batch-inference-script/sample-img/1014_cat.png -------------------------------------------------------------------------------- /tf-batch-inference-script/sample-img/1037_horse.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-batch-inference-script/sample-img/1037_horse.png -------------------------------------------------------------------------------- /tf-batch-inference-script/sample-img/1038_bird.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-batch-inference-script/sample-img/1038_bird.png -------------------------------------------------------------------------------- /tf-distribution-options/code/inference.py: -------------------------------------------------------------------------------- 1 | # Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"). You 4 | # may not use this file except in compliance with the License. A copy of 5 | # the License is located at 6 | # 7 | # http://aws.amazon.com/apache2.0/ 8 | # 9 | # or in the "license" file accompanying this file. This file is 10 | # distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF 11 | # ANY KIND, either express or implied. See the License for the specific 12 | # language governing permissions and limitations under the License. 13 | 14 | import io 15 | import json 16 | import numpy as np 17 | from collections import namedtuple 18 | from PIL import Image 19 | 20 | Context = namedtuple('Context', 21 | 'model_name, model_version, method, rest_uri, grpc_uri, ' 22 | 'custom_attributes, request_content_type, accept_header') 23 | 24 | 25 | def input_handler(data, context): 26 | """ Pre-process request input before it is sent to TensorFlow Serving REST API 27 | 28 | Args: 29 | data (obj): the request data, in format of dict or string 30 | context (Context): an object containing request and configuration details 31 | 32 | Returns: 33 | (dict): a JSON-serializable dict that contains request body and headers 34 | """ 35 | 36 | if context.request_content_type == 'application/x-image': 37 | 38 | image_as_bytes = io.BytesIO(data.read()) 39 | image = Image.open(image_as_bytes) 40 | instance = np.expand_dims(image, axis=0) 41 | return json.dumps({"instances": instance.tolist()}) 42 | 43 | else: 44 | _return_error(415, 'Unsupported content type "{}"'.format(context.request_content_type or 'Unknown')) 45 | 46 | 47 | def output_handler(data, context): 48 | """Post-process TensorFlow Serving output before it is returned to the client. 49 | 50 | Args: 51 | data (obj): the TensorFlow serving response 52 | context (Context): an object containing request and configuration details 53 | 54 | Returns: 55 | (bytes, string): data to return to client, response content type 56 | """ 57 | if data.status_code != 200: 58 | raise Exception(data.content.decode('utf-8')) 59 | response_content_type = context.accept_header 60 | prediction = data.content 61 | return prediction, response_content_type 62 | 63 | 64 | def _return_error(code, message): 65 | raise ValueError('Error: {}, {}'.format(str(code), message)) 66 | -------------------------------------------------------------------------------- /tf-distribution-options/code/model_def.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | from tensorflow.keras.layers import Activation, Conv2D, Dense, Dropout, Flatten, MaxPooling2D, BatchNormalization 3 | from tensorflow.keras.models import Sequential 4 | from tensorflow.keras.optimizers import Adam, SGD, RMSprop 5 | 6 | HEIGHT = 32 7 | WIDTH = 32 8 | DEPTH = 3 9 | NUM_CLASSES = 10 10 | 11 | def get_model(learning_rate, weight_decay, optimizer, momentum, size, mpi=False, hvd=False): 12 | 13 | model = Sequential() 14 | model.add(Conv2D(32, (3, 3), padding='same', input_shape=(HEIGHT, WIDTH, DEPTH))) 15 | model.add(BatchNormalization()) 16 | model.add(Activation('relu')) 17 | model.add(Conv2D(32, (3, 3))) 18 | model.add(BatchNormalization()) 19 | model.add(Activation('relu')) 20 | model.add(MaxPooling2D(pool_size=(2, 2))) 21 | model.add(Dropout(0.2)) 22 | 23 | model.add(Conv2D(64, (3, 3), padding='same')) 24 | model.add(BatchNormalization()) 25 | model.add(Activation('relu')) 26 | model.add(Conv2D(64, (3, 3))) 27 | model.add(BatchNormalization()) 28 | model.add(Activation('relu')) 29 | model.add(MaxPooling2D(pool_size=(2, 2))) 30 | model.add(Dropout(0.3)) 31 | 32 | model.add(Conv2D(128, (3, 3), padding='same')) 33 | model.add(BatchNormalization()) 34 | model.add(Activation('relu')) 35 | model.add(Conv2D(128, (3, 3))) 36 | model.add(BatchNormalization()) 37 | model.add(Activation('relu')) 38 | model.add(MaxPooling2D(pool_size=(2, 2))) 39 | model.add(Dropout(0.4)) 40 | 41 | model.add(Flatten()) 42 | model.add(Dense(512)) 43 | model.add(Activation('relu')) 44 | model.add(Dropout(0.5)) 45 | model.add(Dense(NUM_CLASSES)) 46 | model.add(Activation('softmax')) 47 | 48 | if mpi: 49 | size = hvd.size() 50 | 51 | if optimizer.lower() == 'sgd': 52 | opt = SGD(lr=learning_rate * size, decay=weight_decay, momentum=momentum) 53 | elif optimizer.lower() == 'rmsprop': 54 | opt = RMSprop(lr=learning_rate * size, decay=weight_decay) 55 | else: 56 | opt = Adam(lr=learning_rate * size, decay=weight_decay) 57 | 58 | if mpi: 59 | opt = hvd.DistributedOptimizer(opt) 60 | 61 | model.compile(loss='categorical_crossentropy', 62 | optimizer=opt, 63 | metrics=['accuracy']) 64 | 65 | return model 66 | 67 | -------------------------------------------------------------------------------- /tf-distribution-options/code/requirements.txt: -------------------------------------------------------------------------------- 1 | Pillow 2 | numpy 3 | -------------------------------------------------------------------------------- /tf-distribution-options/code/train_hvd.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import codecs 3 | import json 4 | import logging 5 | import numpy as np 6 | import os 7 | import re 8 | 9 | import tensorflow as tf 10 | import tensorflow.keras.backend as K 11 | from tensorflow.keras.callbacks import TensorBoard, ModelCheckpoint 12 | 13 | from model_def import get_model, HEIGHT, WIDTH, DEPTH, NUM_CLASSES 14 | from utilities import process_input 15 | 16 | 17 | logging.getLogger().setLevel(logging.INFO) 18 | tf.logging.set_verbosity(tf.logging.ERROR) 19 | 20 | 21 | # Copy inference pre/post-processing script so it will be included in the model package 22 | os.system('mkdir /opt/ml/model/code') 23 | os.system('cp inference.py /opt/ml/model/code') 24 | os.system('cp requirements.txt /opt/ml/model/code') 25 | 26 | 27 | class CustomTensorBoardCallback(TensorBoard): 28 | 29 | def on_batch_end(self, batch, logs=None): 30 | pass 31 | 32 | 33 | def save_history(path, history): 34 | 35 | history_for_json = {} 36 | # transform float values that aren't json-serializable 37 | for key in list(history.history.keys()): 38 | if type(history.history[key]) == np.ndarray: 39 | history_for_json[key] == history.history[key].tolist() 40 | elif type(history.history[key]) == list: 41 | if type(history.history[key][0]) == np.float32 or type(history.history[key][0]) == np.float64: 42 | history_for_json[key] = list(map(float, history.history[key])) 43 | 44 | with codecs.open(path, 'w', encoding='utf-8') as f: 45 | json.dump(history_for_json, f, separators=(',', ':'), sort_keys=True, indent=4) 46 | 47 | 48 | def save_model(model, output): 49 | 50 | # create a TensorFlow SavedModel for deployment to a SageMaker endpoint with TensorFlow Serving 51 | tf.contrib.saved_model.save_keras_model(model, args.model_dir) 52 | logging.info("Model successfully saved at: {}".format(output)) 53 | return 54 | 55 | 56 | def main(args): 57 | 58 | mpi = False 59 | if 'sourcedir.tar.gz' in args.tensorboard_dir: 60 | tensorboard_dir = re.sub('source/sourcedir.tar.gz', 'model', args.tensorboard_dir) 61 | else: 62 | tensorboard_dir = args.tensorboard_dir 63 | logging.info("Writing TensorBoard logs to {}".format(tensorboard_dir)) 64 | 65 | if 'sagemaker_mpi_enabled' in args.fw_params: 66 | if args.fw_params['sagemaker_mpi_enabled']: 67 | import horovod.tensorflow.keras as hvd 68 | mpi = True 69 | # Horovod: initialize Horovod. 70 | hvd.init() 71 | 72 | # Horovod: pin GPU to be used to process local rank (one GPU per process) 73 | config = tf.ConfigProto() 74 | config.gpu_options.allow_growth = True 75 | config.gpu_options.visible_device_list = str(hvd.local_rank()) 76 | K.set_session(tf.Session(config=config)) 77 | else: 78 | hvd = None 79 | 80 | logging.info("Running with MPI={}".format(mpi)) 81 | logging.info("getting data") 82 | train_dataset = process_input(args.epochs, args.batch_size, args.train, 'train', args.data_config) 83 | eval_dataset = process_input(args.epochs, args.batch_size, args.eval, 'eval', args.data_config) 84 | validation_dataset = process_input(args.epochs, args.batch_size, args.validation, 'validation', args.data_config) 85 | 86 | logging.info("configuring model") 87 | model = get_model(args.learning_rate, args.weight_decay, args.optimizer, args.momentum, 1, mpi, hvd) 88 | callbacks = [] 89 | if mpi: 90 | callbacks.append(hvd.callbacks.BroadcastGlobalVariablesCallback(0)) 91 | callbacks.append(hvd.callbacks.MetricAverageCallback()) 92 | callbacks.append(hvd.callbacks.LearningRateWarmupCallback(warmup_epochs=5, verbose=1)) 93 | callbacks.append(tf.keras.callbacks.ReduceLROnPlateau(patience=10, verbose=1)) 94 | if hvd.rank() == 0: 95 | callbacks.append(ModelCheckpoint(args.output_data_dir + '/checkpoint-{epoch}.h5')) 96 | callbacks.append(CustomTensorBoardCallback(log_dir=tensorboard_dir)) 97 | else: 98 | callbacks.append(ModelCheckpoint(args.output_data_dir + '/checkpoint-{epoch}.h5')) 99 | callbacks.append(CustomTensorBoardCallback(log_dir=tensorboard_dir)) 100 | 101 | logging.info("Starting training") 102 | size = 1 103 | if mpi: 104 | size = hvd.size() 105 | 106 | history = model.fit(x=train_dataset[0], 107 | y=train_dataset[1], 108 | steps_per_epoch=(num_examples_per_epoch('train') // args.batch_size) // size, 109 | epochs=args.epochs, 110 | validation_data=validation_dataset, 111 | validation_steps=(num_examples_per_epoch('validation') // args.batch_size) // size, 112 | callbacks=callbacks) 113 | 114 | score = model.evaluate(eval_dataset[0], 115 | eval_dataset[1], 116 | steps=num_examples_per_epoch('eval') // args.batch_size, 117 | verbose=0) 118 | 119 | logging.info('Test loss:{}'.format(score[0])) 120 | logging.info('Test accuracy:{}'.format(score[1])) 121 | 122 | # Horovod: Save model and history only on worker 0 (i.e. master) 123 | if mpi: 124 | if hvd.rank() == 0: 125 | save_history(args.model_dir + "/hvd_history.p", history) 126 | return save_model(model, args.model_output_dir) 127 | else: 128 | save_history(args.model_dir + "/hvd_history.p", history) 129 | return save_model(model, args.model_output_dir) 130 | 131 | 132 | def num_examples_per_epoch(subset='train'): 133 | if subset == 'train': 134 | return 40000 135 | elif subset == 'validation': 136 | return 10000 137 | elif subset == 'eval': 138 | return 10000 139 | else: 140 | raise ValueError('Invalid data subset "%s"' % subset) 141 | 142 | 143 | if __name__ == '__main__': 144 | 145 | parser = argparse.ArgumentParser() 146 | 147 | parser.add_argument('--train',type=str,required=False,default=os.environ.get('SM_CHANNEL_TRAIN')) 148 | parser.add_argument('--validation',type=str,required=False,default=os.environ.get('SM_CHANNEL_VALIDATION')) 149 | parser.add_argument('--eval',type=str,required=False,default=os.environ.get('SM_CHANNEL_EVAL')) 150 | parser.add_argument('--model_dir',type=str,required=True,help='The directory where the model will be stored.') 151 | parser.add_argument('--model_output_dir',type=str,default=os.environ.get('SM_MODEL_DIR')) 152 | parser.add_argument('--output_data_dir',type=str,default=os.environ.get('SM_OUTPUT_DATA_DIR')) 153 | parser.add_argument('--output-dir',type=str,default=os.environ.get('SM_OUTPUT_DIR')) 154 | parser.add_argument('--tensorboard-dir',type=str,default=os.environ.get('SM_MODULE_DIR')) 155 | parser.add_argument('--weight-decay',type=float,default=2e-4,help='Weight decay for convolutions.') 156 | parser.add_argument('--learning-rate',type=float,default=0.001,help='Initial learning rate.') 157 | parser.add_argument('--epochs',type=int,default=10) 158 | parser.add_argument('--batch-size',type=int,default=128) 159 | parser.add_argument('--data-config',type=json.loads,default=os.environ.get('SM_INPUT_DATA_CONFIG')) 160 | parser.add_argument('--fw-params',type=json.loads,default=os.environ.get('SM_FRAMEWORK_PARAMS')) 161 | parser.add_argument('--optimizer',type=str,default='adam') 162 | parser.add_argument('--momentum',type=float,default='0.9') 163 | 164 | args = parser.parse_args() 165 | 166 | main(args) 167 | -------------------------------------------------------------------------------- /tf-distribution-options/code/train_ps.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import codecs 3 | import json 4 | import logging 5 | import numpy as np 6 | import os 7 | import re 8 | 9 | import tensorflow as tf 10 | import tensorflow.keras.backend as K 11 | from tensorflow.keras.callbacks import TensorBoard, ModelCheckpoint 12 | 13 | from model_def import get_model, HEIGHT, WIDTH, DEPTH, NUM_CLASSES 14 | from utilities import process_input 15 | 16 | 17 | logging.getLogger().setLevel(logging.INFO) 18 | tf.logging.set_verbosity(tf.logging.ERROR) 19 | 20 | 21 | # Copy inference pre/post-processing script so it will be included in the model package 22 | os.system('mkdir /opt/ml/model/code') 23 | os.system('cp inference.py /opt/ml/model/code') 24 | os.system('cp requirements.txt /opt/ml/model/code') 25 | 26 | 27 | class CustomTensorBoardCallback(TensorBoard): 28 | def on_batch_end(self, batch, logs=None): 29 | pass 30 | 31 | 32 | def save_history(path, history): 33 | 34 | history_for_json = {} 35 | # transform float values that aren't json-serializable 36 | for key in list(history.history.keys()): 37 | if type(history.history[key]) == np.ndarray: 38 | history_for_json[key] == history.history[key].tolist() 39 | elif type(history.history[key]) == list: 40 | if type(history.history[key][0]) == np.float32 or type(history.history[key][0]) == np.float64: 41 | history_for_json[key] = list(map(float, history.history[key])) 42 | 43 | with codecs.open(path, 'w', encoding='utf-8') as f: 44 | json.dump(history_for_json, f, separators=(',', ':'), sort_keys=True, indent=4) 45 | 46 | 47 | def save_model(model, output): 48 | 49 | # create a TensorFlow SavedModel for deployment to a SageMaker endpoint with TensorFlow Serving 50 | tf.contrib.saved_model.save_keras_model(model, args.model_dir) 51 | logging.info("Model successfully saved at: {}".format(output)) 52 | return 53 | 54 | 55 | def main(args): 56 | 57 | if 'sourcedir.tar.gz' in args.tensorboard_dir: 58 | tensorboard_dir = re.sub('source/sourcedir.tar.gz', 'model', args.tensorboard_dir) 59 | else: 60 | tensorboard_dir = args.tensorboard_dir 61 | 62 | logging.info("Writing TensorBoard logs to {}".format(tensorboard_dir)) 63 | 64 | logging.info("getting data") 65 | train_dataset = process_input(args.epochs, args.batch_size, args.train, 'train', args.data_config) 66 | eval_dataset = process_input(args.epochs, args.batch_size, args.eval, 'eval', args.data_config) 67 | validation_dataset = process_input(args.epochs, args.batch_size, args.validation, 'validation', args.data_config) 68 | 69 | logging.info("configuring model") 70 | logging.info("Hosts: "+ os.environ.get('SM_HOSTS')) 71 | 72 | size = len(args.hosts) 73 | 74 | #Deal with this 75 | model = get_model(args.learning_rate, args.weight_decay, args.optimizer, args.momentum, size) 76 | callbacks = [] 77 | if args.current_host == args.hosts[0]: 78 | callbacks.append(ModelCheckpoint(args.output_data_dir + '/checkpoint-{epoch}.h5')) 79 | callbacks.append(CustomTensorBoardCallback(log_dir=tensorboard_dir)) 80 | 81 | logging.info("Starting training") 82 | 83 | history = model.fit(x=train_dataset[0], 84 | y=train_dataset[1], 85 | steps_per_epoch=(num_examples_per_epoch('train') // args.batch_size) // size, 86 | epochs=args.epochs, 87 | validation_data=validation_dataset, 88 | validation_steps=(num_examples_per_epoch('validation') // args.batch_size) // size, callbacks=callbacks) 89 | 90 | score = model.evaluate(eval_dataset[0], 91 | eval_dataset[1], 92 | steps=num_examples_per_epoch('eval') // args.batch_size, 93 | verbose=0) 94 | 95 | logging.info('Test loss:{}'.format(score[0])) 96 | logging.info('Test accuracy:{}'.format(score[1])) 97 | 98 | # PS: Save model and history only on worker 0 99 | if args.current_host == args.hosts[0]: 100 | save_history(args.model_dir + "/ps_history.p", history) 101 | save_model(model, args.model_dir) 102 | 103 | 104 | def num_examples_per_epoch(subset='train'): 105 | 106 | if subset == 'train': 107 | return 40000 108 | elif subset == 'validation': 109 | return 10000 110 | elif subset == 'eval': 111 | return 10000 112 | else: 113 | raise ValueError('Invalid data subset "%s"' % subset) 114 | 115 | 116 | if __name__ == '__main__': 117 | 118 | parser = argparse.ArgumentParser() 119 | 120 | parser.add_argument('--hosts',type=list,default=json.loads(os.environ.get('SM_HOSTS'))) 121 | parser.add_argument('--current-host',type=str,default=os.environ.get('SM_CURRENT_HOST')) 122 | parser.add_argument('--train',type=str,required=False,default=os.environ.get('SM_CHANNEL_TRAIN')) 123 | parser.add_argument('--validation',type=str,required=False,default=os.environ.get('SM_CHANNEL_VALIDATION')) 124 | parser.add_argument('--eval',type=str,required=False,default=os.environ.get('SM_CHANNEL_EVAL')) 125 | parser.add_argument('--model_dir',type=str,required=True,help='The directory where the model will be stored.') 126 | parser.add_argument('--model_output_dir',type=str,default=os.environ.get('SM_MODEL_DIR')) 127 | parser.add_argument('--output_data_dir',type=str,default=os.environ.get('SM_OUTPUT_DATA_DIR')) 128 | parser.add_argument('--output-dir',type=str,default=os.environ.get('SM_OUTPUT_DIR')) 129 | parser.add_argument('--tensorboard-dir',type=str,default=os.environ.get('SM_MODULE_DIR')) 130 | parser.add_argument('--weight-decay',type=float,default=2e-4,help='Weight decay for convolutions.') 131 | parser.add_argument('--learning-rate',type=float,default=0.001,help='Initial learning rate.') 132 | parser.add_argument('--epochs',type=int,default=10) 133 | parser.add_argument('--batch-size',type=int,default=128) 134 | parser.add_argument('--data-config',type=json.loads,default=os.environ.get('SM_INPUT_DATA_CONFIG')) 135 | parser.add_argument('--fw-params',type=json.loads,default=os.environ.get('SM_FRAMEWORK_PARAMS')) 136 | parser.add_argument('--optimizer',type=str,default='adam') 137 | parser.add_argument('--momentum',type=float,default='0.9') 138 | 139 | args = parser.parse_args() 140 | 141 | main(args) 142 | -------------------------------------------------------------------------------- /tf-distribution-options/code/utilities.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import os 3 | import tensorflow as tf 4 | 5 | from model_def import HEIGHT, WIDTH, DEPTH, NUM_CLASSES 6 | 7 | 8 | NUM_DATA_BATCHES = 5 9 | NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN = 10000 * NUM_DATA_BATCHES 10 | 11 | 12 | def _get_filenames(channel_name, channel): 13 | if channel_name in ['train', 'validation', 'eval']: 14 | return [os.path.join(channel, channel_name + '.tfrecords')] 15 | else: 16 | raise ValueError('Invalid data subset "%s"' % channel_name) 17 | 18 | 19 | def _train_preprocess_fn(image): 20 | 21 | # Resize the image to add four extra pixels on each side. 22 | image = tf.image.resize_image_with_crop_or_pad(image, HEIGHT + 8, WIDTH + 8) 23 | 24 | # Randomly crop a [HEIGHT, WIDTH] section of the image. 25 | image = tf.random_crop(image, [HEIGHT, WIDTH, DEPTH]) 26 | 27 | # Randomly flip the image horizontally. 28 | image = tf.image.random_flip_left_right(image) 29 | 30 | return image 31 | 32 | 33 | def _dataset_parser(value): 34 | 35 | featdef = { 36 | 'image': tf.FixedLenFeature([], tf.string), 37 | 'label': tf.FixedLenFeature([], tf.int64), 38 | } 39 | 40 | example = tf.parse_single_example(value, featdef) 41 | image = tf.decode_raw(example['image'], tf.uint8) 42 | image.set_shape([DEPTH * HEIGHT * WIDTH]) 43 | 44 | # Reshape from [depth * height * width] to [depth, height, width]. 45 | image = tf.cast( 46 | tf.transpose(tf.reshape(image, [DEPTH, HEIGHT, WIDTH]), [1, 2, 0]), 47 | tf.float32) 48 | label = tf.cast(example['label'], tf.int32) 49 | image = _train_preprocess_fn(image) 50 | return image, tf.one_hot(label, NUM_CLASSES) 51 | 52 | 53 | def process_input(epochs, batch_size, channel, channel_name, data_config): 54 | 55 | mode = data_config[channel_name]['TrainingInputMode'] 56 | filenames = _get_filenames(channel_name, channel) 57 | # Repeat infinitely. 58 | logging.info("Running {} in {} mode".format(channel_name, mode)) 59 | if mode == 'Pipe': 60 | from sagemaker_tensorflow import PipeModeDataset 61 | dataset = PipeModeDataset(channel=channel_name, record_format='TFRecord') 62 | else: 63 | dataset = tf.data.TFRecordDataset(filenames) 64 | 65 | dataset = dataset.repeat(epochs) 66 | dataset = dataset.prefetch(10) 67 | 68 | # Parse records. 69 | dataset = dataset.map( 70 | _dataset_parser, num_parallel_calls=10) 71 | 72 | # Potentially shuffle records. 73 | if channel_name == 'train': 74 | # Ensure that the capacity is sufficiently large to provide good random 75 | # shuffling. 76 | buffer_size = int(NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN * 0.4) + 3 * batch_size 77 | dataset = dataset.shuffle(buffer_size=buffer_size) 78 | 79 | # Batch it up. 80 | dataset = dataset.batch(batch_size, drop_remainder=True) 81 | iterator = dataset.make_one_shot_iterator() 82 | image_batch, label_batch = iterator.get_next() 83 | 84 | return image_batch, label_batch 85 | 86 | 87 | -------------------------------------------------------------------------------- /tf-distribution-options/generate_cifar10_tfrecords.py: -------------------------------------------------------------------------------- 1 | # Copyright 2017 The TensorFlow Authors. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | # ============================================================================== 15 | """Read CIFAR-10 data from pickled numpy arrays and writes TFRecords. 16 | 17 | Generates tf.train.Example protos and writes them to TFRecord files from the 18 | python version of the CIFAR-10 dataset downloaded from 19 | https://www.cs.toronto.edu/~kriz/cifar.html. 20 | """ 21 | 22 | from __future__ import absolute_import 23 | from __future__ import division 24 | from __future__ import print_function 25 | 26 | import argparse 27 | import os 28 | import sys 29 | 30 | import tarfile 31 | from six.moves import cPickle as pickle 32 | from six.moves import xrange # pylint: disable=redefined-builtin 33 | import tensorflow as tf 34 | 35 | tf.logging.set_verbosity(tf.logging.ERROR) 36 | if type(tf.contrib) != type(tf): tf.contrib._warning = None 37 | 38 | CIFAR_FILENAME = 'cifar-10-python.tar.gz' 39 | CIFAR_DOWNLOAD_URL = 'https://www.cs.toronto.edu/~kriz/' + CIFAR_FILENAME 40 | CIFAR_LOCAL_FOLDER = 'cifar-10-batches-py' 41 | 42 | 43 | def download_and_extract(data_dir): 44 | # download CIFAR-10 if not already downloaded. 45 | tf.contrib.learn.datasets.base.maybe_download(CIFAR_FILENAME, data_dir, 46 | CIFAR_DOWNLOAD_URL) 47 | tarfile.open(os.path.join(data_dir, CIFAR_FILENAME), 48 | 'r:gz').extractall(data_dir) 49 | 50 | 51 | def _int64_feature(value): 52 | return tf.train.Feature(int64_list=tf.train.Int64List(value=[value])) 53 | 54 | 55 | def _bytes_feature(value): 56 | return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value])) 57 | 58 | 59 | def _get_file_names(): 60 | """Returns the file names expected to exist in the input_dir.""" 61 | file_names = {} 62 | file_names['train'] = ['data_batch_%d' % i for i in xrange(1, 5)] 63 | file_names['validation'] = ['data_batch_5'] 64 | file_names['eval'] = ['test_batch'] 65 | return file_names 66 | 67 | 68 | def read_pickle_from_file(filename): 69 | with tf.gfile.Open(filename, 'rb') as f: 70 | if sys.version_info >= (3, 0): 71 | data_dict = pickle.load(f, encoding='bytes') 72 | else: 73 | data_dict = pickle.load(f) 74 | return data_dict 75 | 76 | 77 | def convert_to_tfrecord(input_files, output_file): 78 | """Converts a file to TFRecords.""" 79 | print('Generating %s' % output_file) 80 | with tf.python_io.TFRecordWriter(output_file) as record_writer: 81 | for input_file in input_files: 82 | data_dict = read_pickle_from_file(input_file) 83 | data = data_dict[b'data'] 84 | labels = data_dict[b'labels'] 85 | 86 | num_entries_in_batch = len(labels) 87 | for i in range(num_entries_in_batch): 88 | example = tf.train.Example(features=tf.train.Features( 89 | feature={ 90 | 'image': _bytes_feature(data[i].tobytes()), 91 | 'label': _int64_feature(labels[i]) 92 | })) 93 | record_writer.write(example.SerializeToString()) 94 | 95 | 96 | def main(data_dir): 97 | print('Download from {} and extract.'.format(CIFAR_DOWNLOAD_URL)) 98 | download_and_extract(data_dir) 99 | file_names = _get_file_names() 100 | input_dir = os.path.join(data_dir, CIFAR_LOCAL_FOLDER) 101 | for mode, files in file_names.items(): 102 | input_files = [os.path.join(input_dir, f) for f in files] 103 | output_file = os.path.join(data_dir+'/'+mode, mode + '.tfrecords') 104 | if not os.path.exists(data_dir+'/'+mode): 105 | os.makedirs(data_dir+'/'+mode) 106 | try: 107 | os.remove(output_file) 108 | except OSError: 109 | pass 110 | # Convert to tf.train.Example and write the to TFRecords. 111 | convert_to_tfrecord(input_files, output_file) 112 | print('Done!') 113 | import shutil 114 | shutil.rmtree(data_dir+'/cifar-10-batches-py') 115 | os.remove(data_dir+'/cifar-10-python.tar.gz') 116 | 117 | 118 | if __name__ == '__main__': 119 | parser = argparse.ArgumentParser() 120 | parser.add_argument( 121 | '--data-dir', 122 | type=str, 123 | default='', 124 | help='Directory to download and extract CIFAR-10 to.') 125 | 126 | args = parser.parse_args() 127 | main(args.data_dir) 128 | -------------------------------------------------------------------------------- /tf-distribution-options/sample-img/1000_dog.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-distribution-options/sample-img/1000_dog.png -------------------------------------------------------------------------------- /tf-distribution-options/sample-img/1001_airplane.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-distribution-options/sample-img/1001_airplane.png -------------------------------------------------------------------------------- /tf-distribution-options/sample-img/1003_deer.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-distribution-options/sample-img/1003_deer.png -------------------------------------------------------------------------------- /tf-distribution-options/sample-img/1004_ship.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-distribution-options/sample-img/1004_ship.png -------------------------------------------------------------------------------- /tf-distribution-options/sample-img/1005_automobile.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-distribution-options/sample-img/1005_automobile.png -------------------------------------------------------------------------------- /tf-distribution-options/sample-img/1008_truck.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-distribution-options/sample-img/1008_truck.png -------------------------------------------------------------------------------- /tf-distribution-options/sample-img/1009_frog.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-distribution-options/sample-img/1009_frog.png -------------------------------------------------------------------------------- /tf-distribution-options/sample-img/1014_cat.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-distribution-options/sample-img/1014_cat.png -------------------------------------------------------------------------------- /tf-distribution-options/sample-img/1037_horse.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-distribution-options/sample-img/1037_horse.png -------------------------------------------------------------------------------- /tf-distribution-options/sample-img/1038_bird.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-distribution-options/sample-img/1038_bird.png -------------------------------------------------------------------------------- /tf-eager-script-mode/train_model/model_def.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | 3 | 4 | def get_model(): 5 | 6 | inputs = tf.keras.Input(shape=(13,)) 7 | hidden_1 = tf.keras.layers.Dense(13, activation='tanh')(inputs) 8 | hidden_2 = tf.keras.layers.Dense(6, activation='sigmoid')(hidden_1) 9 | outputs = tf.keras.layers.Dense(1)(hidden_2) 10 | return tf.keras.Model(inputs=inputs, outputs=outputs) 11 | -------------------------------------------------------------------------------- /tf-eager-script-mode/train_model/train.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import numpy as np 3 | import os 4 | import tensorflow as tf 5 | from tensorflow.contrib.eager.python import tfe 6 | 7 | from model_def import get_model 8 | 9 | 10 | tf.enable_eager_execution() 11 | tf.set_random_seed(0) 12 | np.random.seed(0) 13 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 14 | 15 | 16 | def parse_args(): 17 | 18 | parser = argparse.ArgumentParser() 19 | 20 | # hyperparameters sent by the client are passed as command-line arguments to the script 21 | parser.add_argument('--epochs', type=int, default=1) 22 | parser.add_argument('--batch_size', type=int, default=64) 23 | parser.add_argument('--learning_rate', type=float, default=0.1) 24 | 25 | # data directories 26 | parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN')) 27 | parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST')) 28 | 29 | # model directory: we will use the default set by SageMaker, /opt/ml/model 30 | parser.add_argument('--model_dir', type=str, default=os.environ.get('SM_MODEL_DIR')) 31 | 32 | return parser.parse_known_args() 33 | 34 | 35 | def get_train_data(train_dir): 36 | 37 | x_train = np.load(os.path.join(train_dir, 'x_train.npy')) 38 | y_train = np.load(os.path.join(train_dir, 'y_train.npy')) 39 | print('x train', x_train.shape,'y train', y_train.shape) 40 | 41 | return x_train, y_train 42 | 43 | 44 | def get_test_data(test_dir): 45 | 46 | x_test = np.load(os.path.join(test_dir, 'x_test.npy')) 47 | y_test = np.load(os.path.join(test_dir, 'y_test.npy')) 48 | print('x test', x_test.shape,'y test', y_test.shape) 49 | 50 | return x_test, y_test 51 | 52 | 53 | if __name__ == "__main__": 54 | 55 | args, _ = parse_args() 56 | 57 | x_train, y_train = get_train_data(args.train) 58 | x_test, y_test = get_test_data(args.test) 59 | 60 | device = '/cpu:0' 61 | print(device) 62 | batch_size = args.batch_size 63 | epochs = args.epochs 64 | learning_rate = args.learning_rate 65 | print('batch_size = {}, epochs = {}, learning rate = {}'.format(batch_size, epochs, learning_rate)) 66 | 67 | with tf.device(device): 68 | 69 | model = get_model() 70 | optimizer = tf.train.GradientDescentOptimizer(learning_rate) 71 | model.compile(optimizer=optimizer, loss='mse') 72 | model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, 73 | validation_data=(x_test, y_test)) 74 | 75 | # evaluate on test set 76 | scores = model.evaluate(x_test, y_test, batch_size, verbose=2) 77 | print("Test MSE :", scores) 78 | 79 | # save checkpoint for locally loading in notebook 80 | saver = tfe.Saver(model.variables) 81 | saver.save(args.model_dir + '/weights.ckpt') 82 | # create a separate SavedModel for deployment to a SageMaker endpoint with TensorFlow Serving 83 | tf.contrib.saved_model.save_keras_model(model, args.model_dir) 84 | 85 | 86 | -------------------------------------------------------------------------------- /tf-horovod-inference-pipeline/generate_cifar10_tfrecords.py: -------------------------------------------------------------------------------- 1 | # Copyright 2017 The TensorFlow Authors. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | # ============================================================================== 15 | """Read CIFAR-10 data from pickled numpy arrays and writes TFRecords. 16 | 17 | Generates tf.train.Example protos and writes them to TFRecord files from the 18 | python version of the CIFAR-10 dataset downloaded from 19 | https://www.cs.toronto.edu/~kriz/cifar.html. 20 | """ 21 | 22 | from __future__ import absolute_import 23 | from __future__ import division 24 | from __future__ import print_function 25 | 26 | import argparse 27 | import os 28 | import sys 29 | 30 | import tarfile 31 | from six.moves import cPickle as pickle 32 | from six.moves import xrange # pylint: disable=redefined-builtin 33 | import tensorflow as tf 34 | 35 | tf.logging.set_verbosity(tf.logging.ERROR) 36 | if type(tf.contrib) != type(tf): tf.contrib._warning = None 37 | 38 | CIFAR_FILENAME = 'cifar-10-python.tar.gz' 39 | CIFAR_DOWNLOAD_URL = 'https://www.cs.toronto.edu/~kriz/' + CIFAR_FILENAME 40 | CIFAR_LOCAL_FOLDER = 'cifar-10-batches-py' 41 | 42 | 43 | def download_and_extract(data_dir): 44 | # download CIFAR-10 if not already downloaded. 45 | tf.contrib.learn.datasets.base.maybe_download(CIFAR_FILENAME, data_dir, 46 | CIFAR_DOWNLOAD_URL) 47 | tarfile.open(os.path.join(data_dir, CIFAR_FILENAME), 48 | 'r:gz').extractall(data_dir) 49 | 50 | 51 | def _int64_feature(value): 52 | return tf.train.Feature(int64_list=tf.train.Int64List(value=[value])) 53 | 54 | 55 | def _bytes_feature(value): 56 | return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value])) 57 | 58 | 59 | def _get_file_names(): 60 | """Returns the file names expected to exist in the input_dir.""" 61 | file_names = {} 62 | file_names['train'] = ['data_batch_%d' % i for i in xrange(1, 5)] 63 | file_names['validation'] = ['data_batch_5'] 64 | file_names['eval'] = ['test_batch'] 65 | return file_names 66 | 67 | 68 | def read_pickle_from_file(filename): 69 | with tf.gfile.Open(filename, 'rb') as f: 70 | if sys.version_info >= (3, 0): 71 | data_dict = pickle.load(f, encoding='bytes') 72 | else: 73 | data_dict = pickle.load(f) 74 | return data_dict 75 | 76 | 77 | def convert_to_tfrecord(input_files, output_file): 78 | """Converts a file to TFRecords.""" 79 | print('Generating %s' % output_file) 80 | with tf.python_io.TFRecordWriter(output_file) as record_writer: 81 | for input_file in input_files: 82 | data_dict = read_pickle_from_file(input_file) 83 | data = data_dict[b'data'] 84 | labels = data_dict[b'labels'] 85 | 86 | num_entries_in_batch = len(labels) 87 | for i in range(num_entries_in_batch): 88 | example = tf.train.Example(features=tf.train.Features( 89 | feature={ 90 | 'image': _bytes_feature(data[i].tobytes()), 91 | 'label': _int64_feature(labels[i]) 92 | })) 93 | record_writer.write(example.SerializeToString()) 94 | 95 | 96 | def main(data_dir): 97 | print('Download from {} and extract.'.format(CIFAR_DOWNLOAD_URL)) 98 | download_and_extract(data_dir) 99 | file_names = _get_file_names() 100 | input_dir = os.path.join(data_dir, CIFAR_LOCAL_FOLDER) 101 | for mode, files in file_names.items(): 102 | input_files = [os.path.join(input_dir, f) for f in files] 103 | output_file = os.path.join(data_dir+'/'+mode, mode + '.tfrecords') 104 | if not os.path.exists(data_dir+'/'+mode): 105 | os.makedirs(data_dir+'/'+mode) 106 | try: 107 | os.remove(output_file) 108 | except OSError: 109 | pass 110 | # Convert to tf.train.Example and write the to TFRecords. 111 | convert_to_tfrecord(input_files, output_file) 112 | print('Done!') 113 | import shutil 114 | shutil.rmtree(data_dir+'/cifar-10-batches-py') 115 | os.remove(data_dir+'/cifar-10-python.tar.gz') 116 | 117 | 118 | if __name__ == '__main__': 119 | parser = argparse.ArgumentParser() 120 | parser.add_argument( 121 | '--data-dir', 122 | type=str, 123 | default='', 124 | help='Directory to download and extract CIFAR-10 to.') 125 | 126 | args = parser.parse_args() 127 | main(args.data_dir) 128 | -------------------------------------------------------------------------------- /tf-horovod-inference-pipeline/image-transformer-container/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM python:3.6 2 | 3 | LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true 4 | 5 | RUN pip install flask gunicorn numpy Pillow 6 | 7 | # Add flask app directory 8 | COPY ./app /app 9 | WORKDIR /app 10 | 11 | # Copy entrypoint file and make it executable 12 | COPY entrypoint.sh /entrypoint.sh 13 | RUN chmod +x /entrypoint.sh 14 | 15 | ENTRYPOINT ["/entrypoint.sh"] 16 | -------------------------------------------------------------------------------- /tf-horovod-inference-pipeline/image-transformer-container/app/main.py: -------------------------------------------------------------------------------- 1 | import io 2 | import json 3 | import numpy as np 4 | import struct 5 | 6 | from flask import Flask, Response, request 7 | from PIL import Image 8 | 9 | 10 | app = Flask(__name__) 11 | 12 | 13 | def read_image(image_from_s3): 14 | 15 | image_as_bytes = io.BytesIO(image_from_s3) 16 | image = Image.open(image_as_bytes) 17 | instance = np.expand_dims(image, axis=0) 18 | 19 | return instance.tolist() 20 | 21 | 22 | @app.route("/invocations", methods=['POST']) 23 | def invocations(): 24 | 25 | try: 26 | image_for_JSON = read_image(request.data) 27 | # TensorFlow Serving's REST API requires a JSON-formatted request 28 | response = Response(json.dumps({"instances": image_for_JSON})) 29 | response.headers['Content-Type'] = "application/json" 30 | return response 31 | except ValueError as err: 32 | return str(err), 400 33 | 34 | 35 | @app.route("/ping", methods=['GET']) 36 | def ping(): 37 | 38 | return "", 200 39 | -------------------------------------------------------------------------------- /tf-horovod-inference-pipeline/image-transformer-container/ecr_policy.json: -------------------------------------------------------------------------------- 1 | { 2 | "Version": "2008-10-17", 3 | "Statement": [ 4 | { 5 | "Sid": "allowSageMakerToPull", 6 | "Effect": "Allow", 7 | "Principal": { 8 | "Service": "sagemaker.amazonaws.com" 9 | }, 10 | "Action": [ 11 | "ecr:GetDownloadUrlForLayer", 12 | "ecr:BatchGetImage", 13 | "ecr:BatchCheckLayerAvailability" 14 | ] 15 | } 16 | ] 17 | } 18 | -------------------------------------------------------------------------------- /tf-horovod-inference-pipeline/image-transformer-container/entrypoint.sh: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env bash 2 | set -e 3 | 4 | # Get the listen port from the SM env variable, otherwise default to 8080 5 | LISTEN_PORT=${SAGEMAKER_BIND_TO_PORT:-8080} 6 | 7 | # Set the number of gunicorn worker processes 8 | GUNICORN_WORKER_COUNT=$(nproc) 9 | 10 | PYTHONUNBUFFERED=1 11 | 12 | # Start flask app 13 | exec gunicorn -w $GUNICORN_WORKER_COUNT -b 0.0.0.0:$LISTEN_PORT main:app 14 | -------------------------------------------------------------------------------- /tf-horovod-inference-pipeline/sample-img/1000_dog.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-horovod-inference-pipeline/sample-img/1000_dog.png -------------------------------------------------------------------------------- /tf-horovod-inference-pipeline/sample-img/1001_airplane.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-horovod-inference-pipeline/sample-img/1001_airplane.png -------------------------------------------------------------------------------- /tf-horovod-inference-pipeline/sample-img/1003_deer.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-horovod-inference-pipeline/sample-img/1003_deer.png -------------------------------------------------------------------------------- /tf-horovod-inference-pipeline/sample-img/1004_ship.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-horovod-inference-pipeline/sample-img/1004_ship.png -------------------------------------------------------------------------------- /tf-horovod-inference-pipeline/sample-img/1005_automobile.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-horovod-inference-pipeline/sample-img/1005_automobile.png -------------------------------------------------------------------------------- /tf-horovod-inference-pipeline/sample-img/1008_truck.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-horovod-inference-pipeline/sample-img/1008_truck.png -------------------------------------------------------------------------------- /tf-horovod-inference-pipeline/sample-img/1009_frog.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-horovod-inference-pipeline/sample-img/1009_frog.png -------------------------------------------------------------------------------- /tf-horovod-inference-pipeline/sample-img/1014_cat.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-horovod-inference-pipeline/sample-img/1014_cat.png -------------------------------------------------------------------------------- /tf-horovod-inference-pipeline/sample-img/1037_horse.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-horovod-inference-pipeline/sample-img/1037_horse.png -------------------------------------------------------------------------------- /tf-horovod-inference-pipeline/sample-img/1038_bird.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/54be9ca995bf33d87ccfede258f1c639e07c19fc/tf-horovod-inference-pipeline/sample-img/1038_bird.png -------------------------------------------------------------------------------- /tf-horovod-inference-pipeline/train.py: -------------------------------------------------------------------------------- 1 | # Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"). 4 | # You may not use this file except in compliance with the License. 5 | # A copy of the License is located at 6 | # 7 | # https://aws.amazon.com/apache-2-0/ 8 | # 9 | # or in the "license" file accompanying this file. This file is distributed 10 | # on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either 11 | # express or implied. See the License for the specific language governing 12 | # permissions and limitations under the License. 13 | 14 | 15 | import argparse 16 | import json 17 | import logging 18 | import os 19 | import re 20 | 21 | import tensorflow as tf 22 | import tensorflow.keras.backend as K 23 | from tensorflow.keras.callbacks import TensorBoard, ModelCheckpoint 24 | from tensorflow.keras.layers import Activation, Conv2D, Dense, Dropout, Flatten, MaxPooling2D, BatchNormalization 25 | from tensorflow.keras.models import Sequential 26 | from tensorflow.keras.optimizers import Adam, SGD, RMSprop 27 | 28 | logging.getLogger().setLevel(logging.INFO) 29 | tf.logging.set_verbosity(tf.logging.ERROR) 30 | HEIGHT = 32 31 | WIDTH = 32 32 | DEPTH = 3 33 | NUM_CLASSES = 10 34 | NUM_DATA_BATCHES = 5 35 | NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN = 10000 * NUM_DATA_BATCHES 36 | INPUT_TENSOR_NAME = 'inputs_input' # needs to match the name of the first layer + "_input" 37 | 38 | 39 | def keras_model_fn(learning_rate, weight_decay, optimizer, momentum, mpi=False, hvd=False): 40 | 41 | model = Sequential() 42 | model.add(Conv2D(32, (3, 3), padding='same', name='inputs', input_shape=(HEIGHT, WIDTH, DEPTH))) 43 | model.add(BatchNormalization()) 44 | model.add(Activation('relu')) 45 | model.add(Conv2D(32, (3, 3))) 46 | model.add(BatchNormalization()) 47 | model.add(Activation('relu')) 48 | model.add(MaxPooling2D(pool_size=(2, 2))) 49 | model.add(Dropout(0.2)) 50 | 51 | model.add(Conv2D(64, (3, 3), padding='same')) 52 | model.add(BatchNormalization()) 53 | model.add(Activation('relu')) 54 | model.add(Conv2D(64, (3, 3))) 55 | model.add(BatchNormalization()) 56 | model.add(Activation('relu')) 57 | model.add(MaxPooling2D(pool_size=(2, 2))) 58 | model.add(Dropout(0.3)) 59 | 60 | model.add(Conv2D(128, (3, 3), padding='same')) 61 | model.add(BatchNormalization()) 62 | model.add(Activation('relu')) 63 | model.add(Conv2D(128, (3, 3))) 64 | model.add(BatchNormalization()) 65 | model.add(Activation('relu')) 66 | model.add(MaxPooling2D(pool_size=(2, 2))) 67 | model.add(Dropout(0.4)) 68 | 69 | model.add(Flatten()) 70 | model.add(Dense(512)) 71 | model.add(Activation('relu')) 72 | model.add(Dropout(0.5)) 73 | model.add(Dense(NUM_CLASSES)) 74 | model.add(Activation('softmax')) 75 | 76 | size = 1 77 | if mpi: 78 | size = hvd.size() 79 | 80 | if optimizer.lower() == 'sgd': 81 | opt = SGD(lr=learning_rate * size, decay=weight_decay, momentum=momentum) 82 | elif optimizer.lower() == 'rmsprop': 83 | opt = RMSprop(lr=learning_rate * size, decay=weight_decay) 84 | else: 85 | opt = Adam(lr=learning_rate * size, decay=weight_decay) 86 | 87 | if mpi: 88 | opt = hvd.DistributedOptimizer(opt) 89 | 90 | model.compile(loss='categorical_crossentropy', 91 | optimizer=opt, 92 | metrics=['accuracy']) 93 | return model 94 | 95 | 96 | class CustomTensorBoardCallback(TensorBoard): 97 | def on_batch_end(self, batch, logs=None): 98 | pass 99 | 100 | 101 | def get_filenames(channel_name, channel): 102 | if channel_name in ['train', 'validation', 'eval']: 103 | return [os.path.join(channel, channel_name + '.tfrecords')] 104 | else: 105 | raise ValueError('Invalid data subset "%s"' % channel_name) 106 | 107 | 108 | def train_input_fn(): 109 | return _input(args.epochs, args.batch_size, args.train, 'train') 110 | 111 | 112 | def eval_input_fn(): 113 | return _input(args.epochs, args.batch_size, args.eval, 'eval') 114 | 115 | 116 | def validation_input_fn(): 117 | return _input(args.epochs, args.batch_size, args.validation, 'validation') 118 | 119 | 120 | def _input(epochs, batch_size, channel, channel_name): 121 | mode = args.data_config[channel_name]['TrainingInputMode'] 122 | """Uses the tf.data input pipeline for CIFAR-10 dataset. 123 | Args: 124 | mode: Standard names for model modes (tf.estimators.ModeKeys). 125 | batch_size: The number of samples per batch of input requested. 126 | """ 127 | filenames = get_filenames(channel_name, channel) 128 | # Repeat infinitely. 129 | logging.info("Running {} in {} mode".format(channel_name, mode)) 130 | if mode == 'Pipe': 131 | from sagemaker_tensorflow import PipeModeDataset 132 | dataset = PipeModeDataset(channel=channel_name, record_format='TFRecord') 133 | else: 134 | dataset = tf.data.TFRecordDataset(filenames) 135 | 136 | dataset = dataset.repeat(epochs) 137 | dataset = dataset.prefetch(10) 138 | 139 | # Parse records. 140 | dataset = dataset.map( 141 | _dataset_parser, num_parallel_calls=10) 142 | 143 | # Potentially shuffle records. 144 | if channel_name == 'train': 145 | # Ensure that the capacity is sufficiently large to provide good random 146 | # shuffling. 147 | buffer_size = int(NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN * 0.4) + 3 * batch_size 148 | dataset = dataset.shuffle(buffer_size=buffer_size) 149 | 150 | # Batch it up. 151 | dataset = dataset.batch(batch_size, drop_remainder=True) 152 | iterator = dataset.make_one_shot_iterator() 153 | image_batch, label_batch = iterator.get_next() 154 | 155 | return {INPUT_TENSOR_NAME: image_batch}, label_batch 156 | 157 | 158 | def _train_preprocess_fn(image): 159 | """Preprocess a single training image of layout [height, width, depth].""" 160 | # Resize the image to add four extra pixels on each side. 161 | image = tf.image.resize_image_with_crop_or_pad(image, HEIGHT + 8, WIDTH + 8) 162 | 163 | # Randomly crop a [HEIGHT, WIDTH] section of the image. 164 | image = tf.random_crop(image, [HEIGHT, WIDTH, DEPTH]) 165 | 166 | # Randomly flip the image horizontally. 167 | image = tf.image.random_flip_left_right(image) 168 | 169 | return image 170 | 171 | 172 | def _dataset_parser(value): 173 | """Parse a CIFAR-10 record from value.""" 174 | featdef = { 175 | 'image': tf.FixedLenFeature([], tf.string), 176 | 'label': tf.FixedLenFeature([], tf.int64), 177 | } 178 | 179 | example = tf.parse_single_example(value, featdef) 180 | image = tf.decode_raw(example['image'], tf.uint8) 181 | image.set_shape([DEPTH * HEIGHT * WIDTH]) 182 | 183 | # Reshape from [depth * height * width] to [depth, height, width]. 184 | image = tf.cast( 185 | tf.transpose(tf.reshape(image, [DEPTH, HEIGHT, WIDTH]), [1, 2, 0]), 186 | tf.float32) 187 | label = tf.cast(example['label'], tf.int32) 188 | image = _train_preprocess_fn(image) 189 | return image, tf.one_hot(label, NUM_CLASSES) 190 | 191 | 192 | def save_model(model, output): 193 | 194 | # create a TensorFlow SavedModel for deployment to a SageMaker endpoint with TensorFlow Serving 195 | tf.contrib.saved_model.save_keras_model(model, args.model_dir) 196 | logging.info("Model successfully saved at: {}".format(output)) 197 | return 198 | 199 | 200 | def main(args): 201 | 202 | mpi = False 203 | if 'sourcedir.tar.gz' in args.tensorboard_dir: 204 | tensorboard_dir = re.sub('source/sourcedir.tar.gz', 'model', args.tensorboard_dir) 205 | else: 206 | tensorboard_dir = args.tensorboard_dir 207 | logging.info("Writing TensorBoard logs to {}".format(tensorboard_dir)) 208 | if 'sagemaker_mpi_enabled' in args.fw_params: 209 | if args.fw_params['sagemaker_mpi_enabled']: 210 | import horovod.tensorflow.keras as hvd 211 | mpi = True 212 | # Horovod: initialize Horovod. 213 | hvd.init() 214 | 215 | # Horovod: pin GPU to be used to process local rank (one GPU per process) 216 | config = tf.ConfigProto() 217 | config.gpu_options.allow_growth = True 218 | config.gpu_options.visible_device_list = str(hvd.local_rank()) 219 | K.set_session(tf.Session(config=config)) 220 | else: 221 | hvd = None 222 | 223 | logging.info("Running with MPI={}".format(mpi)) 224 | logging.info("getting data") 225 | train_dataset = train_input_fn() 226 | eval_dataset = eval_input_fn() 227 | validation_dataset = validation_input_fn() 228 | 229 | logging.info("configuring model") 230 | model = keras_model_fn(args.learning_rate, args.weight_decay, args.optimizer, args.momentum, mpi, hvd) 231 | callbacks = [] 232 | if mpi: 233 | callbacks.append(hvd.callbacks.BroadcastGlobalVariablesCallback(0)) 234 | callbacks.append(hvd.callbacks.MetricAverageCallback()) 235 | callbacks.append(hvd.callbacks.LearningRateWarmupCallback(warmup_epochs=5, verbose=1)) 236 | callbacks.append(tf.keras.callbacks.ReduceLROnPlateau(patience=10, verbose=1)) 237 | if hvd.rank() == 0: 238 | callbacks.append(ModelCheckpoint(args.output_dir + '/checkpoint-{epoch}.h5')) 239 | callbacks.append(CustomTensorBoardCallback(log_dir=tensorboard_dir)) 240 | else: 241 | callbacks.append(ModelCheckpoint(args.output_dir + '/checkpoint-{epoch}.h5')) 242 | callbacks.append(CustomTensorBoardCallback(log_dir=tensorboard_dir)) 243 | logging.info("Starting training") 244 | size = 1 245 | if mpi: 246 | size = hvd.size() 247 | model.fit(x=train_dataset[0], y=train_dataset[1], 248 | steps_per_epoch=(num_examples_per_epoch('train') // args.batch_size) // size, 249 | epochs=args.epochs, validation_data=validation_dataset, 250 | validation_steps=(num_examples_per_epoch('validation') // args.batch_size) // size, callbacks=callbacks) 251 | 252 | score = model.evaluate(eval_dataset[0], eval_dataset[1], steps=num_examples_per_epoch('eval') // args.batch_size, 253 | verbose=0) 254 | 255 | logging.info('Test loss:{}'.format(score[0])) 256 | logging.info('Test accuracy:{}'.format(score[1])) 257 | 258 | # Horovod: Save model only on worker 0 (i.e. master) 259 | if mpi: 260 | if hvd.rank() == 0: 261 | return save_model(model, args.model_output_dir) 262 | else: 263 | return save_model(model, args.model_output_dir) 264 | 265 | 266 | def num_examples_per_epoch(subset='train'): 267 | if subset == 'train': 268 | return 40000 269 | elif subset == 'validation': 270 | return 10000 271 | elif subset == 'eval': 272 | return 10000 273 | else: 274 | raise ValueError('Invalid data subset "%s"' % subset) 275 | 276 | 277 | if __name__ == '__main__': 278 | 279 | parser = argparse.ArgumentParser() 280 | parser.add_argument( 281 | '--train', 282 | type=str, 283 | required=False, 284 | default=os.environ.get('SM_CHANNEL_TRAIN'), 285 | help='The directory where the CIFAR-10 input data is stored.') 286 | parser.add_argument( 287 | '--validation', 288 | type=str, 289 | required=False, 290 | default=os.environ.get('SM_CHANNEL_VALIDATION'), 291 | help='The directory where the CIFAR-10 input data is stored.') 292 | parser.add_argument( 293 | '--eval', 294 | type=str, 295 | required=False, 296 | default=os.environ.get('SM_CHANNEL_EVAL'), 297 | help='The directory where the CIFAR-10 input data is stored.') 298 | parser.add_argument( 299 | '--model_dir', 300 | type=str, 301 | required=True, 302 | help='The directory where the model will be stored.') 303 | parser.add_argument( 304 | '--model_output_dir', 305 | type=str, 306 | default=os.environ.get('SM_MODEL_DIR')) 307 | parser.add_argument( 308 | '--output-dir', 309 | type=str, 310 | default=os.environ.get('SM_OUTPUT_DIR')) 311 | parser.add_argument( 312 | '--tensorboard-dir', 313 | type=str, 314 | default=os.environ.get('SM_MODULE_DIR')) 315 | parser.add_argument( 316 | '--weight-decay', 317 | type=float, 318 | default=2e-4, 319 | help='Weight decay for convolutions.') 320 | parser.add_argument( 321 | '--learning-rate', 322 | type=float, 323 | default=0.001, 324 | help="""\ 325 | This is the inital learning rate value. The learning rate will decrease 326 | during training. For more details check the model_fn implementation in 327 | this file.\ 328 | """) 329 | parser.add_argument( 330 | '--epochs', 331 | type=int, 332 | default=10, 333 | help='The number of steps to use for training.') 334 | parser.add_argument( 335 | '--batch-size', 336 | type=int, 337 | default=128, 338 | help='Batch size for training.') 339 | parser.add_argument( 340 | '--data-config', 341 | type=json.loads, 342 | default=os.environ.get('SM_INPUT_DATA_CONFIG') 343 | ) 344 | parser.add_argument( 345 | '--fw-params', 346 | type=json.loads, 347 | default=os.environ.get('SM_FRAMEWORK_PARAMS') 348 | ) 349 | parser.add_argument( 350 | '--optimizer', 351 | type=str, 352 | default='adam' 353 | ) 354 | parser.add_argument( 355 | '--momentum', 356 | type=float, 357 | default='0.9' 358 | ) 359 | args = parser.parse_args() 360 | 361 | main(args) 362 | -------------------------------------------------------------------------------- /tf-sentiment-script-mode/sentiment.py: -------------------------------------------------------------------------------- 1 | import logging 2 | logging.getLogger("tensorflow").setLevel(logging.ERROR) 3 | import argparse 4 | import codecs 5 | import json 6 | import numpy as np 7 | import os 8 | import tensorflow as tf 9 | 10 | max_features = 20000 11 | maxlen = 400 12 | embedding_dims = 300 13 | filters = 256 14 | kernel_size = 3 15 | hidden_dims = 256 16 | 17 | def parse_args(): 18 | 19 | parser = argparse.ArgumentParser() 20 | 21 | # hyperparameters sent by the client are passed as command-line arguments to the script 22 | parser.add_argument('--epochs', type=int, default=1) 23 | parser.add_argument('--batch_size', type=int, default=64) 24 | parser.add_argument('--learning_rate', type=float, default=0.01) 25 | 26 | # data directories 27 | parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN')) 28 | parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST')) 29 | 30 | # model directory: we will use the default set by SageMaker, /opt/ml/model 31 | parser.add_argument('--model_dir', type=str, default=os.environ.get('SM_MODEL_DIR')) 32 | 33 | return parser.parse_known_args() 34 | 35 | 36 | def save_history(path, history): 37 | 38 | history_for_json = {} 39 | # transform float values that aren't json-serializable 40 | for key in list(history.history.keys()): 41 | if type(history.history[key]) == np.ndarray: 42 | history_for_json[key] = history.history[key].tolist() 43 | elif type(history.history[key]) == list: 44 | if type(history.history[key][0]) == np.float32 or type(history.history[key][0]) == np.float64: 45 | history_for_json[key] = list(map(float, history.history[key])) 46 | 47 | with codecs.open(path, 'w', encoding='utf-8') as f: 48 | json.dump(history_for_json, f, separators=(',', ':'), sort_keys=True, indent=4) 49 | 50 | 51 | def get_train_data(train_dir): 52 | 53 | x_train = np.load(os.path.join(train_dir, 'x_train.npy')) 54 | y_train = np.load(os.path.join(train_dir, 'y_train.npy')) 55 | print('x train', x_train.shape,'y train', y_train.shape) 56 | 57 | return x_train, y_train 58 | 59 | 60 | def get_test_data(test_dir): 61 | 62 | x_test = np.load(os.path.join(test_dir, 'x_test.npy')) 63 | y_test = np.load(os.path.join(test_dir, 'y_test.npy')) 64 | print('x test', x_test.shape,'y test', y_test.shape) 65 | 66 | return x_test, y_test 67 | 68 | 69 | def get_model(learning_rate): 70 | 71 | mirrored_strategy = tf.distribute.MirroredStrategy() 72 | 73 | with mirrored_strategy.scope(): 74 | embedding_layer = tf.keras.layers.Embedding(max_features, 75 | embedding_dims, 76 | input_length=maxlen) 77 | 78 | sequence_input = tf.keras.Input(shape=(maxlen,), dtype='int32') 79 | embedded_sequences = embedding_layer(sequence_input) 80 | x = tf.keras.layers.Dropout(0.2)(embedded_sequences) 81 | x = tf.keras.layers.Conv1D(filters, kernel_size, padding='valid', activation='relu', strides=1)(x) 82 | x = tf.keras.layers.MaxPooling1D()(x) 83 | x = tf.keras.layers.GlobalMaxPooling1D()(x) 84 | x = tf.keras.layers.Dense(hidden_dims, activation='relu')(x) 85 | x = tf.keras.layers.Dropout(0.2)(x) 86 | preds = tf.keras.layers.Dense(1, activation='sigmoid')(x) 87 | 88 | model = tf.keras.Model(sequence_input, preds) 89 | optimizer = tf.keras.optimizers.Adam(learning_rate) 90 | model.compile(loss='binary_crossentropy', 91 | optimizer=optimizer, 92 | metrics=['accuracy']) 93 | 94 | return model 95 | 96 | 97 | if __name__ == "__main__": 98 | 99 | args, _ = parse_args() 100 | 101 | x_train, y_train = get_train_data(args.train) 102 | x_test, y_test = get_test_data(args.test) 103 | 104 | model = get_model(args.learning_rate) 105 | 106 | history = model.fit(x_train, y_train, 107 | batch_size=args.batch_size, 108 | epochs=args.epochs, 109 | validation_data=(x_test, y_test)) 110 | 111 | save_history(args.model_dir + "/history.p", history) 112 | 113 | # create a TensorFlow SavedModel for deployment to a SageMaker endpoint with TensorFlow Serving 114 | model.save(args.model_dir + '/1') 115 | 116 | 117 | --------------------------------------------------------------------------------