├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── aws_summit_2022_inf1_bert_compile_and_deploy.ipynb ├── aws_summit_2022_inf1_bert_compile_and_deploy_walkthrough.ipynb ├── code ├── inference.py └── requirements.txt └── images ├── accessevent.png ├── accesssm.png ├── accessstudio.png ├── email.png ├── kernel.png ├── kernelselect.png ├── launch.png ├── login.png ├── menuleft.png ├── openconsole.png ├── opennb.png ├── otp.png ├── signin.png ├── teamdashboard.png └── terminal.png /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. 5 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *master* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. 60 | 61 | We may ask you to sign a [Contributor License Agreement (CLA)](http://en.wikipedia.org/wiki/Contributor_License_Agreement) for larger changes. 62 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of 4 | this software and associated documentation files (the "Software"), to deal in 5 | the Software without restriction, including without limitation the rights to 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 7 | the Software, and to permit persons to whom the Software is furnished to do so. 8 | 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # AWS Summit San Francisco 2022 2 | ## CMP314 Optimizing NLP models with Amazon EC2 Inf1 instances in Amazon Sagemaker 3 | 4 | Welcome to the AWS Summit San Francisco 2022 Inferentia Workshop! 5 | During this workshop, we will create two endpoints with one HuggingFace model each. We will use them for the task of paraphrase detection which is an NLP classification problem. These two endpoints will have the following configurations: a) CPU-based endpoint, where we will be deploying the model with no changes; and b) Inf1 instance based endpoint, where we will prepare and compile the model using SageMaker Neo before deploying. Finally, we will perform a latency and throughput performance comparison of both endpoints. 6 | 7 | ## Event engine access during the event 8 | #### Follow these instructions to connect to the AWS Console during the event 9 | 10 | 1. Access event engine: Open a browser and type the URL shared during the workshop. 11 | 12 | 2. Click on “Agree” (make sure the event passcode is displayed) 13 | 14 | 15 | 3. Click on “Email One-Time Password (OTP)” 16 | 17 | 18 | 4. Complete with your email and click “Send Passcode” 19 | 20 | 21 | 5. Retrieve the OTP passcode from your email. Copy and paste it in the “Passcode 9 digit code” field. Press “Sign In” 22 | 23 | 24 | 6. Once logged in, you will see the "team Dashboard". Click on “AWS Console” 25 | 26 | 27 | 7. Then click on “Open AWS Console" 28 | 29 | 30 | 8. Inside the console look for the search box and Type “Sagemaker”, then click on the "Amazon Sagemaker" Service. If you prefer you can navigate directly to the Sagemaker console. 31 | 32 | 33 | 9. Once you see the Sagemaker dashboard click on “Studio” on the “Sagemaker Domain” menu on the left. 34 | 35 | 36 | 10. Click on “Launch App” button and then on “Studio” 37 | 38 | 39 | 11. Once inside Sagemaker Studio”, go to “File/New/Terminal” to open a terminal: 40 | 41 | 42 | 12. Type the following command to clone this repo: 43 | `git clone https://github.com/aws-samples/aws-inferentia-huggingface-workshop.git` 44 | 45 | 13. Once the repo is cloned, open the Jupiter notebook named [aws_summit_2022_inf1_bert_compile_and_deploy.ipynb](https://github.com/aws-samples/aws-inferentia-huggingface-workshop/blob/main/aws_summit_2022_inf1_bert_compile_and_deploy.ipynb). To do this, find the file browser on the left, and click on “aws-inferentia-huggingface-workshop” then double click on the file name **aws_summit_2022_inf1_bert_compile_and_deploy.ipynb**. 46 | 47 | 48 | 14. You will see the following pop up. 49 | 50 | 51 | 15. Make sure you select the Python 3 (Pytorch 1.8 Python 3.6 CPU Optimized) Kernel when prompted 52 | 53 | 54 | # Now start the workshop and have fun ! 55 | 56 | 57 | 58 | 59 | ## Licence 60 | 61 | This code is released under the MIT-0 license 62 | Please refer to the license for applicable terms. 63 | -------------------------------------------------------------------------------- /aws_summit_2022_inf1_bert_compile_and_deploy.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# AWS Summit San Francisco 2022\n", 8 | "## Using AWS Inferentia to optimize HuggingFace model inference" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "Welcome to the AWS Summit San Francisco 2022 Inferentia Workshop! \n" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "# Table of contents\n", 23 | "1. [Introduction](#introduction)\n", 24 | " 1. [Setting up the environment](#setenv)\n", 25 | "3. [Get model from HuggingFace Model Hub](#getmodel)\n", 26 | " 1. [Get the Tokenizer](#gettoken)\n", 27 | " 2. [Download models and prepare them for inference](#trace)\n", 28 | "4. [Deploy default model to a CPU-based endpoint](#deploycpu)\n", 29 | " 1. [Perform a test CPU based inference](#testcpu)\n", 30 | "5. [Compile and deploy the model on an Inferentia instance](#compiledeploy)\n", 31 | " 1. [Review changes to the inference code](#reviewchanges)\n", 32 | " 2. [Create and compile Pytorch model for the inf1 instance](#pytorchmodel)\n", 33 | " 3. [Deploy compiled model into the inf1 instance](#deployinf1)\n", 34 | " 4. [Perform a test inf1 based inference](#testinf1)\n", 35 | "6. [Benchmark and comparison](#benchmark)\n", 36 | " 1. [Benchmark CPU based endpoint](#benchcpu)\n", 37 | " 2. [Benchmark Inferentia based endpoint](#benchinf1)\n", 38 | "7. [Comparison and conclusions](#conclusions)\n", 39 | "8. [Cleanup](#cleanup)" 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | "---" 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "metadata": {}, 52 | "source": [ 53 | "# 1. Introduction " 54 | ] 55 | }, 56 | { 57 | "cell_type": "markdown", 58 | "metadata": {}, 59 | "source": [ 60 | "During this workshop, we will create two endpoints with one HuggingFace model each. We will use them for the task of paraphrase detection which is an NLP classification problem. \n", 61 | "These two endpoints will have the following configurations: a) CPU-based endpoint, where we will be deploying the model with no changes; and b) Inf1 instance based endpoint, where we will prepare and compile the model using SageMaker Neo before deploying. \n", 62 | "Finally, we will perform a latency and throughput performance comparison of both endpoints. " 63 | ] 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "metadata": {}, 68 | "source": [ 69 | "[AWS Inferentia](https://aws.amazon.com/machine-learning/inferentia/) is Amazon's first ML chips designed to accelerate deep learning workloads and is part of a long-term strategy to deliver on this vision. AWS Inferentia is designed to provide high performance inference in the cloud, to drive down the total cost of inference, and to make it easy for developers to integrate machine learning into their business applications. AWS Inferentia chips deliver up 2.3x higher throughput and up to 70% lower cost per inference than comparable current generation GPU-based Amazon EC2 instances, as we will confirm in the example notebook.\n", 70 | "\n", 71 | "[AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/) is a software development kit (SDK) for running machine learning inference using AWS Inferentia chips. It consists of a compiler, run-time, and profiling tools that enable developers to run high-performance and low latency inference using AWS Inferentia-based Amazon EC2 Inf1 instances. Using Neuron, you can bring your models that have been trained on any popular framework (PyTorch, TensorFlow, MXNet), and run them optimally on Inferentia. There is excellent support for Vision and NLP models especially, and on top of that we have released great features to help you make the most efficient use of the hardware, such as [dynamic batching](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/appnotes/perf/torch-neuron-dataparallel-app-note.html#dynamic-batching-description) or [Data Parallel](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-frameworks/pytorch-neuron/api-torch-neuron-dataparallel-api.html) inferencing.\n", 72 | "\n", 73 | "[SageMaker Neo](https://aws.amazon.com/sagemaker/neo/) saves you the effort of DIY model compilation, extending familiar SageMaker SDK API's to enable easy compilation for a [wide range](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_OutputConfig.html#API_OutputConfig_Contents) of platforms. This includes CPU and GPU-based instances, but also Inf1 instances; in this case, SageMaker Neo uses the Neuron SDK to compile your model.\n" 74 | ] 75 | }, 76 | { 77 | "cell_type": "markdown", 78 | "metadata": {}, 79 | "source": [ 80 | "---" 81 | ] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "metadata": {}, 86 | "source": [ 87 | "### Setting up the environment \n", 88 | "First, make sure you are using the Python 3 (Pytorch 1.8 Python 3.6 CPU Optimized) Kernel. And that you are working in the us-west-2 region unless instructed otherwise.\n", 89 | "\n", 90 | "Then, install ipywidgets library and restart the kernel to be able to use it." 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": null, 96 | "metadata": {}, 97 | "outputs": [], 98 | "source": [ 99 | "%%capture\n", 100 | "import IPython\n", 101 | "import sys\n", 102 | "\n", 103 | "!{sys.executable} -m pip install ipywidgets\n", 104 | "IPython.Application.instance().kernel.do_shutdown(True) # has to restart kernel so changes are used" 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "metadata": {}, 110 | "source": [ 111 | "STOP! Restart the Kernel, comment the cell above and continue." 112 | ] 113 | }, 114 | { 115 | "cell_type": "markdown", 116 | "metadata": {}, 117 | "source": [ 118 | "We will then install required Python packages. Also, we will create a default Amazon Sagemaker session, get the Amazon Sagemaker role and default Amazon S3 bucket." 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": null, 124 | "metadata": {}, 125 | "outputs": [], 126 | "source": [ 127 | "%%capture\n", 128 | "!pip install -U transformers\n", 129 | "!pip install -U sagemaker\n", 130 | "!pip install -U torch" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": null, 136 | "metadata": {}, 137 | "outputs": [], 138 | "source": [ 139 | "import sys\n", 140 | "import transformers\n", 141 | "import sagemaker\n", 142 | "import torch\n", 143 | "import boto3\n", 144 | "\n", 145 | "sagemaker_session = sagemaker.Session()\n", 146 | "role = sagemaker.get_execution_role()\n", 147 | "sess_bucket = sagemaker_session.default_bucket()" 148 | ] 149 | }, 150 | { 151 | "cell_type": "markdown", 152 | "metadata": {}, 153 | "source": [ 154 | "---" 155 | ] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "metadata": {}, 160 | "source": [ 161 | "## 2. Get model from HuggingFace Model Hub " 162 | ] 163 | }, 164 | { 165 | "cell_type": "markdown", 166 | "metadata": {}, 167 | "source": [ 168 | "For this workshop, we will use [Prompsit/paraphrase-bert-en](https://huggingface.co/Prompsit/paraphrase-bert-en) transformer model from HuggingFace Model Hub. It has been fine-tuned from a pretrained model called \"bert-base-uncased\". The model works comparing a pair of sentences, it determines the semantic similarity between them. If the two sentences convey the same meaning it is labelled as paraphrase, otherwise it is labeled as non-paraphrase. \n", 169 | "So it allows to evaluate paraphrases for a given phrase, answering the following question: Is \"phrase B\" a paraphrase of \"phrase A\"? and the resulting probabilities correspond to classes:\n", 170 | "\n", 171 | " 0: Not a paraphrase\n", 172 | " 1: It's a paraphrase\n", 173 | "\n", 174 | "This model doesn't expect to find punctuation marks or long pieces of text.\n" 175 | ] 176 | }, 177 | { 178 | "cell_type": "markdown", 179 | "metadata": {}, 180 | "source": [ 181 | "### Get the Tokenizer \n", 182 | "As a first step, we need to get the tokenizer. A tokenizer breaks a stream of text into tokens, and it is in charge of preparing the inputs for a model. We need it to create a sample input to interact with the model, and will get it from HuggingFace through the `transformers` library. It is important to set the `return_dict` parameter to `False` when instantiating the model. In `transformers` v4.x, this parameter is `True` by default and it enables the return of dict-like python objects containing the model outputs, instead of the standard tuples. Neuron compilation does not support dictionary-based model ouputs, and compilation would fail if we didn't explictly set it to `False`." 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": null, 188 | "metadata": {}, 189 | "outputs": [], 190 | "source": [ 191 | "tokenizer = transformers.AutoTokenizer.from_pretrained(\"Prompsit/paraphrase-bert-en\")\n", 192 | "\n", 193 | "model = transformers.AutoModelForSequenceClassification.from_pretrained(\n", 194 | " \"Prompsit/paraphrase-bert-en\", return_dict=False\n", 195 | ")" 196 | ] 197 | }, 198 | { 199 | "cell_type": "markdown", 200 | "metadata": {}, 201 | "source": [ 202 | "### Download models and prepare them for inference \n", 203 | "We will download the model and create two files with different formats. The first one is the model itself with no changes. This one will be uploaded and used in the CPU based endpoint as it is. The second image is a traced Pytorch image of the model so we can compile it before deploying it to the inf1 instance.\n", 204 | "\n", 205 | "PyTorch models must be saved as a definition file (.pt or .pth) with input datatype of float32.\n", 206 | "To save the model, we will use torch.jit.trace followed by torch.save. This will save an object to a file ( a python pickle: pickle_module=pickle). \n", 207 | "\n", 208 | "Next, we will convert the saved model to a compressed tar file and upload it to an S3 bucket.\n", 209 | "As a final step, we will create a sample input to `jit.trace` of the model with PyTorch. We need this to have SageMaker Neo compile the model artifact.\n" 210 | ] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "execution_count": null, 215 | "metadata": {}, 216 | "outputs": [], 217 | "source": [ 218 | "from pathlib import Path\n", 219 | "\n", 220 | "# Create directory for model artifacts\n", 221 | "Path(\"normal_model/\").mkdir(exist_ok=True)\n", 222 | "Path(\"traced_model/\").mkdir(exist_ok=True)\n", 223 | "\n", 224 | "# Prepare sample input for jit model tracing\n", 225 | "seq_0 = \"Welcome to AWS Summit San Francisco 2022! Thank you for attending the workshop on using Huggingface transformers on Inferentia instances.\"\n", 226 | "seq_1 = seq_0\n", 227 | "max_length = 512\n", 228 | "\n", 229 | "tokenized_sequence_pair = tokenizer.encode_plus(\n", 230 | " seq_0, seq_1, max_length=max_length, padding=\"max_length\", truncation=True, return_tensors=\"pt\"\n", 231 | ")\n", 232 | "\n", 233 | "example = tokenized_sequence_pair[\"input_ids\"], tokenized_sequence_pair[\"attention_mask\"]\n", 234 | "\n", 235 | "traced_model = torch.jit.trace(model.eval(), example)\n", 236 | "\n", 237 | "model.save_pretrained('normal_model/')\n", 238 | "traced_model.save(\"traced_model/model.pth\") # The `.pth` extension is required." 239 | ] 240 | }, 241 | { 242 | "cell_type": "code", 243 | "execution_count": null, 244 | "metadata": {}, 245 | "outputs": [], 246 | "source": [ 247 | "!tar -czvf normal_model.tar.gz -C normal_model . && mv normal_model.tar.gz normal_model/\n", 248 | "!tar -czvf traced_model.tar.gz -C traced_model . && mv traced_model.tar.gz traced_model/" 249 | ] 250 | }, 251 | { 252 | "cell_type": "markdown", 253 | "metadata": {}, 254 | "source": [ 255 | "We upload the traced model `tar.gz` file to Amazon S3, where the compilation job will download it from" 256 | ] 257 | }, 258 | { 259 | "cell_type": "code", 260 | "execution_count": null, 261 | "metadata": {}, 262 | "outputs": [], 263 | "source": [ 264 | "normal_model_url = sagemaker_session.upload_data(\n", 265 | " path=\"normal_model/normal_model.tar.gz\",\n", 266 | " key_prefix=\"neuron-experiments/bert-seq-classification/normal-model\",\n", 267 | ")\n", 268 | "\n", 269 | "traced_model_url = sagemaker_session.upload_data(\n", 270 | " path=\"traced_model/traced_model.tar.gz\",\n", 271 | " key_prefix=\"neuron-experiments/bert-seq-classification/traced-model\",\n", 272 | ")" 273 | ] 274 | }, 275 | { 276 | "cell_type": "markdown", 277 | "metadata": {}, 278 | "source": [ 279 | "---" 280 | ] 281 | }, 282 | { 283 | "cell_type": "markdown", 284 | "metadata": {}, 285 | "source": [ 286 | "## 3. Deploy default model to a CPU-based endpoint " 287 | ] 288 | }, 289 | { 290 | "cell_type": "markdown", 291 | "metadata": {}, 292 | "source": [ 293 | "As a first step, we create model from the Hugging Face Model Class.\n", 294 | "We will be passing the `normal_model_url` as the `model_data` parameter to the `HuggingFaceModel` API. \n", 295 | "Notice that we are passing `inference.py` as the entry point script; also, the packages defined in the requirements file within the `source_dir` will automatically be installed in the endpoint instance. In this case we will use the `transformers` library that is compatible Inferentia instances (v. 4.15.0)" 296 | ] 297 | }, 298 | { 299 | "cell_type": "code", 300 | "execution_count": null, 301 | "metadata": {}, 302 | "outputs": [], 303 | "source": [ 304 | "from sagemaker.huggingface import HuggingFaceModel\n", 305 | "from sagemaker.predictor import Predictor\n", 306 | "from datetime import datetime\n", 307 | "\n", 308 | "prefix = \"neuron-experiments/bert-seq-classification\"\n", 309 | "flavour = \"normal\"\n", 310 | "date_string = datetime.now().strftime(\"%Y%m-%d%H-%M%S\")\n", 311 | "\n", 312 | "normal_sm_model = HuggingFaceModel(\n", 313 | " model_data=normal_model_url,\n", 314 | " predictor_cls=Predictor,\n", 315 | " transformers_version=\"4.12.3\",\n", 316 | " pytorch_version='1.9.1',\n", 317 | " role=role,\n", 318 | " entry_point=\"inference.py\",\n", 319 | " source_dir=\"code\",\n", 320 | " py_version=\"py38\",\n", 321 | " name=f\"{flavour}-distilbert-{date_string}\",\n", 322 | " env={\"SAGEMAKER_CONTAINER_LOG_LEVEL\": \"10\"},\n", 323 | ")" 324 | ] 325 | }, 326 | { 327 | "cell_type": "markdown", 328 | "metadata": {}, 329 | "source": [ 330 | "Then, we create the endpoint and deploy the model for inference. This process will take about 4 minutes to complete. As you can see, one line of code will create a [real time endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html) for real time inference that you can integrate to your applications. These endpoints are fully managed and support autoscaling. " 331 | ] 332 | }, 333 | { 334 | "cell_type": "code", 335 | "execution_count": null, 336 | "metadata": {}, 337 | "outputs": [], 338 | "source": [ 339 | "%%time\n", 340 | "from sagemaker.serializers import JSONSerializer\n", 341 | "from sagemaker.deserializers import JSONDeserializer\n", 342 | "\n", 343 | "hardware = \"c5\"\n", 344 | "\n", 345 | "normal_predictor = normal_sm_model.deploy(\n", 346 | " instance_type=\"ml.c5.xlarge\",\n", 347 | " initial_instance_count=1,\n", 348 | " endpoint_name=f\"paraphrase-bert-en-{hardware}-{date_string}\",\n", 349 | " serializer=JSONSerializer(),\n", 350 | " deserializer=JSONDeserializer(),\n", 351 | ")" 352 | ] 353 | }, 354 | { 355 | "cell_type": "markdown", 356 | "metadata": {}, 357 | "source": [ 358 | "### Perform a test inference on CPU" 359 | ] 360 | }, 361 | { 362 | "cell_type": "markdown", 363 | "metadata": {}, 364 | "source": [ 365 | "We will perform a quick test to see if the endpoint is responding as expected. We will send sample sequences." 366 | ] 367 | }, 368 | { 369 | "cell_type": "code", 370 | "execution_count": null, 371 | "metadata": {}, 372 | "outputs": [], 373 | "source": [ 374 | "# Predict with model endpoint\n", 375 | "client = boto3.client('sagemaker')\n", 376 | "\n", 377 | "#let's make sure it is up und running first\n", 378 | "status = \"\"\n", 379 | "while status != 'InService':\n", 380 | " endpoint_response = client.describe_endpoint(EndpointName=f\"paraphrase-bert-en-{hardware}-{date_string}\")\n", 381 | " status = endpoint_response['EndpointStatus']\n", 382 | "\n", 383 | "\n", 384 | "# Send a payload to the endpoint and recieve the inference\n", 385 | "payload = seq_0, seq_1\n", 386 | "normal_predictor.predict(payload)" 387 | ] 388 | }, 389 | { 390 | "cell_type": "markdown", 391 | "metadata": {}, 392 | "source": [ 393 | "---" 394 | ] 395 | }, 396 | { 397 | "cell_type": "markdown", 398 | "metadata": {}, 399 | "source": [ 400 | "## 4. Compile and deploy the model on an Inferentia instance " 401 | ] 402 | }, 403 | { 404 | "cell_type": "markdown", 405 | "metadata": {}, 406 | "source": [ 407 | "In this section we will cover the compilation and deployment of the model into the inf1 instance. We will also review the changes in the inference code." 408 | ] 409 | }, 410 | { 411 | "cell_type": "markdown", 412 | "metadata": {}, 413 | "source": [ 414 | "### Review inference code " 415 | ] 416 | }, 417 | { 418 | "cell_type": "markdown", 419 | "metadata": {}, 420 | "source": [ 421 | "If you open `inference.py` you will see a few functions: \n", 422 | "a) `model_fn` which receives the model directory and is responsible for loading and returning the model.\n", 423 | "b) `input_fn` and `output_fn` functions that are in charge of pre-processing/checking content types of input and output to the endpoint.\n", 424 | "And c) `predict_fn`, receives the outputs of `model_fn` and `input_fn` and defines how the model will run inference (it recieves the loaded model and the deserialized/pre-processed input data).\n", 425 | "All of this code runs inside the endpoint once it is created." 426 | ] 427 | }, 428 | { 429 | "cell_type": "code", 430 | "execution_count": null, 431 | "metadata": {}, 432 | "outputs": [], 433 | "source": [ 434 | "!pygmentize code/inference.py" 435 | ] 436 | }, 437 | { 438 | "cell_type": "markdown", 439 | "metadata": {}, 440 | "source": [ 441 | "In this case, notice that we will load the corresponding model depending on where the function is deployed. `model_fn` will return a tuple containing both the model and its corresponding tokenizer. Both the model and the input data will be sent `.to(device)`, which can be a CPU or GPU.\n", 442 | "\n", 443 | "Also, notice the `predict_fn`. In this function we recieve the string for inference, convert it to the format the model accepts, ask the model for the inference, recieve the inference and format it in clear text as a return string. In real life you might not need to do this interpretation since your application might be fine receiving the predicted class and use it directly." 444 | ] 445 | }, 446 | { 447 | "cell_type": "markdown", 448 | "metadata": {}, 449 | "source": [ 450 | "### Create and compile Pytorch model for the inf1 instance " 451 | ] 452 | }, 453 | { 454 | "cell_type": "markdown", 455 | "metadata": {}, 456 | "source": [ 457 | "We will now create a new `Huggingface` model that will use the `inference.py` file described above as its entry point script. " 458 | ] 459 | }, 460 | { 461 | "cell_type": "code", 462 | "execution_count": null, 463 | "metadata": {}, 464 | "outputs": [], 465 | "source": [ 466 | "from sagemaker.huggingface import HuggingFaceModel\n", 467 | "from sagemaker.predictor import Predictor\n", 468 | "from datetime import datetime\n", 469 | "from sagemaker.serializers import JSONSerializer\n", 470 | "from sagemaker.deserializers import JSONDeserializer\n", 471 | "\n", 472 | "date_string = datetime.now().strftime(\"%Y%m-%d%H-%M%S\")\n", 473 | "hardware = \"inf1\"\n", 474 | "compilation_job_name = f\"paraphrase-bert-en-{hardware}-\" + date_string\n", 475 | "output_model_path = f\"s3://{sess_bucket}/{prefix}/neo-compilations19/{hardware}-model\"\n", 476 | "\n", 477 | "compiled_inf1_model = HuggingFaceModel(\n", 478 | " model_data=traced_model_url,\n", 479 | " predictor_cls=Predictor,\n", 480 | " transformers_version=\"4.12.3\",\n", 481 | " pytorch_version='1.9.1',\n", 482 | " role=role,\n", 483 | " entry_point=\"inference.py\",\n", 484 | " source_dir=\"code\",\n", 485 | " py_version=\"py37\",\n", 486 | " name=f\"distilbert-{date_string}\",\n", 487 | " env={\"SAGEMAKER_CONTAINER_LOG_LEVEL\": \"10\"},\n", 488 | ")" 489 | ] 490 | }, 491 | { 492 | "cell_type": "markdown", 493 | "metadata": {}, 494 | "source": [ 495 | "We are ready to compile the model! Two additional notes:\n", 496 | "* HuggingFace models should be compiled to `dtype` `int64`\n", 497 | "* the format for `compiler_options` differs from the standard Python `dict` that you can use when compiling for \"normal\" instance types; for inferentia, you must provide a JSON string with CLI arguments, which correspond to the ones supported by the [Neuron Compiler](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-cc/command-line-reference.html) (read more about `compiler_options` [here](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_OutputConfig.html#API_OutputConfig_Contents))\n" 498 | ] 499 | }, 500 | { 501 | "cell_type": "markdown", 502 | "metadata": {}, 503 | "source": [ 504 | "#### Model compilation" 505 | ] 506 | }, 507 | { 508 | "cell_type": "markdown", 509 | "metadata": {}, 510 | "source": [ 511 | "Let's compile the model (this will take around 10 minutes to complete):" 512 | ] 513 | }, 514 | { 515 | "cell_type": "code", 516 | "execution_count": null, 517 | "metadata": {}, 518 | "outputs": [], 519 | "source": [ 520 | "%%time\n", 521 | "import json\n", 522 | "\n", 523 | "compiled_inf1_model = compiled_inf1_model.compile(\n", 524 | " target_instance_family=f\"ml_{hardware}\",\n", 525 | " input_shape={\"input_ids\": [1, 512], \"attention_mask\": [1, 512]},\n", 526 | " job_name=compilation_job_name,\n", 527 | " role=role,\n", 528 | " framework=\"pytorch\",\n", 529 | " framework_version=\"1.9.1\",\n", 530 | " output_path=output_model_path,\n", 531 | " compiler_options=json.dumps(\"--dtype int64\"),\n", 532 | " compile_max_run=900,\n", 533 | ")" 534 | ] 535 | }, 536 | { 537 | "cell_type": "markdown", 538 | "metadata": {}, 539 | "source": [ 540 | "#### Compiler logs and artifacts\n", 541 | "Open a new browser tab and navigate to the Sagemaker Console. Under the Images menu on the left you will find the menu Inference and inside \"Compilation Jobs\". Here is where you will find the job that was executed in the previous cell. Look for the job name to get its details. If you scroll down you will find a section called \"Monitor\" you can access the compiler logs hosted in Cloudwatch. Look for the successful completion of the job in a line similar to the following:" 542 | ] 543 | }, 544 | { 545 | "cell_type": "raw", 546 | "metadata": {}, 547 | "source": [ 548 | "localhost compiler-container-Primary[4736]: Compiler status PASS\n", 549 | "and localhost compiler-container-Primary[4736]: INFO:Neuron:Neuron successfully compiled 1 sub-graphs, Total fused subgraphs = 1, Percent of model sub-graphs successfully compiled = 100.0%" 550 | ] 551 | }, 552 | { 553 | "cell_type": "markdown", 554 | "metadata": {}, 555 | "source": [ 556 | "Also, in the Output section, you will find a link to the S3 compiled model artifact. Click on it so see where it was stored." 557 | ] 558 | }, 559 | { 560 | "cell_type": "code", 561 | "execution_count": null, 562 | "metadata": {}, 563 | "outputs": [], 564 | "source": [ 565 | "print(\"Compilation job name: {} \\nOutput model path in S3: {}\".format(compilation_job_name, output_model_path))" 566 | ] 567 | }, 568 | { 569 | "cell_type": "markdown", 570 | "metadata": {}, 571 | "source": [ 572 | "### Deploy compiled model into the inf1 instance " 573 | ] 574 | }, 575 | { 576 | "cell_type": "markdown", 577 | "metadata": {}, 578 | "source": [ 579 | "After successful compilation, we deploy the new model to an inf1.xlarge instance based [real time endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html). As you can see, the one line of code procedure is similar to creating a CPU based instance. " 580 | ] 581 | }, 582 | { 583 | "cell_type": "code", 584 | "execution_count": null, 585 | "metadata": {}, 586 | "outputs": [], 587 | "source": [ 588 | "%%time\n", 589 | "\n", 590 | "compiled_inf1_predictor = compiled_inf1_model.deploy(\n", 591 | " instance_type=\"ml.inf1.xlarge\",\n", 592 | " initial_instance_count=1,\n", 593 | " endpoint_name=f\"paraphrase-bert-en-{hardware}-{date_string}\",\n", 594 | " serializer=JSONSerializer(),\n", 595 | " deserializer=JSONDeserializer(),\n", 596 | " wait=False\n", 597 | ")" 598 | ] 599 | }, 600 | { 601 | "cell_type": "markdown", 602 | "metadata": {}, 603 | "source": [ 604 | "### Perform a test inference " 605 | ] 606 | }, 607 | { 608 | "cell_type": "markdown", 609 | "metadata": {}, 610 | "source": [ 611 | "As a final test, we first make sure the endpoint is up una running in a `InService` state, and then perform a simple inference sending two sequences of text and wait for the response." 612 | ] 613 | }, 614 | { 615 | "cell_type": "code", 616 | "execution_count": null, 617 | "metadata": {}, 618 | "outputs": [], 619 | "source": [ 620 | "# Predict with model endpoint\n", 621 | "client = boto3.client('sagemaker')\n", 622 | "\n", 623 | "#let's make sure it is up und running first\n", 624 | "status = \"\"\n", 625 | "while status != 'InService':\n", 626 | " endpoint_response = client.describe_endpoint(EndpointName=f\"paraphrase-bert-en-{hardware}-{date_string}\")\n", 627 | " status = endpoint_response['EndpointStatus']\n", 628 | "\n", 629 | "\n", 630 | "# Send a payload to the endpoint and recieve the inference\n", 631 | "payload = seq_0, seq_1\n", 632 | "compiled_inf1_predictor.predict(payload)" 633 | ] 634 | }, 635 | { 636 | "cell_type": "markdown", 637 | "metadata": {}, 638 | "source": [ 639 | "---" 640 | ] 641 | }, 642 | { 643 | "cell_type": "markdown", 644 | "metadata": {}, 645 | "source": [ 646 | "## 5. Benchmark and comparison " 647 | ] 648 | }, 649 | { 650 | "cell_type": "markdown", 651 | "metadata": {}, 652 | "source": [ 653 | "Now that we have both endpoints online, we will perform a benchmark using Python's `threading` module. In each benchmark, we start 5 threads that will each make 100 requests to the model endpoint. We measure the inference latency for each request, and we also measure the total time to finish the task, so that we can get an estimate of the request throughput/second." 654 | ] 655 | }, 656 | { 657 | "cell_type": "markdown", 658 | "metadata": {}, 659 | "source": [ 660 | "### Benchmark CPU based endpoint " 661 | ] 662 | }, 663 | { 664 | "cell_type": "code", 665 | "execution_count": null, 666 | "metadata": {}, 667 | "outputs": [], 668 | "source": [ 669 | "%%time\n", 670 | "# Run the benchmark \n", 671 | "\n", 672 | "import threading\n", 673 | "import time\n", 674 | "\n", 675 | "num_preds = 100\n", 676 | "num_threads = 5\n", 677 | "\n", 678 | "times = []\n", 679 | "\n", 680 | "\n", 681 | "def predict():\n", 682 | " thread_id = threading.get_ident()\n", 683 | " print(f\"Thread {thread_id} started\")\n", 684 | "\n", 685 | " for i in range(num_preds):\n", 686 | " tick = time.time()\n", 687 | " response = normal_predictor.predict(payload)\n", 688 | " tock = time.time()\n", 689 | " times.append((thread_id, tock - tick))\n", 690 | "\n", 691 | "\n", 692 | "threads = []\n", 693 | "[threads.append(threading.Thread(target=predict, daemon=False)) for i in range(num_threads)]\n", 694 | "[t.start() for t in threads]\n", 695 | "\n", 696 | "# Wait for threads, get an estimate of total time\n", 697 | "start = time.time()\n", 698 | "[t.join() for t in threads]\n", 699 | "end = time.time() - start" 700 | ] 701 | }, 702 | { 703 | "cell_type": "code", 704 | "execution_count": null, 705 | "metadata": {}, 706 | "outputs": [], 707 | "source": [ 708 | "# Display results \n", 709 | "from matplotlib.pyplot import hist, title, show, xlim\n", 710 | "import numpy as np\n", 711 | "\n", 712 | "TPS_CPU = (num_preds * num_threads) / end\n", 713 | "\n", 714 | "t_CPU = [duration for thread__id, duration in times]\n", 715 | "latency_percentiles = np.percentile(t_CPU, q=[50, 90, 95, 99])\n", 716 | "latency_CPU = latency_percentiles[2]*1000\n", 717 | "\n", 718 | "hist(t_CPU, bins=100)\n", 719 | "title(\"Request latency histogram on CPU\")\n", 720 | "show()\n", 721 | "\n", 722 | "print(\"==== Default HuggingFace model on CPU benchmark ====\\n\")\n", 723 | "print(f\"95 % of requests take less than {latency_CPU} ms\")\n", 724 | "print(f\"Rough request throughput/second is {TPS_CPU}\")" 725 | ] 726 | }, 727 | { 728 | "cell_type": "markdown", 729 | "metadata": {}, 730 | "source": [ 731 | "We can see that request latency is in the 1-1.2 second range, and throughput is ~4.5 TPS." 732 | ] 733 | }, 734 | { 735 | "cell_type": "markdown", 736 | "metadata": {}, 737 | "source": [ 738 | "### Benchmark Inferentia based endpoint " 739 | ] 740 | }, 741 | { 742 | "cell_type": "code", 743 | "execution_count": null, 744 | "metadata": {}, 745 | "outputs": [], 746 | "source": [ 747 | "%%time\n", 748 | "# Run benchmark \n", 749 | "\n", 750 | "import threading\n", 751 | "import time\n", 752 | "\n", 753 | "\n", 754 | "num_preds = 300\n", 755 | "num_threads = 5\n", 756 | "\n", 757 | "times = []\n", 758 | "\n", 759 | "\n", 760 | "def predict():\n", 761 | " thread_id = threading.get_ident()\n", 762 | " print(f\"Thread {thread_id} started\")\n", 763 | "\n", 764 | " for i in range(num_preds):\n", 765 | " tick = time.time()\n", 766 | " response = compiled_inf1_predictor.predict(payload)\n", 767 | " tock = time.time()\n", 768 | " times.append((thread_id, tock - tick))\n", 769 | "\n", 770 | "\n", 771 | "threads = []\n", 772 | "[threads.append(threading.Thread(target=predict, daemon=False)) for i in range(num_threads)]\n", 773 | "[t.start() for t in threads]\n", 774 | "\n", 775 | "# Make a rough estimate of total time, wait for threads\n", 776 | "start = time.time()\n", 777 | "[t.join() for t in threads]\n", 778 | "end = time.time() - start" 779 | ] 780 | }, 781 | { 782 | "cell_type": "code", 783 | "execution_count": null, 784 | "metadata": {}, 785 | "outputs": [], 786 | "source": [ 787 | "# Display results \n", 788 | "from matplotlib.pyplot import hist, title, show, xlim\n", 789 | "import numpy as np\n", 790 | "\n", 791 | "TPS_inf1 = (num_preds * num_threads) / end\n", 792 | "\n", 793 | "t_inf1 = [duration for thread__id, duration in times]\n", 794 | "latency_percentiles = np.percentile(t_inf1, q=[50, 90, 95, 99])\n", 795 | "latency_inf1 = latency_percentiles[2]*1000\n", 796 | "\n", 797 | "hist(t_inf1, bins=100)\n", 798 | "title(\"Request latency histogram on Inferentia\")\n", 799 | "show()\n", 800 | "\n", 801 | "print(\"==== Default HuggingFace model on inf1 benchmark ====\\n\")\n", 802 | "print(f\"95 % of requests take less than {latency_inf1} ms\")\n", 803 | "print(f\"Rough request throughput/second is {TPS_inf1}\")\n", 804 | "\n", 805 | "\n" 806 | ] 807 | }, 808 | { 809 | "cell_type": "markdown", 810 | "metadata": {}, 811 | "source": [ 812 | "We can see that request latency is in the 0.02-0.05 millisecond range, and throughput is ~157 TPS." 813 | ] 814 | }, 815 | { 816 | "cell_type": "markdown", 817 | "metadata": {}, 818 | "source": [ 819 | "---" 820 | ] 821 | }, 822 | { 823 | "cell_type": "markdown", 824 | "metadata": {}, 825 | "source": [ 826 | "# 6. Conclusion " 827 | ] 828 | }, 829 | { 830 | "cell_type": "code", 831 | "execution_count": null, 832 | "metadata": {}, 833 | "outputs": [], 834 | "source": [ 835 | "print(\"Using inf1 instances latency dropped to a {:.2f} millisecond range from {:.2f} ms on a CPU endpoint.\".format(latency_inf1, latency_CPU)) \n", 836 | "print(\"Also, The average throughput increased to {:.2f} TPS from {:.2f} TPS on the CPU.\".format( TPS_inf1, TPS_CPU) )" 837 | ] 838 | }, 839 | { 840 | "cell_type": "markdown", 841 | "metadata": {}, 842 | "source": [ 843 | "This increase in performance obtained from using inf1 instances, paired with the cost reduction and the use of known SageMaker SDK APIs, enables new benefits with little development effort and a gentle learning curve. " 844 | ] 845 | }, 846 | { 847 | "cell_type": "markdown", 848 | "metadata": {}, 849 | "source": [ 850 | "* To learn more about how to deploy Hugging Face modes through Sagemaker on to Inf1, please watch their latest [Webinar](https://www.youtube.com/watch?v=3fulTyMXhWQ), and read their latest [blog post](https://huggingface.co/blog/bert-inferentia-sagemaker). \n", 851 | "* For more information about Inferentia, please see the AWS EC2 Inf1 [website](https://aws.amazon.com/ec2/instance-types/inf1/) or check out other Tutorials available online [here] (https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-intro/tutorials.html).\n", 852 | "* You can learn more about Inferentia performance on the [Neuron Inference Performance](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/benchmark/index.html) pages\n" 853 | ] 854 | }, 855 | { 856 | "cell_type": "markdown", 857 | "metadata": {}, 858 | "source": [ 859 | "---" 860 | ] 861 | }, 862 | { 863 | "cell_type": "markdown", 864 | "metadata": {}, 865 | "source": [ 866 | "# 7. Clean up \n", 867 | "Delete the models and release the endpoints." 868 | ] 869 | }, 870 | { 871 | "cell_type": "code", 872 | "execution_count": null, 873 | "metadata": {}, 874 | "outputs": [], 875 | "source": [ 876 | "normal_predictor.delete_model()\n", 877 | "normal_predictor.delete_endpoint()\n", 878 | "compiled_inf1_predictor.delete_model()\n", 879 | "compiled_inf1_predictor.delete_endpoint()" 880 | ] 881 | } 882 | ], 883 | "metadata": { 884 | "instance_type": "ml.m5.large", 885 | "kernelspec": { 886 | "display_name": "Python 3 (PyTorch 1.8 Python 3.6 CPU Optimized)", 887 | "language": "python", 888 | "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/1.8.1-cpu-py36" 889 | }, 890 | "language_info": { 891 | "codemirror_mode": { 892 | "name": "ipython", 893 | "version": 3 894 | }, 895 | "file_extension": ".py", 896 | "mimetype": "text/x-python", 897 | "name": "python", 898 | "nbconvert_exporter": "python", 899 | "pygments_lexer": "ipython3", 900 | "version": "3.6.13" 901 | } 902 | }, 903 | "nbformat": 4, 904 | "nbformat_minor": 5 905 | } 906 | -------------------------------------------------------------------------------- /aws_summit_2022_inf1_bert_compile_and_deploy_walkthrough.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# AWS Summit Atlanta 2022\n", 8 | "## Using AWS Inferentia to optimize HuggingFace model inference" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "Welcome to the AWS Summit Atlanta 2022 Inferentia Workshop Walkthrough ! \n" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "# Table of contents\n", 23 | "1. [Introduction](#introduction)\n", 24 | " 1. [Setting up the environment](#setenv)\n", 25 | "3. [Get model from HuggingFace Model Hub](#getmodel)\n", 26 | " 1. [Get the Tokenizer](#gettoken)\n", 27 | " 2. [Download models and prepare them for inference](#trace)\n", 28 | "4. [Deploy default model to a GPU-based endpoint](#deploycpu)\n", 29 | " 1. [Perform a test GPU based inference](#testcpu)\n", 30 | "5. [Compile and deploy the model on an Inferentia instance](#compiledeploy)\n", 31 | " 1. [Review changes to the inference code](#reviewchanges)\n", 32 | " 2. [Create and compile Pytorch model for the inf1 instance](#pytorchmodel)\n", 33 | " 3. [Deploy compiled model into the inf1 instance](#deployinf1)\n", 34 | " 4. [Perform a test inf1 based inference](#testinf1)\n", 35 | "6. [Benchmark and comparison](#benchmark)\n", 36 | " 1. [Benchmark GPU based endpoint](#benchcpu)\n", 37 | " 2. [Benchmark Inferentia based endpoint](#benchinf1)\n", 38 | "7. [Comparison and conclusions](#conclusions)\n", 39 | "8. [Cleanup](#cleanup)" 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | "---" 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "metadata": {}, 52 | "source": [ 53 | "# 1. Introduction " 54 | ] 55 | }, 56 | { 57 | "cell_type": "markdown", 58 | "metadata": {}, 59 | "source": [ 60 | "During this workshop, we will create two endpoints with one HuggingFace model each. We will use them for the task of paraphrase detection which is an NLP classification problem. \n", 61 | "These two endpoints will have the following configurations: a) GPU-based endpoint, where we will be deploying the model with no changes; and b) Inf1 instance based endpoint, where we will prepare and compile the model using SageMaker Neo before deploying. \n", 62 | "Finally, we will perform a latency and throughput performance comparison of both endpoints. " 63 | ] 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "metadata": {}, 68 | "source": [ 69 | "[AWS Inferentia](https://aws.amazon.com/machine-learning/inferentia/) is Amazon's first ML chips designed to accelerate deep learning workloads and is part of a long-term strategy to deliver on this vision. AWS Inferentia is designed to provide high performance inference in the cloud, to drive down the total cost of inference, and to make it easy for developers to integrate machine learning into their business applications. AWS Inferentia chips deliver up 2.3x higher throughput and up to 70% lower cost per inference than comparable current generation GPU-based Amazon EC2 instances, as we will confirm in the example notebook.\n", 70 | "\n", 71 | "[AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/) is a software development kit (SDK) for running machine learning inference using AWS Inferentia chips. It consists of a compiler, run-time, and profiling tools that enable developers to run high-performance and low latency inference using AWS Inferentia-based Amazon EC2 Inf1 instances. Using Neuron, you can bring your models that have been trained on any popular framework (PyTorch, TensorFlow, MXNet), and run them optimally on Inferentia. There is excellent support for Vision and NLP models especially, and on top of that we have released great features to help you make the most efficient use of the hardware, such as [dynamic batching](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/appnotes/perf/torch-neuron-dataparallel-app-note.html#dynamic-batching-description) or [Data Parallel](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-frameworks/pytorch-neuron/api-torch-neuron-dataparallel-api.html) inferencing.\n", 72 | "\n", 73 | "[SageMaker Neo](https://aws.amazon.com/sagemaker/neo/) saves you the effort of DIY model compilation, extending familiar SageMaker SDK API's to enable easy compilation for a [wide range](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_OutputConfig.html#API_OutputConfig_Contents) of platforms. This includes GPU and GPU-based instances, but also Inf1 instances; in this case, SageMaker Neo uses the Neuron SDK to compile your model.\n" 74 | ] 75 | }, 76 | { 77 | "cell_type": "markdown", 78 | "metadata": {}, 79 | "source": [ 80 | "---" 81 | ] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "metadata": {}, 86 | "source": [ 87 | "### Setting up the environment \n", 88 | "First, make sure you are using the Python 3 (Pytorch 1.8 Python 3.6 GPU Optimized) Kernel. And that you are working in the us-west-2 region unless instructed otherwise.\n", 89 | "\n", 90 | "Then, install ipywidgets library and restart the kernel to be able to use it." 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": 1, 96 | "metadata": {}, 97 | "outputs": [], 98 | "source": [ 99 | "%%capture\n", 100 | "import IPython\n", 101 | "import sys\n", 102 | "\n", 103 | "#!{sys.executable} -m pip install ipywidgets\n", 104 | "#IPython.Application.instance().kernel.do_shutdown(True) # has to restart kernel so changes are used" 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "metadata": {}, 110 | "source": [ 111 | "STOP! Restart the Kernel, comment the cell above and continue." 112 | ] 113 | }, 114 | { 115 | "cell_type": "markdown", 116 | "metadata": {}, 117 | "source": [ 118 | "We will then install required Python packages. Also, we will create a default Amazon Sagemaker session, get the Amazon Sagemaker role and default Amazon S3 bucket." 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": 2, 124 | "metadata": {}, 125 | "outputs": [], 126 | "source": [ 127 | "%%capture\n", 128 | "!pip install -U transformers\n", 129 | "!pip install -U sagemaker\n", 130 | "!pip install -U torch" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": 3, 136 | "metadata": {}, 137 | "outputs": [], 138 | "source": [ 139 | "import sys\n", 140 | "import transformers\n", 141 | "import sagemaker\n", 142 | "import torch\n", 143 | "import boto3\n", 144 | "\n", 145 | "sagemaker_session = sagemaker.Session()\n", 146 | "role = sagemaker.get_execution_role()\n", 147 | "sess_bucket = sagemaker_session.default_bucket()" 148 | ] 149 | }, 150 | { 151 | "cell_type": "markdown", 152 | "metadata": {}, 153 | "source": [ 154 | "---" 155 | ] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "metadata": {}, 160 | "source": [ 161 | "## 2. Get model from HuggingFace Model Hub " 162 | ] 163 | }, 164 | { 165 | "cell_type": "markdown", 166 | "metadata": {}, 167 | "source": [ 168 | "For this workshop, we will use [Prompsit/paraphrase-bert-en](https://huggingface.co/Prompsit/paraphrase-bert-en) transformer model from HuggingFace Model Hub. It has been fine-tuned from a pretrained model called \"bert-base-uncased\". The model works comparing a pair of sentences, it determines the semantic similarity between them. If the two sentences convey the same meaning it is labelled as paraphrase, otherwise it is labeled as non-paraphrase. \n", 169 | "So it allows to evaluate paraphrases for a given phrase, answering the following question: Is \"phrase B\" a paraphrase of \"phrase A\"? and the resulting probabilities correspond to classes:\n", 170 | "\n", 171 | " 0: Not a paraphrase\n", 172 | " 1: It's a paraphrase\n", 173 | "\n", 174 | "This model doesn't expect to find punctuation marks or long pieces of text.\n" 175 | ] 176 | }, 177 | { 178 | "cell_type": "markdown", 179 | "metadata": {}, 180 | "source": [ 181 | "### Get the Tokenizer \n", 182 | "As a first step, we need to get the tokenizer. A tokenizer breaks a stream of text into tokens, and it is in charge of preparing the inputs for a model. We need it to create a sample input to interact with the model, and will get it from HuggingFace through the `transformers` library. It is important to set the `return_dict` parameter to `False` when instantiating the model. In `transformers` v4.x, this parameter is `True` by default and it enables the return of dict-like python objects containing the model outputs, instead of the standard tuples. Neuron compilation does not support dictionary-based model ouputs, and compilation would fail if we didn't explictly set it to `False`." 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": 4, 188 | "metadata": {}, 189 | "outputs": [], 190 | "source": [ 191 | "tokenizer = transformers.AutoTokenizer.from_pretrained(\"Prompsit/paraphrase-bert-en\")\n", 192 | "\n", 193 | "model = transformers.AutoModelForSequenceClassification.from_pretrained(\n", 194 | " \"Prompsit/paraphrase-bert-en\", return_dict=False\n", 195 | ")" 196 | ] 197 | }, 198 | { 199 | "cell_type": "markdown", 200 | "metadata": {}, 201 | "source": [ 202 | "### Download models and prepare them for inference \n", 203 | "We will download the model and create two files with different formats. The first one is the model itself with no changes. This one will be uploaded and used in the GPU based endpoint as it is. The second image is a traced Pytorch image of the model so we can compile it before deploying it to the inf1 instance.\n", 204 | "\n", 205 | "PyTorch models must be saved as a definition file (.pt or .pth) with input datatype of float32.\n", 206 | "To save the model, we will use torch.jit.trace followed by torch.save. This will save an object to a file ( a python pickle: pickle_module=pickle). \n", 207 | "\n", 208 | "Next, we will convert the saved model to a compressed tar file and upload it to an S3 bucket.\n", 209 | "As a final step, we will create a sample input to `jit.trace` of the model with PyTorch. We need this to have SageMaker Neo compile the model artifact.\n" 210 | ] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "execution_count": 5, 215 | "metadata": {}, 216 | "outputs": [], 217 | "source": [ 218 | "from pathlib import Path\n", 219 | "\n", 220 | "# Create directory for model artifacts\n", 221 | "Path(\"normal_model/\").mkdir(exist_ok=True)\n", 222 | "Path(\"traced_model/\").mkdir(exist_ok=True)\n", 223 | "\n", 224 | "# Prepare sample input for jit model tracing\n", 225 | "seq_0 = \"Welcome to AWS Summit San Francisco 2022! Thank you for attending the workshop on using Huggingface transformers on Inferentia instances.\"\n", 226 | "seq_1 = seq_0\n", 227 | "max_length = 512\n", 228 | "\n", 229 | "tokenized_sequence_pair = tokenizer.encode_plus(\n", 230 | " seq_0, seq_1, max_length=max_length, padding=\"max_length\", truncation=True, return_tensors=\"pt\"\n", 231 | ")\n", 232 | "\n", 233 | "example = tokenized_sequence_pair[\"input_ids\"], tokenized_sequence_pair[\"attention_mask\"]\n", 234 | "\n", 235 | "traced_model = torch.jit.trace(model.eval(), example)\n", 236 | "\n", 237 | "model.save_pretrained('normal_model/')\n", 238 | "traced_model.save(\"traced_model/model.pth\") # The `.pth` extension is required." 239 | ] 240 | }, 241 | { 242 | "cell_type": "code", 243 | "execution_count": 6, 244 | "metadata": {}, 245 | "outputs": [ 246 | { 247 | "name": "stdout", 248 | "output_type": "stream", 249 | "text": [ 250 | "./\n", 251 | "./pytorch_model.bin\n", 252 | "./config.json\n", 253 | "./normal_model.tar.gz\n", 254 | "./\n", 255 | "./model.pth\n", 256 | "./traced_model.tar.gz\n" 257 | ] 258 | } 259 | ], 260 | "source": [ 261 | "!tar -czvf normal_model.tar.gz -C normal_model . && mv normal_model.tar.gz normal_model/\n", 262 | "!tar -czvf traced_model.tar.gz -C traced_model . && mv traced_model.tar.gz traced_model/" 263 | ] 264 | }, 265 | { 266 | "cell_type": "markdown", 267 | "metadata": {}, 268 | "source": [ 269 | "We upload the traced model `tar.gz` file to Amazon S3, where the compilation job will download it from" 270 | ] 271 | }, 272 | { 273 | "cell_type": "code", 274 | "execution_count": 7, 275 | "metadata": {}, 276 | "outputs": [], 277 | "source": [ 278 | "normal_model_url = sagemaker_session.upload_data(\n", 279 | " path=\"normal_model/normal_model.tar.gz\",\n", 280 | " key_prefix=\"neuron-experiments/bert-seq-classification/normal-model\",\n", 281 | ")\n", 282 | "\n", 283 | "traced_model_url = sagemaker_session.upload_data(\n", 284 | " path=\"traced_model/traced_model.tar.gz\",\n", 285 | " key_prefix=\"neuron-experiments/bert-seq-classification/traced-model\",\n", 286 | ")" 287 | ] 288 | }, 289 | { 290 | "cell_type": "markdown", 291 | "metadata": {}, 292 | "source": [ 293 | "---" 294 | ] 295 | }, 296 | { 297 | "cell_type": "markdown", 298 | "metadata": {}, 299 | "source": [ 300 | "## 3. Deploy default model to a GPU-based endpoint " 301 | ] 302 | }, 303 | { 304 | "cell_type": "markdown", 305 | "metadata": {}, 306 | "source": [ 307 | "As a first step, we create model from the Hugging Face Model Class.\n", 308 | "We will be passing the `normal_model_url` as the `model_data` parameter to the `HuggingFaceModel` API. \n", 309 | "Notice that we are passing `inference.py` as the entry point script; also, the packages defined in the requirements file within the `source_dir` will automatically be installed in the endpoint instance. In this case we will use the `transformers` library that is compatible Inferentia instances (v. 4.15.0)" 310 | ] 311 | }, 312 | { 313 | "cell_type": "code", 314 | "execution_count": 8, 315 | "metadata": {}, 316 | "outputs": [], 317 | "source": [ 318 | "from sagemaker.huggingface import HuggingFaceModel\n", 319 | "from sagemaker.predictor import Predictor\n", 320 | "from datetime import datetime\n", 321 | "\n", 322 | "prefix = \"neuron-experiments/bert-seq-classification\"\n", 323 | "flavour = \"normal\"\n", 324 | "date_string = datetime.now().strftime(\"%Y%m-%d%H-%M%S\")\n", 325 | "\n", 326 | "normal_sm_model = HuggingFaceModel(\n", 327 | " model_data=normal_model_url,\n", 328 | " predictor_cls=Predictor,\n", 329 | " transformers_version=\"4.12.3\",\n", 330 | " pytorch_version='1.9.1',\n", 331 | " role=role,\n", 332 | " entry_point=\"inference.py\",\n", 333 | " source_dir=\"code\",\n", 334 | " py_version=\"py38\",\n", 335 | " name=f\"{flavour}-distilbert-{date_string}\",\n", 336 | " env={\"SAGEMAKER_CONTAINER_LOG_LEVEL\": \"10\"},\n", 337 | ")" 338 | ] 339 | }, 340 | { 341 | "cell_type": "markdown", 342 | "metadata": {}, 343 | "source": [ 344 | "Then, we create the endpoint and deploy the model for inference. This process will take about 4 minutes to complete. As you can see, one line of code will create a [real time endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html) for real time inference that you can integrate to your applications. These endpoints are fully managed and support autoscaling. " 345 | ] 346 | }, 347 | { 348 | "cell_type": "code", 349 | "execution_count": 9, 350 | "metadata": {}, 351 | "outputs": [ 352 | { 353 | "name": "stdout", 354 | "output_type": "stream", 355 | "text": [ 356 | "-----------!CPU times: user 46.7 s, sys: 9.44 s, total: 56.1 s\n", 357 | "Wall time: 6min 23s\n" 358 | ] 359 | } 360 | ], 361 | "source": [ 362 | "%%time\n", 363 | "from sagemaker.serializers import JSONSerializer\n", 364 | "from sagemaker.deserializers import JSONDeserializer\n", 365 | "\n", 366 | "hardware = \"g4dn\"\n", 367 | "\n", 368 | "normal_predictor = normal_sm_model.deploy(\n", 369 | " instance_type=\"ml.g4dn.xlarge\",\n", 370 | " initial_instance_count=1,\n", 371 | " endpoint_name=f\"paraphrase-bert-en-{hardware}-{date_string}\",\n", 372 | " serializer=JSONSerializer(),\n", 373 | " deserializer=JSONDeserializer(),\n", 374 | ")" 375 | ] 376 | }, 377 | { 378 | "cell_type": "markdown", 379 | "metadata": {}, 380 | "source": [ 381 | "### Perform a test inference on GPU" 382 | ] 383 | }, 384 | { 385 | "cell_type": "markdown", 386 | "metadata": {}, 387 | "source": [ 388 | "We will perform a quick test to see if the endpoint is responding as expected. We will send sample sequences." 389 | ] 390 | }, 391 | { 392 | "cell_type": "code", 393 | "execution_count": 10, 394 | "metadata": {}, 395 | "outputs": [ 396 | { 397 | "data": { 398 | "text/plain": [ 399 | "['\"BERT predicts that \\\\\"Welcome to AWS Summit San Francisco 2022! Thank you for attending the workshop on using Huggingface transformers on Inferentia instances.\\\\\" and \\\\\"Welcome to AWS Summit San Francisco 2022! Thank you for attending the workshop on using Huggingface transformers on Inferentia instances.\\\\\" are paraphrase\"',\n", 400 | " 'application/json']" 401 | ] 402 | }, 403 | "execution_count": 10, 404 | "metadata": {}, 405 | "output_type": "execute_result" 406 | } 407 | ], 408 | "source": [ 409 | "# Predict with model endpoint\n", 410 | "client = boto3.client('sagemaker')\n", 411 | "\n", 412 | "#let's make sure it is up und running first\n", 413 | "status = \"\"\n", 414 | "while status != 'InService':\n", 415 | " endpoint_response = client.describe_endpoint(EndpointName=f\"paraphrase-bert-en-{hardware}-{date_string}\")\n", 416 | " status = endpoint_response['EndpointStatus']\n", 417 | "\n", 418 | "\n", 419 | "# Send a payload to the endpoint and recieve the inference\n", 420 | "payload = seq_0, seq_1\n", 421 | "normal_predictor.predict(payload)" 422 | ] 423 | }, 424 | { 425 | "cell_type": "markdown", 426 | "metadata": {}, 427 | "source": [ 428 | "---" 429 | ] 430 | }, 431 | { 432 | "cell_type": "markdown", 433 | "metadata": {}, 434 | "source": [ 435 | "## 4. Compile and deploy the model on an Inferentia instance " 436 | ] 437 | }, 438 | { 439 | "cell_type": "markdown", 440 | "metadata": {}, 441 | "source": [ 442 | "In this section we will cover the compilation and deployment of the model into the inf1 instance. We will also review the changes in the inference code." 443 | ] 444 | }, 445 | { 446 | "cell_type": "markdown", 447 | "metadata": {}, 448 | "source": [ 449 | "### Review inference code " 450 | ] 451 | }, 452 | { 453 | "cell_type": "markdown", 454 | "metadata": {}, 455 | "source": [ 456 | "If you open `inference.py` you will see a few functions: \n", 457 | "a) `model_fn` which receives the model directory and is responsible for loading and returning the model.\n", 458 | "b) `input_fn` and `output_fn` functions that are in charge of pre-processing/checking content types of input and output to the endpoint.\n", 459 | "And c) `predict_fn`, receives the outputs of `model_fn` and `input_fn` and defines how the model will run inference (it recieves the loaded model and the deserialized/pre-processed input data).\n", 460 | "All of this code runs inside the endpoint once it is created." 461 | ] 462 | }, 463 | { 464 | "cell_type": "code", 465 | "execution_count": 11, 466 | "metadata": {}, 467 | "outputs": [ 468 | { 469 | "name": "stdout", 470 | "output_type": "stream", 471 | "text": [ 472 | "\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36mos\u001b[39;49;00m\n", 473 | "\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36mjson\u001b[39;49;00m\n", 474 | "\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36mtorch\u001b[39;49;00m\n", 475 | "\u001b[34mfrom\u001b[39;49;00m \u001b[04m\u001b[36mtransformers\u001b[39;49;00m \u001b[34mimport\u001b[39;49;00m AutoTokenizer, AutoModelForSequenceClassification, AutoConfig\n", 476 | "\n", 477 | "JSON_CONTENT_TYPE = \u001b[33m'\u001b[39;49;00m\u001b[33mapplication/json\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m\n", 478 | "device = torch.device(\u001b[33m'\u001b[39;49;00m\u001b[33mcuda\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m \u001b[34mif\u001b[39;49;00m torch.cuda.is_available() \u001b[34melse\u001b[39;49;00m \u001b[33m'\u001b[39;49;00m\u001b[33mcpu\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m)\n", 479 | "\n", 480 | "\u001b[34mdef\u001b[39;49;00m \u001b[32mmodel_fn\u001b[39;49;00m(model_dir):\n", 481 | " tokenizer_init = AutoTokenizer.from_pretrained(\u001b[33m'\u001b[39;49;00m\u001b[33mPrompsit/paraphrase-bert-en\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m)\n", 482 | " compiled_model = os.path.exists(\u001b[33mf\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m\u001b[33m{\u001b[39;49;00mmodel_dir\u001b[33m}\u001b[39;49;00m\u001b[33m/model_neuron.pt\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m)\n", 483 | " \u001b[34mif\u001b[39;49;00m compiled_model:\n", 484 | " \u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36mtorch_neuron\u001b[39;49;00m\n", 485 | " os.environ[\u001b[33m\"\u001b[39;49;00m\u001b[33mNEURONCORE_GROUP_SIZES\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m] = \u001b[33m\"\u001b[39;49;00m\u001b[33m1\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\n", 486 | " model = torch.jit.load(\u001b[33mf\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m\u001b[33m{\u001b[39;49;00mmodel_dir\u001b[33m}\u001b[39;49;00m\u001b[33m/model_neuron.pt\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m)\n", 487 | " \u001b[34melse\u001b[39;49;00m: \n", 488 | " model = AutoModelForSequenceClassification.from_pretrained(model_dir).to(device)\n", 489 | " \n", 490 | " \u001b[34mreturn\u001b[39;49;00m (model, tokenizer_init)\n", 491 | "\n", 492 | "\n", 493 | "\u001b[34mdef\u001b[39;49;00m \u001b[32minput_fn\u001b[39;49;00m(serialized_input_data, content_type=JSON_CONTENT_TYPE):\n", 494 | " \u001b[34mif\u001b[39;49;00m content_type == JSON_CONTENT_TYPE:\n", 495 | " input_data = json.loads(serialized_input_data)\n", 496 | " \u001b[34mreturn\u001b[39;49;00m input_data\n", 497 | " \u001b[34melse\u001b[39;49;00m:\n", 498 | " \u001b[34mraise\u001b[39;49;00m \u001b[36mException\u001b[39;49;00m(\u001b[33m'\u001b[39;49;00m\u001b[33mRequested unsupported ContentType in Accept: \u001b[39;49;00m\u001b[33m'\u001b[39;49;00m + content_type)\n", 499 | " \u001b[34mreturn\u001b[39;49;00m\n", 500 | " \n", 501 | "\n", 502 | "\u001b[34mdef\u001b[39;49;00m \u001b[32mpredict_fn\u001b[39;49;00m(input_data, models):\n", 503 | "\n", 504 | " model_bert, tokenizer = models\n", 505 | " sequence_0 = input_data[\u001b[34m0\u001b[39;49;00m] \n", 506 | " sequence_1 = input_data[\u001b[34m1\u001b[39;49;00m]\n", 507 | " \n", 508 | " max_length = \u001b[34m512\u001b[39;49;00m\n", 509 | " tokenized_sequence_pair = tokenizer.encode_plus(sequence_0,\n", 510 | " sequence_1,\n", 511 | " max_length=max_length,\n", 512 | " padding=\u001b[33m'\u001b[39;49;00m\u001b[33mmax_length\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m,\n", 513 | " truncation=\u001b[34mTrue\u001b[39;49;00m,\n", 514 | " return_tensors=\u001b[33m'\u001b[39;49;00m\u001b[33mpt\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m).to(device)\n", 515 | " \n", 516 | " \u001b[37m# Convert example inputs to a format that is compatible with TorchScript tracing\u001b[39;49;00m\n", 517 | " example_inputs = tokenized_sequence_pair[\u001b[33m'\u001b[39;49;00m\u001b[33minput_ids\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m], tokenized_sequence_pair[\u001b[33m'\u001b[39;49;00m\u001b[33mattention_mask\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m]\n", 518 | " \n", 519 | " \u001b[34mwith\u001b[39;49;00m torch.no_grad():\n", 520 | " paraphrase_classification_logits = model_bert(*example_inputs)\n", 521 | " \n", 522 | " classes = [\u001b[33m'\u001b[39;49;00m\u001b[33mparaphrase\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m,\u001b[33m'\u001b[39;49;00m\u001b[33mnot paraphrase\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m]\n", 523 | " paraphrase_prediction = paraphrase_classification_logits[\u001b[34m0\u001b[39;49;00m][\u001b[34m0\u001b[39;49;00m].argmax().item()\n", 524 | " out_str = \u001b[33m'\u001b[39;49;00m\u001b[33mBERT predicts that \u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\u001b[33m{}\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\u001b[33m and \u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\u001b[33m{}\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\u001b[33m are \u001b[39;49;00m\u001b[33m{}\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m.format(sequence_0, sequence_1, classes[paraphrase_prediction])\n", 525 | " \n", 526 | " \u001b[34mreturn\u001b[39;49;00m out_str\n", 527 | "\n", 528 | "\n", 529 | "\u001b[34mdef\u001b[39;49;00m \u001b[32moutput_fn\u001b[39;49;00m(prediction_output, accept=JSON_CONTENT_TYPE):\n", 530 | " \u001b[34mif\u001b[39;49;00m accept == JSON_CONTENT_TYPE:\n", 531 | " \u001b[34mreturn\u001b[39;49;00m json.dumps(prediction_output), accept\n", 532 | " \n", 533 | " \u001b[34mraise\u001b[39;49;00m \u001b[36mException\u001b[39;49;00m(\u001b[33m'\u001b[39;49;00m\u001b[33mRequested unsupported ContentType in Accept: \u001b[39;49;00m\u001b[33m'\u001b[39;49;00m + accept)\n", 534 | " \n" 535 | ] 536 | } 537 | ], 538 | "source": [ 539 | "!pygmentize code/inference.py" 540 | ] 541 | }, 542 | { 543 | "cell_type": "markdown", 544 | "metadata": {}, 545 | "source": [ 546 | "In this case, notice that we will load the corresponding model depending on where the function is deployed. `model_fn` will return a tuple containing both the model and its corresponding tokenizer. Both the model and the input data will be sent `.to(device)`, which can be a GPU or GPU.\n", 547 | "\n", 548 | "Also, notice the `predict_fn`. In this function we recieve the string for inference, convert it to the format the model accepts, ask the model for the inference, recieve the inference and format it in clear text as a return string. In real life you might not need to do this interpretation since your application might be fine receiving the predicted class and use it directly." 549 | ] 550 | }, 551 | { 552 | "cell_type": "markdown", 553 | "metadata": {}, 554 | "source": [ 555 | "### Create and compile Pytorch model for the inf1 instance " 556 | ] 557 | }, 558 | { 559 | "cell_type": "markdown", 560 | "metadata": {}, 561 | "source": [ 562 | "We will now create a new `Huggingface` model that will use the `inference.py` file described above as its entry point script. " 563 | ] 564 | }, 565 | { 566 | "cell_type": "code", 567 | "execution_count": 12, 568 | "metadata": {}, 569 | "outputs": [], 570 | "source": [ 571 | "from sagemaker.huggingface import HuggingFaceModel\n", 572 | "from sagemaker.predictor import Predictor\n", 573 | "from datetime import datetime\n", 574 | "from sagemaker.serializers import JSONSerializer\n", 575 | "from sagemaker.deserializers import JSONDeserializer\n", 576 | "\n", 577 | "date_string = datetime.now().strftime(\"%Y%m-%d%H-%M%S\")\n", 578 | "hardware = \"inf1\"\n", 579 | "compilation_job_name = f\"paraphrase-bert-en-{hardware}-\" + date_string\n", 580 | "output_model_path = f\"s3://{sess_bucket}/{prefix}/neo-compilations19/{hardware}-model\"\n", 581 | "\n", 582 | "compiled_inf1_model = HuggingFaceModel(\n", 583 | " model_data=traced_model_url,\n", 584 | " predictor_cls=Predictor,\n", 585 | " transformers_version=\"4.12.3\",\n", 586 | " pytorch_version='1.9.1',\n", 587 | " role=role,\n", 588 | " entry_point=\"inference.py\",\n", 589 | " source_dir=\"code\",\n", 590 | " py_version=\"py37\",\n", 591 | " name=f\"distilbert-{date_string}\",\n", 592 | " env={\"SAGEMAKER_CONTAINER_LOG_LEVEL\": \"10\"},\n", 593 | ")" 594 | ] 595 | }, 596 | { 597 | "cell_type": "markdown", 598 | "metadata": {}, 599 | "source": [ 600 | "We are ready to compile the model! Two additional notes:\n", 601 | "* HuggingFace models should be compiled to `dtype` `int64`\n", 602 | "* the format for `compiler_options` differs from the standard Python `dict` that you can use when compiling for \"normal\" instance types; for inferentia, you must provide a JSON string with CLI arguments, which correspond to the ones supported by the [Neuron Compiler](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-cc/command-line-reference.html) (read more about `compiler_options` [here](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_OutputConfig.html#API_OutputConfig_Contents))\n" 603 | ] 604 | }, 605 | { 606 | "cell_type": "markdown", 607 | "metadata": {}, 608 | "source": [ 609 | "#### Model compilation" 610 | ] 611 | }, 612 | { 613 | "cell_type": "markdown", 614 | "metadata": {}, 615 | "source": [ 616 | "Let's compile the model (this will take around 10 minutes to complete):" 617 | ] 618 | }, 619 | { 620 | "cell_type": "code", 621 | "execution_count": 13, 622 | "metadata": {}, 623 | "outputs": [ 624 | { 625 | "name": "stdout", 626 | "output_type": "stream", 627 | "text": [ 628 | "??????????????????????????????????.....................................................................................................................!CPU times: user 509 ms, sys: 69.3 ms, total: 578 ms\n", 629 | "Wall time: 12min 56s\n" 630 | ] 631 | } 632 | ], 633 | "source": [ 634 | "%%time\n", 635 | "import json\n", 636 | "\n", 637 | "compiled_inf1_model = compiled_inf1_model.compile(\n", 638 | " target_instance_family=f\"ml_{hardware}\",\n", 639 | " input_shape={\"input_ids\": [1, 512], \"attention_mask\": [1, 512]},\n", 640 | " job_name=compilation_job_name,\n", 641 | " role=role,\n", 642 | " framework=\"pytorch\",\n", 643 | " framework_version=\"1.9.1\",\n", 644 | " output_path=output_model_path,\n", 645 | " compiler_options=json.dumps(\"--dtype int64\"),\n", 646 | " compile_max_run=900,\n", 647 | ")" 648 | ] 649 | }, 650 | { 651 | "cell_type": "markdown", 652 | "metadata": {}, 653 | "source": [ 654 | "#### Compiler logs and artifacts\n", 655 | "Open a new browser tab and navigate to the Sagemaker Console. Under the Images menu on the left you will find the menu Inference and inside \"Compilation Jobs\". Here is where you will find the job that was executed in the previous cell. Look for the job name to get its details. If you scroll down you will find a section called \"Monitor\" you can access the compiler logs hosted in Cloudwatch. Look for the successful completion of the job in a line similar to the following:" 656 | ] 657 | }, 658 | { 659 | "cell_type": "raw", 660 | "metadata": {}, 661 | "source": [ 662 | "localhost compiler-container-Primary[4736]: Compiler status PASS\n", 663 | "and localhost compiler-container-Primary[4736]: INFO:Neuron:Neuron successfully compiled 1 sub-graphs, Total fused subgraphs = 1, Percent of model sub-graphs successfully compiled = 100.0%" 664 | ] 665 | }, 666 | { 667 | "cell_type": "markdown", 668 | "metadata": {}, 669 | "source": [ 670 | "Also, in the Output section, you will find a link to the S3 compiled model artifact. Click on it so see where it was stored." 671 | ] 672 | }, 673 | { 674 | "cell_type": "code", 675 | "execution_count": 14, 676 | "metadata": {}, 677 | "outputs": [ 678 | { 679 | "name": "stdout", 680 | "output_type": "stream", 681 | "text": [ 682 | "Compilation job name: paraphrase-bert-en-inf1-202205-1816-2816 \n", 683 | "Output model path in S3: s3://sagemaker-us-east-1-563487891580/neuron-experiments/bert-seq-classification/neo-compilations19/inf1-model\n" 684 | ] 685 | } 686 | ], 687 | "source": [ 688 | "print(\"Compilation job name: {} \\nOutput model path in S3: {}\".format(compilation_job_name, output_model_path))" 689 | ] 690 | }, 691 | { 692 | "cell_type": "markdown", 693 | "metadata": {}, 694 | "source": [ 695 | "### Deploy compiled model into the inf1 instance " 696 | ] 697 | }, 698 | { 699 | "cell_type": "markdown", 700 | "metadata": {}, 701 | "source": [ 702 | "After successful compilation, we deploy the new model to an inf1.xlarge instance based [real time endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html). As you can see, the one line of code procedure is similar to creating a GPU based instance. " 703 | ] 704 | }, 705 | { 706 | "cell_type": "code", 707 | "execution_count": 15, 708 | "metadata": {}, 709 | "outputs": [ 710 | { 711 | "name": "stdout", 712 | "output_type": "stream", 713 | "text": [ 714 | "CPU times: user 13.2 s, sys: 2.42 s, total: 15.6 s\n", 715 | "Wall time: 15.4 s\n" 716 | ] 717 | } 718 | ], 719 | "source": [ 720 | "%%time\n", 721 | "\n", 722 | "compiled_inf1_predictor = compiled_inf1_model.deploy(\n", 723 | " instance_type=\"ml.inf1.xlarge\",\n", 724 | " initial_instance_count=1,\n", 725 | " endpoint_name=f\"paraphrase-bert-en-{hardware}-{date_string}\",\n", 726 | " serializer=JSONSerializer(),\n", 727 | " deserializer=JSONDeserializer(),\n", 728 | " wait=False\n", 729 | ")" 730 | ] 731 | }, 732 | { 733 | "cell_type": "markdown", 734 | "metadata": {}, 735 | "source": [ 736 | "### Perform a test inference " 737 | ] 738 | }, 739 | { 740 | "cell_type": "markdown", 741 | "metadata": {}, 742 | "source": [ 743 | "As a final test, we first make sure the endpoint is up una running in a `InService` state, and then perform a simple inference sending two sequences of text and wait for the response." 744 | ] 745 | }, 746 | { 747 | "cell_type": "code", 748 | "execution_count": 18, 749 | "metadata": {}, 750 | "outputs": [ 751 | { 752 | "data": { 753 | "text/plain": [ 754 | "'BERT predicts that \"Welcome to AWS Summit San Francisco 2022! Thank you for attending the workshop on using Huggingface transformers on Inferentia instances.\" and \"Welcome to AWS Summit San Francisco 2022! Thank you for attending the workshop on using Huggingface transformers on Inferentia instances.\" are paraphrase'" 755 | ] 756 | }, 757 | "execution_count": 18, 758 | "metadata": {}, 759 | "output_type": "execute_result" 760 | } 761 | ], 762 | "source": [ 763 | "# Predict with model endpoint\n", 764 | "client = boto3.client('sagemaker')\n", 765 | "\n", 766 | "#let's make sure it is up und running first\n", 767 | "status = \"\"\n", 768 | "while status != 'InService':\n", 769 | " endpoint_response = client.describe_endpoint(EndpointName=f\"paraphrase-bert-en-{hardware}-{date_string}\")\n", 770 | " status = endpoint_response['EndpointStatus']\n", 771 | "\n", 772 | "\n", 773 | "# Send a payload to the endpoint and recieve the inference\n", 774 | "payload = seq_0, seq_1\n", 775 | "compiled_inf1_predictor.predict(payload)" 776 | ] 777 | }, 778 | { 779 | "cell_type": "markdown", 780 | "metadata": {}, 781 | "source": [ 782 | "---" 783 | ] 784 | }, 785 | { 786 | "cell_type": "markdown", 787 | "metadata": {}, 788 | "source": [ 789 | "## 5. Benchmark and comparison " 790 | ] 791 | }, 792 | { 793 | "cell_type": "markdown", 794 | "metadata": {}, 795 | "source": [ 796 | "Now that we have both endpoints online, we will perform a benchmark using Python's `threading` module. In each benchmark, we start 5 threads that will each make 100 requests to the model endpoint. We measure the inference latency for each request, and we also measure the total time to finish the task, so that we can get an estimate of the request throughput/second." 797 | ] 798 | }, 799 | { 800 | "cell_type": "markdown", 801 | "metadata": {}, 802 | "source": [ 803 | "### Benchmark GPU based endpoint " 804 | ] 805 | }, 806 | { 807 | "cell_type": "code", 808 | "execution_count": 19, 809 | "metadata": {}, 810 | "outputs": [ 811 | { 812 | "name": "stdout", 813 | "output_type": "stream", 814 | "text": [ 815 | "Thread 140033736738560 started\n", 816 | "Thread 140034523002624 started\n", 817 | "Thread 140034497824512 started\n", 818 | "Thread 140034472646400 started\n", 819 | "Thread 140034513561344 started\n", 820 | "CPU times: user 991 ms, sys: 37.7 ms, total: 1.03 s\n", 821 | "Wall time: 16.9 s\n" 822 | ] 823 | } 824 | ], 825 | "source": [ 826 | "%%time\n", 827 | "# Run the benchmark \n", 828 | "\n", 829 | "import threading\n", 830 | "import time\n", 831 | "\n", 832 | "num_preds = 100\n", 833 | "num_threads = 5\n", 834 | "\n", 835 | "times = []\n", 836 | "\n", 837 | "\n", 838 | "def predict():\n", 839 | " thread_id = threading.get_ident()\n", 840 | " print(f\"Thread {thread_id} started\")\n", 841 | "\n", 842 | " for i in range(num_preds):\n", 843 | " tick = time.time()\n", 844 | " response = normal_predictor.predict(payload)\n", 845 | " tock = time.time()\n", 846 | " times.append((thread_id, tock - tick))\n", 847 | "\n", 848 | "\n", 849 | "threads = []\n", 850 | "[threads.append(threading.Thread(target=predict, daemon=False)) for i in range(num_threads)]\n", 851 | "[t.start() for t in threads]\n", 852 | "\n", 853 | "# Wait for threads, get an estimate of total time\n", 854 | "start = time.time()\n", 855 | "[t.join() for t in threads]\n", 856 | "end = time.time() - start" 857 | ] 858 | }, 859 | { 860 | "cell_type": "code", 861 | "execution_count": 20, 862 | "metadata": {}, 863 | "outputs": [ 864 | { 865 | "data": { 866 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAEICAYAAACktLTqAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAZZ0lEQVR4nO3dfZRcdZ3n8feHhMDIgwnQmwkk0IABYRwI2kZ2xmAUkACDgT0OwvoQFCe4wllcnXUjzFFmRs9hVHDh4OAEYcLzwwyCcQAhkwVcdwhDR2N4CA8JJCYxJM1zAAcNfPeP++twaaq6q+tWV3f/8nmdU6dv/X734Vu3637q9u9WVykiMDOzvGw33AWYmVnrOdzNzDLkcDczy5DD3cwsQw53M7MMOdzNzDLkcLeWk9QpKSSNHe5ampXqf1edvk9KuqvdNZkNhsN9BJC0WtJvJb0s6WlJCyTtPNx11SPpHkmfb9G6Zkpa14p1tUtEXBsRHx1ovvR7/GY7ahrpJI2T9HVJj0l6RdJ6SXdI+mhpnvJxsLF8HKS+o/qs8zRJP2/3YxktHO4jxwkRsTMwDTgM+NrwlmMj2Sj8q+ifgdnAZ4AJwL7ARcDxfebrPQ7eC3QBf9XOInPicB9hIuJp4E6KkAdA0uGS/k3SC5J+JWlmqW9fSfdK2ixpkaRLJF2T+t52Vlw+A5K0naR5klZJelbSTZJ2S307Sromtb8g6QFJEyV9C5gBXJLOsC4Z6DFJ+qykFanGJyWdkdp3Au4A9kzrelnSngPU1TvkM0fSryU9I+nc0rbGSDonLbtZ0lJJUyR9X9IFfepaKOl/9FP6UZKeSI//+5KUltt6xqjC9yRtkvSSpAclvUfSXOCTwFfT4/pJmv+g9JfPC5IelvSxUj27S/pJWs8Dkr5ZPjNNj/tMSU8AT6S2iyStTcsslTSjNP95kv4p/R43p9oOkPS1VO/a8plzjd9bf7UuSPvktrTu+yXtX2c9RwFHA7Mj4v6I+F26/TQizq61TESsp3huvKef34/1JyJ8G+YbsBo4Kk1PBh4ELkr39wKeBY6jeDE+Ot3vSP33ARcCOwBHAJuBa1LfTGBdP9s6G1iStrkD8A/A9anvDOAnwDuAMcD7gF1T3z3A5/t5PJ1AAGPT/eOB/QEBHwJeBd7bT4391dW77suAPwAOBV4DDkr9/zPtvwPT9g4FdgemA78Btkvz7ZHqmFjnMQTwL8B4YG+gB5iV+k4Dfp6mjwGWpvkEHARMSn0LgG+W1rk9sBI4BxgHfCT9vg5M/Tek2zuAg4G1vdsp1bQI2A34g9T2qfT4xgJfAZ4Gdkx95wH/kWocC1wFPAWcm2r5C+CpOo9/oFoXUDwPp6d1XwvcUGdd5wP3DPI4mAI8DPxt377S/Ft/D77V2J/DXYBvW5+4L6eDJ4DFwPjU97+Aq/vMfycwJ4XOFmCnUt91NB7uK4AjS32TgN+ng/VzwL8Bh9So9x4GEe41+m8Fzu6nxv7q6l335FL/vwOnpOnHKM4Qa213BXB0mj4LuL2fxxDAB0v3bwLmpemtoZJC73HgcNILR2mZBbw13GdQhO92pbbrKUJ4THqMB5b6vsnbw/0jAzyXngcOTdPnAYtKfSek59mYdH+XtM7xNdZTt9bSY/thqe844NE6Nf2QUvBTvDi9ALwI/EeN4+AFYA3w97z5IrYah/ugbh6WGTlOjIhdKMLu3RRnlgD7AH+e/jR+QdILwAcpAm9P4PmIeKW0njWD2OY+wC2l9a4AXgcmAldTvIjcIOk3kr4taftmHpikYyUtkfRc2s5xpcc32Lp6PV2afhXovQA9BVhVZ71XUpzpkn5ePUDp9baxVUT8H+AS4PvAJknzJe1aZ317Amsj4o1S2xqKv846KF681pb6ytM12yT9ZRryejHtq3fy1n27sTT9W+CZiHi9dJ9aj2uAWnsNuH+SZymerwBExHMRMZ7ir8Ed+sx7YkSMj4h9IuKLEdFb4xaKvybKtqd4QbQaHO4jTETcS3FW9N3UtJbizH186bZTRJwPbAAmpLHrXnuXpl+h+BMfKMajKUKk11rg2D7r3jEi1kfE7yPiryPiYOBPgD+juBgGxdleQyTtANycHs/EdFDfTjGEUW9ddetqYJNrKYaAarkGmC3pUIrhk1sbfRz9iYiLI+J9FEMpB1AMDcHbH9tvgCmSysfd3sB6imGfLRRDUb2m1Npc70QaX/8qcDIwIe3bF3lz31bRX62DtRh4v6TJA85Z368p/mor25fBncxsUxzuI9P/Bo5OIXQNcIKkY9LFwh1VXCidHBFrgG7gr1W81eyDFH9693oc2FHS8ems+69465nSD4BvSdoHQFKHpNlp+sOS/ji9ILxEcYbUexa3EdivwccyLm2zB9gi6VigfBFvI7C7pHc2UlcDfgj8raSp6WLnIZJ2B4iIdcADFGfsN5fOCpsm6f2SPpD27ysUY9z19tP9FGe4X5W0vYoL4ydQDFm8DvwIOE/SOyS9mzdfTOvZheIFoQcYK+nrQL2/Ggarbq2DXVFE3AXcDdya9tW4tL8OH8RqbgS+JOnd6ffaRTF0OOh6thUO9xEoInooLn59PSLWUryF7ByKg3gtxZlh7+/uvwIfAJ4DvpGW613Pi8AXKQJvPUX4lN89cxGwELhL0maKi5gfSH1/SPH2tZcohkXu5c1hjIuAj0t6XtLFAzyWzcB/pxizfj7Vu7DU/yjFWO6TaRhmzwHqGsiFaVt3pdovp7jw2utK4I8ZeEimUbtSXNx9nuIs8lngO6nvcuDg9LhujYjfUQTkscAzFGPKn0n7AIrrAO+kGO64mmK/vNbPtu8EfkrxIr6G4oWl1lDOoDVQ62CdRHGB+hqKMfWnKN5NdEyDy18G/CPFRf4XKZ7n50bET5usJ3tKFyYsE5LOA94VEZ8aaN5tkaQjKAJmnxjhT35Jfwf8YUTMGe5abPTxmbttM9JQwNkU7/IYccGehhwOScMO04HTgVuGuy4bnRzutk2QdBDFcMAkimsaI9EuFOPur1CMMV8A/HhYK7JRy8MyZmYZ8pm7mVmGRsSHD+2xxx7R2dk53GWYmY0qS5cufSYiOmr1jYhw7+zspLu7e7jLMDMbVSTV/ScuD8uYmWXI4W5mliGHu5lZhhzuZmYZcribmWXI4W5mliGHu5lZhhzuZmYZcribmWVoRPyHqg2fznm3bZ1eff7xw1iJmbWSz9zNzDLkcDczy9CA4S5piqS7JT0i6WFJZ6f23SQtkvRE+jkhtUvSxZJWSlou6b1D/SDMzOytGjlz3wJ8JSIOpvi28jMlHQzMAxZHxFRgcboPxRfqTk23ucClLa/azMz6NWC4R8SGiPhFmt4MrAD2AmZTfJM86eeJaXo2cFUUlgDjJU1qdeFmZlbfoMbcJXUChwH3AxMjYkPqehqYmKb3AtaWFluX2vqua66kbkndPT09g63bzMz60XC4S9oZuBn4UkS8VO5L3yQ/qC9jjYj5EdEVEV0dHTW/SMTMzJrUULhL2p4i2K+NiB+l5o29wy3p56bUvh6YUlp8cmozM7M2aeTdMgIuB1ZExIWlroXAnDQ9B/hxqf0z6V0zhwMvloZvzMysDRr5D9U/BT4NPChpWWo7BzgfuEnS6cAa4OTUdztwHLASeBX4bCsLNjOzgQ0Y7hHxc0B1uo+sMX8AZ1asy8zMKvB/qJqZZcjhbmaWIYe7mVmGHO5mZhlyuJuZZcjhbmaWIYe7mVmGHO5mZhlyuJuZZcjhbmaWIYe7mVmGHO5mZhlyuJuZZcjhbmaWIYe7mVmGHO5mZhlq5Gv2rpC0SdJDpbYbJS1Lt9W939AkqVPSb0t9PxjC2s3MrI5GvmZvAXAJcFVvQ0R8onda0gXAi6X5V0XEtBbVZ2ZmTWjka/Z+JqmzVl/68uyTgY+0uC4zM6ug6pj7DGBjRDxRattX0i8l3StpRr0FJc2V1C2pu6enp2IZZmZWVjXcTwWuL93fAOwdEYcBXwauk7RrrQUjYn5EdEVEV0dHR8UyzMysrOlwlzQW+C/Ajb1tEfFaRDybppcCq4ADqhZpZmaDU+XM/Sjg0YhY19sgqUPSmDS9HzAVeLJaiWZmNliNvBXyeuA+4EBJ6ySdnrpO4a1DMgBHAMvTWyP/GfhCRDzXwnrNzKwBjbxb5tQ67afVaLsZuLl6WWZmVoX/Q9XMLEMOdzOzDDnczcwy5HA3M8uQw93MLEMOdzOzDDnczcwy5HA3M8uQw93MLEMOdzOzDDnczcwy5HA3M8uQw93MLEMOdzOzDDnczcwy5HA3M8tQI9/EdIWkTZIeKrWdJ2m9pGXpdlyp72uSVkp6TNIxQ1W4mZnV18iZ+wJgVo3270XEtHS7HUDSwRRfv/dHaZm/7/1OVTMza58Bwz0ifgY0+j2os4EbIuK1iHgKWAlMr1CfmZk1ocqY+1mSlqdhmwmpbS9gbWmedantbSTNldQtqbunp6dCGWZm1lez4X4psD8wDdgAXDDYFUTE/Ijoioiujo6OJsswM7Namgr3iNgYEa9HxBvAZbw59LIemFKadXJqMzOzNmoq3CVNKt09Ceh9J81C4BRJO0jaF5gK/Hu1Es3MbLDGDjSDpOuBmcAektYB3wBmSpoGBLAaOAMgIh6WdBPwCLAFODMiXh+Sys3MrK4Bwz0iTq3RfHk/838L+FaVoszMrBr/h6qZWYYc7mZmGXK4m5llyOFuZpYhh7uZWYYGfLeM5adz3m3DXYKZDTGfuZuZZcjhbmaWIYe7mVmGHO5mZhlyuJuZZcjhbmaWIYe7mVmGHO5mZhlyuJuZZcjhbmaWoQHDXdIVkjZJeqjU9h1Jj0paLukWSeNTe6ek30palm4/GMLazcysjkbO3BcAs/q0LQLeExGHAI8DXyv1rYqIaen2hdaUaWZmgzFguEfEz4Dn+rTdFRFb0t0lwOQhqM3MzJrUijH3zwF3lO7vK+mXku6VNKPeQpLmSuqW1N3T09OCMszMrFelcJd0LrAFuDY1bQD2jojDgC8D10natdayETE/Iroioqujo6NKGWZm1kfT4S7pNODPgE9GRABExGsR8WyaXgqsAg5oQZ1mZjYITYW7pFnAV4GPRcSrpfYOSWPS9H7AVODJVhRqZmaNG/CbmCRdD8wE9pC0DvgGxbtjdgAWSQJYkt4ZcwTwN5J+D7wBfCEinqu5YjMzGzIDhntEnFqj+fI6894M3Fy1KDMzq8b/oWpmliGHu5lZhhzuZmYZcribmWXI4W5mliGHu5lZhhzuZmYZcribmWXI4W5mliGHu5lZhhzuZmYZcribmWXI4W5mliGHu5lZhhzuZmYZcribmWWooXCXdIWkTZIeKrXtJmmRpCfSzwmpXZIulrRS0nJJ7x2q4s3MrLZGz9wXALP6tM0DFkfEVGBxug9wLMV3p04F5gKXVi/TzMwGo6Fwj4ifAX2/C3U2cGWavhI4sdR+VRSWAOMlTWpBrWZm1qAqY+4TI2JDmn4amJim9wLWluZbl9reQtJcSd2Sunt6eiqUYWZmfbXkgmpEBBCDXGZ+RHRFRFdHR0cryjAzs6RKuG/sHW5JPzel9vXAlNJ8k1ObmZm1SZVwXwjMSdNzgB+X2j+T3jVzOPBiafjGzMzaYGwjM0m6HpgJ7CFpHfAN4HzgJkmnA2uAk9PstwPHASuBV4HPtrhmMzMbQEPhHhGn1uk6ssa8AZxZpSgzM6vG/6FqZpYhh7uZWYYc7mZmGXK4m5llyOFuZpYhh7uZWYYc7mZmGXK4m5llyOFuZpYhh7uZWYYc7mZmGWros2Vs29A577at06vPP34YKzGzqnzmbmaWIYe7mVmGHO5mZhlyuJuZZajpC6qSDgRuLDXtB3wdGA/8BdCT2s+JiNub3Y6ZmQ1e0+EeEY8B0wAkjaH4EuxbKL5W73sR8d1WFGhmZoPXqmGZI4FVEbGmReszM7MKWhXupwDXl+6fJWm5pCskTai1gKS5kroldff09NSaxczMmlQ53CWNAz4G/FNquhTYn2LIZgNwQa3lImJ+RHRFRFdHR0fVMszMrKQVZ+7HAr+IiI0AEbExIl6PiDeAy4DpLdiGmZkNQivC/VRKQzKSJpX6TgIeasE2zMxsECp9toyknYCjgTNKzd+WNA0IYHWfPjMza4NK4R4RrwC792n7dKWKzMysMv+HqplZhhzuZmYZcribmWXI4W5mliGHu5lZhhzuZmYZcribmWXI4W5mliGHu5lZhhzuZmYZcribmWXI4W5mliGHu5lZhhzuZmYZcribmWXI4W5mlqFKX9YBIGk1sBl4HdgSEV2SdgNuBDopvo3p5Ih4vuq2zMysMa06c/9wREyLiK50fx6wOCKmAovTfTMza5OhGpaZDVyZpq8EThyi7ZiZWQ2tCPcA7pK0VNLc1DYxIjak6aeBiS3YjpmZNajymDvwwYhYL+k/AYskPVrujIiQFH0XSi8EcwH23nvvFpRhZma9Kp+5R8T69HMTcAswHdgoaRJA+rmpxnLzI6IrIro6OjqqlmFmZiWVwl3STpJ26Z0GPgo8BCwE5qTZ5gA/rrIdMzMbnKrDMhOBWyT1ruu6iPippAeAmySdDqwBTq64HTMzG4RK4R4RTwKH1mh/FjiyyrrNzKx5/g9VM7MMOdzNzDLkcDczy1Ar3uduo0DnvNuGuwQzayOfuZuZZcjhbmaWIYe7mVmGHO5mZhlyuJuZZcjhbmaWIYe7mVmGHO5mZhlyuJuZZcjhbmaWIYe7mVmGHO5mZhlyuJuZZajpcJc0RdLdkh6R9LCks1P7eZLWS1qWbse1rlwzM2tElY/83QJ8JSJ+kb4ke6mkRanvexHx3erlmZlZM5oO94jYAGxI05slrQD2alVhZmbWvJaMuUvqBA4D7k9NZ0laLukKSRPqLDNXUrek7p6enlaUYWZmSeVwl7QzcDPwpYh4CbgU2B+YRnFmf0Gt5SJifkR0RURXR0dH1TLMzKykUrhL2p4i2K+NiB8BRMTGiHg9It4ALgOmVy/TzMwGo8q7ZQRcDqyIiAtL7ZNKs50EPNR8eWZm1owq75b5U+DTwIOSlqW2c4BTJU0DAlgNnFFhG2Zm1oQq75b5OaAaXbc3X46ZmbWC/0PVzCxDDnczsww53M3MMuRwNzPLkMPdzCxDDnczsww53M3MMuRwNzPLkMPdzCxDVT5+wDLWOe+2rdOrzz9+GCsxs2b4zN3MLEMOdzOzDDnczcwy5HA3M8uQw93MLEN+t4y1nN9pYzb8hizcJc0CLgLGAD+MiPOHaltWWzlkzWzbMiThLmkM8H3gaGAd8ICkhRHxyFBsb1vT35mxA93MYOjO3KcDKyPiSQBJNwCzgSEJ99E4DNCqmtsd5oOte6jnb2Q9Vde1rRiNx9Fo1Y59rYho/UqljwOzIuLz6f6ngQ9ExFmleeYCc9PdA4HHWl7IwPYAnhmG7bbCaK19tNYNrn04jNa6oT217xMRHbU6hu2CakTMB+YP1/YBJHVHRNdw1tCs0Vr7aK0bXPtwGK11w/DXPlRvhVwPTCndn5zazMysDYYq3B8ApkraV9I44BRg4RBty8zM+hiSYZmI2CLpLOBOirdCXhERDw/Ftioa1mGhikZr7aO1bnDtw2G01g3DPew8FBdUzcxsePnjB8zMMuRwNzPLULbhLmmWpMckrZQ0r0b/EZJ+IWlLel9+3/5dJa2TdEl7Kt663abrlrS3pLskrZD0iKTOthVO5dq/LenhVPvFktS+yhuq/ctpny6XtFjSPqW+OZKeSLc5o6FuSdMk3Zf2+XJJn2hn3VVqL/WP1GO0v+dK+47RiMjuRnERdxWwHzAO+BVwcJ95OoFDgKuAj9dYx0XAdcAlo6Vu4B7g6DS9M/CO0VA78CfA/0vrGAPcB8wcYbV/uHd/Av8NuDFN7wY8mX5OSNMTRkHdBwBT0/SewAZg/GjY56X+kXqM1q27ncdormfuWz/+ICJ+B/R+/MFWEbE6IpYDb/RdWNL7gInAXe0otqTpuiUdDIyNiEVpvpcj4tU21Q3V9nkAO1IcLDsA2wMbh77krRqp/e7S/lxC8b8bAMcAiyLiuYh4HlgEzBrpdUfE4xHxRJr+DbAJqPmfjkOkyj4f6cdozbrbfYzmGu57AWtL99eltgFJ2g64APjLIahrIE3XTXEm9oKkH0n6paTvpA9wa5ema4+I+4C7Kc4eNwB3RsSKlldY32BrPx24o8llW6lK3VtJmk7xwrqqpdX1r+naR9kxWt7nbT1G/Xnub/dF4PaIWNfmYd+qxgIzgMOAXwM3AqcBlw9jTQ2R9C7gIN48M1skaUZE/N9hLKsmSZ8CuoAPDXctg1GvbkmTgKuBORHxtr9iR4IatY+KY7RG3W09RnMN9yoff/CfgRmSvkgxJjZO0ssR8bYLJ0OgSt3rgGXx5idx3gocTvvCvUrtJwFLIuJlAEl3UPwe2hXuDdUu6SjgXOBDEfFaadmZfZa9Z0iqfLsqdSNpV+A24NyIWDLEtfZVpfYRf4zWqbu9x2i7LkS080bxovUksC9vXvT4ozrzLqDGBdXUdxrtvVjTdN0UF3p+BXSk+/8InDlKav8E8K9pHdsDi4ETRlLtFGdbq0gXIUvtuwFPUVxMnZCmdxsFdY9L+/lL7drPraq9zzwj7hjtZ5+39Rht+y+1jb+E44DH004+N7X9DfCxNP1+ilfSV4BngYeH+4lTtW6KL0dZDjyYAnTcaKg9Pen/AVhB8Zn/F47A58u/UlzkXZZuC0vLfg5YmW6fHQ11A58Cfl9qXwZMGw2191nHSDxG+3uutO0Y9ccPmJllKNd3y5iZbdMc7mZmGXK4m5llyOFuZpYhh7uZWYYc7mZmGXK4m5ll6P8DHuOamgmr0TwAAAAASUVORK5CYII=\n", 867 | "text/plain": [ 868 | "
" 869 | ] 870 | }, 871 | "metadata": { 872 | "needs_background": "light" 873 | }, 874 | "output_type": "display_data" 875 | }, 876 | { 877 | "name": "stdout", 878 | "output_type": "stream", 879 | "text": [ 880 | "==== Default HuggingFace model on GPU benchmark ====\n", 881 | "\n", 882 | "95 % of requests take less than 170.16630172729492 ms\n", 883 | "Rough request throughput/second is 29.591531706537907\n" 884 | ] 885 | } 886 | ], 887 | "source": [ 888 | "# Display results \n", 889 | "from matplotlib.pyplot import hist, title, show, xlim\n", 890 | "import numpy as np\n", 891 | "\n", 892 | "TPS_GPU = (num_preds * num_threads) / end\n", 893 | "\n", 894 | "t_GPU = [duration for thread__id, duration in times]\n", 895 | "latency_percentiles = np.percentile(t_GPU, q=[50, 90, 95, 99])\n", 896 | "latency_GPU = latency_percentiles[2]*1000\n", 897 | "\n", 898 | "hist(t_GPU, bins=100)\n", 899 | "title(\"Request latency histogram on GPU\")\n", 900 | "show()\n", 901 | "\n", 902 | "print(\"==== Default HuggingFace model on GPU benchmark ====\\n\")\n", 903 | "print(f\"95 % of requests take less than {latency_GPU} ms\")\n", 904 | "print(f\"Rough request throughput/second is {TPS_GPU}\")" 905 | ] 906 | }, 907 | { 908 | "cell_type": "markdown", 909 | "metadata": {}, 910 | "source": [ 911 | "We can see that request latency is in the 1-1.2 second range, and throughput is ~4.5 TPS." 912 | ] 913 | }, 914 | { 915 | "cell_type": "markdown", 916 | "metadata": {}, 917 | "source": [ 918 | "### Benchmark Inferentia based endpoint " 919 | ] 920 | }, 921 | { 922 | "cell_type": "code", 923 | "execution_count": 21, 924 | "metadata": {}, 925 | "outputs": [ 926 | { 927 | "name": "stdout", 928 | "output_type": "stream", 929 | "text": [ 930 | "Thread 140033736738560 startedThread 140034523002624 started\n", 931 | "\n", 932 | "Thread 140034472646400 started\n", 933 | "Thread 140034513561344 started\n", 934 | "Thread 140034490779392 started\n", 935 | "CPU times: user 2.16 s, sys: 91.3 ms, total: 2.25 s\n", 936 | "Wall time: 13 s\n" 937 | ] 938 | } 939 | ], 940 | "source": [ 941 | "%%time\n", 942 | "# Run benchmark \n", 943 | "\n", 944 | "import threading\n", 945 | "import time\n", 946 | "\n", 947 | "\n", 948 | "num_preds = 300\n", 949 | "num_threads = 5\n", 950 | "\n", 951 | "times = []\n", 952 | "\n", 953 | "\n", 954 | "def predict():\n", 955 | " thread_id = threading.get_ident()\n", 956 | " print(f\"Thread {thread_id} started\")\n", 957 | "\n", 958 | " for i in range(num_preds):\n", 959 | " tick = time.time()\n", 960 | " response = compiled_inf1_predictor.predict(payload)\n", 961 | " tock = time.time()\n", 962 | " times.append((thread_id, tock - tick))\n", 963 | "\n", 964 | "\n", 965 | "threads = []\n", 966 | "[threads.append(threading.Thread(target=predict, daemon=False)) for i in range(num_threads)]\n", 967 | "[t.start() for t in threads]\n", 968 | "\n", 969 | "# Make a rough estimate of total time, wait for threads\n", 970 | "start = time.time()\n", 971 | "[t.join() for t in threads]\n", 972 | "end = time.time() - start" 973 | ] 974 | }, 975 | { 976 | "cell_type": "code", 977 | "execution_count": 22, 978 | "metadata": {}, 979 | "outputs": [ 980 | { 981 | "data": { 982 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYQAAAEICAYAAABfz4NwAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAaQElEQVR4nO3deZhldX3n8ffHbjYRaJYSsbulieKCRiP2II5LGIlsLs3zjPrgJKExOJ1EjBpNFHUeMS4ZTVQio9EQYVh0UAYXWsVoD0p8MiNIY9ygVTq4dLcNXbIJ4tb6nT/Or+BSVHXdqltddYu8X89znz7n9/vdc77n1L3nc885t6pTVUiS9ID5LkCSNBwMBEkSYCBIkhoDQZIEGAiSpMZAkCQBBoL6lGRFkkqyeL5rmalW/yMm6fv9JJ+f65r+vUjy1CTXJ7kzyYlDUM+1SY6a7zqGjYGwkyT5fpKftTfAjUnOS/Kg+a5rMkmuSPKSWVrWUUk2z8ay5kpVfbiqjplqXPs5vnUuahpmM/iA8GbgvVX1oKr65E4s7T4m+plV1WOr6oq5rGMhMBB2rudW1YOA3wGeCLxufsvRMFvIZ199OBi4diZPvJ/vl+FSVT52wgP4PvB7PfN/A3ymZ/5I4P8BtwFfB47q6TsE+GfgDmAd8F7gQ63vKGDzZOuiC/nTgX8DbgYuBvZrfbsDH2rttwFXAwcCbwN+DfwcuJPuk9z47VkBFLC4zb8Y2NBqvAH449a+J/Az4DdtWXcCD52irrFlrwZ+CPwYeEPPuhcBr2/PvQO4BlgOvA9417g61wJ/PsnPpIA/Aa5v2/8+IK3vFOBf2nSAM4FtwE+AbwKPA9YAvwJ+2bbrU238Y4Ar2jKvBZ7Xs879gU+15VwNvHVsPT01ndZq+l5rew+wqT3nGuDpPePfBPzv9nO8o9X2SLoPG9va847ZwetyR7We1/bJZ9qyrwIePslyxr8eJn1u+7n9pr0u7gR2A/YBzgG2AlvaflnU87P4v+1ncHPr2w14Z3t93AR8ANij9z0BvLrtg63Ai1vfZD+z73PPe+YI4Mttn2yle7/tOt/HkHk5bs13AffXx7gX3LL2xn1Pm1/aXugn0B0on9XmR1r/l4F3tzfBM9obrN9AeAVwZVvnbsA/ABe1vj+mOzg9kO4g+yRg79Z3BfCSHWzP+APAs4GH0x08fxe4Czh8BzXuqK6xZf8jsAfwBOAXwGNa/1+2/feotr4n0B1ojwB+BDygjTug1XHgJNtQwKeBJcDDgFHguNZ3CvcEwrF0B+IlbX2PAQ5qfecBb+1Z5i7ARrrA2hV4Zvt5Par1f6Q9HggcRnfAHh8I64D9uOcA9wdt+xbTHeRuBHZvfW+iC+5jW/8FwPeAN7Ra/istWCbY/qlqPY/udXhEW/aHgY/0+XrY4XO57wekT7TXwJ7Ag4GvcM+HilOA7cCftWXtQRcOa9t+2ovudfzfe15v2+kuS+1C9766C9h3op/ZBO+ZJ9F9QFvctmsD8Mr5PobMy3Frvgu4vz7aC+7O9oYr4HJgSet7LXDhuPGfo/uE/LD24t6zp+9/0X8gbACO7uk7iO4T0mLgj+jOSh4/Qb1XMI1AmKD/k8ArdlDjjuoaW/aynv6vACe16e8AqyZZ7wbgWW36ZcBlO9iGAp7WM38xcHqbPoV7AuGZwHfbQeIB45Zxr4ML8HS6A/YDetouojtwL2rb+KievonOEJ45xWvpVuAJbfpNwLqevue219nYp+u92jKXTLCcSWvt2bYP9vSdAHy7n9fDVM/l3q/RA+kCf4+e/hcBX+z5Wfywpy/AT+k5WwGewj1nVEfRnX0s7unfBhw50c9sfD0TbNsrgU9M9R6/Pz68h7BznVhVe9G9YB9N9wkWuuupL0hy29gDeBrdQfKhwK1V9dOe5fxgGus8GPhEz3I30F0OOhC4kC54PpLkR0n+JskuM9mwJMcnuTLJLW09J/Rs33TrGnNjz/RdwNhN+OV0lx0mcj7dJ2ravxdOUfpk67hbVX2B7rLB+4BtSc5Osvcky3sosKmqftPT9gO6s8ARusDb1NPXOz1hW5K/SLIhye1tX+3DvfftTT3TPwN+XFW/7plnou2aotYxU+6fHej3uQfTfZLf2vN6+Ae6M4UxvftkhO4M65qe8f/U2sfcXFXbZ1J7kkcm+XT78sdPgL9mx6/l+y0DYQ5U1T/TfUp5Z2vaRHeGsKTnsWdVvZ3uGua+SfbsWcTDeqZ/SvfmACDJIu79xtgEHD9u2btX1Zaq+lVV/VVVHQb8R+A5wMljZfa7PUl2Az7WtufAqloCXEb3SW6yZU1aVx+r3ER3eWoiHwJWJXkC3aWdT/a7HTtSVWdV1ZPoLvM8ku6yFdx3234ELE/S+156GN118VG6s71lPX3LJ1rd2ESSpwOvAV5Id8ljCXA79+zbQeyo1rm0ie4M4YCe18LeVfXYnjG9+/nHdEH32J7x+1T3hY1+TPXafj/wbeDQqtqb7pLabOzvBcdAmDt/BzyrHbg+BDw3ybFJFiXZvX1Vc1lV/QBYD/xVkl2TPI3ussCY7wK7J3l2+3T/3+iuyY/5APC2JAcDJBlJsqpN/6ckv91C5Cd0lzPGPi3eBPxWn9uya1vnKLA9yfFA71c2bwL2T7JPP3X14YPAW5Icms7jk+wPUFWb6W7WXgh8rKp+tqMF9SPJf0jy5LZ/f0p3zX6y/XQV3afR1yTZpX23/bl0189/DXwceFOSByZ5NPcE8GT2oguRUWBxkjcCk52dTNektc7S8vtSVVuBzwPvSrJ3kgckeXiS351k/G/o7i+dmeTBAEmWJjm2z1VO9drei+79cGf7Gf1pv9tyf2MgzJGqGqW7AfjGqtoErKL7JDJK94npL7nn5/FfgCcDtwBntOeNLed24KV0B8ktdAes3u/8v4fu5tvnk9xBdyP3ya3vIcAldC/+DXTfZLqw53nPT3JrkrOm2JY7gJfTXYO/tdW7tqf/23TXpm9op/gPnaKuqby7revzrfZz6G40jjkf+G2mvlzUr73pDkC30l1SuRn429Z3DnBY265PVtUv6Q6qx9N9kv174OS2D6C7r7EP3eWUC+n2yy92sO7P0V0O+W5b98+Z+DLTtPVR61w6me6DxXV0+/kSukumk3kt3Q3xK9tlnf9D9yWDftzrZzZB/1/QvYbvoPu5f7TP5d7vjH3lTkMsyZuAR1TVH0w19t+jJM+gO+s6uIb8BZ3kHcBDqmr1fNcijecZgha0dlnnFXTfcBm6MEjy6HaJK0mOAE6l+8qlNHQMBC1YSR5D98tEB9HdoxlGe9HdR/gp3aWIdwGXzmtF0iS8ZCRJAjxDkCQ1Q/1How444IBasWLFfJchSQvKNddc8+OqGpl65L0NdSCsWLGC9evXz3cZkrSgJJnOXze4m5eMJEmAgSBJagwESRJgIEiSGgNBkgQYCJKkxkCQJAEGgiSpMRAkScCQ/6byoFac/pm7p7//9mfPYyWSNPw8Q5AkAQaCJKkxECRJgIEgSWoMBEkSYCBIkhoDQZIE9BEISc5Nsi3Jtyboe3WSSnJAm0+Ss5JsTPKNJIf3jF2d5Pr2WD27myFJGlQ/ZwjnAceNb0yyHDgG+GFP8/HAoe2xBnh/G7sfcAbwZOAI4Iwk+w5SuCRpdk0ZCFX1JeCWCbrOBF4DVE/bKuCC6lwJLElyEHAssK6qbqmqW4F1TBAykqT5M6N7CElWAVuq6uvjupYCm3rmN7e2ydonWvaaJOuTrB8dHZ1JeZKkGZh2ICR5IPB64I2zXw5U1dlVtbKqVo6MjOyMVUiSJjCTM4SHA4cAX0/yfWAZ8NUkDwG2AMt7xi5rbZO1S5KGxLQDoaq+WVUPrqoVVbWC7vLP4VV1I7AWOLl92+hI4Paq2gp8Djgmyb7tZvIxrU2SNCT6+drpRcCXgUcl2Zzk1B0Mvwy4AdgI/CPwUoCqugV4C3B1e7y5tUmShsSU/x9CVb1oiv4VPdMFnDbJuHOBc6dZnyRpjvibypIkwECQJDUGgiQJMBAkSY2BIEkCDARJUmMgSJIAA0GS1BgIkiTAQJAkNQaCJAkwECRJjYEgSQIMBElSYyBIkgADQZLUGAiSJMBAkCQ1BoIkCegjEJKcm2Rbkm/1tP1tkm8n+UaSTyRZ0tP3uiQbk3wnybE97ce1to1JTp/1LZEkDaSfM4TzgOPGta0DHldVjwe+C7wOIMlhwEnAY9tz/j7JoiSLgPcBxwOHAS9qYyVJQ2LKQKiqLwG3jGv7fFVtb7NXAsva9CrgI1X1i6r6HrAROKI9NlbVDVX1S+AjbawkaUjMxj2EPwI+26aXApt6+ja3tsna7yPJmiTrk6wfHR2dhfIkSf0YKBCSvAHYDnx4dsqBqjq7qlZW1cqRkZHZWqwkaQqLZ/rEJKcAzwGOrqpqzVuA5T3DlrU2dtAuSRoCMzpDSHIc8BrgeVV1V0/XWuCkJLslOQQ4FPgKcDVwaJJDkuxKd+N57WClS5Jm05RnCEkuAo4CDkiyGTiD7ltFuwHrkgBcWVV/UlXXJrkYuI7uUtJpVfXrtpyXAZ8DFgHnVtW1O2F7JEkzNGUgVNWLJmg+Zwfj3wa8bYL2y4DLplWdJGnO+JvKkiTAQJAkNQaCJAkwECRJjYEgSQIMBElSYyBIkgADQZLUGAiSJMBAkCQ1BoIkCTAQJEmNgSBJAgwESVJjIEiSAANBktQYCJIkwECQJDUGgiQJ6CMQkpybZFuSb/W07ZdkXZLr27/7tvYkOSvJxiTfSHJ4z3NWt/HXJ1m9czZHkjRT/ZwhnAccN67tdODyqjoUuLzNAxwPHNoea4D3QxcgwBnAk4EjgDPGQkSSNBymDISq+hJwy7jmVcD5bfp84MSe9guqcyWwJMlBwLHAuqq6papuBdZx35CRJM2jmd5DOLCqtrbpG4ED2/RSYFPPuM2tbbL2+0iyJsn6JOtHR0dnWJ4kaboGvqlcVQXULNQytryzq2plVa0cGRmZrcVKkqYw00C4qV0Kov27rbVvAZb3jFvW2iZrlyQNiZkGwlpg7JtCq4FLe9pPbt82OhK4vV1a+hxwTJJ9283kY1qbJGlILJ5qQJKLgKOAA5Jspvu20NuBi5OcCvwAeGEbfhlwArARuAt4MUBV3ZLkLcDVbdybq2r8jWpJ0jyaMhCq6kWTdB09wdgCTptkOecC506rOknSnPE3lSVJgIEgSWoMBEkSYCBIkhoDQZIEGAiSpMZAkCQBBoIkqTEQJEmAgSBJagwESRJgIEiSGgNBkgQYCJKkxkCQJAEGgiSpMRAkSYCBIElqDARJEjBgICT58yTXJvlWkouS7J7kkCRXJdmY5KNJdm1jd2vzG1v/ilnZAknSrJhxICRZCrwcWFlVjwMWAScB7wDOrKpHALcCp7annArc2trPbOMkSUNi0EtGi4E9kiwGHghsBZ4JXNL6zwdObNOr2jyt/+gkGXD9kqRZMuNAqKotwDuBH9IFwe3ANcBtVbW9DdsMLG3TS4FN7bnb2/j9xy83yZok65OsHx0dnWl5kqRpGuSS0b50n/oPAR4K7AkcN2hBVXV2Va2sqpUjIyODLk6S1KdBLhn9HvC9qhqtql8BHweeCixpl5AAlgFb2vQWYDlA698HuHmA9UuSZtEggfBD4MgkD2z3Ao4GrgO+CDy/jVkNXNqm17Z5Wv8XqqoGWL8kaRYNcg/hKrqbw18FvtmWdTbwWuBVSTbS3SM4pz3lHGD/1v4q4PQB6pYkzbLFUw+ZXFWdAZwxrvkG4IgJxv4ceMEg65Mk7Tz+prIkCTAQJEmNgSBJAgwESVJjIEiSAANBktQYCJIkwECQJDUGgiQJMBAkSY2BIEkCDARJUmMgSJIAA0GS1BgIkiTAQJAkNQaCJAkwECRJjYEgSQIGDIQkS5JckuTbSTYkeUqS/ZKsS3J9+3ffNjZJzkqyMck3khw+O5sgSZoNg54hvAf4p6p6NPAEYANwOnB5VR0KXN7mAY4HDm2PNcD7B1y3JGkWzTgQkuwDPAM4B6CqfllVtwGrgPPbsPOBE9v0KuCC6lwJLEly0EzXL0maXYOcIRwCjAL/M8m/Jvlgkj2BA6tqaxtzI3Bgm14KbOp5/ubWdi9J1iRZn2T96OjoAOVJkqZjkEBYDBwOvL+qngj8lHsuDwFQVQXUdBZaVWdX1cqqWjkyMjJAeZKk6RgkEDYDm6vqqjZ/CV1A3DR2Kaj9u631bwGW9zx/WWuTJA2BGQdCVd0IbEryqNZ0NHAdsBZY3dpWA5e26bXAye3bRkcCt/dcWpIkzbPFAz7/z4APJ9kVuAF4MV3IXJzkVOAHwAvb2MuAE4CNwF1trCRpSAwUCFX1NWDlBF1HTzC2gNMGWZ8kaefxN5UlSYCBIElqDARJEmAgSJIaA0GSBBgIkqTGQJAkAQaCJKkxECRJgIEgSWoMBEkSYCBIkhoDQZIEGAiSpMZAkCQBBoIkqTEQJEmAgSBJagwESRIwC4GQZFGSf03y6TZ/SJKrkmxM8tEku7b23dr8xta/YtB1S5Jmz2ycIbwC2NAz/w7gzKp6BHArcGprPxW4tbWf2cZJkobEQIGQZBnwbOCDbT7AM4FL2pDzgRPb9Ko2T+s/uo2XJA2BQc8Q/g54DfCbNr8/cFtVbW/zm4GlbXopsAmg9d/ext9LkjVJ1idZPzo6OmB5kqR+zTgQkjwH2FZV18xiPVTV2VW1sqpWjoyMzOaiJUk7sHiA5z4VeF6SE4Ddgb2B9wBLkixuZwHLgC1t/BZgObA5yWJgH+DmAdYvSZpFMz5DqKrXVdWyqloBnAR8oap+H/gi8Pw2bDVwaZte2+Zp/V+oqprp+iVJs2tn/B7Ca4FXJdlId4/gnNZ+DrB/a38VcPpOWLckaYYGuWR0t6q6AriiTd8AHDHBmJ8DL5iN9UmSZp+/qSxJAgwESVJjIEiSAANBktQYCJIkwECQJDUGgiQJMBAkSY2BIEkCDARJUmMgSJIAA0GS1BgIkiTAQJAkNQaCJAkwECRJjYEgSQIMBElSYyBIkoABAiHJ8iRfTHJdkmuTvKK175dkXZLr27/7tvYkOSvJxiTfSHL4bG2EJGlwg5whbAdeXVWHAUcCpyU5DDgduLyqDgUub/MAxwOHtsca4P0DrFuSNMtmHAhVtbWqvtqm7wA2AEuBVcD5bdj5wIltehVwQXWuBJYkOWim65ckza5ZuYeQZAXwROAq4MCq2tq6bgQObNNLgU09T9vc2sYva02S9UnWj46OzkZ5kqQ+DBwISR4EfAx4ZVX9pLevqgqo6Syvqs6uqpVVtXJkZGTQ8iRJfRooEJLsQhcGH66qj7fmm8YuBbV/t7X2LcDynqcva22SpCEwyLeMApwDbKiqd/d0rQVWt+nVwKU97Se3bxsdCdzec2lJkjTPFg/w3KcCfwh8M8nXWtvrgbcDFyc5FfgB8MLWdxlwArARuAt48QDrliTNshkHQlX9C5BJuo+eYHwBp810fZKkncvfVJYkAQaCJKkxECRJgIEgSWoMBEkSYCBIkhoDQZIEGAiSpMZAkCQBBoIkqTEQJEmAgSBJagwESRJgIEiSGgNBkgQYCJKkxkCQJAEGgiSpMRAkScA8BEKS45J8J8nGJKfP9folSRNbPJcrS7IIeB/wLGAzcHWStVV13c5e94rTP3P39Pff/uydvTpJWnDm+gzhCGBjVd1QVb8EPgKsmuMaJEkTmNMzBGApsKlnfjPw5N4BSdYAa9rsnUm+M811HAD8eEcD8o5pLnFuTFn3kLLuuWXdc2ch1gxd3QfP5IlzHQhTqqqzgbNn+vwk66tq5SyWNCese25Z99xaiHUvxJrh7rpXzOS5c33JaAuwvGd+WWuTJM2zuQ6Eq4FDkxySZFfgJGDtHNcgSZrAnF4yqqrtSV4GfA5YBJxbVdfO8mpmfLlpnln33LLuubUQ616INcMgl9yrajYLkSQtUP6msiQJMBAkSc2CDYSp/gRGkt2SfLT1X5VkxTyUeR991H1KktEkX2uPl8xHneNqOjfJtiTfmqQ/Sc5q2/SNJIfPdY0T6aPuo5Lc3rOv3zjXNU4kyfIkX0xyXZJrk7xigjFDtc/7rHno9neS3ZN8JcnXW91/NcGYoTuW9Fn39I8lVbXgHnQ3pP8N+C1gV+DrwGHjxrwU+ECbPgn46AKp+xTgvfNd67iangEcDnxrkv4TgM8CAY4Erprvmvus+yjg0/Nd5wR1HQQc3qb3Ar47wetkqPZ5nzUP3f5u++9BbXoX4CrgyHFjhvFY0k/d0z6WLNQzhH7+BMYq4Pw2fQlwdJLMYY0TWZB/uqOqvgTcsoMhq4ALqnMlsCTJQXNT3eT6qHsoVdXWqvpqm74D2ED3W/69hmqf91nz0Gn77842u0t7jP+mzdAdS/qse9oWaiBM9Ccwxr/47h5TVduB24H956S6yfVTN8B/bpcBLkmyfIL+YdPvdg2jp7TT7s8meex8FzNeuzzxRLpPgL2Gdp/voGYYwv2dZFGSrwHbgHVVNem+HqJjST91wzSPJQs1EO7PPgWsqKrHA+u455OJZt9XgYOr6gnA/wA+Ob/l3FuSBwEfA15ZVT+Z73r6MUXNQ7m/q+rXVfU7dH854Ygkj5vnkvrSR93TPpYs1EDo509g3D0myWJgH+DmOaluclPWXVU3V9Uv2uwHgSfNUW2DWJB/kqSqfjJ22l1VlwG7JDlgnssCIMkudAfWD1fVxycYMnT7fKqah3l/A1TVbcAXgePGdQ3jseRuk9U9k2PJQg2Efv4ExlpgdZt+PvCFanda5tGUdY+7Dvw8umuxw24tcHL75suRwO1VtXW+i5pKkoeMXQtOcgTd+2He3+itpnOADVX17kmGDdU+76fmYdzfSUaSLGnTe9D9Xy3fHjds6I4l/dQ9k2PJ0P21037UJH8CI8mbgfVVtZbuxXlhko10NxZPmr+KO33W/fIkzwO209V9yrwV3CS5iO4bIgck2QycQXcTi6r6AHAZ3bdeNgJ3AS+en0rvrY+6nw/8aZLtwM+Ak+b7jd48FfhD4JvtGjHA64GHwdDu835qHsb9fRBwfrr/vOsBwMVV9elhP5bQX93TPpb4pyskScDCvWQkSZplBoIkCTAQJEmNgSBJAgwESVJjIEiSAANBktT8f6IWEKx0414SAAAAAElFTkSuQmCC\n", 983 | "text/plain": [ 984 | "
" 985 | ] 986 | }, 987 | "metadata": { 988 | "needs_background": "light" 989 | }, 990 | "output_type": "display_data" 991 | }, 992 | { 993 | "name": "stdout", 994 | "output_type": "stream", 995 | "text": [ 996 | "==== Default HuggingFace model on inf1 benchmark ====\n", 997 | "\n", 998 | "95 % of requests take less than 51.53084993362427 ms\n", 999 | "Rough request throughput/second is 115.47261022471368\n" 1000 | ] 1001 | } 1002 | ], 1003 | "source": [ 1004 | "# Display results \n", 1005 | "from matplotlib.pyplot import hist, title, show, xlim\n", 1006 | "import numpy as np\n", 1007 | "\n", 1008 | "TPS_inf1 = (num_preds * num_threads) / end\n", 1009 | "\n", 1010 | "t_inf1 = [duration for thread__id, duration in times]\n", 1011 | "latency_percentiles = np.percentile(t_inf1, q=[50, 90, 95, 99])\n", 1012 | "latency_inf1 = latency_percentiles[2]*1000\n", 1013 | "\n", 1014 | "hist(t_inf1, bins=100)\n", 1015 | "title(\"Request latency histogram on Inferentia\")\n", 1016 | "show()\n", 1017 | "\n", 1018 | "print(\"==== Default HuggingFace model on inf1 benchmark ====\\n\")\n", 1019 | "print(f\"95 % of requests take less than {latency_inf1} ms\")\n", 1020 | "print(f\"Rough request throughput/second is {TPS_inf1}\")\n", 1021 | "\n", 1022 | "\n" 1023 | ] 1024 | }, 1025 | { 1026 | "cell_type": "markdown", 1027 | "metadata": {}, 1028 | "source": [ 1029 | "We can see that request latency is in the 0.02-0.05 millisecond range, and throughput is ~157 TPS." 1030 | ] 1031 | }, 1032 | { 1033 | "cell_type": "markdown", 1034 | "metadata": {}, 1035 | "source": [ 1036 | "---" 1037 | ] 1038 | }, 1039 | { 1040 | "cell_type": "markdown", 1041 | "metadata": {}, 1042 | "source": [ 1043 | "# 6. Conclusion " 1044 | ] 1045 | }, 1046 | { 1047 | "cell_type": "code", 1048 | "execution_count": 23, 1049 | "metadata": {}, 1050 | "outputs": [ 1051 | { 1052 | "name": "stdout", 1053 | "output_type": "stream", 1054 | "text": [ 1055 | "Using inf1 instances latency dropped to a 51.53 millisecond range from 170.17 ms on a GPU endpoint.\n", 1056 | "Also, The average throughput increased to 115.47 TPS from 29.59 TPS on the GPU.\n" 1057 | ] 1058 | } 1059 | ], 1060 | "source": [ 1061 | "print(\"Using inf1 instances latency dropped to a {:.2f} millisecond range from {:.2f} ms on a GPU endpoint.\".format(latency_inf1, latency_GPU)) \n", 1062 | "print(\"Also, The average throughput increased to {:.2f} TPS from {:.2f} TPS on the GPU.\".format( TPS_inf1, TPS_GPU) )" 1063 | ] 1064 | }, 1065 | { 1066 | "cell_type": "markdown", 1067 | "metadata": {}, 1068 | "source": [ 1069 | "This increase in performance obtained from using inf1 instances, paired with the cost reduction and the use of known SageMaker SDK APIs, enables new benefits with little development effort and a gentle learning curve. " 1070 | ] 1071 | }, 1072 | { 1073 | "cell_type": "markdown", 1074 | "metadata": {}, 1075 | "source": [ 1076 | "* To learn more about how to deploy Hugging Face modes through Sagemaker on to Inf1, please watch their latest [Webinar](https://www.youtube.com/watch?v=3fulTyMXhWQ), and read their latest [blog post](https://huggingface.co/blog/bert-inferentia-sagemaker). \n", 1077 | "* For more information about Inferentia, please see the AWS EC2 Inf1 [website](https://aws.amazon.com/ec2/instance-types/inf1/) or check out other Tutorials available online [here] (https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-intro/tutorials.html).\n", 1078 | "* You can learn more about Inferentia performance on the [Neuron Inference Performance](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/benchmark/index.html) pages\n" 1079 | ] 1080 | }, 1081 | { 1082 | "cell_type": "markdown", 1083 | "metadata": {}, 1084 | "source": [ 1085 | "---" 1086 | ] 1087 | }, 1088 | { 1089 | "cell_type": "markdown", 1090 | "metadata": {}, 1091 | "source": [ 1092 | "# 7. Clean up \n", 1093 | "Delete the models and release the endpoints." 1094 | ] 1095 | }, 1096 | { 1097 | "cell_type": "code", 1098 | "execution_count": null, 1099 | "metadata": {}, 1100 | "outputs": [], 1101 | "source": [ 1102 | "normal_predictor.delete_model()\n", 1103 | "normal_predictor.delete_endpoint()\n", 1104 | "compiled_inf1_predictor.delete_model()\n", 1105 | "compiled_inf1_predictor.delete_endpoint()" 1106 | ] 1107 | } 1108 | ], 1109 | "metadata": { 1110 | "instance_type": "ml.m5.large", 1111 | "kernelspec": { 1112 | "display_name": "Python 3 (PyTorch 1.8 Python 3.6 CPU Optimized)", 1113 | "language": "python", 1114 | "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/1.8.1-cpu-py36" 1115 | }, 1116 | "language_info": { 1117 | "codemirror_mode": { 1118 | "name": "ipython", 1119 | "version": 3 1120 | }, 1121 | "file_extension": ".py", 1122 | "mimetype": "text/x-python", 1123 | "name": "python", 1124 | "nbconvert_exporter": "python", 1125 | "pygments_lexer": "ipython3", 1126 | "version": "3.6.13" 1127 | } 1128 | }, 1129 | "nbformat": 4, 1130 | "nbformat_minor": 5 1131 | } 1132 | -------------------------------------------------------------------------------- /code/inference.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | import torch 4 | from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig 5 | 6 | JSON_CONTENT_TYPE = 'application/json' 7 | device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') 8 | 9 | def model_fn(model_dir): 10 | tokenizer_init = AutoTokenizer.from_pretrained('Prompsit/paraphrase-bert-en') 11 | compiled_model = os.path.exists(f'{model_dir}/model.pth') 12 | if compiled_model: 13 | import torch_neuron 14 | os.environ["NEURONCORE_GROUP_SIZES"] = "1" 15 | model = torch.jit.load(f'{model_dir}/model.pth') 16 | else: 17 | model = AutoModelForSequenceClassification.from_pretrained(model_dir).to(device) 18 | 19 | return (model, tokenizer_init) 20 | 21 | 22 | def input_fn(serialized_input_data, content_type=JSON_CONTENT_TYPE): 23 | if content_type == JSON_CONTENT_TYPE: 24 | input_data = json.loads(serialized_input_data) 25 | return input_data 26 | else: 27 | raise Exception('Requested unsupported ContentType in Accept: ' + content_type) 28 | return 29 | 30 | 31 | def predict_fn(input_data, models): 32 | 33 | model_bert, tokenizer = models 34 | sequence_0 = input_data[0] 35 | sequence_1 = input_data[1] 36 | 37 | max_length = 512 38 | tokenized_sequence_pair = tokenizer.encode_plus(sequence_0, 39 | sequence_1, 40 | max_length=max_length, 41 | padding='max_length', 42 | truncation=True, 43 | return_tensors='pt').to(device) 44 | 45 | # Convert example inputs to a format that is compatible with TorchScript tracing 46 | example_inputs = tokenized_sequence_pair['input_ids'], tokenized_sequence_pair['attention_mask'] 47 | 48 | with torch.no_grad(): 49 | paraphrase_classification_logits = model_bert(*example_inputs) 50 | 51 | classes = ['paraphrase','not paraphrase'] 52 | paraphrase_prediction = paraphrase_classification_logits[0][0].argmax().item() 53 | out_str = 'BERT predicts that "{}" and "{}" are {}'.format(sequence_0, sequence_1, classes[paraphrase_prediction]) 54 | 55 | return out_str 56 | 57 | 58 | def output_fn(prediction_output, accept=JSON_CONTENT_TYPE): 59 | if accept == JSON_CONTENT_TYPE: 60 | return json.dumps(prediction_output), accept 61 | 62 | raise Exception('Requested unsupported ContentType in Accept: ' + accept) 63 | 64 | -------------------------------------------------------------------------------- /code/requirements.txt: -------------------------------------------------------------------------------- 1 | transformers==4.15.0 2 | -------------------------------------------------------------------------------- /images/accessevent.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-inferentia-huggingface-workshop/a91b574b15d209df91aaf3e72094a0ee034ed4a6/images/accessevent.png -------------------------------------------------------------------------------- /images/accesssm.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-inferentia-huggingface-workshop/a91b574b15d209df91aaf3e72094a0ee034ed4a6/images/accesssm.png -------------------------------------------------------------------------------- /images/accessstudio.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-inferentia-huggingface-workshop/a91b574b15d209df91aaf3e72094a0ee034ed4a6/images/accessstudio.png -------------------------------------------------------------------------------- /images/email.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-inferentia-huggingface-workshop/a91b574b15d209df91aaf3e72094a0ee034ed4a6/images/email.png -------------------------------------------------------------------------------- /images/kernel.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-inferentia-huggingface-workshop/a91b574b15d209df91aaf3e72094a0ee034ed4a6/images/kernel.png -------------------------------------------------------------------------------- /images/kernelselect.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-inferentia-huggingface-workshop/a91b574b15d209df91aaf3e72094a0ee034ed4a6/images/kernelselect.png -------------------------------------------------------------------------------- /images/launch.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-inferentia-huggingface-workshop/a91b574b15d209df91aaf3e72094a0ee034ed4a6/images/launch.png -------------------------------------------------------------------------------- /images/login.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-inferentia-huggingface-workshop/a91b574b15d209df91aaf3e72094a0ee034ed4a6/images/login.png -------------------------------------------------------------------------------- /images/menuleft.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-inferentia-huggingface-workshop/a91b574b15d209df91aaf3e72094a0ee034ed4a6/images/menuleft.png -------------------------------------------------------------------------------- /images/openconsole.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-inferentia-huggingface-workshop/a91b574b15d209df91aaf3e72094a0ee034ed4a6/images/openconsole.png -------------------------------------------------------------------------------- /images/opennb.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-inferentia-huggingface-workshop/a91b574b15d209df91aaf3e72094a0ee034ed4a6/images/opennb.png -------------------------------------------------------------------------------- /images/otp.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-inferentia-huggingface-workshop/a91b574b15d209df91aaf3e72094a0ee034ed4a6/images/otp.png -------------------------------------------------------------------------------- /images/signin.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-inferentia-huggingface-workshop/a91b574b15d209df91aaf3e72094a0ee034ed4a6/images/signin.png -------------------------------------------------------------------------------- /images/teamdashboard.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-inferentia-huggingface-workshop/a91b574b15d209df91aaf3e72094a0ee034ed4a6/images/teamdashboard.png -------------------------------------------------------------------------------- /images/terminal.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-inferentia-huggingface-workshop/a91b574b15d209df91aaf3e72094a0ee034ed4a6/images/terminal.png --------------------------------------------------------------------------------