├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── TUTORIALS.md ├── blogs └── 01_LLama3-8B_Inferentia_EKS_vLLM │ ├── Dockerfile │ ├── Readme.md │ ├── cluster-config.yaml │ ├── deployment.yaml │ └── nodegroup-config.yaml ├── tutorials ├── 01_EmbeddingsFromTextWithBert │ ├── 01_TextFeatureExtractionForSimilarity.ipynb │ ├── code │ │ └── inference.py │ └── frank_chap01.txt ├── 02_ObjectTrackingSageMakerGStreamer │ ├── 01_Yolov7SageMakerInferentia.ipynb │ ├── 02_CVPipeline.ipynb │ ├── README.md │ ├── code_01 │ │ └── inference.py │ ├── code_02 │ │ └── pipeline.py │ ├── container_01 │ │ └── Dockerfile │ ├── container_02 │ │ └── Dockerfile │ └── libs │ │ ├── cvpipeline.py │ │ ├── smcvpipeline.py │ │ └── tracker.py ├── 03_QuestionAnsweringMachine │ ├── 01_QuestionAnsweringWithT5SSM.ipynb │ ├── src │ │ └── question_answering.py │ └── train.csv.gz ├── 04_ImageGenerationWithStableDiffusion │ ├── SDInf2HFOptimumNeuron.ipynb │ └── SDOnInf2AndHFOptimumNeuron_SMSSH.ipynb ├── 05_FastQuestionAnsweringWithBertQA │ └── BertQAInferentia1.ipynb ├── 06_FinetuneLLMs │ ├── 01_Finetune_LLMs.ipynb │ └── 02_Deploy_Llama2-7B.ipynb ├── 07_DeployToInferentiaWithTGI │ └── inf2-tgi-demo.ipynb └── 08_TextClassificationWithNaturalLanguageInference │ └── NLI_with_BART_inf2.ipynb └── workshops ├── 01_FineTuneSpamClassifier ├── README.md ├── docs │ ├── imgs │ │ └── 01_activities.png │ └── optimum_neuron_models.md └── notebooks │ ├── 01_DatasetPreparation.ipynb │ ├── 02_ModelFineTuning.ipynb │ ├── 03_ModelInference.ipynb │ ├── 03_ModelInferenceInf1.ipynb │ ├── requirements.txt │ └── src │ ├── compile.py │ ├── dump_model_table.py │ ├── requirements.txt │ └── train.py ├── 02_DomainAdaptation ├── README.md ├── docs │ └── imgs │ │ ├── 6-orpo-curve.png │ │ ├── 6-orpo-intro.png │ │ └── model_alignment_techniques.png └── notebooks │ ├── 01_ModelAdaptationWithOrpo.ipynb │ └── 02_DeployModel.ipynb └── 03_NKIWorkshop ├── README.md └── notebooks ├── 0-setup.ipynb ├── 1-intergrate-prebuild-kernel.ipynb ├── 2-custom-operators.ipynb └── 3-neuron-profile.ipynb /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. 5 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *main* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. 60 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT No Attribution 2 | 3 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of 6 | this software and associated documentation files (the "Software"), to deal in 7 | the Software without restriction, including without limitation the rights to 8 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 9 | the Software, and to permit persons to whom the Software is furnished to do so. 10 | 11 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 12 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 13 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 14 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 15 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 16 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 17 | 18 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # How to reduce costs and improve performance of your Machine Learning (ML) workloads? 2 | 3 | In this repo you'll learn how to use [AWS Trainium](https://aws.amazon.com/machine-learning/trainium/) and [AWS Inferentia](https://aws.amazon.com/machine-learning/inferentia/) with [Amazon SageMaker](https://aws.amazon.com/sagemaker/) and [Hugging Face Optimum Neuron](https://huggingface.co/docs/optimum-neuron/index), to optimize your ML workloads! Here you find workshops, tutorials, blog post content, etc. you can use to learn and inspire your own solution. 4 | 5 | 6 | The content you find here is focused on particular use cases. If you're looking for standalone model samples for inference and training, please check this other repo: https://github.com/aws-neuron/aws-neuron-samples. 7 | 8 | ### Workshops 9 | 10 | |Title|| 11 | |:-|:-| 12 | |[Fine-tune and deploy LLM from Hugging Face on AWS Trainium and AWS Inferentia](workshops/01_FineTuneSpamClassifier)|Learn how to create a spam classifier that can be easily integrated to your own application| 13 | |[Adapting LLMs for domain-aware applications with AWS Trainium post-training](workshops/02_DomainAdaptation)|Learn how to adapt a pre-trained model to your own business needs and add a conversational interface your customers can interact with| 14 | |[Building Custom Accelerator Kernels with AWS Neuron Kernel Interface (NKI)](workshops/03_NKIWorkshop)|Learn how to use the Neuron Kernel Interface (NKI) to write kernels for Neuron accelerators| 15 | 16 | 17 | These workshops are supported by **AWS Workshop Studio** 18 | 19 | ### Tutorials 20 | 21 | |Description| 22 | |:-| 23 | |[inf1 - Extract embeddings from raw text](tutorials/01_EmbeddingsFromTextWithBert)| 24 | |[inf1 - Track objects in video streaming using CV](tutorials/02_ObjectTrackingSageMakerGStreamer)| 25 | |[inf1 - Create a closed question Q&A model](tutorials/03_QuestionAnsweringMachine)| 26 | |[ind2 - Generate images using SD](tutorials/04_ImageGenerationWithStableDiffusion)| 27 | |[inf1 - Answer questions given a context](tutorials/05_FastQuestionAnsweringWithBertQA)| 28 | |[trn1 - Fine-tune a LLM using distributed training](tutorials/06_FinetuneLLMs)| 29 | |[inf2 - Deploy a LLM to HF TGI](tutorials/07_DeployToInferentiaWithTGI)| 30 | |[inf2 - Porting BART for Multi-Genre Natural Language Inference](tutorials/08_TextClassificationWithNaturalLanguageInference)| 31 | 32 | ### Blog posts content 33 | |Description| 34 | |:-| 35 | |[Llama3-8B Deployment on AWS Inferentia 2 with Amazon EKS and vLLM](blogs/01_LLama3-8B_Inferentia_EKS_vLLM/)| 36 | 37 | ## Contributing 38 | If you have questions, comments, suggestions, etc. please feel free to cut tickets in this repo. 39 | 40 | Also, please refer to the [CONTRIBUTING](CONTRIBUTING.md) document for further details on contributing to this repository. 41 | -------------------------------------------------------------------------------- /TUTORIALS.md: -------------------------------------------------------------------------------- 1 | # Applied AI/ML Specialized Hardware 2 | 3 | Specialized Hardware is a ML (Machine Learning) model accelerator for Inference or Training, like [AWS Inferentia](https://aws.amazon.com/machine-learning/inferentia/), [AWS Trainium](https://aws.amazon.com/machine-learning/trainium/), [SIMD accel in CPUs](https://en.wikipedia.org/wiki/SIMD) and GPUs. In this repo you'll find reference implementations of different use cases (applications) for Computer Vision, Natural Language Processing, etc. that make use of hardware acceleration to reduce model execution latency and increase throughput. 4 | 5 | **Use cases** are represented by questions which can be answered by the reference implementation linked to it. 6 | 7 | If you're looking for technical samples that show how to run specific models on Trainium (trn1) and Inferentia (inf1 & inf2), go to [AWS Neuron Samples](https://github.com/aws-neuron/aws-neuron-samples) 8 | 9 | ## Tutorials/Reference implementations 10 | |Use Case|Description| 11 | |-|-| 12 | |[How to track people in video files?](tutorials/01_ObjectTrackingSageMakerGStreamer/)|CV/ML Pipeline to process video files in batch with SageMaker+Inferentia, GStreamer and Yolov7+ByteTrack| 13 | |[How to measure the similarity between two sentences?](tutorials/02_EmbeddingsFromTextWithBert/)|Compute the semantic similarity of two or more sentences by extracting their embeddings with SageMaker+Inferentia and HF Bert Case| 14 | |[How to create a mechanism to answer questions from a FAQ?](tutorials/03_QuestionAnsweringMachine/)|Fine tune a T5-ssm model (on SageMaker & Trainium) to build a Q&A mechanism, more powerful than a classic chatbot, to answer questions from a FAQ, sent by your customers| 15 | |[How to generate images based on a text input?](tutorials/04_ImageGenerationWithSDXL/)|Deploy an SDXL model to inferentia 2 + SageMaker using HF Optimum Neuron| 16 | |[How to create a really fast question answering mechanism?](tutorials/05_FastQuestionAnsweringWithBertQA/)|Deploy a BertQA model to Inferentia1 and SageMaker to build a fast and cheap Q&A mechanism| 17 | |[How to classify pieces of text via Natural Language Inference?](tutorials/08_TextClassificationWithNaturalLanguageInference)|Classify texts on custom selected topics with BART and inf2 instances| 18 | 19 | ## Contributing 20 | If you have a question related to a business challenge that must be answered by an accelerated AI/ML solution, like the content in this repo, then you can contribute. You can just open an issue with your question or if you have the skills, implement a solution (tutorial, workshop, etc.) using Jupyter notebooks (for SageMaker Studio or Notebook Instances) and create a pull request. We appreciate your help. 21 | 22 | Please refer to the [CONTRIBUTING](CONTRIBUTING.md) document for further details on contributing to this repository. 23 | -------------------------------------------------------------------------------- /blogs/01_LLama3-8B_Inferentia_EKS_vLLM/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM public.ecr.aws/neuron/pytorch-inference-neuronx:2.1.2-neuronx-py310-sdk2.19.1-ubuntu20.04 2 | 3 | # Clone the vllm repository 4 | RUN git clone https://github.com/vllm-project/vllm.git 5 | 6 | # Set the working directory 7 | WORKDIR /vllm 8 | RUN git checkout v0.5.0 9 | 10 | # Set the environment variable 11 | ENV VLLM_TARGET_DEVICE=neuron 12 | 13 | # Install the dependencies 14 | RUN pip install -U -r requirements-neuron.txt 15 | RUN pip install -e . 16 | 17 | # Modify the arg_utils.py file 18 | RUN sed -i "/parser.add_argument('--block-size',/ {N;N;N;N;N;s/\[8, 16, 32\]/[8, 16, 32, 128, 256, 512, 1024, 2048, 4096, 8192]/}" vllm/engine/arg_utils.py 19 | 20 | # Install ray 21 | RUN pip install ray 22 | RUN pip install pynvml 23 | 24 | # Set the entry point 25 | ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"] -------------------------------------------------------------------------------- /blogs/01_LLama3-8B_Inferentia_EKS_vLLM/Readme.md: -------------------------------------------------------------------------------- 1 | # Llama3-8B Deployment on AWS Inferentia 2 with Amazon EKS and vLLM 2 | 3 | This repository contains the necessary files and configurations to deploy the Llama3-8B model on AWS Inferentia 2 instances using Amazon EKS (Elastic Kubernetes Service) and vLLM. 4 | 5 | ## Files in this Directory 6 | 7 | 1. `Dockerfile`: Defines the container image for running vLLM with Llama3-8B on Inferentia 2. 8 | 9 | 2. `cluster-config.yaml`: Configuration file for creating the Amazon EKS cluster. 10 | 11 | 3. `deployment.yaml`: Kubernetes deployment configuration for the Llama3-8B model. 12 | 13 | 4. `nodegroup-config.yaml`: Configuration for the Inferentia 2 node group in the EKS cluster. 14 | 15 | ## Overview 16 | 17 | This project demonstrates how to: 18 | 19 | - Set up an Amazon EKS cluster 20 | - Configure Inferentia 2 node groups 21 | - Build and push a custom Docker image for vLLM 22 | - Deploy Llama3-8B model using vLLM on Inferentia 2 instances 23 | - Configure Kubernetes probes for health checking 24 | - Scale the deployment 25 | 26 | ## Prerequisites 27 | 28 | - AWS CLI 29 | - eksctl 30 | - kubectl 31 | - docker 32 | 33 | ## Getting Started 34 | 35 | 1. Create the EKS cluster using the `cluster-config.yaml` file. 36 | 2. Set up the Inferentia 2 node group using the `nodegroup-config.yaml` file. 37 | 3. Build and push the Docker image using the provided `Dockerfile`. 38 | 4. Deploy the Llama3-8B model using the `deployment.yaml` file. 39 | 40 | ## Important Notes 41 | 42 | - The deployment uses 8 Neuron cores per replica for optimal performance. 43 | - Initial startup time for model compilation is around 25 minutes. 44 | - Proper monitoring and scaling strategies are crucial for production use. 45 | 46 | For detailed instructions and explanations, please refer to the accompanying blog post. 47 | 48 | ## Authors 49 | 50 | - Dmitri Laptev - Senior GenAI Solutions Architect at AWS 51 | - Maurits de Groot - Solutions Architect at AWS 52 | - Ziwen Ning - Software Development Engineer at AWS 53 | - Jianying Lang - Principal Solutions Architect at AWS Worldwide Specialist Organization (WWSO) 54 | -------------------------------------------------------------------------------- /blogs/01_LLama3-8B_Inferentia_EKS_vLLM/cluster-config.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: eksctl.io/v1alpha5 2 | kind: ClusterConfig 3 | 4 | metadata: 5 | name: neuron-cluster 6 | region: us-east-1 7 | version: "1.30" 8 | 9 | addons: 10 | - name: vpc-cni 11 | version: latest 12 | 13 | cloudWatch: 14 | clusterLogging: 15 | enableTypes: ["*"] 16 | 17 | iam: 18 | withOIDC: true 19 | -------------------------------------------------------------------------------- /blogs/01_LLama3-8B_Inferentia_EKS_vLLM/deployment.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: apps/v1 2 | kind: Deployment 3 | metadata: 4 | name: neuronx-vllm-deployment 5 | labels: 6 | app: neuronx-vllm 7 | spec: 8 | replicas: 3 9 | selector: 10 | matchLabels: 11 | app: neuronx-vllm 12 | template: 13 | metadata: 14 | labels: 15 | app: neuronx-vllm 16 | spec: 17 | schedulerName: my-scheduler 18 | containers: 19 | - name: neuronx-vllm 20 | image: .dkr.ecr.us-east-1.amazonaws.com/vllm-neuron:latest 21 | resources: 22 | limits: 23 | cpu: 32 24 | memory: "64G" 25 | aws.amazon.com/neuroncore: "8" 26 | requests: 27 | cpu: 32 28 | memory: "64G" 29 | aws.amazon.com/neuroncore: "8" 30 | ports: 31 | - containerPort: 8000 32 | env: 33 | - name: HF_TOKEN 34 | value: 35 | - name: FI_EFA_FORK_SAFE 36 | value: "1" 37 | args: 38 | - "--model" 39 | - "meta-llama/Meta-Llama-3-8B" 40 | - "--tensor-parallel-size" 41 | - "8" 42 | - "--max-num-seqs" 43 | - "8" 44 | - "--max-model-len" 45 | - "8192" 46 | - "--block-size" 47 | - "8192" 48 | readinessProbe: 49 | httpGet: 50 | path: /health 51 | port: 8000 52 | initialDelaySeconds: 1800 53 | periodSeconds: 10 54 | timeoutSeconds: 5 55 | failureThreshold: 5 56 | 57 | livenessProbe: 58 | httpGet: 59 | path: /health 60 | port: 8000 61 | initialDelaySeconds: 1800 62 | periodSeconds: 15 63 | timeoutSeconds: 5 64 | failureThreshold: 5 65 | 66 | startupProbe: 67 | httpGet: 68 | path: /health 69 | port: 8000 70 | initialDelaySeconds: 1800 71 | periodSeconds: 10 72 | timeoutSeconds: 5 73 | failureThreshold: 180 -------------------------------------------------------------------------------- /blogs/01_LLama3-8B_Inferentia_EKS_vLLM/nodegroup-config.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: eksctl.io/v1alpha5 2 | kind: ClusterConfig 3 | 4 | metadata: 5 | name: genai 6 | region: us-east-1 7 | version: "1.30" 8 | 9 | managedNodeGroups: 10 | - name: neuron-group 11 | instanceType: inf2.48xlarge 12 | desiredCapacity: 1 13 | minSize: 1 14 | maxSize: 1 15 | volumeSize: 500 16 | ami: ami-0077f86889fb430bf 17 | amiFamily: AmazonLinux2 18 | iam: 19 | attachPolicyARNs: 20 | - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy 21 | - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly 22 | - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore 23 | - arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess 24 | 25 | overrideBootstrapCommand: | 26 | #!/bin/bash 27 | 28 | /etc/eks/bootstrap.sh genai 29 | -------------------------------------------------------------------------------- /tutorials/01_EmbeddingsFromTextWithBert/01_TextFeatureExtractionForSimilarity.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "16d74ffb", 6 | "metadata": {}, 7 | "source": [ 8 | "# Measure document similarities by extraction features from text inputs\n", 9 | "\n", 10 | "Create a mechanism to extract features (embeddings) from text inputs. With the embeddings you can then compute the distance between two or more sentences. This is useful if you're building a search mechanism or trying to see how **\"semantically\"** two sentences are close.\n", 11 | "\n", 12 | "For that purpose you'll use a **[Bert base](https://huggingface.co/bert-base-cased-finetuned-mrpc)** model, accelerated by an inf1 instance ([AWS Inferentia](https://aws.amazon.com/machine-learning/inferentia/)), running on SageMaker.\n", 13 | "\n", 14 | "For maximum performance and flexibility, you'll prepare the model with \"Neuron Core Pipeline\" and \"Dynamic Batch Size\" enabled. The first technique will shard the model across multiple cores to improve throughput. The second technique will allow you to send requests with different batch sizes. [Read more about these feature here](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/pytorch/pipeline_tutorial/neuroncore_pipeline_pytorch.html).\n", 15 | "\n", 16 | "The text samples used in this notebook were extracted from: https://www.gutenberg.org/cache/epub/84/pg84-images.html#chap01" 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "id": "fc1cb598", 22 | "metadata": {}, 23 | "source": [ 24 | "## 1) Compile a pre-trained model\n", 25 | "When you deploy a model to a SageMaker Endpoint/inf1 instance (AWS Inferentia), you first need to compile the model with NeuronSDK. We'll use a sample provided by the official AWS Neuron SDK + Inferentia Samples.\n", 26 | "\n", 27 | "- Clone the repo: https://github.com/aws-neuron/aws-neuron-samples\n", 28 | "- Load the jupyter notebook for BertBaseCased: https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuron/inference/bertbasecased/\n", 29 | "- Start running the notebook, but enable Dynamic Batch and also Neuron Core Pipelines for 4 Neuron Cores, in model compilation section, as following:\n", 30 | "\n", 31 | "```python\n", 32 | "import os\n", 33 | "import torch\n", 34 | "import torch.neuron\n", 35 | "\n", 36 | "save_dir='model'\n", 37 | "neuron_model = torch.neuron.trace(\n", 38 | " model, example_inputs=example_inputs_paraphrase,\n", 39 | " dynamic_batch_size=True,\n", 40 | " compiler_args['--neuron-core-pipeline', '4']\n", 41 | ")\n", 42 | "model.config.update({\"traced_sequence_length\": max_length})\n", 43 | "\n", 44 | "## Export 1/compiled model; 2/ tokenizer and 3/ model configs\n", 45 | "model_neuron.save(os.path.join(save_dir,\"model_neuron.pt\"))\n", 46 | "tokenizer.save_pretrained(save_dir)\n", 47 | "model.config.save_pretrained(save_dir)\n", 48 | "\n", 49 | "```" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "id": "61189f98", 55 | "metadata": {}, 56 | "source": [ 57 | "## 2) Pack and upload the model to S3\n", 58 | "After compiling the model with the instructions above, **COPY** the entire **save_dir** to the same directory of this Notebook." 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": null, 64 | "id": "d60a3908", 65 | "metadata": {}, 66 | "outputs": [], 67 | "source": [ 68 | "import io\n", 69 | "import tarfile\n", 70 | "import sagemaker\n", 71 | "\n", 72 | "save_dir='model'\n", 73 | "sess = sagemaker.Session()\n", 74 | "sagemaker_session_bucket = sess.default_bucket()\n", 75 | "with io.BytesIO() as file:\n", 76 | " with tarfile.open(fileobj=file, mode=\"w:gz\") as tar:\n", 77 | " tar.add(save_dir, \".\")\n", 78 | " tar.list()\n", 79 | " file.seek(0)\n", 80 | " s3_uri = sess.upload_string_as_file_body(\n", 81 | " file.read(), sagemaker_session_bucket, \"model/bert/model.tar.gz\"\n", 82 | " )\n", 83 | "print(s3_uri)" 84 | ] 85 | }, 86 | { 87 | "cell_type": "markdown", 88 | "id": "7ef38d42", 89 | "metadata": {}, 90 | "source": [ 91 | "## 3) Inference script used by SageMaker endpoint to load and execute the model\n", 92 | "This script is responsible for loading the model and expose a webservice for us to invoke and get predictions (embeddings)" 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": null, 98 | "id": "81760e79", 99 | "metadata": {}, 100 | "outputs": [], 101 | "source": [ 102 | "!pygmentize code/inference.py" 103 | ] 104 | }, 105 | { 106 | "cell_type": "markdown", 107 | "id": "7799a6eb", 108 | "metadata": {}, 109 | "source": [ 110 | "## 4) Deploy our model to a SageMaker endpoint" 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": null, 116 | "id": "cbc462b8", 117 | "metadata": {}, 118 | "outputs": [], 119 | "source": [ 120 | "import sagemaker\n", 121 | "sess = sagemaker.Session()\n", 122 | "\n", 123 | "# sagemaker session bucket -> used for uploading data, models and logs\n", 124 | "# sagemaker will automatically create this bucket if it not exists\n", 125 | "sagemaker_session_bucket = sess.default_bucket()\n", 126 | "\n", 127 | "role = sagemaker.get_execution_role()\n", 128 | "\n", 129 | "print(f\"sagemaker role arn: {role}\")\n", 130 | "print(f\"sagemaker bucket: {sess.default_bucket()}\")\n", 131 | "print(f\"sagemaker session region: {sess.boto_region_name}\")" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": null, 137 | "id": "8bd55d2b", 138 | "metadata": {}, 139 | "outputs": [], 140 | "source": [ 141 | "from sagemaker.huggingface.model import HuggingFaceModel\n", 142 | "\n", 143 | "# create Hugging Face Model Class\n", 144 | "huggingface_model = HuggingFaceModel(\n", 145 | " model_data=s3_uri, # path to your model and script\n", 146 | " role=role, # iam role with permissions to create an Endpoint\n", 147 | " transformers_version=\"4.12\", # transformers version used\n", 148 | " pytorch_version=\"1.9\", # pytorch version used\n", 149 | " py_version='py37', # python version used\n", 150 | " sagemaker_session=sess,\n", 151 | " model_server_workers=4, # keep 4 workers\n", 152 | " entry_point=\"code/inference.py\",\n", 153 | " # for production it is important to define vpc_config and use a vpc_endpoint\n", 154 | " #vpc_config={\n", 155 | " # 'Subnets': ['subnet-a320a8ca', 'subnet-56d5072d'],\n", 156 | " # 'SecurityGroupIds': ['sg-0d8c231d83c1caaa6', 'sg-5504723c']\n", 157 | " #} \n", 158 | ")\n", 159 | "\n", 160 | "# Let SageMaker know that we've already compiled the model via neuron-cc\n", 161 | "huggingface_model._is_compiled_model = True\n", 162 | "\n", 163 | "# deploy the endpoint endpoint\n", 164 | "predictor = huggingface_model.deploy(\n", 165 | " initial_instance_count=1, # number of instances\n", 166 | " instance_type=\"ml.inf1.6xlarge\" # AWS Inferentia Instance\n", 167 | ")" 168 | ] 169 | }, 170 | { 171 | "cell_type": "markdown", 172 | "id": "380e48a5", 173 | "metadata": {}, 174 | "source": [ 175 | "## 5) Run a simple test" 176 | ] 177 | }, 178 | { 179 | "cell_type": "code", 180 | "execution_count": null, 181 | "id": "63e60205", 182 | "metadata": {}, 183 | "outputs": [], 184 | "source": [ 185 | "from sagemaker.serializers import JSONSerializer\n", 186 | "from sagemaker.deserializers import NumpyDeserializer\n", 187 | "predictor.serializer = JSONSerializer()\n", 188 | "predictor.deserializer = NumpyDeserializer()" 189 | ] 190 | }, 191 | { 192 | "cell_type": "code", 193 | "execution_count": null, 194 | "id": "72afc7de", 195 | "metadata": {}, 196 | "outputs": [], 197 | "source": [ 198 | "with open('frank_chap01.txt') as f:\n", 199 | " data = {'inputs': [l.strip() for l in f.readlines()]}\n", 200 | "num_sentences = len(data['inputs'])\n", 201 | "print(f\"Number of sentences: {num_sentences}\")\n", 202 | "embeddings = predictor.predict(data)\n", 203 | "print(embeddings.shape)" 204 | ] 205 | }, 206 | { 207 | "cell_type": "markdown", 208 | "id": "79e975c7", 209 | "metadata": {}, 210 | "source": [ 211 | "### 5.1) Simple benchmark to identify the best batch_size with 1 client only" 212 | ] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "execution_count": 51, 217 | "id": "ff730002", 218 | "metadata": {}, 219 | "outputs": [ 220 | { 221 | "name": "stdout", 222 | "output_type": "stream", 223 | "text": [ 224 | "Batch size: 1 Elapsed time: 14.544463157653809ms Latency p/s 14.544463157653809ms\n", 225 | "Batch size: 2 Elapsed time: 23.25267791748047ms Latency p/s 11.626338958740234ms\n", 226 | "Batch size: 3 Elapsed time: 31.86509609222412ms Latency p/s 10.621698697408041ms\n", 227 | "Batch size: 4 Elapsed time: 39.96927738189697ms Latency p/s 9.992319345474243ms\n", 228 | "Batch size: 5 Elapsed time: 48.52888584136963ms Latency p/s 9.705777168273926ms\n", 229 | "Batch size: 6 Elapsed time: 57.08444118499756ms Latency p/s 9.514073530832926ms\n", 230 | "Batch size: 7 Elapsed time: 65.29092788696289ms Latency p/s 9.32727541242327ms\n", 231 | "Batch size: 8 Elapsed time: 74.49376583099365ms Latency p/s 9.311720728874207ms\n", 232 | "Batch size: 9 Elapsed time: 82.37555027008057ms Latency p/s 9.15283891889784ms\n", 233 | "Batch size: 10 Elapsed time: 90.54069519042969ms Latency p/s 9.054069519042969ms\n", 234 | "Batch size: 11 Elapsed time: 99.27759170532227ms Latency p/s 9.025235609574752ms\n" 235 | ] 236 | } 237 | ], 238 | "source": [ 239 | "import time\n", 240 | "import copy\n", 241 | "iterations=10\n", 242 | "for batch_size in range(1,num_sentences+1):\n", 243 | " d = copy.deepcopy(data)\n", 244 | " d['inputs'] = d['inputs'][:batch_size]\n", 245 | " t=time.time()\n", 246 | " for i in range(iterations):\n", 247 | " predictor.predict(d)\n", 248 | " elapsed = (time.time()-t)/iterations*1000\n", 249 | " print(f\"Batch size: {batch_size} Elapsed time: {elapsed}ms Latency p/s {elapsed/batch_size}ms\")" 250 | ] 251 | }, 252 | { 253 | "cell_type": "markdown", 254 | "id": "658ad164", 255 | "metadata": {}, 256 | "source": [ 257 | "### 5.2) Now Invoke the endpoint in parallel to evaluate throughput" 258 | ] 259 | }, 260 | { 261 | "cell_type": "code", 262 | "execution_count": 38, 263 | "id": "518a7206", 264 | "metadata": {}, 265 | "outputs": [ 266 | { 267 | "name": "stdout", 268 | "output_type": "stream", 269 | "text": [ 270 | "Elapsed time: 24082.525491714478ms to process 11264 sentences with 5 workers. Latency p/s: 2.1380083000456747ms\n" 271 | ] 272 | } 273 | ], 274 | "source": [ 275 | "import time\n", 276 | "from concurrent.futures import ThreadPoolExecutor\n", 277 | "\n", 278 | "# custom task that will sleep for a variable amount of time\n", 279 | "def task(data):\n", 280 | " predictor.predict(data)\n", 281 | "\n", 282 | "num_workers = 5\n", 283 | "d = copy.deepcopy(data)\n", 284 | "documents_1k = [d for i in range(1024)]\n", 285 | "total_docs = len(documents_1k) * len(data['inputs'])\n", 286 | "\n", 287 | "# start the thread pool\n", 288 | "t=time.time()\n", 289 | "with ThreadPoolExecutor(num_workers) as executor:\n", 290 | " # execute tasks concurrently and process results in order \n", 291 | " executor.map(task, documents_1k)\n", 292 | "elapsed = (time.time()-t)*1000\n", 293 | "print(f\"Elapsed time: {elapsed}ms to process {total_docs} sentences with {num_workers} workers. Latency p/s: {elapsed/total_docs}ms\")" 294 | ] 295 | }, 296 | { 297 | "cell_type": "markdown", 298 | "id": "6ea67331", 299 | "metadata": {}, 300 | "source": [ 301 | "### 5.3) Finally a similarity test" 302 | ] 303 | }, 304 | { 305 | "cell_type": "code", 306 | "execution_count": 50, 307 | "id": "8bbedec9", 308 | "metadata": {}, 309 | "outputs": [ 310 | { 311 | "name": "stdout", 312 | "output_type": "stream", 313 | "text": [ 314 | "Cosine Similarity: [[0.9238203]]\n" 315 | ] 316 | } 317 | ], 318 | "source": [ 319 | "from sklearn.metrics.pairwise import cosine_similarity\n", 320 | "sentence_1=\"I've seen things you people wouldn't believe. Attack ships on fire off the shoulder of Orion.\"\n", 321 | "sentence_2=\"I watched C-beams glitter in the dark near the Tannhäuser Gate. All those moments will be lost in time, like tears in rain. Time to die.\"\n", 322 | "embeddings_1,embeddings_2 = predictor.predict({'inputs':[sentence_1, sentence_2]})\n", 323 | "print(f'Cosine Similarity: {cosine_similarity([embeddings_1],[embeddings_2])}')" 324 | ] 325 | } 326 | ], 327 | "metadata": { 328 | "kernelspec": { 329 | "display_name": "conda_pytorch_p36", 330 | "language": "python", 331 | "name": "conda_pytorch_p36" 332 | }, 333 | "language_info": { 334 | "codemirror_mode": { 335 | "name": "ipython", 336 | "version": 3 337 | }, 338 | "file_extension": ".py", 339 | "mimetype": "text/x-python", 340 | "name": "python", 341 | "nbconvert_exporter": "python", 342 | "pygments_lexer": "ipython3", 343 | "version": "3.6.13" 344 | } 345 | }, 346 | "nbformat": 4, 347 | "nbformat_minor": 5 348 | } 349 | -------------------------------------------------------------------------------- /tutorials/01_EmbeddingsFromTextWithBert/code/inference.py: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | import os 4 | os.environ['NEURON_RT_NUM_CORES'] = '4' 5 | import json 6 | import torch 7 | import torch.neuron 8 | from typing import List 9 | import torch.nn.functional as F 10 | from transformers import AutoConfig, AutoTokenizer 11 | 12 | def compute_embeddings(features, sentences): 13 | attention_mask = sentences['attention_mask'] 14 | mask = attention_mask.unsqueeze(-1).expand(features.size()).float() 15 | masked_embeddings = features * mask 16 | summed = torch.sum(masked_embeddings, 1) 17 | summed_mask = torch.clamp(mask.sum(1), min=1e-9) 18 | 19 | return (summed / summed_mask).numpy() 20 | 21 | def model_fn(model_dir): 22 | # load tokenizer and neuron model from model_dir 23 | tokenizer = AutoTokenizer.from_pretrained(model_dir) 24 | model = torch.jit.load(os.path.join(model_dir, "model_neuron.pt")) 25 | model_config = AutoConfig.from_pretrained(model_dir) 26 | 27 | return model, tokenizer, model_config 28 | 29 | def predict_fn(data, model_tokenizer_model_config): 30 | # destruct model and tokenizer 31 | model, tokenizer, model_config = model_tokenizer_model_config 32 | encoded_input = tokenizer.batch_encode_plus( 33 | data['inputs'], 34 | return_tensors="pt", 35 | max_length=model_config.traced_sequence_length, 36 | padding="max_length", 37 | truncation=True, 38 | ) 39 | # convert for neuron model 40 | sentences_inputs = encoded_input['input_ids'], encoded_input['attention_mask'], encoded_input['token_type_ids'] 41 | 42 | with torch.no_grad(): 43 | model_output = model(*sentences_inputs)[0] 44 | 45 | # Perform pooling & return numpy 46 | return compute_embeddings(model_output, encoded_input) 47 | -------------------------------------------------------------------------------- /tutorials/01_EmbeddingsFromTextWithBert/frank_chap01.txt: -------------------------------------------------------------------------------- 1 | I am by birth a Genevese, and my family is one of the most distinguished of that republic. My ancestors had been for many years counsellors and syndics, and my father had filled several public situations with honour and reputation. He was respected by all who knew him for his integrity and indefatigable attention to public business. He passed his younger days perpetually occupied by the affairs of his country; a variety of circumstances had prevented his marrying early, nor was it until the decline of life that he became a husband and the father of a family. 2 | As the circumstances of his marriage illustrate his character, I cannot refrain from relating them. One of his most intimate friends was a merchant who, from a flourishing state, fell, through numerous mischances, into poverty. This man, whose name was Beaufort, was of a proud and unbending disposition and could not bear to live in poverty and oblivion in the same country where he had formerly been distinguished for his rank and magnificence. Having paid his debts, therefore, in the most honourable manner, he retreated with his daughter to the town of Lucerne, where he lived unknown and in wretchedness. My father loved Beaufort with the truest friendship and was deeply grieved by his retreat in these unfortunate circumstances. He bitterly deplored the false pride which led his friend to a conduct so little worthy of the affection that united them. He lost no time in endeavouring to seek him out, with the hope of persuading him to begin the world again through his credit and assistance. 3 | Beaufort had taken effectual measures to conceal himself, and it was ten months before my father discovered his abode. Overjoyed at this discovery, he hastened to the house, which was situated in a mean street near the Reuss. But when he entered, misery and despair alone welcomed him. Beaufort had saved but a very small sum of money from the wreck of his fortunes, but it was sufficient to provide him with sustenance for some months, and in the meantime he hoped to procure some respectable employment in a merchant’s house. The interval was, consequently, spent in inaction; his grief only became more deep and rankling when he had leisure for reflection, and at length it took so fast hold of his mind that at the end of three months he lay on a bed of sickness, incapable of any exertion. 4 | His daughter attended him with the greatest tenderness, but she saw with despair that their little fund was rapidly decreasing and that there was no other prospect of support. But Caroline Beaufort possessed a mind of an uncommon mould, and her courage rose to support her in her adversity. She procured plain work; she plaited straw and by various means contrived to earn a pittance scarcely sufficient to support life. 5 | Several months passed in this manner. Her father grew worse; her time was more entirely occupied in attending him; her means of subsistence decreased; and in the tenth month her father died in her arms, leaving her an orphan and a beggar. This last blow overcame her, and she knelt by Beaufort’s coffin weeping bitterly, when my father entered the chamber. He came like a protecting spirit to the poor girl, who committed herself to his care; and after the interment of his friend he conducted her to Geneva and placed her under the protection of a relation. Two years after this event Caroline became his wife. 6 | There was a considerable difference between the ages of my parents, but this circumstance seemed to unite them only closer in bonds of devoted affection. There was a sense of justice in my father’s upright mind which rendered it necessary that he should approve highly to love strongly. Perhaps during former years he had suffered from the late-discovered unworthiness of one beloved and so was disposed to set a greater value on tried worth. There was a show of gratitude and worship in his attachment to my mother, differing wholly from the doting fondness of age, for it was inspired by reverence for her virtues and a desire to be the means of, in some degree, recompensing her for the sorrows she had endured, but which gave inexpressible grace to his behaviour to her. Everything was made to yield to her wishes and her convenience. He strove to shelter her, as a fair exotic is sheltered by the gardener, from every rougher wind and to surround her with all that could tend to excite pleasurable emotion in her soft and benevolent mind. Her health, and even the tranquillity of her hitherto constant spirit, had been shaken by what she had gone through. During the two years that had elapsed previous to their marriage my father had gradually relinquished all his public functions; and immediately after their union they sought the pleasant climate of Italy, and the change of scene and interest attendant on a tour through that land of wonders, as a restorative for her weakened frame. 7 | From Italy they visited Germany and France. I, their eldest child, was born at Naples, and as an infant accompanied them in their rambles. I remained for several years their only child. Much as they were attached to each other, they seemed to draw inexhaustible stores of affection from a very mine of love to bestow them upon me. My mother’s tender caresses and my father’s smile of benevolent pleasure while regarding me are my first recollections. I was their plaything and their idol, and something better—their child, the innocent and helpless creature bestowed on them by Heaven, whom to bring up to good, and whose future lot it was in their hands to direct to happiness or misery, according as they fulfilled their duties towards me. With this deep consciousness of what they owed towards the being to which they had given life, added to the active spirit of tenderness that animated both, it may be imagined that while during every hour of my infant life I received a lesson of patience, of charity, and of self-control, I was so guided by a silken cord that all seemed but one train of enjoyment to me. 8 | For a long time I was their only care. My mother had much desired to have a daughter, but I continued their single offspring. When I was about five years old, while making an excursion beyond the frontiers of Italy, they passed a week on the shores of the Lake of Como. Their benevolent disposition often made them enter the cottages of the poor. This, to my mother, was more than a duty; it was a necessity, a passion—remembering what she had suffered, and how she had been relieved—for her to act in her turn the guardian angel to the afflicted. During one of their walks a poor cot in the foldings of a vale attracted their notice as being singularly disconsolate, while the number of half-clothed children gathered about it spoke of penury in its worst shape. One day, when my father had gone by himself to Milan, my mother, accompanied by me, visited this abode. She found a peasant and his wife, hard working, bent down by care and labour, distributing a scanty meal to five hungry babes. Among these there was one which attracted my mother far above all the rest. She appeared of a different stock. The four others were dark-eyed, hardy little vagrants; this child was thin and very fair. Her hair was the brightest living gold, and despite the poverty of her clothing, seemed to set a crown of distinction on her head. Her brow was clear and ample, her blue eyes cloudless, and her lips and the moulding of her face so expressive of sensibility and sweetness that none could behold her without looking on her as of a distinct species, a being heaven-sent, and bearing a celestial stamp in all her features. 9 | The peasant woman, perceiving that my mother fixed eyes of wonder and admiration on this lovely girl, eagerly communicated her history. She was not her child, but the daughter of a Milanese nobleman. Her mother was a German and had died on giving her birth. The infant had been placed with these good people to nurse: they were better off then. They had not been long married, and their eldest child was but just born. The father of their charge was one of those Italians nursed in the memory of the antique glory of Italy—one among the schiavi ognor frementi, who exerted himself to obtain the liberty of his country. He became the victim of its weakness. Whether he had died or still lingered in the dungeons of Austria was not known. His property was confiscated; his child became an orphan and a beggar. She continued with her foster parents and bloomed in their rude abode, fairer than a garden rose among dark-leaved brambles. 10 | When my father returned from Milan, he found playing with me in the hall of our villa a child fairer than pictured cherub—a creature who seemed to shed radiance from her looks and whose form and motions were lighter than the chamois of the hills. The apparition was soon explained. With his permission my mother prevailed on her rustic guardians to yield their charge to her. They were fond of the sweet orphan. Her presence had seemed a blessing to them, but it would be unfair to her to keep her in poverty and want when Providence afforded her such powerful protection. They consulted their village priest, and the result was that Elizabeth Lavenza became the inmate of my parents’ house—my more than sister—the beautiful and adored companion of all my occupations and my pleasures. 11 | Everyone loved Elizabeth. The passionate and almost reverential attachment with which all regarded her became, while I shared it, my pride and my delight. On the evening previous to her being brought to my home, my mother had said playfully, “I have a pretty present for my Victor—tomorrow he shall have it.” And when, on the morrow, she presented Elizabeth to me as her promised gift, I, with childish seriousness, interpreted her words literally and looked upon Elizabeth as mine—mine to protect, love, and cherish. All praises bestowed on her I received as made to a possession of my own. We called each other familiarly by the name of cousin. No word, no expression could body forth the kind of relation in which she stood to me—my more than sister, since till death she was to be mine only. -------------------------------------------------------------------------------- /tutorials/02_ObjectTrackingSageMakerGStreamer/01_Yolov7SageMakerInferentia.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "d89c3d67", 6 | "metadata": {}, 7 | "source": [ 8 | "# Deploy Yolov7 to SageMaker + Inferentia\n", 9 | "\n", 10 | "\n", 11 | "We'll create a SageMaker real-time endpoint with a Yolov7 model capable of detecting people and predicting the pose of each person. For that purpose, we need to get the model and prepare it to be deployed to AWS Inferentia." 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "id": "58b1061d", 17 | "metadata": {}, 18 | "source": [ 19 | "## 1) Install dependencies" 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": null, 25 | "id": "9e6e0ad9", 26 | "metadata": {}, 27 | "outputs": [], 28 | "source": [ 29 | "# with this library we can build docker images and push them to ECR\n", 30 | "%pip install sagemaker-studio-image-build" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "id": "efec75fd", 36 | "metadata": {}, 37 | "source": [ 38 | "## 2) Compile a pre-trained model\n", 39 | "When you deploy a model to a SageMaker Endpoint/inf1 instance (AWS Inferentia), you first need compile the model with NeuronSDK. We'll use a sample provided by the official AWS Neuron SDK + Inferentia Samples.\n", 40 | "\n", 41 | "- Clone the repo: https://github.com/aws-neuron/aws-neuron-samples\n", 42 | "- Load the jupyter notebook for Yolov7: https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuron/inference/yolov7\n", 43 | "- Start running the notebook, but enable Dynamic Batch and also Neuron Core Pipelines for 4 Neuron Cores,in model compilation section, as following:\n", 44 | "\n", 45 | "```python\n", 46 | "import torch\n", 47 | "import torch.neuron\n", 48 | "\n", 49 | "model_neuron = torch.neuron.trace(\n", 50 | " model, example_inputs=x,\n", 51 | " dynamic_batch_size=True,\n", 52 | " compiler_args['--neuron-core-pipeline', '4']\n", 53 | ")\n", 54 | "\n", 55 | "## Export to saved model\n", 56 | "model_neuron.save(\"yolov7_neuron.pt\")\n", 57 | "```" 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "id": "275e2ea1", 63 | "metadata": {}, 64 | "source": [ 65 | "## 3) Pack and upload the model to S3\n", 66 | "After compiling the model with the instructions above, **copy** the model to the same directory of this Notebook" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": null, 72 | "id": "2d5a89a8", 73 | "metadata": {}, 74 | "outputs": [], 75 | "source": [ 76 | "import os\n", 77 | "import io\n", 78 | "import tarfile\n", 79 | "import sagemaker\n", 80 | "\n", 81 | "sagemaker_session = sagemaker.Session()\n", 82 | "bucket = sagemaker_session.default_bucket()\n", 83 | "image_name='pytorch-inference-neuron'\n", 84 | "image_tag=\"1.10.2h-neuron-py37-sdk1.19.0-ubuntu18.04\"\n", 85 | "model_s3_path=\"models/yolov7-pose/model.tar.gz\"" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": null, 91 | "id": "70a2c9cd", 92 | "metadata": {}, 93 | "outputs": [], 94 | "source": [ 95 | "with io.BytesIO() as tar_file:\n", 96 | " with tarfile.open(fileobj=tar_file, mode='w:gz') as tar:\n", 97 | " tar.add('yolov7_neuron.pt', 'model.pt')\n", 98 | " tar.list()\n", 99 | " tar_file.seek(0)\n", 100 | " s3_uri = sagemaker_session.upload_string_as_file_body(\n", 101 | " tar_file.read(), bucket=bucket, key=model_s3_path\n", 102 | " )\n", 103 | " print(s3_uri)" 104 | ] 105 | }, 106 | { 107 | "cell_type": "markdown", 108 | "id": "715815db", 109 | "metadata": {}, 110 | "source": [ 111 | "## 3) Build a custom docker container with additional libraries\n", 112 | "**YOU DON\"T NEED TO RUN** this section if you already did that before" 113 | ] 114 | }, 115 | { 116 | "cell_type": "markdown", 117 | "id": "93b49329", 118 | "metadata": {}, 119 | "source": [ 120 | "We'll extend a pythorch-inference container to apply a patch that allow us to pass CustomAttributes to our code and also to install required libraries like libJPEG Turbo." 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": null, 126 | "id": "d68a0f10", 127 | "metadata": {}, 128 | "outputs": [], 129 | "source": [ 130 | "!pygmentize container_01/Dockerfile" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": null, 136 | "id": "59a6698f", 137 | "metadata": {}, 138 | "outputs": [], 139 | "source": [ 140 | "!sm-docker build container_01/ --repository $image_name:$image_tag" 141 | ] 142 | }, 143 | { 144 | "cell_type": "markdown", 145 | "id": "b35e5738", 146 | "metadata": {}, 147 | "source": [ 148 | "## 4) Inference Code executed by SageMaker Endpoint\n", 149 | "We need to create a custom inference file to pass to SageMaker. This code has the mechanisms to invoke the model and also pre/post process the input jpeg image & predictions.\n", 150 | "\n", 151 | "- **input_fn()**: Will receive the bytes of a .jpeg file. This file needs to be a mosaic, composed of multiple frames in just one image. By using **CustomAttributes** we share some metadata about the mosaic to the endpoint. With tile_width and tile_height we can compute how many images does the mosaic have, parse it and build a batch.\n", 152 | "- **output_fn()**: Gets the predictions and converts them to a numpy blob" 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": null, 158 | "id": "6a28879e", 159 | "metadata": {}, 160 | "outputs": [], 161 | "source": [ 162 | "!pygmentize code_01/inference.py" 163 | ] 164 | }, 165 | { 166 | "cell_type": "markdown", 167 | "id": "80aae4ec", 168 | "metadata": {}, 169 | "source": [ 170 | "## 5) Deploy our model to SageMaker" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": null, 176 | "id": "d3598f0c", 177 | "metadata": {}, 178 | "outputs": [], 179 | "source": [ 180 | "import boto3\n", 181 | "import logging\n", 182 | "from sagemaker.pytorch.model import PyTorchModel\n", 183 | "from sagemaker.predictor import Predictor\n", 184 | "\n", 185 | "sagemaker_session = sagemaker.Session()\n", 186 | "\n", 187 | "account_id = boto3.client('sts').get_caller_identity().get('Account')\n", 188 | "region_name = sagemaker_session.boto_session.region_name\n", 189 | "bucket = sagemaker_session.default_bucket()\n", 190 | "s3_uri=f\"s3://{bucket}/{model_s3_path}\"\n", 191 | "role=sagemaker.get_execution_role()\n", 192 | "print(f\"Bucket: {bucket}\\nAWS AccountID: {account_id}\\nRegion: {region_name}\")\n", 193 | "\n", 194 | "# https://github.com/aws/deep-learning-containers/blob/master/available_images.md#neuron-containers\n", 195 | "image_uri=f\"{account_id}.dkr.ecr.{region_name}.amazonaws.com/{image_name}:{image_tag}\"\n", 196 | "\n", 197 | "print(image_uri)\n", 198 | "sagemaker_model = PyTorchModel(\n", 199 | " image_uri=image_uri,\n", 200 | " model_data=s3_uri, \n", 201 | " role=role, \n", 202 | " name=\"yolov7-pose-inferentia\",\n", 203 | " sagemaker_session=sagemaker_session,\n", 204 | " entry_point=\"code_01/inference.py\",\n", 205 | " container_log_level=logging.DEBUG,\n", 206 | " model_server_workers=4, # keep 4 workers\n", 207 | " framework_version=\"1.10.0\",\n", 208 | " # for production it is important to define vpc_config and use a vpc_endpoint\n", 209 | " #vpc_config={\n", 210 | " # 'Subnets': ['', ''],\n", 211 | " # 'SecurityGroupIds': ['', '']\n", 212 | " #}\n", 213 | ")\n", 214 | "sagemaker_model._is_compiled_model = True" 215 | ] 216 | }, 217 | { 218 | "cell_type": "code", 219 | "execution_count": null, 220 | "id": "bad27169", 221 | "metadata": {}, 222 | "outputs": [], 223 | "source": [ 224 | "predictor = sagemaker_model.deploy(\n", 225 | " endpoint_name=\"yolov7-pose-inferentia\",\n", 226 | " instance_type=\"ml.inf1.6xlarge\",\n", 227 | " initial_instance_count=1\n", 228 | ")" 229 | ] 230 | }, 231 | { 232 | "cell_type": "markdown", 233 | "id": "dd2e18d4", 234 | "metadata": {}, 235 | "source": [ 236 | "## 6) Test the endpoint" 237 | ] 238 | }, 239 | { 240 | "cell_type": "code", 241 | "execution_count": null, 242 | "id": "fb939242", 243 | "metadata": {}, 244 | "outputs": [], 245 | "source": [ 246 | "%matplotlib inline\n", 247 | "import os\n", 248 | "import cv2\n", 249 | "import numpy as np\n", 250 | "import urllib.request\n", 251 | "import matplotlib.pyplot as plt\n", 252 | "\n", 253 | "if not os.path.isfile('zidane.jpg'):\n", 254 | " urllib.request.urlretrieve(\n", 255 | " 'https://raw.githubusercontent.com/ultralytics/yolov5/master/data/images/zidane.jpg',\n", 256 | " 'zidane.jpg'\n", 257 | " )\n", 258 | " \n", 259 | "if not os.path.isfile('mosaic4.jpg'):\n", 260 | " img = cv2.imread('zidane.jpg')\n", 261 | " h,w,c = img.shape\n", 262 | " factor = 960/w\n", 263 | " new_h,new_w=int(h*factor),int(w*factor)\n", 264 | " img = cv2.resize(img, (new_w,new_h))\n", 265 | " mosaic = np.zeros((new_h*2, new_w*2, c), dtype=np.uint8)\n", 266 | " for i in range(2):\n", 267 | " for j in range(2):\n", 268 | " ph, pw = i*new_h, j*new_w\n", 269 | " mosaic[ph:ph+new_h, pw:pw+new_w] = img[:]\n", 270 | " cv2.imwrite('mosaic4.jpg', mosaic)\n", 271 | "plt.figure(figsize=(15,10))\n", 272 | "plt.imshow(cv2.cvtColor(cv2.imread('mosaic4.jpg'), cv2.COLOR_BGR2RGB))" 273 | ] 274 | }, 275 | { 276 | "cell_type": "code", 277 | "execution_count": null, 278 | "id": "4759e9f7", 279 | "metadata": {}, 280 | "outputs": [], 281 | "source": [ 282 | "import json\n", 283 | "import time\n", 284 | "import sagemaker\n", 285 | "import numpy as np\n", 286 | "from sagemaker.predictor import Predictor\n", 287 | "from sagemaker.serializers import DataSerializer\n", 288 | "from sagemaker.deserializers import NumpyDeserializer\n", 289 | "\n", 290 | "sagemaker_session = sagemaker.Session()\n", 291 | "\n", 292 | "predictor = Predictor(endpoint_name=\"yolov7-pose-inferentia\", sagemaker_session=sagemaker_session)\n", 293 | "predictor.serializer = DataSerializer(content_type='image/jpeg')\n", 294 | "predictor.deserializer = NumpyDeserializer()\n", 295 | "\n", 296 | "mosaic_size=2\n", 297 | "custom_attributes={\n", 298 | " 'CustomAttributes': json.dumps({ \n", 299 | " \"tile_width\": 960, \n", 300 | " \"tile_height\": 540,\n", 301 | " \"conf_thres\": 0.15,\n", 302 | " \"iou_thres\": 0.45\n", 303 | " })\n", 304 | "}\n", 305 | "data = open(f'mosaic{mosaic_size*mosaic_size}.jpg', 'rb').read()\n", 306 | "t = time.time()\n", 307 | "y = predictor.predict(data, initial_args=custom_attributes)\n", 308 | "elapsed = (time.time()-t) * 1000\n", 309 | "print(f\"Elapsed: {elapsed}, Latency per image: {elapsed / (mosaic_size ** 2)}\")\n", 310 | "y.shape" 311 | ] 312 | } 313 | ], 314 | "metadata": { 315 | "kernelspec": { 316 | "display_name": "conda_python3", 317 | "language": "python", 318 | "name": "conda_python3" 319 | }, 320 | "language_info": { 321 | "codemirror_mode": { 322 | "name": "ipython", 323 | "version": 3 324 | }, 325 | "file_extension": ".py", 326 | "mimetype": "text/x-python", 327 | "name": "python", 328 | "nbconvert_exporter": "python", 329 | "pygments_lexer": "ipython3", 330 | "version": "3.6.13" 331 | } 332 | }, 333 | "nbformat": 4, 334 | "nbformat_minor": 5 335 | } 336 | -------------------------------------------------------------------------------- /tutorials/02_ObjectTrackingSageMakerGStreamer/02_CVPipeline.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "876b4937", 6 | "metadata": {}, 7 | "source": [ 8 | "# CV/ML Pipeline to extract highlights from videos using ML Models\n", 9 | "\n", 10 | "With this notebook you can create an end-to-end CV/ML Pipeline using [GStreamer](gstreamer.freedesktop.org/) and run ML models to extract information from the frames. We'll use a Person detection + Pose estimation model based on Yolov7 to identify and track people in video files. With Gstreamer we can combine multiple feeds/cameras and create a mosaic of images. This helps us to accelerate the process.\n", 11 | "\n", 12 | "First, deploy a pre-trained **Yolov7** to a SageMaker endpoint. Follow the instructions in [this notebook](01_Yolov7SageMakerInferentia.ipynb). Then, you can run this notebook." 13 | ] 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "id": "feef436c", 18 | "metadata": {}, 19 | "source": [ 20 | "## 1) Install dependencies" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": null, 26 | "id": "95e37a31", 27 | "metadata": {}, 28 | "outputs": [], 29 | "source": [ 30 | "# with this library we can build docker images and push them to ECR\n", 31 | "%pip install sagemaker-studio-image-build" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "id": "f9b932b2", 37 | "metadata": {}, 38 | "source": [ 39 | "## 2) Initialize some variables" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": null, 45 | "id": "fa5cbb95", 46 | "metadata": { 47 | "scrolled": true 48 | }, 49 | "outputs": [], 50 | "source": [ 51 | "import os\n", 52 | "import io\n", 53 | "import boto3\n", 54 | "import tarfile\n", 55 | "import sagemaker\n", 56 | "\n", 57 | "sagemaker_session = sagemaker.Session()\n", 58 | "bucket = sagemaker_session.default_bucket()\n", 59 | "account_id = boto3.client('sts').get_caller_identity().get('Account')\n", 60 | "region_name = sagemaker_session.boto_session.region_name\n", 61 | "\n", 62 | "image_name='gstreamer'\n", 63 | "image_tag=\"py3-1.0\"\n", 64 | "image_uri=f\"{account_id}.dkr.ecr.{region_name}.amazonaws.com/{image_name}:{image_tag}\"\n", 65 | "print(f'Custom docker image: {image_uri}')" 66 | ] 67 | }, 68 | { 69 | "cell_type": "markdown", 70 | "id": "0279f53b", 71 | "metadata": {}, 72 | "source": [ 73 | "## 3) Build a custom docker container with additional libraries\n", 74 | "**YOU DON\"T NEED TO RUN** this section if you already did that before" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": null, 80 | "id": "c8fca5e5", 81 | "metadata": {}, 82 | "outputs": [], 83 | "source": [ 84 | "!pygmentize container_02/Dockerfile" 85 | ] 86 | }, 87 | { 88 | "cell_type": "markdown", 89 | "id": "8298ebba", 90 | "metadata": { 91 | "scrolled": true 92 | }, 93 | "source": [ 94 | "### 3.1) Build and push the container image" 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": null, 100 | "id": "242b1d6d", 101 | "metadata": {}, 102 | "outputs": [], 103 | "source": [ 104 | "!sm-docker build container_02/ --repository $image_name:$image_tag" 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "id": "260a05e0", 110 | "metadata": {}, 111 | "source": [ 112 | "## 4) Create an application for processing our videos\n", 113 | "This application will run inside a container executed by SageMaker Processing Jobs" 114 | ] 115 | }, 116 | { 117 | "cell_type": "markdown", 118 | "id": "ddc49c24", 119 | "metadata": {}, 120 | "source": [ 121 | "### 4.1) Tracker object that makes use of ByteTrack\n", 122 | "Source: https://github.com/ifzhang/ByteTrack \n", 123 | "This class assigns ids to detected objects and keeps track of them across multiple frames" 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": null, 129 | "id": "44a95c38", 130 | "metadata": {}, 131 | "outputs": [], 132 | "source": [ 133 | "!pygmentize libs/tracker.py" 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "id": "7e0038a7", 139 | "metadata": {}, 140 | "source": [ 141 | "### 4.2) CV Pipeline that wraps a GStreamer pipeline\n", 142 | "Extend this class to create your own GStreamer pipeline solution" 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "execution_count": null, 148 | "id": "213faf51", 149 | "metadata": {}, 150 | "outputs": [], 151 | "source": [ 152 | "!pygmentize libs/cvpipeline.py" 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "id": "9062a172", 158 | "metadata": {}, 159 | "source": [ 160 | "### 4.3) SageMaker CV Pipeline\n", 161 | "Extends a CVPipeline and invokes a SageMaker Endpoint for each frame" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": null, 167 | "id": "39230ed6", 168 | "metadata": {}, 169 | "outputs": [], 170 | "source": [ 171 | "!pygmentize libs/smcvpipeline.py" 172 | ] 173 | }, 174 | { 175 | "cell_type": "markdown", 176 | "id": "31ee6845", 177 | "metadata": {}, 178 | "source": [ 179 | "### 4.4) Main application\n", 180 | "This script will parse all the parameters passed through SageMaker Processing jobs api and invoke the Gstreamer pipeline" 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": null, 186 | "id": "2156a6e7", 187 | "metadata": {}, 188 | "outputs": [], 189 | "source": [ 190 | "!pygmentize code_02/pipeline.py" 191 | ] 192 | }, 193 | { 194 | "cell_type": "markdown", 195 | "id": "0c2ff4c8", 196 | "metadata": {}, 197 | "source": [ 198 | "### 4.5) Clone the correct version of ByteTrack\n", 199 | "This library is required when object tracking is enabled" 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "execution_count": null, 205 | "id": "875e59b8", 206 | "metadata": {}, 207 | "outputs": [], 208 | "source": [ 209 | "import os\n", 210 | "if not os.path.isdir('libs/bytetrack'):\n", 211 | " !git clone https://github.com/ifzhang/ByteTrack libs/bytetrack && \\\n", 212 | " cd libs/bytetrack && git checkout d1bf019" 213 | ] 214 | }, 215 | { 216 | "cell_type": "markdown", 217 | "id": "e4249e25", 218 | "metadata": {}, 219 | "source": [ 220 | "## 5) Kick-off a SageMaker Processing job to process all our video files" 221 | ] 222 | }, 223 | { 224 | "cell_type": "code", 225 | "execution_count": null, 226 | "id": "fd0d8a5c", 227 | "metadata": {}, 228 | "outputs": [], 229 | "source": [ 230 | "import sagemaker\n", 231 | "from sagemaker.processing import ScriptProcessor\n", 232 | "from sagemaker.processing import ProcessingInput, ProcessingOutput\n", 233 | "from sagemaker.network import NetworkConfig\n", 234 | "\n", 235 | "sagemaker_session = sagemaker.Session()\n", 236 | "bucket = sagemaker_session.default_bucket()\n", 237 | "print(f\"s3://{bucket}/samples/\")" 238 | ] 239 | }, 240 | { 241 | "cell_type": "markdown", 242 | "id": "d34408e2", 243 | "metadata": {}, 244 | "source": [ 245 | "### 5.1) Upload your .mp4 files to S3\n", 246 | "If you don't have a video now and just want to run some tests, go to https://pixabay.com/videos/ or any other website which has video of people.\n", 247 | "\n", 248 | "Download the **.mp4** as 720p (1280x720) files and upload them to the S3 path printed in the last cell (above).\n", 249 | "\n", 250 | "Run the following command, then to make sure you uploaded the files:\n", 251 | "```bash\n", 252 | "aws s3 ls s3:///samples/ \n", 253 | "```" 254 | ] 255 | }, 256 | { 257 | "cell_type": "markdown", 258 | "id": "bf519d08", 259 | "metadata": {}, 260 | "source": [ 261 | "### 5.2) Finally run the Processing Job" 262 | ] 263 | }, 264 | { 265 | "cell_type": "code", 266 | "execution_count": null, 267 | "id": "9ca2f694", 268 | "metadata": {}, 269 | "outputs": [], 270 | "source": [ 271 | "import time\n", 272 | "script_processor = ScriptProcessor(\n", 273 | " base_job_name=f'cv-pipeline-{int(time.time()*1000)}',\n", 274 | " image_uri=image_uri,\n", 275 | " role=sagemaker.get_execution_role(),\n", 276 | " instance_type='ml.c5.xlarge',\n", 277 | " instance_count=1,\n", 278 | " max_runtime_in_seconds=60 * 30,\n", 279 | " command=[\"/home/ec2-user/entrypoint.sh\", \"python3\"],\n", 280 | " # for production it is important to define vpc_config and use a vpc_endpoint\n", 281 | " #vpc_config={\n", 282 | " # 'Subnets': ['', ''],\n", 283 | " # 'SecurityGroupIds': ['', '']\n", 284 | " #}\n", 285 | ")\n", 286 | "\n", 287 | "script_processor.run(\n", 288 | " code='code_02/pipeline.py',\n", 289 | " inputs=[\n", 290 | " # always keep this input in the first place to avoid\n", 291 | " # issues with the pipe name\n", 292 | " ProcessingInput(\n", 293 | " source=f's3://{bucket}/samples',\n", 294 | " destination='/opt/ml/processing/input/data', \n", 295 | " s3_input_mode='Pipe',\n", 296 | " s3_data_distribution_type='ShardedByS3Key'\n", 297 | " ),\n", 298 | " ProcessingInput(\n", 299 | " source='libs',\n", 300 | " destination='/opt/ml/processing/input/libs',\n", 301 | " s3_input_mode='File'\n", 302 | " ) \n", 303 | " ],\n", 304 | " outputs=[ProcessingOutput(\n", 305 | " source='/opt/ml/processing/output/predictions',\n", 306 | " destination=f's3://{bucket}/predictions/',\n", 307 | " s3_upload_mode='Continuous'\n", 308 | " )],\n", 309 | " arguments=[\n", 310 | " '--input-shape', '1280 720',\n", 311 | " '--endpoint-name', \"yolov7-pose-inferentia\",\n", 312 | " '--region-name', 'us-east-1'\n", 313 | " ]\n", 314 | ")" 315 | ] 316 | }, 317 | { 318 | "cell_type": "code", 319 | "execution_count": null, 320 | "id": "2836fa31", 321 | "metadata": {}, 322 | "outputs": [], 323 | "source": [] 324 | } 325 | ], 326 | "metadata": { 327 | "kernelspec": { 328 | "display_name": "conda_python3", 329 | "language": "python", 330 | "name": "conda_python3" 331 | }, 332 | "language_info": { 333 | "codemirror_mode": { 334 | "name": "ipython", 335 | "version": 3 336 | }, 337 | "file_extension": ".py", 338 | "mimetype": "text/x-python", 339 | "name": "python", 340 | "nbconvert_exporter": "python", 341 | "pygments_lexer": "ipython3", 342 | "version": "3.6.13" 343 | } 344 | }, 345 | "nbformat": 4, 346 | "nbformat_minor": 5 347 | } 348 | -------------------------------------------------------------------------------- /tutorials/02_ObjectTrackingSageMakerGStreamer/README.md: -------------------------------------------------------------------------------- 1 | # Object Tracking for video files with SageMaker and GStreamer 2 | 3 | Process multiple **video files** with ML models, using [SageMaker Processing Jobs](https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html) and [GStreamer](https://gstreamer.freedesktop.org/) in batch mode. 4 | 5 | In this tutorial you'll learn how to create a people pathing mechanism that tracks people on video files in batch mode. The output is a set of Numpy files with the predictions of each frame, which contain: 6 | - bouding boxes for each person; 7 | - keypoints for pose estimation; 8 | - id of each detecte person across the frames; 9 | 10 | 11 | ### Notebooks: 12 | - [01_Yolov7SageMakerInferentia](01_Yolov7SageMakerInferentia.ipynb): First deploy a real-time endpoint on SageMaker on an Inferentia (inf1) instance, with the Object Detection & Pose estimation model. 13 | - [02_CVPipeline](02_CVPipeline.ipynb): Launch a SageMaker Processing Job with a Python script that defines a GStreamer pipeline that processes multiple files at once by sending each frame to the endpoint and saving the predictions as Numpy files. 14 | 15 | 16 | ### Activities 17 | - First upload some **.mp4** to an S3 bucket. 18 | - Run notebook 01: 1/ follow the instructions there to compile a Yolov7 for Inferentia; 2/ deploy the compiled model to an enpoint 19 | - Run notebook 02: Prepare a python application that will be executed by SageMaker to read the .mp4 files and get the predictions. 20 | -------------------------------------------------------------------------------- /tutorials/02_ObjectTrackingSageMakerGStreamer/code_01/inference.py: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | 4 | import os 5 | os.environ['NEURON_RT_NUM_CORES'] = '4' 6 | import io 7 | import cv2 8 | import json 9 | import time 10 | import torch 11 | import torch.neuron 12 | import numpy as np 13 | 14 | from turbojpeg import TurboJPEG 15 | 16 | class Detector(object): 17 | '''Main class responsible for pre/post processing + model invocation''' 18 | def __init__(self, model_path): 19 | 20 | self.model = torch.jit.load(model_path).eval() 21 | self.jpeg = TurboJPEG() 22 | 23 | print(f'Model loaded') 24 | 25 | def xywh2xyxy(self, x): 26 | # Convert nx4 boxes from [x, y, w, h] to [x1, y1, x2, y2] 27 | # where xy1=top-left, xy2=bottom-right 28 | y = np.copy(x) 29 | y[:, 0] = x[:, 0] - x[:, 2] / 2 # top left x 30 | y[:, 1] = x[:, 1] - x[:, 3] / 2 # top left y 31 | y[:, 2] = x[:, 0] + x[:, 2] / 2 # bottom right x 32 | y[:, 3] = x[:, 1] + x[:, 3] / 2 # bottom right y 33 | return y 34 | 35 | # non maximum suppression. Inspired by torchvision.nms 36 | def nms(self, bboxes, scores, iou_threshold=0.45): 37 | x1 = bboxes[:, 0] 38 | y1 = bboxes[:, 1] 39 | x2 = bboxes[:, 2] 40 | y2 = bboxes[:, 3] 41 | areas = (x2 - x1 + 1) * (y2 - y1 + 1) 42 | order = scores.ravel().argsort()[::-1] 43 | keep = [] 44 | while order.size > 0: 45 | i = order[0] 46 | keep.append(i) 47 | xx1 = np.maximum(x1[i], x1[order[1:]]) 48 | yy1 = np.maximum(y1[i], y1[order[1:]]) 49 | xx2 = np.minimum(x2[i], x2[order[1:]]) 50 | yy2 = np.minimum(y2[i], y2[order[1:]]) 51 | w = np.maximum(0.0, xx2 - xx1 + 1) 52 | h = np.maximum(0.0, yy2 - yy1 + 1) 53 | inter = w * h 54 | iou = inter / (areas[i] + areas[order[1:]] - inter) 55 | inds = np.where(iou <= iou_threshold)[0] 56 | order = order[inds + 1] 57 | bboxes = bboxes[keep] 58 | scores = scores[keep] 59 | return bboxes, scores, keep 60 | 61 | def non_max_suppression_kpt(self, prediction, conf_thres=0.25, iou_thres=0.45, classes=None, agnostic=False, 62 | labels=(), kpt_label=False, nc=None, nkpt=None): 63 | """Runs Non-Maximum Suppression (NMS) on inference results 64 | 65 | Returns: 66 | list of detections, on (n,6) tensor per image [xyxy, conf, cls, keypoints] 67 | """ 68 | if nc is None: 69 | nc = prediction.shape[2] - 5 if not kpt_label else prediction.shape[2] - 56 # number of classes 70 | xc = prediction[..., 4] > conf_thres # candidates 71 | 72 | # Settings 73 | min_wh, max_wh = 2, 4096 # (pixels) minimum and maximum box width and height 74 | max_det = 300 # maximum number of detections per image 75 | max_nms = 30000 # maximum number of boxes 76 | time_limit = 10.0 # seconds to quit after 77 | 78 | t = time.time() 79 | output = [np.zeros((0,57))] * prediction.shape[0] 80 | for xi, x in enumerate(prediction): # image index, image inference 81 | # Apply constraints 82 | x = x[xc[xi]] # confidence 83 | 84 | # Cat apriori labels if autolabelling 85 | if labels and len(labels[xi]): 86 | l = labels[xi] 87 | v = np.zeros((len(l), nc + 5)) 88 | v[:, :4] = l[:, 1:5] # box 89 | v[:, 4] = 1.0 # conf 90 | v[range(len(l)), l[:, 0].long() + 5] = 1.0 # cls 91 | x = np.concatenate((x, v), axis=0) 92 | 93 | # If none remain process next image 94 | if not x.shape[0]: 95 | continue 96 | 97 | # Compute conf 98 | x[:, 5:5+nc] *= x[:, 4:5] # conf = obj_conf * cls_conf 99 | 100 | # Box (center x, center y, width, height) to (x1, y1, x2, y2) 101 | box = self.xywh2xyxy(x[:, :4]) 102 | 103 | if not kpt_label: 104 | conf = x[:, 5:].max(axis=1, keepdims=True) 105 | j = np.argmax(x[:, 5:], axis=1).reshape(x[:, 5:].shape[0],-1) 106 | x = np.concatenate((box, conf, j), axis=1)[conf.ravel() > conf_thres] 107 | else: 108 | kpts = x[:, 6:] 109 | conf = x[:, 5:6].max(axis=1, keepdims=True) 110 | j = np.argmax(x[:, 5:6], axis=1).reshape(x[:, 5:6].shape[0],-1) 111 | x = np.concatenate((box, conf, j, kpts), axis=1)[conf.ravel() > conf_thres] 112 | 113 | # Filter by class 114 | if classes is not None: 115 | x = x[(x[:, 5:6] == classes).any(1)] 116 | 117 | # Check shape 118 | n = x.shape[0] # number of boxes 119 | if not n: # no boxes 120 | continue 121 | elif n > max_nms: # excess boxes 122 | x = x[x[:, 4].argsort()[::-1][:max_nms]] # sort by confidence 123 | 124 | # Batched NMS 125 | c = x[:, 5:6] * (0 if agnostic else max_wh) # classes 126 | boxes, scores = x[:, :4] + c, x[:, 4] # boxes (offset by class), scores 127 | 128 | boxes,scores,i = self.nms(boxes, scores, iou_thres) # NMS 129 | 130 | #if len(i) > max_det: # limit detections 131 | # i = i[:max_det] 132 | # boxes = boxes[:max_det] 133 | # scores = scores[:max_det] 134 | 135 | output[xi] = x[i] 136 | if (time.time() - t) > time_limit: 137 | print(f'WARNING: NMS time limit {time_limit}s exceeded') 138 | break # time limit exceeded 139 | return output 140 | 141 | def predict(self,x): 142 | with torch.no_grad(): 143 | return self.model(x).numpy()#torch.from_numpy(x)).numpy() 144 | 145 | def preprocess(self, img, img_size=960): 146 | '''Make the image squared and prepare the tensor as [B,C,H,W]''' 147 | h,w,c = img.shape 148 | img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) 149 | if h!=w: 150 | max_size=max(h,w) 151 | img_sqr = np.zeros((max_size, max_size,c), dtype=np.uint8) 152 | img_sqr[0:h,0:w],img = img[:],img_sqr 153 | x = cv2.resize(img, (img_size, img_size), interpolation=cv2.INTER_LINEAR) 154 | return x 155 | #x = np.expand_dims((x.transpose(2,0,1) / 255.0).astype(np.float32), axis=0) 156 | #return np.ascontiguousarray(x) 157 | 158 | def postprocess(self, output, tensor_shape, img_shape, conf_thres=0.15, iou_thres=0.45, nc=1, nkpt=17): 159 | '''Run NMS to filter bboxes & return detections with keypoints''' 160 | detections = self.non_max_suppression_kpt( 161 | output, conf_thres,iou_thres, nc=nc, nkpt=nkpt, kpt_label=True) 162 | 163 | # targets in the format 164 | # [det_index, int(class_id), [x1,y1,x2,y2], conf, [x0,y0,conf0...x16,y16,conf16]] 165 | targets = [] 166 | for i,det in enumerate(detections): 167 | bboxes,scores,classes,keypoints = det[:, :4],det[:, 4], det[:, 5],det[:,6:] 168 | bboxes = bboxes.clip(0,tensor_shape[0]) 169 | # rescale bboxes and poses 170 | # fix the distortion provoked by preprocess 171 | tw,th,ih,iw = *tensor_shape, *img_shape 172 | bboxes = bboxes / [tw,th,tw,th] * [iw,ih,iw,ih] 173 | keypoints = (keypoints / ([tw,th,1]*nkpt)) * ([ih,iw,1]*nkpt) 174 | dets = [] 175 | for index, (box, conf, cls, pose) in enumerate(zip(bboxes,scores,classes,keypoints)): 176 | dets.append([index, int(cls), box.astype(np.int32), conf, pose]) 177 | if len(dets)>0: targets.append(dets) 178 | return targets 179 | 180 | def mosaic2batch(self, data, tile_width=960, tile_height=540, img_size=960): 181 | mosaic = self.jpeg.decode(data) 182 | h,w,c = mosaic.shape 183 | 184 | max_size=max(tile_width, tile_height) 185 | min_size=min(tile_width, tile_height) 186 | num_pixels = max_size*max_size*3 187 | batch_size = h//tile_height * w//tile_width 188 | batch = torch.zeros(max_size*max_size*c * batch_size, dtype=torch.float32) 189 | ttl_pixels=0 190 | # build a batch out of the tiles 191 | for row in range(h//tile_height): 192 | for col in range(w//tile_width): 193 | pw,ph=col*tile_width,row*tile_height 194 | tile = mosaic[ph:ph+tile_height, pw:pw+tile_width] 195 | 196 | tile = self.preprocess(tile, img_size) 197 | 198 | batch[ttl_pixels:ttl_pixels + num_pixels] = torch.from_numpy(tile).ravel() 199 | ttl_pixels = ttl_pixels + num_pixels 200 | 201 | batch = batch.reshape(-1,max_size,max_size,c) 202 | batch = (batch / 255.0).float() # to FP32 203 | batch = batch.permute(0,3,1,2) # NHWC --> NCHW 204 | 205 | return batch 206 | 207 | ## SAGEMAKER FUNCTIONS ## 208 | # The following functions are invoked by SageMaker to load the model, 209 | # receive the payload, invoke the model and prepare the output 210 | def model_fn(model_dir): 211 | return Detector(os.path.join(model_dir, 'model.pt')) 212 | 213 | def input_fn(data, content_type, context=None): 214 | if content_type != 'image/jpeg': 215 | raise Exception(f'Invalid data type. Expected image/jpeg, got {content_type}') 216 | 217 | try: 218 | custom_attributes = context.get_request_header(0,'X-Amzn-SageMaker-Custom-Attributes') 219 | params = json.loads(custom_attributes) 220 | return data, params 221 | except Exception as e: 222 | raise Exception(f"You need to pass Custom Attributes") 223 | 224 | def output_fn(predictions, accept, context=None): 225 | if accept!='application/x-npy': 226 | raise Exception(f'Invalid data type. Expected application/x-npy, got {accept}') 227 | 228 | with io.BytesIO() as b: 229 | data = [] 230 | for i,objs in enumerate(predictions): 231 | for obj_id, obj_cls, bbox, conf, pose_kpts in objs: 232 | data.append(np.hstack([ 233 | [i, obj_id, obj_cls], 234 | bbox.astype(np.float32), 235 | pose_kpts 236 | ])) 237 | np.save(b, np.vstack(data)) 238 | b.seek(0) 239 | return b.read() 240 | 241 | def predict_fn(data, detector, context=None): 242 | mosaic,params = data 243 | # adjust img_size accordinly with the input shape of your model 244 | img_size=960 245 | tile_width=params.get('tile_width', 960) 246 | tile_height=params.get('tile_height', 540) 247 | conf_thres=params.get('conf_thres', 0.15) 248 | iou_thres=params.get('iou_thres', 0.45) 249 | 250 | x = detector.mosaic2batch(mosaic, tile_width, tile_height, img_size) 251 | out = detector.predict(x) 252 | detections = detector.postprocess(out, x.shape[2:], (tile_height, tile_width), conf_thres, iou_thres) 253 | return detections 254 | -------------------------------------------------------------------------------- /tutorials/02_ObjectTrackingSageMakerGStreamer/code_02/pipeline.py: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | import os 4 | import sys 5 | import time 6 | import argparse 7 | sys.path.append("/opt/ml/processing/input/libs/bytetrack") 8 | sys.path.append("/opt/ml/processing/input/libs") 9 | from smcvpipeline import SageMakerCVPipeline 10 | 11 | if __name__ == '__main__': 12 | parser = argparse.ArgumentParser( 13 | prog = 'CV Pipeline for ML', 14 | description = 'Video streaming processing with ML') 15 | 16 | parser.add_argument('-k','--enable-tracking', type=bool, help="Enable object tracking", default=True) 17 | parser.add_argument('-j','--jpeg-quality', type=int, help="Quality of the jpeg mosaic sent to SM endpoint", default=90) 18 | parser.add_argument('-e','--endpoint-name', type=str, help="SageMaker Endpoint Name", required=True) 19 | parser.add_argument('-r','--region-name', type=str, help="Region Name", default="us-east-1") 20 | parser.add_argument('-p','--preds-per-output-file', type=int, help="Number of predictions per output file", default=150) 21 | parser.add_argument('-w','--num-workers', type=int, help="Number of workers that will invoke the model", default=5) 22 | parser.add_argument('-n','--cams-per-row', type=int, help="Number of cams per row", default=2) 23 | parser.add_argument('-m','--max-cams-per-batch', type=int, help="Max number of cams per batch", default=4) 24 | parser.add_argument('-i','--input-shape', type=int, help="Resized resolution of the feeds", nargs=2, default=[1280, 720]) 25 | parser.add_argument('-t','--tile-size', type=int, help="Shape of each tile in the mosaic", nargs=2, default=[960, 540]) 26 | parser.add_argument('-c','--conf-thres', type=float, help="Confidence threshold of the object", default=0.15) 27 | parser.add_argument('-o','--iou-thres', type=float, help="Confidence threshold of the IoU ", default=0.45) 28 | 29 | args = parser.parse_args() 30 | print(args) 31 | 32 | cams_per_row=args.cams_per_row 33 | max_cams_per_batch=args.max_cams_per_batch 34 | raw_width,raw_height=args.input_shape 35 | input_dir = "/opt/ml/processing/input" 36 | output_dir = "/opt/ml/processing/output" 37 | failure_file = output_dir + "/failure" 38 | pipeline = None 39 | if not os.path.isdir(output_dir): os.makedirs(output_dir) 40 | 41 | # Tracking requires sequential processing, that's why we can have only 1 active worker 42 | if args.enable_tracking and args.num_workers > 1: 43 | print(f"Tracking enabled. Setting num_workers to 1. Current: {args.num_workers}") 44 | args.num_workers = 1 45 | 46 | try: 47 | # list the pipes in the input dir 48 | # parse the manifest file and extract all file names 49 | file_names = [f.strip() for f in open(f'{input_dir}/data/input-1-manifest', 'r').readlines()[1:]] # skip first line 50 | num_batches = ((len(file_names)-1)//max_cams_per_batch) + 1 51 | print(f"Num files: {len(file_names)}, Num batches: {num_batches}") 52 | 53 | for batch in range(num_batches): 54 | start = batch * max_cams_per_batch 55 | end = start + min(max_cams_per_batch,len(file_names[start:])) 56 | 57 | mosaic,sources = [],[] 58 | for i,s3_path in enumerate(file_names[start:end]): 59 | # convert the s3 path to the expected by awss3src 60 | s3_path = s3_path.replace('s3://', f's3://{args.region_name}/') 61 | 62 | xoff,yoff = raw_width * (i%cams_per_row), raw_height * (i//cams_per_row) 63 | mosaic.append(f"sink_{i}::xpos={xoff} sink_{i}::ypos={yoff}") 64 | sources.append(f"\n awss3src uri={s3_path} ! decodebin ! videoconvert ! video/x-raw,format=(string)BGR ") 65 | sources.append(f"! videoscale method=0 add-borders=false ! video/x-raw,width={raw_width},height={raw_height} ! queue2 max-size-buffers=1000 ! comp.sink_{i}") 66 | mosaic,sources = " ".join(mosaic), "".join(sources) 67 | pre_pipeline = f""" 68 | compositor name=comp {mosaic} 69 | ! videoconvert ! video/x-raw,format=BGR ! fakesink name=input {sources} 70 | """ 71 | print(pre_pipeline) 72 | params = ( 73 | pre_pipeline, args.endpoint_name, args.region_name, max_cams_per_batch, 74 | output_dir, args.tile_size, args.conf_thres, args.iou_thres, 75 | args.num_workers, args.preds_per_output_file, args.jpeg_quality, 76 | args.enable_tracking 77 | ) 78 | pipeline = SageMakerCVPipeline(*params) 79 | t = time.time() 80 | pipeline.start() 81 | pipeline.join() 82 | print(f"Total time: {time.time()-t}") 83 | except Exception as e: 84 | print(f"ERROR: {sys.exc_info()[0]} {e}") 85 | with open(failure_file, 'w') as f: 86 | f.write(str(e)) 87 | raise e 88 | finally: 89 | if not pipeline is None and pipeline.is_running(): # should not happen 90 | print('Stopping pipeline...') 91 | pipeline.stop() 92 | pipeline.join() 93 | -------------------------------------------------------------------------------- /tutorials/02_ObjectTrackingSageMakerGStreamer/container_01/Dockerfile: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | ARG REGION_NAME=us-east-1 4 | ARG ACCOUNT_ID=763104351884 5 | FROM $ACCOUNT_ID.dkr.ecr.$REGION_NAME.amazonaws.com/pytorch-inference-neuron:1.10.2-neuron-py37-sdk1.19.0-ubuntu18.04 6 | RUN echo '\ 7 | --- /opt/conda/lib/python3.7/site-packages/sagemaker_inference/transformer.py 2022-08-23 17:26:42.000000000 +0000\n\ 8 | +++ /opt/conda/lib/python3.7/site-packages/sagemaker_inference/transformer_.py 2022-12-07 13:15:09.753360938 +0000\n\ 9 | @@ -250,9 +250,9 @@\n\ 10 | (response_data, content_type)\n\ 11 | \n\ 12 | """\n\ 13 | - data = self._run_handler_function(self._input_fn, *(input_data, content_type))\n\ 14 | - prediction = self._run_handler_function(self._predict_fn, *(data, model))\n\ 15 | - result = self._run_handler_function(self._output_fn, *(prediction, accept))\n\ 16 | + data = self._run_handler_function(self._input_fn, *(input_data, content_type, context))\n\ 17 | + prediction = self._run_handler_function(self._predict_fn, *(data, model, context))\n\ 18 | + result = self._run_handler_function(self._output_fn, *(prediction, accept, context))\n\ 19 | return result\n\ 20 | \n\ 21 | def _run_handler_function(self, func, *argv):' > /tmp/transformer.py.patch 22 | 23 | RUN patch /opt/conda/lib/python3.7/site-packages/sagemaker_inference/transformer.py /tmp/transformer.py.patch 24 | RUN wget --quiet --output-document=/tmp/libjpeg.deb https://netactuate.dl.sourceforge.net/project/libjpeg-turbo/2.1.4/libjpeg-turbo-official_2.1.4_amd64.deb && dpkg -i /tmp/libjpeg.deb && rm -f /tmp/libjpeg.deb 25 | RUN pip3 install PyTurboJPEG 26 | -------------------------------------------------------------------------------- /tutorials/02_ObjectTrackingSageMakerGStreamer/container_02/Dockerfile: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | FROM ubuntu:22.04 4 | 5 | ARG DEBIAN_FRONTEND=noninteractive 6 | 7 | ENV TZ UTC 8 | ENV LANG=C.UTF-8 9 | ENV LC_ALL=C.UTF-8 10 | ENV PYTHONDONTWRITEBYTECODE=1 11 | ENV PYTHONUNBUFFERED=1 12 | ENV PYTHONIOENCODING=UTF-8 13 | 14 | # install required packages 15 | RUN apt-get update && \ 16 | apt-get dist-upgrade -y && \ 17 | apt-get install -y --no-install-recommends \ 18 | libchromaprint1 libgles2 libgmp10 libgme0 libxkbcommon0 \ 19 | libbs2b0 libgsl27 libwavpack1 libxslt1.1 gstreamer1.0-libav \ 20 | libavformat58 libwayland-server0 libwayland-client0 libharfbuzz-icu0 librtmp1 \ 21 | libtheora0 libxtst6 libmpg123-0 libxcomposite1 libgl1 \ 22 | libmjpegutils-2.1-0 python3-numpy libsrt1.4-openssl libmpeg2-4 libvulkan1 \ 23 | libxdamage1 libjpeg8 mjpegtools libpng16-16 wayland-protocols \ 24 | libcap2 libofa0 udev libgstreamer-plugins-base1.0-dev libcups2 \ 25 | libgstreamer1.0-dev libopenexr25 libmfx1 libde265-0 libgirepository1.0-dev \ 26 | libfdk-aac2 libavcodec58 git libunwind8 xdg-dbus-proxy \ 27 | libtwolame0 mesa-utils libtag1v5 libaa1 libgles1 \ 28 | ffmpeg liborc-0.4-0 libgraphene-1.0-dev libwebpdemux2 libsoup2.4-1 \ 29 | build-essential libsm6 libglu1 libwebrtc-audio-processing1 liba52-0.7.4 \ 30 | libva2 libwayland-cursor0 libcurl3-gnutls libvisual-0.4-0 libbz2-1.0 \ 31 | libvpx7 libdv4 libatspi2.0-0 liblilv-0-0 gstreamer1.0-plugins-bad \ 32 | gstreamer1.0-plugins-good gstreamer1.0-plugins-ugly libdvdnav4 libssl3 libgsm1 \ 33 | libwoff1 libwebp7 x264 libopenblas-dev libxrandr-dev \ 34 | freeglut3-dev intel-media-va-driver-non-free ladspa-sdk gfortran libvo-aacenc0 \ 35 | python3-opencv libcaca0 python3-opengl libsbc1 libatk1.0-0 \ 36 | libsoundtouch1 libsndfile1 python3 libgudev-1.0-0 liblcms2-2 \ 37 | libzvbi0 libatk-bridge2.0-0 libass9 libgbm1 libglib2.0-0 \ 38 | libaom3 bubblewrap libdw1 libseccomp2 libepoxy0 \ 39 | libavutil56 glibc-tools libmodplug1 libshout3 libwebpmux3 \ 40 | libfaad2 iso-codes libgcrypt20 xvfb libspandsp2 \ 41 | libvorbis0a libfaac0 libmpcdec6 libopus0 libsrtp2-1 \ 42 | libx264-163 gstreamer1.0-rtsp python3-pip ca-certificates libva-wayland2 \ 43 | gcc libwildmidi2 libpango-1.0-0 libflite1 libdca0 \ 44 | libopenjp2-7 libzbar0 libspeex1 libkate1 pkg-config \ 45 | libx264-dev libopencore-amrwb0 gstreamer1.0-tools libxv1 gstreamer1.0-plugins-base \ 46 | libcairo2-dev python3-gst-1.0 wget cmake libwayland-egl1 \ 47 | libavfilter7 libegl1 libdvdread8 libvo-amrwbenc0 libogg0 \ 48 | librsvg2-2 libopencore-amrnb0 libx265-199 libatk-adaptor sudo \ 49 | libmp3lame0 python3-dev libssl-dev \ 50 | && rm -rf /var/lib/apt/lists/* 51 | 52 | # download and install libjpeg turbo 53 | RUN wget -qO libjpeg-turbo.deb \ 54 | https://deac-ams.dl.sourceforge.net/project/libjpeg-turbo/2.1.4/libjpeg-turbo-official_2.1.4_amd64.deb && \ 55 | dpkg -i libjpeg-turbo.deb && \ 56 | rm -f libjpeg-turbo.deb 57 | 58 | # download and install aws plugins for gstreamer 59 | RUN wget -qO sh.rustup.rs https://sh.rustup.rs && \ 60 | bash sh.rustup.rs -q -y --profile default && \ 61 | . "$HOME/.cargo/env" && \ 62 | rm -f sh.rustup.rs && \ 63 | cargo install cargo-c && \ 64 | git clone -b gstreamer-1.21.1 https://gitlab.freedesktop.org/gstreamer/gst-plugins-rs.git && \ 65 | cd gst-plugins-rs && \ 66 | cargo cbuild -p gst-plugin-aws --libdir=/usr/lib/x86_64-linux-gnu && \ 67 | cargo cinstall -p gst-plugin-aws --libdir=/usr/lib/x86_64-linux-gnu && \ 68 | cd .. && rm -rf gst-plugins-rs 69 | 70 | # create a user 71 | RUN mkdir -p /opt/ml/processing/output /opt/ml/processing/code 72 | RUN echo '%sudo ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers 73 | RUN groupadd --gid 500 --non-unique ec2-user 74 | RUN adduser --uid 500 --disabled-password --gecos '' --ingroup ec2-user ec2-user 75 | RUN usermod -a -G sudo,video,ec2-user ec2-user 76 | ENV PATH="$PATH:/home/ec2-user/.local/bin" 77 | RUN chown -R ec2-user:ec2-user /opt/ml 78 | 79 | USER ec2-user 80 | WORKDIR /opt/ml/processing/code 81 | # install some required python packages 82 | RUN pip3 install --upgrade pip 83 | RUN pip3 install pycairo PyGObject PyTurboJPEG boto3 Cython 84 | # torch is required for ByteTrack 85 | RUN pip3 install torch torchvision thop loguru scikit-learn lap cython_bbox 86 | 87 | RUN echo "#!/bin/sh\n/usr/bin/xvfb-run -a \$@\n" > /home/ec2-user/entrypoint.sh && chmod +x /home/ec2-user/entrypoint.sh 88 | 89 | 90 | ENTRYPOINT [ "/home/ec2-user/entrypoint.sh"] 91 | -------------------------------------------------------------------------------- /tutorials/02_ObjectTrackingSageMakerGStreamer/libs/cvpipeline.py: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | import gi 4 | import threading 5 | import numpy as np 6 | gi.require_version('Gst', '1.0') 7 | from gi.repository import Gst 8 | 9 | class CVPipeline(threading.Thread): 10 | '''Base class for a gstreamer pipeline''' 11 | def __init__(self, pipeline): 12 | threading.Thread.__init__(self) 13 | self.running = False 14 | self.gst_pipeline = pipeline 15 | 16 | def stop(self): 17 | self.running = False 18 | 19 | def is_running(self): 20 | return self.running 21 | 22 | def run(self): 23 | '''Invoked as a thread to execute the gstreamer loop''' 24 | self.running = True 25 | 26 | Gst.init(None) 27 | self.pipeline = Gst.parse_launch(self.gst_pipeline) 28 | self.pipeline.get_by_name('input').get_static_pad('sink').add_probe( 29 | Gst.PadProbeType.BUFFER, 30 | self.__on_frame_probe__ 31 | ) 32 | self.pipeline.set_state(Gst.State.PLAYING) 33 | self.bus = self.pipeline.get_bus() 34 | try: 35 | while self.running: 36 | msg = self.bus.timed_pop_filtered( 37 | Gst.SECOND, 38 | Gst.MessageType.EOS | Gst.MessageType.ERROR 39 | ) 40 | if msg: 41 | text = msg.get_structure().to_string() if msg.get_structure() else '' 42 | msg_type = Gst.message_type_get_name(msg.type) 43 | print(f'{msg.src.name}: [{msg_type}] {text}') 44 | self.stop() 45 | finally: 46 | self.pipeline.set_state(Gst.State.NULL) 47 | 48 | def __on_frame_probe__(self, pad, info): 49 | '''Handler that reads a buffer from gstreamer and loads a numpy rgb frame''' 50 | buf = info.get_buffer() 51 | caps = pad.get_current_caps() 52 | caps_structure = caps.get_structure(0) 53 | height, width = caps_structure.get_value('height'), caps_structure.get_value('width') 54 | pixel_bytes = 3 55 | is_mapped, map_info = buf.map(Gst.MapFlags.READ) 56 | if is_mapped: 57 | try: 58 | image_array = np.ndarray( 59 | (height, width, pixel_bytes), dtype=np.uint8, buffer=map_info.data 60 | ).copy() 61 | self.process_frame(image_array, buf.pts) 62 | finally: 63 | buf.unmap(map_info) 64 | 65 | return Gst.PadProbeReturn.OK 66 | 67 | def process_frame(self, frame, timestamp): 68 | pass 69 | -------------------------------------------------------------------------------- /tutorials/02_ObjectTrackingSageMakerGStreamer/libs/smcvpipeline.py: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | import io 4 | import os 5 | import time 6 | import json 7 | import boto3 8 | import queue 9 | import tracker 10 | import threading 11 | import cvpipeline 12 | import numpy as np 13 | from turbojpeg import TurboJPEG 14 | 15 | class SageMakerCVPipeline(cvpipeline.CVPipeline): 16 | def __init__(self, pipeline, endpoint_name, region_name, max_cams_per_batch, output_dir, 17 | tile_size=(960,540), conf_thres=0.15, iou_thres=0.45, max_workers=5, 18 | preds_per_file=100, jpeg_quality=90, enable_tracking=False): 19 | super().__init__(pipeline) 20 | 21 | self.endpoint_name = endpoint_name 22 | self.jpeg = TurboJPEG() 23 | self.jpeg_quality = jpeg_quality 24 | self.frames = queue.Queue() 25 | self.sm_client = boto3.client('sagemaker-runtime', region_name=region_name) 26 | self.endpoint_name = endpoint_name 27 | self.output_dir = output_dir 28 | self.tile_size = tile_size 29 | self.params = json.dumps({ 30 | "tile_width": tile_size[0],#960, 31 | "tile_height": tile_size[1],#540, 32 | "conf_thres": conf_thres, #0.15, 33 | "iou_thres": iou_thres, #0.45 34 | }) 35 | self.cache = [] 36 | self.workers = [threading.Thread(target=self.__worker__, args=(i,)) for i in range(max_workers)] 37 | self.trackers = [tracker.Tracker() for i in range(max_cams_per_batch)] if enable_tracking else None 38 | self.preds_per_file = preds_per_file 39 | self.cache_lock = threading.Lock() 40 | self.cache_counter = 0 41 | if not os.path.isdir(self.output_dir): os.mkdir(self.output_dir) 42 | 43 | def run(self): 44 | self.running = True 45 | # initialize workers 46 | for w in self.workers: w.start() 47 | # run gstreamer main loop 48 | super(SageMakerCVPipeline, self).run() 49 | # wait for all workers to finalize 50 | for w in self.workers: w.join() 51 | # dump to disk the last pending predictions 52 | self.dump_cache(True) 53 | 54 | def dump_cache(self, flush=False): 55 | '''Saves predictions to disk as compressed numpy filers''' 56 | if flush or len(self.cache) >= self.preds_per_file: 57 | dump = False 58 | self.cache_lock.acquire() 59 | # check if there are predictions to be flushed 60 | if len(self.cache) > 0: 61 | cache,self.cache,pred_file_id,dump = self.cache,[],self.cache_counter,True 62 | self.cache_counter += 1 63 | self.cache_lock.release() 64 | if dump: 65 | # ok. there are predictions, save to a file 66 | print(f'Dumping {len(cache)}... ') 67 | np.savez(os.path.join(self.output_dir, f'pred_{pred_file_id:05d}.npz'), cache) 68 | 69 | def __worker__(self, worker_id): 70 | '''A worker will keep listening to a queue for frames to process''' 71 | while self.running: 72 | if self.frames.empty(): 73 | time.sleep(0.1) 74 | else: 75 | # alrigth, there is a new frame 76 | frame,timestamp = self.frames.get() 77 | with io.BytesIO() as resp: 78 | # invoke the endpoint and keep the predictions 79 | resp.write(self.sm_client.invoke_endpoint( 80 | EndpointName=self.endpoint_name, 81 | Body=frame, 82 | ContentType="image/jpeg", 83 | Accept="application/x-npy", 84 | CustomAttributes=self.params 85 | )['Body'].read()) 86 | resp.seek(0) 87 | # resp format: [cam_id, obj_id, obj_cls, conf, bbox(x1,y1,x2,y2) pose(x1,y1,conf1,...,x17,y17,conf17)] 88 | preds = np.load(resp).astype(np.object) 89 | data = [timestamp, preds, []] 90 | if not self.trackers is None: 91 | dets = [] 92 | for pred in preds: 93 | cam_id,conf,bbox = pred[0],pred[3],pred[4:8] 94 | dets.append(np.hstack((bbox, [conf]))) 95 | dets = np.array(dets).astype(np.object) 96 | data[2].append(self.trackers[int(cam_id)].step(dets, self.tile_size)) 97 | 98 | self.cache_lock.acquire() 99 | self.cache.append(data) 100 | self.cache_lock.release() 101 | self.dump_cache() 102 | 103 | def process_frame(self, frame, timestamp): 104 | '''Concrete implementation of frame processing''' 105 | # a mosaic will be encoded as jpeg outside gstreamer for max performance 106 | frame = self.jpeg.encode(frame, quality=self.jpeg_quality) 107 | self.frames.put((frame,timestamp)) 108 | -------------------------------------------------------------------------------- /tutorials/02_ObjectTrackingSageMakerGStreamer/libs/tracker.py: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | import cv2 4 | import sys 5 | import numpy as np 6 | from yolox.tracker.byte_tracker import BYTETracker 7 | 8 | # Helper class that emulates argparse 9 | class AllMyFields: 10 | def __init__(self, dictionary): 11 | for k, v in dictionary.items(): 12 | setattr(self, k, v) 13 | 14 | class Tracker(object): 15 | def __init__(self, frame_rate=25, track_tresh=0.25, track_buffer=30, match_tresh=0.5, min_box_area=10): 16 | self.args = AllMyFields({ 17 | 'track_thresh': track_tresh, 18 | 'track_buffer': track_buffer, 19 | 'match_thresh': match_tresh, 20 | 'mot20': False, 21 | 'min_box_area': min_box_area 22 | }) 23 | self.tracker = BYTETracker(self.args, frame_rate=frame_rate) 24 | self.online_targets = None 25 | 26 | def render(self, frame, objects): 27 | '''Render BBoxes & ID to an image''' 28 | for obj_id,xyxy,score in objects: 29 | x1,y1,x2,y2 = xyxy 30 | cv2.rectangle(frame, (x1,y1), (x2,y2), (255,255,0), 3) 31 | cv2.putText(frame, f'{obj_id}', (x1+50,y1+50), cv2.FONT_HERSHEY_SIMPLEX, 32 | 2, (0, 0, 255), 2, cv2.LINE_AA) 33 | 34 | def step(self, detections, img_size=(960,540)): 35 | '''Update the tracker based on predictions 36 | Detections[ [x1,y1,x2,y2,conf] ] 37 | ''' 38 | self.online_targets = self.tracker.update(detections, [img_size[1], img_size[0]], [img_size[1], img_size[0]]) 39 | results = [] 40 | for t in self.online_targets: 41 | tlwh = t.tlwh 42 | tid = t.track_id 43 | vertical = tlwh[2] / tlwh[3] > 1.6 44 | if tlwh[2] * tlwh[3] > self.args.min_box_area and not vertical: 45 | x1,y1,bw,bh = tlwh.astype(np.int32) 46 | xyxy = [x1,y1,x1+bw,y1+bh] 47 | # obj_id, bbox, conf 48 | results.append((tid, xyxy, t.score)) 49 | return results 50 | -------------------------------------------------------------------------------- /tutorials/03_QuestionAnsweringMachine/01_QuestionAnsweringWithT5SSM.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "f6cb03f7", 6 | "metadata": {}, 7 | "source": [ 8 | "# FAQ Bot - Q&A model, trained using pairs of questions and answers\n", 9 | "\n", 10 | "Fine tune a large language model with a list of question and answers. This approach os called Closed Book Q&A because the model doesn't require context and is capable of answering variations of the questions you provide in your dataset.\n", 11 | "\n", 12 | "This is an evolution of classic ChatBots because LLMs like T5 can disambiguate and generalize better than the old technologies we find in these ChatBots services.\n", 13 | "\n", 14 | "For that purpose you'll use a **[T5 SMALL SSM ~80MParams](https://huggingface.co/google/t5-small-ssm)** model, accelerated by a trn1 instance ([AWS Trainium](https://aws.amazon.com/machine-learning/trainium/)), running on [Amazon SageMaker](https://aws.amazon.com/sagemaker/).\n", 15 | "\n", 16 | "You can set the hyperperameter **--model_name** to change the model size. This solution works well with: \n", 17 | " - t5-small-ssm\n", 18 | " - t5-large-ssm\n", 19 | " \n", 20 | "If you need to fine tune **t5-3b-ssm, t5-11b-ssm or t5-xxl-ssm**, you need **FSDP**, which is out of the scope of this tutorial.\n", 21 | "\n", 22 | "You can see the results of the predictions at the end of this notebook. You'll notice the questions sent to the model are not in the training dataset. They are just variations of the questions used to fine tune the model.\n", 23 | "\n", 24 | "The dataset is the content of all **AWS FAQ** pages, downloaded from: https://aws.amazon.com/faqs/\n", 25 | "\n", 26 | "This notebook was tested with **Python3.8+**\n", 27 | "\n", 28 | ">**If you have never before done a SageMaker training job with Trn1, you'll need to do a service level request. This can take a few hours, best to make the request early so you don't have to wait.**\n", 29 | "\n", 30 | "You can edit this URL to go directly to the page to request the increase:\n", 31 | "\n", 32 | "`https://.console.aws.amazon.com/servicequotas/home/services/sagemaker/quotas/L-79A1FE57`" 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "id": "5cb78690", 38 | "metadata": {}, 39 | "source": [ 40 | "## 1) Install some dependencies\n", 41 | "You need a more recent version of **sagemaker** Python Library. After this install you'll need to restart the kernel." 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": null, 47 | "id": "103ae05a", 48 | "metadata": { 49 | "scrolled": true 50 | }, 51 | "outputs": [], 52 | "source": [ 53 | "# add --force-reinstall if it fails to resolve dependencies\n", 54 | "%pip install -U sagemaker" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": null, 60 | "id": "d4a951ec", 61 | "metadata": {}, 62 | "outputs": [], 63 | "source": [ 64 | "import sagemaker\n", 65 | "print(sagemaker.__version__)\n", 66 | "if not sagemaker.__version__ >= \"2.146.0\": print(\"You need to upgrade or restart the kernel if you already upgraded\")\n", 67 | "\n", 68 | "sess = sagemaker.Session()\n", 69 | "role = sagemaker.get_execution_role()\n", 70 | "bucket = sess.default_bucket()\n", 71 | "region = sess.boto_region_name\n", 72 | "\n", 73 | "print(f\"sagemaker role arn: {role}\")\n", 74 | "print(f\"sagemaker bucket: {bucket}\")\n", 75 | "print(f\"sagemaker session region: {region}\")" 76 | ] 77 | }, 78 | { 79 | "cell_type": "markdown", 80 | "id": "ea60dba6", 81 | "metadata": {}, 82 | "source": [ 83 | "## 2) Visualize and upload the dataset\n", 84 | "Take note of the S3 URI here if you get interrupted, no need to reupload later." 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": 5, 90 | "id": "855c1822", 91 | "metadata": { 92 | "scrolled": true 93 | }, 94 | "outputs": [ 95 | { 96 | "data": { 97 | "text/html": [ 98 | "
\n", 99 | "\n", 112 | "\n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | "
servicequestionanswers
0/ec2/autoscaling/faqs/What is Amazon EC2 Auto Scaling?Amazon EC2 Auto Scaling is a fully managed ser...
1/ec2/autoscaling/faqs/When should I use Amazon EC2 Auto Scaling vs. ...You should use AWS Auto Scaling to manage scal...
2/ec2/autoscaling/faqs/How is Predictive Scaling Policy different fro...Predictive Scaling Policy brings the similar p...
3/ec2/autoscaling/faqs/What are the benefits of using Amazon EC2 Auto...Amazon EC2 Auto Scaling helps to maintain your...
4/ec2/autoscaling/faqs/What is fleet management and how is it differe...If your application runs on Amazon EC2 instanc...
\n", 154 | "
" 155 | ], 156 | "text/plain": [ 157 | " service question \\\n", 158 | "0 /ec2/autoscaling/faqs/ What is Amazon EC2 Auto Scaling? \n", 159 | "1 /ec2/autoscaling/faqs/ When should I use Amazon EC2 Auto Scaling vs. ... \n", 160 | "2 /ec2/autoscaling/faqs/ How is Predictive Scaling Policy different fro... \n", 161 | "3 /ec2/autoscaling/faqs/ What are the benefits of using Amazon EC2 Auto... \n", 162 | "4 /ec2/autoscaling/faqs/ What is fleet management and how is it differe... \n", 163 | "\n", 164 | " answers \n", 165 | "0 Amazon EC2 Auto Scaling is a fully managed ser... \n", 166 | "1 You should use AWS Auto Scaling to manage scal... \n", 167 | "2 Predictive Scaling Policy brings the similar p... \n", 168 | "3 Amazon EC2 Auto Scaling helps to maintain your... \n", 169 | "4 If your application runs on Amazon EC2 instanc... " 170 | ] 171 | }, 172 | "execution_count": 5, 173 | "metadata": {}, 174 | "output_type": "execute_result" 175 | } 176 | ], 177 | "source": [ 178 | "import pandas as pd\n", 179 | "df = pd.read_csv('train.csv.gz', compression='gzip', sep=';')\n", 180 | "df.head()" 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": null, 186 | "id": "df8e1b39", 187 | "metadata": {}, 188 | "outputs": [], 189 | "source": [ 190 | "s3_uri = sess.upload_data(path='train.csv.gz', key_prefix='datasets/aws-faq/train')\n", 191 | "print(s3_uri)" 192 | ] 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "id": "92dd77a3", 197 | "metadata": {}, 198 | "source": [ 199 | "## 3) Prepare the train/inference script" 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "execution_count": null, 205 | "id": "b28590cf", 206 | "metadata": {}, 207 | "outputs": [], 208 | "source": [ 209 | "import os\n", 210 | "if not os.path.isdir('src'): os.mkdir('src')" 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": null, 216 | "id": "8c266667", 217 | "metadata": {}, 218 | "outputs": [], 219 | "source": [ 220 | "## requirements.txt will be used by SageMaker to install\n", 221 | "## additional Python packages" 222 | ] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "execution_count": null, 227 | "id": "d3b08830", 228 | "metadata": {}, 229 | "outputs": [], 230 | "source": [ 231 | "%%writefile src/requirements.txt\n", 232 | "torchvision\n", 233 | "transformers==4.27.4" 234 | ] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "execution_count": null, 239 | "id": "27761ac8", 240 | "metadata": {}, 241 | "outputs": [], 242 | "source": [ 243 | "!pygmentize src/question_answering.py" 244 | ] 245 | }, 246 | { 247 | "cell_type": "markdown", 248 | "id": "81e8b7c0", 249 | "metadata": {}, 250 | "source": [ 251 | "## 4) Kick-off our fine tuning job on Amazon SageMaker\n", 252 | "We need to create a SageMaker Estimator first and then invoke **.fit**. \n", 253 | "\n", 254 | "Please, notice we're passing the parameter **checkpoint_s3_uri**. This is important because NeuronSDK will spend some time compiling the model before fine tuning it. The compiler saves the model to cache files and, with this param, the files will be uploaded to **S3**. So, next time we run a job, NeuronSDK can just load back the cache files and start training immediately.\n", 255 | "\n", 256 | "When training for the first time, the training job takes ~9 hours to process all 60 Epochs on an **trn1.32xlarge**.\n", 257 | "\n", 258 | "If you need to wait for a quota increase like I did. When you come back, run cell 2 to setup the sagemaker session and S3 uris, etc. Then run the below to get the process started." 259 | ] 260 | }, 261 | { 262 | "cell_type": "code", 263 | "execution_count": null, 264 | "id": "3d7f8c86", 265 | "metadata": {}, 266 | "outputs": [], 267 | "source": [ 268 | "from sagemaker.pytorch import PyTorch\n", 269 | "\n", 270 | "# https://github.com/aws/deep-learning-containers/blob/master/available_images.md#neuron-containers\n", 271 | "image_name=\"pytorch-training-neuronx\"\n", 272 | "# We need SDK2.9+ to deal with T5s\n", 273 | "image_tag=\"1.13.0-neuronx-py38-sdk2.9.1-ubuntu20.04\"\n", 274 | "\n", 275 | "estimator = PyTorch(\n", 276 | " entry_point=\"question_answering.py\", # Specify your train script\n", 277 | " source_dir=\"src\",\n", 278 | " role=role,\n", 279 | " sagemaker_session=sess,\n", 280 | " instance_count=1,\n", 281 | " instance_type='ml.trn1.32xlarge', \n", 282 | " disable_profiler=True,\n", 283 | " output_path=f\"s3://{bucket}/output\",\n", 284 | " image_uri=f\"763104351884.dkr.ecr.{region}.amazonaws.com/{image_name}:{image_tag}\",\n", 285 | " \n", 286 | " # Parameters required to enable checkpointing\n", 287 | " # This is necessary for caching XLA HLO files and reduce training time next time \n", 288 | " checkpoint_s3_uri=f\"s3://{bucket}/checkpoints\",\n", 289 | " volume_size = 512,\n", 290 | " distribution={\n", 291 | " \"torch_distributed\": {\n", 292 | " \"enabled\": True\n", 293 | " }\n", 294 | " },\n", 295 | " hyperparameters={\n", 296 | " \"model-name\": \"t5-small-ssm\",\n", 297 | " \"lr\": 5e-5,\n", 298 | " \"num-epochs\": 60\n", 299 | " },\n", 300 | " metric_definitions=[\n", 301 | " {'Name': 'train:loss', 'Regex': 'loss:(\\S+);'}\n", 302 | " ]\n", 303 | ")\n", 304 | "estimator.framework_version = '1.13.1' # workround when using image_uri" 305 | ] 306 | }, 307 | { 308 | "cell_type": "code", 309 | "execution_count": null, 310 | "id": "1ab4e4ab", 311 | "metadata": { 312 | "scrolled": true 313 | }, 314 | "outputs": [], 315 | "source": [ 316 | "estimator.fit({\"train\": s3_uri})" 317 | ] 318 | }, 319 | { 320 | "cell_type": "markdown", 321 | "id": "e4f082b4", 322 | "metadata": {}, 323 | "source": [ 324 | "## 5) Deploy our model to a SageMaker endpoint\n", 325 | "Here, we're using a pre-defined HuggingFace model class+container to just load our fine tuned model on a CPU based instance: c6i.4xlarge (an Intel Xeon based machine).\n", 326 | "\n", 327 | ">If you're picking this up later uncomment line 4, fill in the path to your model artifacts, comment line 9 out, and uncomment line 10." 328 | ] 329 | }, 330 | { 331 | "cell_type": "code", 332 | "execution_count": null, 333 | "id": "d90af272", 334 | "metadata": {}, 335 | "outputs": [], 336 | "source": [ 337 | "# uncomment and modify this if you're picking this back up later and your training was sucessful.\n", 338 | "# you'll need to get the model s3 URI from sagemaker -> Training -> Training Jobs -> -> Output -> S3 model artifact\n", 339 | "\n", 340 | "# pre_trained_model = YOUR_S3_PATH\n", 341 | "from sagemaker.huggingface.model import HuggingFaceModel\n", 342 | "\n", 343 | "# create Hugging Face Model Class\n", 344 | "huggingface_model = HuggingFaceModel(\n", 345 | " model_data=estimator.model_data, # path to your model and script\n", 346 | " # model_data=pre_trained_model, # path to your model and script\n", 347 | " role=role, # iam role with permissions to create an Endpoint\n", 348 | " transformers_version=\"4.26.0\", # transformers version used\n", 349 | " pytorch_version=\"1.13.1\", # pytorch version used\n", 350 | " py_version='py39', # python version used\n", 351 | " sagemaker_session=sess,\n", 352 | " \n", 353 | " # for production it is important to define vpc_config and use a vpc_endpoint\n", 354 | " #vpc_config={\n", 355 | " # 'Subnets': ['subnet-A-REPLACE', 'subnet-B-REPLACE'],\n", 356 | " # 'SecurityGroupIds': ['sg-A-REPLACE', 'sg-B-REPLACE']\n", 357 | " #} \n", 358 | ")" 359 | ] 360 | }, 361 | { 362 | "cell_type": "code", 363 | "execution_count": null, 364 | "id": "23f9c74f", 365 | "metadata": {}, 366 | "outputs": [], 367 | "source": [ 368 | "predictor = huggingface_model.deploy(\n", 369 | " initial_instance_count=1,\n", 370 | " instance_type=\"ml.c6i.4xlarge\",\n", 371 | ")" 372 | ] 373 | }, 374 | { 375 | "cell_type": "markdown", 376 | "id": "a6f12446", 377 | "metadata": {}, 378 | "source": [ 379 | "## 6) Run a quick test" 380 | ] 381 | }, 382 | { 383 | "cell_type": "code", 384 | "execution_count": 17, 385 | "id": "8e393b0f", 386 | "metadata": {}, 387 | "outputs": [ 388 | { 389 | "name": "stdout", 390 | "output_type": "stream", 391 | "text": [ 392 | "Q: What is SageMaker?\n", 393 | "A: SageMaker is a new ML (ML) service that makes it easy to build, train, and deploy notebook data inference, and deploy and tune models of data. SageMaker helps you build, train, and manage your ML models, and deploy model data to build your models up and down\n", 394 | "\n", 395 | "Q: What is EC2 AutoScaling?\n", 396 | "A: Amazon-based EC2 instancess let you reduce your applications on multiple factors by allowing you to scale your application requirements and costs across multiple instances. Amazoning EC2 instances as a result of optimization in your applications, reducing the number of compute EC and the number of available instances to optimize your\n", 397 | "\n", 398 | "Q: What are the benefits of autoscaling?\n", 399 | "A: You can use autoscaling to help you optimize the capacity of your applications by allowing you to take advantage of your application across multiple applications. Autoscaling allows you to easily scale the number of your applications across multiple devices, and optimize your fleet up or down to 40%. You can also use auto\n", 400 | "\n", 401 | "CPU times: user 5.16 ms, sys: 0 ns, total: 5.16 ms\n", 402 | "Wall time: 1.66 s\n" 403 | ] 404 | } 405 | ], 406 | "source": [ 407 | "%%time\n", 408 | "questions = [\n", 409 | " \"What is SageMaker?\",\n", 410 | " \"What is EC2 AutoScaling?\",\n", 411 | " \"What are the benefits of autoscaling?\"\n", 412 | "]\n", 413 | "resp = predictor.predict({'inputs': questions})\n", 414 | "for q,a in zip(questions, resp['answers']):\n", 415 | " print(f\"Q: {q}\\nA: {a}\\n\")" 416 | ] 417 | }, 418 | { 419 | "cell_type": "markdown", 420 | "id": "10cd7c8a", 421 | "metadata": {}, 422 | "source": [ 423 | "## 7) Clean up\n", 424 | "This will delete the model and the endpoint you created" 425 | ] 426 | }, 427 | { 428 | "cell_type": "code", 429 | "execution_count": null, 430 | "id": "b3f1afb2", 431 | "metadata": {}, 432 | "outputs": [], 433 | "source": [ 434 | "predictor.delete_model()\n", 435 | "predictor.delete_endpoint()" 436 | ] 437 | } 438 | ], 439 | "metadata": { 440 | "kernelspec": { 441 | "display_name": "conda_pytorch_p39", 442 | "language": "python", 443 | "name": "conda_pytorch_p39" 444 | }, 445 | "language_info": { 446 | "codemirror_mode": { 447 | "name": "ipython", 448 | "version": 3 449 | }, 450 | "file_extension": ".py", 451 | "mimetype": "text/x-python", 452 | "name": "python", 453 | "nbconvert_exporter": "python", 454 | "pygments_lexer": "ipython3", 455 | "version": "3.9.15" 456 | } 457 | }, 458 | "nbformat": 4, 459 | "nbformat_minor": 5 460 | } 461 | -------------------------------------------------------------------------------- /tutorials/03_QuestionAnsweringMachine/src/question_answering.py: -------------------------------------------------------------------------------- 1 | import os 2 | import csv 3 | import glob 4 | import time 5 | import json 6 | import gzip 7 | import torch 8 | import argparse 9 | 10 | from transformers import AutoModelForSeq2SeqLM, AutoTokenizer 11 | from torch.utils.data import IterableDataset, DataLoader 12 | 13 | max_sentence_len="" 14 | max_new_tokens="" 15 | 16 | class QnADataset(IterableDataset): 17 | '''Dataset that streams batches instead of loading the whole file into memory''' 18 | def __init__(self, files_path, max_sentence_len=256, shuffle=True, tokenizer=None): 19 | super(QnADataset).__init__() 20 | self.files = glob.glob(os.path.join(files_path, "*.csv.gz")) 21 | if len(self.files) == 0: raise Exception("No .csv files found") 22 | print(f"{len(self.files)} csv files found") 23 | self.reader = None 24 | self.shuffle = shuffle 25 | self.tokenizer = tokenizer 26 | self.max_sentence_len = max_sentence_len 27 | 28 | def batch_generator(self): 29 | for file_path in self.files: 30 | with gzip.open(file_path, 'rt') as csvfile: 31 | data = csv.reader(csvfile, delimiter = ";") 32 | next(data) # skip header 33 | for i,row in enumerate(data): 34 | e = self.tokenizer(row[1], max_length=self.max_sentence_len, padding='max_length', truncation=True, return_tensors="pt") 35 | e['labels'] = self.tokenizer(row[2], max_length=self.max_sentence_len, padding='max_length', truncation=True, return_tensors="pt").input_ids 36 | yield i,e 37 | def __iter__(self): 38 | return self.batch_generator() 39 | 40 | def collate_fn(data): 41 | # rebuild all samples of a given batch into a dictionary HF way 42 | batch = {} 43 | for j,sample in data: 44 | for k,v in sample.items(): 45 | if batch.get(k) is None: batch[k] = [] 46 | batch[k].append(torch.LongTensor(v)) 47 | batch = {k:torch.vstack(batch[k]) for k in batch.keys()} 48 | return batch 49 | 50 | def train(args, world_size, device): 51 | print("Starting training job") 52 | os.makedirs(args.checkpoints_path, exist_ok=True) 53 | 54 | model = AutoModelForSeq2SeqLM.from_pretrained(f"google/{args.model_name}") 55 | model.to(device) 56 | 57 | tokenizer = AutoTokenizer.from_pretrained(f"google/{args.model_name}") 58 | optimizer = AdamW(model.parameters(), lr=args.lr * xm.xrt_world_size()) 59 | 60 | train_dataset = QnADataset(args.train, args.max_sentence_len, True, tokenizer) 61 | train_dataloader = DataLoader(train_dataset, collate_fn=collate_fn, batch_size=args.batch_size) 62 | train_dataloader = pl.MpDeviceLoader(train_dataloader, device) 63 | 64 | best_path = os.path.join(args.checkpoints_path, args.model_name, 'best.pt') 65 | best_loss = float("inf") 66 | for epoch in range(args.num_epochs): 67 | model.train() 68 | epoch_loss = 0.0 69 | num_batches = 0 70 | epoch_time = time.time() 71 | for step, batch in enumerate(train_dataloader): 72 | outputs = model(**batch) 73 | optimizer.zero_grad() 74 | loss = outputs.loss 75 | loss.backward() 76 | epoch_loss += outputs.logits.shape[0] * loss.detach().to('cpu') 77 | num_batches += 1 78 | 79 | # gather gradient updates from all cores and apply them 80 | xm.optimizer_step(optimizer) 81 | elapsed = time.time()-epoch_time 82 | epoch_loss /= num_batches*args.batch_size 83 | xm.master_print(f"epoch:{epoch}; elapsed_time(sec):{elapsed:0.2f}; loss:{epoch_loss};") 84 | if epoch_loss < best_loss: 85 | best_loss = epoch_loss 86 | xm.save({'state_dict': model.state_dict(), 'loss': best_loss}, best_path) 87 | 88 | best_model = torch.load(best_path) 89 | best_loss = best_model["loss"] 90 | print(f'Saving best model. Loss: {best_loss}') 91 | model.load_state_dict(best_model['state_dict']) 92 | model.to('cpu') 93 | model.eval() 94 | model.save_pretrained(args.model_path) 95 | tokenizer.save_pretrained(args.model_path) 96 | 97 | def input_fn(request_body, request_content_type): 98 | if request_content_type == "application/json": 99 | inputs = json.loads(request_body) 100 | return inputs['inputs'] 101 | 102 | raise Exception(f"Unsupported content type: {request_content_type}") 103 | 104 | def output_fn(prediction, content_type): 105 | if content_type == "application/json": 106 | return json.dumps({'answer': prediction}) 107 | raise Exception(f"Unsupported accept: {content_type}") 108 | 109 | 110 | def model_fn(model_dir): 111 | model = AutoModelForSeq2SeqLM.from_pretrained(model_dir) 112 | model.eval() 113 | tokenizer = AutoTokenizer.from_pretrained(model_dir) 114 | 115 | return model,tokenizer 116 | 117 | def predict_fn(input_object, model_tokenizer): 118 | global max_sentence_len,max_new_tokens 119 | model,tokenizer = model_tokenizer 120 | 121 | input_ids = tokenizer(input_object, max_length=max_sentence_len, padding='max_length', truncation=True, return_tensors="pt").input_ids 122 | gen_output = model.generate(input_ids, max_new_tokens=max_new_tokens) 123 | return [tokenizer.decode(o, skip_special_tokens=True) for o in gen_output] 124 | 125 | if __name__=='__main__': 126 | parser = argparse.ArgumentParser( 127 | prog = 'Train script for Trainium', 128 | description = 'Hyperparameters for the training process') 129 | 130 | # t5-xxl-ssm" # requires split ~46GB 131 | parser.add_argument('--num-epochs', type=int, help="Number of epochs", default=2) 132 | parser.add_argument('--batch-size', type=int, help="Batch size", default=4) 133 | parser.add_argument('--max-sentence-len', type=int, help="Maximum sentence length", default=128) 134 | parser.add_argument('--max-new-tokens', type=int, help="Maximum number of generated tokens", default=64) 135 | parser.add_argument('--model-name', type=str, help="Name of the model", default="t5-large-ssm") 136 | parser.add_argument('--lr', type=float, help="Learning rate", default=5e-5) 137 | 138 | parser.add_argument('--model-path', type=str, help="Path where we'll save the model", default=os.environ["SM_MODEL_DIR"]) 139 | parser.add_argument('--checkpoints-path', type=str, help="Path where we'll save the best model and cache", default='/opt/ml/checkpoints') 140 | parser.add_argument('--train', type=str, help="Path to train data", default=os.environ["SM_CHANNEL_TRAIN"]) 141 | 142 | args = parser.parse_args() 143 | print(args) 144 | import torch_xla.core.xla_model as xm 145 | import torch_xla.distributed.xla_backend 146 | import torch_xla.test.test_utils as test_utils 147 | import torch_xla.distributed.parallel_loader as pl 148 | 149 | from torch.optim import AdamW 150 | 151 | cache_dir = os.path.join(args.checkpoints_path, args.model_name) 152 | os.environ['TOKENIZERS_PARALLELISM'] = 'false' 153 | os.environ['NEURON_CC_FLAGS']=f"--cache_dir={cache_dir} --retry_failed_compilation" 154 | os.environ['XLA_USE_BF16'] = '1' 155 | 156 | device = 'xla' 157 | # Initialize XLA process group for torchrun 158 | torch.distributed.init_process_group(device) 159 | world_size = xm.xrt_world_size() 160 | 161 | print(f"Device: {device} World size: {world_size}") 162 | train(args, world_size, device) 163 | 164 | # define the max_seq len 165 | with open(__file__, "r") as f: 166 | data = f.read() 167 | data = data.replace("\"\"", f"{args.max_sentence_len}") 168 | data = data.replace("\"\"", f"{args.max_new_tokens}") 169 | 170 | code_path = os.path.join(args.model_path, 'code') 171 | if not os.path.isdir(code_path): os.makedirs(code_path, exist_ok=True) 172 | # save a copy of the inference file to the correct dir 173 | with open(os.path.join(code_path, 'inference.py'), "w") as f: 174 | f.write(data) 175 | -------------------------------------------------------------------------------- /tutorials/03_QuestionAnsweringMachine/train.csv.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/ml-specialized-hardware/d73aadcdd1b966d23e5191882f707c0cc01cbe23/tutorials/03_QuestionAnsweringMachine/train.csv.gz -------------------------------------------------------------------------------- /tutorials/07_DeployToInferentiaWithTGI/inf2-tgi-demo.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "id": "0ea34d80-1592-48b0-905b-845f29437577", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "!pip install -U sagemaker==2.232.2" 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": 1, 16 | "id": "9604ec03-8aeb-4c31-a688-62163172c277", 17 | "metadata": {}, 18 | "outputs": [ 19 | { 20 | "name": "stdout", 21 | "output_type": "stream", 22 | "text": [ 23 | "sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml\n", 24 | "sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml\n" 25 | ] 26 | } 27 | ], 28 | "source": [ 29 | "import sagemaker\n", 30 | "\n", 31 | "sess = sagemaker.Session()\n", 32 | "session_bucket = sess.default_bucket()\n", 33 | "role = sagemaker.get_execution_role()" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": 39, 39 | "id": "fe07bddd-7ee0-40e1-af62-ac0c0cf530bf", 40 | "metadata": {}, 41 | "outputs": [], 42 | "source": [ 43 | "# Import the necessary libraries for using Hugging Face models and SageMaker\n", 44 | "from sagemaker.huggingface import HuggingFaceModel\n", 45 | "\n", 46 | "# Define the instance type that will be used for inference\n", 47 | "# ml.inf2.24xlarge is based on AWS Inferentia2 hardware, optimized for high-performance machine learning inference\n", 48 | "instance_type = \"ml.inf2.24xlarge\"\n", 49 | "\n", 50 | "# Set the health check timeout and volume size for the SageMaker model endpoint\n", 51 | "health_check_timeout = 2400 # The maximum time (in seconds) SageMaker waits for the model to be ready\n", 52 | "volume_size = 128 # Storage size in GB allocated to the model\n", 53 | "\n", 54 | "# Define the environment configuration for the Hugging Face model\n", 55 | "config = {\n", 56 | " \"HF_MODEL_ID\": \"meta-llama/Meta-Llama-3.1-8B\", # Hugging Face model ID\n", 57 | " \"HF_NUM_CORES\": \"8\", # Number of Neuron cores to use for inference\n", 58 | " \"HF_AUTO_CAST_TYPE\": \"bf16\", # Enable automatic casting to bf16 (half precision for faster inference)\n", 59 | " \"MAX_BATCH_SIZE\": \"4\", # Maximum batch size to process in one forward pass\n", 60 | " \"MAX_INPUT_LENGTH\": \"4095\", # Maximum input sequence length (tokens) allowed for inference\n", 61 | " \"MAX_TOTAL_TOKENS\": \"4096\", # Maximum total number of tokens (input + output)\n", 62 | " \"HF_TOKEN\": \"\" # Token to authenticate with Hugging Face Hub (ensure to keep this secure)\n", 63 | "}\n", 64 | "\n", 65 | "# Set the URI for the Hugging Face TGI (Text Generation Inference) image\n", 66 | "# This image is designed for optimized inference using AWS Neuron SDK (for Inferentia)\n", 67 | "tgi_image = \"763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.2-optimum0.0.25-neuronx-py310-ubuntu22.04\"\n", 68 | "\n", 69 | "# Create the HuggingFaceModel object with the specified role, image, and environment configuration\n", 70 | "model = HuggingFaceModel(\n", 71 | " role=role, # IAM role that grants SageMaker permissions\n", 72 | " image_uri=tgi_image, # URI for the Hugging Face inference image\n", 73 | " env=config # Pass the environment variables defined in the config\n", 74 | ")\n", 75 | "\n", 76 | "# In this case, we are deploying a precompiled model, stored at https://huggingface.co/aws-neuron/optimum-neuron-cache \n", 77 | "# If the model you need to deploy the model that is not precompiled, you can export your own neuron model\n", 78 | "# as explained in https://huggingface.co/docs/optimum-neuron/main/en/guides/export_model#exporting-neuron-models-using-neuronx-tgi\n", 79 | "\n", 80 | "# Mark the model as precompiled\n", 81 | "model._is_compiled_model = True\n" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": 19, 87 | "id": "62f5ed8d-77e6-45a8-8ba4-c35566631a5d", 88 | "metadata": {}, 89 | "outputs": [ 90 | { 91 | "name": "stdout", 92 | "output_type": "stream", 93 | "text": [ 94 | "------------------------!" 95 | ] 96 | } 97 | ], 98 | "source": [ 99 | "predictor = model.deploy(\n", 100 | " initial_instance_count=1,\n", 101 | " instance_type=instance_type,\n", 102 | " container_startup_health_check_timeout=health_check_timeout,\n", 103 | " volume_size=volume_size\n", 104 | ")" 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": 38, 110 | "id": "0a8ddbde-4dc7-4b85-bd01-1311b78987b7", 111 | "metadata": {}, 112 | "outputs": [ 113 | { 114 | "data": { 115 | "text/plain": [ 116 | "[{'generated_text': 'What are the pros and cons of different energy sources? Is there a link between electricity usage and climate change? How can we tackle energy poverty, the issue of clean air at home, or the challenge of providing electricity access in refugee camps? Why should we care about these issues? And how can we better communicate these issues to diverse audiences?\\nThese are key issues for the energy sector – both at home and abroad. This degree will equip you to address them from the perspective of economics, innovation and policy – and prepare you for an exciting career.\\nOur innovative'}]" 117 | ] 118 | }, 119 | "execution_count": 38, 120 | "metadata": {}, 121 | "output_type": "execute_result" 122 | } 123 | ], 124 | "source": [ 125 | "data = {\n", 126 | " \"inputs\": \"What are the pros and cons of different energy sources?\",\n", 127 | " \"temperature\": 0.7,\n", 128 | " \"max_tokens\": 100,\n", 129 | " \"top_p\": 0.9,\n", 130 | " \"n\": 1,\n", 131 | "}\n", 132 | "\n", 133 | "predictor.predict(data)" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": 40, 139 | "id": "38015e6b-82be-4e9a-bc1e-b52e95b21937", 140 | "metadata": {}, 141 | "outputs": [], 142 | "source": [ 143 | "#clean-up\n", 144 | "\n", 145 | "predictor.delete_endpoint()" 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": null, 151 | "id": "7a2bb14b-a5a2-46c7-80c9-bbc4d301d20b", 152 | "metadata": {}, 153 | "outputs": [], 154 | "source": [] 155 | } 156 | ], 157 | "metadata": { 158 | "kernelspec": { 159 | "display_name": "Python 3 (ipykernel)", 160 | "language": "python", 161 | "name": "python3" 162 | }, 163 | "language_info": { 164 | "codemirror_mode": { 165 | "name": "ipython", 166 | "version": 3 167 | }, 168 | "file_extension": ".py", 169 | "mimetype": "text/x-python", 170 | "name": "python", 171 | "nbconvert_exporter": "python", 172 | "pygments_lexer": "ipython3", 173 | "version": "3.10.14" 174 | } 175 | }, 176 | "nbformat": 4, 177 | "nbformat_minor": 5 178 | } 179 | -------------------------------------------------------------------------------- /tutorials/08_TextClassificationWithNaturalLanguageInference/NLI_with_BART_inf2.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 0. Import libraries / Setup" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "This notebook was tested with the neuron **sdk 2.21.1** (in Python 3.10.12).\n", 15 | "It requires the following packages:\n", 16 | "```\n", 17 | "torch==2.5.1\n", 18 | "torch-neuronx==2.5.1.2.4.0\n", 19 | "torch-xla==2.5.1\n", 20 | "torchvision==0.20.1\n", 21 | "libneuronxla==2.1.714.0\n", 22 | "neuronx-cc==2.16.372.0+4a9b2326\n", 23 | "```\n", 24 | "Normally those should be already installed when you setup the system for said sdk version.\n", 25 | "Then you need to install the following:\n", 26 | "```\n", 27 | "huggingface-hub==0.28.1\n", 28 | "transformers==4.48.2\n", 29 | "```" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": null, 35 | "metadata": { 36 | "editable": true, 37 | "scrolled": true, 38 | "slideshow": { 39 | "slide_type": "" 40 | }, 41 | "tags": [] 42 | }, 43 | "outputs": [], 44 | "source": [ 45 | "import sys\n", 46 | "\n", 47 | "!{sys.executable} -m pip install transformers==4.48.2 huggingface-hub==0.28.1" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": null, 53 | "metadata": {}, 54 | "outputs": [], 55 | "source": [ 56 | "import transformers\n", 57 | "import torch_neuronx\n", 58 | "import os\n", 59 | "os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"" 60 | ] 61 | }, 62 | { 63 | "cell_type": "markdown", 64 | "metadata": {}, 65 | "source": [ 66 | "# 1. Load model pretrained on MNLI" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": null, 72 | "metadata": {}, 73 | "outputs": [], 74 | "source": [ 75 | "from transformers import BartForSequenceClassification, BartTokenizer\n", 76 | "tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-mnli', export=True)\n", 77 | "model = BartForSequenceClassification.from_pretrained('facebook/bart-large-mnli', export=True)\n", 78 | "model_cpu = BartForSequenceClassification.from_pretrained('facebook/bart-large-mnli')\n", 79 | "model_dir = \"Bart\"\n", 80 | "os.makedirs(model_dir, exist_ok=True)" 81 | ] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "metadata": {}, 86 | "source": [ 87 | "## 1.1 Test loaded model" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": null, 93 | "metadata": {}, 94 | "outputs": [], 95 | "source": [ 96 | "# pose sequence as a NLI premise and label (politics) as a hypothesis\n", 97 | "premise = 'What is your favorite team, Madrid or Barca?'\n", 98 | "hypothesis = 'This text is about sports.'\n", 99 | "max_length = 128\n", 100 | "\n", 101 | "# run through model pre-trained on MNLI\n", 102 | "encoded_input = tokenizer.encode_plus(premise, hypothesis, return_tensors='pt', truncation='only_first', padding=\"max_length\", max_length=max_length)\n", 103 | "logits = model(encoded_input[\"input_ids\"], encoded_input[\"attention_mask\"], use_cache=False)[0]\n", 104 | "\n", 105 | "# we throw away \"neutral\" (dim 1) and take the probability of\n", 106 | "# \"entailment\" (2) as the probability of the label being true \n", 107 | "entail_contradiction_logits = logits[:,[0,2]]\n", 108 | "probs = entail_contradiction_logits.softmax(dim=1)\n", 109 | "true_prob = probs[:,1].item() * 100\n", 110 | "print(f'Probability that the label is true: {true_prob:0.2f}%')" 111 | ] 112 | }, 113 | { 114 | "cell_type": "markdown", 115 | "metadata": {}, 116 | "source": [ 117 | "## 1.2 Test tracing the model as it comes" 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": null, 123 | "metadata": { 124 | "scrolled": true 125 | }, 126 | "outputs": [], 127 | "source": [ 128 | "neuron_encoder = torch_neuronx.trace(\n", 129 | " model, \n", 130 | " encoded_input[\"input_ids\"],\n", 131 | " compiler_args='--target inf2 --model-type transformer --auto-cast all',\n", 132 | " compiler_workdir='./enc_dir')" 133 | ] 134 | }, 135 | { 136 | "cell_type": "markdown", 137 | "metadata": {}, 138 | "source": [ 139 | "The step above fails because between the encoder and decoder, the arguments are passes as a dictionary with tuples as values. The compiler doesn't work well with this setup, so the idea is to split the model in two parts, encoder and decoder compile them independently and then put them back into the original model structure.\n", 140 | "\n", 141 | "Given this model is around 400M params (1.5GB), it fits into just 1 core when quantized to bf16. After that, both encoder and decoder will be accelerated on inf2." 142 | ] 143 | }, 144 | { 145 | "cell_type": "markdown", 146 | "metadata": {}, 147 | "source": [ 148 | "# 2. Prepare model for compilation" 149 | ] 150 | }, 151 | { 152 | "cell_type": "code", 153 | "execution_count": null, 154 | "metadata": {}, 155 | "outputs": [], 156 | "source": [ 157 | "dim_enc=model.config.max_position_embeddings\n", 158 | "dim_dec=model.config.d_model\n", 159 | "print(f'Dim enc: {dim_enc}; Dim dec: {dim_dec}')\n", 160 | "max_dec_len = 1024" 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": null, 166 | "metadata": {}, 167 | "outputs": [], 168 | "source": [] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": null, 173 | "metadata": {}, 174 | "outputs": [], 175 | "source": [ 176 | "import torch\n", 177 | "import torch.nn.functional as F\n", 178 | "from transformers.modeling_outputs import BaseModelOutput, BaseModelOutputWithPastAndCrossAttentions\n", 179 | "\n", 180 | "# Define one function for the encoder part\n", 181 | "def enc_f(self, input_ids, attention_mask, **kwargs):\n", 182 | " if hasattr(self, 'forward_neuron'):\n", 183 | " out = self.forward_neuron(input_ids, attention_mask)\n", 184 | " else:\n", 185 | " out = self.forward_(input_ids, attention_mask=attention_mask, return_dict=True)\n", 186 | " return BaseModelOutput(**out)\n", 187 | "\n", 188 | "\n", 189 | "# Define one function for the decoder part\n", 190 | "def dec_f(self, input_ids, encoder_hidden_states, encoder_attention_mask, **kwargs): \n", 191 | " out = None\n", 192 | " \n", 193 | " if input_ids.shape[1] > self.max_length:\n", 194 | " raise Exception(f\"The decoded sequence is not supported. Max: {self.max_length}\")\n", 195 | "\n", 196 | " if hasattr(self, 'forward_neuron'):\n", 197 | " out = self.forward_neuron(input_ids,\n", 198 | " encoder_hidden_states,\n", 199 | " encoder_attention_mask)\n", 200 | " else:\n", 201 | " out = self.forward_(input_ids=input_ids,\n", 202 | " encoder_hidden_states=encoder_hidden_states,\n", 203 | " encoder_attention_mask=encoder_attention_mask,\n", 204 | " return_dict=True,\n", 205 | " use_cache=False,\n", 206 | " output_attentions=False)\n", 207 | " \n", 208 | " # Ensure the output is compatible with BaseModelOutputWithPastAndCrossAttentions\n", 209 | " if 'cross_attentions' not in out:\n", 210 | " out['cross_attentions'] = None\n", 211 | " if 'hidden_states' not in out:\n", 212 | " out['hidden_states'] = None\n", 213 | " if 'attentions' not in out:\n", 214 | " out['attentions'] = None\n", 215 | " \n", 216 | " return BaseModelOutputWithPastAndCrossAttentions(**out)" 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": null, 222 | "metadata": {}, 223 | "outputs": [], 224 | "source": [ 225 | "import types\n", 226 | "\n", 227 | "# Backup the original forward methods\n", 228 | "if not hasattr(model.model.encoder, 'forward_'): \n", 229 | " model.model.encoder.forward_ = model.model.encoder.forward\n", 230 | "if not hasattr(model.model.decoder, 'forward_'): \n", 231 | " model.model.decoder.forward_ = model.model.decoder.forward\n", 232 | "\n", 233 | "# Replace the forward methods with the custom ones\n", 234 | "model.model.encoder.forward = types.MethodType(enc_f, model.model.encoder)\n", 235 | "model.model.decoder.forward = types.MethodType(dec_f, model.model.decoder)\n", 236 | "\n", 237 | "# Set the max_length attribute for the decoder\n", 238 | "model.model.decoder.max_length = max_dec_len # or any other appropriate value" 239 | ] 240 | }, 241 | { 242 | "cell_type": "code", 243 | "execution_count": null, 244 | "metadata": {}, 245 | "outputs": [], 246 | "source": [ 247 | "# Run only the encoder to prepare the sample input for the decoder\n", 248 | "encoder_inputs = encoded_input[\"input_ids\"], encoded_input[\"attention_mask\"]\n", 249 | "encoder_outputs = model.model.encoder(encoded_input[\"input_ids\"], encoded_input[\"attention_mask\"])" 250 | ] 251 | }, 252 | { 253 | "cell_type": "markdown", 254 | "metadata": {}, 255 | "source": [ 256 | "## 2.1 Trace Encoder" 257 | ] 258 | }, 259 | { 260 | "cell_type": "code", 261 | "execution_count": null, 262 | "metadata": {}, 263 | "outputs": [], 264 | "source": [ 265 | "import os\n", 266 | "import torch\n", 267 | "\n", 268 | "model_filename=f\"{model_dir}/BART-large-nli-encoder.pt\"\n", 269 | "\n", 270 | "if not os.path.isfile(model_filename):\n", 271 | " if hasattr(model.model.encoder, 'forward_neuron'): del model.model.encoder.forward_neuron\n", 272 | " neuron_encoder = torch_neuronx.trace(\n", 273 | " model.model.encoder, \n", 274 | " encoder_inputs,\n", 275 | " compiler_args='--target inf2 --model-type transformer --auto-cast all',\n", 276 | " compiler_workdir='./enc_dir')\n", 277 | " # neuron_encoder_dynamic_batch = torch_neuronx.dynamic_batch(neuron_encoder)\n", 278 | " neuron_encoder.save(model_filename)\n", 279 | " model.model.encoder.forward_neuron = neuron_encoder\n", 280 | "else:\n", 281 | " model.model.encoder.forward_neuron = torch.jit.load(model_filename)\n", 282 | "\n" 283 | ] 284 | }, 285 | { 286 | "cell_type": "markdown", 287 | "metadata": {}, 288 | "source": [ 289 | "## 2.2 Trace Decoder" 290 | ] 291 | }, 292 | { 293 | "cell_type": "code", 294 | "execution_count": null, 295 | "metadata": {}, 296 | "outputs": [], 297 | "source": [ 298 | "model_filename=f\"{model_dir}/BART-large-nli-decoder.pt\"\n", 299 | "\n", 300 | "if not os.path.isfile(model_filename):\n", 301 | " inp = encoded_input[\"input_ids\"], encoder_outputs[0], encoded_input[\"attention_mask\"]\n", 302 | " if hasattr(model.model.decoder, 'forward_neuron'): del model.model.decoder.forward_neuron\n", 303 | " neuron_decoder = torch_neuronx.trace(\n", 304 | " model.model.decoder,\n", 305 | " inp,\n", 306 | " compiler_args='--target inf2 --model-type transformer --auto-cast all',\n", 307 | " compiler_workdir='./dec_dir')\n", 308 | " # neuron_decoder_dynamic_batch = torch_neuronx.dynamic_batch(neuron_decoder)\n", 309 | " neuron_decoder.save(model_filename)\n", 310 | " model.model.decoder.forward_neuron = neuron_decoder\n", 311 | "else:\n", 312 | " model.model.decoder.forward_neuron = torch.jit.load(model_filename)" 313 | ] 314 | }, 315 | { 316 | "cell_type": "markdown", 317 | "metadata": { 318 | "scrolled": true 319 | }, 320 | "source": [ 321 | "# 3. Test" 322 | ] 323 | }, 324 | { 325 | "cell_type": "code", 326 | "execution_count": null, 327 | "metadata": { 328 | "scrolled": true 329 | }, 330 | "outputs": [], 331 | "source": [ 332 | "# pass sequence as a NLI premise and label (politics) as a hypothesis\n", 333 | "premise = 'how do you like the potatoes?'\n", 334 | "hypothesis = 'This text is about cooking.'\n", 335 | "\n", 336 | "# run through model pre-trained on MNLI\n", 337 | "max_length=128\n", 338 | "x = tokenizer.encode_plus(premise, hypothesis, return_tensors='pt', truncation='only_first', padding=\"max_length\", max_length=max_length, return_attention_mask=True)\n", 339 | "y = model(x[\"input_ids\"],x[\"attention_mask\"])\n", 340 | "logits = y[0]\n", 341 | "\n", 342 | "# we throw away \"neutral\" (dim 1) and take the probability of\n", 343 | "# \"entailment\" (2) as the probability of the label being true \n", 344 | "entail_contradiction_logits = logits[:,[0,2]]\n", 345 | "probs = entail_contradiction_logits.softmax(dim=1)\n", 346 | "true_prob = probs[:,1].item() * 100\n", 347 | "print(f'Probability that the label is true: {true_prob:0.2f}%')\n" 348 | ] 349 | }, 350 | { 351 | "cell_type": "markdown", 352 | "metadata": {}, 353 | "source": [ 354 | "### Now we can test the inference latency in the Inf2 chips:" 355 | ] 356 | }, 357 | { 358 | "cell_type": "code", 359 | "execution_count": null, 360 | "metadata": {}, 361 | "outputs": [], 362 | "source": [ 363 | "%%timeit -r 10\n", 364 | "\n", 365 | "model(x[\"input_ids\"], x[\"attention_mask\"])" 366 | ] 367 | }, 368 | { 369 | "cell_type": "markdown", 370 | "metadata": {}, 371 | "source": [ 372 | "### And compare it with the model hosted in the CPU:" 373 | ] 374 | }, 375 | { 376 | "cell_type": "code", 377 | "execution_count": null, 378 | "metadata": {}, 379 | "outputs": [], 380 | "source": [ 381 | "%%timeit -r 10\n", 382 | "model_cpu(x[\"input_ids\"], x[\"attention_mask\"])" 383 | ] 384 | }, 385 | { 386 | "cell_type": "markdown", 387 | "metadata": {}, 388 | "source": [ 389 | "### Finally we can compare the output of CPU model vs the Inf2" 390 | ] 391 | }, 392 | { 393 | "cell_type": "code", 394 | "execution_count": null, 395 | "metadata": {}, 396 | "outputs": [], 397 | "source": [ 398 | "y = model_cpu(x[\"input_ids\"],x[\"attention_mask\"])\n", 399 | "logits = y[0]\n", 400 | "# we throw away \"neutral\" (dim 1) and take the probability of\n", 401 | "# \"entailment\" (2) as the probability of the label being true \n", 402 | "entail_contradiction_logits = logits[:,[0,2]]\n", 403 | "probs = entail_contradiction_logits.softmax(dim=1)\n", 404 | "true_prob = probs[:,1].item() * 100\n", 405 | "print(f'Probability that the label is true: {true_prob:0.2f}%')\n" 406 | ] 407 | }, 408 | { 409 | "cell_type": "markdown", 410 | "metadata": {}, 411 | "source": [ 412 | "the value should be very similar to the one 3 cells above." 413 | ] 414 | } 415 | ], 416 | "metadata": { 417 | "kernelspec": { 418 | "display_name": "Python 3 (ipykernel)", 419 | "language": "python", 420 | "name": "python3" 421 | }, 422 | "language_info": { 423 | "codemirror_mode": { 424 | "name": "ipython", 425 | "version": 3 426 | }, 427 | "file_extension": ".py", 428 | "mimetype": "text/x-python", 429 | "name": "python", 430 | "nbconvert_exporter": "python", 431 | "pygments_lexer": "ipython3", 432 | "version": "3.10.12" 433 | } 434 | }, 435 | "nbformat": 4, 436 | "nbformat_minor": 4 437 | } 438 | 439 | -------------------------------------------------------------------------------- /workshops/01_FineTuneSpamClassifier/README.md: -------------------------------------------------------------------------------- 1 | # How to reduce costs and improve performance of your Machine Learning (ML) workloads? 2 | ## AWS Machine Learning Purpose-built Accelerators Tutorial 3 | 4 | In this workshop you'll learn how to use [AWS Trainium](https://aws.amazon.com/machine-learning/trainium/) and [AWS Inferentia](https://aws.amazon.com/machine-learning/inferentia/) with [Amazon SageMaker](https://aws.amazon.com/sagemaker/) and [Hugging Face Optimum Neuron](https://huggingface.co/docs/optimum-neuron/index), to optimize your ML workloads! You'll also learn a new methodology to map/qualify/implement end2end solutions for different business challenges. A **top-down** approach that starts with the **use case/business challenge** identification/mapping and ends with a trained model deployed as an API, which can be then integrated to your application. 5 | 6 | Supposing you have a **business challenge** to address, which requires custom ML models. You need to prepare a dataset, train/deploy your models and finally integrate these models to your application (eventually automate this whole process). And, in the end, you expect to have a cost-optimized solution that fits into your budget. 7 | 8 | The picture bellow shows the steps of the proposed methodology you need to follow in order to successfuly apply it to your own business problem: 9 |

10 | 11 |

12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 |
1) Use case identification:The first step of the process is to identify your use case. We prepared a table with a list of common use cases, framed as questions. The idea is to find the Task we'll use to address the problem.
1.1) Task mapping:After identifying the use case/business challenge, using the **use cases table** or your own judgment, now it is time to prepare a model for that given Task
2) Model selection:There is a second table which lists all the current supported models and the Tasks it can implement. Use that table to select your model
3) Model building:Now, you can make use of the available notebooks to run: 1/ Data Preparation; 2/ Model fine-tuning and 3/ Model deploying. If you already have a pre-trained model, you can skip steps 1 and 2
4) App integration:In the previous step you deployed your model and it is now exposed as an API. Just integrate your application to this API and start using your model
20 | 21 | ## 1) Use case mapping 22 | 23 | The following table brings a list of common use cases (framed as questions) and their associated tasks. Use this table as a reference to idenfity which **Task** is the best option to address your problem. Frame your use **case/business challenge** as a question and try to find the most similar option in the table. Then, use the task associated to the mappend use case, in the second column, and follow the next steps. 24 | 25 | **IMPORTANT:** If you don't find a use case (question) that resonates with your own use case, try to identify which **Task** is more appropriate for your scenario (using the tasks table). Also, please cut a ticket with the description of your use case + a framed question so that we can improve this table. 26 | 27 | |Use case question|Task| 28 | |:-|:-| 29 | |How to create an auto-complete mechanism for my application?|CausalLM| 30 | |How to create a chat-bot to answer questions from an FAQ to my customers?|QuestionAnswering| 31 | |How can I summarize a long document into a few paragraphs?|CausalLM| 32 | |How can I create a spam classifier for my emails?|SequenceClassification| 33 | |How to check if a given text has a good or a bad comment?|SequenceClassification| 34 | |How do I translate documents from multiple languages to dutch?|CausalLM| 35 | |How to complete a sentence, given its initial words only|CausalLM| 36 | |How to classify pictures of products into different classes?|ImageClassification| 37 | |How to create an Alexa like mechanism which detects specific keywords?|AudioClassification| 38 | |How to create subtitles to audiobooks?|Text-To-Speech| 39 | |Given two sentences, how to make sure the second sentence is related to the first?|NextSentencePrediction| 40 | 41 | ### 1.1) Available Tasks 42 | 43 | |Task|Description| 44 | |:-|:-| 45 | |SequenceClassification|Text classification - binary or multi class| 46 | |MultipleChoice|Given a context and multiple options, the model predicts which one is correct| 47 | |TokenClassification|Token classification assigns a label to individual tokens in a sentence. One of the most common token classification tasks is Named Entity Recognition (NER)| 48 | |MaskedLM|When the input text has a mask that needs to be replaced by a generated term| 49 | |QuestionAnswering|It answers questions bases on a context or on the acquired knowledge via training| 50 | |CausalLM|Causal language modeling predicts the next token in a sequence of tokens, and the model can only attend to tokens on the left. This means the model cannot see future tokens| 51 | |ConditionalGeneration|Fills a mask based on the conditions of the sentence| 52 | |NextSentencePrediction|NSP consists of giving the model two sentences, sentence A and sentence B. We then say, ‘hey Model, does sentence B come after sentence A?’ — and Model says either IsNextSentence or NotNextSentence.| 53 | |MaskedImageModeling|Predict masks of the objects in a given picture| 54 | |ImageClassification|Classifies (binary or multiclass) an image into different classes of objects| 55 | 56 | ## 2) HF Optimum Neuron - Supported Models 57 | 58 | [Click here to see the current supported models for training and inference in Hugging Face Optimum Neuron](purpose-built-accelerators/docs/optimum_neuron_models.md) 59 | 60 | ## 3) Model Building 61 | Here you can find notebooks you can run on [Amazon SageMaker Studio](https://aws.amazon.com/sagemaker/studio/) to prepare a model that addresses a task associated to your own use case. They implement a solution for the following use case: **How can I create a spam detection mechanism?**. The required task is **SequenceClassification**. In the end we'll have a Binary Text classification model which receives a given email as input and return 0=NOT SPAM and 1=SPAM. 62 | 63 | - The first notebook downloads a public dataset named **Deysi/spam-detection-dataset**. The dataset has already samples labelade as **spam** or **not spam**. 64 | - The second notebook is configured to train a **bert-base-uncased** for **SequenceClassification**. You'll notice there are variables you can configure to define the model and the task, then you define some hyperparameters and kick-off the training job using Amazon SageMaker. 65 | - The third notebook shows how to compile a pre-trained model to AWS Inferentia and deploy it to a SageMaker real-time Endpoint which will exposes the model as a simple API (WebService). 66 | 67 | **ATTENTION:** if you already have a trained model, compatible with the models listed in the table linked in section 2, then just use the third notebook (you don't need the first two in this case). 68 | 69 | |Notebook|Description| 70 | |-|-| 71 | |[01 - Data Preparation](notebooks/01_DatasetPreparation.ipynb)|How to load and prepare a dataset for fine-tuning a model| 72 | |[02 - Model Fine-tuning](notebooks/02_ModelFineTuning.ipynb)|How to kick-off a fine-tuning job using the dataset prepared in the previous notebook| 73 | |[03 - Model Deployment](notebooks/03_ModelInference.ipynb)|How to compile and deploy a pre-trained model to Inferentia| 74 | 75 | ## 4) App Integration 76 | 77 | If you followed the steps in the previous sections, you have a running SageMaker real-time endpoint with your model. Now you can make use of [AWS SDK for SageMaker runtime](https://aws.amazon.com/developer/tools/) which offers libraries available for the most common programming languages. If your application is Python based, you can also make use of [Amazon SageMaker Inference API](https://sagemaker.readthedocs.io/en/stable/api/inference/index.html). 78 | -------------------------------------------------------------------------------- /workshops/01_FineTuneSpamClassifier/docs/imgs/01_activities.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/ml-specialized-hardware/d73aadcdd1b966d23e5191882f707c0cc01cbe23/workshops/01_FineTuneSpamClassifier/docs/imgs/01_activities.png -------------------------------------------------------------------------------- /workshops/01_FineTuneSpamClassifier/notebooks/requirements.txt: -------------------------------------------------------------------------------- 1 | datasets 2 | transformers==4.43.2 3 | optimum-neuron==0.0.25 -------------------------------------------------------------------------------- /workshops/01_FineTuneSpamClassifier/notebooks/src/compile.py: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | 4 | import os 5 | os.environ['NEURON_RT_NUM_CORES'] = '1' 6 | import sys 7 | import glob 8 | import json 9 | import torch 10 | import shutil 11 | import tarfile 12 | import logging 13 | import argparse 14 | import traceback 15 | import optimum.neuron 16 | from transformers import AutoTokenizer 17 | 18 | def model_fn(model_dir, context=None): 19 | task = os.environ.get("TASK") 20 | if task is None: raise Exception("Invalid TASK. You need to invoke the compilation job once to set TASK variable") 21 | 22 | NeuronModel = eval(f"optimum.neuron.NeuronModelFor{task}") 23 | tokenizer = AutoTokenizer.from_pretrained(model_dir) 24 | model = NeuronModel.from_pretrained(model_dir) 25 | return model,tokenizer 26 | 27 | def input_fn(input_data, content_type, context=None): 28 | if content_type == 'application/json': 29 | req = json.loads(input_data) 30 | prompt = req.get('prompt') 31 | if prompt is None or len(prompt) < 3: 32 | raise("Invalid prompt. Provide an input like: {'prompt': 'text text text'}") 33 | return prompt 34 | else: 35 | raise Exception(f"Unsupported mime type: {content_type}. Supported: application/json") 36 | 37 | def predict_fn(input_object, model_tokenizer, context=None): 38 | model,tokenizer = model_tokenizer 39 | inputs = tokenizer(input_object, truncation=True, return_tensors="pt") 40 | logits = model(**inputs).logits 41 | idx = logits.argmax(1, keepdim=True) 42 | conf = torch.gather(logits, 1, idx) 43 | return torch.cat([idx,conf], 1) 44 | 45 | if __name__ == "__main__": 46 | parser = argparse.ArgumentParser() 47 | 48 | # hyperparameters sent by the client are passed as command-line arguments to the script. 49 | parser.add_argument("--task", type=str, default="") 50 | parser.add_argument("--dynamic_batch_size", type=bool, default=False) 51 | parser.add_argument("--input_shapes", type=str, required=True) 52 | parser.add_argument("--is_model_compressed", type=bool, default=True) 53 | 54 | parser.add_argument("--model_dir", type=str, default=os.environ["SM_MODEL_DIR"]) 55 | parser.add_argument("--checkpoint_dir", type=str, default=os.environ["SM_CHANNEL_CHECKPOINT"]) 56 | 57 | args, _ = parser.parse_known_args() 58 | 59 | # Set up logging 60 | logging.basicConfig( 61 | level=logging.getLevelName("DEBUG"), 62 | handlers=[logging.StreamHandler(sys.stdout)], 63 | format="%(asctime)s - %(name)s - %(levelname)s - %(message)s", 64 | ) 65 | logger = logging.getLogger(__name__) 66 | logger.info(args) 67 | 68 | NeuronModel = eval(f"optimum.neuron.NeuronModel{'For' + args.task if len(args.task) > 0 else ''}") 69 | logger.info(f"Checkpoint files: {os.listdir(args.checkpoint_dir)}") 70 | 71 | model_path = args.checkpoint_dir 72 | if args.is_model_compressed: 73 | logger.info("Decompressing model file...") 74 | with tarfile.open(os.path.join(args.checkpoint_dir, "model.tar.gz"), 'r:gz') as tar: 75 | tar.extractall(os.path.join(args.checkpoint_dir, "model")) 76 | model_path = os.path.join(args.checkpoint_dir, "model") 77 | logger.info(f"Done! Model path: {model_path}") 78 | logger.info(f"Model path files: {os.listdir(model_path)}") 79 | 80 | input_shapes = json.loads(args.input_shapes) 81 | model = NeuronModel.from_pretrained(model_path, export=True, dynamic_batch_size=args.dynamic_batch_size, **input_shapes) 82 | model.save_pretrained(args.model_dir) 83 | 84 | code_path = os.path.join(args.model_dir, 'code') 85 | os.makedirs(code_path, exist_ok=True) 86 | 87 | shutil.copy(__file__, os.path.join(code_path, "inference.py")) 88 | shutil.copy('requirements.txt', os.path.join(code_path, 'requirements.txt')) 89 | -------------------------------------------------------------------------------- /workshops/01_FineTuneSpamClassifier/notebooks/src/dump_model_table.py: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | 4 | import os 5 | import re 6 | import sys 7 | import argparse 8 | import pandas as pd 9 | from optimum.neuron import version 10 | from optimum.exporters.tasks import TasksManager 11 | from optimum.exporters.neuron.model_configs import * 12 | from optimum.neuron.distributed.parallelizers_manager import ParallelizersManager 13 | from optimum.neuron.utils.training_utils import ( 14 | _SUPPORTED_MODEL_NAMES, 15 | _SUPPORTED_MODEL_TYPES, 16 | _generate_supported_model_class_names 17 | ) 18 | 19 | def training_models(): 20 | # retrieve supported models for Tensor Parallelism 21 | tp_support = list(ParallelizersManager._MODEL_TYPE_TO_PARALLEL_MODEL_CLASS.keys()) 22 | 23 | # build compability table for training 24 | data_training = {'Model': []} 25 | for m in _SUPPORTED_MODEL_TYPES: 26 | if type(m) != str: m = m[0] 27 | if m=='gpt-2': m='gpt2' # fix the name 28 | model_id = len(data_training['Model']) 29 | model_link = f'{m}' 30 | data_training['Model'].append(f"{model_link} [TP]" if m in tp_support else model_link) 31 | tasks = [re.sub(r'.+For(.+)', r'\1', t) for t in set(_generate_supported_model_class_names(m)) if not t.endswith('Model')] 32 | for t in tasks: 33 | if data_training.get(t) is None: data_training[t] = [''] * len(_SUPPORTED_MODEL_TYPES) 34 | data_training[t][model_id] = f'doc' 35 | df_training = pd.DataFrame.from_dict(data_training).set_index('Model') 36 | return df_training.to_markdown() 37 | 38 | def inference_models(): 39 | # retrieve supported models for Tensor Parallelism 40 | tp_support = list(ParallelizersManager._MODEL_TYPE_TO_PARALLEL_MODEL_CLASS.keys()) 41 | 42 | # build compability table for inference 43 | meta = [(k,list(v['neuron'].keys())) for k,v in TasksManager._SUPPORTED_MODEL_TYPE.items() if v.get('neuron') is not None] 44 | data_inference = {'Model': []} 45 | for m,t in meta: 46 | model_id = len(data_inference['Model']) 47 | model_link = f'{m}' 48 | data_inference['Model'].append(f"{model_link} [TP]" if m in tp_support else model_link) 49 | for task in t: 50 | if data_inference.get(task) is None: data_inference[task] = [''] * len(meta) 51 | data_inference[task][model_id] = f'doc' 52 | 53 | df_inference = pd.DataFrame.from_dict(data_inference).set_index('Model') 54 | return df_inference.to_markdown() 55 | 56 | if __name__ == "__main__": 57 | parser = argparse.ArgumentParser() 58 | 59 | # input parameters of this script 60 | parser.add_argument("--output_file", type=str, required=True) 61 | 62 | try: 63 | args, _ = parser.parse_known_args() 64 | print(f"Dumping the metadata file to: {args.output_file}") 65 | with open(args.output_file, 'w') as f: 66 | f.write("# HF Optimum Neuron - Supported Models\n") 67 | f.write(f"**version: {version.__version__}** \n") 68 | f.write("Models marked with [TP] support **Tensor Parallelism** for training and inference\n") 69 | f.write("## Models/tasks for training\n") 70 | f.write(f"{training_models()}\n") 71 | f.write("## Models/tasks for inference\n") 72 | f.write(f"{inference_models()}\n") 73 | except Exception as e: 74 | print(traceback.format_exc()) 75 | sys.exit(1) 76 | 77 | finally: 78 | print("Done! ", sys.exc_info()) 79 | sys.exit(0) 80 | -------------------------------------------------------------------------------- /workshops/01_FineTuneSpamClassifier/notebooks/src/requirements.txt: -------------------------------------------------------------------------------- 1 | --extra-index-url https://pip.repos.neuron.amazonaws.com 2 | evaluate==0.4.3 3 | optimum-neuron==0.0.25 -------------------------------------------------------------------------------- /workshops/01_FineTuneSpamClassifier/notebooks/src/train.py: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | 4 | import os 5 | import sys 6 | import torch 7 | import random 8 | import argparse 9 | import evaluate 10 | import importlib 11 | import traceback 12 | import subprocess 13 | import transformers 14 | 15 | from huggingface_hub import login 16 | from datasets import load_from_disk 17 | from transformers import AutoTokenizer 18 | from optimum.neuron import NeuronTrainer as Trainer 19 | from optimum.neuron import NeuronTrainingArguments as TrainingArguments 20 | 21 | if __name__ == "__main__": 22 | parser = argparse.ArgumentParser() 23 | 24 | # hyperparameters sent by the client are passed as command-line arguments to the script. 25 | parser.add_argument("--epochs", type=int, default=3) 26 | parser.add_argument("--max_sen_len", type=int, default=256) 27 | parser.add_argument("--train_batch_size", type=int, default=32) 28 | parser.add_argument("--eval_batch_size", type=int, default=64) 29 | parser.add_argument("--warmup_steps", type=int, default=500) 30 | parser.add_argument("--tensor_parallel_size", type=int, default=1) 31 | parser.add_argument("--model_id", type=str, required=True) 32 | parser.add_argument("--zero_1", type=bool, default=False) 33 | parser.add_argument("--task", type=str, default="") 34 | parser.add_argument("--collator", type=str, default="DefaultDataCollator") 35 | parser.add_argument("--learning_rate", type=float, default=5e-5) 36 | parser.add_argument("--weight_decay", type=float, default=0.01) 37 | parser.add_argument("--bf16", type=bool, default=True) 38 | 39 | # hugging face hub 40 | parser.add_argument("--hf_token", type=str, default=None) 41 | 42 | # Data, model, and output directories 43 | parser.add_argument("--output_data_dir", type=str, default=os.environ["SM_OUTPUT_DATA_DIR"]) 44 | parser.add_argument("--model_dir", type=str, default=os.environ["SM_MODEL_DIR"]) 45 | parser.add_argument("--n_neurons", type=str, default=os.environ["SM_NUM_NEURONS"]) 46 | parser.add_argument("--training_dir", type=str, default=os.environ["SM_CHANNEL_TRAIN"]) 47 | parser.add_argument("--eval_dir", type=str, default=os.environ.get("SM_CHANNEL_EVAL", None)) 48 | 49 | parser.add_argument('--checkpoints-path', type=str, help="Path where we'll save the cache", default='/opt/ml/checkpoints') 50 | 51 | args, _ = parser.parse_known_args() 52 | os.makedirs(args.checkpoints_path, exist_ok=True) 53 | 54 | if not args.hf_token is None and len(args.hf_token) > 0: 55 | print("HF token defined. Logging in...") 56 | login(token=args.hf_token) 57 | 58 | cmd = f"optimum-cli neuron cache set {os.environ['CUSTOM_CACHE_REPO']}" 59 | subprocess.check_call(cmd.split(' ')) 60 | 61 | Collator = eval(f"transformers.{args.collator}") 62 | AutoModel = eval(f"transformers.AutoModel{'For' + args.task if len(args.task) > 0 else ''}") 63 | 64 | train_dataset=load_from_disk(args.training_dir) 65 | eval_dataset=load_from_disk(args.eval_dir) if not args.eval_dir is None else None 66 | 67 | tokenizer = AutoTokenizer.from_pretrained(args.model_id) 68 | tokenizer.pad_token = tokenizer.eos_token 69 | tokenizer.model_max_length = args.max_sen_len 70 | 71 | data_collator = Collator(return_tensors="pt") 72 | model = AutoModel.from_pretrained(args.model_id, trust_remote_code=True) # TODO: add a hyperparameter with model params 73 | 74 | training_args = TrainingArguments( 75 | evaluation_strategy="epoch" if not args.eval_dir is None else "no", 76 | learning_rate=args.learning_rate, 77 | weight_decay=args.weight_decay, 78 | bf16=args.bf16, 79 | num_train_epochs=args.epochs, 80 | output_dir=args.checkpoints_path, 81 | overwrite_output_dir=True, 82 | tensor_parallel_size=args.tensor_parallel_size, 83 | zero_1=args.zero_1, 84 | 85 | per_device_train_batch_size=args.train_batch_size, 86 | per_device_eval_batch_size=args.eval_batch_size if not args.eval_dir is None else None, 87 | logging_dir=f"{args.output_data_dir}/logs", 88 | logging_strategy="steps", 89 | logging_steps=500, 90 | save_steps=2000, 91 | save_strategy="steps", 92 | save_total_limit=1, 93 | ) 94 | trainer = Trainer( 95 | model=model, 96 | args=training_args, 97 | train_dataset=train_dataset, 98 | eval_dataset=eval_dataset, 99 | data_collator=data_collator, 100 | ) 101 | trainer.train() 102 | # save artifacts that will be uploaded to S3 103 | trainer.save_model(args.model_dir) 104 | tokenizer.save_pretrained(args.model_dir) 105 | -------------------------------------------------------------------------------- /workshops/02_DomainAdaptation/README.md: -------------------------------------------------------------------------------- 1 | # Adapting LLMs for domain-aware applications with AWS Trainium post-training 2 | 3 | ## Introduction 4 | 5 | Large language models are typically trained on a broad corpus of data from various domains, making them highly capable of handling diverse tasks and topics. However, when these models are deployed in specific domains or applications, their performance may not be optimal due to the domain-specific language, terminology, and context. Domain adaptation aims to fine-tune or adapt the pre-trained LLM to a particular domain or task, improving its performance and enabling better understanding of domain-specific data. 6 | 7 | # Scenarios and Use Cases for LLM Domain Adaptation 8 | LLM domain adaptation is useful in various scenarios where domain-specific knowledge or language patterns are crucial. Some common use cases include: 9 | 10 | - Specialized industries (e.g., healthcare, finance, legal, engineering) 11 | - Domain-specific applications (e.g., chatbots for customer service, virtual assistants for specific tasks) 12 | - Text summarization or generation for specific domains 13 | - Question-answering systems for domain-specific knowledge bases 14 | - Sentiment analysis or text classification in domain-specific contexts 15 | - Machine translation for domain-specific terminologies 16 | 17 | # Techniques for LLM Domain Adaptation 18 | 19 | **Supervised Fine-Tuning (SFT)**: In this approach, the language model is fine-tuned on a labeled dataset specific to the target domain or task. The model learns to generate outputs similar to the labeled examples in the dataset. 20 | - Use case: Fine-tuning a language model for legal document summarization, where you have a dataset of legal documents and their corresponding summaries. 21 | 22 | **Reinforcement Learning from Human Feedback (RLHF)**: This technique involves providing human feedback (e.g., ratings, comparisons, or corrections) to the language model during the fine-tuning process. The model is trained to generate outputs that align with the human feedback, effectively shaping its behavior according to human preferences. RLHF can be particularly useful when you want to imbue the language model with specific traits, such as factuality, safety, or ethical behavior. 23 | - Use case: Training a language model for customer service applications, where it needs to provide helpful, polite, and factual responses. 24 | 25 | **Direct Preference Optimization (DPO)**: DPO is stable, performant, and computationally lightweight, eliminating the need for sampling from the LM during fine-tuning or performing significant hyperparameter tuning. 26 | - Use case: Pre-training a language model on corrupted scientific paper abstracts to improve its performance on tasks related to scientific literature, such as question answering or text summarization. 27 | 28 | **Odds Ratio Preference Optimization (ORPO)**: ORPO is a simple and novel monolithic alignment method technique which efficiently penalizes the model from learning undesired generation styles during SFT. 29 | - Use case: Adapting a general-purpose language model to the financial domain by training it with ORPO on a corpus of financial reports and news articles. 30 | 31 |

32 | model alignment techniques 33 |

34 | 35 | These techniques can be used individually or in combination, depending on the specific requirements and constraints of the domain adaptation task. The choice of technique often depends on factors such as the availability of labeled data, the desired traits or behaviors of the adapted model, and the computational resources available. 36 | 37 | It's worth noting that domain adaptation is an active area of research, and new techniques and approaches are constantly emerging. Additionally, the effectiveness of these techniques can vary depending on the specific domain, task, and language model used. 38 | 39 | # This workshop 40 | 41 | For this particular workshop, we'll use ORPO, given its effienciency while managing the resources required to adapt our LLMs. It uses less memory to achieve similar results to DPO, resulting in a cheaper setup. 42 | 43 | Duration: Approximately 60 minutes 44 | 45 | ## ORPO 46 | [ORPO](https://arxiv.org/html/2403.07691v2) is a fine-tuning technique that streamlines the process of adapting LLMs to specific tasks. It addresses a limitation of the traditional two-stage approach. While SFT effectively adapts the model to a desired domain, it can inadvertently increase the probability of generating undesirable responses alongside preferred ones. 47 | 48 | ![orpo intro](./docs/imgs/6-orpo-intro.png) 49 | 50 | Here’s a breakdown of the issue: 51 | - Supervised Fine-Tuning (SFT): Trains the LLM on task-specific data, improving its performance in that domain. 52 | - Drawback: During SFT, the probability of generating undesirable responses along with preferred ones also increases, as shown in the image. 53 | 54 | ![orpo intro](./docs/imgs/6-orpo-curve.png) 55 | 56 | Preference alignment is then employed to address this issue. It aims to: 57 | 58 | - Increase the likelihood of generating preferred responses. 59 | - Decrease the likelihood of generating rejected responses. 60 | 61 | Traditionally, preference alignment is achieved through techniques like Reinforcement Learning with Human Feedback (RLHF) or Direct Preference Optimization (DPO). However, these methods require a separate reference model, increasing computational complexity. 62 | 63 | ORPO elegantly solves this problem by combining SFT and preference alignment into a single objective function. It modifies the standard language modeling loss by incorporating an odds ratio (OR) term. This term: 64 | 65 | - Weakly penalizes rejected responses. 66 | - Strongly rewards preferred responses. 67 | 68 | By simultaneously optimizing for both objectives, ORPO allows the LLM to learn the target task while aligning its outputs with human preferences. 69 | 70 | [For more details, you can check this blog post](https://huggingface.co/blog/mlabonne/orpo-llama-3). 71 | 72 | In this workshop we'll use an implementation of ORPO provided by Hugging Face Optimum Neuron. 73 | 74 | ## HF Optimum Neuron 75 | 76 | 🤗 Optimum Neuron is the interface between the 🤗 Transformers library and AWS Accelerators including AWS Trainium and AWS Inferentia. It provides a set of tools enabling easy model loading, training and inference on single- and multi-Accelerator settings for different downstream tasks. 77 | 78 | With Optimum Neuron you can bring your transformers training code and with minimal changes execute it on AWS Trainium. Here you can see an example of a training code compatible with Optimum Neuron. 79 | 80 | ```python 81 | - from transformers import Trainer, TrainingArguments 82 | + from optimum.neuron import NeuronTrainer as Trainer 83 | + from optimum.neuron import NeuronTrainingArguments as TrainingArguments 84 | 85 | from transformers import TrainingArguments 86 | from optimum.neuron import NeuronTrainer as Trainer 87 | 88 | def parse_args(): 89 | ... 90 | 91 | def training_function(args): 92 | 93 | # load dataset from disk and tokenizer 94 | train_dataset = load_from_disk(os.path.join(args.dataset_path, "train")) 95 | ... 96 | 97 | # Download the model from huggingface.co/models 98 | model = AutoModelForSequenceClassification.from_pretrained( 99 | args.model_id, num_labels=num_labels, label2id=label2id, id2label=id2label 100 | ) 101 | 102 | training_args = TrainingArguments( 103 | ... 104 | ) 105 | 106 | # Create Trainer instance 107 | trainer = Trainer( 108 | model=model, 109 | args=training_args, 110 | train_dataset=train_dataset, 111 | eval_dataset=eval_dataset, 112 | compute_metrics=compute_metrics, 113 | ) 114 | 115 | # Start training 116 | trainer.train() 117 | ``` 118 | 119 | For more information about HF Optimum Neuron, please [check the official documentation](https://huggingface.co/docs/optimum-neuron/index). 120 | -------------------------------------------------------------------------------- /workshops/02_DomainAdaptation/docs/imgs/6-orpo-curve.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/ml-specialized-hardware/d73aadcdd1b966d23e5191882f707c0cc01cbe23/workshops/02_DomainAdaptation/docs/imgs/6-orpo-curve.png -------------------------------------------------------------------------------- /workshops/02_DomainAdaptation/docs/imgs/6-orpo-intro.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/ml-specialized-hardware/d73aadcdd1b966d23e5191882f707c0cc01cbe23/workshops/02_DomainAdaptation/docs/imgs/6-orpo-intro.png -------------------------------------------------------------------------------- /workshops/02_DomainAdaptation/docs/imgs/model_alignment_techniques.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/ml-specialized-hardware/d73aadcdd1b966d23e5191882f707c0cc01cbe23/workshops/02_DomainAdaptation/docs/imgs/model_alignment_techniques.png -------------------------------------------------------------------------------- /workshops/03_NKIWorkshop/README.md: -------------------------------------------------------------------------------- 1 | # Building Custom Accelerator Kernels with AWS Neuron Kernel Interface (NKI) 2 | 3 | ## Introduction 4 | 5 | The Neuron Kernel Interface (NKI) is a domain-specific language and runtime that allows developers to write custom kernels optimized for AWS Neuron devices (Trainium/Inferentia). NKI enables direct access to the hardware's compute and memory resources, giving you fine-grained control over how your workloads execute on Neuron accelerators. 6 | 7 | With NKI, you can create highly optimized implementations of custom operators, extend ML frameworks with new functionality, and maximize performance for your unique machine learning workloads. This interface bridges the gap between high-level ML frameworks and the underlying Neuron hardware, providing a programming environment similar to CUDA but specifically designed for AWS Neuron devices. 8 | 9 | NKI allows developers to harness the full computational power of AWS Neuron devices by writing kernels that explicitly manage computation and memory operations. This level of control is essential for specialized workloads that require custom optimization beyond what's possible with standard operators provided by ML frameworks. 10 | Scenarios and Use Cases for NKI 11 | 12 | ## NKI Programming Model 13 | 14 | NKI follows a three-phase programming model that gives developers explicit control over the execution of their kernels: 15 | 16 | 1. Load - Move data from device memory (HBM) to on-chip memory (SBUF) 17 | * Explicitly define which data to bring into fast on-chip memory 18 | * Control memory access patterns to optimize bandwidth utilization 19 | * Apply data transformations during loading if needed 20 | 21 | 2. Compute - Perform operations using on-chip memory 22 | * Execute arithmetic operations on data in on-chip memory 23 | * Leverage vector and matrix operations for efficient computation 24 | * Utilize specialized hardware units for operations like matrix multiplication 25 | 26 | 3. Store - Move results from on-chip memory back to device memory 27 | * Control when and how results are written back to device memory 28 | * Optimize for memory bandwidth by storing results efficiently 29 | * Apply masks or conditions to selectively update memory 30 | 31 | This programming model is based on the architecture of Neuron devices, which feature a large HBM (High Bandwidth Memory) for storing model weights and activations, and a smaller but faster on-chip memory (SBUF) for active computations. By explicitly managing data movement between these memory tiers, developers can optimize for both performance and energy efficiency. 32 | 33 | ## This Workshop 34 | 35 | This hands-on workshop will teach you how to build, optimize, and integrate custom kernels for AWS Neuron devices using NKI. You'll learn the fundamentals of kernel development, how to integrate kernels with PyTorch, and how to analyze performance with Neuron Profile. 36 | 37 | Duration: Approximately 90 minutes 38 | Workshop Outline 39 | 40 | 1. Environment Setup 41 | * Configuring your Trn1 or Inf2 instance 42 | * Installing required packages 43 | * Verifying your setup 44 | 45 | 2. Implementing Your First NKI Kernel 46 | * Understanding the NKI programming model 47 | * Writing a simple tensor addition kernel 48 | * Running kernels in baremetal mode and with PyTorch 49 | 50 | 3. Integrating Prebuilt Kernels 51 | * Using optimized kernels from the neuronxcc.nki.kernels namespace 52 | * Comparing custom Flash Attention implementation with standard attention 53 | * Understanding performance benefits of optimized kernels 54 | 55 | 4. Creating Custom Operators 56 | * Inserting NKI kernels as custom operators in PyTorch 57 | * Implementing forward and backward passes for training 58 | * Supporting autograd for custom operators 59 | 60 | 5. Performance Analysis with Neuron Profile 61 | * Installing and using Neuron Profile 62 | * Capturing execution traces 63 | * Analyzing kernel performance metrics 64 | * Identifying optimization opportunities 65 | 66 | 67 | Each section builds upon the previous one, gradually introducing more advanced concepts while providing hands-on experience with real code examples. 68 | Prerequisites 69 | 70 | In the first notebook, you'll set up your environment and verify that NKI is properly installed and configured. Then you'll implement your first kernel - a simple tensor addition operation - and learn how to run it both in baremetal mode and through PyTorch. This will establish the foundation for the more advanced topics covered in subsequent notebooks. 71 | 72 | Let's begin with the environment setup and your first NKI kernel! 73 | -------------------------------------------------------------------------------- /workshops/03_NKIWorkshop/notebooks/0-setup.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 0 - setup\n", 8 | "In this guide, we will implement a simple “Hello World” style NKI kernel and run it on a NeuronDevice (Trainium/Inferentia2 or beyond device). We will showcase how to invoke a NKI kernel standalone through NKI baremetal mode and also through ML frameworks (PyTorch). Before diving into kernel implementation, let’s make sure you have the correct environment setup for running NKI kernels." 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "## Environment Setup\n", 16 | "You need a [Trn1](https://aws.amazon.com/ec2/instance-types/trn1/) or [Inf2](https://aws.amazon.com/ec2/instance-types/inf2/) instance set up on AWS to run NKI kernels on a NeuronDevice. Once logged into the instance, follow steps below to ensure you have all the required packages installed in your Python environment.\n", 17 | "\n", 18 | "NKI is shipped as part of the Neuron compiler package. To make sure you have the latest compiler package, see [Setup Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/index.html) for an installation guide.\n", 19 | "\n", 20 | "You can verify that NKI is available in your compiler installation by running the following command:" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": null, 26 | "metadata": {}, 27 | "outputs": [], 28 | "source": [ 29 | "import neuronxcc.nki" 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "This attempts to import the NKI package. It will error out if NKI is not included in your Neuron compiler version or if the Neuron compiler is not installed. The import might take about a minute the first time you run it. Whenever possible, we recommend using local instance NVMe volumes instead of EBS for executable code.\n", 37 | "\n", 38 | "If you intend to run NKI kernels without any ML framework for quick prototyping, you will also need NumPy installed.\n", 39 | "\n", 40 | "To call NKI kernels from PyTorch, you also need to have torch_neuronx installed. For an installation guide, see PyTorch Neuron Setup. You can verify that you have torch_neuronx installed by running the following command:" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": null, 46 | "metadata": {}, 47 | "outputs": [], 48 | "source": [ 49 | "import torch_neuronx" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": {}, 55 | "source": [ 56 | "## Implementing your first NKI kernel\n", 57 | "In current NKI release, all input and output tensors must be passed into the kernel as device memory (HBM) tensors on a NeuronDevice. The body of the kernel typically consists of three main phases:\n", 58 | "\n", 59 | "1. Load the inputs from device memory to on-chip memory (SBUF).\n", 60 | "2. Perform the desired computation.\n", 61 | "3. Store the outputs from on-chip memory to device memory.\n", 62 | "\n", 63 | "For more details on the above terms, see [NKI Programming Model](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/programming_model.html)." 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": null, 69 | "metadata": {}, 70 | "outputs": [], 71 | "source": [ 72 | "from neuronxcc import nki\n", 73 | "import neuronxcc.nki.language as nl\n", 74 | "\n", 75 | "@nki.jit\n", 76 | "def nki_tensor_add_kernel(a_input, b_input):\n", 77 | " \"\"\"\n", 78 | " NKI kernel to compute element-wise addition of two input tensors\n", 79 | " \"\"\"\n", 80 | " \n", 81 | " # Check all input/output tensor shapes are the same for element-wise operation\n", 82 | " assert a_input.shape == b_input.shape\n", 83 | "\n", 84 | " # Check size of the first dimension does not exceed on-chip memory tile size limit,\n", 85 | " # so that we don't need to tile the input to keep this example simple\n", 86 | " assert a_input.shape[0] <= nl.tile_size.pmax\n", 87 | "\n", 88 | " # Load the inputs from device memory to on-chip memory\n", 89 | " a_tile = nl.load(a_input)\n", 90 | " b_tile = nl.load(b_input)\n", 91 | "\n", 92 | " # Specify the computation (in our case: a + b)\n", 93 | " c_tile = nl.add(a_tile, b_tile)\n", 94 | "\n", 95 | " # Create a HBM tensor as the kernel output\n", 96 | " c_output = nl.ndarray(a_input.shape, dtype=a_input.dtype, buffer=nl.shared_hbm)\n", 97 | "\n", 98 | " # Store the result to c_output from on-chip memory to device memory\n", 99 | " nl.store(c_output, value=c_tile)\n", 100 | "\n", 101 | " # Return kernel output as function output\n", 102 | " return c_output" 103 | ] 104 | }, 105 | { 106 | "cell_type": "markdown", 107 | "metadata": {}, 108 | "source": [ 109 | "## Running the kernel\n", 110 | "Next, we will cover unique ways to run the above NKI kernel on a NeuronDevice:\n", 111 | "\n", 112 | "1. NKI baremetal: run NKI kernel with no ML framework involvement\n", 113 | "2. PyTorch: run NKI kernel as a PyTorch operator\n", 114 | "3. JAX: run NKI kernel as a JAX operator (not used in this workshop)\n", 115 | "\n", 116 | "All three run modes can call the same kernel function decorated with the `nki.jit` decorator as discussed above:\n", 117 | "```\n", 118 | "@nki.jit\n", 119 | "def nki_tensor_add_kernel(a_input, b_input):\n", 120 | "```\n", 121 | "The `nki.jit` decorator automatically chooses the correct run mode by checking the incoming tensor type:\n", 122 | "\n", 123 | "1. NumPy arrays as input: run in NKI baremetal mode\n", 124 | "2. PyTorch tensors as input: run in PyTorch mode\n", 125 | "3. JAX tensors: run in JAX mode\n", 126 | "\n", 127 | "See [nki.jit](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/api/generated/nki.jit.html) API doc for more details." 128 | ] 129 | }, 130 | { 131 | "cell_type": "markdown", 132 | "metadata": {}, 133 | "source": [ 134 | "### NKI baremetal\n", 135 | "\n", 136 | "Baremetal mode expects input tensors of the NKI kernel to be NumPy arrays. The kernel also converts its NKI output tensors to NumPy arrays. To invoke the kernel, we first initialize the two input tensors `a` and `b` as NumPy arrays. Finally, we call the NKI kernel just like any other Python function:" 137 | ] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": null, 142 | "metadata": {}, 143 | "outputs": [], 144 | "source": [ 145 | "import numpy as np\n", 146 | "\n", 147 | "a = np.ones((4, 3), dtype=np.float16)\n", 148 | "b = np.ones((4, 3), dtype=np.float16)\n", 149 | "\n", 150 | "# Run NKI kernel on a NeuronDevice\n", 151 | "c = nki_tensor_add_kernel(a, b)\n", 152 | "\n", 153 | "print(c)" 154 | ] 155 | }, 156 | { 157 | "cell_type": "markdown", 158 | "metadata": {}, 159 | "source": [ 160 | "> Alternatively, we can decorate the kernel with [nki.baremetal](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/api/generated/nki.baremetal.html) or pass the `mode` parameter to the `nki.jit` decorator, `@nki.jit(mode='baremetal')`, to bypass the dynamic mode detection. See [nki.baremetal](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/api/generated/nki.baremetal.html) API doc for more available input arguments for the baremetal mode." 161 | ] 162 | }, 163 | { 164 | "cell_type": "markdown", 165 | "metadata": {}, 166 | "source": [ 167 | "### PyTorch\n", 168 | "\n", 169 | "To run the above `nki_tensor_add_kernel` kernel using PyTorch, we initialize the input and output tensors as PyTorch `device` tensors instead." 170 | ] 171 | }, 172 | { 173 | "cell_type": "code", 174 | "execution_count": null, 175 | "metadata": {}, 176 | "outputs": [], 177 | "source": [ 178 | "import torch\n", 179 | "from torch_xla.core import xla_model as xm\n", 180 | "\n", 181 | "nki_tensor_add_kernel_pytorch = nki.jit(nki_tensor_add_kernel, mode='torchxla')\n", 182 | "\n", 183 | "device = xm.xla_device()\n", 184 | "\n", 185 | "a = torch.ones((4, 3), dtype=torch.float16).to(device=device)\n", 186 | "b = torch.ones((4, 3), dtype=torch.float16).to(device=device)\n", 187 | "\n", 188 | "c = nki_tensor_add_kernel_pytorch(a, b)\n", 189 | "\n", 190 | "print(c) # an implicit XLA barrier/mark-step (triggers XLA compilation)" 191 | ] 192 | }, 193 | { 194 | "cell_type": "markdown", 195 | "metadata": {}, 196 | "source": [ 197 | "> Alternatively, we can pass the `mode='torchxla'` parameter into the `nki.jit` decorator to bypass the dynamic mode detection." 198 | ] 199 | }, 200 | { 201 | "cell_type": "markdown", 202 | "metadata": {}, 203 | "source": [ 204 | "## Release the NeuronCore for the next notebook\n", 205 | "\n", 206 | "Before moving to the next notebook we need to release the NeuronCore. If we don't do this the next notebook will not be able resources - you can also stop the kernel via the GUI" 207 | ] 208 | }, 209 | { 210 | "cell_type": "code", 211 | "execution_count": null, 212 | "metadata": {}, 213 | "outputs": [], 214 | "source": [ 215 | "import IPython\n", 216 | "IPython.Application.instance().kernel.do_shutdown(True)" 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": null, 222 | "metadata": {}, 223 | "outputs": [], 224 | "source": [] 225 | } 226 | ], 227 | "metadata": { 228 | "kernelspec": { 229 | "display_name": "Python 3 (ipykernel)", 230 | "language": "python", 231 | "name": "python3" 232 | }, 233 | "language_info": { 234 | "codemirror_mode": { 235 | "name": "ipython", 236 | "version": 3 237 | }, 238 | "file_extension": ".py", 239 | "mimetype": "text/x-python", 240 | "name": "python", 241 | "nbconvert_exporter": "python", 242 | "pygments_lexer": "ipython3", 243 | "version": "3.10.12" 244 | } 245 | }, 246 | "nbformat": 4, 247 | "nbformat_minor": 4 248 | } 249 | -------------------------------------------------------------------------------- /workshops/03_NKIWorkshop/notebooks/2-custom-operators.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 2-Custom Operators\n", 8 | "This notebook demonstrates how to insert a NKI kernel as a custom operators into a PyTorch.\n", 9 | "\n", 10 | "## Using NKI kernels\n", 11 | "To register a NKI kernel registration, you need to call a decorated NKI function.\n", 12 | "\n", 13 | "Let’s examine a guiding example below where we randomly initialize two inputs, add them together, and then multiply the result by the two input tensors element-wise. This effectively calculates: `a * b * (a + b)`.\n", 14 | "\n", 15 | "We define a common NKI kernel for addition. For more information on the kernel, see [SPMD Tensor Addition](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/tutorials/spmd_tensor_addition.html)." 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": null, 21 | "metadata": {}, 22 | "outputs": [], 23 | "source": [ 24 | "import neuronxcc.nki as nki\n", 25 | "import neuronxcc.nki.language as nl\n", 26 | "\n", 27 | "@nki.jit\n", 28 | "def nki_tensor_add_kernel_(a_input, b_input):\n", 29 | " \"\"\"NKI kernel to compute element-wise addition of two input tensors\n", 30 | " \n", 31 | " This kernel assumes strict input/output sizes can be uniformly tiled to [128,512]\n", 32 | "\n", 33 | " Args:\n", 34 | " a_input: a first input tensor\n", 35 | " b_input: a second input tensor\n", 36 | "\n", 37 | " Returns:\n", 38 | " c_output: an output tensor\n", 39 | " \"\"\"\n", 40 | "\n", 41 | " # Create output tensor shared between all SPMD instances as result tensor\n", 42 | " c_output = nl.ndarray(a_input.shape, dtype=a_input.dtype, buffer=nl.shared_hbm)\n", 43 | "\n", 44 | " # Calculate tile offsets based on current 'program'\n", 45 | " offset_i_x = nl.program_id(0) * 128\n", 46 | " offset_i_y = nl.program_id(1) * 512\n", 47 | "\n", 48 | " # Generate tensor indices to index tensors a and b\n", 49 | " ix_, iy_ = nl.mgrid[0:128, 0:512]\n", 50 | " ix = offset_i_x + ix_\n", 51 | " iy = offset_i_y + iy_\n", 52 | "\n", 53 | " # Load input data from device memory (HBM) to on-chip memory (SBUF)\n", 54 | " # We refer to an indexed portion of a tensor as an intermediate tensor\n", 55 | " a_tile = nl.load(a_input[ix, iy])\n", 56 | " b_tile = nl.load(b_input[ix, iy])\n", 57 | "\n", 58 | " # compute a + b\n", 59 | " c_tile = a_tile + b_tile\n", 60 | "\n", 61 | " # store the addition results back to device memory (c_output)\n", 62 | " nl.store(c_output[ix, iy], value=c_tile)\n", 63 | "\n", 64 | " # Transfer the ownership of `c_output` to the caller\n", 65 | " return c_output" 66 | ] 67 | }, 68 | { 69 | "cell_type": "markdown", 70 | "metadata": {}, 71 | "source": [ 72 | "## PyTorch\n", 73 | "We can perform `(a + b) * a * b` using native PyTorch code." 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": null, 79 | "metadata": {}, 80 | "outputs": [], 81 | "source": [ 82 | "import torch\n", 83 | "from torch_xla.core import xla_model as xm\n", 84 | "\n", 85 | "device = xm.xla_device()\n", 86 | "\n", 87 | "a = torch.randn(256, 1024, dtype=torch.float32).to(device)\n", 88 | "b = torch.randn(256, 1024, dtype=torch.float32).to(device)\n", 89 | "c = a + b\n", 90 | "out = a * b * c\n", 91 | "\n", 92 | "print(out)" 93 | ] 94 | }, 95 | { 96 | "cell_type": "markdown", 97 | "metadata": {}, 98 | "source": [ 99 | "Now let’s replace the tensor addition (`c = a + b`) with a NKI kernel. To do this we replace the `+` operator with a call to the NKI kernel caller (`nki_tensor_add`), and everything else works as before." 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": null, 105 | "metadata": {}, 106 | "outputs": [], 107 | "source": [ 108 | "def nki_tensor_add(a_input, b_input):\n", 109 | " \"\"\"NKI kernel caller to compute element-wise addition of two input tensors\n", 110 | "\n", 111 | " This kernel caller lifts tile-size restriction, by applying the kernel on tiles of the inputs/outputs\n", 112 | "\n", 113 | " Args:\n", 114 | " a_input: a first input tensor, of shape [N*128, M*512]\n", 115 | " b_input: a second input tensor, of shape [N*128, M*512]\n", 116 | "\n", 117 | " Returns:\n", 118 | " a tensor of shape [N*128, M*512], the result of a_input + b_input\n", 119 | " \"\"\"\n", 120 | "\n", 121 | " # The SPMD launch grid denotes the number of kernel instances.\n", 122 | " # In this case, we use a 2D grid where the size of each invocation is 128x512\n", 123 | " grid_x = a_input.shape[0] // 128\n", 124 | " grid_y = a_input.shape[1] // 512\n", 125 | "\n", 126 | " return nki_tensor_add_kernel_[grid_x, grid_y](a_input, b_input)\n", 127 | "\n", 128 | "device = xm.xla_device()\n", 129 | "a = torch.randn(256, 1024, dtype=torch.float32).to(device)\n", 130 | "b = torch.randn(256, 1024, dtype=torch.float32).to(device)\n", 131 | "c = nki_tensor_add(a, b) # calling a NKI kernel, instead of the built-in torch op\n", 132 | "out = a * b * c\n", 133 | "print(out)" 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "metadata": {}, 139 | "source": [ 140 | "To understand what happens under the hood when we compile the above code, we can print HLO IR graph generated by XLA by setting the `NEURON_FRAMEWORK_DEBUG` environment variable. For example, you may add the following lines to your code:" 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": null, 146 | "metadata": {}, 147 | "outputs": [], 148 | "source": [ 149 | "import os\n", 150 | "os.environ['NEURON_FRAMEWORK_DEBUG'] = \"1\"" 151 | ] 152 | }, 153 | { 154 | "cell_type": "markdown", 155 | "metadata": {}, 156 | "source": [ 157 | "A `.pbtxt` file is then written in your run directory that has the corresponding human-readable HLO IR.\n", 158 | "\n", 159 | "Let’s examine the XLA output of this example. In line #5 we can identify that the tensor addition is now mapped to an HLO `custom-call` instruction, with `AwsNeuronCustomNativeKernel` as `custom_call_target`. The output of that `custom-call` is then consumed by the next instruction in line #6 as usual.\n", 160 | "\n", 161 | "```python\n", 162 | "ENTRY %SyncTensorsGraph.22 (p0.2: f32[256,1024], p1.2: f32[256,1024]) -> (f32[256,1024]) {\n", 163 | " %p1.2 = f32[256,1024]{1,0} parameter(1), frontend_attributes={neff_input_name=\"input1\"}\n", 164 | " %p0.2 = f32[256,1024]{1,0} parameter(0), frontend_attributes={neff_input_name=\"input0\"}\n", 165 | " %multiply = f32[256,1024]{1,0} multiply(f32[256,1024]{1,0} %p1.2, f32[256,1024]{1,0} %p0.2)\n", 166 | " %custom-call.2 = f32[256,1024]{1,0} custom-call(f32[256,1024]{1,0} %p1.2, f32[256,1024]{1,0} %p0.2), custom_call_target=\"AwsNeuronCustomNativeKernel\", api_version=API_VERSION_UNSPECIFIED, backend_config=\"...\")\n", 167 | " %multiply.1 = f32[256,1024]{1,0} multiply(f32[256,1024]{1,0} %multiply, f32[256,1024]{1,0} %custom-call.2)\n", 168 | " ROOT %tuple = (f32[256,1024]{1,0}) tuple(f32[256,1024]{1,0} %multiply.1), frontend_attributes={neff_output_names=\"output0\"}\n", 169 | "}\n", 170 | "```\n", 171 | "\n", 172 | "The Neuron compiler replaces the above custom-call with the corresponding NKI kernel implementation while optimizing the rest of the compute graph as usual. At the end of the compilation process, a single compiled binary NEFF file is generated representing the entire graph including the NKI kernel. For more information about NEFF files, see [Neuron Compiler](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/index.html)." 173 | ] 174 | }, 175 | { 176 | "cell_type": "markdown", 177 | "metadata": {}, 178 | "source": [ 179 | "## Using NKI in training graphs\n", 180 | "\n", 181 | "If you are using NKI to implement a new operator in a training graph, you might need to make the new operator interplay with the `autograd` engine in the framework. To do this, in PyTorch, you can subclass the framework’s base operator class and implement both the `forward()` and `backward()` methods. The `autograd` engine then uses the `backward()` method when performing auto-differentiation. See Extending [torch.autograd](https://pytorch.org/docs/stable/notes/extending.html) in the PyTorch Docs for instructions on doing this in PyTorch.\n", 182 | "\n", 183 | "Let’s reuse the `nki_tensor_add` kernel from before and demonstrate how to train a simple compute graph `(a+b)*a*b` in PyTorch.\n", 184 | "\n", 185 | "## PyTorch\n", 186 | "\n", 187 | "We define a `NkiAddFunc` class, which leverages the `nki_tensor_add` kernel in its `forward()` function. The gradients of both input tensors in `y = a + b` are ones, so the `backward()` function propagates the `dy` gradients from the previous backward function." 188 | ] 189 | }, 190 | { 191 | "cell_type": "code", 192 | "execution_count": null, 193 | "metadata": {}, 194 | "outputs": [], 195 | "source": [ 196 | "import torch\n", 197 | "import torch_xla.core.xla_model as xm\n", 198 | "device = xm.xla_device()\n", 199 | "\n", 200 | "class NkiAddFunc(torch.autograd.Function):\n", 201 | " @staticmethod\n", 202 | " def forward(ctx, a, b):\n", 203 | " return nki_tensor_add(a, b)\n", 204 | "\n", 205 | " @staticmethod\n", 206 | " def backward(ctx, dy, *args):\n", 207 | " # gradients for a and b\n", 208 | " return dy, dy\n", 209 | "\n", 210 | "# now, let's define the compute graph\n", 211 | "a = torch.randn(256, 1024, dtype=torch.float32).to(device).detach().requires_grad_()\n", 212 | "b = torch.randn(256, 1024, dtype=torch.float32).to(device).detach().requires_grad_()\n", 213 | "c = NkiAddFunc.apply(a, b)\n", 214 | "out = a * b * c\n", 215 | "\n", 216 | "# here we define a (dummy) loss-function, in prep for backward propagation\n", 217 | "loss = out.sum()\n", 218 | "\n", 219 | "# lastly, let's invoke the auto-grad engine\n", 220 | "loss.backward()\n", 221 | "\n", 222 | "xm.mark_step()" 223 | ] 224 | }, 225 | { 226 | "cell_type": "markdown", 227 | "metadata": {}, 228 | "source": [ 229 | "## Release the NeuronCore for the next notebook\n", 230 | "\n", 231 | "Before moving to the next notebook we need to release the NeuronCore. If we don't do this the next notebook will not be able resources - you can also stop the kernel via the GUI" 232 | ] 233 | }, 234 | { 235 | "cell_type": "code", 236 | "execution_count": null, 237 | "metadata": {}, 238 | "outputs": [], 239 | "source": [ 240 | "import IPython\n", 241 | "IPython.Application.instance().kernel.do_shutdown(True)" 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": null, 247 | "metadata": {}, 248 | "outputs": [], 249 | "source": [] 250 | } 251 | ], 252 | "metadata": { 253 | "kernelspec": { 254 | "display_name": "Python 3 (ipykernel)", 255 | "language": "python", 256 | "name": "python3" 257 | }, 258 | "language_info": { 259 | "codemirror_mode": { 260 | "name": "ipython", 261 | "version": 3 262 | }, 263 | "file_extension": ".py", 264 | "mimetype": "text/x-python", 265 | "name": "python", 266 | "nbconvert_exporter": "python", 267 | "pygments_lexer": "ipython3", 268 | "version": "3.10.12" 269 | } 270 | }, 271 | "nbformat": 4, 272 | "nbformat_minor": 4 273 | } 274 | -------------------------------------------------------------------------------- /workshops/03_NKIWorkshop/notebooks/3-neuron-profile.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 3-Neuron Profile \n", 8 | "In this tutorial, we use Neuron Profile to view the execution trace of a NKI kernel captured on a NeuronCore. In doing so, we learn about:\n", 9 | "\n", 10 | "- Installation and usage of Neuron Profile.\n", 11 | "\n", 12 | "- Inspecting a detailed execution timeline of compute engine instructions and DMA engine activities generated from your NKI kernel.\n", 13 | "\n", 14 | "As background, [Neuron Profile](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-profile-user-guide.html) is the tool you need to visualize where time is being spent during kernel execution on NeuronDevices, which is crucial for identifying performance bottlenecks and opportunities of your kernel. Neuron Profile produces runtime execution data for every instruction executed on each compute engine and also every data movement activity completed by DMA engines. Neuron Profile also reports key performance metrics such as compute engine and memory bandwidth utilization, which allows developers to quickly find out the achieved hardware efficiency of their kernel. Profiling typically has near zero overhead thanks to the dedicated on-chip profiling hardware in NeuronDevices.\n", 15 | "\n", 16 | "## Profile a NKI Kernel\n", 17 | "\n", 18 | "### Install Neuron Profile\n", 19 | "Make sure you have the latest version of the `aws-neuronx-tools`, which includes updated profiling support for NKI kernels. Neuron Profile is included within this package and is installed to `/opt/aws/neuron/bin`.\n", 20 | "\n", 21 | "The `aws-neuronx-tools` package comes pre-installed on [Neuron DLAMIs](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/dlami/index.html). For detailed installation instructions see [Neuron Profile User Guide: Installation](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-profile-user-guide.html#installation).\n", 22 | "\n", 23 | "### Profile using `neuron-profile capture`\n", 24 | "\n", 25 | "To profile a NKI kernel the required steps are (1) enable `NEURON_FRAMEWORK_DEBUG` to tell the compiler to save the `NEFF` file, (2) execute the NKI kernel to generate the `NEFF`, and (3) run `neuron-profile capture` to generate a `NTFF` profile. Each step is described in more detail below.\n", 26 | "\n", 27 | "We will profile a NKI kernel which computes the element-wise exponential of an input tensor of any 2D shape. The rest of this tutorial will use a performance profile generated from this kernel as an example. Full code of `prof-kernel.py`:" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": null, 33 | "metadata": {}, 34 | "outputs": [], 35 | "source": [ 36 | "%%writefile prof-kernel.py\n", 37 | "\"\"\"\n", 38 | "Example kernel used to demmonstrate Neuron Profile.\n", 39 | "\"\"\"\n", 40 | "import torch\n", 41 | "from neuronxcc import nki\n", 42 | "import neuronxcc.nki.language as nl\n", 43 | "import math\n", 44 | "import os\n", 45 | "os.environ[\"NEURON_FRAMEWORK_DEBUG\"] = \"1\"\n", 46 | "os.environ[\"NEURON_CC_FLAGS\"]= \" --disable-dge \"\n", 47 | "\n", 48 | "@nki.jit\n", 49 | "def tensor_exp_kernel_(in_tensor):\n", 50 | " \"\"\"NKI kernel to compute elementwise exponential of an input tensor\n", 51 | "\n", 52 | " Args:\n", 53 | " in_tensor: an input tensor of ANY 2D shape (up to SBUF size)\n", 54 | " Returns:\n", 55 | " out_tensor: an output tensor of ANY 2D shape (up to SBUF size)\n", 56 | " \"\"\"\n", 57 | " out_tensor = nl.ndarray(in_tensor.shape, dtype=in_tensor.dtype,\n", 58 | " buffer=nl.shared_hbm)\n", 59 | "\n", 60 | " sz_p, sz_f = in_tensor.shape\n", 61 | "\n", 62 | " i_f = nl.arange(sz_f)[None, :]\n", 63 | "\n", 64 | " for p in nl.affine_range(math.ceil(sz_p / nl.tile_size.pmax)):\n", 65 | " # Generate tensor indices for the input/output tensors\n", 66 | " # pad index to pmax, for simplicity\n", 67 | " i_p = p * nl.tile_size.pmax + nl.arange(nl.tile_size.pmax)[:, None]\n", 68 | "\n", 69 | " # Load input data from external memory to on-chip memory\n", 70 | " # only read up to sz_p\n", 71 | " in_tile = nl.load(in_tensor[i_p, i_f], mask=(i_p\n", 117 | "Use the flag `--disable-dge` to temporarily disable a new compiler feature which is interfering with DMA debugging information display in neuron-profile. This is highly recommended to improve NKI performance debugging experience until we release a software fix for this issue.\n", 118 | "\n", 119 | "\n", 120 | "2. Compile your NKI kernel to create a NEFF in your current directory:" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": null, 126 | "metadata": {}, 127 | "outputs": [], 128 | "source": [ 129 | "!python3 prof-kernel.py" 130 | ] 131 | }, 132 | { 133 | "cell_type": "markdown", 134 | "metadata": {}, 135 | "source": [ 136 | "
\n", 137 | "Find your NEFF named similarly to `MODULE_0_SyncTensorsGraph.13_12659246067793504316.neff`.\n", 138 | "
\n", 139 | "\n", 140 | "3. Profile the NEFF. This profiling step executes the NEFF on the NeuronDevice and records a raw execution trace into an Neuron Trace File Format (NTFF) artifact." 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": null, 146 | "metadata": {}, 147 | "outputs": [], 148 | "source": [ 149 | "!neuron-profile capture -n -s profile.ntff --profile-nth-exec=2" 150 | ] 151 | }, 152 | { 153 | "cell_type": "markdown", 154 | "metadata": {}, 155 | "source": [ 156 | "This will save your NTFF profile to `profile_exec_2.ntff`.\n", 157 | "\n", 158 | "
\n", 159 | "The `--profile-nth-exec=2` option will profile your NEFF twice on the NeuronDevice and output a NTFF profile for the second iteration. This is recommended to avoid one-time warmup delays which can be seen in the first iteration of execution.\n", 160 | "
\n", 161 | "\n", 162 | "In [View Neuron Profile UI](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/neuron_profile_for_nki.html#nki-view-neuron-profile-ui), we will view the profile in a user-friendly format using the Neuron Profile UI.\n", 163 | "\n", 164 | "### Profile using nki.benchmark\n", 165 | "\n", 166 | "You may also use the [nki.benchmark](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/api/generated/nki.benchmark.html) API to generate a NEFF and NTFF programmatically. One caveat is [nki.benchmark](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/api/generated/nki.benchmark.html) runs your NEFF without an ML framework in [nki.baremetal](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/api/generated/nki.baremetal.html) mode, so the input tensors to the kernel must be NumPy arrays instead of framework tensors such as `torch.Tensor`.\n", 167 | "\n", 168 | "Below is an example NKI kernel decorated by [nki.benchmark](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/api/generated/nki.benchmark.html). Full code of `prof-kernel-benchmark.py`:" 169 | ] 170 | }, 171 | { 172 | "cell_type": "code", 173 | "execution_count": null, 174 | "metadata": {}, 175 | "outputs": [], 176 | "source": [ 177 | "%%writefile prof-kernel-benchmark.py\n", 178 | "\"\"\"\n", 179 | "Example kernel used to demmonstrate Neuron Profile with nki.benchmark.\n", 180 | "\"\"\"\n", 181 | "from neuronxcc import nki\n", 182 | "from neuronxcc.nki.typing import tensor\n", 183 | "import neuronxcc.nki.language as nl\n", 184 | "import math\n", 185 | "\n", 186 | "\n", 187 | "@nki.benchmark(save_neff_name='file.neff', save_trace_name='profile.ntff')\n", 188 | "def tensor_exp_kernel_(in_tensor):\n", 189 | " \"\"\"NKI kernel to compute elementwise exponential of an input tensor\n", 190 | " Args:\n", 191 | " in_tensor: an input tensor of ANY 2D shape (up to SBUF size)\n", 192 | " Returns:\n", 193 | " out_tensor: an output tensor of ANY 2D shape (up to SBUF size)\n", 194 | " \"\"\"\n", 195 | " out_tensor = nl.ndarray(in_tensor.shape, dtype=in_tensor.dtype,\n", 196 | " buffer=nl.shared_hbm)\n", 197 | "\n", 198 | " sz_p, sz_f = in_tensor.shape\n", 199 | " i_f = nl.arange(sz_f)[None, :]\n", 200 | " for p in nl.affine_range(math.ceil(sz_p / nl.tile_size.pmax)):\n", 201 | " # Generate tensor indices for the input/output tensors\n", 202 | " # pad index to pmax, for simplicity\n", 203 | " i_p = p * nl.tile_size.pmax + nl.arange(nl.tile_size.pmax)[:, None]\n", 204 | " # Load input data from external memory to on-chip memory\n", 205 | " # only read up to sz_p\n", 206 | " in_tile = nl.load(in_tensor[i_p, i_f], mask=(i_p