├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── TUTORIALS.md
├── blogs
    └── 01_LLama3-8B_Inferentia_EKS_vLLM
    │   ├── Dockerfile
    │   ├── Readme.md
    │   ├── cluster-config.yaml
    │   ├── deployment.yaml
    │   └── nodegroup-config.yaml
├── tutorials
    ├── 01_EmbeddingsFromTextWithBert
    │   ├── 01_TextFeatureExtractionForSimilarity.ipynb
    │   ├── code
    │   │   └── inference.py
    │   └── frank_chap01.txt
    ├── 02_ObjectTrackingSageMakerGStreamer
    │   ├── 01_Yolov7SageMakerInferentia.ipynb
    │   ├── 02_CVPipeline.ipynb
    │   ├── README.md
    │   ├── code_01
    │   │   └── inference.py
    │   ├── code_02
    │   │   └── pipeline.py
    │   ├── container_01
    │   │   └── Dockerfile
    │   ├── container_02
    │   │   └── Dockerfile
    │   └── libs
    │   │   ├── cvpipeline.py
    │   │   ├── smcvpipeline.py
    │   │   └── tracker.py
    ├── 03_QuestionAnsweringMachine
    │   ├── 01_QuestionAnsweringWithT5SSM.ipynb
    │   ├── src
    │   │   └── question_answering.py
    │   └── train.csv.gz
    ├── 04_ImageGenerationWithStableDiffusion
    │   ├── SDInf2HFOptimumNeuron.ipynb
    │   └── SDOnInf2AndHFOptimumNeuron_SMSSH.ipynb
    ├── 05_FastQuestionAnsweringWithBertQA
    │   └── BertQAInferentia1.ipynb
    ├── 06_FinetuneLLMs
    │   ├── 01_Finetune_LLMs.ipynb
    │   └── 02_Deploy_Llama2-7B.ipynb
    ├── 07_DeployToInferentiaWithTGI
    │   └── inf2-tgi-demo.ipynb
    └── 08_TextClassificationWithNaturalLanguageInference
    │   └── NLI_with_BART_inf2.ipynb
└── workshops
    ├── 01_FineTuneSpamClassifier
        ├── README.md
        ├── docs
        │   ├── imgs
        │   │   └── 01_activities.png
        │   └── optimum_neuron_models.md
        └── notebooks
        │   ├── 01_DatasetPreparation.ipynb
        │   ├── 02_ModelFineTuning.ipynb
        │   ├── 03_ModelInference.ipynb
        │   ├── 03_ModelInferenceInf1.ipynb
        │   ├── requirements.txt
        │   └── src
        │       ├── compile.py
        │       ├── dump_model_table.py
        │       ├── requirements.txt
        │       └── train.py
    ├── 02_DomainAdaptation
        ├── README.md
        ├── docs
        │   └── imgs
        │   │   ├── 6-orpo-curve.png
        │   │   ├── 6-orpo-intro.png
        │   │   └── model_alignment_techniques.png
        └── notebooks
        │   ├── 01_ModelAdaptationWithOrpo.ipynb
        │   └── 02_DeployModel.ipynb
    └── 03_NKIWorkshop
        ├── README.md
        └── notebooks
            ├── 0-setup.ipynb
            ├── 1-intergrate-prebuild-kernel.ipynb
            ├── 2-custom-operators.ipynb
            └── 3-neuron-profile.ipynb


/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | ## Code of Conduct
2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
4 | opensource-codeofconduct@amazon.com with any additional questions or comments.
5 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contributing Guidelines
 2 | 
 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
 4 | documentation, we greatly value feedback and contributions from our community.
 5 | 
 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
 7 | information to effectively respond to your bug report or contribution.
 8 | 
 9 | 
10 | ## Reporting Bugs/Feature Requests
11 | 
12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features.
13 | 
14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
16 | 
17 | * A reproducible test case or series of steps
18 | * The version of our code being used
19 | * Any modifications you've made relevant to the bug
20 | * Anything unusual about your environment or deployment
21 | 
22 | 
23 | ## Contributing via Pull Requests
24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
25 | 
26 | 1. You are working against the latest source on the *main* branch.
27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
29 | 
30 | To send us a pull request, please:
31 | 
32 | 1. Fork the repository.
33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
34 | 3. Ensure local tests pass.
35 | 4. Commit to your fork using clear commit messages.
36 | 5. Send us a pull request, answering any default questions in the pull request interface.
37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
38 | 
39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
41 | 
42 | 
43 | ## Finding contributions to work on
44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.
45 | 
46 | 
47 | ## Code of Conduct
48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
50 | opensource-codeofconduct@amazon.com with any additional questions or comments.
51 | 
52 | 
53 | ## Security issue notifications
54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
55 | 
56 | 
57 | ## Licensing
58 | 
59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
60 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT No Attribution
 2 | 
 3 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of
 6 | this software and associated documentation files (the "Software"), to deal in
 7 | the Software without restriction, including without limitation the rights to
 8 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
 9 | the Software, and to permit persons to whom the Software is furnished to do so.
10 | 
11 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
12 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
13 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
14 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
15 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
16 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
17 | 
18 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # How to reduce costs and improve performance of your Machine Learning (ML) workloads?
 2 | 
 3 | In this repo you'll learn how to use [AWS Trainium](https://aws.amazon.com/machine-learning/trainium/) and [AWS Inferentia](https://aws.amazon.com/machine-learning/inferentia/) with [Amazon SageMaker](https://aws.amazon.com/sagemaker/) and [Hugging Face Optimum Neuron](https://huggingface.co/docs/optimum-neuron/index), to optimize your ML workloads! Here you find workshops, tutorials, blog post content, etc. you can use to learn and inspire your own solution.
 4 | 
 5 | 
 6 | The content you find here is focused on particular use cases. If you're looking for standalone model samples for inference and training, please check this other repo: https://github.com/aws-neuron/aws-neuron-samples. 
 7 | 
 8 | ### Workshops
 9 | 
10 | |Title||
11 | |:-|:-|
12 | |[Fine-tune and deploy LLM from Hugging Face on AWS Trainium and AWS Inferentia](workshops/01_FineTuneSpamClassifier)|Learn how to create a spam classifier that can be easily integrated to your own application|
13 | |[Adapting LLMs for domain-aware applications with AWS Trainium post-training](workshops/02_DomainAdaptation)|Learn how to adapt a pre-trained model to your own business needs and add a conversational interface your customers can interact with|
14 | |[Building Custom Accelerator Kernels with AWS Neuron Kernel Interface (NKI)](workshops/03_NKIWorkshop)|Learn how to use the Neuron Kernel Interface (NKI) to write kernels for Neuron accelerators|
15 | 
16 | 
17 | These workshops are supported by **AWS Workshop Studio**
18 | 
19 | ### Tutorials
20 | 
21 | |Description|
22 | |:-|
23 | |[inf1 - Extract embeddings from raw text](tutorials/01_EmbeddingsFromTextWithBert)|
24 | |[inf1 - Track objects in video streaming using CV](tutorials/02_ObjectTrackingSageMakerGStreamer)|
25 | |[inf1 - Create a closed question Q&A model](tutorials/03_QuestionAnsweringMachine)|
26 | |[ind2 - Generate images using SD](tutorials/04_ImageGenerationWithStableDiffusion)|
27 | |[inf1 - Answer questions given a context](tutorials/05_FastQuestionAnsweringWithBertQA)|
28 | |[trn1 - Fine-tune a LLM using distributed training](tutorials/06_FinetuneLLMs)|
29 | |[inf2 - Deploy a LLM to HF TGI](tutorials/07_DeployToInferentiaWithTGI)|
30 | |[inf2 - Porting BART for Multi-Genre Natural Language Inference](tutorials/08_TextClassificationWithNaturalLanguageInference)|
31 | 
32 | ### Blog posts content
33 | |Description|
34 | |:-|
35 | |[Llama3-8B Deployment on AWS Inferentia 2 with Amazon EKS and vLLM](blogs/01_LLama3-8B_Inferentia_EKS_vLLM/)|
36 | 
37 | ## Contributing
38 | If you have questions, comments, suggestions, etc. please feel free to cut tickets in this repo.
39 | 
40 | Also, please refer to the [CONTRIBUTING](CONTRIBUTING.md) document for further details on contributing to this repository.
41 | 


--------------------------------------------------------------------------------
/TUTORIALS.md:
--------------------------------------------------------------------------------
 1 | # Applied AI/ML Specialized Hardware
 2 | 
 3 | Specialized Hardware is a ML (Machine Learning) model accelerator for Inference or Training, like [AWS Inferentia](https://aws.amazon.com/machine-learning/inferentia/), [AWS Trainium](https://aws.amazon.com/machine-learning/trainium/), [SIMD accel in CPUs](https://en.wikipedia.org/wiki/SIMD) and GPUs. In this repo you'll find reference implementations of different use cases (applications) for Computer Vision, Natural Language Processing, etc. that make use of hardware acceleration to reduce model execution latency and increase throughput.
 4 | 
 5 | **Use cases** are represented by questions which can be answered by the reference implementation linked to it.
 6 | 
 7 | If you're looking for technical samples that show how to run specific models on Trainium (trn1) and Inferentia (inf1 & inf2), go to [AWS Neuron Samples](https://github.com/aws-neuron/aws-neuron-samples)
 8 | 
 9 | ## Tutorials/Reference implementations
10 | |Use Case|Description|
11 | |-|-|
12 | |[How to track people in video files?](tutorials/01_ObjectTrackingSageMakerGStreamer/)|CV/ML Pipeline to process video files in batch with SageMaker+Inferentia, GStreamer and Yolov7+ByteTrack|
13 | |[How to measure the similarity between two sentences?](tutorials/02_EmbeddingsFromTextWithBert/)|Compute the semantic similarity of two or more sentences by extracting their embeddings with SageMaker+Inferentia and HF Bert Case|
14 | |[How to create a mechanism to answer questions from a FAQ?](tutorials/03_QuestionAnsweringMachine/)|Fine tune a T5-ssm model (on SageMaker & Trainium) to build a Q&A mechanism, more powerful than a classic chatbot, to answer questions from a FAQ, sent by your customers|
15 | |[How to generate images based on a text input?](tutorials/04_ImageGenerationWithSDXL/)|Deploy an SDXL model to inferentia 2 + SageMaker using HF Optimum Neuron|
16 | |[How to create a really fast question answering mechanism?](tutorials/05_FastQuestionAnsweringWithBertQA/)|Deploy a BertQA model to Inferentia1 and SageMaker to build a fast and cheap Q&A mechanism|
17 | |[How to classify pieces of text via Natural Language Inference?](tutorials/08_TextClassificationWithNaturalLanguageInference)|Classify texts on custom selected topics with BART and inf2 instances|
18 | 
19 | ## Contributing
20 | If you have a question related to a business challenge that must be answered by an accelerated AI/ML solution, like the content in this repo, then you can contribute. You can just open an issue with your question or if you have the skills, implement a solution (tutorial, workshop, etc.) using Jupyter notebooks (for SageMaker Studio or Notebook Instances) and create a pull request. We appreciate your help.
21 | 
22 | Please refer to the [CONTRIBUTING](CONTRIBUTING.md) document for further details on contributing to this repository.
23 | 


--------------------------------------------------------------------------------
/blogs/01_LLama3-8B_Inferentia_EKS_vLLM/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM public.ecr.aws/neuron/pytorch-inference-neuronx:2.1.2-neuronx-py310-sdk2.19.1-ubuntu20.04
 2 | 
 3 | # Clone the vllm repository
 4 | RUN git clone https://github.com/vllm-project/vllm.git
 5 | 
 6 | # Set the working directory
 7 | WORKDIR /vllm
 8 | RUN git checkout v0.5.0
 9 | 
10 | # Set the environment variable
11 | ENV VLLM_TARGET_DEVICE=neuron
12 | 
13 | # Install the dependencies
14 | RUN pip install -U -r requirements-neuron.txt
15 | RUN pip install -e .
16 | 
17 | # Modify the arg_utils.py file
18 | RUN sed -i "/parser.add_argument('--block-size',/ {N;N;N;N;N;s/\[8, 16, 32\]/[8, 16, 32, 128, 256, 512, 1024, 2048, 4096, 8192]/}" vllm/engine/arg_utils.py
19 | 
20 | # Install ray
21 | RUN pip install ray
22 | RUN pip install pynvml
23 | 
24 | # Set the entry point
25 | ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]


--------------------------------------------------------------------------------
/blogs/01_LLama3-8B_Inferentia_EKS_vLLM/Readme.md:
--------------------------------------------------------------------------------
 1 | # Llama3-8B Deployment on AWS Inferentia 2 with Amazon EKS and vLLM
 2 | 
 3 | This repository contains the necessary files and configurations to deploy the Llama3-8B model on AWS Inferentia 2 instances using Amazon EKS (Elastic Kubernetes Service) and vLLM.
 4 | 
 5 | ## Files in this Directory
 6 | 
 7 | 1. `Dockerfile`: Defines the container image for running vLLM with Llama3-8B on Inferentia 2.
 8 | 
 9 | 2. `cluster-config.yaml`: Configuration file for creating the Amazon EKS cluster.
10 | 
11 | 3. `deployment.yaml`: Kubernetes deployment configuration for the Llama3-8B model.
12 | 
13 | 4. `nodegroup-config.yaml`: Configuration for the Inferentia 2 node group in the EKS cluster.
14 | 
15 | ## Overview
16 | 
17 | This project demonstrates how to:
18 | 
19 | - Set up an Amazon EKS cluster
20 | - Configure Inferentia 2 node groups
21 | - Build and push a custom Docker image for vLLM
22 | - Deploy Llama3-8B model using vLLM on Inferentia 2 instances
23 | - Configure Kubernetes probes for health checking
24 | - Scale the deployment
25 | 
26 | ## Prerequisites
27 | 
28 | - AWS CLI
29 | - eksctl
30 | - kubectl
31 | - docker
32 | 
33 | ## Getting Started
34 | 
35 | 1. Create the EKS cluster using the `cluster-config.yaml` file.
36 | 2. Set up the Inferentia 2 node group using the `nodegroup-config.yaml` file.
37 | 3. Build and push the Docker image using the provided `Dockerfile`.
38 | 4. Deploy the Llama3-8B model using the `deployment.yaml` file.
39 | 
40 | ## Important Notes
41 | 
42 | - The deployment uses 8 Neuron cores per replica for optimal performance.
43 | - Initial startup time for model compilation is around 25 minutes.
44 | - Proper monitoring and scaling strategies are crucial for production use.
45 | 
46 | For detailed instructions and explanations, please refer to the accompanying blog post.
47 | 
48 | ## Authors
49 | 
50 | - Dmitri Laptev - Senior GenAI Solutions Architect at AWS
51 | - Maurits de Groot - Solutions Architect at AWS
52 | - Ziwen Ning - Software Development Engineer at AWS
53 | - Jianying Lang - Principal Solutions Architect at AWS Worldwide Specialist Organization (WWSO)
54 | 


--------------------------------------------------------------------------------
/blogs/01_LLama3-8B_Inferentia_EKS_vLLM/cluster-config.yaml:
--------------------------------------------------------------------------------
 1 | apiVersion: eksctl.io/v1alpha5
 2 | kind: ClusterConfig
 3 | 
 4 | metadata:
 5 |   name: neuron-cluster
 6 |   region: us-east-1
 7 |   version: "1.30"
 8 | 
 9 | addons:
10 | - name: vpc-cni
11 |   version: latest
12 | 
13 | cloudWatch:
14 |   clusterLogging:
15 |     enableTypes: ["*"]
16 |     
17 | iam:
18 |   withOIDC: true
19 | 


--------------------------------------------------------------------------------
/blogs/01_LLama3-8B_Inferentia_EKS_vLLM/deployment.yaml:
--------------------------------------------------------------------------------
 1 | apiVersion: apps/v1
 2 | kind: Deployment
 3 | metadata:
 4 |   name: neuronx-vllm-deployment
 5 |   labels:
 6 |     app: neuronx-vllm
 7 | spec:
 8 |   replicas: 3
 9 |   selector:
10 |     matchLabels:
11 |       app: neuronx-vllm
12 |   template:
13 |     metadata:
14 |       labels:
15 |         app: neuronx-vllm
16 |     spec:
17 |       schedulerName: my-scheduler
18 |       containers:
19 |       - name: neuronx-vllm
20 |         image: <AWS_ACCOUNT_ID>.dkr.ecr.us-east-1.amazonaws.com/vllm-neuron:latest
21 |         resources:
22 |           limits:
23 |             cpu: 32
24 |             memory: "64G"
25 |             aws.amazon.com/neuroncore: "8"
26 |           requests:
27 |             cpu: 32
28 |             memory: "64G"
29 |             aws.amazon.com/neuroncore: "8"
30 |         ports:
31 |         - containerPort: 8000
32 |         env:
33 |         - name: HF_TOKEN
34 |           value: <HF_TOKEN>
35 |         - name: FI_EFA_FORK_SAFE
36 |           value: "1"
37 |         args:
38 |         - "--model"
39 |         - "meta-llama/Meta-Llama-3-8B"
40 |         - "--tensor-parallel-size"
41 |         - "8"
42 |         - "--max-num-seqs"
43 |         - "8"
44 |         - "--max-model-len"
45 |         - "8192"
46 |         - "--block-size"
47 |         - "8192"
48 |         readinessProbe:
49 |           httpGet:
50 |             path: /health
51 |             port: 8000
52 |           initialDelaySeconds: 1800
53 |           periodSeconds: 10
54 |           timeoutSeconds: 5
55 |           failureThreshold: 5
56 | 
57 |         livenessProbe:
58 |           httpGet:
59 |             path: /health
60 |             port: 8000
61 |           initialDelaySeconds: 1800
62 |           periodSeconds: 15
63 |           timeoutSeconds: 5
64 |           failureThreshold: 5
65 | 
66 |         startupProbe:
67 |           httpGet:
68 |             path: /health
69 |             port: 8000
70 |           initialDelaySeconds: 1800
71 |           periodSeconds: 10
72 |           timeoutSeconds: 5
73 |           failureThreshold: 180 


--------------------------------------------------------------------------------
/blogs/01_LLama3-8B_Inferentia_EKS_vLLM/nodegroup-config.yaml:
--------------------------------------------------------------------------------
 1 | apiVersion: eksctl.io/v1alpha5
 2 | kind: ClusterConfig
 3 | 
 4 | metadata:
 5 |   name: genai
 6 |   region: us-east-1
 7 |   version: "1.30"
 8 |     
 9 | managedNodeGroups:
10 |   - name: neuron-group
11 |     instanceType: inf2.48xlarge
12 |     desiredCapacity: 1
13 |     minSize: 1
14 |     maxSize: 1
15 |     volumeSize: 500
16 |     ami: ami-0077f86889fb430bf
17 |     amiFamily: AmazonLinux2
18 |     iam:
19 |       attachPolicyARNs:
20 |       - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
21 |       - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
22 |       - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
23 |       - arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess
24 | 
25 |     overrideBootstrapCommand: |
26 |       #!/bin/bash
27 | 
28 |       /etc/eks/bootstrap.sh genai
29 | 


--------------------------------------------------------------------------------
/tutorials/01_EmbeddingsFromTextWithBert/01_TextFeatureExtractionForSimilarity.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "16d74ffb",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "# Measure document similarities by extraction features from text inputs\n",
  9 |     "\n",
 10 |     "Create a mechanism to extract features (embeddings) from text inputs. With the embeddings you can then compute the distance between two or more sentences. This is useful if you're building a search mechanism or trying to see how **\"semantically\"** two sentences are close.\n",
 11 |     "\n",
 12 |     "For that purpose you'll use a **[Bert base](https://huggingface.co/bert-base-cased-finetuned-mrpc)** model, accelerated by an inf1 instance ([AWS Inferentia](https://aws.amazon.com/machine-learning/inferentia/)), running on SageMaker.\n",
 13 |     "\n",
 14 |     "For maximum performance and flexibility, you'll prepare the model with \"Neuron Core Pipeline\" and \"Dynamic Batch Size\" enabled. The first technique will shard the model across multiple cores to improve throughput. The second technique will allow you to send requests with different batch sizes. [Read more about these feature here](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/pytorch/pipeline_tutorial/neuroncore_pipeline_pytorch.html).\n",
 15 |     "\n",
 16 |     "The text samples used in this notebook were extracted from: https://www.gutenberg.org/cache/epub/84/pg84-images.html#chap01"
 17 |    ]
 18 |   },
 19 |   {
 20 |    "cell_type": "markdown",
 21 |    "id": "fc1cb598",
 22 |    "metadata": {},
 23 |    "source": [
 24 |     "## 1) Compile a pre-trained model\n",
 25 |     "When you deploy a model to a SageMaker Endpoint/inf1 instance (AWS Inferentia), you first need to compile the model with NeuronSDK. We'll use a sample provided by the official AWS Neuron SDK + Inferentia Samples.\n",
 26 |     "\n",
 27 |     "- Clone the repo: https://github.com/aws-neuron/aws-neuron-samples\n",
 28 |     "- Load the jupyter notebook for BertBaseCased: https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuron/inference/bertbasecased/\n",
 29 |     "- Start running the notebook, but enable Dynamic Batch and also Neuron Core Pipelines for 4 Neuron Cores, in model compilation section, as following:\n",
 30 |     "\n",
 31 |     "```python\n",
 32 |     "import os\n",
 33 |     "import torch\n",
 34 |     "import torch.neuron\n",
 35 |     "\n",
 36 |     "save_dir='model'\n",
 37 |     "neuron_model = torch.neuron.trace(\n",
 38 |     "    model, example_inputs=example_inputs_paraphrase,\n",
 39 |     "    dynamic_batch_size=True,\n",
 40 |     "    compiler_args['--neuron-core-pipeline', '4']\n",
 41 |     ")\n",
 42 |     "model.config.update({\"traced_sequence_length\": max_length})\n",
 43 |     "\n",
 44 |     "## Export 1/compiled model; 2/ tokenizer and 3/ model configs\n",
 45 |     "model_neuron.save(os.path.join(save_dir,\"model_neuron.pt\"))\n",
 46 |     "tokenizer.save_pretrained(save_dir)\n",
 47 |     "model.config.save_pretrained(save_dir)\n",
 48 |     "\n",
 49 |     "```"
 50 |    ]
 51 |   },
 52 |   {
 53 |    "cell_type": "markdown",
 54 |    "id": "61189f98",
 55 |    "metadata": {},
 56 |    "source": [
 57 |     "## 2) Pack and upload the model to S3\n",
 58 |     "After compiling the model with the instructions above, **COPY** the entire **save_dir** to the same directory of this Notebook."
 59 |    ]
 60 |   },
 61 |   {
 62 |    "cell_type": "code",
 63 |    "execution_count": null,
 64 |    "id": "d60a3908",
 65 |    "metadata": {},
 66 |    "outputs": [],
 67 |    "source": [
 68 |     "import io\n",
 69 |     "import tarfile\n",
 70 |     "import sagemaker\n",
 71 |     "\n",
 72 |     "save_dir='model'\n",
 73 |     "sess = sagemaker.Session()\n",
 74 |     "sagemaker_session_bucket = sess.default_bucket()\n",
 75 |     "with io.BytesIO() as file:\n",
 76 |     "    with tarfile.open(fileobj=file, mode=\"w:gz\") as tar:\n",
 77 |     "        tar.add(save_dir, \".\")\n",
 78 |     "        tar.list()\n",
 79 |     "    file.seek(0)\n",
 80 |     "    s3_uri = sess.upload_string_as_file_body(\n",
 81 |     "        file.read(), sagemaker_session_bucket, \"model/bert/model.tar.gz\"\n",
 82 |     "    )\n",
 83 |     "print(s3_uri)"
 84 |    ]
 85 |   },
 86 |   {
 87 |    "cell_type": "markdown",
 88 |    "id": "7ef38d42",
 89 |    "metadata": {},
 90 |    "source": [
 91 |     "## 3) Inference script used by SageMaker endpoint to load and execute the model\n",
 92 |     "This script is responsible for loading the model and expose a webservice for us to invoke and get predictions (embeddings)"
 93 |    ]
 94 |   },
 95 |   {
 96 |    "cell_type": "code",
 97 |    "execution_count": null,
 98 |    "id": "81760e79",
 99 |    "metadata": {},
100 |    "outputs": [],
101 |    "source": [
102 |     "!pygmentize code/inference.py"
103 |    ]
104 |   },
105 |   {
106 |    "cell_type": "markdown",
107 |    "id": "7799a6eb",
108 |    "metadata": {},
109 |    "source": [
110 |     "## 4) Deploy our model to a SageMaker endpoint"
111 |    ]
112 |   },
113 |   {
114 |    "cell_type": "code",
115 |    "execution_count": null,
116 |    "id": "cbc462b8",
117 |    "metadata": {},
118 |    "outputs": [],
119 |    "source": [
120 |     "import sagemaker\n",
121 |     "sess = sagemaker.Session()\n",
122 |     "\n",
123 |     "# sagemaker session bucket -> used for uploading data, models and logs\n",
124 |     "# sagemaker will automatically create this bucket if it not exists\n",
125 |     "sagemaker_session_bucket = sess.default_bucket()\n",
126 |     "\n",
127 |     "role = sagemaker.get_execution_role()\n",
128 |     "\n",
129 |     "print(f\"sagemaker role arn: {role}\")\n",
130 |     "print(f\"sagemaker bucket: {sess.default_bucket()}\")\n",
131 |     "print(f\"sagemaker session region: {sess.boto_region_name}\")"
132 |    ]
133 |   },
134 |   {
135 |    "cell_type": "code",
136 |    "execution_count": null,
137 |    "id": "8bd55d2b",
138 |    "metadata": {},
139 |    "outputs": [],
140 |    "source": [
141 |     "from sagemaker.huggingface.model import HuggingFaceModel\n",
142 |     "\n",
143 |     "# create Hugging Face Model Class\n",
144 |     "huggingface_model = HuggingFaceModel(\n",
145 |     "   model_data=s3_uri,       # path to your model and script\n",
146 |     "   role=role,                    # iam role with permissions to create an Endpoint\n",
147 |     "   transformers_version=\"4.12\",  # transformers version used\n",
148 |     "   pytorch_version=\"1.9\",        # pytorch version used\n",
149 |     "   py_version='py37',            # python version used\n",
150 |     "   sagemaker_session=sess,\n",
151 |     "   model_server_workers=4, # keep 4 workers\n",
152 |     "   entry_point=\"code/inference.py\",\n",
153 |     "   # for production it is important to define vpc_config and use a vpc_endpoint\n",
154 |     "   #vpc_config={\n",
155 |     "   #    'Subnets': ['subnet-a320a8ca', 'subnet-56d5072d'],\n",
156 |     "   #    'SecurityGroupIds': ['sg-0d8c231d83c1caaa6', 'sg-5504723c']\n",
157 |     "   #}    \n",
158 |     ")\n",
159 |     "\n",
160 |     "# Let SageMaker know that we've already compiled the model via neuron-cc\n",
161 |     "huggingface_model._is_compiled_model = True\n",
162 |     "\n",
163 |     "# deploy the endpoint endpoint\n",
164 |     "predictor = huggingface_model.deploy(\n",
165 |     "    initial_instance_count=1,      # number of instances\n",
166 |     "    instance_type=\"ml.inf1.6xlarge\" # AWS Inferentia Instance\n",
167 |     ")"
168 |    ]
169 |   },
170 |   {
171 |    "cell_type": "markdown",
172 |    "id": "380e48a5",
173 |    "metadata": {},
174 |    "source": [
175 |     "## 5) Run a simple test"
176 |    ]
177 |   },
178 |   {
179 |    "cell_type": "code",
180 |    "execution_count": null,
181 |    "id": "63e60205",
182 |    "metadata": {},
183 |    "outputs": [],
184 |    "source": [
185 |     "from sagemaker.serializers import JSONSerializer\n",
186 |     "from sagemaker.deserializers import NumpyDeserializer\n",
187 |     "predictor.serializer = JSONSerializer()\n",
188 |     "predictor.deserializer = NumpyDeserializer()"
189 |    ]
190 |   },
191 |   {
192 |    "cell_type": "code",
193 |    "execution_count": null,
194 |    "id": "72afc7de",
195 |    "metadata": {},
196 |    "outputs": [],
197 |    "source": [
198 |     "with open('frank_chap01.txt') as f:\n",
199 |     "    data = {'inputs': [l.strip() for l in f.readlines()]}\n",
200 |     "num_sentences = len(data['inputs'])\n",
201 |     "print(f\"Number of sentences: {num_sentences}\")\n",
202 |     "embeddings = predictor.predict(data)\n",
203 |     "print(embeddings.shape)"
204 |    ]
205 |   },
206 |   {
207 |    "cell_type": "markdown",
208 |    "id": "79e975c7",
209 |    "metadata": {},
210 |    "source": [
211 |     "### 5.1) Simple benchmark to identify the best batch_size with 1 client only"
212 |    ]
213 |   },
214 |   {
215 |    "cell_type": "code",
216 |    "execution_count": 51,
217 |    "id": "ff730002",
218 |    "metadata": {},
219 |    "outputs": [
220 |     {
221 |      "name": "stdout",
222 |      "output_type": "stream",
223 |      "text": [
224 |       "Batch size: 1 Elapsed time: 14.544463157653809ms Latency p/s 14.544463157653809ms\n",
225 |       "Batch size: 2 Elapsed time: 23.25267791748047ms Latency p/s 11.626338958740234ms\n",
226 |       "Batch size: 3 Elapsed time: 31.86509609222412ms Latency p/s 10.621698697408041ms\n",
227 |       "Batch size: 4 Elapsed time: 39.96927738189697ms Latency p/s 9.992319345474243ms\n",
228 |       "Batch size: 5 Elapsed time: 48.52888584136963ms Latency p/s 9.705777168273926ms\n",
229 |       "Batch size: 6 Elapsed time: 57.08444118499756ms Latency p/s 9.514073530832926ms\n",
230 |       "Batch size: 7 Elapsed time: 65.29092788696289ms Latency p/s 9.32727541242327ms\n",
231 |       "Batch size: 8 Elapsed time: 74.49376583099365ms Latency p/s 9.311720728874207ms\n",
232 |       "Batch size: 9 Elapsed time: 82.37555027008057ms Latency p/s 9.15283891889784ms\n",
233 |       "Batch size: 10 Elapsed time: 90.54069519042969ms Latency p/s 9.054069519042969ms\n",
234 |       "Batch size: 11 Elapsed time: 99.27759170532227ms Latency p/s 9.025235609574752ms\n"
235 |      ]
236 |     }
237 |    ],
238 |    "source": [
239 |     "import time\n",
240 |     "import copy\n",
241 |     "iterations=10\n",
242 |     "for batch_size in range(1,num_sentences+1):\n",
243 |     "    d = copy.deepcopy(data)\n",
244 |     "    d['inputs'] = d['inputs'][:batch_size]\n",
245 |     "    t=time.time()\n",
246 |     "    for i in range(iterations):\n",
247 |     "        predictor.predict(d)\n",
248 |     "    elapsed = (time.time()-t)/iterations*1000\n",
249 |     "    print(f\"Batch size: {batch_size} Elapsed time: {elapsed}ms Latency p/s {elapsed/batch_size}ms\")"
250 |    ]
251 |   },
252 |   {
253 |    "cell_type": "markdown",
254 |    "id": "658ad164",
255 |    "metadata": {},
256 |    "source": [
257 |     "### 5.2) Now Invoke the endpoint in parallel to evaluate throughput"
258 |    ]
259 |   },
260 |   {
261 |    "cell_type": "code",
262 |    "execution_count": 38,
263 |    "id": "518a7206",
264 |    "metadata": {},
265 |    "outputs": [
266 |     {
267 |      "name": "stdout",
268 |      "output_type": "stream",
269 |      "text": [
270 |       "Elapsed time: 24082.525491714478ms to process 11264 sentences with 5 workers. Latency p/s: 2.1380083000456747ms\n"
271 |      ]
272 |     }
273 |    ],
274 |    "source": [
275 |     "import time\n",
276 |     "from concurrent.futures import ThreadPoolExecutor\n",
277 |     "\n",
278 |     "# custom task that will sleep for a variable amount of time\n",
279 |     "def task(data):\n",
280 |     "    predictor.predict(data)\n",
281 |     "\n",
282 |     "num_workers = 5\n",
283 |     "d = copy.deepcopy(data)\n",
284 |     "documents_1k = [d for i in range(1024)]\n",
285 |     "total_docs = len(documents_1k) * len(data['inputs'])\n",
286 |     "\n",
287 |     "# start the thread pool\n",
288 |     "t=time.time()\n",
289 |     "with ThreadPoolExecutor(num_workers) as executor:\n",
290 |     "    # execute tasks concurrently and process results in order    \n",
291 |     "    executor.map(task, documents_1k)\n",
292 |     "elapsed = (time.time()-t)*1000\n",
293 |     "print(f\"Elapsed time: {elapsed}ms to process {total_docs} sentences with {num_workers} workers. Latency p/s: {elapsed/total_docs}ms\")"
294 |    ]
295 |   },
296 |   {
297 |    "cell_type": "markdown",
298 |    "id": "6ea67331",
299 |    "metadata": {},
300 |    "source": [
301 |     "### 5.3) Finally a similarity test"
302 |    ]
303 |   },
304 |   {
305 |    "cell_type": "code",
306 |    "execution_count": 50,
307 |    "id": "8bbedec9",
308 |    "metadata": {},
309 |    "outputs": [
310 |     {
311 |      "name": "stdout",
312 |      "output_type": "stream",
313 |      "text": [
314 |       "Cosine Similarity: [[0.9238203]]\n"
315 |      ]
316 |     }
317 |    ],
318 |    "source": [
319 |     "from sklearn.metrics.pairwise import cosine_similarity\n",
320 |     "sentence_1=\"I've seen things you people wouldn't believe. Attack ships on fire off the shoulder of Orion.\"\n",
321 |     "sentence_2=\"I watched C-beams glitter in the dark near the Tannhäuser Gate. All those moments will be lost in time, like tears in rain. Time to die.\"\n",
322 |     "embeddings_1,embeddings_2 = predictor.predict({'inputs':[sentence_1, sentence_2]})\n",
323 |     "print(f'Cosine Similarity: {cosine_similarity([embeddings_1],[embeddings_2])}')"
324 |    ]
325 |   }
326 |  ],
327 |  "metadata": {
328 |   "kernelspec": {
329 |    "display_name": "conda_pytorch_p36",
330 |    "language": "python",
331 |    "name": "conda_pytorch_p36"
332 |   },
333 |   "language_info": {
334 |    "codemirror_mode": {
335 |     "name": "ipython",
336 |     "version": 3
337 |    },
338 |    "file_extension": ".py",
339 |    "mimetype": "text/x-python",
340 |    "name": "python",
341 |    "nbconvert_exporter": "python",
342 |    "pygments_lexer": "ipython3",
343 |    "version": "3.6.13"
344 |   }
345 |  },
346 |  "nbformat": 4,
347 |  "nbformat_minor": 5
348 | }
349 | 


--------------------------------------------------------------------------------
/tutorials/01_EmbeddingsFromTextWithBert/code/inference.py:
--------------------------------------------------------------------------------
 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | # SPDX-License-Identifier: MIT-0
 3 | import os
 4 | os.environ['NEURON_RT_NUM_CORES'] = '4'
 5 | import json
 6 | import torch
 7 | import torch.neuron
 8 | from typing import List
 9 | import torch.nn.functional as F
10 | from transformers import AutoConfig, AutoTokenizer
11 | 
12 | def compute_embeddings(features, sentences):
13 |     attention_mask = sentences['attention_mask']
14 |     mask = attention_mask.unsqueeze(-1).expand(features.size()).float()
15 |     masked_embeddings = features * mask
16 |     summed = torch.sum(masked_embeddings, 1)
17 |     summed_mask = torch.clamp(mask.sum(1), min=1e-9)
18 |     
19 |     return (summed / summed_mask).numpy()
20 | 
21 | def model_fn(model_dir):
22 |     # load tokenizer and neuron model from model_dir
23 |     tokenizer = AutoTokenizer.from_pretrained(model_dir)
24 |     model = torch.jit.load(os.path.join(model_dir, "model_neuron.pt"))
25 |     model_config = AutoConfig.from_pretrained(model_dir)
26 | 
27 |     return model, tokenizer, model_config
28 | 
29 | def predict_fn(data, model_tokenizer_model_config):
30 |     # destruct model and tokenizer
31 |     model, tokenizer, model_config = model_tokenizer_model_config
32 |     encoded_input = tokenizer.batch_encode_plus(
33 |         data['inputs'],
34 |         return_tensors="pt",
35 |         max_length=model_config.traced_sequence_length,
36 |         padding="max_length",
37 |         truncation=True,
38 |     )
39 |     # convert for neuron model
40 |     sentences_inputs = encoded_input['input_ids'], encoded_input['attention_mask'], encoded_input['token_type_ids']
41 | 
42 |     with torch.no_grad():
43 |         model_output = model(*sentences_inputs)[0]
44 | 
45 |     # Perform pooling & return numpy
46 |     return compute_embeddings(model_output, encoded_input)
47 | 


--------------------------------------------------------------------------------
/tutorials/01_EmbeddingsFromTextWithBert/frank_chap01.txt:
--------------------------------------------------------------------------------
 1 | I am by birth a Genevese, and my family is one of the most distinguished of that republic. My ancestors had been for many years counsellors and syndics, and my father had filled several public situations with honour and reputation. He was respected by all who knew him for his integrity and indefatigable attention to public business. He passed his younger days perpetually occupied by the affairs of his country; a variety of circumstances had prevented his marrying early, nor was it until the decline of life that he became a husband and the father of a family. 
 2 | As the circumstances of his marriage illustrate his character, I cannot refrain from relating them. One of his most intimate friends was a merchant who, from a flourishing state, fell, through numerous mischances, into poverty. This man, whose name was Beaufort, was of a proud and unbending disposition and could not bear to live in poverty and oblivion in the same country where he had formerly been distinguished for his rank and magnificence. Having paid his debts, therefore, in the most honourable manner, he retreated with his daughter to the town of Lucerne, where he lived unknown and in wretchedness. My father loved Beaufort with the truest friendship and was deeply grieved by his retreat in these unfortunate circumstances. He bitterly deplored the false pride which led his friend to a conduct so little worthy of the affection that united them. He lost no time in endeavouring to seek him out, with the hope of persuading him to begin the world again through his credit and assistance.
 3 | Beaufort had taken effectual measures to conceal himself, and it was ten months before my father discovered his abode. Overjoyed at this discovery, he hastened to the house, which was situated in a mean street near the Reuss. But when he entered, misery and despair alone welcomed him. Beaufort had saved but a very small sum of money from the wreck of his fortunes, but it was sufficient to provide him with sustenance for some months, and in the meantime he hoped to procure some respectable employment in a merchant’s house. The interval was, consequently, spent in inaction; his grief only became more deep and rankling when he had leisure for reflection, and at length it took so fast hold of his mind that at the end of three months he lay on a bed of sickness, incapable of any exertion.
 4 | His daughter attended him with the greatest tenderness, but she saw with despair that their little fund was rapidly decreasing and that there was no other prospect of support. But Caroline Beaufort possessed a mind of an uncommon mould, and her courage rose to support her in her adversity. She procured plain work; she plaited straw and by various means contrived to earn a pittance scarcely sufficient to support life.
 5 | Several months passed in this manner. Her father grew worse; her time was more entirely occupied in attending him; her means of subsistence decreased; and in the tenth month her father died in her arms, leaving her an orphan and a beggar. This last blow overcame her, and she knelt by Beaufort’s coffin weeping bitterly, when my father entered the chamber. He came like a protecting spirit to the poor girl, who committed herself to his care; and after the interment of his friend he conducted her to Geneva and placed her under the protection of a relation. Two years after this event Caroline became his wife.
 6 | There was a considerable difference between the ages of my parents, but this circumstance seemed to unite them only closer in bonds of devoted affection. There was a sense of justice in my father’s upright mind which rendered it necessary that he should approve highly to love strongly. Perhaps during former years he had suffered from the late-discovered unworthiness of one beloved and so was disposed to set a greater value on tried worth. There was a show of gratitude and worship in his attachment to my mother, differing wholly from the doting fondness of age, for it was inspired by reverence for her virtues and a desire to be the means of, in some degree, recompensing her for the sorrows she had endured, but which gave inexpressible grace to his behaviour to her. Everything was made to yield to her wishes and her convenience. He strove to shelter her, as a fair exotic is sheltered by the gardener, from every rougher wind and to surround her with all that could tend to excite pleasurable emotion in her soft and benevolent mind. Her health, and even the tranquillity of her hitherto constant spirit, had been shaken by what she had gone through. During the two years that had elapsed previous to their marriage my father had gradually relinquished all his public functions; and immediately after their union they sought the pleasant climate of Italy, and the change of scene and interest attendant on a tour through that land of wonders, as a restorative for her weakened frame.
 7 | From Italy they visited Germany and France. I, their eldest child, was born at Naples, and as an infant accompanied them in their rambles. I remained for several years their only child. Much as they were attached to each other, they seemed to draw inexhaustible stores of affection from a very mine of love to bestow them upon me. My mother’s tender caresses and my father’s smile of benevolent pleasure while regarding me are my first recollections. I was their plaything and their idol, and something better—their child, the innocent and helpless creature bestowed on them by Heaven, whom to bring up to good, and whose future lot it was in their hands to direct to happiness or misery, according as they fulfilled their duties towards me. With this deep consciousness of what they owed towards the being to which they had given life, added to the active spirit of tenderness that animated both, it may be imagined that while during every hour of my infant life I received a lesson of patience, of charity, and of self-control, I was so guided by a silken cord that all seemed but one train of enjoyment to me.
 8 | For a long time I was their only care. My mother had much desired to have a daughter, but I continued their single offspring. When I was about five years old, while making an excursion beyond the frontiers of Italy, they passed a week on the shores of the Lake of Como. Their benevolent disposition often made them enter the cottages of the poor. This, to my mother, was more than a duty; it was a necessity, a passion—remembering what she had suffered, and how she had been relieved—for her to act in her turn the guardian angel to the afflicted. During one of their walks a poor cot in the foldings of a vale attracted their notice as being singularly disconsolate, while the number of half-clothed children gathered about it spoke of penury in its worst shape. One day, when my father had gone by himself to Milan, my mother, accompanied by me, visited this abode. She found a peasant and his wife, hard working, bent down by care and labour, distributing a scanty meal to five hungry babes. Among these there was one which attracted my mother far above all the rest. She appeared of a different stock. The four others were dark-eyed, hardy little vagrants; this child was thin and very fair. Her hair was the brightest living gold, and despite the poverty of her clothing, seemed to set a crown of distinction on her head. Her brow was clear and ample, her blue eyes cloudless, and her lips and the moulding of her face so expressive of sensibility and sweetness that none could behold her without looking on her as of a distinct species, a being heaven-sent, and bearing a celestial stamp in all her features.
 9 | The peasant woman, perceiving that my mother fixed eyes of wonder and admiration on this lovely girl, eagerly communicated her history. She was not her child, but the daughter of a Milanese nobleman. Her mother was a German and had died on giving her birth. The infant had been placed with these good people to nurse: they were better off then. They had not been long married, and their eldest child was but just born. The father of their charge was one of those Italians nursed in the memory of the antique glory of Italy—one among the schiavi ognor frementi, who exerted himself to obtain the liberty of his country. He became the victim of its weakness. Whether he had died or still lingered in the dungeons of Austria was not known. His property was confiscated; his child became an orphan and a beggar. She continued with her foster parents and bloomed in their rude abode, fairer than a garden rose among dark-leaved brambles.
10 | When my father returned from Milan, he found playing with me in the hall of our villa a child fairer than pictured cherub—a creature who seemed to shed radiance from her looks and whose form and motions were lighter than the chamois of the hills. The apparition was soon explained. With his permission my mother prevailed on her rustic guardians to yield their charge to her. They were fond of the sweet orphan. Her presence had seemed a blessing to them, but it would be unfair to her to keep her in poverty and want when Providence afforded her such powerful protection. They consulted their village priest, and the result was that Elizabeth Lavenza became the inmate of my parents’ house—my more than sister—the beautiful and adored companion of all my occupations and my pleasures.
11 | Everyone loved Elizabeth. The passionate and almost reverential attachment with which all regarded her became, while I shared it, my pride and my delight. On the evening previous to her being brought to my home, my mother had said playfully, “I have a pretty present for my Victor—tomorrow he shall have it.” And when, on the morrow, she presented Elizabeth to me as her promised gift, I, with childish seriousness, interpreted her words literally and looked upon Elizabeth as mine—mine to protect, love, and cherish. All praises bestowed on her I received as made to a possession of my own. We called each other familiarly by the name of cousin. No word, no expression could body forth the kind of relation in which she stood to me—my more than sister, since till death she was to be mine only. 


--------------------------------------------------------------------------------
/tutorials/02_ObjectTrackingSageMakerGStreamer/01_Yolov7SageMakerInferentia.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "d89c3d67",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "# Deploy Yolov7 to SageMaker + Inferentia\n",
  9 |     "\n",
 10 |     "\n",
 11 |     "We'll create a SageMaker real-time endpoint with a Yolov7 model capable of detecting people and predicting the pose of each person. For that purpose, we need to get the model and prepare it to be deployed to AWS Inferentia."
 12 |    ]
 13 |   },
 14 |   {
 15 |    "cell_type": "markdown",
 16 |    "id": "58b1061d",
 17 |    "metadata": {},
 18 |    "source": [
 19 |     "## 1) Install dependencies"
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "code",
 24 |    "execution_count": null,
 25 |    "id": "9e6e0ad9",
 26 |    "metadata": {},
 27 |    "outputs": [],
 28 |    "source": [
 29 |     "# with this library we can build docker images and push them to ECR\n",
 30 |     "%pip install sagemaker-studio-image-build"
 31 |    ]
 32 |   },
 33 |   {
 34 |    "cell_type": "markdown",
 35 |    "id": "efec75fd",
 36 |    "metadata": {},
 37 |    "source": [
 38 |     "## 2) Compile a pre-trained model\n",
 39 |     "When you deploy a model to a SageMaker Endpoint/inf1 instance (AWS Inferentia), you first need compile the model with NeuronSDK. We'll use a sample provided by the official AWS Neuron SDK + Inferentia Samples.\n",
 40 |     "\n",
 41 |     "- Clone the repo: https://github.com/aws-neuron/aws-neuron-samples\n",
 42 |     "- Load the jupyter notebook for Yolov7: https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuron/inference/yolov7\n",
 43 |     "- Start running the notebook, but enable Dynamic Batch and also Neuron Core Pipelines for 4 Neuron Cores,in model compilation section, as following:\n",
 44 |     "\n",
 45 |     "```python\n",
 46 |     "import torch\n",
 47 |     "import torch.neuron\n",
 48 |     "\n",
 49 |     "model_neuron = torch.neuron.trace(\n",
 50 |     "    model, example_inputs=x,\n",
 51 |     "    dynamic_batch_size=True,\n",
 52 |     "    compiler_args['--neuron-core-pipeline', '4']\n",
 53 |     ")\n",
 54 |     "\n",
 55 |     "## Export to saved model\n",
 56 |     "model_neuron.save(\"yolov7_neuron.pt\")\n",
 57 |     "```"
 58 |    ]
 59 |   },
 60 |   {
 61 |    "cell_type": "markdown",
 62 |    "id": "275e2ea1",
 63 |    "metadata": {},
 64 |    "source": [
 65 |     "## 3) Pack and upload the model to S3\n",
 66 |     "After compiling the model with the instructions above, **copy** the model to the same directory of this Notebook"
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "code",
 71 |    "execution_count": null,
 72 |    "id": "2d5a89a8",
 73 |    "metadata": {},
 74 |    "outputs": [],
 75 |    "source": [
 76 |     "import os\n",
 77 |     "import io\n",
 78 |     "import tarfile\n",
 79 |     "import sagemaker\n",
 80 |     "\n",
 81 |     "sagemaker_session = sagemaker.Session()\n",
 82 |     "bucket = sagemaker_session.default_bucket()\n",
 83 |     "image_name='pytorch-inference-neuron'\n",
 84 |     "image_tag=\"1.10.2h-neuron-py37-sdk1.19.0-ubuntu18.04\"\n",
 85 |     "model_s3_path=\"models/yolov7-pose/model.tar.gz\""
 86 |    ]
 87 |   },
 88 |   {
 89 |    "cell_type": "code",
 90 |    "execution_count": null,
 91 |    "id": "70a2c9cd",
 92 |    "metadata": {},
 93 |    "outputs": [],
 94 |    "source": [
 95 |     "with io.BytesIO() as tar_file:\n",
 96 |     "    with tarfile.open(fileobj=tar_file, mode='w:gz') as tar:\n",
 97 |     "        tar.add('yolov7_neuron.pt', 'model.pt')\n",
 98 |     "        tar.list()\n",
 99 |     "    tar_file.seek(0)\n",
100 |     "    s3_uri = sagemaker_session.upload_string_as_file_body(\n",
101 |     "        tar_file.read(), bucket=bucket, key=model_s3_path\n",
102 |     "    )\n",
103 |     "    print(s3_uri)"
104 |    ]
105 |   },
106 |   {
107 |    "cell_type": "markdown",
108 |    "id": "715815db",
109 |    "metadata": {},
110 |    "source": [
111 |     "## 3) Build a custom docker container with additional libraries\n",
112 |     "**YOU DON\"T NEED TO RUN** this section if you already did that before"
113 |    ]
114 |   },
115 |   {
116 |    "cell_type": "markdown",
117 |    "id": "93b49329",
118 |    "metadata": {},
119 |    "source": [
120 |     "We'll extend a pythorch-inference container to apply a patch that allow us to pass CustomAttributes to our code and also to install required libraries like libJPEG Turbo."
121 |    ]
122 |   },
123 |   {
124 |    "cell_type": "code",
125 |    "execution_count": null,
126 |    "id": "d68a0f10",
127 |    "metadata": {},
128 |    "outputs": [],
129 |    "source": [
130 |     "!pygmentize container_01/Dockerfile"
131 |    ]
132 |   },
133 |   {
134 |    "cell_type": "code",
135 |    "execution_count": null,
136 |    "id": "59a6698f",
137 |    "metadata": {},
138 |    "outputs": [],
139 |    "source": [
140 |     "!sm-docker build container_01/ --repository $image_name:$image_tag"
141 |    ]
142 |   },
143 |   {
144 |    "cell_type": "markdown",
145 |    "id": "b35e5738",
146 |    "metadata": {},
147 |    "source": [
148 |     "## 4) Inference Code executed by SageMaker Endpoint\n",
149 |     "We need to create a custom inference file to pass to SageMaker. This code has the mechanisms to invoke the model and also pre/post process the input jpeg image & predictions.\n",
150 |     "\n",
151 |     "- **input_fn()**: Will receive the bytes of a .jpeg file. This file needs to be a mosaic, composed of multiple frames in just one image. By using **CustomAttributes** we share some metadata about the mosaic to the endpoint. With tile_width and tile_height we can compute how many images does the mosaic have, parse it and build a batch.\n",
152 |     "- **output_fn()**: Gets the predictions and converts them to a numpy blob"
153 |    ]
154 |   },
155 |   {
156 |    "cell_type": "code",
157 |    "execution_count": null,
158 |    "id": "6a28879e",
159 |    "metadata": {},
160 |    "outputs": [],
161 |    "source": [
162 |     "!pygmentize code_01/inference.py"
163 |    ]
164 |   },
165 |   {
166 |    "cell_type": "markdown",
167 |    "id": "80aae4ec",
168 |    "metadata": {},
169 |    "source": [
170 |     "## 5) Deploy our model to SageMaker"
171 |    ]
172 |   },
173 |   {
174 |    "cell_type": "code",
175 |    "execution_count": null,
176 |    "id": "d3598f0c",
177 |    "metadata": {},
178 |    "outputs": [],
179 |    "source": [
180 |     "import boto3\n",
181 |     "import logging\n",
182 |     "from sagemaker.pytorch.model import PyTorchModel\n",
183 |     "from sagemaker.predictor import Predictor\n",
184 |     "\n",
185 |     "sagemaker_session = sagemaker.Session()\n",
186 |     "\n",
187 |     "account_id = boto3.client('sts').get_caller_identity().get('Account')\n",
188 |     "region_name = sagemaker_session.boto_session.region_name\n",
189 |     "bucket = sagemaker_session.default_bucket()\n",
190 |     "s3_uri=f\"s3://{bucket}/{model_s3_path}\"\n",
191 |     "role=sagemaker.get_execution_role()\n",
192 |     "print(f\"Bucket: {bucket}\\nAWS AccountID: {account_id}\\nRegion: {region_name}\")\n",
193 |     "\n",
194 |     "# https://github.com/aws/deep-learning-containers/blob/master/available_images.md#neuron-containers\n",
195 |     "image_uri=f\"{account_id}.dkr.ecr.{region_name}.amazonaws.com/{image_name}:{image_tag}\"\n",
196 |     "\n",
197 |     "print(image_uri)\n",
198 |     "sagemaker_model = PyTorchModel(\n",
199 |     "    image_uri=image_uri,\n",
200 |     "    model_data=s3_uri,    \n",
201 |     "    role=role,    \n",
202 |     "    name=\"yolov7-pose-inferentia\",\n",
203 |     "    sagemaker_session=sagemaker_session,\n",
204 |     "    entry_point=\"code_01/inference.py\",\n",
205 |     "    container_log_level=logging.DEBUG,\n",
206 |     "    model_server_workers=4, # keep 4 workers\n",
207 |     "    framework_version=\"1.10.0\",\n",
208 |     "    # for production it is important to define vpc_config and use a vpc_endpoint\n",
209 |     "    #vpc_config={\n",
210 |     "    #    'Subnets': ['<SUBNET1>', '<SUBNET2>'],\n",
211 |     "    #    'SecurityGroupIds': ['<SECURITYGROUP1>', '<DEFAULTSECURITYGROUP>']\n",
212 |     "    #}\n",
213 |     ")\n",
214 |     "sagemaker_model._is_compiled_model = True"
215 |    ]
216 |   },
217 |   {
218 |    "cell_type": "code",
219 |    "execution_count": null,
220 |    "id": "bad27169",
221 |    "metadata": {},
222 |    "outputs": [],
223 |    "source": [
224 |     "predictor = sagemaker_model.deploy(\n",
225 |     "    endpoint_name=\"yolov7-pose-inferentia\",\n",
226 |     "    instance_type=\"ml.inf1.6xlarge\",\n",
227 |     "    initial_instance_count=1\n",
228 |     ")"
229 |    ]
230 |   },
231 |   {
232 |    "cell_type": "markdown",
233 |    "id": "dd2e18d4",
234 |    "metadata": {},
235 |    "source": [
236 |     "## 6) Test the endpoint"
237 |    ]
238 |   },
239 |   {
240 |    "cell_type": "code",
241 |    "execution_count": null,
242 |    "id": "fb939242",
243 |    "metadata": {},
244 |    "outputs": [],
245 |    "source": [
246 |     "%matplotlib inline\n",
247 |     "import os\n",
248 |     "import cv2\n",
249 |     "import numpy as np\n",
250 |     "import urllib.request\n",
251 |     "import matplotlib.pyplot as plt\n",
252 |     "\n",
253 |     "if not os.path.isfile('zidane.jpg'):\n",
254 |     "    urllib.request.urlretrieve(\n",
255 |     "        'https://raw.githubusercontent.com/ultralytics/yolov5/master/data/images/zidane.jpg',\n",
256 |     "        'zidane.jpg'\n",
257 |     "    )\n",
258 |     "    \n",
259 |     "if not os.path.isfile('mosaic4.jpg'):\n",
260 |     "    img = cv2.imread('zidane.jpg')\n",
261 |     "    h,w,c = img.shape\n",
262 |     "    factor = 960/w\n",
263 |     "    new_h,new_w=int(h*factor),int(w*factor)\n",
264 |     "    img = cv2.resize(img, (new_w,new_h))\n",
265 |     "    mosaic = np.zeros((new_h*2, new_w*2, c), dtype=np.uint8)\n",
266 |     "    for i in range(2):\n",
267 |     "        for j in range(2):\n",
268 |     "            ph, pw = i*new_h, j*new_w\n",
269 |     "            mosaic[ph:ph+new_h, pw:pw+new_w] = img[:]\n",
270 |     "    cv2.imwrite('mosaic4.jpg', mosaic)\n",
271 |     "plt.figure(figsize=(15,10))\n",
272 |     "plt.imshow(cv2.cvtColor(cv2.imread('mosaic4.jpg'), cv2.COLOR_BGR2RGB))"
273 |    ]
274 |   },
275 |   {
276 |    "cell_type": "code",
277 |    "execution_count": null,
278 |    "id": "4759e9f7",
279 |    "metadata": {},
280 |    "outputs": [],
281 |    "source": [
282 |     "import json\n",
283 |     "import time\n",
284 |     "import sagemaker\n",
285 |     "import numpy as np\n",
286 |     "from sagemaker.predictor import Predictor\n",
287 |     "from sagemaker.serializers import DataSerializer\n",
288 |     "from sagemaker.deserializers import NumpyDeserializer\n",
289 |     "\n",
290 |     "sagemaker_session = sagemaker.Session()\n",
291 |     "\n",
292 |     "predictor = Predictor(endpoint_name=\"yolov7-pose-inferentia\", sagemaker_session=sagemaker_session)\n",
293 |     "predictor.serializer = DataSerializer(content_type='image/jpeg')\n",
294 |     "predictor.deserializer = NumpyDeserializer()\n",
295 |     "\n",
296 |     "mosaic_size=2\n",
297 |     "custom_attributes={\n",
298 |     "    'CustomAttributes': json.dumps({        \n",
299 |     "        \"tile_width\": 960, \n",
300 |     "        \"tile_height\": 540,\n",
301 |     "        \"conf_thres\": 0.15,\n",
302 |     "        \"iou_thres\": 0.45\n",
303 |     "    })\n",
304 |     "}\n",
305 |     "data = open(f'mosaic{mosaic_size*mosaic_size}.jpg', 'rb').read()\n",
306 |     "t = time.time()\n",
307 |     "y = predictor.predict(data, initial_args=custom_attributes)\n",
308 |     "elapsed = (time.time()-t) * 1000\n",
309 |     "print(f\"Elapsed: {elapsed}, Latency per image: {elapsed / (mosaic_size ** 2)}\")\n",
310 |     "y.shape"
311 |    ]
312 |   }
313 |  ],
314 |  "metadata": {
315 |   "kernelspec": {
316 |    "display_name": "conda_python3",
317 |    "language": "python",
318 |    "name": "conda_python3"
319 |   },
320 |   "language_info": {
321 |    "codemirror_mode": {
322 |     "name": "ipython",
323 |     "version": 3
324 |    },
325 |    "file_extension": ".py",
326 |    "mimetype": "text/x-python",
327 |    "name": "python",
328 |    "nbconvert_exporter": "python",
329 |    "pygments_lexer": "ipython3",
330 |    "version": "3.6.13"
331 |   }
332 |  },
333 |  "nbformat": 4,
334 |  "nbformat_minor": 5
335 | }
336 | 


--------------------------------------------------------------------------------
/tutorials/02_ObjectTrackingSageMakerGStreamer/02_CVPipeline.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "876b4937",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "# CV/ML Pipeline to extract highlights from videos using ML Models\n",
  9 |     "\n",
 10 |     "With this notebook you can create an end-to-end CV/ML Pipeline using [GStreamer](gstreamer.freedesktop.org/) and run ML models to extract information from the frames. We'll use a Person detection + Pose estimation model based on Yolov7 to identify and track people in video files. With Gstreamer we can combine multiple feeds/cameras and create a mosaic of images. This helps us to accelerate the process.\n",
 11 |     "\n",
 12 |     "First, deploy a pre-trained **Yolov7** to a SageMaker endpoint. Follow the instructions in [this notebook](01_Yolov7SageMakerInferentia.ipynb). Then, you can run this notebook."
 13 |    ]
 14 |   },
 15 |   {
 16 |    "cell_type": "markdown",
 17 |    "id": "feef436c",
 18 |    "metadata": {},
 19 |    "source": [
 20 |     "## 1) Install dependencies"
 21 |    ]
 22 |   },
 23 |   {
 24 |    "cell_type": "code",
 25 |    "execution_count": null,
 26 |    "id": "95e37a31",
 27 |    "metadata": {},
 28 |    "outputs": [],
 29 |    "source": [
 30 |     "# with this library we can build docker images and push them to ECR\n",
 31 |     "%pip install sagemaker-studio-image-build"
 32 |    ]
 33 |   },
 34 |   {
 35 |    "cell_type": "markdown",
 36 |    "id": "f9b932b2",
 37 |    "metadata": {},
 38 |    "source": [
 39 |     "## 2) Initialize some variables"
 40 |    ]
 41 |   },
 42 |   {
 43 |    "cell_type": "code",
 44 |    "execution_count": null,
 45 |    "id": "fa5cbb95",
 46 |    "metadata": {
 47 |     "scrolled": true
 48 |    },
 49 |    "outputs": [],
 50 |    "source": [
 51 |     "import os\n",
 52 |     "import io\n",
 53 |     "import boto3\n",
 54 |     "import tarfile\n",
 55 |     "import sagemaker\n",
 56 |     "\n",
 57 |     "sagemaker_session = sagemaker.Session()\n",
 58 |     "bucket = sagemaker_session.default_bucket()\n",
 59 |     "account_id = boto3.client('sts').get_caller_identity().get('Account')\n",
 60 |     "region_name = sagemaker_session.boto_session.region_name\n",
 61 |     "\n",
 62 |     "image_name='gstreamer'\n",
 63 |     "image_tag=\"py3-1.0\"\n",
 64 |     "image_uri=f\"{account_id}.dkr.ecr.{region_name}.amazonaws.com/{image_name}:{image_tag}\"\n",
 65 |     "print(f'Custom docker image: {image_uri}')"
 66 |    ]
 67 |   },
 68 |   {
 69 |    "cell_type": "markdown",
 70 |    "id": "0279f53b",
 71 |    "metadata": {},
 72 |    "source": [
 73 |     "## 3) Build a custom docker container with additional libraries\n",
 74 |     "**YOU DON\"T NEED TO RUN** this section if you already did that before"
 75 |    ]
 76 |   },
 77 |   {
 78 |    "cell_type": "code",
 79 |    "execution_count": null,
 80 |    "id": "c8fca5e5",
 81 |    "metadata": {},
 82 |    "outputs": [],
 83 |    "source": [
 84 |     "!pygmentize container_02/Dockerfile"
 85 |    ]
 86 |   },
 87 |   {
 88 |    "cell_type": "markdown",
 89 |    "id": "8298ebba",
 90 |    "metadata": {
 91 |     "scrolled": true
 92 |    },
 93 |    "source": [
 94 |     "### 3.1) Build and push the container image"
 95 |    ]
 96 |   },
 97 |   {
 98 |    "cell_type": "code",
 99 |    "execution_count": null,
100 |    "id": "242b1d6d",
101 |    "metadata": {},
102 |    "outputs": [],
103 |    "source": [
104 |     "!sm-docker build container_02/ --repository $image_name:$image_tag"
105 |    ]
106 |   },
107 |   {
108 |    "cell_type": "markdown",
109 |    "id": "260a05e0",
110 |    "metadata": {},
111 |    "source": [
112 |     "## 4) Create an application for processing our videos\n",
113 |     "This application will run inside a container executed by SageMaker Processing Jobs"
114 |    ]
115 |   },
116 |   {
117 |    "cell_type": "markdown",
118 |    "id": "ddc49c24",
119 |    "metadata": {},
120 |    "source": [
121 |     "### 4.1) Tracker object that makes use of ByteTrack\n",
122 |     "Source: https://github.com/ifzhang/ByteTrack  \n",
123 |     "This class assigns ids to detected objects and keeps track of them across multiple frames"
124 |    ]
125 |   },
126 |   {
127 |    "cell_type": "code",
128 |    "execution_count": null,
129 |    "id": "44a95c38",
130 |    "metadata": {},
131 |    "outputs": [],
132 |    "source": [
133 |     "!pygmentize libs/tracker.py"
134 |    ]
135 |   },
136 |   {
137 |    "cell_type": "markdown",
138 |    "id": "7e0038a7",
139 |    "metadata": {},
140 |    "source": [
141 |     "### 4.2) CV Pipeline that wraps a GStreamer pipeline\n",
142 |     "Extend this class to create your own GStreamer pipeline solution"
143 |    ]
144 |   },
145 |   {
146 |    "cell_type": "code",
147 |    "execution_count": null,
148 |    "id": "213faf51",
149 |    "metadata": {},
150 |    "outputs": [],
151 |    "source": [
152 |     "!pygmentize libs/cvpipeline.py"
153 |    ]
154 |   },
155 |   {
156 |    "cell_type": "markdown",
157 |    "id": "9062a172",
158 |    "metadata": {},
159 |    "source": [
160 |     "### 4.3) SageMaker CV Pipeline\n",
161 |     "Extends a CVPipeline and invokes a SageMaker Endpoint for each frame"
162 |    ]
163 |   },
164 |   {
165 |    "cell_type": "code",
166 |    "execution_count": null,
167 |    "id": "39230ed6",
168 |    "metadata": {},
169 |    "outputs": [],
170 |    "source": [
171 |     "!pygmentize libs/smcvpipeline.py"
172 |    ]
173 |   },
174 |   {
175 |    "cell_type": "markdown",
176 |    "id": "31ee6845",
177 |    "metadata": {},
178 |    "source": [
179 |     "### 4.4) Main application\n",
180 |     "This script will parse all the parameters passed through SageMaker Processing jobs api and invoke the Gstreamer pipeline"
181 |    ]
182 |   },
183 |   {
184 |    "cell_type": "code",
185 |    "execution_count": null,
186 |    "id": "2156a6e7",
187 |    "metadata": {},
188 |    "outputs": [],
189 |    "source": [
190 |     "!pygmentize code_02/pipeline.py"
191 |    ]
192 |   },
193 |   {
194 |    "cell_type": "markdown",
195 |    "id": "0c2ff4c8",
196 |    "metadata": {},
197 |    "source": [
198 |     "### 4.5) Clone the correct version of ByteTrack\n",
199 |     "This library is required when object tracking is enabled"
200 |    ]
201 |   },
202 |   {
203 |    "cell_type": "code",
204 |    "execution_count": null,
205 |    "id": "875e59b8",
206 |    "metadata": {},
207 |    "outputs": [],
208 |    "source": [
209 |     "import os\n",
210 |     "if not os.path.isdir('libs/bytetrack'):\n",
211 |     "    !git clone https://github.com/ifzhang/ByteTrack libs/bytetrack && \\\n",
212 |     "        cd libs/bytetrack && git checkout d1bf019"
213 |    ]
214 |   },
215 |   {
216 |    "cell_type": "markdown",
217 |    "id": "e4249e25",
218 |    "metadata": {},
219 |    "source": [
220 |     "## 5) Kick-off a SageMaker Processing job to process all our video files"
221 |    ]
222 |   },
223 |   {
224 |    "cell_type": "code",
225 |    "execution_count": null,
226 |    "id": "fd0d8a5c",
227 |    "metadata": {},
228 |    "outputs": [],
229 |    "source": [
230 |     "import sagemaker\n",
231 |     "from sagemaker.processing import ScriptProcessor\n",
232 |     "from sagemaker.processing import ProcessingInput, ProcessingOutput\n",
233 |     "from sagemaker.network import NetworkConfig\n",
234 |     "\n",
235 |     "sagemaker_session = sagemaker.Session()\n",
236 |     "bucket = sagemaker_session.default_bucket()\n",
237 |     "print(f\"s3://{bucket}/samples/\")"
238 |    ]
239 |   },
240 |   {
241 |    "cell_type": "markdown",
242 |    "id": "d34408e2",
243 |    "metadata": {},
244 |    "source": [
245 |     "### 5.1) Upload your .mp4 files to S3\n",
246 |     "If you don't have a video now and just want to run some tests, go to https://pixabay.com/videos/ or any other website which has video of people.\n",
247 |     "\n",
248 |     "Download the **.mp4** as 720p (1280x720) files and upload them to the S3 path printed in the last cell (above).\n",
249 |     "\n",
250 |     "Run the following command, then to make sure you uploaded the files:\n",
251 |     "```bash\n",
252 |     "aws s3 ls s3://<YOUR_BUCKET>/samples/  \n",
253 |     "```"
254 |    ]
255 |   },
256 |   {
257 |    "cell_type": "markdown",
258 |    "id": "bf519d08",
259 |    "metadata": {},
260 |    "source": [
261 |     "### 5.2) Finally run the Processing Job"
262 |    ]
263 |   },
264 |   {
265 |    "cell_type": "code",
266 |    "execution_count": null,
267 |    "id": "9ca2f694",
268 |    "metadata": {},
269 |    "outputs": [],
270 |    "source": [
271 |     "import time\n",
272 |     "script_processor = ScriptProcessor(\n",
273 |     "    base_job_name=f'cv-pipeline-{int(time.time()*1000)}',\n",
274 |     "    image_uri=image_uri,\n",
275 |     "    role=sagemaker.get_execution_role(),\n",
276 |     "    instance_type='ml.c5.xlarge',\n",
277 |     "    instance_count=1,\n",
278 |     "    max_runtime_in_seconds=60 * 30,\n",
279 |     "    command=[\"/home/ec2-user/entrypoint.sh\", \"python3\"],\n",
280 |     "    # for production it is important to define vpc_config and use a vpc_endpoint\n",
281 |     "    #vpc_config={\n",
282 |     "    #    'Subnets': ['<SUBNET1>', '<SUBNET2>'],\n",
283 |     "    #    'SecurityGroupIds': ['<SECURITYGROUP1>', '<DEFAULTSECURITYGROUP>']\n",
284 |     "    #}\n",
285 |     ")\n",
286 |     "\n",
287 |     "script_processor.run(\n",
288 |     "    code='code_02/pipeline.py',\n",
289 |     "    inputs=[\n",
290 |     "        # always keep this input in the first place to avoid\n",
291 |     "        # issues with the pipe name\n",
292 |     "        ProcessingInput(\n",
293 |     "            source=f's3://{bucket}/samples',\n",
294 |     "            destination='/opt/ml/processing/input/data',            \n",
295 |     "            s3_input_mode='Pipe',\n",
296 |     "            s3_data_distribution_type='ShardedByS3Key'\n",
297 |     "        ),\n",
298 |     "        ProcessingInput(\n",
299 |     "            source='libs',\n",
300 |     "            destination='/opt/ml/processing/input/libs',\n",
301 |     "            s3_input_mode='File'\n",
302 |     "        )        \n",
303 |     "    ],\n",
304 |     "    outputs=[ProcessingOutput(\n",
305 |     "        source='/opt/ml/processing/output/predictions',\n",
306 |     "        destination=f's3://{bucket}/predictions/',\n",
307 |     "        s3_upload_mode='Continuous'\n",
308 |     "    )],\n",
309 |     "    arguments=[\n",
310 |     "        '--input-shape', '1280 720',\n",
311 |     "        '--endpoint-name', \"yolov7-pose-inferentia\",\n",
312 |     "        '--region-name', 'us-east-1'\n",
313 |     "    ]\n",
314 |     ")"
315 |    ]
316 |   },
317 |   {
318 |    "cell_type": "code",
319 |    "execution_count": null,
320 |    "id": "2836fa31",
321 |    "metadata": {},
322 |    "outputs": [],
323 |    "source": []
324 |   }
325 |  ],
326 |  "metadata": {
327 |   "kernelspec": {
328 |    "display_name": "conda_python3",
329 |    "language": "python",
330 |    "name": "conda_python3"
331 |   },
332 |   "language_info": {
333 |    "codemirror_mode": {
334 |     "name": "ipython",
335 |     "version": 3
336 |    },
337 |    "file_extension": ".py",
338 |    "mimetype": "text/x-python",
339 |    "name": "python",
340 |    "nbconvert_exporter": "python",
341 |    "pygments_lexer": "ipython3",
342 |    "version": "3.6.13"
343 |   }
344 |  },
345 |  "nbformat": 4,
346 |  "nbformat_minor": 5
347 | }
348 | 


--------------------------------------------------------------------------------
/tutorials/02_ObjectTrackingSageMakerGStreamer/README.md:
--------------------------------------------------------------------------------
 1 | # Object Tracking for video files with SageMaker and GStreamer
 2 | 
 3 | Process multiple **video files** with ML models, using [SageMaker Processing Jobs](https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html) and [GStreamer](https://gstreamer.freedesktop.org/) in batch mode. 
 4 | 
 5 | In this tutorial you'll learn how to create a people pathing mechanism that tracks people on video files in batch mode. The output is a set of Numpy files with the predictions of each frame, which contain:  
 6 |  - bouding boxes for each person;
 7 |  - keypoints for pose estimation;
 8 |  - id of each detecte person across the frames;
 9 | 
10 | 
11 | ### Notebooks:
12 |  - [01_Yolov7SageMakerInferentia](01_Yolov7SageMakerInferentia.ipynb): First deploy a real-time endpoint on SageMaker on an Inferentia (inf1) instance, with the Object Detection & Pose estimation model.
13 |  - [02_CVPipeline](02_CVPipeline.ipynb): Launch a SageMaker Processing Job with a Python script that defines a GStreamer pipeline that processes multiple files at once by sending each frame to the endpoint and saving the predictions as Numpy files.
14 | 
15 | 
16 | ### Activities
17 |   - First upload some **.mp4** to an S3 bucket.
18 |   - Run notebook 01: 1/ follow the instructions there to compile a Yolov7 for Inferentia; 2/ deploy the compiled model to an enpoint
19 |   - Run notebook 02: Prepare a python application that will be executed by SageMaker to read the .mp4 files and get the predictions.
20 | 


--------------------------------------------------------------------------------
/tutorials/02_ObjectTrackingSageMakerGStreamer/code_01/inference.py:
--------------------------------------------------------------------------------
  1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
  2 | # SPDX-License-Identifier: MIT-0
  3 | 
  4 | import os
  5 | os.environ['NEURON_RT_NUM_CORES'] = '4'
  6 | import io
  7 | import cv2
  8 | import json
  9 | import time
 10 | import torch
 11 | import torch.neuron
 12 | import numpy as np
 13 | 
 14 | from turbojpeg import TurboJPEG
 15 | 
 16 | class Detector(object):
 17 |     '''Main class responsible for pre/post processing + model invocation'''
 18 |     def __init__(self, model_path):
 19 |             
 20 |         self.model = torch.jit.load(model_path).eval()
 21 |         self.jpeg = TurboJPEG()
 22 |         
 23 |         print(f'Model loaded')
 24 | 
 25 |     def xywh2xyxy(self, x):
 26 |         # Convert nx4 boxes from [x, y, w, h] to [x1, y1, x2, y2] 
 27 |         # where xy1=top-left, xy2=bottom-right
 28 |         y = np.copy(x)
 29 |         y[:, 0] = x[:, 0] - x[:, 2] / 2  # top left x
 30 |         y[:, 1] = x[:, 1] - x[:, 3] / 2  # top left y
 31 |         y[:, 2] = x[:, 0] + x[:, 2] / 2  # bottom right x
 32 |         y[:, 3] = x[:, 1] + x[:, 3] / 2  # bottom right y
 33 |         return y
 34 | 
 35 |     # non maximum suppression. Inspired by torchvision.nms
 36 |     def nms(self, bboxes, scores, iou_threshold=0.45):
 37 |         x1 = bboxes[:, 0]
 38 |         y1 = bboxes[:, 1]
 39 |         x2 = bboxes[:, 2]
 40 |         y2 = bboxes[:, 3]
 41 |         areas = (x2 - x1 + 1) * (y2 - y1 + 1) 
 42 |         order = scores.ravel().argsort()[::-1]
 43 |         keep = []
 44 |         while order.size > 0:
 45 |             i = order[0]
 46 |             keep.append(i)
 47 |             xx1 = np.maximum(x1[i], x1[order[1:]])
 48 |             yy1 = np.maximum(y1[i], y1[order[1:]])
 49 |             xx2 = np.minimum(x2[i], x2[order[1:]])
 50 |             yy2 = np.minimum(y2[i], y2[order[1:]])
 51 |             w = np.maximum(0.0, xx2 - xx1 + 1)
 52 |             h = np.maximum(0.0, yy2 - yy1 + 1)
 53 |             inter = w * h
 54 |             iou = inter / (areas[i] + areas[order[1:]] - inter)
 55 |             inds = np.where(iou <= iou_threshold)[0]
 56 |             order = order[inds + 1]
 57 |         bboxes = bboxes[keep]
 58 |         scores = scores[keep]
 59 |         return bboxes, scores, keep
 60 | 
 61 |     def non_max_suppression_kpt(self, prediction, conf_thres=0.25, iou_thres=0.45, classes=None, agnostic=False,
 62 |                             labels=(), kpt_label=False, nc=None, nkpt=None):
 63 |         """Runs Non-Maximum Suppression (NMS) on inference results
 64 | 
 65 |         Returns:
 66 |              list of detections, on (n,6) tensor per image [xyxy, conf, cls, keypoints]
 67 |         """
 68 |         if nc is None:
 69 |             nc = prediction.shape[2] - 5  if not kpt_label else prediction.shape[2] - 56 # number of classes
 70 |         xc = prediction[..., 4] > conf_thres  # candidates
 71 | 
 72 |         # Settings
 73 |         min_wh, max_wh = 2, 4096  # (pixels) minimum and maximum box width and height
 74 |         max_det = 300  # maximum number of detections per image
 75 |         max_nms = 30000  # maximum number of boxes
 76 |         time_limit = 10.0  # seconds to quit after
 77 | 
 78 |         t = time.time()
 79 |         output = [np.zeros((0,57))] * prediction.shape[0]
 80 |         for xi, x in enumerate(prediction):  # image index, image inference
 81 |             # Apply constraints        
 82 |             x = x[xc[xi]]  # confidence
 83 | 
 84 |             # Cat apriori labels if autolabelling
 85 |             if labels and len(labels[xi]):
 86 |                 l = labels[xi]
 87 |                 v = np.zeros((len(l), nc + 5))
 88 |                 v[:, :4] = l[:, 1:5]  # box
 89 |                 v[:, 4] = 1.0  # conf
 90 |                 v[range(len(l)), l[:, 0].long() + 5] = 1.0  # cls
 91 |                 x = np.concatenate((x, v), axis=0)
 92 | 
 93 |             # If none remain process next image
 94 |             if not x.shape[0]:
 95 |                 continue
 96 | 
 97 |             # Compute conf
 98 |             x[:, 5:5+nc] *= x[:, 4:5]  # conf = obj_conf * cls_conf
 99 | 
100 |             # Box (center x, center y, width, height) to (x1, y1, x2, y2)
101 |             box = self.xywh2xyxy(x[:, :4])
102 | 
103 |             if not kpt_label:
104 |                 conf = x[:, 5:].max(axis=1, keepdims=True)
105 |                 j = np.argmax(x[:, 5:], axis=1).reshape(x[:, 5:].shape[0],-1)
106 |                 x = np.concatenate((box, conf, j), axis=1)[conf.ravel() > conf_thres]
107 |             else:
108 |                 kpts = x[:, 6:]
109 |                 conf = x[:, 5:6].max(axis=1, keepdims=True)                
110 |                 j = np.argmax(x[:, 5:6], axis=1).reshape(x[:, 5:6].shape[0],-1)
111 |                 x = np.concatenate((box, conf, j, kpts), axis=1)[conf.ravel() > conf_thres]
112 | 
113 |             # Filter by class
114 |             if classes is not None:
115 |                 x = x[(x[:, 5:6] == classes).any(1)]
116 | 
117 |             # Check shape
118 |             n = x.shape[0]  # number of boxes        
119 |             if not n:  # no boxes
120 |                 continue
121 |             elif n > max_nms:  # excess boxes
122 |                 x = x[x[:, 4].argsort()[::-1][:max_nms]]  # sort by confidence
123 | 
124 |             # Batched NMS
125 |             c = x[:, 5:6] * (0 if agnostic else max_wh)  # classes
126 |             boxes, scores = x[:, :4] + c, x[:, 4]  # boxes (offset by class), scores
127 |             
128 |             boxes,scores,i = self.nms(boxes, scores, iou_thres)  # NMS
129 |             
130 |             #if len(i) > max_det:  # limit detections
131 |             #    i = i[:max_det]
132 |             #    boxes = boxes[:max_det]
133 |             #    scores = scores[:max_det]
134 | 
135 |             output[xi] = x[i]
136 |             if (time.time() - t) > time_limit:
137 |                 print(f'WARNING: NMS time limit {time_limit}s exceeded')
138 |                 break  # time limit exceeded
139 |         return output
140 |  
141 |     def predict(self,x):
142 |         with torch.no_grad():
143 |             return self.model(x).numpy()#torch.from_numpy(x)).numpy()
144 | 
145 |     def preprocess(self, img, img_size=960):
146 |         '''Make the image squared and prepare the tensor as [B,C,H,W]'''
147 |         h,w,c = img.shape
148 |         img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
149 |         if h!=w:
150 |             max_size=max(h,w)
151 |             img_sqr = np.zeros((max_size, max_size,c), dtype=np.uint8)
152 |             img_sqr[0:h,0:w],img = img[:],img_sqr
153 |         x = cv2.resize(img, (img_size, img_size), interpolation=cv2.INTER_LINEAR)
154 |         return x
155 |         #x = np.expand_dims((x.transpose(2,0,1) / 255.0).astype(np.float32), axis=0)
156 |         #return np.ascontiguousarray(x)
157 | 
158 |     def postprocess(self, output, tensor_shape, img_shape, conf_thres=0.15, iou_thres=0.45, nc=1, nkpt=17):
159 |         '''Run NMS to filter bboxes & return detections with keypoints'''
160 |         detections = self.non_max_suppression_kpt(
161 |             output, conf_thres,iou_thres, nc=nc, nkpt=nkpt, kpt_label=True)
162 |         
163 |         # targets in the format
164 |         # [det_index, int(class_id), [x1,y1,x2,y2], conf, [x0,y0,conf0...x16,y16,conf16]]
165 |         targets = []
166 |         for i,det in enumerate(detections):
167 |             bboxes,scores,classes,keypoints = det[:, :4],det[:, 4], det[:, 5],det[:,6:]
168 |             bboxes = bboxes.clip(0,tensor_shape[0])
169 |             # rescale bboxes and poses
170 |             # fix the distortion provoked by preprocess
171 |             tw,th,ih,iw = *tensor_shape, *img_shape
172 |             bboxes = bboxes / [tw,th,tw,th] * [iw,ih,iw,ih]        
173 |             keypoints = (keypoints / ([tw,th,1]*nkpt)) * ([ih,iw,1]*nkpt)
174 |             dets = []
175 |             for index, (box, conf, cls, pose) in enumerate(zip(bboxes,scores,classes,keypoints)):
176 |                 dets.append([index, int(cls), box.astype(np.int32), conf, pose])
177 |             if len(dets)>0: targets.append(dets)
178 |         return targets
179 |     
180 |     def mosaic2batch(self, data, tile_width=960, tile_height=540, img_size=960):
181 |         mosaic = self.jpeg.decode(data)
182 |         h,w,c = mosaic.shape
183 | 
184 |         max_size=max(tile_width, tile_height)
185 |         min_size=min(tile_width, tile_height)
186 |         num_pixels = max_size*max_size*3
187 |         batch_size = h//tile_height * w//tile_width
188 |         batch = torch.zeros(max_size*max_size*c * batch_size, dtype=torch.float32)
189 |         ttl_pixels=0
190 |         # build a batch out of the tiles
191 |         for row in range(h//tile_height):
192 |             for col in range(w//tile_width):
193 |                 pw,ph=col*tile_width,row*tile_height
194 |                 tile = mosaic[ph:ph+tile_height, pw:pw+tile_width]
195 |                 
196 |                 tile = self.preprocess(tile, img_size)
197 |                 
198 |                 batch[ttl_pixels:ttl_pixels + num_pixels] = torch.from_numpy(tile).ravel()
199 |                 ttl_pixels = ttl_pixels + num_pixels
200 | 
201 |         batch = batch.reshape(-1,max_size,max_size,c)
202 |         batch = (batch / 255.0).float() # to FP32
203 |         batch = batch.permute(0,3,1,2) # NHWC --> NCHW
204 | 
205 |         return batch
206 |     
207 | ## SAGEMAKER FUNCTIONS ##
208 | # The following functions are invoked by SageMaker to load the model, 
209 | # receive the payload, invoke the model and prepare the output
210 | def model_fn(model_dir):
211 |     return Detector(os.path.join(model_dir, 'model.pt'))
212 | 
213 | def input_fn(data, content_type, context=None):
214 |     if content_type != 'image/jpeg':
215 |         raise Exception(f'Invalid data type. Expected image/jpeg, got {content_type}')
216 |    
217 |     try:
218 |         custom_attributes = context.get_request_header(0,'X-Amzn-SageMaker-Custom-Attributes')
219 |         params = json.loads(custom_attributes)
220 |         return data, params
221 |     except Exception as e:
222 |         raise Exception(f"You need to pass Custom Attributes")
223 | 
224 | def output_fn(predictions, accept, context=None):
225 |     if accept!='application/x-npy':
226 |         raise Exception(f'Invalid data type. Expected application/x-npy, got {accept}')
227 | 
228 |     with io.BytesIO() as b:   
229 |         data = []
230 |         for i,objs in enumerate(predictions):
231 |             for obj_id, obj_cls, bbox, conf, pose_kpts in objs:
232 |                 data.append(np.hstack([
233 |                     [i, obj_id, obj_cls],
234 |                     bbox.astype(np.float32),
235 |                     pose_kpts
236 |                 ]))
237 |         np.save(b, np.vstack(data))
238 |         b.seek(0)
239 |         return b.read()
240 | 
241 | def predict_fn(data, detector, context=None):
242 |     mosaic,params = data
243 |     # adjust img_size accordinly with the input shape of your model
244 |     img_size=960
245 |     tile_width=params.get('tile_width', 960)
246 |     tile_height=params.get('tile_height', 540)
247 |     conf_thres=params.get('conf_thres', 0.15)
248 |     iou_thres=params.get('iou_thres', 0.45)
249 |     
250 |     x = detector.mosaic2batch(mosaic, tile_width, tile_height, img_size)
251 |     out = detector.predict(x)
252 |     detections = detector.postprocess(out, x.shape[2:], (tile_height, tile_width), conf_thres, iou_thres)
253 |     return detections
254 | 


--------------------------------------------------------------------------------
/tutorials/02_ObjectTrackingSageMakerGStreamer/code_02/pipeline.py:
--------------------------------------------------------------------------------
 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | # SPDX-License-Identifier: MIT-0
 3 | import os
 4 | import sys
 5 | import time
 6 | import argparse
 7 | sys.path.append("/opt/ml/processing/input/libs/bytetrack")
 8 | sys.path.append("/opt/ml/processing/input/libs")
 9 | from smcvpipeline import SageMakerCVPipeline
10 | 
11 | if __name__ == '__main__':
12 |     parser = argparse.ArgumentParser(
13 |                     prog = 'CV Pipeline for ML',
14 |                     description = 'Video streaming processing with ML')
15 |     
16 |     parser.add_argument('-k','--enable-tracking', type=bool, help="Enable object tracking", default=True)
17 |     parser.add_argument('-j','--jpeg-quality', type=int, help="Quality of the jpeg mosaic sent to SM endpoint", default=90)
18 |     parser.add_argument('-e','--endpoint-name', type=str, help="SageMaker Endpoint Name", required=True)
19 |     parser.add_argument('-r','--region-name', type=str, help="Region Name", default="us-east-1")
20 |     parser.add_argument('-p','--preds-per-output-file', type=int, help="Number of predictions per output file", default=150)
21 |     parser.add_argument('-w','--num-workers', type=int, help="Number of workers that will invoke the model", default=5)
22 |     parser.add_argument('-n','--cams-per-row', type=int, help="Number of cams per row", default=2)
23 |     parser.add_argument('-m','--max-cams-per-batch', type=int, help="Max number of cams per batch", default=4)
24 |     parser.add_argument('-i','--input-shape', type=int, help="Resized resolution of the feeds", nargs=2, default=[1280, 720])
25 |     parser.add_argument('-t','--tile-size', type=int, help="Shape of each tile in the mosaic", nargs=2, default=[960, 540])
26 |     parser.add_argument('-c','--conf-thres', type=float, help="Confidence threshold of the object", default=0.15)
27 |     parser.add_argument('-o','--iou-thres', type=float, help="Confidence threshold of the IoU ", default=0.45)
28 |     
29 |     args = parser.parse_args()
30 |     print(args)
31 |     
32 |     cams_per_row=args.cams_per_row
33 |     max_cams_per_batch=args.max_cams_per_batch
34 |     raw_width,raw_height=args.input_shape
35 |     input_dir = "/opt/ml/processing/input"
36 |     output_dir = "/opt/ml/processing/output"    
37 |     failure_file = output_dir + "/failure"
38 |     pipeline = None
39 |     if not os.path.isdir(output_dir): os.makedirs(output_dir)
40 |     
41 |     # Tracking requires sequential processing, that's why we can have only 1 active worker
42 |     if args.enable_tracking and args.num_workers > 1:
43 |         print(f"Tracking enabled. Setting num_workers to 1. Current: {args.num_workers}")
44 |         args.num_workers = 1
45 |     
46 |     try:        
47 |         # list the pipes in the input dir
48 |         # parse the manifest file and extract all file names
49 |         file_names = [f.strip() for f in open(f'{input_dir}/data/input-1-manifest', 'r').readlines()[1:]] # skip first line
50 |         num_batches = ((len(file_names)-1)//max_cams_per_batch) + 1
51 |         print(f"Num files: {len(file_names)}, Num batches: {num_batches}")
52 |         
53 |         for batch in range(num_batches):
54 |             start = batch * max_cams_per_batch
55 |             end = start + min(max_cams_per_batch,len(file_names[start:]))
56 |             
57 |             mosaic,sources = [],[]
58 |             for i,s3_path in enumerate(file_names[start:end]):
59 |                 # convert the s3 path to the expected by awss3src
60 |                 s3_path = s3_path.replace('s3://', f's3://{args.region_name}/')
61 |                 
62 |                 xoff,yoff = raw_width * (i%cams_per_row), raw_height * (i//cams_per_row)
63 |                 mosaic.append(f"sink_{i}::xpos={xoff} sink_{i}::ypos={yoff}")
64 |                 sources.append(f"\n    awss3src uri={s3_path} ! decodebin ! videoconvert ! video/x-raw,format=(string)BGR ")
65 |                 sources.append(f"! videoscale method=0 add-borders=false ! video/x-raw,width={raw_width},height={raw_height} ! queue2 max-size-buffers=1000 ! comp.sink_{i}")
66 |             mosaic,sources = " ".join(mosaic), "".join(sources)
67 |             pre_pipeline = f"""
68 |             compositor name=comp {mosaic}
69 |             ! videoconvert ! video/x-raw,format=BGR ! fakesink name=input {sources}
70 |             """
71 |             print(pre_pipeline)
72 |             params = (
73 |                 pre_pipeline, args.endpoint_name, args.region_name, max_cams_per_batch, 
74 |                 output_dir, args.tile_size, args.conf_thres, args.iou_thres, 
75 |                 args.num_workers, args.preds_per_output_file, args.jpeg_quality,
76 |                 args.enable_tracking
77 |             )
78 |             pipeline = SageMakerCVPipeline(*params)
79 |             t = time.time()
80 |             pipeline.start()
81 |             pipeline.join()
82 |             print(f"Total time: {time.time()-t}")
83 |     except Exception as e:                        
84 |         print(f"ERROR: {sys.exc_info()[0]} {e}")
85 |         with open(failure_file, 'w') as f:
86 |             f.write(str(e))
87 |         raise e
88 |     finally:
89 |         if not pipeline is None and pipeline.is_running(): # should not happen
90 |             print('Stopping pipeline...')
91 |             pipeline.stop()
92 |             pipeline.join()
93 | 


--------------------------------------------------------------------------------
/tutorials/02_ObjectTrackingSageMakerGStreamer/container_01/Dockerfile:
--------------------------------------------------------------------------------
 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | # SPDX-License-Identifier: MIT-0
 3 | ARG REGION_NAME=us-east-1
 4 | ARG ACCOUNT_ID=763104351884
 5 | FROM $ACCOUNT_ID.dkr.ecr.$REGION_NAME.amazonaws.com/pytorch-inference-neuron:1.10.2-neuron-py37-sdk1.19.0-ubuntu18.04
 6 | RUN echo '\
 7 | --- /opt/conda/lib/python3.7/site-packages/sagemaker_inference/transformer.py    2022-08-23 17:26:42.000000000 +0000\n\
 8 | +++ /opt/conda/lib/python3.7/site-packages/sagemaker_inference/transformer_.py   2022-12-07 13:15:09.753360938 +0000\n\
 9 | @@ -250,9 +250,9 @@\n\
10 |                  (response_data, content_type)\n\
11 | \n\
12 |          """\n\
13 | -        data = self._run_handler_function(self._input_fn, *(input_data, content_type))\n\
14 | -        prediction = self._run_handler_function(self._predict_fn, *(data, model))\n\
15 | -        result = self._run_handler_function(self._output_fn, *(prediction, accept))\n\
16 | +        data = self._run_handler_function(self._input_fn, *(input_data, content_type, context))\n\
17 | +        prediction = self._run_handler_function(self._predict_fn, *(data, model, context))\n\
18 | +        result = self._run_handler_function(self._output_fn, *(prediction, accept, context))\n\
19 |          return result\n\
20 | \n\
21 |      def _run_handler_function(self, func, *argv):' > /tmp/transformer.py.patch
22 | 
23 | RUN patch /opt/conda/lib/python3.7/site-packages/sagemaker_inference/transformer.py /tmp/transformer.py.patch
24 | RUN wget --quiet --output-document=/tmp/libjpeg.deb https://netactuate.dl.sourceforge.net/project/libjpeg-turbo/2.1.4/libjpeg-turbo-official_2.1.4_amd64.deb && dpkg -i /tmp/libjpeg.deb && rm -f /tmp/libjpeg.deb
25 | RUN pip3 install PyTurboJPEG
26 | 


--------------------------------------------------------------------------------
/tutorials/02_ObjectTrackingSageMakerGStreamer/container_02/Dockerfile:
--------------------------------------------------------------------------------
 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | # SPDX-License-Identifier: MIT-0
 3 | FROM ubuntu:22.04
 4 | 
 5 | ARG DEBIAN_FRONTEND=noninteractive
 6 | 
 7 | ENV TZ UTC
 8 | ENV LANG=C.UTF-8
 9 | ENV LC_ALL=C.UTF-8
10 | ENV PYTHONDONTWRITEBYTECODE=1
11 | ENV PYTHONUNBUFFERED=1
12 | ENV PYTHONIOENCODING=UTF-8
13 | 
14 | # install required packages
15 | RUN apt-get update && \
16 |     apt-get dist-upgrade -y && \
17 |     apt-get install -y --no-install-recommends \
18 |     libchromaprint1 libgles2 libgmp10 libgme0 libxkbcommon0  \
19 |     libbs2b0 libgsl27 libwavpack1 libxslt1.1 gstreamer1.0-libav  \
20 |     libavformat58 libwayland-server0 libwayland-client0 libharfbuzz-icu0 librtmp1  \
21 |     libtheora0 libxtst6 libmpg123-0 libxcomposite1 libgl1  \
22 |     libmjpegutils-2.1-0 python3-numpy libsrt1.4-openssl libmpeg2-4 libvulkan1  \
23 |     libxdamage1 libjpeg8 mjpegtools libpng16-16 wayland-protocols  \
24 |     libcap2 libofa0 udev libgstreamer-plugins-base1.0-dev libcups2  \
25 |     libgstreamer1.0-dev libopenexr25 libmfx1 libde265-0 libgirepository1.0-dev  \
26 |     libfdk-aac2 libavcodec58 git libunwind8 xdg-dbus-proxy  \
27 |     libtwolame0 mesa-utils libtag1v5 libaa1 libgles1  \
28 |     ffmpeg liborc-0.4-0 libgraphene-1.0-dev libwebpdemux2 libsoup2.4-1  \
29 |     build-essential libsm6 libglu1 libwebrtc-audio-processing1 liba52-0.7.4  \
30 |     libva2 libwayland-cursor0 libcurl3-gnutls libvisual-0.4-0 libbz2-1.0  \
31 |     libvpx7 libdv4 libatspi2.0-0 liblilv-0-0 gstreamer1.0-plugins-bad  \
32 |     gstreamer1.0-plugins-good gstreamer1.0-plugins-ugly libdvdnav4 libssl3 libgsm1  \
33 |     libwoff1 libwebp7 x264 libopenblas-dev libxrandr-dev  \
34 |     freeglut3-dev intel-media-va-driver-non-free ladspa-sdk gfortran libvo-aacenc0  \
35 |     python3-opencv libcaca0 python3-opengl libsbc1 libatk1.0-0  \
36 |     libsoundtouch1 libsndfile1 python3 libgudev-1.0-0 liblcms2-2  \
37 |     libzvbi0 libatk-bridge2.0-0 libass9 libgbm1 libglib2.0-0  \
38 |     libaom3 bubblewrap libdw1 libseccomp2 libepoxy0  \
39 |     libavutil56 glibc-tools libmodplug1 libshout3 libwebpmux3  \
40 |     libfaad2 iso-codes libgcrypt20 xvfb libspandsp2  \
41 |     libvorbis0a libfaac0 libmpcdec6 libopus0 libsrtp2-1  \
42 |     libx264-163 gstreamer1.0-rtsp python3-pip ca-certificates libva-wayland2  \
43 |     gcc libwildmidi2 libpango-1.0-0 libflite1 libdca0  \
44 |     libopenjp2-7 libzbar0 libspeex1 libkate1 pkg-config  \
45 |     libx264-dev libopencore-amrwb0 gstreamer1.0-tools libxv1 gstreamer1.0-plugins-base  \
46 |     libcairo2-dev python3-gst-1.0 wget cmake libwayland-egl1  \
47 |     libavfilter7 libegl1 libdvdread8 libvo-amrwbenc0 libogg0  \
48 |     librsvg2-2 libopencore-amrnb0 libx265-199 libatk-adaptor sudo  \
49 |     libmp3lame0 python3-dev libssl-dev \
50 |     && rm -rf /var/lib/apt/lists/*
51 | 
52 | # download and install libjpeg turbo
53 | RUN wget -qO libjpeg-turbo.deb \
54 |     https://deac-ams.dl.sourceforge.net/project/libjpeg-turbo/2.1.4/libjpeg-turbo-official_2.1.4_amd64.deb && \
55 |     dpkg -i libjpeg-turbo.deb && \
56 |     rm -f libjpeg-turbo.deb
57 | 
58 | # download and install aws plugins for gstreamer
59 | RUN wget -qO sh.rustup.rs https://sh.rustup.rs && \
60 |     bash sh.rustup.rs -q -y --profile default && \
61 |     . "$HOME/.cargo/env" && \
62 |     rm -f sh.rustup.rs && \
63 |     cargo install cargo-c && \
64 |     git clone -b gstreamer-1.21.1 https://gitlab.freedesktop.org/gstreamer/gst-plugins-rs.git && \
65 |     cd gst-plugins-rs && \
66 |     cargo cbuild -p gst-plugin-aws --libdir=/usr/lib/x86_64-linux-gnu && \
67 |     cargo cinstall -p gst-plugin-aws --libdir=/usr/lib/x86_64-linux-gnu && \
68 |     cd .. && rm -rf gst-plugins-rs
69 | 
70 | # create a user
71 | RUN mkdir -p /opt/ml/processing/output /opt/ml/processing/code
72 | RUN echo '%sudo ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers
73 | RUN groupadd --gid 500 --non-unique ec2-user
74 | RUN adduser --uid 500 --disabled-password --gecos '' --ingroup ec2-user ec2-user
75 | RUN usermod -a -G sudo,video,ec2-user ec2-user
76 | ENV PATH="$PATH:/home/ec2-user/.local/bin"
77 | RUN chown -R ec2-user:ec2-user /opt/ml
78 | 
79 | USER ec2-user
80 | WORKDIR /opt/ml/processing/code
81 | # install some required python packages
82 | RUN pip3 install --upgrade pip
83 | RUN pip3 install pycairo PyGObject PyTurboJPEG boto3 Cython
84 | # torch is required for ByteTrack
85 | RUN pip3 install torch torchvision thop loguru scikit-learn lap cython_bbox
86 | 
87 | RUN echo "#!/bin/sh\n/usr/bin/xvfb-run -a \$@\n" > /home/ec2-user/entrypoint.sh && chmod +x /home/ec2-user/entrypoint.sh
88 | 
89 | 
90 | ENTRYPOINT [ "/home/ec2-user/entrypoint.sh"]
91 | 


--------------------------------------------------------------------------------
/tutorials/02_ObjectTrackingSageMakerGStreamer/libs/cvpipeline.py:
--------------------------------------------------------------------------------
 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | # SPDX-License-Identifier: MIT-0
 3 | import gi
 4 | import threading
 5 | import numpy as np
 6 | gi.require_version('Gst', '1.0')
 7 | from gi.repository import Gst
 8 | 
 9 | class CVPipeline(threading.Thread):
10 |     '''Base class for a gstreamer pipeline'''
11 |     def __init__(self, pipeline):
12 |         threading.Thread.__init__(self)
13 |         self.running = False
14 |         self.gst_pipeline = pipeline
15 | 
16 |     def stop(self):
17 |         self.running = False
18 | 
19 |     def is_running(self):
20 |         return self.running
21 |     
22 |     def run(self):
23 |         '''Invoked as a thread to execute the gstreamer loop'''
24 |         self.running = True
25 | 
26 |         Gst.init(None)
27 |         self.pipeline = Gst.parse_launch(self.gst_pipeline)
28 |         self.pipeline.get_by_name('input').get_static_pad('sink').add_probe(
29 |             Gst.PadProbeType.BUFFER,
30 |             self.__on_frame_probe__
31 |         )
32 |         self.pipeline.set_state(Gst.State.PLAYING)
33 |         self.bus = self.pipeline.get_bus()
34 |         try:
35 |             while self.running:
36 |                 msg = self.bus.timed_pop_filtered(
37 |                     Gst.SECOND,
38 |                     Gst.MessageType.EOS | Gst.MessageType.ERROR
39 |                 )
40 |                 if msg:
41 |                     text = msg.get_structure().to_string() if msg.get_structure() else ''
42 |                     msg_type = Gst.message_type_get_name(msg.type)
43 |                     print(f'{msg.src.name}: [{msg_type}] {text}')
44 |                     self.stop()
45 |         finally:
46 |             self.pipeline.set_state(Gst.State.NULL)
47 | 
48 |     def __on_frame_probe__(self, pad, info):
49 |         '''Handler that reads a buffer from gstreamer and loads a numpy rgb frame'''
50 |         buf = info.get_buffer()
51 |         caps = pad.get_current_caps()
52 |         caps_structure = caps.get_structure(0)
53 |         height, width = caps_structure.get_value('height'), caps_structure.get_value('width')
54 |         pixel_bytes = 3
55 |         is_mapped, map_info = buf.map(Gst.MapFlags.READ)
56 |         if is_mapped:
57 |             try:
58 |                 image_array = np.ndarray(
59 |                     (height, width, pixel_bytes), dtype=np.uint8, buffer=map_info.data
60 |                 ).copy()
61 |                 self.process_frame(image_array, buf.pts)
62 |             finally:
63 |                 buf.unmap(map_info)
64 | 
65 |         return Gst.PadProbeReturn.OK
66 | 
67 |     def process_frame(self, frame, timestamp):
68 |         pass
69 | 


--------------------------------------------------------------------------------
/tutorials/02_ObjectTrackingSageMakerGStreamer/libs/smcvpipeline.py:
--------------------------------------------------------------------------------
  1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
  2 | # SPDX-License-Identifier: MIT-0
  3 | import io
  4 | import os
  5 | import time
  6 | import json
  7 | import boto3
  8 | import queue
  9 | import tracker
 10 | import threading
 11 | import cvpipeline
 12 | import numpy as np
 13 | from turbojpeg import TurboJPEG
 14 | 
 15 | class SageMakerCVPipeline(cvpipeline.CVPipeline):
 16 |     def __init__(self, pipeline, endpoint_name, region_name, max_cams_per_batch, output_dir, 
 17 |                  tile_size=(960,540), conf_thres=0.15, iou_thres=0.45, max_workers=5, 
 18 |                  preds_per_file=100, jpeg_quality=90, enable_tracking=False):
 19 |         super().__init__(pipeline)
 20 |         
 21 |         self.endpoint_name = endpoint_name
 22 |         self.jpeg = TurboJPEG()
 23 |         self.jpeg_quality = jpeg_quality
 24 |         self.frames = queue.Queue()
 25 |         self.sm_client = boto3.client('sagemaker-runtime', region_name=region_name)
 26 |         self.endpoint_name = endpoint_name
 27 |         self.output_dir = output_dir
 28 |         self.tile_size = tile_size
 29 |         self.params = json.dumps({
 30 |             "tile_width": tile_size[0],#960,
 31 |             "tile_height": tile_size[1],#540,
 32 |             "conf_thres": conf_thres, #0.15,
 33 |             "iou_thres": iou_thres, #0.45
 34 |         })
 35 |         self.cache = []
 36 |         self.workers = [threading.Thread(target=self.__worker__, args=(i,)) for i in range(max_workers)]
 37 |         self.trackers = [tracker.Tracker() for i in range(max_cams_per_batch)] if enable_tracking else None
 38 |         self.preds_per_file = preds_per_file
 39 |         self.cache_lock = threading.Lock()
 40 |         self.cache_counter = 0
 41 |         if not os.path.isdir(self.output_dir): os.mkdir(self.output_dir)
 42 | 
 43 |     def run(self):
 44 |         self.running = True
 45 |         # initialize workers
 46 |         for w in self.workers: w.start()
 47 |         # run gstreamer main loop
 48 |         super(SageMakerCVPipeline, self).run()
 49 |         # wait for all workers to finalize
 50 |         for w in self.workers: w.join()
 51 |         # dump to disk the last pending predictions
 52 |         self.dump_cache(True)
 53 | 
 54 |     def dump_cache(self, flush=False):
 55 |         '''Saves predictions to disk as compressed numpy filers'''
 56 |         if flush or len(self.cache) >= self.preds_per_file:
 57 |             dump = False
 58 |             self.cache_lock.acquire()
 59 |             # check if there are predictions to be flushed
 60 |             if len(self.cache) > 0:
 61 |                 cache,self.cache,pred_file_id,dump = self.cache,[],self.cache_counter,True
 62 |                 self.cache_counter += 1
 63 |             self.cache_lock.release()
 64 |             if dump:
 65 |                 # ok. there are predictions, save to a file
 66 |                 print(f'Dumping {len(cache)}... ')
 67 |                 np.savez(os.path.join(self.output_dir, f'pred_{pred_file_id:05d}.npz'), cache)
 68 | 
 69 |     def __worker__(self, worker_id):
 70 |         '''A worker will keep listening to a queue for frames to process'''
 71 |         while self.running:
 72 |             if self.frames.empty():
 73 |                 time.sleep(0.1)
 74 |             else:
 75 |                 # alrigth, there is a new frame
 76 |                 frame,timestamp = self.frames.get()
 77 |                 with io.BytesIO() as resp:
 78 |                     # invoke the endpoint and keep the predictions
 79 |                     resp.write(self.sm_client.invoke_endpoint(
 80 |                         EndpointName=self.endpoint_name,
 81 |                         Body=frame,
 82 |                         ContentType="image/jpeg",
 83 |                         Accept="application/x-npy",
 84 |                         CustomAttributes=self.params
 85 |                     )['Body'].read())
 86 |                     resp.seek(0)
 87 |                     # resp format: [cam_id, obj_id, obj_cls, conf, bbox(x1,y1,x2,y2) pose(x1,y1,conf1,...,x17,y17,conf17)]
 88 |                     preds = np.load(resp).astype(np.object)
 89 |                     data = [timestamp, preds, []]
 90 |                     if not self.trackers is None:
 91 |                         dets = []
 92 |                         for pred in preds:
 93 |                             cam_id,conf,bbox = pred[0],pred[3],pred[4:8]
 94 |                             dets.append(np.hstack((bbox, [conf])))
 95 |                         dets = np.array(dets).astype(np.object)
 96 |                         data[2].append(self.trackers[int(cam_id)].step(dets, self.tile_size))
 97 | 
 98 |                     self.cache_lock.acquire()
 99 |                     self.cache.append(data)
100 |                     self.cache_lock.release()
101 |                     self.dump_cache()
102 | 
103 |     def process_frame(self, frame, timestamp):
104 |         '''Concrete implementation of frame processing'''
105 |         # a mosaic will be encoded as jpeg outside gstreamer for max performance
106 |         frame = self.jpeg.encode(frame, quality=self.jpeg_quality)
107 |         self.frames.put((frame,timestamp))
108 | 


--------------------------------------------------------------------------------
/tutorials/02_ObjectTrackingSageMakerGStreamer/libs/tracker.py:
--------------------------------------------------------------------------------
 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | # SPDX-License-Identifier: MIT-0
 3 | import cv2
 4 | import sys
 5 | import numpy as np
 6 | from yolox.tracker.byte_tracker import BYTETracker
 7 | 
 8 | # Helper class that emulates argparse
 9 | class AllMyFields:
10 |     def __init__(self, dictionary):
11 |         for k, v in dictionary.items():
12 |             setattr(self, k, v)
13 | 
14 | class Tracker(object):
15 |     def __init__(self, frame_rate=25, track_tresh=0.25, track_buffer=30, match_tresh=0.5, min_box_area=10):
16 |         self.args = AllMyFields({
17 |             'track_thresh': track_tresh,
18 |             'track_buffer': track_buffer,
19 |             'match_thresh': match_tresh,
20 |             'mot20': False,
21 |             'min_box_area': min_box_area
22 |         })
23 |         self.tracker = BYTETracker(self.args, frame_rate=frame_rate)
24 |         self.online_targets = None
25 | 
26 |     def render(self, frame, objects):
27 |         '''Render BBoxes & ID to an image'''
28 |         for obj_id,xyxy,score in objects:
29 |             x1,y1,x2,y2 = xyxy
30 |             cv2.rectangle(frame, (x1,y1), (x2,y2), (255,255,0), 3)
31 |             cv2.putText(frame, f'{obj_id}', (x1+50,y1+50), cv2.FONT_HERSHEY_SIMPLEX,
32 |                    2, (0, 0, 255), 2, cv2.LINE_AA)
33 |             
34 |     def step(self, detections, img_size=(960,540)):
35 |         '''Update the tracker based on predictions
36 |         Detections[ [x1,y1,x2,y2,conf] ]
37 |         '''
38 |         self.online_targets = self.tracker.update(detections, [img_size[1], img_size[0]], [img_size[1], img_size[0]])
39 |         results = []
40 |         for t in self.online_targets:
41 |             tlwh = t.tlwh
42 |             tid = t.track_id
43 |             vertical = tlwh[2] / tlwh[3] > 1.6
44 |             if tlwh[2] * tlwh[3] > self.args.min_box_area and not vertical:
45 |                 x1,y1,bw,bh = tlwh.astype(np.int32)
46 |                 xyxy = [x1,y1,x1+bw,y1+bh]
47 |                 # obj_id, bbox, conf
48 |                 results.append((tid, xyxy, t.score))
49 |         return results
50 | 


--------------------------------------------------------------------------------
/tutorials/03_QuestionAnsweringMachine/01_QuestionAnsweringWithT5SSM.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "f6cb03f7",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "# FAQ Bot - Q&A model, trained using pairs of questions and answers\n",
  9 |     "\n",
 10 |     "Fine tune a large language model with a list of question and answers. This approach os called Closed Book Q&A because the model doesn't require context and is capable of answering variations of the questions you provide in your dataset.\n",
 11 |     "\n",
 12 |     "This is an evolution of classic ChatBots because LLMs like T5 can disambiguate and generalize better than the old technologies we find in these ChatBots services.\n",
 13 |     "\n",
 14 |     "For that purpose you'll use a **[T5 SMALL SSM ~80MParams](https://huggingface.co/google/t5-small-ssm)** model, accelerated by a trn1 instance ([AWS Trainium](https://aws.amazon.com/machine-learning/trainium/)), running on [Amazon SageMaker](https://aws.amazon.com/sagemaker/).\n",
 15 |     "\n",
 16 |     "You can set the hyperperameter **--model_name** to change the model size. This solution works well with:  \n",
 17 |     "  - t5-small-ssm\n",
 18 |     "  - t5-large-ssm\n",
 19 |     "  \n",
 20 |     "If you need to fine tune **t5-3b-ssm, t5-11b-ssm or t5-xxl-ssm**, you need **FSDP**, which is out of the scope of this tutorial.\n",
 21 |     "\n",
 22 |     "You can see the results of the predictions at the end of this notebook. You'll notice the questions sent to the model are not in the training dataset. They are just variations of the questions used to fine tune the model.\n",
 23 |     "\n",
 24 |     "The dataset is the content of all **AWS FAQ** pages, downloaded from: https://aws.amazon.com/faqs/\n",
 25 |     "\n",
 26 |     "This notebook was tested with **Python3.8+**\n",
 27 |     "\n",
 28 |     ">**If you have never before done a SageMaker training job with Trn1, you'll need to do a service level request. This can take a few hours, best to make the request early so you don't have to wait.**\n",
 29 |     "\n",
 30 |     "You can edit this URL to go directly to the page to request the increase:\n",
 31 |     "\n",
 32 |     "`https://<region>.console.aws.amazon.com/servicequotas/home/services/sagemaker/quotas/L-79A1FE57`"
 33 |    ]
 34 |   },
 35 |   {
 36 |    "cell_type": "markdown",
 37 |    "id": "5cb78690",
 38 |    "metadata": {},
 39 |    "source": [
 40 |     "## 1) Install some dependencies\n",
 41 |     "You need a more recent version of **sagemaker** Python Library. After this install you'll need to restart the kernel."
 42 |    ]
 43 |   },
 44 |   {
 45 |    "cell_type": "code",
 46 |    "execution_count": null,
 47 |    "id": "103ae05a",
 48 |    "metadata": {
 49 |     "scrolled": true
 50 |    },
 51 |    "outputs": [],
 52 |    "source": [
 53 |     "# add --force-reinstall if it fails to resolve dependencies\n",
 54 |     "%pip install -U sagemaker"
 55 |    ]
 56 |   },
 57 |   {
 58 |    "cell_type": "code",
 59 |    "execution_count": null,
 60 |    "id": "d4a951ec",
 61 |    "metadata": {},
 62 |    "outputs": [],
 63 |    "source": [
 64 |     "import sagemaker\n",
 65 |     "print(sagemaker.__version__)\n",
 66 |     "if not sagemaker.__version__ >= \"2.146.0\": print(\"You need to upgrade or restart the kernel if you already upgraded\")\n",
 67 |     "\n",
 68 |     "sess = sagemaker.Session()\n",
 69 |     "role = sagemaker.get_execution_role()\n",
 70 |     "bucket = sess.default_bucket()\n",
 71 |     "region = sess.boto_region_name\n",
 72 |     "\n",
 73 |     "print(f\"sagemaker role arn: {role}\")\n",
 74 |     "print(f\"sagemaker bucket: {bucket}\")\n",
 75 |     "print(f\"sagemaker session region: {region}\")"
 76 |    ]
 77 |   },
 78 |   {
 79 |    "cell_type": "markdown",
 80 |    "id": "ea60dba6",
 81 |    "metadata": {},
 82 |    "source": [
 83 |     "## 2) Visualize and upload the dataset\n",
 84 |     "Take note of the S3 URI here if you get interrupted, no need to reupload later."
 85 |    ]
 86 |   },
 87 |   {
 88 |    "cell_type": "code",
 89 |    "execution_count": 5,
 90 |    "id": "855c1822",
 91 |    "metadata": {
 92 |     "scrolled": true
 93 |    },
 94 |    "outputs": [
 95 |     {
 96 |      "data": {
 97 |       "text/html": [
 98 |        "<div>\n",
 99 |        "<style scoped>\n",
100 |        "    .dataframe tbody tr th:only-of-type {\n",
101 |        "        vertical-align: middle;\n",
102 |        "    }\n",
103 |        "\n",
104 |        "    .dataframe tbody tr th {\n",
105 |        "        vertical-align: top;\n",
106 |        "    }\n",
107 |        "\n",
108 |        "    .dataframe thead th {\n",
109 |        "        text-align: right;\n",
110 |        "    }\n",
111 |        "</style>\n",
112 |        "<table border=\"1\" class=\"dataframe\">\n",
113 |        "  <thead>\n",
114 |        "    <tr style=\"text-align: right;\">\n",
115 |        "      <th></th>\n",
116 |        "      <th>service</th>\n",
117 |        "      <th>question</th>\n",
118 |        "      <th>answers</th>\n",
119 |        "    </tr>\n",
120 |        "  </thead>\n",
121 |        "  <tbody>\n",
122 |        "    <tr>\n",
123 |        "      <th>0</th>\n",
124 |        "      <td>/ec2/autoscaling/faqs/</td>\n",
125 |        "      <td>What is Amazon EC2 Auto Scaling?</td>\n",
126 |        "      <td>Amazon EC2 Auto Scaling is a fully managed ser...</td>\n",
127 |        "    </tr>\n",
128 |        "    <tr>\n",
129 |        "      <th>1</th>\n",
130 |        "      <td>/ec2/autoscaling/faqs/</td>\n",
131 |        "      <td>When should I use Amazon EC2 Auto Scaling vs. ...</td>\n",
132 |        "      <td>You should use AWS Auto Scaling to manage scal...</td>\n",
133 |        "    </tr>\n",
134 |        "    <tr>\n",
135 |        "      <th>2</th>\n",
136 |        "      <td>/ec2/autoscaling/faqs/</td>\n",
137 |        "      <td>How is Predictive Scaling Policy different fro...</td>\n",
138 |        "      <td>Predictive Scaling Policy brings the similar p...</td>\n",
139 |        "    </tr>\n",
140 |        "    <tr>\n",
141 |        "      <th>3</th>\n",
142 |        "      <td>/ec2/autoscaling/faqs/</td>\n",
143 |        "      <td>What are the benefits of using Amazon EC2 Auto...</td>\n",
144 |        "      <td>Amazon EC2 Auto Scaling helps to maintain your...</td>\n",
145 |        "    </tr>\n",
146 |        "    <tr>\n",
147 |        "      <th>4</th>\n",
148 |        "      <td>/ec2/autoscaling/faqs/</td>\n",
149 |        "      <td>What is fleet management and how is it differe...</td>\n",
150 |        "      <td>If your application runs on Amazon EC2 instanc...</td>\n",
151 |        "    </tr>\n",
152 |        "  </tbody>\n",
153 |        "</table>\n",
154 |        "</div>"
155 |       ],
156 |       "text/plain": [
157 |        "                  service                                           question  \\\n",
158 |        "0  /ec2/autoscaling/faqs/                   What is Amazon EC2 Auto Scaling?   \n",
159 |        "1  /ec2/autoscaling/faqs/  When should I use Amazon EC2 Auto Scaling vs. ...   \n",
160 |        "2  /ec2/autoscaling/faqs/  How is Predictive Scaling Policy different fro...   \n",
161 |        "3  /ec2/autoscaling/faqs/  What are the benefits of using Amazon EC2 Auto...   \n",
162 |        "4  /ec2/autoscaling/faqs/  What is fleet management and how is it differe...   \n",
163 |        "\n",
164 |        "                                             answers  \n",
165 |        "0  Amazon EC2 Auto Scaling is a fully managed ser...  \n",
166 |        "1  You should use AWS Auto Scaling to manage scal...  \n",
167 |        "2  Predictive Scaling Policy brings the similar p...  \n",
168 |        "3  Amazon EC2 Auto Scaling helps to maintain your...  \n",
169 |        "4  If your application runs on Amazon EC2 instanc...  "
170 |       ]
171 |      },
172 |      "execution_count": 5,
173 |      "metadata": {},
174 |      "output_type": "execute_result"
175 |     }
176 |    ],
177 |    "source": [
178 |     "import pandas as pd\n",
179 |     "df = pd.read_csv('train.csv.gz', compression='gzip', sep=';')\n",
180 |     "df.head()"
181 |    ]
182 |   },
183 |   {
184 |    "cell_type": "code",
185 |    "execution_count": null,
186 |    "id": "df8e1b39",
187 |    "metadata": {},
188 |    "outputs": [],
189 |    "source": [
190 |     "s3_uri = sess.upload_data(path='train.csv.gz', key_prefix='datasets/aws-faq/train')\n",
191 |     "print(s3_uri)"
192 |    ]
193 |   },
194 |   {
195 |    "cell_type": "markdown",
196 |    "id": "92dd77a3",
197 |    "metadata": {},
198 |    "source": [
199 |     "## 3) Prepare the train/inference script"
200 |    ]
201 |   },
202 |   {
203 |    "cell_type": "code",
204 |    "execution_count": null,
205 |    "id": "b28590cf",
206 |    "metadata": {},
207 |    "outputs": [],
208 |    "source": [
209 |     "import os\n",
210 |     "if not os.path.isdir('src'): os.mkdir('src')"
211 |    ]
212 |   },
213 |   {
214 |    "cell_type": "code",
215 |    "execution_count": null,
216 |    "id": "8c266667",
217 |    "metadata": {},
218 |    "outputs": [],
219 |    "source": [
220 |     "## requirements.txt will be used by SageMaker to install\n",
221 |     "## additional Python packages"
222 |    ]
223 |   },
224 |   {
225 |    "cell_type": "code",
226 |    "execution_count": null,
227 |    "id": "d3b08830",
228 |    "metadata": {},
229 |    "outputs": [],
230 |    "source": [
231 |     "%%writefile src/requirements.txt\n",
232 |     "torchvision\n",
233 |     "transformers==4.27.4"
234 |    ]
235 |   },
236 |   {
237 |    "cell_type": "code",
238 |    "execution_count": null,
239 |    "id": "27761ac8",
240 |    "metadata": {},
241 |    "outputs": [],
242 |    "source": [
243 |     "!pygmentize src/question_answering.py"
244 |    ]
245 |   },
246 |   {
247 |    "cell_type": "markdown",
248 |    "id": "81e8b7c0",
249 |    "metadata": {},
250 |    "source": [
251 |     "## 4) Kick-off our fine tuning job on Amazon SageMaker\n",
252 |     "We need to create a SageMaker Estimator first and then invoke **.fit**. \n",
253 |     "\n",
254 |     "Please, notice we're passing the parameter **checkpoint_s3_uri**. This is important because NeuronSDK will spend some time compiling the model before fine tuning it. The compiler saves the model to cache files and, with this param, the files will be uploaded to **S3**. So, next time we run a job, NeuronSDK can just load back the cache files and start training immediately.\n",
255 |     "\n",
256 |     "When training for the first time, the training job takes ~9 hours to process all 60 Epochs on an **trn1.32xlarge**.\n",
257 |     "\n",
258 |     "If you need to wait for a quota increase like I did. When you come back, run cell 2 to setup the sagemaker session and S3 uris, etc. Then run the below to get the process started."
259 |    ]
260 |   },
261 |   {
262 |    "cell_type": "code",
263 |    "execution_count": null,
264 |    "id": "3d7f8c86",
265 |    "metadata": {},
266 |    "outputs": [],
267 |    "source": [
268 |     "from sagemaker.pytorch import PyTorch\n",
269 |     "\n",
270 |     "# https://github.com/aws/deep-learning-containers/blob/master/available_images.md#neuron-containers\n",
271 |     "image_name=\"pytorch-training-neuronx\"\n",
272 |     "# We need SDK2.9+ to deal with T5s\n",
273 |     "image_tag=\"1.13.0-neuronx-py38-sdk2.9.1-ubuntu20.04\"\n",
274 |     "\n",
275 |     "estimator = PyTorch(\n",
276 |     "    entry_point=\"question_answering.py\", # Specify your train script\n",
277 |     "    source_dir=\"src\",\n",
278 |     "    role=role,\n",
279 |     "    sagemaker_session=sess,\n",
280 |     "    instance_count=1,\n",
281 |     "    instance_type='ml.trn1.32xlarge',    \n",
282 |     "    disable_profiler=True,\n",
283 |     "    output_path=f\"s3://{bucket}/output\",\n",
284 |     "    image_uri=f\"763104351884.dkr.ecr.{region}.amazonaws.com/{image_name}:{image_tag}\",\n",
285 |     "    \n",
286 |     "    # Parameters required to enable checkpointing\n",
287 |     "    # This is necessary for caching XLA HLO files and reduce training time next time    \n",
288 |     "    checkpoint_s3_uri=f\"s3://{bucket}/checkpoints\",\n",
289 |     "    volume_size = 512,\n",
290 |     "    distribution={\n",
291 |     "        \"torch_distributed\": {\n",
292 |     "            \"enabled\": True\n",
293 |     "        }\n",
294 |     "    },\n",
295 |     "    hyperparameters={\n",
296 |     "        \"model-name\": \"t5-small-ssm\",\n",
297 |     "        \"lr\": 5e-5,\n",
298 |     "        \"num-epochs\": 60\n",
299 |     "    },\n",
300 |     "    metric_definitions=[\n",
301 |     "        {'Name': 'train:loss', 'Regex': 'loss:(\\S+);'}\n",
302 |     "    ]\n",
303 |     ")\n",
304 |     "estimator.framework_version = '1.13.1' # workround when using image_uri"
305 |    ]
306 |   },
307 |   {
308 |    "cell_type": "code",
309 |    "execution_count": null,
310 |    "id": "1ab4e4ab",
311 |    "metadata": {
312 |     "scrolled": true
313 |    },
314 |    "outputs": [],
315 |    "source": [
316 |     "estimator.fit({\"train\": s3_uri})"
317 |    ]
318 |   },
319 |   {
320 |    "cell_type": "markdown",
321 |    "id": "e4f082b4",
322 |    "metadata": {},
323 |    "source": [
324 |     "## 5) Deploy our model to a SageMaker endpoint\n",
325 |     "Here, we're using a pre-defined HuggingFace model class+container to just load our fine tuned model on a CPU based instance: c6i.4xlarge (an Intel Xeon based machine).\n",
326 |     "\n",
327 |     ">If you're picking this up later uncomment line 4, fill in the path to your model artifacts, comment line 9 out, and uncomment line 10."
328 |    ]
329 |   },
330 |   {
331 |    "cell_type": "code",
332 |    "execution_count": null,
333 |    "id": "d90af272",
334 |    "metadata": {},
335 |    "outputs": [],
336 |    "source": [
337 |     "# uncomment and modify this if you're picking this back up later and your training was sucessful.\n",
338 |     "# you'll need to get the model s3 URI from sagemaker -> Training -> Training Jobs -> <Your job name> -> Output -> S3 model artifact\n",
339 |     "\n",
340 |     "# pre_trained_model = YOUR_S3_PATH\n",
341 |     "from sagemaker.huggingface.model import HuggingFaceModel\n",
342 |     "\n",
343 |     "# create Hugging Face Model Class\n",
344 |     "huggingface_model = HuggingFaceModel(\n",
345 |     "   model_data=estimator.model_data,       # path to your model and script\n",
346 |     "   # model_data=pre_trained_model,       # path to your model and script\n",
347 |     "   role=role,                             # iam role with permissions to create an Endpoint\n",
348 |     "   transformers_version=\"4.26.0\",         # transformers version used\n",
349 |     "   pytorch_version=\"1.13.1\",              # pytorch version used\n",
350 |     "   py_version='py39',                     # python version used\n",
351 |     "   sagemaker_session=sess,\n",
352 |     "   \n",
353 |     "   # for production it is important to define vpc_config and use a vpc_endpoint\n",
354 |     "   #vpc_config={\n",
355 |     "   #    'Subnets': ['subnet-A-REPLACE', 'subnet-B-REPLACE'],\n",
356 |     "   #    'SecurityGroupIds': ['sg-A-REPLACE', 'sg-B-REPLACE']\n",
357 |     "   #}    \n",
358 |     ")"
359 |    ]
360 |   },
361 |   {
362 |    "cell_type": "code",
363 |    "execution_count": null,
364 |    "id": "23f9c74f",
365 |    "metadata": {},
366 |    "outputs": [],
367 |    "source": [
368 |     "predictor = huggingface_model.deploy(\n",
369 |     "    initial_instance_count=1,\n",
370 |     "    instance_type=\"ml.c6i.4xlarge\",\n",
371 |     ")"
372 |    ]
373 |   },
374 |   {
375 |    "cell_type": "markdown",
376 |    "id": "a6f12446",
377 |    "metadata": {},
378 |    "source": [
379 |     "## 6) Run a quick test"
380 |    ]
381 |   },
382 |   {
383 |    "cell_type": "code",
384 |    "execution_count": 17,
385 |    "id": "8e393b0f",
386 |    "metadata": {},
387 |    "outputs": [
388 |     {
389 |      "name": "stdout",
390 |      "output_type": "stream",
391 |      "text": [
392 |       "Q: What is SageMaker?\n",
393 |       "A: SageMaker is a new ML (ML) service that makes it easy to build, train, and deploy notebook data inference, and deploy and tune models of data. SageMaker helps you build, train, and manage your ML models, and deploy model data to build your models up and down\n",
394 |       "\n",
395 |       "Q: What is EC2 AutoScaling?\n",
396 |       "A: Amazon-based EC2 instancess let you reduce your applications on multiple factors by allowing you to scale your application requirements and costs across multiple instances. Amazoning EC2 instances as a result of optimization in your applications, reducing the number of compute EC and the number of available instances to optimize your\n",
397 |       "\n",
398 |       "Q: What are the benefits of autoscaling?\n",
399 |       "A: You can use autoscaling to help you optimize the capacity of your applications by allowing you to take advantage of your application across multiple applications. Autoscaling allows you to easily scale the number of your applications across multiple devices, and optimize your fleet up or down to 40%. You can also use auto\n",
400 |       "\n",
401 |       "CPU times: user 5.16 ms, sys: 0 ns, total: 5.16 ms\n",
402 |       "Wall time: 1.66 s\n"
403 |      ]
404 |     }
405 |    ],
406 |    "source": [
407 |     "%%time\n",
408 |     "questions = [\n",
409 |     "    \"What is SageMaker?\",\n",
410 |     "    \"What is EC2 AutoScaling?\",\n",
411 |     "    \"What are the benefits of autoscaling?\"\n",
412 |     "]\n",
413 |     "resp = predictor.predict({'inputs': questions})\n",
414 |     "for q,a in zip(questions, resp['answers']):\n",
415 |     "    print(f\"Q: {q}\\nA: {a}\\n\")"
416 |    ]
417 |   },
418 |   {
419 |    "cell_type": "markdown",
420 |    "id": "10cd7c8a",
421 |    "metadata": {},
422 |    "source": [
423 |     "## 7) Clean up\n",
424 |     "This will delete the model and the endpoint you created"
425 |    ]
426 |   },
427 |   {
428 |    "cell_type": "code",
429 |    "execution_count": null,
430 |    "id": "b3f1afb2",
431 |    "metadata": {},
432 |    "outputs": [],
433 |    "source": [
434 |     "predictor.delete_model()\n",
435 |     "predictor.delete_endpoint()"
436 |    ]
437 |   }
438 |  ],
439 |  "metadata": {
440 |   "kernelspec": {
441 |    "display_name": "conda_pytorch_p39",
442 |    "language": "python",
443 |    "name": "conda_pytorch_p39"
444 |   },
445 |   "language_info": {
446 |    "codemirror_mode": {
447 |     "name": "ipython",
448 |     "version": 3
449 |    },
450 |    "file_extension": ".py",
451 |    "mimetype": "text/x-python",
452 |    "name": "python",
453 |    "nbconvert_exporter": "python",
454 |    "pygments_lexer": "ipython3",
455 |    "version": "3.9.15"
456 |   }
457 |  },
458 |  "nbformat": 4,
459 |  "nbformat_minor": 5
460 | }
461 | 


--------------------------------------------------------------------------------
/tutorials/03_QuestionAnsweringMachine/src/question_answering.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import csv
  3 | import glob
  4 | import time
  5 | import json
  6 | import gzip
  7 | import torch
  8 | import argparse
  9 | 
 10 | from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
 11 | from torch.utils.data import IterableDataset, DataLoader
 12 | 
 13 | max_sentence_len="<MAX_LEN>"
 14 | max_new_tokens="<MAX_NEW_TOKENS>"
 15 | 
 16 | class QnADataset(IterableDataset):
 17 |     '''Dataset that streams batches instead of loading the whole file into memory'''
 18 |     def __init__(self, files_path, max_sentence_len=256, shuffle=True, tokenizer=None):
 19 |         super(QnADataset).__init__()
 20 |         self.files = glob.glob(os.path.join(files_path, "*.csv.gz"))
 21 |         if len(self.files) == 0: raise Exception("No .csv files found")
 22 |         print(f"{len(self.files)} csv files found")
 23 |         self.reader = None
 24 |         self.shuffle = shuffle
 25 |         self.tokenizer = tokenizer
 26 |         self.max_sentence_len = max_sentence_len
 27 | 
 28 |     def batch_generator(self):
 29 |         for file_path in self.files:
 30 |             with gzip.open(file_path, 'rt') as csvfile:
 31 |                 data = csv.reader(csvfile, delimiter = ";")
 32 |                 next(data) # skip header
 33 |                 for i,row in enumerate(data):
 34 |                     e = self.tokenizer(row[1], max_length=self.max_sentence_len, padding='max_length', truncation=True, return_tensors="pt")
 35 |                     e['labels'] = self.tokenizer(row[2], max_length=self.max_sentence_len, padding='max_length', truncation=True, return_tensors="pt").input_ids
 36 |                     yield i,e
 37 |     def __iter__(self):
 38 |         return self.batch_generator()
 39 | 
 40 | def collate_fn(data):
 41 |     # rebuild all samples of a given batch into a dictionary HF way
 42 |     batch = {}
 43 |     for j,sample in data:
 44 |         for k,v in sample.items():
 45 |             if batch.get(k) is None: batch[k] = []
 46 |             batch[k].append(torch.LongTensor(v))
 47 |     batch = {k:torch.vstack(batch[k]) for k in batch.keys()}
 48 |     return batch
 49 | 
 50 | def train(args, world_size, device):
 51 |     print("Starting training job")
 52 |     os.makedirs(args.checkpoints_path, exist_ok=True)
 53 | 
 54 |     model = AutoModelForSeq2SeqLM.from_pretrained(f"google/{args.model_name}")
 55 |     model.to(device)
 56 | 
 57 |     tokenizer = AutoTokenizer.from_pretrained(f"google/{args.model_name}")
 58 |     optimizer = AdamW(model.parameters(), lr=args.lr * xm.xrt_world_size())
 59 | 
 60 |     train_dataset = QnADataset(args.train, args.max_sentence_len, True, tokenizer)
 61 |     train_dataloader = DataLoader(train_dataset, collate_fn=collate_fn, batch_size=args.batch_size)
 62 |     train_dataloader = pl.MpDeviceLoader(train_dataloader, device)
 63 | 
 64 |     best_path = os.path.join(args.checkpoints_path, args.model_name, 'best.pt')
 65 |     best_loss = float("inf")
 66 |     for epoch in range(args.num_epochs):
 67 |         model.train()
 68 |         epoch_loss = 0.0
 69 |         num_batches = 0
 70 |         epoch_time = time.time()
 71 |         for step, batch in enumerate(train_dataloader):
 72 |             outputs = model(**batch)
 73 |             optimizer.zero_grad()
 74 |             loss = outputs.loss
 75 |             loss.backward()
 76 |             epoch_loss += outputs.logits.shape[0] * loss.detach().to('cpu')
 77 |             num_batches += 1
 78 | 
 79 |             # gather gradient updates from all cores and apply them
 80 |             xm.optimizer_step(optimizer)
 81 |         elapsed = time.time()-epoch_time
 82 |         epoch_loss /= num_batches*args.batch_size
 83 |         xm.master_print(f"epoch:{epoch}; elapsed_time(sec):{elapsed:0.2f}; loss:{epoch_loss};")
 84 |         if epoch_loss < best_loss:
 85 |             best_loss = epoch_loss
 86 |             xm.save({'state_dict': model.state_dict(), 'loss': best_loss}, best_path)
 87 | 
 88 |     best_model = torch.load(best_path)
 89 |     best_loss = best_model["loss"]
 90 |     print(f'Saving best model. Loss: {best_loss}')
 91 |     model.load_state_dict(best_model['state_dict'])
 92 |     model.to('cpu')
 93 |     model.eval()
 94 |     model.save_pretrained(args.model_path)
 95 |     tokenizer.save_pretrained(args.model_path)
 96 | 
 97 | def input_fn(request_body, request_content_type):
 98 |     if request_content_type == "application/json":
 99 |         inputs = json.loads(request_body)
100 |         return inputs['inputs']
101 | 
102 |     raise Exception(f"Unsupported content type: {request_content_type}")
103 | 
104 | def output_fn(prediction, content_type):
105 |     if content_type == "application/json":
106 |         return json.dumps({'answer': prediction})
107 |     raise Exception(f"Unsupported accept: {content_type}")
108 | 
109 | 
110 | def model_fn(model_dir):
111 |     model = AutoModelForSeq2SeqLM.from_pretrained(model_dir)
112 |     model.eval()
113 |     tokenizer = AutoTokenizer.from_pretrained(model_dir)
114 | 
115 |     return model,tokenizer
116 | 
117 | def predict_fn(input_object, model_tokenizer):
118 |     global max_sentence_len,max_new_tokens
119 |     model,tokenizer = model_tokenizer
120 | 
121 |     input_ids = tokenizer(input_object, max_length=max_sentence_len, padding='max_length', truncation=True, return_tensors="pt").input_ids
122 |     gen_output = model.generate(input_ids, max_new_tokens=max_new_tokens)
123 |     return [tokenizer.decode(o, skip_special_tokens=True) for o in gen_output]
124 | 
125 | if __name__=='__main__':
126 |     parser = argparse.ArgumentParser(
127 |                 prog = 'Train script for Trainium',
128 |                 description = 'Hyperparameters for the training process')
129 | 
130 |     # t5-xxl-ssm" # requires split ~46GB
131 |     parser.add_argument('--num-epochs', type=int, help="Number of epochs", default=2)
132 |     parser.add_argument('--batch-size', type=int, help="Batch size", default=4)
133 |     parser.add_argument('--max-sentence-len', type=int, help="Maximum sentence length", default=128)
134 |     parser.add_argument('--max-new-tokens', type=int, help="Maximum number of generated tokens", default=64)
135 |     parser.add_argument('--model-name', type=str, help="Name of the model", default="t5-large-ssm")
136 |     parser.add_argument('--lr', type=float, help="Learning rate", default=5e-5)
137 | 
138 |     parser.add_argument('--model-path', type=str, help="Path where we'll save the model", default=os.environ["SM_MODEL_DIR"])
139 |     parser.add_argument('--checkpoints-path', type=str, help="Path where we'll save the best model and cache", default='/opt/ml/checkpoints')
140 |     parser.add_argument('--train', type=str, help="Path to train data", default=os.environ["SM_CHANNEL_TRAIN"])
141 | 
142 |     args = parser.parse_args()
143 |     print(args)
144 |     import torch_xla.core.xla_model as xm
145 |     import torch_xla.distributed.xla_backend
146 |     import torch_xla.test.test_utils as test_utils
147 |     import torch_xla.distributed.parallel_loader as pl
148 |     
149 |     from torch.optim import AdamW    
150 | 
151 |     cache_dir = os.path.join(args.checkpoints_path, args.model_name)
152 |     os.environ['TOKENIZERS_PARALLELISM'] = 'false'
153 |     os.environ['NEURON_CC_FLAGS']=f"--cache_dir={cache_dir} --retry_failed_compilation"
154 |     os.environ['XLA_USE_BF16'] = '1'
155 |     
156 |     device = 'xla'
157 |     # Initialize XLA process group for torchrun
158 |     torch.distributed.init_process_group(device)
159 |     world_size = xm.xrt_world_size()
160 |     
161 |     print(f"Device: {device} World size: {world_size}")
162 |     train(args, world_size, device)
163 | 
164 |     # define the max_seq len
165 |     with open(__file__, "r") as f:
166 |         data = f.read()
167 |         data = data.replace("\"<MAX_LEN>\"", f"{args.max_sentence_len}")
168 |         data = data.replace("\"<MAX_NEW_TOKENS>\"", f"{args.max_new_tokens}")
169 | 
170 |     code_path = os.path.join(args.model_path, 'code')
171 |     if not os.path.isdir(code_path): os.makedirs(code_path, exist_ok=True)
172 |     # save a copy of the inference file to the correct dir
173 |     with open(os.path.join(code_path, 'inference.py'), "w") as f:
174 |         f.write(data)  
175 | 


--------------------------------------------------------------------------------
/tutorials/03_QuestionAnsweringMachine/train.csv.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/ml-specialized-hardware/d73aadcdd1b966d23e5191882f707c0cc01cbe23/tutorials/03_QuestionAnsweringMachine/train.csv.gz


--------------------------------------------------------------------------------
/tutorials/07_DeployToInferentiaWithTGI/inf2-tgi-demo.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": null,
  6 |    "id": "0ea34d80-1592-48b0-905b-845f29437577",
  7 |    "metadata": {},
  8 |    "outputs": [],
  9 |    "source": [
 10 |     "!pip install -U sagemaker==2.232.2"
 11 |    ]
 12 |   },
 13 |   {
 14 |    "cell_type": "code",
 15 |    "execution_count": 1,
 16 |    "id": "9604ec03-8aeb-4c31-a688-62163172c277",
 17 |    "metadata": {},
 18 |    "outputs": [
 19 |     {
 20 |      "name": "stdout",
 21 |      "output_type": "stream",
 22 |      "text": [
 23 |       "sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml\n",
 24 |       "sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml\n"
 25 |      ]
 26 |     }
 27 |    ],
 28 |    "source": [
 29 |     "import sagemaker\n",
 30 |     "\n",
 31 |     "sess = sagemaker.Session()\n",
 32 |     "session_bucket = sess.default_bucket()\n",
 33 |     "role = sagemaker.get_execution_role()"
 34 |    ]
 35 |   },
 36 |   {
 37 |    "cell_type": "code",
 38 |    "execution_count": 39,
 39 |    "id": "fe07bddd-7ee0-40e1-af62-ac0c0cf530bf",
 40 |    "metadata": {},
 41 |    "outputs": [],
 42 |    "source": [
 43 |     "# Import the necessary libraries for using Hugging Face models and SageMaker\n",
 44 |     "from sagemaker.huggingface import HuggingFaceModel\n",
 45 |     "\n",
 46 |     "# Define the instance type that will be used for inference\n",
 47 |     "# ml.inf2.24xlarge is based on AWS Inferentia2 hardware, optimized for high-performance machine learning inference\n",
 48 |     "instance_type = \"ml.inf2.24xlarge\"\n",
 49 |     "\n",
 50 |     "# Set the health check timeout and volume size for the SageMaker model endpoint\n",
 51 |     "health_check_timeout = 2400  # The maximum time (in seconds) SageMaker waits for the model to be ready\n",
 52 |     "volume_size = 128  # Storage size in GB allocated to the model\n",
 53 |     "\n",
 54 |     "# Define the environment configuration for the Hugging Face model\n",
 55 |     "config = {\n",
 56 |     "    \"HF_MODEL_ID\": \"meta-llama/Meta-Llama-3.1-8B\",  # Hugging Face model ID\n",
 57 |     "    \"HF_NUM_CORES\": \"8\",  # Number of Neuron cores to use for inference\n",
 58 |     "    \"HF_AUTO_CAST_TYPE\": \"bf16\",  # Enable automatic casting to bf16 (half precision for faster inference)\n",
 59 |     "    \"MAX_BATCH_SIZE\": \"4\",  # Maximum batch size to process in one forward pass\n",
 60 |     "    \"MAX_INPUT_LENGTH\": \"4095\",  # Maximum input sequence length (tokens) allowed for inference\n",
 61 |     "    \"MAX_TOTAL_TOKENS\": \"4096\",  # Maximum total number of tokens (input + output)\n",
 62 |     "    \"HF_TOKEN\": \"<put your HF token there>\"  # Token to authenticate with Hugging Face Hub (ensure to keep this secure)\n",
 63 |     "}\n",
 64 |     "\n",
 65 |     "# Set the URI for the Hugging Face TGI (Text Generation Inference) image\n",
 66 |     "# This image is designed for optimized inference using AWS Neuron SDK (for Inferentia)\n",
 67 |     "tgi_image = \"763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.2-optimum0.0.25-neuronx-py310-ubuntu22.04\"\n",
 68 |     "\n",
 69 |     "# Create the HuggingFaceModel object with the specified role, image, and environment configuration\n",
 70 |     "model = HuggingFaceModel(\n",
 71 |     "  role=role,  # IAM role that grants SageMaker permissions\n",
 72 |     "  image_uri=tgi_image,  # URI for the Hugging Face inference image\n",
 73 |     "  env=config  # Pass the environment variables defined in the config\n",
 74 |     ")\n",
 75 |     "\n",
 76 |     "# In this case, we are deploying a precompiled model, stored at https://huggingface.co/aws-neuron/optimum-neuron-cache \n",
 77 |     "# If the model you need to deploy the model that is not precompiled,  you can export your own neuron model\n",
 78 |     "# as explained in https://huggingface.co/docs/optimum-neuron/main/en/guides/export_model#exporting-neuron-models-using-neuronx-tgi\n",
 79 |     "\n",
 80 |     "# Mark the model as precompiled\n",
 81 |     "model._is_compiled_model = True\n"
 82 |    ]
 83 |   },
 84 |   {
 85 |    "cell_type": "code",
 86 |    "execution_count": 19,
 87 |    "id": "62f5ed8d-77e6-45a8-8ba4-c35566631a5d",
 88 |    "metadata": {},
 89 |    "outputs": [
 90 |     {
 91 |      "name": "stdout",
 92 |      "output_type": "stream",
 93 |      "text": [
 94 |       "------------------------!"
 95 |      ]
 96 |     }
 97 |    ],
 98 |    "source": [
 99 |     "predictor = model.deploy(\n",
100 |     "  initial_instance_count=1,\n",
101 |     "  instance_type=instance_type,\n",
102 |     "  container_startup_health_check_timeout=health_check_timeout,\n",
103 |     "  volume_size=volume_size\n",
104 |     ")"
105 |    ]
106 |   },
107 |   {
108 |    "cell_type": "code",
109 |    "execution_count": 38,
110 |    "id": "0a8ddbde-4dc7-4b85-bd01-1311b78987b7",
111 |    "metadata": {},
112 |    "outputs": [
113 |     {
114 |      "data": {
115 |       "text/plain": [
116 |        "[{'generated_text': 'What are the pros and cons of different energy sources? Is there a link between electricity usage and climate change? How can we tackle energy poverty, the issue of clean air at home, or the challenge of providing electricity access in refugee camps? Why should we care about these issues? And how can we better communicate these issues to diverse audiences?\\nThese are key issues for the energy sector – both at home and abroad. This degree will equip you to address them from the perspective of economics, innovation and policy – and prepare you for an exciting career.\\nOur innovative'}]"
117 |       ]
118 |      },
119 |      "execution_count": 38,
120 |      "metadata": {},
121 |      "output_type": "execute_result"
122 |     }
123 |    ],
124 |    "source": [
125 |     "data = {\n",
126 |     "    \"inputs\": \"What are the pros and cons of different energy sources?\",\n",
127 |     "    \"temperature\": 0.7,\n",
128 |     "    \"max_tokens\": 100,\n",
129 |     "    \"top_p\": 0.9,\n",
130 |     "    \"n\": 1,\n",
131 |     "}\n",
132 |     "\n",
133 |     "predictor.predict(data)"
134 |    ]
135 |   },
136 |   {
137 |    "cell_type": "code",
138 |    "execution_count": 40,
139 |    "id": "38015e6b-82be-4e9a-bc1e-b52e95b21937",
140 |    "metadata": {},
141 |    "outputs": [],
142 |    "source": [
143 |     "#clean-up\n",
144 |     "\n",
145 |     "predictor.delete_endpoint()"
146 |    ]
147 |   },
148 |   {
149 |    "cell_type": "code",
150 |    "execution_count": null,
151 |    "id": "7a2bb14b-a5a2-46c7-80c9-bbc4d301d20b",
152 |    "metadata": {},
153 |    "outputs": [],
154 |    "source": []
155 |   }
156 |  ],
157 |  "metadata": {
158 |   "kernelspec": {
159 |    "display_name": "Python 3 (ipykernel)",
160 |    "language": "python",
161 |    "name": "python3"
162 |   },
163 |   "language_info": {
164 |    "codemirror_mode": {
165 |     "name": "ipython",
166 |     "version": 3
167 |    },
168 |    "file_extension": ".py",
169 |    "mimetype": "text/x-python",
170 |    "name": "python",
171 |    "nbconvert_exporter": "python",
172 |    "pygments_lexer": "ipython3",
173 |    "version": "3.10.14"
174 |   }
175 |  },
176 |  "nbformat": 4,
177 |  "nbformat_minor": 5
178 | }
179 | 


--------------------------------------------------------------------------------
/tutorials/08_TextClassificationWithNaturalLanguageInference/NLI_with_BART_inf2.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# 0. Import libraries / Setup"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "This notebook was tested with the neuron **sdk 2.21.1** (in Python 3.10.12).\n",
 15 |     "It requires the following packages:\n",
 16 |     "```\n",
 17 |     "torch==2.5.1\n",
 18 |     "torch-neuronx==2.5.1.2.4.0\n",
 19 |     "torch-xla==2.5.1\n",
 20 |     "torchvision==0.20.1\n",
 21 |     "libneuronxla==2.1.714.0\n",
 22 |     "neuronx-cc==2.16.372.0+4a9b2326\n",
 23 |     "```\n",
 24 |     "Normally those should be already installed when you setup the system for said sdk version.\n",
 25 |     "Then you need to install the following:\n",
 26 |     "```\n",
 27 |     "huggingface-hub==0.28.1\n",
 28 |     "transformers==4.48.2\n",
 29 |     "```"
 30 |    ]
 31 |   },
 32 |   {
 33 |    "cell_type": "code",
 34 |    "execution_count": null,
 35 |    "metadata": {
 36 |     "editable": true,
 37 |     "scrolled": true,
 38 |     "slideshow": {
 39 |      "slide_type": ""
 40 |     },
 41 |     "tags": []
 42 |    },
 43 |    "outputs": [],
 44 |    "source": [
 45 |     "import sys\n",
 46 |     "\n",
 47 |     "!{sys.executable} -m pip install transformers==4.48.2 huggingface-hub==0.28.1"
 48 |    ]
 49 |   },
 50 |   {
 51 |    "cell_type": "code",
 52 |    "execution_count": null,
 53 |    "metadata": {},
 54 |    "outputs": [],
 55 |    "source": [
 56 |     "import transformers\n",
 57 |     "import torch_neuronx\n",
 58 |     "import os\n",
 59 |     "os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\""
 60 |    ]
 61 |   },
 62 |   {
 63 |    "cell_type": "markdown",
 64 |    "metadata": {},
 65 |    "source": [
 66 |     "# 1. Load model pretrained on MNLI"
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "code",
 71 |    "execution_count": null,
 72 |    "metadata": {},
 73 |    "outputs": [],
 74 |    "source": [
 75 |     "from transformers import BartForSequenceClassification, BartTokenizer\n",
 76 |     "tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-mnli', export=True)\n",
 77 |     "model = BartForSequenceClassification.from_pretrained('facebook/bart-large-mnli', export=True)\n",
 78 |     "model_cpu = BartForSequenceClassification.from_pretrained('facebook/bart-large-mnli')\n",
 79 |     "model_dir = \"Bart\"\n",
 80 |     "os.makedirs(model_dir, exist_ok=True)"
 81 |    ]
 82 |   },
 83 |   {
 84 |    "cell_type": "markdown",
 85 |    "metadata": {},
 86 |    "source": [
 87 |     "## 1.1 Test loaded model"
 88 |    ]
 89 |   },
 90 |   {
 91 |    "cell_type": "code",
 92 |    "execution_count": null,
 93 |    "metadata": {},
 94 |    "outputs": [],
 95 |    "source": [
 96 |     "# pose sequence as a NLI premise and label (politics) as a hypothesis\n",
 97 |     "premise = 'What is your favorite team, Madrid or Barca?'\n",
 98 |     "hypothesis = 'This text is about sports.'\n",
 99 |     "max_length = 128\n",
100 |     "\n",
101 |     "# run through model pre-trained on MNLI\n",
102 |     "encoded_input = tokenizer.encode_plus(premise, hypothesis, return_tensors='pt', truncation='only_first', padding=\"max_length\", max_length=max_length)\n",
103 |     "logits = model(encoded_input[\"input_ids\"], encoded_input[\"attention_mask\"], use_cache=False)[0]\n",
104 |     "\n",
105 |     "# we throw away \"neutral\" (dim 1) and take the probability of\n",
106 |     "# \"entailment\" (2) as the probability of the label being true \n",
107 |     "entail_contradiction_logits = logits[:,[0,2]]\n",
108 |     "probs = entail_contradiction_logits.softmax(dim=1)\n",
109 |     "true_prob = probs[:,1].item() * 100\n",
110 |     "print(f'Probability that the label is true: {true_prob:0.2f}%')"
111 |    ]
112 |   },
113 |   {
114 |    "cell_type": "markdown",
115 |    "metadata": {},
116 |    "source": [
117 |     "## 1.2 Test tracing the model as it comes"
118 |    ]
119 |   },
120 |   {
121 |    "cell_type": "code",
122 |    "execution_count": null,
123 |    "metadata": {
124 |     "scrolled": true
125 |    },
126 |    "outputs": [],
127 |    "source": [
128 |     "neuron_encoder = torch_neuronx.trace(\n",
129 |     "        model, \n",
130 |     "        encoded_input[\"input_ids\"],\n",
131 |     "        compiler_args='--target inf2 --model-type transformer --auto-cast all',\n",
132 |     "        compiler_workdir='./enc_dir')"
133 |    ]
134 |   },
135 |   {
136 |    "cell_type": "markdown",
137 |    "metadata": {},
138 |    "source": [
139 |     "The step above fails because between the encoder and decoder, the arguments are passes as a dictionary with tuples as values. The compiler doesn't work well with this setup, so the idea is to split the model in two parts, encoder and decoder compile them independently and then put them back into the original model structure.\n",
140 |     "\n",
141 |     "Given this model is around 400M params (1.5GB), it fits into just 1 core when quantized to bf16. After that, both encoder and decoder will be accelerated on inf2."
142 |    ]
143 |   },
144 |   {
145 |    "cell_type": "markdown",
146 |    "metadata": {},
147 |    "source": [
148 |     "# 2. Prepare model for compilation"
149 |    ]
150 |   },
151 |   {
152 |    "cell_type": "code",
153 |    "execution_count": null,
154 |    "metadata": {},
155 |    "outputs": [],
156 |    "source": [
157 |     "dim_enc=model.config.max_position_embeddings\n",
158 |     "dim_dec=model.config.d_model\n",
159 |     "print(f'Dim enc: {dim_enc}; Dim dec: {dim_dec}')\n",
160 |     "max_dec_len = 1024"
161 |    ]
162 |   },
163 |   {
164 |    "cell_type": "code",
165 |    "execution_count": null,
166 |    "metadata": {},
167 |    "outputs": [],
168 |    "source": []
169 |   },
170 |   {
171 |    "cell_type": "code",
172 |    "execution_count": null,
173 |    "metadata": {},
174 |    "outputs": [],
175 |    "source": [
176 |     "import torch\n",
177 |     "import torch.nn.functional as F\n",
178 |     "from transformers.modeling_outputs import BaseModelOutput, BaseModelOutputWithPastAndCrossAttentions\n",
179 |     "\n",
180 |     "# Define one function for the encoder part\n",
181 |     "def enc_f(self, input_ids, attention_mask, **kwargs):\n",
182 |     "    if hasattr(self, 'forward_neuron'):\n",
183 |     "        out = self.forward_neuron(input_ids, attention_mask)\n",
184 |     "    else:\n",
185 |     "        out = self.forward_(input_ids, attention_mask=attention_mask, return_dict=True)\n",
186 |     "    return BaseModelOutput(**out)\n",
187 |     "\n",
188 |     "\n",
189 |     "# Define one function for the decoder part\n",
190 |     "def dec_f(self, input_ids, encoder_hidden_states, encoder_attention_mask, **kwargs):    \n",
191 |     "    out = None\n",
192 |     "    \n",
193 |     "    if input_ids.shape[1] > self.max_length:\n",
194 |     "        raise Exception(f\"The decoded sequence is not supported. Max: {self.max_length}\")\n",
195 |     "\n",
196 |     "    if hasattr(self, 'forward_neuron'):\n",
197 |     "        out = self.forward_neuron(input_ids,\n",
198 |     "                                  encoder_hidden_states,\n",
199 |     "                                  encoder_attention_mask)\n",
200 |     "    else:\n",
201 |     "        out = self.forward_(input_ids=input_ids,\n",
202 |     "                            encoder_hidden_states=encoder_hidden_states,\n",
203 |     "                            encoder_attention_mask=encoder_attention_mask,\n",
204 |     "                            return_dict=True,\n",
205 |     "                            use_cache=False,\n",
206 |     "                            output_attentions=False)\n",
207 |     "    \n",
208 |     "    # Ensure the output is compatible with BaseModelOutputWithPastAndCrossAttentions\n",
209 |     "    if 'cross_attentions' not in out:\n",
210 |     "        out['cross_attentions'] = None\n",
211 |     "    if 'hidden_states' not in out:\n",
212 |     "        out['hidden_states'] = None\n",
213 |     "    if 'attentions' not in out:\n",
214 |     "        out['attentions'] = None\n",
215 |     "    \n",
216 |     "    return BaseModelOutputWithPastAndCrossAttentions(**out)"
217 |    ]
218 |   },
219 |   {
220 |    "cell_type": "code",
221 |    "execution_count": null,
222 |    "metadata": {},
223 |    "outputs": [],
224 |    "source": [
225 |     "import types\n",
226 |     "\n",
227 |     "# Backup the original forward methods\n",
228 |     "if not hasattr(model.model.encoder, 'forward_'): \n",
229 |     "    model.model.encoder.forward_ = model.model.encoder.forward\n",
230 |     "if not hasattr(model.model.decoder, 'forward_'): \n",
231 |     "    model.model.decoder.forward_ = model.model.decoder.forward\n",
232 |     "\n",
233 |     "# Replace the forward methods with the custom ones\n",
234 |     "model.model.encoder.forward = types.MethodType(enc_f, model.model.encoder)\n",
235 |     "model.model.decoder.forward = types.MethodType(dec_f, model.model.decoder)\n",
236 |     "\n",
237 |     "# Set the max_length attribute for the decoder\n",
238 |     "model.model.decoder.max_length = max_dec_len  # or any other appropriate value"
239 |    ]
240 |   },
241 |   {
242 |    "cell_type": "code",
243 |    "execution_count": null,
244 |    "metadata": {},
245 |    "outputs": [],
246 |    "source": [
247 |     "# Run only the encoder to prepare the sample input for the decoder\n",
248 |     "encoder_inputs = encoded_input[\"input_ids\"], encoded_input[\"attention_mask\"]\n",
249 |     "encoder_outputs = model.model.encoder(encoded_input[\"input_ids\"], encoded_input[\"attention_mask\"])"
250 |    ]
251 |   },
252 |   {
253 |    "cell_type": "markdown",
254 |    "metadata": {},
255 |    "source": [
256 |     "## 2.1 Trace Encoder"
257 |    ]
258 |   },
259 |   {
260 |    "cell_type": "code",
261 |    "execution_count": null,
262 |    "metadata": {},
263 |    "outputs": [],
264 |    "source": [
265 |     "import os\n",
266 |     "import torch\n",
267 |     "\n",
268 |     "model_filename=f\"{model_dir}/BART-large-nli-encoder.pt\"\n",
269 |     "\n",
270 |     "if not os.path.isfile(model_filename):\n",
271 |     "    if hasattr(model.model.encoder, 'forward_neuron'): del model.model.encoder.forward_neuron\n",
272 |     "    neuron_encoder = torch_neuronx.trace(\n",
273 |     "        model.model.encoder, \n",
274 |     "        encoder_inputs,\n",
275 |     "        compiler_args='--target inf2 --model-type transformer --auto-cast all',\n",
276 |     "        compiler_workdir='./enc_dir')\n",
277 |     "    # neuron_encoder_dynamic_batch = torch_neuronx.dynamic_batch(neuron_encoder)\n",
278 |     "    neuron_encoder.save(model_filename)\n",
279 |     "    model.model.encoder.forward_neuron = neuron_encoder\n",
280 |     "else:\n",
281 |     "    model.model.encoder.forward_neuron = torch.jit.load(model_filename)\n",
282 |     "\n"
283 |    ]
284 |   },
285 |   {
286 |    "cell_type": "markdown",
287 |    "metadata": {},
288 |    "source": [
289 |     "## 2.2 Trace Decoder"
290 |    ]
291 |   },
292 |   {
293 |    "cell_type": "code",
294 |    "execution_count": null,
295 |    "metadata": {},
296 |    "outputs": [],
297 |    "source": [
298 |     "model_filename=f\"{model_dir}/BART-large-nli-decoder.pt\"\n",
299 |     "\n",
300 |     "if not os.path.isfile(model_filename):\n",
301 |     "    inp = encoded_input[\"input_ids\"], encoder_outputs[0], encoded_input[\"attention_mask\"]\n",
302 |     "    if hasattr(model.model.decoder, 'forward_neuron'): del model.model.decoder.forward_neuron\n",
303 |     "    neuron_decoder = torch_neuronx.trace(\n",
304 |     "        model.model.decoder,\n",
305 |     "        inp,\n",
306 |     "        compiler_args='--target inf2 --model-type transformer --auto-cast all',\n",
307 |     "        compiler_workdir='./dec_dir')\n",
308 |     "    # neuron_decoder_dynamic_batch = torch_neuronx.dynamic_batch(neuron_decoder)\n",
309 |     "    neuron_decoder.save(model_filename)\n",
310 |     "    model.model.decoder.forward_neuron = neuron_decoder\n",
311 |     "else:\n",
312 |     "    model.model.decoder.forward_neuron = torch.jit.load(model_filename)"
313 |    ]
314 |   },
315 |   {
316 |    "cell_type": "markdown",
317 |    "metadata": {
318 |     "scrolled": true
319 |    },
320 |    "source": [
321 |     "# 3. Test"
322 |    ]
323 |   },
324 |   {
325 |    "cell_type": "code",
326 |    "execution_count": null,
327 |    "metadata": {
328 |     "scrolled": true
329 |    },
330 |    "outputs": [],
331 |    "source": [
332 |     "# pass sequence as a NLI premise and label (politics) as a hypothesis\n",
333 |     "premise = 'how do you like the potatoes?'\n",
334 |     "hypothesis = 'This text is about cooking.'\n",
335 |     "\n",
336 |     "# run through model pre-trained on MNLI\n",
337 |     "max_length=128\n",
338 |     "x = tokenizer.encode_plus(premise, hypothesis, return_tensors='pt', truncation='only_first', padding=\"max_length\", max_length=max_length, return_attention_mask=True)\n",
339 |     "y = model(x[\"input_ids\"],x[\"attention_mask\"])\n",
340 |     "logits = y[0]\n",
341 |     "\n",
342 |     "# we throw away \"neutral\" (dim 1) and take the probability of\n",
343 |     "# \"entailment\" (2) as the probability of the label being true \n",
344 |     "entail_contradiction_logits = logits[:,[0,2]]\n",
345 |     "probs = entail_contradiction_logits.softmax(dim=1)\n",
346 |     "true_prob = probs[:,1].item() * 100\n",
347 |     "print(f'Probability that the label is true: {true_prob:0.2f}%')\n"
348 |    ]
349 |   },
350 |   {
351 |    "cell_type": "markdown",
352 |    "metadata": {},
353 |    "source": [
354 |     "### Now we can test the inference latency in the Inf2 chips:"
355 |    ]
356 |   },
357 |   {
358 |    "cell_type": "code",
359 |    "execution_count": null,
360 |    "metadata": {},
361 |    "outputs": [],
362 |    "source": [
363 |     "%%timeit -r 10\n",
364 |     "\n",
365 |     "model(x[\"input_ids\"], x[\"attention_mask\"])"
366 |    ]
367 |   },
368 |   {
369 |    "cell_type": "markdown",
370 |    "metadata": {},
371 |    "source": [
372 |     "### And compare it with the model hosted in the CPU:"
373 |    ]
374 |   },
375 |   {
376 |    "cell_type": "code",
377 |    "execution_count": null,
378 |    "metadata": {},
379 |    "outputs": [],
380 |    "source": [
381 |     "%%timeit -r 10\n",
382 |     "model_cpu(x[\"input_ids\"], x[\"attention_mask\"])"
383 |    ]
384 |   },
385 |   {
386 |    "cell_type": "markdown",
387 |    "metadata": {},
388 |    "source": [
389 |     "### Finally we can compare the output of CPU model vs the Inf2"
390 |    ]
391 |   },
392 |   {
393 |    "cell_type": "code",
394 |    "execution_count": null,
395 |    "metadata": {},
396 |    "outputs": [],
397 |    "source": [
398 |     "y = model_cpu(x[\"input_ids\"],x[\"attention_mask\"])\n",
399 |     "logits = y[0]\n",
400 |     "# we throw away \"neutral\" (dim 1) and take the probability of\n",
401 |     "# \"entailment\" (2) as the probability of the label being true \n",
402 |     "entail_contradiction_logits = logits[:,[0,2]]\n",
403 |     "probs = entail_contradiction_logits.softmax(dim=1)\n",
404 |     "true_prob = probs[:,1].item() * 100\n",
405 |     "print(f'Probability that the label is true: {true_prob:0.2f}%')\n"
406 |    ]
407 |   },
408 |   {
409 |    "cell_type": "markdown",
410 |    "metadata": {},
411 |    "source": [
412 |     "the value should be very similar to the one 3 cells above."
413 |    ]
414 |   }
415 |  ],
416 |  "metadata": {
417 |   "kernelspec": {
418 |    "display_name": "Python 3 (ipykernel)",
419 |    "language": "python",
420 |    "name": "python3"
421 |   },
422 |   "language_info": {
423 |    "codemirror_mode": {
424 |     "name": "ipython",
425 |     "version": 3
426 |    },
427 |    "file_extension": ".py",
428 |    "mimetype": "text/x-python",
429 |    "name": "python",
430 |    "nbconvert_exporter": "python",
431 |    "pygments_lexer": "ipython3",
432 |    "version": "3.10.12"
433 |   }
434 |  },
435 |  "nbformat": 4,
436 |  "nbformat_minor": 4
437 | }
438 | 
439 | 


--------------------------------------------------------------------------------
/workshops/01_FineTuneSpamClassifier/README.md:
--------------------------------------------------------------------------------
 1 | # How to reduce costs and improve performance of your Machine Learning (ML) workloads?
 2 | ## AWS Machine Learning Purpose-built Accelerators Tutorial
 3 | 
 4 | In this workshop you'll learn how to use [AWS Trainium](https://aws.amazon.com/machine-learning/trainium/) and [AWS Inferentia](https://aws.amazon.com/machine-learning/inferentia/) with [Amazon SageMaker](https://aws.amazon.com/sagemaker/) and [Hugging Face Optimum Neuron](https://huggingface.co/docs/optimum-neuron/index), to optimize your ML workloads! You'll also learn a new methodology to map/qualify/implement end2end solutions for different business challenges. A **top-down** approach that starts with the **use case/business challenge** identification/mapping and ends with a trained model deployed as an API, which can be then integrated to your application.
 5 | 
 6 | Supposing you have a **business challenge** to address, which requires custom ML models. You need to prepare a dataset, train/deploy your models and finally integrate these models to your application (eventually automate this whole process). And, in the end, you expect to have a cost-optimized solution that fits into your budget.
 7 | 
 8 | The picture bellow shows the steps of the proposed methodology you need to follow in order to successfuly apply it to your own business problem:
 9 | <p align="center">
10 |   <img src="docs/imgs/01_activities.png"></img>
11 | </p>
12 | 
13 | <table>
14 |     <tr><td><strong>1) Use case identification:</strong></td><td>The first step of the process is to identify your use case. We prepared a table with a list of common use cases, framed as questions. The idea is to find the <b>Task</b> we'll use to address the problem.</td></tr>
15 |     <tr><td><strong>1.1) Task mapping:</strong></td><td>After identifying the use case/business challenge, using the **use cases table** or your own judgment, now it is time to prepare a model for that given <b>Task</b></td></tr>
16 |     <tr><td><strong>2) Model selection:</strong></td><td>There is a second table which lists all the current supported models and the <b>Tasks</b> it can implement. Use that table to select your model</td></tr>
17 |     <tr><td><strong>3) Model building:</strong></td><td>Now, you can make use of the available notebooks to run: 1/ Data Preparation; 2/ Model fine-tuning and 3/ Model deploying. If you already have a pre-trained model, you can skip steps 1 and 2</td></tr>
18 |     <tr><td><strong>4) App integration:</strong></td><td>In the previous step you deployed your model and it is now exposed as an API. Just integrate your application to this API and start using your model</td></tr>
19 | </table>
20 |             
21 | ## 1) Use case mapping
22 | 
23 | The following table brings a list of common use cases (framed as questions) and their associated tasks. Use this table as a reference to idenfity which **Task** is the best option to address your problem. Frame your use **case/business challenge** as a question and try to find the most similar option in the table. Then, use the task associated to the mappend use case, in the second column, and follow the next steps. 
24 | 
25 | **IMPORTANT:** If you don't find a use case (question) that resonates with your own use case, try to identify which **Task** is more appropriate for your scenario (using the tasks table). Also, please cut a ticket with the description of your use case + a framed question so that we can improve this table.
26 | 
27 | |Use case question|Task|
28 | |:-|:-|
29 | |How to create an auto-complete mechanism for my application?|CausalLM|
30 | |How to create a chat-bot to answer questions from an FAQ to my customers?|QuestionAnswering|
31 | |How can I summarize a long document into a few paragraphs?|CausalLM|
32 | |How can I create a spam classifier for my emails?|SequenceClassification|
33 | |How to check if a given text has a good or a bad comment?|SequenceClassification|
34 | |How do I translate documents from multiple languages to dutch?|CausalLM|
35 | |How to complete a sentence, given its initial words only|CausalLM|
36 | |How to classify pictures of products into different classes?|ImageClassification|
37 | |How to create an Alexa like mechanism which detects specific keywords?|AudioClassification|
38 | |How to create subtitles to audiobooks?|Text-To-Speech|
39 | |Given two sentences, how to make sure the second sentence is related to the first?|NextSentencePrediction|
40 | 
41 | ### 1.1) Available Tasks
42 | 
43 | |Task|Description|
44 | |:-|:-|
45 | |SequenceClassification|Text classification - binary or multi class|
46 | |MultipleChoice|Given a context and multiple options, the model predicts which one is correct|
47 | |TokenClassification|Token classification assigns a label to individual tokens in a sentence. One of the most common token classification tasks is Named Entity Recognition (NER)|
48 | |MaskedLM|When the input text has a mask that needs to be replaced by a generated term|
49 | |QuestionAnswering|It answers questions bases on a context or on the acquired knowledge via training|
50 | |CausalLM|Causal language modeling predicts the next token in a sequence of tokens, and the model can only attend to tokens on the left. This means the model cannot see future tokens|
51 | |ConditionalGeneration|Fills a mask based on the conditions of the sentence|
52 | |NextSentencePrediction|NSP consists of giving the model two sentences, sentence A and sentence B. We then say, ‘hey Model, does sentence B come after sentence A?’ — and Model says either IsNextSentence or NotNextSentence.|
53 | |MaskedImageModeling|Predict masks of the objects in a given picture|
54 | |ImageClassification|Classifies (binary or multiclass) an image into different classes of objects|
55 | 
56 | ## 2) HF Optimum Neuron - Supported Models
57 | 
58 | [Click here to see the current supported models for training and inference in Hugging Face Optimum Neuron](purpose-built-accelerators/docs/optimum_neuron_models.md)
59 | 
60 | ## 3) Model Building
61 | Here you can find notebooks you can run on [Amazon SageMaker Studio](https://aws.amazon.com/sagemaker/studio/) to prepare a model that addresses a task associated to your own use case. They implement a solution for the following use case: **How can I create a spam detection mechanism?**. The required task is **SequenceClassification**. In the end we'll have a Binary Text classification model which receives a given email as input and return 0=NOT SPAM and 1=SPAM.
62 | 
63 | - The first notebook downloads a public dataset named **Deysi/spam-detection-dataset**. The dataset has already samples labelade as **spam** or **not spam**. 
64 | - The second notebook is configured to train a **bert-base-uncased** for **SequenceClassification**. You'll notice there are variables you can configure to define the model and the task, then you define some hyperparameters and kick-off the training job using Amazon SageMaker.
65 | - The third notebook shows how to compile a pre-trained model to AWS Inferentia and deploy it to a SageMaker real-time Endpoint which will exposes the model as a simple API (WebService).
66 | 
67 | **ATTENTION:** if you already have a trained model, compatible with the models listed in the table linked in section 2, then just use the third notebook (you don't need the first two in this case).
68 | 
69 | |Notebook|Description|
70 | |-|-|
71 | |[01 - Data Preparation](notebooks/01_DatasetPreparation.ipynb)|How to load and prepare a dataset for fine-tuning a model|
72 | |[02 - Model Fine-tuning](notebooks/02_ModelFineTuning.ipynb)|How to kick-off a fine-tuning job using the dataset prepared in the previous notebook|
73 | |[03 - Model Deployment](notebooks/03_ModelInference.ipynb)|How to compile and deploy a pre-trained model to Inferentia|
74 | 
75 | ## 4) App Integration
76 | 
77 | If you followed the steps in the previous sections, you have a running SageMaker real-time endpoint with your model. Now you can make use of [AWS SDK for SageMaker runtime](https://aws.amazon.com/developer/tools/) which offers libraries available for the most common programming languages. If your application is Python based, you can also make use of [Amazon SageMaker Inference API](https://sagemaker.readthedocs.io/en/stable/api/inference/index.html).
78 | 


--------------------------------------------------------------------------------
/workshops/01_FineTuneSpamClassifier/docs/imgs/01_activities.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/ml-specialized-hardware/d73aadcdd1b966d23e5191882f707c0cc01cbe23/workshops/01_FineTuneSpamClassifier/docs/imgs/01_activities.png


--------------------------------------------------------------------------------
/workshops/01_FineTuneSpamClassifier/notebooks/requirements.txt:
--------------------------------------------------------------------------------
1 | datasets
2 | transformers==4.43.2
3 | optimum-neuron==0.0.25


--------------------------------------------------------------------------------
/workshops/01_FineTuneSpamClassifier/notebooks/src/compile.py:
--------------------------------------------------------------------------------
 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | # SPDX-License-Identifier: MIT-0
 3 | 
 4 | import os
 5 | os.environ['NEURON_RT_NUM_CORES'] = '1'
 6 | import sys
 7 | import glob
 8 | import json
 9 | import torch
10 | import shutil
11 | import tarfile
12 | import logging
13 | import argparse
14 | import traceback
15 | import optimum.neuron
16 | from transformers import AutoTokenizer
17 | 
18 | def model_fn(model_dir, context=None):
19 |     task = os.environ.get("TASK")
20 |     if task is None: raise Exception("Invalid TASK. You need to invoke the compilation job once to set TASK variable")
21 |         
22 |     NeuronModel = eval(f"optimum.neuron.NeuronModelFor{task}")
23 |     tokenizer = AutoTokenizer.from_pretrained(model_dir)
24 |     model = NeuronModel.from_pretrained(model_dir)
25 |     return model,tokenizer
26 | 
27 | def input_fn(input_data, content_type, context=None):
28 |     if content_type == 'application/json':
29 |         req = json.loads(input_data)
30 |         prompt = req.get('prompt')
31 |         if prompt is None or len(prompt) < 3:
32 |             raise("Invalid prompt. Provide an input like: {'prompt': 'text text text'}")
33 |         return prompt
34 |     else:
35 |         raise Exception(f"Unsupported mime type: {content_type}. Supported: application/json")    
36 | 
37 | def predict_fn(input_object, model_tokenizer, context=None):
38 |     model,tokenizer = model_tokenizer
39 |     inputs = tokenizer(input_object, truncation=True, return_tensors="pt")
40 |     logits = model(**inputs).logits
41 |     idx = logits.argmax(1, keepdim=True)
42 |     conf = torch.gather(logits, 1, idx)
43 |     return torch.cat([idx,conf], 1)    
44 | 
45 | if __name__ == "__main__":
46 |     parser = argparse.ArgumentParser()
47 | 
48 |     # hyperparameters sent by the client are passed as command-line arguments to the script.    
49 |     parser.add_argument("--task", type=str, default="")
50 |     parser.add_argument("--dynamic_batch_size", type=bool, default=False)
51 |     parser.add_argument("--input_shapes", type=str, required=True)
52 |     parser.add_argument("--is_model_compressed", type=bool, default=True)
53 |     
54 |     parser.add_argument("--model_dir", type=str, default=os.environ["SM_MODEL_DIR"])    
55 |     parser.add_argument("--checkpoint_dir", type=str, default=os.environ["SM_CHANNEL_CHECKPOINT"])
56 |     
57 |     args, _ = parser.parse_known_args()
58 | 
59 |     # Set up logging        
60 |     logging.basicConfig(
61 |         level=logging.getLevelName("DEBUG"),
62 |         handlers=[logging.StreamHandler(sys.stdout)],
63 |         format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
64 |     )
65 |     logger = logging.getLogger(__name__)
66 |     logger.info(args)
67 | 
68 |     NeuronModel = eval(f"optimum.neuron.NeuronModel{'For' + args.task if len(args.task) > 0 else ''}")
69 |     logger.info(f"Checkpoint files: {os.listdir(args.checkpoint_dir)}")
70 | 
71 |     model_path = args.checkpoint_dir
72 |     if args.is_model_compressed:
73 |         logger.info("Decompressing model file...")
74 |         with tarfile.open(os.path.join(args.checkpoint_dir, "model.tar.gz"), 'r:gz') as tar:
75 |             tar.extractall(os.path.join(args.checkpoint_dir, "model"))
76 |         model_path = os.path.join(args.checkpoint_dir, "model")
77 |         logger.info(f"Done! Model path: {model_path}")
78 |         logger.info(f"Model path files: {os.listdir(model_path)}")
79 | 
80 |     input_shapes = json.loads(args.input_shapes)
81 |     model = NeuronModel.from_pretrained(model_path, export=True, dynamic_batch_size=args.dynamic_batch_size, **input_shapes)
82 |     model.save_pretrained(args.model_dir)
83 | 
84 |     code_path = os.path.join(args.model_dir, 'code')
85 |     os.makedirs(code_path, exist_ok=True)
86 | 
87 |     shutil.copy(__file__, os.path.join(code_path, "inference.py"))
88 |     shutil.copy('requirements.txt', os.path.join(code_path, 'requirements.txt'))
89 | 


--------------------------------------------------------------------------------
/workshops/01_FineTuneSpamClassifier/notebooks/src/dump_model_table.py:
--------------------------------------------------------------------------------
 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | # SPDX-License-Identifier: MIT-0
 3 | 
 4 | import os
 5 | import re
 6 | import sys
 7 | import argparse
 8 | import pandas as pd
 9 | from optimum.neuron import version
10 | from optimum.exporters.tasks import TasksManager
11 | from optimum.exporters.neuron.model_configs import *
12 | from optimum.neuron.distributed.parallelizers_manager import ParallelizersManager
13 | from optimum.neuron.utils.training_utils import (
14 |     _SUPPORTED_MODEL_NAMES,
15 |     _SUPPORTED_MODEL_TYPES,
16 |     _generate_supported_model_class_names
17 | )
18 | 
19 | def training_models():
20 |     # retrieve supported models for Tensor Parallelism
21 |     tp_support = list(ParallelizersManager._MODEL_TYPE_TO_PARALLEL_MODEL_CLASS.keys())
22 | 
23 |     # build compability table for training
24 |     data_training = {'Model': []}
25 |     for m in _SUPPORTED_MODEL_TYPES:
26 |         if type(m) != str: m = m[0]
27 |         if m=='gpt-2': m='gpt2' # fix the name
28 |         model_id = len(data_training['Model'])
29 |         model_link = f'<a rel="noopener noreferrer" target="_new" href="https://huggingface.co/models?sort=trending&search={m}">{m}</a>'
30 |         data_training['Model'].append(f"{model_link} <font style='color: red;'><b>[TP]</b></font>" if m in tp_support else model_link)
31 |         tasks = [re.sub(r'.+For(.+)', r'\1', t) for t in set(_generate_supported_model_class_names(m)) if not t.endswith('Model')]
32 |         for t in tasks:
33 |             if data_training.get(t) is None: data_training[t] = [''] * len(_SUPPORTED_MODEL_TYPES)
34 |             data_training[t][model_id] = f'<a rel="noopener noreferrer" target="_new" href="https://huggingface.co/docs/transformers/model_doc/{m}#transformers.{m.title()}For{t}">doc</a>'        
35 |     df_training = pd.DataFrame.from_dict(data_training).set_index('Model')
36 |     return df_training.to_markdown()
37 |     
38 | def inference_models():
39 |     # retrieve supported models for Tensor Parallelism
40 |     tp_support = list(ParallelizersManager._MODEL_TYPE_TO_PARALLEL_MODEL_CLASS.keys())
41 | 
42 |     # build compability table for inference
43 |     meta = [(k,list(v['neuron'].keys())) for k,v in TasksManager._SUPPORTED_MODEL_TYPE.items() if v.get('neuron') is not None]
44 |     data_inference = {'Model': []}
45 |     for m,t in meta:
46 |         model_id = len(data_inference['Model'])
47 |         model_link = f'<a rel="noopener noreferrer" target="_new" href="https://huggingface.co/models?sort=trending&search={m}">{m}</a>'
48 |         data_inference['Model'].append(f"{model_link} <font style='color: red;'><b>[TP]</b></font>" if m in tp_support else model_link)
49 |         for task in t:
50 |             if data_inference.get(task) is None: data_inference[task] = [''] * len(meta)
51 |             data_inference[task][model_id] = f'<a rel="noopener noreferrer" target="_new" href="https://huggingface.co/models?pipeline_tag={task}&sort=trending&search={m}">doc</a>'
52 | 
53 |     df_inference = pd.DataFrame.from_dict(data_inference).set_index('Model')
54 |     return df_inference.to_markdown()
55 | 
56 | if __name__ == "__main__":
57 |     parser = argparse.ArgumentParser()
58 | 
59 |     # input parameters of this script
60 |     parser.add_argument("--output_file", type=str, required=True)
61 |     
62 |     try:
63 |         args, _ = parser.parse_known_args()
64 |         print(f"Dumping the metadata file to: {args.output_file}")
65 |         with open(args.output_file, 'w') as f:
66 |             f.write("# HF Optimum Neuron - Supported Models\n")
67 |             f.write(f"**version: {version.__version__}**  \n")
68 |             f.write("Models marked with <font style='color: red;'><b>[TP]</b></font> support **Tensor Parallelism** for training and inference\n")
69 |             f.write("## Models/tasks for training\n")
70 |             f.write(f"{training_models()}\n")
71 |             f.write("## Models/tasks for inference\n")
72 |             f.write(f"{inference_models()}\n")
73 |     except Exception as e:
74 |         print(traceback.format_exc())
75 |         sys.exit(1)
76 |         
77 |     finally:
78 |         print("Done! ", sys.exc_info())
79 |         sys.exit(0)
80 | 


--------------------------------------------------------------------------------
/workshops/01_FineTuneSpamClassifier/notebooks/src/requirements.txt:
--------------------------------------------------------------------------------
1 | --extra-index-url https://pip.repos.neuron.amazonaws.com
2 | evaluate==0.4.3
3 | optimum-neuron==0.0.25


--------------------------------------------------------------------------------
/workshops/01_FineTuneSpamClassifier/notebooks/src/train.py:
--------------------------------------------------------------------------------
  1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
  2 | # SPDX-License-Identifier: MIT-0
  3 | 
  4 | import os
  5 | import sys
  6 | import torch
  7 | import random
  8 | import argparse
  9 | import evaluate
 10 | import importlib
 11 | import traceback
 12 | import subprocess
 13 | import transformers
 14 | 
 15 | from huggingface_hub import login
 16 | from datasets import load_from_disk
 17 | from transformers import AutoTokenizer
 18 | from optimum.neuron import NeuronTrainer as Trainer
 19 | from optimum.neuron import NeuronTrainingArguments as TrainingArguments
 20 | 
 21 | if __name__ == "__main__":
 22 |     parser = argparse.ArgumentParser()
 23 | 
 24 |     # hyperparameters sent by the client are passed as command-line arguments to the script.
 25 |     parser.add_argument("--epochs", type=int, default=3)
 26 |     parser.add_argument("--max_sen_len", type=int, default=256)
 27 |     parser.add_argument("--train_batch_size", type=int, default=32)
 28 |     parser.add_argument("--eval_batch_size", type=int, default=64)
 29 |     parser.add_argument("--warmup_steps", type=int, default=500)
 30 |     parser.add_argument("--tensor_parallel_size", type=int, default=1)
 31 |     parser.add_argument("--model_id", type=str, required=True)
 32 |     parser.add_argument("--zero_1", type=bool, default=False)
 33 |     parser.add_argument("--task", type=str, default="")
 34 |     parser.add_argument("--collator", type=str, default="DefaultDataCollator")
 35 |     parser.add_argument("--learning_rate", type=float, default=5e-5)
 36 |     parser.add_argument("--weight_decay", type=float, default=0.01)
 37 |     parser.add_argument("--bf16", type=bool, default=True)
 38 | 
 39 |     # hugging face hub
 40 |     parser.add_argument("--hf_token", type=str, default=None)
 41 | 
 42 |     # Data, model, and output directories
 43 |     parser.add_argument("--output_data_dir", type=str, default=os.environ["SM_OUTPUT_DATA_DIR"])
 44 |     parser.add_argument("--model_dir", type=str, default=os.environ["SM_MODEL_DIR"])
 45 |     parser.add_argument("--n_neurons", type=str, default=os.environ["SM_NUM_NEURONS"])
 46 |     parser.add_argument("--training_dir", type=str, default=os.environ["SM_CHANNEL_TRAIN"])
 47 |     parser.add_argument("--eval_dir", type=str, default=os.environ.get("SM_CHANNEL_EVAL", None))
 48 | 
 49 |     parser.add_argument('--checkpoints-path', type=str, help="Path where we'll save the cache", default='/opt/ml/checkpoints')
 50 | 
 51 |     args, _ = parser.parse_known_args()
 52 |     os.makedirs(args.checkpoints_path, exist_ok=True)
 53 | 
 54 |     if not args.hf_token is None and len(args.hf_token) > 0:
 55 |         print("HF token defined. Logging in...")
 56 |         login(token=args.hf_token)
 57 |         
 58 |         cmd = f"optimum-cli neuron cache set {os.environ['CUSTOM_CACHE_REPO']}"
 59 |         subprocess.check_call(cmd.split(' '))
 60 | 
 61 |     Collator = eval(f"transformers.{args.collator}")
 62 |     AutoModel = eval(f"transformers.AutoModel{'For' + args.task if len(args.task) > 0 else ''}")
 63 | 
 64 |     train_dataset=load_from_disk(args.training_dir)
 65 |     eval_dataset=load_from_disk(args.eval_dir) if not args.eval_dir is None else None
 66 | 
 67 |     tokenizer = AutoTokenizer.from_pretrained(args.model_id)
 68 |     tokenizer.pad_token = tokenizer.eos_token
 69 |     tokenizer.model_max_length = args.max_sen_len
 70 | 
 71 |     data_collator = Collator(return_tensors="pt")
 72 |     model = AutoModel.from_pretrained(args.model_id, trust_remote_code=True) # TODO: add a hyperparameter with model params
 73 | 
 74 |     training_args = TrainingArguments(
 75 |         evaluation_strategy="epoch" if not args.eval_dir is None else "no",
 76 |         learning_rate=args.learning_rate,
 77 |         weight_decay=args.weight_decay,
 78 |         bf16=args.bf16,
 79 |         num_train_epochs=args.epochs,
 80 |         output_dir=args.checkpoints_path,
 81 |         overwrite_output_dir=True,
 82 |         tensor_parallel_size=args.tensor_parallel_size,
 83 |         zero_1=args.zero_1,
 84 | 
 85 |         per_device_train_batch_size=args.train_batch_size,
 86 |         per_device_eval_batch_size=args.eval_batch_size if not args.eval_dir is None else None,
 87 |         logging_dir=f"{args.output_data_dir}/logs",
 88 |         logging_strategy="steps",
 89 |         logging_steps=500,
 90 |         save_steps=2000,
 91 |         save_strategy="steps",
 92 |         save_total_limit=1,
 93 |     )
 94 |     trainer = Trainer(
 95 |         model=model,
 96 |         args=training_args,
 97 |         train_dataset=train_dataset,
 98 |         eval_dataset=eval_dataset,
 99 |         data_collator=data_collator,
100 |     )
101 |     trainer.train()
102 |     # save artifacts that will be uploaded to S3
103 |     trainer.save_model(args.model_dir)
104 |     tokenizer.save_pretrained(args.model_dir)
105 | 


--------------------------------------------------------------------------------
/workshops/02_DomainAdaptation/README.md:
--------------------------------------------------------------------------------
  1 | # Adapting LLMs for domain-aware applications with AWS Trainium post-training
  2 | 
  3 | ## Introduction
  4 | 
  5 | Large language models are typically trained on a broad corpus of data from various domains, making them highly capable of handling diverse tasks and topics. However, when these models are deployed in specific domains or applications, their performance may not be optimal due to the domain-specific language, terminology, and context. Domain adaptation aims to fine-tune or adapt the pre-trained LLM to a particular domain or task, improving its performance and enabling better understanding of domain-specific data.
  6 | 
  7 | # Scenarios and Use Cases for LLM Domain Adaptation
  8 | LLM domain adaptation is useful in various scenarios where domain-specific knowledge or language patterns are crucial. Some common use cases include:
  9 | 
 10 |  - Specialized industries (e.g., healthcare, finance, legal, engineering)
 11 |  - Domain-specific applications (e.g., chatbots for customer service, virtual assistants for specific tasks)
 12 |  - Text summarization or generation for specific domains
 13 |  - Question-answering systems for domain-specific knowledge bases
 14 |  - Sentiment analysis or text classification in domain-specific contexts
 15 |  - Machine translation for domain-specific terminologies
 16 | 
 17 | # Techniques for LLM Domain Adaptation
 18 | 
 19 | **Supervised Fine-Tuning (SFT)**: In this approach, the language model is fine-tuned on a labeled dataset specific to the target domain or task. The model learns to generate outputs similar to the labeled examples in the dataset.
 20 |    - Use case: Fine-tuning a language model for legal document summarization, where you have a dataset of legal documents and their corresponding summaries.
 21 | 
 22 | **Reinforcement Learning from Human Feedback (RLHF)**: This technique involves providing human feedback (e.g., ratings, comparisons, or corrections) to the language model during the fine-tuning process. The model is trained to generate outputs that align with the human feedback, effectively shaping its behavior according to human preferences. RLHF can be particularly useful when you want to imbue the language model with specific traits, such as factuality, safety, or ethical behavior.
 23 |    - Use case: Training a language model for customer service applications, where it needs to provide helpful, polite, and factual responses.
 24 | 
 25 | **Direct Preference Optimization (DPO)**: DPO is stable, performant, and computationally lightweight, eliminating the need for sampling from the LM during fine-tuning or performing significant hyperparameter tuning.
 26 |    - Use case: Pre-training a language model on corrupted scientific paper abstracts to improve its performance on tasks related to scientific literature, such as question answering or text summarization.
 27 | 
 28 | **Odds Ratio Preference Optimization (ORPO)**: ORPO is a simple and novel monolithic alignment method technique which efficiently penalizes the model from learning undesired generation styles during SFT.
 29 |    - Use case: Adapting a general-purpose language model to the financial domain by training it with ORPO on a corpus of financial reports and news articles.
 30 | 
 31 | <p align="center">
 32 |   <img src="docs/imgs/model_alignment_techniques.png" alt="model alignment techniques"></img>
 33 | </p>
 34 | 
 35 | These techniques can be used individually or in combination, depending on the specific requirements and constraints of the domain adaptation task. The choice of technique often depends on factors such as the availability of labeled data, the desired traits or behaviors of the adapted model, and the computational resources available.
 36 | 
 37 | It's worth noting that domain adaptation is an active area of research, and new techniques and approaches are constantly emerging. Additionally, the effectiveness of these techniques can vary depending on the specific domain, task, and language model used.
 38 | 
 39 | # This workshop
 40 | 
 41 | For this particular workshop, we'll use ORPO, given its effienciency while managing the resources required to adapt our LLMs. It uses less memory to achieve similar results to DPO, resulting in a cheaper setup.
 42 | 
 43 | Duration: Approximately 60 minutes
 44 | 
 45 | ## ORPO
 46 | [ORPO](https://arxiv.org/html/2403.07691v2) is a fine-tuning technique that streamlines the process of adapting LLMs to specific tasks. It addresses a limitation of the traditional two-stage approach. While SFT effectively adapts the model to a desired domain, it can inadvertently increase the probability of generating undesirable responses alongside preferred ones.
 47 | 
 48 | ![orpo intro](./docs/imgs/6-orpo-intro.png)
 49 | 
 50 | Here’s a breakdown of the issue:
 51 |  - Supervised Fine-Tuning (SFT): Trains the LLM on task-specific data, improving its performance in that domain.
 52 |  - Drawback: During SFT, the probability of generating undesirable responses along with preferred ones also increases, as shown in the image.
 53 | 
 54 | ![orpo intro](./docs/imgs/6-orpo-curve.png)
 55 | 
 56 | Preference alignment is then employed to address this issue. It aims to:
 57 | 
 58 |  - Increase the likelihood of generating preferred responses.
 59 |  - Decrease the likelihood of generating rejected responses.
 60 | 
 61 | Traditionally, preference alignment is achieved through techniques like Reinforcement Learning with Human Feedback (RLHF) or Direct Preference Optimization (DPO). However, these methods require a separate reference model, increasing computational complexity.
 62 | 
 63 | ORPO elegantly solves this problem by combining SFT and preference alignment into a single objective function. It modifies the standard language modeling loss by incorporating an odds ratio (OR) term. This term:
 64 | 
 65 |   - Weakly penalizes rejected responses.
 66 |   - Strongly rewards preferred responses.
 67 | 
 68 | By simultaneously optimizing for both objectives, ORPO allows the LLM to learn the target task while aligning its outputs with human preferences.
 69 | 
 70 | [For more details, you can check this blog post](https://huggingface.co/blog/mlabonne/orpo-llama-3).
 71 | 
 72 | In this workshop we'll use an implementation of ORPO provided by Hugging Face Optimum Neuron.
 73 | 
 74 | ## HF Optimum Neuron
 75 | 
 76 | 🤗 Optimum Neuron is the interface between the 🤗 Transformers library and AWS Accelerators including AWS Trainium and AWS Inferentia. It provides a set of tools enabling easy model loading, training and inference on single- and multi-Accelerator settings for different downstream tasks.
 77 | 
 78 | With Optimum Neuron you can bring your transformers training code and with minimal changes execute it on AWS Trainium. Here you can see an example of a training code compatible with Optimum Neuron.
 79 | 
 80 | ```python
 81 | - from transformers import Trainer, TrainingArguments
 82 | + from optimum.neuron import NeuronTrainer as Trainer
 83 | + from optimum.neuron import NeuronTrainingArguments as TrainingArguments
 84 | 
 85 | from transformers import TrainingArguments
 86 | from optimum.neuron import NeuronTrainer as Trainer
 87 | 
 88 | def parse_args():
 89 | 	...
 90 | 
 91 | def training_function(args):
 92 | 
 93 |     # load dataset from disk and tokenizer
 94 |     train_dataset = load_from_disk(os.path.join(args.dataset_path, "train"))
 95 | 		...
 96 | 
 97 |     # Download the model from huggingface.co/models
 98 |     model = AutoModelForSequenceClassification.from_pretrained(
 99 |         args.model_id, num_labels=num_labels, label2id=label2id, id2label=id2label
100 |     )
101 | 
102 |     training_args = TrainingArguments(
103 | 			...
104 |     )
105 | 
106 |     # Create Trainer instance
107 |     trainer = Trainer(
108 |         model=model,
109 |         args=training_args,
110 |         train_dataset=train_dataset,
111 |         eval_dataset=eval_dataset,
112 |         compute_metrics=compute_metrics,
113 |     )
114 | 
115 |     # Start training
116 |     trainer.train()
117 | ```
118 | 
119 | For more information about HF Optimum Neuron, please [check the official documentation](https://huggingface.co/docs/optimum-neuron/index).
120 | 


--------------------------------------------------------------------------------
/workshops/02_DomainAdaptation/docs/imgs/6-orpo-curve.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/ml-specialized-hardware/d73aadcdd1b966d23e5191882f707c0cc01cbe23/workshops/02_DomainAdaptation/docs/imgs/6-orpo-curve.png


--------------------------------------------------------------------------------
/workshops/02_DomainAdaptation/docs/imgs/6-orpo-intro.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/ml-specialized-hardware/d73aadcdd1b966d23e5191882f707c0cc01cbe23/workshops/02_DomainAdaptation/docs/imgs/6-orpo-intro.png


--------------------------------------------------------------------------------
/workshops/02_DomainAdaptation/docs/imgs/model_alignment_techniques.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/ml-specialized-hardware/d73aadcdd1b966d23e5191882f707c0cc01cbe23/workshops/02_DomainAdaptation/docs/imgs/model_alignment_techniques.png


--------------------------------------------------------------------------------
/workshops/03_NKIWorkshop/README.md:
--------------------------------------------------------------------------------
 1 | # Building Custom Accelerator Kernels with AWS Neuron Kernel Interface (NKI)
 2 | 
 3 | ## Introduction
 4 | 
 5 | The Neuron Kernel Interface (NKI) is a domain-specific language and runtime that allows developers to write custom kernels optimized for AWS Neuron devices (Trainium/Inferentia). NKI enables direct access to the hardware's compute and memory resources, giving you fine-grained control over how your workloads execute on Neuron accelerators.
 6 | 
 7 | With NKI, you can create highly optimized implementations of custom operators, extend ML frameworks with new functionality, and maximize performance for your unique machine learning workloads. This interface bridges the gap between high-level ML frameworks and the underlying Neuron hardware, providing a programming environment similar to CUDA but specifically designed for AWS Neuron devices.
 8 | 
 9 | NKI allows developers to harness the full computational power of AWS Neuron devices by writing kernels that explicitly manage computation and memory operations. This level of control is essential for specialized workloads that require custom optimization beyond what's possible with standard operators provided by ML frameworks.
10 | Scenarios and Use Cases for NKI
11 | 
12 | ## NKI Programming Model
13 | 
14 | NKI follows a three-phase programming model that gives developers explicit control over the execution of their kernels:
15 | 
16 | 1. Load - Move data from device memory (HBM) to on-chip memory (SBUF)
17 |    * Explicitly define which data to bring into fast on-chip memory
18 |    * Control memory access patterns to optimize bandwidth utilization
19 |    * Apply data transformations during loading if needed
20 | 
21 | 2. Compute - Perform operations using on-chip memory
22 |    * Execute arithmetic operations on data in on-chip memory
23 |    * Leverage vector and matrix operations for efficient computation
24 |    * Utilize specialized hardware units for operations like matrix multiplication
25 | 
26 | 3. Store - Move results from on-chip memory back to device memory
27 |    * Control when and how results are written back to device memory
28 |    * Optimize for memory bandwidth by storing results efficiently
29 |    * Apply masks or conditions to selectively update memory
30 | 
31 | This programming model is based on the architecture of Neuron devices, which feature a large HBM (High Bandwidth Memory) for storing model weights and activations, and a smaller but faster on-chip memory (SBUF) for active computations. By explicitly managing data movement between these memory tiers, developers can optimize for both performance and energy efficiency.
32 | 
33 | ## This Workshop
34 | 
35 | This hands-on workshop will teach you how to build, optimize, and integrate custom kernels for AWS Neuron devices using NKI. You'll learn the fundamentals of kernel development, how to integrate kernels with PyTorch, and how to analyze performance with Neuron Profile.
36 | 
37 | Duration: Approximately 90 minutes
38 | Workshop Outline
39 | 
40 | 1. Environment Setup
41 |    * Configuring your Trn1 or Inf2 instance
42 |    * Installing required packages
43 |    * Verifying your setup
44 | 
45 | 2. Implementing Your First NKI Kernel
46 |    * Understanding the NKI programming model
47 |    * Writing a simple tensor addition kernel
48 |    * Running kernels in baremetal mode and with PyTorch
49 | 
50 | 3. Integrating Prebuilt Kernels
51 |    * Using optimized kernels from the neuronxcc.nki.kernels namespace
52 |    * Comparing custom Flash Attention implementation with standard attention
53 |    * Understanding performance benefits of optimized kernels
54 | 
55 | 4. Creating Custom Operators
56 |    * Inserting NKI kernels as custom operators in PyTorch
57 |    * Implementing forward and backward passes for training
58 |    * Supporting autograd for custom operators
59 | 
60 | 5. Performance Analysis with Neuron Profile
61 |    * Installing and using Neuron Profile
62 |    * Capturing execution traces
63 |    * Analyzing kernel performance metrics
64 |    * Identifying optimization opportunities
65 | 
66 | 
67 | Each section builds upon the previous one, gradually introducing more advanced concepts while providing hands-on experience with real code examples.
68 | Prerequisites
69 | 
70 | In the first notebook, you'll set up your environment and verify that NKI is properly installed and configured. Then you'll implement your first kernel - a simple tensor addition operation - and learn how to run it both in baremetal mode and through PyTorch. This will establish the foundation for the more advanced topics covered in subsequent notebooks.
71 | 
72 | Let's begin with the environment setup and your first NKI kernel!
73 | 


--------------------------------------------------------------------------------
/workshops/03_NKIWorkshop/notebooks/0-setup.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# 0 - setup\n",
  8 |     "In this guide, we will implement a simple “Hello World” style NKI kernel and run it on a NeuronDevice (Trainium/Inferentia2 or beyond device). We will showcase how to invoke a NKI kernel standalone through NKI baremetal mode and also through ML frameworks (PyTorch). Before diving into kernel implementation, let’s make sure you have the correct environment setup for running NKI kernels."
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "markdown",
 13 |    "metadata": {},
 14 |    "source": [
 15 |     "## Environment Setup\n",
 16 |     "You need a [Trn1](https://aws.amazon.com/ec2/instance-types/trn1/) or [Inf2](https://aws.amazon.com/ec2/instance-types/inf2/) instance set up on AWS to run NKI kernels on a NeuronDevice. Once logged into the instance, follow steps below to ensure you have all the required packages installed in your Python environment.\n",
 17 |     "\n",
 18 |     "NKI is shipped as part of the Neuron compiler package. To make sure you have the latest compiler package, see [Setup Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/index.html) for an installation guide.\n",
 19 |     "\n",
 20 |     "You can verify that NKI is available in your compiler installation by running the following command:"
 21 |    ]
 22 |   },
 23 |   {
 24 |    "cell_type": "code",
 25 |    "execution_count": null,
 26 |    "metadata": {},
 27 |    "outputs": [],
 28 |    "source": [
 29 |     "import neuronxcc.nki"
 30 |    ]
 31 |   },
 32 |   {
 33 |    "cell_type": "markdown",
 34 |    "metadata": {},
 35 |    "source": [
 36 |     "This attempts to import the NKI package. It will error out if NKI is not included in your Neuron compiler version or if the Neuron compiler is not installed. The import might take about a minute the first time you run it. Whenever possible, we recommend using local instance NVMe volumes instead of EBS for executable code.\n",
 37 |     "\n",
 38 |     "If you intend to run NKI kernels without any ML framework for quick prototyping, you will also need NumPy installed.\n",
 39 |     "\n",
 40 |     "To call NKI kernels from PyTorch, you also need to have torch_neuronx installed. For an installation guide, see PyTorch Neuron Setup. You can verify that you have torch_neuronx installed by running the following command:"
 41 |    ]
 42 |   },
 43 |   {
 44 |    "cell_type": "code",
 45 |    "execution_count": null,
 46 |    "metadata": {},
 47 |    "outputs": [],
 48 |    "source": [
 49 |     "import torch_neuronx"
 50 |    ]
 51 |   },
 52 |   {
 53 |    "cell_type": "markdown",
 54 |    "metadata": {},
 55 |    "source": [
 56 |     "## Implementing your first NKI kernel\n",
 57 |     "In current NKI release, all input and output tensors must be passed into the kernel as device memory (HBM) tensors on a NeuronDevice. The body of the kernel typically consists of three main phases:\n",
 58 |     "\n",
 59 |     "1. Load the inputs from device memory to on-chip memory (SBUF).\n",
 60 |     "2. Perform the desired computation.\n",
 61 |     "3. Store the outputs from on-chip memory to device memory.\n",
 62 |     "\n",
 63 |     "For more details on the above terms, see [NKI Programming Model](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/programming_model.html)."
 64 |    ]
 65 |   },
 66 |   {
 67 |    "cell_type": "code",
 68 |    "execution_count": null,
 69 |    "metadata": {},
 70 |    "outputs": [],
 71 |    "source": [
 72 |     "from neuronxcc import nki\n",
 73 |     "import neuronxcc.nki.language as nl\n",
 74 |     "\n",
 75 |     "@nki.jit\n",
 76 |     "def nki_tensor_add_kernel(a_input, b_input):\n",
 77 |     "    \"\"\"\n",
 78 |     "    NKI kernel to compute element-wise addition of two input tensors\n",
 79 |     "    \"\"\"\n",
 80 |     "    \n",
 81 |     "    # Check all input/output tensor shapes are the same for element-wise operation\n",
 82 |     "    assert a_input.shape == b_input.shape\n",
 83 |     "\n",
 84 |     "    # Check size of the first dimension does not exceed on-chip memory tile size limit,\n",
 85 |     "    # so that we don't need to tile the input to keep this example simple\n",
 86 |     "    assert a_input.shape[0] <= nl.tile_size.pmax\n",
 87 |     "\n",
 88 |     "    # Load the inputs from device memory to on-chip memory\n",
 89 |     "    a_tile = nl.load(a_input)\n",
 90 |     "    b_tile = nl.load(b_input)\n",
 91 |     "\n",
 92 |     "    # Specify the computation (in our case: a + b)\n",
 93 |     "    c_tile = nl.add(a_tile, b_tile)\n",
 94 |     "\n",
 95 |     "    # Create a HBM tensor as the kernel output\n",
 96 |     "    c_output = nl.ndarray(a_input.shape, dtype=a_input.dtype, buffer=nl.shared_hbm)\n",
 97 |     "\n",
 98 |     "    # Store the result to c_output from on-chip memory to device memory\n",
 99 |     "    nl.store(c_output, value=c_tile)\n",
100 |     "\n",
101 |     "    # Return kernel output as function output\n",
102 |     "    return c_output"
103 |    ]
104 |   },
105 |   {
106 |    "cell_type": "markdown",
107 |    "metadata": {},
108 |    "source": [
109 |     "## Running the kernel\n",
110 |     "Next, we will cover unique ways to run the above NKI kernel on a NeuronDevice:\n",
111 |     "\n",
112 |     "1. NKI baremetal: run NKI kernel with no ML framework involvement\n",
113 |     "2. PyTorch: run NKI kernel as a PyTorch operator\n",
114 |     "3. JAX: run NKI kernel as a JAX operator (not used in this workshop)\n",
115 |     "\n",
116 |     "All three run modes can call the same kernel function decorated with the `nki.jit` decorator as discussed above:\n",
117 |     "```\n",
118 |     "@nki.jit\n",
119 |     "def nki_tensor_add_kernel(a_input, b_input):\n",
120 |     "```\n",
121 |     "The `nki.jit` decorator automatically chooses the correct run mode by checking the incoming tensor type:\n",
122 |     "\n",
123 |     "1. NumPy arrays as input: run in NKI baremetal mode\n",
124 |     "2. PyTorch tensors as input: run in PyTorch mode\n",
125 |     "3. JAX tensors: run in JAX mode\n",
126 |     "\n",
127 |     "See [nki.jit](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/api/generated/nki.jit.html) API doc for more details."
128 |    ]
129 |   },
130 |   {
131 |    "cell_type": "markdown",
132 |    "metadata": {},
133 |    "source": [
134 |     "### NKI baremetal\n",
135 |     "\n",
136 |     "Baremetal mode expects input tensors of the NKI kernel to be NumPy arrays. The kernel also converts its NKI output tensors to NumPy arrays. To invoke the kernel, we first initialize the two input tensors `a` and `b` as NumPy arrays. Finally, we call the NKI kernel just like any other Python function:"
137 |    ]
138 |   },
139 |   {
140 |    "cell_type": "code",
141 |    "execution_count": null,
142 |    "metadata": {},
143 |    "outputs": [],
144 |    "source": [
145 |     "import numpy as np\n",
146 |     "\n",
147 |     "a = np.ones((4, 3), dtype=np.float16)\n",
148 |     "b = np.ones((4, 3), dtype=np.float16)\n",
149 |     "\n",
150 |     "# Run NKI kernel on a NeuronDevice\n",
151 |     "c = nki_tensor_add_kernel(a, b)\n",
152 |     "\n",
153 |     "print(c)"
154 |    ]
155 |   },
156 |   {
157 |    "cell_type": "markdown",
158 |    "metadata": {},
159 |    "source": [
160 |     "> Alternatively, we can decorate the kernel with [nki.baremetal](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/api/generated/nki.baremetal.html) or pass the `mode` parameter to the `nki.jit` decorator, `@nki.jit(mode='baremetal')`, to bypass the dynamic mode detection. See [nki.baremetal](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/api/generated/nki.baremetal.html) API doc for more available input arguments for the baremetal mode."
161 |    ]
162 |   },
163 |   {
164 |    "cell_type": "markdown",
165 |    "metadata": {},
166 |    "source": [
167 |     "### PyTorch\n",
168 |     "\n",
169 |     "To run the above `nki_tensor_add_kernel` kernel using PyTorch, we initialize the input and output tensors as PyTorch `device` tensors instead."
170 |    ]
171 |   },
172 |   {
173 |    "cell_type": "code",
174 |    "execution_count": null,
175 |    "metadata": {},
176 |    "outputs": [],
177 |    "source": [
178 |     "import torch\n",
179 |     "from torch_xla.core import xla_model as xm\n",
180 |     "\n",
181 |     "nki_tensor_add_kernel_pytorch = nki.jit(nki_tensor_add_kernel, mode='torchxla')\n",
182 |     "\n",
183 |     "device = xm.xla_device()\n",
184 |     "\n",
185 |     "a = torch.ones((4, 3), dtype=torch.float16).to(device=device)\n",
186 |     "b = torch.ones((4, 3), dtype=torch.float16).to(device=device)\n",
187 |     "\n",
188 |     "c = nki_tensor_add_kernel_pytorch(a, b)\n",
189 |     "\n",
190 |     "print(c)  # an implicit XLA barrier/mark-step (triggers XLA compilation)"
191 |    ]
192 |   },
193 |   {
194 |    "cell_type": "markdown",
195 |    "metadata": {},
196 |    "source": [
197 |     "> Alternatively, we can pass the `mode='torchxla'` parameter into the `nki.jit` decorator to bypass the dynamic mode detection."
198 |    ]
199 |   },
200 |   {
201 |    "cell_type": "markdown",
202 |    "metadata": {},
203 |    "source": [
204 |     "## Release the NeuronCore for the next notebook\n",
205 |     "\n",
206 |     "Before moving to the next notebook we need to release the NeuronCore. If we don't do this the next notebook will not be able resources - you can also stop the kernel via the GUI"
207 |    ]
208 |   },
209 |   {
210 |    "cell_type": "code",
211 |    "execution_count": null,
212 |    "metadata": {},
213 |    "outputs": [],
214 |    "source": [
215 |     "import IPython\n",
216 |     "IPython.Application.instance().kernel.do_shutdown(True)"
217 |    ]
218 |   },
219 |   {
220 |    "cell_type": "code",
221 |    "execution_count": null,
222 |    "metadata": {},
223 |    "outputs": [],
224 |    "source": []
225 |   }
226 |  ],
227 |  "metadata": {
228 |   "kernelspec": {
229 |    "display_name": "Python 3 (ipykernel)",
230 |    "language": "python",
231 |    "name": "python3"
232 |   },
233 |   "language_info": {
234 |    "codemirror_mode": {
235 |     "name": "ipython",
236 |     "version": 3
237 |    },
238 |    "file_extension": ".py",
239 |    "mimetype": "text/x-python",
240 |    "name": "python",
241 |    "nbconvert_exporter": "python",
242 |    "pygments_lexer": "ipython3",
243 |    "version": "3.10.12"
244 |   }
245 |  },
246 |  "nbformat": 4,
247 |  "nbformat_minor": 4
248 | }
249 | 


--------------------------------------------------------------------------------
/workshops/03_NKIWorkshop/notebooks/2-custom-operators.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# 2-Custom Operators\n",
  8 |     "This notebook demonstrates how to insert a NKI kernel as a custom operators into a PyTorch.\n",
  9 |     "\n",
 10 |     "## Using NKI kernels\n",
 11 |     "To register a NKI kernel registration, you need to call a decorated NKI function.\n",
 12 |     "\n",
 13 |     "Let’s examine a guiding example below where we randomly initialize two inputs, add them together, and then multiply the result by the two input tensors element-wise. This effectively calculates: `a * b * (a + b)`.\n",
 14 |     "\n",
 15 |     "We define a common NKI kernel for addition. For more information on the kernel, see [SPMD Tensor Addition](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/tutorials/spmd_tensor_addition.html)."
 16 |    ]
 17 |   },
 18 |   {
 19 |    "cell_type": "code",
 20 |    "execution_count": null,
 21 |    "metadata": {},
 22 |    "outputs": [],
 23 |    "source": [
 24 |     "import neuronxcc.nki as nki\n",
 25 |     "import neuronxcc.nki.language as nl\n",
 26 |     "\n",
 27 |     "@nki.jit\n",
 28 |     "def nki_tensor_add_kernel_(a_input, b_input):\n",
 29 |     "  \"\"\"NKI kernel to compute element-wise addition of two input tensors\n",
 30 |     "  \n",
 31 |     "  This kernel assumes strict input/output sizes can be uniformly tiled to [128,512]\n",
 32 |     "\n",
 33 |     "  Args:\n",
 34 |     "      a_input: a first input tensor\n",
 35 |     "      b_input: a second input tensor\n",
 36 |     "\n",
 37 |     "  Returns:\n",
 38 |     "      c_output: an output tensor\n",
 39 |     "  \"\"\"\n",
 40 |     "\n",
 41 |     "  # Create output tensor shared between all SPMD instances as result tensor\n",
 42 |     "  c_output = nl.ndarray(a_input.shape, dtype=a_input.dtype, buffer=nl.shared_hbm)\n",
 43 |     "\n",
 44 |     "  # Calculate tile offsets based on current 'program'\n",
 45 |     "  offset_i_x = nl.program_id(0) * 128\n",
 46 |     "  offset_i_y = nl.program_id(1) * 512\n",
 47 |     "\n",
 48 |     "  # Generate tensor indices to index tensors a and b\n",
 49 |     "  ix_, iy_ = nl.mgrid[0:128, 0:512]\n",
 50 |     "  ix = offset_i_x + ix_\n",
 51 |     "  iy = offset_i_y + iy_\n",
 52 |     "\n",
 53 |     "  # Load input data from device memory (HBM) to on-chip memory (SBUF)\n",
 54 |     "  # We refer to an indexed portion of a tensor as an intermediate tensor\n",
 55 |     "  a_tile = nl.load(a_input[ix, iy])\n",
 56 |     "  b_tile = nl.load(b_input[ix, iy])\n",
 57 |     "\n",
 58 |     "  # compute a + b\n",
 59 |     "  c_tile = a_tile + b_tile\n",
 60 |     "\n",
 61 |     "  # store the addition results back to device memory (c_output)\n",
 62 |     "  nl.store(c_output[ix, iy], value=c_tile)\n",
 63 |     "\n",
 64 |     "  # Transfer the ownership of `c_output` to the caller\n",
 65 |     "  return c_output"
 66 |    ]
 67 |   },
 68 |   {
 69 |    "cell_type": "markdown",
 70 |    "metadata": {},
 71 |    "source": [
 72 |     "## PyTorch\n",
 73 |     "We can perform `(a + b) * a * b` using native PyTorch code."
 74 |    ]
 75 |   },
 76 |   {
 77 |    "cell_type": "code",
 78 |    "execution_count": null,
 79 |    "metadata": {},
 80 |    "outputs": [],
 81 |    "source": [
 82 |     "import torch\n",
 83 |     "from torch_xla.core import xla_model as xm\n",
 84 |     "\n",
 85 |     "device = xm.xla_device()\n",
 86 |     "\n",
 87 |     "a = torch.randn(256, 1024, dtype=torch.float32).to(device)\n",
 88 |     "b = torch.randn(256, 1024, dtype=torch.float32).to(device)\n",
 89 |     "c = a + b\n",
 90 |     "out = a * b * c\n",
 91 |     "\n",
 92 |     "print(out)"
 93 |    ]
 94 |   },
 95 |   {
 96 |    "cell_type": "markdown",
 97 |    "metadata": {},
 98 |    "source": [
 99 |     "Now let’s replace the tensor addition (`c = a + b`) with a NKI kernel. To do this we replace the `+` operator with a call to the NKI kernel caller (`nki_tensor_add`), and everything else works as before."
100 |    ]
101 |   },
102 |   {
103 |    "cell_type": "code",
104 |    "execution_count": null,
105 |    "metadata": {},
106 |    "outputs": [],
107 |    "source": [
108 |     "def nki_tensor_add(a_input, b_input):\n",
109 |     "  \"\"\"NKI kernel caller to compute element-wise addition of two input tensors\n",
110 |     "\n",
111 |     "  This kernel caller lifts tile-size restriction, by applying the kernel on tiles of the inputs/outputs\n",
112 |     "\n",
113 |     "  Args:\n",
114 |     "      a_input: a first input tensor, of shape [N*128, M*512]\n",
115 |     "      b_input: a second input tensor, of shape [N*128, M*512]\n",
116 |     "\n",
117 |     "  Returns:\n",
118 |     "      a tensor of shape [N*128, M*512], the result of a_input + b_input\n",
119 |     "  \"\"\"\n",
120 |     "\n",
121 |     "  # The SPMD launch grid denotes the number of kernel instances.\n",
122 |     "  # In this case, we use a 2D grid where the size of each invocation is 128x512\n",
123 |     "  grid_x = a_input.shape[0] // 128\n",
124 |     "  grid_y = a_input.shape[1] // 512\n",
125 |     "\n",
126 |     "  return nki_tensor_add_kernel_[grid_x, grid_y](a_input, b_input)\n",
127 |     "\n",
128 |     "device = xm.xla_device()\n",
129 |     "a = torch.randn(256, 1024, dtype=torch.float32).to(device)\n",
130 |     "b = torch.randn(256, 1024, dtype=torch.float32).to(device)\n",
131 |     "c = nki_tensor_add(a, b) # calling a NKI kernel, instead of the built-in torch op\n",
132 |     "out = a * b * c\n",
133 |     "print(out)"
134 |    ]
135 |   },
136 |   {
137 |    "cell_type": "markdown",
138 |    "metadata": {},
139 |    "source": [
140 |     "To understand what happens under the hood when we compile the above code, we can print HLO IR graph generated by XLA by setting the `NEURON_FRAMEWORK_DEBUG` environment variable. For example, you may add the following lines to your code:"
141 |    ]
142 |   },
143 |   {
144 |    "cell_type": "code",
145 |    "execution_count": null,
146 |    "metadata": {},
147 |    "outputs": [],
148 |    "source": [
149 |     "import os\n",
150 |     "os.environ['NEURON_FRAMEWORK_DEBUG'] = \"1\""
151 |    ]
152 |   },
153 |   {
154 |    "cell_type": "markdown",
155 |    "metadata": {},
156 |    "source": [
157 |     "A `.pbtxt` file is then written in your run directory that has the corresponding human-readable HLO IR.\n",
158 |     "\n",
159 |     "Let’s examine the XLA output of this example. In line #5 we can identify that the tensor addition is now mapped to an HLO `custom-call` instruction, with `AwsNeuronCustomNativeKernel` as `custom_call_target`. The output of that `custom-call` is then consumed by the next instruction in line #6 as usual.\n",
160 |     "\n",
161 |     "```python\n",
162 |     "ENTRY %SyncTensorsGraph.22 (p0.2: f32[256,1024], p1.2: f32[256,1024]) -> (f32[256,1024]) {\n",
163 |     " %p1.2 = f32[256,1024]{1,0} parameter(1), frontend_attributes={neff_input_name=\"input1\"}\n",
164 |     " %p0.2 = f32[256,1024]{1,0} parameter(0), frontend_attributes={neff_input_name=\"input0\"}\n",
165 |     " %multiply = f32[256,1024]{1,0} multiply(f32[256,1024]{1,0} %p1.2, f32[256,1024]{1,0} %p0.2)\n",
166 |     " %custom-call.2 = f32[256,1024]{1,0} custom-call(f32[256,1024]{1,0} %p1.2, f32[256,1024]{1,0} %p0.2), custom_call_target=\"AwsNeuronCustomNativeKernel\", api_version=API_VERSION_UNSPECIFIED, backend_config=\"...\")\n",
167 |     " %multiply.1 = f32[256,1024]{1,0} multiply(f32[256,1024]{1,0} %multiply, f32[256,1024]{1,0} %custom-call.2)\n",
168 |     " ROOT %tuple = (f32[256,1024]{1,0}) tuple(f32[256,1024]{1,0} %multiply.1), frontend_attributes={neff_output_names=\"output0\"}\n",
169 |     "}\n",
170 |     "```\n",
171 |     "\n",
172 |     "The Neuron compiler replaces the above custom-call with the corresponding NKI kernel implementation while optimizing the rest of the compute graph as usual. At the end of the compilation process, a single compiled binary NEFF file is generated representing the entire graph including the NKI kernel. For more information about NEFF files, see [Neuron Compiler](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/index.html)."
173 |    ]
174 |   },
175 |   {
176 |    "cell_type": "markdown",
177 |    "metadata": {},
178 |    "source": [
179 |     "## Using NKI in training graphs\n",
180 |     "\n",
181 |     "If you are using NKI to implement a new operator in a training graph, you might need to make the new operator interplay with the `autograd` engine in the framework. To do this, in PyTorch, you can subclass the framework’s base operator class and implement both the `forward()` and `backward()` methods. The `autograd` engine then uses the `backward()` method when performing auto-differentiation. See Extending [torch.autograd](https://pytorch.org/docs/stable/notes/extending.html) in the PyTorch Docs for instructions on doing this in PyTorch.\n",
182 |     "\n",
183 |     "Let’s reuse the `nki_tensor_add` kernel from before and demonstrate how to train a simple compute graph `(a+b)*a*b` in PyTorch.\n",
184 |     "\n",
185 |     "## PyTorch\n",
186 |     "\n",
187 |     "We define a `NkiAddFunc` class, which leverages the `nki_tensor_add` kernel in its `forward()` function. The gradients of both input tensors in `y = a + b` are ones, so the `backward()` function propagates the `dy` gradients from the previous backward function."
188 |    ]
189 |   },
190 |   {
191 |    "cell_type": "code",
192 |    "execution_count": null,
193 |    "metadata": {},
194 |    "outputs": [],
195 |    "source": [
196 |     "import torch\n",
197 |     "import torch_xla.core.xla_model as xm\n",
198 |     "device = xm.xla_device()\n",
199 |     "\n",
200 |     "class NkiAddFunc(torch.autograd.Function):\n",
201 |     "  @staticmethod\n",
202 |     "  def forward(ctx, a, b):\n",
203 |     "    return nki_tensor_add(a, b)\n",
204 |     "\n",
205 |     "  @staticmethod\n",
206 |     "  def backward(ctx, dy, *args):\n",
207 |     "    # gradients for a and b\n",
208 |     "    return dy, dy\n",
209 |     "\n",
210 |     "# now, let's define the compute graph\n",
211 |     "a = torch.randn(256, 1024, dtype=torch.float32).to(device).detach().requires_grad_()\n",
212 |     "b = torch.randn(256, 1024, dtype=torch.float32).to(device).detach().requires_grad_()\n",
213 |     "c = NkiAddFunc.apply(a, b)\n",
214 |     "out = a * b * c\n",
215 |     "\n",
216 |     "# here we define a (dummy) loss-function, in prep for backward propagation\n",
217 |     "loss = out.sum()\n",
218 |     "\n",
219 |     "# lastly, let's invoke the auto-grad engine\n",
220 |     "loss.backward()\n",
221 |     "\n",
222 |     "xm.mark_step()"
223 |    ]
224 |   },
225 |   {
226 |    "cell_type": "markdown",
227 |    "metadata": {},
228 |    "source": [
229 |     "## Release the NeuronCore for the next notebook\n",
230 |     "\n",
231 |     "Before moving to the next notebook we need to release the NeuronCore. If we don't do this the next notebook will not be able resources - you can also stop the kernel via the GUI"
232 |    ]
233 |   },
234 |   {
235 |    "cell_type": "code",
236 |    "execution_count": null,
237 |    "metadata": {},
238 |    "outputs": [],
239 |    "source": [
240 |     "import IPython\n",
241 |     "IPython.Application.instance().kernel.do_shutdown(True)"
242 |    ]
243 |   },
244 |   {
245 |    "cell_type": "code",
246 |    "execution_count": null,
247 |    "metadata": {},
248 |    "outputs": [],
249 |    "source": []
250 |   }
251 |  ],
252 |  "metadata": {
253 |   "kernelspec": {
254 |    "display_name": "Python 3 (ipykernel)",
255 |    "language": "python",
256 |    "name": "python3"
257 |   },
258 |   "language_info": {
259 |    "codemirror_mode": {
260 |     "name": "ipython",
261 |     "version": 3
262 |    },
263 |    "file_extension": ".py",
264 |    "mimetype": "text/x-python",
265 |    "name": "python",
266 |    "nbconvert_exporter": "python",
267 |    "pygments_lexer": "ipython3",
268 |    "version": "3.10.12"
269 |   }
270 |  },
271 |  "nbformat": 4,
272 |  "nbformat_minor": 4
273 | }
274 | 


--------------------------------------------------------------------------------
/workshops/03_NKIWorkshop/notebooks/3-neuron-profile.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# 3-Neuron Profile \n",
  8 |     "In this tutorial, we use Neuron Profile to view the execution trace of a NKI kernel captured on a NeuronCore. In doing so, we learn about:\n",
  9 |     "\n",
 10 |     "- Installation and usage of Neuron Profile.\n",
 11 |     "\n",
 12 |     "- Inspecting a detailed execution timeline of compute engine instructions and DMA engine activities generated from your NKI kernel.\n",
 13 |     "\n",
 14 |     "As background, [Neuron Profile](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-profile-user-guide.html) is the tool you need to visualize where time is being spent during kernel execution on NeuronDevices, which is crucial for identifying performance bottlenecks and opportunities of your kernel. Neuron Profile produces runtime execution data for every instruction executed on each compute engine and also every data movement activity completed by DMA engines. Neuron Profile also reports key performance metrics such as compute engine and memory bandwidth utilization, which allows developers to quickly find out the achieved hardware efficiency of their kernel. Profiling typically has near zero overhead thanks to the dedicated on-chip profiling hardware in NeuronDevices.\n",
 15 |     "\n",
 16 |     "## Profile a NKI Kernel\n",
 17 |     "\n",
 18 |     "### Install Neuron Profile\n",
 19 |     "Make sure you have the latest version of the `aws-neuronx-tools`, which includes updated profiling support for NKI kernels. Neuron Profile is included within this package and is installed to `/opt/aws/neuron/bin`.\n",
 20 |     "\n",
 21 |     "The `aws-neuronx-tools` package comes pre-installed on [Neuron DLAMIs](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/dlami/index.html). For detailed installation instructions see [Neuron Profile User Guide: Installation](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-profile-user-guide.html#installation).\n",
 22 |     "\n",
 23 |     "### Profile using `neuron-profile capture`\n",
 24 |     "\n",
 25 |     "To profile a NKI kernel the required steps are (1) enable `NEURON_FRAMEWORK_DEBUG` to tell the compiler to save the `NEFF` file, (2) execute the NKI kernel to generate the `NEFF`, and (3) run `neuron-profile capture` to generate a `NTFF` profile. Each step is described in more detail below.\n",
 26 |     "\n",
 27 |     "We will profile a NKI kernel which computes the element-wise exponential of an input tensor of any 2D shape. The rest of this tutorial will use a performance profile generated from this kernel as an example. Full code of `prof-kernel.py`:"
 28 |    ]
 29 |   },
 30 |   {
 31 |    "cell_type": "code",
 32 |    "execution_count": null,
 33 |    "metadata": {},
 34 |    "outputs": [],
 35 |    "source": [
 36 |     "%%writefile prof-kernel.py\n",
 37 |     "\"\"\"\n",
 38 |     "Example kernel used to demmonstrate Neuron Profile.\n",
 39 |     "\"\"\"\n",
 40 |     "import torch\n",
 41 |     "from neuronxcc import nki\n",
 42 |     "import neuronxcc.nki.language as nl\n",
 43 |     "import math\n",
 44 |     "import os\n",
 45 |     "os.environ[\"NEURON_FRAMEWORK_DEBUG\"] = \"1\"\n",
 46 |     "os.environ[\"NEURON_CC_FLAGS\"]= \" --disable-dge \"\n",
 47 |     "\n",
 48 |     "@nki.jit\n",
 49 |     "def tensor_exp_kernel_(in_tensor):\n",
 50 |     "  \"\"\"NKI kernel to compute elementwise exponential of an input tensor\n",
 51 |     "\n",
 52 |     "  Args:\n",
 53 |     "      in_tensor: an input tensor of ANY 2D shape (up to SBUF size)\n",
 54 |     "  Returns:\n",
 55 |     "      out_tensor: an output tensor of ANY 2D shape (up to SBUF size)\n",
 56 |     "  \"\"\"\n",
 57 |     "  out_tensor = nl.ndarray(in_tensor.shape, dtype=in_tensor.dtype,\n",
 58 |     "                          buffer=nl.shared_hbm)\n",
 59 |     "\n",
 60 |     "  sz_p, sz_f = in_tensor.shape\n",
 61 |     "\n",
 62 |     "  i_f = nl.arange(sz_f)[None, :]\n",
 63 |     "\n",
 64 |     "  for p in nl.affine_range(math.ceil(sz_p / nl.tile_size.pmax)):\n",
 65 |     "    # Generate tensor indices for the input/output tensors\n",
 66 |     "    # pad index to pmax, for simplicity\n",
 67 |     "    i_p = p * nl.tile_size.pmax + nl.arange(nl.tile_size.pmax)[:, None]\n",
 68 |     "\n",
 69 |     "    # Load input data from external memory to on-chip memory\n",
 70 |     "    # only read up to sz_p\n",
 71 |     "    in_tile = nl.load(in_tensor[i_p, i_f], mask=(i_p<sz_p))\n",
 72 |     "\n",
 73 |     "    # perform the computation\n",
 74 |     "    out_tile = nl.exp(in_tile)\n",
 75 |     "\n",
 76 |     "    # store the results back to external memory\n",
 77 |     "    # only write up to sz_p\n",
 78 |     "    nl.store(out_tensor[i_p, i_f], value=out_tile, mask=(i_p<sz_p))\n",
 79 |     "\n",
 80 |     "    return out_tensor\n",
 81 |     "\n",
 82 |     "if __name__ == \"__main__\":\n",
 83 |     "  from torch_xla.core import xla_model as xm\n",
 84 |     "  device = xm.xla_device()\n",
 85 |     "\n",
 86 |     "  in_tensor = torch.rand((250, 512), dtype=torch.float32).to(device=device)\n",
 87 |     "\n",
 88 |     "  out_tensor = tensor_exp_kernel_(in_tensor)\n",
 89 |     "  print(f\"output_nki={out_tensor}\")"
 90 |    ]
 91 |   },
 92 |   {
 93 |    "cell_type": "markdown",
 94 |    "metadata": {},
 95 |    "source": [
 96 |     "To profile this NKI kernel, follow these steps:\n",
 97 |     "\n",
 98 |     "1. Enable Neuron debug output by setting the `NEURON_FRAMEWORK_DEBUG` environment variable. This will trigger the Neuron compiler to save the Neuron Executable File Format (NEFF) artifact to the current directory after compilation of your NKI kernel. The NEFF contains all hardware instructions required to execute your NKI kernel on a NeuronDevice, as well as metadata and debug info needed for profiling. For example, add the following lines to your NKI kernel source file:"
 99 |    ]
100 |   },
101 |   {
102 |    "cell_type": "code",
103 |    "execution_count": null,
104 |    "metadata": {},
105 |    "outputs": [],
106 |    "source": [
107 |     "import os\n",
108 |     "os.environ[\"NEURON_FRAMEWORK_DEBUG\"] = \"1\"\n",
109 |     "os.environ[\"NEURON_CC_FLAGS\"]= \" --disable-dge \""
110 |    ]
111 |   },
112 |   {
113 |    "cell_type": "markdown",
114 |    "metadata": {},
115 |    "source": [
116 |     "<blockquote>\n",
117 |     "Use the flag `--disable-dge` to temporarily disable a new compiler feature which is interfering with DMA debugging information display in neuron-profile. This is highly recommended to improve NKI performance debugging experience until we release a software fix for this issue.\n",
118 |     "</blockquote>\n",
119 |     "\n",
120 |     "2. Compile your NKI kernel to create a NEFF in your current directory:"
121 |    ]
122 |   },
123 |   {
124 |    "cell_type": "code",
125 |    "execution_count": null,
126 |    "metadata": {},
127 |    "outputs": [],
128 |    "source": [
129 |     "!python3 prof-kernel.py"
130 |    ]
131 |   },
132 |   {
133 |    "cell_type": "markdown",
134 |    "metadata": {},
135 |    "source": [
136 |     "<blockquote>\n",
137 |     "Find your NEFF named similarly to `MODULE_0_SyncTensorsGraph.13_12659246067793504316.neff`.\n",
138 |     "</blockquote>\n",
139 |     "\n",
140 |     "3. Profile the NEFF. This profiling step executes the NEFF on the NeuronDevice and records a raw execution trace into an Neuron Trace File Format (NTFF) artifact."
141 |    ]
142 |   },
143 |   {
144 |    "cell_type": "code",
145 |    "execution_count": null,
146 |    "metadata": {},
147 |    "outputs": [],
148 |    "source": [
149 |     "!neuron-profile capture -n <path_to_neff> -s profile.ntff --profile-nth-exec=2"
150 |    ]
151 |   },
152 |   {
153 |    "cell_type": "markdown",
154 |    "metadata": {},
155 |    "source": [
156 |     "This will save your NTFF profile to `profile_exec_2.ntff`.\n",
157 |     "\n",
158 |     "<blockquote>\n",
159 |     "The `--profile-nth-exec=2` option will profile your NEFF twice on the NeuronDevice and output a NTFF profile for the second iteration. This is recommended to avoid one-time warmup delays which can be seen in the first iteration of execution.\n",
160 |     "</blockquote>\n",
161 |     "\n",
162 |     "In [View Neuron Profile UI](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/neuron_profile_for_nki.html#nki-view-neuron-profile-ui), we will view the profile in a user-friendly format using the Neuron Profile UI.\n",
163 |     "\n",
164 |     "### Profile using nki.benchmark\n",
165 |     "\n",
166 |     "You may also use the [nki.benchmark](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/api/generated/nki.benchmark.html) API to generate a NEFF and NTFF programmatically. One caveat is [nki.benchmark](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/api/generated/nki.benchmark.html) runs your NEFF without an ML framework in [nki.baremetal](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/api/generated/nki.baremetal.html) mode, so the input tensors to the kernel must be NumPy arrays instead of framework tensors such as `torch.Tensor`.\n",
167 |     "\n",
168 |     "Below is an example NKI kernel decorated by [nki.benchmark](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/api/generated/nki.benchmark.html). Full code of `prof-kernel-benchmark.py`:"
169 |    ]
170 |   },
171 |   {
172 |    "cell_type": "code",
173 |    "execution_count": null,
174 |    "metadata": {},
175 |    "outputs": [],
176 |    "source": [
177 |     "%%writefile prof-kernel-benchmark.py\n",
178 |     "\"\"\"\n",
179 |     "Example kernel used to demmonstrate Neuron Profile with nki.benchmark.\n",
180 |     "\"\"\"\n",
181 |     "from neuronxcc import nki\n",
182 |     "from neuronxcc.nki.typing import tensor\n",
183 |     "import neuronxcc.nki.language as nl\n",
184 |     "import math\n",
185 |     "\n",
186 |     "\n",
187 |     "@nki.benchmark(save_neff_name='file.neff', save_trace_name='profile.ntff')\n",
188 |     "def tensor_exp_kernel_(in_tensor):\n",
189 |     "  \"\"\"NKI kernel to compute elementwise exponential of an input tensor\n",
190 |     "  Args:\n",
191 |     "      in_tensor: an input tensor of ANY 2D shape (up to SBUF size)\n",
192 |     "  Returns:\n",
193 |     "      out_tensor: an output tensor of ANY 2D shape (up to SBUF size)\n",
194 |     "  \"\"\"\n",
195 |     "  out_tensor = nl.ndarray(in_tensor.shape, dtype=in_tensor.dtype,\n",
196 |     "                          buffer=nl.shared_hbm)\n",
197 |     "\n",
198 |     "  sz_p, sz_f = in_tensor.shape\n",
199 |     "  i_f = nl.arange(sz_f)[None, :]\n",
200 |     "  for p in nl.affine_range(math.ceil(sz_p / nl.tile_size.pmax)):\n",
201 |     "    # Generate tensor indices for the input/output tensors\n",
202 |     "    # pad index to pmax, for simplicity\n",
203 |     "    i_p = p * nl.tile_size.pmax + nl.arange(nl.tile_size.pmax)[:, None]\n",
204 |     "    # Load input data from external memory to on-chip memory\n",
205 |     "    # only read up to sz_p\n",
206 |     "    in_tile = nl.load(in_tensor[i_p, i_f], mask=(i_p<sz_p))\n",
207 |     "    # perform the computation\n",
208 |     "    out_tile = nl.exp(in_tile)\n",
209 |     "    # store the results back to external memory\n",
210 |     "    # only write up to sz_p\n",
211 |     "    nl.store(out_tensor[i_p, i_f], value=out_tile, mask=(i_p<sz_p))\n",
212 |     "\n",
213 |     "  return out_tensor\n",
214 |     "\n",
215 |     "if __name__ == \"__main__\":\n",
216 |     "  tensor_exp_kernel_(tensor[[250, 512], nl.float32])"
217 |    ]
218 |   },
219 |   {
220 |    "cell_type": "markdown",
221 |    "metadata": {},
222 |    "source": [
223 |     "To use [nki.benchmark](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/api/generated/nki.benchmark.html) to create a NEFF file and NTFF profile in your current directory, execute the example NKI kernel with:"
224 |    ]
225 |   },
226 |   {
227 |    "cell_type": "code",
228 |    "execution_count": null,
229 |    "metadata": {},
230 |    "outputs": [],
231 |    "source": [
232 |     "!python3 prof-kernel-benchmark.py"
233 |    ]
234 |   },
235 |   {
236 |    "cell_type": "markdown",
237 |    "metadata": {},
238 |    "source": [
239 |     "## Release the NeuronCore for the next notebook\n",
240 |     "\n",
241 |     "Before moving to the next notebook we need to release the NeuronCore. If we don't do this the next notebook will not be able resources - you can also stop the kernel via the GUI"
242 |    ]
243 |   },
244 |   {
245 |    "cell_type": "code",
246 |    "execution_count": null,
247 |    "metadata": {},
248 |    "outputs": [],
249 |    "source": [
250 |     "import IPython\n",
251 |     "IPython.Application.instance().kernel.do_shutdown(True)"
252 |    ]
253 |   },
254 |   {
255 |    "cell_type": "code",
256 |    "execution_count": null,
257 |    "metadata": {},
258 |    "outputs": [],
259 |    "source": []
260 |   }
261 |  ],
262 |  "metadata": {
263 |   "kernelspec": {
264 |    "display_name": "Python 3 (ipykernel)",
265 |    "language": "python",
266 |    "name": "python3"
267 |   },
268 |   "language_info": {
269 |    "codemirror_mode": {
270 |     "name": "ipython",
271 |     "version": 3
272 |    },
273 |    "file_extension": ".py",
274 |    "mimetype": "text/x-python",
275 |    "name": "python",
276 |    "nbconvert_exporter": "python",
277 |    "pygments_lexer": "ipython3",
278 |    "version": "3.10.12"
279 |   }
280 |  },
281 |  "nbformat": 4,
282 |  "nbformat_minor": 4
283 | }
284 | 


--------------------------------------------------------------------------------