├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── NOTICE ├── README.md ├── contributed └── models │ ├── README.md │ └── qwen2 │ ├── modeling_qwen2.py │ └── qwen-2-test.ipynb ├── doc └── README.md └── labs ├── FineTuning └── HuggingFaceExample │ ├── 01_finetuning │ ├── Finetune-TinyLlama-1.1B.ipynb │ └── assets │ │ ├── consolidate_adapter_shards_and_merge_model.py │ │ ├── finetune_llama.py │ │ └── requirements.txt │ ├── 02_inference │ └── Inference-TinyLlama-1.1B.ipynb │ └── Local example.ipynb ├── Lab_Four_NKI_Profiling.ipynb ├── Lab_One_NxDI.ipynb ├── Lab_Three_NKI_Custom_Operators.ipynb ├── Lab_Two_NKI.ipynb ├── generation_config.json └── vLLM ├── Benchmarks.ipynb └── Servers.ipynb /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. 5 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *main* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. 60 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | 2 | Apache License 3 | Version 2.0, January 2004 4 | http://www.apache.org/licenses/ 5 | 6 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 7 | 8 | 1. Definitions. 9 | 10 | "License" shall mean the terms and conditions for use, reproduction, 11 | and distribution as defined by Sections 1 through 9 of this document. 12 | 13 | "Licensor" shall mean the copyright owner or entity authorized by 14 | the copyright owner that is granting the License. 15 | 16 | "Legal Entity" shall mean the union of the acting entity and all 17 | other entities that control, are controlled by, or are under common 18 | control with that entity. For the purposes of this definition, 19 | "control" means (i) the power, direct or indirect, to cause the 20 | direction or management of such entity, whether by contract or 21 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 22 | outstanding shares, or (iii) beneficial ownership of such entity. 23 | 24 | "You" (or "Your") shall mean an individual or Legal Entity 25 | exercising permissions granted by this License. 26 | 27 | "Source" form shall mean the preferred form for making modifications, 28 | including but not limited to software source code, documentation 29 | source, and configuration files. 30 | 31 | "Object" form shall mean any form resulting from mechanical 32 | transformation or translation of a Source form, including but 33 | not limited to compiled object code, generated documentation, 34 | and conversions to other media types. 35 | 36 | "Work" shall mean the work of authorship, whether in Source or 37 | Object form, made available under the License, as indicated by a 38 | copyright notice that is included in or attached to the work 39 | (an example is provided in the Appendix below). 40 | 41 | "Derivative Works" shall mean any work, whether in Source or Object 42 | form, that is based on (or derived from) the Work and for which the 43 | editorial revisions, annotations, elaborations, or other modifications 44 | represent, as a whole, an original work of authorship. For the purposes 45 | of this License, Derivative Works shall not include works that remain 46 | separable from, or merely link (or bind by name) to the interfaces of, 47 | the Work and Derivative Works thereof. 48 | 49 | "Contribution" shall mean any work of authorship, including 50 | the original version of the Work and any modifications or additions 51 | to that Work or Derivative Works thereof, that is intentionally 52 | submitted to Licensor for inclusion in the Work by the copyright owner 53 | or by an individual or Legal Entity authorized to submit on behalf of 54 | the copyright owner. For the purposes of this definition, "submitted" 55 | means any form of electronic, verbal, or written communication sent 56 | to the Licensor or its representatives, including but not limited to 57 | communication on electronic mailing lists, source code control systems, 58 | and issue tracking systems that are managed by, or on behalf of, the 59 | Licensor for the purpose of discussing and improving the Work, but 60 | excluding communication that is conspicuously marked or otherwise 61 | designated in writing by the copyright owner as "Not a Contribution." 62 | 63 | "Contributor" shall mean Licensor and any individual or Legal Entity 64 | on behalf of whom a Contribution has been received by Licensor and 65 | subsequently incorporated within the Work. 66 | 67 | 2. Grant of Copyright License. Subject to the terms and conditions of 68 | this License, each Contributor hereby grants to You a perpetual, 69 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 70 | copyright license to reproduce, prepare Derivative Works of, 71 | publicly display, publicly perform, sublicense, and distribute the 72 | Work and such Derivative Works in Source or Object form. 73 | 74 | 3. Grant of Patent License. Subject to the terms and conditions of 75 | this License, each Contributor hereby grants to You a perpetual, 76 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 77 | (except as stated in this section) patent license to make, have made, 78 | use, offer to sell, sell, import, and otherwise transfer the Work, 79 | where such license applies only to those patent claims licensable 80 | by such Contributor that are necessarily infringed by their 81 | Contribution(s) alone or by combination of their Contribution(s) 82 | with the Work to which such Contribution(s) was submitted. If You 83 | institute patent litigation against any entity (including a 84 | cross-claim or counterclaim in a lawsuit) alleging that the Work 85 | or a Contribution incorporated within the Work constitutes direct 86 | or contributory patent infringement, then any patent licenses 87 | granted to You under this License for that Work shall terminate 88 | as of the date such litigation is filed. 89 | 90 | 4. Redistribution. You may reproduce and distribute copies of the 91 | Work or Derivative Works thereof in any medium, with or without 92 | modifications, and in Source or Object form, provided that You 93 | meet the following conditions: 94 | 95 | (a) You must give any other recipients of the Work or 96 | Derivative Works a copy of this License; and 97 | 98 | (b) You must cause any modified files to carry prominent notices 99 | stating that You changed the files; and 100 | 101 | (c) You must retain, in the Source form of any Derivative Works 102 | that You distribute, all copyright, patent, trademark, and 103 | attribution notices from the Source form of the Work, 104 | excluding those notices that do not pertain to any part of 105 | the Derivative Works; and 106 | 107 | (d) If the Work includes a "NOTICE" text file as part of its 108 | distribution, then any Derivative Works that You distribute must 109 | include a readable copy of the attribution notices contained 110 | within such NOTICE file, excluding those notices that do not 111 | pertain to any part of the Derivative Works, in at least one 112 | of the following places: within a NOTICE text file distributed 113 | as part of the Derivative Works; within the Source form or 114 | documentation, if provided along with the Derivative Works; or, 115 | within a display generated by the Derivative Works, if and 116 | wherever such third-party notices normally appear. The contents 117 | of the NOTICE file are for informational purposes only and 118 | do not modify the License. You may add Your own attribution 119 | notices within Derivative Works that You distribute, alongside 120 | or as an addendum to the NOTICE text from the Work, provided 121 | that such additional attribution notices cannot be construed 122 | as modifying the License. 123 | 124 | You may add Your own copyright statement to Your modifications and 125 | may provide additional or different license terms and conditions 126 | for use, reproduction, or distribution of Your modifications, or 127 | for any such Derivative Works as a whole, provided Your use, 128 | reproduction, and distribution of the Work otherwise complies with 129 | the conditions stated in this License. 130 | 131 | 5. Submission of Contributions. Unless You explicitly state otherwise, 132 | any Contribution intentionally submitted for inclusion in the Work 133 | by You to the Licensor shall be under the terms and conditions of 134 | this License, without any additional terms or conditions. 135 | Notwithstanding the above, nothing herein shall supersede or modify 136 | the terms of any separate license agreement you may have executed 137 | with Licensor regarding such Contributions. 138 | 139 | 6. Trademarks. This License does not grant permission to use the trade 140 | names, trademarks, service marks, or product names of the Licensor, 141 | except as required for reasonable and customary use in describing the 142 | origin of the Work and reproducing the content of the NOTICE file. 143 | 144 | 7. Disclaimer of Warranty. Unless required by applicable law or 145 | agreed to in writing, Licensor provides the Work (and each 146 | Contributor provides its Contributions) on an "AS IS" BASIS, 147 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 148 | implied, including, without limitation, any warranties or conditions 149 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 150 | PARTICULAR PURPOSE. You are solely responsible for determining the 151 | appropriateness of using or redistributing the Work and assume any 152 | risks associated with Your exercise of permissions under this License. 153 | 154 | 8. Limitation of Liability. In no event and under no legal theory, 155 | whether in tort (including negligence), contract, or otherwise, 156 | unless required by applicable law (such as deliberate and grossly 157 | negligent acts) or agreed to in writing, shall any Contributor be 158 | liable to You for damages, including any direct, indirect, special, 159 | incidental, or consequential damages of any character arising as a 160 | result of this License or out of the use or inability to use the 161 | Work (including but not limited to damages for loss of goodwill, 162 | work stoppage, computer failure or malfunction, or any and all 163 | other commercial damages or losses), even if such Contributor 164 | has been advised of the possibility of such damages. 165 | 166 | 9. Accepting Warranty or Additional Liability. While redistributing 167 | the Work or Derivative Works thereof, You may choose to offer, 168 | and charge a fee for, acceptance of support, warranty, indemnity, 169 | or other liability obligations and/or rights consistent with this 170 | License. However, in accepting such obligations, You may act only 171 | on Your own behalf and on Your sole responsibility, not on behalf 172 | of any other Contributor, and only if You agree to indemnify, 173 | defend, and hold each Contributor harmless for any liability 174 | incurred by, or claims asserted against, such Contributor by reason 175 | of your accepting any such warranty or additional liability. 176 | -------------------------------------------------------------------------------- /NOTICE: -------------------------------------------------------------------------------- 1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Neuron Workshops 2 | 3 | In this workshop you will learn how to develop support for a new model with [NeuronX Distributed Inference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/nxdi-overview.html#nxdi-overview), through the context of Llama 3.2 1B. You will also learn how to write your own kernel to directly program the accelerated hardware with the [Neuron Kernel Interface](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/index.html). These tools will help you design your research proposals and experiments on Trainium. 4 | 5 | It also includes an end-to-end example of using Hugging Face Optimum Neuron to fine-tune and host a small language model with Amazon SageMaker. 6 | 7 | 8 | ### What are AWS Trainium and Neuron? 9 | AWS Trainium is an AI chip developed by AWS for accelerating building and deploying machine learning models. Built on a specialized architecture designed for deep learning, Trainium accelerates the training and inference of complex models with high output and scalability, making it ideal for academic researchers looking to optimize performance and costs. This architecture also emphasizes sustainability through energy-efficient design, reducing environmental impact. Amazon has established a dedicated Trainium research cluster featuring up to 40,000 Trainium chips, accessible via Amazon EC2 Trn1 instances. These instances are connected through a non-blocking, petabit-scale network using Amazon EC2 UltraClusters, enabling seamless high-performance ML training. The Trn1 instance family is optimized to deliver substantial compute power for cutting-edge AI research and development. This unique offering not only enhances the efficiency and affordability of model training but also presents academic researchers with opportunities to publish new papers on underrepresented compute architectures, thus advancing the field. 10 | 11 | Learn more about Trainium [here](https://aws.amazon.com/ai/machine-learning/trainium/). 12 | 13 | ### Your workshop 14 | This hands-on workshop is designed for developers, data scientists, and machine learning engineers who are getting started in their journey on the Neuron SDK. 15 | 16 | The workshop has multiple available modules: 17 | 1. Set up instructions 18 | 2. Run inference with Llama and NeuronX Distributed inference (NxD) 19 | 3. Write your own kernel with Neuron Kernel Interface (NKI) 20 | 4. Fine tune and host an existing, supported model with a different data set using SageMaker. 21 | 22 | #### Instructor-led workshop 23 | If you are participating in an instructor-led workshop, follow the guidance provided by your instructor for accessing the environment. 24 | 25 | #### Self-managed workshop 26 | If you are following the workshop steps in your own environment, you will need to take the following actions: 27 | 1. Launch a trn1.2xlarge instance on Amazon EC2, using the latest [DLAMI with Neuron packages preinstalled](https://repost.aws/articles/ARTxLi0wndTwquyl7frQYuKg) 28 | 2. Use a Python virtual environment preinstalled in that DLAMI, commonly located in `/opt/aws_`. 29 | 3. Set up and manage your own development environment on that instance, such as by using VSCode or a Jupyter Lab server. 30 | 31 | ### Background knowledge 32 | This workshop introduces developing on AWS Trainium for the academic AI research audience and technical innovators. As such it's expected that the audience will already have a firm understanding of machine learning fundamentals. 33 | 34 | ### Workshop costs 35 | If you are participating in an instructor-led workshop hosted in an AWS-managed Workshop Studio environment, you will not incur any costs through using this environment. If you are following this workshop in your own environment, then you will incur associated costs with provisioning an Amazon EC2 instance. Please see the service pricing details [here](https://aws.amazon.com/ec2/pricing/on-demand/). 36 | 37 | At the time of writing, this workshop uses a trn1.2xlarge instance with an on-demand hourly rate in supported US regions of $1.34 per hour. The fine tuning workshop requires less than an hour of ml.trn1.2xlarge at $1.54 per hour, and an ml.inf2.xlarge at $0.99 per hour. Please ensure you delete the resources when you are finished. 38 | 39 | ## FAQ's and known issues 40 | 1. Workshop instructions are available [here](https://catalog.us-east-1.prod.workshops.aws/workshops/bf9d80a3-5e4b-4648-bca8-1d887bb2a9ca/en-US). 41 | 2. If you use the `NousResearch` Llama 3.2 1B, please note you'll need to remove a trailing comma in the model config file. You can do this by using VIM in VSCode. If you do not take this step, you'll get an error for invalid JSON in trying to read the model config in Lab 1. If editing the file through the terminal is a little challenging, you can also download the config file from this repository with the following command: 42 | `!wget https://github.com/aws-neuron/build-on-trainium-workshop/blob/main/labs/generation_config.json -P /home/ec2-user/environment/models/llama/` 43 | 4. Jupyter kernels can hold on to the NeuronCores as a Python process even after your cell has completed. This can then cause issues when you try to run a new notebook, and sometimes when you try to run another cell. If you encounter a `NeuronCore not found` or similar error statement, please just restart your Jupyter kernel and/or shut down kernels from previous sessions. You can also restart the instance through the EC2 console. Once your node is back online, you can always check the availability of the NeuronCores with `neuron-ls`. 44 | 5. Want to see how to integrate NKI with NxD? Check out our `nki-llama` [here](https://github.com/aws-samples/nki-llama). 45 | 46 | 47 | ## Security 48 | 49 | See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information. 50 | 51 | ## License 52 | 53 | This project is licensed under the Apache-2.0 License. 54 | 55 | -------------------------------------------------------------------------------- /contributed/models/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /contributed/models/qwen2/qwen-2-test.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [ 8 | { 9 | "name": "stdout", 10 | "output_type": "stream", 11 | "text": [ 12 | "libneuronxla 2.2.3493.0+78c3e78c\n", 13 | "neuronx-cc 2.18.121.0+9e31e41a\n", 14 | "neuronx-distributed 0.12.12111+cdd84048\n", 15 | "neuronx-distributed-inference 0.3.5591+f50feae2\n", 16 | "torch-neuronx 2.6.0.2.7.5413+113e6810\n" 17 | ] 18 | } 19 | ], 20 | "source": [ 21 | "!pip list | grep neuron" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": null, 27 | "metadata": {}, 28 | "outputs": [], 29 | "source": [ 30 | "import torch\n", 31 | "from transformers import AutoTokenizer, GenerationConfig\n", 32 | "from neuronx_distributed_inference.models.config import NeuronConfig, OnDeviceSamplingConfig\n", 33 | "from neuronx_distributed_inference.utils.hf_adapter import HuggingFaceGenerationAdapter, load_pretrained_config" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": 2, 39 | "metadata": {}, 40 | "outputs": [], 41 | "source": [ 42 | "model_path = \"/home/ubuntu/model_hf_qwen/qwen2/\"\n", 43 | "traced_model_path = \"/home/ubuntu/traced_model_qwen/qwen2\"" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": null, 49 | "metadata": {}, 50 | "outputs": [], 51 | "source": [ 52 | "from huggingface_hub import snapshot_download\n", 53 | "\n", 54 | "snapshot_download(\"Qwen/QwQ-32B\", local_dir=model_path)" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": null, 60 | "metadata": {}, 61 | "outputs": [], 62 | "source": [ 63 | "from modeling_qwen_v2 import Qwen2InferenceConfig, NeuronQwen2ForCausalLM\n", 64 | "\n", 65 | "def run_qwen2_compile():\n", 66 | " # Initialize configs and tokenizer.\n", 67 | " tokenizer = AutoTokenizer.from_pretrained(model_path, padding_side=\"right\")\n", 68 | " tokenizer.pad_token = tokenizer.eos_token\n", 69 | "\n", 70 | " generation_config = GenerationConfig.from_pretrained(model_path)\n", 71 | " generation_config_kwargs = {\n", 72 | " \"do_sample\": False,\n", 73 | " \"top_k\": 1,\n", 74 | " \"pad_token_id\": tokenizer.pad_token_id,\n", 75 | " }\n", 76 | " generation_config.update(**generation_config_kwargs)\n", 77 | " \n", 78 | " neuron_config = NeuronConfig(\n", 79 | " tp_degree=8,\n", 80 | " batch_size=1,\n", 81 | " max_context_length=128,\n", 82 | " seq_len=256,\n", 83 | " enable_bucketing=True,\n", 84 | " context_encoding_buckets=[128],\n", 85 | " token_generation_buckets=[256],\n", 86 | " flash_decoding_enabled=False,\n", 87 | " torch_dtype=torch.bfloat16,\n", 88 | " fused_qkv=False,\n", 89 | " attn_kernel_enabled=True,\n", 90 | " attn_cls=\"NeuronQwen2Attention\"\n", 91 | " )\n", 92 | " config = Qwen2InferenceConfig(\n", 93 | " neuron_config,\n", 94 | " load_config=load_pretrained_config(model_path),\n", 95 | " )\n", 96 | " \n", 97 | " # Compile and save model.\n", 98 | " print(\"\\nCompiling and saving model...\")\n", 99 | " model = NeuronQwen2ForCausalLM(model_path, config)\n", 100 | " model.compile(traced_model_path)\n", 101 | " tokenizer.save_pretrained(traced_model_path)" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": null, 107 | "metadata": {}, 108 | "outputs": [], 109 | "source": [ 110 | "run_qwen2_compile()" 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": null, 116 | "metadata": {}, 117 | "outputs": [], 118 | "source": [ 119 | "from modeling_qwen_v2 import Qwen2InferenceConfig, NeuronQwen2ForCausalLM\n", 120 | "\n", 121 | "model = NeuronQwen2ForCausalLM(traced_model_path)\n", 122 | "model.load(traced_model_path)" 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": null, 128 | "metadata": {}, 129 | "outputs": [], 130 | "source": [ 131 | "config = model.get_config_cls()\n", 132 | "config.get_neuron_config_cls()" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": 9, 138 | "metadata": {}, 139 | "outputs": [ 140 | { 141 | "data": { 142 | "text/plain": [ 143 | "40" 144 | ] 145 | }, 146 | "execution_count": 9, 147 | "metadata": {}, 148 | "output_type": "execute_result" 149 | } 150 | ], 151 | "source": [ 152 | "model.config.num_attention_heads" 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": 10, 158 | "metadata": {}, 159 | "outputs": [ 160 | { 161 | "data": { 162 | "text/plain": [ 163 | "8" 164 | ] 165 | }, 166 | "execution_count": 10, 167 | "metadata": {}, 168 | "output_type": "execute_result" 169 | } 170 | ], 171 | "source": [ 172 | "model.config.num_key_value_heads" 173 | ] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "execution_count": 11, 178 | "metadata": {}, 179 | "outputs": [ 180 | { 181 | "data": { 182 | "text/plain": [ 183 | "5120" 184 | ] 185 | }, 186 | "execution_count": 11, 187 | "metadata": {}, 188 | "output_type": "execute_result" 189 | } 190 | ], 191 | "source": [ 192 | "model.config.hidden_size" 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": 12, 198 | "metadata": {}, 199 | "outputs": [ 200 | { 201 | "name": "stderr", 202 | "output_type": "stream", 203 | "text": [ 204 | "Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.\n" 205 | ] 206 | }, 207 | { 208 | "data": { 209 | "text/plain": [ 210 | "\"Okay, the user wants a short introduction to large language models. Let me start by defining what a large language model is. I should mention that they are AI systems trained on vast amounts of text data. Maybe include that they use deep learning, specifically transformer architectures.\\n\\nI need to highlight their capabilities, like generating text, understanding context, and performing various tasks such as answering questions, writing stories, or coding. It's important to note their scale—large parameter counts and extensive training data. \\n\\nAlso, touch on their applications: customer service, content creation, research, etc. Maybe mention some examples like GPT, BERT, or\"" 211 | ] 212 | }, 213 | "execution_count": 12, 214 | "metadata": {}, 215 | "output_type": "execute_result" 216 | } 217 | ], 218 | "source": [ 219 | "tokenizer = AutoTokenizer.from_pretrained(traced_model_path)\n", 220 | "tokenizer.pad_token = tokenizer.eos_token\n", 221 | "generation_config = GenerationConfig.from_pretrained(model_path)\n", 222 | "generation_config_kwargs = {\n", 223 | " \"do_sample\": True,\n", 224 | " \"temperature\": 0.9,\n", 225 | " \"top_k\": 5,\n", 226 | " \"pad_token_id\": tokenizer.pad_token_id,\n", 227 | "}\n", 228 | "\n", 229 | "prompt = \"Give me a short introduction to large language model.\"\n", 230 | "messages = [\n", 231 | " {\"role\": \"system\", \"content\": \"You are Qwen, created by Alibaba Cloud. You are a helpful assistant.\"},\n", 232 | " {\"role\": \"user\", \"content\": prompt}\n", 233 | "]\n", 234 | "text = tokenizer.apply_chat_template(\n", 235 | " messages,\n", 236 | " tokenize=False,\n", 237 | " add_generation_prompt=True\n", 238 | ")\n", 239 | "model_inputs = tokenizer([text], return_tensors=\"pt\")\n", 240 | "generation_model = HuggingFaceGenerationAdapter(model)\n", 241 | "generated_ids = generation_model.generate(\n", 242 | " **model_inputs,\n", 243 | " max_new_tokens=128\n", 244 | ")\n", 245 | "generated_ids = [\n", 246 | " output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)\n", 247 | "]\n", 248 | "\n", 249 | "response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]\n", 250 | "response" 251 | ] 252 | }, 253 | { 254 | "cell_type": "code", 255 | "execution_count": 13, 256 | "metadata": {}, 257 | "outputs": [], 258 | "source": [ 259 | "model.reset()" 260 | ] 261 | }, 262 | { 263 | "cell_type": "markdown", 264 | "metadata": {}, 265 | "source": [ 266 | "# Run Benchmarks" 267 | ] 268 | }, 269 | { 270 | "cell_type": "code", 271 | "execution_count": 1, 272 | "metadata": {}, 273 | "outputs": [], 274 | "source": [ 275 | "model_path = \"/home/ubuntu/model_hf_qwen/qwen2\"\n", 276 | "traced_model_path = \"/home/ubuntu/traced_model_qwen/qwen2/logit\"" 277 | ] 278 | }, 279 | { 280 | "cell_type": "code", 281 | "execution_count": null, 282 | "metadata": {}, 283 | "outputs": [], 284 | "source": [ 285 | "dir = '/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/'\n", 286 | "!cp modeling_qwen2.py {dir}" 287 | ] 288 | }, 289 | { 290 | "cell_type": "markdown", 291 | "metadata": {}, 292 | "source": [ 293 | "# Edit the inference_demo.py file to include the following:\n", 294 | "\n", 295 | "```python\n", 296 | "from .modeling_qwen2 import NeuronQwen2ForCausalLM\n", 297 | "\n", 298 | "MODEL_TYPES = {\n", 299 | " \"llama\": {\"causal-lm\": NeuronLlamaForCausalLM},\n", 300 | " \"mixtral\": {\"causal-lm\": NeuronMixtralForCausalLM},\n", 301 | " \"dbrx\": {\"causal-lm\": NeuronDbrxForCausalLM},\n", 302 | " 'qwen2': {\"causal-lm\": NeuronQwen2ForCausalLM}\n", 303 | "}\n", 304 | "```" 305 | ] 306 | }, 307 | { 308 | "cell_type": "code", 309 | "execution_count": 8, 310 | "metadata": {}, 311 | "outputs": [ 312 | { 313 | "name": "stdout", 314 | "output_type": "stream", 315 | "text": [ 316 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed/modules/moe/expert_mlps.py:11: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n", 317 | " from neuronx_distributed.modules.moe.blockwise import (\n", 318 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed/modules/moe/expert_mlps.py:11: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n", 319 | " from neuronx_distributed.modules.moe.blockwise import (\n", 320 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed/modules/moe/expert_mlps.py:11: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n", 321 | " from neuronx_distributed.modules.moe.blockwise import (\n", 322 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/modules/attention/utils.py:14: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n", 323 | " from neuronx_distributed_inference.modules.custom_calls import neuron_cumsum\n", 324 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:745: UserWarning: Set seed for `privateuseone` device does not take effect, please add API's `_is_in_bad_fork` and `manual_seed_all` to `privateuseone` device module.\n", 325 | " return fn(*args, **kwargs)\n", 326 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/modules/lora_serving/lora_model.py:12: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n", 327 | " from neuronx_distributed_inference.modules.attention.gqa import GQA, GroupQueryAttention_QKV\n", 328 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/modules/lora_serving/lora_model.py:12: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n", 329 | " from neuronx_distributed_inference.modules.attention.gqa import GQA, GroupQueryAttention_QKV\n", 330 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/modules/lora_serving/lora_model.py:12: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n", 331 | " from neuronx_distributed_inference.modules.attention.gqa import GQA, GroupQueryAttention_QKV\n", 332 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/models/dbrx/modeling_dbrx.py:38: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n", 333 | " from neuronx_distributed_inference.modules.attention.attention_base import NeuronAttentionBase\n", 334 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/models/dbrx/modeling_dbrx.py:38: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n", 335 | " from neuronx_distributed_inference.modules.attention.attention_base import NeuronAttentionBase\n", 336 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/inference_demo.py:25: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n", 337 | " from neuronx_distributed_inference.models.dbrx.modeling_dbrx import NeuronDbrxForCausalLM\n", 338 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/inference_demo.py:27: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n", 339 | " from neuronx_distributed_inference.models.mixtral.modeling_mixtral import NeuronMixtralForCausalLM\n", 340 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/models/mllama/modeling_mllama.py:72: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n", 341 | " from .modeling_mllama_vision import NeuronMllamaVisionModel # noqa: E402\n", 342 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/utils/accuracy.py:29: UserWarning: Intel extension for pytorch not found. For faster CPU references install `intel-extension-for-pytorch`.\n", 343 | " warnings.warn(\n", 344 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:745: UserWarning: Set seed for `privateuseone` device does not take effect, please add API's `_is_in_bad_fork` and `manual_seed_all` to `privateuseone` device module.\n", 345 | " return fn(*args, **kwargs)\n", 346 | "Loading configs...\n", 347 | "WARNING:root:NeuronConfig init: Unexpected keyword arguments: {'model_type': 'qwen2', 'task_type': 'causal-lm', 'model_path': '/home/ubuntu/model_hf_qwen/qwen2', 'compiled_model_path': '/home/ubuntu/traced_model_qwen/qwen2/logit', 'benchmark': True, 'check_accuracy_mode': , 'divergence_difference_tol': 0.001, 'prompts': ['To be, or not to be'], 'top_k': 1, 'top_p': 1.0, 'temperature': 1.0, 'do_sample': False, 'dynamic': False, 'pad_token_id': 151645, 'on_device_sampling': False, 'enable_torch_dist': False, 'enable_lora': False, 'max_loras': 1, 'max_lora_rank': 16, 'skip_warmup': False, 'skip_compile': False, 'compile_only': False, 'compile_dry_run': False, 'hlo_debug': False}\n", 348 | "\n", 349 | "Compiling and saving model...\n", 350 | "INFO:Neuron:Generating HLOs for the following models: ['context_encoding_model', 'token_generation_model']\n", 351 | "[2025-06-02 13:35:56.009: I neuronx_distributed/parallel_layers/parallel_state.py:592] > initializing tensor model parallel with size 8\n", 352 | "[2025-06-02 13:35:56.009: I neuronx_distributed/parallel_layers/parallel_state.py:593] > initializing pipeline model parallel with size 1\n", 353 | "[2025-06-02 13:35:56.010: I neuronx_distributed/parallel_layers/parallel_state.py:594] > initializing context model parallel with size 1\n", 354 | "[2025-06-02 13:35:56.010: I neuronx_distributed/parallel_layers/parallel_state.py:595] > initializing data parallel with size 1\n", 355 | "[2025-06-02 13:35:56.010: I neuronx_distributed/parallel_layers/parallel_state.py:596] > initializing world size to 8\n", 356 | "[2025-06-02 13:35:56.010: I neuronx_distributed/parallel_layers/parallel_state.py:343] [rank_0_pp-1_tp-1_dp-1_cp-1] Chosen Logic for replica groups ret_logic=, 'Ascending Ring PG Group')>\n", 357 | "[2025-06-02 13:35:56.011: I neuronx_distributed/parallel_layers/parallel_state.py:632] [rank_0_pp-1_tp-1_dp-1_cp-1] tp_groups: replica_groups.tp_groups=[[0, 1, 2, 3, 4, 5, 6, 7]]\n", 358 | "[2025-06-02 13:35:56.011: I neuronx_distributed/parallel_layers/parallel_state.py:633] [rank_0_pp-1_tp-1_dp-1_cp-1] dp_groups: replica_groups.dp_groups=[[0], [1], [2], [3], [4], [5], [6], [7]]\n", 359 | "[2025-06-02 13:35:56.011: I neuronx_distributed/parallel_layers/parallel_state.py:634] [rank_0_pp-1_tp-1_dp-1_cp-1] pp_groups: replica_groups.pp_groups=[[0], [1], [2], [3], [4], [5], [6], [7]]\n", 360 | "[2025-06-02 13:35:56.011: I neuronx_distributed/parallel_layers/parallel_state.py:635] [rank_0_pp-1_tp-1_dp-1_cp-1] cp_groups: replica_groups.cp_groups=[[0], [1], [2], [3], [4], [5], [6], [7]]\n", 361 | "[2025-06-02 13:35:56.011: I neuronx_distributed/parallel_layers/parallel_state.py:636] [rank_0_pp-1_tp-1_dp-1_cp-1] ep_model_groups: replica_groups.ep_model_groups=[[0], [1], [2], [3], [4], [5], [6], [7]]\n", 362 | "[2025-06-02 13:35:56.011: I neuronx_distributed/parallel_layers/parallel_state.py:637] [rank_0_pp-1_tp-1_dp-1_cp-1] ep_data_groups: replica_groups.ep_data_groups=[[0], [1], [2], [3], [4], [5], [6], [7]]\n", 363 | "INFO:Neuron:Generating 1 hlos for key: context_encoding_model\n", 364 | "INFO:Neuron:Started loading module context_encoding_model\n", 365 | "INFO:Neuron:Finished loading module context_encoding_model in 0.3605782985687256 seconds\n", 366 | "INFO:Neuron:generating HLO: context_encoding_model, input example shape = torch.Size([1, 16])\n", 367 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed/parallel_layers/layers.py:478: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.\n", 368 | " with torch.cuda.amp.autocast(enabled=False):\n", 369 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/torch_neuronx/xla_impl/hlo_conversion.py:210: UserWarning: Received an input tensor that was unused. Tensor will be ignored. (index=1, shape=torch.Size([1, 16]), dtype=torch.int32)\n", 370 | " warnings.warn(\n", 371 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/torch_neuronx/xla_impl/hlo_conversion.py:210: UserWarning: Received an input tensor that was unused. Tensor will be ignored. (index=3, shape=torch.Size([1]), dtype=torch.int32)\n", 372 | " warnings.warn(\n", 373 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/torch_neuronx/xla_impl/hlo_conversion.py:210: UserWarning: Received an input tensor that was unused. Tensor will be ignored. (index=4, shape=torch.Size([1, 3]), dtype=torch.float32)\n", 374 | " warnings.warn(\n", 375 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/torch_neuronx/xla_impl/hlo_conversion.py:210: UserWarning: Received an input tensor that was unused. Tensor will be ignored. (index=5, shape=torch.Size([1]), dtype=torch.int32)\n", 376 | " warnings.warn(\n", 377 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/torch_neuronx/xla_impl/hlo_conversion.py:210: UserWarning: Received an input tensor that was unused. Tensor will be ignored. (index=6, shape=torch.Size([1]), dtype=torch.int32)\n", 378 | " warnings.warn(\n", 379 | "INFO:Neuron:Finished generating HLO for context_encoding_model in 8.811824083328247 seconds, input example shape = torch.Size([1, 16])\n", 380 | "INFO:Neuron:Generating 1 hlos for key: token_generation_model\n", 381 | "INFO:Neuron:Started loading module token_generation_model\n", 382 | "INFO:Neuron:Finished loading module token_generation_model in 0.13971686363220215 seconds\n", 383 | "INFO:Neuron:generating HLO: token_generation_model, input example shape = torch.Size([1, 1])\n", 384 | "INFO:Neuron:Finished generating HLO for token_generation_model in 9.776893615722656 seconds, input example shape = torch.Size([1, 1])\n", 385 | "INFO:Neuron:Generated all HLOs in 19.276326656341553 seconds\n", 386 | "INFO:Neuron:Starting compilation for the priority HLO\n", 387 | "INFO:Neuron:'token_generation_model' is the priority model with bucket rank 0\n", 388 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/libneuronxla/neuron_cc_wrapper.py:283: SyntaxWarning: str format compiler_flags is discouraged as its handling involves repeated joining and splitting, which can easily make mistakes if something is quoted or escaped. Use list[str] instead. Refer to documentation of the Python subprocess module for details.\n", 389 | " warnings.warn(SyntaxWarning(\n", 390 | "2025-06-02 13:36:15.000516: 7289 INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --framework=XLA /tmp/nxd_model/token_generation_model/_tp0_bk0/model.MODULE_9b906898286ddf239aa0+91ef39e9.hlo_module.pb --output /tmp/nxd_model/token_generation_model/_tp0_bk0/model.MODULE_9b906898286ddf239aa0+91ef39e9.neff --target=trn1 --auto-cast=none --model-type=transformer --tensorizer-options=--enable-ccop-compute-overlap --cc-pipeline-tiling-factor=2 --vectorize-strided-dma --lnc=1 --logfile=/tmp/nxd_model/token_generation_model/_tp0_bk0/log-neuron-cc.txt --enable-internal-neff-wrapper --verbose=35\n", 391 | ".........Completed run_backend_driver.\n", 392 | "\n", 393 | "Compiler status PASS\n", 394 | "INFO:Neuron:Done compilation for the priority HLO in 169.35613083839417 seconds\n", 395 | "INFO:Neuron:Updating the hlo module with optimized layout\n", 396 | "INFO:Neuron:Done optimizing weight layout for all HLOs in 0.3216278553009033 seconds\n", 397 | "INFO:Neuron:Starting compilation for all HLOs\n", 398 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/libneuronxla/neuron_cc_wrapper.py:245: SyntaxWarning: str format compiler_flags is discouraged as its handling involves repeated joining and splitting, which can easily make mistakes if something is quoted or escaped. Use list[str] instead. Refer to documentation of the Python subprocess module for details.\n", 399 | " warnings.warn(SyntaxWarning(\n", 400 | "2025-06-02 13:39:05.000174: 7289 INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --framework=XLA /tmp/nxd_model/context_encoding_model/_tp0_bk0/model.MODULE_d4332219e6ee5f826cce+d43b5474.hlo_module.pb --output /tmp/nxd_model/context_encoding_model/_tp0_bk0/model.MODULE_d4332219e6ee5f826cce+d43b5474.neff --target=trn1 --auto-cast=none --model-type=transformer --tensorizer-options=--enable-ccop-compute-overlap --cc-pipeline-tiling-factor=2 --vectorize-strided-dma --lnc=1 -O1 --internal-hlo2tensorizer-options= --modular-flow-mac-threshold=10 --logfile=/tmp/nxd_model/context_encoding_model/_tp0_bk0/log-neuron-cc.txt --verbose=35\n", 401 | ".Completed run_backend_driver.\n", 402 | "\n", 403 | "Compiler status PASS\n", 404 | "INFO:Neuron:Finished Compilation for all HLOs in 9.435595512390137 seconds\n", 405 | "......Completed run_backend_driver.\n", 406 | "\n", 407 | "Compiler status PASS\n", 408 | "INFO:Neuron:Done preparing weight layout transformation\n", 409 | "INFO:Neuron:Finished building model in 307.08067560195923 seconds\n", 410 | "INFO:Neuron:SKIPPING pre-sharding the checkpoints. The checkpoints will be sharded during load time.\n", 411 | "Compiling and tracing time: 307.11146965399985 seconds\n", 412 | "\n", 413 | "Loading model to Neuron...\n", 414 | "INFO:Neuron:Sharding weights on load...\n", 415 | "INFO:Neuron:Sharding Weights for ranks: 0...7\n", 416 | "[2025-06-02 13:41:03.157: I neuronx_distributed/parallel_layers/parallel_state.py:592] > initializing tensor model parallel with size 8\n", 417 | "[2025-06-02 13:41:03.157: I neuronx_distributed/parallel_layers/parallel_state.py:593] > initializing pipeline model parallel with size 1\n", 418 | "[2025-06-02 13:41:03.157: I neuronx_distributed/parallel_layers/parallel_state.py:594] > initializing context model parallel with size 1\n", 419 | "[2025-06-02 13:41:03.157: I neuronx_distributed/parallel_layers/parallel_state.py:595] > initializing data parallel with size 1\n", 420 | "[2025-06-02 13:41:03.158: I neuronx_distributed/parallel_layers/parallel_state.py:596] > initializing world size to 8\n", 421 | "[2025-06-02 13:41:03.158: I neuronx_distributed/parallel_layers/parallel_state.py:343] [rank_0_pp-1_tp-1_dp-1_cp-1] Chosen Logic for replica groups ret_logic=, 'Ascending Ring PG Group')>\n", 422 | "[2025-06-02 13:41:03.159: I neuronx_distributed/parallel_layers/parallel_state.py:632] [rank_0_pp-1_tp-1_dp-1_cp-1] tp_groups: replica_groups.tp_groups=[[0, 1, 2, 3, 4, 5, 6, 7]]\n", 423 | "[2025-06-02 13:41:03.159: I neuronx_distributed/parallel_layers/parallel_state.py:633] [rank_0_pp-1_tp-1_dp-1_cp-1] dp_groups: replica_groups.dp_groups=[[0], [1], [2], [3], [4], [5], [6], [7]]\n", 424 | "[2025-06-02 13:41:03.159: I neuronx_distributed/parallel_layers/parallel_state.py:634] [rank_0_pp-1_tp-1_dp-1_cp-1] pp_groups: replica_groups.pp_groups=[[0], [1], [2], [3], [4], [5], [6], [7]]\n", 425 | "[2025-06-02 13:41:03.159: I neuronx_distributed/parallel_layers/parallel_state.py:635] [rank_0_pp-1_tp-1_dp-1_cp-1] cp_groups: replica_groups.cp_groups=[[0], [1], [2], [3], [4], [5], [6], [7]]\n", 426 | "[2025-06-02 13:41:03.159: I neuronx_distributed/parallel_layers/parallel_state.py:636] [rank_0_pp-1_tp-1_dp-1_cp-1] ep_model_groups: replica_groups.ep_model_groups=[[0], [1], [2], [3], [4], [5], [6], [7]]\n", 427 | "[2025-06-02 13:41:03.159: I neuronx_distributed/parallel_layers/parallel_state.py:637] [rank_0_pp-1_tp-1_dp-1_cp-1] ep_data_groups: replica_groups.ep_data_groups=[[0], [1], [2], [3], [4], [5], [6], [7]]\n", 428 | "INFO:Neuron:Done Sharding weights in 3.519328597999902\n", 429 | "INFO:Neuron:Finished weights loading in 16.628388952000023 seconds\n", 430 | "INFO:Neuron:Warming up the model.\n", 431 | "2025-Jun-02 13:41:22.0009 7289:8468 [7] nccl_net_ofi_create_plugin:211 CCOM WARN NET/OFI Failed to initialize sendrecv protocol\n", 432 | "2025-Jun-02 13:41:22.0010 7289:8468 [7] nccl_net_ofi_create_plugin:334 CCOM WARN NET/OFI aws-ofi-nccl initialization failed\n", 433 | "2025-Jun-02 13:41:22.0011 7289:8468 [7] nccl_net_ofi_init:155 CCOM WARN NET/OFI Initializing plugin failed\n", 434 | "2025-Jun-02 13:41:22.0012 7289:8468 [7] net_plugin.cc:94 CCOM WARN OFI plugin initNet() failed is EFA enabled?\n", 435 | "INFO:Neuron:Warmup completed in 0.33977651596069336 seconds.\n", 436 | "Total model loading time: 19.222302051000042 seconds\n", 437 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:650: UserWarning: `do_sample` is set to `False`. However, `top_k` is set to `1` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_k`.\n", 438 | " warnings.warn(\n", 439 | "\n", 440 | "Checking accuracy by logit matching\n", 441 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/utils/accuracy.py:363: UserWarning: input_len + num_tokens_to_check exceeds max_context_length. If output divergences at an index greater than max_context_length, a ValueError will occur because the next input len exceeds max_context_length. To avoid this, set num_tokens_to_check to a value of max_context_length - input_len or less.\n", 442 | " warnings.warn(\n", 443 | "Loading checkpoint shards: 100%|████████████████| 14/14 [00:08<00:00, 1.58it/s]\n", 444 | "From v4.47 onwards, when a model cache is to be returned, `generate` will return a `Cache` instance instead by default (as opposed to the legacy tuple of tuples format). If you want to keep returning the legacy format, please set `return_legacy_cache=True`.\n", 445 | "Expected Output: [\", that is the question. Whether 'tis nobler in the mind to suffer the slings and arrows of outrageous fortune\"] tensor([[ 11, 429, 374, 279, 3405, 13, 13139, 364, 83, 285,\n", 446 | " 13049, 1536, 304, 279, 3971, 311, 7676, 279, 1739, 819,\n", 447 | " 323, 36957, 315, 54488, 32315]])\n", 448 | "Expected Logits Shape: torch.Size([25, 1, 152064])\n", 449 | "Actual Output: [\", that is the question. Whether 'tis nobler in the mind to suffer the slings and arrows of outrageous fortune\"] tensor([[ 11, 429, 374, 279, 3405, 13, 13139, 364, 83, 285,\n", 450 | " 13049, 1536, 304, 279, 3971, 311, 7676, 279, 1739, 819,\n", 451 | " 323, 36957, 315, 54488, 32315]])\n", 452 | "Actual Logits Shape: torch.Size([25, 1, 152064])\n", 453 | "Passed logits validation!\n", 454 | "\n", 455 | "Generating outputs...\n", 456 | "Prompts: ['To be, or not to be']\n", 457 | "Generated outputs:\n", 458 | "Output 0: To be, or not to be, that is the question. Whether 'tis nobler in the mind to suffer the slings and arrows of outrageous fortune\n", 459 | "Starting end-to-end benchmark with 20\n", 460 | "Benchmark completed and its result is as following\n", 461 | "{\n", 462 | " \"e2e_model\": {\n", 463 | " \"latency_ms_p50\": 569.0377950668335,\n", 464 | " \"latency_ms_p90\": 570.0641632080078,\n", 465 | " \"latency_ms_p95\": 570.2431917190552,\n", 466 | " \"latency_ms_p99\": 570.8965921401978,\n", 467 | " \"latency_ms_p100\": 571.0599422454834,\n", 468 | " \"latency_ms_avg\": 569.459593296051,\n", 469 | " \"throughput\": 56.19362703995017\n", 470 | " },\n", 471 | " \"context_encoding_model\": {\n", 472 | " \"latency_ms_p50\": 41.747450828552246,\n", 473 | " \"latency_ms_p90\": 42.02606678009033,\n", 474 | " \"latency_ms_p95\": 42.056477069854736,\n", 475 | " \"latency_ms_p99\": 42.05883264541626,\n", 476 | " \"latency_ms_p100\": 42.05942153930664,\n", 477 | " \"latency_ms_avg\": 41.80266857147217,\n", 478 | " \"throughput\": 382.75068426897144\n", 479 | " },\n", 480 | " \"token_generation_model\": {\n", 481 | " \"latency_ms_p50\": 33.631086349487305,\n", 482 | " \"latency_ms_p90\": 33.74745845794678,\n", 483 | " \"latency_ms_p95\": 33.88720750808716,\n", 484 | " \"latency_ms_p99\": 34.08886194229126,\n", 485 | " \"latency_ms_p100\": 34.223079681396484,\n", 486 | " \"latency_ms_avg\": 33.66035064061483,\n", 487 | " \"throughput\": 31.68911334451813\n", 488 | " }\n", 489 | "}\n", 490 | "Completed saving result to benchmark_report.json\n" 491 | ] 492 | } 493 | ], 494 | "source": [ 495 | "!inference_demo \\\n", 496 | " --model-type qwen2 \\\n", 497 | " --task-type causal-lm \\\n", 498 | " run \\\n", 499 | " --model-path /home/ubuntu/model_hf_qwen/qwen2 \\\n", 500 | " --compiled-model-path /home/ubuntu/traced_model_qwen/qwen2/logit \\\n", 501 | " --torch-dtype bfloat16 \\\n", 502 | " --tp-degree 8 \\\n", 503 | " --batch-size 1 \\\n", 504 | " --max-context-length 16 \\\n", 505 | " --seq-len 32 \\\n", 506 | " --top-k 1 \\\n", 507 | " --pad-token-id 151645 \\\n", 508 | " --prompt \"To be, or not to be\" \\\n", 509 | " --check-accuracy-mode logit-matching \\\n", 510 | " --benchmark" 511 | ] 512 | } 513 | ], 514 | "metadata": { 515 | "kernelspec": { 516 | "display_name": "aws_neuronx_venv_pytorch_2_6_nxd_inference", 517 | "language": "python", 518 | "name": "python3" 519 | }, 520 | "language_info": { 521 | "codemirror_mode": { 522 | "name": "ipython", 523 | "version": 3 524 | }, 525 | "file_extension": ".py", 526 | "mimetype": "text/x-python", 527 | "name": "python", 528 | "nbconvert_exporter": "python", 529 | "pygments_lexer": "ipython3", 530 | "version": "3.10.12" 531 | } 532 | }, 533 | "nbformat": 4, 534 | "nbformat_minor": 2 535 | } 536 | -------------------------------------------------------------------------------- /doc/README.md: -------------------------------------------------------------------------------- 1 | # Build On Trainium Resources 2 | **Purpose:** 3 | 4 | Collection of resources (documentation, examples, tutorials and workshops) to help onboard new students and researchers. This set of resources will need to updated and maintained as new resources become available. 5 | 6 | # Resources 7 | 8 | This section contains links to various documentation sources and is a helpful index when working on Neuron. It is organized into several sections based on workload and relevance. 9 | 10 | ## Getting Started with Neuron 11 | 12 | |Title |Description |Link | 13 | |--- |--- |--- | 14 | |Getting Started with AWS |Getting started resource for AWS, generally, including AWS environment provisioning, budger alarms, CLI, instance setip and best practices for working in an AWS environment |[BoT Getting Started on AWS](https://github.com/scttfrdmn/aws-101-for-tranium) | 15 | |Neuron Documentation |The Neuron Official Product Documentation. This contains details on our software libraries and hardware. |[Neuron Docs](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/index.html) | 16 | |Inf2 Instance Details |Helpful overview links for the Inferentia2 Instance and associated accelerators |
  • [AWS Landing Page](https://aws.amazon.com/ai/machine-learning/inferentia/)
  • [Instance Details](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/inf2-arch.html#aws-inf2-arch)
  • [Chip Details](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/inferentia2.html#inferentia2-arch)
  • [Core Details](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/neuron-core-v2.html#neuroncores-v2-arch)
| 17 | |Trn1 Instance Details |Similar overview links for Trn1 instances and acclerators |
  • [AWS Landing Page](https://aws.amazon.com/ai/machine-learning/trainium/)
  • [Instance Details](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/trn1-arch.html#aws-trn1-arch)
  • [Chip Details](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/trainium.html#trainium-arch)
  • [Core Details](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/neuron-core-v2.html#neuroncores-v2-arch)
| 18 | |Trn2 Instance Details |Similar overview links for Trn2 instances and acclerators |
  • [Youtube Launch Video](https://www.youtube.com/watch?v=Bteba8KLeGc)
  • [Instance Details](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/trn2-arch.html#aws-trn2-arch)
  • [Chip Details](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/trainium2.html#trainium2-arch)
  • [Core Details](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/neuron-core-v3.html#neuroncores-v3-arch)
| 19 | |Software Overview - General |Overview Video of Trainium Software Stack |[Video](https://www.youtube.com/watch?v=vaqj8XQfqwM&t=806s) | 20 | |Software Overview - Framework |Application Frameworks for developing on Neuron. Torch-NeuronX for small model inference and training, NxD for Distributed modeling primitives, NxDI - a higher abstraction library for inference and NxDT a corresponding abstraction for training. |
  • Torch-NeuronX ([Training](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/programming-guide/training/pytorch-neuron-programming-guide.html#pytorch-neuronx-programming-guide), [Inference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/programming-guide/inference/trace-vs-xla-lazytensor.html))
  • [NxD](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/developer-guide.html)
  • [NxD-T](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/overview.html#nxd-training-overview)
  • [NxD-I](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/nxdi-overview.html#nxdi-overview)
| 21 | |Software Overview - ML Libraries |ML libraries which offer another interface for deploying to trn/inf. Optimum-Neuron provides and interface between transformers and AWS Accelerators. AXLearn is a training library built on top of JAX and XLA. |[Optimum Neuron](https://huggingface.co/docs/optimum-neuron/index) [AXLearn](https://github.com/apple/axlearn) | 22 | |Environment Setup |A set of resources on provisioning instances and setting up development environments with the appropriate Neuron Software. |
  • [Instance Guide](https://repost.aws/articles/ARTxLi0wndTwquyl7frQYuKg)
  • [Remote Development Guide](https://repost.aws/articles/ARmgDHboGkRKmaEyfBzyVP4w)
  • [AMIs](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/dlami/index.html)
  • [Containers](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/index.html)
  • [Manual Setup](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/neuron-setup/pytorch/neuronx/ubuntu/torch-neuronx-ubuntu22.html#setup-torch-neuronx-ubuntu22)
| 23 | |Release Versions |Index of the latest release versions and their semantic version information. |
  • [Latest Release Version](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/index.html#latest-neuron-release)
  • [Component Package Verisons](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/releasecontent.html#latest-neuron-release-artifacts)
| 24 | 25 | ## Training Resources 26 | 27 | |Title |Description |Link | 28 | |--- |--- |--- | 29 | |Torch-NeuronX Docs |Torch-NeuronX docs on the XLA flow, and constructing a simple training loop on Trainium/Inferentia. |[Torch-NeuronX Training Docs](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/programming-guide/training/pytorch-neuron-programming-guide.html#pytorch-neuronx-programming-guide) | 30 | |NxD Docs |Details on NxD, as well as the Distributed Layer Primitives (Tensor Parallelism, Pipeline Parallelism, etc.) |[NxD Developer Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/developer-guide-training.html#neuronx-distributed-developer-guide-training) | 31 | |NxD Docs + PyTorch Lightning |PyTorch Lightning Docs for NxD Training |[PTL Developer Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/ptl_developer_guide.html#ptl-developer-guide) | 32 | |NxD-T Developer Guide |NxD-Training, A higher level abstraction library on NxD for training specific workloads. |[NxD Training Developer Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/overview.html#nxd-training-overview) | 33 | |PreTraining |Pre-Training samples within various different libraries above |
  • [Torch-NeuronX](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/mlp.html#neuronx-mlp-training-tutorial)
  • [NXD](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/tutorials/training_llama_tp_zero1.html#llama2-7b-tp-zero1-tutorial)
  • [NxD-T](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/tutorials/hf_llama3_8B_pretraining.html#hf-llama3-8b-pretraining)
  • [Optimum Neuron](https://huggingface.co/docs/optimum-neuron/training_tutorials/pretraining_hyperpod_llm)
| 34 | |LoRA Fine Tuning |LoRA Samples within the various libraries for Neuron |
  • [NxD-T](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/tutorials/hf_llama3_8B_SFT_LORA.html#hf-llama3-8b-sft-lora)
  • [Optimum Neuron](https://huggingface.co/docs/optimum-neuron/training_tutorials/sft_lora_finetune_llm)
| 35 | |Preference Alignment |Preference Alignment Samples within the various libraries for Neuron |[NxD-T](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/tutorials/hf_llama3_8B_DPO_ORPO.html#hf-llama3-8b-dpo-orpo) | 36 | |Awsome Distributed Training |Reference Distributed Training Examples on AWS |[Awsome-distributed-training](https://github.com/aws-samples/awsome-distributed-training) | 37 | 38 | ## Inference Resources 39 | 40 | |Title |Description |Link | 41 | |--- |--- |--- | 42 | |Torch-NeuronX Docs |Torch-NeuronX docs on the XLA flow, and tracing models for Inference on a single core. Samples of various common models as well. |[Torch-NeuronX Docs](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/api-reference-guide/inference/api-torch-neuronx-trace.html#torch-neuronx-trace-api) 43 | [Samples](https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/inference) | 44 | |NxD-I Developer Guide |NxD-Inference, A higher level abstraction library on NxD for inference specific workloads. |[NxD-I Docs](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/tutorials/index.html) | 45 | |Deployment vLLM |Guide for vLLM development with NxDI |[vLLM Docs](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/vllm-user-guide.html) | 46 | |TGI |Guide on how to use HuggingFace Text Generation Inference (TGI) with Neuron |[TGI Docs](https://huggingface.co/docs/optimum-neuron/en/guides/neuronx_tgi) | 47 | 48 | ## Kernel Resources 49 | 50 | |Title |Description |Link | 51 | |--- |--- |--- | 52 | |NKI Docs |General NKI docs |[NKI Docs](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/index.html) | 53 | |Getting Started With NKI |Getting started writing NKI Kernels |[Getting Started With NKI](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/getting_started.html#nki-getting-started) | 54 | |Performant Kernels with NKI |Understanding NKI kernel performance |[Performant Kernels with NKI](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/nki_arch_guides.html#nki-arch-guides) | 55 | |NKI - Sample Kernels |Sample Kernel Repository with reference implementation |[NKI - Sample Kernels](https://github.com/aws-neuron/nki-samples/tree/main) | 56 | 57 | ## Tools Resources 58 | 59 | |Title |Description |Link | 60 | |--- |--- |--- | 61 | |Profiler |Neuron Profiler User Guide |[Profiler Docs](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-profiler-2-0-beta-user-guide.html) | 62 | |Monitoring Tools and CLI |Monitoring and CLI tools for working with Neuron Hardware. |[Monitoring Tools and CLI](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html) | 63 | 64 | # Learning Paths 65 | 66 | Learning Paths are a list or organized exercises 67 | 68 | ## Training 69 | 70 | |Title |Description |Link |Minimum Instance Required | 71 | |--- |--- |--- |--- | 72 | |Setup an Instance/Developer Environment |This section contains resources to provision a developer Environment. This is a great starting place if you need a clean environment for development, or for starting any of the following exercises. |
  • [Instance Setup](https://repost.aws/articles/ARTxLi0wndTwquyl7frQYuKg)
  • [DLAMIs](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/dlami/index.html)
|trn1.2xlarge | 73 | |Construct a simple Training Loop with torch-neuronx |This is a sample of how to construct a training loop using torch-neuronx. Relevant for getting started with XLA flows, as well as models which require a single core/DP. |[MLP Tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/mlp.html#neuronx-mlp-training-tutorial) |trn1.2xlarge | 74 | |Implement Tensor Parallelism with NeuronX Distributed |Implement Tensor Parallel for a model to shard training across accelerators. |[BERT Pretraining Tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/tutorials/training.html#tp-training-tutorial) |trn1.32xlarge | 75 | |Pre-training Llama with TP, PP and ZeRO-1 |Train a model using multiple forms of parallelism (Tensor Parallelism, Pipeline Parallelism, and ZeRO-1). This uses the NxD Core Library and should give a good view of the parallel primatives. |[Llama Pretraining Tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/tutorials/training_llama_tp_zero1.html) |4x trn1.32xlarge cluster | 76 | |LoRA Fine Tuning with Optimum Neuron |Fine-Tune a model with LoRA on Optimum Neuron. Optimum Neuron is a library developed by HF and allows for simple modifications to transformers code to port to Neuron. |[Qwen LoRA Optimum Neuron](https://huggingface.co/docs/optimum-neuron/training_tutorials/qwen3-fine-tuning) |trn1.32xlarge | 77 | |LoRA Fine-Tuning with NxDT |LoRA based Fine-tune a model using NxD-T, our higher level training library built on top of NxD core. |[LoRA NxDT Tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/tutorials/hf_llama3_8B_SFT_LORA.html#hf-llama3-8b-sft-lora) |trn1.32xlarge | 78 | |DPO/ORPO Fine-Tuning with NxDT |Preference Alignment for a model using NxD-T, our higher level training library built on top of NxD core. |[DPO/ORPO Tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/tutorials/hf_llama3_8B_DPO_ORPO.html) |trn1.32xlarge | 79 | 80 | 81 | 82 | ## Inference Path 83 | 84 | |Title |Description |Link |Minimum Instance Required | 85 | |--- |--- |--- |--- | 86 | |Setup an Instance/Developer Environment |This section contains resources to provision a developer Environment. This is a great starting place if you need a clean environment for development, or for starting any of the following exercises. |
  • [Instance Setup](https://repost.aws/articles/ARTxLi0wndTwquyl7frQYuKg)
  • [DLAMIs](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/dlami/index.html)
|trn1.2xlarge | 87 | |Trace Models with Torch-NeuronX |Trace small models without model parallelism for inference with torch-neuronx. |[Torch-NeuronX Tutorials](https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/README.md#inference) |trn1.2xlarge | 88 | |Deploy Various Models with Optimum Neuron |Optimum Neuron allows for popular models in diffusers and transformers to easily be deployed to Neuron devices. |[Optimum Neuron Tutorials](https://huggingface.co/docs/optimum-neuron/inference_tutorials/notebooks) |trn1.32xlarge | 89 | |Deploy LLM with NxD |NxD is our library with model sharding primitives. This guide serves as a good jumping off point for common LLMs |[NxD-I Production Models](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/model-reference.html) |trn1.32xlarge | 90 | |vLLM Integration |This guide walks through how to run models with vLLM on Neuron devices. This uses the previously mentioned NxDI back-end for the model deployments. |[vLLM User Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/vllm-user-guide.html) |trn1.32xlarge | 91 | |Deploy a DiT with NxD |This guide walks through a non LLM model architecture to be sharded and deployed on Neuron. In this case it is a Diffusion Transformer architecture for image generation |[PixArt Sigma on Neuron](https://aws.amazon.com/blogs/machine-learning/cost-effective-ai-image-generation-with-pixart-sigma-inference-on-aws-trainium-and-aws-inferentia/) |trn1.32xlarge | 92 | |Onboard a new Model to NxD-I |This guide walks through how to onboard a new model to NxD |[Model Onboarding Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/onboarding-models.html) |trn1.32xlarge | 93 | |Explore Additonal features of NxD-O |Here are a few additonal references for NxD-I feature that may be rel;evant for your specific use case (Multi-LoRA, Quantization, Spec. decode) |
  • [Quantization](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/custom-quantization.html)
  • [Spec. Decode](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/tutorials/trn2-llama3.3-70b-tutorial.html#nxdi-trn2-llama3-3-70b-tutorial)
  • [Multi-LoRA](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/tutorials/trn2-llama3.1-8b-multi-lora-tutorial.html#nxdi-trn2-llama3-1-8b-multi-lora-tutorial)
|trn1.32xlarge | 94 | 95 | ## Kernel/Compiler Path 96 | 97 | |Title |Description |Link |Minimum Instance Required | 98 | |--- |--- |--- |--- | 99 | |Setup an Instance/Developer Environment |This section contains resources to provision a developer Environment. This is a great starting place if you need a clean environment for development, or for starting any of the following exercises. |
  • [Instance Setup](https://repost.aws/articles/ARTxLi0wndTwquyl7frQYuKg)
  • [DLAMIs](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/dlami/index.html)
|trn1.2xlarge | 100 | |Writing Functional Kernels |This Getting Started Guide will demonstrate how to write a Hello World, element-wise tensor add kernel. This will give you a good foundation for reading and understanding the other kernels. |[Getting Started with NKI](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/getting_started.html#nki-getting-started) |trn1.2xlarge | 101 | |NKI workshop |This workshop walks through how to build, profile and integrate a kernel into PyTorch modelling. |[NKI Workshop](https://github.com/aws-samples/ml-specialized-hardware/tree/main/workshops/03_NKIWorkshop) |trn1.2xlarge | 102 | |Walkthrough NKI Tutorials |These tutorials walkthrough popular kernels and the associated optimizations applied. This is a good set of kernels to show how to iteratively write and optimize kernels. |[NKI Tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/tutorials.html) |trn1.2xlarge | 103 | |Review NKI Samples |This repository contains the implementations of optimized reference kernels, used within our serving libraries and implementations. |[NKI Samples](https://github.com/aws-neuron/nki-samples/) |trn1.2xlarge | 104 | |Profiling NKI Kernels |This guide walks through how to profile kernels and use the Neuron Profiler |[Profiling NKI Kernels](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/neuron_profile_for_nki.html#neuron-profile-for-nki) |trn1.2xlarge | 105 | 106 | # Appendix 107 | 108 | ## Other Resources 109 | 110 | |Title |Description |Link | 111 | |--- |--- |--- | 112 | |Re:Invent 2024 Recap |REcap Post from Re:Invent, which includes links to workshops and sessions on Neuron |[RePost Article](https://repost.aws/articles/ARuhbPQliOSqKn74zJpGmMYQ) | 113 | |AI on EKS |Reference implementation for AI workloads on EKS including hosting on Trainium |[AI on EKS](https://github.com/awslabs/ai-on-eks) | 114 | -------------------------------------------------------------------------------- /labs/FineTuning/HuggingFaceExample/01_finetuning/Finetune-TinyLlama-1.1B.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "37be34fc-0fa9-4811-865c-a3fdc38d38e8", 6 | "metadata": {}, 7 | "source": [ 8 | "# Fine-tune TinyLlama-1.1B for text-to-SQL generation\n", 9 | "\n", 10 | "## Introduction\n", 11 | "\n", 12 | "In this workshop module, you will learn how to fine-tune a Llama-based LLM ([TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)) using causal language modelling so that the model learns how to generate SQL queries for text-based instructions. Your fine-tuning job will be launched using SageMaker Training which provides a serverless training environment where you do not need to manage the underlying infrastructure. You will learn how to configure a PyTorch training job using [SageMaker's PyTorch estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html), and how to leverage the [Hugging Face Optimum Neuron](https://github.com/huggingface/optimum-neuron) package to easily run the PyTorch training job with AWS Trainium accelerators via an [AWS EC2 trn1.2xlarge instance](https://aws.amazon.com/ec2/instance-types/trn1/).\n", 13 | "\n", 14 | "For this module, you will be using the [b-mc2/sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context) dataset which consists of thousands of examples of SQL schemas, questions about the schemas, and SQL queries intended to answer the questions.\n", 15 | "\n", 16 | "*Dataset example 1:*\n", 17 | "* *SQL schema/context:* `CREATE TABLE management (department_id VARCHAR); CREATE TABLE department (department_id VARCHAR)`\n", 18 | "* *Question:* `How many departments are led by heads who are not mentioned?`\n", 19 | "* *SQL query/answer:* `SELECT COUNT(*) FROM department WHERE NOT department_id IN (SELECT department_id FROM management)`\n", 20 | "\n", 21 | "*Dataset example 2:*\n", 22 | "* *SQL schema/context:* `CREATE TABLE courses (course_name VARCHAR, course_id VARCHAR); CREATE TABLE student_course_registrations (student_id VARCHAR, course_id VARCHAR)`\n", 23 | "* *Question:* `What are the ids of all students for courses and what are the names of those courses?`\n", 24 | "* *SQL query/answer:* `SELECT T1.student_id, T2.course_name FROM student_course_registrations AS T1 JOIN courses AS T2 ON T1.course_id = T2.course_id`\n", 25 | "\n", 26 | "By fine-tuning the model over several thousand of these text-to-SQL examples, the model will then learn how to generate an appropriate SQL query when presented with a SQL context and a free-form question.\n", 27 | "\n", 28 | "This text-to-SQL use case was selected so you can successfully fine-tune your model in a reasonably short amount of time (~20 minutes) which is appropriate for this 1hr workshop. Although this is a relatively simple use case, please keep in mind that the same techniques and components used in this module can also be applied to fine-tune LLMs for more advanced use cases such as writing code, summarizing documents, creating blog posts - the possibilities are endless!" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "id": "866074ee-c300-4793-8e63-adbcfc314ad8", 34 | "metadata": { 35 | "tags": [] 36 | }, 37 | "source": [ 38 | "## Prerequisites\n", 39 | "\n", 40 | "This notebook uses the SageMaker Python SDK to prepare, launch, and monitor the progress of a PyTorch-based training job. Before we get started, it is important to upgrade the SageMaker SDK to ensure that you are using the latest version. Run the next two cells to upgrade the SageMaker SDK and set up your session." 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": null, 46 | "id": "3264aae2-1f18-4b59-a92c-2f169903c202", 47 | "metadata": { 48 | "tags": [] 49 | }, 50 | "outputs": [ 51 | { 52 | "ename": "", 53 | "evalue": "", 54 | "output_type": "error", 55 | "traceback": [ 56 | "\u001b[1;31mRunning cells with 'Python 3.11.12' requires the ipykernel package.\n", 57 | "\u001b[1;31mCreate a Python Environment with the required packages.\n", 58 | "\u001b[1;31mOr install 'ipykernel' using the command: '/opt/homebrew/bin/python3.11 -m pip install ipykernel -U --user --force-reinstall'" 59 | ] 60 | } 61 | ], 62 | "source": [ 63 | "# Upgrade SageMaker SDK to the latest version\n", 64 | "%pip install -U sagemaker awscli huggingface_hub ipywidgets -q 2>&1 | grep -v \"warnings/venv\"\n", 65 | "# Definitely restart your kernel after this cell" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": null, 71 | "id": "9b5ed574-6db5-471b-8515-c0f6189e653e", 72 | "metadata": { 73 | "tags": [] 74 | }, 75 | "outputs": [], 76 | "source": [ 77 | "import logging\n", 78 | "sagemaker_config_logger = logging.getLogger(\"sagemaker.config\")\n", 79 | "sagemaker_config_logger.setLevel(logging.WARNING)\n", 80 | "\n", 81 | "# Import SageMaker SDK, setup our session\n", 82 | "from sagemaker import get_execution_role, Session\n", 83 | "from sagemaker.pytorch import PyTorch\n", 84 | "import boto3\n", 85 | "\n", 86 | "region_name=\"us-east-2\" #this is hard coded to a specific region because of Workshop quotas. You could use sess.boto_region_name\n", 87 | "sess = Session(boto_session=boto3.Session(region_name=region_name))\n", 88 | "default_bucket = sess.default_bucket()\n" 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "id": "2ce630d1", 94 | "metadata": {}, 95 | "source": [ 96 | "This next command just configures the EC2 instance (in us-west-2) to have a default region of us-east-2. This is specific to the environment in AWS Workshop Studio." 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": null, 102 | "id": "5542b3d1", 103 | "metadata": {}, 104 | "outputs": [], 105 | "source": [ 106 | "!aws configure set region us-east-2" 107 | ] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "id": "ed3a1f57", 112 | "metadata": {}, 113 | "source": [ 114 | "## Log into Hugging Face\n", 115 | "\n", 116 | "The following step is recommended but optional. If you can log in with your Hugging Face token, it will let you avoid any rate limits for unauthenticated requests. Even though none of the models or datasets we are using require special permission, if you don't log in your training may fail because of too many unauthenticated requests. " 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": null, 122 | "id": "5f142253", 123 | "metadata": {}, 124 | "outputs": [], 125 | "source": [ 126 | "# If the cell below stays empty, RESTART YOUR KERNEL if you didn't and run the cells above again\n", 127 | "# If you can't login in, you can proceed to the next cell.\n", 128 | "from huggingface_hub import notebook_login\n", 129 | "\n", 130 | "# Uncheck \"Add token as git credential\" or just ignore the error message about it not being added.\n", 131 | "notebook_login()" 132 | ] 133 | }, 134 | { 135 | "cell_type": "markdown", 136 | "id": "4193108b-25fb-4d3e-85db-c66b8c04c251", 137 | "metadata": {}, 138 | "source": [ 139 | "## Specify the Optimum Neuron deep learning container (DLC) image\n", 140 | "\n", 141 | "The SageMaker Training service uses containers to execute your training script, allowing you to fully customize your training script environment and any required dependencies. For this workshop, you will use a recent Pytorch Training deep learning container (DLC) image which is an AWS-maintained image containing the Neuron SDK and PyTorch. The Optimum-Neuron library is installed with the requirements.txt file in the assets directory." 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": null, 147 | "id": "247ad886-6977-4295-947b-86d4892b48bd", 148 | "metadata": { 149 | "tags": [] 150 | }, 151 | "outputs": [ 152 | { 153 | "name": "stdout", 154 | "output_type": "stream", 155 | "text": [ 156 | "763104351884.dkr.ecr.us-east-2.amazonaws.com/pytorch-training-neuronx:2.7.0-neuronx-py310-sdk2.24.1-ubuntu22.04\n" 157 | ] 158 | } 159 | ], 160 | "source": [ 161 | "# Specify the Neuron DLC that we will use for training\n", 162 | "# For now, we'll use the standard Neuron DLC and install Optimum Neuron v0.0.27 at training time because we want to use a later SDK \n", 163 | "# You can see more about the images here: https://github.com/aws-neuron/deep-learning-containers?tab=readme-ov-file#pytorch-training-neuronx\n", 164 | "\n", 165 | "training_image = f\"763104351884.dkr.ecr.{sess.boto_region_name}.amazonaws.com/pytorch-training-neuronx:2.7.0-neuronx-py310-sdk2.24.1-ubuntu22.04\"\n", 166 | "print(training_image)" 167 | ] 168 | }, 169 | { 170 | "cell_type": "markdown", 171 | "id": "8a8802bc-657a-419d-b86d-eb8af5eff90e", 172 | "metadata": { 173 | "tags": [] 174 | }, 175 | "source": [ 176 | "## Configure the PyTorch Estimator\n", 177 | "\n", 178 | "The SageMaker SDK includes a [PyTorch Estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html) class which you can use to define a PyTorch training job that will be executed in the SageMaker managed environment. \n", 179 | "\n", 180 | "In the following cell, you will create a PyTorch Estimator which will run the attached `finetune_llama.py` training script on an ml.trn1.2xlarge instance. The `finetune_llama.py` script is an Optimum Neuron training script that can be used for causal language modelling with AWS Trainium. The scripts will be downloaded as the instance is brought up, and the scripts will download the model and the datasets onto the SageMaker training instance.\n", 181 | "\n", 182 | "The PyTorch Estimator has many parameters that can be used to configure your training job. A few of the most important parameters include:\n", 183 | "\n", 184 | "- *entry_point*: refers to the name of the training script that will be executed as part of this training job\n", 185 | "- *source_dir*: the path to the local source code directory (relative to your notebook) that will be packaged up and included inside your training container\n", 186 | "- *instance_count*: defines how many EC2 instances to use for this training job\n", 187 | "- *instance_type*: determines which type of EC2 instance will be used for training\n", 188 | "- *image_uri*: defines which training DLC will be used to run the training job (see Neuron DLC, above)\n", 189 | "- *distribution*: determines which type of distribution to use for the training job - you will need 'torch_distributed' for this workshop\n", 190 | "- *environment*: provides a dictionary of environment variables which will be applied to your training environment\n", 191 | "- *hyperparameters*: provides a dictionary of command-line arguments to pass to your training script, ex: finetune_llama.py\n", 192 | "\n", 193 | "In the `hyperparameters` section, you can see the specific command-line arguments that are used to control the behavior of the `finetune_llama.py` training script. Notably:\n", 194 | "- *model_id*: specifies which model you will be fine-tuning, in this case a recent checkpoint from the TinyLlama-1.1B project\n", 195 | "- *tokenizer_id*: specifies which tokenizer you will used to tokenize the dataset examples during training\n", 196 | "- *output_dir*: directory in which the fine-tuned model will be saved. Here we use the SageMaker-specific `/opt/ml/model` directory. At the end of the training job, SageMaker automatically copies the contents of this directory to the output S3 bucket\n", 197 | "- *tensor_parallel_size*: the tensor parallel degree for which we want to use for training. In this case we use '2' to shard the model across the 2 NeuronCores available in the trn1.2xlarge instance\n", 198 | "- *bf16*: request BFloat16 training\n", 199 | "- *per_device_train_batch_size*: the microbatch size to be used for fine-tuning\n", 200 | "- *gradient_accumulation_steps*: how many steps for which gradients will be accumulated between updates\n", 201 | "- *max_steps*: the maximum number of steps of fine-tuning that we want to perform\n", 202 | "- *lora_r*, *lora_alpha*, *lora_dropout*: the LoRA rank, alpha, and dropout values to use during fine-tuning\n", 203 | "\n", 204 | "The below estimator has been pre-configured for you, so you do not need to make any changes." 205 | ] 206 | }, 207 | { 208 | "cell_type": "code", 209 | "execution_count": null, 210 | "id": "7dd1c2b2", 211 | "metadata": {}, 212 | "outputs": [], 213 | "source": [ 214 | "# Note that the hyperparameters are command-line args passed to the finetune_llama.py script to control its behavior\n", 215 | "# Create hyperparameters dictionary\n", 216 | "hyperparameters = {\n", 217 | " \"model_id\": \"TinyLlama/TinyLlama-1.1B-Chat-v1.0\",\n", 218 | " \"tokenizer_id\": \"TinyLlama/TinyLlama-1.1B-Chat-v1.0\",\n", 219 | " \"skip_cache_push\": True,\n", 220 | " \"output_dir\": \"/opt/ml/model\",\n", 221 | " \"tensor_parallel_size\": 2,\n", 222 | " \"bf16\": True,\n", 223 | " \"per_device_train_batch_size\": 2,\n", 224 | " \"gradient_accumulation_steps\": 1,\n", 225 | " \"gradient_checkpointing\": True,\n", 226 | " \"max_steps\": 1000,\n", 227 | " \"lora_r\": 16,\n", 228 | " \"lora_alpha\": 32,\n", 229 | " \"lora_dropout\": 0.05,\n", 230 | " \"logging_steps\": 10,\n", 231 | " \"learning_rate\": 5e-5,\n", 232 | " \"dataloader_drop_last\": True,\n", 233 | " \"disable_tqdm\": True,\n", 234 | "}\n", 235 | "\n", 236 | "# Set up environment variables\n", 237 | "from huggingface_hub import HfFolder\n", 238 | "\n", 239 | "environment = {\"FI_EFA_FORK_SAFE\": \"1\", \"WANDB_DISABLED\": \"true\"}\n", 240 | "token = HfFolder.get_token()\n", 241 | "if token is not None:\n", 242 | " environment[\"HF_TOKEN\"] = token\n", 243 | "\n", 244 | "# Set up the PyTorch estimator\n", 245 | "pt_estimator = PyTorch(\n", 246 | " entry_point=\"finetune_llama.py\",\n", 247 | " source_dir=\"./assets\",\n", 248 | " role=get_execution_role(),\n", 249 | " instance_count=1,\n", 250 | " instance_type=\"ml.trn1.2xlarge\",\n", 251 | " disable_profiler=True,\n", 252 | " output_path=f\"s3://{default_bucket}/neuron_events2025\",\n", 253 | " base_job_name=\"trn1-tinyllama\",\n", 254 | " sagemaker_session=sess,\n", 255 | " code_bucket=f\"s3://{default_bucket}/neuron_events2025_code\",\n", 256 | " checkpoint_s3_uri=f\"s3://{default_bucket}/neuron_events_output\",\n", 257 | " image_uri=training_image,\n", 258 | " distribution={\"torch_distributed\": {\"enabled\": True}},\n", 259 | " environment=environment,\n", 260 | " disable_output_compression=True,\n", 261 | " hyperparameters=hyperparameters\n", 262 | ")\n" 263 | ] 264 | }, 265 | { 266 | "cell_type": "markdown", 267 | "id": "2278940b-f563-4582-9df0-bd56d9b5fd28", 268 | "metadata": {}, 269 | "source": [ 270 | "## Launch the training job\n", 271 | "\n", 272 | "Once the estimator has been created, you can then launch your training job by calling `.fit()` on the estimator:" 273 | ] 274 | }, 275 | { 276 | "cell_type": "code", 277 | "execution_count": 9, 278 | "id": "b7829c64-0190-43c3-be1a-0ccce7d45248", 279 | "metadata": { 280 | "tags": [] 281 | }, 282 | "outputs": [ 283 | { 284 | "name": "stderr", 285 | "output_type": "stream", 286 | "text": [ 287 | "INFO:sagemaker:Creating training-job with name: trn1-tinyllama-2025-05-13-00-40-31-750\n" 288 | ] 289 | } 290 | ], 291 | "source": [ 292 | "# Call fit() on the estimator to initiate the training job\n", 293 | "pt_estimator.fit(wait=False, logs=False)" 294 | ] 295 | }, 296 | { 297 | "cell_type": "markdown", 298 | "id": "b77434b2-94d7-4256-8d0b-d5d2ddb1d5ae", 299 | "metadata": {}, 300 | "source": [ 301 | "## Monitor the training job\n", 302 | "\n", 303 | "When the training job has been launched, the SageMaker Training service will then take care of:\n", 304 | "- launching and configuring the requested EC2 infrastructure for your training job\n", 305 | "- launching the requested container image on each of the EC2 instances\n", 306 | "- copying your source code directory and running your training script within the container(s)\n", 307 | "- storing your trained model artifacts in Amazon Simple Storage Service (S3)\n", 308 | "- decommissioning the training infrastructure\n", 309 | "\n", 310 | "While the training job is running, the following cell will periodically check and output the job status. When you see 'Completed', you know that your training job is finished and you can proceed to the remainder of the notebook. The training job typically takes about 20 minutes to complete.\n", 311 | "\n", 312 | "If you are interested in viewing the output logs from your training job, you can view the logs by navigating to the AWS CloudWatch console, selecting `Logs -> Log Groups` in the left-hand menu, and then looking for your SageMaker training job in the list. **Note:** it will usually take 4-5 minutes before the infrastructure is running and the output logs begin to be populated in CloudWatch." 313 | ] 314 | }, 315 | { 316 | "cell_type": "code", 317 | "execution_count": 10, 318 | "id": "0c223037-2f8e-4eb0-9e4b-ff4dac6ede7a", 319 | "metadata": { 320 | "tags": [] 321 | }, 322 | "outputs": [ 323 | { 324 | "name": "stdout", 325 | "output_type": "stream", 326 | "text": [ 327 | "2025-05-13T00:40:37.718399 Training job status: InProgress!\n", 328 | "2025-05-13T00:41:07.827456 Training job status: InProgress!\n", 329 | "2025-05-13T00:41:37.941892 Training job status: InProgress!\n", 330 | "2025-05-13T00:42:08.055514 Training job status: InProgress!\n", 331 | "2025-05-13T00:42:38.170184 Training job status: InProgress!\n", 332 | "2025-05-13T00:43:08.285526 Training job status: InProgress!\n", 333 | "2025-05-13T00:43:38.401669 Training job status: InProgress!\n", 334 | "2025-05-13T00:44:08.517601 Training job status: InProgress!\n", 335 | "2025-05-13T00:44:38.607279 Training job status: InProgress!\n", 336 | "2025-05-13T00:45:08.901240 Training job status: InProgress!\n", 337 | "2025-05-13T00:45:39.029987 Training job status: InProgress!\n", 338 | "2025-05-13T00:46:09.148483 Training job status: InProgress!\n", 339 | "2025-05-13T00:46:39.262424 Training job status: InProgress!\n", 340 | "2025-05-13T00:47:09.378729 Training job status: InProgress!\n", 341 | "2025-05-13T00:47:39.477011 Training job status: InProgress!\n", 342 | "2025-05-13T00:48:09.589262 Training job status: InProgress!\n", 343 | "2025-05-13T00:48:39.715998 Training job status: InProgress!\n", 344 | "2025-05-13T00:49:09.833712 Training job status: InProgress!\n", 345 | "2025-05-13T00:49:40.132350 Training job status: InProgress!\n", 346 | "2025-05-13T00:50:10.259671 Training job status: InProgress!\n", 347 | "2025-05-13T00:50:40.376526 Training job status: InProgress!\n", 348 | "2025-05-13T00:51:10.492630 Training job status: InProgress!\n", 349 | "2025-05-13T00:51:40.612684 Training job status: InProgress!\n", 350 | "2025-05-13T00:52:10.735871 Training job status: InProgress!\n", 351 | "2025-05-13T00:52:40.856541 Training job status: InProgress!\n", 352 | "2025-05-13T00:53:10.978185 Training job status: InProgress!\n", 353 | "2025-05-13T00:53:41.102406 Training job status: InProgress!\n", 354 | "2025-05-13T00:54:11.391318 Training job status: InProgress!\n", 355 | "2025-05-13T00:54:41.506542 Training job status: InProgress!\n", 356 | "2025-05-13T00:55:11.619419 Training job status: InProgress!\n", 357 | "2025-05-13T00:55:41.736144 Training job status: InProgress!\n", 358 | "2025-05-13T00:56:11.850643 Training job status: InProgress!\n", 359 | "2025-05-13T00:56:41.965740 Training job status: InProgress!\n", 360 | "2025-05-13T00:57:12.082235 Training job status: InProgress!\n", 361 | "2025-05-13T00:57:42.193146 Training job status: InProgress!\n", 362 | "2025-05-13T00:58:12.309523 Training job status: InProgress!\n", 363 | "2025-05-13T00:58:42.596288 Training job status: InProgress!\n", 364 | "2025-05-13T00:59:12.715701 Training job status: InProgress!\n", 365 | "2025-05-13T00:59:42.835134 Training job status: InProgress!\n", 366 | "2025-05-13T01:00:12.952002 Training job status: InProgress!\n", 367 | "2025-05-13T01:00:43.070275 Training job status: InProgress!\n", 368 | "2025-05-13T01:01:13.187416 Training job status: InProgress!\n", 369 | "2025-05-13T01:01:43.291955 Training job status: InProgress!\n", 370 | "\n", 371 | "2025-05-13T01:02:13.412501 Training job status: Completed!\n" 372 | ] 373 | } 374 | ], 375 | "source": [ 376 | "# Periodically check job status until it shows 'Completed' (ETA ~20 minutes)\n", 377 | "# You can also monitor job status in the SageMaker console, and view the\n", 378 | "# SageMaker Training job logs in the CloudWatch console\n", 379 | "from time import sleep\n", 380 | "from datetime import datetime\n", 381 | "\n", 382 | "while (job_status := pt_estimator.jobs[-1].describe()['TrainingJobStatus']) not in ['Completed', 'Error', 'Failed']:\n", 383 | " print(f\"{datetime.now().isoformat()} Training job status: {job_status}!\")\n", 384 | " sleep(30)\n", 385 | "\n", 386 | "print(f\"\\n{datetime.now().isoformat()} Training job status: {job_status}!\")" 387 | ] 388 | }, 389 | { 390 | "cell_type": "markdown", 391 | "id": "16c94343-b0c6-4903-82cc-c8ab2f88b26b", 392 | "metadata": {}, 393 | "source": [ 394 | "## Determine location of fine-tuned model artifacts\n", 395 | "\n", 396 | "Once the training job has completed, SageMaker will copy your fine-tuned model artifacts to a specified location in S3.\n", 397 | "\n", 398 | "In the following cell, you can see how to programmatically determine the location of your model artifacts:" 399 | ] 400 | }, 401 | { 402 | "cell_type": "code", 403 | "execution_count": 11, 404 | "id": "213af977-8ed6-4081-af65-59c70db2dbfb", 405 | "metadata": { 406 | "tags": [] 407 | }, 408 | "outputs": [ 409 | { 410 | "name": "stdout", 411 | "output_type": "stream", 412 | "text": [ 413 | "Your fine-tuned model is available here:\n", 414 | "\n", 415 | "s3://this.output.should.be.replaced.with.a.real.s3.path.once.the.cell.is.executed/\n" 416 | ] 417 | } 418 | ], 419 | "source": [ 420 | "# Show where the fine-tuned model is stored - previous job must be 'Completed' before running this cell\n", 421 | "model_archive_path = pt_estimator.jobs[-1].describe()['ModelArtifacts']['S3ModelArtifacts']\n", 422 | "print(f\"Your fine-tuned model is available here:\\n\\n{model_archive_path}/\")" 423 | ] 424 | }, 425 | { 426 | "cell_type": "markdown", 427 | "id": "b68f529f-a548-4fbd-b160-3cab5f52c488", 428 | "metadata": {}, 429 | "source": [ 430 | "
\n", 431 | "\n", 432 | "**Note:** Please copy the above S3 path, as it will be required in the subsequent workshop module.\n", 433 | "\n", 434 | "\n", 435 | "Lastly, run the following cell to list the model artifacts available in your S3 model_archive_path:" 436 | ] 437 | }, 438 | { 439 | "cell_type": "code", 440 | "execution_count": 12, 441 | "id": "27ad8c7e-6a73-4f20-944f-ac12ef286a6f", 442 | "metadata": { 443 | "tags": [] 444 | }, 445 | "outputs": [ 446 | { 447 | "name": "stdout", 448 | "output_type": "stream", 449 | "text": [ 450 | "2025-05-13 01:01:39 714 config.json\n", 451 | "2025-05-13 01:01:48 124 generation_config.json\n", 452 | "2025-05-13 01:01:40 4400216536 model.safetensors\n", 453 | "2025-05-13 01:01:47 551 special_tokens_map.json\n", 454 | "2025-05-13 01:01:47 1842795 tokenizer.json\n", 455 | "2025-05-13 01:01:39 499723 tokenizer.model\n", 456 | "2025-05-13 01:01:48 1368 tokenizer_config.json\n" 457 | ] 458 | } 459 | ], 460 | "source": [ 461 | "# View the contents of the fine-tuned model path in S3\n", 462 | "!aws s3 ls {model_archive_path}/merged_model/" 463 | ] 464 | }, 465 | { 466 | "cell_type": "markdown", 467 | "id": "fca9ffa7-a694-48c0-acde-cd468d18a448", 468 | "metadata": {}, 469 | "source": [ 470 | "Congratulations on completing the LLM fine-tuning module!\n", 471 | "\n", 472 | "In the next notebook, you will learn how to deploy your fine-tuned model in a SageMaker hosted endpoint, and leverage AWS Inferentia accelerators to perform model inference. Have fun!" 473 | ] 474 | } 475 | ], 476 | "metadata": { 477 | "availableInstances": [ 478 | { 479 | "_defaultOrder": 0, 480 | "_isFastLaunch": true, 481 | "category": "General purpose", 482 | "gpuNum": 0, 483 | "hideHardwareSpecs": false, 484 | "memoryGiB": 4, 485 | "name": "ml.t3.medium", 486 | "vcpuNum": 2 487 | }, 488 | { 489 | "_defaultOrder": 1, 490 | "_isFastLaunch": false, 491 | "category": "General purpose", 492 | "gpuNum": 0, 493 | "hideHardwareSpecs": false, 494 | "memoryGiB": 8, 495 | "name": "ml.t3.large", 496 | "vcpuNum": 2 497 | }, 498 | { 499 | "_defaultOrder": 2, 500 | "_isFastLaunch": false, 501 | "category": "General purpose", 502 | "gpuNum": 0, 503 | "hideHardwareSpecs": false, 504 | "memoryGiB": 16, 505 | "name": "ml.t3.xlarge", 506 | "vcpuNum": 4 507 | }, 508 | { 509 | "_defaultOrder": 3, 510 | "_isFastLaunch": false, 511 | "category": "General purpose", 512 | "gpuNum": 0, 513 | "hideHardwareSpecs": false, 514 | "memoryGiB": 32, 515 | "name": "ml.t3.2xlarge", 516 | "vcpuNum": 8 517 | }, 518 | { 519 | "_defaultOrder": 4, 520 | "_isFastLaunch": true, 521 | "category": "General purpose", 522 | "gpuNum": 0, 523 | "hideHardwareSpecs": false, 524 | "memoryGiB": 8, 525 | "name": "ml.m5.large", 526 | "vcpuNum": 2 527 | }, 528 | { 529 | "_defaultOrder": 5, 530 | "_isFastLaunch": false, 531 | "category": "General purpose", 532 | "gpuNum": 0, 533 | "hideHardwareSpecs": false, 534 | "memoryGiB": 16, 535 | "name": "ml.m5.xlarge", 536 | "vcpuNum": 4 537 | }, 538 | { 539 | "_defaultOrder": 6, 540 | "_isFastLaunch": false, 541 | "category": "General purpose", 542 | "gpuNum": 0, 543 | "hideHardwareSpecs": false, 544 | "memoryGiB": 32, 545 | "name": "ml.m5.2xlarge", 546 | "vcpuNum": 8 547 | }, 548 | { 549 | "_defaultOrder": 7, 550 | "_isFastLaunch": false, 551 | "category": "General purpose", 552 | "gpuNum": 0, 553 | "hideHardwareSpecs": false, 554 | "memoryGiB": 64, 555 | "name": "ml.m5.4xlarge", 556 | "vcpuNum": 16 557 | }, 558 | { 559 | "_defaultOrder": 8, 560 | "_isFastLaunch": false, 561 | "category": "General purpose", 562 | "gpuNum": 0, 563 | "hideHardwareSpecs": false, 564 | "memoryGiB": 128, 565 | "name": "ml.m5.8xlarge", 566 | "vcpuNum": 32 567 | }, 568 | { 569 | "_defaultOrder": 9, 570 | "_isFastLaunch": false, 571 | "category": "General purpose", 572 | "gpuNum": 0, 573 | "hideHardwareSpecs": false, 574 | "memoryGiB": 192, 575 | "name": "ml.m5.12xlarge", 576 | "vcpuNum": 48 577 | }, 578 | { 579 | "_defaultOrder": 10, 580 | "_isFastLaunch": false, 581 | "category": "General purpose", 582 | "gpuNum": 0, 583 | "hideHardwareSpecs": false, 584 | "memoryGiB": 256, 585 | "name": "ml.m5.16xlarge", 586 | "vcpuNum": 64 587 | }, 588 | { 589 | "_defaultOrder": 11, 590 | "_isFastLaunch": false, 591 | "category": "General purpose", 592 | "gpuNum": 0, 593 | "hideHardwareSpecs": false, 594 | "memoryGiB": 384, 595 | "name": "ml.m5.24xlarge", 596 | "vcpuNum": 96 597 | }, 598 | { 599 | "_defaultOrder": 12, 600 | "_isFastLaunch": false, 601 | "category": "General purpose", 602 | "gpuNum": 0, 603 | "hideHardwareSpecs": false, 604 | "memoryGiB": 8, 605 | "name": "ml.m5d.large", 606 | "vcpuNum": 2 607 | }, 608 | { 609 | "_defaultOrder": 13, 610 | "_isFastLaunch": false, 611 | "category": "General purpose", 612 | "gpuNum": 0, 613 | "hideHardwareSpecs": false, 614 | "memoryGiB": 16, 615 | "name": "ml.m5d.xlarge", 616 | "vcpuNum": 4 617 | }, 618 | { 619 | "_defaultOrder": 14, 620 | "_isFastLaunch": false, 621 | "category": "General purpose", 622 | "gpuNum": 0, 623 | "hideHardwareSpecs": false, 624 | "memoryGiB": 32, 625 | "name": "ml.m5d.2xlarge", 626 | "vcpuNum": 8 627 | }, 628 | { 629 | "_defaultOrder": 15, 630 | "_isFastLaunch": false, 631 | "category": "General purpose", 632 | "gpuNum": 0, 633 | "hideHardwareSpecs": false, 634 | "memoryGiB": 64, 635 | "name": "ml.m5d.4xlarge", 636 | "vcpuNum": 16 637 | }, 638 | { 639 | "_defaultOrder": 16, 640 | "_isFastLaunch": false, 641 | "category": "General purpose", 642 | "gpuNum": 0, 643 | "hideHardwareSpecs": false, 644 | "memoryGiB": 128, 645 | "name": "ml.m5d.8xlarge", 646 | "vcpuNum": 32 647 | }, 648 | { 649 | "_defaultOrder": 17, 650 | "_isFastLaunch": false, 651 | "category": "General purpose", 652 | "gpuNum": 0, 653 | "hideHardwareSpecs": false, 654 | "memoryGiB": 192, 655 | "name": "ml.m5d.12xlarge", 656 | "vcpuNum": 48 657 | }, 658 | { 659 | "_defaultOrder": 18, 660 | "_isFastLaunch": false, 661 | "category": "General purpose", 662 | "gpuNum": 0, 663 | "hideHardwareSpecs": false, 664 | "memoryGiB": 256, 665 | "name": "ml.m5d.16xlarge", 666 | "vcpuNum": 64 667 | }, 668 | { 669 | "_defaultOrder": 19, 670 | "_isFastLaunch": false, 671 | "category": "General purpose", 672 | "gpuNum": 0, 673 | "hideHardwareSpecs": false, 674 | "memoryGiB": 384, 675 | "name": "ml.m5d.24xlarge", 676 | "vcpuNum": 96 677 | }, 678 | { 679 | "_defaultOrder": 20, 680 | "_isFastLaunch": false, 681 | "category": "General purpose", 682 | "gpuNum": 0, 683 | "hideHardwareSpecs": true, 684 | "memoryGiB": 0, 685 | "name": "ml.geospatial.interactive", 686 | "supportedImageNames": [ 687 | "sagemaker-geospatial-v1-0" 688 | ], 689 | "vcpuNum": 0 690 | }, 691 | { 692 | "_defaultOrder": 21, 693 | "_isFastLaunch": true, 694 | "category": "Compute optimized", 695 | "gpuNum": 0, 696 | "hideHardwareSpecs": false, 697 | "memoryGiB": 4, 698 | "name": "ml.c5.large", 699 | "vcpuNum": 2 700 | }, 701 | { 702 | "_defaultOrder": 22, 703 | "_isFastLaunch": false, 704 | "category": "Compute optimized", 705 | "gpuNum": 0, 706 | "hideHardwareSpecs": false, 707 | "memoryGiB": 8, 708 | "name": "ml.c5.xlarge", 709 | "vcpuNum": 4 710 | }, 711 | { 712 | "_defaultOrder": 23, 713 | "_isFastLaunch": false, 714 | "category": "Compute optimized", 715 | "gpuNum": 0, 716 | "hideHardwareSpecs": false, 717 | "memoryGiB": 16, 718 | "name": "ml.c5.2xlarge", 719 | "vcpuNum": 8 720 | }, 721 | { 722 | "_defaultOrder": 24, 723 | "_isFastLaunch": false, 724 | "category": "Compute optimized", 725 | "gpuNum": 0, 726 | "hideHardwareSpecs": false, 727 | "memoryGiB": 32, 728 | "name": "ml.c5.4xlarge", 729 | "vcpuNum": 16 730 | }, 731 | { 732 | "_defaultOrder": 25, 733 | "_isFastLaunch": false, 734 | "category": "Compute optimized", 735 | "gpuNum": 0, 736 | "hideHardwareSpecs": false, 737 | "memoryGiB": 72, 738 | "name": "ml.c5.9xlarge", 739 | "vcpuNum": 36 740 | }, 741 | { 742 | "_defaultOrder": 26, 743 | "_isFastLaunch": false, 744 | "category": "Compute optimized", 745 | "gpuNum": 0, 746 | "hideHardwareSpecs": false, 747 | "memoryGiB": 96, 748 | "name": "ml.c5.12xlarge", 749 | "vcpuNum": 48 750 | }, 751 | { 752 | "_defaultOrder": 27, 753 | "_isFastLaunch": false, 754 | "category": "Compute optimized", 755 | "gpuNum": 0, 756 | "hideHardwareSpecs": false, 757 | "memoryGiB": 144, 758 | "name": "ml.c5.18xlarge", 759 | "vcpuNum": 72 760 | }, 761 | { 762 | "_defaultOrder": 28, 763 | "_isFastLaunch": false, 764 | "category": "Compute optimized", 765 | "gpuNum": 0, 766 | "hideHardwareSpecs": false, 767 | "memoryGiB": 192, 768 | "name": "ml.c5.24xlarge", 769 | "vcpuNum": 96 770 | }, 771 | { 772 | "_defaultOrder": 29, 773 | "_isFastLaunch": true, 774 | "category": "Accelerated computing", 775 | "gpuNum": 1, 776 | "hideHardwareSpecs": false, 777 | "memoryGiB": 16, 778 | "name": "ml.g4dn.xlarge", 779 | "vcpuNum": 4 780 | }, 781 | { 782 | "_defaultOrder": 30, 783 | "_isFastLaunch": false, 784 | "category": "Accelerated computing", 785 | "gpuNum": 1, 786 | "hideHardwareSpecs": false, 787 | "memoryGiB": 32, 788 | "name": "ml.g4dn.2xlarge", 789 | "vcpuNum": 8 790 | }, 791 | { 792 | "_defaultOrder": 31, 793 | "_isFastLaunch": false, 794 | "category": "Accelerated computing", 795 | "gpuNum": 1, 796 | "hideHardwareSpecs": false, 797 | "memoryGiB": 64, 798 | "name": "ml.g4dn.4xlarge", 799 | "vcpuNum": 16 800 | }, 801 | { 802 | "_defaultOrder": 32, 803 | "_isFastLaunch": false, 804 | "category": "Accelerated computing", 805 | "gpuNum": 1, 806 | "hideHardwareSpecs": false, 807 | "memoryGiB": 128, 808 | "name": "ml.g4dn.8xlarge", 809 | "vcpuNum": 32 810 | }, 811 | { 812 | "_defaultOrder": 33, 813 | "_isFastLaunch": false, 814 | "category": "Accelerated computing", 815 | "gpuNum": 4, 816 | "hideHardwareSpecs": false, 817 | "memoryGiB": 192, 818 | "name": "ml.g4dn.12xlarge", 819 | "vcpuNum": 48 820 | }, 821 | { 822 | "_defaultOrder": 34, 823 | "_isFastLaunch": false, 824 | "category": "Accelerated computing", 825 | "gpuNum": 1, 826 | "hideHardwareSpecs": false, 827 | "memoryGiB": 256, 828 | "name": "ml.g4dn.16xlarge", 829 | "vcpuNum": 64 830 | }, 831 | { 832 | "_defaultOrder": 35, 833 | "_isFastLaunch": false, 834 | "category": "Accelerated computing", 835 | "gpuNum": 1, 836 | "hideHardwareSpecs": false, 837 | "memoryGiB": 61, 838 | "name": "ml.p3.2xlarge", 839 | "vcpuNum": 8 840 | }, 841 | { 842 | "_defaultOrder": 36, 843 | "_isFastLaunch": false, 844 | "category": "Accelerated computing", 845 | "gpuNum": 4, 846 | "hideHardwareSpecs": false, 847 | "memoryGiB": 244, 848 | "name": "ml.p3.8xlarge", 849 | "vcpuNum": 32 850 | }, 851 | { 852 | "_defaultOrder": 37, 853 | "_isFastLaunch": false, 854 | "category": "Accelerated computing", 855 | "gpuNum": 8, 856 | "hideHardwareSpecs": false, 857 | "memoryGiB": 488, 858 | "name": "ml.p3.16xlarge", 859 | "vcpuNum": 64 860 | }, 861 | { 862 | "_defaultOrder": 38, 863 | "_isFastLaunch": false, 864 | "category": "Accelerated computing", 865 | "gpuNum": 8, 866 | "hideHardwareSpecs": false, 867 | "memoryGiB": 768, 868 | "name": "ml.p3dn.24xlarge", 869 | "vcpuNum": 96 870 | }, 871 | { 872 | "_defaultOrder": 39, 873 | "_isFastLaunch": false, 874 | "category": "Memory Optimized", 875 | "gpuNum": 0, 876 | "hideHardwareSpecs": false, 877 | "memoryGiB": 16, 878 | "name": "ml.r5.large", 879 | "vcpuNum": 2 880 | }, 881 | { 882 | "_defaultOrder": 40, 883 | "_isFastLaunch": false, 884 | "category": "Memory Optimized", 885 | "gpuNum": 0, 886 | "hideHardwareSpecs": false, 887 | "memoryGiB": 32, 888 | "name": "ml.r5.xlarge", 889 | "vcpuNum": 4 890 | }, 891 | { 892 | "_defaultOrder": 41, 893 | "_isFastLaunch": false, 894 | "category": "Memory Optimized", 895 | "gpuNum": 0, 896 | "hideHardwareSpecs": false, 897 | "memoryGiB": 64, 898 | "name": "ml.r5.2xlarge", 899 | "vcpuNum": 8 900 | }, 901 | { 902 | "_defaultOrder": 42, 903 | "_isFastLaunch": false, 904 | "category": "Memory Optimized", 905 | "gpuNum": 0, 906 | "hideHardwareSpecs": false, 907 | "memoryGiB": 128, 908 | "name": "ml.r5.4xlarge", 909 | "vcpuNum": 16 910 | }, 911 | { 912 | "_defaultOrder": 43, 913 | "_isFastLaunch": false, 914 | "category": "Memory Optimized", 915 | "gpuNum": 0, 916 | "hideHardwareSpecs": false, 917 | "memoryGiB": 256, 918 | "name": "ml.r5.8xlarge", 919 | "vcpuNum": 32 920 | }, 921 | { 922 | "_defaultOrder": 44, 923 | "_isFastLaunch": false, 924 | "category": "Memory Optimized", 925 | "gpuNum": 0, 926 | "hideHardwareSpecs": false, 927 | "memoryGiB": 384, 928 | "name": "ml.r5.12xlarge", 929 | "vcpuNum": 48 930 | }, 931 | { 932 | "_defaultOrder": 45, 933 | "_isFastLaunch": false, 934 | "category": "Memory Optimized", 935 | "gpuNum": 0, 936 | "hideHardwareSpecs": false, 937 | "memoryGiB": 512, 938 | "name": "ml.r5.16xlarge", 939 | "vcpuNum": 64 940 | }, 941 | { 942 | "_defaultOrder": 46, 943 | "_isFastLaunch": false, 944 | "category": "Memory Optimized", 945 | "gpuNum": 0, 946 | "hideHardwareSpecs": false, 947 | "memoryGiB": 768, 948 | "name": "ml.r5.24xlarge", 949 | "vcpuNum": 96 950 | }, 951 | { 952 | "_defaultOrder": 47, 953 | "_isFastLaunch": false, 954 | "category": "Accelerated computing", 955 | "gpuNum": 1, 956 | "hideHardwareSpecs": false, 957 | "memoryGiB": 16, 958 | "name": "ml.g5.xlarge", 959 | "vcpuNum": 4 960 | }, 961 | { 962 | "_defaultOrder": 48, 963 | "_isFastLaunch": false, 964 | "category": "Accelerated computing", 965 | "gpuNum": 1, 966 | "hideHardwareSpecs": false, 967 | "memoryGiB": 32, 968 | "name": "ml.g5.2xlarge", 969 | "vcpuNum": 8 970 | }, 971 | { 972 | "_defaultOrder": 49, 973 | "_isFastLaunch": false, 974 | "category": "Accelerated computing", 975 | "gpuNum": 1, 976 | "hideHardwareSpecs": false, 977 | "memoryGiB": 64, 978 | "name": "ml.g5.4xlarge", 979 | "vcpuNum": 16 980 | }, 981 | { 982 | "_defaultOrder": 50, 983 | "_isFastLaunch": false, 984 | "category": "Accelerated computing", 985 | "gpuNum": 1, 986 | "hideHardwareSpecs": false, 987 | "memoryGiB": 128, 988 | "name": "ml.g5.8xlarge", 989 | "vcpuNum": 32 990 | }, 991 | { 992 | "_defaultOrder": 51, 993 | "_isFastLaunch": false, 994 | "category": "Accelerated computing", 995 | "gpuNum": 1, 996 | "hideHardwareSpecs": false, 997 | "memoryGiB": 256, 998 | "name": "ml.g5.16xlarge", 999 | "vcpuNum": 64 1000 | }, 1001 | { 1002 | "_defaultOrder": 52, 1003 | "_isFastLaunch": false, 1004 | "category": "Accelerated computing", 1005 | "gpuNum": 4, 1006 | "hideHardwareSpecs": false, 1007 | "memoryGiB": 192, 1008 | "name": "ml.g5.12xlarge", 1009 | "vcpuNum": 48 1010 | }, 1011 | { 1012 | "_defaultOrder": 53, 1013 | "_isFastLaunch": false, 1014 | "category": "Accelerated computing", 1015 | "gpuNum": 4, 1016 | "hideHardwareSpecs": false, 1017 | "memoryGiB": 384, 1018 | "name": "ml.g5.24xlarge", 1019 | "vcpuNum": 96 1020 | }, 1021 | { 1022 | "_defaultOrder": 54, 1023 | "_isFastLaunch": false, 1024 | "category": "Accelerated computing", 1025 | "gpuNum": 8, 1026 | "hideHardwareSpecs": false, 1027 | "memoryGiB": 768, 1028 | "name": "ml.g5.48xlarge", 1029 | "vcpuNum": 192 1030 | }, 1031 | { 1032 | "_defaultOrder": 55, 1033 | "_isFastLaunch": false, 1034 | "category": "Accelerated computing", 1035 | "gpuNum": 8, 1036 | "hideHardwareSpecs": false, 1037 | "memoryGiB": 1152, 1038 | "name": "ml.p4d.24xlarge", 1039 | "vcpuNum": 96 1040 | }, 1041 | { 1042 | "_defaultOrder": 56, 1043 | "_isFastLaunch": false, 1044 | "category": "Accelerated computing", 1045 | "gpuNum": 8, 1046 | "hideHardwareSpecs": false, 1047 | "memoryGiB": 1152, 1048 | "name": "ml.p4de.24xlarge", 1049 | "vcpuNum": 96 1050 | }, 1051 | { 1052 | "_defaultOrder": 57, 1053 | "_isFastLaunch": false, 1054 | "category": "Accelerated computing", 1055 | "gpuNum": 0, 1056 | "hideHardwareSpecs": false, 1057 | "memoryGiB": 32, 1058 | "name": "ml.trn1.2xlarge", 1059 | "vcpuNum": 8 1060 | }, 1061 | { 1062 | "_defaultOrder": 58, 1063 | "_isFastLaunch": false, 1064 | "category": "Accelerated computing", 1065 | "gpuNum": 0, 1066 | "hideHardwareSpecs": false, 1067 | "memoryGiB": 512, 1068 | "name": "ml.trn1.32xlarge", 1069 | "vcpuNum": 128 1070 | }, 1071 | { 1072 | "_defaultOrder": 59, 1073 | "_isFastLaunch": false, 1074 | "category": "Accelerated computing", 1075 | "gpuNum": 0, 1076 | "hideHardwareSpecs": false, 1077 | "memoryGiB": 512, 1078 | "name": "ml.trn1n.32xlarge", 1079 | "vcpuNum": 128 1080 | } 1081 | ], 1082 | "instance_type": "ml.t3.medium", 1083 | "kernelspec": { 1084 | "display_name": "Python 3", 1085 | "language": "python", 1086 | "name": "python3" 1087 | }, 1088 | "language_info": { 1089 | "codemirror_mode": { 1090 | "name": "ipython", 1091 | "version": 3 1092 | }, 1093 | "file_extension": ".py", 1094 | "mimetype": "text/x-python", 1095 | "name": "python", 1096 | "nbconvert_exporter": "python", 1097 | "pygments_lexer": "ipython3", 1098 | "version": "3.11.12" 1099 | } 1100 | }, 1101 | "nbformat": 4, 1102 | "nbformat_minor": 5 1103 | } 1104 | -------------------------------------------------------------------------------- /labs/FineTuning/HuggingFaceExample/01_finetuning/assets/consolidate_adapter_shards_and_merge_model.py: -------------------------------------------------------------------------------- 1 | from optimum.neuron.models.training import ( 2 | consolidate_model_parallel_checkpoints_to_unified_checkpoint, 3 | ) 4 | from transformers import AutoModel, AutoTokenizer 5 | from argparse import ArgumentParser 6 | from shutil import copyfile 7 | import os 8 | import peft 9 | 10 | parser = ArgumentParser() 11 | parser.add_argument( 12 | "-i", 13 | "--input_dir", 14 | help="source checkpoint directory containing sharded adapter checkpoint files", 15 | required=True, 16 | ) 17 | parser.add_argument( 18 | "-o", 19 | "--output_dir", 20 | help="destination directory for final merged model (adapters merged into base model)", 21 | required=True, 22 | ) 23 | args = parser.parse_args() 24 | 25 | consolidated_ckpt_dir = os.path.join(args.input_dir, "consolidated") 26 | 27 | # Consolidate the adapter shards into a PEFT-compatible checkpoint 28 | print("Consolidating LoRA adapter shards") 29 | consolidate_model_parallel_checkpoints_to_unified_checkpoint( 30 | args.input_dir, consolidated_ckpt_dir 31 | ) 32 | copyfile( 33 | os.path.join(args.input_dir, "adapter_default/adapter_config.json"), 34 | os.path.join(consolidated_ckpt_dir, "adapter_config.json"), 35 | ) 36 | 37 | # Load AutoPeftModel using the consolidated PEFT checkpoint 38 | peft_model = peft.AutoPeftModelForCausalLM.from_pretrained(consolidated_ckpt_dir) 39 | 40 | # Merge adapter weights into base model, save new pretrained model 41 | print("Merging LoRA adapter shards into base model") 42 | merged_model = peft_model.merge_and_unload() 43 | print(f"Saving merged model to {args.output_dir}") 44 | merged_model.save_pretrained(args.output_dir) 45 | 46 | print(f"Saving tokenizer to {args.output_dir}") 47 | tokenizer = AutoTokenizer.from_pretrained(args.input_dir) 48 | tokenizer.save_pretrained(args.output_dir) 49 | 50 | # Load the pretrained model and print config 51 | print("Merged model config:") 52 | model = AutoModel.from_pretrained(args.output_dir) 53 | print(model) 54 | -------------------------------------------------------------------------------- /labs/FineTuning/HuggingFaceExample/01_finetuning/assets/finetune_llama.py: -------------------------------------------------------------------------------- 1 | from dataclasses import dataclass, field 2 | from datasets import load_dataset 3 | from peft import LoraConfig 4 | from transformers import ( 5 | AutoTokenizer, 6 | set_seed, 7 | ) 8 | import os 9 | import subprocess 10 | import boto3 11 | from botocore.exceptions import ClientError 12 | from huggingface_hub import login 13 | import torch 14 | 15 | from optimum.neuron import NeuronHfArgumentParser as HfArgumentParser 16 | from optimum.neuron import NeuronSFTConfig, NeuronSFTTrainer, NeuronTrainingArguments 17 | from torch_xla.core.xla_model import is_master_ordinal 18 | from optimum.neuron.models.training import NeuronModelForCausalLM 19 | 20 | 21 | 22 | def training_function(script_args, training_args): 23 | dataset = load_dataset("b-mc2/sql-create-context", split="train") 24 | dataset = dataset.shuffle(seed=23) 25 | train_dataset = dataset.select(range(50000)) 26 | eval_dataset = dataset.select(range(50000, 50500)) 27 | 28 | def create_conversation(sample): 29 | system_message = ( 30 | "You are a text to SQL query translator. Users will ask you questions in English and you will generate a " 31 | "SQL query based on the provided SCHEMA.\nSCHEMA:\n{schema}" 32 | ) 33 | return { 34 | "messages": [ 35 | { 36 | "role": "system", 37 | "content": system_message.format(schema=sample["context"]), 38 | }, 39 | {"role": "user", "content": sample["question"]}, 40 | {"role": "assistant", "content": sample["answer"] + ";"}, 41 | ] 42 | } 43 | 44 | train_dataset = train_dataset.map( 45 | create_conversation, remove_columns=train_dataset.features, batched=False 46 | ) 47 | eval_dataset = eval_dataset.map( 48 | create_conversation, remove_columns=eval_dataset.features, batched=False 49 | ) 50 | 51 | tokenizer = AutoTokenizer.from_pretrained(script_args.tokenizer_id) 52 | # tokenizer.pad_token = tokenizer.eos_token 53 | # tokenizer.eos_token_id = 128001 54 | 55 | trn_config = training_args.trn_config 56 | dtype = torch.bfloat16 if training_args.bf16 else torch.float32 57 | model = NeuronModelForCausalLM.from_pretrained( 58 | script_args.model_id, 59 | trn_config, 60 | torch_dtype=dtype, 61 | # Use FlashAttention2 for better performance and to be able to use larger sequence lengths. 62 | use_flash_attention_2=False, #Because we are training a sequence lower than 2K for the workshop 63 | ) 64 | 65 | config = LoraConfig( 66 | r=script_args.lora_r, 67 | lora_alpha=script_args.lora_alpha, 68 | lora_dropout=script_args.lora_dropout, 69 | target_modules=[ 70 | "q_proj", 71 | "gate_proj", 72 | "v_proj", 73 | "o_proj", 74 | "k_proj", 75 | "up_proj", 76 | "down_proj", 77 | ], 78 | bias="none", 79 | task_type="CAUSAL_LM", 80 | ) 81 | 82 | args = training_args.to_dict() 83 | 84 | sft_config = NeuronSFTConfig( 85 | max_seq_length=1024, 86 | packing=True, 87 | **args, 88 | dataset_kwargs={ 89 | "add_special_tokens": False, 90 | "append_concat_token": True, 91 | }, 92 | ) 93 | 94 | trainer = NeuronSFTTrainer( 95 | args=sft_config, 96 | model=model, 97 | peft_config=config, 98 | tokenizer=tokenizer, 99 | train_dataset=train_dataset, 100 | eval_dataset=eval_dataset, 101 | ) 102 | 103 | # Start training 104 | trainer.train() 105 | del trainer 106 | 107 | 108 | @dataclass 109 | class ScriptArguments: 110 | model_id: str = field( 111 | default="TinyLlama/TinyLlama-1.1B-Chat-v1.0", 112 | metadata={ 113 | "help": "The model that you want to train from the Hugging Face hub." 114 | }, 115 | ) 116 | tokenizer_id: str = field( 117 | default="TinyLlama/TinyLlama-1.1B-Chat-v1.0", 118 | metadata={"help": "The tokenizer used to tokenize text for fine-tuning."}, 119 | ) 120 | lora_r: int = field( 121 | default=16, 122 | metadata={"help": "LoRA r value to be used during fine-tuning."}, 123 | ) 124 | lora_alpha: int = field( 125 | default=32, 126 | metadata={"help": "LoRA alpha value to be used during fine-tuning."}, 127 | ) 128 | lora_dropout: float = field( 129 | default=0.05, 130 | metadata={"help": "LoRA dropout value to be used during fine-tuning."}, 131 | ) 132 | secret_name: str = field( 133 | default="huggingface/token", 134 | metadata={"help": "AWS Secrets Manager secret name containing Hugging Face token."}, 135 | ) 136 | secret_region: str = field( 137 | default="us-west-2", 138 | metadata={"help": "AWS region where the secret is stored."}, 139 | ) 140 | 141 | 142 | def get_secret(secret_name, region_name): 143 | """ 144 | Retrieve a secret from AWS Secrets Manager by searching for secrets with the given name prefix. 145 | This is specific to the workshop environment. 146 | """ 147 | try: 148 | session = boto3.session.Session() 149 | client = session.client(service_name='secretsmanager', region_name=region_name) 150 | 151 | # List secrets and find one that starts with the secret_name 152 | paginator = client.get_paginator('list_secrets') 153 | for page in paginator.paginate(): 154 | for secret in page['SecretList']: 155 | if secret['Name'].startswith(secret_name): 156 | response = client.get_secret_value(SecretId=secret['ARN']) 157 | if 'SecretString' in response: 158 | return response['SecretString'] 159 | return None 160 | except ClientError: 161 | print("Could not retrieve secret from AWS Secrets Manager") 162 | return None 163 | 164 | if __name__ == "__main__": 165 | parser = HfArgumentParser([ScriptArguments, NeuronTrainingArguments]) 166 | script_args, training_args = parser.parse_args_into_dataclasses() 167 | 168 | # Check for Hugging Face token in environment variable 169 | hf_token = os.environ.get("HF_TOKEN") 170 | 171 | # If no token in environment, try to get it from AWS Secrets Manager 172 | if not hf_token: 173 | print("No Hugging Face token found in environment, checking AWS Secrets Manager...") 174 | hf_token = get_secret(script_args.secret_name, script_args.secret_region) 175 | 176 | # Login to Hugging Face if a valid token is found 177 | if hf_token: 178 | print("Logging in to Hugging Face Hub...") 179 | login(token=hf_token) 180 | else: 181 | print("No valid Hugging Face token found, continuing without authentication") 182 | 183 | set_seed(training_args.seed) 184 | training_function(script_args, training_args) 185 | 186 | # Consolidate LoRA adapter shards, merge LoRA adapters into base model, save merged model 187 | if is_master_ordinal(): 188 | input_ckpt_dir = os.path.join( 189 | training_args.output_dir, f"checkpoint-{training_args.max_steps}" 190 | ) 191 | output_ckpt_dir = os.path.join(training_args.output_dir, "merged_model") 192 | # the spawned process expects to see 2 NeuronCores for consolidating checkpoints with a tp=2 193 | # Either the second core isn't really used or it is freed up by the other thread finishing. 194 | # Adjusting Neuron env. var to advertise 2 NeuronCores to the process. 195 | env = os.environ.copy() 196 | env["NEURON_RT_VISIBLE_CORES"] = "0-1" 197 | subprocess.run( 198 | [ 199 | "python3", 200 | "consolidate_adapter_shards_and_merge_model.py", 201 | "-i", 202 | input_ckpt_dir, 203 | "-o", 204 | output_ckpt_dir, 205 | ], 206 | env=env 207 | ) -------------------------------------------------------------------------------- /labs/FineTuning/HuggingFaceExample/01_finetuning/assets/requirements.txt: -------------------------------------------------------------------------------- 1 | optimum-neuron==0.3.0 2 | peft==0.16.0 3 | trl==0.11.4 4 | huggingface_hub==0.33.4 5 | datasets==3.6.0 6 | -------------------------------------------------------------------------------- /labs/Lab_Four_NKI_Profiling.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Neuron Profile \n", 8 | "\n", 9 | "This workshop was borrowed from the AWS NKI Workshop. To find the full original content, see here:\n", 10 | "- Workshop: https://catalog.us-east-1.prod.workshops.aws/workshops/0d84c975-7a94-469a-b6bc-661768d303f7/en-US/lab-0\n", 11 | "- Github: https://github.com/aws-samples/ml-specialized-hardware/tree/main/workshops/03_NKIWorkshop\n", 12 | "\n", 13 | "In this tutorial, we use Neuron Profile to view the execution trace of a NKI kernel captured on a NeuronCore. In doing so, we learn about:\n", 14 | "\n", 15 | "- Installation and usage of Neuron Profile.\n", 16 | "\n", 17 | "- Inspecting a detailed execution timeline of compute engine instructions and DMA engine activities generated from your NKI kernel.\n", 18 | "\n", 19 | "As background, [Neuron Profile](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-profile-user-guide.html) is the tool you need to visualize where time is being spent during kernel execution on NeuronDevices, which is crucial for identifying performance bottlenecks and opportunities of your kernel. Neuron Profile produces runtime execution data for every instruction executed on each compute engine and also every data movement activity completed by DMA engines. Neuron Profile also reports key performance metrics such as compute engine and memory bandwidth utilization, which allows developers to quickly find out the achieved hardware efficiency of their kernel. Profiling typically has near zero overhead thanks to the dedicated on-chip profiling hardware in NeuronDevices.\n", 20 | "\n", 21 | "## Profile a NKI Kernel\n", 22 | "\n", 23 | "### Install Neuron Profile\n", 24 | "Make sure you have the latest version of the `aws-neuronx-tools`, which includes updated profiling support for NKI kernels. Neuron Profile is included within this package and is installed to `/opt/aws/neuron/bin`.\n", 25 | "\n", 26 | "The `aws-neuronx-tools` package comes pre-installed on [Neuron DLAMIs](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/dlami/index.html). For detailed installation instructions see [Neuron Profile User Guide: Installation](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-profile-user-guide.html#installation).\n", 27 | "\n", 28 | "### Profile using `neuron-profile capture`\n", 29 | "\n", 30 | "To profile a NKI kernel the required steps are (1) enable `NEURON_FRAMEWORK_DEBUG` to tell the compiler to save the `NEFF` file, (2) execute the NKI kernel to generate the `NEFF`, and (3) run `neuron-profile capture` to generate a `NTFF` profile. Each step is described in more detail below.\n", 31 | "\n", 32 | "We will profile a NKI kernel which computes the element-wise exponential of an input tensor of any 2D shape. The rest of this tutorial will use a performance profile generated from this kernel as an example. Full code of `prof-kernel.py`:" 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": null, 38 | "metadata": {}, 39 | "outputs": [], 40 | "source": [ 41 | "%%writefile prof-kernel.py\n", 42 | "\"\"\"\n", 43 | "Example kernel used to demmonstrate Neuron Profile.\n", 44 | "\"\"\"\n", 45 | "import torch\n", 46 | "from neuronxcc import nki\n", 47 | "import neuronxcc.nki.language as nl\n", 48 | "import math\n", 49 | "import os\n", 50 | "os.environ[\"NEURON_FRAMEWORK_DEBUG\"] = \"1\"\n", 51 | "os.environ[\"NEURON_CC_FLAGS\"]= \" --disable-dge \"\n", 52 | "\n", 53 | "@nki.jit\n", 54 | "def tensor_exp_kernel_(in_tensor):\n", 55 | " \"\"\"NKI kernel to compute elementwise exponential of an input tensor\n", 56 | "\n", 57 | " Args:\n", 58 | " in_tensor: an input tensor of ANY 2D shape (up to SBUF size)\n", 59 | " Returns:\n", 60 | " out_tensor: an output tensor of ANY 2D shape (up to SBUF size)\n", 61 | " \"\"\"\n", 62 | " out_tensor = nl.ndarray(in_tensor.shape, dtype=in_tensor.dtype,\n", 63 | " buffer=nl.shared_hbm)\n", 64 | "\n", 65 | " sz_p, sz_f = in_tensor.shape\n", 66 | "\n", 67 | " i_f = nl.arange(sz_f)[None, :]\n", 68 | "\n", 69 | " for p in nl.affine_range(math.ceil(sz_p / nl.tile_size.pmax)):\n", 70 | " # Generate tensor indices for the input/output tensors\n", 71 | " # pad index to pmax, for simplicity\n", 72 | " i_p = p * nl.tile_size.pmax + nl.arange(nl.tile_size.pmax)[:, None]\n", 73 | "\n", 74 | " # Load input data from external memory to on-chip memory\n", 75 | " # only read up to sz_p\n", 76 | " in_tile = nl.load(in_tensor[i_p, i_f], mask=(i_p\n", 122 | "Use the flag `--disable-dge` to temporarily disable a new compiler feature which is interfering with DMA debugging information display in neuron-profile. This is highly recommended to improve NKI performance debugging experience until we release a software fix for this issue.\n", 123 | "\n", 124 | "\n", 125 | "2. Compile your NKI kernel to create a NEFF in your current directory:" 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": null, 131 | "metadata": {}, 132 | "outputs": [], 133 | "source": [ 134 | "!python3 prof-kernel.py" 135 | ] 136 | }, 137 | { 138 | "cell_type": "markdown", 139 | "metadata": {}, 140 | "source": [ 141 | "
\n", 142 | "Find your NEFF named similarly to `MODULE_0_SyncTensorsGraph.13_12659246067793504316.neff`.\n", 143 | "
\n", 144 | "\n", 145 | "3. Profile the NEFF. This profiling step executes the NEFF on the NeuronDevice and records a raw execution trace into an Neuron Trace File Format (NTFF) artifact." 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": null, 151 | "metadata": {}, 152 | "outputs": [], 153 | "source": [ 154 | "!neuron-profile capture -n -s profile.ntff --profile-nth-exec=2" 155 | ] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "metadata": {}, 160 | "source": [ 161 | "This will save your NTFF profile to `profile_exec_2.ntff`.\n", 162 | "\n", 163 | "
\n", 164 | "The `--profile-nth-exec=2` option will profile your NEFF twice on the NeuronDevice and output a NTFF profile for the second iteration. This is recommended to avoid one-time warmup delays which can be seen in the first iteration of execution.\n", 165 | "
\n", 166 | "\n", 167 | "In [View Neuron Profile UI](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/neuron_profile_for_nki.html#nki-view-neuron-profile-ui), we will view the profile in a user-friendly format using the Neuron Profile UI.\n", 168 | "\n", 169 | "### Profile using nki.benchmark\n", 170 | "\n", 171 | "You may also use the [nki.benchmark](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/api/generated/nki.benchmark.html) API to generate a NEFF and NTFF programmatically. One caveat is [nki.benchmark](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/api/generated/nki.benchmark.html) runs your NEFF without an ML framework in [nki.baremetal](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/api/generated/nki.baremetal.html) mode, so the input tensors to the kernel must be NumPy arrays instead of framework tensors such as `torch.Tensor`.\n", 172 | "\n", 173 | "Below is an example NKI kernel decorated by [nki.benchmark](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/api/generated/nki.benchmark.html). Full code of `prof-kernel-benchmark.py`:" 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": null, 179 | "metadata": {}, 180 | "outputs": [], 181 | "source": [ 182 | "%%writefile prof-kernel-benchmark.py\n", 183 | "\"\"\"\n", 184 | "Example kernel used to demmonstrate Neuron Profile with nki.benchmark.\n", 185 | "\"\"\"\n", 186 | "from neuronxcc import nki\n", 187 | "from neuronxcc.nki.typing import tensor\n", 188 | "import neuronxcc.nki.language as nl\n", 189 | "import math\n", 190 | "\n", 191 | "\n", 192 | "@nki.benchmark(save_neff_name='file.neff', save_trace_name='profile.ntff')\n", 193 | "def tensor_exp_kernel_(in_tensor):\n", 194 | " \"\"\"NKI kernel to compute elementwise exponential of an input tensor\n", 195 | " Args:\n", 196 | " in_tensor: an input tensor of ANY 2D shape (up to SBUF size)\n", 197 | " Returns:\n", 198 | " out_tensor: an output tensor of ANY 2D shape (up to SBUF size)\n", 199 | " \"\"\"\n", 200 | " out_tensor = nl.ndarray(in_tensor.shape, dtype=in_tensor.dtype,\n", 201 | " buffer=nl.shared_hbm)\n", 202 | "\n", 203 | " sz_p, sz_f = in_tensor.shape\n", 204 | " i_f = nl.arange(sz_f)[None, :]\n", 205 | " for p in nl.affine_range(math.ceil(sz_p / nl.tile_size.pmax)):\n", 206 | " # Generate tensor indices for the input/output tensors\n", 207 | " # pad index to pmax, for simplicity\n", 208 | " i_p = p * nl.tile_size.pmax + nl.arange(nl.tile_size.pmax)[:, None]\n", 209 | " # Load input data from external memory to on-chip memory\n", 210 | " # only read up to sz_p\n", 211 | " in_tile = nl.load(in_tensor[i_p, i_f], mask=(i_p