├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── NOTICE ├── README.md ├── contributed └── models │ ├── README.md │ └── qwen2 │ ├── modeling_qwen2.py │ └── qwen-2-test.ipynb └── labs ├── FineTuning └── HuggingFaceExample │ ├── 01_finetuning │ ├── Finetune-TinyLlama-1.1B.ipynb │ └── assets │ │ ├── consolidate_adapter_shards_and_merge_model.py │ │ ├── finetune_llama.py │ │ └── requirements.txt │ └── 02_inference │ └── Inference-TinyLlama-1.1B.ipynb ├── Lab_One_NxDI.ipynb ├── Lab_Two_NKI.ipynb └── generation_config.json /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. 5 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *main* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. 60 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | 2 | Apache License 3 | Version 2.0, January 2004 4 | http://www.apache.org/licenses/ 5 | 6 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 7 | 8 | 1. Definitions. 9 | 10 | "License" shall mean the terms and conditions for use, reproduction, 11 | and distribution as defined by Sections 1 through 9 of this document. 12 | 13 | "Licensor" shall mean the copyright owner or entity authorized by 14 | the copyright owner that is granting the License. 15 | 16 | "Legal Entity" shall mean the union of the acting entity and all 17 | other entities that control, are controlled by, or are under common 18 | control with that entity. For the purposes of this definition, 19 | "control" means (i) the power, direct or indirect, to cause the 20 | direction or management of such entity, whether by contract or 21 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 22 | outstanding shares, or (iii) beneficial ownership of such entity. 23 | 24 | "You" (or "Your") shall mean an individual or Legal Entity 25 | exercising permissions granted by this License. 26 | 27 | "Source" form shall mean the preferred form for making modifications, 28 | including but not limited to software source code, documentation 29 | source, and configuration files. 30 | 31 | "Object" form shall mean any form resulting from mechanical 32 | transformation or translation of a Source form, including but 33 | not limited to compiled object code, generated documentation, 34 | and conversions to other media types. 35 | 36 | "Work" shall mean the work of authorship, whether in Source or 37 | Object form, made available under the License, as indicated by a 38 | copyright notice that is included in or attached to the work 39 | (an example is provided in the Appendix below). 40 | 41 | "Derivative Works" shall mean any work, whether in Source or Object 42 | form, that is based on (or derived from) the Work and for which the 43 | editorial revisions, annotations, elaborations, or other modifications 44 | represent, as a whole, an original work of authorship. For the purposes 45 | of this License, Derivative Works shall not include works that remain 46 | separable from, or merely link (or bind by name) to the interfaces of, 47 | the Work and Derivative Works thereof. 48 | 49 | "Contribution" shall mean any work of authorship, including 50 | the original version of the Work and any modifications or additions 51 | to that Work or Derivative Works thereof, that is intentionally 52 | submitted to Licensor for inclusion in the Work by the copyright owner 53 | or by an individual or Legal Entity authorized to submit on behalf of 54 | the copyright owner. For the purposes of this definition, "submitted" 55 | means any form of electronic, verbal, or written communication sent 56 | to the Licensor or its representatives, including but not limited to 57 | communication on electronic mailing lists, source code control systems, 58 | and issue tracking systems that are managed by, or on behalf of, the 59 | Licensor for the purpose of discussing and improving the Work, but 60 | excluding communication that is conspicuously marked or otherwise 61 | designated in writing by the copyright owner as "Not a Contribution." 62 | 63 | "Contributor" shall mean Licensor and any individual or Legal Entity 64 | on behalf of whom a Contribution has been received by Licensor and 65 | subsequently incorporated within the Work. 66 | 67 | 2. Grant of Copyright License. Subject to the terms and conditions of 68 | this License, each Contributor hereby grants to You a perpetual, 69 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 70 | copyright license to reproduce, prepare Derivative Works of, 71 | publicly display, publicly perform, sublicense, and distribute the 72 | Work and such Derivative Works in Source or Object form. 73 | 74 | 3. Grant of Patent License. Subject to the terms and conditions of 75 | this License, each Contributor hereby grants to You a perpetual, 76 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 77 | (except as stated in this section) patent license to make, have made, 78 | use, offer to sell, sell, import, and otherwise transfer the Work, 79 | where such license applies only to those patent claims licensable 80 | by such Contributor that are necessarily infringed by their 81 | Contribution(s) alone or by combination of their Contribution(s) 82 | with the Work to which such Contribution(s) was submitted. If You 83 | institute patent litigation against any entity (including a 84 | cross-claim or counterclaim in a lawsuit) alleging that the Work 85 | or a Contribution incorporated within the Work constitutes direct 86 | or contributory patent infringement, then any patent licenses 87 | granted to You under this License for that Work shall terminate 88 | as of the date such litigation is filed. 89 | 90 | 4. Redistribution. You may reproduce and distribute copies of the 91 | Work or Derivative Works thereof in any medium, with or without 92 | modifications, and in Source or Object form, provided that You 93 | meet the following conditions: 94 | 95 | (a) You must give any other recipients of the Work or 96 | Derivative Works a copy of this License; and 97 | 98 | (b) You must cause any modified files to carry prominent notices 99 | stating that You changed the files; and 100 | 101 | (c) You must retain, in the Source form of any Derivative Works 102 | that You distribute, all copyright, patent, trademark, and 103 | attribution notices from the Source form of the Work, 104 | excluding those notices that do not pertain to any part of 105 | the Derivative Works; and 106 | 107 | (d) If the Work includes a "NOTICE" text file as part of its 108 | distribution, then any Derivative Works that You distribute must 109 | include a readable copy of the attribution notices contained 110 | within such NOTICE file, excluding those notices that do not 111 | pertain to any part of the Derivative Works, in at least one 112 | of the following places: within a NOTICE text file distributed 113 | as part of the Derivative Works; within the Source form or 114 | documentation, if provided along with the Derivative Works; or, 115 | within a display generated by the Derivative Works, if and 116 | wherever such third-party notices normally appear. The contents 117 | of the NOTICE file are for informational purposes only and 118 | do not modify the License. You may add Your own attribution 119 | notices within Derivative Works that You distribute, alongside 120 | or as an addendum to the NOTICE text from the Work, provided 121 | that such additional attribution notices cannot be construed 122 | as modifying the License. 123 | 124 | You may add Your own copyright statement to Your modifications and 125 | may provide additional or different license terms and conditions 126 | for use, reproduction, or distribution of Your modifications, or 127 | for any such Derivative Works as a whole, provided Your use, 128 | reproduction, and distribution of the Work otherwise complies with 129 | the conditions stated in this License. 130 | 131 | 5. Submission of Contributions. Unless You explicitly state otherwise, 132 | any Contribution intentionally submitted for inclusion in the Work 133 | by You to the Licensor shall be under the terms and conditions of 134 | this License, without any additional terms or conditions. 135 | Notwithstanding the above, nothing herein shall supersede or modify 136 | the terms of any separate license agreement you may have executed 137 | with Licensor regarding such Contributions. 138 | 139 | 6. Trademarks. This License does not grant permission to use the trade 140 | names, trademarks, service marks, or product names of the Licensor, 141 | except as required for reasonable and customary use in describing the 142 | origin of the Work and reproducing the content of the NOTICE file. 143 | 144 | 7. Disclaimer of Warranty. Unless required by applicable law or 145 | agreed to in writing, Licensor provides the Work (and each 146 | Contributor provides its Contributions) on an "AS IS" BASIS, 147 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 148 | implied, including, without limitation, any warranties or conditions 149 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 150 | PARTICULAR PURPOSE. You are solely responsible for determining the 151 | appropriateness of using or redistributing the Work and assume any 152 | risks associated with Your exercise of permissions under this License. 153 | 154 | 8. Limitation of Liability. In no event and under no legal theory, 155 | whether in tort (including negligence), contract, or otherwise, 156 | unless required by applicable law (such as deliberate and grossly 157 | negligent acts) or agreed to in writing, shall any Contributor be 158 | liable to You for damages, including any direct, indirect, special, 159 | incidental, or consequential damages of any character arising as a 160 | result of this License or out of the use or inability to use the 161 | Work (including but not limited to damages for loss of goodwill, 162 | work stoppage, computer failure or malfunction, or any and all 163 | other commercial damages or losses), even if such Contributor 164 | has been advised of the possibility of such damages. 165 | 166 | 9. Accepting Warranty or Additional Liability. While redistributing 167 | the Work or Derivative Works thereof, You may choose to offer, 168 | and charge a fee for, acceptance of support, warranty, indemnity, 169 | or other liability obligations and/or rights consistent with this 170 | License. However, in accepting such obligations, You may act only 171 | on Your own behalf and on Your sole responsibility, not on behalf 172 | of any other Contributor, and only if You agree to indemnify, 173 | defend, and hold each Contributor harmless for any liability 174 | incurred by, or claims asserted against, such Contributor by reason 175 | of your accepting any such warranty or additional liability. 176 | -------------------------------------------------------------------------------- /NOTICE: -------------------------------------------------------------------------------- 1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Build On Trainium Workshop 2 | 3 | In this workshop you will learn how to develop support for a new model with [NeuronX Distributed Inference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/nxdi-overview.html#nxdi-overview), through the context of Llama 3.2 1B. You will also learn how to write your own kernel to directly program the accelerated hardware with the [Neuron Kernel Interface](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/index.html). Both of these tools will help you design your research proposals and experiments on Trainium. 4 | 5 | It also includes an end-to-end example of using Hugging Face Optimum Neuron to fine-tune and host a small language model with Amazon SageMaker. 6 | 7 | ### What is Build on Trainium? 8 | Build on Trainium is a $110M credit program focused on AI research and university education to support the next generation of innovation and development on AWS Trainium. AWS Trainium chips are purpose-built for high-performance deep learning (DL) training of generative AI models, including large language models (LLMs) and latent diffusion models. Build on Trainium provides compute credits to novel AI research on Trainium, investing in leading academic teams to build innovations in critical areas including new model architectures, ML libraries, optimizations, large-scale distributed systems, and more. This multi-year initiative lays the foundation for the future of AI by inspiring the academic community to utilize, invest in, and contribute to the open-source community around Trainium. Combining these benefits with Neuron software development kit (SDK) and recent launch of the Neuron Kernel Interface (NKI), AI researchers can innovate at scale in the cloud. 9 | 10 | ### What are AWS Trainium and Neuron? 11 | AWS Trainium is an AI chip developed by AWS for accelerating building and deploying machine learning models. Built on a specialized architecture designed for deep learning, Trainium accelerates the training and inference of complex models with high output and scalability, making it ideal for academic researchers looking to optimize performance and costs. This architecture also emphasizes sustainability through energy-efficient design, reducing environmental impact. Amazon has established a dedicated Trainium research cluster featuring up to 40,000 Trainium chips, accessible via Amazon EC2 Trn1 instances. These instances are connected through a non-blocking, petabit-scale network using Amazon EC2 UltraClusters, enabling seamless high-performance ML training. The Trn1 instance family is optimized to deliver substantial compute power for cutting-edge AI research and development. This unique offering not only enhances the efficiency and affordability of model training but also presents academic researchers with opportunities to publish new papers on underrepresented compute architectures, thus advancing the field. 12 | 13 | Learn more about Build On Trainium [here](https://aws.amazon.com/ai/machine-learning/trainium/research/). 14 | 15 | ### Your workshop 16 | This hands-on workshop is designed for academic researchers who are planning on submitting proposals to [Build On Trainium](https://www.amazon.science/research-awards/call-for-proposals). 17 | 18 | The workshop has multiple available modules: 19 | 1. Set up instructions 20 | 2. Run inference with Llama and NeuronX Distributed inference (NxD) 21 | 3. Write your own kernel with Neuron Kernel Interface (NKI) 22 | 4. Fine tune and host an existing, supported model with a different data set using SageMaker. 23 | 24 | #### Instructor-led workshop 25 | If you are participating in an instructor-led workshop, follow the guidance provided by your instructor for accessing the environment. 26 | 27 | #### Self-managed workshop 28 | If you are following the workshop steps in your own environment, you will need to take the following actions: 29 | 1. Launch a trn1.2xlarge instance on Amazon EC2, using the latest [DLAMI with Neuron packages preinstalled](https://repost.aws/articles/ARTxLi0wndTwquyl7frQYuKg) 30 | 2. Use a Python virtual environment preinstalled in that DLAMI, commonly located in `/opt/aws_`. 31 | 3. Set up and manage your own development environment on that instance, such as by using VSCode or a Jupyter Lab server. 32 | 33 | ### Background knowledge 34 | This workshop introduces developing on AWS Trainium for the academic AI research audience. As such it's expected that the audience will already have a firm understanding of machine learning fundamentals. 35 | 36 | ### Workshop costs 37 | If you are participating in an instructor-led workshop hosted in an AWS-managed Workshop Studio environment, you will not incur any costs through using this environment. If you are following this workshop in your own environment, then you will incur associated costs with provisioning an Amazon EC2 instance. Please see the service pricing details [here](https://aws.amazon.com/ec2/pricing/on-demand/). 38 | 39 | At the time of writing, this workshop uses a trn1.2xlarge instance with an on-demand hourly rate in supported US regions of $1.34 per hour. The fine tuning workshop requires less than an hour of ml.trn1.2xlarge at $1.54 an hour, and an ml.inf2.xlarge at $.99 an hour (you deploy it and **delete it when you are done**) 40 | 41 | ## FAQ's and known issues 42 | 1. Workshop instructions are available [here](https://catalog.us-east-1.prod.workshops.aws/workshops/bf9d80a3-5e4b-4648-bca8-1d887bb2a9ca/en-US). 43 | 2. If you use the `NousResearch` Llama 3.2 1B, please note you'll need to remove a trailing comma in the model config file. You can do this by using VIM in VSCode. If you do not take this step, you'll get an error for invalid JSON in trying to read the model config in Lab 1. If editing the file through the terminal is a little challenging, you can also download the config file from this repository with the following command: 44 | `!wget https://github.com/aws-neuron/build-on-trainium-workshop/blob/main/labs/generation_config.json -P /home/ec2-user/environment/models/llama/` 45 | 4. Jupyter kernels can hold on to the NeuronCores as a Python process even after your cell has completed. This can then cause issues when you try to run a new notebook, and sometimes when you try to run another cell. If you encounter a `NeuronCore not found` or similar error statement, please just restart your Jupyter kernel and/or shut down kernels from previous sessions. You can also restart the instance through the EC2 console. Once your node is back online, you can always check the availability of the NeuronCores with `neuron-ls`. 46 | 5. Want to see how to integrate NKI with NxD? Check out our `nki-llama` [here](https://github.com/aws-samples/nki-llama). 47 | 48 | 49 | ## Security 50 | 51 | See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information. 52 | 53 | ## License 54 | 55 | This project is licensed under the Apache-2.0 License. 56 | 57 | -------------------------------------------------------------------------------- /contributed/models/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /contributed/models/qwen2/modeling_qwen2.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | # Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved. 3 | # 4 | # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX 5 | # and OPT implementations in this library. It has been modified from its 6 | # original forms to accommodate minor architectural differences compared 7 | # to GPT-NeoX and OPT used by the Meta AI team that trained the model. 8 | # 9 | # Licensed under the Apache License, Version 2.0 (the "License"); 10 | # you may not use this file except in compliance with the License. 11 | # You may obtain a copy of the License at 12 | # 13 | # http://www.apache.org/licenses/LICENSE-2.0 14 | # 15 | # Unless required by applicable law or agreed to in writing, software 16 | # distributed under the License is distributed on an "AS IS" BASIS, 17 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 18 | # See the License for the specific language governing permissions and 19 | # limitations under the License. 20 | """PyTorch Qwen2 model for NXD inference.""" 21 | import copy 22 | import gc 23 | import logging 24 | import math 25 | from typing import List, Optional, Tuple, Type 26 | 27 | import torch 28 | from neuronx_distributed.parallel_layers import parallel_state # noqa: E402 29 | from neuronx_distributed.parallel_layers.layers import ( # noqa: E402; noqa: E402; noqa: E402; noqa: E402; noqa: E402 30 | ColumnParallelLinear, 31 | ParallelEmbedding, 32 | RowParallelLinear, 33 | ) 34 | from neuronx_distributed_inference.modules.attention.gqa import GroupQueryAttention_O 35 | from neuronx_distributed.parallel_layers.mappings import ( 36 | gather_from_sequence_parallel_region, 37 | reduce_from_tensor_model_parallel_region, 38 | reduce_scatter_to_sequence_parallel_region, 39 | _gather_along_first_dim, 40 | ) 41 | from neuronx_distributed.parallel_layers.utils import get_padding_length 42 | from neuronx_distributed.utils import cpu_mode 43 | from neuronxcc.nki._private_kernels.mlp import ( 44 | mlp_fused_add_isa_kernel, 45 | mlp_isa_kernel, 46 | quant_mlp_fused_add_isa_kernel, 47 | quant_mlp_isa_kernel, 48 | ) 49 | from neuronxcc.nki._private_kernels.rmsnorm import rmsnorm_quant_isa_kernel 50 | from neuronxcc.nki.language import nc 51 | from torch import nn 52 | from torch_neuronx.xla_impl.ops import nki_jit 53 | from transformers import Qwen2ForCausalLM 54 | from transformers.activations import ACT2FN 55 | from transformers.models.qwen2.modeling_qwen2 import Qwen2RMSNorm, Qwen2RotaryEmbedding 56 | 57 | from neuronx_distributed_inference.models.config import InferenceConfig, NeuronConfig # noqa: E402 58 | from neuronx_distributed_inference.models.model_base import ( # noqa: E402 59 | NeuronBaseForCausalLM, 60 | NeuronBaseModel, 61 | ) 62 | from neuronx_distributed_inference.modules.attention.attention_base import NeuronAttentionBase 63 | from neuronx_distributed_inference.modules.attention.gqa import ( # noqa: E402 64 | BaseGroupQueryAttention, 65 | ) 66 | from neuronx_distributed_inference.modules.attention.utils import ( 67 | RotaryEmbedding, 68 | preprocess_quantized_linear_layer, 69 | transpose_parallel_linear_layer, 70 | ) 71 | from neuronx_distributed_inference.modules.custom_calls import CustomRMSNorm 72 | from neuronx_distributed_inference.modules.flashdecode.utils import calculate_num_cores_per_group 73 | from neuronx_distributed_inference.modules.lora_serving.lora_module import is_lora_module 74 | from neuronx_distributed_inference.utils.distributed import get_tp_group 75 | 76 | _Qwen2_MODULE_MAP = {} 77 | 78 | logger = logging.getLogger("Neuron") 79 | 80 | 81 | def get_rmsnorm_cls(): 82 | # Initialize to the appropriate implementation of RMSNorm 83 | # If infer on NXD -> CustomRMSNorm 84 | # If infer on CPU -> HF_RMSNorm (CustomRMSNorm does not work on CPU) 85 | return Qwen2RMSNorm if cpu_mode() else CustomRMSNorm 86 | 87 | 88 | def preshard_hook_fn(module: torch.nn.Module, model_state_dict: dict, prefix: str) -> bool: 89 | if isinstance(module, (BaseGroupQueryAttention,)): 90 | return module.preshard_hook(model_state_dict, prefix) 91 | 92 | return False 93 | 94 | 95 | # Get the modules_to_not_convert from the neuron configs 96 | def get_modules_to_not_convert(neuron_config: NeuronConfig): 97 | return getattr(neuron_config, "modules_to_not_convert", None) 98 | 99 | 100 | def get_updated_configs(config: InferenceConfig): 101 | """ 102 | Generate a list of configurations for each hidden layer in a Qwen2 model. 103 | 104 | This function creates a list of InferenceConfig objects, one for each layer. It 105 | modifies the configurations for certain layers based on which modules should not 106 | be converted to quantized format. The function uses get_modules_to_not_convert() 107 | to determine which modules should not be converted. 108 | 109 | Args: 110 | config (InferenceConfig): The inference configuration for the model. 111 | 112 | Returns: 113 | list[InferenceConfig]: A list of InferenceConfig objects, one for each layer in the model. 114 | Each config may be either the original config or a modified version 115 | with "quantized_mlp_kernel_enabled" as False for that specific layer. 116 | """ 117 | updated_configs = [] 118 | modules_to_not_convert = get_modules_to_not_convert(config.neuron_config) 119 | if modules_to_not_convert is None: 120 | modules_to_not_convert = [] 121 | 122 | for i in range(config.num_hidden_layers): 123 | # If any of the MLP modules for this layer are in modules_to_not_convert 124 | module_pattern = f"layers.{i}.mlp" 125 | if any(module_pattern in module for module in modules_to_not_convert): 126 | non_quant_config = copy.deepcopy(config) 127 | non_quant_config.neuron_config.quantized_mlp_kernel_enabled = False 128 | non_quant_config.neuron_config.activation_quantization_type = None 129 | non_quant_config.neuron_config.quantize_clamp_bound = float("inf") 130 | updated_configs.append(non_quant_config) 131 | else: 132 | updated_configs.append(config) 133 | return updated_configs 134 | 135 | 136 | def _register_module(key: str, cls: Type[nn.Module]): 137 | _Qwen2_MODULE_MAP[key] = cls 138 | 139 | 140 | def register_module(key: str): 141 | """ 142 | Register a module for use in NeuronQwen2. 143 | 144 | Arguments: 145 | key: String used to identify the module 146 | 147 | Example: 148 | @register_module("NeuronQwen2Attention") 149 | class NeuronQwen2Attention(nn.Module): 150 | ... 151 | """ 152 | 153 | def inner(cls: Type[nn.Module]): 154 | _register_module(key, cls) 155 | return cls 156 | 157 | return inner 158 | 159 | 160 | def _helper_concat_and_delete_qkv(Qwen2_state_dict, layer_num, attr): 161 | """ 162 | Helper function to concatenate and delete QKV attributes for fusedqkv (weight or scale). 163 | Args: 164 | Qwen2_state_dict: The state dictionary containing model weights 165 | layer_num: The index of the layer to process 166 | attr: The attribute to process ('weight' or 'scale') 167 | """ 168 | Qwen2_state_dict[f"layers.{layer_num}.self_attn.Wqkv.{attr}"] = torch.cat( 169 | [ 170 | Qwen2_state_dict[f"layers.{layer_num}.self_attn.q_proj.{attr}"], 171 | Qwen2_state_dict[f"layers.{layer_num}.self_attn.k_proj.{attr}"], 172 | Qwen2_state_dict[f"layers.{layer_num}.self_attn.v_proj.{attr}"], 173 | ], 174 | ) 175 | del Qwen2_state_dict[f"layers.{layer_num}.self_attn.q_proj.{attr}"] 176 | del Qwen2_state_dict[f"layers.{layer_num}.self_attn.k_proj.{attr}"] 177 | del Qwen2_state_dict[f"layers.{layer_num}.self_attn.v_proj.{attr}"] 178 | 179 | 180 | def convert_state_dict_to_fused_qkv(Qwen2_state_dict, cfg: InferenceConfig): 181 | """ 182 | This function concats the qkv weights and scales to a Wqkv weight and scale for fusedqkv, and deletes the qkv weights. 183 | """ 184 | mods_to_not_conv = get_modules_to_not_convert(cfg.neuron_config) 185 | if mods_to_not_conv is None: 186 | mods_to_not_conv = [] 187 | 188 | for l in range(cfg.num_hidden_layers): # noqa: E741 189 | _helper_concat_and_delete_qkv(Qwen2_state_dict, l, "weight") 190 | if ( 191 | cfg.neuron_config.quantized_mlp_kernel_enabled or cfg.neuron_config.quantized 192 | ) and f"layers.{l}.self_attn" not in mods_to_not_conv: 193 | _helper_concat_and_delete_qkv(Qwen2_state_dict, l, "scale") 194 | 195 | gc.collect() 196 | 197 | return Qwen2_state_dict 198 | 199 | 200 | class WeightGatheredColumnParallel(ColumnParallelLinear): 201 | """ 202 | A specialized column-parallel linear layer that implements weight gathering optimization 203 | for efficient processing of long sequences in transformer models during eagle speculation. 204 | 205 | This layer provides two forward paths: 206 | 1. Standard column-parallel forward (inherited from parent) 207 | 2. Weight-gathered forward for long sequences 208 | """ 209 | def forward_wg(self, input: torch, weight_gather: bool = False): 210 | """ 211 | Performs the forward pass with optional weight gathering optimization. 212 | 213 | Args: 214 | input (torch.Tensor): Input tensor of shape (batch_size, seq_len/TP, 2*hidden_size) 215 | weight_gather (bool): Whether to use weight gathering optimization. 216 | Typically True for sequences >= 32K 217 | 218 | Returns: 219 | torch.Tensor or Tuple[torch.Tensor, torch.Tensor]: 220 | - If skip_bias_add is False: Output tensor of shape (batch_size, seq_len, hidden_size) 221 | - If skip_bias_add is True: Tuple of (output tensor, bias) 222 | """ 223 | if weight_gather: 224 | weight = _gather_along_first_dim(self.weight, process_group=self.tensor_parallel_group) 225 | output = self._forward_impl( 226 | input=input, 227 | weight=weight, 228 | bias=None, 229 | async_grad_allreduce=self.async_tensor_model_parallel_allreduce, 230 | sequence_parallel_enabled=self.sequence_parallel_enabled, 231 | sequence_dimension=self.sequence_dimension, 232 | autograd_func_class=self.autograd_func_class, 233 | process_group=self.tensor_parallel_group 234 | ) 235 | 236 | output = gather_from_sequence_parallel_region( 237 | output, 238 | self.sequence_dimension, 239 | process_group=self.tensor_parallel_group, 240 | ) 241 | if self.skip_bias_add: 242 | return output, self.bias 243 | 244 | output = (output + self.bias) if self.bias is not None else output 245 | return output 246 | else: 247 | return self.forward(input) 248 | 249 | 250 | class Qwen2InferenceConfig(InferenceConfig): 251 | def add_derived_config(self): 252 | self.num_cores_per_group = 1 253 | if self.neuron_config.flash_decoding_enabled: 254 | num_attn_heads, num_kv_heads = self.num_attention_heads, self.num_key_value_heads 255 | self.num_cores_per_group = calculate_num_cores_per_group( 256 | num_attn_heads, num_kv_heads, self.neuron_config.tp_degree 257 | ) 258 | 259 | def get_required_attributes(self) -> List[str]: 260 | return [ 261 | "hidden_size", 262 | "num_attention_heads", 263 | "num_hidden_layers", 264 | "num_key_value_heads", 265 | "pad_token_id", 266 | "vocab_size", 267 | "max_position_embeddings", 268 | "rope_theta", 269 | "rms_norm_eps", 270 | "hidden_act", 271 | ] 272 | 273 | @classmethod 274 | def get_neuron_config_cls(cls) -> Type[NeuronConfig]: 275 | return NeuronConfig 276 | 277 | 278 | class NeuronQwen2MLP(nn.Module): 279 | """ 280 | This class just replace the linear layers (gate_proj, up_proj and down_proj) with column and row parallel layers 281 | """ 282 | 283 | def __init__(self, config: InferenceConfig): 284 | super().__init__() 285 | self.config = config 286 | self.neuron_config = config.neuron_config 287 | self.tp_degree = config.neuron_config.tp_degree 288 | self.hidden_size = config.hidden_size 289 | self.intermediate_size = config.intermediate_size 290 | self.act_fn = ACT2FN[config.hidden_act] 291 | 292 | self.sequence_parallel_enabled = getattr( 293 | self.neuron_config, "sequence_parallel_enabled", False 294 | ) 295 | self.sequence_dimension = 1 if self.sequence_parallel_enabled else None 296 | self.rms_norm_eps = config.rms_norm_eps 297 | self.mlp_kernel_enabled = self.neuron_config.mlp_kernel_enabled 298 | self.fused_rmsnorm_skip_gamma = self.config.neuron_config.fused_rmsnorm_skip_gamma 299 | self.quantized_mlp_kernel_enabled = self.neuron_config.quantized_mlp_kernel_enabled 300 | self.rmsnorm_quantize_kernel_enabled = self.neuron_config.rmsnorm_quantize_kernel_enabled 301 | self.quantize_clamp_bound = self.neuron_config.quantize_clamp_bound 302 | self.logical_nc_config = self.neuron_config.logical_nc_config 303 | self.activation_quantization_type = self.neuron_config.activation_quantization_type 304 | mlp_bias = getattr(config, "mlp_bias", False) 305 | 306 | if self.neuron_config.quantized_mlp_kernel_enabled and self.quantize_clamp_bound == float( 307 | "inf" 308 | ): 309 | logging.warning( 310 | "quantize_clamp_bound is not specified in NeuronConfig. We will use the default value of 1200 for Qwen2 models in quantized kernels." 311 | ) 312 | self.quantize_clamp_bound = 1200.0 313 | if parallel_state.model_parallel_is_initialized(): 314 | if self.neuron_config.quantized_mlp_kernel_enabled: 315 | # # Quantized MLP kernels expect intermediate size to be multiple of 128, so we need to pad 316 | tp_degree = self.neuron_config.tp_degree 317 | self.intermediate_size += ( 318 | get_padding_length(self.intermediate_size // tp_degree, 128) * tp_degree 319 | ) 320 | logger.debug(f"Quantized intermediate_size: {self.intermediate_size}") 321 | self.gate_proj = ColumnParallelLinear( 322 | self.hidden_size, 323 | self.intermediate_size, 324 | bias=mlp_bias, 325 | gather_output=False, 326 | dtype=config.neuron_config.torch_dtype, 327 | pad=True, 328 | sequence_parallel_enabled=False, 329 | sequence_dimension=None, 330 | tensor_model_parallel_group=get_tp_group(config), 331 | ) 332 | self.up_proj = ColumnParallelLinear( 333 | self.hidden_size, 334 | self.intermediate_size, 335 | bias=mlp_bias, 336 | gather_output=False, 337 | dtype=config.neuron_config.torch_dtype, 338 | pad=True, 339 | sequence_parallel_enabled=False, 340 | sequence_dimension=None, 341 | tensor_model_parallel_group=get_tp_group(config), 342 | ) 343 | self.down_proj = RowParallelLinear( 344 | self.intermediate_size, 345 | self.hidden_size, 346 | bias=mlp_bias, 347 | input_is_parallel=True, 348 | dtype=config.neuron_config.torch_dtype, 349 | pad=True, 350 | sequence_parallel_enabled=self.sequence_parallel_enabled, 351 | sequence_dimension=self.sequence_dimension, 352 | tensor_model_parallel_group=get_tp_group(config), 353 | reduce_dtype=config.neuron_config.rpl_reduce_dtype, 354 | ) 355 | if self.mlp_kernel_enabled: 356 | if self.neuron_config.quantized_mlp_kernel_enabled: 357 | setattr( 358 | self.gate_proj, 359 | "post_create_quantized_module_hook", 360 | preprocess_quantized_linear_layer, 361 | ) 362 | setattr( 363 | self.up_proj, 364 | "post_create_quantized_module_hook", 365 | preprocess_quantized_linear_layer, 366 | ) 367 | setattr( 368 | self.down_proj, 369 | "post_create_quantized_module_hook", 370 | preprocess_quantized_linear_layer, 371 | ) 372 | else: 373 | # Transpose the weights to the layout expected by kernels 374 | self.gate_proj.weight = transpose_parallel_linear_layer(self.gate_proj.weight) 375 | self.up_proj.weight = transpose_parallel_linear_layer(self.up_proj.weight) 376 | self.down_proj.weight = transpose_parallel_linear_layer(self.down_proj.weight) 377 | 378 | else: 379 | self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=mlp_bias) 380 | self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=mlp_bias) 381 | self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=mlp_bias) 382 | 383 | def _kernel_enabled_quantized_mlp(self, x, rmsnorm, residual, adapter_ids): 384 | grid = (nc(self.logical_nc_config),) 385 | fused_residual = residual is not None 386 | fused_rmsnorm = rmsnorm is not None 387 | logger.debug( 388 | f"MLP: quantized kernel, fused_residual={fused_residual}, fused_rmsnorm={fused_rmsnorm}, logical_nc_config={self.logical_nc_config}" 389 | ) 390 | 391 | # Can't do residual add in the kernel if SP is enabled 392 | if fused_residual: 393 | assert ( 394 | not self.sequence_parallel_enabled 395 | ), "Quantized MLP cannot have both fused residual add and sequence parallel RMSnorm!" 396 | # Using fused residual add 397 | _mlp_fwd_call = nki_jit()(quant_mlp_fused_add_isa_kernel) 398 | else: 399 | _mlp_fwd_call = nki_jit()(quant_mlp_isa_kernel) 400 | 401 | if fused_rmsnorm: 402 | ln_w = rmsnorm.weight.unsqueeze(0) 403 | else: 404 | ln_w = torch.zeros(size=(1, self.hidden_size), dtype=x.dtype, device=x.device) 405 | 406 | # Handle SP RMSnorm 407 | x_orig_dtype = x.dtype 408 | if self.sequence_parallel_enabled: 409 | # This RMSNormQuant kernel will do quantization inside, so we pass the 410 | # clamp_bound for clipping. 411 | # If we don't use this kernel, the MLP kernel below will do the 412 | # quantization, so we also pass clamp_bound to that kernel. 413 | if self.rmsnorm_quantize_kernel_enabled: 414 | logger.debug( 415 | "Running Quantized MLP kernel with sequence-parallel RMSnorm-Quantize kernel!" 416 | ) 417 | _rmsnorm_quant_fwd_call = nki_jit()(rmsnorm_quant_isa_kernel) 418 | quant_rmsnorm_out = torch.zeros( 419 | size=( 420 | x.shape[0], # batch size 421 | x.shape[1], # sequence length 422 | x.shape[2] + 4, # hidden size + 4 bytes for packing fp32 scale 423 | ), 424 | dtype=torch.int8, 425 | device=x.device, 426 | ) 427 | clamp_bound = self.quantize_clamp_bound 428 | _rmsnorm_quant_fwd_call[grid]( 429 | x, ln_w, clamp_bound, quant_rmsnorm_out, kernel_name="QuantOnly" 430 | ) 431 | x = gather_from_sequence_parallel_region( 432 | quant_rmsnorm_out, 433 | self.sequence_dimension, 434 | process_group=get_tp_group(self.config), 435 | ) 436 | 437 | else: 438 | logger.debug( 439 | "Running Quantized MLP kernel with external (native compiler) sequence-parallel RMSnorm!" 440 | ) 441 | x = gather_from_sequence_parallel_region( 442 | x, self.sequence_dimension, process_group=get_tp_group(self.config) 443 | ) 444 | 445 | # Build output tensor 446 | output_tensor_seqlen = x.shape[1] 447 | if fused_residual: 448 | # seqlen dim is doubled to store the residual add output 449 | output_tensor_seqlen *= 2 450 | 451 | output_tensor = torch.zeros( 452 | size=( 453 | x.shape[0], # batch size 454 | output_tensor_seqlen, 455 | self.hidden_size, # hidden size 456 | ), 457 | dtype=x_orig_dtype, 458 | device=x.device, 459 | ) 460 | 461 | # Grab weights 462 | # all weights of the layers are stored in (out, in) shape 463 | # unsqueeze so that shape of RMS gamma weight is [1, hidden] instead of [hidden] 464 | gate_w = self.gate_proj.weight.data 465 | gate_w_scale = self.gate_proj.scale 466 | up_w = self.up_proj.weight.data 467 | up_w_scale = self.up_proj.scale 468 | down_w = self.down_proj.weight.data 469 | down_w_scale = self.down_proj.scale 470 | clamp_bound = self.quantize_clamp_bound 471 | 472 | if fused_residual: 473 | _mlp_fwd_call[grid]( 474 | x, # attn_output 475 | residual, # hidden 476 | ln_w, # ln_w 477 | gate_w, # gate_w 478 | gate_w_scale, 479 | up_w, # up_w 480 | up_w_scale, 481 | down_w, # down_w 482 | down_w_scale, 483 | clamp_bound, 484 | output_tensor, # out 485 | fused_rmsnorm=fused_rmsnorm, 486 | eps=self.rms_norm_eps, 487 | kernel_name="MLP", 488 | store_add=True, 489 | ) 490 | original_seqlen = x.shape[1] 491 | residual = output_tensor[:, original_seqlen:, :] 492 | output_tensor = output_tensor[:, :original_seqlen, :] 493 | else: 494 | _mlp_fwd_call[grid]( 495 | x, # hidden 496 | # should be fine to pass gamma is as a dummy even if not using fused rmsnorm 497 | ln_w, 498 | gate_w, # gate_w 499 | gate_w_scale, 500 | up_w, # up_w 501 | up_w_scale, 502 | down_w, # down_w 503 | down_w_scale, 504 | clamp_bound, 505 | output_tensor, # out 506 | # Run RMSNorm inside the kernel if NOT using SP rmsnorm 507 | fused_rmsnorm=fused_rmsnorm, 508 | eps=self.rms_norm_eps, 509 | kernel_name="MLP", 510 | ) 511 | residual = None 512 | 513 | # All-reduce or reduce-scatter, depending on whether SP is enabled 514 | if self.sequence_parallel_enabled: 515 | output_tensor = reduce_scatter_to_sequence_parallel_region( 516 | output_tensor, self.sequence_dimension, process_group=get_tp_group(self.config) 517 | ) 518 | else: 519 | output_tensor = reduce_from_tensor_model_parallel_region(output_tensor) 520 | 521 | logger.debug(f"Quantized MLP output shape {output_tensor.shape}") 522 | return (output_tensor, residual) 523 | 524 | def _kernel_enabled_mlp(self, x, rmsnorm, residual, adapter_ids): 525 | fused_residual = residual is not None 526 | fused_rmsnorm = rmsnorm is not None 527 | logger.debug( 528 | f"MLP: kernel, fused_residual={fused_residual}, fused_rmsnorm={fused_rmsnorm}, skip_gamma={self.fused_rmsnorm_skip_gamma}, logical_nc_config={self.logical_nc_config}" 529 | ) 530 | 531 | # Choose which kernel to call 532 | if fused_residual: 533 | assert ( 534 | not self.sequence_parallel_enabled 535 | ), "MLP kernel cannot have both fused residual add and sequence parallel RMSnorm!" 536 | # Using fused residual add 537 | _mlp_fwd_call = nki_jit()(mlp_fused_add_isa_kernel) 538 | else: 539 | _mlp_fwd_call = nki_jit()(mlp_isa_kernel) 540 | 541 | if self.sequence_parallel_enabled: 542 | x = gather_from_sequence_parallel_region( 543 | x, self.sequence_dimension, process_group=get_tp_group(self.config) 544 | ) 545 | 546 | # Build output tensor 547 | output_tensor_seqlen = x.shape[1] 548 | if fused_residual: 549 | # seqlen dim is doubled to store the residual add output 550 | output_tensor_seqlen *= 2 551 | 552 | output_tensor = torch.zeros( 553 | size=( 554 | x.shape[0], # batch size 555 | output_tensor_seqlen, 556 | self.hidden_size, # hidden size 557 | ), 558 | dtype=x.dtype, 559 | device=x.device, 560 | ) 561 | 562 | # Grab weights 563 | # all weights of the layers are stored in (out, in) shape 564 | # unsqueeze so that shape of RMS gamma weight is [1, hidden] instead of [hidden] 565 | if fused_rmsnorm: 566 | ln_w = rmsnorm.weight.unsqueeze(0) 567 | else: 568 | ln_w = torch.zeros(size=(1, self.hidden_size), dtype=x.dtype, device=x.device) 569 | gate_w = self.gate_proj.weight.data 570 | up_w = self.up_proj.weight.data 571 | down_w = self.down_proj.weight.data 572 | 573 | grid = (nc(self.logical_nc_config),) 574 | 575 | if fused_residual: 576 | _mlp_fwd_call[grid]( 577 | x, # attn_output 578 | residual, # hidden 579 | ln_w, # ln_w 580 | gate_w, # gate_w 581 | up_w, # up_w 582 | down_w, # down_w 583 | output_tensor, # out 584 | kernel_name="MLP", 585 | fused_rmsnorm=fused_rmsnorm, 586 | skip_gamma=self.fused_rmsnorm_skip_gamma, 587 | eps=self.rms_norm_eps, 588 | store_add=True, 589 | ) 590 | original_seqlen = x.shape[1] 591 | residual = output_tensor[:, original_seqlen:, :] 592 | output_tensor = output_tensor[:, :original_seqlen, :] 593 | else: 594 | _mlp_fwd_call[grid]( 595 | x, # hidden 596 | # should be fine to pass gamma is as a dummy even if not using fused rmsnorm 597 | ln_w, 598 | gate_w, 599 | up_w, 600 | down_w, 601 | output_tensor, # out 602 | kernel_name="MLP", 603 | # Run RMSNorm inside the kernel if NOT using SP rmsnorm 604 | fused_rmsnorm=fused_rmsnorm, 605 | skip_gamma=self.fused_rmsnorm_skip_gamma, 606 | eps=self.rms_norm_eps, 607 | ) 608 | residual = None 609 | 610 | # All-reduce or reduce-scatter, depending on whether SP is enabled 611 | if self.sequence_parallel_enabled: 612 | output_tensor = reduce_scatter_to_sequence_parallel_region( 613 | output_tensor, self.sequence_dimension, process_group=get_tp_group(self.config) 614 | ) 615 | else: 616 | output_tensor = reduce_from_tensor_model_parallel_region( 617 | output_tensor, process_group=get_tp_group(self.config) 618 | ) 619 | 620 | logger.debug(f"MLP output shape {output_tensor.shape}") 621 | return (output_tensor, residual) 622 | 623 | def _native_mlp(self, x, adapter_ids=None): 624 | logger.debug("MLP: native compiler") 625 | # all-gather is done here instead of CPL layers to 626 | # avoid 2 all-gathers from up and gate projections 627 | if self.sequence_parallel_enabled: 628 | x = gather_from_sequence_parallel_region( 629 | x, self.sequence_dimension, process_group=get_tp_group(self.config) 630 | ) 631 | gate_proj_output = ( 632 | self.gate_proj(x) 633 | if not is_lora_module(self.gate_proj) 634 | else self.gate_proj(x, adapter_ids) 635 | ) 636 | 637 | up_proj_output = ( 638 | self.up_proj(x) if not is_lora_module(self.up_proj) else self.up_proj(x, adapter_ids) 639 | ) 640 | 641 | down_proj_input = self.act_fn(gate_proj_output) * up_proj_output 642 | output = ( 643 | self.down_proj(down_proj_input) 644 | if not is_lora_module(self.down_proj) 645 | else self.down_proj(down_proj_input, adapter_ids) 646 | ) 647 | logger.debug(f"MLP output shape {output.shape}") 648 | return output 649 | 650 | def forward(self, x, rmsnorm=None, residual=None, adapter_ids=None): 651 | """ 652 | If residual is passed in, will fuse its add into the MLP kernel 653 | If rmsnorm is passed in, will fuse the rmsnorm into the MLP kernel 654 | 655 | Returns a tuple of (output, residual), where residual is the output of the residual add 656 | """ 657 | 658 | if self.mlp_kernel_enabled: 659 | # Quantized MLP kernel 660 | if self.quantized_mlp_kernel_enabled: 661 | return self._kernel_enabled_quantized_mlp( 662 | x, rmsnorm, residual, adapter_ids=adapter_ids 663 | ) 664 | # MLP kernel 665 | return self._kernel_enabled_mlp(x, rmsnorm, residual, adapter_ids=adapter_ids) 666 | else: 667 | # No kernel 668 | assert rmsnorm is None and residual is None 669 | return (self._native_mlp(x, adapter_ids=adapter_ids), None) 670 | 671 | 672 | @register_module("NeuronQwen2Attention") 673 | class NeuronQwen2Attention(NeuronAttentionBase): 674 | """ 675 | Compared with Qwen2Attention, this class just 676 | 1. replaces the q_proj, k_proj, v_proj with column parallel layer 677 | 2. replaces the o_proj with row parallel layer 678 | 3. update self.num_head to be self.num_head / tp_degree 679 | 4. update self.num_key_value_heads to be self.num_key_value_heads / tp_degree 680 | 5. update forward() method to adjust to changes from self.num_head 681 | """ 682 | 683 | def __init__(self, config: InferenceConfig, tensor_model_parallel_group=None): 684 | super().__init__(tensor_model_parallel_group=tensor_model_parallel_group) 685 | 686 | self.config = config 687 | self.neuron_config = config.neuron_config 688 | self.hidden_size = config.hidden_size 689 | self.num_attention_heads = config.num_attention_heads 690 | self.num_key_value_heads = config.num_key_value_heads 691 | self.head_dim = getattr(config, "head_dim", self.hidden_size // self.num_attention_heads) 692 | self.max_position_embeddings = config.max_position_embeddings 693 | self.rope_theta = config.rope_theta 694 | self.padding_side = config.neuron_config.padding_side 695 | self.torch_dtype = config.neuron_config.torch_dtype 696 | self.is_medusa = config.neuron_config.is_medusa 697 | self.flash_decoding_enabled = config.neuron_config.flash_decoding_enabled 698 | self.num_cores_per_group = config.num_cores_per_group 699 | self.bias = getattr(config, "attention_bias", True) 700 | self.rpl_reduce_dtype = config.neuron_config.rpl_reduce_dtype 701 | self.mlp_kernel_enabled = config.neuron_config.mlp_kernel_enabled 702 | self.rms_norm_eps = config.rms_norm_eps 703 | self.attn_tkg_builtin_kernel_enabled = self.neuron_config.attn_tkg_builtin_kernel_enabled 704 | 705 | if parallel_state.model_parallel_is_initialized(): 706 | self.tp_degree = self.config.neuron_config.tp_degree 707 | else: 708 | self.tp_degree = 1 709 | 710 | self.fused_qkv = config.neuron_config.fused_qkv 711 | self.clip_qkv = None 712 | 713 | self.sequence_parallel_enabled = self.neuron_config.sequence_parallel_enabled 714 | self.sequence_dimension = 1 if self.sequence_parallel_enabled else None 715 | logger.debug( 716 | f"Hello from NeuronQwen2Attention init! Is SP enabled? {self.sequence_parallel_enabled}. Dim? {self.sequence_dimension}" 717 | ) 718 | 719 | self.init_gqa_properties() 720 | 721 | self.init_rope() 722 | 723 | self.o_proj = GroupQueryAttention_O( 724 | hidden_size=self.hidden_size, 725 | head_dim=self.head_dim, 726 | num_attention_heads=self.num_attention_heads, 727 | num_key_value_heads=self.num_key_value_heads, 728 | tp_degree=self.tp_degree, 729 | dtype=self.torch_dtype, 730 | bias=False, 731 | input_is_parallel=True, 732 | layer_name=self.o_proj_layer_name, 733 | sequence_parallel_enabled=self.sequence_parallel_enabled, 734 | sequence_dimension=self.sequence_dimension, 735 | tensor_model_parallel_group=self.tensor_model_parallel_group, 736 | rpl_reduce_dtype=self.rpl_reduce_dtype, 737 | ) 738 | 739 | def init_rope(self): 740 | self.rotary_emb = Qwen2RotaryEmbedding(self.config) 741 | 742 | if self.attn_tkg_builtin_kernel_enabled: 743 | self.inv_freqs = self.rotary_emb.get_inv_freqs().unsqueeze(1) 744 | 745 | 746 | 747 | class NeuronQwen2DecoderLayer(nn.Module): 748 | """ 749 | Just replace the attention with the NXD version, and MLP with the NXD version 750 | """ 751 | 752 | def __init__(self, config: InferenceConfig): 753 | super().__init__() 754 | self.hidden_size = config.hidden_size 755 | 756 | self.self_attn = NeuronQwen2Attention( 757 | config=config, tensor_model_parallel_group=get_tp_group(config) 758 | ) 759 | 760 | self.mlp = NeuronQwen2MLP(config) 761 | logger.debug( 762 | f"Instantiating RMSNorm modules with hidden size {config.hidden_size} and EPS {config.rms_norm_eps}" 763 | ) 764 | self.input_layernorm = None 765 | if ( 766 | not config.neuron_config.is_eagle_draft 767 | or config.neuron_config.enable_eagle_draft_input_norm 768 | ): 769 | self.input_layernorm = get_rmsnorm_cls()( 770 | config.hidden_size, 771 | eps=config.rms_norm_eps, 772 | ) 773 | self.post_attention_layernorm = get_rmsnorm_cls()( 774 | config.hidden_size, 775 | eps=config.rms_norm_eps, 776 | ) 777 | self.qkv_kernel_enabled = config.neuron_config.qkv_kernel_enabled 778 | self.mlp_kernel_enabled = config.neuron_config.mlp_kernel_enabled 779 | self.quantized_mlp_kernel_enabled = config.neuron_config.quantized_mlp_kernel_enabled 780 | self.rmsnorm_quantize_kernel_enabled = config.neuron_config.rmsnorm_quantize_kernel_enabled 781 | self.mlp_kernel_fuse_residual_add = config.neuron_config.mlp_kernel_fuse_residual_add 782 | self.qkv_kernel_fuse_residual_add = config.neuron_config.qkv_kernel_fuse_residual_add 783 | self.sequence_parallel_enabled = config.neuron_config.sequence_parallel_enabled 784 | self.is_prefill_stage = config.neuron_config.is_prefill_stage 785 | self.config = config 786 | 787 | if self.is_prefill_stage and self.config.neuron_config.is_mlp_quantized(): 788 | # for CTE, quantized MLP kernel does not support fused rmsnorm 789 | self.mlp_kernel_fused_rmsnorm = False 790 | else: 791 | self.mlp_kernel_fused_rmsnorm = not self.sequence_parallel_enabled 792 | 793 | def forward( 794 | self, 795 | hidden_states: torch.Tensor, 796 | attention_mask: Optional[torch.Tensor] = None, 797 | position_ids: Optional[torch.LongTensor] = None, 798 | past_key_value: Optional[Tuple[torch.Tensor]] = None, 799 | adapter_ids=None, 800 | rotary_position_ids: Optional[torch.LongTensor] = None, 801 | residual: Optional[torch.Tensor] = None, # residual from previous layer used by QKV 802 | **kwargs, 803 | ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]], Optional[torch.FloatTensor], Optional[torch.FloatTensor], Optional[torch.FloatTensor]]: 804 | entry_hidden_states = hidden_states 805 | # RMSNorm (fused with QKV kernel when SP is disabled) 806 | if (not self.qkv_kernel_enabled or self.sequence_parallel_enabled) and self.input_layernorm: 807 | hidden_states = self.input_layernorm(hidden_states) 808 | 809 | # Self Attention 810 | # produced another residual used by MLP 811 | attn_output = self.self_attn( 812 | hidden_states=hidden_states, 813 | attention_mask=attention_mask, 814 | position_ids=position_ids, 815 | past_key_value=past_key_value, 816 | adapter_ids=adapter_ids, 817 | rmsnorm=self.input_layernorm, 818 | rotary_position_ids=rotary_position_ids, 819 | residual=residual, 820 | **kwargs, 821 | ) 822 | 823 | if attn_output.residual is None: 824 | residual = entry_hidden_states # input to attention 825 | else: 826 | # residual will only be returned by attn/qkv if fuse add qkv kernel is enabled 827 | assert self.qkv_kernel_fuse_residual_add, \ 828 | "residual add before qkv should be computed in the previous layer, \ 829 | unless qkv_kernel_fuse_residual_add is specified" 830 | assert ( 831 | not self.sequence_parallel_enabled 832 | ), "qkv_kernel_fuse_residual_add should be off when sequence parallelism is enabled" 833 | assert ( 834 | self.qkv_kernel_enabled 835 | ), "qkv_kernel_fuse_residual_add should be used with qkv_kernel_enabled" 836 | residual = attn_output.residual 837 | 838 | hidden_states = attn_output.hidden_states 839 | if self.mlp_kernel_enabled and self.mlp_kernel_fuse_residual_add: 840 | assert ( 841 | not self.sequence_parallel_enabled 842 | ), "mlp_kernel_fuse_residual_add should be off when sequence parallelism is enabled" 843 | # First residual add handled in the MLP kernel 844 | hidden_states, residual = self.mlp( 845 | hidden_states, 846 | rmsnorm=self.post_attention_layernorm, 847 | residual=residual, 848 | adapter_ids=adapter_ids, 849 | ) 850 | else: 851 | hidden_states = residual + hidden_states 852 | residual = hidden_states 853 | # RMSNorm (fused with QKV kernel when SP is disabled) 854 | if self.mlp_kernel_enabled and self.mlp_kernel_fused_rmsnorm: 855 | rmsnorm = self.post_attention_layernorm 856 | else: 857 | hidden_states = self.post_attention_layernorm(hidden_states) 858 | rmsnorm = None 859 | hidden_states, _ = self.mlp( 860 | hidden_states, 861 | rmsnorm=rmsnorm, 862 | adapter_ids=adapter_ids, 863 | ) 864 | 865 | # if fuse residual add with qkv, we leave this add to the next layer's QKV 866 | # unless it is the last layer in which case we add it here 867 | if not self.qkv_kernel_fuse_residual_add: 868 | hidden_states = residual + hidden_states 869 | residual = None # set to None to prevent it from being used again 870 | 871 | # also return residual for QKV in the next layer 872 | outputs = (hidden_states, attn_output.present_key_value, attn_output.cos_cache, attn_output.sin_cache, residual) 873 | return outputs 874 | 875 | 876 | class ResBlock(nn.Module): 877 | """ 878 | A Residual Block module. 879 | 880 | This module performs a linear transformation followed by a SiLU activation, 881 | and then adds the result to the original input, creating a residual connection. 882 | 883 | Args: 884 | hidden_size (int): The size of the hidden layers in the block. 885 | """ 886 | 887 | def __init__(self, hidden_size): 888 | super().__init__() 889 | self.linear = nn.Linear(hidden_size, hidden_size) 890 | # Initialize as an identity mapping 891 | torch.nn.init.zeros_(self.linear.weight) 892 | # Use SiLU activation to keep consistent with the Qwen2 model 893 | self.act = nn.SiLU() 894 | 895 | def forward(self, x): 896 | """ 897 | Forward pass of the ResBlock. 898 | 899 | Args: 900 | x (torch.Tensor): Input tensor. 901 | 902 | Returns: 903 | torch.Tensor: Output after the residual connection and activation. 904 | """ 905 | return x + self.act(self.linear(x)) 906 | 907 | 908 | class NeuronQwen2Model(NeuronBaseModel): 909 | """ 910 | The neuron version of the Qwen2Model 911 | """ 912 | 913 | def setup_attr_for_model(self, config: InferenceConfig): 914 | # Needed for init_inference_optimization() 915 | self.on_device_sampling = config.neuron_config.on_device_sampling_config is not None 916 | self.tp_degree = config.neuron_config.tp_degree 917 | self.hidden_size = config.hidden_size 918 | self.num_attention_heads = config.num_attention_heads 919 | self.num_key_value_heads = config.num_key_value_heads 920 | self.max_batch_size = config.neuron_config.max_batch_size 921 | self.buckets = config.neuron_config.buckets 922 | 923 | def init_model(self, config: InferenceConfig): 924 | self.padding_idx = config.pad_token_id 925 | self.vocab_size = config.vocab_size 926 | 927 | if parallel_state.model_parallel_is_initialized(): 928 | self.embed_tokens = ParallelEmbedding( 929 | config.vocab_size, 930 | config.hidden_size, 931 | self.padding_idx, 932 | dtype=config.neuron_config.torch_dtype, 933 | shard_across_embedding=not config.neuron_config.vocab_parallel, 934 | sequence_parallel_enabled=config.neuron_config.sequence_parallel_enabled, 935 | sequence_dimension=1, 936 | pad=True, 937 | tensor_model_parallel_group=get_tp_group(config), 938 | use_spmd_rank=config.neuron_config.vocab_parallel, 939 | ) 940 | 941 | self.lm_head = ColumnParallelLinear( 942 | config.hidden_size, 943 | config.vocab_size, 944 | gather_output=not self.on_device_sampling, 945 | bias=False, 946 | pad=True, 947 | tensor_model_parallel_group=get_tp_group(config), 948 | ) 949 | else: 950 | self.embed_tokens = nn.Embedding( 951 | config.vocab_size, 952 | config.hidden_size, 953 | self.padding_idx, 954 | ) 955 | self.lm_head = nn.Linear( 956 | config.hidden_size, 957 | config.vocab_size, 958 | bias=False, 959 | ) 960 | 961 | updated_configs = get_updated_configs(config) 962 | 963 | self.layers = nn.ModuleList([NeuronQwen2DecoderLayer(conf) for conf in updated_configs]) 964 | 965 | if not config.neuron_config.is_eagle_draft: 966 | self.norm = get_rmsnorm_cls()(config.hidden_size, eps=config.rms_norm_eps) 967 | 968 | if config.neuron_config.is_eagle_draft: 969 | fc_bias = getattr(config, "fc_bias", False) 970 | # replicate fc weights since activations are sequence sharded 971 | self.fc = WeightGatheredColumnParallel( 972 | config.hidden_size * 2, config.hidden_size, bias=fc_bias, gather_output=True, sequence_dimension=1 973 | ) 974 | self.is_medusa = config.neuron_config.is_medusa 975 | self.num_medusa_heads = config.neuron_config.num_medusa_heads 976 | self.medusa_speculation_length = config.neuron_config.medusa_speculation_length 977 | 978 | if self.is_medusa: 979 | if parallel_state.model_parallel_is_initialized(): 980 | medusa_head_cls = ColumnParallelLinear 981 | else: 982 | medusa_head_cls = nn.Linear 983 | for i in range(self.num_medusa_heads): 984 | medusa_head = nn.Sequential( 985 | *([ResBlock(config.hidden_size)] * 1), 986 | medusa_head_cls( 987 | config.hidden_size, 988 | config.vocab_size, 989 | gather_output=not self.on_device_sampling, 990 | bias=False, 991 | ), 992 | ) 993 | setattr(self, f"medusa_head_{i}", medusa_head) 994 | 995 | 996 | class NeuronQwen2ForCausalLM(NeuronBaseForCausalLM): 997 | """ 998 | This class extends Qwen2ForCausalLM create traceable 999 | blocks for Neuron. 1000 | 1001 | Args: 1002 | Qwen2ForCausalLM (_type_): _description_ 1003 | """ 1004 | 1005 | _model_cls = NeuronQwen2Model 1006 | 1007 | @staticmethod 1008 | def load_hf_model(model_path, **kwargs): 1009 | return Qwen2ForCausalLM.from_pretrained(model_path, **kwargs) 1010 | 1011 | @staticmethod 1012 | def convert_hf_to_neuron_state_dict(state_dict: dict, config: InferenceConfig) -> dict: 1013 | """This function should be over-ridden in child classes as needed""" 1014 | 1015 | neuron_config = config.neuron_config 1016 | # to facilitate rank usage in attention 1017 | num_layers = config.num_hidden_layers 1018 | tp_degree = neuron_config.tp_degree 1019 | for i in range(num_layers): 1020 | state_dict[f"layers.{i}.self_attn.rank_util.rank"] = torch.arange( 1021 | 0, tp_degree, dtype=torch.int32 1022 | ) 1023 | 1024 | """ 1025 | for every layer do the following transformations 1026 | gate_w_prime = (gate_w.T * gamma).T 1027 | up_w_prime = (up_w.T * gamma).T 1028 | """ 1029 | if ( 1030 | neuron_config.fused_rmsnorm_skip_gamma 1031 | and not neuron_config.sequence_parallel_enabled 1032 | ): 1033 | if neuron_config.mlp_kernel_enabled: 1034 | # MLP 1035 | state_dict[f"layers.{i}.mlp.gate_proj.weight"] = state_dict[ 1036 | f"layers.{i}.mlp.gate_proj.weight" 1037 | ] * state_dict[f"layers.{i}.input_layernorm.weight"].unsqueeze(0) 1038 | state_dict[f"layers.{i}.mlp.up_proj.weight"] = state_dict[ 1039 | f"layers.{i}.mlp.up_proj.weight" 1040 | ] * state_dict[f"layers.{i}.input_layernorm.weight"].unsqueeze(0) 1041 | 1042 | if neuron_config.qkv_kernel_enabled: 1043 | # QKV 1044 | state_dict[f"layers.{i}.self_attn.q_proj.weight"] = state_dict[ 1045 | f"layers.{i}.self_attn.q_proj.weight" 1046 | ] * state_dict[f"layers.{i}.input_layernorm.weight"].unsqueeze(0) 1047 | state_dict[f"layers.{i}.self_attn.k_proj.weight"] = state_dict[ 1048 | f"layers.{i}.self_attn.k_proj.weight" 1049 | ] * state_dict[f"layers.{i}.input_layernorm.weight"].unsqueeze(0) 1050 | state_dict[f"layers.{i}.self_attn.v_proj.weight"] = state_dict[ 1051 | f"layers.{i}.self_attn.v_proj.weight" 1052 | ] * state_dict[f"layers.{i}.input_layernorm.weight"].unsqueeze(0) 1053 | 1054 | if neuron_config.fused_qkv: 1055 | state_dict = convert_state_dict_to_fused_qkv(state_dict, config) 1056 | 1057 | if neuron_config.vocab_parallel: 1058 | # TODO: this hack can be removed after replication_id is ready to use 1059 | state_dict["embed_tokens.rank_util.rank"] = torch.arange( 1060 | 0, neuron_config.local_ranks_size, dtype=torch.int32 1061 | ) 1062 | 1063 | # to facilitate rank usage in base model 1064 | state_dict["rank_util.rank"] = torch.arange(0, tp_degree, dtype=torch.int32) 1065 | return state_dict 1066 | 1067 | @staticmethod 1068 | def update_state_dict_for_tied_weights(state_dict): 1069 | state_dict["lm_head.weight"] = state_dict["embed_tokens.weight"].clone() 1070 | 1071 | @classmethod 1072 | def get_config_cls(cls): 1073 | return Qwen2InferenceConfig 1074 | -------------------------------------------------------------------------------- /contributed/models/qwen2/qwen-2-test.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [ 8 | { 9 | "name": "stdout", 10 | "output_type": "stream", 11 | "text": [ 12 | "libneuronxla 2.2.3493.0+78c3e78c\n", 13 | "neuronx-cc 2.18.121.0+9e31e41a\n", 14 | "neuronx-distributed 0.12.12111+cdd84048\n", 15 | "neuronx-distributed-inference 0.3.5591+f50feae2\n", 16 | "torch-neuronx 2.6.0.2.7.5413+113e6810\n" 17 | ] 18 | } 19 | ], 20 | "source": [ 21 | "!pip list | grep neuron" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": null, 27 | "metadata": {}, 28 | "outputs": [], 29 | "source": [ 30 | "import torch\n", 31 | "from transformers import AutoTokenizer, GenerationConfig\n", 32 | "from neuronx_distributed_inference.models.config import NeuronConfig, OnDeviceSamplingConfig\n", 33 | "from neuronx_distributed_inference.utils.hf_adapter import HuggingFaceGenerationAdapter, load_pretrained_config" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": 2, 39 | "metadata": {}, 40 | "outputs": [], 41 | "source": [ 42 | "model_path = \"/home/ubuntu/model_hf_qwen/qwen2/\"\n", 43 | "traced_model_path = \"/home/ubuntu/traced_model_qwen/qwen2\"" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": null, 49 | "metadata": {}, 50 | "outputs": [], 51 | "source": [ 52 | "from huggingface_hub import snapshot_download\n", 53 | "\n", 54 | "snapshot_download(\"Qwen/QwQ-32B\", local_dir=model_path)" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": null, 60 | "metadata": {}, 61 | "outputs": [], 62 | "source": [ 63 | "from modeling_qwen_v2 import Qwen2InferenceConfig, NeuronQwen2ForCausalLM\n", 64 | "\n", 65 | "def run_qwen2_compile():\n", 66 | " # Initialize configs and tokenizer.\n", 67 | " tokenizer = AutoTokenizer.from_pretrained(model_path, padding_side=\"right\")\n", 68 | " tokenizer.pad_token = tokenizer.eos_token\n", 69 | "\n", 70 | " generation_config = GenerationConfig.from_pretrained(model_path)\n", 71 | " generation_config_kwargs = {\n", 72 | " \"do_sample\": False,\n", 73 | " \"top_k\": 1,\n", 74 | " \"pad_token_id\": tokenizer.pad_token_id,\n", 75 | " }\n", 76 | " generation_config.update(**generation_config_kwargs)\n", 77 | " \n", 78 | " neuron_config = NeuronConfig(\n", 79 | " tp_degree=8,\n", 80 | " batch_size=1,\n", 81 | " max_context_length=128,\n", 82 | " seq_len=256,\n", 83 | " enable_bucketing=True,\n", 84 | " context_encoding_buckets=[128],\n", 85 | " token_generation_buckets=[256],\n", 86 | " flash_decoding_enabled=False,\n", 87 | " torch_dtype=torch.bfloat16,\n", 88 | " fused_qkv=False,\n", 89 | " attn_kernel_enabled=True,\n", 90 | " attn_cls=\"NeuronQwen2Attention\"\n", 91 | " )\n", 92 | " config = Qwen2InferenceConfig(\n", 93 | " neuron_config,\n", 94 | " load_config=load_pretrained_config(model_path),\n", 95 | " )\n", 96 | " \n", 97 | " # Compile and save model.\n", 98 | " print(\"\\nCompiling and saving model...\")\n", 99 | " model = NeuronQwen2ForCausalLM(model_path, config)\n", 100 | " model.compile(traced_model_path)\n", 101 | " tokenizer.save_pretrained(traced_model_path)" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": null, 107 | "metadata": {}, 108 | "outputs": [], 109 | "source": [ 110 | "run_qwen2_compile()" 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": null, 116 | "metadata": {}, 117 | "outputs": [], 118 | "source": [ 119 | "from modeling_qwen_v2 import Qwen2InferenceConfig, NeuronQwen2ForCausalLM\n", 120 | "\n", 121 | "model = NeuronQwen2ForCausalLM(traced_model_path)\n", 122 | "model.load(traced_model_path)" 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": null, 128 | "metadata": {}, 129 | "outputs": [], 130 | "source": [ 131 | "config = model.get_config_cls()\n", 132 | "config.get_neuron_config_cls()" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": 9, 138 | "metadata": {}, 139 | "outputs": [ 140 | { 141 | "data": { 142 | "text/plain": [ 143 | "40" 144 | ] 145 | }, 146 | "execution_count": 9, 147 | "metadata": {}, 148 | "output_type": "execute_result" 149 | } 150 | ], 151 | "source": [ 152 | "model.config.num_attention_heads" 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": 10, 158 | "metadata": {}, 159 | "outputs": [ 160 | { 161 | "data": { 162 | "text/plain": [ 163 | "8" 164 | ] 165 | }, 166 | "execution_count": 10, 167 | "metadata": {}, 168 | "output_type": "execute_result" 169 | } 170 | ], 171 | "source": [ 172 | "model.config.num_key_value_heads" 173 | ] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "execution_count": 11, 178 | "metadata": {}, 179 | "outputs": [ 180 | { 181 | "data": { 182 | "text/plain": [ 183 | "5120" 184 | ] 185 | }, 186 | "execution_count": 11, 187 | "metadata": {}, 188 | "output_type": "execute_result" 189 | } 190 | ], 191 | "source": [ 192 | "model.config.hidden_size" 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": 12, 198 | "metadata": {}, 199 | "outputs": [ 200 | { 201 | "name": "stderr", 202 | "output_type": "stream", 203 | "text": [ 204 | "Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.\n" 205 | ] 206 | }, 207 | { 208 | "data": { 209 | "text/plain": [ 210 | "\"Okay, the user wants a short introduction to large language models. Let me start by defining what a large language model is. I should mention that they are AI systems trained on vast amounts of text data. Maybe include that they use deep learning, specifically transformer architectures.\\n\\nI need to highlight their capabilities, like generating text, understanding context, and performing various tasks such as answering questions, writing stories, or coding. It's important to note their scale—large parameter counts and extensive training data. \\n\\nAlso, touch on their applications: customer service, content creation, research, etc. Maybe mention some examples like GPT, BERT, or\"" 211 | ] 212 | }, 213 | "execution_count": 12, 214 | "metadata": {}, 215 | "output_type": "execute_result" 216 | } 217 | ], 218 | "source": [ 219 | "tokenizer = AutoTokenizer.from_pretrained(traced_model_path)\n", 220 | "tokenizer.pad_token = tokenizer.eos_token\n", 221 | "generation_config = GenerationConfig.from_pretrained(model_path)\n", 222 | "generation_config_kwargs = {\n", 223 | " \"do_sample\": True,\n", 224 | " \"temperature\": 0.9,\n", 225 | " \"top_k\": 5,\n", 226 | " \"pad_token_id\": tokenizer.pad_token_id,\n", 227 | "}\n", 228 | "\n", 229 | "prompt = \"Give me a short introduction to large language model.\"\n", 230 | "messages = [\n", 231 | " {\"role\": \"system\", \"content\": \"You are Qwen, created by Alibaba Cloud. You are a helpful assistant.\"},\n", 232 | " {\"role\": \"user\", \"content\": prompt}\n", 233 | "]\n", 234 | "text = tokenizer.apply_chat_template(\n", 235 | " messages,\n", 236 | " tokenize=False,\n", 237 | " add_generation_prompt=True\n", 238 | ")\n", 239 | "model_inputs = tokenizer([text], return_tensors=\"pt\")\n", 240 | "generation_model = HuggingFaceGenerationAdapter(model)\n", 241 | "generated_ids = generation_model.generate(\n", 242 | " **model_inputs,\n", 243 | " max_new_tokens=128\n", 244 | ")\n", 245 | "generated_ids = [\n", 246 | " output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)\n", 247 | "]\n", 248 | "\n", 249 | "response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]\n", 250 | "response" 251 | ] 252 | }, 253 | { 254 | "cell_type": "code", 255 | "execution_count": 13, 256 | "metadata": {}, 257 | "outputs": [], 258 | "source": [ 259 | "model.reset()" 260 | ] 261 | }, 262 | { 263 | "cell_type": "markdown", 264 | "metadata": {}, 265 | "source": [ 266 | "# Run Benchmarks" 267 | ] 268 | }, 269 | { 270 | "cell_type": "code", 271 | "execution_count": 1, 272 | "metadata": {}, 273 | "outputs": [], 274 | "source": [ 275 | "model_path = \"/home/ubuntu/model_hf_qwen/qwen2\"\n", 276 | "traced_model_path = \"/home/ubuntu/traced_model_qwen/qwen2/logit\"" 277 | ] 278 | }, 279 | { 280 | "cell_type": "code", 281 | "execution_count": null, 282 | "metadata": {}, 283 | "outputs": [], 284 | "source": [ 285 | "dir = '/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/'\n", 286 | "!cp modeling_qwen2.py {dir}" 287 | ] 288 | }, 289 | { 290 | "cell_type": "markdown", 291 | "metadata": {}, 292 | "source": [ 293 | "# Edit the inference_demo.py file to include the following:\n", 294 | "\n", 295 | "```python\n", 296 | "from .modeling_qwen2 import NeuronQwen2ForCausalLM\n", 297 | "\n", 298 | "MODEL_TYPES = {\n", 299 | " \"llama\": {\"causal-lm\": NeuronLlamaForCausalLM},\n", 300 | " \"mixtral\": {\"causal-lm\": NeuronMixtralForCausalLM},\n", 301 | " \"dbrx\": {\"causal-lm\": NeuronDbrxForCausalLM},\n", 302 | " 'qwen2': {\"causal-lm\": NeuronQwen2ForCausalLM}\n", 303 | "}\n", 304 | "```" 305 | ] 306 | }, 307 | { 308 | "cell_type": "code", 309 | "execution_count": 8, 310 | "metadata": {}, 311 | "outputs": [ 312 | { 313 | "name": "stdout", 314 | "output_type": "stream", 315 | "text": [ 316 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed/modules/moe/expert_mlps.py:11: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n", 317 | " from neuronx_distributed.modules.moe.blockwise import (\n", 318 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed/modules/moe/expert_mlps.py:11: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n", 319 | " from neuronx_distributed.modules.moe.blockwise import (\n", 320 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed/modules/moe/expert_mlps.py:11: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n", 321 | " from neuronx_distributed.modules.moe.blockwise import (\n", 322 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/modules/attention/utils.py:14: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n", 323 | " from neuronx_distributed_inference.modules.custom_calls import neuron_cumsum\n", 324 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:745: UserWarning: Set seed for `privateuseone` device does not take effect, please add API's `_is_in_bad_fork` and `manual_seed_all` to `privateuseone` device module.\n", 325 | " return fn(*args, **kwargs)\n", 326 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/modules/lora_serving/lora_model.py:12: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n", 327 | " from neuronx_distributed_inference.modules.attention.gqa import GQA, GroupQueryAttention_QKV\n", 328 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/modules/lora_serving/lora_model.py:12: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n", 329 | " from neuronx_distributed_inference.modules.attention.gqa import GQA, GroupQueryAttention_QKV\n", 330 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/modules/lora_serving/lora_model.py:12: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n", 331 | " from neuronx_distributed_inference.modules.attention.gqa import GQA, GroupQueryAttention_QKV\n", 332 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/models/dbrx/modeling_dbrx.py:38: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n", 333 | " from neuronx_distributed_inference.modules.attention.attention_base import NeuronAttentionBase\n", 334 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/models/dbrx/modeling_dbrx.py:38: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n", 335 | " from neuronx_distributed_inference.modules.attention.attention_base import NeuronAttentionBase\n", 336 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/inference_demo.py:25: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n", 337 | " from neuronx_distributed_inference.models.dbrx.modeling_dbrx import NeuronDbrxForCausalLM\n", 338 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/inference_demo.py:27: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n", 339 | " from neuronx_distributed_inference.models.mixtral.modeling_mixtral import NeuronMixtralForCausalLM\n", 340 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/models/mllama/modeling_mllama.py:72: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n", 341 | " from .modeling_mllama_vision import NeuronMllamaVisionModel # noqa: E402\n", 342 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/utils/accuracy.py:29: UserWarning: Intel extension for pytorch not found. For faster CPU references install `intel-extension-for-pytorch`.\n", 343 | " warnings.warn(\n", 344 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:745: UserWarning: Set seed for `privateuseone` device does not take effect, please add API's `_is_in_bad_fork` and `manual_seed_all` to `privateuseone` device module.\n", 345 | " return fn(*args, **kwargs)\n", 346 | "Loading configs...\n", 347 | "WARNING:root:NeuronConfig init: Unexpected keyword arguments: {'model_type': 'qwen2', 'task_type': 'causal-lm', 'model_path': '/home/ubuntu/model_hf_qwen/qwen2', 'compiled_model_path': '/home/ubuntu/traced_model_qwen/qwen2/logit', 'benchmark': True, 'check_accuracy_mode': , 'divergence_difference_tol': 0.001, 'prompts': ['To be, or not to be'], 'top_k': 1, 'top_p': 1.0, 'temperature': 1.0, 'do_sample': False, 'dynamic': False, 'pad_token_id': 151645, 'on_device_sampling': False, 'enable_torch_dist': False, 'enable_lora': False, 'max_loras': 1, 'max_lora_rank': 16, 'skip_warmup': False, 'skip_compile': False, 'compile_only': False, 'compile_dry_run': False, 'hlo_debug': False}\n", 348 | "\n", 349 | "Compiling and saving model...\n", 350 | "INFO:Neuron:Generating HLOs for the following models: ['context_encoding_model', 'token_generation_model']\n", 351 | "[2025-06-02 13:35:56.009: I neuronx_distributed/parallel_layers/parallel_state.py:592] > initializing tensor model parallel with size 8\n", 352 | "[2025-06-02 13:35:56.009: I neuronx_distributed/parallel_layers/parallel_state.py:593] > initializing pipeline model parallel with size 1\n", 353 | "[2025-06-02 13:35:56.010: I neuronx_distributed/parallel_layers/parallel_state.py:594] > initializing context model parallel with size 1\n", 354 | "[2025-06-02 13:35:56.010: I neuronx_distributed/parallel_layers/parallel_state.py:595] > initializing data parallel with size 1\n", 355 | "[2025-06-02 13:35:56.010: I neuronx_distributed/parallel_layers/parallel_state.py:596] > initializing world size to 8\n", 356 | "[2025-06-02 13:35:56.010: I neuronx_distributed/parallel_layers/parallel_state.py:343] [rank_0_pp-1_tp-1_dp-1_cp-1] Chosen Logic for replica groups ret_logic=, 'Ascending Ring PG Group')>\n", 357 | "[2025-06-02 13:35:56.011: I neuronx_distributed/parallel_layers/parallel_state.py:632] [rank_0_pp-1_tp-1_dp-1_cp-1] tp_groups: replica_groups.tp_groups=[[0, 1, 2, 3, 4, 5, 6, 7]]\n", 358 | "[2025-06-02 13:35:56.011: I neuronx_distributed/parallel_layers/parallel_state.py:633] [rank_0_pp-1_tp-1_dp-1_cp-1] dp_groups: replica_groups.dp_groups=[[0], [1], [2], [3], [4], [5], [6], [7]]\n", 359 | "[2025-06-02 13:35:56.011: I neuronx_distributed/parallel_layers/parallel_state.py:634] [rank_0_pp-1_tp-1_dp-1_cp-1] pp_groups: replica_groups.pp_groups=[[0], [1], [2], [3], [4], [5], [6], [7]]\n", 360 | "[2025-06-02 13:35:56.011: I neuronx_distributed/parallel_layers/parallel_state.py:635] [rank_0_pp-1_tp-1_dp-1_cp-1] cp_groups: replica_groups.cp_groups=[[0], [1], [2], [3], [4], [5], [6], [7]]\n", 361 | "[2025-06-02 13:35:56.011: I neuronx_distributed/parallel_layers/parallel_state.py:636] [rank_0_pp-1_tp-1_dp-1_cp-1] ep_model_groups: replica_groups.ep_model_groups=[[0], [1], [2], [3], [4], [5], [6], [7]]\n", 362 | "[2025-06-02 13:35:56.011: I neuronx_distributed/parallel_layers/parallel_state.py:637] [rank_0_pp-1_tp-1_dp-1_cp-1] ep_data_groups: replica_groups.ep_data_groups=[[0], [1], [2], [3], [4], [5], [6], [7]]\n", 363 | "INFO:Neuron:Generating 1 hlos for key: context_encoding_model\n", 364 | "INFO:Neuron:Started loading module context_encoding_model\n", 365 | "INFO:Neuron:Finished loading module context_encoding_model in 0.3605782985687256 seconds\n", 366 | "INFO:Neuron:generating HLO: context_encoding_model, input example shape = torch.Size([1, 16])\n", 367 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed/parallel_layers/layers.py:478: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.\n", 368 | " with torch.cuda.amp.autocast(enabled=False):\n", 369 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/torch_neuronx/xla_impl/hlo_conversion.py:210: UserWarning: Received an input tensor that was unused. Tensor will be ignored. (index=1, shape=torch.Size([1, 16]), dtype=torch.int32)\n", 370 | " warnings.warn(\n", 371 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/torch_neuronx/xla_impl/hlo_conversion.py:210: UserWarning: Received an input tensor that was unused. Tensor will be ignored. (index=3, shape=torch.Size([1]), dtype=torch.int32)\n", 372 | " warnings.warn(\n", 373 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/torch_neuronx/xla_impl/hlo_conversion.py:210: UserWarning: Received an input tensor that was unused. Tensor will be ignored. (index=4, shape=torch.Size([1, 3]), dtype=torch.float32)\n", 374 | " warnings.warn(\n", 375 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/torch_neuronx/xla_impl/hlo_conversion.py:210: UserWarning: Received an input tensor that was unused. Tensor will be ignored. (index=5, shape=torch.Size([1]), dtype=torch.int32)\n", 376 | " warnings.warn(\n", 377 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/torch_neuronx/xla_impl/hlo_conversion.py:210: UserWarning: Received an input tensor that was unused. Tensor will be ignored. (index=6, shape=torch.Size([1]), dtype=torch.int32)\n", 378 | " warnings.warn(\n", 379 | "INFO:Neuron:Finished generating HLO for context_encoding_model in 8.811824083328247 seconds, input example shape = torch.Size([1, 16])\n", 380 | "INFO:Neuron:Generating 1 hlos for key: token_generation_model\n", 381 | "INFO:Neuron:Started loading module token_generation_model\n", 382 | "INFO:Neuron:Finished loading module token_generation_model in 0.13971686363220215 seconds\n", 383 | "INFO:Neuron:generating HLO: token_generation_model, input example shape = torch.Size([1, 1])\n", 384 | "INFO:Neuron:Finished generating HLO for token_generation_model in 9.776893615722656 seconds, input example shape = torch.Size([1, 1])\n", 385 | "INFO:Neuron:Generated all HLOs in 19.276326656341553 seconds\n", 386 | "INFO:Neuron:Starting compilation for the priority HLO\n", 387 | "INFO:Neuron:'token_generation_model' is the priority model with bucket rank 0\n", 388 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/libneuronxla/neuron_cc_wrapper.py:283: SyntaxWarning: str format compiler_flags is discouraged as its handling involves repeated joining and splitting, which can easily make mistakes if something is quoted or escaped. Use list[str] instead. Refer to documentation of the Python subprocess module for details.\n", 389 | " warnings.warn(SyntaxWarning(\n", 390 | "2025-06-02 13:36:15.000516: 7289 INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --framework=XLA /tmp/nxd_model/token_generation_model/_tp0_bk0/model.MODULE_9b906898286ddf239aa0+91ef39e9.hlo_module.pb --output /tmp/nxd_model/token_generation_model/_tp0_bk0/model.MODULE_9b906898286ddf239aa0+91ef39e9.neff --target=trn1 --auto-cast=none --model-type=transformer --tensorizer-options=--enable-ccop-compute-overlap --cc-pipeline-tiling-factor=2 --vectorize-strided-dma --lnc=1 --logfile=/tmp/nxd_model/token_generation_model/_tp0_bk0/log-neuron-cc.txt --enable-internal-neff-wrapper --verbose=35\n", 391 | ".........Completed run_backend_driver.\n", 392 | "\n", 393 | "Compiler status PASS\n", 394 | "INFO:Neuron:Done compilation for the priority HLO in 169.35613083839417 seconds\n", 395 | "INFO:Neuron:Updating the hlo module with optimized layout\n", 396 | "INFO:Neuron:Done optimizing weight layout for all HLOs in 0.3216278553009033 seconds\n", 397 | "INFO:Neuron:Starting compilation for all HLOs\n", 398 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/libneuronxla/neuron_cc_wrapper.py:245: SyntaxWarning: str format compiler_flags is discouraged as its handling involves repeated joining and splitting, which can easily make mistakes if something is quoted or escaped. Use list[str] instead. Refer to documentation of the Python subprocess module for details.\n", 399 | " warnings.warn(SyntaxWarning(\n", 400 | "2025-06-02 13:39:05.000174: 7289 INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --framework=XLA /tmp/nxd_model/context_encoding_model/_tp0_bk0/model.MODULE_d4332219e6ee5f826cce+d43b5474.hlo_module.pb --output /tmp/nxd_model/context_encoding_model/_tp0_bk0/model.MODULE_d4332219e6ee5f826cce+d43b5474.neff --target=trn1 --auto-cast=none --model-type=transformer --tensorizer-options=--enable-ccop-compute-overlap --cc-pipeline-tiling-factor=2 --vectorize-strided-dma --lnc=1 -O1 --internal-hlo2tensorizer-options= --modular-flow-mac-threshold=10 --logfile=/tmp/nxd_model/context_encoding_model/_tp0_bk0/log-neuron-cc.txt --verbose=35\n", 401 | ".Completed run_backend_driver.\n", 402 | "\n", 403 | "Compiler status PASS\n", 404 | "INFO:Neuron:Finished Compilation for all HLOs in 9.435595512390137 seconds\n", 405 | "......Completed run_backend_driver.\n", 406 | "\n", 407 | "Compiler status PASS\n", 408 | "INFO:Neuron:Done preparing weight layout transformation\n", 409 | "INFO:Neuron:Finished building model in 307.08067560195923 seconds\n", 410 | "INFO:Neuron:SKIPPING pre-sharding the checkpoints. The checkpoints will be sharded during load time.\n", 411 | "Compiling and tracing time: 307.11146965399985 seconds\n", 412 | "\n", 413 | "Loading model to Neuron...\n", 414 | "INFO:Neuron:Sharding weights on load...\n", 415 | "INFO:Neuron:Sharding Weights for ranks: 0...7\n", 416 | "[2025-06-02 13:41:03.157: I neuronx_distributed/parallel_layers/parallel_state.py:592] > initializing tensor model parallel with size 8\n", 417 | "[2025-06-02 13:41:03.157: I neuronx_distributed/parallel_layers/parallel_state.py:593] > initializing pipeline model parallel with size 1\n", 418 | "[2025-06-02 13:41:03.157: I neuronx_distributed/parallel_layers/parallel_state.py:594] > initializing context model parallel with size 1\n", 419 | "[2025-06-02 13:41:03.157: I neuronx_distributed/parallel_layers/parallel_state.py:595] > initializing data parallel with size 1\n", 420 | "[2025-06-02 13:41:03.158: I neuronx_distributed/parallel_layers/parallel_state.py:596] > initializing world size to 8\n", 421 | "[2025-06-02 13:41:03.158: I neuronx_distributed/parallel_layers/parallel_state.py:343] [rank_0_pp-1_tp-1_dp-1_cp-1] Chosen Logic for replica groups ret_logic=, 'Ascending Ring PG Group')>\n", 422 | "[2025-06-02 13:41:03.159: I neuronx_distributed/parallel_layers/parallel_state.py:632] [rank_0_pp-1_tp-1_dp-1_cp-1] tp_groups: replica_groups.tp_groups=[[0, 1, 2, 3, 4, 5, 6, 7]]\n", 423 | "[2025-06-02 13:41:03.159: I neuronx_distributed/parallel_layers/parallel_state.py:633] [rank_0_pp-1_tp-1_dp-1_cp-1] dp_groups: replica_groups.dp_groups=[[0], [1], [2], [3], [4], [5], [6], [7]]\n", 424 | "[2025-06-02 13:41:03.159: I neuronx_distributed/parallel_layers/parallel_state.py:634] [rank_0_pp-1_tp-1_dp-1_cp-1] pp_groups: replica_groups.pp_groups=[[0], [1], [2], [3], [4], [5], [6], [7]]\n", 425 | "[2025-06-02 13:41:03.159: I neuronx_distributed/parallel_layers/parallel_state.py:635] [rank_0_pp-1_tp-1_dp-1_cp-1] cp_groups: replica_groups.cp_groups=[[0], [1], [2], [3], [4], [5], [6], [7]]\n", 426 | "[2025-06-02 13:41:03.159: I neuronx_distributed/parallel_layers/parallel_state.py:636] [rank_0_pp-1_tp-1_dp-1_cp-1] ep_model_groups: replica_groups.ep_model_groups=[[0], [1], [2], [3], [4], [5], [6], [7]]\n", 427 | "[2025-06-02 13:41:03.159: I neuronx_distributed/parallel_layers/parallel_state.py:637] [rank_0_pp-1_tp-1_dp-1_cp-1] ep_data_groups: replica_groups.ep_data_groups=[[0], [1], [2], [3], [4], [5], [6], [7]]\n", 428 | "INFO:Neuron:Done Sharding weights in 3.519328597999902\n", 429 | "INFO:Neuron:Finished weights loading in 16.628388952000023 seconds\n", 430 | "INFO:Neuron:Warming up the model.\n", 431 | "2025-Jun-02 13:41:22.0009 7289:8468 [7] nccl_net_ofi_create_plugin:211 CCOM WARN NET/OFI Failed to initialize sendrecv protocol\n", 432 | "2025-Jun-02 13:41:22.0010 7289:8468 [7] nccl_net_ofi_create_plugin:334 CCOM WARN NET/OFI aws-ofi-nccl initialization failed\n", 433 | "2025-Jun-02 13:41:22.0011 7289:8468 [7] nccl_net_ofi_init:155 CCOM WARN NET/OFI Initializing plugin failed\n", 434 | "2025-Jun-02 13:41:22.0012 7289:8468 [7] net_plugin.cc:94 CCOM WARN OFI plugin initNet() failed is EFA enabled?\n", 435 | "INFO:Neuron:Warmup completed in 0.33977651596069336 seconds.\n", 436 | "Total model loading time: 19.222302051000042 seconds\n", 437 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:650: UserWarning: `do_sample` is set to `False`. However, `top_k` is set to `1` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_k`.\n", 438 | " warnings.warn(\n", 439 | "\n", 440 | "Checking accuracy by logit matching\n", 441 | "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/utils/accuracy.py:363: UserWarning: input_len + num_tokens_to_check exceeds max_context_length. If output divergences at an index greater than max_context_length, a ValueError will occur because the next input len exceeds max_context_length. To avoid this, set num_tokens_to_check to a value of max_context_length - input_len or less.\n", 442 | " warnings.warn(\n", 443 | "Loading checkpoint shards: 100%|████████████████| 14/14 [00:08<00:00, 1.58it/s]\n", 444 | "From v4.47 onwards, when a model cache is to be returned, `generate` will return a `Cache` instance instead by default (as opposed to the legacy tuple of tuples format). If you want to keep returning the legacy format, please set `return_legacy_cache=True`.\n", 445 | "Expected Output: [\", that is the question. Whether 'tis nobler in the mind to suffer the slings and arrows of outrageous fortune\"] tensor([[ 11, 429, 374, 279, 3405, 13, 13139, 364, 83, 285,\n", 446 | " 13049, 1536, 304, 279, 3971, 311, 7676, 279, 1739, 819,\n", 447 | " 323, 36957, 315, 54488, 32315]])\n", 448 | "Expected Logits Shape: torch.Size([25, 1, 152064])\n", 449 | "Actual Output: [\", that is the question. Whether 'tis nobler in the mind to suffer the slings and arrows of outrageous fortune\"] tensor([[ 11, 429, 374, 279, 3405, 13, 13139, 364, 83, 285,\n", 450 | " 13049, 1536, 304, 279, 3971, 311, 7676, 279, 1739, 819,\n", 451 | " 323, 36957, 315, 54488, 32315]])\n", 452 | "Actual Logits Shape: torch.Size([25, 1, 152064])\n", 453 | "Passed logits validation!\n", 454 | "\n", 455 | "Generating outputs...\n", 456 | "Prompts: ['To be, or not to be']\n", 457 | "Generated outputs:\n", 458 | "Output 0: To be, or not to be, that is the question. Whether 'tis nobler in the mind to suffer the slings and arrows of outrageous fortune\n", 459 | "Starting end-to-end benchmark with 20\n", 460 | "Benchmark completed and its result is as following\n", 461 | "{\n", 462 | " \"e2e_model\": {\n", 463 | " \"latency_ms_p50\": 569.0377950668335,\n", 464 | " \"latency_ms_p90\": 570.0641632080078,\n", 465 | " \"latency_ms_p95\": 570.2431917190552,\n", 466 | " \"latency_ms_p99\": 570.8965921401978,\n", 467 | " \"latency_ms_p100\": 571.0599422454834,\n", 468 | " \"latency_ms_avg\": 569.459593296051,\n", 469 | " \"throughput\": 56.19362703995017\n", 470 | " },\n", 471 | " \"context_encoding_model\": {\n", 472 | " \"latency_ms_p50\": 41.747450828552246,\n", 473 | " \"latency_ms_p90\": 42.02606678009033,\n", 474 | " \"latency_ms_p95\": 42.056477069854736,\n", 475 | " \"latency_ms_p99\": 42.05883264541626,\n", 476 | " \"latency_ms_p100\": 42.05942153930664,\n", 477 | " \"latency_ms_avg\": 41.80266857147217,\n", 478 | " \"throughput\": 382.75068426897144\n", 479 | " },\n", 480 | " \"token_generation_model\": {\n", 481 | " \"latency_ms_p50\": 33.631086349487305,\n", 482 | " \"latency_ms_p90\": 33.74745845794678,\n", 483 | " \"latency_ms_p95\": 33.88720750808716,\n", 484 | " \"latency_ms_p99\": 34.08886194229126,\n", 485 | " \"latency_ms_p100\": 34.223079681396484,\n", 486 | " \"latency_ms_avg\": 33.66035064061483,\n", 487 | " \"throughput\": 31.68911334451813\n", 488 | " }\n", 489 | "}\n", 490 | "Completed saving result to benchmark_report.json\n" 491 | ] 492 | } 493 | ], 494 | "source": [ 495 | "!inference_demo \\\n", 496 | " --model-type qwen2 \\\n", 497 | " --task-type causal-lm \\\n", 498 | " run \\\n", 499 | " --model-path /home/ubuntu/model_hf_qwen/qwen2 \\\n", 500 | " --compiled-model-path /home/ubuntu/traced_model_qwen/qwen2/logit \\\n", 501 | " --torch-dtype bfloat16 \\\n", 502 | " --tp-degree 8 \\\n", 503 | " --batch-size 1 \\\n", 504 | " --max-context-length 16 \\\n", 505 | " --seq-len 32 \\\n", 506 | " --top-k 1 \\\n", 507 | " --pad-token-id 151645 \\\n", 508 | " --prompt \"To be, or not to be\" \\\n", 509 | " --check-accuracy-mode logit-matching \\\n", 510 | " --benchmark" 511 | ] 512 | } 513 | ], 514 | "metadata": { 515 | "kernelspec": { 516 | "display_name": "aws_neuronx_venv_pytorch_2_6_nxd_inference", 517 | "language": "python", 518 | "name": "python3" 519 | }, 520 | "language_info": { 521 | "codemirror_mode": { 522 | "name": "ipython", 523 | "version": 3 524 | }, 525 | "file_extension": ".py", 526 | "mimetype": "text/x-python", 527 | "name": "python", 528 | "nbconvert_exporter": "python", 529 | "pygments_lexer": "ipython3", 530 | "version": "3.10.12" 531 | } 532 | }, 533 | "nbformat": 4, 534 | "nbformat_minor": 2 535 | } 536 | -------------------------------------------------------------------------------- /labs/FineTuning/HuggingFaceExample/01_finetuning/Finetune-TinyLlama-1.1B.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "37be34fc-0fa9-4811-865c-a3fdc38d38e8", 6 | "metadata": {}, 7 | "source": [ 8 | "# Fine-tune TinyLlama-1.1B for text-to-SQL generation\n", 9 | "\n", 10 | "## Introduction\n", 11 | "\n", 12 | "In this workshop module, you will learn how to fine-tune a Llama-based LLM ([TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)) using causal language modelling so that the model learns how to generate SQL queries for text-based instructions. Your fine-tuning job will be launched using SageMaker Training which provides a serverless training environment where you do not need to manage the underlying infrastructure. You will learn how to configure a PyTorch training job using [SageMaker's PyTorch estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html), and how to leverage the [Hugging Face Optimum Neuron](https://github.com/huggingface/optimum-neuron) package to easily run the PyTorch training job with AWS Trainium accelerators via an [AWS EC2 trn1.2xlarge instance](https://aws.amazon.com/ec2/instance-types/trn1/).\n", 13 | "\n", 14 | "For this module, you will be using the [b-mc2/sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context) dataset which consists of thousands of examples of SQL schemas, questions about the schemas, and SQL queries intended to answer the questions.\n", 15 | "\n", 16 | "*Dataset example 1:*\n", 17 | "* *SQL schema/context:* `CREATE TABLE management (department_id VARCHAR); CREATE TABLE department (department_id VARCHAR)`\n", 18 | "* *Question:* `How many departments are led by heads who are not mentioned?`\n", 19 | "* *SQL query/answer:* `SELECT COUNT(*) FROM department WHERE NOT department_id IN (SELECT department_id FROM management)`\n", 20 | "\n", 21 | "*Dataset example 2:*\n", 22 | "* *SQL schema/context:* `CREATE TABLE courses (course_name VARCHAR, course_id VARCHAR); CREATE TABLE student_course_registrations (student_id VARCHAR, course_id VARCHAR)`\n", 23 | "* *Question:* `What are the ids of all students for courses and what are the names of those courses?`\n", 24 | "* *SQL query/answer:* `SELECT T1.student_id, T2.course_name FROM student_course_registrations AS T1 JOIN courses AS T2 ON T1.course_id = T2.course_id`\n", 25 | "\n", 26 | "By fine-tuning the model over several thousand of these text-to-SQL examples, the model will then learn how to generate an appropriate SQL query when presented with a SQL context and a free-form question.\n", 27 | "\n", 28 | "This text-to-SQL use case was selected so you can successfully fine-tune your model in a reasonably short amount of time (~20 minutes) which is appropriate for this 1hr workshop. Although this is a relatively simple use case, please keep in mind that the same techniques and components used in this module can also be applied to fine-tune LLMs for more advanced use cases such as writing code, summarizing documents, creating blog posts - the possibilities are endless!" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "id": "866074ee-c300-4793-8e63-adbcfc314ad8", 34 | "metadata": { 35 | "tags": [] 36 | }, 37 | "source": [ 38 | "## Prerequisites\n", 39 | "\n", 40 | "This notebook uses the SageMaker Python SDK to prepare, launch, and monitor the progress of a PyTorch-based training job. Before we get started, it is important to upgrade the SageMaker SDK to ensure that you are using the latest version. Run the next two cells to upgrade the SageMaker SDK and set up your session." 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": null, 46 | "id": "3264aae2-1f18-4b59-a92c-2f169903c202", 47 | "metadata": { 48 | "tags": [] 49 | }, 50 | "outputs": [ 51 | { 52 | "name": "stdout", 53 | "output_type": "stream", 54 | "text": [ 55 | "\n", 56 | "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m25.0.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.1.1\u001b[0m\n", 57 | "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n", 58 | "Note: you may need to restart the kernel to use updated packages.\n" 59 | ] 60 | } 61 | ], 62 | "source": [ 63 | "# Upgrade SageMaker SDK to the latest version\n", 64 | "%pip install -U sagemaker awscli -q 2>&1 | grep -v \"warnings/venv\"" 65 | ] 66 | }, 67 | { 68 | "cell_type": "code", 69 | "execution_count": null, 70 | "id": "9b5ed574-6db5-471b-8515-c0f6189e653e", 71 | "metadata": { 72 | "tags": [] 73 | }, 74 | "outputs": [], 75 | "source": [ 76 | "import logging\n", 77 | "sagemaker_config_logger = logging.getLogger(\"sagemaker.config\")\n", 78 | "sagemaker_config_logger.setLevel(logging.WARNING)\n", 79 | "\n", 80 | "# Import SageMaker SDK, setup our session\n", 81 | "from sagemaker import get_execution_role, Session\n", 82 | "from sagemaker.pytorch import PyTorch\n", 83 | "import boto3\n", 84 | "\n", 85 | "region_name=\"us-east-2\" #this is hard coded to a specific region because of Workshop quotas. You could use sess.boto_region_name\n", 86 | "sess = Session(boto_session=boto3.Session(region_name=region_name))\n", 87 | "default_bucket = sess.default_bucket()\n" 88 | ] 89 | }, 90 | { 91 | "cell_type": "markdown", 92 | "id": "2ce630d1", 93 | "metadata": {}, 94 | "source": [ 95 | "This next command just configures the EC2 instance (in us-west-2) to have a default region of us-east-2. This is specific to the environment in AWS Workshop Studio." 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": null, 101 | "id": "5542b3d1", 102 | "metadata": {}, 103 | "outputs": [], 104 | "source": [ 105 | "!aws configure set region us-east-2" 106 | ] 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "id": "4193108b-25fb-4d3e-85db-c66b8c04c251", 111 | "metadata": {}, 112 | "source": [ 113 | "## Specify the Optimum Neuron deep learning container (DLC) image\n", 114 | "\n", 115 | "The SageMaker Training service uses containers to execute your training script, allowing you to fully customize your training script environment and any required dependencies. For this workshop, you will use a recent Pytorch Training deep learning container (DLC) image which is an AWS-maintained image containing the Neuron SDK and PyTorch. The Optimum-Neuron library is installed with the requirements.txt file in the assets directory." 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": null, 121 | "id": "247ad886-6977-4295-947b-86d4892b48bd", 122 | "metadata": { 123 | "tags": [] 124 | }, 125 | "outputs": [ 126 | { 127 | "name": "stdout", 128 | "output_type": "stream", 129 | "text": [ 130 | "763104351884.dkr.ecr.us-east-2.amazonaws.com/pytorch-training-neuronx:2.5.1-neuronx-py310-sdk2.22.0-ubuntu22.04\n" 131 | ] 132 | } 133 | ], 134 | "source": [ 135 | "# Specify the Neuron DLC that we will use for training\n", 136 | "# For now, we'll use the standard Neuron DLC and install Optimum Neuron v0.0.27 at training time because we want to use a later SDK \n", 137 | "# You can see more about the images here: https://github.com/aws-neuron/deep-learning-containers?tab=readme-ov-file#pytorch-training-neuronx\n", 138 | "\n", 139 | "training_image = f\"763104351884.dkr.ecr.{sess.boto_region_name}.amazonaws.com/pytorch-training-neuronx:2.5.1-neuronx-py310-sdk2.22.0-ubuntu22.04\"\n", 140 | "print(training_image)" 141 | ] 142 | }, 143 | { 144 | "cell_type": "markdown", 145 | "id": "8a8802bc-657a-419d-b86d-eb8af5eff90e", 146 | "metadata": { 147 | "tags": [] 148 | }, 149 | "source": [ 150 | "## Configure the PyTorch Estimator\n", 151 | "\n", 152 | "The SageMaker SDK includes a [PyTorch Estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html) class which you can use to define a PyTorch training job that will be executed in the SageMaker managed environment. \n", 153 | "\n", 154 | "In the following cell, you will create a PyTorch Estimator which will run the attached `finetune_llama.py` training script on an ml.trn1.2xlarge instance. The `finetune_llama.py` script is an Optimum Neuron training script that can be used for causal language modelling with AWS Trainium. The scripts will be downloaded as the instance is brought up, and the scripts will download the model and the datasets onto the SageMaker training instance.\n", 155 | "\n", 156 | "The PyTorch Estimator has many parameters that can be used to configure your training job. A few of the most important parameters include:\n", 157 | "\n", 158 | "- *entry_point*: refers to the name of the training script that will be executed as part of this training job\n", 159 | "- *source_dir*: the path to the local source code directory (relative to your notebook) that will be packaged up and included inside your training container\n", 160 | "- *instance_count*: defines how many EC2 instances to use for this training job\n", 161 | "- *instance_type*: determines which type of EC2 instance will be used for training\n", 162 | "- *image_uri*: defines which training DLC will be used to run the training job (see Neuron DLC, above)\n", 163 | "- *distribution*: determines which type of distribution to use for the training job - you will need 'torch_distributed' for this workshop\n", 164 | "- *environment*: provides a dictionary of environment variables which will be applied to your training environment\n", 165 | "- *hyperparameters*: provides a dictionary of command-line arguments to pass to your training script, ex: finetune_llama.py\n", 166 | "\n", 167 | "In the `hyperparameters` section, you can see the specific command-line arguments that are used to control the behavior of the `finetune_llama.py` training script. Notably:\n", 168 | "- *model_id*: specifies which model you will be fine-tuning, in this case a recent checkpoint from the TinyLlama-1.1B project\n", 169 | "- *tokenizer_id*: specifies which tokenizer you will used to tokenize the dataset examples during training\n", 170 | "- *output_dir*: directory in which the fine-tuned model will be saved. Here we use the SageMaker-specific `/opt/ml/model` directory. At the end of the training job, SageMaker automatically copies the contents of this directory to the output S3 bucket\n", 171 | "- *tensor_parallel_size*: the tensor parallel degree for which we want to use for training. In this case we use '2' to shard the model across the 2 NeuronCores available in the trn1.2xlarge instance\n", 172 | "- *bf16*: request BFloat16 training\n", 173 | "- *per_device_train_batch_size*: the microbatch size to be used for fine-tuning\n", 174 | "- *gradient_accumulation_steps*: how many steps for which gradients will be accumulated between updates\n", 175 | "- *max_steps*: the maximum number of steps of fine-tuning that we want to perform\n", 176 | "- *lora_r*, *lora_alpha*, *lora_dropout*: the LoRA rank, alpha, and dropout values to use during fine-tuning\n", 177 | "\n", 178 | "The below estimator has been pre-configured for you, so you do not need to make any changes." 179 | ] 180 | }, 181 | { 182 | "cell_type": "code", 183 | "execution_count": 8, 184 | "id": "9e28014c-4d0b-452b-9bde-44aa10e61bb6", 185 | "metadata": { 186 | "tags": [] 187 | }, 188 | "outputs": [], 189 | "source": [ 190 | "# Set up the PyTorch estimator\n", 191 | "# Note that the hyperparameters are just command-line args passed to the finetune_llama.py script to control its behavior\n", 192 | "\n", 193 | "pt_estimator = PyTorch(\n", 194 | " entry_point=\"finetune_llama.py\",\n", 195 | " source_dir=\"./assets\",\n", 196 | " role=get_execution_role(),\n", 197 | " instance_count=1,\n", 198 | " instance_type=\"ml.trn1.2xlarge\",\n", 199 | " disable_profiler=True,\n", 200 | " output_path=f\"s3://{default_bucket}/neuron_events2025\",\n", 201 | " base_job_name=\"trn1-tinyllama\",\n", 202 | " sagemaker_session=sess,\n", 203 | " code_bucket=f\"s3://{default_bucket}/neuron_events2025_code\",\n", 204 | " checkpoint_s3_uri=f\"s3://{default_bucket}/neuron_events_output\",\n", 205 | " image_uri=training_image,\n", 206 | " distribution={\"torch_distributed\": {\"enabled\": True}},\n", 207 | " environment={\"FI_EFA_FORK_SAFE\": \"1\", \"WANDB_DISABLED\": \"true\"},\n", 208 | " disable_output_compression=True,\n", 209 | " hyperparameters={\n", 210 | " \"model_id\": \"TinyLlama/TinyLlama-1.1B-Chat-v1.0\",\n", 211 | " \"tokenizer_id\": \"TinyLlama/TinyLlama-1.1B-Chat-v1.0\",\n", 212 | " \"output_dir\": \"/opt/ml/model\",\n", 213 | " \"tensor_parallel_size\": 2,\n", 214 | " \"bf16\": True,\n", 215 | " \"per_device_train_batch_size\": 2,\n", 216 | " \"gradient_accumulation_steps\": 1,\n", 217 | " \"gradient_checkpointing\": True,\n", 218 | " \"max_steps\": 1000,\n", 219 | " \"lora_r\": 16,\n", 220 | " \"lora_alpha\": 32,\n", 221 | " \"lora_dropout\": 0.05,\n", 222 | " \"logging_steps\": 10,\n", 223 | " \"learning_rate\": 5e-5,\n", 224 | " \"dataloader_drop_last\": True,\n", 225 | " \"disable_tqdm\": True\n", 226 | " }\n", 227 | " )" 228 | ] 229 | }, 230 | { 231 | "cell_type": "markdown", 232 | "id": "2278940b-f563-4582-9df0-bd56d9b5fd28", 233 | "metadata": {}, 234 | "source": [ 235 | "## Launch the training job\n", 236 | "\n", 237 | "Once the estimator has been created, you can then launch your training job by calling `.fit()` on the estimator:" 238 | ] 239 | }, 240 | { 241 | "cell_type": "code", 242 | "execution_count": 9, 243 | "id": "b7829c64-0190-43c3-be1a-0ccce7d45248", 244 | "metadata": { 245 | "tags": [] 246 | }, 247 | "outputs": [ 248 | { 249 | "name": "stderr", 250 | "output_type": "stream", 251 | "text": [ 252 | "INFO:sagemaker:Creating training-job with name: trn1-tinyllama-2025-05-13-00-40-31-750\n" 253 | ] 254 | } 255 | ], 256 | "source": [ 257 | "# Call fit() on the estimator to initiate the training job\n", 258 | "pt_estimator.fit(wait=False, logs=False)" 259 | ] 260 | }, 261 | { 262 | "cell_type": "markdown", 263 | "id": "b77434b2-94d7-4256-8d0b-d5d2ddb1d5ae", 264 | "metadata": {}, 265 | "source": [ 266 | "## Monitor the training job\n", 267 | "\n", 268 | "When the training job has been launched, the SageMaker Training service will then take care of:\n", 269 | "- launching and configuring the requested EC2 infrastructure for your training job\n", 270 | "- launching the requested container image on each of the EC2 instances\n", 271 | "- copying your source code directory and running your training script within the container(s)\n", 272 | "- storing your trained model artifacts in Amazon Simple Storage Service (S3)\n", 273 | "- decommissioning the training infrastructure\n", 274 | "\n", 275 | "While the training job is running, the following cell will periodically check and output the job status. When you see 'Completed', you know that your training job is finished and you can proceed to the remainder of the notebook. The training job typically takes about 20 minutes to complete.\n", 276 | "\n", 277 | "If you are interested in viewing the output logs from your training job, you can view the logs by navigating to the AWS CloudWatch console, selecting `Logs -> Log Groups` in the left-hand menu, and then looking for your SageMaker training job in the list. **Note:** it will usually take 4-5 minutes before the infrastructure is running and the output logs begin to be populated in CloudWatch." 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "execution_count": 10, 283 | "id": "0c223037-2f8e-4eb0-9e4b-ff4dac6ede7a", 284 | "metadata": { 285 | "tags": [] 286 | }, 287 | "outputs": [ 288 | { 289 | "name": "stdout", 290 | "output_type": "stream", 291 | "text": [ 292 | "2025-05-13T00:40:37.718399 Training job status: InProgress!\n", 293 | "2025-05-13T00:41:07.827456 Training job status: InProgress!\n", 294 | "2025-05-13T00:41:37.941892 Training job status: InProgress!\n", 295 | "2025-05-13T00:42:08.055514 Training job status: InProgress!\n", 296 | "2025-05-13T00:42:38.170184 Training job status: InProgress!\n", 297 | "2025-05-13T00:43:08.285526 Training job status: InProgress!\n", 298 | "2025-05-13T00:43:38.401669 Training job status: InProgress!\n", 299 | "2025-05-13T00:44:08.517601 Training job status: InProgress!\n", 300 | "2025-05-13T00:44:38.607279 Training job status: InProgress!\n", 301 | "2025-05-13T00:45:08.901240 Training job status: InProgress!\n", 302 | "2025-05-13T00:45:39.029987 Training job status: InProgress!\n", 303 | "2025-05-13T00:46:09.148483 Training job status: InProgress!\n", 304 | "2025-05-13T00:46:39.262424 Training job status: InProgress!\n", 305 | "2025-05-13T00:47:09.378729 Training job status: InProgress!\n", 306 | "2025-05-13T00:47:39.477011 Training job status: InProgress!\n", 307 | "2025-05-13T00:48:09.589262 Training job status: InProgress!\n", 308 | "2025-05-13T00:48:39.715998 Training job status: InProgress!\n", 309 | "2025-05-13T00:49:09.833712 Training job status: InProgress!\n", 310 | "2025-05-13T00:49:40.132350 Training job status: InProgress!\n", 311 | "2025-05-13T00:50:10.259671 Training job status: InProgress!\n", 312 | "2025-05-13T00:50:40.376526 Training job status: InProgress!\n", 313 | "2025-05-13T00:51:10.492630 Training job status: InProgress!\n", 314 | "2025-05-13T00:51:40.612684 Training job status: InProgress!\n", 315 | "2025-05-13T00:52:10.735871 Training job status: InProgress!\n", 316 | "2025-05-13T00:52:40.856541 Training job status: InProgress!\n", 317 | "2025-05-13T00:53:10.978185 Training job status: InProgress!\n", 318 | "2025-05-13T00:53:41.102406 Training job status: InProgress!\n", 319 | "2025-05-13T00:54:11.391318 Training job status: InProgress!\n", 320 | "2025-05-13T00:54:41.506542 Training job status: InProgress!\n", 321 | "2025-05-13T00:55:11.619419 Training job status: InProgress!\n", 322 | "2025-05-13T00:55:41.736144 Training job status: InProgress!\n", 323 | "2025-05-13T00:56:11.850643 Training job status: InProgress!\n", 324 | "2025-05-13T00:56:41.965740 Training job status: InProgress!\n", 325 | "2025-05-13T00:57:12.082235 Training job status: InProgress!\n", 326 | "2025-05-13T00:57:42.193146 Training job status: InProgress!\n", 327 | "2025-05-13T00:58:12.309523 Training job status: InProgress!\n", 328 | "2025-05-13T00:58:42.596288 Training job status: InProgress!\n", 329 | "2025-05-13T00:59:12.715701 Training job status: InProgress!\n", 330 | "2025-05-13T00:59:42.835134 Training job status: InProgress!\n", 331 | "2025-05-13T01:00:12.952002 Training job status: InProgress!\n", 332 | "2025-05-13T01:00:43.070275 Training job status: InProgress!\n", 333 | "2025-05-13T01:01:13.187416 Training job status: InProgress!\n", 334 | "2025-05-13T01:01:43.291955 Training job status: InProgress!\n", 335 | "\n", 336 | "2025-05-13T01:02:13.412501 Training job status: Completed!\n" 337 | ] 338 | } 339 | ], 340 | "source": [ 341 | "# Periodically check job status until it shows 'Completed' (ETA ~20 minutes)\n", 342 | "# You can also monitor job status in the SageMaker console, and view the\n", 343 | "# SageMaker Training job logs in the CloudWatch console\n", 344 | "from time import sleep\n", 345 | "from datetime import datetime\n", 346 | "\n", 347 | "while (job_status := pt_estimator.jobs[-1].describe()['TrainingJobStatus']) not in ['Completed', 'Error', 'Failed']:\n", 348 | " print(f\"{datetime.now().isoformat()} Training job status: {job_status}!\")\n", 349 | " sleep(30)\n", 350 | "\n", 351 | "print(f\"\\n{datetime.now().isoformat()} Training job status: {job_status}!\")" 352 | ] 353 | }, 354 | { 355 | "cell_type": "markdown", 356 | "id": "16c94343-b0c6-4903-82cc-c8ab2f88b26b", 357 | "metadata": {}, 358 | "source": [ 359 | "## Determine location of fine-tuned model artifacts\n", 360 | "\n", 361 | "Once the training job has completed, SageMaker will copy your fine-tuned model artifacts to a specified location in S3.\n", 362 | "\n", 363 | "In the following cell, you can see how to programmatically determine the location of your model artifacts:" 364 | ] 365 | }, 366 | { 367 | "cell_type": "code", 368 | "execution_count": 11, 369 | "id": "213af977-8ed6-4081-af65-59c70db2dbfb", 370 | "metadata": { 371 | "tags": [] 372 | }, 373 | "outputs": [ 374 | { 375 | "name": "stdout", 376 | "output_type": "stream", 377 | "text": [ 378 | "Your fine-tuned model is available here:\n", 379 | "\n", 380 | "s3://this.output.should.be.replaced.with.a.real.s3.path.once.the.cell.is.executed/\n" 381 | ] 382 | } 383 | ], 384 | "source": [ 385 | "# Show where the fine-tuned model is stored - previous job must be 'Completed' before running this cell\n", 386 | "model_archive_path = pt_estimator.jobs[-1].describe()['ModelArtifacts']['S3ModelArtifacts']\n", 387 | "print(f\"Your fine-tuned model is available here:\\n\\n{model_archive_path}/\")" 388 | ] 389 | }, 390 | { 391 | "cell_type": "markdown", 392 | "id": "b68f529f-a548-4fbd-b160-3cab5f52c488", 393 | "metadata": {}, 394 | "source": [ 395 | "
\n", 396 | "\n", 397 | "**Note:** Please copy the above S3 path, as it will be required in the subsequent workshop module.\n", 398 | "\n", 399 | "\n", 400 | "Lastly, run the following cell to list the model artifacts available in your S3 model_archive_path:" 401 | ] 402 | }, 403 | { 404 | "cell_type": "code", 405 | "execution_count": 12, 406 | "id": "27ad8c7e-6a73-4f20-944f-ac12ef286a6f", 407 | "metadata": { 408 | "tags": [] 409 | }, 410 | "outputs": [ 411 | { 412 | "name": "stdout", 413 | "output_type": "stream", 414 | "text": [ 415 | "2025-05-13 01:01:39 714 config.json\n", 416 | "2025-05-13 01:01:48 124 generation_config.json\n", 417 | "2025-05-13 01:01:40 4400216536 model.safetensors\n", 418 | "2025-05-13 01:01:47 551 special_tokens_map.json\n", 419 | "2025-05-13 01:01:47 1842795 tokenizer.json\n", 420 | "2025-05-13 01:01:39 499723 tokenizer.model\n", 421 | "2025-05-13 01:01:48 1368 tokenizer_config.json\n" 422 | ] 423 | } 424 | ], 425 | "source": [ 426 | "# View the contents of the fine-tuned model path in S3\n", 427 | "!aws s3 ls {model_archive_path}/merged_model/" 428 | ] 429 | }, 430 | { 431 | "cell_type": "markdown", 432 | "id": "fca9ffa7-a694-48c0-acde-cd468d18a448", 433 | "metadata": {}, 434 | "source": [ 435 | "Congratulations on completing the LLM fine-tuning module!\n", 436 | "\n", 437 | "In the next notebook, you will learn how to deploy your fine-tuned model in a SageMaker hosted endpoint, and leverage AWS Inferentia accelerators to perform model inference. Have fun!" 438 | ] 439 | } 440 | ], 441 | "metadata": { 442 | "availableInstances": [ 443 | { 444 | "_defaultOrder": 0, 445 | "_isFastLaunch": true, 446 | "category": "General purpose", 447 | "gpuNum": 0, 448 | "hideHardwareSpecs": false, 449 | "memoryGiB": 4, 450 | "name": "ml.t3.medium", 451 | "vcpuNum": 2 452 | }, 453 | { 454 | "_defaultOrder": 1, 455 | "_isFastLaunch": false, 456 | "category": "General purpose", 457 | "gpuNum": 0, 458 | "hideHardwareSpecs": false, 459 | "memoryGiB": 8, 460 | "name": "ml.t3.large", 461 | "vcpuNum": 2 462 | }, 463 | { 464 | "_defaultOrder": 2, 465 | "_isFastLaunch": false, 466 | "category": "General purpose", 467 | "gpuNum": 0, 468 | "hideHardwareSpecs": false, 469 | "memoryGiB": 16, 470 | "name": "ml.t3.xlarge", 471 | "vcpuNum": 4 472 | }, 473 | { 474 | "_defaultOrder": 3, 475 | "_isFastLaunch": false, 476 | "category": "General purpose", 477 | "gpuNum": 0, 478 | "hideHardwareSpecs": false, 479 | "memoryGiB": 32, 480 | "name": "ml.t3.2xlarge", 481 | "vcpuNum": 8 482 | }, 483 | { 484 | "_defaultOrder": 4, 485 | "_isFastLaunch": true, 486 | "category": "General purpose", 487 | "gpuNum": 0, 488 | "hideHardwareSpecs": false, 489 | "memoryGiB": 8, 490 | "name": "ml.m5.large", 491 | "vcpuNum": 2 492 | }, 493 | { 494 | "_defaultOrder": 5, 495 | "_isFastLaunch": false, 496 | "category": "General purpose", 497 | "gpuNum": 0, 498 | "hideHardwareSpecs": false, 499 | "memoryGiB": 16, 500 | "name": "ml.m5.xlarge", 501 | "vcpuNum": 4 502 | }, 503 | { 504 | "_defaultOrder": 6, 505 | "_isFastLaunch": false, 506 | "category": "General purpose", 507 | "gpuNum": 0, 508 | "hideHardwareSpecs": false, 509 | "memoryGiB": 32, 510 | "name": "ml.m5.2xlarge", 511 | "vcpuNum": 8 512 | }, 513 | { 514 | "_defaultOrder": 7, 515 | "_isFastLaunch": false, 516 | "category": "General purpose", 517 | "gpuNum": 0, 518 | "hideHardwareSpecs": false, 519 | "memoryGiB": 64, 520 | "name": "ml.m5.4xlarge", 521 | "vcpuNum": 16 522 | }, 523 | { 524 | "_defaultOrder": 8, 525 | "_isFastLaunch": false, 526 | "category": "General purpose", 527 | "gpuNum": 0, 528 | "hideHardwareSpecs": false, 529 | "memoryGiB": 128, 530 | "name": "ml.m5.8xlarge", 531 | "vcpuNum": 32 532 | }, 533 | { 534 | "_defaultOrder": 9, 535 | "_isFastLaunch": false, 536 | "category": "General purpose", 537 | "gpuNum": 0, 538 | "hideHardwareSpecs": false, 539 | "memoryGiB": 192, 540 | "name": "ml.m5.12xlarge", 541 | "vcpuNum": 48 542 | }, 543 | { 544 | "_defaultOrder": 10, 545 | "_isFastLaunch": false, 546 | "category": "General purpose", 547 | "gpuNum": 0, 548 | "hideHardwareSpecs": false, 549 | "memoryGiB": 256, 550 | "name": "ml.m5.16xlarge", 551 | "vcpuNum": 64 552 | }, 553 | { 554 | "_defaultOrder": 11, 555 | "_isFastLaunch": false, 556 | "category": "General purpose", 557 | "gpuNum": 0, 558 | "hideHardwareSpecs": false, 559 | "memoryGiB": 384, 560 | "name": "ml.m5.24xlarge", 561 | "vcpuNum": 96 562 | }, 563 | { 564 | "_defaultOrder": 12, 565 | "_isFastLaunch": false, 566 | "category": "General purpose", 567 | "gpuNum": 0, 568 | "hideHardwareSpecs": false, 569 | "memoryGiB": 8, 570 | "name": "ml.m5d.large", 571 | "vcpuNum": 2 572 | }, 573 | { 574 | "_defaultOrder": 13, 575 | "_isFastLaunch": false, 576 | "category": "General purpose", 577 | "gpuNum": 0, 578 | "hideHardwareSpecs": false, 579 | "memoryGiB": 16, 580 | "name": "ml.m5d.xlarge", 581 | "vcpuNum": 4 582 | }, 583 | { 584 | "_defaultOrder": 14, 585 | "_isFastLaunch": false, 586 | "category": "General purpose", 587 | "gpuNum": 0, 588 | "hideHardwareSpecs": false, 589 | "memoryGiB": 32, 590 | "name": "ml.m5d.2xlarge", 591 | "vcpuNum": 8 592 | }, 593 | { 594 | "_defaultOrder": 15, 595 | "_isFastLaunch": false, 596 | "category": "General purpose", 597 | "gpuNum": 0, 598 | "hideHardwareSpecs": false, 599 | "memoryGiB": 64, 600 | "name": "ml.m5d.4xlarge", 601 | "vcpuNum": 16 602 | }, 603 | { 604 | "_defaultOrder": 16, 605 | "_isFastLaunch": false, 606 | "category": "General purpose", 607 | "gpuNum": 0, 608 | "hideHardwareSpecs": false, 609 | "memoryGiB": 128, 610 | "name": "ml.m5d.8xlarge", 611 | "vcpuNum": 32 612 | }, 613 | { 614 | "_defaultOrder": 17, 615 | "_isFastLaunch": false, 616 | "category": "General purpose", 617 | "gpuNum": 0, 618 | "hideHardwareSpecs": false, 619 | "memoryGiB": 192, 620 | "name": "ml.m5d.12xlarge", 621 | "vcpuNum": 48 622 | }, 623 | { 624 | "_defaultOrder": 18, 625 | "_isFastLaunch": false, 626 | "category": "General purpose", 627 | "gpuNum": 0, 628 | "hideHardwareSpecs": false, 629 | "memoryGiB": 256, 630 | "name": "ml.m5d.16xlarge", 631 | "vcpuNum": 64 632 | }, 633 | { 634 | "_defaultOrder": 19, 635 | "_isFastLaunch": false, 636 | "category": "General purpose", 637 | "gpuNum": 0, 638 | "hideHardwareSpecs": false, 639 | "memoryGiB": 384, 640 | "name": "ml.m5d.24xlarge", 641 | "vcpuNum": 96 642 | }, 643 | { 644 | "_defaultOrder": 20, 645 | "_isFastLaunch": false, 646 | "category": "General purpose", 647 | "gpuNum": 0, 648 | "hideHardwareSpecs": true, 649 | "memoryGiB": 0, 650 | "name": "ml.geospatial.interactive", 651 | "supportedImageNames": [ 652 | "sagemaker-geospatial-v1-0" 653 | ], 654 | "vcpuNum": 0 655 | }, 656 | { 657 | "_defaultOrder": 21, 658 | "_isFastLaunch": true, 659 | "category": "Compute optimized", 660 | "gpuNum": 0, 661 | "hideHardwareSpecs": false, 662 | "memoryGiB": 4, 663 | "name": "ml.c5.large", 664 | "vcpuNum": 2 665 | }, 666 | { 667 | "_defaultOrder": 22, 668 | "_isFastLaunch": false, 669 | "category": "Compute optimized", 670 | "gpuNum": 0, 671 | "hideHardwareSpecs": false, 672 | "memoryGiB": 8, 673 | "name": "ml.c5.xlarge", 674 | "vcpuNum": 4 675 | }, 676 | { 677 | "_defaultOrder": 23, 678 | "_isFastLaunch": false, 679 | "category": "Compute optimized", 680 | "gpuNum": 0, 681 | "hideHardwareSpecs": false, 682 | "memoryGiB": 16, 683 | "name": "ml.c5.2xlarge", 684 | "vcpuNum": 8 685 | }, 686 | { 687 | "_defaultOrder": 24, 688 | "_isFastLaunch": false, 689 | "category": "Compute optimized", 690 | "gpuNum": 0, 691 | "hideHardwareSpecs": false, 692 | "memoryGiB": 32, 693 | "name": "ml.c5.4xlarge", 694 | "vcpuNum": 16 695 | }, 696 | { 697 | "_defaultOrder": 25, 698 | "_isFastLaunch": false, 699 | "category": "Compute optimized", 700 | "gpuNum": 0, 701 | "hideHardwareSpecs": false, 702 | "memoryGiB": 72, 703 | "name": "ml.c5.9xlarge", 704 | "vcpuNum": 36 705 | }, 706 | { 707 | "_defaultOrder": 26, 708 | "_isFastLaunch": false, 709 | "category": "Compute optimized", 710 | "gpuNum": 0, 711 | "hideHardwareSpecs": false, 712 | "memoryGiB": 96, 713 | "name": "ml.c5.12xlarge", 714 | "vcpuNum": 48 715 | }, 716 | { 717 | "_defaultOrder": 27, 718 | "_isFastLaunch": false, 719 | "category": "Compute optimized", 720 | "gpuNum": 0, 721 | "hideHardwareSpecs": false, 722 | "memoryGiB": 144, 723 | "name": "ml.c5.18xlarge", 724 | "vcpuNum": 72 725 | }, 726 | { 727 | "_defaultOrder": 28, 728 | "_isFastLaunch": false, 729 | "category": "Compute optimized", 730 | "gpuNum": 0, 731 | "hideHardwareSpecs": false, 732 | "memoryGiB": 192, 733 | "name": "ml.c5.24xlarge", 734 | "vcpuNum": 96 735 | }, 736 | { 737 | "_defaultOrder": 29, 738 | "_isFastLaunch": true, 739 | "category": "Accelerated computing", 740 | "gpuNum": 1, 741 | "hideHardwareSpecs": false, 742 | "memoryGiB": 16, 743 | "name": "ml.g4dn.xlarge", 744 | "vcpuNum": 4 745 | }, 746 | { 747 | "_defaultOrder": 30, 748 | "_isFastLaunch": false, 749 | "category": "Accelerated computing", 750 | "gpuNum": 1, 751 | "hideHardwareSpecs": false, 752 | "memoryGiB": 32, 753 | "name": "ml.g4dn.2xlarge", 754 | "vcpuNum": 8 755 | }, 756 | { 757 | "_defaultOrder": 31, 758 | "_isFastLaunch": false, 759 | "category": "Accelerated computing", 760 | "gpuNum": 1, 761 | "hideHardwareSpecs": false, 762 | "memoryGiB": 64, 763 | "name": "ml.g4dn.4xlarge", 764 | "vcpuNum": 16 765 | }, 766 | { 767 | "_defaultOrder": 32, 768 | "_isFastLaunch": false, 769 | "category": "Accelerated computing", 770 | "gpuNum": 1, 771 | "hideHardwareSpecs": false, 772 | "memoryGiB": 128, 773 | "name": "ml.g4dn.8xlarge", 774 | "vcpuNum": 32 775 | }, 776 | { 777 | "_defaultOrder": 33, 778 | "_isFastLaunch": false, 779 | "category": "Accelerated computing", 780 | "gpuNum": 4, 781 | "hideHardwareSpecs": false, 782 | "memoryGiB": 192, 783 | "name": "ml.g4dn.12xlarge", 784 | "vcpuNum": 48 785 | }, 786 | { 787 | "_defaultOrder": 34, 788 | "_isFastLaunch": false, 789 | "category": "Accelerated computing", 790 | "gpuNum": 1, 791 | "hideHardwareSpecs": false, 792 | "memoryGiB": 256, 793 | "name": "ml.g4dn.16xlarge", 794 | "vcpuNum": 64 795 | }, 796 | { 797 | "_defaultOrder": 35, 798 | "_isFastLaunch": false, 799 | "category": "Accelerated computing", 800 | "gpuNum": 1, 801 | "hideHardwareSpecs": false, 802 | "memoryGiB": 61, 803 | "name": "ml.p3.2xlarge", 804 | "vcpuNum": 8 805 | }, 806 | { 807 | "_defaultOrder": 36, 808 | "_isFastLaunch": false, 809 | "category": "Accelerated computing", 810 | "gpuNum": 4, 811 | "hideHardwareSpecs": false, 812 | "memoryGiB": 244, 813 | "name": "ml.p3.8xlarge", 814 | "vcpuNum": 32 815 | }, 816 | { 817 | "_defaultOrder": 37, 818 | "_isFastLaunch": false, 819 | "category": "Accelerated computing", 820 | "gpuNum": 8, 821 | "hideHardwareSpecs": false, 822 | "memoryGiB": 488, 823 | "name": "ml.p3.16xlarge", 824 | "vcpuNum": 64 825 | }, 826 | { 827 | "_defaultOrder": 38, 828 | "_isFastLaunch": false, 829 | "category": "Accelerated computing", 830 | "gpuNum": 8, 831 | "hideHardwareSpecs": false, 832 | "memoryGiB": 768, 833 | "name": "ml.p3dn.24xlarge", 834 | "vcpuNum": 96 835 | }, 836 | { 837 | "_defaultOrder": 39, 838 | "_isFastLaunch": false, 839 | "category": "Memory Optimized", 840 | "gpuNum": 0, 841 | "hideHardwareSpecs": false, 842 | "memoryGiB": 16, 843 | "name": "ml.r5.large", 844 | "vcpuNum": 2 845 | }, 846 | { 847 | "_defaultOrder": 40, 848 | "_isFastLaunch": false, 849 | "category": "Memory Optimized", 850 | "gpuNum": 0, 851 | "hideHardwareSpecs": false, 852 | "memoryGiB": 32, 853 | "name": "ml.r5.xlarge", 854 | "vcpuNum": 4 855 | }, 856 | { 857 | "_defaultOrder": 41, 858 | "_isFastLaunch": false, 859 | "category": "Memory Optimized", 860 | "gpuNum": 0, 861 | "hideHardwareSpecs": false, 862 | "memoryGiB": 64, 863 | "name": "ml.r5.2xlarge", 864 | "vcpuNum": 8 865 | }, 866 | { 867 | "_defaultOrder": 42, 868 | "_isFastLaunch": false, 869 | "category": "Memory Optimized", 870 | "gpuNum": 0, 871 | "hideHardwareSpecs": false, 872 | "memoryGiB": 128, 873 | "name": "ml.r5.4xlarge", 874 | "vcpuNum": 16 875 | }, 876 | { 877 | "_defaultOrder": 43, 878 | "_isFastLaunch": false, 879 | "category": "Memory Optimized", 880 | "gpuNum": 0, 881 | "hideHardwareSpecs": false, 882 | "memoryGiB": 256, 883 | "name": "ml.r5.8xlarge", 884 | "vcpuNum": 32 885 | }, 886 | { 887 | "_defaultOrder": 44, 888 | "_isFastLaunch": false, 889 | "category": "Memory Optimized", 890 | "gpuNum": 0, 891 | "hideHardwareSpecs": false, 892 | "memoryGiB": 384, 893 | "name": "ml.r5.12xlarge", 894 | "vcpuNum": 48 895 | }, 896 | { 897 | "_defaultOrder": 45, 898 | "_isFastLaunch": false, 899 | "category": "Memory Optimized", 900 | "gpuNum": 0, 901 | "hideHardwareSpecs": false, 902 | "memoryGiB": 512, 903 | "name": "ml.r5.16xlarge", 904 | "vcpuNum": 64 905 | }, 906 | { 907 | "_defaultOrder": 46, 908 | "_isFastLaunch": false, 909 | "category": "Memory Optimized", 910 | "gpuNum": 0, 911 | "hideHardwareSpecs": false, 912 | "memoryGiB": 768, 913 | "name": "ml.r5.24xlarge", 914 | "vcpuNum": 96 915 | }, 916 | { 917 | "_defaultOrder": 47, 918 | "_isFastLaunch": false, 919 | "category": "Accelerated computing", 920 | "gpuNum": 1, 921 | "hideHardwareSpecs": false, 922 | "memoryGiB": 16, 923 | "name": "ml.g5.xlarge", 924 | "vcpuNum": 4 925 | }, 926 | { 927 | "_defaultOrder": 48, 928 | "_isFastLaunch": false, 929 | "category": "Accelerated computing", 930 | "gpuNum": 1, 931 | "hideHardwareSpecs": false, 932 | "memoryGiB": 32, 933 | "name": "ml.g5.2xlarge", 934 | "vcpuNum": 8 935 | }, 936 | { 937 | "_defaultOrder": 49, 938 | "_isFastLaunch": false, 939 | "category": "Accelerated computing", 940 | "gpuNum": 1, 941 | "hideHardwareSpecs": false, 942 | "memoryGiB": 64, 943 | "name": "ml.g5.4xlarge", 944 | "vcpuNum": 16 945 | }, 946 | { 947 | "_defaultOrder": 50, 948 | "_isFastLaunch": false, 949 | "category": "Accelerated computing", 950 | "gpuNum": 1, 951 | "hideHardwareSpecs": false, 952 | "memoryGiB": 128, 953 | "name": "ml.g5.8xlarge", 954 | "vcpuNum": 32 955 | }, 956 | { 957 | "_defaultOrder": 51, 958 | "_isFastLaunch": false, 959 | "category": "Accelerated computing", 960 | "gpuNum": 1, 961 | "hideHardwareSpecs": false, 962 | "memoryGiB": 256, 963 | "name": "ml.g5.16xlarge", 964 | "vcpuNum": 64 965 | }, 966 | { 967 | "_defaultOrder": 52, 968 | "_isFastLaunch": false, 969 | "category": "Accelerated computing", 970 | "gpuNum": 4, 971 | "hideHardwareSpecs": false, 972 | "memoryGiB": 192, 973 | "name": "ml.g5.12xlarge", 974 | "vcpuNum": 48 975 | }, 976 | { 977 | "_defaultOrder": 53, 978 | "_isFastLaunch": false, 979 | "category": "Accelerated computing", 980 | "gpuNum": 4, 981 | "hideHardwareSpecs": false, 982 | "memoryGiB": 384, 983 | "name": "ml.g5.24xlarge", 984 | "vcpuNum": 96 985 | }, 986 | { 987 | "_defaultOrder": 54, 988 | "_isFastLaunch": false, 989 | "category": "Accelerated computing", 990 | "gpuNum": 8, 991 | "hideHardwareSpecs": false, 992 | "memoryGiB": 768, 993 | "name": "ml.g5.48xlarge", 994 | "vcpuNum": 192 995 | }, 996 | { 997 | "_defaultOrder": 55, 998 | "_isFastLaunch": false, 999 | "category": "Accelerated computing", 1000 | "gpuNum": 8, 1001 | "hideHardwareSpecs": false, 1002 | "memoryGiB": 1152, 1003 | "name": "ml.p4d.24xlarge", 1004 | "vcpuNum": 96 1005 | }, 1006 | { 1007 | "_defaultOrder": 56, 1008 | "_isFastLaunch": false, 1009 | "category": "Accelerated computing", 1010 | "gpuNum": 8, 1011 | "hideHardwareSpecs": false, 1012 | "memoryGiB": 1152, 1013 | "name": "ml.p4de.24xlarge", 1014 | "vcpuNum": 96 1015 | }, 1016 | { 1017 | "_defaultOrder": 57, 1018 | "_isFastLaunch": false, 1019 | "category": "Accelerated computing", 1020 | "gpuNum": 0, 1021 | "hideHardwareSpecs": false, 1022 | "memoryGiB": 32, 1023 | "name": "ml.trn1.2xlarge", 1024 | "vcpuNum": 8 1025 | }, 1026 | { 1027 | "_defaultOrder": 58, 1028 | "_isFastLaunch": false, 1029 | "category": "Accelerated computing", 1030 | "gpuNum": 0, 1031 | "hideHardwareSpecs": false, 1032 | "memoryGiB": 512, 1033 | "name": "ml.trn1.32xlarge", 1034 | "vcpuNum": 128 1035 | }, 1036 | { 1037 | "_defaultOrder": 59, 1038 | "_isFastLaunch": false, 1039 | "category": "Accelerated computing", 1040 | "gpuNum": 0, 1041 | "hideHardwareSpecs": false, 1042 | "memoryGiB": 512, 1043 | "name": "ml.trn1n.32xlarge", 1044 | "vcpuNum": 128 1045 | } 1046 | ], 1047 | "instance_type": "ml.t3.medium", 1048 | "kernelspec": { 1049 | "display_name": "Python 3 (ipykernel)", 1050 | "language": "python", 1051 | "name": "python3" 1052 | }, 1053 | "language_info": { 1054 | "codemirror_mode": { 1055 | "name": "ipython", 1056 | "version": 3 1057 | }, 1058 | "file_extension": ".py", 1059 | "mimetype": "text/x-python", 1060 | "name": "python", 1061 | "nbconvert_exporter": "python", 1062 | "pygments_lexer": "ipython3", 1063 | "version": "3.9.21" 1064 | } 1065 | }, 1066 | "nbformat": 4, 1067 | "nbformat_minor": 5 1068 | } 1069 | -------------------------------------------------------------------------------- /labs/FineTuning/HuggingFaceExample/01_finetuning/assets/consolidate_adapter_shards_and_merge_model.py: -------------------------------------------------------------------------------- 1 | from optimum.neuron.distributed.checkpointing import ( 2 | consolidate_model_parallel_checkpoints_to_unified_checkpoint, 3 | ) 4 | from transformers import AutoModel, AutoTokenizer 5 | from argparse import ArgumentParser 6 | from shutil import copyfile 7 | import os 8 | import peft 9 | 10 | parser = ArgumentParser() 11 | parser.add_argument( 12 | "-i", 13 | "--input_dir", 14 | help="source checkpoint directory containing sharded adapter checkpoint files", 15 | required=True, 16 | ) 17 | parser.add_argument( 18 | "-o", 19 | "--output_dir", 20 | help="destination directory for final merged model (adapters merged into base model)", 21 | required=True, 22 | ) 23 | args = parser.parse_args() 24 | 25 | consolidated_ckpt_dir = os.path.join(args.input_dir, "consolidated") 26 | 27 | # Consolidate the adapter shards into a PEFT-compatible checkpoint 28 | print("Consolidating LoRA adapter shards") 29 | consolidate_model_parallel_checkpoints_to_unified_checkpoint( 30 | args.input_dir, consolidated_ckpt_dir 31 | ) 32 | copyfile( 33 | os.path.join(args.input_dir, "adapter_config.json"), 34 | os.path.join(consolidated_ckpt_dir, "adapter_config.json"), 35 | ) 36 | 37 | # Load AutoPeftModel using the consolidated PEFT checkpoint 38 | peft_model = peft.AutoPeftModelForCausalLM.from_pretrained(consolidated_ckpt_dir) 39 | 40 | # Merge adapter weights into base model, save new pretrained model 41 | print("Merging LoRA adapter shards into base model") 42 | merged_model = peft_model.merge_and_unload() 43 | print(f"Saving merged model to {args.output_dir}") 44 | merged_model.save_pretrained(args.output_dir) 45 | 46 | print(f"Saving tokenizer to {args.output_dir}") 47 | tokenizer = AutoTokenizer.from_pretrained(args.input_dir) 48 | tokenizer.save_pretrained(args.output_dir) 49 | 50 | # Load the pretrained model and print config 51 | print("Merged model config:") 52 | model = AutoModel.from_pretrained(args.output_dir) 53 | print(model) 54 | -------------------------------------------------------------------------------- /labs/FineTuning/HuggingFaceExample/01_finetuning/assets/finetune_llama.py: -------------------------------------------------------------------------------- 1 | from dataclasses import dataclass, field 2 | from datasets import load_dataset 3 | from peft import LoraConfig 4 | from transformers import ( 5 | AutoModelForCausalLM, 6 | AutoTokenizer, 7 | set_seed, 8 | ) 9 | import os 10 | import subprocess 11 | 12 | from optimum.neuron import NeuronHfArgumentParser as HfArgumentParser 13 | from optimum.neuron import NeuronSFTConfig, NeuronSFTTrainer, NeuronTrainingArguments 14 | from optimum.neuron.distributed import lazy_load_for_parallelism 15 | from torch_xla.core.xla_model import is_master_ordinal 16 | 17 | 18 | def training_function(script_args, training_args): 19 | dataset = load_dataset("b-mc2/sql-create-context", split="train") 20 | dataset = dataset.shuffle(seed=23) 21 | train_dataset = dataset.select(range(50000)) 22 | eval_dataset = dataset.select(range(50000, 50500)) 23 | 24 | def create_conversation(sample): 25 | system_message = ( 26 | "You are a text to SQL query translator. Users will ask you questions in English and you will generate a " 27 | "SQL query based on the provided SCHEMA.\nSCHEMA:\n{schema}" 28 | ) 29 | return { 30 | "messages": [ 31 | { 32 | "role": "system", 33 | "content": system_message.format(schema=sample["context"]), 34 | }, 35 | {"role": "user", "content": sample["question"]}, 36 | {"role": "assistant", "content": sample["answer"] + ";"}, 37 | ] 38 | } 39 | 40 | train_dataset = train_dataset.map( 41 | create_conversation, remove_columns=train_dataset.features, batched=False 42 | ) 43 | eval_dataset = eval_dataset.map( 44 | create_conversation, remove_columns=eval_dataset.features, batched=False 45 | ) 46 | 47 | tokenizer = AutoTokenizer.from_pretrained(script_args.tokenizer_id) 48 | # tokenizer.pad_token = tokenizer.eos_token 49 | # tokenizer.eos_token_id = 128001 50 | 51 | with lazy_load_for_parallelism( 52 | tensor_parallel_size=training_args.tensor_parallel_size 53 | ): 54 | model = AutoModelForCausalLM.from_pretrained(script_args.model_id) 55 | 56 | config = LoraConfig( 57 | r=script_args.lora_r, 58 | lora_alpha=script_args.lora_alpha, 59 | lora_dropout=script_args.lora_dropout, 60 | target_modules=[ 61 | "q_proj", 62 | "gate_proj", 63 | "v_proj", 64 | "o_proj", 65 | "k_proj", 66 | "up_proj", 67 | "down_proj", 68 | ], 69 | bias="none", 70 | task_type="CAUSAL_LM", 71 | ) 72 | 73 | args = training_args.to_dict() 74 | 75 | sft_config = NeuronSFTConfig( 76 | max_seq_length=1024, 77 | packing=True, 78 | **args, 79 | dataset_kwargs={ 80 | "add_special_tokens": False, 81 | "append_concat_token": True, 82 | }, 83 | ) 84 | 85 | trainer = NeuronSFTTrainer( 86 | args=sft_config, 87 | model=model, 88 | peft_config=config, 89 | tokenizer=tokenizer, 90 | train_dataset=train_dataset, 91 | eval_dataset=eval_dataset, 92 | ) 93 | 94 | # Start training 95 | trainer.train() 96 | del trainer 97 | 98 | 99 | @dataclass 100 | class ScriptArguments: 101 | model_id: str = field( 102 | default="TinyLlama/TinyLlama-1.1B-Chat-v1.0", 103 | metadata={ 104 | "help": "The model that you want to train from the Hugging Face hub." 105 | }, 106 | ) 107 | tokenizer_id: str = field( 108 | default="TinyLlama/TinyLlama-1.1B-Chat-v1.0", 109 | metadata={"help": "The tokenizer used to tokenize text for fine-tuning."}, 110 | ) 111 | lora_r: int = field( 112 | default=16, 113 | metadata={"help": "LoRA r value to be used during fine-tuning."}, 114 | ) 115 | lora_alpha: int = field( 116 | default=32, 117 | metadata={"help": "LoRA alpha value to be used during fine-tuning."}, 118 | ) 119 | lora_dropout: float = field( 120 | default=0.05, 121 | metadata={"help": "LoRA dropout value to be used during fine-tuning."}, 122 | ) 123 | 124 | 125 | if __name__ == "__main__": 126 | parser = HfArgumentParser([ScriptArguments, NeuronTrainingArguments]) 127 | script_args, training_args = parser.parse_args_into_dataclasses() 128 | 129 | set_seed(training_args.seed) 130 | training_function(script_args, training_args) 131 | 132 | # Consolidate LoRA adapter shards, merge LoRA adapters into base model, save merged model 133 | if is_master_ordinal(): 134 | input_ckpt_dir = os.path.join( 135 | training_args.output_dir, f"checkpoint-{training_args.max_steps}" 136 | ) 137 | output_ckpt_dir = os.path.join(training_args.output_dir, "merged_model") 138 | # the spawned process expects to see 2 NeuronCores for consolidating checkpoints with a tp=2 139 | # Either the second core isn't really used or it is freed up by the other thread finishing. 140 | # Adjusting Neuron env. var to advertise 2 NeuronCores to the process. 141 | env = os.environ.copy() 142 | env["NEURON_RT_VISIBLE_CORES"] = "0-1" 143 | subprocess.run( 144 | [ 145 | "python3", 146 | "consolidate_adapter_shards_and_merge_model.py", 147 | "-i", 148 | input_ckpt_dir, 149 | "-o", 150 | output_ckpt_dir, 151 | ], 152 | env=env 153 | ) -------------------------------------------------------------------------------- /labs/FineTuning/HuggingFaceExample/01_finetuning/assets/requirements.txt: -------------------------------------------------------------------------------- 1 | optimum-neuron==0.0.27 2 | peft==0.14.0 3 | trl==0.11.4 -------------------------------------------------------------------------------- /labs/Lab_One_NxDI.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "5a972332", 6 | "metadata": {}, 7 | "source": [ 8 | "# Develop support for a new model with NeuronX Distributed Inference\n", 9 | "\n", 10 | "In this notebook you will learn how to develop support for a new model with NeuronX Distributed Inference (NxD). NxD is a Python package developed by Annapurna Labs that enables you to shard, compile, train, and host Pytorch models on Trainium and Inferentia instances. We develop two key packages demonstrating how to use this, [NxD Inference](https://github.com/aws-neuron/neuronx-distributed-inference/tree/main) and [NxD Training](https://github.com/aws-neuron/neuronx-distributed-training). This notebook focuses on inference. You will learn how to develop support for a new model in NxD Inference through the context of Llama 3.2, 1B.\n", 11 | "\n", 12 | "#### Overview\n", 13 | "1. Check dependencies for AWS Neuron SDK\n", 14 | "2. Accept the Meta usage terms and download the model from Hugging Face.\n", 15 | "3. Learn how to invoke the model step-by-step\n", 16 | " - Load the model from a local path.\n", 17 | " - Shard and compilet it for Trainium.\n", 18 | " - Download and tokenize the dataset\n", 19 | " - Invoke the model with prompts\n", 20 | "4. Learn how to modify the underlying APIs to work with your own models\n", 21 | "\n", 22 | "#### Prerequisites\n", 23 | "This notebook was developed on a trn1.2xlarge instance, using the latest Amazon Linux DLAMI. Both the Amazon Linux and Ubuntu Neuron DLAMI's have preinstalled Python virtual environments with all the basic software packages included. The virtual environment used to develop this notebook is located at this path in both Amaxon Linux and Ubutnu DLAMIs: `/opt/aws_neuronx_venv_pytorch_2_5_nxd_inference`. " 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "id": "3652fc5a", 29 | "metadata": {}, 30 | "source": [ 31 | "### Step 1. Import NxD Inference packages\n", 32 | "\n", 33 | "If you are running this notebook in the virtual environment for NxD Inference, then the package should already be installed. Let's verify that with the following import." 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": null, 39 | "id": "c4405a13-5431-4d29-a6a6-2eb989fb0f50", 40 | "metadata": {}, 41 | "outputs": [], 42 | "source": [ 43 | "import neuronx_distributed_inference" 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "id": "0d1970fc", 49 | "metadata": {}, 50 | "source": [ 51 | "### Step 2. Accept the Meta usage terms and download the model\n", 52 | "\n", 53 | "If you would like to use the model directly from Meta, you'll need to navigate over to the Hugging Face hub for Llama 3.2 1B [here](https://huggingface.co/meta-llama/Llama-3.2-1B). Log in to the Hub, accept the usage term, and request access to the model. Once access has been granted, copy your Hugging Face token and paste it into the download command below.\n", 54 | "\n", 55 | "If you do not have your token readily available you can proceed with the alternative model shown below." 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": null, 61 | "id": "959fb008-a2c8-4505-8f60-42e5b2060b31", 62 | "metadata": {}, 63 | "outputs": [], 64 | "source": [ 65 | "# helpful pacakges to speed up the download\n", 66 | "!pip install hf_transfer \"huggingface_hub[cli]\"" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "id": "ccff01a8-94f7-4d10-bdf7-71229ec19cb9", 72 | "metadata": {}, 73 | "source": [ 74 | "We'll download the `NousResearch/Llama3.2-1B` model here." 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": null, 80 | "id": "75a2e3d1-7c1b-4d9d-b1f5-d294a1381566", 81 | "metadata": {}, 82 | "outputs": [], 83 | "source": [ 84 | "!huggingface-cli download NousResearch/Llama-3.2-1B --local-dir /home/ec2-user/environment/models/llama/" 85 | ] 86 | }, 87 | { 88 | "cell_type": "markdown", 89 | "id": "02214b8a", 90 | "metadata": {}, 91 | "source": [ 92 | "### Step 3. Establish model configs\n", 93 | "Next, you'll point to the local model files and establish config objects. Each of these configs are helpful in successfully invoking the model." 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": null, 99 | "id": "77e54a5f-842f-4b2c-ab79-c0f11a6ef292", 100 | "metadata": {}, 101 | "outputs": [], 102 | "source": [ 103 | "# the original checkpoint\n", 104 | "model_path = '/home/ec2-user/environment/models/llama/'" 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": null, 110 | "id": "094dc24d-dd06-45c8-adec-fa997f02e6d1", 111 | "metadata": {}, 112 | "outputs": [], 113 | "source": [ 114 | "# where your NxD trace will go\n", 115 | "traced_model_path = '/home/ec2-user/environment/models/traced_llama'" 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": null, 121 | "id": "9f72bda4-5e04-442c-b016-f30816db54d4", 122 | "metadata": {}, 123 | "outputs": [], 124 | "source": [ 125 | "import torch\n", 126 | "from transformers import AutoTokenizer, GenerationConfig\n", 127 | "\n", 128 | "from neuronx_distributed_inference.models.config import NeuronConfig, OnDeviceSamplingConfig\n", 129 | "from neuronx_distributed_inference.models.llama.modeling_llama import LlamaInferenceConfig, NeuronLlamaForCausalLM\n", 130 | "from neuronx_distributed_inference.utils.hf_adapter import HuggingFaceGenerationAdapter, load_pretrained_config\n", 131 | "from neuronx_distributed_inference.modules.generation.sampling import prepare_sampling_params\n", 132 | "\n", 133 | "# torch.manual_seed(0)" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": null, 139 | "id": "812403b6", 140 | "metadata": {}, 141 | "outputs": [], 142 | "source": [ 143 | "# update the generation config to address a trailing comma\n", 144 | "!cp generation_config.json $model_path/" 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": null, 150 | "id": "857c6e49-ce3a-47c9-868a-520f0cd68276", 151 | "metadata": {}, 152 | "outputs": [], 153 | "source": [ 154 | "# Initialize configs \n", 155 | "generation_config = GenerationConfig.from_pretrained(model_path)\n", 156 | "\n", 157 | "# Some sample overrides for generation\n", 158 | "generation_config_kwargs = {\n", 159 | " \"do_sample\": True,\n", 160 | " \"top_k\": 1,\n", 161 | " \"pad_token_id\": generation_config.eos_token_id,\n", 162 | "}\n", 163 | "generation_config.update(**generation_config_kwargs)" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": null, 169 | "id": "d196acdb-d094-41c0-9638-9974cec332c4", 170 | "metadata": {}, 171 | "outputs": [], 172 | "source": [ 173 | "neuron_config = NeuronConfig(\n", 174 | " tp_degree=2,\n", 175 | " batch_size=2,\n", 176 | " max_context_length=32,\n", 177 | " seq_len=64,\n", 178 | " on_device_sampling_config=OnDeviceSamplingConfig(top_k=1),\n", 179 | " enable_bucketing=True,\n", 180 | " flash_decoding_enabled=False\n", 181 | ")\n", 182 | "\n", 183 | "# Build the Llama Inference config\n", 184 | "config = LlamaInferenceConfig(\n", 185 | " neuron_config,\n", 186 | " load_config=load_pretrained_config(model_path),\n", 187 | ")" 188 | ] 189 | }, 190 | { 191 | "cell_type": "markdown", 192 | "id": "5269bcdd-cf8c-4b10-a428-0cd0fafd83d1", 193 | "metadata": {}, 194 | "source": [ 195 | "### Step 4. Shard and compile the model\n", 196 | "The NeuronX compiler will optimize your model for Trainium hardware, ultimately generating the assembly code that executes your operations. We will invoke that compiler now. Generally it's suggested to compile for some of the larger input and output shapes for your model, while using bucketing to optimize performance. Both of those are handled for you automatically with NxD.\n", 197 | "\n", 198 | "With NxD, this step also shards your checkpoint for the TP degree that you defined above. Compilation can take some time, for a 1B model this should run for a few minutes." 199 | ] 200 | }, 201 | { 202 | "cell_type": "code", 203 | "execution_count": null, 204 | "id": "afd1e5d5-a989-40fb-8350-fca737470b19", 205 | "metadata": { 206 | "scrolled": true 207 | }, 208 | "outputs": [], 209 | "source": [ 210 | "model = NeuronLlamaForCausalLM(model_path, config)\n", 211 | "model.compile(traced_model_path)" 212 | ] 213 | }, 214 | { 215 | "cell_type": "markdown", 216 | "id": "38178c7e-0f6e-41ab-9383-2942615b82ed", 217 | "metadata": {}, 218 | "source": [ 219 | "Once compilation is complete your new model is saved and ready to load! " 220 | ] 221 | }, 222 | { 223 | "cell_type": "markdown", 224 | "id": "63a37f02-ed94-4c3e-81cc-6d9e23c04175", 225 | "metadata": {}, 226 | "source": [ 227 | "### Step 5. Download the tokenizer" 228 | ] 229 | }, 230 | { 231 | "cell_type": "code", 232 | "execution_count": null, 233 | "id": "0c5e306f-9488-4b0e-8e6a-f238a50f2cfe", 234 | "metadata": {}, 235 | "outputs": [], 236 | "source": [ 237 | "tokenizer = AutoTokenizer.from_pretrained(model_path, padding_side=\"right\")\n", 238 | "tokenizer.pad_token = tokenizer.eos_token\n", 239 | "tokenizer.save_pretrained(traced_model_path)" 240 | ] 241 | }, 242 | { 243 | "cell_type": "markdown", 244 | "id": "212cfe39-9e66-4a02-bf21-2560de065a34", 245 | "metadata": {}, 246 | "source": [ 247 | "### Step 6. Load the traced model" 248 | ] 249 | }, 250 | { 251 | "cell_type": "code", 252 | "execution_count": null, 253 | "id": "c945db68-5392-406c-8dd6-9e66b9ab0a63", 254 | "metadata": {}, 255 | "outputs": [], 256 | "source": [ 257 | "model = NeuronLlamaForCausalLM(traced_model_path)\n", 258 | "model.load(traced_model_path)\n", 259 | "tokenizer = AutoTokenizer.from_pretrained(traced_model_path)" 260 | ] 261 | }, 262 | { 263 | "cell_type": "markdown", 264 | "id": "75b0f5f3-b12b-4ac4-883c-856604f8d44e", 265 | "metadata": {}, 266 | "source": [ 267 | "### Step 7. Define the prompts and prepare them for sampling" 268 | ] 269 | }, 270 | { 271 | "cell_type": "code", 272 | "execution_count": null, 273 | "id": "f203a455-402b-4ddc-81d3-d4d1b4335c5c", 274 | "metadata": {}, 275 | "outputs": [], 276 | "source": [ 277 | "prompts = [\"I believe the meaning of life is\", \"The color of the sky is\"]\n", 278 | "\n", 279 | "# Example: parameter sweeps for sampling\n", 280 | "sampling_params = prepare_sampling_params(batch_size=neuron_config.batch_size,\n", 281 | " top_k=[10, 5],\n", 282 | " top_p=[0.5, 0.9],\n", 283 | " temperature=[0.9, 0.5])\n", 284 | "\n", 285 | "inputs = tokenizer(prompts, padding=True, return_tensors=\"pt\")" 286 | ] 287 | }, 288 | { 289 | "cell_type": "markdown", 290 | "id": "108f43f8-a2a8-4986-af7c-fdc58a37f3cd", 291 | "metadata": {}, 292 | "source": [ 293 | "### Step 8. Create a Generation Adapter and run inference" 294 | ] 295 | }, 296 | { 297 | "cell_type": "code", 298 | "execution_count": null, 299 | "id": "2f511a6f-049c-4a05-bccc-f5cce8071334", 300 | "metadata": {}, 301 | "outputs": [], 302 | "source": [ 303 | "generation_model = HuggingFaceGenerationAdapter(model)\n", 304 | "outputs = generation_model.generate(\n", 305 | " inputs.input_ids,\n", 306 | " generation_config=generation_config,\n", 307 | " attention_mask=inputs.attention_mask,\n", 308 | " max_length=model.config.neuron_config.max_length,\n", 309 | " sampling_params=sampling_params,\n", 310 | ")\n", 311 | "output_tokens = tokenizer.batch_decode(outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False)\n", 312 | "\n", 313 | "print(\"Generated outputs:\")\n", 314 | "for i, output_token in enumerate(output_tokens):\n", 315 | " print(f\"Output {i}: {output_token}\")\n" 316 | ] 317 | }, 318 | { 319 | "cell_type": "markdown", 320 | "id": "a5b840fc-dcba-428a-bcf8-c35702d144e0", 321 | "metadata": {}, 322 | "source": [ 323 | "---\n", 324 | "# Develop support for a new model with NxDI\n", 325 | "Now that you've run inference with this model, let's take a closer look at how this works. The cells you just ran are based on a script available in our repository [here](https://github.com/aws-neuron/neuronx-distributed-inference/tree/main). You can step through this repository to understand how the objects are developed, inherited, and made available for inference. The full developer guide on the topic is available [here](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/onboarding-models.html#nxdi-onboarding-models). Let's look at some of the key points!" 326 | ] 327 | }, 328 | { 329 | "cell_type": "markdown", 330 | "id": "f5ec151f-ce53-4051-a9d0-957654834f51", 331 | "metadata": {}, 332 | "source": [ 333 | "#### 1/ NeuronConfig class\n", 334 | "You can inherit our base `NeuronConfig` class and extend it with your own model parameters. In the notebook you just ran, this is how we defined the following parameters:\n", 335 | "- Tensor Parallel (TP) Degree\n", 336 | "- Batch size\n", 337 | "- Max context length (input shape)\n", 338 | "- Sequence length (output shape)\n", 339 | "- On device sampling\n", 340 | "- Enabling bucketing\n", 341 | "- Flash decoding\n", 342 | "\n", 343 | "\n", 344 | "This object and these parameters will be sent to the compiler when you call `model.compile`. It's a helpful way to ensure that the compiler registers your design choices so that it can start optimizations. It also enables the model sharing with NxDI for your prefered TP degree, which lets you very quickly test a variety of TP degrees (TP=8, 32, 64, etc)." 345 | ] 346 | }, 347 | { 348 | "cell_type": "markdown", 349 | "id": "ac98eb22-c02b-4c74-bd4c-3cd1bd196f54", 350 | "metadata": {}, 351 | "source": [ 352 | "#### 2/ InferenceConfig class\n", 353 | "Next, you can inherit our base `InferenceConfig` class and extend it with the rest of your modeling parameters. In the notebook you ran above, we took two important steps with this config.\n", 354 | "1. Passed into it the base `NeuronConfig`.\n", 355 | "2. Passed the rest of the model config from the HuggingFace pretrained config.\n", 356 | "\n", 357 | "Your inference class is where you define modeling parameters like the following:\n", 358 | "- hidden size\n", 359 | "- num attention heads\n", 360 | "- num hidden layers\n", 361 | "- num key value heads\n", 362 | "- vocab size\n", 363 | "\n", 364 | "You'll use this `config` object to save and compile your model. Let's learn how!" 365 | ] 366 | }, 367 | { 368 | "cell_type": "markdown", 369 | "id": "71016dc5-112d-470f-a1ee-ce1855a5487d", 370 | "metadata": {}, 371 | "source": [ 372 | "#### 3/ NeuronModel\n", 373 | "This is how you fundamentally integrate your modeling code into the Neuron SDK. If you'd like to simply reuse our `NeuronAttentionBase`, you can inherit this directly through the library and simply pass your parameters through the `InferenceConfig` you defined above. This is how the example code in our notebook works. This is also the fastest way of getting your model online with NxD I.\n", 374 | "\n", 375 | "In the example code you ran, you also used our code for `NeuronLlamaMLP`. This is a layer in the network which inherits from `nn.Module` directly, and it's where you can define the structure of your computations. The `NeuronLlamaMLP` uses a predefined `ColumnParallelLinear` object for both the gate and up projections, while using a predefined `RowParallelLinear` object for the down projection. It also defines a forward pass on that layer.\n", 376 | "\n", 377 | "The rest of the model is defined similarly: either you inherit from our base objects and just passing in your `InferenceConfig`, or you define a new layer inheriting from `nn.Module` and write those layers as either `RowParallelLinear`, `ColumnParallelLinear`, or something else. The benefit of writing your layers into the `Row` and `Column` parallel layers as presented here is that we can handle the distribution of your model for you. \n", 378 | "\n", 379 | "For a more complete guide check out our documentation on the subject [here](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/api_guide.html#api-guide)." 380 | ] 381 | }, 382 | { 383 | "cell_type": "markdown", 384 | "id": "8e98a0d4", 385 | "metadata": {}, 386 | "source": [ 387 | "### Notebook Wrap-Up\n", 388 | "\n", 389 | "For more advanced topics:\n", 390 | "- **Profiling**: See [Neuron Profiling Tools](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-profile/index.html).\n", 391 | "- **Distributed Serving**: Explore vLLM or other serving frameworks.\n", 392 | "- **Performance Benchmarking**: Use `llmperf` or custom scripts.\n", 393 | "\n", 394 | "Thank you for using AWS Trainium, and happy LLM experimentation!\n" 395 | ] 396 | }, 397 | { 398 | "cell_type": "code", 399 | "execution_count": null, 400 | "id": "3bfc3c62-08a4-49ae-adef-5c0d661f2712", 401 | "metadata": {}, 402 | "outputs": [], 403 | "source": [] 404 | } 405 | ], 406 | "metadata": { 407 | "kernelspec": { 408 | "display_name": "Python 3 (ipykernel)", 409 | "language": "python", 410 | "name": "python3" 411 | }, 412 | "language_info": { 413 | "codemirror_mode": { 414 | "name": "ipython", 415 | "version": 3 416 | }, 417 | "file_extension": ".py", 418 | "mimetype": "text/x-python", 419 | "name": "python", 420 | "nbconvert_exporter": "python", 421 | "pygments_lexer": "ipython3", 422 | "version": "3.9.21" 423 | } 424 | }, 425 | "nbformat": 4, 426 | "nbformat_minor": 5 427 | } 428 | -------------------------------------------------------------------------------- /labs/Lab_Two_NKI.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "d6b1e73f-dc2c-4d66-b3ba-4fb71b5243c8", 6 | "metadata": {}, 7 | "source": [ 8 | "# Write your own kernel with the Neuron Kernel Interface (NKI)\n", 9 | "In this notebook you'll learn how to develop your own kernel with [NKI](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/index.html). A kernel is a set of user-defined functions that are executed largely as defined by the user, not by the compiler. With NKI you can write your own functions to define any operations you like, using supported APIs, and execute them on Trainium and Inferentia hardware. You have the control and lower-level access to define the data movement, computational patterns, and physical execution for the mathematics of your algorithms with NKI.\n", 10 | "\n", 11 | "The structure of the notebook is as follows:\n", 12 | "1. Brief introduction to the NeuronCore and the NKI programming model\n", 13 | "2. Your first NKI kernel - tensor addition\n", 14 | "3. Your second NKI kernel - matrix multiplication\n", 15 | "\n", 16 | "Wrap up and next steps." 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "id": "54d843c8-b824-4896-ad23-1098dd859872", 22 | "metadata": {}, 23 | "source": [ 24 | "### 1. Introduction to the NeuronCore and NKI programming model\n", 25 | "The NeuronCore is the main acceleration unit within AWS AI chips Trainium and Inferentia. As you can see in the image below, it is composed of 4 compute engines. These engines are based on a systollic array architecture. The compute engines are fed data from the primary on-chip memory cache, SBUF. Data is moved from the HBM banks to SBUF when you call `nl.load`. You'll index into your tensors to create lower-level objects, called `tiles`. A tile is the result of `nl.load`. Once you've defined `tiles`, you can send them to various NKI mathematical APIS such as `add`, `subtract`, `matmul`, etc. The result of these operations are stored on the secondary on-chip memory cache, PSUM. After moving the data back to SBUF, you can then send it back to HBM with `nl.store`.\n", 26 | "\n", 27 | "" 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "id": "37fc461d-1067-4b96-95a5-3da49e15723f", 33 | "metadata": {}, 34 | "source": [ 35 | "Trainium1 chips feature two NeuronCore-v2 acceleration units, 2 HBM banks, NeuronLink-v2 chip-to-chip connect, host PCIE, and dedicated engines for both data movement and collective communications. Trainium1 offers 32 GiB of device memory (sum of all 4 HBM banks), with 840 GiB/sec of bandwidth. Trainium1 instances feature 16 Trainium chips, providing a total of up to 3 petaflops of FP16 compute and 512 accelerator memory capacity. For more architectural details, see our docs [here](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/trainium.html#trainium-arch). \n", 36 | "\n", 37 | "\n", 38 | "The on-chip memory cache, SBUF, **has ~20x higher memory bandwidth than HBM**. The purpose of your kernel is to exploit as much of that compute accleration as you can within the context of your model and worload." 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "id": "0fffc189-2875-4a42-a14a-de4ea122d2ef", 44 | "metadata": {}, 45 | "source": [ 46 | "#### Structuring data and tensors for NKI\n", 47 | "\n", 48 | "To easily move data and design our kernels on NKI, we'll want to exploit the 128 partitions built into SBUF as shown in the image below. In particular, SBUF has 128 partition lanes. Each of these lanes can execute programs in parallel on the engines. As much as possible, we'll want to align the tensors and data structures in our algorithms to follow this physical design. The benefit is that our kernels will run faster and be easier to develop!\n", 49 | "\n", 50 | "Your data movement from HBM to SBUF should be very carefully aligned with this 128-lane partition dimension, also called p-dim. Each tile needs a precise definition along the p-dim. Your second dimension is called the free dimension, or f-dim. As the name goes, this dimension is much more flexible than p-dim. Though it may surprise you, it's better not to fully saturate sbuf with extremely large tiles. This is so that the compiler can overlap data movement and collectives with compute, giving you better overall compute utilization and performance." 51 | ] 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "id": "0c786fce-ac8e-4549-8bf1-edaee7512211", 56 | "metadata": {}, 57 | "source": [ 58 | "" 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "id": "03a9fbbe-2fa4-41c0-9cc8-7560fbc7a49f", 64 | "metadata": {}, 65 | "source": [ 66 | "### 2. Your first NKI kernel\n", 67 | "Now that you have some understanding of the compute architecture and motivation for kernels, let's write your first NKI kernel! Importing the `nki` library may take a few moments the first time you've imported it on an instance." 68 | ] 69 | }, 70 | { 71 | "cell_type": "code", 72 | "execution_count": 1, 73 | "id": "2da52760-db72-403a-ade9-d8bebac40de3", 74 | "metadata": {}, 75 | "outputs": [], 76 | "source": [ 77 | "import numpy as np\n", 78 | "import neuronxcc.nki as nki\n", 79 | "import neuronxcc.nki.language as nl" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": 2, 85 | "id": "b83039ee-1788-478f-809f-f139cb032cce", 86 | "metadata": {}, 87 | "outputs": [], 88 | "source": [ 89 | "@nki.jit\n", 90 | "def nki_tensor_add_kernel_(a_input, b_input):\n", 91 | " \n", 92 | " # Create output tensor \n", 93 | " c_output = nl.ndarray(a_input.shape, dtype=a_input.dtype, buffer=nl.shared_hbm)\n", 94 | "\n", 95 | " # Load input data from device memory (HBM) to on-chip memory (SBUF)\n", 96 | " a_tile = nl.load(a_input)\n", 97 | " b_tile = nl.load(b_input)\n", 98 | "\n", 99 | " # compute a + b\n", 100 | " c_tile = a_tile + b_tile\n", 101 | "\n", 102 | " # return the final tensor\n", 103 | " nl.store(c_output, value=c_tile)\n", 104 | "\n", 105 | " # Transfer the ownership of `c_output` to the caller\n", 106 | " return c_output\n" 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": 3, 112 | "id": "486f0e0a-6af1-4882-afe2-4ce5a1912ddc", 113 | "metadata": {}, 114 | "outputs": [ 115 | { 116 | "name": "stdout", 117 | "output_type": "stream", 118 | "text": [ 119 | "NKI and NumPy match\n" 120 | ] 121 | } 122 | ], 123 | "source": [ 124 | "a = np.random.rand(128, 512).astype(np.float16)\n", 125 | "b = np.random.rand(128, 512).astype(np.float16)\n", 126 | "\n", 127 | "output_nki = nki_tensor_add_kernel_(a, b)\n", 128 | "\n", 129 | "output_np = a + b\n", 130 | "\n", 131 | "allclose = np.allclose(output_np, output_nki, atol=1e-4, rtol=1e-2)\n", 132 | "if allclose:\n", 133 | " print(\"NKI and NumPy match\")\n", 134 | "else:\n", 135 | " print(\"NKI and NumPy differ\")\n" 136 | ] 137 | }, 138 | { 139 | "cell_type": "markdown", 140 | "id": "35f65891-2d62-4af4-aa5d-7620c707f6bd", 141 | "metadata": {}, 142 | "source": [ 143 | "Now let's see if we can do that for matrix multiplication!" 144 | ] 145 | }, 146 | { 147 | "cell_type": "markdown", 148 | "id": "e8a65cb0-215d-4590-8335-d53c23eef5c1", 149 | "metadata": {}, 150 | "source": [ 151 | "### 3. Your second NKI kernel\n", 152 | "Now, let's try to use PyTorch arrays and pass them to the device with XLA. Then we'll try a matrix multiplication kernel." 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": 4, 158 | "id": "e4e24399-7bae-4db2-b964-b5fdcc93fb32", 159 | "metadata": {}, 160 | "outputs": [ 161 | { 162 | "name": "stderr", 163 | "output_type": "stream", 164 | "text": [ 165 | "WARNING:root:MASTER_ADDR environment variable is not set, defaulting to localhost\n", 166 | "WARNING:root:Found libneuronpjrt.so. Setting PJRT_DEVICE=NEURON.\n" 167 | ] 168 | } 169 | ], 170 | "source": [ 171 | "import torch\n", 172 | "from torch_xla.core import xla_model as xm\n", 173 | "\n", 174 | "device = xm.xla_device()\n", 175 | "\n", 176 | "lhs_small = torch.rand((64, 128), dtype=torch.bfloat16, device=device)\n", 177 | "rhs_small = torch.rand((128, 512), dtype=torch.bfloat16, device=device)" 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": 5, 183 | "id": "0bc1f344-6e02-4f3a-928a-9f1bccabfb12", 184 | "metadata": {}, 185 | "outputs": [], 186 | "source": [ 187 | "@nki.jit\n", 188 | "def nki_matmul_basic_(lhsT, rhs):\n", 189 | " \"\"\"NKI kernel to compute a 64x128x512 matrix multiplication operation\n", 190 | "\n", 191 | " Args:\n", 192 | " lhsT: an input tensor of shape [128,64], a left hand side argument of the\n", 193 | " matrix multiplication, delivered transposed for optimal performance\n", 194 | " rhs: an input tensor of shape [128,512], a right hand side argument of the\n", 195 | " matrix multiplication\n", 196 | " Returns:\n", 197 | " result: the resulting output tensor of shape [64,512]\n", 198 | " \"\"\"\n", 199 | " result = nl.ndarray((64, 512), dtype=lhsT.dtype, buffer=nl.shared_hbm)\n", 200 | "\n", 201 | " # Defining indexes for input LHS.T\n", 202 | " # - Note: here we take LayoutConstraint #1 into account:\n", 203 | " # \"For MatMult, contraction axis must be mapped to P-dim\"\n", 204 | " i_lhsT_p, i_lhsT_f = nl.mgrid[0:128, 0:64]\n", 205 | "\n", 206 | " # Defining indexes for input RHS\n", 207 | " # - Note: here we take LayoutConstraint #1 into account:\n", 208 | " # \"For MatMult, contraction axis must be mapped to P-dim\"\n", 209 | " i_rhs_p, i_rhs_f = nl.mgrid[0:128, 0:512]\n", 210 | "\n", 211 | " # Defining indexes for the output ([64,128]@[128,512] -> [64,512])\n", 212 | " i_out_p, i_out_f = nl.mgrid[0:64, 0:512]\n", 213 | "\n", 214 | " # Loading the inputs (HBM->SBUF)\n", 215 | " # Note: here we take Tile dtype definition into account,\n", 216 | " # which forces P-dim as the left most index\n", 217 | " lhs_tile = nl.load(lhsT[i_lhsT_p, i_lhsT_f])\n", 218 | " rhs_tile = nl.load(rhs[i_rhs_p, i_rhs_f])\n", 219 | "\n", 220 | " # Perform the matrix-multiplication\n", 221 | " # Note1: We set transpose_x to True, to indicate that the LHS input is transposed\n", 222 | " # Note2: A NKI matmul instruction always writes to PSUM in float32 data-type\n", 223 | " result_psum = nl.matmul(lhs_tile, rhs_tile, transpose_x=True)\n", 224 | "\n", 225 | " # Copy the result from PSUM back to SBUF, and cast to expected output data-type\n", 226 | " result_sbuf = nl.copy(result_psum, dtype=result.dtype)\n", 227 | "\n", 228 | " # The result of a [64,128] x [128,512] matrix multiplication has a shape of [64, 512].\n", 229 | " # This dictates which indices to use to address the result tile.\n", 230 | " nl.store(result[i_out_p, i_out_f], value=result_sbuf)\n", 231 | "\n", 232 | " return result" 233 | ] 234 | }, 235 | { 236 | "cell_type": "code", 237 | "execution_count": 6, 238 | "id": "d5b2a228-0a08-42fc-9bd4-81dedba0e4d6", 239 | "metadata": {}, 240 | "outputs": [ 241 | { 242 | "name": "stdout", 243 | "output_type": "stream", 244 | "text": [ 245 | "Checking correctness of nki_matmul_basic\n", 246 | "2025-03-17 22:45:04.000657: 512118 INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --framework=XLA /tmp/ec2-user/neuroncc_compile_workdir/58a5f9b5-7dd1-4569-b58f-bae92b1f0d13/model.MODULE_6255296715421101974+e30acd3a.hlo_module.pb --output /tmp/ec2-user/neuroncc_compile_workdir/58a5f9b5-7dd1-4569-b58f-bae92b1f0d13/model.MODULE_6255296715421101974+e30acd3a.neff --target=trn1 --verbose=35\n", 247 | ".\n", 248 | "Compiler status PASS\n", 249 | "NKI and Torch match\n" 250 | ] 251 | } 252 | ], 253 | "source": [ 254 | "# Run NKI kernel\n", 255 | "output_small = nki_matmul_basic_(lhs_small.T, rhs_small)\n", 256 | "\n", 257 | "# Run torch reference\n", 258 | "output_small_torch = torch.matmul(lhs_small, rhs_small)\n", 259 | "\n", 260 | "# Compare results\n", 261 | "print(\"Checking correctness of nki_matmul_basic\")\n", 262 | "if torch.allclose(output_small_torch, output_small, atol=1e-4, rtol=1e-2):\n", 263 | " print(\"NKI and Torch match\")\n", 264 | "else:\n", 265 | " print(\"NKI and Torch differ\")" 266 | ] 267 | }, 268 | { 269 | "cell_type": "markdown", 270 | "id": "801236a2-9d4d-4630-a750-dc42bb2e4514", 271 | "metadata": {}, 272 | "source": [ 273 | "### 4. Wrap up and next steps\n", 274 | "The simiplicity you see in the `tensor_add` kernel above is possible because the shapes we pass in are very small. We've intentionally selected them to exactly match the shapes of tiles that NKI suports as maximum dimensions, for both the partition and free dimensions.\n", 275 | "\n", 276 | "As you saw above, the partition dimension has a maximum length of 128. This the most important dimension and shape to embrace in your kernels, because it impacts your ability to load data onto the chip. In order to exploit the parallelism of execution enabled through the 128 lanes on sbuf, you might want to develop into your kernel the ability to extract data in batches of 128 to load onto sbuf. \n", 277 | "\n", 278 | "The second dimension, also known as the free dimension, is more flexible. Once you have clean batches of 128 lanes being loaded onto sbuf, you can build in tiling on the second dimension of much more varying sizes up to 512. \n", 279 | "\n", 280 | "To learn more about tililng, and to step through the rest of the matrix multiplication tutorial, see our docs on the topic [here](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/tutorials/matrix_multiplication.html#)" 281 | ] 282 | }, 283 | { 284 | "cell_type": "code", 285 | "execution_count": null, 286 | "id": "bd810a4b-2365-48a3-ad0f-23f3850ffc71", 287 | "metadata": {}, 288 | "outputs": [], 289 | "source": [] 290 | } 291 | ], 292 | "metadata": { 293 | "kernelspec": { 294 | "display_name": "Python 3 (ipykernel)", 295 | "language": "python", 296 | "name": "python3" 297 | }, 298 | "language_info": { 299 | "codemirror_mode": { 300 | "name": "ipython", 301 | "version": 3 302 | }, 303 | "file_extension": ".py", 304 | "mimetype": "text/x-python", 305 | "name": "python", 306 | "nbconvert_exporter": "python", 307 | "pygments_lexer": "ipython3", 308 | "version": "3.9.16" 309 | } 310 | }, 311 | "nbformat": 4, 312 | "nbformat_minor": 5 313 | } 314 | -------------------------------------------------------------------------------- /labs/generation_config.json: -------------------------------------------------------------------------------- 1 | { 2 | "_from_model_config": true, 3 | "bos_token_id": 128000, 4 | "eos_token_id": 128001, 5 | "transformers_version": "4.45.0.dev0", 6 | "do_sample": true, 7 | "temperature": 0.6, 8 | "top_p": 0.9 9 | } --------------------------------------------------------------------------------