├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── NOTICE
├── README.md
├── contributed
    └── models
    │   ├── README.md
    │   └── qwen2
    │       ├── modeling_qwen2.py
    │       └── qwen-2-test.ipynb
└── labs
    ├── FineTuning
        └── HuggingFaceExample
        │   ├── 01_finetuning
        │       ├── Finetune-TinyLlama-1.1B.ipynb
        │       └── assets
        │       │   ├── consolidate_adapter_shards_and_merge_model.py
        │       │   ├── finetune_llama.py
        │       │   └── requirements.txt
        │   └── 02_inference
        │       └── Inference-TinyLlama-1.1B.ipynb
    ├── Lab_One_NxDI.ipynb
    ├── Lab_Two_NKI.ipynb
    └── generation_config.json


/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | ## Code of Conduct
2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
4 | opensource-codeofconduct@amazon.com with any additional questions or comments.
5 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contributing Guidelines
 2 | 
 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
 4 | documentation, we greatly value feedback and contributions from our community.
 5 | 
 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
 7 | information to effectively respond to your bug report or contribution.
 8 | 
 9 | 
10 | ## Reporting Bugs/Feature Requests
11 | 
12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features.
13 | 
14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
16 | 
17 | * A reproducible test case or series of steps
18 | * The version of our code being used
19 | * Any modifications you've made relevant to the bug
20 | * Anything unusual about your environment or deployment
21 | 
22 | 
23 | ## Contributing via Pull Requests
24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
25 | 
26 | 1. You are working against the latest source on the *main* branch.
27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
29 | 
30 | To send us a pull request, please:
31 | 
32 | 1. Fork the repository.
33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
34 | 3. Ensure local tests pass.
35 | 4. Commit to your fork using clear commit messages.
36 | 5. Send us a pull request, answering any default questions in the pull request interface.
37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
38 | 
39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
41 | 
42 | 
43 | ## Finding contributions to work on
44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.
45 | 
46 | 
47 | ## Code of Conduct
48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
50 | opensource-codeofconduct@amazon.com with any additional questions or comments.
51 | 
52 | 
53 | ## Security issue notifications
54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
55 | 
56 | 
57 | ## Licensing
58 | 
59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
60 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 | 
  2 |                                  Apache License
  3 |                            Version 2.0, January 2004
  4 |                         http://www.apache.org/licenses/
  5 | 
  6 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
  7 | 
  8 |    1. Definitions.
  9 | 
 10 |       "License" shall mean the terms and conditions for use, reproduction,
 11 |       and distribution as defined by Sections 1 through 9 of this document.
 12 | 
 13 |       "Licensor" shall mean the copyright owner or entity authorized by
 14 |       the copyright owner that is granting the License.
 15 | 
 16 |       "Legal Entity" shall mean the union of the acting entity and all
 17 |       other entities that control, are controlled by, or are under common
 18 |       control with that entity. For the purposes of this definition,
 19 |       "control" means (i) the power, direct or indirect, to cause the
 20 |       direction or management of such entity, whether by contract or
 21 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 22 |       outstanding shares, or (iii) beneficial ownership of such entity.
 23 | 
 24 |       "You" (or "Your") shall mean an individual or Legal Entity
 25 |       exercising permissions granted by this License.
 26 | 
 27 |       "Source" form shall mean the preferred form for making modifications,
 28 |       including but not limited to software source code, documentation
 29 |       source, and configuration files.
 30 | 
 31 |       "Object" form shall mean any form resulting from mechanical
 32 |       transformation or translation of a Source form, including but
 33 |       not limited to compiled object code, generated documentation,
 34 |       and conversions to other media types.
 35 | 
 36 |       "Work" shall mean the work of authorship, whether in Source or
 37 |       Object form, made available under the License, as indicated by a
 38 |       copyright notice that is included in or attached to the work
 39 |       (an example is provided in the Appendix below).
 40 | 
 41 |       "Derivative Works" shall mean any work, whether in Source or Object
 42 |       form, that is based on (or derived from) the Work and for which the
 43 |       editorial revisions, annotations, elaborations, or other modifications
 44 |       represent, as a whole, an original work of authorship. For the purposes
 45 |       of this License, Derivative Works shall not include works that remain
 46 |       separable from, or merely link (or bind by name) to the interfaces of,
 47 |       the Work and Derivative Works thereof.
 48 | 
 49 |       "Contribution" shall mean any work of authorship, including
 50 |       the original version of the Work and any modifications or additions
 51 |       to that Work or Derivative Works thereof, that is intentionally
 52 |       submitted to Licensor for inclusion in the Work by the copyright owner
 53 |       or by an individual or Legal Entity authorized to submit on behalf of
 54 |       the copyright owner. For the purposes of this definition, "submitted"
 55 |       means any form of electronic, verbal, or written communication sent
 56 |       to the Licensor or its representatives, including but not limited to
 57 |       communication on electronic mailing lists, source code control systems,
 58 |       and issue tracking systems that are managed by, or on behalf of, the
 59 |       Licensor for the purpose of discussing and improving the Work, but
 60 |       excluding communication that is conspicuously marked or otherwise
 61 |       designated in writing by the copyright owner as "Not a Contribution."
 62 | 
 63 |       "Contributor" shall mean Licensor and any individual or Legal Entity
 64 |       on behalf of whom a Contribution has been received by Licensor and
 65 |       subsequently incorporated within the Work.
 66 | 
 67 |    2. Grant of Copyright License. Subject to the terms and conditions of
 68 |       this License, each Contributor hereby grants to You a perpetual,
 69 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 70 |       copyright license to reproduce, prepare Derivative Works of,
 71 |       publicly display, publicly perform, sublicense, and distribute the
 72 |       Work and such Derivative Works in Source or Object form.
 73 | 
 74 |    3. Grant of Patent License. Subject to the terms and conditions of
 75 |       this License, each Contributor hereby grants to You a perpetual,
 76 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 77 |       (except as stated in this section) patent license to make, have made,
 78 |       use, offer to sell, sell, import, and otherwise transfer the Work,
 79 |       where such license applies only to those patent claims licensable
 80 |       by such Contributor that are necessarily infringed by their
 81 |       Contribution(s) alone or by combination of their Contribution(s)
 82 |       with the Work to which such Contribution(s) was submitted. If You
 83 |       institute patent litigation against any entity (including a
 84 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
 85 |       or a Contribution incorporated within the Work constitutes direct
 86 |       or contributory patent infringement, then any patent licenses
 87 |       granted to You under this License for that Work shall terminate
 88 |       as of the date such litigation is filed.
 89 | 
 90 |    4. Redistribution. You may reproduce and distribute copies of the
 91 |       Work or Derivative Works thereof in any medium, with or without
 92 |       modifications, and in Source or Object form, provided that You
 93 |       meet the following conditions:
 94 | 
 95 |       (a) You must give any other recipients of the Work or
 96 |           Derivative Works a copy of this License; and
 97 | 
 98 |       (b) You must cause any modified files to carry prominent notices
 99 |           stating that You changed the files; and
100 | 
101 |       (c) You must retain, in the Source form of any Derivative Works
102 |           that You distribute, all copyright, patent, trademark, and
103 |           attribution notices from the Source form of the Work,
104 |           excluding those notices that do not pertain to any part of
105 |           the Derivative Works; and
106 | 
107 |       (d) If the Work includes a "NOTICE" text file as part of its
108 |           distribution, then any Derivative Works that You distribute must
109 |           include a readable copy of the attribution notices contained
110 |           within such NOTICE file, excluding those notices that do not
111 |           pertain to any part of the Derivative Works, in at least one
112 |           of the following places: within a NOTICE text file distributed
113 |           as part of the Derivative Works; within the Source form or
114 |           documentation, if provided along with the Derivative Works; or,
115 |           within a display generated by the Derivative Works, if and
116 |           wherever such third-party notices normally appear. The contents
117 |           of the NOTICE file are for informational purposes only and
118 |           do not modify the License. You may add Your own attribution
119 |           notices within Derivative Works that You distribute, alongside
120 |           or as an addendum to the NOTICE text from the Work, provided
121 |           that such additional attribution notices cannot be construed
122 |           as modifying the License.
123 | 
124 |       You may add Your own copyright statement to Your modifications and
125 |       may provide additional or different license terms and conditions
126 |       for use, reproduction, or distribution of Your modifications, or
127 |       for any such Derivative Works as a whole, provided Your use,
128 |       reproduction, and distribution of the Work otherwise complies with
129 |       the conditions stated in this License.
130 | 
131 |    5. Submission of Contributions. Unless You explicitly state otherwise,
132 |       any Contribution intentionally submitted for inclusion in the Work
133 |       by You to the Licensor shall be under the terms and conditions of
134 |       this License, without any additional terms or conditions.
135 |       Notwithstanding the above, nothing herein shall supersede or modify
136 |       the terms of any separate license agreement you may have executed
137 |       with Licensor regarding such Contributions.
138 | 
139 |    6. Trademarks. This License does not grant permission to use the trade
140 |       names, trademarks, service marks, or product names of the Licensor,
141 |       except as required for reasonable and customary use in describing the
142 |       origin of the Work and reproducing the content of the NOTICE file.
143 | 
144 |    7. Disclaimer of Warranty. Unless required by applicable law or
145 |       agreed to in writing, Licensor provides the Work (and each
146 |       Contributor provides its Contributions) on an "AS IS" BASIS,
147 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
148 |       implied, including, without limitation, any warranties or conditions
149 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
150 |       PARTICULAR PURPOSE. You are solely responsible for determining the
151 |       appropriateness of using or redistributing the Work and assume any
152 |       risks associated with Your exercise of permissions under this License.
153 | 
154 |    8. Limitation of Liability. In no event and under no legal theory,
155 |       whether in tort (including negligence), contract, or otherwise,
156 |       unless required by applicable law (such as deliberate and grossly
157 |       negligent acts) or agreed to in writing, shall any Contributor be
158 |       liable to You for damages, including any direct, indirect, special,
159 |       incidental, or consequential damages of any character arising as a
160 |       result of this License or out of the use or inability to use the
161 |       Work (including but not limited to damages for loss of goodwill,
162 |       work stoppage, computer failure or malfunction, or any and all
163 |       other commercial damages or losses), even if such Contributor
164 |       has been advised of the possibility of such damages.
165 | 
166 |    9. Accepting Warranty or Additional Liability. While redistributing
167 |       the Work or Derivative Works thereof, You may choose to offer,
168 |       and charge a fee for, acceptance of support, warranty, indemnity,
169 |       or other liability obligations and/or rights consistent with this
170 |       License. However, in accepting such obligations, You may act only
171 |       on Your own behalf and on Your sole responsibility, not on behalf
172 |       of any other Contributor, and only if You agree to indemnify,
173 |       defend, and hold each Contributor harmless for any liability
174 |       incurred by, or claims asserted against, such Contributor by reason
175 |       of your accepting any such warranty or additional liability.
176 | 


--------------------------------------------------------------------------------
/NOTICE:
--------------------------------------------------------------------------------
1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
2 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Build On Trainium Workshop
 2 | 
 3 | In this workshop you will learn how to develop support for a new model with [NeuronX Distributed Inference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/nxdi-overview.html#nxdi-overview), through the context of Llama 3.2 1B. You will also learn how to write your own kernel to directly program the accelerated hardware with the [Neuron Kernel Interface](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/index.html). Both of these tools will help you design your research proposals and experiments on Trainium.
 4 | 
 5 | It also includes an end-to-end example of using Hugging Face Optimum Neuron to fine-tune and host a small language model with Amazon SageMaker.
 6 | 
 7 | ### What is Build on Trainium? 
 8 | Build on Trainium is a $110M credit program focused on AI research and university education to support the next generation of innovation and development on AWS Trainium. AWS Trainium chips are purpose-built for high-performance deep learning (DL) training of generative AI models, including large language models (LLMs) and latent diffusion models. Build on Trainium provides compute credits to novel AI research on Trainium, investing in leading academic teams to build innovations in critical areas including new model architectures, ML libraries, optimizations, large-scale distributed systems, and more. This multi-year initiative lays the foundation for the future of AI by inspiring the academic community to utilize, invest in, and contribute to the open-source community around Trainium. Combining these benefits with Neuron software development kit (SDK) and recent launch of the Neuron Kernel Interface (NKI), AI researchers can innovate at scale in the cloud.
 9 | 
10 | ### What are AWS Trainium and Neuron?
11 | AWS Trainium is an AI chip developed by AWS for accelerating building and deploying machine learning models. Built on a specialized architecture designed for deep learning, Trainium accelerates the training and inference of complex models with high output and scalability, making it ideal for academic researchers looking to optimize performance and costs. This architecture also emphasizes sustainability through energy-efficient design, reducing environmental impact. Amazon has established a dedicated Trainium research cluster featuring up to 40,000 Trainium chips, accessible via Amazon EC2 Trn1 instances. These instances are connected through a non-blocking, petabit-scale network using Amazon EC2 UltraClusters, enabling seamless high-performance ML training. The Trn1 instance family is optimized to deliver substantial compute power for cutting-edge AI research and development. This unique offering not only enhances the efficiency and affordability of model training but also presents academic researchers with opportunities to publish new papers on underrepresented compute architectures, thus advancing the field.
12 | 
13 | Learn more about Build On Trainium [here](https://aws.amazon.com/ai/machine-learning/trainium/research/).
14 | 
15 | ### Your workshop
16 | This hands-on workshop is designed for academic researchers who are planning on submitting proposals to [Build On Trainium](https://www.amazon.science/research-awards/call-for-proposals). 
17 | 
18 | The workshop has multiple available modules:
19 | 1. Set up instructions
20 | 2. Run inference with Llama and NeuronX Distributed inference (NxD)
21 | 3. Write your own kernel with Neuron Kernel Interface (NKI)
22 | 4. Fine tune and host an existing, supported model with a different data set using SageMaker.
23 | 
24 | #### Instructor-led workshop
25 | If you are participating in an instructor-led workshop, follow the guidance provided by your instructor for accessing the environment.
26 | 
27 | #### Self-managed workshop
28 | If you are following the workshop steps in your own environment, you will need to take the following actions:
29 | 1. Launch a trn1.2xlarge instance on Amazon EC2, using the latest [DLAMI with Neuron packages preinstalled](https://repost.aws/articles/ARTxLi0wndTwquyl7frQYuKg) 
30 | 2. Use a Python virtual environment preinstalled in that DLAMI, commonly located in `/opt/aws_<xxx>`.
31 | 3. Set up and manage your own development environment on that instance, such as by using VSCode or a Jupyter Lab server.
32 | 
33 | ### Background knowledge
34 | This workshop introduces developing on AWS Trainium for the academic AI research audience. As such it's expected that the audience will already have a firm understanding of machine learning fundamentals. 
35 | 
36 | ### Workshop costs
37 | If you are participating in an instructor-led workshop hosted in an AWS-managed Workshop Studio environment, you will not incur any costs through using this environment. If you are following this workshop in your own environment, then you will incur associated costs with provisioning an Amazon EC2 instance. Please see the service pricing details [here](https://aws.amazon.com/ec2/pricing/on-demand/). 
38 | 
39 | At the time of writing, this workshop uses a trn1.2xlarge instance with an on-demand hourly rate in supported US regions of $1.34 per hour. The fine tuning workshop requires less than an hour of ml.trn1.2xlarge at $1.54 an hour, and an ml.inf2.xlarge at $.99 an hour (you deploy it and **delete it when you are done**)
40 | 
41 | ## FAQ's and known issues
42 | 1. Workshop instructions are available [here](https://catalog.us-east-1.prod.workshops.aws/workshops/bf9d80a3-5e4b-4648-bca8-1d887bb2a9ca/en-US).
43 | 2. If you use the `NousResearch` Llama 3.2 1B, please note you'll need to remove a trailing comma in the model config file. You can do this by using VIM in VSCode. If you do not take this step, you'll get an error for invalid JSON in trying to read the model config in Lab 1. If editing the file through the terminal is a little challenging, you can also download the config file from this repository with the following command:
44 |    `!wget https://github.com/aws-neuron/build-on-trainium-workshop/blob/main/labs/generation_config.json -P /home/ec2-user/environment/models/llama/`
45 | 4. Jupyter kernels can hold on to the NeuronCores as a Python process even after your cell has completed. This can then cause issues when you try to run a new notebook, and sometimes when you try to run another cell. If you encounter a `NeuronCore not found` or similar error statement, please just restart your Jupyter kernel and/or shut down kernels from previous sessions. You can also restart the instance through the EC2 console. Once your node is back online, you can always check the availability of the NeuronCores with `neuron-ls`.
46 | 5. Want to see how to integrate NKI with NxD? Check out our `nki-llama` [here](https://github.com/aws-samples/nki-llama).
47 | 
48 | 
49 | ## Security
50 | 
51 | See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.
52 | 
53 | ## License
54 | 
55 | This project is licensed under the Apache-2.0 License.
56 | 
57 | 


--------------------------------------------------------------------------------
/contributed/models/README.md:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/contributed/models/qwen2/modeling_qwen2.py:
--------------------------------------------------------------------------------
   1 | # coding=utf-8
   2 | # Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
   3 | #
   4 | # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
   5 | # and OPT implementations in this library. It has been modified from its
   6 | # original forms to accommodate minor architectural differences compared
   7 | # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
   8 | #
   9 | # Licensed under the Apache License, Version 2.0 (the "License");
  10 | # you may not use this file except in compliance with the License.
  11 | # You may obtain a copy of the License at
  12 | #
  13 | #     http://www.apache.org/licenses/LICENSE-2.0
  14 | #
  15 | # Unless required by applicable law or agreed to in writing, software
  16 | # distributed under the License is distributed on an "AS IS" BASIS,
  17 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  18 | # See the License for the specific language governing permissions and
  19 | # limitations under the License.
  20 | """PyTorch Qwen2 model for NXD inference."""
  21 | import copy
  22 | import gc
  23 | import logging
  24 | import math
  25 | from typing import List, Optional, Tuple, Type
  26 | 
  27 | import torch
  28 | from neuronx_distributed.parallel_layers import parallel_state  # noqa: E402
  29 | from neuronx_distributed.parallel_layers.layers import (  # noqa: E402; noqa: E402; noqa: E402; noqa: E402; noqa: E402
  30 |     ColumnParallelLinear,
  31 |     ParallelEmbedding,
  32 |     RowParallelLinear,
  33 | )
  34 | from neuronx_distributed_inference.modules.attention.gqa import GroupQueryAttention_O
  35 | from neuronx_distributed.parallel_layers.mappings import (
  36 |     gather_from_sequence_parallel_region,
  37 |     reduce_from_tensor_model_parallel_region,
  38 |     reduce_scatter_to_sequence_parallel_region,
  39 |     _gather_along_first_dim,
  40 | )
  41 | from neuronx_distributed.parallel_layers.utils import get_padding_length
  42 | from neuronx_distributed.utils import cpu_mode
  43 | from neuronxcc.nki._private_kernels.mlp import (
  44 |     mlp_fused_add_isa_kernel,
  45 |     mlp_isa_kernel,
  46 |     quant_mlp_fused_add_isa_kernel,
  47 |     quant_mlp_isa_kernel,
  48 | )
  49 | from neuronxcc.nki._private_kernels.rmsnorm import rmsnorm_quant_isa_kernel
  50 | from neuronxcc.nki.language import nc
  51 | from torch import nn
  52 | from torch_neuronx.xla_impl.ops import nki_jit
  53 | from transformers import Qwen2ForCausalLM
  54 | from transformers.activations import ACT2FN
  55 | from transformers.models.qwen2.modeling_qwen2 import Qwen2RMSNorm, Qwen2RotaryEmbedding
  56 | 
  57 | from neuronx_distributed_inference.models.config import InferenceConfig, NeuronConfig  # noqa: E402
  58 | from neuronx_distributed_inference.models.model_base import (  # noqa: E402
  59 |     NeuronBaseForCausalLM,
  60 |     NeuronBaseModel,
  61 | )
  62 | from neuronx_distributed_inference.modules.attention.attention_base import NeuronAttentionBase
  63 | from neuronx_distributed_inference.modules.attention.gqa import (  # noqa: E402
  64 |     BaseGroupQueryAttention,
  65 | )
  66 | from neuronx_distributed_inference.modules.attention.utils import (
  67 |     RotaryEmbedding,
  68 |     preprocess_quantized_linear_layer,
  69 |     transpose_parallel_linear_layer,
  70 | )
  71 | from neuronx_distributed_inference.modules.custom_calls import CustomRMSNorm
  72 | from neuronx_distributed_inference.modules.flashdecode.utils import calculate_num_cores_per_group
  73 | from neuronx_distributed_inference.modules.lora_serving.lora_module import is_lora_module
  74 | from neuronx_distributed_inference.utils.distributed import get_tp_group
  75 | 
  76 | _Qwen2_MODULE_MAP = {}
  77 | 
  78 | logger = logging.getLogger("Neuron")
  79 | 
  80 | 
  81 | def get_rmsnorm_cls():
  82 |     # Initialize to the appropriate implementation of RMSNorm
  83 |     # If infer on NXD -> CustomRMSNorm
  84 |     # If infer on CPU -> HF_RMSNorm (CustomRMSNorm does not work on CPU)
  85 |     return Qwen2RMSNorm if cpu_mode() else CustomRMSNorm
  86 | 
  87 | 
  88 | def preshard_hook_fn(module: torch.nn.Module, model_state_dict: dict, prefix: str) -> bool:
  89 |     if isinstance(module, (BaseGroupQueryAttention,)):
  90 |         return module.preshard_hook(model_state_dict, prefix)
  91 | 
  92 |     return False
  93 | 
  94 | 
  95 | # Get the modules_to_not_convert from the neuron configs
  96 | def get_modules_to_not_convert(neuron_config: NeuronConfig):
  97 |     return getattr(neuron_config, "modules_to_not_convert", None)
  98 | 
  99 | 
 100 | def get_updated_configs(config: InferenceConfig):
 101 |     """
 102 |     Generate a list of configurations for each hidden layer in a Qwen2 model.
 103 | 
 104 |     This function creates a list of InferenceConfig objects, one for each layer. It
 105 |     modifies the configurations for certain layers based on which modules should not
 106 |     be converted to quantized format. The function uses get_modules_to_not_convert()
 107 |     to determine which modules should not be converted.
 108 | 
 109 |     Args:
 110 |     config (InferenceConfig): The inference configuration for the model.
 111 | 
 112 |     Returns:
 113 |     list[InferenceConfig]: A list of InferenceConfig objects, one for each layer in the model.
 114 |                            Each config may be either the original config or a modified version
 115 |                            with "quantized_mlp_kernel_enabled" as False for that specific layer.
 116 |     """
 117 |     updated_configs = []
 118 |     modules_to_not_convert = get_modules_to_not_convert(config.neuron_config)
 119 |     if modules_to_not_convert is None:
 120 |         modules_to_not_convert = []
 121 | 
 122 |     for i in range(config.num_hidden_layers):
 123 |         # If any of the MLP modules for this layer are in modules_to_not_convert
 124 |         module_pattern = f"layers.{i}.mlp"
 125 |         if any(module_pattern in module for module in modules_to_not_convert):
 126 |             non_quant_config = copy.deepcopy(config)
 127 |             non_quant_config.neuron_config.quantized_mlp_kernel_enabled = False
 128 |             non_quant_config.neuron_config.activation_quantization_type = None
 129 |             non_quant_config.neuron_config.quantize_clamp_bound = float("inf")
 130 |             updated_configs.append(non_quant_config)
 131 |         else:
 132 |             updated_configs.append(config)
 133 |     return updated_configs
 134 | 
 135 | 
 136 | def _register_module(key: str, cls: Type[nn.Module]):
 137 |     _Qwen2_MODULE_MAP[key] = cls
 138 | 
 139 | 
 140 | def register_module(key: str):
 141 |     """
 142 |     Register a module for use in NeuronQwen2.
 143 | 
 144 |     Arguments:
 145 |         key: String used to identify the module
 146 | 
 147 |     Example:
 148 |         @register_module("NeuronQwen2Attention")
 149 |         class NeuronQwen2Attention(nn.Module):
 150 |             ...
 151 |     """
 152 | 
 153 |     def inner(cls: Type[nn.Module]):
 154 |         _register_module(key, cls)
 155 |         return cls
 156 | 
 157 |     return inner
 158 | 
 159 | 
 160 | def _helper_concat_and_delete_qkv(Qwen2_state_dict, layer_num, attr):
 161 |     """
 162 |     Helper function to concatenate and delete QKV attributes for fusedqkv (weight or scale).
 163 |     Args:
 164 |         Qwen2_state_dict: The state dictionary containing model weights
 165 |         layer_num: The index of the layer to process
 166 |         attr: The attribute to process ('weight' or 'scale')
 167 |     """
 168 |     Qwen2_state_dict[f"layers.{layer_num}.self_attn.Wqkv.{attr}"] = torch.cat(
 169 |         [
 170 |             Qwen2_state_dict[f"layers.{layer_num}.self_attn.q_proj.{attr}"],
 171 |             Qwen2_state_dict[f"layers.{layer_num}.self_attn.k_proj.{attr}"],
 172 |             Qwen2_state_dict[f"layers.{layer_num}.self_attn.v_proj.{attr}"],
 173 |         ],
 174 |     )
 175 |     del Qwen2_state_dict[f"layers.{layer_num}.self_attn.q_proj.{attr}"]
 176 |     del Qwen2_state_dict[f"layers.{layer_num}.self_attn.k_proj.{attr}"]
 177 |     del Qwen2_state_dict[f"layers.{layer_num}.self_attn.v_proj.{attr}"]
 178 | 
 179 | 
 180 | def convert_state_dict_to_fused_qkv(Qwen2_state_dict, cfg: InferenceConfig):
 181 |     """
 182 |     This function concats the qkv weights and scales to a Wqkv weight and scale for fusedqkv, and deletes the qkv weights.
 183 |     """
 184 |     mods_to_not_conv = get_modules_to_not_convert(cfg.neuron_config)
 185 |     if mods_to_not_conv is None:
 186 |         mods_to_not_conv = []
 187 | 
 188 |     for l in range(cfg.num_hidden_layers):  # noqa: E741
 189 |         _helper_concat_and_delete_qkv(Qwen2_state_dict, l, "weight")
 190 |         if (
 191 |             cfg.neuron_config.quantized_mlp_kernel_enabled or cfg.neuron_config.quantized
 192 |         ) and f"layers.{l}.self_attn" not in mods_to_not_conv:
 193 |             _helper_concat_and_delete_qkv(Qwen2_state_dict, l, "scale")
 194 | 
 195 |     gc.collect()
 196 | 
 197 |     return Qwen2_state_dict
 198 | 
 199 | 
 200 | class WeightGatheredColumnParallel(ColumnParallelLinear):
 201 |     """
 202 |     A specialized column-parallel linear layer that implements weight gathering optimization
 203 |     for efficient processing of long sequences in transformer models during eagle speculation.
 204 | 
 205 |     This layer provides two forward paths:
 206 |     1. Standard column-parallel forward (inherited from parent)
 207 |     2. Weight-gathered forward for long sequences
 208 |     """
 209 |     def forward_wg(self, input: torch, weight_gather: bool = False):
 210 |         """
 211 |         Performs the forward pass with optional weight gathering optimization.
 212 | 
 213 |         Args:
 214 |             input (torch.Tensor): Input tensor of shape (batch_size, seq_len/TP, 2*hidden_size)
 215 |             weight_gather (bool): Whether to use weight gathering optimization.
 216 |                                 Typically True for sequences >= 32K
 217 | 
 218 |         Returns:
 219 |             torch.Tensor or Tuple[torch.Tensor, torch.Tensor]:
 220 |                 - If skip_bias_add is False: Output tensor of shape (batch_size, seq_len, hidden_size)
 221 |                 - If skip_bias_add is True: Tuple of (output tensor, bias)
 222 |         """
 223 |         if weight_gather:
 224 |             weight = _gather_along_first_dim(self.weight, process_group=self.tensor_parallel_group)
 225 |             output = self._forward_impl(
 226 |                 input=input,
 227 |                 weight=weight,
 228 |                 bias=None,
 229 |                 async_grad_allreduce=self.async_tensor_model_parallel_allreduce,
 230 |                 sequence_parallel_enabled=self.sequence_parallel_enabled,
 231 |                 sequence_dimension=self.sequence_dimension,
 232 |                 autograd_func_class=self.autograd_func_class,
 233 |                 process_group=self.tensor_parallel_group
 234 |             )
 235 | 
 236 |             output = gather_from_sequence_parallel_region(
 237 |                 output,
 238 |                 self.sequence_dimension,
 239 |                 process_group=self.tensor_parallel_group,
 240 |             )
 241 |             if self.skip_bias_add:
 242 |                 return output, self.bias
 243 | 
 244 |             output = (output + self.bias) if self.bias is not None else output
 245 |             return output
 246 |         else:
 247 |             return self.forward(input)
 248 | 
 249 | 
 250 | class Qwen2InferenceConfig(InferenceConfig):
 251 |     def add_derived_config(self):
 252 |         self.num_cores_per_group = 1
 253 |         if self.neuron_config.flash_decoding_enabled:
 254 |             num_attn_heads, num_kv_heads = self.num_attention_heads, self.num_key_value_heads
 255 |             self.num_cores_per_group = calculate_num_cores_per_group(
 256 |                 num_attn_heads, num_kv_heads, self.neuron_config.tp_degree
 257 |             )
 258 | 
 259 |     def get_required_attributes(self) -> List[str]:
 260 |         return [
 261 |             "hidden_size",
 262 |             "num_attention_heads",
 263 |             "num_hidden_layers",
 264 |             "num_key_value_heads",
 265 |             "pad_token_id",
 266 |             "vocab_size",
 267 |             "max_position_embeddings",
 268 |             "rope_theta",
 269 |             "rms_norm_eps",
 270 |             "hidden_act",
 271 |         ]
 272 | 
 273 |     @classmethod
 274 |     def get_neuron_config_cls(cls) -> Type[NeuronConfig]:
 275 |         return NeuronConfig
 276 | 
 277 | 
 278 | class NeuronQwen2MLP(nn.Module):
 279 |     """
 280 |     This class just replace the linear layers (gate_proj, up_proj and down_proj) with column and row parallel layers
 281 |     """
 282 | 
 283 |     def __init__(self, config: InferenceConfig):
 284 |         super().__init__()
 285 |         self.config = config
 286 |         self.neuron_config = config.neuron_config
 287 |         self.tp_degree = config.neuron_config.tp_degree
 288 |         self.hidden_size = config.hidden_size
 289 |         self.intermediate_size = config.intermediate_size
 290 |         self.act_fn = ACT2FN[config.hidden_act]
 291 | 
 292 |         self.sequence_parallel_enabled = getattr(
 293 |             self.neuron_config, "sequence_parallel_enabled", False
 294 |         )
 295 |         self.sequence_dimension = 1 if self.sequence_parallel_enabled else None
 296 |         self.rms_norm_eps = config.rms_norm_eps
 297 |         self.mlp_kernel_enabled = self.neuron_config.mlp_kernel_enabled
 298 |         self.fused_rmsnorm_skip_gamma = self.config.neuron_config.fused_rmsnorm_skip_gamma
 299 |         self.quantized_mlp_kernel_enabled = self.neuron_config.quantized_mlp_kernel_enabled
 300 |         self.rmsnorm_quantize_kernel_enabled = self.neuron_config.rmsnorm_quantize_kernel_enabled
 301 |         self.quantize_clamp_bound = self.neuron_config.quantize_clamp_bound
 302 |         self.logical_nc_config = self.neuron_config.logical_nc_config
 303 |         self.activation_quantization_type = self.neuron_config.activation_quantization_type
 304 |         mlp_bias = getattr(config, "mlp_bias", False)
 305 | 
 306 |         if self.neuron_config.quantized_mlp_kernel_enabled and self.quantize_clamp_bound == float(
 307 |             "inf"
 308 |         ):
 309 |             logging.warning(
 310 |                 "quantize_clamp_bound is not specified in NeuronConfig. We will use the default value of 1200 for Qwen2 models in quantized kernels."
 311 |             )
 312 |             self.quantize_clamp_bound = 1200.0
 313 |         if parallel_state.model_parallel_is_initialized():
 314 |             if self.neuron_config.quantized_mlp_kernel_enabled:
 315 |                 # # Quantized MLP kernels expect intermediate size to be multiple of 128, so we need to pad
 316 |                 tp_degree = self.neuron_config.tp_degree
 317 |                 self.intermediate_size += (
 318 |                     get_padding_length(self.intermediate_size // tp_degree, 128) * tp_degree
 319 |                 )
 320 |                 logger.debug(f"Quantized intermediate_size: {self.intermediate_size}")
 321 |             self.gate_proj = ColumnParallelLinear(
 322 |                 self.hidden_size,
 323 |                 self.intermediate_size,
 324 |                 bias=mlp_bias,
 325 |                 gather_output=False,
 326 |                 dtype=config.neuron_config.torch_dtype,
 327 |                 pad=True,
 328 |                 sequence_parallel_enabled=False,
 329 |                 sequence_dimension=None,
 330 |                 tensor_model_parallel_group=get_tp_group(config),
 331 |             )
 332 |             self.up_proj = ColumnParallelLinear(
 333 |                 self.hidden_size,
 334 |                 self.intermediate_size,
 335 |                 bias=mlp_bias,
 336 |                 gather_output=False,
 337 |                 dtype=config.neuron_config.torch_dtype,
 338 |                 pad=True,
 339 |                 sequence_parallel_enabled=False,
 340 |                 sequence_dimension=None,
 341 |                 tensor_model_parallel_group=get_tp_group(config),
 342 |             )
 343 |             self.down_proj = RowParallelLinear(
 344 |                 self.intermediate_size,
 345 |                 self.hidden_size,
 346 |                 bias=mlp_bias,
 347 |                 input_is_parallel=True,
 348 |                 dtype=config.neuron_config.torch_dtype,
 349 |                 pad=True,
 350 |                 sequence_parallel_enabled=self.sequence_parallel_enabled,
 351 |                 sequence_dimension=self.sequence_dimension,
 352 |                 tensor_model_parallel_group=get_tp_group(config),
 353 |                 reduce_dtype=config.neuron_config.rpl_reduce_dtype,
 354 |             )
 355 |             if self.mlp_kernel_enabled:
 356 |                 if self.neuron_config.quantized_mlp_kernel_enabled:
 357 |                     setattr(
 358 |                         self.gate_proj,
 359 |                         "post_create_quantized_module_hook",
 360 |                         preprocess_quantized_linear_layer,
 361 |                     )
 362 |                     setattr(
 363 |                         self.up_proj,
 364 |                         "post_create_quantized_module_hook",
 365 |                         preprocess_quantized_linear_layer,
 366 |                     )
 367 |                     setattr(
 368 |                         self.down_proj,
 369 |                         "post_create_quantized_module_hook",
 370 |                         preprocess_quantized_linear_layer,
 371 |                     )
 372 |                 else:
 373 |                     # Transpose the weights to the layout expected by kernels
 374 |                     self.gate_proj.weight = transpose_parallel_linear_layer(self.gate_proj.weight)
 375 |                     self.up_proj.weight = transpose_parallel_linear_layer(self.up_proj.weight)
 376 |                     self.down_proj.weight = transpose_parallel_linear_layer(self.down_proj.weight)
 377 | 
 378 |         else:
 379 |             self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=mlp_bias)
 380 |             self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=mlp_bias)
 381 |             self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=mlp_bias)
 382 | 
 383 |     def _kernel_enabled_quantized_mlp(self, x, rmsnorm, residual, adapter_ids):
 384 |         grid = (nc(self.logical_nc_config),)
 385 |         fused_residual = residual is not None
 386 |         fused_rmsnorm = rmsnorm is not None
 387 |         logger.debug(
 388 |             f"MLP: quantized kernel, fused_residual={fused_residual}, fused_rmsnorm={fused_rmsnorm}, logical_nc_config={self.logical_nc_config}"
 389 |         )
 390 | 
 391 |         # Can't do residual add in the kernel if SP is enabled
 392 |         if fused_residual:
 393 |             assert (
 394 |                 not self.sequence_parallel_enabled
 395 |             ), "Quantized MLP cannot have both fused residual add and sequence parallel RMSnorm!"
 396 |             # Using fused residual add
 397 |             _mlp_fwd_call = nki_jit()(quant_mlp_fused_add_isa_kernel)
 398 |         else:
 399 |             _mlp_fwd_call = nki_jit()(quant_mlp_isa_kernel)
 400 | 
 401 |         if fused_rmsnorm:
 402 |             ln_w = rmsnorm.weight.unsqueeze(0)
 403 |         else:
 404 |             ln_w = torch.zeros(size=(1, self.hidden_size), dtype=x.dtype, device=x.device)
 405 | 
 406 |         # Handle SP RMSnorm
 407 |         x_orig_dtype = x.dtype
 408 |         if self.sequence_parallel_enabled:
 409 |             # This RMSNormQuant kernel will do quantization inside, so we pass the
 410 |             # clamp_bound for clipping.
 411 |             # If we don't use this kernel, the MLP kernel below will do the
 412 |             # quantization, so we also pass clamp_bound to that kernel.
 413 |             if self.rmsnorm_quantize_kernel_enabled:
 414 |                 logger.debug(
 415 |                     "Running Quantized MLP kernel with sequence-parallel RMSnorm-Quantize kernel!"
 416 |                 )
 417 |                 _rmsnorm_quant_fwd_call = nki_jit()(rmsnorm_quant_isa_kernel)
 418 |                 quant_rmsnorm_out = torch.zeros(
 419 |                     size=(
 420 |                         x.shape[0],  # batch size
 421 |                         x.shape[1],  # sequence length
 422 |                         x.shape[2] + 4,  # hidden size + 4 bytes for packing fp32 scale
 423 |                     ),
 424 |                     dtype=torch.int8,
 425 |                     device=x.device,
 426 |                 )
 427 |                 clamp_bound = self.quantize_clamp_bound
 428 |                 _rmsnorm_quant_fwd_call[grid](
 429 |                     x, ln_w, clamp_bound, quant_rmsnorm_out, kernel_name="QuantOnly"
 430 |                 )
 431 |                 x = gather_from_sequence_parallel_region(
 432 |                     quant_rmsnorm_out,
 433 |                     self.sequence_dimension,
 434 |                     process_group=get_tp_group(self.config),
 435 |                 )
 436 | 
 437 |             else:
 438 |                 logger.debug(
 439 |                     "Running Quantized MLP kernel with external (native compiler) sequence-parallel RMSnorm!"
 440 |                 )
 441 |                 x = gather_from_sequence_parallel_region(
 442 |                     x, self.sequence_dimension, process_group=get_tp_group(self.config)
 443 |                 )
 444 | 
 445 |         # Build output tensor
 446 |         output_tensor_seqlen = x.shape[1]
 447 |         if fused_residual:
 448 |             # seqlen dim is doubled to store the residual add output
 449 |             output_tensor_seqlen *= 2
 450 | 
 451 |         output_tensor = torch.zeros(
 452 |             size=(
 453 |                 x.shape[0],  # batch size
 454 |                 output_tensor_seqlen,
 455 |                 self.hidden_size,  # hidden size
 456 |             ),
 457 |             dtype=x_orig_dtype,
 458 |             device=x.device,
 459 |         )
 460 | 
 461 |         # Grab weights
 462 |         # all weights of the layers are stored in (out, in) shape
 463 |         # unsqueeze so that shape of RMS gamma weight is [1, hidden] instead of [hidden]
 464 |         gate_w = self.gate_proj.weight.data
 465 |         gate_w_scale = self.gate_proj.scale
 466 |         up_w = self.up_proj.weight.data
 467 |         up_w_scale = self.up_proj.scale
 468 |         down_w = self.down_proj.weight.data
 469 |         down_w_scale = self.down_proj.scale
 470 |         clamp_bound = self.quantize_clamp_bound
 471 | 
 472 |         if fused_residual:
 473 |             _mlp_fwd_call[grid](
 474 |                 x,  # attn_output
 475 |                 residual,  # hidden
 476 |                 ln_w,  # ln_w
 477 |                 gate_w,  # gate_w
 478 |                 gate_w_scale,
 479 |                 up_w,  # up_w
 480 |                 up_w_scale,
 481 |                 down_w,  # down_w
 482 |                 down_w_scale,
 483 |                 clamp_bound,
 484 |                 output_tensor,  # out
 485 |                 fused_rmsnorm=fused_rmsnorm,
 486 |                 eps=self.rms_norm_eps,
 487 |                 kernel_name="MLP",
 488 |                 store_add=True,
 489 |             )
 490 |             original_seqlen = x.shape[1]
 491 |             residual = output_tensor[:, original_seqlen:, :]
 492 |             output_tensor = output_tensor[:, :original_seqlen, :]
 493 |         else:
 494 |             _mlp_fwd_call[grid](
 495 |                 x,  # hidden
 496 |                 # should be fine to pass gamma is as a dummy even if not using fused rmsnorm
 497 |                 ln_w,
 498 |                 gate_w,  # gate_w
 499 |                 gate_w_scale,
 500 |                 up_w,  # up_w
 501 |                 up_w_scale,
 502 |                 down_w,  # down_w
 503 |                 down_w_scale,
 504 |                 clamp_bound,
 505 |                 output_tensor,  # out
 506 |                 # Run RMSNorm inside the kernel if NOT using SP rmsnorm
 507 |                 fused_rmsnorm=fused_rmsnorm,
 508 |                 eps=self.rms_norm_eps,
 509 |                 kernel_name="MLP",
 510 |             )
 511 |             residual = None
 512 | 
 513 |         # All-reduce or reduce-scatter, depending on whether SP is enabled
 514 |         if self.sequence_parallel_enabled:
 515 |             output_tensor = reduce_scatter_to_sequence_parallel_region(
 516 |                 output_tensor, self.sequence_dimension, process_group=get_tp_group(self.config)
 517 |             )
 518 |         else:
 519 |             output_tensor = reduce_from_tensor_model_parallel_region(output_tensor)
 520 | 
 521 |         logger.debug(f"Quantized MLP output shape {output_tensor.shape}")
 522 |         return (output_tensor, residual)
 523 | 
 524 |     def _kernel_enabled_mlp(self, x, rmsnorm, residual, adapter_ids):
 525 |         fused_residual = residual is not None
 526 |         fused_rmsnorm = rmsnorm is not None
 527 |         logger.debug(
 528 |             f"MLP: kernel, fused_residual={fused_residual}, fused_rmsnorm={fused_rmsnorm}, skip_gamma={self.fused_rmsnorm_skip_gamma}, logical_nc_config={self.logical_nc_config}"
 529 |         )
 530 | 
 531 |         # Choose which kernel to call
 532 |         if fused_residual:
 533 |             assert (
 534 |                 not self.sequence_parallel_enabled
 535 |             ), "MLP kernel cannot have both fused residual add and sequence parallel RMSnorm!"
 536 |             # Using fused residual add
 537 |             _mlp_fwd_call = nki_jit()(mlp_fused_add_isa_kernel)
 538 |         else:
 539 |             _mlp_fwd_call = nki_jit()(mlp_isa_kernel)
 540 | 
 541 |         if self.sequence_parallel_enabled:
 542 |             x = gather_from_sequence_parallel_region(
 543 |                 x, self.sequence_dimension, process_group=get_tp_group(self.config)
 544 |             )
 545 | 
 546 |         # Build output tensor
 547 |         output_tensor_seqlen = x.shape[1]
 548 |         if fused_residual:
 549 |             # seqlen dim is doubled to store the residual add output
 550 |             output_tensor_seqlen *= 2
 551 | 
 552 |         output_tensor = torch.zeros(
 553 |             size=(
 554 |                 x.shape[0],  # batch size
 555 |                 output_tensor_seqlen,
 556 |                 self.hidden_size,  # hidden size
 557 |             ),
 558 |             dtype=x.dtype,
 559 |             device=x.device,
 560 |         )
 561 | 
 562 |         # Grab weights
 563 |         # all weights of the layers are stored in (out, in) shape
 564 |         # unsqueeze so that shape of RMS gamma weight is [1, hidden] instead of [hidden]
 565 |         if fused_rmsnorm:
 566 |             ln_w = rmsnorm.weight.unsqueeze(0)
 567 |         else:
 568 |             ln_w = torch.zeros(size=(1, self.hidden_size), dtype=x.dtype, device=x.device)
 569 |         gate_w = self.gate_proj.weight.data
 570 |         up_w = self.up_proj.weight.data
 571 |         down_w = self.down_proj.weight.data
 572 | 
 573 |         grid = (nc(self.logical_nc_config),)
 574 | 
 575 |         if fused_residual:
 576 |             _mlp_fwd_call[grid](
 577 |                 x,  # attn_output
 578 |                 residual,  # hidden
 579 |                 ln_w,  # ln_w
 580 |                 gate_w,  # gate_w
 581 |                 up_w,  # up_w
 582 |                 down_w,  # down_w
 583 |                 output_tensor,  # out
 584 |                 kernel_name="MLP",
 585 |                 fused_rmsnorm=fused_rmsnorm,
 586 |                 skip_gamma=self.fused_rmsnorm_skip_gamma,
 587 |                 eps=self.rms_norm_eps,
 588 |                 store_add=True,
 589 |             )
 590 |             original_seqlen = x.shape[1]
 591 |             residual = output_tensor[:, original_seqlen:, :]
 592 |             output_tensor = output_tensor[:, :original_seqlen, :]
 593 |         else:
 594 |             _mlp_fwd_call[grid](
 595 |                 x,  # hidden
 596 |                 # should be fine to pass gamma is as a dummy even if not using fused rmsnorm
 597 |                 ln_w,
 598 |                 gate_w,
 599 |                 up_w,
 600 |                 down_w,
 601 |                 output_tensor,  # out
 602 |                 kernel_name="MLP",
 603 |                 # Run RMSNorm inside the kernel if NOT using SP rmsnorm
 604 |                 fused_rmsnorm=fused_rmsnorm,
 605 |                 skip_gamma=self.fused_rmsnorm_skip_gamma,
 606 |                 eps=self.rms_norm_eps,
 607 |             )
 608 |             residual = None
 609 | 
 610 |         # All-reduce or reduce-scatter, depending on whether SP is enabled
 611 |         if self.sequence_parallel_enabled:
 612 |             output_tensor = reduce_scatter_to_sequence_parallel_region(
 613 |                 output_tensor, self.sequence_dimension, process_group=get_tp_group(self.config)
 614 |             )
 615 |         else:
 616 |             output_tensor = reduce_from_tensor_model_parallel_region(
 617 |                 output_tensor, process_group=get_tp_group(self.config)
 618 |             )
 619 | 
 620 |         logger.debug(f"MLP output shape {output_tensor.shape}")
 621 |         return (output_tensor, residual)
 622 | 
 623 |     def _native_mlp(self, x, adapter_ids=None):
 624 |         logger.debug("MLP: native compiler")
 625 |         # all-gather is done here instead of CPL layers to
 626 |         # avoid 2 all-gathers from up and gate projections
 627 |         if self.sequence_parallel_enabled:
 628 |             x = gather_from_sequence_parallel_region(
 629 |                 x, self.sequence_dimension, process_group=get_tp_group(self.config)
 630 |             )
 631 |         gate_proj_output = (
 632 |             self.gate_proj(x)
 633 |             if not is_lora_module(self.gate_proj)
 634 |             else self.gate_proj(x, adapter_ids)
 635 |         )
 636 | 
 637 |         up_proj_output = (
 638 |             self.up_proj(x) if not is_lora_module(self.up_proj) else self.up_proj(x, adapter_ids)
 639 |         )
 640 | 
 641 |         down_proj_input = self.act_fn(gate_proj_output) * up_proj_output
 642 |         output = (
 643 |             self.down_proj(down_proj_input)
 644 |             if not is_lora_module(self.down_proj)
 645 |             else self.down_proj(down_proj_input, adapter_ids)
 646 |         )
 647 |         logger.debug(f"MLP output shape {output.shape}")
 648 |         return output
 649 | 
 650 |     def forward(self, x, rmsnorm=None, residual=None, adapter_ids=None):
 651 |         """
 652 |         If residual is passed in, will fuse its add into the MLP kernel
 653 |         If rmsnorm is passed in, will fuse the rmsnorm into the MLP kernel
 654 | 
 655 |         Returns a tuple of (output, residual), where residual is the output of the residual add
 656 |         """
 657 | 
 658 |         if self.mlp_kernel_enabled:
 659 |             # Quantized MLP kernel
 660 |             if self.quantized_mlp_kernel_enabled:
 661 |                 return self._kernel_enabled_quantized_mlp(
 662 |                     x, rmsnorm, residual, adapter_ids=adapter_ids
 663 |                 )
 664 |             # MLP kernel
 665 |             return self._kernel_enabled_mlp(x, rmsnorm, residual, adapter_ids=adapter_ids)
 666 |         else:
 667 |             # No kernel
 668 |             assert rmsnorm is None and residual is None
 669 |             return (self._native_mlp(x, adapter_ids=adapter_ids), None)
 670 | 
 671 | 
 672 | @register_module("NeuronQwen2Attention")
 673 | class NeuronQwen2Attention(NeuronAttentionBase):
 674 |     """
 675 |     Compared with Qwen2Attention, this class just
 676 |     1. replaces the q_proj, k_proj, v_proj with column parallel layer
 677 |     2. replaces the o_proj with row parallel layer
 678 |     3. update self.num_head to be self.num_head / tp_degree
 679 |     4. update self.num_key_value_heads to be self.num_key_value_heads / tp_degree
 680 |     5. update forward() method to adjust to changes from self.num_head
 681 |     """
 682 | 
 683 |     def __init__(self, config: InferenceConfig, tensor_model_parallel_group=None):
 684 |         super().__init__(tensor_model_parallel_group=tensor_model_parallel_group)
 685 | 
 686 |         self.config = config
 687 |         self.neuron_config = config.neuron_config
 688 |         self.hidden_size = config.hidden_size
 689 |         self.num_attention_heads = config.num_attention_heads
 690 |         self.num_key_value_heads = config.num_key_value_heads
 691 |         self.head_dim = getattr(config, "head_dim", self.hidden_size // self.num_attention_heads)
 692 |         self.max_position_embeddings = config.max_position_embeddings
 693 |         self.rope_theta = config.rope_theta
 694 |         self.padding_side = config.neuron_config.padding_side
 695 |         self.torch_dtype = config.neuron_config.torch_dtype
 696 |         self.is_medusa = config.neuron_config.is_medusa
 697 |         self.flash_decoding_enabled = config.neuron_config.flash_decoding_enabled
 698 |         self.num_cores_per_group = config.num_cores_per_group
 699 |         self.bias = getattr(config, "attention_bias", True)
 700 |         self.rpl_reduce_dtype = config.neuron_config.rpl_reduce_dtype
 701 |         self.mlp_kernel_enabled = config.neuron_config.mlp_kernel_enabled
 702 |         self.rms_norm_eps = config.rms_norm_eps
 703 |         self.attn_tkg_builtin_kernel_enabled = self.neuron_config.attn_tkg_builtin_kernel_enabled
 704 | 
 705 |         if parallel_state.model_parallel_is_initialized():
 706 |             self.tp_degree = self.config.neuron_config.tp_degree
 707 |         else:
 708 |             self.tp_degree = 1
 709 | 
 710 |         self.fused_qkv = config.neuron_config.fused_qkv
 711 |         self.clip_qkv = None
 712 | 
 713 |         self.sequence_parallel_enabled = self.neuron_config.sequence_parallel_enabled
 714 |         self.sequence_dimension = 1 if self.sequence_parallel_enabled else None
 715 |         logger.debug(
 716 |             f"Hello from NeuronQwen2Attention init! Is SP enabled? {self.sequence_parallel_enabled}. Dim? {self.sequence_dimension}"
 717 |         )
 718 | 
 719 |         self.init_gqa_properties()
 720 | 
 721 |         self.init_rope()
 722 | 
 723 |         self.o_proj = GroupQueryAttention_O(
 724 |             hidden_size=self.hidden_size,
 725 |             head_dim=self.head_dim,
 726 |             num_attention_heads=self.num_attention_heads,
 727 |             num_key_value_heads=self.num_key_value_heads,
 728 |             tp_degree=self.tp_degree,
 729 |             dtype=self.torch_dtype,
 730 |             bias=False,
 731 |             input_is_parallel=True,
 732 |             layer_name=self.o_proj_layer_name,
 733 |             sequence_parallel_enabled=self.sequence_parallel_enabled,
 734 |             sequence_dimension=self.sequence_dimension,
 735 |             tensor_model_parallel_group=self.tensor_model_parallel_group,
 736 |             rpl_reduce_dtype=self.rpl_reduce_dtype,
 737 |         )
 738 | 
 739 |     def init_rope(self):
 740 |         self.rotary_emb = Qwen2RotaryEmbedding(self.config)
 741 | 
 742 |         if self.attn_tkg_builtin_kernel_enabled:
 743 |             self.inv_freqs = self.rotary_emb.get_inv_freqs().unsqueeze(1)
 744 | 
 745 | 
 746 | 
 747 | class NeuronQwen2DecoderLayer(nn.Module):
 748 |     """
 749 |     Just replace the attention with the NXD version, and MLP with the NXD version
 750 |     """
 751 | 
 752 |     def __init__(self, config: InferenceConfig):
 753 |         super().__init__()
 754 |         self.hidden_size = config.hidden_size
 755 | 
 756 |         self.self_attn = NeuronQwen2Attention(
 757 |             config=config, tensor_model_parallel_group=get_tp_group(config)
 758 |         )
 759 | 
 760 |         self.mlp = NeuronQwen2MLP(config)
 761 |         logger.debug(
 762 |             f"Instantiating RMSNorm modules with hidden size {config.hidden_size} and EPS {config.rms_norm_eps}"
 763 |         )
 764 |         self.input_layernorm = None
 765 |         if (
 766 |             not config.neuron_config.is_eagle_draft
 767 |             or config.neuron_config.enable_eagle_draft_input_norm
 768 |         ):
 769 |             self.input_layernorm = get_rmsnorm_cls()(
 770 |                 config.hidden_size,
 771 |                 eps=config.rms_norm_eps,
 772 |             )
 773 |         self.post_attention_layernorm = get_rmsnorm_cls()(
 774 |             config.hidden_size,
 775 |             eps=config.rms_norm_eps,
 776 |         )
 777 |         self.qkv_kernel_enabled = config.neuron_config.qkv_kernel_enabled
 778 |         self.mlp_kernel_enabled = config.neuron_config.mlp_kernel_enabled
 779 |         self.quantized_mlp_kernel_enabled = config.neuron_config.quantized_mlp_kernel_enabled
 780 |         self.rmsnorm_quantize_kernel_enabled = config.neuron_config.rmsnorm_quantize_kernel_enabled
 781 |         self.mlp_kernel_fuse_residual_add = config.neuron_config.mlp_kernel_fuse_residual_add
 782 |         self.qkv_kernel_fuse_residual_add = config.neuron_config.qkv_kernel_fuse_residual_add
 783 |         self.sequence_parallel_enabled = config.neuron_config.sequence_parallel_enabled
 784 |         self.is_prefill_stage = config.neuron_config.is_prefill_stage
 785 |         self.config = config
 786 | 
 787 |         if self.is_prefill_stage and self.config.neuron_config.is_mlp_quantized():
 788 |             # for CTE, quantized MLP kernel does not support fused rmsnorm
 789 |             self.mlp_kernel_fused_rmsnorm = False
 790 |         else:
 791 |             self.mlp_kernel_fused_rmsnorm = not self.sequence_parallel_enabled
 792 | 
 793 |     def forward(
 794 |         self,
 795 |         hidden_states: torch.Tensor,
 796 |         attention_mask: Optional[torch.Tensor] = None,
 797 |         position_ids: Optional[torch.LongTensor] = None,
 798 |         past_key_value: Optional[Tuple[torch.Tensor]] = None,
 799 |         adapter_ids=None,
 800 |         rotary_position_ids: Optional[torch.LongTensor] = None,
 801 |         residual: Optional[torch.Tensor] = None,  # residual from previous layer used by QKV
 802 |         **kwargs,
 803 |     ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]], Optional[torch.FloatTensor], Optional[torch.FloatTensor], Optional[torch.FloatTensor]]:
 804 |         entry_hidden_states = hidden_states
 805 |         # RMSNorm (fused with QKV kernel when SP is disabled)
 806 |         if (not self.qkv_kernel_enabled or self.sequence_parallel_enabled) and self.input_layernorm:
 807 |             hidden_states = self.input_layernorm(hidden_states)
 808 | 
 809 |         # Self Attention
 810 |         # produced another residual used by MLP
 811 |         attn_output = self.self_attn(
 812 |             hidden_states=hidden_states,
 813 |             attention_mask=attention_mask,
 814 |             position_ids=position_ids,
 815 |             past_key_value=past_key_value,
 816 |             adapter_ids=adapter_ids,
 817 |             rmsnorm=self.input_layernorm,
 818 |             rotary_position_ids=rotary_position_ids,
 819 |             residual=residual,
 820 |             **kwargs,
 821 |         )
 822 | 
 823 |         if attn_output.residual is None:
 824 |             residual = entry_hidden_states  # input to attention
 825 |         else:
 826 |             # residual will only be returned by attn/qkv if fuse add qkv kernel is enabled
 827 |             assert self.qkv_kernel_fuse_residual_add, \
 828 |                 "residual add before qkv should be computed in the previous layer, \
 829 |                  unless qkv_kernel_fuse_residual_add is specified"
 830 |             assert (
 831 |                 not self.sequence_parallel_enabled
 832 |             ), "qkv_kernel_fuse_residual_add should be off when sequence parallelism is enabled"
 833 |             assert (
 834 |                 self.qkv_kernel_enabled
 835 |             ), "qkv_kernel_fuse_residual_add should be used with qkv_kernel_enabled"
 836 |             residual = attn_output.residual
 837 | 
 838 |         hidden_states = attn_output.hidden_states
 839 |         if self.mlp_kernel_enabled and self.mlp_kernel_fuse_residual_add:
 840 |             assert (
 841 |                 not self.sequence_parallel_enabled
 842 |             ), "mlp_kernel_fuse_residual_add should be off when sequence parallelism is enabled"
 843 |             # First residual add handled in the MLP kernel
 844 |             hidden_states, residual = self.mlp(
 845 |                 hidden_states,
 846 |                 rmsnorm=self.post_attention_layernorm,
 847 |                 residual=residual,
 848 |                 adapter_ids=adapter_ids,
 849 |             )
 850 |         else:
 851 |             hidden_states = residual + hidden_states
 852 |             residual = hidden_states
 853 |             # RMSNorm (fused with QKV kernel when SP is disabled)
 854 |             if self.mlp_kernel_enabled and self.mlp_kernel_fused_rmsnorm:
 855 |                 rmsnorm = self.post_attention_layernorm
 856 |             else:
 857 |                 hidden_states = self.post_attention_layernorm(hidden_states)
 858 |                 rmsnorm = None
 859 |             hidden_states, _ = self.mlp(
 860 |                 hidden_states,
 861 |                 rmsnorm=rmsnorm,
 862 |                 adapter_ids=adapter_ids,
 863 |             )
 864 | 
 865 |         # if fuse residual add with qkv, we leave this add to the next layer's QKV
 866 |         # unless it is the last layer in which case we add it here
 867 |         if not self.qkv_kernel_fuse_residual_add:
 868 |             hidden_states = residual + hidden_states
 869 |             residual = None  # set to None to prevent it from being used again
 870 | 
 871 |         # also return residual for QKV in the next layer
 872 |         outputs = (hidden_states, attn_output.present_key_value, attn_output.cos_cache, attn_output.sin_cache, residual)
 873 |         return outputs
 874 | 
 875 | 
 876 | class ResBlock(nn.Module):
 877 |     """
 878 |     A Residual Block module.
 879 | 
 880 |     This module performs a linear transformation followed by a SiLU activation,
 881 |     and then adds the result to the original input, creating a residual connection.
 882 | 
 883 |     Args:
 884 |         hidden_size (int): The size of the hidden layers in the block.
 885 |     """
 886 | 
 887 |     def __init__(self, hidden_size):
 888 |         super().__init__()
 889 |         self.linear = nn.Linear(hidden_size, hidden_size)
 890 |         # Initialize as an identity mapping
 891 |         torch.nn.init.zeros_(self.linear.weight)
 892 |         # Use SiLU activation to keep consistent with the Qwen2 model
 893 |         self.act = nn.SiLU()
 894 | 
 895 |     def forward(self, x):
 896 |         """
 897 |         Forward pass of the ResBlock.
 898 | 
 899 |         Args:
 900 |             x (torch.Tensor): Input tensor.
 901 | 
 902 |         Returns:
 903 |             torch.Tensor: Output after the residual connection and activation.
 904 |         """
 905 |         return x + self.act(self.linear(x))
 906 | 
 907 | 
 908 | class NeuronQwen2Model(NeuronBaseModel):
 909 |     """
 910 |     The neuron version of the Qwen2Model
 911 |     """
 912 | 
 913 |     def setup_attr_for_model(self, config: InferenceConfig):
 914 |         # Needed for init_inference_optimization()
 915 |         self.on_device_sampling = config.neuron_config.on_device_sampling_config is not None
 916 |         self.tp_degree = config.neuron_config.tp_degree
 917 |         self.hidden_size = config.hidden_size
 918 |         self.num_attention_heads = config.num_attention_heads
 919 |         self.num_key_value_heads = config.num_key_value_heads
 920 |         self.max_batch_size = config.neuron_config.max_batch_size
 921 |         self.buckets = config.neuron_config.buckets
 922 | 
 923 |     def init_model(self, config: InferenceConfig):
 924 |         self.padding_idx = config.pad_token_id
 925 |         self.vocab_size = config.vocab_size
 926 | 
 927 |         if parallel_state.model_parallel_is_initialized():
 928 |             self.embed_tokens = ParallelEmbedding(
 929 |                 config.vocab_size,
 930 |                 config.hidden_size,
 931 |                 self.padding_idx,
 932 |                 dtype=config.neuron_config.torch_dtype,
 933 |                 shard_across_embedding=not config.neuron_config.vocab_parallel,
 934 |                 sequence_parallel_enabled=config.neuron_config.sequence_parallel_enabled,
 935 |                 sequence_dimension=1,
 936 |                 pad=True,
 937 |                 tensor_model_parallel_group=get_tp_group(config),
 938 |                 use_spmd_rank=config.neuron_config.vocab_parallel,
 939 |             )
 940 | 
 941 |             self.lm_head = ColumnParallelLinear(
 942 |                 config.hidden_size,
 943 |                 config.vocab_size,
 944 |                 gather_output=not self.on_device_sampling,
 945 |                 bias=False,
 946 |                 pad=True,
 947 |                 tensor_model_parallel_group=get_tp_group(config),
 948 |             )
 949 |         else:
 950 |             self.embed_tokens = nn.Embedding(
 951 |                 config.vocab_size,
 952 |                 config.hidden_size,
 953 |                 self.padding_idx,
 954 |             )
 955 |             self.lm_head = nn.Linear(
 956 |                 config.hidden_size,
 957 |                 config.vocab_size,
 958 |                 bias=False,
 959 |             )
 960 | 
 961 |         updated_configs = get_updated_configs(config)
 962 | 
 963 |         self.layers = nn.ModuleList([NeuronQwen2DecoderLayer(conf) for conf in updated_configs])
 964 | 
 965 |         if not config.neuron_config.is_eagle_draft:
 966 |             self.norm = get_rmsnorm_cls()(config.hidden_size, eps=config.rms_norm_eps)
 967 | 
 968 |         if config.neuron_config.is_eagle_draft:
 969 |             fc_bias = getattr(config, "fc_bias", False)
 970 |             # replicate fc weights since activations are sequence sharded
 971 |             self.fc = WeightGatheredColumnParallel(
 972 |                 config.hidden_size * 2, config.hidden_size, bias=fc_bias, gather_output=True, sequence_dimension=1
 973 |             )
 974 |         self.is_medusa = config.neuron_config.is_medusa
 975 |         self.num_medusa_heads = config.neuron_config.num_medusa_heads
 976 |         self.medusa_speculation_length = config.neuron_config.medusa_speculation_length
 977 | 
 978 |         if self.is_medusa:
 979 |             if parallel_state.model_parallel_is_initialized():
 980 |                 medusa_head_cls = ColumnParallelLinear
 981 |             else:
 982 |                 medusa_head_cls = nn.Linear
 983 |             for i in range(self.num_medusa_heads):
 984 |                 medusa_head = nn.Sequential(
 985 |                     *([ResBlock(config.hidden_size)] * 1),
 986 |                     medusa_head_cls(
 987 |                         config.hidden_size,
 988 |                         config.vocab_size,
 989 |                         gather_output=not self.on_device_sampling,
 990 |                         bias=False,
 991 |                     ),
 992 |                 )
 993 |                 setattr(self, f"medusa_head_{i}", medusa_head)
 994 | 
 995 | 
 996 | class NeuronQwen2ForCausalLM(NeuronBaseForCausalLM):
 997 |     """
 998 |     This class extends Qwen2ForCausalLM create traceable
 999 |     blocks for Neuron.
1000 | 
1001 |     Args:
1002 |         Qwen2ForCausalLM (_type_): _description_
1003 |     """
1004 | 
1005 |     _model_cls = NeuronQwen2Model
1006 | 
1007 |     @staticmethod
1008 |     def load_hf_model(model_path, **kwargs):
1009 |         return Qwen2ForCausalLM.from_pretrained(model_path, **kwargs)
1010 | 
1011 |     @staticmethod
1012 |     def convert_hf_to_neuron_state_dict(state_dict: dict, config: InferenceConfig) -> dict:
1013 |         """This function should be over-ridden in child classes as needed"""
1014 | 
1015 |         neuron_config = config.neuron_config
1016 |         # to facilitate rank usage in attention
1017 |         num_layers = config.num_hidden_layers
1018 |         tp_degree = neuron_config.tp_degree
1019 |         for i in range(num_layers):
1020 |             state_dict[f"layers.{i}.self_attn.rank_util.rank"] = torch.arange(
1021 |                 0, tp_degree, dtype=torch.int32
1022 |             )
1023 | 
1024 |             """
1025 |             for every layer do the following transformations
1026 |             gate_w_prime = (gate_w.T * gamma).T
1027 |             up_w_prime = (up_w.T * gamma).T
1028 |             """
1029 |             if (
1030 |                 neuron_config.fused_rmsnorm_skip_gamma
1031 |                 and not neuron_config.sequence_parallel_enabled
1032 |             ):
1033 |                 if neuron_config.mlp_kernel_enabled:
1034 |                     # MLP
1035 |                     state_dict[f"layers.{i}.mlp.gate_proj.weight"] = state_dict[
1036 |                         f"layers.{i}.mlp.gate_proj.weight"
1037 |                     ] * state_dict[f"layers.{i}.input_layernorm.weight"].unsqueeze(0)
1038 |                     state_dict[f"layers.{i}.mlp.up_proj.weight"] = state_dict[
1039 |                         f"layers.{i}.mlp.up_proj.weight"
1040 |                     ] * state_dict[f"layers.{i}.input_layernorm.weight"].unsqueeze(0)
1041 | 
1042 |                 if neuron_config.qkv_kernel_enabled:
1043 |                     # QKV
1044 |                     state_dict[f"layers.{i}.self_attn.q_proj.weight"] = state_dict[
1045 |                         f"layers.{i}.self_attn.q_proj.weight"
1046 |                     ] * state_dict[f"layers.{i}.input_layernorm.weight"].unsqueeze(0)
1047 |                     state_dict[f"layers.{i}.self_attn.k_proj.weight"] = state_dict[
1048 |                         f"layers.{i}.self_attn.k_proj.weight"
1049 |                     ] * state_dict[f"layers.{i}.input_layernorm.weight"].unsqueeze(0)
1050 |                     state_dict[f"layers.{i}.self_attn.v_proj.weight"] = state_dict[
1051 |                         f"layers.{i}.self_attn.v_proj.weight"
1052 |                     ] * state_dict[f"layers.{i}.input_layernorm.weight"].unsqueeze(0)
1053 | 
1054 |         if neuron_config.fused_qkv:
1055 |             state_dict = convert_state_dict_to_fused_qkv(state_dict, config)
1056 | 
1057 |         if neuron_config.vocab_parallel:
1058 |             # TODO: this hack can be removed after replication_id is ready to use
1059 |             state_dict["embed_tokens.rank_util.rank"] = torch.arange(
1060 |                 0, neuron_config.local_ranks_size, dtype=torch.int32
1061 |             )
1062 | 
1063 |         # to facilitate rank usage in base model
1064 |         state_dict["rank_util.rank"] = torch.arange(0, tp_degree, dtype=torch.int32)
1065 |         return state_dict
1066 | 
1067 |     @staticmethod
1068 |     def update_state_dict_for_tied_weights(state_dict):
1069 |         state_dict["lm_head.weight"] = state_dict["embed_tokens.weight"].clone()
1070 | 
1071 |     @classmethod
1072 |     def get_config_cls(cls):
1073 |         return Qwen2InferenceConfig
1074 | 


--------------------------------------------------------------------------------
/contributed/models/qwen2/qwen-2-test.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {},
  7 |    "outputs": [
  8 |     {
  9 |      "name": "stdout",
 10 |      "output_type": "stream",
 11 |      "text": [
 12 |       "libneuronxla                  2.2.3493.0+78c3e78c\n",
 13 |       "neuronx-cc                    2.18.121.0+9e31e41a\n",
 14 |       "neuronx-distributed           0.12.12111+cdd84048\n",
 15 |       "neuronx-distributed-inference 0.3.5591+f50feae2\n",
 16 |       "torch-neuronx                 2.6.0.2.7.5413+113e6810\n"
 17 |      ]
 18 |     }
 19 |    ],
 20 |    "source": [
 21 |     "!pip list | grep neuron"
 22 |    ]
 23 |   },
 24 |   {
 25 |    "cell_type": "code",
 26 |    "execution_count": null,
 27 |    "metadata": {},
 28 |    "outputs": [],
 29 |    "source": [
 30 |     "import torch\n",
 31 |     "from transformers import AutoTokenizer, GenerationConfig\n",
 32 |     "from neuronx_distributed_inference.models.config import NeuronConfig, OnDeviceSamplingConfig\n",
 33 |     "from neuronx_distributed_inference.utils.hf_adapter import HuggingFaceGenerationAdapter, load_pretrained_config"
 34 |    ]
 35 |   },
 36 |   {
 37 |    "cell_type": "code",
 38 |    "execution_count": 2,
 39 |    "metadata": {},
 40 |    "outputs": [],
 41 |    "source": [
 42 |     "model_path = \"/home/ubuntu/model_hf_qwen/qwen2/\"\n",
 43 |     "traced_model_path = \"/home/ubuntu/traced_model_qwen/qwen2\""
 44 |    ]
 45 |   },
 46 |   {
 47 |    "cell_type": "code",
 48 |    "execution_count": null,
 49 |    "metadata": {},
 50 |    "outputs": [],
 51 |    "source": [
 52 |     "from huggingface_hub import snapshot_download\n",
 53 |     "\n",
 54 |     "snapshot_download(\"Qwen/QwQ-32B\", local_dir=model_path)"
 55 |    ]
 56 |   },
 57 |   {
 58 |    "cell_type": "code",
 59 |    "execution_count": null,
 60 |    "metadata": {},
 61 |    "outputs": [],
 62 |    "source": [
 63 |     "from modeling_qwen_v2 import Qwen2InferenceConfig, NeuronQwen2ForCausalLM\n",
 64 |     "\n",
 65 |     "def run_qwen2_compile():\n",
 66 |     "    # Initialize configs and tokenizer.\n",
 67 |     "    tokenizer = AutoTokenizer.from_pretrained(model_path, padding_side=\"right\")\n",
 68 |     "    tokenizer.pad_token = tokenizer.eos_token\n",
 69 |     "\n",
 70 |     "    generation_config = GenerationConfig.from_pretrained(model_path)\n",
 71 |     "    generation_config_kwargs = {\n",
 72 |     "        \"do_sample\": False,\n",
 73 |     "        \"top_k\": 1,\n",
 74 |     "        \"pad_token_id\": tokenizer.pad_token_id,\n",
 75 |     "    }\n",
 76 |     "    generation_config.update(**generation_config_kwargs)\n",
 77 |     " \n",
 78 |     "    neuron_config = NeuronConfig(\n",
 79 |     "        tp_degree=8,\n",
 80 |     "        batch_size=1,\n",
 81 |     "        max_context_length=128,\n",
 82 |     "        seq_len=256,\n",
 83 |     "        enable_bucketing=True,\n",
 84 |     "        context_encoding_buckets=[128],\n",
 85 |     "        token_generation_buckets=[256],\n",
 86 |     "        flash_decoding_enabled=False,\n",
 87 |     "        torch_dtype=torch.bfloat16,\n",
 88 |     "        fused_qkv=False,\n",
 89 |     "        attn_kernel_enabled=True,\n",
 90 |     "        attn_cls=\"NeuronQwen2Attention\"\n",
 91 |     "    )\n",
 92 |     "    config = Qwen2InferenceConfig(\n",
 93 |     "        neuron_config,\n",
 94 |     "        load_config=load_pretrained_config(model_path),\n",
 95 |     "    )\n",
 96 |     "    \n",
 97 |     "    # Compile and save model.\n",
 98 |     "    print(\"\\nCompiling and saving model...\")\n",
 99 |     "    model = NeuronQwen2ForCausalLM(model_path, config)\n",
100 |     "    model.compile(traced_model_path)\n",
101 |     "    tokenizer.save_pretrained(traced_model_path)"
102 |    ]
103 |   },
104 |   {
105 |    "cell_type": "code",
106 |    "execution_count": null,
107 |    "metadata": {},
108 |    "outputs": [],
109 |    "source": [
110 |     "run_qwen2_compile()"
111 |    ]
112 |   },
113 |   {
114 |    "cell_type": "code",
115 |    "execution_count": null,
116 |    "metadata": {},
117 |    "outputs": [],
118 |    "source": [
119 |     "from modeling_qwen_v2 import Qwen2InferenceConfig, NeuronQwen2ForCausalLM\n",
120 |     "\n",
121 |     "model = NeuronQwen2ForCausalLM(traced_model_path)\n",
122 |     "model.load(traced_model_path)"
123 |    ]
124 |   },
125 |   {
126 |    "cell_type": "code",
127 |    "execution_count": null,
128 |    "metadata": {},
129 |    "outputs": [],
130 |    "source": [
131 |     "config = model.get_config_cls()\n",
132 |     "config.get_neuron_config_cls()"
133 |    ]
134 |   },
135 |   {
136 |    "cell_type": "code",
137 |    "execution_count": 9,
138 |    "metadata": {},
139 |    "outputs": [
140 |     {
141 |      "data": {
142 |       "text/plain": [
143 |        "40"
144 |       ]
145 |      },
146 |      "execution_count": 9,
147 |      "metadata": {},
148 |      "output_type": "execute_result"
149 |     }
150 |    ],
151 |    "source": [
152 |     "model.config.num_attention_heads"
153 |    ]
154 |   },
155 |   {
156 |    "cell_type": "code",
157 |    "execution_count": 10,
158 |    "metadata": {},
159 |    "outputs": [
160 |     {
161 |      "data": {
162 |       "text/plain": [
163 |        "8"
164 |       ]
165 |      },
166 |      "execution_count": 10,
167 |      "metadata": {},
168 |      "output_type": "execute_result"
169 |     }
170 |    ],
171 |    "source": [
172 |     "model.config.num_key_value_heads"
173 |    ]
174 |   },
175 |   {
176 |    "cell_type": "code",
177 |    "execution_count": 11,
178 |    "metadata": {},
179 |    "outputs": [
180 |     {
181 |      "data": {
182 |       "text/plain": [
183 |        "5120"
184 |       ]
185 |      },
186 |      "execution_count": 11,
187 |      "metadata": {},
188 |      "output_type": "execute_result"
189 |     }
190 |    ],
191 |    "source": [
192 |     "model.config.hidden_size"
193 |    ]
194 |   },
195 |   {
196 |    "cell_type": "code",
197 |    "execution_count": 12,
198 |    "metadata": {},
199 |    "outputs": [
200 |     {
201 |      "name": "stderr",
202 |      "output_type": "stream",
203 |      "text": [
204 |       "Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.\n"
205 |      ]
206 |     },
207 |     {
208 |      "data": {
209 |       "text/plain": [
210 |        "\"Okay, the user wants a short introduction to large language models. Let me start by defining what a large language model is. I should mention that they are AI systems trained on vast amounts of text data. Maybe include that they use deep learning, specifically transformer architectures.\\n\\nI need to highlight their capabilities, like generating text, understanding context, and performing various tasks such as answering questions, writing stories, or coding. It's important to note their scale—large parameter counts and extensive training data. \\n\\nAlso, touch on their applications: customer service, content creation, research, etc. Maybe mention some examples like GPT, BERT, or\""
211 |       ]
212 |      },
213 |      "execution_count": 12,
214 |      "metadata": {},
215 |      "output_type": "execute_result"
216 |     }
217 |    ],
218 |    "source": [
219 |     "tokenizer = AutoTokenizer.from_pretrained(traced_model_path)\n",
220 |     "tokenizer.pad_token = tokenizer.eos_token\n",
221 |     "generation_config = GenerationConfig.from_pretrained(model_path)\n",
222 |     "generation_config_kwargs = {\n",
223 |     "    \"do_sample\": True,\n",
224 |     "    \"temperature\": 0.9,\n",
225 |     "    \"top_k\": 5,\n",
226 |     "    \"pad_token_id\": tokenizer.pad_token_id,\n",
227 |     "}\n",
228 |     "\n",
229 |     "prompt = \"Give me a short introduction to large language model.\"\n",
230 |     "messages = [\n",
231 |     "    {\"role\": \"system\", \"content\": \"You are Qwen, created by Alibaba Cloud. You are a helpful assistant.\"},\n",
232 |     "    {\"role\": \"user\", \"content\": prompt}\n",
233 |     "]\n",
234 |     "text = tokenizer.apply_chat_template(\n",
235 |     "    messages,\n",
236 |     "    tokenize=False,\n",
237 |     "    add_generation_prompt=True\n",
238 |     ")\n",
239 |     "model_inputs = tokenizer([text], return_tensors=\"pt\")\n",
240 |     "generation_model = HuggingFaceGenerationAdapter(model)\n",
241 |     "generated_ids = generation_model.generate(\n",
242 |     "    **model_inputs,\n",
243 |     "    max_new_tokens=128\n",
244 |     ")\n",
245 |     "generated_ids = [\n",
246 |     "    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)\n",
247 |     "]\n",
248 |     "\n",
249 |     "response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]\n",
250 |     "response"
251 |    ]
252 |   },
253 |   {
254 |    "cell_type": "code",
255 |    "execution_count": 13,
256 |    "metadata": {},
257 |    "outputs": [],
258 |    "source": [
259 |     "model.reset()"
260 |    ]
261 |   },
262 |   {
263 |    "cell_type": "markdown",
264 |    "metadata": {},
265 |    "source": [
266 |     "# Run Benchmarks"
267 |    ]
268 |   },
269 |   {
270 |    "cell_type": "code",
271 |    "execution_count": 1,
272 |    "metadata": {},
273 |    "outputs": [],
274 |    "source": [
275 |     "model_path = \"/home/ubuntu/model_hf_qwen/qwen2\"\n",
276 |     "traced_model_path = \"/home/ubuntu/traced_model_qwen/qwen2/logit\""
277 |    ]
278 |   },
279 |   {
280 |    "cell_type": "code",
281 |    "execution_count": null,
282 |    "metadata": {},
283 |    "outputs": [],
284 |    "source": [
285 |     "dir = '/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/'\n",
286 |     "!cp modeling_qwen2.py {dir}"
287 |    ]
288 |   },
289 |   {
290 |    "cell_type": "markdown",
291 |    "metadata": {},
292 |    "source": [
293 |     "# Edit the inference_demo.py file to include the following:\n",
294 |     "\n",
295 |     "```python\n",
296 |     "from .modeling_qwen2 import NeuronQwen2ForCausalLM\n",
297 |     "\n",
298 |     "MODEL_TYPES = {\n",
299 |     "    \"llama\": {\"causal-lm\": NeuronLlamaForCausalLM},\n",
300 |     "    \"mixtral\": {\"causal-lm\": NeuronMixtralForCausalLM},\n",
301 |     "    \"dbrx\": {\"causal-lm\": NeuronDbrxForCausalLM},\n",
302 |     "    'qwen2': {\"causal-lm\": NeuronQwen2ForCausalLM}\n",
303 |     "}\n",
304 |     "```"
305 |    ]
306 |   },
307 |   {
308 |    "cell_type": "code",
309 |    "execution_count": 8,
310 |    "metadata": {},
311 |    "outputs": [
312 |     {
313 |      "name": "stdout",
314 |      "output_type": "stream",
315 |      "text": [
316 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed/modules/moe/expert_mlps.py:11: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n",
317 |       "  from neuronx_distributed.modules.moe.blockwise import (\n",
318 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed/modules/moe/expert_mlps.py:11: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n",
319 |       "  from neuronx_distributed.modules.moe.blockwise import (\n",
320 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed/modules/moe/expert_mlps.py:11: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n",
321 |       "  from neuronx_distributed.modules.moe.blockwise import (\n",
322 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/modules/attention/utils.py:14: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n",
323 |       "  from neuronx_distributed_inference.modules.custom_calls import neuron_cumsum\n",
324 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:745: UserWarning: Set seed for `privateuseone` device does not take effect, please add API's `_is_in_bad_fork` and `manual_seed_all` to `privateuseone` device module.\n",
325 |       "  return fn(*args, **kwargs)\n",
326 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/modules/lora_serving/lora_model.py:12: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n",
327 |       "  from neuronx_distributed_inference.modules.attention.gqa import GQA, GroupQueryAttention_QKV\n",
328 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/modules/lora_serving/lora_model.py:12: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n",
329 |       "  from neuronx_distributed_inference.modules.attention.gqa import GQA, GroupQueryAttention_QKV\n",
330 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/modules/lora_serving/lora_model.py:12: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n",
331 |       "  from neuronx_distributed_inference.modules.attention.gqa import GQA, GroupQueryAttention_QKV\n",
332 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/models/dbrx/modeling_dbrx.py:38: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n",
333 |       "  from neuronx_distributed_inference.modules.attention.attention_base import NeuronAttentionBase\n",
334 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/models/dbrx/modeling_dbrx.py:38: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n",
335 |       "  from neuronx_distributed_inference.modules.attention.attention_base import NeuronAttentionBase\n",
336 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/inference_demo.py:25: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n",
337 |       "  from neuronx_distributed_inference.models.dbrx.modeling_dbrx import NeuronDbrxForCausalLM\n",
338 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/inference_demo.py:27: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n",
339 |       "  from neuronx_distributed_inference.models.mixtral.modeling_mixtral import NeuronMixtralForCausalLM\n",
340 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/models/mllama/modeling_mllama.py:72: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n",
341 |       "  from .modeling_mllama_vision import NeuronMllamaVisionModel  # noqa: E402\n",
342 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/utils/accuracy.py:29: UserWarning: Intel extension for pytorch not found. For faster CPU references install `intel-extension-for-pytorch`.\n",
343 |       "  warnings.warn(\n",
344 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:745: UserWarning: Set seed for `privateuseone` device does not take effect, please add API's `_is_in_bad_fork` and `manual_seed_all` to `privateuseone` device module.\n",
345 |       "  return fn(*args, **kwargs)\n",
346 |       "Loading configs...\n",
347 |       "WARNING:root:NeuronConfig init: Unexpected keyword arguments: {'model_type': 'qwen2', 'task_type': 'causal-lm', 'model_path': '/home/ubuntu/model_hf_qwen/qwen2', 'compiled_model_path': '/home/ubuntu/traced_model_qwen/qwen2/logit', 'benchmark': True, 'check_accuracy_mode': <CheckAccuracyMode.LOGIT_MATCHING: 'logit-matching'>, 'divergence_difference_tol': 0.001, 'prompts': ['To be, or not to be'], 'top_k': 1, 'top_p': 1.0, 'temperature': 1.0, 'do_sample': False, 'dynamic': False, 'pad_token_id': 151645, 'on_device_sampling': False, 'enable_torch_dist': False, 'enable_lora': False, 'max_loras': 1, 'max_lora_rank': 16, 'skip_warmup': False, 'skip_compile': False, 'compile_only': False, 'compile_dry_run': False, 'hlo_debug': False}\n",
348 |       "\n",
349 |       "Compiling and saving model...\n",
350 |       "INFO:Neuron:Generating HLOs for the following models: ['context_encoding_model', 'token_generation_model']\n",
351 |       "[2025-06-02 13:35:56.009: I neuronx_distributed/parallel_layers/parallel_state.py:592] > initializing tensor model parallel with size 8\n",
352 |       "[2025-06-02 13:35:56.009: I neuronx_distributed/parallel_layers/parallel_state.py:593] > initializing pipeline model parallel with size 1\n",
353 |       "[2025-06-02 13:35:56.010: I neuronx_distributed/parallel_layers/parallel_state.py:594] > initializing context model parallel with size 1\n",
354 |       "[2025-06-02 13:35:56.010: I neuronx_distributed/parallel_layers/parallel_state.py:595] > initializing data parallel with size 1\n",
355 |       "[2025-06-02 13:35:56.010: I neuronx_distributed/parallel_layers/parallel_state.py:596] > initializing world size to 8\n",
356 |       "[2025-06-02 13:35:56.010: I neuronx_distributed/parallel_layers/parallel_state.py:343] [rank_0_pp-1_tp-1_dp-1_cp-1] Chosen Logic for replica groups ret_logic=<PG_Group_Logic.LOGIC1: (<function ascending_ring_PG_group at 0x70ec7a840ca0>, 'Ascending Ring PG Group')>\n",
357 |       "[2025-06-02 13:35:56.011: I neuronx_distributed/parallel_layers/parallel_state.py:632] [rank_0_pp-1_tp-1_dp-1_cp-1] tp_groups: replica_groups.tp_groups=[[0, 1, 2, 3, 4, 5, 6, 7]]\n",
358 |       "[2025-06-02 13:35:56.011: I neuronx_distributed/parallel_layers/parallel_state.py:633] [rank_0_pp-1_tp-1_dp-1_cp-1] dp_groups: replica_groups.dp_groups=[[0], [1], [2], [3], [4], [5], [6], [7]]\n",
359 |       "[2025-06-02 13:35:56.011: I neuronx_distributed/parallel_layers/parallel_state.py:634] [rank_0_pp-1_tp-1_dp-1_cp-1] pp_groups: replica_groups.pp_groups=[[0], [1], [2], [3], [4], [5], [6], [7]]\n",
360 |       "[2025-06-02 13:35:56.011: I neuronx_distributed/parallel_layers/parallel_state.py:635] [rank_0_pp-1_tp-1_dp-1_cp-1] cp_groups: replica_groups.cp_groups=[[0], [1], [2], [3], [4], [5], [6], [7]]\n",
361 |       "[2025-06-02 13:35:56.011: I neuronx_distributed/parallel_layers/parallel_state.py:636] [rank_0_pp-1_tp-1_dp-1_cp-1] ep_model_groups: replica_groups.ep_model_groups=[[0], [1], [2], [3], [4], [5], [6], [7]]\n",
362 |       "[2025-06-02 13:35:56.011: I neuronx_distributed/parallel_layers/parallel_state.py:637] [rank_0_pp-1_tp-1_dp-1_cp-1] ep_data_groups: replica_groups.ep_data_groups=[[0], [1], [2], [3], [4], [5], [6], [7]]\n",
363 |       "INFO:Neuron:Generating 1 hlos for key: context_encoding_model\n",
364 |       "INFO:Neuron:Started loading module context_encoding_model\n",
365 |       "INFO:Neuron:Finished loading module context_encoding_model in 0.3605782985687256 seconds\n",
366 |       "INFO:Neuron:generating HLO: context_encoding_model, input example shape = torch.Size([1, 16])\n",
367 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed/parallel_layers/layers.py:478: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.\n",
368 |       "  with torch.cuda.amp.autocast(enabled=False):\n",
369 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/torch_neuronx/xla_impl/hlo_conversion.py:210: UserWarning: Received an input tensor that was unused. Tensor will be ignored. (index=1, shape=torch.Size([1, 16]), dtype=torch.int32)\n",
370 |       "  warnings.warn(\n",
371 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/torch_neuronx/xla_impl/hlo_conversion.py:210: UserWarning: Received an input tensor that was unused. Tensor will be ignored. (index=3, shape=torch.Size([1]), dtype=torch.int32)\n",
372 |       "  warnings.warn(\n",
373 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/torch_neuronx/xla_impl/hlo_conversion.py:210: UserWarning: Received an input tensor that was unused. Tensor will be ignored. (index=4, shape=torch.Size([1, 3]), dtype=torch.float32)\n",
374 |       "  warnings.warn(\n",
375 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/torch_neuronx/xla_impl/hlo_conversion.py:210: UserWarning: Received an input tensor that was unused. Tensor will be ignored. (index=5, shape=torch.Size([1]), dtype=torch.int32)\n",
376 |       "  warnings.warn(\n",
377 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/torch_neuronx/xla_impl/hlo_conversion.py:210: UserWarning: Received an input tensor that was unused. Tensor will be ignored. (index=6, shape=torch.Size([1]), dtype=torch.int32)\n",
378 |       "  warnings.warn(\n",
379 |       "INFO:Neuron:Finished generating HLO for context_encoding_model in 8.811824083328247 seconds, input example shape = torch.Size([1, 16])\n",
380 |       "INFO:Neuron:Generating 1 hlos for key: token_generation_model\n",
381 |       "INFO:Neuron:Started loading module token_generation_model\n",
382 |       "INFO:Neuron:Finished loading module token_generation_model in 0.13971686363220215 seconds\n",
383 |       "INFO:Neuron:generating HLO: token_generation_model, input example shape = torch.Size([1, 1])\n",
384 |       "INFO:Neuron:Finished generating HLO for token_generation_model in 9.776893615722656 seconds, input example shape = torch.Size([1, 1])\n",
385 |       "INFO:Neuron:Generated all HLOs in 19.276326656341553 seconds\n",
386 |       "INFO:Neuron:Starting compilation for the priority HLO\n",
387 |       "INFO:Neuron:'token_generation_model' is the priority model with bucket rank 0\n",
388 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/libneuronxla/neuron_cc_wrapper.py:283: SyntaxWarning: str format compiler_flags is discouraged as its handling involves repeated joining and splitting, which can easily make mistakes if something is quoted or escaped. Use list[str] instead. Refer to documentation of the Python subprocess module for details.\n",
389 |       "  warnings.warn(SyntaxWarning(\n",
390 |       "2025-06-02 13:36:15.000516:  7289  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --framework=XLA /tmp/nxd_model/token_generation_model/_tp0_bk0/model.MODULE_9b906898286ddf239aa0+91ef39e9.hlo_module.pb --output /tmp/nxd_model/token_generation_model/_tp0_bk0/model.MODULE_9b906898286ddf239aa0+91ef39e9.neff --target=trn1 --auto-cast=none --model-type=transformer --tensorizer-options=--enable-ccop-compute-overlap --cc-pipeline-tiling-factor=2 --vectorize-strided-dma  --lnc=1 --logfile=/tmp/nxd_model/token_generation_model/_tp0_bk0/log-neuron-cc.txt --enable-internal-neff-wrapper --verbose=35\n",
391 |       ".........Completed run_backend_driver.\n",
392 |       "\n",
393 |       "Compiler status PASS\n",
394 |       "INFO:Neuron:Done compilation for the priority HLO in 169.35613083839417 seconds\n",
395 |       "INFO:Neuron:Updating the hlo module with optimized layout\n",
396 |       "INFO:Neuron:Done optimizing weight layout for all HLOs in 0.3216278553009033 seconds\n",
397 |       "INFO:Neuron:Starting compilation for all HLOs\n",
398 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/libneuronxla/neuron_cc_wrapper.py:245: SyntaxWarning: str format compiler_flags is discouraged as its handling involves repeated joining and splitting, which can easily make mistakes if something is quoted or escaped. Use list[str] instead. Refer to documentation of the Python subprocess module for details.\n",
399 |       "  warnings.warn(SyntaxWarning(\n",
400 |       "2025-06-02 13:39:05.000174:  7289  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --framework=XLA /tmp/nxd_model/context_encoding_model/_tp0_bk0/model.MODULE_d4332219e6ee5f826cce+d43b5474.hlo_module.pb --output /tmp/nxd_model/context_encoding_model/_tp0_bk0/model.MODULE_d4332219e6ee5f826cce+d43b5474.neff --target=trn1 --auto-cast=none --model-type=transformer --tensorizer-options=--enable-ccop-compute-overlap --cc-pipeline-tiling-factor=2 --vectorize-strided-dma  --lnc=1 -O1 --internal-hlo2tensorizer-options= --modular-flow-mac-threshold=10  --logfile=/tmp/nxd_model/context_encoding_model/_tp0_bk0/log-neuron-cc.txt --verbose=35\n",
401 |       ".Completed run_backend_driver.\n",
402 |       "\n",
403 |       "Compiler status PASS\n",
404 |       "INFO:Neuron:Finished Compilation for all HLOs in 9.435595512390137 seconds\n",
405 |       "......Completed run_backend_driver.\n",
406 |       "\n",
407 |       "Compiler status PASS\n",
408 |       "INFO:Neuron:Done preparing weight layout transformation\n",
409 |       "INFO:Neuron:Finished building model in 307.08067560195923 seconds\n",
410 |       "INFO:Neuron:SKIPPING pre-sharding the checkpoints. The checkpoints will be sharded during load time.\n",
411 |       "Compiling and tracing time: 307.11146965399985 seconds\n",
412 |       "\n",
413 |       "Loading model to Neuron...\n",
414 |       "INFO:Neuron:Sharding weights on load...\n",
415 |       "INFO:Neuron:Sharding Weights for ranks: 0...7\n",
416 |       "[2025-06-02 13:41:03.157: I neuronx_distributed/parallel_layers/parallel_state.py:592] > initializing tensor model parallel with size 8\n",
417 |       "[2025-06-02 13:41:03.157: I neuronx_distributed/parallel_layers/parallel_state.py:593] > initializing pipeline model parallel with size 1\n",
418 |       "[2025-06-02 13:41:03.157: I neuronx_distributed/parallel_layers/parallel_state.py:594] > initializing context model parallel with size 1\n",
419 |       "[2025-06-02 13:41:03.157: I neuronx_distributed/parallel_layers/parallel_state.py:595] > initializing data parallel with size 1\n",
420 |       "[2025-06-02 13:41:03.158: I neuronx_distributed/parallel_layers/parallel_state.py:596] > initializing world size to 8\n",
421 |       "[2025-06-02 13:41:03.158: I neuronx_distributed/parallel_layers/parallel_state.py:343] [rank_0_pp-1_tp-1_dp-1_cp-1] Chosen Logic for replica groups ret_logic=<PG_Group_Logic.LOGIC1: (<function ascending_ring_PG_group at 0x70ec7a840ca0>, 'Ascending Ring PG Group')>\n",
422 |       "[2025-06-02 13:41:03.159: I neuronx_distributed/parallel_layers/parallel_state.py:632] [rank_0_pp-1_tp-1_dp-1_cp-1] tp_groups: replica_groups.tp_groups=[[0, 1, 2, 3, 4, 5, 6, 7]]\n",
423 |       "[2025-06-02 13:41:03.159: I neuronx_distributed/parallel_layers/parallel_state.py:633] [rank_0_pp-1_tp-1_dp-1_cp-1] dp_groups: replica_groups.dp_groups=[[0], [1], [2], [3], [4], [5], [6], [7]]\n",
424 |       "[2025-06-02 13:41:03.159: I neuronx_distributed/parallel_layers/parallel_state.py:634] [rank_0_pp-1_tp-1_dp-1_cp-1] pp_groups: replica_groups.pp_groups=[[0], [1], [2], [3], [4], [5], [6], [7]]\n",
425 |       "[2025-06-02 13:41:03.159: I neuronx_distributed/parallel_layers/parallel_state.py:635] [rank_0_pp-1_tp-1_dp-1_cp-1] cp_groups: replica_groups.cp_groups=[[0], [1], [2], [3], [4], [5], [6], [7]]\n",
426 |       "[2025-06-02 13:41:03.159: I neuronx_distributed/parallel_layers/parallel_state.py:636] [rank_0_pp-1_tp-1_dp-1_cp-1] ep_model_groups: replica_groups.ep_model_groups=[[0], [1], [2], [3], [4], [5], [6], [7]]\n",
427 |       "[2025-06-02 13:41:03.159: I neuronx_distributed/parallel_layers/parallel_state.py:637] [rank_0_pp-1_tp-1_dp-1_cp-1] ep_data_groups: replica_groups.ep_data_groups=[[0], [1], [2], [3], [4], [5], [6], [7]]\n",
428 |       "INFO:Neuron:Done Sharding weights in 3.519328597999902\n",
429 |       "INFO:Neuron:Finished weights loading in 16.628388952000023 seconds\n",
430 |       "INFO:Neuron:Warming up the model.\n",
431 |       "2025-Jun-02 13:41:22.0009 7289:8468 [7] nccl_net_ofi_create_plugin:211 CCOM WARN NET/OFI Failed to initialize sendrecv protocol\n",
432 |       "2025-Jun-02 13:41:22.0010 7289:8468 [7] nccl_net_ofi_create_plugin:334 CCOM WARN NET/OFI aws-ofi-nccl initialization failed\n",
433 |       "2025-Jun-02 13:41:22.0011 7289:8468 [7] nccl_net_ofi_init:155 CCOM WARN NET/OFI Initializing plugin failed\n",
434 |       "2025-Jun-02 13:41:22.0012 7289:8468 [7] net_plugin.cc:94 CCOM WARN OFI plugin initNet() failed is EFA enabled?\n",
435 |       "INFO:Neuron:Warmup completed in 0.33977651596069336 seconds.\n",
436 |       "Total model loading time: 19.222302051000042 seconds\n",
437 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:650: UserWarning: `do_sample` is set to `False`. However, `top_k` is set to `1` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_k`.\n",
438 |       "  warnings.warn(\n",
439 |       "\n",
440 |       "Checking accuracy by logit matching\n",
441 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/utils/accuracy.py:363: UserWarning: input_len + num_tokens_to_check exceeds max_context_length. If output divergences at an index greater than max_context_length, a ValueError will occur because the next input len exceeds max_context_length. To avoid this, set num_tokens_to_check to a value of max_context_length - input_len or less.\n",
442 |       "  warnings.warn(\n",
443 |       "Loading checkpoint shards: 100%|████████████████| 14/14 [00:08<00:00,  1.58it/s]\n",
444 |       "From v4.47 onwards, when a model cache is to be returned, `generate` will return a `Cache` instance instead by default (as opposed to the legacy tuple of tuples format). If you want to keep returning the legacy format, please set `return_legacy_cache=True`.\n",
445 |       "Expected Output:  [\", that is the question. Whether 'tis nobler in the mind to suffer the slings and arrows of outrageous fortune\"] tensor([[   11,   429,   374,   279,  3405,    13, 13139,   364,    83,   285,\n",
446 |       "         13049,  1536,   304,   279,  3971,   311,  7676,   279,  1739,   819,\n",
447 |       "           323, 36957,   315, 54488, 32315]])\n",
448 |       "Expected Logits Shape:  torch.Size([25, 1, 152064])\n",
449 |       "Actual Output:  [\", that is the question. Whether 'tis nobler in the mind to suffer the slings and arrows of outrageous fortune\"] tensor([[   11,   429,   374,   279,  3405,    13, 13139,   364,    83,   285,\n",
450 |       "         13049,  1536,   304,   279,  3971,   311,  7676,   279,  1739,   819,\n",
451 |       "           323, 36957,   315, 54488, 32315]])\n",
452 |       "Actual Logits Shape:  torch.Size([25, 1, 152064])\n",
453 |       "Passed logits validation!\n",
454 |       "\n",
455 |       "Generating outputs...\n",
456 |       "Prompts: ['To be, or not to be']\n",
457 |       "Generated outputs:\n",
458 |       "Output 0: To be, or not to be, that is the question. Whether 'tis nobler in the mind to suffer the slings and arrows of outrageous fortune\n",
459 |       "Starting end-to-end benchmark with 20\n",
460 |       "Benchmark completed and its result is as following\n",
461 |       "{\n",
462 |       "    \"e2e_model\": {\n",
463 |       "        \"latency_ms_p50\": 569.0377950668335,\n",
464 |       "        \"latency_ms_p90\": 570.0641632080078,\n",
465 |       "        \"latency_ms_p95\": 570.2431917190552,\n",
466 |       "        \"latency_ms_p99\": 570.8965921401978,\n",
467 |       "        \"latency_ms_p100\": 571.0599422454834,\n",
468 |       "        \"latency_ms_avg\": 569.459593296051,\n",
469 |       "        \"throughput\": 56.19362703995017\n",
470 |       "    },\n",
471 |       "    \"context_encoding_model\": {\n",
472 |       "        \"latency_ms_p50\": 41.747450828552246,\n",
473 |       "        \"latency_ms_p90\": 42.02606678009033,\n",
474 |       "        \"latency_ms_p95\": 42.056477069854736,\n",
475 |       "        \"latency_ms_p99\": 42.05883264541626,\n",
476 |       "        \"latency_ms_p100\": 42.05942153930664,\n",
477 |       "        \"latency_ms_avg\": 41.80266857147217,\n",
478 |       "        \"throughput\": 382.75068426897144\n",
479 |       "    },\n",
480 |       "    \"token_generation_model\": {\n",
481 |       "        \"latency_ms_p50\": 33.631086349487305,\n",
482 |       "        \"latency_ms_p90\": 33.74745845794678,\n",
483 |       "        \"latency_ms_p95\": 33.88720750808716,\n",
484 |       "        \"latency_ms_p99\": 34.08886194229126,\n",
485 |       "        \"latency_ms_p100\": 34.223079681396484,\n",
486 |       "        \"latency_ms_avg\": 33.66035064061483,\n",
487 |       "        \"throughput\": 31.68911334451813\n",
488 |       "    }\n",
489 |       "}\n",
490 |       "Completed saving result to benchmark_report.json\n"
491 |      ]
492 |     }
493 |    ],
494 |    "source": [
495 |     "!inference_demo \\\n",
496 |     "    --model-type qwen2 \\\n",
497 |     "    --task-type causal-lm \\\n",
498 |     "    run \\\n",
499 |     "    --model-path /home/ubuntu/model_hf_qwen/qwen2 \\\n",
500 |     "    --compiled-model-path /home/ubuntu/traced_model_qwen/qwen2/logit \\\n",
501 |     "    --torch-dtype bfloat16 \\\n",
502 |     "    --tp-degree 8 \\\n",
503 |     "    --batch-size 1 \\\n",
504 |     "    --max-context-length 16 \\\n",
505 |     "    --seq-len 32 \\\n",
506 |     "    --top-k 1 \\\n",
507 |     "    --pad-token-id 151645 \\\n",
508 |     "    --prompt \"To be, or not to be\" \\\n",
509 |     "    --check-accuracy-mode logit-matching \\\n",
510 |     "    --benchmark"
511 |    ]
512 |   }
513 |  ],
514 |  "metadata": {
515 |   "kernelspec": {
516 |    "display_name": "aws_neuronx_venv_pytorch_2_6_nxd_inference",
517 |    "language": "python",
518 |    "name": "python3"
519 |   },
520 |   "language_info": {
521 |    "codemirror_mode": {
522 |     "name": "ipython",
523 |     "version": 3
524 |    },
525 |    "file_extension": ".py",
526 |    "mimetype": "text/x-python",
527 |    "name": "python",
528 |    "nbconvert_exporter": "python",
529 |    "pygments_lexer": "ipython3",
530 |    "version": "3.10.12"
531 |   }
532 |  },
533 |  "nbformat": 4,
534 |  "nbformat_minor": 2
535 | }
536 | 


--------------------------------------------------------------------------------
/labs/FineTuning/HuggingFaceExample/01_finetuning/Finetune-TinyLlama-1.1B.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "id": "37be34fc-0fa9-4811-865c-a3fdc38d38e8",
   6 |    "metadata": {},
   7 |    "source": [
   8 |     "# Fine-tune TinyLlama-1.1B for text-to-SQL generation\n",
   9 |     "\n",
  10 |     "## Introduction\n",
  11 |     "\n",
  12 |     "In this workshop module, you will learn how to fine-tune a Llama-based LLM ([TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)) using causal language modelling so that the model learns how to generate SQL queries for text-based instructions. Your fine-tuning job will be launched using SageMaker Training which provides a serverless training environment where you do not need to manage the underlying infrastructure. You will learn how to configure a PyTorch training job using [SageMaker's PyTorch estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html), and how to leverage the [Hugging Face Optimum Neuron](https://github.com/huggingface/optimum-neuron) package to easily run the PyTorch training job with AWS Trainium accelerators via an [AWS EC2 trn1.2xlarge instance](https://aws.amazon.com/ec2/instance-types/trn1/).\n",
  13 |     "\n",
  14 |     "For this module, you will be using the [b-mc2/sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context) dataset which consists of thousands of examples of SQL schemas, questions about the schemas, and SQL queries intended to answer the questions.\n",
  15 |     "\n",
  16 |     "*Dataset example 1:*\n",
  17 |     "* *SQL schema/context:* `CREATE TABLE management (department_id VARCHAR); CREATE TABLE department (department_id VARCHAR)`\n",
  18 |     "* *Question:* `How many departments are led by heads who are not mentioned?`\n",
  19 |     "* *SQL query/answer:* `SELECT COUNT(*) FROM department WHERE NOT department_id IN (SELECT department_id FROM management)`\n",
  20 |     "\n",
  21 |     "*Dataset example 2:*\n",
  22 |     "* *SQL schema/context:* `CREATE TABLE courses (course_name VARCHAR, course_id VARCHAR); CREATE TABLE student_course_registrations (student_id VARCHAR, course_id VARCHAR)`\n",
  23 |     "* *Question:* `What are the ids of all students for courses and what are the names of those courses?`\n",
  24 |     "* *SQL query/answer:* `SELECT T1.student_id, T2.course_name FROM student_course_registrations AS T1 JOIN courses AS T2 ON T1.course_id = T2.course_id`\n",
  25 |     "\n",
  26 |     "By fine-tuning the model over several thousand of these text-to-SQL examples, the model will then learn how to generate an appropriate SQL query when presented with a SQL context and a free-form question.\n",
  27 |     "\n",
  28 |     "This text-to-SQL use case was selected so you can successfully fine-tune your model in a reasonably short amount of time (~20 minutes) which is appropriate for this 1hr workshop. Although this is a relatively simple use case, please keep in mind that the same techniques and components used in this module can also be applied to fine-tune LLMs for more advanced use cases such as writing code, summarizing documents, creating blog posts - the possibilities are endless!"
  29 |    ]
  30 |   },
  31 |   {
  32 |    "cell_type": "markdown",
  33 |    "id": "866074ee-c300-4793-8e63-adbcfc314ad8",
  34 |    "metadata": {
  35 |     "tags": []
  36 |    },
  37 |    "source": [
  38 |     "## Prerequisites\n",
  39 |     "\n",
  40 |     "This notebook uses the SageMaker Python SDK to prepare, launch, and monitor the progress of a PyTorch-based training job. Before we get started, it is important to upgrade the SageMaker SDK to ensure that you are using the latest version. Run the next two cells to upgrade the SageMaker SDK and set up your session."
  41 |    ]
  42 |   },
  43 |   {
  44 |    "cell_type": "code",
  45 |    "execution_count": null,
  46 |    "id": "3264aae2-1f18-4b59-a92c-2f169903c202",
  47 |    "metadata": {
  48 |     "tags": []
  49 |    },
  50 |    "outputs": [
  51 |     {
  52 |      "name": "stdout",
  53 |      "output_type": "stream",
  54 |      "text": [
  55 |       "\n",
  56 |       "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m25.0.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.1.1\u001b[0m\n",
  57 |       "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n",
  58 |       "Note: you may need to restart the kernel to use updated packages.\n"
  59 |      ]
  60 |     }
  61 |    ],
  62 |    "source": [
  63 |     "# Upgrade SageMaker SDK to the latest version\n",
  64 |     "%pip install -U sagemaker awscli -q 2>&1 | grep -v \"warnings/venv\""
  65 |    ]
  66 |   },
  67 |   {
  68 |    "cell_type": "code",
  69 |    "execution_count": null,
  70 |    "id": "9b5ed574-6db5-471b-8515-c0f6189e653e",
  71 |    "metadata": {
  72 |     "tags": []
  73 |    },
  74 |    "outputs": [],
  75 |    "source": [
  76 |     "import logging\n",
  77 |     "sagemaker_config_logger = logging.getLogger(\"sagemaker.config\")\n",
  78 |     "sagemaker_config_logger.setLevel(logging.WARNING)\n",
  79 |     "\n",
  80 |     "# Import SageMaker SDK, setup our session\n",
  81 |     "from sagemaker import get_execution_role, Session\n",
  82 |     "from sagemaker.pytorch import PyTorch\n",
  83 |     "import boto3\n",
  84 |     "\n",
  85 |     "region_name=\"us-east-2\" #this is hard coded to a specific region because of Workshop quotas.  You could use sess.boto_region_name\n",
  86 |     "sess = Session(boto_session=boto3.Session(region_name=region_name))\n",
  87 |     "default_bucket = sess.default_bucket()\n"
  88 |    ]
  89 |   },
  90 |   {
  91 |    "cell_type": "markdown",
  92 |    "id": "2ce630d1",
  93 |    "metadata": {},
  94 |    "source": [
  95 |     "This next command just configures the EC2 instance (in us-west-2) to have a default region of us-east-2.  This is specific to the environment in AWS Workshop Studio."
  96 |    ]
  97 |   },
  98 |   {
  99 |    "cell_type": "code",
 100 |    "execution_count": null,
 101 |    "id": "5542b3d1",
 102 |    "metadata": {},
 103 |    "outputs": [],
 104 |    "source": [
 105 |     "!aws configure set region us-east-2"
 106 |    ]
 107 |   },
 108 |   {
 109 |    "cell_type": "markdown",
 110 |    "id": "4193108b-25fb-4d3e-85db-c66b8c04c251",
 111 |    "metadata": {},
 112 |    "source": [
 113 |     "## Specify the Optimum Neuron deep learning container (DLC) image\n",
 114 |     "\n",
 115 |     "The SageMaker Training service uses containers to execute your training script, allowing you to fully customize your training script environment and any required dependencies. For this workshop, you will use a recent Pytorch Training deep learning container (DLC) image which is an AWS-maintained image containing the Neuron SDK and PyTorch.  The Optimum-Neuron library is installed with the requirements.txt file in the assets directory."
 116 |    ]
 117 |   },
 118 |   {
 119 |    "cell_type": "code",
 120 |    "execution_count": null,
 121 |    "id": "247ad886-6977-4295-947b-86d4892b48bd",
 122 |    "metadata": {
 123 |     "tags": []
 124 |    },
 125 |    "outputs": [
 126 |     {
 127 |      "name": "stdout",
 128 |      "output_type": "stream",
 129 |      "text": [
 130 |       "763104351884.dkr.ecr.us-east-2.amazonaws.com/pytorch-training-neuronx:2.5.1-neuronx-py310-sdk2.22.0-ubuntu22.04\n"
 131 |      ]
 132 |     }
 133 |    ],
 134 |    "source": [
 135 |     "# Specify the Neuron DLC that we will use for training\n",
 136 |     "#   For now, we'll use the standard Neuron DLC and install Optimum Neuron v0.0.27 at training time because we want to use a later SDK \n",
 137 |     "#   You can see more about the images here: https://github.com/aws-neuron/deep-learning-containers?tab=readme-ov-file#pytorch-training-neuronx\n",
 138 |     "\n",
 139 |     "training_image = f\"763104351884.dkr.ecr.{sess.boto_region_name}.amazonaws.com/pytorch-training-neuronx:2.5.1-neuronx-py310-sdk2.22.0-ubuntu22.04\"\n",
 140 |     "print(training_image)"
 141 |    ]
 142 |   },
 143 |   {
 144 |    "cell_type": "markdown",
 145 |    "id": "8a8802bc-657a-419d-b86d-eb8af5eff90e",
 146 |    "metadata": {
 147 |     "tags": []
 148 |    },
 149 |    "source": [
 150 |     "## Configure the PyTorch Estimator\n",
 151 |     "\n",
 152 |     "The SageMaker SDK includes a [PyTorch Estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html) class which you can use to define a PyTorch training job that will be executed in the SageMaker managed environment. \n",
 153 |     "\n",
 154 |     "In the following cell, you will create a PyTorch Estimator which will run the attached `finetune_llama.py` training script on an ml.trn1.2xlarge instance. The `finetune_llama.py` script is an Optimum Neuron training script that can be used for causal language modelling with AWS Trainium. The scripts will be downloaded as the instance is brought up, and the scripts will download the model and the datasets onto the SageMaker training instance.\n",
 155 |     "\n",
 156 |     "The PyTorch Estimator has many parameters that can be used to configure your training job. A few of the most important parameters include:\n",
 157 |     "\n",
 158 |     "- *entry_point*: refers to the name of the training script that will be executed as part of this training job\n",
 159 |     "- *source_dir*: the path to the local source code directory (relative to your notebook) that will be packaged up and included inside your training container\n",
 160 |     "- *instance_count*: defines how many EC2 instances to use for this training job\n",
 161 |     "- *instance_type*: determines which type of EC2 instance will be used for training\n",
 162 |     "- *image_uri*: defines which training DLC will be used to run the training job (see Neuron DLC, above)\n",
 163 |     "- *distribution*: determines which type of distribution to use for the training job - you will need 'torch_distributed' for this workshop\n",
 164 |     "- *environment*: provides a dictionary of environment variables which will be applied to your training environment\n",
 165 |     "- *hyperparameters*: provides a dictionary of command-line arguments to pass to your training script, ex: finetune_llama.py\n",
 166 |     "\n",
 167 |     "In the `hyperparameters` section, you can see the specific command-line arguments that are used to control the behavior of the `finetune_llama.py` training script. Notably:\n",
 168 |     "- *model_id*: specifies which model you will be fine-tuning, in this case a recent checkpoint from the TinyLlama-1.1B project\n",
 169 |     "- *tokenizer_id*: specifies which tokenizer you will used to tokenize the dataset examples during training\n",
 170 |     "- *output_dir*: directory in which the fine-tuned model will be saved. Here we use the SageMaker-specific `/opt/ml/model` directory. At the end of the training job, SageMaker automatically copies the contents of this directory to the output S3 bucket\n",
 171 |     "- *tensor_parallel_size*: the tensor parallel degree for which we want to use for training. In this case we use '2' to shard the model across the 2 NeuronCores available in the trn1.2xlarge instance\n",
 172 |     "- *bf16*: request BFloat16 training\n",
 173 |     "- *per_device_train_batch_size*: the microbatch size to be used for fine-tuning\n",
 174 |     "- *gradient_accumulation_steps*: how many steps for which gradients will be accumulated between updates\n",
 175 |     "- *max_steps*: the maximum number of steps of fine-tuning that we want to perform\n",
 176 |     "- *lora_r*, *lora_alpha*, *lora_dropout*: the LoRA rank, alpha, and dropout values to use during fine-tuning\n",
 177 |     "\n",
 178 |     "The below estimator has been pre-configured for you, so you do not need to make any changes."
 179 |    ]
 180 |   },
 181 |   {
 182 |    "cell_type": "code",
 183 |    "execution_count": 8,
 184 |    "id": "9e28014c-4d0b-452b-9bde-44aa10e61bb6",
 185 |    "metadata": {
 186 |     "tags": []
 187 |    },
 188 |    "outputs": [],
 189 |    "source": [
 190 |     "# Set up the PyTorch estimator\n",
 191 |     "# Note that the hyperparameters are just command-line args passed to the finetune_llama.py script to control its behavior\n",
 192 |     "\n",
 193 |     "pt_estimator = PyTorch(\n",
 194 |     "        entry_point=\"finetune_llama.py\",\n",
 195 |     "        source_dir=\"./assets\",\n",
 196 |     "        role=get_execution_role(),\n",
 197 |     "        instance_count=1,\n",
 198 |     "        instance_type=\"ml.trn1.2xlarge\",\n",
 199 |     "        disable_profiler=True,\n",
 200 |     "        output_path=f\"s3://{default_bucket}/neuron_events2025\",\n",
 201 |     "        base_job_name=\"trn1-tinyllama\",\n",
 202 |     "        sagemaker_session=sess,\n",
 203 |     "        code_bucket=f\"s3://{default_bucket}/neuron_events2025_code\",\n",
 204 |     "        checkpoint_s3_uri=f\"s3://{default_bucket}/neuron_events_output\",\n",
 205 |     "        image_uri=training_image,\n",
 206 |     "        distribution={\"torch_distributed\": {\"enabled\": True}},\n",
 207 |     "        environment={\"FI_EFA_FORK_SAFE\": \"1\", \"WANDB_DISABLED\": \"true\"},\n",
 208 |     "        disable_output_compression=True,\n",
 209 |     "        hyperparameters={\n",
 210 |     "            \"model_id\": \"TinyLlama/TinyLlama-1.1B-Chat-v1.0\",\n",
 211 |     "            \"tokenizer_id\": \"TinyLlama/TinyLlama-1.1B-Chat-v1.0\",\n",
 212 |     "            \"output_dir\": \"/opt/ml/model\",\n",
 213 |     "            \"tensor_parallel_size\": 2,\n",
 214 |     "            \"bf16\": True,\n",
 215 |     "            \"per_device_train_batch_size\": 2,\n",
 216 |     "            \"gradient_accumulation_steps\": 1,\n",
 217 |     "            \"gradient_checkpointing\": True,\n",
 218 |     "            \"max_steps\": 1000,\n",
 219 |     "            \"lora_r\": 16,\n",
 220 |     "            \"lora_alpha\": 32,\n",
 221 |     "            \"lora_dropout\": 0.05,\n",
 222 |     "            \"logging_steps\": 10,\n",
 223 |     "            \"learning_rate\": 5e-5,\n",
 224 |     "            \"dataloader_drop_last\": True,\n",
 225 |     "            \"disable_tqdm\": True\n",
 226 |     "        }\n",
 227 |     "    )"
 228 |    ]
 229 |   },
 230 |   {
 231 |    "cell_type": "markdown",
 232 |    "id": "2278940b-f563-4582-9df0-bd56d9b5fd28",
 233 |    "metadata": {},
 234 |    "source": [
 235 |     "## Launch the training job\n",
 236 |     "\n",
 237 |     "Once the estimator has been created, you can then launch your training job by calling `.fit()` on the estimator:"
 238 |    ]
 239 |   },
 240 |   {
 241 |    "cell_type": "code",
 242 |    "execution_count": 9,
 243 |    "id": "b7829c64-0190-43c3-be1a-0ccce7d45248",
 244 |    "metadata": {
 245 |     "tags": []
 246 |    },
 247 |    "outputs": [
 248 |     {
 249 |      "name": "stderr",
 250 |      "output_type": "stream",
 251 |      "text": [
 252 |       "INFO:sagemaker:Creating training-job with name: trn1-tinyllama-2025-05-13-00-40-31-750\n"
 253 |      ]
 254 |     }
 255 |    ],
 256 |    "source": [
 257 |     "# Call fit() on the estimator to initiate the training job\n",
 258 |     "pt_estimator.fit(wait=False, logs=False)"
 259 |    ]
 260 |   },
 261 |   {
 262 |    "cell_type": "markdown",
 263 |    "id": "b77434b2-94d7-4256-8d0b-d5d2ddb1d5ae",
 264 |    "metadata": {},
 265 |    "source": [
 266 |     "## Monitor the training job\n",
 267 |     "\n",
 268 |     "When the training job has been launched, the SageMaker Training service will then take care of:\n",
 269 |     "- launching and configuring the requested EC2 infrastructure for your training job\n",
 270 |     "- launching the requested container image on each of the EC2 instances\n",
 271 |     "- copying your source code directory and running your training script within the container(s)\n",
 272 |     "- storing your trained model artifacts in Amazon Simple Storage Service (S3)\n",
 273 |     "- decommissioning the training infrastructure\n",
 274 |     "\n",
 275 |     "While the training job is running, the following cell will periodically check and output the job status. When you see 'Completed', you know that your training job is finished and you can proceed to the remainder of the notebook. The training job typically takes about 20 minutes to complete.\n",
 276 |     "\n",
 277 |     "If you are interested in viewing the output logs from your training job, you can view the logs by navigating to the AWS CloudWatch console, selecting `Logs -> Log Groups` in the left-hand menu, and then looking for your SageMaker training job in the list. **Note:** it will usually take 4-5 minutes before the infrastructure is running and the output logs begin to be populated in CloudWatch."
 278 |    ]
 279 |   },
 280 |   {
 281 |    "cell_type": "code",
 282 |    "execution_count": 10,
 283 |    "id": "0c223037-2f8e-4eb0-9e4b-ff4dac6ede7a",
 284 |    "metadata": {
 285 |     "tags": []
 286 |    },
 287 |    "outputs": [
 288 |     {
 289 |      "name": "stdout",
 290 |      "output_type": "stream",
 291 |      "text": [
 292 |       "2025-05-13T00:40:37.718399 Training job status: InProgress!\n",
 293 |       "2025-05-13T00:41:07.827456 Training job status: InProgress!\n",
 294 |       "2025-05-13T00:41:37.941892 Training job status: InProgress!\n",
 295 |       "2025-05-13T00:42:08.055514 Training job status: InProgress!\n",
 296 |       "2025-05-13T00:42:38.170184 Training job status: InProgress!\n",
 297 |       "2025-05-13T00:43:08.285526 Training job status: InProgress!\n",
 298 |       "2025-05-13T00:43:38.401669 Training job status: InProgress!\n",
 299 |       "2025-05-13T00:44:08.517601 Training job status: InProgress!\n",
 300 |       "2025-05-13T00:44:38.607279 Training job status: InProgress!\n",
 301 |       "2025-05-13T00:45:08.901240 Training job status: InProgress!\n",
 302 |       "2025-05-13T00:45:39.029987 Training job status: InProgress!\n",
 303 |       "2025-05-13T00:46:09.148483 Training job status: InProgress!\n",
 304 |       "2025-05-13T00:46:39.262424 Training job status: InProgress!\n",
 305 |       "2025-05-13T00:47:09.378729 Training job status: InProgress!\n",
 306 |       "2025-05-13T00:47:39.477011 Training job status: InProgress!\n",
 307 |       "2025-05-13T00:48:09.589262 Training job status: InProgress!\n",
 308 |       "2025-05-13T00:48:39.715998 Training job status: InProgress!\n",
 309 |       "2025-05-13T00:49:09.833712 Training job status: InProgress!\n",
 310 |       "2025-05-13T00:49:40.132350 Training job status: InProgress!\n",
 311 |       "2025-05-13T00:50:10.259671 Training job status: InProgress!\n",
 312 |       "2025-05-13T00:50:40.376526 Training job status: InProgress!\n",
 313 |       "2025-05-13T00:51:10.492630 Training job status: InProgress!\n",
 314 |       "2025-05-13T00:51:40.612684 Training job status: InProgress!\n",
 315 |       "2025-05-13T00:52:10.735871 Training job status: InProgress!\n",
 316 |       "2025-05-13T00:52:40.856541 Training job status: InProgress!\n",
 317 |       "2025-05-13T00:53:10.978185 Training job status: InProgress!\n",
 318 |       "2025-05-13T00:53:41.102406 Training job status: InProgress!\n",
 319 |       "2025-05-13T00:54:11.391318 Training job status: InProgress!\n",
 320 |       "2025-05-13T00:54:41.506542 Training job status: InProgress!\n",
 321 |       "2025-05-13T00:55:11.619419 Training job status: InProgress!\n",
 322 |       "2025-05-13T00:55:41.736144 Training job status: InProgress!\n",
 323 |       "2025-05-13T00:56:11.850643 Training job status: InProgress!\n",
 324 |       "2025-05-13T00:56:41.965740 Training job status: InProgress!\n",
 325 |       "2025-05-13T00:57:12.082235 Training job status: InProgress!\n",
 326 |       "2025-05-13T00:57:42.193146 Training job status: InProgress!\n",
 327 |       "2025-05-13T00:58:12.309523 Training job status: InProgress!\n",
 328 |       "2025-05-13T00:58:42.596288 Training job status: InProgress!\n",
 329 |       "2025-05-13T00:59:12.715701 Training job status: InProgress!\n",
 330 |       "2025-05-13T00:59:42.835134 Training job status: InProgress!\n",
 331 |       "2025-05-13T01:00:12.952002 Training job status: InProgress!\n",
 332 |       "2025-05-13T01:00:43.070275 Training job status: InProgress!\n",
 333 |       "2025-05-13T01:01:13.187416 Training job status: InProgress!\n",
 334 |       "2025-05-13T01:01:43.291955 Training job status: InProgress!\n",
 335 |       "\n",
 336 |       "2025-05-13T01:02:13.412501 Training job status: Completed!\n"
 337 |      ]
 338 |     }
 339 |    ],
 340 |    "source": [
 341 |     "# Periodically check job status until it shows 'Completed' (ETA ~20 minutes)\n",
 342 |     "#  You can also monitor job status in the SageMaker console, and view the\n",
 343 |     "#  SageMaker Training job logs in the CloudWatch console\n",
 344 |     "from time import sleep\n",
 345 |     "from datetime import datetime\n",
 346 |     "\n",
 347 |     "while (job_status := pt_estimator.jobs[-1].describe()['TrainingJobStatus']) not in ['Completed', 'Error', 'Failed']:\n",
 348 |     "    print(f\"{datetime.now().isoformat()} Training job status: {job_status}!\")\n",
 349 |     "    sleep(30)\n",
 350 |     "\n",
 351 |     "print(f\"\\n{datetime.now().isoformat()} Training job status: {job_status}!\")"
 352 |    ]
 353 |   },
 354 |   {
 355 |    "cell_type": "markdown",
 356 |    "id": "16c94343-b0c6-4903-82cc-c8ab2f88b26b",
 357 |    "metadata": {},
 358 |    "source": [
 359 |     "## Determine location of fine-tuned model artifacts\n",
 360 |     "\n",
 361 |     "Once the training job has completed, SageMaker will copy your fine-tuned model artifacts to a specified location in S3.\n",
 362 |     "\n",
 363 |     "In the following cell, you can see how to programmatically determine the location of your model artifacts:"
 364 |    ]
 365 |   },
 366 |   {
 367 |    "cell_type": "code",
 368 |    "execution_count": 11,
 369 |    "id": "213af977-8ed6-4081-af65-59c70db2dbfb",
 370 |    "metadata": {
 371 |     "tags": []
 372 |    },
 373 |    "outputs": [
 374 |     {
 375 |      "name": "stdout",
 376 |      "output_type": "stream",
 377 |      "text": [
 378 |       "Your fine-tuned model is available here:\n",
 379 |       "\n",
 380 |       "s3://this.output.should.be.replaced.with.a.real.s3.path.once.the.cell.is.executed/\n"
 381 |      ]
 382 |     }
 383 |    ],
 384 |    "source": [
 385 |     "# Show where the fine-tuned model is stored - previous job must be 'Completed' before running this cell\n",
 386 |     "model_archive_path = pt_estimator.jobs[-1].describe()['ModelArtifacts']['S3ModelArtifacts']\n",
 387 |     "print(f\"Your fine-tuned model is available here:\\n\\n{model_archive_path}/\")"
 388 |    ]
 389 |   },
 390 |   {
 391 |    "cell_type": "markdown",
 392 |    "id": "b68f529f-a548-4fbd-b160-3cab5f52c488",
 393 |    "metadata": {},
 394 |    "source": [
 395 |     "<br/>\n",
 396 |     "\n",
 397 |     "**Note:** Please copy the above S3 path, as it will be required in the subsequent workshop module.\n",
 398 |     "\n",
 399 |     "\n",
 400 |     "Lastly, run the following cell to list the model artifacts available in your S3 model_archive_path:"
 401 |    ]
 402 |   },
 403 |   {
 404 |    "cell_type": "code",
 405 |    "execution_count": 12,
 406 |    "id": "27ad8c7e-6a73-4f20-944f-ac12ef286a6f",
 407 |    "metadata": {
 408 |     "tags": []
 409 |    },
 410 |    "outputs": [
 411 |     {
 412 |      "name": "stdout",
 413 |      "output_type": "stream",
 414 |      "text": [
 415 |       "2025-05-13 01:01:39        714 config.json\n",
 416 |       "2025-05-13 01:01:48        124 generation_config.json\n",
 417 |       "2025-05-13 01:01:40 4400216536 model.safetensors\n",
 418 |       "2025-05-13 01:01:47        551 special_tokens_map.json\n",
 419 |       "2025-05-13 01:01:47    1842795 tokenizer.json\n",
 420 |       "2025-05-13 01:01:39     499723 tokenizer.model\n",
 421 |       "2025-05-13 01:01:48       1368 tokenizer_config.json\n"
 422 |      ]
 423 |     }
 424 |    ],
 425 |    "source": [
 426 |     "# View the contents of the fine-tuned model path in S3\n",
 427 |     "!aws s3 ls {model_archive_path}/merged_model/"
 428 |    ]
 429 |   },
 430 |   {
 431 |    "cell_type": "markdown",
 432 |    "id": "fca9ffa7-a694-48c0-acde-cd468d18a448",
 433 |    "metadata": {},
 434 |    "source": [
 435 |     "Congratulations on completing the LLM fine-tuning module!\n",
 436 |     "\n",
 437 |     "In the next notebook, you will learn how to deploy your fine-tuned model in a SageMaker hosted endpoint, and leverage AWS Inferentia accelerators to perform model inference. Have fun!"
 438 |    ]
 439 |   }
 440 |  ],
 441 |  "metadata": {
 442 |   "availableInstances": [
 443 |    {
 444 |     "_defaultOrder": 0,
 445 |     "_isFastLaunch": true,
 446 |     "category": "General purpose",
 447 |     "gpuNum": 0,
 448 |     "hideHardwareSpecs": false,
 449 |     "memoryGiB": 4,
 450 |     "name": "ml.t3.medium",
 451 |     "vcpuNum": 2
 452 |    },
 453 |    {
 454 |     "_defaultOrder": 1,
 455 |     "_isFastLaunch": false,
 456 |     "category": "General purpose",
 457 |     "gpuNum": 0,
 458 |     "hideHardwareSpecs": false,
 459 |     "memoryGiB": 8,
 460 |     "name": "ml.t3.large",
 461 |     "vcpuNum": 2
 462 |    },
 463 |    {
 464 |     "_defaultOrder": 2,
 465 |     "_isFastLaunch": false,
 466 |     "category": "General purpose",
 467 |     "gpuNum": 0,
 468 |     "hideHardwareSpecs": false,
 469 |     "memoryGiB": 16,
 470 |     "name": "ml.t3.xlarge",
 471 |     "vcpuNum": 4
 472 |    },
 473 |    {
 474 |     "_defaultOrder": 3,
 475 |     "_isFastLaunch": false,
 476 |     "category": "General purpose",
 477 |     "gpuNum": 0,
 478 |     "hideHardwareSpecs": false,
 479 |     "memoryGiB": 32,
 480 |     "name": "ml.t3.2xlarge",
 481 |     "vcpuNum": 8
 482 |    },
 483 |    {
 484 |     "_defaultOrder": 4,
 485 |     "_isFastLaunch": true,
 486 |     "category": "General purpose",
 487 |     "gpuNum": 0,
 488 |     "hideHardwareSpecs": false,
 489 |     "memoryGiB": 8,
 490 |     "name": "ml.m5.large",
 491 |     "vcpuNum": 2
 492 |    },
 493 |    {
 494 |     "_defaultOrder": 5,
 495 |     "_isFastLaunch": false,
 496 |     "category": "General purpose",
 497 |     "gpuNum": 0,
 498 |     "hideHardwareSpecs": false,
 499 |     "memoryGiB": 16,
 500 |     "name": "ml.m5.xlarge",
 501 |     "vcpuNum": 4
 502 |    },
 503 |    {
 504 |     "_defaultOrder": 6,
 505 |     "_isFastLaunch": false,
 506 |     "category": "General purpose",
 507 |     "gpuNum": 0,
 508 |     "hideHardwareSpecs": false,
 509 |     "memoryGiB": 32,
 510 |     "name": "ml.m5.2xlarge",
 511 |     "vcpuNum": 8
 512 |    },
 513 |    {
 514 |     "_defaultOrder": 7,
 515 |     "_isFastLaunch": false,
 516 |     "category": "General purpose",
 517 |     "gpuNum": 0,
 518 |     "hideHardwareSpecs": false,
 519 |     "memoryGiB": 64,
 520 |     "name": "ml.m5.4xlarge",
 521 |     "vcpuNum": 16
 522 |    },
 523 |    {
 524 |     "_defaultOrder": 8,
 525 |     "_isFastLaunch": false,
 526 |     "category": "General purpose",
 527 |     "gpuNum": 0,
 528 |     "hideHardwareSpecs": false,
 529 |     "memoryGiB": 128,
 530 |     "name": "ml.m5.8xlarge",
 531 |     "vcpuNum": 32
 532 |    },
 533 |    {
 534 |     "_defaultOrder": 9,
 535 |     "_isFastLaunch": false,
 536 |     "category": "General purpose",
 537 |     "gpuNum": 0,
 538 |     "hideHardwareSpecs": false,
 539 |     "memoryGiB": 192,
 540 |     "name": "ml.m5.12xlarge",
 541 |     "vcpuNum": 48
 542 |    },
 543 |    {
 544 |     "_defaultOrder": 10,
 545 |     "_isFastLaunch": false,
 546 |     "category": "General purpose",
 547 |     "gpuNum": 0,
 548 |     "hideHardwareSpecs": false,
 549 |     "memoryGiB": 256,
 550 |     "name": "ml.m5.16xlarge",
 551 |     "vcpuNum": 64
 552 |    },
 553 |    {
 554 |     "_defaultOrder": 11,
 555 |     "_isFastLaunch": false,
 556 |     "category": "General purpose",
 557 |     "gpuNum": 0,
 558 |     "hideHardwareSpecs": false,
 559 |     "memoryGiB": 384,
 560 |     "name": "ml.m5.24xlarge",
 561 |     "vcpuNum": 96
 562 |    },
 563 |    {
 564 |     "_defaultOrder": 12,
 565 |     "_isFastLaunch": false,
 566 |     "category": "General purpose",
 567 |     "gpuNum": 0,
 568 |     "hideHardwareSpecs": false,
 569 |     "memoryGiB": 8,
 570 |     "name": "ml.m5d.large",
 571 |     "vcpuNum": 2
 572 |    },
 573 |    {
 574 |     "_defaultOrder": 13,
 575 |     "_isFastLaunch": false,
 576 |     "category": "General purpose",
 577 |     "gpuNum": 0,
 578 |     "hideHardwareSpecs": false,
 579 |     "memoryGiB": 16,
 580 |     "name": "ml.m5d.xlarge",
 581 |     "vcpuNum": 4
 582 |    },
 583 |    {
 584 |     "_defaultOrder": 14,
 585 |     "_isFastLaunch": false,
 586 |     "category": "General purpose",
 587 |     "gpuNum": 0,
 588 |     "hideHardwareSpecs": false,
 589 |     "memoryGiB": 32,
 590 |     "name": "ml.m5d.2xlarge",
 591 |     "vcpuNum": 8
 592 |    },
 593 |    {
 594 |     "_defaultOrder": 15,
 595 |     "_isFastLaunch": false,
 596 |     "category": "General purpose",
 597 |     "gpuNum": 0,
 598 |     "hideHardwareSpecs": false,
 599 |     "memoryGiB": 64,
 600 |     "name": "ml.m5d.4xlarge",
 601 |     "vcpuNum": 16
 602 |    },
 603 |    {
 604 |     "_defaultOrder": 16,
 605 |     "_isFastLaunch": false,
 606 |     "category": "General purpose",
 607 |     "gpuNum": 0,
 608 |     "hideHardwareSpecs": false,
 609 |     "memoryGiB": 128,
 610 |     "name": "ml.m5d.8xlarge",
 611 |     "vcpuNum": 32
 612 |    },
 613 |    {
 614 |     "_defaultOrder": 17,
 615 |     "_isFastLaunch": false,
 616 |     "category": "General purpose",
 617 |     "gpuNum": 0,
 618 |     "hideHardwareSpecs": false,
 619 |     "memoryGiB": 192,
 620 |     "name": "ml.m5d.12xlarge",
 621 |     "vcpuNum": 48
 622 |    },
 623 |    {
 624 |     "_defaultOrder": 18,
 625 |     "_isFastLaunch": false,
 626 |     "category": "General purpose",
 627 |     "gpuNum": 0,
 628 |     "hideHardwareSpecs": false,
 629 |     "memoryGiB": 256,
 630 |     "name": "ml.m5d.16xlarge",
 631 |     "vcpuNum": 64
 632 |    },
 633 |    {
 634 |     "_defaultOrder": 19,
 635 |     "_isFastLaunch": false,
 636 |     "category": "General purpose",
 637 |     "gpuNum": 0,
 638 |     "hideHardwareSpecs": false,
 639 |     "memoryGiB": 384,
 640 |     "name": "ml.m5d.24xlarge",
 641 |     "vcpuNum": 96
 642 |    },
 643 |    {
 644 |     "_defaultOrder": 20,
 645 |     "_isFastLaunch": false,
 646 |     "category": "General purpose",
 647 |     "gpuNum": 0,
 648 |     "hideHardwareSpecs": true,
 649 |     "memoryGiB": 0,
 650 |     "name": "ml.geospatial.interactive",
 651 |     "supportedImageNames": [
 652 |      "sagemaker-geospatial-v1-0"
 653 |     ],
 654 |     "vcpuNum": 0
 655 |    },
 656 |    {
 657 |     "_defaultOrder": 21,
 658 |     "_isFastLaunch": true,
 659 |     "category": "Compute optimized",
 660 |     "gpuNum": 0,
 661 |     "hideHardwareSpecs": false,
 662 |     "memoryGiB": 4,
 663 |     "name": "ml.c5.large",
 664 |     "vcpuNum": 2
 665 |    },
 666 |    {
 667 |     "_defaultOrder": 22,
 668 |     "_isFastLaunch": false,
 669 |     "category": "Compute optimized",
 670 |     "gpuNum": 0,
 671 |     "hideHardwareSpecs": false,
 672 |     "memoryGiB": 8,
 673 |     "name": "ml.c5.xlarge",
 674 |     "vcpuNum": 4
 675 |    },
 676 |    {
 677 |     "_defaultOrder": 23,
 678 |     "_isFastLaunch": false,
 679 |     "category": "Compute optimized",
 680 |     "gpuNum": 0,
 681 |     "hideHardwareSpecs": false,
 682 |     "memoryGiB": 16,
 683 |     "name": "ml.c5.2xlarge",
 684 |     "vcpuNum": 8
 685 |    },
 686 |    {
 687 |     "_defaultOrder": 24,
 688 |     "_isFastLaunch": false,
 689 |     "category": "Compute optimized",
 690 |     "gpuNum": 0,
 691 |     "hideHardwareSpecs": false,
 692 |     "memoryGiB": 32,
 693 |     "name": "ml.c5.4xlarge",
 694 |     "vcpuNum": 16
 695 |    },
 696 |    {
 697 |     "_defaultOrder": 25,
 698 |     "_isFastLaunch": false,
 699 |     "category": "Compute optimized",
 700 |     "gpuNum": 0,
 701 |     "hideHardwareSpecs": false,
 702 |     "memoryGiB": 72,
 703 |     "name": "ml.c5.9xlarge",
 704 |     "vcpuNum": 36
 705 |    },
 706 |    {
 707 |     "_defaultOrder": 26,
 708 |     "_isFastLaunch": false,
 709 |     "category": "Compute optimized",
 710 |     "gpuNum": 0,
 711 |     "hideHardwareSpecs": false,
 712 |     "memoryGiB": 96,
 713 |     "name": "ml.c5.12xlarge",
 714 |     "vcpuNum": 48
 715 |    },
 716 |    {
 717 |     "_defaultOrder": 27,
 718 |     "_isFastLaunch": false,
 719 |     "category": "Compute optimized",
 720 |     "gpuNum": 0,
 721 |     "hideHardwareSpecs": false,
 722 |     "memoryGiB": 144,
 723 |     "name": "ml.c5.18xlarge",
 724 |     "vcpuNum": 72
 725 |    },
 726 |    {
 727 |     "_defaultOrder": 28,
 728 |     "_isFastLaunch": false,
 729 |     "category": "Compute optimized",
 730 |     "gpuNum": 0,
 731 |     "hideHardwareSpecs": false,
 732 |     "memoryGiB": 192,
 733 |     "name": "ml.c5.24xlarge",
 734 |     "vcpuNum": 96
 735 |    },
 736 |    {
 737 |     "_defaultOrder": 29,
 738 |     "_isFastLaunch": true,
 739 |     "category": "Accelerated computing",
 740 |     "gpuNum": 1,
 741 |     "hideHardwareSpecs": false,
 742 |     "memoryGiB": 16,
 743 |     "name": "ml.g4dn.xlarge",
 744 |     "vcpuNum": 4
 745 |    },
 746 |    {
 747 |     "_defaultOrder": 30,
 748 |     "_isFastLaunch": false,
 749 |     "category": "Accelerated computing",
 750 |     "gpuNum": 1,
 751 |     "hideHardwareSpecs": false,
 752 |     "memoryGiB": 32,
 753 |     "name": "ml.g4dn.2xlarge",
 754 |     "vcpuNum": 8
 755 |    },
 756 |    {
 757 |     "_defaultOrder": 31,
 758 |     "_isFastLaunch": false,
 759 |     "category": "Accelerated computing",
 760 |     "gpuNum": 1,
 761 |     "hideHardwareSpecs": false,
 762 |     "memoryGiB": 64,
 763 |     "name": "ml.g4dn.4xlarge",
 764 |     "vcpuNum": 16
 765 |    },
 766 |    {
 767 |     "_defaultOrder": 32,
 768 |     "_isFastLaunch": false,
 769 |     "category": "Accelerated computing",
 770 |     "gpuNum": 1,
 771 |     "hideHardwareSpecs": false,
 772 |     "memoryGiB": 128,
 773 |     "name": "ml.g4dn.8xlarge",
 774 |     "vcpuNum": 32
 775 |    },
 776 |    {
 777 |     "_defaultOrder": 33,
 778 |     "_isFastLaunch": false,
 779 |     "category": "Accelerated computing",
 780 |     "gpuNum": 4,
 781 |     "hideHardwareSpecs": false,
 782 |     "memoryGiB": 192,
 783 |     "name": "ml.g4dn.12xlarge",
 784 |     "vcpuNum": 48
 785 |    },
 786 |    {
 787 |     "_defaultOrder": 34,
 788 |     "_isFastLaunch": false,
 789 |     "category": "Accelerated computing",
 790 |     "gpuNum": 1,
 791 |     "hideHardwareSpecs": false,
 792 |     "memoryGiB": 256,
 793 |     "name": "ml.g4dn.16xlarge",
 794 |     "vcpuNum": 64
 795 |    },
 796 |    {
 797 |     "_defaultOrder": 35,
 798 |     "_isFastLaunch": false,
 799 |     "category": "Accelerated computing",
 800 |     "gpuNum": 1,
 801 |     "hideHardwareSpecs": false,
 802 |     "memoryGiB": 61,
 803 |     "name": "ml.p3.2xlarge",
 804 |     "vcpuNum": 8
 805 |    },
 806 |    {
 807 |     "_defaultOrder": 36,
 808 |     "_isFastLaunch": false,
 809 |     "category": "Accelerated computing",
 810 |     "gpuNum": 4,
 811 |     "hideHardwareSpecs": false,
 812 |     "memoryGiB": 244,
 813 |     "name": "ml.p3.8xlarge",
 814 |     "vcpuNum": 32
 815 |    },
 816 |    {
 817 |     "_defaultOrder": 37,
 818 |     "_isFastLaunch": false,
 819 |     "category": "Accelerated computing",
 820 |     "gpuNum": 8,
 821 |     "hideHardwareSpecs": false,
 822 |     "memoryGiB": 488,
 823 |     "name": "ml.p3.16xlarge",
 824 |     "vcpuNum": 64
 825 |    },
 826 |    {
 827 |     "_defaultOrder": 38,
 828 |     "_isFastLaunch": false,
 829 |     "category": "Accelerated computing",
 830 |     "gpuNum": 8,
 831 |     "hideHardwareSpecs": false,
 832 |     "memoryGiB": 768,
 833 |     "name": "ml.p3dn.24xlarge",
 834 |     "vcpuNum": 96
 835 |    },
 836 |    {
 837 |     "_defaultOrder": 39,
 838 |     "_isFastLaunch": false,
 839 |     "category": "Memory Optimized",
 840 |     "gpuNum": 0,
 841 |     "hideHardwareSpecs": false,
 842 |     "memoryGiB": 16,
 843 |     "name": "ml.r5.large",
 844 |     "vcpuNum": 2
 845 |    },
 846 |    {
 847 |     "_defaultOrder": 40,
 848 |     "_isFastLaunch": false,
 849 |     "category": "Memory Optimized",
 850 |     "gpuNum": 0,
 851 |     "hideHardwareSpecs": false,
 852 |     "memoryGiB": 32,
 853 |     "name": "ml.r5.xlarge",
 854 |     "vcpuNum": 4
 855 |    },
 856 |    {
 857 |     "_defaultOrder": 41,
 858 |     "_isFastLaunch": false,
 859 |     "category": "Memory Optimized",
 860 |     "gpuNum": 0,
 861 |     "hideHardwareSpecs": false,
 862 |     "memoryGiB": 64,
 863 |     "name": "ml.r5.2xlarge",
 864 |     "vcpuNum": 8
 865 |    },
 866 |    {
 867 |     "_defaultOrder": 42,
 868 |     "_isFastLaunch": false,
 869 |     "category": "Memory Optimized",
 870 |     "gpuNum": 0,
 871 |     "hideHardwareSpecs": false,
 872 |     "memoryGiB": 128,
 873 |     "name": "ml.r5.4xlarge",
 874 |     "vcpuNum": 16
 875 |    },
 876 |    {
 877 |     "_defaultOrder": 43,
 878 |     "_isFastLaunch": false,
 879 |     "category": "Memory Optimized",
 880 |     "gpuNum": 0,
 881 |     "hideHardwareSpecs": false,
 882 |     "memoryGiB": 256,
 883 |     "name": "ml.r5.8xlarge",
 884 |     "vcpuNum": 32
 885 |    },
 886 |    {
 887 |     "_defaultOrder": 44,
 888 |     "_isFastLaunch": false,
 889 |     "category": "Memory Optimized",
 890 |     "gpuNum": 0,
 891 |     "hideHardwareSpecs": false,
 892 |     "memoryGiB": 384,
 893 |     "name": "ml.r5.12xlarge",
 894 |     "vcpuNum": 48
 895 |    },
 896 |    {
 897 |     "_defaultOrder": 45,
 898 |     "_isFastLaunch": false,
 899 |     "category": "Memory Optimized",
 900 |     "gpuNum": 0,
 901 |     "hideHardwareSpecs": false,
 902 |     "memoryGiB": 512,
 903 |     "name": "ml.r5.16xlarge",
 904 |     "vcpuNum": 64
 905 |    },
 906 |    {
 907 |     "_defaultOrder": 46,
 908 |     "_isFastLaunch": false,
 909 |     "category": "Memory Optimized",
 910 |     "gpuNum": 0,
 911 |     "hideHardwareSpecs": false,
 912 |     "memoryGiB": 768,
 913 |     "name": "ml.r5.24xlarge",
 914 |     "vcpuNum": 96
 915 |    },
 916 |    {
 917 |     "_defaultOrder": 47,
 918 |     "_isFastLaunch": false,
 919 |     "category": "Accelerated computing",
 920 |     "gpuNum": 1,
 921 |     "hideHardwareSpecs": false,
 922 |     "memoryGiB": 16,
 923 |     "name": "ml.g5.xlarge",
 924 |     "vcpuNum": 4
 925 |    },
 926 |    {
 927 |     "_defaultOrder": 48,
 928 |     "_isFastLaunch": false,
 929 |     "category": "Accelerated computing",
 930 |     "gpuNum": 1,
 931 |     "hideHardwareSpecs": false,
 932 |     "memoryGiB": 32,
 933 |     "name": "ml.g5.2xlarge",
 934 |     "vcpuNum": 8
 935 |    },
 936 |    {
 937 |     "_defaultOrder": 49,
 938 |     "_isFastLaunch": false,
 939 |     "category": "Accelerated computing",
 940 |     "gpuNum": 1,
 941 |     "hideHardwareSpecs": false,
 942 |     "memoryGiB": 64,
 943 |     "name": "ml.g5.4xlarge",
 944 |     "vcpuNum": 16
 945 |    },
 946 |    {
 947 |     "_defaultOrder": 50,
 948 |     "_isFastLaunch": false,
 949 |     "category": "Accelerated computing",
 950 |     "gpuNum": 1,
 951 |     "hideHardwareSpecs": false,
 952 |     "memoryGiB": 128,
 953 |     "name": "ml.g5.8xlarge",
 954 |     "vcpuNum": 32
 955 |    },
 956 |    {
 957 |     "_defaultOrder": 51,
 958 |     "_isFastLaunch": false,
 959 |     "category": "Accelerated computing",
 960 |     "gpuNum": 1,
 961 |     "hideHardwareSpecs": false,
 962 |     "memoryGiB": 256,
 963 |     "name": "ml.g5.16xlarge",
 964 |     "vcpuNum": 64
 965 |    },
 966 |    {
 967 |     "_defaultOrder": 52,
 968 |     "_isFastLaunch": false,
 969 |     "category": "Accelerated computing",
 970 |     "gpuNum": 4,
 971 |     "hideHardwareSpecs": false,
 972 |     "memoryGiB": 192,
 973 |     "name": "ml.g5.12xlarge",
 974 |     "vcpuNum": 48
 975 |    },
 976 |    {
 977 |     "_defaultOrder": 53,
 978 |     "_isFastLaunch": false,
 979 |     "category": "Accelerated computing",
 980 |     "gpuNum": 4,
 981 |     "hideHardwareSpecs": false,
 982 |     "memoryGiB": 384,
 983 |     "name": "ml.g5.24xlarge",
 984 |     "vcpuNum": 96
 985 |    },
 986 |    {
 987 |     "_defaultOrder": 54,
 988 |     "_isFastLaunch": false,
 989 |     "category": "Accelerated computing",
 990 |     "gpuNum": 8,
 991 |     "hideHardwareSpecs": false,
 992 |     "memoryGiB": 768,
 993 |     "name": "ml.g5.48xlarge",
 994 |     "vcpuNum": 192
 995 |    },
 996 |    {
 997 |     "_defaultOrder": 55,
 998 |     "_isFastLaunch": false,
 999 |     "category": "Accelerated computing",
1000 |     "gpuNum": 8,
1001 |     "hideHardwareSpecs": false,
1002 |     "memoryGiB": 1152,
1003 |     "name": "ml.p4d.24xlarge",
1004 |     "vcpuNum": 96
1005 |    },
1006 |    {
1007 |     "_defaultOrder": 56,
1008 |     "_isFastLaunch": false,
1009 |     "category": "Accelerated computing",
1010 |     "gpuNum": 8,
1011 |     "hideHardwareSpecs": false,
1012 |     "memoryGiB": 1152,
1013 |     "name": "ml.p4de.24xlarge",
1014 |     "vcpuNum": 96
1015 |    },
1016 |    {
1017 |     "_defaultOrder": 57,
1018 |     "_isFastLaunch": false,
1019 |     "category": "Accelerated computing",
1020 |     "gpuNum": 0,
1021 |     "hideHardwareSpecs": false,
1022 |     "memoryGiB": 32,
1023 |     "name": "ml.trn1.2xlarge",
1024 |     "vcpuNum": 8
1025 |    },
1026 |    {
1027 |     "_defaultOrder": 58,
1028 |     "_isFastLaunch": false,
1029 |     "category": "Accelerated computing",
1030 |     "gpuNum": 0,
1031 |     "hideHardwareSpecs": false,
1032 |     "memoryGiB": 512,
1033 |     "name": "ml.trn1.32xlarge",
1034 |     "vcpuNum": 128
1035 |    },
1036 |    {
1037 |     "_defaultOrder": 59,
1038 |     "_isFastLaunch": false,
1039 |     "category": "Accelerated computing",
1040 |     "gpuNum": 0,
1041 |     "hideHardwareSpecs": false,
1042 |     "memoryGiB": 512,
1043 |     "name": "ml.trn1n.32xlarge",
1044 |     "vcpuNum": 128
1045 |    }
1046 |   ],
1047 |   "instance_type": "ml.t3.medium",
1048 |   "kernelspec": {
1049 |    "display_name": "Python 3 (ipykernel)",
1050 |    "language": "python",
1051 |    "name": "python3"
1052 |   },
1053 |   "language_info": {
1054 |    "codemirror_mode": {
1055 |     "name": "ipython",
1056 |     "version": 3
1057 |    },
1058 |    "file_extension": ".py",
1059 |    "mimetype": "text/x-python",
1060 |    "name": "python",
1061 |    "nbconvert_exporter": "python",
1062 |    "pygments_lexer": "ipython3",
1063 |    "version": "3.9.21"
1064 |   }
1065 |  },
1066 |  "nbformat": 4,
1067 |  "nbformat_minor": 5
1068 | }
1069 | 


--------------------------------------------------------------------------------
/labs/FineTuning/HuggingFaceExample/01_finetuning/assets/consolidate_adapter_shards_and_merge_model.py:
--------------------------------------------------------------------------------
 1 | from optimum.neuron.distributed.checkpointing import (
 2 |     consolidate_model_parallel_checkpoints_to_unified_checkpoint,
 3 | )
 4 | from transformers import AutoModel, AutoTokenizer
 5 | from argparse import ArgumentParser
 6 | from shutil import copyfile
 7 | import os
 8 | import peft
 9 | 
10 | parser = ArgumentParser()
11 | parser.add_argument(
12 |     "-i",
13 |     "--input_dir",
14 |     help="source checkpoint directory containing sharded adapter checkpoint files",
15 |     required=True,
16 | )
17 | parser.add_argument(
18 |     "-o",
19 |     "--output_dir",
20 |     help="destination directory for final merged model (adapters merged into base model)",
21 |     required=True,
22 | )
23 | args = parser.parse_args()
24 | 
25 | consolidated_ckpt_dir = os.path.join(args.input_dir, "consolidated")
26 | 
27 | # Consolidate the adapter shards into a PEFT-compatible checkpoint
28 | print("Consolidating LoRA adapter shards")
29 | consolidate_model_parallel_checkpoints_to_unified_checkpoint(
30 |     args.input_dir, consolidated_ckpt_dir
31 | )
32 | copyfile(
33 |     os.path.join(args.input_dir, "adapter_config.json"),
34 |     os.path.join(consolidated_ckpt_dir, "adapter_config.json"),
35 | )
36 | 
37 | # Load AutoPeftModel using the consolidated PEFT checkpoint
38 | peft_model = peft.AutoPeftModelForCausalLM.from_pretrained(consolidated_ckpt_dir)
39 | 
40 | # Merge adapter weights into base model, save new pretrained model
41 | print("Merging LoRA adapter shards into base model")
42 | merged_model = peft_model.merge_and_unload()
43 | print(f"Saving merged model to {args.output_dir}")
44 | merged_model.save_pretrained(args.output_dir)
45 | 
46 | print(f"Saving tokenizer to {args.output_dir}")
47 | tokenizer = AutoTokenizer.from_pretrained(args.input_dir)
48 | tokenizer.save_pretrained(args.output_dir)
49 | 
50 | # Load the pretrained model and print config
51 | print("Merged model config:")
52 | model = AutoModel.from_pretrained(args.output_dir)
53 | print(model)
54 | 


--------------------------------------------------------------------------------
/labs/FineTuning/HuggingFaceExample/01_finetuning/assets/finetune_llama.py:
--------------------------------------------------------------------------------
  1 | from dataclasses import dataclass, field
  2 | from datasets import load_dataset
  3 | from peft import LoraConfig
  4 | from transformers import (
  5 |     AutoModelForCausalLM,
  6 |     AutoTokenizer,
  7 |     set_seed,
  8 | )
  9 | import os
 10 | import subprocess
 11 | 
 12 | from optimum.neuron import NeuronHfArgumentParser as HfArgumentParser
 13 | from optimum.neuron import NeuronSFTConfig, NeuronSFTTrainer, NeuronTrainingArguments
 14 | from optimum.neuron.distributed import lazy_load_for_parallelism
 15 | from torch_xla.core.xla_model import is_master_ordinal
 16 | 
 17 | 
 18 | def training_function(script_args, training_args):
 19 |     dataset = load_dataset("b-mc2/sql-create-context", split="train")
 20 |     dataset = dataset.shuffle(seed=23)
 21 |     train_dataset = dataset.select(range(50000))
 22 |     eval_dataset = dataset.select(range(50000, 50500))
 23 | 
 24 |     def create_conversation(sample):
 25 |         system_message = (
 26 |             "You are a text to SQL query translator. Users will ask you questions in English and you will generate a "
 27 |             "SQL query based on the provided SCHEMA.\nSCHEMA:\n{schema}"
 28 |         )
 29 |         return {
 30 |             "messages": [
 31 |                 {
 32 |                     "role": "system",
 33 |                     "content": system_message.format(schema=sample["context"]),
 34 |                 },
 35 |                 {"role": "user", "content": sample["question"]},
 36 |                 {"role": "assistant", "content": sample["answer"] + ";"},
 37 |             ]
 38 |         }
 39 | 
 40 |     train_dataset = train_dataset.map(
 41 |         create_conversation, remove_columns=train_dataset.features, batched=False
 42 |     )
 43 |     eval_dataset = eval_dataset.map(
 44 |         create_conversation, remove_columns=eval_dataset.features, batched=False
 45 |     )
 46 | 
 47 |     tokenizer = AutoTokenizer.from_pretrained(script_args.tokenizer_id)
 48 |     # tokenizer.pad_token = tokenizer.eos_token
 49 |     # tokenizer.eos_token_id = 128001
 50 | 
 51 |     with lazy_load_for_parallelism(
 52 |         tensor_parallel_size=training_args.tensor_parallel_size
 53 |     ):
 54 |         model = AutoModelForCausalLM.from_pretrained(script_args.model_id)
 55 | 
 56 |     config = LoraConfig(
 57 |         r=script_args.lora_r,
 58 |         lora_alpha=script_args.lora_alpha,
 59 |         lora_dropout=script_args.lora_dropout,
 60 |         target_modules=[
 61 |             "q_proj",
 62 |             "gate_proj",
 63 |             "v_proj",
 64 |             "o_proj",
 65 |             "k_proj",
 66 |             "up_proj",
 67 |             "down_proj",
 68 |         ],
 69 |         bias="none",
 70 |         task_type="CAUSAL_LM",
 71 |     )
 72 | 
 73 |     args = training_args.to_dict()
 74 | 
 75 |     sft_config = NeuronSFTConfig(
 76 |         max_seq_length=1024,
 77 |         packing=True,
 78 |         **args,
 79 |         dataset_kwargs={
 80 |             "add_special_tokens": False,
 81 |             "append_concat_token": True,
 82 |         },
 83 |     )
 84 | 
 85 |     trainer = NeuronSFTTrainer(
 86 |         args=sft_config,
 87 |         model=model,
 88 |         peft_config=config,
 89 |         tokenizer=tokenizer,
 90 |         train_dataset=train_dataset,
 91 |         eval_dataset=eval_dataset,
 92 |     )
 93 | 
 94 |     # Start training
 95 |     trainer.train()
 96 |     del trainer
 97 | 
 98 | 
 99 | @dataclass
100 | class ScriptArguments:
101 |     model_id: str = field(
102 |         default="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
103 |         metadata={
104 |             "help": "The model that you want to train from the Hugging Face hub."
105 |         },
106 |     )
107 |     tokenizer_id: str = field(
108 |         default="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
109 |         metadata={"help": "The tokenizer used to tokenize text for fine-tuning."},
110 |     )
111 |     lora_r: int = field(
112 |         default=16,
113 |         metadata={"help": "LoRA r value to be used during fine-tuning."},
114 |     )
115 |     lora_alpha: int = field(
116 |         default=32,
117 |         metadata={"help": "LoRA alpha value to be used during fine-tuning."},
118 |     )
119 |     lora_dropout: float = field(
120 |         default=0.05,
121 |         metadata={"help": "LoRA dropout value to be used during fine-tuning."},
122 |     )
123 | 
124 | 
125 | if __name__ == "__main__":
126 |     parser = HfArgumentParser([ScriptArguments, NeuronTrainingArguments])
127 |     script_args, training_args = parser.parse_args_into_dataclasses()
128 | 
129 |     set_seed(training_args.seed)
130 |     training_function(script_args, training_args)
131 | 
132 |     # Consolidate LoRA adapter shards, merge LoRA adapters into base model, save merged model
133 |     if is_master_ordinal():
134 |         input_ckpt_dir = os.path.join(
135 |             training_args.output_dir, f"checkpoint-{training_args.max_steps}"
136 |         )
137 |         output_ckpt_dir = os.path.join(training_args.output_dir, "merged_model")
138 |         # the spawned process expects to see 2 NeuronCores for consolidating checkpoints with a tp=2
139 |         # Either the second core isn't really used or it is freed up by the other thread finishing.  
140 |         # Adjusting Neuron env. var to advertise 2 NeuronCores to the process.
141 |         env = os.environ.copy()
142 |         env["NEURON_RT_VISIBLE_CORES"] = "0-1"
143 |         subprocess.run(
144 |             [
145 |                 "python3",
146 |                 "consolidate_adapter_shards_and_merge_model.py",
147 |                 "-i",
148 |                 input_ckpt_dir,
149 |                 "-o",
150 |                 output_ckpt_dir,
151 |             ],
152 |             env=env
153 |         )


--------------------------------------------------------------------------------
/labs/FineTuning/HuggingFaceExample/01_finetuning/assets/requirements.txt:
--------------------------------------------------------------------------------
1 | optimum-neuron==0.0.27
2 | peft==0.14.0
3 | trl==0.11.4


--------------------------------------------------------------------------------
/labs/Lab_One_NxDI.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "5a972332",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "# Develop support for a new model with NeuronX Distributed Inference\n",
  9 |     "\n",
 10 |     "In this notebook you will learn how to develop support for a new model with NeuronX Distributed Inference (NxD). NxD is a Python package developed by Annapurna Labs that enables you to shard, compile, train, and host Pytorch models on Trainium and Inferentia instances. We develop two key packages demonstrating how to use this, [NxD Inference](https://github.com/aws-neuron/neuronx-distributed-inference/tree/main) and [NxD Training](https://github.com/aws-neuron/neuronx-distributed-training). This notebook focuses on inference. You will learn how to develop support for a new model in NxD Inference through the context of Llama 3.2, 1B.\n",
 11 |     "\n",
 12 |     "#### Overview\n",
 13 |     "1. Check dependencies for AWS Neuron SDK\n",
 14 |     "2. Accept the Meta usage terms and download the model from Hugging Face.\n",
 15 |     "3. Learn how to invoke the model step-by-step\n",
 16 |     "   - Load the model from a local path.\n",
 17 |     "   - Shard and compilet it for Trainium.\n",
 18 |     "   - Download and tokenize the dataset\n",
 19 |     "   - Invoke the model with prompts\n",
 20 |     "4. Learn how to modify the underlying APIs to work with your own models\n",
 21 |     "\n",
 22 |     "#### Prerequisites\n",
 23 |     "This notebook was developed on a trn1.2xlarge instance, using the latest Amazon Linux DLAMI. Both the Amazon Linux and Ubuntu Neuron DLAMI's have preinstalled Python virtual environments with all the basic software packages included. The virtual environment used to develop this notebook is located at this path in both Amaxon Linux and Ubutnu DLAMIs:  `/opt/aws_neuronx_venv_pytorch_2_5_nxd_inference`. "
 24 |    ]
 25 |   },
 26 |   {
 27 |    "cell_type": "markdown",
 28 |    "id": "3652fc5a",
 29 |    "metadata": {},
 30 |    "source": [
 31 |     "### Step 1. Import NxD Inference packages\n",
 32 |     "\n",
 33 |     "If you are running this notebook in the virtual environment for NxD Inference, then the package should already be installed. Let's verify that with the following import."
 34 |    ]
 35 |   },
 36 |   {
 37 |    "cell_type": "code",
 38 |    "execution_count": null,
 39 |    "id": "c4405a13-5431-4d29-a6a6-2eb989fb0f50",
 40 |    "metadata": {},
 41 |    "outputs": [],
 42 |    "source": [
 43 |     "import neuronx_distributed_inference"
 44 |    ]
 45 |   },
 46 |   {
 47 |    "cell_type": "markdown",
 48 |    "id": "0d1970fc",
 49 |    "metadata": {},
 50 |    "source": [
 51 |     "### Step 2. Accept the Meta usage terms and download the model\n",
 52 |     "\n",
 53 |     "If you would like to use the model directly from Meta, you'll need to navigate over to the Hugging Face hub for Llama 3.2 1B [here](https://huggingface.co/meta-llama/Llama-3.2-1B). Log in to the Hub, accept the usage term, and request access to the model. Once access has been granted, copy your Hugging Face token and paste it into the download command below.\n",
 54 |     "\n",
 55 |     "If you do not have your token readily available you can proceed with the alternative model shown below."
 56 |    ]
 57 |   },
 58 |   {
 59 |    "cell_type": "code",
 60 |    "execution_count": null,
 61 |    "id": "959fb008-a2c8-4505-8f60-42e5b2060b31",
 62 |    "metadata": {},
 63 |    "outputs": [],
 64 |    "source": [
 65 |     "# helpful pacakges to speed up the download\n",
 66 |     "!pip install hf_transfer \"huggingface_hub[cli]\""
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "markdown",
 71 |    "id": "ccff01a8-94f7-4d10-bdf7-71229ec19cb9",
 72 |    "metadata": {},
 73 |    "source": [
 74 |     "We'll download the `NousResearch/Llama3.2-1B` model here."
 75 |    ]
 76 |   },
 77 |   {
 78 |    "cell_type": "code",
 79 |    "execution_count": null,
 80 |    "id": "75a2e3d1-7c1b-4d9d-b1f5-d294a1381566",
 81 |    "metadata": {},
 82 |    "outputs": [],
 83 |    "source": [
 84 |     "!huggingface-cli download NousResearch/Llama-3.2-1B --local-dir /home/ec2-user/environment/models/llama/"
 85 |    ]
 86 |   },
 87 |   {
 88 |    "cell_type": "markdown",
 89 |    "id": "02214b8a",
 90 |    "metadata": {},
 91 |    "source": [
 92 |     "### Step 3. Establish model configs\n",
 93 |     "Next, you'll point to the local model files and establish config objects. Each of these configs are helpful in successfully invoking the model."
 94 |    ]
 95 |   },
 96 |   {
 97 |    "cell_type": "code",
 98 |    "execution_count": null,
 99 |    "id": "77e54a5f-842f-4b2c-ab79-c0f11a6ef292",
100 |    "metadata": {},
101 |    "outputs": [],
102 |    "source": [
103 |     "# the original checkpoint\n",
104 |     "model_path = '/home/ec2-user/environment/models/llama/'"
105 |    ]
106 |   },
107 |   {
108 |    "cell_type": "code",
109 |    "execution_count": null,
110 |    "id": "094dc24d-dd06-45c8-adec-fa997f02e6d1",
111 |    "metadata": {},
112 |    "outputs": [],
113 |    "source": [
114 |     "# where your NxD trace will go\n",
115 |     "traced_model_path = '/home/ec2-user/environment/models/traced_llama'"
116 |    ]
117 |   },
118 |   {
119 |    "cell_type": "code",
120 |    "execution_count": null,
121 |    "id": "9f72bda4-5e04-442c-b016-f30816db54d4",
122 |    "metadata": {},
123 |    "outputs": [],
124 |    "source": [
125 |     "import torch\n",
126 |     "from transformers import AutoTokenizer, GenerationConfig\n",
127 |     "\n",
128 |     "from neuronx_distributed_inference.models.config import NeuronConfig, OnDeviceSamplingConfig\n",
129 |     "from neuronx_distributed_inference.models.llama.modeling_llama import LlamaInferenceConfig, NeuronLlamaForCausalLM\n",
130 |     "from neuronx_distributed_inference.utils.hf_adapter import HuggingFaceGenerationAdapter, load_pretrained_config\n",
131 |     "from neuronx_distributed_inference.modules.generation.sampling import prepare_sampling_params\n",
132 |     "\n",
133 |     "# torch.manual_seed(0)"
134 |    ]
135 |   },
136 |   {
137 |    "cell_type": "code",
138 |    "execution_count": null,
139 |    "id": "812403b6",
140 |    "metadata": {},
141 |    "outputs": [],
142 |    "source": [
143 |     "# update the generation config to address a trailing comma\n",
144 |     "!cp generation_config.json $model_path/"
145 |    ]
146 |   },
147 |   {
148 |    "cell_type": "code",
149 |    "execution_count": null,
150 |    "id": "857c6e49-ce3a-47c9-868a-520f0cd68276",
151 |    "metadata": {},
152 |    "outputs": [],
153 |    "source": [
154 |     "# Initialize configs \n",
155 |     "generation_config = GenerationConfig.from_pretrained(model_path)\n",
156 |     "\n",
157 |     "# Some sample overrides for generation\n",
158 |     "generation_config_kwargs = {\n",
159 |     "    \"do_sample\": True,\n",
160 |     "    \"top_k\": 1,\n",
161 |     "    \"pad_token_id\": generation_config.eos_token_id,\n",
162 |     "}\n",
163 |     "generation_config.update(**generation_config_kwargs)"
164 |    ]
165 |   },
166 |   {
167 |    "cell_type": "code",
168 |    "execution_count": null,
169 |    "id": "d196acdb-d094-41c0-9638-9974cec332c4",
170 |    "metadata": {},
171 |    "outputs": [],
172 |    "source": [
173 |     "neuron_config = NeuronConfig(\n",
174 |     "    tp_degree=2,\n",
175 |     "    batch_size=2,\n",
176 |     "    max_context_length=32,\n",
177 |     "    seq_len=64,\n",
178 |     "    on_device_sampling_config=OnDeviceSamplingConfig(top_k=1),\n",
179 |     "    enable_bucketing=True,\n",
180 |     "    flash_decoding_enabled=False\n",
181 |     ")\n",
182 |     "\n",
183 |     "# Build the Llama Inference config\n",
184 |     "config = LlamaInferenceConfig(\n",
185 |     "    neuron_config,\n",
186 |     "    load_config=load_pretrained_config(model_path),\n",
187 |     ")"
188 |    ]
189 |   },
190 |   {
191 |    "cell_type": "markdown",
192 |    "id": "5269bcdd-cf8c-4b10-a428-0cd0fafd83d1",
193 |    "metadata": {},
194 |    "source": [
195 |     "### Step 4. Shard and compile the model\n",
196 |     "The NeuronX compiler will optimize your model for Trainium hardware, ultimately generating the assembly code that executes your operations. We will invoke that compiler now. Generally it's suggested to compile for some of the larger input and output shapes for your model, while using bucketing to optimize performance. Both of those are handled for you automatically with NxD.\n",
197 |     "\n",
198 |     "With NxD, this step also shards your checkpoint for the TP degree that you defined above. Compilation can take some time, for a 1B model this should run for a few minutes."
199 |    ]
200 |   },
201 |   {
202 |    "cell_type": "code",
203 |    "execution_count": null,
204 |    "id": "afd1e5d5-a989-40fb-8350-fca737470b19",
205 |    "metadata": {
206 |     "scrolled": true
207 |    },
208 |    "outputs": [],
209 |    "source": [
210 |     "model = NeuronLlamaForCausalLM(model_path, config)\n",
211 |     "model.compile(traced_model_path)"
212 |    ]
213 |   },
214 |   {
215 |    "cell_type": "markdown",
216 |    "id": "38178c7e-0f6e-41ab-9383-2942615b82ed",
217 |    "metadata": {},
218 |    "source": [
219 |     "Once compilation is complete your new model is saved and ready to load! "
220 |    ]
221 |   },
222 |   {
223 |    "cell_type": "markdown",
224 |    "id": "63a37f02-ed94-4c3e-81cc-6d9e23c04175",
225 |    "metadata": {},
226 |    "source": [
227 |     "### Step 5. Download the tokenizer"
228 |    ]
229 |   },
230 |   {
231 |    "cell_type": "code",
232 |    "execution_count": null,
233 |    "id": "0c5e306f-9488-4b0e-8e6a-f238a50f2cfe",
234 |    "metadata": {},
235 |    "outputs": [],
236 |    "source": [
237 |     "tokenizer = AutoTokenizer.from_pretrained(model_path, padding_side=\"right\")\n",
238 |     "tokenizer.pad_token = tokenizer.eos_token\n",
239 |     "tokenizer.save_pretrained(traced_model_path)"
240 |    ]
241 |   },
242 |   {
243 |    "cell_type": "markdown",
244 |    "id": "212cfe39-9e66-4a02-bf21-2560de065a34",
245 |    "metadata": {},
246 |    "source": [
247 |     "### Step 6. Load the traced model"
248 |    ]
249 |   },
250 |   {
251 |    "cell_type": "code",
252 |    "execution_count": null,
253 |    "id": "c945db68-5392-406c-8dd6-9e66b9ab0a63",
254 |    "metadata": {},
255 |    "outputs": [],
256 |    "source": [
257 |     "model = NeuronLlamaForCausalLM(traced_model_path)\n",
258 |     "model.load(traced_model_path)\n",
259 |     "tokenizer = AutoTokenizer.from_pretrained(traced_model_path)"
260 |    ]
261 |   },
262 |   {
263 |    "cell_type": "markdown",
264 |    "id": "75b0f5f3-b12b-4ac4-883c-856604f8d44e",
265 |    "metadata": {},
266 |    "source": [
267 |     "### Step 7. Define the prompts and prepare them for sampling"
268 |    ]
269 |   },
270 |   {
271 |    "cell_type": "code",
272 |    "execution_count": null,
273 |    "id": "f203a455-402b-4ddc-81d3-d4d1b4335c5c",
274 |    "metadata": {},
275 |    "outputs": [],
276 |    "source": [
277 |     "prompts = [\"I believe the meaning of life is\", \"The color of the sky is\"]\n",
278 |     "\n",
279 |     "# Example: parameter sweeps for sampling\n",
280 |     "sampling_params = prepare_sampling_params(batch_size=neuron_config.batch_size,\n",
281 |     "                                         top_k=[10, 5],\n",
282 |     "                                         top_p=[0.5, 0.9],\n",
283 |     "                                         temperature=[0.9, 0.5])\n",
284 |     "\n",
285 |     "inputs = tokenizer(prompts, padding=True, return_tensors=\"pt\")"
286 |    ]
287 |   },
288 |   {
289 |    "cell_type": "markdown",
290 |    "id": "108f43f8-a2a8-4986-af7c-fdc58a37f3cd",
291 |    "metadata": {},
292 |    "source": [
293 |     "### Step 8. Create a Generation Adapter and run inference"
294 |    ]
295 |   },
296 |   {
297 |    "cell_type": "code",
298 |    "execution_count": null,
299 |    "id": "2f511a6f-049c-4a05-bccc-f5cce8071334",
300 |    "metadata": {},
301 |    "outputs": [],
302 |    "source": [
303 |     "generation_model = HuggingFaceGenerationAdapter(model)\n",
304 |     "outputs = generation_model.generate(\n",
305 |     "    inputs.input_ids,\n",
306 |     "    generation_config=generation_config,\n",
307 |     "    attention_mask=inputs.attention_mask,\n",
308 |     "    max_length=model.config.neuron_config.max_length,\n",
309 |     "    sampling_params=sampling_params,\n",
310 |     ")\n",
311 |     "output_tokens = tokenizer.batch_decode(outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False)\n",
312 |     "\n",
313 |     "print(\"Generated outputs:\")\n",
314 |     "for i, output_token in enumerate(output_tokens):\n",
315 |     "    print(f\"Output {i}: {output_token}\")\n"
316 |    ]
317 |   },
318 |   {
319 |    "cell_type": "markdown",
320 |    "id": "a5b840fc-dcba-428a-bcf8-c35702d144e0",
321 |    "metadata": {},
322 |    "source": [
323 |     "---\n",
324 |     "# Develop support for a new model with NxDI\n",
325 |     "Now that you've run inference with this model, let's take a closer look at how this works. The cells you just ran are based on a script available in our repository [here](https://github.com/aws-neuron/neuronx-distributed-inference/tree/main). You can step through this repository to understand how the objects are developed, inherited, and made available for inference. The full developer guide on the topic is available [here](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/onboarding-models.html#nxdi-onboarding-models). Let's look at some of the key points!"
326 |    ]
327 |   },
328 |   {
329 |    "cell_type": "markdown",
330 |    "id": "f5ec151f-ce53-4051-a9d0-957654834f51",
331 |    "metadata": {},
332 |    "source": [
333 |     "#### 1/ NeuronConfig class\n",
334 |     "You can inherit our base `NeuronConfig` class and extend it with your own model parameters. In the notebook you just ran, this is how we defined the following parameters:\n",
335 |     "- Tensor Parallel (TP) Degree\n",
336 |     "- Batch size\n",
337 |     "- Max context length (input shape)\n",
338 |     "- Sequence length (output shape)\n",
339 |     "- On device sampling\n",
340 |     "- Enabling bucketing\n",
341 |     "- Flash decoding\n",
342 |     "\n",
343 |     "\n",
344 |     "This object and these parameters will be sent to the compiler when you call `model.compile`. It's a helpful way to ensure that the compiler registers your design choices so that it can start optimizations. It also enables the model sharing with NxDI for your prefered TP degree, which lets you very quickly test a variety of TP degrees (TP=8, 32, 64, etc)."
345 |    ]
346 |   },
347 |   {
348 |    "cell_type": "markdown",
349 |    "id": "ac98eb22-c02b-4c74-bd4c-3cd1bd196f54",
350 |    "metadata": {},
351 |    "source": [
352 |     "#### 2/ InferenceConfig class\n",
353 |     "Next, you can inherit our base `InferenceConfig` class and extend it with the rest of your modeling parameters. In the notebook you ran above, we took two important steps with this config.\n",
354 |     "1. Passed into it the base `NeuronConfig`.\n",
355 |     "2. Passed the rest of the model config from the HuggingFace pretrained config.\n",
356 |     "\n",
357 |     "Your inference class is where you define modeling parameters like the following:\n",
358 |     "- hidden size\n",
359 |     "- num attention heads\n",
360 |     "- num hidden layers\n",
361 |     "- num key value heads\n",
362 |     "- vocab size\n",
363 |     "\n",
364 |     "You'll use this `config` object to save and compile your model. Let's learn how!"
365 |    ]
366 |   },
367 |   {
368 |    "cell_type": "markdown",
369 |    "id": "71016dc5-112d-470f-a1ee-ce1855a5487d",
370 |    "metadata": {},
371 |    "source": [
372 |     "#### 3/ NeuronModel\n",
373 |     "This is how you fundamentally integrate your modeling code into the Neuron SDK. If you'd like to simply reuse our `NeuronAttentionBase`, you can inherit this directly through the library and simply pass your parameters through the `InferenceConfig` you defined above. This is how the example code in our notebook works. This is also the fastest way of getting your model online with NxD I.\n",
374 |     "\n",
375 |     "In the example code you ran, you also used our code for `NeuronLlamaMLP`. This is a layer in the network which inherits from `nn.Module` directly, and it's where you can define the structure of your computations. The `NeuronLlamaMLP` uses a predefined `ColumnParallelLinear` object for both the gate and up projections, while using a predefined `RowParallelLinear` object for the down projection. It also defines a forward pass on that layer.\n",
376 |     "\n",
377 |     "The rest of the model is defined similarly: either you inherit from our base objects and just passing in your `InferenceConfig`, or you define a new layer inheriting from `nn.Module` and write those layers as either `RowParallelLinear`, `ColumnParallelLinear`, or something else. The benefit of writing your layers into the `Row` and `Column` parallel layers as presented here is that we can handle the distribution of your model for you. \n",
378 |     "\n",
379 |     "For a more complete guide check out our documentation on the subject [here](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/api_guide.html#api-guide)."
380 |    ]
381 |   },
382 |   {
383 |    "cell_type": "markdown",
384 |    "id": "8e98a0d4",
385 |    "metadata": {},
386 |    "source": [
387 |     "### Notebook Wrap-Up\n",
388 |     "\n",
389 |     "For more advanced topics:\n",
390 |     "- **Profiling**: See [Neuron Profiling Tools](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-profile/index.html).\n",
391 |     "- **Distributed Serving**: Explore vLLM or other serving frameworks.\n",
392 |     "- **Performance Benchmarking**: Use `llmperf` or custom scripts.\n",
393 |     "\n",
394 |     "Thank you for using AWS Trainium, and happy LLM experimentation!\n"
395 |    ]
396 |   },
397 |   {
398 |    "cell_type": "code",
399 |    "execution_count": null,
400 |    "id": "3bfc3c62-08a4-49ae-adef-5c0d661f2712",
401 |    "metadata": {},
402 |    "outputs": [],
403 |    "source": []
404 |   }
405 |  ],
406 |  "metadata": {
407 |   "kernelspec": {
408 |    "display_name": "Python 3 (ipykernel)",
409 |    "language": "python",
410 |    "name": "python3"
411 |   },
412 |   "language_info": {
413 |    "codemirror_mode": {
414 |     "name": "ipython",
415 |     "version": 3
416 |    },
417 |    "file_extension": ".py",
418 |    "mimetype": "text/x-python",
419 |    "name": "python",
420 |    "nbconvert_exporter": "python",
421 |    "pygments_lexer": "ipython3",
422 |    "version": "3.9.21"
423 |   }
424 |  },
425 |  "nbformat": 4,
426 |  "nbformat_minor": 5
427 | }
428 | 


--------------------------------------------------------------------------------
/labs/Lab_Two_NKI.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "d6b1e73f-dc2c-4d66-b3ba-4fb71b5243c8",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "# Write your own kernel with the Neuron Kernel Interface (NKI)\n",
  9 |     "In this notebook you'll learn how to develop your own kernel with [NKI](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/index.html). A kernel is a set of user-defined functions that are executed largely as defined by the user, not by the compiler. With NKI you can write your own functions to define any operations you like, using supported APIs, and execute them on Trainium and Inferentia hardware. You have the control and lower-level access to define the data movement, computational patterns, and physical execution for the mathematics of your algorithms with NKI.\n",
 10 |     "\n",
 11 |     "The structure of the notebook is as follows:\n",
 12 |     "1. Brief introduction to the NeuronCore and the NKI programming model\n",
 13 |     "2. Your first NKI kernel - tensor addition\n",
 14 |     "3. Your second NKI kernel - matrix multiplication\n",
 15 |     "\n",
 16 |     "Wrap up and next steps."
 17 |    ]
 18 |   },
 19 |   {
 20 |    "cell_type": "markdown",
 21 |    "id": "54d843c8-b824-4896-ad23-1098dd859872",
 22 |    "metadata": {},
 23 |    "source": [
 24 |     "### 1. Introduction to the NeuronCore and NKI programming model\n",
 25 |     "The NeuronCore is the main acceleration unit within AWS AI chips Trainium and Inferentia. As you can see in the image below, it is composed of 4 compute engines. These engines are based on a systollic array architecture. The compute engines are fed data from the primary on-chip memory cache, SBUF. Data is moved from the HBM banks to SBUF when you call `nl.load`. You'll index into your tensors to create lower-level objects, called `tiles`. A tile is the result of `nl.load`. Once you've defined `tiles`, you can send them to various NKI mathematical APIS such as `add`, `subtract`, `matmul`, etc. The result of these operations are stored on the secondary on-chip memory cache, PSUM. After moving the data back to SBUF, you can then send it back to HBM with `nl.store`.\n",
 26 |     "\n",
 27 |     "<img src=https://awsdocs-neuron.readthedocs-hosted.com/en/latest/_images/pm-nc.png width=\"400\"/>"
 28 |    ]
 29 |   },
 30 |   {
 31 |    "cell_type": "markdown",
 32 |    "id": "37fc461d-1067-4b96-95a5-3da49e15723f",
 33 |    "metadata": {},
 34 |    "source": [
 35 |     "Trainium1 chips feature two NeuronCore-v2 acceleration units, 2 HBM banks, NeuronLink-v2 chip-to-chip connect, host PCIE, and dedicated engines for both data movement and collective communications. Trainium1 offers 32 GiB of device memory (sum of all 4 HBM banks), with 840 GiB/sec of bandwidth. Trainium1 instances feature 16 Trainium chips, providing a total of up to 3 petaflops of FP16 compute and 512 accelerator memory capacity. For more architectural details, see our docs [here](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/trainium.html#trainium-arch). \n",
 36 |     "\n",
 37 |     "\n",
 38 |     "The on-chip memory cache, SBUF, **has ~20x higher memory bandwidth than HBM**. The purpose of your kernel is to exploit as much of that compute accleration as you can within the context of your model and worload."
 39 |    ]
 40 |   },
 41 |   {
 42 |    "cell_type": "markdown",
 43 |    "id": "0fffc189-2875-4a42-a14a-de4ea122d2ef",
 44 |    "metadata": {},
 45 |    "source": [
 46 |     "#### Structuring data and tensors for NKI\n",
 47 |     "\n",
 48 |     "To easily move data and design our kernels on NKI, we'll want to exploit the 128 partitions built into SBUF as shown in the image below. In particular, SBUF has 128 partition lanes. Each of these lanes can execute programs in parallel on the engines. As much as possible, we'll want to align the tensors and data structures in our algorithms to follow this physical design. The benefit is that our kernels will run faster and be easier to develop!\n",
 49 |     "\n",
 50 |     "Your data movement from HBM to SBUF should be very carefully aligned with this 128-lane partition dimension, also called p-dim. Each tile needs a precise definition along the p-dim. Your second dimension is called the free dimension, or f-dim. As the name goes, this dimension is much more flexible than p-dim. Though it may surprise you, it's better not to fully saturate sbuf with extremely large tiles. This is so that the compiler can overlap data movement and collectives with compute, giving you better overall compute utilization and performance."
 51 |    ]
 52 |   },
 53 |   {
 54 |    "cell_type": "markdown",
 55 |    "id": "0c786fce-ac8e-4549-8bf1-edaee7512211",
 56 |    "metadata": {},
 57 |    "source": [
 58 |     "<img src=https://awsdocs-neuron.readthedocs-hosted.com/en/latest/_images/pm-layout.png width=\"600\"/>"
 59 |    ]
 60 |   },
 61 |   {
 62 |    "cell_type": "markdown",
 63 |    "id": "03a9fbbe-2fa4-41c0-9cc8-7560fbc7a49f",
 64 |    "metadata": {},
 65 |    "source": [
 66 |     "### 2. Your first NKI kernel\n",
 67 |     "Now that you have some understanding of the compute architecture and motivation for kernels, let's write your first NKI kernel! Importing the `nki` library may take a few moments the first time you've imported it on an instance."
 68 |    ]
 69 |   },
 70 |   {
 71 |    "cell_type": "code",
 72 |    "execution_count": 1,
 73 |    "id": "2da52760-db72-403a-ade9-d8bebac40de3",
 74 |    "metadata": {},
 75 |    "outputs": [],
 76 |    "source": [
 77 |     "import numpy as np\n",
 78 |     "import neuronxcc.nki as nki\n",
 79 |     "import neuronxcc.nki.language as nl"
 80 |    ]
 81 |   },
 82 |   {
 83 |    "cell_type": "code",
 84 |    "execution_count": 2,
 85 |    "id": "b83039ee-1788-478f-809f-f139cb032cce",
 86 |    "metadata": {},
 87 |    "outputs": [],
 88 |    "source": [
 89 |     "@nki.jit\n",
 90 |     "def nki_tensor_add_kernel_(a_input, b_input):\n",
 91 |     " \n",
 92 |     "  # Create output tensor \n",
 93 |     "  c_output = nl.ndarray(a_input.shape, dtype=a_input.dtype, buffer=nl.shared_hbm)\n",
 94 |     "\n",
 95 |     "  # Load input data from device memory (HBM) to on-chip memory (SBUF)\n",
 96 |     "  a_tile = nl.load(a_input)\n",
 97 |     "  b_tile = nl.load(b_input)\n",
 98 |     "\n",
 99 |     "  # compute a + b\n",
100 |     "  c_tile = a_tile + b_tile\n",
101 |     "\n",
102 |     "  # return the final tensor\n",
103 |     "  nl.store(c_output, value=c_tile)\n",
104 |     "\n",
105 |     "  # Transfer the ownership of `c_output` to the caller\n",
106 |     "  return c_output\n"
107 |    ]
108 |   },
109 |   {
110 |    "cell_type": "code",
111 |    "execution_count": 3,
112 |    "id": "486f0e0a-6af1-4882-afe2-4ce5a1912ddc",
113 |    "metadata": {},
114 |    "outputs": [
115 |     {
116 |      "name": "stdout",
117 |      "output_type": "stream",
118 |      "text": [
119 |       "NKI and NumPy match\n"
120 |      ]
121 |     }
122 |    ],
123 |    "source": [
124 |     "a = np.random.rand(128, 512).astype(np.float16)\n",
125 |     "b = np.random.rand(128, 512).astype(np.float16)\n",
126 |     "\n",
127 |     "output_nki = nki_tensor_add_kernel_(a, b)\n",
128 |     "\n",
129 |     "output_np = a + b\n",
130 |     "\n",
131 |     "allclose = np.allclose(output_np, output_nki, atol=1e-4, rtol=1e-2)\n",
132 |     "if allclose:\n",
133 |     "    print(\"NKI and NumPy match\")\n",
134 |     "else:\n",
135 |     "    print(\"NKI and NumPy differ\")\n"
136 |    ]
137 |   },
138 |   {
139 |    "cell_type": "markdown",
140 |    "id": "35f65891-2d62-4af4-aa5d-7620c707f6bd",
141 |    "metadata": {},
142 |    "source": [
143 |     "Now let's see if we can do that for matrix multiplication!"
144 |    ]
145 |   },
146 |   {
147 |    "cell_type": "markdown",
148 |    "id": "e8a65cb0-215d-4590-8335-d53c23eef5c1",
149 |    "metadata": {},
150 |    "source": [
151 |     "### 3. Your second NKI kernel\n",
152 |     "Now, let's try to use PyTorch arrays and pass them to the device with XLA. Then we'll try a matrix multiplication kernel."
153 |    ]
154 |   },
155 |   {
156 |    "cell_type": "code",
157 |    "execution_count": 4,
158 |    "id": "e4e24399-7bae-4db2-b964-b5fdcc93fb32",
159 |    "metadata": {},
160 |    "outputs": [
161 |     {
162 |      "name": "stderr",
163 |      "output_type": "stream",
164 |      "text": [
165 |       "WARNING:root:MASTER_ADDR environment variable is not set, defaulting to localhost\n",
166 |       "WARNING:root:Found libneuronpjrt.so. Setting PJRT_DEVICE=NEURON.\n"
167 |      ]
168 |     }
169 |    ],
170 |    "source": [
171 |     "import torch\n",
172 |     "from torch_xla.core import xla_model as xm\n",
173 |     "\n",
174 |     "device = xm.xla_device()\n",
175 |     "\n",
176 |     "lhs_small = torch.rand((64, 128), dtype=torch.bfloat16, device=device)\n",
177 |     "rhs_small = torch.rand((128, 512), dtype=torch.bfloat16, device=device)"
178 |    ]
179 |   },
180 |   {
181 |    "cell_type": "code",
182 |    "execution_count": 5,
183 |    "id": "0bc1f344-6e02-4f3a-928a-9f1bccabfb12",
184 |    "metadata": {},
185 |    "outputs": [],
186 |    "source": [
187 |     "@nki.jit\n",
188 |     "def nki_matmul_basic_(lhsT, rhs):\n",
189 |     "  \"\"\"NKI kernel to compute a 64x128x512 matrix multiplication operation\n",
190 |     "\n",
191 |     "  Args:\n",
192 |     "      lhsT: an input tensor of shape [128,64], a left hand side argument of the\n",
193 |     "        matrix multiplication, delivered transposed for optimal performance\n",
194 |     "      rhs: an input tensor of shape [128,512], a right hand side argument of the\n",
195 |     "        matrix multiplication\n",
196 |     "  Returns:\n",
197 |     "      result: the resulting output tensor of shape [64,512]\n",
198 |     "  \"\"\"\n",
199 |     "  result = nl.ndarray((64, 512), dtype=lhsT.dtype, buffer=nl.shared_hbm)\n",
200 |     "\n",
201 |     "  # Defining indexes for input LHS.T\n",
202 |     "  # - Note: here we take LayoutConstraint #1 into account:\n",
203 |     "  # \"For MatMult, contraction axis must be mapped to P-dim\"\n",
204 |     "  i_lhsT_p, i_lhsT_f = nl.mgrid[0:128, 0:64]\n",
205 |     "\n",
206 |     "  # Defining indexes for input RHS\n",
207 |     "  # - Note: here we take LayoutConstraint #1 into account:\n",
208 |     "  # \"For MatMult, contraction axis must be mapped to P-dim\"\n",
209 |     "  i_rhs_p, i_rhs_f = nl.mgrid[0:128, 0:512]\n",
210 |     "\n",
211 |     "  # Defining indexes for the output ([64,128]@[128,512] -> [64,512])\n",
212 |     "  i_out_p, i_out_f = nl.mgrid[0:64, 0:512]\n",
213 |     "\n",
214 |     "  # Loading the inputs (HBM->SBUF)\n",
215 |     "  # Note: here we take Tile dtype definition into account,\n",
216 |     "  # which forces P-dim as the left most index\n",
217 |     "  lhs_tile = nl.load(lhsT[i_lhsT_p, i_lhsT_f])\n",
218 |     "  rhs_tile = nl.load(rhs[i_rhs_p, i_rhs_f])\n",
219 |     "\n",
220 |     "  # Perform the matrix-multiplication\n",
221 |     "  # Note1: We set transpose_x to True, to indicate that the LHS input is transposed\n",
222 |     "  # Note2: A NKI matmul instruction always writes to PSUM in float32 data-type\n",
223 |     "  result_psum = nl.matmul(lhs_tile, rhs_tile, transpose_x=True)\n",
224 |     "\n",
225 |     "  # Copy the result from PSUM back to SBUF, and cast to expected output data-type\n",
226 |     "  result_sbuf = nl.copy(result_psum, dtype=result.dtype)\n",
227 |     "\n",
228 |     "  # The result of a [64,128] x [128,512] matrix multiplication has a shape of [64, 512].\n",
229 |     "  # This dictates which indices to use to address the result tile.\n",
230 |     "  nl.store(result[i_out_p, i_out_f], value=result_sbuf)\n",
231 |     "\n",
232 |     "  return result"
233 |    ]
234 |   },
235 |   {
236 |    "cell_type": "code",
237 |    "execution_count": 6,
238 |    "id": "d5b2a228-0a08-42fc-9bd4-81dedba0e4d6",
239 |    "metadata": {},
240 |    "outputs": [
241 |     {
242 |      "name": "stdout",
243 |      "output_type": "stream",
244 |      "text": [
245 |       "Checking correctness of nki_matmul_basic\n",
246 |       "2025-03-17 22:45:04.000657:  512118  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --framework=XLA /tmp/ec2-user/neuroncc_compile_workdir/58a5f9b5-7dd1-4569-b58f-bae92b1f0d13/model.MODULE_6255296715421101974+e30acd3a.hlo_module.pb --output /tmp/ec2-user/neuroncc_compile_workdir/58a5f9b5-7dd1-4569-b58f-bae92b1f0d13/model.MODULE_6255296715421101974+e30acd3a.neff --target=trn1 --verbose=35\n",
247 |       ".\n",
248 |       "Compiler status PASS\n",
249 |       "NKI and Torch match\n"
250 |      ]
251 |     }
252 |    ],
253 |    "source": [
254 |     "# Run NKI kernel\n",
255 |     "output_small = nki_matmul_basic_(lhs_small.T, rhs_small)\n",
256 |     "\n",
257 |     "# Run torch reference\n",
258 |     "output_small_torch = torch.matmul(lhs_small, rhs_small)\n",
259 |     "\n",
260 |     "# Compare results\n",
261 |     "print(\"Checking correctness of nki_matmul_basic\")\n",
262 |     "if torch.allclose(output_small_torch, output_small, atol=1e-4, rtol=1e-2):\n",
263 |     "  print(\"NKI and Torch match\")\n",
264 |     "else:\n",
265 |     "  print(\"NKI and Torch differ\")"
266 |    ]
267 |   },
268 |   {
269 |    "cell_type": "markdown",
270 |    "id": "801236a2-9d4d-4630-a750-dc42bb2e4514",
271 |    "metadata": {},
272 |    "source": [
273 |     "### 4. Wrap up and next steps\n",
274 |     "The simiplicity you see in the `tensor_add` kernel above is possible because the shapes we pass in are very small. We've intentionally selected them to exactly match the shapes of tiles that NKI suports as maximum dimensions, for both the partition and free dimensions.\n",
275 |     "\n",
276 |     "As you saw above, the partition dimension has a maximum length of 128. This the most important dimension and shape to embrace in your kernels, because it impacts your ability to load data onto the chip. In order to exploit the parallelism of execution enabled through the 128 lanes on sbuf, you might want to develop into your kernel the ability to extract data in batches of 128 to load onto sbuf. \n",
277 |     "\n",
278 |     "The second dimension, also known as the free dimension, is more flexible. Once you have clean batches of 128 lanes being loaded onto sbuf, you can build in tiling on the second dimension of much more varying sizes up to 512. \n",
279 |     "\n",
280 |     "To learn more about tililng, and to step through the rest of the matrix multiplication tutorial, see our docs on the topic [here](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/tutorials/matrix_multiplication.html#)"
281 |    ]
282 |   },
283 |   {
284 |    "cell_type": "code",
285 |    "execution_count": null,
286 |    "id": "bd810a4b-2365-48a3-ad0f-23f3850ffc71",
287 |    "metadata": {},
288 |    "outputs": [],
289 |    "source": []
290 |   }
291 |  ],
292 |  "metadata": {
293 |   "kernelspec": {
294 |    "display_name": "Python 3 (ipykernel)",
295 |    "language": "python",
296 |    "name": "python3"
297 |   },
298 |   "language_info": {
299 |    "codemirror_mode": {
300 |     "name": "ipython",
301 |     "version": 3
302 |    },
303 |    "file_extension": ".py",
304 |    "mimetype": "text/x-python",
305 |    "name": "python",
306 |    "nbconvert_exporter": "python",
307 |    "pygments_lexer": "ipython3",
308 |    "version": "3.9.16"
309 |   }
310 |  },
311 |  "nbformat": 4,
312 |  "nbformat_minor": 5
313 | }
314 | 


--------------------------------------------------------------------------------
/labs/generation_config.json:
--------------------------------------------------------------------------------
1 | {
2 |   "_from_model_config": true,
3 |   "bos_token_id": 128000,
4 |   "eos_token_id": 128001,
5 |   "transformers_version": "4.45.0.dev0",
6 |   "do_sample": true,
7 |   "temperature": 0.6,
8 |   "top_p": 0.9
9 | }


--------------------------------------------------------------------------------