├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── NOTICE
├── README.md
├── contributed
    └── models
    │   ├── README.md
    │   └── qwen2
    │       ├── modeling_qwen2.py
    │       └── qwen-2-test.ipynb
├── doc
    └── README.md
└── labs
    ├── FineTuning
        └── HuggingFaceExample
        │   ├── 01_finetuning
        │       ├── Finetune-TinyLlama-1.1B.ipynb
        │       └── assets
        │       │   ├── consolidate_adapter_shards_and_merge_model.py
        │       │   ├── finetune_llama.py
        │       │   └── requirements.txt
        │   ├── 02_inference
        │       └── Inference-TinyLlama-1.1B.ipynb
        │   └── Local example.ipynb
    ├── Lab_Four_NKI_Profiling.ipynb
    ├── Lab_One_NxDI.ipynb
    ├── Lab_Three_NKI_Custom_Operators.ipynb
    ├── Lab_Two_NKI.ipynb
    ├── generation_config.json
    └── vLLM
        ├── Benchmarks.ipynb
        └── Servers.ipynb


/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | ## Code of Conduct
2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
4 | opensource-codeofconduct@amazon.com with any additional questions or comments.
5 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contributing Guidelines
 2 | 
 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
 4 | documentation, we greatly value feedback and contributions from our community.
 5 | 
 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
 7 | information to effectively respond to your bug report or contribution.
 8 | 
 9 | 
10 | ## Reporting Bugs/Feature Requests
11 | 
12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features.
13 | 
14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
16 | 
17 | * A reproducible test case or series of steps
18 | * The version of our code being used
19 | * Any modifications you've made relevant to the bug
20 | * Anything unusual about your environment or deployment
21 | 
22 | 
23 | ## Contributing via Pull Requests
24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
25 | 
26 | 1. You are working against the latest source on the *main* branch.
27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
29 | 
30 | To send us a pull request, please:
31 | 
32 | 1. Fork the repository.
33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
34 | 3. Ensure local tests pass.
35 | 4. Commit to your fork using clear commit messages.
36 | 5. Send us a pull request, answering any default questions in the pull request interface.
37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
38 | 
39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
41 | 
42 | 
43 | ## Finding contributions to work on
44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.
45 | 
46 | 
47 | ## Code of Conduct
48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
50 | opensource-codeofconduct@amazon.com with any additional questions or comments.
51 | 
52 | 
53 | ## Security issue notifications
54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
55 | 
56 | 
57 | ## Licensing
58 | 
59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
60 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 | 
  2 |                                  Apache License
  3 |                            Version 2.0, January 2004
  4 |                         http://www.apache.org/licenses/
  5 | 
  6 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
  7 | 
  8 |    1. Definitions.
  9 | 
 10 |       "License" shall mean the terms and conditions for use, reproduction,
 11 |       and distribution as defined by Sections 1 through 9 of this document.
 12 | 
 13 |       "Licensor" shall mean the copyright owner or entity authorized by
 14 |       the copyright owner that is granting the License.
 15 | 
 16 |       "Legal Entity" shall mean the union of the acting entity and all
 17 |       other entities that control, are controlled by, or are under common
 18 |       control with that entity. For the purposes of this definition,
 19 |       "control" means (i) the power, direct or indirect, to cause the
 20 |       direction or management of such entity, whether by contract or
 21 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 22 |       outstanding shares, or (iii) beneficial ownership of such entity.
 23 | 
 24 |       "You" (or "Your") shall mean an individual or Legal Entity
 25 |       exercising permissions granted by this License.
 26 | 
 27 |       "Source" form shall mean the preferred form for making modifications,
 28 |       including but not limited to software source code, documentation
 29 |       source, and configuration files.
 30 | 
 31 |       "Object" form shall mean any form resulting from mechanical
 32 |       transformation or translation of a Source form, including but
 33 |       not limited to compiled object code, generated documentation,
 34 |       and conversions to other media types.
 35 | 
 36 |       "Work" shall mean the work of authorship, whether in Source or
 37 |       Object form, made available under the License, as indicated by a
 38 |       copyright notice that is included in or attached to the work
 39 |       (an example is provided in the Appendix below).
 40 | 
 41 |       "Derivative Works" shall mean any work, whether in Source or Object
 42 |       form, that is based on (or derived from) the Work and for which the
 43 |       editorial revisions, annotations, elaborations, or other modifications
 44 |       represent, as a whole, an original work of authorship. For the purposes
 45 |       of this License, Derivative Works shall not include works that remain
 46 |       separable from, or merely link (or bind by name) to the interfaces of,
 47 |       the Work and Derivative Works thereof.
 48 | 
 49 |       "Contribution" shall mean any work of authorship, including
 50 |       the original version of the Work and any modifications or additions
 51 |       to that Work or Derivative Works thereof, that is intentionally
 52 |       submitted to Licensor for inclusion in the Work by the copyright owner
 53 |       or by an individual or Legal Entity authorized to submit on behalf of
 54 |       the copyright owner. For the purposes of this definition, "submitted"
 55 |       means any form of electronic, verbal, or written communication sent
 56 |       to the Licensor or its representatives, including but not limited to
 57 |       communication on electronic mailing lists, source code control systems,
 58 |       and issue tracking systems that are managed by, or on behalf of, the
 59 |       Licensor for the purpose of discussing and improving the Work, but
 60 |       excluding communication that is conspicuously marked or otherwise
 61 |       designated in writing by the copyright owner as "Not a Contribution."
 62 | 
 63 |       "Contributor" shall mean Licensor and any individual or Legal Entity
 64 |       on behalf of whom a Contribution has been received by Licensor and
 65 |       subsequently incorporated within the Work.
 66 | 
 67 |    2. Grant of Copyright License. Subject to the terms and conditions of
 68 |       this License, each Contributor hereby grants to You a perpetual,
 69 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 70 |       copyright license to reproduce, prepare Derivative Works of,
 71 |       publicly display, publicly perform, sublicense, and distribute the
 72 |       Work and such Derivative Works in Source or Object form.
 73 | 
 74 |    3. Grant of Patent License. Subject to the terms and conditions of
 75 |       this License, each Contributor hereby grants to You a perpetual,
 76 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 77 |       (except as stated in this section) patent license to make, have made,
 78 |       use, offer to sell, sell, import, and otherwise transfer the Work,
 79 |       where such license applies only to those patent claims licensable
 80 |       by such Contributor that are necessarily infringed by their
 81 |       Contribution(s) alone or by combination of their Contribution(s)
 82 |       with the Work to which such Contribution(s) was submitted. If You
 83 |       institute patent litigation against any entity (including a
 84 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
 85 |       or a Contribution incorporated within the Work constitutes direct
 86 |       or contributory patent infringement, then any patent licenses
 87 |       granted to You under this License for that Work shall terminate
 88 |       as of the date such litigation is filed.
 89 | 
 90 |    4. Redistribution. You may reproduce and distribute copies of the
 91 |       Work or Derivative Works thereof in any medium, with or without
 92 |       modifications, and in Source or Object form, provided that You
 93 |       meet the following conditions:
 94 | 
 95 |       (a) You must give any other recipients of the Work or
 96 |           Derivative Works a copy of this License; and
 97 | 
 98 |       (b) You must cause any modified files to carry prominent notices
 99 |           stating that You changed the files; and
100 | 
101 |       (c) You must retain, in the Source form of any Derivative Works
102 |           that You distribute, all copyright, patent, trademark, and
103 |           attribution notices from the Source form of the Work,
104 |           excluding those notices that do not pertain to any part of
105 |           the Derivative Works; and
106 | 
107 |       (d) If the Work includes a "NOTICE" text file as part of its
108 |           distribution, then any Derivative Works that You distribute must
109 |           include a readable copy of the attribution notices contained
110 |           within such NOTICE file, excluding those notices that do not
111 |           pertain to any part of the Derivative Works, in at least one
112 |           of the following places: within a NOTICE text file distributed
113 |           as part of the Derivative Works; within the Source form or
114 |           documentation, if provided along with the Derivative Works; or,
115 |           within a display generated by the Derivative Works, if and
116 |           wherever such third-party notices normally appear. The contents
117 |           of the NOTICE file are for informational purposes only and
118 |           do not modify the License. You may add Your own attribution
119 |           notices within Derivative Works that You distribute, alongside
120 |           or as an addendum to the NOTICE text from the Work, provided
121 |           that such additional attribution notices cannot be construed
122 |           as modifying the License.
123 | 
124 |       You may add Your own copyright statement to Your modifications and
125 |       may provide additional or different license terms and conditions
126 |       for use, reproduction, or distribution of Your modifications, or
127 |       for any such Derivative Works as a whole, provided Your use,
128 |       reproduction, and distribution of the Work otherwise complies with
129 |       the conditions stated in this License.
130 | 
131 |    5. Submission of Contributions. Unless You explicitly state otherwise,
132 |       any Contribution intentionally submitted for inclusion in the Work
133 |       by You to the Licensor shall be under the terms and conditions of
134 |       this License, without any additional terms or conditions.
135 |       Notwithstanding the above, nothing herein shall supersede or modify
136 |       the terms of any separate license agreement you may have executed
137 |       with Licensor regarding such Contributions.
138 | 
139 |    6. Trademarks. This License does not grant permission to use the trade
140 |       names, trademarks, service marks, or product names of the Licensor,
141 |       except as required for reasonable and customary use in describing the
142 |       origin of the Work and reproducing the content of the NOTICE file.
143 | 
144 |    7. Disclaimer of Warranty. Unless required by applicable law or
145 |       agreed to in writing, Licensor provides the Work (and each
146 |       Contributor provides its Contributions) on an "AS IS" BASIS,
147 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
148 |       implied, including, without limitation, any warranties or conditions
149 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
150 |       PARTICULAR PURPOSE. You are solely responsible for determining the
151 |       appropriateness of using or redistributing the Work and assume any
152 |       risks associated with Your exercise of permissions under this License.
153 | 
154 |    8. Limitation of Liability. In no event and under no legal theory,
155 |       whether in tort (including negligence), contract, or otherwise,
156 |       unless required by applicable law (such as deliberate and grossly
157 |       negligent acts) or agreed to in writing, shall any Contributor be
158 |       liable to You for damages, including any direct, indirect, special,
159 |       incidental, or consequential damages of any character arising as a
160 |       result of this License or out of the use or inability to use the
161 |       Work (including but not limited to damages for loss of goodwill,
162 |       work stoppage, computer failure or malfunction, or any and all
163 |       other commercial damages or losses), even if such Contributor
164 |       has been advised of the possibility of such damages.
165 | 
166 |    9. Accepting Warranty or Additional Liability. While redistributing
167 |       the Work or Derivative Works thereof, You may choose to offer,
168 |       and charge a fee for, acceptance of support, warranty, indemnity,
169 |       or other liability obligations and/or rights consistent with this
170 |       License. However, in accepting such obligations, You may act only
171 |       on Your own behalf and on Your sole responsibility, not on behalf
172 |       of any other Contributor, and only if You agree to indemnify,
173 |       defend, and hold each Contributor harmless for any liability
174 |       incurred by, or claims asserted against, such Contributor by reason
175 |       of your accepting any such warranty or additional liability.
176 | 


--------------------------------------------------------------------------------
/NOTICE:
--------------------------------------------------------------------------------
1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
2 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Neuron Workshops
 2 | 
 3 | In this workshop you will learn how to develop support for a new model with [NeuronX Distributed Inference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/nxdi-overview.html#nxdi-overview), through the context of Llama 3.2 1B. You will also learn how to write your own kernel to directly program the accelerated hardware with the [Neuron Kernel Interface](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/index.html). These tools will help you design your research proposals and experiments on Trainium.
 4 | 
 5 | It also includes an end-to-end example of using Hugging Face Optimum Neuron to fine-tune and host a small language model with Amazon SageMaker.
 6 | 
 7 | 
 8 | ### What are AWS Trainium and Neuron?
 9 | AWS Trainium is an AI chip developed by AWS for accelerating building and deploying machine learning models. Built on a specialized architecture designed for deep learning, Trainium accelerates the training and inference of complex models with high output and scalability, making it ideal for academic researchers looking to optimize performance and costs. This architecture also emphasizes sustainability through energy-efficient design, reducing environmental impact. Amazon has established a dedicated Trainium research cluster featuring up to 40,000 Trainium chips, accessible via Amazon EC2 Trn1 instances. These instances are connected through a non-blocking, petabit-scale network using Amazon EC2 UltraClusters, enabling seamless high-performance ML training. The Trn1 instance family is optimized to deliver substantial compute power for cutting-edge AI research and development. This unique offering not only enhances the efficiency and affordability of model training but also presents academic researchers with opportunities to publish new papers on underrepresented compute architectures, thus advancing the field.
10 | 
11 | Learn more about Trainium [here](https://aws.amazon.com/ai/machine-learning/trainium/).
12 | 
13 | ### Your workshop
14 | This hands-on workshop is designed for developers, data scientists, and machine learning engineers who are getting started in their journey on the Neuron SDK. 
15 | 
16 | The workshop has multiple available modules:
17 | 1. Set up instructions
18 | 2. Run inference with Llama and NeuronX Distributed inference (NxD)
19 | 3. Write your own kernel with Neuron Kernel Interface (NKI)
20 | 4. Fine tune and host an existing, supported model with a different data set using SageMaker.
21 | 
22 | #### Instructor-led workshop
23 | If you are participating in an instructor-led workshop, follow the guidance provided by your instructor for accessing the environment.
24 | 
25 | #### Self-managed workshop
26 | If you are following the workshop steps in your own environment, you will need to take the following actions:
27 | 1. Launch a trn1.2xlarge instance on Amazon EC2, using the latest [DLAMI with Neuron packages preinstalled](https://repost.aws/articles/ARTxLi0wndTwquyl7frQYuKg) 
28 | 2. Use a Python virtual environment preinstalled in that DLAMI, commonly located in `/opt/aws_<xxx>`.
29 | 3. Set up and manage your own development environment on that instance, such as by using VSCode or a Jupyter Lab server.
30 | 
31 | ### Background knowledge
32 | This workshop introduces developing on AWS Trainium for the academic AI research audience and technical innovators. As such it's expected that the audience will already have a firm understanding of machine learning fundamentals. 
33 | 
34 | ### Workshop costs
35 | If you are participating in an instructor-led workshop hosted in an AWS-managed Workshop Studio environment, you will not incur any costs through using this environment. If you are following this workshop in your own environment, then you will incur associated costs with provisioning an Amazon EC2 instance. Please see the service pricing details [here](https://aws.amazon.com/ec2/pricing/on-demand/). 
36 | 
37 | At the time of writing, this workshop uses a trn1.2xlarge instance with an on-demand hourly rate in supported US regions of $1.34 per hour. The fine tuning workshop requires less than an hour of ml.trn1.2xlarge at $1.54 per hour, and an ml.inf2.xlarge at $0.99 per hour. Please ensure you delete the resources when you are finished.
38 | 
39 | ## FAQ's and known issues
40 | 1. Workshop instructions are available [here](https://catalog.us-east-1.prod.workshops.aws/workshops/bf9d80a3-5e4b-4648-bca8-1d887bb2a9ca/en-US).
41 | 2. If you use the `NousResearch` Llama 3.2 1B, please note you'll need to remove a trailing comma in the model config file. You can do this by using VIM in VSCode. If you do not take this step, you'll get an error for invalid JSON in trying to read the model config in Lab 1. If editing the file through the terminal is a little challenging, you can also download the config file from this repository with the following command:
42 |    `!wget https://github.com/aws-neuron/build-on-trainium-workshop/blob/main/labs/generation_config.json -P /home/ec2-user/environment/models/llama/`
43 | 4. Jupyter kernels can hold on to the NeuronCores as a Python process even after your cell has completed. This can then cause issues when you try to run a new notebook, and sometimes when you try to run another cell. If you encounter a `NeuronCore not found` or similar error statement, please just restart your Jupyter kernel and/or shut down kernels from previous sessions. You can also restart the instance through the EC2 console. Once your node is back online, you can always check the availability of the NeuronCores with `neuron-ls`.
44 | 5. Want to see how to integrate NKI with NxD? Check out our `nki-llama` [here](https://github.com/aws-samples/nki-llama).
45 | 
46 | 
47 | ## Security
48 | 
49 | See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.
50 | 
51 | ## License
52 | 
53 | This project is licensed under the Apache-2.0 License.
54 | 
55 | 


--------------------------------------------------------------------------------
/contributed/models/README.md:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/contributed/models/qwen2/qwen-2-test.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {},
  7 |    "outputs": [
  8 |     {
  9 |      "name": "stdout",
 10 |      "output_type": "stream",
 11 |      "text": [
 12 |       "libneuronxla                  2.2.3493.0+78c3e78c\n",
 13 |       "neuronx-cc                    2.18.121.0+9e31e41a\n",
 14 |       "neuronx-distributed           0.12.12111+cdd84048\n",
 15 |       "neuronx-distributed-inference 0.3.5591+f50feae2\n",
 16 |       "torch-neuronx                 2.6.0.2.7.5413+113e6810\n"
 17 |      ]
 18 |     }
 19 |    ],
 20 |    "source": [
 21 |     "!pip list | grep neuron"
 22 |    ]
 23 |   },
 24 |   {
 25 |    "cell_type": "code",
 26 |    "execution_count": null,
 27 |    "metadata": {},
 28 |    "outputs": [],
 29 |    "source": [
 30 |     "import torch\n",
 31 |     "from transformers import AutoTokenizer, GenerationConfig\n",
 32 |     "from neuronx_distributed_inference.models.config import NeuronConfig, OnDeviceSamplingConfig\n",
 33 |     "from neuronx_distributed_inference.utils.hf_adapter import HuggingFaceGenerationAdapter, load_pretrained_config"
 34 |    ]
 35 |   },
 36 |   {
 37 |    "cell_type": "code",
 38 |    "execution_count": 2,
 39 |    "metadata": {},
 40 |    "outputs": [],
 41 |    "source": [
 42 |     "model_path = \"/home/ubuntu/model_hf_qwen/qwen2/\"\n",
 43 |     "traced_model_path = \"/home/ubuntu/traced_model_qwen/qwen2\""
 44 |    ]
 45 |   },
 46 |   {
 47 |    "cell_type": "code",
 48 |    "execution_count": null,
 49 |    "metadata": {},
 50 |    "outputs": [],
 51 |    "source": [
 52 |     "from huggingface_hub import snapshot_download\n",
 53 |     "\n",
 54 |     "snapshot_download(\"Qwen/QwQ-32B\", local_dir=model_path)"
 55 |    ]
 56 |   },
 57 |   {
 58 |    "cell_type": "code",
 59 |    "execution_count": null,
 60 |    "metadata": {},
 61 |    "outputs": [],
 62 |    "source": [
 63 |     "from modeling_qwen_v2 import Qwen2InferenceConfig, NeuronQwen2ForCausalLM\n",
 64 |     "\n",
 65 |     "def run_qwen2_compile():\n",
 66 |     "    # Initialize configs and tokenizer.\n",
 67 |     "    tokenizer = AutoTokenizer.from_pretrained(model_path, padding_side=\"right\")\n",
 68 |     "    tokenizer.pad_token = tokenizer.eos_token\n",
 69 |     "\n",
 70 |     "    generation_config = GenerationConfig.from_pretrained(model_path)\n",
 71 |     "    generation_config_kwargs = {\n",
 72 |     "        \"do_sample\": False,\n",
 73 |     "        \"top_k\": 1,\n",
 74 |     "        \"pad_token_id\": tokenizer.pad_token_id,\n",
 75 |     "    }\n",
 76 |     "    generation_config.update(**generation_config_kwargs)\n",
 77 |     " \n",
 78 |     "    neuron_config = NeuronConfig(\n",
 79 |     "        tp_degree=8,\n",
 80 |     "        batch_size=1,\n",
 81 |     "        max_context_length=128,\n",
 82 |     "        seq_len=256,\n",
 83 |     "        enable_bucketing=True,\n",
 84 |     "        context_encoding_buckets=[128],\n",
 85 |     "        token_generation_buckets=[256],\n",
 86 |     "        flash_decoding_enabled=False,\n",
 87 |     "        torch_dtype=torch.bfloat16,\n",
 88 |     "        fused_qkv=False,\n",
 89 |     "        attn_kernel_enabled=True,\n",
 90 |     "        attn_cls=\"NeuronQwen2Attention\"\n",
 91 |     "    )\n",
 92 |     "    config = Qwen2InferenceConfig(\n",
 93 |     "        neuron_config,\n",
 94 |     "        load_config=load_pretrained_config(model_path),\n",
 95 |     "    )\n",
 96 |     "    \n",
 97 |     "    # Compile and save model.\n",
 98 |     "    print(\"\\nCompiling and saving model...\")\n",
 99 |     "    model = NeuronQwen2ForCausalLM(model_path, config)\n",
100 |     "    model.compile(traced_model_path)\n",
101 |     "    tokenizer.save_pretrained(traced_model_path)"
102 |    ]
103 |   },
104 |   {
105 |    "cell_type": "code",
106 |    "execution_count": null,
107 |    "metadata": {},
108 |    "outputs": [],
109 |    "source": [
110 |     "run_qwen2_compile()"
111 |    ]
112 |   },
113 |   {
114 |    "cell_type": "code",
115 |    "execution_count": null,
116 |    "metadata": {},
117 |    "outputs": [],
118 |    "source": [
119 |     "from modeling_qwen_v2 import Qwen2InferenceConfig, NeuronQwen2ForCausalLM\n",
120 |     "\n",
121 |     "model = NeuronQwen2ForCausalLM(traced_model_path)\n",
122 |     "model.load(traced_model_path)"
123 |    ]
124 |   },
125 |   {
126 |    "cell_type": "code",
127 |    "execution_count": null,
128 |    "metadata": {},
129 |    "outputs": [],
130 |    "source": [
131 |     "config = model.get_config_cls()\n",
132 |     "config.get_neuron_config_cls()"
133 |    ]
134 |   },
135 |   {
136 |    "cell_type": "code",
137 |    "execution_count": 9,
138 |    "metadata": {},
139 |    "outputs": [
140 |     {
141 |      "data": {
142 |       "text/plain": [
143 |        "40"
144 |       ]
145 |      },
146 |      "execution_count": 9,
147 |      "metadata": {},
148 |      "output_type": "execute_result"
149 |     }
150 |    ],
151 |    "source": [
152 |     "model.config.num_attention_heads"
153 |    ]
154 |   },
155 |   {
156 |    "cell_type": "code",
157 |    "execution_count": 10,
158 |    "metadata": {},
159 |    "outputs": [
160 |     {
161 |      "data": {
162 |       "text/plain": [
163 |        "8"
164 |       ]
165 |      },
166 |      "execution_count": 10,
167 |      "metadata": {},
168 |      "output_type": "execute_result"
169 |     }
170 |    ],
171 |    "source": [
172 |     "model.config.num_key_value_heads"
173 |    ]
174 |   },
175 |   {
176 |    "cell_type": "code",
177 |    "execution_count": 11,
178 |    "metadata": {},
179 |    "outputs": [
180 |     {
181 |      "data": {
182 |       "text/plain": [
183 |        "5120"
184 |       ]
185 |      },
186 |      "execution_count": 11,
187 |      "metadata": {},
188 |      "output_type": "execute_result"
189 |     }
190 |    ],
191 |    "source": [
192 |     "model.config.hidden_size"
193 |    ]
194 |   },
195 |   {
196 |    "cell_type": "code",
197 |    "execution_count": 12,
198 |    "metadata": {},
199 |    "outputs": [
200 |     {
201 |      "name": "stderr",
202 |      "output_type": "stream",
203 |      "text": [
204 |       "Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.\n"
205 |      ]
206 |     },
207 |     {
208 |      "data": {
209 |       "text/plain": [
210 |        "\"Okay, the user wants a short introduction to large language models. Let me start by defining what a large language model is. I should mention that they are AI systems trained on vast amounts of text data. Maybe include that they use deep learning, specifically transformer architectures.\\n\\nI need to highlight their capabilities, like generating text, understanding context, and performing various tasks such as answering questions, writing stories, or coding. It's important to note their scale—large parameter counts and extensive training data. \\n\\nAlso, touch on their applications: customer service, content creation, research, etc. Maybe mention some examples like GPT, BERT, or\""
211 |       ]
212 |      },
213 |      "execution_count": 12,
214 |      "metadata": {},
215 |      "output_type": "execute_result"
216 |     }
217 |    ],
218 |    "source": [
219 |     "tokenizer = AutoTokenizer.from_pretrained(traced_model_path)\n",
220 |     "tokenizer.pad_token = tokenizer.eos_token\n",
221 |     "generation_config = GenerationConfig.from_pretrained(model_path)\n",
222 |     "generation_config_kwargs = {\n",
223 |     "    \"do_sample\": True,\n",
224 |     "    \"temperature\": 0.9,\n",
225 |     "    \"top_k\": 5,\n",
226 |     "    \"pad_token_id\": tokenizer.pad_token_id,\n",
227 |     "}\n",
228 |     "\n",
229 |     "prompt = \"Give me a short introduction to large language model.\"\n",
230 |     "messages = [\n",
231 |     "    {\"role\": \"system\", \"content\": \"You are Qwen, created by Alibaba Cloud. You are a helpful assistant.\"},\n",
232 |     "    {\"role\": \"user\", \"content\": prompt}\n",
233 |     "]\n",
234 |     "text = tokenizer.apply_chat_template(\n",
235 |     "    messages,\n",
236 |     "    tokenize=False,\n",
237 |     "    add_generation_prompt=True\n",
238 |     ")\n",
239 |     "model_inputs = tokenizer([text], return_tensors=\"pt\")\n",
240 |     "generation_model = HuggingFaceGenerationAdapter(model)\n",
241 |     "generated_ids = generation_model.generate(\n",
242 |     "    **model_inputs,\n",
243 |     "    max_new_tokens=128\n",
244 |     ")\n",
245 |     "generated_ids = [\n",
246 |     "    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)\n",
247 |     "]\n",
248 |     "\n",
249 |     "response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]\n",
250 |     "response"
251 |    ]
252 |   },
253 |   {
254 |    "cell_type": "code",
255 |    "execution_count": 13,
256 |    "metadata": {},
257 |    "outputs": [],
258 |    "source": [
259 |     "model.reset()"
260 |    ]
261 |   },
262 |   {
263 |    "cell_type": "markdown",
264 |    "metadata": {},
265 |    "source": [
266 |     "# Run Benchmarks"
267 |    ]
268 |   },
269 |   {
270 |    "cell_type": "code",
271 |    "execution_count": 1,
272 |    "metadata": {},
273 |    "outputs": [],
274 |    "source": [
275 |     "model_path = \"/home/ubuntu/model_hf_qwen/qwen2\"\n",
276 |     "traced_model_path = \"/home/ubuntu/traced_model_qwen/qwen2/logit\""
277 |    ]
278 |   },
279 |   {
280 |    "cell_type": "code",
281 |    "execution_count": null,
282 |    "metadata": {},
283 |    "outputs": [],
284 |    "source": [
285 |     "dir = '/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/'\n",
286 |     "!cp modeling_qwen2.py {dir}"
287 |    ]
288 |   },
289 |   {
290 |    "cell_type": "markdown",
291 |    "metadata": {},
292 |    "source": [
293 |     "# Edit the inference_demo.py file to include the following:\n",
294 |     "\n",
295 |     "```python\n",
296 |     "from .modeling_qwen2 import NeuronQwen2ForCausalLM\n",
297 |     "\n",
298 |     "MODEL_TYPES = {\n",
299 |     "    \"llama\": {\"causal-lm\": NeuronLlamaForCausalLM},\n",
300 |     "    \"mixtral\": {\"causal-lm\": NeuronMixtralForCausalLM},\n",
301 |     "    \"dbrx\": {\"causal-lm\": NeuronDbrxForCausalLM},\n",
302 |     "    'qwen2': {\"causal-lm\": NeuronQwen2ForCausalLM}\n",
303 |     "}\n",
304 |     "```"
305 |    ]
306 |   },
307 |   {
308 |    "cell_type": "code",
309 |    "execution_count": 8,
310 |    "metadata": {},
311 |    "outputs": [
312 |     {
313 |      "name": "stdout",
314 |      "output_type": "stream",
315 |      "text": [
316 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed/modules/moe/expert_mlps.py:11: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n",
317 |       "  from neuronx_distributed.modules.moe.blockwise import (\n",
318 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed/modules/moe/expert_mlps.py:11: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n",
319 |       "  from neuronx_distributed.modules.moe.blockwise import (\n",
320 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed/modules/moe/expert_mlps.py:11: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n",
321 |       "  from neuronx_distributed.modules.moe.blockwise import (\n",
322 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/modules/attention/utils.py:14: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n",
323 |       "  from neuronx_distributed_inference.modules.custom_calls import neuron_cumsum\n",
324 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:745: UserWarning: Set seed for `privateuseone` device does not take effect, please add API's `_is_in_bad_fork` and `manual_seed_all` to `privateuseone` device module.\n",
325 |       "  return fn(*args, **kwargs)\n",
326 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/modules/lora_serving/lora_model.py:12: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n",
327 |       "  from neuronx_distributed_inference.modules.attention.gqa import GQA, GroupQueryAttention_QKV\n",
328 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/modules/lora_serving/lora_model.py:12: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n",
329 |       "  from neuronx_distributed_inference.modules.attention.gqa import GQA, GroupQueryAttention_QKV\n",
330 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/modules/lora_serving/lora_model.py:12: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n",
331 |       "  from neuronx_distributed_inference.modules.attention.gqa import GQA, GroupQueryAttention_QKV\n",
332 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/models/dbrx/modeling_dbrx.py:38: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n",
333 |       "  from neuronx_distributed_inference.modules.attention.attention_base import NeuronAttentionBase\n",
334 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/models/dbrx/modeling_dbrx.py:38: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n",
335 |       "  from neuronx_distributed_inference.modules.attention.attention_base import NeuronAttentionBase\n",
336 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/inference_demo.py:25: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n",
337 |       "  from neuronx_distributed_inference.models.dbrx.modeling_dbrx import NeuronDbrxForCausalLM\n",
338 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/inference_demo.py:27: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n",
339 |       "  from neuronx_distributed_inference.models.mixtral.modeling_mixtral import NeuronMixtralForCausalLM\n",
340 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/models/mllama/modeling_mllama.py:72: DeprecationWarning: torch_neuronx.nki_jit is deprecated, use nki.jit instead.\n",
341 |       "  from .modeling_mllama_vision import NeuronMllamaVisionModel  # noqa: E402\n",
342 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/utils/accuracy.py:29: UserWarning: Intel extension for pytorch not found. For faster CPU references install `intel-extension-for-pytorch`.\n",
343 |       "  warnings.warn(\n",
344 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:745: UserWarning: Set seed for `privateuseone` device does not take effect, please add API's `_is_in_bad_fork` and `manual_seed_all` to `privateuseone` device module.\n",
345 |       "  return fn(*args, **kwargs)\n",
346 |       "Loading configs...\n",
347 |       "WARNING:root:NeuronConfig init: Unexpected keyword arguments: {'model_type': 'qwen2', 'task_type': 'causal-lm', 'model_path': '/home/ubuntu/model_hf_qwen/qwen2', 'compiled_model_path': '/home/ubuntu/traced_model_qwen/qwen2/logit', 'benchmark': True, 'check_accuracy_mode': <CheckAccuracyMode.LOGIT_MATCHING: 'logit-matching'>, 'divergence_difference_tol': 0.001, 'prompts': ['To be, or not to be'], 'top_k': 1, 'top_p': 1.0, 'temperature': 1.0, 'do_sample': False, 'dynamic': False, 'pad_token_id': 151645, 'on_device_sampling': False, 'enable_torch_dist': False, 'enable_lora': False, 'max_loras': 1, 'max_lora_rank': 16, 'skip_warmup': False, 'skip_compile': False, 'compile_only': False, 'compile_dry_run': False, 'hlo_debug': False}\n",
348 |       "\n",
349 |       "Compiling and saving model...\n",
350 |       "INFO:Neuron:Generating HLOs for the following models: ['context_encoding_model', 'token_generation_model']\n",
351 |       "[2025-06-02 13:35:56.009: I neuronx_distributed/parallel_layers/parallel_state.py:592] > initializing tensor model parallel with size 8\n",
352 |       "[2025-06-02 13:35:56.009: I neuronx_distributed/parallel_layers/parallel_state.py:593] > initializing pipeline model parallel with size 1\n",
353 |       "[2025-06-02 13:35:56.010: I neuronx_distributed/parallel_layers/parallel_state.py:594] > initializing context model parallel with size 1\n",
354 |       "[2025-06-02 13:35:56.010: I neuronx_distributed/parallel_layers/parallel_state.py:595] > initializing data parallel with size 1\n",
355 |       "[2025-06-02 13:35:56.010: I neuronx_distributed/parallel_layers/parallel_state.py:596] > initializing world size to 8\n",
356 |       "[2025-06-02 13:35:56.010: I neuronx_distributed/parallel_layers/parallel_state.py:343] [rank_0_pp-1_tp-1_dp-1_cp-1] Chosen Logic for replica groups ret_logic=<PG_Group_Logic.LOGIC1: (<function ascending_ring_PG_group at 0x70ec7a840ca0>, 'Ascending Ring PG Group')>\n",
357 |       "[2025-06-02 13:35:56.011: I neuronx_distributed/parallel_layers/parallel_state.py:632] [rank_0_pp-1_tp-1_dp-1_cp-1] tp_groups: replica_groups.tp_groups=[[0, 1, 2, 3, 4, 5, 6, 7]]\n",
358 |       "[2025-06-02 13:35:56.011: I neuronx_distributed/parallel_layers/parallel_state.py:633] [rank_0_pp-1_tp-1_dp-1_cp-1] dp_groups: replica_groups.dp_groups=[[0], [1], [2], [3], [4], [5], [6], [7]]\n",
359 |       "[2025-06-02 13:35:56.011: I neuronx_distributed/parallel_layers/parallel_state.py:634] [rank_0_pp-1_tp-1_dp-1_cp-1] pp_groups: replica_groups.pp_groups=[[0], [1], [2], [3], [4], [5], [6], [7]]\n",
360 |       "[2025-06-02 13:35:56.011: I neuronx_distributed/parallel_layers/parallel_state.py:635] [rank_0_pp-1_tp-1_dp-1_cp-1] cp_groups: replica_groups.cp_groups=[[0], [1], [2], [3], [4], [5], [6], [7]]\n",
361 |       "[2025-06-02 13:35:56.011: I neuronx_distributed/parallel_layers/parallel_state.py:636] [rank_0_pp-1_tp-1_dp-1_cp-1] ep_model_groups: replica_groups.ep_model_groups=[[0], [1], [2], [3], [4], [5], [6], [7]]\n",
362 |       "[2025-06-02 13:35:56.011: I neuronx_distributed/parallel_layers/parallel_state.py:637] [rank_0_pp-1_tp-1_dp-1_cp-1] ep_data_groups: replica_groups.ep_data_groups=[[0], [1], [2], [3], [4], [5], [6], [7]]\n",
363 |       "INFO:Neuron:Generating 1 hlos for key: context_encoding_model\n",
364 |       "INFO:Neuron:Started loading module context_encoding_model\n",
365 |       "INFO:Neuron:Finished loading module context_encoding_model in 0.3605782985687256 seconds\n",
366 |       "INFO:Neuron:generating HLO: context_encoding_model, input example shape = torch.Size([1, 16])\n",
367 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed/parallel_layers/layers.py:478: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.\n",
368 |       "  with torch.cuda.amp.autocast(enabled=False):\n",
369 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/torch_neuronx/xla_impl/hlo_conversion.py:210: UserWarning: Received an input tensor that was unused. Tensor will be ignored. (index=1, shape=torch.Size([1, 16]), dtype=torch.int32)\n",
370 |       "  warnings.warn(\n",
371 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/torch_neuronx/xla_impl/hlo_conversion.py:210: UserWarning: Received an input tensor that was unused. Tensor will be ignored. (index=3, shape=torch.Size([1]), dtype=torch.int32)\n",
372 |       "  warnings.warn(\n",
373 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/torch_neuronx/xla_impl/hlo_conversion.py:210: UserWarning: Received an input tensor that was unused. Tensor will be ignored. (index=4, shape=torch.Size([1, 3]), dtype=torch.float32)\n",
374 |       "  warnings.warn(\n",
375 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/torch_neuronx/xla_impl/hlo_conversion.py:210: UserWarning: Received an input tensor that was unused. Tensor will be ignored. (index=5, shape=torch.Size([1]), dtype=torch.int32)\n",
376 |       "  warnings.warn(\n",
377 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/torch_neuronx/xla_impl/hlo_conversion.py:210: UserWarning: Received an input tensor that was unused. Tensor will be ignored. (index=6, shape=torch.Size([1]), dtype=torch.int32)\n",
378 |       "  warnings.warn(\n",
379 |       "INFO:Neuron:Finished generating HLO for context_encoding_model in 8.811824083328247 seconds, input example shape = torch.Size([1, 16])\n",
380 |       "INFO:Neuron:Generating 1 hlos for key: token_generation_model\n",
381 |       "INFO:Neuron:Started loading module token_generation_model\n",
382 |       "INFO:Neuron:Finished loading module token_generation_model in 0.13971686363220215 seconds\n",
383 |       "INFO:Neuron:generating HLO: token_generation_model, input example shape = torch.Size([1, 1])\n",
384 |       "INFO:Neuron:Finished generating HLO for token_generation_model in 9.776893615722656 seconds, input example shape = torch.Size([1, 1])\n",
385 |       "INFO:Neuron:Generated all HLOs in 19.276326656341553 seconds\n",
386 |       "INFO:Neuron:Starting compilation for the priority HLO\n",
387 |       "INFO:Neuron:'token_generation_model' is the priority model with bucket rank 0\n",
388 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/libneuronxla/neuron_cc_wrapper.py:283: SyntaxWarning: str format compiler_flags is discouraged as its handling involves repeated joining and splitting, which can easily make mistakes if something is quoted or escaped. Use list[str] instead. Refer to documentation of the Python subprocess module for details.\n",
389 |       "  warnings.warn(SyntaxWarning(\n",
390 |       "2025-06-02 13:36:15.000516:  7289  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --framework=XLA /tmp/nxd_model/token_generation_model/_tp0_bk0/model.MODULE_9b906898286ddf239aa0+91ef39e9.hlo_module.pb --output /tmp/nxd_model/token_generation_model/_tp0_bk0/model.MODULE_9b906898286ddf239aa0+91ef39e9.neff --target=trn1 --auto-cast=none --model-type=transformer --tensorizer-options=--enable-ccop-compute-overlap --cc-pipeline-tiling-factor=2 --vectorize-strided-dma  --lnc=1 --logfile=/tmp/nxd_model/token_generation_model/_tp0_bk0/log-neuron-cc.txt --enable-internal-neff-wrapper --verbose=35\n",
391 |       ".........Completed run_backend_driver.\n",
392 |       "\n",
393 |       "Compiler status PASS\n",
394 |       "INFO:Neuron:Done compilation for the priority HLO in 169.35613083839417 seconds\n",
395 |       "INFO:Neuron:Updating the hlo module with optimized layout\n",
396 |       "INFO:Neuron:Done optimizing weight layout for all HLOs in 0.3216278553009033 seconds\n",
397 |       "INFO:Neuron:Starting compilation for all HLOs\n",
398 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/libneuronxla/neuron_cc_wrapper.py:245: SyntaxWarning: str format compiler_flags is discouraged as its handling involves repeated joining and splitting, which can easily make mistakes if something is quoted or escaped. Use list[str] instead. Refer to documentation of the Python subprocess module for details.\n",
399 |       "  warnings.warn(SyntaxWarning(\n",
400 |       "2025-06-02 13:39:05.000174:  7289  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --framework=XLA /tmp/nxd_model/context_encoding_model/_tp0_bk0/model.MODULE_d4332219e6ee5f826cce+d43b5474.hlo_module.pb --output /tmp/nxd_model/context_encoding_model/_tp0_bk0/model.MODULE_d4332219e6ee5f826cce+d43b5474.neff --target=trn1 --auto-cast=none --model-type=transformer --tensorizer-options=--enable-ccop-compute-overlap --cc-pipeline-tiling-factor=2 --vectorize-strided-dma  --lnc=1 -O1 --internal-hlo2tensorizer-options= --modular-flow-mac-threshold=10  --logfile=/tmp/nxd_model/context_encoding_model/_tp0_bk0/log-neuron-cc.txt --verbose=35\n",
401 |       ".Completed run_backend_driver.\n",
402 |       "\n",
403 |       "Compiler status PASS\n",
404 |       "INFO:Neuron:Finished Compilation for all HLOs in 9.435595512390137 seconds\n",
405 |       "......Completed run_backend_driver.\n",
406 |       "\n",
407 |       "Compiler status PASS\n",
408 |       "INFO:Neuron:Done preparing weight layout transformation\n",
409 |       "INFO:Neuron:Finished building model in 307.08067560195923 seconds\n",
410 |       "INFO:Neuron:SKIPPING pre-sharding the checkpoints. The checkpoints will be sharded during load time.\n",
411 |       "Compiling and tracing time: 307.11146965399985 seconds\n",
412 |       "\n",
413 |       "Loading model to Neuron...\n",
414 |       "INFO:Neuron:Sharding weights on load...\n",
415 |       "INFO:Neuron:Sharding Weights for ranks: 0...7\n",
416 |       "[2025-06-02 13:41:03.157: I neuronx_distributed/parallel_layers/parallel_state.py:592] > initializing tensor model parallel with size 8\n",
417 |       "[2025-06-02 13:41:03.157: I neuronx_distributed/parallel_layers/parallel_state.py:593] > initializing pipeline model parallel with size 1\n",
418 |       "[2025-06-02 13:41:03.157: I neuronx_distributed/parallel_layers/parallel_state.py:594] > initializing context model parallel with size 1\n",
419 |       "[2025-06-02 13:41:03.157: I neuronx_distributed/parallel_layers/parallel_state.py:595] > initializing data parallel with size 1\n",
420 |       "[2025-06-02 13:41:03.158: I neuronx_distributed/parallel_layers/parallel_state.py:596] > initializing world size to 8\n",
421 |       "[2025-06-02 13:41:03.158: I neuronx_distributed/parallel_layers/parallel_state.py:343] [rank_0_pp-1_tp-1_dp-1_cp-1] Chosen Logic for replica groups ret_logic=<PG_Group_Logic.LOGIC1: (<function ascending_ring_PG_group at 0x70ec7a840ca0>, 'Ascending Ring PG Group')>\n",
422 |       "[2025-06-02 13:41:03.159: I neuronx_distributed/parallel_layers/parallel_state.py:632] [rank_0_pp-1_tp-1_dp-1_cp-1] tp_groups: replica_groups.tp_groups=[[0, 1, 2, 3, 4, 5, 6, 7]]\n",
423 |       "[2025-06-02 13:41:03.159: I neuronx_distributed/parallel_layers/parallel_state.py:633] [rank_0_pp-1_tp-1_dp-1_cp-1] dp_groups: replica_groups.dp_groups=[[0], [1], [2], [3], [4], [5], [6], [7]]\n",
424 |       "[2025-06-02 13:41:03.159: I neuronx_distributed/parallel_layers/parallel_state.py:634] [rank_0_pp-1_tp-1_dp-1_cp-1] pp_groups: replica_groups.pp_groups=[[0], [1], [2], [3], [4], [5], [6], [7]]\n",
425 |       "[2025-06-02 13:41:03.159: I neuronx_distributed/parallel_layers/parallel_state.py:635] [rank_0_pp-1_tp-1_dp-1_cp-1] cp_groups: replica_groups.cp_groups=[[0], [1], [2], [3], [4], [5], [6], [7]]\n",
426 |       "[2025-06-02 13:41:03.159: I neuronx_distributed/parallel_layers/parallel_state.py:636] [rank_0_pp-1_tp-1_dp-1_cp-1] ep_model_groups: replica_groups.ep_model_groups=[[0], [1], [2], [3], [4], [5], [6], [7]]\n",
427 |       "[2025-06-02 13:41:03.159: I neuronx_distributed/parallel_layers/parallel_state.py:637] [rank_0_pp-1_tp-1_dp-1_cp-1] ep_data_groups: replica_groups.ep_data_groups=[[0], [1], [2], [3], [4], [5], [6], [7]]\n",
428 |       "INFO:Neuron:Done Sharding weights in 3.519328597999902\n",
429 |       "INFO:Neuron:Finished weights loading in 16.628388952000023 seconds\n",
430 |       "INFO:Neuron:Warming up the model.\n",
431 |       "2025-Jun-02 13:41:22.0009 7289:8468 [7] nccl_net_ofi_create_plugin:211 CCOM WARN NET/OFI Failed to initialize sendrecv protocol\n",
432 |       "2025-Jun-02 13:41:22.0010 7289:8468 [7] nccl_net_ofi_create_plugin:334 CCOM WARN NET/OFI aws-ofi-nccl initialization failed\n",
433 |       "2025-Jun-02 13:41:22.0011 7289:8468 [7] nccl_net_ofi_init:155 CCOM WARN NET/OFI Initializing plugin failed\n",
434 |       "2025-Jun-02 13:41:22.0012 7289:8468 [7] net_plugin.cc:94 CCOM WARN OFI plugin initNet() failed is EFA enabled?\n",
435 |       "INFO:Neuron:Warmup completed in 0.33977651596069336 seconds.\n",
436 |       "Total model loading time: 19.222302051000042 seconds\n",
437 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:650: UserWarning: `do_sample` is set to `False`. However, `top_k` is set to `1` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_k`.\n",
438 |       "  warnings.warn(\n",
439 |       "\n",
440 |       "Checking accuracy by logit matching\n",
441 |       "/opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/utils/accuracy.py:363: UserWarning: input_len + num_tokens_to_check exceeds max_context_length. If output divergences at an index greater than max_context_length, a ValueError will occur because the next input len exceeds max_context_length. To avoid this, set num_tokens_to_check to a value of max_context_length - input_len or less.\n",
442 |       "  warnings.warn(\n",
443 |       "Loading checkpoint shards: 100%|████████████████| 14/14 [00:08<00:00,  1.58it/s]\n",
444 |       "From v4.47 onwards, when a model cache is to be returned, `generate` will return a `Cache` instance instead by default (as opposed to the legacy tuple of tuples format). If you want to keep returning the legacy format, please set `return_legacy_cache=True`.\n",
445 |       "Expected Output:  [\", that is the question. Whether 'tis nobler in the mind to suffer the slings and arrows of outrageous fortune\"] tensor([[   11,   429,   374,   279,  3405,    13, 13139,   364,    83,   285,\n",
446 |       "         13049,  1536,   304,   279,  3971,   311,  7676,   279,  1739,   819,\n",
447 |       "           323, 36957,   315, 54488, 32315]])\n",
448 |       "Expected Logits Shape:  torch.Size([25, 1, 152064])\n",
449 |       "Actual Output:  [\", that is the question. Whether 'tis nobler in the mind to suffer the slings and arrows of outrageous fortune\"] tensor([[   11,   429,   374,   279,  3405,    13, 13139,   364,    83,   285,\n",
450 |       "         13049,  1536,   304,   279,  3971,   311,  7676,   279,  1739,   819,\n",
451 |       "           323, 36957,   315, 54488, 32315]])\n",
452 |       "Actual Logits Shape:  torch.Size([25, 1, 152064])\n",
453 |       "Passed logits validation!\n",
454 |       "\n",
455 |       "Generating outputs...\n",
456 |       "Prompts: ['To be, or not to be']\n",
457 |       "Generated outputs:\n",
458 |       "Output 0: To be, or not to be, that is the question. Whether 'tis nobler in the mind to suffer the slings and arrows of outrageous fortune\n",
459 |       "Starting end-to-end benchmark with 20\n",
460 |       "Benchmark completed and its result is as following\n",
461 |       "{\n",
462 |       "    \"e2e_model\": {\n",
463 |       "        \"latency_ms_p50\": 569.0377950668335,\n",
464 |       "        \"latency_ms_p90\": 570.0641632080078,\n",
465 |       "        \"latency_ms_p95\": 570.2431917190552,\n",
466 |       "        \"latency_ms_p99\": 570.8965921401978,\n",
467 |       "        \"latency_ms_p100\": 571.0599422454834,\n",
468 |       "        \"latency_ms_avg\": 569.459593296051,\n",
469 |       "        \"throughput\": 56.19362703995017\n",
470 |       "    },\n",
471 |       "    \"context_encoding_model\": {\n",
472 |       "        \"latency_ms_p50\": 41.747450828552246,\n",
473 |       "        \"latency_ms_p90\": 42.02606678009033,\n",
474 |       "        \"latency_ms_p95\": 42.056477069854736,\n",
475 |       "        \"latency_ms_p99\": 42.05883264541626,\n",
476 |       "        \"latency_ms_p100\": 42.05942153930664,\n",
477 |       "        \"latency_ms_avg\": 41.80266857147217,\n",
478 |       "        \"throughput\": 382.75068426897144\n",
479 |       "    },\n",
480 |       "    \"token_generation_model\": {\n",
481 |       "        \"latency_ms_p50\": 33.631086349487305,\n",
482 |       "        \"latency_ms_p90\": 33.74745845794678,\n",
483 |       "        \"latency_ms_p95\": 33.88720750808716,\n",
484 |       "        \"latency_ms_p99\": 34.08886194229126,\n",
485 |       "        \"latency_ms_p100\": 34.223079681396484,\n",
486 |       "        \"latency_ms_avg\": 33.66035064061483,\n",
487 |       "        \"throughput\": 31.68911334451813\n",
488 |       "    }\n",
489 |       "}\n",
490 |       "Completed saving result to benchmark_report.json\n"
491 |      ]
492 |     }
493 |    ],
494 |    "source": [
495 |     "!inference_demo \\\n",
496 |     "    --model-type qwen2 \\\n",
497 |     "    --task-type causal-lm \\\n",
498 |     "    run \\\n",
499 |     "    --model-path /home/ubuntu/model_hf_qwen/qwen2 \\\n",
500 |     "    --compiled-model-path /home/ubuntu/traced_model_qwen/qwen2/logit \\\n",
501 |     "    --torch-dtype bfloat16 \\\n",
502 |     "    --tp-degree 8 \\\n",
503 |     "    --batch-size 1 \\\n",
504 |     "    --max-context-length 16 \\\n",
505 |     "    --seq-len 32 \\\n",
506 |     "    --top-k 1 \\\n",
507 |     "    --pad-token-id 151645 \\\n",
508 |     "    --prompt \"To be, or not to be\" \\\n",
509 |     "    --check-accuracy-mode logit-matching \\\n",
510 |     "    --benchmark"
511 |    ]
512 |   }
513 |  ],
514 |  "metadata": {
515 |   "kernelspec": {
516 |    "display_name": "aws_neuronx_venv_pytorch_2_6_nxd_inference",
517 |    "language": "python",
518 |    "name": "python3"
519 |   },
520 |   "language_info": {
521 |    "codemirror_mode": {
522 |     "name": "ipython",
523 |     "version": 3
524 |    },
525 |    "file_extension": ".py",
526 |    "mimetype": "text/x-python",
527 |    "name": "python",
528 |    "nbconvert_exporter": "python",
529 |    "pygments_lexer": "ipython3",
530 |    "version": "3.10.12"
531 |   }
532 |  },
533 |  "nbformat": 4,
534 |  "nbformat_minor": 2
535 | }
536 | 


--------------------------------------------------------------------------------
/doc/README.md:
--------------------------------------------------------------------------------
  1 | # Build On Trainium Resources
  2 | **Purpose:** 
  3 | 
  4 | Collection of resources (documentation, examples, tutorials and workshops) to help onboard new students and researchers. This set of resources will need to updated and maintained as new resources become available. 
  5 | 
  6 | # Resources
  7 | 
  8 | This section contains links to various documentation sources and is a helpful index when working on Neuron. It is organized into several sections based on workload and relevance.
  9 | 
 10 | ## Getting Started with Neuron
 11 | 
 12 | |Title	|Description	|Link	|
 13 | |---	|---	|---	|
 14 | |Getting Started with AWS	|Getting started resource for AWS, generally, including AWS environment provisioning, budger alarms, CLI, instance setip and best practices for working in an AWS environment	|[BoT Getting Started on AWS](https://github.com/scttfrdmn/aws-101-for-tranium)	|
 15 | |Neuron Documentation	|The Neuron Official Product Documentation. This contains details on our software libraries and hardware.	|[Neuron Docs](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/index.html)	|
 16 | |Inf2 Instance Details	|Helpful overview links for the Inferentia2 Instance and associated accelerators	|<ul><li>[AWS Landing Page](https://aws.amazon.com/ai/machine-learning/inferentia/) </li><li> [Instance Details](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/inf2-arch.html#aws-inf2-arch) </li><li> [Chip Details](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/inferentia2.html#inferentia2-arch) </li><li> [Core Details](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/neuron-core-v2.html#neuroncores-v2-arch) </li></ul> |
 17 | |Trn1 Instance Details	|Similar overview links for Trn1 instances and acclerators	|<ul><li>[AWS Landing Page](https://aws.amazon.com/ai/machine-learning/trainium/) </li><li>[Instance Details](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/trn1-arch.html#aws-trn1-arch) </li><li> [Chip Details](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/trainium.html#trainium-arch) </li><li> [Core Details](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/neuron-core-v2.html#neuroncores-v2-arch) </li></ul>	|
 18 | |Trn2 Instance Details	|Similar overview links for Trn2 instances and acclerators	|<ul><li>[Youtube Launch Video](https://www.youtube.com/watch?v=Bteba8KLeGc) </li><li> [Instance Details](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/trn2-arch.html#aws-trn2-arch) </li><li> [Chip Details](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/trainium2.html#trainium2-arch) </li><li> [Core Details](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/neuron-core-v3.html#neuroncores-v3-arch) </li></ul>	|
 19 | |Software Overview - General	|Overview Video of Trainium Software Stack	|[Video](https://www.youtube.com/watch?v=vaqj8XQfqwM&t=806s)	|
 20 | |Software Overview - Framework	|Application Frameworks for developing on Neuron. Torch-NeuronX for small model inference and training, NxD for Distributed modeling primitives, NxDI - a higher abstraction library for inference and NxDT a corresponding abstraction for training.	|<ul><li>Torch-NeuronX ([Training](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/programming-guide/training/pytorch-neuron-programming-guide.html#pytorch-neuronx-programming-guide), [Inference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/programming-guide/inference/trace-vs-xla-lazytensor.html)) </li><li> [NxD](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/developer-guide.html) </li><li> [NxD-T](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/overview.html#nxd-training-overview) </li><li> [NxD-I](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/nxdi-overview.html#nxdi-overview) </li></ul>	|
 21 | |Software Overview - ML Libraries	|ML libraries which offer another interface for deploying to trn/inf. Optimum-Neuron provides and interface between transformers and AWS Accelerators. AXLearn is a training library built on top of JAX and XLA.	|[Optimum Neuron](https://huggingface.co/docs/optimum-neuron/index) [AXLearn](https://github.com/apple/axlearn)	|
 22 | |Environment Setup	|A set of resources on provisioning instances and setting up development environments with the appropriate Neuron Software.	|<ul><li>[Instance Guide](https://repost.aws/articles/ARTxLi0wndTwquyl7frQYuKg) </li><li> [Remote Development Guide](https://repost.aws/articles/ARmgDHboGkRKmaEyfBzyVP4w) </li><li> [AMIs](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/dlami/index.html)</li><li> [Containers](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/index.html) </li><li> [Manual Setup](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/neuron-setup/pytorch/neuronx/ubuntu/torch-neuronx-ubuntu22.html#setup-torch-neuronx-ubuntu22)	</li></ul> |
 23 | |Release Versions	|Index of the latest release versions and their semantic version information.	|<ul><li>[Latest Release Version](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/index.html#latest-neuron-release)</li><li>[Component Package Verisons](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/releasecontent.html#latest-neuron-release-artifacts)</li></ul>|
 24 | 
 25 | ## Training Resources
 26 | 
 27 | |Title	|Description	|Link	|
 28 | |---	|---	|---	|
 29 | |Torch-NeuronX Docs	|Torch-NeuronX docs on the XLA flow, and constructing a simple training loop on Trainium/Inferentia.	|[Torch-NeuronX Training Docs](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/programming-guide/training/pytorch-neuron-programming-guide.html#pytorch-neuronx-programming-guide)	|
 30 | |NxD Docs	|Details on NxD, as well as the Distributed Layer Primitives (Tensor Parallelism, Pipeline Parallelism, etc.)	|[NxD Developer Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/developer-guide-training.html#neuronx-distributed-developer-guide-training)	|
 31 | |NxD Docs + PyTorch Lightning	|PyTorch Lightning Docs for NxD Training	|[PTL Developer Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/ptl_developer_guide.html#ptl-developer-guide)	|
 32 | |NxD-T Developer Guide	|NxD-Training, A higher level abstraction library on NxD for training specific workloads.	|[NxD Training Developer Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/overview.html#nxd-training-overview)	|
 33 | |PreTraining	|Pre-Training samples within various different libraries above	|<ul><li>[Torch-NeuronX](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/mlp.html#neuronx-mlp-training-tutorial)</li><li>[NXD](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/tutorials/training_llama_tp_zero1.html#llama2-7b-tp-zero1-tutorial)</li><li>[NxD-T](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/tutorials/hf_llama3_8B_pretraining.html#hf-llama3-8b-pretraining)</li><li>[Optimum Neuron](https://huggingface.co/docs/optimum-neuron/training_tutorials/pretraining_hyperpod_llm)</li></ul> |
 34 | |LoRA Fine Tuning	|LoRA Samples within the various libraries for Neuron	|<ul><li>[NxD-T](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/tutorials/hf_llama3_8B_SFT_LORA.html#hf-llama3-8b-sft-lora)</li><li>[Optimum Neuron](https://huggingface.co/docs/optimum-neuron/training_tutorials/sft_lora_finetune_llm) </li></ul>	|
 35 | |Preference Alignment	|Preference Alignment Samples within the various libraries for Neuron	|[NxD-T](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/tutorials/hf_llama3_8B_DPO_ORPO.html#hf-llama3-8b-dpo-orpo)	|
 36 | |Awsome Distributed Training	|Reference Distributed Training Examples on AWS	|[Awsome-distributed-training](https://github.com/aws-samples/awsome-distributed-training)	|
 37 | 
 38 | ## Inference Resources
 39 | 
 40 | |Title	|Description	|Link	|
 41 | |---	|---	|---	|
 42 | |Torch-NeuronX Docs	|Torch-NeuronX docs on the XLA flow, and tracing models for Inference on a single core. Samples of various common models as well.	|[Torch-NeuronX Docs](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/api-reference-guide/inference/api-torch-neuronx-trace.html#torch-neuronx-trace-api)
 43 | [Samples](https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/inference)	|
 44 | |NxD-I Developer Guide	|NxD-Inference, A higher level abstraction library on NxD for inference specific workloads.	|[NxD-I Docs](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/tutorials/index.html)	|
 45 | |Deployment vLLM	|Guide for vLLM development with NxDI	|[vLLM Docs](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/vllm-user-guide.html)	|
 46 | |TGI	|Guide on how to use HuggingFace Text Generation Inference (TGI) with Neuron	|[TGI Docs](https://huggingface.co/docs/optimum-neuron/en/guides/neuronx_tgi)	|
 47 | 
 48 | ## Kernel Resources
 49 | 
 50 | |Title	|Description	|Link	|
 51 | |---	|---	|---	|
 52 | |NKI Docs	|General NKI docs 	|[NKI Docs](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/index.html)	|
 53 | |Getting Started With NKI	|Getting started writing NKI Kernels 	|[Getting Started With NKI](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/getting_started.html#nki-getting-started)	|
 54 | |Performant Kernels with NKI	|Understanding NKI kernel performance	|[Performant Kernels with NKI](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/nki_arch_guides.html#nki-arch-guides)	|
 55 | |NKI - Sample Kernels	|Sample Kernel Repository with reference implementation	|[NKI - Sample Kernels](https://github.com/aws-neuron/nki-samples/tree/main)	|
 56 | 
 57 | ## Tools Resources
 58 | 
 59 | |Title	|Description	|Link	|
 60 | |---	|---	|---	|
 61 | |Profiler	|Neuron Profiler User Guide 	|[Profiler Docs](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-profiler-2-0-beta-user-guide.html)	|
 62 | |Monitoring Tools and CLI	|Monitoring and CLI tools for working with Neuron Hardware.	|[Monitoring Tools and CLI](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html)	|
 63 | 
 64 | # Learning Paths
 65 | 
 66 | Learning Paths are a list or organized exercises 
 67 | 
 68 | ## Training 
 69 | 
 70 | |Title	|Description	|Link	|Minimum Instance Required	|
 71 | |---	|---	|---	|---	|
 72 | |Setup an Instance/Developer Environment	|This section contains resources to provision a developer Environment. This is a great starting place if you need a clean environment for development, or for starting any of the following exercises.	|<ul><li> [Instance Setup](https://repost.aws/articles/ARTxLi0wndTwquyl7frQYuKg) <li></li> [DLAMIs](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/dlami/index.html) </li></ul> 	|trn1.2xlarge	|
 73 | |Construct a simple Training Loop with torch-neuronx	|This is a sample of how to construct a training loop using torch-neuronx. Relevant for getting started with XLA flows, as well as models which require a single core/DP.	|[MLP Tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/mlp.html#neuronx-mlp-training-tutorial)	|trn1.2xlarge	|
 74 | |Implement Tensor Parallelism with NeuronX Distributed	|Implement Tensor Parallel for a model to shard training across accelerators.	|[BERT Pretraining Tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/tutorials/training.html#tp-training-tutorial)	|trn1.32xlarge	|
 75 | |Pre-training Llama with TP, PP and ZeRO-1	|Train a model using multiple forms of parallelism (Tensor Parallelism, Pipeline Parallelism, and ZeRO-1). This uses the NxD Core Library and should give a good view of the parallel primatives.	|[Llama Pretraining Tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/tutorials/training_llama_tp_zero1.html)	|4x trn1.32xlarge cluster	|
 76 | |LoRA Fine Tuning with Optimum Neuron	|Fine-Tune a model with LoRA on Optimum Neuron. Optimum Neuron is a library developed by HF and allows for simple modifications to transformers code to port to Neuron.	|[Qwen LoRA Optimum Neuron](https://huggingface.co/docs/optimum-neuron/training_tutorials/qwen3-fine-tuning)	|trn1.32xlarge	|
 77 | |LoRA Fine-Tuning with NxDT	|LoRA based Fine-tune a model using NxD-T, our higher level training library built on top of NxD core.	|[LoRA NxDT Tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/tutorials/hf_llama3_8B_SFT_LORA.html#hf-llama3-8b-sft-lora)	|trn1.32xlarge	|
 78 | |DPO/ORPO Fine-Tuning with NxDT	|Preference Alignment for a model using NxD-T, our higher level training library built on top of NxD core.	|[DPO/ORPO Tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/tutorials/hf_llama3_8B_DPO_ORPO.html)	|trn1.32xlarge	|
 79 | 
 80 | 
 81 | 
 82 | ## Inference Path
 83 | 
 84 | |Title	|Description	|Link	|Minimum Instance Required	|
 85 | |---	|---	|---	|---	|
 86 | |Setup an Instance/Developer Environment	|This section contains resources to provision a developer Environment. This is a great starting place if you need a clean environment for development, or for starting any of the following exercises.	|<ul><li> [Instance Setup](https://repost.aws/articles/ARTxLi0wndTwquyl7frQYuKg) <li></li> [DLAMIs](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/dlami/index.html) </li></ul> 	|trn1.2xlarge	|
 87 | |Trace Models with Torch-NeuronX	|Trace small models without model parallelism for inference with torch-neuronx.	|[Torch-NeuronX Tutorials](https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/README.md#inference)	|trn1.2xlarge	|
 88 | |Deploy Various Models with Optimum Neuron	|Optimum Neuron allows for popular models in diffusers and transformers to easily be deployed to Neuron devices.	|[Optimum Neuron Tutorials](https://huggingface.co/docs/optimum-neuron/inference_tutorials/notebooks)	|trn1.32xlarge	|
 89 | |Deploy LLM with NxD	|NxD is our library with model sharding primitives. This guide serves as a good jumping off point for common LLMs	|[NxD-I Production Models](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/model-reference.html)	|trn1.32xlarge	|
 90 | |vLLM Integration	|This guide walks through how to run models with vLLM on Neuron devices. This uses the previously mentioned NxDI back-end for the model deployments.	|[vLLM User Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/vllm-user-guide.html)	|trn1.32xlarge	|
 91 | |Deploy a DiT with NxD	|This guide walks through a non LLM model architecture to be sharded and deployed on Neuron. In this case it is a Diffusion Transformer architecture for image generation	|[PixArt Sigma on Neuron](https://aws.amazon.com/blogs/machine-learning/cost-effective-ai-image-generation-with-pixart-sigma-inference-on-aws-trainium-and-aws-inferentia/)	|trn1.32xlarge	|
 92 | |Onboard a new Model to NxD-I	|This guide walks through how to onboard a new model to NxD	|[Model Onboarding Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/onboarding-models.html)	|trn1.32xlarge	|
 93 | |Explore Additonal features of NxD-O	|Here are a few additonal references for NxD-I feature that may be rel;evant for your specific use case (Multi-LoRA, Quantization, Spec. decode)	|<ul><li>[Quantization](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/custom-quantization.html)<li></li> [Spec. Decode](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/tutorials/trn2-llama3.3-70b-tutorial.html#nxdi-trn2-llama3-3-70b-tutorial)<li></li> [Multi-LoRA](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/tutorials/trn2-llama3.1-8b-multi-lora-tutorial.html#nxdi-trn2-llama3-1-8b-multi-lora-tutorial) </li></ul> 	|trn1.32xlarge	|
 94 | 
 95 | ## Kernel/Compiler Path
 96 | 
 97 | |Title	|Description	|Link	|Minimum Instance Required	|
 98 | |---	|---	|---	|---	|
 99 | |Setup an Instance/Developer Environment	|This section contains resources to provision a developer Environment. This is a great starting place if you need a clean environment for development, or for starting any of the following exercises.	|<ul><li> [Instance Setup](https://repost.aws/articles/ARTxLi0wndTwquyl7frQYuKg) <li></li> [DLAMIs](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/dlami/index.html) </li></ul> 	|trn1.2xlarge	|
100 | |Writing Functional Kernels	|This Getting Started Guide will demonstrate how to write a Hello World, element-wise tensor add kernel. This will give you a good foundation for reading and understanding the other kernels.	|[Getting Started with NKI](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/getting_started.html#nki-getting-started)	|trn1.2xlarge	|
101 | |NKI workshop	|This workshop walks through how to build, profile and integrate a kernel into PyTorch modelling.	|[NKI Workshop](https://github.com/aws-samples/ml-specialized-hardware/tree/main/workshops/03_NKIWorkshop)	|trn1.2xlarge	|
102 | |Walkthrough NKI Tutorials	|These tutorials walkthrough popular kernels and the associated optimizations applied. This is a good set of kernels to show how to iteratively write and optimize kernels.	|[NKI Tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/tutorials.html)	|trn1.2xlarge	|
103 | |Review NKI Samples	|This repository contains the implementations of optimized reference kernels, used within our serving libraries and implementations.	|[NKI Samples](https://github.com/aws-neuron/nki-samples/)	|trn1.2xlarge	|
104 | |Profiling NKI Kernels	|This guide walks through how to profile kernels and use the Neuron Profiler	|[Profiling NKI Kernels](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/neuron_profile_for_nki.html#neuron-profile-for-nki)	|trn1.2xlarge	|
105 | 
106 | # Appendix
107 | 
108 | ## Other Resources
109 | 
110 | |Title	|Description	|Link	|
111 | |---	|---	|---	|
112 | |Re:Invent 2024 Recap	|REcap Post from Re:Invent, which includes links to workshops and sessions on Neuron	|[RePost Article](https://repost.aws/articles/ARuhbPQliOSqKn74zJpGmMYQ)	|
113 | |AI on EKS	|Reference implementation for AI workloads on EKS including hosting on Trainium	|[AI on EKS](https://github.com/awslabs/ai-on-eks)	|
114 | 


--------------------------------------------------------------------------------
/labs/FineTuning/HuggingFaceExample/01_finetuning/Finetune-TinyLlama-1.1B.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "id": "37be34fc-0fa9-4811-865c-a3fdc38d38e8",
   6 |    "metadata": {},
   7 |    "source": [
   8 |     "# Fine-tune TinyLlama-1.1B for text-to-SQL generation\n",
   9 |     "\n",
  10 |     "## Introduction\n",
  11 |     "\n",
  12 |     "In this workshop module, you will learn how to fine-tune a Llama-based LLM ([TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)) using causal language modelling so that the model learns how to generate SQL queries for text-based instructions. Your fine-tuning job will be launched using SageMaker Training which provides a serverless training environment where you do not need to manage the underlying infrastructure. You will learn how to configure a PyTorch training job using [SageMaker's PyTorch estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html), and how to leverage the [Hugging Face Optimum Neuron](https://github.com/huggingface/optimum-neuron) package to easily run the PyTorch training job with AWS Trainium accelerators via an [AWS EC2 trn1.2xlarge instance](https://aws.amazon.com/ec2/instance-types/trn1/).\n",
  13 |     "\n",
  14 |     "For this module, you will be using the [b-mc2/sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context) dataset which consists of thousands of examples of SQL schemas, questions about the schemas, and SQL queries intended to answer the questions.\n",
  15 |     "\n",
  16 |     "*Dataset example 1:*\n",
  17 |     "* *SQL schema/context:* `CREATE TABLE management (department_id VARCHAR); CREATE TABLE department (department_id VARCHAR)`\n",
  18 |     "* *Question:* `How many departments are led by heads who are not mentioned?`\n",
  19 |     "* *SQL query/answer:* `SELECT COUNT(*) FROM department WHERE NOT department_id IN (SELECT department_id FROM management)`\n",
  20 |     "\n",
  21 |     "*Dataset example 2:*\n",
  22 |     "* *SQL schema/context:* `CREATE TABLE courses (course_name VARCHAR, course_id VARCHAR); CREATE TABLE student_course_registrations (student_id VARCHAR, course_id VARCHAR)`\n",
  23 |     "* *Question:* `What are the ids of all students for courses and what are the names of those courses?`\n",
  24 |     "* *SQL query/answer:* `SELECT T1.student_id, T2.course_name FROM student_course_registrations AS T1 JOIN courses AS T2 ON T1.course_id = T2.course_id`\n",
  25 |     "\n",
  26 |     "By fine-tuning the model over several thousand of these text-to-SQL examples, the model will then learn how to generate an appropriate SQL query when presented with a SQL context and a free-form question.\n",
  27 |     "\n",
  28 |     "This text-to-SQL use case was selected so you can successfully fine-tune your model in a reasonably short amount of time (~20 minutes) which is appropriate for this 1hr workshop. Although this is a relatively simple use case, please keep in mind that the same techniques and components used in this module can also be applied to fine-tune LLMs for more advanced use cases such as writing code, summarizing documents, creating blog posts - the possibilities are endless!"
  29 |    ]
  30 |   },
  31 |   {
  32 |    "cell_type": "markdown",
  33 |    "id": "866074ee-c300-4793-8e63-adbcfc314ad8",
  34 |    "metadata": {
  35 |     "tags": []
  36 |    },
  37 |    "source": [
  38 |     "## Prerequisites\n",
  39 |     "\n",
  40 |     "This notebook uses the SageMaker Python SDK to prepare, launch, and monitor the progress of a PyTorch-based training job. Before we get started, it is important to upgrade the SageMaker SDK to ensure that you are using the latest version. Run the next two cells to upgrade the SageMaker SDK and set up your session."
  41 |    ]
  42 |   },
  43 |   {
  44 |    "cell_type": "code",
  45 |    "execution_count": null,
  46 |    "id": "3264aae2-1f18-4b59-a92c-2f169903c202",
  47 |    "metadata": {
  48 |     "tags": []
  49 |    },
  50 |    "outputs": [
  51 |     {
  52 |      "ename": "",
  53 |      "evalue": "",
  54 |      "output_type": "error",
  55 |      "traceback": [
  56 |       "\u001b[1;31mRunning cells with 'Python 3.11.12' requires the ipykernel package.\n",
  57 |       "\u001b[1;31m<a href='command:jupyter.createPythonEnvAndSelectController'>Create a Python Environment</a> with the required packages.\n",
  58 |       "\u001b[1;31mOr install 'ipykernel' using the command: '/opt/homebrew/bin/python3.11 -m pip install ipykernel -U --user --force-reinstall'"
  59 |      ]
  60 |     }
  61 |    ],
  62 |    "source": [
  63 |     "# Upgrade SageMaker SDK to the latest version\n",
  64 |     "%pip install -U sagemaker awscli huggingface_hub ipywidgets -q 2>&1 | grep -v \"warnings/venv\"\n",
  65 |     "# Definitely restart your kernel after this cell"
  66 |    ]
  67 |   },
  68 |   {
  69 |    "cell_type": "code",
  70 |    "execution_count": null,
  71 |    "id": "9b5ed574-6db5-471b-8515-c0f6189e653e",
  72 |    "metadata": {
  73 |     "tags": []
  74 |    },
  75 |    "outputs": [],
  76 |    "source": [
  77 |     "import logging\n",
  78 |     "sagemaker_config_logger = logging.getLogger(\"sagemaker.config\")\n",
  79 |     "sagemaker_config_logger.setLevel(logging.WARNING)\n",
  80 |     "\n",
  81 |     "# Import SageMaker SDK, setup our session\n",
  82 |     "from sagemaker import get_execution_role, Session\n",
  83 |     "from sagemaker.pytorch import PyTorch\n",
  84 |     "import boto3\n",
  85 |     "\n",
  86 |     "region_name=\"us-east-2\" #this is hard coded to a specific region because of Workshop quotas.  You could use sess.boto_region_name\n",
  87 |     "sess = Session(boto_session=boto3.Session(region_name=region_name))\n",
  88 |     "default_bucket = sess.default_bucket()\n"
  89 |    ]
  90 |   },
  91 |   {
  92 |    "cell_type": "markdown",
  93 |    "id": "2ce630d1",
  94 |    "metadata": {},
  95 |    "source": [
  96 |     "This next command just configures the EC2 instance (in us-west-2) to have a default region of us-east-2.  This is specific to the environment in AWS Workshop Studio."
  97 |    ]
  98 |   },
  99 |   {
 100 |    "cell_type": "code",
 101 |    "execution_count": null,
 102 |    "id": "5542b3d1",
 103 |    "metadata": {},
 104 |    "outputs": [],
 105 |    "source": [
 106 |     "!aws configure set region us-east-2"
 107 |    ]
 108 |   },
 109 |   {
 110 |    "cell_type": "markdown",
 111 |    "id": "ed3a1f57",
 112 |    "metadata": {},
 113 |    "source": [
 114 |     "## Log into Hugging Face\n",
 115 |     "\n",
 116 |     "The following step is recommended but optional.  If you can log in with your Hugging Face token, it will let you avoid any rate limits for unauthenticated requests.  Even though none of the models or datasets we are using require special permission, if you don't log in your training may fail because of too many unauthenticated requests.  "
 117 |    ]
 118 |   },
 119 |   {
 120 |    "cell_type": "code",
 121 |    "execution_count": null,
 122 |    "id": "5f142253",
 123 |    "metadata": {},
 124 |    "outputs": [],
 125 |    "source": [
 126 |     "# If the cell below stays empty, RESTART YOUR KERNEL if you didn't and run the cells above again\n",
 127 |     "# If you can't login in, you can proceed to the next cell.\n",
 128 |     "from huggingface_hub import notebook_login\n",
 129 |     "\n",
 130 |     "# Uncheck \"Add token as git credential\" or just ignore the error message about it not being added.\n",
 131 |     "notebook_login()"
 132 |    ]
 133 |   },
 134 |   {
 135 |    "cell_type": "markdown",
 136 |    "id": "4193108b-25fb-4d3e-85db-c66b8c04c251",
 137 |    "metadata": {},
 138 |    "source": [
 139 |     "## Specify the Optimum Neuron deep learning container (DLC) image\n",
 140 |     "\n",
 141 |     "The SageMaker Training service uses containers to execute your training script, allowing you to fully customize your training script environment and any required dependencies. For this workshop, you will use a recent Pytorch Training deep learning container (DLC) image which is an AWS-maintained image containing the Neuron SDK and PyTorch.  The Optimum-Neuron library is installed with the requirements.txt file in the assets directory."
 142 |    ]
 143 |   },
 144 |   {
 145 |    "cell_type": "code",
 146 |    "execution_count": null,
 147 |    "id": "247ad886-6977-4295-947b-86d4892b48bd",
 148 |    "metadata": {
 149 |     "tags": []
 150 |    },
 151 |    "outputs": [
 152 |     {
 153 |      "name": "stdout",
 154 |      "output_type": "stream",
 155 |      "text": [
 156 |       "763104351884.dkr.ecr.us-east-2.amazonaws.com/pytorch-training-neuronx:2.7.0-neuronx-py310-sdk2.24.1-ubuntu22.04\n"
 157 |      ]
 158 |     }
 159 |    ],
 160 |    "source": [
 161 |     "# Specify the Neuron DLC that we will use for training\n",
 162 |     "#   For now, we'll use the standard Neuron DLC and install Optimum Neuron v0.0.27 at training time because we want to use a later SDK \n",
 163 |     "#   You can see more about the images here: https://github.com/aws-neuron/deep-learning-containers?tab=readme-ov-file#pytorch-training-neuronx\n",
 164 |     "\n",
 165 |     "training_image = f\"763104351884.dkr.ecr.{sess.boto_region_name}.amazonaws.com/pytorch-training-neuronx:2.7.0-neuronx-py310-sdk2.24.1-ubuntu22.04\"\n",
 166 |     "print(training_image)"
 167 |    ]
 168 |   },
 169 |   {
 170 |    "cell_type": "markdown",
 171 |    "id": "8a8802bc-657a-419d-b86d-eb8af5eff90e",
 172 |    "metadata": {
 173 |     "tags": []
 174 |    },
 175 |    "source": [
 176 |     "## Configure the PyTorch Estimator\n",
 177 |     "\n",
 178 |     "The SageMaker SDK includes a [PyTorch Estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html) class which you can use to define a PyTorch training job that will be executed in the SageMaker managed environment. \n",
 179 |     "\n",
 180 |     "In the following cell, you will create a PyTorch Estimator which will run the attached `finetune_llama.py` training script on an ml.trn1.2xlarge instance. The `finetune_llama.py` script is an Optimum Neuron training script that can be used for causal language modelling with AWS Trainium. The scripts will be downloaded as the instance is brought up, and the scripts will download the model and the datasets onto the SageMaker training instance.\n",
 181 |     "\n",
 182 |     "The PyTorch Estimator has many parameters that can be used to configure your training job. A few of the most important parameters include:\n",
 183 |     "\n",
 184 |     "- *entry_point*: refers to the name of the training script that will be executed as part of this training job\n",
 185 |     "- *source_dir*: the path to the local source code directory (relative to your notebook) that will be packaged up and included inside your training container\n",
 186 |     "- *instance_count*: defines how many EC2 instances to use for this training job\n",
 187 |     "- *instance_type*: determines which type of EC2 instance will be used for training\n",
 188 |     "- *image_uri*: defines which training DLC will be used to run the training job (see Neuron DLC, above)\n",
 189 |     "- *distribution*: determines which type of distribution to use for the training job - you will need 'torch_distributed' for this workshop\n",
 190 |     "- *environment*: provides a dictionary of environment variables which will be applied to your training environment\n",
 191 |     "- *hyperparameters*: provides a dictionary of command-line arguments to pass to your training script, ex: finetune_llama.py\n",
 192 |     "\n",
 193 |     "In the `hyperparameters` section, you can see the specific command-line arguments that are used to control the behavior of the `finetune_llama.py` training script. Notably:\n",
 194 |     "- *model_id*: specifies which model you will be fine-tuning, in this case a recent checkpoint from the TinyLlama-1.1B project\n",
 195 |     "- *tokenizer_id*: specifies which tokenizer you will used to tokenize the dataset examples during training\n",
 196 |     "- *output_dir*: directory in which the fine-tuned model will be saved. Here we use the SageMaker-specific `/opt/ml/model` directory. At the end of the training job, SageMaker automatically copies the contents of this directory to the output S3 bucket\n",
 197 |     "- *tensor_parallel_size*: the tensor parallel degree for which we want to use for training. In this case we use '2' to shard the model across the 2 NeuronCores available in the trn1.2xlarge instance\n",
 198 |     "- *bf16*: request BFloat16 training\n",
 199 |     "- *per_device_train_batch_size*: the microbatch size to be used for fine-tuning\n",
 200 |     "- *gradient_accumulation_steps*: how many steps for which gradients will be accumulated between updates\n",
 201 |     "- *max_steps*: the maximum number of steps of fine-tuning that we want to perform\n",
 202 |     "- *lora_r*, *lora_alpha*, *lora_dropout*: the LoRA rank, alpha, and dropout values to use during fine-tuning\n",
 203 |     "\n",
 204 |     "The below estimator has been pre-configured for you, so you do not need to make any changes."
 205 |    ]
 206 |   },
 207 |   {
 208 |    "cell_type": "code",
 209 |    "execution_count": null,
 210 |    "id": "7dd1c2b2",
 211 |    "metadata": {},
 212 |    "outputs": [],
 213 |    "source": [
 214 |     "# Note that the hyperparameters are command-line args passed to the finetune_llama.py script to control its behavior\n",
 215 |     "# Create hyperparameters dictionary\n",
 216 |     "hyperparameters = {\n",
 217 |     "    \"model_id\": \"TinyLlama/TinyLlama-1.1B-Chat-v1.0\",\n",
 218 |     "    \"tokenizer_id\": \"TinyLlama/TinyLlama-1.1B-Chat-v1.0\",\n",
 219 |     "    \"skip_cache_push\": True,\n",
 220 |     "    \"output_dir\": \"/opt/ml/model\",\n",
 221 |     "    \"tensor_parallel_size\": 2,\n",
 222 |     "    \"bf16\": True,\n",
 223 |     "    \"per_device_train_batch_size\": 2,\n",
 224 |     "    \"gradient_accumulation_steps\": 1,\n",
 225 |     "    \"gradient_checkpointing\": True,\n",
 226 |     "    \"max_steps\": 1000,\n",
 227 |     "    \"lora_r\": 16,\n",
 228 |     "    \"lora_alpha\": 32,\n",
 229 |     "    \"lora_dropout\": 0.05,\n",
 230 |     "    \"logging_steps\": 10,\n",
 231 |     "    \"learning_rate\": 5e-5,\n",
 232 |     "    \"dataloader_drop_last\": True,\n",
 233 |     "    \"disable_tqdm\": True,\n",
 234 |     "}\n",
 235 |     "\n",
 236 |     "# Set up environment variables\n",
 237 |     "from huggingface_hub import HfFolder\n",
 238 |     "\n",
 239 |     "environment = {\"FI_EFA_FORK_SAFE\": \"1\", \"WANDB_DISABLED\": \"true\"}\n",
 240 |     "token = HfFolder.get_token()\n",
 241 |     "if token is not None:\n",
 242 |     "    environment[\"HF_TOKEN\"] = token\n",
 243 |     "\n",
 244 |     "# Set up the PyTorch estimator\n",
 245 |     "pt_estimator = PyTorch(\n",
 246 |     "    entry_point=\"finetune_llama.py\",\n",
 247 |     "    source_dir=\"./assets\",\n",
 248 |     "    role=get_execution_role(),\n",
 249 |     "    instance_count=1,\n",
 250 |     "    instance_type=\"ml.trn1.2xlarge\",\n",
 251 |     "    disable_profiler=True,\n",
 252 |     "    output_path=f\"s3://{default_bucket}/neuron_events2025\",\n",
 253 |     "    base_job_name=\"trn1-tinyllama\",\n",
 254 |     "    sagemaker_session=sess,\n",
 255 |     "    code_bucket=f\"s3://{default_bucket}/neuron_events2025_code\",\n",
 256 |     "    checkpoint_s3_uri=f\"s3://{default_bucket}/neuron_events_output\",\n",
 257 |     "    image_uri=training_image,\n",
 258 |     "    distribution={\"torch_distributed\": {\"enabled\": True}},\n",
 259 |     "    environment=environment,\n",
 260 |     "    disable_output_compression=True,\n",
 261 |     "    hyperparameters=hyperparameters\n",
 262 |     ")\n"
 263 |    ]
 264 |   },
 265 |   {
 266 |    "cell_type": "markdown",
 267 |    "id": "2278940b-f563-4582-9df0-bd56d9b5fd28",
 268 |    "metadata": {},
 269 |    "source": [
 270 |     "## Launch the training job\n",
 271 |     "\n",
 272 |     "Once the estimator has been created, you can then launch your training job by calling `.fit()` on the estimator:"
 273 |    ]
 274 |   },
 275 |   {
 276 |    "cell_type": "code",
 277 |    "execution_count": 9,
 278 |    "id": "b7829c64-0190-43c3-be1a-0ccce7d45248",
 279 |    "metadata": {
 280 |     "tags": []
 281 |    },
 282 |    "outputs": [
 283 |     {
 284 |      "name": "stderr",
 285 |      "output_type": "stream",
 286 |      "text": [
 287 |       "INFO:sagemaker:Creating training-job with name: trn1-tinyllama-2025-05-13-00-40-31-750\n"
 288 |      ]
 289 |     }
 290 |    ],
 291 |    "source": [
 292 |     "# Call fit() on the estimator to initiate the training job\n",
 293 |     "pt_estimator.fit(wait=False, logs=False)"
 294 |    ]
 295 |   },
 296 |   {
 297 |    "cell_type": "markdown",
 298 |    "id": "b77434b2-94d7-4256-8d0b-d5d2ddb1d5ae",
 299 |    "metadata": {},
 300 |    "source": [
 301 |     "## Monitor the training job\n",
 302 |     "\n",
 303 |     "When the training job has been launched, the SageMaker Training service will then take care of:\n",
 304 |     "- launching and configuring the requested EC2 infrastructure for your training job\n",
 305 |     "- launching the requested container image on each of the EC2 instances\n",
 306 |     "- copying your source code directory and running your training script within the container(s)\n",
 307 |     "- storing your trained model artifacts in Amazon Simple Storage Service (S3)\n",
 308 |     "- decommissioning the training infrastructure\n",
 309 |     "\n",
 310 |     "While the training job is running, the following cell will periodically check and output the job status. When you see 'Completed', you know that your training job is finished and you can proceed to the remainder of the notebook. The training job typically takes about 20 minutes to complete.\n",
 311 |     "\n",
 312 |     "If you are interested in viewing the output logs from your training job, you can view the logs by navigating to the AWS CloudWatch console, selecting `Logs -> Log Groups` in the left-hand menu, and then looking for your SageMaker training job in the list. **Note:** it will usually take 4-5 minutes before the infrastructure is running and the output logs begin to be populated in CloudWatch."
 313 |    ]
 314 |   },
 315 |   {
 316 |    "cell_type": "code",
 317 |    "execution_count": 10,
 318 |    "id": "0c223037-2f8e-4eb0-9e4b-ff4dac6ede7a",
 319 |    "metadata": {
 320 |     "tags": []
 321 |    },
 322 |    "outputs": [
 323 |     {
 324 |      "name": "stdout",
 325 |      "output_type": "stream",
 326 |      "text": [
 327 |       "2025-05-13T00:40:37.718399 Training job status: InProgress!\n",
 328 |       "2025-05-13T00:41:07.827456 Training job status: InProgress!\n",
 329 |       "2025-05-13T00:41:37.941892 Training job status: InProgress!\n",
 330 |       "2025-05-13T00:42:08.055514 Training job status: InProgress!\n",
 331 |       "2025-05-13T00:42:38.170184 Training job status: InProgress!\n",
 332 |       "2025-05-13T00:43:08.285526 Training job status: InProgress!\n",
 333 |       "2025-05-13T00:43:38.401669 Training job status: InProgress!\n",
 334 |       "2025-05-13T00:44:08.517601 Training job status: InProgress!\n",
 335 |       "2025-05-13T00:44:38.607279 Training job status: InProgress!\n",
 336 |       "2025-05-13T00:45:08.901240 Training job status: InProgress!\n",
 337 |       "2025-05-13T00:45:39.029987 Training job status: InProgress!\n",
 338 |       "2025-05-13T00:46:09.148483 Training job status: InProgress!\n",
 339 |       "2025-05-13T00:46:39.262424 Training job status: InProgress!\n",
 340 |       "2025-05-13T00:47:09.378729 Training job status: InProgress!\n",
 341 |       "2025-05-13T00:47:39.477011 Training job status: InProgress!\n",
 342 |       "2025-05-13T00:48:09.589262 Training job status: InProgress!\n",
 343 |       "2025-05-13T00:48:39.715998 Training job status: InProgress!\n",
 344 |       "2025-05-13T00:49:09.833712 Training job status: InProgress!\n",
 345 |       "2025-05-13T00:49:40.132350 Training job status: InProgress!\n",
 346 |       "2025-05-13T00:50:10.259671 Training job status: InProgress!\n",
 347 |       "2025-05-13T00:50:40.376526 Training job status: InProgress!\n",
 348 |       "2025-05-13T00:51:10.492630 Training job status: InProgress!\n",
 349 |       "2025-05-13T00:51:40.612684 Training job status: InProgress!\n",
 350 |       "2025-05-13T00:52:10.735871 Training job status: InProgress!\n",
 351 |       "2025-05-13T00:52:40.856541 Training job status: InProgress!\n",
 352 |       "2025-05-13T00:53:10.978185 Training job status: InProgress!\n",
 353 |       "2025-05-13T00:53:41.102406 Training job status: InProgress!\n",
 354 |       "2025-05-13T00:54:11.391318 Training job status: InProgress!\n",
 355 |       "2025-05-13T00:54:41.506542 Training job status: InProgress!\n",
 356 |       "2025-05-13T00:55:11.619419 Training job status: InProgress!\n",
 357 |       "2025-05-13T00:55:41.736144 Training job status: InProgress!\n",
 358 |       "2025-05-13T00:56:11.850643 Training job status: InProgress!\n",
 359 |       "2025-05-13T00:56:41.965740 Training job status: InProgress!\n",
 360 |       "2025-05-13T00:57:12.082235 Training job status: InProgress!\n",
 361 |       "2025-05-13T00:57:42.193146 Training job status: InProgress!\n",
 362 |       "2025-05-13T00:58:12.309523 Training job status: InProgress!\n",
 363 |       "2025-05-13T00:58:42.596288 Training job status: InProgress!\n",
 364 |       "2025-05-13T00:59:12.715701 Training job status: InProgress!\n",
 365 |       "2025-05-13T00:59:42.835134 Training job status: InProgress!\n",
 366 |       "2025-05-13T01:00:12.952002 Training job status: InProgress!\n",
 367 |       "2025-05-13T01:00:43.070275 Training job status: InProgress!\n",
 368 |       "2025-05-13T01:01:13.187416 Training job status: InProgress!\n",
 369 |       "2025-05-13T01:01:43.291955 Training job status: InProgress!\n",
 370 |       "\n",
 371 |       "2025-05-13T01:02:13.412501 Training job status: Completed!\n"
 372 |      ]
 373 |     }
 374 |    ],
 375 |    "source": [
 376 |     "# Periodically check job status until it shows 'Completed' (ETA ~20 minutes)\n",
 377 |     "#  You can also monitor job status in the SageMaker console, and view the\n",
 378 |     "#  SageMaker Training job logs in the CloudWatch console\n",
 379 |     "from time import sleep\n",
 380 |     "from datetime import datetime\n",
 381 |     "\n",
 382 |     "while (job_status := pt_estimator.jobs[-1].describe()['TrainingJobStatus']) not in ['Completed', 'Error', 'Failed']:\n",
 383 |     "    print(f\"{datetime.now().isoformat()} Training job status: {job_status}!\")\n",
 384 |     "    sleep(30)\n",
 385 |     "\n",
 386 |     "print(f\"\\n{datetime.now().isoformat()} Training job status: {job_status}!\")"
 387 |    ]
 388 |   },
 389 |   {
 390 |    "cell_type": "markdown",
 391 |    "id": "16c94343-b0c6-4903-82cc-c8ab2f88b26b",
 392 |    "metadata": {},
 393 |    "source": [
 394 |     "## Determine location of fine-tuned model artifacts\n",
 395 |     "\n",
 396 |     "Once the training job has completed, SageMaker will copy your fine-tuned model artifacts to a specified location in S3.\n",
 397 |     "\n",
 398 |     "In the following cell, you can see how to programmatically determine the location of your model artifacts:"
 399 |    ]
 400 |   },
 401 |   {
 402 |    "cell_type": "code",
 403 |    "execution_count": 11,
 404 |    "id": "213af977-8ed6-4081-af65-59c70db2dbfb",
 405 |    "metadata": {
 406 |     "tags": []
 407 |    },
 408 |    "outputs": [
 409 |     {
 410 |      "name": "stdout",
 411 |      "output_type": "stream",
 412 |      "text": [
 413 |       "Your fine-tuned model is available here:\n",
 414 |       "\n",
 415 |       "s3://this.output.should.be.replaced.with.a.real.s3.path.once.the.cell.is.executed/\n"
 416 |      ]
 417 |     }
 418 |    ],
 419 |    "source": [
 420 |     "# Show where the fine-tuned model is stored - previous job must be 'Completed' before running this cell\n",
 421 |     "model_archive_path = pt_estimator.jobs[-1].describe()['ModelArtifacts']['S3ModelArtifacts']\n",
 422 |     "print(f\"Your fine-tuned model is available here:\\n\\n{model_archive_path}/\")"
 423 |    ]
 424 |   },
 425 |   {
 426 |    "cell_type": "markdown",
 427 |    "id": "b68f529f-a548-4fbd-b160-3cab5f52c488",
 428 |    "metadata": {},
 429 |    "source": [
 430 |     "<br/>\n",
 431 |     "\n",
 432 |     "**Note:** Please copy the above S3 path, as it will be required in the subsequent workshop module.\n",
 433 |     "\n",
 434 |     "\n",
 435 |     "Lastly, run the following cell to list the model artifacts available in your S3 model_archive_path:"
 436 |    ]
 437 |   },
 438 |   {
 439 |    "cell_type": "code",
 440 |    "execution_count": 12,
 441 |    "id": "27ad8c7e-6a73-4f20-944f-ac12ef286a6f",
 442 |    "metadata": {
 443 |     "tags": []
 444 |    },
 445 |    "outputs": [
 446 |     {
 447 |      "name": "stdout",
 448 |      "output_type": "stream",
 449 |      "text": [
 450 |       "2025-05-13 01:01:39        714 config.json\n",
 451 |       "2025-05-13 01:01:48        124 generation_config.json\n",
 452 |       "2025-05-13 01:01:40 4400216536 model.safetensors\n",
 453 |       "2025-05-13 01:01:47        551 special_tokens_map.json\n",
 454 |       "2025-05-13 01:01:47    1842795 tokenizer.json\n",
 455 |       "2025-05-13 01:01:39     499723 tokenizer.model\n",
 456 |       "2025-05-13 01:01:48       1368 tokenizer_config.json\n"
 457 |      ]
 458 |     }
 459 |    ],
 460 |    "source": [
 461 |     "# View the contents of the fine-tuned model path in S3\n",
 462 |     "!aws s3 ls {model_archive_path}/merged_model/"
 463 |    ]
 464 |   },
 465 |   {
 466 |    "cell_type": "markdown",
 467 |    "id": "fca9ffa7-a694-48c0-acde-cd468d18a448",
 468 |    "metadata": {},
 469 |    "source": [
 470 |     "Congratulations on completing the LLM fine-tuning module!\n",
 471 |     "\n",
 472 |     "In the next notebook, you will learn how to deploy your fine-tuned model in a SageMaker hosted endpoint, and leverage AWS Inferentia accelerators to perform model inference. Have fun!"
 473 |    ]
 474 |   }
 475 |  ],
 476 |  "metadata": {
 477 |   "availableInstances": [
 478 |    {
 479 |     "_defaultOrder": 0,
 480 |     "_isFastLaunch": true,
 481 |     "category": "General purpose",
 482 |     "gpuNum": 0,
 483 |     "hideHardwareSpecs": false,
 484 |     "memoryGiB": 4,
 485 |     "name": "ml.t3.medium",
 486 |     "vcpuNum": 2
 487 |    },
 488 |    {
 489 |     "_defaultOrder": 1,
 490 |     "_isFastLaunch": false,
 491 |     "category": "General purpose",
 492 |     "gpuNum": 0,
 493 |     "hideHardwareSpecs": false,
 494 |     "memoryGiB": 8,
 495 |     "name": "ml.t3.large",
 496 |     "vcpuNum": 2
 497 |    },
 498 |    {
 499 |     "_defaultOrder": 2,
 500 |     "_isFastLaunch": false,
 501 |     "category": "General purpose",
 502 |     "gpuNum": 0,
 503 |     "hideHardwareSpecs": false,
 504 |     "memoryGiB": 16,
 505 |     "name": "ml.t3.xlarge",
 506 |     "vcpuNum": 4
 507 |    },
 508 |    {
 509 |     "_defaultOrder": 3,
 510 |     "_isFastLaunch": false,
 511 |     "category": "General purpose",
 512 |     "gpuNum": 0,
 513 |     "hideHardwareSpecs": false,
 514 |     "memoryGiB": 32,
 515 |     "name": "ml.t3.2xlarge",
 516 |     "vcpuNum": 8
 517 |    },
 518 |    {
 519 |     "_defaultOrder": 4,
 520 |     "_isFastLaunch": true,
 521 |     "category": "General purpose",
 522 |     "gpuNum": 0,
 523 |     "hideHardwareSpecs": false,
 524 |     "memoryGiB": 8,
 525 |     "name": "ml.m5.large",
 526 |     "vcpuNum": 2
 527 |    },
 528 |    {
 529 |     "_defaultOrder": 5,
 530 |     "_isFastLaunch": false,
 531 |     "category": "General purpose",
 532 |     "gpuNum": 0,
 533 |     "hideHardwareSpecs": false,
 534 |     "memoryGiB": 16,
 535 |     "name": "ml.m5.xlarge",
 536 |     "vcpuNum": 4
 537 |    },
 538 |    {
 539 |     "_defaultOrder": 6,
 540 |     "_isFastLaunch": false,
 541 |     "category": "General purpose",
 542 |     "gpuNum": 0,
 543 |     "hideHardwareSpecs": false,
 544 |     "memoryGiB": 32,
 545 |     "name": "ml.m5.2xlarge",
 546 |     "vcpuNum": 8
 547 |    },
 548 |    {
 549 |     "_defaultOrder": 7,
 550 |     "_isFastLaunch": false,
 551 |     "category": "General purpose",
 552 |     "gpuNum": 0,
 553 |     "hideHardwareSpecs": false,
 554 |     "memoryGiB": 64,
 555 |     "name": "ml.m5.4xlarge",
 556 |     "vcpuNum": 16
 557 |    },
 558 |    {
 559 |     "_defaultOrder": 8,
 560 |     "_isFastLaunch": false,
 561 |     "category": "General purpose",
 562 |     "gpuNum": 0,
 563 |     "hideHardwareSpecs": false,
 564 |     "memoryGiB": 128,
 565 |     "name": "ml.m5.8xlarge",
 566 |     "vcpuNum": 32
 567 |    },
 568 |    {
 569 |     "_defaultOrder": 9,
 570 |     "_isFastLaunch": false,
 571 |     "category": "General purpose",
 572 |     "gpuNum": 0,
 573 |     "hideHardwareSpecs": false,
 574 |     "memoryGiB": 192,
 575 |     "name": "ml.m5.12xlarge",
 576 |     "vcpuNum": 48
 577 |    },
 578 |    {
 579 |     "_defaultOrder": 10,
 580 |     "_isFastLaunch": false,
 581 |     "category": "General purpose",
 582 |     "gpuNum": 0,
 583 |     "hideHardwareSpecs": false,
 584 |     "memoryGiB": 256,
 585 |     "name": "ml.m5.16xlarge",
 586 |     "vcpuNum": 64
 587 |    },
 588 |    {
 589 |     "_defaultOrder": 11,
 590 |     "_isFastLaunch": false,
 591 |     "category": "General purpose",
 592 |     "gpuNum": 0,
 593 |     "hideHardwareSpecs": false,
 594 |     "memoryGiB": 384,
 595 |     "name": "ml.m5.24xlarge",
 596 |     "vcpuNum": 96
 597 |    },
 598 |    {
 599 |     "_defaultOrder": 12,
 600 |     "_isFastLaunch": false,
 601 |     "category": "General purpose",
 602 |     "gpuNum": 0,
 603 |     "hideHardwareSpecs": false,
 604 |     "memoryGiB": 8,
 605 |     "name": "ml.m5d.large",
 606 |     "vcpuNum": 2
 607 |    },
 608 |    {
 609 |     "_defaultOrder": 13,
 610 |     "_isFastLaunch": false,
 611 |     "category": "General purpose",
 612 |     "gpuNum": 0,
 613 |     "hideHardwareSpecs": false,
 614 |     "memoryGiB": 16,
 615 |     "name": "ml.m5d.xlarge",
 616 |     "vcpuNum": 4
 617 |    },
 618 |    {
 619 |     "_defaultOrder": 14,
 620 |     "_isFastLaunch": false,
 621 |     "category": "General purpose",
 622 |     "gpuNum": 0,
 623 |     "hideHardwareSpecs": false,
 624 |     "memoryGiB": 32,
 625 |     "name": "ml.m5d.2xlarge",
 626 |     "vcpuNum": 8
 627 |    },
 628 |    {
 629 |     "_defaultOrder": 15,
 630 |     "_isFastLaunch": false,
 631 |     "category": "General purpose",
 632 |     "gpuNum": 0,
 633 |     "hideHardwareSpecs": false,
 634 |     "memoryGiB": 64,
 635 |     "name": "ml.m5d.4xlarge",
 636 |     "vcpuNum": 16
 637 |    },
 638 |    {
 639 |     "_defaultOrder": 16,
 640 |     "_isFastLaunch": false,
 641 |     "category": "General purpose",
 642 |     "gpuNum": 0,
 643 |     "hideHardwareSpecs": false,
 644 |     "memoryGiB": 128,
 645 |     "name": "ml.m5d.8xlarge",
 646 |     "vcpuNum": 32
 647 |    },
 648 |    {
 649 |     "_defaultOrder": 17,
 650 |     "_isFastLaunch": false,
 651 |     "category": "General purpose",
 652 |     "gpuNum": 0,
 653 |     "hideHardwareSpecs": false,
 654 |     "memoryGiB": 192,
 655 |     "name": "ml.m5d.12xlarge",
 656 |     "vcpuNum": 48
 657 |    },
 658 |    {
 659 |     "_defaultOrder": 18,
 660 |     "_isFastLaunch": false,
 661 |     "category": "General purpose",
 662 |     "gpuNum": 0,
 663 |     "hideHardwareSpecs": false,
 664 |     "memoryGiB": 256,
 665 |     "name": "ml.m5d.16xlarge",
 666 |     "vcpuNum": 64
 667 |    },
 668 |    {
 669 |     "_defaultOrder": 19,
 670 |     "_isFastLaunch": false,
 671 |     "category": "General purpose",
 672 |     "gpuNum": 0,
 673 |     "hideHardwareSpecs": false,
 674 |     "memoryGiB": 384,
 675 |     "name": "ml.m5d.24xlarge",
 676 |     "vcpuNum": 96
 677 |    },
 678 |    {
 679 |     "_defaultOrder": 20,
 680 |     "_isFastLaunch": false,
 681 |     "category": "General purpose",
 682 |     "gpuNum": 0,
 683 |     "hideHardwareSpecs": true,
 684 |     "memoryGiB": 0,
 685 |     "name": "ml.geospatial.interactive",
 686 |     "supportedImageNames": [
 687 |      "sagemaker-geospatial-v1-0"
 688 |     ],
 689 |     "vcpuNum": 0
 690 |    },
 691 |    {
 692 |     "_defaultOrder": 21,
 693 |     "_isFastLaunch": true,
 694 |     "category": "Compute optimized",
 695 |     "gpuNum": 0,
 696 |     "hideHardwareSpecs": false,
 697 |     "memoryGiB": 4,
 698 |     "name": "ml.c5.large",
 699 |     "vcpuNum": 2
 700 |    },
 701 |    {
 702 |     "_defaultOrder": 22,
 703 |     "_isFastLaunch": false,
 704 |     "category": "Compute optimized",
 705 |     "gpuNum": 0,
 706 |     "hideHardwareSpecs": false,
 707 |     "memoryGiB": 8,
 708 |     "name": "ml.c5.xlarge",
 709 |     "vcpuNum": 4
 710 |    },
 711 |    {
 712 |     "_defaultOrder": 23,
 713 |     "_isFastLaunch": false,
 714 |     "category": "Compute optimized",
 715 |     "gpuNum": 0,
 716 |     "hideHardwareSpecs": false,
 717 |     "memoryGiB": 16,
 718 |     "name": "ml.c5.2xlarge",
 719 |     "vcpuNum": 8
 720 |    },
 721 |    {
 722 |     "_defaultOrder": 24,
 723 |     "_isFastLaunch": false,
 724 |     "category": "Compute optimized",
 725 |     "gpuNum": 0,
 726 |     "hideHardwareSpecs": false,
 727 |     "memoryGiB": 32,
 728 |     "name": "ml.c5.4xlarge",
 729 |     "vcpuNum": 16
 730 |    },
 731 |    {
 732 |     "_defaultOrder": 25,
 733 |     "_isFastLaunch": false,
 734 |     "category": "Compute optimized",
 735 |     "gpuNum": 0,
 736 |     "hideHardwareSpecs": false,
 737 |     "memoryGiB": 72,
 738 |     "name": "ml.c5.9xlarge",
 739 |     "vcpuNum": 36
 740 |    },
 741 |    {
 742 |     "_defaultOrder": 26,
 743 |     "_isFastLaunch": false,
 744 |     "category": "Compute optimized",
 745 |     "gpuNum": 0,
 746 |     "hideHardwareSpecs": false,
 747 |     "memoryGiB": 96,
 748 |     "name": "ml.c5.12xlarge",
 749 |     "vcpuNum": 48
 750 |    },
 751 |    {
 752 |     "_defaultOrder": 27,
 753 |     "_isFastLaunch": false,
 754 |     "category": "Compute optimized",
 755 |     "gpuNum": 0,
 756 |     "hideHardwareSpecs": false,
 757 |     "memoryGiB": 144,
 758 |     "name": "ml.c5.18xlarge",
 759 |     "vcpuNum": 72
 760 |    },
 761 |    {
 762 |     "_defaultOrder": 28,
 763 |     "_isFastLaunch": false,
 764 |     "category": "Compute optimized",
 765 |     "gpuNum": 0,
 766 |     "hideHardwareSpecs": false,
 767 |     "memoryGiB": 192,
 768 |     "name": "ml.c5.24xlarge",
 769 |     "vcpuNum": 96
 770 |    },
 771 |    {
 772 |     "_defaultOrder": 29,
 773 |     "_isFastLaunch": true,
 774 |     "category": "Accelerated computing",
 775 |     "gpuNum": 1,
 776 |     "hideHardwareSpecs": false,
 777 |     "memoryGiB": 16,
 778 |     "name": "ml.g4dn.xlarge",
 779 |     "vcpuNum": 4
 780 |    },
 781 |    {
 782 |     "_defaultOrder": 30,
 783 |     "_isFastLaunch": false,
 784 |     "category": "Accelerated computing",
 785 |     "gpuNum": 1,
 786 |     "hideHardwareSpecs": false,
 787 |     "memoryGiB": 32,
 788 |     "name": "ml.g4dn.2xlarge",
 789 |     "vcpuNum": 8
 790 |    },
 791 |    {
 792 |     "_defaultOrder": 31,
 793 |     "_isFastLaunch": false,
 794 |     "category": "Accelerated computing",
 795 |     "gpuNum": 1,
 796 |     "hideHardwareSpecs": false,
 797 |     "memoryGiB": 64,
 798 |     "name": "ml.g4dn.4xlarge",
 799 |     "vcpuNum": 16
 800 |    },
 801 |    {
 802 |     "_defaultOrder": 32,
 803 |     "_isFastLaunch": false,
 804 |     "category": "Accelerated computing",
 805 |     "gpuNum": 1,
 806 |     "hideHardwareSpecs": false,
 807 |     "memoryGiB": 128,
 808 |     "name": "ml.g4dn.8xlarge",
 809 |     "vcpuNum": 32
 810 |    },
 811 |    {
 812 |     "_defaultOrder": 33,
 813 |     "_isFastLaunch": false,
 814 |     "category": "Accelerated computing",
 815 |     "gpuNum": 4,
 816 |     "hideHardwareSpecs": false,
 817 |     "memoryGiB": 192,
 818 |     "name": "ml.g4dn.12xlarge",
 819 |     "vcpuNum": 48
 820 |    },
 821 |    {
 822 |     "_defaultOrder": 34,
 823 |     "_isFastLaunch": false,
 824 |     "category": "Accelerated computing",
 825 |     "gpuNum": 1,
 826 |     "hideHardwareSpecs": false,
 827 |     "memoryGiB": 256,
 828 |     "name": "ml.g4dn.16xlarge",
 829 |     "vcpuNum": 64
 830 |    },
 831 |    {
 832 |     "_defaultOrder": 35,
 833 |     "_isFastLaunch": false,
 834 |     "category": "Accelerated computing",
 835 |     "gpuNum": 1,
 836 |     "hideHardwareSpecs": false,
 837 |     "memoryGiB": 61,
 838 |     "name": "ml.p3.2xlarge",
 839 |     "vcpuNum": 8
 840 |    },
 841 |    {
 842 |     "_defaultOrder": 36,
 843 |     "_isFastLaunch": false,
 844 |     "category": "Accelerated computing",
 845 |     "gpuNum": 4,
 846 |     "hideHardwareSpecs": false,
 847 |     "memoryGiB": 244,
 848 |     "name": "ml.p3.8xlarge",
 849 |     "vcpuNum": 32
 850 |    },
 851 |    {
 852 |     "_defaultOrder": 37,
 853 |     "_isFastLaunch": false,
 854 |     "category": "Accelerated computing",
 855 |     "gpuNum": 8,
 856 |     "hideHardwareSpecs": false,
 857 |     "memoryGiB": 488,
 858 |     "name": "ml.p3.16xlarge",
 859 |     "vcpuNum": 64
 860 |    },
 861 |    {
 862 |     "_defaultOrder": 38,
 863 |     "_isFastLaunch": false,
 864 |     "category": "Accelerated computing",
 865 |     "gpuNum": 8,
 866 |     "hideHardwareSpecs": false,
 867 |     "memoryGiB": 768,
 868 |     "name": "ml.p3dn.24xlarge",
 869 |     "vcpuNum": 96
 870 |    },
 871 |    {
 872 |     "_defaultOrder": 39,
 873 |     "_isFastLaunch": false,
 874 |     "category": "Memory Optimized",
 875 |     "gpuNum": 0,
 876 |     "hideHardwareSpecs": false,
 877 |     "memoryGiB": 16,
 878 |     "name": "ml.r5.large",
 879 |     "vcpuNum": 2
 880 |    },
 881 |    {
 882 |     "_defaultOrder": 40,
 883 |     "_isFastLaunch": false,
 884 |     "category": "Memory Optimized",
 885 |     "gpuNum": 0,
 886 |     "hideHardwareSpecs": false,
 887 |     "memoryGiB": 32,
 888 |     "name": "ml.r5.xlarge",
 889 |     "vcpuNum": 4
 890 |    },
 891 |    {
 892 |     "_defaultOrder": 41,
 893 |     "_isFastLaunch": false,
 894 |     "category": "Memory Optimized",
 895 |     "gpuNum": 0,
 896 |     "hideHardwareSpecs": false,
 897 |     "memoryGiB": 64,
 898 |     "name": "ml.r5.2xlarge",
 899 |     "vcpuNum": 8
 900 |    },
 901 |    {
 902 |     "_defaultOrder": 42,
 903 |     "_isFastLaunch": false,
 904 |     "category": "Memory Optimized",
 905 |     "gpuNum": 0,
 906 |     "hideHardwareSpecs": false,
 907 |     "memoryGiB": 128,
 908 |     "name": "ml.r5.4xlarge",
 909 |     "vcpuNum": 16
 910 |    },
 911 |    {
 912 |     "_defaultOrder": 43,
 913 |     "_isFastLaunch": false,
 914 |     "category": "Memory Optimized",
 915 |     "gpuNum": 0,
 916 |     "hideHardwareSpecs": false,
 917 |     "memoryGiB": 256,
 918 |     "name": "ml.r5.8xlarge",
 919 |     "vcpuNum": 32
 920 |    },
 921 |    {
 922 |     "_defaultOrder": 44,
 923 |     "_isFastLaunch": false,
 924 |     "category": "Memory Optimized",
 925 |     "gpuNum": 0,
 926 |     "hideHardwareSpecs": false,
 927 |     "memoryGiB": 384,
 928 |     "name": "ml.r5.12xlarge",
 929 |     "vcpuNum": 48
 930 |    },
 931 |    {
 932 |     "_defaultOrder": 45,
 933 |     "_isFastLaunch": false,
 934 |     "category": "Memory Optimized",
 935 |     "gpuNum": 0,
 936 |     "hideHardwareSpecs": false,
 937 |     "memoryGiB": 512,
 938 |     "name": "ml.r5.16xlarge",
 939 |     "vcpuNum": 64
 940 |    },
 941 |    {
 942 |     "_defaultOrder": 46,
 943 |     "_isFastLaunch": false,
 944 |     "category": "Memory Optimized",
 945 |     "gpuNum": 0,
 946 |     "hideHardwareSpecs": false,
 947 |     "memoryGiB": 768,
 948 |     "name": "ml.r5.24xlarge",
 949 |     "vcpuNum": 96
 950 |    },
 951 |    {
 952 |     "_defaultOrder": 47,
 953 |     "_isFastLaunch": false,
 954 |     "category": "Accelerated computing",
 955 |     "gpuNum": 1,
 956 |     "hideHardwareSpecs": false,
 957 |     "memoryGiB": 16,
 958 |     "name": "ml.g5.xlarge",
 959 |     "vcpuNum": 4
 960 |    },
 961 |    {
 962 |     "_defaultOrder": 48,
 963 |     "_isFastLaunch": false,
 964 |     "category": "Accelerated computing",
 965 |     "gpuNum": 1,
 966 |     "hideHardwareSpecs": false,
 967 |     "memoryGiB": 32,
 968 |     "name": "ml.g5.2xlarge",
 969 |     "vcpuNum": 8
 970 |    },
 971 |    {
 972 |     "_defaultOrder": 49,
 973 |     "_isFastLaunch": false,
 974 |     "category": "Accelerated computing",
 975 |     "gpuNum": 1,
 976 |     "hideHardwareSpecs": false,
 977 |     "memoryGiB": 64,
 978 |     "name": "ml.g5.4xlarge",
 979 |     "vcpuNum": 16
 980 |    },
 981 |    {
 982 |     "_defaultOrder": 50,
 983 |     "_isFastLaunch": false,
 984 |     "category": "Accelerated computing",
 985 |     "gpuNum": 1,
 986 |     "hideHardwareSpecs": false,
 987 |     "memoryGiB": 128,
 988 |     "name": "ml.g5.8xlarge",
 989 |     "vcpuNum": 32
 990 |    },
 991 |    {
 992 |     "_defaultOrder": 51,
 993 |     "_isFastLaunch": false,
 994 |     "category": "Accelerated computing",
 995 |     "gpuNum": 1,
 996 |     "hideHardwareSpecs": false,
 997 |     "memoryGiB": 256,
 998 |     "name": "ml.g5.16xlarge",
 999 |     "vcpuNum": 64
1000 |    },
1001 |    {
1002 |     "_defaultOrder": 52,
1003 |     "_isFastLaunch": false,
1004 |     "category": "Accelerated computing",
1005 |     "gpuNum": 4,
1006 |     "hideHardwareSpecs": false,
1007 |     "memoryGiB": 192,
1008 |     "name": "ml.g5.12xlarge",
1009 |     "vcpuNum": 48
1010 |    },
1011 |    {
1012 |     "_defaultOrder": 53,
1013 |     "_isFastLaunch": false,
1014 |     "category": "Accelerated computing",
1015 |     "gpuNum": 4,
1016 |     "hideHardwareSpecs": false,
1017 |     "memoryGiB": 384,
1018 |     "name": "ml.g5.24xlarge",
1019 |     "vcpuNum": 96
1020 |    },
1021 |    {
1022 |     "_defaultOrder": 54,
1023 |     "_isFastLaunch": false,
1024 |     "category": "Accelerated computing",
1025 |     "gpuNum": 8,
1026 |     "hideHardwareSpecs": false,
1027 |     "memoryGiB": 768,
1028 |     "name": "ml.g5.48xlarge",
1029 |     "vcpuNum": 192
1030 |    },
1031 |    {
1032 |     "_defaultOrder": 55,
1033 |     "_isFastLaunch": false,
1034 |     "category": "Accelerated computing",
1035 |     "gpuNum": 8,
1036 |     "hideHardwareSpecs": false,
1037 |     "memoryGiB": 1152,
1038 |     "name": "ml.p4d.24xlarge",
1039 |     "vcpuNum": 96
1040 |    },
1041 |    {
1042 |     "_defaultOrder": 56,
1043 |     "_isFastLaunch": false,
1044 |     "category": "Accelerated computing",
1045 |     "gpuNum": 8,
1046 |     "hideHardwareSpecs": false,
1047 |     "memoryGiB": 1152,
1048 |     "name": "ml.p4de.24xlarge",
1049 |     "vcpuNum": 96
1050 |    },
1051 |    {
1052 |     "_defaultOrder": 57,
1053 |     "_isFastLaunch": false,
1054 |     "category": "Accelerated computing",
1055 |     "gpuNum": 0,
1056 |     "hideHardwareSpecs": false,
1057 |     "memoryGiB": 32,
1058 |     "name": "ml.trn1.2xlarge",
1059 |     "vcpuNum": 8
1060 |    },
1061 |    {
1062 |     "_defaultOrder": 58,
1063 |     "_isFastLaunch": false,
1064 |     "category": "Accelerated computing",
1065 |     "gpuNum": 0,
1066 |     "hideHardwareSpecs": false,
1067 |     "memoryGiB": 512,
1068 |     "name": "ml.trn1.32xlarge",
1069 |     "vcpuNum": 128
1070 |    },
1071 |    {
1072 |     "_defaultOrder": 59,
1073 |     "_isFastLaunch": false,
1074 |     "category": "Accelerated computing",
1075 |     "gpuNum": 0,
1076 |     "hideHardwareSpecs": false,
1077 |     "memoryGiB": 512,
1078 |     "name": "ml.trn1n.32xlarge",
1079 |     "vcpuNum": 128
1080 |    }
1081 |   ],
1082 |   "instance_type": "ml.t3.medium",
1083 |   "kernelspec": {
1084 |    "display_name": "Python 3",
1085 |    "language": "python",
1086 |    "name": "python3"
1087 |   },
1088 |   "language_info": {
1089 |    "codemirror_mode": {
1090 |     "name": "ipython",
1091 |     "version": 3
1092 |    },
1093 |    "file_extension": ".py",
1094 |    "mimetype": "text/x-python",
1095 |    "name": "python",
1096 |    "nbconvert_exporter": "python",
1097 |    "pygments_lexer": "ipython3",
1098 |    "version": "3.11.12"
1099 |   }
1100 |  },
1101 |  "nbformat": 4,
1102 |  "nbformat_minor": 5
1103 | }
1104 | 


--------------------------------------------------------------------------------
/labs/FineTuning/HuggingFaceExample/01_finetuning/assets/consolidate_adapter_shards_and_merge_model.py:
--------------------------------------------------------------------------------
 1 | from optimum.neuron.models.training import (
 2 |     consolidate_model_parallel_checkpoints_to_unified_checkpoint,
 3 | )
 4 | from transformers import AutoModel, AutoTokenizer
 5 | from argparse import ArgumentParser
 6 | from shutil import copyfile
 7 | import os
 8 | import peft
 9 | 
10 | parser = ArgumentParser()
11 | parser.add_argument(
12 |     "-i",
13 |     "--input_dir",
14 |     help="source checkpoint directory containing sharded adapter checkpoint files",
15 |     required=True,
16 | )
17 | parser.add_argument(
18 |     "-o",
19 |     "--output_dir",
20 |     help="destination directory for final merged model (adapters merged into base model)",
21 |     required=True,
22 | )
23 | args = parser.parse_args()
24 | 
25 | consolidated_ckpt_dir = os.path.join(args.input_dir, "consolidated")
26 | 
27 | # Consolidate the adapter shards into a PEFT-compatible checkpoint
28 | print("Consolidating LoRA adapter shards")
29 | consolidate_model_parallel_checkpoints_to_unified_checkpoint(
30 |     args.input_dir, consolidated_ckpt_dir
31 | )
32 | copyfile(
33 |     os.path.join(args.input_dir, "adapter_default/adapter_config.json"),
34 |     os.path.join(consolidated_ckpt_dir, "adapter_config.json"),
35 | )
36 | 
37 | # Load AutoPeftModel using the consolidated PEFT checkpoint
38 | peft_model = peft.AutoPeftModelForCausalLM.from_pretrained(consolidated_ckpt_dir)
39 | 
40 | # Merge adapter weights into base model, save new pretrained model
41 | print("Merging LoRA adapter shards into base model")
42 | merged_model = peft_model.merge_and_unload()
43 | print(f"Saving merged model to {args.output_dir}")
44 | merged_model.save_pretrained(args.output_dir)
45 | 
46 | print(f"Saving tokenizer to {args.output_dir}")
47 | tokenizer = AutoTokenizer.from_pretrained(args.input_dir)
48 | tokenizer.save_pretrained(args.output_dir)
49 | 
50 | # Load the pretrained model and print config
51 | print("Merged model config:")
52 | model = AutoModel.from_pretrained(args.output_dir)
53 | print(model)
54 | 


--------------------------------------------------------------------------------
/labs/FineTuning/HuggingFaceExample/01_finetuning/assets/finetune_llama.py:
--------------------------------------------------------------------------------
  1 | from dataclasses import dataclass, field
  2 | from datasets import load_dataset
  3 | from peft import LoraConfig
  4 | from transformers import (
  5 |     AutoTokenizer,
  6 |     set_seed,
  7 | )
  8 | import os
  9 | import subprocess
 10 | import boto3
 11 | from botocore.exceptions import ClientError
 12 | from huggingface_hub import login
 13 | import torch
 14 | 
 15 | from optimum.neuron import NeuronHfArgumentParser as HfArgumentParser
 16 | from optimum.neuron import NeuronSFTConfig, NeuronSFTTrainer, NeuronTrainingArguments
 17 | from torch_xla.core.xla_model import is_master_ordinal
 18 | from optimum.neuron.models.training import NeuronModelForCausalLM
 19 | 
 20 | 
 21 | 
 22 | def training_function(script_args, training_args):
 23 |     dataset = load_dataset("b-mc2/sql-create-context", split="train")
 24 |     dataset = dataset.shuffle(seed=23)
 25 |     train_dataset = dataset.select(range(50000))
 26 |     eval_dataset = dataset.select(range(50000, 50500))
 27 | 
 28 |     def create_conversation(sample):
 29 |         system_message = (
 30 |             "You are a text to SQL query translator. Users will ask you questions in English and you will generate a "
 31 |             "SQL query based on the provided SCHEMA.\nSCHEMA:\n{schema}"
 32 |         )
 33 |         return {
 34 |             "messages": [
 35 |                 {
 36 |                     "role": "system",
 37 |                     "content": system_message.format(schema=sample["context"]),
 38 |                 },
 39 |                 {"role": "user", "content": sample["question"]},
 40 |                 {"role": "assistant", "content": sample["answer"] + ";"},
 41 |             ]
 42 |         }
 43 | 
 44 |     train_dataset = train_dataset.map(
 45 |         create_conversation, remove_columns=train_dataset.features, batched=False
 46 |     )
 47 |     eval_dataset = eval_dataset.map(
 48 |         create_conversation, remove_columns=eval_dataset.features, batched=False
 49 |     )
 50 | 
 51 |     tokenizer = AutoTokenizer.from_pretrained(script_args.tokenizer_id)
 52 |     # tokenizer.pad_token = tokenizer.eos_token
 53 |     # tokenizer.eos_token_id = 128001
 54 | 
 55 |     trn_config = training_args.trn_config
 56 |     dtype = torch.bfloat16 if training_args.bf16 else torch.float32
 57 |     model = NeuronModelForCausalLM.from_pretrained(
 58 |         script_args.model_id,
 59 |         trn_config,
 60 |         torch_dtype=dtype,
 61 |         # Use FlashAttention2 for better performance and to be able to use larger sequence lengths.
 62 |         use_flash_attention_2=False, #Because we are training a sequence lower than 2K for the workshop
 63 |     )
 64 | 
 65 |     config = LoraConfig(
 66 |         r=script_args.lora_r,
 67 |         lora_alpha=script_args.lora_alpha,
 68 |         lora_dropout=script_args.lora_dropout,
 69 |         target_modules=[
 70 |             "q_proj",
 71 |             "gate_proj",
 72 |             "v_proj",
 73 |             "o_proj",
 74 |             "k_proj",
 75 |             "up_proj",
 76 |             "down_proj",
 77 |         ],
 78 |         bias="none",
 79 |         task_type="CAUSAL_LM",
 80 |     )
 81 | 
 82 |     args = training_args.to_dict()
 83 | 
 84 |     sft_config = NeuronSFTConfig(
 85 |         max_seq_length=1024,
 86 |         packing=True,
 87 |         **args,
 88 |         dataset_kwargs={
 89 |             "add_special_tokens": False,
 90 |             "append_concat_token": True,
 91 |         },
 92 |     )
 93 | 
 94 |     trainer = NeuronSFTTrainer(
 95 |         args=sft_config,
 96 |         model=model,
 97 |         peft_config=config,
 98 |         tokenizer=tokenizer,
 99 |         train_dataset=train_dataset,
100 |         eval_dataset=eval_dataset,
101 |     )
102 | 
103 |     # Start training
104 |     trainer.train()
105 |     del trainer
106 | 
107 | 
108 | @dataclass
109 | class ScriptArguments:
110 |     model_id: str = field(
111 |         default="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
112 |         metadata={
113 |             "help": "The model that you want to train from the Hugging Face hub."
114 |         },
115 |     )
116 |     tokenizer_id: str = field(
117 |         default="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
118 |         metadata={"help": "The tokenizer used to tokenize text for fine-tuning."},
119 |     )
120 |     lora_r: int = field(
121 |         default=16,
122 |         metadata={"help": "LoRA r value to be used during fine-tuning."},
123 |     )
124 |     lora_alpha: int = field(
125 |         default=32,
126 |         metadata={"help": "LoRA alpha value to be used during fine-tuning."},
127 |     )
128 |     lora_dropout: float = field(
129 |         default=0.05,
130 |         metadata={"help": "LoRA dropout value to be used during fine-tuning."},
131 |     )
132 |     secret_name: str = field(
133 |         default="huggingface/token",
134 |         metadata={"help": "AWS Secrets Manager secret name containing Hugging Face token."},
135 |     )
136 |     secret_region: str = field(
137 |         default="us-west-2",
138 |         metadata={"help": "AWS region where the secret is stored."},
139 |     )
140 | 
141 | 
142 | def get_secret(secret_name, region_name):
143 |     """
144 |     Retrieve a secret from AWS Secrets Manager by searching for secrets with the given name prefix.  
145 |     This is specific to the workshop environment.
146 |     """
147 |     try:
148 |         session = boto3.session.Session()
149 |         client = session.client(service_name='secretsmanager', region_name=region_name)
150 |         
151 |         # List secrets and find one that starts with the secret_name
152 |         paginator = client.get_paginator('list_secrets')
153 |         for page in paginator.paginate():
154 |             for secret in page['SecretList']:
155 |                 if secret['Name'].startswith(secret_name):
156 |                     response = client.get_secret_value(SecretId=secret['ARN'])
157 |                     if 'SecretString' in response:
158 |                         return response['SecretString']
159 |         return None
160 |     except ClientError:
161 |         print("Could not retrieve secret from AWS Secrets Manager")
162 |         return None
163 | 
164 | if __name__ == "__main__":
165 |     parser = HfArgumentParser([ScriptArguments, NeuronTrainingArguments])
166 |     script_args, training_args = parser.parse_args_into_dataclasses()
167 |     
168 |     # Check for Hugging Face token in environment variable
169 |     hf_token = os.environ.get("HF_TOKEN")
170 |     
171 |     # If no token in environment, try to get it from AWS Secrets Manager
172 |     if not hf_token:
173 |         print("No Hugging Face token found in environment, checking AWS Secrets Manager...")
174 |         hf_token = get_secret(script_args.secret_name, script_args.secret_region)
175 |     
176 |     # Login to Hugging Face if a valid token is found
177 |     if hf_token:
178 |         print("Logging in to Hugging Face Hub...")
179 |         login(token=hf_token)
180 |     else:
181 |         print("No valid Hugging Face token found, continuing without authentication")
182 |     
183 |     set_seed(training_args.seed)
184 |     training_function(script_args, training_args)
185 | 
186 |     # Consolidate LoRA adapter shards, merge LoRA adapters into base model, save merged model
187 |     if is_master_ordinal():
188 |         input_ckpt_dir = os.path.join(
189 |             training_args.output_dir, f"checkpoint-{training_args.max_steps}"
190 |         )
191 |         output_ckpt_dir = os.path.join(training_args.output_dir, "merged_model")
192 |         # the spawned process expects to see 2 NeuronCores for consolidating checkpoints with a tp=2
193 |         # Either the second core isn't really used or it is freed up by the other thread finishing.  
194 |         # Adjusting Neuron env. var to advertise 2 NeuronCores to the process.
195 |         env = os.environ.copy()
196 |         env["NEURON_RT_VISIBLE_CORES"] = "0-1"
197 |         subprocess.run(
198 |             [
199 |                 "python3",
200 |                 "consolidate_adapter_shards_and_merge_model.py",
201 |                 "-i",
202 |                 input_ckpt_dir,
203 |                 "-o",
204 |                 output_ckpt_dir,
205 |             ],
206 |             env=env
207 |         )


--------------------------------------------------------------------------------
/labs/FineTuning/HuggingFaceExample/01_finetuning/assets/requirements.txt:
--------------------------------------------------------------------------------
1 | optimum-neuron==0.3.0
2 | peft==0.16.0
3 | trl==0.11.4
4 | huggingface_hub==0.33.4
5 | datasets==3.6.0
6 | 


--------------------------------------------------------------------------------
/labs/Lab_Four_NKI_Profiling.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Neuron Profile \n",
  8 |     "\n",
  9 |     "This workshop was borrowed from the AWS NKI Workshop. To find the full original content, see here:\n",
 10 |     "- Workshop: https://catalog.us-east-1.prod.workshops.aws/workshops/0d84c975-7a94-469a-b6bc-661768d303f7/en-US/lab-0\n",
 11 |     "- Github: https://github.com/aws-samples/ml-specialized-hardware/tree/main/workshops/03_NKIWorkshop\n",
 12 |     "\n",
 13 |     "In this tutorial, we use Neuron Profile to view the execution trace of a NKI kernel captured on a NeuronCore. In doing so, we learn about:\n",
 14 |     "\n",
 15 |     "- Installation and usage of Neuron Profile.\n",
 16 |     "\n",
 17 |     "- Inspecting a detailed execution timeline of compute engine instructions and DMA engine activities generated from your NKI kernel.\n",
 18 |     "\n",
 19 |     "As background, [Neuron Profile](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-profile-user-guide.html) is the tool you need to visualize where time is being spent during kernel execution on NeuronDevices, which is crucial for identifying performance bottlenecks and opportunities of your kernel. Neuron Profile produces runtime execution data for every instruction executed on each compute engine and also every data movement activity completed by DMA engines. Neuron Profile also reports key performance metrics such as compute engine and memory bandwidth utilization, which allows developers to quickly find out the achieved hardware efficiency of their kernel. Profiling typically has near zero overhead thanks to the dedicated on-chip profiling hardware in NeuronDevices.\n",
 20 |     "\n",
 21 |     "## Profile a NKI Kernel\n",
 22 |     "\n",
 23 |     "### Install Neuron Profile\n",
 24 |     "Make sure you have the latest version of the `aws-neuronx-tools`, which includes updated profiling support for NKI kernels. Neuron Profile is included within this package and is installed to `/opt/aws/neuron/bin`.\n",
 25 |     "\n",
 26 |     "The `aws-neuronx-tools` package comes pre-installed on [Neuron DLAMIs](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/dlami/index.html). For detailed installation instructions see [Neuron Profile User Guide: Installation](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-profile-user-guide.html#installation).\n",
 27 |     "\n",
 28 |     "### Profile using `neuron-profile capture`\n",
 29 |     "\n",
 30 |     "To profile a NKI kernel the required steps are (1) enable `NEURON_FRAMEWORK_DEBUG` to tell the compiler to save the `NEFF` file, (2) execute the NKI kernel to generate the `NEFF`, and (3) run `neuron-profile capture` to generate a `NTFF` profile. Each step is described in more detail below.\n",
 31 |     "\n",
 32 |     "We will profile a NKI kernel which computes the element-wise exponential of an input tensor of any 2D shape. The rest of this tutorial will use a performance profile generated from this kernel as an example. Full code of `prof-kernel.py`:"
 33 |    ]
 34 |   },
 35 |   {
 36 |    "cell_type": "code",
 37 |    "execution_count": null,
 38 |    "metadata": {},
 39 |    "outputs": [],
 40 |    "source": [
 41 |     "%%writefile prof-kernel.py\n",
 42 |     "\"\"\"\n",
 43 |     "Example kernel used to demmonstrate Neuron Profile.\n",
 44 |     "\"\"\"\n",
 45 |     "import torch\n",
 46 |     "from neuronxcc import nki\n",
 47 |     "import neuronxcc.nki.language as nl\n",
 48 |     "import math\n",
 49 |     "import os\n",
 50 |     "os.environ[\"NEURON_FRAMEWORK_DEBUG\"] = \"1\"\n",
 51 |     "os.environ[\"NEURON_CC_FLAGS\"]= \" --disable-dge \"\n",
 52 |     "\n",
 53 |     "@nki.jit\n",
 54 |     "def tensor_exp_kernel_(in_tensor):\n",
 55 |     "  \"\"\"NKI kernel to compute elementwise exponential of an input tensor\n",
 56 |     "\n",
 57 |     "  Args:\n",
 58 |     "      in_tensor: an input tensor of ANY 2D shape (up to SBUF size)\n",
 59 |     "  Returns:\n",
 60 |     "      out_tensor: an output tensor of ANY 2D shape (up to SBUF size)\n",
 61 |     "  \"\"\"\n",
 62 |     "  out_tensor = nl.ndarray(in_tensor.shape, dtype=in_tensor.dtype,\n",
 63 |     "                          buffer=nl.shared_hbm)\n",
 64 |     "\n",
 65 |     "  sz_p, sz_f = in_tensor.shape\n",
 66 |     "\n",
 67 |     "  i_f = nl.arange(sz_f)[None, :]\n",
 68 |     "\n",
 69 |     "  for p in nl.affine_range(math.ceil(sz_p / nl.tile_size.pmax)):\n",
 70 |     "    # Generate tensor indices for the input/output tensors\n",
 71 |     "    # pad index to pmax, for simplicity\n",
 72 |     "    i_p = p * nl.tile_size.pmax + nl.arange(nl.tile_size.pmax)[:, None]\n",
 73 |     "\n",
 74 |     "    # Load input data from external memory to on-chip memory\n",
 75 |     "    # only read up to sz_p\n",
 76 |     "    in_tile = nl.load(in_tensor[i_p, i_f], mask=(i_p<sz_p))\n",
 77 |     "\n",
 78 |     "    # perform the computation\n",
 79 |     "    out_tile = nl.exp(in_tile)\n",
 80 |     "\n",
 81 |     "    # store the results back to external memory\n",
 82 |     "    # only write up to sz_p\n",
 83 |     "    nl.store(out_tensor[i_p, i_f], value=out_tile, mask=(i_p<sz_p))\n",
 84 |     "\n",
 85 |     "    return out_tensor\n",
 86 |     "\n",
 87 |     "if __name__ == \"__main__\":\n",
 88 |     "  from torch_xla.core import xla_model as xm\n",
 89 |     "  device = xm.xla_device()\n",
 90 |     "\n",
 91 |     "  in_tensor = torch.rand((250, 512), dtype=torch.float32).to(device=device)\n",
 92 |     "\n",
 93 |     "  out_tensor = tensor_exp_kernel_(in_tensor)\n",
 94 |     "  print(f\"output_nki={out_tensor}\")"
 95 |    ]
 96 |   },
 97 |   {
 98 |    "cell_type": "markdown",
 99 |    "metadata": {},
100 |    "source": [
101 |     "To profile this NKI kernel, follow these steps:\n",
102 |     "\n",
103 |     "1. Enable Neuron debug output by setting the `NEURON_FRAMEWORK_DEBUG` environment variable. This will trigger the Neuron compiler to save the Neuron Executable File Format (NEFF) artifact to the current directory after compilation of your NKI kernel. The NEFF contains all hardware instructions required to execute your NKI kernel on a NeuronDevice, as well as metadata and debug info needed for profiling. For example, add the following lines to your NKI kernel source file:"
104 |    ]
105 |   },
106 |   {
107 |    "cell_type": "code",
108 |    "execution_count": null,
109 |    "metadata": {},
110 |    "outputs": [],
111 |    "source": [
112 |     "import os\n",
113 |     "os.environ[\"NEURON_FRAMEWORK_DEBUG\"] = \"1\"\n",
114 |     "os.environ[\"NEURON_CC_FLAGS\"]= \" --disable-dge \""
115 |    ]
116 |   },
117 |   {
118 |    "cell_type": "markdown",
119 |    "metadata": {},
120 |    "source": [
121 |     "<blockquote>\n",
122 |     "Use the flag `--disable-dge` to temporarily disable a new compiler feature which is interfering with DMA debugging information display in neuron-profile. This is highly recommended to improve NKI performance debugging experience until we release a software fix for this issue.\n",
123 |     "</blockquote>\n",
124 |     "\n",
125 |     "2. Compile your NKI kernel to create a NEFF in your current directory:"
126 |    ]
127 |   },
128 |   {
129 |    "cell_type": "code",
130 |    "execution_count": null,
131 |    "metadata": {},
132 |    "outputs": [],
133 |    "source": [
134 |     "!python3 prof-kernel.py"
135 |    ]
136 |   },
137 |   {
138 |    "cell_type": "markdown",
139 |    "metadata": {},
140 |    "source": [
141 |     "<blockquote>\n",
142 |     "Find your NEFF named similarly to `MODULE_0_SyncTensorsGraph.13_12659246067793504316.neff`.\n",
143 |     "</blockquote>\n",
144 |     "\n",
145 |     "3. Profile the NEFF. This profiling step executes the NEFF on the NeuronDevice and records a raw execution trace into an Neuron Trace File Format (NTFF) artifact."
146 |    ]
147 |   },
148 |   {
149 |    "cell_type": "code",
150 |    "execution_count": null,
151 |    "metadata": {},
152 |    "outputs": [],
153 |    "source": [
154 |     "!neuron-profile capture -n <path_to_neff> -s profile.ntff --profile-nth-exec=2"
155 |    ]
156 |   },
157 |   {
158 |    "cell_type": "markdown",
159 |    "metadata": {},
160 |    "source": [
161 |     "This will save your NTFF profile to `profile_exec_2.ntff`.\n",
162 |     "\n",
163 |     "<blockquote>\n",
164 |     "The `--profile-nth-exec=2` option will profile your NEFF twice on the NeuronDevice and output a NTFF profile for the second iteration. This is recommended to avoid one-time warmup delays which can be seen in the first iteration of execution.\n",
165 |     "</blockquote>\n",
166 |     "\n",
167 |     "In [View Neuron Profile UI](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/neuron_profile_for_nki.html#nki-view-neuron-profile-ui), we will view the profile in a user-friendly format using the Neuron Profile UI.\n",
168 |     "\n",
169 |     "### Profile using nki.benchmark\n",
170 |     "\n",
171 |     "You may also use the [nki.benchmark](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/api/generated/nki.benchmark.html) API to generate a NEFF and NTFF programmatically. One caveat is [nki.benchmark](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/api/generated/nki.benchmark.html) runs your NEFF without an ML framework in [nki.baremetal](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/api/generated/nki.baremetal.html) mode, so the input tensors to the kernel must be NumPy arrays instead of framework tensors such as `torch.Tensor`.\n",
172 |     "\n",
173 |     "Below is an example NKI kernel decorated by [nki.benchmark](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/api/generated/nki.benchmark.html). Full code of `prof-kernel-benchmark.py`:"
174 |    ]
175 |   },
176 |   {
177 |    "cell_type": "code",
178 |    "execution_count": null,
179 |    "metadata": {},
180 |    "outputs": [],
181 |    "source": [
182 |     "%%writefile prof-kernel-benchmark.py\n",
183 |     "\"\"\"\n",
184 |     "Example kernel used to demmonstrate Neuron Profile with nki.benchmark.\n",
185 |     "\"\"\"\n",
186 |     "from neuronxcc import nki\n",
187 |     "from neuronxcc.nki.typing import tensor\n",
188 |     "import neuronxcc.nki.language as nl\n",
189 |     "import math\n",
190 |     "\n",
191 |     "\n",
192 |     "@nki.benchmark(save_neff_name='file.neff', save_trace_name='profile.ntff')\n",
193 |     "def tensor_exp_kernel_(in_tensor):\n",
194 |     "  \"\"\"NKI kernel to compute elementwise exponential of an input tensor\n",
195 |     "  Args:\n",
196 |     "      in_tensor: an input tensor of ANY 2D shape (up to SBUF size)\n",
197 |     "  Returns:\n",
198 |     "      out_tensor: an output tensor of ANY 2D shape (up to SBUF size)\n",
199 |     "  \"\"\"\n",
200 |     "  out_tensor = nl.ndarray(in_tensor.shape, dtype=in_tensor.dtype,\n",
201 |     "                          buffer=nl.shared_hbm)\n",
202 |     "\n",
203 |     "  sz_p, sz_f = in_tensor.shape\n",
204 |     "  i_f = nl.arange(sz_f)[None, :]\n",
205 |     "  for p in nl.affine_range(math.ceil(sz_p / nl.tile_size.pmax)):\n",
206 |     "    # Generate tensor indices for the input/output tensors\n",
207 |     "    # pad index to pmax, for simplicity\n",
208 |     "    i_p = p * nl.tile_size.pmax + nl.arange(nl.tile_size.pmax)[:, None]\n",
209 |     "    # Load input data from external memory to on-chip memory\n",
210 |     "    # only read up to sz_p\n",
211 |     "    in_tile = nl.load(in_tensor[i_p, i_f], mask=(i_p<sz_p))\n",
212 |     "    # perform the computation\n",
213 |     "    out_tile = nl.exp(in_tile)\n",
214 |     "    # store the results back to external memory\n",
215 |     "    # only write up to sz_p\n",
216 |     "    nl.store(out_tensor[i_p, i_f], value=out_tile, mask=(i_p<sz_p))\n",
217 |     "\n",
218 |     "  return out_tensor\n",
219 |     "\n",
220 |     "if __name__ == \"__main__\":\n",
221 |     "  tensor_exp_kernel_(tensor[[250, 512], nl.float32])"
222 |    ]
223 |   },
224 |   {
225 |    "cell_type": "markdown",
226 |    "metadata": {},
227 |    "source": [
228 |     "To use [nki.benchmark](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/api/generated/nki.benchmark.html) to create a NEFF file and NTFF profile in your current directory, execute the example NKI kernel with:"
229 |    ]
230 |   },
231 |   {
232 |    "cell_type": "code",
233 |    "execution_count": null,
234 |    "metadata": {},
235 |    "outputs": [],
236 |    "source": [
237 |     "!python3 prof-kernel-benchmark.py"
238 |    ]
239 |   },
240 |   {
241 |    "cell_type": "markdown",
242 |    "metadata": {},
243 |    "source": [
244 |     "## Release the NeuronCore for the next notebook\n",
245 |     "\n",
246 |     "Before moving to the next notebook we need to release the NeuronCore. If we don't do this the next notebook will not be able resources - you can also stop the kernel via the GUI"
247 |    ]
248 |   },
249 |   {
250 |    "cell_type": "code",
251 |    "execution_count": null,
252 |    "metadata": {},
253 |    "outputs": [],
254 |    "source": [
255 |     "import IPython\n",
256 |     "IPython.Application.instance().kernel.do_shutdown(True)"
257 |    ]
258 |   },
259 |   {
260 |    "cell_type": "code",
261 |    "execution_count": null,
262 |    "metadata": {},
263 |    "outputs": [],
264 |    "source": []
265 |   }
266 |  ],
267 |  "metadata": {
268 |   "kernelspec": {
269 |    "display_name": "Python 3 (ipykernel)",
270 |    "language": "python",
271 |    "name": "python3"
272 |   },
273 |   "language_info": {
274 |    "codemirror_mode": {
275 |     "name": "ipython",
276 |     "version": 3
277 |    },
278 |    "file_extension": ".py",
279 |    "mimetype": "text/x-python",
280 |    "name": "python",
281 |    "nbconvert_exporter": "python",
282 |    "pygments_lexer": "ipython3",
283 |    "version": "3.10.12"
284 |   }
285 |  },
286 |  "nbformat": 4,
287 |  "nbformat_minor": 4
288 | }
289 | 


--------------------------------------------------------------------------------
/labs/Lab_One_NxDI.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "5a972332",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "# Develop support for a new model with NeuronX Distributed Inference\n",
  9 |     "\n",
 10 |     "In this notebook you will learn how to develop support for a new model with NeuronX Distributed Inference (NxD). NxD is a Python package developed by Annapurna Labs that enables you to shard, compile, train, and host PyTorch models on Trainium and Inferentia instances. We develop two key packages demonstrating how to use this, [NxD Inference](https://github.com/aws-neuron/neuronx-distributed-inference/tree/main) and [NxD Training](https://github.com/aws-neuron/neuronx-distributed-training). This notebook focuses on inference. You will learn how to develop support for a new model in NxD Inference through the context of Llama 3.2, 1B.\n",
 11 |     "\n",
 12 |     "#### Overview\n",
 13 |     "1. Check dependencies for AWS Neuron SDK\n",
 14 |     "2. Accept the Meta usage terms and download the model from Hugging Face.\n",
 15 |     "3. Learn how to invoke the model step-by-step\n",
 16 |     "   - Load the model from a local path.\n",
 17 |     "   - Shard and compile it for Trainium.\n",
 18 |     "   - Download and tokenize the dataset\n",
 19 |     "   - Invoke the model with prompts\n",
 20 |     "4. Learn how to modify the underlying APIs to work with your own models\n",
 21 |     "\n",
 22 |     "#### Prerequisites\n",
 23 |     "This notebook was developed on a trn1.2xlarge instance, using the latest Amazon Linux DLAMI. Both the Amazon Linux and Ubuntu Neuron DLAMI's have preinstalled Python virtual environments with all the basic software packages included. The virtual environment used to develop this notebook is located at this path in both Amazon Linux and Ubuntu DLAMIs:  `/opt/aws_neuronx_venv_pytorch_2_5_nxd_inference`. "
 24 |    ]
 25 |   },
 26 |   {
 27 |    "cell_type": "markdown",
 28 |    "id": "3652fc5a",
 29 |    "metadata": {},
 30 |    "source": [
 31 |     "### Step 1. Import NxD Inference packages\n",
 32 |     "\n",
 33 |     "If you are running this notebook in the virtual environment for NxD Inference, then the package should already be installed. Let's verify that with the following import."
 34 |    ]
 35 |   },
 36 |   {
 37 |    "cell_type": "code",
 38 |    "execution_count": null,
 39 |    "id": "c4405a13-5431-4d29-a6a6-2eb989fb0f50",
 40 |    "metadata": {},
 41 |    "outputs": [],
 42 |    "source": [
 43 |     "import neuronx_distributed_inference"
 44 |    ]
 45 |   },
 46 |   {
 47 |    "cell_type": "markdown",
 48 |    "id": "0d1970fc",
 49 |    "metadata": {},
 50 |    "source": [
 51 |     "### Step 2. Accept the Meta usage terms and download the model\n",
 52 |     "\n",
 53 |     "If you would like to use the model directly from Meta, you'll need to navigate over to the Hugging Face hub for Llama 3.2 1B [here](https://huggingface.co/meta-llama/Llama-3.2-1B). Log in to the Hub, accept the usage term, and request access to the model. Once access has been granted, copy your Hugging Face token and paste it into the download command below.\n",
 54 |     "\n",
 55 |     "If you do not have your token readily available, you can proceed with the alternative model shown below."
 56 |    ]
 57 |   },
 58 |   {
 59 |    "cell_type": "code",
 60 |    "execution_count": null,
 61 |    "id": "959fb008-a2c8-4505-8f60-42e5b2060b31",
 62 |    "metadata": {},
 63 |    "outputs": [],
 64 |    "source": [
 65 |     "# helpful packages to speed up the download\n",
 66 |     "!pip install hf_transfer \"huggingface_hub[cli]\""
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "markdown",
 71 |    "id": "ccff01a8-94f7-4d10-bdf7-71229ec19cb9",
 72 |    "metadata": {},
 73 |    "source": [
 74 |     "We'll download the `NousResearch/Llama3.2-1B` model here."
 75 |    ]
 76 |   },
 77 |   {
 78 |    "cell_type": "code",
 79 |    "execution_count": null,
 80 |    "id": "75a2e3d1-7c1b-4d9d-b1f5-d294a1381566",
 81 |    "metadata": {},
 82 |    "outputs": [],
 83 |    "source": [
 84 |     "!hf download NousResearch/Llama-3.2-1B --local-dir /home/ubuntu/environment/models/llama/"
 85 |    ]
 86 |   },
 87 |   {
 88 |    "cell_type": "markdown",
 89 |    "id": "02214b8a",
 90 |    "metadata": {},
 91 |    "source": [
 92 |     "### Step 3. Establish model configs\n",
 93 |     "Next, you'll point to the local model files and establish config objects. Each of these configs are helpful in successfully invoking the model."
 94 |    ]
 95 |   },
 96 |   {
 97 |    "cell_type": "code",
 98 |    "execution_count": null,
 99 |    "id": "77e54a5f-842f-4b2c-ab79-c0f11a6ef292",
100 |    "metadata": {},
101 |    "outputs": [],
102 |    "source": [
103 |     "# the original checkpoint\n",
104 |     "model_path = '/home/ubuntu/environment/models/llama/'"
105 |    ]
106 |   },
107 |   {
108 |    "cell_type": "code",
109 |    "execution_count": null,
110 |    "id": "094dc24d-dd06-45c8-adec-fa997f02e6d1",
111 |    "metadata": {},
112 |    "outputs": [],
113 |    "source": [
114 |     "# where your NxD trace will go\n",
115 |     "traced_model_path = '/home/ubuntu/environment/models/traced_llama'"
116 |    ]
117 |   },
118 |   {
119 |    "cell_type": "code",
120 |    "execution_count": null,
121 |    "id": "9f72bda4-5e04-442c-b016-f30816db54d4",
122 |    "metadata": {},
123 |    "outputs": [],
124 |    "source": [
125 |     "import torch\n",
126 |     "from transformers import AutoTokenizer, GenerationConfig\n",
127 |     "\n",
128 |     "from neuronx_distributed_inference.models.config import NeuronConfig, OnDeviceSamplingConfig\n",
129 |     "from neuronx_distributed_inference.models.llama.modeling_llama import LlamaInferenceConfig, NeuronLlamaForCausalLM\n",
130 |     "from neuronx_distributed_inference.utils.hf_adapter import HuggingFaceGenerationAdapter, load_pretrained_config\n",
131 |     "from neuronx_distributed_inference.modules.generation.sampling import prepare_sampling_params\n",
132 |     "\n",
133 |     "# torch.manual_seed(0)"
134 |    ]
135 |   },
136 |   {
137 |    "cell_type": "code",
138 |    "execution_count": null,
139 |    "id": "812403b6",
140 |    "metadata": {},
141 |    "outputs": [],
142 |    "source": [
143 |     "# update the generation config to address a trailing comma\n",
144 |     "!cp generation_config.json $model_path/"
145 |    ]
146 |   },
147 |   {
148 |    "cell_type": "code",
149 |    "execution_count": null,
150 |    "id": "857c6e49-ce3a-47c9-868a-520f0cd68276",
151 |    "metadata": {},
152 |    "outputs": [],
153 |    "source": [
154 |     "# Initialize configs \n",
155 |     "generation_config = GenerationConfig.from_pretrained(model_path)\n",
156 |     "\n",
157 |     "# Some sample overrides for generation\n",
158 |     "generation_config_kwargs = {\n",
159 |     "    \"do_sample\": True,\n",
160 |     "    \"top_k\": 1,\n",
161 |     "    \"pad_token_id\": generation_config.eos_token_id,\n",
162 |     "}\n",
163 |     "generation_config.update(**generation_config_kwargs)"
164 |    ]
165 |   },
166 |   {
167 |    "cell_type": "code",
168 |    "execution_count": null,
169 |    "id": "d196acdb-d094-41c0-9638-9974cec332c4",
170 |    "metadata": {},
171 |    "outputs": [],
172 |    "source": [
173 |     "neuron_config = NeuronConfig(\n",
174 |     "    tp_degree=2,\n",
175 |     "    batch_size=2,\n",
176 |     "    max_context_length=32,\n",
177 |     "    seq_len=64,\n",
178 |     "    on_device_sampling_config=OnDeviceSamplingConfig(top_k=1),\n",
179 |     "    enable_bucketing=True,\n",
180 |     "    flash_decoding_enabled=False\n",
181 |     ")\n",
182 |     "\n",
183 |     "# Build the Llama Inference config\n",
184 |     "config = LlamaInferenceConfig(\n",
185 |     "    neuron_config,\n",
186 |     "    load_config=load_pretrained_config(model_path),\n",
187 |     ")"
188 |    ]
189 |   },
190 |   {
191 |    "cell_type": "markdown",
192 |    "id": "5269bcdd-cf8c-4b10-a428-0cd0fafd83d1",
193 |    "metadata": {},
194 |    "source": [
195 |     "### Step 4. Shard and compile the model\n",
196 |     "The NeuronX compiler will optimize your model for Trainium hardware, ultimately generating the assembly code that executes your operations. We will invoke that compiler now. Generally, it's suggested to compile for some of the larger input and output shapes for your model, while using bucketing to optimize performance. Both of those are handled for you automatically with NxD.\n",
197 |     "\n",
198 |     "With NxD, this step also shards your checkpoint for the TP degree that you defined above. Compilation can take some time, for a 1B model this should run for a few minutes."
199 |    ]
200 |   },
201 |   {
202 |    "cell_type": "code",
203 |    "execution_count": null,
204 |    "id": "afd1e5d5-a989-40fb-8350-fca737470b19",
205 |    "metadata": {
206 |     "scrolled": true
207 |    },
208 |    "outputs": [],
209 |    "source": [
210 |     "model = NeuronLlamaForCausalLM(model_path, config)\n",
211 |     "model.compile(traced_model_path)"
212 |    ]
213 |   },
214 |   {
215 |    "cell_type": "markdown",
216 |    "id": "38178c7e-0f6e-41ab-9383-2942615b82ed",
217 |    "metadata": {},
218 |    "source": [
219 |     "Once compilation is complete your new model is saved and ready to load! "
220 |    ]
221 |   },
222 |   {
223 |    "cell_type": "markdown",
224 |    "id": "63a37f02-ed94-4c3e-81cc-6d9e23c04175",
225 |    "metadata": {},
226 |    "source": [
227 |     "### Step 5. Download the tokenizer"
228 |    ]
229 |   },
230 |   {
231 |    "cell_type": "code",
232 |    "execution_count": null,
233 |    "id": "0c5e306f-9488-4b0e-8e6a-f238a50f2cfe",
234 |    "metadata": {},
235 |    "outputs": [],
236 |    "source": [
237 |     "tokenizer = AutoTokenizer.from_pretrained(model_path, padding_side=\"right\")\n",
238 |     "tokenizer.pad_token = tokenizer.eos_token\n",
239 |     "tokenizer.save_pretrained(traced_model_path)"
240 |    ]
241 |   },
242 |   {
243 |    "cell_type": "markdown",
244 |    "id": "212cfe39-9e66-4a02-bf21-2560de065a34",
245 |    "metadata": {},
246 |    "source": [
247 |     "### Step 6. Load the traced model"
248 |    ]
249 |   },
250 |   {
251 |    "cell_type": "code",
252 |    "execution_count": null,
253 |    "id": "c945db68-5392-406c-8dd6-9e66b9ab0a63",
254 |    "metadata": {},
255 |    "outputs": [],
256 |    "source": [
257 |     "model = NeuronLlamaForCausalLM(traced_model_path)\n",
258 |     "model.load(traced_model_path)\n",
259 |     "tokenizer = AutoTokenizer.from_pretrained(traced_model_path)"
260 |    ]
261 |   },
262 |   {
263 |    "cell_type": "markdown",
264 |    "id": "75b0f5f3-b12b-4ac4-883c-856604f8d44e",
265 |    "metadata": {},
266 |    "source": [
267 |     "### Step 7. Define the prompts and prepare them for sampling"
268 |    ]
269 |   },
270 |   {
271 |    "cell_type": "code",
272 |    "execution_count": null,
273 |    "id": "f203a455-402b-4ddc-81d3-d4d1b4335c5c",
274 |    "metadata": {},
275 |    "outputs": [],
276 |    "source": [
277 |     "prompts = [\"I believe the meaning of life is\", \"The color of the sky is\"]\n",
278 |     "\n",
279 |     "# Example: parameter sweeps for sampling\n",
280 |     "sampling_params = prepare_sampling_params(batch_size=neuron_config.batch_size,\n",
281 |     "                                         top_k=[10, 5],\n",
282 |     "                                         top_p=[0.5, 0.9],\n",
283 |     "                                         temperature=[0.9, 0.5])\n",
284 |     "\n",
285 |     "inputs = tokenizer(prompts, padding=True, return_tensors=\"pt\")"
286 |    ]
287 |   },
288 |   {
289 |    "cell_type": "markdown",
290 |    "id": "108f43f8-a2a8-4986-af7c-fdc58a37f3cd",
291 |    "metadata": {},
292 |    "source": [
293 |     "### Step 8. Create a Generation Adapter and run inference"
294 |    ]
295 |   },
296 |   {
297 |    "cell_type": "code",
298 |    "execution_count": null,
299 |    "id": "2f511a6f-049c-4a05-bccc-f5cce8071334",
300 |    "metadata": {},
301 |    "outputs": [],
302 |    "source": [
303 |     "generation_model = HuggingFaceGenerationAdapter(model)\n",
304 |     "outputs = generation_model.generate(\n",
305 |     "    inputs.input_ids,\n",
306 |     "    generation_config=generation_config,\n",
307 |     "    attention_mask=inputs.attention_mask,\n",
308 |     "    max_length=model.config.neuron_config.max_length,\n",
309 |     "    sampling_params=sampling_params,\n",
310 |     ")\n",
311 |     "output_tokens = tokenizer.batch_decode(outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False)\n",
312 |     "\n",
313 |     "print(\"Generated outputs:\")\n",
314 |     "for i, output_token in enumerate(output_tokens):\n",
315 |     "    print(f\"Output {i}: {output_token}\")\n"
316 |    ]
317 |   },
318 |   {
319 |    "cell_type": "markdown",
320 |    "id": "a5b840fc-dcba-428a-bcf8-c35702d144e0",
321 |    "metadata": {},
322 |    "source": [
323 |     "---\n",
324 |     "# Develop support for a new model with NxDI\n",
325 |     "Now that you've run inference with this model, let's take a closer look at how this works. The cells you just ran are based on a script available in our repository [here](https://github.com/aws-neuron/neuronx-distributed-inference/tree/main). You can step through this repository to understand how the objects are developed, inherited, and made available for inference. The full developer guide on the topic is available [here](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/onboarding-models.html#nxdi-onboarding-models). Let's look at some of the key points!"
326 |    ]
327 |   },
328 |   {
329 |    "cell_type": "markdown",
330 |    "id": "f5ec151f-ce53-4051-a9d0-957654834f51",
331 |    "metadata": {},
332 |    "source": [
333 |     "#### 1/ NeuronConfig class\n",
334 |     "You can inherit our base `NeuronConfig` class and extend it with your own model parameters. In the notebook you just ran, this is how we defined the following parameters:\n",
335 |     "- Tensor Parallel (TP) Degree\n",
336 |     "- Batch size\n",
337 |     "- Max context length (input shape)\n",
338 |     "- Sequence length (output shape)\n",
339 |     "- On device sampling\n",
340 |     "- Enabling bucketing\n",
341 |     "- Flash decoding\n",
342 |     "\n",
343 |     "\n",
344 |     "This object and these parameters will be sent to the compiler when you call `model.compile`. It's a helpful way to ensure that the compiler registers your design choices so that it can start optimizations. It also enables the model sharing with NxDI for your preferred TP degree, which lets you very quickly test a variety of TP degrees (TP=8, 32, 64, etc.)."
345 |    ]
346 |   },
347 |   {
348 |    "cell_type": "markdown",
349 |    "id": "ac98eb22-c02b-4c74-bd4c-3cd1bd196f54",
350 |    "metadata": {},
351 |    "source": [
352 |     "#### 2/ InferenceConfig class\n",
353 |     "Next, you can inherit our base `InferenceConfig` class and extend it with the rest of your modeling parameters. In the notebook you ran above, we took two important steps with this config.\n",
354 |     "1. Passed into it the base `NeuronConfig`.\n",
355 |     "2. Passed the rest of the model config from the HuggingFace pretrained config.\n",
356 |     "\n",
357 |     "Your inference class is where you define modeling parameters like the following:\n",
358 |     "- hidden size\n",
359 |     "- num attention heads\n",
360 |     "- num hidden layers\n",
361 |     "- num key value heads\n",
362 |     "- vocab size\n",
363 |     "\n",
364 |     "You'll use this `config` object to save and compile your model. Let's learn how!"
365 |    ]
366 |   },
367 |   {
368 |    "cell_type": "markdown",
369 |    "id": "71016dc5-112d-470f-a1ee-ce1855a5487d",
370 |    "metadata": {},
371 |    "source": [
372 |     "#### 3/ NeuronModel\n",
373 |     "This is how you fundamentally integrate your modeling code into the Neuron SDK. If you'd like to simply reuse our `NeuronAttentionBase`, you can inherit this directly through the library and simply pass your parameters through the `InferenceConfig` you defined above. This is how the example code in our notebook works. This is also the fastest way of getting your model online with NxD I.\n",
374 |     "\n",
375 |     "In the example code you ran, you also used our code for `NeuronLlamaMLP`. This is a layer in the network which inherits from `nn.Module` directly, and it's where you can define the structure of your computations. The `NeuronLlamaMLP` uses a predefined `ColumnParallelLinear` object for both the gate and up projections, while using a predefined `RowParallelLinear` object for the down projection. It also defines a forward pass on that layer.\n",
376 |     "\n",
377 |     "The rest of the model is defined similarly: either you inherit from our base objects and just passing in your `InferenceConfig`, or you define a new layer inheriting from `nn.Module` and write those layers as either `RowParallelLinear`, `ColumnParallelLinear`, or something else. The benefit of writing your layers into the `Row` and `Column` parallel layers as presented here is that we can handle the distribution of your model for you. \n",
378 |     "\n",
379 |     "For a more complete guide check out our documentation on the subject [here](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/api_guide.html#api-guide)."
380 |    ]
381 |   },
382 |   {
383 |    "cell_type": "markdown",
384 |    "id": "8e98a0d4",
385 |    "metadata": {},
386 |    "source": [
387 |     "### Notebook Wrap-Up\n",
388 |     "\n",
389 |     "For more advanced topics:\n",
390 |     "- **Profiling**: See [Neuron Profiling Tools](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-profile/index.html).\n",
391 |     "- **Distributed Serving**: Explore vLLM or other serving frameworks.\n",
392 |     "- **Performance Benchmarking**: Use `llmperf` or custom scripts.\n",
393 |     "\n",
394 |     "Thank you for using AWS Trainium, and happy LLM experimentation!\n"
395 |    ]
396 |   },
397 |   {
398 |    "cell_type": "code",
399 |    "execution_count": null,
400 |    "id": "3bfc3c62-08a4-49ae-adef-5c0d661f2712",
401 |    "metadata": {},
402 |    "outputs": [],
403 |    "source": []
404 |   }
405 |  ],
406 |  "metadata": {
407 |   "kernelspec": {
408 |    "display_name": "Python 3 (ipykernel)",
409 |    "language": "python",
410 |    "name": "python3"
411 |   },
412 |   "language_info": {
413 |    "codemirror_mode": {
414 |     "name": "ipython",
415 |     "version": 3
416 |    },
417 |    "file_extension": ".py",
418 |    "mimetype": "text/x-python",
419 |    "name": "python",
420 |    "nbconvert_exporter": "python",
421 |    "pygments_lexer": "ipython3",
422 |    "version": "3.9.21"
423 |   }
424 |  },
425 |  "nbformat": 4,
426 |  "nbformat_minor": 5
427 | }
428 | 


--------------------------------------------------------------------------------
/labs/Lab_Three_NKI_Custom_Operators.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Writing custom PyTyorch Operators with NKI\n",
  8 |     "\n",
  9 |     "This workshop was borrowed from the AWS NKI Workshop. To find the full original content, see here:\n",
 10 |     "- Workshop: https://catalog.us-east-1.prod.workshops.aws/workshops/0d84c975-7a94-469a-b6bc-661768d303f7/en-US/lab-0\n",
 11 |     "- Github: https://github.com/aws-samples/ml-specialized-hardware/tree/main/workshops/03_NKIWorkshop\n",
 12 |     "\n",
 13 |     "This notebook demonstrates how to insert a NKI kernel as a custom operators into a PyTorch.\n",
 14 |     "\n",
 15 |     "## Using NKI kernels\n",
 16 |     "To register a NKI kernel registration, you need to call a decorated NKI function.\n",
 17 |     "\n",
 18 |     "Let’s examine a guiding example below where we randomly initialize two inputs, add them together, and then multiply the result by the two input tensors element-wise. This effectively calculates: `a * b * (a + b)`.\n",
 19 |     "\n",
 20 |     "We define a common NKI kernel for addition. For more information on the kernel, see [SPMD Tensor Addition](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/tutorials/spmd_tensor_addition.html)."
 21 |    ]
 22 |   },
 23 |   {
 24 |    "cell_type": "code",
 25 |    "execution_count": null,
 26 |    "metadata": {},
 27 |    "outputs": [],
 28 |    "source": [
 29 |     "import neuronxcc.nki as nki\n",
 30 |     "import neuronxcc.nki.language as nl\n",
 31 |     "\n",
 32 |     "@nki.jit\n",
 33 |     "def nki_tensor_add_kernel_(a_input, b_input):\n",
 34 |     "  \"\"\"NKI kernel to compute element-wise addition of two input tensors\n",
 35 |     "  \n",
 36 |     "  This kernel assumes strict input/output sizes can be uniformly tiled to [128,512]\n",
 37 |     "\n",
 38 |     "  Args:\n",
 39 |     "      a_input: a first input tensor\n",
 40 |     "      b_input: a second input tensor\n",
 41 |     "\n",
 42 |     "  Returns:\n",
 43 |     "      c_output: an output tensor\n",
 44 |     "  \"\"\"\n",
 45 |     "\n",
 46 |     "  # Create output tensor shared between all SPMD instances as result tensor\n",
 47 |     "  c_output = nl.ndarray(a_input.shape, dtype=a_input.dtype, buffer=nl.shared_hbm)\n",
 48 |     "\n",
 49 |     "  # Calculate tile offsets based on current 'program'\n",
 50 |     "  offset_i_x = nl.program_id(0) * 128\n",
 51 |     "  offset_i_y = nl.program_id(1) * 512\n",
 52 |     "\n",
 53 |     "  # Generate tensor indices to index tensors a and b\n",
 54 |     "  ix_, iy_ = nl.mgrid[0:128, 0:512]\n",
 55 |     "  ix = offset_i_x + ix_\n",
 56 |     "  iy = offset_i_y + iy_\n",
 57 |     "\n",
 58 |     "  # Load input data from device memory (HBM) to on-chip memory (SBUF)\n",
 59 |     "  # We refer to an indexed portion of a tensor as an intermediate tensor\n",
 60 |     "  a_tile = nl.load(a_input[ix, iy])\n",
 61 |     "  b_tile = nl.load(b_input[ix, iy])\n",
 62 |     "\n",
 63 |     "  # compute a + b\n",
 64 |     "  c_tile = a_tile + b_tile\n",
 65 |     "\n",
 66 |     "  # store the addition results back to device memory (c_output)\n",
 67 |     "  nl.store(c_output[ix, iy], value=c_tile)\n",
 68 |     "\n",
 69 |     "  # Transfer the ownership of `c_output` to the caller\n",
 70 |     "  return c_output"
 71 |    ]
 72 |   },
 73 |   {
 74 |    "cell_type": "markdown",
 75 |    "metadata": {},
 76 |    "source": [
 77 |     "## PyTorch\n",
 78 |     "We can perform `(a + b) * a * b` using native PyTorch code."
 79 |    ]
 80 |   },
 81 |   {
 82 |    "cell_type": "code",
 83 |    "execution_count": null,
 84 |    "metadata": {},
 85 |    "outputs": [],
 86 |    "source": [
 87 |     "import torch\n",
 88 |     "from torch_xla.core import xla_model as xm\n",
 89 |     "\n",
 90 |     "device = xm.xla_device()\n",
 91 |     "\n",
 92 |     "a = torch.randn(256, 1024, dtype=torch.float32).to(device)\n",
 93 |     "b = torch.randn(256, 1024, dtype=torch.float32).to(device)\n",
 94 |     "c = a + b\n",
 95 |     "out = a * b * c\n",
 96 |     "\n",
 97 |     "print(out)"
 98 |    ]
 99 |   },
100 |   {
101 |    "cell_type": "markdown",
102 |    "metadata": {},
103 |    "source": [
104 |     "Now let’s replace the tensor addition (`c = a + b`) with a NKI kernel. To do this we replace the `+` operator with a call to the NKI kernel caller (`nki_tensor_add`), and everything else works as before."
105 |    ]
106 |   },
107 |   {
108 |    "cell_type": "code",
109 |    "execution_count": null,
110 |    "metadata": {},
111 |    "outputs": [],
112 |    "source": [
113 |     "def nki_tensor_add(a_input, b_input):\n",
114 |     "  \"\"\"NKI kernel caller to compute element-wise addition of two input tensors\n",
115 |     "\n",
116 |     "  This kernel caller lifts tile-size restriction, by applying the kernel on tiles of the inputs/outputs\n",
117 |     "\n",
118 |     "  Args:\n",
119 |     "      a_input: a first input tensor, of shape [N*128, M*512]\n",
120 |     "      b_input: a second input tensor, of shape [N*128, M*512]\n",
121 |     "\n",
122 |     "  Returns:\n",
123 |     "      a tensor of shape [N*128, M*512], the result of a_input + b_input\n",
124 |     "  \"\"\"\n",
125 |     "\n",
126 |     "  # The SPMD launch grid denotes the number of kernel instances.\n",
127 |     "  # In this case, we use a 2D grid where the size of each invocation is 128x512\n",
128 |     "  grid_x = a_input.shape[0] // 128\n",
129 |     "  grid_y = a_input.shape[1] // 512\n",
130 |     "\n",
131 |     "  return nki_tensor_add_kernel_[grid_x, grid_y](a_input, b_input)\n",
132 |     "\n",
133 |     "device = xm.xla_device()\n",
134 |     "a = torch.randn(256, 1024, dtype=torch.float32).to(device)\n",
135 |     "b = torch.randn(256, 1024, dtype=torch.float32).to(device)\n",
136 |     "c = nki_tensor_add(a, b) # calling a NKI kernel, instead of the built-in torch op\n",
137 |     "out = a * b * c\n",
138 |     "print(out)"
139 |    ]
140 |   },
141 |   {
142 |    "cell_type": "markdown",
143 |    "metadata": {},
144 |    "source": [
145 |     "To understand what happens under the hood when we compile the above code, we can print HLO IR graph generated by XLA by setting the `NEURON_FRAMEWORK_DEBUG` environment variable. For example, you may add the following lines to your code:"
146 |    ]
147 |   },
148 |   {
149 |    "cell_type": "code",
150 |    "execution_count": null,
151 |    "metadata": {},
152 |    "outputs": [],
153 |    "source": [
154 |     "import os\n",
155 |     "os.environ['NEURON_FRAMEWORK_DEBUG'] = \"1\""
156 |    ]
157 |   },
158 |   {
159 |    "cell_type": "markdown",
160 |    "metadata": {},
161 |    "source": [
162 |     "A `.pbtxt` file is then written in your run directory that has the corresponding human-readable HLO IR.\n",
163 |     "\n",
164 |     "Let’s examine the XLA output of this example. In line #5 we can identify that the tensor addition is now mapped to an HLO `custom-call` instruction, with `AwsNeuronCustomNativeKernel` as `custom_call_target`. The output of that `custom-call` is then consumed by the next instruction in line #6 as usual.\n",
165 |     "\n",
166 |     "```python\n",
167 |     "ENTRY %SyncTensorsGraph.22 (p0.2: f32[256,1024], p1.2: f32[256,1024]) -> (f32[256,1024]) {\n",
168 |     " %p1.2 = f32[256,1024]{1,0} parameter(1), frontend_attributes={neff_input_name=\"input1\"}\n",
169 |     " %p0.2 = f32[256,1024]{1,0} parameter(0), frontend_attributes={neff_input_name=\"input0\"}\n",
170 |     " %multiply = f32[256,1024]{1,0} multiply(f32[256,1024]{1,0} %p1.2, f32[256,1024]{1,0} %p0.2)\n",
171 |     " %custom-call.2 = f32[256,1024]{1,0} custom-call(f32[256,1024]{1,0} %p1.2, f32[256,1024]{1,0} %p0.2), custom_call_target=\"AwsNeuronCustomNativeKernel\", api_version=API_VERSION_UNSPECIFIED, backend_config=\"...\")\n",
172 |     " %multiply.1 = f32[256,1024]{1,0} multiply(f32[256,1024]{1,0} %multiply, f32[256,1024]{1,0} %custom-call.2)\n",
173 |     " ROOT %tuple = (f32[256,1024]{1,0}) tuple(f32[256,1024]{1,0} %multiply.1), frontend_attributes={neff_output_names=\"output0\"}\n",
174 |     "}\n",
175 |     "```\n",
176 |     "\n",
177 |     "The Neuron compiler replaces the above custom-call with the corresponding NKI kernel implementation while optimizing the rest of the compute graph as usual. At the end of the compilation process, a single compiled binary NEFF file is generated representing the entire graph including the NKI kernel. For more information about NEFF files, see [Neuron Compiler](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/index.html)."
178 |    ]
179 |   },
180 |   {
181 |    "cell_type": "markdown",
182 |    "metadata": {},
183 |    "source": [
184 |     "## Using NKI in training graphs\n",
185 |     "\n",
186 |     "If you are using NKI to implement a new operator in a training graph, you might need to make the new operator interplay with the `autograd` engine in the framework. To do this, in PyTorch, you can subclass the framework’s base operator class and implement both the `forward()` and `backward()` methods. The `autograd` engine then uses the `backward()` method when performing auto-differentiation. See Extending [torch.autograd](https://pytorch.org/docs/stable/notes/extending.html) in the PyTorch Docs for instructions on doing this in PyTorch.\n",
187 |     "\n",
188 |     "Let’s reuse the `nki_tensor_add` kernel from before and demonstrate how to train a simple compute graph `(a+b)*a*b` in PyTorch.\n",
189 |     "\n",
190 |     "## PyTorch\n",
191 |     "\n",
192 |     "We define a `NkiAddFunc` class, which leverages the `nki_tensor_add` kernel in its `forward()` function. The gradients of both input tensors in `y = a + b` are ones, so the `backward()` function propagates the `dy` gradients from the previous backward function."
193 |    ]
194 |   },
195 |   {
196 |    "cell_type": "code",
197 |    "execution_count": null,
198 |    "metadata": {},
199 |    "outputs": [],
200 |    "source": [
201 |     "import torch\n",
202 |     "import torch_xla.core.xla_model as xm\n",
203 |     "device = xm.xla_device()\n",
204 |     "\n",
205 |     "class NkiAddFunc(torch.autograd.Function):\n",
206 |     "  @staticmethod\n",
207 |     "  def forward(ctx, a, b):\n",
208 |     "    return nki_tensor_add(a, b)\n",
209 |     "\n",
210 |     "  @staticmethod\n",
211 |     "  def backward(ctx, dy, *args):\n",
212 |     "    # gradients for a and b\n",
213 |     "    return dy, dy\n",
214 |     "\n",
215 |     "# now, let's define the compute graph\n",
216 |     "a = torch.randn(256, 1024, dtype=torch.float32).to(device).detach().requires_grad_()\n",
217 |     "b = torch.randn(256, 1024, dtype=torch.float32).to(device).detach().requires_grad_()\n",
218 |     "c = NkiAddFunc.apply(a, b)\n",
219 |     "out = a * b * c\n",
220 |     "\n",
221 |     "# here we define a (dummy) loss-function, in prep for backward propagation\n",
222 |     "loss = out.sum()\n",
223 |     "\n",
224 |     "# lastly, let's invoke the auto-grad engine\n",
225 |     "loss.backward()\n",
226 |     "\n",
227 |     "xm.mark_step()"
228 |    ]
229 |   },
230 |   {
231 |    "cell_type": "markdown",
232 |    "metadata": {},
233 |    "source": [
234 |     "## Release the NeuronCore for the next notebook\n",
235 |     "\n",
236 |     "Before moving to the next notebook we need to release the NeuronCore. If we don't do this the next notebook will not be able resources - you can also stop the kernel via the GUI"
237 |    ]
238 |   },
239 |   {
240 |    "cell_type": "code",
241 |    "execution_count": null,
242 |    "metadata": {},
243 |    "outputs": [],
244 |    "source": [
245 |     "import IPython\n",
246 |     "IPython.Application.instance().kernel.do_shutdown(True)"
247 |    ]
248 |   },
249 |   {
250 |    "cell_type": "code",
251 |    "execution_count": null,
252 |    "metadata": {},
253 |    "outputs": [],
254 |    "source": []
255 |   }
256 |  ],
257 |  "metadata": {
258 |   "kernelspec": {
259 |    "display_name": "Python 3 (ipykernel)",
260 |    "language": "python",
261 |    "name": "python3"
262 |   },
263 |   "language_info": {
264 |    "codemirror_mode": {
265 |     "name": "ipython",
266 |     "version": 3
267 |    },
268 |    "file_extension": ".py",
269 |    "mimetype": "text/x-python",
270 |    "name": "python",
271 |    "nbconvert_exporter": "python",
272 |    "pygments_lexer": "ipython3",
273 |    "version": "3.10.12"
274 |   }
275 |  },
276 |  "nbformat": 4,
277 |  "nbformat_minor": 4
278 | }
279 | 


--------------------------------------------------------------------------------
/labs/Lab_Two_NKI.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "d6b1e73f-dc2c-4d66-b3ba-4fb71b5243c8",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "# Write your own kernel with the Neuron Kernel Interface (NKI)\n",
  9 |     "In this notebook you'll learn how to develop your own kernel with [NKI](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/index.html). A kernel is a set of user-defined functions that are executed largely as defined by the user, not by the compiler. With NKI you can write your own functions to define any operations you like, using supported APIs, and execute them on Trainium and Inferentia hardware. You have the control and lower-level access to define the data movement, computational patterns, and physical execution for the mathematics of your algorithms with NKI.\n",
 10 |     "\n",
 11 |     "The structure of the notebook is as follows:\n",
 12 |     "1. Brief introduction to the NeuronCore and the NKI programming model\n",
 13 |     "2. Your first NKI kernel - tensor addition\n",
 14 |     "3. Your second NKI kernel - matrix multiplication\n",
 15 |     "\n",
 16 |     "Wrap up and next steps."
 17 |    ]
 18 |   },
 19 |   {
 20 |    "cell_type": "markdown",
 21 |    "id": "54d843c8-b824-4896-ad23-1098dd859872",
 22 |    "metadata": {},
 23 |    "source": [
 24 |     "### 1. Introduction to the NeuronCore and NKI programming model\n",
 25 |     "The NeuronCore is the main acceleration unit within AWS AI chips Trainium and Inferentia. As you can see in the image below, it is composed of 4 compute engines. These engines are based on a systollic array architecture. The compute engines are fed data from the primary on-chip memory cache, SBUF. Data is moved from the HBM banks to SBUF when you call `nl.load`. You'll index into your tensors to create lower-level objects, called `tiles`. A tile is the result of `nl.load`. Once you've defined `tiles`, you can send them to various NKI mathematical APIS such as `add`, `subtract`, `matmul`, etc. The result of these operations are stored on the secondary on-chip memory cache, PSUM. After moving the data back to SBUF, you can then send it back to HBM with `nl.store`.\n",
 26 |     "\n",
 27 |     "<img src=https://awsdocs-neuron.readthedocs-hosted.com/en/latest/_images/pm-nc.png width=\"400\"/>"
 28 |    ]
 29 |   },
 30 |   {
 31 |    "cell_type": "markdown",
 32 |    "id": "37fc461d-1067-4b96-95a5-3da49e15723f",
 33 |    "metadata": {},
 34 |    "source": [
 35 |     "Trainium1 chips feature two NeuronCore-v2 acceleration units, 2 HBM banks, NeuronLink-v2 chip-to-chip connect, host PCIE, and dedicated engines for both data movement and collective communications. Trainium1 offers 32 GiB of device memory (sum of all 4 HBM banks), with 840 GiB/sec of bandwidth. Trainium1 instances feature 16 Trainium chips, providing a total of up to 3 petaflops of FP16 compute and 512 accelerator memory capacity. For more architectural details, see our docs [here](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/trainium.html#trainium-arch). \n",
 36 |     "\n",
 37 |     "\n",
 38 |     "The on-chip memory cache, SBUF, **has ~20x higher memory bandwidth than HBM**. The purpose of your kernel is to exploit as much of that compute acceleration as you can within the context of your model and workload."
 39 |    ]
 40 |   },
 41 |   {
 42 |    "cell_type": "markdown",
 43 |    "id": "0fffc189-2875-4a42-a14a-de4ea122d2ef",
 44 |    "metadata": {},
 45 |    "source": [
 46 |     "#### Structuring data and tensors for NKI\n",
 47 |     "\n",
 48 |     "To easily move data and design our kernels on NKI, we'll want to exploit the 128 partitions built into SBUF as shown in the image below. In particular, SBUF has 128 partition lanes. Each of these lanes can execute programs in parallel on the engines. As much as possible, we'll want to align the tensors and data structures in our algorithms to follow this physical design. The benefit is that our kernels will run faster and be easier to develop!\n",
 49 |     "\n",
 50 |     "Your data movement from HBM to SBUF should be very carefully aligned with this 128-lane partition dimension, also called p-dim. Each tile needs a precise definition along the p-dim. Your second dimension is called the free dimension, or f-dim. As the name goes, this dimension is much more flexible than p-dim. Though it may surprise you, it's better not to fully saturate sbuf with extremely large tiles. This is so that the compiler can overlap data movement and collectives with compute, giving you better overall compute utilization and performance."
 51 |    ]
 52 |   },
 53 |   {
 54 |    "cell_type": "markdown",
 55 |    "id": "0c786fce-ac8e-4549-8bf1-edaee7512211",
 56 |    "metadata": {},
 57 |    "source": [
 58 |     "<img src=https://awsdocs-neuron.readthedocs-hosted.com/en/latest/_images/pm-layout.png width=\"600\"/>"
 59 |    ]
 60 |   },
 61 |   {
 62 |    "cell_type": "markdown",
 63 |    "id": "03a9fbbe-2fa4-41c0-9cc8-7560fbc7a49f",
 64 |    "metadata": {},
 65 |    "source": [
 66 |     "### 2. Your first NKI kernel\n",
 67 |     "Now that you have some understanding of the compute architecture and motivation for kernels, let's write your first NKI kernel! Importing the `nki` library may take a few moments the first time you've imported it on an instance."
 68 |    ]
 69 |   },
 70 |   {
 71 |    "cell_type": "code",
 72 |    "execution_count": 1,
 73 |    "id": "2da52760-db72-403a-ade9-d8bebac40de3",
 74 |    "metadata": {},
 75 |    "outputs": [],
 76 |    "source": [
 77 |     "import numpy as np\n",
 78 |     "import neuronxcc.nki as nki\n",
 79 |     "import neuronxcc.nki.language as nl"
 80 |    ]
 81 |   },
 82 |   {
 83 |    "cell_type": "code",
 84 |    "execution_count": 2,
 85 |    "id": "b83039ee-1788-478f-809f-f139cb032cce",
 86 |    "metadata": {},
 87 |    "outputs": [],
 88 |    "source": [
 89 |     "@nki.jit\n",
 90 |     "def nki_tensor_add_kernel_(a_input, b_input):\n",
 91 |     " \n",
 92 |     "  # Create output tensor \n",
 93 |     "  c_output = nl.ndarray(a_input.shape, dtype=a_input.dtype, buffer=nl.shared_hbm)\n",
 94 |     "\n",
 95 |     "  # Load input data from device memory (HBM) to on-chip memory (SBUF)\n",
 96 |     "  a_tile = nl.load(a_input)\n",
 97 |     "  b_tile = nl.load(b_input)\n",
 98 |     "\n",
 99 |     "  # compute a + b\n",
100 |     "  c_tile = a_tile + b_tile\n",
101 |     "\n",
102 |     "  # return the final tensor\n",
103 |     "  nl.store(c_output, value=c_tile)\n",
104 |     "\n",
105 |     "  # Transfer the ownership of `c_output` to the caller\n",
106 |     "  return c_output\n"
107 |    ]
108 |   },
109 |   {
110 |    "cell_type": "code",
111 |    "execution_count": 3,
112 |    "id": "486f0e0a-6af1-4882-afe2-4ce5a1912ddc",
113 |    "metadata": {},
114 |    "outputs": [
115 |     {
116 |      "name": "stdout",
117 |      "output_type": "stream",
118 |      "text": [
119 |       "NKI and NumPy match\n"
120 |      ]
121 |     }
122 |    ],
123 |    "source": [
124 |     "a = np.random.rand(128, 512).astype(np.float16)\n",
125 |     "b = np.random.rand(128, 512).astype(np.float16)\n",
126 |     "\n",
127 |     "output_nki = nki_tensor_add_kernel_(a, b)\n",
128 |     "\n",
129 |     "output_np = a + b\n",
130 |     "\n",
131 |     "allclose = np.allclose(output_np, output_nki, atol=1e-4, rtol=1e-2)\n",
132 |     "if allclose:\n",
133 |     "    print(\"NKI and NumPy match\")\n",
134 |     "else:\n",
135 |     "    print(\"NKI and NumPy differ\")\n"
136 |    ]
137 |   },
138 |   {
139 |    "cell_type": "markdown",
140 |    "id": "35f65891-2d62-4af4-aa5d-7620c707f6bd",
141 |    "metadata": {},
142 |    "source": [
143 |     "Now let's see if we can do that for matrix multiplication!"
144 |    ]
145 |   },
146 |   {
147 |    "cell_type": "markdown",
148 |    "id": "e8a65cb0-215d-4590-8335-d53c23eef5c1",
149 |    "metadata": {},
150 |    "source": [
151 |     "### 3. Your second NKI kernel\n",
152 |     "Now, let's try to use PyTorch arrays and pass them to the device with XLA. Then we'll try a matrix multiplication kernel.\n",
153 |     "\n",
154 |     "If you get any errors, you may need to use a different python environment.  Choose the Python 3.9 kernel (or create a new venv) and install these packages:\n",
155 |     "\n",
156 |     "%pip install neuronx-cc==2.18.121.0+9e31e41a\n",
157 |     "\n",
158 |     "%pip install torch torch_xla torch_neuronx\n",
159 |     "\n"
160 |    ]
161 |   },
162 |   {
163 |    "cell_type": "code",
164 |    "execution_count": 4,
165 |    "id": "e4e24399-7bae-4db2-b964-b5fdcc93fb32",
166 |    "metadata": {},
167 |    "outputs": [
168 |     {
169 |      "name": "stderr",
170 |      "output_type": "stream",
171 |      "text": [
172 |       "WARNING:root:MASTER_ADDR environment variable is not set, defaulting to localhost\n",
173 |       "WARNING:root:Found libneuronpjrt.so. Setting PJRT_DEVICE=NEURON.\n"
174 |      ]
175 |     }
176 |    ],
177 |    "source": [
178 |     "import torch\n",
179 |     "from torch_xla.core import xla_model as xm\n",
180 |     "\n",
181 |     "device = xm.xla_device()\n",
182 |     "\n",
183 |     "lhs_small = torch.rand((64, 128), dtype=torch.bfloat16, device=device)\n",
184 |     "rhs_small = torch.rand((128, 512), dtype=torch.bfloat16, device=device)"
185 |    ]
186 |   },
187 |   {
188 |    "cell_type": "code",
189 |    "execution_count": 5,
190 |    "id": "0bc1f344-6e02-4f3a-928a-9f1bccabfb12",
191 |    "metadata": {},
192 |    "outputs": [],
193 |    "source": [
194 |     "@nki.jit\n",
195 |     "def nki_matmul_basic_(lhsT, rhs):\n",
196 |     "  \"\"\"NKI kernel to compute a 64x128x512 matrix multiplication operation\n",
197 |     "\n",
198 |     "  Args:\n",
199 |     "      lhsT: an input tensor of shape [128,64], a left hand side argument of the\n",
200 |     "        matrix multiplication, delivered transposed for optimal performance\n",
201 |     "      rhs: an input tensor of shape [128,512], a right hand side argument of the\n",
202 |     "        matrix multiplication\n",
203 |     "  Returns:\n",
204 |     "      result: the resulting output tensor of shape [64,512]\n",
205 |     "  \"\"\"\n",
206 |     "  result = nl.ndarray((64, 512), dtype=lhsT.dtype, buffer=nl.shared_hbm)\n",
207 |     "\n",
208 |     "  # Defining indexes for input LHS.T\n",
209 |     "  # - Note: here we take LayoutConstraint #1 into account:\n",
210 |     "  # \"For MatMult, contraction axis must be mapped to P-dim\"\n",
211 |     "  i_lhsT_p, i_lhsT_f = nl.mgrid[0:128, 0:64]\n",
212 |     "\n",
213 |     "  # Defining indexes for input RHS\n",
214 |     "  # - Note: here we take LayoutConstraint #1 into account:\n",
215 |     "  # \"For MatMult, contraction axis must be mapped to P-dim\"\n",
216 |     "  i_rhs_p, i_rhs_f = nl.mgrid[0:128, 0:512]\n",
217 |     "\n",
218 |     "  # Defining indexes for the output ([64,128]@[128,512] -> [64,512])\n",
219 |     "  i_out_p, i_out_f = nl.mgrid[0:64, 0:512]\n",
220 |     "\n",
221 |     "  # Loading the inputs (HBM->SBUF)\n",
222 |     "  # Note: here we take Tile dtype definition into account,\n",
223 |     "  # which forces P-dim as the left most index\n",
224 |     "  lhs_tile = nl.load(lhsT[i_lhsT_p, i_lhsT_f])\n",
225 |     "  rhs_tile = nl.load(rhs[i_rhs_p, i_rhs_f])\n",
226 |     "\n",
227 |     "  # Perform the matrix-multiplication\n",
228 |     "  # Note1: We set transpose_x to True, to indicate that the LHS input is transposed\n",
229 |     "  # Note2: A NKI matmul instruction always writes to PSUM in float32 data-type\n",
230 |     "  result_psum = nl.matmul(lhs_tile, rhs_tile, transpose_x=True)\n",
231 |     "\n",
232 |     "  # Copy the result from PSUM back to SBUF, and cast to expected output data-type\n",
233 |     "  result_sbuf = nl.copy(result_psum, dtype=result.dtype)\n",
234 |     "\n",
235 |     "  # The result of a [64,128] x [128,512] matrix multiplication has a shape of [64, 512].\n",
236 |     "  # This dictates which indices to use to address the result tile.\n",
237 |     "  nl.store(result[i_out_p, i_out_f], value=result_sbuf)\n",
238 |     "\n",
239 |     "  return result"
240 |    ]
241 |   },
242 |   {
243 |    "cell_type": "code",
244 |    "execution_count": 6,
245 |    "id": "d5b2a228-0a08-42fc-9bd4-81dedba0e4d6",
246 |    "metadata": {},
247 |    "outputs": [
248 |     {
249 |      "name": "stdout",
250 |      "output_type": "stream",
251 |      "text": [
252 |       "Checking correctness of nki_matmul_basic\n",
253 |       "2025-03-17 22:45:04.000657:  512118  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --framework=XLA /tmp/ec2-user/neuroncc_compile_workdir/58a5f9b5-7dd1-4569-b58f-bae92b1f0d13/model.MODULE_6255296715421101974+e30acd3a.hlo_module.pb --output /tmp/ec2-user/neuroncc_compile_workdir/58a5f9b5-7dd1-4569-b58f-bae92b1f0d13/model.MODULE_6255296715421101974+e30acd3a.neff --target=trn1 --verbose=35\n",
254 |       ".\n",
255 |       "Compiler status PASS\n",
256 |       "NKI and Torch match\n"
257 |      ]
258 |     }
259 |    ],
260 |    "source": [
261 |     "# Run NKI kernel\n",
262 |     "output_small = nki_matmul_basic_(lhs_small.T, rhs_small)\n",
263 |     "\n",
264 |     "# Run torch reference\n",
265 |     "output_small_torch = torch.matmul(lhs_small, rhs_small)\n",
266 |     "\n",
267 |     "# Compare results\n",
268 |     "print(\"Checking correctness of nki_matmul_basic\")\n",
269 |     "if torch.allclose(output_small_torch, output_small, atol=1e-4, rtol=1e-2):\n",
270 |     "  print(\"NKI and Torch match\")\n",
271 |     "else:\n",
272 |     "  print(\"NKI and Torch differ\")"
273 |    ]
274 |   },
275 |   {
276 |    "cell_type": "markdown",
277 |    "id": "801236a2-9d4d-4630-a750-dc42bb2e4514",
278 |    "metadata": {},
279 |    "source": [
280 |     "### 4. Wrap up and next steps\n",
281 |     "The simplicity you see in the `tensor_add` kernel above is possible because the shapes we pass in are very small. We've intentionally selected them to exactly match the shapes of tiles that NKI supports as maximum dimensions, for both the partition and free dimensions.\n",
282 |     "\n",
283 |     "As you saw above, the partition dimension has a maximum length of 128. This the most important dimension and shape to embrace in your kernels, because it impacts your ability to load data onto the chip. In order to exploit the parallelism of execution enabled through the 128 lanes on sbuf, you might want to develop into your kernel the ability to extract data in batches of 128 to load onto sbuf. \n",
284 |     "\n",
285 |     "The second dimension, also known as the free dimension, is more flexible. Once you have clean batches of 128 lanes being loaded onto sbuf, you can build in tiling on the second dimension of much more varying sizes up to 512. \n",
286 |     "\n",
287 |     "To learn more about tiling, and to step through the rest of the matrix multiplication tutorial, see our docs on the topic [here](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/tutorials/matrix_multiplication.html#)"
288 |    ]
289 |   },
290 |   {
291 |    "cell_type": "code",
292 |    "execution_count": null,
293 |    "id": "bd810a4b-2365-48a3-ad0f-23f3850ffc71",
294 |    "metadata": {},
295 |    "outputs": [],
296 |    "source": []
297 |   }
298 |  ],
299 |  "metadata": {
300 |   "kernelspec": {
301 |    "display_name": "Python 3 (ipykernel)",
302 |    "language": "python",
303 |    "name": "python3"
304 |   },
305 |   "language_info": {
306 |    "codemirror_mode": {
307 |     "name": "ipython",
308 |     "version": 3
309 |    },
310 |    "file_extension": ".py",
311 |    "mimetype": "text/x-python",
312 |    "name": "python",
313 |    "nbconvert_exporter": "python",
314 |    "pygments_lexer": "ipython3",
315 |    "version": "3.9.16"
316 |   }
317 |  },
318 |  "nbformat": 4,
319 |  "nbformat_minor": 5
320 | }
321 | 


--------------------------------------------------------------------------------
/labs/generation_config.json:
--------------------------------------------------------------------------------
1 | {
2 |   "_from_model_config": true,
3 |   "bos_token_id": 128000,
4 |   "eos_token_id": 128001,
5 |   "transformers_version": "4.45.0.dev0",
6 |   "do_sample": true,
7 |   "temperature": 0.6,
8 |   "top_p": 0.9
9 | }


--------------------------------------------------------------------------------
/labs/vLLM/Servers.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Install vLLM\n",
  8 |     "\n",
  9 |     "There is a prior version of the SDK that was upstreamed into the main vLLM repository.  However, most of the time we want to install from source from the aws-neuron fork.  \n",
 10 |     "\n",
 11 |     "Instructions are available here:  https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/vllm-user-guide.html#nxdi-vllm-user-guide\n",
 12 |     "\n",
 13 |     "However, the steps are below.  Run the next three cells.  The pip installs could take 5 minutes.\n",
 14 |     "\n",
 15 |     "The AWS workshop environment deploys using a Neuron DLAMI with a recent SDK.  If you are deploying this in your own environment, you may need to match the branch to your SDK version or follow the latest instructions at the link above."
 16 |    ]
 17 |   },
 18 |   {
 19 |    "cell_type": "code",
 20 |    "execution_count": null,
 21 |    "metadata": {},
 22 |    "outputs": [],
 23 |    "source": [
 24 |     "%%bash\n",
 25 |     "git clone -b 2.25.0 https://github.com/aws-neuron/upstreaming-to-vllm.git\n"
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "code",
 30 |    "execution_count": 3,
 31 |    "metadata": {},
 32 |    "outputs": [
 33 |     {
 34 |      "name": "stdout",
 35 |      "output_type": "stream",
 36 |      "text": [
 37 |       "Note: you may need to restart the kernel to use updated packages.\n"
 38 |      ]
 39 |     }
 40 |    ],
 41 |    "source": [
 42 |     "%pip install --quiet -r /home/ubuntu/environment/vLLM/upstreaming-to-vllm/requirements/neuron.txt\n",
 43 |     "#expected to produce no output for 4 or 5 minutes.  Remove the --quiet flag if you want to see ALL the packages installed!  Or look in the neuron.txt requirements doc."
 44 |    ]
 45 |   },
 46 |   {
 47 |    "cell_type": "code",
 48 |    "execution_count": 4,
 49 |    "metadata": {},
 50 |    "outputs": [],
 51 |    "source": [
 52 |     "!VLLM_TARGET_DEVICE=\"neuron\" pip install --quiet -e /home/ubuntu/environment/vLLM/upstreaming-to-vllm/.\n",
 53 |     "# expected to product no output for 5 or 6 minutes"
 54 |    ]
 55 |   },
 56 |   {
 57 |    "cell_type": "markdown",
 58 |    "metadata": {},
 59 |    "source": [
 60 |     "# Download copies of the model to deploy\n",
 61 |     "We are downloading a copy of the stock Qwen3-8B model as well as the compiled version from Hugging Face.\n"
 62 |    ]
 63 |   },
 64 |   {
 65 |    "cell_type": "code",
 66 |    "execution_count": null,
 67 |    "metadata": {},
 68 |    "outputs": [],
 69 |    "source": [
 70 |     "!hf download aws-neuron/Qwen3-8BSharded --local-dir /home/ubuntu/environment/qwen3\n",
 71 |     "#this could take 3-4 minutes"
 72 |    ]
 73 |   },
 74 |   {
 75 |    "cell_type": "code",
 76 |    "execution_count": null,
 77 |    "metadata": {},
 78 |    "outputs": [],
 79 |    "source": [
 80 |     "!hf download Qwen/Qwen3-8B --local-dir /home/ubuntu/environment/Qwen3-8B --exclude \"*.safetensors\"\n",
 81 |     "#This is the stock model.  It will only take seconds because we don't need to download the weights."
 82 |    ]
 83 |   },
 84 |   {
 85 |    "cell_type": "markdown",
 86 |    "metadata": {},
 87 |    "source": [
 88 |     "# Make sure you restart your kernel\n",
 89 |     "If you get an error that vllm could not be found, it is because you didn't restart your kernel after installing it above\n",
 90 |     "\n",
 91 |     "# Offline inference example\n",
 92 |     "\n",
 93 |     "In this example, we load the qwen3 precompiled model artifacts (or NEFF files) and the model presharded for two cores.  We do this because of the system memory limitations of the trn1.2xlarge (32GB of system RAM).  The trn1.2xlarge also has 32GB of device RAM on the Trainium1 device (that has two Neuron cores), but system RAM is (usually) our limiter for compiling.\n",
 94 |     "\n",
 95 |     "May take 8 minutes to run"
 96 |    ]
 97 |   },
 98 |   {
 99 |    "cell_type": "code",
100 |    "execution_count": null,
101 |    "metadata": {},
102 |    "outputs": [],
103 |    "source": [
104 |     "import os\n",
105 |     "from vllm import LLM, SamplingParams\n",
106 |     "os.environ['VLLM_NEURON_FRAMEWORK'] = \"neuronx-distributed-inference\"\n",
107 |     "os.environ['NEURON_COMPILED_ARTIFACTS'] = \"/home/ubuntu/environment/qwen3\"\n",
108 |     "#os.environ['BASE_COMPILE_WORK_DIR'] = \"/home/ubuntu/qwen3/\"\n",
109 |     "llm = LLM(\n",
110 |     "    model=\"/home/ubuntu/environment/Qwen3-8B\", #model weights\n",
111 |     "    max_num_seqs=1,\n",
112 |     "    max_model_len=1024,\n",
113 |     "    device=\"neuron\",\n",
114 |     "    tensor_parallel_size=2,\n",
115 |     "    override_neuron_config={})\n",
116 |     "prompts = [\n",
117 |     "    \"Hello, my name is\",\n",
118 |     "    \"The president of the United States is\",\n",
119 |     "    \"The capital of France is\",\n",
120 |     "    \"The future of AI is\",\n",
121 |     "]\n",
122 |     "# note that top_k must be set to lower than the global_top_k defined in\n",
123 |     "# the neuronx_distributed_inference.models.config.OnDeviceSamplingConfig\n",
124 |     "sampling_params = SamplingParams(top_k=10, temperature=0.8, top_p=0.95)\n",
125 |     "outputs = llm.generate(prompts, sampling_params)\n",
126 |     "for output in outputs:\n",
127 |     "    prompt = output.prompt\n",
128 |     "    generated_text = output.outputs[0].text\n",
129 |     "    print(f\"Prompt: {prompt!r}, Generated text: {generated_text!r}\")\n",
130 |     "\n",
131 |     "# Free up the Neuron Cores for the next step -- in production, keep the object around to avoid load times and warmup times\n",
132 |     "del llm"
133 |    ]
134 |   },
135 |   {
136 |    "cell_type": "markdown",
137 |    "metadata": {},
138 |    "source": [
139 |     "# Online inference example.\n",
140 |     "In this case, we are loading the model directly from Hugging Face and compiling what we need as we go (this is a 1.1B parameter model, so it needs less system RAM to compile than the Qwen3-8B example above)\n",
141 |     "It may take 5 minutes for the model to download, compile and run.\n\n",
142 |     "# Restart your kernel!!\n",
143 |     "Restart your kernel before you run the next cell.  This will remove the python script and anything it has loaded in the devices.\n"
144 | 
145 |    ]
146 |   },
147 |   {
148 |    "cell_type": "markdown",
149 |    "metadata": {},
150 |    "source": [
151 |     "Because you are running this in a Jupyter notebook, this cell will keep running until you stop it.  The server should remain available (and using the Neuron cores) until you stop it.  \n",
152 |     "\n",
153 |     "You'll run this cell with different parameters your instructor will be discussing and using the guidellm tool in the Benchmark.ipynb notebook to run against this server.\n",
154 |     "\n",
155 |     "Run the next cell and wait until you see something like this (it should take about 5 minutes):\n",
156 |     "```\n",
157 |     "INFO:     Started server process [21298]\n",
158 |     "INFO:     Waiting for application startup.\n",
159 |     "INFO:     Application startup complete.\n",
160 |     "```"
161 |    ]
162 |   },
163 |   {
164 |    "cell_type": "code",
165 |    "execution_count": null,
166 |    "metadata": {},
167 |    "outputs": [],
168 |    "source": [
169 |     "!VLLM_NEURON_FRAMEWORK='neuronx-distributed-inference' python -m vllm.entrypoints.openai.api_server \\\n",
170 |     "    --model=\"TinyLlama/TinyLlama-1.1B-Chat-v1.0\" \\\n",
171 |     "    --max-num-seqs=1 \\\n",
172 |     "    --max-model-len=1024 \\\n",
173 |     "    --tensor-parallel-size=2 \\\n",
174 |     "    --port=8080 \\\n",
175 |     "    --device \"neuron\" "
176 |    ]
177 |   }
178 |  ],
179 |  "metadata": {
180 |   "kernelspec": {
181 |    "display_name": "aws_neuronx_venv_pytorch_latest",
182 |    "language": "python",
183 |    "name": "python3"
184 |   },
185 |   "language_info": {
186 |    "codemirror_mode": {
187 |     "name": "ipython",
188 |     "version": 3
189 |    },
190 |    "file_extension": ".py",
191 |    "mimetype": "text/x-python",
192 |    "name": "python",
193 |    "nbconvert_exporter": "python",
194 |    "pygments_lexer": "ipython3",
195 |    "version": "3.10.12"
196 |   }
197 |  },
198 |  "nbformat": 4,
199 |  "nbformat_minor": 2
200 | }
201 | 


--------------------------------------------------------------------------------