├── CODEOWNERS
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── NOTICE
├── README.md
├── examples
    ├── cluster-configs
    │   └── trn1-16-nodes-pcluster.md
    ├── general
    │   └── network
    │   │   └── vpc-subnet-setup.md
    ├── images
    │   ├── add-route.png
    │   ├── az.png
    │   ├── create-vpc.png
    │   ├── edit-subnet.png
    │   ├── ipv4.png
    │   ├── subnets-nat.png
    │   ├── subnets.png
    │   ├── vpc-entry.png
    │   └── vpc-setup.png
    └── jobs
    │   ├── dp-bert-launch-job.md
    │   ├── finetuning
    │       └── neuronx-nemo-megatron-llamav2-finetuning-job.md
    │   ├── gpt3-launch-job.md
    │   ├── neuronx-nemo-megatron-gpt-job.md
    │   └── neuronx-nemo-megatron-llamav2-job.md
├── install_neuron.sh
└── releasenotes.md


/CODEOWNERS:
--------------------------------------------------------------------------------
 1 | # This file creates codeowners for the documentation. It will allow setting code reviewers for all Pull requests to merge to the master branch 
 2 | # Each line is a file pattern followed by one or more owners.
 3 | 
 4 | # Refernce guide - https://docs.github.com/en/github/creating-cloning-and-archiving-repositories/creating-a-repository-on-github/about-code-owners#example-[…]ners-file
 5 | # Example - These owners will be the default owners for everything in
 6 | # the repo. Unless a later match takes precedence,
 7 | # @global-owner1 and @global-owner2 will be requested for
 8 | # review when someone opens a pull request.
 9 | # *       @global-owner1 @global-owner2
10 | 
11 | *       @aws-maens @aws-sadaf
12 | 


--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | ## Code of Conduct
2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
4 | opensource-codeofconduct@amazon.com with any additional questions or comments.
5 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contributing Guidelines
 2 | 
 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
 4 | documentation, we greatly value feedback and contributions from our community.
 5 | 
 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
 7 | information to effectively respond to your bug report or contribution.
 8 | 
 9 | 
10 | ## Reporting Bugs/Feature Requests
11 | 
12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features.
13 | 
14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
16 | 
17 | * A reproducible test case or series of steps
18 | * The version of our code being used
19 | * Any modifications you've made relevant to the bug
20 | * Anything unusual about your environment or deployment
21 | 
22 | 
23 | ## Contributing via Pull Requests
24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
25 | 
26 | 1. You are working against the latest source on the *main* branch.
27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
29 | 
30 | To send us a pull request, please:
31 | 
32 | 1. Fork the repository.
33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
34 | 3. Ensure local tests pass.
35 | 4. Commit to your fork using clear commit messages.
36 | 5. Send us a pull request, answering any default questions in the pull request interface.
37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
38 | 
39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
41 | 
42 | 
43 | ## Finding contributions to work on
44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.
45 | 
46 | 
47 | ## Code of Conduct
48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
50 | opensource-codeofconduct@amazon.com with any additional questions or comments.
51 | 
52 | 
53 | ## Security issue notifications
54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
55 | 
56 | 
57 | ## Licensing
58 | 
59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
60 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Amazon Software License 1.0
 2 | 
 3 | This Amazon Software License ("License") governs your use, reproduction, and
 4 | distribution of the accompanying software as specified below.
 5 | 
 6 | 1. Definitions
 7 | 
 8 |   "Licensor" means any person or entity that distributes its Work.
 9 | 
10 |   "Software" means the original work of authorship made available under this
11 |   License.
12 | 
13 |   "Work" means the Software and any additions to or derivative works of the
14 |   Software that are made available under this License.
15 | 
16 |   The terms "reproduce," "reproduction," "derivative works," and
17 |   "distribution" have the meaning as provided under U.S. copyright law;
18 |   provided, however, that for the purposes of this License, derivative works
19 |   shall not include works that remain separable from, or merely link (or bind
20 |   by name) to the interfaces of, the Work.
21 | 
22 |   Works, including the Software, are "made available" under this License by
23 |   including in or with the Work either (a) a copyright notice referencing the
24 |   applicability of this License to the Work, or (b) a copy of this License.
25 | 
26 | 2. License Grants
27 | 
28 |   2.1 Copyright Grant. Subject to the terms and conditions of this License,
29 |   each Licensor grants to you a perpetual, worldwide, non-exclusive,
30 |   royalty-free, copyright license to reproduce, prepare derivative works of,
31 |   publicly display, publicly perform, sublicense and distribute its Work and
32 |   any resulting derivative works in any form.
33 | 
34 |   2.2 Patent Grant. Subject to the terms and conditions of this License, each
35 |   Licensor grants to you a perpetual, worldwide, non-exclusive, royalty-free
36 |   patent license to make, have made, use, sell, offer for sale, import, and
37 |   otherwise transfer its Work, in whole or in part. The foregoing license
38 |   applies only to the patent claims licensable by Licensor that would be
39 |   infringed by Licensor's Work (or portion thereof) individually and
40 |   excluding any combinations with any other materials or technology.
41 | 
42 | 3. Limitations
43 | 
44 |   3.1 Redistribution. You may reproduce or distribute the Work only if
45 |   (a) you do so under this License, (b) you include a complete copy of this
46 |   License with your distribution, and (c) you retain without modification
47 |   any copyright, patent, trademark, or attribution notices that are present
48 |   in the Work.
49 | 
50 |   3.2 Derivative Works. You may specify that additional or different terms
51 |   apply to the use, reproduction, and distribution of your derivative works
52 |   of the Work ("Your Terms") only if (a) Your Terms provide that the use
53 |   limitation in Section 3.3 applies to your derivative works, and (b) you
54 |   identify the specific derivative works that are subject to Your Terms.
55 |   Notwithstanding Your Terms, this License (including the redistribution
56 |   requirements in Section 3.1) will continue to apply to the Work itself.
57 | 
58 |   3.3 Use Limitation. The Work and any derivative works thereof only may be
59 |   used or intended for use with the web services, computing platforms or
60 |   applications provided by Amazon.com, Inc. or its affiliates, including
61 |   Amazon Web Services, Inc.
62 | 
63 |   3.4 Patent Claims. If you bring or threaten to bring a patent claim against
64 |   any Licensor (including any claim, cross-claim or counterclaim in a
65 |   lawsuit) to enforce any patents that you allege are infringed by any Work,
66 |   then your rights under this License from such Licensor (including the
67 |   grants in Sections 2.1 and 2.2) will terminate immediately.
68 | 
69 |   3.5 Trademarks. This License does not grant any rights to use any
70 |   Licensor's or its affiliates' names, logos, or trademarks, except as
71 |   necessary to reproduce the notices described in this License.
72 | 
73 |   3.6 Termination. If you violate any term of this License, then your rights
74 |   under this License (including the grants in Sections 2.1 and 2.2) will
75 |   terminate immediately.
76 | 
77 | 4. Disclaimer of Warranty.
78 | 
79 |   THE WORK IS PROVIDED "AS IS" WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND,
80 |   EITHER EXPRESS OR IMPLIED, INCLUDING WARRANTIES OR CONDITIONS OF
81 |   MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE OR
82 |   NON-INFRINGEMENT. YOU BEAR THE RISK OF UNDERTAKING ANY ACTIVITIES UNDER
83 |   THIS LICENSE. SOME STATES' CONSUMER LAWS DO NOT ALLOW EXCLUSION OF AN
84 |   IMPLIED WARRANTY, SO THIS DISCLAIMER MAY NOT APPLY TO YOU.
85 | 
86 | 5. Limitation of Liability.
87 | 
88 |   EXCEPT AS PROHIBITED BY APPLICABLE LAW, IN NO EVENT AND UNDER NO LEGAL
89 |   THEORY, WHETHER IN TORT (INCLUDING NEGLIGENCE), CONTRACT, OR OTHERWISE
90 |   SHALL ANY LICENSOR BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY DIRECT,
91 |   INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF OR
92 |   RELATED TO THIS LICENSE, THE USE OR INABILITY TO USE THE WORK (INCLUDING
93 |   BUT NOT LIMITED TO LOSS OF GOODWILL, BUSINESS INTERRUPTION, LOST PROFITS
94 |   OR DATA, COMPUTER FAILURE OR MALFUNCTION, OR ANY OTHER COMM ERCIAL DAMAGES
95 |   OR LOSSES), EVEN IF THE LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF
96 |   SUCH DAMAGES.
97 | 


--------------------------------------------------------------------------------
/NOTICE:
--------------------------------------------------------------------------------
1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
2 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Train a model on AWS Trn1 ParallelCluster
 2 | 
 3 | ## Introduction
 4 | 
 5 | This document explains how to use AWS ParallelCluster to build HPC compute cluster that uses trn1 compute nodes to run your distributed ML training job. Once the nodes are launched, we will run a training task to confirm that the nodes are working, and use SLURM commands to check the job status. In this tutorial, we will use AWS pcluster command to run a YAML file in order to generate the cluster. As an example, we are going to launch multiple trn1.32xl nodes in our cluster.
 6 | 
 7 | We are going to set up our ParallelCluster infrastructure as below:
 8 | 
 9 | ![image info](./examples/images/vpc-setup.png)
10 | 
11 | As shown in the figure above, inside a VPC, there are two subnets, a public and a private ones. Head node resides in the public subnet, while the compute fleet (in this case, trn1 instances) are in the private subnet. A Network Address Translation (NAT) gateway is also needed in order for nodes in the private subnet to connect to clients outside the VPC. In the next section, we are going to describe how to set up all the necessary infrastructure for Trn1 ParallelCluster.
12 | 
13 | 
14 | 
15 | ## Prerequisite infrastructure
16 | 
17 | ### VPC Creation
18 | A ParallelCluster requires a VPC that has two subnets and a Network Address Translation (NAT) gateway as shown in the diagram above. [Here](./examples/general/network/vpc-subnet-setup.md) are the instructions to create the VPC and enable auto-assign public IPv4 address for the public subnet. 
19 | 
20 | ### Key pair
21 | A key pair is needed for access to the head node of the cluster. You may use an existing one or create a new key pair by following the instruction [here](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/create-key-pairs.html#having-ec2-create-your-key-pair "Create key pair")
22 | 
23 | ### AWS ParallelCluster Python package
24 | 
25 | AWS ParallelCluster Python package is needed in a local environment (i.e., your Mac/PC desktop with a CLI terminal or an AWS Cloud9) where you issue the command to launch the creation process for your HPC environment in AWS. See [here](https://docs.aws.amazon.com/parallelcluster/latest/ug/install-v3-virtual-environment.html) for instructions about installing AWS ParallelCluster Python package in your local environment.
26 | 
27 | ## Create a cluster
28 | 
29 | See table below for script to create trn1 ParallelCluster:
30 | 
31 | |Cluster      | Link |
32 | |-------------|------------------|
33 | |16xTrn1 nodes   | [trn1-16-nodes-pcluster.md](./examples/cluster-configs/trn1-16-nodes-pcluster.md)  |
34 | 
35 | ## Launch training job
36 | 
37 | See table below for script to launch a model training job on the ParallelCluster:
38 | 
39 | |Job      | Link  |
40 | |-------------|-------------------|
41 | |BERT Large   | [dp-bert-launch-job.md](./examples/jobs/dp-bert-launch-job.md) |
42 | |GPT3 (neuronx-nemo-megatron) | [neuronx-nemo-megatron-gpt-job.md](./examples/jobs/neuronx-nemo-megatron-gpt-job.md) |
43 | |Llama 2 7B (neuronx-nemo-megatron) | [neuronx-nemo-megatron-llamav2-job.md](./examples/jobs/neuronx-nemo-megatron-llamav2-job.md) |
44 | 
45 | ## Launch training job [End of Support]
46 | 
47 | See table below for scripts that are no longer supported:
48 | 
49 | |Job      | Link  |
50 | |-------------|-------------------|
51 | |GPT3 (Megatron-LM)        | [gpt3-launch-job.md](./examples/jobs/gpt3-launch-job.md)       |
52 | 
53 | ## Security
54 | 
55 | See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.
56 | 
57 | ## License
58 | 
59 | This library is licensed under the Amazon Software License.
60 | 
61 | 
62 | ## Release Notes
63 | 
64 | Please refer to the [Change Log](releasenotes.md).
65 | 
66 | 


--------------------------------------------------------------------------------
/examples/cluster-configs/trn1-16-nodes-pcluster.md:
--------------------------------------------------------------------------------
  1 | # Create ParallelCluster
  2 | 
  3 | 1. Once your VPC, ParallelCluster python package, and key pair are set up, you are ready to create a ParallelCluster. Copy the following content into a launch.yaml file in your local desktop where AWS ParallelCluster CLI is installed. Here is an example YAML file content:
  4 | 
  5 | ```
  6 | Region: <YOUR REGION> # i.e., us-west-2
  7 | Image:
  8 |   Os: ubuntu2004
  9 | HeadNode:
 10 |   InstanceType: c5.4xlarge
 11 |   Networking:
 12 |     SubnetId: subnet-<PUBLIC SUBNET ID>
 13 |   Ssh:
 14 |     KeyName: <KEY NAME WITHOUT .PEM>
 15 |   LocalStorage:
 16 |     RootVolume:
 17 |       Size: 1024
 18 |   CustomActions:
 19 |     OnNodeConfigured:
 20 |       Script: s3://neuron-s3/pcluster/post-install-scripts/neuron-installation/v<SDK RELEASE VER>/u20/pt/install_neuron.sh
 21 |   Iam:
 22 |     S3Access:
 23 |        - BucketName: neuron-s3
 24 |          EnableWriteAccess: false
 25 | Scheduling:
 26 |   Scheduler: slurm
 27 |   SlurmQueues:
 28 |     - Name: compute1
 29 |       CapacityType: ONDEMAND
 30 |       ComputeSettings:
 31 |         LocalStorage:
 32 |           RootVolume:
 33 |             Size: 1024
 34 |           EphemeralVolume:
 35 |             MountDir: /local_storage
 36 |       ComputeResources:
 37 |         - Efa:
 38 |             Enabled: true
 39 |           InstanceType: trn1.32xlarge
 40 |           MaxCount: 16
 41 |           MinCount: 0
 42 |           Name: queue1-i1
 43 |       Networking:
 44 |         SubnetIds:
 45 |           - subnet-<PRIVATE SUBNET ID>
 46 |         PlacementGroup:
 47 |           Enabled: true
 48 |       CustomActions:
 49 |         OnNodeConfigured:
 50 |           Script: s3://neuron-s3/pcluster/post-install-scripts/neuron-installation/v<SDK RELEASE VER>/u20/pt/install_neuron.sh
 51 |       Iam:
 52 |         S3Access:
 53 |           - BucketName: neuron-s3
 54 |             EnableWriteAccess: false
 55 | SharedStorage:
 56 | - FsxLustreSettings:
 57 |     DeploymentType: SCRATCH_2
 58 |     StorageCapacity: 1200
 59 |   MountDir: /fsx
 60 |   Name: pclusterfsx
 61 |   StorageType: FsxLustre
 62 | ```
 63 | 
 64 | Please replace the placeholders `<PUBLIC SUBNET ID>`, `<PRIVATE SUBNET ID>`, `<KEY NAME WITHOUT .PEM>` and `<SDK RELEASE VER>` with appropriate values.
 65 | 
 66 | For example, to use Neuron SDK release version 2.13.0, replace `<SDK RELEASE VER>` with `2.13.0` in the two `CustomActions: OnNodeConfigured: Script` fields (make sure `2.13.0` is preceded by the letter `v`).
 67 | 
 68 | ```
 69 |   CustomActions:
 70 |     OnNodeConfigured:
 71 |       Script: s3://neuron-s3/pcluster/post-install-scripts/neuron-installation/v2.13.0/u20/pt/install_neuron.sh
 72 | ```
 73 | The `<PUBLIC SUBNET ID>`and `<PRIVATE SUBNET ID>`values are obtained following the [VPC creation guide](./examples/general/network/vpc-subnet-setup.md). 
 74 | 
 75 | The `<KEY NAME WITHOUT .PEM>` is obtained following [key pair setup](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/create-key-pairs.html#having-ec2-create-your-key-pair)
 76 | 
 77 | The YAML file above will create a ParallelCluster with a c5.4xlarge head node, and 16 trn1.32xl compute nodes. All `MaxCount` trn1 nodes are in the same queue. In case you need to isolate compute nodes with different queues, simply append another instanceType designation to the current instanceType, and designate `MaxCount` for each queue, for example, `InstanceType` section would be become:
 78 | 
 79 | ```
 80 | InstanceType: trn1.32xlarge
 81 | MaxCount: 8
 82 | MinCount: 0
 83 | Name: queue-0
 84 | InstanceType: trn1.32xlarge
 85 | MaxCount: 8
 86 | MinCount: 0
 87 | Name: queue-1
 88 | ```
 89 | 
 90 | So now you have two queues, each queue is designated to a number of trn1 compute nodes. An unique feature for trn1.32xlarge instance is the EFA interfaces built for high performance/low latency network data transfer. This is indicated by:
 91 | 
 92 | ```
 93 | - Efa:
 94 |     Enabled: true
 95 | ```
 96 | 
 97 | If you are using trn1.2xl instance, this feature is not enabled, and in which case, you don’t need such designation.
 98 | 
 99 | 2. In the virtual environment where you installed AWS ParallelCluster API, run the following command (assuming you have saved the configurations above in `configuration.yaml`):
100 | 
101 | ```
102 | pcluster create-cluster --cluster-configuration configuration.yaml -n My-PCluster-Trn1 
103 | ```
104 | Where
105 | 
106 | `cluster-configuration` is the YAML file
107 | 
108 | This will create a ParallelCluster in your AWS account, and you may inspect the progress in AWS CloudFormation console. 
109 | 
110 | You may also check cluster status using `pcluster` command, for example: 
111 | 
112 | `pcluster describe-cluster -r us-west-2 -n My-PCluster-Trn1`
113 | 
114 | 3. During the cluster creation process, post-install actions now takes place automatically via `CustomActions` indicated in `configuration.yaml` to configure the head node and any static compute nodes (`MinCount` > 0). `CustomActions` will install Neuron drivers and runtime, EFA drivers, and Neuron tools. 
115 | 
116 | 4. After post-installation actions are complete, the ParallelCluster environment is properly configured to run SLURM jobs. Rerun `pcluster describe-cluster ...` command above to see the head node IP address, such that you may SSH into it for the [next part of the tutorial](../jobs/dp-bert-launch-job.md) where you would launch a training job.
117 | 
118 | ## Custom script update
119 | 
120 | To make it easier to upgrade to newer versions, you can copy the `install_neuron.sh` to your own s3 bucket s3://``<BUCKET NAME>`` and modify `CustomActions: OnNodeConfigured: Script` and `CustomActions: OnNodeConfigured: Iam: S3Access:` fields (2 sets).
121 | 
122 | ```
123 |   CustomActions:
124 |     OnNodeConfigured:
125 |       Script: s3://<BUCKET NAME>/install_neuron.sh
126 |       Iam:
127 |         S3Access:
128 |           - BucketName: <BUCKET NAME>
129 |             EnableWriteAccess: false
130 | ```
131 | 
132 | Then create the cluster as above.
133 | 
134 | When you are ready to update your own s3 bucket's copy after a new Neuron SDK release, first stop your ParallelCluster from the system and environment where ParallelCluster API was installed:
135 | ```
136 | pcluster update-compute-fleet --cluster-name <YOUR_CLUSTER_NAME> -r <YOUR_REGION> --status STOP_REQUESTED
137 | ```
138 | Next copy the new version to your own s3 bucket, then restart the ParallelCluster.
139 | ```
140 | pcluster update-compute-fleet --cluster-name <YOUR_CLUSTER_NAME> -r <YOUR_REGION --status START_REQUESTED
141 | ```
142 | The above steps only update the runtime and tools packages on the compute fleet. Since the Python environment is on the head node, and the head node is not updated with the above instructions, you will need update your Python environment by manually running the same `install_neuron.sh` script on the head node:
143 | 
144 | ```
145 | # log into the head node, and make sure you have credentials to access s3://<BUCKET NAME>
146 | cd ~/
147 | aws s3 cp s3://<BUCKET NAME>/install_neuron.sh .
148 | source install_neuron.sh
149 | ```
150 | 
151 | ## Updating configuration.yaml
152 | 
153 | You can change the configuration.yaml to update for example compute resources. First stop your ParallelCluster from the system and environment where ParallelCluster API was installed:
154 | ```
155 | pcluster update-compute-fleet --cluster-name <YOUR_CLUSTER_NAME> -r <YOUR_REGION> --status STOP_REQUESTED
156 | ```
157 | Update the cluster using the new configuration YAML version (assuming it is named `new_configuration.yaml`):
158 | ```
159 | pcluster update-cluster -c new_configuration.yaml --cluster-name <YOUR_CLUSTER_NAME> -r YOUR_REGION
160 | ```
161 | Then restart the ParallelCluster.
162 | ```
163 | pcluster update-compute-fleet --cluster-name <YOUR_CLUSTER_NAME> -r <YOUR_REGION --status START_REQUESTED
164 | ```
165 | 
166 | 
167 | ## Known issues
168 | 
169 | - The default entries in `/etc/hosts` sometimes does not map to the correct ip address (Trn1 has 8 network interfaces) resulting in potential connection errors when running multi-instance jobs. The default `install_neuron.sh` provided in the above sample YAML file has the workaround along with the neuron package installations. If you prefer to not include the installations and just patch this issue you can include the following as part of your custom OnNodeConfigured script for your Trn1 compute nodes or set it separately after worker launch but before launching any multi-instance jobs. 
170 | 
171 | ```
172 | sudo sed -i "/$HOSTNAME/d" /etc/hosts
173 | ```
174 | This removes the hostname ip address mapping. This mapping is not generally needed for normal ParallelCluster operation or Training jobs using Slurm.
175 | 


--------------------------------------------------------------------------------
/examples/general/network/vpc-subnet-setup.md:
--------------------------------------------------------------------------------
 1 | # VPC setup for ParallelCluster with Trn1
 2 | 
 3 | Here are the detailed, step-by-step instructions for creating a VPC and infrastructure within it. These instructions are based on AWS console. 
 4 | 
 5 | 1. **Create VPC (and more)** - For this step, you will use your AWS console's VPC dashboard. In this dashboard:
 6 | 
 7 | ![image info](../../images/vpc-entry.png)
 8 | 
 9 | Once you proceed, you will see the following. Select **VPC and more** because you will create subnets and NAT gateway in this process. Also, give 
10 | a project name for convenience.
11 | 
12 | ![image info](../../images/create-vpc.png)
13 | 
14 | From here and on, we will set remaining parameters for infrastructures in this VPC. 
15 | 
16 | 2. **Select Availability Zone (AZ)** - Select one AZ for your ParallelCluster VPC. Next, select the AZ that maps to your account. This AZ must also contains Trn1 instance.
17 | 
18 | ![image info](../../images/az.png)
19 | 
20 |  To verify which Trn1 AZ is mapped to your account, use the following code snippet in your desktop:
21 | 
22 | ```
23 | AZ1=$(aws ec2 describe-availability-zones \
24 | --region us-east-1 \
25 | --query "AvailabilityZones[]" \
26 | --filters "Name=zone-id,Values=use1-az4" \
27 | --query "AvailabilityZones[].ZoneName" \
28 | --output text)
29 | 
30 | AZ2=$(aws ec2 describe-availability-zones \
31 | --region us-east-1 \
32 | --query "AvailabilityZones[]" \
33 | --filters "Name=zone-id,Values=use1-az5" \
34 | --query "AvailabilityZones[].ZoneName" \
35 | --output text)
36 | 
37 | AZ3=$(aws ec2 describe-availability-zones \
38 | --region us-west-2 \
39 | --query "AvailabilityZones[]" \
40 | --filters "Name=zone-id,Values=usw2-az4" \
41 | --query "AvailabilityZones[].ZoneName" \
42 | --output text)
43 | 
44 | echo -e "\nYour Trn1 availability zones are $AZ1, $AZ2, $AZ3\n"
45 | 
46 | ```
47 | 
48 | An example output of the above snippet may be:
49 | 
50 | `Your Trn1 availability zones are us-east-1a, us-east-1f, us-west-2d`
51 | 
52 | 3. **Set up subnets** - You need a public subnet and private subnet. The public subnet will host the head node, while the private subnet will host the compute nodes. In order for the compute nodes to access the web for software upgrade and installation, you need a NAT gateway in this AZ. 
53 | 
54 | ![image info](../../images/subnets-nat.png)
55 | 
56 | These are all the configurations needed to set up a VPC and the infrastructure inside. Now click `Create VPC`. This will create the VPC, two subnets, and one NAT gateway. 
57 | 
58 | 4. **Identify subnet by VPC ID** - Now in Subnets tab, find the subnets associated with your ParallelCluster’s VPC. There should be one private and one public subnet. We are going to edit settings in these subnets by clicking on these Subnet ID.
59 | 
60 | ![image info](../../images/subnets.png)
61 | 
62 | Click on the public subnet's ID.
63 | 
64 | 5. **IPv4 address** - In the public subnet panel, in its `Actions` dropdown box, select `Edit subnet settings`:
65 | 
66 | ![image info](../../images/edit-subnet.png)
67 | 
68 | This will take you to the next setting page. Here, you need to check `Enable auto-assign public IPv4 address:
69 | 
70 | ![image info](../../images/ipv4.png)
71 | 
72 | These are all the changes you need to make for the public subnet.


--------------------------------------------------------------------------------
/examples/images/add-route.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-neuron/aws-neuron-parallelcluster-samples/88772abbb07efba66b4be4bb5c16ac05275ff850/examples/images/add-route.png


--------------------------------------------------------------------------------
/examples/images/az.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-neuron/aws-neuron-parallelcluster-samples/88772abbb07efba66b4be4bb5c16ac05275ff850/examples/images/az.png


--------------------------------------------------------------------------------
/examples/images/create-vpc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-neuron/aws-neuron-parallelcluster-samples/88772abbb07efba66b4be4bb5c16ac05275ff850/examples/images/create-vpc.png


--------------------------------------------------------------------------------
/examples/images/edit-subnet.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-neuron/aws-neuron-parallelcluster-samples/88772abbb07efba66b4be4bb5c16ac05275ff850/examples/images/edit-subnet.png


--------------------------------------------------------------------------------
/examples/images/ipv4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-neuron/aws-neuron-parallelcluster-samples/88772abbb07efba66b4be4bb5c16ac05275ff850/examples/images/ipv4.png


--------------------------------------------------------------------------------
/examples/images/subnets-nat.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-neuron/aws-neuron-parallelcluster-samples/88772abbb07efba66b4be4bb5c16ac05275ff850/examples/images/subnets-nat.png


--------------------------------------------------------------------------------
/examples/images/subnets.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-neuron/aws-neuron-parallelcluster-samples/88772abbb07efba66b4be4bb5c16ac05275ff850/examples/images/subnets.png


--------------------------------------------------------------------------------
/examples/images/vpc-entry.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-neuron/aws-neuron-parallelcluster-samples/88772abbb07efba66b4be4bb5c16ac05275ff850/examples/images/vpc-entry.png


--------------------------------------------------------------------------------
/examples/images/vpc-setup.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-neuron/aws-neuron-parallelcluster-samples/88772abbb07efba66b4be4bb5c16ac05275ff850/examples/images/vpc-setup.png


--------------------------------------------------------------------------------
/examples/jobs/dp-bert-launch-job.md:
--------------------------------------------------------------------------------
 1 | # Launch training job
 2 | Once the cluster is successfully created and the Neuron packages are installed, please ssh into the head node to begin the training example. As an example here, we will use [Phase 1 BERT-Large pretraining](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/bert.html#phase-1-bert-large-pretrainingg) as the example job to submit to the cluster using SLURM job scheduler. You will be running commands from the head node. On ParallelCluster, the home directory is shared between the head node and compute nodes so files in the home directory are visible to worker nodes.
 3 | 
 4 | For all the commands below, make sure you are in the virtual environment created during setup before you run the commands. SLURM job scheduler will automatically activate the virtual environment when the training script is run on the worker nodes.
 5 | 
 6 | ```
 7 | source ~/aws_neuron_venv_pytorch/bin/activate
 8 | ```
 9 | 
10 | ## Download training scripts
11 | Clone the AWS Neuron Samples to obtain the Python-based training script `dp_bert_large_hf_pretrain_hdf5.py`, the SLURM shell scripts, and the public Python-based implementation of the Layerwise Adaptive Moments (LAMB) optimizer `lamb.py`. Install the requirements. 
12 | ```
13 | cd ~/
14 | git clone https://github.com/aws-neuron/aws-neuron-samples.git
15 | python3 -m pip install -r ~/aws-neuron-samples/torch-neuronx/training/dp_bert_hf_pretrain/requirements.txt
16 | ```
17 | 
18 | The pretraining scripts are stored in the home directory of the head node which is shared with the compute nodes via NFS. As you will see later, when launching a training job using SLURM, the job will run a script on each specified compute node.
19 | 
20 | ## Download data set
21 | 
22 | Download the tokenized and sharded dataset files needed for this tutorial in the home directory that is shared with the compute nodes via NFS:
23 | 
24 | ```
25 | mkdir -p ~/examples_datasets/
26 | pushd ~/examples_datasets/
27 | aws s3 cp s3://neuron-s3/training_datasets/bert_pretrain_wikicorpus_tokenized_hdf5/bert_pretrain_wikicorpus_tokenized_hdf5_seqlen128.tar .  --no-sign-request
28 | tar -xf bert_pretrain_wikicorpus_tokenized_hdf5_seqlen128.tar
29 | rm bert_pretrain_wikicorpus_tokenized_hdf5_seqlen128.tar
30 | popd
31 | ```
32 | 
33 | ## Compile model
34 | The `run_dp_bert_large_hf_pretrain_bf16_s128.sh` script will be invoked by SLURM commands running on the head node. To do compilation, run the following command from `~/aws-neuron-samples/torch-neuronx/training/dp_bert_hf_pretrain` directory on the head node:
35 | 
36 | ```
37 | cd ~/aws-neuron-samples/torch-neuronx/training/dp_bert_hf_pretrain
38 | sbatch --exclusive --nodes=16 --wrap "srun neuron_parallel_compile ./run_dp_bert_large_hf_pretrain_bf16_s128_lamb.sh"
39 | ```
40 | 
41 | The job id will be displayed by sbatch. You can monitor the results of the compilation job by inspecting the file `slurm_<job id>.out` file generated in `~/aws-neuron-samples/torch-neuronx/training/dp_bert_hf_pretrain`.
42 | 
43 | ## Launch training
44 | After the compilation job is finished, start the actual pretraining:
45 | 
46 | ```
47 | cd ~/aws-neuron-samples/torch-neuronx/training/dp_bert_hf_pretrain
48 | sbatch  --exclusive --nodes=16 --wrap "srun ./run_dp_bert_large_hf_pretrain_bf16_s128_lamb.sh"
49 | ```
50 | 
51 | Again, the job id will be displayed by sbatch and you can follow the training by inspecting the file `slurm_<job id>.out` file generated in `~/examples/dp_bert_hf_pretrain`.
52 | 
53 | ### Cluster scalability
54 | 
55 | In a Trn1 cluster, multiple interconnected Trn1 instances run a large model training workload in parallel and reduce total computation time, or time to convergence. There are two measures of scalability of a cluster: strong scaling and weak scaling. Typically, for model training, the need is to speed up training run, because usage cost is determined by sample throughput for rounds of gradient updates. This means strong scaling is an important measure of scalability for model training. Strong scaling refers to the scenario where the total problem size stays the same as the number of processors increases. In evaluating strong scaling, or the impact of parallelization, we want to keep global batch size same and see how much time it takes to convergence. In such scenario, we need to adjust gradient accumulation micro-step according to number of compute nodes. This is achieved with the following in the downloaded training shell script `run_dp_bert_large_hf_pretrain_bf16_s128_lamb.sh`:
56 | 
57 | ```
58 | GRAD_ACCUM_USTEPS=$(($GRAD_ACCUM_USTEPS/$WORLD_SIZE_JOB))
59 | ```
60 | 
61 | The SLURM shell script automatically adjust the gradient accumulation microsteps to keep the global batch size for phase 1 at 65536 with LAMB optimizer (strong scaling).
62 | 
63 | On the other hand, if the interest is to evaluate how much more workloads can be executed at a fixed time by adding more nodes, then use weak scaling to measure scalability. In weak scaling, the problem size increases at the same rate as number of processors, thereby keeping amount of work per processor the same. To see performance for larger global batch size (weak scaling), please comment out the line above. Doing so would keep number of steps for gradient accumulation constant with a default value (i.e., 128) provided in the training script `run_dp_bert_large_hf_pretrain_bf16_s128_lamb.sh`.
64 | 
65 | ## Tips
66 | 
67 | Some useful SLURM commands are `sinfo`,  `squeue` and `scontrol`. `sinfo` command displays information about SLURM node names and partitions. `squeue` command provides information about job queues currently running in the Slurm schedule. SLURM will generate a log file `slurm-XXXXXX.out`. You may then use `tail -f slurm-XXXXXX.out`, to inspect the job summary. `scontrol show node <COMPUTE_NODE_NAME>` can show more information such as node state, power consumption, and more.
68 | 
69 | 
70 | ## Known issues/limitations
71 | 
72 | - The current setup supports up to 128 nodes BERT pretraining with LAMB optmizer when using strong scaling.
73 | 
74 | ## Troubleshooting guide
75 | 
76 | See [Troubleshooting Guide for AWS ParallelCluster](https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting-v3.html) for more details and fixes to common issues.
77 | 


--------------------------------------------------------------------------------
/examples/jobs/finetuning/neuronx-nemo-megatron-llamav2-finetuning-job.md:
--------------------------------------------------------------------------------
  1 | # Launch a Llama2 finetuning job using neuronx-nemo-megatron
  2 | 
  3 | This tutorial explains how to run Llama V2 finetuning jobs with AWS EC2 trn1.32xl instances using [neuronx-nemo-megatron](https://github.com/aws-neuron/neuronx-nemo-megatron) and [AWS ParallelCluster](https://aws.amazon.com/hpc/parallelcluster/).
  4 | 
  5 | neuronx-nemo-megatron (also known as "AWS Neuron Reference for NeMo Megatron") includes modified versions of the open-source packages [NeMo](https://github.com/NVIDIA/NeMo) and [Apex](https://github.com/NVIDIA/apex) that have been adapted for use with AWS Neuron and AWS EC2 Trn1 instances. neuronx-nemo-megatron allows for pretraining models with hundreds of billions of parameters across thousands of Trainium accelerators, and enables advanced training capabilities such as 3D parallelism, sequence parallelism, and activation checkpointing.
  6 | 
  7 | ## Prerequisites
  8 | Before proceeding with this tutorial, please follow [these instructions](https://github.com/aws-neuron/aws-neuron-parallelcluster-samples#train-a-model-on-aws-trn1-parallelcluster) to create a ParallelCluster consisting of 1 or more trn1.32xl or trn1n.32xl nodes. ParallelCluster automates the creation of trn1 clusters, and provides the SLURM job management system for scheduling and managing distributed training jobs. Please note that the home directory on your ParallelCluster head node will be shared with all of the worker nodes via NFS.
  9 | 
 10 | ## Install neuronx-nemo-megatron
 11 | 
 12 | With your trn1 ParallelCluster in place, begin by logging into the head node of your cluster using SSH. To provide access to TensorBoard (required in a later step), please make sure that you enable port forwarding for TCP port 6006 when you login, ex:
 13 | ```
 14 | ssh -i YOUR_KEY.pem ubuntu@HEAD_NODE_IP_ADDRESS -L 6006:127.0.0.1:6006
 15 | ```
 16 | 
 17 | Once logged into the head node, activate the provided PyTorch Neuron virtual environment that was created when you set up your ParallelCluster. **Note**: if your PyTorch Neuron environment is lower than Neuron 2.11, please refer to the [Neuron documentation](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup/pytorch-update.html#pytorch-neuronx-update) for instructions on updating to Neuron 2.11 or later.
 18 | ```
 19 | cd ~
 20 | source ./aws_neuron_venv_pytorch/bin/activate
 21 | ```
 22 | 
 23 | Next, clone the neuronx-nemo-megatron repo to the head node:
 24 | ```
 25 | cd ~
 26 | git clone https://github.com/aws-neuron/neuronx-nemo-megatron.git
 27 | cd neuronx-nemo-megatron
 28 | ```
 29 | 
 30 | Install the `wheel` Python package and run the build script to create the neuronx-nemo-megatron wheels:
 31 | ```
 32 | pip3 install wheel
 33 | ./build.sh
 34 | ```
 35 | 
 36 | Install the neuronx-nemo-megatron packages and dependencies in your virtual environment:
 37 | ```
 38 | pip3 install ./build/*.whl
 39 | pip3 install -r requirements.txt torch==1.13.1 protobuf==3.20.3
 40 | ```
 41 | 
 42 | Build the Megatron helper module
 43 | ```
 44 | cd ~
 45 | python3 -c "from nemo.collections.nlp.data.language_modeling.megatron.dataset_utils import compile_helper; \
 46 | compile_helper()"
 47 | ```
 48 | 
 49 | The above utility will help make this file : ```nemo.collections.nlp.data.language_modeling.megatron.dataset_utils``` and below is the expected output (You can ignore the error)
 50 | ```
 51 | 2023-Aug-17 22:53:01.0674 47940:47940 ERROR  TDRV:tdrv_get_dev_info                       No neuron device available
 52 | [NeMo W 2023-08-17 22:53:03 optimizers:67] Could not import distributed_fused_adam optimizer from Apex
 53 | [NeMo W 2023-08-17 22:53:04 experimental:27] Module <class 'nemo.collections.nlp.data.language_modeling.megatron.megatron_batch_samplers.MegatronPretrainingRandomBatchSampler'> is experimental, not ready for production and is not fully supported. Use at your own risk.
 54 | ```
 55 | 
 56 | ## Download LlamaV2 dataset and tokenizer
 57 | This tutorial makes use of the xsum dataset. The dataset can be downloaded from HuggingFace by running the following commands in a python3 shell or file:
 58 | 
 59 | ```
 60 | from datasets import load_dataset
 61 | 
 62 | dataset = load_dataset("xsum")
 63 | 
 64 | dataset = dataset.rename_column('document', 'input')
 65 | dataset = dataset.rename_column('summary', 'output')
 66 | 
 67 | output_file_path = "xsum_dataset.jsonl"
 68 | dataset['train'].to_json(output_file_path, orient='records', lines=True)
 69 | ```
 70 | The above command will give you the raw dataset of around 255mb which needs to be tokenized using a llamaV2 tokenizer. To tokenize the data, you need to request the tokenizer from hugging face and meta following the below link :
 71 | 
 72 | [Request Tokenizer and model weights from hugging face](https://huggingface.co/meta-llama/Llama-2-7b)
 73 | 
 74 | Note: Use of this model is governed by the Meta license. In order to download the model weights and tokenizer, please visit the above website and accept our License before requesting access here.
 75 | 
 76 | The file will be tokenized automatically by Neuron NeMo and converted to memory map format. 
 77 | ## Convert LlamaV2 to Neuron NeMo Format
 78 | The LLama models from HuggingFace must be converted into the specified tensor parallel and pipeline parallel format. 
 79 | Please run the below command with the tensor parallel and pipeline parallel config you are using 
 80 | We will assume tp=8 and pp=1. 
 81 | ```
 82 | python3 checkpoint_conversion/convert_hf_checkpoint_to_nemo_llama.py \
 83 |   --path_to_checkpoint='PATH_TO_LLAMA_TOKENIZER/llamav2_weights/7b-hf' \
 84 |   --config_file='PATH_TO_LLAMA_TOKENIZER/llamav2_weights/7b-hf/config.json' \
 85 |   --output_path="/output/directory" \
 86 |   --tp_degree=8 \
 87 |   --pp_degree=1 \
 88 |   --save_bf16=True
 89 | ```
 90 | 
 91 | ## Llama2 training configurations
 92 | We tested with the following model sizes: 7B
 93 | ### Llama2 7B
 94 | 
 95 | - Model configuration
 96 |     - Attention heads: 32
 97 |     - Layers: 32
 98 |     - Sequence length: 4096
 99 |     - Hidden size: 4096
100 |     - Hidden FFN size: 11008
101 |     - Microbatch size: 1
102 |     - Global batch size: 256
103 | 
104 | - Distributed training configuration
105 |     - Number of nodes: 4
106 |     - Tensor parallel degree: 8
107 |     - Pipeline parallel degree: 1
108 |     - Data parallel degree: 16
109 | 
110 | 
111 | ## Pre-compile the model
112 | By default, PyTorch Neuron uses a just in time (JIT) compilation flow that sequentially compiles all of the neural network compute graphs as they are encountered during a training job. The compiled graphs are cached in a local compiler cache so that subsequent training jobs can leverage the compiled graphs and avoid compilation (so long as the graph signatures and Neuron version have not changed).
113 | 
114 | An alternative to the JIT flow is to use the included [neuron_parallel_compile](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/api-reference-guide/training/pytorch-neuron-parallel-compile.html?highlight=neuron_parallel_compile) command to perform ahead of time (AOT) compilation. In the AOT compilation flow, the compute graphs are first identified and extracted during a short simulated training run, and the extracted graphs are then compiled and cached using parallel compilation, which is considerably faster than the JIT flow.
115 | 
116 | Before starting the compilation you need to update your path to the dataset and tokenizer in the ```test_llama.sh``` script for pretraining llama 7b :
117 | You also want to enable automatic conversion to HuggingFace. Note do not include the # comments in the script, as it breaks hydra parsing.
118 | ```
119 | cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling
120 | 
121 | # For llama 7b
122 | vi test_llama.sh
123 | 
124 | 
125 | # Update the below lines
126 | # For tokenizer
127 | model.tokenizer.type='PATH_TO_LLAMA_TOKENIZER/llamav2_weights/7b-hf' \
128 | 
129 | # For Dataset and Finetuning
130 | model.data.fine_tuning=True \
131 | model.data.train_ds.file_names=[PATH_TO_XSUM_JSONL/xsum_dataset.jsonl] \
132 | 
133 | # To load pretrained Llama model
134 | +model.load_xser=True \
135 | model.resume_from_checkpoint='CONVERTED_CHECKPOINT_PATH/model_optim_rng.ckpt' \
136 | model.use_cpu_initialization=False \
137 | 
138 | # For HuggingFace Conversion
139 | model.convert_to_hf=True \
140 | model.output_dir='PATH_TO_SAVE_CONVERTED_MODEL' \
141 | model.config_path='PATH_TO_LLAMA_TOKENIZER/llamav2_weights/7b-hf/config.json' \
142 | 
143 | # To save checkpoint on end
144 | exp_manager.checkpoint_callback_params.save_last=True \
145 | 
146 | 
147 | ```
148 | Run the following commands to launch an AOT pre-compilation job on your ParallelCluster:
149 | ```
150 | cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling
151 | sbatch --nodes 4 compile.slurm ./llama_7b.sh
152 | ```
153 | 
154 | 
155 | Once you have launched the precompilation job, run the `squeue` command to view the SLURM job queue on your cluster. If you have not recently run a job on your cluster, it may take 4-5 minutes for the requested trn1.32xlarge nodes to be launched and initialized. Once the job is running, `squeue` should show output similar to the following:
156 | ```
157 |     JOBID  PARTITION  NAME           USER    ST  TIME  NODES NODELIST(REASON)
158 |     10     compute1   compile.slurm  ubuntu  R   5:11  4     compute1-dy-queue1-i1-[1-4]
159 | ```
160 | 
161 | You can view the output of the precompilation job by examining the file named `slurm-compile.slurm-ZZ.out` where ZZ represents the JOBID of your job in the `squeue` output, above. Ex:
162 | ```
163 | tail -f slurm-compile.slurm-10.out
164 | ```
165 | 
166 | Once the precompilation job is complete, you should see a message similar to the following in the logs:
167 | ```
168 | 2023-06-11 23:04:08.000738: INFO ||PARALLEL_COMPILE||: Total graphs: 22
169 | 2023-06-11 23:04:08.000738: INFO ||PARALLEL_COMPILE||: Total successful compilations: 22
170 | 2023-06-11 23:04:08.000738: INFO ||PARALLEL_COMPILE||: Total failed compilations: 0
171 | ```
172 | 
173 | At this point, you can press `CTRL-C` to exit the tail command.
174 | 
175 | ## Launch a finetuning job
176 | The Llama2 finetuning job can be launched in the same manner as the precompilation job described above. In this case, we change the SLURM script from `compile.slurm` to `run.slurm`, but the other parameters remain the same:
177 | ```
178 | cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling
179 | sbatch --nodes 4 run.slurm ./llama_7b.sh
180 | ```
181 | 
182 | 
183 | As outlined above, you can again use the `squeue` command to view the job queue. Once you see that your pretraining job is running, you can view the output of the training job by examining the file named `slurm-run.slurm-ZZ.out` where ZZ represents the JOBID of your job:
184 | ```
185 | tail -f slurm-run.slurm-11.out
186 | ```
187 | 
188 | Once the model is loaded onto the Trainium accelerators and training has commenced, you will begin to see output indicating the job progress:
189 | ```
190 | Epoch 0:  28%|██▊       | 424/1507 [41:42<1:46:32,  5.90s/it, loss=1.22, v_num=1778, reduced_train_loss=1.270, gradient_norm=5.780, parameter_norm=1568.0, global_step=423.0, consumed_samples=27072.0, throughput=11.10, thoughput_peak=11.30] 
191 | ```
192 | 
193 | ## Monitor training
194 | ### TensorBoard
195 | In addition to the text-based job monitoring described in the previous section, you can also use standard tools such as TensorBoard to monitor training job progress. To view an ongoing training job in TensorBoard, you first need to identify the experiment directory associated with your ongoing job. This will typically be the most recently created directory under `~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling/nemo_experiments/megatron_llama`. Once you have identifed the directory, `cd` into it, and then launch TensorBoard:
196 | ```
197 | cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling/nemo_experiments/megatron_llama
198 | ls -alt|head
199 | # Identify the correct experiment directory in the
200 | # output of the ls command, ex: 2023-06-10_00-22-42
201 | cd YOUR_EXPERIMENT_DIR  # <- replace this with your experiment directory
202 | tensorboard --logdir ./
203 | ```
204 | 
205 | With TensorBoard running, you can then view the TensorBoard dashboard by browsing to http://localhost:6006 on your local machine. If you cannot access TensorBoard at this address, please make sure that you have port-forwarded TCP port 6006 when SSH'ing into the head node, ex: `ssh -i YOUR_KEY.pem ubuntu@HEAD_NODE_IP_ADDRESS -L 6006:127.0.0.1:6006`
206 | 
207 | ### neuron-top / neuron-monitor / neuron-ls
208 | The [neuron-top](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-top-user-guide.html?highlight=neuron-top) tool can be used to view useful information about NeuronCore utilization, vCPU and RAM utilization, and loaded graphs on a per-node basis. To use neuron-top during on ongoing training job, first SSH into one of your compute nodes from the head node, and then run `neuron-top`:
209 | ```
210 | ssh compute1-dy-queue1-i1-1  # to determine which compute nodes are in use, run the squeue command
211 | neuron-top
212 | ```
213 | 
214 | Similarly, once you are logged into one of the active compute nodes, you can also use other Neuron tools such as [neuron-monitor](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html) and [neuron-ls](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html) to capture performance/utilization statistics and to understand NeuronCore allocation.
215 | 
216 | ## Key Features
217 | * We were able to make llama work with zero optimizer but have enabled it by default. To reduce the memory pressure, you can give it by adding the below hyper parameter in your run script :
218 | ```
219 | cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling/
220 | 
221 | 
222 | # For llama 7b
223 | vi test_llama.sh
224 | 
225 | # Add the below line in the run script :
226 | model.wrap_with_zero=True \
227 | 
228 | ```
229 | 
230 | ## Known issues/limitations
231 | * The initial release of neuronx-nemo-megatron supports Llama2 pretraining and finetuning only. Model evaluation can be performed in transformers-neuronx library with the converted HuggingFace model.
232 | * neuronx-nemo-megatron currently requires pytorch-lightning v1.8.6
233 | 
234 | ## Troubleshooting guide
235 | See [Troubleshooting Guide for AWS ParallelCluster](https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting-v3.html) for more details and fixes to common issues.
236 | 


--------------------------------------------------------------------------------
/examples/jobs/gpt3-launch-job.md:
--------------------------------------------------------------------------------
  1 | # Launch GPT-3 training job [End of Support]
  2 | Once the cluster is successfully created and the Neuron packages are installed, please ssh into the head node to begin the training example. As an example here, we expand the [GPT3 pretraining with Megatron-LM [End of Support]](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/megatron_lm_gpt.html) tutorial to run on a cluster using SLURM job scheduler. 
  3 | 
  4 | You will be running commands from the head node of the cluster. On ParallelCluster, the home directory is shared between the head node and compute nodes so files in the home directory are visible to worker nodes. 
  5 | 
  6 | You may inspect the ParallelCluster with the following command:
  7 | 
  8 | ```sh
  9 | sinfo
 10 | ```
 11 | 
 12 | and expect to see the output which indicates your node's status. An example is:
 13 | 
 14 | ```
 15 | PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST  
 16 | compute1*    up   infinite      16  alloc compute1-dy-queue1-i1-[1-16] 
 17 | ```
 18 | 
 19 | For all the commands below, make sure you are in the virtual environment created during setup before you run the commands. SLURM job scheduler will automatically activate the virtual environment when the training script is run on the worker nodes.
 20 | 
 21 | ```sh
 22 | source ~/aws_neuron_venv_pytorch/bin/activate
 23 | ```
 24 | 
 25 | Use this virtual environment in the head node for the steps below.
 26 | 
 27 | ## Download preprocessed training dataset
 28 | 
 29 | In this tutorial, we use the Wikipedia dataset as an example to demonstrate training at scale.
 30 | You can download the vocabulary file, the merge table file, and the preprocessed Wikipedia dataset with the following commands:
 31 | 
 32 | ```sh
 33 | export DATA_DIR=~/examples_datasets/gpt2
 34 | 
 35 | mkdir -p ${DATA_DIR} && cd ${DATA_DIR}
 36 | 
 37 | wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
 38 | wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
 39 | aws s3 cp s3://neuron-s3/training_datasets/gpt/wikipedia/my-gpt2_text_document.bin .  --no-sign-request
 40 | aws s3 cp s3://neuron-s3/training_datasets/gpt/wikipedia/my-gpt2_text_document.idx .  --no-sign-request
 41 | aws s3 cp s3://neuron-s3/training_datasets/gpt/wikipedia/license.txt .  --no-sign-request
 42 | ```
 43 | 
 44 | To prepare your own dataset from scratch, please follow the steps in [Preparing Wikipedia Dataset from Scratch](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/megatron_lm_gpt.html#preparing-wikipedia-dataset-from-scratch)
 45 | 
 46 | ## Download training scripts
 47 | 
 48 | In this section we will download the training scripts and build the necessary dependencies.
 49 | 
 50 | First, install Python3 development package needed to build the data helpers tools. If you are on Amazon Linux, do:
 51 | 
 52 | ```sh
 53 | sudo yum install -y python3-devel
 54 | ```
 55 | 
 56 | If you are on Ubuntu, do:
 57 | 
 58 | ```sh
 59 | sudo apt install -y python3-dev
 60 | ```
 61 | 
 62 | Next, clone the AWS Neuron Reference for Megatron-LM package, install dependencies and build helpers tool:
 63 | 
 64 | ```sh
 65 | cd ~/
 66 | git clone https://github.com/aws-neuron/aws-neuron-reference-for-megatron-lm.git
 67 | pip install pybind11 regex
 68 | pushd .
 69 | cd aws-neuron-reference-for-megatron-lm/megatron/data/
 70 | make
 71 | popd
 72 | ```
 73 | 
 74 | There will be an `~/aws-neuron-reference-for-megatron-lm` directory from which you will be running the SLURM commands. The shell scripts needed to run the tutorial are in `~/aws-neuron-reference-for-megatron-lm/examples` directory. 
 75 | 
 76 | ## GPT-3 6.7B training configuration
 77 | 
 78 | In this example, we are going to run a pretraining job for the GPT-3 6.7B model using the following model configuration:
 79 | 
 80 | ```sh
 81 | Hidden size = 4096
 82 | Sequence len = 2048
 83 | Num heads = 32
 84 | Num layers = 32
 85 | Microbatch = 1
 86 | Gradient accumulation microsteps = 64
 87 | ```
 88 | The distributed configuration is tensor parallel degree 32, pipeline parallel degree 1, and data parallel degree 16 if using 16 nodes. The global batch size is 1024.
 89 | 
 90 | ## GPT-3 6.7B training script
 91 | 
 92 | The training shell script pretrain_gpt3_6.7B_32layers_bf16_bs1024_slurm.sh will be launched by SLURM on each worker node, where it prepares the environment and invokes torchrun to launch the Python script pretrain_gpt.py on 32 workers.  The environment settings are:
 93 | 
 94 | - Enable Elastic Fabric Adapter for higher networking performance
 95 | - Mark all parameter transfers as static to enable runtime optimizations for wrapped torch.nn modules
 96 | - Enables custom lowering for Softmax operation to enable compiler optimizations and improve GPT performance
 97 | - Cast training to BF16 and enable stochastic rounding
 98 | - Increase Neuron RT execution timeout in case slow compilation causes Neuron RT to wait longer than default timeout
 99 | - Ensure enough framework threads to execute collective compute operations to prevent hang
100 | - Separate NeuronCache dir per node, workaround limitation to file locking on NFS
101 | - Run fewer steps and redirect TensorBoard logging when extract graphs only (during neuron_parallel_compile)
102 | 
103 | The training shell script uses `torchrun` to run multiple pretrain_gpt.py processes, with world size, node rank, and master address extracted from SLURM node information.
104 | 
105 | ```sh
106 | MASTER_ADDR=(`scontrol show hostnames $SLURM_JOB_NODELIST`)
107 | WORLD_SIZE_JOB=$SLURM_NTASKS
108 | RANK_NODE=$SLURM_NODEID
109 | DISTRIBUTED_ARGS="--nproc_per_node 32 --nnodes $WORLD_SIZE_JOB --node_rank $RANK_NODE --master_addr $MASTER_ADDR --master_port 2022"
110 | torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
111 |      ...(options)...
112 | ```
113 | 
114 | The Python script pretrain_gpt.py calls the Megatron pretraining API with the Megatron GPT model builder, the loss and forward functions, and the dataset provider. It also sets the default compiler flag to model-type transformer for improved transformer support in compiler.
115 | 
116 | ## Precompiling the training graphs
117 | 
118 | Precompiling the training graphs is an optional step to reduce just-in-time graph compilations during training. This is done using the [Neuron Parallel Compile tool](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/api-reference-guide/training/pytorch-neuron-parallel-compile.html), which extracts the graphs used during training and perform multiple graph compilations in parallel.
119 | 
120 | To precompile the training graphs, in `~/aws-neuron-reference-for-megatron-lm` directory, issue the following command to submit a SLURM job:
121 | 
122 | ```sh
123 | cd ~/aws-neuron-reference-for-megatron-lm
124 | sbatch --exclusive -N 16 --wrap "srun neuron_parallel_compile ./examples/pretrain_gpt3_6.7B_32layers_bf16_bs1024_slurm.sh"
125 | ```
126 | 
127 | You can also use the SLURM script `examples/pretrain_gpt3_6.7B_compile.slurm` provided for convenience:
128 | 
129 | ```sh
130 | cd ~/aws-neuron-reference-for-megatron-lm
131 | sbatch ./examples/pretrain_gpt3_6.7B_compile.slurm
132 | ```
133 | 
134 | Note that currently each node has it's own separate cache in `~/neuron_cache/gpt/<node name>/neuron-compile-cache` to workaround a known NFS file-locking limitation. This is configured by the following line in the `./examples/pretrain_gpt3_6.7B_32layers_bf16_bs1024_slurm.sh` script.
135 | 
136 | ```sh
137 | export NEURON_CC_FLAGS="--cache_dir=$HOME/neuron_cache/gpt/`hostname`"
138 | ```
139 | If the cluster size is larger than 16, use `--nodelist=<node prefix[range]>` to limit the nodes used during precompilation and actual training run to ensure the workers on each node reads from the correct cache. The nodelist must match between precompilation and the actual run.
140 | 
141 | The job id will be displayed by `squeue` command. You can monitor the results of the compilation job by inspecting the file `slurm-<job id>.out`file generated in `~/aws-neuron-reference-for-megatron-lm` directory. To follow the progress of this SLURM job, you may stream the SLURM output file in real time:
142 | 
143 | ```sh
144 | tail -f slurm-<job id>.out
145 | ```
146 | Note that there are many processes across many instances (nodes) running in parallel, and all the outputs are combined into the `slurm-<job id>.out` file. You can examine individual node's log by looking into `run_log_gpt3_6.7B_32layers_bf16.<node id>.16.txt` file. 
147 | 
148 | The graph extraction sets NEURON_EXTRACT_GRAPHS_ONLY environment variable which cause the graph execution to execute empty graphs with zero outputs. The zero outputs cause execution results to be random, so the TensorBoard log is redirected to `/tmp/parallel_compile_ignored_tb_output`to enable clean TB log of the actual run in the next section.
149 | 
150 | Currently, the total compilation time for GPT-3 6.7B example is about 30 minutes. When each node is finished compilation, you should see the following at the end of each node's log:
151 | 
152 | ```sh
153 | 2023-05-08 20:14:50.000983: INFO ||PARALLEL_COMPILE||: Total graphs: 26
154 | 2023-05-08 20:14:50.000983: INFO ||PARALLEL_COMPILE||: Total successful compilations: 26
155 | 2023-05-08 20:14:50.000983: INFO ||PARALLEL_COMPILE||: Total failed compilations: 0
156 | ```
157 | 
158 | ## Launch training script
159 | 
160 | Before or after the precompilation job is finished, submit the actual pretraining job by running the following slurm command:
161 | 
162 | ```
163 | cd ~/aws-neuron-reference-for-megatron-lm
164 | sbatch --exclusive -N 16 --wrap "srun ./examples/pretrain_gpt3_6.7B_32layers_bf16_bs1024_slurm.sh"
165 | ```
166 | 
167 | You can also use the SLURM script `examples/pretrain_gpt3_6.7B.slurm` provided for convenience:
168 | 
169 | ```sh
170 | cd ~/aws-neuron-reference-for-megatron-lm
171 | sbatch ./examples/pretrain_gpt3_6.7B.slurm
172 | ```
173 | 
174 | As mentioned above, If the cluster size is larger than 16, use `--nodelist=<node prefix[range]>` to limit the nodes used during precompilation and actual training run to ensure the workers on each node reads from the correct cache. The nodelist must match between precompilation and the actual run.
175 | 
176 | You can also run the script with more or less number of nodes as long as the cluster size supports the node count. Further more, note that the global batch size (equal to gradient accumulation count times number of nodes) will change when the number of nodes is changed.
177 | 
178 | If the submission is done before precompilation job is finished, the new submitted job will be queued in the SLURM job queue. You can use `squeue` to see running and queued jobs.
179 | 
180 | Again, the job id will be displayed by `squeue` command and you can follow the training by inspecting the file `slurm-<job id>.out` file. You can examine individual node's log by looking into `run_log_gpt3_6.7B_32layers_bf16.<node id>.16.txt` file.
181 | 
182 | After an initial startup, you should see lines like the following that indicate training progress (iteration, loss, elapsed time, throughput, etc).
183 | 
184 | ```sh
185 |  iteration       64/  143051 | consumed samples:        65536 | elapsed time per iteration (ms): 7313.1 | learning rate: 4.956E-05 | global batch size:  1024 | lm loss: 7.281250E+00 | grad norm: 3.562 | throughput: 140.022 |
186 | 
187 | ```
188 | ## View TensorBoard trace
189 | 
190 | You can examine the TensorBoard trace by ssh into the headnode with `-L 6006:localhost:6006' option, then:
191 | 
192 | ```sh
193 | source ~/aws_neuron_venv_pytorch/bin/activate
194 | cd ~/aws-neuron-reference-for-megatron-lm
195 | tensorboard --logdir ./tb_gpt3_6.7B_32layers_bf16
196 | ```
197 | On the host from where you ssh into the headnode, you can use a browser to go to `http://localhost:6006/` to view TensorBoard.
198 | 
199 | ## View NeuronCore activities
200 | 
201 | To inspect NeuronCore activities in the compute node, you can SSH from head node into any compute node, for example:
202 | 
203 | ```sh
204 | ssh compute1-dy-queue1-i1-1
205 | ```
206 | and then run `neuron-top` in the compute node to see NeuronCore activities while training in happening
207 | 
208 | ## Tips
209 | 
210 | Some useful SLURM commands are `sinfo`,  `squeue` and `scontrol`. `sinfo` command displays information about SLURM node names and partitions. `squeue` command provides information about job queues currently running in the Slurm schedule. SLURM will generate a log file `slurm-XXXXXX.out`. You may then use `tail -f slurm-XXXXXX.out`, to inspect the job summary. `scontrol show node <COMPUTE_NODE_NAME>` can show more information such as node state, power consumption, and more.
211 | 
212 | 
213 | ## Known issues/limitations
214 | 
215 | - "Failed accept4: Too many open files" error and the solution from [here](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/megatron_lm_gpt.html?highlight=megatron%20lm#failed-accept4-too-many-open-files)
216 | - "cannot import name'helppers' from 'megatron.data' error and the solution from [here](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/megatron_lm_gpt.html?highlight=megatron%20lm#error-cannot-import-name-helpers-from-megatron-data)
217 | 
218 | ## Troubleshooting guide
219 | 
220 | See [Troubleshooting Guide for AWS ParallelCluster](https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting-v3.html) for more details and fixes to common issues.
221 | 


--------------------------------------------------------------------------------
/examples/jobs/neuronx-nemo-megatron-gpt-job.md:
--------------------------------------------------------------------------------
  1 | # Launch a GPT-3 pretraining job using neuronx-nemo-megatron
  2 | 
  3 | This tutorial explains how to run GPT-3 pretraining jobs with AWS EC2 trn1.32xl instances using [neuronx-nemo-megatron](https://github.com/aws-neuron/neuronx-nemo-megatron) and [AWS ParallelCluster](https://aws.amazon.com/hpc/parallelcluster/).
  4 | 
  5 | neuronx-nemo-megatron (also known as "AWS Neuron Reference for NeMo Megatron") includes modified versions of the open-source packages [NeMo](https://github.com/NVIDIA/NeMo) and [Apex](https://github.com/NVIDIA/apex) that have been adapted for use with AWS Neuron and AWS EC2 Trn1 instances. neuronx-nemo-megatron allows for pretraining models with hundreds of billions of parameters across thousands of Trainium accelerators, and enables advanced training capabilities such as 3D parallelism, sequence parallelism, and activation checkpointing.
  6 | 
  7 | ## Prerequisites
  8 | Before proceeding with this tutorial, please follow [these instructions](https://github.com/aws-neuron/aws-neuron-parallelcluster-samples#train-a-model-on-aws-trn1-parallelcluster) to create a ParallelCluster consisting of 1 or more trn1.32xl or trn1n.32xl nodes. ParallelCluster automates the creation of trn1 clusters, and provides the SLURM job management system for scheduling and managing distributed training jobs. Please note that the home directory on your ParallelCluster head node will be shared with all of the worker nodes via NFS.
  9 | 
 10 | ## Install neuronx-nemo-megatron
 11 | 
 12 | With your trn1 ParallelCluster in place, begin by logging into the head node of your cluster using SSH. To provide access to TensorBoard (required in a later step), please make sure that you enable port forwarding for TCP port 6006 when you login, ex:
 13 | ```
 14 | ssh -i YOUR_KEY.pem ubuntu@HEAD_NODE_IP_ADDRESS -L 6006:127.0.0.1:6006
 15 | ```
 16 | 
 17 | Once logged into the head node, activate the provided PyTorch Neuron virtual environment that was created when you set up your ParallelCluster. **Note**: if your PyTorch Neuron environment is lower than Neuron 2.11, please refer to the [Neuron documentation](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup/pytorch-update.html#pytorch-neuronx-update) for instructions on updating to Neuron 2.11 or later.
 18 | ```
 19 | cd ~
 20 | source ./aws_neuron_venv_pytorch/bin/activate
 21 | ```
 22 | 
 23 | Next, clone the neuronx-nemo-megatron repo to the head node:
 24 | ```
 25 | cd ~
 26 | git clone https://github.com/aws-neuron/neuronx-nemo-megatron.git
 27 | cd neuronx-nemo-megatron
 28 | ```
 29 | 
 30 | Install the `wheel` Python package and run the build script to create the neuronx-nemo-megatron wheels:
 31 | ```
 32 | pip3 install wheel
 33 | ./build.sh
 34 | ```
 35 | 
 36 | Install the neuronx-nemo-megatron packages and dependencies in your virtual environment:
 37 | ```
 38 | pip3 install ./build/*.whl
 39 | pip3 install -r requirements.txt protobuf==3.20.3
 40 | ```
 41 | 
 42 | Build the Megatron helper module
 43 | ```
 44 | cd ~
 45 | python3 -c "from nemo.collections.nlp.data.language_modeling.megatron.dataset_utils import compile_helper; \
 46 | compile_helper()"
 47 | ```
 48 | 
 49 | The above utility will help make this file : ```nemo.collections.nlp.data.language_modeling.megatron.dataset_utils``` and below is the expected output (You can ignore the error) 
 50 | ```
 51 | 2023-Aug-17 22:53:01.0674 47940:47940 ERROR  TDRV:tdrv_get_dev_info                       No neuron device available
 52 | [NeMo W 2023-08-17 22:53:03 optimizers:67] Could not import distributed_fused_adam optimizer from Apex
 53 | [NeMo W 2023-08-17 22:53:04 experimental:27] Module <class 'nemo.collections.nlp.data.language_modeling.megatron.megatron_batch_samplers.MegatronPretrainingRandomBatchSampler'> is experimental, not ready for production and is not fully supported. Use at your own risk.
 54 | ```
 55 | 
 56 | ## Download GPT dataset
 57 | This tutorial makes use of a preprocessed Wikipedia dataset that is stored in S3. The dataset can be downloaded to your cluster by running the following commands on the head node:
 58 | ```sh
 59 | export DATA_DIR=~/examples_datasets/gpt2
 60 | mkdir -p ${DATA_DIR} && cd ${DATA_DIR}
 61 | wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
 62 | wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
 63 | aws s3 cp s3://neuron-s3/training_datasets/gpt/wikipedia/my-gpt2_text_document.bin .  --no-sign-request
 64 | aws s3 cp s3://neuron-s3/training_datasets/gpt/wikipedia/my-gpt2_text_document.idx .  --no-sign-request
 65 | aws s3 cp s3://neuron-s3/training_datasets/gpt/wikipedia/license.txt .  --no-sign-request
 66 | ```
 67 | 
 68 | ## GPT-3 training configurations
 69 | We tested with the following model sizes: 23B, 46B, 175B
 70 | ### GPT-3 23B
 71 | - Model configuration
 72 |     - Attention heads: 64
 73 |     - Layers: 28
 74 |     - Sequence length: 2048
 75 |     - Hidden size: 8192
 76 |     - Hidden FFN size: 32768
 77 |     - Microbatch size: 1
 78 |     - Global batch size: 32 * number_of_nodes
 79 | 
 80 | - Distributed training configuration
 81 |     - Number of nodes: 4
 82 |     - Tensor parallel degree: 8
 83 |     - Pipeline parallel degree: 4
 84 |     - Data parallel degree: 4
 85 | 
 86 | ### GPT-3 46B
 87 | - Model configuration
 88 |     - Attention heads: 64
 89 |     - Layers: 56
 90 |     - Sequence length: 2048
 91 |     - Hidden size: 8192
 92 |     - Hidden FFN size: 32768
 93 |     - Microbatch size: 1
 94 |     - Global batch size: 32 * number_of_nodes
 95 | 
 96 | - Distributed training configuration
 97 |     - Number of nodes: 8
 98 |     - Tensor parallel degree: 8
 99 |     - Pipeline parallel degree: 8
100 |     - Data parallel degree: 4
101 | 
102 | ### GPT-3 175B
103 | - Model configuration
104 |     - Attention heads: 96
105 |     - Layers: 96
106 |     - Sequence length: 2048
107 |     - Hidden size: 12288
108 |     - Hidden FFN size: 49152
109 |     - Microbatch size: 1
110 |     - Global batch size: 32 * number_of_nodes
111 | 
112 | - Distributed training configuration
113 |     - Number of nodes: 8
114 |     - Tensor parallel degree: 32
115 |     - Pipeline parallel degree: 8
116 |     - Data parallel degree: 1
117 | 
118 | ## Pre-compile the model
119 | By default, PyTorch Neuron uses a just in time (JIT) compilation flow that sequentially compiles all of the neural network compute graphs as they are encountered during a training job. The compiled graphs are cached in a local compiler cache so that subsequent training jobs can leverage the compiled graphs and avoid compilation (so long as the graph signatures and Neuron version have not changed).
120 | 
121 | An alternative to the JIT flow is to use the included [neuron_parallel_compile](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/api-reference-guide/training/pytorch-neuron-parallel-compile.html?highlight=neuron_parallel_compile) command to perform ahead of time (AOT) compilation. In the AOT compilation flow, the compute graphs are first identified and extracted during a short simulated training run, and the extracted graphs are then compiled and cached using parallel compilation, which is considerably faster than the JIT flow.
122 | 
123 | Run the following commands to launch an AOT pre-compilation job on your ParallelCluster:
124 | ```
125 | cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling
126 | sbatch --nodes 4 compile.slurm ./gpt_23b.sh
127 | ```
128 | For the 46B and 175B the `--nodes 8` would be used instead of 4.
129 | 
130 | Once you have launched the precompilation job, run the `squeue` command to view the SLURM job queue on your cluster. If you have not recently run a job on your cluster, it may take 4-5 minutes for the requested trn1.32xlarge nodes to be launched and initialized. Once the job is running, `squeue` should show output similar to the following:
131 | ```
132 |     JOBID  PARTITION  NAME           USER    ST  TIME  NODES NODELIST(REASON)
133 |     10     compute1   compile.slurm  ubuntu  R   5:11  4     compute1-dy-queue1-i1-[1-4]
134 | ```
135 | 
136 | You can view the output of the precompilation job by examining the file named `slurm-compile.slurm-ZZ.out` where ZZ represents the JOBID of your job in the `squeue` output, above. Ex:
137 | ```
138 | tail -f slurm-compile.slurm-10.out
139 | ```
140 | 
141 | Once the precompilation job is complete, you should see a message similar to the following in the logs:
142 | ```
143 | 2023-06-11 23:04:08.000738: INFO ||PARALLEL_COMPILE||: Total graphs: 22
144 | 2023-06-11 23:04:08.000738: INFO ||PARALLEL_COMPILE||: Total successful compilations: 22
145 | 2023-06-11 23:04:08.000738: INFO ||PARALLEL_COMPILE||: Total failed compilations: 0
146 | ```
147 | 
148 | At this point, you can press `CTRL-C` to exit the tail command.
149 | 
150 | ## Launch a pretraining job
151 | The GPT-3 pretraining job can be launched in the same manner as the precompilation job described above. In this case, we change the SLURM script from `compile.slurm` to `run.slurm`, but the other parameters remain the same:
152 | ```
153 | cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling
154 | sbatch --nodes 4 run.slurm ./gpt_23b.sh
155 | ```
156 | For the 46B and 175B the `--nodes 8` would be used instead of 4, like the compile above.
157 | 
158 | 
159 | As outlined above, you can again use the `squeue` command to view the job queue. Once you see that your pretraining job is running, you can view the output of the training job by examining the file named `slurm-run.slurm-ZZ.out` where ZZ represents the JOBID of your job:
160 | ```
161 | tail -f slurm-run.slurm-11.out
162 | ```
163 | 
164 | Once the model is loaded onto the Trainium accelerators and training has commenced, you will begin to see output indicating the job progress:
165 | ```
166 | Epoch 0:   0%|          | 189/301501 [59:12<1573:03:24, 18.79s/it, loss=7.75, v_num=3-16, reduced_train_loss=7.560, global_step=188.0, consumed_samples=24064.0]
167 | Epoch 0:   0%|          | 190/301501 [59:30<1572:41:13, 18.79s/it, loss=7.74, v_num=3-16, reduced_train_loss=7.560, global_step=189.0, consumed_samples=24192.0]
168 | Epoch 0:   0%|          | 191/301501 [59:48<1572:21:28, 18.79s/it, loss=7.73, v_num=3-16, reduced_train_loss=7.910, global_step=190.0, consumed_samples=24320.0]
169 | ```
170 | 
171 | ## Monitor training
172 | ### TensorBoard
173 | In addition to the text-based job monitoring described in the previous section, you can also use standard tools such as TensorBoard to monitor training job progress. To view an ongoing training job in TensorBoard, you first need to identify the experiment directory associated with your ongoing job. This will typically be the most recently created directory under `~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling/nemo_experiments/megatron_gpt`. Once you have identifed the directory, `cd` into it, and then launch TensorBoard:
174 | ```
175 | cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling/nemo_experiments/megatron_gpt
176 | ls -alt|head
177 | # Identify the correct experiment directory in the
178 | # output of the ls command, ex: 2023-06-10_00-22-42
179 | cd YOUR_EXPERIMENT_DIR  # <- replace this with your experiment directory
180 | tensorboard --logdir ./
181 | ```
182 | 
183 | With TensorBoard running, you can then view the TensorBoard dashboard by browsing to http://localhost:6006 on your local machine. If you cannot access TensorBoard at this address, please make sure that you have port-forwarded TCP port 6006 when SSH'ing into the head node, ex: `ssh -i YOUR_KEY.pem ubuntu@HEAD_NODE_IP_ADDRESS -L 6006:127.0.0.1:6006`
184 | 
185 | ### neuron-top / neuron-monitor / neuron-ls
186 | The [neuron-top](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-top-user-guide.html?highlight=neuron-top) tool can be used to view useful information about NeuronCore utilization, vCPU and RAM utilization, and loaded graphs on a per-node basis. To use neuron-top during on ongoing training job, first SSH into one of your compute nodes from the head node, and then run `neuron-top`:
187 | ```
188 | ssh compute1-dy-queue1-i1-1  # to determine which compute nodes are in use, run the squeue command
189 | neuron-top
190 | ```
191 | 
192 | Similarly, once you are logged into one of the active compute nodes, you can also use other Neuron tools such as [neuron-monitor](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html) and [neuron-ls](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html) to capture performance/utilization statistics and to understand NeuronCore allocation.
193 | 
194 | ## Key Features
195 | * We were able to make gpt work with zero optimizer but have enabled it by default. To reduce the memory pressure, you can give it by adding the below hyper parameter in your run script :
196 | ```
197 | cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling/
198 | 
199 | vi test.sh
200 | 
201 | # Add the below line in the run script :
202 | model.wrap_with_zero=True \
203 | 
204 | ```
205 | 
206 | ## Known issues/limitations
207 | * The initial release of neuronx-nemo-megatron supports GPT pretraining only. Model evaluation will be available in a future release.
208 | * The Neuron compiler's modular flow (ex: `--enable-experimental-O1`) is not supported by this initial release of neuronx-nemo-megatron.
209 | * neuronx-nemo-megatron currently requires pytorch-lightning v1.8.6
210 | 
211 | ## Troubleshooting guide
212 | See [Troubleshooting Guide for AWS ParallelCluster](https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting-v3.html) for more details and fixes to common issues.
213 | 


--------------------------------------------------------------------------------
/examples/jobs/neuronx-nemo-megatron-llamav2-job.md:
--------------------------------------------------------------------------------
  1 | # Launch a Llama2 pretraining job using neuronx-nemo-megatron
  2 | 
  3 | This tutorial explains how to run Llama V2 pretraining jobs with AWS EC2 trn1.32xl instances using [neuronx-nemo-megatron](https://github.com/aws-neuron/neuronx-nemo-megatron) and [AWS ParallelCluster](https://aws.amazon.com/hpc/parallelcluster/).
  4 | 
  5 | neuronx-nemo-megatron (also known as "AWS Neuron Reference for NeMo Megatron") includes modified versions of the open-source packages [NeMo](https://github.com/NVIDIA/NeMo) and [Apex](https://github.com/NVIDIA/apex) that have been adapted for use with AWS Neuron and AWS EC2 Trn1 instances. neuronx-nemo-megatron allows for pretraining models with hundreds of billions of parameters across thousands of Trainium accelerators, and enables advanced training capabilities such as 3D parallelism, sequence parallelism, and activation checkpointing.
  6 | 
  7 | ## Prerequisites
  8 | Before proceeding with this tutorial, please follow [these instructions](https://github.com/aws-neuron/aws-neuron-parallelcluster-samples#train-a-model-on-aws-trn1-parallelcluster) to create a ParallelCluster consisting of 1 or more trn1.32xl or trn1n.32xl nodes. ParallelCluster automates the creation of trn1 clusters, and provides the SLURM job management system for scheduling and managing distributed training jobs. Please note that the home directory on your ParallelCluster head node will be shared with all of the worker nodes via NFS.
  9 | 
 10 | ## Install neuronx-nemo-megatron
 11 | 
 12 | With your trn1 ParallelCluster in place, begin by logging into the head node of your cluster using SSH. To provide access to TensorBoard (required in a later step), please make sure that you enable port forwarding for TCP port 6006 when you login, ex:
 13 | ```
 14 | ssh -i YOUR_KEY.pem ubuntu@HEAD_NODE_IP_ADDRESS -L 6006:127.0.0.1:6006
 15 | ```
 16 | 
 17 | Once logged into the head node, activate the provided PyTorch Neuron virtual environment that was created when you set up your ParallelCluster. **Note**: if your PyTorch Neuron environment is lower than Neuron 2.11, please refer to the [Neuron documentation](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup/pytorch-update.html#pytorch-neuronx-update) for instructions on updating to Neuron 2.11 or later.
 18 | ```
 19 | cd ~
 20 | source ./aws_neuron_venv_pytorch/bin/activate
 21 | ```
 22 | 
 23 | Next, clone the neuronx-nemo-megatron repo to the head node:
 24 | ```
 25 | cd ~
 26 | git clone https://github.com/aws-neuron/neuronx-nemo-megatron.git
 27 | cd neuronx-nemo-megatron
 28 | ```
 29 | 
 30 | Install the `wheel` Python package and run the build script to create the neuronx-nemo-megatron wheels:
 31 | ```
 32 | pip3 install wheel
 33 | ./build.sh
 34 | ```
 35 | 
 36 | Install the neuronx-nemo-megatron packages and dependencies in your virtual environment:
 37 | ```
 38 | pip3 install ./build/*.whl
 39 | pip3 install -r requirements.txt protobuf==3.20.3
 40 | ```
 41 | 
 42 | Build the Megatron helper module
 43 | ```
 44 | cd ~
 45 | python3 -c "from nemo.collections.nlp.data.language_modeling.megatron.dataset_utils import compile_helper; \
 46 | compile_helper()"
 47 | ```
 48 | 
 49 | The above utility will help make this file : ```nemo.collections.nlp.data.language_modeling.megatron.dataset_utils``` and below is the expected output (You can ignore the error)
 50 | ```
 51 | 2023-Aug-17 22:53:01.0674 47940:47940 ERROR  TDRV:tdrv_get_dev_info                       No neuron device available
 52 | [NeMo W 2023-08-17 22:53:03 optimizers:67] Could not import distributed_fused_adam optimizer from Apex
 53 | [NeMo W 2023-08-17 22:53:04 experimental:27] Module <class 'nemo.collections.nlp.data.language_modeling.megatron.megatron_batch_samplers.MegatronPretrainingRandomBatchSampler'> is experimental, not ready for production and is not fully supported. Use at your own risk.
 54 | ```
 55 | 
 56 | ## Download LlamaV2 dataset and tokenizer
 57 | This tutorial makes use of a Red pyjama dataset. The dataset can be downloaded to your cluster by running the following commands on the head node:
 58 | 
 59 | ```
 60 | wget https://data.together.xyz/redpajama-data-1T/v1.0.0/book/book.jsonl
 61 | ```
 62 | Note: Dataset download is 50G and will take approximately 3-4 hours to download. 
 63 | 
 64 | The above command will give you the raw dataset of around 50G which needs to be tokenized using a llamaV2 tokenizer. To tokenize the data, you need to request the tokenizer from hugging face and meta following the below link : 
 65 | 
 66 | [Request Tokenizer and model weights from hugging face](https://huggingface.co/meta-llama/Llama-2-7b)
 67 | 
 68 | Note: Use of this model is governed by the Meta license. In order to download the model weights and tokenizer, please visit the above website and accept our License before requesting access here.
 69 | 
 70 | Once you have the Tokenizer and the dataset. You can tokenize the dataset following the below command : 
 71 | ```
 72 | python nemo/scripts/nlp_language_modeling/preprocess_data_for_megatron.py \
 73 |     --input=DATA_FOLDER/DATA.jsonl \
 74 |     --json-keys=text \
 75 |     --tokenizer-library=huggingface \
 76 |     --tokenizer-type=TOKENIZER_FOLDER/llama7b-hf \
 77 |     --dataset-impl=mmap \
 78 |     --output-prefix=DATA_FOLDER/DATA_tokenized \
 79 |     --append-eod \
 80 |     --need-pad-id \
 81 |     --workers=32
 82 | ```
 83 | 
 84 | Post tokenizing the dataset, you will have a path to the tokenizer and the dataset which will be used for pretraining. 
 85 | 
 86 | ## Llama2 training configurations
 87 | We tested with the following model sizes: 7B, 13B, 70B
 88 | ### Llama2 7B
 89 | 
 90 | - Model configuration
 91 |     - Attention heads: 32
 92 |     - Layers: 32
 93 |     - Sequence length: 4096
 94 |     - Hidden size: 4096
 95 |     - Hidden FFN size: 11008
 96 |     - Microbatch size: 1
 97 |     - Global batch size: 256
 98 | 
 99 | - Distributed training configuration
100 |     - Number of nodes: 4
101 |     - Tensor parallel degree: 8
102 |     - Pipeline parallel degree: 1
103 |     - Data parallel degree: 16
104 | 
105 | ### Llama2 13B
106 | 
107 | - Model configuration
108 |   - Attention heads: 40
109 |   - Layers: 40
110 |   - Sequence length: 4096
111 |   - Hidden size: 5120
112 |   - Hidden FFN size: 13824
113 |   - Microbatch size: 1
114 |   - Global batch size: 1024
115 | 
116 | - Distributed training configuration
117 |   - Number of nodes: 4
118 |   - Tensor parallel degree: 8
119 |   - Pipeline parallel degree: 4
120 |   - Data parallel degree: 4
121 | 
122 | ### Llama2 70B
123 | 
124 | - Model configuration
125 |   - Attention heads: 64
126 |   - Layers: 80
127 |   - Sequence length: 4096
128 |   - Hidden size: 8192
129 |   - Hidden FFN size: 28672
130 |   - Microbatch size: 1
131 |   - Global batch size: 512
132 | 
133 | - Distributed training configuration
134 |   - Number of nodes: 8
135 |   - Tensor parallel degree: 8
136 |   - Pipeline parallel degree: 16
137 |   - Data parallel degree: 2
138 | 
139 | ## Pre-compile the model
140 | By default, PyTorch Neuron uses a just in time (JIT) compilation flow that sequentially compiles all of the neural network compute graphs as they are encountered during a training job. The compiled graphs are cached in a local compiler cache so that subsequent training jobs can leverage the compiled graphs and avoid compilation (so long as the graph signatures and Neuron version have not changed).
141 | 
142 | An alternative to the JIT flow is to use the included [neuron_parallel_compile](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/api-reference-guide/training/pytorch-neuron-parallel-compile.html?highlight=neuron_parallel_compile) command to perform ahead of time (AOT) compilation. In the AOT compilation flow, the compute graphs are first identified and extracted during a short simulated training run, and the extracted graphs are then compiled and cached using parallel compilation, which is considerably faster than the JIT flow.
143 | 
144 | Before starting the compilation you need to update your path to the dataset and tokenizer in the ```test_llama.sh``` script for pretraining llama 7b and llama 13b and ```test_llama_gqa.sh``` for pretraining llama 70b as below : 
145 | 
146 | ```
147 | cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling
148 | 
149 | # For llama 7b and 13b
150 | vi test_llama.sh
151 | 
152 | # For llama 70b 
153 | vi test_llama_gqa.sh
154 | 
155 | # Update the below lines
156 | # For tokenizer
157 | model.tokenizer.type='PATH_TO_LLAMA_TOKENIZER/' \
158 | 
159 | # For Dataset
160 | model.data.data_prefix=[1.0,PATH_TO_TOKENIZED_DATASET/books/book.jsonl-processed_text_document] \
161 | ```
162 | Run the following commands to launch an AOT pre-compilation job on your ParallelCluster:
163 | ```
164 | cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling
165 | sbatch --nodes 4 compile.slurm ./llama_7b.sh
166 | ```
167 | 
168 | For compiling llama 13b, run the following commands:
169 | ```
170 | cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling
171 | sbatch --nodes 4 compile.slurm ./llama_13b.sh
172 | ```
173 | 
174 | For compiling llama 70b, run the following commands:
175 | ```
176 | cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling
177 | sbatch --nodes 32 compile.slurm ./llama_70b.sh
178 | ```
179 | 
180 | Note : For the 70B the `--nodes 32` would be used instead of 4.
181 | 
182 | Once you have launched the precompilation job, run the `squeue` command to view the SLURM job queue on your cluster. If you have not recently run a job on your cluster, it may take 4-5 minutes for the requested trn1.32xlarge nodes to be launched and initialized. Once the job is running, `squeue` should show output similar to the following:
183 | ```
184 |     JOBID  PARTITION  NAME           USER    ST  TIME  NODES NODELIST(REASON)
185 |     10     compute1   compile.slurm  ubuntu  R   5:11  4     compute1-dy-queue1-i1-[1-4]
186 | ```
187 | 
188 | You can view the output of the precompilation job by examining the file named `slurm-compile.slurm-ZZ.out` where ZZ represents the JOBID of your job in the `squeue` output, above. Ex:
189 | ```
190 | tail -f slurm-compile.slurm-10.out
191 | ```
192 | 
193 | Once the precompilation job is complete, you should see a message similar to the following in the logs:
194 | ```
195 | 2023-06-11 23:04:08.000738: INFO ||PARALLEL_COMPILE||: Total graphs: 22
196 | 2023-06-11 23:04:08.000738: INFO ||PARALLEL_COMPILE||: Total successful compilations: 22
197 | 2023-06-11 23:04:08.000738: INFO ||PARALLEL_COMPILE||: Total failed compilations: 0
198 | ```
199 | 
200 | At this point, you can press `CTRL-C` to exit the tail command.
201 | 
202 | ## Launch a pretraining job
203 | The Llama2 pretraining job can be launched in the same manner as the precompilation job described above. In this case, we change the SLURM script from `compile.slurm` to `run.slurm`, but the other parameters remain the same:
204 | ```
205 | cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling
206 | sbatch --nodes 4 run.slurm ./llama_7b.sh
207 | ```
208 | 
209 | For llama_13b, run the below command : 
210 | ```
211 | cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling
212 | sbatch --nodes 4 run.slurm ./llama_13b.sh
213 | ```
214 | 
215 | For llama_70b, run the below command :
216 | ```
217 | cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling
218 | sbatch --nodes 32 run.slurm ./llama_70b.sh
219 | ```
220 | Note : For the 70B the `--nodes 32` would be used instead of 4.
221 | 
222 | As outlined above, you can again use the `squeue` command to view the job queue. Once you see that your pretraining job is running, you can view the output of the training job by examining the file named `slurm-run.slurm-ZZ.out` where ZZ represents the JOBID of your job:
223 | ```
224 | tail -f slurm-run.slurm-11.out
225 | ```
226 | 
227 | Once the model is loaded onto the Trainium accelerators and training has commenced, you will begin to see output indicating the job progress:
228 | ```
229 | Epoch 0:  22%|██▏       | 4499/20101 [22:26:14<77:48:37, 17.95s/it, loss=2.43, v_num=5563, reduced_train_loss=2.470, gradient_norm=0.121, parameter_norm=1864.0, global_step=4512.0, consumed_samples=1.16e+6, iteration_time=16.40]
230 | Epoch 0:  22%|██▏       | 4500/20101 [22:26:32<77:48:18, 17.95s/it, loss=2.43, v_num=5563, reduced_train_loss=2.470, gradient_norm=0.121, parameter_norm=1864.0, global_step=4512.0, consumed_samples=1.16e+6, iteration_time=16.40]
231 | Epoch 0:  22%|██▏       | 4500/20101 [22:26:32<77:48:18, 17.95s/it, loss=2.44, v_num=5563, reduced_train_loss=2.450, gradient_norm=0.120, parameter_norm=1864.0, global_step=4512.0, consumed_samples=1.16e+6, iteration_time=16.50]
232 | ```
233 | 
234 | ## Monitor training
235 | ### TensorBoard
236 | In addition to the text-based job monitoring described in the previous section, you can also use standard tools such as TensorBoard to monitor training job progress. To view an ongoing training job in TensorBoard, you first need to identify the experiment directory associated with your ongoing job. This will typically be the most recently created directory under `~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling/nemo_experiments/megatron_llama`. Once you have identifed the directory, `cd` into it, and then launch TensorBoard:
237 | ```
238 | cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling/nemo_experiments/megatron_llama
239 | ls -alt|head
240 | # Identify the correct experiment directory in the
241 | # output of the ls command, ex: 2023-06-10_00-22-42
242 | cd YOUR_EXPERIMENT_DIR  # <- replace this with your experiment directory
243 | tensorboard --logdir ./
244 | ```
245 | 
246 | With TensorBoard running, you can then view the TensorBoard dashboard by browsing to http://localhost:6006 on your local machine. If you cannot access TensorBoard at this address, please make sure that you have port-forwarded TCP port 6006 when SSH'ing into the head node, ex: `ssh -i YOUR_KEY.pem ubuntu@HEAD_NODE_IP_ADDRESS -L 6006:127.0.0.1:6006`
247 | 
248 | ### neuron-top / neuron-monitor / neuron-ls
249 | The [neuron-top](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-top-user-guide.html?highlight=neuron-top) tool can be used to view useful information about NeuronCore utilization, vCPU and RAM utilization, and loaded graphs on a per-node basis. To use neuron-top during on ongoing training job, first SSH into one of your compute nodes from the head node, and then run `neuron-top`:
250 | ```
251 | ssh compute1-dy-queue1-i1-1  # to determine which compute nodes are in use, run the squeue command
252 | neuron-top
253 | ```
254 | 
255 | Similarly, once you are logged into one of the active compute nodes, you can also use other Neuron tools such as [neuron-monitor](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html) and [neuron-ls](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html) to capture performance/utilization statistics and to understand NeuronCore allocation.
256 | 
257 | ## Key Features 
258 | * We were able to make llama work with zero optimizer but have enabled it by default. To reduce the memory pressure, you can give it by adding the below hyper parameter in your run script : 
259 | ```
260 | cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling/
261 | 
262 | # For llama 7b and 13b 
263 | vi test_llama.sh
264 | 
265 | # For llama 70b
266 | vi test_llama_gqa.sh
267 | 
268 | # Add the below line in the run script :
269 | model.wrap_with_zero=True \
270 | 
271 | ```
272 | 
273 | ## Known issues/limitations
274 | * The initial release of neuronx-nemo-megatron supports Llama2 pretraining only. Model evaluation will be available in a future release.
275 | * neuronx-nemo-megatron currently requires pytorch-lightning v1.8.6
276 | * Llama2-70B : Tested and validated on 8 nodes. Scaling beyond might see memory issues.
277 | 
278 | ## Troubleshooting guide
279 | See [Troubleshooting Guide for AWS ParallelCluster](https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting-v3.html) for more details and fixes to common issues.
280 | 


--------------------------------------------------------------------------------
/install_neuron.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | set -e
 3 | 
 4 | echo "Neuron SDK Release 2.6.0"
 5 | # Configure Linux for Neuron repository updates
 6 | . /etc/os-release
 7 | 
 8 | sudo tee /etc/apt/sources.list.d/neuron.list > /dev/null <<EOF
 9 | deb https://apt.repos.neuron.amazonaws.com ${VERSION_CODENAME} main
10 | EOF
11 | wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | sudo apt-key add -
12 | 
13 | # Update OS packages
14 | sudo apt-get update -y
15 | 
16 | # Install git
17 | sudo apt-get install git -y
18 | 
19 | # Remove preinstalled packages and Install Neuron Driver and Runtime
20 | sudo apt-get remove aws-neuron-dkms -y
21 | sudo apt-get remove aws-neuronx-dkms -y
22 | sudo apt-get remove aws-neuronx-oci-hook -y
23 | sudo apt-get remove aws-neuronx-runtime-lib -y
24 | sudo apt-get remove aws-neuronx-collectives -y
25 | sudo apt-get install aws-neuronx-dkms=2.6.33.0 -y
26 | sudo apt-get install aws-neuronx-oci-hook=2.1.14.0 -y
27 | sudo apt-get install aws-neuronx-runtime-lib=2.10.30.0* -y
28 | sudo apt-get install aws-neuronx-collectives=2.10.37.0* -y
29 | 
30 | # Install EFA Driver(only required for multiinstance training)
31 | curl -O https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz
32 | wget https://efa-installer.amazonaws.com/aws-efa-installer.key && gpg --import aws-efa-installer.key
33 | cat aws-efa-installer.key | gpg --fingerprint
34 | wget https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz.sig &&  gpg --verify ./aws-efa-installer-latest.tar.gz.sig
35 | 
36 | tar -xvf aws-efa-installer-latest.tar.gz
37 | cd aws-efa-installer && sudo bash efa_installer.sh --yes
38 | sudo rm -rf aws-efa-installer-latest.tar.gz aws-efa-installer
39 | 
40 | # Remove pre-installed package and Install Neuron Tools
41 | sudo apt-get remove aws-neuron-tools  -y
42 | sudo apt-get remove aws-neuronx-tools  -y
43 | sudo apt-get install aws-neuronx-tools=2.6.1.0 -y
44 | 
45 | export PATH=/opt/aws/neuron/bin:$PATH
46 | 
47 | # Install Python venv and activate Python virtual environment to install
48 | # Neuron pip packages.
49 | sudo apt install python3.8-venv -y
50 | 
51 | cd /home/ubuntu
52 | 
53 | . "/etc/parallelcluster/cfnconfig"
54 | 
55 | if [[ $cfn_node_type == "HeadNode" ]]; then
56 |   python3.8 -m venv aws_neuron_venv_pytorch
57 |   source aws_neuron_venv_pytorch/bin/activate
58 |   pip install -U pip
59 | 
60 |   # Install packages from repos
61 |   python -m pip config set global.extra-index-url "https://pip.repos.neuron.amazonaws.com"
62 |   python -m pip install torch-neuronx=="1.12.0.1.4.0" neuronx-cc=="2.3.0.4" torchvision
63 | 
64 | 
65 |   chown ubuntu:ubuntu -R aws_neuron_venv_pytorch
66 | else
67 |   DNS_SERVER=""
68 |   grep Ubuntu /etc/issue &>/dev/null && DNS_SERVER=$(resolvectl dns | awk '{print $4}' | sort -r | head -1)
69 |   IP="$(host $HOSTNAME $DNS_SERVER | tail -1 | awk '{print $4}')"
70 |   DOMAIN=$(jq .cluster.dns_domain /etc/chef/dna.json | tr -d \")
71 |   sudo sed -i "/$HOSTNAME/d" /etc/hosts
72 |   sudo bash -c "echo '$IP $HOSTNAME.${DOMAIN::-1} $HOSTNAME' >> /etc/hosts"
73 | fi
74 | 
75 | 


--------------------------------------------------------------------------------
/releasenotes.md:
--------------------------------------------------------------------------------
 1 | # Change Log
 2 | 
 3 | ## September, 15th 2023
 4 | * Added sample for pre-training Llama 2 13B model using neuronx-nemo-megatron library
 5 | 
 6 | ## August, 29th 2023
 7 | * Added samples for pre-training Llama 2 7B model using neuronx-nemo-megatron library
 8 | 
 9 | ## August, 28th 2023
10 | * Added samples for pre-training GPT-3 23B, 46B and 175B models using neuronx-nemo-megatron library
11 | * Announced End of Support for GPT-3 training using aws-neuron-reference-for-megatron-lm library 
12 | 
13 | ## October,10th 2022
14 | 
15 | * Added a parallel cluster example that explains how to use AWS ParallelCluster to build HPC compute cluster using trn1 compute nodes to run the distributed ML training job.
16 | 
17 | ## Known Issues
18 | 
19 | ### Relaunch a dynamic cluster created with `MinCount = 0` may fail due to compute nodes IP address mismatch.
20 | 
21 | #### **Amazon Linux 2**
22 | 
23 | For dynamic cluster with `MinCount = 0`, /etc/hosts IP addresses of compute nodes may not match with what's in `nslookup` upon cluster relaunch. Therefore, for your information, a temporary workaround is included in `install_neuron.sh` post-install script:
24 | 
25 | ```
26 | IP="$(host $HOSTNAME| awk '{print $4}')"
27 | DOMAIN=$(jq .cluster.dns_domain /etc/chef/dna.json | tr -d \")
28 | sudo sed -i "/$HOSTNAME/d" /etc/hosts
29 | sudo bash -c "echo '$IP $HOSTNAME.${DOMAIN::-1} $HOSTNAME' >> /etc/hosts"
30 | ```
31 | 
32 | This fix is already implemented in the custom installation script to ensure a AL2 dynamic cluster would relaunch successfully.
33 | 
34 | #### **Ubuntu**
35 | 
36 | In Ubuntu, the default DNS is systemd-resolve. In the configuration file `/etc/systemd/resolved.conf` of systemd-resolve, the default of the option `ReadEtcHosts` is `yes`.  Therefore systemd-resolve on Ubuntu is to query first the file `/etc/hosts` and then the DNS server. The workaround to work with such systemd-resolve behavior is:
37 | 
38 | ```
39 | DNS_SERVER=""
40 | grep Ubuntu /etc/issue &>/dev/null && DNS_SERVER=$(resolvectl dns | awk '{print $4}' | sort -r | head -1)
41 | IP="$(host $HOSTNAME $DNS_SERVER | tail -1 | awk '{print $4}')"
42 | DOMAIN=$(jq .cluster.dns_domain /etc/chef/dna.json | tr -d \")
43 | sudo sed -i "/$HOSTNAME/d" /etc/hosts
44 | sudo bash -c "echo '$IP $HOSTNAME.${DOMAIN::-1} $HOSTNAME' >> /etc/hosts"
45 | ```
46 | 
47 | This fix is implemented in the custom installation script to ensure a Ubuntu dynamic cluster would relaunch successfully. 
48 | 
49 | 
50 | ### Error “Assertion `listp->slotinfo[cnt].gen <= GL(dl_tls_generation)’ failed” followed by ‘RPC failed with status = “UNAVAILABLE: Connection reset by peer”’
51 | 
52 | 
53 | ```
54 | 
55 |    INFO: Inconsistency detected by ld.so: ../elf/dl-tls.c: 488: _dl_allocate_tls_init: Assertion `listp->slotinfo[cnt].gen <= GL(dl_tls_generation)' failed!
56 |    INFO: 2022-10-03 02:16:04.488054: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Connection reset by peer" and grpc_error_string = "{"created":"@1664763364.487962663","description":"Error received from peer ipv4:10.0.9.150:41677","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Connection reset by peer","grpc_status":14}", maybe retrying the RPC
57 | 
58 | ```
59 | This error may occur intermittently when using GNU C Library glibc 2.26. To find out what version you have, run ```ldd --version```. glibc 2.27 provides a workaround and therefore the error is fixed in Ubuntu20. For more information on this issue, see [Neuron troubleshooting guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/training-troubleshooting.html#error-assertion-listp-slotinfo-cnt-gen-gl-dl-tls-generation-failed-followed-by-rpc-failed-with-status-unavailable-connection-reset-by-peer). 
60 | 
61 | ## Troubleshooting guide
62 | 
63 | See [Troubleshooting Guide for AWS ParallelCluster](https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting-v3.html) for more details and fixes to common issues.
64 | 


--------------------------------------------------------------------------------