├── CODEOWNERS ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── NOTICE ├── README.md ├── examples ├── cluster-configs │ └── trn1-16-nodes-pcluster.md ├── general │ └── network │ │ └── vpc-subnet-setup.md ├── images │ ├── add-route.png │ ├── az.png │ ├── create-vpc.png │ ├── edit-subnet.png │ ├── ipv4.png │ ├── subnets-nat.png │ ├── subnets.png │ ├── vpc-entry.png │ └── vpc-setup.png └── jobs │ ├── dp-bert-launch-job.md │ ├── finetuning │ └── neuronx-nemo-megatron-llamav2-finetuning-job.md │ ├── gpt3-launch-job.md │ ├── neuronx-nemo-megatron-gpt-job.md │ └── neuronx-nemo-megatron-llamav2-job.md ├── install_neuron.sh └── releasenotes.md /CODEOWNERS: -------------------------------------------------------------------------------- 1 | # This file creates codeowners for the documentation. It will allow setting code reviewers for all Pull requests to merge to the master branch 2 | # Each line is a file pattern followed by one or more owners. 3 | 4 | # Refernce guide - https://docs.github.com/en/github/creating-cloning-and-archiving-repositories/creating-a-repository-on-github/about-code-owners#example-[…]ners-file 5 | # Example - These owners will be the default owners for everything in 6 | # the repo. Unless a later match takes precedence, 7 | # @global-owner1 and @global-owner2 will be requested for 8 | # review when someone opens a pull request. 9 | # * @global-owner1 @global-owner2 10 | 11 | * @aws-maens @aws-sadaf 12 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. 5 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *main* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. 60 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Amazon Software License 1.0 2 | 3 | This Amazon Software License ("License") governs your use, reproduction, and 4 | distribution of the accompanying software as specified below. 5 | 6 | 1. Definitions 7 | 8 | "Licensor" means any person or entity that distributes its Work. 9 | 10 | "Software" means the original work of authorship made available under this 11 | License. 12 | 13 | "Work" means the Software and any additions to or derivative works of the 14 | Software that are made available under this License. 15 | 16 | The terms "reproduce," "reproduction," "derivative works," and 17 | "distribution" have the meaning as provided under U.S. copyright law; 18 | provided, however, that for the purposes of this License, derivative works 19 | shall not include works that remain separable from, or merely link (or bind 20 | by name) to the interfaces of, the Work. 21 | 22 | Works, including the Software, are "made available" under this License by 23 | including in or with the Work either (a) a copyright notice referencing the 24 | applicability of this License to the Work, or (b) a copy of this License. 25 | 26 | 2. License Grants 27 | 28 | 2.1 Copyright Grant. Subject to the terms and conditions of this License, 29 | each Licensor grants to you a perpetual, worldwide, non-exclusive, 30 | royalty-free, copyright license to reproduce, prepare derivative works of, 31 | publicly display, publicly perform, sublicense and distribute its Work and 32 | any resulting derivative works in any form. 33 | 34 | 2.2 Patent Grant. Subject to the terms and conditions of this License, each 35 | Licensor grants to you a perpetual, worldwide, non-exclusive, royalty-free 36 | patent license to make, have made, use, sell, offer for sale, import, and 37 | otherwise transfer its Work, in whole or in part. The foregoing license 38 | applies only to the patent claims licensable by Licensor that would be 39 | infringed by Licensor's Work (or portion thereof) individually and 40 | excluding any combinations with any other materials or technology. 41 | 42 | 3. Limitations 43 | 44 | 3.1 Redistribution. You may reproduce or distribute the Work only if 45 | (a) you do so under this License, (b) you include a complete copy of this 46 | License with your distribution, and (c) you retain without modification 47 | any copyright, patent, trademark, or attribution notices that are present 48 | in the Work. 49 | 50 | 3.2 Derivative Works. You may specify that additional or different terms 51 | apply to the use, reproduction, and distribution of your derivative works 52 | of the Work ("Your Terms") only if (a) Your Terms provide that the use 53 | limitation in Section 3.3 applies to your derivative works, and (b) you 54 | identify the specific derivative works that are subject to Your Terms. 55 | Notwithstanding Your Terms, this License (including the redistribution 56 | requirements in Section 3.1) will continue to apply to the Work itself. 57 | 58 | 3.3 Use Limitation. The Work and any derivative works thereof only may be 59 | used or intended for use with the web services, computing platforms or 60 | applications provided by Amazon.com, Inc. or its affiliates, including 61 | Amazon Web Services, Inc. 62 | 63 | 3.4 Patent Claims. If you bring or threaten to bring a patent claim against 64 | any Licensor (including any claim, cross-claim or counterclaim in a 65 | lawsuit) to enforce any patents that you allege are infringed by any Work, 66 | then your rights under this License from such Licensor (including the 67 | grants in Sections 2.1 and 2.2) will terminate immediately. 68 | 69 | 3.5 Trademarks. This License does not grant any rights to use any 70 | Licensor's or its affiliates' names, logos, or trademarks, except as 71 | necessary to reproduce the notices described in this License. 72 | 73 | 3.6 Termination. If you violate any term of this License, then your rights 74 | under this License (including the grants in Sections 2.1 and 2.2) will 75 | terminate immediately. 76 | 77 | 4. Disclaimer of Warranty. 78 | 79 | THE WORK IS PROVIDED "AS IS" WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, 80 | EITHER EXPRESS OR IMPLIED, INCLUDING WARRANTIES OR CONDITIONS OF 81 | MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE OR 82 | NON-INFRINGEMENT. YOU BEAR THE RISK OF UNDERTAKING ANY ACTIVITIES UNDER 83 | THIS LICENSE. SOME STATES' CONSUMER LAWS DO NOT ALLOW EXCLUSION OF AN 84 | IMPLIED WARRANTY, SO THIS DISCLAIMER MAY NOT APPLY TO YOU. 85 | 86 | 5. Limitation of Liability. 87 | 88 | EXCEPT AS PROHIBITED BY APPLICABLE LAW, IN NO EVENT AND UNDER NO LEGAL 89 | THEORY, WHETHER IN TORT (INCLUDING NEGLIGENCE), CONTRACT, OR OTHERWISE 90 | SHALL ANY LICENSOR BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY DIRECT, 91 | INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF OR 92 | RELATED TO THIS LICENSE, THE USE OR INABILITY TO USE THE WORK (INCLUDING 93 | BUT NOT LIMITED TO LOSS OF GOODWILL, BUSINESS INTERRUPTION, LOST PROFITS 94 | OR DATA, COMPUTER FAILURE OR MALFUNCTION, OR ANY OTHER COMM ERCIAL DAMAGES 95 | OR LOSSES), EVEN IF THE LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF 96 | SUCH DAMAGES. 97 | -------------------------------------------------------------------------------- /NOTICE: -------------------------------------------------------------------------------- 1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Train a model on AWS Trn1 ParallelCluster 2 | 3 | ## Introduction 4 | 5 | This document explains how to use AWS ParallelCluster to build HPC compute cluster that uses trn1 compute nodes to run your distributed ML training job. Once the nodes are launched, we will run a training task to confirm that the nodes are working, and use SLURM commands to check the job status. In this tutorial, we will use AWS pcluster command to run a YAML file in order to generate the cluster. As an example, we are going to launch multiple trn1.32xl nodes in our cluster. 6 | 7 | We are going to set up our ParallelCluster infrastructure as below: 8 | 9 | ![image info](./examples/images/vpc-setup.png) 10 | 11 | As shown in the figure above, inside a VPC, there are two subnets, a public and a private ones. Head node resides in the public subnet, while the compute fleet (in this case, trn1 instances) are in the private subnet. A Network Address Translation (NAT) gateway is also needed in order for nodes in the private subnet to connect to clients outside the VPC. In the next section, we are going to describe how to set up all the necessary infrastructure for Trn1 ParallelCluster. 12 | 13 | 14 | 15 | ## Prerequisite infrastructure 16 | 17 | ### VPC Creation 18 | A ParallelCluster requires a VPC that has two subnets and a Network Address Translation (NAT) gateway as shown in the diagram above. [Here](./examples/general/network/vpc-subnet-setup.md) are the instructions to create the VPC and enable auto-assign public IPv4 address for the public subnet. 19 | 20 | ### Key pair 21 | A key pair is needed for access to the head node of the cluster. You may use an existing one or create a new key pair by following the instruction [here](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/create-key-pairs.html#having-ec2-create-your-key-pair "Create key pair") 22 | 23 | ### AWS ParallelCluster Python package 24 | 25 | AWS ParallelCluster Python package is needed in a local environment (i.e., your Mac/PC desktop with a CLI terminal or an AWS Cloud9) where you issue the command to launch the creation process for your HPC environment in AWS. See [here](https://docs.aws.amazon.com/parallelcluster/latest/ug/install-v3-virtual-environment.html) for instructions about installing AWS ParallelCluster Python package in your local environment. 26 | 27 | ## Create a cluster 28 | 29 | See table below for script to create trn1 ParallelCluster: 30 | 31 | |Cluster | Link | 32 | |-------------|------------------| 33 | |16xTrn1 nodes | [trn1-16-nodes-pcluster.md](./examples/cluster-configs/trn1-16-nodes-pcluster.md) | 34 | 35 | ## Launch training job 36 | 37 | See table below for script to launch a model training job on the ParallelCluster: 38 | 39 | |Job | Link | 40 | |-------------|-------------------| 41 | |BERT Large | [dp-bert-launch-job.md](./examples/jobs/dp-bert-launch-job.md) | 42 | |GPT3 (neuronx-nemo-megatron) | [neuronx-nemo-megatron-gpt-job.md](./examples/jobs/neuronx-nemo-megatron-gpt-job.md) | 43 | |Llama 2 7B (neuronx-nemo-megatron) | [neuronx-nemo-megatron-llamav2-job.md](./examples/jobs/neuronx-nemo-megatron-llamav2-job.md) | 44 | 45 | ## Launch training job [End of Support] 46 | 47 | See table below for scripts that are no longer supported: 48 | 49 | |Job | Link | 50 | |-------------|-------------------| 51 | |GPT3 (Megatron-LM) | [gpt3-launch-job.md](./examples/jobs/gpt3-launch-job.md) | 52 | 53 | ## Security 54 | 55 | See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information. 56 | 57 | ## License 58 | 59 | This library is licensed under the Amazon Software License. 60 | 61 | 62 | ## Release Notes 63 | 64 | Please refer to the [Change Log](releasenotes.md). 65 | 66 | -------------------------------------------------------------------------------- /examples/cluster-configs/trn1-16-nodes-pcluster.md: -------------------------------------------------------------------------------- 1 | # Create ParallelCluster 2 | 3 | 1. Once your VPC, ParallelCluster python package, and key pair are set up, you are ready to create a ParallelCluster. Copy the following content into a launch.yaml file in your local desktop where AWS ParallelCluster CLI is installed. Here is an example YAML file content: 4 | 5 | ``` 6 | Region: # i.e., us-west-2 7 | Image: 8 | Os: ubuntu2004 9 | HeadNode: 10 | InstanceType: c5.4xlarge 11 | Networking: 12 | SubnetId: subnet- 13 | Ssh: 14 | KeyName: 15 | LocalStorage: 16 | RootVolume: 17 | Size: 1024 18 | CustomActions: 19 | OnNodeConfigured: 20 | Script: s3://neuron-s3/pcluster/post-install-scripts/neuron-installation/v/u20/pt/install_neuron.sh 21 | Iam: 22 | S3Access: 23 | - BucketName: neuron-s3 24 | EnableWriteAccess: false 25 | Scheduling: 26 | Scheduler: slurm 27 | SlurmQueues: 28 | - Name: compute1 29 | CapacityType: ONDEMAND 30 | ComputeSettings: 31 | LocalStorage: 32 | RootVolume: 33 | Size: 1024 34 | EphemeralVolume: 35 | MountDir: /local_storage 36 | ComputeResources: 37 | - Efa: 38 | Enabled: true 39 | InstanceType: trn1.32xlarge 40 | MaxCount: 16 41 | MinCount: 0 42 | Name: queue1-i1 43 | Networking: 44 | SubnetIds: 45 | - subnet- 46 | PlacementGroup: 47 | Enabled: true 48 | CustomActions: 49 | OnNodeConfigured: 50 | Script: s3://neuron-s3/pcluster/post-install-scripts/neuron-installation/v/u20/pt/install_neuron.sh 51 | Iam: 52 | S3Access: 53 | - BucketName: neuron-s3 54 | EnableWriteAccess: false 55 | SharedStorage: 56 | - FsxLustreSettings: 57 | DeploymentType: SCRATCH_2 58 | StorageCapacity: 1200 59 | MountDir: /fsx 60 | Name: pclusterfsx 61 | StorageType: FsxLustre 62 | ``` 63 | 64 | Please replace the placeholders ``, ``, `` and `` with appropriate values. 65 | 66 | For example, to use Neuron SDK release version 2.13.0, replace `` with `2.13.0` in the two `CustomActions: OnNodeConfigured: Script` fields (make sure `2.13.0` is preceded by the letter `v`). 67 | 68 | ``` 69 | CustomActions: 70 | OnNodeConfigured: 71 | Script: s3://neuron-s3/pcluster/post-install-scripts/neuron-installation/v2.13.0/u20/pt/install_neuron.sh 72 | ``` 73 | The ``and ``values are obtained following the [VPC creation guide](./examples/general/network/vpc-subnet-setup.md). 74 | 75 | The `` is obtained following [key pair setup](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/create-key-pairs.html#having-ec2-create-your-key-pair) 76 | 77 | The YAML file above will create a ParallelCluster with a c5.4xlarge head node, and 16 trn1.32xl compute nodes. All `MaxCount` trn1 nodes are in the same queue. In case you need to isolate compute nodes with different queues, simply append another instanceType designation to the current instanceType, and designate `MaxCount` for each queue, for example, `InstanceType` section would be become: 78 | 79 | ``` 80 | InstanceType: trn1.32xlarge 81 | MaxCount: 8 82 | MinCount: 0 83 | Name: queue-0 84 | InstanceType: trn1.32xlarge 85 | MaxCount: 8 86 | MinCount: 0 87 | Name: queue-1 88 | ``` 89 | 90 | So now you have two queues, each queue is designated to a number of trn1 compute nodes. An unique feature for trn1.32xlarge instance is the EFA interfaces built for high performance/low latency network data transfer. This is indicated by: 91 | 92 | ``` 93 | - Efa: 94 | Enabled: true 95 | ``` 96 | 97 | If you are using trn1.2xl instance, this feature is not enabled, and in which case, you don’t need such designation. 98 | 99 | 2. In the virtual environment where you installed AWS ParallelCluster API, run the following command (assuming you have saved the configurations above in `configuration.yaml`): 100 | 101 | ``` 102 | pcluster create-cluster --cluster-configuration configuration.yaml -n My-PCluster-Trn1 103 | ``` 104 | Where 105 | 106 | `cluster-configuration` is the YAML file 107 | 108 | This will create a ParallelCluster in your AWS account, and you may inspect the progress in AWS CloudFormation console. 109 | 110 | You may also check cluster status using `pcluster` command, for example: 111 | 112 | `pcluster describe-cluster -r us-west-2 -n My-PCluster-Trn1` 113 | 114 | 3. During the cluster creation process, post-install actions now takes place automatically via `CustomActions` indicated in `configuration.yaml` to configure the head node and any static compute nodes (`MinCount` > 0). `CustomActions` will install Neuron drivers and runtime, EFA drivers, and Neuron tools. 115 | 116 | 4. After post-installation actions are complete, the ParallelCluster environment is properly configured to run SLURM jobs. Rerun `pcluster describe-cluster ...` command above to see the head node IP address, such that you may SSH into it for the [next part of the tutorial](../jobs/dp-bert-launch-job.md) where you would launch a training job. 117 | 118 | ## Custom script update 119 | 120 | To make it easier to upgrade to newer versions, you can copy the `install_neuron.sh` to your own s3 bucket s3://```` and modify `CustomActions: OnNodeConfigured: Script` and `CustomActions: OnNodeConfigured: Iam: S3Access:` fields (2 sets). 121 | 122 | ``` 123 | CustomActions: 124 | OnNodeConfigured: 125 | Script: s3:///install_neuron.sh 126 | Iam: 127 | S3Access: 128 | - BucketName: 129 | EnableWriteAccess: false 130 | ``` 131 | 132 | Then create the cluster as above. 133 | 134 | When you are ready to update your own s3 bucket's copy after a new Neuron SDK release, first stop your ParallelCluster from the system and environment where ParallelCluster API was installed: 135 | ``` 136 | pcluster update-compute-fleet --cluster-name -r --status STOP_REQUESTED 137 | ``` 138 | Next copy the new version to your own s3 bucket, then restart the ParallelCluster. 139 | ``` 140 | pcluster update-compute-fleet --cluster-name -r 146 | cd ~/ 147 | aws s3 cp s3:///install_neuron.sh . 148 | source install_neuron.sh 149 | ``` 150 | 151 | ## Updating configuration.yaml 152 | 153 | You can change the configuration.yaml to update for example compute resources. First stop your ParallelCluster from the system and environment where ParallelCluster API was installed: 154 | ``` 155 | pcluster update-compute-fleet --cluster-name -r --status STOP_REQUESTED 156 | ``` 157 | Update the cluster using the new configuration YAML version (assuming it is named `new_configuration.yaml`): 158 | ``` 159 | pcluster update-cluster -c new_configuration.yaml --cluster-name -r YOUR_REGION 160 | ``` 161 | Then restart the ParallelCluster. 162 | ``` 163 | pcluster update-compute-fleet --cluster-name -r .out` file generated in `~/aws-neuron-samples/torch-neuronx/training/dp_bert_hf_pretrain`. 42 | 43 | ## Launch training 44 | After the compilation job is finished, start the actual pretraining: 45 | 46 | ``` 47 | cd ~/aws-neuron-samples/torch-neuronx/training/dp_bert_hf_pretrain 48 | sbatch --exclusive --nodes=16 --wrap "srun ./run_dp_bert_large_hf_pretrain_bf16_s128_lamb.sh" 49 | ``` 50 | 51 | Again, the job id will be displayed by sbatch and you can follow the training by inspecting the file `slurm_.out` file generated in `~/examples/dp_bert_hf_pretrain`. 52 | 53 | ### Cluster scalability 54 | 55 | In a Trn1 cluster, multiple interconnected Trn1 instances run a large model training workload in parallel and reduce total computation time, or time to convergence. There are two measures of scalability of a cluster: strong scaling and weak scaling. Typically, for model training, the need is to speed up training run, because usage cost is determined by sample throughput for rounds of gradient updates. This means strong scaling is an important measure of scalability for model training. Strong scaling refers to the scenario where the total problem size stays the same as the number of processors increases. In evaluating strong scaling, or the impact of parallelization, we want to keep global batch size same and see how much time it takes to convergence. In such scenario, we need to adjust gradient accumulation micro-step according to number of compute nodes. This is achieved with the following in the downloaded training shell script `run_dp_bert_large_hf_pretrain_bf16_s128_lamb.sh`: 56 | 57 | ``` 58 | GRAD_ACCUM_USTEPS=$(($GRAD_ACCUM_USTEPS/$WORLD_SIZE_JOB)) 59 | ``` 60 | 61 | The SLURM shell script automatically adjust the gradient accumulation microsteps to keep the global batch size for phase 1 at 65536 with LAMB optimizer (strong scaling). 62 | 63 | On the other hand, if the interest is to evaluate how much more workloads can be executed at a fixed time by adding more nodes, then use weak scaling to measure scalability. In weak scaling, the problem size increases at the same rate as number of processors, thereby keeping amount of work per processor the same. To see performance for larger global batch size (weak scaling), please comment out the line above. Doing so would keep number of steps for gradient accumulation constant with a default value (i.e., 128) provided in the training script `run_dp_bert_large_hf_pretrain_bf16_s128_lamb.sh`. 64 | 65 | ## Tips 66 | 67 | Some useful SLURM commands are `sinfo`, `squeue` and `scontrol`. `sinfo` command displays information about SLURM node names and partitions. `squeue` command provides information about job queues currently running in the Slurm schedule. SLURM will generate a log file `slurm-XXXXXX.out`. You may then use `tail -f slurm-XXXXXX.out`, to inspect the job summary. `scontrol show node ` can show more information such as node state, power consumption, and more. 68 | 69 | 70 | ## Known issues/limitations 71 | 72 | - The current setup supports up to 128 nodes BERT pretraining with LAMB optmizer when using strong scaling. 73 | 74 | ## Troubleshooting guide 75 | 76 | See [Troubleshooting Guide for AWS ParallelCluster](https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting-v3.html) for more details and fixes to common issues. 77 | -------------------------------------------------------------------------------- /examples/jobs/finetuning/neuronx-nemo-megatron-llamav2-finetuning-job.md: -------------------------------------------------------------------------------- 1 | # Launch a Llama2 finetuning job using neuronx-nemo-megatron 2 | 3 | This tutorial explains how to run Llama V2 finetuning jobs with AWS EC2 trn1.32xl instances using [neuronx-nemo-megatron](https://github.com/aws-neuron/neuronx-nemo-megatron) and [AWS ParallelCluster](https://aws.amazon.com/hpc/parallelcluster/). 4 | 5 | neuronx-nemo-megatron (also known as "AWS Neuron Reference for NeMo Megatron") includes modified versions of the open-source packages [NeMo](https://github.com/NVIDIA/NeMo) and [Apex](https://github.com/NVIDIA/apex) that have been adapted for use with AWS Neuron and AWS EC2 Trn1 instances. neuronx-nemo-megatron allows for pretraining models with hundreds of billions of parameters across thousands of Trainium accelerators, and enables advanced training capabilities such as 3D parallelism, sequence parallelism, and activation checkpointing. 6 | 7 | ## Prerequisites 8 | Before proceeding with this tutorial, please follow [these instructions](https://github.com/aws-neuron/aws-neuron-parallelcluster-samples#train-a-model-on-aws-trn1-parallelcluster) to create a ParallelCluster consisting of 1 or more trn1.32xl or trn1n.32xl nodes. ParallelCluster automates the creation of trn1 clusters, and provides the SLURM job management system for scheduling and managing distributed training jobs. Please note that the home directory on your ParallelCluster head node will be shared with all of the worker nodes via NFS. 9 | 10 | ## Install neuronx-nemo-megatron 11 | 12 | With your trn1 ParallelCluster in place, begin by logging into the head node of your cluster using SSH. To provide access to TensorBoard (required in a later step), please make sure that you enable port forwarding for TCP port 6006 when you login, ex: 13 | ``` 14 | ssh -i YOUR_KEY.pem ubuntu@HEAD_NODE_IP_ADDRESS -L 6006:127.0.0.1:6006 15 | ``` 16 | 17 | Once logged into the head node, activate the provided PyTorch Neuron virtual environment that was created when you set up your ParallelCluster. **Note**: if your PyTorch Neuron environment is lower than Neuron 2.11, please refer to the [Neuron documentation](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup/pytorch-update.html#pytorch-neuronx-update) for instructions on updating to Neuron 2.11 or later. 18 | ``` 19 | cd ~ 20 | source ./aws_neuron_venv_pytorch/bin/activate 21 | ``` 22 | 23 | Next, clone the neuronx-nemo-megatron repo to the head node: 24 | ``` 25 | cd ~ 26 | git clone https://github.com/aws-neuron/neuronx-nemo-megatron.git 27 | cd neuronx-nemo-megatron 28 | ``` 29 | 30 | Install the `wheel` Python package and run the build script to create the neuronx-nemo-megatron wheels: 31 | ``` 32 | pip3 install wheel 33 | ./build.sh 34 | ``` 35 | 36 | Install the neuronx-nemo-megatron packages and dependencies in your virtual environment: 37 | ``` 38 | pip3 install ./build/*.whl 39 | pip3 install -r requirements.txt torch==1.13.1 protobuf==3.20.3 40 | ``` 41 | 42 | Build the Megatron helper module 43 | ``` 44 | cd ~ 45 | python3 -c "from nemo.collections.nlp.data.language_modeling.megatron.dataset_utils import compile_helper; \ 46 | compile_helper()" 47 | ``` 48 | 49 | The above utility will help make this file : ```nemo.collections.nlp.data.language_modeling.megatron.dataset_utils``` and below is the expected output (You can ignore the error) 50 | ``` 51 | 2023-Aug-17 22:53:01.0674 47940:47940 ERROR TDRV:tdrv_get_dev_info No neuron device available 52 | [NeMo W 2023-08-17 22:53:03 optimizers:67] Could not import distributed_fused_adam optimizer from Apex 53 | [NeMo W 2023-08-17 22:53:04 experimental:27] Module is experimental, not ready for production and is not fully supported. Use at your own risk. 54 | ``` 55 | 56 | ## Download LlamaV2 dataset and tokenizer 57 | This tutorial makes use of the xsum dataset. The dataset can be downloaded from HuggingFace by running the following commands in a python3 shell or file: 58 | 59 | ``` 60 | from datasets import load_dataset 61 | 62 | dataset = load_dataset("xsum") 63 | 64 | dataset = dataset.rename_column('document', 'input') 65 | dataset = dataset.rename_column('summary', 'output') 66 | 67 | output_file_path = "xsum_dataset.jsonl" 68 | dataset['train'].to_json(output_file_path, orient='records', lines=True) 69 | ``` 70 | The above command will give you the raw dataset of around 255mb which needs to be tokenized using a llamaV2 tokenizer. To tokenize the data, you need to request the tokenizer from hugging face and meta following the below link : 71 | 72 | [Request Tokenizer and model weights from hugging face](https://huggingface.co/meta-llama/Llama-2-7b) 73 | 74 | Note: Use of this model is governed by the Meta license. In order to download the model weights and tokenizer, please visit the above website and accept our License before requesting access here. 75 | 76 | The file will be tokenized automatically by Neuron NeMo and converted to memory map format. 77 | ## Convert LlamaV2 to Neuron NeMo Format 78 | The LLama models from HuggingFace must be converted into the specified tensor parallel and pipeline parallel format. 79 | Please run the below command with the tensor parallel and pipeline parallel config you are using 80 | We will assume tp=8 and pp=1. 81 | ``` 82 | python3 checkpoint_conversion/convert_hf_checkpoint_to_nemo_llama.py \ 83 | --path_to_checkpoint='PATH_TO_LLAMA_TOKENIZER/llamav2_weights/7b-hf' \ 84 | --config_file='PATH_TO_LLAMA_TOKENIZER/llamav2_weights/7b-hf/config.json' \ 85 | --output_path="/output/directory" \ 86 | --tp_degree=8 \ 87 | --pp_degree=1 \ 88 | --save_bf16=True 89 | ``` 90 | 91 | ## Llama2 training configurations 92 | We tested with the following model sizes: 7B 93 | ### Llama2 7B 94 | 95 | - Model configuration 96 | - Attention heads: 32 97 | - Layers: 32 98 | - Sequence length: 4096 99 | - Hidden size: 4096 100 | - Hidden FFN size: 11008 101 | - Microbatch size: 1 102 | - Global batch size: 256 103 | 104 | - Distributed training configuration 105 | - Number of nodes: 4 106 | - Tensor parallel degree: 8 107 | - Pipeline parallel degree: 1 108 | - Data parallel degree: 16 109 | 110 | 111 | ## Pre-compile the model 112 | By default, PyTorch Neuron uses a just in time (JIT) compilation flow that sequentially compiles all of the neural network compute graphs as they are encountered during a training job. The compiled graphs are cached in a local compiler cache so that subsequent training jobs can leverage the compiled graphs and avoid compilation (so long as the graph signatures and Neuron version have not changed). 113 | 114 | An alternative to the JIT flow is to use the included [neuron_parallel_compile](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/api-reference-guide/training/pytorch-neuron-parallel-compile.html?highlight=neuron_parallel_compile) command to perform ahead of time (AOT) compilation. In the AOT compilation flow, the compute graphs are first identified and extracted during a short simulated training run, and the extracted graphs are then compiled and cached using parallel compilation, which is considerably faster than the JIT flow. 115 | 116 | Before starting the compilation you need to update your path to the dataset and tokenizer in the ```test_llama.sh``` script for pretraining llama 7b : 117 | You also want to enable automatic conversion to HuggingFace. Note do not include the # comments in the script, as it breaks hydra parsing. 118 | ``` 119 | cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling 120 | 121 | # For llama 7b 122 | vi test_llama.sh 123 | 124 | 125 | # Update the below lines 126 | # For tokenizer 127 | model.tokenizer.type='PATH_TO_LLAMA_TOKENIZER/llamav2_weights/7b-hf' \ 128 | 129 | # For Dataset and Finetuning 130 | model.data.fine_tuning=True \ 131 | model.data.train_ds.file_names=[PATH_TO_XSUM_JSONL/xsum_dataset.jsonl] \ 132 | 133 | # To load pretrained Llama model 134 | +model.load_xser=True \ 135 | model.resume_from_checkpoint='CONVERTED_CHECKPOINT_PATH/model_optim_rng.ckpt' \ 136 | model.use_cpu_initialization=False \ 137 | 138 | # For HuggingFace Conversion 139 | model.convert_to_hf=True \ 140 | model.output_dir='PATH_TO_SAVE_CONVERTED_MODEL' \ 141 | model.config_path='PATH_TO_LLAMA_TOKENIZER/llamav2_weights/7b-hf/config.json' \ 142 | 143 | # To save checkpoint on end 144 | exp_manager.checkpoint_callback_params.save_last=True \ 145 | 146 | 147 | ``` 148 | Run the following commands to launch an AOT pre-compilation job on your ParallelCluster: 149 | ``` 150 | cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling 151 | sbatch --nodes 4 compile.slurm ./llama_7b.sh 152 | ``` 153 | 154 | 155 | Once you have launched the precompilation job, run the `squeue` command to view the SLURM job queue on your cluster. If you have not recently run a job on your cluster, it may take 4-5 minutes for the requested trn1.32xlarge nodes to be launched and initialized. Once the job is running, `squeue` should show output similar to the following: 156 | ``` 157 | JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 158 | 10 compute1 compile.slurm ubuntu R 5:11 4 compute1-dy-queue1-i1-[1-4] 159 | ``` 160 | 161 | You can view the output of the precompilation job by examining the file named `slurm-compile.slurm-ZZ.out` where ZZ represents the JOBID of your job in the `squeue` output, above. Ex: 162 | ``` 163 | tail -f slurm-compile.slurm-10.out 164 | ``` 165 | 166 | Once the precompilation job is complete, you should see a message similar to the following in the logs: 167 | ``` 168 | 2023-06-11 23:04:08.000738: INFO ||PARALLEL_COMPILE||: Total graphs: 22 169 | 2023-06-11 23:04:08.000738: INFO ||PARALLEL_COMPILE||: Total successful compilations: 22 170 | 2023-06-11 23:04:08.000738: INFO ||PARALLEL_COMPILE||: Total failed compilations: 0 171 | ``` 172 | 173 | At this point, you can press `CTRL-C` to exit the tail command. 174 | 175 | ## Launch a finetuning job 176 | The Llama2 finetuning job can be launched in the same manner as the precompilation job described above. In this case, we change the SLURM script from `compile.slurm` to `run.slurm`, but the other parameters remain the same: 177 | ``` 178 | cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling 179 | sbatch --nodes 4 run.slurm ./llama_7b.sh 180 | ``` 181 | 182 | 183 | As outlined above, you can again use the `squeue` command to view the job queue. Once you see that your pretraining job is running, you can view the output of the training job by examining the file named `slurm-run.slurm-ZZ.out` where ZZ represents the JOBID of your job: 184 | ``` 185 | tail -f slurm-run.slurm-11.out 186 | ``` 187 | 188 | Once the model is loaded onto the Trainium accelerators and training has commenced, you will begin to see output indicating the job progress: 189 | ``` 190 | Epoch 0: 28%|██▊ | 424/1507 [41:42<1:46:32, 5.90s/it, loss=1.22, v_num=1778, reduced_train_loss=1.270, gradient_norm=5.780, parameter_norm=1568.0, global_step=423.0, consumed_samples=27072.0, throughput=11.10, thoughput_peak=11.30] 191 | ``` 192 | 193 | ## Monitor training 194 | ### TensorBoard 195 | In addition to the text-based job monitoring described in the previous section, you can also use standard tools such as TensorBoard to monitor training job progress. To view an ongoing training job in TensorBoard, you first need to identify the experiment directory associated with your ongoing job. This will typically be the most recently created directory under `~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling/nemo_experiments/megatron_llama`. Once you have identifed the directory, `cd` into it, and then launch TensorBoard: 196 | ``` 197 | cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling/nemo_experiments/megatron_llama 198 | ls -alt|head 199 | # Identify the correct experiment directory in the 200 | # output of the ls command, ex: 2023-06-10_00-22-42 201 | cd YOUR_EXPERIMENT_DIR # <- replace this with your experiment directory 202 | tensorboard --logdir ./ 203 | ``` 204 | 205 | With TensorBoard running, you can then view the TensorBoard dashboard by browsing to http://localhost:6006 on your local machine. If you cannot access TensorBoard at this address, please make sure that you have port-forwarded TCP port 6006 when SSH'ing into the head node, ex: `ssh -i YOUR_KEY.pem ubuntu@HEAD_NODE_IP_ADDRESS -L 6006:127.0.0.1:6006` 206 | 207 | ### neuron-top / neuron-monitor / neuron-ls 208 | The [neuron-top](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-top-user-guide.html?highlight=neuron-top) tool can be used to view useful information about NeuronCore utilization, vCPU and RAM utilization, and loaded graphs on a per-node basis. To use neuron-top during on ongoing training job, first SSH into one of your compute nodes from the head node, and then run `neuron-top`: 209 | ``` 210 | ssh compute1-dy-queue1-i1-1 # to determine which compute nodes are in use, run the squeue command 211 | neuron-top 212 | ``` 213 | 214 | Similarly, once you are logged into one of the active compute nodes, you can also use other Neuron tools such as [neuron-monitor](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html) and [neuron-ls](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html) to capture performance/utilization statistics and to understand NeuronCore allocation. 215 | 216 | ## Key Features 217 | * We were able to make llama work with zero optimizer but have enabled it by default. To reduce the memory pressure, you can give it by adding the below hyper parameter in your run script : 218 | ``` 219 | cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling/ 220 | 221 | 222 | # For llama 7b 223 | vi test_llama.sh 224 | 225 | # Add the below line in the run script : 226 | model.wrap_with_zero=True \ 227 | 228 | ``` 229 | 230 | ## Known issues/limitations 231 | * The initial release of neuronx-nemo-megatron supports Llama2 pretraining and finetuning only. Model evaluation can be performed in transformers-neuronx library with the converted HuggingFace model. 232 | * neuronx-nemo-megatron currently requires pytorch-lightning v1.8.6 233 | 234 | ## Troubleshooting guide 235 | See [Troubleshooting Guide for AWS ParallelCluster](https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting-v3.html) for more details and fixes to common issues. 236 | -------------------------------------------------------------------------------- /examples/jobs/gpt3-launch-job.md: -------------------------------------------------------------------------------- 1 | # Launch GPT-3 training job [End of Support] 2 | Once the cluster is successfully created and the Neuron packages are installed, please ssh into the head node to begin the training example. As an example here, we expand the [GPT3 pretraining with Megatron-LM [End of Support]](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/megatron_lm_gpt.html) tutorial to run on a cluster using SLURM job scheduler. 3 | 4 | You will be running commands from the head node of the cluster. On ParallelCluster, the home directory is shared between the head node and compute nodes so files in the home directory are visible to worker nodes. 5 | 6 | You may inspect the ParallelCluster with the following command: 7 | 8 | ```sh 9 | sinfo 10 | ``` 11 | 12 | and expect to see the output which indicates your node's status. An example is: 13 | 14 | ``` 15 | PARTITION AVAIL TIMELIMIT NODES STATE NODELIST 16 | compute1* up infinite 16 alloc compute1-dy-queue1-i1-[1-16] 17 | ``` 18 | 19 | For all the commands below, make sure you are in the virtual environment created during setup before you run the commands. SLURM job scheduler will automatically activate the virtual environment when the training script is run on the worker nodes. 20 | 21 | ```sh 22 | source ~/aws_neuron_venv_pytorch/bin/activate 23 | ``` 24 | 25 | Use this virtual environment in the head node for the steps below. 26 | 27 | ## Download preprocessed training dataset 28 | 29 | In this tutorial, we use the Wikipedia dataset as an example to demonstrate training at scale. 30 | You can download the vocabulary file, the merge table file, and the preprocessed Wikipedia dataset with the following commands: 31 | 32 | ```sh 33 | export DATA_DIR=~/examples_datasets/gpt2 34 | 35 | mkdir -p ${DATA_DIR} && cd ${DATA_DIR} 36 | 37 | wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json 38 | wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt 39 | aws s3 cp s3://neuron-s3/training_datasets/gpt/wikipedia/my-gpt2_text_document.bin . --no-sign-request 40 | aws s3 cp s3://neuron-s3/training_datasets/gpt/wikipedia/my-gpt2_text_document.idx . --no-sign-request 41 | aws s3 cp s3://neuron-s3/training_datasets/gpt/wikipedia/license.txt . --no-sign-request 42 | ``` 43 | 44 | To prepare your own dataset from scratch, please follow the steps in [Preparing Wikipedia Dataset from Scratch](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/megatron_lm_gpt.html#preparing-wikipedia-dataset-from-scratch) 45 | 46 | ## Download training scripts 47 | 48 | In this section we will download the training scripts and build the necessary dependencies. 49 | 50 | First, install Python3 development package needed to build the data helpers tools. If you are on Amazon Linux, do: 51 | 52 | ```sh 53 | sudo yum install -y python3-devel 54 | ``` 55 | 56 | If you are on Ubuntu, do: 57 | 58 | ```sh 59 | sudo apt install -y python3-dev 60 | ``` 61 | 62 | Next, clone the AWS Neuron Reference for Megatron-LM package, install dependencies and build helpers tool: 63 | 64 | ```sh 65 | cd ~/ 66 | git clone https://github.com/aws-neuron/aws-neuron-reference-for-megatron-lm.git 67 | pip install pybind11 regex 68 | pushd . 69 | cd aws-neuron-reference-for-megatron-lm/megatron/data/ 70 | make 71 | popd 72 | ``` 73 | 74 | There will be an `~/aws-neuron-reference-for-megatron-lm` directory from which you will be running the SLURM commands. The shell scripts needed to run the tutorial are in `~/aws-neuron-reference-for-megatron-lm/examples` directory. 75 | 76 | ## GPT-3 6.7B training configuration 77 | 78 | In this example, we are going to run a pretraining job for the GPT-3 6.7B model using the following model configuration: 79 | 80 | ```sh 81 | Hidden size = 4096 82 | Sequence len = 2048 83 | Num heads = 32 84 | Num layers = 32 85 | Microbatch = 1 86 | Gradient accumulation microsteps = 64 87 | ``` 88 | The distributed configuration is tensor parallel degree 32, pipeline parallel degree 1, and data parallel degree 16 if using 16 nodes. The global batch size is 1024. 89 | 90 | ## GPT-3 6.7B training script 91 | 92 | The training shell script pretrain_gpt3_6.7B_32layers_bf16_bs1024_slurm.sh will be launched by SLURM on each worker node, where it prepares the environment and invokes torchrun to launch the Python script pretrain_gpt.py on 32 workers. The environment settings are: 93 | 94 | - Enable Elastic Fabric Adapter for higher networking performance 95 | - Mark all parameter transfers as static to enable runtime optimizations for wrapped torch.nn modules 96 | - Enables custom lowering for Softmax operation to enable compiler optimizations and improve GPT performance 97 | - Cast training to BF16 and enable stochastic rounding 98 | - Increase Neuron RT execution timeout in case slow compilation causes Neuron RT to wait longer than default timeout 99 | - Ensure enough framework threads to execute collective compute operations to prevent hang 100 | - Separate NeuronCache dir per node, workaround limitation to file locking on NFS 101 | - Run fewer steps and redirect TensorBoard logging when extract graphs only (during neuron_parallel_compile) 102 | 103 | The training shell script uses `torchrun` to run multiple pretrain_gpt.py processes, with world size, node rank, and master address extracted from SLURM node information. 104 | 105 | ```sh 106 | MASTER_ADDR=(`scontrol show hostnames $SLURM_JOB_NODELIST`) 107 | WORLD_SIZE_JOB=$SLURM_NTASKS 108 | RANK_NODE=$SLURM_NODEID 109 | DISTRIBUTED_ARGS="--nproc_per_node 32 --nnodes $WORLD_SIZE_JOB --node_rank $RANK_NODE --master_addr $MASTER_ADDR --master_port 2022" 110 | torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \ 111 | ...(options)... 112 | ``` 113 | 114 | The Python script pretrain_gpt.py calls the Megatron pretraining API with the Megatron GPT model builder, the loss and forward functions, and the dataset provider. It also sets the default compiler flag to model-type transformer for improved transformer support in compiler. 115 | 116 | ## Precompiling the training graphs 117 | 118 | Precompiling the training graphs is an optional step to reduce just-in-time graph compilations during training. This is done using the [Neuron Parallel Compile tool](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/api-reference-guide/training/pytorch-neuron-parallel-compile.html), which extracts the graphs used during training and perform multiple graph compilations in parallel. 119 | 120 | To precompile the training graphs, in `~/aws-neuron-reference-for-megatron-lm` directory, issue the following command to submit a SLURM job: 121 | 122 | ```sh 123 | cd ~/aws-neuron-reference-for-megatron-lm 124 | sbatch --exclusive -N 16 --wrap "srun neuron_parallel_compile ./examples/pretrain_gpt3_6.7B_32layers_bf16_bs1024_slurm.sh" 125 | ``` 126 | 127 | You can also use the SLURM script `examples/pretrain_gpt3_6.7B_compile.slurm` provided for convenience: 128 | 129 | ```sh 130 | cd ~/aws-neuron-reference-for-megatron-lm 131 | sbatch ./examples/pretrain_gpt3_6.7B_compile.slurm 132 | ``` 133 | 134 | Note that currently each node has it's own separate cache in `~/neuron_cache/gpt//neuron-compile-cache` to workaround a known NFS file-locking limitation. This is configured by the following line in the `./examples/pretrain_gpt3_6.7B_32layers_bf16_bs1024_slurm.sh` script. 135 | 136 | ```sh 137 | export NEURON_CC_FLAGS="--cache_dir=$HOME/neuron_cache/gpt/`hostname`" 138 | ``` 139 | If the cluster size is larger than 16, use `--nodelist=` to limit the nodes used during precompilation and actual training run to ensure the workers on each node reads from the correct cache. The nodelist must match between precompilation and the actual run. 140 | 141 | The job id will be displayed by `squeue` command. You can monitor the results of the compilation job by inspecting the file `slurm-.out`file generated in `~/aws-neuron-reference-for-megatron-lm` directory. To follow the progress of this SLURM job, you may stream the SLURM output file in real time: 142 | 143 | ```sh 144 | tail -f slurm-.out 145 | ``` 146 | Note that there are many processes across many instances (nodes) running in parallel, and all the outputs are combined into the `slurm-.out` file. You can examine individual node's log by looking into `run_log_gpt3_6.7B_32layers_bf16..16.txt` file. 147 | 148 | The graph extraction sets NEURON_EXTRACT_GRAPHS_ONLY environment variable which cause the graph execution to execute empty graphs with zero outputs. The zero outputs cause execution results to be random, so the TensorBoard log is redirected to `/tmp/parallel_compile_ignored_tb_output`to enable clean TB log of the actual run in the next section. 149 | 150 | Currently, the total compilation time for GPT-3 6.7B example is about 30 minutes. When each node is finished compilation, you should see the following at the end of each node's log: 151 | 152 | ```sh 153 | 2023-05-08 20:14:50.000983: INFO ||PARALLEL_COMPILE||: Total graphs: 26 154 | 2023-05-08 20:14:50.000983: INFO ||PARALLEL_COMPILE||: Total successful compilations: 26 155 | 2023-05-08 20:14:50.000983: INFO ||PARALLEL_COMPILE||: Total failed compilations: 0 156 | ``` 157 | 158 | ## Launch training script 159 | 160 | Before or after the precompilation job is finished, submit the actual pretraining job by running the following slurm command: 161 | 162 | ``` 163 | cd ~/aws-neuron-reference-for-megatron-lm 164 | sbatch --exclusive -N 16 --wrap "srun ./examples/pretrain_gpt3_6.7B_32layers_bf16_bs1024_slurm.sh" 165 | ``` 166 | 167 | You can also use the SLURM script `examples/pretrain_gpt3_6.7B.slurm` provided for convenience: 168 | 169 | ```sh 170 | cd ~/aws-neuron-reference-for-megatron-lm 171 | sbatch ./examples/pretrain_gpt3_6.7B.slurm 172 | ``` 173 | 174 | As mentioned above, If the cluster size is larger than 16, use `--nodelist=` to limit the nodes used during precompilation and actual training run to ensure the workers on each node reads from the correct cache. The nodelist must match between precompilation and the actual run. 175 | 176 | You can also run the script with more or less number of nodes as long as the cluster size supports the node count. Further more, note that the global batch size (equal to gradient accumulation count times number of nodes) will change when the number of nodes is changed. 177 | 178 | If the submission is done before precompilation job is finished, the new submitted job will be queued in the SLURM job queue. You can use `squeue` to see running and queued jobs. 179 | 180 | Again, the job id will be displayed by `squeue` command and you can follow the training by inspecting the file `slurm-.out` file. You can examine individual node's log by looking into `run_log_gpt3_6.7B_32layers_bf16..16.txt` file. 181 | 182 | After an initial startup, you should see lines like the following that indicate training progress (iteration, loss, elapsed time, throughput, etc). 183 | 184 | ```sh 185 | iteration 64/ 143051 | consumed samples: 65536 | elapsed time per iteration (ms): 7313.1 | learning rate: 4.956E-05 | global batch size: 1024 | lm loss: 7.281250E+00 | grad norm: 3.562 | throughput: 140.022 | 186 | 187 | ``` 188 | ## View TensorBoard trace 189 | 190 | You can examine the TensorBoard trace by ssh into the headnode with `-L 6006:localhost:6006' option, then: 191 | 192 | ```sh 193 | source ~/aws_neuron_venv_pytorch/bin/activate 194 | cd ~/aws-neuron-reference-for-megatron-lm 195 | tensorboard --logdir ./tb_gpt3_6.7B_32layers_bf16 196 | ``` 197 | On the host from where you ssh into the headnode, you can use a browser to go to `http://localhost:6006/` to view TensorBoard. 198 | 199 | ## View NeuronCore activities 200 | 201 | To inspect NeuronCore activities in the compute node, you can SSH from head node into any compute node, for example: 202 | 203 | ```sh 204 | ssh compute1-dy-queue1-i1-1 205 | ``` 206 | and then run `neuron-top` in the compute node to see NeuronCore activities while training in happening 207 | 208 | ## Tips 209 | 210 | Some useful SLURM commands are `sinfo`, `squeue` and `scontrol`. `sinfo` command displays information about SLURM node names and partitions. `squeue` command provides information about job queues currently running in the Slurm schedule. SLURM will generate a log file `slurm-XXXXXX.out`. You may then use `tail -f slurm-XXXXXX.out`, to inspect the job summary. `scontrol show node ` can show more information such as node state, power consumption, and more. 211 | 212 | 213 | ## Known issues/limitations 214 | 215 | - "Failed accept4: Too many open files" error and the solution from [here](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/megatron_lm_gpt.html?highlight=megatron%20lm#failed-accept4-too-many-open-files) 216 | - "cannot import name'helppers' from 'megatron.data' error and the solution from [here](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/megatron_lm_gpt.html?highlight=megatron%20lm#error-cannot-import-name-helpers-from-megatron-data) 217 | 218 | ## Troubleshooting guide 219 | 220 | See [Troubleshooting Guide for AWS ParallelCluster](https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting-v3.html) for more details and fixes to common issues. 221 | -------------------------------------------------------------------------------- /examples/jobs/neuronx-nemo-megatron-gpt-job.md: -------------------------------------------------------------------------------- 1 | # Launch a GPT-3 pretraining job using neuronx-nemo-megatron 2 | 3 | This tutorial explains how to run GPT-3 pretraining jobs with AWS EC2 trn1.32xl instances using [neuronx-nemo-megatron](https://github.com/aws-neuron/neuronx-nemo-megatron) and [AWS ParallelCluster](https://aws.amazon.com/hpc/parallelcluster/). 4 | 5 | neuronx-nemo-megatron (also known as "AWS Neuron Reference for NeMo Megatron") includes modified versions of the open-source packages [NeMo](https://github.com/NVIDIA/NeMo) and [Apex](https://github.com/NVIDIA/apex) that have been adapted for use with AWS Neuron and AWS EC2 Trn1 instances. neuronx-nemo-megatron allows for pretraining models with hundreds of billions of parameters across thousands of Trainium accelerators, and enables advanced training capabilities such as 3D parallelism, sequence parallelism, and activation checkpointing. 6 | 7 | ## Prerequisites 8 | Before proceeding with this tutorial, please follow [these instructions](https://github.com/aws-neuron/aws-neuron-parallelcluster-samples#train-a-model-on-aws-trn1-parallelcluster) to create a ParallelCluster consisting of 1 or more trn1.32xl or trn1n.32xl nodes. ParallelCluster automates the creation of trn1 clusters, and provides the SLURM job management system for scheduling and managing distributed training jobs. Please note that the home directory on your ParallelCluster head node will be shared with all of the worker nodes via NFS. 9 | 10 | ## Install neuronx-nemo-megatron 11 | 12 | With your trn1 ParallelCluster in place, begin by logging into the head node of your cluster using SSH. To provide access to TensorBoard (required in a later step), please make sure that you enable port forwarding for TCP port 6006 when you login, ex: 13 | ``` 14 | ssh -i YOUR_KEY.pem ubuntu@HEAD_NODE_IP_ADDRESS -L 6006:127.0.0.1:6006 15 | ``` 16 | 17 | Once logged into the head node, activate the provided PyTorch Neuron virtual environment that was created when you set up your ParallelCluster. **Note**: if your PyTorch Neuron environment is lower than Neuron 2.11, please refer to the [Neuron documentation](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup/pytorch-update.html#pytorch-neuronx-update) for instructions on updating to Neuron 2.11 or later. 18 | ``` 19 | cd ~ 20 | source ./aws_neuron_venv_pytorch/bin/activate 21 | ``` 22 | 23 | Next, clone the neuronx-nemo-megatron repo to the head node: 24 | ``` 25 | cd ~ 26 | git clone https://github.com/aws-neuron/neuronx-nemo-megatron.git 27 | cd neuronx-nemo-megatron 28 | ``` 29 | 30 | Install the `wheel` Python package and run the build script to create the neuronx-nemo-megatron wheels: 31 | ``` 32 | pip3 install wheel 33 | ./build.sh 34 | ``` 35 | 36 | Install the neuronx-nemo-megatron packages and dependencies in your virtual environment: 37 | ``` 38 | pip3 install ./build/*.whl 39 | pip3 install -r requirements.txt protobuf==3.20.3 40 | ``` 41 | 42 | Build the Megatron helper module 43 | ``` 44 | cd ~ 45 | python3 -c "from nemo.collections.nlp.data.language_modeling.megatron.dataset_utils import compile_helper; \ 46 | compile_helper()" 47 | ``` 48 | 49 | The above utility will help make this file : ```nemo.collections.nlp.data.language_modeling.megatron.dataset_utils``` and below is the expected output (You can ignore the error) 50 | ``` 51 | 2023-Aug-17 22:53:01.0674 47940:47940 ERROR TDRV:tdrv_get_dev_info No neuron device available 52 | [NeMo W 2023-08-17 22:53:03 optimizers:67] Could not import distributed_fused_adam optimizer from Apex 53 | [NeMo W 2023-08-17 22:53:04 experimental:27] Module is experimental, not ready for production and is not fully supported. Use at your own risk. 54 | ``` 55 | 56 | ## Download GPT dataset 57 | This tutorial makes use of a preprocessed Wikipedia dataset that is stored in S3. The dataset can be downloaded to your cluster by running the following commands on the head node: 58 | ```sh 59 | export DATA_DIR=~/examples_datasets/gpt2 60 | mkdir -p ${DATA_DIR} && cd ${DATA_DIR} 61 | wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json 62 | wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt 63 | aws s3 cp s3://neuron-s3/training_datasets/gpt/wikipedia/my-gpt2_text_document.bin . --no-sign-request 64 | aws s3 cp s3://neuron-s3/training_datasets/gpt/wikipedia/my-gpt2_text_document.idx . --no-sign-request 65 | aws s3 cp s3://neuron-s3/training_datasets/gpt/wikipedia/license.txt . --no-sign-request 66 | ``` 67 | 68 | ## GPT-3 training configurations 69 | We tested with the following model sizes: 23B, 46B, 175B 70 | ### GPT-3 23B 71 | - Model configuration 72 | - Attention heads: 64 73 | - Layers: 28 74 | - Sequence length: 2048 75 | - Hidden size: 8192 76 | - Hidden FFN size: 32768 77 | - Microbatch size: 1 78 | - Global batch size: 32 * number_of_nodes 79 | 80 | - Distributed training configuration 81 | - Number of nodes: 4 82 | - Tensor parallel degree: 8 83 | - Pipeline parallel degree: 4 84 | - Data parallel degree: 4 85 | 86 | ### GPT-3 46B 87 | - Model configuration 88 | - Attention heads: 64 89 | - Layers: 56 90 | - Sequence length: 2048 91 | - Hidden size: 8192 92 | - Hidden FFN size: 32768 93 | - Microbatch size: 1 94 | - Global batch size: 32 * number_of_nodes 95 | 96 | - Distributed training configuration 97 | - Number of nodes: 8 98 | - Tensor parallel degree: 8 99 | - Pipeline parallel degree: 8 100 | - Data parallel degree: 4 101 | 102 | ### GPT-3 175B 103 | - Model configuration 104 | - Attention heads: 96 105 | - Layers: 96 106 | - Sequence length: 2048 107 | - Hidden size: 12288 108 | - Hidden FFN size: 49152 109 | - Microbatch size: 1 110 | - Global batch size: 32 * number_of_nodes 111 | 112 | - Distributed training configuration 113 | - Number of nodes: 8 114 | - Tensor parallel degree: 32 115 | - Pipeline parallel degree: 8 116 | - Data parallel degree: 1 117 | 118 | ## Pre-compile the model 119 | By default, PyTorch Neuron uses a just in time (JIT) compilation flow that sequentially compiles all of the neural network compute graphs as they are encountered during a training job. The compiled graphs are cached in a local compiler cache so that subsequent training jobs can leverage the compiled graphs and avoid compilation (so long as the graph signatures and Neuron version have not changed). 120 | 121 | An alternative to the JIT flow is to use the included [neuron_parallel_compile](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/api-reference-guide/training/pytorch-neuron-parallel-compile.html?highlight=neuron_parallel_compile) command to perform ahead of time (AOT) compilation. In the AOT compilation flow, the compute graphs are first identified and extracted during a short simulated training run, and the extracted graphs are then compiled and cached using parallel compilation, which is considerably faster than the JIT flow. 122 | 123 | Run the following commands to launch an AOT pre-compilation job on your ParallelCluster: 124 | ``` 125 | cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling 126 | sbatch --nodes 4 compile.slurm ./gpt_23b.sh 127 | ``` 128 | For the 46B and 175B the `--nodes 8` would be used instead of 4. 129 | 130 | Once you have launched the precompilation job, run the `squeue` command to view the SLURM job queue on your cluster. If you have not recently run a job on your cluster, it may take 4-5 minutes for the requested trn1.32xlarge nodes to be launched and initialized. Once the job is running, `squeue` should show output similar to the following: 131 | ``` 132 | JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 133 | 10 compute1 compile.slurm ubuntu R 5:11 4 compute1-dy-queue1-i1-[1-4] 134 | ``` 135 | 136 | You can view the output of the precompilation job by examining the file named `slurm-compile.slurm-ZZ.out` where ZZ represents the JOBID of your job in the `squeue` output, above. Ex: 137 | ``` 138 | tail -f slurm-compile.slurm-10.out 139 | ``` 140 | 141 | Once the precompilation job is complete, you should see a message similar to the following in the logs: 142 | ``` 143 | 2023-06-11 23:04:08.000738: INFO ||PARALLEL_COMPILE||: Total graphs: 22 144 | 2023-06-11 23:04:08.000738: INFO ||PARALLEL_COMPILE||: Total successful compilations: 22 145 | 2023-06-11 23:04:08.000738: INFO ||PARALLEL_COMPILE||: Total failed compilations: 0 146 | ``` 147 | 148 | At this point, you can press `CTRL-C` to exit the tail command. 149 | 150 | ## Launch a pretraining job 151 | The GPT-3 pretraining job can be launched in the same manner as the precompilation job described above. In this case, we change the SLURM script from `compile.slurm` to `run.slurm`, but the other parameters remain the same: 152 | ``` 153 | cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling 154 | sbatch --nodes 4 run.slurm ./gpt_23b.sh 155 | ``` 156 | For the 46B and 175B the `--nodes 8` would be used instead of 4, like the compile above. 157 | 158 | 159 | As outlined above, you can again use the `squeue` command to view the job queue. Once you see that your pretraining job is running, you can view the output of the training job by examining the file named `slurm-run.slurm-ZZ.out` where ZZ represents the JOBID of your job: 160 | ``` 161 | tail -f slurm-run.slurm-11.out 162 | ``` 163 | 164 | Once the model is loaded onto the Trainium accelerators and training has commenced, you will begin to see output indicating the job progress: 165 | ``` 166 | Epoch 0: 0%| | 189/301501 [59:12<1573:03:24, 18.79s/it, loss=7.75, v_num=3-16, reduced_train_loss=7.560, global_step=188.0, consumed_samples=24064.0] 167 | Epoch 0: 0%| | 190/301501 [59:30<1572:41:13, 18.79s/it, loss=7.74, v_num=3-16, reduced_train_loss=7.560, global_step=189.0, consumed_samples=24192.0] 168 | Epoch 0: 0%| | 191/301501 [59:48<1572:21:28, 18.79s/it, loss=7.73, v_num=3-16, reduced_train_loss=7.910, global_step=190.0, consumed_samples=24320.0] 169 | ``` 170 | 171 | ## Monitor training 172 | ### TensorBoard 173 | In addition to the text-based job monitoring described in the previous section, you can also use standard tools such as TensorBoard to monitor training job progress. To view an ongoing training job in TensorBoard, you first need to identify the experiment directory associated with your ongoing job. This will typically be the most recently created directory under `~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling/nemo_experiments/megatron_gpt`. Once you have identifed the directory, `cd` into it, and then launch TensorBoard: 174 | ``` 175 | cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling/nemo_experiments/megatron_gpt 176 | ls -alt|head 177 | # Identify the correct experiment directory in the 178 | # output of the ls command, ex: 2023-06-10_00-22-42 179 | cd YOUR_EXPERIMENT_DIR # <- replace this with your experiment directory 180 | tensorboard --logdir ./ 181 | ``` 182 | 183 | With TensorBoard running, you can then view the TensorBoard dashboard by browsing to http://localhost:6006 on your local machine. If you cannot access TensorBoard at this address, please make sure that you have port-forwarded TCP port 6006 when SSH'ing into the head node, ex: `ssh -i YOUR_KEY.pem ubuntu@HEAD_NODE_IP_ADDRESS -L 6006:127.0.0.1:6006` 184 | 185 | ### neuron-top / neuron-monitor / neuron-ls 186 | The [neuron-top](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-top-user-guide.html?highlight=neuron-top) tool can be used to view useful information about NeuronCore utilization, vCPU and RAM utilization, and loaded graphs on a per-node basis. To use neuron-top during on ongoing training job, first SSH into one of your compute nodes from the head node, and then run `neuron-top`: 187 | ``` 188 | ssh compute1-dy-queue1-i1-1 # to determine which compute nodes are in use, run the squeue command 189 | neuron-top 190 | ``` 191 | 192 | Similarly, once you are logged into one of the active compute nodes, you can also use other Neuron tools such as [neuron-monitor](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html) and [neuron-ls](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html) to capture performance/utilization statistics and to understand NeuronCore allocation. 193 | 194 | ## Key Features 195 | * We were able to make gpt work with zero optimizer but have enabled it by default. To reduce the memory pressure, you can give it by adding the below hyper parameter in your run script : 196 | ``` 197 | cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling/ 198 | 199 | vi test.sh 200 | 201 | # Add the below line in the run script : 202 | model.wrap_with_zero=True \ 203 | 204 | ``` 205 | 206 | ## Known issues/limitations 207 | * The initial release of neuronx-nemo-megatron supports GPT pretraining only. Model evaluation will be available in a future release. 208 | * The Neuron compiler's modular flow (ex: `--enable-experimental-O1`) is not supported by this initial release of neuronx-nemo-megatron. 209 | * neuronx-nemo-megatron currently requires pytorch-lightning v1.8.6 210 | 211 | ## Troubleshooting guide 212 | See [Troubleshooting Guide for AWS ParallelCluster](https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting-v3.html) for more details and fixes to common issues. 213 | -------------------------------------------------------------------------------- /examples/jobs/neuronx-nemo-megatron-llamav2-job.md: -------------------------------------------------------------------------------- 1 | # Launch a Llama2 pretraining job using neuronx-nemo-megatron 2 | 3 | This tutorial explains how to run Llama V2 pretraining jobs with AWS EC2 trn1.32xl instances using [neuronx-nemo-megatron](https://github.com/aws-neuron/neuronx-nemo-megatron) and [AWS ParallelCluster](https://aws.amazon.com/hpc/parallelcluster/). 4 | 5 | neuronx-nemo-megatron (also known as "AWS Neuron Reference for NeMo Megatron") includes modified versions of the open-source packages [NeMo](https://github.com/NVIDIA/NeMo) and [Apex](https://github.com/NVIDIA/apex) that have been adapted for use with AWS Neuron and AWS EC2 Trn1 instances. neuronx-nemo-megatron allows for pretraining models with hundreds of billions of parameters across thousands of Trainium accelerators, and enables advanced training capabilities such as 3D parallelism, sequence parallelism, and activation checkpointing. 6 | 7 | ## Prerequisites 8 | Before proceeding with this tutorial, please follow [these instructions](https://github.com/aws-neuron/aws-neuron-parallelcluster-samples#train-a-model-on-aws-trn1-parallelcluster) to create a ParallelCluster consisting of 1 or more trn1.32xl or trn1n.32xl nodes. ParallelCluster automates the creation of trn1 clusters, and provides the SLURM job management system for scheduling and managing distributed training jobs. Please note that the home directory on your ParallelCluster head node will be shared with all of the worker nodes via NFS. 9 | 10 | ## Install neuronx-nemo-megatron 11 | 12 | With your trn1 ParallelCluster in place, begin by logging into the head node of your cluster using SSH. To provide access to TensorBoard (required in a later step), please make sure that you enable port forwarding for TCP port 6006 when you login, ex: 13 | ``` 14 | ssh -i YOUR_KEY.pem ubuntu@HEAD_NODE_IP_ADDRESS -L 6006:127.0.0.1:6006 15 | ``` 16 | 17 | Once logged into the head node, activate the provided PyTorch Neuron virtual environment that was created when you set up your ParallelCluster. **Note**: if your PyTorch Neuron environment is lower than Neuron 2.11, please refer to the [Neuron documentation](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup/pytorch-update.html#pytorch-neuronx-update) for instructions on updating to Neuron 2.11 or later. 18 | ``` 19 | cd ~ 20 | source ./aws_neuron_venv_pytorch/bin/activate 21 | ``` 22 | 23 | Next, clone the neuronx-nemo-megatron repo to the head node: 24 | ``` 25 | cd ~ 26 | git clone https://github.com/aws-neuron/neuronx-nemo-megatron.git 27 | cd neuronx-nemo-megatron 28 | ``` 29 | 30 | Install the `wheel` Python package and run the build script to create the neuronx-nemo-megatron wheels: 31 | ``` 32 | pip3 install wheel 33 | ./build.sh 34 | ``` 35 | 36 | Install the neuronx-nemo-megatron packages and dependencies in your virtual environment: 37 | ``` 38 | pip3 install ./build/*.whl 39 | pip3 install -r requirements.txt protobuf==3.20.3 40 | ``` 41 | 42 | Build the Megatron helper module 43 | ``` 44 | cd ~ 45 | python3 -c "from nemo.collections.nlp.data.language_modeling.megatron.dataset_utils import compile_helper; \ 46 | compile_helper()" 47 | ``` 48 | 49 | The above utility will help make this file : ```nemo.collections.nlp.data.language_modeling.megatron.dataset_utils``` and below is the expected output (You can ignore the error) 50 | ``` 51 | 2023-Aug-17 22:53:01.0674 47940:47940 ERROR TDRV:tdrv_get_dev_info No neuron device available 52 | [NeMo W 2023-08-17 22:53:03 optimizers:67] Could not import distributed_fused_adam optimizer from Apex 53 | [NeMo W 2023-08-17 22:53:04 experimental:27] Module is experimental, not ready for production and is not fully supported. Use at your own risk. 54 | ``` 55 | 56 | ## Download LlamaV2 dataset and tokenizer 57 | This tutorial makes use of a Red pyjama dataset. The dataset can be downloaded to your cluster by running the following commands on the head node: 58 | 59 | ``` 60 | wget https://data.together.xyz/redpajama-data-1T/v1.0.0/book/book.jsonl 61 | ``` 62 | Note: Dataset download is 50G and will take approximately 3-4 hours to download. 63 | 64 | The above command will give you the raw dataset of around 50G which needs to be tokenized using a llamaV2 tokenizer. To tokenize the data, you need to request the tokenizer from hugging face and meta following the below link : 65 | 66 | [Request Tokenizer and model weights from hugging face](https://huggingface.co/meta-llama/Llama-2-7b) 67 | 68 | Note: Use of this model is governed by the Meta license. In order to download the model weights and tokenizer, please visit the above website and accept our License before requesting access here. 69 | 70 | Once you have the Tokenizer and the dataset. You can tokenize the dataset following the below command : 71 | ``` 72 | python nemo/scripts/nlp_language_modeling/preprocess_data_for_megatron.py \ 73 | --input=DATA_FOLDER/DATA.jsonl \ 74 | --json-keys=text \ 75 | --tokenizer-library=huggingface \ 76 | --tokenizer-type=TOKENIZER_FOLDER/llama7b-hf \ 77 | --dataset-impl=mmap \ 78 | --output-prefix=DATA_FOLDER/DATA_tokenized \ 79 | --append-eod \ 80 | --need-pad-id \ 81 | --workers=32 82 | ``` 83 | 84 | Post tokenizing the dataset, you will have a path to the tokenizer and the dataset which will be used for pretraining. 85 | 86 | ## Llama2 training configurations 87 | We tested with the following model sizes: 7B, 13B, 70B 88 | ### Llama2 7B 89 | 90 | - Model configuration 91 | - Attention heads: 32 92 | - Layers: 32 93 | - Sequence length: 4096 94 | - Hidden size: 4096 95 | - Hidden FFN size: 11008 96 | - Microbatch size: 1 97 | - Global batch size: 256 98 | 99 | - Distributed training configuration 100 | - Number of nodes: 4 101 | - Tensor parallel degree: 8 102 | - Pipeline parallel degree: 1 103 | - Data parallel degree: 16 104 | 105 | ### Llama2 13B 106 | 107 | - Model configuration 108 | - Attention heads: 40 109 | - Layers: 40 110 | - Sequence length: 4096 111 | - Hidden size: 5120 112 | - Hidden FFN size: 13824 113 | - Microbatch size: 1 114 | - Global batch size: 1024 115 | 116 | - Distributed training configuration 117 | - Number of nodes: 4 118 | - Tensor parallel degree: 8 119 | - Pipeline parallel degree: 4 120 | - Data parallel degree: 4 121 | 122 | ### Llama2 70B 123 | 124 | - Model configuration 125 | - Attention heads: 64 126 | - Layers: 80 127 | - Sequence length: 4096 128 | - Hidden size: 8192 129 | - Hidden FFN size: 28672 130 | - Microbatch size: 1 131 | - Global batch size: 512 132 | 133 | - Distributed training configuration 134 | - Number of nodes: 8 135 | - Tensor parallel degree: 8 136 | - Pipeline parallel degree: 16 137 | - Data parallel degree: 2 138 | 139 | ## Pre-compile the model 140 | By default, PyTorch Neuron uses a just in time (JIT) compilation flow that sequentially compiles all of the neural network compute graphs as they are encountered during a training job. The compiled graphs are cached in a local compiler cache so that subsequent training jobs can leverage the compiled graphs and avoid compilation (so long as the graph signatures and Neuron version have not changed). 141 | 142 | An alternative to the JIT flow is to use the included [neuron_parallel_compile](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/api-reference-guide/training/pytorch-neuron-parallel-compile.html?highlight=neuron_parallel_compile) command to perform ahead of time (AOT) compilation. In the AOT compilation flow, the compute graphs are first identified and extracted during a short simulated training run, and the extracted graphs are then compiled and cached using parallel compilation, which is considerably faster than the JIT flow. 143 | 144 | Before starting the compilation you need to update your path to the dataset and tokenizer in the ```test_llama.sh``` script for pretraining llama 7b and llama 13b and ```test_llama_gqa.sh``` for pretraining llama 70b as below : 145 | 146 | ``` 147 | cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling 148 | 149 | # For llama 7b and 13b 150 | vi test_llama.sh 151 | 152 | # For llama 70b 153 | vi test_llama_gqa.sh 154 | 155 | # Update the below lines 156 | # For tokenizer 157 | model.tokenizer.type='PATH_TO_LLAMA_TOKENIZER/' \ 158 | 159 | # For Dataset 160 | model.data.data_prefix=[1.0,PATH_TO_TOKENIZED_DATASET/books/book.jsonl-processed_text_document] \ 161 | ``` 162 | Run the following commands to launch an AOT pre-compilation job on your ParallelCluster: 163 | ``` 164 | cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling 165 | sbatch --nodes 4 compile.slurm ./llama_7b.sh 166 | ``` 167 | 168 | For compiling llama 13b, run the following commands: 169 | ``` 170 | cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling 171 | sbatch --nodes 4 compile.slurm ./llama_13b.sh 172 | ``` 173 | 174 | For compiling llama 70b, run the following commands: 175 | ``` 176 | cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling 177 | sbatch --nodes 32 compile.slurm ./llama_70b.sh 178 | ``` 179 | 180 | Note : For the 70B the `--nodes 32` would be used instead of 4. 181 | 182 | Once you have launched the precompilation job, run the `squeue` command to view the SLURM job queue on your cluster. If you have not recently run a job on your cluster, it may take 4-5 minutes for the requested trn1.32xlarge nodes to be launched and initialized. Once the job is running, `squeue` should show output similar to the following: 183 | ``` 184 | JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 185 | 10 compute1 compile.slurm ubuntu R 5:11 4 compute1-dy-queue1-i1-[1-4] 186 | ``` 187 | 188 | You can view the output of the precompilation job by examining the file named `slurm-compile.slurm-ZZ.out` where ZZ represents the JOBID of your job in the `squeue` output, above. Ex: 189 | ``` 190 | tail -f slurm-compile.slurm-10.out 191 | ``` 192 | 193 | Once the precompilation job is complete, you should see a message similar to the following in the logs: 194 | ``` 195 | 2023-06-11 23:04:08.000738: INFO ||PARALLEL_COMPILE||: Total graphs: 22 196 | 2023-06-11 23:04:08.000738: INFO ||PARALLEL_COMPILE||: Total successful compilations: 22 197 | 2023-06-11 23:04:08.000738: INFO ||PARALLEL_COMPILE||: Total failed compilations: 0 198 | ``` 199 | 200 | At this point, you can press `CTRL-C` to exit the tail command. 201 | 202 | ## Launch a pretraining job 203 | The Llama2 pretraining job can be launched in the same manner as the precompilation job described above. In this case, we change the SLURM script from `compile.slurm` to `run.slurm`, but the other parameters remain the same: 204 | ``` 205 | cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling 206 | sbatch --nodes 4 run.slurm ./llama_7b.sh 207 | ``` 208 | 209 | For llama_13b, run the below command : 210 | ``` 211 | cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling 212 | sbatch --nodes 4 run.slurm ./llama_13b.sh 213 | ``` 214 | 215 | For llama_70b, run the below command : 216 | ``` 217 | cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling 218 | sbatch --nodes 32 run.slurm ./llama_70b.sh 219 | ``` 220 | Note : For the 70B the `--nodes 32` would be used instead of 4. 221 | 222 | As outlined above, you can again use the `squeue` command to view the job queue. Once you see that your pretraining job is running, you can view the output of the training job by examining the file named `slurm-run.slurm-ZZ.out` where ZZ represents the JOBID of your job: 223 | ``` 224 | tail -f slurm-run.slurm-11.out 225 | ``` 226 | 227 | Once the model is loaded onto the Trainium accelerators and training has commenced, you will begin to see output indicating the job progress: 228 | ``` 229 | Epoch 0: 22%|██▏ | 4499/20101 [22:26:14<77:48:37, 17.95s/it, loss=2.43, v_num=5563, reduced_train_loss=2.470, gradient_norm=0.121, parameter_norm=1864.0, global_step=4512.0, consumed_samples=1.16e+6, iteration_time=16.40] 230 | Epoch 0: 22%|██▏ | 4500/20101 [22:26:32<77:48:18, 17.95s/it, loss=2.43, v_num=5563, reduced_train_loss=2.470, gradient_norm=0.121, parameter_norm=1864.0, global_step=4512.0, consumed_samples=1.16e+6, iteration_time=16.40] 231 | Epoch 0: 22%|██▏ | 4500/20101 [22:26:32<77:48:18, 17.95s/it, loss=2.44, v_num=5563, reduced_train_loss=2.450, gradient_norm=0.120, parameter_norm=1864.0, global_step=4512.0, consumed_samples=1.16e+6, iteration_time=16.50] 232 | ``` 233 | 234 | ## Monitor training 235 | ### TensorBoard 236 | In addition to the text-based job monitoring described in the previous section, you can also use standard tools such as TensorBoard to monitor training job progress. To view an ongoing training job in TensorBoard, you first need to identify the experiment directory associated with your ongoing job. This will typically be the most recently created directory under `~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling/nemo_experiments/megatron_llama`. Once you have identifed the directory, `cd` into it, and then launch TensorBoard: 237 | ``` 238 | cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling/nemo_experiments/megatron_llama 239 | ls -alt|head 240 | # Identify the correct experiment directory in the 241 | # output of the ls command, ex: 2023-06-10_00-22-42 242 | cd YOUR_EXPERIMENT_DIR # <- replace this with your experiment directory 243 | tensorboard --logdir ./ 244 | ``` 245 | 246 | With TensorBoard running, you can then view the TensorBoard dashboard by browsing to http://localhost:6006 on your local machine. If you cannot access TensorBoard at this address, please make sure that you have port-forwarded TCP port 6006 when SSH'ing into the head node, ex: `ssh -i YOUR_KEY.pem ubuntu@HEAD_NODE_IP_ADDRESS -L 6006:127.0.0.1:6006` 247 | 248 | ### neuron-top / neuron-monitor / neuron-ls 249 | The [neuron-top](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-top-user-guide.html?highlight=neuron-top) tool can be used to view useful information about NeuronCore utilization, vCPU and RAM utilization, and loaded graphs on a per-node basis. To use neuron-top during on ongoing training job, first SSH into one of your compute nodes from the head node, and then run `neuron-top`: 250 | ``` 251 | ssh compute1-dy-queue1-i1-1 # to determine which compute nodes are in use, run the squeue command 252 | neuron-top 253 | ``` 254 | 255 | Similarly, once you are logged into one of the active compute nodes, you can also use other Neuron tools such as [neuron-monitor](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html) and [neuron-ls](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html) to capture performance/utilization statistics and to understand NeuronCore allocation. 256 | 257 | ## Key Features 258 | * We were able to make llama work with zero optimizer but have enabled it by default. To reduce the memory pressure, you can give it by adding the below hyper parameter in your run script : 259 | ``` 260 | cd ~/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling/ 261 | 262 | # For llama 7b and 13b 263 | vi test_llama.sh 264 | 265 | # For llama 70b 266 | vi test_llama_gqa.sh 267 | 268 | # Add the below line in the run script : 269 | model.wrap_with_zero=True \ 270 | 271 | ``` 272 | 273 | ## Known issues/limitations 274 | * The initial release of neuronx-nemo-megatron supports Llama2 pretraining only. Model evaluation will be available in a future release. 275 | * neuronx-nemo-megatron currently requires pytorch-lightning v1.8.6 276 | * Llama2-70B : Tested and validated on 8 nodes. Scaling beyond might see memory issues. 277 | 278 | ## Troubleshooting guide 279 | See [Troubleshooting Guide for AWS ParallelCluster](https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting-v3.html) for more details and fixes to common issues. 280 | -------------------------------------------------------------------------------- /install_neuron.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | set -e 3 | 4 | echo "Neuron SDK Release 2.6.0" 5 | # Configure Linux for Neuron repository updates 6 | . /etc/os-release 7 | 8 | sudo tee /etc/apt/sources.list.d/neuron.list > /dev/null </dev/null && DNS_SERVER=$(resolvectl dns | awk '{print $4}' | sort -r | head -1) 69 | IP="$(host $HOSTNAME $DNS_SERVER | tail -1 | awk '{print $4}')" 70 | DOMAIN=$(jq .cluster.dns_domain /etc/chef/dna.json | tr -d \") 71 | sudo sed -i "/$HOSTNAME/d" /etc/hosts 72 | sudo bash -c "echo '$IP $HOSTNAME.${DOMAIN::-1} $HOSTNAME' >> /etc/hosts" 73 | fi 74 | 75 | -------------------------------------------------------------------------------- /releasenotes.md: -------------------------------------------------------------------------------- 1 | # Change Log 2 | 3 | ## September, 15th 2023 4 | * Added sample for pre-training Llama 2 13B model using neuronx-nemo-megatron library 5 | 6 | ## August, 29th 2023 7 | * Added samples for pre-training Llama 2 7B model using neuronx-nemo-megatron library 8 | 9 | ## August, 28th 2023 10 | * Added samples for pre-training GPT-3 23B, 46B and 175B models using neuronx-nemo-megatron library 11 | * Announced End of Support for GPT-3 training using aws-neuron-reference-for-megatron-lm library 12 | 13 | ## October,10th 2022 14 | 15 | * Added a parallel cluster example that explains how to use AWS ParallelCluster to build HPC compute cluster using trn1 compute nodes to run the distributed ML training job. 16 | 17 | ## Known Issues 18 | 19 | ### Relaunch a dynamic cluster created with `MinCount = 0` may fail due to compute nodes IP address mismatch. 20 | 21 | #### **Amazon Linux 2** 22 | 23 | For dynamic cluster with `MinCount = 0`, /etc/hosts IP addresses of compute nodes may not match with what's in `nslookup` upon cluster relaunch. Therefore, for your information, a temporary workaround is included in `install_neuron.sh` post-install script: 24 | 25 | ``` 26 | IP="$(host $HOSTNAME| awk '{print $4}')" 27 | DOMAIN=$(jq .cluster.dns_domain /etc/chef/dna.json | tr -d \") 28 | sudo sed -i "/$HOSTNAME/d" /etc/hosts 29 | sudo bash -c "echo '$IP $HOSTNAME.${DOMAIN::-1} $HOSTNAME' >> /etc/hosts" 30 | ``` 31 | 32 | This fix is already implemented in the custom installation script to ensure a AL2 dynamic cluster would relaunch successfully. 33 | 34 | #### **Ubuntu** 35 | 36 | In Ubuntu, the default DNS is systemd-resolve. In the configuration file `/etc/systemd/resolved.conf` of systemd-resolve, the default of the option `ReadEtcHosts` is `yes`. Therefore systemd-resolve on Ubuntu is to query first the file `/etc/hosts` and then the DNS server. The workaround to work with such systemd-resolve behavior is: 37 | 38 | ``` 39 | DNS_SERVER="" 40 | grep Ubuntu /etc/issue &>/dev/null && DNS_SERVER=$(resolvectl dns | awk '{print $4}' | sort -r | head -1) 41 | IP="$(host $HOSTNAME $DNS_SERVER | tail -1 | awk '{print $4}')" 42 | DOMAIN=$(jq .cluster.dns_domain /etc/chef/dna.json | tr -d \") 43 | sudo sed -i "/$HOSTNAME/d" /etc/hosts 44 | sudo bash -c "echo '$IP $HOSTNAME.${DOMAIN::-1} $HOSTNAME' >> /etc/hosts" 45 | ``` 46 | 47 | This fix is implemented in the custom installation script to ensure a Ubuntu dynamic cluster would relaunch successfully. 48 | 49 | 50 | ### Error “Assertion `listp->slotinfo[cnt].gen <= GL(dl_tls_generation)’ failed” followed by ‘RPC failed with status = “UNAVAILABLE: Connection reset by peer”’ 51 | 52 | 53 | ``` 54 | 55 | INFO: Inconsistency detected by ld.so: ../elf/dl-tls.c: 488: _dl_allocate_tls_init: Assertion `listp->slotinfo[cnt].gen <= GL(dl_tls_generation)' failed! 56 | INFO: 2022-10-03 02:16:04.488054: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Connection reset by peer" and grpc_error_string = "{"created":"@1664763364.487962663","description":"Error received from peer ipv4:10.0.9.150:41677","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Connection reset by peer","grpc_status":14}", maybe retrying the RPC 57 | 58 | ``` 59 | This error may occur intermittently when using GNU C Library glibc 2.26. To find out what version you have, run ```ldd --version```. glibc 2.27 provides a workaround and therefore the error is fixed in Ubuntu20. For more information on this issue, see [Neuron troubleshooting guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/training-troubleshooting.html#error-assertion-listp-slotinfo-cnt-gen-gl-dl-tls-generation-failed-followed-by-rpc-failed-with-status-unavailable-connection-reset-by-peer). 60 | 61 | ## Troubleshooting guide 62 | 63 | See [Troubleshooting Guide for AWS ParallelCluster](https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting-v3.html) for more details and fixes to common issues. 64 | --------------------------------------------------------------------------------