├── .gitignore
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── TUTORIAL.md
├── configs
    ├── base-config-build-ami.ini
    └── multi-queue-config.ini
├── scripts
    ├── create_custom_ami.sh
    ├── custom_dlami_user_data.sh
    └── train_8B_gpt2.sh
└── ultra.png


/.gitignore:
--------------------------------------------------------------------------------
1 | .env
2 | **/*tmp
3 | **/.DS_Store
4 | 
5 | 


--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | ## Code of Conduct
2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
4 | opensource-codeofconduct@amazon.com with any additional questions or comments.
5 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contributing Guidelines
 2 | 
 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
 4 | documentation, we greatly value feedback and contributions from our community.
 5 | 
 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
 7 | information to effectively respond to your bug report or contribution.
 8 | 
 9 | 
10 | ## Reporting Bugs/Feature Requests
11 | 
12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features.
13 | 
14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
16 | 
17 | * A reproducible test case or series of steps
18 | * The version of our code being used
19 | * Any modifications you've made relevant to the bug
20 | * Anything unusual about your environment or deployment
21 | 
22 | 
23 | ## Contributing via Pull Requests
24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
25 | 
26 | 1. You are working against the latest source on the *main* branch.
27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
29 | 
30 | To send us a pull request, please:
31 | 
32 | 1. Fork the repository.
33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
34 | 3. Ensure local tests pass.
35 | 4. Commit to your fork using clear commit messages.
36 | 5. Send us a pull request, answering any default questions in the pull request interface.
37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
38 | 
39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
41 | 
42 | 
43 | ## Finding contributions to work on
44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.
45 | 
46 | 
47 | ## Code of Conduct
48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
50 | opensource-codeofconduct@amazon.com with any additional questions or comments.
51 | 
52 | 
53 | ## Security issue notifications
54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
55 | 
56 | 
57 | ## Licensing
58 | 
59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
60 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | 
 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of
 4 | this software and associated documentation files (the "Software"), to deal in
 5 | the Software without restriction, including without limitation the rights to
 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
 7 | the Software, and to permit persons to whom the Software is furnished to do so.
 8 | 
 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
15 | 
16 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Megatron on AWS EC2 UltraCluster
 2 | 
 3 | **Megatron on AWS EC2 UltraCluster** provides steps, code and configuration samples to deploy and train a GPT type Natural Language Understanding (NLU) model using an [AWS EC2 UltraCluster of P4d instances](<https://pages.awscloud.com/amazon-ec2-p4d.html>) and the [NVIDIA Megatron-LM](<https://github.com/NVIDIA/Megatron-LM>) framework.
 4 | 
 5 | [Megatron](<https://arxiv.org/pdf/1909.08053.pdf>) is a large and powerful transformer developed by the Applied Deep Learning Research team at NVIDIA. Refer to [Megatron's original Github repository](<https://github.com/NVIDIA/Megatron-LM>) for more information.
 6 | 
 7 | ## Repository Structure
 8 | 
 9 | This repository contains configuration files for [AWS ParallelCluster](<https://aws.amazon.com/hpc/parallelcluster/>) in the [`configs`](<./configs>) folder.
10 | The configurations implement a tightly coupled cluster of [p4d.24xlarge EC2 instances](<https://aws.amazon.com/ec2/instance-types/p4/>), leveraging [AWS Elastic Fabric Adapter (EFA)](<https://aws.amazon.com/hpc/efa/>) and [Amazon FSx for Lustre](<https://aws.amazon.com/fsx/lustre/>) in an EC2 UltraCluster of P4d configuration.
11 | 
12 | The [`scripts`](<./scripts>) folder contains scripts to build a custom [Deep Learning AMI](<https://docs.aws.amazon.com/dlami/latest/devguide/what-is-dlami.html>) with Megatron-LM and its dependencies.
13 | It also contains scripts to train a GPT-2 8 billion parameter model, in a 8-way model parallel configuration, using the [SLURM scheduler](<https://docs.aws.amazon.com/parallelcluster/latest/ug/schedulers.slurm.html>) available through ParallelCluster.
14 | 
15 | The [TUTORIAL.md](<TUTORIAL.md>) describes how-to:
16 | 
17 | - Set-up a cluster management environment with AWS ParallelCluster.
18 | - Build a custom Deep Learning AMI using the `pcluster` [CLI](<https://docs.aws.amazon.com/parallelcluster/latest/ug/commands.html>).
19 | - Configure and deploy a multi-queue cluster with CPU and GPU instances.
20 | - Preprocess the latest English Wikipedia data dump from [Wikimedia](<https://www.wikimedia.org/>) on a large CPU instance.
21 | - Train the 8B parameter version of GPT-2 using the preprocessed data across 128 GPUs.
22 | 
23 | The Advanced section of the tutorial also describes how to monitor training using [Tensorboard](<https://www.tensorflow.org/tensorboard>).
24 | 
25 | ### Amazon EC2 UltraCluster
26 | 
27 | For an overview of the Amazon EC2 UltraClusters of P4d instances follow [this link](https://pages.awscloud.com/amazon-ec2-p4d.html) 
28 | 
29 | ![p4d ultracluster](<./ultra.png>)
30 | 
31 | ## Contribution guidelines
32 | 
33 | If you want to contribute to models, please review the [contribution guidelines](<CONTRIBUTING.md>).
34 | 
35 | ## License
36 | 
37 | This project is licensed under \[MIT\], see the [LICENSE](<LICENSE>) file.
38 | 


--------------------------------------------------------------------------------
/TUTORIAL.md:
--------------------------------------------------------------------------------
  1 | # Megatron on AWS UltraCluster
  2 | 
  3 | This tutorial walks through the end-to-end process of configuring a cluster with AWS ParallelCluster, with a customized Deep Learning AMI, preprocessing a large dataset using large CPU Amazon EC2 instances and training a GPT-2 Natural Language Understanding model using an AWS EC2 UltraCluster.
  4 | 
  5 | Familiarity with AWS cloud concepts of Virtual Private Cloud (VPC), e.g Subnets and Availability Zones, the AWS CLI and bash scripting is recommended.
  6 | 
  7 | ## Contents:
  8 | 
  9 | * [Contents](<#contents>)
 10 | * [ParallelCluster Management Setup](<#parallelcluster-management-setup>)
 11 |   * [Local Environment](<#local-environement>)
 12 |   * [AWS](<#aws>)
 13 | * [Building an Custom AMI](<#building-an-custom-ami>)
 14 | * [Configure and Deploy a Cluster](<#configure-and-deploy-a-cluster>)
 15 | * [Preprocessing the Training Dataset wth CPU Instances](<#preprocessing-the-training-dataset-wth-cpu-instances>)
 16 | * [Model Parallel Trainin on a p4d.24xlarge UltraCluster](<#model-parallel-trainin-on-a-p4d24xlarge-ultracluster>)
 17 |   * [Monitoring Training with Tensorboard](<#monitoring-training-with-tensorboard>)
 18 | 
 19 | ## Prerequisites
 20 | 
 21 | If you don't have Python3 pip already installed, follow the instructions on the [pip installation](<https://pip.pypa.io/en/latest/installing/>) page before running the commands below.
 22 | 
 23 | ## ParallelCluster Management Setup
 24 | 
 25 | #### Local Environment
 26 | 
 27 | To deploy a cluster with AWS ParallelCluster you'll need to install the `aws-parallelcluster` cli.
 28 | From the root of this project execute the following to create and activate a virtual environment with Parallelcluster installed:
 29 | 
 30 | ```bash
 31 | python3 -m venv .megatron_env
 32 | source .megatron_env/bin/activate
 33 | python3 -m pip install awscli aws-parallelcluster==2.10.1
 34 | pcluster version
 35 | 
 36 | # Set AWS Region
 37 | export AWS_REGION="us-west-2"
 38 | ```
 39 | 
 40 | To execute this sample repo, you will need credentials to access the AWS CLI. Refer to the [getting Started documentation](<https://docs.aws.amazon.com/parallelcluster/latest/ug/install.html>) for more information on setting up ParallelCluster CLI.
 41 | 
 42 | If you don't have Hashicorp Packer already installed, follow the instructions on the [Hashicorp Packer getting started](<https://learn.hashicorp.com/tutorials/packer/getting-started-install?in=packer/getting-started>) page before running the commands below.
 43 | It is required to build custom AMIs.
 44 | 
 45 | #### AWS
 46 | 
 47 | This sample repo assumes an existing VPC with private and public subnets. For details on how to provision such infrastructure check out the [this tutorial](<https://docs.aws.amazon.com/AmazonECS/latest/developerguide/create-public-private-vpc.html>).
 48 | Private subnets are a requirement for running p4d.24xlarge instances with 4 EFA cards.
 49 | Please note that the private subnet should have a NAT Gateway setup since AWS ParallelCluster and the Megatron ML lab require internet connection.
 50 | 
 51 | Use the following commands to list VPCs, Subnets and Availability Zones:
 52 | 
 53 | ```bash
 54 | aws ec2 describe-subnets --query 'Subnets[].{VPC:VpcId,SUBNET:SubnetId,AZ:AvailabilityZone}' --region ${AWS_REGION}
 55 | ```
 56 | 
 57 | Take note of the Ids to properly configure the cluster environment and set the following environment variables:
 58 | 
 59 | ```bash
 60 | VPC_ID=<value>
 61 | PUBLIC_SUBNET_ID=<value>
 62 | PRIVATE_SUBNET_ID=<value>
 63 | ```
 64 | 
 65 | This sample also requires an S3 bucket and a EC2 key pair. You can use the following AWS CLI commands to create new ones:
 66 | 
 67 | ```bash
 68 | # Create a EC2 key pair
 69 | SSH_KEY_NAME="megatron-lab-key"
 70 | 
 71 | aws ec2 create-key-pair --key-name ${SSH_KEY_NAME} \
 72 |     --query KeyMaterial \
 73 |     --region ${AWS_REGION} \
 74 |     --output text > ~/.ssh/${SSH_KEY_NAME}
 75 | 
 76 | BUCKET_POSTFIX=$(python3 -S -c "import uuid; print(str(uuid.uuid4().hex)[:10])")aws
 77 | BUCKET_NAME="megatron-lab-${BUCKET_POSTFIX}"
 78 | 
 79 | aws s3 mb s3://${BUCKET_NAME} --region ${AWS_REGION}
 80 | ```
 81 | 
 82 | ## Building an Custom AMI
 83 | 
 84 | [Build a custom AMI](<https://docs.aws.amazon.com/parallelcluster/latest/ug/tutorials_02_ami_customization.html>) to avoid long provisioning times associated with using [post installation scripts](<https://docs.aws.amazon.com/parallelcluster/latest/ug/cluster-definition.html#post-install>) for Megatron-LM dependencies.
 85 | 
 86 | The base AMI for customization is an AWS Deep Learning AMI (DLAMI).
 87 | It already provides the amazon required software to run distributed training of large machine learning models, including NVIDIA drivers and CUDA, EFA plugins and the major deep learning frameworks such as PyTorch and Tensorflow, managed in Conda environments.
 88 | The Conda package manager can also manage the Megatron-LM dependencies.
 89 | 
 90 | To retrieve the AMI ID of the Deep Learning image v38.0 based on Amazon Linux 2 in the region of deployment, you can use the following command
 91 | 
 92 | ```bash
 93 | # Retrieve Deep Learning AMI ID
 94 | export DEEP_LEARNING_AMI_ID=`aws ec2 describe-images --owners amazon \
 95 |     --query 'Images[*].{ImageId:ImageId,CreationDate:CreationDate}' \
 96 |     --filters "Name=name,Values='Deep Learning AMI (Amazon Linux 2) Version 38.0'" \
 97 |     --region ${AWS_REGION} \
 98 |     | jq -r 'sort_by(.CreationDate)[-1] | .ImageId'`
 99 | ```
100 | 
101 | Before building the customer AMI, you will have to modify the argument values between `<...>` accordingly of the base configuration file for AWS ParallelCluster, located in `./configs/base-config-build-ami.ini`.
102 | 
103 | The command calls for the script [custom\_dlami\_user\_data.sh](<./scripts/custom_dlami_user_data.sh>), which installs Megatron-LM and its dependencies, including NVIDIA APEX.
104 | The instance used for the build is `-i p4d.24xlarge`, as NVIDIA APEX will be compiled to the host's platform during installation.
105 | 
106 | The instructions below help to set the variable in the AWS ParallelCluster configuration file that is used to build the customer AMI, i.e. `./configs/base-config-build-ami.ini`.
107 | 
108 | ```bash
109 | git clone https://github.com/pixelb/crudini
110 | 
111 | # Install dependencies
112 | pip3 install iniparse
113 | 
114 | # Change the cluster configuration file
115 | python3 crudini/crudini --set ./configs/base-config-build-ami.ini "aws" aws_region_name "${AWS_REGION}"
116 | python3 crudini/crudini --set ./configs/base-config-build-ami.ini "vpc megatron" vpc_id "${VPC_ID}"
117 | python3 crudini/crudini --set ./configs/base-config-build-ami.ini "vpc megatron" master_subnet_id "${PUBLIC_SUBNET_ID}"
118 | python3 crudini/crudini --set ./configs/base-config-build-ami.ini "cluster base-config-build-ami" key_name "${SSH_KEY_NAME}"
119 | ```
120 | 
121 | To build the custom AMI from the root of the sample project, use the `pcluster createami` command as shown in the script [create\_custom\_ami.sh](<./scripts/create_custom_ami.sh>).
122 | 
123 | ```bash
124 | ./scripts/create_custom_ami.sh
125 | ```
126 | 
127 | After the build is complete you get the AMI Id printed on screen:
128 | 
129 | ```bash
130 | Custom AMI ami-xxxxxxxxxxxxx created with name megatron-on-pcluster-aws-parallelcluster-2.10.1-amzn2-hvm-x86_64-202101071208
131 | 
132 | To use it, add the following variable to the AWS ParallelCluster config file, under the [cluster ...] section
133 | custom_ami = ami-xxxxxxxxxxxxx
134 | ```
135 | 
136 | Please set the following variable that will be used to setup the AWS ParallelCluster for running Megatron-ML:
137 | 
138 | ```bash
139 | CUSTOM_AMI=<your value>
140 | ```
141 | 
142 | ## Configure and Deploy a Cluster
143 | 
144 | Use the [configs/multi-queue-config.ini](<./configs/multi-queue-config.ini>) configuration file to stand up the cluster.
145 | You can do that manually by changing the VPC, COMPUTE and FSX sections or use the following commands to change the configuration file using `crudini`:
146 | 
147 | ```bash
148 | # Change the cluster configuration file
149 | python3 crudini/crudini --set ./configs/multi-queue-config.ini "aws" aws_region_name "${AWS_REGION}"
150 | python3 crudini/crudini --set ./configs/multi-queue-config.ini "vpc megatron" vpc_id "${VPC_ID}"
151 | python3 crudini/crudini --set ./configs/multi-queue-config.ini "vpc megatron" master_subnet_id "${PUBLIC_SUBNET_ID}"
152 | python3 crudini/crudini --set ./configs/multi-queue-config.ini "vpc megatron" compute_subnet_id "${PRIVATE_SUBNET_ID}"
153 | python3 crudini/crudini --set ./configs/multi-queue-config.ini "cluster multi-queue-us-west-2" key_name "${SSH_KEY_NAME}"
154 | python3 crudini/crudini --set ./configs/multi-queue-config.ini "cluster multi-queue-us-west-2" s3_read_write_resource "arn:aws:s3:::${BUCKET_NAME}"
155 | python3 crudini/crudini --set ./configs/multi-queue-config.ini "cluster multi-queue-us-west-2" custom_ami "${CUSTOM_AMI}"
156 | python3 crudini/crudini --set ./configs/multi-queue-config.ini "fsx sharedfsx" import_path "s3://${BUCKET_NAME}"
157 | python3 crudini/crudini --set ./configs/multi-queue-config.ini "fsx sharedfsx" export_path "s3://${BUCKET_NAME}"
158 | ```
159 | 
160 | You are now ready to create the cluster with:
161 | 
162 | ```bash
163 | pcluster create megatron-on-pcluster -c configs/multi-queue-config.ini
164 | ```
165 | 
166 | Once deployment completes you get the Head Node's public and private IPs printed on the screen at the end of the cluster creation:
167 | 
168 | ```bash
169 | Creating stack named: parallelcluster-megatron-on-pcluster
170 | Status: parallelcluster-megatron-on-pcluster - CREATE_COMPLETE
171 | MasterPublicIP: yyy.yyy.yy.yyy
172 | ClusterUser: ec2-user
173 | MasterPrivateIP: xxx.xxx.xx.xxx
174 | ```
175 | 
176 | Access the cluster Head Node using the CLI command
177 | 
178 | ```bash
179 | pcluster ssh megatron-on-pcluster -i ~/.ssh/${SSH_KEY_NAME}
180 | ```
181 | 
182 | ## Preprocessing the Training Dataset wth CPU instances
183 | 
184 | Once connected to the cluster head node, set-up a data folder in the _/lustre_ directory and download the latest English Wikipedia data dump from Wikimedia.
185 | This process follows the original [Megatron-LM documentation](<https://github.com/NVIDIA/Megatron-LM#datasets>):
186 | 
187 | ```bash
188 | export WIKI_DIR=/lustre/data/wiki
189 | mkdir -p $WIKI_DIR && cd $WIKI_DIR
190 | 
191 | wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
192 | ```
193 | 
194 | Download the vocab and merge table files for the desired model. This example uses the GPT-2 model:
195 | 
196 | ```bash
197 | export DATA_DIR=/lustre/data
198 | export GPT2_DATA=${DATA_DIR}/gpt2
199 | 
200 | mkdir -p ${GPT2_DATA} && cd ${GPT2_DATA}
201 | 
202 | wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
203 | wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
204 | 
205 | mkdir -p ${GPT2_DATA}/checkpoint
206 | wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_lm_345m/versions/v0.0/zip -O ${GPT2_DATA}/checkpoint/megatron_lm_345m_v0.0.zip
207 | ```
208 | 
209 | Once the data are available, provision a cpu node using slurm: `salloc --nodes 1 -p cpu`.
210 | 
211 | All data preprocesing work will proceed from the CPU machine. You can check the provisioning status of your new machine using the `squeue` command. Once the status, _ST_, changes to running, _R_, access the CPU machine terminal through ssh with: `ssh cpu-dy-c5n18xlarge-1`.
212 | 
213 | Extract the downloaded data using WikiExtractor:
214 | 
215 | ```bash
216 | git clone https://github.com/attardi/wikiextractor.git /lustre/wikiextractor
217 | cd /lustre/wikiextractor
218 | cd -
219 | python -m wikiextractor.WikiExtractor --json /lustre/data/wiki/enwiki-latest-pages-articles.xml.bz2 --output /lustre/data/wiki/text/ -q --processes 70 2>&1 | tee wikiextract.out &
220 | ```
221 | 
222 | Wikiextractor first preprocesses the template of all pages sequentially, followed by a Map/Reduce process for extracting the pages and converting to the loose json format required by Megatron-LM.
223 | 
224 | Once the extraction completes, we merge the text files with:
225 | 
226 | ```bash
227 | conda activate pytorch_latest_p37
228 | cd /lustre/data/wiki
229 | find /lustre/data/wiki/text/ -name wiki* | parallel -m -j 70 "cat {} >> mergedfile.json"
230 | ```
231 | 
232 | The `mergedfile.json` size on disk is 16GB. With it, create the binary data format for Megatron GPT2.
233 | **NOYE**: Refer to [this solution](<https://github.com/NVIDIA/Megatron-LM/issues/62>) if an `IndexError: list index out of range` occurs.
234 | 
235 | To create the binary data, type the following command:
236 | 
237 | ```bash
238 | python /home/ec2-user/megatron/tools/preprocess_data.py \
239 |     --input /lustre/data/wiki/mergedfile.json \
240 |     --output-prefix my-gpt2 \
241 |     --vocab /lustre/data/gpt2/gpt2-vocab.json \
242 |     --dataset-impl mmap \
243 |     --tokenizer-type GPT2BPETokenizer \
244 |     --merge-file /lustre/data/gpt2/gpt2-merges.txt \
245 |     --append-eod \
246 |     --workers 70
247 | ```
248 | 
249 | Once all preprocessing is done we can persist the data from FSx back to S3 using a [Data Repository Task](<https://docs.aws.amazon.com/fsx/latest/LustreGuide/export-data-repo-task.html>) from the terminal used to spin up the cluster.
250 | This guarantees that data gets persisted even if the cluster is terminated.
251 | 
252 | ```bash
253 | # Retrieve the FSx for Lustre file system Id
254 | export FSX_ID=$(aws fsx describe-file-systems --query "FileSystems[?LustreConfiguration.DataRepositoryConfiguration.ExportPath=='s3://${BUCKET_NAME}'].FileSystemId" --output text --region ${AWS_REGION})
255 | # Create data repository task
256 | aws fsx create-data-repository-task \
257 |     --file-system-id $FSX_ID \
258 |     --type EXPORT_TO_REPOSITORY \
259 |     --paths data \
260 |     --report Enabled=true,Scope=FAILED_FILES_ONLY,Format=REPORT_CSV_20191124,Path=s3://${BUCKET_NAME}/reports \
261 |     --region ${AWS_REGION}
262 | ```
263 | 
264 | You can exit to the original terminal with the `exit` command 2 times: (1) for exiting the `ssh` session on the CPU node, (2) for the `salloc` slurm allocation.
265 | 
266 | ## Model Parallel Trainin on a p4d.24xlarge UltraCluster
267 | 
268 | In this section you will train the 8 billion parameters version of Megatron-LM GPT-2 model across 64 GPUs - 8 p4d.24xlarge instances. Log back into the cluster head node using `pcluster ssh ...` if not already on the machine.
269 | 
270 | Start by creating a training script according to the [original documentation](<https://github.com/NVIDIA/Megatron-LM/blob/main/examples/pretrain_gpt_distributed.sh>).
271 | To train using `slurm` on 8 nodes, modify the distributed world configuration section according to the script [scripts/train\_8B\_gpt2.sh](<./scripts/train_8B_gpt2.sh>).
272 | Make sure to include the CUDA, EFA and NCCL environment variables to enable NCCL to communicate between GPUs through AWS EFA using GPU Remote Direct Memory Access.
273 | 
274 | Create a file named `/lustre/scripts/train\_8B\_gpt2.sh` and copy the content below in it.
275 | 
276 | ```bash
277 | # scripts/train_8B_gpt2.sh
278 | 
279 | #!/bin/bash
280 | 
281 | # Shared data paths from Fsx for Luster mount point /lustre
282 | DATA_PATH=/lustre/data/wiki/my-gpt2_text_document
283 | CHECKPOINT_PATH=/lustre/data/gpt2/checkpoint
284 | VOCAB_FILE=/lustre/data/gpt2/gpt2-vocab.json
285 | MERGES_FILE=/lustre/data/gpt2/gpt2-merges.txt
286 | 
287 | # Distributed World configuration
288 | MP_SIZE=8
289 | GPUS_PER_NODE=8
290 | DDP_IMPL=torch
291 | MASTER_ADDR=$SLURM_SUBMIT_HOST
292 | MASTER_PORT=6000
293 | NNODES=$SLURM_NTASKS
294 | NODE_RANK=$SLURM_NODEID
295 | WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
296 | 
297 | # CUDA, EFA and NCCL configs
298 | export LD_LIBRARY_PATH=/usr/local/cuda-11.0/efa/lib:/usr/local/cuda-11.0/lib:/usr/local/cuda-11.0/lib64:/usr/local/cuda-11.0:/opt/amazon/efa/lib64:/opt/amazon/openmpi/lib64:$LD_LIBRARY_PATH
299 | export FI_PROVIDER=efa
300 | export FI_EFA_USE_DEVICE_RDMA=1
301 | export NCCL_ALGO=ring
302 | export NCCL_DEBUG=INFO
303 | export RDMAV_FORK_SAFE=1
304 | 
305 | # Distributed args for Pytorch DDP
306 | DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
307 | 
308 | # Training:
309 | /home/ec2-user/anaconda3/envs/pytorch_latest_p37/bin/python -m torch.distributed.launch $DISTRIBUTED_ARGS \
310 |     /home/ec2-user/megatron/pretrain_gpt2.py \
311 |     --model-parallel-size $MP_SIZE \
312 |     --DDP-impl $DDP_IMPL \
313 |     --num-layers 42 \
314 |     --hidden-size 4096 \
315 |     --num-attention-heads 32 \
316 |     --batch-size 16 \
317 |     --seq-length 1024 \
318 |     --max-position-embeddings 1024 \
319 |     --train-iters 1000 \
320 |     --lr-decay-iters 320000 \
321 |     --save $CHECKPOINT_PATH \
322 |     --load $CHECKPOINT_PATH \
323 |     --data-path $DATA_PATH \
324 |     --vocab-file $VOCAB_FILE \
325 |     --merge-file $MERGES_FILE \
326 |     --data-impl mmap \
327 |     --split 949,50,1 \
328 |     --distributed-backend nccl \
329 |     --lr 0.00015 \
330 |     --lr-decay-style cosine \
331 |     --min-lr 1.0e-5 \
332 |     --weight-decay 1e-2 \
333 |     --clip-grad 1.0 \
334 |     --warmup .01 \
335 |     --checkpoint-activations \
336 |     --distribute-checkpointed-activations \
337 |     --log-interval 50 \
338 |     --save-interval 1000 \
339 |     --eval-interval 1000 \
340 |     --eval-iters 10 \
341 |     --num-workers 2 \
342 |     --fp16 \
343 |     --tensorboard-dir /lustre/logs/gpt2_param8B_nodes16_bs16_sjob${SLURM_JOB_ID}
344 | 
345 | set +x
346 | ```
347 | 
348 | ```bash
349 | chmod +x /lustre/scripts/train\_8B\_gpt2.sh
350 | ```
351 | 
352 | To drive the `sbatch` execution of the trainning script, wrap it on a `job.sh` script, using a shared path across all nodes, such as `/lustre/scripts` :
353 | 
354 | ```bash
355 | mkdir -p /lustre/scripts
356 | cat > /lustre/scripts/job.sh << EOF
357 | #!/bin/bash
358 | #SBATCH --wait-all-nodes=1
359 | #SBATCH -p gpu
360 | #SBATCH -n 8
361 | #SBATCH -N 8
362 | #SBATCH -o out_%j.out
363 | srun /lustre/scripts/train_8B_gpt2.sh
364 | EOF
365 | 
366 | sbatch /lustre/scripts/job.sh
367 | ```
368 | 
369 | Now you can start training by running `sbatch job.sh` from the head node on the cluster. The output from the run will be recorded on the `.out` file on the current folder.
370 | If your job fails with `slurmstepd: error: execve(): /lustre/scripts/train_8B_gpt2.sh: Permission denied`, change the permissions of your scripts with `chmod +x /lustre/scripts/*.sh`.
371 | 
372 | Inspecting the NCCL logs in the `.out` file expect to find entries that describe the OFI provide to EFA, such as below:
373 | 
374 | ```bash
375 | gpu-dy-p4d24xlarge-10:33337:33337 [0] NCCL INFO NET/OFI Selected Provider is efa
376 | ```
377 | 
378 | ### Monitoring training with Tensorboard
379 | 
380 | The Megatron-LM framework writes tensorboard logs to the `--tensorboard-dir` specified on the training script.The custom AMI built for the cluster has tensorboard installed on the `pytorch_latest_p37` environment used for training.
381 | Use the to start a tensorboard silently and expose it in a specific port:
382 | 
383 | ```bash
384 | python -m tensorboard.main --port=8080 --logdir /lustre/logs --host 0.0.0.0  2>&1 | tee ~/tensorboard.logs &!
385 | ```
386 | 
387 | Using the following `ssh` tunel configuration when connecting to the head node, you can access tensorboard on `localhost:8080`:
388 | 
389 | ```bash
390 | pcluster ssh megatron-on-pcluster -i ~/.ssh/${SSH_KEY_NAME} -L 8080:localhost:8080
391 | ```
392 | 


--------------------------------------------------------------------------------
/configs/base-config-build-ami.ini:
--------------------------------------------------------------------------------
 1 | [global]
 2 | update_check = false
 3 | sanity_check = true
 4 | cluster_template = base-config-build-ami
 5 | 
 6 | [aws]
 7 | aws_region_name = us-west-2
 8 | 
 9 | [scaling quick]
10 | scaledown_idletime = 15
11 | 
12 | [vpc megatron]
13 | vpc_id = <Your VPC Id>
14 | master_subnet_id = <A public subnet Id>
15 | 
16 | [queue gpu]
17 | compute_resource_settings = gpu_resources
18 | disable_hyperthreading = true
19 | enable_efa = true
20 | enable_efa_gdr = true
21 | placement_group = DYNAMIC
22 | 
23 | [compute_resource gpu_resources]
24 | instance_type = p4d.24xlarge
25 | max_count = 128
26 | 
27 | 
28 | [cluster base-config-build-ami]
29 | key_name = <Your EC2 Key pair name>
30 | base_os = alinux2                   # optional, defaults to alinux2
31 | scheduler = slurm
32 | master_instance_type = c5.4xlarge    # optional, defaults to t2.micro
33 | vpc_settings = megatron
34 | scaling_settings = quick
35 | queue_settings = gpu
36 | compute_root_volume_size = 256
37 | master_root_volume_size = 128
38 | 
39 | 
40 | [aliases]
41 | ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}
42 | 


--------------------------------------------------------------------------------
/configs/multi-queue-config.ini:
--------------------------------------------------------------------------------
 1 | [global]
 2 | update_check = false
 3 | sanity_check = true
 4 | cluster_template = multi-queue-us-west-2
 5 | 
 6 | [aws]
 7 | aws_region_name = us-west-2
 8 | 
 9 | [scaling quick]
10 | scaledown_idletime = 120
11 | 
12 | [vpc megatron]
13 | vpc_id = <Your VPC Id>
14 | master_subnet_id = <A public subnet Id>
15 | compute_subnet_id = <A private subnet Id>
16 | 
17 | [queue gpu]
18 | compute_resource_settings = gpu_resources
19 | disable_hyperthreading = true       
20 | enable_efa = true
21 | enable_efa_gdr = true
22 | placement_group = DYNAMIC           
23 | 
24 | [compute_resource gpu_resources]
25 | instance_type = p4d.24xlarge
26 | max_count = 128
27 | 
28 | [queue cpu]
29 | compute_resource_settings = cpu_resources
30 | 
31 | [compute_resource cpu_resources]
32 | instance_type = c5n.18xlarge
33 | max_count = 12
34 | 
35 | [fsx sharedfsx]
36 | shared_dir = /lustre
37 | storage_capacity = 1200
38 | import_path =  s3://<Your Bucket Name>
39 | export_path =  s3://<Your Bucket Name>
40 | deployment_type = SCRATCH_2
41 | 
42 | [cluster multi-queue-us-west-2]
43 | key_name = <Your EC2 Key pair name>
44 | base_os = alinux2                   # optional, defaults to alinux2
45 | scheduler = slurm
46 | master_instance_type = c5.4xlarge    # optional, defaults to t2.micro
47 | vpc_settings = megatron
48 | scaling_settings = quick
49 | queue_settings = gpu, cpu
50 | custom_ami = <Your Custom AMI Id>
51 | s3_read_write_resource = arn:aws:s3:::<Your Bucket Name>*
52 | compute_root_volume_size = 256
53 | master_root_volume_size = 128
54 | fsx_settings = sharedfsx
55 | 
56 | [aliases]
57 | ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}
58 | 


--------------------------------------------------------------------------------
/scripts/create_custom_ami.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | #
 3 | #
 4 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 5 | # SPDX-License-Identifier: MIT-0
 6 | #
 7 | #
 8 | 
 9 | echo "Deep Learning AMI ID $DEEP_LEARNING_AMI_ID"
10 | 
11 | echo "Creating a new AMI"
12 | 
13 | pcluster createami -ai $DEEP_LEARNING_AMI_ID \
14 | 	-os alinux2 \
15 | 	-ap megatron-on-pcluster- \
16 | 	-c $(pwd)/configs/base-config-build-ami.ini \
17 | 	-i p4d.24xlarge \
18 | 	--post-install file://$(pwd)/scripts/custom_dlami_user_data.sh \
19 | 	-r ${AWS_REGION}
20 | 


--------------------------------------------------------------------------------
/scripts/custom_dlami_user_data.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | #
 3 | #
 4 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 5 | # SPDX-License-Identifier: MIT-0
 6 | #
 7 | #
 8 | 
 9 | #!/bin/bash
10 | #This script is based on the DLAMI v36 ami-0f899ff8474ea45a9
11 | yum install htop parallel -y
12 | 
13 | sudo rm /usr/local/cuda
14 | ln -s /usr/local/cuda-11.0 /usr/local/cuda
15 | export CUDA_HOME=/usr/local/cuda-11.0/
16 | export LD_LIBRARY_PATH=/usr/local/cuda-11.0/efa/lib:/usr/local/cuda-11.0/lib:/usr/local/cuda-11.0/lib64:/usr/local/cuda-11.0:/opt/amazon/efa/lib64:/opt/amazon/openmpi/lib64:$LD_LIBRARY_PATH
17 | export USR_HOME=/home/ec2-user && mkdir -p $USR_HOME && cd $USR_HOME
18 | export PIP_EXEC=$USR_HOME/anaconda3/envs/pytorch_latest_p37/bin/pip
19 | export PYTHON_EXEC=$USR_HOME/anaconda3/envs/pytorch_latest_p37/bin/python
20 | 
21 | #packages installation
22 | MEGATRON_DIRECTORY=$USR_HOME/megatron
23 | APEX=$USR_HOME/apex
24 | 
25 | if [ ! -d "$MEGATRON_DIRECTORY" ]; then
26 |     # control will enter here if $DIRECTORY doesn't exist.
27 |     echo "Megatron repository not found. Installing..."
28 |     git clone -b v1.1 https://github.com/NVIDIA/Megatron-LM/ $MEGATRON_DIRECTORY
29 |     chown -R ec2-user:ec2-user $MEGATRON_DIRECTORY
30 |     $PIP_EXEC install pipenv transformers dataclasses pybind11 wikiextractor tensorboard jupyterlab
31 |     $PIP_EXEC install -e $MEGATRON_DIRECTORY -U
32 | fi
33 | 
34 | if [ ! -d $APEX ]; then
35 |     # need to point to a right cuda version and then install latest pytorch
36 |     echo "Apex directory doesn't exist, installing..."
37 |     git clone https://www.github.com/nvidia/apex $APEX
38 |     chown -R ec2-user:ec2-user $APEX
39 |     $PIP_EXEC install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" $APEX
40 | fi
41 | 
42 | chown -R ec2-user:ec2-user $USR_HOME
43 | 
44 | exit $?
45 | 


--------------------------------------------------------------------------------
/scripts/train_8B_gpt2.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # Shared data paths from Fsx for Luster mount point /lustre
 4 | DATA_PATH=/lustre/data/wiki/my-gpt2_text_document
 5 | CHECKPOINT_PATH=/lustre/data/gpt2/checkpoint
 6 | VOCAB_FILE=/lustre/data/gpt2/gpt2-vocab.json
 7 | MERGES_FILE=/lustre/data/gpt2/gpt2-merges.txt
 8 | 
 9 | # Distributed World configuration
10 | MP_SIZE=8
11 | GPUS_PER_NODE=8
12 | DDP_IMPL=torch
13 | MASTER_ADDR=$SLURM_SUBMIT_HOST
14 | MASTER_PORT=6000
15 | NNODES=$SLURM_NTASKS
16 | NODE_RANK=$SLURM_NODEID
17 | WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
18 | 
19 | # CUDA, EFA and NCCL configs
20 | export LD_LIBRARY_PATH=/usr/local/cuda-11.0/efa/lib:/usr/local/cuda-11.0/lib:/usr/local/cuda-11.0/lib64:/usr/local/cuda-11.0:/opt/amazon/efa/lib64:/opt/amazon/openmpi/lib64:$LD_LIBRARY_PATH
21 | export FI_PROVIDER=efa
22 | export FI_EFA_USE_DEVICE_RDMA=1
23 | export NCCL_ALGO=ring
24 | export NCCL_DEBUG=INFO
25 | export RDMAV_FORK_SAFE=1
26 | 
27 | # Distributed args for Pytorch DDP
28 | DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
29 | 
30 | # Training:
31 | /home/ec2-user/anaconda3/envs/pytorch_latest_p37/bin/python -m torch.distributed.launch $DISTRIBUTED_ARGS \
32 |        /home/ec2-user/megatron/pretrain_gpt2.py \
33 |        --model-parallel-size $MP_SIZE \
34 |        --DDP-impl $DDP_IMPL \
35 |        --num-layers 42 \
36 |        --hidden-size 4096 \
37 |        --num-attention-heads 32 \
38 |        --batch-size 16 \
39 |        --seq-length 1024 \
40 |        --max-position-embeddings 1024 \
41 |        --train-iters 1000 \
42 |        --lr-decay-iters 320000 \
43 |        --save $CHECKPOINT_PATH \
44 |        --load $CHECKPOINT_PATH \
45 |        --data-path $DATA_PATH \
46 |        --vocab-file $VOCAB_FILE \
47 |        --merge-file $MERGES_FILE \
48 |        --data-impl mmap \
49 |        --split 949,50,1 \
50 |        --distributed-backend nccl \
51 |        --lr 0.00015 \
52 |        --lr-decay-style cosine \
53 |        --min-lr 1.0e-5 \
54 |        --weight-decay 1e-2 \
55 |        --clip-grad 1.0 \
56 |        --warmup .01 \
57 |        --checkpoint-activations \
58 |        --distribute-checkpointed-activations \
59 |        --log-interval 50 \
60 |        --save-interval 1000 \
61 |        --eval-interval 1000 \
62 |        --eval-iters 10 \
63 |        --num-workers 2 \
64 |        --fp16 \
65 |        --tensorboard-dir /lustre/logs/gpt2_param8B_nodes16_bs16_sjob${SLURM_JOB_ID}
66 | 
67 | set +x
68 | 


--------------------------------------------------------------------------------
/ultra.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-parallelcluster-megatron/2511e0a6237cb1a30dcf4d47e3783e6e9475235e/ultra.png


--------------------------------------------------------------------------------