├── .gitignore ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── TUTORIAL.md ├── configs ├── base-config-build-ami.ini └── multi-queue-config.ini ├── scripts ├── create_custom_ami.sh ├── custom_dlami_user_data.sh └── train_8B_gpt2.sh └── ultra.png /.gitignore: -------------------------------------------------------------------------------- 1 | .env 2 | **/*tmp 3 | **/.DS_Store 4 | 5 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. 5 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *main* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. 60 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of 4 | this software and associated documentation files (the "Software"), to deal in 5 | the Software without restriction, including without limitation the rights to 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 7 | the Software, and to permit persons to whom the Software is furnished to do so. 8 | 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Megatron on AWS EC2 UltraCluster 2 | 3 | **Megatron on AWS EC2 UltraCluster** provides steps, code and configuration samples to deploy and train a GPT type Natural Language Understanding (NLU) model using an [AWS EC2 UltraCluster of P4d instances]() and the [NVIDIA Megatron-LM]() framework. 4 | 5 | [Megatron]() is a large and powerful transformer developed by the Applied Deep Learning Research team at NVIDIA. Refer to [Megatron's original Github repository]() for more information. 6 | 7 | ## Repository Structure 8 | 9 | This repository contains configuration files for [AWS ParallelCluster]() in the [`configs`](<./configs>) folder. 10 | The configurations implement a tightly coupled cluster of [p4d.24xlarge EC2 instances](), leveraging [AWS Elastic Fabric Adapter (EFA)]() and [Amazon FSx for Lustre]() in an EC2 UltraCluster of P4d configuration. 11 | 12 | The [`scripts`](<./scripts>) folder contains scripts to build a custom [Deep Learning AMI]() with Megatron-LM and its dependencies. 13 | It also contains scripts to train a GPT-2 8 billion parameter model, in a 8-way model parallel configuration, using the [SLURM scheduler]() available through ParallelCluster. 14 | 15 | The [TUTORIAL.md]() describes how-to: 16 | 17 | - Set-up a cluster management environment with AWS ParallelCluster. 18 | - Build a custom Deep Learning AMI using the `pcluster` [CLI](). 19 | - Configure and deploy a multi-queue cluster with CPU and GPU instances. 20 | - Preprocess the latest English Wikipedia data dump from [Wikimedia]() on a large CPU instance. 21 | - Train the 8B parameter version of GPT-2 using the preprocessed data across 128 GPUs. 22 | 23 | The Advanced section of the tutorial also describes how to monitor training using [Tensorboard](). 24 | 25 | ### Amazon EC2 UltraCluster 26 | 27 | For an overview of the Amazon EC2 UltraClusters of P4d instances follow [this link](https://pages.awscloud.com/amazon-ec2-p4d.html) 28 | 29 | ![p4d ultracluster](<./ultra.png>) 30 | 31 | ## Contribution guidelines 32 | 33 | If you want to contribute to models, please review the [contribution guidelines](). 34 | 35 | ## License 36 | 37 | This project is licensed under \[MIT\], see the [LICENSE]() file. 38 | -------------------------------------------------------------------------------- /TUTORIAL.md: -------------------------------------------------------------------------------- 1 | # Megatron on AWS UltraCluster 2 | 3 | This tutorial walks through the end-to-end process of configuring a cluster with AWS ParallelCluster, with a customized Deep Learning AMI, preprocessing a large dataset using large CPU Amazon EC2 instances and training a GPT-2 Natural Language Understanding model using an AWS EC2 UltraCluster. 4 | 5 | Familiarity with AWS cloud concepts of Virtual Private Cloud (VPC), e.g Subnets and Availability Zones, the AWS CLI and bash scripting is recommended. 6 | 7 | ## Contents: 8 | 9 | * [Contents](<#contents>) 10 | * [ParallelCluster Management Setup](<#parallelcluster-management-setup>) 11 | * [Local Environment](<#local-environement>) 12 | * [AWS](<#aws>) 13 | * [Building an Custom AMI](<#building-an-custom-ami>) 14 | * [Configure and Deploy a Cluster](<#configure-and-deploy-a-cluster>) 15 | * [Preprocessing the Training Dataset wth CPU Instances](<#preprocessing-the-training-dataset-wth-cpu-instances>) 16 | * [Model Parallel Trainin on a p4d.24xlarge UltraCluster](<#model-parallel-trainin-on-a-p4d24xlarge-ultracluster>) 17 | * [Monitoring Training with Tensorboard](<#monitoring-training-with-tensorboard>) 18 | 19 | ## Prerequisites 20 | 21 | If you don't have Python3 pip already installed, follow the instructions on the [pip installation]() page before running the commands below. 22 | 23 | ## ParallelCluster Management Setup 24 | 25 | #### Local Environment 26 | 27 | To deploy a cluster with AWS ParallelCluster you'll need to install the `aws-parallelcluster` cli. 28 | From the root of this project execute the following to create and activate a virtual environment with Parallelcluster installed: 29 | 30 | ```bash 31 | python3 -m venv .megatron_env 32 | source .megatron_env/bin/activate 33 | python3 -m pip install awscli aws-parallelcluster==2.10.1 34 | pcluster version 35 | 36 | # Set AWS Region 37 | export AWS_REGION="us-west-2" 38 | ``` 39 | 40 | To execute this sample repo, you will need credentials to access the AWS CLI. Refer to the [getting Started documentation]() for more information on setting up ParallelCluster CLI. 41 | 42 | If you don't have Hashicorp Packer already installed, follow the instructions on the [Hashicorp Packer getting started]() page before running the commands below. 43 | It is required to build custom AMIs. 44 | 45 | #### AWS 46 | 47 | This sample repo assumes an existing VPC with private and public subnets. For details on how to provision such infrastructure check out the [this tutorial](). 48 | Private subnets are a requirement for running p4d.24xlarge instances with 4 EFA cards. 49 | Please note that the private subnet should have a NAT Gateway setup since AWS ParallelCluster and the Megatron ML lab require internet connection. 50 | 51 | Use the following commands to list VPCs, Subnets and Availability Zones: 52 | 53 | ```bash 54 | aws ec2 describe-subnets --query 'Subnets[].{VPC:VpcId,SUBNET:SubnetId,AZ:AvailabilityZone}' --region ${AWS_REGION} 55 | ``` 56 | 57 | Take note of the Ids to properly configure the cluster environment and set the following environment variables: 58 | 59 | ```bash 60 | VPC_ID= 61 | PUBLIC_SUBNET_ID= 62 | PRIVATE_SUBNET_ID= 63 | ``` 64 | 65 | This sample also requires an S3 bucket and a EC2 key pair. You can use the following AWS CLI commands to create new ones: 66 | 67 | ```bash 68 | # Create a EC2 key pair 69 | SSH_KEY_NAME="megatron-lab-key" 70 | 71 | aws ec2 create-key-pair --key-name ${SSH_KEY_NAME} \ 72 | --query KeyMaterial \ 73 | --region ${AWS_REGION} \ 74 | --output text > ~/.ssh/${SSH_KEY_NAME} 75 | 76 | BUCKET_POSTFIX=$(python3 -S -c "import uuid; print(str(uuid.uuid4().hex)[:10])")aws 77 | BUCKET_NAME="megatron-lab-${BUCKET_POSTFIX}" 78 | 79 | aws s3 mb s3://${BUCKET_NAME} --region ${AWS_REGION} 80 | ``` 81 | 82 | ## Building an Custom AMI 83 | 84 | [Build a custom AMI]() to avoid long provisioning times associated with using [post installation scripts]() for Megatron-LM dependencies. 85 | 86 | The base AMI for customization is an AWS Deep Learning AMI (DLAMI). 87 | It already provides the amazon required software to run distributed training of large machine learning models, including NVIDIA drivers and CUDA, EFA plugins and the major deep learning frameworks such as PyTorch and Tensorflow, managed in Conda environments. 88 | The Conda package manager can also manage the Megatron-LM dependencies. 89 | 90 | To retrieve the AMI ID of the Deep Learning image v38.0 based on Amazon Linux 2 in the region of deployment, you can use the following command 91 | 92 | ```bash 93 | # Retrieve Deep Learning AMI ID 94 | export DEEP_LEARNING_AMI_ID=`aws ec2 describe-images --owners amazon \ 95 | --query 'Images[*].{ImageId:ImageId,CreationDate:CreationDate}' \ 96 | --filters "Name=name,Values='Deep Learning AMI (Amazon Linux 2) Version 38.0'" \ 97 | --region ${AWS_REGION} \ 98 | | jq -r 'sort_by(.CreationDate)[-1] | .ImageId'` 99 | ``` 100 | 101 | Before building the customer AMI, you will have to modify the argument values between `<...>` accordingly of the base configuration file for AWS ParallelCluster, located in `./configs/base-config-build-ami.ini`. 102 | 103 | The command calls for the script [custom\_dlami\_user\_data.sh](<./scripts/custom_dlami_user_data.sh>), which installs Megatron-LM and its dependencies, including NVIDIA APEX. 104 | The instance used for the build is `-i p4d.24xlarge`, as NVIDIA APEX will be compiled to the host's platform during installation. 105 | 106 | The instructions below help to set the variable in the AWS ParallelCluster configuration file that is used to build the customer AMI, i.e. `./configs/base-config-build-ami.ini`. 107 | 108 | ```bash 109 | git clone https://github.com/pixelb/crudini 110 | 111 | # Install dependencies 112 | pip3 install iniparse 113 | 114 | # Change the cluster configuration file 115 | python3 crudini/crudini --set ./configs/base-config-build-ami.ini "aws" aws_region_name "${AWS_REGION}" 116 | python3 crudini/crudini --set ./configs/base-config-build-ami.ini "vpc megatron" vpc_id "${VPC_ID}" 117 | python3 crudini/crudini --set ./configs/base-config-build-ami.ini "vpc megatron" master_subnet_id "${PUBLIC_SUBNET_ID}" 118 | python3 crudini/crudini --set ./configs/base-config-build-ami.ini "cluster base-config-build-ami" key_name "${SSH_KEY_NAME}" 119 | ``` 120 | 121 | To build the custom AMI from the root of the sample project, use the `pcluster createami` command as shown in the script [create\_custom\_ami.sh](<./scripts/create_custom_ami.sh>). 122 | 123 | ```bash 124 | ./scripts/create_custom_ami.sh 125 | ``` 126 | 127 | After the build is complete you get the AMI Id printed on screen: 128 | 129 | ```bash 130 | Custom AMI ami-xxxxxxxxxxxxx created with name megatron-on-pcluster-aws-parallelcluster-2.10.1-amzn2-hvm-x86_64-202101071208 131 | 132 | To use it, add the following variable to the AWS ParallelCluster config file, under the [cluster ...] section 133 | custom_ami = ami-xxxxxxxxxxxxx 134 | ``` 135 | 136 | Please set the following variable that will be used to setup the AWS ParallelCluster for running Megatron-ML: 137 | 138 | ```bash 139 | CUSTOM_AMI= 140 | ``` 141 | 142 | ## Configure and Deploy a Cluster 143 | 144 | Use the [configs/multi-queue-config.ini](<./configs/multi-queue-config.ini>) configuration file to stand up the cluster. 145 | You can do that manually by changing the VPC, COMPUTE and FSX sections or use the following commands to change the configuration file using `crudini`: 146 | 147 | ```bash 148 | # Change the cluster configuration file 149 | python3 crudini/crudini --set ./configs/multi-queue-config.ini "aws" aws_region_name "${AWS_REGION}" 150 | python3 crudini/crudini --set ./configs/multi-queue-config.ini "vpc megatron" vpc_id "${VPC_ID}" 151 | python3 crudini/crudini --set ./configs/multi-queue-config.ini "vpc megatron" master_subnet_id "${PUBLIC_SUBNET_ID}" 152 | python3 crudini/crudini --set ./configs/multi-queue-config.ini "vpc megatron" compute_subnet_id "${PRIVATE_SUBNET_ID}" 153 | python3 crudini/crudini --set ./configs/multi-queue-config.ini "cluster multi-queue-us-west-2" key_name "${SSH_KEY_NAME}" 154 | python3 crudini/crudini --set ./configs/multi-queue-config.ini "cluster multi-queue-us-west-2" s3_read_write_resource "arn:aws:s3:::${BUCKET_NAME}" 155 | python3 crudini/crudini --set ./configs/multi-queue-config.ini "cluster multi-queue-us-west-2" custom_ami "${CUSTOM_AMI}" 156 | python3 crudini/crudini --set ./configs/multi-queue-config.ini "fsx sharedfsx" import_path "s3://${BUCKET_NAME}" 157 | python3 crudini/crudini --set ./configs/multi-queue-config.ini "fsx sharedfsx" export_path "s3://${BUCKET_NAME}" 158 | ``` 159 | 160 | You are now ready to create the cluster with: 161 | 162 | ```bash 163 | pcluster create megatron-on-pcluster -c configs/multi-queue-config.ini 164 | ``` 165 | 166 | Once deployment completes you get the Head Node's public and private IPs printed on the screen at the end of the cluster creation: 167 | 168 | ```bash 169 | Creating stack named: parallelcluster-megatron-on-pcluster 170 | Status: parallelcluster-megatron-on-pcluster - CREATE_COMPLETE 171 | MasterPublicIP: yyy.yyy.yy.yyy 172 | ClusterUser: ec2-user 173 | MasterPrivateIP: xxx.xxx.xx.xxx 174 | ``` 175 | 176 | Access the cluster Head Node using the CLI command 177 | 178 | ```bash 179 | pcluster ssh megatron-on-pcluster -i ~/.ssh/${SSH_KEY_NAME} 180 | ``` 181 | 182 | ## Preprocessing the Training Dataset wth CPU instances 183 | 184 | Once connected to the cluster head node, set-up a data folder in the _/lustre_ directory and download the latest English Wikipedia data dump from Wikimedia. 185 | This process follows the original [Megatron-LM documentation](): 186 | 187 | ```bash 188 | export WIKI_DIR=/lustre/data/wiki 189 | mkdir -p $WIKI_DIR && cd $WIKI_DIR 190 | 191 | wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 192 | ``` 193 | 194 | Download the vocab and merge table files for the desired model. This example uses the GPT-2 model: 195 | 196 | ```bash 197 | export DATA_DIR=/lustre/data 198 | export GPT2_DATA=${DATA_DIR}/gpt2 199 | 200 | mkdir -p ${GPT2_DATA} && cd ${GPT2_DATA} 201 | 202 | wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json 203 | wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt 204 | 205 | mkdir -p ${GPT2_DATA}/checkpoint 206 | wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_lm_345m/versions/v0.0/zip -O ${GPT2_DATA}/checkpoint/megatron_lm_345m_v0.0.zip 207 | ``` 208 | 209 | Once the data are available, provision a cpu node using slurm: `salloc --nodes 1 -p cpu`. 210 | 211 | All data preprocesing work will proceed from the CPU machine. You can check the provisioning status of your new machine using the `squeue` command. Once the status, _ST_, changes to running, _R_, access the CPU machine terminal through ssh with: `ssh cpu-dy-c5n18xlarge-1`. 212 | 213 | Extract the downloaded data using WikiExtractor: 214 | 215 | ```bash 216 | git clone https://github.com/attardi/wikiextractor.git /lustre/wikiextractor 217 | cd /lustre/wikiextractor 218 | cd - 219 | python -m wikiextractor.WikiExtractor --json /lustre/data/wiki/enwiki-latest-pages-articles.xml.bz2 --output /lustre/data/wiki/text/ -q --processes 70 2>&1 | tee wikiextract.out & 220 | ``` 221 | 222 | Wikiextractor first preprocesses the template of all pages sequentially, followed by a Map/Reduce process for extracting the pages and converting to the loose json format required by Megatron-LM. 223 | 224 | Once the extraction completes, we merge the text files with: 225 | 226 | ```bash 227 | conda activate pytorch_latest_p37 228 | cd /lustre/data/wiki 229 | find /lustre/data/wiki/text/ -name wiki* | parallel -m -j 70 "cat {} >> mergedfile.json" 230 | ``` 231 | 232 | The `mergedfile.json` size on disk is 16GB. With it, create the binary data format for Megatron GPT2. 233 | **NOYE**: Refer to [this solution]() if an `IndexError: list index out of range` occurs. 234 | 235 | To create the binary data, type the following command: 236 | 237 | ```bash 238 | python /home/ec2-user/megatron/tools/preprocess_data.py \ 239 | --input /lustre/data/wiki/mergedfile.json \ 240 | --output-prefix my-gpt2 \ 241 | --vocab /lustre/data/gpt2/gpt2-vocab.json \ 242 | --dataset-impl mmap \ 243 | --tokenizer-type GPT2BPETokenizer \ 244 | --merge-file /lustre/data/gpt2/gpt2-merges.txt \ 245 | --append-eod \ 246 | --workers 70 247 | ``` 248 | 249 | Once all preprocessing is done we can persist the data from FSx back to S3 using a [Data Repository Task]() from the terminal used to spin up the cluster. 250 | This guarantees that data gets persisted even if the cluster is terminated. 251 | 252 | ```bash 253 | # Retrieve the FSx for Lustre file system Id 254 | export FSX_ID=$(aws fsx describe-file-systems --query "FileSystems[?LustreConfiguration.DataRepositoryConfiguration.ExportPath=='s3://${BUCKET_NAME}'].FileSystemId" --output text --region ${AWS_REGION}) 255 | # Create data repository task 256 | aws fsx create-data-repository-task \ 257 | --file-system-id $FSX_ID \ 258 | --type EXPORT_TO_REPOSITORY \ 259 | --paths data \ 260 | --report Enabled=true,Scope=FAILED_FILES_ONLY,Format=REPORT_CSV_20191124,Path=s3://${BUCKET_NAME}/reports \ 261 | --region ${AWS_REGION} 262 | ``` 263 | 264 | You can exit to the original terminal with the `exit` command 2 times: (1) for exiting the `ssh` session on the CPU node, (2) for the `salloc` slurm allocation. 265 | 266 | ## Model Parallel Trainin on a p4d.24xlarge UltraCluster 267 | 268 | In this section you will train the 8 billion parameters version of Megatron-LM GPT-2 model across 64 GPUs - 8 p4d.24xlarge instances. Log back into the cluster head node using `pcluster ssh ...` if not already on the machine. 269 | 270 | Start by creating a training script according to the [original documentation](). 271 | To train using `slurm` on 8 nodes, modify the distributed world configuration section according to the script [scripts/train\_8B\_gpt2.sh](<./scripts/train_8B_gpt2.sh>). 272 | Make sure to include the CUDA, EFA and NCCL environment variables to enable NCCL to communicate between GPUs through AWS EFA using GPU Remote Direct Memory Access. 273 | 274 | Create a file named `/lustre/scripts/train\_8B\_gpt2.sh` and copy the content below in it. 275 | 276 | ```bash 277 | # scripts/train_8B_gpt2.sh 278 | 279 | #!/bin/bash 280 | 281 | # Shared data paths from Fsx for Luster mount point /lustre 282 | DATA_PATH=/lustre/data/wiki/my-gpt2_text_document 283 | CHECKPOINT_PATH=/lustre/data/gpt2/checkpoint 284 | VOCAB_FILE=/lustre/data/gpt2/gpt2-vocab.json 285 | MERGES_FILE=/lustre/data/gpt2/gpt2-merges.txt 286 | 287 | # Distributed World configuration 288 | MP_SIZE=8 289 | GPUS_PER_NODE=8 290 | DDP_IMPL=torch 291 | MASTER_ADDR=$SLURM_SUBMIT_HOST 292 | MASTER_PORT=6000 293 | NNODES=$SLURM_NTASKS 294 | NODE_RANK=$SLURM_NODEID 295 | WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES)) 296 | 297 | # CUDA, EFA and NCCL configs 298 | export LD_LIBRARY_PATH=/usr/local/cuda-11.0/efa/lib:/usr/local/cuda-11.0/lib:/usr/local/cuda-11.0/lib64:/usr/local/cuda-11.0:/opt/amazon/efa/lib64:/opt/amazon/openmpi/lib64:$LD_LIBRARY_PATH 299 | export FI_PROVIDER=efa 300 | export FI_EFA_USE_DEVICE_RDMA=1 301 | export NCCL_ALGO=ring 302 | export NCCL_DEBUG=INFO 303 | export RDMAV_FORK_SAFE=1 304 | 305 | # Distributed args for Pytorch DDP 306 | DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT" 307 | 308 | # Training: 309 | /home/ec2-user/anaconda3/envs/pytorch_latest_p37/bin/python -m torch.distributed.launch $DISTRIBUTED_ARGS \ 310 | /home/ec2-user/megatron/pretrain_gpt2.py \ 311 | --model-parallel-size $MP_SIZE \ 312 | --DDP-impl $DDP_IMPL \ 313 | --num-layers 42 \ 314 | --hidden-size 4096 \ 315 | --num-attention-heads 32 \ 316 | --batch-size 16 \ 317 | --seq-length 1024 \ 318 | --max-position-embeddings 1024 \ 319 | --train-iters 1000 \ 320 | --lr-decay-iters 320000 \ 321 | --save $CHECKPOINT_PATH \ 322 | --load $CHECKPOINT_PATH \ 323 | --data-path $DATA_PATH \ 324 | --vocab-file $VOCAB_FILE \ 325 | --merge-file $MERGES_FILE \ 326 | --data-impl mmap \ 327 | --split 949,50,1 \ 328 | --distributed-backend nccl \ 329 | --lr 0.00015 \ 330 | --lr-decay-style cosine \ 331 | --min-lr 1.0e-5 \ 332 | --weight-decay 1e-2 \ 333 | --clip-grad 1.0 \ 334 | --warmup .01 \ 335 | --checkpoint-activations \ 336 | --distribute-checkpointed-activations \ 337 | --log-interval 50 \ 338 | --save-interval 1000 \ 339 | --eval-interval 1000 \ 340 | --eval-iters 10 \ 341 | --num-workers 2 \ 342 | --fp16 \ 343 | --tensorboard-dir /lustre/logs/gpt2_param8B_nodes16_bs16_sjob${SLURM_JOB_ID} 344 | 345 | set +x 346 | ``` 347 | 348 | ```bash 349 | chmod +x /lustre/scripts/train\_8B\_gpt2.sh 350 | ``` 351 | 352 | To drive the `sbatch` execution of the trainning script, wrap it on a `job.sh` script, using a shared path across all nodes, such as `/lustre/scripts` : 353 | 354 | ```bash 355 | mkdir -p /lustre/scripts 356 | cat > /lustre/scripts/job.sh << EOF 357 | #!/bin/bash 358 | #SBATCH --wait-all-nodes=1 359 | #SBATCH -p gpu 360 | #SBATCH -n 8 361 | #SBATCH -N 8 362 | #SBATCH -o out_%j.out 363 | srun /lustre/scripts/train_8B_gpt2.sh 364 | EOF 365 | 366 | sbatch /lustre/scripts/job.sh 367 | ``` 368 | 369 | Now you can start training by running `sbatch job.sh` from the head node on the cluster. The output from the run will be recorded on the `.out` file on the current folder. 370 | If your job fails with `slurmstepd: error: execve(): /lustre/scripts/train_8B_gpt2.sh: Permission denied`, change the permissions of your scripts with `chmod +x /lustre/scripts/*.sh`. 371 | 372 | Inspecting the NCCL logs in the `.out` file expect to find entries that describe the OFI provide to EFA, such as below: 373 | 374 | ```bash 375 | gpu-dy-p4d24xlarge-10:33337:33337 [0] NCCL INFO NET/OFI Selected Provider is efa 376 | ``` 377 | 378 | ### Monitoring training with Tensorboard 379 | 380 | The Megatron-LM framework writes tensorboard logs to the `--tensorboard-dir` specified on the training script.The custom AMI built for the cluster has tensorboard installed on the `pytorch_latest_p37` environment used for training. 381 | Use the to start a tensorboard silently and expose it in a specific port: 382 | 383 | ```bash 384 | python -m tensorboard.main --port=8080 --logdir /lustre/logs --host 0.0.0.0 2>&1 | tee ~/tensorboard.logs &! 385 | ``` 386 | 387 | Using the following `ssh` tunel configuration when connecting to the head node, you can access tensorboard on `localhost:8080`: 388 | 389 | ```bash 390 | pcluster ssh megatron-on-pcluster -i ~/.ssh/${SSH_KEY_NAME} -L 8080:localhost:8080 391 | ``` 392 | -------------------------------------------------------------------------------- /configs/base-config-build-ami.ini: -------------------------------------------------------------------------------- 1 | [global] 2 | update_check = false 3 | sanity_check = true 4 | cluster_template = base-config-build-ami 5 | 6 | [aws] 7 | aws_region_name = us-west-2 8 | 9 | [scaling quick] 10 | scaledown_idletime = 15 11 | 12 | [vpc megatron] 13 | vpc_id = 14 | master_subnet_id = 15 | 16 | [queue gpu] 17 | compute_resource_settings = gpu_resources 18 | disable_hyperthreading = true 19 | enable_efa = true 20 | enable_efa_gdr = true 21 | placement_group = DYNAMIC 22 | 23 | [compute_resource gpu_resources] 24 | instance_type = p4d.24xlarge 25 | max_count = 128 26 | 27 | 28 | [cluster base-config-build-ami] 29 | key_name = 30 | base_os = alinux2 # optional, defaults to alinux2 31 | scheduler = slurm 32 | master_instance_type = c5.4xlarge # optional, defaults to t2.micro 33 | vpc_settings = megatron 34 | scaling_settings = quick 35 | queue_settings = gpu 36 | compute_root_volume_size = 256 37 | master_root_volume_size = 128 38 | 39 | 40 | [aliases] 41 | ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS} 42 | -------------------------------------------------------------------------------- /configs/multi-queue-config.ini: -------------------------------------------------------------------------------- 1 | [global] 2 | update_check = false 3 | sanity_check = true 4 | cluster_template = multi-queue-us-west-2 5 | 6 | [aws] 7 | aws_region_name = us-west-2 8 | 9 | [scaling quick] 10 | scaledown_idletime = 120 11 | 12 | [vpc megatron] 13 | vpc_id = 14 | master_subnet_id = 15 | compute_subnet_id = 16 | 17 | [queue gpu] 18 | compute_resource_settings = gpu_resources 19 | disable_hyperthreading = true 20 | enable_efa = true 21 | enable_efa_gdr = true 22 | placement_group = DYNAMIC 23 | 24 | [compute_resource gpu_resources] 25 | instance_type = p4d.24xlarge 26 | max_count = 128 27 | 28 | [queue cpu] 29 | compute_resource_settings = cpu_resources 30 | 31 | [compute_resource cpu_resources] 32 | instance_type = c5n.18xlarge 33 | max_count = 12 34 | 35 | [fsx sharedfsx] 36 | shared_dir = /lustre 37 | storage_capacity = 1200 38 | import_path = s3:// 39 | export_path = s3:// 40 | deployment_type = SCRATCH_2 41 | 42 | [cluster multi-queue-us-west-2] 43 | key_name = 44 | base_os = alinux2 # optional, defaults to alinux2 45 | scheduler = slurm 46 | master_instance_type = c5.4xlarge # optional, defaults to t2.micro 47 | vpc_settings = megatron 48 | scaling_settings = quick 49 | queue_settings = gpu, cpu 50 | custom_ami = 51 | s3_read_write_resource = arn:aws:s3:::* 52 | compute_root_volume_size = 256 53 | master_root_volume_size = 128 54 | fsx_settings = sharedfsx 55 | 56 | [aliases] 57 | ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS} 58 | -------------------------------------------------------------------------------- /scripts/create_custom_ami.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # 3 | # 4 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 5 | # SPDX-License-Identifier: MIT-0 6 | # 7 | # 8 | 9 | echo "Deep Learning AMI ID $DEEP_LEARNING_AMI_ID" 10 | 11 | echo "Creating a new AMI" 12 | 13 | pcluster createami -ai $DEEP_LEARNING_AMI_ID \ 14 | -os alinux2 \ 15 | -ap megatron-on-pcluster- \ 16 | -c $(pwd)/configs/base-config-build-ami.ini \ 17 | -i p4d.24xlarge \ 18 | --post-install file://$(pwd)/scripts/custom_dlami_user_data.sh \ 19 | -r ${AWS_REGION} 20 | -------------------------------------------------------------------------------- /scripts/custom_dlami_user_data.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # 3 | # 4 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 5 | # SPDX-License-Identifier: MIT-0 6 | # 7 | # 8 | 9 | #!/bin/bash 10 | #This script is based on the DLAMI v36 ami-0f899ff8474ea45a9 11 | yum install htop parallel -y 12 | 13 | sudo rm /usr/local/cuda 14 | ln -s /usr/local/cuda-11.0 /usr/local/cuda 15 | export CUDA_HOME=/usr/local/cuda-11.0/ 16 | export LD_LIBRARY_PATH=/usr/local/cuda-11.0/efa/lib:/usr/local/cuda-11.0/lib:/usr/local/cuda-11.0/lib64:/usr/local/cuda-11.0:/opt/amazon/efa/lib64:/opt/amazon/openmpi/lib64:$LD_LIBRARY_PATH 17 | export USR_HOME=/home/ec2-user && mkdir -p $USR_HOME && cd $USR_HOME 18 | export PIP_EXEC=$USR_HOME/anaconda3/envs/pytorch_latest_p37/bin/pip 19 | export PYTHON_EXEC=$USR_HOME/anaconda3/envs/pytorch_latest_p37/bin/python 20 | 21 | #packages installation 22 | MEGATRON_DIRECTORY=$USR_HOME/megatron 23 | APEX=$USR_HOME/apex 24 | 25 | if [ ! -d "$MEGATRON_DIRECTORY" ]; then 26 | # control will enter here if $DIRECTORY doesn't exist. 27 | echo "Megatron repository not found. Installing..." 28 | git clone -b v1.1 https://github.com/NVIDIA/Megatron-LM/ $MEGATRON_DIRECTORY 29 | chown -R ec2-user:ec2-user $MEGATRON_DIRECTORY 30 | $PIP_EXEC install pipenv transformers dataclasses pybind11 wikiextractor tensorboard jupyterlab 31 | $PIP_EXEC install -e $MEGATRON_DIRECTORY -U 32 | fi 33 | 34 | if [ ! -d $APEX ]; then 35 | # need to point to a right cuda version and then install latest pytorch 36 | echo "Apex directory doesn't exist, installing..." 37 | git clone https://www.github.com/nvidia/apex $APEX 38 | chown -R ec2-user:ec2-user $APEX 39 | $PIP_EXEC install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" $APEX 40 | fi 41 | 42 | chown -R ec2-user:ec2-user $USR_HOME 43 | 44 | exit $? 45 | -------------------------------------------------------------------------------- /scripts/train_8B_gpt2.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Shared data paths from Fsx for Luster mount point /lustre 4 | DATA_PATH=/lustre/data/wiki/my-gpt2_text_document 5 | CHECKPOINT_PATH=/lustre/data/gpt2/checkpoint 6 | VOCAB_FILE=/lustre/data/gpt2/gpt2-vocab.json 7 | MERGES_FILE=/lustre/data/gpt2/gpt2-merges.txt 8 | 9 | # Distributed World configuration 10 | MP_SIZE=8 11 | GPUS_PER_NODE=8 12 | DDP_IMPL=torch 13 | MASTER_ADDR=$SLURM_SUBMIT_HOST 14 | MASTER_PORT=6000 15 | NNODES=$SLURM_NTASKS 16 | NODE_RANK=$SLURM_NODEID 17 | WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES)) 18 | 19 | # CUDA, EFA and NCCL configs 20 | export LD_LIBRARY_PATH=/usr/local/cuda-11.0/efa/lib:/usr/local/cuda-11.0/lib:/usr/local/cuda-11.0/lib64:/usr/local/cuda-11.0:/opt/amazon/efa/lib64:/opt/amazon/openmpi/lib64:$LD_LIBRARY_PATH 21 | export FI_PROVIDER=efa 22 | export FI_EFA_USE_DEVICE_RDMA=1 23 | export NCCL_ALGO=ring 24 | export NCCL_DEBUG=INFO 25 | export RDMAV_FORK_SAFE=1 26 | 27 | # Distributed args for Pytorch DDP 28 | DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT" 29 | 30 | # Training: 31 | /home/ec2-user/anaconda3/envs/pytorch_latest_p37/bin/python -m torch.distributed.launch $DISTRIBUTED_ARGS \ 32 | /home/ec2-user/megatron/pretrain_gpt2.py \ 33 | --model-parallel-size $MP_SIZE \ 34 | --DDP-impl $DDP_IMPL \ 35 | --num-layers 42 \ 36 | --hidden-size 4096 \ 37 | --num-attention-heads 32 \ 38 | --batch-size 16 \ 39 | --seq-length 1024 \ 40 | --max-position-embeddings 1024 \ 41 | --train-iters 1000 \ 42 | --lr-decay-iters 320000 \ 43 | --save $CHECKPOINT_PATH \ 44 | --load $CHECKPOINT_PATH \ 45 | --data-path $DATA_PATH \ 46 | --vocab-file $VOCAB_FILE \ 47 | --merge-file $MERGES_FILE \ 48 | --data-impl mmap \ 49 | --split 949,50,1 \ 50 | --distributed-backend nccl \ 51 | --lr 0.00015 \ 52 | --lr-decay-style cosine \ 53 | --min-lr 1.0e-5 \ 54 | --weight-decay 1e-2 \ 55 | --clip-grad 1.0 \ 56 | --warmup .01 \ 57 | --checkpoint-activations \ 58 | --distribute-checkpointed-activations \ 59 | --log-interval 50 \ 60 | --save-interval 1000 \ 61 | --eval-interval 1000 \ 62 | --eval-iters 10 \ 63 | --num-workers 2 \ 64 | --fp16 \ 65 | --tensorboard-dir /lustre/logs/gpt2_param8B_nodes16_bs16_sjob${SLURM_JOB_ID} 66 | 67 | set +x 68 | -------------------------------------------------------------------------------- /ultra.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/aws-parallelcluster-megatron/2511e0a6237cb1a30dcf4d47e3783e6e9475235e/ultra.png --------------------------------------------------------------------------------