├── README.md ├── images ├── .DS_Store ├── DDP_Steps.png ├── DDP_BuildBlock.png ├── DataParallel.png ├── Bucket_Allreduce.png ├── Backends_Difference.png └── Collective_Communication.png ├── .gitignore ├── codes ├── mnist.py ├── mnist-env.py └── mnist-tcp.py └── DDP-Tutorial.md /README.md: -------------------------------------------------------------------------------- 1 | # DDP-Tutorial 2 | A tutorial for DDP in Pytorch. 3 | -------------------------------------------------------------------------------- /images/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KaiiZhang/DDP-Tutorial/HEAD/images/.DS_Store -------------------------------------------------------------------------------- /images/DDP_Steps.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KaiiZhang/DDP-Tutorial/HEAD/images/DDP_Steps.png -------------------------------------------------------------------------------- /images/DDP_BuildBlock.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KaiiZhang/DDP-Tutorial/HEAD/images/DDP_BuildBlock.png -------------------------------------------------------------------------------- /images/DataParallel.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KaiiZhang/DDP-Tutorial/HEAD/images/DataParallel.png -------------------------------------------------------------------------------- /images/Bucket_Allreduce.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KaiiZhang/DDP-Tutorial/HEAD/images/Bucket_Allreduce.png -------------------------------------------------------------------------------- /images/Backends_Difference.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KaiiZhang/DDP-Tutorial/HEAD/images/Backends_Difference.png -------------------------------------------------------------------------------- /images/Collective_Communication.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KaiiZhang/DDP-Tutorial/HEAD/images/Collective_Communication.png -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | pip-wheel-metadata/ 24 | share/python-wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .nox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | *.py,cover 51 | .hypothesis/ 52 | .pytest_cache/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | target/ 76 | 77 | # Jupyter Notebook 78 | .ipynb_checkpoints 79 | 80 | # IPython 81 | profile_default/ 82 | ipython_config.py 83 | 84 | # pyenv 85 | .python-version 86 | 87 | # pipenv 88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 91 | # install all needed dependencies. 92 | #Pipfile.lock 93 | 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 95 | __pypackages__/ 96 | 97 | # Celery stuff 98 | celerybeat-schedule 99 | celerybeat.pid 100 | 101 | # SageMath parsed files 102 | *.sage.py 103 | 104 | # Environments 105 | .env 106 | .venv 107 | env/ 108 | venv/ 109 | ENV/ 110 | env.bak/ 111 | venv.bak/ 112 | 113 | # Spyder project settings 114 | .spyderproject 115 | .spyproject 116 | 117 | # Rope project settings 118 | .ropeproject 119 | 120 | # mkdocs documentation 121 | /site 122 | 123 | # mypy 124 | .mypy_cache/ 125 | .dmypy.json 126 | dmypy.json 127 | 128 | # Pyre type checker 129 | .pyre/ 130 | 131 | # Dataset 132 | /codes/data/ 133 | -------------------------------------------------------------------------------- /codes/mnist.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Author: Kai Zhang 3 | Date: 2022-03-19 15:12:04 4 | LastEditors: Kai Zhang 5 | LastEditTime: 2022-03-23 19:49:38 6 | Description: demo of mnist classification 7 | ''' 8 | from datetime import datetime 9 | import argparse 10 | import torchvision 11 | import torchvision.transforms as transforms 12 | import torch 13 | import torch.nn as nn 14 | from tqdm import tqdm 15 | 16 | class ConvNet(nn.Module): 17 | def __init__(self, num_classes=10): 18 | super(ConvNet, self).__init__() 19 | self.layer1 = nn.Sequential( 20 | nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2), 21 | nn.BatchNorm2d(16), 22 | nn.ReLU(), 23 | nn.MaxPool2d(kernel_size=2, stride=2)) 24 | self.layer2 = nn.Sequential( 25 | nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2), 26 | nn.BatchNorm2d(32), 27 | nn.ReLU(), 28 | nn.MaxPool2d(kernel_size=2, stride=2)) 29 | self.fc = nn.Linear(7*7*32, num_classes) 30 | 31 | def forward(self, x): 32 | out = self.layer1(x) 33 | out = self.layer2(out) 34 | out = out.reshape(out.size(0), -1) 35 | out = self.fc(out) 36 | return out 37 | 38 | def train(gpu, args): 39 | model = ConvNet() 40 | model.cuda(gpu) 41 | # define loss function (criterion) and optimizer 42 | criterion = nn.CrossEntropyLoss().to(gpu) 43 | optimizer = torch.optim.SGD(model.parameters(), 1e-4) 44 | 45 | # Data loading code 46 | train_dataset = torchvision.datasets.MNIST(root='./data', 47 | train=True, 48 | transform=transforms.ToTensor(), 49 | download=True) 50 | train_loader = torch.utils.data.DataLoader(dataset=train_dataset, 51 | batch_size=args.batch_size, 52 | shuffle=True, 53 | num_workers=0, 54 | pin_memory=True, 55 | sampler=None) 56 | 57 | start = datetime.now() 58 | total_step = len(train_loader) 59 | for epoch in range(args.epochs): 60 | model.train() 61 | for i, (images, labels) in enumerate(tqdm(train_loader)): 62 | images = images.to(gpu) 63 | labels = labels.to(gpu) 64 | # Forward pass 65 | outputs = model(images) 66 | loss = criterion(outputs, labels) 67 | 68 | # Backward and optimize 69 | optimizer.zero_grad() 70 | loss.backward() 71 | optimizer.step() 72 | if (i + 1) % 100 == 0: 73 | print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch + 1, args.epochs, i + 1, total_step, 74 | loss.item())) 75 | print("Training complete in: " + str(datetime.now() - start)) 76 | 77 | def main(): 78 | parser = argparse.ArgumentParser() 79 | parser.add_argument('-g', '--gpuid', default=0, type=int, 80 | help="which gpu to use") 81 | parser.add_argument('-e', '--epochs', default=1, type=int, 82 | metavar='N', 83 | help='number of total epochs to run') 84 | parser.add_argument('-b', '--batch_size', default=4, type=int, 85 | metavar='N', 86 | help='number of batchsize') 87 | 88 | args = parser.parse_args() 89 | train(args.gpuid, args) 90 | 91 | if __name__ == '__main__': 92 | main() 93 | -------------------------------------------------------------------------------- /codes/mnist-env.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Author: Kai Zhang 3 | Date: 2022-03-19 16:05:57 4 | LastEditors: Kai Zhang 5 | LastEditTime: 2022-03-23 21:25:20 6 | Description: Start DDP in env model 7 | ''' 8 | from datetime import datetime 9 | import argparse 10 | import torchvision 11 | import torchvision.transforms as transforms 12 | import torch 13 | import torch.nn as nn 14 | import torch.distributed as dist 15 | from tqdm import tqdm 16 | from torch.cuda.amp import GradScaler 17 | 18 | 19 | def main(): 20 | parser = argparse.ArgumentParser() 21 | parser.add_argument('-g', '--gpuid', default=0, type=int, 22 | help="which gpu to use") 23 | parser.add_argument('-e', '--epochs', default=1, type=int, 24 | metavar='N', 25 | help='number of total epochs to run') 26 | parser.add_argument('-b', '--batch_size', default=4, type=int, 27 | metavar='N', 28 | help='number of batchsize') 29 | ################################################################################## 30 | parser.add_argument("--local_rank", type=int, # 31 | help='rank in current node') # 32 | parser.add_argument('--use_mix_precision', default=False, # 33 | action='store_true', help="whether to use mix precision") # 34 | ################################################################################## 35 | args = parser.parse_args() 36 | train(args.local_rank, args) 37 | 38 | 39 | class ConvNet(nn.Module): 40 | def __init__(self, num_classes=10): 41 | super(ConvNet, self).__init__() 42 | self.layer1 = nn.Sequential( 43 | nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2), 44 | nn.BatchNorm2d(16), 45 | nn.ReLU(), 46 | nn.MaxPool2d(kernel_size=2, stride=2)) 47 | self.layer2 = nn.Sequential( 48 | nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2), 49 | nn.BatchNorm2d(32), 50 | nn.ReLU(), 51 | nn.MaxPool2d(kernel_size=2, stride=2)) 52 | self.fc = nn.Linear(7*7*32, num_classes) 53 | 54 | def forward(self, x): 55 | out = self.layer1(x) 56 | out = self.layer2(out) 57 | out = out.reshape(out.size(0), -1) 58 | out = self.fc(out) 59 | return out 60 | 61 | 62 | #################################### N11 ################################## 63 | def evaluate(model, gpu, test_loader, rank): 64 | model.eval() 65 | size = torch.tensor(0.).to(gpu) 66 | correct = torch.tensor(0.).to(gpu) 67 | with torch.no_grad(): 68 | for i, (images, labels) in enumerate(tqdm(test_loader)): 69 | images = images.to(gpu) 70 | labels = labels.to(gpu) 71 | # Forward pass 72 | outputs = model(images) 73 | size += images.shape[0] 74 | correct += (outputs.argmax(1) == labels).type(torch.float).sum() 75 | # 群体通信 reduce 操作 change to allreduce if Gloo 76 | dist.reduce(size, 0, op=dist.ReduceOp.SUM) 77 | # 群体通信 reduce 操作 change to allreduce if Gloo 78 | dist.reduce(correct, 0, op=dist.ReduceOp.SUM) 79 | if rank == 0: 80 | print('Evaluate accuracy is {:.2f}'.format(correct / size)) 81 | ################################################################################# 82 | 83 | 84 | def train(gpu, args): 85 | ######################################## N1 ################ 86 | dist.init_process_group(backend='nccl', init_method='env://') # 87 | args.rank = dist.get_rank() # 88 | ################################################################## 89 | model = ConvNet() 90 | model.cuda(gpu) 91 | # define loss function (criterion) and optimizer 92 | criterion = nn.CrossEntropyLoss().to(gpu) 93 | optimizer = torch.optim.SGD(model.parameters(), 1e-4) 94 | # Wrap the model 95 | ####################################### N2 ######################## 96 | model = nn.SyncBatchNorm.convert_sync_batchnorm(model) # 97 | model = nn.parallel.DistributedDataParallel(model, device_ids=[gpu]) # 98 | scaler = GradScaler(enabled=args.use_mix_precision) # 99 | ######################################################################### 100 | # Data loading code 101 | train_dataset = torchvision.datasets.MNIST(root='./data', 102 | train=True, 103 | transform=transforms.ToTensor(), 104 | download=True) 105 | #################################### N3 ####################################### 106 | train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset) # 107 | train_loader = torch.utils.data.DataLoader(dataset=train_dataset, # 108 | batch_size=args.batch_size, # 109 | shuffle=False, # 110 | num_workers=0, # 111 | pin_memory=True, # 112 | sampler=train_sampler) # 113 | ##################################################################################### 114 | 115 | #################################### N9 ################################### 116 | test_dataset = torchvision.datasets.MNIST(root='./data', # 117 | train=False, # 118 | transform=transforms.ToTensor(), # 119 | download=True) # 120 | test_sampler = torch.utils.data.distributed.DistributedSampler(test_dataset) # 121 | test_loader = torch.utils.data.DataLoader(dataset=test_dataset, # 122 | batch_size=args.batch_size, # 123 | shuffle=False, # 124 | num_workers=0, # 125 | pin_memory=True, # 126 | sampler=test_sampler) # 127 | ################################################################################# 128 | start = datetime.now() 129 | # The number changes to orignal_length // args.world_size 130 | total_step = len(train_loader) 131 | for epoch in range(args.epochs): 132 | ################ N4 ################ 133 | train_loader.sampler.set_epoch(epoch) # 134 | ########################################## 135 | model.train() 136 | for i, (images, labels) in enumerate(tqdm(train_loader) if args.rank == 0 else train_loader): # only use tqdm in rank0 137 | images = images.to(gpu) 138 | labels = labels.to(gpu) 139 | # Forward pass 140 | ######################## N5 ################################ 141 | with torch.cuda.amp.autocast(enabled=args.use_mix_precision): # 142 | outputs = model(images) # 143 | loss = criterion(outputs, labels) # 144 | ################################################################## 145 | # Backward and optimize 146 | optimizer.zero_grad() 147 | ############## N6 ########## 148 | scaler.scale(loss).backward() # 149 | scaler.step(optimizer) # 150 | scaler.update() # 151 | ################################## 152 | ################ N7 #################### 153 | if (i + 1) % 100 == 0 and args.rank == 0: # 154 | ############################################## 155 | print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch + 1, args.epochs, i + 1, total_step, 156 | loss.item())) 157 | evaluate(model, gpu, test_loader, args.rank) 158 | ########### N8 ############ 159 | dist.destroy_process_group() # 160 | if args.rank == 0: # 161 | ################################# 162 | print("Training complete in: " + str(datetime.now() - start)) 163 | 164 | 165 | if __name__ == '__main__': 166 | main() 167 | -------------------------------------------------------------------------------- /codes/mnist-tcp.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Author: Kai Zhang 3 | Date: 2022-03-19 16:05:57 4 | LastEditors: Kai Zhang 5 | LastEditTime: 2022-03-23 21:24:58 6 | Description: Start DDP in tcp model 7 | ''' 8 | from datetime import datetime 9 | import argparse 10 | import torchvision 11 | import torchvision.transforms as transforms 12 | import torch 13 | import torch.nn as nn 14 | import torch.distributed as dist 15 | from tqdm import tqdm 16 | from torch.cuda.amp import GradScaler 17 | 18 | def main(): 19 | parser = argparse.ArgumentParser() 20 | parser.add_argument('-g', '--gpuid', default=0, type=int, 21 | help="which gpu to use") 22 | parser.add_argument('-e', '--epochs', default=1, type=int, 23 | metavar='N', 24 | help='number of total epochs to run') 25 | parser.add_argument('-b', '--batch_size', default=4, type=int, 26 | metavar='N', 27 | help='number of batchsize') 28 | ################################################################################## 29 | parser.add_argument('--init_method', default='tcp://localhost:18888', # 30 | help="init-method") # 31 | parser.add_argument('-r', '--rank', default=0, type=int, # 32 | help='rank of current process') # 33 | parser.add_argument('--world_size', default=2, type=int, # 34 | help="world size") # 35 | parser.add_argument('--use_mix_precision', default=False, # 36 | action='store_true', help="whether to use mix precision") # 37 | ################################################################################## 38 | args = parser.parse_args() 39 | train(args.gpuid, args) 40 | 41 | class ConvNet(nn.Module): 42 | def __init__(self, num_classes=10): 43 | super(ConvNet, self).__init__() 44 | self.layer1 = nn.Sequential( 45 | nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2), 46 | nn.BatchNorm2d(16), 47 | nn.ReLU(), 48 | nn.MaxPool2d(kernel_size=2, stride=2)) 49 | self.layer2 = nn.Sequential( 50 | nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2), 51 | nn.BatchNorm2d(32), 52 | nn.ReLU(), 53 | nn.MaxPool2d(kernel_size=2, stride=2)) 54 | self.fc = nn.Linear(7*7*32, num_classes) 55 | 56 | def forward(self, x): 57 | out = self.layer1(x) 58 | out = self.layer2(out) 59 | out = out.reshape(out.size(0), -1) 60 | out = self.fc(out) 61 | return out 62 | 63 | 64 | #################################### N11 ################################## 65 | def evaluate(model, gpu, test_loader, rank): 66 | model.eval() 67 | size = torch.tensor(0.).to(gpu) 68 | correct = torch.tensor(0.).to(gpu) 69 | with torch.no_grad(): 70 | for i, (images, labels) in enumerate(tqdm(test_loader)): 71 | images = images.to(gpu) 72 | labels = labels.to(gpu) 73 | # Forward pass 74 | outputs = model(images) 75 | size += images.shape[0] 76 | correct += (outputs.argmax(1) == labels).type(torch.float).sum() 77 | # 群体通信 reduce 操作 change to allreduce if Gloo 78 | dist.reduce(size, 0, op=dist.ReduceOp.SUM) 79 | # 群体通信 reduce 操作 change to allreduce if Gloo 80 | dist.reduce(correct, 0, op=dist.ReduceOp.SUM) 81 | if rank == 0: 82 | print('Evaluate accuracy is {:.2f}'.format(correct / size)) 83 | ################################################################################# 84 | 85 | def train(gpu, args): 86 | ######################################## N1 #################################################################### 87 | dist.init_process_group(backend='nccl', init_method=args.init_method, rank=args.rank, world_size=args.world_size) # 88 | ###################################################################################################################### 89 | model = ConvNet() 90 | model.cuda(gpu) 91 | # Wrap the model 92 | ####################################### N2 ######################## 93 | model = nn.SyncBatchNorm.convert_sync_batchnorm(model) # 94 | model = nn.parallel.DistributedDataParallel(model, device_ids=[gpu]) # 95 | scaler = GradScaler(enabled=args.use_mix_precision) # 96 | ######################################################################### 97 | # define loss function (criterion) and optimizer 98 | criterion = nn.CrossEntropyLoss().to(gpu) 99 | optimizer = torch.optim.SGD(model.parameters(), 1e-4) 100 | # Data loading code 101 | train_dataset = torchvision.datasets.MNIST(root='./data', 102 | train=True, 103 | transform=transforms.ToTensor(), 104 | download=True) 105 | #################################### N3 ####################################### 106 | train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset) # 107 | train_loader = torch.utils.data.DataLoader(dataset=train_dataset, # 108 | batch_size=args.batch_size, # 109 | shuffle=False, # 110 | num_workers=0, # 111 | pin_memory=True, # 112 | sampler=train_sampler) # 113 | ##################################################################################### 114 | 115 | #################################### N9 ################################### 116 | test_dataset = torchvision.datasets.MNIST(root='./data', # 117 | train=False, # 118 | transform=transforms.ToTensor(), # 119 | download=True) # 120 | test_sampler = torch.utils.data.distributed.DistributedSampler(test_dataset) # 121 | test_loader = torch.utils.data.DataLoader(dataset=test_dataset, # 122 | batch_size=args.batch_size, # 123 | shuffle=False, # 124 | num_workers=0, # 125 | pin_memory=True, # 126 | sampler=test_sampler) # 127 | ################################################################################# 128 | start = datetime.now() 129 | total_step = len(train_loader) # The number changes to orignal_length // args.world_size 130 | for epoch in range(args.epochs): 131 | ################ N4 ################ 132 | train_loader.sampler.set_epoch(epoch) # 133 | ########################################## 134 | model.train() 135 | for i, (images, labels) in enumerate(tqdm(train_loader)): 136 | images = images.to(gpu) 137 | labels = labels.to(gpu) 138 | # Forward pass 139 | ######################## N5 ################################ 140 | with torch.cuda.amp.autocast(enabled=args.use_mix_precision): # 141 | outputs = model(images) # 142 | loss = criterion(outputs, labels) # 143 | ################################################################## 144 | # Backward and optimize 145 | optimizer.zero_grad() 146 | ############## N6 ########## 147 | scaler.scale(loss).backward() # 148 | scaler.step(optimizer) # 149 | scaler.update() # 150 | ################################## 151 | ################ N7 #################### 152 | if (i + 1) % 100 == 0 and args.rank == 0: # 153 | ############################################## 154 | print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch + 1, args.epochs, i + 1, total_step, 155 | loss.item())) 156 | evaluate(model, gpu, test_loader, args.rank) 157 | dist.destroy_process_group() 158 | ###### N8 ####### 159 | if args.rank == 0: # 160 | ####################### 161 | print("Training complete in: " + str(datetime.now() - start)) 162 | 163 | 164 | if __name__ == '__main__': 165 | main() 166 | -------------------------------------------------------------------------------- /DDP-Tutorial.md: -------------------------------------------------------------------------------- 1 | 8 | # 上手Distributed Data Parallel的详尽教程 9 | 10 | ## 写在前面 11 | Pyorch中的Distributed Data Parallel(DDP)已经推出很久了,目前自监督和预训练相关的工作都离不开多机并行。但结合本人和身边同学们的情况来看,大家对DDP的用法还都不熟悉,或者觉得没有多机多卡的使用需求。但其实DDP的单机多卡性能也是要优于`DataParallel`的,并且多机多卡是在企业中训练大模型必备的技能。 12 | 13 | 结合本人学习DDP的过程中,主要的痛点在于: 14 | 15 | * Pytorch官方的文档中的知识点比较分散,有时过于简洁。对于缺少并行经验的同学会比较难理解。 16 | * 谷歌/百度到的博客质量参差不齐,大部分都浅尝辄止,很难系统的入门 17 | 18 | 在走了不少的弯路后,将目前体会到的一些心得记录为本文。希望能够帮助感兴趣的同学少踩一些坑,然后可以更流畅的阅读官方文档(Pytorch官方的文档是很全的)。 19 | 20 | 由于才疏学浅,文中避免不了会有一些错误或者理解偏差,还请大家批评指正,感谢。 21 | *** 22 | 23 | ## 背景 24 | 25 | 深度学习的发展证明了大数据和大模型的价值。无论是在CV还是NLP领域,在大规模的计算资源上训练模型的能力变得日益重要。GPU以比CPU更快的矩阵乘法和加法运算,加速了模型训练。但随着数据量和模型参数的增长,单块GPU很快变得不够用。因此我们必须找到合适的方法,实现数据和模型在多个GPU甚至多个计算节点间的划分和复制,从而实现更短的训练周期和更大的模型参数量。 26 | 27 | ## 数据并行训练 28 | 29 | 并行训练是解决上述问题的重要方法。`Pytorch`中关于并行训练的内容分为以下三部分: 30 | 31 | * **`nn.DataParallel`** 32 | * **`nn.parallel.DistributedDataParallel`** 33 | * `torch.distributed.rpc` 34 | 35 | 前二者是本文的讨论重点,在CV和NLP中广泛应用。rpc是更通用的分布式并行方案,主要应用在强化学习等领域,不在本文的讨论范围内。`DataParallel`是最容易的并行训练方案,只需要增加一行代码,即可实现模型在多卡上的训练。但在pytorch中,`DataParallel`无论在功能和性能上都不是最优的并行方案,相比于`DistributedDataParallel`(DDP)有诸多不足。 36 | 37 | ## DataParallel 38 | 39 | |  | 40 | |:--:| 41 | | *DataParallel的流程图(图源自[Hugging Face](https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255))* | 42 | 43 | `DataParallel`实现并行训练的示意图如上图所示,在的每个前向过程中,模型会由GPU1复制到其它GPUs,这会引入延迟。其次,在每个GPU完成前向运算后,其输出logits会被被收集到GPU-1。logits的收集是`DataParallel`中GPU显存利用不均衡的问题根源。这一现象在图像分类这种logits的维度很小的任务中不明显,但在图像语义分割和文本翻译等密集预测任务中,GPU-1的显存占用会显著高于其它GPU。这会造成额外的GPU资源浪费。 44 | 45 | 更重要的是,`DataParallel`的并行受`Python`语言中的GIL争用影响,仅能以单进程多线程的方式实现,这并不是理想的并行多进程实现。同时,`DataParallel`最多仅支持单机多卡的训练模式,无法实现需要多台机器训练的大模型。 46 | 47 | 我们将`DataParallel`的优缺点概括如下: 48 | 49 | ### 优点 50 | 51 | * 只需要一行代码的增加,易于项目原型的开发 52 | 53 | ```Python 54 | model = nn.DataParallel(model) 55 | ``` 56 | 57 | ### 缺点 58 | 59 | * 每个前向过程中模型复制引入的延迟 60 | * GPU的利用不均衡 61 | * 单进程多线程的实现 62 | * 不支持多机多卡 63 | 64 | ## DistributedDataParallel 65 | 66 | | | 67 | |:--:| 68 | | *DDP的流程图,其中Construction只在训练开始前执行一次,仅Forward和Backward在训练中重复多次* | 69 | 70 | DDP的流程示意图如上图所示,DDP需要额外的建立进程组阶段(Construction)。在Construction阶段需要首先明确通信协议和总进程数。通信协议是实现DDP的底层基础,我们在之后单独介绍。总进程数就是指有多少个独立的并行进程,被称为worldsize。根据需求每个进程可以占用一个或多个GPU,但并不推荐多个进程共享一个GPU,这会造成潜在的性能损失。为了便于理解,在本文的所有示例中我们假定每个进程只占用1个GPU,占用多个GPU的情况只需要简单的调整GPU映射关系就好。 71 | 72 | 并行组建立之后,每个GPU上会独立的构建模型,然后GPU-1中模型的状态会被广播到其它所有进程中以保证所有模型都具有相同的初始状态。值得注意的是Construction只在训练开始前执行,在训练中只会不断迭代前向和后向过程,因此不会带来额外的延迟。 73 | 74 | 相比于`DataParallel`,DDP的前向后向过程更加简洁。推理、损失函数计算,梯度计算都是并行独立完成的。DDP实现并行训练的核心在于**梯度同步**。梯度在模型间的同步使用的是`allreduce`通信操作,每个GPU会得到完全相同的梯度。如图中后向过程的步骤2,GPU间的通信在梯度计算完成后被触发(hook函数)。图中没有画出的是,通常每个GPU也会建立独立的优化器。由于模型具有同样的初始状态和后续相同的梯度,因此每轮迭代后不同进程间的模型是完全相同的,这保证了DDP的数理一致性。 75 | 76 | 为了优化性能,DDP中针对`allreduce`操作进行了更深入的设计。梯度的计算过程和进程间的通信过程分别需要消耗一定量的时间。等待模型所有的参数都计算完梯度再进行通信显然不是最优的。如下图所示,DDP中的设计是通过将全部模型参数划分为无数个小的bucket,在bucket级别建立`allreduce`。当所有进程中bucket0的梯度计算完成后就立刻开始通信,此时bucket1中梯度还在计算。这样可以实现计算和通信过程的时间重叠。这种设计能够使得DDP的训练更高效。为了降低阅读难度,关于bucket大小相关的设计细节在此处不赘述了,建议感兴趣的同学之后继续阅读[原论文](https://arxiv.org/pdf/2006.15704.pdf)。 77 | 78 | |  | 79 | |:--:| 80 | | *DDP中的模型参数划分为多个bucket的机制示意图(图源自[DDP Note](https://pytorch.org/docs/stable/notes/ddp.html))* | 81 | 82 | 在最后我们对DDP的通信部分进行介绍。DDP后端的通信由多种CPP编写的协议支持,不同协议具有不同的通信算子的支持,在开发中可以根据需求选择。 83 | 84 | |  | 85 | |:--:| 86 | | *DDP的实现框图(图源自[Paper](https://arxiv.org/pdf/2006.15704.pdf))* | 87 | 88 | 对于CV和NLP常用GPU训练的任务而言,选择Gloo或NCCL协议即可。一个决定因素是你使用的计算机集群的网络环境: 89 | 90 | * **当使用的是Ethernet(以太网,大部分机器都是这个环境)**:那么优先选择NCCL,具有更好的性能;如果在使用中遇到了NCCL通信的问题,那么就选择Gloo作为备用。(经验:单机多卡直接NCCL;多机多卡先尝试NCCL,如果通信有问题,而且自己解决不了,那就Gloo。) 91 | * **当使用的是InfiniBand**:只支持NCCL。 92 | 93 | 另一个决定性因素是二者支持的算子范围不同,因此在使用时还需要结合代码里的功能来确定。下图记录了每种通信协议能够支持的算子,Gloo能够实现GPU中最基本的DDP训练,而NCCL能够支持更加多样的算子. 94 | 95 | |  | 96 | |:--:| 97 | | *不同Backend的算子支持情况(图源自[Doc](https://pytorch.org/docs/stable/distributed.html))* | 98 | 99 | 综上,得益于DDP的分布式并行设计,DDP并不受PythonGIL争用的影响,是以多进程的方式运行的。这也使得DDP可以支持多机多卡的训练。我们将DDP的优缺点概括如下: 100 | 101 | ### 优点 102 | 103 | * 更快的训练速度 104 | * 多进程的运行方式 105 | * 支持单机多卡和多机多卡 106 | * 平衡的GPU使用 107 | 108 | ### 缺点 109 | 110 | * 需要更多的代码书写和设计 111 | 112 | *** 113 | 114 | ## Show Me the Code 115 | 本文首先会基于MNIST图像分类建立一个最小原型,然后逐步改进它以实现多机多卡的训练和混合精度的支持。在讲述的思路上本文借鉴了[Kevin Kaichuang Yang的教程](https://yangkky.github.io/2019/07/08/distributed-pytorch-tutorial.html),但在实现细节上有较大的差异。特别的是本文增加了对DDP启动方式的探讨,并且介绍了多进程通信操作的使用样例。 116 | 117 | ### 非多进程示例 118 | 119 | 首先引入了所有用到的库。 120 | 121 | ```Python 122 | from datetime import datetime 123 | import argparse 124 | import torchvision 125 | import torchvision.transforms as transforms 126 | import torch 127 | import torch.nn as nn 128 | import torch.distributed as dist 129 | from tqdm import tqdm 130 | ``` 131 | 132 | 定义一个简单的卷积神经网络模型。 133 | 134 | ```Python 135 | class ConvNet(nn.Module): 136 | def __init__(self, num_classes=10): 137 | super(ConvNet, self).__init__() 138 | self.layer1 = nn.Sequential( 139 | nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2), 140 | nn.BatchNorm2d(16), 141 | nn.ReLU(), 142 | nn.MaxPool2d(kernel_size=2, stride=2)) 143 | self.layer2 = nn.Sequential( 144 | nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2), 145 | nn.BatchNorm2d(32), 146 | nn.ReLU(), 147 | nn.MaxPool2d(kernel_size=2, stride=2)) 148 | self.fc = nn.Linear(7*7*32, num_classes) 149 | 150 | def forward(self, x): 151 | out = self.layer1(x) 152 | out = self.layer2(out) 153 | out = out.reshape(out.size(0), -1) 154 | out = self.fc(out) 155 | return out 156 | ``` 157 | 158 | 定义主函数,添加一些启动脚本的可选参数。 159 | 160 | ```Python 161 | def main(): 162 | parser = argparse.ArgumentParser() 163 | parser.add_argument('-g', '--gpuid', default=0, type=int, 164 | help="which gpu to use") 165 | parser.add_argument('-e', '--epochs', default=2, type=int, 166 | metavar='N', 167 | help='number of total epochs to run') 168 | parser.add_argument('-b', '--batch_size', default=4, type=int, 169 | metavar='N', 170 | help='number of batchsize') 171 | 172 | args = parser.parse_args() 173 | train(args.gpuid, args) 174 | ``` 175 | 176 | 然后给出训练函数的详细内容。 177 | 178 | ```Python 179 | def train(gpu, args): 180 | model = ConvNet() 181 | model.cuda(gpu) 182 | # define loss function (criterion) and optimizer 183 | criterion = nn.CrossEntropyLoss().to(gpu) 184 | optimizer = torch.optim.SGD(model.parameters(), 1e-4) 185 | 186 | # Data loading code 187 | train_dataset = torchvision.datasets.MNIST(root='./data', 188 | train=True, 189 | transform=transforms.ToTensor(), 190 | download=True) 191 | train_loader = torch.utils.data.DataLoader(dataset=train_dataset, 192 | batch_size=args.batch_size, 193 | shuffle=True, 194 | num_workers=0, 195 | pin_memory=True, 196 | sampler=None) 197 | 198 | start = datetime.now() 199 | total_step = len(train_loader) 200 | for epoch in range(args.epochs): 201 | model.train() 202 | for i, (images, labels) in enumerate(tqdm(train_loader)): 203 | images = images.to(gpu) 204 | labels = labels.to(gpu) 205 | # Forward pass 206 | outputs = model(images) 207 | loss = criterion(outputs, labels) 208 | 209 | # Backward and optimize 210 | optimizer.zero_grad() 211 | loss.backward() 212 | optimizer.step() 213 | if (i + 1) % 100 == 0: 214 | print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch + 1, args.epochs, i + 1, total_step, 215 | loss.item())) 216 | print("Training complete in: " + str(datetime.now() - start)) 217 | ``` 218 | 219 | 最后确保主函数被启动。 220 | 221 | ```Python 222 | if __name__ == '__main__': 223 | main() 224 | ``` 225 | 226 | 以上是我们的MNIST图像分类最小原型,可以通过如下命令启动在指定单个GPU上的训练: 227 | 228 | ```Bash 229 | python train.py -g 0 230 | ``` 231 | 232 | ### 多进程示例 233 | 234 | 在开始对最小原型的改造之前,我们还需要交代一些事情。在DDP的代码实现中,最重要的步骤之一就是初始化。所谓初始化对应于上文介绍的Construction阶段,每个进程中需要指明几个关键的参数: 235 | 236 | * **backend**:明确后端通信方式,NCCL还是Gloo 237 | * **init_method**:初始化方式,TCP还是Environment variable(Env),可以简单理解为进程获取关键参数的地址和方式 238 | * **world_size**:总的进程数有多少 239 | * **rank**:当前进程是总进程中的第几个 240 | 241 | 初始化方式不同会影响代码的启动部分。本文会分别给出TCP和ENV模式的样例。 242 | 243 |