├── README.md
├── images
    ├── .DS_Store
    ├── DDP_Steps.png
    ├── DDP_BuildBlock.png
    ├── DataParallel.png
    ├── Bucket_Allreduce.png
    ├── Backends_Difference.png
    └── Collective_Communication.png
├── .gitignore
├── codes
    ├── mnist.py
    ├── mnist-env.py
    └── mnist-tcp.py
└── DDP-Tutorial.md


/README.md:
--------------------------------------------------------------------------------
1 | # DDP-Tutorial
2 | A tutorial for DDP in Pytorch.
3 | 


--------------------------------------------------------------------------------
/images/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/KaiiZhang/DDP-Tutorial/HEAD/images/.DS_Store


--------------------------------------------------------------------------------
/images/DDP_Steps.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/KaiiZhang/DDP-Tutorial/HEAD/images/DDP_Steps.png


--------------------------------------------------------------------------------
/images/DDP_BuildBlock.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/KaiiZhang/DDP-Tutorial/HEAD/images/DDP_BuildBlock.png


--------------------------------------------------------------------------------
/images/DataParallel.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/KaiiZhang/DDP-Tutorial/HEAD/images/DataParallel.png


--------------------------------------------------------------------------------
/images/Bucket_Allreduce.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/KaiiZhang/DDP-Tutorial/HEAD/images/Bucket_Allreduce.png


--------------------------------------------------------------------------------
/images/Backends_Difference.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/KaiiZhang/DDP-Tutorial/HEAD/images/Backends_Difference.png


--------------------------------------------------------------------------------
/images/Collective_Communication.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/KaiiZhang/DDP-Tutorial/HEAD/images/Collective_Communication.png


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | pip-wheel-metadata/
 24 | share/python-wheels/
 25 | *.egg-info/
 26 | .installed.cfg
 27 | *.egg
 28 | MANIFEST
 29 | 
 30 | # PyInstaller
 31 | #  Usually these files are written by a python script from a template
 32 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 33 | *.manifest
 34 | *.spec
 35 | 
 36 | # Installer logs
 37 | pip-log.txt
 38 | pip-delete-this-directory.txt
 39 | 
 40 | # Unit test / coverage reports
 41 | htmlcov/
 42 | .tox/
 43 | .nox/
 44 | .coverage
 45 | .coverage.*
 46 | .cache
 47 | nosetests.xml
 48 | coverage.xml
 49 | *.cover
 50 | *.py,cover
 51 | .hypothesis/
 52 | .pytest_cache/
 53 | 
 54 | # Translations
 55 | *.mo
 56 | *.pot
 57 | 
 58 | # Django stuff:
 59 | *.log
 60 | local_settings.py
 61 | db.sqlite3
 62 | db.sqlite3-journal
 63 | 
 64 | # Flask stuff:
 65 | instance/
 66 | .webassets-cache
 67 | 
 68 | # Scrapy stuff:
 69 | .scrapy
 70 | 
 71 | # Sphinx documentation
 72 | docs/_build/
 73 | 
 74 | # PyBuilder
 75 | target/
 76 | 
 77 | # Jupyter Notebook
 78 | .ipynb_checkpoints
 79 | 
 80 | # IPython
 81 | profile_default/
 82 | ipython_config.py
 83 | 
 84 | # pyenv
 85 | .python-version
 86 | 
 87 | # pipenv
 88 | #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 89 | #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 90 | #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 91 | #   install all needed dependencies.
 92 | #Pipfile.lock
 93 | 
 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow
 95 | __pypackages__/
 96 | 
 97 | # Celery stuff
 98 | celerybeat-schedule
 99 | celerybeat.pid
100 | 
101 | # SageMath parsed files
102 | *.sage.py
103 | 
104 | # Environments
105 | .env
106 | .venv
107 | env/
108 | venv/
109 | ENV/
110 | env.bak/
111 | venv.bak/
112 | 
113 | # Spyder project settings
114 | .spyderproject
115 | .spyproject
116 | 
117 | # Rope project settings
118 | .ropeproject
119 | 
120 | # mkdocs documentation
121 | /site
122 | 
123 | # mypy
124 | .mypy_cache/
125 | .dmypy.json
126 | dmypy.json
127 | 
128 | # Pyre type checker
129 | .pyre/
130 | 
131 | # Dataset
132 | /codes/data/
133 | 


--------------------------------------------------------------------------------
/codes/mnist.py:
--------------------------------------------------------------------------------
 1 | '''
 2 | Author: Kai Zhang
 3 | Date: 2022-03-19 15:12:04
 4 | LastEditors: Kai Zhang
 5 | LastEditTime: 2022-03-23 19:49:38
 6 | Description: demo of mnist classification 
 7 | '''
 8 | from datetime import datetime
 9 | import argparse
10 | import torchvision
11 | import torchvision.transforms as transforms
12 | import torch
13 | import torch.nn as nn
14 | from tqdm import tqdm
15 | 
16 | class ConvNet(nn.Module):
17 |     def __init__(self, num_classes=10):
18 |         super(ConvNet, self).__init__()
19 |         self.layer1 = nn.Sequential(
20 |             nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2),
21 |             nn.BatchNorm2d(16),
22 |             nn.ReLU(),
23 |             nn.MaxPool2d(kernel_size=2, stride=2))
24 |         self.layer2 = nn.Sequential(
25 |             nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2),
26 |             nn.BatchNorm2d(32),
27 |             nn.ReLU(),
28 |             nn.MaxPool2d(kernel_size=2, stride=2))
29 |         self.fc = nn.Linear(7*7*32, num_classes)
30 | 
31 |     def forward(self, x):
32 |         out = self.layer1(x)
33 |         out = self.layer2(out)
34 |         out = out.reshape(out.size(0), -1)
35 |         out = self.fc(out)
36 |         return out
37 | 
38 | def train(gpu, args):
39 |     model = ConvNet()
40 |     model.cuda(gpu)
41 |     # define loss function (criterion) and optimizer
42 |     criterion = nn.CrossEntropyLoss().to(gpu)
43 |     optimizer = torch.optim.SGD(model.parameters(), 1e-4)
44 | 
45 |     # Data loading code
46 |     train_dataset = torchvision.datasets.MNIST(root='./data',
47 |                                                train=True,
48 |                                                transform=transforms.ToTensor(),
49 |                                                download=True)
50 |     train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
51 |                                                batch_size=args.batch_size,
52 |                                                shuffle=True,
53 |                                                num_workers=0,
54 |                                                pin_memory=True,
55 |                                                sampler=None)
56 | 
57 |     start = datetime.now()
58 |     total_step = len(train_loader)
59 |     for epoch in range(args.epochs):
60 |         model.train()
61 |         for i, (images, labels) in enumerate(tqdm(train_loader)):
62 |             images = images.to(gpu)
63 |             labels = labels.to(gpu)
64 |             # Forward pass
65 |             outputs = model(images)
66 |             loss = criterion(outputs, labels)
67 | 
68 |             # Backward and optimize
69 |             optimizer.zero_grad()
70 |             loss.backward()
71 |             optimizer.step()
72 |             if (i + 1) % 100 == 0:
73 |                 print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch + 1, args.epochs, i + 1, total_step,
74 |                                                                    loss.item()))
75 |     print("Training complete in: " + str(datetime.now() - start))
76 | 
77 | def main():
78 |     parser = argparse.ArgumentParser()
79 |     parser.add_argument('-g', '--gpuid', default=0, type=int,
80 |                         help="which gpu to use")
81 |     parser.add_argument('-e', '--epochs', default=1, type=int,
82 |                         metavar='N',
83 |                         help='number of total epochs to run')
84 |     parser.add_argument('-b', '--batch_size', default=4, type=int,
85 |                         metavar='N',
86 |                         help='number of batchsize')
87 | 
88 |     args = parser.parse_args()
89 |     train(args.gpuid, args)
90 | 
91 | if __name__ == '__main__':
92 |     main()
93 | 


--------------------------------------------------------------------------------
/codes/mnist-env.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | Author: Kai Zhang
  3 | Date: 2022-03-19 16:05:57
  4 | LastEditors: Kai Zhang
  5 | LastEditTime: 2022-03-23 21:25:20
  6 | Description: Start DDP in env model
  7 | '''
  8 | from datetime import datetime
  9 | import argparse
 10 | import torchvision
 11 | import torchvision.transforms as transforms
 12 | import torch
 13 | import torch.nn as nn
 14 | import torch.distributed as dist
 15 | from tqdm import tqdm
 16 | from torch.cuda.amp import GradScaler
 17 | 
 18 | 
 19 | def main():
 20 |     parser = argparse.ArgumentParser()
 21 |     parser.add_argument('-g', '--gpuid', default=0, type=int,
 22 |                         help="which gpu to use")
 23 |     parser.add_argument('-e', '--epochs', default=1, type=int,
 24 |                         metavar='N',
 25 |                         help='number of total epochs to run')
 26 |     parser.add_argument('-b', '--batch_size', default=4, type=int,
 27 |                         metavar='N',
 28 |                         help='number of batchsize')
 29 |     ##################################################################################
 30 |     parser.add_argument("--local_rank", type=int,                                    #
 31 |                         help='rank in current node')                                 #
 32 |     parser.add_argument('--use_mix_precision', default=False,                        #
 33 |                         action='store_true', help="whether to use mix precision")    #
 34 |     ##################################################################################
 35 |     args = parser.parse_args()
 36 |     train(args.local_rank, args)
 37 | 
 38 | 
 39 | class ConvNet(nn.Module):
 40 |     def __init__(self, num_classes=10):
 41 |         super(ConvNet, self).__init__()
 42 |         self.layer1 = nn.Sequential(
 43 |             nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2),
 44 |             nn.BatchNorm2d(16),
 45 |             nn.ReLU(),
 46 |             nn.MaxPool2d(kernel_size=2, stride=2))
 47 |         self.layer2 = nn.Sequential(
 48 |             nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2),
 49 |             nn.BatchNorm2d(32),
 50 |             nn.ReLU(),
 51 |             nn.MaxPool2d(kernel_size=2, stride=2))
 52 |         self.fc = nn.Linear(7*7*32, num_classes)
 53 | 
 54 |     def forward(self, x):
 55 |         out = self.layer1(x)
 56 |         out = self.layer2(out)
 57 |         out = out.reshape(out.size(0), -1)
 58 |         out = self.fc(out)
 59 |         return out
 60 | 
 61 | 
 62 | ####################################    N11    ##################################
 63 | def evaluate(model, gpu, test_loader, rank):
 64 |     model.eval()
 65 |     size = torch.tensor(0.).to(gpu)
 66 |     correct = torch.tensor(0.).to(gpu)
 67 |     with torch.no_grad():
 68 |         for i, (images, labels) in enumerate(tqdm(test_loader)):
 69 |             images = images.to(gpu)
 70 |             labels = labels.to(gpu)
 71 |             # Forward pass
 72 |             outputs = model(images)
 73 |             size += images.shape[0]
 74 |             correct += (outputs.argmax(1) == labels).type(torch.float).sum()
 75 |     # 群体通信 reduce 操作 change to allreduce if Gloo
 76 |     dist.reduce(size, 0, op=dist.ReduceOp.SUM)
 77 |     # 群体通信 reduce 操作 change to allreduce if Gloo
 78 |     dist.reduce(correct, 0, op=dist.ReduceOp.SUM)
 79 |     if rank == 0:
 80 |         print('Evaluate accuracy is {:.2f}'.format(correct / size))
 81 |  #################################################################################
 82 | 
 83 | 
 84 | def train(gpu, args):
 85 |     ########################################    N1    ################
 86 |     dist.init_process_group(backend='nccl', init_method='env://')    #
 87 |     args.rank = dist.get_rank()                                      #
 88 |     ##################################################################
 89 |     model = ConvNet()
 90 |     model.cuda(gpu)
 91 |     # define loss function (criterion) and optimizer
 92 |     criterion = nn.CrossEntropyLoss().to(gpu)
 93 |     optimizer = torch.optim.SGD(model.parameters(), 1e-4)
 94 |     # Wrap the model
 95 |     #######################################    N2    ########################
 96 |     model = nn.SyncBatchNorm.convert_sync_batchnorm(model)                  #
 97 |     model = nn.parallel.DistributedDataParallel(model, device_ids=[gpu])    #
 98 |     scaler = GradScaler(enabled=args.use_mix_precision)                     #
 99 |     #########################################################################
100 |     # Data loading code
101 |     train_dataset = torchvision.datasets.MNIST(root='./data',
102 |                                                train=True,
103 |                                                transform=transforms.ToTensor(),
104 |                                                download=True)
105 |     ####################################    N3    #######################################
106 |     train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)      #
107 |     train_loader = torch.utils.data.DataLoader(dataset=train_dataset,                   #
108 |                                                batch_size=args.batch_size,              #
109 |                                                shuffle=False,                           #
110 |                                                num_workers=0,                           #
111 |                                                pin_memory=True,                         #
112 |                                                sampler=train_sampler)                   #
113 |     #####################################################################################
114 | 
115 |     ####################################    N9    ###################################
116 |     test_dataset = torchvision.datasets.MNIST(root='./data',                        #
117 |                                               train=False,                          #
118 |                                               transform=transforms.ToTensor(),      #
119 |                                               download=True)                        #
120 |     test_sampler = torch.utils.data.distributed.DistributedSampler(test_dataset)    #
121 |     test_loader = torch.utils.data.DataLoader(dataset=test_dataset,                 #
122 |                                               batch_size=args.batch_size,           #
123 |                                               shuffle=False,                        #
124 |                                               num_workers=0,                        #
125 |                                               pin_memory=True,                      #
126 |                                               sampler=test_sampler)                 #
127 |     #################################################################################
128 |     start = datetime.now()
129 |     # The number changes to orignal_length // args.world_size
130 |     total_step = len(train_loader)
131 |     for epoch in range(args.epochs):
132 |         ################    N4    ################
133 |         train_loader.sampler.set_epoch(epoch)    #
134 |         ##########################################
135 |         model.train()
136 |         for i, (images, labels) in enumerate(tqdm(train_loader) if args.rank == 0 else train_loader):  # only use tqdm in rank0
137 |             images = images.to(gpu)
138 |             labels = labels.to(gpu)
139 |             # Forward pass
140 |             ########################    N5    ################################
141 |             with torch.cuda.amp.autocast(enabled=args.use_mix_precision):    #
142 |                 outputs = model(images)                                      #
143 |                 loss = criterion(outputs, labels)                            #
144 |             ##################################################################
145 |             # Backward and optimize
146 |             optimizer.zero_grad()
147 |             ##############    N6    ##########
148 |             scaler.scale(loss).backward()    #
149 |             scaler.step(optimizer)           #
150 |             scaler.update()                  #
151 |             ##################################
152 |             ################    N7    ####################
153 |             if (i + 1) % 100 == 0 and args.rank == 0:    #
154 |                 ##############################################
155 |                 print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch + 1, args.epochs, i + 1, total_step,
156 |                                                                          loss.item()))
157 |         evaluate(model, gpu, test_loader, args.rank)
158 |     ###########    N8    ############
159 |     dist.destroy_process_group()    #
160 |     if args.rank == 0:              #
161 |         #################################
162 |         print("Training complete in: " + str(datetime.now() - start))
163 | 
164 | 
165 | if __name__ == '__main__':
166 |     main()
167 | 


--------------------------------------------------------------------------------
/codes/mnist-tcp.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | Author: Kai Zhang
  3 | Date: 2022-03-19 16:05:57
  4 | LastEditors: Kai Zhang
  5 | LastEditTime: 2022-03-23 21:24:58
  6 | Description: Start DDP in tcp model
  7 | '''
  8 | from datetime import datetime
  9 | import argparse
 10 | import torchvision
 11 | import torchvision.transforms as transforms
 12 | import torch
 13 | import torch.nn as nn
 14 | import torch.distributed as dist
 15 | from tqdm import tqdm
 16 | from torch.cuda.amp import GradScaler
 17 | 
 18 | def main():
 19 |     parser = argparse.ArgumentParser()
 20 |     parser.add_argument('-g', '--gpuid', default=0, type=int,
 21 |                         help="which gpu to use")
 22 |     parser.add_argument('-e', '--epochs', default=1, type=int,
 23 |                         metavar='N',
 24 |                         help='number of total epochs to run')
 25 |     parser.add_argument('-b', '--batch_size', default=4, type=int,
 26 |                         metavar='N',
 27 |                         help='number of batchsize')
 28 |     ##################################################################################
 29 |     parser.add_argument('--init_method', default='tcp://localhost:18888',            #
 30 |                         help="init-method")                                          #
 31 |     parser.add_argument('-r', '--rank', default=0, type=int,                         #
 32 |                     help='rank of current process')                                  #
 33 |     parser.add_argument('--world_size', default=2, type=int,                         #
 34 |                         help="world size")                                           #
 35 |     parser.add_argument('--use_mix_precision', default=False,                        #
 36 |                         action='store_true', help="whether to use mix precision")    #
 37 |     ##################################################################################
 38 |     args = parser.parse_args()
 39 |     train(args.gpuid, args)
 40 | 
 41 | class ConvNet(nn.Module):
 42 |     def __init__(self, num_classes=10):
 43 |         super(ConvNet, self).__init__()
 44 |         self.layer1 = nn.Sequential(
 45 |             nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2),
 46 |             nn.BatchNorm2d(16),
 47 |             nn.ReLU(),
 48 |             nn.MaxPool2d(kernel_size=2, stride=2))
 49 |         self.layer2 = nn.Sequential(
 50 |             nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2),
 51 |             nn.BatchNorm2d(32),
 52 |             nn.ReLU(),
 53 |             nn.MaxPool2d(kernel_size=2, stride=2))
 54 |         self.fc = nn.Linear(7*7*32, num_classes)
 55 | 
 56 |     def forward(self, x):
 57 |         out = self.layer1(x)
 58 |         out = self.layer2(out)
 59 |         out = out.reshape(out.size(0), -1)
 60 |         out = self.fc(out)
 61 |         return out
 62 | 
 63 | 
 64 | ####################################    N11    ##################################
 65 | def evaluate(model, gpu, test_loader, rank):
 66 |     model.eval()
 67 |     size = torch.tensor(0.).to(gpu)
 68 |     correct = torch.tensor(0.).to(gpu)
 69 |     with torch.no_grad():
 70 |         for i, (images, labels) in enumerate(tqdm(test_loader)):
 71 |             images = images.to(gpu)
 72 |             labels = labels.to(gpu)
 73 |             # Forward pass
 74 |             outputs = model(images)
 75 |             size += images.shape[0]
 76 |             correct += (outputs.argmax(1) == labels).type(torch.float).sum()
 77 |     # 群体通信 reduce 操作 change to allreduce if Gloo
 78 |     dist.reduce(size, 0, op=dist.ReduceOp.SUM)
 79 |     # 群体通信 reduce 操作 change to allreduce if Gloo
 80 |     dist.reduce(correct, 0, op=dist.ReduceOp.SUM)
 81 |     if rank == 0:
 82 |         print('Evaluate accuracy is {:.2f}'.format(correct / size))
 83 |  #################################################################################
 84 | 
 85 | def train(gpu, args):
 86 |     ########################################    N1    ####################################################################
 87 |     dist.init_process_group(backend='nccl', init_method=args.init_method, rank=args.rank, world_size=args.world_size)    #
 88 |     ######################################################################################################################
 89 |     model = ConvNet()
 90 |     model.cuda(gpu)
 91 |     # Wrap the model
 92 |     #######################################    N2    ########################
 93 |     model = nn.SyncBatchNorm.convert_sync_batchnorm(model)                  #
 94 |     model = nn.parallel.DistributedDataParallel(model, device_ids=[gpu])    #
 95 |     scaler = GradScaler(enabled=args.use_mix_precision)                     #
 96 |     #########################################################################
 97 |     # define loss function (criterion) and optimizer
 98 |     criterion = nn.CrossEntropyLoss().to(gpu)
 99 |     optimizer = torch.optim.SGD(model.parameters(), 1e-4)
100 |     # Data loading code
101 |     train_dataset = torchvision.datasets.MNIST(root='./data',
102 |                                                train=True,
103 |                                                transform=transforms.ToTensor(),
104 |                                                download=True)
105 |     ####################################    N3    #######################################
106 |     train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)      #
107 |     train_loader = torch.utils.data.DataLoader(dataset=train_dataset,                   #
108 |                                                batch_size=args.batch_size,              #
109 |                                                shuffle=False,                           #
110 |                                                num_workers=0,                           #
111 |                                                pin_memory=True,                         #
112 |                                                sampler=train_sampler)                   #
113 |     #####################################################################################
114 | 
115 |     ####################################    N9    ###################################
116 |     test_dataset = torchvision.datasets.MNIST(root='./data',                        #
117 |                                                train=False,                         #
118 |                                                transform=transforms.ToTensor(),     #
119 |                                                download=True)                       #
120 |     test_sampler = torch.utils.data.distributed.DistributedSampler(test_dataset)    #
121 |     test_loader = torch.utils.data.DataLoader(dataset=test_dataset,                 #
122 |                                                batch_size=args.batch_size,          #
123 |                                                shuffle=False,                       #
124 |                                                num_workers=0,                       #
125 |                                                pin_memory=True,                     #
126 |                                                sampler=test_sampler)                #
127 |     #################################################################################
128 |     start = datetime.now()
129 |     total_step = len(train_loader) # The number changes to orignal_length // args.world_size
130 |     for epoch in range(args.epochs):
131 |         ################    N4    ################
132 |         train_loader.sampler.set_epoch(epoch)    #
133 |         ##########################################
134 |         model.train()
135 |         for i, (images, labels) in enumerate(tqdm(train_loader)):
136 |             images = images.to(gpu)
137 |             labels = labels.to(gpu)
138 |             # Forward pass
139 |             ########################    N5    ################################
140 |             with torch.cuda.amp.autocast(enabled=args.use_mix_precision):    #
141 |                 outputs = model(images)                                      #
142 |                 loss = criterion(outputs, labels)                            #
143 |             ##################################################################
144 |             # Backward and optimize
145 |             optimizer.zero_grad()
146 |             ##############    N6    ##########
147 |             scaler.scale(loss).backward()    #
148 |             scaler.step(optimizer)           #
149 |             scaler.update()                  #
150 |             ##################################
151 |             ################    N7    ####################
152 |             if (i + 1) % 100 == 0 and args.rank == 0:    #
153 |             ##############################################
154 |                 print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch + 1, args.epochs, i + 1, total_step,
155 |                                                                    loss.item()))
156 |         evaluate(model, gpu, test_loader, args.rank)
157 |     dist.destroy_process_group()
158 |     ######    N8    #######
159 |     if args.rank == 0:    #
160 |     #######################
161 |         print("Training complete in: " + str(datetime.now() - start))
162 | 
163 | 
164 | if __name__ == '__main__':
165 |     main()
166 | 


--------------------------------------------------------------------------------
/DDP-Tutorial.md:
--------------------------------------------------------------------------------
  1 | <!--
  2 |  * @Author: Kai Zhang
  3 |  * @Date: 2022-03-20 15:01:01
  4 |  * @LastEditors: Kai Zhang
  5 |  * @LastEditTime: 2022-03-23 21:21:54
  6 |  * @Description: Tutorial for DDP
  7 | -->
  8 | # 上手Distributed Data Parallel的详尽教程
  9 | 
 10 | ## 写在前面
 11 | Pyorch中的Distributed Data Parallel（DDP）已经推出很久了，目前自监督和预训练相关的工作都离不开多机并行。但结合本人和身边同学们的情况来看，大家对DDP的用法还都不熟悉，或者觉得没有多机多卡的使用需求。但其实DDP的单机多卡性能也是要优于`DataParallel`的，并且多机多卡是在企业中训练大模型必备的技能。
 12 | 
 13 | 结合本人学习DDP的过程中，主要的痛点在于：
 14 | 
 15 | * Pytorch官方的文档中的知识点比较分散，有时过于简洁。对于缺少并行经验的同学会比较难理解。
 16 | * 谷歌/百度到的博客质量参差不齐，大部分都浅尝辄止，很难系统的入门
 17 | 
 18 | 在走了不少的弯路后，将目前体会到的一些心得记录为本文。希望能够帮助感兴趣的同学少踩一些坑，然后可以更流畅的阅读官方文档（Pytorch官方的文档是很全的）。
 19 | 
 20 | 由于才疏学浅，文中避免不了会有一些错误或者理解偏差，还请大家批评指正，感谢。
 21 | ***
 22 | 
 23 | ## 背景
 24 | 
 25 | 深度学习的发展证明了大数据和大模型的价值。无论是在CV还是NLP领域，在大规模的计算资源上训练模型的能力变得日益重要。GPU以比CPU更快的矩阵乘法和加法运算，加速了模型训练。但随着数据量和模型参数的增长，单块GPU很快变得不够用。因此我们必须找到合适的方法，实现数据和模型在多个GPU甚至多个计算节点间的划分和复制，从而实现更短的训练周期和更大的模型参数量。
 26 | 
 27 | ## 数据并行训练
 28 | 
 29 | 并行训练是解决上述问题的重要方法。`Pytorch`中关于并行训练的内容分为以下三部分：
 30 | 
 31 | * **`nn.DataParallel`**
 32 | * **`nn.parallel.DistributedDataParallel`**
 33 | * `torch.distributed.rpc`
 34 | 
 35 | 前二者是本文的讨论重点，在CV和NLP中广泛应用。rpc是更通用的分布式并行方案，主要应用在强化学习等领域，不在本文的讨论范围内。`DataParallel`是最容易的并行训练方案，只需要增加一行代码，即可实现模型在多卡上的训练。但在pytorch中，`DataParallel`无论在功能和性能上都不是最优的并行方案，相比于`DistributedDataParallel`（DDP）有诸多不足。
 36 | 
 37 | ## DataParallel
 38 | 
 39 | | ![figure](images/DataParallel.png) |
 40 | |:--:|
 41 | | *DataParallel的流程图（图源自[Hugging Face](https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255)）* |
 42 | 
 43 | `DataParallel`实现并行训练的示意图如上图所示，在的每个前向过程中，模型会由GPU1复制到其它GPUs，这会引入延迟。其次，在每个GPU完成前向运算后，其输出logits会被被收集到GPU-1。logits的收集是`DataParallel`中GPU显存利用不均衡的问题根源。这一现象在图像分类这种logits的维度很小的任务中不明显，但在图像语义分割和文本翻译等密集预测任务中，GPU-1的显存占用会显著高于其它GPU。这会造成额外的GPU资源浪费。
 44 | 
 45 | 更重要的是，`DataParallel`的并行受`Python`语言中的GIL争用影响，仅能以单进程多线程的方式实现，这并不是理想的并行多进程实现。同时，`DataParallel`最多仅支持单机多卡的训练模式，无法实现需要多台机器训练的大模型。
 46 | 
 47 | 我们将`DataParallel`的优缺点概括如下：
 48 | 
 49 | ### 优点
 50 | 
 51 | * 只需要一行代码的增加，易于项目原型的开发
 52 | 
 53 | ```Python
 54 | model = nn.DataParallel(model)
 55 | ```
 56 | 
 57 | ### 缺点
 58 | 
 59 | * 每个前向过程中模型复制引入的延迟
 60 | * GPU的利用不均衡
 61 | * 单进程多线程的实现
 62 | * 不支持多机多卡
 63 |   
 64 | ## DistributedDataParallel
 65 | 
 66 | | ![figure](images/DDP_Steps.png)|
 67 | |:--:|
 68 | | *DDP的流程图，其中Construction只在训练开始前执行一次，仅Forward和Backward在训练中重复多次* |
 69 | 
 70 | DDP的流程示意图如上图所示，DDP需要额外的建立进程组阶段（Construction）。在Construction阶段需要首先明确通信协议和总进程数。通信协议是实现DDP的底层基础，我们在之后单独介绍。总进程数就是指有多少个独立的并行进程，被称为worldsize。根据需求每个进程可以占用一个或多个GPU，但并不推荐多个进程共享一个GPU，这会造成潜在的性能损失。为了便于理解，在本文的所有示例中我们假定每个进程只占用1个GPU，占用多个GPU的情况只需要简单的调整GPU映射关系就好。
 71 | 
 72 | 并行组建立之后，每个GPU上会独立的构建模型，然后GPU-1中模型的状态会被广播到其它所有进程中以保证所有模型都具有相同的初始状态。值得注意的是Construction只在训练开始前执行，在训练中只会不断迭代前向和后向过程，因此不会带来额外的延迟。
 73 | 
 74 | 相比于`DataParallel`，DDP的前向后向过程更加简洁。推理、损失函数计算，梯度计算都是并行独立完成的。DDP实现并行训练的核心在于**梯度同步**。梯度在模型间的同步使用的是`allreduce`通信操作，每个GPU会得到完全相同的梯度。如图中后向过程的步骤2，GPU间的通信在梯度计算完成后被触发（hook函数）。图中没有画出的是，通常每个GPU也会建立独立的优化器。由于模型具有同样的初始状态和后续相同的梯度，因此每轮迭代后不同进程间的模型是完全相同的，这保证了DDP的数理一致性。
 75 | 
 76 | 为了优化性能，DDP中针对`allreduce`操作进行了更深入的设计。梯度的计算过程和进程间的通信过程分别需要消耗一定量的时间。等待模型所有的参数都计算完梯度再进行通信显然不是最优的。如下图所示，DDP中的设计是通过将全部模型参数划分为无数个小的bucket，在bucket级别建立`allreduce`。当所有进程中bucket0的梯度计算完成后就立刻开始通信，此时bucket1中梯度还在计算。这样可以实现计算和通信过程的时间重叠。这种设计能够使得DDP的训练更高效。为了降低阅读难度，关于bucket大小相关的设计细节在此处不赘述了，建议感兴趣的同学之后继续阅读[原论文](https://arxiv.org/pdf/2006.15704.pdf)。
 77 | 
 78 | | ![figure](images/Bucket_Allreduce.png) |
 79 | |:--:|
 80 | | *DDP中的模型参数划分为多个bucket的机制示意图（图源自[DDP Note](https://pytorch.org/docs/stable/notes/ddp.html)）* |
 81 | 
 82 | 在最后我们对DDP的通信部分进行介绍。DDP后端的通信由多种CPP编写的协议支持，不同协议具有不同的通信算子的支持，在开发中可以根据需求选择。
 83 | 
 84 | | ![figure](images/DDP_BuildBlock.png) |
 85 | |:--:|
 86 | | *DDP的实现框图（图源自[Paper](https://arxiv.org/pdf/2006.15704.pdf)）* |
 87 | 
 88 | 对于CV和NLP常用GPU训练的任务而言，选择Gloo或NCCL协议即可。一个决定因素是你使用的计算机集群的网络环境：
 89 | 
 90 | * **当使用的是Ethernet（以太网，大部分机器都是这个环境）**：那么优先选择NCCL，具有更好的性能；如果在使用中遇到了NCCL通信的问题，那么就选择Gloo作为备用。（经验：单机多卡直接NCCL；多机多卡先尝试NCCL，如果通信有问题，而且自己解决不了，那就Gloo。）
 91 | * **当使用的是InfiniBand**：只支持NCCL。
 92 |   
 93 | 另一个决定性因素是二者支持的算子范围不同，因此在使用时还需要结合代码里的功能来确定。下图记录了每种通信协议能够支持的算子，Gloo能够实现GPU中最基本的DDP训练，而NCCL能够支持更加多样的算子.
 94 | 
 95 | | ![figure](images/Backends_Difference.png) |
 96 | |:--:|
 97 | | *不同Backend的算子支持情况（图源自[Doc](https://pytorch.org/docs/stable/distributed.html)）* |
 98 | 
 99 | 综上，得益于DDP的分布式并行设计，DDP并不受PythonGIL争用的影响，是以多进程的方式运行的。这也使得DDP可以支持多机多卡的训练。我们将DDP的优缺点概括如下：
100 | 
101 | ### 优点
102 | 
103 | * 更快的训练速度
104 | * 多进程的运行方式
105 | * 支持单机多卡和多机多卡
106 | * 平衡的GPU使用
107 | 
108 | ### 缺点
109 | 
110 | * 需要更多的代码书写和设计
111 |   
112 | ***
113 | 
114 | ## Show Me the Code
115 | 本文首先会基于MNIST图像分类建立一个最小原型，然后逐步改进它以实现多机多卡的训练和混合精度的支持。在讲述的思路上本文借鉴了[Kevin Kaichuang Yang的教程](https://yangkky.github.io/2019/07/08/distributed-pytorch-tutorial.html)，但在实现细节上有较大的差异。特别的是本文增加了对DDP启动方式的探讨，并且介绍了多进程通信操作的使用样例。
116 | 
117 | ### 非多进程示例
118 | 
119 | 首先引入了所有用到的库。
120 | 
121 | ```Python
122 | from datetime import datetime
123 | import argparse
124 | import torchvision
125 | import torchvision.transforms as transforms
126 | import torch
127 | import torch.nn as nn
128 | import torch.distributed as dist
129 | from tqdm import tqdm
130 | ```
131 | 
132 | 定义一个简单的卷积神经网络模型。
133 | 
134 | ```Python
135 | class ConvNet(nn.Module):
136 |     def __init__(self, num_classes=10):
137 |         super(ConvNet, self).__init__()
138 |         self.layer1 = nn.Sequential(
139 |             nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2),
140 |             nn.BatchNorm2d(16),
141 |             nn.ReLU(),
142 |             nn.MaxPool2d(kernel_size=2, stride=2))
143 |         self.layer2 = nn.Sequential(
144 |             nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2),
145 |             nn.BatchNorm2d(32),
146 |             nn.ReLU(),
147 |             nn.MaxPool2d(kernel_size=2, stride=2))
148 |         self.fc = nn.Linear(7*7*32, num_classes)
149 | 
150 |     def forward(self, x):
151 |         out = self.layer1(x)
152 |         out = self.layer2(out)
153 |         out = out.reshape(out.size(0), -1)
154 |         out = self.fc(out)
155 |         return out
156 | ```
157 | 
158 | 定义主函数，添加一些启动脚本的可选参数。
159 | 
160 | ```Python
161 | def main():
162 |     parser = argparse.ArgumentParser()
163 |     parser.add_argument('-g', '--gpuid', default=0, type=int,
164 |                         help="which gpu to use")
165 |     parser.add_argument('-e', '--epochs', default=2, type=int, 
166 |                         metavar='N',
167 |                         help='number of total epochs to run')
168 |     parser.add_argument('-b', '--batch_size', default=4, type=int, 
169 |                         metavar='N',
170 |                         help='number of batchsize')         
171 | 
172 |     args = parser.parse_args()
173 |     train(args.gpuid, args)
174 | ```
175 | 
176 | 然后给出训练函数的详细内容。
177 | 
178 | ```Python
179 | def train(gpu, args):
180 |     model = ConvNet()
181 |     model.cuda(gpu)
182 |     # define loss function (criterion) and optimizer
183 |     criterion = nn.CrossEntropyLoss().to(gpu)
184 |     optimizer = torch.optim.SGD(model.parameters(), 1e-4)
185 | 
186 |     # Data loading code
187 |     train_dataset = torchvision.datasets.MNIST(root='./data',
188 |                                                train=True,
189 |                                                transform=transforms.ToTensor(),
190 |                                                download=True)
191 |     train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
192 |                                                batch_size=args.batch_size,
193 |                                                shuffle=True,
194 |                                                num_workers=0,
195 |                                                pin_memory=True,
196 |                                                sampler=None)
197 | 
198 |     start = datetime.now()
199 |     total_step = len(train_loader)
200 |     for epoch in range(args.epochs):
201 |         model.train()
202 |         for i, (images, labels) in enumerate(tqdm(train_loader)):
203 |             images = images.to(gpu)
204 |             labels = labels.to(gpu)
205 |             # Forward pass
206 |             outputs = model(images)
207 |             loss = criterion(outputs, labels)
208 | 
209 |             # Backward and optimize
210 |             optimizer.zero_grad()
211 |             loss.backward()
212 |             optimizer.step()
213 |             if (i + 1) % 100 == 0:
214 |                 print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch + 1, args.epochs, i + 1, total_step,
215 |                                                                    loss.item()))
216 |     print("Training complete in: " + str(datetime.now() - start))
217 | ```
218 | 
219 | 最后确保主函数被启动。
220 | 
221 | ```Python
222 | if __name__ == '__main__':
223 |     main()
224 | ```
225 | 
226 | 以上是我们的MNIST图像分类最小原型，可以通过如下命令启动在指定单个GPU上的训练：
227 | 
228 | ```Bash
229 | python train.py -g 0
230 | ```
231 | 
232 | ### 多进程示例
233 | 
234 | 在开始对最小原型的改造之前，我们还需要交代一些事情。在DDP的代码实现中，最重要的步骤之一就是初始化。所谓初始化对应于上文介绍的Construction阶段，每个进程中需要指明几个关键的参数：
235 | 
236 | * **backend**：明确后端通信方式，NCCL还是Gloo
237 | * **init_method**：初始化方式，TCP还是Environment variable（Env），可以简单理解为进程获取关键参数的地址和方式
238 | * **world_size**：总的进程数有多少
239 | * **rank**：当前进程是总进程中的第几个
240 | 
241 | 初始化方式不同会影响代码的启动部分。本文会分别给出TCP和ENV模式的样例。
242 | 
243 | <center>TCP模式</center>
244 | 
245 | 让我们先从TCP开始，注意那些标记被更改的代码部分：
246 | 
247 | ```Python
248 | def main():
249 |     parser = argparse.ArgumentParser()
250 |     parser.add_argument('-g', '--gpuid', default=0, type=int,
251 |                         help="which gpu to use")
252 |     parser.add_argument('-e', '--epochs', default=1, type=int, 
253 |                         metavar='N',
254 |                         help='number of total epochs to run')
255 |     parser.add_argument('-b', '--batch_size', default=4, type=int, 
256 |                         metavar='N',
257 |                         help='number of batchsize')   
258 |     ##################################################################################
259 |     parser.add_argument('--init_method', default='tcp://localhost:18888',            #
260 |                         help="init-method")                                          #
261 |     parser.add_argument('-r', '--rank', default=0, type=int,                         #
262 |                     help='rank of current process')                                  #
263 |     parser.add_argument('--world_size', default=2, type=int,                         #
264 |                         help="world size")                                           #
265 |     parser.add_argument('--use_mix_precision', default=False,                        #
266 |                         action='store_true', help="whether to use mix precision")    #
267 |     ##################################################################################                  
268 |     args = parser.parse_args()
269 |     train(args.gpuid, args)
270 | ```
271 | 
272 | 在main函数中需要增加了以下参数：
273 | 
274 | * args.init_method：url地址，用来指明的初始化方法。在tcp初始化方法中，其格式应为：tcp:[ IP ]:[ Port ] 。IP为rank=0进程所在的机器IP地址，Port为任意一个空闲的端口号。当采用的是单机多卡模式时，IP可以默认为//localhost
275 | * args.rank：当前进程在所有进程中的序号
276 | * args.world_size：进程总数
277 | * args.use_mix_precision：布尔变量，控制是否使用混合精度
278 |   
279 | ```Python
280 | def train(gpu, args):
281 |     ########################################    N1    ####################################################################
282 |     dist.init_process_group(backend='nccl', init_method=args.init_method, rank=args.rank, world_size=args.world_size)    #
283 |     ######################################################################################################################
284 |     model = ConvNet()
285 |     model.cuda(gpu)
286 |     # define loss function (criterion) and optimizer
287 |     criterion = nn.CrossEntropyLoss().to(gpu)
288 |     optimizer = torch.optim.SGD(model.parameters(), 1e-4)
289 |     # Wrap the model
290 |     #######################################    N2    ########################
291 |     model = nn.SyncBatchNorm.convert_sync_batchnorm(model)                  #
292 |     model = nn.parallel.DistributedDataParallel(model, device_ids=[gpu])    #
293 |     scaler = GradScaler(enabled=args.use_mix_precision)                   #
294 |     #########################################################################
295 |     # Data loading code
296 |     train_dataset = torchvision.datasets.MNIST(root='./data',
297 |                                                train=True,
298 |                                                transform=transforms.ToTensor(),
299 |                                                download=True)
300 |     ####################################    N3    #######################################
301 |     train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)      #
302 |     train_loader = torch.utils.data.DataLoader(dataset=train_dataset,                   #
303 |                                                batch_size=args.batch_size,              #
304 |                                                shuffle=False,                           #
305 |                                                num_workers=0,                           #
306 |                                                pin_memory=True,                         #
307 |                                                sampler=train_sampler)                   #
308 |     #####################################################################################
309 |     start = datetime.now()
310 |     total_step = len(train_loader) # The number changes to orignal_length // args.world_size
311 |     for epoch in range(args.epochs):
312 |         ################    N4    ################
313 |         train_loader.sampler.set_epoch(epoch)    #
314 |         ##########################################
315 |         model.train()
316 |         for i, (images, labels) in enumerate(tqdm(train_loader)):
317 |             images = images.to(gpu)
318 |             labels = labels.to(gpu)
319 |             # Forward pass
320 |             ########################    N5    ################################
321 |             with torch.cuda.amp.autocast(enabled=args.use_mix_precision):    #
322 |                 outputs = model(images)                                      #
323 |                 loss = criterion(outputs, labels)                            #
324 |             ##################################################################  
325 |             # Backward and optimize
326 |             optimizer.zero_grad()
327 |             ##############    N6    ##########
328 |             scaler.scale(loss).backward()    #
329 |             scaler.step(optimizer)           #
330 |             scaler.update()                  #
331 |             ##################################
332 |             ################    N7    ####################
333 |             if (i + 1) % 100 == 0 and args.rank == 0:    #
334 |             ##############################################   
335 |                 print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch + 1, args.epochs, i + 1, total_step,
336 |                                                                    loss.item()))            
337 |     ############    N8    ###########
338 |     dist.destroy_process_group()    #                                       
339 |     if args.rank == 0:              #
340 |     #################################
341 |         print("Training complete in: " + str(datetime.now() - start))
342 | ```
343 | 
344 | 在训练函数中增加/修改了以下内容：
345 | 
346 | * N1：增加了DDP初始化的代码，需要指明backend、init_method、rank和world_size。其含义在前文都有介绍。
347 | * N2：在并行环境下，对于用到BN层的模型需要转换为同步BN层；其次，用`DistributedDataParallel`将模型封装为一个DDP模型，并复制到指定的GPU上。封装时不需要更改模型内部的代码；设置混合精度中的scaler，通过设置`enabled`参数控制是否生效。
348 | * N3：DDP要求定义`distributed.DistributedSampler`，通过封装`train_dataset`实现；在建立`DataLoader`时指定`sampler`。此外还要注意：`shuffle=False`。DDP的数据打乱需要通过设置`sampler`，参考N4。
349 | * N4：在每个epoch开始前打乱数据顺序。（注意total_step已经变为`orignal_length // args.world_size`。）
350 | * N5：利用`torch.cuda.amp.autocast`控制前向过程中是否使用半精度计算。
351 | * N6: 当使用混合精度时，scaler会缩放loss来避免由于精度变化导致梯度为0的情况。
352 | * N7：为了避免log信息的重复打印，可以只允许rank0号进程打印。
353 | * N8: 清理进程；然后，同上。
354 | 
355 | 假设服务器环境为2台服务器（也称为2个node），每台服务器两块GPU。启动方式为：
356 | 
357 | ```bash
358 | # Node 0 : ip 192.168.1.201  port : 12345
359 | # terminal-0
360 | python mnist-tcp.py --init_method tcp://192.168.1.201:12345 -g 0 --rank 0 --world_size 4 --use_mix_precision
361 | # terminal-1
362 | python mnist-tcp.py --init_method tcp://192.168.1.201:12345 -g 1 --rank 1 --world_size 4 --use_mix_precision
363 | 
364 | # Node 1 : 
365 | # terminal-0
366 | python tcp_init.py --init_method tcp://192.168.1.201:12345 -g 0 --rank 2 --world_size 4 --use_mix_precision
367 | # terminal-1
368 | python tcp_init.py --init_method tcp://192.168.1.201:12345 -g 1 --rank 3 --world_size 4 --use_mix_precision
369 | ```
370 | TCP模式启动很好理解，需要在bash中独立的启动每一个进程，并为每个进程分配好其rank序号。缺点是当进程数多的时候启动比较麻烦。完整的脚本文件见[这里](https://github.com/KaiiZhang/DDP-Tutorial)。
371 | ***
372 | <center>ENV模式</center>
373 | 
374 | ENV模式启动会更简洁，对于每个进程并不需要在`dist.init_process_group`中手动的指定其rank、world_size和url。程序会在环境变量中去寻找这些值。代码如下：
375 | 
376 | ```Python
377 | def main():
378 |     parser = argparse.ArgumentParser()
379 |     parser.add_argument('-g', '--gpuid', default=0, type=int,
380 |                         help="which gpu to use")
381 |     parser.add_argument('-e', '--epochs', default=1, type=int, 
382 |                         metavar='N',
383 |                         help='number of total epochs to run')
384 |     parser.add_argument('-b', '--batch_size', default=4, type=int, 
385 |                         metavar='N',
386 |                         help='number of batchsize')   
387 |     ##################################################################################
388 |     parser.add_argument("--local_rank", type=int,                                    #
389 |                         help='rank in current node')                                 #
390 |     parser.add_argument('--use_mix_precision', default=False,                        #
391 |                         action='store_true', help="whether to use mix precision")    #
392 |     ##################################################################################                  
393 |     args = parser.parse_args()
394 |     #################################
395 |     train(args.local_rank, args)    #
396 |     #################################
397 | ```
398 | 
399 | * args.local_rank：这里指的是当前进程在当前机器中的序号，注意和在全部进程中序号的区别。在ENV模式中，这个参数是必须的，由启动脚本自动划分，不需要手动指定。要善用`local_rank`来分配GPU_ID。
400 | * `train(args.local_rank, args)`：一般情况下保持local_rank与进程所用GPU_ID一致。
401 |   
402 | ```Python
403 | def train(gpu, args):
404 |     ##################################################################
405 |     dist.init_process_group(backend='nccl', init_method='env://')    #
406 |     args.rank = dist.get_rank()                                      #
407 |     ##################################################################
408 |     model = ConvNet()
409 |     ...
410 | ```
411 | 
412 | * 训练函数中仅需要更改初始化方式即可。在ENV中只需要指定`init_method='env://'`。TCP所需的关键参数模型会从环境变量中自动获取，环境变量可以在程序外部启动时设定，参考启动方式。
413 | * 当前进程的rank值可以通过`dist.get_rank()`得到
414 | * 之后的代码与TCP完全相同
415 | 
416 | 假设服务器环境为2台服务器（也称为2个node），每台服务器两块GPU。ENV模式的启动方式为：
417 | 
418 | ```bash
419 | # Node 0 : ip 192.168.1.201  port : 12345
420 | # terminal-0
421 | python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr="192.168.1.201" --master_port=12345 mnist-env.py --use_mix_precision
422 | 
423 | # Node 1 : 
424 | # terminal-0
425 | python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=1 --master_addr="192.168.1.201" --master_port=12345 mnist-env.py --use_mix_precision
426 | ```
427 | 
428 | ENV模式可以使用pytorch中的启动脚本`torch.distributed.launch`启动。在启动命令中需要指明多个参数：
429 | 
430 | * nproc_per_node: 每台机器中运行几个进程
431 | * nnodes：一共使用多少台机器
432 | * node_rank：当前机器的序号
433 | * master_addr：0号机器的IP
434 | * master_port：0号机器的可用端口
435 |   
436 | 可以看到无论一台机器中的进程数为多少，只需要一行命令就可以启动，相比于TCP模式启动方式更加简洁。
437 | 
438 | ***
439 | 
440 | <center>Bonus</center>
441 | 
442 | 参考上文给出的demo，可以用来将已有的训练代码转换为DDP模式。其实现的核心是通过隐藏在`loss.backward()`过程中的`all_reduce`群体通信操作来实现进程组间的梯度同步。为了更深入理解DDP的群体通信，以便于更灵活的自定义一些操作，本文给出一个模型并行验证的实现方式作为样例。
443 | 
444 | | ![figure](images/Collective_Communication.png) |
445 | |:--:|
446 | | *群组通信示意图（图源自[Doc](https://pytorch.org/tutorials/intermediate/dist_tuto.html)）* |
447 | 
448 | 训练中对模型在验证集上进行验证也是必不可少的步骤之一，那么如何在上述demo中增加模型验证的代码呢？如何实现模型的并行验证？
449 | 
450 | ```Python
451 | ####################################    N11    ##################################
452 | def evaluate(model, gpu, test_loader, rank):
453 |     model.eval()
454 |     size = torch.tensor(0.).to(gpu)
455 |     correct = torch.tensor(0.).to(gpu)
456 |     with torch.no_grad():
457 |         for i, (images, labels) in enumerate(tqdm(test_loader)):
458 |             images = images.to(gpu)
459 |             labels = labels.to(gpu)
460 |             outputs = model(images)
461 |             size += images.shape[0]
462 |             correct += (outputs.argmax(1) == labels).type(torch.float).sum() 
463 |     dist.reduce(size, 0, op=dist.ReduceOp.SUM) # 群体通信 reduce 操作 change to allreduce if Gloo
464 |     dist.reduce(correct, 0, op=dist.ReduceOp.SUM) # 群体通信 reduce 操作 change to allreduce if Gloo
465 |     if rank==0:
466 |         print('Evaluate accuracy is {:.2f}'.format(correct / size))
467 |  #################################################################################
468 | 
469 | def train(gpu, args):
470 |     ...
471 |     ####################################    N9    ###################################
472 |     test_dataset = torchvision.datasets.MNIST(root='./data',                        #
473 |                                                train=False,                         #
474 |                                                transform=transforms.ToTensor(),     #
475 |                                                download=True)                       #
476 |     test_sampler = torch.utils.data.distributed.DistributedSampler(test_dataset)    #
477 |     test_loader = torch.utils.data.DataLoader(dataset=test_dataset,                 #
478 |                                                batch_size=args.batch_size,               #
479 |                                                shuffle=False,                       #
480 |                                                num_workers=0,                       #
481 |                                                pin_memory=True,                     #
482 |                                                sampler=test_sampler)                #
483 |     #################################################################################
484 |     start = datetime.now()
485 |     total_step = len(train_loader) # The number changes to orignal_length // args.world_size
486 |     for epoch in range(args.epochs):
487 |         ...
488 |         #####################    N10    #################
489 |         evaluate(model, gpu, test_loader, args.rank)    #
490 |         #################################################
491 |     ...        
492 | ```
493 | 
494 | 省略了代码不变的部分，完整的程序见[脚本](https://github.com/KaiiZhang/DDP-Tutorial)。
495 | * N9：增加验证集的DataLoader，设置sampler实现数据的并行切分
496 | * N10：在每个epoch结束前验证模型
497 | * N11: 利用群体通信`Reduce`操作，将计算准确率所需的正确预测数和全局样本数收集到rank0进程中
498 | 
499 | 只需要利用群体通信将验证集样本数和预测正确的样本数汇集在rank0中即可实现并行的模型验证，对于其它任务也可以参考这个思路实现。例如图像语义分割中计算mIoU只需要将每个进程的混淆矩阵汇总相加到rank0即可。
500 | 
501 | ## 一些可能遇到的问题
502 | 
503 | 网络防火墙有可能在首次多机多卡训练时造成计算节点间的通信失败。单机多卡成功运行的代码在扩展至多机多卡遇到问题后可以首先尝试将init_method切换为Gloo，能够回避掉一些潜在的问题。记录一下本人在实践中遇到的问题和解决方法。
504 | 
505 | ### [address family mismatch 错误](https://discuss.pytorch.org/t/runtimeerror-address-family-mismatch-when-use-gloo-backend/64753)
506 | 
507 | 解决方案是手动设置通信的网络端口。机器的网络端口通过`ifconfig`命令查询，有多个网口时可以都尝试一下。
508 | 
509 | 当backend==NCCL
510 | 
511 | ```Bash
512 | # Node 0 
513 | # terminal-0
514 | export NCCL_SOCKET_IFNAME=eth0
515 | python ...
516 | 
517 | # Node 1 : 
518 | # terminal-0
519 | export NCCL_SOCKET_IFNAME=eth0
520 | python ...
521 | ```
522 | 
523 | 当backend==Gloo
524 | 
525 | ```Bash
526 | # Node 0 
527 | # terminal-0
528 | export GLOO_SOCKET_IFNAME=eth0
529 | python ...
530 | 
531 | # Node 1 : 
532 | # terminal-0
533 | export GLOO_SOCKET_IFNAME=eth0
534 | python ...
535 | ```
536 | 
537 | ## 参考
538 | 1. https://pytorch.org/docs/stable/distributed.html#choosing-the-network-interface-to-use
539 | 2. https://pytorch.org/tutorials/beginner/dist_overview.html
540 | 3. Li, S., Zhao, Y., Varma, R., Salpekar, O., Noordhuis, P., Li, T., ... & Chintala, S. (2020). Pytorch distributed: Experiences on accelerating data parallel training. arXiv preprint arXiv:2006.15704.
541 | 4. https://zhuanlan.zhihu.com/p/76638962
542 | 5. https://yangkky.github.io/2019/07/08/distributed-pytorch-tutorial.html
543 | 6. https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255


--------------------------------------------------------------------------------