├── .gitignore ├── LICENSE ├── README.md ├── apex_distributed.py ├── assets ├── fig1_experimental_result.jpg └── fig2_allreduce.jpg ├── dataparallel.py ├── distributed.py ├── distributed_slurm_main.py ├── horovod_distributed.py ├── multiprocessing_distributed.py ├── requirements.txt ├── start.sh └── statistics.sh /.gitignore: -------------------------------------------------------------------------------- 1 | # Compiled python 2 | *.pyc 3 | *.pyd 4 | 5 | # Compiled MATLAB 6 | *.mex* 7 | 8 | # IPython notebook checkpoints 9 | .ipynb_checkpoints 10 | 11 | # Editor temporaries 12 | *.swn 13 | *.swo 14 | *.swp 15 | *~ 16 | 17 | # Sublime Text settings 18 | *.sublime-workspace 19 | *.sublime-project 20 | 21 | # Eclipse Project settings 22 | *.*project 23 | .settings 24 | 25 | # QtCreator files 26 | *.user 27 | 28 | # PyCharm files 29 | .idea 30 | 31 | # Visual Studio Code files 32 | .vscode 33 | .vs 34 | 35 | # OSX dir files 36 | .DS_Store 37 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2019-present, Zhi Zhang 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in 13 | all copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 21 | THE SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Distribution is all you need 2 | 3 | ## Take-Away 4 | 5 | 笔者使用 PyTorch 编写了不同加速库在 ImageNet 上的使用示例(单机多卡),需要的同学可以当作 quickstart 将需要的部分 copy 到自己的项目中(Github 请点击下面链接): 6 | 7 | 1. **[nn.DataParallel ](https://github.com/tczhangzhi/pytorch-distributed/blob/master/dataparallel.py) 简单方便的 nn.DataParallel** 8 | 2. **[torch.distributed](https://github.com/tczhangzhi/pytorch-distributed/blob/master/distributed.py) 使用 torch.distributed 加速并行训练** 9 | 3. **[torch.multiprocessing](https://github.com/tczhangzhi/pytorch-distributed/blob/master/multiprocessing_distributed.py) 使用 torch.multiprocessing 取代启动器** 10 | 4. **[apex](https://github.com/tczhangzhi/pytorch-distributed/blob/master/apex_distributed.py) 使用 apex 再加速** 11 | 5. **[horovod](https://github.com/tczhangzhi/pytorch-distributed/blob/master/horovod_distributed.py)** **horovod 的优雅实现** 12 | 6. **[slurm](https://github.com/tczhangzhi/pytorch-distributed/blob/master/distributed_slurm_main.py) GPU 集群上的分布式** 13 | 7. **补充:分布式 [evaluation](https://github.com/tczhangzhi/pytorch-distributed/blob/master/distributed.py)** 14 | 15 | 这里,笔者记录了使用 4 块 Tesla V100-PICE 在 ImageNet 进行了运行时间的测试,测试结果发现 **Apex 的加速效果最好,但与 Horovod/Distributed 差别不大**,平时可以直接使用内置的 Distributed。**Dataparallel 较慢,不推荐使用**。(后续会补上 V100/K80 上的测试结果,穿插了一些试验所以中断了) 16 | 17 | ![experimental_results](https://github.com/tczhangzhi/pytorch-distributed/blob/master/assets/fig1_experimental_result.jpg) 18 | 19 | 简要记录一下不同库的分布式训练方式: 20 | 21 | ## 简单方便的 nn.DataParallel 22 | 23 | > DataParallel 可以帮助我们(使用单进程控)将模型和数据加载到多个 GPU 中,控制数据在 GPU 之间的流动,协同不同 GPU 上的模型进行并行训练(细粒度的方法有 scatter,gather 等等)。 24 | 25 | DataParallel 使用起来非常方便,我们只需要用 DataParallel 包装模型,再设置一些参数即可。需要定义的参数包括:参与训练的 GPU 有哪些,device_ids=gpus;用于汇总梯度的 GPU 是哪个,output_device=gpus[0] 。DataParallel 会自动帮我们将数据切分 load 到相应 GPU,将模型复制到相应 GPU,进行正向传播计算梯度并汇总: 26 | 27 | ``` 28 | model = nn.DataParallel(model.cuda(), device_ids=gpus, output_device=gpus[0]) 29 | ``` 30 | 31 | 值得注意的是,模型和数据都需要先 load 进 GPU 中,DataParallel 的 module 才能对其进行处理,否则会报错: 32 | 33 | ``` 34 | # 这里要 model.cuda() 35 | model = nn.DataParallel(model.cuda(), device_ids=gpus, output_device=gpus[0]) 36 | 37 | for epoch in range(100): 38 | for batch_idx, (data, target) in enumerate(train_loader): 39 | # 这里要 images/target.cuda() 40 | images = images.cuda(non_blocking=True) 41 | target = target.cuda(non_blocking=True) 42 | ... 43 | output = model(images) 44 | loss = criterion(output, target) 45 | ... 46 | optimizer.zero_grad() 47 | loss.backward() 48 | optimizer.step() 49 | ``` 50 | 51 | 汇总一下,DataParallel 并行训练部分主要与如下代码段有关: 52 | 53 | ``` 54 | # main.py 55 | import torch 56 | import torch.distributed as dist 57 | 58 | gpus = [0, 1, 2, 3] 59 | torch.cuda.set_device('cuda:{}'.format(gpus[0])) 60 | 61 | train_dataset = ... 62 | 63 | train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=...) 64 | 65 | model = ... 66 | model = nn.DataParallel(model.to(device), device_ids=gpus, output_device=gpus[0]) 67 | 68 | optimizer = optim.SGD(model.parameters()) 69 | 70 | for epoch in range(100): 71 | for batch_idx, (data, target) in enumerate(train_loader): 72 | images = images.cuda(non_blocking=True) 73 | target = target.cuda(non_blocking=True) 74 | ... 75 | output = model(images) 76 | loss = criterion(output, target) 77 | ... 78 | optimizer.zero_grad() 79 | loss.backward() 80 | optimizer.step() 81 | ``` 82 | 83 | 在使用时,使用 python 执行即可: 84 | 85 | ``` 86 | python main.py 87 | ``` 88 | 89 | 在 ImageNet 上的完整训练代码,请点击[Github](https://link.zhihu.com/?target=https%3A//github.com/tczhangzhi/pytorch-distributed/blob/master/dataparallel.py)。 90 | 91 | ## 使用 torch.distributed 加速并行训练 92 | 93 | > 在 pytorch 1.0 之后,官方终于对分布式的常用方法进行了封装,支持 all-reduce,broadcast,send 和 receive 等等。通过 MPI 实现 CPU 通信,通过 NCCL 实现 GPU 通信。官方也曾经提到用 DistributedDataParallel 解决 DataParallel 速度慢,GPU 负载不均衡的问题,目前已经很成熟了~ 94 | 95 | 与 DataParallel 的单进程控制多 GPU 不同,在 distributed 的帮助下,我们只需要编写一份代码,torch 就会自动将其分配给 ![[公式]](https://www.zhihu.com/equation?tex=n) 个进程,分别在 ![[公式]](https://www.zhihu.com/equation?tex=n) 个 GPU 上运行。 96 | 97 | 在 API 层面,pytorch 为我们提供了 torch.distributed.launch 启动器,用于在命令行分布式地执行 python 文件。在执行过程中,启动器会将当前进程的(其实就是 GPU的)index 通过参数传递给 python,我们可以这样获得当前进程的 index: 98 | 99 | ``` 100 | parser = argparse.ArgumentParser() 101 | parser.add_argument('--local_rank', default=-1, type=int, 102 | help='node rank for distributed training') 103 | args = parser.parse_args() 104 | print(args.local_rank) 105 | ``` 106 | 107 | 接着,使用 init_process_group 设置GPU 之间通信使用的后端和端口: 108 | 109 | ``` 110 | dist.init_process_group(backend='nccl') 111 | ``` 112 | 113 | 之后,使用 DistributedSampler 对数据集进行划分。如此前我们介绍的那样,它能帮助我们将每个 batch 划分成几个 partition,在当前进程中只需要获取和 rank 对应的那个 partition 进行训练: 114 | 115 | ``` 116 | train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset) 117 | 118 | train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=..., sampler=train_sampler) 119 | ``` 120 | 121 | 然后,使用 DistributedDataParallel 包装模型,它能帮助我们为不同 GPU 上求得的梯度进行 all reduce(即汇总不同 GPU 计算所得的梯度,并同步计算结果)。all reduce 后不同 GPU 中模型的梯度均为 all reduce 之前各 GPU 梯度的均值: 122 | 123 | ``` 124 | model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank]) 125 | ``` 126 | 127 | 最后,把数据和模型加载到当前进程使用的 GPU 中,正常进行正反向传播: 128 | 129 | ``` 130 | torch.cuda.set_device(args.local_rank) 131 | 132 | model.cuda() 133 | 134 | for epoch in range(100): 135 | for batch_idx, (data, target) in enumerate(train_loader): 136 | images = images.cuda(non_blocking=True) 137 | target = target.cuda(non_blocking=True) 138 | ... 139 | output = model(images) 140 | loss = criterion(output, target) 141 | ... 142 | optimizer.zero_grad() 143 | loss.backward() 144 | optimizer.step() 145 | ``` 146 | 147 | 汇总一下,torch.distributed 并行训练部分主要与如下代码段有关: 148 | 149 | ``` 150 | # main.py 151 | import torch 152 | import argparse 153 | import torch.distributed as dist 154 | 155 | parser = argparse.ArgumentParser() 156 | parser.add_argument('--local_rank', default=-1, type=int, 157 | help='node rank for distributed training') 158 | args = parser.parse_args() 159 | 160 | dist.init_process_group(backend='nccl') 161 | torch.cuda.set_device(args.local_rank) 162 | 163 | train_dataset = ... 164 | train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset) 165 | 166 | train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=..., sampler=train_sampler) 167 | 168 | model = ... 169 | model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank]) 170 | 171 | optimizer = optim.SGD(model.parameters()) 172 | 173 | for epoch in range(100): 174 | for batch_idx, (data, target) in enumerate(train_loader): 175 | images = images.cuda(non_blocking=True) 176 | target = target.cuda(non_blocking=True) 177 | ... 178 | output = model(images) 179 | loss = criterion(output, target) 180 | ... 181 | optimizer.zero_grad() 182 | loss.backward() 183 | optimizer.step() 184 | ``` 185 | 186 | 在使用时,调用 torch.distributed.launch 启动器启动: 187 | 188 | ``` 189 | CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 main.py 190 | ``` 191 | 192 | 在 ImageNet 上的完整训练代码,请点击[Github](https://link.zhihu.com/?target=https%3A//github.com/tczhangzhi/pytorch-distributed/blob/master/distributed.py)。 193 | 194 | ## 使用 torch.multiprocessing 取代启动器 195 | 196 | > 有的同学可能比较熟悉 torch.multiprocessing,也可以手动使用 torch.multiprocessing 进行多进程控制。绕开 torch.distributed.launch 自动控制开启和退出进程的一些小毛病~ 197 | 198 | 使用时,只需要调用 torch.multiprocessing.spawn,torch.multiprocessing 就会帮助我们自动创建进程。如下面的代码所示,spawn 开启了 nprocs=4 个进程,每个进程执行 main_worker 并向其中传入 local_rank(当前进程 index)和 args(即 4 和 myargs)作为参数: 199 | 200 | ``` 201 | import torch.multiprocessing as mp 202 | 203 | mp.spawn(main_worker, nprocs=4, args=(4, myargs)) 204 | ``` 205 | 206 | 这里,我们直接将原本需要 torch.distributed.launch 管理的执行内容,封装进 main_worker 函数中,其中 proc 对应 local_rank(当前进程 index),进程数 nproc 对应 4, args 对应 myargs: 207 | 208 | ``` 209 | def main_worker(proc, nproc, args): 210 | 211 | dist.init_process_group(backend='nccl', init_method='tcp://127.0.0.1:23456', world_size=4, rank=gpu) 212 | torch.cuda.set_device(args.local_rank) 213 | 214 | train_dataset = ... 215 | train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset) 216 | 217 | train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=..., sampler=train_sampler) 218 | 219 | model = ... 220 | model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank]) 221 | 222 | optimizer = optim.SGD(model.parameters()) 223 | 224 | for epoch in range(100): 225 | for batch_idx, (data, target) in enumerate(train_loader): 226 | images = images.cuda(non_blocking=True) 227 | target = target.cuda(non_blocking=True) 228 | ... 229 | output = model(images) 230 | loss = criterion(output, target) 231 | ... 232 | optimizer.zero_grad() 233 | loss.backward() 234 | optimizer.step() 235 | ``` 236 | 237 | 在上面的代码中值得注意的是,由于没有 torch.distributed.launch 读取的默认环境变量作为配置,我们需要手动为 init_process_group 指定参数: 238 | 239 | ``` 240 | dist.init_process_group(backend='nccl', init_method='tcp://127.0.0.1:23456', world_size=4, rank=gpu) 241 | ``` 242 | 243 | 汇总一下,添加 multiprocessing 后并行训练部分主要与如下代码段有关: 244 | 245 | ``` 246 | # main.py 247 | import torch 248 | import torch.distributed as dist 249 | import torch.multiprocessing as mp 250 | 251 | mp.spawn(main_worker, nprocs=4, args=(4, myargs)) 252 | 253 | def main_worker(proc, nprocs, args): 254 | 255 | dist.init_process_group(backend='nccl', init_method='tcp://127.0.0.1:23456', world_size=4, rank=gpu) 256 | torch.cuda.set_device(args.local_rank) 257 | 258 | train_dataset = ... 259 | train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset) 260 | 261 | train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=..., sampler=train_sampler) 262 | 263 | model = ... 264 | model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank]) 265 | 266 | optimizer = optim.SGD(model.parameters()) 267 | 268 | for epoch in range(100): 269 | for batch_idx, (data, target) in enumerate(train_loader): 270 | images = images.cuda(non_blocking=True) 271 | target = target.cuda(non_blocking=True) 272 | ... 273 | output = model(images) 274 | loss = criterion(output, target) 275 | ... 276 | optimizer.zero_grad() 277 | loss.backward() 278 | optimizer.step() 279 | ``` 280 | 281 | 在使用时,直接使用 python 运行就可以了: 282 | 283 | ``` 284 | python main.py 285 | ``` 286 | 287 | 在 ImageNet 上的完整训练代码,请点击[Github](https://link.zhihu.com/?target=https%3A//github.com/tczhangzhi/pytorch-distributed/blob/master/multiprocessing_distributed.py)。 288 | 289 | ## 使用 Apex 再加速 290 | 291 | > Apex 是 NVIDIA 开源的用于混合精度训练和分布式训练库。Apex 对混合精度训练的过程进行了封装,改两三行配置就可以进行混合精度的训练,从而大幅度降低显存占用,节约运算时间。此外,Apex 也提供了对分布式训练的封装,针对 NVIDIA 的 NCCL 通信库进行了优化。 292 | 293 | 在混合精度训练上,Apex 的封装十分优雅。直接使用 amp.initialize 包装模型和优化器,apex 就会自动帮助我们管理模型参数和优化器的精度了,根据精度需求不同可以传入其他配置参数。 294 | 295 | ``` 296 | from apex import amp 297 | 298 | model, optimizer = amp.initialize(model, optimizer) 299 | ``` 300 | 301 | 在分布式训练的封装上,Apex 在胶水层的改动并不大,主要是优化了 NCCL 的通信。因此,大部分代码仍与 torch.distributed 保持一致。使用的时候只需要将 torch.nn.parallel.DistributedDataParallel 替换为 apex.parallel.DistributedDataParallel 用于包装模型。在 API 层面,相对于 torch.distributed ,它可以自动管理一些参数(可以少传一点): 302 | 303 | ``` 304 | from apex.parallel import DistributedDataParallel 305 | 306 | model = DistributedDataParallel(model) 307 | # # torch.distributed 308 | # model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank]) 309 | # model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank], output_device=args.local_rank) 310 | ``` 311 | 312 | 在正向传播计算 loss 时,Apex 需要使用 amp.scale_loss 包装,用于根据 loss 值自动对精度进行缩放: 313 | 314 | ``` 315 | with amp.scale_loss(loss, optimizer) as scaled_loss: 316 | scaled_loss.backward() 317 | ``` 318 | 319 | 汇总一下,Apex 的并行训练部分主要与如下代码段有关: 320 | 321 | ``` 322 | # main.py 323 | import torch 324 | import argparse 325 | import torch.distributed as dist 326 | 327 | from apex.parallel import DistributedDataParallel 328 | 329 | parser = argparse.ArgumentParser() 330 | parser.add_argument('--local_rank', default=-1, type=int, 331 | help='node rank for distributed training') 332 | args = parser.parse_args() 333 | 334 | dist.init_process_group(backend='nccl') 335 | torch.cuda.set_device(args.local_rank) 336 | 337 | train_dataset = ... 338 | train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset) 339 | 340 | train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=..., sampler=train_sampler) 341 | 342 | model = ... 343 | model, optimizer = amp.initialize(model, optimizer) 344 | model = DistributedDataParallel(model, device_ids=[args.local_rank]) 345 | 346 | optimizer = optim.SGD(model.parameters()) 347 | 348 | for epoch in range(100): 349 | for batch_idx, (data, target) in enumerate(train_loader): 350 | images = images.cuda(non_blocking=True) 351 | target = target.cuda(non_blocking=True) 352 | ... 353 | output = model(images) 354 | loss = criterion(output, target) 355 | optimizer.zero_grad() 356 | with amp.scale_loss(loss, optimizer) as scaled_loss: 357 | scaled_loss.backward() 358 | optimizer.step() 359 | ``` 360 | 361 | 在使用时,调用 torch.distributed.launch 启动器启动: 362 | 363 | ``` 364 | UDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 main.py 365 | ``` 366 | 367 | 在 ImageNet 上的完整训练代码,请点击[Github](https://link.zhihu.com/?target=https%3A//github.com/tczhangzhi/pytorch-distributed/blob/master/apex_distributed.py)。 368 | 369 | ## Horovod 的优雅实现 370 | 371 | > Horovod 是 Uber 开源的深度学习工具,它的发展吸取了 Facebook "Training ImageNet In 1 Hour" 与百度 "Ring Allreduce" 的优点,可以无痛与 PyTorch/Tensorflow 等深度学习框架结合,实现并行训练。 372 | 373 | 在 API 层面,Horovod 和 torch.distributed 十分相似。在 mpirun 的基础上,Horovod 提供了自己封装的 horovodrun 作为启动器。 374 | 375 | 与 torch.distributed.launch 相似,我们只需要编写一份代码,horovodrun 启动器就会自动将其分配给 ![[公式]](https://www.zhihu.com/equation?tex=n) 个进程,分别在 ![[公式]](https://www.zhihu.com/equation?tex=n) 个 GPU 上运行。在执行过程中,启动器会将当前进程的(其实就是 GPU的)index 注入 hvd,我们可以这样获得当前进程的 index: 376 | 377 | ``` 378 | import horovod.torch as hvd 379 | 380 | hvd.local_rank() 381 | ``` 382 | 383 | 与 init_process_group 相似,Horovod 使用 init 设置GPU 之间通信使用的后端和端口: 384 | 385 | ``` 386 | hvd.init() 387 | ``` 388 | 389 | 接着,使用 DistributedSampler 对数据集进行划分。如此前我们介绍的那样,它能帮助我们将每个 batch 划分成几个 partition,在当前进程中只需要获取和 rank 对应的那个 partition 进行训练: 390 | 391 | ``` 392 | train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset) 393 | 394 | train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=..., sampler=train_sampler) 395 | ``` 396 | 397 | 之后,使用 broadcast_parameters 包装模型参数,将模型参数从编号为 root_rank 的 GPU 复制到所有其他 GPU 中: 398 | 399 | ``` 400 | hvd.broadcast_parameters(model.state_dict(), root_rank=0) 401 | ``` 402 | 403 | 然后,使用 DistributedOptimizer 包装优化器。它能帮助我们为不同 GPU 上求得的梯度进行 all reduce(即汇总不同 GPU 计算所得的梯度,并同步计算结果)。all reduce 后不同 GPU 中模型的梯度均为 all reduce 之前各 GPU 梯度的均值: 404 | 405 | ``` 406 | hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters(), compression=hvd.Compression.fp16) 407 | ``` 408 | 409 | 最后,把数据加载到当前 GPU 中。在编写代码时,我们只需要关注正常进行正向传播和反向传播: 410 | 411 | ``` 412 | torch.cuda.set_device(args.local_rank) 413 | 414 | for epoch in range(100): 415 | for batch_idx, (data, target) in enumerate(train_loader): 416 | images = images.cuda(non_blocking=True) 417 | target = target.cuda(non_blocking=True) 418 | ... 419 | output = model(images) 420 | loss = criterion(output, target) 421 | ... 422 | optimizer.zero_grad() 423 | loss.backward() 424 | optimizer.step() 425 | ``` 426 | 427 | 汇总一下,Horovod 的并行训练部分主要与如下代码段有关: 428 | 429 | ``` 430 | # main.py 431 | import torch 432 | import horovod.torch as hvd 433 | 434 | hvd.init() 435 | torch.cuda.set_device(hvd.local_rank()) 436 | 437 | train_dataset = ... 438 | train_sampler = torch.utils.data.distributed.DistributedSampler( 439 | train_dataset, num_replicas=hvd.size(), rank=hvd.rank()) 440 | 441 | train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=..., sampler=train_sampler) 442 | 443 | model = ... 444 | model.cuda() 445 | 446 | optimizer = optim.SGD(model.parameters()) 447 | 448 | optimizer = hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters()) 449 | hvd.broadcast_parameters(model.state_dict(), root_rank=0) 450 | 451 | for epoch in range(100): 452 | for batch_idx, (data, target) in enumerate(train_loader): 453 | images = images.cuda(non_blocking=True) 454 | target = target.cuda(non_blocking=True) 455 | ... 456 | output = model(images) 457 | loss = criterion(output, target) 458 | ... 459 | optimizer.zero_grad() 460 | loss.backward() 461 | optimizer.step() 462 | ``` 463 | 464 | 在使用时,调用 horovodrun 启动器启动: 465 | 466 | ``` 467 | CUDA_VISIBLE_DEVICES=0,1,2,3 horovodrun -np 4 -H localhost:4 --verbose python main.py 468 | ``` 469 | 470 | 在 ImageNet 上的完整训练代码,请点击[Github](https://link.zhihu.com/?target=https%3A//github.com/tczhangzhi/pytorch-distributed/blob/master/horovod_distributed.py)。 471 | 472 | ## GPU 集群上的分布式 473 | 474 | > Slurm,是一个用于 Linux 系统的免费、开源的任务调度工具。它提供了三个关键功能。第一,为用户分配资源(计算机节点),以供用户执行工作。第二,它提供了一个框架,用于执行在节点上运行着的任务(通常是并行的任务),第三,为任务队列合理地分配资源。如果你还没有部署 Slurm 可以按照笔者总结的[部署教程](https://zhuanlan.zhihu.com/p/149771261)进行部署。 475 | 476 | 通过运行 slurm 的控制命令,slurm 会将写好的 python 程序在每个节点上分别执行,调用节点上定义的 GPU 资源进行运算。要编写能被 Slurm 在 GPU 集群上执行的 python 分布式训练程序,我们只需要对上文中多进程的 DistributedDataParallel 代码进行修改,告诉每一个执行的任务(每个节点上的 python 程序),要用哪些训练哪一部分数据,反向传播的结果如何合并就可以了。 477 | 478 | 我们首先需要获得每个任务(对应每个节点)的基本信息,以便针对任务的基本信息处理其应当负责的数据。在使用 slurm 执行 srun python 代码时,python 可以从环境变量 os.environ 中获取当前 python 进程的基本信息: 479 | 480 | ``` 481 | import os 482 | local_rank = os.environ['SLURM_PROCID'] # 当前任务的编号(比如节点 1 执行 1 号任务,节点 2 执行 2 号任务) 483 | world_size = os.environ['SLURM_NPROCS'] # 共开启的任务的总数(共有 2 个节点执行了 2 个任务) 484 | job_id = os.environ['SLURM_JOBID'] # 当前作业的编号(这是第 1 次执行 srun,编号为 1) 485 | ``` 486 | 487 | 在每个任务(节点)中,我们需要为节点中的每个 GPU 资源分配一个进程,管理该 GPU 应当处理的数据。 488 | 489 | 当前节点的 GPU 的数量可以由 torch.cuda 查询得到: 490 | 491 | ``` 492 | ngpus_per_node = torch.cuda.device_count() 493 | ``` 494 | 495 | 接着,与上文相似,我们使用 torch.multiprocessing 创建 ngpus_per_node 个进程,其中,每个进程执行的函数为 main_worker ,该函数调用所需要的由 args 传入: 496 | 497 | ``` 498 | mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args)) 499 | ``` 500 | 501 | 在编写 main_worker 时,我们首先需要解决的问题是:不同节点、或者同一节点间的不同进程之间需要通信来实现数据的分割、参数的合并。我们可以使用 pytorch 的 dist 库在共享文件系统上创建一个文件进行通信: 502 | 503 | ``` 504 | import torch.distributed as dist 505 | 506 | def main_worker(gpu, ngpus_per_node, args): 507 | dist_url = "file://dist_file.{}".format(job_id) 508 | rank = local_rank * ngpus_per_node + gpu 509 | dist.init_process_group(backend='nccl', init_method=dist_url, world_size=world_size, rank=rank) 510 | ... 511 | ``` 512 | 513 | 完成进程创建和通信后,下一步就是实现我们常用的 pipline 了,即加载模型、加载数据、正向传播、反向传播。与上文相似,这里,我们把模型加载进当前进程所对应的 GPU 中: 514 | 515 | ``` 516 | def main_worker(gpu, ngpus_per_node, args): 517 | dist_url = "file://dist_file.{}".format(job_id) 518 | rank = local_rank * ngpus_per_node + gpu 519 | dist.init_process_group(backend='nccl', init_method=dist_url, world_size=world_size, rank=rank) 520 | ... 521 | torch.cuda.set_device(gpu) 522 | model.cuda(gpu) 523 | model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[gpu]) 524 | ``` 525 | 526 | 接着,把当前进程对应的数据段采样出来,也加载到对应的 GPU 中。同样可以使用 pytorch 的 dist 库实现这个采样过程: 527 | 528 | ``` 529 | def main_worker(gpu, ngpus_per_node, args): 530 | dist_url = "file://dist_file.{}".format(job_id) 531 | rank = local_rank * ngpus_per_node + gpu 532 | dist.init_process_group(backend='nccl', init_method=dist_url, world_size=world_size, rank=rank) 533 | ... 534 | torch.cuda.set_device(gpu) 535 | model.cuda(gpu) 536 | model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[gpu]) 537 | ... 538 | train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset) 539 | train_loader = torch.utils.data.DataLoader(train_dataset, 540 | batch_size=args.batch_size, 541 | num_workers=2, 542 | pin_memory=True, 543 | sampler=train_sampler) 544 | for i, (images, target) in enumerate(train_loader): 545 | images = images.cuda(gpu, non_blocking=True) 546 | target = target.cuda(gpu, non_blocking=True) 547 | ``` 548 | 549 | 最后,进行正常的正向和反向传播: 550 | 551 | ``` 552 | def main_worker(gpu, ngpus_per_node, args): 553 | dist_url = "file://dist_file.{}".format(job_id) 554 | rank = local_rank * ngpus_per_node + gpu 555 | dist.init_process_group(backend='nccl', init_method=dist_url, world_size=world_size, rank=rank) 556 | ... 557 | torch.cuda.set_device(gpu) 558 | model.cuda(gpu) 559 | model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[gpu]) 560 | ... 561 | train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset) 562 | train_loader = torch.utils.data.DataLoader(train_dataset, 563 | batch_size=args.batch_size, 564 | num_workers=2, 565 | pin_memory=True, 566 | sampler=train_sampler) 567 | for i, (images, target) in enumerate(train_loader): 568 | images = images.cuda(gpu, non_blocking=True) 569 | target = target.cuda(gpu, non_blocking=True) 570 | ... 571 | output = model(images) 572 | loss = criterion(output, target) 573 | optimizer.zero_grad() 574 | loss.backward() 575 | optimizer.step() 576 | ``` 577 | 578 | 在使用时,调用 srun 启动任务: 579 | 580 | ``` 581 | srun -N2 --gres gpu:1 python distributed_slurm_main.py --dist-file dist_file 582 | ``` 583 | 584 | 在 ImageNet 上的完整训练代码,请点击[Github](https://github.com/tczhangzhi/pytorch-distributed/blob/master/distributed_slurm_main.py)。 585 | 586 | ## 分布式 evaluation 587 | 588 | > all_reduce, barrier 等 API 是 distributed 中更为基础和底层的 API。这些 API 可以帮助我们控制进程之间的交互,控制 GPU 数据的传输。在自定义 GPU 协作逻辑,汇总 GPU 间少量的统计信息时,大有用处。熟练掌握这些 API 也可以帮助我们自己设计、优化分布式训练、测试流程。 589 | 590 | 到目前为止,Distributed Sampler 能够帮助我们分发数据,DistributedDataParallel、hvd.broadcast_parameters 能够帮助我们分发模型,并在框架的支持下解决梯度汇总和参数更新的问题。然而,还有一些同学还有这样的疑惑, 591 | 592 | 1. 训练样本被切分成了若干个部分,被若干个进程分别控制运行在若干个 GPU 上,如何在进程间进行通信汇总这些(GPU 上的)信息? 593 | 2. 使用一张卡进行推理、测试太慢了,如何使用 Distributed 进行分布式地推理和测试,并将结果汇总在一起? 594 | 3. ...... 595 | 596 | 要解决这些问题,我们缺少一个更为基础的 API,**汇总记录不同 GPU 上生成的准确率、损失函数等指标信息**。这个 API 就是 `torch.distributed.all_reduce`。示意图如下: 597 | 598 | ![all_reduce](https://github.com/tczhangzhi/pytorch-distributed/blob/master/assets/fig2_allreduce.jpg) 599 | 600 | 具体来说,它的工作过程包含以下三步: 601 | 602 | 1. 通过调用 `all_reduce(tensor, op=...)`,当前进程会向其他进程发送 `tensor`(例如 rank 0 会发送 rank 0 的 tensor 到 rank 1、2、3) 603 | 2. 接受其他进程发来的 `tensor`(例如 rank 0 会接收 rank 1 的 tensor、rank 2 的 tensor、rank 3 的 tensor)。 604 | 3. 在全部接收完成后,当前进程(例如rank 0)会对当前进程的和接收到的 `tensor` (例如 rank 0 的 tensor、rank 1 的 tensor、rank 2 的 tensor、rank 3 的 tensor)进行 `op` (例如求和)操作。 605 | 606 | 使用 `torch.distributed.all_reduce(loss, op=torch.distributed.reduce_op.SUM)`,我们就能够对不数据切片(不同 GPU 上的训练数据)的损失函数进行求和了。接着,我们只要再将其除以进程(GPU)数量 `world_size`就可以得到损失函数的平均值。 607 | 608 | 正确率也能够通过同样方法进行计算: 609 | 610 | ``` 611 | # 原始代码 612 | output = model(images) 613 | loss = criterion(output, target) 614 | 615 | acc1, acc5 = accuracy(output, target, topk=(1, 5)) 616 | losses.update(loss.item(), images.size(0)) 617 | top1.update(acc1.item(), images.size(0)) 618 | top5.update(acc5.item(), images.size(0)) 619 | ​ 620 | # 修改后,同步各 GPU 中数据切片的统计信息,用于分布式的 evaluation 621 | def reduce_tensor(tensor): 622 | rt = tensor.clone() 623 | dist.all_reduce(rt, op=dist.reduce_op.SUM) 624 | rt /= args.world_size 625 | return rt 626 | ​ 627 | output = model(images) 628 | loss = criterion(output, target) 629 | acc1, acc5 = accuracy(output, target, topk=(1, 5)) 630 | ​ 631 | torch.distributed.barrier() 632 | ​ 633 | reduced_loss = reduce_tensor(loss.data) 634 | reduced_acc1 = reduce_tensor(acc1) 635 | reduced_acc5 = reduce_tensor(acc5) 636 | ​ 637 | losses.update(loss.item(), images.size(0)) 638 | top1.update(acc1.item(), images.size(0)) 639 | top5.update(acc5.item(), images.size(0)) 640 | ``` 641 | 642 | 值得注意的是,为了同步各进程的计算进度,我们在 reduce 之前插入了一个同步 API `torch.distributed.barrier()`。在所有进程运行到这一步之前,先完成此前代码的进程会等待其他进程。这使得我们能够得到准确、有序的输出。在 Horovod 中,我们无法使用 `torch.distributed.barrier()`,取而代之的是,我们可以在 allreduce 过程中指明: 643 | 644 | ``` 645 | def reduce_mean(tensor, world_size): 646 | rt = tensor.clone() 647 | hvd.allreduce(rt, name='barrier') 648 | rt /= world_size 649 | return rt 650 | 651 | output = model(images) 652 | loss = criterion(output, target) 653 | acc1, acc5 = accuracy(output, target, topk=(1, 5)) 654 | 655 | reduced_loss = reduce_tensor(loss.data) 656 | reduced_acc1 = reduce_tensor(acc1) 657 | reduced_acc5 = reduce_tensor(acc5) 658 | 659 | losses.update(loss.item(), images.size(0)) 660 | top1.update(acc1.item(), images.size(0)) 661 | top5.update(acc5.item(), images.size(0)) 662 | ``` -------------------------------------------------------------------------------- /apex_distributed.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import os 3 | import random 4 | import shutil 5 | import time 6 | import warnings 7 | 8 | import torch 9 | import torch.nn as nn 10 | import torch.nn.parallel 11 | import torch.backends.cudnn as cudnn 12 | import torch.distributed as dist 13 | import torch.optim 14 | import torch.multiprocessing as mp 15 | import torch.utils.data 16 | import torch.utils.data.distributed 17 | import torchvision.transforms as transforms 18 | import torchvision.datasets as datasets 19 | import torchvision.models as models 20 | 21 | from apex import amp 22 | from apex.parallel import DistributedDataParallel 23 | 24 | model_names = sorted(name for name in models.__dict__ 25 | if name.islower() and not name.startswith("__") 26 | and callable(models.__dict__[name])) 27 | 28 | parser = argparse.ArgumentParser(description='PyTorch ImageNet Training') 29 | parser.add_argument('--data', 30 | metavar='DIR', 31 | default='/home/zhangzhi/Data/exports/ImageNet2012', 32 | help='path to dataset') 33 | parser.add_argument('-a', 34 | '--arch', 35 | metavar='ARCH', 36 | default='resnet18', 37 | choices=model_names, 38 | help='model architecture: ' + ' | '.join(model_names) + 39 | ' (default: resnet18)') 40 | parser.add_argument('-j', 41 | '--workers', 42 | default=4, 43 | type=int, 44 | metavar='N', 45 | help='number of data loading workers (default: 4)') 46 | parser.add_argument('--epochs', 47 | default=90, 48 | type=int, 49 | metavar='N', 50 | help='number of total epochs to run') 51 | parser.add_argument('--start-epoch', 52 | default=0, 53 | type=int, 54 | metavar='N', 55 | help='manual epoch number (useful on restarts)') 56 | parser.add_argument('-b', 57 | '--batch-size', 58 | default=3200, 59 | type=int, 60 | metavar='N', 61 | help='mini-batch size (default: 6400), this is the total ' 62 | 'batch size of all GPUs on the current node when ' 63 | 'using Data Parallel or Distributed Data Parallel') 64 | parser.add_argument('--lr', 65 | '--learning-rate', 66 | default=0.1, 67 | type=float, 68 | metavar='LR', 69 | help='initial learning rate', 70 | dest='lr') 71 | parser.add_argument('--momentum', 72 | default=0.9, 73 | type=float, 74 | metavar='M', 75 | help='momentum') 76 | parser.add_argument('--local_rank', 77 | default=-1, 78 | type=int, 79 | help='node rank for distributed training') 80 | parser.add_argument('--wd', 81 | '--weight-decay', 82 | default=1e-4, 83 | type=float, 84 | metavar='W', 85 | help='weight decay (default: 1e-4)', 86 | dest='weight_decay') 87 | parser.add_argument('-p', 88 | '--print-freq', 89 | default=10, 90 | type=int, 91 | metavar='N', 92 | help='print frequency (default: 10)') 93 | parser.add_argument('-e', 94 | '--evaluate', 95 | dest='evaluate', 96 | action='store_true', 97 | help='evaluate model on validation set') 98 | parser.add_argument('--pretrained', 99 | dest='pretrained', 100 | action='store_true', 101 | help='use pre-trained model') 102 | parser.add_argument('--seed', 103 | default=None, 104 | type=int, 105 | help='seed for initializing training. ') 106 | 107 | 108 | def reduce_mean(tensor, nprocs): 109 | rt = tensor.clone() 110 | dist.all_reduce(rt, op=dist.ReduceOp.SUM) 111 | rt /= nprocs 112 | return rt 113 | 114 | 115 | class data_prefetcher(): 116 | def __init__(self, loader): 117 | self.loader = iter(loader) 118 | self.stream = torch.cuda.Stream() 119 | self.mean = torch.tensor([0.485 * 255, 0.456 * 255, 120 | 0.406 * 255]).cuda().view(1, 3, 1, 1) 121 | self.std = torch.tensor([0.229 * 255, 0.224 * 255, 122 | 0.225 * 255]).cuda().view(1, 3, 1, 1) 123 | # With Amp, it isn't necessary to manually convert data to half. 124 | # if args.fp16: 125 | # self.mean = self.mean.half() 126 | # self.std = self.std.half() 127 | self.preload() 128 | 129 | def preload(self): 130 | try: 131 | self.next_input, self.next_target = next(self.loader) 132 | except StopIteration: 133 | self.next_input = None 134 | self.next_target = None 135 | return 136 | # if record_stream() doesn't work, another option is to make sure device inputs are created 137 | # on the main stream. 138 | # self.next_input_gpu = torch.empty_like(self.next_input, device='cuda') 139 | # self.next_target_gpu = torch.empty_like(self.next_target, device='cuda') 140 | # Need to make sure the memory allocated for next_* is not still in use by the main stream 141 | # at the time we start copying to next_*: 142 | # self.stream.wait_stream(torch.cuda.current_stream()) 143 | with torch.cuda.stream(self.stream): 144 | self.next_input = self.next_input.cuda(non_blocking=True) 145 | self.next_target = self.next_target.cuda(non_blocking=True) 146 | # more code for the alternative if record_stream() doesn't work: 147 | # copy_ will record the use of the pinned source tensor in this side stream. 148 | # self.next_input_gpu.copy_(self.next_input, non_blocking=True) 149 | # self.next_target_gpu.copy_(self.next_target, non_blocking=True) 150 | # self.next_input = self.next_input_gpu 151 | # self.next_target = self.next_target_gpu 152 | 153 | # With Amp, it isn't necessary to manually convert data to half. 154 | # if args.fp16: 155 | # self.next_input = self.next_input.half() 156 | # else: 157 | self.next_input = self.next_input.float() 158 | self.next_input = self.next_input.sub_(self.mean).div_(self.std) 159 | 160 | def next(self): 161 | torch.cuda.current_stream().wait_stream(self.stream) 162 | input = self.next_input 163 | target = self.next_target 164 | if input is not None: 165 | input.record_stream(torch.cuda.current_stream()) 166 | if target is not None: 167 | target.record_stream(torch.cuda.current_stream()) 168 | self.preload() 169 | return input, target 170 | 171 | 172 | def main(): 173 | args = parser.parse_args() 174 | args.nprocs = torch.cuda.device_count() 175 | 176 | if args.seed is not None: 177 | random.seed(args.seed) 178 | torch.manual_seed(args.seed) 179 | cudnn.deterministic = True 180 | warnings.warn('You have chosen to seed training. ' 181 | 'This will turn on the CUDNN deterministic setting, ' 182 | 'which can slow down your training considerably! ' 183 | 'You may see unexpected behavior when restarting ' 184 | 'from checkpoints.') 185 | 186 | main_worker(args.local_rank, args.nprocs, args) 187 | 188 | 189 | def main_worker(local_rank, nprocs, args): 190 | best_acc1 = .0 191 | 192 | dist.init_process_group(backend='nccl') 193 | # create model 194 | if args.pretrained: 195 | print("=> using pre-trained model '{}'".format(args.arch)) 196 | model = models.__dict__[args.arch](pretrained=True) 197 | else: 198 | print("=> creating model '{}'".format(args.arch)) 199 | model = models.__dict__[args.arch]() 200 | 201 | torch.cuda.set_device(local_rank) 202 | model.cuda() 203 | # When using a single GPU per process and per 204 | # DistributedDataParallel, we need to divide the batch size 205 | # ourselves based on the total number of GPUs we have 206 | args.batch_size = int(args.batch_size / nprocs) 207 | 208 | # define loss function (criterion) and optimizer 209 | criterion = nn.CrossEntropyLoss().cuda() 210 | 211 | optimizer = torch.optim.SGD(model.parameters(), 212 | args.lr, 213 | momentum=args.momentum, 214 | weight_decay=args.weight_decay) 215 | 216 | model, optimizer = amp.initialize(model, optimizer) 217 | model = DistributedDataParallel(model) 218 | 219 | cudnn.benchmark = True 220 | 221 | # Data loading code 222 | traindir = os.path.join(args.data, 'train') 223 | valdir = os.path.join(args.data, 'val') 224 | normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], 225 | std=[0.229, 0.224, 0.225]) 226 | 227 | train_dataset = datasets.ImageFolder( 228 | traindir, 229 | transforms.Compose([ 230 | transforms.RandomResizedCrop(224), 231 | transforms.RandomHorizontalFlip(), 232 | transforms.ToTensor(), 233 | normalize, 234 | ])) 235 | 236 | train_sampler = torch.utils.data.distributed.DistributedSampler( 237 | train_dataset) 238 | 239 | train_loader = torch.utils.data.DataLoader(train_dataset, 240 | batch_size=args.batch_size, 241 | shuffle=(train_sampler is None), 242 | num_workers=2, 243 | pin_memory=True, 244 | sampler=train_sampler) 245 | 246 | val_loader = torch.utils.data.DataLoader(datasets.ImageFolder( 247 | valdir, 248 | transforms.Compose([ 249 | transforms.Resize(256), 250 | transforms.CenterCrop(224), 251 | transforms.ToTensor(), 252 | normalize, 253 | ])), 254 | batch_size=args.batch_size, 255 | shuffle=False, 256 | num_workers=2, 257 | pin_memory=True) 258 | 259 | if args.evaluate: 260 | validate(val_loader, model, criterion, local_rank, args) 261 | return 262 | 263 | for epoch in range(args.start_epoch, args.epochs): 264 | train_sampler.set_epoch(epoch) 265 | adjust_learning_rate(optimizer, epoch, args) 266 | 267 | # train for one epoch 268 | train(train_loader, model, criterion, optimizer, epoch, local_rank, 269 | args) 270 | 271 | # evaluate on validation set 272 | acc1 = validate(val_loader, model, criterion, local_rank, args) 273 | 274 | # remember best acc@1 and save checkpoint 275 | is_best = acc1 > best_acc1 276 | best_acc1 = max(acc1, best_acc1) 277 | 278 | if args.local_rank == 0: 279 | save_checkpoint( 280 | { 281 | 'epoch': epoch + 1, 282 | 'arch': args.arch, 283 | 'state_dict': model.module.state_dict(), 284 | 'best_acc1': best_acc1, 285 | }, is_best) 286 | 287 | 288 | def train(train_loader, model, criterion, optimizer, epoch, local_rank, args): 289 | batch_time = AverageMeter('Time', ':6.3f') 290 | data_time = AverageMeter('Data', ':6.3f') 291 | losses = AverageMeter('Loss', ':.4e') 292 | top1 = AverageMeter('Acc@1', ':6.2f') 293 | top5 = AverageMeter('Acc@5', ':6.2f') 294 | progress = ProgressMeter(len(train_loader), 295 | [batch_time, data_time, losses, top1, top5], 296 | prefix="Epoch: [{}]".format(epoch)) 297 | 298 | # switch to train mode 299 | model.train() 300 | 301 | end = time.time() 302 | prefetcher = data_prefetcher(train_loader) 303 | images, target = prefetcher.next() 304 | i = 0 305 | while images is not None: 306 | # measure data loading time 307 | data_time.update(time.time() - end) 308 | 309 | # compute output 310 | output = model(images) 311 | loss = criterion(output, target) 312 | 313 | # measure accuracy and record loss 314 | acc1, acc5 = accuracy(output, target, topk=(1, 5)) 315 | 316 | torch.distributed.barrier() 317 | 318 | reduced_loss = reduce_mean(loss, args.nprocs) 319 | reduced_acc1 = reduce_mean(acc1, args.nprocs) 320 | reduced_acc5 = reduce_mean(acc5, args.nprocs) 321 | 322 | losses.update(reduced_loss.item(), images.size(0)) 323 | top1.update(reduced_acc1.item(), images.size(0)) 324 | top5.update(reduced_acc5.item(), images.size(0)) 325 | 326 | # compute gradient and do SGD step 327 | optimizer.zero_grad() 328 | with amp.scale_loss(loss, optimizer) as scaled_loss: 329 | scaled_loss.backward() 330 | optimizer.step() 331 | 332 | # measure elapsed time 333 | batch_time.update(time.time() - end) 334 | end = time.time() 335 | 336 | if i % args.print_freq == 0: 337 | progress.display(i) 338 | 339 | i += 1 340 | 341 | images, target = prefetcher.next() 342 | 343 | 344 | def validate(val_loader, model, criterion, local_rank, args): 345 | batch_time = AverageMeter('Time', ':6.3f') 346 | losses = AverageMeter('Loss', ':.4e') 347 | top1 = AverageMeter('Acc@1', ':6.2f') 348 | top5 = AverageMeter('Acc@5', ':6.2f') 349 | progress = ProgressMeter(len(val_loader), [batch_time, losses, top1, top5], 350 | prefix='Test: ') 351 | 352 | # switch to evaluate mode 353 | model.eval() 354 | 355 | with torch.no_grad(): 356 | end = time.time() 357 | prefetcher = data_prefetcher(val_loader) 358 | images, target = prefetcher.next() 359 | i = 0 360 | while images is not None: 361 | 362 | # compute output 363 | output = model(images) 364 | loss = criterion(output, target) 365 | 366 | # measure accuracy and record loss 367 | acc1, acc5 = accuracy(output, target, topk=(1, 5)) 368 | 369 | torch.distributed.barrier() 370 | 371 | reduced_loss = reduce_mean(loss, args.nprocs) 372 | reduced_acc1 = reduce_mean(acc1, args.nprocs) 373 | reduced_acc5 = reduce_mean(acc5, args.nprocs) 374 | 375 | losses.update(reduced_loss.item(), images.size(0)) 376 | top1.update(reduced_acc1.item(), images.size(0)) 377 | top5.update(reduced_acc5.item(), images.size(0)) 378 | 379 | # measure elapsed time 380 | batch_time.update(time.time() - end) 381 | end = time.time() 382 | 383 | if i % args.print_freq == 0: 384 | progress.display(i) 385 | 386 | i += 1 387 | 388 | images, target = prefetcher.next() 389 | 390 | # TODO: this should also be done with the ProgressMeter 391 | print(' * Acc@1 {top1.avg:.3f} Acc@5 {top5.avg:.3f}'.format(top1=top1, 392 | top5=top5)) 393 | 394 | return top1.avg 395 | 396 | 397 | def save_checkpoint(state, is_best, filename='checkpoint.pth.tar'): 398 | torch.save(state, filename) 399 | if is_best: 400 | shutil.copyfile(filename, 'model_best.pth.tar') 401 | 402 | 403 | class AverageMeter(object): 404 | """Computes and stores the average and current value""" 405 | def __init__(self, name, fmt=':f'): 406 | self.name = name 407 | self.fmt = fmt 408 | self.reset() 409 | 410 | def reset(self): 411 | self.val = 0 412 | self.avg = 0 413 | self.sum = 0 414 | self.count = 0 415 | 416 | def update(self, val, n=1): 417 | self.val = val 418 | self.sum += val * n 419 | self.count += n 420 | self.avg = self.sum / self.count 421 | 422 | def __str__(self): 423 | fmtstr = '{name} {val' + self.fmt + '} ({avg' + self.fmt + '})' 424 | return fmtstr.format(**self.__dict__) 425 | 426 | 427 | class ProgressMeter(object): 428 | def __init__(self, num_batches, meters, prefix=""): 429 | self.batch_fmtstr = self._get_batch_fmtstr(num_batches) 430 | self.meters = meters 431 | self.prefix = prefix 432 | 433 | def display(self, batch): 434 | entries = [self.prefix + self.batch_fmtstr.format(batch)] 435 | entries += [str(meter) for meter in self.meters] 436 | print('\t'.join(entries)) 437 | 438 | def _get_batch_fmtstr(self, num_batches): 439 | num_digits = len(str(num_batches // 1)) 440 | fmt = '{:' + str(num_digits) + 'd}' 441 | return '[' + fmt + '/' + fmt.format(num_batches) + ']' 442 | 443 | 444 | def adjust_learning_rate(optimizer, epoch, args): 445 | """Sets the learning rate to the initial LR decayed by 10 every 30 epochs""" 446 | lr = args.lr * (0.1**(epoch // 30)) 447 | for param_group in optimizer.param_groups: 448 | param_group['lr'] = lr 449 | 450 | 451 | def accuracy(output, target, topk=(1, )): 452 | """Computes the accuracy over the k top predictions for the specified values of k""" 453 | with torch.no_grad(): 454 | maxk = max(topk) 455 | batch_size = target.size(0) 456 | 457 | _, pred = output.topk(maxk, 1, True, True) 458 | pred = pred.t() 459 | correct = pred.eq(target.view(1, -1).expand_as(pred)) 460 | 461 | res = [] 462 | for k in topk: 463 | correct_k = correct[:k].view(-1).float().sum(0, keepdim=True) 464 | res.append(correct_k.mul_(100.0 / batch_size)) 465 | return res 466 | 467 | 468 | if __name__ == '__main__': 469 | main() -------------------------------------------------------------------------------- /assets/fig1_experimental_result.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tczhangzhi/pytorch-distributed/cd12856420858b14e02873e7d5c8cc7bb5aab5b0/assets/fig1_experimental_result.jpg -------------------------------------------------------------------------------- /assets/fig2_allreduce.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tczhangzhi/pytorch-distributed/cd12856420858b14e02873e7d5c8cc7bb5aab5b0/assets/fig2_allreduce.jpg -------------------------------------------------------------------------------- /dataparallel.py: -------------------------------------------------------------------------------- 1 | import csv 2 | 3 | import argparse 4 | import os 5 | import random 6 | import shutil 7 | import time 8 | import warnings 9 | 10 | import torch 11 | import torch.nn as nn 12 | import torch.nn.parallel 13 | import torch.backends.cudnn as cudnn 14 | import torch.distributed as dist 15 | import torch.optim 16 | import torch.multiprocessing as mp 17 | import torch.utils.data 18 | import torch.utils.data.distributed 19 | import torchvision.transforms as transforms 20 | import torchvision.datasets as datasets 21 | import torchvision.models as models 22 | 23 | model_names = sorted(name for name in models.__dict__ 24 | if name.islower() and not name.startswith("__") 25 | and callable(models.__dict__[name])) 26 | 27 | parser = argparse.ArgumentParser(description='PyTorch ImageNet Training') 28 | parser.add_argument('--data', 29 | metavar='DIR', 30 | default='/home/zhangzhi/Data/ImageNet2012', 31 | help='path to dataset') 32 | parser.add_argument('-a', 33 | '--arch', 34 | metavar='ARCH', 35 | default='resnet18', 36 | choices=model_names, 37 | help='model architecture: ' + ' | '.join(model_names) + 38 | ' (default: resnet18)') 39 | parser.add_argument('-j', 40 | '--workers', 41 | default=4, 42 | type=int, 43 | metavar='N', 44 | help='number of data loading workers (default: 4)') 45 | parser.add_argument('--epochs', 46 | default=90, 47 | type=int, 48 | metavar='N', 49 | help='number of total epochs to run') 50 | parser.add_argument('--start-epoch', 51 | default=0, 52 | type=int, 53 | metavar='N', 54 | help='manual epoch number (useful on restarts)') 55 | parser.add_argument('-b', 56 | '--batch-size', 57 | default=3200, 58 | type=int, 59 | metavar='N', 60 | help='mini-batch size (default: 3200), this is the total ' 61 | 'batch size of all GPUs on the current node when ' 62 | 'using Data Parallel or Distributed Data Parallel') 63 | parser.add_argument('--lr', 64 | '--learning-rate', 65 | default=0.1, 66 | type=float, 67 | metavar='LR', 68 | help='initial learning rate', 69 | dest='lr') 70 | parser.add_argument('--momentum', 71 | default=0.9, 72 | type=float, 73 | metavar='M', 74 | help='momentum') 75 | parser.add_argument('--wd', 76 | '--weight-decay', 77 | default=1e-4, 78 | type=float, 79 | metavar='W', 80 | help='weight decay (default: 1e-4)', 81 | dest='weight_decay') 82 | parser.add_argument('-p', 83 | '--print-freq', 84 | default=10, 85 | type=int, 86 | metavar='N', 87 | help='print frequency (default: 10)') 88 | parser.add_argument('-e', 89 | '--evaluate', 90 | dest='evaluate', 91 | action='store_true', 92 | help='evaluate model on validation set') 93 | parser.add_argument('--pretrained', 94 | dest='pretrained', 95 | action='store_true', 96 | help='use pre-trained model') 97 | parser.add_argument('--seed', 98 | default=None, 99 | type=int, 100 | help='seed for initializing training. ') 101 | 102 | best_acc1 = 0 103 | 104 | 105 | def main(): 106 | args = parser.parse_args() 107 | 108 | if args.seed is not None: 109 | random.seed(args.seed) 110 | torch.manual_seed(args.seed) 111 | cudnn.deterministic = True 112 | warnings.warn('You have chosen to seed training. ' 113 | 'This will turn on the CUDNN deterministic setting, ' 114 | 'which can slow down your training considerably! ' 115 | 'You may see unexpected behavior when restarting ' 116 | 'from checkpoints.') 117 | 118 | gpus = [0, 1, 2, 3] 119 | main_worker(gpus=gpus, args=args) 120 | 121 | 122 | def main_worker(gpus, args): 123 | global best_acc1 124 | 125 | # create model 126 | if args.pretrained: 127 | print("=> using pre-trained model '{}'".format(args.arch)) 128 | model = models.__dict__[args.arch](pretrained=True) 129 | else: 130 | print("=> creating model '{}'".format(args.arch)) 131 | model = models.__dict__[args.arch]() 132 | 133 | torch.cuda.set_device('cuda:{}'.format(gpus[0])) 134 | model.cuda() 135 | # When using a single GPU per process and per 136 | # DistributedDataParallel, we need to divide the batch size 137 | # ourselves based on the total number of GPUs we have 138 | model = nn.DataParallel(model, device_ids=gpus, output_device=gpus[0]) 139 | 140 | # define loss function (criterion) and optimizer 141 | criterion = nn.CrossEntropyLoss() 142 | 143 | optimizer = torch.optim.SGD(model.parameters(), 144 | args.lr, 145 | momentum=args.momentum, 146 | weight_decay=args.weight_decay) 147 | 148 | cudnn.benchmark = True 149 | 150 | # Data loading code 151 | traindir = os.path.join(args.data, 'train') 152 | valdir = os.path.join(args.data, 'val') 153 | normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], 154 | std=[0.229, 0.224, 0.225]) 155 | 156 | train_dataset = datasets.ImageFolder( 157 | traindir, 158 | transforms.Compose([ 159 | transforms.RandomResizedCrop(224), 160 | transforms.RandomHorizontalFlip(), 161 | transforms.ToTensor(), 162 | normalize, 163 | ])) 164 | 165 | train_loader = torch.utils.data.DataLoader(train_dataset, 166 | batch_size=args.batch_size, 167 | shuffle=True, 168 | num_workers=2, 169 | pin_memory=True) 170 | 171 | val_loader = torch.utils.data.DataLoader(datasets.ImageFolder( 172 | valdir, 173 | transforms.Compose([ 174 | transforms.Resize(256), 175 | transforms.CenterCrop(224), 176 | transforms.ToTensor(), 177 | normalize, 178 | ])), 179 | batch_size=args.batch_size, 180 | shuffle=False, 181 | num_workers=2, 182 | pin_memory=True) 183 | 184 | if args.evaluate: 185 | validate(val_loader, model, criterion, args) 186 | return 187 | 188 | log_csv = "dataparallel.csv" 189 | 190 | for epoch in range(args.start_epoch, args.epochs): 191 | epoch_start = time.time() 192 | 193 | adjust_learning_rate(optimizer, epoch, args) 194 | 195 | # train for one epoch 196 | train(train_loader, model, criterion, optimizer, epoch, args) 197 | 198 | # evaluate on validation set 199 | acc1 = validate(val_loader, model, criterion, args) 200 | 201 | # remember best acc@1 and save checkpoint 202 | is_best = acc1 > best_acc1 203 | best_acc1 = max(acc1, best_acc1) 204 | 205 | epoch_end = time.time() 206 | 207 | with open(log_csv, 'a+') as f: 208 | csv_write = csv.writer(f) 209 | data_row = [ 210 | time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(epoch_start)), 211 | epoch_end - epoch_start 212 | ] 213 | csv_write.writerow(data_row) 214 | 215 | save_checkpoint( 216 | { 217 | 'epoch': epoch + 1, 218 | 'arch': args.arch, 219 | 'state_dict': model.module.state_dict(), 220 | 'best_acc1': best_acc1, 221 | }, is_best) 222 | 223 | 224 | def train(train_loader, model, criterion, optimizer, epoch, args): 225 | batch_time = AverageMeter('Time', ':6.3f') 226 | data_time = AverageMeter('Data', ':6.3f') 227 | losses = AverageMeter('Loss', ':.4e') 228 | top1 = AverageMeter('Acc@1', ':6.2f') 229 | top5 = AverageMeter('Acc@5', ':6.2f') 230 | progress = ProgressMeter(len(train_loader), 231 | [batch_time, data_time, losses, top1, top5], 232 | prefix="Epoch: [{}]".format(epoch)) 233 | 234 | # switch to train mode 235 | model.train() 236 | 237 | end = time.time() 238 | for i, (images, target) in enumerate(train_loader): 239 | # measure data loading time 240 | data_time.update(time.time() - end) 241 | 242 | images = images.cuda(non_blocking=True) 243 | target = target.cuda(non_blocking=True) 244 | 245 | # compute output 246 | output = model(images) 247 | loss = criterion(output, target) 248 | 249 | # measure accuracy and record loss 250 | acc1, acc5 = accuracy(output, target, topk=(1, 5)) 251 | losses.update(loss.item(), images.size(0)) 252 | top1.update(acc1[0], images.size(0)) 253 | top5.update(acc5[0], images.size(0)) 254 | 255 | # compute gradient and do SGD step 256 | optimizer.zero_grad() 257 | loss.backward() 258 | optimizer.step() 259 | 260 | # measure elapsed time 261 | batch_time.update(time.time() - end) 262 | end = time.time() 263 | 264 | if i % args.print_freq == 0: 265 | progress.display(i) 266 | 267 | 268 | def validate(val_loader, model, criterion, args): 269 | batch_time = AverageMeter('Time', ':6.3f') 270 | losses = AverageMeter('Loss', ':.4e') 271 | top1 = AverageMeter('Acc@1', ':6.2f') 272 | top5 = AverageMeter('Acc@5', ':6.2f') 273 | progress = ProgressMeter(len(val_loader), [batch_time, losses, top1, top5], 274 | prefix='Test: ') 275 | 276 | # switch to evaluate mode 277 | model.eval() 278 | 279 | with torch.no_grad(): 280 | end = time.time() 281 | for i, (images, target) in enumerate(val_loader): 282 | images = images.cuda(non_blocking=True) 283 | target = target.cuda(non_blocking=True) 284 | 285 | # compute output 286 | output = model(images) 287 | loss = criterion(output, target) 288 | 289 | # measure accuracy and record loss 290 | acc1, acc5 = accuracy(output, target, topk=(1, 5)) 291 | losses.update(loss.item(), images.size(0)) 292 | top1.update(acc1[0], images.size(0)) 293 | top5.update(acc5[0], images.size(0)) 294 | 295 | # measure elapsed time 296 | batch_time.update(time.time() - end) 297 | end = time.time() 298 | 299 | if i % args.print_freq == 0: 300 | progress.display(i) 301 | 302 | # TODO: this should also be done with the ProgressMeter 303 | print(' * Acc@1 {top1.avg:.3f} Acc@5 {top5.avg:.3f}'.format(top1=top1, 304 | top5=top5)) 305 | 306 | return top1.avg 307 | 308 | 309 | def save_checkpoint(state, is_best, filename='checkpoint.pth.tar'): 310 | torch.save(state, filename) 311 | if is_best: 312 | shutil.copyfile(filename, 'model_best.pth.tar') 313 | 314 | 315 | class AverageMeter(object): 316 | """Computes and stores the average and current value""" 317 | def __init__(self, name, fmt=':f'): 318 | self.name = name 319 | self.fmt = fmt 320 | self.reset() 321 | 322 | def reset(self): 323 | self.val = 0 324 | self.avg = 0 325 | self.sum = 0 326 | self.count = 0 327 | 328 | def update(self, val, n=1): 329 | self.val = val 330 | self.sum += val * n 331 | self.count += n 332 | self.avg = self.sum / self.count 333 | 334 | def __str__(self): 335 | fmtstr = '{name} {val' + self.fmt + '} ({avg' + self.fmt + '})' 336 | return fmtstr.format(**self.__dict__) 337 | 338 | 339 | class ProgressMeter(object): 340 | def __init__(self, num_batches, meters, prefix=""): 341 | self.batch_fmtstr = self._get_batch_fmtstr(num_batches) 342 | self.meters = meters 343 | self.prefix = prefix 344 | 345 | def display(self, batch): 346 | entries = [self.prefix + self.batch_fmtstr.format(batch)] 347 | entries += [str(meter) for meter in self.meters] 348 | print('\t'.join(entries)) 349 | 350 | def _get_batch_fmtstr(self, num_batches): 351 | num_digits = len(str(num_batches // 1)) 352 | fmt = '{:' + str(num_digits) + 'd}' 353 | return '[' + fmt + '/' + fmt.format(num_batches) + ']' 354 | 355 | 356 | def adjust_learning_rate(optimizer, epoch, args): 357 | """Sets the learning rate to the initial LR decayed by 10 every 30 epochs""" 358 | lr = args.lr * (0.1**(epoch // 30)) 359 | for param_group in optimizer.param_groups: 360 | param_group['lr'] = lr 361 | 362 | 363 | def accuracy(output, target, topk=(1, )): 364 | """Computes the accuracy over the k top predictions for the specified values of k""" 365 | with torch.no_grad(): 366 | maxk = max(topk) 367 | batch_size = target.size(0) 368 | 369 | _, pred = output.topk(maxk, 1, True, True) 370 | pred = pred.t() 371 | correct = pred.eq(target.view(1, -1).expand_as(pred)) 372 | 373 | res = [] 374 | for k in topk: 375 | correct_k = correct[:k].view(-1).float().sum(0, keepdim=True) 376 | res.append(correct_k.mul_(100.0 / batch_size)) 377 | return res 378 | 379 | 380 | if __name__ == '__main__': 381 | main() -------------------------------------------------------------------------------- /distributed.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import os 3 | import random 4 | import shutil 5 | import time 6 | import warnings 7 | 8 | import torch 9 | import torch.nn as nn 10 | import torch.nn.parallel 11 | import torch.backends.cudnn as cudnn 12 | import torch.distributed as dist 13 | import torch.optim 14 | import torch.multiprocessing as mp 15 | import torch.utils.data 16 | import torch.utils.data.distributed 17 | import torchvision.transforms as transforms 18 | import torchvision.datasets as datasets 19 | import torchvision.models as models 20 | 21 | model_names = sorted(name for name in models.__dict__ 22 | if name.islower() and not name.startswith("__") 23 | and callable(models.__dict__[name])) 24 | 25 | parser = argparse.ArgumentParser(description='PyTorch ImageNet Training') 26 | parser.add_argument('--data', 27 | metavar='DIR', 28 | default='/home/zhangzhi/Data/exports/ImageNet2012', 29 | help='path to dataset') 30 | parser.add_argument('-a', 31 | '--arch', 32 | metavar='ARCH', 33 | default='resnet18', 34 | choices=model_names, 35 | help='model architecture: ' + ' | '.join(model_names) + 36 | ' (default: resnet18)') 37 | parser.add_argument('-j', 38 | '--workers', 39 | default=4, 40 | type=int, 41 | metavar='N', 42 | help='number of data loading workers (default: 4)') 43 | parser.add_argument('--epochs', 44 | default=90, 45 | type=int, 46 | metavar='N', 47 | help='number of total epochs to run') 48 | parser.add_argument('--start-epoch', 49 | default=0, 50 | type=int, 51 | metavar='N', 52 | help='manual epoch number (useful on restarts)') 53 | parser.add_argument('-b', 54 | '--batch-size', 55 | default=3200, 56 | type=int, 57 | metavar='N', 58 | help='mini-batch size (default: 3200), this is the total ' 59 | 'batch size of all GPUs on the current node when ' 60 | 'using Data Parallel or Distributed Data Parallel') 61 | parser.add_argument('--lr', 62 | '--learning-rate', 63 | default=0.1, 64 | type=float, 65 | metavar='LR', 66 | help='initial learning rate', 67 | dest='lr') 68 | parser.add_argument('--momentum', 69 | default=0.9, 70 | type=float, 71 | metavar='M', 72 | help='momentum') 73 | parser.add_argument('--local_rank', 74 | default=-1, 75 | type=int, 76 | help='node rank for distributed training') 77 | parser.add_argument('--wd', 78 | '--weight-decay', 79 | default=1e-4, 80 | type=float, 81 | metavar='W', 82 | help='weight decay (default: 1e-4)', 83 | dest='weight_decay') 84 | parser.add_argument('-p', 85 | '--print-freq', 86 | default=10, 87 | type=int, 88 | metavar='N', 89 | help='print frequency (default: 10)') 90 | parser.add_argument('-e', 91 | '--evaluate', 92 | dest='evaluate', 93 | action='store_true', 94 | help='evaluate model on validation set') 95 | parser.add_argument('--pretrained', 96 | dest='pretrained', 97 | action='store_true', 98 | help='use pre-trained model') 99 | parser.add_argument('--seed', 100 | default=None, 101 | type=int, 102 | help='seed for initializing training. ') 103 | 104 | 105 | def reduce_mean(tensor, nprocs): 106 | rt = tensor.clone() 107 | dist.all_reduce(rt, op=dist.ReduceOp.SUM) 108 | rt /= nprocs 109 | return rt 110 | 111 | 112 | def main(): 113 | args = parser.parse_args() 114 | args.nprocs = torch.cuda.device_count() 115 | 116 | if args.seed is not None: 117 | random.seed(args.seed) 118 | torch.manual_seed(args.seed) 119 | cudnn.deterministic = True 120 | warnings.warn('You have chosen to seed training. ' 121 | 'This will turn on the CUDNN deterministic setting, ' 122 | 'which can slow down your training considerably! ' 123 | 'You may see unexpected behavior when restarting ' 124 | 'from checkpoints.') 125 | 126 | main_worker(args.local_rank, args.nprocs, args) 127 | 128 | 129 | def main_worker(local_rank, nprocs, args): 130 | best_acc1 = .0 131 | 132 | dist.init_process_group(backend='nccl') 133 | # create model 134 | if args.pretrained: 135 | print("=> using pre-trained model '{}'".format(args.arch)) 136 | model = models.__dict__[args.arch](pretrained=True) 137 | else: 138 | print("=> creating model '{}'".format(args.arch)) 139 | model = models.__dict__[args.arch]() 140 | 141 | torch.cuda.set_device(local_rank) 142 | model.cuda(local_rank) 143 | # When using a single GPU per process and per 144 | # DistributedDataParallel, we need to divide the batch size 145 | # ourselves based on the total number of GPUs we have 146 | args.batch_size = int(args.batch_size / nprocs) 147 | model = torch.nn.parallel.DistributedDataParallel(model, 148 | device_ids=[local_rank]) 149 | 150 | # define loss function (criterion) and optimizer 151 | criterion = nn.CrossEntropyLoss().cuda(local_rank) 152 | 153 | optimizer = torch.optim.SGD(model.parameters(), 154 | args.lr, 155 | momentum=args.momentum, 156 | weight_decay=args.weight_decay) 157 | 158 | cudnn.benchmark = True 159 | 160 | # Data loading code 161 | traindir = os.path.join(args.data, 'train') 162 | valdir = os.path.join(args.data, 'val') 163 | normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], 164 | std=[0.229, 0.224, 0.225]) 165 | 166 | train_dataset = datasets.ImageFolder( 167 | traindir, 168 | transforms.Compose([ 169 | transforms.RandomResizedCrop(224), 170 | transforms.RandomHorizontalFlip(), 171 | transforms.ToTensor(), 172 | normalize, 173 | ])) 174 | train_sampler = torch.utils.data.distributed.DistributedSampler( 175 | train_dataset) 176 | train_loader = torch.utils.data.DataLoader(train_dataset, 177 | batch_size=args.batch_size, 178 | num_workers=2, 179 | pin_memory=True, 180 | sampler=train_sampler) 181 | 182 | val_dataset = datasets.ImageFolder( 183 | valdir, 184 | transforms.Compose([ 185 | transforms.Resize(256), 186 | transforms.CenterCrop(224), 187 | transforms.ToTensor(), 188 | normalize, 189 | ])) 190 | val_sampler = torch.utils.data.distributed.DistributedSampler(val_dataset) 191 | val_loader = torch.utils.data.DataLoader(val_dataset, 192 | batch_size=args.batch_size, 193 | num_workers=2, 194 | pin_memory=True, 195 | sampler=val_sampler) 196 | 197 | if args.evaluate: 198 | validate(val_loader, model, criterion, local_rank, args) 199 | return 200 | 201 | for epoch in range(args.start_epoch, args.epochs): 202 | train_sampler.set_epoch(epoch) 203 | val_sampler.set_epoch(epoch) 204 | 205 | adjust_learning_rate(optimizer, epoch, args) 206 | 207 | # train for one epoch 208 | train(train_loader, model, criterion, optimizer, epoch, local_rank, 209 | args) 210 | 211 | # evaluate on validation set 212 | acc1 = validate(val_loader, model, criterion, local_rank, args) 213 | 214 | # remember best acc@1 and save checkpoint 215 | is_best = acc1 > best_acc1 216 | best_acc1 = max(acc1, best_acc1) 217 | 218 | if args.local_rank == 0: 219 | save_checkpoint( 220 | { 221 | 'epoch': epoch + 1, 222 | 'arch': args.arch, 223 | 'state_dict': model.module.state_dict(), 224 | 'best_acc1': best_acc1, 225 | }, is_best) 226 | 227 | 228 | def train(train_loader, model, criterion, optimizer, epoch, local_rank, args): 229 | batch_time = AverageMeter('Time', ':6.3f') 230 | data_time = AverageMeter('Data', ':6.3f') 231 | losses = AverageMeter('Loss', ':.4e') 232 | top1 = AverageMeter('Acc@1', ':6.2f') 233 | top5 = AverageMeter('Acc@5', ':6.2f') 234 | progress = ProgressMeter(len(train_loader), 235 | [batch_time, data_time, losses, top1, top5], 236 | prefix="Epoch: [{}]".format(epoch)) 237 | 238 | # switch to train mode 239 | model.train() 240 | 241 | end = time.time() 242 | for i, (images, target) in enumerate(train_loader): 243 | # measure data loading time 244 | data_time.update(time.time() - end) 245 | 246 | images = images.cuda(local_rank, non_blocking=True) 247 | target = target.cuda(local_rank, non_blocking=True) 248 | 249 | # compute output 250 | output = model(images) 251 | loss = criterion(output, target) 252 | 253 | # measure accuracy and record loss 254 | acc1, acc5 = accuracy(output, target, topk=(1, 5)) 255 | 256 | torch.distributed.barrier() 257 | 258 | reduced_loss = reduce_mean(loss, args.nprocs) 259 | reduced_acc1 = reduce_mean(acc1, args.nprocs) 260 | reduced_acc5 = reduce_mean(acc5, args.nprocs) 261 | 262 | losses.update(reduced_loss.item(), images.size(0)) 263 | top1.update(reduced_acc1.item(), images.size(0)) 264 | top5.update(reduced_acc5.item(), images.size(0)) 265 | 266 | # compute gradient and do SGD step 267 | optimizer.zero_grad() 268 | loss.backward() 269 | optimizer.step() 270 | 271 | # measure elapsed time 272 | batch_time.update(time.time() - end) 273 | end = time.time() 274 | 275 | if i % args.print_freq == 0: 276 | progress.display(i) 277 | 278 | 279 | def validate(val_loader, model, criterion, local_rank, args): 280 | batch_time = AverageMeter('Time', ':6.3f') 281 | losses = AverageMeter('Loss', ':.4e') 282 | top1 = AverageMeter('Acc@1', ':6.2f') 283 | top5 = AverageMeter('Acc@5', ':6.2f') 284 | progress = ProgressMeter(len(val_loader), [batch_time, losses, top1, top5], 285 | prefix='Test: ') 286 | 287 | # switch to evaluate mode 288 | model.eval() 289 | 290 | with torch.no_grad(): 291 | end = time.time() 292 | for i, (images, target) in enumerate(val_loader): 293 | images = images.cuda(local_rank, non_blocking=True) 294 | target = target.cuda(local_rank, non_blocking=True) 295 | 296 | # compute output 297 | output = model(images) 298 | loss = criterion(output, target) 299 | 300 | # measure accuracy and record loss 301 | acc1, acc5 = accuracy(output, target, topk=(1, 5)) 302 | 303 | torch.distributed.barrier() 304 | 305 | reduced_loss = reduce_mean(loss, args.nprocs) 306 | reduced_acc1 = reduce_mean(acc1, args.nprocs) 307 | reduced_acc5 = reduce_mean(acc5, args.nprocs) 308 | 309 | losses.update(reduced_loss.item(), images.size(0)) 310 | top1.update(reduced_acc1.item(), images.size(0)) 311 | top5.update(reduced_acc5.item(), images.size(0)) 312 | 313 | # measure elapsed time 314 | batch_time.update(time.time() - end) 315 | end = time.time() 316 | 317 | if i % args.print_freq == 0: 318 | progress.display(i) 319 | 320 | # TODO: this should also be done with the ProgressMeter 321 | print(' * Acc@1 {top1.avg:.3f} Acc@5 {top5.avg:.3f}'.format(top1=top1, 322 | top5=top5)) 323 | 324 | return top1.avg 325 | 326 | 327 | def save_checkpoint(state, is_best, filename='checkpoint.pth.tar'): 328 | torch.save(state, filename) 329 | if is_best: 330 | shutil.copyfile(filename, 'model_best.pth.tar') 331 | 332 | 333 | class AverageMeter(object): 334 | """Computes and stores the average and current value""" 335 | def __init__(self, name, fmt=':f'): 336 | self.name = name 337 | self.fmt = fmt 338 | self.reset() 339 | 340 | def reset(self): 341 | self.val = 0 342 | self.avg = 0 343 | self.sum = 0 344 | self.count = 0 345 | 346 | def update(self, val, n=1): 347 | self.val = val 348 | self.sum += val * n 349 | self.count += n 350 | self.avg = self.sum / self.count 351 | 352 | def __str__(self): 353 | fmtstr = '{name} {val' + self.fmt + '} ({avg' + self.fmt + '})' 354 | return fmtstr.format(**self.__dict__) 355 | 356 | 357 | class ProgressMeter(object): 358 | def __init__(self, num_batches, meters, prefix=""): 359 | self.batch_fmtstr = self._get_batch_fmtstr(num_batches) 360 | self.meters = meters 361 | self.prefix = prefix 362 | 363 | def display(self, batch): 364 | entries = [self.prefix + self.batch_fmtstr.format(batch)] 365 | entries += [str(meter) for meter in self.meters] 366 | print('\t'.join(entries)) 367 | 368 | def _get_batch_fmtstr(self, num_batches): 369 | num_digits = len(str(num_batches // 1)) 370 | fmt = '{:' + str(num_digits) + 'd}' 371 | return '[' + fmt + '/' + fmt.format(num_batches) + ']' 372 | 373 | 374 | def adjust_learning_rate(optimizer, epoch, args): 375 | """Sets the learning rate to the initial LR decayed by 10 every 30 epochs""" 376 | lr = args.lr * (0.1**(epoch // 30)) 377 | for param_group in optimizer.param_groups: 378 | param_group['lr'] = lr 379 | 380 | 381 | def accuracy(output, target, topk=(1, )): 382 | """Computes the accuracy over the k top predictions for the specified values of k""" 383 | with torch.no_grad(): 384 | maxk = max(topk) 385 | batch_size = target.size(0) 386 | 387 | _, pred = output.topk(maxk, 1, True, True) 388 | pred = pred.t() 389 | correct = pred.eq(target.view(1, -1).expand_as(pred)) 390 | 391 | res = [] 392 | for k in topk: 393 | correct_k = correct[:k].view(-1).float().sum(0, keepdim=True) 394 | res.append(correct_k.mul_(100.0 / batch_size)) 395 | return res 396 | 397 | 398 | if __name__ == '__main__': 399 | main() -------------------------------------------------------------------------------- /distributed_slurm_main.py: -------------------------------------------------------------------------------- 1 | import os 2 | import csv 3 | import time 4 | import socket 5 | import random 6 | import shutil 7 | import argparse 8 | import warnings 9 | 10 | import torch 11 | import torch.optim 12 | import torch.nn as nn 13 | import torch.nn.parallel 14 | import torch.backends.cudnn as cudnn 15 | import torch.distributed as dist 16 | import torch.multiprocessing as mp 17 | import torch.utils.data 18 | import torch.utils.data.distributed 19 | 20 | import torchvision.transforms as transforms 21 | import torchvision.datasets as datasets 22 | import torchvision.models as models 23 | 24 | model_names = sorted(name for name in models.__dict__ 25 | if name.islower() and not name.startswith("__") 26 | and callable(models.__dict__[name])) 27 | 28 | parser = argparse.ArgumentParser(description='PyTorch ImageNet Training') 29 | parser.add_argument('--data', 30 | metavar='DIR', 31 | default='/home/zhangzhi/Data/exports/ImageNet2012', 32 | help='path to dataset') 33 | parser.add_argument('-a', 34 | '--arch', 35 | metavar='ARCH', 36 | default='resnet18', 37 | choices=model_names, 38 | help='model architecture: ' + ' | '.join(model_names) + 39 | ' (default: resnet18)') 40 | parser.add_argument('-j', 41 | '--workers', 42 | default=4, 43 | type=int, 44 | metavar='N', 45 | help='number of data loading workers (default: 4)') 46 | parser.add_argument('--epochs', 47 | default=90, 48 | type=int, 49 | metavar='N', 50 | help='number of total epochs to run') 51 | parser.add_argument('--start-epoch', 52 | default=0, 53 | type=int, 54 | metavar='N', 55 | help='manual epoch number (useful on restarts)') 56 | parser.add_argument('-b', 57 | '--batch-size', 58 | default=3200, 59 | type=int, 60 | metavar='N', 61 | help='mini-batch size (default: 3200), this is the total ' 62 | 'batch size of all GPUs on the current node when ' 63 | 'using Data Parallel or Distributed Data Parallel') 64 | parser.add_argument('--lr', 65 | '--learning-rate', 66 | default=0.1, 67 | type=float, 68 | metavar='LR', 69 | help='initial learning rate', 70 | dest='lr') 71 | parser.add_argument('--momentum', 72 | default=0.9, 73 | type=float, 74 | metavar='M', 75 | help='momentum') 76 | parser.add_argument('--wd', 77 | '--weight-decay', 78 | default=1e-4, 79 | type=float, 80 | metavar='W', 81 | help='weight decay (default: 1e-4)', 82 | dest='weight_decay') 83 | parser.add_argument('-p', 84 | '--print-freq', 85 | default=10, 86 | type=int, 87 | metavar='N', 88 | help='print frequency (default: 10)') 89 | parser.add_argument('-e', 90 | '--evaluate', 91 | dest='evaluate', 92 | action='store_true', 93 | help='evaluate model on validation set') 94 | parser.add_argument('--pretrained', 95 | dest='pretrained', 96 | action='store_true', 97 | help='use pre-trained model') 98 | parser.add_argument('--seed', 99 | default=None, 100 | type=int, 101 | help='seed for initializing training. ') 102 | parser.add_argument('--dist-file', 103 | default=None, 104 | type=str, 105 | help='file used to initial distributed training') 106 | 107 | best_acc1 = 0 108 | 109 | 110 | def main(): 111 | args = parser.parse_args() 112 | 113 | if args.seed is not None: 114 | random.seed(args.seed) 115 | torch.manual_seed(args.seed) 116 | cudnn.deterministic = True 117 | # torch.backends.cudnn.enabled = False 118 | warnings.warn('You have chosen to seed training. ' 119 | 'This will turn on the CUDNN deterministic setting, ' 120 | 'which can slow down your training considerably! ' 121 | 'You may see unexpected behavior when restarting ' 122 | 'from checkpoints.') 123 | 124 | args.local_rank = int(os.environ["SLURM_PROCID"]) 125 | args.world_size = int(os.environ["SLURM_NPROCS"]) 126 | ngpus_per_node = torch.cuda.device_count() 127 | 128 | job_id = os.environ["SLURM_JOBID"] 129 | args.dist_url = "file://{}.{}".format(os.path.realpath(args.dist_file), 130 | job_id) 131 | mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args)) 132 | 133 | 134 | def main_worker(gpu, ngpus_per_node, args): 135 | global best_acc1 136 | rank = args.local_rank * ngpus_per_node + gpu 137 | dist.init_process_group(backend='nccl', 138 | init_method=args.dist_url, 139 | world_size=args.world_size, 140 | rank=rank) 141 | # create model 142 | if args.pretrained: 143 | print("=> using pre-trained model '{}'".format(args.arch)) 144 | model = models.__dict__[args.arch](pretrained=True) 145 | else: 146 | print("=> creating model '{}'".format(args.arch)) 147 | model = models.__dict__[args.arch]() 148 | 149 | torch.cuda.set_device(gpu) 150 | model.cuda(gpu) 151 | # When using a single GPU per process and per 152 | # DistributedDataParallel, we need to divide the batch size 153 | # ourselves based on the total number of GPUs we have 154 | args.batch_size = int(args.batch_size / ngpus_per_node) 155 | model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[gpu]) 156 | 157 | # define loss function (criterion) and optimizer 158 | criterion = nn.CrossEntropyLoss().cuda(gpu) 159 | 160 | optimizer = torch.optim.SGD(model.parameters(), 161 | args.lr, 162 | momentum=args.momentum, 163 | weight_decay=args.weight_decay) 164 | 165 | cudnn.benchmark = True 166 | 167 | # Data loading code 168 | traindir = os.path.join(args.data, 'train') 169 | valdir = os.path.join(args.data, 'val') 170 | normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], 171 | std=[0.229, 0.224, 0.225]) 172 | 173 | train_dataset = datasets.ImageFolder( 174 | traindir, 175 | transforms.Compose([ 176 | transforms.RandomResizedCrop(224), 177 | transforms.RandomHorizontalFlip(), 178 | transforms.ToTensor(), 179 | normalize, 180 | ])) 181 | 182 | train_sampler = torch.utils.data.distributed.DistributedSampler( 183 | train_dataset) 184 | 185 | train_loader = torch.utils.data.DataLoader(train_dataset, 186 | batch_size=args.batch_size, 187 | shuffle=(train_sampler is None), 188 | num_workers=2, 189 | pin_memory=True, 190 | sampler=train_sampler) 191 | 192 | val_loader = torch.utils.data.DataLoader(datasets.ImageFolder( 193 | valdir, 194 | transforms.Compose([ 195 | transforms.Resize(256), 196 | transforms.CenterCrop(224), 197 | transforms.ToTensor(), 198 | normalize, 199 | ])), 200 | batch_size=args.batch_size, 201 | shuffle=False, 202 | num_workers=2, 203 | pin_memory=True) 204 | 205 | if args.evaluate: 206 | validate(val_loader, model, criterion, gpu, args) 207 | return 208 | 209 | log_csv = "distributed.csv" 210 | 211 | for epoch in range(args.start_epoch, args.epochs): 212 | epoch_start = time.time() 213 | 214 | train_sampler.set_epoch(epoch) 215 | adjust_learning_rate(optimizer, epoch, args) 216 | 217 | # train for one epoch 218 | train(train_loader, model, criterion, optimizer, epoch, gpu, args) 219 | 220 | # evaluate on validation set 221 | acc1 = validate(val_loader, model, criterion, gpu, args) 222 | 223 | # remember best acc@1 and save checkpoint 224 | is_best = acc1 > best_acc1 225 | best_acc1 = max(acc1, best_acc1) 226 | 227 | epoch_end = time.time() 228 | 229 | with open(log_csv, 'a+') as f: 230 | csv_write = csv.writer(f) 231 | data_row = [ 232 | time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(epoch_start)), 233 | epoch_end - epoch_start 234 | ] 235 | csv_write.writerow(data_row) 236 | 237 | save_checkpoint( 238 | { 239 | 'epoch': epoch + 1, 240 | 'arch': args.arch, 241 | 'state_dict': model.module.state_dict(), 242 | 'best_acc1': best_acc1, 243 | }, is_best) 244 | 245 | 246 | def train(train_loader, model, criterion, optimizer, epoch, gpu, args): 247 | batch_time = AverageMeter('Time', ':6.3f') 248 | data_time = AverageMeter('Data', ':6.3f') 249 | losses = AverageMeter('Loss', ':.4e') 250 | top1 = AverageMeter('Acc@1', ':6.2f') 251 | top5 = AverageMeter('Acc@5', ':6.2f') 252 | progress = ProgressMeter(len(train_loader), 253 | [batch_time, data_time, losses, top1, top5], 254 | prefix="Epoch: [{}]".format(epoch)) 255 | 256 | # switch to train mode 257 | model.train() 258 | 259 | end = time.time() 260 | for i, (images, target) in enumerate(train_loader): 261 | # measure data loading time 262 | data_time.update(time.time() - end) 263 | 264 | images = images.cuda(gpu, non_blocking=True) 265 | target = target.cuda(gpu, non_blocking=True) 266 | 267 | # compute output 268 | output = model(images) 269 | loss = criterion(output, target) 270 | 271 | # measure accuracy and record loss 272 | acc1, acc5 = accuracy(output, target, topk=(1, 5)) 273 | losses.update(loss.item(), images.size(0)) 274 | top1.update(acc1[0], images.size(0)) 275 | top5.update(acc5[0], images.size(0)) 276 | 277 | # compute gradient and do SGD step 278 | optimizer.zero_grad() 279 | loss.backward() 280 | optimizer.step() 281 | 282 | # measure elapsed time 283 | batch_time.update(time.time() - end) 284 | end = time.time() 285 | 286 | if i % args.print_freq == 0: 287 | progress.display(i) 288 | 289 | 290 | def validate(val_loader, model, criterion, gpu, args): 291 | batch_time = AverageMeter('Time', ':6.3f') 292 | losses = AverageMeter('Loss', ':.4e') 293 | top1 = AverageMeter('Acc@1', ':6.2f') 294 | top5 = AverageMeter('Acc@5', ':6.2f') 295 | progress = ProgressMeter(len(val_loader), [batch_time, losses, top1, top5], 296 | prefix='Test: ') 297 | 298 | # switch to evaluate mode 299 | model.eval() 300 | 301 | with torch.no_grad(): 302 | end = time.time() 303 | for i, (images, target) in enumerate(val_loader): 304 | images = images.cuda(gpu, non_blocking=True) 305 | target = target.cuda(gpu, non_blocking=True) 306 | 307 | # compute output 308 | output = model(images) 309 | loss = criterion(output, target) 310 | 311 | # measure accuracy and record loss 312 | acc1, acc5 = accuracy(output, target, topk=(1, 5)) 313 | losses.update(loss.item(), images.size(0)) 314 | top1.update(acc1[0], images.size(0)) 315 | top5.update(acc5[0], images.size(0)) 316 | 317 | # measure elapsed time 318 | batch_time.update(time.time() - end) 319 | end = time.time() 320 | 321 | if i % args.print_freq == 0: 322 | progress.display(i) 323 | 324 | # TODO: this should also be done with the ProgressMeter 325 | print(' * Acc@1 {top1.avg:.3f} Acc@5 {top5.avg:.3f}'.format(top1=top1, 326 | top5=top5)) 327 | 328 | return top1.avg 329 | 330 | 331 | def save_checkpoint(state, is_best, filename='checkpoint.pth.tar'): 332 | torch.save(state, filename) 333 | if is_best: 334 | shutil.copyfile(filename, 'model_best.pth.tar') 335 | 336 | 337 | class AverageMeter(object): 338 | """Computes and stores the average and current value""" 339 | def __init__(self, name, fmt=':f'): 340 | self.name = name 341 | self.fmt = fmt 342 | self.reset() 343 | 344 | def reset(self): 345 | self.val = 0 346 | self.avg = 0 347 | self.sum = 0 348 | self.count = 0 349 | 350 | def update(self, val, n=1): 351 | self.val = val 352 | self.sum += val * n 353 | self.count += n 354 | self.avg = self.sum / self.count 355 | 356 | def __str__(self): 357 | fmtstr = '{name} {val' + self.fmt + '} ({avg' + self.fmt + '})' 358 | return fmtstr.format(**self.__dict__) 359 | 360 | 361 | class ProgressMeter(object): 362 | def __init__(self, num_batches, meters, prefix=""): 363 | self.batch_fmtstr = self._get_batch_fmtstr(num_batches) 364 | self.meters = meters 365 | self.prefix = prefix 366 | 367 | def display(self, batch): 368 | entries = [self.prefix + self.batch_fmtstr.format(batch)] 369 | entries += [str(meter) for meter in self.meters] 370 | print('\t'.join(entries)) 371 | 372 | def _get_batch_fmtstr(self, num_batches): 373 | num_digits = len(str(num_batches // 1)) 374 | fmt = '{:' + str(num_digits) + 'd}' 375 | return '[' + fmt + '/' + fmt.format(num_batches) + ']' 376 | 377 | 378 | def adjust_learning_rate(optimizer, epoch, args): 379 | """Sets the learning rate to the initial LR decayed by 10 every 30 epochs""" 380 | lr = args.lr * (0.1**(epoch // 30)) 381 | for param_group in optimizer.param_groups: 382 | param_group['lr'] = lr 383 | 384 | 385 | def accuracy(output, target, topk=(1, )): 386 | """Computes the accuracy over the k top predictions for the specified values of k""" 387 | with torch.no_grad(): 388 | maxk = max(topk) 389 | batch_size = target.size(0) 390 | 391 | _, pred = output.topk(maxk, 1, True, True) 392 | pred = pred.t() 393 | correct = pred.eq(target.view(1, -1).expand_as(pred)) 394 | 395 | res = [] 396 | for k in topk: 397 | correct_k = correct[:k].view(-1).float().sum(0, keepdim=True) 398 | res.append(correct_k.mul_(100.0 / batch_size)) 399 | return res 400 | 401 | 402 | if __name__ == '__main__': 403 | main() -------------------------------------------------------------------------------- /horovod_distributed.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import os 3 | import random 4 | import shutil 5 | import time 6 | import warnings 7 | 8 | import torch 9 | import torch.nn as nn 10 | import torch.nn.parallel 11 | import torch.backends.cudnn as cudnn 12 | import torch.distributed as dist 13 | import torch.optim 14 | import torch.multiprocessing as mp 15 | import torch.utils.data 16 | import torch.utils.data.distributed 17 | import torchvision.transforms as transforms 18 | import torchvision.datasets as datasets 19 | import torchvision.models as models 20 | import horovod.torch as hvd 21 | 22 | model_names = sorted(name for name in models.__dict__ 23 | if name.islower() and not name.startswith("__") 24 | and callable(models.__dict__[name])) 25 | 26 | parser = argparse.ArgumentParser(description='PyTorch ImageNet Training') 27 | parser.add_argument('--data', 28 | metavar='DIR', 29 | default='/home/zhangzhi/Data/exports/ImageNet2012', 30 | help='path to dataset') 31 | parser.add_argument('-a', 32 | '--arch', 33 | metavar='ARCH', 34 | default='resnet18', 35 | choices=model_names, 36 | help='model architecture: ' + ' | '.join(model_names) + 37 | ' (default: resnet18)') 38 | parser.add_argument('-j', 39 | '--workers', 40 | default=4, 41 | type=int, 42 | metavar='N', 43 | help='number of data loading workers (default: 4)') 44 | parser.add_argument('--epochs', 45 | default=90, 46 | type=int, 47 | metavar='N', 48 | help='number of total epochs to run') 49 | parser.add_argument('--start-epoch', 50 | default=0, 51 | type=int, 52 | metavar='N', 53 | help='manual epoch number (useful on restarts)') 54 | parser.add_argument('-b', 55 | '--batch-size', 56 | default=3200, 57 | type=int, 58 | metavar='N', 59 | help='mini-batch size (default: 3200), this is the total ' 60 | 'batch size of all GPUs on the current node when ' 61 | 'using Data Parallel or Distributed Data Parallel') 62 | parser.add_argument('--lr', 63 | '--learning-rate', 64 | default=0.1, 65 | type=float, 66 | metavar='LR', 67 | help='initial learning rate', 68 | dest='lr') 69 | parser.add_argument('--momentum', 70 | default=0.9, 71 | type=float, 72 | metavar='M', 73 | help='momentum') 74 | parser.add_argument('--wd', 75 | '--weight-decay', 76 | default=1e-4, 77 | type=float, 78 | metavar='W', 79 | help='weight decay (default: 1e-4)', 80 | dest='weight_decay') 81 | parser.add_argument('-p', 82 | '--print-freq', 83 | default=10, 84 | type=int, 85 | metavar='N', 86 | help='print frequency (default: 10)') 87 | parser.add_argument('-e', 88 | '--evaluate', 89 | dest='evaluate', 90 | action='store_true', 91 | help='evaluate model on validation set') 92 | parser.add_argument('--pretrained', 93 | dest='pretrained', 94 | action='store_true', 95 | help='use pre-trained model') 96 | parser.add_argument('--seed', 97 | default=None, 98 | type=int, 99 | help='seed for initializing training. ') 100 | 101 | 102 | def reduce_mean(tensor, nprocs): 103 | rt = tensor.clone() 104 | hvd.allreduce(rt, name='barrier') 105 | # # horovod.allreduce calculates the average value by default 106 | # # https://github.com/tczhangzhi/pytorch-distributed/issues/14 107 | # rt /= nprocs 108 | return rt 109 | 110 | 111 | def main(): 112 | args = parser.parse_args() 113 | args.nprocs = torch.cuda.device_count() 114 | 115 | if args.seed is not None: 116 | random.seed(args.seed) 117 | torch.manual_seed(args.seed) 118 | cudnn.deterministic = True 119 | warnings.warn('You have chosen to seed training. ' 120 | 'This will turn on the CUDNN deterministic setting, ' 121 | 'which can slow down your training considerably! ' 122 | 'You may see unexpected behavior when restarting ' 123 | 'from checkpoints.') 124 | 125 | hvd.init() 126 | args.local_rank = hvd.local_rank() 127 | torch.cuda.set_device(args.local_rank) 128 | 129 | main_worker(args.local_rank, args.nprocs, args) 130 | 131 | 132 | def main_worker(local_rank, nprocs, args): 133 | best_acc1 = .0 134 | 135 | # create model 136 | if args.pretrained: 137 | print("=> using pre-trained model '{}'".format(args.arch)) 138 | model = models.__dict__[args.arch](pretrained=True) 139 | else: 140 | print("=> creating model '{}'".format(args.arch)) 141 | model = models.__dict__[args.arch]() 142 | 143 | model.cuda() 144 | # When using a single GPU per process and per 145 | # DistributedDataParallel, we need to divide the batch size 146 | # ourselves based on the total number of GPUs we have 147 | args.batch_size = int(args.batch_size / nprocs) 148 | 149 | hvd.broadcast_parameters(model.state_dict(), root_rank=0) 150 | 151 | # define loss function (criterion) and optimizer 152 | criterion = nn.CrossEntropyLoss().cuda() 153 | 154 | optimizer = torch.optim.SGD(model.parameters(), 155 | args.lr, 156 | momentum=args.momentum, 157 | weight_decay=args.weight_decay) 158 | hvd.broadcast_optimizer_state(optimizer, root_rank=0) 159 | compression = hvd.Compression.fp16 160 | 161 | optimizer = hvd.DistributedOptimizer( 162 | optimizer, 163 | named_parameters=model.named_parameters(), 164 | compression=compression) 165 | 166 | cudnn.benchmark = True 167 | 168 | # Data loading code 169 | traindir = os.path.join(args.data, 'train') 170 | valdir = os.path.join(args.data, 'val') 171 | normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], 172 | std=[0.229, 0.224, 0.225]) 173 | 174 | train_dataset = datasets.ImageFolder( 175 | traindir, 176 | transforms.Compose([ 177 | transforms.RandomResizedCrop(224), 178 | transforms.RandomHorizontalFlip(), 179 | transforms.ToTensor(), 180 | normalize, 181 | ])) 182 | train_sampler = torch.utils.data.distributed.DistributedSampler( 183 | train_dataset, num_replicas=hvd.size(), rank=hvd.rank()) 184 | train_loader = torch.utils.data.DataLoader(train_dataset, 185 | batch_size=args.batch_size, 186 | num_workers=2, 187 | pin_memory=True, 188 | sampler=train_sampler) 189 | 190 | val_dataset = datasets.ImageFolder( 191 | valdir, 192 | transforms.Compose([ 193 | transforms.Resize(256), 194 | transforms.CenterCrop(224), 195 | transforms.ToTensor(), 196 | normalize, 197 | ])) 198 | val_sampler = torch.utils.data.distributed.DistributedSampler( 199 | val_dataset, num_replicas=hvd.size(), rank=hvd.rank()) 200 | val_loader = torch.utils.data.DataLoader(val_dataset, 201 | batch_size=args.batch_size, 202 | num_workers=2, 203 | pin_memory=True, 204 | sampler=val_sampler) 205 | 206 | if args.evaluate: 207 | validate(val_loader, model, criterion, args) 208 | return 209 | 210 | for epoch in range(args.start_epoch, args.epochs): 211 | 212 | train_sampler.set_epoch(epoch) 213 | val_sampler.set_epoch(epoch) 214 | 215 | adjust_learning_rate(optimizer, epoch, args) 216 | 217 | # train for one epoch 218 | train(train_loader, model, criterion, optimizer, epoch, args) 219 | 220 | # evaluate on validation set 221 | acc1 = validate(val_loader, model, criterion, args) 222 | 223 | # remember best acc@1 and save checkpoint 224 | is_best = acc1 > best_acc1 225 | best_acc1 = max(acc1, best_acc1) 226 | 227 | if args.local_rank == 0: 228 | save_checkpoint( 229 | { 230 | 'epoch': epoch + 1, 231 | 'arch': args.arch, 232 | 'state_dict': model.state_dict(), 233 | 'best_acc1': best_acc1, 234 | }, is_best) 235 | 236 | 237 | def train(train_loader, model, criterion, optimizer, epoch, args): 238 | batch_time = AverageMeter('Time', ':6.3f') 239 | data_time = AverageMeter('Data', ':6.3f') 240 | losses = AverageMeter('Loss', ':.4e') 241 | top1 = AverageMeter('Acc@1', ':6.2f') 242 | top5 = AverageMeter('Acc@5', ':6.2f') 243 | progress = ProgressMeter(len(train_loader), 244 | [batch_time, data_time, losses, top1, top5], 245 | prefix="Epoch: [{}]".format(epoch)) 246 | 247 | # switch to train mode 248 | model.train() 249 | 250 | end = time.time() 251 | for i, (images, target) in enumerate(train_loader): 252 | # measure data loading time 253 | data_time.update(time.time() - end) 254 | 255 | images = images.cuda(non_blocking=True) 256 | target = target.cuda(non_blocking=True) 257 | 258 | # compute output 259 | output = model(images) 260 | loss = criterion(output, target) 261 | 262 | # measure accuracy and record loss 263 | acc1, acc5 = accuracy(output, target, topk=(1, 5)) 264 | 265 | reduced_loss = reduce_mean(loss, args.nprocs) 266 | reduced_acc1 = reduce_mean(acc1, args.nprocs) 267 | reduced_acc5 = reduce_mean(acc5, args.nprocs) 268 | 269 | losses.update(reduced_loss.item(), images.size(0)) 270 | top1.update(reduced_acc1.item(), images.size(0)) 271 | top5.update(reduced_acc5.item(), images.size(0)) 272 | 273 | # compute gradient and do SGD step 274 | optimizer.zero_grad() 275 | loss.backward() 276 | optimizer.step() 277 | 278 | # measure elapsed time 279 | batch_time.update(time.time() - end) 280 | end = time.time() 281 | 282 | if i % args.print_freq == 0: 283 | progress.display(i) 284 | 285 | 286 | def validate(val_loader, model, criterion, args): 287 | batch_time = AverageMeter('Time', ':6.3f') 288 | losses = AverageMeter('Loss', ':.4e') 289 | top1 = AverageMeter('Acc@1', ':6.2f') 290 | top5 = AverageMeter('Acc@5', ':6.2f') 291 | progress = ProgressMeter(len(val_loader), [batch_time, losses, top1, top5], 292 | prefix='Test: ') 293 | 294 | # switch to evaluate mode 295 | model.eval() 296 | 297 | with torch.no_grad(): 298 | end = time.time() 299 | for i, (images, target) in enumerate(val_loader): 300 | images = images.cuda(non_blocking=True) 301 | target = target.cuda(non_blocking=True) 302 | 303 | # compute output 304 | output = model(images) 305 | loss = criterion(output, target) 306 | 307 | # measure accuracy and record loss 308 | acc1, acc5 = accuracy(output, target, topk=(1, 5)) 309 | 310 | reduced_loss = reduce_mean(loss, args.nprocs) 311 | reduced_acc1 = reduce_mean(acc1, args.nprocs) 312 | reduced_acc5 = reduce_mean(acc5, args.nprocs) 313 | 314 | losses.update(reduced_loss.item(), images.size(0)) 315 | top1.update(reduced_acc1.item(), images.size(0)) 316 | top5.update(reduced_acc5.item(), images.size(0)) 317 | 318 | # measure elapsed time 319 | batch_time.update(time.time() - end) 320 | end = time.time() 321 | 322 | if i % args.print_freq == 0: 323 | progress.display(i) 324 | 325 | # TODO: this should also be done with the ProgressMeter 326 | print(' * Acc@1 {top1.avg:.3f} Acc@5 {top5.avg:.3f}'.format(top1=top1, 327 | top5=top5)) 328 | 329 | return top1.avg 330 | 331 | 332 | def save_checkpoint(state, is_best, filename='checkpoint.pth.tar'): 333 | torch.save(state, filename) 334 | if is_best: 335 | shutil.copyfile(filename, 'model_best.pth.tar') 336 | 337 | 338 | class AverageMeter(object): 339 | """Computes and stores the average and current value""" 340 | def __init__(self, name, fmt=':f'): 341 | self.name = name 342 | self.fmt = fmt 343 | self.reset() 344 | 345 | def reset(self): 346 | self.val = 0 347 | self.avg = 0 348 | self.sum = 0 349 | self.count = 0 350 | 351 | def update(self, val, n=1): 352 | self.val = val 353 | self.sum += val * n 354 | self.count += n 355 | self.avg = self.sum / self.count 356 | 357 | def __str__(self): 358 | fmtstr = '{name} {val' + self.fmt + '} ({avg' + self.fmt + '})' 359 | return fmtstr.format(**self.__dict__) 360 | 361 | 362 | class ProgressMeter(object): 363 | def __init__(self, num_batches, meters, prefix=""): 364 | self.batch_fmtstr = self._get_batch_fmtstr(num_batches) 365 | self.meters = meters 366 | self.prefix = prefix 367 | 368 | def display(self, batch): 369 | entries = [self.prefix + self.batch_fmtstr.format(batch)] 370 | entries += [str(meter) for meter in self.meters] 371 | print('\t'.join(entries)) 372 | 373 | def _get_batch_fmtstr(self, num_batches): 374 | num_digits = len(str(num_batches // 1)) 375 | fmt = '{:' + str(num_digits) + 'd}' 376 | return '[' + fmt + '/' + fmt.format(num_batches) + ']' 377 | 378 | 379 | def adjust_learning_rate(optimizer, epoch, args): 380 | """Sets the learning rate to the initial LR decayed by 10 every 30 epochs""" 381 | lr = args.lr * (0.1**(epoch // 30)) 382 | for param_group in optimizer.param_groups: 383 | param_group['lr'] = lr 384 | 385 | 386 | def accuracy(output, target, topk=(1, )): 387 | """Computes the accuracy over the k top predictions for the specified values of k""" 388 | with torch.no_grad(): 389 | maxk = max(topk) 390 | batch_size = target.size(0) 391 | 392 | _, pred = output.topk(maxk, 1, True, True) 393 | pred = pred.t() 394 | correct = pred.eq(target.view(1, -1).expand_as(pred)) 395 | 396 | res = [] 397 | for k in topk: 398 | correct_k = correct[:k].view(-1).float().sum(0, keepdim=True) 399 | res.append(correct_k.mul_(100.0 / batch_size)) 400 | return res 401 | 402 | 403 | if __name__ == '__main__': 404 | main() 405 | -------------------------------------------------------------------------------- /multiprocessing_distributed.py: -------------------------------------------------------------------------------- 1 | import csv 2 | 3 | import argparse 4 | import os 5 | import random 6 | import shutil 7 | import time 8 | import warnings 9 | 10 | import torch 11 | import torch.nn as nn 12 | import torch.nn.parallel 13 | import torch.backends.cudnn as cudnn 14 | import torch.distributed as dist 15 | import torch.optim 16 | import torch.multiprocessing as mp 17 | import torch.utils.data 18 | import torch.utils.data.distributed 19 | import torchvision.transforms as transforms 20 | import torchvision.datasets as datasets 21 | import torchvision.models as models 22 | 23 | model_names = sorted(name for name in models.__dict__ 24 | if name.islower() and not name.startswith("__") 25 | and callable(models.__dict__[name])) 26 | 27 | parser = argparse.ArgumentParser(description='PyTorch ImageNet Training') 28 | parser.add_argument('--data', 29 | metavar='DIR', 30 | default='/home/zhangzhi/Data/exports/ImageNet2012', 31 | help='path to dataset') 32 | parser.add_argument('-a', 33 | '--arch', 34 | metavar='ARCH', 35 | default='resnet18', 36 | choices=model_names, 37 | help='model architecture: ' + ' | '.join(model_names) + 38 | ' (default: resnet18)') 39 | parser.add_argument('-j', 40 | '--workers', 41 | default=4, 42 | type=int, 43 | metavar='N', 44 | help='number of data loading workers (default: 4)') 45 | parser.add_argument('--epochs', 46 | default=90, 47 | type=int, 48 | metavar='N', 49 | help='number of total epochs to run') 50 | parser.add_argument('--start-epoch', 51 | default=0, 52 | type=int, 53 | metavar='N', 54 | help='manual epoch number (useful on restarts)') 55 | parser.add_argument('-b', 56 | '--batch-size', 57 | default=3200, 58 | type=int, 59 | metavar='N', 60 | help='mini-batch size (default: 256), this is the total ' 61 | 'batch size of all GPUs on the current node when ' 62 | 'using Data Parallel or Distributed Data Parallel') 63 | parser.add_argument('--lr', 64 | '--learning-rate', 65 | default=0.1, 66 | type=float, 67 | metavar='LR', 68 | help='initial learning rate', 69 | dest='lr') 70 | parser.add_argument('--momentum', 71 | default=0.9, 72 | type=float, 73 | metavar='M', 74 | help='momentum') 75 | parser.add_argument('--wd', 76 | '--weight-decay', 77 | default=1e-4, 78 | type=float, 79 | metavar='W', 80 | help='weight decay (default: 1e-4)', 81 | dest='weight_decay') 82 | parser.add_argument('-p', 83 | '--print-freq', 84 | default=10, 85 | type=int, 86 | metavar='N', 87 | help='print frequency (default: 10)') 88 | parser.add_argument('-e', 89 | '--evaluate', 90 | dest='evaluate', 91 | action='store_true', 92 | help='evaluate model on validation set') 93 | parser.add_argument('--pretrained', 94 | dest='pretrained', 95 | action='store_true', 96 | help='use pre-trained model') 97 | parser.add_argument('--seed', 98 | default=None, 99 | type=int, 100 | help='seed for initializing training. ') 101 | 102 | 103 | def reduce_mean(tensor, nprocs): 104 | rt = tensor.clone() 105 | dist.all_reduce(rt, op=dist.ReduceOp.SUM) 106 | rt /= nprocs 107 | return rt 108 | 109 | 110 | def main(): 111 | args = parser.parse_args() 112 | args.nprocs = torch.cuda.device_count() 113 | 114 | mp.spawn(main_worker, nprocs=args.nprocs, args=(args.nprocs, args)) 115 | 116 | 117 | def main_worker(local_rank, nprocs, args): 118 | args.local_rank = local_rank 119 | 120 | if args.seed is not None: 121 | random.seed(args.seed) 122 | torch.manual_seed(args.seed) 123 | cudnn.deterministic = True 124 | warnings.warn('You have chosen to seed training. ' 125 | 'This will turn on the CUDNN deterministic setting, ' 126 | 'which can slow down your training considerably! ' 127 | 'You may see unexpected behavior when restarting ' 128 | 'from checkpoints.') 129 | 130 | best_acc1 = .0 131 | 132 | dist.init_process_group(backend='nccl', 133 | init_method='tcp://127.0.0.1:23456', 134 | world_size=args.nprocs, 135 | rank=local_rank) 136 | # create model 137 | if args.pretrained: 138 | print("=> using pre-trained model '{}'".format(args.arch)) 139 | model = models.__dict__[args.arch](pretrained=True) 140 | else: 141 | print("=> creating model '{}'".format(args.arch)) 142 | model = models.__dict__[args.arch]() 143 | 144 | torch.cuda.set_device(local_rank) 145 | model.cuda(local_rank) 146 | # When using a single GPU per process and per 147 | # DistributedDataParallel, we need to divide the batch size 148 | # ourselves based on the total number of GPUs we have 149 | args.batch_size = int(args.batch_size / args.nprocs) 150 | model = torch.nn.parallel.DistributedDataParallel(model, 151 | device_ids=[local_rank]) 152 | 153 | # define loss function (criterion) and optimizer 154 | criterion = nn.CrossEntropyLoss().cuda(local_rank) 155 | 156 | optimizer = torch.optim.SGD(model.parameters(), 157 | args.lr, 158 | momentum=args.momentum, 159 | weight_decay=args.weight_decay) 160 | 161 | cudnn.benchmark = True 162 | 163 | # Data loading code 164 | traindir = os.path.join(args.data, 'train') 165 | valdir = os.path.join(args.data, 'val') 166 | normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], 167 | std=[0.229, 0.224, 0.225]) 168 | 169 | train_dataset = datasets.ImageFolder( 170 | traindir, 171 | transforms.Compose([ 172 | transforms.RandomResizedCrop(224), 173 | transforms.RandomHorizontalFlip(), 174 | transforms.ToTensor(), 175 | normalize, 176 | ])) 177 | train_sampler = torch.utils.data.distributed.DistributedSampler( 178 | train_dataset) 179 | train_loader = torch.utils.data.DataLoader(train_dataset, 180 | batch_size=args.batch_size, 181 | num_workers=2, 182 | pin_memory=True, 183 | sampler=train_sampler) 184 | 185 | val_dataset = datasets.ImageFolder( 186 | valdir, 187 | transforms.Compose([ 188 | transforms.Resize(256), 189 | transforms.CenterCrop(224), 190 | transforms.ToTensor(), 191 | normalize, 192 | ])) 193 | val_sampler = torch.utils.data.distributed.DistributedSampler(val_dataset) 194 | val_loader = torch.utils.data.DataLoader(val_dataset, 195 | batch_size=args.batch_size, 196 | num_workers=2, 197 | pin_memory=True, 198 | sampler=val_sampler) 199 | 200 | if args.evaluate: 201 | validate(val_loader, model, criterion, local_rank, args) 202 | return 203 | 204 | for epoch in range(args.start_epoch, args.epochs): 205 | 206 | train_sampler.set_epoch(epoch) 207 | val_sampler.set_epoch(epoch) 208 | 209 | adjust_learning_rate(optimizer, epoch, args) 210 | 211 | # train for one epoch 212 | train(train_loader, model, criterion, optimizer, epoch, local_rank, 213 | args) 214 | 215 | # evaluate on validation set 216 | acc1 = validate(val_loader, model, criterion, local_rank, args) 217 | 218 | # remember best acc@1 and save checkpoint 219 | is_best = acc1 > best_acc1 220 | best_acc1 = max(acc1, best_acc1) 221 | 222 | if args.local_rank == 0: 223 | save_checkpoint( 224 | { 225 | 'epoch': epoch + 1, 226 | 'arch': args.arch, 227 | 'state_dict': model.module.state_dict(), 228 | 'best_acc1': best_acc1, 229 | }, is_best) 230 | 231 | 232 | def train(train_loader, model, criterion, optimizer, epoch, local_rank, args): 233 | batch_time = AverageMeter('Time', ':6.3f') 234 | data_time = AverageMeter('Data', ':6.3f') 235 | losses = AverageMeter('Loss', ':.4e') 236 | top1 = AverageMeter('Acc@1', ':6.2f') 237 | top5 = AverageMeter('Acc@5', ':6.2f') 238 | progress = ProgressMeter(len(train_loader), 239 | [batch_time, data_time, losses, top1, top5], 240 | prefix="Epoch: [{}]".format(epoch)) 241 | 242 | # switch to train mode 243 | model.train() 244 | 245 | end = time.time() 246 | for i, (images, target) in enumerate(train_loader): 247 | # measure data loading time 248 | data_time.update(time.time() - end) 249 | 250 | images = images.cuda(local_rank, non_blocking=True) 251 | target = target.cuda(local_rank, non_blocking=True) 252 | 253 | # compute output 254 | output = model(images) 255 | loss = criterion(output, target) 256 | 257 | # measure accuracy and record loss 258 | acc1, acc5 = accuracy(output, target, topk=(1, 5)) 259 | 260 | torch.distributed.barrier() 261 | 262 | reduced_loss = reduce_mean(loss, args.nprocs) 263 | reduced_acc1 = reduce_mean(acc1, args.nprocs) 264 | reduced_acc5 = reduce_mean(acc5, args.nprocs) 265 | 266 | losses.update(reduced_loss.item(), images.size(0)) 267 | top1.update(reduced_acc1.item(), images.size(0)) 268 | top5.update(reduced_acc5.item(), images.size(0)) 269 | 270 | # compute gradient and do SGD step 271 | optimizer.zero_grad() 272 | loss.backward() 273 | optimizer.step() 274 | 275 | # measure elapsed time 276 | batch_time.update(time.time() - end) 277 | end = time.time() 278 | 279 | if i % args.print_freq == 0: 280 | progress.display(i) 281 | 282 | 283 | def validate(val_loader, model, criterion, local_rank, args): 284 | batch_time = AverageMeter('Time', ':6.3f') 285 | losses = AverageMeter('Loss', ':.4e') 286 | top1 = AverageMeter('Acc@1', ':6.2f') 287 | top5 = AverageMeter('Acc@5', ':6.2f') 288 | progress = ProgressMeter(len(val_loader), [batch_time, losses, top1, top5], 289 | prefix='Test: ') 290 | 291 | # switch to evaluate mode 292 | model.eval() 293 | 294 | with torch.no_grad(): 295 | end = time.time() 296 | for i, (images, target) in enumerate(val_loader): 297 | images = images.cuda(local_rank, non_blocking=True) 298 | target = target.cuda(local_rank, non_blocking=True) 299 | 300 | # compute output 301 | output = model(images) 302 | loss = criterion(output, target) 303 | 304 | # measure accuracy and record loss 305 | acc1, acc5 = accuracy(output, target, topk=(1, 5)) 306 | 307 | torch.distributed.barrier() 308 | 309 | reduced_loss = reduce_mean(loss, args.nprocs) 310 | reduced_acc1 = reduce_mean(acc1, args.nprocs) 311 | reduced_acc5 = reduce_mean(acc5, args.nprocs) 312 | 313 | losses.update(reduced_loss.item(), images.size(0)) 314 | top1.update(reduced_acc1.item(), images.size(0)) 315 | top5.update(reduced_acc5.item(), images.size(0)) 316 | 317 | # measure elapsed time 318 | batch_time.update(time.time() - end) 319 | end = time.time() 320 | 321 | if i % args.print_freq == 0: 322 | progress.display(i) 323 | 324 | # TODO: this should also be done with the ProgressMeter 325 | print(' * Acc@1 {top1.avg:.3f} Acc@5 {top5.avg:.3f}'.format(top1=top1, 326 | top5=top5)) 327 | 328 | return top1.avg 329 | 330 | 331 | def save_checkpoint(state, is_best, filename='checkpoint.pth.tar'): 332 | torch.save(state, filename) 333 | if is_best: 334 | shutil.copyfile(filename, 'model_best.pth.tar') 335 | 336 | 337 | class AverageMeter(object): 338 | """Computes and stores the average and current value""" 339 | def __init__(self, name, fmt=':f'): 340 | self.name = name 341 | self.fmt = fmt 342 | self.reset() 343 | 344 | def reset(self): 345 | self.val = 0 346 | self.avg = 0 347 | self.sum = 0 348 | self.count = 0 349 | 350 | def update(self, val, n=1): 351 | self.val = val 352 | self.sum += val * n 353 | self.count += n 354 | self.avg = self.sum / self.count 355 | 356 | def __str__(self): 357 | fmtstr = '{name} {val' + self.fmt + '} ({avg' + self.fmt + '})' 358 | return fmtstr.format(**self.__dict__) 359 | 360 | 361 | class ProgressMeter(object): 362 | def __init__(self, num_batches, meters, prefix=""): 363 | self.batch_fmtstr = self._get_batch_fmtstr(num_batches) 364 | self.meters = meters 365 | self.prefix = prefix 366 | 367 | def display(self, batch): 368 | entries = [self.prefix + self.batch_fmtstr.format(batch)] 369 | entries += [str(meter) for meter in self.meters] 370 | print('\t'.join(entries)) 371 | 372 | def _get_batch_fmtstr(self, num_batches): 373 | num_digits = len(str(num_batches // 1)) 374 | fmt = '{:' + str(num_digits) + 'd}' 375 | return '[' + fmt + '/' + fmt.format(num_batches) + ']' 376 | 377 | 378 | def adjust_learning_rate(optimizer, epoch, args): 379 | """Sets the learning rate to the initial LR decayed by 10 every 30 epochs""" 380 | lr = args.lr * (0.1**(epoch // 30)) 381 | for param_group in optimizer.param_groups: 382 | param_group['lr'] = lr 383 | 384 | 385 | def accuracy(output, target, topk=(1, )): 386 | """Computes the accuracy over the k top predictions for the specified values of k""" 387 | with torch.no_grad(): 388 | maxk = max(topk) 389 | batch_size = target.size(0) 390 | 391 | _, pred = output.topk(maxk, 1, True, True) 392 | pred = pred.t() 393 | correct = pred.eq(target.view(1, -1).expand_as(pred)) 394 | 395 | res = [] 396 | for k in topk: 397 | correct_k = correct[:k].view(-1).float().sum(0, keepdim=True) 398 | res.append(correct_k.mul_(100.0 / batch_size)) 399 | return res 400 | 401 | 402 | if __name__ == '__main__': 403 | main() -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | torch==1.3.0 2 | torchvision==0.4.0 3 | apex==0.9.10 4 | horovod==0.18.2 5 | -------------------------------------------------------------------------------- /start.sh: -------------------------------------------------------------------------------- 1 | CUDA_VISIBLE_DEVICES=0,1,2,3 python multiprocessing_distributed.py 2 | CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 distributed.py 3 | CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 apex_distributed.py 4 | HOROVOD_WITH_PYTORCH=1 CUDA_VISIBLE_DEVICES=0,1,2,3 horovodrun -np 4 -H localhost:4 --verbose python horovod_distributed.py 5 | srun -N2 --gres gpu:4 python distributed_slurm_main.py --dist-file dist_file -------------------------------------------------------------------------------- /statistics.sh: -------------------------------------------------------------------------------- 1 | nvidia-smi -i 0,1,2,3 --format=csv,noheader,nounits --query-gpu=timestamp,index,memory.total,memory.used,memory.free,utilization.gpu,utilization.memory -lms 500 -f multiprocessing_distributed_log.csv 2 | nvidia-smi -i 0,1,2,3 --format=csv,noheader,nounits --query-gpu=timestamp,index,memory.total,memory.used,memory.free,utilization.gpu,utilization.memory -lms 500 -f distributed_log.csv 3 | nvidia-smi -i 0,1,2,3 --format=csv,noheader,nounits --query-gpu=timestamp,index,memory.total,memory.used,memory.free,utilization.gpu,utilization.memory -lms 500 -f apex_distributed_log.csv 4 | nvidia-smi -i 0,1,2,3 --format=csv,noheader,nounits --query-gpu=timestamp,index,memory.total,memory.used,memory.free,utilization.gpu,utilization.memory -lms 500 -f horovod_distributed_log.csv --------------------------------------------------------------------------------