└── README.md


/README.md:
--------------------------------------------------------------------------------
  1 | # pytorch-sync-batchnorm-example
  2 | 
  3 | The default behavior of Batchnorm, in Pytorch and most other frameworks, is to compute batch statistics separately for each device. Meaning that, if we use a model with batchnorm layers and train on multiple GPUs, batch statistics will not reflect the whole *batch*; instead, statistics will reflect slices of data passed to each GPU. The intuition is that this may harm model convergence and impact performance. In fact, this performance drop is known to happen for object detection models and GANs.
  4 | 
  5 | In order to compute batchnorm statistics across all GPUs, we need to use the synchronized batchnorm module that was recently released by Pytorch. To do so, we need to make some changes to our code. We cannot use `SyncBatchnorm` when using `nn.DataParallel(...)`. `SyncBatchnorm` requires that we use a very specific setting: we need to use `torch.parallel.DistributedDataParallel(...)` with Multi-process single GPU configuration. In other words, we need to launch a separate process for each GPU. Below we show step-by-step how to use `SynchBatchnorm` on a single machine with multiple GPUs.
  6 | 
  7 | ## Basic Idea
  8 | 
  9 | We'll launch one process for each GPU. Our training script will be provided a `rank` argument, which is simply an integer that tells us which process is being launched. `rank 0` is our master. This way we can control what each process do. For example, we may want to print losses and stuff to the console only on the master process.
 10 | 
 11 | ## Step 1: Parsing the local_rank argument
 12 | 
 13 | This argument is how we know what process is being lanched. We can have the arguments to our script as usual, we just need to add an extra to parse `--local_rank`.
 14 | 
 15 | ```python
 16 | import argparse
 17 | parser = argparse.ArgumentParser()
 18 | parser.add_argument('--local_rank', type=int, default=0)
 19 | ```
 20 | 
 21 | 
 22 | ## Step 2: Setting up the process and device
 23 | 
 24 | Next we need to init the process, we do this by adding the following code to our script:
 25 | 
 26 | ```python
 27 | torch.cuda.set_device(args.local_rank)
 28 | 
 29 | world_size = args.ngpu
 30 | torch.distributed.init_process_group(
 31 |     'nccl',
 32 |     init_method='env://',
 33 |     world_size=world_size,
 34 |     rank=args.local_rank,
 35 | )
 36 | ```
 37 | 
 38 | 
 39 | 
 40 | ## Step 3: Converting your model to use torch.nn.SyncBatchNorm
 41 | 
 42 | We don't need to change our model, it just stay as it is. We just need to convert regular batchnorm layers to [torch.nn.SyncBatchNorm](https://pytorch.org/docs/master/nn.html#torch.nn.SyncBatchNorm).
 43 | 
 44 | ```python
 45 | net = torch.nn.SyncBatchNorm.convert_sync_batchnorm(net)
 46 | ```
 47 | 
 48 | It is our job to send each model to its device:
 49 | 
 50 | ```python
 51 | device = torch.device('cuda:{}'.format(args.local_rank))
 52 | net = net.to(device)
 53 | ```
 54 | 
 55 | Remember that we need to do the same for the inputs of the model. i.e.:
 56 | 
 57 | ```python
 58 | for it, (input, target) in enumerate(self.data_loader):
 59 |     input, target = input.to(device), target.to(device)
 60 | ```
 61 | 
 62 | ## Step 4: Wraping your model with DistributedDataParallel
 63 | 
 64 | The same way we wrapped our models with `DataParallel`, we need to do same but with [DistributedDataParallel](https://pytorch.org/docs/master/nn.html#distributeddataparallel).
 65 | 
 66 | ```python
 67 |  net = torch.nn.parallel.DistributedDataParallel(
 68 |     net,
 69 |     device_ids=[args.local_rank],
 70 |     output_device=args.local_rank,
 71 | )
 72 | ```
 73 | 
 74 | ## Step 5: Adapting your DataLoader
 75 | 
 76 | Since we are going to launch multiple processes, we need to take care of the portion of the data provided to each process. This is very simple, assuming you have your Dataset already implemented.
 77 | 
 78 | ```python
 79 | sampler = torch.utils.data.distributed.DistributedSampler(
 80 |     dataset,
 81 |     num_replicas=config.ngpu,
 82 |     rank=local_rank,
 83 | )
 84 | data_loader = DataLoader(
 85 |     dataset,
 86 |     batch_size=batch_size,
 87 |     num_workers=8,
 88 |     pin_memory=True,
 89 |     sampler=sampler,
 90 |     drop_last=True,
 91 | )
 92 | ```
 93 | 
 94 | ## Step 6: Launching the processes
 95 | 
 96 | After we parse `--local_rank` and take care of what happens with each process, we can launch the processes using [torch.distributed.launch](https://pytorch.org/docs/master/distributed.html#launch-utility) utility. `--nproc_per_node` is the number of GPUs.
 97 | 
 98 | ```bash
 99 | python -m torch.distributed.launch --nproc_per_node=3 distributed_train.py \
100 | --arg1=arg1 --arg2=arg2 --arg3=arg3 --arg4=arg4 --argn=argn
101 | ```
102 | 
103 | `--arg1=arg1 --arg2=arg2 --arg3=arg3 --arg4=arg4 --argn=argn` are just the regular arguments we pass to our training scripts.
104 | 


--------------------------------------------------------------------------------