├── .gitignore ├── README.md ├── base.py ├── cifar10.py └── imagenet.py /.gitignore: -------------------------------------------------------------------------------- 1 | .idea 2 | .git 3 | __pycache__ 4 | run_all.sh 5 | get_time_all.sh 6 | get_gpu_time.py 7 | backup -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # PyTorch DataLoaders with DALI 2 | 3 | PyTorch DataLoaders implemented with [nvidia-dali](https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/index.html), we've implemented **CIFAR-10** and **ImageNet** dataloaders, more dataloaders will be added in the future. 4 | 5 | With 2 processors of Intel(R) Xeon(R) Gold 6154 CPU, 1 Tesla V100 GPU and all dataset in memory disk, we can **extremely** **accelerate image preprocessing** with DALI. 6 | 7 | | Iter Training Data Cost(bs=256) | CIFAR-10 | ImageNet | 8 | | :-----------------------------: | :------: | :------: | 9 | | DALI | 1.4s(2 processors) | 625s(8 processors) | 10 | | torchvision | 280.1s(2 processors) | 13400s(8 processors) | 11 | 12 | In CIFAR-10 training, we can reduce tranining time **from** **1 day to 1 hour** with our hardware setting. 13 | 14 | ## Requirements 15 | 16 | You only need to install nvidia-dali package and version should be >= 0.12, we've tested version 0.11 and it didn't work 17 | 18 | ```bash 19 | #for cuda9.0 20 | pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/cuda/9.0 nvidia-dali 21 | #for cuda10.0 22 | pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/cuda/10.0 nvidia-dali 23 | ``` 24 | 25 | More details and documents can be found [here](https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/index.html#) 26 | 27 | ## Usage 28 | 29 | You can use these dataloaders easily as the following example 30 | 31 | ```python 32 | from base import DALIDataloader 33 | from cifar10 import HybridTrainPipe_CIFAR 34 | pip_train = HybridTrainPipe_CIFAR(batch_size=TRAIN_BS, 35 | num_threads=NUM_WORKERS, 36 | device_id=0, 37 | data_dir=IMG_DIR, 38 | crop=CROP_SIZE, 39 | world_size=1, 40 | local_rank=0, 41 | cutout=0) 42 | train_loader = DALIDataloader(pipeline=pip_train, 43 | size=CIFAR_IMAGES_NUM_TRAIN, 44 | batch_size=TRAIN_BS, 45 | onehot_label=True) 46 | for i, data in enumerate(train_loader): # Using it just like PyTorch dataloader 47 | images = data[0].cuda(non_blocking=True) 48 | labels = data[1].cuda(non_blocking=True) 49 | ``` 50 | 51 | If you have large enough memory for storing dataset, we strongly recommend you to mount a memory disk and put the whole dataset in it to accelerate I/O, like this 52 | 53 | ```bash 54 | mount -t tmpfs -o size=20g tmpfs /userhome/memory_data 55 | ``` 56 | 57 | It's noteworthy that `20g` above is a ceiling but **not** occupying `20g` memory at the moment you mount the tmpfs, memories are occupied as you putting dataset in it. Compressed files should **not** be extracted before you've copied them into memory, otherwise it could be much slower. 58 | -------------------------------------------------------------------------------- /base.py: -------------------------------------------------------------------------------- 1 | from nvidia.dali.plugin.pytorch import DALIGenericIterator 2 | 3 | class DALIDataloader(DALIGenericIterator): 4 | def __init__(self, pipeline, size, batch_size, output_map=["data", "label"], auto_reset=True, onehot_label=False): 5 | self.size = size 6 | self.batch_size = batch_size 7 | self.onehot_label = onehot_label 8 | self.output_map = output_map 9 | super().__init__(pipelines=pipeline, size=size, auto_reset=auto_reset, output_map=output_map) 10 | 11 | def __next__(self): 12 | if self._first_batch is not None: 13 | batch = self._first_batch 14 | self._first_batch = None 15 | return batch 16 | data = super().__next__()[0] 17 | if self.onehot_label: 18 | return [data[self.output_map[0]], data[self.output_map[1]].squeeze().long()] 19 | else: 20 | return [data[self.output_map[0]], data[self.output_map[1]]] 21 | 22 | def __len__(self): 23 | if self.size%self.batch_size==0: 24 | return self.size//self.batch_size 25 | else: 26 | return self.size//self.batch_size+1 -------------------------------------------------------------------------------- /cifar10.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | import time 4 | import torch 5 | import pickle 6 | import numpy as np 7 | import nvidia.dali.ops as ops 8 | from base import DALIDataloader 9 | import nvidia.dali.types as types 10 | from sklearn.utils import shuffle 11 | from torchvision.datasets import CIFAR10 12 | from nvidia.dali.pipeline import Pipeline 13 | import torchvision.transforms as transforms 14 | 15 | CIFAR_MEAN = [0.49139968, 0.48215827, 0.44653124] 16 | CIFAR_STD = [0.24703233, 0.24348505, 0.26158768] 17 | CIFAR_IMAGES_NUM_TRAIN = 50000 18 | CIFAR_IMAGES_NUM_TEST = 10000 19 | IMG_DIR = '/userhome/data/cifar10' 20 | TRAIN_BS = 256 21 | TEST_BS = 200 22 | NUM_WORKERS = 4 23 | CROP_SIZE = 32 24 | 25 | class HybridTrainPipe_CIFAR(Pipeline): 26 | def __init__(self, batch_size, num_threads, device_id, data_dir, crop=32, dali_cpu=False, local_rank=0, 27 | world_size=1, 28 | cutout=0): 29 | super(HybridTrainPipe_CIFAR, self).__init__(batch_size, num_threads, device_id, seed=12 + device_id) 30 | self.iterator = iter(CIFAR_INPUT_ITER(batch_size, 'train', root=data_dir)) 31 | dali_device = "gpu" 32 | self.input = ops.ExternalSource() 33 | self.input_label = ops.ExternalSource() 34 | self.pad = ops.Paste(device=dali_device, ratio=1.25, fill_value=0) 35 | self.uniform = ops.Uniform(range=(0., 1.)) 36 | self.crop = ops.Crop(device=dali_device, crop_h=crop, crop_w=crop) 37 | self.cmnp = ops.CropMirrorNormalize(device="gpu", 38 | output_dtype=types.FLOAT, 39 | output_layout=types.NCHW, 40 | image_type=types.RGB, 41 | mean=[0.49139968 * 255., 0.48215827 * 255., 0.44653124 * 255.], 42 | std=[0.24703233 * 255., 0.24348505 * 255., 0.26158768 * 255.] 43 | ) 44 | self.coin = ops.CoinFlip(probability=0.5) 45 | 46 | def iter_setup(self): 47 | (images, labels) = self.iterator.next() 48 | self.feed_input(self.jpegs, images, layout="HWC") 49 | self.feed_input(self.labels, labels) 50 | 51 | def define_graph(self): 52 | rng = self.coin() 53 | self.jpegs = self.input() 54 | self.labels = self.input_label() 55 | output = self.jpegs 56 | output = self.pad(output.gpu()) 57 | output = self.crop(output, crop_pos_x=self.uniform(), crop_pos_y=self.uniform()) 58 | output = self.cmnp(output, mirror=rng) 59 | return [output, self.labels] 60 | 61 | 62 | class HybridTestPipe_CIFAR(Pipeline): 63 | def __init__(self, batch_size, num_threads, device_id, data_dir, crop, size, local_rank=0, world_size=1): 64 | super(HybridTestPipe_CIFAR, self).__init__(batch_size, num_threads, device_id, seed=12 + device_id) 65 | self.iterator = iter(CIFAR_INPUT_ITER(batch_size, 'val', root=data_dir)) 66 | self.input = ops.ExternalSource() 67 | self.input_label = ops.ExternalSource() 68 | self.cmnp = ops.CropMirrorNormalize(device="gpu", 69 | output_dtype=types.FLOAT, 70 | output_layout=types.NCHW, 71 | image_type=types.RGB, 72 | mean=[0.49139968 * 255., 0.48215827 * 255., 0.44653124 * 255.], 73 | std=[0.24703233 * 255., 0.24348505 * 255., 0.26158768 * 255.] 74 | ) 75 | 76 | def iter_setup(self): 77 | (images, labels) = self.iterator.next() 78 | self.feed_input(self.jpegs, images, layout="HWC") # can only in HWC order 79 | self.feed_input(self.labels, labels) 80 | 81 | def define_graph(self): 82 | self.jpegs = self.input() 83 | self.labels = self.input_label() 84 | output = self.jpegs 85 | output = self.cmnp(output.gpu()) 86 | return [output, self.labels] 87 | 88 | 89 | class CIFAR_INPUT_ITER(): 90 | base_folder = 'cifar-10-batches-py' 91 | train_list = [ 92 | ['data_batch_1', 'c99cafc152244af753f735de768cd75f'], 93 | ['data_batch_2', 'd4bba439e000b95fd0a9bffe97cbabec'], 94 | ['data_batch_3', '54ebc095f3ab1f0389bbae665268c751'], 95 | ['data_batch_4', '634d18415352ddfa80567beed471001a'], 96 | ['data_batch_5', '482c414d41f54cd18b22e5b47cb7c3cb'], 97 | ] 98 | 99 | test_list = [ 100 | ['test_batch', '40351d587109b95175f43aff81a1287e'], 101 | ] 102 | 103 | def __init__(self, batch_size, type='train', root='/userhome/memory_data/cifar10'): 104 | self.root = root 105 | self.batch_size = batch_size 106 | self.train = (type == 'train') 107 | if self.train: 108 | downloaded_list = self.train_list 109 | else: 110 | downloaded_list = self.test_list 111 | 112 | self.data = [] 113 | self.targets = [] 114 | for file_name, checksum in downloaded_list: 115 | file_path = os.path.join(self.root, self.base_folder, file_name) 116 | with open(file_path, 'rb') as f: 117 | if sys.version_info[0] == 2: 118 | entry = pickle.load(f) 119 | else: 120 | entry = pickle.load(f, encoding='latin1') 121 | self.data.append(entry['data']) 122 | if 'labels' in entry: 123 | self.targets.extend(entry['labels']) 124 | else: 125 | self.targets.extend(entry['fine_labels']) 126 | 127 | self.data = np.vstack(self.data).reshape(-1, 3, 32, 32) 128 | self.targets = np.vstack(self.targets) 129 | self.data = self.data.transpose((0, 2, 3, 1)) # convert to HWC 130 | np.save("cifar.npy", self.data) 131 | self.data = np.load('cifar.npy') # to serialize, increase locality 132 | 133 | def __iter__(self): 134 | self.i = 0 135 | self.n = len(self.data) 136 | return self 137 | 138 | def __next__(self): 139 | batch = [] 140 | labels = [] 141 | for _ in range(self.batch_size): 142 | if self.train and self.i % self.n == 0: 143 | self.data, self.targets = shuffle(self.data, self.targets, random_state=0) 144 | img, label = self.data[self.i], self.targets[self.i] 145 | batch.append(img) 146 | labels.append(label) 147 | self.i = (self.i + 1) % self.n 148 | return (batch, labels) 149 | 150 | next = __next__ 151 | 152 | if __name__ == '__main__': 153 | # iteration of DALI dataloader 154 | pip_train = HybridTrainPipe_CIFAR(batch_size=TRAIN_BS, num_threads=NUM_WORKERS, device_id=0, data_dir=IMG_DIR, crop=CROP_SIZE, world_size=1, local_rank=0, cutout=0) 155 | pip_test = HybridTestPipe_CIFAR(batch_size=TEST_BS, num_threads=NUM_WORKERS, device_id=0, data_dir=IMG_DIR, crop=CROP_SIZE, size=CROP_SIZE, world_size=1, local_rank=0) 156 | train_loader = DALIDataloader(pipeline=pip_train, size=CIFAR_IMAGES_NUM_TRAIN, batch_size=TRAIN_BS, onehot_label=True) 157 | test_loader = DALIDataloader(pipeline=pip_test, size=CIFAR_IMAGES_NUM_TEST, batch_size=TEST_BS, onehot_label=True) 158 | print("[DALI] train dataloader length: %d"%len(train_loader)) 159 | print('[DALI] start iterate train dataloader') 160 | start = time.time() 161 | for i, data in enumerate(train_loader): 162 | images = data[0].cuda(non_blocking=True) 163 | labels = data[1].cuda(non_blocking=True) 164 | end = time.time() 165 | train_time = end-start 166 | print('[DALI] end train dataloader iteration') 167 | 168 | print("[DALI] test dataloader length: %d"%len(test_loader)) 169 | print('[DALI] start iterate test dataloader') 170 | start = time.time() 171 | for i, data in enumerate(test_loader): 172 | images = data[0].cuda(non_blocking=True) 173 | labels = data[1].cuda(non_blocking=True) 174 | end = time.time() 175 | test_time = end-start 176 | print('[DALI] end test dataloader iteration') 177 | print('[DALI] iteration time: %fs [train], %fs [test]' % (train_time, test_time)) 178 | 179 | 180 | # iteration of PyTorch dataloader 181 | transform_train = transforms.Compose([ 182 | transforms.RandomCrop(CROP_SIZE, padding=4), 183 | transforms.RandomHorizontalFlip(), 184 | transforms.ToTensor(), 185 | transforms.Normalize(CIFAR_MEAN, CIFAR_STD), 186 | ]) 187 | train_dst = CIFAR10(root=IMG_DIR, train=True, download=True, transform=transform_train) 188 | train_loader = torch.utils.data.DataLoader(train_dst, batch_size=TRAIN_BS, shuffle=True, pin_memory=True, num_workers=NUM_WORKERS) 189 | transform_test = transforms.Compose([ 190 | transforms.ToTensor(), 191 | transforms.Normalize(CIFAR_MEAN, CIFAR_STD), 192 | ]) 193 | test_dst = CIFAR10(root=IMG_DIR, train=False, download=True, transform=transform_test) 194 | test_iter = torch.utils.data.DataLoader(test_dst, batch_size=TEST_BS, shuffle=False, pin_memory=True, num_workers=NUM_WORKERS) 195 | print("[PyTorch] train dataloader length: %d"%len(train_loader)) 196 | print('[PyTorch] start iterate train dataloader') 197 | start = time.time() 198 | for i, data in enumerate(train_loader): 199 | images = data[0].cuda(non_blocking=True) 200 | labels = data[1].cuda(non_blocking=True) 201 | end = time.time() 202 | train_time = end-start 203 | print('[PyTorch] end train dataloader iteration') 204 | 205 | print("[PyTorch] test dataloader length: %d"%len(test_loader)) 206 | print('[PyTorch] start iterate test dataloader') 207 | start = time.time() 208 | for i, data in enumerate(test_loader): 209 | images = data[0].cuda(non_blocking=True) 210 | labels = data[1].cuda(non_blocking=True) 211 | end = time.time() 212 | test_time = end-start 213 | print('[PyTorch] end test dataloader iteration') 214 | print('[PyTorch] iteration time: %fs [train], %fs [test]' % (train_time, test_time)) 215 | -------------------------------------------------------------------------------- /imagenet.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | import time 4 | import torch 5 | import pickle 6 | import numpy as np 7 | import nvidia.dali.ops as ops 8 | from base import DALIDataloader 9 | from torchvision import datasets 10 | from sklearn.utils import shuffle 11 | import nvidia.dali.types as types 12 | from nvidia.dali.pipeline import Pipeline 13 | import torchvision.transforms as transforms 14 | 15 | IMAGENET_MEAN = [0.49139968, 0.48215827, 0.44653124] 16 | IMAGENET_STD = [0.24703233, 0.24348505, 0.26158768] 17 | IMAGENET_IMAGES_NUM_TRAIN = 1281167 18 | IMAGENET_IMAGES_NUM_TEST = 50000 19 | IMG_DIR = '/gdata/ImageNet2012' 20 | TRAIN_BS = 256 21 | TEST_BS = 200 22 | NUM_WORKERS = 4 23 | VAL_SIZE = 256 24 | CROP_SIZE = 224 25 | 26 | class HybridTrainPipe(Pipeline): 27 | def __init__(self, batch_size, num_threads, device_id, data_dir, crop, dali_cpu=False, local_rank=0, world_size=1): 28 | super(HybridTrainPipe, self).__init__(batch_size, num_threads, device_id, seed=12 + device_id) 29 | dali_device = "gpu" 30 | self.input = ops.FileReader(file_root=data_dir, shard_id=local_rank, num_shards=world_size, random_shuffle=True) 31 | self.decode = ops.ImageDecoder(device="mixed", output_type=types.RGB) 32 | self.res = ops.RandomResizedCrop(device="gpu", size=crop, random_area=[0.08, 1.25]) 33 | self.cmnp = ops.CropMirrorNormalize(device="gpu", 34 | output_dtype=types.FLOAT, 35 | output_layout=types.NCHW, 36 | image_type=types.RGB, 37 | mean=[0.485 * 255, 0.456 * 255, 0.406 * 255], 38 | std=[0.229 * 255, 0.224 * 255, 0.225 * 255]) 39 | self.coin = ops.CoinFlip(probability=0.5) 40 | print('DALI "{0}" variant'.format(dali_device)) 41 | 42 | def define_graph(self): 43 | rng = self.coin() 44 | self.jpegs, self.labels = self.input(name="Reader") 45 | images = self.decode(self.jpegs) 46 | images = self.res(images) 47 | output = self.cmnp(images, mirror=rng) 48 | return [output, self.labels] 49 | 50 | 51 | class HybridValPipe(Pipeline): 52 | def __init__(self, batch_size, num_threads, device_id, data_dir, crop, size, local_rank=0, world_size=1): 53 | super(HybridValPipe, self).__init__(batch_size, num_threads, device_id, seed=12 + device_id) 54 | self.input = ops.FileReader(file_root=data_dir, shard_id=local_rank, num_shards=world_size, 55 | random_shuffle=False) 56 | self.decode = ops.ImageDecoder(device="mixed", output_type=types.RGB) 57 | self.res = ops.Resize(device="gpu", resize_shorter=size, interp_type=types.INTERP_TRIANGULAR) 58 | self.cmnp = ops.CropMirrorNormalize(device="gpu", 59 | output_dtype=types.FLOAT, 60 | output_layout=types.NCHW, 61 | crop=(crop, crop), 62 | image_type=types.RGB, 63 | mean=[0.485 * 255, 0.456 * 255, 0.406 * 255], 64 | std=[0.229 * 255, 0.224 * 255, 0.225 * 255]) 65 | 66 | def define_graph(self): 67 | self.jpegs, self.labels = self.input(name="Reader") 68 | images = self.decode(self.jpegs) 69 | images = self.res(images) 70 | output = self.cmnp(images) 71 | return [output, self.labels] 72 | 73 | 74 | if __name__ == '__main__': 75 | # iteration of DALI dataloader 76 | pip_train = HybridTrainPipe(batch_size=TRAIN_BS, num_threads=NUM_WORKERS, device_id=0, data_dir=IMG_DIR+'/train', crop=CROP_SIZE, world_size=1, local_rank=0) 77 | pip_test = HybridValPipe(batch_size=TEST_BS, num_threads=NUM_WORKERS, device_id=0, data_dir=IMG_DIR+'/val', crop=CROP_SIZE, size=VAL_SIZE, world_size=1, local_rank=0) 78 | train_loader = DALIDataloader(pipeline=pip_train, size=IMAGENET_IMAGES_NUM_TRAIN, batch_size=TRAIN_BS, onehot_label=True) 79 | test_loader = DALIDataloader(pipeline=pip_test, size=IMAGENET_IMAGES_NUM_TEST, batch_size=TEST_BS, onehot_label=True) 80 | # print("[DALI] train dataloader length: %d"%len(train_loader)) 81 | # print('[DALI] start iterate train dataloader') 82 | # start = time.time() 83 | # for i, data in enumerate(train_loader): 84 | # images = data[0].cuda(non_blocking=True) 85 | # labels = data[1].cuda(non_blocking=True) 86 | # end = time.time() 87 | # train_time = end-start 88 | # print('[DALI] end train dataloader iteration') 89 | 90 | print("[DALI] test dataloader length: %d"%len(test_loader)) 91 | print('[DALI] start iterate test dataloader') 92 | start = time.time() 93 | for i, data in enumerate(test_loader): 94 | images = data[0].cuda(non_blocking=True) 95 | labels = data[1].cuda(non_blocking=True) 96 | end = time.time() 97 | test_time = end-start 98 | print('[DALI] end test dataloader iteration') 99 | # print('[DALI] iteration time: %fs [train], %fs [test]' % (train_time, test_time)) 100 | print('[DALI] iteration time: %fs [test]' % (test_time)) 101 | 102 | 103 | # iteration of PyTorch dataloader 104 | transform_train = transforms.Compose([ 105 | transforms.RandomResizedCrop(CROP_SIZE, scale=(0.08, 1.25)), 106 | transforms.RandomHorizontalFlip(), 107 | transforms.ToTensor(), 108 | transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), 109 | ]) 110 | train_dst = datasets.ImageFolder(IMG_DIR+'/train', transform_train) 111 | train_loader = torch.utils.data.DataLoader(train_dst, batch_size=TRAIN_BS, shuffle=True, pin_memory=True, num_workers=NUM_WORKERS) 112 | transform_test = transforms.Compose([ 113 | transforms.Resize(VAL_SIZE), 114 | transforms.CenterCrop(CROP_SIZE), 115 | transforms.ToTensor(), 116 | transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), 117 | ]) 118 | test_dst = datasets.ImageFolder(IMG_DIR+'/val', transform_test) 119 | test_iter = torch.utils.data.DataLoader(test_dst, batch_size=TEST_BS, shuffle=False, pin_memory=True, num_workers=NUM_WORKERS) 120 | # print("[PyTorch] train dataloader length: %d"%len(train_loader)) 121 | # print('[PyTorch] start iterate train dataloader') 122 | # start = time.time() 123 | # for i, data in enumerate(train_loader): 124 | # images = data[0].cuda(non_blocking=True) 125 | # labels = data[1].cuda(non_blocking=True) 126 | # end = time.time() 127 | # train_time = end-start 128 | # print('[PyTorch] end train dataloader iteration') 129 | 130 | print("[PyTorch] test dataloader length: %d"%len(test_loader)) 131 | print('[PyTorch] start iterate test dataloader') 132 | start = time.time() 133 | for i, data in enumerate(test_loader): 134 | images = data[0].cuda(non_blocking=True) 135 | labels = data[1].cuda(non_blocking=True) 136 | end = time.time() 137 | test_time = end-start 138 | print('[PyTorch] end test dataloader iteration') 139 | # print('[PyTorch] iteration time: %fs [train], %fs [test]' % (train_time, test_time)) 140 | print('[PyTorch] iteration time: %fs [test]' % (test_time)) 141 | --------------------------------------------------------------------------------