├── README.md └── chexpert.py /README.md: -------------------------------------------------------------------------------- 1 | # Robust Deep AUC Maximization [![pdf](https://img.shields.io/badge/Arxiv-pdf-orange.svg?style=flat)](https://arxiv.org/abs/2012.03173) 2 | 3 | This is the official implementation of the paper "**Robust Deep AUC Maximization: A New Surrogate Loss and Empirical Studies on Medical Image Classification**" published on **ICCV2021**. 4 | 5 | Requirements 6 | --------- 7 | ```python 8 | pip install libauc 9 | ``` 10 | 11 | Benchmark Datasets 12 | --------- 13 | Benchmark dataset contains [Cat&Dog](https://www.kaggle.com/c/dogs-vs-cats), [CIFAR10](https://www.cs.toronto.edu/~kriz/cifar.html), [CIFAR100](https://www.cs.toronto.edu/~kriz/cifar.html), [STL10](https://cs.stanford.edu/~acoates/stl10/). To construct their imbalanced version, we show an example below: 14 | 15 | ### Example 16 | 17 | #### Importing LibAUC & Loading Datasets 18 | ```python 19 | from libauc.datasets import CIFAR10, CIFAR100, CAT_vs_DOG, STL10 20 | (train_data, train_label), (test_data, test_label) = CIFAR10() 21 | ``` 22 | 23 | #### Constructing Imbalanced Datasets 24 | ```python 25 | from libauc.datasets import imbalance_generator 26 | SEED = 123 27 | imratio = 0.1 # postive_samples/(total_samples) 28 | (train_images, train_labels) = imbalance_generator(train_data, train_label, imratio=imratio, shuffle=True, random_seed=SEED) 29 | (test_images, test_labels) = imbalance_generator(test_data, test_label, is_balanced=True, random_seed=SEED) 30 | ``` 31 | 32 | #### Making Dataloader for Training and Testing 33 | ```python 34 | trainloader = torch.utils.data.DataLoader(ImageDataset(train_images, train_labels), batch_size=BATCH_SIZE, shuffle=True, num_workers=1, pin_memory=True, drop_last=True) 35 | testloader = torch.utils.data.DataLoader( ImageDataset(test_images, test_labels, mode='test'), batch_size=BATCH_SIZE, shuffle=False, num_workers=1, pin_memory=True) 36 | ``` 37 | 38 | For the instructions of training the models, please refer to this [Notebook](https://github.com/yzhuoning/LibAUC/blob/main/examples/02_Optimizing_AUROC_with_ResNet20_on_Imbalanced_CIFAR10.ipynb). 39 | 40 | 41 | CheXpert 42 | --------- 43 | CheXpert is a large dataset of chest X-rays and competition, which consists of 224,316 chest radiographs of 65,240 patients.The details about the dataset can be found at https://stanfordmlgroup.github.io/competitions/chexpert/. The dataloader used in the paper can be downloaded [here](https://github.com/Optimization-AI/ICCV2021_DeepAUC/blob/main/chexpert.py). 44 | 45 | ### Example 46 | 47 | ```python 48 | root="YOUR_DATA_PATH" 49 | class_id="CLASS_ID" 50 | traindSet = CheXpert(csv_path=root+'train.csv', image_root_path=root, use_frontal=True, image_size=224, mode='train', class_index=class_id) 51 | testSet = CheXpert(csv_path=root+'valid.csv', image_root_path=root, use_frontal=True, image_size=224, mode='valid', class_index=class_id) 52 | trainloader = torch.utils.data.DataLoader(traindSet, batch_size=32, num_workers=2, shuffle=True) 53 | testloader = torch.utils.data.DataLoader(testSet, batch_size=32, num_workers=2, shuffle=False) 54 | ``` 55 | 56 | For the instructions of training the models, please refer to this [Notebook](https://github.com/Optimization-AI/LibAUC/blob/main/examples/05_Optimizing_AUROC_Loss_with_DenseNet121_on_CheXpert.ipynb). 57 | 58 | 59 | Citation 60 | --------- 61 | If you find this repo helpful, please cite the following paper: 62 | ``` 63 | @inproceedings{yuan2021robust, 64 | title={Large-scale Robust Deep AUC Maximization: A New Surrogate Loss and Empirical Studies on Medical Image Classification}, 65 | author={Yuan, Zhuoning and Yan, Yan and Sonka, Milan and Yang, Tianbao}, 66 | booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision}, 67 | year={2021} 68 | } 69 | ``` 70 | 71 | Contact 72 | ---------- 73 | If you have any questions, please contact us @ [Zhuoning Yuan](https://homepage.divms.uiowa.edu/~zhuoning/) [yzhuoning@gmail.com] and [Tianbao Yang](http://people.tamu.edu/~tianbao-yang/) [tianbao-yang@tamu.edu] or please open a new issue in the Github. 74 | -------------------------------------------------------------------------------- /chexpert.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import torch 3 | from torch.utils.data import Dataset 4 | import torchvision.transforms as tfs 5 | import cv2 6 | from PIL import Image 7 | import pandas as pd 8 | 9 | class CheXpert(Dataset): 10 | ''' 11 | Reference: 12 | @inproceedings{yuan2021robust, 13 | title={Large-scale Robust Deep AUC Maximization: A New Surrogate Loss and Empirical Studies on Medical Image Classification}, 14 | author={Yuan, Zhuoning and Yan, Yan and Sonka, Milan and Yang, Tianbao}, 15 | booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision}, 16 | year={2021} 17 | } 18 | ''' 19 | def __init__(self, 20 | csv_path, 21 | image_root_path='', 22 | image_size=320, 23 | class_index=0, 24 | use_frontal=True, 25 | use_upsampling=True, 26 | flip_label=False, 27 | shuffle=True, 28 | seed=123, 29 | verbose=True, 30 | upsampling_cols=['Cardiomegaly', 'Consolidation'], 31 | train_cols=['Cardiomegaly', 'Edema', 'Consolidation', 'Atelectasis', 'Pleural Effusion'], 32 | mode='train'): 33 | 34 | 35 | # load data from csv 36 | self.df = pd.read_csv(csv_path) 37 | self.df['Path'] = self.df['Path'].str.replace('CheXpert-v1.0-small/', '') 38 | self.df['Path'] = self.df['Path'].str.replace('CheXpert-v1.0/', '') 39 | if use_frontal: 40 | self.df = self.df[self.df['Frontal/Lateral'] == 'Frontal'] 41 | 42 | # upsample selected cols 43 | if use_upsampling: 44 | assert isinstance(upsampling_cols, list), 'Input should be list!' 45 | sampled_df_list = [] 46 | for col in upsampling_cols: 47 | print ('Upsampling %s...'%col) 48 | sampled_df_list.append(self.df[self.df[col] == 1]) 49 | self.df = pd.concat([self.df] + sampled_df_list, axis=0) 50 | 51 | 52 | # impute missing values 53 | for col in train_cols: 54 | if col in ['Edema', 'Atelectasis']: 55 | self.df[col].replace(-1, 1, inplace=True) 56 | self.df[col].fillna(0, inplace=True) 57 | elif col in ['Cardiomegaly','Consolidation', 'Pleural Effusion']: 58 | self.df[col].replace(-1, 0, inplace=True) 59 | self.df[col].fillna(0, inplace=True) 60 | else: 61 | self.df[col].fillna(0, inplace=True) 62 | 63 | self._num_images = len(self.df) 64 | 65 | # 0 --> -1 66 | if flip_label and class_index != -1: # In multi-class mode we disable this option! 67 | self.df.replace(0, -1, inplace=True) 68 | 69 | # shuffle data 70 | if shuffle: 71 | data_index = list(range(self._num_images)) 72 | np.random.seed(seed) 73 | np.random.shuffle(data_index) 74 | self.df = self.df.iloc[data_index] 75 | 76 | 77 | assert class_index in [-1, 0, 1, 2, 3, 4], 'Out of selection!' 78 | assert image_root_path != '', 'You need to pass the correct location for the dataset!' 79 | 80 | if class_index == -1: # 5 classes 81 | print ('Multi-label mode: True, Number of classes: [%d]'%len(train_cols)) 82 | self.select_cols = train_cols 83 | self.value_counts_dict = {} 84 | for class_key, select_col in enumerate(train_cols): 85 | class_value_counts_dict = self.df[select_col].value_counts().to_dict() 86 | self.value_counts_dict[class_key] = class_value_counts_dict 87 | else: # 1 class 88 | self.select_cols = [train_cols[class_index]] # this var determines the number of classes 89 | self.value_counts_dict = self.df[self.select_cols[0]].value_counts().to_dict() 90 | 91 | self.mode = mode 92 | self.class_index = class_index 93 | self.image_size = image_size 94 | 95 | self._images_list = [image_root_path+path for path in self.df['Path'].tolist()] 96 | if class_index != -1: 97 | self._labels_list = self.df[train_cols].values[:, class_index].tolist() 98 | else: 99 | self._labels_list = self.df[train_cols].values.tolist() 100 | 101 | if verbose: 102 | if class_index != -1: 103 | print ('-'*30) 104 | if flip_label: 105 | self.imratio = self.value_counts_dict[1]/(self.value_counts_dict[-1]+self.value_counts_dict[1]) 106 | print('Found %s images in total, %s positive images, %s negative images'%(self._num_images, self.value_counts_dict[1], self.value_counts_dict[-1] )) 107 | print ('%s(C%s): imbalance ratio is %.4f'%(self.select_cols[0], class_index, self.imratio )) 108 | else: 109 | self.imratio = self.value_counts_dict[1]/(self.value_counts_dict[0]+self.value_counts_dict[1]) 110 | print('Found %s images in total, %s positive images, %s negative images'%(self._num_images, self.value_counts_dict[1], self.value_counts_dict[0] )) 111 | print ('%s(C%s): imbalance ratio is %.4f'%(self.select_cols[0], class_index, self.imratio )) 112 | print ('-'*30) 113 | else: 114 | print ('-'*30) 115 | imratio_list = [] 116 | for class_key, select_col in enumerate(train_cols): 117 | imratio = self.value_counts_dict[class_key][1]/(self.value_counts_dict[class_key][0]+self.value_counts_dict[class_key][1]) 118 | imratio_list.append(imratio) 119 | print('Found %s images in total, %s positive images, %s negative images'%(self._num_images, self.value_counts_dict[class_key][1], self.value_counts_dict[class_key][0] )) 120 | print ('%s(C%s): imbalance ratio is %.4f'%(select_col, class_key, imratio )) 121 | print () 122 | self.imratio = np.mean(imratio_list) 123 | self.imratio_list = imratio_list 124 | print ('-'*30) 125 | 126 | @property 127 | def class_counts(self): 128 | return self.value_counts_dict 129 | 130 | @property 131 | def imbalance_ratio(self): 132 | return self.imratio 133 | 134 | @property 135 | def num_classes(self): 136 | return len(self.select_cols) 137 | 138 | @property 139 | def data_size(self): 140 | return self._num_images 141 | 142 | def image_augmentation(self, image): 143 | img_aug = tfs.Compose([tfs.RandomAffine(degrees=(-15, 15), translate=(0.05, 0.05), scale=(0.95, 1.05), fill=128)]) # pytorch 3.7: fillcolor --> fill 144 | image = img_aug(image) 145 | return image 146 | 147 | def __len__(self): 148 | return self._num_images 149 | 150 | def __getitem__(self, idx): 151 | 152 | image = cv2.imread(self._images_list[idx], 0) 153 | image = Image.fromarray(image) 154 | if self.mode == 'train': 155 | image = self.image_augmentation(image) 156 | image = np.array(image) 157 | image = cv2.cvtColor(image, cv2.COLOR_GRAY2RGB) 158 | 159 | # resize and normalize; e.g., ToTensor() 160 | image = cv2.resize(image, dsize=(self.image_size, self.image_size), interpolation=cv2.INTER_LINEAR) 161 | image = image/255.0 162 | __mean__ = np.array([[[0.485, 0.456, 0.406]]]) 163 | __std__ = np.array([[[0.229, 0.224, 0.225] ]]) 164 | image = (image-__mean__)/__std__ 165 | image = image.transpose((2, 0, 1)).astype(np.float32) 166 | if self.class_index != -1: # multi-class mode 167 | label = np.array(self._labels_list[idx]).reshape(-1).astype(np.float32) 168 | else: 169 | label = np.array(self._labels_list[idx]).reshape(-1).astype(np.float32) 170 | return image, label 171 | 172 | 173 | if __name__ == '__main__': 174 | root = '../chexpert/dataset/CheXpert-v1.0-small/' 175 | traindSet = CheXpert(csv_path=root+'train.csv', image_root_path=root, use_upsampling=True, use_frontal=True, image_size=320, mode='train', class_index=0) 176 | testSet = CheXpert(csv_path=root+'valid.csv', image_root_path=root, use_upsampling=False, use_frontal=True, image_size=320, mode='valid', class_index=0) 177 | trainloader = torch.utils.data.DataLoader(traindSet, batch_size=32, num_workers=2, drop_last=True, shuffle=True) 178 | testloader = torch.utils.data.DataLoader(testSet, batch_size=32, num_workers=2, drop_last=False, shuffle=False) 179 | 180 | 181 | --------------------------------------------------------------------------------