├── README.md
└── chexpert.py


/README.md:
--------------------------------------------------------------------------------
 1 | # Robust Deep AUC Maximization  [![pdf](https://img.shields.io/badge/Arxiv-pdf-orange.svg?style=flat)](https://arxiv.org/abs/2012.03173)
 2 | 
 3 | This is the official implementation of the paper "**Robust Deep AUC Maximization: A New Surrogate Loss and Empirical Studies on Medical Image Classification**" published on **ICCV2021**. 
 4 | 
 5 | Requirements
 6 | ---------
 7 | ```python
 8 | pip install libauc
 9 | ```
10 | 
11 | Benchmark Datasets
12 | ---------
13 | Benchmark dataset contains [Cat&Dog](https://www.kaggle.com/c/dogs-vs-cats), [CIFAR10](https://www.cs.toronto.edu/~kriz/cifar.html), [CIFAR100](https://www.cs.toronto.edu/~kriz/cifar.html), [STL10](https://cs.stanford.edu/~acoates/stl10/). To construct their imbalanced version, we show an example below: 
14 | 
15 | ### Example
16 | 
17 | #### Importing LibAUC & Loading Datasets
18 | ```python
19 | from libauc.datasets import CIFAR10, CIFAR100, CAT_vs_DOG, STL10
20 | (train_data, train_label), (test_data, test_label) = CIFAR10()
21 | ```
22 | 
23 | #### Constructing Imbalanced Datasets
24 | ```python
25 | from libauc.datasets import imbalance_generator
26 | SEED = 123
27 | imratio = 0.1 # postive_samples/(total_samples)
28 | (train_images, train_labels) = imbalance_generator(train_data, train_label, imratio=imratio, shuffle=True, random_seed=SEED)
29 | (test_images, test_labels) = imbalance_generator(test_data, test_label, is_balanced=True, random_seed=SEED)
30 | ```
31 | 
32 | #### Making Dataloader for Training and Testing 
33 | ```python
34 | trainloader = torch.utils.data.DataLoader(ImageDataset(train_images, train_labels), batch_size=BATCH_SIZE, shuffle=True, num_workers=1, pin_memory=True, drop_last=True)
35 | testloader = torch.utils.data.DataLoader( ImageDataset(test_images, test_labels, mode='test'), batch_size=BATCH_SIZE, shuffle=False, num_workers=1,  pin_memory=True)
36 | ```
37 | 
38 | For the instructions of training the models, please refer to this [Notebook](https://github.com/yzhuoning/LibAUC/blob/main/examples/02_Optimizing_AUROC_with_ResNet20_on_Imbalanced_CIFAR10.ipynb). 
39 | 
40 | 
41 | CheXpert
42 | ---------
43 | CheXpert is a large dataset of chest X-rays and competition, which consists of 224,316 chest radiographs of 65,240 patients.The details about the dataset can be found at https://stanfordmlgroup.github.io/competitions/chexpert/. The dataloader used in the paper can be downloaded [here](https://github.com/Optimization-AI/ICCV2021_DeepAUC/blob/main/chexpert.py).
44 | 
45 | ### Example 
46 | 
47 | ```python
48 | root="YOUR_DATA_PATH"
49 | class_id="CLASS_ID"
50 | traindSet = CheXpert(csv_path=root+'train.csv', image_root_path=root, use_frontal=True, image_size=224, mode='train', class_index=class_id)
51 | testSet =  CheXpert(csv_path=root+'valid.csv',  image_root_path=root, use_frontal=True, image_size=224, mode='valid', class_index=class_id)
52 | trainloader =  torch.utils.data.DataLoader(traindSet, batch_size=32, num_workers=2, shuffle=True)
53 | testloader =  torch.utils.data.DataLoader(testSet, batch_size=32, num_workers=2, shuffle=False)
54 | ```
55 | 
56 | For the instructions of training the models, please refer to this [Notebook](https://github.com/Optimization-AI/LibAUC/blob/main/examples/05_Optimizing_AUROC_Loss_with_DenseNet121_on_CheXpert.ipynb).
57 | 
58 | 
59 | Citation
60 | ---------
61 | If you find this repo helpful, please cite the following paper:
62 | ```
63 | @inproceedings{yuan2021robust,
64 | 	title={Large-scale Robust Deep AUC Maximization: A New Surrogate Loss and Empirical Studies on Medical Image Classification},
65 | 	author={Yuan, Zhuoning and Yan, Yan and Sonka, Milan and Yang, Tianbao},
66 | 	booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
67 | 	year={2021}
68 | 	}
69 | ```
70 | 
71 | Contact
72 | ----------
73 | If you have any questions, please contact us @ [Zhuoning Yuan](https://homepage.divms.uiowa.edu/~zhuoning/) [yzhuoning@gmail.com] and [Tianbao Yang](http://people.tamu.edu/~tianbao-yang/) [tianbao-yang@tamu.edu] or please open a new issue in the Github. 
74 | 


--------------------------------------------------------------------------------
/chexpert.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import torch 
  3 | from torch.utils.data import Dataset
  4 | import torchvision.transforms as tfs
  5 | import cv2
  6 | from PIL import Image
  7 | import pandas as pd
  8 | 
  9 | class CheXpert(Dataset):
 10 |     '''
 11 |     Reference: 
 12 |         @inproceedings{yuan2021robust,
 13 |             title={Large-scale Robust Deep AUC Maximization: A New Surrogate Loss and Empirical Studies on Medical Image Classification},
 14 |             author={Yuan, Zhuoning and Yan, Yan and Sonka, Milan and Yang, Tianbao},
 15 |             booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
 16 |             year={2021}
 17 |             }
 18 |     '''
 19 |     def __init__(self, 
 20 |                  csv_path, 
 21 |                  image_root_path='',
 22 |                  image_size=320,
 23 |                  class_index=0, 
 24 |                  use_frontal=True,
 25 |                  use_upsampling=True,
 26 |                  flip_label=False,
 27 |                  shuffle=True,
 28 |                  seed=123,
 29 |                  verbose=True,
 30 |                  upsampling_cols=['Cardiomegaly', 'Consolidation'],
 31 |                  train_cols=['Cardiomegaly', 'Edema', 'Consolidation', 'Atelectasis',  'Pleural Effusion'],
 32 |                  mode='train'):
 33 |         
 34 |     
 35 |         # load data from csv
 36 |         self.df = pd.read_csv(csv_path)
 37 |         self.df['Path'] = self.df['Path'].str.replace('CheXpert-v1.0-small/', '')
 38 |         self.df['Path'] = self.df['Path'].str.replace('CheXpert-v1.0/', '')
 39 |         if use_frontal:
 40 |             self.df = self.df[self.df['Frontal/Lateral'] == 'Frontal']  
 41 |             
 42 |         # upsample selected cols
 43 |         if use_upsampling:
 44 |             assert isinstance(upsampling_cols, list), 'Input should be list!'
 45 |             sampled_df_list = []
 46 |             for col in upsampling_cols:
 47 |                 print ('Upsampling %s...'%col)
 48 |                 sampled_df_list.append(self.df[self.df[col] == 1])
 49 |             self.df = pd.concat([self.df] + sampled_df_list, axis=0)
 50 | 
 51 | 
 52 |         # impute missing values 
 53 |         for col in train_cols:
 54 |             if col in ['Edema', 'Atelectasis']:
 55 |                 self.df[col].replace(-1, 1, inplace=True)  
 56 |                 self.df[col].fillna(0, inplace=True) 
 57 |             elif col in ['Cardiomegaly','Consolidation',  'Pleural Effusion']:
 58 |                 self.df[col].replace(-1, 0, inplace=True) 
 59 |                 self.df[col].fillna(0, inplace=True)
 60 |             else:
 61 |                 self.df[col].fillna(0, inplace=True)
 62 |         
 63 |         self._num_images = len(self.df)
 64 |         
 65 |         # 0 --> -1
 66 |         if flip_label and class_index != -1: # In multi-class mode we disable this option!
 67 |             self.df.replace(0, -1, inplace=True)   
 68 |             
 69 |         # shuffle data
 70 |         if shuffle:
 71 |             data_index = list(range(self._num_images))
 72 |             np.random.seed(seed)
 73 |             np.random.shuffle(data_index)
 74 |             self.df = self.df.iloc[data_index]
 75 |         
 76 |         
 77 |         assert class_index in [-1, 0, 1, 2, 3, 4], 'Out of selection!'
 78 |         assert image_root_path != '', 'You need to pass the correct location for the dataset!'
 79 | 
 80 |         if class_index == -1: # 5 classes
 81 |             print ('Multi-label mode: True, Number of classes: [%d]'%len(train_cols))
 82 |             self.select_cols = train_cols
 83 |             self.value_counts_dict = {}
 84 |             for class_key, select_col in enumerate(train_cols):
 85 |                 class_value_counts_dict = self.df[select_col].value_counts().to_dict()
 86 |                 self.value_counts_dict[class_key] = class_value_counts_dict
 87 |         else:       # 1 class
 88 |             self.select_cols = [train_cols[class_index]]  # this var determines the number of classes
 89 |             self.value_counts_dict = self.df[self.select_cols[0]].value_counts().to_dict()
 90 |         
 91 |         self.mode = mode
 92 |         self.class_index = class_index
 93 |         self.image_size = image_size
 94 |         
 95 |         self._images_list =  [image_root_path+path for path in self.df['Path'].tolist()]
 96 |         if class_index != -1:
 97 |             self._labels_list = self.df[train_cols].values[:, class_index].tolist()
 98 |         else:
 99 |             self._labels_list = self.df[train_cols].values.tolist()
100 |     
101 |         if verbose:
102 |             if class_index != -1:
103 |                 print ('-'*30)
104 |                 if flip_label:
105 |                     self.imratio = self.value_counts_dict[1]/(self.value_counts_dict[-1]+self.value_counts_dict[1])
106 |                     print('Found %s images in total, %s positive images, %s negative images'%(self._num_images, self.value_counts_dict[1], self.value_counts_dict[-1] ))
107 |                     print ('%s(C%s): imbalance ratio is %.4f'%(self.select_cols[0], class_index, self.imratio ))
108 |                 else:
109 |                     self.imratio = self.value_counts_dict[1]/(self.value_counts_dict[0]+self.value_counts_dict[1])
110 |                     print('Found %s images in total, %s positive images, %s negative images'%(self._num_images, self.value_counts_dict[1], self.value_counts_dict[0] ))
111 |                     print ('%s(C%s): imbalance ratio is %.4f'%(self.select_cols[0], class_index, self.imratio ))
112 |                 print ('-'*30)
113 |             else:
114 |                 print ('-'*30)
115 |                 imratio_list = []
116 |                 for class_key, select_col in enumerate(train_cols):
117 |                     imratio = self.value_counts_dict[class_key][1]/(self.value_counts_dict[class_key][0]+self.value_counts_dict[class_key][1])
118 |                     imratio_list.append(imratio)
119 |                     print('Found %s images in total, %s positive images, %s negative images'%(self._num_images, self.value_counts_dict[class_key][1], self.value_counts_dict[class_key][0] ))
120 |                     print ('%s(C%s): imbalance ratio is %.4f'%(select_col, class_key, imratio ))
121 |                     print ()
122 |                 self.imratio = np.mean(imratio_list)
123 |                 self.imratio_list = imratio_list
124 |                 print ('-'*30)
125 |             
126 |     @property        
127 |     def class_counts(self):
128 |         return self.value_counts_dict
129 |     
130 |     @property
131 |     def imbalance_ratio(self):
132 |         return self.imratio
133 | 
134 |     @property
135 |     def num_classes(self):
136 |         return len(self.select_cols)
137 |        
138 |     @property  
139 |     def data_size(self):
140 |         return self._num_images 
141 |     
142 |     def image_augmentation(self, image):
143 |         img_aug = tfs.Compose([tfs.RandomAffine(degrees=(-15, 15), translate=(0.05, 0.05), scale=(0.95, 1.05), fill=128)]) # pytorch 3.7: fillcolor --> fill
144 |         image = img_aug(image)
145 |         return image
146 |     
147 |     def __len__(self):
148 |         return self._num_images
149 |     
150 |     def __getitem__(self, idx):
151 | 
152 |         image = cv2.imread(self._images_list[idx], 0)
153 |         image = Image.fromarray(image)
154 |         if self.mode == 'train':
155 |             image = self.image_augmentation(image)
156 |         image = np.array(image)
157 |         image = cv2.cvtColor(image, cv2.COLOR_GRAY2RGB)
158 |         
159 |         # resize and normalize; e.g., ToTensor()
160 |         image = cv2.resize(image, dsize=(self.image_size, self.image_size), interpolation=cv2.INTER_LINEAR)  
161 |         image = image/255.0
162 |         __mean__ = np.array([[[0.485, 0.456, 0.406]]])
163 |         __std__ =  np.array([[[0.229, 0.224, 0.225]  ]]) 
164 |         image = (image-__mean__)/__std__
165 |         image = image.transpose((2, 0, 1)).astype(np.float32)
166 |         if self.class_index != -1: # multi-class mode
167 |             label = np.array(self._labels_list[idx]).reshape(-1).astype(np.float32)
168 |         else:
169 |             label = np.array(self._labels_list[idx]).reshape(-1).astype(np.float32)
170 |         return image, label
171 | 
172 | 
173 | if __name__ == '__main__':
174 |     root = '../chexpert/dataset/CheXpert-v1.0-small/'
175 |     traindSet = CheXpert(csv_path=root+'train.csv', image_root_path=root, use_upsampling=True, use_frontal=True, image_size=320, mode='train', class_index=0)
176 |     testSet =  CheXpert(csv_path=root+'valid.csv',  image_root_path=root, use_upsampling=False, use_frontal=True, image_size=320, mode='valid', class_index=0)
177 |     trainloader =  torch.utils.data.DataLoader(traindSet, batch_size=32, num_workers=2, drop_last=True, shuffle=True)
178 |     testloader =  torch.utils.data.DataLoader(testSet, batch_size=32, num_workers=2, drop_last=False, shuffle=False)
179 | 
180 |  
181 | 


--------------------------------------------------------------------------------