├── samples.png ├── diagset-a-container ├── README.md └── container.py └── README.md /samples.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/michalkoziarski/DiagSet/HEAD/samples.png -------------------------------------------------------------------------------- /diagset-a-container/README.md: -------------------------------------------------------------------------------- 1 | # Python container for DiagSet-A dataset 2 | 3 | ## Sample usage 4 | 5 | To create train and test containers, and sample batches for one epoch: 6 | 7 | ```python 8 | from container import TrainingDiagSetDataset, EvaluationDiagSetDataset 9 | 10 | train_set = TrainingDiagSetDataset( 11 | root_path='./DiagSet-A', 12 | partitions=['train', 'validation'], 13 | magnification=40 14 | ) 15 | test_set = EvaluationDiagSetDataset( 16 | root_path='./DiagSet-A', 17 | partitions=['test'], 18 | magnification=40 19 | ) 20 | 21 | for _ in range(train_set.length()): 22 | images, labels = train_set.batch() 23 | ``` 24 | 25 | To create a container for binary classification: 26 | 27 | ```python 28 | train_set = TrainingDiagSetDataset( 29 | root_path='./DiagSet-A', 30 | partitions=['train', 'validation'], 31 | magnification=40, 32 | label_dictionary={'BG': 0, 'T': 0, 'N': 0, 'A': 0, 'R1': 1, 'R2': 1, 'R3': 1, 'R4': 1, 'R5': 1} 33 | ) 34 | ``` 35 | 36 | To create a container that will sample images from both classes with equal probability: 37 | 38 | ```python 39 | train_set = TrainingDiagSetDataset( 40 | root_path='./DiagSet-A', 41 | partitions=['train', 'validation'], 42 | magnification=40, 43 | label_dictionary={'BG': 0, 'T': 0, 'N': 0, 'A': 0, 'R1': 1, 'R2': 1, 'R3': 1, 'R4': 1, 'R5': 1}, 44 | class_ratios={0: 0.5, 1: 0.5} 45 | ) 46 | ``` 47 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # DiagSet: a dataset for prostate cancer histopathological image classification 2 | 3 | ## Description 4 | 5 | The dataset consists of three different partitions: DiagSet-A, containing over 2.6 million tissue patches extracted from 430 fully annotated scans; DiagSet-B, containing 4675 scans with assigned binary diagnosis; and DiagSet-C, containing 46 scans with diagnosis given independently by a group of histopathologists. 6 | 7 | ![DiagSet-A samples](samples.png) 8 | *Samples from DiagSet-A dataset, containing patches from different tissue classes extracted at different magnifications.* 9 | 10 | ## Access 11 | 12 | The data is publicly available and can be accessed by anyone (after registration) at . Note that after registration site administrator will need to activate your account, which should typically take less than 24 hours. 13 | 14 | In case of any problems with accessing the website you can either email the [website administrator](mailto:pawel.wasowicz@diag.pl) directly or, if that fails, create an issue here. 15 | 16 | ## DiagSet-A container 17 | 18 | For easier access to the data we prepared a Python container for loading the patches from raw data files. It can be found [here](diagset-a-container). 19 | 20 | ## Publication 21 | 22 | More detailed description about the dataset and the conducted experiments can be found at . To cite the dataset, you can use: 23 | ```bibtex 24 | @article{koziarski2024diagset, 25 | title={DiagSet: a dataset for prostate cancer histopathological image classification}, 26 | author={Koziarski, Micha{\l} and Cyganek, Bogus{\l}aw and Niedziela, Przemys{\l}aw and Olborski, Bogus{\l}aw and Antosz, Zbigniew and {\.Z}ydak, Marcin and Kwolek, Bogdan and W{\k{a}}sowicz, Pawe{\l} and Buka{\l}a, Andrzej and Swad{\'z}ba, Jakub and others}, 27 | journal={Scientific Reports}, 28 | volume={14}, 29 | number={1}, 30 | pages={6780}, 31 | year={2024}, 32 | publisher={Nature Publishing Group UK London} 33 | } 34 | ``` 35 | -------------------------------------------------------------------------------- /diagset-a-container/container.py: -------------------------------------------------------------------------------- 1 | import json 2 | import logging 3 | import numpy as np 4 | import pandas as pd 5 | 6 | from abc import ABC, abstractmethod 7 | from pathlib import Path 8 | from queue import Queue 9 | from threading import Thread 10 | 11 | 12 | IMAGENET_IMAGE_MEAN = [123.68, 116.779, 103.939] 13 | PATCH_SIZE = 224 14 | DEFAULT_LABEL_DICTIONARY = {'BG': 0, 'T': 1, 'N': 2, 'A': 3, 'R1': 4, 'R2': 5, 'R3': 6, 'R4': 7, 'R5': 8} 15 | 16 | 17 | class AbstractDiagSetDataset(ABC): 18 | def __init__(self, root_path, partitions, magnification=40, batch_size=32, augment=True, 19 | subtract_mean=True, label_dictionary=None, shuffling=True, class_ratios=None, 20 | scan_subset=None, buffer_size=64): 21 | """ 22 | Abstract container for DiagSet-A dataset. 23 | 24 | :param root_path: root directory of the dataset 25 | :param partitions: list containing all partitions ('train', 'validation' or 'test') that will be loaded 26 | :param magnification: int in [40, 20, 10, 5] describing scan magnification for which patches will be loaded 27 | :param batch_size: int, number of images in a single batch 28 | :param augment: boolean, whether to apply random image augmentations 29 | :param subtract_mean: boolean, whether to subtract ImageNet mean from every image 30 | :param label_dictionary: dict assigning int label to every text key, DEFAULT_LABEL_DICTIONARY will 31 | be used if it is set to None 32 | :param shuffling: boolean, whether to shuffle the order of batches 33 | :param class_ratios: dict assigning probability to each int key, specifies ratio of images from a class 34 | with a given key that will be loaded in each batch (note that it will not always return deterministic 35 | number of images per class, but will specify the probability of drawing from that class instead). 36 | Can be None, in which case original dataset 37 | ratios will be used, otherwise all dict values should sum up to one 38 | :param scan_subset: subset of scans that will be loaded, either list of strings with scan IDs or 39 | float in (0, 1), in which case a random subset of scans from given partitions will be selected 40 | :param buffer_size: number of images from each class that will be stored in buffer 41 | """ 42 | for partition in partitions: 43 | assert partition in ['train', 'validation', 'test'] 44 | 45 | self.root_path = root_path 46 | self.partitions = partitions 47 | self.magnification = magnification 48 | self.batch_size = batch_size 49 | self.augment = augment 50 | self.subtract_mean = subtract_mean 51 | self.shuffling = shuffling 52 | self.scan_subset = scan_subset 53 | self.buffer_size = buffer_size 54 | 55 | if label_dictionary is None: 56 | logging.info('Using default label dictionary...') 57 | 58 | self.label_dictionary = DEFAULT_LABEL_DICTIONARY 59 | else: 60 | self.label_dictionary = label_dictionary 61 | 62 | self.numeric_labels = list(set(self.label_dictionary.values())) 63 | 64 | self.buffers = {} 65 | self.blob_paths = {} 66 | self.class_distribution = {} 67 | 68 | for numeric_label in self.numeric_labels: 69 | self.buffers[numeric_label] = Queue(buffer_size) 70 | self.blob_paths[numeric_label] = [] 71 | self.class_distribution[numeric_label] = 0 72 | 73 | self.n_images = 0 74 | 75 | self.blobs_path = Path(root_path) / 'blobs' / 'S' / ('%dx' % magnification) 76 | self.distributions_path = Path(root_path) / 'distributions' / 'S' / ('%dx' % magnification) 77 | 78 | assert self.blobs_path.exists() 79 | 80 | self.scan_names = [path.name for path in self.blobs_path.iterdir()] 81 | 82 | partition_scan_names = [] 83 | 84 | for partition in self.partitions: 85 | partition_path = Path(root_path) / 'partitions' / 'DiagSet-A.2' / ('%s.csv' % partition) 86 | 87 | if partition_path.exists(): 88 | df = pd.read_csv(partition_path) 89 | partition_scan_names += df['scan_id'].astype(np.str).tolist() 90 | else: 91 | raise ValueError('Partition file not found under "%s".' % partition_path) 92 | 93 | self.scan_names = [scan_name for scan_name in self.scan_names if scan_name in partition_scan_names] 94 | 95 | if self.scan_subset is not None and self.scan_subset != 1.0: 96 | if type(self.scan_subset) is list: 97 | logging.info('Using given %d out of %d scans...' % (len(self.scan_subset), len(self.scan_names))) 98 | 99 | self.scan_names = self.scan_subset 100 | else: 101 | if type(self.scan_subset) is float: 102 | n_scans = int(self.scan_subset * len(self.scan_names)) 103 | else: 104 | n_scans = self.scan_subset 105 | 106 | assert n_scans > 0 107 | assert n_scans <= len(self.scan_names) 108 | 109 | logging.info('Randomly selecting %d out of %d scans...' % (n_scans, len(self.scan_names))) 110 | 111 | self.scan_names = list(np.random.choice(self.scan_names, n_scans, replace=False)) 112 | 113 | logging.info('Loading blob paths...') 114 | 115 | for scan_name in self.scan_names: 116 | for string_label, numeric_label in self.label_dictionary.items(): 117 | blob_names = map(lambda x: x.name, sorted((self.blobs_path / scan_name / string_label).iterdir())) 118 | 119 | for blob_name in blob_names: 120 | self.blob_paths[numeric_label].append(self.blobs_path / scan_name / string_label / blob_name) 121 | 122 | with open(self.distributions_path / ('%s.json' % scan_name), 'r') as f: 123 | scan_class_distribution = json.load(f) 124 | 125 | self.n_images += sum(scan_class_distribution.values()) 126 | 127 | for string_label, numeric_label in self.label_dictionary.items(): 128 | self.class_distribution[numeric_label] += scan_class_distribution[string_label] 129 | 130 | if class_ratios is None: 131 | self.class_ratios = {} 132 | 133 | for numeric_label in self.numeric_labels: 134 | self.class_ratios[numeric_label] = self.class_distribution[numeric_label] / self.n_images 135 | else: 136 | self.class_ratios = class_ratios 137 | 138 | logging.info('Found %d patches.' % self.n_images) 139 | 140 | class_distribution_text = ', '.join(['%s: %.2f%%' % (label, count / self.n_images * 100) 141 | for label, count in self.class_distribution.items()]) 142 | logging.info('Class distribution: %s.' % class_distribution_text) 143 | 144 | if self.shuffling: 145 | for numeric_label in self.numeric_labels: 146 | np.random.shuffle(self.blob_paths[numeric_label]) 147 | 148 | for numeric_label in self.numeric_labels: 149 | if len(self.blob_paths[numeric_label]) > 0: 150 | Thread(target=self.fill_buffer, daemon=True, args=(numeric_label, )).start() 151 | 152 | @abstractmethod 153 | def batch(self): 154 | return 155 | 156 | def length(self): 157 | return int(np.ceil(self.n_images / self.batch_size)) 158 | 159 | def fill_buffer(self, numeric_label): 160 | while True: 161 | for blob_path in self.blob_paths[numeric_label]: 162 | images = self.prepare_images(blob_path) 163 | 164 | for image in images: 165 | self.buffers[numeric_label].put(image) 166 | 167 | if self.shuffling: 168 | np.random.shuffle(self.blob_paths[numeric_label]) 169 | 170 | def prepare_images(self, blob_path): 171 | images = np.load(blob_path) 172 | 173 | if self.shuffling: 174 | np.random.shuffle(images) 175 | 176 | prepared_images = [] 177 | 178 | for i in range(len(images)): 179 | image = images[i].astype(np.float32) 180 | 181 | if self.augment: 182 | image = self._augment(image) 183 | else: 184 | x = (image.shape[0] - PATCH_SIZE) // 2 185 | y = (image.shape[1] - PATCH_SIZE) // 2 186 | 187 | image = image[x:(x + PATCH_SIZE), y:(y + PATCH_SIZE)] 188 | 189 | if self.subtract_mean: 190 | image -= IMAGENET_IMAGE_MEAN 191 | 192 | prepared_images.append(image) 193 | 194 | prepared_images = np.array(prepared_images) 195 | 196 | return prepared_images 197 | 198 | def _augment(self, image): 199 | x_max = image.shape[0] - PATCH_SIZE 200 | y_max = image.shape[1] - PATCH_SIZE 201 | 202 | x = np.random.randint(x_max) 203 | y = np.random.randint(y_max) 204 | 205 | image = image[x:(x + PATCH_SIZE), y:(y + PATCH_SIZE)] 206 | 207 | if np.random.choice([True, False]): 208 | image = np.fliplr(image) 209 | 210 | image = np.rot90(image, k=np.random.randint(4)) 211 | 212 | return image 213 | 214 | 215 | class TrainingDiagSetDataset(AbstractDiagSetDataset): 216 | def batch(self): 217 | probabilities = [self.class_ratios[label] for label in self.numeric_labels] 218 | 219 | labels = np.random.choice(self.numeric_labels, self.batch_size, p=probabilities) 220 | images = np.array([self.buffers[label].get() for label in labels]) 221 | 222 | return images, labels 223 | 224 | 225 | class EvaluationDiagSetDataset(AbstractDiagSetDataset): 226 | def __init__(self, **kwargs): 227 | assert kwargs.get('augment', False) is False 228 | assert kwargs.get('shuffling', False) is False 229 | assert kwargs.get('class_ratios') is None 230 | 231 | kwargs['augment'] = False 232 | kwargs['shuffling'] = False 233 | kwargs['class_ratios'] = None 234 | 235 | self.current_numeric_label_index = 0 236 | self.current_batch_index = 0 237 | 238 | super().__init__(**kwargs) 239 | 240 | def batch(self): 241 | labels = [] 242 | images = [] 243 | 244 | for _ in range(self.batch_size): 245 | label = self.numeric_labels[self.current_numeric_label_index] 246 | 247 | while len(self.blob_paths[label]) == 0: 248 | self.current_numeric_label_index = (self.current_numeric_label_index + 1) % len(self.numeric_labels) 249 | 250 | label = self.numeric_labels[self.current_numeric_label_index] 251 | 252 | image = self.buffers[label].get() 253 | 254 | labels.append(label) 255 | images.append(image) 256 | 257 | self.current_batch_index += 1 258 | 259 | if self.current_batch_index >= self.class_distribution[label]: 260 | self.current_batch_index = 0 261 | self.current_numeric_label_index += 1 262 | 263 | if self.current_numeric_label_index >= len(self.numeric_labels): 264 | self.current_numeric_label_index = 0 265 | 266 | break 267 | 268 | labels = np.array(labels) 269 | images = np.array(images) 270 | 271 | return images, labels 272 | --------------------------------------------------------------------------------