├── samples.png
├── diagset-a-container
    ├── README.md
    └── container.py
└── README.md


/samples.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/michalkoziarski/DiagSet/HEAD/samples.png


--------------------------------------------------------------------------------
/diagset-a-container/README.md:
--------------------------------------------------------------------------------
 1 | # Python container for DiagSet-A dataset 
 2 | 
 3 | ## Sample usage
 4 | 
 5 | To create train and test containers, and sample batches for one epoch:
 6 | 
 7 | ```python
 8 | from container import TrainingDiagSetDataset, EvaluationDiagSetDataset
 9 | 
10 | train_set = TrainingDiagSetDataset(
11 |     root_path='./DiagSet-A',
12 |     partitions=['train', 'validation'],
13 |     magnification=40
14 | )
15 | test_set = EvaluationDiagSetDataset(
16 |     root_path='./DiagSet-A',
17 |     partitions=['test'],
18 |     magnification=40
19 | )
20 | 
21 | for _ in range(train_set.length()):
22 |     images, labels = train_set.batch()
23 | ```
24 | 
25 | To create a container for binary classification:
26 | 
27 | ```python
28 | train_set = TrainingDiagSetDataset(
29 |     root_path='./DiagSet-A',
30 |     partitions=['train', 'validation'],
31 |     magnification=40,
32 |     label_dictionary={'BG': 0, 'T': 0, 'N': 0, 'A': 0, 'R1': 1, 'R2': 1, 'R3': 1, 'R4': 1, 'R5': 1}
33 | )
34 | ```
35 | 
36 | To create a container that will sample images from both classes with equal probability:
37 | 
38 | ```python
39 | train_set = TrainingDiagSetDataset(
40 |     root_path='./DiagSet-A',
41 |     partitions=['train', 'validation'],
42 |     magnification=40,
43 |     label_dictionary={'BG': 0, 'T': 0, 'N': 0, 'A': 0, 'R1': 1, 'R2': 1, 'R3': 1, 'R4': 1, 'R5': 1},
44 |     class_ratios={0: 0.5, 1: 0.5}
45 | )
46 | ```
47 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # DiagSet: a dataset for prostate cancer histopathological image classification
 2 | 
 3 | ## Description
 4 | 
 5 | The dataset consists of three different partitions: DiagSet-A, containing over 2.6 million tissue patches extracted from 430 fully annotated scans; DiagSet-B, containing 4675 scans with assigned binary diagnosis; and DiagSet-C, containing 46 scans with diagnosis given independently by a group of histopathologists.
 6 | 
 7 | ![DiagSet-A samples](samples.png)
 8 | *Samples from DiagSet-A dataset, containing patches from different tissue classes extracted at different magnifications.*
 9 | 
10 | ## Access
11 | 
12 | The data is publicly available and can be accessed by anyone (after registration) at <https://ai-econsilio.diag.pl>. Note that after registration site administrator will need to activate your account, which should typically take less than 24 hours.
13 | 
14 | In case of any problems with accessing the website you can either email the [website administrator](mailto:pawel.wasowicz@diag.pl) directly or, if that fails, create an issue here.
15 | 
16 | ## DiagSet-A container
17 | 
18 | For easier access to the data we prepared a Python container for loading the patches from raw data files. It can be found [here](diagset-a-container).
19 | 
20 | ## Publication
21 | 
22 | More detailed description about the dataset and the conducted experiments can be found at <https://www.nature.com/articles/s41598-024-52183-4>. To cite the dataset, you can use:
23 | ```bibtex
24 | @article{koziarski2024diagset,
25 |   title={DiagSet: a dataset for prostate cancer histopathological image classification},
26 |   author={Koziarski, Micha{\l} and Cyganek, Bogus{\l}aw and Niedziela, Przemys{\l}aw and Olborski, Bogus{\l}aw and Antosz, Zbigniew and {\.Z}ydak, Marcin and Kwolek, Bogdan and W{\k{a}}sowicz, Pawe{\l} and Buka{\l}a, Andrzej and Swad{\'z}ba, Jakub and others},
27 |   journal={Scientific Reports},
28 |   volume={14},
29 |   number={1},
30 |   pages={6780},
31 |   year={2024},
32 |   publisher={Nature Publishing Group UK London}
33 | }
34 | ```
35 | 


--------------------------------------------------------------------------------
/diagset-a-container/container.py:
--------------------------------------------------------------------------------
  1 | import json
  2 | import logging
  3 | import numpy as np
  4 | import pandas as pd
  5 | 
  6 | from abc import ABC, abstractmethod
  7 | from pathlib import Path
  8 | from queue import Queue
  9 | from threading import Thread
 10 | 
 11 | 
 12 | IMAGENET_IMAGE_MEAN = [123.68, 116.779, 103.939]
 13 | PATCH_SIZE = 224
 14 | DEFAULT_LABEL_DICTIONARY = {'BG': 0, 'T': 1, 'N': 2, 'A': 3, 'R1': 4, 'R2': 5, 'R3': 6, 'R4': 7, 'R5': 8}
 15 | 
 16 | 
 17 | class AbstractDiagSetDataset(ABC):
 18 |     def __init__(self, root_path, partitions, magnification=40, batch_size=32, augment=True,
 19 |                  subtract_mean=True, label_dictionary=None, shuffling=True, class_ratios=None,
 20 |                  scan_subset=None, buffer_size=64):
 21 |         """
 22 |         Abstract container for DiagSet-A dataset.
 23 | 
 24 |         :param root_path: root directory of the dataset
 25 |         :param partitions: list containing all partitions ('train', 'validation' or 'test') that will be loaded
 26 |         :param magnification: int in [40, 20, 10, 5] describing scan magnification for which patches will be loaded
 27 |         :param batch_size: int, number of images in a single batch
 28 |         :param augment: boolean, whether to apply random image augmentations
 29 |         :param subtract_mean: boolean, whether to subtract ImageNet mean from every image
 30 |         :param label_dictionary: dict assigning int label to every text key, DEFAULT_LABEL_DICTIONARY will
 31 |                be used if it is set to None
 32 |         :param shuffling: boolean, whether to shuffle the order of batches
 33 |         :param class_ratios: dict assigning probability to each int key, specifies ratio of images from a class
 34 |                with a given key that will be loaded in each batch (note that it will not always return deterministic
 35 |                number of images per class, but will specify the probability of drawing from that class instead).
 36 |                Can be None, in which case original dataset
 37 |                ratios will be used, otherwise all dict values should sum up to one
 38 |         :param scan_subset: subset of scans that will be loaded, either list of strings with scan IDs or
 39 |                float in (0, 1), in which case a random subset of scans from given partitions will be selected
 40 |         :param buffer_size: number of images from each class that will be stored in buffer
 41 |         """
 42 |         for partition in partitions:
 43 |             assert partition in ['train', 'validation', 'test']
 44 | 
 45 |         self.root_path = root_path
 46 |         self.partitions = partitions
 47 |         self.magnification = magnification
 48 |         self.batch_size = batch_size
 49 |         self.augment = augment
 50 |         self.subtract_mean = subtract_mean
 51 |         self.shuffling = shuffling
 52 |         self.scan_subset = scan_subset
 53 |         self.buffer_size = buffer_size
 54 | 
 55 |         if label_dictionary is None:
 56 |             logging.info('Using default label dictionary...')
 57 | 
 58 |             self.label_dictionary = DEFAULT_LABEL_DICTIONARY
 59 |         else:
 60 |             self.label_dictionary = label_dictionary
 61 | 
 62 |         self.numeric_labels = list(set(self.label_dictionary.values()))
 63 | 
 64 |         self.buffers = {}
 65 |         self.blob_paths = {}
 66 |         self.class_distribution = {}
 67 | 
 68 |         for numeric_label in self.numeric_labels:
 69 |             self.buffers[numeric_label] = Queue(buffer_size)
 70 |             self.blob_paths[numeric_label] = []
 71 |             self.class_distribution[numeric_label] = 0
 72 | 
 73 |         self.n_images = 0
 74 | 
 75 |         self.blobs_path = Path(root_path) / 'blobs' / 'S' / ('%dx' % magnification)
 76 |         self.distributions_path = Path(root_path) / 'distributions' / 'S' / ('%dx' % magnification)
 77 | 
 78 |         assert self.blobs_path.exists()
 79 | 
 80 |         self.scan_names = [path.name for path in self.blobs_path.iterdir()]
 81 | 
 82 |         partition_scan_names = []
 83 | 
 84 |         for partition in self.partitions:
 85 |             partition_path = Path(root_path) / 'partitions' / 'DiagSet-A.2' / ('%s.csv' % partition)
 86 | 
 87 |             if partition_path.exists():
 88 |                 df = pd.read_csv(partition_path)
 89 |                 partition_scan_names += df['scan_id'].astype(np.str).tolist()
 90 |             else:
 91 |                 raise ValueError('Partition file not found under "%s".' % partition_path)
 92 | 
 93 |         self.scan_names = [scan_name for scan_name in self.scan_names if scan_name in partition_scan_names]
 94 | 
 95 |         if self.scan_subset is not None and self.scan_subset != 1.0:
 96 |             if type(self.scan_subset) is list:
 97 |                 logging.info('Using given %d out of %d scans...' % (len(self.scan_subset), len(self.scan_names)))
 98 | 
 99 |                 self.scan_names = self.scan_subset
100 |             else:
101 |                 if type(self.scan_subset) is float:
102 |                     n_scans = int(self.scan_subset * len(self.scan_names))
103 |                 else:
104 |                     n_scans = self.scan_subset
105 | 
106 |                 assert n_scans > 0
107 |                 assert n_scans <= len(self.scan_names)
108 | 
109 |                 logging.info('Randomly selecting %d out of %d scans...' % (n_scans, len(self.scan_names)))
110 | 
111 |                 self.scan_names = list(np.random.choice(self.scan_names, n_scans, replace=False))
112 | 
113 |         logging.info('Loading blob paths...')
114 | 
115 |         for scan_name in self.scan_names:
116 |             for string_label, numeric_label in self.label_dictionary.items():
117 |                 blob_names = map(lambda x: x.name, sorted((self.blobs_path / scan_name / string_label).iterdir()))
118 | 
119 |                 for blob_name in blob_names:
120 |                     self.blob_paths[numeric_label].append(self.blobs_path / scan_name / string_label / blob_name)
121 | 
122 |             with open(self.distributions_path / ('%s.json' % scan_name), 'r') as f:
123 |                 scan_class_distribution = json.load(f)
124 | 
125 |             self.n_images += sum(scan_class_distribution.values())
126 | 
127 |             for string_label, numeric_label in self.label_dictionary.items():
128 |                 self.class_distribution[numeric_label] += scan_class_distribution[string_label]
129 | 
130 |         if class_ratios is None:
131 |             self.class_ratios = {}
132 | 
133 |             for numeric_label in self.numeric_labels:
134 |                 self.class_ratios[numeric_label] = self.class_distribution[numeric_label] / self.n_images
135 |         else:
136 |             self.class_ratios = class_ratios
137 | 
138 |         logging.info('Found %d patches.' % self.n_images)
139 | 
140 |         class_distribution_text = ', '.join(['%s: %.2f%%' % (label, count / self.n_images * 100)
141 |                                              for label, count in self.class_distribution.items()])
142 |         logging.info('Class distribution: %s.' % class_distribution_text)
143 | 
144 |         if self.shuffling:
145 |             for numeric_label in self.numeric_labels:
146 |                 np.random.shuffle(self.blob_paths[numeric_label])
147 | 
148 |         for numeric_label in self.numeric_labels:
149 |             if len(self.blob_paths[numeric_label]) > 0:
150 |                 Thread(target=self.fill_buffer, daemon=True, args=(numeric_label, )).start()
151 | 
152 |     @abstractmethod
153 |     def batch(self):
154 |         return
155 | 
156 |     def length(self):
157 |         return int(np.ceil(self.n_images / self.batch_size))
158 | 
159 |     def fill_buffer(self, numeric_label):
160 |         while True:
161 |             for blob_path in self.blob_paths[numeric_label]:
162 |                 images = self.prepare_images(blob_path)
163 | 
164 |                 for image in images:
165 |                     self.buffers[numeric_label].put(image)
166 | 
167 |             if self.shuffling:
168 |                 np.random.shuffle(self.blob_paths[numeric_label])
169 | 
170 |     def prepare_images(self, blob_path):
171 |         images = np.load(blob_path)
172 | 
173 |         if self.shuffling:
174 |             np.random.shuffle(images)
175 | 
176 |         prepared_images = []
177 | 
178 |         for i in range(len(images)):
179 |             image = images[i].astype(np.float32)
180 | 
181 |             if self.augment:
182 |                 image = self._augment(image)
183 |             else:
184 |                 x = (image.shape[0] - PATCH_SIZE) // 2
185 |                 y = (image.shape[1] - PATCH_SIZE) // 2
186 | 
187 |                 image = image[x:(x + PATCH_SIZE), y:(y + PATCH_SIZE)]
188 | 
189 |             if self.subtract_mean:
190 |                 image -= IMAGENET_IMAGE_MEAN
191 | 
192 |             prepared_images.append(image)
193 | 
194 |         prepared_images = np.array(prepared_images)
195 | 
196 |         return prepared_images
197 | 
198 |     def _augment(self, image):
199 |         x_max = image.shape[0] - PATCH_SIZE
200 |         y_max = image.shape[1] - PATCH_SIZE
201 | 
202 |         x = np.random.randint(x_max)
203 |         y = np.random.randint(y_max)
204 | 
205 |         image = image[x:(x + PATCH_SIZE), y:(y + PATCH_SIZE)]
206 | 
207 |         if np.random.choice([True, False]):
208 |             image = np.fliplr(image)
209 | 
210 |         image = np.rot90(image, k=np.random.randint(4))
211 | 
212 |         return image
213 | 
214 | 
215 | class TrainingDiagSetDataset(AbstractDiagSetDataset):
216 |     def batch(self):
217 |         probabilities = [self.class_ratios[label] for label in self.numeric_labels]
218 | 
219 |         labels = np.random.choice(self.numeric_labels, self.batch_size, p=probabilities)
220 |         images = np.array([self.buffers[label].get() for label in labels])
221 | 
222 |         return images, labels
223 | 
224 | 
225 | class EvaluationDiagSetDataset(AbstractDiagSetDataset):
226 |     def __init__(self, **kwargs):
227 |         assert kwargs.get('augment', False) is False
228 |         assert kwargs.get('shuffling', False) is False
229 |         assert kwargs.get('class_ratios') is None
230 | 
231 |         kwargs['augment'] = False
232 |         kwargs['shuffling'] = False
233 |         kwargs['class_ratios'] = None
234 | 
235 |         self.current_numeric_label_index = 0
236 |         self.current_batch_index = 0
237 | 
238 |         super().__init__(**kwargs)
239 | 
240 |     def batch(self):
241 |         labels = []
242 |         images = []
243 | 
244 |         for _ in range(self.batch_size):
245 |             label = self.numeric_labels[self.current_numeric_label_index]
246 | 
247 |             while len(self.blob_paths[label]) == 0:
248 |                 self.current_numeric_label_index = (self.current_numeric_label_index + 1) % len(self.numeric_labels)
249 | 
250 |                 label = self.numeric_labels[self.current_numeric_label_index]
251 | 
252 |             image = self.buffers[label].get()
253 | 
254 |             labels.append(label)
255 |             images.append(image)
256 | 
257 |             self.current_batch_index += 1
258 | 
259 |             if self.current_batch_index >= self.class_distribution[label]:
260 |                 self.current_batch_index = 0
261 |                 self.current_numeric_label_index += 1
262 | 
263 |                 if self.current_numeric_label_index >= len(self.numeric_labels):
264 |                     self.current_numeric_label_index = 0
265 | 
266 |                     break
267 | 
268 |         labels = np.array(labels)
269 |         images = np.array(images)
270 | 
271 |         return images, labels
272 | 


--------------------------------------------------------------------------------