├── .gitignore
├── EDA
├── README.md
├── eda.ipynb
└── images
│ ├── nsvc.jpg
│ ├── pcpsc.jpg
│ └── pcpst.jpg
├── LICENSE
├── README.md
├── densenet.py
├── main.py
├── pipeline.py
├── train.py
└── utils.py
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 | *$py.class
5 |
6 | # C extensions
7 | *.so
8 |
9 | # Distribution / packaging
10 | .Python
11 | env/
12 | build/
13 | develop-eggs/
14 | dist/
15 | downloads/
16 | eggs/
17 | .eggs/
18 | lib/
19 | lib64/
20 | parts/
21 | sdist/
22 | var/
23 | wheels/
24 | *.egg-info/
25 | .installed.cfg
26 | *.egg
27 |
28 | # PyInstaller
29 | # Usually these files are written by a python script from a template
30 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
31 | *.manifest
32 | *.spec
33 |
34 | # Installer logs
35 | pip-log.txt
36 | pip-delete-this-directory.txt
37 |
38 | # Unit test / coverage reports
39 | htmlcov/
40 | .tox/
41 | .coverage
42 | .coverage.*
43 | .cache
44 | nosetests.xml
45 | coverage.xml
46 | *.cover
47 | .hypothesis/
48 |
49 | # Translations
50 | *.mo
51 | *.pot
52 |
53 | # Django stuff:
54 | *.log
55 | local_settings.py
56 |
57 | # Flask stuff:
58 | instance/
59 | .webassets-cache
60 |
61 | # Scrapy stuff:
62 | .scrapy
63 |
64 | # Sphinx documentation
65 | docs/_build/
66 |
67 | # PyBuilder
68 | target/
69 |
70 | # Jupyter Notebook
71 | .ipynb_checkpoints
72 |
73 | # pyenv
74 | .python-version
75 |
76 | # celery beat schedule file
77 | celerybeat-schedule
78 |
79 | # SageMath parsed files
80 | *.sage.py
81 |
82 | # dotenv
83 | .env
84 |
85 | # virtualenv
86 | .venv
87 | venv/
88 | ENV/
89 |
90 | # Spyder project settings
91 | .spyderproject
92 | .spyproject
93 |
94 | # Rope project settings
95 | .ropeproject
96 |
97 | # mkdocs documentation
98 | /site
99 |
100 | # mypy
101 | .mypy_cache/
102 |
--------------------------------------------------------------------------------
/EDA/README.md:
--------------------------------------------------------------------------------
1 |
2 | # Data Analysis Report for MURA
3 |
4 | MURA is a dataset of musculoskeletal radiographs consisting of 14,982 `studies` from 12,251 `patients`, with a total of 40,895 `multi-view radiographic images`. Each `study` belongs to one of seven standard upper extremity radiographic `study
5 | types`: elbow, finger, forearm, hand, humerus, shoulder and wrist.
6 |
7 | ## Components of MURA dataset
8 |
9 | MURA dataset comes with `train`, `valid` and `test` folders containing corresponding datasets, `train.csv` and `valid.csv` contain paths of `radiographic images` and their labels. Each image is labeled as 1 (abnormal) or 0 (normal) based on whether its corresponding study is negative or positive, respectively
10 |
11 | Sometimes, these radiographic images are also referred as `views`.
12 |
13 | ## Components of `train` and `valid` set
14 |
15 | * `train` set consists of seven `study types` namely:
16 |
17 | `XR_ELBOW` `XR_FINGER` `XR_FOREARM` `XR_HAND` `XR_HUMERUS` `XR_SHOULDER` `XR_WRIST`
18 |
19 | * Each `study type` contains several folders named like:
20 |
21 | `patient12104` `patient12110` `patient12116` `patient12122` `patient12128` ...
22 |
23 | * These folders are named after patient ids, each of these folders contain one or more `study`, named like:
24 |
25 | `study1_negative` `study2_negative` `study3_positive` ...
26 |
27 | * Each of these `study`s contains one or more radiographs (views or images), named like:
28 |
29 | `image1.png` `image2.png` ...
30 |
31 | * Each view (image) is RGB with pixel range [0, 255] and varies in dimensions.
32 |
33 | **NOTE**: all above points are true for `test` set, except the third point, the `study` folder are named like: `study1` `study2` ..
34 |
35 | ## Some insightful plots
36 |
37 | ### Plot of number of Patients vs `study type`
38 |
39 |
40 |
41 | In `train` set `XR_WRIST` has maximum number of patients, followed by `XR_FINGER`, `XR_HUMERUS`, `XR_SHOULDER`, `XR_HAND`, `XR_ELBOW` and `XR_FOREARM`. `X_FOREARM` with 606 patients has got the least number. Similar pattern can be seen in `valid` set, XR_WRIST has the maximum, followed by `XR_FINGER`, `XR_SHOULDER`,`XR_HUMEROUS`, `XR_HAND`,`XR_ELBOW`, `XR_FOREARM`.
42 |
43 | ### Plot of number of patients vs study count
44 |
45 | Patients of a `study type` might have multiple `study`s, like a patient may have 3 `study`s for wrist, independent of each other.
46 | The following plot shows variation of number of patients with number of `study`s
47 |
48 | **NOTE** study count = number of studies, so if 4 patients have study count 3, that means 4 patients have undergone 3 `study`s for a given `study type`
49 |
50 |
51 |
52 |
53 | Patients of `XR_FOREARM` and `XR_HUMEROUS` `study type`s have either 1 `study` or 2 only.
54 | Patients of `XR_FINGER`, `XR_HAND` and `XR_ELBOW` have upto 3 `study`s.
55 | Patients of `XR_SHOULDER` and `XR_WRIST` have upto 4 `study`s
56 |
57 | ### Plot of number of `study`s vs number of views
58 |
59 | Each `study` may have one or more number of views, the following plot variation of number of views per study in train dataset
60 |
61 |
62 |
63 | Maximum number of views per study can be found in `XR_SHOULDER`, there is a study in it which has as many as 13 images (views), similarlyy `XR_HUMEROUS` has a study with 10 images. It can be seen that most of the `study`s have either 2, 3 or 4 images.
64 |
--------------------------------------------------------------------------------
/EDA/eda.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Exploratory Data Analysis"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": null,
13 | "metadata": {
14 | "collapsed": true
15 | },
16 | "outputs": [],
17 | "source": [
18 | "import os\n",
19 | "import time\n",
20 | "import torch\n",
21 | "from torch.utils.data import DataLoader, Dataset\n",
22 | "from torchvision.utils import make_grid\n",
23 | "from torchvision import transforms\n",
24 | "from collections import defaultdict\n",
25 | "from torchvision.datasets.folder import pil_loader\n",
26 | "import random\n",
27 | "import numpy as np\n",
28 | "import pandas as pd\n",
29 | "import matplotlib.pyplot as plt\n",
30 | "import pylab\n",
31 | "from skimage import io, transform\n",
32 | "\n",
33 | "pd.set_option('max_colwidth', 800)\n",
34 | "\n",
35 | "%matplotlib inline"
36 | ]
37 | },
38 | {
39 | "cell_type": "markdown",
40 | "metadata": {},
41 | "source": [
42 | "## Load Data"
43 | ]
44 | },
45 | {
46 | "cell_type": "markdown",
47 | "metadata": {},
48 | "source": [
49 | "We have seven categories of musculoskeletal radiographs"
50 | ]
51 | },
52 | {
53 | "cell_type": "code",
54 | "execution_count": null,
55 | "metadata": {
56 | "collapsed": true
57 | },
58 | "outputs": [],
59 | "source": [
60 | "train_df = pd.read_csv('../MURA-v1.0/train.csv', names=['Path', 'Label'])\n",
61 | "valid_df = pd.read_csv('../MURA-v1.0/valid.csv', names=['Path', 'Label'])"
62 | ]
63 | },
64 | {
65 | "cell_type": "markdown",
66 | "metadata": {},
67 | "source": [
68 | "Let's checkout the shapes of dataframes"
69 | ]
70 | },
71 | {
72 | "cell_type": "code",
73 | "execution_count": null,
74 | "metadata": {},
75 | "outputs": [],
76 | "source": [
77 | "train_df.shape, valid_df.shape"
78 | ]
79 | },
80 | {
81 | "cell_type": "markdown",
82 | "metadata": {},
83 | "source": [
84 | "We have 37111 radiographs for training and 3225 radiographs for validation set, let's peak into the dataframes"
85 | ]
86 | },
87 | {
88 | "cell_type": "code",
89 | "execution_count": null,
90 | "metadata": {},
91 | "outputs": [],
92 | "source": [
93 | "train_df.head(3)"
94 | ]
95 | },
96 | {
97 | "cell_type": "code",
98 | "execution_count": null,
99 | "metadata": {},
100 | "outputs": [],
101 | "source": [
102 | "valid_df.head(3)"
103 | ]
104 | },
105 | {
106 | "cell_type": "markdown",
107 | "metadata": {},
108 | "source": [
109 | "So, we have radiograph paths and their correspoinding labels, each radiographs has a label of 0 (normal) or 1 (abnormal)"
110 | ]
111 | },
112 | {
113 | "cell_type": "markdown",
114 | "metadata": {},
115 | "source": [
116 | "## Analysis"
117 | ]
118 | },
119 | {
120 | "cell_type": "markdown",
121 | "metadata": {},
122 | "source": [
123 | "According to paper:
\n",
124 | "1.\n",
125 | "\n",
126 | " The MURA abnormality detection task is a binary classification task, where the input is an upper \n",
127 | " exremity radiograph study — with each study containing one or more views (images) — and the \n",
128 | " expected output is a binary label y ∈ {0, 1} indicating whether the \"study\" is normal or abnormal, \n",
129 | " respectively.\n",
130 | "2.\n",
131 | "\n",
132 | " The model takes as input one or more views for a study of an upper extremity. On each view, our 169-\n",
133 | " layer convolutional neural network predicts the probability of abnormality. We compute the overall \n",
134 | " probability of abnormality for the study by taking the arithmetic mean of the abnormality \n",
135 | " probabilities output by the network for each image. The model makes the binary prediction of \n",
136 | " abnormal if the probability of abnormality for the study is greater than 0.5."
137 | ]
138 | },
139 | {
140 | "cell_type": "markdown",
141 | "metadata": {},
142 | "source": [
143 | "So, we have make predictions on study level, taking into account the predictions of all the views (images) of the study. This can be done by taking arithmetic mean of all the views (images) under a particular study."
144 | ]
145 | },
146 | {
147 | "cell_type": "code",
148 | "execution_count": null,
149 | "metadata": {},
150 | "outputs": [],
151 | "source": [
152 | "train_df.head(30)"
153 | ]
154 | },
155 | {
156 | "cell_type": "markdown",
157 | "metadata": {},
158 | "source": [
159 | "Analyzing this dataframe, we can see that images are annotated based on whether their corresponding study is positive (normal, 0) or negative (abnormal, 1)"
160 | ]
161 | },
162 | {
163 | "cell_type": "markdown",
164 | "metadata": {},
165 | "source": [
166 | "### Plot some random radiographs from training and validation set"
167 | ]
168 | },
169 | {
170 | "cell_type": "code",
171 | "execution_count": null,
172 | "metadata": {
173 | "collapsed": true
174 | },
175 | "outputs": [],
176 | "source": [
177 | "train_mat = train_df.as_matrix()\n",
178 | "valid_mat = valid_df.as_matrix()"
179 | ]
180 | },
181 | {
182 | "cell_type": "code",
183 | "execution_count": null,
184 | "metadata": {},
185 | "outputs": [],
186 | "source": [
187 | "ix = np.random.randint(0, len(train_mat)) # randomly select a index\n",
188 | "img_path = train_mat[ix][0]\n",
189 | "plt.imshow(io.imread(img_path), cmap='binary')\n",
190 | "cat = img_path.split('/')[2] # get the radiograph category\n",
191 | "plt.title('Category: %s & Lable: %d ' %(cat, train_mat[ix][1]))\n",
192 | "plt.show()"
193 | ]
194 | },
195 | {
196 | "cell_type": "code",
197 | "execution_count": null,
198 | "metadata": {
199 | "collapsed": true
200 | },
201 | "outputs": [],
202 | "source": [
203 | "ix = np.random.randint(0, len(valid_mat))\n",
204 | "img_path = valid_mat[ix][0]\n",
205 | "plt.imshow(io.imread(img_path), cmap='binary')\n",
206 | "cat = img_path.split('/')[2]\n",
207 | "plt.title('Category: %s & Lable: %d ' %(cat, valid_mat[ix][1]))\n",
208 | "plt.show()"
209 | ]
210 | },
211 | {
212 | "cell_type": "markdown",
213 | "metadata": {},
214 | "source": [
215 | "This can be seen that images vary in resolution and dimension"
216 | ]
217 | },
218 | {
219 | "cell_type": "code",
220 | "execution_count": null,
221 | "metadata": {
222 | "collapsed": true
223 | },
224 | "outputs": [],
225 | "source": [
226 | "# look at the pixel values\n",
227 | "io.imread(img_path)[0]"
228 | ]
229 | },
230 | {
231 | "cell_type": "markdown",
232 | "metadata": {},
233 | "source": [
234 | "### Data Exploration"
235 | ]
236 | },
237 | {
238 | "cell_type": "code",
239 | "execution_count": null,
240 | "metadata": {
241 | "collapsed": true
242 | },
243 | "outputs": [],
244 | "source": [
245 | "!ls ../MURA-v1.0/train/"
246 | ]
247 | },
248 | {
249 | "cell_type": "code",
250 | "execution_count": null,
251 | "metadata": {
252 | "collapsed": true
253 | },
254 | "outputs": [],
255 | "source": [
256 | "!ls ../MURA-v1.0/train/XR_ELBOW/"
257 | ]
258 | },
259 | {
260 | "cell_type": "markdown",
261 | "metadata": {},
262 | "source": [
263 | "So, train dataset has seven study types, each study type has studies on patients stored in folders named like patient001, patient002 etc.."
264 | ]
265 | },
266 | {
267 | "cell_type": "markdown",
268 | "metadata": {},
269 | "source": [
270 | "#### Patient count per study type"
271 | ]
272 | },
273 | {
274 | "cell_type": "markdown",
275 | "metadata": {},
276 | "source": [
277 | "Let's count number of patients in each study type"
278 | ]
279 | },
280 | {
281 | "cell_type": "code",
282 | "execution_count": null,
283 | "metadata": {
284 | "collapsed": true
285 | },
286 | "outputs": [],
287 | "source": [
288 | "data_cat= ['train', 'valid']\n",
289 | "study_types = list(os.walk('../MURA-v1.0/train/'))[0][1] # study types, same for train and valid sets\n",
290 | "patients_count = {} # to store all patients count for each study type, for train and valid sets\n",
291 | "for phase in data_cat:\n",
292 | " patients_count[phase] = {}\n",
293 | " for study_type in study_types:\n",
294 | " patients = list(os.walk('../MURA-v1.0/%s/%s' %(phase, study_type)))[0][1] # patient folder names\n",
295 | " patients_count[phase][study_type] = len(patients)"
296 | ]
297 | },
298 | {
299 | "cell_type": "code",
300 | "execution_count": null,
301 | "metadata": {},
302 | "outputs": [],
303 | "source": [
304 | "print(study_types)\n",
305 | "print()\n",
306 | "print(patients_count)"
307 | ]
308 | },
309 | {
310 | "cell_type": "code",
311 | "execution_count": null,
312 | "metadata": {},
313 | "outputs": [],
314 | "source": [
315 | "# plot the patient counts per study type \n",
316 | "\n",
317 | "fig, ax = plt.subplots(figsize=(10, 5))\n",
318 | "for i, phase in enumerate(data_cat):\n",
319 | " counts = patients_count[phase].values()\n",
320 | " m = max(counts)\n",
321 | " for i, v in enumerate(counts):\n",
322 | " if v==m: ax.text(i-0.1, v+3, str(v))\n",
323 | " else: ax.text(i-0.1, v + 20, str(v))\n",
324 | " x_pos = np.arange(len(study_types))\n",
325 | " plt.bar(x_pos, counts, alpha=0.5)\n",
326 | " plt.xticks(x_pos, study_types)\n",
327 | "\n",
328 | "plt.xlabel('Study types')\n",
329 | "plt.ylabel('Number of patients')\n",
330 | "plt.legend(['train', 'valid'])\n",
331 | "plt.show()\n",
332 | "fig.savefig('images/pcpst.jpg', bbox_inches='tight', pad_inches=0) # name=patient count per study type"
333 | ]
334 | },
335 | {
336 | "cell_type": "markdown",
337 | "metadata": {},
338 | "source": [
339 | "XR_FINGER has got the most number of patients (1867 in train set, 166 in valid set) followed by XR_WRIST"
340 | ]
341 | },
342 | {
343 | "cell_type": "markdown",
344 | "metadata": {},
345 | "source": [
346 | "### Study count"
347 | ]
348 | },
349 | {
350 | "cell_type": "markdown",
351 | "metadata": {},
352 | "source": [
353 | "Patients might have multiple studies for a given study type, like a patient may have two studies for wrist, independent of each other.
Let's have a look at such cases, **NOTE** here study count = number of patients which have same number of studies"
354 | ]
355 | },
356 | {
357 | "cell_type": "code",
358 | "execution_count": null,
359 | "metadata": {
360 | "collapsed": true
361 | },
362 | "outputs": [],
363 | "source": [
364 | "# let's find out number of studies per study_type\n",
365 | "study_count = {} # to store study counts for each study type \n",
366 | "for study_type in study_types:\n",
367 | " BASE_DIR = '../MURA-v1.0/train/%s/' % study_type\n",
368 | " study_count[study_type] = defaultdict(lambda:0) # to store study count for current study_type, initialized to 0 by default\n",
369 | " patients = list(os.walk(BASE_DIR))[0][1] # patient folder names\n",
370 | " for patient in patients:\n",
371 | " studies = os.listdir(BASE_DIR+patient)\n",
372 | " study_count[study_type][len(studies)] += 1"
373 | ]
374 | },
375 | {
376 | "cell_type": "code",
377 | "execution_count": null,
378 | "metadata": {},
379 | "outputs": [],
380 | "source": [
381 | "study_count"
382 | ]
383 | },
384 | {
385 | "cell_type": "markdown",
386 | "metadata": {},
387 | "source": [
388 | "XR_WRIST has 3111 patients who have only single study, similarly, 158 patients have 2 studies, 12 patients have 3 studies and 4 patients have 4 studies.
let's plot this data"
389 | ]
390 | },
391 | {
392 | "cell_type": "code",
393 | "execution_count": null,
394 | "metadata": {},
395 | "outputs": [],
396 | "source": [
397 | "# plot the study count vs number of patients per study type data \n",
398 | "fig = plt.figure(figsize=(8, 25))\n",
399 | "for i, study_type in enumerate(study_count):\n",
400 | " ax = fig.add_subplot(7, 1, i+1)\n",
401 | " study = study_count[study_type]\n",
402 | " # text in the plot\n",
403 | " m = max(study.values())\n",
404 | " for i, v in enumerate(study.values()):\n",
405 | " if v==m: ax.text(i-0.1, v - 200, str(v))\n",
406 | " else: ax.text(i-0.1, v + 200, str(v))\n",
407 | " ax.text(i, m - 100, study_type, color='green')\n",
408 | " # plot the bar chart\n",
409 | " x_pos = np.arange(len(study))\n",
410 | " plt.bar(x_pos, study.values(), align='center', alpha=0.5)\n",
411 | " plt.xticks(x_pos, study.keys())\n",
412 | " plt.xlabel('Study count')\n",
413 | " plt.ylabel('Number of patients')\n",
414 | "plt.show()\n",
415 | "fig.savefig('images/pcpsc.jpg', bbox_inches='tight', pad_inches=0)"
416 | ]
417 | },
418 | {
419 | "cell_type": "markdown",
420 | "metadata": {},
421 | "source": [
422 | "### Number of views per study"
423 | ]
424 | },
425 | {
426 | "cell_type": "markdown",
427 | "metadata": {},
428 | "source": [
429 | "It can be seen that each study may have more that one view (radiograph image), let' have a look"
430 | ]
431 | },
432 | {
433 | "cell_type": "code",
434 | "execution_count": null,
435 | "metadata": {
436 | "collapsed": true
437 | },
438 | "outputs": [],
439 | "source": [
440 | "# let's find out number of studies per study_type\n",
441 | "view_count = {} # to store study counts for each study type, study count = number of patients which have similar number of studies \n",
442 | "for study_type in study_types:\n",
443 | " BASE_DIR = '../MURA-v1.0/train/%s/' % study_type\n",
444 | " view_count[study_type] = defaultdict(lambda:0) # to store study count for current study_type, initialized to 0 by default\n",
445 | " patients = list(os.walk(BASE_DIR))[0][1] # patient folder names\n",
446 | " for patient in patients:\n",
447 | " studies = os.listdir(BASE_DIR + patient)\n",
448 | " for study in studies:\n",
449 | " views = os.listdir(BASE_DIR + patient + '/' + study)\n",
450 | " view_count[study_type][len(views)] += 1"
451 | ]
452 | },
453 | {
454 | "cell_type": "code",
455 | "execution_count": null,
456 | "metadata": {},
457 | "outputs": [],
458 | "source": [
459 | "view_count"
460 | ]
461 | },
462 | {
463 | "cell_type": "markdown",
464 | "metadata": {},
465 | "source": [
466 | "`XR_SHOULDER` has as many as 13 views in some studies, `XR_HAND` has 5 at max, this poses a challenging task to predict on a study taking into account all the views of that study while keeping the batch size of 8 (as mentioned in MURA paper)"
467 | ]
468 | },
469 | {
470 | "cell_type": "code",
471 | "execution_count": null,
472 | "metadata": {},
473 | "outputs": [],
474 | "source": [
475 | "# plot the view count vs number of studies per study type data \n",
476 | "fig = plt.figure(figsize=(10, 30))\n",
477 | "for i, view_type in enumerate(view_count):\n",
478 | " ax = fig.add_subplot(7, 1, i+1)\n",
479 | " view = view_count[view_type]\n",
480 | " # text in the plot\n",
481 | " m = max(view.values())\n",
482 | " for i, v in enumerate(view.values()):\n",
483 | " if v==m: ax.text(i-0.1, v - 200, str(v))\n",
484 | " else: ax.text(i-0.1, v + 80, str(v))\n",
485 | " ax.text(i - 0.5, m - 80, view_type, color='green')\n",
486 | " # plot the bar chart\n",
487 | " x_pos = np.arange(len(view))\n",
488 | " plt.bar(x_pos, view.values(), align='center', alpha=0.5)\n",
489 | " plt.xticks(x_pos, view.keys())\n",
490 | " plt.xlabel('Number of views')\n",
491 | " plt.ylabel('Number of studies')\n",
492 | "plt.show()\n",
493 | "fig.savefig('images/nsvc.jpg', bbox_inches='tight', pad_inches=0) # name=number of studies view count"
494 | ]
495 | },
496 | {
497 | "cell_type": "markdown",
498 | "metadata": {},
499 | "source": [
500 | "Most of the studies contain 2, 3 or 4 views"
501 | ]
502 | }
503 | ],
504 | "metadata": {
505 | "kernelspec": {
506 | "display_name": "Python [default]",
507 | "language": "python",
508 | "name": "python3"
509 | },
510 | "language_info": {
511 | "codemirror_mode": {
512 | "name": "ipython",
513 | "version": 3
514 | },
515 | "file_extension": ".py",
516 | "mimetype": "text/x-python",
517 | "name": "python",
518 | "nbconvert_exporter": "python",
519 | "pygments_lexer": "ipython3",
520 | "version": "3.5.3"
521 | }
522 | },
523 | "nbformat": 4,
524 | "nbformat_minor": 2
525 | }
526 |
--------------------------------------------------------------------------------
/EDA/images/nsvc.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pyaf/DenseNet-MURA-PyTorch/04f7c5ab4c0abca7a8e55042626362fd2c9a09ad/EDA/images/nsvc.jpg
--------------------------------------------------------------------------------
/EDA/images/pcpsc.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pyaf/DenseNet-MURA-PyTorch/04f7c5ab4c0abca7a8e55042626362fd2c9a09ad/EDA/images/pcpsc.jpg
--------------------------------------------------------------------------------
/EDA/images/pcpst.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pyaf/DenseNet-MURA-PyTorch/04f7c5ab4c0abca7a8e55042626362fd2c9a09ad/EDA/images/pcpst.jpg
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2018 Rishabh Agrahari
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # DenseNet on MURA Dataset using PyTorch
2 |
3 | A PyTorch implementation of 169 layer [DenseNet](https://arxiv.org/abs/1608.06993) model on MURA dataset, inspired from the paper [arXiv:1712.06957v3](https://arxiv.org/abs/1712.06957) by Pranav Rajpurkar et al. MURA is a large dataset of musculoskeletal radiographs, where each study is manually labeled by radiologists as either normal or abnormal. [know more](https://stanfordmlgroup.github.io/projects/mura/)
4 |
5 | ## Important Points:
6 | * The implemented model is a 169 layer DenseNet with single node output layer initialized with weights from a model pretrained on ImageNet dataset.
7 | * Before feeding the images to the network, each image is normalized to have same mean and standard deviation as of the images in the ImageNet training set, scaled to 224 x 224 and augmentented with random lateral inversions and rotations.
8 | * The model uses modified Binary Cross Entropy Loss function as mentioned in the paper.
9 | * The Learning Rate decays by a factor of 10 every time the validation loss plateaus after an epoch.
10 | * The optimization algorithm is Adam with default parameters β1 = 0.9 and β2 = 0.999.
11 |
12 | According to MURA dataset paper:
13 |
14 | > The model takes as input one or more views for a study of an upper extremity. On each view, our 169-layer convolutional neural network predicts the probability of abnormality. We compute the overall probability of abnormality for the study by taking the arithmetic mean of the abnormality probabilities output by the network for each image.
15 |
16 | The model implemented in [model.py](model.py) takes as input 'all' the views for a study of an upper extremity. On each view the model predicts the probability of abnormality. The Model computes the overall probability of abnormality for the study by taking the arithmetic mean of the abnormality probabilites output by the network for each image.
17 |
18 | ## Instructions
19 |
20 | Install dependencies:
21 | * PyTorch
22 | * TorchVision
23 | * Numpy
24 | * Pandas
25 |
26 | Train the model with `python main.py`
27 |
28 | ## Citation
29 | @ARTICLE{2017arXiv171206957R,
30 | author = {{Rajpurkar}, P. and {Irvin}, J. and {Bagul}, A. and {Ding}, D. and
31 | {Duan}, T. and {Mehta}, H. and {Yang}, B. and {Zhu}, K. and
32 | {Laird}, D. and {Ball}, R.~L. and {Langlotz}, C. and {Shpanskaya}, K. and
33 | {Lungren}, M.~P. and {Ng}, A.},
34 | title = "{MURA Dataset: Towards Radiologist-Level Abnormality Detection in Musculoskeletal Radiographs}",
35 | journal = {ArXiv e-prints},
36 | archivePrefix = "arXiv",
37 | eprint = {1712.06957},
38 | primaryClass = "physics.med-ph",
39 | keywords = {Physics - Medical Physics, Computer Science - Artificial Intelligence},
40 | year = 2017,
41 | month = dec,
42 | adsurl = {http://adsabs.harvard.edu/abs/2017arXiv171206957R},
43 | adsnote = {Provided by the SAO/NASA Astrophysics Data System}
44 | }
45 |
--------------------------------------------------------------------------------
/densenet.py:
--------------------------------------------------------------------------------
1 | import torch
2 | import torch.nn as nn
3 | import torch.nn.functional as F
4 | import torch.utils.model_zoo as model_zoo
5 | from collections import OrderedDict
6 |
7 | __all__ = ['DenseNet', 'densenet169']
8 |
9 |
10 | model_urls = {
11 | 'densenet169': 'https://download.pytorch.org/models/densenet169-b2777c0a.pth',
12 | }
13 |
14 | def densenet169(pretrained=False, **kwargs):
15 | r"""Densenet-169 model from
16 | `"Densely Connected Convolutional Networks" `_
17 |
18 | Args:
19 | pretrained (bool): If True, returns a model pre-trained on ImageNet
20 | """
21 | model = DenseNet(num_init_features=64, growth_rate=32, block_config=(6, 12, 32, 32),
22 | **kwargs)
23 | if pretrained:
24 | model.load_state_dict(model_zoo.load_url(model_urls['densenet169']), strict=False)
25 | return model
26 |
27 | class _DenseLayer(nn.Sequential):
28 | def __init__(self, num_input_features, growth_rate, bn_size, drop_rate):
29 | super(_DenseLayer, self).__init__()
30 | self.add_module('norm1', nn.BatchNorm2d(num_input_features)),
31 | self.add_module('relu1', nn.ReLU(inplace=True)),
32 | self.add_module('conv1', nn.Conv2d(num_input_features, bn_size *
33 | growth_rate, kernel_size=1, stride=1, bias=False)),
34 | self.add_module('norm2', nn.BatchNorm2d(bn_size * growth_rate)),
35 | self.add_module('relu2', nn.ReLU(inplace=True)),
36 | self.add_module('conv2', nn.Conv2d(bn_size * growth_rate, growth_rate,
37 | kernel_size=3, stride=1, padding=1, bias=False)),
38 | self.drop_rate = drop_rate
39 |
40 | def forward(self, x):
41 | new_features = super(_DenseLayer, self).forward(x)
42 | if self.drop_rate > 0:
43 | new_features = F.dropout(new_features, p=self.drop_rate, training=self.training)
44 | return torch.cat([x, new_features], 1)
45 |
46 |
47 | class _DenseBlock(nn.Sequential):
48 | def __init__(self, num_layers, num_input_features, bn_size, growth_rate, drop_rate):
49 | super(_DenseBlock, self).__init__()
50 | for i in range(num_layers):
51 | layer = _DenseLayer(num_input_features + i * growth_rate, growth_rate, bn_size, drop_rate)
52 | self.add_module('denselayer%d' % (i + 1), layer)
53 |
54 |
55 | class _Transition(nn.Sequential):
56 | def __init__(self, num_input_features, num_output_features):
57 | super(_Transition, self).__init__()
58 | self.add_module('norm', nn.BatchNorm2d(num_input_features))
59 | self.add_module('relu', nn.ReLU(inplace=True))
60 | self.add_module('conv', nn.Conv2d(num_input_features, num_output_features,
61 | kernel_size=1, stride=1, bias=False))
62 | self.add_module('pool', nn.AvgPool2d(kernel_size=2, stride=2))
63 |
64 |
65 | class DenseNet(nn.Module):
66 | r"""Densenet-BC model class, based on
67 | `"Densely Connected Convolutional Networks" `_
68 |
69 | Args:
70 | growth_rate (int) - how many filters to add each layer (`k` in paper)
71 | block_config (list of 4 ints) - how many layers in each pooling block
72 | num_init_features (int) - the number of filters to learn in the first convolution layer
73 | bn_size (int) - multiplicative factor for number of bottle neck layers
74 | (i.e. bn_size * k features in the bottleneck layer)
75 | drop_rate (float) - dropout rate after each dense layer
76 | num_classes (int) - number of classification classes
77 | """
78 | def __init__(self, growth_rate=32, block_config=(6, 12, 24, 16),
79 | num_init_features=64, bn_size=4, drop_rate=0, num_classes=1000):
80 |
81 | super(DenseNet, self).__init__()
82 |
83 | # First convolution
84 | self.features = nn.Sequential(OrderedDict([
85 | ('conv0', nn.Conv2d(3, num_init_features, kernel_size=7, stride=2, padding=3, bias=False)),
86 | ('norm0', nn.BatchNorm2d(num_init_features)),
87 | ('relu0', nn.ReLU(inplace=True)),
88 | ('pool0', nn.MaxPool2d(kernel_size=3, stride=2, padding=1)),
89 | ]))
90 |
91 | # Each denseblock
92 | num_features = num_init_features
93 | for i, num_layers in enumerate(block_config):
94 | block = _DenseBlock(num_layers=num_layers, num_input_features=num_features,
95 | bn_size=bn_size, growth_rate=growth_rate, drop_rate=drop_rate)
96 | self.features.add_module('denseblock%d' % (i + 1), block)
97 | num_features = num_features + num_layers * growth_rate
98 | if i != len(block_config) - 1:
99 | trans = _Transition(num_input_features=num_features, num_output_features=num_features // 2)
100 | self.features.add_module('transition%d' % (i + 1), trans)
101 | num_features = num_features // 2
102 |
103 | # Final batch norm
104 | self.features.add_module('norm5', nn.BatchNorm2d(num_features))
105 |
106 | # Linear layer
107 | # self.classifier = nn.Linear(num_features, 1000)
108 | # self.fc = nn.Linear(1000, 1)
109 |
110 | self.fc = nn.Linear(num_features, 1)
111 |
112 | # Official init from torch repo.
113 | for m in self.modules():
114 | if isinstance(m, nn.Conv2d):
115 | nn.init.kaiming_normal(m.weight.data)
116 | elif isinstance(m, nn.BatchNorm2d):
117 | m.weight.data.fill_(1)
118 | m.bias.data.zero_()
119 | elif isinstance(m, nn.Linear):
120 | m.bias.data.zero_()
121 |
122 | def forward(self, x):
123 | features = self.features(x)
124 | out = F.relu(features, inplace=True)
125 | out = F.avg_pool2d(out, kernel_size=7, stride=1).view(features.size(0), -1)
126 | # out = F.relu(self.classifier(out))
127 | out = F.sigmoid(self.fc(out))
128 | return out
129 |
--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
1 | import time
2 | import copy
3 | import pandas as pd
4 | import torch
5 | from torch.autograd import Variable
6 | from densenet import densenet169
7 | from utils import plot_training, n_p, get_count
8 | from train import train_model, get_metrics
9 | from pipeline import get_study_level_data, get_dataloaders
10 |
11 | # #### load study level dict data
12 | study_data = get_study_level_data(study_type='XR_WRIST')
13 |
14 | # #### Create dataloaders pipeline
15 | data_cat = ['train', 'valid'] # data categories
16 | dataloaders = get_dataloaders(study_data, batch_size=1)
17 | dataset_sizes = {x: len(study_data[x]) for x in data_cat}
18 |
19 | # #### Build model
20 | # tai = total abnormal images, tni = total normal images
21 | tai = {x: get_count(study_data[x], 'positive') for x in data_cat}
22 | tni = {x: get_count(study_data[x], 'negative') for x in data_cat}
23 | Wt1 = {x: n_p(tni[x] / (tni[x] + tai[x])) for x in data_cat}
24 | Wt0 = {x: n_p(tai[x] / (tni[x] + tai[x])) for x in data_cat}
25 |
26 | print('tai:', tai)
27 | print('tni:', tni, '\n')
28 | print('Wt0 train:', Wt0['train'])
29 | print('Wt0 valid:', Wt0['valid'])
30 | print('Wt1 train:', Wt1['train'])
31 | print('Wt1 valid:', Wt1['valid'])
32 |
33 | class Loss(torch.nn.modules.Module):
34 | def __init__(self, Wt1, Wt0):
35 | super(Loss, self).__init__()
36 | self.Wt1 = Wt1
37 | self.Wt0 = Wt0
38 |
39 | def forward(self, inputs, targets, phase):
40 | loss = - (self.Wt1[phase] * targets * inputs.log() + self.Wt0[phase] * (1 - targets) * (1 - inputs).log())
41 | return loss
42 |
43 | model = densenet169(pretrained=True)
44 | model = model.cuda()
45 |
46 | criterion = Loss(Wt1, Wt0)
47 | optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)
48 | scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', patience=1, verbose=True)
49 |
50 | # #### Train model
51 | model = train_model(model, criterion, optimizer, dataloaders, scheduler, dataset_sizes, num_epochs=5)
52 |
53 | torch.save(model.state_dict(), 'models/model.pth')
54 |
55 | get_metrics(model, criterion, dataloaders, dataset_sizes)
--------------------------------------------------------------------------------
/pipeline.py:
--------------------------------------------------------------------------------
1 | import os
2 | import pandas as pd
3 | from tqdm import tqdm
4 | import torch
5 | from torchvision import transforms
6 | from torch.utils.data import DataLoader, Dataset
7 | from torchvision.datasets.folder import pil_loader
8 |
9 | data_cat = ['train', 'valid'] # data categories
10 |
11 | def get_study_level_data(study_type):
12 | """
13 | Returns a dict, with keys 'train' and 'valid' and respective values as study level dataframes,
14 | these dataframes contain three columns 'Path', 'Count', 'Label'
15 | Args:
16 | study_type (string): one of the seven study type folder names in 'train/valid/test' dataset
17 | """
18 | study_data = {}
19 | study_label = {'positive': 1, 'negative': 0}
20 | for phase in data_cat:
21 | BASE_DIR = 'MURA-v1.0/%s/%s/' % (phase, study_type)
22 | patients = list(os.walk(BASE_DIR))[0][1] # list of patient folder names
23 | study_data[phase] = pd.DataFrame(columns=['Path', 'Count', 'Label'])
24 | i = 0
25 | for patient in tqdm(patients): # for each patient folder
26 | for study in os.listdir(BASE_DIR + patient): # for each study in that patient folder
27 | label = study_label[study.split('_')[1]] # get label 0 or 1
28 | path = BASE_DIR + patient + '/' + study + '/' # path to this study
29 | study_data[phase].loc[i] = [path, len(os.listdir(path)), label] # add new row
30 | i+=1
31 | return study_data
32 |
33 | class ImageDataset(Dataset):
34 | """training dataset."""
35 |
36 | def __init__(self, df, transform=None):
37 | """
38 | Args:
39 | df (pd.DataFrame): a pandas DataFrame with image path and labels.
40 | transform (callable, optional): Optional transform to be applied
41 | on a sample.
42 | """
43 | self.df = df
44 | self.transform = transform
45 |
46 | def __len__(self):
47 | return len(self.df)
48 |
49 | def __getitem__(self, idx):
50 | study_path = self.df.iloc[idx, 0]
51 | count = self.df.iloc[idx, 1]
52 | images = []
53 | for i in range(count):
54 | image = pil_loader(study_path + 'image%s.png' % (i+1))
55 | images.append(self.transform(image))
56 | images = torch.stack(images)
57 | label = self.df.iloc[idx, 2]
58 | sample = {'images': images, 'label': label}
59 | return sample
60 |
61 | def get_dataloaders(data, batch_size=8, study_level=False):
62 | '''
63 | Returns dataloader pipeline with data augmentation
64 | '''
65 | data_transforms = {
66 | 'train': transforms.Compose([
67 | transforms.Resize((224, 224)),
68 | transforms.RandomHorizontalFlip(),
69 | transforms.RandomRotation(10),
70 | transforms.ToTensor(),
71 | transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
72 | ]),
73 | 'valid': transforms.Compose([
74 | transforms.Resize((224, 224)),
75 | transforms.ToTensor(),
76 | transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
77 | ]),
78 | }
79 | image_datasets = {x: ImageDataset(data[x], transform=data_transforms[x]) for x in data_cat}
80 | dataloaders = {x: DataLoader(image_datasets[x], batch_size=batch_size, shuffle=True, num_workers=4) for x in data_cat}
81 | return dataloaders
82 |
83 | if __name__=='main':
84 | pass
85 |
--------------------------------------------------------------------------------
/train.py:
--------------------------------------------------------------------------------
1 | import time
2 | import copy
3 | import torch
4 | from torchnet import meter
5 | from torch.autograd import Variable
6 | from utils import plot_training
7 |
8 | data_cat = ['train', 'valid'] # data categories
9 |
10 | def train_model(model, criterion, optimizer, dataloaders, scheduler,
11 | dataset_sizes, num_epochs):
12 | since = time.time()
13 | best_model_wts = copy.deepcopy(model.state_dict())
14 | best_acc = 0.0
15 | costs = {x:[] for x in data_cat} # for storing costs per epoch
16 | accs = {x:[] for x in data_cat} # for storing accuracies per epoch
17 | print('Train batches:', len(dataloaders['train']))
18 | print('Valid batches:', len(dataloaders['valid']), '\n')
19 | for epoch in range(num_epochs):
20 | confusion_matrix = {x: meter.ConfusionMeter(2, normalized=True)
21 | for x in data_cat}
22 | print('Epoch {}/{}'.format(epoch+1, num_epochs))
23 | print('-' * 10)
24 | # Each epoch has a training and validation phase
25 | for phase in data_cat:
26 | model.train(phase=='train')
27 | running_loss = 0.0
28 | running_corrects = 0
29 | # Iterate over data.
30 | for i, data in enumerate(dataloaders[phase]):
31 | # get the inputs
32 | print(i, end='\r')
33 | inputs = data['images'][0]
34 | labels = data['label'].type(torch.FloatTensor)
35 | # wrap them in Variable
36 | inputs = Variable(inputs.cuda())
37 | labels = Variable(labels.cuda())
38 | # zero the parameter gradients
39 | optimizer.zero_grad()
40 | # forward
41 | outputs = model(inputs)
42 | outputs = torch.mean(outputs)
43 | loss = criterion(outputs, labels, phase)
44 | running_loss += loss.data[0]
45 | # backward + optimize only if in training phase
46 | if phase == 'train':
47 | loss.backward()
48 | optimizer.step()
49 | # statistics
50 | preds = (outputs.data > 0.5).type(torch.cuda.FloatTensor)
51 | running_corrects += torch.sum(preds == labels.data)
52 | confusion_matrix[phase].add(preds, labels.data)
53 | epoch_loss = running_loss / dataset_sizes[phase]
54 | epoch_acc = running_corrects / dataset_sizes[phase]
55 | costs[phase].append(epoch_loss)
56 | accs[phase].append(epoch_acc)
57 | print('{} Loss: {:.4f} Acc: {:.4f}'.format(
58 | phase, epoch_loss, epoch_acc))
59 | print('Confusion Meter:\n', confusion_matrix[phase].value())
60 | # deep copy the model
61 | if phase == 'valid':
62 | scheduler.step(epoch_loss)
63 | if epoch_acc > best_acc:
64 | best_acc = epoch_acc
65 | best_model_wts = copy.deepcopy(model.state_dict())
66 | time_elapsed = time.time() - since
67 | print('Time elapsed: {:.0f}m {:.0f}s'.format(
68 | time_elapsed // 60, time_elapsed % 60))
69 | print()
70 | time_elapsed = time.time() - since
71 | print('Training complete in {:.0f}m {:.0f}s'.format(
72 | time_elapsed // 60, time_elapsed % 60))
73 | print('Best valid Acc: {:4f}'.format(best_acc))
74 | plot_training(costs, accs)
75 | # load best model weights
76 | model.load_state_dict(best_model_wts)
77 | return model
78 |
79 |
80 | def get_metrics(model, criterion, dataloaders, dataset_sizes, phase='valid'):
81 | '''
82 | Loops over phase (train or valid) set to determine acc, loss and
83 | confusion meter of the model.
84 | '''
85 | confusion_matrix = meter.ConfusionMeter(2, normalized=True)
86 | running_loss = 0.0
87 | running_corrects = 0
88 | for i, data in enumerate(dataloaders[phase]):
89 | print(i, end='\r')
90 | labels = data['label'].type(torch.FloatTensor)
91 | inputs = data['images'][0]
92 | # wrap them in Variable
93 | inputs = Variable(inputs.cuda())
94 | labels = Variable(labels.cuda())
95 | # forward
96 | outputs = model(inputs)
97 | outputs = torch.mean(outputs)
98 | loss = criterion(outputs, labels, phase)
99 | # statistics
100 | running_loss += loss.data[0] * inputs.size(0)
101 | preds = (outputs.data > 0.5).type(torch.cuda.FloatTensor)
102 | running_corrects += torch.sum(preds == labels.data)
103 | confusion_matrix.add(preds, labels.data)
104 |
105 | loss = running_loss / dataset_sizes[phase]
106 | acc = running_corrects / dataset_sizes[phase]
107 | print('{} Loss: {:.4f} Acc: {:.4f}'.format(phase, loss, acc))
108 | print('Confusion Meter:\n', confusion_matrix.value())
109 |
--------------------------------------------------------------------------------
/utils.py:
--------------------------------------------------------------------------------
1 | import torch
2 | from torch.autograd import Variable
3 | import matplotlib.pyplot as plt
4 | from torchnet import meter
5 |
6 | def plot_training(costs, accs):
7 | '''
8 | Plots curve of Cost vs epochs and Accuracy vs epochs for 'train' and 'valid' sets during training
9 | '''
10 | train_acc = accs['train']
11 | valid_acc = accs['valid']
12 | train_cost = costs['train']
13 | valid_cost = costs['valid']
14 | epochs = range(len(train_acc))
15 |
16 | plt.figure(figsize=(10, 5))
17 |
18 | plt.subplot(1, 2, 1,)
19 | plt.plot(epochs, train_acc)
20 | plt.plot(epochs, valid_acc)
21 | plt.legend(['train', 'valid'], loc='upper left')
22 | plt.title('Accuracy')
23 |
24 | plt.subplot(1, 2, 2)
25 | plt.plot(epochs, train_cost)
26 | plt.plot(epochs, valid_cost)
27 | plt.legend(['train', 'valid'], loc='upper left')
28 | plt.title('Cost')
29 |
30 | plt.show()
31 |
32 | def n_p(x):
33 | '''convert numpy float to Variable tensor float'''
34 | return Variable(torch.cuda.FloatTensor([x]), requires_grad=False)
35 |
36 | def get_count(df, cat):
37 | '''
38 | Returns number of images in a study type dataframe which are of abnormal or normal
39 | Args:
40 | df -- dataframe
41 | cat -- category, "positive" for abnormal and "negative" for normal
42 | '''
43 | return df[df['Path'].str.contains(cat)]['Count'].sum()
44 |
45 |
46 | if __name__=='main':
47 | pass
--------------------------------------------------------------------------------