├── .gitignore ├── EDA ├── README.md ├── eda.ipynb └── images │ ├── nsvc.jpg │ ├── pcpsc.jpg │ └── pcpst.jpg ├── LICENSE ├── README.md ├── densenet.py ├── main.py ├── pipeline.py ├── train.py └── utils.py /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | env/ 12 | build/ 13 | develop-eggs/ 14 | dist/ 15 | downloads/ 16 | eggs/ 17 | .eggs/ 18 | lib/ 19 | lib64/ 20 | parts/ 21 | sdist/ 22 | var/ 23 | wheels/ 24 | *.egg-info/ 25 | .installed.cfg 26 | *.egg 27 | 28 | # PyInstaller 29 | # Usually these files are written by a python script from a template 30 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 31 | *.manifest 32 | *.spec 33 | 34 | # Installer logs 35 | pip-log.txt 36 | pip-delete-this-directory.txt 37 | 38 | # Unit test / coverage reports 39 | htmlcov/ 40 | .tox/ 41 | .coverage 42 | .coverage.* 43 | .cache 44 | nosetests.xml 45 | coverage.xml 46 | *.cover 47 | .hypothesis/ 48 | 49 | # Translations 50 | *.mo 51 | *.pot 52 | 53 | # Django stuff: 54 | *.log 55 | local_settings.py 56 | 57 | # Flask stuff: 58 | instance/ 59 | .webassets-cache 60 | 61 | # Scrapy stuff: 62 | .scrapy 63 | 64 | # Sphinx documentation 65 | docs/_build/ 66 | 67 | # PyBuilder 68 | target/ 69 | 70 | # Jupyter Notebook 71 | .ipynb_checkpoints 72 | 73 | # pyenv 74 | .python-version 75 | 76 | # celery beat schedule file 77 | celerybeat-schedule 78 | 79 | # SageMath parsed files 80 | *.sage.py 81 | 82 | # dotenv 83 | .env 84 | 85 | # virtualenv 86 | .venv 87 | venv/ 88 | ENV/ 89 | 90 | # Spyder project settings 91 | .spyderproject 92 | .spyproject 93 | 94 | # Rope project settings 95 | .ropeproject 96 | 97 | # mkdocs documentation 98 | /site 99 | 100 | # mypy 101 | .mypy_cache/ 102 | -------------------------------------------------------------------------------- /EDA/README.md: -------------------------------------------------------------------------------- 1 | 2 | # Data Analysis Report for MURA 3 | 4 | MURA is a dataset of musculoskeletal radiographs consisting of 14,982 `studies` from 12,251 `patients`, with a total of 40,895 `multi-view radiographic images`. Each `study` belongs to one of seven standard upper extremity radiographic `study 5 | types`: elbow, finger, forearm, hand, humerus, shoulder and wrist. 6 | 7 | ## Components of MURA dataset 8 | 9 | MURA dataset comes with `train`, `valid` and `test` folders containing corresponding datasets, `train.csv` and `valid.csv` contain paths of `radiographic images` and their labels. Each image is labeled as 1 (abnormal) or 0 (normal) based on whether its corresponding study is negative or positive, respectively
10 | 11 | Sometimes, these radiographic images are also referred as `views`. 12 | 13 | ## Components of `train` and `valid` set 14 | 15 | * `train` set consists of seven `study types` namely: 16 | 17 | `XR_ELBOW` `XR_FINGER` `XR_FOREARM` `XR_HAND` `XR_HUMERUS` `XR_SHOULDER` `XR_WRIST` 18 | 19 | * Each `study type` contains several folders named like: 20 | 21 | `patient12104` `patient12110` `patient12116` `patient12122` `patient12128` ... 22 | 23 | * These folders are named after patient ids, each of these folders contain one or more `study`, named like: 24 | 25 | `study1_negative` `study2_negative` `study3_positive` ...
26 | 27 | * Each of these `study`s contains one or more radiographs (views or images), named like: 28 | 29 | `image1.png` `image2.png` ... 30 | 31 | * Each view (image) is RGB with pixel range [0, 255] and varies in dimensions. 32 | 33 | **NOTE**: all above points are true for `test` set, except the third point, the `study` folder are named like: `study1` `study2` .. 34 | 35 | ## Some insightful plots 36 | 37 | ### Plot of number of Patients vs `study type` 38 | 39 | 40 | 41 | In `train` set `XR_WRIST` has maximum number of patients, followed by `XR_FINGER`, `XR_HUMERUS`, `XR_SHOULDER`, `XR_HAND`, `XR_ELBOW` and `XR_FOREARM`. `X_FOREARM` with 606 patients has got the least number. Similar pattern can be seen in `valid` set, XR_WRIST has the maximum, followed by `XR_FINGER`, `XR_SHOULDER`,`XR_HUMEROUS`, `XR_HAND`,`XR_ELBOW`, `XR_FOREARM`. 42 | 43 | ### Plot of number of patients vs study count 44 | 45 | Patients of a `study type` might have multiple `study`s, like a patient may have 3 `study`s for wrist, independent of each other.
46 | The following plot shows variation of number of patients with number of `study`s 47 | 48 | **NOTE** study count = number of studies, so if 4 patients have study count 3, that means 4 patients have undergone 3 `study`s for a given `study type` 49 | 50 | 51 | 52 | 53 | Patients of `XR_FOREARM` and `XR_HUMEROUS` `study type`s have either 1 `study` or 2 only. 54 | Patients of `XR_FINGER`, `XR_HAND` and `XR_ELBOW` have upto 3 `study`s. 55 | Patients of `XR_SHOULDER` and `XR_WRIST` have upto 4 `study`s 56 | 57 | ### Plot of number of `study`s vs number of views 58 | 59 | Each `study` may have one or more number of views, the following plot variation of number of views per study in train dataset 60 | 61 | 62 | 63 | Maximum number of views per study can be found in `XR_SHOULDER`, there is a study in it which has as many as 13 images (views), similarlyy `XR_HUMEROUS` has a study with 10 images. It can be seen that most of the `study`s have either 2, 3 or 4 images. 64 | -------------------------------------------------------------------------------- /EDA/eda.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Exploratory Data Analysis" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": null, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import os\n", 19 | "import time\n", 20 | "import torch\n", 21 | "from torch.utils.data import DataLoader, Dataset\n", 22 | "from torchvision.utils import make_grid\n", 23 | "from torchvision import transforms\n", 24 | "from collections import defaultdict\n", 25 | "from torchvision.datasets.folder import pil_loader\n", 26 | "import random\n", 27 | "import numpy as np\n", 28 | "import pandas as pd\n", 29 | "import matplotlib.pyplot as plt\n", 30 | "import pylab\n", 31 | "from skimage import io, transform\n", 32 | "\n", 33 | "pd.set_option('max_colwidth', 800)\n", 34 | "\n", 35 | "%matplotlib inline" 36 | ] 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "metadata": {}, 41 | "source": [ 42 | "## Load Data" 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "metadata": {}, 48 | "source": [ 49 | "We have seven categories of musculoskeletal radiographs" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": null, 55 | "metadata": { 56 | "collapsed": true 57 | }, 58 | "outputs": [], 59 | "source": [ 60 | "train_df = pd.read_csv('../MURA-v1.0/train.csv', names=['Path', 'Label'])\n", 61 | "valid_df = pd.read_csv('../MURA-v1.0/valid.csv', names=['Path', 'Label'])" 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": {}, 67 | "source": [ 68 | "Let's checkout the shapes of dataframes" 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": null, 74 | "metadata": {}, 75 | "outputs": [], 76 | "source": [ 77 | "train_df.shape, valid_df.shape" 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": {}, 83 | "source": [ 84 | "We have 37111 radiographs for training and 3225 radiographs for validation set, let's peak into the dataframes" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": null, 90 | "metadata": {}, 91 | "outputs": [], 92 | "source": [ 93 | "train_df.head(3)" 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": null, 99 | "metadata": {}, 100 | "outputs": [], 101 | "source": [ 102 | "valid_df.head(3)" 103 | ] 104 | }, 105 | { 106 | "cell_type": "markdown", 107 | "metadata": {}, 108 | "source": [ 109 | "So, we have radiograph paths and their correspoinding labels, each radiographs has a label of 0 (normal) or 1 (abnormal)" 110 | ] 111 | }, 112 | { 113 | "cell_type": "markdown", 114 | "metadata": {}, 115 | "source": [ 116 | "## Analysis" 117 | ] 118 | }, 119 | { 120 | "cell_type": "markdown", 121 | "metadata": {}, 122 | "source": [ 123 | "According to paper:
\n", 124 | "1.\n", 125 | "\n", 126 | " The MURA abnormality detection task is a binary classification task, where the input is an upper \n", 127 | " exremity radiograph study — with each study containing one or more views (images) — and the \n", 128 | " expected output is a binary label y ∈ {0, 1} indicating whether the \"study\" is normal or abnormal, \n", 129 | " respectively.\n", 130 | "2.\n", 131 | "\n", 132 | " The model takes as input one or more views for a study of an upper extremity. On each view, our 169-\n", 133 | " layer convolutional neural network predicts the probability of abnormality. We compute the overall \n", 134 | " probability of abnormality for the study by taking the arithmetic mean of the abnormality \n", 135 | " probabilities output by the network for each image. The model makes the binary prediction of \n", 136 | " abnormal if the probability of abnormality for the study is greater than 0.5." 137 | ] 138 | }, 139 | { 140 | "cell_type": "markdown", 141 | "metadata": {}, 142 | "source": [ 143 | "So, we have make predictions on study level, taking into account the predictions of all the views (images) of the study. This can be done by taking arithmetic mean of all the views (images) under a particular study." 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "execution_count": null, 149 | "metadata": {}, 150 | "outputs": [], 151 | "source": [ 152 | "train_df.head(30)" 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "metadata": {}, 158 | "source": [ 159 | "Analyzing this dataframe, we can see that images are annotated based on whether their corresponding study is positive (normal, 0) or negative (abnormal, 1)" 160 | ] 161 | }, 162 | { 163 | "cell_type": "markdown", 164 | "metadata": {}, 165 | "source": [ 166 | "### Plot some random radiographs from training and validation set" 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": null, 172 | "metadata": { 173 | "collapsed": true 174 | }, 175 | "outputs": [], 176 | "source": [ 177 | "train_mat = train_df.as_matrix()\n", 178 | "valid_mat = valid_df.as_matrix()" 179 | ] 180 | }, 181 | { 182 | "cell_type": "code", 183 | "execution_count": null, 184 | "metadata": {}, 185 | "outputs": [], 186 | "source": [ 187 | "ix = np.random.randint(0, len(train_mat)) # randomly select a index\n", 188 | "img_path = train_mat[ix][0]\n", 189 | "plt.imshow(io.imread(img_path), cmap='binary')\n", 190 | "cat = img_path.split('/')[2] # get the radiograph category\n", 191 | "plt.title('Category: %s & Lable: %d ' %(cat, train_mat[ix][1]))\n", 192 | "plt.show()" 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": null, 198 | "metadata": { 199 | "collapsed": true 200 | }, 201 | "outputs": [], 202 | "source": [ 203 | "ix = np.random.randint(0, len(valid_mat))\n", 204 | "img_path = valid_mat[ix][0]\n", 205 | "plt.imshow(io.imread(img_path), cmap='binary')\n", 206 | "cat = img_path.split('/')[2]\n", 207 | "plt.title('Category: %s & Lable: %d ' %(cat, valid_mat[ix][1]))\n", 208 | "plt.show()" 209 | ] 210 | }, 211 | { 212 | "cell_type": "markdown", 213 | "metadata": {}, 214 | "source": [ 215 | "This can be seen that images vary in resolution and dimension" 216 | ] 217 | }, 218 | { 219 | "cell_type": "code", 220 | "execution_count": null, 221 | "metadata": { 222 | "collapsed": true 223 | }, 224 | "outputs": [], 225 | "source": [ 226 | "# look at the pixel values\n", 227 | "io.imread(img_path)[0]" 228 | ] 229 | }, 230 | { 231 | "cell_type": "markdown", 232 | "metadata": {}, 233 | "source": [ 234 | "### Data Exploration" 235 | ] 236 | }, 237 | { 238 | "cell_type": "code", 239 | "execution_count": null, 240 | "metadata": { 241 | "collapsed": true 242 | }, 243 | "outputs": [], 244 | "source": [ 245 | "!ls ../MURA-v1.0/train/" 246 | ] 247 | }, 248 | { 249 | "cell_type": "code", 250 | "execution_count": null, 251 | "metadata": { 252 | "collapsed": true 253 | }, 254 | "outputs": [], 255 | "source": [ 256 | "!ls ../MURA-v1.0/train/XR_ELBOW/" 257 | ] 258 | }, 259 | { 260 | "cell_type": "markdown", 261 | "metadata": {}, 262 | "source": [ 263 | "So, train dataset has seven study types, each study type has studies on patients stored in folders named like patient001, patient002 etc.." 264 | ] 265 | }, 266 | { 267 | "cell_type": "markdown", 268 | "metadata": {}, 269 | "source": [ 270 | "#### Patient count per study type" 271 | ] 272 | }, 273 | { 274 | "cell_type": "markdown", 275 | "metadata": {}, 276 | "source": [ 277 | "Let's count number of patients in each study type" 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "execution_count": null, 283 | "metadata": { 284 | "collapsed": true 285 | }, 286 | "outputs": [], 287 | "source": [ 288 | "data_cat= ['train', 'valid']\n", 289 | "study_types = list(os.walk('../MURA-v1.0/train/'))[0][1] # study types, same for train and valid sets\n", 290 | "patients_count = {} # to store all patients count for each study type, for train and valid sets\n", 291 | "for phase in data_cat:\n", 292 | " patients_count[phase] = {}\n", 293 | " for study_type in study_types:\n", 294 | " patients = list(os.walk('../MURA-v1.0/%s/%s' %(phase, study_type)))[0][1] # patient folder names\n", 295 | " patients_count[phase][study_type] = len(patients)" 296 | ] 297 | }, 298 | { 299 | "cell_type": "code", 300 | "execution_count": null, 301 | "metadata": {}, 302 | "outputs": [], 303 | "source": [ 304 | "print(study_types)\n", 305 | "print()\n", 306 | "print(patients_count)" 307 | ] 308 | }, 309 | { 310 | "cell_type": "code", 311 | "execution_count": null, 312 | "metadata": {}, 313 | "outputs": [], 314 | "source": [ 315 | "# plot the patient counts per study type \n", 316 | "\n", 317 | "fig, ax = plt.subplots(figsize=(10, 5))\n", 318 | "for i, phase in enumerate(data_cat):\n", 319 | " counts = patients_count[phase].values()\n", 320 | " m = max(counts)\n", 321 | " for i, v in enumerate(counts):\n", 322 | " if v==m: ax.text(i-0.1, v+3, str(v))\n", 323 | " else: ax.text(i-0.1, v + 20, str(v))\n", 324 | " x_pos = np.arange(len(study_types))\n", 325 | " plt.bar(x_pos, counts, alpha=0.5)\n", 326 | " plt.xticks(x_pos, study_types)\n", 327 | "\n", 328 | "plt.xlabel('Study types')\n", 329 | "plt.ylabel('Number of patients')\n", 330 | "plt.legend(['train', 'valid'])\n", 331 | "plt.show()\n", 332 | "fig.savefig('images/pcpst.jpg', bbox_inches='tight', pad_inches=0) # name=patient count per study type" 333 | ] 334 | }, 335 | { 336 | "cell_type": "markdown", 337 | "metadata": {}, 338 | "source": [ 339 | "XR_FINGER has got the most number of patients (1867 in train set, 166 in valid set) followed by XR_WRIST" 340 | ] 341 | }, 342 | { 343 | "cell_type": "markdown", 344 | "metadata": {}, 345 | "source": [ 346 | "### Study count" 347 | ] 348 | }, 349 | { 350 | "cell_type": "markdown", 351 | "metadata": {}, 352 | "source": [ 353 | "Patients might have multiple studies for a given study type, like a patient may have two studies for wrist, independent of each other.
Let's have a look at such cases, **NOTE** here study count = number of patients which have same number of studies" 354 | ] 355 | }, 356 | { 357 | "cell_type": "code", 358 | "execution_count": null, 359 | "metadata": { 360 | "collapsed": true 361 | }, 362 | "outputs": [], 363 | "source": [ 364 | "# let's find out number of studies per study_type\n", 365 | "study_count = {} # to store study counts for each study type \n", 366 | "for study_type in study_types:\n", 367 | " BASE_DIR = '../MURA-v1.0/train/%s/' % study_type\n", 368 | " study_count[study_type] = defaultdict(lambda:0) # to store study count for current study_type, initialized to 0 by default\n", 369 | " patients = list(os.walk(BASE_DIR))[0][1] # patient folder names\n", 370 | " for patient in patients:\n", 371 | " studies = os.listdir(BASE_DIR+patient)\n", 372 | " study_count[study_type][len(studies)] += 1" 373 | ] 374 | }, 375 | { 376 | "cell_type": "code", 377 | "execution_count": null, 378 | "metadata": {}, 379 | "outputs": [], 380 | "source": [ 381 | "study_count" 382 | ] 383 | }, 384 | { 385 | "cell_type": "markdown", 386 | "metadata": {}, 387 | "source": [ 388 | "XR_WRIST has 3111 patients who have only single study, similarly, 158 patients have 2 studies, 12 patients have 3 studies and 4 patients have 4 studies.
let's plot this data" 389 | ] 390 | }, 391 | { 392 | "cell_type": "code", 393 | "execution_count": null, 394 | "metadata": {}, 395 | "outputs": [], 396 | "source": [ 397 | "# plot the study count vs number of patients per study type data \n", 398 | "fig = plt.figure(figsize=(8, 25))\n", 399 | "for i, study_type in enumerate(study_count):\n", 400 | " ax = fig.add_subplot(7, 1, i+1)\n", 401 | " study = study_count[study_type]\n", 402 | " # text in the plot\n", 403 | " m = max(study.values())\n", 404 | " for i, v in enumerate(study.values()):\n", 405 | " if v==m: ax.text(i-0.1, v - 200, str(v))\n", 406 | " else: ax.text(i-0.1, v + 200, str(v))\n", 407 | " ax.text(i, m - 100, study_type, color='green')\n", 408 | " # plot the bar chart\n", 409 | " x_pos = np.arange(len(study))\n", 410 | " plt.bar(x_pos, study.values(), align='center', alpha=0.5)\n", 411 | " plt.xticks(x_pos, study.keys())\n", 412 | " plt.xlabel('Study count')\n", 413 | " plt.ylabel('Number of patients')\n", 414 | "plt.show()\n", 415 | "fig.savefig('images/pcpsc.jpg', bbox_inches='tight', pad_inches=0)" 416 | ] 417 | }, 418 | { 419 | "cell_type": "markdown", 420 | "metadata": {}, 421 | "source": [ 422 | "### Number of views per study" 423 | ] 424 | }, 425 | { 426 | "cell_type": "markdown", 427 | "metadata": {}, 428 | "source": [ 429 | "It can be seen that each study may have more that one view (radiograph image), let' have a look" 430 | ] 431 | }, 432 | { 433 | "cell_type": "code", 434 | "execution_count": null, 435 | "metadata": { 436 | "collapsed": true 437 | }, 438 | "outputs": [], 439 | "source": [ 440 | "# let's find out number of studies per study_type\n", 441 | "view_count = {} # to store study counts for each study type, study count = number of patients which have similar number of studies \n", 442 | "for study_type in study_types:\n", 443 | " BASE_DIR = '../MURA-v1.0/train/%s/' % study_type\n", 444 | " view_count[study_type] = defaultdict(lambda:0) # to store study count for current study_type, initialized to 0 by default\n", 445 | " patients = list(os.walk(BASE_DIR))[0][1] # patient folder names\n", 446 | " for patient in patients:\n", 447 | " studies = os.listdir(BASE_DIR + patient)\n", 448 | " for study in studies:\n", 449 | " views = os.listdir(BASE_DIR + patient + '/' + study)\n", 450 | " view_count[study_type][len(views)] += 1" 451 | ] 452 | }, 453 | { 454 | "cell_type": "code", 455 | "execution_count": null, 456 | "metadata": {}, 457 | "outputs": [], 458 | "source": [ 459 | "view_count" 460 | ] 461 | }, 462 | { 463 | "cell_type": "markdown", 464 | "metadata": {}, 465 | "source": [ 466 | "`XR_SHOULDER` has as many as 13 views in some studies, `XR_HAND` has 5 at max, this poses a challenging task to predict on a study taking into account all the views of that study while keeping the batch size of 8 (as mentioned in MURA paper)" 467 | ] 468 | }, 469 | { 470 | "cell_type": "code", 471 | "execution_count": null, 472 | "metadata": {}, 473 | "outputs": [], 474 | "source": [ 475 | "# plot the view count vs number of studies per study type data \n", 476 | "fig = plt.figure(figsize=(10, 30))\n", 477 | "for i, view_type in enumerate(view_count):\n", 478 | " ax = fig.add_subplot(7, 1, i+1)\n", 479 | " view = view_count[view_type]\n", 480 | " # text in the plot\n", 481 | " m = max(view.values())\n", 482 | " for i, v in enumerate(view.values()):\n", 483 | " if v==m: ax.text(i-0.1, v - 200, str(v))\n", 484 | " else: ax.text(i-0.1, v + 80, str(v))\n", 485 | " ax.text(i - 0.5, m - 80, view_type, color='green')\n", 486 | " # plot the bar chart\n", 487 | " x_pos = np.arange(len(view))\n", 488 | " plt.bar(x_pos, view.values(), align='center', alpha=0.5)\n", 489 | " plt.xticks(x_pos, view.keys())\n", 490 | " plt.xlabel('Number of views')\n", 491 | " plt.ylabel('Number of studies')\n", 492 | "plt.show()\n", 493 | "fig.savefig('images/nsvc.jpg', bbox_inches='tight', pad_inches=0) # name=number of studies view count" 494 | ] 495 | }, 496 | { 497 | "cell_type": "markdown", 498 | "metadata": {}, 499 | "source": [ 500 | "Most of the studies contain 2, 3 or 4 views" 501 | ] 502 | } 503 | ], 504 | "metadata": { 505 | "kernelspec": { 506 | "display_name": "Python [default]", 507 | "language": "python", 508 | "name": "python3" 509 | }, 510 | "language_info": { 511 | "codemirror_mode": { 512 | "name": "ipython", 513 | "version": 3 514 | }, 515 | "file_extension": ".py", 516 | "mimetype": "text/x-python", 517 | "name": "python", 518 | "nbconvert_exporter": "python", 519 | "pygments_lexer": "ipython3", 520 | "version": "3.5.3" 521 | } 522 | }, 523 | "nbformat": 4, 524 | "nbformat_minor": 2 525 | } 526 | -------------------------------------------------------------------------------- /EDA/images/nsvc.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pyaf/DenseNet-MURA-PyTorch/04f7c5ab4c0abca7a8e55042626362fd2c9a09ad/EDA/images/nsvc.jpg -------------------------------------------------------------------------------- /EDA/images/pcpsc.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pyaf/DenseNet-MURA-PyTorch/04f7c5ab4c0abca7a8e55042626362fd2c9a09ad/EDA/images/pcpsc.jpg -------------------------------------------------------------------------------- /EDA/images/pcpst.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pyaf/DenseNet-MURA-PyTorch/04f7c5ab4c0abca7a8e55042626362fd2c9a09ad/EDA/images/pcpst.jpg -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Rishabh Agrahari 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # DenseNet on MURA Dataset using PyTorch 2 | 3 | A PyTorch implementation of 169 layer [DenseNet](https://arxiv.org/abs/1608.06993) model on MURA dataset, inspired from the paper [arXiv:1712.06957v3](https://arxiv.org/abs/1712.06957) by Pranav Rajpurkar et al. MURA is a large dataset of musculoskeletal radiographs, where each study is manually labeled by radiologists as either normal or abnormal. [know more](https://stanfordmlgroup.github.io/projects/mura/) 4 | 5 | ## Important Points: 6 | * The implemented model is a 169 layer DenseNet with single node output layer initialized with weights from a model pretrained on ImageNet dataset. 7 | * Before feeding the images to the network, each image is normalized to have same mean and standard deviation as of the images in the ImageNet training set, scaled to 224 x 224 and augmentented with random lateral inversions and rotations. 8 | * The model uses modified Binary Cross Entropy Loss function as mentioned in the paper. 9 | * The Learning Rate decays by a factor of 10 every time the validation loss plateaus after an epoch. 10 | * The optimization algorithm is Adam with default parameters β1 = 0.9 and β2 = 0.999. 11 | 12 | According to MURA dataset paper: 13 | 14 | > The model takes as input one or more views for a study of an upper extremity. On each view, our 169-layer convolutional neural network predicts the probability of abnormality. We compute the overall probability of abnormality for the study by taking the arithmetic mean of the abnormality probabilities output by the network for each image. 15 | 16 | The model implemented in [model.py](model.py) takes as input 'all' the views for a study of an upper extremity. On each view the model predicts the probability of abnormality. The Model computes the overall probability of abnormality for the study by taking the arithmetic mean of the abnormality probabilites output by the network for each image. 17 | 18 | ## Instructions 19 | 20 | Install dependencies: 21 | * PyTorch 22 | * TorchVision 23 | * Numpy 24 | * Pandas 25 | 26 | Train the model with `python main.py` 27 | 28 | ## Citation 29 | @ARTICLE{2017arXiv171206957R, 30 | author = {{Rajpurkar}, P. and {Irvin}, J. and {Bagul}, A. and {Ding}, D. and 31 | {Duan}, T. and {Mehta}, H. and {Yang}, B. and {Zhu}, K. and 32 | {Laird}, D. and {Ball}, R.~L. and {Langlotz}, C. and {Shpanskaya}, K. and 33 | {Lungren}, M.~P. and {Ng}, A.}, 34 | title = "{MURA Dataset: Towards Radiologist-Level Abnormality Detection in Musculoskeletal Radiographs}", 35 | journal = {ArXiv e-prints}, 36 | archivePrefix = "arXiv", 37 | eprint = {1712.06957}, 38 | primaryClass = "physics.med-ph", 39 | keywords = {Physics - Medical Physics, Computer Science - Artificial Intelligence}, 40 | year = 2017, 41 | month = dec, 42 | adsurl = {http://adsabs.harvard.edu/abs/2017arXiv171206957R}, 43 | adsnote = {Provided by the SAO/NASA Astrophysics Data System} 44 | } 45 | -------------------------------------------------------------------------------- /densenet.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | import torch.utils.model_zoo as model_zoo 5 | from collections import OrderedDict 6 | 7 | __all__ = ['DenseNet', 'densenet169'] 8 | 9 | 10 | model_urls = { 11 | 'densenet169': 'https://download.pytorch.org/models/densenet169-b2777c0a.pth', 12 | } 13 | 14 | def densenet169(pretrained=False, **kwargs): 15 | r"""Densenet-169 model from 16 | `"Densely Connected Convolutional Networks" `_ 17 | 18 | Args: 19 | pretrained (bool): If True, returns a model pre-trained on ImageNet 20 | """ 21 | model = DenseNet(num_init_features=64, growth_rate=32, block_config=(6, 12, 32, 32), 22 | **kwargs) 23 | if pretrained: 24 | model.load_state_dict(model_zoo.load_url(model_urls['densenet169']), strict=False) 25 | return model 26 | 27 | class _DenseLayer(nn.Sequential): 28 | def __init__(self, num_input_features, growth_rate, bn_size, drop_rate): 29 | super(_DenseLayer, self).__init__() 30 | self.add_module('norm1', nn.BatchNorm2d(num_input_features)), 31 | self.add_module('relu1', nn.ReLU(inplace=True)), 32 | self.add_module('conv1', nn.Conv2d(num_input_features, bn_size * 33 | growth_rate, kernel_size=1, stride=1, bias=False)), 34 | self.add_module('norm2', nn.BatchNorm2d(bn_size * growth_rate)), 35 | self.add_module('relu2', nn.ReLU(inplace=True)), 36 | self.add_module('conv2', nn.Conv2d(bn_size * growth_rate, growth_rate, 37 | kernel_size=3, stride=1, padding=1, bias=False)), 38 | self.drop_rate = drop_rate 39 | 40 | def forward(self, x): 41 | new_features = super(_DenseLayer, self).forward(x) 42 | if self.drop_rate > 0: 43 | new_features = F.dropout(new_features, p=self.drop_rate, training=self.training) 44 | return torch.cat([x, new_features], 1) 45 | 46 | 47 | class _DenseBlock(nn.Sequential): 48 | def __init__(self, num_layers, num_input_features, bn_size, growth_rate, drop_rate): 49 | super(_DenseBlock, self).__init__() 50 | for i in range(num_layers): 51 | layer = _DenseLayer(num_input_features + i * growth_rate, growth_rate, bn_size, drop_rate) 52 | self.add_module('denselayer%d' % (i + 1), layer) 53 | 54 | 55 | class _Transition(nn.Sequential): 56 | def __init__(self, num_input_features, num_output_features): 57 | super(_Transition, self).__init__() 58 | self.add_module('norm', nn.BatchNorm2d(num_input_features)) 59 | self.add_module('relu', nn.ReLU(inplace=True)) 60 | self.add_module('conv', nn.Conv2d(num_input_features, num_output_features, 61 | kernel_size=1, stride=1, bias=False)) 62 | self.add_module('pool', nn.AvgPool2d(kernel_size=2, stride=2)) 63 | 64 | 65 | class DenseNet(nn.Module): 66 | r"""Densenet-BC model class, based on 67 | `"Densely Connected Convolutional Networks" `_ 68 | 69 | Args: 70 | growth_rate (int) - how many filters to add each layer (`k` in paper) 71 | block_config (list of 4 ints) - how many layers in each pooling block 72 | num_init_features (int) - the number of filters to learn in the first convolution layer 73 | bn_size (int) - multiplicative factor for number of bottle neck layers 74 | (i.e. bn_size * k features in the bottleneck layer) 75 | drop_rate (float) - dropout rate after each dense layer 76 | num_classes (int) - number of classification classes 77 | """ 78 | def __init__(self, growth_rate=32, block_config=(6, 12, 24, 16), 79 | num_init_features=64, bn_size=4, drop_rate=0, num_classes=1000): 80 | 81 | super(DenseNet, self).__init__() 82 | 83 | # First convolution 84 | self.features = nn.Sequential(OrderedDict([ 85 | ('conv0', nn.Conv2d(3, num_init_features, kernel_size=7, stride=2, padding=3, bias=False)), 86 | ('norm0', nn.BatchNorm2d(num_init_features)), 87 | ('relu0', nn.ReLU(inplace=True)), 88 | ('pool0', nn.MaxPool2d(kernel_size=3, stride=2, padding=1)), 89 | ])) 90 | 91 | # Each denseblock 92 | num_features = num_init_features 93 | for i, num_layers in enumerate(block_config): 94 | block = _DenseBlock(num_layers=num_layers, num_input_features=num_features, 95 | bn_size=bn_size, growth_rate=growth_rate, drop_rate=drop_rate) 96 | self.features.add_module('denseblock%d' % (i + 1), block) 97 | num_features = num_features + num_layers * growth_rate 98 | if i != len(block_config) - 1: 99 | trans = _Transition(num_input_features=num_features, num_output_features=num_features // 2) 100 | self.features.add_module('transition%d' % (i + 1), trans) 101 | num_features = num_features // 2 102 | 103 | # Final batch norm 104 | self.features.add_module('norm5', nn.BatchNorm2d(num_features)) 105 | 106 | # Linear layer 107 | # self.classifier = nn.Linear(num_features, 1000) 108 | # self.fc = nn.Linear(1000, 1) 109 | 110 | self.fc = nn.Linear(num_features, 1) 111 | 112 | # Official init from torch repo. 113 | for m in self.modules(): 114 | if isinstance(m, nn.Conv2d): 115 | nn.init.kaiming_normal(m.weight.data) 116 | elif isinstance(m, nn.BatchNorm2d): 117 | m.weight.data.fill_(1) 118 | m.bias.data.zero_() 119 | elif isinstance(m, nn.Linear): 120 | m.bias.data.zero_() 121 | 122 | def forward(self, x): 123 | features = self.features(x) 124 | out = F.relu(features, inplace=True) 125 | out = F.avg_pool2d(out, kernel_size=7, stride=1).view(features.size(0), -1) 126 | # out = F.relu(self.classifier(out)) 127 | out = F.sigmoid(self.fc(out)) 128 | return out 129 | -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | import time 2 | import copy 3 | import pandas as pd 4 | import torch 5 | from torch.autograd import Variable 6 | from densenet import densenet169 7 | from utils import plot_training, n_p, get_count 8 | from train import train_model, get_metrics 9 | from pipeline import get_study_level_data, get_dataloaders 10 | 11 | # #### load study level dict data 12 | study_data = get_study_level_data(study_type='XR_WRIST') 13 | 14 | # #### Create dataloaders pipeline 15 | data_cat = ['train', 'valid'] # data categories 16 | dataloaders = get_dataloaders(study_data, batch_size=1) 17 | dataset_sizes = {x: len(study_data[x]) for x in data_cat} 18 | 19 | # #### Build model 20 | # tai = total abnormal images, tni = total normal images 21 | tai = {x: get_count(study_data[x], 'positive') for x in data_cat} 22 | tni = {x: get_count(study_data[x], 'negative') for x in data_cat} 23 | Wt1 = {x: n_p(tni[x] / (tni[x] + tai[x])) for x in data_cat} 24 | Wt0 = {x: n_p(tai[x] / (tni[x] + tai[x])) for x in data_cat} 25 | 26 | print('tai:', tai) 27 | print('tni:', tni, '\n') 28 | print('Wt0 train:', Wt0['train']) 29 | print('Wt0 valid:', Wt0['valid']) 30 | print('Wt1 train:', Wt1['train']) 31 | print('Wt1 valid:', Wt1['valid']) 32 | 33 | class Loss(torch.nn.modules.Module): 34 | def __init__(self, Wt1, Wt0): 35 | super(Loss, self).__init__() 36 | self.Wt1 = Wt1 37 | self.Wt0 = Wt0 38 | 39 | def forward(self, inputs, targets, phase): 40 | loss = - (self.Wt1[phase] * targets * inputs.log() + self.Wt0[phase] * (1 - targets) * (1 - inputs).log()) 41 | return loss 42 | 43 | model = densenet169(pretrained=True) 44 | model = model.cuda() 45 | 46 | criterion = Loss(Wt1, Wt0) 47 | optimizer = torch.optim.Adam(model.parameters(), lr=0.0001) 48 | scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', patience=1, verbose=True) 49 | 50 | # #### Train model 51 | model = train_model(model, criterion, optimizer, dataloaders, scheduler, dataset_sizes, num_epochs=5) 52 | 53 | torch.save(model.state_dict(), 'models/model.pth') 54 | 55 | get_metrics(model, criterion, dataloaders, dataset_sizes) -------------------------------------------------------------------------------- /pipeline.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pandas as pd 3 | from tqdm import tqdm 4 | import torch 5 | from torchvision import transforms 6 | from torch.utils.data import DataLoader, Dataset 7 | from torchvision.datasets.folder import pil_loader 8 | 9 | data_cat = ['train', 'valid'] # data categories 10 | 11 | def get_study_level_data(study_type): 12 | """ 13 | Returns a dict, with keys 'train' and 'valid' and respective values as study level dataframes, 14 | these dataframes contain three columns 'Path', 'Count', 'Label' 15 | Args: 16 | study_type (string): one of the seven study type folder names in 'train/valid/test' dataset 17 | """ 18 | study_data = {} 19 | study_label = {'positive': 1, 'negative': 0} 20 | for phase in data_cat: 21 | BASE_DIR = 'MURA-v1.0/%s/%s/' % (phase, study_type) 22 | patients = list(os.walk(BASE_DIR))[0][1] # list of patient folder names 23 | study_data[phase] = pd.DataFrame(columns=['Path', 'Count', 'Label']) 24 | i = 0 25 | for patient in tqdm(patients): # for each patient folder 26 | for study in os.listdir(BASE_DIR + patient): # for each study in that patient folder 27 | label = study_label[study.split('_')[1]] # get label 0 or 1 28 | path = BASE_DIR + patient + '/' + study + '/' # path to this study 29 | study_data[phase].loc[i] = [path, len(os.listdir(path)), label] # add new row 30 | i+=1 31 | return study_data 32 | 33 | class ImageDataset(Dataset): 34 | """training dataset.""" 35 | 36 | def __init__(self, df, transform=None): 37 | """ 38 | Args: 39 | df (pd.DataFrame): a pandas DataFrame with image path and labels. 40 | transform (callable, optional): Optional transform to be applied 41 | on a sample. 42 | """ 43 | self.df = df 44 | self.transform = transform 45 | 46 | def __len__(self): 47 | return len(self.df) 48 | 49 | def __getitem__(self, idx): 50 | study_path = self.df.iloc[idx, 0] 51 | count = self.df.iloc[idx, 1] 52 | images = [] 53 | for i in range(count): 54 | image = pil_loader(study_path + 'image%s.png' % (i+1)) 55 | images.append(self.transform(image)) 56 | images = torch.stack(images) 57 | label = self.df.iloc[idx, 2] 58 | sample = {'images': images, 'label': label} 59 | return sample 60 | 61 | def get_dataloaders(data, batch_size=8, study_level=False): 62 | ''' 63 | Returns dataloader pipeline with data augmentation 64 | ''' 65 | data_transforms = { 66 | 'train': transforms.Compose([ 67 | transforms.Resize((224, 224)), 68 | transforms.RandomHorizontalFlip(), 69 | transforms.RandomRotation(10), 70 | transforms.ToTensor(), 71 | transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) 72 | ]), 73 | 'valid': transforms.Compose([ 74 | transforms.Resize((224, 224)), 75 | transforms.ToTensor(), 76 | transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) 77 | ]), 78 | } 79 | image_datasets = {x: ImageDataset(data[x], transform=data_transforms[x]) for x in data_cat} 80 | dataloaders = {x: DataLoader(image_datasets[x], batch_size=batch_size, shuffle=True, num_workers=4) for x in data_cat} 81 | return dataloaders 82 | 83 | if __name__=='main': 84 | pass 85 | -------------------------------------------------------------------------------- /train.py: -------------------------------------------------------------------------------- 1 | import time 2 | import copy 3 | import torch 4 | from torchnet import meter 5 | from torch.autograd import Variable 6 | from utils import plot_training 7 | 8 | data_cat = ['train', 'valid'] # data categories 9 | 10 | def train_model(model, criterion, optimizer, dataloaders, scheduler, 11 | dataset_sizes, num_epochs): 12 | since = time.time() 13 | best_model_wts = copy.deepcopy(model.state_dict()) 14 | best_acc = 0.0 15 | costs = {x:[] for x in data_cat} # for storing costs per epoch 16 | accs = {x:[] for x in data_cat} # for storing accuracies per epoch 17 | print('Train batches:', len(dataloaders['train'])) 18 | print('Valid batches:', len(dataloaders['valid']), '\n') 19 | for epoch in range(num_epochs): 20 | confusion_matrix = {x: meter.ConfusionMeter(2, normalized=True) 21 | for x in data_cat} 22 | print('Epoch {}/{}'.format(epoch+1, num_epochs)) 23 | print('-' * 10) 24 | # Each epoch has a training and validation phase 25 | for phase in data_cat: 26 | model.train(phase=='train') 27 | running_loss = 0.0 28 | running_corrects = 0 29 | # Iterate over data. 30 | for i, data in enumerate(dataloaders[phase]): 31 | # get the inputs 32 | print(i, end='\r') 33 | inputs = data['images'][0] 34 | labels = data['label'].type(torch.FloatTensor) 35 | # wrap them in Variable 36 | inputs = Variable(inputs.cuda()) 37 | labels = Variable(labels.cuda()) 38 | # zero the parameter gradients 39 | optimizer.zero_grad() 40 | # forward 41 | outputs = model(inputs) 42 | outputs = torch.mean(outputs) 43 | loss = criterion(outputs, labels, phase) 44 | running_loss += loss.data[0] 45 | # backward + optimize only if in training phase 46 | if phase == 'train': 47 | loss.backward() 48 | optimizer.step() 49 | # statistics 50 | preds = (outputs.data > 0.5).type(torch.cuda.FloatTensor) 51 | running_corrects += torch.sum(preds == labels.data) 52 | confusion_matrix[phase].add(preds, labels.data) 53 | epoch_loss = running_loss / dataset_sizes[phase] 54 | epoch_acc = running_corrects / dataset_sizes[phase] 55 | costs[phase].append(epoch_loss) 56 | accs[phase].append(epoch_acc) 57 | print('{} Loss: {:.4f} Acc: {:.4f}'.format( 58 | phase, epoch_loss, epoch_acc)) 59 | print('Confusion Meter:\n', confusion_matrix[phase].value()) 60 | # deep copy the model 61 | if phase == 'valid': 62 | scheduler.step(epoch_loss) 63 | if epoch_acc > best_acc: 64 | best_acc = epoch_acc 65 | best_model_wts = copy.deepcopy(model.state_dict()) 66 | time_elapsed = time.time() - since 67 | print('Time elapsed: {:.0f}m {:.0f}s'.format( 68 | time_elapsed // 60, time_elapsed % 60)) 69 | print() 70 | time_elapsed = time.time() - since 71 | print('Training complete in {:.0f}m {:.0f}s'.format( 72 | time_elapsed // 60, time_elapsed % 60)) 73 | print('Best valid Acc: {:4f}'.format(best_acc)) 74 | plot_training(costs, accs) 75 | # load best model weights 76 | model.load_state_dict(best_model_wts) 77 | return model 78 | 79 | 80 | def get_metrics(model, criterion, dataloaders, dataset_sizes, phase='valid'): 81 | ''' 82 | Loops over phase (train or valid) set to determine acc, loss and 83 | confusion meter of the model. 84 | ''' 85 | confusion_matrix = meter.ConfusionMeter(2, normalized=True) 86 | running_loss = 0.0 87 | running_corrects = 0 88 | for i, data in enumerate(dataloaders[phase]): 89 | print(i, end='\r') 90 | labels = data['label'].type(torch.FloatTensor) 91 | inputs = data['images'][0] 92 | # wrap them in Variable 93 | inputs = Variable(inputs.cuda()) 94 | labels = Variable(labels.cuda()) 95 | # forward 96 | outputs = model(inputs) 97 | outputs = torch.mean(outputs) 98 | loss = criterion(outputs, labels, phase) 99 | # statistics 100 | running_loss += loss.data[0] * inputs.size(0) 101 | preds = (outputs.data > 0.5).type(torch.cuda.FloatTensor) 102 | running_corrects += torch.sum(preds == labels.data) 103 | confusion_matrix.add(preds, labels.data) 104 | 105 | loss = running_loss / dataset_sizes[phase] 106 | acc = running_corrects / dataset_sizes[phase] 107 | print('{} Loss: {:.4f} Acc: {:.4f}'.format(phase, loss, acc)) 108 | print('Confusion Meter:\n', confusion_matrix.value()) 109 | -------------------------------------------------------------------------------- /utils.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch.autograd import Variable 3 | import matplotlib.pyplot as plt 4 | from torchnet import meter 5 | 6 | def plot_training(costs, accs): 7 | ''' 8 | Plots curve of Cost vs epochs and Accuracy vs epochs for 'train' and 'valid' sets during training 9 | ''' 10 | train_acc = accs['train'] 11 | valid_acc = accs['valid'] 12 | train_cost = costs['train'] 13 | valid_cost = costs['valid'] 14 | epochs = range(len(train_acc)) 15 | 16 | plt.figure(figsize=(10, 5)) 17 | 18 | plt.subplot(1, 2, 1,) 19 | plt.plot(epochs, train_acc) 20 | plt.plot(epochs, valid_acc) 21 | plt.legend(['train', 'valid'], loc='upper left') 22 | plt.title('Accuracy') 23 | 24 | plt.subplot(1, 2, 2) 25 | plt.plot(epochs, train_cost) 26 | plt.plot(epochs, valid_cost) 27 | plt.legend(['train', 'valid'], loc='upper left') 28 | plt.title('Cost') 29 | 30 | plt.show() 31 | 32 | def n_p(x): 33 | '''convert numpy float to Variable tensor float''' 34 | return Variable(torch.cuda.FloatTensor([x]), requires_grad=False) 35 | 36 | def get_count(df, cat): 37 | ''' 38 | Returns number of images in a study type dataframe which are of abnormal or normal 39 | Args: 40 | df -- dataframe 41 | cat -- category, "positive" for abnormal and "negative" for normal 42 | ''' 43 | return df[df['Path'].str.contains(cat)]['Count'].sum() 44 | 45 | 46 | if __name__=='main': 47 | pass --------------------------------------------------------------------------------