├── .gitignore
├── CONTRIBUTING.md
├── LICENSE.md
├── README.md
├── book.jpg
└── code
├── Chapter-10
├── README.md
└── Testing_and_Remediating_Bias_constrained.ipynb
├── Chapter-11
├── Backdoor_testing.ipynb
├── Data_Poisoning.ipynb
├── README.md
├── Red_Teaming_an_XGBoost_model.ipynb
└── Training_an_Overfit_and_a_Constrained_XGBoost_model.ipynb
├── Chapter-6
├── Constrained_XGB_and_Post_Hoc_Explanations.ipynb
├── GLM,GAM_and_EBM_code_example.ipynb
└── README.md
├── Chapter-7 & 9
├── 1.Data Preparation.ipynb
├── 2.Transfer learning-Stage_1.ipynb
├── 3.Transfer learning-Stage_2.ipynb
├── 4.Post-Hoc Explanations.ipynb
├── 5.Adding Noise to images .ipynb
├── 6.Label_Randomization.ipynb
├── Adversarial Example Attacks.ipynb
├── README.md
└── Retraining on Gaussian Noise.ipynb
├── Chapter-8
├── README.md
├── Residual_Analysis_for_XGBoost.ipynb
├── Selecting a Better XGBoost Model.ipynb
├── Sensitivity_Analysis_for_XGBoost_Adversarial_Example_Search.ipynb
└── Stress_testing_XGBoost.ipynb
└── Data.zip
/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_Store
2 | *.ipynb
3 | *.jupyterlab-workspace
4 |
--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
1 | ---
2 | name: Bug report
3 | about: Create a report to help us improve
4 | title: ''
5 | labels: ''
6 | assignees: ''
7 |
8 | ---
9 |
10 | **Describe the bug**
11 | A clear and concise description of what the bug is.
12 |
13 | **To Reproduce**
14 | Steps to reproduce the behavior:
15 | 1. Go to '...'
16 | 2. Click on '....'
17 | 3. Scroll down to '....'
18 | 4. See error
19 |
20 | **Expected behavior**
21 | A clear and concise description of what you expected to happen.
22 |
23 | **Screenshots**
24 | If applicable, add screenshots to help explain your problem.
25 |
26 | **Desktop (please complete the following information):**
27 | - OS: [e.g. iOS]
28 | - Browser [e.g. chrome, safari]
29 | - Version [e.g. 22]
30 |
31 | **Additional context**
32 | Add any other context about the problem here.
33 |
--------------------------------------------------------------------------------
/LICENSE.md:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2022 Machine Learning for High-Risk Applications Book
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Machine-Learning-for-High-Risk-Applications-Book
2 |
3 | This is a companion repository for the book [Machine Learning for High-Risk Applications](https://learning.oreilly.com/library/view/machine-learning-for/9781098102425/)
4 |
5 |
6 |
7 | 
8 |
9 | Buy on Amazon |
10 | Read on O'Reilly
11 |
12 |
13 | The past decade has witnessed the broad adoption of artificial intelligence and machine learning (AI/ML) technologies. However, a lack of oversight in their widespread implementation has resulted in some incidents and harmful outcomes that could have been avoided with proper risk management. Before we can realize AI/ML's true benefit, practitioners must understand how to mitigate its risks.
14 |
15 | This book describes approaches to responsible AI—a holistic framework for improving AI/ML technology, business processes, and cultural competencies that builds on best practices in risk management, cybersecurity, data privacy, and applied social science. Authors Patrick Hall, James Curtis, and Parul Pandey created this guide for data scientists who want to improve real-world AI/ML system outcomes for organizations, consumers, and the public.
16 |
17 | Learn technical approaches for responsible AI across explainability, model validation and debugging, bias management, data privacy, and ML security
18 | Learn how to create a successful and impactful AI risk management practice
19 | Get a basic guide to existing standards, laws, and assessments for adopting AI technologies, including the new NIST AI Risk Management Framework
20 | Engage with interactive resources on GitHub and Colab
21 |
22 | ## Code
23 |
24 | The code for this book can be found in the following sections:
25 |
26 |
27 | | Chapter | Code Notebooks |
28 | | ------- | -------------- |
29 | | 6 | [Explainable Boosting Machines and Explaining XGBoost](https://github.com/ml-for-high-risk-apps-book/Machine-Learning-for-High-Risk-Applications-Book/tree/main/code/Chapter-6) |
30 | | 7 | [Explaining a PyTorch Image Classifier](https://github.com/ml-for-high-risk-apps-book/Machine-Learning-for-High-Risk-Applications-Book/tree/main/code/Chapter-7%20%26%209) |
31 | | 8 | [Debugging XGBoost](https://github.com/ml-for-high-risk-apps-book/Machine-Learning-for-High-Risk-Applications-Book/tree/main/code/Chapter-8) |
32 | | 9 | [Debugging a PyTorch Image Classifier](https://github.com/ml-for-high-risk-apps-book/Machine-Learning-for-High-Risk-Applications-Book/tree/main/code/Chapter-7%20%26%209) |
33 | | 10 | [Testing and Remediating Bias with XGBoost](https://github.com/ml-for-high-risk-apps-book/Machine-Learning-for-High-Risk-Applications-Book/tree/main/code/Chapter-10) |
34 | | 11 | [Red-teaming XGBoost](https://github.com/ml-for-high-risk-apps-book/Machine-Learning-for-High-Risk-Applications-Book/tree/main/code/Chapter-11) |
35 |
36 |
37 |
38 |
39 |
40 |
41 |
--------------------------------------------------------------------------------
/book.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ml-for-high-risk-apps-book/Machine-Learning-for-High-Risk-Applications-Book/0fbf704b8b2d84579ce9473cdecabf46747a7ac2/book.jpg
--------------------------------------------------------------------------------
/code/Chapter-10/README.md:
--------------------------------------------------------------------------------
1 | # Code for Chapter 10 - Testing and Remediating Bias with XGBoost
2 |
3 | This chapter focuses on technical implementations of bias testing and remediation approaches. We’ll start off by training XGBoost on a variant of the credit card data. We’ll then test for bias by checking for differences in performance and outcomes across demographic groups. We’ll additionally try and identify any bias concerns at the individual observation level. Once we confirm the existence of measurable levels of bias in our model predictions, we’ll start trying to fix, or remediate, that bias. We employ pre-, in- and post-processing remediation methods that attempt to fix the training data, model, and outcomes, respectively. We’ll finish off the chapter by conducting bias-aware model selection that leaves us with a model that is both performant and minimally biased.
4 |
5 | # Code
6 |
7 | 1. Testing and Remediating Bias in an XGBoost Credit Model [](https://githubtocolab.com/ml-for-high-risk-apps-book/Machine-Learning-for-High-Risk-Applications-Book/blob/main/code/Chapter-10/Testing_and_Remediating_Bias_constrained.ipynb) [](https://github.com/ml-for-high-risk-apps-book/Machine-Learning-for-High-Risk-Applications-Book/blob/main/code/Chapter-10/Testing_and_Remediating_Bias_constrained.ipynb)
8 |
--------------------------------------------------------------------------------
/code/Chapter-11/README.md:
--------------------------------------------------------------------------------
1 | # Code for Chapter 11 - Red-teaming XGBoost
2 |
3 | In this chapter, we’ll show you how to hack your own models so that you can add red-teaming into to your model debugging repertoire. The main idea of the chapter is when you know what
4 | hackers will try to do to your model, then you can try it out first and devise effective defenses. We’ll start out with a concept refresher that reintroduces common ML
5 | attacks and countermeasures, then we’ll dive into examples of attacking an XGBoost classifier trained on structured data. We’ll then introduce two XGBoost models, one trained with the
6 | standard black-box approach, and one trained with constraints and a high degree of L2 regularization. We’ll use these two models to explain the attacks and to test
7 | whether transparency and L2 regularization are adequate countermeasures. After that, we’ll jump into attacks that are likely to be performed by external adversaries
8 | against a black-box ML API: model extraction and adversarial example attacks. From there, we’ll try out insider attacks that involve making deliberate changes to an ML
9 | modeling pipeline: data poisoning and model backdoors.
10 |
11 | # Code
12 |
13 | 1. Training an Overfit and a Constrained XGBoost model [](https://githubtocolab.com/ml-for-high-risk-apps-book/Machine-Learning-for-High-Risk-Applications-Book/blob/main/code/Chapter-11/Training_an_Overfit_and_a_Constrained_XGBoost_model.ipynb) [](https://github.com/ml-for-high-risk-apps-book/Machine-Learning-for-High-Risk-Applications-Book/blob/main/code/Chapter-11/Training_an_Overfit_and_a_Constrained_XGBoost_model.ipynb)
14 | 2. Red-Teaming an XGBoost Model [](https://githubtocolab.com/ml-for-high-risk-apps-book/Machine-Learning-for-High-Risk-Applications-Book/blob/main/code/Chapter-11/Data_Poisoning.ipynb) [](https://github.com/ml-for-high-risk-apps-book/Machine-Learning-for-High-Risk-Applications-Book/blob/main/code/Chapter-11/Red_Teaming_an_XGBoost_model.ipynb)
15 | 3. Data Poisoning [](https://colab.research.google.com/drive/1tMxexHbgNoUaeTS179bXUA7BbvkbcSOe?usp=sharing) [](https://github.com/ml-for-high-risk-apps-book/Machine-Learning-for-High-Risk-Applications-Book/blob/main/code/Chapter-11/Data_Poisoning.ipynb)
16 | 4. Backdoor Testing [](https://githubtocolab.com/ml-for-high-risk-apps-book/Machine-Learning-for-High-Risk-Applications-Book/blob/main/code/Chapter-11/Backdoor_testing.ipynb) [](https://github.com/ml-for-high-risk-apps-book/Machine-Learning-for-High-Risk-Applications-Book/blob/main/code/Chapter-11/Backdoor_testing.ipynb)
17 |
--------------------------------------------------------------------------------
/code/Chapter-6/README.md:
--------------------------------------------------------------------------------
1 |
2 | # Code for Chapter 6 - Explainable Boosting Machines and Explaining XGBoost
3 |
4 | This chapter explores interpretable models and post-hoc explanation with examples relating to consumer finance. It also applies the approaches discussed in Chapter 2 using explainable boosting machines (EBMs), monotonically constrained XGBoost models, and post-hoc explanation techniques.
5 |
6 | # Code
7 | 1. GLM, GAM, and EBM code example
[](https://github.com/ml-for-high-risk-apps-book/Machine-Learning-for-High-Risk-Applications-Book/blob/main/code/Chapter-6/Constrained_XGB_and_Post_Hoc_Explanations.ipynb)
8 | 2. Constrained_XGB and PostHoc Explanations
[](https://github.com/ml-for-high-risk-apps-book/Machine-Learning-for-High-Risk-Applications-Book/blob/main/code/Chapter-6/Constrained_XGB_and_Post_Hoc_Explanations.ipynb)
9 |
10 |
--------------------------------------------------------------------------------
/code/Chapter-7 & 9/1.Data Preparation.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "cd5db84d",
6 | "metadata": {
7 | "id": "cd5db84d"
8 | },
9 | "source": [
10 | "# Chapter 7: Explaining a PyTorch Image Classifier\n",
11 | "\n",
12 | "## **Data Preparation**"
13 | ]
14 | },
15 | {
16 | "cell_type": "markdown",
17 | "id": "91f6f7c2",
18 | "metadata": {
19 | "id": "91f6f7c2"
20 | },
21 | "source": [
22 | "## 1. Setting the environment\n",
23 | "\n",
24 | "If you are using Colab, it comes preinstalled with PyTorch and other commonly used libraries for machine and Deep learning. However if you are executing this notebook in your local system, you will need to install them manually via the following commands:\n",
25 | "\n"
26 | ]
27 | },
28 | {
29 | "cell_type": "code",
30 | "execution_count": 6,
31 | "id": "2467c0e9",
32 | "metadata": {
33 | "collapsed": true,
34 | "jupyter": {
35 | "outputs_hidden": true
36 | },
37 | "tags": []
38 | },
39 | "outputs": [
40 | {
41 | "name": "stdout",
42 | "output_type": "stream",
43 | "text": [
44 | "\n",
45 | "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip available: \u001b[0m\u001b[31;49m22.2.2\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m22.3.1\u001b[0m\n",
46 | "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpython3.10 -m pip install --upgrade pip\u001b[0m\n"
47 | ]
48 | }
49 | ],
50 | "source": [
51 | "!pip3 install torch torchvision numpy pandas matplotlib seaborn scikit-learn --quiet"
52 | ]
53 | },
54 | {
55 | "cell_type": "code",
56 | "execution_count": 7,
57 | "id": "7b52ade8",
58 | "metadata": {
59 | "id": "7b52ade8"
60 | },
61 | "outputs": [],
62 | "source": [
63 | "# Importing the necessary libraries \n",
64 | "import os\n",
65 | "import numpy as np\n",
66 | "import pandas as pd\n",
67 | "import seaborn as sns\n",
68 | "import matplotlib.pyplot as plt\n",
69 | "\n",
70 | "from sklearn.model_selection import train_test_split\n",
71 | "\n",
72 | "import torch\n",
73 | "import torchvision\n",
74 | "import torch.nn as nn\n",
75 | "from torchvision import datasets, transforms\n",
76 | "from torchvision.datasets import ImageFolder\n",
77 | "from torch.utils.data import Dataset, DataLoader, ConcatDataset\n",
78 | "\n",
79 | "device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')\n"
80 | ]
81 | },
82 | {
83 | "cell_type": "markdown",
84 | "id": "874f772c",
85 | "metadata": {
86 | "id": "874f772c"
87 | },
88 | "source": [
89 | "## 2. Loading the dataset\n",
90 | "\n",
91 | "We'll create an image classifier to diagnose pneumonia in Chest X-Ray images The dataset we’ll use for training has been taken from [Kaggle](https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia), and it consists of 5,863 X-Ray images of patients, which have been split into two distinct categories — one containing pneumonia and the other being normal. In this ection, we'll have a look at the dataset and see if there are any issues with it.\n",
92 | "\n",
93 | "> Download the [zipped dataset from Kaggle](https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia) and save it as `chest_xray_original.zip` on your local systems"
94 | ]
95 | },
96 | {
97 | "cell_type": "code",
98 | "execution_count": 14,
99 | "id": "18adeea3",
100 | "metadata": {
101 | "id": "18adeea3"
102 | },
103 | "outputs": [],
104 | "source": [
105 | "import zipfile\n",
106 | "with zipfile.ZipFile(\"chest_xray_original.zip\",\"r\") as zip_ref:\n",
107 | " zip_ref.extractall(\"chest_xray_original\")"
108 | ]
109 | },
110 | {
111 | "cell_type": "code",
112 | "execution_count": 21,
113 | "id": "1e1c3ad1",
114 | "metadata": {
115 | "id": "1e1c3ad1"
116 | },
117 | "outputs": [],
118 | "source": [
119 | "\n",
120 | "# Assigning PATH to original dataset\n",
121 | "PATH_original = 'chest_xray_original/chest_xray_original'\n"
122 | ]
123 | },
124 | {
125 | "cell_type": "code",
126 | "execution_count": 22,
127 | "id": "8df3f9bf",
128 | "metadata": {
129 | "id": "8df3f9bf"
130 | },
131 | "outputs": [],
132 | "source": [
133 | "def distribution(data_set):\n",
134 | " #To calculate distribution of the datasets\n",
135 | " \n",
136 | " normal_path = os.path.join(PATH_original+f\"/{data_set}/NORMAL\")\n",
137 | " pneumonia_path = os.path.join(PATH_original+f\"/{data_set}/PNEUMONIA\")\n",
138 | " \n",
139 | " normal = len([filename for filename in os.listdir(normal_path)])\n",
140 | " pneumonia = len([filename for filename in os.listdir(pneumonia_path)])\n",
141 | "\n",
142 | " distribution = dict(zip(['Normal','Pneumonia'],[normal,pneumonia]))\n",
143 | " sns.barplot(x=list(distribution.keys()), y=list(distribution.values())).set_title(f\"{data_set} Data Imbalance\")\n",
144 | " return distribution"
145 | ]
146 | },
147 | {
148 | "cell_type": "code",
149 | "execution_count": 23,
150 | "id": "dd70b780",
151 | "metadata": {
152 | "id": "dd70b780",
153 | "outputId": "e28e2fa7-367c-4885-a2ac-118d2c2be228"
154 | },
155 | "outputs": [
156 | {
157 | "data": {
158 | "text/plain": [
159 | "{'Normal': 1342, 'Pneumonia': 3876}"
160 | ]
161 | },
162 | "execution_count": 23,
163 | "metadata": {},
164 | "output_type": "execute_result"
165 | },
166 | {
167 | "data": {
168 | "image/png": "",
169 | "text/plain": [
170 | ""
171 | ]
172 | },
173 | "metadata": {},
174 | "output_type": "display_data"
175 | }
176 | ],
177 | "source": [
178 | "distribution('train')"
179 | ]
180 | },
181 | {
182 | "cell_type": "code",
183 | "execution_count": 24,
184 | "id": "7bacfeab",
185 | "metadata": {
186 | "id": "7bacfeab",
187 | "outputId": "ca740fa9-8120-4eaa-e9b1-4f59a2a6fce2"
188 | },
189 | "outputs": [
190 | {
191 | "data": {
192 | "text/plain": [
193 | "{'Normal': 234, 'Pneumonia': 390}"
194 | ]
195 | },
196 | "execution_count": 24,
197 | "metadata": {},
198 | "output_type": "execute_result"
199 | },
200 | {
201 | "data": {
202 | "image/png": "",
203 | "text/plain": [
204 | ""
205 | ]
206 | },
207 | "metadata": {},
208 | "output_type": "display_data"
209 | }
210 | ],
211 | "source": [
212 | "distribution('test')"
213 | ]
214 | },
215 | {
216 | "cell_type": "code",
217 | "execution_count": 25,
218 | "id": "8b61bbae",
219 | "metadata": {
220 | "id": "8b61bbae",
221 | "outputId": "334cb43e-959e-4d72-babf-ce1724a86683"
222 | },
223 | "outputs": [
224 | {
225 | "data": {
226 | "text/plain": [
227 | "{'Normal': 9, 'Pneumonia': 9}"
228 | ]
229 | },
230 | "execution_count": 25,
231 | "metadata": {},
232 | "output_type": "execute_result"
233 | },
234 | {
235 | "data": {
236 | "image/png": "",
237 | "text/plain": [
238 | ""
239 | ]
240 | },
241 | "metadata": {},
242 | "output_type": "display_data"
243 | }
244 | ],
245 | "source": [
246 | "distribution('val')"
247 | ]
248 | },
249 | {
250 | "cell_type": "markdown",
251 | "id": "41bfef58",
252 | "metadata": {
253 | "id": "41bfef58"
254 | },
255 | "source": [
256 | "Like many datasets in medical applications, this data has a class imbalance problem. Another cause of the concern is the presence of very few images validation set. The validation dataset consists of only 9 images for Pneumonia class and another 9 for the Normal class. This is not a sufficient number to adequately validate the model. We have to address these and other issues before proceeding with training our model.\n",
257 | "\n",
258 | "> We will manually transfer 461 Normal and 498 Pneumonia unique images from training folder to the validation folder. \n",
259 | " \n",
260 | "*Download and place the preprocessed dataset in a folder named `chest_xray_preprocessed`. The preprocessed dataset can be accessed from [here](https://drive.google.com/drive/folders/1-hvJdh-tONZHYFVdA3ttpB_yuRusZK5C?usp=sharing).*\n",
261 | " "
262 | ]
263 | },
264 | {
265 | "cell_type": "markdown",
266 | "id": "fd6f56a5",
267 | "metadata": {
268 | "id": "fd6f56a5"
269 | },
270 | "source": [
271 | "## 3. Oversampling the data\n",
272 | "\n",
273 | "We'll now augment the remaining training set of the preprocessed dataset by oversampling the minority class inorder to balance the classes"
274 | ]
275 | },
276 | {
277 | "cell_type": "code",
278 | "execution_count": 32,
279 | "id": "ef39f361",
280 | "metadata": {
281 | "id": "ef39f361"
282 | },
283 | "outputs": [],
284 | "source": [
285 | "TRAIN_DIR = 'chest_xray_preprocessed/train'\n",
286 | "\n",
287 | "IMAGE_SIZE = 224 # Image size of resize when applying transforms.\n",
288 | "BATCH_SIZE = 32\n",
289 | "NUM_WORKERS = 4 # Number of parallel processes for data preparation.\n",
290 | "def get_augmented_data():\n",
291 | " sample1 = ImageFolder(TRAIN_DIR, \n",
292 | " transform = transforms.Compose([transforms.Resize((224,224)),\n",
293 | " transforms.RandomRotation(10),\n",
294 | " transforms.RandomGrayscale(),\n",
295 | " transforms.RandomAffine(translate=(0.05,0.05), degrees=0),\n",
296 | " transforms.ToTensor(),\n",
297 | " transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),\n",
298 | " ]))\n",
299 | "\n",
300 | " sample2 = ImageFolder(TRAIN_DIR, \n",
301 | " transform=transforms.Compose([transforms.Resize((224,224)),\n",
302 | " transforms.RandomGrayscale(),\n",
303 | " transforms.RandomAffine(translate=(0.1,0.05), degrees=10),\n",
304 | " \n",
305 | " transforms.ToTensor(),\n",
306 | " transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),\n",
307 | " ]))\n",
308 | "\n",
309 | " sample3 = ImageFolder(TRAIN_DIR, \n",
310 | " transforms.Compose([transforms.Resize((224,224)),\n",
311 | " transforms.RandomRotation(15),\n",
312 | " transforms.RandomGrayscale(p=1),\n",
313 | " transforms.RandomAffine(translate=(0.08,0.1), degrees=15),\n",
314 | " transforms.ToTensor(),\n",
315 | " transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),\n",
316 | " ]))\n",
317 | " sample4 = ImageFolder(TRAIN_DIR, \n",
318 | " transforms.Compose([transforms.Resize((224,224)),\n",
319 | " transforms.RandomRotation(15),\n",
320 | " transforms.RandomGrayscale(p=1),\n",
321 | " transforms.RandomAffine(translate=(0.09,0.2), degrees=17),\n",
322 | " transforms.ToTensor(),\n",
323 | " transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),\n",
324 | " ]))\n",
325 | "\n",
326 | "\n",
327 | " normal_1, _ = train_test_split(sample2, test_size= 3377/(880+3377), shuffle=False)\n",
328 | " normal_2, _ = train_test_split(sample3, test_size= 3377/(880+3377), shuffle=False)\n",
329 | " normal_3, _ = train_test_split(sample4, test_size= 3000/(880+3377), shuffle=False)\n",
330 | "\n",
331 | " train_dataset = ConcatDataset([sample1, normal_1, normal_2, normal_3])\n",
332 | " return train_dataset"
333 | ]
334 | },
335 | {
336 | "cell_type": "code",
337 | "execution_count": 34,
338 | "id": "93df4eda",
339 | "metadata": {
340 | "id": "93df4eda"
341 | },
342 | "outputs": [],
343 | "source": [
344 | "train_dataset = get_augmented_data()"
345 | ]
346 | },
347 | {
348 | "cell_type": "code",
349 | "execution_count": 35,
350 | "id": "647e1a6c",
351 | "metadata": {
352 | "id": "647e1a6c"
353 | },
354 | "outputs": [],
355 | "source": [
356 | "# Saving the augmented training set images\n",
357 | "torch.save(train_dataset,'train_dataset.pt')"
358 | ]
359 | },
360 | {
361 | "cell_type": "markdown",
362 | "id": "e89d79ce",
363 | "metadata": {
364 | "id": "e89d79ce"
365 | },
366 | "source": [
367 | "Sanity Check to see if everything works as proposed"
368 | ]
369 | },
370 | {
371 | "cell_type": "code",
372 | "execution_count": 36,
373 | "id": "c1701d59",
374 | "metadata": {
375 | "id": "c1701d59",
376 | "outputId": "8c150459-a10c-4d76-fe8e-56cdc2f07c4c"
377 | },
378 | "outputs": [
379 | {
380 | "name": "stdout",
381 | "output_type": "stream",
382 | "text": [
383 | "Normal : 3516 and Pneumonia : 3758\n"
384 | ]
385 | }
386 | ],
387 | "source": [
388 | "pneumonia = 0\n",
389 | "normal = 0\n",
390 | "for i in range(len(train_dataset)):\n",
391 | " if train_dataset[i][1]==1:\n",
392 | " pneumonia += 1\n",
393 | " else:\n",
394 | " normal += 1\n",
395 | " \n",
396 | "print(f'Normal : {normal} and Pneumonia : {pneumonia}')"
397 | ]
398 | },
399 | {
400 | "attachments": {},
401 | "cell_type": "markdown",
402 | "id": "c4a07f26",
403 | "metadata": {
404 | "id": "c4a07f26"
405 | },
406 | "source": [
407 | "Another preprocessing technique that we can use for the dataset is image cropping. We\n",
408 | "crop some of the images in the training set so as to highlight only the lung region.\n",
409 | "Cropping also helped to eliminate any annotation markers or any other markings on\n",
410 | "the Chest X-Ray images and only focussing on the region of interest in the images.\n",
411 | "We saved these images as a separate dataset to be used to fine tune the network during\n",
412 | "the second stage of transfer learning. \n",
413 | "\n",
414 | "The **Cropped Dataset** can be accessed from [here](https://drive.google.com/drive/folders/15C9RiGnjYpISCPBkoLrHhEzMHtRh7E7J?usp=sharing)"
415 | ]
416 | },
417 | {
418 | "cell_type": "markdown",
419 | "id": "df38d4fd",
420 | "metadata": {
421 | "id": "df38d4fd"
422 | },
423 | "source": [
424 | "## 4. Loading the Cropped Dataset"
425 | ]
426 | },
427 | {
428 | "cell_type": "code",
429 | "execution_count": 37,
430 | "id": "ad7cc68e",
431 | "metadata": {
432 | "id": "ad7cc68e"
433 | },
434 | "outputs": [],
435 | "source": [
436 | "cropped_ds = ImageFolder('Cropped', \n",
437 | " transform = transforms.Compose([transforms.Resize((224,224)),\n",
438 | " transforms.RandomRotation(10),\n",
439 | " transforms.RandomGrayscale(),\n",
440 | " transforms.RandomAffine(translate=(0.05,0.05), degrees=0),\n",
441 | " transforms.ToTensor(),\n",
442 | " transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),\n",
443 | " ]))\n"
444 | ]
445 | },
446 | {
447 | "cell_type": "code",
448 | "execution_count": 38,
449 | "id": "bd7f5d42",
450 | "metadata": {
451 | "id": "bd7f5d42",
452 | "outputId": "44f42be6-941e-41e2-f689-353a4759c024"
453 | },
454 | "outputs": [
455 | {
456 | "data": {
457 | "text/plain": [
458 | "Dataset ImageFolder\n",
459 | " Number of datapoints: 422\n",
460 | " Root location: Cropped\n",
461 | " StandardTransform\n",
462 | "Transform: Compose(\n",
463 | " Resize(size=(224, 224), interpolation=bilinear, max_size=None, antialias=None)\n",
464 | " RandomRotation(degrees=[-10.0, 10.0], interpolation=nearest, expand=False, fill=0)\n",
465 | " RandomGrayscale(p=0.1)\n",
466 | " RandomAffine(degrees=[0.0, 0.0], translate=(0.05, 0.05))\n",
467 | " ToTensor()\n",
468 | " Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])\n",
469 | " )"
470 | ]
471 | },
472 | "execution_count": 38,
473 | "metadata": {},
474 | "output_type": "execute_result"
475 | }
476 | ],
477 | "source": [
478 | "cropped_ds"
479 | ]
480 | },
481 | {
482 | "cell_type": "code",
483 | "execution_count": null,
484 | "id": "f9e8cbe5-d957-4bb8-9f36-a0d7b3b5cd11",
485 | "metadata": {},
486 | "outputs": [],
487 | "source": []
488 | }
489 | ],
490 | "metadata": {
491 | "colab": {
492 | "include_colab_link": true,
493 | "provenance": []
494 | },
495 | "kernelspec": {
496 | "display_name": "Python 3 (ipykernel)",
497 | "language": "python",
498 | "name": "python3"
499 | },
500 | "language_info": {
501 | "codemirror_mode": {
502 | "name": "ipython",
503 | "version": 3
504 | },
505 | "file_extension": ".py",
506 | "mimetype": "text/x-python",
507 | "name": "python",
508 | "nbconvert_exporter": "python",
509 | "pygments_lexer": "ipython3",
510 | "version": "3.10.8"
511 | }
512 | },
513 | "nbformat": 4,
514 | "nbformat_minor": 5
515 | }
516 |
--------------------------------------------------------------------------------
/code/Chapter-7 & 9/3.Transfer learning-Stage_2.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "attachments": {},
5 | "cell_type": "markdown",
6 | "id": "8b0c8670",
7 | "metadata": {},
8 | "source": [
9 | "# Chapter 7: Explaining a PyTorch Image Classifier\n",
10 | "\n",
11 | "## **Transfer Learning Stage 2**"
12 | ]
13 | },
14 | {
15 | "attachments": {},
16 | "cell_type": "markdown",
17 | "id": "7a4690cd",
18 | "metadata": {},
19 | "source": [
20 | "## Setting the environment\n",
21 | "\n",
22 | "If you are using Colab, it comes preinstalled with PyTorch and other commonly used libraries for machine and Deep learning. If you are executing this notebook in your local system, you will need to install them manually via the following commands:"
23 | ]
24 | },
25 | {
26 | "cell_type": "code",
27 | "execution_count": null,
28 | "id": "46d89428",
29 | "metadata": {},
30 | "outputs": [],
31 | "source": [
32 | "#!pip3 install torch torchvision numpy pandas matplotlib seaborn"
33 | ]
34 | },
35 | {
36 | "cell_type": "code",
37 | "execution_count": 1,
38 | "id": "3cbe0711",
39 | "metadata": {},
40 | "outputs": [],
41 | "source": [
42 | "import numpy as np \n",
43 | "import pandas as pd \n",
44 | "import matplotlib.pyplot as plt\n",
45 | "from matplotlib.image import imread\n",
46 | "import seaborn as sns\n",
47 | "import random\n",
48 | "\n",
49 | "\n",
50 | "import copy\n",
51 | "import os\n",
52 | "from sklearn.model_selection import train_test_split,StratifiedShuffleSplit\n",
53 | "from sklearn.metrics import confusion_matrix,ConfusionMatrixDisplay, classification_report\n",
54 | "from skimage.util import random_noise\n",
55 | "import time\n",
56 | "\n",
57 | "\n",
58 | "import torch\n",
59 | "import torch.nn as nn\n",
60 | "import torch.nn.functional as F\n",
61 | "import torchvision\n",
62 | "import torchvision.models as models\n",
63 | "from torchvision.datasets import ImageFolder\n",
64 | "from torchvision.utils import make_grid, save_image\n",
65 | "import torchvision.transforms as transforms\n",
66 | "from torch.utils.data import Dataset, DataLoader, ConcatDataset\n",
67 | "from PIL import Image\n",
68 | "\n",
69 | "from mlxtend.plotting import plot_confusion_matrix\n",
70 | "from sklearn.metrics import confusion_matrix\n",
71 | "\n",
72 | "# Seed\n",
73 | "seed = 123\n",
74 | "torch.manual_seed(seed)\n",
75 | "torch.cuda.manual_seed(seed)\n",
76 | "torch.cuda.manual_seed_all(seed)\n",
77 | "np.random.seed(seed)\n",
78 | "random.seed(seed)\n",
79 | "torch.backends.cudnn.benchmark = False\n",
80 | "torch.backends.cudnn.deterministic = True\n",
81 | "\n",
82 | "device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')\n",
83 | "random_seed = 12345"
84 | ]
85 | },
86 | {
87 | "cell_type": "code",
88 | "execution_count": 2,
89 | "id": "a73f9c43",
90 | "metadata": {},
91 | "outputs": [
92 | {
93 | "data": {
94 | "text/plain": [
95 | ""
96 | ]
97 | },
98 | "execution_count": 2,
99 | "metadata": {},
100 | "output_type": "execute_result"
101 | }
102 | ],
103 | "source": [
104 | "def seed_worker(worker_id):\n",
105 | " worker_seed = torch.initial_seed() % 2**32\n",
106 | " numpy.random.seed(worker_seed)\n",
107 | " random.seed(worker_seed)\n",
108 | "\n",
109 | "g = torch.Generator()\n",
110 | "g.manual_seed(0)"
111 | ]
112 | },
113 | {
114 | "cell_type": "markdown",
115 | "id": "87d9fb2c",
116 | "metadata": {},
117 | "source": [
118 | "## Loading the Cropped Dataset\n",
119 | "Refer to the `Data Preparation` notebook"
120 | ]
121 | },
122 | {
123 | "cell_type": "code",
124 | "execution_count": 3,
125 | "id": "241236e4",
126 | "metadata": {},
127 | "outputs": [
128 | {
129 | "data": {
130 | "text/plain": [
131 | "(422, 977)"
132 | ]
133 | },
134 | "execution_count": 3,
135 | "metadata": {},
136 | "output_type": "execute_result"
137 | }
138 | ],
139 | "source": [
140 | "cropped_train_ds = ImageFolder('chest_xray_pre-processed/chest_xray/Cropped', \n",
141 | " transform = transforms.Compose([transforms.Resize((224,224)),\n",
142 | " transforms.RandomRotation(10),\n",
143 | " transforms.RandomGrayscale(),\n",
144 | " transforms.RandomAffine(translate=(0.05,0.05), degrees=0),\n",
145 | " transforms.ToTensor(),\n",
146 | " transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),\n",
147 | " ]))\n",
148 | "val_ds = ImageFolder('chest_xray_pre-processed/chest_xray/val', \n",
149 | " transform = transforms.Compose([transforms.Resize((224,224)),\n",
150 | " transforms.ToTensor(),\n",
151 | " transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),\n",
152 | " ]))\n",
153 | "\n",
154 | "len(cropped_train_ds), len(val_ds)\n"
155 | ]
156 | },
157 | {
158 | "cell_type": "code",
159 | "execution_count": 4,
160 | "id": "0db8d417",
161 | "metadata": {},
162 | "outputs": [],
163 | "source": [
164 | "batch_size=32\n",
165 | "\n",
166 | "cropped_train_dl = DataLoader(cropped_train_ds, batch_size, shuffle=True, worker_init_fn=seed_worker)\n",
167 | "val_dl = DataLoader(val_ds, batch_size, worker_init_fn=seed_worker)\n",
168 | "loaders = {'train':cropped_train_dl, 'val':val_dl}\n",
169 | "dataset_sizes = {'train':len(cropped_train_ds), 'val':len(val_ds)}"
170 | ]
171 | },
172 | {
173 | "cell_type": "code",
174 | "execution_count": 38,
175 | "id": "6b499208",
176 | "metadata": {},
177 | "outputs": [],
178 | "source": [
179 | "model = torch.load('Finetuning_Stage1.pt')"
180 | ]
181 | },
182 | {
183 | "cell_type": "code",
184 | "execution_count": 39,
185 | "id": "b0736ee8",
186 | "metadata": {},
187 | "outputs": [],
188 | "source": [
189 | "for param in model.parameters():\n",
190 | " param.requires_grad = False"
191 | ]
192 | },
193 | {
194 | "cell_type": "code",
195 | "execution_count": 40,
196 | "id": "4c535cea",
197 | "metadata": {},
198 | "outputs": [],
199 | "source": [
200 | "criterion = nn.CrossEntropyLoss()\n",
201 | "optimizer = torch.optim.Adam(model.parameters(), lr = 0.001)\n",
202 | "scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 4, gamma=0.1)"
203 | ]
204 | },
205 | {
206 | "cell_type": "code",
207 | "execution_count": 41,
208 | "id": "cd9985fd",
209 | "metadata": {},
210 | "outputs": [],
211 | "source": [
212 | "losses = {'train':[], 'val':[]}\n",
213 | "accuracies = {'train':[], 'val':[]}\n",
214 | "\n",
215 | "\n",
216 | "def train(model, criterion, optimizer, scheduler, epochs):\n",
217 | " since = time.time()\n",
218 | " best_model = copy.deepcopy(model.state_dict())\n",
219 | " best_acc = 0.0\n",
220 | " for epoch in range(epochs):\n",
221 | " for phase in ['train', 'val']:\n",
222 | " if phase == 'train':\n",
223 | " model.train()\n",
224 | " else:\n",
225 | " model.eval()\n",
226 | " \n",
227 | " running_loss = 0.0\n",
228 | " running_corrects = 0.0\n",
229 | "\n",
230 | " for inputs, labels in loaders[phase]:\n",
231 | " inputs, labels = inputs.to(device), labels.to(device)\n",
232 | "\n",
233 | " optimizer.zero_grad()\n",
234 | "\n",
235 | " with torch.set_grad_enabled(phase=='train'):\n",
236 | " outp = model(inputs)\n",
237 | " _, pred = torch.max(outp, 1)\n",
238 | " loss = criterion(outp, labels)\n",
239 | " loss.requires_grad = True\n",
240 | " \n",
241 | " if phase == 'train':\n",
242 | " loss.backward()\n",
243 | " optimizer.step()\n",
244 | " \n",
245 | " running_loss += loss.item()*inputs.size(0)\n",
246 | " running_corrects += torch.sum(pred == labels.data)\n",
247 | "\n",
248 | "\n",
249 | " epoch_loss = running_loss / dataset_sizes[phase]\n",
250 | " epoch_acc = running_corrects.double()/dataset_sizes[phase]\n",
251 | " losses[phase].append(epoch_loss)\n",
252 | " accuracies[phase].append(epoch_acc)\n",
253 | " if phase == 'train':\n",
254 | " print('Epoch: {}/{}'.format(epoch+1, epochs))\n",
255 | " print('{} - loss:{}, accuracy{}'.format(phase, epoch_loss, epoch_acc))\n",
256 | " \n",
257 | " \n",
258 | " if phase == 'val' and epoch_acc > best_acc:\n",
259 | " best_acc = epoch_acc\n",
260 | " best_model = copy.deepcopy(model.state_dict())\n",
261 | " scheduler.step() \n",
262 | " print('Best accuracy {}'.format(best_acc))\n",
263 | "\n",
264 | " model.load_state_dict(best_model)\n",
265 | " return model "
266 | ]
267 | },
268 | {
269 | "cell_type": "code",
270 | "execution_count": 42,
271 | "id": "a8cf8954",
272 | "metadata": {},
273 | "outputs": [
274 | {
275 | "name": "stdout",
276 | "output_type": "stream",
277 | "text": [
278 | "Epoch: 1/17\n",
279 | "train - loss:0.16213886227604374, accuracy0.9360189573459716\n",
280 | "val - loss:0.05905295006321144, accuracy0.9815762538382805\n",
281 | "Epoch: 2/17\n",
282 | "train - loss:0.2181292337970146, accuracy0.9218009478672986\n",
283 | "val - loss:0.09303262661174307, accuracy0.9733879222108496\n",
284 | "Epoch: 3/17\n",
285 | "train - loss:0.17090152465336694, accuracy0.9312796208530807\n",
286 | "val - loss:0.09203716716300457, accuracy0.9723643807574207\n",
287 | "Epoch: 4/17\n",
288 | "train - loss:0.17591952452197743, accuracy0.9312796208530807\n",
289 | "val - loss:0.08194925019857345, accuracy0.9744114636642784\n",
290 | "Epoch: 5/17\n",
291 | "train - loss:0.1736818109572781, accuracy0.9241706161137442\n",
292 | "val - loss:0.07081017136284617, accuracy0.9785056294779939\n",
293 | "Epoch: 6/17\n",
294 | "train - loss:0.1652766755689377, accuracy0.9265402843601896\n",
295 | "val - loss:0.06797165959056645, accuracy0.9795291709314228\n",
296 | "Epoch: 7/17\n",
297 | "train - loss:0.17079675276498907, accuracy0.9454976303317536\n",
298 | "val - loss:0.08522375099469713, accuracy0.9744114636642784\n",
299 | "Epoch: 8/17\n",
300 | "train - loss:0.16720860655159112, accuracy0.9454976303317536\n",
301 | "val - loss:0.08462242321445095, accuracy0.9744114636642784\n",
302 | "Epoch: 9/17\n",
303 | "train - loss:0.1643253081076518, accuracy0.9360189573459716\n",
304 | "val - loss:0.0878552215935862, accuracy0.9733879222108496\n",
305 | "Epoch: 10/17\n",
306 | "train - loss:0.18233245928988073, accuracy0.9265402843601896\n",
307 | "val - loss:0.08086182687440344, accuracy0.9744114636642784\n",
308 | "Epoch: 11/17\n",
309 | "train - loss:0.18567422757993376, accuracy0.9241706161137442\n",
310 | "val - loss:0.06240757423052341, accuracy0.9815762538382805\n",
311 | "Epoch: 12/17\n",
312 | "train - loss:0.17861574831732077, accuracy0.9336492890995262\n",
313 | "val - loss:0.06742130681486361, accuracy0.9805527123848516\n",
314 | "Epoch: 13/17\n",
315 | "train - loss:0.1907900203354833, accuracy0.9336492890995262\n",
316 | "val - loss:0.0762084919125589, accuracy0.9744114636642784\n",
317 | "Epoch: 14/17\n",
318 | "train - loss:0.162779003488765, accuracy0.9407582938388627\n",
319 | "val - loss:0.08570337082003851, accuracy0.9733879222108496\n",
320 | "Epoch: 15/17\n",
321 | "train - loss:0.18446631447999115, accuracy0.9336492890995262\n",
322 | "val - loss:0.08623684697826443, accuracy0.9744114636642784\n",
323 | "Epoch: 16/17\n",
324 | "train - loss:0.1861739684496587, accuracy0.9383886255924171\n",
325 | "val - loss:0.07517468557685293, accuracy0.9754350051177073\n",
326 | "Epoch: 17/17\n",
327 | "train - loss:0.1731398391638887, accuracy0.9431279620853081\n",
328 | "val - loss:0.0682922940559919, accuracy0.9805527123848516\n",
329 | "Best accuracy 0.9815762538382805\n"
330 | ]
331 | }
332 | ],
333 | "source": [
334 | "model.to(device)\n",
335 | "epochs = 17\n",
336 | "model = train(model, criterion, optimizer, scheduler, epochs)"
337 | ]
338 | },
339 | {
340 | "cell_type": "code",
341 | "execution_count": 46,
342 | "id": "3f022b90",
343 | "metadata": {},
344 | "outputs": [],
345 | "source": [
346 | "torch.save(model, 'Finetuning_Stage2.pt')"
347 | ]
348 | },
349 | {
350 | "cell_type": "markdown",
351 | "id": "7b98bfee",
352 | "metadata": {},
353 | "source": [
354 | "## Model Evaluation on Test Set"
355 | ]
356 | },
357 | {
358 | "cell_type": "code",
359 | "execution_count": 47,
360 | "id": "e654c750",
361 | "metadata": {},
362 | "outputs": [],
363 | "source": [
364 | "testset = ImageFolder('chest_xray_pre-processed/chest_xray/test', \n",
365 | " transform=torchvision.transforms.Compose([torchvision.transforms.Resize((224,224)), \n",
366 | " torchvision.transforms.ToTensor(),\n",
367 | " torchvision.transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]), \n",
368 | " \n",
369 | " ]))"
370 | ]
371 | },
372 | {
373 | "cell_type": "code",
374 | "execution_count": 48,
375 | "id": "c64639c6",
376 | "metadata": {},
377 | "outputs": [],
378 | "source": [
379 | "test_dl = DataLoader(testset, batch_size=256)\n",
380 | "model.to(device);\n"
381 | ]
382 | },
383 | {
384 | "cell_type": "code",
385 | "execution_count": 49,
386 | "id": "beaf99af",
387 | "metadata": {},
388 | "outputs": [],
389 | "source": [
390 | "def accuracy(outputs, labels):\n",
391 | " _, preds = torch.max(outputs, dim=1) \n",
392 | " return torch.tensor(torch.sum(preds == labels).item() / len(preds)), preds\n",
393 | "\n",
394 | "def validation(batch):\n",
395 | " images,labels = batch\n",
396 | " images,labels = images.to(device),labels.to(device)\n",
397 | " output = model(images) \n",
398 | " loss = F.cross_entropy(output, labels) \n",
399 | " acc,predictions = accuracy(output, labels) \n",
400 | " \n",
401 | " return {'valid_loss': loss.detach(), 'valid_accuracy':acc.detach(), 'predictions':predictions.detach(), 'labels':labels.detach()}\n",
402 | " \n",
403 | "@torch.no_grad()\n",
404 | "def test_predict(model, test_dataloader):\n",
405 | " model.eval()\n",
406 | " \n",
407 | " outputs = [validation(batch) for batch in test_dataloader] \n",
408 | " batch_losses = [x['valid_loss'] for x in outputs]\n",
409 | " epoch_loss = torch.stack(batch_losses).mean() \n",
410 | " batch_accs = [x['valid_accuracy'] for x in outputs]\n",
411 | " epoch_acc = torch.stack(batch_accs).mean() \n",
412 | " \n",
413 | " batch_preds = [pred for x in outputs for pred in x['predictions'].tolist()] \n",
414 | " batch_labels = [label for x in outputs for label in x['labels'].tolist()] \n",
415 | " \n",
416 | " print('test_loss: {:.4f}, test_acc: {:.4f}'\n",
417 | " .format(epoch_loss.item(), epoch_acc.item()))\n",
418 | " \n",
419 | " return batch_preds, batch_labels "
420 | ]
421 | },
422 | {
423 | "cell_type": "code",
424 | "execution_count": 50,
425 | "id": "0e4e2e1a",
426 | "metadata": {},
427 | "outputs": [
428 | {
429 | "name": "stdout",
430 | "output_type": "stream",
431 | "text": [
432 | "test_loss: 0.2626, test_acc: 0.9334\n"
433 | ]
434 | }
435 | ],
436 | "source": [
437 | "preds,labels = test_predict(model, test_dl)"
438 | ]
439 | },
440 | {
441 | "cell_type": "code",
442 | "execution_count": 51,
443 | "id": "4901a11c",
444 | "metadata": {},
445 | "outputs": [
446 | {
447 | "data": {
448 | "image/png": "",
449 | "text/plain": [
450 | ""
451 | ]
452 | },
453 | "metadata": {
454 | "needs_background": "light"
455 | },
456 | "output_type": "display_data"
457 | }
458 | ],
459 | "source": [
460 | "cm = confusion_matrix(labels, preds)\n",
461 | "plot_confusion_matrix(cm,figsize=(8,6),cmap=plt.cm.Greys)\n",
462 | "plt.xticks(range(2), ['Normal', 'Pneumonia'])\n",
463 | "plt.yticks(range(2), ['Normal', 'Pneumonia'])\n",
464 | "plt.xlabel('Predicted Label',fontsize=10)\n",
465 | "plt.ylabel('True Label',fontsize=10)\n",
466 | "plt.show()"
467 | ]
468 | },
469 | {
470 | "cell_type": "code",
471 | "execution_count": 52,
472 | "id": "9c3153ca",
473 | "metadata": {},
474 | "outputs": [
475 | {
476 | "name": "stdout",
477 | "output_type": "stream",
478 | "text": [
479 | " precision recall f1-score support\n",
480 | "\n",
481 | " 0 0.95 0.85 0.90 234\n",
482 | " 1 0.92 0.97 0.94 390\n",
483 | "\n",
484 | " accuracy 0.93 624\n",
485 | " macro avg 0.93 0.91 0.92 624\n",
486 | "weighted avg 0.93 0.93 0.93 624\n",
487 | "\n"
488 | ]
489 | }
490 | ],
491 | "source": [
492 | "\n",
493 | "print(classification_report(labels, preds))"
494 | ]
495 | },
496 | {
497 | "attachments": {},
498 | "cell_type": "markdown",
499 | "id": "4c8132f8",
500 | "metadata": {},
501 | "source": [
502 | "The saved model can be accessed from [here](https://drive.google.com/file/d/17pKbfQwLugq60_RGMHUqbg3DI-EC7M24/view?usp=share_link)"
503 | ]
504 | },
505 | {
506 | "cell_type": "markdown",
507 | "id": "66cfe614",
508 | "metadata": {},
509 | "source": []
510 | }
511 | ],
512 | "metadata": {
513 | "kernelspec": {
514 | "display_name": "Python 3 (ipykernel)",
515 | "language": "python",
516 | "name": "python3"
517 | },
518 | "language_info": {
519 | "codemirror_mode": {
520 | "name": "ipython",
521 | "version": 3
522 | },
523 | "file_extension": ".py",
524 | "mimetype": "text/x-python",
525 | "name": "python",
526 | "nbconvert_exporter": "python",
527 | "pygments_lexer": "ipython3",
528 | "version": "3.8.10"
529 | }
530 | },
531 | "nbformat": 4,
532 | "nbformat_minor": 5
533 | }
534 |
--------------------------------------------------------------------------------
/code/Chapter-7 & 9/Adversarial Example Attacks.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "attachments": {},
5 | "cell_type": "markdown",
6 | "id": "df42271f",
7 | "metadata": {},
8 | "source": [
9 | "# Chapter 9: Debugging a PyTorch Image Classifier\n",
10 | "\n",
11 | "## **Adversarial Example Attacks.**\n",
12 | "\n",
13 | "The code below performs the FGSM attack on a PyTorch model. This attack involves generating adversarial examples by perturbing input data with small, carefully crafted changes that cause the model to misclassify the input. The code loops over all examples in the test set, calculates the loss, and then uses the FGSM attack to generate adversarial examples. Finally, it returns the accuracy and the adversarial examples."
14 | ]
15 | },
16 | {
17 | "cell_type": "code",
18 | "execution_count": 1,
19 | "id": "5ee6465c-fd8f-4df2-bb4b-ce8d5f9c671d",
20 | "metadata": {},
21 | "outputs": [],
22 | "source": [
23 | "import matplotlib.pyplot as plt\n",
24 | "import numpy as np\n",
25 | "import scipy\n",
26 | "import torch\n",
27 | "import torch.nn as nn\n",
28 | "import torch.nn.functional as F\n",
29 | "import torch.optim as optim\n",
30 | "from torch.utils.data import ConcatDataset, DataLoader, Dataset\n",
31 | "from torchvision import datasets, transforms\n",
32 | "from torchvision.datasets import ImageFolder\n",
33 | "\n",
34 | "device = torch.device(\"cuda:0\" if torch.cuda.is_available() else \"cpu\")\n",
35 | "random_seed = 12345\n",
36 | "\n",
37 | "PATH = \"chest_xray_pre-processed\""
38 | ]
39 | },
40 | {
41 | "cell_type": "code",
42 | "execution_count": 2,
43 | "id": "ef53d588-eba4-4b47-b4f7-3c535bbc416f",
44 | "metadata": {},
45 | "outputs": [
46 | {
47 | "data": {
48 | "text/plain": [
49 | ""
50 | ]
51 | },
52 | "execution_count": 2,
53 | "metadata": {},
54 | "output_type": "execute_result"
55 | }
56 | ],
57 | "source": [
58 | "def seed_worker(worker_id):\n",
59 | " worker_seed = torch.initial_seed() % 2**32\n",
60 | " numpy.random.seed(worker_seed)\n",
61 | " random.seed(worker_seed)\n",
62 | "\n",
63 | "\n",
64 | "g = torch.Generator()\n",
65 | "g.manual_seed(0)"
66 | ]
67 | },
68 | {
69 | "cell_type": "code",
70 | "execution_count": 3,
71 | "id": "3bf1ef57-f2d0-42c7-b8b1-30ce775b5df6",
72 | "metadata": {},
73 | "outputs": [],
74 | "source": [
75 | "import matplotlib.pyplot as plt\n",
76 | "from matplotlib.colors import LinearSegmentedColormap\n",
77 | "\n",
78 | "classes = dict({0: \"NORMAL\", 1: \"PNEUMONIA\"})"
79 | ]
80 | },
81 | {
82 | "cell_type": "code",
83 | "execution_count": 4,
84 | "id": "3feab12c-d83f-4258-86d6-f8ff9688f30d",
85 | "metadata": {},
86 | "outputs": [],
87 | "source": [
88 | "MEAN = torch.tensor([0.485, 0.456, 0.406])\n",
89 | "STD = torch.tensor([0.229, 0.224, 0.225])\n",
90 | "\n",
91 | "\n",
92 | "from matplotlib.colors import LinearSegmentedColormap\n",
93 | "\n",
94 | "\n",
95 | "def imshow(img, transpose=True):\n",
96 | " plt.figure(figsize=(11, 6))\n",
97 | " x = img.cpu() * STD[:, None, None] + MEAN[:, None, None]\n",
98 | " # x = img.cpu()\n",
99 | " # x= x / 2 + 0.5 # unnormalize\n",
100 | " npimg = x.cpu().detach().numpy()\n",
101 | " plt.imshow(npimg.transpose(1, 2, 0))\n",
102 | " plt.axis(\"off\")\n",
103 | " plt.show()"
104 | ]
105 | },
106 | {
107 | "cell_type": "code",
108 | "execution_count": 11,
109 | "id": "74473fc5-ba3e-4c49-88b3-965a469855f2",
110 | "metadata": {},
111 | "outputs": [],
112 | "source": [
113 | "epsilons = [0, .02, 0.05, .1, .15, .2,]\n",
114 | "pretrained_model = \"model/Finetuning_Stage2.pt\"\n",
115 | "model = torch.load(pretrained_model, map_location=device)\n",
116 | "model.eval();"
117 | ]
118 | },
119 | {
120 | "cell_type": "code",
121 | "execution_count": 12,
122 | "id": "3852fc26-80ea-4dcb-93ca-d33615f04828",
123 | "metadata": {},
124 | "outputs": [],
125 | "source": [
126 | "testset = ImageFolder(\n",
127 | " PATH + \"/test\",\n",
128 | " transform=transforms.Compose(\n",
129 | " [\n",
130 | " transforms.Resize((224, 224)),\n",
131 | " transforms.ToTensor(),\n",
132 | " transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),\n",
133 | " ]\n",
134 | " ),\n",
135 | ")\n",
136 | "test_dl = DataLoader(testset, batch_size=1)"
137 | ]
138 | },
139 | {
140 | "cell_type": "code",
141 | "execution_count": 13,
142 | "id": "c1c01940-dce2-4464-be07-d5f8efa84447",
143 | "metadata": {},
144 | "outputs": [],
145 | "source": [
146 | "# This function performs an FGSM attack on an input image using the given data gradient\n",
147 | "# Code adapted from https://pytorch.org/tutorials/beginner/fgsm_tutorial.html\n",
148 | "\n",
149 | "def fgsm_attack(image, epsilon, data_grad):\n",
150 | " # Compute the sign of the data gradient\n",
151 | " sign_data_grad = torch.sign(data_grad)\n",
152 | " # Create a perturbed image by adding epsilon times the sign of the gradient to the original image\n",
153 | " perturbed_image = image + epsilon * sign_data_grad\n",
154 | " # Clip the pixel values to be in the [0,1] range\n",
155 | " perturbed_image = torch.clamp(perturbed_image, 0, 1)\n",
156 | " # Return the perturbed image\n",
157 | " return perturbed_image"
158 | ]
159 | },
160 | {
161 | "cell_type": "code",
162 | "execution_count": null,
163 | "id": "c16bfd1e",
164 | "metadata": {},
165 | "outputs": [],
166 | "source": [
167 | "# This function tests the robustness of a deep learning model against FGSM attacks with varying levels of epsilon\n",
168 | "# It returns the accuracy of the model and saves adversarial examples for visualization\n",
169 | "# Code adapted from https://pytorch.org/tutorials/beginner/fgsm_tutorial.html\n",
170 | "\n",
171 | "def test(model, device, test_loader, epsilon):\n",
172 | "\n",
173 | " # Initialize counters and storage\n",
174 | " num_correct = 0\n",
175 | " adversarial_examples = []\n",
176 | "\n",
177 | " # Iterate over the test set\n",
178 | " for images, labels in test_loader:\n",
179 | "\n",
180 | " # Send data and labels to the device\n",
181 | " images, labels = images.to(device), labels.to(device)\n",
182 | "\n",
183 | " # Set requires_grad attribute of tensor\n",
184 | " images.requires_grad = True\n",
185 | "\n",
186 | " # Make predictions\n",
187 | " output = model(images)\n",
188 | " initial_predictions = output.max(1, keepdim=True)[1]\n",
189 | "\n",
190 | " # Check if initial prediction is correct\n",
191 | " if initial_predictions.item() != labels.item():\n",
192 | " continue\n",
193 | "\n",
194 | " # Calculate loss and gradients\n",
195 | " loss = F.nll_loss(output, labels)\n",
196 | " model.zero_grad()\n",
197 | " loss.backward()\n",
198 | " datagrad = images.grad.data\n",
199 | "\n",
200 | " # Generate adversarial examples using FGSM attack\n",
201 | " perturbed_images = fgsm_attack(images, epsilon, datagrad)\n",
202 | "\n",
203 | " # Make predictions on the adversarial examples\n",
204 | " output = model(perturbed_images)\n",
205 | " final_predictions = output.max(1, keepdim=True)[1]\n",
206 | "\n",
207 | " # Check for success\n",
208 | " if final_predictions.item() == labels.item():\n",
209 | " num_correct += 1\n",
210 | " if (epsilon == 0) and (len(adversarial_examples) < 5):\n",
211 | " adversarial_example = perturbed_images.squeeze().detach().cpu().numpy()\n",
212 | " adversarial_examples.append((initial_predictions.item(), final_predictions.item(), adversarial_example))\n",
213 | " else:\n",
214 | " if len(adversarial_examples) < 5:\n",
215 | " adversarial_example = perturbed_images.squeeze().detach().cpu().numpy()\n",
216 | " adversarial_examples.append((initial_predictions.item(), final_predictions.item(), adversarial_example))\n",
217 | "\n",
218 | " # Calculate and print accuracy\n",
219 | " accuracy = num_correct / float(len(test_loader))\n",
220 | " print(\"Epsilon: {}\\tTest Accuracy = {} / {} = {}\".format(epsilon, num_correct, len(test_loader), accuracy))\n",
221 | "\n",
222 | " # Return accuracy and adversarial examples\n",
223 | " return accuracy, adversarial_examples"
224 | ]
225 | },
226 | {
227 | "cell_type": "code",
228 | "execution_count": 15,
229 | "id": "b7939be3-6f58-4435-9d50-024e80fef5d8",
230 | "metadata": {},
231 | "outputs": [
232 | {
233 | "name": "stderr",
234 | "output_type": "stream",
235 | "text": [
236 | "/Users/parul/miniconda3/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at ../c10/core/TensorImpl.h:1156.)\n",
237 | " return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)\n"
238 | ]
239 | },
240 | {
241 | "name": "stdout",
242 | "output_type": "stream",
243 | "text": [
244 | "Epsilon: 0\tTest Accuracy = 519 / 624 = 0.8317307692307693\n",
245 | "Epsilon: 0.02\tTest Accuracy = 285 / 624 = 0.4567307692307692\n",
246 | "Epsilon: 0.05\tTest Accuracy = 160 / 624 = 0.2564102564102564\n",
247 | "Epsilon: 0.1\tTest Accuracy = 133 / 624 = 0.21314102564102563\n",
248 | "Epsilon: 0.15\tTest Accuracy = 155 / 624 = 0.2483974358974359\n",
249 | "Epsilon: 0.2\tTest Accuracy = 177 / 624 = 0.28365384615384615\n"
250 | ]
251 | }
252 | ],
253 | "source": [
254 | "accuracies = []\n",
255 | "examples = []\n",
256 | "\n",
257 | "# Run test for each epsilon\n",
258 | "for eps in epsilons:\n",
259 | " acc, ex = test(model, device, test_dl, eps)\n",
260 | " accuracies.append(acc)\n",
261 | " examples.append(ex)"
262 | ]
263 | },
264 | {
265 | "cell_type": "code",
266 | "execution_count": 19,
267 | "id": "4cc9536c-2593-49aa-ade2-634f90eea06e",
268 | "metadata": {},
269 | "outputs": [
270 | {
271 | "data": {
272 | "image/png": "",
273 | "text/plain": [
274 | ""
275 | ]
276 | },
277 | "metadata": {
278 | "needs_background": "light"
279 | },
280 | "output_type": "display_data"
281 | }
282 | ],
283 | "source": [
284 | "plt.figure(figsize=(5, 5))\n",
285 | "plt.plot(epsilons, accuracies, \"*-\")\n",
286 | "plt.yticks(np.arange(0, 1.1, step=0.1))\n",
287 | "plt.xticks(np.arange(0, 0.20, step=0.05))\n",
288 | "plt.title(\"Accuracy vs Epsilon\")\n",
289 | "plt.xlabel(\"Epsilon\")\n",
290 | "plt.ylabel(\"Accuracy\")\n",
291 | "plt.savefig('ff.svg')\n",
292 | "plt.show()"
293 | ]
294 | },
295 | {
296 | "cell_type": "code",
297 | "execution_count": null,
298 | "id": "f6f98377-1ab0-484c-8b97-12ee8cc63931",
299 | "metadata": {},
300 | "outputs": [],
301 | "source": []
302 | }
303 | ],
304 | "metadata": {
305 | "kernelspec": {
306 | "display_name": "Python 3",
307 | "language": "python",
308 | "name": "python3"
309 | },
310 | "language_info": {
311 | "codemirror_mode": {
312 | "name": "ipython",
313 | "version": 3
314 | },
315 | "file_extension": ".py",
316 | "mimetype": "text/x-python",
317 | "name": "python",
318 | "nbconvert_exporter": "python",
319 | "pygments_lexer": "ipython3",
320 | "version": "3.8.11"
321 | }
322 | },
323 | "nbformat": 4,
324 | "nbformat_minor": 5
325 | }
326 |
--------------------------------------------------------------------------------
/code/Chapter-7 & 9/README.md:
--------------------------------------------------------------------------------
1 | # Code for Chapter 7 - Explaining a PyTorch Image Classifier
2 |
3 |
4 | | Section | Notebook |
5 | | :-----------------------------------------------: | ---------------------- |
6 | | Data Preparation | [1.Data Preparation.ipynb](https://github.com/ml-for-high-risk-apps-book/Machine-Learning-for-High-Risk-Applications-Book/blob/main/code/Chapter-7%20%26%209/1.Data%20Preparation.ipynb "1.Data Preparation.ipynb") |
7 | | Model Training | [2.Transfer learning-Stage_1.ipynb](https://github.com/ml-for-high-risk-apps-book/Machine-Learning-for-High-Risk-Applications-Book/blob/main/code/Chapter-7%20%26%209/2.Transfer%20learning-Stage_1.ipynb "2.Transfer learning-Stage_1.ipynb") |
8 | | | [3.Transfer learning-Stage_2.ipynb](https://github.com/ml-for-high-risk-apps-book/Machine-Learning-for-High-Risk-Applications-Book/blob/main/code/Chapter-7/3.Transfer%20learning-Stage_2.ipynb "3.Transfer learning-Stage_2.ipynb") | |
9 | | Generating Post-Hoc Explanations Using Captum | [4.Post-Hoc Explanations.ipynb](https://github.com/ml-for-high-risk-apps-book/Machine-Learning-for-High-Risk-Applications-Book/blob/main/code/Chapter-7/4.Post-Hoc%20Explanations.ipynb "4.Post-Hoc Explanations.ipynb") | |
10 | | Assessing the Robustness of Post-Hoc Explanations | [5.Adding Noise to images](https://github.com/ml-for-high-risk-apps-book/Machine-Learning-for-High-Risk-Applications-Book/blob/main/code/Chapter-7/5.Adding%20Noise%20to%20images%20.ipynb) | |
11 | | | [6.Label Randomization](https://github.com/ml-for-high-risk-apps-book/Machine-Learning-for-High-Risk-Applications-Book/blob/main/code/Chapter-7/6.Label_Randomization.ipynb) |
12 | ---
13 |
14 | # Code for Chapter 9 - Debugging a PyTorch Image Classifier
15 |
16 | This chapter focuses on model debugging techniques for Deep Learning(DL) models using our example pneumonia classifier trained in Chapter 7. We’ll start by discussing data quality and leakage issues in DL systems and why it is important to address them in the very beginning of a project. We’ll then explore some software testing methods and why software quality assurance (QA) is an essential component of debugging DL pipelines. We’ll also perform DL sensitivity analysis approaches, including testing the model on different distributions of pneumonia images and applying adversarial attacks. We’ll close the chapter by addressing our own data quality and leakage issues, discussing interesting new debugging tools for DL, and addressing the results of our own adversarial testing.
17 |
18 |
19 | * [Adversarial Example Attacks](https://github.com/ml-for-high-risk-apps-book/Machine-Learning-for-High-Risk-Applications-Book/blob/main/code/Chapter-7%20%26%209/Adversarial%20Example%20Attacks.ipynb)
20 | * [Noise Injection](https://github.com/ml-for-high-risk-apps-book/Machine-Learning-for-High-Risk-Applications-Book/blob/main/code/Chapter-7%20%26%209/Retraining%20on%20Gaussian%20Noise.ipynb)
21 |
22 |
--------------------------------------------------------------------------------
/code/Chapter-8/README.md:
--------------------------------------------------------------------------------
1 |
2 | # Code for Chapter 8 - Debugging XGBoost
3 | This chapter will introduce several methods, that go beyond traditional model assessment, to push models to their limits and find hidden problems and failure modes. The chapter starts with a concept refresher and then focuses on model debugging exercises that better simulate real-world stresses with sensitivity analysis and tests that uncover model errors with residual analysis. The overarching goal of model debugging is to increase trust in model performance with human users, but in the process, we’ll also gain an increased level of transparency into models.
4 | # Code
5 | 1. Selecting a better XGBoost Model [](https://githubtocolab.com/ml-for-high-risk-apps-book/Machine-Learning-for-High-Risk-Applications-Book/blob/main/code/Chapter-8/Selecting%20a%20Better%20XGBoost%20Model.ipynb) [](https://github.com/ml-for-high-risk-apps-book/Machine-Learning-for-High-Risk-Applications-Book/blob/main/code/Chapter-8/Selecting%20a%20Better%20XGBoost%20Model.ipynb)
6 | 2. Sensitivity Analysis for XGBoost - Stress Testing XGBoost [](https://githubtocolab.com/ml-for-high-risk-apps-book/Machine-Learning-for-High-Risk-Applications-Book/blob/main/code/Chapter-8/Stress_testing_XGBoost.ipynb) [](https://github.com/ml-for-high-risk-apps-book/Machine-Learning-for-High-Risk-Applications-Book/blob/main/code/Chapter-8/Stress_testing_XGBoost.ipynb)
7 | 3. Sensitivity Analysis for XGBoost - Adversarial Example Search [](https://githubtocolab.com/ml-for-high-risk-apps-book/Machine-Learning-for-High-Risk-Applications-Book/blob/main/code/Chapter-8/Sensitivity_Analysis_for_XGBoost_Adversarial_Example_Search.ipynb) [](https://github.com/ml-for-high-risk-apps-book/Machine-Learning-for-High-Risk-Applications-Book/blob/main/code/Chapter-8/Sensitivity_Analysis_for_XGBoost_Adversarial_Example_Search.ipynb)
8 | 4. Residual Analysis for XGBoost [](https://githubtocolab.com/ml-for-high-risk-apps-book/Machine-Learning-for-High-Risk-Applications-Book/blob/main/code/Chapter-8/Residual_Analysis_for_XGBoost.ipynb) [](https://github.com/ml-for-high-risk-apps-book/Machine-Learning-for-High-Risk-Applications-Book/blob/main/code/Chapter-8/Residual_Analysis_for_XGBoost.ipynb)
9 |
10 |
11 |
12 |
--------------------------------------------------------------------------------
/code/Data.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ml-for-high-risk-apps-book/Machine-Learning-for-High-Risk-Applications-Book/0fbf704b8b2d84579ce9473cdecabf46747a7ac2/code/Data.zip
--------------------------------------------------------------------------------