├── BasicClassification.ipynb
├── CNN.ipynb
├── ClassifyingImages.ipynb
├── Clustering.ipynb
├── EMSegmentation.ipynb
├── EMTopicModel.ipynb
├── GLMnet.ipynb
├── HiDimClassification.ipynb
├── MeanField.ipynb
├── PCA.ipynb
├── README.md
├── Regression.ipynb
└── SGDSVM.ipynb
/BasicClassification.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# * Prerequisites"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "You should familiarize yourself with the `numpy.ndarray` class of python's `numpy` library.\n",
15 | "\n",
16 | "You should be able to answer the following questions before starting this assignment. Let's assume `a` is a numpy array.\n",
17 | "* What is an array's shape (e.g., what is the meaning of `a.shape`)? \n",
18 | "* What is numpy's reshaping operation? How much computational over-head would it induce? \n",
19 | "* What is numpy's transpose operation, and how it is different from reshaping? Does it cause computation overhead?\n",
20 | "* What is the meaning of the commands `a.reshape(-1, 1)` and `a.reshape(-1)`?\n",
21 | "* Would happens to the variable `a` after we call `b = a.reshape(-1)`? Does any of `a`'s attributes change?\n",
22 | "* How do assignments in python and numpy work in general?\n",
23 | " * Does the `b=a` statement use copying by value? Or is it copying by reference?\n",
24 | " * Is the answer to the previous question change depending on whether `a` is a numpy array or a scalar value?\n",
25 | " \n",
26 | "You can answer all of these questions by\n",
27 | "\n",
28 | " 1. Reading numpy's documentation from https://numpy.org/doc/stable/.\n",
29 | " 2. Making trials using dummy variables."
30 | ]
31 | },
32 | {
33 | "cell_type": "markdown",
34 | "metadata": {},
35 | "source": [
36 | "# *Assignment Summary"
37 | ]
38 | },
39 | {
40 | "cell_type": "markdown",
41 | "metadata": {},
42 | "source": [
43 | "The UC Irvine machine learning data repository hosts a famous collection of data on whether a patient has diabetes (the Pima Indians dataset), originally owned by the National Institute of Diabetes and Digestive and Kidney Diseases and donated by Vincent Sigillito. You can find this data at https://www.kaggle.com/uciml/pima-indians-diabetes-database/data. This data has a set of attributes of patients, and a categorical variable telling whether the patient is diabetic or not. For several attributes in this data set, a value of 0 may indicate a missing value of the variable.\n",
44 | "\n",
45 | "* **Part 1-A)** Build a simple naive Bayes classifier to classify this data set. We will use 20% of the data for evaluation and the other 80% for training. \n",
46 | "\n",
47 | " There are a total of 768 data-points. You should use a normal distribution to model each of the class-conditional distributions. You should write this classifier yourself (it's quite straight-forward).\n",
48 | "\n",
49 | " Report the accuracy of the classifier on the 20% evaluation data, where accuracy is the number of correct predictions as a fraction of total predictions.\n",
50 | "\n",
51 | "* **Part 1-B)** Now adjust your code so that, for attribute 3 (Diastolic blood pressure), attribute 4 (Triceps skin fold thickness), attribute 6 (Body mass index), and attribute 8 (Age), it regards a value of 0 as a missing value when estimating the class-conditional distributions, and the posterior.\n",
52 | "\n",
53 | " Report the accuracy of the classifier on the 20% that was held out for evaluation.\n",
54 | "\n",
55 | "* **Part 1-C)** Now install SVMLight, which you can find at http://svmlight.joachims.org, to train and evaluate an SVM to classify this data.\n",
56 | "\n",
57 | " You don't need to understand much about SVM's to do this as we'll do that in following exercises. You should NOT substitute NA values for zeros for attributes 3, 4, 6, and 8.\n",
58 | " \n",
59 | " Report the accuracy of the classifier on the held out 20%"
60 | ]
61 | },
62 | {
63 | "cell_type": "markdown",
64 | "metadata": {},
65 | "source": [
66 | "# 0. Data"
67 | ]
68 | },
69 | {
70 | "cell_type": "markdown",
71 | "metadata": {},
72 | "source": [
73 | "## 0.1 Description"
74 | ]
75 | },
76 | {
77 | "cell_type": "markdown",
78 | "metadata": {},
79 | "source": [
80 | "The UC Irvine's Machine Learning Data Repository Department hosts a Kaggle Competition with famous collection of data on whether a patient has diabetes (the Pima Indians dataset), originally owned by the National Institute of Diabetes and Digestive and Kidney Diseases and donated by Vincent Sigillito. \n",
81 | "\n",
82 | "You can find this data at https://www.kaggle.com/uciml/pima-indians-diabetes-database/data. The Kaggle website offers valuable visualizations of the original data dimensions in its dashboard. It is quite insightful to take the time and make sense of the data using their dashboard before applying any method to the data."
83 | ]
84 | },
85 | {
86 | "cell_type": "markdown",
87 | "metadata": {},
88 | "source": [
89 | "## 0.2 Information Summary"
90 | ]
91 | },
92 | {
93 | "cell_type": "markdown",
94 | "metadata": {},
95 | "source": [
96 | "* **Input/Output**: This data has a set of attributes of patients, and a categorical variable telling whether the patient is diabetic or not. \n",
97 | "\n",
98 | "* **Missing Data**: For several attributes in this data set, a value of 0 may indicate a missing value of the variable.\n",
99 | "\n",
100 | "* **Final Goal**: We want to build a classifier that can predict whether a patient has diabetes or not. To do this, we will train multiple kinds of models, and will be handing the missing data with different approaches for each method (i.e., some methods will ignore their existence, while others may do something about the missing data)."
101 | ]
102 | },
103 | {
104 | "cell_type": "markdown",
105 | "metadata": {},
106 | "source": [
107 | "## 0.3 Loading"
108 | ]
109 | },
110 | {
111 | "cell_type": "code",
112 | "execution_count": 46,
113 | "metadata": {},
114 | "outputs": [],
115 | "source": [
116 | "%matplotlib inline\n",
117 | "import pandas as pd\n",
118 | "import numpy as np\n",
119 | "import seaborn as sns\n",
120 | "import matplotlib.pyplot as plt\n",
121 | "\n",
122 | "from utils import test_case_checker"
123 | ]
124 | },
125 | {
126 | "cell_type": "code",
127 | "execution_count": 47,
128 | "metadata": {},
129 | "outputs": [
130 | {
131 | "data": {
132 | "text/html": [
133 | "
\n",
134 | "\n",
147 | "
\n",
148 | " \n",
149 | " \n",
150 | " \n",
151 | " Pregnancies \n",
152 | " Glucose \n",
153 | " BloodPressure \n",
154 | " SkinThickness \n",
155 | " Insulin \n",
156 | " BMI \n",
157 | " DiabetesPedigreeFunction \n",
158 | " Age \n",
159 | " Outcome \n",
160 | " \n",
161 | " \n",
162 | " \n",
163 | " \n",
164 | " 0 \n",
165 | " 6 \n",
166 | " 148 \n",
167 | " 72 \n",
168 | " 35 \n",
169 | " 0 \n",
170 | " 33.6 \n",
171 | " 0.627 \n",
172 | " 50 \n",
173 | " 1 \n",
174 | " \n",
175 | " \n",
176 | " 1 \n",
177 | " 1 \n",
178 | " 85 \n",
179 | " 66 \n",
180 | " 29 \n",
181 | " 0 \n",
182 | " 26.6 \n",
183 | " 0.351 \n",
184 | " 31 \n",
185 | " 0 \n",
186 | " \n",
187 | " \n",
188 | " 2 \n",
189 | " 8 \n",
190 | " 183 \n",
191 | " 64 \n",
192 | " 0 \n",
193 | " 0 \n",
194 | " 23.3 \n",
195 | " 0.672 \n",
196 | " 32 \n",
197 | " 1 \n",
198 | " \n",
199 | " \n",
200 | " 3 \n",
201 | " 1 \n",
202 | " 89 \n",
203 | " 66 \n",
204 | " 23 \n",
205 | " 94 \n",
206 | " 28.1 \n",
207 | " 0.167 \n",
208 | " 21 \n",
209 | " 0 \n",
210 | " \n",
211 | " \n",
212 | " 4 \n",
213 | " 0 \n",
214 | " 137 \n",
215 | " 40 \n",
216 | " 35 \n",
217 | " 168 \n",
218 | " 43.1 \n",
219 | " 2.288 \n",
220 | " 33 \n",
221 | " 1 \n",
222 | " \n",
223 | " \n",
224 | "
\n",
225 | "
"
226 | ],
227 | "text/plain": [
228 | " Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \\\n",
229 | "0 6 148 72 35 0 33.6 \n",
230 | "1 1 85 66 29 0 26.6 \n",
231 | "2 8 183 64 0 0 23.3 \n",
232 | "3 1 89 66 23 94 28.1 \n",
233 | "4 0 137 40 35 168 43.1 \n",
234 | "\n",
235 | " DiabetesPedigreeFunction Age Outcome \n",
236 | "0 0.627 50 1 \n",
237 | "1 0.351 31 0 \n",
238 | "2 0.672 32 1 \n",
239 | "3 0.167 21 0 \n",
240 | "4 2.288 33 1 "
241 | ]
242 | },
243 | "execution_count": 47,
244 | "metadata": {},
245 | "output_type": "execute_result"
246 | }
247 | ],
248 | "source": [
249 | "df = pd.read_csv('diabetes.csv')\n",
250 | "df.head()"
251 | ]
252 | },
253 | {
254 | "cell_type": "markdown",
255 | "metadata": {},
256 | "source": [
257 | "## 0.1 Splitting The Data"
258 | ]
259 | },
260 | {
261 | "cell_type": "markdown",
262 | "metadata": {},
263 | "source": [
264 | "First, we will shuffle the data completely, and forget about the order in the original csv file. \n",
265 | "\n",
266 | "* The training and evaluation dataframes will be named ```train_df``` and ```eval_df```, respectively.\n",
267 | "\n",
268 | "* We will also create the 2-d numpy array `train_features` whose number of rows is the number of training samples, and the number of columns is 8 (i.e., the number of features). We will define `eval_features` in a similar fashion\n",
269 | "\n",
270 | "* We would also create the 1-d numpy arrays `train_labels` and `eval_labels` which contain the training and evaluation labels, respectively."
271 | ]
272 | },
273 | {
274 | "cell_type": "code",
275 | "execution_count": 48,
276 | "metadata": {},
277 | "outputs": [],
278 | "source": [
279 | "# Let's generate the split ourselves.\n",
280 | "np_random = np.random.RandomState(seed=12345)\n",
281 | "rand_unifs = np_random.uniform(0,1,size=df.shape[0])\n",
282 | "division_thresh = np.percentile(rand_unifs, 80)\n",
283 | "train_indicator = rand_unifs < division_thresh\n",
284 | "eval_indicator = rand_unifs >= division_thresh"
285 | ]
286 | },
287 | {
288 | "cell_type": "code",
289 | "execution_count": 49,
290 | "metadata": {},
291 | "outputs": [
292 | {
293 | "data": {
294 | "text/html": [
295 | "\n",
296 | "\n",
309 | "
\n",
310 | " \n",
311 | " \n",
312 | " \n",
313 | " Pregnancies \n",
314 | " Glucose \n",
315 | " BloodPressure \n",
316 | " SkinThickness \n",
317 | " Insulin \n",
318 | " BMI \n",
319 | " DiabetesPedigreeFunction \n",
320 | " Age \n",
321 | " Outcome \n",
322 | " \n",
323 | " \n",
324 | " \n",
325 | " \n",
326 | " 0 \n",
327 | " 1 \n",
328 | " 85 \n",
329 | " 66 \n",
330 | " 29 \n",
331 | " 0 \n",
332 | " 26.6 \n",
333 | " 0.351 \n",
334 | " 31 \n",
335 | " 0 \n",
336 | " \n",
337 | " \n",
338 | " 1 \n",
339 | " 8 \n",
340 | " 183 \n",
341 | " 64 \n",
342 | " 0 \n",
343 | " 0 \n",
344 | " 23.3 \n",
345 | " 0.672 \n",
346 | " 32 \n",
347 | " 1 \n",
348 | " \n",
349 | " \n",
350 | " 2 \n",
351 | " 1 \n",
352 | " 89 \n",
353 | " 66 \n",
354 | " 23 \n",
355 | " 94 \n",
356 | " 28.1 \n",
357 | " 0.167 \n",
358 | " 21 \n",
359 | " 0 \n",
360 | " \n",
361 | " \n",
362 | " 3 \n",
363 | " 0 \n",
364 | " 137 \n",
365 | " 40 \n",
366 | " 35 \n",
367 | " 168 \n",
368 | " 43.1 \n",
369 | " 2.288 \n",
370 | " 33 \n",
371 | " 1 \n",
372 | " \n",
373 | " \n",
374 | " 4 \n",
375 | " 5 \n",
376 | " 116 \n",
377 | " 74 \n",
378 | " 0 \n",
379 | " 0 \n",
380 | " 25.6 \n",
381 | " 0.201 \n",
382 | " 30 \n",
383 | " 0 \n",
384 | " \n",
385 | " \n",
386 | "
\n",
387 | "
"
388 | ],
389 | "text/plain": [
390 | " Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \\\n",
391 | "0 1 85 66 29 0 26.6 \n",
392 | "1 8 183 64 0 0 23.3 \n",
393 | "2 1 89 66 23 94 28.1 \n",
394 | "3 0 137 40 35 168 43.1 \n",
395 | "4 5 116 74 0 0 25.6 \n",
396 | "\n",
397 | " DiabetesPedigreeFunction Age Outcome \n",
398 | "0 0.351 31 0 \n",
399 | "1 0.672 32 1 \n",
400 | "2 0.167 21 0 \n",
401 | "3 2.288 33 1 \n",
402 | "4 0.201 30 0 "
403 | ]
404 | },
405 | "execution_count": 49,
406 | "metadata": {},
407 | "output_type": "execute_result"
408 | }
409 | ],
410 | "source": [
411 | "train_df = df[train_indicator].reset_index(drop=True)\n",
412 | "train_features = train_df.loc[:, train_df.columns != 'Outcome'].values\n",
413 | "train_labels = train_df['Outcome'].values\n",
414 | "train_df.head()"
415 | ]
416 | },
417 | {
418 | "cell_type": "code",
419 | "execution_count": 50,
420 | "metadata": {},
421 | "outputs": [
422 | {
423 | "data": {
424 | "text/html": [
425 | "\n",
426 | "\n",
439 | "
\n",
440 | " \n",
441 | " \n",
442 | " \n",
443 | " Pregnancies \n",
444 | " Glucose \n",
445 | " BloodPressure \n",
446 | " SkinThickness \n",
447 | " Insulin \n",
448 | " BMI \n",
449 | " DiabetesPedigreeFunction \n",
450 | " Age \n",
451 | " Outcome \n",
452 | " \n",
453 | " \n",
454 | " \n",
455 | " \n",
456 | " 0 \n",
457 | " 6 \n",
458 | " 148 \n",
459 | " 72 \n",
460 | " 35 \n",
461 | " 0 \n",
462 | " 33.6 \n",
463 | " 0.627 \n",
464 | " 50 \n",
465 | " 1 \n",
466 | " \n",
467 | " \n",
468 | " 1 \n",
469 | " 3 \n",
470 | " 78 \n",
471 | " 50 \n",
472 | " 32 \n",
473 | " 88 \n",
474 | " 31.0 \n",
475 | " 0.248 \n",
476 | " 26 \n",
477 | " 1 \n",
478 | " \n",
479 | " \n",
480 | " 2 \n",
481 | " 10 \n",
482 | " 168 \n",
483 | " 74 \n",
484 | " 0 \n",
485 | " 0 \n",
486 | " 38.0 \n",
487 | " 0.537 \n",
488 | " 34 \n",
489 | " 1 \n",
490 | " \n",
491 | " \n",
492 | " 3 \n",
493 | " 0 \n",
494 | " 118 \n",
495 | " 84 \n",
496 | " 47 \n",
497 | " 230 \n",
498 | " 45.8 \n",
499 | " 0.551 \n",
500 | " 31 \n",
501 | " 1 \n",
502 | " \n",
503 | " \n",
504 | " 4 \n",
505 | " 7 \n",
506 | " 107 \n",
507 | " 74 \n",
508 | " 0 \n",
509 | " 0 \n",
510 | " 29.6 \n",
511 | " 0.254 \n",
512 | " 31 \n",
513 | " 1 \n",
514 | " \n",
515 | " \n",
516 | "
\n",
517 | "
"
518 | ],
519 | "text/plain": [
520 | " Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \\\n",
521 | "0 6 148 72 35 0 33.6 \n",
522 | "1 3 78 50 32 88 31.0 \n",
523 | "2 10 168 74 0 0 38.0 \n",
524 | "3 0 118 84 47 230 45.8 \n",
525 | "4 7 107 74 0 0 29.6 \n",
526 | "\n",
527 | " DiabetesPedigreeFunction Age Outcome \n",
528 | "0 0.627 50 1 \n",
529 | "1 0.248 26 1 \n",
530 | "2 0.537 34 1 \n",
531 | "3 0.551 31 1 \n",
532 | "4 0.254 31 1 "
533 | ]
534 | },
535 | "execution_count": 50,
536 | "metadata": {},
537 | "output_type": "execute_result"
538 | }
539 | ],
540 | "source": [
541 | "eval_df = df[eval_indicator].reset_index(drop=True)\n",
542 | "eval_features = eval_df.loc[:, eval_df.columns != 'Outcome'].values\n",
543 | "eval_labels = eval_df['Outcome'].values\n",
544 | "eval_df.head()"
545 | ]
546 | },
547 | {
548 | "cell_type": "code",
549 | "execution_count": 51,
550 | "metadata": {},
551 | "outputs": [
552 | {
553 | "data": {
554 | "text/plain": [
555 | "((614, 8), (614,), (154, 8), (154,))"
556 | ]
557 | },
558 | "execution_count": 51,
559 | "metadata": {},
560 | "output_type": "execute_result"
561 | }
562 | ],
563 | "source": [
564 | "train_features.shape, train_labels.shape, eval_features.shape, eval_labels.shape"
565 | ]
566 | },
567 | {
568 | "cell_type": "markdown",
569 | "metadata": {},
570 | "source": [
571 | "## 0.2 Pre-processing The Data"
572 | ]
573 | },
574 | {
575 | "cell_type": "markdown",
576 | "metadata": {},
577 | "source": [
578 | "Some of the columns exhibit missing values. We will use a Naive Bayes Classifier later that will treat such missing values in a special way. To be specific, for attribute 3 (Diastolic blood pressure), attribute 4 (Triceps skin fold thickness), attribute 6 (Body mass index), and attribute 8 (Age), we should regard a value of 0 as a missing value.\n",
579 | "\n",
580 | "Therefore, we will be creating the `train_featues_with_nans` and `eval_features_with_nans` numpy arrays to be just like their `train_features` and `eval_features` counter-parts, but with the zero-values in such columns replaced with nans."
581 | ]
582 | },
583 | {
584 | "cell_type": "code",
585 | "execution_count": 52,
586 | "metadata": {},
587 | "outputs": [],
588 | "source": [
589 | "train_df_with_nans = train_df.copy(deep=True)\n",
590 | "eval_df_with_nans = eval_df.copy(deep=True)\n",
591 | "for col_with_nans in ['BloodPressure', 'SkinThickness', 'BMI', 'Age']:\n",
592 | " train_df_with_nans[col_with_nans] = train_df_with_nans[col_with_nans].replace(0, np.nan)\n",
593 | " eval_df_with_nans[col_with_nans] = eval_df_with_nans[col_with_nans].replace(0, np.nan)\n",
594 | "train_features_with_nans = train_df_with_nans.loc[:, train_df_with_nans.columns != 'Outcome'].values\n",
595 | "eval_features_with_nans = eval_df_with_nans.loc[:, eval_df_with_nans.columns != 'Outcome'].values"
596 | ]
597 | },
598 | {
599 | "cell_type": "code",
600 | "execution_count": 53,
601 | "metadata": {},
602 | "outputs": [
603 | {
604 | "name": "stdout",
605 | "output_type": "stream",
606 | "text": [
607 | "Here are the training rows with at least one missing values.\n",
608 | "\n",
609 | "You can see that such incomplete data points constitute a substantial part of the data.\n",
610 | "\n"
611 | ]
612 | },
613 | {
614 | "data": {
615 | "text/html": [
616 | "\n",
617 | "\n",
630 | "
\n",
631 | " \n",
632 | " \n",
633 | " \n",
634 | " Pregnancies \n",
635 | " Glucose \n",
636 | " BloodPressure \n",
637 | " SkinThickness \n",
638 | " Insulin \n",
639 | " BMI \n",
640 | " DiabetesPedigreeFunction \n",
641 | " Age \n",
642 | " Outcome \n",
643 | " \n",
644 | " \n",
645 | " \n",
646 | " \n",
647 | " 1 \n",
648 | " 8 \n",
649 | " 183 \n",
650 | " 64.0 \n",
651 | " NaN \n",
652 | " 0 \n",
653 | " 23.3 \n",
654 | " 0.672 \n",
655 | " 32 \n",
656 | " 1 \n",
657 | " \n",
658 | " \n",
659 | " 4 \n",
660 | " 5 \n",
661 | " 116 \n",
662 | " 74.0 \n",
663 | " NaN \n",
664 | " 0 \n",
665 | " 25.6 \n",
666 | " 0.201 \n",
667 | " 30 \n",
668 | " 0 \n",
669 | " \n",
670 | " \n",
671 | " 5 \n",
672 | " 10 \n",
673 | " 115 \n",
674 | " NaN \n",
675 | " NaN \n",
676 | " 0 \n",
677 | " 35.3 \n",
678 | " 0.134 \n",
679 | " 29 \n",
680 | " 0 \n",
681 | " \n",
682 | " \n",
683 | " 7 \n",
684 | " 8 \n",
685 | " 125 \n",
686 | " 96.0 \n",
687 | " NaN \n",
688 | " 0 \n",
689 | " NaN \n",
690 | " 0.232 \n",
691 | " 54 \n",
692 | " 1 \n",
693 | " \n",
694 | " \n",
695 | " 8 \n",
696 | " 4 \n",
697 | " 110 \n",
698 | " 92.0 \n",
699 | " NaN \n",
700 | " 0 \n",
701 | " 37.6 \n",
702 | " 0.191 \n",
703 | " 30 \n",
704 | " 0 \n",
705 | " \n",
706 | " \n",
707 | " ... \n",
708 | " ... \n",
709 | " ... \n",
710 | " ... \n",
711 | " ... \n",
712 | " ... \n",
713 | " ... \n",
714 | " ... \n",
715 | " ... \n",
716 | " ... \n",
717 | " \n",
718 | " \n",
719 | " 598 \n",
720 | " 6 \n",
721 | " 162 \n",
722 | " 62.0 \n",
723 | " NaN \n",
724 | " 0 \n",
725 | " 24.3 \n",
726 | " 0.178 \n",
727 | " 50 \n",
728 | " 1 \n",
729 | " \n",
730 | " \n",
731 | " 599 \n",
732 | " 4 \n",
733 | " 136 \n",
734 | " 70.0 \n",
735 | " NaN \n",
736 | " 0 \n",
737 | " 31.2 \n",
738 | " 1.182 \n",
739 | " 22 \n",
740 | " 1 \n",
741 | " \n",
742 | " \n",
743 | " 605 \n",
744 | " 1 \n",
745 | " 106 \n",
746 | " 76.0 \n",
747 | " NaN \n",
748 | " 0 \n",
749 | " 37.5 \n",
750 | " 0.197 \n",
751 | " 26 \n",
752 | " 0 \n",
753 | " \n",
754 | " \n",
755 | " 606 \n",
756 | " 6 \n",
757 | " 190 \n",
758 | " 92.0 \n",
759 | " NaN \n",
760 | " 0 \n",
761 | " 35.5 \n",
762 | " 0.278 \n",
763 | " 66 \n",
764 | " 1 \n",
765 | " \n",
766 | " \n",
767 | " 612 \n",
768 | " 1 \n",
769 | " 126 \n",
770 | " 60.0 \n",
771 | " NaN \n",
772 | " 0 \n",
773 | " 30.1 \n",
774 | " 0.349 \n",
775 | " 47 \n",
776 | " 1 \n",
777 | " \n",
778 | " \n",
779 | "
\n",
780 | "
186 rows × 9 columns
\n",
781 | "
"
782 | ],
783 | "text/plain": [
784 | " Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \\\n",
785 | "1 8 183 64.0 NaN 0 23.3 \n",
786 | "4 5 116 74.0 NaN 0 25.6 \n",
787 | "5 10 115 NaN NaN 0 35.3 \n",
788 | "7 8 125 96.0 NaN 0 NaN \n",
789 | "8 4 110 92.0 NaN 0 37.6 \n",
790 | ".. ... ... ... ... ... ... \n",
791 | "598 6 162 62.0 NaN 0 24.3 \n",
792 | "599 4 136 70.0 NaN 0 31.2 \n",
793 | "605 1 106 76.0 NaN 0 37.5 \n",
794 | "606 6 190 92.0 NaN 0 35.5 \n",
795 | "612 1 126 60.0 NaN 0 30.1 \n",
796 | "\n",
797 | " DiabetesPedigreeFunction Age Outcome \n",
798 | "1 0.672 32 1 \n",
799 | "4 0.201 30 0 \n",
800 | "5 0.134 29 0 \n",
801 | "7 0.232 54 1 \n",
802 | "8 0.191 30 0 \n",
803 | ".. ... ... ... \n",
804 | "598 0.178 50 1 \n",
805 | "599 1.182 22 1 \n",
806 | "605 0.197 26 0 \n",
807 | "606 0.278 66 1 \n",
808 | "612 0.349 47 1 \n",
809 | "\n",
810 | "[186 rows x 9 columns]"
811 | ]
812 | },
813 | "execution_count": 53,
814 | "metadata": {},
815 | "output_type": "execute_result"
816 | }
817 | ],
818 | "source": [
819 | "print('Here are the training rows with at least one missing values.')\n",
820 | "print('')\n",
821 | "print('You can see that such incomplete data points constitute a substantial part of the data.')\n",
822 | "print('')\n",
823 | "nan_training_data = train_df_with_nans[train_df_with_nans.isna().any(axis=1)]\n",
824 | "nan_training_data"
825 | ]
826 | },
827 | {
828 | "cell_type": "markdown",
829 | "metadata": {},
830 | "source": [
831 | "# 1. Part 1 (Building a simple Naive Bayes Classifier)"
832 | ]
833 | },
834 | {
835 | "cell_type": "markdown",
836 | "metadata": {},
837 | "source": [
838 | "Consider a single sample $(\\mathbf{x}, y)$, where the feature vector is denoted with $\\mathbf{x}$, and the label is denoted with $y$. We will also denote the $j^{th}$ feature of $\\mathbf{x}$ with $x^{(j)}$.\n",
839 | "\n",
840 | "According to the textbook, the Naive Bayes Classifier uses the following decision rule:\n",
841 | "\n",
842 | "\"Choose $y$ such that $$\\bigg[\\log p(y) + \\sum_{j} \\log p(x^{(j)}|y) \\bigg]$$ is the largest\"\n",
843 | "\n",
844 | "However, we first need to define the probabilistic models of the prior $p(y)$ and the class-conditional feature distributions $p(x^{(j)}|y)$ using the training data.\n",
845 | "\n",
846 | "* **Modelling the prior $p(y)$**: We fit a Bernoulli distribution to the `Outcome` variable of `train_df`.\n",
847 | "* **Modelling the class-conditional feature distributions $p(x^{(j)}|y)$**: We fit Gaussian distributions, and infer the Gaussian mean and variance parameters from `train_df`."
848 | ]
849 | },
850 | {
851 | "cell_type": "markdown",
852 | "metadata": {},
853 | "source": [
854 | "# Task 1 "
855 | ]
856 | },
857 | {
858 | "cell_type": "markdown",
859 | "metadata": {},
860 | "source": [
861 | "Write a function `log_prior` that takes a numpy array `train_labels` as input, and outputs the following vector as a column numpy array (i.e., with shape $(2,1)$).\n",
862 | "\n",
863 | "$$\\log p_y =\\begin{bmatrix}\\log p(y=0)\\\\\\log p(y=1)\\end{bmatrix}$$\n",
864 | "\n",
865 | "Try and avoid the utilization of loops as much as possible. No loops are necessary.\n",
866 | "\n",
867 | "**Hint**: Make sure all the array shapes are what you need and expect. You can reshape any numpy array without any tangible computational over-head."
868 | ]
869 | },
870 | {
871 | "cell_type": "code",
872 | "execution_count": 54,
873 | "metadata": {
874 | "deletable": false,
875 | "nbgrader": {
876 | "cell_type": "code",
877 | "checksum": "071dcd7013b592e1fc344ddc31bedc4e",
878 | "grade": false,
879 | "grade_id": "cell-540952d95c213032",
880 | "locked": false,
881 | "schema_version": 3,
882 | "solution": true,
883 | "task": false
884 | }
885 | },
886 | "outputs": [],
887 | "source": [
888 | "def log_prior(train_labels):\n",
889 | " \n",
890 | " # your code here\n",
891 | " num0 = 0\n",
892 | " num1 = 0\n",
893 | " for label in train_labels:\n",
894 | " if label == 0:\n",
895 | " num0 += 1\n",
896 | " else:\n",
897 | " num1 += 1\n",
898 | " log_py = np.log([[num0/len(train_labels)],[num1/len(train_labels)]])\n",
899 | " \n",
900 | " assert log_py.shape == (2,1)\n",
901 | " \n",
902 | " return log_py"
903 | ]
904 | },
905 | {
906 | "cell_type": "code",
907 | "execution_count": 55,
908 | "metadata": {
909 | "deletable": false,
910 | "editable": false,
911 | "nbgrader": {
912 | "cell_type": "code",
913 | "checksum": "20d512227df8d765b37255ebdca0bbde",
914 | "grade": true,
915 | "grade_id": "cell-7c3e85395bcc5892",
916 | "locked": true,
917 | "points": 1,
918 | "schema_version": 3,
919 | "solution": false,
920 | "task": false
921 | }
922 | },
923 | "outputs": [],
924 | "source": [
925 | "some_labels = np.array([0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1])\n",
926 | "some_log_py = log_prior(some_labels)\n",
927 | "assert np.array_equal(some_log_py.round(3), np.array([[-0.916], [-0.511]]))\n",
928 | "\n",
929 | "# Checking against the pre-computed test database\n",
930 | "test_results = test_case_checker(log_prior, task_id=1)\n",
931 | "assert test_results['passed'], test_results['message']"
932 | ]
933 | },
934 | {
935 | "cell_type": "code",
936 | "execution_count": 56,
937 | "metadata": {},
938 | "outputs": [
939 | {
940 | "data": {
941 | "text/plain": [
942 | "array([[-0.41610786],\n",
943 | " [-1.07766068]])"
944 | ]
945 | },
946 | "execution_count": 56,
947 | "metadata": {},
948 | "output_type": "execute_result"
949 | }
950 | ],
951 | "source": [
952 | "log_py = log_prior(train_labels)\n",
953 | "log_py"
954 | ]
955 | },
956 | {
957 | "cell_type": "markdown",
958 | "metadata": {},
959 | "source": [
960 | "# Task 2 "
961 | ]
962 | },
963 | {
964 | "cell_type": "markdown",
965 | "metadata": {},
966 | "source": [
967 | "Write a function `cc_mean_ignore_missing` that takes the numpy arrays `train_features` and `train_labels` as input, and outputs the following matrix with the shape $(8,2)$, where 8 is the number of features.\n",
968 | "\n",
969 | "$$\\mu_y = \\begin{bmatrix} \\mathbb{E}[x^{(0)}|y=0] & \\mathbb{E}[x^{(0)}|y=1]\\\\\n",
970 | "\\mathbb{E}[x^{(1)}|y=0] & \\mathbb{E}[x^{(1)}|y=1] \\\\\n",
971 | "\\cdots & \\cdots\\\\\n",
972 | "\\mathbb{E}[x^{(7)}|y=0] & \\mathbb{E}[x^{(7)}|y=1]\\end{bmatrix}$$\n",
973 | "\n",
974 | "Some points regarding this task:\n",
975 | "\n",
976 | "* The `train_features` numpy array has a shape of `(N,8)` where `N` is the number of training data points, and 8 is the number of the features. \n",
977 | "\n",
978 | "* The `train_labels` numpy array has a shape of `(N,)`. \n",
979 | "\n",
980 | "* **You can assume that `train_features` has no missing elements in this task**.\n",
981 | "\n",
982 | "* Try and avoid the utilization of loops as much as possible. No loops are necessary."
983 | ]
984 | },
985 | {
986 | "cell_type": "code",
987 | "execution_count": 57,
988 | "metadata": {
989 | "deletable": false,
990 | "nbgrader": {
991 | "cell_type": "code",
992 | "checksum": "48bacfdecfbecc35ccca01e2e264d3ef",
993 | "grade": false,
994 | "grade_id": "cell-9482e9412e53e401",
995 | "locked": false,
996 | "schema_version": 3,
997 | "solution": true,
998 | "task": false
999 | }
1000 | },
1001 | "outputs": [],
1002 | "source": [
1003 | "def cc_mean_ignore_missing(train_features, train_labels):\n",
1004 | " N, d = train_features.shape\n",
1005 | " \n",
1006 | " # your code here\n",
1007 | " pos = np.sum(train_labels)\n",
1008 | " neg = N - pos\n",
1009 | " train_labels = train_labels.reshape(-1,1)\n",
1010 | " \n",
1011 | " positives = train_features * train_labels\n",
1012 | " train_opps = (train_labels == 0)\n",
1013 | " pos_avgs = np.sum(positives, axis=0) / pos\n",
1014 | " pos_avgs = pos_avgs.reshape(-1,1)\n",
1015 | " \n",
1016 | " negatives = train_features * train_opps\n",
1017 | " neg_avgs = np.sum(negatives, axis=0) / neg\n",
1018 | " neg_avgs = neg_avgs.reshape(-1,1)\n",
1019 | " \n",
1020 | " mu_y = np.hstack((neg_avgs, pos_avgs))\n",
1021 | " assert mu_y.shape == (d, 2)\n",
1022 | " return mu_y"
1023 | ]
1024 | },
1025 | {
1026 | "cell_type": "code",
1027 | "execution_count": 58,
1028 | "metadata": {
1029 | "deletable": false,
1030 | "editable": false,
1031 | "nbgrader": {
1032 | "cell_type": "code",
1033 | "checksum": "af330da15d19ecdf406ae2d28bbf3a36",
1034 | "grade": true,
1035 | "grade_id": "cell-f3045a00bb2c1146",
1036 | "locked": true,
1037 | "points": 1,
1038 | "schema_version": 3,
1039 | "solution": false,
1040 | "task": false
1041 | }
1042 | },
1043 | "outputs": [],
1044 | "source": [
1045 | "some_feats = np.array([[ 1. , 85. , 66. , 29. , 0. , 26.6, 0.4, 31. ],\n",
1046 | " [ 8. , 183. , 64. , 0. , 0. , 23.3, 0.7, 32. ],\n",
1047 | " [ 1. , 89. , 66. , 23. , 94. , 28.1, 0.2, 21. ],\n",
1048 | " [ 0. , 137. , 40. , 35. , 168. , 43.1, 2.3, 33. ],\n",
1049 | " [ 5. , 116. , 74. , 0. , 0. , 25.6, 0.2, 30. ]])\n",
1050 | "some_labels = np.array([0, 1, 0, 1, 0])\n",
1051 | "\n",
1052 | "some_mu_y = cc_mean_ignore_missing(some_feats, some_labels)\n",
1053 | "\n",
1054 | "assert np.array_equal(some_mu_y.round(2), np.array([[ 2.33, 4. ],\n",
1055 | " [ 96.67, 160. ],\n",
1056 | " [ 68.67, 52. ],\n",
1057 | " [ 17.33, 17.5 ],\n",
1058 | " [ 31.33, 84. ],\n",
1059 | " [ 26.77, 33.2 ],\n",
1060 | " [ 0.27, 1.5 ],\n",
1061 | " [ 27.33, 32.5 ]]))\n",
1062 | "\n",
1063 | "# Checking against the pre-computed test database\n",
1064 | "test_results = test_case_checker(cc_mean_ignore_missing, task_id=2)\n",
1065 | "assert test_results['passed'], test_results['message']"
1066 | ]
1067 | },
1068 | {
1069 | "cell_type": "code",
1070 | "execution_count": 59,
1071 | "metadata": {},
1072 | "outputs": [
1073 | {
1074 | "data": {
1075 | "text/plain": [
1076 | "array([[ 3.48641975, 4.91866029],\n",
1077 | " [109.99753086, 142.30143541],\n",
1078 | " [ 68.77037037, 70.66028708],\n",
1079 | " [ 19.51358025, 21.97129187],\n",
1080 | " [ 66.25679012, 100.55980861],\n",
1081 | " [ 30.31703704, 35.1492823 ],\n",
1082 | " [ 0.42825926, 0.55279904],\n",
1083 | " [ 31.57283951, 37.39712919]])"
1084 | ]
1085 | },
1086 | "execution_count": 59,
1087 | "metadata": {},
1088 | "output_type": "execute_result"
1089 | }
1090 | ],
1091 | "source": [
1092 | "mu_y = cc_mean_ignore_missing(train_features, train_labels)\n",
1093 | "mu_y"
1094 | ]
1095 | },
1096 | {
1097 | "cell_type": "markdown",
1098 | "metadata": {},
1099 | "source": [
1100 | "# Task 3 "
1101 | ]
1102 | },
1103 | {
1104 | "cell_type": "markdown",
1105 | "metadata": {},
1106 | "source": [
1107 | "Write a function `cc_std_ignore_missing` that takes the numpy arrays `train_features` and `train_labels` as input, and outputs the following matrix with the shape $(8,2)$, where 8 is the number of features.\n",
1108 | "\n",
1109 | "$$\\sigma_y = \\begin{bmatrix} \\text{std}[x^{(0)}|y=0] & \\text{std}[x^{(0)}|y=1]\\\\\n",
1110 | "\\text{std}[x^{(1)}|y=0] & \\text{std}[x^{(1)}|y=1] \\\\\n",
1111 | "\\cdots & \\cdots\\\\\n",
1112 | "\\text{std}[x^{(7)}|y=0] & \\text{std}[x^{(7)}|y=1]\\end{bmatrix}$$\n",
1113 | "\n",
1114 | "Some points regarding this task:\n",
1115 | "\n",
1116 | "* The `train_features` numpy array has a shape of `(N,8)` where `N` is the number of training data points, and 8 is the number of the features. \n",
1117 | "\n",
1118 | "* The `train_labels` numpy array has a shape of `(N,)`. \n",
1119 | "\n",
1120 | "* **You can assume that `train_features` has no missing elements in this task**.\n",
1121 | "\n",
1122 | "* Try and avoid the utilization of loops as much as possible. No loops are necessary."
1123 | ]
1124 | },
1125 | {
1126 | "cell_type": "code",
1127 | "execution_count": 60,
1128 | "metadata": {
1129 | "deletable": false,
1130 | "nbgrader": {
1131 | "cell_type": "code",
1132 | "checksum": "865730f8532366280665d029bc3b2ce5",
1133 | "grade": false,
1134 | "grade_id": "cell-410ce572204e37df",
1135 | "locked": false,
1136 | "schema_version": 3,
1137 | "solution": true,
1138 | "task": false
1139 | }
1140 | },
1141 | "outputs": [],
1142 | "source": [
1143 | "def cc_std_ignore_missing(train_features, train_labels):\n",
1144 | " N, d = train_features.shape\n",
1145 | " \n",
1146 | " # your code here\n",
1147 | " positive_rows = train_labels == 1\n",
1148 | " negative_rows = train_labels == 0\n",
1149 | " positives = train_features[positive_rows,:]\n",
1150 | " negatives = train_features[negative_rows,:]\n",
1151 | "\n",
1152 | " pos = np.std(positives, axis = 0)\n",
1153 | " neg = np.std(negatives, axis = 0)\n",
1154 | " sigma_y = np.column_stack((neg, pos))\n",
1155 | " assert sigma_y.shape == (d, 2)\n",
1156 | " return sigma_y"
1157 | ]
1158 | },
1159 | {
1160 | "cell_type": "code",
1161 | "execution_count": 61,
1162 | "metadata": {
1163 | "deletable": false,
1164 | "editable": false,
1165 | "nbgrader": {
1166 | "cell_type": "code",
1167 | "checksum": "1d389f545cb36b2404ec6069b0d5801a",
1168 | "grade": true,
1169 | "grade_id": "cell-d91cfef7c658f962",
1170 | "locked": true,
1171 | "points": 1,
1172 | "schema_version": 3,
1173 | "solution": false,
1174 | "task": false
1175 | }
1176 | },
1177 | "outputs": [],
1178 | "source": [
1179 | "some_feats = np.array([[ 1. , 85. , 66. , 29. , 0. , 26.6, 0.4, 31. ],\n",
1180 | " [ 8. , 183. , 64. , 0. , 0. , 23.3, 0.7, 32. ],\n",
1181 | " [ 1. , 89. , 66. , 23. , 94. , 28.1, 0.2, 21. ],\n",
1182 | " [ 0. , 137. , 40. , 35. , 168. , 43.1, 2.3, 33. ],\n",
1183 | " [ 5. , 116. , 74. , 0. , 0. , 25.6, 0.2, 30. ]])\n",
1184 | "some_labels = np.array([0, 1, 0, 1, 0])\n",
1185 | "\n",
1186 | "some_std_y = cc_std_ignore_missing(some_feats, some_labels)\n",
1187 | "\n",
1188 | "assert np.array_equal(some_std_y.round(3), np.array([[ 1.886, 4. ],\n",
1189 | " [13.768, 23. ],\n",
1190 | " [ 3.771, 12. ],\n",
1191 | " [12.499, 17.5 ],\n",
1192 | " [44.312, 84. ],\n",
1193 | " [ 1.027, 9.9 ],\n",
1194 | " [ 0.094, 0.8 ],\n",
1195 | " [ 4.497, 0.5 ]]))\n",
1196 | "\n",
1197 | "# Checking against the pre-computed test database\n",
1198 | "test_results = test_case_checker(cc_std_ignore_missing, task_id=3)\n",
1199 | "assert test_results['passed'], test_results['message']"
1200 | ]
1201 | },
1202 | {
1203 | "cell_type": "code",
1204 | "execution_count": 62,
1205 | "metadata": {},
1206 | "outputs": [
1207 | {
1208 | "data": {
1209 | "text/plain": [
1210 | "array([[ 3.1155426 , 3.75417931],\n",
1211 | " [ 25.96811899, 32.50910874],\n",
1212 | " [ 18.07540068, 21.69568568],\n",
1213 | " [ 15.02320635, 17.21685884],\n",
1214 | " [ 95.63339586, 139.24364214],\n",
1215 | " [ 7.50030986, 6.6625219 ],\n",
1216 | " [ 0.29438217, 0.37201494],\n",
1217 | " [ 11.67577435, 11.01543899]])"
1218 | ]
1219 | },
1220 | "execution_count": 62,
1221 | "metadata": {},
1222 | "output_type": "execute_result"
1223 | }
1224 | ],
1225 | "source": [
1226 | "sigma_y = cc_std_ignore_missing(train_features, train_labels)\n",
1227 | "sigma_y"
1228 | ]
1229 | },
1230 | {
1231 | "cell_type": "markdown",
1232 | "metadata": {},
1233 | "source": [
1234 | "# Task 4 "
1235 | ]
1236 | },
1237 | {
1238 | "cell_type": "markdown",
1239 | "metadata": {},
1240 | "source": [
1241 | "Write a function `log_prob` that takes the numpy arrays `train_features`, $\\mu_y$, $\\sigma_y$, and $\\log p_y$ as input, and outputs the following matrix with the shape $(N, 2)$\n",
1242 | "\n",
1243 | "$$\\log p_{x,y} = \\begin{bmatrix} \\bigg[\\log p(y=0) + \\sum_{j=0}^{7} \\log p(x_1^{(j)}|y=0) \\bigg] & \\bigg[\\log p(y=1) + \\sum_{j=0}^{7} \\log p(x_1^{(j)}|y=1) \\bigg] \\\\\n",
1244 | "\\bigg[\\log p(y=0) + \\sum_{j=0}^{7} \\log p(x_2^{(j)}|y=0) \\bigg] & \\bigg[\\log p(y=1) + \\sum_{j=0}^{7} \\log p(x_2^{(j)}|y=1) \\bigg] \\\\\n",
1245 | "\\cdots & \\cdots \\\\\n",
1246 | "\\bigg[\\log p(y=0) + \\sum_{j=0}^{7} \\log p(x_N^{(j)}|y=0) \\bigg] & \\bigg[\\log p(y=1) + \\sum_{j=0}^{7} \\log p(x_N^{(j)}|y=1) \\bigg] \\\\\n",
1247 | "\\end{bmatrix}$$\n",
1248 | "\n",
1249 | "where\n",
1250 | "* $N$ is the number of training data points.\n",
1251 | "* $x_i$ is the $i^{th}$ training data point.\n",
1252 | "\n",
1253 | "Try and avoid the utilization of loops as much as possible. No loops are necessary."
1254 | ]
1255 | },
1256 | {
1257 | "cell_type": "markdown",
1258 | "metadata": {},
1259 | "source": [
1260 | "**Hint**: Remember that we are modelling $p(x_i^{(j)}|y)$ with a Gaussian whose parameters are defined inside $\\mu_y$ and $\\sigma_y$. Write the Gaussian PDF expression and take its natural log **on paper**, then implement it.\n",
1261 | "\n",
1262 | "**Important Note**: Do not use third-party and non-standard implementations for computing $\\log p(x_i^{(j)}|y)$. Using functions that find the Gaussian PDF, and then taking their log is **numerically unstable**; the Gaussian PDF values can easily become extremely small numbers that cannot be represented using floating point standards and thus would be stored as zero. Taking the log of a zero value will throw an error. On the other hand, it is unnecessary to compute and store $p(x_i^{(j)}|y)$ in order to find $\\log p(x_i^{(j)}|y)$; you can write $\\log p(x_i^{(j)}|y)$ as a direct function of $\\mu_y$, $\\sigma_y$ and the features. This latter approach is numerically stable, and can be applied when the PDF values are much smaller than could be stored using the common standards."
1263 | ]
1264 | },
1265 | {
1266 | "cell_type": "code",
1267 | "execution_count": 63,
1268 | "metadata": {
1269 | "deletable": false,
1270 | "nbgrader": {
1271 | "cell_type": "code",
1272 | "checksum": "335f5b8746a99280ca50e0d2dbca5375",
1273 | "grade": false,
1274 | "grade_id": "cell-773a3cddb6c45cf8",
1275 | "locked": false,
1276 | "schema_version": 3,
1277 | "solution": true,
1278 | "task": false
1279 | }
1280 | },
1281 | "outputs": [],
1282 | "source": [
1283 | "def log_prob(features, mu_y, sigma_y, log_py):\n",
1284 | " N, d = features.shape\n",
1285 | " \n",
1286 | " # your code here\n",
1287 | " part1 = np.sum(np.log(1 / (sigma_y.T * (np.sqrt(2 * np.pi)))), axis = 1)\n",
1288 | " part2_neg = np.power(features - mu_y.T[0,:],2) / (2* np.power( sigma_y.T[0,:],2)) \n",
1289 | " log_neg = np.sum(part2_neg, axis=1)\n",
1290 | " log_neg -= part1[0] + log_py[0]\n",
1291 | " part2_pos = np.power(features - mu_y.T[1,:],2) / (2* np.power( sigma_y.T[1,:],2)) \n",
1292 | " log_pos = np.sum(part2_pos, axis=1)\n",
1293 | " log_pos -= part1[1] + log_py[1]\n",
1294 | " log_p_x_y = -np.column_stack((log_neg, log_pos))\n",
1295 | " \n",
1296 | " assert log_p_x_y.shape == (N,2)\n",
1297 | " return log_p_x_y"
1298 | ]
1299 | },
1300 | {
1301 | "cell_type": "code",
1302 | "execution_count": 64,
1303 | "metadata": {
1304 | "deletable": false,
1305 | "editable": false,
1306 | "nbgrader": {
1307 | "cell_type": "code",
1308 | "checksum": "372496f883a755a2b19cd88b06cab33a",
1309 | "grade": true,
1310 | "grade_id": "cell-a8c2e2a5d88902b0",
1311 | "locked": true,
1312 | "points": 1,
1313 | "schema_version": 3,
1314 | "solution": false,
1315 | "task": false
1316 | }
1317 | },
1318 | "outputs": [],
1319 | "source": [
1320 | "some_feats = np.array([[ 1. , 85. , 66. , 29. , 0. , 26.6, 0.4, 31. ],\n",
1321 | " [ 8. , 183. , 64. , 0. , 0. , 23.3, 0.7, 32. ],\n",
1322 | " [ 1. , 89. , 66. , 23. , 94. , 28.1, 0.2, 21. ],\n",
1323 | " [ 0. , 137. , 40. , 35. , 168. , 43.1, 2.3, 33. ],\n",
1324 | " [ 5. , 116. , 74. , 0. , 0. , 25.6, 0.2, 30. ]])\n",
1325 | "some_labels = np.array([0, 1, 0, 1, 0])\n",
1326 | "\n",
1327 | "some_mu_y = cc_mean_ignore_missing(some_feats, some_labels)\n",
1328 | "some_std_y = cc_std_ignore_missing(some_feats, some_labels)\n",
1329 | "some_log_py = log_prior(some_labels)\n",
1330 | "\n",
1331 | "some_log_p_x_y = log_prob(some_feats, some_mu_y, some_std_y, some_log_py)\n",
1332 | "\n",
1333 | "assert np.array_equal(some_log_p_x_y.round(3), np.array([[ -20.822, -36.606],\n",
1334 | " [ -60.879, -27.944],\n",
1335 | " [ -21.774, -295.68 ],\n",
1336 | " [-417.359, -27.944],\n",
1337 | " [ -23.2 , -42.6 ]]))\n",
1338 | "\n",
1339 | "# Checking against the pre-computed test database\n",
1340 | "test_results = test_case_checker(log_prob, task_id=4)\n",
1341 | "assert test_results['passed'], test_results['message']"
1342 | ]
1343 | },
1344 | {
1345 | "cell_type": "code",
1346 | "execution_count": 65,
1347 | "metadata": {},
1348 | "outputs": [
1349 | {
1350 | "data": {
1351 | "text/plain": [
1352 | "array([[-26.96647828, -31.00418408],\n",
1353 | " [-32.4755447 , -31.39530914],\n",
1354 | " [-27.14875996, -31.51999532],\n",
1355 | " ...,\n",
1356 | " [-26.29368771, -29.09161966],\n",
1357 | " [-28.19432943, -30.08324788],\n",
1358 | " [-26.98605248, -30.80571318]])"
1359 | ]
1360 | },
1361 | "execution_count": 65,
1362 | "metadata": {},
1363 | "output_type": "execute_result"
1364 | }
1365 | ],
1366 | "source": [
1367 | "log_p_x_y = log_prob(train_features, mu_y, sigma_y, log_py)\n",
1368 | "log_p_x_y"
1369 | ]
1370 | },
1371 | {
1372 | "cell_type": "markdown",
1373 | "metadata": {},
1374 | "source": [
1375 | "## 1.1. Writing the Simple Naive Bayes Classifier"
1376 | ]
1377 | },
1378 | {
1379 | "cell_type": "code",
1380 | "execution_count": 66,
1381 | "metadata": {},
1382 | "outputs": [],
1383 | "source": [
1384 | "class NBClassifier():\n",
1385 | " def __init__(self, train_features, train_labels):\n",
1386 | " self.train_features = train_features\n",
1387 | " self.train_labels = train_labels\n",
1388 | " self.log_py = log_prior(train_labels)\n",
1389 | " self.mu_y = self.get_cc_means()\n",
1390 | " self.sigma_y = self.get_cc_std()\n",
1391 | " \n",
1392 | " def get_cc_means(self):\n",
1393 | " mu_y = cc_mean_ignore_missing(self.train_features, self.train_labels)\n",
1394 | " return mu_y\n",
1395 | " \n",
1396 | " def get_cc_std(self):\n",
1397 | " sigma_y = cc_std_ignore_missing(self.train_features, self.train_labels)\n",
1398 | " return sigma_y\n",
1399 | " \n",
1400 | " def predict(self, features):\n",
1401 | " log_p_x_y = log_prob(features, mu_y, sigma_y, log_py)\n",
1402 | " return log_p_x_y.argmax(axis=1)"
1403 | ]
1404 | },
1405 | {
1406 | "cell_type": "code",
1407 | "execution_count": 67,
1408 | "metadata": {},
1409 | "outputs": [],
1410 | "source": [
1411 | "diabetes_classifier = NBClassifier(train_features, train_labels)\n",
1412 | "train_pred = diabetes_classifier.predict(train_features)\n",
1413 | "eval_pred = diabetes_classifier.predict(eval_features)"
1414 | ]
1415 | },
1416 | {
1417 | "cell_type": "code",
1418 | "execution_count": 68,
1419 | "metadata": {},
1420 | "outputs": [
1421 | {
1422 | "name": "stdout",
1423 | "output_type": "stream",
1424 | "text": [
1425 | "The training data accuracy of your trained model is 0.7671009771986971\n",
1426 | "The evaluation data accuracy of your trained model is 0.7532467532467533\n"
1427 | ]
1428 | }
1429 | ],
1430 | "source": [
1431 | "train_acc = (train_pred==train_labels).mean()\n",
1432 | "eval_acc = (eval_pred==eval_labels).mean()\n",
1433 | "print(f'The training data accuracy of your trained model is {train_acc}')\n",
1434 | "print(f'The evaluation data accuracy of your trained model is {eval_acc}')"
1435 | ]
1436 | },
1437 | {
1438 | "cell_type": "markdown",
1439 | "metadata": {},
1440 | "source": [
1441 | "## 1.2 Running an off-the-shelf implementation of Naive-Bayes For Comparison"
1442 | ]
1443 | },
1444 | {
1445 | "cell_type": "code",
1446 | "execution_count": 69,
1447 | "metadata": {},
1448 | "outputs": [
1449 | {
1450 | "name": "stdout",
1451 | "output_type": "stream",
1452 | "text": [
1453 | "The training data accuracy of your trained model is 0.7671009771986971\n",
1454 | "The evaluation data accuracy of your trained model is 0.7532467532467533\n"
1455 | ]
1456 | }
1457 | ],
1458 | "source": [
1459 | "from sklearn.naive_bayes import GaussianNB\n",
1460 | "gnb = GaussianNB().fit(train_features, train_labels)\n",
1461 | "train_pred_sk = gnb.predict(train_features)\n",
1462 | "eval_pred_sk = gnb.predict(eval_features)\n",
1463 | "print(f'The training data accuracy of your trained model is {(train_pred_sk == train_labels).mean()}')\n",
1464 | "print(f'The evaluation data accuracy of your trained model is {(eval_pred_sk == eval_labels).mean()}')"
1465 | ]
1466 | },
1467 | {
1468 | "cell_type": "markdown",
1469 | "metadata": {},
1470 | "source": [
1471 | "# Part 2 (Building a Naive Bayes Classifier Considering Missing Entries)"
1472 | ]
1473 | },
1474 | {
1475 | "cell_type": "markdown",
1476 | "metadata": {},
1477 | "source": [
1478 | "In this part, we will modify some of the parameter inference functions of the Naive Bayes classifier to make it able to ignore the NaN entries when inferring the Gaussian mean and stds."
1479 | ]
1480 | },
1481 | {
1482 | "cell_type": "markdown",
1483 | "metadata": {},
1484 | "source": [
1485 | "# Task 5 "
1486 | ]
1487 | },
1488 | {
1489 | "cell_type": "markdown",
1490 | "metadata": {},
1491 | "source": [
1492 | "Write a function `cc_mean_consider_missing` that\n",
1493 | "* has exactly the same input and output types as the `cc_mean_ignore_missing` function,\n",
1494 | "* and has similar functionality to `cc_mean_ignore_missing` except that it can handle and ignore the NaN entries when computing the class conditional means.\n",
1495 | "\n",
1496 | "You can borrow most of the code from your `cc_mean_ignore_missing` implementation, but you should make it compatible with the existence of NaN values in the features.\n",
1497 | "\n",
1498 | "Try and avoid the utilization of loops as much as possible. No loops are necessary."
1499 | ]
1500 | },
1501 | {
1502 | "cell_type": "markdown",
1503 | "metadata": {},
1504 | "source": [
1505 | "* **Hint**: You may find the `np.nanmean` function useful."
1506 | ]
1507 | },
1508 | {
1509 | "cell_type": "code",
1510 | "execution_count": 70,
1511 | "metadata": {
1512 | "deletable": false,
1513 | "nbgrader": {
1514 | "cell_type": "code",
1515 | "checksum": "ed57e96c9d1d8044ce805a98adaacbbb",
1516 | "grade": false,
1517 | "grade_id": "cell-6ab8c367d427a588",
1518 | "locked": false,
1519 | "schema_version": 3,
1520 | "solution": true,
1521 | "task": false
1522 | }
1523 | },
1524 | "outputs": [],
1525 | "source": [
1526 | "def cc_mean_consider_missing(train_features_with_nans, train_labels):\n",
1527 | " N, d = train_features_with_nans.shape\n",
1528 | " \n",
1529 | " # your code here\n",
1530 | " train_labels = train_labels.T\n",
1531 | " pos_train_labels = train_labels == 1 \n",
1532 | " neg_train_labels = train_labels == 0\n",
1533 | " positives = train_features_with_nans[pos_train_labels, :]\n",
1534 | " pos_mean = np.nanmean(positives, axis=0) \n",
1535 | " pos_mean = pos_mean.reshape(-1,1)\n",
1536 | " negatives = train_features_with_nans[neg_train_labels, :]\n",
1537 | " neg_mean = np.nanmean(negatives, axis=0) \n",
1538 | " neg_mean = neg_mean.reshape(-1,1)\n",
1539 | " mu_y = np.hstack((neg_mean, pos_mean))\n",
1540 | " \n",
1541 | " assert not np.isnan(mu_y).any()\n",
1542 | " assert mu_y.shape == (d, 2)\n",
1543 | " return mu_y"
1544 | ]
1545 | },
1546 | {
1547 | "cell_type": "code",
1548 | "execution_count": 71,
1549 | "metadata": {
1550 | "deletable": false,
1551 | "editable": false,
1552 | "nbgrader": {
1553 | "cell_type": "code",
1554 | "checksum": "ec2c7c4cbb59a66bc04e3afcdd1d7701",
1555 | "grade": true,
1556 | "grade_id": "cell-b340557154da9804",
1557 | "locked": true,
1558 | "points": 1,
1559 | "schema_version": 3,
1560 | "solution": false,
1561 | "task": false
1562 | }
1563 | },
1564 | "outputs": [],
1565 | "source": [
1566 | "some_feats = np.array([[ 1. , 85. , 66. , 29. , 0. , 26.6, 0.4, 31. ],\n",
1567 | " [ 8. , 183. , 64. , 0. , 0. , 23.3, 0.7, 32. ],\n",
1568 | " [ 1. , 89. , 66. , 23. , 94. , 28.1, 0.2, 21. ],\n",
1569 | " [ 0. , 137. , 40. , 35. , 168. , 43.1, 2.3, 33. ],\n",
1570 | " [ 5. , 116. , 74. , 0. , 0. , 25.6, 0.2, 30. ]])\n",
1571 | "some_labels = np.array([0, 1, 0, 1, 0])\n",
1572 | "\n",
1573 | "for i,j in [(0,0), (1,1), (2,3), (3,4), (4, 2)]:\n",
1574 | " some_feats[i,j] = np.nan\n",
1575 | "\n",
1576 | "some_mu_y = cc_mean_consider_missing(some_feats, some_labels)\n",
1577 | "\n",
1578 | "assert np.array_equal(some_mu_y.round(2), np.array([[ 3. , 4. ],\n",
1579 | " [ 96.67, 137. ],\n",
1580 | " [ 66. , 52. ],\n",
1581 | " [ 14.5 , 17.5 ],\n",
1582 | " [ 31.33, 0. ],\n",
1583 | " [ 26.77, 33.2 ],\n",
1584 | " [ 0.27, 1.5 ],\n",
1585 | " [ 27.33, 32.5 ]]))\n",
1586 | "\n",
1587 | "# Checking against the pre-computed test database\n",
1588 | "test_results = test_case_checker(cc_mean_consider_missing, task_id=5)\n",
1589 | "assert test_results['passed'], test_results['message']"
1590 | ]
1591 | },
1592 | {
1593 | "cell_type": "code",
1594 | "execution_count": 72,
1595 | "metadata": {},
1596 | "outputs": [
1597 | {
1598 | "data": {
1599 | "text/plain": [
1600 | "array([[ 3.48641975, 4.91866029],\n",
1601 | " [109.99753086, 142.30143541],\n",
1602 | " [ 71.41538462, 75.34693878],\n",
1603 | " [ 27.53658537, 32.11188811],\n",
1604 | " [ 66.25679012, 100.55980861],\n",
1605 | " [ 30.85025126, 35.31826923],\n",
1606 | " [ 0.42825926, 0.55279904],\n",
1607 | " [ 31.57283951, 37.39712919]])"
1608 | ]
1609 | },
1610 | "execution_count": 72,
1611 | "metadata": {},
1612 | "output_type": "execute_result"
1613 | }
1614 | ],
1615 | "source": [
1616 | "mu_y = cc_mean_consider_missing(train_features_with_nans, train_labels)\n",
1617 | "mu_y"
1618 | ]
1619 | },
1620 | {
1621 | "cell_type": "markdown",
1622 | "metadata": {},
1623 | "source": [
1624 | "# Task 6 "
1625 | ]
1626 | },
1627 | {
1628 | "cell_type": "markdown",
1629 | "metadata": {},
1630 | "source": [
1631 | "Write a function `cc_std_consider_missing` that\n",
1632 | "* has exactly the same input and output types as the `cc_std_ignore_missing` function,\n",
1633 | "* and has similar functionality to `cc_std_ignore_missing` except that it can handle and ignore the NaN entries when computing the class conditional means.\n",
1634 | "\n",
1635 | "You can borrow most of the code from your `cc_std_ignore_missing` implementation, but you should make it compatible with the existence of NaN values in the features.\n",
1636 | "\n",
1637 | "Try and avoid the utilization of loops as much as possible. No loops are necessary."
1638 | ]
1639 | },
1640 | {
1641 | "cell_type": "markdown",
1642 | "metadata": {},
1643 | "source": [
1644 | "* **Hint**: You may find the `np.nanstd` function useful."
1645 | ]
1646 | },
1647 | {
1648 | "cell_type": "code",
1649 | "execution_count": 73,
1650 | "metadata": {
1651 | "deletable": false,
1652 | "nbgrader": {
1653 | "cell_type": "code",
1654 | "checksum": "3001dfb41f62e3925b7edde741bb1776",
1655 | "grade": false,
1656 | "grade_id": "cell-927753c6215c5646",
1657 | "locked": false,
1658 | "schema_version": 3,
1659 | "solution": true,
1660 | "task": false
1661 | }
1662 | },
1663 | "outputs": [],
1664 | "source": [
1665 | "def cc_std_consider_missing(train_features_with_nans, train_labels):\n",
1666 | " N, d = train_features_with_nans.shape\n",
1667 | " \n",
1668 | " # your code here\n",
1669 | " positive_rows = train_labels == 1\n",
1670 | " negative_rows = train_labels == 0\n",
1671 | " positives = train_features_with_nans[positive_rows,:]\n",
1672 | " negatives = train_features_with_nans[negative_rows,:]\n",
1673 | "\n",
1674 | " pos = np.nanstd(positives, axis = 0)\n",
1675 | " neg = np.nanstd(negatives, axis = 0)\n",
1676 | " sigma_y = np.column_stack((neg, pos))\n",
1677 | " \n",
1678 | " assert not np.isnan(sigma_y).any()\n",
1679 | " assert sigma_y.shape == (d, 2)\n",
1680 | " return sigma_y"
1681 | ]
1682 | },
1683 | {
1684 | "cell_type": "code",
1685 | "execution_count": 74,
1686 | "metadata": {
1687 | "deletable": false,
1688 | "editable": false,
1689 | "nbgrader": {
1690 | "cell_type": "code",
1691 | "checksum": "2c8b25848847a54241cfbaba98b8d83d",
1692 | "grade": true,
1693 | "grade_id": "cell-d67179c6dea81502",
1694 | "locked": true,
1695 | "points": 1,
1696 | "schema_version": 3,
1697 | "solution": false,
1698 | "task": false
1699 | }
1700 | },
1701 | "outputs": [],
1702 | "source": [
1703 | "some_feats = np.array([[ 1. , 85. , 66. , 29. , 0. , 26.6, 0.4, 31. ],\n",
1704 | " [ 8. , 183. , 64. , 0. , 0. , 23.3, 0.7, 32. ],\n",
1705 | " [ 1. , 89. , 66. , 23. , 94. , 28.1, 0.2, 21. ],\n",
1706 | " [ 0. , 137. , 40. , 35. , 168. , 43.1, 2.3, 33. ],\n",
1707 | " [ 5. , 116. , 74. , 0. , 0. , 25.6, 0.2, 30. ]])\n",
1708 | "some_labels = np.array([0, 1, 0, 1, 0])\n",
1709 | "\n",
1710 | "for i,j in [(0,0), (1,1), (2,3), (3,4), (4, 2)]:\n",
1711 | " some_feats[i,j] = np.nan\n",
1712 | "\n",
1713 | "some_std_y = cc_std_consider_missing(some_feats, some_labels)\n",
1714 | "\n",
1715 | "assert np.array_equal(some_std_y.round(2), np.array([[ 2. , 4. ],\n",
1716 | " [13.77, 0. ],\n",
1717 | " [ 0. , 12. ],\n",
1718 | " [14.5 , 17.5 ],\n",
1719 | " [44.31, 0. ],\n",
1720 | " [ 1.03, 9.9 ],\n",
1721 | " [ 0.09, 0.8 ],\n",
1722 | " [ 4.5 , 0.5 ]]))\n",
1723 | "\n",
1724 | "# Checking against the pre-computed test database\n",
1725 | "test_results = test_case_checker(cc_std_consider_missing, task_id=6)\n",
1726 | "assert test_results['passed'], test_results['message']"
1727 | ]
1728 | },
1729 | {
1730 | "cell_type": "code",
1731 | "execution_count": 75,
1732 | "metadata": {},
1733 | "outputs": [
1734 | {
1735 | "data": {
1736 | "text/plain": [
1737 | "array([[ 3.1155426 , 3.75417931],\n",
1738 | " [ 25.96811899, 32.50910874],\n",
1739 | " [ 12.26342359, 12.1982786 ],\n",
1740 | " [ 9.87753687, 10.37284304],\n",
1741 | " [ 95.63339586, 139.24364214],\n",
1742 | " [ 6.38703834, 6.21564813],\n",
1743 | " [ 0.29438217, 0.37201494],\n",
1744 | " [ 11.67577435, 11.01543899]])"
1745 | ]
1746 | },
1747 | "execution_count": 75,
1748 | "metadata": {},
1749 | "output_type": "execute_result"
1750 | }
1751 | ],
1752 | "source": [
1753 | "sigma_y = cc_std_consider_missing(train_features_with_nans, train_labels)\n",
1754 | "sigma_y"
1755 | ]
1756 | },
1757 | {
1758 | "cell_type": "markdown",
1759 | "metadata": {},
1760 | "source": [
1761 | "## 2.1. Writing the Naive Bayes Classifier With Missing Data Handling"
1762 | ]
1763 | },
1764 | {
1765 | "cell_type": "code",
1766 | "execution_count": 76,
1767 | "metadata": {},
1768 | "outputs": [],
1769 | "source": [
1770 | "class NBClassifierWithMissing(NBClassifier):\n",
1771 | " def get_cc_means(self):\n",
1772 | " mu_y = cc_mean_consider_missing(self.train_features, self.train_labels)\n",
1773 | " return mu_y\n",
1774 | " \n",
1775 | " def get_cc_std(self):\n",
1776 | " sigma_y = cc_std_consider_missing(self.train_features, self.train_labels)\n",
1777 | " return sigma_y"
1778 | ]
1779 | },
1780 | {
1781 | "cell_type": "code",
1782 | "execution_count": 77,
1783 | "metadata": {},
1784 | "outputs": [],
1785 | "source": [
1786 | "diabetes_classifier_nans = NBClassifierWithMissing(train_features_with_nans, train_labels)\n",
1787 | "train_pred = diabetes_classifier_nans.predict(train_features_with_nans)\n",
1788 | "eval_pred = diabetes_classifier_nans.predict(eval_features_with_nans)"
1789 | ]
1790 | },
1791 | {
1792 | "cell_type": "code",
1793 | "execution_count": 78,
1794 | "metadata": {},
1795 | "outputs": [
1796 | {
1797 | "name": "stdout",
1798 | "output_type": "stream",
1799 | "text": [
1800 | "The training data accuracy of your trained model is 0.7182410423452769\n",
1801 | "The evaluation data accuracy of your trained model is 0.7142857142857143\n"
1802 | ]
1803 | }
1804 | ],
1805 | "source": [
1806 | "train_acc = (train_pred==train_labels).mean()\n",
1807 | "eval_acc = (eval_pred==eval_labels).mean()\n",
1808 | "print(f'The training data accuracy of your trained model is {train_acc}')\n",
1809 | "print(f'The evaluation data accuracy of your trained model is {eval_acc}')"
1810 | ]
1811 | },
1812 | {
1813 | "cell_type": "markdown",
1814 | "metadata": {},
1815 | "source": [
1816 | "# 3. Running SVMlight"
1817 | ]
1818 | },
1819 | {
1820 | "cell_type": "markdown",
1821 | "metadata": {},
1822 | "source": [
1823 | "In this section, we are going to investigate the support vector machine classification method. We will become familiar with this classification method in week 3. However, in this section, we are just going to observe how this method performs to set the stage for the third week.\n",
1824 | "\n",
1825 | "`SVMlight` (http://svmlight.joachims.org/) is a famous implementation of the SVM classifier. \n",
1826 | "\n",
1827 | "`SVMLight` can be called from a shell terminal, and there is no nice wrapper for it in python3. Therefore:\n",
1828 | "1. We have to export the training data to a special format called `svmlight/libsvm`. This can be done using scikit-learn.\n",
1829 | "2. We have to run the `svm_learn` program to learn the model and then store it.\n",
1830 | "3. We have to import the model back to python."
1831 | ]
1832 | },
1833 | {
1834 | "cell_type": "markdown",
1835 | "metadata": {},
1836 | "source": [
1837 | "## 3.1 Exporting the training data to libsvm format"
1838 | ]
1839 | },
1840 | {
1841 | "cell_type": "code",
1842 | "execution_count": 79,
1843 | "metadata": {},
1844 | "outputs": [],
1845 | "source": [
1846 | "from sklearn.datasets import dump_svmlight_file\n",
1847 | "dump_svmlight_file(train_features, 2*train_labels-1, 'training_feats.data', \n",
1848 | " zero_based=False, comment=None, query_id=None, multilabel=False)"
1849 | ]
1850 | },
1851 | {
1852 | "cell_type": "markdown",
1853 | "metadata": {},
1854 | "source": [
1855 | "## 3.2 Training `SVMlight`"
1856 | ]
1857 | },
1858 | {
1859 | "cell_type": "code",
1860 | "execution_count": 80,
1861 | "metadata": {},
1862 | "outputs": [
1863 | {
1864 | "name": "stdout",
1865 | "output_type": "stream",
1866 | "text": [
1867 | "Scanning examples...done\n",
1868 | "Reading examples into memory...100..200..300..400..500..600..OK. (614 examples read)\n",
1869 | "Setting default regularization parameter C=0.0000\n",
1870 | "Optimizingdone. (1781 iterations)\n",
1871 | "Optimization finished (141 misclassified, maxdiff=0.00099).\n",
1872 | "Runtime in cpu-seconds: 0.19\n",
1873 | "Number of SV: 375 (including 369 at upper bound)\n",
1874 | "L1 loss: loss=335.23204\n",
1875 | "Norm of weight vector: |w|=0.03179\n",
1876 | "Norm of longest example vector: |x|=871.75350\n",
1877 | "Estimated VCdim of classifier: VCdim<=769.24695\n",
1878 | "Computing XiAlpha-estimates...done\n",
1879 | "Runtime for XiAlpha-estimates in cpu-seconds: 0.00\n",
1880 | "XiAlpha-estimate of the error: error<=60.75% (rho=1.00,depth=0)\n",
1881 | "XiAlpha-estimate of the recall: recall=>10.53% (rho=1.00,depth=0)\n",
1882 | "XiAlpha-estimate of the precision: precision=>10.58% (rho=1.00,depth=0)\n",
1883 | "Number of kernel evaluations: 71356\n",
1884 | "Writing model file...done\n",
1885 | "\n"
1886 | ]
1887 | }
1888 | ],
1889 | "source": [
1890 | "from subprocess import Popen, PIPE\n",
1891 | "process = Popen([\"./svmlight/svm_learn\", \"./training_feats.data\", \"svm_model.txt\"], stdout=PIPE, stderr=PIPE)\n",
1892 | "stdout, stderr = process.communicate()\n",
1893 | "print(stdout.decode(\"utf-8\"))"
1894 | ]
1895 | },
1896 | {
1897 | "cell_type": "markdown",
1898 | "metadata": {},
1899 | "source": [
1900 | "## 3.3 Importing the SVM Model"
1901 | ]
1902 | },
1903 | {
1904 | "cell_type": "code",
1905 | "execution_count": 81,
1906 | "metadata": {},
1907 | "outputs": [],
1908 | "source": [
1909 | "from svm2weight import get_svmlight_weights\n",
1910 | "svm_weights, thresh = get_svmlight_weights('svm_model.txt', printOutput=False)\n",
1911 | "\n",
1912 | "def svmlight_classifier(train_features):\n",
1913 | " return (train_features @ svm_weights - thresh).reshape(-1) >= 0."
1914 | ]
1915 | },
1916 | {
1917 | "cell_type": "code",
1918 | "execution_count": 82,
1919 | "metadata": {},
1920 | "outputs": [],
1921 | "source": [
1922 | "train_pred = svmlight_classifier(train_features)\n",
1923 | "eval_pred = svmlight_classifier(eval_features)"
1924 | ]
1925 | },
1926 | {
1927 | "cell_type": "code",
1928 | "execution_count": 83,
1929 | "metadata": {},
1930 | "outputs": [
1931 | {
1932 | "name": "stdout",
1933 | "output_type": "stream",
1934 | "text": [
1935 | "The training data accuracy of your trained model is 0.7703583061889251\n",
1936 | "The evaluation data accuracy of your trained model is 0.7402597402597403\n"
1937 | ]
1938 | }
1939 | ],
1940 | "source": [
1941 | "train_acc = (train_pred==train_labels).mean()\n",
1942 | "eval_acc = (eval_pred==eval_labels).mean()\n",
1943 | "print(f'The training data accuracy of your trained model is {train_acc}')\n",
1944 | "print(f'The evaluation data accuracy of your trained model is {eval_acc}')"
1945 | ]
1946 | },
1947 | {
1948 | "cell_type": "code",
1949 | "execution_count": null,
1950 | "metadata": {},
1951 | "outputs": [],
1952 | "source": []
1953 | },
1954 | {
1955 | "cell_type": "code",
1956 | "execution_count": null,
1957 | "metadata": {},
1958 | "outputs": [],
1959 | "source": []
1960 | }
1961 | ],
1962 | "metadata": {
1963 | "kernelspec": {
1964 | "display_name": "Python 3",
1965 | "language": "python",
1966 | "name": "python3"
1967 | },
1968 | "language_info": {
1969 | "codemirror_mode": {
1970 | "name": "ipython",
1971 | "version": 3
1972 | },
1973 | "file_extension": ".py",
1974 | "mimetype": "text/x-python",
1975 | "name": "python",
1976 | "nbconvert_exporter": "python",
1977 | "pygments_lexer": "ipython3",
1978 | "version": "3.7.6"
1979 | }
1980 | },
1981 | "nbformat": 4,
1982 | "nbformat_minor": 4
1983 | }
1984 |
--------------------------------------------------------------------------------
/MeanField.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 3,
6 | "metadata": {},
7 | "outputs": [],
8 | "source": [
9 | "%matplotlib inline\n",
10 | "%load_ext autoreload\n",
11 | "%autoreload 2\n",
12 | "\n",
13 | "import matplotlib.pyplot as plt\n",
14 | "import numpy as np\n",
15 | "import os\n",
16 | "import pandas as pd\n",
17 | "\n",
18 | "from scipy.special import expit\n",
19 | "\n",
20 | "from utils import test_case_checker, perform_computation, show_test_cases"
21 | ]
22 | },
23 | {
24 | "cell_type": "markdown",
25 | "metadata": {},
26 | "source": [
27 | "# 0. Data "
28 | ]
29 | },
30 | {
31 | "cell_type": "markdown",
32 | "metadata": {},
33 | "source": [
34 | "Since the MNIST data (http://yann.lecun.com/exdb/mnist/) is stored in a binary format, we would rather have an API handle the loading for us. \n",
35 | "\n",
36 | "Pytorch (https://pytorch.org/) is an Automatic Differentiation library that we may see and use later in the course. \n",
37 | "\n",
38 | "Torchvision (https://pytorch.org/docs/stable/torchvision/index.html?highlight=torchvision#module-torchvision) is an extension library for pytorch that can load many of the famous data sets painlessly. \n",
39 | "\n",
40 | "We already used Torchvision for downloading the MNIST data. It is stored in a numpy array file that we will load easily."
41 | ]
42 | },
43 | {
44 | "cell_type": "markdown",
45 | "metadata": {},
46 | "source": [
47 | "## 0.1 Loading the Data"
48 | ]
49 | },
50 | {
51 | "cell_type": "code",
52 | "execution_count": 4,
53 | "metadata": {},
54 | "outputs": [],
55 | "source": [
56 | "if os.path.exists('mnist.npz'):\n",
57 | " npzfile = np.load('mnist.npz')\n",
58 | " train_images_raw = npzfile['train_images_raw']\n",
59 | " train_labels = npzfile['train_labels']\n",
60 | " eval_images_raw = npzfile['eval_images_raw']\n",
61 | " eval_labels = npzfile['eval_labels']\n",
62 | "else:\n",
63 | " import torchvision\n",
64 | " download_ = not os.path.exists('./mnist')\n",
65 | " data_train = torchvision.datasets.MNIST('mnist', train=True, transform=None, target_transform=None, download=download_)\n",
66 | " data_eval = torchvision.datasets.MNIST('mnist', train=False, transform=None, target_transform=None, download=download_)\n",
67 | "\n",
68 | " train_images_raw = data_train.data.numpy()\n",
69 | " train_labels = data_train.targets.numpy()\n",
70 | " eval_images_raw = data_eval.data.numpy()\n",
71 | " eval_labels = data_eval.targets.numpy()\n",
72 | "\n",
73 | " np.savez('mnist.npz', train_images_raw=train_images_raw, train_labels=train_labels, \n",
74 | " eval_images_raw=eval_images_raw, eval_labels=eval_labels) "
75 | ]
76 | },
77 | {
78 | "cell_type": "code",
79 | "execution_count": 5,
80 | "metadata": {},
81 | "outputs": [],
82 | "source": [
83 | "noise_flip_prob = 0.04"
84 | ]
85 | },
86 | {
87 | "cell_type": "markdown",
88 | "metadata": {},
89 | "source": [
90 | "# Task 1 "
91 | ]
92 | },
93 | {
94 | "cell_type": "markdown",
95 | "metadata": {},
96 | "source": [
97 | "Write the function `get_thresholded_and_noised` that does image thresholding and flipping pixels. More specifically, this functions should exactly apply the following two steps in order:\n",
98 | "\n",
99 | "1. **Thresholding**: First, given the input threshold argument, you must compute a thresholded image array. This array should indicate whether each element of `images_raw` is **greater than or equal to** the `threshold` argument. We will call the result of this step the thresholded image.\n",
100 | "2. **Noise Application (i.e., Flipping Pixels)**: After the image was thresholded, you should use the `flip_flags` input argument and flip the pixels with a corresponding `True` entry in `flip_flags`. \n",
101 | "\n",
102 | " * `flip_flags` mostly consists of `False` entries, which means you should not change their corresponding pixels. Instead, whenever a pixel had a `True` entry in `flip_flags`, that pixel in the thresholded image must get flipped. This way you will obtain the noised image.\n",
103 | "3. **Mapping Pixels to -1/+1**: You need to make sure the output image pixels are mapped to -1 and 1 values (as opposed to 0/1 or True/False).\n",
104 | "\n",
105 | "`get_thresholded_and_noised` should take the following arguments:\n",
106 | "\n",
107 | "1. `images_raw`: A numpy array. Do not assume anything about its shape, dtype or range of values. Your function should be careless about these attributes.\n",
108 | "2. `threshold`: A scalar value.\n",
109 | "3. `flip_flags`: A numpy array with the same shape as `images_raw` and `np.bool` dtype. This array indicates whether each pixel should be flipped or not.\n",
110 | "\n",
111 | "and return the following:\n",
112 | "\n",
113 | "* `mapped_noised_image`: A numpy array with the same shape as `images_raw`. This array's entries should either be -1 or 1."
114 | ]
115 | },
116 | {
117 | "cell_type": "code",
118 | "execution_count": 21,
119 | "metadata": {
120 | "deletable": false,
121 | "nbgrader": {
122 | "cell_type": "code",
123 | "checksum": "d7a43224809fd8d0963612527c0b97c7",
124 | "grade": false,
125 | "grade_id": "cell-8537fe703ac9bd5d",
126 | "locked": false,
127 | "schema_version": 3,
128 | "solution": true,
129 | "task": false
130 | }
131 | },
132 | "outputs": [],
133 | "source": [
134 | "def get_thresholded_and_noised(images_raw, threshold, flip_flags):\n",
135 | " \n",
136 | " # your code here\n",
137 | " mapped_noised_image = np.where(np.logical_xor(images_raw >= threshold, flip_flags), 1, -1)\n",
138 | " \n",
139 | " assert (np.abs(mapped_noised_image)==1).all()\n",
140 | " return mapped_noised_image.astype(np.int32)"
141 | ]
142 | },
143 | {
144 | "cell_type": "code",
145 | "execution_count": 22,
146 | "metadata": {
147 | "deletable": false,
148 | "editable": false,
149 | "nbgrader": {
150 | "cell_type": "code",
151 | "checksum": "9d437b2f579e514e2afb64f057b9b7cc",
152 | "grade": true,
153 | "grade_id": "cell-a93db968174effe4",
154 | "locked": true,
155 | "points": 0.2,
156 | "schema_version": 3,
157 | "solution": false,
158 | "task": false
159 | }
160 | },
161 | "outputs": [
162 | {
163 | "name": "stdout",
164 | "output_type": "stream",
165 | "text": [
166 | "The reference and solution images are the same to a T! Well done on this test case.\n"
167 | ]
168 | },
169 | {
170 | "data": {
171 | "image/png": "\n",
172 | "text/plain": [
173 | ""
174 | ]
175 | },
176 | "metadata": {
177 | "needs_background": "light"
178 | },
179 | "output_type": "display_data"
180 | },
181 | {
182 | "name": "stdout",
183 | "output_type": "stream",
184 | "text": [
185 | " Enter nothing to go to the next image\n",
186 | "or\n",
187 | " Enter \"s\" when you are done to recieve the three images. \n",
188 | " **Don't forget to do this before continuing to the next step.**\n",
189 | "s\n"
190 | ]
191 | }
192 | ],
193 | "source": [
194 | "\n",
195 | "def test_thresh_noise(x, seed = 12345, p = noise_flip_prob, threshold = 128): \n",
196 | " np_random = np.random.RandomState(seed=seed)\n",
197 | " flip_flags = (np_random.uniform(0., 1., size=x.shape) < p)\n",
198 | " return get_thresholded_and_noised(x, threshold, flip_flags)\n",
199 | "\n",
200 | "(orig_image, ref_image, test_im, success_thr) = show_test_cases(test_thresh_noise, task_id='1_V')\n",
201 | "\n",
202 | "assert success_thr"
203 | ]
204 | },
205 | {
206 | "cell_type": "code",
207 | "execution_count": 23,
208 | "metadata": {
209 | "deletable": false,
210 | "editable": false,
211 | "nbgrader": {
212 | "cell_type": "code",
213 | "checksum": "9ddfa2bdbc3cdfaccac1c378121cc61d",
214 | "grade": true,
215 | "grade_id": "cell-cad4a05d0f97d19d",
216 | "locked": true,
217 | "points": 0.8,
218 | "schema_version": 3,
219 | "solution": false,
220 | "task": false
221 | }
222 | },
223 | "outputs": [],
224 | "source": [
225 | "# Checking against the pre-computed test database\n",
226 | "test_results = test_case_checker(get_thresholded_and_noised, task_id=1)\n",
227 | "assert test_results['passed'], test_results['message']"
228 | ]
229 | },
230 | {
231 | "cell_type": "markdown",
232 | "metadata": {},
233 | "source": [
234 | "## 0.2 Applying Thresholding and Noise to Data"
235 | ]
236 | },
237 | {
238 | "cell_type": "code",
239 | "execution_count": 24,
240 | "metadata": {},
241 | "outputs": [],
242 | "source": [
243 | "if perform_computation:\n",
244 | " X_true_grayscale = train_images_raw[:10, :, :]\n",
245 | "\n",
246 | " np_random = np.random.RandomState(seed=12345)\n",
247 | " flip_flags = flip_flags = (np_random.uniform(0., 1., size=X_true_grayscale.shape) < noise_flip_prob)\n",
248 | " initial_pi = np_random.uniform(0, 1, size=X_true_grayscale.shape) # Initial Random Pi values\n",
249 | "\n",
250 | " X_true = get_thresholded_and_noised(X_true_grayscale, threshold=128, flip_flags=flip_flags * 0)\n",
251 | " X_noised = get_thresholded_and_noised(X_true_grayscale, threshold=128, flip_flags=flip_flags)"
252 | ]
253 | },
254 | {
255 | "cell_type": "markdown",
256 | "metadata": {},
257 | "source": [
258 | "# Task 2 "
259 | ]
260 | },
261 | {
262 | "cell_type": "markdown",
263 | "metadata": {},
264 | "source": [
265 | "Write a funciton named `sigmoid_2x` that given a variable $X$ computes the following:\n",
266 | "\n",
267 | "$$f(X) := \\frac{\\exp(X)}{\\exp(X) + \\exp(-X)}$$\n",
268 | "\n",
269 | "The input argument is a numpy array $X$, which could have any shape. Your output array must have the same shape as $X$.\n",
270 | "\n",
271 | "**Important Note**: Theoretically, $f$ satisfies the following equations:\n",
272 | "\n",
273 | "$$\\lim_{X\\rightarrow +\\infty} f(X) = 1$$\n",
274 | "$$\\lim_{X\\rightarrow -\\infty} f(X) = 0$$\n",
275 | "\n",
276 | "Your implementation must also work correctly even on these extreme edge cases. In other words, you must satisfy the following tests.\n",
277 | "* `sigmoid_2x(np.inf)==1` \n",
278 | "* `sigmoid_2x(-np.inf)==0`.\n",
279 | "\n",
280 | "**Hint**: You may find `scipy.special.expit` useful."
281 | ]
282 | },
283 | {
284 | "cell_type": "code",
285 | "execution_count": 25,
286 | "metadata": {
287 | "deletable": false,
288 | "nbgrader": {
289 | "cell_type": "code",
290 | "checksum": "9467e4711dfb00723c4a04f641d3a4bb",
291 | "grade": false,
292 | "grade_id": "cell-baba53895e886588",
293 | "locked": false,
294 | "schema_version": 3,
295 | "solution": true,
296 | "task": false
297 | }
298 | },
299 | "outputs": [],
300 | "source": [
301 | "def sigmoid_2x(X):\n",
302 | " \n",
303 | " # your code here\n",
304 | " output = expit(2*X)\n",
305 | " \n",
306 | " return output"
307 | ]
308 | },
309 | {
310 | "cell_type": "code",
311 | "execution_count": 26,
312 | "metadata": {
313 | "deletable": false,
314 | "editable": false,
315 | "nbgrader": {
316 | "cell_type": "code",
317 | "checksum": "05703316807b847b94777d150f813ef1",
318 | "grade": true,
319 | "grade_id": "cell-4e87b3b9548c3052",
320 | "locked": true,
321 | "points": 1,
322 | "schema_version": 3,
323 | "solution": false,
324 | "task": false
325 | }
326 | },
327 | "outputs": [],
328 | "source": [
329 | "assert sigmoid_2x(+np.inf) == 1.\n",
330 | "assert sigmoid_2x(-np.inf) == 0.\n",
331 | "assert np.array_equal(sigmoid_2x(np.array([0, 1])).round(3), np.array([0.5, 0.881]))\n",
332 | "\n",
333 | "\n",
334 | "# Checking against the pre-computed test database\n",
335 | "test_results = test_case_checker(sigmoid_2x, task_id=2)\n",
336 | "assert test_results['passed'], test_results['message']"
337 | ]
338 | },
339 | {
340 | "cell_type": "markdown",
341 | "metadata": {},
342 | "source": [
343 | "# 1. Applying Mean-field Approximation to Boltzman Machine's Variational Inference Problem"
344 | ]
345 | },
346 | {
347 | "cell_type": "markdown",
348 | "metadata": {},
349 | "source": [
350 | "# Task 3 "
351 | ]
352 | },
353 | {
354 | "cell_type": "markdown",
355 | "metadata": {},
356 | "source": [
357 | "Write a `boltzman_meanfield` function that applies the mean-field approximation to the Boltzman machine. \n",
358 | "\n",
359 | "Recalling the textbook notation, $X_i$ is the observed value of pixel $i$, and $H_i$ is the true value of pixel $i$ (before applying noise). For instance, if we have a $3 \\times 3$ image, the corresponding Boltzman machine looks like this: \n",
360 | "\n",
361 | "```\n",
362 | " X_1 X_2 X_3\n",
363 | " / / /\n",
364 | " H_1 ------ H_2 ------ H_3\n",
365 | " | | |\n",
366 | " | | |\n",
367 | " | | |\n",
368 | " | X_4 | X_5 | X_6\n",
369 | " |/ |/ |/ \n",
370 | " H_4 ------ H_5 ------ H_6\n",
371 | " | | |\n",
372 | " | | |\n",
373 | " | | |\n",
374 | " | X_7 | X_8 | X_9\n",
375 | " |/ |/ |/ \n",
376 | " H_7 ------ H_8 ------ H_9\n",
377 | "``` \n",
378 | "\n",
379 | "Here, we a adopt a slightly simplified notation from the textbook and define $\\mathcal{N}(i)$ to be the neighbors of pixel $i$ (the pixels adjacent to pixel $i$). For instance, in the above figure, we have $\\mathcal{N}(1) = \\{2,4\\}$, $\\mathcal{N}(2) = \\{1,3,5\\}$, and $\\mathcal{N}(5) = \\{2,4,6,8\\}$.\n",
380 | "\n",
381 | "\n",
382 | "With this, the process in the textbook can be summarized as follows:\n",
383 | "\n",
384 | "```\n",
385 | "1. for iteration = 1, 2, 3, ....,\n",
386 | " 2. Pick a random pixel i.\n",
387 | " 3. Find pixel i's new parameter as\n",
388 | "```\n",
389 | "$$\\pi_i^{\\text{new}} = \\frac{\\exp(\\theta_{ii}^{(2)} X_i + \\sum_{j\\in \\mathcal{N}(i)} \\theta_{ij}^{(1)} (2\\pi_j -1))}{\\exp(\\theta_{ii}^{(2)} X_i + \\sum_{j\\in \\mathcal{N}(i)} \\theta_{ij}^{(1)} (2\\pi_j -1)) + \\exp(-\\theta_{ii}^{(2)} X_i - \\sum_{j\\in \\mathcal{N}(i)} \\theta_{ij}^{(1)} (2\\pi_j -1))} .$$\n",
390 | "```\n",
391 | " 4. Replace the existing parameter for pixel i with the new one.\n",
392 | "```\n",
393 | "$$\\pi_i \\leftarrow \\pi_i^{\\text{new}}$$\n",
394 | "\n",
395 | "Since our computational resources are extremely vectorized, we will make the following minor algorithmic modification and ask you to implement the following instead:\n",
396 | "\n",
397 | "```\n",
398 | "1. for iteration = 1, 2, 3, ....,\n",
399 | " 2. for each pixels i:\n",
400 | " 3. Find pixel i's new parameter, but do not update the original parameter yet.\n",
401 | "```\n",
402 | "$$\\pi_i^{\\text{new}} = \\frac{\\exp(\\theta_{ii}^{(2)} X_i + \\sum_{j\\in \\mathcal{N}(i)} \\theta_{ij}^{(1)} (2\\pi_j -1))}{\\exp(\\theta_{ii}^{(2)} X_i + \\sum_{j\\in \\mathcal{N}(i)} \\theta_{ij}^{(1)} (2\\pi_j -1)) + \\exp(-\\theta_{ii}^{(2)} X_i - \\sum_{j\\in \\mathcal{N}(i)} \\theta_{ij}^{(1)} (2\\pi_j -1))} .$$\n",
403 | "```\n",
404 | " 4. Once you have computed all the new parameters, update all of them at the same time:\n",
405 | "```\n",
406 | "$$\\pi \\leftarrow \\pi^{\\text{new}}$$\n",
407 | "\n",
408 | "We assume that the parameters $\\theta_{ii}^{(2)}$ have the same value for all $i$ and denote their common value by scalar `theta_X`. Moreover, we assume that the parameters $\\theta_{ij}^{(1)}$ have the same value for all $i,j$ and denote their common value by scalar `theta_pi`.\n",
409 | "\n",
410 | "The `boltzman_meanfield` function must take the following input arguments:\n",
411 | "1. `images`: A numpy array with the shape `(N,height,width)`, where \n",
412 | " * `N` is the number of samples and could be anything,\n",
413 | " * `height` is each individual image's height in pixels (i.e., number of rows in each image),\n",
414 | " * and `width` is each individual image's width in pixels (i.e., number of columns in each image).\n",
415 | " * Do not assume anything about `images`'s dtype or the number of samples or the `height` or the `width`.\n",
416 | " * The entries of `images` are either -1 or 1.\n",
417 | "2. `initial_pi`: A numpy array with the same shape as `images` (i.e. `(N,height,width)`). This variable is corresponding to the initial value of $\\pi$ in the textbook analysis and above equations. Note that for each of the $N$ images, we have a different $\\pi$ variable.\n",
418 | "\n",
419 | "3. `theta_X`: A scalar with a default value of `0.5*np.log(1/noise_flip_prob-1)`. This variable represents $\\theta_{ii}^{(2)}$ in the above update equation.\n",
420 | "\n",
421 | "4. `theta_pi`: A scalar with a default value of 2. This variable represents $\\theta_{ij}^{(1)}$ in the above update equation.\n",
422 | "\n",
423 | "5. `iterations`: A scalar with a default value of 100. This variable denotes the number of update iterations to perform.\n",
424 | "\n",
425 | "The `boltzman_meanfield` function must return the final $\\pi$ variable as a numpy array called `pi`, and should contain values that are between 0 and 1. \n",
426 | "\n",
427 | "**Hint**: You may find the `sigmoid_2x` function, that you implemented earlier, useful.\n",
428 | "\n",
429 | "**Hint**: If you want to find the summation of neighboring elements for all of a 2-dimensional matrix, there is an easy and efficient way using matrix operations. You can initialize a zero matrix, and then add four shifted versions (i.e., left-, right-, up-, and down-shifted versions) of the original matrix to it. You will have to be careful in the assignment and selection indices, since you will have to drop one row/column for each shifted version of the matrix.\n",
430 | " * Do **not** use `np.roll` if you're taking this approach."
431 | ]
432 | },
433 | {
434 | "cell_type": "code",
435 | "execution_count": 27,
436 | "metadata": {
437 | "deletable": false,
438 | "nbgrader": {
439 | "cell_type": "code",
440 | "checksum": "0cd68da6ffa91cc0796f53f49b7ab71a",
441 | "grade": false,
442 | "grade_id": "cell-e47949a00f04759c",
443 | "locked": false,
444 | "schema_version": 3,
445 | "solution": true,
446 | "task": false
447 | }
448 | },
449 | "outputs": [],
450 | "source": [
451 | "def boltzman_meanfield(images, initial_pi, theta_X=0.5*np.log(1/noise_flip_prob-1), theta_pi=2, iterations=100):\n",
452 | " if len(images.shape)==2:\n",
453 | " # In case a 2d image was given as input, we'll add a dummy dimension to be consistent\n",
454 | " X = images.reshape(1,*images.shape)\n",
455 | " else:\n",
456 | " # Otherwise, we'll just work with what's given\n",
457 | " X = images\n",
458 | " \n",
459 | " pi = initial_pi\n",
460 | " # your code here\n",
461 | " for i in range(iterations):\n",
462 | " left = np.pad(pi, ((0,0), (0,0), (1,0)), mode='constant')[:, :, :-1]\n",
463 | " right = np.pad(pi, ((0,0), (0,0), (0,1)), mode='constant')[:, :, 1:]\n",
464 | " up = np.pad(pi, ((0,0), (1,0), (0,0)), mode='constant')[:, :-1, :]\n",
465 | " down = np.pad(pi, ((0,0), (0,1), (0,0)), mode='constant')[:, 1:, :]\n",
466 | " L = theta_pi * np.where(left==0, left, 2 * left - 1)\n",
467 | " R = theta_pi * np.where(right==0, right, 2 * right - 1)\n",
468 | " U = theta_pi * np.where(up==0, up, 2 * up - 1)\n",
469 | " D = theta_pi * np.where(down==0, down, 2 * down - 1)\n",
470 | " \n",
471 | " pi = sigmoid_2x(theta_X * images + (L+R+U+D))\n",
472 | " \n",
473 | " return pi.reshape(*images.shape)"
474 | ]
475 | },
476 | {
477 | "cell_type": "code",
478 | "execution_count": 28,
479 | "metadata": {
480 | "deletable": false,
481 | "editable": false,
482 | "nbgrader": {
483 | "cell_type": "code",
484 | "checksum": "5753989d3e0dad787a6833c39262fb89",
485 | "grade": true,
486 | "grade_id": "cell-6291d0a80ccca660",
487 | "locked": true,
488 | "points": 0.2,
489 | "schema_version": 3,
490 | "solution": false,
491 | "task": false
492 | }
493 | },
494 | "outputs": [
495 | {
496 | "name": "stdout",
497 | "output_type": "stream",
498 | "text": [
499 | "The reference and solution images are the same to a T! Well done on this test case.\n"
500 | ]
501 | },
502 | {
503 | "data": {
504 | "image/png": "\n",
505 | "text/plain": [
506 | ""
507 | ]
508 | },
509 | "metadata": {
510 | "needs_background": "light"
511 | },
512 | "output_type": "display_data"
513 | },
514 | {
515 | "name": "stdout",
516 | "output_type": "stream",
517 | "text": [
518 | " Enter nothing to go to the next image\n",
519 | "or\n",
520 | " Enter \"s\" when you are done to recieve the three images. \n",
521 | " **Don't forget to do this before continuing to the next step.**\n",
522 | "s\n"
523 | ]
524 | }
525 | ],
526 | "source": [
527 | "def test_boltzman(x, seed = 12345, theta_X=0.5*np.log(1/noise_flip_prob-1), theta_pi=2, iterations=100): \n",
528 | " np_random = np.random.RandomState(seed=seed)\n",
529 | " initial_pi = np_random.uniform(0,1, size=x.shape)\n",
530 | " return boltzman_meanfield(x, initial_pi, theta_X=theta_X, \n",
531 | " theta_pi=theta_pi, iterations=iterations)\n",
532 | " \n",
533 | "(orig_image, ref_image, test_im, success_is_row_inky) = show_test_cases(test_boltzman, task_id='3_V')\n",
534 | "\n",
535 | "assert success_is_row_inky"
536 | ]
537 | },
538 | {
539 | "cell_type": "code",
540 | "execution_count": 29,
541 | "metadata": {
542 | "deletable": false,
543 | "editable": false,
544 | "nbgrader": {
545 | "cell_type": "code",
546 | "checksum": "dfbdbd77c48ab7c23ce00913b90050b8",
547 | "grade": true,
548 | "grade_id": "cell-e7b59624d7ab9ec3",
549 | "locked": true,
550 | "points": 0.8,
551 | "schema_version": 3,
552 | "solution": false,
553 | "task": false
554 | }
555 | },
556 | "outputs": [],
557 | "source": [
558 | "# Checking against the pre-computed test database\n",
559 | "test_results = test_case_checker(boltzman_meanfield, task_id=3)\n",
560 | "assert test_results['passed'], test_results['message']"
561 | ]
562 | },
563 | {
564 | "cell_type": "markdown",
565 | "metadata": {},
566 | "source": [
567 | "## 2. Tuning the Boltzman Machine's Hyper-Parameters"
568 | ]
569 | },
570 | {
571 | "cell_type": "markdown",
572 | "metadata": {},
573 | "source": [
574 | "Now, with the `boltzman_meanfield` function that you implemented above, here see the effect of changing hyper parameters `theta_X` and `theta_pi` which were defined in Task 3. \n",
575 | "\n",
576 | "- We set `theta_X` to be `0.5*np.log(1/noise_flip_prob-1)` where `noise_flip_prob` was the probability of flipping each pixel. Try to think why this is a reasonable choice. (This is also related to one of the questions in the follow-up quiz).\n",
577 | "- We try different values for `theta_pi`. \n",
578 | "\n",
579 | "For each value of `theta_pi`, we the apply the denoising and compare the denoised images to the original ones. We adopt several statistical measures to compare original and denoised images and to finally decide which value of `theta_pi` is better. Remember that during the noising process, we chose some pixels and decide to flip them, and during the denoising process we essentially try to detect such pixels. Let `P` be the total number of pixels that we flip during the noise adding process, and `N` be the total number of pixels that we do not flip during the noise adding process. We can define:\n",
580 | "\n",
581 | "- True Positive (`TP`). Defined to be the total number of pixels that are flipped during the noise adding process, and we successfully detect them during the denoising process. \n",
582 | "- True Positive Rate (`TPR`). Other names: sensitivity, recall. Defined to be the ratio `TP / P`.\n",
583 | "- False Positive (`FP`). Defined to be the number of pixels that were detected as being noisy during the denosing process, but were not really noisy. \n",
584 | "- False Positive Rate (`FPR`). Other name: fall-out. Defined to be the ratio `FP/N`.\n",
585 | "- Positive Predictive Value (`PPV`). Other name: precision. Defined to be the ratio `TP / (TP + FP)`.\n",
586 | "- `F1` score. Defined to be the harmonic mean of precision (`PPV`) and recall (`TPR`), or equivalently `2 TP / (2 TP + FP + FN)`. \n",
587 | "\n",
588 | "Since we fix `theta_X` in this section and evaluate different values of `theta_pi`, in the plots, `theta` refers to `theta_pi`."
589 | ]
590 | },
591 | {
592 | "cell_type": "code",
593 | "execution_count": 15,
594 | "metadata": {},
595 | "outputs": [],
596 | "source": [
597 | "def get_tpr(preds, true_labels):\n",
598 | " TP = (preds * (preds == true_labels)).sum()\n",
599 | " P = true_labels.sum()\n",
600 | " if P==0:\n",
601 | " TPR = 1.\n",
602 | " else:\n",
603 | " TPR = TP / P\n",
604 | " \n",
605 | " return TPR\n",
606 | "\n",
607 | "def get_fpr(preds, true_labels):\n",
608 | " FP = (preds * (preds != true_labels)).sum()\n",
609 | " N = (1-true_labels).sum()\n",
610 | " if N==0:\n",
611 | " FPR=1\n",
612 | " else:\n",
613 | " FPR = FP / N\n",
614 | " return FPR\n",
615 | "\n",
616 | "def get_ppv(preds, true_labels):\n",
617 | " TP = (preds * (preds == true_labels)).sum()\n",
618 | " FP = (preds * (preds != true_labels)).sum()\n",
619 | " if (TP + FP) == 0:\n",
620 | " PPV = 1\n",
621 | " else:\n",
622 | " PPV = TP / (TP + FP)\n",
623 | " return PPV\n",
624 | "\n",
625 | "def get_f1(preds, true_labels):\n",
626 | " TP = (preds * (preds == true_labels)).sum()\n",
627 | " FP = (preds * (preds != true_labels)).sum()\n",
628 | " FN = ((1-preds) * (preds != true_labels)).sum()\n",
629 | " if (2 * TP + FP + FN) == 0:\n",
630 | " F1 = 1\n",
631 | " else:\n",
632 | " F1 = (2 * TP) / (2 * TP + FP + FN)\n",
633 | " return F1"
634 | ]
635 | },
636 | {
637 | "cell_type": "code",
638 | "execution_count": 16,
639 | "metadata": {},
640 | "outputs": [],
641 | "source": [
642 | "if perform_computation:\n",
643 | " all_theta = np.arange(0, 10, 0.2).tolist() + np.arange(10, 100, 5).tolist()\n",
644 | "\n",
645 | " tpr_list, fpr_list, ppv_list, f1_list = [], [], [], []\n",
646 | "\n",
647 | " for theta in all_theta:\n",
648 | " meanfield_pi = boltzman_meanfield(X_noised, initial_pi, theta_X=0.5*np.log(1/noise_flip_prob-1), theta_pi=theta, iterations=100)\n",
649 | " X_denoised = 2 * (meanfield_pi > 0.5) - 1\n",
650 | "\n",
651 | " predicted_noise_pixels = (X_denoised != X_noised)\n",
652 | " tpr = get_tpr(predicted_noise_pixels, flip_flags)\n",
653 | " fpr = get_fpr(predicted_noise_pixels, flip_flags)\n",
654 | " ppv = get_ppv(predicted_noise_pixels, flip_flags)\n",
655 | " f1 = get_f1(predicted_noise_pixels, flip_flags)\n",
656 | "\n",
657 | " tpr_list.append(tpr)\n",
658 | " fpr_list.append(fpr)\n",
659 | " ppv_list.append(ppv)\n",
660 | " f1_list.append(f1)"
661 | ]
662 | },
663 | {
664 | "cell_type": "code",
665 | "execution_count": 17,
666 | "metadata": {},
667 | "outputs": [
668 | {
669 | "data": {
670 | "image/png": "\n",
671 | "text/plain": [
672 | ""
673 | ]
674 | },
675 | "metadata": {
676 | "needs_background": "light"
677 | },
678 | "output_type": "display_data"
679 | }
680 | ],
681 | "source": [
682 | "if perform_computation:\n",
683 | " fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12,4), dpi=90)\n",
684 | "\n",
685 | " ax=axes[0]\n",
686 | " ax.plot(all_theta, tpr_list)\n",
687 | " ax.set_xlabel('Theta')\n",
688 | " ax.set_ylabel('True Positive Rate')\n",
689 | " ax.set_title('True Positive Rate Vs. Theta')\n",
690 | " ax.set_xscale('log')\n",
691 | "\n",
692 | " ax=axes[1]\n",
693 | " ax.plot(all_theta, fpr_list)\n",
694 | " ax.set_xlabel('Theta')\n",
695 | " ax.set_ylabel('False Positive Rate')\n",
696 | " ax.set_title('False Positive Rate Vs. Theta')\n",
697 | " ax.set_xscale('log')"
698 | ]
699 | },
700 | {
701 | "cell_type": "code",
702 | "execution_count": 18,
703 | "metadata": {},
704 | "outputs": [
705 | {
706 | "data": {
707 | "image/png": "\n",
708 | "text/plain": [
709 | ""
710 | ]
711 | },
712 | "metadata": {
713 | "needs_background": "light"
714 | },
715 | "output_type": "display_data"
716 | }
717 | ],
718 | "source": [
719 | "if perform_computation:\n",
720 | " fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(12,3), dpi=90)\n",
721 | "\n",
722 | " ax=axes[0]\n",
723 | " ax.plot(fpr_list, tpr_list)\n",
724 | " ax.set_xlabel('False Positive Rate')\n",
725 | " ax.set_ylabel('True Positive Rate')\n",
726 | " ax.set_title('ROC Curve')\n",
727 | " ax.set_xlim(-0.05, 1.05)\n",
728 | " ax.set_ylim(-0.05, 1.05)\n",
729 | " ax.plot(np.arange(-0.05, 1.05, 0.01), np.arange(-0.05, 1.05, 0.01), ls='--', c='black')\n",
730 | "\n",
731 | " ax=axes[1]\n",
732 | " ax.plot(all_theta, f1_list)\n",
733 | " ax.set_xlabel('Theta')\n",
734 | " ax.set_ylabel('F1-statistic')\n",
735 | " ax.set_title('F1-score Vs. Theta')\n",
736 | " ax.set_xscale('log')\n",
737 | "\n",
738 | " ax=axes[2]\n",
739 | " ax.plot(tpr_list, ppv_list)\n",
740 | " ax.set_xlabel('Recall')\n",
741 | " ax.set_ylabel('Precision')\n",
742 | " ax.set_title('Precision Vs. Recall')\n",
743 | " ax.set_xlim(-0.05, 1.05)\n",
744 | " ax.set_ylim(-0.05, 1.05)\n",
745 | " ax.plot(np.arange(-0.05, 1.05, 0.01), 1-np.arange(-0.05, 1.05, 0.01), ls='--', c='black')\n",
746 | " None"
747 | ]
748 | },
749 | {
750 | "cell_type": "code",
751 | "execution_count": 19,
752 | "metadata": {},
753 | "outputs": [
754 | {
755 | "name": "stdout",
756 | "output_type": "stream",
757 | "text": [
758 | "Best theta w.r.t. the F-score is 0.8\n"
759 | ]
760 | }
761 | ],
762 | "source": [
763 | "if perform_computation:\n",
764 | " best_theta = all_theta[np.argmax(f1_list)]\n",
765 | " print(f'Best theta w.r.t. the F-score is {best_theta}')"
766 | ]
767 | },
768 | {
769 | "cell_type": "markdown",
770 | "metadata": {},
771 | "source": [
772 | "Now let's try the tuned hyper-parameters, and verify whether it visually improved the Boltzman machine."
773 | ]
774 | },
775 | {
776 | "cell_type": "code",
777 | "execution_count": 20,
778 | "metadata": {
779 | "deletable": false,
780 | "editable": false,
781 | "nbgrader": {
782 | "cell_type": "code",
783 | "checksum": "9ffddbff83f528b6590d47a23414c8dd",
784 | "grade": true,
785 | "grade_id": "cell-00c232dc99ca3fdd",
786 | "locked": true,
787 | "points": 0,
788 | "schema_version": 3,
789 | "solution": false,
790 | "task": false
791 | }
792 | },
793 | "outputs": [
794 | {
795 | "name": "stdout",
796 | "output_type": "stream",
797 | "text": [
798 | "The reference and solution images are the same to a T! Well done on this test case.\n"
799 | ]
800 | },
801 | {
802 | "data": {
803 | "image/png": "\n",
804 | "text/plain": [
805 | ""
806 | ]
807 | },
808 | "metadata": {
809 | "needs_background": "light"
810 | },
811 | "output_type": "display_data"
812 | },
813 | {
814 | "name": "stdout",
815 | "output_type": "stream",
816 | "text": [
817 | " Enter nothing to go to the next image\n",
818 | "or\n",
819 | " Enter \"s\" when you are done to recieve the three images. \n",
820 | " **Don't forget to do this before continuing to the next step.**\n",
821 | "s\n"
822 | ]
823 | }
824 | ],
825 | "source": [
826 | "if perform_computation:\n",
827 | " def test_boltzman(x, seed = 12345, theta_X=0.5*np.log(1/noise_flip_prob-1), theta_pi=best_theta, iterations=100): \n",
828 | " np_random = np.random.RandomState(seed=seed)\n",
829 | " initial_pi = np_random.uniform(0,1, size=x.shape)\n",
830 | " return boltzman_meanfield(x, initial_pi, theta_X=theta_X, \n",
831 | " theta_pi=theta_pi, iterations=iterations) > 0.5\n",
832 | "\n",
833 | " (orig_image, ref_image, test_im, success_is_row_inky) = show_test_cases(test_boltzman, task_id='4_V')"
834 | ]
835 | },
836 | {
837 | "cell_type": "code",
838 | "execution_count": null,
839 | "metadata": {},
840 | "outputs": [],
841 | "source": []
842 | },
843 | {
844 | "cell_type": "code",
845 | "execution_count": null,
846 | "metadata": {},
847 | "outputs": [],
848 | "source": []
849 | }
850 | ],
851 | "metadata": {
852 | "kernelspec": {
853 | "display_name": "Python 3",
854 | "language": "python",
855 | "name": "python3"
856 | },
857 | "language_info": {
858 | "codemirror_mode": {
859 | "name": "ipython",
860 | "version": 3
861 | },
862 | "file_extension": ".py",
863 | "mimetype": "text/x-python",
864 | "name": "python",
865 | "nbconvert_exporter": "python",
866 | "pygments_lexer": "ipython3",
867 | "version": "3.7.6"
868 | }
869 | },
870 | "nbformat": 4,
871 | "nbformat_minor": 4
872 | }
873 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | ### Applied-Machine-Learning
2 |
--------------------------------------------------------------------------------