├── .DS_Store
├── BasicClassification.ipynb
├── CNN.ipynb
├── ClassifyingImages.ipynb
├── Clustering.ipynb
├── EMSegmentation-lib
├── .DS_Store
├── EMSegmentation.pdf
├── README.md
├── aml_utils.py
├── payload_requirements.json
├── pics
│ ├── RobertMixed03.jpg
│ ├── smallstrelitzia.jpg
│ └── smallsunset.jpg
├── requirements1.txt
└── test_db
│ ├── .DS_Store
│ ├── task_1.npz
│ ├── task_2.npz
│ ├── task_3.npz
│ └── task_4.npz
├── EMSegmentation.ipynb
├── EMTopicModel-lib
├── .DS_Store
├── EMTopicModel.pdf
├── README.md
├── aml_utils.py
├── payload_requirements.json
├── requirements1.txt
├── test_db
│ ├── .DS_Store
│ ├── task_1.npz
│ ├── task_2.npz
│ └── task_3.npz
└── words
│ ├── .DS_Store
│ ├── docword.nips.txt
│ └── vocab.nips.txt
├── EMTopicModel.ipynb
├── GLMnet.ipynb
├── HiDimClassification.ipynb
├── MeanField.ipynb
├── PCA.ipynb
├── README.md
├── Regression.ipynb
└── SGDSVM.ipynb
/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Keyuan125/CS441-AppliedMachineLearning/623c5307e6412e9a2fc59dd6213fc07fa412a861/.DS_Store
--------------------------------------------------------------------------------
/BasicClassification.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# * Prerequisites"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "In this assignment you will implement the Naive Bayes Classifier. Before starting this assignment, make sure you understand the concepts discussed in the videos in Week 2 about Naive Bayes. You can also find it useful to read Chapter 1 of the textbook.\n",
15 | "\n",
16 | "\n",
17 | "Also, make sure that you are familiar with the `numpy.ndarray` class of python's `numpy` library and that you are able to answer the following questions:\n",
18 | "\n",
19 | "Let's assume `a` is a numpy array.\n",
20 | "* What is an array's shape (e.g., what is the meaning of `a.shape`)? \n",
21 | "* What is numpy's reshaping operation? How much computational over-head would it induce? \n",
22 | "* What is numpy's transpose operation, and how it is different from reshaping? Does it cause computation overhead?\n",
23 | "* What is the meaning of the commands `a.reshape(-1, 1)` and `a.reshape(-1)`?\n",
24 | "* Would happens to the variable `a` after we call `b = a.reshape(-1)`? Does any of the attributes of `a` change?\n",
25 | "* How do assignments in python and numpy work in general?\n",
26 | " * Does the `b=a` statement use copying by value? Or is it copying by reference?\n",
27 | " * Would the answer to the previous question change depending on whether `a` is a numpy array or a scalar value?\n",
28 | " \n",
29 | "You can answer all of these questions by\n",
30 | "\n",
31 | " 1. Reading numpy's documentation from https://numpy.org/doc/stable/.\n",
32 | " 2. Making trials using dummy variables."
33 | ]
34 | },
35 | {
36 | "cell_type": "markdown",
37 | "metadata": {},
38 | "source": [
39 | "# *Assignment Summary"
40 | ]
41 | },
42 | {
43 | "cell_type": "markdown",
44 | "metadata": {},
45 | "source": [
46 | "The UC Irvine machine learning data repository hosts a famous dataset, the Pima Indians dataset, on whether a patient has diabetes originally owned by the National Institute of Diabetes and Digestive and Kidney Diseases and donated by Vincent Sigillito. You can find it at https://www.kaggle.com/uciml/pima-indians-diabetes-database/data. This data has a set of attributes of patients, and a categorical variable telling whether the patient is diabetic or not. For several attributes in this data set, a value of 0 may indicate a missing value of the variable. It has a total of 768 data-points. \n",
47 | "\n",
48 | "* **Part 1-A)** First, you will build a simple naive Bayes classifier to classify this data set. We will use 20% of the data for evaluation and the other 80% for training. \n",
49 | "\n",
50 | " You should use a normal distribution to model each of the class-conditional distributions.\n",
51 | "\n",
52 | " Report the accuracy of the classifier on the 20% evaluation data, where accuracy is the number of correct predictions as a fraction of total predictions.\n",
53 | "\n",
54 | "* **Part 1-B)** Next, you will adjust your code so that, for attributes 3 (Diastolic blood pressure), 4 (Triceps skin fold thickness), 6 (Body mass index), and 8 (Age), it regards a value of 0 as a missing value when estimating the class-conditional distributions, and the posterior.\n",
55 | "\n",
56 | " Report the accuracy of the classifier on the 20% that was held out for evaluation.\n",
57 | "\n",
58 | "* **Part 1-C)** Last, you will have some experience with SVMLight, an off-the-shelf implementation of Support Vector Machines or SVMs. For now, you don't need to understand much about SVM's, we will explore them in more depth in the following exercises. You will install SVMLight, which you can find at http://svmlight.joachims.org, to train and evaluate an SVM to classify this data.\n",
59 | "\n",
60 | " You should NOT substitute NA values for zeros for attributes 3, 4, 6, and 8.\n",
61 | " \n",
62 | " Report the accuracy of the classifier on the held out 20%"
63 | ]
64 | },
65 | {
66 | "cell_type": "markdown",
67 | "metadata": {},
68 | "source": [
69 | "# 0. Data"
70 | ]
71 | },
72 | {
73 | "cell_type": "markdown",
74 | "metadata": {},
75 | "source": [
76 | "## 0.1 Description"
77 | ]
78 | },
79 | {
80 | "cell_type": "markdown",
81 | "metadata": {},
82 | "source": [
83 | "The UC Irvine's Machine Learning Data Repository Department hosts a Kaggle Competition with famous collection of data on whether a patient has diabetes (the Pima Indians dataset), originally owned by the National Institute of Diabetes and Digestive and Kidney Diseases and donated by Vincent Sigillito. \n",
84 | "\n",
85 | "You can find this data at https://www.kaggle.com/uciml/pima-indians-diabetes-database/data. The Kaggle website offers valuable visualizations of the original data dimensions in its dashboard. It is quite insightful to take the time and make sense of the data using their dashboard before applying any method to the data."
86 | ]
87 | },
88 | {
89 | "cell_type": "markdown",
90 | "metadata": {},
91 | "source": [
92 | "## 0.2 Information Summary"
93 | ]
94 | },
95 | {
96 | "cell_type": "markdown",
97 | "metadata": {},
98 | "source": [
99 | "* **Input/Output**: This data has a set of attributes of patients, and a categorical variable telling whether the patient is diabetic or not. \n",
100 | "\n",
101 | "* **Missing Data**: For several attributes in this data set, a value of 0 may indicate a missing value of the variable.\n",
102 | "\n",
103 | "* **Final Goal**: We want to build a classifier that can predict whether a patient has diabetes or not. To do this, we will train multiple kinds of models, and will be handing the missing data with different approaches for each method (i.e., some methods will ignore their existence, while others may do something about the missing data)."
104 | ]
105 | },
106 | {
107 | "cell_type": "markdown",
108 | "metadata": {},
109 | "source": [
110 | "## 0.3 Loading"
111 | ]
112 | },
113 | {
114 | "cell_type": "code",
115 | "execution_count": 46,
116 | "metadata": {},
117 | "outputs": [],
118 | "source": [
119 | "%matplotlib inline\n",
120 | "import pandas as pd\n",
121 | "import numpy as np\n",
122 | "import seaborn as sns\n",
123 | "import matplotlib.pyplot as plt\n",
124 | "\n",
125 | "from aml_utils import test_case_checker"
126 | ]
127 | },
128 | {
129 | "cell_type": "code",
130 | "execution_count": 47,
131 | "metadata": {},
132 | "outputs": [
133 | {
134 | "data": {
135 | "text/html": [
136 | "
\n",
137 | "\n",
150 | "
\n",
151 | " \n",
152 | " \n",
153 | " | \n",
154 | " Pregnancies | \n",
155 | " Glucose | \n",
156 | " BloodPressure | \n",
157 | " SkinThickness | \n",
158 | " Insulin | \n",
159 | " BMI | \n",
160 | " DiabetesPedigreeFunction | \n",
161 | " Age | \n",
162 | " Outcome | \n",
163 | "
\n",
164 | " \n",
165 | " \n",
166 | " \n",
167 | " 0 | \n",
168 | " 6 | \n",
169 | " 148 | \n",
170 | " 72 | \n",
171 | " 35 | \n",
172 | " 0 | \n",
173 | " 33.6 | \n",
174 | " 0.627 | \n",
175 | " 50 | \n",
176 | " 1 | \n",
177 | "
\n",
178 | " \n",
179 | " 1 | \n",
180 | " 1 | \n",
181 | " 85 | \n",
182 | " 66 | \n",
183 | " 29 | \n",
184 | " 0 | \n",
185 | " 26.6 | \n",
186 | " 0.351 | \n",
187 | " 31 | \n",
188 | " 0 | \n",
189 | "
\n",
190 | " \n",
191 | " 2 | \n",
192 | " 8 | \n",
193 | " 183 | \n",
194 | " 64 | \n",
195 | " 0 | \n",
196 | " 0 | \n",
197 | " 23.3 | \n",
198 | " 0.672 | \n",
199 | " 32 | \n",
200 | " 1 | \n",
201 | "
\n",
202 | " \n",
203 | " 3 | \n",
204 | " 1 | \n",
205 | " 89 | \n",
206 | " 66 | \n",
207 | " 23 | \n",
208 | " 94 | \n",
209 | " 28.1 | \n",
210 | " 0.167 | \n",
211 | " 21 | \n",
212 | " 0 | \n",
213 | "
\n",
214 | " \n",
215 | " 4 | \n",
216 | " 0 | \n",
217 | " 137 | \n",
218 | " 40 | \n",
219 | " 35 | \n",
220 | " 168 | \n",
221 | " 43.1 | \n",
222 | " 2.288 | \n",
223 | " 33 | \n",
224 | " 1 | \n",
225 | "
\n",
226 | " \n",
227 | "
\n",
228 | "
"
229 | ],
230 | "text/plain": [
231 | " Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \\\n",
232 | "0 6 148 72 35 0 33.6 \n",
233 | "1 1 85 66 29 0 26.6 \n",
234 | "2 8 183 64 0 0 23.3 \n",
235 | "3 1 89 66 23 94 28.1 \n",
236 | "4 0 137 40 35 168 43.1 \n",
237 | "\n",
238 | " DiabetesPedigreeFunction Age Outcome \n",
239 | "0 0.627 50 1 \n",
240 | "1 0.351 31 0 \n",
241 | "2 0.672 32 1 \n",
242 | "3 0.167 21 0 \n",
243 | "4 2.288 33 1 "
244 | ]
245 | },
246 | "execution_count": 47,
247 | "metadata": {},
248 | "output_type": "execute_result"
249 | }
250 | ],
251 | "source": [
252 | "df = pd.read_csv('../BasicClassification-lib/diabetes.csv')\n",
253 | "df.head()"
254 | ]
255 | },
256 | {
257 | "cell_type": "markdown",
258 | "metadata": {},
259 | "source": [
260 | "## 0.1 Splitting The Data"
261 | ]
262 | },
263 | {
264 | "cell_type": "markdown",
265 | "metadata": {},
266 | "source": [
267 | "First, we will shuffle the data completely, and forget about the order in the original csv file. \n",
268 | "\n",
269 | "* The training and evaluation dataframes will be named ```train_df``` and ```eval_df```, respectively.\n",
270 | "\n",
271 | "* We will also create the 2-d numpy array `train_features` whose number of rows is the number of training samples, and the number of columns is 8 (i.e., the number of features). We will define `eval_features` in a similar fashion\n",
272 | "\n",
273 | "* We would also create the 1-d numpy arrays `train_labels` and `eval_labels` which contain the training and evaluation labels, respectively."
274 | ]
275 | },
276 | {
277 | "cell_type": "code",
278 | "execution_count": 48,
279 | "metadata": {},
280 | "outputs": [],
281 | "source": [
282 | "# Let's generate the split ourselves.\n",
283 | "np_random = np.random.RandomState(seed=12345)\n",
284 | "rand_unifs = np_random.uniform(0,1,size=df.shape[0])\n",
285 | "division_thresh = np.percentile(rand_unifs, 80)\n",
286 | "train_indicator = rand_unifs < division_thresh\n",
287 | "eval_indicator = rand_unifs >= division_thresh"
288 | ]
289 | },
290 | {
291 | "cell_type": "code",
292 | "execution_count": 49,
293 | "metadata": {},
294 | "outputs": [
295 | {
296 | "data": {
297 | "text/html": [
298 | "\n",
299 | "\n",
312 | "
\n",
313 | " \n",
314 | " \n",
315 | " | \n",
316 | " Pregnancies | \n",
317 | " Glucose | \n",
318 | " BloodPressure | \n",
319 | " SkinThickness | \n",
320 | " Insulin | \n",
321 | " BMI | \n",
322 | " DiabetesPedigreeFunction | \n",
323 | " Age | \n",
324 | " Outcome | \n",
325 | "
\n",
326 | " \n",
327 | " \n",
328 | " \n",
329 | " 0 | \n",
330 | " 1 | \n",
331 | " 85 | \n",
332 | " 66 | \n",
333 | " 29 | \n",
334 | " 0 | \n",
335 | " 26.6 | \n",
336 | " 0.351 | \n",
337 | " 31 | \n",
338 | " 0 | \n",
339 | "
\n",
340 | " \n",
341 | " 1 | \n",
342 | " 8 | \n",
343 | " 183 | \n",
344 | " 64 | \n",
345 | " 0 | \n",
346 | " 0 | \n",
347 | " 23.3 | \n",
348 | " 0.672 | \n",
349 | " 32 | \n",
350 | " 1 | \n",
351 | "
\n",
352 | " \n",
353 | " 2 | \n",
354 | " 1 | \n",
355 | " 89 | \n",
356 | " 66 | \n",
357 | " 23 | \n",
358 | " 94 | \n",
359 | " 28.1 | \n",
360 | " 0.167 | \n",
361 | " 21 | \n",
362 | " 0 | \n",
363 | "
\n",
364 | " \n",
365 | " 3 | \n",
366 | " 0 | \n",
367 | " 137 | \n",
368 | " 40 | \n",
369 | " 35 | \n",
370 | " 168 | \n",
371 | " 43.1 | \n",
372 | " 2.288 | \n",
373 | " 33 | \n",
374 | " 1 | \n",
375 | "
\n",
376 | " \n",
377 | " 4 | \n",
378 | " 5 | \n",
379 | " 116 | \n",
380 | " 74 | \n",
381 | " 0 | \n",
382 | " 0 | \n",
383 | " 25.6 | \n",
384 | " 0.201 | \n",
385 | " 30 | \n",
386 | " 0 | \n",
387 | "
\n",
388 | " \n",
389 | "
\n",
390 | "
"
391 | ],
392 | "text/plain": [
393 | " Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \\\n",
394 | "0 1 85 66 29 0 26.6 \n",
395 | "1 8 183 64 0 0 23.3 \n",
396 | "2 1 89 66 23 94 28.1 \n",
397 | "3 0 137 40 35 168 43.1 \n",
398 | "4 5 116 74 0 0 25.6 \n",
399 | "\n",
400 | " DiabetesPedigreeFunction Age Outcome \n",
401 | "0 0.351 31 0 \n",
402 | "1 0.672 32 1 \n",
403 | "2 0.167 21 0 \n",
404 | "3 2.288 33 1 \n",
405 | "4 0.201 30 0 "
406 | ]
407 | },
408 | "execution_count": 49,
409 | "metadata": {},
410 | "output_type": "execute_result"
411 | }
412 | ],
413 | "source": [
414 | "train_df = df[train_indicator].reset_index(drop=True)\n",
415 | "train_features = train_df.loc[:, train_df.columns != 'Outcome'].values\n",
416 | "train_labels = train_df['Outcome'].values\n",
417 | "train_df.head()"
418 | ]
419 | },
420 | {
421 | "cell_type": "code",
422 | "execution_count": 50,
423 | "metadata": {},
424 | "outputs": [
425 | {
426 | "data": {
427 | "text/html": [
428 | "\n",
429 | "\n",
442 | "
\n",
443 | " \n",
444 | " \n",
445 | " | \n",
446 | " Pregnancies | \n",
447 | " Glucose | \n",
448 | " BloodPressure | \n",
449 | " SkinThickness | \n",
450 | " Insulin | \n",
451 | " BMI | \n",
452 | " DiabetesPedigreeFunction | \n",
453 | " Age | \n",
454 | " Outcome | \n",
455 | "
\n",
456 | " \n",
457 | " \n",
458 | " \n",
459 | " 0 | \n",
460 | " 6 | \n",
461 | " 148 | \n",
462 | " 72 | \n",
463 | " 35 | \n",
464 | " 0 | \n",
465 | " 33.6 | \n",
466 | " 0.627 | \n",
467 | " 50 | \n",
468 | " 1 | \n",
469 | "
\n",
470 | " \n",
471 | " 1 | \n",
472 | " 3 | \n",
473 | " 78 | \n",
474 | " 50 | \n",
475 | " 32 | \n",
476 | " 88 | \n",
477 | " 31.0 | \n",
478 | " 0.248 | \n",
479 | " 26 | \n",
480 | " 1 | \n",
481 | "
\n",
482 | " \n",
483 | " 2 | \n",
484 | " 10 | \n",
485 | " 168 | \n",
486 | " 74 | \n",
487 | " 0 | \n",
488 | " 0 | \n",
489 | " 38.0 | \n",
490 | " 0.537 | \n",
491 | " 34 | \n",
492 | " 1 | \n",
493 | "
\n",
494 | " \n",
495 | " 3 | \n",
496 | " 0 | \n",
497 | " 118 | \n",
498 | " 84 | \n",
499 | " 47 | \n",
500 | " 230 | \n",
501 | " 45.8 | \n",
502 | " 0.551 | \n",
503 | " 31 | \n",
504 | " 1 | \n",
505 | "
\n",
506 | " \n",
507 | " 4 | \n",
508 | " 7 | \n",
509 | " 107 | \n",
510 | " 74 | \n",
511 | " 0 | \n",
512 | " 0 | \n",
513 | " 29.6 | \n",
514 | " 0.254 | \n",
515 | " 31 | \n",
516 | " 1 | \n",
517 | "
\n",
518 | " \n",
519 | "
\n",
520 | "
"
521 | ],
522 | "text/plain": [
523 | " Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \\\n",
524 | "0 6 148 72 35 0 33.6 \n",
525 | "1 3 78 50 32 88 31.0 \n",
526 | "2 10 168 74 0 0 38.0 \n",
527 | "3 0 118 84 47 230 45.8 \n",
528 | "4 7 107 74 0 0 29.6 \n",
529 | "\n",
530 | " DiabetesPedigreeFunction Age Outcome \n",
531 | "0 0.627 50 1 \n",
532 | "1 0.248 26 1 \n",
533 | "2 0.537 34 1 \n",
534 | "3 0.551 31 1 \n",
535 | "4 0.254 31 1 "
536 | ]
537 | },
538 | "execution_count": 50,
539 | "metadata": {},
540 | "output_type": "execute_result"
541 | }
542 | ],
543 | "source": [
544 | "eval_df = df[eval_indicator].reset_index(drop=True)\n",
545 | "eval_features = eval_df.loc[:, eval_df.columns != 'Outcome'].values\n",
546 | "eval_labels = eval_df['Outcome'].values\n",
547 | "eval_df.head()"
548 | ]
549 | },
550 | {
551 | "cell_type": "code",
552 | "execution_count": 51,
553 | "metadata": {},
554 | "outputs": [
555 | {
556 | "data": {
557 | "text/plain": [
558 | "((614, 8), (614,), (154, 8), (154,))"
559 | ]
560 | },
561 | "execution_count": 51,
562 | "metadata": {},
563 | "output_type": "execute_result"
564 | }
565 | ],
566 | "source": [
567 | "train_features.shape, train_labels.shape, eval_features.shape, eval_labels.shape"
568 | ]
569 | },
570 | {
571 | "cell_type": "markdown",
572 | "metadata": {},
573 | "source": [
574 | "## 0.2 Pre-processing The Data"
575 | ]
576 | },
577 | {
578 | "cell_type": "markdown",
579 | "metadata": {},
580 | "source": [
581 | "Some of the columns exhibit missing values. We will use a Naive Bayes Classifier later that will treat such missing values in a special way. To be specific, for attribute 3 (Diastolic blood pressure), attribute 4 (Triceps skin fold thickness), attribute 6 (Body mass index), and attribute 8 (Age), we should regard a value of 0 as a missing value.\n",
582 | "\n",
583 | "Therefore, we will be creating the `train_featues_with_nans` and `eval_features_with_nans` numpy arrays to be just like their `train_features` and `eval_features` counter-parts, but with the zero-values in such columns replaced with nans."
584 | ]
585 | },
586 | {
587 | "cell_type": "code",
588 | "execution_count": 52,
589 | "metadata": {},
590 | "outputs": [],
591 | "source": [
592 | "train_df_with_nans = train_df.copy(deep=True)\n",
593 | "eval_df_with_nans = eval_df.copy(deep=True)\n",
594 | "for col_with_nans in ['BloodPressure', 'SkinThickness', 'BMI', 'Age']:\n",
595 | " train_df_with_nans[col_with_nans] = train_df_with_nans[col_with_nans].replace(0, np.nan)\n",
596 | " eval_df_with_nans[col_with_nans] = eval_df_with_nans[col_with_nans].replace(0, np.nan)\n",
597 | "train_features_with_nans = train_df_with_nans.loc[:, train_df_with_nans.columns != 'Outcome'].values\n",
598 | "eval_features_with_nans = eval_df_with_nans.loc[:, eval_df_with_nans.columns != 'Outcome'].values"
599 | ]
600 | },
601 | {
602 | "cell_type": "code",
603 | "execution_count": 53,
604 | "metadata": {},
605 | "outputs": [
606 | {
607 | "name": "stdout",
608 | "output_type": "stream",
609 | "text": [
610 | "Here are the training rows with at least one missing values.\n",
611 | "\n",
612 | "You can see that such incomplete data points constitute a substantial part of the data.\n",
613 | "\n"
614 | ]
615 | },
616 | {
617 | "data": {
618 | "text/html": [
619 | "\n",
620 | "\n",
633 | "
\n",
634 | " \n",
635 | " \n",
636 | " | \n",
637 | " Pregnancies | \n",
638 | " Glucose | \n",
639 | " BloodPressure | \n",
640 | " SkinThickness | \n",
641 | " Insulin | \n",
642 | " BMI | \n",
643 | " DiabetesPedigreeFunction | \n",
644 | " Age | \n",
645 | " Outcome | \n",
646 | "
\n",
647 | " \n",
648 | " \n",
649 | " \n",
650 | " 1 | \n",
651 | " 8 | \n",
652 | " 183 | \n",
653 | " 64.0 | \n",
654 | " NaN | \n",
655 | " 0 | \n",
656 | " 23.3 | \n",
657 | " 0.672 | \n",
658 | " 32 | \n",
659 | " 1 | \n",
660 | "
\n",
661 | " \n",
662 | " 4 | \n",
663 | " 5 | \n",
664 | " 116 | \n",
665 | " 74.0 | \n",
666 | " NaN | \n",
667 | " 0 | \n",
668 | " 25.6 | \n",
669 | " 0.201 | \n",
670 | " 30 | \n",
671 | " 0 | \n",
672 | "
\n",
673 | " \n",
674 | " 5 | \n",
675 | " 10 | \n",
676 | " 115 | \n",
677 | " NaN | \n",
678 | " NaN | \n",
679 | " 0 | \n",
680 | " 35.3 | \n",
681 | " 0.134 | \n",
682 | " 29 | \n",
683 | " 0 | \n",
684 | "
\n",
685 | " \n",
686 | " 7 | \n",
687 | " 8 | \n",
688 | " 125 | \n",
689 | " 96.0 | \n",
690 | " NaN | \n",
691 | " 0 | \n",
692 | " NaN | \n",
693 | " 0.232 | \n",
694 | " 54 | \n",
695 | " 1 | \n",
696 | "
\n",
697 | " \n",
698 | " 8 | \n",
699 | " 4 | \n",
700 | " 110 | \n",
701 | " 92.0 | \n",
702 | " NaN | \n",
703 | " 0 | \n",
704 | " 37.6 | \n",
705 | " 0.191 | \n",
706 | " 30 | \n",
707 | " 0 | \n",
708 | "
\n",
709 | " \n",
710 | " ... | \n",
711 | " ... | \n",
712 | " ... | \n",
713 | " ... | \n",
714 | " ... | \n",
715 | " ... | \n",
716 | " ... | \n",
717 | " ... | \n",
718 | " ... | \n",
719 | " ... | \n",
720 | "
\n",
721 | " \n",
722 | " 598 | \n",
723 | " 6 | \n",
724 | " 162 | \n",
725 | " 62.0 | \n",
726 | " NaN | \n",
727 | " 0 | \n",
728 | " 24.3 | \n",
729 | " 0.178 | \n",
730 | " 50 | \n",
731 | " 1 | \n",
732 | "
\n",
733 | " \n",
734 | " 599 | \n",
735 | " 4 | \n",
736 | " 136 | \n",
737 | " 70.0 | \n",
738 | " NaN | \n",
739 | " 0 | \n",
740 | " 31.2 | \n",
741 | " 1.182 | \n",
742 | " 22 | \n",
743 | " 1 | \n",
744 | "
\n",
745 | " \n",
746 | " 605 | \n",
747 | " 1 | \n",
748 | " 106 | \n",
749 | " 76.0 | \n",
750 | " NaN | \n",
751 | " 0 | \n",
752 | " 37.5 | \n",
753 | " 0.197 | \n",
754 | " 26 | \n",
755 | " 0 | \n",
756 | "
\n",
757 | " \n",
758 | " 606 | \n",
759 | " 6 | \n",
760 | " 190 | \n",
761 | " 92.0 | \n",
762 | " NaN | \n",
763 | " 0 | \n",
764 | " 35.5 | \n",
765 | " 0.278 | \n",
766 | " 66 | \n",
767 | " 1 | \n",
768 | "
\n",
769 | " \n",
770 | " 612 | \n",
771 | " 1 | \n",
772 | " 126 | \n",
773 | " 60.0 | \n",
774 | " NaN | \n",
775 | " 0 | \n",
776 | " 30.1 | \n",
777 | " 0.349 | \n",
778 | " 47 | \n",
779 | " 1 | \n",
780 | "
\n",
781 | " \n",
782 | "
\n",
783 | "
186 rows × 9 columns
\n",
784 | "
"
785 | ],
786 | "text/plain": [
787 | " Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \\\n",
788 | "1 8 183 64.0 NaN 0 23.3 \n",
789 | "4 5 116 74.0 NaN 0 25.6 \n",
790 | "5 10 115 NaN NaN 0 35.3 \n",
791 | "7 8 125 96.0 NaN 0 NaN \n",
792 | "8 4 110 92.0 NaN 0 37.6 \n",
793 | ".. ... ... ... ... ... ... \n",
794 | "598 6 162 62.0 NaN 0 24.3 \n",
795 | "599 4 136 70.0 NaN 0 31.2 \n",
796 | "605 1 106 76.0 NaN 0 37.5 \n",
797 | "606 6 190 92.0 NaN 0 35.5 \n",
798 | "612 1 126 60.0 NaN 0 30.1 \n",
799 | "\n",
800 | " DiabetesPedigreeFunction Age Outcome \n",
801 | "1 0.672 32 1 \n",
802 | "4 0.201 30 0 \n",
803 | "5 0.134 29 0 \n",
804 | "7 0.232 54 1 \n",
805 | "8 0.191 30 0 \n",
806 | ".. ... ... ... \n",
807 | "598 0.178 50 1 \n",
808 | "599 1.182 22 1 \n",
809 | "605 0.197 26 0 \n",
810 | "606 0.278 66 1 \n",
811 | "612 0.349 47 1 \n",
812 | "\n",
813 | "[186 rows x 9 columns]"
814 | ]
815 | },
816 | "execution_count": 53,
817 | "metadata": {},
818 | "output_type": "execute_result"
819 | }
820 | ],
821 | "source": [
822 | "print('Here are the training rows with at least one missing values.')\n",
823 | "print('')\n",
824 | "print('You can see that such incomplete data points constitute a substantial part of the data.')\n",
825 | "print('')\n",
826 | "nan_training_data = train_df_with_nans[train_df_with_nans.isna().any(axis=1)]\n",
827 | "nan_training_data"
828 | ]
829 | },
830 | {
831 | "cell_type": "markdown",
832 | "metadata": {},
833 | "source": [
834 | "# 1. Part 1 (Building a simple Naive Bayes Classifier)"
835 | ]
836 | },
837 | {
838 | "cell_type": "markdown",
839 | "metadata": {},
840 | "source": [
841 | "Consider a single sample $(\\mathbf{x}, y)$, where the feature vector is denoted with $\\mathbf{x}$, and the label is denoted with $y$. We will also denote the $j^{th}$ feature of $\\mathbf{x}$ with $x^{(j)}$.\n",
842 | "\n",
843 | "According to the textbook, the Naive Bayes Classifier uses the following decision rule:\n",
844 | "\n",
845 | "\"Choose $y$ such that $$\\bigg[\\log p(y) + \\sum_{j} \\log p(x^{(j)}|y) \\bigg]$$ is the largest\"\n",
846 | "\n",
847 | "However, we first need to define the probabilistic models of the prior $p(y)$ and the class-conditional feature distributions $p(x^{(j)}|y)$ using the training data.\n",
848 | "\n",
849 | "* **Modelling the prior $p(y)$**: We fit a Bernoulli distribution to the `Outcome` variable of `train_df`.\n",
850 | "* **Modelling the class-conditional feature distributions $p(x^{(j)}|y)$**: We fit Gaussian distributions, and infer the Gaussian mean and variance parameters from `train_df`."
851 | ]
852 | },
853 | {
854 | "cell_type": "markdown",
855 | "metadata": {},
856 | "source": [
857 | "# Task 1"
858 | ]
859 | },
860 | {
861 | "cell_type": "markdown",
862 | "metadata": {},
863 | "source": [
864 | "Write a function `log_prior` that takes a numpy array `train_labels` as input, and outputs the following vector as a column numpy array (i.e., with shape $(2,1)$).\n",
865 | "\n",
866 | "$$\\log p_y =\\begin{bmatrix}\\log p(y=0)\\\\\\log p(y=1)\\end{bmatrix}$$\n",
867 | "\n",
868 | "Try and avoid the utilization of loops as much as possible. No loops are necessary.\n",
869 | "\n",
870 | "**Hint**: Make sure all the array shapes are what you need and expect. You can reshape any numpy array without any tangible computational over-head."
871 | ]
872 | },
873 | {
874 | "cell_type": "code",
875 | "execution_count": 128,
876 | "metadata": {
877 | "deletable": false
878 | },
879 | "outputs": [],
880 | "source": [
881 | "def log_prior(train_labels):\n",
882 | " \n",
883 | " # your code here\n",
884 | "# raise NotImplementedError\n",
885 | " \n",
886 | " num_of_one = np.sum(train_labels)\n",
887 | " num_of_zero = np.shape(train_labels)[0] - num_of_one\n",
888 | " \n",
889 | " log_py = np.array([np.log(num_of_zero/np.shape(train_labels)[0]), np.log(num_of_one/np.shape(train_labels)[0])]).reshape(2, 1)\n",
890 | "\n",
891 | " return log_py\n"
892 | ]
893 | },
894 | {
895 | "cell_type": "code",
896 | "execution_count": 129,
897 | "metadata": {
898 | "deletable": false,
899 | "editable": false,
900 | "nbgrader": {
901 | "cell_type": "code",
902 | "checksum": "58446a9c6b83fc53b43a0d41fce9f93b",
903 | "grade": false,
904 | "grade_id": "cell-89c12d30b6fb44cb",
905 | "locked": true,
906 | "schema_version": 3,
907 | "solution": false,
908 | "task": false
909 | }
910 | },
911 | "outputs": [],
912 | "source": [
913 | "# Performing sanity checks on your implementation\n",
914 | "some_labels = np.array([0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1])\n",
915 | "some_log_py = log_prior(some_labels)\n",
916 | "assert np.array_equal(some_log_py.round(3), np.array([[-0.916], [-0.511]]))\n",
917 | "\n",
918 | "# Checking against the pre-computed test database\n",
919 | "test_results = test_case_checker(log_prior, task_id=1)\n",
920 | "assert test_results['passed'], test_results['message']"
921 | ]
922 | },
923 | {
924 | "cell_type": "code",
925 | "execution_count": 130,
926 | "metadata": {
927 | "code_folding": [
928 | 73,
929 | 135
930 | ],
931 | "deletable": false,
932 | "editable": false,
933 | "nbgrader": {
934 | "cell_type": "code",
935 | "checksum": "14ff380a49035c9250c4323e60337bc3",
936 | "grade": true,
937 | "grade_id": "cell-e263f2b1878b37bc",
938 | "locked": true,
939 | "points": 1,
940 | "schema_version": 3,
941 | "solution": false,
942 | "task": false
943 | }
944 | },
945 | "outputs": [],
946 | "source": [
947 | "# This cell is left empty as a seperator. You can leave this cell as it is, and you should not delete it.\n"
948 | ]
949 | },
950 | {
951 | "cell_type": "code",
952 | "execution_count": 131,
953 | "metadata": {},
954 | "outputs": [
955 | {
956 | "data": {
957 | "text/plain": [
958 | "array([[-0.41610786],\n",
959 | " [-1.07766068]])"
960 | ]
961 | },
962 | "execution_count": 131,
963 | "metadata": {},
964 | "output_type": "execute_result"
965 | }
966 | ],
967 | "source": [
968 | "log_py = log_prior(train_labels)\n",
969 | "log_py"
970 | ]
971 | },
972 | {
973 | "cell_type": "markdown",
974 | "metadata": {},
975 | "source": [
976 | "# Task 2"
977 | ]
978 | },
979 | {
980 | "cell_type": "markdown",
981 | "metadata": {},
982 | "source": [
983 | "Write a function `cc_mean_ignore_missing` that takes the numpy arrays `train_features` and `train_labels` as input, and outputs the following matrix with the shape $(8,2)$, where 8 is the number of features.\n",
984 | "\n",
985 | "$$\\mu_y = \\begin{bmatrix} \\mathbb{E}[x^{(0)}|y=0] & \\mathbb{E}[x^{(0)}|y=1]\\\\\n",
986 | "\\mathbb{E}[x^{(1)}|y=0] & \\mathbb{E}[x^{(1)}|y=1] \\\\\n",
987 | "\\cdots & \\cdots\\\\\n",
988 | "\\mathbb{E}[x^{(7)}|y=0] & \\mathbb{E}[x^{(7)}|y=1]\\end{bmatrix}$$\n",
989 | "\n",
990 | "Some points regarding this task:\n",
991 | "\n",
992 | "* The `train_features` numpy array has a shape of `(N,8)` where `N` is the number of training data points, and 8 is the number of the features. \n",
993 | "\n",
994 | "* The `train_labels` numpy array has a shape of `(N,)`. \n",
995 | "\n",
996 | "* **You can assume that `train_features` has no missing elements in this task**.\n",
997 | "\n",
998 | "* Try and avoid the utilization of loops as much as possible. No loops are necessary."
999 | ]
1000 | },
1001 | {
1002 | "cell_type": "code",
1003 | "execution_count": 132,
1004 | "metadata": {
1005 | "deletable": false
1006 | },
1007 | "outputs": [],
1008 | "source": [
1009 | "def cc_mean_ignore_missing(train_features, train_labels):\n",
1010 | " N, d = train_features.shape\n",
1011 | " # your code here\n",
1012 | " \n",
1013 | " # Fist calculate the second column:\n",
1014 | "# dotProduct = np.matmul(train_labels.reshape(1, N), train_features).reshape(d, 1)\n",
1015 | "# secondColumn = dotProduct / np.sum(train_labels)\n",
1016 | " \n",
1017 | "# dotProduct2 = np.matmul(np.ones((1, N)), train_features).reshape(d, 1)\n",
1018 | "# firstColumn = (dotProduct2 - dotProduct) / (N - np.sum(train_labels))\n",
1019 | " \n",
1020 | " \n",
1021 | " # Extract the index of zeros and ones seperately\n",
1022 | " allZeros = np.where(train_labels == 0)[0]\n",
1023 | " allOnes = np.where(train_labels == 1)[0]\n",
1024 | " \n",
1025 | " # Convert the 2-d numpy array to DataFrame\n",
1026 | " df_features = pd.DataFrame(train_features)\n",
1027 | " \n",
1028 | " \n",
1029 | " dfZeros = df_features.loc[allZeros]\n",
1030 | " dfOnes = df_features.loc[allOnes]\n",
1031 | " \n",
1032 | " \n",
1033 | " firstCol = dfZeros.mean(axis=0).to_numpy().reshape((d, 1))\n",
1034 | " secondCol = dfOnes.mean(axis=0).to_numpy().reshape((d, 1))\n",
1035 | " \n",
1036 | " mu_y = np.concatenate((firstCol, secondCol), axis=1)\n",
1037 | " assert mu_y.shape == (d, 2)\n",
1038 | " return mu_y"
1039 | ]
1040 | },
1041 | {
1042 | "cell_type": "code",
1043 | "execution_count": 133,
1044 | "metadata": {
1045 | "deletable": false,
1046 | "editable": false,
1047 | "nbgrader": {
1048 | "cell_type": "code",
1049 | "checksum": "101072d0656f58c95247f1efe296a85b",
1050 | "grade": false,
1051 | "grade_id": "cell-feae5e6e77107267",
1052 | "locked": true,
1053 | "schema_version": 3,
1054 | "solution": false,
1055 | "task": false
1056 | }
1057 | },
1058 | "outputs": [],
1059 | "source": [
1060 | "# Performing sanity checks on your implementation\n",
1061 | "some_feats = np.array([[ 1. , 85. , 66. , 29. , 0. , 26.6, 0.4, 31. ],\n",
1062 | " [ 8. , 183. , 64. , 0. , 0. , 23.3, 0.7, 32. ],\n",
1063 | " [ 1. , 89. , 66. , 23. , 94. , 28.1, 0.2, 21. ],\n",
1064 | " [ 0. , 137. , 40. , 35. , 168. , 43.1, 2.3, 33. ],\n",
1065 | " [ 5. , 116. , 74. , 0. , 0. , 25.6, 0.2, 30. ]])\n",
1066 | "some_labels = np.array([0, 1, 0, 1, 0])\n",
1067 | "\n",
1068 | "some_mu_y = cc_mean_ignore_missing(some_feats, some_labels)\n",
1069 | "\n",
1070 | "assert np.array_equal(some_mu_y.round(2), np.array([[ 2.33, 4. ],\n",
1071 | " [ 96.67, 160. ],\n",
1072 | " [ 68.67, 52. ],\n",
1073 | " [ 17.33, 17.5 ],\n",
1074 | " [ 31.33, 84. ],\n",
1075 | " [ 26.77, 33.2 ],\n",
1076 | " [ 0.27, 1.5 ],\n",
1077 | " [ 27.33, 32.5 ]]))\n",
1078 | "\n",
1079 | "# Checking against the pre-computed test database\n",
1080 | "test_results = test_case_checker(cc_mean_ignore_missing, task_id=2)\n",
1081 | "assert test_results['passed'], test_results['message']"
1082 | ]
1083 | },
1084 | {
1085 | "cell_type": "code",
1086 | "execution_count": 134,
1087 | "metadata": {
1088 | "code_folding": [],
1089 | "deletable": false,
1090 | "editable": false,
1091 | "nbgrader": {
1092 | "cell_type": "code",
1093 | "checksum": "a98f1ecc45f43b138e415573f0408bab",
1094 | "grade": true,
1095 | "grade_id": "cell-e263f2b1878b37bj",
1096 | "locked": true,
1097 | "points": 1,
1098 | "schema_version": 3,
1099 | "solution": false,
1100 | "task": false
1101 | }
1102 | },
1103 | "outputs": [],
1104 | "source": [
1105 | "# This cell is left empty as a seperator. You can leave this cell as it is, and you should not delete it.\n"
1106 | ]
1107 | },
1108 | {
1109 | "cell_type": "code",
1110 | "execution_count": 135,
1111 | "metadata": {},
1112 | "outputs": [
1113 | {
1114 | "data": {
1115 | "text/plain": [
1116 | "array([[ 3.48641975, 4.91866029],\n",
1117 | " [109.99753086, 142.30143541],\n",
1118 | " [ 68.77037037, 70.66028708],\n",
1119 | " [ 19.51358025, 21.97129187],\n",
1120 | " [ 66.25679012, 100.55980861],\n",
1121 | " [ 30.31703704, 35.1492823 ],\n",
1122 | " [ 0.42825926, 0.55279904],\n",
1123 | " [ 31.57283951, 37.39712919]])"
1124 | ]
1125 | },
1126 | "execution_count": 135,
1127 | "metadata": {},
1128 | "output_type": "execute_result"
1129 | }
1130 | ],
1131 | "source": [
1132 | "mu_y = cc_mean_ignore_missing(train_features, train_labels)\n",
1133 | "mu_y"
1134 | ]
1135 | },
1136 | {
1137 | "cell_type": "markdown",
1138 | "metadata": {},
1139 | "source": [
1140 | "# Task 3"
1141 | ]
1142 | },
1143 | {
1144 | "cell_type": "markdown",
1145 | "metadata": {},
1146 | "source": [
1147 | "Write a function `cc_std_ignore_missing` that takes the numpy arrays `train_features` and `train_labels` as input, and outputs the following matrix with the shape $(8,2)$, where 8 is the number of features.\n",
1148 | "\n",
1149 | "$$\\sigma_y = \\begin{bmatrix} \\text{std}[x^{(0)}|y=0] & \\text{std}[x^{(0)}|y=1]\\\\\n",
1150 | "\\text{std}[x^{(1)}|y=0] & \\text{std}[x^{(1)}|y=1] \\\\\n",
1151 | "\\cdots & \\cdots\\\\\n",
1152 | "\\text{std}[x^{(7)}|y=0] & \\text{std}[x^{(7)}|y=1]\\end{bmatrix}$$\n",
1153 | "\n",
1154 | "Some points regarding this task:\n",
1155 | "\n",
1156 | "* The `train_features` numpy array has a shape of `(N,8)` where `N` is the number of training data points, and 8 is the number of the features. \n",
1157 | "\n",
1158 | "* The `train_labels` numpy array has a shape of `(N,)`. \n",
1159 | "\n",
1160 | "* **You can assume that `train_features` has no missing elements in this task**.\n",
1161 | "\n",
1162 | "* Try and avoid the utilization of loops as much as possible. No loops are necessary."
1163 | ]
1164 | },
1165 | {
1166 | "cell_type": "code",
1167 | "execution_count": 136,
1168 | "metadata": {
1169 | "deletable": false
1170 | },
1171 | "outputs": [],
1172 | "source": [
1173 | "def cc_std_ignore_missing(train_features, train_labels):\n",
1174 | " N, d = train_features.shape\n",
1175 | " \n",
1176 | " # Extract the index of zeros and ones seperately\n",
1177 | " allZeros = np.where(train_labels == 0)[0]\n",
1178 | " allOnes = np.where(train_labels == 1)[0]\n",
1179 | " \n",
1180 | " # Convert the 2-d numpy array to DataFrame\n",
1181 | " df_features = pd.DataFrame(train_features)\n",
1182 | " \n",
1183 | " \n",
1184 | " dfZeros = df_features.loc[allZeros]\n",
1185 | " dfOnes = df_features.loc[allOnes]\n",
1186 | " \n",
1187 | " \n",
1188 | " firstCol = dfZeros.std(axis=0, ddof=0).to_numpy().reshape((d, 1))\n",
1189 | " secondCol = dfOnes.std(axis=0, ddof=0).to_numpy().reshape((d, 1))\n",
1190 | " \n",
1191 | " sigma_y = np.concatenate((firstCol, secondCol), axis=1)\n",
1192 | " \n",
1193 | " assert sigma_y.shape == (d, 2)\n",
1194 | "\n",
1195 | " return sigma_y"
1196 | ]
1197 | },
1198 | {
1199 | "cell_type": "code",
1200 | "execution_count": 137,
1201 | "metadata": {
1202 | "deletable": false,
1203 | "editable": false,
1204 | "nbgrader": {
1205 | "cell_type": "code",
1206 | "checksum": "9a6eeb9ba6ff8c69ef9ec7a78800d904",
1207 | "grade": false,
1208 | "grade_id": "cell-347ad2c612aa195e",
1209 | "locked": true,
1210 | "schema_version": 3,
1211 | "solution": false,
1212 | "task": false
1213 | }
1214 | },
1215 | "outputs": [],
1216 | "source": [
1217 | "# Performing sanity checks on your implementation\n",
1218 | "some_feats = np.array([[ 1. , 85. , 66. , 29. , 0. , 26.6, 0.4, 31. ],\n",
1219 | " [ 8. , 183. , 64. , 0. , 0. , 23.3, 0.7, 32. ],\n",
1220 | " [ 1. , 89. , 66. , 23. , 94. , 28.1, 0.2, 21. ],\n",
1221 | " [ 0. , 137. , 40. , 35. , 168. , 43.1, 2.3, 33. ],\n",
1222 | " [ 5. , 116. , 74. , 0. , 0. , 25.6, 0.2, 30. ]])\n",
1223 | "some_labels = np.array([0, 1, 0, 1, 0])\n",
1224 | "\n",
1225 | "some_std_y = cc_std_ignore_missing(some_feats, some_labels)\n",
1226 | "\n",
1227 | "assert np.array_equal(some_std_y.round(3), np.array([[ 1.886, 4. ],\n",
1228 | " [13.768, 23. ],\n",
1229 | " [ 3.771, 12. ],\n",
1230 | " [12.499, 17.5 ],\n",
1231 | " [44.312, 84. ],\n",
1232 | " [ 1.027, 9.9 ],\n",
1233 | " [ 0.094, 0.8 ],\n",
1234 | " [ 4.497, 0.5 ]]))\n",
1235 | "\n",
1236 | "# Checking against the pre-computed test database\n",
1237 | "test_results = test_case_checker(cc_std_ignore_missing, task_id=3)\n",
1238 | "assert test_results['passed'], test_results['message']"
1239 | ]
1240 | },
1241 | {
1242 | "cell_type": "code",
1243 | "execution_count": 138,
1244 | "metadata": {
1245 | "code_folding": [],
1246 | "deletable": false,
1247 | "editable": false,
1248 | "nbgrader": {
1249 | "cell_type": "code",
1250 | "checksum": "468e750631e9197a917b4f8fc41f7a92",
1251 | "grade": true,
1252 | "grade_id": "cell-e263f2b1878b37bg",
1253 | "locked": true,
1254 | "points": 1,
1255 | "schema_version": 3,
1256 | "solution": false,
1257 | "task": false
1258 | }
1259 | },
1260 | "outputs": [],
1261 | "source": [
1262 | "# This cell is left empty as a seperator. You can leave this cell as it is, and you should not delete it.\n"
1263 | ]
1264 | },
1265 | {
1266 | "cell_type": "code",
1267 | "execution_count": 139,
1268 | "metadata": {},
1269 | "outputs": [
1270 | {
1271 | "data": {
1272 | "text/plain": [
1273 | "array([[ 3.1155426 , 3.75417931],\n",
1274 | " [ 25.96811899, 32.50910874],\n",
1275 | " [ 18.07540068, 21.69568568],\n",
1276 | " [ 15.02320635, 17.21685884],\n",
1277 | " [ 95.63339586, 139.24364214],\n",
1278 | " [ 7.50030986, 6.6625219 ],\n",
1279 | " [ 0.29438217, 0.37201494],\n",
1280 | " [ 11.67577435, 11.01543899]])"
1281 | ]
1282 | },
1283 | "execution_count": 139,
1284 | "metadata": {},
1285 | "output_type": "execute_result"
1286 | }
1287 | ],
1288 | "source": [
1289 | "sigma_y = cc_std_ignore_missing(train_features, train_labels)\n",
1290 | "sigma_y"
1291 | ]
1292 | },
1293 | {
1294 | "cell_type": "markdown",
1295 | "metadata": {},
1296 | "source": [
1297 | "# Task 4"
1298 | ]
1299 | },
1300 | {
1301 | "cell_type": "markdown",
1302 | "metadata": {},
1303 | "source": [
1304 | "Write a function `log_prob` that takes the numpy arrays `train_features`, $\\mu_y$, $\\sigma_y$, and $\\log p_y$ as input, and outputs the following matrix with the shape $(N, 2)$\n",
1305 | "\n",
1306 | "$$\\log p_{x,y} = \\begin{bmatrix} \\bigg[\\log p(y=0) + \\sum_{j=0}^{7} \\log p(x_1^{(j)}|y=0) \\bigg] & \\bigg[\\log p(y=1) + \\sum_{j=0}^{7} \\log p(x_1^{(j)}|y=1) \\bigg] \\\\\n",
1307 | "\\bigg[\\log p(y=0) + \\sum_{j=0}^{7} \\log p(x_2^{(j)}|y=0) \\bigg] & \\bigg[\\log p(y=1) + \\sum_{j=0}^{7} \\log p(x_2^{(j)}|y=1) \\bigg] \\\\\n",
1308 | "\\cdots & \\cdots \\\\\n",
1309 | "\\bigg[\\log p(y=0) + \\sum_{j=0}^{7} \\log p(x_N^{(j)}|y=0) \\bigg] & \\bigg[\\log p(y=1) + \\sum_{j=0}^{7} \\log p(x_N^{(j)}|y=1) \\bigg] \\\\\n",
1310 | "\\end{bmatrix}$$\n",
1311 | "\n",
1312 | "where\n",
1313 | "* $N$ is the number of training data points.\n",
1314 | "* $x_i$ is the $i^{th}$ training data point.\n",
1315 | "\n",
1316 | "Try and avoid the utilization of loops as much as possible. No loops are necessary."
1317 | ]
1318 | },
1319 | {
1320 | "cell_type": "markdown",
1321 | "metadata": {},
1322 | "source": [
1323 | "**Hint**: Remember that we are modelling $p(x_i^{(j)}|y)$ with a Gaussian whose parameters are defined inside $\\mu_y$ and $\\sigma_y$. Write the Gaussian PDF expression and take its natural log **on paper**, then implement it.\n",
1324 | "\n",
1325 | "**Important Note**: Do not use third-party and non-standard implementations for computing $\\log p(x_i^{(j)}|y)$. Using functions that find the Gaussian PDF, and then taking their log is **numerically unstable**; the Gaussian PDF values can easily become extremely small numbers that cannot be represented using floating point standards and thus would be stored as zero. Taking the log of a zero value will throw an error. On the other hand, it is unnecessary to compute and store $p(x_i^{(j)}|y)$ in order to find $\\log p(x_i^{(j)}|y)$; you can write $\\log p(x_i^{(j)}|y)$ as a direct function of $\\mu_y$, $\\sigma_y$ and the features. This latter approach is numerically stable, and can be applied when the PDF values are much smaller than could be stored using the common standards."
1326 | ]
1327 | },
1328 | {
1329 | "cell_type": "code",
1330 | "execution_count": 140,
1331 | "metadata": {
1332 | "deletable": false
1333 | },
1334 | "outputs": [],
1335 | "source": [
1336 | "def log_prob(train_features, mu_y, sigma_y, log_py):\n",
1337 | " N, d = train_features.shape\n",
1338 | " \n",
1339 | " # Extract the index of zeros and ones seperately\n",
1340 | " mu0 = mu_y[:, 0]\n",
1341 | " sigma0 = sigma_y[:, 0]\n",
1342 | " \n",
1343 | " mu1 = mu_y[:, 1]\n",
1344 | " sigma1 = sigma_y[:, 1]\n",
1345 | "\n",
1346 | " firstCol = np.sum(np.log(1/(sigma0.reshape(1, d)*np.sqrt(2*np.pi)))-(1/2)*((train_features-mu0.reshape(1, d))/sigma0.reshape(1, d))**2, axis=1)+log_py[0]\n",
1347 | " secondCol = np.sum(np.log(1/(sigma1.reshape(1, d)*np.sqrt(2*np.pi)))-(1/2)*((train_features-mu1.reshape(1, d))/sigma1.reshape(1, d))**2, axis=1)+log_py[1]\n",
1348 | "\n",
1349 | " log_p_x_y = np.concatenate((firstCol.reshape(N, 1), secondCol.reshape(N,1)), axis=1)\n",
1350 | " assert log_p_x_y.shape == (N,2)\n",
1351 | " return log_p_x_y\n"
1352 | ]
1353 | },
1354 | {
1355 | "cell_type": "code",
1356 | "execution_count": 141,
1357 | "metadata": {
1358 | "deletable": false,
1359 | "editable": false,
1360 | "nbgrader": {
1361 | "cell_type": "code",
1362 | "checksum": "1381c37cc128fcc5502da552ceace2e6",
1363 | "grade": false,
1364 | "grade_id": "cell-86fb4d0c1943d700",
1365 | "locked": true,
1366 | "schema_version": 3,
1367 | "solution": false,
1368 | "task": false
1369 | }
1370 | },
1371 | "outputs": [],
1372 | "source": [
1373 | "# Performing sanity checks on your implementation\n",
1374 | "some_feats = np.array([[ 1. , 85. , 66. , 29. , 0. , 26.6, 0.4, 31. ],\n",
1375 | " [ 8. , 183. , 64. , 0. , 0. , 23.3, 0.7, 32. ],\n",
1376 | " [ 1. , 89. , 66. , 23. , 94. , 28.1, 0.2, 21. ],\n",
1377 | " [ 0. , 137. , 40. , 35. , 168. , 43.1, 2.3, 33. ],\n",
1378 | " [ 5. , 116. , 74. , 0. , 0. , 25.6, 0.2, 30. ]])\n",
1379 | "some_labels = np.array([0, 1, 0, 1, 0])\n",
1380 | "\n",
1381 | "some_mu_y = cc_mean_ignore_missing(some_feats, some_labels)\n",
1382 | "some_std_y = cc_std_ignore_missing(some_feats, some_labels)\n",
1383 | "some_log_py = log_prior(some_labels)\n",
1384 | "\n",
1385 | "some_log_p_x_y = log_prob(some_feats, some_mu_y, some_std_y, some_log_py)\n",
1386 | "\n",
1387 | "assert np.array_equal(some_log_p_x_y.round(3), np.array([[ -20.822, -36.606],\n",
1388 | " [ -60.879, -27.944],\n",
1389 | " [ -21.774, -295.68 ],\n",
1390 | " [-417.359, -27.944],\n",
1391 | " [ -23.2 , -42.6 ]]))\n",
1392 | "\n",
1393 | "# Checking against the pre-computed test database\n",
1394 | "test_results = test_case_checker(log_prob, task_id=4)\n",
1395 | "assert test_results['passed'], test_results['message']"
1396 | ]
1397 | },
1398 | {
1399 | "cell_type": "code",
1400 | "execution_count": 142,
1401 | "metadata": {
1402 | "code_folding": [],
1403 | "deletable": false,
1404 | "editable": false,
1405 | "nbgrader": {
1406 | "cell_type": "code",
1407 | "checksum": "df1a1330fdef96dd92cc192241e744fc",
1408 | "grade": true,
1409 | "grade_id": "cell-e263f2b1878b37bh",
1410 | "locked": true,
1411 | "points": 1,
1412 | "schema_version": 3,
1413 | "solution": false,
1414 | "task": false
1415 | }
1416 | },
1417 | "outputs": [],
1418 | "source": [
1419 | "# This cell is left empty as a seperator. You can leave this cell as it is, and you should not delete it.\n"
1420 | ]
1421 | },
1422 | {
1423 | "cell_type": "code",
1424 | "execution_count": 143,
1425 | "metadata": {},
1426 | "outputs": [
1427 | {
1428 | "data": {
1429 | "text/plain": [
1430 | "array([[-26.96647828, -31.00418408],\n",
1431 | " [-32.4755447 , -31.39530914],\n",
1432 | " [-27.14875996, -31.51999532],\n",
1433 | " ...,\n",
1434 | " [-26.29368771, -29.09161966],\n",
1435 | " [-28.19432943, -30.08324788],\n",
1436 | " [-26.98605248, -30.80571318]])"
1437 | ]
1438 | },
1439 | "execution_count": 143,
1440 | "metadata": {},
1441 | "output_type": "execute_result"
1442 | }
1443 | ],
1444 | "source": [
1445 | "log_p_x_y = log_prob(train_features, mu_y, sigma_y, log_py)\n",
1446 | "log_p_x_y"
1447 | ]
1448 | },
1449 | {
1450 | "cell_type": "markdown",
1451 | "metadata": {},
1452 | "source": [
1453 | "## 1.1. Writing the Simple Naive Bayes Classifier"
1454 | ]
1455 | },
1456 | {
1457 | "cell_type": "code",
1458 | "execution_count": 144,
1459 | "metadata": {},
1460 | "outputs": [],
1461 | "source": [
1462 | "class NBClassifier():\n",
1463 | " def __init__(self, train_features, train_labels):\n",
1464 | " self.train_features = train_features\n",
1465 | " self.train_labels = train_labels\n",
1466 | " self.log_py = log_prior(train_labels)\n",
1467 | " self.mu_y = self.get_cc_means()\n",
1468 | " self.sigma_y = self.get_cc_std()\n",
1469 | " \n",
1470 | " def get_cc_means(self):\n",
1471 | " mu_y = cc_mean_ignore_missing(self.train_features, self.train_labels)\n",
1472 | " return mu_y\n",
1473 | " \n",
1474 | " def get_cc_std(self):\n",
1475 | " sigma_y = cc_std_ignore_missing(self.train_features, self.train_labels)\n",
1476 | " return sigma_y\n",
1477 | " \n",
1478 | " def predict(self, features):\n",
1479 | " log_p_x_y = log_prob(features, self.mu_y, self.sigma_y, self.log_py)\n",
1480 | " return log_p_x_y.argmax(axis=1)"
1481 | ]
1482 | },
1483 | {
1484 | "cell_type": "code",
1485 | "execution_count": 145,
1486 | "metadata": {},
1487 | "outputs": [],
1488 | "source": [
1489 | "diabetes_classifier = NBClassifier(train_features, train_labels)\n",
1490 | "train_pred = diabetes_classifier.predict(train_features)\n",
1491 | "eval_pred = diabetes_classifier.predict(eval_features)"
1492 | ]
1493 | },
1494 | {
1495 | "cell_type": "code",
1496 | "execution_count": 146,
1497 | "metadata": {},
1498 | "outputs": [
1499 | {
1500 | "name": "stdout",
1501 | "output_type": "stream",
1502 | "text": [
1503 | "The training data accuracy of your trained model is 0.7671009771986971\n",
1504 | "The evaluation data accuracy of your trained model is 0.7532467532467533\n"
1505 | ]
1506 | }
1507 | ],
1508 | "source": [
1509 | "train_acc = (train_pred==train_labels).mean()\n",
1510 | "eval_acc = (eval_pred==eval_labels).mean()\n",
1511 | "print(f'The training data accuracy of your trained model is {train_acc}')\n",
1512 | "print(f'The evaluation data accuracy of your trained model is {eval_acc}')"
1513 | ]
1514 | },
1515 | {
1516 | "cell_type": "markdown",
1517 | "metadata": {},
1518 | "source": [
1519 | "## 1.2 Running an off-the-shelf implementation of Naive-Bayes For Comparison"
1520 | ]
1521 | },
1522 | {
1523 | "cell_type": "code",
1524 | "execution_count": 147,
1525 | "metadata": {},
1526 | "outputs": [
1527 | {
1528 | "name": "stdout",
1529 | "output_type": "stream",
1530 | "text": [
1531 | "The training data accuracy of your trained model is 0.7671009771986971\n",
1532 | "The evaluation data accuracy of your trained model is 0.7532467532467533\n"
1533 | ]
1534 | }
1535 | ],
1536 | "source": [
1537 | "from sklearn.naive_bayes import GaussianNB\n",
1538 | "gnb = GaussianNB().fit(train_features, train_labels)\n",
1539 | "train_pred_sk = gnb.predict(train_features)\n",
1540 | "eval_pred_sk = gnb.predict(eval_features)\n",
1541 | "print(f'The training data accuracy of your trained model is {(train_pred_sk == train_labels).mean()}')\n",
1542 | "print(f'The evaluation data accuracy of your trained model is {(eval_pred_sk == eval_labels).mean()}')"
1543 | ]
1544 | },
1545 | {
1546 | "cell_type": "markdown",
1547 | "metadata": {},
1548 | "source": [
1549 | "# Part 2 (Building a Naive Bayes Classifier Considering Missing Entries)"
1550 | ]
1551 | },
1552 | {
1553 | "cell_type": "markdown",
1554 | "metadata": {},
1555 | "source": [
1556 | "In this part, we will modify some of the parameter inference functions of the Naive Bayes classifier to make it able to ignore the NaN entries when inferring the Gaussian mean and stds."
1557 | ]
1558 | },
1559 | {
1560 | "cell_type": "markdown",
1561 | "metadata": {},
1562 | "source": [
1563 | "# Task 5"
1564 | ]
1565 | },
1566 | {
1567 | "cell_type": "markdown",
1568 | "metadata": {},
1569 | "source": [
1570 | "Write a function `cc_mean_consider_missing` that\n",
1571 | "* has exactly the same input and output types as the `cc_mean_ignore_missing` function,\n",
1572 | "* and has similar functionality to `cc_mean_ignore_missing` except that it can handle and ignore the NaN entries when computing the class conditional means.\n",
1573 | "\n",
1574 | "You can borrow most of the code from your `cc_mean_ignore_missing` implementation, but you should make it compatible with the existence of NaN values in the features.\n",
1575 | "\n",
1576 | "Try and avoid the utilization of loops as much as possible. No loops are necessary."
1577 | ]
1578 | },
1579 | {
1580 | "cell_type": "markdown",
1581 | "metadata": {},
1582 | "source": [
1583 | "* **Hint**: You may find the `np.nanmean` function useful."
1584 | ]
1585 | },
1586 | {
1587 | "cell_type": "code",
1588 | "execution_count": 148,
1589 | "metadata": {
1590 | "deletable": false
1591 | },
1592 | "outputs": [],
1593 | "source": [
1594 | "def cc_mean_consider_missing(train_features_with_nans, train_labels):\n",
1595 | " N, d = train_features_with_nans.shape\n",
1596 | " \n",
1597 | " # your code here\n",
1598 | " \n",
1599 | " # Extract the index of zeros and ones seperately\n",
1600 | " allZeros = np.where(train_labels == 0)[0]\n",
1601 | " allOnes = np.where(train_labels == 1)[0]\n",
1602 | " \n",
1603 | " # Convert the 2-d numpy array to DataFrame\n",
1604 | " df_features = pd.DataFrame(train_features_with_nans)\n",
1605 | " \n",
1606 | " \n",
1607 | " dfZeros = df_features.loc[allZeros]\n",
1608 | " dfOnes = df_features.loc[allOnes]\n",
1609 | " \n",
1610 | " \n",
1611 | " firstCol = np.nanmean(dfZeros, axis=0).reshape((d, 1))\n",
1612 | " secondCol = np.nanmean(dfOnes, axis=0).reshape((d, 1))\n",
1613 | " \n",
1614 | " mu_y = np.concatenate((firstCol, secondCol), axis=1)\n",
1615 | " \n",
1616 | " assert not np.isnan(mu_y).any()\n",
1617 | " assert mu_y.shape == (d, 2)\n",
1618 | " return mu_y"
1619 | ]
1620 | },
1621 | {
1622 | "cell_type": "code",
1623 | "execution_count": 149,
1624 | "metadata": {
1625 | "deletable": false,
1626 | "editable": false,
1627 | "nbgrader": {
1628 | "cell_type": "code",
1629 | "checksum": "6303d5da34d9e332d33292b47c7bf113",
1630 | "grade": false,
1631 | "grade_id": "cell-ca4af11e9d8a7fdb",
1632 | "locked": true,
1633 | "schema_version": 3,
1634 | "solution": false,
1635 | "task": false
1636 | }
1637 | },
1638 | "outputs": [],
1639 | "source": [
1640 | "# Performing sanity checks on your implementation\n",
1641 | "some_feats = np.array([[ 1. , 85. , 66. , 29. , 0. , 26.6, 0.4, 31. ],\n",
1642 | " [ 8. , 183. , 64. , 0. , 0. , 23.3, 0.7, 32. ],\n",
1643 | " [ 1. , 89. , 66. , 23. , 94. , 28.1, 0.2, 21. ],\n",
1644 | " [ 0. , 137. , 40. , 35. , 168. , 43.1, 2.3, 33. ],\n",
1645 | " [ 5. , 116. , 74. , 0. , 0. , 25.6, 0.2, 30. ]])\n",
1646 | "some_labels = np.array([0, 1, 0, 1, 0])\n",
1647 | "\n",
1648 | "for i,j in [(0,0), (1,1), (2,3), (3,4), (4, 2)]:\n",
1649 | " some_feats[i,j] = np.nan\n",
1650 | "\n",
1651 | "some_mu_y = cc_mean_consider_missing(some_feats, some_labels)\n",
1652 | "\n",
1653 | "assert np.array_equal(some_mu_y.round(2), np.array([[ 3. , 4. ],\n",
1654 | " [ 96.67, 137. ],\n",
1655 | " [ 66. , 52. ],\n",
1656 | " [ 14.5 , 17.5 ],\n",
1657 | " [ 31.33, 0. ],\n",
1658 | " [ 26.77, 33.2 ],\n",
1659 | " [ 0.27, 1.5 ],\n",
1660 | " [ 27.33, 32.5 ]]))\n",
1661 | "\n",
1662 | "# Checking against the pre-computed test database\n",
1663 | "test_results = test_case_checker(cc_mean_consider_missing, task_id=5)\n",
1664 | "assert test_results['passed'], test_results['message']"
1665 | ]
1666 | },
1667 | {
1668 | "cell_type": "code",
1669 | "execution_count": 150,
1670 | "metadata": {
1671 | "code_folding": [],
1672 | "deletable": false,
1673 | "editable": false,
1674 | "nbgrader": {
1675 | "cell_type": "code",
1676 | "checksum": "fe73c92df82cf2b5b4d938c18671be6b",
1677 | "grade": true,
1678 | "grade_id": "cell-e263f2b1878b37bf",
1679 | "locked": true,
1680 | "points": 1,
1681 | "schema_version": 3,
1682 | "solution": false,
1683 | "task": false
1684 | }
1685 | },
1686 | "outputs": [],
1687 | "source": [
1688 | "# This cell is left empty as a seperator. You can leave this cell as it is, and you should not delete it.\n"
1689 | ]
1690 | },
1691 | {
1692 | "cell_type": "code",
1693 | "execution_count": 151,
1694 | "metadata": {},
1695 | "outputs": [
1696 | {
1697 | "data": {
1698 | "text/plain": [
1699 | "array([[ 3.48641975, 4.91866029],\n",
1700 | " [109.99753086, 142.30143541],\n",
1701 | " [ 71.41538462, 75.34693878],\n",
1702 | " [ 27.53658537, 32.11188811],\n",
1703 | " [ 66.25679012, 100.55980861],\n",
1704 | " [ 30.85025126, 35.31826923],\n",
1705 | " [ 0.42825926, 0.55279904],\n",
1706 | " [ 31.57283951, 37.39712919]])"
1707 | ]
1708 | },
1709 | "execution_count": 151,
1710 | "metadata": {},
1711 | "output_type": "execute_result"
1712 | }
1713 | ],
1714 | "source": [
1715 | "mu_y = cc_mean_consider_missing(train_features_with_nans, train_labels)\n",
1716 | "mu_y"
1717 | ]
1718 | },
1719 | {
1720 | "cell_type": "markdown",
1721 | "metadata": {},
1722 | "source": [
1723 | "# Task 6"
1724 | ]
1725 | },
1726 | {
1727 | "cell_type": "markdown",
1728 | "metadata": {},
1729 | "source": [
1730 | "Write a function `cc_std_consider_missing` that\n",
1731 | "* has exactly the same input and output types as the `cc_std_ignore_missing` function,\n",
1732 | "* and has similar functionality to `cc_std_ignore_missing` except that it can handle and ignore the NaN entries when computing the class conditional means.\n",
1733 | "\n",
1734 | "You can borrow most of the code from your `cc_std_ignore_missing` implementation, but you should make it compatible with the existence of NaN values in the features.\n",
1735 | "\n",
1736 | "Try and avoid the utilization of loops as much as possible. No loops are necessary."
1737 | ]
1738 | },
1739 | {
1740 | "cell_type": "markdown",
1741 | "metadata": {},
1742 | "source": [
1743 | "* **Hint**: You may find the `np.nanstd` function useful."
1744 | ]
1745 | },
1746 | {
1747 | "cell_type": "code",
1748 | "execution_count": 152,
1749 | "metadata": {
1750 | "deletable": false
1751 | },
1752 | "outputs": [],
1753 | "source": [
1754 | "def cc_std_consider_missing(train_features_with_nans, train_labels):\n",
1755 | " N, d = train_features_with_nans.shape\n",
1756 | " \n",
1757 | " # your code here\n",
1758 | " # Extract the index of zeros and ones seperately\n",
1759 | " allZeros = np.where(train_labels == 0)[0]\n",
1760 | " allOnes = np.where(train_labels == 1)[0]\n",
1761 | " \n",
1762 | " # Convert the 2-d numpy array to DataFrame\n",
1763 | " df_features = pd.DataFrame(train_features_with_nans)\n",
1764 | " \n",
1765 | " \n",
1766 | " dfZeros = df_features.loc[allZeros]\n",
1767 | " dfOnes = df_features.loc[allOnes]\n",
1768 | " \n",
1769 | " firstCol = np.nanstd(dfZeros, axis=0, ddof=0).reshape((d, 1))\n",
1770 | " secondCol = np.nanstd(dfOnes, axis=0, ddof=0).reshape((d, 1))\n",
1771 | " \n",
1772 | " sigma_y = np.concatenate((firstCol, secondCol), axis=1)\n",
1773 | " \n",
1774 | " assert not np.isnan(sigma_y).any()\n",
1775 | " assert sigma_y.shape == (d, 2)\n",
1776 | " return sigma_y"
1777 | ]
1778 | },
1779 | {
1780 | "cell_type": "code",
1781 | "execution_count": 153,
1782 | "metadata": {
1783 | "deletable": false,
1784 | "editable": false,
1785 | "nbgrader": {
1786 | "cell_type": "code",
1787 | "checksum": "6062b0a65e131aa86909fefa7e7b88fc",
1788 | "grade": false,
1789 | "grade_id": "cell-2821b980896856b7",
1790 | "locked": true,
1791 | "schema_version": 3,
1792 | "solution": false,
1793 | "task": false
1794 | }
1795 | },
1796 | "outputs": [],
1797 | "source": [
1798 | "# Performing sanity checks on your implementation\n",
1799 | "some_feats = np.array([[ 1. , 85. , 66. , 29. , 0. , 26.6, 0.4, 31. ],\n",
1800 | " [ 8. , 183. , 64. , 0. , 0. , 23.3, 0.7, 32. ],\n",
1801 | " [ 1. , 89. , 66. , 23. , 94. , 28.1, 0.2, 21. ],\n",
1802 | " [ 0. , 137. , 40. , 35. , 168. , 43.1, 2.3, 33. ],\n",
1803 | " [ 5. , 116. , 74. , 0. , 0. , 25.6, 0.2, 30. ]])\n",
1804 | "some_labels = np.array([0, 1, 0, 1, 0])\n",
1805 | "\n",
1806 | "for i,j in [(0,0), (1,1), (2,3), (3,4), (4, 2)]:\n",
1807 | " some_feats[i,j] = np.nan\n",
1808 | "\n",
1809 | "some_std_y = cc_std_consider_missing(some_feats, some_labels)\n",
1810 | "\n",
1811 | "assert np.array_equal(some_std_y.round(2), np.array([[ 2. , 4. ],\n",
1812 | " [13.77, 0. ],\n",
1813 | " [ 0. , 12. ],\n",
1814 | " [14.5 , 17.5 ],\n",
1815 | " [44.31, 0. ],\n",
1816 | " [ 1.03, 9.9 ],\n",
1817 | " [ 0.09, 0.8 ],\n",
1818 | " [ 4.5 , 0.5 ]]))\n",
1819 | "\n",
1820 | "# Checking against the pre-computed test database\n",
1821 | "test_results = test_case_checker(cc_std_consider_missing, task_id=6)\n",
1822 | "assert test_results['passed'], test_results['message']"
1823 | ]
1824 | },
1825 | {
1826 | "cell_type": "code",
1827 | "execution_count": 154,
1828 | "metadata": {
1829 | "code_folding": [],
1830 | "deletable": false,
1831 | "editable": false,
1832 | "nbgrader": {
1833 | "cell_type": "code",
1834 | "checksum": "4919f542a3a80c75bbe8f533c2fe35b6",
1835 | "grade": true,
1836 | "grade_id": "cell-e263f2b1878b37bz",
1837 | "locked": true,
1838 | "points": 1,
1839 | "schema_version": 3,
1840 | "solution": false,
1841 | "task": false
1842 | }
1843 | },
1844 | "outputs": [],
1845 | "source": [
1846 | "# This cell is left empty as a seperator. You can leave this cell as it is, and you should not delete it.\n"
1847 | ]
1848 | },
1849 | {
1850 | "cell_type": "code",
1851 | "execution_count": 155,
1852 | "metadata": {},
1853 | "outputs": [
1854 | {
1855 | "data": {
1856 | "text/plain": [
1857 | "array([[ 3.1155426 , 3.75417931],\n",
1858 | " [ 25.96811899, 32.50910874],\n",
1859 | " [ 12.26342359, 12.1982786 ],\n",
1860 | " [ 9.87753687, 10.37284304],\n",
1861 | " [ 95.63339586, 139.24364214],\n",
1862 | " [ 6.38703834, 6.21564813],\n",
1863 | " [ 0.29438217, 0.37201494],\n",
1864 | " [ 11.67577435, 11.01543899]])"
1865 | ]
1866 | },
1867 | "execution_count": 155,
1868 | "metadata": {},
1869 | "output_type": "execute_result"
1870 | }
1871 | ],
1872 | "source": [
1873 | "sigma_y = cc_std_consider_missing(train_features_with_nans, train_labels)\n",
1874 | "sigma_y"
1875 | ]
1876 | },
1877 | {
1878 | "cell_type": "markdown",
1879 | "metadata": {},
1880 | "source": [
1881 | "## 2.1. Writing the Naive Bayes Classifier With Missing Data Handling"
1882 | ]
1883 | },
1884 | {
1885 | "cell_type": "code",
1886 | "execution_count": 156,
1887 | "metadata": {},
1888 | "outputs": [],
1889 | "source": [
1890 | "class NBClassifierWithMissing(NBClassifier):\n",
1891 | " def get_cc_means(self):\n",
1892 | " mu_y = cc_mean_consider_missing(self.train_features, self.train_labels)\n",
1893 | " return mu_y\n",
1894 | " \n",
1895 | " def get_cc_std(self):\n",
1896 | " sigma_y = cc_std_consider_missing(self.train_features, self.train_labels)\n",
1897 | " return sigma_y\n",
1898 | " \n",
1899 | " def predict(self, features):\n",
1900 | " preds = []\n",
1901 | " for feature in features:\n",
1902 | " is_num = np.logical_not(np.isnan(feature))\n",
1903 | " mu_y_not_nan = self.mu_y[is_num,:]\n",
1904 | " std_y_not_nan = self.sigma_y[is_num,:]\n",
1905 | " feats_not_nan = feature[is_num].reshape(1,-1)\n",
1906 | " log_p_x_y = log_prob(feats_not_nan, mu_y_not_nan, std_y_not_nan, self.log_py)\n",
1907 | " preds.append(log_p_x_y.argmax(axis=1).item())\n",
1908 | "\n",
1909 | " return np.array(preds)"
1910 | ]
1911 | },
1912 | {
1913 | "cell_type": "code",
1914 | "execution_count": 157,
1915 | "metadata": {},
1916 | "outputs": [],
1917 | "source": [
1918 | "diabetes_classifier_nans = NBClassifierWithMissing(train_features_with_nans, train_labels)\n",
1919 | "train_pred = diabetes_classifier_nans.predict(train_features_with_nans)\n",
1920 | "eval_pred = diabetes_classifier_nans.predict(eval_features_with_nans)"
1921 | ]
1922 | },
1923 | {
1924 | "cell_type": "code",
1925 | "execution_count": 158,
1926 | "metadata": {},
1927 | "outputs": [
1928 | {
1929 | "name": "stdout",
1930 | "output_type": "stream",
1931 | "text": [
1932 | "The training data accuracy of your trained model is 0.747557003257329\n",
1933 | "The evaluation data accuracy of your trained model is 0.7532467532467533\n"
1934 | ]
1935 | }
1936 | ],
1937 | "source": [
1938 | "train_acc = (train_pred==train_labels).mean()\n",
1939 | "eval_acc = (eval_pred==eval_labels).mean()\n",
1940 | "print(f'The training data accuracy of your trained model is {train_acc}')\n",
1941 | "print(f'The evaluation data accuracy of your trained model is {eval_acc}')"
1942 | ]
1943 | },
1944 | {
1945 | "cell_type": "markdown",
1946 | "metadata": {},
1947 | "source": [
1948 | "# 3. Running SVMlight"
1949 | ]
1950 | },
1951 | {
1952 | "cell_type": "markdown",
1953 | "metadata": {},
1954 | "source": [
1955 | "In this section, we are going to investigate the support vector machine classification method. We will become familiar with this classification method in week 3. However, in this section, we are just going to observe how this method performs to set the stage for the third week.\n",
1956 | "\n",
1957 | "`SVMlight` (http://svmlight.joachims.org/) is a famous implementation of the SVM classifier. \n",
1958 | "\n",
1959 | "`SVMLight` can be called from a shell terminal, and there is no nice wrapper for it in python3. Therefore:\n",
1960 | "1. We have to export the training data to a special format called `svmlight/libsvm`. This can be done using scikit-learn.\n",
1961 | "2. We have to run the `svm_learn` program to learn the model and then store it.\n",
1962 | "3. We have to import the model back to python."
1963 | ]
1964 | },
1965 | {
1966 | "cell_type": "markdown",
1967 | "metadata": {},
1968 | "source": [
1969 | "## 3.1 Exporting the training data to libsvm format"
1970 | ]
1971 | },
1972 | {
1973 | "cell_type": "code",
1974 | "execution_count": 159,
1975 | "metadata": {},
1976 | "outputs": [],
1977 | "source": [
1978 | "from sklearn.datasets import dump_svmlight_file\n",
1979 | "dump_svmlight_file(train_features, 2*train_labels-1, 'training_feats.data', \n",
1980 | " zero_based=False, comment=None, query_id=None, multilabel=False)"
1981 | ]
1982 | },
1983 | {
1984 | "cell_type": "markdown",
1985 | "metadata": {},
1986 | "source": [
1987 | "## 3.2 Training `SVMlight`"
1988 | ]
1989 | },
1990 | {
1991 | "cell_type": "code",
1992 | "execution_count": 160,
1993 | "metadata": {},
1994 | "outputs": [
1995 | {
1996 | "name": "stdout",
1997 | "output_type": "stream",
1998 | "text": [
1999 | "chmod: changing permissions of '../BasicClassification-lib/svmlight/svm_learn': Operation not permitted\n",
2000 | "Scanning examples...done\n",
2001 | "Reading examples into memory...100..200..300..400..500..600..OK. (614 examples read)\n",
2002 | "Setting default regularization parameter C=0.0000\n",
2003 | "Optimizing....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................done. (1781 iterations)\n",
2004 | "Optimization finished (141 misclassified, maxdiff=0.00099).\n",
2005 | "Runtime in cpu-seconds: 0.19\n",
2006 | "Number of SV: 375 (including 369 at upper bound)\n",
2007 | "L1 loss: loss=335.23204\n",
2008 | "Norm of weight vector: |w|=0.03179\n",
2009 | "Norm of longest example vector: |x|=871.75350\n",
2010 | "Estimated VCdim of classifier: VCdim<=769.24695\n",
2011 | "Computing XiAlpha-estimates...done\n",
2012 | "Runtime for XiAlpha-estimates in cpu-seconds: 0.00\n",
2013 | "XiAlpha-estimate of the error: error<=60.75% (rho=1.00,depth=0)\n",
2014 | "XiAlpha-estimate of the recall: recall=>10.53% (rho=1.00,depth=0)\n",
2015 | "XiAlpha-estimate of the precision: precision=>10.58% (rho=1.00,depth=0)\n",
2016 | "Number of kernel evaluations: 71356\n",
2017 | "Writing model file...done\n",
2018 | "\n"
2019 | ]
2020 | }
2021 | ],
2022 | "source": [
2023 | "!chmod +x ../BasicClassification-lib/svmlight/svm_learn\n",
2024 | "from subprocess import Popen, PIPE\n",
2025 | "process = Popen([\"../BasicClassification-lib/svmlight/svm_learn\", \"./training_feats.data\", \"svm_model.txt\"], stdout=PIPE, stderr=PIPE)\n",
2026 | "stdout, stderr = process.communicate()\n",
2027 | "print(stdout.decode(\"utf-8\"))"
2028 | ]
2029 | },
2030 | {
2031 | "cell_type": "markdown",
2032 | "metadata": {},
2033 | "source": [
2034 | "## 3.3 Importing the SVM Model"
2035 | ]
2036 | },
2037 | {
2038 | "cell_type": "code",
2039 | "execution_count": 161,
2040 | "metadata": {},
2041 | "outputs": [],
2042 | "source": [
2043 | "from svm2weight import get_svmlight_weights\n",
2044 | "svm_weights, thresh = get_svmlight_weights('svm_model.txt', printOutput=False)\n",
2045 | "\n",
2046 | "def svmlight_classifier(train_features):\n",
2047 | " return (train_features @ svm_weights - thresh).reshape(-1) >= 0."
2048 | ]
2049 | },
2050 | {
2051 | "cell_type": "code",
2052 | "execution_count": 162,
2053 | "metadata": {},
2054 | "outputs": [],
2055 | "source": [
2056 | "train_pred = svmlight_classifier(train_features)\n",
2057 | "eval_pred = svmlight_classifier(eval_features)"
2058 | ]
2059 | },
2060 | {
2061 | "cell_type": "code",
2062 | "execution_count": 163,
2063 | "metadata": {},
2064 | "outputs": [
2065 | {
2066 | "name": "stdout",
2067 | "output_type": "stream",
2068 | "text": [
2069 | "The training data accuracy of your trained model is 0.7703583061889251\n",
2070 | "The evaluation data accuracy of your trained model is 0.7402597402597403\n"
2071 | ]
2072 | }
2073 | ],
2074 | "source": [
2075 | "train_acc = (train_pred==train_labels).mean()\n",
2076 | "eval_acc = (eval_pred==eval_labels).mean()\n",
2077 | "print(f'The training data accuracy of your trained model is {train_acc}')\n",
2078 | "print(f'The evaluation data accuracy of your trained model is {eval_acc}')"
2079 | ]
2080 | },
2081 | {
2082 | "cell_type": "code",
2083 | "execution_count": 164,
2084 | "metadata": {},
2085 | "outputs": [],
2086 | "source": [
2087 | "# Cleaning up after our work is done\n",
2088 | "!rm -rf svm_model.txt training_feats.data"
2089 | ]
2090 | },
2091 | {
2092 | "cell_type": "code",
2093 | "execution_count": null,
2094 | "metadata": {},
2095 | "outputs": [],
2096 | "source": []
2097 | }
2098 | ],
2099 | "metadata": {
2100 | "illinois_payload": {
2101 | "b64z": "",
2102 | "nb_path": "release/BasicClassification/BasicClassification.ipynb"
2103 | },
2104 | "kernelspec": {
2105 | "display_name": "Python 3 (Threads: 2)",
2106 | "language": "python",
2107 | "name": "python3"
2108 | },
2109 | "language_info": {
2110 | "codemirror_mode": {
2111 | "name": "ipython",
2112 | "version": 3
2113 | },
2114 | "file_extension": ".py",
2115 | "mimetype": "text/x-python",
2116 | "name": "python",
2117 | "nbconvert_exporter": "python",
2118 | "pygments_lexer": "ipython3",
2119 | "version": "3.8.12"
2120 | }
2121 | },
2122 | "nbformat": 4,
2123 | "nbformat_minor": 4
2124 | }
2125 |
--------------------------------------------------------------------------------
/EMSegmentation-lib/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Keyuan125/CS441-AppliedMachineLearning/623c5307e6412e9a2fc59dd6213fc07fa412a861/EMSegmentation-lib/.DS_Store
--------------------------------------------------------------------------------
/EMSegmentation-lib/EMSegmentation.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Keyuan125/CS441-AppliedMachineLearning/623c5307e6412e9a2fc59dd6213fc07fa412a861/EMSegmentation-lib/EMSegmentation.pdf
--------------------------------------------------------------------------------
/EMSegmentation-lib/README.md:
--------------------------------------------------------------------------------
1 |
2 | # Assignment-specific public libraries and large data files (visible, read-only)
3 |
4 | This directory is for data and script files that are specific to one homework.
5 | This directory will be part of the Python path for all users. Libraries placed here should generally be those written by staff; Python packages can instead be installed system-wide in the Dockerfiles or `requirements.txt` loading system (see below).
6 |
7 | Please don't confuse this directory with the `work/course-lib` directory that may contain data to be used by multiple homework assignments.
8 |
9 | ## Placement
10 |
11 | In the lab container, the contents of this directory will be placed in:
12 |
13 | ```
14 | work/release/[hwid]-lib
15 | ```
16 |
17 | Where `hwid` matches the homework ID for that assignment. For example, data files associated with `work/source/HW1/HW1.ipynb` should be placed in `work/source/HW1-lib/`. These files will end up in `work/release/HW1-lib` in the container.
18 |
19 | These files will be read-only. These will be available for all users, including students. However, it's better not to refer to these files using absolute paths; see best practices below.
20 |
21 | ## Special files
22 |
23 | ### `payload_requirements.json`
24 |
25 | **Only staff can configure this file.**
26 |
27 | The `payload_requirements.json` file, if present, specifies additional files that will be submitted along with the notebook. The file should contain an object with a `"files"` property that is a list of strings that are relative paths to files under the current homework notebook's working directory. For example:
28 |
29 | ```json
30 | {
31 | "files": ["some-file.db", "inner_directory/nested_file.txt"]
32 | }
33 | ```
34 |
35 | If the homework ID for this homework is "HW1", the above example would specify these additional files to be collected:
36 |
37 | - `work/release/HW1/some-file.db`
38 | - `work/release/HW1/inner_directory/nested_file.txt`
39 |
40 | ### `requirements.txt` sequence
41 |
42 | **Only staff can configure these files.**
43 |
44 | If some of the Python packages for a particular assignment need to be frozen, you can specify packages and versions in one or more files named `requirements*.txt`, where you may want to put a number before the extension such as `requirements1.txt`. These will be processed one at a time in natural version order (like `sort -V` in Linux) during the Docker build.
45 |
46 | ## Best practices
47 |
48 | It's important to put public staff libraries and large data files in this directory to prevent editing and ensure efficient use of the cloud storage. Large files will also have improved access time from this directory compared to the notebook directory.
49 |
50 | Python files in this directory will be on the Python system path in the container, so you may want to write a Python loader for data you need and refer to the data with relative paths under the library directory (rather than using ".." to refer to a directory above). However, if students try to work on files offline this may complicate things and require shims to adjust the paths.
51 |
52 | Staff members should refer to additional notes in the staff library directories.
53 |
--------------------------------------------------------------------------------
/EMSegmentation-lib/aml_utils.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import matplotlib.pyplot as plt
3 | import pandas as pd
4 | import os
5 | import sys
6 | import traceback
7 | from pygments import formatters, highlight, lexers
8 | import re
9 | import inspect
10 | import types
11 | import copy
12 |
13 | visualize = True
14 | perform_computation = True
15 | test_db_dir = os.path.dirname(os.path.realpath(__file__)) + '/test_db'
16 | NO_DASHES = 55
17 |
18 | Fore_BLUE_LIGHT = u'\u001b[38;5;19m'
19 | Fore_RED_LIGHT = u'\u001b[38;5;196m'
20 | Fore_BLUE = u'\u001b[38;5;34m'
21 | Fore_RED = '\x1b[1;31m'
22 | FORE_GREEN_DARK = u'\u001b[38;5;22m'
23 | Fore_DARKRED = u'\u001b[38;5;124m'
24 | Fore_MAGENTA = '\x1b[1m' + u'\u001b[38;5;92m'
25 | Fore_GREEN = u'\u001b[38;5;32m'
26 | Fore_BLACK = '\x1b[0m' + u'\u001b[38;5;0m'
27 |
28 |
29 | ########################################################################################
30 | ######################## Utilities for traceback processing ############################
31 | ########################################################################################
32 | def keep_tb_rule(tb):
33 | tb_file_path = tb.tb_frame.f_code.co_filename
34 | if os.path.realpath(__file__) == os.path.realpath(tb_file_path):
35 | return False
36 | else:
37 | return True
38 |
39 | def censor_exc_traceback(exc_traceback):
40 | original_tb_list = []
41 | tb_next = exc_traceback
42 | while tb_next is not None:
43 | original_tb_list.append(tb_next)
44 | tb_next = tb_next.tb_next
45 |
46 | censored_tb_list = [tb for tb in original_tb_list if keep_tb_rule(tb)]
47 |
48 | for i, tb in enumerate(censored_tb_list[:-1]):
49 | tb.tb_next = censored_tb_list[i+1]
50 |
51 | if len(censored_tb_list) > 0:
52 | return censored_tb_list[0]
53 | else:
54 | return exc_traceback
55 |
56 | try:
57 | import IPython
58 | ultratb = IPython.core.ultratb.VerboseTB(include_vars=False)
59 | def get_tb_colored_str(exc_type, exc_value, exc_traceback):
60 | manipulated_exc_traceback = censor_exc_traceback(exc_traceback)
61 | tb_text = ultratb.text(exc_type, exc_value, manipulated_exc_traceback)
62 | tb_text = re.sub( r"/tmp/ipykernel_.*.py", "/Jupyter/Notebook/Student/Task/Implementation/Cells", tb_text)
63 | tb_text = re.sub( r"\s{20,}Traceback", " Traceback", tb_text)
64 | s_split = tb_text.split('\n')
65 | if len(s_split) > 0:
66 | c_s_split = s_split[1:]
67 | tb_text = '\n'.join(c_s_split) + '\n'
68 | tb_text = tb_text.replace('\x1b[0;36m', '\x1b[1m \x1b[1;34m')
69 | return tb_text
70 | except:
71 | def get_tb_colored_str(exc_type, exc_value, exc_traceback):
72 | manipulated_exc_traceback = censor_exc_traceback(exc_traceback)
73 | tb_text = traceback.format_exception(exc_type, exc_value, manipulated_exc_traceback, limit=None, chain=True)
74 | tb_text = ''.join(tb_text)
75 | tb_text = re.sub( r"\"/tmp/ipykernel_.*\"", "\"/Jupyter/Notebook/Student/Task/Implementation/Cells\"", tb_text)
76 | lexer = lexers.get_lexer_by_name("pytb", stripall=True)
77 | formatter = formatters.get_formatter_by_name("terminal16m")
78 | tb_colored = highlight(tb_text, lexer, formatter)
79 | return tb_colored
80 |
81 | try:
82 | from IPython.utils import PyColorize
83 | color_parser = PyColorize.Parser(color_table=None, out="str", parent=None, style='Linux')
84 | def code_color_parser(code_str):
85 | return color_parser.format(code_str)
86 | except:
87 | def code_color_parser(code_str):
88 | return code_str
89 |
90 | def get_num_indents(src_list):
91 | assert len(src_list) > 0
92 | a = [line + 20 * ' ' for line in src_list]
93 | b = [len(line) - len(line.lstrip()) for line in a]
94 | assert b[0] == 0
95 | c = min(b[1:])
96 | return c
97 |
98 | def code_snippet_maker(stu_function, args, kwargs):
99 | test_kwargs_str_lst = []
100 | test_kwargs_str_lst.append('from copy import deepcopy')
101 | test_kwargs_str_lst.append("failed_arguments = deepcopy(test_results['test_kwargs'])")
102 | for arg_ in args:
103 | test_kwargs_str_lst.append(arg_)
104 | for key,val in kwargs.items():
105 | test_kwargs_str_lst.append(f"{key} = failed_arguments['{key}']")
106 | test_kwargs_str = ', '.join(test_kwargs_str_lst)
107 |
108 | if hasattr(stu_function, '__name__'):
109 | stu_func_name = stu_function.__name__
110 | else:
111 | stu_func_name = 'YOUR_FUNCTION_NAME'
112 |
113 | check_list_code = []
114 | check_list_code.append(f"correct_sol = test_results['correct_sol'] # The Reference Solution")
115 | check_list_code.append(f"if isinstance(correct_sol, np.ndarray):")
116 | check_list_code.append(f" assert isinstance(my_solution, np.ndarray)")
117 | check_list_code.append(f" assert my_solution.dtype is correct_sol.dtype")
118 | check_list_code.append(f" assert my_solution.shape == correct_sol.shape")
119 | check_list_code.append(f" assert np.allclose(my_solution, correct_sol)")
120 | check_list_code.append(f" print('If you passed the above assertions, it probably means that you have fixed the issue! Well Done!')")
121 | check_list_code.append(f" print('Now you have to do 3 things:')")
122 | check_list_code.append(f" print(' 1) Carefully copy the fixed code body back to the {stu_func_name} function.')")
123 | check_list_code.append(f" print(' 2) If you copied any \"returned_var = \" lines, convert them back to return statements.')")
124 | check_list_code.append(f" print(' 3) Carefully remove this cell (i.e., the cell you inserted and modified) once you are done.')")
125 |
126 | try:
127 | src = inspect.getsource(stu_function)
128 | src_list = src.split('\n')
129 | src_list = [line for line in src_list if not (line.strip().startswith('#'))]
130 | no_indents = get_num_indents(src_list)
131 | mod_src_list = []
132 | src_gen = src_list[1:] if src_list[0].startswith('def') else src_list
133 | for line in src_gen:
134 | if len(line) > no_indents:
135 | shifted_left_line = line[no_indents:]
136 | else:
137 | shifted_left_line = line
138 |
139 | return_statement = 'return '
140 | if not shifted_left_line.lstrip().startswith(return_statement):
141 | mod_src_list.append(shifted_left_line)
142 | else:
143 | i = shifted_left_line.index(return_statement)
144 | shifted_left_line = shifted_left_line[:i] + 'returned_var = ' + shifted_left_line[i+len(return_statement):] + ' # returned variable'
145 | mod_src_list.append(shifted_left_line)
146 |
147 | mod_bodysrc_list = '\n'.join(mod_src_list).strip().split('\n')
148 |
149 | mod_src_list = []
150 | mod_src_list = mod_src_list + ['### You can copy the following auto-generated snippet into a new cell to reproduce the issue.']
151 | mod_src_list = mod_src_list + ['### Use the + button on the top left of the screen to insert a new cell below.']
152 | mod_src_list = mod_src_list + ['']
153 | mod_src_list = mod_src_list + ['#'*7 + ' Test Arguments ' + '#'*7] + test_kwargs_str_lst
154 | mod_src_list = mod_src_list + ['\n' + '#'*7 + ' Your Code Body ' + '#'*7] + mod_bodysrc_list
155 | mod_src_list.append('\n' + '#'*5 + ' Checking Solutions '+ '#'*6)
156 | mod_src_list.append(f"my_solution = returned_var # Your Solution")
157 | mod_src_list = mod_src_list + check_list_code
158 | processed_code = '\n'.join(mod_src_list)
159 | except:
160 | mod_src_list = []
161 | mod_src_list.append(f"my_solution = {stu_func_name}({test_kwargs_str})")
162 | mod_src_list = mod_src_list + check_list_code
163 | processed_code = '\n'.join(mod_src_list)
164 |
165 | return processed_code
166 |
167 |
168 | ########################################################################################
169 | ####################### Utilities for comparison processing ############################
170 | ########################################################################################
171 | def retrieve_item(item_name, ptr_, test_idx, npz_file):
172 | item_shape = npz_file[f'shape_{item_name}'][test_idx]
173 | item_size = int(np.prod(item_shape))
174 | item = npz_file[item_name][ptr_:(ptr_+item_size)].reshape(item_shape)
175 | return item, ptr_+item_size
176 |
177 | class NPStrListCoder:
178 | def __init__(self):
179 | self.filler = '?'
180 | self.spacer = ':'
181 | self.max_len = 100
182 |
183 | def encode(self, str_list):
184 | my_str_ = self.spacer.join(str_list)
185 | str_hex_data = [ord(c) for c in my_str_]
186 | assert_msg = f'Increase max len; you have so many characters: {len(str_hex_data)}>{self.max_len}'
187 | assert len(str_hex_data) <= self.max_len, assert_msg
188 | str_hex_data = str_hex_data + [ord(self.filler) for _ in range(self.max_len - len(str_hex_data))]
189 | str_hex_np = np.array(str_hex_data)
190 | return str_hex_np
191 |
192 | def decode(self, np_arr):
193 | a = ''.join([chr(i) for i in np_arr])
194 | recovered_list = a.replace(self.filler, '').split(self.spacer)
195 | return recovered_list
196 |
197 | str2np_coder = NPStrListCoder()
198 |
199 | def test_case_loader(test_file):
200 | npz_file = np.load(test_file)
201 | arg_id_list = sorted([int(key[4:]) for key in npz_file.keys() if key.startswith('arg_')])
202 | kwarg_names_list = sorted([key[6:] for key in npz_file.keys() if key.startswith('kwarg_')])
203 |
204 | arg_ptr_list = [0 for _ in range(len(arg_id_list))]
205 | dfcarg_ptr_list = [0 for _ in range(len(arg_id_list))]
206 | kwarg_ptr_list = [0 for _ in range(len(kwarg_names_list))]
207 | dfckwarg_ptr_list = [0 for _ in range(len(kwarg_names_list))]
208 | out_ptr = 0
209 | for i in np.arange(npz_file['num_tests']):
210 | args_list = []
211 | for arg_id, arg_id_ in enumerate(arg_id_list):
212 | arg_item, arg_ptr_list[arg_id] = retrieve_item(f'arg_{arg_id_}', arg_ptr_list[arg_id], i, npz_file)
213 | if f'dfcarg_{arg_id_}' in npz_file.keys():
214 | col_list_code, dfcarg_ptr_list[arg_id] = retrieve_item(f'dfcarg_{arg_id_}', dfcarg_ptr_list[arg_id], i, npz_file)
215 | arg_item = pd.DataFrame(arg_item, columns=str2np_coder.decode(col_list_code))
216 | args_list.append(arg_item)
217 | args = tuple(args_list)
218 |
219 | kwargs = {}
220 | for kwarg_id, kwarg_name in enumerate(kwarg_names_list):
221 | kwarg_item, kwarg_ptr_list[kwarg_id] = retrieve_item(f'kwarg_{kwarg_name}', kwarg_ptr_list[kwarg_id], i, npz_file)
222 | if f'dfckwarg_{kwarg_name}' in npz_file.keys():
223 | col_list_code, dfckwarg_ptr_list[kwarg_id] = retrieve_item(f'dfckwarg_{kwarg_name}', dfckwarg_ptr_list[kwarg_id], i, npz_file)
224 | kwarg_item = pd.DataFrame(kwarg_item, columns=str2np_coder.decode(col_list_code))
225 | kwargs[kwarg_name]=kwarg_item
226 |
227 | output, out_ptr = retrieve_item(f'output', out_ptr, i, npz_file)
228 |
229 | yield args, kwargs, output
230 |
231 | def arg2str(args, kwargs, adv_user_msg=False, stu_func=None):
232 | msg = ''
233 |
234 | for arg_ in args:
235 | msg += f'{arg_}\n'
236 | for key,val in kwargs.items():
237 | try:
238 | val_str = np.array_repr(val)
239 | except:
240 | val_str = val
241 | new_line = f'{Fore_MAGENTA}{key}{Fore_BLACK} = {val_str}\n'
242 | new_line = new_line.replace(' = array(',' = np.array(')
243 | new_line = new_line.replace('nan,','np.nan,')
244 | msg += new_line
245 |
246 |
247 | if adv_user_msg:
248 | try:
249 | is_stu_func_lambda = isinstance(stu_func, types.LambdaType)
250 | if is_stu_func_lambda:
251 | is_stu_func_lambda = stu_func.__name__ == ""
252 | if not is_stu_func_lambda:
253 | code_title_ = f'\n' + '-' * (NO_DASHES-1) + f'{Fore_RED} Reproducing Code Snippet {Fore_BLACK}' + '-' * (NO_DASHES-2) + '\n'
254 | code = code_snippet_maker(stu_func, args, kwargs)
255 | msg += code_title_ + code_color_parser(code)
256 | except:
257 | pass
258 | return msg
259 |
260 |
261 | def test_case_checker(stu_func, task_id=0):
262 | out_dict = {}
263 | out_dict['task_number'] = task_id
264 | out_dict['exception'] = None
265 | out_dict['exception_info'] = None
266 | test_db_npz = f'{test_db_dir}/task_{task_id}.npz'
267 | if not os.path.exists(test_db_npz):
268 | out_dict['message'] = f'Test database test_db/task_{task_id}.npz does not exist... aborting!'
269 | out_dict['passed'] = False
270 | out_dict['test_args'] = None
271 | out_dict['test_kwargs'] = None
272 | out_dict['stu_sol'] = None
273 | out_dict['correct_sol'] = None
274 | return out_dict
275 |
276 | if hasattr(stu_func, '__name__'):
277 | stu_func_name = stu_func.__name__
278 | else:
279 | stu_func_name = None
280 |
281 | done = False
282 | err_title = f'\n' + '*' * NO_DASHES + f'{Fore_RED} Error in Task {task_id} {Fore_BLACK}' + '*' * NO_DASHES + f'\n'
283 | test_case_title = '\n' + '-' * NO_DASHES + f'{Fore_RED} Test Case Arguments {Fore_BLACK}' + '-' * NO_DASHES + '\n'
284 | summary_title = '-' * NO_DASHES + f' {Fore_RED} Summary {Fore_BLACK}' + '-' * NO_DASHES + '\n'
285 | for (test_args, test_kwargs, correct_sol) in test_case_loader(test_db_npz):
286 | try:
287 | stu_args_copy = copy.deepcopy(test_args)
288 | stu_kwargs_copy = copy.deepcopy(test_kwargs)
289 | stu_sol = stu_func(*stu_args_copy, **stu_kwargs_copy)
290 | except Exception as stu_exception:
291 | stu_sol = None
292 | stu_exception_info = sys.exc_info()
293 | message = err_title + summary_title
294 | message += f'Your code {Fore_RED}crashed{Fore_BLACK} during the evaluation of a test case argument.'
295 | message += f' The rest of this message gives you the following material:\n'
296 | message += f' 1. The exception traceback detailing how the error occured.\n'
297 | message += f' 2. The specific test case arguments that caused the error.\n'
298 | message += f' 3. A code snippet that can conviniently reproduce the error.\n'
299 | message += f' -> You can {Fore_RED}copy and paste{Fore_BLACK} the {Fore_RED}code snippet{Fore_BLACK} into a {Fore_RED}new cell{Fore_BLACK}, and run the cell to reproduce the error.\n\n'
300 | message += '-' * NO_DASHES + f'{Fore_RED} Exception Traceback {Fore_BLACK}' + '-' * NO_DASHES + '\n'
301 | message += get_tb_colored_str(*stu_exception_info)
302 | message += test_case_title
303 | message += arg2str(test_args, test_kwargs, adv_user_msg=True, stu_func=stu_func)
304 | out_dict['test_args'] = test_args
305 | out_dict['test_kwargs'] = test_kwargs
306 | out_dict['stu_sol'] = stu_sol
307 | out_dict['correct_sol'] = correct_sol
308 | out_dict['message'] = message
309 | out_dict['passed'] = False
310 | out_dict['exception'] = stu_exception
311 | out_dict['exception_info'] = stu_exception_info
312 | return out_dict
313 |
314 | if isinstance(correct_sol, np.ndarray) and np.isscalar(stu_sol):
315 | # This is handling a special case: When scalar numpy objects are stored,
316 | # they will be converted to a numpy array upon loading.
317 | # In this case, we'll give students the benefit of the doubt,
318 | # and assume the correct solution already was a scalar.
319 | if correct_sol.size == 1:
320 | correct_sol = np.float64(correct_sol.item())
321 | stu_sol = np.float64(np.float64(stu_sol).item())
322 |
323 | #Type Sanity check
324 | if type(stu_sol) is not type(correct_sol):
325 | message = err_title + summary_title
326 | message += f'Your solution\'s {Fore_RED}output type{Fore_BLACK} is not the same as '
327 | message += f'the reference solution\'s data type.\n'
328 | message += f' Your solution\'s type --> {Fore_RED}{type(stu_sol)}{Fore_BLACK}\n'
329 | message += f' Correct solution\'s type --> {Fore_RED}{type(correct_sol)}{Fore_BLACK}\n'
330 | message += test_case_title
331 | message += arg2str(test_args, test_kwargs, adv_user_msg=True, stu_func=stu_func)
332 | out_dict['test_args'] = test_args
333 | out_dict['test_kwargs'] = test_kwargs
334 | out_dict['stu_sol'] = stu_sol
335 | out_dict['correct_sol'] = correct_sol
336 | out_dict['message'] = message
337 | out_dict['passed'] = False
338 | return out_dict
339 |
340 | if isinstance(correct_sol, np.ndarray):
341 | if not np.all(np.array(correct_sol.shape) == np.array(stu_sol.shape)):
342 | message = err_title + summary_title
343 | message += f'Your solution\'s {Fore_RED}output numpy shape{Fore_BLACK} is not the same as '
344 | message += f'the reference solution\'s shape.\n'
345 | message += f' Your solution\'s shape --> {Fore_RED}{stu_sol.shape}{Fore_BLACK}\n'
346 | message += f' Correct solution\'s shape --> {Fore_RED}{correct_sol.shape}{Fore_BLACK}\n'
347 | message += test_case_title
348 | message += arg2str(test_args, test_kwargs, adv_user_msg=True, stu_func=stu_func)
349 | out_dict['test_args'] = test_args
350 | out_dict['test_kwargs'] = test_kwargs
351 | out_dict['stu_sol'] = stu_sol
352 | out_dict['correct_sol'] = correct_sol
353 | out_dict['message'] = message
354 | out_dict['passed'] = False
355 | return out_dict
356 |
357 | if not(stu_sol.dtype is correct_sol.dtype):
358 | message = err_title + summary_title
359 | message += f'Your solution\'s {Fore_RED}output numpy dtype{Fore_BLACK} is not the same as'
360 | message += f'the reference solution\'s dtype.\n'
361 | message += f' Your solution\'s dtype --> {Fore_RED}np.{stu_sol.dtype}{Fore_BLACK}\n'
362 | message += f' Correct solution\'s dtype --> {Fore_RED}np.{correct_sol.dtype}{Fore_BLACK}\n'
363 | message += test_case_title
364 | message += arg2str(test_args, test_kwargs, adv_user_msg=True, stu_func=stu_func)
365 | out_dict['test_args'] = test_args
366 | out_dict['test_kwargs'] = test_kwargs
367 | out_dict['stu_sol'] = stu_sol
368 | out_dict['correct_sol'] = correct_sol
369 | out_dict['message'] = message
370 | out_dict['passed'] = False
371 | return out_dict
372 |
373 | if isinstance(correct_sol, np.ndarray):
374 | equality_array = np.isclose(stu_sol, correct_sol, rtol=1e-05, atol=1e-08, equal_nan=True)
375 | if not equality_array.all():
376 | message = err_title + summary_title
377 | message += f'Your solution is {Fore_RED}not the same{Fore_BLACK} as the correct solution.\n'
378 | whr_ = np.array(np.where(np.logical_not(equality_array)))
379 | ineq_idxs = whr_[:,0].tolist()
380 | message += f' your_solution{ineq_idxs}={stu_sol[tuple(ineq_idxs)]}\n'
381 | message += f' correct_solution{ineq_idxs}={correct_sol[tuple(ineq_idxs)]}\n'
382 | message += test_case_title
383 | message += arg2str(test_args, test_kwargs, adv_user_msg=True, stu_func=stu_func)
384 | out_dict['test_args'] = test_args
385 | out_dict['test_kwargs'] = test_kwargs
386 | out_dict['stu_sol'] = stu_sol
387 | out_dict['correct_sol'] = correct_sol
388 | out_dict['message'] = message
389 | out_dict['passed'] = False
390 | return out_dict
391 |
392 | elif np.isscalar(correct_sol):
393 | equality_array = np.isclose(stu_sol, correct_sol, rtol=1e-05, atol=1e-08, equal_nan=True)
394 | if not equality_array.all():
395 | message = err_title + summary_title
396 | message += f'Your solution is {Fore_RED}not the same{Fore_BLACK} as the correct solution.\n'
397 | message += f' your_solution={stu_sol}\n'
398 | message += f' correct_solution={correct_sol}\n'
399 | message += test_case_title
400 | message += arg2str(test_args, test_kwargs, adv_user_msg=True, stu_func=stu_func)
401 | out_dict['test_args'] = test_args
402 | out_dict['test_kwargs'] = test_kwargs
403 | out_dict['stu_sol'] = stu_sol
404 | out_dict['correct_sol'] = correct_sol
405 | out_dict['message'] = message
406 | out_dict['passed'] = False
407 | return out_dict
408 |
409 | elif isinstance(correct_sol, tuple):
410 | if not correct_sol==stu_sol:
411 | message = err_title + summary_title
412 | message += f'Your solution is {Fore_RED}not the same{Fore_BLACK} as the correct solution.\n'
413 | message += f' your_solution={stu_sol}\n'
414 | message += f' correct_solution={correct_sol}\n'
415 | message += test_case_title
416 | message += arg2str(test_args, test_kwargs, adv_user_msg=True, stu_func=stu_func)
417 | out_dict['test_args'] = test_args
418 | out_dict['test_kwargs'] = test_kwargs
419 | out_dict['stu_sol'] = stu_sol
420 | out_dict['correct_sol'] = correct_sol
421 | out_dict['message'] = message
422 | out_dict['passed'] = False
423 | return out_dict
424 |
425 | else:
426 | raise Exception(f'Not implemented comparison for other data types. sorry!')
427 |
428 | out_dict['test_args'] = None
429 | out_dict['test_kwargs'] = None
430 | out_dict['stu_sol'] = None
431 | out_dict['correct_sol'] = None
432 | out_dict['message'] = 'Well Done!'
433 | out_dict['passed'] = True
434 | return out_dict
435 |
436 | def show_test_cases(test_func, task_id=0):
437 | from IPython.display import clear_output
438 | file_path = f'{test_db_dir}/task_{task_id}.npz'
439 | npz_file = np.load(file_path)
440 | orig_images = npz_file['raw_images']
441 | ref_images = npz_file['ref_images']
442 | test_images = test_func(orig_images)
443 |
444 | visualize_ = visualize and perform_computation
445 |
446 | if not np.all(np.array(test_images.shape) == np.array(ref_images.shape)):
447 | print(f'Error: It seems the test images and the ref images have different shapes. Modify your function so that they both have the same shape.')
448 | print(f' test_images shape: {test_images.shape}')
449 | print(f' ref_images shape: {ref_images.shape}')
450 | return None, None, None, False
451 |
452 | if not np.all(np.array(test_images.dtype) == np.array(ref_images.dtype)):
453 | print(f'Error: It seems the test images and the ref images have different dtype. Modify your function so that they both have the same dtype.')
454 | print(f' test_images dtype: {test_images.dtype}')
455 | print(f' ref_images dtype: {ref_images.dtype}')
456 | return None, None, None, False
457 |
458 | for i in range(ref_images.shape[0]):
459 | if visualize_:
460 | nrows, ncols = 1, 3
461 | ax_w, ax_h = 5, 5
462 | fig, axes = plt.subplots(nrows, ncols, figsize=(ncols*ax_w, nrows*ax_h))
463 | axes = np.array(axes).reshape(nrows, ncols)
464 |
465 | orig_image = orig_images[i]
466 | ref_image = ref_images[i]
467 | test_image = test_images[i]
468 |
469 | if visualize_:
470 | ax = axes[0,0]
471 | ax.pcolormesh(orig_image, edgecolors='k', linewidth=0.01, cmap='Greys')
472 | ax.xaxis.tick_top()
473 | ax.invert_yaxis()
474 |
475 | x_ticks = ax.get_xticks(minor=False).astype(np.int)
476 | x_ticks = x_ticks[x_ticks < orig_image.shape[1]]
477 | ax.set_xticks(x_ticks + 0.5)
478 | ax.set_xticklabels((x_ticks).astype(np.int))
479 |
480 | y_ticks = ax.get_yticks(minor=False).astype(np.int)
481 | y_ticks = y_ticks[y_ticks < orig_image.shape[0]]
482 | ax.set_yticks(y_ticks + 0.5)
483 | ax.set_yticklabels((y_ticks).astype(np.int))
484 |
485 | ax.set_aspect('equal')
486 | ax.set_title('Raw Image')
487 |
488 | ax = axes[0,1]
489 | ax.pcolormesh(ref_image, edgecolors='k', linewidth=0.01, cmap='Greys')
490 | ax.xaxis.tick_top()
491 | ax.invert_yaxis()
492 |
493 | x_ticks = ax.get_xticks(minor=False).astype(np.int)
494 | x_ticks = x_ticks[x_ticks < ref_image.shape[1]]
495 | ax.set_xticks(x_ticks+0.5)
496 | ax.set_xticklabels((x_ticks).astype(np.int))
497 |
498 | y_ticks = ax.get_yticks(minor=False).astype(np.int)
499 | y_ticks = y_ticks[y_ticks < ref_image.shape[0]]
500 | ax.set_yticks(y_ticks+0.5)
501 | ax.set_yticklabels((y_ticks).astype(np.int))
502 |
503 | ax.set_aspect('equal')
504 | ax.set_title('Reference Solution Image')
505 |
506 | ax = axes[0,2]
507 | ax.pcolormesh(test_image, edgecolors='k', linewidth=0.01, cmap='Greys')
508 | ax.xaxis.tick_top()
509 | ax.invert_yaxis()
510 |
511 | x_ticks = ax.get_xticks(minor=False).astype(np.int)
512 | x_ticks = x_ticks[x_ticks < test_image.shape[1]]
513 | ax.set_xticks(x_ticks + 0.5)
514 | ax.set_xticklabels((x_ticks).astype(np.int))
515 |
516 | y_ticks = ax.get_yticks(minor=False).astype(np.int)
517 | y_ticks = y_ticks[y_ticks < test_image.shape[0]]
518 | ax.set_yticks(y_ticks + 0.5)
519 | ax.set_yticklabels((y_ticks).astype(np.int))
520 |
521 | ax.set_aspect('equal')
522 | ax.set_title('Your Solution Image')
523 |
524 | if np.allclose(ref_image, test_image):
525 | if visualize_:
526 | print('The reference and solution images are the same to a T! Well done on this test case.')
527 | else:
528 | print('The reference and solution images are not the same...')
529 | ineq_idxs = np.array(np.where(np.logical_not(np.isclose(ref_image, test_image))))[:,0].tolist()
530 | print(f'ref_image{ineq_idxs}={ref_image[tuple(ineq_idxs)]}')
531 | print(f'test_image{ineq_idxs}={test_image[tuple(ineq_idxs)]}')
532 | if visualize_:
533 | print('I will return the images so that you will be able to diagnose the issue and resolve it...')
534 | return (orig_image, ref_image, test_image, False)
535 |
536 | if visualize_:
537 | plt.show()
538 | input_prompt = ' Enter nothing to go to the next image\nor\n Enter "s" when you are done to recieve the three images. \n'
539 | input_prompt += ' **Don\'t forget to do this before continuing to the next step.**\n'
540 |
541 | try:
542 | cmd = input(input_prompt)
543 | except KeyboardInterrupt:
544 | cmd = 's'
545 |
546 | if cmd.lower().startswith('s'):
547 | return (orig_image, ref_image, test_image, True)
548 | else:
549 | clear_output(wait=True)
550 |
551 | return (orig_image, ref_image, test_image, True)
--------------------------------------------------------------------------------
/EMSegmentation-lib/payload_requirements.json:
--------------------------------------------------------------------------------
1 | {
2 | "comment": "The files property is a list of additional file paths (relative to this homework's notebook dir) that will be bundled into the student's ipynb metadata each time they save.",
3 | "files": []
4 | }
5 |
--------------------------------------------------------------------------------
/EMSegmentation-lib/pics/RobertMixed03.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Keyuan125/CS441-AppliedMachineLearning/623c5307e6412e9a2fc59dd6213fc07fa412a861/EMSegmentation-lib/pics/RobertMixed03.jpg
--------------------------------------------------------------------------------
/EMSegmentation-lib/pics/smallstrelitzia.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Keyuan125/CS441-AppliedMachineLearning/623c5307e6412e9a2fc59dd6213fc07fa412a861/EMSegmentation-lib/pics/smallstrelitzia.jpg
--------------------------------------------------------------------------------
/EMSegmentation-lib/pics/smallsunset.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Keyuan125/CS441-AppliedMachineLearning/623c5307e6412e9a2fc59dd6213fc07fa412a861/EMSegmentation-lib/pics/smallsunset.jpg
--------------------------------------------------------------------------------
/EMSegmentation-lib/requirements1.txt:
--------------------------------------------------------------------------------
1 | # Stage 1: Normal Pypi index packages that have no reliance on PyTorch.
2 | # The Dockerfiles also may have specified some initial packages.
3 |
4 | scikit-learn==0.24
5 | scikit-image==0.18.3
6 | seaborn==0.10.0
7 | pandas==1.1.3
8 | numpy==1.18.1
9 | matplotlib==3.1.3
10 | Pillow==8.2.0
11 | jupyter_client==6.1.11
--------------------------------------------------------------------------------
/EMSegmentation-lib/test_db/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Keyuan125/CS441-AppliedMachineLearning/623c5307e6412e9a2fc59dd6213fc07fa412a861/EMSegmentation-lib/test_db/.DS_Store
--------------------------------------------------------------------------------
/EMSegmentation-lib/test_db/task_1.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Keyuan125/CS441-AppliedMachineLearning/623c5307e6412e9a2fc59dd6213fc07fa412a861/EMSegmentation-lib/test_db/task_1.npz
--------------------------------------------------------------------------------
/EMSegmentation-lib/test_db/task_2.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Keyuan125/CS441-AppliedMachineLearning/623c5307e6412e9a2fc59dd6213fc07fa412a861/EMSegmentation-lib/test_db/task_2.npz
--------------------------------------------------------------------------------
/EMSegmentation-lib/test_db/task_3.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Keyuan125/CS441-AppliedMachineLearning/623c5307e6412e9a2fc59dd6213fc07fa412a861/EMSegmentation-lib/test_db/task_3.npz
--------------------------------------------------------------------------------
/EMSegmentation-lib/test_db/task_4.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Keyuan125/CS441-AppliedMachineLearning/623c5307e6412e9a2fc59dd6213fc07fa412a861/EMSegmentation-lib/test_db/task_4.npz
--------------------------------------------------------------------------------
/EMTopicModel-lib/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Keyuan125/CS441-AppliedMachineLearning/623c5307e6412e9a2fc59dd6213fc07fa412a861/EMTopicModel-lib/.DS_Store
--------------------------------------------------------------------------------
/EMTopicModel-lib/EMTopicModel.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Keyuan125/CS441-AppliedMachineLearning/623c5307e6412e9a2fc59dd6213fc07fa412a861/EMTopicModel-lib/EMTopicModel.pdf
--------------------------------------------------------------------------------
/EMTopicModel-lib/README.md:
--------------------------------------------------------------------------------
1 |
2 | # Assignment-specific public libraries and large data files (visible, read-only)
3 |
4 | This directory is for data and script files that are specific to one homework.
5 | This directory will be part of the Python path for all users. Libraries placed here should generally be those written by staff; Python packages can instead be installed system-wide in the Dockerfiles or `requirements.txt` loading system (see below).
6 |
7 | Please don't confuse this directory with the `work/course-lib` directory that may contain data to be used by multiple homework assignments.
8 |
9 | ## Placement
10 |
11 | In the lab container, the contents of this directory will be placed in:
12 |
13 | ```
14 | work/release/[hwid]-lib
15 | ```
16 |
17 | Where `hwid` matches the homework ID for that assignment. For example, data files associated with `work/source/HW1/HW1.ipynb` should be placed in `work/source/HW1-lib/`. These files will end up in `work/release/HW1-lib` in the container.
18 |
19 | These files will be read-only. These will be available for all users, including students. However, it's better not to refer to these files using absolute paths; see best practices below.
20 |
21 | ## Special files
22 |
23 | ### `payload_requirements.json`
24 |
25 | **Only staff can configure this file.**
26 |
27 | The `payload_requirements.json` file, if present, specifies additional files that will be submitted along with the notebook. The file should contain an object with a `"files"` property that is a list of strings that are relative paths to files under the current homework notebook's working directory. For example:
28 |
29 | ```json
30 | {
31 | "files": ["some-file.db", "inner_directory/nested_file.txt"]
32 | }
33 | ```
34 |
35 | If the homework ID for this homework is "HW1", the above example would specify these additional files to be collected:
36 |
37 | - `work/release/HW1/some-file.db`
38 | - `work/release/HW1/inner_directory/nested_file.txt`
39 |
40 | ### `requirements.txt` sequence
41 |
42 | **Only staff can configure these files.**
43 |
44 | If some of the Python packages for a particular assignment need to be frozen, you can specify packages and versions in one or more files named `requirements*.txt`, where you may want to put a number before the extension such as `requirements1.txt`. These will be processed one at a time in natural version order (like `sort -V` in Linux) during the Docker build.
45 |
46 | ## Best practices
47 |
48 | It's important to put public staff libraries and large data files in this directory to prevent editing and ensure efficient use of the cloud storage. Large files will also have improved access time from this directory compared to the notebook directory.
49 |
50 | Python files in this directory will be on the Python system path in the container, so you may want to write a Python loader for data you need and refer to the data with relative paths under the library directory (rather than using ".." to refer to a directory above). However, if students try to work on files offline this may complicate things and require shims to adjust the paths.
51 |
52 | Staff members should refer to additional notes in the staff library directories.
53 |
--------------------------------------------------------------------------------
/EMTopicModel-lib/aml_utils.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import matplotlib.pyplot as plt
3 | import pandas as pd
4 | import os
5 | import sys
6 | import traceback
7 | from pygments import formatters, highlight, lexers
8 | import re
9 | import inspect
10 | import types
11 | import copy
12 |
13 | visualize = True
14 | perform_computation = True
15 | test_db_dir = os.path.dirname(os.path.realpath(__file__)) + '/test_db'
16 | NO_DASHES = 55
17 |
18 | Fore_BLUE_LIGHT = u'\u001b[38;5;19m'
19 | Fore_RED_LIGHT = u'\u001b[38;5;196m'
20 | Fore_BLUE = u'\u001b[38;5;34m'
21 | Fore_RED = '\x1b[1;31m'
22 | FORE_GREEN_DARK = u'\u001b[38;5;22m'
23 | Fore_DARKRED = u'\u001b[38;5;124m'
24 | Fore_MAGENTA = '\x1b[1m' + u'\u001b[38;5;92m'
25 | Fore_GREEN = u'\u001b[38;5;32m'
26 | Fore_BLACK = '\x1b[0m' + u'\u001b[38;5;0m'
27 |
28 |
29 | ########################################################################################
30 | ######################## Utilities for traceback processing ############################
31 | ########################################################################################
32 | def keep_tb_rule(tb):
33 | tb_file_path = tb.tb_frame.f_code.co_filename
34 | if os.path.realpath(__file__) == os.path.realpath(tb_file_path):
35 | return False
36 | else:
37 | return True
38 |
39 | def censor_exc_traceback(exc_traceback):
40 | original_tb_list = []
41 | tb_next = exc_traceback
42 | while tb_next is not None:
43 | original_tb_list.append(tb_next)
44 | tb_next = tb_next.tb_next
45 |
46 | censored_tb_list = [tb for tb in original_tb_list if keep_tb_rule(tb)]
47 |
48 | for i, tb in enumerate(censored_tb_list[:-1]):
49 | tb.tb_next = censored_tb_list[i+1]
50 |
51 | if len(censored_tb_list) > 0:
52 | return censored_tb_list[0]
53 | else:
54 | return exc_traceback
55 |
56 | try:
57 | import IPython
58 | ultratb = IPython.core.ultratb.VerboseTB(include_vars=False)
59 | def get_tb_colored_str(exc_type, exc_value, exc_traceback):
60 | manipulated_exc_traceback = censor_exc_traceback(exc_traceback)
61 | tb_text = ultratb.text(exc_type, exc_value, manipulated_exc_traceback)
62 | tb_text = re.sub( r"/tmp/ipykernel_.*.py", "/Jupyter/Notebook/Student/Task/Implementation/Cells", tb_text)
63 | tb_text = re.sub( r"\s{20,}Traceback", " Traceback", tb_text)
64 | s_split = tb_text.split('\n')
65 | if len(s_split) > 0:
66 | c_s_split = s_split[1:]
67 | tb_text = '\n'.join(c_s_split) + '\n'
68 | tb_text = tb_text.replace('\x1b[0;36m', '\x1b[1m \x1b[1;34m')
69 | return tb_text
70 | except:
71 | def get_tb_colored_str(exc_type, exc_value, exc_traceback):
72 | manipulated_exc_traceback = censor_exc_traceback(exc_traceback)
73 | tb_text = traceback.format_exception(exc_type, exc_value, manipulated_exc_traceback, limit=None, chain=True)
74 | tb_text = ''.join(tb_text)
75 | tb_text = re.sub( r"\"/tmp/ipykernel_.*\"", "\"/Jupyter/Notebook/Student/Task/Implementation/Cells\"", tb_text)
76 | lexer = lexers.get_lexer_by_name("pytb", stripall=True)
77 | formatter = formatters.get_formatter_by_name("terminal16m")
78 | tb_colored = highlight(tb_text, lexer, formatter)
79 | return tb_colored
80 |
81 | try:
82 | from IPython.utils import PyColorize
83 | color_parser = PyColorize.Parser(color_table=None, out="str", parent=None, style='Linux')
84 | def code_color_parser(code_str):
85 | return color_parser.format(code_str)
86 | except:
87 | def code_color_parser(code_str):
88 | return code_str
89 |
90 | def get_num_indents(src_list):
91 | assert len(src_list) > 0
92 | a = [line + 20 * ' ' for line in src_list]
93 | b = [len(line) - len(line.lstrip()) for line in a]
94 | assert b[0] == 0
95 | c = min(b[1:])
96 | return c
97 |
98 | def code_snippet_maker(stu_function, args, kwargs):
99 | test_kwargs_str_lst = []
100 | test_kwargs_str_lst.append('from copy import deepcopy')
101 | test_kwargs_str_lst.append("failed_arguments = deepcopy(test_results['test_kwargs'])")
102 | for arg_ in args:
103 | test_kwargs_str_lst.append(arg_)
104 | for key,val in kwargs.items():
105 | test_kwargs_str_lst.append(f"{key} = failed_arguments['{key}']")
106 | test_kwargs_str = ', '.join(test_kwargs_str_lst)
107 |
108 | if hasattr(stu_function, '__name__'):
109 | stu_func_name = stu_function.__name__
110 | else:
111 | stu_func_name = 'YOUR_FUNCTION_NAME'
112 |
113 | check_list_code = []
114 | check_list_code.append(f"correct_sol = test_results['correct_sol'] # The Reference Solution")
115 | check_list_code.append(f"if isinstance(correct_sol, np.ndarray):")
116 | check_list_code.append(f" assert isinstance(my_solution, np.ndarray)")
117 | check_list_code.append(f" assert my_solution.dtype is correct_sol.dtype")
118 | check_list_code.append(f" assert my_solution.shape == correct_sol.shape")
119 | check_list_code.append(f" assert np.allclose(my_solution, correct_sol)")
120 | check_list_code.append(f" print('If you passed the above assertions, it probably means that you have fixed the issue! Well Done!')")
121 | check_list_code.append(f" print('Now you have to do 3 things:')")
122 | check_list_code.append(f" print(' 1) Carefully copy the fixed code body back to the {stu_func_name} function.')")
123 | check_list_code.append(f" print(' 2) If you copied any \"returned_var = \" lines, convert them back to return statements.')")
124 | check_list_code.append(f" print(' 3) Carefully remove this cell (i.e., the cell you inserted and modified) once you are done.')")
125 |
126 | try:
127 | src = inspect.getsource(stu_function)
128 | src_list = src.split('\n')
129 | src_list = [line for line in src_list if not (line.strip().startswith('#'))]
130 | no_indents = get_num_indents(src_list)
131 | mod_src_list = []
132 | src_gen = src_list[1:] if src_list[0].startswith('def') else src_list
133 | for line in src_gen:
134 | if len(line) > no_indents:
135 | shifted_left_line = line[no_indents:]
136 | else:
137 | shifted_left_line = line
138 |
139 | return_statement = 'return '
140 | if not shifted_left_line.lstrip().startswith(return_statement):
141 | mod_src_list.append(shifted_left_line)
142 | else:
143 | i = shifted_left_line.index(return_statement)
144 | shifted_left_line = shifted_left_line[:i] + 'returned_var = ' + shifted_left_line[i+len(return_statement):] + ' # returned variable'
145 | mod_src_list.append(shifted_left_line)
146 |
147 | mod_bodysrc_list = '\n'.join(mod_src_list).strip().split('\n')
148 |
149 | mod_src_list = []
150 | mod_src_list = mod_src_list + ['### You can copy the following auto-generated snippet into a new cell to reproduce the issue.']
151 | mod_src_list = mod_src_list + ['### Use the + button on the top left of the screen to insert a new cell below.']
152 | mod_src_list = mod_src_list + ['']
153 | mod_src_list = mod_src_list + ['#'*7 + ' Test Arguments ' + '#'*7] + test_kwargs_str_lst
154 | mod_src_list = mod_src_list + ['\n' + '#'*7 + ' Your Code Body ' + '#'*7] + mod_bodysrc_list
155 | mod_src_list.append('\n' + '#'*5 + ' Checking Solutions '+ '#'*6)
156 | mod_src_list.append(f"my_solution = returned_var # Your Solution")
157 | mod_src_list = mod_src_list + check_list_code
158 | processed_code = '\n'.join(mod_src_list)
159 | except:
160 | mod_src_list = []
161 | mod_src_list.append(f"my_solution = {stu_func_name}({test_kwargs_str})")
162 | mod_src_list = mod_src_list + check_list_code
163 | processed_code = '\n'.join(mod_src_list)
164 |
165 | return processed_code
166 |
167 |
168 | ########################################################################################
169 | ####################### Utilities for comparison processing ############################
170 | ########################################################################################
171 | def retrieve_item(item_name, ptr_, test_idx, npz_file):
172 | item_shape = npz_file[f'shape_{item_name}'][test_idx]
173 | item_size = int(np.prod(item_shape))
174 | item = npz_file[item_name][ptr_:(ptr_+item_size)].reshape(item_shape)
175 | return item, ptr_+item_size
176 |
177 | class NPStrListCoder:
178 | def __init__(self):
179 | self.filler = '?'
180 | self.spacer = ':'
181 | self.max_len = 100
182 |
183 | def encode(self, str_list):
184 | my_str_ = self.spacer.join(str_list)
185 | str_hex_data = [ord(c) for c in my_str_]
186 | assert_msg = f'Increase max len; you have so many characters: {len(str_hex_data)}>{self.max_len}'
187 | assert len(str_hex_data) <= self.max_len, assert_msg
188 | str_hex_data = str_hex_data + [ord(self.filler) for _ in range(self.max_len - len(str_hex_data))]
189 | str_hex_np = np.array(str_hex_data)
190 | return str_hex_np
191 |
192 | def decode(self, np_arr):
193 | a = ''.join([chr(i) for i in np_arr])
194 | recovered_list = a.replace(self.filler, '').split(self.spacer)
195 | return recovered_list
196 |
197 | str2np_coder = NPStrListCoder()
198 |
199 | def test_case_loader(test_file):
200 | npz_file = np.load(test_file)
201 | arg_id_list = sorted([int(key[4:]) for key in npz_file.keys() if key.startswith('arg_')])
202 | kwarg_names_list = sorted([key[6:] for key in npz_file.keys() if key.startswith('kwarg_')])
203 |
204 | arg_ptr_list = [0 for _ in range(len(arg_id_list))]
205 | dfcarg_ptr_list = [0 for _ in range(len(arg_id_list))]
206 | kwarg_ptr_list = [0 for _ in range(len(kwarg_names_list))]
207 | dfckwarg_ptr_list = [0 for _ in range(len(kwarg_names_list))]
208 | out_ptr = 0
209 | for i in np.arange(npz_file['num_tests']):
210 | args_list = []
211 | for arg_id, arg_id_ in enumerate(arg_id_list):
212 | arg_item, arg_ptr_list[arg_id] = retrieve_item(f'arg_{arg_id_}', arg_ptr_list[arg_id], i, npz_file)
213 | if f'dfcarg_{arg_id_}' in npz_file.keys():
214 | col_list_code, dfcarg_ptr_list[arg_id] = retrieve_item(f'dfcarg_{arg_id_}', dfcarg_ptr_list[arg_id], i, npz_file)
215 | arg_item = pd.DataFrame(arg_item, columns=str2np_coder.decode(col_list_code))
216 | args_list.append(arg_item)
217 | args = tuple(args_list)
218 |
219 | kwargs = {}
220 | for kwarg_id, kwarg_name in enumerate(kwarg_names_list):
221 | kwarg_item, kwarg_ptr_list[kwarg_id] = retrieve_item(f'kwarg_{kwarg_name}', kwarg_ptr_list[kwarg_id], i, npz_file)
222 | if f'dfckwarg_{kwarg_name}' in npz_file.keys():
223 | col_list_code, dfckwarg_ptr_list[kwarg_id] = retrieve_item(f'dfckwarg_{kwarg_name}', dfckwarg_ptr_list[kwarg_id], i, npz_file)
224 | kwarg_item = pd.DataFrame(kwarg_item, columns=str2np_coder.decode(col_list_code))
225 | kwargs[kwarg_name]=kwarg_item
226 |
227 | output, out_ptr = retrieve_item(f'output', out_ptr, i, npz_file)
228 |
229 | yield args, kwargs, output
230 |
231 | def arg2str(args, kwargs, adv_user_msg=False, stu_func=None):
232 | msg = ''
233 |
234 | for arg_ in args:
235 | msg += f'{arg_}\n'
236 | for key,val in kwargs.items():
237 | try:
238 | val_str = np.array_repr(val)
239 | except:
240 | val_str = val
241 | new_line = f'{Fore_MAGENTA}{key}{Fore_BLACK} = {val_str}\n'
242 | new_line = new_line.replace(' = array(',' = np.array(')
243 | new_line = new_line.replace('nan,','np.nan,')
244 | msg += new_line
245 |
246 |
247 | if adv_user_msg:
248 | try:
249 | is_stu_func_lambda = isinstance(stu_func, types.LambdaType)
250 | if is_stu_func_lambda:
251 | is_stu_func_lambda = stu_func.__name__ == ""
252 | if not is_stu_func_lambda:
253 | code_title_ = f'\n' + '-' * (NO_DASHES-1) + f'{Fore_RED} Reproducing Code Snippet {Fore_BLACK}' + '-' * (NO_DASHES-2) + '\n'
254 | code = code_snippet_maker(stu_func, args, kwargs)
255 | msg += code_title_ + code_color_parser(code)
256 | except:
257 | pass
258 | return msg
259 |
260 |
261 | def test_case_checker(stu_func, task_id=0):
262 | out_dict = {}
263 | out_dict['task_number'] = task_id
264 | out_dict['exception'] = None
265 | out_dict['exception_info'] = None
266 | test_db_npz = f'{test_db_dir}/task_{task_id}.npz'
267 | if not os.path.exists(test_db_npz):
268 | out_dict['message'] = f'Test database test_db/task_{task_id}.npz does not exist... aborting!'
269 | out_dict['passed'] = False
270 | out_dict['test_args'] = None
271 | out_dict['test_kwargs'] = None
272 | out_dict['stu_sol'] = None
273 | out_dict['correct_sol'] = None
274 | return out_dict
275 |
276 | if hasattr(stu_func, '__name__'):
277 | stu_func_name = stu_func.__name__
278 | else:
279 | stu_func_name = None
280 |
281 | done = False
282 | err_title = f'\n' + '*' * NO_DASHES + f'{Fore_RED} Error in Task {task_id} {Fore_BLACK}' + '*' * NO_DASHES + f'\n'
283 | test_case_title = '\n' + '-' * NO_DASHES + f'{Fore_RED} Test Case Arguments {Fore_BLACK}' + '-' * NO_DASHES + '\n'
284 | summary_title = '-' * NO_DASHES + f' {Fore_RED} Summary {Fore_BLACK}' + '-' * NO_DASHES + '\n'
285 | for (test_args, test_kwargs, correct_sol) in test_case_loader(test_db_npz):
286 | try:
287 | stu_args_copy = copy.deepcopy(test_args)
288 | stu_kwargs_copy = copy.deepcopy(test_kwargs)
289 | stu_sol = stu_func(*stu_args_copy, **stu_kwargs_copy)
290 | except Exception as stu_exception:
291 | stu_sol = None
292 | stu_exception_info = sys.exc_info()
293 | message = err_title + summary_title
294 | message += f'Your code {Fore_RED}crashed{Fore_BLACK} during the evaluation of a test case argument.'
295 | message += f' The rest of this message gives you the following material:\n'
296 | message += f' 1. The exception traceback detailing how the error occured.\n'
297 | message += f' 2. The specific test case arguments that caused the error.\n'
298 | message += f' 3. A code snippet that can conviniently reproduce the error.\n'
299 | message += f' -> You can {Fore_RED}copy and paste{Fore_BLACK} the {Fore_RED}code snippet{Fore_BLACK} into a {Fore_RED}new cell{Fore_BLACK}, and run the cell to reproduce the error.\n\n'
300 | message += '-' * NO_DASHES + f'{Fore_RED} Exception Traceback {Fore_BLACK}' + '-' * NO_DASHES + '\n'
301 | message += get_tb_colored_str(*stu_exception_info)
302 | message += test_case_title
303 | message += arg2str(test_args, test_kwargs, adv_user_msg=True, stu_func=stu_func)
304 | out_dict['test_args'] = test_args
305 | out_dict['test_kwargs'] = test_kwargs
306 | out_dict['stu_sol'] = stu_sol
307 | out_dict['correct_sol'] = correct_sol
308 | out_dict['message'] = message
309 | out_dict['passed'] = False
310 | out_dict['exception'] = stu_exception
311 | out_dict['exception_info'] = stu_exception_info
312 | return out_dict
313 |
314 | if isinstance(correct_sol, np.ndarray) and np.isscalar(stu_sol):
315 | # This is handling a special case: When scalar numpy objects are stored,
316 | # they will be converted to a numpy array upon loading.
317 | # In this case, we'll give students the benefit of the doubt,
318 | # and assume the correct solution already was a scalar.
319 | if correct_sol.size == 1:
320 | correct_sol = np.float64(correct_sol.item())
321 | stu_sol = np.float64(np.float64(stu_sol).item())
322 |
323 | #Type Sanity check
324 | if type(stu_sol) is not type(correct_sol):
325 | message = err_title + summary_title
326 | message += f'Your solution\'s {Fore_RED}output type{Fore_BLACK} is not the same as '
327 | message += f'the reference solution\'s data type.\n'
328 | message += f' Your solution\'s type --> {Fore_RED}{type(stu_sol)}{Fore_BLACK}\n'
329 | message += f' Correct solution\'s type --> {Fore_RED}{type(correct_sol)}{Fore_BLACK}\n'
330 | message += test_case_title
331 | message += arg2str(test_args, test_kwargs, adv_user_msg=True, stu_func=stu_func)
332 | out_dict['test_args'] = test_args
333 | out_dict['test_kwargs'] = test_kwargs
334 | out_dict['stu_sol'] = stu_sol
335 | out_dict['correct_sol'] = correct_sol
336 | out_dict['message'] = message
337 | out_dict['passed'] = False
338 | return out_dict
339 |
340 | if isinstance(correct_sol, np.ndarray):
341 | if not np.all(np.array(correct_sol.shape) == np.array(stu_sol.shape)):
342 | message = err_title + summary_title
343 | message += f'Your solution\'s {Fore_RED}output numpy shape{Fore_BLACK} is not the same as '
344 | message += f'the reference solution\'s shape.\n'
345 | message += f' Your solution\'s shape --> {Fore_RED}{stu_sol.shape}{Fore_BLACK}\n'
346 | message += f' Correct solution\'s shape --> {Fore_RED}{correct_sol.shape}{Fore_BLACK}\n'
347 | message += test_case_title
348 | message += arg2str(test_args, test_kwargs, adv_user_msg=True, stu_func=stu_func)
349 | out_dict['test_args'] = test_args
350 | out_dict['test_kwargs'] = test_kwargs
351 | out_dict['stu_sol'] = stu_sol
352 | out_dict['correct_sol'] = correct_sol
353 | out_dict['message'] = message
354 | out_dict['passed'] = False
355 | return out_dict
356 |
357 | if not(stu_sol.dtype is correct_sol.dtype):
358 | message = err_title + summary_title
359 | message += f'Your solution\'s {Fore_RED}output numpy dtype{Fore_BLACK} is not the same as'
360 | message += f'the reference solution\'s dtype.\n'
361 | message += f' Your solution\'s dtype --> {Fore_RED}np.{stu_sol.dtype}{Fore_BLACK}\n'
362 | message += f' Correct solution\'s dtype --> {Fore_RED}np.{correct_sol.dtype}{Fore_BLACK}\n'
363 | message += test_case_title
364 | message += arg2str(test_args, test_kwargs, adv_user_msg=True, stu_func=stu_func)
365 | out_dict['test_args'] = test_args
366 | out_dict['test_kwargs'] = test_kwargs
367 | out_dict['stu_sol'] = stu_sol
368 | out_dict['correct_sol'] = correct_sol
369 | out_dict['message'] = message
370 | out_dict['passed'] = False
371 | return out_dict
372 |
373 | if isinstance(correct_sol, np.ndarray):
374 | equality_array = np.isclose(stu_sol, correct_sol, rtol=1e-05, atol=1e-08, equal_nan=True)
375 | if not equality_array.all():
376 | message = err_title + summary_title
377 | message += f'Your solution is {Fore_RED}not the same{Fore_BLACK} as the correct solution.\n'
378 | whr_ = np.array(np.where(np.logical_not(equality_array)))
379 | ineq_idxs = whr_[:,0].tolist()
380 | message += f' your_solution{ineq_idxs}={stu_sol[tuple(ineq_idxs)]}\n'
381 | message += f' correct_solution{ineq_idxs}={correct_sol[tuple(ineq_idxs)]}\n'
382 | message += test_case_title
383 | message += arg2str(test_args, test_kwargs, adv_user_msg=True, stu_func=stu_func)
384 | out_dict['test_args'] = test_args
385 | out_dict['test_kwargs'] = test_kwargs
386 | out_dict['stu_sol'] = stu_sol
387 | out_dict['correct_sol'] = correct_sol
388 | out_dict['message'] = message
389 | out_dict['passed'] = False
390 | return out_dict
391 |
392 | elif np.isscalar(correct_sol):
393 | equality_array = np.isclose(stu_sol, correct_sol, rtol=1e-05, atol=1e-08, equal_nan=True)
394 | if not equality_array.all():
395 | message = err_title + summary_title
396 | message += f'Your solution is {Fore_RED}not the same{Fore_BLACK} as the correct solution.\n'
397 | message += f' your_solution={stu_sol}\n'
398 | message += f' correct_solution={correct_sol}\n'
399 | message += test_case_title
400 | message += arg2str(test_args, test_kwargs, adv_user_msg=True, stu_func=stu_func)
401 | out_dict['test_args'] = test_args
402 | out_dict['test_kwargs'] = test_kwargs
403 | out_dict['stu_sol'] = stu_sol
404 | out_dict['correct_sol'] = correct_sol
405 | out_dict['message'] = message
406 | out_dict['passed'] = False
407 | return out_dict
408 |
409 | elif isinstance(correct_sol, tuple):
410 | if not correct_sol==stu_sol:
411 | message = err_title + summary_title
412 | message += f'Your solution is {Fore_RED}not the same{Fore_BLACK} as the correct solution.\n'
413 | message += f' your_solution={stu_sol}\n'
414 | message += f' correct_solution={correct_sol}\n'
415 | message += test_case_title
416 | message += arg2str(test_args, test_kwargs, adv_user_msg=True, stu_func=stu_func)
417 | out_dict['test_args'] = test_args
418 | out_dict['test_kwargs'] = test_kwargs
419 | out_dict['stu_sol'] = stu_sol
420 | out_dict['correct_sol'] = correct_sol
421 | out_dict['message'] = message
422 | out_dict['passed'] = False
423 | return out_dict
424 |
425 | else:
426 | raise Exception(f'Not implemented comparison for other data types. sorry!')
427 |
428 | out_dict['test_args'] = None
429 | out_dict['test_kwargs'] = None
430 | out_dict['stu_sol'] = None
431 | out_dict['correct_sol'] = None
432 | out_dict['message'] = 'Well Done!'
433 | out_dict['passed'] = True
434 | return out_dict
435 |
436 | def show_test_cases(test_func, task_id=0):
437 | from IPython.display import clear_output
438 | file_path = f'{test_db_dir}/task_{task_id}.npz'
439 | npz_file = np.load(file_path)
440 | orig_images = npz_file['raw_images']
441 | ref_images = npz_file['ref_images']
442 | test_images = test_func(orig_images)
443 |
444 | visualize_ = visualize and perform_computation
445 |
446 | if not np.all(np.array(test_images.shape) == np.array(ref_images.shape)):
447 | print(f'Error: It seems the test images and the ref images have different shapes. Modify your function so that they both have the same shape.')
448 | print(f' test_images shape: {test_images.shape}')
449 | print(f' ref_images shape: {ref_images.shape}')
450 | return None, None, None, False
451 |
452 | if not np.all(np.array(test_images.dtype) == np.array(ref_images.dtype)):
453 | print(f'Error: It seems the test images and the ref images have different dtype. Modify your function so that they both have the same dtype.')
454 | print(f' test_images dtype: {test_images.dtype}')
455 | print(f' ref_images dtype: {ref_images.dtype}')
456 | return None, None, None, False
457 |
458 | for i in range(ref_images.shape[0]):
459 | if visualize_:
460 | nrows, ncols = 1, 3
461 | ax_w, ax_h = 5, 5
462 | fig, axes = plt.subplots(nrows, ncols, figsize=(ncols*ax_w, nrows*ax_h))
463 | axes = np.array(axes).reshape(nrows, ncols)
464 |
465 | orig_image = orig_images[i]
466 | ref_image = ref_images[i]
467 | test_image = test_images[i]
468 |
469 | if visualize_:
470 | ax = axes[0,0]
471 | ax.pcolormesh(orig_image, edgecolors='k', linewidth=0.01, cmap='Greys')
472 | ax.xaxis.tick_top()
473 | ax.invert_yaxis()
474 |
475 | x_ticks = ax.get_xticks(minor=False).astype(np.int)
476 | x_ticks = x_ticks[x_ticks < orig_image.shape[1]]
477 | ax.set_xticks(x_ticks + 0.5)
478 | ax.set_xticklabels((x_ticks).astype(np.int))
479 |
480 | y_ticks = ax.get_yticks(minor=False).astype(np.int)
481 | y_ticks = y_ticks[y_ticks < orig_image.shape[0]]
482 | ax.set_yticks(y_ticks + 0.5)
483 | ax.set_yticklabels((y_ticks).astype(np.int))
484 |
485 | ax.set_aspect('equal')
486 | ax.set_title('Raw Image')
487 |
488 | ax = axes[0,1]
489 | ax.pcolormesh(ref_image, edgecolors='k', linewidth=0.01, cmap='Greys')
490 | ax.xaxis.tick_top()
491 | ax.invert_yaxis()
492 |
493 | x_ticks = ax.get_xticks(minor=False).astype(np.int)
494 | x_ticks = x_ticks[x_ticks < ref_image.shape[1]]
495 | ax.set_xticks(x_ticks+0.5)
496 | ax.set_xticklabels((x_ticks).astype(np.int))
497 |
498 | y_ticks = ax.get_yticks(minor=False).astype(np.int)
499 | y_ticks = y_ticks[y_ticks < ref_image.shape[0]]
500 | ax.set_yticks(y_ticks+0.5)
501 | ax.set_yticklabels((y_ticks).astype(np.int))
502 |
503 | ax.set_aspect('equal')
504 | ax.set_title('Reference Solution Image')
505 |
506 | ax = axes[0,2]
507 | ax.pcolormesh(test_image, edgecolors='k', linewidth=0.01, cmap='Greys')
508 | ax.xaxis.tick_top()
509 | ax.invert_yaxis()
510 |
511 | x_ticks = ax.get_xticks(minor=False).astype(np.int)
512 | x_ticks = x_ticks[x_ticks < test_image.shape[1]]
513 | ax.set_xticks(x_ticks + 0.5)
514 | ax.set_xticklabels((x_ticks).astype(np.int))
515 |
516 | y_ticks = ax.get_yticks(minor=False).astype(np.int)
517 | y_ticks = y_ticks[y_ticks < test_image.shape[0]]
518 | ax.set_yticks(y_ticks + 0.5)
519 | ax.set_yticklabels((y_ticks).astype(np.int))
520 |
521 | ax.set_aspect('equal')
522 | ax.set_title('Your Solution Image')
523 |
524 | if np.allclose(ref_image, test_image):
525 | if visualize_:
526 | print('The reference and solution images are the same to a T! Well done on this test case.')
527 | else:
528 | print('The reference and solution images are not the same...')
529 | ineq_idxs = np.array(np.where(np.logical_not(np.isclose(ref_image, test_image))))[:,0].tolist()
530 | print(f'ref_image{ineq_idxs}={ref_image[tuple(ineq_idxs)]}')
531 | print(f'test_image{ineq_idxs}={test_image[tuple(ineq_idxs)]}')
532 | if visualize_:
533 | print('I will return the images so that you will be able to diagnose the issue and resolve it...')
534 | return (orig_image, ref_image, test_image, False)
535 |
536 | if visualize_:
537 | plt.show()
538 | input_prompt = ' Enter nothing to go to the next image\nor\n Enter "s" when you are done to recieve the three images. \n'
539 | input_prompt += ' **Don\'t forget to do this before continuing to the next step.**\n'
540 |
541 | try:
542 | cmd = input(input_prompt)
543 | except KeyboardInterrupt:
544 | cmd = 's'
545 |
546 | if cmd.lower().startswith('s'):
547 | return (orig_image, ref_image, test_image, True)
548 | else:
549 | clear_output(wait=True)
550 |
551 | return (orig_image, ref_image, test_image, True)
--------------------------------------------------------------------------------
/EMTopicModel-lib/payload_requirements.json:
--------------------------------------------------------------------------------
1 | {
2 | "comment": "The files property is a list of additional file paths (relative to this homework's notebook dir) that will be bundled into the student's ipynb metadata each time they save.",
3 | "files": []
4 | }
5 |
--------------------------------------------------------------------------------
/EMTopicModel-lib/requirements1.txt:
--------------------------------------------------------------------------------
1 | scipy==1.5.3
2 | jupyter_client==6.1.11
3 |
--------------------------------------------------------------------------------
/EMTopicModel-lib/test_db/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Keyuan125/CS441-AppliedMachineLearning/623c5307e6412e9a2fc59dd6213fc07fa412a861/EMTopicModel-lib/test_db/.DS_Store
--------------------------------------------------------------------------------
/EMTopicModel-lib/test_db/task_1.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Keyuan125/CS441-AppliedMachineLearning/623c5307e6412e9a2fc59dd6213fc07fa412a861/EMTopicModel-lib/test_db/task_1.npz
--------------------------------------------------------------------------------
/EMTopicModel-lib/test_db/task_2.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Keyuan125/CS441-AppliedMachineLearning/623c5307e6412e9a2fc59dd6213fc07fa412a861/EMTopicModel-lib/test_db/task_2.npz
--------------------------------------------------------------------------------
/EMTopicModel-lib/test_db/task_3.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Keyuan125/CS441-AppliedMachineLearning/623c5307e6412e9a2fc59dd6213fc07fa412a861/EMTopicModel-lib/test_db/task_3.npz
--------------------------------------------------------------------------------
/EMTopicModel-lib/words/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Keyuan125/CS441-AppliedMachineLearning/623c5307e6412e9a2fc59dd6213fc07fa412a861/EMTopicModel-lib/words/.DS_Store
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # CS441-AppliedMachineLearning
2 |
3 | This is my coding assignment for UIUC-CS441-Applied Machine Learning
4 |
5 | Topics:
6 | - Basic Classfication
7 | - Classifying Images
8 | - SVM using Stochastic Gradient Descent
9 | - Regression
10 | - GLMnet
11 | - PCA and PCoA
12 | - Clustering
13 | - High-Dimension Classification
14 | - Expectation-Maximization for the topic Model
15 | - Expectation Maximization for Mixture of Normals
16 | - Mean Field
17 | - Convolutional Neural Networks
18 |
19 |
--------------------------------------------------------------------------------