├── .DS_Store ├── BasicClassification.ipynb ├── CNN.ipynb ├── ClassifyingImages.ipynb ├── Clustering.ipynb ├── EMSegmentation-lib ├── .DS_Store ├── EMSegmentation.pdf ├── README.md ├── aml_utils.py ├── payload_requirements.json ├── pics │ ├── RobertMixed03.jpg │ ├── smallstrelitzia.jpg │ └── smallsunset.jpg ├── requirements1.txt └── test_db │ ├── .DS_Store │ ├── task_1.npz │ ├── task_2.npz │ ├── task_3.npz │ └── task_4.npz ├── EMSegmentation.ipynb ├── EMTopicModel-lib ├── .DS_Store ├── EMTopicModel.pdf ├── README.md ├── aml_utils.py ├── payload_requirements.json ├── requirements1.txt ├── test_db │ ├── .DS_Store │ ├── task_1.npz │ ├── task_2.npz │ └── task_3.npz └── words │ ├── .DS_Store │ ├── docword.nips.txt │ └── vocab.nips.txt ├── EMTopicModel.ipynb ├── GLMnet.ipynb ├── HiDimClassification.ipynb ├── MeanField.ipynb ├── PCA.ipynb ├── README.md ├── Regression.ipynb └── SGDSVM.ipynb /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Keyuan125/CS441-AppliedMachineLearning/623c5307e6412e9a2fc59dd6213fc07fa412a861/.DS_Store -------------------------------------------------------------------------------- /BasicClassification.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# * Prerequisites" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "In this assignment you will implement the Naive Bayes Classifier. Before starting this assignment, make sure you understand the concepts discussed in the videos in Week 2 about Naive Bayes. You can also find it useful to read Chapter 1 of the textbook.\n", 15 | "\n", 16 | "\n", 17 | "Also, make sure that you are familiar with the `numpy.ndarray` class of python's `numpy` library and that you are able to answer the following questions:\n", 18 | "\n", 19 | "Let's assume `a` is a numpy array.\n", 20 | "* What is an array's shape (e.g., what is the meaning of `a.shape`)? \n", 21 | "* What is numpy's reshaping operation? How much computational over-head would it induce? \n", 22 | "* What is numpy's transpose operation, and how it is different from reshaping? Does it cause computation overhead?\n", 23 | "* What is the meaning of the commands `a.reshape(-1, 1)` and `a.reshape(-1)`?\n", 24 | "* Would happens to the variable `a` after we call `b = a.reshape(-1)`? Does any of the attributes of `a` change?\n", 25 | "* How do assignments in python and numpy work in general?\n", 26 | " * Does the `b=a` statement use copying by value? Or is it copying by reference?\n", 27 | " * Would the answer to the previous question change depending on whether `a` is a numpy array or a scalar value?\n", 28 | " \n", 29 | "You can answer all of these questions by\n", 30 | "\n", 31 | " 1. Reading numpy's documentation from https://numpy.org/doc/stable/.\n", 32 | " 2. Making trials using dummy variables." 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "metadata": {}, 38 | "source": [ 39 | "# *Assignment Summary" 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | "The UC Irvine machine learning data repository hosts a famous dataset, the Pima Indians dataset, on whether a patient has diabetes originally owned by the National Institute of Diabetes and Digestive and Kidney Diseases and donated by Vincent Sigillito. You can find it at https://www.kaggle.com/uciml/pima-indians-diabetes-database/data. This data has a set of attributes of patients, and a categorical variable telling whether the patient is diabetic or not. For several attributes in this data set, a value of 0 may indicate a missing value of the variable. It has a total of 768 data-points. \n", 47 | "\n", 48 | "* **Part 1-A)** First, you will build a simple naive Bayes classifier to classify this data set. We will use 20% of the data for evaluation and the other 80% for training. \n", 49 | "\n", 50 | " You should use a normal distribution to model each of the class-conditional distributions.\n", 51 | "\n", 52 | " Report the accuracy of the classifier on the 20% evaluation data, where accuracy is the number of correct predictions as a fraction of total predictions.\n", 53 | "\n", 54 | "* **Part 1-B)** Next, you will adjust your code so that, for attributes 3 (Diastolic blood pressure), 4 (Triceps skin fold thickness), 6 (Body mass index), and 8 (Age), it regards a value of 0 as a missing value when estimating the class-conditional distributions, and the posterior.\n", 55 | "\n", 56 | " Report the accuracy of the classifier on the 20% that was held out for evaluation.\n", 57 | "\n", 58 | "* **Part 1-C)** Last, you will have some experience with SVMLight, an off-the-shelf implementation of Support Vector Machines or SVMs. For now, you don't need to understand much about SVM's, we will explore them in more depth in the following exercises. You will install SVMLight, which you can find at http://svmlight.joachims.org, to train and evaluate an SVM to classify this data.\n", 59 | "\n", 60 | " You should NOT substitute NA values for zeros for attributes 3, 4, 6, and 8.\n", 61 | " \n", 62 | " Report the accuracy of the classifier on the held out 20%" 63 | ] 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "metadata": {}, 68 | "source": [ 69 | "# 0. Data" 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "## 0.1 Description" 77 | ] 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "metadata": {}, 82 | "source": [ 83 | "The UC Irvine's Machine Learning Data Repository Department hosts a Kaggle Competition with famous collection of data on whether a patient has diabetes (the Pima Indians dataset), originally owned by the National Institute of Diabetes and Digestive and Kidney Diseases and donated by Vincent Sigillito. \n", 84 | "\n", 85 | "You can find this data at https://www.kaggle.com/uciml/pima-indians-diabetes-database/data. The Kaggle website offers valuable visualizations of the original data dimensions in its dashboard. It is quite insightful to take the time and make sense of the data using their dashboard before applying any method to the data." 86 | ] 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "metadata": {}, 91 | "source": [ 92 | "## 0.2 Information Summary" 93 | ] 94 | }, 95 | { 96 | "cell_type": "markdown", 97 | "metadata": {}, 98 | "source": [ 99 | "* **Input/Output**: This data has a set of attributes of patients, and a categorical variable telling whether the patient is diabetic or not. \n", 100 | "\n", 101 | "* **Missing Data**: For several attributes in this data set, a value of 0 may indicate a missing value of the variable.\n", 102 | "\n", 103 | "* **Final Goal**: We want to build a classifier that can predict whether a patient has diabetes or not. To do this, we will train multiple kinds of models, and will be handing the missing data with different approaches for each method (i.e., some methods will ignore their existence, while others may do something about the missing data)." 104 | ] 105 | }, 106 | { 107 | "cell_type": "markdown", 108 | "metadata": {}, 109 | "source": [ 110 | "## 0.3 Loading" 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": 46, 116 | "metadata": {}, 117 | "outputs": [], 118 | "source": [ 119 | "%matplotlib inline\n", 120 | "import pandas as pd\n", 121 | "import numpy as np\n", 122 | "import seaborn as sns\n", 123 | "import matplotlib.pyplot as plt\n", 124 | "\n", 125 | "from aml_utils import test_case_checker" 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": 47, 131 | "metadata": {}, 132 | "outputs": [ 133 | { 134 | "data": { 135 | "text/html": [ 136 | "
\n", 137 | "\n", 150 | "\n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | "
PregnanciesGlucoseBloodPressureSkinThicknessInsulinBMIDiabetesPedigreeFunctionAgeOutcome
061487235033.60.627501
11856629026.60.351310
28183640023.30.672321
318966239428.10.167210
40137403516843.12.288331
\n", 228 | "
" 229 | ], 230 | "text/plain": [ 231 | " Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \\\n", 232 | "0 6 148 72 35 0 33.6 \n", 233 | "1 1 85 66 29 0 26.6 \n", 234 | "2 8 183 64 0 0 23.3 \n", 235 | "3 1 89 66 23 94 28.1 \n", 236 | "4 0 137 40 35 168 43.1 \n", 237 | "\n", 238 | " DiabetesPedigreeFunction Age Outcome \n", 239 | "0 0.627 50 1 \n", 240 | "1 0.351 31 0 \n", 241 | "2 0.672 32 1 \n", 242 | "3 0.167 21 0 \n", 243 | "4 2.288 33 1 " 244 | ] 245 | }, 246 | "execution_count": 47, 247 | "metadata": {}, 248 | "output_type": "execute_result" 249 | } 250 | ], 251 | "source": [ 252 | "df = pd.read_csv('../BasicClassification-lib/diabetes.csv')\n", 253 | "df.head()" 254 | ] 255 | }, 256 | { 257 | "cell_type": "markdown", 258 | "metadata": {}, 259 | "source": [ 260 | "## 0.1 Splitting The Data" 261 | ] 262 | }, 263 | { 264 | "cell_type": "markdown", 265 | "metadata": {}, 266 | "source": [ 267 | "First, we will shuffle the data completely, and forget about the order in the original csv file. \n", 268 | "\n", 269 | "* The training and evaluation dataframes will be named ```train_df``` and ```eval_df```, respectively.\n", 270 | "\n", 271 | "* We will also create the 2-d numpy array `train_features` whose number of rows is the number of training samples, and the number of columns is 8 (i.e., the number of features). We will define `eval_features` in a similar fashion\n", 272 | "\n", 273 | "* We would also create the 1-d numpy arrays `train_labels` and `eval_labels` which contain the training and evaluation labels, respectively." 274 | ] 275 | }, 276 | { 277 | "cell_type": "code", 278 | "execution_count": 48, 279 | "metadata": {}, 280 | "outputs": [], 281 | "source": [ 282 | "# Let's generate the split ourselves.\n", 283 | "np_random = np.random.RandomState(seed=12345)\n", 284 | "rand_unifs = np_random.uniform(0,1,size=df.shape[0])\n", 285 | "division_thresh = np.percentile(rand_unifs, 80)\n", 286 | "train_indicator = rand_unifs < division_thresh\n", 287 | "eval_indicator = rand_unifs >= division_thresh" 288 | ] 289 | }, 290 | { 291 | "cell_type": "code", 292 | "execution_count": 49, 293 | "metadata": {}, 294 | "outputs": [ 295 | { 296 | "data": { 297 | "text/html": [ 298 | "
\n", 299 | "\n", 312 | "\n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | "
PregnanciesGlucoseBloodPressureSkinThicknessInsulinBMIDiabetesPedigreeFunctionAgeOutcome
01856629026.60.351310
18183640023.30.672321
218966239428.10.167210
30137403516843.12.288331
45116740025.60.201300
\n", 390 | "
" 391 | ], 392 | "text/plain": [ 393 | " Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \\\n", 394 | "0 1 85 66 29 0 26.6 \n", 395 | "1 8 183 64 0 0 23.3 \n", 396 | "2 1 89 66 23 94 28.1 \n", 397 | "3 0 137 40 35 168 43.1 \n", 398 | "4 5 116 74 0 0 25.6 \n", 399 | "\n", 400 | " DiabetesPedigreeFunction Age Outcome \n", 401 | "0 0.351 31 0 \n", 402 | "1 0.672 32 1 \n", 403 | "2 0.167 21 0 \n", 404 | "3 2.288 33 1 \n", 405 | "4 0.201 30 0 " 406 | ] 407 | }, 408 | "execution_count": 49, 409 | "metadata": {}, 410 | "output_type": "execute_result" 411 | } 412 | ], 413 | "source": [ 414 | "train_df = df[train_indicator].reset_index(drop=True)\n", 415 | "train_features = train_df.loc[:, train_df.columns != 'Outcome'].values\n", 416 | "train_labels = train_df['Outcome'].values\n", 417 | "train_df.head()" 418 | ] 419 | }, 420 | { 421 | "cell_type": "code", 422 | "execution_count": 50, 423 | "metadata": {}, 424 | "outputs": [ 425 | { 426 | "data": { 427 | "text/html": [ 428 | "
\n", 429 | "\n", 442 | "\n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | "
PregnanciesGlucoseBloodPressureSkinThicknessInsulinBMIDiabetesPedigreeFunctionAgeOutcome
061487235033.60.627501
137850328831.00.248261
210168740038.00.537341
30118844723045.80.551311
47107740029.60.254311
\n", 520 | "
" 521 | ], 522 | "text/plain": [ 523 | " Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \\\n", 524 | "0 6 148 72 35 0 33.6 \n", 525 | "1 3 78 50 32 88 31.0 \n", 526 | "2 10 168 74 0 0 38.0 \n", 527 | "3 0 118 84 47 230 45.8 \n", 528 | "4 7 107 74 0 0 29.6 \n", 529 | "\n", 530 | " DiabetesPedigreeFunction Age Outcome \n", 531 | "0 0.627 50 1 \n", 532 | "1 0.248 26 1 \n", 533 | "2 0.537 34 1 \n", 534 | "3 0.551 31 1 \n", 535 | "4 0.254 31 1 " 536 | ] 537 | }, 538 | "execution_count": 50, 539 | "metadata": {}, 540 | "output_type": "execute_result" 541 | } 542 | ], 543 | "source": [ 544 | "eval_df = df[eval_indicator].reset_index(drop=True)\n", 545 | "eval_features = eval_df.loc[:, eval_df.columns != 'Outcome'].values\n", 546 | "eval_labels = eval_df['Outcome'].values\n", 547 | "eval_df.head()" 548 | ] 549 | }, 550 | { 551 | "cell_type": "code", 552 | "execution_count": 51, 553 | "metadata": {}, 554 | "outputs": [ 555 | { 556 | "data": { 557 | "text/plain": [ 558 | "((614, 8), (614,), (154, 8), (154,))" 559 | ] 560 | }, 561 | "execution_count": 51, 562 | "metadata": {}, 563 | "output_type": "execute_result" 564 | } 565 | ], 566 | "source": [ 567 | "train_features.shape, train_labels.shape, eval_features.shape, eval_labels.shape" 568 | ] 569 | }, 570 | { 571 | "cell_type": "markdown", 572 | "metadata": {}, 573 | "source": [ 574 | "## 0.2 Pre-processing The Data" 575 | ] 576 | }, 577 | { 578 | "cell_type": "markdown", 579 | "metadata": {}, 580 | "source": [ 581 | "Some of the columns exhibit missing values. We will use a Naive Bayes Classifier later that will treat such missing values in a special way. To be specific, for attribute 3 (Diastolic blood pressure), attribute 4 (Triceps skin fold thickness), attribute 6 (Body mass index), and attribute 8 (Age), we should regard a value of 0 as a missing value.\n", 582 | "\n", 583 | "Therefore, we will be creating the `train_featues_with_nans` and `eval_features_with_nans` numpy arrays to be just like their `train_features` and `eval_features` counter-parts, but with the zero-values in such columns replaced with nans." 584 | ] 585 | }, 586 | { 587 | "cell_type": "code", 588 | "execution_count": 52, 589 | "metadata": {}, 590 | "outputs": [], 591 | "source": [ 592 | "train_df_with_nans = train_df.copy(deep=True)\n", 593 | "eval_df_with_nans = eval_df.copy(deep=True)\n", 594 | "for col_with_nans in ['BloodPressure', 'SkinThickness', 'BMI', 'Age']:\n", 595 | " train_df_with_nans[col_with_nans] = train_df_with_nans[col_with_nans].replace(0, np.nan)\n", 596 | " eval_df_with_nans[col_with_nans] = eval_df_with_nans[col_with_nans].replace(0, np.nan)\n", 597 | "train_features_with_nans = train_df_with_nans.loc[:, train_df_with_nans.columns != 'Outcome'].values\n", 598 | "eval_features_with_nans = eval_df_with_nans.loc[:, eval_df_with_nans.columns != 'Outcome'].values" 599 | ] 600 | }, 601 | { 602 | "cell_type": "code", 603 | "execution_count": 53, 604 | "metadata": {}, 605 | "outputs": [ 606 | { 607 | "name": "stdout", 608 | "output_type": "stream", 609 | "text": [ 610 | "Here are the training rows with at least one missing values.\n", 611 | "\n", 612 | "You can see that such incomplete data points constitute a substantial part of the data.\n", 613 | "\n" 614 | ] 615 | }, 616 | { 617 | "data": { 618 | "text/html": [ 619 | "
\n", 620 | "\n", 633 | "\n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | " \n", 743 | " \n", 744 | " \n", 745 | " \n", 746 | " \n", 747 | " \n", 748 | " \n", 749 | " \n", 750 | " \n", 751 | " \n", 752 | " \n", 753 | " \n", 754 | " \n", 755 | " \n", 756 | " \n", 757 | " \n", 758 | " \n", 759 | " \n", 760 | " \n", 761 | " \n", 762 | " \n", 763 | " \n", 764 | " \n", 765 | " \n", 766 | " \n", 767 | " \n", 768 | " \n", 769 | " \n", 770 | " \n", 771 | " \n", 772 | " \n", 773 | " \n", 774 | " \n", 775 | " \n", 776 | " \n", 777 | " \n", 778 | " \n", 779 | " \n", 780 | " \n", 781 | " \n", 782 | "
PregnanciesGlucoseBloodPressureSkinThicknessInsulinBMIDiabetesPedigreeFunctionAgeOutcome
1818364.0NaN023.30.672321
4511674.0NaN025.60.201300
510115NaNNaN035.30.134290
7812596.0NaN0NaN0.232541
8411092.0NaN037.60.191300
..............................
598616262.0NaN024.30.178501
599413670.0NaN031.21.182221
605110676.0NaN037.50.197260
606619092.0NaN035.50.278661
612112660.0NaN030.10.349471
\n", 783 | "

186 rows × 9 columns

\n", 784 | "
" 785 | ], 786 | "text/plain": [ 787 | " Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \\\n", 788 | "1 8 183 64.0 NaN 0 23.3 \n", 789 | "4 5 116 74.0 NaN 0 25.6 \n", 790 | "5 10 115 NaN NaN 0 35.3 \n", 791 | "7 8 125 96.0 NaN 0 NaN \n", 792 | "8 4 110 92.0 NaN 0 37.6 \n", 793 | ".. ... ... ... ... ... ... \n", 794 | "598 6 162 62.0 NaN 0 24.3 \n", 795 | "599 4 136 70.0 NaN 0 31.2 \n", 796 | "605 1 106 76.0 NaN 0 37.5 \n", 797 | "606 6 190 92.0 NaN 0 35.5 \n", 798 | "612 1 126 60.0 NaN 0 30.1 \n", 799 | "\n", 800 | " DiabetesPedigreeFunction Age Outcome \n", 801 | "1 0.672 32 1 \n", 802 | "4 0.201 30 0 \n", 803 | "5 0.134 29 0 \n", 804 | "7 0.232 54 1 \n", 805 | "8 0.191 30 0 \n", 806 | ".. ... ... ... \n", 807 | "598 0.178 50 1 \n", 808 | "599 1.182 22 1 \n", 809 | "605 0.197 26 0 \n", 810 | "606 0.278 66 1 \n", 811 | "612 0.349 47 1 \n", 812 | "\n", 813 | "[186 rows x 9 columns]" 814 | ] 815 | }, 816 | "execution_count": 53, 817 | "metadata": {}, 818 | "output_type": "execute_result" 819 | } 820 | ], 821 | "source": [ 822 | "print('Here are the training rows with at least one missing values.')\n", 823 | "print('')\n", 824 | "print('You can see that such incomplete data points constitute a substantial part of the data.')\n", 825 | "print('')\n", 826 | "nan_training_data = train_df_with_nans[train_df_with_nans.isna().any(axis=1)]\n", 827 | "nan_training_data" 828 | ] 829 | }, 830 | { 831 | "cell_type": "markdown", 832 | "metadata": {}, 833 | "source": [ 834 | "# 1. Part 1 (Building a simple Naive Bayes Classifier)" 835 | ] 836 | }, 837 | { 838 | "cell_type": "markdown", 839 | "metadata": {}, 840 | "source": [ 841 | "Consider a single sample $(\\mathbf{x}, y)$, where the feature vector is denoted with $\\mathbf{x}$, and the label is denoted with $y$. We will also denote the $j^{th}$ feature of $\\mathbf{x}$ with $x^{(j)}$.\n", 842 | "\n", 843 | "According to the textbook, the Naive Bayes Classifier uses the following decision rule:\n", 844 | "\n", 845 | "\"Choose $y$ such that $$\\bigg[\\log p(y) + \\sum_{j} \\log p(x^{(j)}|y) \\bigg]$$ is the largest\"\n", 846 | "\n", 847 | "However, we first need to define the probabilistic models of the prior $p(y)$ and the class-conditional feature distributions $p(x^{(j)}|y)$ using the training data.\n", 848 | "\n", 849 | "* **Modelling the prior $p(y)$**: We fit a Bernoulli distribution to the `Outcome` variable of `train_df`.\n", 850 | "* **Modelling the class-conditional feature distributions $p(x^{(j)}|y)$**: We fit Gaussian distributions, and infer the Gaussian mean and variance parameters from `train_df`." 851 | ] 852 | }, 853 | { 854 | "cell_type": "markdown", 855 | "metadata": {}, 856 | "source": [ 857 | "# Task 1" 858 | ] 859 | }, 860 | { 861 | "cell_type": "markdown", 862 | "metadata": {}, 863 | "source": [ 864 | "Write a function `log_prior` that takes a numpy array `train_labels` as input, and outputs the following vector as a column numpy array (i.e., with shape $(2,1)$).\n", 865 | "\n", 866 | "$$\\log p_y =\\begin{bmatrix}\\log p(y=0)\\\\\\log p(y=1)\\end{bmatrix}$$\n", 867 | "\n", 868 | "Try and avoid the utilization of loops as much as possible. No loops are necessary.\n", 869 | "\n", 870 | "**Hint**: Make sure all the array shapes are what you need and expect. You can reshape any numpy array without any tangible computational over-head." 871 | ] 872 | }, 873 | { 874 | "cell_type": "code", 875 | "execution_count": 128, 876 | "metadata": { 877 | "deletable": false 878 | }, 879 | "outputs": [], 880 | "source": [ 881 | "def log_prior(train_labels):\n", 882 | " \n", 883 | " # your code here\n", 884 | "# raise NotImplementedError\n", 885 | " \n", 886 | " num_of_one = np.sum(train_labels)\n", 887 | " num_of_zero = np.shape(train_labels)[0] - num_of_one\n", 888 | " \n", 889 | " log_py = np.array([np.log(num_of_zero/np.shape(train_labels)[0]), np.log(num_of_one/np.shape(train_labels)[0])]).reshape(2, 1)\n", 890 | "\n", 891 | " return log_py\n" 892 | ] 893 | }, 894 | { 895 | "cell_type": "code", 896 | "execution_count": 129, 897 | "metadata": { 898 | "deletable": false, 899 | "editable": false, 900 | "nbgrader": { 901 | "cell_type": "code", 902 | "checksum": "58446a9c6b83fc53b43a0d41fce9f93b", 903 | "grade": false, 904 | "grade_id": "cell-89c12d30b6fb44cb", 905 | "locked": true, 906 | "schema_version": 3, 907 | "solution": false, 908 | "task": false 909 | } 910 | }, 911 | "outputs": [], 912 | "source": [ 913 | "# Performing sanity checks on your implementation\n", 914 | "some_labels = np.array([0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1])\n", 915 | "some_log_py = log_prior(some_labels)\n", 916 | "assert np.array_equal(some_log_py.round(3), np.array([[-0.916], [-0.511]]))\n", 917 | "\n", 918 | "# Checking against the pre-computed test database\n", 919 | "test_results = test_case_checker(log_prior, task_id=1)\n", 920 | "assert test_results['passed'], test_results['message']" 921 | ] 922 | }, 923 | { 924 | "cell_type": "code", 925 | "execution_count": 130, 926 | "metadata": { 927 | "code_folding": [ 928 | 73, 929 | 135 930 | ], 931 | "deletable": false, 932 | "editable": false, 933 | "nbgrader": { 934 | "cell_type": "code", 935 | "checksum": "14ff380a49035c9250c4323e60337bc3", 936 | "grade": true, 937 | "grade_id": "cell-e263f2b1878b37bc", 938 | "locked": true, 939 | "points": 1, 940 | "schema_version": 3, 941 | "solution": false, 942 | "task": false 943 | } 944 | }, 945 | "outputs": [], 946 | "source": [ 947 | "# This cell is left empty as a seperator. You can leave this cell as it is, and you should not delete it.\n" 948 | ] 949 | }, 950 | { 951 | "cell_type": "code", 952 | "execution_count": 131, 953 | "metadata": {}, 954 | "outputs": [ 955 | { 956 | "data": { 957 | "text/plain": [ 958 | "array([[-0.41610786],\n", 959 | " [-1.07766068]])" 960 | ] 961 | }, 962 | "execution_count": 131, 963 | "metadata": {}, 964 | "output_type": "execute_result" 965 | } 966 | ], 967 | "source": [ 968 | "log_py = log_prior(train_labels)\n", 969 | "log_py" 970 | ] 971 | }, 972 | { 973 | "cell_type": "markdown", 974 | "metadata": {}, 975 | "source": [ 976 | "# Task 2" 977 | ] 978 | }, 979 | { 980 | "cell_type": "markdown", 981 | "metadata": {}, 982 | "source": [ 983 | "Write a function `cc_mean_ignore_missing` that takes the numpy arrays `train_features` and `train_labels` as input, and outputs the following matrix with the shape $(8,2)$, where 8 is the number of features.\n", 984 | "\n", 985 | "$$\\mu_y = \\begin{bmatrix} \\mathbb{E}[x^{(0)}|y=0] & \\mathbb{E}[x^{(0)}|y=1]\\\\\n", 986 | "\\mathbb{E}[x^{(1)}|y=0] & \\mathbb{E}[x^{(1)}|y=1] \\\\\n", 987 | "\\cdots & \\cdots\\\\\n", 988 | "\\mathbb{E}[x^{(7)}|y=0] & \\mathbb{E}[x^{(7)}|y=1]\\end{bmatrix}$$\n", 989 | "\n", 990 | "Some points regarding this task:\n", 991 | "\n", 992 | "* The `train_features` numpy array has a shape of `(N,8)` where `N` is the number of training data points, and 8 is the number of the features. \n", 993 | "\n", 994 | "* The `train_labels` numpy array has a shape of `(N,)`. \n", 995 | "\n", 996 | "* **You can assume that `train_features` has no missing elements in this task**.\n", 997 | "\n", 998 | "* Try and avoid the utilization of loops as much as possible. No loops are necessary." 999 | ] 1000 | }, 1001 | { 1002 | "cell_type": "code", 1003 | "execution_count": 132, 1004 | "metadata": { 1005 | "deletable": false 1006 | }, 1007 | "outputs": [], 1008 | "source": [ 1009 | "def cc_mean_ignore_missing(train_features, train_labels):\n", 1010 | " N, d = train_features.shape\n", 1011 | " # your code here\n", 1012 | " \n", 1013 | " # Fist calculate the second column:\n", 1014 | "# dotProduct = np.matmul(train_labels.reshape(1, N), train_features).reshape(d, 1)\n", 1015 | "# secondColumn = dotProduct / np.sum(train_labels)\n", 1016 | " \n", 1017 | "# dotProduct2 = np.matmul(np.ones((1, N)), train_features).reshape(d, 1)\n", 1018 | "# firstColumn = (dotProduct2 - dotProduct) / (N - np.sum(train_labels))\n", 1019 | " \n", 1020 | " \n", 1021 | " # Extract the index of zeros and ones seperately\n", 1022 | " allZeros = np.where(train_labels == 0)[0]\n", 1023 | " allOnes = np.where(train_labels == 1)[0]\n", 1024 | " \n", 1025 | " # Convert the 2-d numpy array to DataFrame\n", 1026 | " df_features = pd.DataFrame(train_features)\n", 1027 | " \n", 1028 | " \n", 1029 | " dfZeros = df_features.loc[allZeros]\n", 1030 | " dfOnes = df_features.loc[allOnes]\n", 1031 | " \n", 1032 | " \n", 1033 | " firstCol = dfZeros.mean(axis=0).to_numpy().reshape((d, 1))\n", 1034 | " secondCol = dfOnes.mean(axis=0).to_numpy().reshape((d, 1))\n", 1035 | " \n", 1036 | " mu_y = np.concatenate((firstCol, secondCol), axis=1)\n", 1037 | " assert mu_y.shape == (d, 2)\n", 1038 | " return mu_y" 1039 | ] 1040 | }, 1041 | { 1042 | "cell_type": "code", 1043 | "execution_count": 133, 1044 | "metadata": { 1045 | "deletable": false, 1046 | "editable": false, 1047 | "nbgrader": { 1048 | "cell_type": "code", 1049 | "checksum": "101072d0656f58c95247f1efe296a85b", 1050 | "grade": false, 1051 | "grade_id": "cell-feae5e6e77107267", 1052 | "locked": true, 1053 | "schema_version": 3, 1054 | "solution": false, 1055 | "task": false 1056 | } 1057 | }, 1058 | "outputs": [], 1059 | "source": [ 1060 | "# Performing sanity checks on your implementation\n", 1061 | "some_feats = np.array([[ 1. , 85. , 66. , 29. , 0. , 26.6, 0.4, 31. ],\n", 1062 | " [ 8. , 183. , 64. , 0. , 0. , 23.3, 0.7, 32. ],\n", 1063 | " [ 1. , 89. , 66. , 23. , 94. , 28.1, 0.2, 21. ],\n", 1064 | " [ 0. , 137. , 40. , 35. , 168. , 43.1, 2.3, 33. ],\n", 1065 | " [ 5. , 116. , 74. , 0. , 0. , 25.6, 0.2, 30. ]])\n", 1066 | "some_labels = np.array([0, 1, 0, 1, 0])\n", 1067 | "\n", 1068 | "some_mu_y = cc_mean_ignore_missing(some_feats, some_labels)\n", 1069 | "\n", 1070 | "assert np.array_equal(some_mu_y.round(2), np.array([[ 2.33, 4. ],\n", 1071 | " [ 96.67, 160. ],\n", 1072 | " [ 68.67, 52. ],\n", 1073 | " [ 17.33, 17.5 ],\n", 1074 | " [ 31.33, 84. ],\n", 1075 | " [ 26.77, 33.2 ],\n", 1076 | " [ 0.27, 1.5 ],\n", 1077 | " [ 27.33, 32.5 ]]))\n", 1078 | "\n", 1079 | "# Checking against the pre-computed test database\n", 1080 | "test_results = test_case_checker(cc_mean_ignore_missing, task_id=2)\n", 1081 | "assert test_results['passed'], test_results['message']" 1082 | ] 1083 | }, 1084 | { 1085 | "cell_type": "code", 1086 | "execution_count": 134, 1087 | "metadata": { 1088 | "code_folding": [], 1089 | "deletable": false, 1090 | "editable": false, 1091 | "nbgrader": { 1092 | "cell_type": "code", 1093 | "checksum": "a98f1ecc45f43b138e415573f0408bab", 1094 | "grade": true, 1095 | "grade_id": "cell-e263f2b1878b37bj", 1096 | "locked": true, 1097 | "points": 1, 1098 | "schema_version": 3, 1099 | "solution": false, 1100 | "task": false 1101 | } 1102 | }, 1103 | "outputs": [], 1104 | "source": [ 1105 | "# This cell is left empty as a seperator. You can leave this cell as it is, and you should not delete it.\n" 1106 | ] 1107 | }, 1108 | { 1109 | "cell_type": "code", 1110 | "execution_count": 135, 1111 | "metadata": {}, 1112 | "outputs": [ 1113 | { 1114 | "data": { 1115 | "text/plain": [ 1116 | "array([[ 3.48641975, 4.91866029],\n", 1117 | " [109.99753086, 142.30143541],\n", 1118 | " [ 68.77037037, 70.66028708],\n", 1119 | " [ 19.51358025, 21.97129187],\n", 1120 | " [ 66.25679012, 100.55980861],\n", 1121 | " [ 30.31703704, 35.1492823 ],\n", 1122 | " [ 0.42825926, 0.55279904],\n", 1123 | " [ 31.57283951, 37.39712919]])" 1124 | ] 1125 | }, 1126 | "execution_count": 135, 1127 | "metadata": {}, 1128 | "output_type": "execute_result" 1129 | } 1130 | ], 1131 | "source": [ 1132 | "mu_y = cc_mean_ignore_missing(train_features, train_labels)\n", 1133 | "mu_y" 1134 | ] 1135 | }, 1136 | { 1137 | "cell_type": "markdown", 1138 | "metadata": {}, 1139 | "source": [ 1140 | "# Task 3" 1141 | ] 1142 | }, 1143 | { 1144 | "cell_type": "markdown", 1145 | "metadata": {}, 1146 | "source": [ 1147 | "Write a function `cc_std_ignore_missing` that takes the numpy arrays `train_features` and `train_labels` as input, and outputs the following matrix with the shape $(8,2)$, where 8 is the number of features.\n", 1148 | "\n", 1149 | "$$\\sigma_y = \\begin{bmatrix} \\text{std}[x^{(0)}|y=0] & \\text{std}[x^{(0)}|y=1]\\\\\n", 1150 | "\\text{std}[x^{(1)}|y=0] & \\text{std}[x^{(1)}|y=1] \\\\\n", 1151 | "\\cdots & \\cdots\\\\\n", 1152 | "\\text{std}[x^{(7)}|y=0] & \\text{std}[x^{(7)}|y=1]\\end{bmatrix}$$\n", 1153 | "\n", 1154 | "Some points regarding this task:\n", 1155 | "\n", 1156 | "* The `train_features` numpy array has a shape of `(N,8)` where `N` is the number of training data points, and 8 is the number of the features. \n", 1157 | "\n", 1158 | "* The `train_labels` numpy array has a shape of `(N,)`. \n", 1159 | "\n", 1160 | "* **You can assume that `train_features` has no missing elements in this task**.\n", 1161 | "\n", 1162 | "* Try and avoid the utilization of loops as much as possible. No loops are necessary." 1163 | ] 1164 | }, 1165 | { 1166 | "cell_type": "code", 1167 | "execution_count": 136, 1168 | "metadata": { 1169 | "deletable": false 1170 | }, 1171 | "outputs": [], 1172 | "source": [ 1173 | "def cc_std_ignore_missing(train_features, train_labels):\n", 1174 | " N, d = train_features.shape\n", 1175 | " \n", 1176 | " # Extract the index of zeros and ones seperately\n", 1177 | " allZeros = np.where(train_labels == 0)[0]\n", 1178 | " allOnes = np.where(train_labels == 1)[0]\n", 1179 | " \n", 1180 | " # Convert the 2-d numpy array to DataFrame\n", 1181 | " df_features = pd.DataFrame(train_features)\n", 1182 | " \n", 1183 | " \n", 1184 | " dfZeros = df_features.loc[allZeros]\n", 1185 | " dfOnes = df_features.loc[allOnes]\n", 1186 | " \n", 1187 | " \n", 1188 | " firstCol = dfZeros.std(axis=0, ddof=0).to_numpy().reshape((d, 1))\n", 1189 | " secondCol = dfOnes.std(axis=0, ddof=0).to_numpy().reshape((d, 1))\n", 1190 | " \n", 1191 | " sigma_y = np.concatenate((firstCol, secondCol), axis=1)\n", 1192 | " \n", 1193 | " assert sigma_y.shape == (d, 2)\n", 1194 | "\n", 1195 | " return sigma_y" 1196 | ] 1197 | }, 1198 | { 1199 | "cell_type": "code", 1200 | "execution_count": 137, 1201 | "metadata": { 1202 | "deletable": false, 1203 | "editable": false, 1204 | "nbgrader": { 1205 | "cell_type": "code", 1206 | "checksum": "9a6eeb9ba6ff8c69ef9ec7a78800d904", 1207 | "grade": false, 1208 | "grade_id": "cell-347ad2c612aa195e", 1209 | "locked": true, 1210 | "schema_version": 3, 1211 | "solution": false, 1212 | "task": false 1213 | } 1214 | }, 1215 | "outputs": [], 1216 | "source": [ 1217 | "# Performing sanity checks on your implementation\n", 1218 | "some_feats = np.array([[ 1. , 85. , 66. , 29. , 0. , 26.6, 0.4, 31. ],\n", 1219 | " [ 8. , 183. , 64. , 0. , 0. , 23.3, 0.7, 32. ],\n", 1220 | " [ 1. , 89. , 66. , 23. , 94. , 28.1, 0.2, 21. ],\n", 1221 | " [ 0. , 137. , 40. , 35. , 168. , 43.1, 2.3, 33. ],\n", 1222 | " [ 5. , 116. , 74. , 0. , 0. , 25.6, 0.2, 30. ]])\n", 1223 | "some_labels = np.array([0, 1, 0, 1, 0])\n", 1224 | "\n", 1225 | "some_std_y = cc_std_ignore_missing(some_feats, some_labels)\n", 1226 | "\n", 1227 | "assert np.array_equal(some_std_y.round(3), np.array([[ 1.886, 4. ],\n", 1228 | " [13.768, 23. ],\n", 1229 | " [ 3.771, 12. ],\n", 1230 | " [12.499, 17.5 ],\n", 1231 | " [44.312, 84. ],\n", 1232 | " [ 1.027, 9.9 ],\n", 1233 | " [ 0.094, 0.8 ],\n", 1234 | " [ 4.497, 0.5 ]]))\n", 1235 | "\n", 1236 | "# Checking against the pre-computed test database\n", 1237 | "test_results = test_case_checker(cc_std_ignore_missing, task_id=3)\n", 1238 | "assert test_results['passed'], test_results['message']" 1239 | ] 1240 | }, 1241 | { 1242 | "cell_type": "code", 1243 | "execution_count": 138, 1244 | "metadata": { 1245 | "code_folding": [], 1246 | "deletable": false, 1247 | "editable": false, 1248 | "nbgrader": { 1249 | "cell_type": "code", 1250 | "checksum": "468e750631e9197a917b4f8fc41f7a92", 1251 | "grade": true, 1252 | "grade_id": "cell-e263f2b1878b37bg", 1253 | "locked": true, 1254 | "points": 1, 1255 | "schema_version": 3, 1256 | "solution": false, 1257 | "task": false 1258 | } 1259 | }, 1260 | "outputs": [], 1261 | "source": [ 1262 | "# This cell is left empty as a seperator. You can leave this cell as it is, and you should not delete it.\n" 1263 | ] 1264 | }, 1265 | { 1266 | "cell_type": "code", 1267 | "execution_count": 139, 1268 | "metadata": {}, 1269 | "outputs": [ 1270 | { 1271 | "data": { 1272 | "text/plain": [ 1273 | "array([[ 3.1155426 , 3.75417931],\n", 1274 | " [ 25.96811899, 32.50910874],\n", 1275 | " [ 18.07540068, 21.69568568],\n", 1276 | " [ 15.02320635, 17.21685884],\n", 1277 | " [ 95.63339586, 139.24364214],\n", 1278 | " [ 7.50030986, 6.6625219 ],\n", 1279 | " [ 0.29438217, 0.37201494],\n", 1280 | " [ 11.67577435, 11.01543899]])" 1281 | ] 1282 | }, 1283 | "execution_count": 139, 1284 | "metadata": {}, 1285 | "output_type": "execute_result" 1286 | } 1287 | ], 1288 | "source": [ 1289 | "sigma_y = cc_std_ignore_missing(train_features, train_labels)\n", 1290 | "sigma_y" 1291 | ] 1292 | }, 1293 | { 1294 | "cell_type": "markdown", 1295 | "metadata": {}, 1296 | "source": [ 1297 | "# Task 4" 1298 | ] 1299 | }, 1300 | { 1301 | "cell_type": "markdown", 1302 | "metadata": {}, 1303 | "source": [ 1304 | "Write a function `log_prob` that takes the numpy arrays `train_features`, $\\mu_y$, $\\sigma_y$, and $\\log p_y$ as input, and outputs the following matrix with the shape $(N, 2)$\n", 1305 | "\n", 1306 | "$$\\log p_{x,y} = \\begin{bmatrix} \\bigg[\\log p(y=0) + \\sum_{j=0}^{7} \\log p(x_1^{(j)}|y=0) \\bigg] & \\bigg[\\log p(y=1) + \\sum_{j=0}^{7} \\log p(x_1^{(j)}|y=1) \\bigg] \\\\\n", 1307 | "\\bigg[\\log p(y=0) + \\sum_{j=0}^{7} \\log p(x_2^{(j)}|y=0) \\bigg] & \\bigg[\\log p(y=1) + \\sum_{j=0}^{7} \\log p(x_2^{(j)}|y=1) \\bigg] \\\\\n", 1308 | "\\cdots & \\cdots \\\\\n", 1309 | "\\bigg[\\log p(y=0) + \\sum_{j=0}^{7} \\log p(x_N^{(j)}|y=0) \\bigg] & \\bigg[\\log p(y=1) + \\sum_{j=0}^{7} \\log p(x_N^{(j)}|y=1) \\bigg] \\\\\n", 1310 | "\\end{bmatrix}$$\n", 1311 | "\n", 1312 | "where\n", 1313 | "* $N$ is the number of training data points.\n", 1314 | "* $x_i$ is the $i^{th}$ training data point.\n", 1315 | "\n", 1316 | "Try and avoid the utilization of loops as much as possible. No loops are necessary." 1317 | ] 1318 | }, 1319 | { 1320 | "cell_type": "markdown", 1321 | "metadata": {}, 1322 | "source": [ 1323 | "**Hint**: Remember that we are modelling $p(x_i^{(j)}|y)$ with a Gaussian whose parameters are defined inside $\\mu_y$ and $\\sigma_y$. Write the Gaussian PDF expression and take its natural log **on paper**, then implement it.\n", 1324 | "\n", 1325 | "**Important Note**: Do not use third-party and non-standard implementations for computing $\\log p(x_i^{(j)}|y)$. Using functions that find the Gaussian PDF, and then taking their log is **numerically unstable**; the Gaussian PDF values can easily become extremely small numbers that cannot be represented using floating point standards and thus would be stored as zero. Taking the log of a zero value will throw an error. On the other hand, it is unnecessary to compute and store $p(x_i^{(j)}|y)$ in order to find $\\log p(x_i^{(j)}|y)$; you can write $\\log p(x_i^{(j)}|y)$ as a direct function of $\\mu_y$, $\\sigma_y$ and the features. This latter approach is numerically stable, and can be applied when the PDF values are much smaller than could be stored using the common standards." 1326 | ] 1327 | }, 1328 | { 1329 | "cell_type": "code", 1330 | "execution_count": 140, 1331 | "metadata": { 1332 | "deletable": false 1333 | }, 1334 | "outputs": [], 1335 | "source": [ 1336 | "def log_prob(train_features, mu_y, sigma_y, log_py):\n", 1337 | " N, d = train_features.shape\n", 1338 | " \n", 1339 | " # Extract the index of zeros and ones seperately\n", 1340 | " mu0 = mu_y[:, 0]\n", 1341 | " sigma0 = sigma_y[:, 0]\n", 1342 | " \n", 1343 | " mu1 = mu_y[:, 1]\n", 1344 | " sigma1 = sigma_y[:, 1]\n", 1345 | "\n", 1346 | " firstCol = np.sum(np.log(1/(sigma0.reshape(1, d)*np.sqrt(2*np.pi)))-(1/2)*((train_features-mu0.reshape(1, d))/sigma0.reshape(1, d))**2, axis=1)+log_py[0]\n", 1347 | " secondCol = np.sum(np.log(1/(sigma1.reshape(1, d)*np.sqrt(2*np.pi)))-(1/2)*((train_features-mu1.reshape(1, d))/sigma1.reshape(1, d))**2, axis=1)+log_py[1]\n", 1348 | "\n", 1349 | " log_p_x_y = np.concatenate((firstCol.reshape(N, 1), secondCol.reshape(N,1)), axis=1)\n", 1350 | " assert log_p_x_y.shape == (N,2)\n", 1351 | " return log_p_x_y\n" 1352 | ] 1353 | }, 1354 | { 1355 | "cell_type": "code", 1356 | "execution_count": 141, 1357 | "metadata": { 1358 | "deletable": false, 1359 | "editable": false, 1360 | "nbgrader": { 1361 | "cell_type": "code", 1362 | "checksum": "1381c37cc128fcc5502da552ceace2e6", 1363 | "grade": false, 1364 | "grade_id": "cell-86fb4d0c1943d700", 1365 | "locked": true, 1366 | "schema_version": 3, 1367 | "solution": false, 1368 | "task": false 1369 | } 1370 | }, 1371 | "outputs": [], 1372 | "source": [ 1373 | "# Performing sanity checks on your implementation\n", 1374 | "some_feats = np.array([[ 1. , 85. , 66. , 29. , 0. , 26.6, 0.4, 31. ],\n", 1375 | " [ 8. , 183. , 64. , 0. , 0. , 23.3, 0.7, 32. ],\n", 1376 | " [ 1. , 89. , 66. , 23. , 94. , 28.1, 0.2, 21. ],\n", 1377 | " [ 0. , 137. , 40. , 35. , 168. , 43.1, 2.3, 33. ],\n", 1378 | " [ 5. , 116. , 74. , 0. , 0. , 25.6, 0.2, 30. ]])\n", 1379 | "some_labels = np.array([0, 1, 0, 1, 0])\n", 1380 | "\n", 1381 | "some_mu_y = cc_mean_ignore_missing(some_feats, some_labels)\n", 1382 | "some_std_y = cc_std_ignore_missing(some_feats, some_labels)\n", 1383 | "some_log_py = log_prior(some_labels)\n", 1384 | "\n", 1385 | "some_log_p_x_y = log_prob(some_feats, some_mu_y, some_std_y, some_log_py)\n", 1386 | "\n", 1387 | "assert np.array_equal(some_log_p_x_y.round(3), np.array([[ -20.822, -36.606],\n", 1388 | " [ -60.879, -27.944],\n", 1389 | " [ -21.774, -295.68 ],\n", 1390 | " [-417.359, -27.944],\n", 1391 | " [ -23.2 , -42.6 ]]))\n", 1392 | "\n", 1393 | "# Checking against the pre-computed test database\n", 1394 | "test_results = test_case_checker(log_prob, task_id=4)\n", 1395 | "assert test_results['passed'], test_results['message']" 1396 | ] 1397 | }, 1398 | { 1399 | "cell_type": "code", 1400 | "execution_count": 142, 1401 | "metadata": { 1402 | "code_folding": [], 1403 | "deletable": false, 1404 | "editable": false, 1405 | "nbgrader": { 1406 | "cell_type": "code", 1407 | "checksum": "df1a1330fdef96dd92cc192241e744fc", 1408 | "grade": true, 1409 | "grade_id": "cell-e263f2b1878b37bh", 1410 | "locked": true, 1411 | "points": 1, 1412 | "schema_version": 3, 1413 | "solution": false, 1414 | "task": false 1415 | } 1416 | }, 1417 | "outputs": [], 1418 | "source": [ 1419 | "# This cell is left empty as a seperator. You can leave this cell as it is, and you should not delete it.\n" 1420 | ] 1421 | }, 1422 | { 1423 | "cell_type": "code", 1424 | "execution_count": 143, 1425 | "metadata": {}, 1426 | "outputs": [ 1427 | { 1428 | "data": { 1429 | "text/plain": [ 1430 | "array([[-26.96647828, -31.00418408],\n", 1431 | " [-32.4755447 , -31.39530914],\n", 1432 | " [-27.14875996, -31.51999532],\n", 1433 | " ...,\n", 1434 | " [-26.29368771, -29.09161966],\n", 1435 | " [-28.19432943, -30.08324788],\n", 1436 | " [-26.98605248, -30.80571318]])" 1437 | ] 1438 | }, 1439 | "execution_count": 143, 1440 | "metadata": {}, 1441 | "output_type": "execute_result" 1442 | } 1443 | ], 1444 | "source": [ 1445 | "log_p_x_y = log_prob(train_features, mu_y, sigma_y, log_py)\n", 1446 | "log_p_x_y" 1447 | ] 1448 | }, 1449 | { 1450 | "cell_type": "markdown", 1451 | "metadata": {}, 1452 | "source": [ 1453 | "## 1.1. Writing the Simple Naive Bayes Classifier" 1454 | ] 1455 | }, 1456 | { 1457 | "cell_type": "code", 1458 | "execution_count": 144, 1459 | "metadata": {}, 1460 | "outputs": [], 1461 | "source": [ 1462 | "class NBClassifier():\n", 1463 | " def __init__(self, train_features, train_labels):\n", 1464 | " self.train_features = train_features\n", 1465 | " self.train_labels = train_labels\n", 1466 | " self.log_py = log_prior(train_labels)\n", 1467 | " self.mu_y = self.get_cc_means()\n", 1468 | " self.sigma_y = self.get_cc_std()\n", 1469 | " \n", 1470 | " def get_cc_means(self):\n", 1471 | " mu_y = cc_mean_ignore_missing(self.train_features, self.train_labels)\n", 1472 | " return mu_y\n", 1473 | " \n", 1474 | " def get_cc_std(self):\n", 1475 | " sigma_y = cc_std_ignore_missing(self.train_features, self.train_labels)\n", 1476 | " return sigma_y\n", 1477 | " \n", 1478 | " def predict(self, features):\n", 1479 | " log_p_x_y = log_prob(features, self.mu_y, self.sigma_y, self.log_py)\n", 1480 | " return log_p_x_y.argmax(axis=1)" 1481 | ] 1482 | }, 1483 | { 1484 | "cell_type": "code", 1485 | "execution_count": 145, 1486 | "metadata": {}, 1487 | "outputs": [], 1488 | "source": [ 1489 | "diabetes_classifier = NBClassifier(train_features, train_labels)\n", 1490 | "train_pred = diabetes_classifier.predict(train_features)\n", 1491 | "eval_pred = diabetes_classifier.predict(eval_features)" 1492 | ] 1493 | }, 1494 | { 1495 | "cell_type": "code", 1496 | "execution_count": 146, 1497 | "metadata": {}, 1498 | "outputs": [ 1499 | { 1500 | "name": "stdout", 1501 | "output_type": "stream", 1502 | "text": [ 1503 | "The training data accuracy of your trained model is 0.7671009771986971\n", 1504 | "The evaluation data accuracy of your trained model is 0.7532467532467533\n" 1505 | ] 1506 | } 1507 | ], 1508 | "source": [ 1509 | "train_acc = (train_pred==train_labels).mean()\n", 1510 | "eval_acc = (eval_pred==eval_labels).mean()\n", 1511 | "print(f'The training data accuracy of your trained model is {train_acc}')\n", 1512 | "print(f'The evaluation data accuracy of your trained model is {eval_acc}')" 1513 | ] 1514 | }, 1515 | { 1516 | "cell_type": "markdown", 1517 | "metadata": {}, 1518 | "source": [ 1519 | "## 1.2 Running an off-the-shelf implementation of Naive-Bayes For Comparison" 1520 | ] 1521 | }, 1522 | { 1523 | "cell_type": "code", 1524 | "execution_count": 147, 1525 | "metadata": {}, 1526 | "outputs": [ 1527 | { 1528 | "name": "stdout", 1529 | "output_type": "stream", 1530 | "text": [ 1531 | "The training data accuracy of your trained model is 0.7671009771986971\n", 1532 | "The evaluation data accuracy of your trained model is 0.7532467532467533\n" 1533 | ] 1534 | } 1535 | ], 1536 | "source": [ 1537 | "from sklearn.naive_bayes import GaussianNB\n", 1538 | "gnb = GaussianNB().fit(train_features, train_labels)\n", 1539 | "train_pred_sk = gnb.predict(train_features)\n", 1540 | "eval_pred_sk = gnb.predict(eval_features)\n", 1541 | "print(f'The training data accuracy of your trained model is {(train_pred_sk == train_labels).mean()}')\n", 1542 | "print(f'The evaluation data accuracy of your trained model is {(eval_pred_sk == eval_labels).mean()}')" 1543 | ] 1544 | }, 1545 | { 1546 | "cell_type": "markdown", 1547 | "metadata": {}, 1548 | "source": [ 1549 | "# Part 2 (Building a Naive Bayes Classifier Considering Missing Entries)" 1550 | ] 1551 | }, 1552 | { 1553 | "cell_type": "markdown", 1554 | "metadata": {}, 1555 | "source": [ 1556 | "In this part, we will modify some of the parameter inference functions of the Naive Bayes classifier to make it able to ignore the NaN entries when inferring the Gaussian mean and stds." 1557 | ] 1558 | }, 1559 | { 1560 | "cell_type": "markdown", 1561 | "metadata": {}, 1562 | "source": [ 1563 | "# Task 5" 1564 | ] 1565 | }, 1566 | { 1567 | "cell_type": "markdown", 1568 | "metadata": {}, 1569 | "source": [ 1570 | "Write a function `cc_mean_consider_missing` that\n", 1571 | "* has exactly the same input and output types as the `cc_mean_ignore_missing` function,\n", 1572 | "* and has similar functionality to `cc_mean_ignore_missing` except that it can handle and ignore the NaN entries when computing the class conditional means.\n", 1573 | "\n", 1574 | "You can borrow most of the code from your `cc_mean_ignore_missing` implementation, but you should make it compatible with the existence of NaN values in the features.\n", 1575 | "\n", 1576 | "Try and avoid the utilization of loops as much as possible. No loops are necessary." 1577 | ] 1578 | }, 1579 | { 1580 | "cell_type": "markdown", 1581 | "metadata": {}, 1582 | "source": [ 1583 | "* **Hint**: You may find the `np.nanmean` function useful." 1584 | ] 1585 | }, 1586 | { 1587 | "cell_type": "code", 1588 | "execution_count": 148, 1589 | "metadata": { 1590 | "deletable": false 1591 | }, 1592 | "outputs": [], 1593 | "source": [ 1594 | "def cc_mean_consider_missing(train_features_with_nans, train_labels):\n", 1595 | " N, d = train_features_with_nans.shape\n", 1596 | " \n", 1597 | " # your code here\n", 1598 | " \n", 1599 | " # Extract the index of zeros and ones seperately\n", 1600 | " allZeros = np.where(train_labels == 0)[0]\n", 1601 | " allOnes = np.where(train_labels == 1)[0]\n", 1602 | " \n", 1603 | " # Convert the 2-d numpy array to DataFrame\n", 1604 | " df_features = pd.DataFrame(train_features_with_nans)\n", 1605 | " \n", 1606 | " \n", 1607 | " dfZeros = df_features.loc[allZeros]\n", 1608 | " dfOnes = df_features.loc[allOnes]\n", 1609 | " \n", 1610 | " \n", 1611 | " firstCol = np.nanmean(dfZeros, axis=0).reshape((d, 1))\n", 1612 | " secondCol = np.nanmean(dfOnes, axis=0).reshape((d, 1))\n", 1613 | " \n", 1614 | " mu_y = np.concatenate((firstCol, secondCol), axis=1)\n", 1615 | " \n", 1616 | " assert not np.isnan(mu_y).any()\n", 1617 | " assert mu_y.shape == (d, 2)\n", 1618 | " return mu_y" 1619 | ] 1620 | }, 1621 | { 1622 | "cell_type": "code", 1623 | "execution_count": 149, 1624 | "metadata": { 1625 | "deletable": false, 1626 | "editable": false, 1627 | "nbgrader": { 1628 | "cell_type": "code", 1629 | "checksum": "6303d5da34d9e332d33292b47c7bf113", 1630 | "grade": false, 1631 | "grade_id": "cell-ca4af11e9d8a7fdb", 1632 | "locked": true, 1633 | "schema_version": 3, 1634 | "solution": false, 1635 | "task": false 1636 | } 1637 | }, 1638 | "outputs": [], 1639 | "source": [ 1640 | "# Performing sanity checks on your implementation\n", 1641 | "some_feats = np.array([[ 1. , 85. , 66. , 29. , 0. , 26.6, 0.4, 31. ],\n", 1642 | " [ 8. , 183. , 64. , 0. , 0. , 23.3, 0.7, 32. ],\n", 1643 | " [ 1. , 89. , 66. , 23. , 94. , 28.1, 0.2, 21. ],\n", 1644 | " [ 0. , 137. , 40. , 35. , 168. , 43.1, 2.3, 33. ],\n", 1645 | " [ 5. , 116. , 74. , 0. , 0. , 25.6, 0.2, 30. ]])\n", 1646 | "some_labels = np.array([0, 1, 0, 1, 0])\n", 1647 | "\n", 1648 | "for i,j in [(0,0), (1,1), (2,3), (3,4), (4, 2)]:\n", 1649 | " some_feats[i,j] = np.nan\n", 1650 | "\n", 1651 | "some_mu_y = cc_mean_consider_missing(some_feats, some_labels)\n", 1652 | "\n", 1653 | "assert np.array_equal(some_mu_y.round(2), np.array([[ 3. , 4. ],\n", 1654 | " [ 96.67, 137. ],\n", 1655 | " [ 66. , 52. ],\n", 1656 | " [ 14.5 , 17.5 ],\n", 1657 | " [ 31.33, 0. ],\n", 1658 | " [ 26.77, 33.2 ],\n", 1659 | " [ 0.27, 1.5 ],\n", 1660 | " [ 27.33, 32.5 ]]))\n", 1661 | "\n", 1662 | "# Checking against the pre-computed test database\n", 1663 | "test_results = test_case_checker(cc_mean_consider_missing, task_id=5)\n", 1664 | "assert test_results['passed'], test_results['message']" 1665 | ] 1666 | }, 1667 | { 1668 | "cell_type": "code", 1669 | "execution_count": 150, 1670 | "metadata": { 1671 | "code_folding": [], 1672 | "deletable": false, 1673 | "editable": false, 1674 | "nbgrader": { 1675 | "cell_type": "code", 1676 | "checksum": "fe73c92df82cf2b5b4d938c18671be6b", 1677 | "grade": true, 1678 | "grade_id": "cell-e263f2b1878b37bf", 1679 | "locked": true, 1680 | "points": 1, 1681 | "schema_version": 3, 1682 | "solution": false, 1683 | "task": false 1684 | } 1685 | }, 1686 | "outputs": [], 1687 | "source": [ 1688 | "# This cell is left empty as a seperator. You can leave this cell as it is, and you should not delete it.\n" 1689 | ] 1690 | }, 1691 | { 1692 | "cell_type": "code", 1693 | "execution_count": 151, 1694 | "metadata": {}, 1695 | "outputs": [ 1696 | { 1697 | "data": { 1698 | "text/plain": [ 1699 | "array([[ 3.48641975, 4.91866029],\n", 1700 | " [109.99753086, 142.30143541],\n", 1701 | " [ 71.41538462, 75.34693878],\n", 1702 | " [ 27.53658537, 32.11188811],\n", 1703 | " [ 66.25679012, 100.55980861],\n", 1704 | " [ 30.85025126, 35.31826923],\n", 1705 | " [ 0.42825926, 0.55279904],\n", 1706 | " [ 31.57283951, 37.39712919]])" 1707 | ] 1708 | }, 1709 | "execution_count": 151, 1710 | "metadata": {}, 1711 | "output_type": "execute_result" 1712 | } 1713 | ], 1714 | "source": [ 1715 | "mu_y = cc_mean_consider_missing(train_features_with_nans, train_labels)\n", 1716 | "mu_y" 1717 | ] 1718 | }, 1719 | { 1720 | "cell_type": "markdown", 1721 | "metadata": {}, 1722 | "source": [ 1723 | "# Task 6" 1724 | ] 1725 | }, 1726 | { 1727 | "cell_type": "markdown", 1728 | "metadata": {}, 1729 | "source": [ 1730 | "Write a function `cc_std_consider_missing` that\n", 1731 | "* has exactly the same input and output types as the `cc_std_ignore_missing` function,\n", 1732 | "* and has similar functionality to `cc_std_ignore_missing` except that it can handle and ignore the NaN entries when computing the class conditional means.\n", 1733 | "\n", 1734 | "You can borrow most of the code from your `cc_std_ignore_missing` implementation, but you should make it compatible with the existence of NaN values in the features.\n", 1735 | "\n", 1736 | "Try and avoid the utilization of loops as much as possible. No loops are necessary." 1737 | ] 1738 | }, 1739 | { 1740 | "cell_type": "markdown", 1741 | "metadata": {}, 1742 | "source": [ 1743 | "* **Hint**: You may find the `np.nanstd` function useful." 1744 | ] 1745 | }, 1746 | { 1747 | "cell_type": "code", 1748 | "execution_count": 152, 1749 | "metadata": { 1750 | "deletable": false 1751 | }, 1752 | "outputs": [], 1753 | "source": [ 1754 | "def cc_std_consider_missing(train_features_with_nans, train_labels):\n", 1755 | " N, d = train_features_with_nans.shape\n", 1756 | " \n", 1757 | " # your code here\n", 1758 | " # Extract the index of zeros and ones seperately\n", 1759 | " allZeros = np.where(train_labels == 0)[0]\n", 1760 | " allOnes = np.where(train_labels == 1)[0]\n", 1761 | " \n", 1762 | " # Convert the 2-d numpy array to DataFrame\n", 1763 | " df_features = pd.DataFrame(train_features_with_nans)\n", 1764 | " \n", 1765 | " \n", 1766 | " dfZeros = df_features.loc[allZeros]\n", 1767 | " dfOnes = df_features.loc[allOnes]\n", 1768 | " \n", 1769 | " firstCol = np.nanstd(dfZeros, axis=0, ddof=0).reshape((d, 1))\n", 1770 | " secondCol = np.nanstd(dfOnes, axis=0, ddof=0).reshape((d, 1))\n", 1771 | " \n", 1772 | " sigma_y = np.concatenate((firstCol, secondCol), axis=1)\n", 1773 | " \n", 1774 | " assert not np.isnan(sigma_y).any()\n", 1775 | " assert sigma_y.shape == (d, 2)\n", 1776 | " return sigma_y" 1777 | ] 1778 | }, 1779 | { 1780 | "cell_type": "code", 1781 | "execution_count": 153, 1782 | "metadata": { 1783 | "deletable": false, 1784 | "editable": false, 1785 | "nbgrader": { 1786 | "cell_type": "code", 1787 | "checksum": "6062b0a65e131aa86909fefa7e7b88fc", 1788 | "grade": false, 1789 | "grade_id": "cell-2821b980896856b7", 1790 | "locked": true, 1791 | "schema_version": 3, 1792 | "solution": false, 1793 | "task": false 1794 | } 1795 | }, 1796 | "outputs": [], 1797 | "source": [ 1798 | "# Performing sanity checks on your implementation\n", 1799 | "some_feats = np.array([[ 1. , 85. , 66. , 29. , 0. , 26.6, 0.4, 31. ],\n", 1800 | " [ 8. , 183. , 64. , 0. , 0. , 23.3, 0.7, 32. ],\n", 1801 | " [ 1. , 89. , 66. , 23. , 94. , 28.1, 0.2, 21. ],\n", 1802 | " [ 0. , 137. , 40. , 35. , 168. , 43.1, 2.3, 33. ],\n", 1803 | " [ 5. , 116. , 74. , 0. , 0. , 25.6, 0.2, 30. ]])\n", 1804 | "some_labels = np.array([0, 1, 0, 1, 0])\n", 1805 | "\n", 1806 | "for i,j in [(0,0), (1,1), (2,3), (3,4), (4, 2)]:\n", 1807 | " some_feats[i,j] = np.nan\n", 1808 | "\n", 1809 | "some_std_y = cc_std_consider_missing(some_feats, some_labels)\n", 1810 | "\n", 1811 | "assert np.array_equal(some_std_y.round(2), np.array([[ 2. , 4. ],\n", 1812 | " [13.77, 0. ],\n", 1813 | " [ 0. , 12. ],\n", 1814 | " [14.5 , 17.5 ],\n", 1815 | " [44.31, 0. ],\n", 1816 | " [ 1.03, 9.9 ],\n", 1817 | " [ 0.09, 0.8 ],\n", 1818 | " [ 4.5 , 0.5 ]]))\n", 1819 | "\n", 1820 | "# Checking against the pre-computed test database\n", 1821 | "test_results = test_case_checker(cc_std_consider_missing, task_id=6)\n", 1822 | "assert test_results['passed'], test_results['message']" 1823 | ] 1824 | }, 1825 | { 1826 | "cell_type": "code", 1827 | "execution_count": 154, 1828 | "metadata": { 1829 | "code_folding": [], 1830 | "deletable": false, 1831 | "editable": false, 1832 | "nbgrader": { 1833 | "cell_type": "code", 1834 | "checksum": "4919f542a3a80c75bbe8f533c2fe35b6", 1835 | "grade": true, 1836 | "grade_id": "cell-e263f2b1878b37bz", 1837 | "locked": true, 1838 | "points": 1, 1839 | "schema_version": 3, 1840 | "solution": false, 1841 | "task": false 1842 | } 1843 | }, 1844 | "outputs": [], 1845 | "source": [ 1846 | "# This cell is left empty as a seperator. You can leave this cell as it is, and you should not delete it.\n" 1847 | ] 1848 | }, 1849 | { 1850 | "cell_type": "code", 1851 | "execution_count": 155, 1852 | "metadata": {}, 1853 | "outputs": [ 1854 | { 1855 | "data": { 1856 | "text/plain": [ 1857 | "array([[ 3.1155426 , 3.75417931],\n", 1858 | " [ 25.96811899, 32.50910874],\n", 1859 | " [ 12.26342359, 12.1982786 ],\n", 1860 | " [ 9.87753687, 10.37284304],\n", 1861 | " [ 95.63339586, 139.24364214],\n", 1862 | " [ 6.38703834, 6.21564813],\n", 1863 | " [ 0.29438217, 0.37201494],\n", 1864 | " [ 11.67577435, 11.01543899]])" 1865 | ] 1866 | }, 1867 | "execution_count": 155, 1868 | "metadata": {}, 1869 | "output_type": "execute_result" 1870 | } 1871 | ], 1872 | "source": [ 1873 | "sigma_y = cc_std_consider_missing(train_features_with_nans, train_labels)\n", 1874 | "sigma_y" 1875 | ] 1876 | }, 1877 | { 1878 | "cell_type": "markdown", 1879 | "metadata": {}, 1880 | "source": [ 1881 | "## 2.1. Writing the Naive Bayes Classifier With Missing Data Handling" 1882 | ] 1883 | }, 1884 | { 1885 | "cell_type": "code", 1886 | "execution_count": 156, 1887 | "metadata": {}, 1888 | "outputs": [], 1889 | "source": [ 1890 | "class NBClassifierWithMissing(NBClassifier):\n", 1891 | " def get_cc_means(self):\n", 1892 | " mu_y = cc_mean_consider_missing(self.train_features, self.train_labels)\n", 1893 | " return mu_y\n", 1894 | " \n", 1895 | " def get_cc_std(self):\n", 1896 | " sigma_y = cc_std_consider_missing(self.train_features, self.train_labels)\n", 1897 | " return sigma_y\n", 1898 | " \n", 1899 | " def predict(self, features):\n", 1900 | " preds = []\n", 1901 | " for feature in features:\n", 1902 | " is_num = np.logical_not(np.isnan(feature))\n", 1903 | " mu_y_not_nan = self.mu_y[is_num,:]\n", 1904 | " std_y_not_nan = self.sigma_y[is_num,:]\n", 1905 | " feats_not_nan = feature[is_num].reshape(1,-1)\n", 1906 | " log_p_x_y = log_prob(feats_not_nan, mu_y_not_nan, std_y_not_nan, self.log_py)\n", 1907 | " preds.append(log_p_x_y.argmax(axis=1).item())\n", 1908 | "\n", 1909 | " return np.array(preds)" 1910 | ] 1911 | }, 1912 | { 1913 | "cell_type": "code", 1914 | "execution_count": 157, 1915 | "metadata": {}, 1916 | "outputs": [], 1917 | "source": [ 1918 | "diabetes_classifier_nans = NBClassifierWithMissing(train_features_with_nans, train_labels)\n", 1919 | "train_pred = diabetes_classifier_nans.predict(train_features_with_nans)\n", 1920 | "eval_pred = diabetes_classifier_nans.predict(eval_features_with_nans)" 1921 | ] 1922 | }, 1923 | { 1924 | "cell_type": "code", 1925 | "execution_count": 158, 1926 | "metadata": {}, 1927 | "outputs": [ 1928 | { 1929 | "name": "stdout", 1930 | "output_type": "stream", 1931 | "text": [ 1932 | "The training data accuracy of your trained model is 0.747557003257329\n", 1933 | "The evaluation data accuracy of your trained model is 0.7532467532467533\n" 1934 | ] 1935 | } 1936 | ], 1937 | "source": [ 1938 | "train_acc = (train_pred==train_labels).mean()\n", 1939 | "eval_acc = (eval_pred==eval_labels).mean()\n", 1940 | "print(f'The training data accuracy of your trained model is {train_acc}')\n", 1941 | "print(f'The evaluation data accuracy of your trained model is {eval_acc}')" 1942 | ] 1943 | }, 1944 | { 1945 | "cell_type": "markdown", 1946 | "metadata": {}, 1947 | "source": [ 1948 | "# 3. Running SVMlight" 1949 | ] 1950 | }, 1951 | { 1952 | "cell_type": "markdown", 1953 | "metadata": {}, 1954 | "source": [ 1955 | "In this section, we are going to investigate the support vector machine classification method. We will become familiar with this classification method in week 3. However, in this section, we are just going to observe how this method performs to set the stage for the third week.\n", 1956 | "\n", 1957 | "`SVMlight` (http://svmlight.joachims.org/) is a famous implementation of the SVM classifier. \n", 1958 | "\n", 1959 | "`SVMLight` can be called from a shell terminal, and there is no nice wrapper for it in python3. Therefore:\n", 1960 | "1. We have to export the training data to a special format called `svmlight/libsvm`. This can be done using scikit-learn.\n", 1961 | "2. We have to run the `svm_learn` program to learn the model and then store it.\n", 1962 | "3. We have to import the model back to python." 1963 | ] 1964 | }, 1965 | { 1966 | "cell_type": "markdown", 1967 | "metadata": {}, 1968 | "source": [ 1969 | "## 3.1 Exporting the training data to libsvm format" 1970 | ] 1971 | }, 1972 | { 1973 | "cell_type": "code", 1974 | "execution_count": 159, 1975 | "metadata": {}, 1976 | "outputs": [], 1977 | "source": [ 1978 | "from sklearn.datasets import dump_svmlight_file\n", 1979 | "dump_svmlight_file(train_features, 2*train_labels-1, 'training_feats.data', \n", 1980 | " zero_based=False, comment=None, query_id=None, multilabel=False)" 1981 | ] 1982 | }, 1983 | { 1984 | "cell_type": "markdown", 1985 | "metadata": {}, 1986 | "source": [ 1987 | "## 3.2 Training `SVMlight`" 1988 | ] 1989 | }, 1990 | { 1991 | "cell_type": "code", 1992 | "execution_count": 160, 1993 | "metadata": {}, 1994 | "outputs": [ 1995 | { 1996 | "name": "stdout", 1997 | "output_type": "stream", 1998 | "text": [ 1999 | "chmod: changing permissions of '../BasicClassification-lib/svmlight/svm_learn': Operation not permitted\n", 2000 | "Scanning examples...done\n", 2001 | "Reading examples into memory...100..200..300..400..500..600..OK. (614 examples read)\n", 2002 | "Setting default regularization parameter C=0.0000\n", 2003 | "Optimizing....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................done. (1781 iterations)\n", 2004 | "Optimization finished (141 misclassified, maxdiff=0.00099).\n", 2005 | "Runtime in cpu-seconds: 0.19\n", 2006 | "Number of SV: 375 (including 369 at upper bound)\n", 2007 | "L1 loss: loss=335.23204\n", 2008 | "Norm of weight vector: |w|=0.03179\n", 2009 | "Norm of longest example vector: |x|=871.75350\n", 2010 | "Estimated VCdim of classifier: VCdim<=769.24695\n", 2011 | "Computing XiAlpha-estimates...done\n", 2012 | "Runtime for XiAlpha-estimates in cpu-seconds: 0.00\n", 2013 | "XiAlpha-estimate of the error: error<=60.75% (rho=1.00,depth=0)\n", 2014 | "XiAlpha-estimate of the recall: recall=>10.53% (rho=1.00,depth=0)\n", 2015 | "XiAlpha-estimate of the precision: precision=>10.58% (rho=1.00,depth=0)\n", 2016 | "Number of kernel evaluations: 71356\n", 2017 | "Writing model file...done\n", 2018 | "\n" 2019 | ] 2020 | } 2021 | ], 2022 | "source": [ 2023 | "!chmod +x ../BasicClassification-lib/svmlight/svm_learn\n", 2024 | "from subprocess import Popen, PIPE\n", 2025 | "process = Popen([\"../BasicClassification-lib/svmlight/svm_learn\", \"./training_feats.data\", \"svm_model.txt\"], stdout=PIPE, stderr=PIPE)\n", 2026 | "stdout, stderr = process.communicate()\n", 2027 | "print(stdout.decode(\"utf-8\"))" 2028 | ] 2029 | }, 2030 | { 2031 | "cell_type": "markdown", 2032 | "metadata": {}, 2033 | "source": [ 2034 | "## 3.3 Importing the SVM Model" 2035 | ] 2036 | }, 2037 | { 2038 | "cell_type": "code", 2039 | "execution_count": 161, 2040 | "metadata": {}, 2041 | "outputs": [], 2042 | "source": [ 2043 | "from svm2weight import get_svmlight_weights\n", 2044 | "svm_weights, thresh = get_svmlight_weights('svm_model.txt', printOutput=False)\n", 2045 | "\n", 2046 | "def svmlight_classifier(train_features):\n", 2047 | " return (train_features @ svm_weights - thresh).reshape(-1) >= 0." 2048 | ] 2049 | }, 2050 | { 2051 | "cell_type": "code", 2052 | "execution_count": 162, 2053 | "metadata": {}, 2054 | "outputs": [], 2055 | "source": [ 2056 | "train_pred = svmlight_classifier(train_features)\n", 2057 | "eval_pred = svmlight_classifier(eval_features)" 2058 | ] 2059 | }, 2060 | { 2061 | "cell_type": "code", 2062 | "execution_count": 163, 2063 | "metadata": {}, 2064 | "outputs": [ 2065 | { 2066 | "name": "stdout", 2067 | "output_type": "stream", 2068 | "text": [ 2069 | "The training data accuracy of your trained model is 0.7703583061889251\n", 2070 | "The evaluation data accuracy of your trained model is 0.7402597402597403\n" 2071 | ] 2072 | } 2073 | ], 2074 | "source": [ 2075 | "train_acc = (train_pred==train_labels).mean()\n", 2076 | "eval_acc = (eval_pred==eval_labels).mean()\n", 2077 | "print(f'The training data accuracy of your trained model is {train_acc}')\n", 2078 | "print(f'The evaluation data accuracy of your trained model is {eval_acc}')" 2079 | ] 2080 | }, 2081 | { 2082 | "cell_type": "code", 2083 | "execution_count": 164, 2084 | "metadata": {}, 2085 | "outputs": [], 2086 | "source": [ 2087 | "# Cleaning up after our work is done\n", 2088 | "!rm -rf svm_model.txt training_feats.data" 2089 | ] 2090 | }, 2091 | { 2092 | "cell_type": "code", 2093 | "execution_count": null, 2094 | "metadata": {}, 2095 | "outputs": [], 2096 | "source": [] 2097 | } 2098 | ], 2099 | "metadata": { 2100 | "illinois_payload": { 2101 | "b64z": "", 2102 | "nb_path": "release/BasicClassification/BasicClassification.ipynb" 2103 | }, 2104 | "kernelspec": { 2105 | "display_name": "Python 3 (Threads: 2)", 2106 | "language": "python", 2107 | "name": "python3" 2108 | }, 2109 | "language_info": { 2110 | "codemirror_mode": { 2111 | "name": "ipython", 2112 | "version": 3 2113 | }, 2114 | "file_extension": ".py", 2115 | "mimetype": "text/x-python", 2116 | "name": "python", 2117 | "nbconvert_exporter": "python", 2118 | "pygments_lexer": "ipython3", 2119 | "version": "3.8.12" 2120 | } 2121 | }, 2122 | "nbformat": 4, 2123 | "nbformat_minor": 4 2124 | } 2125 | -------------------------------------------------------------------------------- /EMSegmentation-lib/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Keyuan125/CS441-AppliedMachineLearning/623c5307e6412e9a2fc59dd6213fc07fa412a861/EMSegmentation-lib/.DS_Store -------------------------------------------------------------------------------- /EMSegmentation-lib/EMSegmentation.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Keyuan125/CS441-AppliedMachineLearning/623c5307e6412e9a2fc59dd6213fc07fa412a861/EMSegmentation-lib/EMSegmentation.pdf -------------------------------------------------------------------------------- /EMSegmentation-lib/README.md: -------------------------------------------------------------------------------- 1 | 2 | # Assignment-specific public libraries and large data files (visible, read-only) 3 | 4 | This directory is for data and script files that are specific to one homework. 5 | This directory will be part of the Python path for all users. Libraries placed here should generally be those written by staff; Python packages can instead be installed system-wide in the Dockerfiles or `requirements.txt` loading system (see below). 6 | 7 | Please don't confuse this directory with the `work/course-lib` directory that may contain data to be used by multiple homework assignments. 8 | 9 | ## Placement 10 | 11 | In the lab container, the contents of this directory will be placed in: 12 | 13 | ``` 14 | work/release/[hwid]-lib 15 | ``` 16 | 17 | Where `hwid` matches the homework ID for that assignment. For example, data files associated with `work/source/HW1/HW1.ipynb` should be placed in `work/source/HW1-lib/`. These files will end up in `work/release/HW1-lib` in the container. 18 | 19 | These files will be read-only. These will be available for all users, including students. However, it's better not to refer to these files using absolute paths; see best practices below. 20 | 21 | ## Special files 22 | 23 | ### `payload_requirements.json` 24 | 25 | **Only staff can configure this file.** 26 | 27 | The `payload_requirements.json` file, if present, specifies additional files that will be submitted along with the notebook. The file should contain an object with a `"files"` property that is a list of strings that are relative paths to files under the current homework notebook's working directory. For example: 28 | 29 | ```json 30 | { 31 | "files": ["some-file.db", "inner_directory/nested_file.txt"] 32 | } 33 | ``` 34 | 35 | If the homework ID for this homework is "HW1", the above example would specify these additional files to be collected: 36 | 37 | - `work/release/HW1/some-file.db` 38 | - `work/release/HW1/inner_directory/nested_file.txt` 39 | 40 | ### `requirements.txt` sequence 41 | 42 | **Only staff can configure these files.** 43 | 44 | If some of the Python packages for a particular assignment need to be frozen, you can specify packages and versions in one or more files named `requirements*.txt`, where you may want to put a number before the extension such as `requirements1.txt`. These will be processed one at a time in natural version order (like `sort -V` in Linux) during the Docker build. 45 | 46 | ## Best practices 47 | 48 | It's important to put public staff libraries and large data files in this directory to prevent editing and ensure efficient use of the cloud storage. Large files will also have improved access time from this directory compared to the notebook directory. 49 | 50 | Python files in this directory will be on the Python system path in the container, so you may want to write a Python loader for data you need and refer to the data with relative paths under the library directory (rather than using ".." to refer to a directory above). However, if students try to work on files offline this may complicate things and require shims to adjust the paths. 51 | 52 | Staff members should refer to additional notes in the staff library directories. 53 | -------------------------------------------------------------------------------- /EMSegmentation-lib/aml_utils.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | import pandas as pd 4 | import os 5 | import sys 6 | import traceback 7 | from pygments import formatters, highlight, lexers 8 | import re 9 | import inspect 10 | import types 11 | import copy 12 | 13 | visualize = True 14 | perform_computation = True 15 | test_db_dir = os.path.dirname(os.path.realpath(__file__)) + '/test_db' 16 | NO_DASHES = 55 17 | 18 | Fore_BLUE_LIGHT = u'\u001b[38;5;19m' 19 | Fore_RED_LIGHT = u'\u001b[38;5;196m' 20 | Fore_BLUE = u'\u001b[38;5;34m' 21 | Fore_RED = '\x1b[1;31m' 22 | FORE_GREEN_DARK = u'\u001b[38;5;22m' 23 | Fore_DARKRED = u'\u001b[38;5;124m' 24 | Fore_MAGENTA = '\x1b[1m' + u'\u001b[38;5;92m' 25 | Fore_GREEN = u'\u001b[38;5;32m' 26 | Fore_BLACK = '\x1b[0m' + u'\u001b[38;5;0m' 27 | 28 | 29 | ######################################################################################## 30 | ######################## Utilities for traceback processing ############################ 31 | ######################################################################################## 32 | def keep_tb_rule(tb): 33 | tb_file_path = tb.tb_frame.f_code.co_filename 34 | if os.path.realpath(__file__) == os.path.realpath(tb_file_path): 35 | return False 36 | else: 37 | return True 38 | 39 | def censor_exc_traceback(exc_traceback): 40 | original_tb_list = [] 41 | tb_next = exc_traceback 42 | while tb_next is not None: 43 | original_tb_list.append(tb_next) 44 | tb_next = tb_next.tb_next 45 | 46 | censored_tb_list = [tb for tb in original_tb_list if keep_tb_rule(tb)] 47 | 48 | for i, tb in enumerate(censored_tb_list[:-1]): 49 | tb.tb_next = censored_tb_list[i+1] 50 | 51 | if len(censored_tb_list) > 0: 52 | return censored_tb_list[0] 53 | else: 54 | return exc_traceback 55 | 56 | try: 57 | import IPython 58 | ultratb = IPython.core.ultratb.VerboseTB(include_vars=False) 59 | def get_tb_colored_str(exc_type, exc_value, exc_traceback): 60 | manipulated_exc_traceback = censor_exc_traceback(exc_traceback) 61 | tb_text = ultratb.text(exc_type, exc_value, manipulated_exc_traceback) 62 | tb_text = re.sub( r"/tmp/ipykernel_.*.py", "/Jupyter/Notebook/Student/Task/Implementation/Cells", tb_text) 63 | tb_text = re.sub( r"\s{20,}Traceback", " Traceback", tb_text) 64 | s_split = tb_text.split('\n') 65 | if len(s_split) > 0: 66 | c_s_split = s_split[1:] 67 | tb_text = '\n'.join(c_s_split) + '\n' 68 | tb_text = tb_text.replace('\x1b[0;36m', '\x1b[1m \x1b[1;34m') 69 | return tb_text 70 | except: 71 | def get_tb_colored_str(exc_type, exc_value, exc_traceback): 72 | manipulated_exc_traceback = censor_exc_traceback(exc_traceback) 73 | tb_text = traceback.format_exception(exc_type, exc_value, manipulated_exc_traceback, limit=None, chain=True) 74 | tb_text = ''.join(tb_text) 75 | tb_text = re.sub( r"\"/tmp/ipykernel_.*\"", "\"/Jupyter/Notebook/Student/Task/Implementation/Cells\"", tb_text) 76 | lexer = lexers.get_lexer_by_name("pytb", stripall=True) 77 | formatter = formatters.get_formatter_by_name("terminal16m") 78 | tb_colored = highlight(tb_text, lexer, formatter) 79 | return tb_colored 80 | 81 | try: 82 | from IPython.utils import PyColorize 83 | color_parser = PyColorize.Parser(color_table=None, out="str", parent=None, style='Linux') 84 | def code_color_parser(code_str): 85 | return color_parser.format(code_str) 86 | except: 87 | def code_color_parser(code_str): 88 | return code_str 89 | 90 | def get_num_indents(src_list): 91 | assert len(src_list) > 0 92 | a = [line + 20 * ' ' for line in src_list] 93 | b = [len(line) - len(line.lstrip()) for line in a] 94 | assert b[0] == 0 95 | c = min(b[1:]) 96 | return c 97 | 98 | def code_snippet_maker(stu_function, args, kwargs): 99 | test_kwargs_str_lst = [] 100 | test_kwargs_str_lst.append('from copy import deepcopy') 101 | test_kwargs_str_lst.append("failed_arguments = deepcopy(test_results['test_kwargs'])") 102 | for arg_ in args: 103 | test_kwargs_str_lst.append(arg_) 104 | for key,val in kwargs.items(): 105 | test_kwargs_str_lst.append(f"{key} = failed_arguments['{key}']") 106 | test_kwargs_str = ', '.join(test_kwargs_str_lst) 107 | 108 | if hasattr(stu_function, '__name__'): 109 | stu_func_name = stu_function.__name__ 110 | else: 111 | stu_func_name = 'YOUR_FUNCTION_NAME' 112 | 113 | check_list_code = [] 114 | check_list_code.append(f"correct_sol = test_results['correct_sol'] # The Reference Solution") 115 | check_list_code.append(f"if isinstance(correct_sol, np.ndarray):") 116 | check_list_code.append(f" assert isinstance(my_solution, np.ndarray)") 117 | check_list_code.append(f" assert my_solution.dtype is correct_sol.dtype") 118 | check_list_code.append(f" assert my_solution.shape == correct_sol.shape") 119 | check_list_code.append(f" assert np.allclose(my_solution, correct_sol)") 120 | check_list_code.append(f" print('If you passed the above assertions, it probably means that you have fixed the issue! Well Done!')") 121 | check_list_code.append(f" print('Now you have to do 3 things:')") 122 | check_list_code.append(f" print(' 1) Carefully copy the fixed code body back to the {stu_func_name} function.')") 123 | check_list_code.append(f" print(' 2) If you copied any \"returned_var = \" lines, convert them back to return statements.')") 124 | check_list_code.append(f" print(' 3) Carefully remove this cell (i.e., the cell you inserted and modified) once you are done.')") 125 | 126 | try: 127 | src = inspect.getsource(stu_function) 128 | src_list = src.split('\n') 129 | src_list = [line for line in src_list if not (line.strip().startswith('#'))] 130 | no_indents = get_num_indents(src_list) 131 | mod_src_list = [] 132 | src_gen = src_list[1:] if src_list[0].startswith('def') else src_list 133 | for line in src_gen: 134 | if len(line) > no_indents: 135 | shifted_left_line = line[no_indents:] 136 | else: 137 | shifted_left_line = line 138 | 139 | return_statement = 'return ' 140 | if not shifted_left_line.lstrip().startswith(return_statement): 141 | mod_src_list.append(shifted_left_line) 142 | else: 143 | i = shifted_left_line.index(return_statement) 144 | shifted_left_line = shifted_left_line[:i] + 'returned_var = ' + shifted_left_line[i+len(return_statement):] + ' # returned variable' 145 | mod_src_list.append(shifted_left_line) 146 | 147 | mod_bodysrc_list = '\n'.join(mod_src_list).strip().split('\n') 148 | 149 | mod_src_list = [] 150 | mod_src_list = mod_src_list + ['### You can copy the following auto-generated snippet into a new cell to reproduce the issue.'] 151 | mod_src_list = mod_src_list + ['### Use the + button on the top left of the screen to insert a new cell below.'] 152 | mod_src_list = mod_src_list + [''] 153 | mod_src_list = mod_src_list + ['#'*7 + ' Test Arguments ' + '#'*7] + test_kwargs_str_lst 154 | mod_src_list = mod_src_list + ['\n' + '#'*7 + ' Your Code Body ' + '#'*7] + mod_bodysrc_list 155 | mod_src_list.append('\n' + '#'*5 + ' Checking Solutions '+ '#'*6) 156 | mod_src_list.append(f"my_solution = returned_var # Your Solution") 157 | mod_src_list = mod_src_list + check_list_code 158 | processed_code = '\n'.join(mod_src_list) 159 | except: 160 | mod_src_list = [] 161 | mod_src_list.append(f"my_solution = {stu_func_name}({test_kwargs_str})") 162 | mod_src_list = mod_src_list + check_list_code 163 | processed_code = '\n'.join(mod_src_list) 164 | 165 | return processed_code 166 | 167 | 168 | ######################################################################################## 169 | ####################### Utilities for comparison processing ############################ 170 | ######################################################################################## 171 | def retrieve_item(item_name, ptr_, test_idx, npz_file): 172 | item_shape = npz_file[f'shape_{item_name}'][test_idx] 173 | item_size = int(np.prod(item_shape)) 174 | item = npz_file[item_name][ptr_:(ptr_+item_size)].reshape(item_shape) 175 | return item, ptr_+item_size 176 | 177 | class NPStrListCoder: 178 | def __init__(self): 179 | self.filler = '?' 180 | self.spacer = ':' 181 | self.max_len = 100 182 | 183 | def encode(self, str_list): 184 | my_str_ = self.spacer.join(str_list) 185 | str_hex_data = [ord(c) for c in my_str_] 186 | assert_msg = f'Increase max len; you have so many characters: {len(str_hex_data)}>{self.max_len}' 187 | assert len(str_hex_data) <= self.max_len, assert_msg 188 | str_hex_data = str_hex_data + [ord(self.filler) for _ in range(self.max_len - len(str_hex_data))] 189 | str_hex_np = np.array(str_hex_data) 190 | return str_hex_np 191 | 192 | def decode(self, np_arr): 193 | a = ''.join([chr(i) for i in np_arr]) 194 | recovered_list = a.replace(self.filler, '').split(self.spacer) 195 | return recovered_list 196 | 197 | str2np_coder = NPStrListCoder() 198 | 199 | def test_case_loader(test_file): 200 | npz_file = np.load(test_file) 201 | arg_id_list = sorted([int(key[4:]) for key in npz_file.keys() if key.startswith('arg_')]) 202 | kwarg_names_list = sorted([key[6:] for key in npz_file.keys() if key.startswith('kwarg_')]) 203 | 204 | arg_ptr_list = [0 for _ in range(len(arg_id_list))] 205 | dfcarg_ptr_list = [0 for _ in range(len(arg_id_list))] 206 | kwarg_ptr_list = [0 for _ in range(len(kwarg_names_list))] 207 | dfckwarg_ptr_list = [0 for _ in range(len(kwarg_names_list))] 208 | out_ptr = 0 209 | for i in np.arange(npz_file['num_tests']): 210 | args_list = [] 211 | for arg_id, arg_id_ in enumerate(arg_id_list): 212 | arg_item, arg_ptr_list[arg_id] = retrieve_item(f'arg_{arg_id_}', arg_ptr_list[arg_id], i, npz_file) 213 | if f'dfcarg_{arg_id_}' in npz_file.keys(): 214 | col_list_code, dfcarg_ptr_list[arg_id] = retrieve_item(f'dfcarg_{arg_id_}', dfcarg_ptr_list[arg_id], i, npz_file) 215 | arg_item = pd.DataFrame(arg_item, columns=str2np_coder.decode(col_list_code)) 216 | args_list.append(arg_item) 217 | args = tuple(args_list) 218 | 219 | kwargs = {} 220 | for kwarg_id, kwarg_name in enumerate(kwarg_names_list): 221 | kwarg_item, kwarg_ptr_list[kwarg_id] = retrieve_item(f'kwarg_{kwarg_name}', kwarg_ptr_list[kwarg_id], i, npz_file) 222 | if f'dfckwarg_{kwarg_name}' in npz_file.keys(): 223 | col_list_code, dfckwarg_ptr_list[kwarg_id] = retrieve_item(f'dfckwarg_{kwarg_name}', dfckwarg_ptr_list[kwarg_id], i, npz_file) 224 | kwarg_item = pd.DataFrame(kwarg_item, columns=str2np_coder.decode(col_list_code)) 225 | kwargs[kwarg_name]=kwarg_item 226 | 227 | output, out_ptr = retrieve_item(f'output', out_ptr, i, npz_file) 228 | 229 | yield args, kwargs, output 230 | 231 | def arg2str(args, kwargs, adv_user_msg=False, stu_func=None): 232 | msg = '' 233 | 234 | for arg_ in args: 235 | msg += f'{arg_}\n' 236 | for key,val in kwargs.items(): 237 | try: 238 | val_str = np.array_repr(val) 239 | except: 240 | val_str = val 241 | new_line = f'{Fore_MAGENTA}{key}{Fore_BLACK} = {val_str}\n' 242 | new_line = new_line.replace(' = array(',' = np.array(') 243 | new_line = new_line.replace('nan,','np.nan,') 244 | msg += new_line 245 | 246 | 247 | if adv_user_msg: 248 | try: 249 | is_stu_func_lambda = isinstance(stu_func, types.LambdaType) 250 | if is_stu_func_lambda: 251 | is_stu_func_lambda = stu_func.__name__ == "" 252 | if not is_stu_func_lambda: 253 | code_title_ = f'\n' + '-' * (NO_DASHES-1) + f'{Fore_RED} Reproducing Code Snippet {Fore_BLACK}' + '-' * (NO_DASHES-2) + '\n' 254 | code = code_snippet_maker(stu_func, args, kwargs) 255 | msg += code_title_ + code_color_parser(code) 256 | except: 257 | pass 258 | return msg 259 | 260 | 261 | def test_case_checker(stu_func, task_id=0): 262 | out_dict = {} 263 | out_dict['task_number'] = task_id 264 | out_dict['exception'] = None 265 | out_dict['exception_info'] = None 266 | test_db_npz = f'{test_db_dir}/task_{task_id}.npz' 267 | if not os.path.exists(test_db_npz): 268 | out_dict['message'] = f'Test database test_db/task_{task_id}.npz does not exist... aborting!' 269 | out_dict['passed'] = False 270 | out_dict['test_args'] = None 271 | out_dict['test_kwargs'] = None 272 | out_dict['stu_sol'] = None 273 | out_dict['correct_sol'] = None 274 | return out_dict 275 | 276 | if hasattr(stu_func, '__name__'): 277 | stu_func_name = stu_func.__name__ 278 | else: 279 | stu_func_name = None 280 | 281 | done = False 282 | err_title = f'\n' + '*' * NO_DASHES + f'{Fore_RED} Error in Task {task_id} {Fore_BLACK}' + '*' * NO_DASHES + f'\n' 283 | test_case_title = '\n' + '-' * NO_DASHES + f'{Fore_RED} Test Case Arguments {Fore_BLACK}' + '-' * NO_DASHES + '\n' 284 | summary_title = '-' * NO_DASHES + f' {Fore_RED} Summary {Fore_BLACK}' + '-' * NO_DASHES + '\n' 285 | for (test_args, test_kwargs, correct_sol) in test_case_loader(test_db_npz): 286 | try: 287 | stu_args_copy = copy.deepcopy(test_args) 288 | stu_kwargs_copy = copy.deepcopy(test_kwargs) 289 | stu_sol = stu_func(*stu_args_copy, **stu_kwargs_copy) 290 | except Exception as stu_exception: 291 | stu_sol = None 292 | stu_exception_info = sys.exc_info() 293 | message = err_title + summary_title 294 | message += f'Your code {Fore_RED}crashed{Fore_BLACK} during the evaluation of a test case argument.' 295 | message += f' The rest of this message gives you the following material:\n' 296 | message += f' 1. The exception traceback detailing how the error occured.\n' 297 | message += f' 2. The specific test case arguments that caused the error.\n' 298 | message += f' 3. A code snippet that can conviniently reproduce the error.\n' 299 | message += f' -> You can {Fore_RED}copy and paste{Fore_BLACK} the {Fore_RED}code snippet{Fore_BLACK} into a {Fore_RED}new cell{Fore_BLACK}, and run the cell to reproduce the error.\n\n' 300 | message += '-' * NO_DASHES + f'{Fore_RED} Exception Traceback {Fore_BLACK}' + '-' * NO_DASHES + '\n' 301 | message += get_tb_colored_str(*stu_exception_info) 302 | message += test_case_title 303 | message += arg2str(test_args, test_kwargs, adv_user_msg=True, stu_func=stu_func) 304 | out_dict['test_args'] = test_args 305 | out_dict['test_kwargs'] = test_kwargs 306 | out_dict['stu_sol'] = stu_sol 307 | out_dict['correct_sol'] = correct_sol 308 | out_dict['message'] = message 309 | out_dict['passed'] = False 310 | out_dict['exception'] = stu_exception 311 | out_dict['exception_info'] = stu_exception_info 312 | return out_dict 313 | 314 | if isinstance(correct_sol, np.ndarray) and np.isscalar(stu_sol): 315 | # This is handling a special case: When scalar numpy objects are stored, 316 | # they will be converted to a numpy array upon loading. 317 | # In this case, we'll give students the benefit of the doubt, 318 | # and assume the correct solution already was a scalar. 319 | if correct_sol.size == 1: 320 | correct_sol = np.float64(correct_sol.item()) 321 | stu_sol = np.float64(np.float64(stu_sol).item()) 322 | 323 | #Type Sanity check 324 | if type(stu_sol) is not type(correct_sol): 325 | message = err_title + summary_title 326 | message += f'Your solution\'s {Fore_RED}output type{Fore_BLACK} is not the same as ' 327 | message += f'the reference solution\'s data type.\n' 328 | message += f' Your solution\'s type --> {Fore_RED}{type(stu_sol)}{Fore_BLACK}\n' 329 | message += f' Correct solution\'s type --> {Fore_RED}{type(correct_sol)}{Fore_BLACK}\n' 330 | message += test_case_title 331 | message += arg2str(test_args, test_kwargs, adv_user_msg=True, stu_func=stu_func) 332 | out_dict['test_args'] = test_args 333 | out_dict['test_kwargs'] = test_kwargs 334 | out_dict['stu_sol'] = stu_sol 335 | out_dict['correct_sol'] = correct_sol 336 | out_dict['message'] = message 337 | out_dict['passed'] = False 338 | return out_dict 339 | 340 | if isinstance(correct_sol, np.ndarray): 341 | if not np.all(np.array(correct_sol.shape) == np.array(stu_sol.shape)): 342 | message = err_title + summary_title 343 | message += f'Your solution\'s {Fore_RED}output numpy shape{Fore_BLACK} is not the same as ' 344 | message += f'the reference solution\'s shape.\n' 345 | message += f' Your solution\'s shape --> {Fore_RED}{stu_sol.shape}{Fore_BLACK}\n' 346 | message += f' Correct solution\'s shape --> {Fore_RED}{correct_sol.shape}{Fore_BLACK}\n' 347 | message += test_case_title 348 | message += arg2str(test_args, test_kwargs, adv_user_msg=True, stu_func=stu_func) 349 | out_dict['test_args'] = test_args 350 | out_dict['test_kwargs'] = test_kwargs 351 | out_dict['stu_sol'] = stu_sol 352 | out_dict['correct_sol'] = correct_sol 353 | out_dict['message'] = message 354 | out_dict['passed'] = False 355 | return out_dict 356 | 357 | if not(stu_sol.dtype is correct_sol.dtype): 358 | message = err_title + summary_title 359 | message += f'Your solution\'s {Fore_RED}output numpy dtype{Fore_BLACK} is not the same as' 360 | message += f'the reference solution\'s dtype.\n' 361 | message += f' Your solution\'s dtype --> {Fore_RED}np.{stu_sol.dtype}{Fore_BLACK}\n' 362 | message += f' Correct solution\'s dtype --> {Fore_RED}np.{correct_sol.dtype}{Fore_BLACK}\n' 363 | message += test_case_title 364 | message += arg2str(test_args, test_kwargs, adv_user_msg=True, stu_func=stu_func) 365 | out_dict['test_args'] = test_args 366 | out_dict['test_kwargs'] = test_kwargs 367 | out_dict['stu_sol'] = stu_sol 368 | out_dict['correct_sol'] = correct_sol 369 | out_dict['message'] = message 370 | out_dict['passed'] = False 371 | return out_dict 372 | 373 | if isinstance(correct_sol, np.ndarray): 374 | equality_array = np.isclose(stu_sol, correct_sol, rtol=1e-05, atol=1e-08, equal_nan=True) 375 | if not equality_array.all(): 376 | message = err_title + summary_title 377 | message += f'Your solution is {Fore_RED}not the same{Fore_BLACK} as the correct solution.\n' 378 | whr_ = np.array(np.where(np.logical_not(equality_array))) 379 | ineq_idxs = whr_[:,0].tolist() 380 | message += f' your_solution{ineq_idxs}={stu_sol[tuple(ineq_idxs)]}\n' 381 | message += f' correct_solution{ineq_idxs}={correct_sol[tuple(ineq_idxs)]}\n' 382 | message += test_case_title 383 | message += arg2str(test_args, test_kwargs, adv_user_msg=True, stu_func=stu_func) 384 | out_dict['test_args'] = test_args 385 | out_dict['test_kwargs'] = test_kwargs 386 | out_dict['stu_sol'] = stu_sol 387 | out_dict['correct_sol'] = correct_sol 388 | out_dict['message'] = message 389 | out_dict['passed'] = False 390 | return out_dict 391 | 392 | elif np.isscalar(correct_sol): 393 | equality_array = np.isclose(stu_sol, correct_sol, rtol=1e-05, atol=1e-08, equal_nan=True) 394 | if not equality_array.all(): 395 | message = err_title + summary_title 396 | message += f'Your solution is {Fore_RED}not the same{Fore_BLACK} as the correct solution.\n' 397 | message += f' your_solution={stu_sol}\n' 398 | message += f' correct_solution={correct_sol}\n' 399 | message += test_case_title 400 | message += arg2str(test_args, test_kwargs, adv_user_msg=True, stu_func=stu_func) 401 | out_dict['test_args'] = test_args 402 | out_dict['test_kwargs'] = test_kwargs 403 | out_dict['stu_sol'] = stu_sol 404 | out_dict['correct_sol'] = correct_sol 405 | out_dict['message'] = message 406 | out_dict['passed'] = False 407 | return out_dict 408 | 409 | elif isinstance(correct_sol, tuple): 410 | if not correct_sol==stu_sol: 411 | message = err_title + summary_title 412 | message += f'Your solution is {Fore_RED}not the same{Fore_BLACK} as the correct solution.\n' 413 | message += f' your_solution={stu_sol}\n' 414 | message += f' correct_solution={correct_sol}\n' 415 | message += test_case_title 416 | message += arg2str(test_args, test_kwargs, adv_user_msg=True, stu_func=stu_func) 417 | out_dict['test_args'] = test_args 418 | out_dict['test_kwargs'] = test_kwargs 419 | out_dict['stu_sol'] = stu_sol 420 | out_dict['correct_sol'] = correct_sol 421 | out_dict['message'] = message 422 | out_dict['passed'] = False 423 | return out_dict 424 | 425 | else: 426 | raise Exception(f'Not implemented comparison for other data types. sorry!') 427 | 428 | out_dict['test_args'] = None 429 | out_dict['test_kwargs'] = None 430 | out_dict['stu_sol'] = None 431 | out_dict['correct_sol'] = None 432 | out_dict['message'] = 'Well Done!' 433 | out_dict['passed'] = True 434 | return out_dict 435 | 436 | def show_test_cases(test_func, task_id=0): 437 | from IPython.display import clear_output 438 | file_path = f'{test_db_dir}/task_{task_id}.npz' 439 | npz_file = np.load(file_path) 440 | orig_images = npz_file['raw_images'] 441 | ref_images = npz_file['ref_images'] 442 | test_images = test_func(orig_images) 443 | 444 | visualize_ = visualize and perform_computation 445 | 446 | if not np.all(np.array(test_images.shape) == np.array(ref_images.shape)): 447 | print(f'Error: It seems the test images and the ref images have different shapes. Modify your function so that they both have the same shape.') 448 | print(f' test_images shape: {test_images.shape}') 449 | print(f' ref_images shape: {ref_images.shape}') 450 | return None, None, None, False 451 | 452 | if not np.all(np.array(test_images.dtype) == np.array(ref_images.dtype)): 453 | print(f'Error: It seems the test images and the ref images have different dtype. Modify your function so that they both have the same dtype.') 454 | print(f' test_images dtype: {test_images.dtype}') 455 | print(f' ref_images dtype: {ref_images.dtype}') 456 | return None, None, None, False 457 | 458 | for i in range(ref_images.shape[0]): 459 | if visualize_: 460 | nrows, ncols = 1, 3 461 | ax_w, ax_h = 5, 5 462 | fig, axes = plt.subplots(nrows, ncols, figsize=(ncols*ax_w, nrows*ax_h)) 463 | axes = np.array(axes).reshape(nrows, ncols) 464 | 465 | orig_image = orig_images[i] 466 | ref_image = ref_images[i] 467 | test_image = test_images[i] 468 | 469 | if visualize_: 470 | ax = axes[0,0] 471 | ax.pcolormesh(orig_image, edgecolors='k', linewidth=0.01, cmap='Greys') 472 | ax.xaxis.tick_top() 473 | ax.invert_yaxis() 474 | 475 | x_ticks = ax.get_xticks(minor=False).astype(np.int) 476 | x_ticks = x_ticks[x_ticks < orig_image.shape[1]] 477 | ax.set_xticks(x_ticks + 0.5) 478 | ax.set_xticklabels((x_ticks).astype(np.int)) 479 | 480 | y_ticks = ax.get_yticks(minor=False).astype(np.int) 481 | y_ticks = y_ticks[y_ticks < orig_image.shape[0]] 482 | ax.set_yticks(y_ticks + 0.5) 483 | ax.set_yticklabels((y_ticks).astype(np.int)) 484 | 485 | ax.set_aspect('equal') 486 | ax.set_title('Raw Image') 487 | 488 | ax = axes[0,1] 489 | ax.pcolormesh(ref_image, edgecolors='k', linewidth=0.01, cmap='Greys') 490 | ax.xaxis.tick_top() 491 | ax.invert_yaxis() 492 | 493 | x_ticks = ax.get_xticks(minor=False).astype(np.int) 494 | x_ticks = x_ticks[x_ticks < ref_image.shape[1]] 495 | ax.set_xticks(x_ticks+0.5) 496 | ax.set_xticklabels((x_ticks).astype(np.int)) 497 | 498 | y_ticks = ax.get_yticks(minor=False).astype(np.int) 499 | y_ticks = y_ticks[y_ticks < ref_image.shape[0]] 500 | ax.set_yticks(y_ticks+0.5) 501 | ax.set_yticklabels((y_ticks).astype(np.int)) 502 | 503 | ax.set_aspect('equal') 504 | ax.set_title('Reference Solution Image') 505 | 506 | ax = axes[0,2] 507 | ax.pcolormesh(test_image, edgecolors='k', linewidth=0.01, cmap='Greys') 508 | ax.xaxis.tick_top() 509 | ax.invert_yaxis() 510 | 511 | x_ticks = ax.get_xticks(minor=False).astype(np.int) 512 | x_ticks = x_ticks[x_ticks < test_image.shape[1]] 513 | ax.set_xticks(x_ticks + 0.5) 514 | ax.set_xticklabels((x_ticks).astype(np.int)) 515 | 516 | y_ticks = ax.get_yticks(minor=False).astype(np.int) 517 | y_ticks = y_ticks[y_ticks < test_image.shape[0]] 518 | ax.set_yticks(y_ticks + 0.5) 519 | ax.set_yticklabels((y_ticks).astype(np.int)) 520 | 521 | ax.set_aspect('equal') 522 | ax.set_title('Your Solution Image') 523 | 524 | if np.allclose(ref_image, test_image): 525 | if visualize_: 526 | print('The reference and solution images are the same to a T! Well done on this test case.') 527 | else: 528 | print('The reference and solution images are not the same...') 529 | ineq_idxs = np.array(np.where(np.logical_not(np.isclose(ref_image, test_image))))[:,0].tolist() 530 | print(f'ref_image{ineq_idxs}={ref_image[tuple(ineq_idxs)]}') 531 | print(f'test_image{ineq_idxs}={test_image[tuple(ineq_idxs)]}') 532 | if visualize_: 533 | print('I will return the images so that you will be able to diagnose the issue and resolve it...') 534 | return (orig_image, ref_image, test_image, False) 535 | 536 | if visualize_: 537 | plt.show() 538 | input_prompt = ' Enter nothing to go to the next image\nor\n Enter "s" when you are done to recieve the three images. \n' 539 | input_prompt += ' **Don\'t forget to do this before continuing to the next step.**\n' 540 | 541 | try: 542 | cmd = input(input_prompt) 543 | except KeyboardInterrupt: 544 | cmd = 's' 545 | 546 | if cmd.lower().startswith('s'): 547 | return (orig_image, ref_image, test_image, True) 548 | else: 549 | clear_output(wait=True) 550 | 551 | return (orig_image, ref_image, test_image, True) -------------------------------------------------------------------------------- /EMSegmentation-lib/payload_requirements.json: -------------------------------------------------------------------------------- 1 | { 2 | "comment": "The files property is a list of additional file paths (relative to this homework's notebook dir) that will be bundled into the student's ipynb metadata each time they save.", 3 | "files": [] 4 | } 5 | -------------------------------------------------------------------------------- /EMSegmentation-lib/pics/RobertMixed03.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Keyuan125/CS441-AppliedMachineLearning/623c5307e6412e9a2fc59dd6213fc07fa412a861/EMSegmentation-lib/pics/RobertMixed03.jpg -------------------------------------------------------------------------------- /EMSegmentation-lib/pics/smallstrelitzia.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Keyuan125/CS441-AppliedMachineLearning/623c5307e6412e9a2fc59dd6213fc07fa412a861/EMSegmentation-lib/pics/smallstrelitzia.jpg -------------------------------------------------------------------------------- /EMSegmentation-lib/pics/smallsunset.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Keyuan125/CS441-AppliedMachineLearning/623c5307e6412e9a2fc59dd6213fc07fa412a861/EMSegmentation-lib/pics/smallsunset.jpg -------------------------------------------------------------------------------- /EMSegmentation-lib/requirements1.txt: -------------------------------------------------------------------------------- 1 | # Stage 1: Normal Pypi index packages that have no reliance on PyTorch. 2 | # The Dockerfiles also may have specified some initial packages. 3 | 4 | scikit-learn==0.24 5 | scikit-image==0.18.3 6 | seaborn==0.10.0 7 | pandas==1.1.3 8 | numpy==1.18.1 9 | matplotlib==3.1.3 10 | Pillow==8.2.0 11 | jupyter_client==6.1.11 -------------------------------------------------------------------------------- /EMSegmentation-lib/test_db/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Keyuan125/CS441-AppliedMachineLearning/623c5307e6412e9a2fc59dd6213fc07fa412a861/EMSegmentation-lib/test_db/.DS_Store -------------------------------------------------------------------------------- /EMSegmentation-lib/test_db/task_1.npz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Keyuan125/CS441-AppliedMachineLearning/623c5307e6412e9a2fc59dd6213fc07fa412a861/EMSegmentation-lib/test_db/task_1.npz -------------------------------------------------------------------------------- /EMSegmentation-lib/test_db/task_2.npz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Keyuan125/CS441-AppliedMachineLearning/623c5307e6412e9a2fc59dd6213fc07fa412a861/EMSegmentation-lib/test_db/task_2.npz -------------------------------------------------------------------------------- /EMSegmentation-lib/test_db/task_3.npz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Keyuan125/CS441-AppliedMachineLearning/623c5307e6412e9a2fc59dd6213fc07fa412a861/EMSegmentation-lib/test_db/task_3.npz -------------------------------------------------------------------------------- /EMSegmentation-lib/test_db/task_4.npz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Keyuan125/CS441-AppliedMachineLearning/623c5307e6412e9a2fc59dd6213fc07fa412a861/EMSegmentation-lib/test_db/task_4.npz -------------------------------------------------------------------------------- /EMTopicModel-lib/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Keyuan125/CS441-AppliedMachineLearning/623c5307e6412e9a2fc59dd6213fc07fa412a861/EMTopicModel-lib/.DS_Store -------------------------------------------------------------------------------- /EMTopicModel-lib/EMTopicModel.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Keyuan125/CS441-AppliedMachineLearning/623c5307e6412e9a2fc59dd6213fc07fa412a861/EMTopicModel-lib/EMTopicModel.pdf -------------------------------------------------------------------------------- /EMTopicModel-lib/README.md: -------------------------------------------------------------------------------- 1 | 2 | # Assignment-specific public libraries and large data files (visible, read-only) 3 | 4 | This directory is for data and script files that are specific to one homework. 5 | This directory will be part of the Python path for all users. Libraries placed here should generally be those written by staff; Python packages can instead be installed system-wide in the Dockerfiles or `requirements.txt` loading system (see below). 6 | 7 | Please don't confuse this directory with the `work/course-lib` directory that may contain data to be used by multiple homework assignments. 8 | 9 | ## Placement 10 | 11 | In the lab container, the contents of this directory will be placed in: 12 | 13 | ``` 14 | work/release/[hwid]-lib 15 | ``` 16 | 17 | Where `hwid` matches the homework ID for that assignment. For example, data files associated with `work/source/HW1/HW1.ipynb` should be placed in `work/source/HW1-lib/`. These files will end up in `work/release/HW1-lib` in the container. 18 | 19 | These files will be read-only. These will be available for all users, including students. However, it's better not to refer to these files using absolute paths; see best practices below. 20 | 21 | ## Special files 22 | 23 | ### `payload_requirements.json` 24 | 25 | **Only staff can configure this file.** 26 | 27 | The `payload_requirements.json` file, if present, specifies additional files that will be submitted along with the notebook. The file should contain an object with a `"files"` property that is a list of strings that are relative paths to files under the current homework notebook's working directory. For example: 28 | 29 | ```json 30 | { 31 | "files": ["some-file.db", "inner_directory/nested_file.txt"] 32 | } 33 | ``` 34 | 35 | If the homework ID for this homework is "HW1", the above example would specify these additional files to be collected: 36 | 37 | - `work/release/HW1/some-file.db` 38 | - `work/release/HW1/inner_directory/nested_file.txt` 39 | 40 | ### `requirements.txt` sequence 41 | 42 | **Only staff can configure these files.** 43 | 44 | If some of the Python packages for a particular assignment need to be frozen, you can specify packages and versions in one or more files named `requirements*.txt`, where you may want to put a number before the extension such as `requirements1.txt`. These will be processed one at a time in natural version order (like `sort -V` in Linux) during the Docker build. 45 | 46 | ## Best practices 47 | 48 | It's important to put public staff libraries and large data files in this directory to prevent editing and ensure efficient use of the cloud storage. Large files will also have improved access time from this directory compared to the notebook directory. 49 | 50 | Python files in this directory will be on the Python system path in the container, so you may want to write a Python loader for data you need and refer to the data with relative paths under the library directory (rather than using ".." to refer to a directory above). However, if students try to work on files offline this may complicate things and require shims to adjust the paths. 51 | 52 | Staff members should refer to additional notes in the staff library directories. 53 | -------------------------------------------------------------------------------- /EMTopicModel-lib/aml_utils.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | import pandas as pd 4 | import os 5 | import sys 6 | import traceback 7 | from pygments import formatters, highlight, lexers 8 | import re 9 | import inspect 10 | import types 11 | import copy 12 | 13 | visualize = True 14 | perform_computation = True 15 | test_db_dir = os.path.dirname(os.path.realpath(__file__)) + '/test_db' 16 | NO_DASHES = 55 17 | 18 | Fore_BLUE_LIGHT = u'\u001b[38;5;19m' 19 | Fore_RED_LIGHT = u'\u001b[38;5;196m' 20 | Fore_BLUE = u'\u001b[38;5;34m' 21 | Fore_RED = '\x1b[1;31m' 22 | FORE_GREEN_DARK = u'\u001b[38;5;22m' 23 | Fore_DARKRED = u'\u001b[38;5;124m' 24 | Fore_MAGENTA = '\x1b[1m' + u'\u001b[38;5;92m' 25 | Fore_GREEN = u'\u001b[38;5;32m' 26 | Fore_BLACK = '\x1b[0m' + u'\u001b[38;5;0m' 27 | 28 | 29 | ######################################################################################## 30 | ######################## Utilities for traceback processing ############################ 31 | ######################################################################################## 32 | def keep_tb_rule(tb): 33 | tb_file_path = tb.tb_frame.f_code.co_filename 34 | if os.path.realpath(__file__) == os.path.realpath(tb_file_path): 35 | return False 36 | else: 37 | return True 38 | 39 | def censor_exc_traceback(exc_traceback): 40 | original_tb_list = [] 41 | tb_next = exc_traceback 42 | while tb_next is not None: 43 | original_tb_list.append(tb_next) 44 | tb_next = tb_next.tb_next 45 | 46 | censored_tb_list = [tb for tb in original_tb_list if keep_tb_rule(tb)] 47 | 48 | for i, tb in enumerate(censored_tb_list[:-1]): 49 | tb.tb_next = censored_tb_list[i+1] 50 | 51 | if len(censored_tb_list) > 0: 52 | return censored_tb_list[0] 53 | else: 54 | return exc_traceback 55 | 56 | try: 57 | import IPython 58 | ultratb = IPython.core.ultratb.VerboseTB(include_vars=False) 59 | def get_tb_colored_str(exc_type, exc_value, exc_traceback): 60 | manipulated_exc_traceback = censor_exc_traceback(exc_traceback) 61 | tb_text = ultratb.text(exc_type, exc_value, manipulated_exc_traceback) 62 | tb_text = re.sub( r"/tmp/ipykernel_.*.py", "/Jupyter/Notebook/Student/Task/Implementation/Cells", tb_text) 63 | tb_text = re.sub( r"\s{20,}Traceback", " Traceback", tb_text) 64 | s_split = tb_text.split('\n') 65 | if len(s_split) > 0: 66 | c_s_split = s_split[1:] 67 | tb_text = '\n'.join(c_s_split) + '\n' 68 | tb_text = tb_text.replace('\x1b[0;36m', '\x1b[1m \x1b[1;34m') 69 | return tb_text 70 | except: 71 | def get_tb_colored_str(exc_type, exc_value, exc_traceback): 72 | manipulated_exc_traceback = censor_exc_traceback(exc_traceback) 73 | tb_text = traceback.format_exception(exc_type, exc_value, manipulated_exc_traceback, limit=None, chain=True) 74 | tb_text = ''.join(tb_text) 75 | tb_text = re.sub( r"\"/tmp/ipykernel_.*\"", "\"/Jupyter/Notebook/Student/Task/Implementation/Cells\"", tb_text) 76 | lexer = lexers.get_lexer_by_name("pytb", stripall=True) 77 | formatter = formatters.get_formatter_by_name("terminal16m") 78 | tb_colored = highlight(tb_text, lexer, formatter) 79 | return tb_colored 80 | 81 | try: 82 | from IPython.utils import PyColorize 83 | color_parser = PyColorize.Parser(color_table=None, out="str", parent=None, style='Linux') 84 | def code_color_parser(code_str): 85 | return color_parser.format(code_str) 86 | except: 87 | def code_color_parser(code_str): 88 | return code_str 89 | 90 | def get_num_indents(src_list): 91 | assert len(src_list) > 0 92 | a = [line + 20 * ' ' for line in src_list] 93 | b = [len(line) - len(line.lstrip()) for line in a] 94 | assert b[0] == 0 95 | c = min(b[1:]) 96 | return c 97 | 98 | def code_snippet_maker(stu_function, args, kwargs): 99 | test_kwargs_str_lst = [] 100 | test_kwargs_str_lst.append('from copy import deepcopy') 101 | test_kwargs_str_lst.append("failed_arguments = deepcopy(test_results['test_kwargs'])") 102 | for arg_ in args: 103 | test_kwargs_str_lst.append(arg_) 104 | for key,val in kwargs.items(): 105 | test_kwargs_str_lst.append(f"{key} = failed_arguments['{key}']") 106 | test_kwargs_str = ', '.join(test_kwargs_str_lst) 107 | 108 | if hasattr(stu_function, '__name__'): 109 | stu_func_name = stu_function.__name__ 110 | else: 111 | stu_func_name = 'YOUR_FUNCTION_NAME' 112 | 113 | check_list_code = [] 114 | check_list_code.append(f"correct_sol = test_results['correct_sol'] # The Reference Solution") 115 | check_list_code.append(f"if isinstance(correct_sol, np.ndarray):") 116 | check_list_code.append(f" assert isinstance(my_solution, np.ndarray)") 117 | check_list_code.append(f" assert my_solution.dtype is correct_sol.dtype") 118 | check_list_code.append(f" assert my_solution.shape == correct_sol.shape") 119 | check_list_code.append(f" assert np.allclose(my_solution, correct_sol)") 120 | check_list_code.append(f" print('If you passed the above assertions, it probably means that you have fixed the issue! Well Done!')") 121 | check_list_code.append(f" print('Now you have to do 3 things:')") 122 | check_list_code.append(f" print(' 1) Carefully copy the fixed code body back to the {stu_func_name} function.')") 123 | check_list_code.append(f" print(' 2) If you copied any \"returned_var = \" lines, convert them back to return statements.')") 124 | check_list_code.append(f" print(' 3) Carefully remove this cell (i.e., the cell you inserted and modified) once you are done.')") 125 | 126 | try: 127 | src = inspect.getsource(stu_function) 128 | src_list = src.split('\n') 129 | src_list = [line for line in src_list if not (line.strip().startswith('#'))] 130 | no_indents = get_num_indents(src_list) 131 | mod_src_list = [] 132 | src_gen = src_list[1:] if src_list[0].startswith('def') else src_list 133 | for line in src_gen: 134 | if len(line) > no_indents: 135 | shifted_left_line = line[no_indents:] 136 | else: 137 | shifted_left_line = line 138 | 139 | return_statement = 'return ' 140 | if not shifted_left_line.lstrip().startswith(return_statement): 141 | mod_src_list.append(shifted_left_line) 142 | else: 143 | i = shifted_left_line.index(return_statement) 144 | shifted_left_line = shifted_left_line[:i] + 'returned_var = ' + shifted_left_line[i+len(return_statement):] + ' # returned variable' 145 | mod_src_list.append(shifted_left_line) 146 | 147 | mod_bodysrc_list = '\n'.join(mod_src_list).strip().split('\n') 148 | 149 | mod_src_list = [] 150 | mod_src_list = mod_src_list + ['### You can copy the following auto-generated snippet into a new cell to reproduce the issue.'] 151 | mod_src_list = mod_src_list + ['### Use the + button on the top left of the screen to insert a new cell below.'] 152 | mod_src_list = mod_src_list + [''] 153 | mod_src_list = mod_src_list + ['#'*7 + ' Test Arguments ' + '#'*7] + test_kwargs_str_lst 154 | mod_src_list = mod_src_list + ['\n' + '#'*7 + ' Your Code Body ' + '#'*7] + mod_bodysrc_list 155 | mod_src_list.append('\n' + '#'*5 + ' Checking Solutions '+ '#'*6) 156 | mod_src_list.append(f"my_solution = returned_var # Your Solution") 157 | mod_src_list = mod_src_list + check_list_code 158 | processed_code = '\n'.join(mod_src_list) 159 | except: 160 | mod_src_list = [] 161 | mod_src_list.append(f"my_solution = {stu_func_name}({test_kwargs_str})") 162 | mod_src_list = mod_src_list + check_list_code 163 | processed_code = '\n'.join(mod_src_list) 164 | 165 | return processed_code 166 | 167 | 168 | ######################################################################################## 169 | ####################### Utilities for comparison processing ############################ 170 | ######################################################################################## 171 | def retrieve_item(item_name, ptr_, test_idx, npz_file): 172 | item_shape = npz_file[f'shape_{item_name}'][test_idx] 173 | item_size = int(np.prod(item_shape)) 174 | item = npz_file[item_name][ptr_:(ptr_+item_size)].reshape(item_shape) 175 | return item, ptr_+item_size 176 | 177 | class NPStrListCoder: 178 | def __init__(self): 179 | self.filler = '?' 180 | self.spacer = ':' 181 | self.max_len = 100 182 | 183 | def encode(self, str_list): 184 | my_str_ = self.spacer.join(str_list) 185 | str_hex_data = [ord(c) for c in my_str_] 186 | assert_msg = f'Increase max len; you have so many characters: {len(str_hex_data)}>{self.max_len}' 187 | assert len(str_hex_data) <= self.max_len, assert_msg 188 | str_hex_data = str_hex_data + [ord(self.filler) for _ in range(self.max_len - len(str_hex_data))] 189 | str_hex_np = np.array(str_hex_data) 190 | return str_hex_np 191 | 192 | def decode(self, np_arr): 193 | a = ''.join([chr(i) for i in np_arr]) 194 | recovered_list = a.replace(self.filler, '').split(self.spacer) 195 | return recovered_list 196 | 197 | str2np_coder = NPStrListCoder() 198 | 199 | def test_case_loader(test_file): 200 | npz_file = np.load(test_file) 201 | arg_id_list = sorted([int(key[4:]) for key in npz_file.keys() if key.startswith('arg_')]) 202 | kwarg_names_list = sorted([key[6:] for key in npz_file.keys() if key.startswith('kwarg_')]) 203 | 204 | arg_ptr_list = [0 for _ in range(len(arg_id_list))] 205 | dfcarg_ptr_list = [0 for _ in range(len(arg_id_list))] 206 | kwarg_ptr_list = [0 for _ in range(len(kwarg_names_list))] 207 | dfckwarg_ptr_list = [0 for _ in range(len(kwarg_names_list))] 208 | out_ptr = 0 209 | for i in np.arange(npz_file['num_tests']): 210 | args_list = [] 211 | for arg_id, arg_id_ in enumerate(arg_id_list): 212 | arg_item, arg_ptr_list[arg_id] = retrieve_item(f'arg_{arg_id_}', arg_ptr_list[arg_id], i, npz_file) 213 | if f'dfcarg_{arg_id_}' in npz_file.keys(): 214 | col_list_code, dfcarg_ptr_list[arg_id] = retrieve_item(f'dfcarg_{arg_id_}', dfcarg_ptr_list[arg_id], i, npz_file) 215 | arg_item = pd.DataFrame(arg_item, columns=str2np_coder.decode(col_list_code)) 216 | args_list.append(arg_item) 217 | args = tuple(args_list) 218 | 219 | kwargs = {} 220 | for kwarg_id, kwarg_name in enumerate(kwarg_names_list): 221 | kwarg_item, kwarg_ptr_list[kwarg_id] = retrieve_item(f'kwarg_{kwarg_name}', kwarg_ptr_list[kwarg_id], i, npz_file) 222 | if f'dfckwarg_{kwarg_name}' in npz_file.keys(): 223 | col_list_code, dfckwarg_ptr_list[kwarg_id] = retrieve_item(f'dfckwarg_{kwarg_name}', dfckwarg_ptr_list[kwarg_id], i, npz_file) 224 | kwarg_item = pd.DataFrame(kwarg_item, columns=str2np_coder.decode(col_list_code)) 225 | kwargs[kwarg_name]=kwarg_item 226 | 227 | output, out_ptr = retrieve_item(f'output', out_ptr, i, npz_file) 228 | 229 | yield args, kwargs, output 230 | 231 | def arg2str(args, kwargs, adv_user_msg=False, stu_func=None): 232 | msg = '' 233 | 234 | for arg_ in args: 235 | msg += f'{arg_}\n' 236 | for key,val in kwargs.items(): 237 | try: 238 | val_str = np.array_repr(val) 239 | except: 240 | val_str = val 241 | new_line = f'{Fore_MAGENTA}{key}{Fore_BLACK} = {val_str}\n' 242 | new_line = new_line.replace(' = array(',' = np.array(') 243 | new_line = new_line.replace('nan,','np.nan,') 244 | msg += new_line 245 | 246 | 247 | if adv_user_msg: 248 | try: 249 | is_stu_func_lambda = isinstance(stu_func, types.LambdaType) 250 | if is_stu_func_lambda: 251 | is_stu_func_lambda = stu_func.__name__ == "" 252 | if not is_stu_func_lambda: 253 | code_title_ = f'\n' + '-' * (NO_DASHES-1) + f'{Fore_RED} Reproducing Code Snippet {Fore_BLACK}' + '-' * (NO_DASHES-2) + '\n' 254 | code = code_snippet_maker(stu_func, args, kwargs) 255 | msg += code_title_ + code_color_parser(code) 256 | except: 257 | pass 258 | return msg 259 | 260 | 261 | def test_case_checker(stu_func, task_id=0): 262 | out_dict = {} 263 | out_dict['task_number'] = task_id 264 | out_dict['exception'] = None 265 | out_dict['exception_info'] = None 266 | test_db_npz = f'{test_db_dir}/task_{task_id}.npz' 267 | if not os.path.exists(test_db_npz): 268 | out_dict['message'] = f'Test database test_db/task_{task_id}.npz does not exist... aborting!' 269 | out_dict['passed'] = False 270 | out_dict['test_args'] = None 271 | out_dict['test_kwargs'] = None 272 | out_dict['stu_sol'] = None 273 | out_dict['correct_sol'] = None 274 | return out_dict 275 | 276 | if hasattr(stu_func, '__name__'): 277 | stu_func_name = stu_func.__name__ 278 | else: 279 | stu_func_name = None 280 | 281 | done = False 282 | err_title = f'\n' + '*' * NO_DASHES + f'{Fore_RED} Error in Task {task_id} {Fore_BLACK}' + '*' * NO_DASHES + f'\n' 283 | test_case_title = '\n' + '-' * NO_DASHES + f'{Fore_RED} Test Case Arguments {Fore_BLACK}' + '-' * NO_DASHES + '\n' 284 | summary_title = '-' * NO_DASHES + f' {Fore_RED} Summary {Fore_BLACK}' + '-' * NO_DASHES + '\n' 285 | for (test_args, test_kwargs, correct_sol) in test_case_loader(test_db_npz): 286 | try: 287 | stu_args_copy = copy.deepcopy(test_args) 288 | stu_kwargs_copy = copy.deepcopy(test_kwargs) 289 | stu_sol = stu_func(*stu_args_copy, **stu_kwargs_copy) 290 | except Exception as stu_exception: 291 | stu_sol = None 292 | stu_exception_info = sys.exc_info() 293 | message = err_title + summary_title 294 | message += f'Your code {Fore_RED}crashed{Fore_BLACK} during the evaluation of a test case argument.' 295 | message += f' The rest of this message gives you the following material:\n' 296 | message += f' 1. The exception traceback detailing how the error occured.\n' 297 | message += f' 2. The specific test case arguments that caused the error.\n' 298 | message += f' 3. A code snippet that can conviniently reproduce the error.\n' 299 | message += f' -> You can {Fore_RED}copy and paste{Fore_BLACK} the {Fore_RED}code snippet{Fore_BLACK} into a {Fore_RED}new cell{Fore_BLACK}, and run the cell to reproduce the error.\n\n' 300 | message += '-' * NO_DASHES + f'{Fore_RED} Exception Traceback {Fore_BLACK}' + '-' * NO_DASHES + '\n' 301 | message += get_tb_colored_str(*stu_exception_info) 302 | message += test_case_title 303 | message += arg2str(test_args, test_kwargs, adv_user_msg=True, stu_func=stu_func) 304 | out_dict['test_args'] = test_args 305 | out_dict['test_kwargs'] = test_kwargs 306 | out_dict['stu_sol'] = stu_sol 307 | out_dict['correct_sol'] = correct_sol 308 | out_dict['message'] = message 309 | out_dict['passed'] = False 310 | out_dict['exception'] = stu_exception 311 | out_dict['exception_info'] = stu_exception_info 312 | return out_dict 313 | 314 | if isinstance(correct_sol, np.ndarray) and np.isscalar(stu_sol): 315 | # This is handling a special case: When scalar numpy objects are stored, 316 | # they will be converted to a numpy array upon loading. 317 | # In this case, we'll give students the benefit of the doubt, 318 | # and assume the correct solution already was a scalar. 319 | if correct_sol.size == 1: 320 | correct_sol = np.float64(correct_sol.item()) 321 | stu_sol = np.float64(np.float64(stu_sol).item()) 322 | 323 | #Type Sanity check 324 | if type(stu_sol) is not type(correct_sol): 325 | message = err_title + summary_title 326 | message += f'Your solution\'s {Fore_RED}output type{Fore_BLACK} is not the same as ' 327 | message += f'the reference solution\'s data type.\n' 328 | message += f' Your solution\'s type --> {Fore_RED}{type(stu_sol)}{Fore_BLACK}\n' 329 | message += f' Correct solution\'s type --> {Fore_RED}{type(correct_sol)}{Fore_BLACK}\n' 330 | message += test_case_title 331 | message += arg2str(test_args, test_kwargs, adv_user_msg=True, stu_func=stu_func) 332 | out_dict['test_args'] = test_args 333 | out_dict['test_kwargs'] = test_kwargs 334 | out_dict['stu_sol'] = stu_sol 335 | out_dict['correct_sol'] = correct_sol 336 | out_dict['message'] = message 337 | out_dict['passed'] = False 338 | return out_dict 339 | 340 | if isinstance(correct_sol, np.ndarray): 341 | if not np.all(np.array(correct_sol.shape) == np.array(stu_sol.shape)): 342 | message = err_title + summary_title 343 | message += f'Your solution\'s {Fore_RED}output numpy shape{Fore_BLACK} is not the same as ' 344 | message += f'the reference solution\'s shape.\n' 345 | message += f' Your solution\'s shape --> {Fore_RED}{stu_sol.shape}{Fore_BLACK}\n' 346 | message += f' Correct solution\'s shape --> {Fore_RED}{correct_sol.shape}{Fore_BLACK}\n' 347 | message += test_case_title 348 | message += arg2str(test_args, test_kwargs, adv_user_msg=True, stu_func=stu_func) 349 | out_dict['test_args'] = test_args 350 | out_dict['test_kwargs'] = test_kwargs 351 | out_dict['stu_sol'] = stu_sol 352 | out_dict['correct_sol'] = correct_sol 353 | out_dict['message'] = message 354 | out_dict['passed'] = False 355 | return out_dict 356 | 357 | if not(stu_sol.dtype is correct_sol.dtype): 358 | message = err_title + summary_title 359 | message += f'Your solution\'s {Fore_RED}output numpy dtype{Fore_BLACK} is not the same as' 360 | message += f'the reference solution\'s dtype.\n' 361 | message += f' Your solution\'s dtype --> {Fore_RED}np.{stu_sol.dtype}{Fore_BLACK}\n' 362 | message += f' Correct solution\'s dtype --> {Fore_RED}np.{correct_sol.dtype}{Fore_BLACK}\n' 363 | message += test_case_title 364 | message += arg2str(test_args, test_kwargs, adv_user_msg=True, stu_func=stu_func) 365 | out_dict['test_args'] = test_args 366 | out_dict['test_kwargs'] = test_kwargs 367 | out_dict['stu_sol'] = stu_sol 368 | out_dict['correct_sol'] = correct_sol 369 | out_dict['message'] = message 370 | out_dict['passed'] = False 371 | return out_dict 372 | 373 | if isinstance(correct_sol, np.ndarray): 374 | equality_array = np.isclose(stu_sol, correct_sol, rtol=1e-05, atol=1e-08, equal_nan=True) 375 | if not equality_array.all(): 376 | message = err_title + summary_title 377 | message += f'Your solution is {Fore_RED}not the same{Fore_BLACK} as the correct solution.\n' 378 | whr_ = np.array(np.where(np.logical_not(equality_array))) 379 | ineq_idxs = whr_[:,0].tolist() 380 | message += f' your_solution{ineq_idxs}={stu_sol[tuple(ineq_idxs)]}\n' 381 | message += f' correct_solution{ineq_idxs}={correct_sol[tuple(ineq_idxs)]}\n' 382 | message += test_case_title 383 | message += arg2str(test_args, test_kwargs, adv_user_msg=True, stu_func=stu_func) 384 | out_dict['test_args'] = test_args 385 | out_dict['test_kwargs'] = test_kwargs 386 | out_dict['stu_sol'] = stu_sol 387 | out_dict['correct_sol'] = correct_sol 388 | out_dict['message'] = message 389 | out_dict['passed'] = False 390 | return out_dict 391 | 392 | elif np.isscalar(correct_sol): 393 | equality_array = np.isclose(stu_sol, correct_sol, rtol=1e-05, atol=1e-08, equal_nan=True) 394 | if not equality_array.all(): 395 | message = err_title + summary_title 396 | message += f'Your solution is {Fore_RED}not the same{Fore_BLACK} as the correct solution.\n' 397 | message += f' your_solution={stu_sol}\n' 398 | message += f' correct_solution={correct_sol}\n' 399 | message += test_case_title 400 | message += arg2str(test_args, test_kwargs, adv_user_msg=True, stu_func=stu_func) 401 | out_dict['test_args'] = test_args 402 | out_dict['test_kwargs'] = test_kwargs 403 | out_dict['stu_sol'] = stu_sol 404 | out_dict['correct_sol'] = correct_sol 405 | out_dict['message'] = message 406 | out_dict['passed'] = False 407 | return out_dict 408 | 409 | elif isinstance(correct_sol, tuple): 410 | if not correct_sol==stu_sol: 411 | message = err_title + summary_title 412 | message += f'Your solution is {Fore_RED}not the same{Fore_BLACK} as the correct solution.\n' 413 | message += f' your_solution={stu_sol}\n' 414 | message += f' correct_solution={correct_sol}\n' 415 | message += test_case_title 416 | message += arg2str(test_args, test_kwargs, adv_user_msg=True, stu_func=stu_func) 417 | out_dict['test_args'] = test_args 418 | out_dict['test_kwargs'] = test_kwargs 419 | out_dict['stu_sol'] = stu_sol 420 | out_dict['correct_sol'] = correct_sol 421 | out_dict['message'] = message 422 | out_dict['passed'] = False 423 | return out_dict 424 | 425 | else: 426 | raise Exception(f'Not implemented comparison for other data types. sorry!') 427 | 428 | out_dict['test_args'] = None 429 | out_dict['test_kwargs'] = None 430 | out_dict['stu_sol'] = None 431 | out_dict['correct_sol'] = None 432 | out_dict['message'] = 'Well Done!' 433 | out_dict['passed'] = True 434 | return out_dict 435 | 436 | def show_test_cases(test_func, task_id=0): 437 | from IPython.display import clear_output 438 | file_path = f'{test_db_dir}/task_{task_id}.npz' 439 | npz_file = np.load(file_path) 440 | orig_images = npz_file['raw_images'] 441 | ref_images = npz_file['ref_images'] 442 | test_images = test_func(orig_images) 443 | 444 | visualize_ = visualize and perform_computation 445 | 446 | if not np.all(np.array(test_images.shape) == np.array(ref_images.shape)): 447 | print(f'Error: It seems the test images and the ref images have different shapes. Modify your function so that they both have the same shape.') 448 | print(f' test_images shape: {test_images.shape}') 449 | print(f' ref_images shape: {ref_images.shape}') 450 | return None, None, None, False 451 | 452 | if not np.all(np.array(test_images.dtype) == np.array(ref_images.dtype)): 453 | print(f'Error: It seems the test images and the ref images have different dtype. Modify your function so that they both have the same dtype.') 454 | print(f' test_images dtype: {test_images.dtype}') 455 | print(f' ref_images dtype: {ref_images.dtype}') 456 | return None, None, None, False 457 | 458 | for i in range(ref_images.shape[0]): 459 | if visualize_: 460 | nrows, ncols = 1, 3 461 | ax_w, ax_h = 5, 5 462 | fig, axes = plt.subplots(nrows, ncols, figsize=(ncols*ax_w, nrows*ax_h)) 463 | axes = np.array(axes).reshape(nrows, ncols) 464 | 465 | orig_image = orig_images[i] 466 | ref_image = ref_images[i] 467 | test_image = test_images[i] 468 | 469 | if visualize_: 470 | ax = axes[0,0] 471 | ax.pcolormesh(orig_image, edgecolors='k', linewidth=0.01, cmap='Greys') 472 | ax.xaxis.tick_top() 473 | ax.invert_yaxis() 474 | 475 | x_ticks = ax.get_xticks(minor=False).astype(np.int) 476 | x_ticks = x_ticks[x_ticks < orig_image.shape[1]] 477 | ax.set_xticks(x_ticks + 0.5) 478 | ax.set_xticklabels((x_ticks).astype(np.int)) 479 | 480 | y_ticks = ax.get_yticks(minor=False).astype(np.int) 481 | y_ticks = y_ticks[y_ticks < orig_image.shape[0]] 482 | ax.set_yticks(y_ticks + 0.5) 483 | ax.set_yticklabels((y_ticks).astype(np.int)) 484 | 485 | ax.set_aspect('equal') 486 | ax.set_title('Raw Image') 487 | 488 | ax = axes[0,1] 489 | ax.pcolormesh(ref_image, edgecolors='k', linewidth=0.01, cmap='Greys') 490 | ax.xaxis.tick_top() 491 | ax.invert_yaxis() 492 | 493 | x_ticks = ax.get_xticks(minor=False).astype(np.int) 494 | x_ticks = x_ticks[x_ticks < ref_image.shape[1]] 495 | ax.set_xticks(x_ticks+0.5) 496 | ax.set_xticklabels((x_ticks).astype(np.int)) 497 | 498 | y_ticks = ax.get_yticks(minor=False).astype(np.int) 499 | y_ticks = y_ticks[y_ticks < ref_image.shape[0]] 500 | ax.set_yticks(y_ticks+0.5) 501 | ax.set_yticklabels((y_ticks).astype(np.int)) 502 | 503 | ax.set_aspect('equal') 504 | ax.set_title('Reference Solution Image') 505 | 506 | ax = axes[0,2] 507 | ax.pcolormesh(test_image, edgecolors='k', linewidth=0.01, cmap='Greys') 508 | ax.xaxis.tick_top() 509 | ax.invert_yaxis() 510 | 511 | x_ticks = ax.get_xticks(minor=False).astype(np.int) 512 | x_ticks = x_ticks[x_ticks < test_image.shape[1]] 513 | ax.set_xticks(x_ticks + 0.5) 514 | ax.set_xticklabels((x_ticks).astype(np.int)) 515 | 516 | y_ticks = ax.get_yticks(minor=False).astype(np.int) 517 | y_ticks = y_ticks[y_ticks < test_image.shape[0]] 518 | ax.set_yticks(y_ticks + 0.5) 519 | ax.set_yticklabels((y_ticks).astype(np.int)) 520 | 521 | ax.set_aspect('equal') 522 | ax.set_title('Your Solution Image') 523 | 524 | if np.allclose(ref_image, test_image): 525 | if visualize_: 526 | print('The reference and solution images are the same to a T! Well done on this test case.') 527 | else: 528 | print('The reference and solution images are not the same...') 529 | ineq_idxs = np.array(np.where(np.logical_not(np.isclose(ref_image, test_image))))[:,0].tolist() 530 | print(f'ref_image{ineq_idxs}={ref_image[tuple(ineq_idxs)]}') 531 | print(f'test_image{ineq_idxs}={test_image[tuple(ineq_idxs)]}') 532 | if visualize_: 533 | print('I will return the images so that you will be able to diagnose the issue and resolve it...') 534 | return (orig_image, ref_image, test_image, False) 535 | 536 | if visualize_: 537 | plt.show() 538 | input_prompt = ' Enter nothing to go to the next image\nor\n Enter "s" when you are done to recieve the three images. \n' 539 | input_prompt += ' **Don\'t forget to do this before continuing to the next step.**\n' 540 | 541 | try: 542 | cmd = input(input_prompt) 543 | except KeyboardInterrupt: 544 | cmd = 's' 545 | 546 | if cmd.lower().startswith('s'): 547 | return (orig_image, ref_image, test_image, True) 548 | else: 549 | clear_output(wait=True) 550 | 551 | return (orig_image, ref_image, test_image, True) -------------------------------------------------------------------------------- /EMTopicModel-lib/payload_requirements.json: -------------------------------------------------------------------------------- 1 | { 2 | "comment": "The files property is a list of additional file paths (relative to this homework's notebook dir) that will be bundled into the student's ipynb metadata each time they save.", 3 | "files": [] 4 | } 5 | -------------------------------------------------------------------------------- /EMTopicModel-lib/requirements1.txt: -------------------------------------------------------------------------------- 1 | scipy==1.5.3 2 | jupyter_client==6.1.11 3 | -------------------------------------------------------------------------------- /EMTopicModel-lib/test_db/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Keyuan125/CS441-AppliedMachineLearning/623c5307e6412e9a2fc59dd6213fc07fa412a861/EMTopicModel-lib/test_db/.DS_Store -------------------------------------------------------------------------------- /EMTopicModel-lib/test_db/task_1.npz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Keyuan125/CS441-AppliedMachineLearning/623c5307e6412e9a2fc59dd6213fc07fa412a861/EMTopicModel-lib/test_db/task_1.npz -------------------------------------------------------------------------------- /EMTopicModel-lib/test_db/task_2.npz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Keyuan125/CS441-AppliedMachineLearning/623c5307e6412e9a2fc59dd6213fc07fa412a861/EMTopicModel-lib/test_db/task_2.npz -------------------------------------------------------------------------------- /EMTopicModel-lib/test_db/task_3.npz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Keyuan125/CS441-AppliedMachineLearning/623c5307e6412e9a2fc59dd6213fc07fa412a861/EMTopicModel-lib/test_db/task_3.npz -------------------------------------------------------------------------------- /EMTopicModel-lib/words/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Keyuan125/CS441-AppliedMachineLearning/623c5307e6412e9a2fc59dd6213fc07fa412a861/EMTopicModel-lib/words/.DS_Store -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # CS441-AppliedMachineLearning 2 | 3 | This is my coding assignment for UIUC-CS441-Applied Machine Learning 4 | 5 | Topics: 6 | - Basic Classfication 7 | - Classifying Images 8 | - SVM using Stochastic Gradient Descent 9 | - Regression 10 | - GLMnet 11 | - PCA and PCoA 12 | - Clustering 13 | - High-Dimension Classification 14 | - Expectation-Maximization for the topic Model 15 | - Expectation Maximization for Mixture of Normals 16 | - Mean Field 17 | - Convolutional Neural Networks 18 | 19 | --------------------------------------------------------------------------------