├── 1 SVM_DecisionTree.ipynb ├── 1_reg.ipynb ├── Credit_Card_Fraud_Detection_using_Scikit_Learn_and_Snap_ML.ipynb ├── Descision_Tree.ipynb ├── Faster_Taxi_Tip_Prediction_using_Snap_ML.ipynb ├── GridSearchCV_Hyperparameter_Tuning_in_Machine_Learning.ipynb ├── Introduction_Machine Learning.pdf ├── K_Means_Clustering.ipynb ├── Logistic_Regression.ipynb ├── ML0101EN_SkillUp_FinalAssignment.ipynb ├── ML0101EN_SkillUp_FinalAssignment.jupyterlite (1) (2).ipynb ├── ML0101EN_SkillUp_FinalAssignment.jupyterlite.ipynb ├── README.md ├── Regression_Tree.ipynb ├── Softmax_Regression,_One_vs_All_and_One_vs_One_for_Multi_class_Classificationipynb.ipynb ├── Support_Vector_Machines.ipynb └── ml_knn_0.ipynb /GridSearchCV_Hyperparameter_Tuning_in_Machine_Learning.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [] 7 | }, 8 | "kernelspec": { 9 | "name": "python3", 10 | "display_name": "Python 3" 11 | }, 12 | "language_info": { 13 | "name": "python" 14 | } 15 | }, 16 | "cells": [ 17 | { 18 | "cell_type": "markdown", 19 | "source": [ 20 | "**GridSearchCV: Hyperparameter Tuning in Machine Learning**\n", 21 | "\n", 22 | "---\n", 23 | "\n" 24 | ], 25 | "metadata": { 26 | "id": "3f7f-SxgQkFt" 27 | } 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "source": [ 32 | "Introduction to GridSearchCV :" 33 | ], 34 | "metadata": { 35 | "id": "h2H8lOzMQsCm" 36 | } 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "source": [ 41 | "GridSearchCV in Scikit-Learn is a vital tool for hyperparameter tuning, performing an exhaustive search over specified parameter values for an estimator. It systematically evaluates each combination using cross-validation to identify the optimal settings that maximize model performance based on a scoring metric like accuracy or F1-score. Hyperparameter tuning is crucial as it significantly impacts model performance, preventing underfitting or overfitting. GridSearchCV automates this process, ensuring robust generalization on unseen data. It helps data scientists efficiently find the best hyperparameters, saving time and resources while optimizing model performance, making it an essential tool in the machine learning pipeline." 42 | ], 43 | "metadata": { 44 | "id": "CY0y1-e7Qn0S" 45 | } 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "source": [ 50 | "Parameters of GridSearchCV :" 51 | ], 52 | "metadata": { 53 | "id": "a76iGl6LQqBS" 54 | } 55 | }, 56 | { 57 | "cell_type": "markdown", 58 | "source": [ 59 | "\n", 60 | "\n", 61 | "**Estimator:** The model or pipeline to be optimized. This can be any Scikit-Learn estimator like LogisticRegression(),SVC(), RandomForestClassifier(), etc.\n", 62 | "\n", 63 | "**param_grid:** A dictionary or list of dictionaries with parameter names (as strings) as keys and lists of parameter settings to try as values. Using param_grid, you can specify the hyperparameters for various models to find the optimal combination.\n", 64 | "\n", 65 | "Examples of various models hyperparameters for the param_grid parameter.\n", 66 | "\n", 67 | "**Logistic Regression:** When tuning a logistic regression model, GridSearchCV can search through different values of C, penalty, and solver to find the best parameters." 68 | ], 69 | "metadata": { 70 | "id": "9MoVTnGIQwQE" 71 | } 72 | }, 73 | { 74 | "cell_type": "code", 75 | "source": [ 76 | "parameters = {'C': [0.01, 0.1, 1],\n", 77 | " 'penalty': ['l2'],\n", 78 | " 'solver': ['lbfgs']}" 79 | ], 80 | "metadata": { 81 | "id": "44neLi1WQ5is" 82 | }, 83 | "execution_count": 2, 84 | "outputs": [] 85 | }, 86 | { 87 | "cell_type": "markdown", 88 | "source": [ 89 | "C: Inverse of regularization strength; smaller values specify stronger regularization.\n", 90 | "penalty: Specifies the norm of the penalty; 'l2' is ridge regression.\n", 91 | "solver: Algorithm to use in the optimization problem.\n", 92 | "\n", 93 | "**Support Vector Machine**: For SVM, GridSearchCV can explore different kernels, C values, and gamma settings to optimize the model.\n" 94 | ], 95 | "metadata": { 96 | "id": "0ZcK8AQFQ9nA" 97 | } 98 | }, 99 | { 100 | "cell_type": "code", 101 | "source": [ 102 | "import numpy as np\n", 103 | "parameters = {'kernel': ['linear', 'rbf', 'poly', 'sigmoid'],\n", 104 | " 'C': np.logspace(-3, 3, 5),\n", 105 | " 'gamma': np.logspace(-3, 3, 5)}" 106 | ], 107 | "metadata": { 108 | "id": "6eOejsn8RE7m" 109 | }, 110 | "execution_count": 4, 111 | "outputs": [] 112 | }, 113 | { 114 | "cell_type": "markdown", 115 | "source": [ 116 | "**kernel:** Specifies the kernel type to be used in the algorithm.\n", 117 | "**C:** Regularization parameter.\n", 118 | "gamma: Kernel coefficient.\n", 119 | "\n", 120 | "**Decision Tree Classifier:** In the case of a decision tree, GridSearchCV can test various criteria, splitters, depths, and other parameters to find the best configuration." 121 | ], 122 | "metadata": { 123 | "id": "wHgkH5_MRM7a" 124 | } 125 | }, 126 | { 127 | "cell_type": "code", 128 | "source": [ 129 | "parameters = {'criterion': ['gini', 'entropy'],\n", 130 | " 'splitter': ['best', 'random'],\n", 131 | " 'max_depth': [2*n for n in range(1, 10)],\n", 132 | " 'max_features': ['auto', 'sqrt'],\n", 133 | " 'min_samples_leaf': [1, 2, 4],\n", 134 | " 'min_samples_split': [2, 5, 10]}" 135 | ], 136 | "metadata": { 137 | "id": "NpFfhzuPRTf_" 138 | }, 139 | "execution_count": 5, 140 | "outputs": [] 141 | }, 142 | { 143 | "cell_type": "markdown", 144 | "source": [ 145 | "**criterion:** The function to measure the quality of a split.\n", 146 | "**splitter:** The strategy used to choose the split at each node.\n", 147 | "**max_depth**: The maximum depth of the tree.\n", 148 | "**max_features**: The number of features to consider when looking for the best split.\n", 149 | "**min_samples_leaf:** The minimum number of samples required to be at a leaf node.\n", 150 | "**min_samples_split**: The minimum number of samples required to split an internal node.\n", 151 | "\n", 152 | "**K-Nearest Neighbors**: For KNN, GridSearchCV can try different numbers of neighbors, algorithms, and power parameters to determine the best model." 153 | ], 154 | "metadata": { 155 | "id": "M4hKKtz0RXGO" 156 | } 157 | }, 158 | { 159 | "cell_type": "code", 160 | "source": [ 161 | "parameters = {'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],\n", 162 | " 'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],\n", 163 | " 'p': [1, 2]}" 164 | ], 165 | "metadata": { 166 | "id": "2-Kjeq0JRdSd" 167 | }, 168 | "execution_count": 6, 169 | "outputs": [] 170 | }, 171 | { 172 | "cell_type": "markdown", 173 | "source": [ 174 | "n_neighbors: Number of neighbors to use.\n", 175 | "algorithm: Algorithm used to compute the nearest neighbors.\n", 176 | "p: Power parameter for the Minkowski metric.\n", 177 | "\n", 178 | "scoring: A single string or callable to evaluate the predictions on the test set. Common options include accuracy, f1, roc_auc, etc. If none, the estimator's default scorer is used.\n", 179 | "\n", 180 | "n_jobs: The number of jobs to run in parallel. -1 means using all processors.\n", 181 | "\n", 182 | "pre_dispatch: Controls the number of jobs that get dispatched during parallel execution. It can be an integer or expressions like 2n_jobs, 3n_jobs, etc., to limit the number of jobs dispatched at once.\n", 183 | "\n", 184 | "refit: If True, refits the best estimator with the entire dataset. The best estimator is stored in the best_estimator_ attribute. Default is True.\n", 185 | "\n", 186 | "cv: Determines the cross-validation splitting strategy. It can be an integer to specify the number of folds, a cross-validation generator, or an iterable. Default is 5-fold cross-validation.\n", 187 | "\n", 188 | "verbose: Controls the verbosity level. Higher values indicate more messages. verbose=0 is silent, verbose=1 shows some messages, and verbose=2 shows more messages.\n", 189 | "\n", 190 | "return_train_score: If False, the cv_results_ attribute will not include training scores. Default is False.\n", 191 | "\n", 192 | "error_score: Value to assign to the score if an error occurs in estimator fitting. np.nan is the default, but it can be set to a specific value." 193 | ], 194 | "metadata": { 195 | "id": "r_gNfRFrRgSC" 196 | } 197 | }, 198 | { 199 | "cell_type": "markdown", 200 | "source": [ 201 | "Applications and Advantages of GridSearchCV\n", 202 | "\n", 203 | "---\n", 204 | "**Model Selection:** GridSearchCV enables the comparison of multiple models and facilitates the selection of the best-performing one for a given data set.\n", 205 | "\n", 206 | "**Hyperparameter Tuning:** It automates the process of finding the optimal hyperparameters, which can significantly improve the performance of machine learning models.\n", 207 | "\n", 208 | "**Pipeline Optimization: **GridSearchCV can be applied to complex pipelines involving multiple preprocessing steps and models to optimize the entire workflow.\n", 209 | "\n", 210 | "**Cross-Validation:** It incorporates cross-validation in the parameter search process, ensuring that the model's performance is robust and not overfitted to a particular train-test split.\n", 211 | "\n", 212 | "**Exhaustive Search:** GridSearchCV performs an exhaustive search over the specified parameter grid, ensuring that the best combination of parameters is found.\n", 213 | "\n", 214 | "**Parallel Execution:** With the n_jobs parameter, it can leverage multiple processors to speed up the search process.\n", 215 | "\n", 216 | "**Automatic Refit:** By setting refit=True, GridSearchCV automatically refits the model with the best parameters on the entire data set, making it ready for use.\n", 217 | "\n", 218 | "**Detailed Output:** The cv_results_ attribute provides detailed information about the performance of each parameter combination, including training and validation scores, which helps in understanding the model's\n" 219 | ], 220 | "metadata": { 221 | "id": "3jcn4ODQRhDm" 222 | } 223 | }, 224 | { 225 | "cell_type": "code", 226 | "source": [ 227 | "import numpy as np\n", 228 | "import pandas as pd\n", 229 | "from sklearn.datasets import load_iris\n", 230 | "from sklearn.model_selection import train_test_split, GridSearchCV\n", 231 | "from sklearn.svm import SVC\n", 232 | "from sklearn.metrics import classification_report\n", 233 | "import warnings\n", 234 | "# Ignore warnings\n", 235 | "warnings.filterwarnings('ignore')" 236 | ], 237 | "metadata": { 238 | "id": "nTfdJOiURzmC" 239 | }, 240 | "execution_count": 7, 241 | "outputs": [] 242 | }, 243 | { 244 | "cell_type": "markdown", 245 | "source": [ 246 | "Load the Iris data set: The Iris data set is a classic data set in machine learning. Load it using the load_iris function from Scikit-Learn." 247 | ], 248 | "metadata": { 249 | "id": "ytglajJuR2en" 250 | } 251 | }, 252 | { 253 | "cell_type": "code", 254 | "source": [ 255 | "iris = load_iris()\n", 256 | "X = iris.data\n", 257 | "y = iris.target" 258 | ], 259 | "metadata": { 260 | "id": "4wfz6Lg2R23o" 261 | }, 262 | "execution_count": 8, 263 | "outputs": [] 264 | }, 265 | { 266 | "cell_type": "markdown", 267 | "source": [ 268 | "**X:** Features of the Iris dataset (sepal length, sepal width, petal length, petal width).\n", 269 | "\n", 270 | "**y:** Target labels representing the three species of Iris (setosa, versicolor, virginica).\n", 271 | "\n", 272 | "* **Splitting the data into training and test set:** Divide data set into training and test sets to evaluate how well the model performs on data it has not been trained on." 273 | ], 274 | "metadata": { 275 | "id": "UIY6sHE_R6Sm" 276 | } 277 | }, 278 | { 279 | "cell_type": "code", 280 | "source": [ 281 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)" 282 | ], 283 | "metadata": { 284 | "id": "sEpFA766SCXQ" 285 | }, 286 | "execution_count": 9, 287 | "outputs": [] 288 | }, 289 | { 290 | "cell_type": "markdown", 291 | "source": [ 292 | "* test_size=0.2: 20% of the data is used for testing.\n", 293 | "* random_state=42: Ensures reproducibility of the random split.\n", 294 | "\n", 295 | "**Define the parameter grid:** Specify a grid of hyperparameters for the SVM model to search over. The grid includes different values for C, gamma, and kernel." 296 | ], 297 | "metadata": { 298 | "id": "UxOu8nYuSHea" 299 | } 300 | }, 301 | { 302 | "cell_type": "code", 303 | "source": [ 304 | "param_grid = {\n", 305 | " 'C': [0.1, 1, 10, 100],\n", 306 | " 'gamma': [1, 0.1, 0.01, 0.001],\n", 307 | " 'kernel': ['linear', 'rbf', 'poly']\n", 308 | "}" 309 | ], 310 | "metadata": { 311 | "id": "I7QTJTsTSO35" 312 | }, 313 | "execution_count": 10, 314 | "outputs": [] 315 | }, 316 | { 317 | "cell_type": "markdown", 318 | "source": [ 319 | "**C:** Regularization parameter.\n", 320 | "\n", 321 | "\n", 322 | "**gamma:** Kernel coefficient.\n", 323 | "\n", 324 | "\n", 325 | "**kernel:** Specifies the type of kernel to be used in the algorithm.\n", 326 | "\n", 327 | "**Initialize the SVC model:** Create an instance of the support vector classifier (SVC)." 328 | ], 329 | "metadata": { 330 | "id": "tQUgxTKYSRKY" 331 | } 332 | }, 333 | { 334 | "cell_type": "code", 335 | "source": [ 336 | "svc = SVC()" 337 | ], 338 | "metadata": { 339 | "id": "-IBz3NZJST8u" 340 | }, 341 | "execution_count": 11, 342 | "outputs": [] 343 | }, 344 | { 345 | "cell_type": "markdown", 346 | "source": [ 347 | "**Initialize GridSearchCV:** Set up the GridSearchCV with the SVC model, the parameter grid, and the desired configuration." 348 | ], 349 | "metadata": { 350 | "id": "OIqAGwAsSZXF" 351 | } 352 | }, 353 | { 354 | "cell_type": "code", 355 | "source": [ 356 | "grid_search = GridSearchCV(estimator=svc, param_grid=param_grid, scoring='accuracy', cv=5, n_jobs=-1, verbose=2)" 357 | ], 358 | "metadata": { 359 | "id": "LvgdI5xZSawU" 360 | }, 361 | "execution_count": 12, 362 | "outputs": [] 363 | }, 364 | { 365 | "cell_type": "markdown", 366 | "source": [ 367 | "**estimator:** The model to optimize (SVC).\n", 368 | "\n", 369 | "**param_grid:** The grid of hyperparameters.\n", 370 | "\n", 371 | "**scoring='accuracy'**: The metric used to evaluate the model's performance.\n", 372 | "\n", 373 | "**cv=5:** 5-fold cross-validation.\n", 374 | "\n", 375 | "\n", 376 | "**n_jobs=-1:** Use all available processors.\n", 377 | "\n", 378 | "\n", 379 | "**verbose=2:** Show detailed output during the search.\n", 380 | "\n", 381 | "**Fit GridSearchCV to the training data:** Perform the grid search on the training data." 382 | ], 383 | "metadata": { 384 | "id": "C8Q4D7jZSeD5" 385 | } 386 | }, 387 | { 388 | "cell_type": "code", 389 | "source": [ 390 | "grid_search.fit(X_train, y_train)" 391 | ], 392 | "metadata": { 393 | "colab": { 394 | "base_uri": "https://localhost:8080/", 395 | "height": 134 396 | }, 397 | "id": "QVpSXkKrSp42", 398 | "outputId": "bf353c6f-5371-4d38-9b8c-456265e5a506" 399 | }, 400 | "execution_count": 13, 401 | "outputs": [ 402 | { 403 | "output_type": "stream", 404 | "name": "stdout", 405 | "text": [ 406 | "Fitting 5 folds for each of 48 candidates, totalling 240 fits\n" 407 | ] 408 | }, 409 | { 410 | "output_type": "execute_result", 411 | "data": { 412 | "text/plain": [ 413 | "GridSearchCV(cv=5, estimator=SVC(), n_jobs=-1,\n", 414 | " param_grid={'C': [0.1, 1, 10, 100], 'gamma': [1, 0.1, 0.01, 0.001],\n", 415 | " 'kernel': ['linear', 'rbf', 'poly']},\n", 416 | " scoring='accuracy', verbose=2)" 417 | ], 418 | "text/html": [ 419 | "
GridSearchCV(cv=5, estimator=SVC(), n_jobs=-1,\n",
420 |               "             param_grid={'C': [0.1, 1, 10, 100], 'gamma': [1, 0.1, 0.01, 0.001],\n",
421 |               "                         'kernel': ['linear', 'rbf', 'poly']},\n",
422 |               "             scoring='accuracy', verbose=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" 426 | ] 427 | }, 428 | "metadata": {}, 429 | "execution_count": 13 430 | } 431 | ] 432 | }, 433 | { 434 | "cell_type": "markdown", 435 | "source": [ 436 | "**Check the best parameters and estimator:** After fitting, print the best parameters and the best estimator found during the grid search." 437 | ], 438 | "metadata": { 439 | "id": "wmYw7XBsStBo" 440 | } 441 | }, 442 | { 443 | "cell_type": "code", 444 | "source": [ 445 | "print(\"Best parameters found: \", grid_search.best_params_)\n", 446 | "print(\"Best estimator: \", grid_search.best_estimator_)" 447 | ], 448 | "metadata": { 449 | "colab": { 450 | "base_uri": "https://localhost:8080/" 451 | }, 452 | "id": "1sgiwPZoSu_5", 453 | "outputId": "b28c4266-21f5-4469-f65e-9d87d8cd2010" 454 | }, 455 | "execution_count": 14, 456 | "outputs": [ 457 | { 458 | "output_type": "stream", 459 | "name": "stdout", 460 | "text": [ 461 | "Best parameters found: {'C': 0.1, 'gamma': 0.1, 'kernel': 'poly'}\n", 462 | "Best estimator: SVC(C=0.1, gamma=0.1, kernel='poly')\n" 463 | ] 464 | } 465 | ] 466 | }, 467 | { 468 | "cell_type": "markdown", 469 | "source": [ 470 | "**Make predictions with the best estimator:** Use the best estimator to make predictions on the test set." 471 | ], 472 | "metadata": { 473 | "id": "nkWaMm4SSx6C" 474 | } 475 | }, 476 | { 477 | "cell_type": "code", 478 | "source": [ 479 | "\n", 480 | "y_pred = grid_search.best_estimator_.predict(X_test)" 481 | ], 482 | "metadata": { 483 | "id": "RtUmVrLQSzIp" 484 | }, 485 | "execution_count": 15, 486 | "outputs": [] 487 | }, 488 | { 489 | "cell_type": "markdown", 490 | "source": [ 491 | "**Evaluate the performance**: Evaluate the model's performance on the test set using the classification_report function, which provides precision, recall, F1-score, and support for each class." 492 | ], 493 | "metadata": { 494 | "id": "wn49nWYvS2M-" 495 | } 496 | }, 497 | { 498 | "cell_type": "code", 499 | "source": [ 500 | "\n", 501 | "print(classification_report(y_test, y_pred))" 502 | ], 503 | "metadata": { 504 | "colab": { 505 | "base_uri": "https://localhost:8080/" 506 | }, 507 | "id": "7tpHrOS2S7h1", 508 | "outputId": "8cddc298-d5ba-43b7-ce27-4b640066b262" 509 | }, 510 | "execution_count": 16, 511 | "outputs": [ 512 | { 513 | "output_type": "stream", 514 | "name": "stdout", 515 | "text": [ 516 | " precision recall f1-score support\n", 517 | "\n", 518 | " 0 1.00 1.00 1.00 10\n", 519 | " 1 1.00 1.00 1.00 9\n", 520 | " 2 1.00 1.00 1.00 11\n", 521 | "\n", 522 | " accuracy 1.00 30\n", 523 | " macro avg 1.00 1.00 1.00 30\n", 524 | "weighted avg 1.00 1.00 1.00 30\n", 525 | "\n" 526 | ] 527 | } 528 | ] 529 | }, 530 | { 531 | "cell_type": "markdown", 532 | "source": [ 533 | "**Key Points**\n", 534 | "\n", 535 | "---\n", 536 | "\n", 537 | "\n", 538 | "* GridSearchCV conducts a thorough exploration across a defined parameter grid.\n", 539 | "* Parameters include the estimator to optimize, parameter grid, scoring method, number of jobs for parallel execution, cross-validation strategy, and verbosity.\n", 540 | "* Practical example demonstrated using GridSearchCV to find the optimal parameters for an SVC model on the Iris data set.\n", 541 | "* GridSearchCV helps in selecting the best model by evaluating multiple combinations of hyperparameters." 542 | ], 543 | "metadata": { 544 | "id": "KPatSgDHS8h2" 545 | } 546 | } 547 | ] 548 | } -------------------------------------------------------------------------------- /Introduction_Machine Learning.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SID41214/Machine_Learning/a7ed7c845c5020a578e2ae30e1b6ce67bb7fb732/Introduction_Machine Learning.pdf -------------------------------------------------------------------------------- /ML0101EN_SkillUp_FinalAssignment.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "kernelspec": { 4 | "name": "python", 5 | "display_name": "Python (Pyodide)", 6 | "language": "python" 7 | }, 8 | "language_info": { 9 | "codemirror_mode": { 10 | "name": "python", 11 | "version": 3 12 | }, 13 | "file_extension": ".py", 14 | "mimetype": "text/x-python", 15 | "name": "python", 16 | "nbconvert_exporter": "python", 17 | "pygments_lexer": "ipython3", 18 | "version": "3.8" 19 | }, 20 | "prev_pub_hash": "ba039b1c59dfa11e53b73e3fc8c403e1e8b43c7aedf6f7e0b1d1e7914b44d98a" 21 | }, 22 | "nbformat_minor": 4, 23 | "nbformat": 4, 24 | "cells": [ 25 | { 26 | "cell_type": "markdown", 27 | "source": "

\n \n \"Skills\n \n

\n\n

Final Project: Classification with Python

\n", 28 | "metadata": {} 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "source": "

Table of Contents

\n
\n
\n

Estimated Time Needed: 180 min

\n\n\n
\n", 33 | "metadata": {} 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "source": "# Instructions\n", 38 | "metadata": {} 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "source": "In this notebook, you will practice all the classification algorithms that we have learned in this course.\n\n\nBelow, is where we are going to use the classification algorithms to create a model based on our training data and evaluate our testing data using evaluation metrics learned in the course.\n\nWe will use some of the algorithms taught in the course, specifically:\n\n1. Linear Regression\n2. KNN\n3. Decision Trees\n4. Logistic Regression\n5. SVM\n\nWe will evaluate our models using:\n\n1. Accuracy Score\n2. Jaccard Index\n3. F1-Score\n4. LogLoss\n5. Mean Absolute Error\n6. Mean Squared Error\n7. R2-Score\n\nFinally, you will use your models to generate the report at the end. \n", 43 | "metadata": {} 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "source": "# About The Dataset\n", 48 | "metadata": {} 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "source": "The original source of the data is Australian Government's Bureau of Meteorology and the latest data can be gathered from [http://www.bom.gov.au/climate/dwo/](http://www.bom.gov.au/climate/dwo/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01).\n\nThe dataset to be used has extra columns like 'RainToday' and our target is 'RainTomorrow', which was gathered from the Rattle at [https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData](https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)\n\n\n", 53 | "metadata": {} 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "source": "This dataset contains observations of weather metrics for each day from 2008 to 2017. The **weatherAUS.csv** dataset includes the following fields:\n\n| Field | Description | Unit | Type |\n| ------------- | ----------------------------------------------------- | --------------- | ------ |\n| Date | Date of the Observation in YYYY-MM-DD | Date | object |\n| Location | Location of the Observation | Location | object |\n| MinTemp | Minimum temperature | Celsius | float |\n| MaxTemp | Maximum temperature | Celsius | float |\n| Rainfall | Amount of rainfall | Millimeters | float |\n| Evaporation | Amount of evaporation | Millimeters | float |\n| Sunshine | Amount of bright sunshine | hours | float |\n| WindGustDir | Direction of the strongest gust | Compass Points | object |\n| WindGustSpeed | Speed of the strongest gust | Kilometers/Hour | object |\n| WindDir9am | Wind direction averaged of 10 minutes prior to 9am | Compass Points | object |\n| WindDir3pm | Wind direction averaged of 10 minutes prior to 3pm | Compass Points | object |\n| WindSpeed9am | Wind speed averaged of 10 minutes prior to 9am | Kilometers/Hour | float |\n| WindSpeed3pm | Wind speed averaged of 10 minutes prior to 3pm | Kilometers/Hour | float |\n| Humidity9am | Humidity at 9am | Percent | float |\n| Humidity3pm | Humidity at 3pm | Percent | float |\n| Pressure9am | Atmospheric pressure reduced to mean sea level at 9am | Hectopascal | float |\n| Pressure3pm | Atmospheric pressure reduced to mean sea level at 3pm | Hectopascal | float |\n| Cloud9am | Fraction of the sky obscured by cloud at 9am | Eights | float |\n| Cloud3pm | Fraction of the sky obscured by cloud at 3pm | Eights | float |\n| Temp9am | Temperature at 9am | Celsius | float |\n| Temp3pm | Temperature at 3pm | Celsius | float |\n| RainToday | If there was rain today | Yes/No | object |\n| RainTomorrow | If there is rain tomorrow | Yes/No | float |\n\nColumn definitions were gathered from [http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml](http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)\n\n", 58 | "metadata": {} 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "source": "## **Import the required libraries**\n", 63 | "metadata": {} 64 | }, 65 | { 66 | "cell_type": "code", 67 | "source": "# All Libraries required for this lab are listed below. The libraries pre-installed on Skills Network Labs are commented.\n# !mamba install -qy pandas==1.3.4 numpy==1.21.4 seaborn==0.9.0 matplotlib==3.5.0 scikit-learn==0.20.1\n# Note: If your environment doesn't support \"!mamba install\", use \"!pip install\"", 68 | "metadata": { 69 | "trusted": true 70 | }, 71 | "outputs": [], 72 | "execution_count": null 73 | }, 74 | { 75 | "cell_type": "code", 76 | "source": "# Surpress warnings:\ndef warn(*args, **kwargs):\n pass\nimport warnings\nwarnings.warn = warn", 77 | "metadata": { 78 | "trusted": true 79 | }, 80 | "outputs": [], 81 | "execution_count": null 82 | }, 83 | { 84 | "cell_type": "code", 85 | "source": "import pandas as pd\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.linear_model import LinearRegression\nfrom sklearn import preprocessing\nimport numpy as np\nfrom sklearn.neighbors import KNeighborsClassifier\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.neighbors import KNeighborsClassifier\nfrom sklearn.tree import DecisionTreeClassifier\nfrom sklearn import svm\nfrom sklearn.metrics import jaccard_score\nfrom sklearn.metrics import f1_score\nfrom sklearn.metrics import log_loss\nfrom sklearn.metrics import confusion_matrix, accuracy_score\nimport sklearn.metrics as metrics", 86 | "metadata": { 87 | "trusted": true 88 | }, 89 | "outputs": [], 90 | "execution_count": null 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "source": "### Importing the Dataset\n", 95 | "metadata": {} 96 | }, 97 | { 98 | "cell_type": "code", 99 | "source": "from pyodide.http import pyfetch\n\nasync def download(url, filename):\n response = await pyfetch(url)\n if response.status == 200:\n with open(filename, \"wb\") as f:\n f.write(await response.bytes())", 100 | "metadata": { 101 | "trusted": true 102 | }, 103 | "outputs": [], 104 | "execution_count": null 105 | }, 106 | { 107 | "cell_type": "code", 108 | "source": "path='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillUp/labs/ML-FinalAssignment/Weather_Data.csv'", 109 | "metadata": { 110 | "trusted": true 111 | }, 112 | "outputs": [], 113 | "execution_count": null 114 | }, 115 | { 116 | "cell_type": "code", 117 | "source": "await download(path, \"Weather_Data.csv\")\nfilename =\"Weather_Data.csv\"", 118 | "metadata": { 119 | "trusted": true 120 | }, 121 | "outputs": [], 122 | "execution_count": null 123 | }, 124 | { 125 | "cell_type": "code", 126 | "source": "df = pd.read_csv(\"Weather_Data.csv\")", 127 | "metadata": { 128 | "trusted": true 129 | }, 130 | "outputs": [], 131 | "execution_count": null 132 | }, 133 | { 134 | "cell_type": "markdown", 135 | "source": "> Note: This version of the lab is designed for JupyterLite, which necessitates downloading the dataset to the interface. However, when working with the downloaded version of this notebook on your local machines (Jupyter Anaconda), you can simply **skip the steps above of \"Importing the Dataset\"** and use the URL directly in the `pandas.read_csv()` function. You can uncomment and run the statements in the cell below.\n", 136 | "metadata": {} 137 | }, 138 | { 139 | "cell_type": "code", 140 | "source": "#filepath = \"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillUp/labs/ML-FinalAssignment/Weather_Data.csv\"\n#df = pd.read_csv(filepath)", 141 | "metadata": { 142 | "trusted": true 143 | }, 144 | "outputs": [], 145 | "execution_count": null 146 | }, 147 | { 148 | "cell_type": "code", 149 | "source": "df.head()", 150 | "metadata": { 151 | "trusted": true 152 | }, 153 | "outputs": [], 154 | "execution_count": null 155 | }, 156 | { 157 | "cell_type": "markdown", 158 | "source": "### Data Preprocessing\n", 159 | "metadata": {} 160 | }, 161 | { 162 | "cell_type": "markdown", 163 | "source": "#### One Hot Encoding\n", 164 | "metadata": {} 165 | }, 166 | { 167 | "cell_type": "markdown", 168 | "source": "First, we need to perform one hot encoding to convert categorical variables to binary variables.\n", 169 | "metadata": {} 170 | }, 171 | { 172 | "cell_type": "code", 173 | "source": "df_sydney_processed = pd.get_dummies(data=df, columns=['RainToday', 'WindGustDir', 'WindDir9am', 'WindDir3pm'])", 174 | "metadata": { 175 | "trusted": true 176 | }, 177 | "outputs": [], 178 | "execution_count": null 179 | }, 180 | { 181 | "cell_type": "markdown", 182 | "source": "Next, we replace the values of the 'RainTomorrow' column changing them from a categorical column to a binary column. We do not use the `get_dummies` method because we would end up with two columns for 'RainTomorrow' and we do not want, since 'RainTomorrow' is our target.\n", 183 | "metadata": {} 184 | }, 185 | { 186 | "cell_type": "code", 187 | "source": "df_sydney_processed.replace(['No', 'Yes'], [0,1], inplace=True)", 188 | "metadata": { 189 | "trusted": true 190 | }, 191 | "outputs": [], 192 | "execution_count": null 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "source": "### Training Data and Test Data\n", 197 | "metadata": {} 198 | }, 199 | { 200 | "cell_type": "markdown", 201 | "source": "Now, we set our 'features' or x values and our Y or target variable.\n", 202 | "metadata": {} 203 | }, 204 | { 205 | "cell_type": "code", 206 | "source": "df_sydney_processed.drop('Date',axis=1,inplace=True)", 207 | "metadata": { 208 | "trusted": true 209 | }, 210 | "outputs": [], 211 | "execution_count": null 212 | }, 213 | { 214 | "cell_type": "code", 215 | "source": "df_sydney_processed = df_sydney_processed.astype(float)", 216 | "metadata": { 217 | "trusted": true 218 | }, 219 | "outputs": [], 220 | "execution_count": null 221 | }, 222 | { 223 | "cell_type": "code", 224 | "source": "features = df_sydney_processed.drop(columns='RainTomorrow', axis=1)\nY = df_sydney_processed['RainTomorrow']", 225 | "metadata": { 226 | "trusted": true 227 | }, 228 | "outputs": [], 229 | "execution_count": null 230 | }, 231 | { 232 | "cell_type": "markdown", 233 | "source": "### Linear Regression\n", 234 | "metadata": {} 235 | }, 236 | { 237 | "cell_type": "markdown", 238 | "source": "#### Q1) Use the `train_test_split` function to split the `features` and `Y` dataframes with a `test_size` of `0.2` and the `random_state` set to `10`.\n", 239 | "metadata": {} 240 | }, 241 | { 242 | "cell_type": "code", 243 | "source": "# Split the dataset into training and testing sets\nx_train, y_test, y_train, y_test = train_test_split(features, Y, test_size=0.2, random_state=10)\n\n# Output the shapes of the resulting datasets to confirm the split\nprint(\"X_train shape:\", X_train.shape)\nprint(\"X_test shape:\", X_test.shape)\nprint(\"y_train shape:\", y_train.shape)\nprint(\"y_test shape:\", y_test.shape)\n\n\n", 244 | "metadata": { 245 | "trusted": true 246 | }, 247 | "outputs": [], 248 | "execution_count": null 249 | }, 250 | { 251 | "cell_type": "code", 252 | "source": "x_train, x_test, y_train, y_test = train_test_split(features, Y, test_size=0.2, random_state=10)", 253 | "metadata": { 254 | "trusted": true 255 | }, 256 | "outputs": [], 257 | "execution_count": null 258 | }, 259 | { 260 | "cell_type": "markdown", 261 | "source": "#### Q2) Create and train a Linear Regression model called LinearReg using the training data (`x_train`, `y_train`).\n", 262 | "metadata": {} 263 | }, 264 | { 265 | "cell_type": "code", 266 | "source": "# Step 1: Import the LinearRegression model from sklearn\nfrom sklearn.linear_model import LinearRegression\n\n# Step 2: Create an instance of the LinearRegression model\nLinearReg = LinearRegression()\n\n# Step 3: Train the Linear Regression model using the training data\nLinearReg.fit(x_train, y_train)\n\n# Output the coefficients and intercept of the trained model\nprint(\"Coefficients:\", LinearReg.coef_)\nprint(\"Intercept:\", LinearReg.intercept_)\n", 267 | "metadata": { 268 | "trusted": true 269 | }, 270 | "outputs": [], 271 | "execution_count": null 272 | }, 273 | { 274 | "cell_type": "code", 275 | "source": "LinearReg = LinearRegression()", 276 | "metadata": { 277 | "trusted": true 278 | }, 279 | "outputs": [], 280 | "execution_count": null 281 | }, 282 | { 283 | "cell_type": "markdown", 284 | "source": "#### Q3) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.\n", 285 | "metadata": {} 286 | }, 287 | { 288 | "cell_type": "code", 289 | "source": "# Step 1: Use the predict method on the testing data\npredictions = LinearReg.predict(X_test)\n\n# Step 2: Output the predictions\nprint(predictions)\n", 290 | "metadata": { 291 | "trusted": true 292 | }, 293 | "outputs": [], 294 | "execution_count": null 295 | }, 296 | { 297 | "cell_type": "code", 298 | "source": "predictions = LinearReg.predict(x_test)", 299 | "metadata": { 300 | "trusted": true 301 | }, 302 | "outputs": [], 303 | "execution_count": null 304 | }, 305 | { 306 | "cell_type": "markdown", 307 | "source": "#### Q4) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.\n", 308 | "metadata": {} 309 | }, 310 | { 311 | "cell_type": "code", 312 | "source": "from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score\n\n# Calculate the Mean Absolute Error\nmae = mean_absolute_error(y_test, predictions)\n\n# Calculate the Mean Squared Error\nmse = mean_squared_error(y_test, predictions)\n\n# Calculate the R2 Score\nr2 = r2_score(y_test, predictions)\n\n# Output the metrics\nprint(\"Mean Absolute Error (MAE):\", mae)\nprint(\"Mean Squared Error (MSE):\", mse)\nprint(\"R2 Score:\", r2)\n", 313 | "metadata": { 314 | "trusted": true 315 | }, 316 | "outputs": [], 317 | "execution_count": null 318 | }, 319 | { 320 | "cell_type": "code", 321 | "source": "LinearRegression_MAE = mae\nLinearRegression_MSE = mse\nLinearRegression_R2 = r2", 322 | "metadata": { 323 | "trusted": true 324 | }, 325 | "outputs": [], 326 | "execution_count": null 327 | }, 328 | { 329 | "cell_type": "markdown", 330 | "source": "#### Q5) Show the MAE, MSE, and R2 in a tabular format using data frame for the linear model.\n", 331 | "metadata": {} 332 | }, 333 | { 334 | "cell_type": "code", 335 | "source": "import pandas as pd\n\n# Create a dictionary with the metrics\nmetrics_dict = {\n 'Metric': ['Mean Absolute Error (MAE)', 'Mean Squared Error (MSE)', 'R2 Score'],\n 'Value': [LinearRegression_MAE, LinearRegression_MSE, LinearRegression_R2]\n}\n\n# Convert the dictionary to a DataFrame\nmetrics_df = pd.DataFrame(metrics_dict)\n\n# Display the DataFrame\nmetrics_df", 336 | "metadata": { 337 | "trusted": true 338 | }, 339 | "outputs": [], 340 | "execution_count": null 341 | }, 342 | { 343 | "cell_type": "code", 344 | "source": "Report = metrics_df", 345 | "metadata": { 346 | "trusted": true 347 | }, 348 | "outputs": [], 349 | "execution_count": null 350 | }, 351 | { 352 | "cell_type": "markdown", 353 | "source": "### KNN\n", 354 | "metadata": {} 355 | }, 356 | { 357 | "cell_type": "markdown", 358 | "source": "#### Q6) Create and train a KNN model called KNN using the training data (`x_train`, `y_train`) with the `n_neighbors` parameter set to `4`.\n", 359 | "metadata": {} 360 | }, 361 | { 362 | "cell_type": "code", 363 | "source": "from sklearn.neighbors import KNeighborsClassifier\n\n# Create the KNN model with n_neighbors set to 4\nKNN = KNeighborsClassifier(n_neighbors=4)\n\n# Train the model using the training data\nKNN.fit(x_train, y_train)\n", 364 | "metadata": { 365 | "trusted": true 366 | }, 367 | "outputs": [], 368 | "execution_count": null 369 | }, 370 | { 371 | "cell_type": "code", 372 | "source": "KNN = KNeighborsClassifier(n_neighbors=4)\n", 373 | "metadata": { 374 | "trusted": true 375 | }, 376 | "outputs": [], 377 | "execution_count": null 378 | }, 379 | { 380 | "cell_type": "markdown", 381 | "source": "#### Q7) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.\n", 382 | "metadata": {} 383 | }, 384 | { 385 | "cell_type": "code", 386 | "source": "# Use the predict method to make predictions on the testing data\npredictions_knn = KNN.predict(x_test)\n", 387 | "metadata": { 388 | "trusted": true 389 | }, 390 | "outputs": [], 391 | "execution_count": null 392 | }, 393 | { 394 | "cell_type": "code", 395 | "source": "predictions = predictions_knn ", 396 | "metadata": { 397 | "trusted": true 398 | }, 399 | "outputs": [], 400 | "execution_count": null 401 | }, 402 | { 403 | "cell_type": "markdown", 404 | "source": "#### Q8) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.\n", 405 | "metadata": {} 406 | }, 407 | { 408 | "cell_type": "code", 409 | "source": "# Calculate accuracy\nknn_accuracy = accuracy_score(y_test, predictions)\n\n# Calculate F1-score\nknn_f1_score = f1_score(y_test, predictions)\n\n# Calculate Jaccard index\nknn_jaccard_index = jaccard_score(y_test, predictions)\n", 410 | "metadata": { 411 | "trusted": true 412 | }, 413 | "outputs": [], 414 | "execution_count": null 415 | }, 416 | { 417 | "cell_type": "code", 418 | "source": "KNN_Accuracy_Score = knn_accuracy\nKNN_JaccardIndex =knn_jaccard_index \nKNN_F1_Score = knn_f1_score", 419 | "metadata": { 420 | "trusted": true 421 | }, 422 | "outputs": [], 423 | "execution_count": null 424 | }, 425 | { 426 | "cell_type": "markdown", 427 | "source": "### Decision Tree\n", 428 | "metadata": {} 429 | }, 430 | { 431 | "cell_type": "markdown", 432 | "source": "#### Q9) Create and train a Decision Tree model called Tree using the training data (`x_train`, `y_train`).\n", 433 | "metadata": {} 434 | }, 435 | { 436 | "cell_type": "code", 437 | "source": "from sklearn.tree import DecisionTreeClassifier\n\n# Create the Decision Tree model\nTree = DecisionTreeClassifier()\n\n# Train the model using the training data\nTree.fit(x_train, y_train)\n", 438 | "metadata": { 439 | "trusted": true 440 | }, 441 | "outputs": [], 442 | "execution_count": null 443 | }, 444 | { 445 | "cell_type": "code", 446 | "source": "Tree = DecisionTreeClassifier()\n# Train the model using the training data\nTree.fit(x_train, y_train)", 447 | "metadata": { 448 | "trusted": true 449 | }, 450 | "outputs": [], 451 | "execution_count": null 452 | }, 453 | { 454 | "cell_type": "markdown", 455 | "source": "#### Q10) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.\n", 456 | "metadata": {} 457 | }, 458 | { 459 | "cell_type": "code", 460 | "source": "# Use the predict method on the testing data\npredictions = Tree.predict(x_test)\n", 461 | "metadata": { 462 | "trusted": true 463 | }, 464 | "outputs": [], 465 | "execution_count": null 466 | }, 467 | { 468 | "cell_type": "code", 469 | "source": "predictions = Tree.predict(x_test)\n", 470 | "metadata": { 471 | "trusted": true 472 | }, 473 | "outputs": [], 474 | "execution_count": null 475 | }, 476 | { 477 | "cell_type": "markdown", 478 | "source": "#### Q11) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.\n", 479 | "metadata": {} 480 | }, 481 | { 482 | "cell_type": "code", 483 | "source": "# Calculate the accuracy score\ntree_accuracy = accuracy_score(y_test, predictions)\n\n# Calculate the Jaccard Index\ntree_jaccard_index = jaccard_score(y_test, predictions)\n\n# Calculate the F1-Score\ntree_f1_score = f1_score(y_test, predictions)\n\n# Display the results\nprint(f\"Accuracy Score: {tree_accuracy}\")\nprint(f\"Jaccard Index: {tree_jaccard_index}\")\nprint(f\"F1-Score: {tree_f1_score}\")\n", 484 | "metadata": { 485 | "trusted": true 486 | }, 487 | "outputs": [], 488 | "execution_count": null 489 | }, 490 | { 491 | "cell_type": "code", 492 | "source": "Tree_Accuracy_Score = tree_accuracy\nTree_JaccardIndex = tree_jaccard_index\nTree_F1_Score = tree_f1_score ", 493 | "metadata": { 494 | "trusted": true 495 | }, 496 | "outputs": [], 497 | "execution_count": null 498 | }, 499 | { 500 | "cell_type": "markdown", 501 | "source": "### Logistic Regression\n", 502 | "metadata": {} 503 | }, 504 | { 505 | "cell_type": "markdown", 506 | "source": "#### Q12) Use the `train_test_split` function to split the `features` and `Y` dataframes with a `test_size` of `0.2` and the `random_state` set to `1`.\n", 507 | "metadata": {} 508 | }, 509 | { 510 | "cell_type": "code", 511 | "source": "from sklearn.model_selection import train_test_split\n\n# Split the data\nx_train, x_test, y_train, y_test = train_test_split(features, Y, test_size=0.2, random_state=1)\n", 512 | "metadata": { 513 | "trusted": true 514 | }, 515 | "outputs": [], 516 | "execution_count": null 517 | }, 518 | { 519 | "cell_type": "code", 520 | "source": "x_train, x_test, y_train, y_test = train_test_split(features, Y, test_size=0.2, random_state=1)", 521 | "metadata": { 522 | "trusted": true 523 | }, 524 | "outputs": [], 525 | "execution_count": null 526 | }, 527 | { 528 | "cell_type": "markdown", 529 | "source": "#### Q13) Create and train a LogisticRegression model called LR using the training data (`x_train`, `y_train`) with the `solver` parameter set to `liblinear`.\n", 530 | "metadata": {} 531 | }, 532 | { 533 | "cell_type": "code", 534 | "source": "from sklearn.linear_model import LogisticRegression\n\n# Create the LogisticRegression model\nLR = LogisticRegression(solver='liblinear')\n\n# Train the model using the training data\nLR.fit(x_train, y_train)\n", 535 | "metadata": { 536 | "trusted": true 537 | }, 538 | "outputs": [], 539 | "execution_count": null 540 | }, 541 | { 542 | "cell_type": "code", 543 | "source": "LR = LogisticRegression(solver='liblinear')\n\n# Train the model using the training data\nLR.fit(x_train, y_train)\n", 544 | "metadata": { 545 | "trusted": true 546 | }, 547 | "outputs": [], 548 | "execution_count": null 549 | }, 550 | { 551 | "cell_type": "markdown", 552 | "source": "#### Q14) Now, use the `predict` and `predict_proba` methods on the testing data (`x_test`) and save it as 2 arrays `predictions` and `predict_proba`.\n", 553 | "metadata": {} 554 | }, 555 | { 556 | "cell_type": "code", 557 | "source": "# Use the predict method to make predictions on the testing data\npredictions = LR.predict(x_test)\n\n# Use the predict_proba method to get the probability estimates\npredict_proba = LR.predict_proba(x_test)\n\n# Output the predictions and probabilities\nprint(\"Predictions:\", predictions)\nprint(\"Predicted Probabilities:\", predict_proba)\n", 558 | "metadata": { 559 | "trusted": true 560 | }, 561 | "outputs": [], 562 | "execution_count": null 563 | }, 564 | { 565 | "cell_type": "code", 566 | "source": "predictions = LR.predict(x_test)", 567 | "metadata": { 568 | "trusted": true 569 | }, 570 | "outputs": [], 571 | "execution_count": null 572 | }, 573 | { 574 | "cell_type": "code", 575 | "source": "predict_proba = LR.predict_proba(x_test)", 576 | "metadata": { 577 | "trusted": true 578 | }, 579 | "outputs": [], 580 | "execution_count": null 581 | }, 582 | { 583 | "cell_type": "markdown", 584 | "source": "#### Q15) Using the `predictions`, `predict_proba` and the `y_test` dataframe calculate the value for each metric using the appropriate function.\n", 585 | "metadata": {} 586 | }, 587 | { 588 | "cell_type": "code", 589 | "source": "from sklearn.metrics import accuracy_score, f1_score, jaccard_score, log_loss\n\n# Calculate accuracy score\naccuracy = accuracy_score(y_test, predictions)\n\n# Calculate F1-score\nf1 = f1_score(y_test, predictions)\n\n# Calculate Jaccard index\njaccard = jaccard_score(y_test, predictions)\n\n# Calculate Log Loss using the predicted probabilities\n# For binary classification, use the probabilities for the positive class\nlogloss = log_loss(y_test, predict_proba)\n\n# Output the metrics\nprint(\"Accuracy Score:\", accuracy)\nprint(\"F1 Score:\", f1)\nprint(\"Jaccard Index:\", jaccard)\nprint(\"Log Loss:\", logloss)\n", 590 | "metadata": { 591 | "trusted": true 592 | }, 593 | "outputs": [], 594 | "execution_count": null 595 | }, 596 | { 597 | "cell_type": "code", 598 | "source": "LR_Accuracy_Score = accuracy\nLR_JaccardIndex = jaccard\nLR_F1_Score = f1\nLR_Log_Loss = logloss", 599 | "metadata": { 600 | "trusted": true 601 | }, 602 | "outputs": [], 603 | "execution_count": null 604 | }, 605 | { 606 | "cell_type": "markdown", 607 | "source": "### SVM\n", 608 | "metadata": {} 609 | }, 610 | { 611 | "cell_type": "markdown", 612 | "source": "#### Q16) Create and train a SVM model called SVM using the training data (`x_train`, `y_train`).\n", 613 | "metadata": {} 614 | }, 615 | { 616 | "cell_type": "code", 617 | "source": "from sklearn import svm\n\n# Create the SVM model\nSVM = svm.SVC()\n\n# Train the model using the training data\nSVM.fit(x_train, y_train)\n", 618 | "metadata": { 619 | "trusted": true 620 | }, 621 | "outputs": [], 622 | "execution_count": null 623 | }, 624 | { 625 | "cell_type": "code", 626 | "source": "SVM = svm.SVC()\nSVM.fit(x_train, y_train)", 627 | "metadata": { 628 | "trusted": true 629 | }, 630 | "outputs": [], 631 | "execution_count": null 632 | }, 633 | { 634 | "cell_type": "markdown", 635 | "source": "#### Q17) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.\n", 636 | "metadata": {} 637 | }, 638 | { 639 | "cell_type": "code", 640 | "source": "# Use the predict method on the testing data\npredictions_svm = SVM.predict(x_test)\n\n# Output the predictions\nprint(predictions_svm)\n", 641 | "metadata": { 642 | "trusted": true 643 | }, 644 | "outputs": [], 645 | "execution_count": null 646 | }, 647 | { 648 | "cell_type": "code", 649 | "source": "predictions = SVM.predict(x_test)", 650 | "metadata": { 651 | "trusted": true 652 | }, 653 | "outputs": [], 654 | "execution_count": null 655 | }, 656 | { 657 | "cell_type": "markdown", 658 | "source": "#### Q18) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.\n", 659 | "metadata": {} 660 | }, 661 | { 662 | "cell_type": "code", 663 | "source": "from sklearn.metrics import accuracy_score, f1_score, jaccard_score, log_loss\n\n# Calculate accuracy score\naccuracy_svm = accuracy_score(y_test, predictions_svm)\n\n# Calculate F1-score\nf1_svm = f1_score(y_test, predictions_svm)\n\n# Calculate Jaccard index\njaccard_svm = jaccard_score(y_test, predictions_svm)\n\n# Note: Log Loss requires probability estimates, so if using SVM you need to use predict_proba for this metric\n# For SVM, you need to use a decision function for probabilities or a model that supports probability estimates.\n# Assuming SVM with probability=True for demonstration\n# If `SVM` was trained with `probability=True` use the following:\npredict_proba_svm = SVM.decision_function(x_test)\nlogloss_svm = log_loss(y_test, predict_proba_svm)\n\n# Output the metrics\nprint(\"SVM Accuracy Score:\", accuracy_svm)\nprint(\"SVM F1 Score:\", f1_svm)\nprint(\"SVM Jaccard Index:\", jaccard_svm)\nprint(\"SVM Log Loss:\", logloss_svm)\n\n\nSVM_Accuracy_Score = accuracy_svm\nSVM_JaccardIndex = jaccard_svm\nSVM_F1_Score = f1_svm", 664 | "metadata": { 665 | "trusted": true 666 | }, 667 | "outputs": [], 668 | "execution_count": null 669 | }, 670 | { 671 | "cell_type": "markdown", 672 | "source": "### Report\n", 673 | "metadata": {} 674 | }, 675 | { 676 | "cell_type": "markdown", 677 | "source": "#### Q19) Show the Accuracy,Jaccard Index,F1-Score and LogLoss in a tabular format using data frame for all of the above models.\n\n\\*LogLoss is only for Logistic Regression Model\n", 678 | "metadata": {} 679 | }, 680 | { 681 | "cell_type": "code", 682 | "source": "import pandas as pd\n\n# Create a dictionary with the metrics for all models\nmetrics_dict = {\n 'Model': ['Linear Regression', 'KNN', 'Decision Tree', 'Logistic Regression', 'SVM'],\n 'Accuracy': [LinearRegression_Accuracy_Score, KNN_Accuracy_Score, Tree_Accuracy_Score, LogisticRegression_Accuracy_Score, accuracy_svm],\n 'Jaccard Index': [None, KNN_JaccardIndex, Tree_JaccardIndex, LogisticRegression_JaccardIndex, jaccard_svm],\n 'F1 Score': [None, KNN_F1_Score, Tree_F1_Score, LogisticRegression_F1_Score, f1_svm],\n 'Log Loss': [None, None, None, LogisticRegression_Log_Loss, None] # Log Loss is only for Logistic Regression\n}\n\n# Convert the dictionary to a DataFrame\nmetrics_df = pd.DataFrame(metrics_dict)\n\n# Display the DataFrame\nprint(metrics_df)\n\n\nReport = metrics_df", 683 | "metadata": { 684 | "trusted": true 685 | }, 686 | "outputs": [], 687 | "execution_count": null 688 | }, 689 | { 690 | "cell_type": "markdown", 691 | "source": "

How to submit

\n\n

Once you complete your notebook you will have to share it. You can download the notebook by navigating to \"File\" and clicking on \"Download\" button.\n\n

This will save the (.ipynb) file on your computer. Once saved, you can upload this file in the \"My Submission\" tab, of the \"Peer-graded Assignment\" section. \n", 692 | "metadata": {} 693 | }, 694 | { 695 | "cell_type": "markdown", 696 | "source": "

About the Authors:

\n\nJoseph Santarcangelo has a PhD in Electrical Engineering, his research focused on using machine learning, signal processing, and computer vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his PhD.\n\n### Other Contributors\n\n[Svitlana Kramar](https://www.linkedin.com/in/svitlana-kramar/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0232ENSkillsNetwork30654641-2022-01-01)\n", 697 | "metadata": {} 698 | }, 699 | { 700 | "cell_type": "markdown", 701 | "source": "## Change Log\n\n| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n| ----------------- | ------- | ------------- | --------------------------- |\n| 2022-06-22 | 2.0 | Svitlana K. | Deleted GridSearch and Mock |\n\n##

© IBM Corporation 2020. All rights reserved.

\n", 702 | "metadata": {} 703 | }, 704 | { 705 | "cell_type": "code", 706 | "source": "", 707 | "metadata": { 708 | "trusted": true 709 | }, 710 | "outputs": [], 711 | "execution_count": null 712 | } 713 | ] 714 | } -------------------------------------------------------------------------------- /ML0101EN_SkillUp_FinalAssignment.jupyterlite (1) (2).ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "d635514d-3cf2-449a-aea8-5e4fa9a207ba", 6 | "metadata": {}, 7 | "source": [ 8 | "

\n", 9 | " \n", 10 | " \"Skills\n", 11 | " \n", 12 | "

\n", 13 | "\n", 14 | "

Final Project: Classification with Python

\n" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "id": "ba96d793-1aa1-4070-b541-9cccd07b073f", 20 | "metadata": {}, 21 | "source": [ 22 | "

Table of Contents

\n", 23 | "
\n", 24 | "
\n", 34 | "

Estimated Time Needed: 180 min

\n", 35 | "\n", 36 | "\n", 37 | "
\n" 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "id": "62e2bfb2-08bd-41de-a704-c72028242793", 43 | "metadata": {}, 44 | "source": [ 45 | "# Instructions\n" 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "id": "c06281fa-8a02-493f-9b69-4b6738e6c8eb", 51 | "metadata": {}, 52 | "source": [ 53 | "In this notebook, you will practice all the classification algorithms that we have learned in this course.\n", 54 | "\n", 55 | "\n", 56 | "Below, is where we are going to use the classification algorithms to create a model based on our training data and evaluate our testing data using evaluation metrics learned in the course.\n", 57 | "\n", 58 | "We will use some of the algorithms taught in the course, specifically:\n", 59 | "\n", 60 | "1. Linear Regression\n", 61 | "2. KNN\n", 62 | "3. Decision Trees\n", 63 | "4. Logistic Regression\n", 64 | "5. SVM\n", 65 | "\n", 66 | "We will evaluate our models using:\n", 67 | "\n", 68 | "1. Accuracy Score\n", 69 | "2. Jaccard Index\n", 70 | "3. F1-Score\n", 71 | "4. LogLoss\n", 72 | "5. Mean Absolute Error\n", 73 | "6. Mean Squared Error\n", 74 | "7. R2-Score\n", 75 | "\n", 76 | "Finally, you will use your models to generate the report at the end. \n" 77 | ] 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "id": "9d4ee051-f50c-4ce5-aba1-167ffc8f5648", 82 | "metadata": {}, 83 | "source": [ 84 | "# About The Dataset\n" 85 | ] 86 | }, 87 | { 88 | "cell_type": "markdown", 89 | "id": "4e4d2b57-e9af-4a7d-a7f9-b8c25660ba78", 90 | "metadata": {}, 91 | "source": [ 92 | "The original source of the data is Australian Government's Bureau of Meteorology and the latest data can be gathered from [http://www.bom.gov.au/climate/dwo/](http://www.bom.gov.au/climate/dwo/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01).\n", 93 | "\n", 94 | "The dataset to be used has extra columns like 'RainToday' and our target is 'RainTomorrow', which was gathered from the Rattle at [https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData](https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)\n", 95 | "\n", 96 | "\n" 97 | ] 98 | }, 99 | { 100 | "cell_type": "markdown", 101 | "id": "4b2d517d-9973-438d-8d32-dff28ad6ce84", 102 | "metadata": {}, 103 | "source": [ 104 | "This dataset contains observations of weather metrics for each day from 2008 to 2017. The **weatherAUS.csv** dataset includes the following fields:\n", 105 | "\n", 106 | "| Field | Description | Unit | Type |\n", 107 | "| ------------- | ----------------------------------------------------- | --------------- | ------ |\n", 108 | "| Date | Date of the Observation in YYYY-MM-DD | Date | object |\n", 109 | "| Location | Location of the Observation | Location | object |\n", 110 | "| MinTemp | Minimum temperature | Celsius | float |\n", 111 | "| MaxTemp | Maximum temperature | Celsius | float |\n", 112 | "| Rainfall | Amount of rainfall | Millimeters | float |\n", 113 | "| Evaporation | Amount of evaporation | Millimeters | float |\n", 114 | "| Sunshine | Amount of bright sunshine | hours | float |\n", 115 | "| WindGustDir | Direction of the strongest gust | Compass Points | object |\n", 116 | "| WindGustSpeed | Speed of the strongest gust | Kilometers/Hour | object |\n", 117 | "| WindDir9am | Wind direction averaged of 10 minutes prior to 9am | Compass Points | object |\n", 118 | "| WindDir3pm | Wind direction averaged of 10 minutes prior to 3pm | Compass Points | object |\n", 119 | "| WindSpeed9am | Wind speed averaged of 10 minutes prior to 9am | Kilometers/Hour | float |\n", 120 | "| WindSpeed3pm | Wind speed averaged of 10 minutes prior to 3pm | Kilometers/Hour | float |\n", 121 | "| Humidity9am | Humidity at 9am | Percent | float |\n", 122 | "| Humidity3pm | Humidity at 3pm | Percent | float |\n", 123 | "| Pressure9am | Atmospheric pressure reduced to mean sea level at 9am | Hectopascal | float |\n", 124 | "| Pressure3pm | Atmospheric pressure reduced to mean sea level at 3pm | Hectopascal | float |\n", 125 | "| Cloud9am | Fraction of the sky obscured by cloud at 9am | Eights | float |\n", 126 | "| Cloud3pm | Fraction of the sky obscured by cloud at 3pm | Eights | float |\n", 127 | "| Temp9am | Temperature at 9am | Celsius | float |\n", 128 | "| Temp3pm | Temperature at 3pm | Celsius | float |\n", 129 | "| RainToday | If there was rain today | Yes/No | object |\n", 130 | "| RainTomorrow | If there is rain tomorrow | Yes/No | float |\n", 131 | "\n", 132 | "Column definitions were gathered from [http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml](http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)\n", 133 | "\n" 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "id": "3ad995f0-a174-49a5-aaef-76294021d5d4", 139 | "metadata": {}, 140 | "source": [ 141 | "## **Import the required libraries**\n" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": null, 147 | "id": "38dca360-78ed-407c-9f48-26b405bf8695", 148 | "metadata": {}, 149 | "outputs": [], 150 | "source": [ 151 | "# All Libraries required for this lab are listed below. The libraries pre-installed on Skills Network Labs are commented.\n", 152 | "# !mamba install -qy pandas==1.3.4 numpy==1.21.4 seaborn==0.9.0 matplotlib==3.5.0 scikit-learn==0.20.1\n", 153 | "# Note: If your environment doesn't support \"!mamba install\", use \"!pip install\"" 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": null, 159 | "id": "ece29267-503d-4de0-8c69-f905815d57a3", 160 | "metadata": {}, 161 | "outputs": [], 162 | "source": [ 163 | "# Surpress warnings:\n", 164 | "def warn(*args, **kwargs):\n", 165 | " pass\n", 166 | "import warnings\n", 167 | "warnings.warn = warn" 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": null, 173 | "id": "2344f678-d444-4a13-bd7f-d730f954116f", 174 | "metadata": {}, 175 | "outputs": [], 176 | "source": [ 177 | "import pandas as pd\n", 178 | "from sklearn.linear_model import LogisticRegression\n", 179 | "from sklearn.linear_model import LinearRegression\n", 180 | "from sklearn import preprocessing\n", 181 | "import numpy as np\n", 182 | "from sklearn.neighbors import KNeighborsClassifier\n", 183 | "from sklearn.model_selection import train_test_split\n", 184 | "from sklearn.neighbors import KNeighborsClassifier\n", 185 | "from sklearn.tree import DecisionTreeClassifier\n", 186 | "from sklearn import svm\n", 187 | "from sklearn.metrics import jaccard_score\n", 188 | "from sklearn.metrics import f1_score\n", 189 | "from sklearn.metrics import log_loss\n", 190 | "from sklearn.metrics import confusion_matrix, accuracy_score\n", 191 | "import sklearn.metrics as metrics" 192 | ] 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "id": "2bdad242-edb6-4a5b-8471-f5918f3ecab7", 197 | "metadata": {}, 198 | "source": [ 199 | "### Importing the Dataset\n" 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "execution_count": null, 205 | "id": "fa02651f-de37-4666-8ad4-b4ddc8d0b0ac", 206 | "metadata": {}, 207 | "outputs": [], 208 | "source": [ 209 | "from pyodide.http import pyfetch\n", 210 | "\n", 211 | "async def download(url, filename):\n", 212 | " response = await pyfetch(url)\n", 213 | " if response.status == 200:\n", 214 | " with open(filename, \"wb\") as f:\n", 215 | " f.write(await response.bytes())" 216 | ] 217 | }, 218 | { 219 | "cell_type": "code", 220 | "execution_count": null, 221 | "id": "4d127aff-9339-42f9-bdfc-f2e17a19dd4c", 222 | "metadata": {}, 223 | "outputs": [], 224 | "source": [ 225 | "path='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillUp/labs/ML-FinalAssignment/Weather_Data.csv'" 226 | ] 227 | }, 228 | { 229 | "cell_type": "code", 230 | "execution_count": null, 231 | "id": "5f770f7d-f967-4ae4-a96a-a6c8c0a02bcc", 232 | "metadata": {}, 233 | "outputs": [], 234 | "source": [ 235 | "await download(path, \"Weather_Data.csv\")\n", 236 | "filename =\"Weather_Data.csv\"" 237 | ] 238 | }, 239 | { 240 | "cell_type": "code", 241 | "execution_count": null, 242 | "id": "f87df4cd-160e-4878-bff0-73f1289f5b90", 243 | "metadata": {}, 244 | "outputs": [], 245 | "source": [ 246 | "df = pd.read_csv(\"Weather_Data.csv\")" 247 | ] 248 | }, 249 | { 250 | "cell_type": "markdown", 251 | "id": "6a7acb6d-0e0e-4321-acd2-8096b272c39d", 252 | "metadata": {}, 253 | "source": [ 254 | "> Note: This version of the lab is designed for JupyterLite, which necessitates downloading the dataset to the interface. However, when working with the downloaded version of this notebook on your local machines (Jupyter Anaconda), you can simply **skip the steps above of \"Importing the Dataset\"** and use the URL directly in the `pandas.read_csv()` function. You can uncomment and run the statements in the cell below.\n" 255 | ] 256 | }, 257 | { 258 | "cell_type": "code", 259 | "execution_count": null, 260 | "id": "f9c77ad8-8b85-4e82-af47-f63163f889c3", 261 | "metadata": {}, 262 | "outputs": [], 263 | "source": [ 264 | "#filepath = \"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillUp/labs/ML-FinalAssignment/Weather_Data.csv\"\n", 265 | "#df = pd.read_csv(filepath)" 266 | ] 267 | }, 268 | { 269 | "cell_type": "code", 270 | "execution_count": null, 271 | "id": "2fd31b01-a2bf-4263-8ff8-6fdce1990047", 272 | "metadata": {}, 273 | "outputs": [], 274 | "source": [ 275 | "df.head()" 276 | ] 277 | }, 278 | { 279 | "cell_type": "markdown", 280 | "id": "eb2f4134-ab8b-48d8-aaab-85ce530aee65", 281 | "metadata": {}, 282 | "source": [ 283 | "### Data Preprocessing\n" 284 | ] 285 | }, 286 | { 287 | "cell_type": "markdown", 288 | "id": "c70975f9-cae8-4cc6-be94-dbc1a880d3c9", 289 | "metadata": {}, 290 | "source": [ 291 | "#### One Hot Encoding\n" 292 | ] 293 | }, 294 | { 295 | "cell_type": "markdown", 296 | "id": "cfadd018-a23d-4985-9eb3-8ed0d30abd52", 297 | "metadata": {}, 298 | "source": [ 299 | "First, we need to perform one hot encoding to convert categorical variables to binary variables.\n" 300 | ] 301 | }, 302 | { 303 | "cell_type": "code", 304 | "execution_count": null, 305 | "id": "55968fd3-0422-4766-98fd-9397e0006e3e", 306 | "metadata": {}, 307 | "outputs": [], 308 | "source": [ 309 | "df_sydney_processed = pd.get_dummies(data=df, columns=['RainToday', 'WindGustDir', 'WindDir9am', 'WindDir3pm'])" 310 | ] 311 | }, 312 | { 313 | "cell_type": "markdown", 314 | "id": "e354a6fc-8c8b-499d-8444-5011a5146b1a", 315 | "metadata": {}, 316 | "source": [ 317 | "Next, we replace the values of the 'RainTomorrow' column changing them from a categorical column to a binary column. We do not use the `get_dummies` method because we would end up with two columns for 'RainTomorrow' and we do not want, since 'RainTomorrow' is our target.\n" 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": null, 323 | "id": "77f75277-a3ca-4ccc-a5b7-f95b2491cc9b", 324 | "metadata": {}, 325 | "outputs": [], 326 | "source": [ 327 | "df_sydney_processed.replace(['No', 'Yes'], [0,1], inplace=True)" 328 | ] 329 | }, 330 | { 331 | "cell_type": "markdown", 332 | "id": "88ab6c18-f36a-408c-8510-fdf831abc53b", 333 | "metadata": {}, 334 | "source": [ 335 | "### Training Data and Test Data\n" 336 | ] 337 | }, 338 | { 339 | "cell_type": "markdown", 340 | "id": "2a25156c-4080-4b15-a45c-9d4884ddce06", 341 | "metadata": {}, 342 | "source": [ 343 | "Now, we set our 'features' or x values and our Y or target variable.\n" 344 | ] 345 | }, 346 | { 347 | "cell_type": "code", 348 | "execution_count": null, 349 | "id": "3077604d-a2f3-4e24-88ff-64b5ce69d6f0", 350 | "metadata": {}, 351 | "outputs": [], 352 | "source": [ 353 | "df_sydney_processed.drop('Date',axis=1,inplace=True)" 354 | ] 355 | }, 356 | { 357 | "cell_type": "code", 358 | "execution_count": null, 359 | "id": "e1b66dd7-5bb7-4739-96b8-374d8e89269e", 360 | "metadata": {}, 361 | "outputs": [], 362 | "source": [ 363 | "df_sydney_processed = df_sydney_processed.astype(float)" 364 | ] 365 | }, 366 | { 367 | "cell_type": "code", 368 | "execution_count": null, 369 | "id": "29857426-177a-4c87-8982-4fdca4ff4d78", 370 | "metadata": {}, 371 | "outputs": [], 372 | "source": [ 373 | "features = df_sydney_processed.drop(columns='RainTomorrow', axis=1)\n", 374 | "Y = df_sydney_processed['RainTomorrow']" 375 | ] 376 | }, 377 | { 378 | "cell_type": "markdown", 379 | "id": "1be81f61-64c3-43d0-89a8-22ff1a9339cf", 380 | "metadata": {}, 381 | "source": [ 382 | "### Linear Regression\n" 383 | ] 384 | }, 385 | { 386 | "cell_type": "markdown", 387 | "id": "60256d7f-4d49-4aed-b2df-dc44a5b0791c", 388 | "metadata": {}, 389 | "source": [ 390 | "#### Q1) Use the `train_test_split` function to split the `features` and `Y` dataframes with a `test_size` of `0.2` and the `random_state` set to `10`.\n" 391 | ] 392 | }, 393 | { 394 | "cell_type": "code", 395 | "execution_count": null, 396 | "id": "843e32d6-7bdd-4d12-b10a-de70bed0e974", 397 | "metadata": {}, 398 | "outputs": [], 399 | "source": [ 400 | "#Enter Your Code and Execute\n", 401 | "from sklearn.model_selection import train_test_split\n", 402 | "\n", 403 | "# Splitting the data into training and testing sets\n", 404 | "X_train, X_test, Y_train, Y_test = train_test_split(features, Y, test_size=0.2, random_state=10)" 405 | ] 406 | }, 407 | { 408 | "cell_type": "code", 409 | "execution_count": null, 410 | "id": "c38f2196-bb38-4322-a012-9a27e6a9d8d8", 411 | "metadata": {}, 412 | "outputs": [], 413 | "source": [ 414 | "x_train, x_test, y_train, y_test = train_test_split(features, Y, test_size=0.2, random_state=10)\n" 415 | ] 416 | }, 417 | { 418 | "cell_type": "markdown", 419 | "id": "b144e3a9-6bd2-4e75-8cae-50e1e3b3fb17", 420 | "metadata": {}, 421 | "source": [ 422 | "#### Q2) Create and train a Linear Regression model called LinearReg using the training data (`x_train`, `y_train`).\n" 423 | ] 424 | }, 425 | { 426 | "cell_type": "code", 427 | "execution_count": null, 428 | "id": "22b77c93-ecf2-4c54-8ad4-496ef5396695", 429 | "metadata": {}, 430 | "outputs": [], 431 | "source": [ 432 | "#Enter Your Code and Execute" 433 | ] 434 | }, 435 | { 436 | "cell_type": "code", 437 | "execution_count": null, 438 | "id": "a4293ee5-572a-45cb-a090-5f30af826630", 439 | "metadata": {}, 440 | "outputs": [], 441 | "source": [ 442 | "from sklearn import linear_model\n", 443 | "LinearReg = linear_model.LinearRegression()\n", 444 | "LinearReg.fit(x_train,y_train)" 445 | ] 446 | }, 447 | { 448 | "cell_type": "markdown", 449 | "id": "1aa0f086-fefc-4e44-8aa2-822aa7828ad6", 450 | "metadata": {}, 451 | "source": [ 452 | "#### Q3) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.\n" 453 | ] 454 | }, 455 | { 456 | "cell_type": "code", 457 | "execution_count": null, 458 | "id": "a63dfa3d-a957-48dc-a067-93e9b3d11431", 459 | "metadata": {}, 460 | "outputs": [], 461 | "source": [ 462 | "#Enter Your Code and Execute" 463 | ] 464 | }, 465 | { 466 | "cell_type": "code", 467 | "execution_count": null, 468 | "id": "d6bdf734-7705-4590-aa54-6b93e20a9da7", 469 | "metadata": {}, 470 | "outputs": [], 471 | "source": [ 472 | "predictions = LinearReg.predict(x_test)" 473 | ] 474 | }, 475 | { 476 | "cell_type": "markdown", 477 | "id": "ea13e307-0ac9-4c7c-874e-440cee795d95", 478 | "metadata": {}, 479 | "source": [ 480 | "#### Q4) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.\n" 481 | ] 482 | }, 483 | { 484 | "cell_type": "code", 485 | "execution_count": null, 486 | "id": "cf2408d4-4932-487d-85f2-913c2efce09f", 487 | "metadata": {}, 488 | "outputs": [], 489 | "source": [ 490 | "#Enter Your Code and Execute" 491 | ] 492 | }, 493 | { 494 | "cell_type": "code", 495 | "execution_count": null, 496 | "id": "aba34a58-a974-467b-8bad-10a0fd0e88d3", 497 | "metadata": {}, 498 | "outputs": [], 499 | "source": [ 500 | "import numpy as np\n", 501 | "from sklearn.metrics import r2_score\n", 502 | "\n", 503 | "LinearRegression_MAE = np.mean(np.absolute(predictions - y_test)\n", 504 | "LinearRegression_MSE =np.mean((predictions - y_test)**2)\n", 505 | "LinearRegression_R2 = r2_score(y_test , predictions) " 506 | ] 507 | }, 508 | { 509 | "cell_type": "markdown", 510 | "id": "4552ab70-ec8a-4455-8c5f-f75f6e2e2771", 511 | "metadata": {}, 512 | "source": [ 513 | "#### Q5) Show the MAE, MSE, and R2 in a tabular format using data frame for the linear model.\n" 514 | ] 515 | }, 516 | { 517 | "cell_type": "code", 518 | "execution_count": null, 519 | "id": "cc932bbd-9528-45b5-b85a-8c9a0b9560ed", 520 | "metadata": {}, 521 | "outputs": [], 522 | "source": [ 523 | "#Enter Your Code and Execute" 524 | ] 525 | }, 526 | { 527 | "cell_type": "code", 528 | "execution_count": null, 529 | "id": "edd964f0-d1aa-4e3b-9a52-577656477438", 530 | "metadata": {}, 531 | "outputs": [], 532 | "source": [ 533 | "from pandas import pd\n", 534 | "Report =pd.DataFrame({'Metric': ['MAE', 'MSE', 'R²'],'Linear Regression': [LinearRegression_MAE, LinearRegression_MSE, LinearRegression_R2]})" 535 | ] 536 | }, 537 | { 538 | "cell_type": "markdown", 539 | "id": "55351393-abee-4af8-8944-f0079265cd56", 540 | "metadata": {}, 541 | "source": [ 542 | "### KNN\n" 543 | ] 544 | }, 545 | { 546 | "cell_type": "markdown", 547 | "id": "b7e38ebb-7442-4980-b883-e89c6de0d351", 548 | "metadata": {}, 549 | "source": [ 550 | "#### Q6) Create and train a KNN model called KNN using the training data (`x_train`, `y_train`) with the `n_neighbors` parameter set to `4`.\n" 551 | ] 552 | }, 553 | { 554 | "cell_type": "code", 555 | "execution_count": null, 556 | "id": "213be9cb-8c88-4099-b023-def7c0e31c6a", 557 | "metadata": {}, 558 | "outputs": [], 559 | "source": [ 560 | "#Enter Your Code and Execute\n", 561 | "from sklearn.neighbors import KNeighborsClassifier" 562 | ] 563 | }, 564 | { 565 | "cell_type": "code", 566 | "execution_count": null, 567 | "id": "f03fe0e5-ac9f-42cd-a107-d5d46b16ae4d", 568 | "metadata": {}, 569 | "outputs": [], 570 | "source": [ 571 | "k = 4\n", 572 | "#Train Model and Predict \n", 573 | "KNN = KNeighborsClassifier(n_neighbors = k).fit(x_train,y_train)\n" 574 | ] 575 | }, 576 | { 577 | "cell_type": "markdown", 578 | "id": "0ef93a31-0d67-4fa5-809b-a765b13f5888", 579 | "metadata": {}, 580 | "source": [ 581 | "#### Q7) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.\n" 582 | ] 583 | }, 584 | { 585 | "cell_type": "code", 586 | "execution_count": null, 587 | "id": "cf386d08-521b-418b-a5d6-601318bd8e93", 588 | "metadata": {}, 589 | "outputs": [], 590 | "source": [ 591 | "#Enter Your Code and Execute" 592 | ] 593 | }, 594 | { 595 | "cell_type": "code", 596 | "execution_count": null, 597 | "id": "8f88815e-92fd-4920-983c-326502b8bc29", 598 | "metadata": {}, 599 | "outputs": [], 600 | "source": [ 601 | "predictions = KNN.predict(x_test)\n", 602 | "predictions[0:]" 603 | ] 604 | }, 605 | { 606 | "cell_type": "markdown", 607 | "id": "9913f102-99a8-4af3-858d-1a55a2d49259", 608 | "metadata": {}, 609 | "source": [ 610 | "#### Q8) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.\n" 611 | ] 612 | }, 613 | { 614 | "cell_type": "code", 615 | "execution_count": null, 616 | "id": "02eb3dd1-0c03-407a-9042-d446bd5ce557", 617 | "metadata": {}, 618 | "outputs": [], 619 | "source": [ 620 | "#Enter Your Code and Execute\n", 621 | "from sklearn import metrics" 622 | ] 623 | }, 624 | { 625 | "cell_type": "code", 626 | "execution_count": null, 627 | "id": "47ddb040-001c-4c7f-907a-e0fe69d2bfb7", 628 | "metadata": {}, 629 | "outputs": [], 630 | "source": [ 631 | "KNN_Accuracy_Score = metrics.accuracy_score(y_test, predictions) \n", 632 | "KNN_JaccardIndex = metrics.jaccard_score(y_test, predictions)\n", 633 | "KNN_F1_Score = metrics.f1_score(y_test, predictions)" 634 | ] 635 | }, 636 | { 637 | "cell_type": "markdown", 638 | "id": "b1b49d21-0f4b-4737-a15c-88574afb6dc5", 639 | "metadata": {}, 640 | "source": [ 641 | "### Decision Tree\n" 642 | ] 643 | }, 644 | { 645 | "cell_type": "markdown", 646 | "id": "07aedd58-1090-48ed-8fe9-38c64f4492ad", 647 | "metadata": {}, 648 | "source": [ 649 | "#### Q9) Create and train a Decision Tree model called Tree using the training data (`x_train`, `y_train`).\n" 650 | ] 651 | }, 652 | { 653 | "cell_type": "code", 654 | "execution_count": null, 655 | "id": "b61b238f-7880-4cd9-9764-b69b5c63f352", 656 | "metadata": {}, 657 | "outputs": [], 658 | "source": [ 659 | "#Enter Your Code and Execute" 660 | ] 661 | }, 662 | { 663 | "cell_type": "code", 664 | "execution_count": null, 665 | "id": "f4a9ea81-7336-4cab-aa98-22cb6c6e79ca", 666 | "metadata": {}, 667 | "outputs": [], 668 | "source": [ 669 | "\n", 670 | "from sklearn.tree import DecisionTreeClassifier\n", 671 | "Tree = DecisionTreeClassifier()\n", 672 | "Tree.fit(x_train, y_train)" 673 | ] 674 | }, 675 | { 676 | "cell_type": "markdown", 677 | "id": "d79279ba-1e22-45c9-a6a7-e1a39a61c512", 678 | "metadata": {}, 679 | "source": [ 680 | "#### Q10) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.\n" 681 | ] 682 | }, 683 | { 684 | "cell_type": "code", 685 | "execution_count": null, 686 | "id": "5c99f9a8-2fa2-48f3-83e9-8781dc1b0d61", 687 | "metadata": {}, 688 | "outputs": [], 689 | "source": [ 690 | "#Enter Your Code and Execute" 691 | ] 692 | }, 693 | { 694 | "cell_type": "code", 695 | "execution_count": null, 696 | "id": "4f6ea989-4778-47c8-9bc0-f6fae679f3c3", 697 | "metadata": {}, 698 | "outputs": [], 699 | "source": [ 700 | "predictions = Tree.predict(x_test)\n", 701 | "predictions[0:]" 702 | ] 703 | }, 704 | { 705 | "cell_type": "markdown", 706 | "id": "f19bae36-1072-4baf-bcf8-3c098eeb1915", 707 | "metadata": {}, 708 | "source": [ 709 | "#### Q11) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.\n" 710 | ] 711 | }, 712 | { 713 | "cell_type": "code", 714 | "execution_count": null, 715 | "id": "0c47c2e2-14c7-4e5a-aaea-86212a0fa032", 716 | "metadata": {}, 717 | "outputs": [], 718 | "source": [ 719 | "#Enter Your Code and Execute\n", 720 | "from sklearn import metrics" 721 | ] 722 | }, 723 | { 724 | "cell_type": "code", 725 | "execution_count": null, 726 | "id": "edb09a5f-c3e2-4c4f-924b-85f6ffe96729", 727 | "metadata": {}, 728 | "outputs": [], 729 | "source": [ 730 | "Tree_Accuracy_Score = metrics.accuracy_score(y_test, predictions)\n", 731 | "Tree_JaccardIndex = metrics.jaccard_score(y_test, predictions)\n", 732 | "Tree_F1_Score = metrics.f1_score(y_test, predictions)" 733 | ] 734 | }, 735 | { 736 | "cell_type": "markdown", 737 | "id": "f2905933-3b27-4ece-a80a-b35744f54b5f", 738 | "metadata": {}, 739 | "source": [ 740 | "### Logistic Regression\n" 741 | ] 742 | }, 743 | { 744 | "cell_type": "markdown", 745 | "id": "490cdcfd-14aa-417d-b3ed-29577d13f7da", 746 | "metadata": {}, 747 | "source": [ 748 | "#### Q12) Use the `train_test_split` function to split the `features` and `Y` dataframes with a `test_size` of `0.2` and the `random_state` set to `1`.\n" 749 | ] 750 | }, 751 | { 752 | "cell_type": "code", 753 | "execution_count": null, 754 | "id": "31bb6aa1-c399-4aed-89eb-4800b1854f0f", 755 | "metadata": {}, 756 | "outputs": [], 757 | "source": [ 758 | "#Enter Your Code and Execute\n", 759 | "from sklearn.model_selection import train_test_split" 760 | ] 761 | }, 762 | { 763 | "cell_type": "code", 764 | "execution_count": null, 765 | "id": "f7f13536-93e4-4edd-9c75-49a6d342f446", 766 | "metadata": {}, 767 | "outputs": [], 768 | "source": [ 769 | "x_train, x_test, y_train, y_test = (features, Y, test_size=0.2, random_state=1 )" 770 | ] 771 | }, 772 | { 773 | "cell_type": "markdown", 774 | "id": "cd2f53d8-3983-4581-8363-f11950b80b85", 775 | "metadata": {}, 776 | "source": [ 777 | "#### Q13) Create and train a LogisticRegression model called LR using the training data (`x_train`, `y_train`) with the `solver` parameter set to `liblinear`.\n" 778 | ] 779 | }, 780 | { 781 | "cell_type": "code", 782 | "execution_count": null, 783 | "id": "d8bf2bf5-8c9b-4250-beca-591e1a62dd1d", 784 | "metadata": {}, 785 | "outputs": [], 786 | "source": [ 787 | "#Enter Your Code and Execute" 788 | ] 789 | }, 790 | { 791 | "cell_type": "code", 792 | "execution_count": null, 793 | "id": "57ef08e5-9b64-4337-92a6-9af7f98a81a4", 794 | "metadata": {}, 795 | "outputs": [], 796 | "source": [ 797 | "from sklearn.linear_model import LogisticRegression\n", 798 | "from sklearn.metrics import confusion_matrix\n", 799 | "LR = LogisticRegression(C=0.01, solver='liblinear').fit(x_train,y_train)\n", 800 | "LR" 801 | ] 802 | }, 803 | { 804 | "cell_type": "markdown", 805 | "id": "cdaf1cdd-61de-46e4-a252-6641aa13998b", 806 | "metadata": {}, 807 | "source": [ 808 | "#### Q14) Now, use the `predict` and `predict_proba` methods on the testing data (`x_test`) and save it as 2 arrays `predictions` and `predict_proba`.\n" 809 | ] 810 | }, 811 | { 812 | "cell_type": "code", 813 | "execution_count": null, 814 | "id": "421725a3-8a77-4239-b6ee-3f7053ad6807", 815 | "metadata": {}, 816 | "outputs": [], 817 | "source": [ 818 | "#Enter Your Code and Execute" 819 | ] 820 | }, 821 | { 822 | "cell_type": "code", 823 | "execution_count": null, 824 | "id": "b2aad2f1-d2b7-4267-ab2e-1913d6b681a8", 825 | "metadata": {}, 826 | "outputs": [], 827 | "source": [ 828 | "predictions = LR.predict(x_test)" 829 | ] 830 | }, 831 | { 832 | "cell_type": "code", 833 | "execution_count": null, 834 | "id": "855f934b-4098-4e23-b5dd-b1ebff6c02b8", 835 | "metadata": {}, 836 | "outputs": [], 837 | "source": [ 838 | "predict_proba = LR.precict_proba(x_test)" 839 | ] 840 | }, 841 | { 842 | "cell_type": "markdown", 843 | "id": "08f3dc13-8be2-4d1e-9866-ff97141ae692", 844 | "metadata": {}, 845 | "source": [ 846 | "#### Q15) Using the `predictions`, `predict_proba` and the `y_test` dataframe calculate the value for each metric using the appropriate function.\n" 847 | ] 848 | }, 849 | { 850 | "cell_type": "code", 851 | "execution_count": null, 852 | "id": "0be54747-7fa3-46e6-a4cf-bd5e820ac616", 853 | "metadata": {}, 854 | "outputs": [], 855 | "source": [ 856 | "#Enter Your Code and Execute" 857 | ] 858 | }, 859 | { 860 | "cell_type": "code", 861 | "execution_count": null, 862 | "id": "6dc992b9-a3f8-49b5-a988-90118971013f", 863 | "metadata": {}, 864 | "outputs": [], 865 | "source": [ 866 | "from sklearn.metrics import accuracy_score,jaccard_score,f1_score,log_loss\n", 867 | "LR_Accuracy_Score = accuracy_score(y_test, predictions)\n", 868 | "LR_JaccardIndex = jaccard_score(y_test, predictions)\n", 869 | "LR_F1_Score = f1_score(y_test, predictions)\n", 870 | "LR_Log_Loss = log_loss(y_test, predict_proba)\n" 871 | ] 872 | }, 873 | { 874 | "cell_type": "markdown", 875 | "id": "0c7326ae-5aa6-4666-b4d6-0705e5bcb771", 876 | "metadata": {}, 877 | "source": [ 878 | "### SVM\n" 879 | ] 880 | }, 881 | { 882 | "cell_type": "markdown", 883 | "id": "920bae21-8886-4705-b6b1-85c1ca4506ee", 884 | "metadata": {}, 885 | "source": [ 886 | "#### Q16) Create and train a SVM model called SVM using the training data (`x_train`, `y_train`).\n" 887 | ] 888 | }, 889 | { 890 | "cell_type": "code", 891 | "execution_count": null, 892 | "id": "4ed2651e-3dd8-46bd-8a31-e7efa095a5dc", 893 | "metadata": {}, 894 | "outputs": [], 895 | "source": [ 896 | "#Enter Your Code and Execute" 897 | ] 898 | }, 899 | { 900 | "cell_type": "code", 901 | "execution_count": null, 902 | "id": "55d94ee3-60bb-4307-8fae-8d87dfd0f5ad", 903 | "metadata": {}, 904 | "outputs": [], 905 | "source": [ 906 | "from sklearn import SVC\n", 907 | "SVM = SVC()\n", 908 | "SVM.fit(x_train, y_train)" 909 | ] 910 | }, 911 | { 912 | "cell_type": "markdown", 913 | "id": "755cb519-2721-4674-9d21-85a154fde994", 914 | "metadata": {}, 915 | "source": [ 916 | "#### Q17) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.\n" 917 | ] 918 | }, 919 | { 920 | "cell_type": "code", 921 | "execution_count": null, 922 | "id": "de56e316-aaca-4ed9-89eb-1d69140ff04c", 923 | "metadata": {}, 924 | "outputs": [], 925 | "source": [ 926 | "#Enter Your Code and Execute" 927 | ] 928 | }, 929 | { 930 | "cell_type": "code", 931 | "execution_count": null, 932 | "id": "cb98d313-75b6-4bea-b79c-efec4b9e412c", 933 | "metadata": {}, 934 | "outputs": [], 935 | "source": [ 936 | "predictions = SVM.predict(x_test)" 937 | ] 938 | }, 939 | { 940 | "cell_type": "markdown", 941 | "id": "961ccca3-1fac-476a-93d2-39d6c8b3905b", 942 | "metadata": {}, 943 | "source": [ 944 | "#### Q18) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.\n" 945 | ] 946 | }, 947 | { 948 | "cell_type": "code", 949 | "execution_count": null, 950 | "id": "34922618-6a7d-494c-a1b6-5f515f29a801", 951 | "metadata": {}, 952 | "outputs": [], 953 | "source": [ 954 | "from sklearn.metrics import accuracy_score,jaccard_score,f1_score\n", 955 | "SVM_Accuracy_Score = accuracy_score(y_test, predictions)\n", 956 | "SVM_JaccardIndex = jaccard_score(y_test, predictions)\n", 957 | "SVM_F1_Score = f1_score(y_test, predictions)\n" 958 | ] 959 | }, 960 | { 961 | "cell_type": "markdown", 962 | "id": "4e02f921-2696-4a0b-b9b6-cfc89a55f77d", 963 | "metadata": {}, 964 | "source": [ 965 | "### Report\n" 966 | ] 967 | }, 968 | { 969 | "cell_type": "markdown", 970 | "id": "1f696bf7-a40a-404b-af35-b9a66f6304d6", 971 | "metadata": {}, 972 | "source": [ 973 | "#### Q19) Show the Accuracy,Jaccard Index,F1-Score and LogLoss in a tabular format using data frame for all of the above models.\n", 974 | "\n", 975 | "\\*LogLoss is only for Logistic Regression Model\n" 976 | ] 977 | }, 978 | { 979 | "cell_type": "code", 980 | "execution_count": null, 981 | "id": "f7cc9f99-9da8-48e1-916e-fd642e28b773", 982 | "metadata": {}, 983 | "outputs": [], 984 | "source": [ 985 | "Report = pd.DataFrame({\n", 986 | " 'Model': ['Logistic Regression', 'SVM'],\n", 987 | " 'Accuracy': [LR_Accuracy_Score, SVM_Accuracy_Score],\n", 988 | " 'Jaccard Index': [LR_JaccardIndex, SVM_JaccardIndex],\n", 989 | " 'F1 Score': [LR_F1_Score, SVM_F1_Score],\n", 990 | " 'Log Loss': [LR_Log_Loss, 'N/A'] # Log Loss is not applicable to SVM\n", 991 | "})" 992 | ] 993 | }, 994 | { 995 | "cell_type": "markdown", 996 | "id": "d7463336-6b5d-4e9e-97a2-86fdf095a9f0", 997 | "metadata": {}, 998 | "source": [ 999 | "

How to submit

\n", 1000 | "\n", 1001 | "

Once you complete your notebook you will have to share it. You can download the notebook by navigating to \"File\" and clicking on \"Download\" button.\n", 1002 | "\n", 1003 | "

This will save the (.ipynb) file on your computer. Once saved, you can upload this file in the \"My Submission\" tab, of the \"Peer-graded Assignment\" section. \n" 1004 | ] 1005 | }, 1006 | { 1007 | "cell_type": "markdown", 1008 | "id": "b7708c87-cdca-4b2c-9edb-829d8ea8a477", 1009 | "metadata": {}, 1010 | "source": [ 1011 | "

About the Authors:

\n", 1012 | "\n", 1013 | "Joseph Santarcangelo has a PhD in Electrical Engineering, his research focused on using machine learning, signal processing, and computer vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his PhD.\n", 1014 | "\n", 1015 | "### Other Contributors\n", 1016 | "\n", 1017 | "[Svitlana Kramar](https://www.linkedin.com/in/svitlana-kramar/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0232ENSkillsNetwork30654641-2022-01-01)\n" 1018 | ] 1019 | }, 1020 | { 1021 | "cell_type": "markdown", 1022 | "id": "a993db4f-58c6-4192-a296-294459698ae3", 1023 | "metadata": {}, 1024 | "source": [ 1025 | "## Change Log\n", 1026 | "\n", 1027 | "| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n", 1028 | "| ----------------- | ------- | ------------- | --------------------------- |\n", 1029 | "| 2022-06-22 | 2.0 | Svitlana K. | Deleted GridSearch and Mock |\n", 1030 | "\n", 1031 | "##

© IBM Corporation 2020. All rights reserved.

\n" 1032 | ] 1033 | } 1034 | ], 1035 | "metadata": { 1036 | "kernelspec": { 1037 | "display_name": "Python 3 (ipykernel)", 1038 | "language": "python", 1039 | "name": "python3" 1040 | }, 1041 | "language_info": { 1042 | "codemirror_mode": { 1043 | "name": "ipython", 1044 | "version": 3 1045 | }, 1046 | "file_extension": ".py", 1047 | "mimetype": "text/x-python", 1048 | "name": "python", 1049 | "nbconvert_exporter": "python", 1050 | "pygments_lexer": "ipython3", 1051 | "version": "3.9.13" 1052 | }, 1053 | "prev_pub_hash": "ba039b1c59dfa11e53b73e3fc8c403e1e8b43c7aedf6f7e0b1d1e7914b44d98a" 1054 | }, 1055 | "nbformat": 4, 1056 | "nbformat_minor": 4 1057 | } 1058 | -------------------------------------------------------------------------------- /ML0101EN_SkillUp_FinalAssignment.jupyterlite.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"cell_type":"markdown","id":"d635514d-3cf2-449a-aea8-5e4fa9a207ba","metadata":{},"source":["

\n"," \n"," \"Skills\n"," \n","

\n","\n","

Final Project: Classification with Python

\n"]},{"cell_type":"markdown","id":"ba96d793-1aa1-4070-b541-9cccd07b073f","metadata":{},"source":["

Table of Contents

\n","
\n","
\n","

Estimated Time Needed: 180 min

\n","\n","\n","
\n"]},{"cell_type":"markdown","id":"62e2bfb2-08bd-41de-a704-c72028242793","metadata":{},"source":["# Instructions\n"]},{"cell_type":"markdown","id":"c06281fa-8a02-493f-9b69-4b6738e6c8eb","metadata":{},"source":["In this notebook, you will practice all the classification algorithms that we have learned in this course.\n","\n","\n","Below, is where we are going to use the classification algorithms to create a model based on our training data and evaluate our testing data using evaluation metrics learned in the course.\n","\n","We will use some of the algorithms taught in the course, specifically:\n","\n","1. Linear Regression\n","2. KNN\n","3. Decision Trees\n","4. Logistic Regression\n","5. SVM\n","\n","We will evaluate our models using:\n","\n","1. Accuracy Score\n","2. Jaccard Index\n","3. F1-Score\n","4. LogLoss\n","5. Mean Absolute Error\n","6. Mean Squared Error\n","7. R2-Score\n","\n","Finally, you will use your models to generate the report at the end. \n"]},{"cell_type":"markdown","id":"9d4ee051-f50c-4ce5-aba1-167ffc8f5648","metadata":{},"source":["# About The Dataset\n"]},{"cell_type":"markdown","id":"4e4d2b57-e9af-4a7d-a7f9-b8c25660ba78","metadata":{},"source":["The original source of the data is Australian Government's Bureau of Meteorology and the latest data can be gathered from [http://www.bom.gov.au/climate/dwo/](http://www.bom.gov.au/climate/dwo/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01).\n","\n","The dataset to be used has extra columns like 'RainToday' and our target is 'RainTomorrow', which was gathered from the Rattle at [https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData](https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)\n","\n","\n"]},{"cell_type":"markdown","id":"4b2d517d-9973-438d-8d32-dff28ad6ce84","metadata":{},"source":["This dataset contains observations of weather metrics for each day from 2008 to 2017. The **weatherAUS.csv** dataset includes the following fields:\n","\n","| Field | Description | Unit | Type |\n","| ------------- | ----------------------------------------------------- | --------------- | ------ |\n","| Date | Date of the Observation in YYYY-MM-DD | Date | object |\n","| Location | Location of the Observation | Location | object |\n","| MinTemp | Minimum temperature | Celsius | float |\n","| MaxTemp | Maximum temperature | Celsius | float |\n","| Rainfall | Amount of rainfall | Millimeters | float |\n","| Evaporation | Amount of evaporation | Millimeters | float |\n","| Sunshine | Amount of bright sunshine | hours | float |\n","| WindGustDir | Direction of the strongest gust | Compass Points | object |\n","| WindGustSpeed | Speed of the strongest gust | Kilometers/Hour | object |\n","| WindDir9am | Wind direction averaged of 10 minutes prior to 9am | Compass Points | object |\n","| WindDir3pm | Wind direction averaged of 10 minutes prior to 3pm | Compass Points | object |\n","| WindSpeed9am | Wind speed averaged of 10 minutes prior to 9am | Kilometers/Hour | float |\n","| WindSpeed3pm | Wind speed averaged of 10 minutes prior to 3pm | Kilometers/Hour | float |\n","| Humidity9am | Humidity at 9am | Percent | float |\n","| Humidity3pm | Humidity at 3pm | Percent | float |\n","| Pressure9am | Atmospheric pressure reduced to mean sea level at 9am | Hectopascal | float |\n","| Pressure3pm | Atmospheric pressure reduced to mean sea level at 3pm | Hectopascal | float |\n","| Cloud9am | Fraction of the sky obscured by cloud at 9am | Eights | float |\n","| Cloud3pm | Fraction of the sky obscured by cloud at 3pm | Eights | float |\n","| Temp9am | Temperature at 9am | Celsius | float |\n","| Temp3pm | Temperature at 3pm | Celsius | float |\n","| RainToday | If there was rain today | Yes/No | object |\n","| RainTomorrow | If there is rain tomorrow | Yes/No | float |\n","\n","Column definitions were gathered from [http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml](http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)\n","\n"]},{"cell_type":"markdown","id":"3ad995f0-a174-49a5-aaef-76294021d5d4","metadata":{},"source":["## **Import the required libraries**\n"]},{"cell_type":"code","execution_count":null,"id":"38dca360-78ed-407c-9f48-26b405bf8695","metadata":{},"outputs":[],"source":["# All Libraries required for this lab are listed below. The libraries pre-installed on Skills Network Labs are commented.\n","# !mamba install -qy pandas==1.3.4 numpy==1.21.4 seaborn==0.9.0 matplotlib==3.5.0 scikit-learn==0.20.1\n","# Note: If your environment doesn't support \"!mamba install\", use \"!pip install\""]},{"cell_type":"code","execution_count":null,"id":"ece29267-503d-4de0-8c69-f905815d57a3","metadata":{},"outputs":[],"source":["# Surpress warnings:\n","def warn(*args, **kwargs):\n"," pass\n","import warnings\n","warnings.warn = warn"]},{"cell_type":"code","execution_count":null,"id":"2344f678-d444-4a13-bd7f-d730f954116f","metadata":{},"outputs":[],"source":["import pandas as pd\n","from sklearn.linear_model import LogisticRegression\n","from sklearn.linear_model import LinearRegression\n","from sklearn import preprocessing\n","import numpy as np\n","from sklearn.neighbors import KNeighborsClassifier\n","from sklearn.model_selection import train_test_split\n","from sklearn.neighbors import KNeighborsClassifier\n","from sklearn.tree import DecisionTreeClassifier\n","from sklearn import svm\n","from sklearn.metrics import jaccard_score\n","from sklearn.metrics import f1_score\n","from sklearn.metrics import log_loss\n","from sklearn.metrics import confusion_matrix, accuracy_score\n","import sklearn.metrics as metrics"]},{"cell_type":"markdown","metadata":{},"source":[]},{"cell_type":"markdown","id":"2bdad242-edb6-4a5b-8471-f5918f3ecab7","metadata":{},"source":["### Importing the Dataset\n"]},{"cell_type":"code","execution_count":null,"id":"fa02651f-de37-4666-8ad4-b4ddc8d0b0ac","metadata":{},"outputs":[],"source":["from pyodide.http import pyfetch\n","\n","async def download(url, filename):\n"," response = await pyfetch(url)\n"," if response.status == 200:\n"," with open(filename, \"wb\") as f:\n"," f.write(await response.bytes())"]},{"cell_type":"code","execution_count":null,"id":"4d127aff-9339-42f9-bdfc-f2e17a19dd4c","metadata":{},"outputs":[],"source":["path='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillUp/labs/ML-FinalAssignment/Weather_Data.csv'"]},{"cell_type":"code","execution_count":null,"id":"5f770f7d-f967-4ae4-a96a-a6c8c0a02bcc","metadata":{},"outputs":[],"source":["await download(path, \"Weather_Data.csv\")\n","filename =\"Weather_Data.csv\""]},{"cell_type":"code","execution_count":null,"id":"f87df4cd-160e-4878-bff0-73f1289f5b90","metadata":{},"outputs":[],"source":["df = pd.read_csv(\"Weather_Data.csv\")"]},{"cell_type":"markdown","id":"6a7acb6d-0e0e-4321-acd2-8096b272c39d","metadata":{},"source":["> Note: This version of the lab is designed for JupyterLite, which necessitates downloading the dataset to the interface. However, when working with the downloaded version of this notebook on your local machines (Jupyter Anaconda), you can simply **skip the steps above of \"Importing the Dataset\"** and use the URL directly in the `pandas.read_csv()` function. You can uncomment and run the statements in the cell below.\n"]},{"cell_type":"code","execution_count":null,"id":"f9c77ad8-8b85-4e82-af47-f63163f889c3","metadata":{},"outputs":[],"source":["#filepath = \"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillUp/labs/ML-FinalAssignment/Weather_Data.csv\"\n","#df = pd.read_csv(filepath)"]},{"cell_type":"code","execution_count":null,"id":"2fd31b01-a2bf-4263-8ff8-6fdce1990047","metadata":{},"outputs":[],"source":["df.head()"]},{"cell_type":"markdown","id":"eb2f4134-ab8b-48d8-aaab-85ce530aee65","metadata":{},"source":["### Data Preprocessing\n"]},{"cell_type":"markdown","id":"c70975f9-cae8-4cc6-be94-dbc1a880d3c9","metadata":{},"source":["#### One Hot Encoding\n"]},{"cell_type":"markdown","id":"cfadd018-a23d-4985-9eb3-8ed0d30abd52","metadata":{},"source":["First, we need to perform one hot encoding to convert categorical variables to binary variables.\n"]},{"cell_type":"code","execution_count":null,"id":"55968fd3-0422-4766-98fd-9397e0006e3e","metadata":{},"outputs":[],"source":["df_sydney_processed = pd.get_dummies(data=df, columns=['RainToday', 'WindGustDir', 'WindDir9am', 'WindDir3pm'])"]},{"cell_type":"markdown","id":"e354a6fc-8c8b-499d-8444-5011a5146b1a","metadata":{},"source":["Next, we replace the values of the 'RainTomorrow' column changing them from a categorical column to a binary column. We do not use the `get_dummies` method because we would end up with two columns for 'RainTomorrow' and we do not want, since 'RainTomorrow' is our target.\n"]},{"cell_type":"code","execution_count":null,"id":"77f75277-a3ca-4ccc-a5b7-f95b2491cc9b","metadata":{},"outputs":[],"source":["df_sydney_processed.replace(['No', 'Yes'], [0,1], inplace=True)"]},{"cell_type":"markdown","id":"88ab6c18-f36a-408c-8510-fdf831abc53b","metadata":{},"source":["### Training Data and Test Data\n"]},{"cell_type":"markdown","id":"2a25156c-4080-4b15-a45c-9d4884ddce06","metadata":{},"source":["Now, we set our 'features' or x values and our Y or target variable.\n"]},{"cell_type":"code","execution_count":null,"id":"3077604d-a2f3-4e24-88ff-64b5ce69d6f0","metadata":{},"outputs":[],"source":["df_sydney_processed.drop('Date',axis=1,inplace=True)"]},{"cell_type":"code","execution_count":null,"id":"e1b66dd7-5bb7-4739-96b8-374d8e89269e","metadata":{},"outputs":[],"source":["df_sydney_processed = df_sydney_processed.astype(float)"]},{"cell_type":"code","execution_count":null,"id":"29857426-177a-4c87-8982-4fdca4ff4d78","metadata":{},"outputs":[],"source":["features = df_sydney_processed.drop(columns='RainTomorrow', axis=1)\n","Y = df_sydney_processed['RainTomorrow']"]},{"cell_type":"markdown","id":"1be81f61-64c3-43d0-89a8-22ff1a9339cf","metadata":{},"source":["### Linear Regression\n"]},{"cell_type":"markdown","id":"60256d7f-4d49-4aed-b2df-dc44a5b0791c","metadata":{},"source":["#### Q1) Use the `train_test_split` function to split the `features` and `Y` dataframes with a `test_size` of `0.2` and the `random_state` set to `10`.\n"]},{"cell_type":"code","execution_count":null,"id":"843e32d6-7bdd-4d12-b10a-de70bed0e974","metadata":{},"outputs":[],"source":["#Enter Your Code and Execute\n","\n"]},{"cell_type":"code","execution_count":null,"id":"c38f2196-bb38-4322-a012-9a27e6a9d8d8","metadata":{},"outputs":[{"ename":"","evalue":"","output_type":"error","traceback":["\u001b[1;31mRunning cells with 'Python 3.11.4' requires the ipykernel package.\n","\u001b[1;31mRun the following command to install 'ipykernel' into the Python environment. \n","\u001b[1;31mCommand: 'c:/Users/sid41/AppData/Local/Programs/Python/Python311/python.exe -m pip install ipykernel -U --user --force-reinstall'"]}],"source":["x_train, x_test, y_train, y_test = train_test_split(features, Y, test_size=0.2, random_state=10)"]},{"cell_type":"markdown","id":"b144e3a9-6bd2-4e75-8cae-50e1e3b3fb17","metadata":{},"source":["#### Q2) Create and train a Linear Regression model called LinearReg using the training data (`x_train`, `y_train`).\n"]},{"cell_type":"code","execution_count":null,"id":"22b77c93-ecf2-4c54-8ad4-496ef5396695","metadata":{},"outputs":[],"source":["#Enter Your Code and Execute"]},{"cell_type":"code","execution_count":null,"id":"a4293ee5-572a-45cb-a090-5f30af826630","metadata":{},"outputs":[],"source":["LinearReg = LinearRegression()\n","LinearReg.fit(x_train, y_train)"]},{"cell_type":"markdown","id":"1aa0f086-fefc-4e44-8aa2-822aa7828ad6","metadata":{},"source":["#### Q3) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.\n"]},{"cell_type":"code","execution_count":null,"id":"a63dfa3d-a957-48dc-a067-93e9b3d11431","metadata":{},"outputs":[],"source":["#Enter Your Code and Execute"]},{"cell_type":"code","execution_count":null,"id":"d6bdf734-7705-4590-aa54-6b93e20a9da7","metadata":{},"outputs":[],"source":["predictions = LinearReg.predict(x_test)"]},{"cell_type":"markdown","id":"ea13e307-0ac9-4c7c-874e-440cee795d95","metadata":{},"source":["#### Q4) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.\n"]},{"cell_type":"code","execution_count":null,"id":"cf2408d4-4932-487d-85f2-913c2efce09f","metadata":{},"outputs":[],"source":["#Enter Your Code and Execute"]},{"cell_type":"code","execution_count":null,"id":"aba34a58-a974-467b-8bad-10a0fd0e88d3","metadata":{},"outputs":[],"source":["from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score\n","\n","LinearRegression_MAE = mean_absolute_error(y_test, predictions)\n","LinearRegression_MSE = mean_squared_error(y_test, predictions)\n","LinearRegression_R2 = r2_score(y_test, predictions)\n"]},{"cell_type":"markdown","id":"4552ab70-ec8a-4455-8c5f-f75f6e2e2771","metadata":{},"source":["#### Q5) Show the MAE, MSE, and R2 in a tabular format using data frame for the linear model.\n"]},{"cell_type":"code","execution_count":null,"id":"cc932bbd-9528-45b5-b85a-8c9a0b9560ed","metadata":{},"outputs":[],"source":["#Enter Your Code and Execute"]},{"cell_type":"code","execution_count":null,"id":"edd964f0-d1aa-4e3b-9a52-577656477438","metadata":{},"outputs":[],"source":["report_data = {\n"," 'Model': ['Linear Regression', 'KNN', 'Decision Tree', 'Logistic Regression', 'SVM'],\n"," 'Accuracy': [LinearReg_Accuracy, KNN_Accuracy, Tree_Accuracy, LR_Accuracy, SVM_Accuracy],\n"," 'Jaccard Index': [LinearReg_Jaccard, KNN_Jaccard, Tree_Jaccard, LR_Jaccard, SVM_Jaccard],\n"," 'F1 Score': [LinearReg_F1, KNN_F1, Tree_F1, LR_F1, SVM_F1],\n"," 'Log Loss': [None, None, None, LR_LogLoss, None] # Only for Logistic Regression\n","}\n","\n","report_df = pd.DataFrame(report_data)\n","print(report_df)"]},{"cell_type":"markdown","id":"55351393-abee-4af8-8944-f0079265cd56","metadata":{},"source":["### KNN\n"]},{"cell_type":"markdown","id":"b7e38ebb-7442-4980-b883-e89c6de0d351","metadata":{},"source":["#### Q6) Create and train a KNN model called KNN using the training data (`x_train`, `y_train`) with the `n_neighbors` parameter set to `4`.\n"]},{"cell_type":"code","execution_count":null,"id":"213be9cb-8c88-4099-b023-def7c0e31c6a","metadata":{},"outputs":[],"source":["#Enter Your Code and Execute"]},{"cell_type":"code","execution_count":null,"id":"f03fe0e5-ac9f-42cd-a107-d5d46b16ae4d","metadata":{},"outputs":[],"source":["from sklearn.neighbors import KNeighborsClassifier\n","\n","KNN = KNeighborsClassifier(n_neighbors=4)\n","KNN.fit(x_train, y_train)\n"]},{"cell_type":"markdown","id":"0ef93a31-0d67-4fa5-809b-a765b13f5888","metadata":{},"source":["#### Q7) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.\n"]},{"cell_type":"code","execution_count":null,"id":"cf386d08-521b-418b-a5d6-601318bd8e93","metadata":{},"outputs":[],"source":["#Enter Your Code and Execute"]},{"cell_type":"code","execution_count":null,"id":"8f88815e-92fd-4920-983c-326502b8bc29","metadata":{},"outputs":[],"source":["predictions = KNN.predict(x_test)"]},{"cell_type":"markdown","id":"9913f102-99a8-4af3-858d-1a55a2d49259","metadata":{},"source":["#### Q8) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.\n"]},{"cell_type":"code","execution_count":null,"id":"02eb3dd1-0c03-407a-9042-d446bd5ce557","metadata":{},"outputs":[],"source":["#Enter Your Code and Execute"]},{"cell_type":"code","execution_count":null,"id":"47ddb040-001c-4c7f-907a-e0fe69d2bfb7","metadata":{},"outputs":[],"source":["from sklearn.metrics import accuracy_score, jaccard_score, f1_score\n","\n","KNN_Accuracy_Score = accuracy_score(y_test, predictions)\n","KNN_JaccardIndex = jaccard_score(y_test, predictions)\n","KNN_F1_Score = f1_score(y_test, predictions)\n","\n","print(f\"KNN Accuracy Score: {KNN_Accuracy_Score}\")\n","print(f\"KNN Jaccard Index: {KNN_JaccardIndex}\")\n","print(f\"KNN F1 Score: {KNN_F1_Score}\")\n"]},{"cell_type":"markdown","id":"b1b49d21-0f4b-4737-a15c-88574afb6dc5","metadata":{},"source":["### Decision Tree\n"]},{"cell_type":"markdown","id":"07aedd58-1090-48ed-8fe9-38c64f4492ad","metadata":{},"source":["#### Q9) Create and train a Decision Tree model called Tree using the training data (`x_train`, `y_train`).\n"]},{"cell_type":"code","execution_count":null,"id":"b61b238f-7880-4cd9-9764-b69b5c63f352","metadata":{},"outputs":[],"source":["#Enter Your Code and Execute"]},{"cell_type":"code","execution_count":null,"id":"f4a9ea81-7336-4cab-aa98-22cb6c6e79ca","metadata":{},"outputs":[],"source":["from sklearn.tree import DecisionTreeClassifier\n","\n","Tree = DecisionTreeClassifier()\n","Tree.fit(x_train, y_train)\n"]},{"cell_type":"markdown","id":"d79279ba-1e22-45c9-a6a7-e1a39a61c512","metadata":{},"source":["#### Q10) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.\n"]},{"cell_type":"code","execution_count":null,"id":"5c99f9a8-2fa2-48f3-83e9-8781dc1b0d61","metadata":{},"outputs":[],"source":["#Enter Your Code and Execute"]},{"cell_type":"code","execution_count":2,"id":"4f6ea989-4778-47c8-9bc0-f6fae679f3c3","metadata":{},"outputs":[{"ename":"NameError","evalue":"name 'Tree' is not defined","output_type":"error","traceback":["\u001b[1;31m---------------------------------------------------------------------------\u001b[0m","\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)","Cell \u001b[1;32mIn[2], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m predictions \u001b[38;5;241m=\u001b[39m \u001b[43mTree\u001b[49m\u001b[38;5;241m.\u001b[39mpredict(x_test)\n","\u001b[1;31mNameError\u001b[0m: name 'Tree' is not defined"]}],"source":["predictions = Tree.predict(x_test)"]},{"cell_type":"markdown","id":"f19bae36-1072-4baf-bcf8-3c098eeb1915","metadata":{},"source":["#### Q11) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.\n"]},{"cell_type":"code","execution_count":null,"id":"0c47c2e2-14c7-4e5a-aaea-86212a0fa032","metadata":{},"outputs":[],"source":["#Enter Your Code and Execute"]},{"cell_type":"code","execution_count":null,"id":"edb09a5f-c3e2-4c4f-924b-85f6ffe96729","metadata":{},"outputs":[],"source":["Tree_Accuracy_Score = accuracy_score(y_test, predictions)\n","Tree_JaccardIndex = jaccard_score(y_test, predictions)\n","Tree_F1_Score = f1_score(y_test, predictions)\n","\n","print(f\"Decision Tree Accuracy Score: {Tree_Accuracy_Score}\")\n","print(f\"Decision Tree Jaccard Index: {Tree_JaccardIndex}\")\n","print(f\"Decision Tree F1 Score: {Tree_F1_Score}\")"]},{"cell_type":"markdown","id":"f2905933-3b27-4ece-a80a-b35744f54b5f","metadata":{},"source":["### Logistic Regression\n"]},{"cell_type":"markdown","id":"490cdcfd-14aa-417d-b3ed-29577d13f7da","metadata":{},"source":["#### Q12) Use the `train_test_split` function to split the `features` and `Y` dataframes with a `test_size` of `0.2` and the `random_state` set to `1`.\n"]},{"cell_type":"code","execution_count":null,"id":"31bb6aa1-c399-4aed-89eb-4800b1854f0f","metadata":{},"outputs":[],"source":["#Enter Your Code and Execute"]},{"cell_type":"code","execution_count":null,"id":"f7f13536-93e4-4edd-9c75-49a6d342f446","metadata":{},"outputs":[],"source":["from sklearn.model_selection import train_test_split\n","\n","x_train, x_test, y_train, y_test = train_test_split(features, Y, test_size=0.2, random_state=1)\n"]},{"cell_type":"markdown","id":"cd2f53d8-3983-4581-8363-f11950b80b85","metadata":{},"source":["#### Q13) Create and train a LogisticRegression model called LR using the training data (`x_train`, `y_train`) with the `solver` parameter set to `liblinear`.\n"]},{"cell_type":"code","execution_count":null,"id":"d8bf2bf5-8c9b-4250-beca-591e1a62dd1d","metadata":{},"outputs":[],"source":["#Enter Your Code and Execute"]},{"cell_type":"code","execution_count":null,"id":"57ef08e5-9b64-4337-92a6-9af7f98a81a4","metadata":{},"outputs":[],"source":["from sklearn.linear_model import LogisticRegression\n","\n","LR = LogisticRegression(solver='liblinear')\n","LR.fit(x_train, y_train)\n"]},{"cell_type":"markdown","id":"cdaf1cdd-61de-46e4-a252-6641aa13998b","metadata":{},"source":["#### Q14) Now, use the `predict` and `predict_proba` methods on the testing data (`x_test`) and save it as 2 arrays `predictions` and `predict_proba`.\n"]},{"cell_type":"code","execution_count":null,"id":"421725a3-8a77-4239-b6ee-3f7053ad6807","metadata":{},"outputs":[],"source":["#Enter Your Code and Execute"]},{"cell_type":"code","execution_count":null,"id":"b2aad2f1-d2b7-4267-ab2e-1913d6b681a8","metadata":{},"outputs":[],"source":["predictions = LR.predict(x_test)"]},{"cell_type":"code","execution_count":null,"id":"855f934b-4098-4e23-b5dd-b1ebff6c02b8","metadata":{},"outputs":[],"source":["predict_proba = LR.predict_proba(x_test)\n"]},{"cell_type":"markdown","id":"08f3dc13-8be2-4d1e-9866-ff97141ae692","metadata":{},"source":["#### Q15) Using the `predictions`, `predict_proba` and the `y_test` dataframe calculate the value for each metric using the appropriate function.\n"]},{"cell_type":"code","execution_count":null,"id":"0be54747-7fa3-46e6-a4cf-bd5e820ac616","metadata":{},"outputs":[],"source":["#Enter Your Code and Execute"]},{"cell_type":"code","execution_count":null,"id":"6dc992b9-a3f8-49b5-a988-90118971013f","metadata":{},"outputs":[],"source":["from sklearn.metrics import log_loss\n","\n","LR_Accuracy_Score = accuracy_score(y_test, predictions)\n","LR_JaccardIndex = jaccard_score(y_test, predictions)\n","LR_F1_Score = f1_score(y_test, predictions)\n","LR_Log_Loss = log_loss(y_test, predict_proba)\n","\n","print(f\"Logistic Regression Accuracy Score: {LR_Accuracy_Score}\")\n","print(f\"Logistic Regression Jaccard Index: {LR_JaccardIndex}\")\n","print(f\"Logistic Regression F1 Score: {LR_F1_Score}\")\n","print(f\"Logistic Regression Log Loss: {LR_Log_Loss}\")\n"]},{"cell_type":"markdown","id":"0c7326ae-5aa6-4666-b4d6-0705e5bcb771","metadata":{},"source":["### SVM\n"]},{"cell_type":"markdown","id":"920bae21-8886-4705-b6b1-85c1ca4506ee","metadata":{},"source":["#### Q16) Create and train a SVM model called SVM using the training data (`x_train`, `y_train`).\n"]},{"cell_type":"code","execution_count":null,"id":"4ed2651e-3dd8-46bd-8a31-e7efa095a5dc","metadata":{},"outputs":[],"source":["#Enter Your Code and Execute"]},{"cell_type":"code","execution_count":null,"id":"55d94ee3-60bb-4307-8fae-8d87dfd0f5ad","metadata":{},"outputs":[],"source":["from sklearn.svm import SVC\n","\n","SVM = SVC()\n","SVM.fit(x_train, y_train)\n"]},{"cell_type":"markdown","id":"755cb519-2721-4674-9d21-85a154fde994","metadata":{},"source":["#### Q17) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.\n"]},{"cell_type":"code","execution_count":null,"id":"de56e316-aaca-4ed9-89eb-1d69140ff04c","metadata":{},"outputs":[],"source":["#Enter Your Code and Execute"]},{"cell_type":"code","execution_count":null,"id":"cb98d313-75b6-4bea-b79c-efec4b9e412c","metadata":{},"outputs":[],"source":["predictions = SVM.predict(x_test)"]},{"cell_type":"markdown","id":"961ccca3-1fac-476a-93d2-39d6c8b3905b","metadata":{},"source":["#### Q18) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.\n"]},{"cell_type":"code","execution_count":null,"id":"34922618-6a7d-494c-a1b6-5f515f29a801","metadata":{},"outputs":[],"source":["SVM_Accuracy_Score = accuracy_score(y_test, predictions)\n","SVM_JaccardIndex = jaccard_score(y_test, predictions)\n","SVM_F1_Score = f1_score(y_test, predictions)\n","\n","print(f\"SVM Accuracy Score: {SVM_Accuracy_Score}\")\n","print(f\"SVM Jaccard Index: {SVM_JaccardIndex}\")\n","print(f\"SVM F1 Score: {SVM_F1_Score}\")\n"]},{"cell_type":"markdown","id":"4e02f921-2696-4a0b-b9b6-cfc89a55f77d","metadata":{},"source":["### Report\n"]},{"cell_type":"markdown","id":"1f696bf7-a40a-404b-af35-b9a66f6304d6","metadata":{},"source":["#### Q19) Show the Accuracy,Jaccard Index,F1-Score and LogLoss in a tabular format using data frame for all of the above models.\n","\n","\\*LogLoss is only for Logistic Regression Model\n"]},{"cell_type":"code","execution_count":null,"id":"f7cc9f99-9da8-48e1-916e-fd642e28b773","metadata":{},"outputs":[],"source":["import pandas as pd\n","\n","Report = pd.DataFrame({\n"," 'Model': ['KNN', 'Decision Tree', 'Logistic Regression', 'SVM'],\n"," 'Accuracy': [KNN_Accuracy_Score, Tree_Accuracy_Score, LR_Accuracy_Score, SVM_Accuracy_Score],\n"," 'Jaccard Index': [KNN_JaccardIndex, Tree_JaccardIndex, LR_JaccardIndex, SVM_JaccardIndex],\n"," 'F1-Score': [KNN_F1_Score, Tree_F1_Score, LR_F1_Score, SVM_F1_Score],\n"," 'LogLoss': [None, None, LR_Log_Loss, None] # Only Logistic Regression uses LogLoss\n","})\n","\n","print(Report)\n"]},{"cell_type":"markdown","id":"d7463336-6b5d-4e9e-97a2-86fdf095a9f0","metadata":{},"source":["

How to submit

\n","\n","

Once you complete your notebook you will have to share it. You can download the notebook by navigating to \"File\" and clicking on \"Download\" button.\n","\n","

This will save the (.ipynb) file on your computer. Once saved, you can upload this file in the \"My Submission\" tab, of the \"Peer-graded Assignment\" section. \n"]},{"cell_type":"markdown","id":"b7708c87-cdca-4b2c-9edb-829d8ea8a477","metadata":{},"source":["

About the Authors:

\n","\n","Joseph Santarcangelo has a PhD in Electrical Engineering, his research focused on using machine learning, signal processing, and computer vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his PhD.\n","\n","### Other Contributors\n","\n","[Svitlana Kramar](https://www.linkedin.com/in/svitlana-kramar/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0232ENSkillsNetwork30654641-2022-01-01)\n"]},{"cell_type":"markdown","id":"a993db4f-58c6-4192-a296-294459698ae3","metadata":{},"source":["## Change Log\n","\n","| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n","| ----------------- | ------- | ------------- | --------------------------- |\n","| 2022-06-22 | 2.0 | Svitlana K. | Deleted GridSearch and Mock |\n","\n","##

© IBM Corporation 2020. All rights reserved.

\n"]}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.11.4"},"prev_pub_hash":"ba039b1c59dfa11e53b73e3fc8c403e1e8b43c7aedf6f7e0b1d1e7914b44d98a"},"nbformat":4,"nbformat_minor":4} 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |

2 |
3 | Machine Learning Wonders 4 |
5 | Welcome to Machine Learning Wonders! 6 |
7 |

8 | 9 |

Unlock the Magic of Machine Learning 🌟

10 | 11 |

12 | About • 13 | Contents • 14 | Getting Started • 15 | Contributions • 16 | Community 17 |

18 | 19 | 20 | 21 | ## About 22 | 23 | 👋 Welcome, curious minds and future data wizards! 🧙‍♂️ 24 | 25 | Ever wanted to teach computers to think, learn, and make predictions? You're in the right place! Machine Learning is like magic, but with data and algorithms. ✨ 26 | 27 | Machine Learning Wonders is your enchanted portal to the mesmerizing world of ML. Whether you're an AI sorcerer or a newbie wizard, our spells...uh, we mean, notebooks and resources, will guide you through the mystical realm of data-driven enchantment. 28 | 29 | ## Contents 30 | 31 | 📦 Here's what's brewing in our mystical cauldron: 32 | 33 | 1. **Notebooks**: Interactive Jupyter scrolls filled with spells—oops, we meant code! 📜 34 | 2. **Datasets**: Treasure chests of data to power your ML spells. 📊 35 | 3. **Resources**: Scrolls of wisdom—books, courses, articles—to level up your magical skills. 📚 36 | 37 | ## Getting Started 38 | 39 | Ready to create magic? Just clone this repository and let the adventure begin! 🧙‍♀️ 40 | 41 | 42 | # Cast your first spell with Jupyter! 43 | Contributions Welcome! 🚀 44 | 45 | 46 | Magic is best when shared! Contribute your own spells, potions, or magical creatures—oops, we meant code, insights, or resources—to enhance our magical repository. 47 | 48 | Join Our Magical Guild! 49 | 50 | 51 | 52 | 53 | 54 | Kudos to all the spellcasters, AI alchemists, and magic learners like you who make this magical repository possible. Together, we shall uncover the secrets of Machine Learning! 🌌🔮 55 | 56 | 57 | 58 | May your algorithms always converge, and your data always be abundant! 🚀✨ 59 | 60 | 61 | 62 | 63 | Disclaimer: No magical creatures were harmed in the making of this repository. 🦄🧙‍♂️ 64 | -------------------------------------------------------------------------------- /Regression_Tree.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [] 7 | }, 8 | "kernelspec": { 9 | "name": "python3", 10 | "display_name": "Python 3" 11 | }, 12 | "language_info": { 13 | "name": "python" 14 | } 15 | }, 16 | "cells": [ 17 | { 18 | "cell_type": "code", 19 | "execution_count": 1, 20 | "metadata": { 21 | "id": "zMidR9HSaf62" 22 | }, 23 | "outputs": [], 24 | "source": [ 25 | "# Pandas will allow us to create a dataframe of the data so it can be used and manipulated\n", 26 | "import pandas as pd\n", 27 | "# Regression Tree Algorithm\n", 28 | "from sklearn.tree import DecisionTreeRegressor\n", 29 | "# Split our data into a training and testing data\n", 30 | "from sklearn.model_selection import train_test_split" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "source": [ 36 | "**About the Dataset**\n", 37 | "\n", 38 | "---\n", 39 | "\n", 40 | "\n", 41 | "* Imagine you are a data scientist working for a real estate company that is planning to invest in Boston real estate. You have collected information about various areas of Boston and are tasked with created a model that can predict the median price of houses for that area so it can be used to make offers.\n", 42 | "\n", 43 | "The dataset had information on areas/towns not individual houses, the features are\n", 44 | "\n", 45 | "* CRIM: Crime per capita\n", 46 | "\n", 47 | "* ZN: Proportion of residential land zoned for lots over 25,000 sq.ft.\n", 48 | "\n", 49 | "* INDUS: Proportion of non-retail business acres per town\n", 50 | "\n", 51 | "* CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)\n", 52 | "\n", 53 | "* NOX: Nitric oxides concentration (parts per 10 million)\n", 54 | "\n", 55 | "* RM: Average number of rooms per dwelling\n", 56 | "\n", 57 | "* AGE: Proportion of owner-occupied units built prior to 1940\n", 58 | "\n", 59 | "* DIS: Weighted distances to five Boston employment centers\n", 60 | "\n", 61 | "* RAD: Index of accessibility to radial highways\n", 62 | "\n", 63 | "* TAX: Full-value property-tax rate per $10,000\n", 64 | "\n", 65 | "* PTRAIO: Pupil-teacher ratio by town\n", 66 | "\n", 67 | "* LSTAT: Percent lower status of the population\n", 68 | "\n", 69 | "* MEDV: Median value of owner-occupied homes in $1000s" 70 | ], 71 | "metadata": { 72 | "id": "q6tDUM2Ga_ds" 73 | } 74 | }, 75 | { 76 | "cell_type": "code", 77 | "source": [ 78 | "data = pd.read_csv(\"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/real_estate_data.csv\")\n", 79 | "print(data.head())\n", 80 | "data.shape" 81 | ], 82 | "metadata": { 83 | "colab": { 84 | "base_uri": "https://localhost:8080/" 85 | }, 86 | "id": "ogVnrBkJbSFG", 87 | "outputId": "f97219cd-9cce-44aa-a860-5c8defca3944" 88 | }, 89 | "execution_count": 3, 90 | "outputs": [ 91 | { 92 | "output_type": "stream", 93 | "name": "stdout", 94 | "text": [ 95 | " CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO \\\n", 96 | "0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1 296 15.3 \n", 97 | "1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2 242 17.8 \n", 98 | "2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2 242 17.8 \n", 99 | "3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3 222 18.7 \n", 100 | "4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3 222 18.7 \n", 101 | "\n", 102 | " LSTAT MEDV \n", 103 | "0 4.98 24.0 \n", 104 | "1 9.14 21.6 \n", 105 | "2 4.03 34.7 \n", 106 | "3 2.94 33.4 \n", 107 | "4 NaN 36.2 \n" 108 | ] 109 | }, 110 | { 111 | "output_type": "execute_result", 112 | "data": { 113 | "text/plain": [ 114 | "(506, 13)" 115 | ] 116 | }, 117 | "metadata": {}, 118 | "execution_count": 3 119 | } 120 | ] 121 | }, 122 | { 123 | "cell_type": "markdown", 124 | "source": [ 125 | "Most of the data is valid, there are rows with missing values which we will deal with in pre-processing" 126 | ], 127 | "metadata": { 128 | "id": "r2cJ4LFobdSA" 129 | } 130 | }, 131 | { 132 | "cell_type": "code", 133 | "source": [ 134 | "data.isna().sum()" 135 | ], 136 | "metadata": { 137 | "colab": { 138 | "base_uri": "https://localhost:8080/", 139 | "height": 491 140 | }, 141 | "id": "QmyA9d7dbbwS", 142 | "outputId": "1ea1c8b8-c196-4be9-b74f-b26e1511e561" 143 | }, 144 | "execution_count": 4, 145 | "outputs": [ 146 | { 147 | "output_type": "execute_result", 148 | "data": { 149 | "text/plain": [ 150 | "CRIM 20\n", 151 | "ZN 20\n", 152 | "INDUS 20\n", 153 | "CHAS 20\n", 154 | "NOX 0\n", 155 | "RM 0\n", 156 | "AGE 20\n", 157 | "DIS 0\n", 158 | "RAD 0\n", 159 | "TAX 0\n", 160 | "PTRATIO 0\n", 161 | "LSTAT 20\n", 162 | "MEDV 0\n", 163 | "dtype: int64" 164 | ], 165 | "text/html": [ 166 | "
\n", 167 | "\n", 180 | "\n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | "
0
CRIM20
ZN20
INDUS20
CHAS20
NOX0
RM0
AGE20
DIS0
RAD0
TAX0
PTRATIO0
LSTAT20
MEDV0
\n", 242 | "

" 243 | ] 244 | }, 245 | "metadata": {}, 246 | "execution_count": 4 247 | } 248 | ] 249 | }, 250 | { 251 | "cell_type": "markdown", 252 | "source": [ 253 | "**Data Pre-Processing**\n", 254 | "\n", 255 | "---\n", 256 | "\n" 257 | ], 258 | "metadata": { 259 | "id": "R5RCEzPobh7-" 260 | } 261 | }, 262 | { 263 | "cell_type": "markdown", 264 | "source": [ 265 | "First lets drop the rows with missing values because we have enough data in our dataset" 266 | ], 267 | "metadata": { 268 | "id": "WLpDyutEbkr4" 269 | } 270 | }, 271 | { 272 | "cell_type": "code", 273 | "source": [ 274 | "data.dropna(inplace=True)" 275 | ], 276 | "metadata": { 277 | "id": "n7neafK2bjXO" 278 | }, 279 | "execution_count": 5, 280 | "outputs": [] 281 | }, 282 | { 283 | "cell_type": "markdown", 284 | "source": [ 285 | "Now we can see our dataset has no missing values" 286 | ], 287 | "metadata": { 288 | "id": "z1gf6-qObmqv" 289 | } 290 | }, 291 | { 292 | "cell_type": "code", 293 | "source": [ 294 | "data.isna().sum()" 295 | ], 296 | "metadata": { 297 | "colab": { 298 | "base_uri": "https://localhost:8080/", 299 | "height": 491 300 | }, 301 | "id": "6ENn8K0Cbo-Q", 302 | "outputId": "25fc3ef8-6a43-418c-e39e-b861342d841f" 303 | }, 304 | "execution_count": 6, 305 | "outputs": [ 306 | { 307 | "output_type": "execute_result", 308 | "data": { 309 | "text/plain": [ 310 | "CRIM 0\n", 311 | "ZN 0\n", 312 | "INDUS 0\n", 313 | "CHAS 0\n", 314 | "NOX 0\n", 315 | "RM 0\n", 316 | "AGE 0\n", 317 | "DIS 0\n", 318 | "RAD 0\n", 319 | "TAX 0\n", 320 | "PTRATIO 0\n", 321 | "LSTAT 0\n", 322 | "MEDV 0\n", 323 | "dtype: int64" 324 | ], 325 | "text/html": [ 326 | "
\n", 327 | "\n", 340 | "\n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | "
0
CRIM0
ZN0
INDUS0
CHAS0
NOX0
RM0
AGE0
DIS0
RAD0
TAX0
PTRATIO0
LSTAT0
MEDV0
\n", 402 | "

" 403 | ] 404 | }, 405 | "metadata": {}, 406 | "execution_count": 6 407 | } 408 | ] 409 | }, 410 | { 411 | "cell_type": "markdown", 412 | "source": [ 413 | "Lets split the dataset into our features and what we are predicting (target)" 414 | ], 415 | "metadata": { 416 | "id": "6GgviVY6bsEY" 417 | } 418 | }, 419 | { 420 | "cell_type": "code", 421 | "source": [ 422 | "X = data.drop(columns=[\"MEDV\"])\n", 423 | "Y = data[\"MEDV\"]" 424 | ], 425 | "metadata": { 426 | "id": "gleeodcYbsiv" 427 | }, 428 | "execution_count": 7, 429 | "outputs": [] 430 | }, 431 | { 432 | "cell_type": "code", 433 | "source": [ 434 | "X.head()" 435 | ], 436 | "metadata": { 437 | "colab": { 438 | "base_uri": "https://localhost:8080/", 439 | "height": 206 440 | }, 441 | "id": "u-CGjmtQbvIU", 442 | "outputId": "6fb919ac-4294-4766-a89a-f08caceeb1d0" 443 | }, 444 | "execution_count": 8, 445 | "outputs": [ 446 | { 447 | "output_type": "execute_result", 448 | "data": { 449 | "text/plain": [ 450 | " CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO \\\n", 451 | "0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1 296 15.3 \n", 452 | "1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2 242 17.8 \n", 453 | "2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2 242 17.8 \n", 454 | "3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3 222 18.7 \n", 455 | "5 0.02985 0.0 2.18 0.0 0.458 6.430 58.7 6.0622 3 222 18.7 \n", 456 | "\n", 457 | " LSTAT \n", 458 | "0 4.98 \n", 459 | "1 9.14 \n", 460 | "2 4.03 \n", 461 | "3 2.94 \n", 462 | "5 5.21 " 463 | ], 464 | "text/html": [ 465 | "\n", 466 | "
\n", 467 | "
\n", 468 | "\n", 481 | "\n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | " \n", 527 | " \n", 528 | " \n", 529 | " \n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | " \n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | "
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOLSTAT
00.0063218.02.310.00.5386.57565.24.0900129615.34.98
10.027310.07.070.00.4696.42178.94.9671224217.89.14
20.027290.07.070.00.4697.18561.14.9671224217.84.03
30.032370.02.180.00.4586.99845.86.0622322218.72.94
50.029850.02.180.00.4586.43058.76.0622322218.75.21
\n", 577 | "
\n", 578 | "
\n", 579 | "\n", 580 | "
\n", 581 | " \n", 589 | "\n", 590 | " \n", 630 | "\n", 631 | " \n", 655 | "
\n", 656 | "\n", 657 | "\n", 658 | "
\n", 659 | " \n", 670 | "\n", 671 | "\n", 760 | "\n", 761 | " \n", 783 | "
\n", 784 | "\n", 785 | "
\n", 786 | "
\n" 787 | ], 788 | "application/vnd.google.colaboratory.intrinsic+json": { 789 | "type": "dataframe", 790 | "variable_name": "X", 791 | "summary": "{\n \"name\": \"X\",\n \"rows\": 394,\n \"fields\": [\n {\n \"column\": \"CRIM\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 9.202422580132147,\n \"min\": 0.00632,\n \"max\": 88.9762,\n \"num_unique_values\": 393,\n \"samples\": [\n 0.02875,\n 0.01709,\n 0.47547\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"ZN\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 23.954082464978665,\n \"min\": 0.0,\n \"max\": 100.0,\n \"num_unique_values\": 25,\n \"samples\": [\n 25.0,\n 30.0,\n 18.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"INDUS\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 6.908363612538852,\n \"min\": 0.46,\n \"max\": 27.74,\n \"num_unique_values\": 73,\n \"samples\": [\n 8.14,\n 4.39,\n 12.83\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"CHAS\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.2529708780114149,\n \"min\": 0.0,\n \"max\": 1.0,\n \"num_unique_values\": 2,\n \"samples\": [\n 1.0,\n 0.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"NOX\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.1131124952382757,\n \"min\": 0.389,\n \"max\": 0.871,\n \"num_unique_values\": 78,\n \"samples\": [\n 0.415,\n 0.538\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"RM\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.6979848819579155,\n \"min\": 3.561,\n \"max\": 8.78,\n \"num_unique_values\": 355,\n \"samples\": [\n 6.122,\n 6.511\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"AGE\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 27.88870500341338,\n \"min\": 2.9,\n \"max\": 100.0,\n \"num_unique_values\": 293,\n \"samples\": [\n 96.7,\n 96.6\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"DIS\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2.0985711648613252,\n \"min\": 1.1296,\n \"max\": 12.1265,\n \"num_unique_values\": 336,\n \"samples\": [\n 2.5451,\n 2.422\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"RAD\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 8,\n \"min\": 1,\n \"max\": 24,\n \"num_unique_values\": 9,\n \"samples\": [\n 7,\n 2\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"TAX\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 168,\n \"min\": 187,\n \"max\": 711,\n \"num_unique_values\": 64,\n \"samples\": [\n 422,\n 187\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"PTRATIO\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2.166459607005876,\n \"min\": 12.6,\n \"max\": 22.0,\n \"num_unique_values\": 46,\n \"samples\": [\n 19.6,\n 15.6\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"LSTAT\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 7.308430453758221,\n \"min\": 1.73,\n \"max\": 37.97,\n \"num_unique_values\": 364,\n \"samples\": [\n 5.12,\n 4.32\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" 792 | } 793 | }, 794 | "metadata": {}, 795 | "execution_count": 8 796 | } 797 | ] 798 | }, 799 | { 800 | "cell_type": "code", 801 | "source": [ 802 | "Y.head()" 803 | ], 804 | "metadata": { 805 | "colab": { 806 | "base_uri": "https://localhost:8080/", 807 | "height": 241 808 | }, 809 | "id": "88c8QaYZbwuk", 810 | "outputId": "01225303-c21d-44e8-a0a2-a11438ff8e82" 811 | }, 812 | "execution_count": 9, 813 | "outputs": [ 814 | { 815 | "output_type": "execute_result", 816 | "data": { 817 | "text/plain": [ 818 | "0 24.0\n", 819 | "1 21.6\n", 820 | "2 34.7\n", 821 | "3 33.4\n", 822 | "5 28.7\n", 823 | "Name: MEDV, dtype: float64" 824 | ], 825 | "text/html": [ 826 | "
\n", 827 | "\n", 840 | "\n", 841 | " \n", 842 | " \n", 843 | " \n", 844 | " \n", 845 | " \n", 846 | " \n", 847 | " \n", 848 | " \n", 849 | " \n", 850 | " \n", 851 | " \n", 852 | " \n", 853 | " \n", 854 | " \n", 855 | " \n", 856 | " \n", 857 | " \n", 858 | " \n", 859 | " \n", 860 | " \n", 861 | " \n", 862 | " \n", 863 | " \n", 864 | " \n", 865 | " \n", 866 | " \n", 867 | " \n", 868 | " \n", 869 | "
MEDV
024.0
121.6
234.7
333.4
528.7
\n", 870 | "

" 871 | ] 872 | }, 873 | "metadata": {}, 874 | "execution_count": 9 875 | } 876 | ] 877 | }, 878 | { 879 | "cell_type": "markdown", 880 | "source": [ 881 | "Finally lets split our data into a training and testing dataset using train_test_split from sklearn.model_selection" 882 | ], 883 | "metadata": { 884 | "id": "MOgldE4Cbzca" 885 | } 886 | }, 887 | { 888 | "cell_type": "code", 889 | "source": [ 890 | "X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.2, random_state=1)" 891 | ], 892 | "metadata": { 893 | "id": "ac3sCRS7b0zZ" 894 | }, 895 | "execution_count": 10, 896 | "outputs": [] 897 | }, 898 | { 899 | "cell_type": "markdown", 900 | "source": [ 901 | "**Create Regression Tree**\n", 902 | "\n", 903 | "---\n", 904 | "\n", 905 | "* Regression Trees are implemented using DecisionTreeRegressor from sklearn.tree\n", 906 | "\n", 907 | "The important parameters of DecisionTreeRegressor are\n", 908 | "\n", 909 | "* criterion: {\"mse\", \"friedman_mse\", \"mae\", \"poisson\"} - The function used to measure error\n", 910 | "\n", 911 | "* max_depth - The max depth the tree can be\n", 912 | "\n", 913 | "* min_samples_split - The minimum number of samples required to split a node\n", 914 | "\n", 915 | "* min_samples_leaf - The minimum number of samples that a leaf can contain\n", 916 | "\n", 917 | "* max_features: {\"auto\", \"sqrt\", \"log2\"} - The number of feature we examine looking for the best one, used to speed up training" 918 | ], 919 | "metadata": { 920 | "id": "e5HuJdOZb3d4" 921 | } 922 | }, 923 | { 924 | "cell_type": "markdown", 925 | "source": [ 926 | "First lets start by creating a DecisionTreeRegressor object, setting the criterion parameter to mse for Mean Squared Error" 927 | ], 928 | "metadata": { 929 | "id": "Go5P-LqvcF0q" 930 | } 931 | }, 932 | { 933 | "cell_type": "code", 934 | "source": [ 935 | "# regression_tree = DecisionTreeRegressor(criterion = 'mse')\n", 936 | "regression_tree = DecisionTreeRegressor(criterion = 'squared_error')" 937 | ], 938 | "metadata": { 939 | "id": "9iuZaWyxcGuL" 940 | }, 941 | "execution_count": 21, 942 | "outputs": [] 943 | }, 944 | { 945 | "cell_type": "markdown", 946 | "source": [], 947 | "metadata": { 948 | "id": "xqdIjV1ecckQ" 949 | } 950 | }, 951 | { 952 | "cell_type": "markdown", 953 | "source": [ 954 | "**Training**\n", 955 | "\n", 956 | "---\n", 957 | "\n" 958 | ], 959 | "metadata": { 960 | "id": "kr8OIoVFcKJo" 961 | } 962 | }, 963 | { 964 | "cell_type": "markdown", 965 | "source": [ 966 | "Now lets train our model using the fit method on the DecisionTreeRegressor object providing our training data" 967 | ], 968 | "metadata": { 969 | "id": "7s6YlYl8cM6u" 970 | } 971 | }, 972 | { 973 | "cell_type": "code", 974 | "source": [ 975 | "regression_tree.fit(X_train, Y_train)" 976 | ], 977 | "metadata": { 978 | "colab": { 979 | "base_uri": "https://localhost:8080/", 980 | "height": 74 981 | }, 982 | "id": "zdbto3UOcO7P", 983 | "outputId": "ca433c9c-53e3-41a9-df90-0054433d6fa2" 984 | }, 985 | "execution_count": 22, 986 | "outputs": [ 987 | { 988 | "output_type": "execute_result", 989 | "data": { 990 | "text/plain": [ 991 | "DecisionTreeRegressor()" 992 | ], 993 | "text/html": [ 994 | "
DecisionTreeRegressor()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" 995 | ] 996 | }, 997 | "metadata": {}, 998 | "execution_count": 22 999 | } 1000 | ] 1001 | }, 1002 | { 1003 | "cell_type": "markdown", 1004 | "source": [ 1005 | "**Evaluation**\n", 1006 | "\n", 1007 | "---\n", 1008 | "\n" 1009 | ], 1010 | "metadata": { 1011 | "id": "LN29otUQcNkp" 1012 | } 1013 | }, 1014 | { 1015 | "cell_type": "markdown", 1016 | "source": [ 1017 | "To evaluate our dataset we will use the score method of the DecisionTreeRegressor object providing our testing data, this number is the 𝑅2\n", 1018 | " value which indicates the coefficient of determination" 1019 | ], 1020 | "metadata": { 1021 | "id": "r74V6LtgcrCT" 1022 | } 1023 | }, 1024 | { 1025 | "cell_type": "code", 1026 | "source": [ 1027 | "regression_tree.score(X_test, Y_test)" 1028 | ], 1029 | "metadata": { 1030 | "colab": { 1031 | "base_uri": "https://localhost:8080/" 1032 | }, 1033 | "id": "tJTSpgmLcscO", 1034 | "outputId": "7d5dd295-609b-43f4-80f3-0c76ab01a635" 1035 | }, 1036 | "execution_count": 23, 1037 | "outputs": [ 1038 | { 1039 | "output_type": "execute_result", 1040 | "data": { 1041 | "text/plain": [ 1042 | "0.8293745783581222" 1043 | ] 1044 | }, 1045 | "metadata": {}, 1046 | "execution_count": 23 1047 | } 1048 | ] 1049 | }, 1050 | { 1051 | "cell_type": "markdown", 1052 | "source": [ 1053 | "We can also find the average error in our testing set which is the average error in median home value prediction" 1054 | ], 1055 | "metadata": { 1056 | "id": "gDQyvDrvcvWM" 1057 | } 1058 | }, 1059 | { 1060 | "cell_type": "code", 1061 | "source": [ 1062 | "prediction = regression_tree.predict(X_test)\n", 1063 | "\n", 1064 | "print(\"$\",(prediction - Y_test).abs().mean()*1000)" 1065 | ], 1066 | "metadata": { 1067 | "colab": { 1068 | "base_uri": "https://localhost:8080/" 1069 | }, 1070 | "id": "YPucr30ocv5w", 1071 | "outputId": "44283130-0b34-4b34-937f-62067ebba6f6" 1072 | }, 1073 | "execution_count": 24, 1074 | "outputs": [ 1075 | { 1076 | "output_type": "stream", 1077 | "name": "stdout", 1078 | "text": [ 1079 | "$ 2869.620253164557\n" 1080 | ] 1081 | } 1082 | ] 1083 | } 1084 | ] 1085 | } --------------------------------------------------------------------------------