├── README.md ├── 1_1_residuals.ipynb ├── 1_3_linear_regression.ipynb ├── 1_2_sum_of_squares.ipynb ├── 4_1_decision_tree.ipynb ├── 2_2_multiple_logistic_regression.ipynb ├── 3_exercise.ipynb ├── 4_2_random_forest.ipynb ├── 1b_exercise.ipynb ├── 3_1_naive_bayes.ipynb ├── 1a_exercise.ipynb ├── 1_4_train_test_split_linear_regression.ipynb ├── 4_exercise.ipynb ├── 5_1_neural_network_classification.ipynb ├── 2_1_logistic_regression.ipynb ├── 2_3_confusion_matrix_roc_auc.ipynb ├── 2_exercise.ipynb ├── 5_2_neural_network_mnist.ipynb ├── 3_2_naive_bayes_user_input.ipynb ├── 5a_exercise.ipynb └── 5b_exercise.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # Intro to Machine Learning 2 | 3 | Why is it important to understand machine learning? Many machine learning techniques can solve interesting problems, from identifying email spam to classifying images. It is important to understand how libraries like scikit-learn work under the hood, to know their strengths and weaknesses, and to determine if they should be used for a given problem. 4 | 5 | What you'll learn—and how you can apply it 6 | 7 | * By the end of this course, you’ll understand: 8 | 9 | * What machine learning is and the various algorithms under its umbrella 10 | 11 | * How regression and classification techniques work, from linear regression to neural networks 12 | 13 | * The API patterns of scikit-learn 14 | 15 | And you'll be able to: 16 | 17 | * Use scikit-learn to make predictions on different types of problems, from email spam to image recognition 18 | Pair the proper application of supervised machine learning algorithms to a given problem and context appropriately 19 | Explain how different machine learning algorithms work, including decision trees and neural networks 20 | This training is for you because... 21 | 22 | * You’re a data science professional wanting to learn about ML and how it works. 23 | You work with data science teams and want to understand their ML capabilities. 24 | You want to become a machine learning or data engineer and want to take the first step in that career path 25 | Prerequisites 26 | 27 | * Basic proficiency with Python (variables, loops, installing and importing packages, collections, list comprehensions, declaring NumPy arrays). 28 | 29 | -------------------------------------------------------------------------------- /1_1_residuals.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "04a5dfbf", 6 | "metadata": {}, 7 | "source": [ 8 | "## Calculating Residuals" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "id": "5bec5246", 14 | "metadata": {}, 15 | "source": [ 16 | "Declare 3 data points as training data" 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": null, 22 | "id": "fb1be73a", 23 | "metadata": {}, 24 | "outputs": [], 25 | "source": [ 26 | "x_data = [ 1.0, 2.0, 3.0 ]\n", 27 | "y_actuals = [ 1.2, 1.25, 2.0 ]" 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "id": "5f710efc", 33 | "metadata": {}, 34 | "source": [ 35 | "Declare slope and intercept coefficients" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": null, 41 | "id": "e7413e5e", 42 | "metadata": {}, 43 | "outputs": [], 44 | "source": [ 45 | "m = .368421\n", 46 | "b = .587719" 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "id": "e7a2beea", 52 | "metadata": {}, 53 | "source": [ 54 | "Plot the scatterplot with the line" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": null, 60 | "id": "c720de27", 61 | "metadata": { 62 | "scrolled": true 63 | }, 64 | "outputs": [], 65 | "source": [ 66 | "import matplotlib.pyplot as plt\n", 67 | "\n", 68 | "# show in chart\n", 69 | "plt.plot(x_data, y_actuals, 'o') # scatterplot\n", 70 | "plt.plot(x_data, [m*x+b for x in x_data]) # line\n", 71 | "plt.show()" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "id": "afcb0105", 77 | "metadata": {}, 78 | "source": [ 79 | "Calculate predicted y-values" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": null, 85 | "id": "195a0f06", 86 | "metadata": {}, 87 | "outputs": [], 88 | "source": [ 89 | "y_predicts = [ m*x + b for x in x_data ] " 90 | ] 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "id": "dc396dba", 95 | "metadata": {}, 96 | "source": [ 97 | "Print residuals" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": null, 103 | "id": "5d3387ec", 104 | "metadata": { 105 | "scrolled": true 106 | }, 107 | "outputs": [], 108 | "source": [ 109 | "for (y_actual, y_predict) in zip(y_actuals, y_predicts): \n", 110 | " residual = y_actual - y_predict\n", 111 | " print(\"RESIDUAL: \", residual)" 112 | ] 113 | } 114 | ], 115 | "metadata": { 116 | "kernelspec": { 117 | "display_name": "Python 3 (ipykernel)", 118 | "language": "python", 119 | "name": "python3" 120 | }, 121 | "language_info": { 122 | "codemirror_mode": { 123 | "name": "ipython", 124 | "version": 3 125 | }, 126 | "file_extension": ".py", 127 | "mimetype": "text/x-python", 128 | "name": "python", 129 | "nbconvert_exporter": "python", 130 | "pygments_lexer": "ipython3", 131 | "version": "3.9.7" 132 | } 133 | }, 134 | "nbformat": 4, 135 | "nbformat_minor": 5 136 | } 137 | -------------------------------------------------------------------------------- /1_3_linear_regression.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "4a81d7b3", 6 | "metadata": {}, 7 | "source": [ 8 | "# Simple Linear Regression\n", 9 | "\n", 10 | "Import the following dependencies" 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": null, 16 | "id": "e3f7b3cb", 17 | "metadata": {}, 18 | "outputs": [], 19 | "source": [ 20 | "import pandas as pd\n", 21 | "import matplotlib.pyplot as plt\n", 22 | "from sklearn.linear_model import LinearRegression\n" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "id": "bf46ef9c", 28 | "metadata": {}, 29 | "source": [ 30 | "Import the small dataset containing input variable `x` and output variable `y`. " 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": null, 36 | "id": "8ed6c9c6", 37 | "metadata": { 38 | "scrolled": true 39 | }, 40 | "outputs": [], 41 | "source": [ 42 | "# Import points\n", 43 | "df = pd.read_csv('https://bit.ly/3goOAnt', delimiter=\",\")\n", 44 | "df\n" 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "id": "0dd40ef2", 50 | "metadata": {}, 51 | "source": [ 52 | "Extract the two columns. " 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": null, 58 | "id": "45e5d7f2", 59 | "metadata": {}, 60 | "outputs": [], 61 | "source": [ 62 | "# Extract input variables (all rows, all columns but last column)\n", 63 | "X = df.values[:, :-1]\n", 64 | "\n", 65 | "# Extract output column (all rows, last column)\n", 66 | "Y = df.values[:, -1]\n" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "id": "bbd9f781", 72 | "metadata": {}, 73 | "source": [ 74 | "Fit the `LinearRegression` model and extract the two coefficients. " 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": null, 80 | "id": "bd01c48d", 81 | "metadata": {}, 82 | "outputs": [], 83 | "source": [ 84 | "# Fit a line to the points\n", 85 | "fit = LinearRegression().fit(X, Y)\n", 86 | "\n", 87 | "# m = 1.7867224, b = -16.51923513\n", 88 | "m = fit.coef_.flatten()\n", 89 | "b = fit.intercept_.flatten()\n", 90 | "print(\"m = {0}\".format(m))\n", 91 | "print(\"b = {0}\".format(b))\n" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": null, 97 | "id": "ac6ad7c6", 98 | "metadata": {}, 99 | "outputs": [], 100 | "source": [ 101 | "\n", 102 | "# show in chart\n", 103 | "plt.plot(X, Y, 'o') # scatterplot\n", 104 | "plt.plot(X, m*X+b) # line\n", 105 | "plt.show()" 106 | ] 107 | } 108 | ], 109 | "metadata": { 110 | "kernelspec": { 111 | "display_name": "Python 3 (ipykernel)", 112 | "language": "python", 113 | "name": "python3" 114 | }, 115 | "language_info": { 116 | "codemirror_mode": { 117 | "name": "ipython", 118 | "version": 3 119 | }, 120 | "file_extension": ".py", 121 | "mimetype": "text/x-python", 122 | "name": "python", 123 | "nbconvert_exporter": "python", 124 | "pygments_lexer": "ipython3", 125 | "version": "3.9.7" 126 | } 127 | }, 128 | "nbformat": 4, 129 | "nbformat_minor": 5 130 | } 131 | -------------------------------------------------------------------------------- /1_2_sum_of_squares.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "04a5dfbf", 6 | "metadata": {}, 7 | "source": [ 8 | "## Calculating Sum of Squares" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "id": "5bec5246", 14 | "metadata": {}, 15 | "source": [ 16 | "Declare 3 data points as training data" 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": null, 22 | "id": "fb1be73a", 23 | "metadata": {}, 24 | "outputs": [], 25 | "source": [ 26 | "x_data = [ 1.0, 2.0, 3.0 ]\n", 27 | "y_actuals = [ 1.2, 1.25, 2.0 ]" 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "id": "5f710efc", 33 | "metadata": {}, 34 | "source": [ 35 | "Declare slope and intercept coefficients" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": null, 41 | "id": "e7413e5e", 42 | "metadata": {}, 43 | "outputs": [], 44 | "source": [ 45 | "m = .368421\n", 46 | "b = .587719" 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "id": "e7a2beea", 52 | "metadata": {}, 53 | "source": [ 54 | "Plot the scatterplot with the line" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": null, 60 | "id": "c720de27", 61 | "metadata": { 62 | "scrolled": false 63 | }, 64 | "outputs": [], 65 | "source": [ 66 | "import matplotlib.pyplot as plt\n", 67 | "\n", 68 | "# show in chart\n", 69 | "plt.plot(x_data, y_actuals, 'o') # scatterplot\n", 70 | "plt.plot(x_data, [m*x+b for x in x_data]) # line\n", 71 | "plt.show()" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "id": "afcb0105", 77 | "metadata": {}, 78 | "source": [ 79 | "Calculate predicted y-values" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": null, 85 | "id": "195a0f06", 86 | "metadata": {}, 87 | "outputs": [], 88 | "source": [ 89 | "y_predicts = [ m*x + b for x in x_data ] " 90 | ] 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "id": "dc396dba", 95 | "metadata": {}, 96 | "source": [ 97 | "Print sum of squares" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": null, 103 | "id": "5d3387ec", 104 | "metadata": { 105 | "scrolled": true 106 | }, 107 | "outputs": [], 108 | "source": [ 109 | "sum_of_squares = 0\n", 110 | "\n", 111 | "for (y_actual, y_predict) in zip(y_actuals, y_predicts): \n", 112 | " residual = y_actual - y_predict\n", 113 | " sum_of_squares += residual ** 2\n", 114 | "\n", 115 | "print(\"SUM OF SQUARES: \", sum_of_squares)" 116 | ] 117 | } 118 | ], 119 | "metadata": { 120 | "kernelspec": { 121 | "display_name": "Python 3 (ipykernel)", 122 | "language": "python", 123 | "name": "python3" 124 | }, 125 | "language_info": { 126 | "codemirror_mode": { 127 | "name": "ipython", 128 | "version": 3 129 | }, 130 | "file_extension": ".py", 131 | "mimetype": "text/x-python", 132 | "name": "python", 133 | "nbconvert_exporter": "python", 134 | "pygments_lexer": "ipython3", 135 | "version": "3.9.7" 136 | } 137 | }, 138 | "nbformat": 4, 139 | "nbformat_minor": 5 140 | } 141 | -------------------------------------------------------------------------------- /4_1_decision_tree.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "d5cba8d4", 6 | "metadata": {}, 7 | "source": [ 8 | "# Decision Trees\n", 9 | "\n", 10 | "Import dependencies." 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": null, 16 | "id": "32cceb26", 17 | "metadata": {}, 18 | "outputs": [], 19 | "source": [ 20 | "import pandas as pd\n", 21 | "import numpy as np \n", 22 | "\n", 23 | "from sklearn.metrics import confusion_matrix\n", 24 | "from sklearn.model_selection import train_test_split\n", 25 | "from sklearn.tree import DecisionTreeClassifier\n", 26 | "\n", 27 | "np.set_printoptions(suppress=True)" 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "id": "7f6b0ddb", 33 | "metadata": {}, 34 | "source": [ 35 | "Import good/bad weather data." 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": null, 41 | "id": "29f37ab9", 42 | "metadata": {}, 43 | "outputs": [], 44 | "source": [ 45 | "df = pd.read_csv('https://bit.ly/3zVspy4')\n", 46 | "df" 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "id": "63ae23d3", 52 | "metadata": {}, 53 | "source": [ 54 | "Separate the input and output variables, and set aside $ 1/3 $ of the data for testing. " 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": null, 60 | "id": "90f1ad16", 61 | "metadata": {}, 62 | "outputs": [], 63 | "source": [ 64 | "X = df.values[:, :-1]\n", 65 | "Y = df.values[:, -1]\n", 66 | "\n", 67 | "X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=1.0/3.0, random_state=10)\n" 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "id": "7b351c61", 73 | "metadata": {}, 74 | "source": [ 75 | "Declare the decision tree and fit the training data. " 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": null, 81 | "id": "b4725b68", 82 | "metadata": {}, 83 | "outputs": [], 84 | "source": [ 85 | "model = DecisionTreeClassifier(max_depth=10, criterion='gini')\n", 86 | "model.fit(X_train, Y_train)\n" 87 | ] 88 | }, 89 | { 90 | "cell_type": "markdown", 91 | "id": "97f7dfd4", 92 | "metadata": {}, 93 | "source": [ 94 | "Score the accuracy of the model. " 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": null, 100 | "id": "df1fee2e", 101 | "metadata": {}, 102 | "outputs": [], 103 | "source": [ 104 | "results = model.score(X_test, Y_test)\n", 105 | "print(results)\n" 106 | ] 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "id": "c6f0ab43", 111 | "metadata": {}, 112 | "source": [ 113 | "Show the confusion matrix. " 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": null, 119 | "id": "835e178d", 120 | "metadata": {}, 121 | "outputs": [], 122 | "source": [ 123 | "matrix = confusion_matrix(y_true=Y_test, y_pred=model.predict(X_test))\n", 124 | "print(matrix)\n" 125 | ] 126 | } 127 | ], 128 | "metadata": { 129 | "kernelspec": { 130 | "display_name": "Python 3 (ipykernel)", 131 | "language": "python", 132 | "name": "python3" 133 | }, 134 | "language_info": { 135 | "codemirror_mode": { 136 | "name": "ipython", 137 | "version": 3 138 | }, 139 | "file_extension": ".py", 140 | "mimetype": "text/x-python", 141 | "name": "python", 142 | "nbconvert_exporter": "python", 143 | "pygments_lexer": "ipython3", 144 | "version": "3.9.7" 145 | } 146 | }, 147 | "nbformat": 4, 148 | "nbformat_minor": 5 149 | } 150 | -------------------------------------------------------------------------------- /2_2_multiple_logistic_regression.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "04a5dfbf", 6 | "metadata": {}, 7 | "source": [ 8 | "## Multiple Logistic Regression" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "id": "32c04dee", 14 | "metadata": {}, 15 | "source": [ 16 | "Import Pandas and scikit-learn dependencies, particularly the `LogisticRegression` linear model." 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": null, 22 | "id": "0e5b3e6e", 23 | "metadata": {}, 24 | "outputs": [], 25 | "source": [ 26 | "import pandas as pd\n", 27 | "import numpy as np \n", 28 | "from sklearn.linear_model import LogisticRegression" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "id": "5bec5246", 34 | "metadata": {}, 35 | "source": [ 36 | "Read and display the five columns of data, and save to a `df` variable. " 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": null, 42 | "id": "fb1be73a", 43 | "metadata": {}, 44 | "outputs": [], 45 | "source": [ 46 | "df = pd.read_csv('https://bit.ly/3SMHvPa', delimiter=\",\")\n", 47 | "df" 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "id": "5f710efc", 53 | "metadata": {}, 54 | "source": [ 55 | "Extract the independent `X` variables and dependent `Y` variable as two separate datasets. " 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": null, 61 | "id": "e7413e5e", 62 | "metadata": {}, 63 | "outputs": [], 64 | "source": [ 65 | "X = df.values[:, :-1]\n", 66 | "Y = df.values[:, -1]" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "id": "afcb0105", 72 | "metadata": {}, 73 | "source": [ 74 | "Create the `LogisticRegression` model and train it with the `X` and `Y` data. To keep things simple, be sure to set `penalty` to `none` to maximize fitting. " 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": null, 80 | "id": "195a0f06", 81 | "metadata": {}, 82 | "outputs": [], 83 | "source": [ 84 | "model = LogisticRegression(penalty='none')\n", 85 | "model.fit(X, Y) " 86 | ] 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "id": "dc396dba", 91 | "metadata": {}, 92 | "source": [ 93 | "Print the coefficient values. " 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": null, 99 | "id": "5d3387ec", 100 | "metadata": { 101 | "scrolled": false 102 | }, 103 | "outputs": [], 104 | "source": [ 105 | "bx = model.coef_\n", 106 | "b0 = model.intercept_\n", 107 | "print(b0, bx)" 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": null, 113 | "id": "f0e13098", 114 | "metadata": {}, 115 | "outputs": [], 116 | "source": [ 117 | "# Make a prediction on whether an observation is good weather\n", 118 | "# rain, lightning, cloudy, temperature\n", 119 | "prediction = model.predict_proba([[0, 0, 1, 76]])\n", 120 | "print(\"GOOD WEATHER PREDICTION: \", prediction)" 121 | ] 122 | } 123 | ], 124 | "metadata": { 125 | "kernelspec": { 126 | "display_name": "Python 3 (ipykernel)", 127 | "language": "python", 128 | "name": "python3" 129 | }, 130 | "language_info": { 131 | "codemirror_mode": { 132 | "name": "ipython", 133 | "version": 3 134 | }, 135 | "file_extension": ".py", 136 | "mimetype": "text/x-python", 137 | "name": "python", 138 | "nbconvert_exporter": "python", 139 | "pygments_lexer": "ipython3", 140 | "version": "3.9.7" 141 | } 142 | }, 143 | "nbformat": 4, 144 | "nbformat_minor": 5 145 | } 146 | -------------------------------------------------------------------------------- /3_exercise.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "a6d04f10", 6 | "metadata": {}, 7 | "source": [ 8 | "# EXERCISE - Naive Bayes" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "id": "3d096e1a", 14 | "metadata": {}, 15 | "source": [ 16 | "In the runnable cell below, replace the question marks \"?\" with the proper Python code to perform a Naive Bayes classification on a banking transaction dataset (https://bit.ly/3QJhJd4).\n", 17 | "\n", 18 | "Use $ 1 / 4 $ of the data for testing and evaluate the prediction performance with a confusion matrix. " 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": null, 24 | "id": "0d149dc1", 25 | "metadata": {}, 26 | "outputs": [], 27 | "source": [ 28 | "import pandas as pd\n", 29 | "from sklearn.model_selection import train_test_split\n", 30 | "from sklearn.feature_extraction.text import CountVectorizer\n", 31 | "from sklearn.naive_bayes import MultinomialNB\n", 32 | "from sklearn.metrics import confusion_matrix\n", 33 | "\n", 34 | "df = pd.read_csv('https://bit.ly/3QJhJd4')\n", 35 | "\n", 36 | "cv = ?\n", 37 | "X = cv.fit_transform(df['MEMO'])\n", 38 | "Y = df['CATEGORY']\n", 39 | "\n", 40 | "x_train, x_test, y_train, y_test = train_test_split(?, ?, test_size=?, random_state=7)\n", 41 | "\n", 42 | "model = MultinomialNB().fit(?, ?)\n", 43 | "\n", 44 | "result = model.score(x_test,?)\n", 45 | "\n", 46 | "print(result)\n", 47 | "\n", 48 | "confusion_matrix(y_true=?, y_pred=model.predict(x_test))" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "id": "c41eaed7", 54 | "metadata": {}, 55 | "source": [ 56 | "### SCROLL DOWN FOR ANSWER\n", 57 | "|
\n", 58 | "|
\n", 59 | "|
\n", 60 | "|
\n", 61 | "|
\n", 62 | "|
\n", 63 | "|
\n", 64 | "|
\n", 65 | "|
\n", 66 | "|
\n", 67 | "|
\n", 68 | "|
\n", 69 | "|
\n", 70 | "|
\n", 71 | "|
\n", 72 | "|
\n", 73 | "|
\n", 74 | "|
\n", 75 | "|
\n", 76 | "|
\n", 77 | "|
\n", 78 | "|
\n", 79 | "|
\n", 80 | "v \n", 81 | "\n", 82 | "```python\n", 83 | "import pandas as pd\n", 84 | "from sklearn.model_selection import train_test_split\n", 85 | "from sklearn.feature_extraction.text import CountVectorizer\n", 86 | "from sklearn.naive_bayes import MultinomialNB\n", 87 | "from sklearn.metrics import confusion_matrix\n", 88 | "\n", 89 | "df = pd.read_csv('https://bit.ly/3QJhJd4')\n", 90 | "\n", 91 | "cv = CountVectorizer()\n", 92 | "X = cv.fit_transform(df['MEMO'])\n", 93 | "Y = df['CATEGORY']\n", 94 | "\n", 95 | "x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=1.0/4.0, random_state=7)\n", 96 | "\n", 97 | "model = MultinomialNB().fit(x_train, y_train)\n", 98 | "\n", 99 | "result = model.score(x_test,y_test)\n", 100 | "\n", 101 | "print(result)\n", 102 | "\n", 103 | "confusion_matrix(y_true=y_test, y_pred=model.predict(x_test))\n", 104 | "```" 105 | ] 106 | } 107 | ], 108 | "metadata": { 109 | "kernelspec": { 110 | "display_name": "Python 3 (ipykernel)", 111 | "language": "python", 112 | "name": "python3" 113 | }, 114 | "language_info": { 115 | "codemirror_mode": { 116 | "name": "ipython", 117 | "version": 3 118 | }, 119 | "file_extension": ".py", 120 | "mimetype": "text/x-python", 121 | "name": "python", 122 | "nbconvert_exporter": "python", 123 | "pygments_lexer": "ipython3", 124 | "version": "3.9.7" 125 | } 126 | }, 127 | "nbformat": 4, 128 | "nbformat_minor": 5 129 | } 130 | -------------------------------------------------------------------------------- /4_2_random_forest.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "14f7c037", 6 | "metadata": {}, 7 | "source": [ 8 | "# Random Forest\n", 9 | "\n", 10 | "Import dependencies." 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": null, 16 | "id": "799028e9", 17 | "metadata": {}, 18 | "outputs": [], 19 | "source": [ 20 | "import pandas as pd\n", 21 | "from sklearn.ensemble import RandomForestClassifier\n", 22 | "from sklearn.metrics import confusion_matrix\n", 23 | "from sklearn.model_selection import train_test_split" 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "id": "430883bc", 29 | "metadata": {}, 30 | "source": [ 31 | "Import good/bad weather data and display in a dataframe." 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": null, 37 | "id": "321ec5a1", 38 | "metadata": {}, 39 | "outputs": [], 40 | "source": [ 41 | "df = pd.read_csv('https://bit.ly/3zVspy4')\n", 42 | "df" 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "id": "31e402ae", 48 | "metadata": {}, 49 | "source": [ 50 | "Separate the input and output data, and set aside $ 1/3 $ of the data for training. " 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "id": "5738b730", 57 | "metadata": {}, 58 | "outputs": [], 59 | "source": [ 60 | "X = df.values[:, :-1]\n", 61 | "Y = df.values[:, -1]\n", 62 | "\n", 63 | "X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=1.0/3.0, random_state=10)\n" 64 | ] 65 | }, 66 | { 67 | "cell_type": "markdown", 68 | "id": "9687f8bd", 69 | "metadata": {}, 70 | "source": [ 71 | "Create a random forest classifier model with 300 trees. Limit their depth to 10 nodes and only allow a maximum of 4 features to be used for each tree. " 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": null, 77 | "id": "e9e6b8f3", 78 | "metadata": {}, 79 | "outputs": [], 80 | "source": [ 81 | "model = RandomForestClassifier(n_estimators=300, max_depth=10, max_features=4, criterion='gini')\n", 82 | "model.fit(X_train, Y_train)\n" 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "id": "f9dc9e86", 88 | "metadata": {}, 89 | "source": [ 90 | "Score the accuracy of the decision tree. " 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": null, 96 | "id": "af342c54", 97 | "metadata": {}, 98 | "outputs": [], 99 | "source": [ 100 | "results = model.score(X_test, Y_test)\n", 101 | "print(results)\n" 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "id": "317f302a", 107 | "metadata": {}, 108 | "source": [ 109 | "View the confusion matrix to see the accuracy of true predictions and false predictions. " 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": null, 115 | "id": "1ded585d", 116 | "metadata": {}, 117 | "outputs": [], 118 | "source": [ 119 | "matrix = confusion_matrix(y_true=Y_test, y_pred=model.predict(X_test))\n", 120 | "print(matrix)\n" 121 | ] 122 | } 123 | ], 124 | "metadata": { 125 | "kernelspec": { 126 | "display_name": "Python 3 (ipykernel)", 127 | "language": "python", 128 | "name": "python3" 129 | }, 130 | "language_info": { 131 | "codemirror_mode": { 132 | "name": "ipython", 133 | "version": 3 134 | }, 135 | "file_extension": ".py", 136 | "mimetype": "text/x-python", 137 | "name": "python", 138 | "nbconvert_exporter": "python", 139 | "pygments_lexer": "ipython3", 140 | "version": "3.9.7" 141 | } 142 | }, 143 | "nbformat": 4, 144 | "nbformat_minor": 5 145 | } 146 | -------------------------------------------------------------------------------- /1b_exercise.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "a6d04f10", 6 | "metadata": {}, 7 | "source": [ 8 | "# EXERCISE - Linear Regression" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "id": "3d096e1a", 14 | "metadata": {}, 15 | "source": [ 16 | "In the runnable cells below, replace the question marks \"?\" with the proper Python code to perform a linear regression with $ 1 / 3 $ of the data used for testing. " 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": null, 22 | "id": "0d149dc1", 23 | "metadata": {}, 24 | "outputs": [], 25 | "source": [ 26 | "import pandas as pd\n", 27 | "from sklearn.linear_model import LinearRegression\n", 28 | "from sklearn.model_selection import train_test_split\n", 29 | "\n", 30 | "# Load the data\n", 31 | "df = pd.read_csv('https://bit.ly/3pBKSuN', delimiter=\",\")\n", 32 | "\n", 33 | "# Extract input variables (all rows, all columns but last column)\n", 34 | "X = df.values[:, :-1]\n", 35 | "\n", 36 | "# Extract output column (all rows, last column)\n", 37 | "Y = df.values[:, -1]\n", 38 | "\n", 39 | "# Separate training and testing data to evaluate performance and reduce overfitting\n", 40 | "X_train, X_test, Y_train, Y_test = train_test_split(?, ?, test_size=?, random_state=10)\n", 41 | "\n", 42 | "# Train the model \n", 43 | "model = LinearRegression()\n", 44 | "model.fit(?, ?)\n", 45 | "result = model.score(?, ?)\n", 46 | "\n", 47 | "# Print the Score \n", 48 | "print(\"R^2: %.3f\" % result)" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "id": "c41eaed7", 54 | "metadata": {}, 55 | "source": [ 56 | "### SCROLL DOWN FOR ANSWER\n", 57 | "|
\n", 58 | "|
\n", 59 | "|
\n", 60 | "|
\n", 61 | "|
\n", 62 | "|
\n", 63 | "|
\n", 64 | "|
\n", 65 | "|
\n", 66 | "|
\n", 67 | "|
\n", 68 | "|
\n", 69 | "|
\n", 70 | "|
\n", 71 | "|
\n", 72 | "|
\n", 73 | "|
\n", 74 | "|
\n", 75 | "|
\n", 76 | "|
\n", 77 | "|
\n", 78 | "|
\n", 79 | "|
\n", 80 | "v \n", 81 | "\n", 82 | "```python\n", 83 | "import pandas as pd\n", 84 | "from sklearn.linear_model import LinearRegression\n", 85 | "from sklearn.model_selection import train_test_split\n", 86 | "\n", 87 | "# Load the data\n", 88 | "df = pd.read_csv('https://bit.ly/3pBKSuN', delimiter=\",\")\n", 89 | "\n", 90 | "# Extract input variables (all rows, all columns but last column)\n", 91 | "X = df.values[:, :-1]\n", 92 | "\n", 93 | "# Extract output column (all rows, last column)\n", 94 | "Y = df.values[:, -1]\n", 95 | "\n", 96 | "# Separate training and testing data to evaluate performance and reduce overfitting\n", 97 | "X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=1.0/3.0, random_state=10)\n", 98 | "\n", 99 | "# Train the model \n", 100 | "model = LinearRegression()\n", 101 | "model.fit(X_train, Y_train)\n", 102 | "result = model.score(X_test, Y_test)\n", 103 | "\n", 104 | "# Print the Score \n", 105 | "print(\"R^2: %.3f\" % result)\n", 106 | "# R^2: 0.182\n", 107 | "```" 108 | ] 109 | } 110 | ], 111 | "metadata": { 112 | "kernelspec": { 113 | "display_name": "Python 3 (ipykernel)", 114 | "language": "python", 115 | "name": "python3" 116 | }, 117 | "language_info": { 118 | "codemirror_mode": { 119 | "name": "ipython", 120 | "version": 3 121 | }, 122 | "file_extension": ".py", 123 | "mimetype": "text/x-python", 124 | "name": "python", 125 | "nbconvert_exporter": "python", 126 | "pygments_lexer": "ipython3", 127 | "version": "3.9.7" 128 | } 129 | }, 130 | "nbformat": 4, 131 | "nbformat_minor": 5 132 | } 133 | -------------------------------------------------------------------------------- /3_1_naive_bayes.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "f5178e18", 6 | "metadata": {}, 7 | "source": [ 8 | "# Naive Bayes Spam Email Classifier\n", 9 | "\n", 10 | "Import dependencies, turn off scientfic notation." 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": null, 16 | "id": "f15900ff", 17 | "metadata": { 18 | "scrolled": true 19 | }, 20 | "outputs": [], 21 | "source": [ 22 | "import pandas as pd\n", 23 | "import numpy as np \n", 24 | "from sklearn.model_selection import train_test_split\n", 25 | "from sklearn.feature_extraction.text import CountVectorizer\n", 26 | "from sklearn.naive_bayes import MultinomialNB\n", 27 | "from sklearn.metrics import confusion_matrix\n", 28 | "\n", 29 | "np.set_printoptions(suppress=True)" 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "id": "fb21b67b", 35 | "metadata": {}, 36 | "source": [ 37 | "Bring in the small email dataset and display it. " 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": null, 43 | "id": "65df9399", 44 | "metadata": { 45 | "scrolled": false 46 | }, 47 | "outputs": [], 48 | "source": [ 49 | "df = pd.read_csv('https://bit.ly/3zQBV5y')\n", 50 | "df" 51 | ] 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "id": "c2f3c208", 56 | "metadata": {}, 57 | "source": [ 58 | "Vectorize the message in each email by counting each word occurrence, and break it up into input X and output Y columns. " 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": null, 64 | "id": "db334c31", 65 | "metadata": { 66 | "scrolled": false 67 | }, 68 | "outputs": [], 69 | "source": [ 70 | "cv = CountVectorizer()\n", 71 | "X = cv.fit_transform(df['msg'])\n", 72 | "Y = df['spam_ind']\n", 73 | "\n", 74 | "# Print count vectorizer as a table \n", 75 | "pd.DataFrame(X.toarray(),columns= cv.get_feature_names_out())\n" 76 | ] 77 | }, 78 | { 79 | "cell_type": "markdown", 80 | "id": "2d4cacd5", 81 | "metadata": {}, 82 | "source": [ 83 | "Break up the emails into train/test datasets " 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "execution_count": null, 89 | "id": "84248f2e", 90 | "metadata": { 91 | "scrolled": true 92 | }, 93 | "outputs": [], 94 | "source": [ 95 | "x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=1.0/3.0, random_state=7)\n", 96 | "\n", 97 | "model = MultinomialNB().fit(x_train, y_train)\n" 98 | ] 99 | }, 100 | { 101 | "cell_type": "markdown", 102 | "id": "bd08426e", 103 | "metadata": {}, 104 | "source": [ 105 | "Score the accuracy of the model." 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "execution_count": null, 111 | "id": "77fc131e", 112 | "metadata": { 113 | "scrolled": false 114 | }, 115 | "outputs": [], 116 | "source": [ 117 | "result = model.score(x_test,y_test)\n", 118 | "\n", 119 | "print(result)" 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": null, 125 | "id": "07e3bf17", 126 | "metadata": {}, 127 | "outputs": [], 128 | "source": [ 129 | "confusion_matrix(y_true=y_test, y_pred=model.predict(x_test))" 130 | ] 131 | } 132 | ], 133 | "metadata": { 134 | "kernelspec": { 135 | "display_name": "Python 3 (ipykernel)", 136 | "language": "python", 137 | "name": "python3" 138 | }, 139 | "language_info": { 140 | "codemirror_mode": { 141 | "name": "ipython", 142 | "version": 3 143 | }, 144 | "file_extension": ".py", 145 | "mimetype": "text/x-python", 146 | "name": "python", 147 | "nbconvert_exporter": "python", 148 | "pygments_lexer": "ipython3", 149 | "version": "3.9.7" 150 | } 151 | }, 152 | "nbformat": 4, 153 | "nbformat_minor": 5 154 | } 155 | -------------------------------------------------------------------------------- /1a_exercise.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "04a5dfbf", 6 | "metadata": {}, 7 | "source": [ 8 | "## EXERCISE - Sum of Squares" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "id": "5bec5246", 14 | "metadata": {}, 15 | "source": [ 16 | "In the Python code below, replace the question marks \"?\" with the proper code to calculate the sum of squares for the given dataset. " 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": 9, 22 | "id": "fb1be73a", 23 | "metadata": {}, 24 | "outputs": [ 25 | { 26 | "name": "stdout", 27 | "output_type": "stream", 28 | "text": [ 29 | "SUM OF SQUARES: [4568.04002146]\n" 30 | ] 31 | } 32 | ], 33 | "source": [ 34 | "import pandas as pd\n", 35 | "\n", 36 | "# import dataframe containing two columns of data\n", 37 | "df = pd.read_csv('https://bit.ly/3pBKSuN')\n", 38 | "\n", 39 | "\n", 40 | "# declare line with slope and intercept\n", 41 | "m = 1.86305\n", 42 | "b = -0.299037\n", 43 | "\n", 44 | "# Extract input variables (all rows, all columns but last column)\n", 45 | "X = df.values[:, :-1]\n", 46 | "\n", 47 | "# Extract output column (all rows, last column)\n", 48 | "Y_actuals = df.values[:, -1]\n", 49 | "\n", 50 | "# calculate y_predictions \n", 51 | "Y_predicts = ?*X + ?\n", 52 | "\n", 53 | "sum_of_squares = 0\n", 54 | "\n", 55 | "for (y_actual, y_predict) in zip(Y_actuals, Y_predicts): \n", 56 | " residual = ? - ?\n", 57 | " sum_of_squares += ? ** 2\n", 58 | "\n", 59 | "print(\"SUM OF SQUARES: \", sum_of_squares)\n" 60 | ] 61 | }, 62 | { 63 | "cell_type": "markdown", 64 | "id": "80aeaa1c", 65 | "metadata": {}, 66 | "source": [ 67 | "### SCROLL DOWN FOR ANSWER\n", 68 | "|
\n", 69 | "|
\n", 70 | "|
\n", 71 | "|
\n", 72 | "|
\n", 73 | "|
\n", 74 | "|
\n", 75 | "|
\n", 76 | "|
\n", 77 | "|
\n", 78 | "|
\n", 79 | "|
\n", 80 | "|
\n", 81 | "|
\n", 82 | "|
\n", 83 | "|
\n", 84 | "|
\n", 85 | "|
\n", 86 | "|
\n", 87 | "|
\n", 88 | "|
\n", 89 | "|
\n", 90 | "|
\n", 91 | "v \n", 92 | "\n", 93 | "```python\n", 94 | "import pandas as pd\n", 95 | "\n", 96 | "# import dataframe containing two columns of data\n", 97 | "df = pd.read_csv('https://bit.ly/3pBKSuN')\n", 98 | "\n", 99 | "\n", 100 | "# declare line with slope and intercept\n", 101 | "m = 1.86305\n", 102 | "b = -0.299037\n", 103 | "\n", 104 | "# Extract input variables (all rows, all columns but last column)\n", 105 | "X = df.values[:, :-1]\n", 106 | "\n", 107 | "# Extract output column (all rows, last column)\n", 108 | "Y_actuals = df.values[:, -1]\n", 109 | "\n", 110 | "# calculate y_predictions \n", 111 | "y_predicts = m*X + b \n", 112 | "\n", 113 | "sum_of_squares = 0\n", 114 | "\n", 115 | "for (y_actual, y_predict) in zip(Y_actuals, y_predicts): \n", 116 | " residual = y_actual - y_predict\n", 117 | " sum_of_squares += residual ** 2\n", 118 | "\n", 119 | "print(\"SUM OF SQUARES: \", sum_of_squares)\n", 120 | "# SUM OF SQUARES: [4568.04002146]\n", 121 | "```" 122 | ] 123 | } 124 | ], 125 | "metadata": { 126 | "kernelspec": { 127 | "display_name": "Python 3 (ipykernel)", 128 | "language": "python", 129 | "name": "python3" 130 | }, 131 | "language_info": { 132 | "codemirror_mode": { 133 | "name": "ipython", 134 | "version": 3 135 | }, 136 | "file_extension": ".py", 137 | "mimetype": "text/x-python", 138 | "name": "python", 139 | "nbconvert_exporter": "python", 140 | "pygments_lexer": "ipython3", 141 | "version": "3.9.7" 142 | } 143 | }, 144 | "nbformat": 4, 145 | "nbformat_minor": 5 146 | } 147 | -------------------------------------------------------------------------------- /1_4_train_test_split_linear_regression.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "a6d04f10", 6 | "metadata": {}, 7 | "source": [ 8 | "# Linear Regression Train/Test Split\n", 9 | "\n" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": null, 15 | "id": "0d149dc1", 16 | "metadata": {}, 17 | "outputs": [], 18 | "source": [ 19 | "import pandas as pd\n", 20 | "from sklearn.linear_model import LinearRegression\n", 21 | "from sklearn.model_selection import train_test_split\n" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": null, 27 | "id": "17730ca6", 28 | "metadata": {}, 29 | "outputs": [], 30 | "source": [ 31 | "# Load the data\n", 32 | "df = pd.read_csv('https://bit.ly/3cIH97A', delimiter=\",\")\n", 33 | "df" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "id": "201c6248", 39 | "metadata": {}, 40 | "source": [ 41 | "Extract the $ X $ and $ Y $ columns from the dataframe." 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": null, 47 | "id": "30c98d98", 48 | "metadata": {}, 49 | "outputs": [], 50 | "source": [ 51 | "# Extract input variables (all rows, all columns but last column)\n", 52 | "X = df.values[:, :-1]\n", 53 | "\n", 54 | "# Extract output column (all rows, last column)\n", 55 | "Y = df.values[:, -1]\n" 56 | ] 57 | }, 58 | { 59 | "cell_type": "markdown", 60 | "id": "cf69454c", 61 | "metadata": {}, 62 | "source": [ 63 | "Separate out the training and testing data." 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": null, 69 | "id": "322afb6f", 70 | "metadata": {}, 71 | "outputs": [], 72 | "source": [ 73 | "# Separate training and testing data to evaluate performance and reduce overfitting\n", 74 | "# This leaves a third of the data out for testing\n", 75 | "# Set a random seed just to make the randomly selected split consistent\n", 76 | "X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=1.0/3.0, random_state=10)\n" 77 | ] 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "id": "a0e09fc8", 82 | "metadata": {}, 83 | "source": [ 84 | "Fit a linear regression with the training data and calculate the $ R^2 $ for the test dataset. " 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": null, 90 | "id": "07061775", 91 | "metadata": {}, 92 | "outputs": [], 93 | "source": [ 94 | "model = LinearRegression()\n", 95 | "model.fit(X_train, Y_train)\n", 96 | "result = model.score(X_test, Y_test)\n", 97 | "print(\"R^2: %.3f\" % result)" 98 | ] 99 | }, 100 | { 101 | "cell_type": "markdown", 102 | "id": "483967ea", 103 | "metadata": {}, 104 | "source": [ 105 | "Plot the linear regression against the scatterplot" 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "execution_count": null, 111 | "id": "2532442c", 112 | "metadata": { 113 | "scrolled": true 114 | }, 115 | "outputs": [], 116 | "source": [ 117 | "import matplotlib.pyplot as plt\n", 118 | "\n", 119 | "# show in chart\n", 120 | "plt.plot(X, Y, 'o') # scatterplot\n", 121 | "plt.plot(X, model.coef_.flatten()*X+model.intercept_.flatten()) # line\n", 122 | "plt.show()" 123 | ] 124 | } 125 | ], 126 | "metadata": { 127 | "kernelspec": { 128 | "display_name": "Python 3 (ipykernel)", 129 | "language": "python", 130 | "name": "python3" 131 | }, 132 | "language_info": { 133 | "codemirror_mode": { 134 | "name": "ipython", 135 | "version": 3 136 | }, 137 | "file_extension": ".py", 138 | "mimetype": "text/x-python", 139 | "name": "python", 140 | "nbconvert_exporter": "python", 141 | "pygments_lexer": "ipython3", 142 | "version": "3.9.7" 143 | } 144 | }, 145 | "nbformat": 4, 146 | "nbformat_minor": 5 147 | } 148 | -------------------------------------------------------------------------------- /4_exercise.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "a6d04f10", 6 | "metadata": {}, 7 | "source": [ 8 | "# EXERCISE - Decision Trees and Random Forests" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "id": "3d096e1a", 14 | "metadata": {}, 15 | "source": [ 16 | "In the runnable cell below, replace the question marks \"?\" with the proper Python code to perform a random forest classification on a maintenance part replacement prediction dataset (https://bit.ly/3QHvclX). Use 300 decision trees with a max depth of 10, and a max number of features 3. \n", 17 | "\n", 18 | "Use $ 1 / 3 $ of the data for testing and evaluate the prediction performance with a confusion matrix. " 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": null, 24 | "id": "0d149dc1", 25 | "metadata": {}, 26 | "outputs": [], 27 | "source": [ 28 | "import pandas as pd\n", 29 | "from sklearn.metrics import confusion_matrix\n", 30 | "from sklearn.model_selection import train_test_split\n", 31 | "from sklearn.tree import DecisionTreeClassifier\n", 32 | "from sklearn.ensemble import RandomForestClassifier\n", 33 | "\n", 34 | "\n", 35 | "df = pd.read_csv('https://bit.ly/3QHvclX')\n", 36 | "\n", 37 | "X = df.values[:, :-1]\n", 38 | "Y = df.values[:, -1]\n", 39 | "\n", 40 | "X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=1.0/3.0, random_state=10)\n", 41 | "\n", 42 | "model = RandomForestClassifier(n_estimators=?, max_depth=?, max_features=?, criterion='gini')\n", 43 | "model.fit(?, ?)\n", 44 | "\n", 45 | "\n", 46 | "results = model.score(?, Y_test)\n", 47 | "print(results)\n", 48 | "\n", 49 | "matrix = confusion_matrix(y_true=?, y_pred=model.predict(?))\n", 50 | "print(matrix)" 51 | ] 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "id": "c41eaed7", 56 | "metadata": {}, 57 | "source": [ 58 | "### SCROLL DOWN FOR ANSWER\n", 59 | "|
\n", 60 | "|
\n", 61 | "|
\n", 62 | "|
\n", 63 | "|
\n", 64 | "|
\n", 65 | "|
\n", 66 | "|
\n", 67 | "|
\n", 68 | "|
\n", 69 | "|
\n", 70 | "|
\n", 71 | "|
\n", 72 | "|
\n", 73 | "|
\n", 74 | "|
\n", 75 | "|
\n", 76 | "|
\n", 77 | "|
\n", 78 | "|
\n", 79 | "|
\n", 80 | "|
\n", 81 | "|
\n", 82 | "v \n", 83 | "\n", 84 | "```python\n", 85 | "import pandas as pd\n", 86 | "from sklearn.metrics import confusion_matrix\n", 87 | "from sklearn.model_selection import train_test_split\n", 88 | "from sklearn.tree import DecisionTreeClassifier\n", 89 | "from sklearn.ensemble import RandomForestClassifier\n", 90 | "\n", 91 | "\n", 92 | "df = pd.read_csv('https://bit.ly/3QHvclX')\n", 93 | "\n", 94 | "X = df.values[:, :-1]\n", 95 | "Y = df.values[:, -1]\n", 96 | "\n", 97 | "X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=1.0/3.0, random_state=10)\n", 98 | "\n", 99 | "model = RandomForestClassifier(n_estimators=300, max_depth=10, max_features=3, criterion='gini')\n", 100 | "model.fit(X_train, Y_train)\n", 101 | "\n", 102 | "\n", 103 | "results = model.score(X_test, Y_test)\n", 104 | "print(results)\n", 105 | "\n", 106 | "matrix = confusion_matrix(y_true=Y_test, y_pred=model.predict(X_test))\n", 107 | "print(matrix)\n", 108 | "```" 109 | ] 110 | } 111 | ], 112 | "metadata": { 113 | "kernelspec": { 114 | "display_name": "Python 3 (ipykernel)", 115 | "language": "python", 116 | "name": "python3" 117 | }, 118 | "language_info": { 119 | "codemirror_mode": { 120 | "name": "ipython", 121 | "version": 3 122 | }, 123 | "file_extension": ".py", 124 | "mimetype": "text/x-python", 125 | "name": "python", 126 | "nbconvert_exporter": "python", 127 | "pygments_lexer": "ipython3", 128 | "version": "3.9.7" 129 | } 130 | }, 131 | "nbformat": 4, 132 | "nbformat_minor": 5 133 | } 134 | -------------------------------------------------------------------------------- /5_1_neural_network_classification.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "fbcf111f", 6 | "metadata": {}, 7 | "source": [ 8 | "# Neural Networks\n", 9 | "\n", 10 | "Import dependencies." 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": null, 16 | "id": "59e4ed99", 17 | "metadata": { 18 | "scrolled": true 19 | }, 20 | "outputs": [], 21 | "source": [ 22 | "import pandas as pd\n", 23 | "from sklearn.model_selection import train_test_split\n", 24 | "from sklearn.neural_network import MLPClassifier\n", 25 | "from sklearn.metrics import confusion_matrix\n" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "id": "b109689c", 31 | "metadata": {}, 32 | "source": [ 33 | "Bring in data and present in dataframe." 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": null, 39 | "id": "b2dfae00", 40 | "metadata": {}, 41 | "outputs": [], 42 | "source": [ 43 | "df = pd.read_csv('https://bit.ly/3GsNzGt', delimiter=\",\")\n", 44 | "df\n" 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "id": "1dfa3d3a", 50 | "metadata": {}, 51 | "source": [ 52 | "Extract out input and output variable columns, scale down input variables by 255. Separate training and testing dataset by $ 1/3 $. " 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": null, 58 | "id": "17be5754", 59 | "metadata": {}, 60 | "outputs": [], 61 | "source": [ 62 | "\n", 63 | "# Extract input variables (all rows, all columns but last column)\n", 64 | "# Note we should do some linear scaling here\n", 65 | "X = (df.values[:, :-1] / 255.0)\n", 66 | "\n", 67 | "# Extract output column (all rows, last column)\n", 68 | "Y = df.values[:, -1]\n", 69 | "\n", 70 | "# Separate training and testing data\n", 71 | "X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=1/3, random_state=7)\n", 72 | "\n" 73 | ] 74 | }, 75 | { 76 | "cell_type": "markdown", 77 | "id": "50530737", 78 | "metadata": {}, 79 | "source": [ 80 | "Fit a neural network (multi-layer perceptron) classifier with a middle ReLU function, 3 nodes in a hidden layer, 100000 iterations, and a .05 learning rate. Fit the training data to the model. " 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": null, 86 | "id": "0dbb5b6e", 87 | "metadata": {}, 88 | "outputs": [], 89 | "source": [ 90 | "\n", 91 | "nn = MLPClassifier(solver='sgd',\n", 92 | " hidden_layer_sizes=(3, ),\n", 93 | " activation='relu',\n", 94 | " max_iter=100_000,\n", 95 | " learning_rate_init=.05)\n", 96 | "\n", 97 | "nn.fit(X_train, Y_train)\n" 98 | ] 99 | }, 100 | { 101 | "cell_type": "markdown", 102 | "id": "3e4764b6", 103 | "metadata": {}, 104 | "source": [ 105 | "Print out the coefficients and test score, as well as the confusion matrix. " 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "execution_count": null, 111 | "id": "43b53768", 112 | "metadata": {}, 113 | "outputs": [], 114 | "source": [ 115 | "\n", 116 | "# Print weights and biases\n", 117 | "print(nn.coefs_ )\n", 118 | "print(nn.intercepts_)\n", 119 | "\n", 120 | "print(\"Test set score: %f\" % nn.score(X_test, Y_test))\n", 121 | "\n", 122 | "cf = confusion_matrix(y_true=Y_test, y_pred=nn.predict(X_test))\n", 123 | "print(cf)" 124 | ] 125 | } 126 | ], 127 | "metadata": { 128 | "kernelspec": { 129 | "display_name": "Python 3 (ipykernel)", 130 | "language": "python", 131 | "name": "python3" 132 | }, 133 | "language_info": { 134 | "codemirror_mode": { 135 | "name": "ipython", 136 | "version": 3 137 | }, 138 | "file_extension": ".py", 139 | "mimetype": "text/x-python", 140 | "name": "python", 141 | "nbconvert_exporter": "python", 142 | "pygments_lexer": "ipython3", 143 | "version": "3.9.7" 144 | } 145 | }, 146 | "nbformat": 4, 147 | "nbformat_minor": 5 148 | } 149 | -------------------------------------------------------------------------------- /2_1_logistic_regression.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "04a5dfbf", 6 | "metadata": {}, 7 | "source": [ 8 | "## Simple Logistic Regression" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "id": "32c04dee", 14 | "metadata": {}, 15 | "source": [ 16 | "Import Pandas and scikit-learn dependencies, particularly the `LogisticRegression` linear model." 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": null, 22 | "id": "0e5b3e6e", 23 | "metadata": {}, 24 | "outputs": [], 25 | "source": [ 26 | "import pandas as pd\n", 27 | "import numpy as np \n", 28 | "from sklearn.linear_model import LogisticRegression" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "id": "5bec5246", 34 | "metadata": {}, 35 | "source": [ 36 | "Read and display the two columns of data, and save to a `df` variable. " 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": null, 42 | "id": "fb1be73a", 43 | "metadata": {}, 44 | "outputs": [], 45 | "source": [ 46 | "df = pd.read_csv('https://bit.ly/33ebs2R', delimiter=\",\")\n", 47 | "df" 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "id": "5f710efc", 53 | "metadata": {}, 54 | "source": [ 55 | "Extract the independent `X` variables and dependent `Y` variables as two separate columns. " 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": null, 61 | "id": "e7413e5e", 62 | "metadata": {}, 63 | "outputs": [], 64 | "source": [ 65 | "X = df.values[:, :-1]\n", 66 | "Y = df.values[:, -1]" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "id": "e7a2beea", 72 | "metadata": {}, 73 | "source": [ 74 | "Plot the data with `matplotlib`. " 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": null, 80 | "id": "c720de27", 81 | "metadata": { 82 | "scrolled": false 83 | }, 84 | "outputs": [], 85 | "source": [ 86 | "import matplotlib.pyplot as plt\n", 87 | "\n", 88 | "# show in chart\n", 89 | "plt.plot(X, Y, 'o') # scatterplot\n", 90 | "plt.show()" 91 | ] 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "id": "afcb0105", 96 | "metadata": {}, 97 | "source": [ 98 | "Create the `LogisticRegression` model and train it with the `X` and `Y` data. To keep things simple, be sure to set `penalty` to `none` to maximize fitting. " 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": null, 104 | "id": "195a0f06", 105 | "metadata": {}, 106 | "outputs": [], 107 | "source": [ 108 | "model = LogisticRegression(penalty='none')\n", 109 | "model.fit(X, Y) " 110 | ] 111 | }, 112 | { 113 | "cell_type": "markdown", 114 | "id": "dc396dba", 115 | "metadata": {}, 116 | "source": [ 117 | "Print the coefficient values of $ \\beta_0 $ and $ \\beta_1 $. " 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": null, 123 | "id": "5d3387ec", 124 | "metadata": { 125 | "scrolled": false 126 | }, 127 | "outputs": [], 128 | "source": [ 129 | "b1 = model.coef_.flatten()[0]\n", 130 | "b0 = model.intercept_.flatten()[0]\n", 131 | "print(b0, b1)" 132 | ] 133 | }, 134 | { 135 | "cell_type": "markdown", 136 | "id": "d77fb79c", 137 | "metadata": {}, 138 | "source": [ 139 | "Finally, let's plot the logistic regression." 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "execution_count": null, 145 | "id": "5ae5bb09", 146 | "metadata": {}, 147 | "outputs": [], 148 | "source": [ 149 | "plt.plot(X, 1.0 / (1.0 + np.exp(-(b0 + b1*X)))) # curve\n", 150 | "plt.plot(X, Y, 'o') # scatterplot\n", 151 | "plt.show()" 152 | ] 153 | } 154 | ], 155 | "metadata": { 156 | "kernelspec": { 157 | "display_name": "Python 3 (ipykernel)", 158 | "language": "python", 159 | "name": "python3" 160 | }, 161 | "language_info": { 162 | "codemirror_mode": { 163 | "name": "ipython", 164 | "version": 3 165 | }, 166 | "file_extension": ".py", 167 | "mimetype": "text/x-python", 168 | "name": "python", 169 | "nbconvert_exporter": "python", 170 | "pygments_lexer": "ipython3", 171 | "version": "3.9.7" 172 | } 173 | }, 174 | "nbformat": 4, 175 | "nbformat_minor": 5 176 | } 177 | -------------------------------------------------------------------------------- /2_3_confusion_matrix_roc_auc.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "04a5dfbf", 6 | "metadata": {}, 7 | "source": [ 8 | "## Confusion Matrix, ROC, and AUC " 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "id": "32c04dee", 14 | "metadata": {}, 15 | "source": [ 16 | "Import Pandas and scikit-learn dependencies, particularly the `LogisticRegression` linear model." 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": null, 22 | "id": "0e5b3e6e", 23 | "metadata": {}, 24 | "outputs": [], 25 | "source": [ 26 | "import pandas as pd\n", 27 | "import numpy as np \n", 28 | "from sklearn.linear_model import LogisticRegression\n", 29 | "from sklearn.metrics import confusion_matrix\n", 30 | "from sklearn.model_selection import train_test_split" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "id": "5bec5246", 36 | "metadata": {}, 37 | "source": [ 38 | "Read and display the five columns of data, and save to a `df` variable. This dataset contains weather measurments across four independent variables, and an output variable indicating whether this is good weather (1) or not (0). " 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": null, 44 | "id": "fb1be73a", 45 | "metadata": {}, 46 | "outputs": [], 47 | "source": [ 48 | "df = pd.read_csv('https://bit.ly/3SMHvPa', delimiter=\",\")\n", 49 | "df" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "id": "5f710efc", 55 | "metadata": {}, 56 | "source": [ 57 | "Extract the independent `X` variables and dependent `Y` variable as two separate columns. " 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": null, 63 | "id": "e7413e5e", 64 | "metadata": {}, 65 | "outputs": [], 66 | "source": [ 67 | "X = df.values[:, :-1]\n", 68 | "Y = df.values[:, -1]" 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "id": "afcb0105", 74 | "metadata": {}, 75 | "source": [ 76 | "Split up the training and testing dataset so we leave out $ 1/3 $ of the data for testing. Then create the `LogisticRegression` model and train it with the `X_train` and `Y_train` data. Then extract the predictions for the test dataset. " 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": null, 82 | "id": "195a0f06", 83 | "metadata": { 84 | "scrolled": true 85 | }, 86 | "outputs": [], 87 | "source": [ 88 | "X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.33, random_state=10)\n", 89 | "\n", 90 | "model = LogisticRegression(penalty='none')\n", 91 | "model.fit(X_train, Y_train)\n", 92 | "\n", 93 | "prediction = model.predict(X_test)" 94 | ] 95 | }, 96 | { 97 | "cell_type": "markdown", 98 | "id": "f619acf7", 99 | "metadata": {}, 100 | "source": [ 101 | "Finally, let's evaluate the test dataset predictions in a confusion matrix. We want the diagonal values from top-right to bottom-left to be as high as possible. The other values ideally should be 0, which would indicate perfect performance. " 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": null, 107 | "id": "62a36921", 108 | "metadata": {}, 109 | "outputs": [], 110 | "source": [ 111 | "'''\n", 112 | "[[truepositives falsenegatives]\n", 113 | " [falsepositives truenegatives]]\n", 114 | " '''\n", 115 | "matrix = confusion_matrix(y_true=Y_test, y_pred=prediction, normalize=None)\n", 116 | "print(matrix)" 117 | ] 118 | }, 119 | { 120 | "cell_type": "markdown", 121 | "id": "4d7cc8c7", 122 | "metadata": {}, 123 | "source": [ 124 | "Sure enough, this test dataset does really well and identified all positive and negative cases correctly. If we needed to consolidate perforamcne to a single number to compare to other models, we can use the ROC/AUC for that. This should give us a value of 1.0. " 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": null, 130 | "id": "3ecca17a", 131 | "metadata": {}, 132 | "outputs": [], 133 | "source": [ 134 | "from sklearn.metrics import roc_auc_score \n", 135 | "results = roc_auc_score(prediction, Y_test)\n", 136 | "print(\"AUC: %.3f\" % results)" 137 | ] 138 | } 139 | ], 140 | "metadata": { 141 | "kernelspec": { 142 | "display_name": "Python 3 (ipykernel)", 143 | "language": "python", 144 | "name": "python3" 145 | }, 146 | "language_info": { 147 | "codemirror_mode": { 148 | "name": "ipython", 149 | "version": 3 150 | }, 151 | "file_extension": ".py", 152 | "mimetype": "text/x-python", 153 | "name": "python", 154 | "nbconvert_exporter": "python", 155 | "pygments_lexer": "ipython3", 156 | "version": "3.9.7" 157 | } 158 | }, 159 | "nbformat": 4, 160 | "nbformat_minor": 5 161 | } 162 | -------------------------------------------------------------------------------- /2_exercise.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "04a5dfbf", 6 | "metadata": {}, 7 | "source": [ 8 | "## EXERCISE - Logistic Regression and Classification" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "id": "5bec5246", 14 | "metadata": {}, 15 | "source": [ 16 | "In the Python code below, replace the question marks \"?\" with the proper code to perform a logistic regression on this dataset predicting light (0) or dark (1) font for a given background color (specified by three R, G, and B values). \n", 17 | "\n", 18 | "Set aside 1/3 of the data for testing, then evaluate the confusion matrix and the AUC. \n", 19 | "\n", 20 | "Is the logistic regression a good predictor for a light or dark font? Why or why not? " 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": null, 26 | "id": "fb1be73a", 27 | "metadata": {}, 28 | "outputs": [], 29 | "source": [ 30 | "import pandas as pd\n", 31 | "import numpy as np \n", 32 | "from sklearn.linear_model import LogisticRegression\n", 33 | "from sklearn.metrics import confusion_matrix\n", 34 | "from sklearn.model_selection import train_test_split\n", 35 | "from sklearn.metrics import roc_auc_score \n", 36 | "\n", 37 | "\n", 38 | "df = pd.read_csv('https://bit.ly/3GsNzGt', delimiter=\",\")\n", 39 | "\n", 40 | "# Extract input variables (all rows, all columns but last column)\n", 41 | "# Scale the R,G,B values to be between 0 and 1, not 0 and 255. \n", 42 | "X = (df.values[:, :-1] / 255.0)\n", 43 | "\n", 44 | "# Extract output column (all rows, last column)\n", 45 | "Y = df.values[:, -1]\n", 46 | "\n", 47 | "X_train, X_test, Y_train, Y_test = train_test_split(?, ?, test_size=?, random_state=10)\n", 48 | "\n", 49 | "model = LogisticRegression(penalty='none')\n", 50 | "model.fit(?, ?)\n", 51 | "\n", 52 | "prediction = model.predict(?)\n", 53 | "\n", 54 | "'''\n", 55 | "[[truepositives falsenegatives]\n", 56 | " [falsepositives truenegatives]]\n", 57 | " '''\n", 58 | "matrix = confusion_matrix(y_true=?, y_pred=?, normalize=None)\n", 59 | "print(matrix)\n", 60 | "\n", 61 | "\n", 62 | "results = roc_auc_score(?, ?)\n", 63 | "print(\"AUC: %.3f\" % results)" 64 | ] 65 | }, 66 | { 67 | "cell_type": "markdown", 68 | "id": "80aeaa1c", 69 | "metadata": {}, 70 | "source": [ 71 | "### SCROLL DOWN FOR ANSWER\n", 72 | "|
\n", 73 | "|
\n", 74 | "|
\n", 75 | "|
\n", 76 | "|
\n", 77 | "|
\n", 78 | "|
\n", 79 | "|
\n", 80 | "|
\n", 81 | "|
\n", 82 | "|
\n", 83 | "|
\n", 84 | "|
\n", 85 | "|
\n", 86 | "|
\n", 87 | "|
\n", 88 | "|
\n", 89 | "|
\n", 90 | "|
\n", 91 | "|
\n", 92 | "|
\n", 93 | "|
\n", 94 | "|
\n", 95 | "v \n", 96 | "\n", 97 | "```python\n", 98 | "import pandas as pd\n", 99 | "import numpy as np \n", 100 | "from sklearn.linear_model import LogisticRegression\n", 101 | "from sklearn.metrics import confusion_matrix\n", 102 | "from sklearn.model_selection import train_test_split\n", 103 | "\n", 104 | "df = pd.read_csv('https://bit.ly/3GsNzGt', delimiter=\",\")\n", 105 | "\n", 106 | "# Extract input variables (all rows, all columns but last column)\n", 107 | "# Scale the R,G,B values to be between 0 and 1, not 0 and 255. \n", 108 | "X = (df.values[:, :-1] / 255.0)\n", 109 | "\n", 110 | "# Extract output column (all rows, last column)\n", 111 | "Y = df.values[:, -1]\n", 112 | "\n", 113 | "X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=1.0/3.0, random_state=10)\n", 114 | "\n", 115 | "model = LogisticRegression(penalty='none')\n", 116 | "model.fit(X_train, Y_train)\n", 117 | "\n", 118 | "prediction = model.predict(X_test)\n", 119 | "\n", 120 | "'''\n", 121 | "[[truepositives falsenegatives]\n", 122 | " [falsepositives truenegatives]]\n", 123 | " '''\n", 124 | "matrix = confusion_matrix(y_true=Y_test, y_pred=prediction, normalize=None)\n", 125 | "print(matrix)\n", 126 | "\n", 127 | "\n", 128 | "from sklearn.metrics import roc_auc_score \n", 129 | "results = roc_auc_score(prediction, Y_test)\n", 130 | "print(\"AUC: %.3f\" % results)\n", 131 | "```" 132 | ] 133 | } 134 | ], 135 | "metadata": { 136 | "kernelspec": { 137 | "display_name": "Python 3 (ipykernel)", 138 | "language": "python", 139 | "name": "python3" 140 | }, 141 | "language_info": { 142 | "codemirror_mode": { 143 | "name": "ipython", 144 | "version": 3 145 | }, 146 | "file_extension": ".py", 147 | "mimetype": "text/x-python", 148 | "name": "python", 149 | "nbconvert_exporter": "python", 150 | "pygments_lexer": "ipython3", 151 | "version": "3.9.7" 152 | } 153 | }, 154 | "nbformat": 4, 155 | "nbformat_minor": 5 156 | } 157 | -------------------------------------------------------------------------------- /5_2_neural_network_mnist.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "3234cf43", 6 | "metadata": {}, 7 | "source": [ 8 | "# Neural Networks on MNIST Dataset\n", 9 | "\n", 10 | "Import dependencies." 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": null, 16 | "id": "1a220160", 17 | "metadata": { 18 | "scrolled": true 19 | }, 20 | "outputs": [], 21 | "source": [ 22 | "import numpy as np\n", 23 | "import pandas as pd\n", 24 | "from sklearn.model_selection import train_test_split\n", 25 | "from sklearn.neural_network import MLPClassifier\n", 26 | "from sklearn.metrics import confusion_matrix\n" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "id": "f8e1555d", 32 | "metadata": {}, 33 | "source": [ 34 | "Read MNIST handwritten digits data from Pandas. Note the data is somewhat large so it is stored as a zipped CSV. " 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": null, 40 | "id": "f6b6e7cf", 41 | "metadata": {}, 42 | "outputs": [], 43 | "source": [ 44 | "df = pd.read_csv('https://bit.ly/3ilJc2C', compression='zip', delimiter=\",\")\n", 45 | "df" 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "id": "ee434097", 51 | "metadata": {}, 52 | "source": [ 53 | "Separate the input and output variables. " 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": null, 59 | "id": "82ea38d8", 60 | "metadata": {}, 61 | "outputs": [], 62 | "source": [ 63 | "X = df.values[:, :-1] / 255.0\n", 64 | "Y = df.values[:, -1]\n" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "id": "4266519e", 70 | "metadata": {}, 71 | "source": [ 72 | "Print out the number of instances of each class. Stratify so that each class is sampled equally. " 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": null, 78 | "id": "25505fa2", 79 | "metadata": {}, 80 | "outputs": [], 81 | "source": [ 82 | "\n", 83 | "# Get a count of each group to ensure samples are equitably balanced\n", 84 | "print(df.groupby([\"class\"]).agg({\"class\" : [np.size]}))\n", 85 | "\n", 86 | "# Separate training and testing data\n", 87 | "# Note that I use the 'stratify' parameter to ensure\n", 88 | "# each class is proportionally represented in both sets\n", 89 | "X_train, X_test, Y_train, Y_test = train_test_split(X, Y,\n", 90 | " test_size=.33, random_state=10, stratify=Y)\n" 91 | ] 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "id": "d8fbd711", 96 | "metadata": {}, 97 | "source": [ 98 | "Train a neural network using the logistic function as the hidden layer activation function with 100 nodes. Set a higher learning rate of .01. " 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": null, 104 | "id": "2d401c7b", 105 | "metadata": {}, 106 | "outputs": [], 107 | "source": [ 108 | "\n", 109 | "nn = MLPClassifier(solver='sgd',\n", 110 | " hidden_layer_sizes=(100, ),\n", 111 | " activation='logistic',\n", 112 | " max_iter=480,\n", 113 | " learning_rate_init=.1)\n", 114 | "\n", 115 | "nn.fit(X_train, Y_train)\n", 116 | "\n", 117 | "print(\"Test set score: %f\" % nn.score(X_test, Y_test))\n", 118 | "\n", 119 | "cf = confusion_matrix(y_true=Y_test, y_pred=nn.predict(X_test))\n", 120 | "print(cf)" 121 | ] 122 | }, 123 | { 124 | "cell_type": "markdown", 125 | "id": "c0a15920", 126 | "metadata": {}, 127 | "source": [ 128 | "Display heat map for each digit character weights. " 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": null, 134 | "id": "39e47ab3", 135 | "metadata": {}, 136 | "outputs": [], 137 | "source": [ 138 | "# Display heat map\n", 139 | "import matplotlib.pyplot as plt\n", 140 | "fig, axes = plt.subplots(4, 4)\n", 141 | "\n", 142 | "# use global min / max to ensure all weights are shown on the same scale\n", 143 | "vmin, vmax = nn.coefs_[0].min(), nn.coefs_[0].max()\n", 144 | "for coef, ax in zip(nn.coefs_[0].T, axes.ravel()):\n", 145 | " ax.matshow(coef.reshape(28, 28), cmap=plt.cm.gray, vmin=.5 * vmin, vmax=.5 * vmax)\n", 146 | " ax.set_xticks(())\n", 147 | " ax.set_yticks(())" 148 | ] 149 | } 150 | ], 151 | "metadata": { 152 | "kernelspec": { 153 | "display_name": "Python 3 (ipykernel)", 154 | "language": "python", 155 | "name": "python3" 156 | }, 157 | "language_info": { 158 | "codemirror_mode": { 159 | "name": "ipython", 160 | "version": 3 161 | }, 162 | "file_extension": ".py", 163 | "mimetype": "text/x-python", 164 | "name": "python", 165 | "nbconvert_exporter": "python", 166 | "pygments_lexer": "ipython3", 167 | "version": "3.9.7" 168 | } 169 | }, 170 | "nbformat": 4, 171 | "nbformat_minor": 5 172 | } 173 | -------------------------------------------------------------------------------- /3_2_naive_bayes_user_input.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "f5178e18", 6 | "metadata": {}, 7 | "source": [ 8 | "# Naive Bayes Spam Email Classifier\n", 9 | "\n", 10 | "## with User Test Input \n", 11 | "\n", 12 | "Import dependencies" 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": null, 18 | "id": "f15900ff", 19 | "metadata": { 20 | "scrolled": true 21 | }, 22 | "outputs": [], 23 | "source": [ 24 | "import pandas as pd\n", 25 | "import numpy as np\n", 26 | "from sklearn.model_selection import train_test_split\n", 27 | "from sklearn.feature_extraction.text import CountVectorizer\n", 28 | "from sklearn.naive_bayes import MultinomialNB\n", 29 | "\n", 30 | "np.set_printoptions(suppress=True)" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "id": "fb21b67b", 36 | "metadata": {}, 37 | "source": [ 38 | "Use this `messsage` variable to create a test message of your choosing. Note that if the entire message contains words the classifier has never seen before, it will be on the fence whether or not it is spam. " 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": null, 44 | "id": "1e0b12ce", 45 | "metadata": {}, 46 | "outputs": [], 47 | "source": [ 48 | "message = \"Meet hot singles now\"" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "id": "251633b3", 54 | "metadata": {}, 55 | "source": [ 56 | "Read the training data into a dataframe, but append the test message above so it gets included in the vectorization. We will omit it from trainining afterwards. " 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": null, 62 | "id": "e6495007", 63 | "metadata": { 64 | "scrolled": true 65 | }, 66 | "outputs": [], 67 | "source": [ 68 | "# read training data, add input to DataFrame\n", 69 | "df = pd.read_csv('https://bit.ly/3zQBV5y')\n", 70 | "df.loc[len(df.index)] = [message, 1] # add record\n", 71 | "df\n" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "id": "512cefc6", 77 | "metadata": {}, 78 | "source": [ 79 | "Vectorize the training data and the test input together, counting each word occurrence for every email, and break out the `X_all` column containing the input data. " 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": null, 85 | "id": "5a8533b1", 86 | "metadata": {}, 87 | "outputs": [], 88 | "source": [ 89 | "# vectorize training data along with user input\n", 90 | "cv = CountVectorizer()\n", 91 | "X_all = cv.fit_transform(df['msg'])\n", 92 | "\n", 93 | "# Print count vectorizer as a table \n", 94 | "pd.DataFrame(X_all.toarray(),columns= cv.get_feature_names_out())" 95 | ] 96 | }, 97 | { 98 | "cell_type": "markdown", 99 | "id": "8340bcbc", 100 | "metadata": {}, 101 | "source": [ 102 | "Extract the training `X_train` and `Y_train` data, omitting the test record we appended earlier. That will be extracted as the test input as `X_test`. " 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": null, 108 | "id": "56a06f86", 109 | "metadata": {}, 110 | "outputs": [], 111 | "source": [ 112 | "# extract the vectorized training data\n", 113 | "X_train = X_all[:-1,:]\n", 114 | "Y_train = df[\"spam_ind\"].iloc[:-1]\n", 115 | "\n", 116 | "# extract out the test input\n", 117 | "X_test = X_all[-1:, :]" 118 | ] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "id": "b680ac31", 123 | "metadata": {}, 124 | "source": [ 125 | "Fit the `MulinomialNB` model to the training data, and predict the probability of being spam for the test email. Note after we predict the probability with `predict_proba()` it will return two values, one for the probability of not being spam and the other for being spam. We want the second value so we extract it. " 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": null, 131 | "id": "1a46a970", 132 | "metadata": {}, 133 | "outputs": [], 134 | "source": [ 135 | "# Create multinomial Naive Bayes and train model\n", 136 | "model = MultinomialNB().fit(X_train, Y_train)\n", 137 | "\n", 138 | "# Test the user input for spam\n", 139 | "probability_of_spam = model.predict_proba(X_test).flatten()[1]\n", 140 | "print(\"Spam probability: {0}%\".format(probability_of_spam))" 141 | ] 142 | } 143 | ], 144 | "metadata": { 145 | "kernelspec": { 146 | "display_name": "Python 3 (ipykernel)", 147 | "language": "python", 148 | "name": "python3" 149 | }, 150 | "language_info": { 151 | "codemirror_mode": { 152 | "name": "ipython", 153 | "version": 3 154 | }, 155 | "file_extension": ".py", 156 | "mimetype": "text/x-python", 157 | "name": "python", 158 | "nbconvert_exporter": "python", 159 | "pygments_lexer": "ipython3", 160 | "version": "3.9.7" 161 | } 162 | }, 163 | "nbformat": 4, 164 | "nbformat_minor": 5 165 | } 166 | -------------------------------------------------------------------------------- /5a_exercise.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "04a5dfbf", 6 | "metadata": {}, 7 | "source": [ 8 | "## EXERCISE - Neural Networks" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "id": "5bec5246", 14 | "metadata": {}, 15 | "source": [ 16 | "In the Python code below, replace the question marks \"?\" with the proper code to perform a neural network prediction on a maintenance dataset on whether a part needs replacement (1) or not (0). \n", 17 | "\n", 18 | "Use 3 nodes in the hidden layer, and ReLU as the activation function.\n", 19 | "\n", 20 | "Experiment with learning rate and iterations to optimize training. Set aside 1/3 of the data for testing, then evaluate performance with a confusion matrix.\n" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": null, 26 | "id": "fb1be73a", 27 | "metadata": {}, 28 | "outputs": [], 29 | "source": [ 30 | "import pandas as pd\n", 31 | "from sklearn.model_selection import train_test_split\n", 32 | "from sklearn.neural_network import MLPClassifier\n", 33 | "from sklearn.metrics import confusion_matrix\n", 34 | "\n", 35 | "df = pd.read_csv('https://bit.ly/3wlFsb4')\n", 36 | "\n", 37 | "# Extract input variables (all rows, all columns but last column)\n", 38 | "# Note we should do some linear scaling here\n", 39 | "X = df.values[:, :-1] / 1000.0\n", 40 | "\n", 41 | "# Extract output column (all rows, last column)\n", 42 | "Y = df.values[:, -1]\n", 43 | "\n", 44 | "# Separate training and testing data\n", 45 | "X_train, X_test, Y_train, Y_test = train_test_split(?, ?, test_size=?, random_state=7)\n", 46 | "\n", 47 | "nn = MLPClassifier(solver='sgd',\n", 48 | " hidden_layer_sizes=(?, ),\n", 49 | " activation='relu',\n", 50 | " max_iter=?,\n", 51 | " learning_rate_init=?)\n", 52 | "\n", 53 | "nn.fit(X_train, Y_train)\n", 54 | "\n", 55 | "# Print weights and biases\n", 56 | "print(nn.coefs_ )\n", 57 | "print(nn.intercepts_)\n", 58 | "\n", 59 | "print(\"Test set score: %f\" % nn.score(?, ?))\n", 60 | "\n", 61 | "print(\"Confusion Matrix:\")\n", 62 | "cf = confusion_matrix(y_true=Y_test, y_pred=nn.predict(X_test))\n", 63 | "print(cf)" 64 | ] 65 | }, 66 | { 67 | "cell_type": "markdown", 68 | "id": "80aeaa1c", 69 | "metadata": {}, 70 | "source": [ 71 | "### SCROLL DOWN FOR ANSWER\n", 72 | "|
\n", 73 | "|
\n", 74 | "|
\n", 75 | "|
\n", 76 | "|
\n", 77 | "|
\n", 78 | "|
\n", 79 | "|
\n", 80 | "|
\n", 81 | "|
\n", 82 | "|
\n", 83 | "|
\n", 84 | "|
\n", 85 | "|
\n", 86 | "|
\n", 87 | "|
\n", 88 | "|
\n", 89 | "|
\n", 90 | "|
\n", 91 | "|
\n", 92 | "|
\n", 93 | "|
\n", 94 | "|
\n", 95 | "v \n", 96 | "\n", 97 | "```python\n", 98 | "import pandas as pd\n", 99 | "# load data\n", 100 | "from sklearn.model_selection import train_test_split\n", 101 | "from sklearn.neural_network import MLPClassifier\n", 102 | "from sklearn.metrics import confusion_matrix\n", 103 | "\n", 104 | "df = pd.read_csv('https://bit.ly/3wlFsb4')\n", 105 | "\n", 106 | "# Extract input variables (all rows, all columns but last column)\n", 107 | "# Note we should do some linear scaling here\n", 108 | "X = df.values[:, :-1] / 1000.0\n", 109 | "\n", 110 | "# Extract output column (all rows, last column)\n", 111 | "Y = df.values[:, -1]\n", 112 | "\n", 113 | "# Separate training and testing data\n", 114 | "X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=1.0/3.0, random_state=7)\n", 115 | "\n", 116 | "nn = MLPClassifier(solver='sgd',\n", 117 | " hidden_layer_sizes=(3, ),\n", 118 | " activation='relu',\n", 119 | " max_iter=100_000,\n", 120 | " learning_rate_init=.01)\n", 121 | "\n", 122 | "nn.fit(X_train, Y_train)\n", 123 | "\n", 124 | "# Print weights and biases\n", 125 | "print(nn.coefs_ )\n", 126 | "print(nn.intercepts_)\n", 127 | "\n", 128 | "print(\"Test set score: %f\" % nn.score(X_test, Y_test))\n", 129 | "\n", 130 | "print(\"Confusion Matrix:\")\n", 131 | "cf = confusion_matrix(y_true=Y_test, y_pred=nn.predict(X_test))\n", 132 | "print(cf)\n", 133 | "```" 134 | ] 135 | } 136 | ], 137 | "metadata": { 138 | "kernelspec": { 139 | "display_name": "Python 3 (ipykernel)", 140 | "language": "python", 141 | "name": "python3" 142 | }, 143 | "language_info": { 144 | "codemirror_mode": { 145 | "name": "ipython", 146 | "version": 3 147 | }, 148 | "file_extension": ".py", 149 | "mimetype": "text/x-python", 150 | "name": "python", 151 | "nbconvert_exporter": "python", 152 | "pygments_lexer": "ipython3", 153 | "version": "3.9.7" 154 | } 155 | }, 156 | "nbformat": 4, 157 | "nbformat_minor": 5 158 | } 159 | -------------------------------------------------------------------------------- /5b_exercise.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "04a5dfbf", 6 | "metadata": {}, 7 | "source": [ 8 | "## EXERCISE - MNIST Neural Networks" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "id": "5bec5246", 14 | "metadata": {}, 15 | "source": [ 16 | "In the Python code below, replace the question marks \"?\" with the proper code to perform a neural network prediction on the MNIST dataset predicting the digits 0-9. \n", 17 | "\n", 18 | "Find a sufficient number of nodes in the hidden layer, and use ReLU as the activation function. Make sure to balance the samples of each class so each are represented equally!\n", 19 | "\n", 20 | "Use a learning rate of .1 and a max of 480 iterations for stochastic gradient descent. Set aside 1/3 of the data for testing, then evaluate performance with a confusion matrix.\n" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": null, 26 | "id": "fb1be73a", 27 | "metadata": {}, 28 | "outputs": [], 29 | "source": [ 30 | "import numpy as np\n", 31 | "import pandas as pd\n", 32 | "from sklearn.model_selection import train_test_split\n", 33 | "from sklearn.neural_network import MLPClassifier\n", 34 | "from sklearn.metrics import confusion_matrix\n", 35 | "\n", 36 | "df = pd.read_csv('https://bit.ly/3ilJc2C', compression='zip', delimiter=\",\")\n", 37 | "\n", 38 | "X = df.values[:, :-1] / 1000.0 # this rescale helps the training \n", 39 | "Y = df.values[:, -1]\n", 40 | "\n", 41 | "\n", 42 | "# Separate training and testing data\n", 43 | "# Note that I use the 'stratify' parameter to ensure\n", 44 | "# each class is proportionally represented in both sets\n", 45 | "X_train, X_test, Y_train, Y_test = train_test_split(X, Y,\n", 46 | " test_size=1.0/3.0, random_state=10, stratify=?)\n", 47 | "\n", 48 | "# Fit a neural network classifier \n", 49 | "nn = MLPClassifier(solver='sgd',\n", 50 | " hidden_layer_sizes=(?, ),\n", 51 | " activation=?,\n", 52 | " max_iter=?,\n", 53 | " learning_rate_init=.1)\n", 54 | "\n", 55 | "nn.fit(X_train, Y_train)\n", 56 | "\n", 57 | "# Evaluate the test dataset\n", 58 | "print(\"Test set score: %f\" % nn.score(?, ?))\n", 59 | "\n", 60 | "cf = confusion_matrix(y_true=?, y_pred=nn.predict(?))\n", 61 | "print(cf)" 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "id": "80aeaa1c", 67 | "metadata": {}, 68 | "source": [ 69 | "### SCROLL DOWN FOR ANSWER\n", 70 | "|
\n", 71 | "|
\n", 72 | "|
\n", 73 | "|
\n", 74 | "|
\n", 75 | "|
\n", 76 | "|
\n", 77 | "|
\n", 78 | "|
\n", 79 | "|
\n", 80 | "|
\n", 81 | "|
\n", 82 | "|
\n", 83 | "|
\n", 84 | "|
\n", 85 | "|
\n", 86 | "|
\n", 87 | "|
\n", 88 | "|
\n", 89 | "|
\n", 90 | "|
\n", 91 | "|
\n", 92 | "|
\n", 93 | "v \n", 94 | "\n", 95 | "```python\n", 96 | "import numpy as np\n", 97 | "import pandas as pd\n", 98 | "from sklearn.model_selection import train_test_split\n", 99 | "from sklearn.neural_network import MLPClassifier\n", 100 | "from sklearn.metrics import confusion_matrix\n", 101 | "\n", 102 | "df = pd.read_csv('https://bit.ly/3ilJc2C', compression='zip', delimiter=\",\")\n", 103 | "\n", 104 | "X = df.values[:, :-1] / 1000.0 # this rescale helps the training \n", 105 | "Y = df.values[:, -1]\n", 106 | "\n", 107 | "\n", 108 | "# Separate training and testing data\n", 109 | "# Note that I use the 'stratify' parameter to ensure\n", 110 | "# each class is proportionally represented in both sets\n", 111 | "X_train, X_test, Y_train, Y_test = train_test_split(X, Y,\n", 112 | " test_size=.33, random_state=10, stratify=Y)\n", 113 | "\n", 114 | "# Fit a neural network classifier \n", 115 | "nn = MLPClassifier(solver='sgd',\n", 116 | " hidden_layer_sizes=(100, ),\n", 117 | " activation='relu',\n", 118 | " max_iter=480,\n", 119 | " learning_rate_init=.1)\n", 120 | "\n", 121 | "nn.fit(X_train, Y_train)\n", 122 | "\n", 123 | "# Evaluate the test dataset\n", 124 | "print(\"Test set score: %f\" % nn.score(X_test, Y_test))\n", 125 | "\n", 126 | "cf = confusion_matrix(y_true=Y_test, y_pred=nn.predict(X_test))\n", 127 | "print(cf)\n", 128 | "```" 129 | ] 130 | } 131 | ], 132 | "metadata": { 133 | "kernelspec": { 134 | "display_name": "Python 3 (ipykernel)", 135 | "language": "python", 136 | "name": "python3" 137 | }, 138 | "language_info": { 139 | "codemirror_mode": { 140 | "name": "ipython", 141 | "version": 3 142 | }, 143 | "file_extension": ".py", 144 | "mimetype": "text/x-python", 145 | "name": "python", 146 | "nbconvert_exporter": "python", 147 | "pygments_lexer": "ipython3", 148 | "version": "3.9.7" 149 | } 150 | }, 151 | "nbformat": 4, 152 | "nbformat_minor": 5 153 | } 154 | --------------------------------------------------------------------------------