├── .ipynb_checkpoints ├── Binary Naive Bayes-checkpoint.ipynb ├── Gaussian Discriminant Analyses-checkpoint.ipynb ├── Gaussian Naive Bayes-checkpoint.ipynb ├── KMeans-checkpoint.ipynb ├── Linear Regression Implementation from scratch-checkpoint.ipynb ├── Logistic Regression-checkpoint.ipynb ├── Momentum in ML-checkpoint.ipynb ├── Multi Class Gaussian Discriminant Analyses-checkpoint.ipynb ├── Multinomial Naive Bayes-checkpoint.ipynb └── Naive Bayes Implementation-checkpoint.ipynb ├── BayesianClassifier.ipynb ├── Binary Naive Bayes.ipynb ├── Gaussian Discriminant Analyses.ipynb ├── Gaussian Naive Bayes.ipynb ├── ID3.ipynb ├── KMeans.ipynb ├── Linear Regression Implementation from scratch.ipynb ├── Linear Regression with Newtons Method.ipynb ├── Linear_reg.ipynb ├── Logistic Regression with Newtons Method.ipynb ├── Logistic Regression.ipynb ├── Momentum in ML.ipynb ├── Multi Class Gaussian Discriminant Analyses.ipynb ├── Multi Class Logistic Regression with Newtons Method.ipynb ├── Multi Class Logistic Regression.ipynb ├── Multinomial Naive Bayes.ipynb ├── Naive Bayes Implementation.ipynb ├── Perceptron.ipynb ├── README.md ├── README_files ├── 7457498 ├── 15320635 ├── 1f334.png ├── 1f3af.png ├── 1f3e0.png ├── 1f912.png ├── compat-bootstrap-f87ad7f1.js ├── decision_tree_predictions.png ├── frameworks-481a47a96965f6706fb41bae0d14b09a.css ├── frameworks-e4aa17e0.js ├── github-bootstrap-d08823e5.js ├── github-eabfbaded2e91939e805d1a3af34018a.css ├── image_preprocessing.png ├── octocat-spinner-128.gif └── search-key-slash.svg ├── Wrapper Method For Feature Selection - Forward and Backward .ipynb ├── book_rec_knn.ipynb ├── figures ├── MM.png ├── bayes-theorem.png ├── decision_tree.png ├── decision_tree_predictions.png ├── image_preprocessing.png ├── linear_regression.jpg ├── logistic_regression.jpg ├── neural_net.png ├── perceptron_hyperplane.png ├── preprocessing.png ├── regression_tree.png └── softmax_regression.jpg ├── gaussian_naive_bayes.py ├── gda.py ├── kmeans.ipynb ├── kmeans.py ├── lwlr.ipynb ├── naive_bayes.py ├── spam filter.ipynb ├── svm.ipynb ├── train_test_split.pyc ├── utils.py └── utils.pyc /.ipynb_checkpoints/Binary Naive Bayes-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Naive Bayes Classifier Implementation using numpy\n", 8 | "\n", 9 | "Naive Bayes is anoother supervised machine laerning algorithm for classification problem. It makes a strong assumption about the data that **each feature is independent of the value of any other feature**. For example, a fruit may be considered to be an apple if it is red, round, and about 10 cm in diameter. A naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of any possible correlations between the color, roundness, and diameter features.\n", 10 | "\n", 11 | "In Naive bayes classifier what we are trying to find is the probability that a given data point belogs to a specific class, we are going to have prediction for all the class in our target.\n", 12 | "\n", 13 | "\n", 14 | "![title](figures/bayes-theorem.png)\n", 15 | "\n", 16 | "This is bernolli naive bayes impementation, which we expecting the features to be true or false (1 or 0)." 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": 72, 22 | "metadata": {}, 23 | "outputs": [], 24 | "source": [ 25 | "# %load naive_bayes.py\n", 26 | "import numpy as np\n", 27 | "\n", 28 | "class NaiveBayesBinaryClassifier:\n", 29 | " \n", 30 | " def fit(self, X, y):\n", 31 | " self.y_classes, y_counts = np.unique(y, return_counts=True)\n", 32 | " self.phi_y = 1.0 * y_counts/y_counts.sum()\n", 33 | " self.phi_x = [1.0 * X[y==k].mean(axis=0) for k in self.y_classes]\n", 34 | " return self\n", 35 | " \n", 36 | " def predict(self, X):\n", 37 | " return np.apply_along_axis(lambda x: self.compute_probs(x), 1, X)\n", 38 | " \n", 39 | " def compute_probs(self, x):\n", 40 | " probs = [self.compute_prob(x, y) for y in range(len(self.y_classes))]\n", 41 | " return self.y_classes[np.argmax(probs)]\n", 42 | " \n", 43 | " def compute_prob(self, x, y):\n", 44 | " res = 1\n", 45 | " for j in range(len(x)):\n", 46 | " Pxy = self.phi_x[y][j] # p(xj=1|y)\n", 47 | " res *= (Pxy**x[j])*((1-Pxy)**(1-x[j])) # p(xj=0|y)\n", 48 | " return res * self.phi_y[y]\n", 49 | " \n", 50 | " def score(self, X, y):\n", 51 | " return (self.predict(X) == y).mean()" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": null, 57 | "metadata": {}, 58 | "outputs": [], 59 | "source": [] 60 | } 61 | ], 62 | "metadata": { 63 | "kernelspec": { 64 | "display_name": "python3", 65 | "language": "python", 66 | "name": "python3" 67 | }, 68 | "language_info": { 69 | "codemirror_mode": { 70 | "name": "ipython", 71 | "version": 2 72 | }, 73 | "file_extension": ".py", 74 | "mimetype": "text/x-python", 75 | "name": "python", 76 | "nbconvert_exporter": "python", 77 | "pygments_lexer": "ipython2", 78 | "version": "2.7.16" 79 | } 80 | }, 81 | "nbformat": 4, 82 | "nbformat_minor": 2 83 | } 84 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/Gaussian Discriminant Analyses-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Gaussian Descriminant Analysis Implementation using numpy\n", 8 | "\n", 9 | "When we have a classification problem in which the input features x are\n", 10 | "continuous-valued random variables, we can then use the Gaussian Discriminant Analysis (GDA) model, which models p(x|y) using a multivariate normal distribution.\n", 11 | "\n", 12 | "GDA is a family of generative algorithm that try to model p(x|y) (and p(y)). After modeling p(y) (called the class priors) and p(x|y), our algorithm can then use Bayes rule to derive the posterior distribution on y given x:\n", 13 | "\n", 14 | "\n", 15 | "![title](figures/bayes-theorem.png)\n", 16 | "\n", 17 | "GDA makes stronger modeling assumptions, and is more data efficient (i.e., requires less training data to learn “well”) when the modeling assumptions are correct or at least approximately correct. Comparing to Logistic regression that makes weaker assumptions, and is significantly more robust to deviations from modeling assumptions\n", 18 | "\n", 19 | "### Our motivation\n", 20 | "To gain better understand on how the algorithm works" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": 4, 26 | "metadata": {}, 27 | "outputs": [], 28 | "source": [ 29 | "# %load gda.py\n", 30 | "import numpy as np\n", 31 | "\n", 32 | "class GDABinaryClassifier:\n", 33 | " \n", 34 | " def fit(self, X, y):\n", 35 | " self.fi = y.mean()\n", 36 | " self.u = np.array([ X[y==k].mean(axis=0) for k in [0,1]])\n", 37 | " X_u = X.copy()\n", 38 | " for k in [0,1]: X_u[y==k] -= self.u[k]\n", 39 | " self.E = X_u.T.dot(X_u) / len(y)\n", 40 | " self.invE = np.linalg.pinv(self.E)\n", 41 | " return self\n", 42 | " \n", 43 | " def predict(self, X):\n", 44 | " return np.argmax([self.compute_prob(X, i) for i in range(len(self.u))], axis=0)\n", 45 | " \n", 46 | " def compute_prob(self, X, i):\n", 47 | " u, phi = self.u[i], ((self.fi)**i * (1 - self.fi)**(1 - i))\n", 48 | " return np.exp(-1.0 * np.sum((X-u).dot(self.invE)*(X-u), axis=1)) * phi\n", 49 | " \n", 50 | " def score(self, X, y):\n", 51 | " return (self.predict(X) == y).mean()" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": 5, 57 | "metadata": {}, 58 | "outputs": [ 59 | { 60 | "data": { 61 | "text/plain": [ 62 | "0.9666080843585237" 63 | ] 64 | }, 65 | "execution_count": 5, 66 | "metadata": {}, 67 | "output_type": "execute_result" 68 | } 69 | ], 70 | "source": [ 71 | "from sklearn.datasets import load_breast_cancer\n", 72 | "X,y = load_breast_cancer(return_X_y=True)\n", 73 | "model = GDABinaryClassifier().fit(X,y)\n", 74 | "pre = model.predict(X)\n", 75 | "model.score(X,y)" 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": null, 81 | "metadata": {}, 82 | "outputs": [], 83 | "source": [] 84 | } 85 | ], 86 | "metadata": { 87 | "kernelspec": { 88 | "display_name": "python3", 89 | "language": "python", 90 | "name": "python3" 91 | }, 92 | "language_info": { 93 | "codemirror_mode": { 94 | "name": "ipython", 95 | "version": 2 96 | }, 97 | "file_extension": ".py", 98 | "mimetype": "text/x-python", 99 | "name": "python", 100 | "nbconvert_exporter": "python", 101 | "pygments_lexer": "ipython2", 102 | "version": "2.7.16" 103 | } 104 | }, 105 | "nbformat": 4, 106 | "nbformat_minor": 2 107 | } 108 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/Gaussian Naive Bayes-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "# %load gaussian_naive_bayes.py\n", 10 | "import numpy as np\n", 11 | " \n", 12 | "class GaussianNB:\n", 13 | " \n", 14 | " def fit(self, X, y, epsilon = 1e-10):\n", 15 | " self.y_classes, y_counts = np.unique(y, return_counts=True)\n", 16 | " self.x_classes = np.array([np.unique(x) for x in X.T])\n", 17 | " self.phi_y = 1.0 * y_counts/y_counts.sum()\n", 18 | " self.u = np.array([X[y==k].mean(axis=0) for k in self.y_classes])\n", 19 | " self.var_x = np.array([X[y==k].var(axis=0) + epsilon for k in self.y_classes])\n", 20 | " return self\n", 21 | " \n", 22 | " def predict(self, X):\n", 23 | " return np.apply_along_axis(lambda x: self.compute_probs(x), 1, X)\n", 24 | " \n", 25 | " def compute_probs(self, x):\n", 26 | " probs = np.array([self.compute_prob(x, y) for y in range(len(self.y_classes))])\n", 27 | " return self.y_classes[np.argmax(probs)]\n", 28 | " \n", 29 | " def compute_prob(self, x, y):\n", 30 | " c = 1.0 /np.sqrt(2.0 * np.pi * (self.var_x[y]))\n", 31 | " return np.prod(c * np.exp(-1.0 * np.square(x - self.u[y]) / (2.0 * self.var_x[y])))\n", 32 | " \n", 33 | " def evaluate(self, X, y):\n", 34 | " return (self.predict(X) == y).mean()" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": 11, 40 | "metadata": {}, 41 | "outputs": [ 42 | { 43 | "data": { 44 | "text/plain": [ 45 | "0.96" 46 | ] 47 | }, 48 | "execution_count": 11, 49 | "metadata": {}, 50 | "output_type": "execute_result" 51 | } 52 | ], 53 | "source": [ 54 | "from sklearn import datasets\n", 55 | "from utils import accuracy_score\n", 56 | "iris = datasets.load_iris()\n", 57 | "X = iris.data \n", 58 | "y = iris.target\n", 59 | "GaussianNB().fit(X, y).evaluate(X, y)" 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": 12, 65 | "metadata": {}, 66 | "outputs": [ 67 | { 68 | "name": "stdout", 69 | "output_type": "stream", 70 | "text": [ 71 | "[1 1 1 2 2 2]\n" 72 | ] 73 | } 74 | ], 75 | "source": [ 76 | "X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])\n", 77 | "Y = np.array([1, 1, 1, 2, 2, 2])\n", 78 | "clf = GaussianNB().fit(X, Y)\n", 79 | "print(clf.predict(X))" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": 13, 85 | "metadata": {}, 86 | "outputs": [ 87 | { 88 | "data": { 89 | "text/plain": [ 90 | "0.8091263216471898" 91 | ] 92 | }, 93 | "execution_count": 13, 94 | "metadata": {}, 95 | "output_type": "execute_result" 96 | } 97 | ], 98 | "source": [ 99 | "from sklearn import datasets\n", 100 | "digits = datasets.load_digits()\n", 101 | "X = digits.data\n", 102 | "y = digits.target\n", 103 | "GaussianNB().fit(X, y).evaluate(X, y)" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": null, 109 | "metadata": {}, 110 | "outputs": [], 111 | "source": [] 112 | } 113 | ], 114 | "metadata": { 115 | "kernelspec": { 116 | "display_name": "python3", 117 | "language": "python", 118 | "name": "python3" 119 | }, 120 | "language_info": { 121 | "codemirror_mode": { 122 | "name": "ipython", 123 | "version": 2 124 | }, 125 | "file_extension": ".py", 126 | "mimetype": "text/x-python", 127 | "name": "python", 128 | "nbconvert_exporter": "python", 129 | "pygments_lexer": "ipython2", 130 | "version": "2.7.16" 131 | } 132 | }, 133 | "nbformat": 4, 134 | "nbformat_minor": 2 135 | } 136 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/Logistic Regression-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Logistic Regression in Python and Numpy" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "In logistic regression, we are trying to model the outcome of a **binary variable** given a **linear combination of input features**. For example, we could try to predict the outcome of an election (win/lose) using information about how much money a candidate spent campaigning, how much time she/he spent campaigning, etc.\n", 15 | "\n", 16 | "### Model \n", 17 | "\n", 18 | "Logistic regression works as follows.\n", 19 | "\n", 20 | "**Given:** \n", 21 | "- dataset $\\{(\\boldsymbol{x}^{(1)}, y^{(1)}), ..., (\\boldsymbol{x}^{(m)}, y^{(m)})\\}$\n", 22 | "- with $\\boldsymbol{x}^{(i)}$ being a $d-$dimensional vector $\\boldsymbol{x}^{(i)} = (x^{(i)}_1, ..., x^{(i)}_d)$\n", 23 | "- $y^{(i)}$ being a binary target variable, $y^{(i)} \\in \\{0,1\\}$\n", 24 | "\n", 25 | "The logistic regression model can be interpreted as a very **simple neural network:**\n", 26 | "- it has a real-valued weight vector $\\boldsymbol{w}= (w^{(1)}, ..., w^{(d)})$\n", 27 | "- it has a real-valued bias $b$\n", 28 | "- it uses a sigmoid function as its activation function\n", 29 | "\n", 30 | "![title](figures/logistic_regression.jpg)" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "### Training\n", 38 | "\n", 39 | "Different to [linear regression](linear_regression.ipynb), logistic regression has no closed form solution. But the cost function is convex, so we can train the model using gradient descent. In fact, **gradient descent** (or any other optimization algorithm) is guaranteed to find the global minimum (if the learning rate is small enough and enough training iterations are used). \n", 40 | "\n", 41 | "Training a logistic regression model has different steps. In the beginning (step 0) the parameters are initialized. The other steps are repeated for a specified number of training iterations or until convergence of the parameters.\n", 42 | "\n", 43 | "* * * \n", 44 | "**Step 0: ** Initialize the weight vector and bias with zeros (or small random values).\n", 45 | "* * *\n", 46 | "\n", 47 | "**Step 1: ** Compute a linear combination of the input features and weights. This can be done in one step for all training examples, using vectorization and broadcasting:\n", 48 | "$\\boldsymbol{a} = \\boldsymbol{X} \\cdot \\boldsymbol{w} + b $\n", 49 | "\n", 50 | "where $\\boldsymbol{X}$ is a matrix of shape $(n_{samples}, n_{features})$ that holds all training examples, and $\\cdot$ denotes the dot product.\n", 51 | "* * *\n", 52 | "\n", 53 | "**Step 2: ** Apply the sigmoid activation function, which returns values between 0 and 1:\n", 54 | "\n", 55 | "$\\boldsymbol{\\hat{y}} = \\sigma(\\boldsymbol{a}) = \\frac{1}{1 + \\exp(-\\boldsymbol{a})}$\n", 56 | "* * *\n", 57 | "\n", 58 | "** Step 3: ** Compute the cost over the whole training set. We want to model the probability of the target values being 0 or 1. So during training we want to adapt our parameters such that our model outputs high values for examples with a positive label (true label being 1) and small values for examples with a negative label (true label being 0). This is reflected in the cost function:\n", 59 | "\n", 60 | "$J(\\boldsymbol{w},b) = - \\frac{1}{m} \\sum_{i=1}^m \\Big[ y^{(i)} \\log(\\hat{y}^{(i)}) + (1 - y^{(i)}) \\log(1 - \\hat{y}^{(i)}) \\Big]$\n", 61 | "* * *\n", 62 | "\n", 63 | "** Step 4: ** Compute the gradient of the cost function with respect to the weight vector and bias. A detailed explanation of this derivation can be found [here](https://stats.stackexchange.com/questions/278771/how-is-the-cost-function-from-logistic-regression-derivated).\n", 64 | "\n", 65 | "The general formula is given by:\n", 66 | "\n", 67 | "$ \\frac{\\partial J}{\\partial w_j} = \\frac{1}{m}\\sum_{i=1}^m\\left[\\hat{y}^{(i)}-y^{(i)}\\right]\\,x_j^{(i)}$\n", 68 | "\n", 69 | "For the bias, the inputs $x_j^{(i)}$ will be given 1.\n", 70 | "* * *\n", 71 | "\n", 72 | "** Step 5: ** Update the weights and bias\n", 73 | "\n", 74 | "$\\boldsymbol{w} = \\boldsymbol{w} - \\eta \\, \\nabla_w J$ \n", 75 | "\n", 76 | "$b = b - \\eta \\, \\nabla_b J$\n", 77 | "\n", 78 | "where $\\eta$ is the learning rate." 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": 18, 84 | "metadata": {}, 85 | "outputs": [], 86 | "source": [ 87 | "import numpy as np\n", 88 | "class LogisticRegression:\n", 89 | " \n", 90 | " def fit(self, X, y, lr = 0.001, epochs=10000, verbose=True, batch_size=1):\n", 91 | " self.classes = np.unique(y)\n", 92 | " y = (y==self.classes[1]) * 1\n", 93 | " X = self.add_bias(X)\n", 94 | " self.weights = np.zeros(X.shape[1])\n", 95 | " self.loss = []\n", 96 | " for i in range(epochs):\n", 97 | " self.loss.append(self.cross_entropy(X,y))\n", 98 | " if i % 1000 == 0 and verbose: \n", 99 | " print('Iterations: %d - Error : %.4f' %(i, self.loss[i]))\n", 100 | " idx = np.random.choice(X.shape[0], batch_size)\n", 101 | " X_batch, y_batch = X[idx], y[idx]\n", 102 | " self.weights -= lr * self.get_gradient(X_batch, y_batch)\n", 103 | " return self\n", 104 | " \n", 105 | " def get_gradient(self, X, y):\n", 106 | " return -1.0 * (y - self.predict_(X)).dot(X) / len(X)\n", 107 | " \n", 108 | " def predict_(self, X):\n", 109 | " return self.sigmoid(np.dot(X, self.weights))\n", 110 | " \n", 111 | " def predict(self, X):\n", 112 | " return self.predict_(self.add_bias(X))\n", 113 | " \n", 114 | " def sigmoid(self, z):\n", 115 | " return 1.0/(1 + np.exp(-z))\n", 116 | " \n", 117 | " def predict_classes(self, X):\n", 118 | " return self.predict_classes_(self.add_bias(X))\n", 119 | "\n", 120 | " def predict_classes_(self, X):\n", 121 | " return np.vectorize(lambda c: self.classes[1] if c>=0.5 else self.classes[0])(self.predict_(X))\n", 122 | " \n", 123 | " def cross_entropy(self, X, y):\n", 124 | " p = self.predict_(X)\n", 125 | " return (-1 / len(y)) * (y * np.log(p)).sum()\n", 126 | "\n", 127 | " def add_bias(self,X):\n", 128 | " return np.insert(X, 0, 1, axis=1)\n", 129 | "\n", 130 | " def score(self, X, y):\n", 131 | " return self.cross_entropy(self.add_bias(X), y)\n" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": 19, 137 | "metadata": {}, 138 | "outputs": [], 139 | "source": [ 140 | "from sklearn.metrics import accuracy_score\n", 141 | "def train_model(X, y, model):\n", 142 | " model.fit(X, y, lr=0.1)\n", 143 | " pre = model.predict_classes(X)\n", 144 | " print('Accuracy :: ', accuracy_score(y, pre))" 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": 20, 150 | "metadata": {}, 151 | "outputs": [ 152 | { 153 | "name": "stdout", 154 | "output_type": "stream", 155 | "text": [ 156 | "Iterations: 0 - Error : 69.3147\n", 157 | "Iterations: 1000 - Error : 0.4746\n", 158 | "Iterations: 2000 - Error : 0.3847\n", 159 | "Iterations: 3000 - Error : 0.1645\n", 160 | "Iterations: 4000 - Error : 0.1280\n", 161 | "Iterations: 5000 - Error : 0.1126\n", 162 | "Iterations: 6000 - Error : 0.0783\n", 163 | "Iterations: 7000 - Error : 0.0674\n", 164 | "Iterations: 8000 - Error : 0.0621\n", 165 | "Iterations: 9000 - Error : 0.0664\n", 166 | "('Accuracy :: ', 1.0)\n" 167 | ] 168 | } 169 | ], 170 | "source": [ 171 | "from sklearn.datasets import load_iris\n", 172 | "X, y = load_iris(return_X_y=True)\n", 173 | "lr = LogisticRegression()\n", 174 | "train_model(X,(y !=0 )*1, lr)" 175 | ] 176 | }, 177 | { 178 | "cell_type": "code", 179 | "execution_count": 21, 180 | "metadata": {}, 181 | "outputs": [ 182 | { 183 | "data": { 184 | "image/png": "\n", 185 | "text/plain": [ 186 | "
" 187 | ] 188 | }, 189 | "metadata": { 190 | "needs_background": "light" 191 | }, 192 | "output_type": "display_data" 193 | } 194 | ], 195 | "source": [ 196 | "import matplotlib.pyplot as plt\n", 197 | "fig = plt.figure(figsize=(8,6))\n", 198 | "plt.plot(np.arange(len(lr.loss)), lr.loss)\n", 199 | "plt.title(\"Development of cost over training\")\n", 200 | "plt.xlabel(\"Number of iterations\")\n", 201 | "plt.ylabel(\"Cost\")\n", 202 | "plt.show()" 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "execution_count": null, 208 | "metadata": {}, 209 | "outputs": [], 210 | "source": [] 211 | } 212 | ], 213 | "metadata": { 214 | "kernelspec": { 215 | "display_name": "python3", 216 | "language": "python", 217 | "name": "python3" 218 | }, 219 | "language_info": { 220 | "codemirror_mode": { 221 | "name": "ipython", 222 | "version": 2 223 | }, 224 | "file_extension": ".py", 225 | "mimetype": "text/x-python", 226 | "name": "python", 227 | "nbconvert_exporter": "python", 228 | "pygments_lexer": "ipython2", 229 | "version": "2.7.16" 230 | } 231 | }, 232 | "nbformat": 4, 233 | "nbformat_minor": 2 234 | } 235 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/Multi Class Gaussian Discriminant Analyses-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import numpy as np \n", 10 | "\n", 11 | "class GDAClassifier:\n", 12 | " \n", 13 | " def fit(self, X, y, epsilon = 1e-10):\n", 14 | " self.y_classes, y_counts = np.unique(y, return_counts=True)\n", 15 | " self.phi_y = 1.0 * y_counts/len(y)\n", 16 | " self.u = np.array([ X[y==k].mean(axis=0) for k in self.y_classes])\n", 17 | " self.E = self.compute_sigma(X, y)\n", 18 | " self.E += np.ones_like(self.E) * epsilon # fix zero overflow\n", 19 | " self.invE = np.linalg.pinv(self.E)\n", 20 | " return self\n", 21 | " \n", 22 | " def compute_sigma(self,X, y):\n", 23 | " X_u = X.copy().astype('float64')\n", 24 | " for i in range(len(self.u)):\n", 25 | " X_u[y==self.y_classes[i]] -= self.u[i]\n", 26 | " return X_u.T.dot(X_u) / len(y)\n", 27 | "\n", 28 | " def predict(self, X):\n", 29 | " return np.apply_along_axis(self.get_prob, 1, X)\n", 30 | " \n", 31 | " def score(self, X, y):\n", 32 | " return (self.predict(X) == y).mean()\n", 33 | " \n", 34 | " def get_prob(self, x):\n", 35 | " p = np.exp(-1.0 * np.sum((x - self.u).dot(self.invE) * (x - self.u), axis =1)) * self.phi_y\n", 36 | " return np.argmax(p)" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": 2, 42 | "metadata": {}, 43 | "outputs": [], 44 | "source": [ 45 | "from utils import train_test_split\n", 46 | "from sklearn.datasets import load_iris\n", 47 | "X,y = load_iris(return_X_y=True)\n", 48 | "X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.8)\n", 49 | "model = GDAClassifier().fit(X_train,y_train)" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": 3, 55 | "metadata": {}, 56 | "outputs": [ 57 | { 58 | "data": { 59 | "text/plain": [ 60 | "0.975" 61 | ] 62 | }, 63 | "execution_count": 3, 64 | "metadata": {}, 65 | "output_type": "execute_result" 66 | } 67 | ], 68 | "source": [ 69 | "model.score(X_test,y_test)" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": 5, 75 | "metadata": {}, 76 | "outputs": [ 77 | { 78 | "data": { 79 | "text/plain": [ 80 | "0.9296703296703297" 81 | ] 82 | }, 83 | "execution_count": 5, 84 | "metadata": {}, 85 | "output_type": "execute_result" 86 | } 87 | ], 88 | "source": [ 89 | "from sklearn.datasets import load_breast_cancer\n", 90 | "X,y = load_breast_cancer(return_X_y=True)\n", 91 | "X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.8)\n", 92 | "model = GDAClassifier().fit(X_train,y_train)\n", 93 | "model.score(X_test,y_test)" 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": 6, 99 | "metadata": {}, 100 | "outputs": [ 101 | { 102 | "data": { 103 | "text/plain": [ 104 | "0.9510022271714922" 105 | ] 106 | }, 107 | "execution_count": 6, 108 | "metadata": {}, 109 | "output_type": "execute_result" 110 | } 111 | ], 112 | "source": [ 113 | "from sklearn.datasets import load_digits\n", 114 | "digits = load_digits()\n", 115 | "X = digits.data\n", 116 | "y = digits.target\n", 117 | "X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.5)\n", 118 | "model = GDAClassifier().fit(X_train,y_train)\n", 119 | "model.score(X_test,y_test)" 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": null, 125 | "metadata": {}, 126 | "outputs": [], 127 | "source": [] 128 | } 129 | ], 130 | "metadata": { 131 | "kernelspec": { 132 | "display_name": "python3", 133 | "language": "python", 134 | "name": "python3" 135 | }, 136 | "language_info": { 137 | "codemirror_mode": { 138 | "name": "ipython", 139 | "version": 2 140 | }, 141 | "file_extension": ".py", 142 | "mimetype": "text/x-python", 143 | "name": "python", 144 | "nbconvert_exporter": "python", 145 | "pygments_lexer": "ipython2", 146 | "version": "2.7.16" 147 | } 148 | }, 149 | "nbformat": 4, 150 | "nbformat_minor": 2 151 | } 152 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/Multinomial Naive Bayes-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 4, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "# %load naive_bayes.py\n", 10 | "import numpy as np\n", 11 | "\n", 12 | "class NaiveBayesBinaryClassifier:\n", 13 | " \n", 14 | " def fit(self, X, y):\n", 15 | " self.y_classes, y_counts = np.unique(y, return_counts=True)\n", 16 | " self.phi_y = 1.0 * y_counts/y_counts.sum()\n", 17 | " self.phi_x = [1.0 * X[y==k].mean(axis=0) for k in self.y_classes]\n", 18 | " return self\n", 19 | " \n", 20 | " def predict(self, X):\n", 21 | " return np.apply_along_axis(lambda x: self.compute_probs(x), 1, X)\n", 22 | " \n", 23 | " def compute_probs(self, x):\n", 24 | " probs = np.array([self.compute_prob(x, y) for y in range(len(self.y_classes))])\n", 25 | " return self.y_classes[np.argmax(probs)]\n", 26 | " \n", 27 | " def compute_prob(self, x, y):\n", 28 | " res = 1\n", 29 | " for j in range(len(x)):\n", 30 | " Pxy = self.phi_x[y][j] # p(xj=1|y)\n", 31 | " res *= (Pxy**x[j])*((1-Pxy)**(1-x[j])) # p(xj=0|y)\n", 32 | " return res * self.phi_y[y]\n", 33 | " \n", 34 | " def evaluate(self, X, y):\n", 35 | " return (self.predict(X) == y).mean()\n", 36 | " \n", 37 | "class MultinomialNB:\n", 38 | " \n", 39 | " def fit(self, X, y):\n", 40 | " self.y_classes, y_counts = np.unique(y, return_counts=True)\n", 41 | " self.x_classes = np.array([np.unique(x) for x in X.T])\n", 42 | " self.phi_y = 1.0 * y_counts/y_counts.sum()\n", 43 | " self.phi_x = self.mean_x(X, y)\n", 44 | " return self\n", 45 | " \n", 46 | " def mean_x(self, X, y):\n", 47 | " return [[(X[:,j][y==k].reshape(-1,1) == self.x_classes[j]).mean(axis=0)\n", 48 | " for j in range(len(self.x_classes))]\n", 49 | " for k in self.y_classes]\n", 50 | " \n", 51 | " def predict(self, X):\n", 52 | " return np.apply_along_axis(lambda x: self.compute_probs(x), 1, X)\n", 53 | " \n", 54 | " def compute_probs(self, x):\n", 55 | " probs = np.array([self.compute_prob(x, y) for y in range(len(self.y_classes))])\n", 56 | " return self.y_classes[np.argmax(probs)]\n", 57 | " \n", 58 | " def compute_prob(self, x, y):\n", 59 | " Pxy = 1\n", 60 | " for j in range(len(x)):\n", 61 | " i = list(self.x_classes[j]).index(x[j])\n", 62 | " Pxy *= self.phi_x[y][j][i] # p(xj|y)\n", 63 | " return Pxy * self.phi_y[y]\n", 64 | " \n", 65 | " def evaluate(self, X, y):\n", 66 | " return (self.predict(X) == y).mean()\n", 67 | " \n", 68 | " \n", 69 | "class GaussianNB:\n", 70 | " \n", 71 | " def fit(self, X, y, epsilon = 1e-10):\n", 72 | " self.y_classes, y_counts = np.unique(y, return_counts=True)\n", 73 | " self.x_classes = np.array([np.unique(x) for x in X.T])\n", 74 | " self.phi_y = 1.0 * y_counts/y_counts.sum()\n", 75 | " self.u = np.array([X[y==k].mean(axis=0) for k in self.y_classes])\n", 76 | " self.var_x = np.array([X[y==k].var(axis=0) + epsilon for k in self.y_classes])\n", 77 | " return self\n", 78 | " \n", 79 | " def predict(self, X):\n", 80 | " return np.apply_along_axis(lambda x: self.compute_probs(x), 1, X)\n", 81 | " \n", 82 | " def compute_probs(self, x):\n", 83 | " probs = np.array([self.compute_prob(x, y) for y in range(len(self.y_classes))])\n", 84 | " return self.y_classes[np.argmax(probs)]\n", 85 | " \n", 86 | " def compute_prob(self, x, y):\n", 87 | " c = 1.0 /np.sqrt(2.0 * np.pi * (self.var_x[y]))\n", 88 | " return np.prod(c * np.exp(-1.0 * np.square(x - self.u[y]) / (2.0 * self.var_x[y])))\n", 89 | " \n", 90 | " def evaluate(self, X, y):\n", 91 | " return (self.predict(X) == y).mean()\n", 92 | " " 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": 5, 98 | "metadata": {}, 99 | "outputs": [ 100 | { 101 | "data": { 102 | "text/plain": [ 103 | "0.9666666666666667" 104 | ] 105 | }, 106 | "execution_count": 5, 107 | "metadata": {}, 108 | "output_type": "execute_result" 109 | } 110 | ], 111 | "source": [ 112 | "from sklearn import datasets\n", 113 | "from utils import accuracy_score\n", 114 | "iris = datasets.load_iris()\n", 115 | "X = iris.data \n", 116 | "y = iris.target\n", 117 | "MultinomialNB().fit(X, y).evaluate(X, y)" 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": 72, 123 | "metadata": {}, 124 | "outputs": [ 125 | { 126 | "data": { 127 | "text/plain": [ 128 | "array([5.1, 3.5, 1.4, 0.2])" 129 | ] 130 | }, 131 | "execution_count": 72, 132 | "metadata": {}, 133 | "output_type": "execute_result" 134 | } 135 | ], 136 | "source": [ 137 | "X [0]" 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": 65, 143 | "metadata": {}, 144 | "outputs": [ 145 | { 146 | "data": { 147 | "text/plain": [ 148 | "0.96" 149 | ] 150 | }, 151 | "execution_count": 65, 152 | "metadata": {}, 153 | "output_type": "execute_result" 154 | } 155 | ], 156 | "source": [ 157 | "from sklearn import datasets\n", 158 | "from utils import accuracy_score\n", 159 | "iris = datasets.load_iris()\n", 160 | "X = iris.data\n", 161 | "y = iris.target\n", 162 | "GaussianNB().fit(X, y).evaluate(X, y)" 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": 66, 168 | "metadata": {}, 169 | "outputs": [ 170 | { 171 | "data": { 172 | "text/plain": [ 173 | "0.7533333333333333" 174 | ] 175 | }, 176 | "execution_count": 66, 177 | "metadata": {}, 178 | "output_type": "execute_result" 179 | } 180 | ], 181 | "source": [ 182 | "from sklearn import datasets\n", 183 | "from utils import accuracy_score\n", 184 | "\n", 185 | "iris = datasets.load_iris()\n", 186 | "X = iris.data \n", 187 | "X = (X > X.mean(axis=0))*1 # turn our feature to categorical using mean\n", 188 | "y = iris.target\n", 189 | "NaiveBayesBinaryClassifier().fit(X, y).evaluate(X, y)" 190 | ] 191 | }, 192 | { 193 | "cell_type": "code", 194 | "execution_count": 67, 195 | "metadata": {}, 196 | "outputs": [ 197 | { 198 | "name": "stdout", 199 | "output_type": "stream", 200 | "text": [ 201 | "[1 1 1 2 2 2]\n" 202 | ] 203 | } 204 | ], 205 | "source": [ 206 | "X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])\n", 207 | "Y = np.array([1, 1, 1, 2, 2, 2])\n", 208 | "clf = MultinomialNB().fit(X, Y)\n", 209 | "print(clf.predict(X))" 210 | ] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "execution_count": 68, 215 | "metadata": {}, 216 | "outputs": [ 217 | { 218 | "data": { 219 | "text/plain": [ 220 | "0.9833055091819699" 221 | ] 222 | }, 223 | "execution_count": 68, 224 | "metadata": {}, 225 | "output_type": "execute_result" 226 | } 227 | ], 228 | "source": [ 229 | "from sklearn import datasets\n", 230 | "digits = datasets.load_digits()\n", 231 | "X = digits.data\n", 232 | "y = digits.target\n", 233 | "MultinomialNB().fit(X, y).evaluate(X, y)" 234 | ] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "execution_count": 69, 239 | "metadata": {}, 240 | "outputs": [ 241 | { 242 | "data": { 243 | "text/plain": [ 244 | "0.8091263216471898" 245 | ] 246 | }, 247 | "execution_count": 69, 248 | "metadata": {}, 249 | "output_type": "execute_result" 250 | } 251 | ], 252 | "source": [ 253 | "from sklearn import datasets\n", 254 | "digits = datasets.load_digits()\n", 255 | "X = digits.data\n", 256 | "y = digits.target\n", 257 | "GaussianNB().fit(X, y).evaluate(X, y)" 258 | ] 259 | }, 260 | { 261 | "cell_type": "code", 262 | "execution_count": 70, 263 | "metadata": {}, 264 | "outputs": [], 265 | "source": [ 266 | "b = np.array([0, 1])\n", 267 | "a = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 268 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 269 | " 0, 0, 0, 0, 0, 0])" 270 | ] 271 | }, 272 | { 273 | "cell_type": "code", 274 | "execution_count": 61, 275 | "metadata": {}, 276 | "outputs": [ 277 | { 278 | "data": { 279 | "text/plain": [ 280 | "0.9666666666666667" 281 | ] 282 | }, 283 | "execution_count": 61, 284 | "metadata": {}, 285 | "output_type": "execute_result" 286 | } 287 | ], 288 | "source": [ 289 | "from sklearn import datasets\n", 290 | "from utils import accuracy_score, train_test_split\n", 291 | "iris = datasets.load_iris()\n", 292 | "X = iris.data\n", 293 | "# X = (X-X.min(axis=0)).astype(int)\n", 294 | "y = iris.target\n", 295 | "X_tr, X_ts, y_tr, y_ts = train_test_split(X,y)\n", 296 | "GaussianNB().fit(X_tr, y_tr).evaluate(X_ts, y_ts)" 297 | ] 298 | }, 299 | { 300 | "cell_type": "markdown", 301 | "metadata": {}, 302 | "source": [ 303 | "### Naive Bayes" 304 | ] 305 | }, 306 | { 307 | "cell_type": "code", 308 | "execution_count": 40, 309 | "metadata": {}, 310 | "outputs": [ 311 | { 312 | "data": { 313 | "text/plain": [ 314 | "array([[0, 1, 0, 0],\n", 315 | " [2, 1, 3, 1],\n", 316 | " [2, 1, 3, 1],\n", 317 | " [0, 1, 0, 0],\n", 318 | " [1, 0, 3, 1],\n", 319 | " [2, 0, 3, 1],\n", 320 | " [2, 1, 4, 2],\n", 321 | " [2, 1, 3, 1],\n", 322 | " [1, 0, 3, 1],\n", 323 | " [2, 1, 5, 2]])" 324 | ] 325 | }, 326 | "execution_count": 40, 327 | "metadata": {}, 328 | "output_type": "execute_result" 329 | } 330 | ], 331 | "source": [ 332 | "X_tr[:10]" 333 | ] 334 | }, 335 | { 336 | "cell_type": "code", 337 | "execution_count": 41, 338 | "metadata": {}, 339 | "outputs": [ 340 | { 341 | "data": { 342 | "text/plain": [ 343 | "array([0, 1, 1, 0, 1, 1, 2, 1, 1, 2])" 344 | ] 345 | }, 346 | "execution_count": 41, 347 | "metadata": {}, 348 | "output_type": "execute_result" 349 | } 350 | ], 351 | "source": [ 352 | "y_classes = [0 1 2]\n", 353 | "P(y) = []" 354 | ] 355 | }, 356 | { 357 | "cell_type": "code", 358 | "execution_count": null, 359 | "metadata": {}, 360 | "outputs": [], 361 | "source": [ 362 | "p(y|x) = p(x|y) * p(y)\n", 363 | "x_uniques = [0 1 2]" 364 | ] 365 | }, 366 | { 367 | "cell_type": "code", 368 | "execution_count": null, 369 | "metadata": {}, 370 | "outputs": [], 371 | "source": [ 372 | "p(y|x) = p(x|y) * p(y) / p(x)" 373 | ] 374 | }, 375 | { 376 | "cell_type": "code", 377 | "execution_count": null, 378 | "metadata": {}, 379 | "outputs": [], 380 | "source": [ 381 | "p(x|y) " 382 | ] 383 | }, 384 | { 385 | "cell_type": "code", 386 | "execution_count": null, 387 | "metadata": {}, 388 | "outputs": [], 389 | "source": [ 390 | "p(y) = [0.2, 0.6, 0.2]" 391 | ] 392 | }, 393 | { 394 | "cell_type": "code", 395 | "execution_count": null, 396 | "metadata": {}, 397 | "outputs": [], 398 | "source": [ 399 | "p(x0|y=0) = [1 0 0]\n", 400 | "p(x0|y=1) = [0 0.33 0.6]\n", 401 | "p(x0|y=2) = [0 0.33 0.6]" 402 | ] 403 | } 404 | ], 405 | "metadata": { 406 | "kernelspec": { 407 | "display_name": "python3", 408 | "language": "python", 409 | "name": "python3" 410 | }, 411 | "language_info": { 412 | "codemirror_mode": { 413 | "name": "ipython", 414 | "version": 2 415 | }, 416 | "file_extension": ".py", 417 | "mimetype": "text/x-python", 418 | "name": "python", 419 | "nbconvert_exporter": "python", 420 | "pygments_lexer": "ipython2", 421 | "version": "2.7.16" 422 | } 423 | }, 424 | "nbformat": 4, 425 | "nbformat_minor": 2 426 | } 427 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/Naive Bayes Implementation-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Naive Bayes Classifier Implementation using numpy\n", 8 | "\n", 9 | "Naive Bayes is anoother supervised machine laerning algorithm for classification problem. It makes a strong assumption about the data that **each feature is independent of the value of any other feature**. For example, a fruit may be considered to be an apple if it is red, round, and about 10 cm in diameter. A naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of any possible correlations between the color, roundness, and diameter features.\n", 10 | "\n", 11 | "In Naive bayes classifier what we are trying to find is the probability that a given data point belogs to a specific class, we are going to have prediction for all the class in our target.\n", 12 | "\n", 13 | "\n", 14 | "![title](figures/bayes-theorem.png)\n", 15 | "\n", 16 | "### Our motivation\n", 17 | "To gain better understand on how the algorithm works" 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": 3, 23 | "metadata": {}, 24 | "outputs": [], 25 | "source": [ 26 | "from sklearn import datasets\n", 27 | "digits = datasets.load_digits()\n", 28 | "X = digits.data\n", 29 | "y = digits.target" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": null, 35 | "metadata": {}, 36 | "outputs": [], 37 | "source": [ 38 | "p(y|x) = p(x|y) * p(y)" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": null, 44 | "metadata": {}, 45 | "outputs": [], 46 | "source": [ 47 | "class MultinomialNB:\n", 48 | " \n", 49 | " def fit(self, X, y):\n", 50 | " unique_y, y_counts = np.unique(y, return_counts=True)\n", 51 | " self.prob_y = y_counts / len(y)\n", 52 | " self.unique_y = unique_y\n", 53 | " self.unique_x = np.array([np.unique(x_t) for x_t in X.T])\n", 54 | " self.prob_x = self.get_prob_x(X, y)\n", 55 | " return self\n", 56 | " \n", 57 | " def get_prob_x(X, y):\n", 58 | " return []\n", 59 | " \n", 60 | " def predict(self, X):\n", 61 | " pass\n", 62 | " \n", 63 | " def score(self, X, y):\n", 64 | " " 65 | ] 66 | }, 67 | { 68 | "cell_type": "code", 69 | "execution_count": null, 70 | "metadata": {}, 71 | "outputs": [], 72 | "source": [ 73 | "model = MultinomialNB().fit(X, y)\n", 74 | "model.score(X, y)" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": 13, 80 | "metadata": {}, 81 | "outputs": [ 82 | { 83 | "data": { 84 | "text/plain": [ 85 | "array([[ 0., 0., 5.],\n", 86 | " [ 0., 0., 0.],\n", 87 | " [ 0., 0., 0.],\n", 88 | " [ 0., 0., 7.],\n", 89 | " [ 0., 0., 0.],\n", 90 | " [ 0., 0., 12.],\n", 91 | " [ 0., 0., 0.],\n", 92 | " [ 0., 0., 7.],\n", 93 | " [ 0., 0., 9.],\n", 94 | " [ 0., 0., 11.]])" 95 | ] 96 | }, 97 | "execution_count": 13, 98 | "metadata": {}, 99 | "output_type": "execute_result" 100 | } 101 | ], 102 | "source": [ 103 | "X[:10, :3]" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": 8, 109 | "metadata": {}, 110 | "outputs": [ 111 | { 112 | "data": { 113 | "text/plain": [ 114 | "array([0, 1])" 115 | ] 116 | }, 117 | "execution_count": 8, 118 | "metadata": {}, 119 | "output_type": "execute_result" 120 | } 121 | ], 122 | "source": [ 123 | "y[:2]" 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": null, 129 | "metadata": {}, 130 | "outputs": [], 131 | "source": [] 132 | } 133 | ], 134 | "metadata": { 135 | "kernelspec": { 136 | "display_name": "python3", 137 | "language": "python", 138 | "name": "python3" 139 | }, 140 | "language_info": { 141 | "codemirror_mode": { 142 | "name": "ipython", 143 | "version": 2 144 | }, 145 | "file_extension": ".py", 146 | "mimetype": "text/x-python", 147 | "name": "python", 148 | "nbconvert_exporter": "python", 149 | "pygments_lexer": "ipython2", 150 | "version": "2.7.16" 151 | } 152 | }, 153 | "nbformat": 4, 154 | "nbformat_minor": 2 155 | } 156 | -------------------------------------------------------------------------------- /BayesianClassifier.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 3, 6 | "metadata": {}, 7 | "outputs": [ 8 | { 9 | "name": "stdout", 10 | "output_type": "stream", 11 | "text": [ 12 | "COLUMN\tVALUE\tOUTPUT\tPROBABILITY\n", 13 | "Temp--->Hot--->Rainy--->0.4\n", 14 | "Temp--->Mild--->Rainy--->0.4\n", 15 | "Temp--->Cool--->Rainy--->0.2\n", 16 | "Humidity--->High--->Rainy--->0.6\n", 17 | "Humidity--->Normal--->Rainy--->0.4\n", 18 | "Windy--->Low--->Rainy--->0.6\n", 19 | "Windy--->High--->Rainy--->0.4\n", 20 | "Cloudy--->Yes--->Rainy--->0.6\n", 21 | "Cloudy--->No--->Rainy--->0.4\n", 22 | "Temp--->Hot--->Overcast--->0.5\n", 23 | "Temp--->Mild--->Overcast--->0.25\n", 24 | "Temp--->Cool--->Overcast--->0.25\n", 25 | "Humidity--->High--->Overcast--->0.5\n", 26 | "Humidity--->Normal--->Overcast--->0.5\n", 27 | "Windy--->Low--->Overcast--->0.5\n", 28 | "Windy--->High--->Overcast--->0.5\n", 29 | "Cloudy--->Yes--->Overcast--->0.5\n", 30 | "Cloudy--->No--->Overcast--->0.5\n", 31 | "Temp--->Hot--->Sunny--->0.0\n", 32 | "Temp--->Mild--->Sunny--->0.6\n", 33 | "Temp--->Cool--->Sunny--->0.4\n", 34 | "Humidity--->High--->Sunny--->0.4\n", 35 | "Humidity--->Normal--->Sunny--->0.6\n", 36 | "Windy--->Low--->Sunny--->0.6\n", 37 | "Windy--->High--->Sunny--->0.4\n", 38 | "Cloudy--->Yes--->Sunny--->0.2\n", 39 | "Cloudy--->No--->Sunny--->0.8\n", 40 | "{'Rainy': 5, 'Overcast': 4, 'Sunny': 5}\n", 41 | "{'Rainy': 0.6162624821683309, 'Overcast': 0.17831669044222537, 'Sunny': 0.20542082738944364}\n" 42 | ] 43 | } 44 | ], 45 | "source": [ 46 | "import pandas as pd\n", 47 | "\n", 48 | "class BayesianClassifier:\n", 49 | "\n", 50 | " def __init__(self, path, col):\n", 51 | " self.cols = {}\n", 52 | " self.op_cols = {}\n", 53 | " self.col = col\n", 54 | " self.data = pd.read_csv(path)\n", 55 | " self.dataM = self.data\n", 56 | " self.op = self.data[col]\n", 57 | " self.data = self.data.drop(col, axis=1)\n", 58 | " self.total = len(self.dataM)\n", 59 | " \n", 60 | " def get_probability_table(self):\n", 61 | " for i in self.op.unique():\n", 62 | " self.cols[i] = {}\n", 63 | " for j in self.data.columns:\n", 64 | " self.cols[i][j] = {}\n", 65 | " for k in self.data[j]:\n", 66 | " if k not in self.cols[i][j]:\n", 67 | " self.cols[i][j][k] = \"\"\n", 68 | " dfs = []\n", 69 | " for i in self.op.unique():\n", 70 | " for j in self.data.columns:\n", 71 | " for k in self.data[j].unique():\n", 72 | " dfs.append(self.dataM[self.dataM[self.col] == i])\n", 73 | "\n", 74 | " \n", 75 | " print(\"COLUMN\\tVALUE\\tOUTPUT\\tPROBABILITY\")\n", 76 | " for x in self.cols:\n", 77 | " for y in self.cols[x]:\n", 78 | " for z in self.cols[x][y]:\n", 79 | " self.op_cols[x] = len(self.dataM[self.dataM[self.col] == x])\n", 80 | " total = len(self.dataM[(self.dataM[self.col] == x) & (self.dataM[y])])\n", 81 | " p = len(self.dataM[(self.dataM[y] == z) & (self.dataM[self.col] == x)]) / total\n", 82 | " self.cols[x][y][z] = p\n", 83 | " print(\"{}--->{}--->{}--->{}\".format(y,z,x,p))\n", 84 | " print(self.op_cols)\n", 85 | " \n", 86 | " def classify(self, values):\n", 87 | " p = {}\n", 88 | " tot = 0\n", 89 | " for i in self.op.unique():\n", 90 | " a = 1\n", 91 | " for key, value in values.items():\n", 92 | " a *= self.cols[i][key][value]\n", 93 | " p[i] = a*self.op_cols[i]/self.total\n", 94 | " tot += a*self.op_cols[i]/self.total\n", 95 | " \n", 96 | " for i in p:\n", 97 | " p[i] = p[i] / tot\n", 98 | " \n", 99 | " return p\n", 100 | " \n", 101 | "b = BayesianClassifier(\"new.csv\", \"Weather\")\n", 102 | "\n", 103 | "b.get_probability_table()\n", 104 | "print(b.classify({\"Temp\": 'Mild', \"Windy\": \"Low\", \"Humidity\": \"High\", \"Cloudy\": \"Yes\"}))" 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": null, 110 | "metadata": {}, 111 | "outputs": [], 112 | "source": [] 113 | } 114 | ], 115 | "metadata": { 116 | "kernelspec": { 117 | "display_name": "Python 3", 118 | "language": "python", 119 | "name": "python3" 120 | }, 121 | "language_info": { 122 | "codemirror_mode": { 123 | "name": "ipython", 124 | "version": 3 125 | }, 126 | "file_extension": ".py", 127 | "mimetype": "text/x-python", 128 | "name": "python", 129 | "nbconvert_exporter": "python", 130 | "pygments_lexer": "ipython3", 131 | "version": "3.7.3" 132 | } 133 | }, 134 | "nbformat": 4, 135 | "nbformat_minor": 2 136 | } 137 | -------------------------------------------------------------------------------- /Binary Naive Bayes.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Naive Bayes Classifier Implementation using numpy\n", 8 | "\n", 9 | "Naive Bayes is anoother supervised machine laerning algorithm for classification problem. It makes a strong assumption about the data that **each feature is independent of the value of any other feature**. For example, a fruit may be considered to be an apple if it is red, round, and about 10 cm in diameter. A naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of any possible correlations between the color, roundness, and diameter features.\n", 10 | "\n", 11 | "In Naive bayes classifier what we are trying to find is the probability that a given data point belogs to a specific class, we are going to have prediction for all the class in our target.\n", 12 | "\n", 13 | "\n", 14 | "![title](figures/bayes-theorem.png)\n", 15 | "\n", 16 | "This is bernolli naive bayes impementation, which we expecting the features to be true or false (1 or 0)." 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": 72, 22 | "metadata": {}, 23 | "outputs": [], 24 | "source": [ 25 | "# %load naive_bayes.py\n", 26 | "import numpy as np\n", 27 | "\n", 28 | "class NaiveBayesBinaryClassifier:\n", 29 | " \n", 30 | " def fit(self, X, y):\n", 31 | " self.y_classes, y_counts = np.unique(y, return_counts=True)\n", 32 | " self.phi_y = 1.0 * y_counts/y_counts.sum()\n", 33 | " self.phi_x = [1.0 * X[y==k].mean(axis=0) for k in self.y_classes]\n", 34 | " return self\n", 35 | " \n", 36 | " def predict(self, X):\n", 37 | " return np.apply_along_axis(lambda x: self.compute_probs(x), 1, X)\n", 38 | " \n", 39 | " def compute_probs(self, x):\n", 40 | " probs = [self.compute_prob(x, y) for y in range(len(self.y_classes))]\n", 41 | " return self.y_classes[np.argmax(probs)]\n", 42 | " \n", 43 | " def compute_prob(self, x, y):\n", 44 | " res = 1\n", 45 | " for j in range(len(x)):\n", 46 | " Pxy = self.phi_x[y][j] # p(xj=1|y)\n", 47 | " res *= (Pxy**x[j])*((1-Pxy)**(1-x[j])) # p(xj=0|y)\n", 48 | " return res * self.phi_y[y]\n", 49 | " \n", 50 | " def score(self, X, y):\n", 51 | " return (self.predict(X) == y).mean()" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": null, 57 | "metadata": {}, 58 | "outputs": [], 59 | "source": [] 60 | } 61 | ], 62 | "metadata": { 63 | "kernelspec": { 64 | "display_name": "python3", 65 | "language": "python", 66 | "name": "python3" 67 | }, 68 | "language_info": { 69 | "codemirror_mode": { 70 | "name": "ipython", 71 | "version": 2 72 | }, 73 | "file_extension": ".py", 74 | "mimetype": "text/x-python", 75 | "name": "python", 76 | "nbconvert_exporter": "python", 77 | "pygments_lexer": "ipython2", 78 | "version": "2.7.16" 79 | } 80 | }, 81 | "nbformat": 4, 82 | "nbformat_minor": 2 83 | } 84 | -------------------------------------------------------------------------------- /Gaussian Discriminant Analyses.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Gaussian Descriminant Analysis Implementation using numpy\n", 8 | "\n", 9 | "When we have a classification problem in which the input features x are\n", 10 | "continuous-valued random variables, we can then use the Gaussian Discriminant Analysis (GDA) model, which models p(x|y) using a multivariate normal distribution.\n", 11 | "\n", 12 | "GDA is a family of generative algorithm that try to model p(x|y) (and p(y)). After modeling p(y) (called the class priors) and p(x|y), our algorithm can then use Bayes rule to derive the posterior distribution on y given x:\n", 13 | "\n", 14 | "\n", 15 | "![title](figures/bayes-theorem.png)\n", 16 | "\n", 17 | "GDA makes stronger modeling assumptions, and is more data efficient (i.e., requires less training data to learn “well”) when the modeling assumptions are correct or at least approximately correct. Comparing to Logistic regression that makes weaker assumptions, and is significantly more robust to deviations from modeling assumptions\n", 18 | "\n", 19 | "### Our motivation\n", 20 | "To gain better understand on how the algorithm works" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": 4, 26 | "metadata": {}, 27 | "outputs": [], 28 | "source": [ 29 | "# %load gda.py\n", 30 | "import numpy as np\n", 31 | "\n", 32 | "class GDABinaryClassifier:\n", 33 | " \n", 34 | " def fit(self, X, y):\n", 35 | " self.fi = y.mean()\n", 36 | " self.u = np.array([ X[y==k].mean(axis=0) for k in [0,1]])\n", 37 | " X_u = X.copy()\n", 38 | " for k in [0,1]: X_u[y==k] -= self.u[k]\n", 39 | " self.E = X_u.T.dot(X_u) / len(y)\n", 40 | " self.invE = np.linalg.pinv(self.E)\n", 41 | " return self\n", 42 | " \n", 43 | " def predict(self, X):\n", 44 | " return np.argmax([self.compute_prob(X, i) for i in range(len(self.u))], axis=0)\n", 45 | " \n", 46 | " def compute_prob(self, X, i):\n", 47 | " u, phi = self.u[i], ((self.fi)**i * (1 - self.fi)**(1 - i))\n", 48 | " return np.exp(-1.0 * np.sum((X-u).dot(self.invE)*(X-u), axis=1)) * phi\n", 49 | " \n", 50 | " def score(self, X, y):\n", 51 | " return (self.predict(X) == y).mean()" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": 5, 57 | "metadata": {}, 58 | "outputs": [ 59 | { 60 | "data": { 61 | "text/plain": [ 62 | "0.9666080843585237" 63 | ] 64 | }, 65 | "execution_count": 5, 66 | "metadata": {}, 67 | "output_type": "execute_result" 68 | } 69 | ], 70 | "source": [ 71 | "from sklearn.datasets import load_breast_cancer\n", 72 | "X,y = load_breast_cancer(return_X_y=True)\n", 73 | "model = GDABinaryClassifier().fit(X,y)\n", 74 | "pre = model.predict(X)\n", 75 | "model.score(X,y)" 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": null, 81 | "metadata": {}, 82 | "outputs": [], 83 | "source": [] 84 | } 85 | ], 86 | "metadata": { 87 | "kernelspec": { 88 | "display_name": "python3", 89 | "language": "python", 90 | "name": "python3" 91 | }, 92 | "language_info": { 93 | "codemirror_mode": { 94 | "name": "ipython", 95 | "version": 2 96 | }, 97 | "file_extension": ".py", 98 | "mimetype": "text/x-python", 99 | "name": "python", 100 | "nbconvert_exporter": "python", 101 | "pygments_lexer": "ipython2", 102 | "version": "2.7.16" 103 | } 104 | }, 105 | "nbformat": 4, 106 | "nbformat_minor": 2 107 | } 108 | -------------------------------------------------------------------------------- /Gaussian Naive Bayes.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "# %load gaussian_naive_bayes.py\n", 10 | "import numpy as np\n", 11 | " \n", 12 | "class GaussianNB:\n", 13 | " \n", 14 | " def fit(self, X, y, epsilon = 1e-10):\n", 15 | " self.y_classes, y_counts = np.unique(y, return_counts=True)\n", 16 | " self.x_classes = np.array([np.unique(x) for x in X.T])\n", 17 | " self.phi_y = 1.0 * y_counts/y_counts.sum()\n", 18 | " self.u = np.array([X[y==k].mean(axis=0) for k in self.y_classes])\n", 19 | " self.var_x = np.array([X[y==k].var(axis=0) + epsilon for k in self.y_classes])\n", 20 | " return self\n", 21 | " \n", 22 | " def predict(self, X):\n", 23 | " return np.apply_along_axis(lambda x: self.compute_probs(x), 1, X)\n", 24 | " \n", 25 | " def compute_probs(self, x):\n", 26 | " probs = np.array([self.compute_prob(x, y) for y in range(len(self.y_classes))])\n", 27 | " return self.y_classes[np.argmax(probs)]\n", 28 | " \n", 29 | " def compute_prob(self, x, y):\n", 30 | " c = 1.0 /np.sqrt(2.0 * np.pi * (self.var_x[y]))\n", 31 | " return np.prod(c * np.exp(-1.0 * np.square(x - self.u[y]) / (2.0 * self.var_x[y])))\n", 32 | " \n", 33 | " def evaluate(self, X, y):\n", 34 | " return (self.predict(X) == y).mean()" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": 11, 40 | "metadata": {}, 41 | "outputs": [ 42 | { 43 | "data": { 44 | "text/plain": [ 45 | "0.96" 46 | ] 47 | }, 48 | "execution_count": 11, 49 | "metadata": {}, 50 | "output_type": "execute_result" 51 | } 52 | ], 53 | "source": [ 54 | "from sklearn import datasets\n", 55 | "from utils import accuracy_score\n", 56 | "iris = datasets.load_iris()\n", 57 | "X = iris.data \n", 58 | "y = iris.target\n", 59 | "GaussianNB().fit(X, y).evaluate(X, y)" 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": 12, 65 | "metadata": {}, 66 | "outputs": [ 67 | { 68 | "name": "stdout", 69 | "output_type": "stream", 70 | "text": [ 71 | "[1 1 1 2 2 2]\n" 72 | ] 73 | } 74 | ], 75 | "source": [ 76 | "X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])\n", 77 | "Y = np.array([1, 1, 1, 2, 2, 2])\n", 78 | "clf = GaussianNB().fit(X, Y)\n", 79 | "print(clf.predict(X))" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": 13, 85 | "metadata": {}, 86 | "outputs": [ 87 | { 88 | "data": { 89 | "text/plain": [ 90 | "0.8091263216471898" 91 | ] 92 | }, 93 | "execution_count": 13, 94 | "metadata": {}, 95 | "output_type": "execute_result" 96 | } 97 | ], 98 | "source": [ 99 | "from sklearn import datasets\n", 100 | "digits = datasets.load_digits()\n", 101 | "X = digits.data\n", 102 | "y = digits.target\n", 103 | "GaussianNB().fit(X, y).evaluate(X, y)" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": null, 109 | "metadata": {}, 110 | "outputs": [], 111 | "source": [] 112 | } 113 | ], 114 | "metadata": { 115 | "kernelspec": { 116 | "display_name": "python3", 117 | "language": "python", 118 | "name": "python3" 119 | }, 120 | "language_info": { 121 | "codemirror_mode": { 122 | "name": "ipython", 123 | "version": 2 124 | }, 125 | "file_extension": ".py", 126 | "mimetype": "text/x-python", 127 | "name": "python", 128 | "nbconvert_exporter": "python", 129 | "pygments_lexer": "ipython2", 130 | "version": "2.7.16" 131 | } 132 | }, 133 | "nbformat": 4, 134 | "nbformat_minor": 2 135 | } 136 | -------------------------------------------------------------------------------- /ID3.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [ 8 | { 9 | "data": { 10 | "text/plain": [ 11 | "'\\nID3 Algorithm\\nMuskan Pandey\\n'" 12 | ] 13 | }, 14 | "execution_count": 1, 15 | "metadata": {}, 16 | "output_type": "execute_result" 17 | } 18 | ], 19 | "source": [ 20 | "'''\n", 21 | "ID3 Algorithm\n", 22 | "Muskan Pandey\n", 23 | "'''" 24 | ] 25 | }, 26 | { 27 | "cell_type": "code", 28 | "execution_count": 2, 29 | "metadata": {}, 30 | "outputs": [], 31 | "source": [ 32 | "import csv\n", 33 | "import math\n", 34 | "import os" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": 3, 40 | "metadata": {}, 41 | "outputs": [], 42 | "source": [ 43 | "def load_csv_to_header_data(filename):\n", 44 | " fpath = os.path.join(os.getcwd(), filename)\n", 45 | " fs = csv.reader(open(fpath, newline='\\n'))\n", 46 | "\n", 47 | " all_row = []\n", 48 | " for r in fs:\n", 49 | " all_row.append(r)\n", 50 | "\n", 51 | " headers = all_row[0]\n", 52 | " idx_to_name, name_to_idx = get_header_name_to_idx_maps(headers)\n", 53 | "\n", 54 | " data = {\n", 55 | " 'header': headers,\n", 56 | " 'rows': all_row[1:],\n", 57 | " 'name_to_idx': name_to_idx,\n", 58 | " 'idx_to_name': idx_to_name\n", 59 | " }\n", 60 | " return data" 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": 4, 66 | "metadata": {}, 67 | "outputs": [], 68 | "source": [ 69 | "def get_header_name_to_idx_maps(headers):\n", 70 | " name_to_idx = {}\n", 71 | " idx_to_name = {}\n", 72 | " for i in range(0, len(headers)):\n", 73 | " name_to_idx[headers[i]] = i\n", 74 | " idx_to_name[i] = headers[i]\n", 75 | " return idx_to_name, name_to_idx" 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": 5, 81 | "metadata": {}, 82 | "outputs": [], 83 | "source": [ 84 | "def project_columns(data, columns_to_project):\n", 85 | " data_h = list(data['header'])\n", 86 | " data_r = list(data['rows'])\n", 87 | "\n", 88 | " all_cols = list(range(0, len(data_h)))\n", 89 | "\n", 90 | " columns_to_project_ix = [data['name_to_idx'][name] for name in columns_to_project]\n", 91 | " columns_to_remove = [cidx for cidx in all_cols if cidx not in columns_to_project_ix]\n", 92 | "\n", 93 | " for delc in sorted(columns_to_remove, reverse=True):\n", 94 | " del data_h[delc]\n", 95 | " for r in data_r:\n", 96 | " del r[delc]\n", 97 | "\n", 98 | " idx_to_name, name_to_idx = get_header_name_to_idx_maps(data_h)\n", 99 | "\n", 100 | " return {'header': data_h, 'rows': data_r,\n", 101 | " 'name_to_idx': name_to_idx,\n", 102 | " 'idx_to_name': idx_to_name}" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": 6, 108 | "metadata": {}, 109 | "outputs": [], 110 | "source": [ 111 | "def get_uniq_values(data):\n", 112 | " idx_to_name = data['idx_to_name']\n", 113 | " idxs = idx_to_name.keys()\n", 114 | "\n", 115 | " val_map = {}\n", 116 | " for idx in iter(idxs):\n", 117 | " val_map[idx_to_name[idx]] = set()\n", 118 | "\n", 119 | " for data_row in data['rows']:\n", 120 | " for idx in idx_to_name.keys():\n", 121 | " att_name = idx_to_name[idx]\n", 122 | " val = data_row[idx]\n", 123 | " if val not in val_map.keys():\n", 124 | " val_map[att_name].add(val)\n", 125 | " return val_map" 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": 7, 131 | "metadata": {}, 132 | "outputs": [], 133 | "source": [ 134 | "def get_class_labels(data, target_attribute):\n", 135 | " rows = data['rows']\n", 136 | " col_idx = data['name_to_idx'][target_attribute]\n", 137 | " labels = {}\n", 138 | " for r in rows:\n", 139 | " val = r[col_idx]\n", 140 | " if val in labels:\n", 141 | " labels[val] = labels[val] + 1\n", 142 | " else:\n", 143 | " labels[val] = 1\n", 144 | " return labels" 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": 8, 150 | "metadata": {}, 151 | "outputs": [], 152 | "source": [ 153 | "def entropy(n, labels):\n", 154 | " ent = 0\n", 155 | " for label in labels.keys():\n", 156 | " p_x = labels[label] / n\n", 157 | " ent += - p_x * math.log(p_x, 2)\n", 158 | " return ent" 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": 9, 164 | "metadata": {}, 165 | "outputs": [], 166 | "source": [ 167 | "def partition_data(data, group_att):\n", 168 | " partitions = {}\n", 169 | " data_rows = data['rows']\n", 170 | " partition_att_idx = data['name_to_idx'][group_att]\n", 171 | " for row in data_rows:\n", 172 | " row_val = row[partition_att_idx]\n", 173 | " if row_val not in partitions.keys():\n", 174 | " partitions[row_val] = {\n", 175 | " 'name_to_idx': data['name_to_idx'],\n", 176 | " 'idx_to_name': data['idx_to_name'],\n", 177 | " 'rows': list()\n", 178 | " }\n", 179 | " partitions[row_val]['rows'].append(row)\n", 180 | " return partitions" 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": 10, 186 | "metadata": {}, 187 | "outputs": [], 188 | "source": [ 189 | "def avg_entropy_w_partitions(data, splitting_att, target_attribute):\n", 190 | " # find uniq values of splitting att\n", 191 | " data_rows = data['rows']\n", 192 | " n = len(data_rows)\n", 193 | " partitions = partition_data(data, splitting_att)\n", 194 | "\n", 195 | " avg_ent = 0\n", 196 | "\n", 197 | " for partition_key in partitions.keys():\n", 198 | " partitioned_data = partitions[partition_key]\n", 199 | " partition_n = len(partitioned_data['rows'])\n", 200 | " partition_labels = get_class_labels(partitioned_data, target_attribute)\n", 201 | " partition_entropy = entropy(partition_n, partition_labels)\n", 202 | " avg_ent += partition_n / n * partition_entropy\n", 203 | "\n", 204 | " return avg_ent, partitions" 205 | ] 206 | }, 207 | { 208 | "cell_type": "code", 209 | "execution_count": 11, 210 | "metadata": {}, 211 | "outputs": [], 212 | "source": [ 213 | "def most_common_label(labels):\n", 214 | " mcl = max(labels, key=lambda k: labels[k])\n", 215 | " return mcl" 216 | ] 217 | }, 218 | { 219 | "cell_type": "code", 220 | "execution_count": 12, 221 | "metadata": {}, 222 | "outputs": [], 223 | "source": [ 224 | "def id3(data, uniqs, remaining_atts, target_attribute):\n", 225 | " labels = get_class_labels(data, target_attribute)\n", 226 | "\n", 227 | " node = {}\n", 228 | "\n", 229 | " if len(labels.keys()) == 1:\n", 230 | " node['label'] = next(iter(labels.keys()))\n", 231 | " return node\n", 232 | "\n", 233 | " if len(remaining_atts) == 0:\n", 234 | " node['label'] = most_common_label(labels)\n", 235 | " return node\n", 236 | "\n", 237 | " n = len(data['rows'])\n", 238 | " ent = entropy(n, labels)\n", 239 | "\n", 240 | " max_info_gain = None\n", 241 | " max_info_gain_att = None\n", 242 | " max_info_gain_partitions = None\n", 243 | "\n", 244 | " for remaining_att in remaining_atts:\n", 245 | " avg_ent, partitions = avg_entropy_w_partitions(data, remaining_att, target_attribute)\n", 246 | " info_gain = ent - avg_ent\n", 247 | " if max_info_gain is None or info_gain > max_info_gain:\n", 248 | " max_info_gain = info_gain\n", 249 | " max_info_gain_att = remaining_att\n", 250 | " max_info_gain_partitions = partitions\n", 251 | "\n", 252 | " if max_info_gain is None:\n", 253 | " node['label'] = most_common_label(labels)\n", 254 | " return node\n", 255 | "\n", 256 | " node['attribute'] = max_info_gain_att\n", 257 | " node['nodes'] = {}\n", 258 | "\n", 259 | " remaining_atts_for_subtrees = set(remaining_atts)\n", 260 | " remaining_atts_for_subtrees.discard(max_info_gain_att)\n", 261 | "\n", 262 | " uniq_att_values = uniqs[max_info_gain_att]\n", 263 | "\n", 264 | " for att_value in uniq_att_values:\n", 265 | " if att_value not in max_info_gain_partitions.keys():\n", 266 | " node['nodes'][att_value] = {'label': most_common_label(labels)}\n", 267 | " continue\n", 268 | " partition = max_info_gain_partitions[att_value]\n", 269 | " node['nodes'][att_value] = id3(partition, uniqs, remaining_atts_for_subtrees, target_attribute)\n", 270 | "\n", 271 | " return node" 272 | ] 273 | }, 274 | { 275 | "cell_type": "code", 276 | "execution_count": 13, 277 | "metadata": {}, 278 | "outputs": [], 279 | "source": [ 280 | "def pretty_print_tree(root):\n", 281 | " stack = []\n", 282 | " rules = set()\n", 283 | "\n", 284 | " def traverse(node, stack, rules):\n", 285 | " if 'label' in node:\n", 286 | " stack.append(' THEN ' + node['label'])\n", 287 | " rules.add(''.join(stack))\n", 288 | " stack.pop()\n", 289 | " elif 'attribute' in node:\n", 290 | " ifnd = 'IF ' if not stack else ' AND '\n", 291 | " stack.append(ifnd + node['attribute'] + ' EQUALS ')\n", 292 | " for subnode_key in node['nodes']:\n", 293 | " stack.append(subnode_key)\n", 294 | " traverse(node['nodes'][subnode_key], stack, rules)\n", 295 | " stack.pop()\n", 296 | " stack.pop()\n", 297 | "\n", 298 | " traverse(root, stack, rules)\n", 299 | " print(os.linesep.join(rules))" 300 | ] 301 | }, 302 | { 303 | "cell_type": "code", 304 | "execution_count": 16, 305 | "metadata": {}, 306 | "outputs": [ 307 | { 308 | "name": "stdout", 309 | "output_type": "stream", 310 | "text": [ 311 | "IF Outlook EQUALS Rainy AND Windy EQUALS False THEN Yes\n", 312 | "IF Outlook EQUALS Rainy AND Windy EQUALS True THEN No\n", 313 | "IF Outlook EQUALS Sunny AND Humidity EQUALS Normal THEN Yes\n", 314 | "IF Outlook EQUALS Sunny AND Humidity EQUALS High THEN No\n", 315 | "IF Outlook EQUALS Overcast THEN Yes\n" 316 | ] 317 | } 318 | ], 319 | "source": [ 320 | "def main():\n", 321 | " config = {'data_file': 'tennis.csv', 'data_mappers': [], 'data_project_columns': ['Outlook', 'Temperature', 'Humidity', 'Windy', 'PlayTennis'], 'target_attribute': 'PlayTennis'}\n", 322 | "\n", 323 | " data = load_csv_to_header_data(config['data_file'])\n", 324 | " data = project_columns(data, config['data_project_columns'])\n", 325 | "\n", 326 | " target_attribute = config['target_attribute']\n", 327 | " remaining_attributes = set(data['header'])\n", 328 | " remaining_attributes.remove(target_attribute)\n", 329 | "\n", 330 | " uniqs = get_uniq_values(data)\n", 331 | "\n", 332 | " root = id3(data, uniqs, remaining_attributes, target_attribute)\n", 333 | "\n", 334 | " pretty_print_tree(root)\n", 335 | "\n", 336 | "\n", 337 | "if __name__ == \"__main__\": main()" 338 | ] 339 | } 340 | ], 341 | "metadata": { 342 | "kernelspec": { 343 | "display_name": "Python 3", 344 | "language": "python", 345 | "name": "python3" 346 | }, 347 | "language_info": { 348 | "codemirror_mode": { 349 | "name": "ipython", 350 | "version": 3 351 | }, 352 | "file_extension": ".py", 353 | "mimetype": "text/x-python", 354 | "name": "python", 355 | "nbconvert_exporter": "python", 356 | "pygments_lexer": "ipython3", 357 | "version": "3.7.3" 358 | } 359 | }, 360 | "nbformat": 4, 361 | "nbformat_minor": 2 362 | } 363 | -------------------------------------------------------------------------------- /Linear_reg.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "# Implement linear regression and perform the following operations:\n", 10 | "# 1. Generate a proper 2D dataset of n points, split it into training and testing data and perform linear regression using R2 method." 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": 2, 16 | "metadata": {}, 17 | "outputs": [], 18 | "source": [ 19 | "import pandas as pd\n", 20 | "import numpy as np" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": 3, 26 | "metadata": { 27 | "scrolled": true 28 | }, 29 | "outputs": [ 30 | { 31 | "name": "stdout", 32 | "output_type": "stream", 33 | "text": [ 34 | "(84, 2)\n", 35 | "Index(['SAT', 'GPA'], dtype='object')\n" 36 | ] 37 | } 38 | ], 39 | "source": [ 40 | "data = pd.read_csv('slr.csv')\n", 41 | "print(data.shape)\n", 42 | "print(data.columns)" 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": 4, 48 | "metadata": {}, 49 | "outputs": [], 50 | "source": [ 51 | "msk = np.random.rand(len(data)) < 0.8\n", 52 | "train = data[msk]\n", 53 | "test = data[~msk]" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": 5, 59 | "metadata": {}, 60 | "outputs": [ 61 | { 62 | "data": { 63 | "text/html": [ 64 | "
\n", 65 | "\n", 78 | "\n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | "
SATGPA
017142.40
116642.52
217602.54
316852.74
617643.00
\n", 114 | "
" 115 | ], 116 | "text/plain": [ 117 | " SAT GPA\n", 118 | "0 1714 2.40\n", 119 | "1 1664 2.52\n", 120 | "2 1760 2.54\n", 121 | "3 1685 2.74\n", 122 | "6 1764 3.00" 123 | ] 124 | }, 125 | "execution_count": 5, 126 | "metadata": {}, 127 | "output_type": "execute_result" 128 | } 129 | ], 130 | "source": [ 131 | "train.head()" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": 6, 137 | "metadata": {}, 138 | "outputs": [ 139 | { 140 | "name": "stdout", 141 | "output_type": "stream", 142 | "text": [ 143 | "63 21\n" 144 | ] 145 | } 146 | ], 147 | "source": [ 148 | "print(len(train), len(test))" 149 | ] 150 | }, 151 | { 152 | "cell_type": "code", 153 | "execution_count": 7, 154 | "metadata": {}, 155 | "outputs": [], 156 | "source": [ 157 | "X_train = train['SAT']\n", 158 | "Y_train = train['GPA']\n", 159 | "X_test = test['SAT']\n", 160 | "Y_test = test['GPA']" 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": 8, 166 | "metadata": {}, 167 | "outputs": [ 168 | { 169 | "name": "stdout", 170 | "output_type": "stream", 171 | "text": [ 172 | "1843.936507936508 3.314603174603175\n" 173 | ] 174 | } 175 | ], 176 | "source": [ 177 | "x_mean = sum(X_train)/len(X_train)\n", 178 | "y_mean = sum(Y_train)/len(Y_train)\n", 179 | "print(x_mean, y_mean)" 180 | ] 181 | }, 182 | { 183 | "cell_type": "code", 184 | "execution_count": 9, 185 | "metadata": {}, 186 | "outputs": [], 187 | "source": [ 188 | "b1 = sum((X_train - x_mean)*(Y_train - y_mean)) / sum((X_train - x_mean)**2)\n", 189 | "b0 = y_mean - b1*x_mean" 190 | ] 191 | }, 192 | { 193 | "cell_type": "code", 194 | "execution_count": 10, 195 | "metadata": { 196 | "scrolled": true 197 | }, 198 | "outputs": [ 199 | { 200 | "name": "stdout", 201 | "output_type": "stream", 202 | "text": [ 203 | "-0.16705420536473436 0.0018881655441944278\n" 204 | ] 205 | } 206 | ], 207 | "source": [ 208 | "print(b0,b1)" 209 | ] 210 | }, 211 | { 212 | "cell_type": "code", 213 | "execution_count": 11, 214 | "metadata": {}, 215 | "outputs": [], 216 | "source": [ 217 | "import matplotlib.pyplot as plt" 218 | ] 219 | }, 220 | { 221 | "cell_type": "code", 222 | "execution_count": 12, 223 | "metadata": {}, 224 | "outputs": [ 225 | { 226 | "data": { 227 | "text/plain": [ 228 | "" 229 | ] 230 | }, 231 | "execution_count": 12, 232 | "metadata": {}, 233 | "output_type": "execute_result" 234 | }, 235 | { 236 | "data": { 237 | "image/png": "\n", 238 | "text/plain": [ 239 | "
" 240 | ] 241 | }, 242 | "metadata": { 243 | "needs_background": "light" 244 | }, 245 | "output_type": "display_data" 246 | } 247 | ], 248 | "source": [ 249 | "plt.scatter(X_train, Y_train, color = \"b\", marker = \"o\")" 250 | ] 251 | }, 252 | { 253 | "cell_type": "code", 254 | "execution_count": 13, 255 | "metadata": {}, 256 | "outputs": [], 257 | "source": [ 258 | "Y_pred = b0 + b1*X_test" 259 | ] 260 | }, 261 | { 262 | "cell_type": "code", 263 | "execution_count": 14, 264 | "metadata": {}, 265 | "outputs": [ 266 | { 267 | "data": { 268 | "image/png": "\n", 269 | "text/plain": [ 270 | "
" 271 | ] 272 | }, 273 | "metadata": { 274 | "needs_background": "light" 275 | }, 276 | "output_type": "display_data" 277 | } 278 | ], 279 | "source": [ 280 | "plt.plot(X_test, Y_pred, color=\"g\")\n", 281 | "plt.plot()\n", 282 | "plt.plot(X_test, Y_test, color=\"r\")\n", 283 | "plt.plot()\n", 284 | "plt.show()" 285 | ] 286 | }, 287 | { 288 | "cell_type": "code", 289 | "execution_count": 15, 290 | "metadata": {}, 291 | "outputs": [ 292 | { 293 | "name": "stdout", 294 | "output_type": "stream", 295 | "text": [ 296 | "0.8829116848754562\n" 297 | ] 298 | } 299 | ], 300 | "source": [ 301 | "R_2 = sum((Y_pred - y_mean)**2) / sum((Y_test - y_mean)**2)\n", 302 | "print(R_2)" 303 | ] 304 | }, 305 | { 306 | "cell_type": "code", 307 | "execution_count": null, 308 | "metadata": {}, 309 | "outputs": [], 310 | "source": [] 311 | } 312 | ], 313 | "metadata": { 314 | "kernelspec": { 315 | "display_name": "Python 3", 316 | "language": "python", 317 | "name": "python3" 318 | }, 319 | "language_info": { 320 | "codemirror_mode": { 321 | "name": "ipython", 322 | "version": 3 323 | }, 324 | "file_extension": ".py", 325 | "mimetype": "text/x-python", 326 | "name": "python", 327 | "nbconvert_exporter": "python", 328 | "pygments_lexer": "ipython3", 329 | "version": "3.7.3" 330 | } 331 | }, 332 | "nbformat": 4, 333 | "nbformat_minor": 2 334 | } 335 | -------------------------------------------------------------------------------- /Logistic Regression with Newtons Method.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 4, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import numpy as np\n", 10 | "class LogisticRegression:\n", 11 | " \n", 12 | " def fit(self, X, y, epochs=10, lr=0.1):\n", 13 | " X = np.insert(X, 0, 1, axis=1)\n", 14 | " self.weights = np.zeros(X.shape[1])\n", 15 | " self.loss = []\n", 16 | " Xs = X.T.dot(X)\n", 17 | " for i in range(epochs):\n", 18 | " h = self.sigmoid(X.dot(self.weights))\n", 19 | " self.loss.append(self.get_loss(h,y))\n", 20 | " invH = np.linalg.pinv(Xs * h.dot(1-h))\n", 21 | " gradient = (h - y).dot(X)\n", 22 | " self.weights -= invH.dot(gradient)\n", 23 | " return self\n", 24 | " \n", 25 | " def predict(self, X):\n", 26 | " return self.sigmoid(np.insert(X, 0, 1, axis=1).dot(self.weights))\n", 27 | " \n", 28 | " def get_loss(self, h, y):\n", 29 | " return np.abs(y.dot(np.log(h)) + (1 - y).dot(np.log(1 - h)))\n", 30 | " \n", 31 | " def sigmoid(self, z):\n", 32 | " return 1/(1 + np.exp(-1 * z))\n", 33 | " \n", 34 | " def predict_classes(self, X):\n", 35 | " return (self.predict(X) >= 0.5) * 1" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": null, 41 | "metadata": {}, 42 | "outputs": [], 43 | "source": [] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": 7, 48 | "metadata": {}, 49 | "outputs": [], 50 | "source": [ 51 | "from sklearn.metrics import accuracy_score, f1_score\n", 52 | "def train_model(X, y, model):\n", 53 | " model.fit(X, y, lr=0.1)\n", 54 | " print(model.predict(X[:2, :]))\n", 55 | " print(model.predict_classes(X[:2, :]))\n", 56 | " pre = model.predict_classes(X)\n", 57 | " print('Accuracy :: ', model.get_loss(model.predict(X), y))\n", 58 | " print('F1 Score :: ', f1_score(y, pre))" 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": 8, 64 | "metadata": {}, 65 | "outputs": [ 66 | { 67 | "name": "stdout", 68 | "output_type": "stream", 69 | "text": [ 70 | "[0.46901795 0.47771344]\n", 71 | "[0 0]\n", 72 | "('Accuracy :: ', 95.3393291345393)\n", 73 | "('F1 Score :: ', 1.0)\n" 74 | ] 75 | } 76 | ], 77 | "source": [ 78 | "from sklearn.datasets import load_iris\n", 79 | "X, y = load_iris(return_X_y=True)\n", 80 | "train_model(X,(y !=0 )*1, LogisticRegression())" 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": 256, 86 | "metadata": {}, 87 | "outputs": [ 88 | { 89 | "name": "stdout", 90 | "output_type": "stream", 91 | "text": [ 92 | "[0.49048766 0.49403437]\n", 93 | "[0 0]\n", 94 | "('Accuracy :: ', 0.9648506151142355)\n", 95 | "('F1 Score :: ', 0.9726027397260274)\n" 96 | ] 97 | } 98 | ], 99 | "source": [ 100 | "from sklearn.datasets import load_breast_cancer\n", 101 | "X,y = load_breast_cancer(return_X_y=True)\n", 102 | "model = LogisticRegression()\n", 103 | "train_model(X,y, model)" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": 258, 109 | "metadata": {}, 110 | "outputs": [ 111 | { 112 | "data": { 113 | "text/plain": [ 114 | "array([ 1.75896854e-01, 1.51896300e-02, -3.17047148e-04, -1.65585845e-03,\n", 115 | " -2.21690168e-05, -5.90715949e-03, 2.94487743e-01, -9.75106585e-02,\n", 116 | " -1.49393191e-01, -7.16398499e-03, -2.31986997e-03, -3.03382659e-02,\n", 117 | " 4.71406827e-04, 1.57079149e-03, 6.43946929e-05, -1.10584177e+00,\n", 118 | " -4.52706299e-03, 2.48692110e-01, -7.37116644e-01, -1.18389729e-01,\n", 119 | " 4.98465538e-01, -1.36140562e-02, -4.99368104e-04, 1.69845087e-04,\n", 120 | " 7.05330073e-05, -3.78643487e-02, -4.68430831e-03, -2.65881514e-02,\n", 121 | " -3.23856876e-02, -3.88360534e-02, -3.00168736e-01])" 122 | ] 123 | }, 124 | "execution_count": 258, 125 | "metadata": {}, 126 | "output_type": "execute_result" 127 | } 128 | ], 129 | "source": [ 130 | "model.weights" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": null, 136 | "metadata": {}, 137 | "outputs": [], 138 | "source": [] 139 | } 140 | ], 141 | "metadata": { 142 | "kernelspec": { 143 | "display_name": "python3", 144 | "language": "python", 145 | "name": "python3" 146 | }, 147 | "language_info": { 148 | "codemirror_mode": { 149 | "name": "ipython", 150 | "version": 2 151 | }, 152 | "file_extension": ".py", 153 | "mimetype": "text/x-python", 154 | "name": "python", 155 | "nbconvert_exporter": "python", 156 | "pygments_lexer": "ipython2", 157 | "version": "2.7.16" 158 | } 159 | }, 160 | "nbformat": 4, 161 | "nbformat_minor": 2 162 | } 163 | -------------------------------------------------------------------------------- /Logistic Regression.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Logistic Regression in Python and Numpy" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "In logistic regression, we are trying to model the outcome of a **binary variable** given a **linear combination of input features**. For example, we could try to predict the outcome of an election (win/lose) using information about how much money a candidate spent campaigning, how much time she/he spent campaigning, etc.\n", 15 | "\n", 16 | "### Model \n", 17 | "\n", 18 | "Logistic regression works as follows.\n", 19 | "\n", 20 | "**Given:** \n", 21 | "- dataset $\\{(\\boldsymbol{x}^{(1)}, y^{(1)}), ..., (\\boldsymbol{x}^{(m)}, y^{(m)})\\}$\n", 22 | "- with $\\boldsymbol{x}^{(i)}$ being a $d-$dimensional vector $\\boldsymbol{x}^{(i)} = (x^{(i)}_1, ..., x^{(i)}_d)$\n", 23 | "- $y^{(i)}$ being a binary target variable, $y^{(i)} \\in \\{0,1\\}$\n", 24 | "\n", 25 | "The logistic regression model can be interpreted as a very **simple neural network:**\n", 26 | "- it has a real-valued weight vector $\\boldsymbol{w}= (w^{(1)}, ..., w^{(d)})$\n", 27 | "- it has a real-valued bias $b$\n", 28 | "- it uses a sigmoid function as its activation function\n", 29 | "\n", 30 | "![title](figures/logistic_regression.jpg)" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "### Training\n", 38 | "\n", 39 | "Different to [linear regression](linear_regression.ipynb), logistic regression has no closed form solution. But the cost function is convex, so we can train the model using gradient descent. In fact, **gradient descent** (or any other optimization algorithm) is guaranteed to find the global minimum (if the learning rate is small enough and enough training iterations are used). \n", 40 | "\n", 41 | "Training a logistic regression model has different steps. In the beginning (step 0) the parameters are initialized. The other steps are repeated for a specified number of training iterations or until convergence of the parameters.\n", 42 | "\n", 43 | "* * * \n", 44 | "**Step 0: ** Initialize the weight vector and bias with zeros (or small random values).\n", 45 | "* * *\n", 46 | "\n", 47 | "**Step 1: ** Compute a linear combination of the input features and weights. This can be done in one step for all training examples, using vectorization and broadcasting:\n", 48 | "$\\boldsymbol{a} = \\boldsymbol{X} \\cdot \\boldsymbol{w} + b $\n", 49 | "\n", 50 | "where $\\boldsymbol{X}$ is a matrix of shape $(n_{samples}, n_{features})$ that holds all training examples, and $\\cdot$ denotes the dot product.\n", 51 | "* * *\n", 52 | "\n", 53 | "**Step 2: ** Apply the sigmoid activation function, which returns values between 0 and 1:\n", 54 | "\n", 55 | "$\\boldsymbol{\\hat{y}} = \\sigma(\\boldsymbol{a}) = \\frac{1}{1 + \\exp(-\\boldsymbol{a})}$\n", 56 | "* * *\n", 57 | "\n", 58 | "** Step 3: ** Compute the cost over the whole training set. We want to model the probability of the target values being 0 or 1. So during training we want to adapt our parameters such that our model outputs high values for examples with a positive label (true label being 1) and small values for examples with a negative label (true label being 0). This is reflected in the cost function:\n", 59 | "\n", 60 | "$J(\\boldsymbol{w},b) = - \\frac{1}{m} \\sum_{i=1}^m \\Big[ y^{(i)} \\log(\\hat{y}^{(i)}) + (1 - y^{(i)}) \\log(1 - \\hat{y}^{(i)}) \\Big]$\n", 61 | "* * *\n", 62 | "\n", 63 | "** Step 4: ** Compute the gradient of the cost function with respect to the weight vector and bias. A detailed explanation of this derivation can be found [here](https://stats.stackexchange.com/questions/278771/how-is-the-cost-function-from-logistic-regression-derivated).\n", 64 | "\n", 65 | "The general formula is given by:\n", 66 | "\n", 67 | "$ \\frac{\\partial J}{\\partial w_j} = \\frac{1}{m}\\sum_{i=1}^m\\left[\\hat{y}^{(i)}-y^{(i)}\\right]\\,x_j^{(i)}$\n", 68 | "\n", 69 | "For the bias, the inputs $x_j^{(i)}$ will be given 1.\n", 70 | "* * *\n", 71 | "\n", 72 | "** Step 5: ** Update the weights and bias\n", 73 | "\n", 74 | "$\\boldsymbol{w} = \\boldsymbol{w} - \\eta \\, \\nabla_w J$ \n", 75 | "\n", 76 | "$b = b - \\eta \\, \\nabla_b J$\n", 77 | "\n", 78 | "where $\\eta$ is the learning rate." 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": 18, 84 | "metadata": {}, 85 | "outputs": [], 86 | "source": [ 87 | "import numpy as np\n", 88 | "class LogisticRegression:\n", 89 | " \n", 90 | " def fit(self, X, y, lr = 0.001, epochs=10000, verbose=True, batch_size=1):\n", 91 | " self.classes = np.unique(y)\n", 92 | " y = (y==self.classes[1]) * 1\n", 93 | " X = self.add_bias(X)\n", 94 | " self.weights = np.zeros(X.shape[1])\n", 95 | " self.loss = []\n", 96 | " for i in range(epochs):\n", 97 | " self.loss.append(self.cross_entropy(X,y))\n", 98 | " if i % 1000 == 0 and verbose: \n", 99 | " print('Iterations: %d - Error : %.4f' %(i, self.loss[i]))\n", 100 | " idx = np.random.choice(X.shape[0], batch_size)\n", 101 | " X_batch, y_batch = X[idx], y[idx]\n", 102 | " self.weights -= lr * self.get_gradient(X_batch, y_batch)\n", 103 | " return self\n", 104 | " \n", 105 | " def get_gradient(self, X, y):\n", 106 | " return -1.0 * (y - self.predict_(X)).dot(X) / len(X)\n", 107 | " \n", 108 | " def predict_(self, X):\n", 109 | " return self.sigmoid(np.dot(X, self.weights))\n", 110 | " \n", 111 | " def predict(self, X):\n", 112 | " return self.predict_(self.add_bias(X))\n", 113 | " \n", 114 | " def sigmoid(self, z):\n", 115 | " return 1.0/(1 + np.exp(-z))\n", 116 | " \n", 117 | " def predict_classes(self, X):\n", 118 | " return self.predict_classes_(self.add_bias(X))\n", 119 | "\n", 120 | " def predict_classes_(self, X):\n", 121 | " return np.vectorize(lambda c: self.classes[1] if c>=0.5 else self.classes[0])(self.predict_(X))\n", 122 | " \n", 123 | " def cross_entropy(self, X, y):\n", 124 | " p = self.predict_(X)\n", 125 | " return (-1 / len(y)) * (y * np.log(p)).sum()\n", 126 | "\n", 127 | " def add_bias(self,X):\n", 128 | " return np.insert(X, 0, 1, axis=1)\n", 129 | "\n", 130 | " def score(self, X, y):\n", 131 | " return self.cross_entropy(self.add_bias(X), y)\n" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": 19, 137 | "metadata": {}, 138 | "outputs": [], 139 | "source": [ 140 | "from sklearn.metrics import accuracy_score\n", 141 | "def train_model(X, y, model):\n", 142 | " model.fit(X, y, lr=0.1)\n", 143 | " pre = model.predict_classes(X)\n", 144 | " print('Accuracy :: ', accuracy_score(y, pre))" 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": 20, 150 | "metadata": {}, 151 | "outputs": [ 152 | { 153 | "name": "stdout", 154 | "output_type": "stream", 155 | "text": [ 156 | "Iterations: 0 - Error : 69.3147\n", 157 | "Iterations: 1000 - Error : 0.4746\n", 158 | "Iterations: 2000 - Error : 0.3847\n", 159 | "Iterations: 3000 - Error : 0.1645\n", 160 | "Iterations: 4000 - Error : 0.1280\n", 161 | "Iterations: 5000 - Error : 0.1126\n", 162 | "Iterations: 6000 - Error : 0.0783\n", 163 | "Iterations: 7000 - Error : 0.0674\n", 164 | "Iterations: 8000 - Error : 0.0621\n", 165 | "Iterations: 9000 - Error : 0.0664\n", 166 | "('Accuracy :: ', 1.0)\n" 167 | ] 168 | } 169 | ], 170 | "source": [ 171 | "from sklearn.datasets import load_iris\n", 172 | "X, y = load_iris(return_X_y=True)\n", 173 | "lr = LogisticRegression()\n", 174 | "train_model(X,(y !=0 )*1, lr)" 175 | ] 176 | }, 177 | { 178 | "cell_type": "code", 179 | "execution_count": 21, 180 | "metadata": {}, 181 | "outputs": [ 182 | { 183 | "data": { 184 | "image/png": "\n", 185 | "text/plain": [ 186 | "
" 187 | ] 188 | }, 189 | "metadata": { 190 | "needs_background": "light" 191 | }, 192 | "output_type": "display_data" 193 | } 194 | ], 195 | "source": [ 196 | "import matplotlib.pyplot as plt\n", 197 | "fig = plt.figure(figsize=(8,6))\n", 198 | "plt.plot(np.arange(len(lr.loss)), lr.loss)\n", 199 | "plt.title(\"Development of cost over training\")\n", 200 | "plt.xlabel(\"Number of iterations\")\n", 201 | "plt.ylabel(\"Cost\")\n", 202 | "plt.show()" 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "execution_count": null, 208 | "metadata": {}, 209 | "outputs": [], 210 | "source": [] 211 | } 212 | ], 213 | "metadata": { 214 | "kernelspec": { 215 | "display_name": "python3", 216 | "language": "python", 217 | "name": "python3" 218 | }, 219 | "language_info": { 220 | "codemirror_mode": { 221 | "name": "ipython", 222 | "version": 2 223 | }, 224 | "file_extension": ".py", 225 | "mimetype": "text/x-python", 226 | "name": "python", 227 | "nbconvert_exporter": "python", 228 | "pygments_lexer": "ipython2", 229 | "version": "2.7.16" 230 | } 231 | }, 232 | "nbformat": 4, 233 | "nbformat_minor": 2 234 | } 235 | -------------------------------------------------------------------------------- /Multinomial Naive Bayes.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 31, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import numpy as np\n", 10 | " \n", 11 | "# class MultinomialNB:\n", 12 | " \n", 13 | "# def fit(self, X, y):\n", 14 | "# self.y_classes, y_counts = np.unique(y, return_counts=True)\n", 15 | "# self.x_classes = np.array([np.unique(x) for x in X.T])\n", 16 | "# self.phi_y = 1.0 * y_counts/y_counts.sum()\n", 17 | "# self.phi_x = self.mean_x(X, y)\n", 18 | "# return self\n", 19 | " \n", 20 | "# def mean_x(self, X, y):\n", 21 | "# return [[(X[:,j][y==k].reshape(-1,1) == self.x_classes[j]).mean(axis=0)\n", 22 | "# for j in range(len(self.x_classes))]\n", 23 | "# for k in self.y_classes]\n", 24 | " \n", 25 | "# def predict(self, X):\n", 26 | "# return np.apply_along_axis(lambda x: self.compute_probs(x), 1, X)\n", 27 | " \n", 28 | "# def compute_probs(self, x):\n", 29 | "# probs = np.array([self.compute_prob(x, y) for y in range(len(self.y_classes))])\n", 30 | "# return self.y_classes[np.argmax(probs)]\n", 31 | " \n", 32 | "# def compute_prob(self, x, y):\n", 33 | "# Pxy = 1\n", 34 | "# for j in range(len(x)):\n", 35 | "# i = list(self.x_classes[j]).index(x[j])\n", 36 | "# Pxy *= self.phi_x[y][j][i] # p(xj|y)\n", 37 | "# return Pxy * self.phi_y[y]\n", 38 | " \n", 39 | "# def evaluate(self, X, y):\n", 40 | "# return (self.predict(X) == y).mean()\n", 41 | "\n", 42 | "class MultinomialNB:\n", 43 | " \n", 44 | " def fit(self, X, y, ls=0.01):\n", 45 | " self.ls = ls\n", 46 | " self.y_classes, y_counts = np.unique(y, return_counts=True)\n", 47 | " self.x_classes = [np.unique(x) for x in X.T]\n", 48 | " self.phi_y = 1.0 * y_counts/y_counts.sum()\n", 49 | " self.phi_x = self.mean_X(X, y)\n", 50 | " self.c_x = self.count_x(X, y)\n", 51 | " return self\n", 52 | " \n", 53 | " def mean_X(self, X, y):\n", 54 | " return [[self.ls_mean_x(X, y, k, j) for j in range(len(self.x_classes))] for k in self.y_classes]\n", 55 | " \n", 56 | " def ls_mean_x(self, X, y, k, j):\n", 57 | " x_data = (X[:,j][y==k].reshape(-1,1) == self.x_classes[j])\n", 58 | " return (x_data.sum(axis=0) + self.ls ) / (len(x_data) + (len(self.x_classes) * self.ls))\n", 59 | " \n", 60 | " def get_mean_x(self, y, j):\n", 61 | " return 1 + self.ls / (self.c_x[y][j] + (len(self.x_classes) * self.ls))\n", 62 | " \n", 63 | " def count_x(self, X, y):\n", 64 | " return [[len(X[:,j][y==k].reshape(-1,1) == self.x_classes[j])\n", 65 | " for j in range(len(self.x_classes))]\n", 66 | " for k in self.y_classes]\n", 67 | "\n", 68 | " def predict(self, X):\n", 69 | " return np.apply_along_axis(lambda x: self.compute_probs(x), 1, X)\n", 70 | " \n", 71 | " def compute_probs(self, x):\n", 72 | " probs = np.array([self.compute_prob(x, y) for y in range(len(self.y_classes))])\n", 73 | " return self.y_classes[np.argmax(probs)]\n", 74 | " \n", 75 | " def compute_prob(self, x, y):\n", 76 | " Pxy = 1\n", 77 | " for j in range(len(x)):\n", 78 | " x_clas = self.x_classes[j]\n", 79 | " if x[j] in x_clas:\n", 80 | " i = list(x_clas).index(x[j])\n", 81 | " p_x_j_y = self.phi_x[y][j][i] # p(xj|y)\n", 82 | " Pxy *= p_x_j_y\n", 83 | " else:\n", 84 | " Pxy *= get_mean_x(y, j)\n", 85 | " return Pxy * self.phi_y[y]\n", 86 | " \n", 87 | " def evaluate(self, X, y):\n", 88 | " return (self.predict(X) == y).mean()" 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": 32, 94 | "metadata": {}, 95 | "outputs": [ 96 | { 97 | "data": { 98 | "text/plain": [ 99 | "0.9666666666666667" 100 | ] 101 | }, 102 | "execution_count": 32, 103 | "metadata": {}, 104 | "output_type": "execute_result" 105 | } 106 | ], 107 | "source": [ 108 | "from sklearn import datasets\n", 109 | "from utils import accuracy_score\n", 110 | "iris = datasets.load_iris()\n", 111 | "X = iris.data \n", 112 | "y = iris.target\n", 113 | "nb = MultinomialNB().fit(X, y)\n", 114 | "nb.evaluate(X, y)" 115 | ] 116 | }, 117 | { 118 | "cell_type": "code", 119 | "execution_count": 33, 120 | "metadata": {}, 121 | "outputs": [ 122 | { 123 | "data": { 124 | "text/plain": [ 125 | "array([0.33333333, 0.33333333, 0.33333333])" 126 | ] 127 | }, 128 | "execution_count": 33, 129 | "metadata": {}, 130 | "output_type": "execute_result" 131 | } 132 | ], 133 | "source": [ 134 | "nb.phi_y" 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": 34, 140 | "metadata": {}, 141 | "outputs": [ 142 | { 143 | "data": { 144 | "text/plain": [ 145 | "[[array([0.02018385, 0.06015188, 0.02018385, 0.08013589, 0.04016787,\n", 146 | " 0.1001199 , 0.08013589, 0.16007194, 0.16007194, 0.06015188,\n", 147 | " 0.02018385, 0.1001199 , 0.04016787, 0.00019984, 0.04016787,\n", 148 | " 0.02018385, 0.00019984, 0.00019984, 0.00019984, 0.00019984,\n", 149 | " 0.00019984, 0.00019984, 0.00019984, 0.00019984, 0.00019984,\n", 150 | " 0.00019984, 0.00019984, 0.00019984, 0.00019984, 0.00019984,\n", 151 | " 0.00019984, 0.00019984, 0.00019984, 0.00019984, 0.00019984]),\n", 152 | " array([0.00019984, 0.00019984, 0.02018385, 0.00019984, 0.00019984,\n", 153 | " 0.00019984, 0.00019984, 0.00019984, 0.02018385, 0.12010392,\n", 154 | " 0.08013589, 0.1001199 , 0.04016787, 0.18005596, 0.12010392,\n", 155 | " 0.06015188, 0.06015188, 0.08013589, 0.04016787, 0.02018385,\n", 156 | " 0.02018385, 0.02018385, 0.02018385]),\n", 157 | " array([2.01838529e-02, 2.01838529e-02, 4.01678657e-02, 1.40087930e-01,\n", 158 | " 2.59992006e-01, 2.59992006e-01, 1.40087930e-01, 8.01358913e-02,\n", 159 | " 4.01678657e-02, 1.99840128e-04, 1.99840128e-04, 1.99840128e-04,\n", 160 | " 1.99840128e-04, 1.99840128e-04, 1.99840128e-04, 1.99840128e-04,\n", 161 | " 1.99840128e-04, 1.99840128e-04, 1.99840128e-04, 1.99840128e-04,\n", 162 | " 1.99840128e-04, 1.99840128e-04, 1.99840128e-04, 1.99840128e-04,\n", 163 | " 1.99840128e-04, 1.99840128e-04, 1.99840128e-04, 1.99840128e-04,\n", 164 | " 1.99840128e-04, 1.99840128e-04, 1.99840128e-04, 1.99840128e-04,\n", 165 | " 1.99840128e-04, 1.99840128e-04, 1.99840128e-04, 1.99840128e-04,\n", 166 | " 1.99840128e-04, 1.99840128e-04, 1.99840128e-04, 1.99840128e-04,\n", 167 | " 1.99840128e-04, 1.99840128e-04, 1.99840128e-04]),\n", 168 | " array([1.00119904e-01, 5.79736211e-01, 1.40087930e-01, 1.40087930e-01,\n", 169 | " 2.01838529e-02, 2.01838529e-02, 1.99840128e-04, 1.99840128e-04,\n", 170 | " 1.99840128e-04, 1.99840128e-04, 1.99840128e-04, 1.99840128e-04,\n", 171 | " 1.99840128e-04, 1.99840128e-04, 1.99840128e-04, 1.99840128e-04,\n", 172 | " 1.99840128e-04, 1.99840128e-04, 1.99840128e-04, 1.99840128e-04,\n", 173 | " 1.99840128e-04, 1.99840128e-04])],\n", 174 | " [array([0.00019984, 0.00019984, 0.00019984, 0.00019984, 0.00019984,\n", 175 | " 0.00019984, 0.02018385, 0.04016787, 0.02018385, 0.02018385,\n", 176 | " 0.00019984, 0.02018385, 0.1001199 , 0.1001199 , 0.1001199 ,\n", 177 | " 0.06015188, 0.04016787, 0.08013589, 0.08013589, 0.04016787,\n", 178 | " 0.06015188, 0.04016787, 0.02018385, 0.04016787, 0.06015188,\n", 179 | " 0.02018385, 0.02018385, 0.02018385, 0.00019984, 0.00019984,\n", 180 | " 0.00019984, 0.00019984, 0.00019984, 0.00019984, 0.00019984]),\n", 181 | " array([0.02018385, 0.04016787, 0.06015188, 0.06015188, 0.08013589,\n", 182 | " 0.06015188, 0.1001199 , 0.12010392, 0.14008793, 0.16007194,\n", 183 | " 0.06015188, 0.06015188, 0.02018385, 0.02018385, 0.00019984,\n", 184 | " 0.00019984, 0.00019984, 0.00019984, 0.00019984, 0.00019984,\n", 185 | " 0.00019984, 0.00019984, 0.00019984]),\n", 186 | " array([0.00019984, 0.00019984, 0.00019984, 0.00019984, 0.00019984,\n", 187 | " 0.00019984, 0.00019984, 0.00019984, 0.00019984, 0.02018385,\n", 188 | " 0.04016787, 0.04016787, 0.02018385, 0.02018385, 0.02018385,\n", 189 | " 0.06015188, 0.1001199 , 0.06015188, 0.08013589, 0.04016787,\n", 190 | " 0.08013589, 0.14008793, 0.06015188, 0.1001199 , 0.04016787,\n", 191 | " 0.04016787, 0.02018385, 0.02018385, 0.00019984, 0.00019984,\n", 192 | " 0.00019984, 0.00019984, 0.00019984, 0.00019984, 0.00019984,\n", 193 | " 0.00019984, 0.00019984, 0.00019984, 0.00019984, 0.00019984,\n", 194 | " 0.00019984, 0.00019984, 0.00019984]),\n", 195 | " array([1.99840128e-04, 1.99840128e-04, 1.99840128e-04, 1.99840128e-04,\n", 196 | " 1.99840128e-04, 1.99840128e-04, 1.40087930e-01, 6.01518785e-02,\n", 197 | " 1.00119904e-01, 2.59992006e-01, 1.40087930e-01, 2.00039968e-01,\n", 198 | " 6.01518785e-02, 2.01838529e-02, 2.01838529e-02, 1.99840128e-04,\n", 199 | " 1.99840128e-04, 1.99840128e-04, 1.99840128e-04, 1.99840128e-04,\n", 200 | " 1.99840128e-04, 1.99840128e-04])],\n", 201 | " [array([0.00019984, 0.00019984, 0.00019984, 0.00019984, 0.00019984,\n", 202 | " 0.00019984, 0.02018385, 0.00019984, 0.00019984, 0.00019984,\n", 203 | " 0.00019984, 0.00019984, 0.00019984, 0.02018385, 0.02018385,\n", 204 | " 0.06015188, 0.02018385, 0.04016787, 0.04016787, 0.04016787,\n", 205 | " 0.12010392, 0.1001199 , 0.08013589, 0.00019984, 0.1001199 ,\n", 206 | " 0.04016787, 0.06015188, 0.00019984, 0.02018385, 0.06015188,\n", 207 | " 0.02018385, 0.02018385, 0.02018385, 0.08013589, 0.02018385]),\n", 208 | " array([1.99840128e-04, 2.01838529e-02, 1.99840128e-04, 1.99840128e-04,\n", 209 | " 8.01358913e-02, 4.01678657e-02, 8.01358913e-02, 1.60071942e-01,\n", 210 | " 4.01678657e-02, 2.40007994e-01, 8.01358913e-02, 1.00119904e-01,\n", 211 | " 6.01518785e-02, 4.01678657e-02, 1.99840128e-04, 2.01838529e-02,\n", 212 | " 1.99840128e-04, 4.01678657e-02, 1.99840128e-04, 1.99840128e-04,\n", 213 | " 1.99840128e-04, 1.99840128e-04, 1.99840128e-04]),\n", 214 | " array([0.00019984, 0.00019984, 0.00019984, 0.00019984, 0.00019984,\n", 215 | " 0.00019984, 0.00019984, 0.00019984, 0.00019984, 0.00019984,\n", 216 | " 0.00019984, 0.00019984, 0.00019984, 0.00019984, 0.00019984,\n", 217 | " 0.00019984, 0.00019984, 0.00019984, 0.00019984, 0.00019984,\n", 218 | " 0.00019984, 0.02018385, 0.00019984, 0.00019984, 0.04016787,\n", 219 | " 0.06015188, 0.06015188, 0.14008793, 0.04016787, 0.04016787,\n", 220 | " 0.04016787, 0.06015188, 0.12010392, 0.06015188, 0.06015188,\n", 221 | " 0.04016787, 0.04016787, 0.06015188, 0.02018385, 0.02018385,\n", 222 | " 0.02018385, 0.04016787, 0.02018385]),\n", 223 | " array([1.99840128e-04, 1.99840128e-04, 1.99840128e-04, 1.99840128e-04,\n", 224 | " 1.99840128e-04, 1.99840128e-04, 1.99840128e-04, 1.99840128e-04,\n", 225 | " 1.99840128e-04, 1.99840128e-04, 2.01838529e-02, 4.01678657e-02,\n", 226 | " 2.01838529e-02, 2.01838529e-02, 2.20023981e-01, 1.00119904e-01,\n", 227 | " 1.20103917e-01, 1.20103917e-01, 6.01518785e-02, 1.60071942e-01,\n", 228 | " 6.01518785e-02, 6.01518785e-02])]]" 229 | ] 230 | }, 231 | "execution_count": 34, 232 | "metadata": {}, 233 | "output_type": "execute_result" 234 | } 235 | ], 236 | "source": [ 237 | "nb.phi_x" 238 | ] 239 | }, 240 | { 241 | "cell_type": "code", 242 | "execution_count": 28, 243 | "metadata": {}, 244 | "outputs": [ 245 | { 246 | "name": "stdout", 247 | "output_type": "stream", 248 | "text": [ 249 | "[1 1 1 2 2 2]\n" 250 | ] 251 | } 252 | ], 253 | "source": [ 254 | "X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])\n", 255 | "Y = np.array([1, 1, 1, 2, 2, 2])\n", 256 | "clf = MultinomialNB().fit(X, Y)\n", 257 | "print(clf.predict(X))" 258 | ] 259 | }, 260 | { 261 | "cell_type": "code", 262 | "execution_count": 30, 263 | "metadata": {}, 264 | "outputs": [ 265 | { 266 | "data": { 267 | "text/plain": [ 268 | "0.9755147468002225" 269 | ] 270 | }, 271 | "execution_count": 30, 272 | "metadata": {}, 273 | "output_type": "execute_result" 274 | } 275 | ], 276 | "source": [ 277 | "from sklearn import datasets\n", 278 | "digits = datasets.load_digits()\n", 279 | "X = digits.data\n", 280 | "y = digits.target\n", 281 | "MultinomialNB().fit(X, y).evaluate(X, y)" 282 | ] 283 | }, 284 | { 285 | "cell_type": "code", 286 | "execution_count": null, 287 | "metadata": {}, 288 | "outputs": [], 289 | "source": [] 290 | } 291 | ], 292 | "metadata": { 293 | "kernelspec": { 294 | "display_name": "python3", 295 | "language": "python", 296 | "name": "python3" 297 | }, 298 | "language_info": { 299 | "codemirror_mode": { 300 | "name": "ipython", 301 | "version": 2 302 | }, 303 | "file_extension": ".py", 304 | "mimetype": "text/x-python", 305 | "name": "python", 306 | "nbconvert_exporter": "python", 307 | "pygments_lexer": "ipython2", 308 | "version": "2.7.16" 309 | } 310 | }, 311 | "nbformat": 4, 312 | "nbformat_minor": 2 313 | } 314 | -------------------------------------------------------------------------------- /Naive Bayes Implementation.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Naive Bayes Classifier Implementation using numpy\n", 8 | "\n", 9 | "Naive Bayes is anoother supervised machine laerning algorithm for classification problem. It makes a strong assumption about the data that **each feature is independent of the value of any other feature**. For example, a fruit may be considered to be an apple if it is red, round, and about 10 cm in diameter. A naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of any possible correlations between the color, roundness, and diameter features.\n", 10 | "\n", 11 | "In Naive bayes classifier what we are trying to find is the probability that a given data point belogs to a specific class, we are going to have prediction for all the class in our target.\n", 12 | "\n", 13 | "\n", 14 | "![title](figures/bayes-theorem.png)\n", 15 | "\n", 16 | "### Our motivation\n", 17 | "To gain better understand on how the algorithm works" 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": 3, 23 | "metadata": {}, 24 | "outputs": [], 25 | "source": [ 26 | "from sklearn import datasets\n", 27 | "digits = datasets.load_digits()\n", 28 | "X = digits.data\n", 29 | "y = digits.target" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": null, 35 | "metadata": {}, 36 | "outputs": [], 37 | "source": [ 38 | "p(y|x) = p(x|y) * p(y)" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": null, 44 | "metadata": {}, 45 | "outputs": [], 46 | "source": [ 47 | "class MultinomialNB:\n", 48 | " \n", 49 | " def fit(self, X, y):\n", 50 | " unique_y, y_counts = np.unique(y, return_counts=True)\n", 51 | " self.prob_y = y_counts / len(y)\n", 52 | " self.unique_y = unique_y\n", 53 | " self.unique_x = np.array([np.unique(x_t) for x_t in X.T])\n", 54 | " self.prob_x = self.get_prob_x(X, y)\n", 55 | " return self\n", 56 | " \n", 57 | " def get_prob_x(X, y):\n", 58 | " return []\n", 59 | " \n", 60 | " def predict(self, X):\n", 61 | " pass\n", 62 | " \n", 63 | " def score(self, X, y):\n", 64 | " " 65 | ] 66 | }, 67 | { 68 | "cell_type": "code", 69 | "execution_count": null, 70 | "metadata": {}, 71 | "outputs": [], 72 | "source": [ 73 | "model = MultinomialNB().fit(X, y)\n", 74 | "model.score(X, y)" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": 13, 80 | "metadata": {}, 81 | "outputs": [ 82 | { 83 | "data": { 84 | "text/plain": [ 85 | "array([[ 0., 0., 5.],\n", 86 | " [ 0., 0., 0.],\n", 87 | " [ 0., 0., 0.],\n", 88 | " [ 0., 0., 7.],\n", 89 | " [ 0., 0., 0.],\n", 90 | " [ 0., 0., 12.],\n", 91 | " [ 0., 0., 0.],\n", 92 | " [ 0., 0., 7.],\n", 93 | " [ 0., 0., 9.],\n", 94 | " [ 0., 0., 11.]])" 95 | ] 96 | }, 97 | "execution_count": 13, 98 | "metadata": {}, 99 | "output_type": "execute_result" 100 | } 101 | ], 102 | "source": [ 103 | "X[:10, :3]" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": 8, 109 | "metadata": {}, 110 | "outputs": [ 111 | { 112 | "data": { 113 | "text/plain": [ 114 | "array([0, 1])" 115 | ] 116 | }, 117 | "execution_count": 8, 118 | "metadata": {}, 119 | "output_type": "execute_result" 120 | } 121 | ], 122 | "source": [ 123 | "y[:2]" 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": null, 129 | "metadata": {}, 130 | "outputs": [], 131 | "source": [] 132 | } 133 | ], 134 | "metadata": { 135 | "kernelspec": { 136 | "display_name": "python3", 137 | "language": "python", 138 | "name": "python3" 139 | }, 140 | "language_info": { 141 | "codemirror_mode": { 142 | "name": "ipython", 143 | "version": 2 144 | }, 145 | "file_extension": ".py", 146 | "mimetype": "text/x-python", 147 | "name": "python", 148 | "nbconvert_exporter": "python", 149 | "pygments_lexer": "ipython2", 150 | "version": "2.7.16" 151 | } 152 | }, 153 | "nbformat": 4, 154 | "nbformat_minor": 2 155 | } 156 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Basic Machine Learning Implementation with Python and Numpy 2 | 3 | This repository contains implementations of basic machine learning algorithms in Python and Numpy. All algorithms are implemented from scratch without using additional machine learning libraries. The intention of these notebooks is to provide a basic understanding of the algorithms and their underlying structure, not to provide the most efficient implementations. 4 | 5 | 1. [Linear Regression ](https://github.com/bamtak/machine-learning-implemetation-python/blob/master/Linear%20Regression%20Implementation%20from%20scratch.ipynb) 6 | 7 | 2. [Logistic Regression ](https://github.com/bamtak/machine-learning-implemetation-python/blob/master/Logistic%20Regression.ipynb) 8 | 9 | 3. [Multi class Logisic Regression](https://github.com/bamtak/machine-learning-implemetation-python/blob/master/Multi%20Class%20Logistic%20Regression.ipynb) 10 | 11 | 4. [Linear Regression with newton's method](https://github.com/bamtak/machine-learning-implemetation-python/blob/master/Linear%20Regression%20with%20Newtons%20Method.ipynb) 12 | 13 | 5. [Logistic Regression with newtons methods](https://github.com/bamtak/machine-learning-implemetation-python/blob/master/Logistic%20Regression%20with%20Newtons%20Method.ipynb) 14 | 15 | 6. [Multiclass Logistic Regression with newtons methods](https://github.com/bamtak/machine-learning-implemetation-python/blob/master/Multi%20Class%20Logistic%20Regression%20with%20Newtons%20Method.ipynb) 16 | 17 | 7. [Perceptron](https://github.com/bamtak/machine-learning-implemetation-python/blob/master/Perceptron.ipynb) 18 | 19 | 8. [Binary Naive Bayes](https://github.com/bamtak/machine-learning-implemetation-python/blob/master/Binary%20Naive%20Bayes.ipynb) 20 | 21 | 9. [Multinomial Naive Bayes](https://github.com/bamtak/machine-learning-implemetation-python/blob/master/Multinomial%20Naive%20Bayes.ipynb) 22 | 23 | 10. [Gaussian Naive Bayes](https://github.com/bamtak/machine-learning-implemetation-python/blob/master/Gaussian%20Naive%20Bayes.ipynb) 24 | 25 | 11. [Gaussian Discriminat Analyses](https://github.com/bamtak/machine-learning-implemetation-python/blob/master/Gaussian%20Discriminant%20Analyses.ipynb) 26 | 27 | 12. [KMeans](https://github.com/bamtak/machine-learning-implemetation-python/blob/master/KMeans.ipynb) 28 | 29 | 13. [Wrapper methods implementation - Forward and Backward](https://github.com/bamtak/machine-learning-implemetation-python/blob/master/Wrapper%20Method%20For%20Feature%20Selection%20-%20Forward%20and%20Backward%20.ipynb) 30 | 31 | 14. [Multiclass Gaussian Discriminat Analyses](https://github.com/bamtak/machine-learning-implemetation-python/blob/master/Multi%20Class%20Gaussian%20Discriminant%20Analyses.ipynb) 32 | -------------------------------------------------------------------------------- /README_files/15320635: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bamtak/machine-learning-implemetation-python/1d36b0d152a23c2f1cf7186e4f8a208b9c0768fc/README_files/15320635 -------------------------------------------------------------------------------- /README_files/1f334.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bamtak/machine-learning-implemetation-python/1d36b0d152a23c2f1cf7186e4f8a208b9c0768fc/README_files/1f334.png -------------------------------------------------------------------------------- /README_files/1f3af.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bamtak/machine-learning-implemetation-python/1d36b0d152a23c2f1cf7186e4f8a208b9c0768fc/README_files/1f3af.png -------------------------------------------------------------------------------- /README_files/1f3e0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bamtak/machine-learning-implemetation-python/1d36b0d152a23c2f1cf7186e4f8a208b9c0768fc/README_files/1f3e0.png -------------------------------------------------------------------------------- /README_files/1f912.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bamtak/machine-learning-implemetation-python/1d36b0d152a23c2f1cf7186e4f8a208b9c0768fc/README_files/1f912.png -------------------------------------------------------------------------------- /README_files/7457498: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bamtak/machine-learning-implemetation-python/1d36b0d152a23c2f1cf7186e4f8a208b9c0768fc/README_files/7457498 -------------------------------------------------------------------------------- /README_files/decision_tree_predictions.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bamtak/machine-learning-implemetation-python/1d36b0d152a23c2f1cf7186e4f8a208b9c0768fc/README_files/decision_tree_predictions.png -------------------------------------------------------------------------------- /README_files/image_preprocessing.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bamtak/machine-learning-implemetation-python/1d36b0d152a23c2f1cf7186e4f8a208b9c0768fc/README_files/image_preprocessing.png -------------------------------------------------------------------------------- /README_files/octocat-spinner-128.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bamtak/machine-learning-implemetation-python/1d36b0d152a23c2f1cf7186e4f8a208b9c0768fc/README_files/octocat-spinner-128.gif -------------------------------------------------------------------------------- /README_files/search-key-slash.svg: -------------------------------------------------------------------------------- 1 | 3 | 4 | 5 | 6 | -------------------------------------------------------------------------------- /Wrapper Method For Feature Selection - Forward and Backward .ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 2, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "from sklearn.linear_model import LogisticRegression\n", 10 | "\n", 11 | "class FeatureSelection:\n", 12 | " \n", 13 | " def forward_feature_selection(self, X, y, n_features=2):\n", 14 | " results = {'best':[]}\n", 15 | " feature_list = [] #\n", 16 | " while len(feature_list) < n_features:\n", 17 | " best = None\n", 18 | " for i in range(X.shape[1]):\n", 19 | " if i not in feature_list: \n", 20 | " cur_model_result = train_model(X, y, feature_list + [i])\n", 21 | " best = get_best(best, cur_model_result)\n", 22 | " feature_list = best['features']\n", 23 | " results['best'].append(best)\n", 24 | " self.result = compute_best_features(results, n_features)\n", 25 | " return results\n", 26 | "\n", 27 | " def backward_feature_selection(self, X, y, n_features=2):\n", 28 | " results = {'best':[]}\n", 29 | " feature_list = range(X.shape[1])\n", 30 | " while len(feature_list) > 1:\n", 31 | " best = None\n", 32 | " for i in range(len(feature_list)):\n", 33 | " idx = feature_list[0:i] + feature_list[i+1:]\n", 34 | " best = get_best(best, train_model(X, y, idx))\n", 35 | " feature_list = best['features']\n", 36 | " results['best'].append(best)\n", 37 | " self.result = compute_best_features(results, n_features)\n", 38 | " return results\n", 39 | "\n", 40 | "def compute_best_features(results, n_features):\n", 41 | " best_result = None\n", 42 | " for result in results['best']:\n", 43 | " if len(result['features']) <= n_features:\n", 44 | " best_result = get_best(best_result, result)\n", 45 | " results['best_features'] = best_result['features']\n", 46 | " \n", 47 | "def train_model(X, y, idx):\n", 48 | " X_ = X[:, idx]\n", 49 | " model = LogisticRegression(random_state=0, solver='lbfgs',multi_class='multinomial').fit(X_, y)\n", 50 | " acc = accuracy_score(y, model.predict(X_))\n", 51 | " return {'score': acc, 'features': idx}\n", 52 | "\n", 53 | "def get_best(b1, b2):\n", 54 | " return b2 if b1 == None or b2['score'] > b1['score'] else b1" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": 3, 60 | "metadata": {}, 61 | "outputs": [ 62 | { 63 | "name": "stdout", 64 | "output_type": "stream", 65 | "text": [ 66 | "**********Forward*********\n", 67 | "{'best_features': [3, 2], 'best': [{'score': 0.96, 'features': [3]}, {'score': 0.9666666666666667, 'features': [3, 2]}]}\n", 68 | "\n", 69 | "\n", 70 | "**********Backward*********\n", 71 | "{'best_features': [2, 3], 'best': [{'score': 0.9733333333333334, 'features': [1, 2, 3]}, {'score': 0.9666666666666667, 'features': [2, 3]}, {'score': 0.96, 'features': [3]}]}\n" 72 | ] 73 | } 74 | ], 75 | "source": [ 76 | "from sklearn import datasets\n", 77 | "iris = datasets.load_iris()\n", 78 | "X = iris.data \n", 79 | "Y = iris.target\n", 80 | "print('**********Forward*********')\n", 81 | "print(FeatureSelection().forward_feature_selection(X, Y))\n", 82 | "print('\\n')\n", 83 | "print('**********Backward*********')\n", 84 | "print(FeatureSelection().backward_feature_selection(X, Y))" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": null, 90 | "metadata": {}, 91 | "outputs": [], 92 | "source": [] 93 | } 94 | ], 95 | "metadata": { 96 | "kernelspec": { 97 | "display_name": "python3", 98 | "language": "python", 99 | "name": "python3" 100 | }, 101 | "language_info": { 102 | "codemirror_mode": { 103 | "name": "ipython", 104 | "version": 2 105 | }, 106 | "file_extension": ".py", 107 | "mimetype": "text/x-python", 108 | "name": "python", 109 | "nbconvert_exporter": "python", 110 | "pygments_lexer": "ipython2", 111 | "version": "2.7.16" 112 | } 113 | }, 114 | "nbformat": 4, 115 | "nbformat_minor": 2 116 | } 117 | -------------------------------------------------------------------------------- /book_rec_knn.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 6, 6 | "metadata": {}, 7 | "outputs": [ 8 | { 9 | "name": "stdout", 10 | "output_type": "stream", 11 | "text": [ 12 | "The Karate Kid\n", 13 | "A Brilliant Young Mind\n", 14 | "Finding Forrester\n" 15 | ] 16 | } 17 | ], 18 | "source": [ 19 | "from collections import Counter\n", 20 | "import math\n", 21 | "\n", 22 | "def knn(data, query, k, choice_fn):\n", 23 | " neighbor_distances_and_indices = []\n", 24 | " \n", 25 | " for index, example in enumerate(data):\n", 26 | " \n", 27 | " distance = euclidean_distance(example[:-1], query)\n", 28 | " \n", 29 | " neighbor_distances_and_indices.append((distance, index))\n", 30 | " \n", 31 | " sorted_neighbor_distances_and_indices = sorted(neighbor_distances_and_indices)\n", 32 | " \n", 33 | " k_nearest_distances_and_indices = sorted_neighbor_distances_and_indices[:k]\n", 34 | " \n", 35 | " k_nearest_labels = [data[i][1] for distance, i in k_nearest_distances_and_indices]\n", 36 | "\n", 37 | " return k_nearest_distances_and_indices , choice_fn(k_nearest_labels)\n", 38 | "\n", 39 | "def mean(labels):\n", 40 | " return sum(labels) / len(labels)\n", 41 | "\n", 42 | "def mode(labels):\n", 43 | " return Counter(labels).most_common(1)[0][0]\n", 44 | "\n", 45 | "def euclidean_distance(point1, point2):\n", 46 | " sum_squared_distance = 0\n", 47 | " for i in range(len(point1)):\n", 48 | " sum_squared_distance += math.pow(point1[i] - point2[i], 2)\n", 49 | " return math.sqrt(sum_squared_distance)\n", 50 | "def recommend_books(book_query, k_recommendations):\n", 51 | " raw_books_data = []\n", 52 | " with open('books_recommendation_data.csv', 'r') as md:\n", 53 | " \n", 54 | " next(md)\n", 55 | "\n", 56 | " \n", 57 | " for line in md.readlines():\n", 58 | " data_row = line.strip().split(',')\n", 59 | " raw_books_data.append(data_row)\n", 60 | "\n", 61 | " \n", 62 | " books_recommendation_data = []\n", 63 | " for row in raw_books_data:\n", 64 | " data_row = list(map(float, row[2:]))\n", 65 | " books_recommendation_data.append(data_row)\n", 66 | "\n", 67 | "\n", 68 | " recommendation_indices, _ = knn(\n", 69 | " books_recommendation_data, book_query, k=k_recommendations, choice_fn=lambda x: None\n", 70 | " )\n", 71 | "\n", 72 | " book_recommendations = []\n", 73 | " for _, index in recommendation_indices:\n", 74 | " book_recommendations.append(raw_books_data[index])\n", 75 | "\n", 76 | " return book_recommendations\n", 77 | "\n", 78 | "if __name__ == '__main__':\n", 79 | " the_post = [6.5, 0, 1, 0, 1, 0, 0, 0, 0] # feature vector for The Post\n", 80 | " # Rating,Biography,Drama,Thriller,Comedy,Crime,Mystery,History,Label\n", 81 | " recommended_books = recommend_books(book_query=the_post, k_recommendations=3)\n", 82 | "\n", 83 | " # Print recommended book titles\n", 84 | " for recommendation in recommended_books:\n", 85 | " print(recommendation[1])\n" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": null, 91 | "metadata": {}, 92 | "outputs": [], 93 | "source": [] 94 | } 95 | ], 96 | "metadata": { 97 | "kernelspec": { 98 | "display_name": "Python 3", 99 | "language": "python", 100 | "name": "python3" 101 | }, 102 | "language_info": { 103 | "codemirror_mode": { 104 | "name": "ipython", 105 | "version": 3 106 | }, 107 | "file_extension": ".py", 108 | "mimetype": "text/x-python", 109 | "name": "python", 110 | "nbconvert_exporter": "python", 111 | "pygments_lexer": "ipython3", 112 | "version": "3.7.3" 113 | } 114 | }, 115 | "nbformat": 4, 116 | "nbformat_minor": 2 117 | } 118 | -------------------------------------------------------------------------------- /figures/MM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bamtak/machine-learning-implemetation-python/1d36b0d152a23c2f1cf7186e4f8a208b9c0768fc/figures/MM.png -------------------------------------------------------------------------------- /figures/bayes-theorem.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bamtak/machine-learning-implemetation-python/1d36b0d152a23c2f1cf7186e4f8a208b9c0768fc/figures/bayes-theorem.png -------------------------------------------------------------------------------- /figures/decision_tree.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bamtak/machine-learning-implemetation-python/1d36b0d152a23c2f1cf7186e4f8a208b9c0768fc/figures/decision_tree.png -------------------------------------------------------------------------------- /figures/decision_tree_predictions.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bamtak/machine-learning-implemetation-python/1d36b0d152a23c2f1cf7186e4f8a208b9c0768fc/figures/decision_tree_predictions.png -------------------------------------------------------------------------------- /figures/image_preprocessing.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bamtak/machine-learning-implemetation-python/1d36b0d152a23c2f1cf7186e4f8a208b9c0768fc/figures/image_preprocessing.png -------------------------------------------------------------------------------- /figures/linear_regression.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bamtak/machine-learning-implemetation-python/1d36b0d152a23c2f1cf7186e4f8a208b9c0768fc/figures/linear_regression.jpg -------------------------------------------------------------------------------- /figures/logistic_regression.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bamtak/machine-learning-implemetation-python/1d36b0d152a23c2f1cf7186e4f8a208b9c0768fc/figures/logistic_regression.jpg -------------------------------------------------------------------------------- /figures/neural_net.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bamtak/machine-learning-implemetation-python/1d36b0d152a23c2f1cf7186e4f8a208b9c0768fc/figures/neural_net.png -------------------------------------------------------------------------------- /figures/perceptron_hyperplane.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bamtak/machine-learning-implemetation-python/1d36b0d152a23c2f1cf7186e4f8a208b9c0768fc/figures/perceptron_hyperplane.png -------------------------------------------------------------------------------- /figures/preprocessing.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bamtak/machine-learning-implemetation-python/1d36b0d152a23c2f1cf7186e4f8a208b9c0768fc/figures/preprocessing.png -------------------------------------------------------------------------------- /figures/regression_tree.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bamtak/machine-learning-implemetation-python/1d36b0d152a23c2f1cf7186e4f8a208b9c0768fc/figures/regression_tree.png -------------------------------------------------------------------------------- /figures/softmax_regression.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bamtak/machine-learning-implemetation-python/1d36b0d152a23c2f1cf7186e4f8a208b9c0768fc/figures/softmax_regression.jpg -------------------------------------------------------------------------------- /gaussian_naive_bayes.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | class GaussianNB: 4 | 5 | def fit(self, X, y, epsilon = 1e-10): 6 | self.y_classes, y_counts = np.unique(y, return_counts=True) 7 | self.x_classes = np.array([np.unique(x) for x in X.T]) 8 | self.phi_y = 1.0 * y_counts/y_counts.sum() 9 | self.u = np.array([X[y==k].mean(axis=0) for k in self.y_classes]) 10 | self.var_x = np.array([X[y==k].var(axis=0) + epsilon for k in self.y_classes]) 11 | return self 12 | 13 | def predict(self, X): 14 | return np.apply_along_axis(lambda x: self.compute_probs(x), 1, X) 15 | 16 | def compute_probs(self, x): 17 | probs = np.array([self.compute_prob(x, y) for y in range(len(self.y_classes))]) 18 | return self.y_classes[np.argmax(probs)] 19 | 20 | def compute_prob(self, x, y): 21 | c = 1.0 /np.sqrt(2.0 * np.pi * (self.var_x[y])) 22 | return np.prod(c * np.exp(-1.0 * np.square(x - self.u[y]) / (2.0 * self.var_x[y]))) 23 | 24 | def evaluate(self, X, y): 25 | return (self.predict(X) == y).mean() -------------------------------------------------------------------------------- /gda.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | class GDABinaryClassifier: 4 | 5 | def fit(self, X, y): 6 | self.fi = y.mean() 7 | self.u = np.array([ X[y==k].mean(axis=0) for k in [0,1]]) 8 | X_u = X.copy() 9 | for k in [0,1]: X_u[y==k] -= self.u[k] 10 | self.E = X_u.T.dot(X_u) / len(y) 11 | self.invE = np.linalg.pinv(self.E) 12 | return self 13 | 14 | def predict(self, X): 15 | return np.argmax([self.compute_prob(X, i) for i in range(len(self.u))], axis=0) 16 | 17 | def compute_prob(self, X, i): 18 | u, phi = self.u[i], ((self.fi)**i * (1 - self.fi)**(1 - i)) 19 | return np.exp(-1.0 * np.sum((X-u).dot(self.invE)*(X-u), axis=1)) * phi 20 | 21 | def score(self, X, y): 22 | return (self.predict(X) == y).mean() -------------------------------------------------------------------------------- /kmeans.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | class KMeans: 3 | 4 | def __init__(self, n_clusters=4): 5 | self.K = n_clusters 6 | 7 | def fit(self, X): 8 | self.centroids = X[np.random.choice(len(X), self.K, replace=False)] 9 | self.intial_centroids = self.centroids 10 | self.prev_label, self.labels = None, np.zeros(len(X)) 11 | while not np.all(self.labels == self.prev_label) : 12 | self.prev_label = self.labels 13 | self.labels = self.predict(X) 14 | self.update_centroid(X) 15 | return self 16 | 17 | def predict(self, X): 18 | return np.apply_along_axis(self.compute_label, 1, X) 19 | 20 | def compute_label(self, x): 21 | return np.argmin(np.sqrt(np.sum((self.centroids - x)**2, axis=1))) 22 | 23 | def update_centroid(self, X): 24 | self.centroids = np.array([np.mean(X[self.labels == k], axis=0) for k in range(self.K)]) -------------------------------------------------------------------------------- /naive_bayes.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | class NaiveBayesBinaryClassifier: 4 | 5 | def fit(self, X, y): 6 | self.y_classes, y_counts = np.unique(y, return_counts=True) 7 | self.phi_y = 1.0 * y_counts/y_counts.sum() 8 | self.phi_x = [1.0 * X[y==k].mean(axis=0) for k in self.y_classes] 9 | return self 10 | 11 | def predict(self, X): 12 | return np.apply_along_axis(lambda x: self.compute_probs(x), 1, X) 13 | 14 | def compute_probs(self, x): 15 | probs = [self.compute_prob(x, y) for y in range(len(self.y_classes))] 16 | return self.y_classes[np.argmax(probs)] 17 | 18 | def compute_prob(self, x, y): 19 | res = 1 20 | for j in range(len(x)): 21 | Pxy = self.phi_x[y][j] # p(xj=1|y) 22 | res *= (Pxy**x[j])*((1-Pxy)**(1-x[j])) # p(xj=0|y) 23 | return res * self.phi_y[y] 24 | 25 | def evaluate(self, X, y): 26 | return (self.predict(X) == y).mean() -------------------------------------------------------------------------------- /spam filter.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 3, 6 | "metadata": {}, 7 | "outputs": [ 8 | { 9 | "name": "stdout", 10 | "output_type": "stream", 11 | "text": [ 12 | " MAIL fil\n", 13 | "0 send send us send us us your password spam\n", 14 | "1 send us your review ham\n", 15 | "2 review your password ham\n", 16 | "3 review us spam\n", 17 | "4 send us your password spam\n", 18 | "5 send us your account spam\n", 19 | "{'send': 6, 'us': 7, 'your': 5, 'password': 3, 'review': 3, 'account': 1}\n", 20 | "{'spam': 4, 'ham': 2}\n", 21 | "Total words: 25\n", 22 | "{'spam': {'send': 0.75, 'us': 1.0, 'your': 0.75, 'password': 0.5, 'review': 0.25, 'account': 0.25}, 'ham': {'send': 0.5, 'us': 0.5, 'your': 1.0, 'review': 1.0, 'password': 0.5}}\n", 23 | "{'spam': 0.6666666666666666, 'ham': 0.3333333333333333}\n", 24 | "0.75\n", 25 | "0.3333333333333333\n" 26 | ] 27 | } 28 | ], 29 | "source": [ 30 | "import pandas as pd\n", 31 | "\n", 32 | "data = pd.read_csv(\"spam.csv\")\n", 33 | "print(data)\n", 34 | "\n", 35 | "bow = {}\n", 36 | "mails = data[\"MAIL\"]\n", 37 | "tot_word_count = 0\n", 38 | "\n", 39 | "for mail in mails:\n", 40 | " mail = mail.split()\n", 41 | " for word in mail:\n", 42 | " if word in bow:\n", 43 | " bow[word] += 1\n", 44 | " tot_word_count += 1\n", 45 | " else:\n", 46 | " bow[word] = 1\n", 47 | " tot_word_count += 1\n", 48 | " \n", 49 | "print(bow)\n", 50 | "\n", 51 | "fil = {'spam': 0, 'ham': 0}\n", 52 | "for fils in data[\"fil\"]:\n", 53 | " fil[fils] += 1\n", 54 | "print(fil)\n", 55 | "print(\"Total words: \", tot_word_count)\n", 56 | "\n", 57 | "prob = {'spam': {}, 'ham': {}}\n", 58 | "for index, row in data.iterrows():\n", 59 | " mail = row['MAIL']\n", 60 | " res = row['fil']\n", 61 | " done = []\n", 62 | " for word in mail.split():\n", 63 | " if word not in done:\n", 64 | " done.append(word)\n", 65 | " if word not in prob[res]:\n", 66 | " prob[res][word] = 1\n", 67 | " else:\n", 68 | " prob[res][word] += 1\n", 69 | "target_prob = {'spam': fil['spam']/len(data), 'ham': fil['ham']/len(data)}\n", 70 | "\n", 71 | "for res in prob:\n", 72 | " for word in prob[res]:\n", 73 | " prob[res][word] = prob[res][word] / fil[res]\n", 74 | "print(prob)\n", 75 | "print(target_prob)\n", 76 | "\n", 77 | "word = \"send\"\n", 78 | "\n", 79 | "p_ws = prob['spam'].get(word, 0)\n", 80 | "p_s = target_prob['spam']\n", 81 | "p_wh = prob['ham'].get(word, 0)\n", 82 | "p_h = target_prob['ham']\n", 83 | "\n", 84 | "p_sw = p_ws * p_s / (p_ws * p_s + p_wh * p_h)\n", 85 | "print(p_sw)\n", 86 | "\n", 87 | "word1 = \"review\"\n", 88 | "word2 = \"password\"\n", 89 | "\n", 90 | "p_w1s = prob['spam'].get(word1, 0)\n", 91 | "p_s = target_prob['spam']\n", 92 | "p_w1h = prob['ham'].get(word1, 0)\n", 93 | "p_h = target_prob['ham']\n", 94 | "\n", 95 | "p_w2s = prob['spam'].get(word2, 0)\n", 96 | "p_w2h = prob['ham'].get(word2, 0)\n", 97 | "\n", 98 | "p_sww = p_w1s * p_w2s * p_s / (p_w1s * p_w2s * p_s + p_w1h * p_w2h * p_h)\n", 99 | "print(p_sww)\n" 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": null, 105 | "metadata": {}, 106 | "outputs": [], 107 | "source": [] 108 | } 109 | ], 110 | "metadata": { 111 | "kernelspec": { 112 | "display_name": "Python 3", 113 | "language": "python", 114 | "name": "python3" 115 | }, 116 | "language_info": { 117 | "codemirror_mode": { 118 | "name": "ipython", 119 | "version": 3 120 | }, 121 | "file_extension": ".py", 122 | "mimetype": "text/x-python", 123 | "name": "python", 124 | "nbconvert_exporter": "python", 125 | "pygments_lexer": "ipython3", 126 | "version": "3.7.3" 127 | } 128 | }, 129 | "nbformat": 4, 130 | "nbformat_minor": 2 131 | } 132 | -------------------------------------------------------------------------------- /svm.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 15, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "# Muskan Pandey" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 10, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "import matplotlib.pyplot as plt\n", 19 | "from matplotlib import style\n", 20 | "import numpy as np\n", 21 | "import pandas as pd\n", 22 | "style.use('ggplot')" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": 11, 28 | "metadata": {}, 29 | "outputs": [], 30 | "source": [ 31 | "class SVM(object):\n", 32 | " def __init__(self,visualization=True):\n", 33 | " self.visualization = visualization\n", 34 | " self.colors = {1:'r',-1:'b'}\n", 35 | " if self.visualization:\n", 36 | " self.fig = plt.figure()\n", 37 | " self.ax = self.fig.add_subplot(1,1,1)\n", 38 | " \n", 39 | " def fit(self,data):\n", 40 | " #train with data\n", 41 | " self.data = data\n", 42 | " # { |\\w\\|:{w,b}}\n", 43 | " opt_dict = {}\n", 44 | " \n", 45 | " transforms = [[1,1],[-1,1],[-1,-1],[1,-1]]\n", 46 | " \n", 47 | " all_data = np.array([])\n", 48 | " for yi in self.data:\n", 49 | " all_data = np.append(all_data,self.data[yi])\n", 50 | " \n", 51 | " self.max_feature_value = max(all_data) \n", 52 | " self.min_feature_value = min(all_data)\n", 53 | " all_data = None\n", 54 | " \n", 55 | " #with smaller steps our margins and db will be more precise\n", 56 | " step_sizes = [self.max_feature_value * 0.1,\n", 57 | " self.max_feature_value * 0.01,\n", 58 | " #point of expense\n", 59 | " self.max_feature_value * 0.001,]\n", 60 | " \n", 61 | " #extremly expensise\n", 62 | " b_range_multiple = 5\n", 63 | " #we dont need to take as small step as w\n", 64 | " b_multiple = 5\n", 65 | " \n", 66 | " latest_optimum = self.max_feature_value*10\n", 67 | " \n", 68 | " \"\"\"\n", 69 | " objective is to satisfy yi(x.w)+b>=1 for all training dataset such that ||w|| is minimum\n", 70 | " for this we will start with random w, and try to satisfy it with making b bigger and bigger\n", 71 | " \"\"\"\n", 72 | " #making step smaller and smaller to get precise value\n", 73 | " for step in step_sizes:\n", 74 | " w = np.array([latest_optimum,latest_optimum])\n", 75 | " \n", 76 | " #we can do this because convex\n", 77 | " optimized = False\n", 78 | " while not optimized:\n", 79 | " for b in np.arange(-1*self.max_feature_value*b_range_multiple,\n", 80 | " self.max_feature_value*b_range_multiple,\n", 81 | " step*b_multiple):\n", 82 | " for transformation in transforms:\n", 83 | " w_t = w*transformation\n", 84 | " found_option = True\n", 85 | " \n", 86 | " #weakest link in SVM fundamentally\n", 87 | " #SMO attempts to fix this a bit\n", 88 | " # ti(xi.w+b) >=1\n", 89 | " for i in self.data:\n", 90 | " for xi in self.data[i]:\n", 91 | " yi=i\n", 92 | " if not yi*(np.dot(w_t,xi)+b)>=1:\n", 93 | " found_option=False\n", 94 | " if found_option:\n", 95 | " \"\"\"\n", 96 | " all points in dataset satisfy y(w.x)+b>=1 for this cuurent w_t, b\n", 97 | " then put w,b in dict with ||w|| as key\n", 98 | " \"\"\"\n", 99 | " opt_dict[np.linalg.norm(w_t)]=[w_t,b]\n", 100 | " \n", 101 | " #after w[0] or w[1]<0 then values of w starts repeating itself because of transformation\n", 102 | " #Think about it, it is easy\n", 103 | " #print(w,len(opt_dict)) Try printing to understand\n", 104 | " if w[0]<0:\n", 105 | " optimized=True\n", 106 | " print(\"optimized a step\")\n", 107 | " else:\n", 108 | " w = w-step\n", 109 | " \n", 110 | " # sorting ||w|| to put the smallest ||w|| at poition 0 \n", 111 | " norms = sorted([n for n in opt_dict])\n", 112 | " #optimal values of w,b\n", 113 | " opt_choice = opt_dict[norms[0]]\n", 114 | "\n", 115 | " self.w=opt_choice[0]\n", 116 | " self.b=opt_choice[1]\n", 117 | " \n", 118 | " #start with new latest_optimum (initial values for w)\n", 119 | " latest_optimum = opt_choice[0][0]+step*2\n", 120 | " \n", 121 | " def predict(self,features):\n", 122 | " #sign(x.w+b)\n", 123 | " classification = np.sign(np.dot(np.array(features),self.w)+self.b)\n", 124 | " if classification!=0 and self.visualization:\n", 125 | " self.ax.scatter(features[0],features[1],s=200,marker='*',c=self.colors[classification])\n", 126 | " return (classification,np.dot(np.array(features),self.w)+self.b)\n", 127 | " \n", 128 | " def visualize(self):\n", 129 | " [[self.ax.scatter(x[0],x[1],s=100,c=self.colors[i]) for x in data_dict[i]] for i in data_dict]\n", 130 | " \n", 131 | " # hyperplane = x.w+b (actually its a line)\n", 132 | " # v = x0.w0+x1.w1+b -> x1 = (v-w[0].x[0]-b)/w1\n", 133 | " #psv = 1 psv line -> x.w+b = 1a small value of b we will increase it later\n", 134 | " #nsv = -1 nsv line -> x.w+b = -1\n", 135 | " # dec = 0 db line -> x.w+b = 0\n", 136 | " def hyperplane(x,w,b,v):\n", 137 | " #returns a x2 value on line when given x1\n", 138 | " return (-w[0]*x-b+v)/w[1]\n", 139 | " \n", 140 | " hyp_x_min= self.min_feature_value*0.9\n", 141 | " hyp_x_max = self.max_feature_value*1.1\n", 142 | " \n", 143 | " # (w.x+b)=1\n", 144 | " # positive support vector hyperplane\n", 145 | " pav1 = hyperplane(hyp_x_min,self.w,self.b,1)\n", 146 | " pav2 = hyperplane(hyp_x_max,self.w,self.b,1)\n", 147 | " self.ax.plot([hyp_x_min,hyp_x_max],[pav1,pav2],'k')\n", 148 | " \n", 149 | " # (w.x+b)=-1\n", 150 | " # negative support vector hyperplane\n", 151 | " nav1 = hyperplane(hyp_x_min,self.w,self.b,-1)\n", 152 | " nav2 = hyperplane(hyp_x_max,self.w,self.b,-1)\n", 153 | " self.ax.plot([hyp_x_min,hyp_x_max],[nav1,nav2],'k')\n", 154 | " \n", 155 | " # (w.x+b)=0\n", 156 | " # db support vector hyperplane\n", 157 | " db1 = hyperplane(hyp_x_min,self.w,self.b,0)\n", 158 | " db2 = hyperplane(hyp_x_max,self.w,self.b,0)\n", 159 | " self.ax.plot([hyp_x_min,hyp_x_max],[db1,db2],'y--')" 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": 12, 165 | "metadata": {}, 166 | "outputs": [], 167 | "source": [ 168 | "#defining a basic data\n", 169 | "data_dict = {-1:np.array([[1,7],[2,8],[3,8]]),1:np.array([[5,1],[6,-1],[7,3]])}" 170 | ] 171 | }, 172 | { 173 | "cell_type": "code", 174 | "execution_count": 13, 175 | "metadata": {}, 176 | "outputs": [ 177 | { 178 | "name": "stdout", 179 | "output_type": "stream", 180 | "text": [ 181 | "optimized a step\n", 182 | "optimized a step\n", 183 | "optimized a step\n" 184 | ] 185 | }, 186 | { 187 | "data": { 188 | "image/png": "\n", 189 | "text/plain": [ 190 | "
" 191 | ] 192 | }, 193 | "metadata": {}, 194 | "output_type": "display_data" 195 | } 196 | ], 197 | "source": [ 198 | "\n", 199 | "svm = SVM() # Linear Kernel\n", 200 | "svm.fit(data=data_dict)\n", 201 | "svm.visualize()" 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": 14, 207 | "metadata": {}, 208 | "outputs": [ 209 | { 210 | "data": { 211 | "text/plain": [ 212 | "(-1.0, -1.000000000000098)" 213 | ] 214 | }, 215 | "execution_count": 14, 216 | "metadata": {}, 217 | "output_type": "execute_result" 218 | } 219 | ], 220 | "source": [ 221 | "svm.predict([3,8])" 222 | ] 223 | } 224 | ], 225 | "metadata": { 226 | "kernelspec": { 227 | "display_name": "Python 3", 228 | "language": "python", 229 | "name": "python3" 230 | }, 231 | "language_info": { 232 | "codemirror_mode": { 233 | "name": "ipython", 234 | "version": 3 235 | }, 236 | "file_extension": ".py", 237 | "mimetype": "text/x-python", 238 | "name": "python", 239 | "nbconvert_exporter": "python", 240 | "pygments_lexer": "ipython3", 241 | "version": "3.7.3" 242 | } 243 | }, 244 | "nbformat": 4, 245 | "nbformat_minor": 2 246 | } 247 | -------------------------------------------------------------------------------- /train_test_split.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bamtak/machine-learning-implemetation-python/1d36b0d152a23c2f1cf7186e4f8a208b9c0768fc/train_test_split.pyc -------------------------------------------------------------------------------- /utils.py: -------------------------------------------------------------------------------- 1 | import random 2 | import math 3 | 4 | def Kfold(X, folds=5, shuffle=False): 5 | X = list(range(X)) if isinstance(X, int) else list(range(len(X))) 6 | if shuffle: random.shuffle(X) 7 | i = 0 8 | fold_size = int(math.ceil(1.0 * len(X)/folds)) 9 | while i < folds: 10 | yield X[0:fold_size*i] + X[fold_size*(i + 1):], X[fold_size*i:fold_size*(i + 1)] 11 | i += 1 12 | 13 | def train_test_split(X, y, test_size=0.2, shuffle=True): 14 | idx = range(len(y)) 15 | if shuffle: random.shuffle(idx) 16 | test_idx = int(test_size * len(y)) 17 | return X[idx[test_idx:]], X[idx[:test_idx]], y[idx[test_idx:]], y[idx[:test_idx]] 18 | 19 | def accuracy_score(pre, y): 20 | return 1 - sum(1.0 * (pre - y)**2)/len(y) -------------------------------------------------------------------------------- /utils.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bamtak/machine-learning-implemetation-python/1d36b0d152a23c2f1cf7186e4f8a208b9c0768fc/utils.pyc --------------------------------------------------------------------------------