├── .gitattributes ├── KNN.py ├── README.md └── some-algorithms-from-scratch.ipynb /.gitattributes: -------------------------------------------------------------------------------- 1 | # Auto detect text files and perform LF normalization 2 | * text=auto 3 | -------------------------------------------------------------------------------- /KNN.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Algorithms_from_scratch! 2 | Algorithms_from_scratch! 3 | 4 | 5 | ## Author 6 | 7 | - Tushar Aggarwal 8 | - LinkedIn: https://www.linkedin.com/in/tusharaggarwalinseec/ 9 | - Website: https://www.tushar-aggarwal.com/ 10 | - Kaggle: https://www.kaggle.com/tusharaggarwal27 11 | 12 | 13 | 14 | ## Feedback 15 | 16 | If you have any feedback, please reach out to me at tushar.inseec@gmail.com or info@tushar-aggarwal.com 17 | -------------------------------------------------------------------------------- /some-algorithms-from-scratch.ipynb: -------------------------------------------------------------------------------- 1 | {"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"pygments_lexer":"ipython3","nbconvert_exporter":"python","version":"3.6.4","file_extension":".py","codemirror_mode":{"name":"ipython","version":3},"name":"python","mimetype":"text/x-python"}},"nbformat_minor":4,"nbformat":4,"cells":[{"source":" $\"Kaggle\"$ ","metadata":{},"cell_type":"markdown"},{"cell_type":"markdown","source":"

\n 🦿📜🪛🤖Some algorithms from scratch!📜🪛🤖🦿\n \n

\n \n \n

\n\n

I brewed this notebook from scratch, If this notebook helped, please consider upvoting and cite me if sharing ,Thank you!

\n Lets connect on LinkedIn!\n \n

\nFollow me on Github too!

\n Also checkout my Medium posts!\n \n

","metadata":{}},{"cell_type":"markdown","source":"

\n In this notebook, I am showing common ML algorithms from scratch in python with some explainations,\n

Please note: I am not writing complex algorithms like all GBM, as its much effecient to use from respective liabraries, and it would be too complex and long to write its all code. \n \n

","metadata":{}},{"cell_type":"markdown","source":"

\n 🤖Linear Regression\n

\n 1. Desc - A simple algorithm that models a linear relationship between inputs and a continuous numerical output variable\n
2. Use cases - Stock price prediction, Predicting housing price, Predicting customer lifetime value\n
3. Pros - Explainable method, Interpretable results by its output coefficient,Faster to train than other machine learning models, \n
4. Cons - Assumes linearity between inputs and output, Sensitive to outliers, Can underfit with small, high-dimensional data\n\n

\n","metadata":{}},{"cell_type":"code","source":"import numpy as np\n\nclass LinearRegression:\n def __init__(self):\n self.w = None\n\n def fit(self, X, y):\n X = np.hstack((np.ones((X.shape[0], 1)), X))\n self.w = np.linalg.inv(X.T @ X) @ X.T @ y\n\n def predict(self, X):\n X = np.hstack((np.ones((X.shape[0], 1)), X))\n return X @ self.w\n","metadata":{"execution":{"iopub.status.busy":"2023-01-27T21:49:10.361866Z","iopub.execute_input":"2023-01-27T21:49:10.362342Z","iopub.status.idle":"2023-01-27T21:49:10.398517Z","shell.execute_reply.started":"2023-01-27T21:49:10.362237Z","shell.execute_reply":"2023-01-27T21:49:10.39745Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"

\n In this example, the LinearRegression class has a fit method that takes in the feature matrix X and the target vector y, and uses the normal equation to calculate the weight vector w that minimizes the mean squared error. The predict method takes in a feature matrix X and returns the predicted target values using the calculated weight vector. \n

","metadata":{}},{"cell_type":"code","source":"X = np.array([[1], [2], [3], [4], [5]])\ny = np.array([5, 7, 9, 11, 13])\nreg = LinearRegression()\nreg.fit(X, y)\nprint(reg.predict(X))\n\n#This code is a basic implementation and doesn't include any regularization or handling of edge cases such as singular matrix, \n#it's also not optimized for large dataset.\n","metadata":{"execution":{"iopub.status.busy":"2023-01-27T21:49:10.400537Z","iopub.execute_input":"2023-01-27T21:49:10.401615Z","iopub.status.idle":"2023-01-27T21:49:10.418833Z","shell.execute_reply.started":"2023-01-27T21:49:10.401576Z","shell.execute_reply":"2023-01-27T21:49:10.41726Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"

\n 🤖Logistic Regression\n

\n 1. Desc - A simple algorithm that models a linear relationship between inputs and a categorical output (1 or 0)\n
2. Use cases - Credit risk score prediction, Customer churn prediction\n
3. Pros - Interpretable and explainable, Less prone to overfitting when using regularization, Applicable for multi-class predictions\n
4. Cons - Assumes linearity between inputs and output, Can overfit with small and high-dimensional data \n\n

\n","metadata":{}},{"cell_type":"code","source":"import numpy as np\n\nclass LogisticRegression:\n def __init__(self, lr=0.01, num_iter=100000, fit_intercept=True):\n self.lr = lr\n self.num_iter = num_iter\n self.fit_intercept = fit_intercept\n\n def __add_intercept(self, X):\n intercept = np.ones((X.shape[0], 1))\n return np.concatenate((intercept, X), axis=1)\n\n def sigmoid(self, z):\n return 1 / (1 + np.exp(-z))\n\n def fit(self, X, y):\n if self.fit_intercept:\n X = self.__add_intercept(X)\n\n self.theta = np.zeros(X.shape[1])\n\n for i in range(self.num_iter):\n z = np.dot(X, self.theta)\n h = self.sigmoid(z)\n gradient = np.dot(X.T, (h - y)) / y.size\n self.theta -= self.lr * gradient\n\n def predict_prob(self, X):\n if self.fit_intercept:\n X = self.__add_intercept(X)\n\n return self.sigmoid(np.dot(X, self.theta))\n\n def predict(self, X, threshold=0.5):\n return self.predict_prob(X) >= threshold\n","metadata":{"execution":{"iopub.status.busy":"2023-01-27T21:49:10.421683Z","iopub.execute_input":"2023-01-27T21:49:10.422804Z","iopub.status.idle":"2023-01-27T21:49:10.443848Z","shell.execute_reply.started":"2023-01-27T21:49:10.422737Z","shell.execute_reply":"2023-01-27T21:49:10.442806Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"\n

In this example, the LogisticRegression class has a fit method that takes in the feature matrix X and the target vector y, and uses gradient descent to calculate the weight vector theta that maximizes the likelihood of the data. The predict_prob method takes in a feature matrix X and returns the predicted probability of a sample to belong to class 1. The predict method takes in a feature matrix X and a threshold value, and returns the predicted class label for each sample, based on whether the predicted probability exceeds the threshold.\n
Here's an example of how you could use this class:\n

","metadata":{}},{"cell_type":"code","source":"X = np.array([[1, 2], [3, 4], [5, 6]])\ny = np.array([0, 0, 1])\nclf = LogisticRegression()\nclf.fit(X, y)\nprint(clf.predict(X))\n#This code is a basic implementation and doesn't include any regularization or handling of edge cases such as numerical stability and overfitting, \n#it's also not optimized for large dataset.","metadata":{"execution":{"iopub.status.busy":"2023-01-27T21:49:10.446982Z","iopub.execute_input":"2023-01-27T21:49:10.448558Z","iopub.status.idle":"2023-01-27T21:49:11.766145Z","shell.execute_reply.started":"2023-01-27T21:49:10.448488Z","shell.execute_reply":"2023-01-27T21:49:11.765066Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"

\n 🤖KNN\n

\n 1. Desc - K-nearest neighbors (KNN) algorithm uses ‘feature similarity’ to predict the values of new datapoints which further means that the new data point will be assigned a value based on how closely it matches the points in the training set.\n
2. Use cases - Handwriting detection, Image recognition, and Video recognition\n
3. Pros - Intuitive and simple, Has no assumptions, No Training Step, It constantly evolves, Easy to implement for multi-class problem\n
4. Cons - Slow algorithm, Needs homogeneous features, Optimal number of neighbors, Imbalanced data causes problems, Outlier sensitivity\n\n

\n","metadata":{}},{"cell_type":"code","source":"import numpy as np\n\nclass KNN:\n def __init__(self, k=5):\n self.k = k\n\n def fit(self, X, y):\n self.X_train = X\n self.y_train = y\n\n def predict(self, X):\n y_pred = np.empty(X.shape[0])\n for i, x in enumerate(X):\n distances = np.linalg.norm(self.X_train - x, axis=1)\n k_nearest_indices = distances.argsort()[:self.k]\n k_nearest_labels = self.y_train[k_nearest_indices]\n y_pred[i] = np.bincount(k_nearest_labels).argmax()\n return y_pred\n","metadata":{"execution":{"iopub.status.busy":"2023-01-27T21:49:11.76747Z","iopub.execute_input":"2023-01-27T21:49:11.767999Z","iopub.status.idle":"2023-01-27T21:49:11.776687Z","shell.execute_reply.started":"2023-01-27T21:49:11.767965Z","shell.execute_reply":"2023-01-27T21:49:11.774955Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"

In this example, the KNN class has a fit method that takes in the feature matrix X and the target vector y, and stores them as the training data. The predict method takes in a feature matrix X and returns the predicted class labels for each sample. It does this by computing the Euclidean distance between each sample in X and each sample in the training data, finding the k-nearest neighbors to each sample in X, and then classifying the sample based on the majority class of the k-nearest neighbors.\n
\nHere's an example of how you could use this class:

","metadata":{}},{"cell_type":"code","source":"X_train = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]])\ny_train = np.array([0, 0, 1, 1, 1])\nX_test = np.array([[2, 3], [4, 5], [6, 7]])\n\nknn = KNN(k=3)\nknn.fit(X_train, y_train)\ny_pred = knn.predict(X_test)\nprint(y_pred)\n#This is a basic implementation of KNN, it doesn't consider the distance metric other than euclidean distance and also doesn't consider the case where the number of classes are more than two.\n#It's also not optimized for large dataset.","metadata":{"execution":{"iopub.status.busy":"2023-01-27T21:49:11.778573Z","iopub.execute_input":"2023-01-27T21:49:11.779132Z","iopub.status.idle":"2023-01-27T21:49:11.797933Z","shell.execute_reply.started":"2023-01-27T21:49:11.779079Z","shell.execute_reply":"2023-01-27T21:49:11.796372Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"

\n 🤖KMeans\n

\n 1. Desc - K-Means is the most widely used clustering approach—it determines K clusters based on euclidean distances\n
2. Use cases - Customer segmentation, Recommendation systems\n
3. Pros - Scales to large datasets, Simple to implement and interpret, Results in tight clusters\n
4. Cons - Requires the expected number of clusters from the beginning, Has troubles with varying cluster sizes and densities\n\n

\n","metadata":{}},{"cell_type":"code","source":"import numpy as np\n\nclass KMeans:\n def __init__(self, k=2, max_iter=100):\n self.k = k\n self.max_iter = max_iter\n\n def fit(self, X):\n # randomly initialize the centroids\n self.centroids = X[np.random.choice(X.shape[0], self.k, replace=False)]\n for _ in range(self.max_iter):\n # calculate the distances to each centroid\n distances = np.array([np.linalg.norm(X - centroid, axis=1) for centroid in self.centroids])\n # assign each point to the closest centroid\n self.labels = np.argmin(distances, axis=0)\n # update the centroids\n for i in range(self.k):\n points = X[self.labels == i]\n self.centroids[i] = points.mean(axis=0)\n\n def predict(self, X):\n distances = np.array([np.linalg.norm(X - centroid, axis=1) for centroid in self.centroids])\n return np.argmin(distances, axis=0)\n","metadata":{"execution":{"iopub.status.busy":"2023-01-27T21:49:11.799698Z","iopub.execute_input":"2023-01-27T21:49:11.800069Z","iopub.status.idle":"2023-01-27T21:49:11.815582Z","shell.execute_reply.started":"2023-01-27T21:49:11.800035Z","shell.execute_reply":"2023-01-27T21:49:11.814082Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"

In this example, the KMeans class takes in an initial number of clusters k and a maximum number of iterations max_iter as arguments. The fit method takes in a data set X and assigns each sample to the closest centroid. The centroids are then updated by taking the mean of all samples assigned to that centroid. This process continues for max_iter iterations, or until the centroids no longer change.\n
Here is an example of how to use the class\n

\n","metadata":{}},{"cell_type":"code","source":"X = np.array([[1, 2], [1, 4], [1, 0],\n [4, 2], [4, 4], [4, 0]])\nkmeans = KMeans(k=2, max_iter=100)\nkmeans.fit(X)\n#The class will return the centroids of the cluster after it has been fit to the data","metadata":{"execution":{"iopub.status.busy":"2023-01-27T21:49:11.817841Z","iopub.execute_input":"2023-01-27T21:49:11.818734Z","iopub.status.idle":"2023-01-27T21:49:11.847795Z","shell.execute_reply.started":"2023-01-27T21:49:11.818691Z","shell.execute_reply":"2023-01-27T21:49:11.846286Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"

\n 🤖Ridge Regression\n

\n 1. Desc - Part of the regression family — it penalizes features that have low predictive outcomes by shrinking their coefficients closer to zero, can be used for classification or regression\n
2. Use cases - Predictive maintenance for automobile, Sales revenue prediction\n
3. Pros - Less prone to overfitting, Best suited where data suffer from multicollinearity, Explainable an interpretable\n
4. Cons - All the predictors are kept in the final model, Doesn't perform feature selection\n\n

\n","metadata":{}},{"cell_type":"code","source":"import numpy as np\n\nclass RidgeRegression:\n def __init__(self, alpha=1.0):\n self.alpha = alpha\n\n def fit(self, X, y):\n n_samples, n_features = X.shape\n self.w = np.linalg.inv(X.T @ X + self.alpha * np.eye(n_features)) @ X.T @ y\n\n def predict(self, X):\n return X @ self.w\n","metadata":{"execution":{"iopub.status.busy":"2023-01-27T21:49:11.849021Z","iopub.execute_input":"2023-01-27T21:49:11.850353Z","iopub.status.idle":"2023-01-27T21:49:11.863456Z","shell.execute_reply.started":"2023-01-27T21:49:11.850314Z","shell.execute_reply":"2023-01-27T21:49:11.862347Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"

In this example, the RidgeRegression class takes an alpha value as input, which is a regularization term used to prevent overfitting. The fit method takes in a data set X and target y, and finds the weights of the linear model that minimizes the regularized mean squared error. The predict method takes in a new data set X and returns the predictions of the linear model for that data set.\n
Here is an example of how to use the class:

","metadata":{}},{"cell_type":"code","source":"X = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])\ny = np.array([1, 2, 3, 4, 5, 6])\nridge_regression = RidgeRegression(alpha=1.0)\nridge_regression.fit(X, y)\nridge_regression.predict(X)\n#The fit method will return None, and the predict method will return the predictions of the linear model for the input data set.\n\n#Note that Ridge Regression is a linear model, which means it assumes linear relationship between features and target variable. It also assumes that data is normalized or standardized.","metadata":{"execution":{"iopub.status.busy":"2023-01-27T21:49:11.86656Z","iopub.execute_input":"2023-01-27T21:49:11.867464Z","iopub.status.idle":"2023-01-27T21:49:11.882378Z","shell.execute_reply.started":"2023-01-27T21:49:11.867422Z","shell.execute_reply":"2023-01-27T21:49:11.881092Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"

\n 🤖Lasso Regression\n

\n 1. Desc - Part of the regression family — it penalizes features that have low predictive outcomes by shrinking their coefficients to zero, can be used for classification or regression\n
2. Use cases - Predicting housing price, Predicting clinical outcomes based on health data\n
3. Pros - Less prone to overfitting, Can handle high-dimensional data, No need for feature selection\n
4. Cons - Can lead to poor interpretability as it can keep highly correlated variables\n\n

\n","metadata":{}},{"cell_type":"code","source":"import numpy as np\n\nclass LassoRegression:\n def __init__(self, alpha=1.0, max_iter=1000, tol=1e-4):\n self.alpha = alpha\n self.max_iter = max_iter\n self.tol = tol\n\n def soft_threshold(self, rho, lamda):\n '''Soft threshold function'''\n if rho < -lamda:\n return (rho + lamda)\n elif rho > lamda:\n return (rho - lamda)\n else: \n return 0\n\n def fit(self, X, y):\n n_samples, n_features = X.shape\n # Initialize the parameters\n self.w = np.random.randn(n_features)\n self.intercept = np.random.randn(1)\n # Coordinate descent\n for _ in range(self.max_iter):\n for j in range(n_features):\n Xj = X[:, j]\n y_pred = X @ self.w + self.intercept\n # calculate rho\n rho = np.sum(Xj * (y - y_pred + self.w[j]*Xj))\n # update w\n self.w[j] = self.soft_threshold(rho, self.alpha) / (np.sum(Xj**2) + 1)\n # calculate intercept\n self.intercept = np.mean(y - X @ self.w)\n \n def predict(self, X):\n return X @ self.w + self.intercept\n\n","metadata":{"execution":{"iopub.status.busy":"2023-01-27T21:49:11.884109Z","iopub.execute_input":"2023-01-27T21:49:11.884878Z","iopub.status.idle":"2023-01-27T21:49:11.900497Z","shell.execute_reply.started":"2023-01-27T21:49:11.884817Z","shell.execute_reply":"2023-01-27T21:49:11.898593Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"

In this example, the LassoRegression class takes in an alpha value, which is a regularization term used to prevent overfitting. It also takes in max_iter, which is the maximum number of iterations allowed and tol, which is the tolerance for convergence. The fit method takes in a data set X and target y, and finds the weights of the linear model that minimizes the regularized mean squared error and L1 regularization term. The predict method takes in a new data set X and returns the predictions of the linear model for that data set.\n
Here is an example of how to use the class:

","metadata":{}},{"cell_type":"code","source":"X = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])\ny = np.array([1, 2, 3, 4, 5, 6])\nlasso_regression = LassoRegression(alpha=1.0)\nlasso_regression.fit(X, y)\nlasso_regression.predict(X)\n#The fit method will return None, and the predict method will return the predictions of the linear model for the input data set.\n\n#Note that Lasso Regression is a linear model, which means it assumes linear relationship between features and target variable. It also assumes that data is normalized or standardized. Lasso Regression is similar to Ridge Regression but it uses L1 regularization term which causes some features to have coefficient zero, this gives an advantage of feature selection","metadata":{"execution":{"iopub.status.busy":"2023-01-27T21:49:11.902827Z","iopub.execute_input":"2023-01-27T21:49:11.903506Z","iopub.status.idle":"2023-01-27T21:49:11.985772Z","shell.execute_reply.started":"2023-01-27T21:49:11.903458Z","shell.execute_reply":"2023-01-27T21:49:11.984414Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"

\n 🤖Decision Tree\n

\n 1. Desc - Decision Tree models make decision rules on the features to produce predictions. It can be used for classification or regression\n
2. Use cases - Customer churn prediction, Credit score modeling, Disease prediction\n
3. Pros - Explainable and interpretable, Can handle missing values\n
4. Cons - Prone to overfitting, Sensitive to outliers\n\n

\n","metadata":{}},{"cell_type":"code","source":"# Data wrangling \nimport pandas as pd \n\n# Array math\nimport numpy as np \n\n# Quick value count calculator\nfrom collections import Counter\n\n\nclass Node: \n \"\"\"\n Class for creating the nodes for a decision tree \n \"\"\"\n def __init__(\n self, \n Y: list,\n X: pd.DataFrame,\n min_samples_split=None,\n max_depth=None,\n depth=None,\n node_type=None,\n rule=None\n ):\n # Saving the data to the node \n self.Y = Y \n self.X = X\n\n # Saving the hyper parameters\n self.min_samples_split = min_samples_split if min_samples_split else 20\n self.max_depth = max_depth if max_depth else 5\n\n # Default current depth of node \n self.depth = depth if depth else 0\n\n # Extracting all the features\n self.features = list(self.X.columns)\n\n # Type of node \n self.node_type = node_type if node_type else 'root'\n\n # Rule for spliting \n self.rule = rule if rule else \"\"\n\n # Calculating the counts of Y in the node \n self.counts = Counter(Y)\n\n # Getting the GINI impurity based on the Y distribution\n self.gini_impurity = self.get_GINI()\n\n # Sorting the counts and saving the final prediction of the node \n counts_sorted = list(sorted(self.counts.items(), key=lambda item: item[1]))\n\n # Getting the last item\n yhat = None\n if len(counts_sorted) > 0:\n yhat = counts_sorted[-1][0]\n\n # Saving to object attribute. This node will predict the class with the most frequent class\n self.yhat = yhat \n\n # Saving the number of observations in the node \n self.n = len(Y)\n\n # Initiating the left and right nodes as empty nodes\n self.left = None \n self.right = None \n\n # Default values for splits\n self.best_feature = None \n self.best_value = None \n\n def GINI_impurity(y1_count: int, y2_count: int) -> float:\n \"\"\"\n Given the observations of a binary class calculate the GINI impurity\n \"\"\"\n # Ensuring the correct types\n if y1_count is None:\n y1_count = 0\n\n if y2_count is None:\n y2_count = 0\n\n # Getting the total observations\n n = y1_count + y2_count\n \n # If n is 0 then we return the lowest possible gini impurity\n if n == 0:\n return 0.0\n\n # Getting the probability to see each of the classes\n p1 = y1_count / n\n p2 = y2_count / n\n \n # Calculating GINI \n gini = 1 - (p1 ** 2 + p2 ** 2)\n \n # Returning the gini impurity\n return gini\n\n @staticmethod\n def ma(x: np.array, window: int) -> np.array:\n \"\"\"\n Calculates the moving average of the given list. \n \"\"\"\n return np.convolve(x, np.ones(window), 'valid') / window\n\n def get_GINI(self):\n \"\"\"\n Function to calculate the GINI impurity of a node \n \"\"\"\n # Getting the 0 and 1 counts\n y1_count, y2_count = self.counts.get(0, 0), self.counts.get(1, 0)\n\n # Getting the GINI impurity\n return self.GINI_impurity(y1_count, y2_count)\n\n def best_split(self) -> tuple:\n \"\"\"\n Given the X features and Y targets calculates the best split \n for a decision tree\n \"\"\"\n # Creating a dataset for spliting\n df = self.X.copy()\n df['Y'] = self.Y\n\n # Getting the GINI impurity for the base input \n GINI_base = self.get_GINI()\n\n # Finding which split yields the best GINI gain \n max_gain = 0\n\n # Default best feature and split\n best_feature = None\n best_value = None\n\n for feature in self.features:\n # Droping missing values\n Xdf = df.dropna().sort_values(feature)\n\n # Sorting the values and getting the rolling average\n xmeans = self.ma(Xdf[feature].unique(), 2)\n\n for value in xmeans:\n # Spliting the dataset \n left_counts = Counter(Xdf[Xdf[feature]=value]['Y'])\n\n # Getting the Y distribution from the dicts\n y0_left, y1_left, y0_right, y1_right = left_counts.get(0, 0), left_counts.get(1, 0), right_counts.get(0, 0), right_counts.get(1, 0)\n\n # Getting the left and right gini impurities\n gini_left = self.GINI_impurity(y0_left, y1_left)\n gini_right = self.GINI_impurity(y0_right, y1_right)\n\n # Getting the obs count from the left and the right data splits\n n_left = y0_left + y1_left\n n_right = y0_right + y1_right\n\n # Calculating the weights for each of the nodes\n w_left = n_left / (n_left + n_right)\n w_right = n_right / (n_left + n_right)\n\n # Calculating the weighted GINI impurity\n wGINI = w_left * gini_left + w_right * gini_right\n\n # Calculating the GINI gain \n GINIgain = GINI_base - wGINI\n\n # Checking if this is the best split so far \n if GINIgain > max_gain:\n best_feature = feature\n best_value = value \n\n # Setting the best gain to the current one \n max_gain = GINIgain\n\n return (best_feature, best_value)\n\n def grow_tree(self):\n \"\"\"\n Recursive method to create the decision tree\n \"\"\"\n # Making a df from the data \n df = self.X.copy()\n df['Y'] = self.Y\n\n # If there is GINI to be gained, we split further \n if (self.depth < self.max_depth) and (self.n >= self.min_samples_split):\n\n # Getting the best split \n best_feature, best_value = self.best_split()\n\n if best_feature is not None:\n # Saving the best split to the current node \n self.best_feature = best_feature\n self.best_value = best_value\n\n # Getting the left and right nodes\n left_df, right_df = df[df[best_feature]<=best_value].copy(), df[df[best_feature]>best_value].copy()\n\n # Creating the left and right nodes\n left = Node(\n left_df['Y'].values.tolist(), \n left_df[self.features], \n depth=self.depth + 1, \n max_depth=self.max_depth, \n min_samples_split=self.min_samples_split, \n node_type='left_node',\n rule=f\"{best_feature} <= {round(best_value, 3)}\"\n )\n\n self.left = left \n self.left.grow_tree()\n\n right = Node(\n right_df['Y'].values.tolist(), \n right_df[self.features], \n depth=self.depth + 1, \n max_depth=self.max_depth, \n min_samples_split=self.min_samples_split,\n node_type='right_node',\n rule=f\"{best_feature} > {round(best_value, 3)}\"\n )\n\n self.right = right\n self.right.grow_tree()\n\n def print_info(self, width=4):\n \"\"\"\n Method to print the infromation about the tree\n \"\"\"\n # Defining the number of spaces \n const = int(self.depth * width ** 1.5)\n spaces = \"-\" * const\n \n if self.node_type == 'root':\n print(\"Root\")\n else:\n print(f\"|{spaces} Split rule: {self.rule}\")\n print(f\"{' ' * const} | GINI impurity of the node: {round(self.gini_impurity, 2)}\")\n print(f\"{' ' * const} | Class distribution in the node: {dict(self.counts)}\")\n print(f\"{' ' * const} | Predicted class: {self.yhat}\") \n\n def print_tree(self):\n \"\"\"\n Prints the whole tree from the current node to the bottom\n \"\"\"\n self.print_info() \n \n if self.left is not None: \n self.left.print_tree()\n \n if self.right is not None:\n self.right.print_tree()\n\n def predict(self, X:pd.DataFrame):\n \"\"\"\n Batch prediction method\n \"\"\"\n predictions = []\n\n for _, x in X.iterrows():\n values = {}\n for feature in self.features:\n values.update({feature: x[feature]})\n \n predictions.append(self.predict_obs(values))\n \n return predictions\n\n def predict_obs(self, values: dict) -> int:\n \"\"\"\n Method to predict the class given a set of features\n \"\"\"\n cur_node = self\n while cur_node.depth < cur_node.max_depth:\n # Traversing the nodes all the way to the bottom\n best_feature = cur_node.best_feature\n best_value = cur_node.best_value\n\n if cur_node.n < cur_node.min_samples_split:\n break \n\n if (values.get(best_feature) < best_value):\n if self.left is not None:\n cur_node = cur_node.left\n else:\n if self.right is not None:\n cur_node = cur_node.right\n \n return cur_node.yhat","metadata":{"execution":{"iopub.status.busy":"2023-01-27T21:49:11.988298Z","iopub.execute_input":"2023-01-27T21:49:11.988786Z","iopub.status.idle":"2023-01-27T21:49:12.029554Z","shell.execute_reply.started":"2023-01-27T21:49:11.988733Z","shell.execute_reply":"2023-01-27T21:49:12.02828Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"

\n 🤖Random forest\n

\n 1. Desc - An ensemble learning method that combines the output of multiple decision trees\n
2. Use cases - Credit score modeling, Predicting housing prices\n
3. Pros - Reduces overfitting, Higher accuracy compared to other models\n
4. Cons - Training complexity can be high, Not very interpretable\n\n

\n","metadata":{}},{"cell_type":"code","source":"import numpy as np\n\nclass RandomForest:\n def __init__(self, n_estimators=10, max_depth=None, min_samples_split=2, random_state=0):\n self.n_estimators = n_estimators\n self.max_depth = max_depth\n self.min_samples_split = min_samples_split\n self.random_state = random_state\n self.trees = []\n\n def fit(self, X, y):\n for i in range(self.n_estimators):\n np.random.seed(self.random_state + i)\n tree = DecisionTree(max_depth=self.max_depth, min_samples_split=self.min_samples_split)\n idx = np.random.choice(X.shape[0], X.shape[0], replace=True)\n X_sample = X[idx]\n y_sample = y[idx]\n tree.fit(X_sample, y_sample)\n self.trees.append(tree)\n\n def predict(self, X):\n predictions = []\n for tree in self.trees:\n predictions.append(tree.predict(X))\n\n predictions = np.array(predictions)\n return np.mean(predictions, axis=0)\n\nclass DecisionTree:\n def __init__(self, max_depth=None, min_samples_split=2):\n self.max_depth = max_depth\n self.min_samples_split = min_samples_split\n self.tree = None\n \n def fit(self, X, y):\n self.tree = self._grow_tree(X, y)\n\n def _grow_tree(self, X, y, depth=0):\n n_samples, n_features = X.shape\n n_classes = len(np.unique(y))\n \n if depth >= self.max_depth or n_samples < self.min_samples_split:\n return self._leaf_value(y)\n \n feature_index = self._best_feature_index(X, y)\n feature = X[:, feature_index]\n values = np.unique(feature)\n branches = []\n for value in values:\n mask = feature == value\n X_branch, y_branch = X[mask], y[mask]\n\n branch = self._grow_tree(X_branch, y_branch, depth + 1)\n branches.append((value, branch))\n\n return (feature_index, branches)\n\n def _best_feature_index(self, X, y):\n # Compute the Gini impurity of the data\n current_impurity = self._gini_impurity(y)\n best_gain = 0\n best_index = None\n \n for index in range(n_features):\n feature = X[:, index]\n values = np.unique(feature)\n feature_impurity = 0\n for value in values:\n mask = feature == value\n y_subset\n","metadata":{"execution":{"iopub.status.busy":"2023-01-27T21:49:12.031488Z","iopub.execute_input":"2023-01-27T21:49:12.031942Z","iopub.status.idle":"2023-01-27T21:49:12.049527Z","shell.execute_reply.started":"2023-01-27T21:49:12.031896Z","shell.execute_reply":"2023-01-27T21:49:12.048442Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"

This is a simple random forest class that takes an input of n_estimators, max_depth, and random_state. It has the fit method which creates n_estimators number of decision trees. Each tree is built by taking a random sample of the data with replacement. The predict method takes an input of X and returns the mean prediction of all the trees.\nThe DecisionTree class from previous must be run before running this random forest algorithm.\n

","metadata":{}},{"cell_type":"markdown","source":"

\n 🤖Gradient boosting regression\n

\n 1. Desc - Gradient Boosting Regression employs boosting to make predictive models from an ensemble of weak predictive learners\n
2. Use cases - Predicting car emissions, Predicting ride hailing fare amount\n
3. Pros - Better accuracy compared to other regression models, It can handle multicollinearity, It can handle non-linear relationships\n
4. Cons - Sensitive to outliers and can therefore cause overfitting, Computationally expensive and has high complexity\n\n

\n","metadata":{}},{"cell_type":"code","source":"import numpy as np\n\nclass GradientBoostingRegressor:\n def __init__(self, n_estimators=100, learning_rate=0.1):\n self.n_estimators = n_estimators\n self.learning_rate = learning_rate\n self.trees = []\n\n def fit(self, X, y):\n y_pred = np.zeros(len(y))\n for i in range(self.n_estimators):\n tree = DecisionTreeRegressor()\n gradient = y - y_pred\n tree.fit(X, gradient)\n self.trees.append(tree)\n y_pred += self.learning_rate * tree.predict(X)\n\n def predict(self, X):\n y_pred = np.zeros(len(X))\n for tree in self.trees:\n y_pred += self.learning_rate * tree.predict(X)\n return y_pred\n","metadata":{"execution":{"iopub.status.busy":"2023-01-27T21:49:12.05135Z","iopub.execute_input":"2023-01-27T21:49:12.05172Z","iopub.status.idle":"2023-01-27T21:49:12.066477Z","shell.execute_reply.started":"2023-01-27T21:49:12.051689Z","shell.execute_reply":"2023-01-27T21:49:12.065456Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"

This is a simple gradient boosting regressor class that takes an input of n_estimators, learning_rate. It has the fit method which creates n_estimators number of decision tree regressors. Each tree is built by fitting to the negative gradient of the predicted value of the previous tree. The predict method takes an input of X and returns the final prediction.\nThe DecisionTreeRegressor class is used here, need to implement that class first before running this gradient boosting regressor algorithm.

","metadata":{}},{"cell_type":"markdown","source":"

\n 🤖Apriori Algorithms\n

\n 1. Desc - Rule based approach that identifies the most frequent itemset in a given dataset where prior knowledge of frequent itemset properties is used\n
2. Use cases - Product placements, Recommendation engines, Promotion optimization\n
3. Pros - Results are intuitive and Interpretable, Exhaustive approach as it finds all rules based on the confidence and support\n
4. Cons - Generates many uninteresting itemsets, Computationally and memory intensive, Results in many overlapping item sets\n

\n","metadata":{}},{"cell_type":"code","source":"def apriori(transactions, min_support):\n items = set()\n for transaction in transactions:\n for item in transaction:\n items.add(item)\n\n items = list(items)\n item_sets = []\n for i in range(1, len(items)+1):\n for subset in itertools.combinations(items, i):\n item_sets.append(frozenset(subset))\n\n item_sets = list(filter(lambda x: len(x)>=min_support, item_sets))\n item_sets.sort()\n\n frequent_item_sets = []\n for item_set in item_sets:\n count = 0\n for transaction in transactions:\n if item_set.issubset(transaction):\n count += 1\n if count/len(transactions) >= min_support:\n frequent_item_sets.append((item_set, count))\n\n return frequent_item_sets\n","metadata":{"execution":{"iopub.status.busy":"2023-01-27T21:49:12.067861Z","iopub.execute_input":"2023-01-27T21:49:12.068555Z","iopub.status.idle":"2023-01-27T21:49:12.081812Z","shell.execute_reply.started":"2023-01-27T21:49:12.068521Z","shell.execute_reply":"2023-01-27T21:49:12.080688Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"

In this code, the apriori function takes two arguments: transactions and min_support. transactions is a list of lists, where each sub-list represents a transaction, and the items

","metadata":{}},{"cell_type":"markdown","source":"

\nComing soon more algorithm and optimization on above with examples

","metadata":{}},{"cell_type":"markdown","source":"

I brewed this notebook from scratch, If this notebook helped, please consider upvoting and cite me if sharing ,Thank you!

\n Lets connect on LinkedIn!\n \n

\nFollow me on Github too!

\n Also checkout my Medium posts!\n \n

","metadata":{}}]} --------------------------------------------------------------------------------