├── .gitignore ├── README.md ├── decision_tree.ipynb ├── gmm.ipynb ├── gmm.py ├── kmeans.ipynb ├── kmeans ├── __init__.py ├── kmeans.py └── test_kmeans .py ├── knn.ipynb ├── linear_regression.py ├── linear_regression_gradient_descent.ipynb ├── linear_regression_map.ipynb ├── linear_regression_mle.ipynb ├── pca.ipynb ├── pca.py ├── svm.ipynb └── utils.py /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | share/python-wheels/ 24 | *.egg-info/ 25 | .installed.cfg 26 | *.egg 27 | MANIFEST 28 | 29 | # PyInstaller 30 | # Usually these files are written by a python script from a template 31 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 32 | *.manifest 33 | *.spec 34 | 35 | # Installer logs 36 | pip-log.txt 37 | pip-delete-this-directory.txt 38 | 39 | # Unit test / coverage reports 40 | htmlcov/ 41 | .tox/ 42 | .nox/ 43 | .coverage 44 | .coverage.* 45 | .cache 46 | nosetests.xml 47 | coverage.xml 48 | *.cover 49 | *.py,cover 50 | .hypothesis/ 51 | .pytest_cache/ 52 | cover/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | .pybuilder/ 76 | target/ 77 | 78 | # Jupyter Notebook 79 | .ipynb_checkpoints 80 | 81 | # IPython 82 | profile_default/ 83 | ipython_config.py 84 | 85 | # pyenv 86 | # For a library or package, you might want to ignore these files since the code is 87 | # intended to run in multiple environments; otherwise, check them in: 88 | # .python-version 89 | 90 | # pipenv 91 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 92 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 93 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 94 | # install all needed dependencies. 95 | #Pipfile.lock 96 | 97 | # poetry 98 | # Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control. 99 | # This is especially recommended for binary packages to ensure reproducibility, and is more 100 | # commonly ignored for libraries. 101 | # https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control 102 | #poetry.lock 103 | 104 | # pdm 105 | # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control. 106 | #pdm.lock 107 | # pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it 108 | # in version control. 109 | # https://pdm.fming.dev/#use-with-ide 110 | .pdm.toml 111 | 112 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm 113 | __pypackages__/ 114 | 115 | # Celery stuff 116 | celerybeat-schedule 117 | celerybeat.pid 118 | 119 | # SageMath parsed files 120 | *.sage.py 121 | 122 | # Environments 123 | .env 124 | .venv 125 | env/ 126 | venv/ 127 | ENV/ 128 | env.bak/ 129 | venv.bak/ 130 | 131 | # Spyder project settings 132 | .spyderproject 133 | .spyproject 134 | 135 | # Rope project settings 136 | .ropeproject 137 | 138 | # mkdocs documentation 139 | /site 140 | 141 | # mypy 142 | .mypy_cache/ 143 | .dmypy.json 144 | dmypy.json 145 | 146 | # Pyre type checker 147 | .pyre/ 148 | 149 | # pytype static type analyzer 150 | .pytype/ 151 | 152 | # Cython debug symbols 153 | cython_debug/ 154 | 155 | # PyCharm 156 | # JetBrains specific template is maintained in a separate JetBrains.gitignore that can 157 | # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore 158 | # and can be added to the global gitignore or merged into this file. For a more nuclear 159 | # option (not recommended) you can uncomment the following to ignore the entire idea folder. 160 | #.idea/ 161 | 162 | .vscode/ 163 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # lazy-ml 2 | ML algorithms implementations that are good for learning the underlying principles 3 | -------------------------------------------------------------------------------- /decision_tree.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Training\n", 8 | "\n", 9 | "Given the whole dataset:\n", 10 | "\n", 11 | "1. Calculate the information gain with each possible split (which feature to split on and on which threshold value)\n", 12 | "2. Divide the set with the feature and the threshold that gives the most information gain\n", 13 | "3. Repeat for all subbranches, until a stopping criteria is reached\n", 14 | "\n", 15 | "# Inferencing\n", 16 | "\n", 17 | "1. Traverse the tree with the features of the data point until a leaf is reached\n", 18 | "2. Return the label of the leaf" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "# Model\n", 26 | "\n", 27 | "## Information gain\n", 28 | "\n", 29 | "Information gain is defined as $$IG(node) = Entropy(parent) - [\\text{weighted average}]entropy(children)$$\n", 30 | "\n", 31 | "$$Entropy(node) = -\\sum_{i=1}^{c}p_i\\log_2p_i$$\n", 32 | "$$p_i = \\frac{N_i}{N}$$\n", 33 | "\n", 34 | "## Stopping criteria\n", 35 | "\n", 36 | "1. Maximum depth: when the tree reaches a maximum depth\n", 37 | "2. Minimum number of samples per node: when the number of samples in a node falls below a threshold\n", 38 | "3. Min impurity decrease: when a split results in a decrease of the impurity greater than or equal to this value" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": 1, 44 | "metadata": {}, 45 | "outputs": [], 46 | "source": [ 47 | "import numpy as np\n", 48 | "from sklearn import datasets\n", 49 | "from sklearn.model_selection import train_test_split\n", 50 | "import random\n", 51 | "from tqdm import tqdm" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": 2, 57 | "metadata": {}, 58 | "outputs": [], 59 | "source": [ 60 | "class TreeNode:\n", 61 | "\n", 62 | " def __init__(self, feature_idx, threshold, left_child=None, right_child=None, *, leaf_value=None):\n", 63 | " self.feature_idx = feature_idx\n", 64 | " self.threshold = threshold\n", 65 | " self.left_child = left_child\n", 66 | " self.right_child = right_child\n", 67 | " self.leaf_value = leaf_value\n", 68 | " \n", 69 | " def is_leaf_node(self):\n", 70 | " return self.leaf_value is not None\n", 71 | "\n", 72 | "class DecisionTree:\n", 73 | "\n", 74 | " def __init__(self, min_samples_leaf=2, max_depth=100, n_features=None):\n", 75 | " self.min_samples_leaf = min_samples_leaf\n", 76 | " self.max_depth = max_depth\n", 77 | " self.n_features = n_features\n", 78 | " self.root = None\n", 79 | "\n", 80 | " def fit(self, X, y):\n", 81 | " self.n_features = X.shape[1] if not self.n_features else min(self.n_features, X.shape[1])\n", 82 | " self.root = self._grow_tree(X, y, level=0)\n", 83 | "\n", 84 | " def _grow_tree(self, X, y, level):\n", 85 | " n_samples, n_features = X.shape\n", 86 | " n_labels = len(np.unique(y))\n", 87 | "\n", 88 | " # If we have reached a stopping criteria, return a leaf node\n", 89 | " if (level == self.max_depth or n_labels == 1 or n_samples < self.min_samples_leaf):\n", 90 | " leaf_value = self._most_common_label(y)\n", 91 | " return TreeNode(None, None, leaf_value=leaf_value)\n", 92 | "\n", 93 | " # Select random features to consider (we don't have to in the case of a Decision Tree, but it's mandatory in the case of a Random Forest)\n", 94 | " feature_idxs = np.random.choice(n_features, self.n_features, replace=False)\n", 95 | "\n", 96 | " # Greedily select the best split according to information gain\n", 97 | " best_feature_idx, best_threshold = self._best_split_criteria(X, y, feature_idxs)\n", 98 | " # Split the data using the threshold\n", 99 | " left_idxs, right_idxs = self._split(X[:, best_feature_idx], best_threshold)\n", 100 | "\n", 101 | " # Assign data points to the left or right child\n", 102 | " left_child = self._grow_tree(X[left_idxs, :], y[left_idxs], level+1)\n", 103 | " right_child = self._grow_tree(X[right_idxs, :], y[right_idxs], level+1)\n", 104 | " return TreeNode(best_feature_idx, best_threshold, left_child, right_child)\n", 105 | "\n", 106 | " def _best_split_criteria(self, X, y, feature_idxs):\n", 107 | " best_gain = -1\n", 108 | " split_criteria = None\n", 109 | " # Iterate through all possible splits and select the pair (feature, threshold) that maximizes information gain\n", 110 | " for feature_idx in feature_idxs:\n", 111 | " feature = X[:, feature_idx]\n", 112 | " thresholds = np.unique(feature)\n", 113 | " for threshold in thresholds:\n", 114 | " gain = self._information_gain(y, feature, threshold)\n", 115 | " if gain > best_gain:\n", 116 | " best_gain = gain\n", 117 | " split_criteria = (feature_idx, threshold)\n", 118 | " return split_criteria\n", 119 | " \n", 120 | " def _information_gain(self, y, feature, threshold):\n", 121 | " # Calculate the entropy of the parent node\n", 122 | " parent_entropy = self._entropy(y)\n", 123 | "\n", 124 | " # Calculate the entropy of the left and right child nodes\n", 125 | " left_idxs, right_idxs = self._split(feature, threshold)\n", 126 | " left_entropy = self._entropy(y[left_idxs])\n", 127 | " right_entropy = self._entropy(y[right_idxs])\n", 128 | " \n", 129 | " # Calculate the information gain\n", 130 | " information_gain = parent_entropy - (len(left_idxs)/len(y))*left_entropy - (len(right_idxs)/len(y))*right_entropy\n", 131 | " return information_gain\n", 132 | " \n", 133 | " def _entropy(self, y):\n", 134 | " # Calculate the entropy of a node according to the formula -sum(p_i*log(p_i)) where p_i is the probability of the i-th class\n", 135 | " _, counts = np.unique(y, return_counts=True)\n", 136 | " probabilities = counts / counts.sum()\n", 137 | " entropy = sum(probabilities * -np.log2(probabilities))\n", 138 | " return entropy\n", 139 | "\n", 140 | " def _split(self, X_column, split_thresh):\n", 141 | " # The index of the data points that are less than or equal to the threshold\n", 142 | " left_idxs = np.argwhere(X_column <= split_thresh).flatten()\n", 143 | " # The index of the data points that are greater than the threshold\n", 144 | " right_idxs = np.argwhere(X_column > split_thresh).flatten()\n", 145 | " return left_idxs, right_idxs\n", 146 | " \n", 147 | " def _most_common_label(self, y):\n", 148 | " # Return the most common label in the data\n", 149 | " return np.bincount(y).argmax()\n", 150 | " \n", 151 | " def predict(self, X):\n", 152 | " return np.array([self._predict(x) for x in X])\n", 153 | " \n", 154 | " def _predict(self, x):\n", 155 | " # Traverse the tree to find the leaf node that the data point belongs to\n", 156 | " node = self.root\n", 157 | " while not node.is_leaf_node():\n", 158 | " if x[node.feature_idx] <= node.threshold:\n", 159 | " node = node.left_child\n", 160 | " else:\n", 161 | " node = node.right_child\n", 162 | " return node.leaf_value\n", 163 | "\n" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": 3, 169 | "metadata": {}, 170 | "outputs": [], 171 | "source": [ 172 | "def get_data_split():\n", 173 | " MAGIC = 42\n", 174 | " random.seed(MAGIC)\n", 175 | " np.random.seed(MAGIC)\n", 176 | " data = datasets.load_breast_cancer()\n", 177 | " X, y = data.data, data.target\n", 178 | "\n", 179 | " X_train, X_test, y_train, y_test = train_test_split(\n", 180 | " X, y, test_size=0.2, random_state=MAGIC\n", 181 | " )\n", 182 | "\n", 183 | " return data, X_train, X_test, y_train, y_test\n", 184 | "\n", 185 | "def accuracy(y_test, y_pred):\n", 186 | " return np.sum(y_test == y_pred) / len(y_test)" 187 | ] 188 | }, 189 | { 190 | "cell_type": "code", 191 | "execution_count": 4, 192 | "metadata": {}, 193 | "outputs": [ 194 | { 195 | "name": "stdout", 196 | "output_type": "stream", 197 | "text": [ 198 | "Training data shape: (455, 30)\n", 199 | "Single tree accuracy: 0.9035087719298246\n", 200 | "mean concave points\n" 201 | ] 202 | } 203 | ], 204 | "source": [ 205 | "# Load a sample dataset\n", 206 | "data, X_train, X_test, y_train, y_test = get_data_split()\n", 207 | "\n", 208 | "print(f'Training data shape: {X_train.shape}')\n", 209 | "\n", 210 | "tree_settings = {\n", 211 | " 'max_depth': 2,\n", 212 | "}\n", 213 | "\n", 214 | "tree = DecisionTree(**tree_settings)\n", 215 | "tree.fit(X_train, y_train)\n", 216 | "tree_predictions = tree.predict(X_test)\n", 217 | "\n", 218 | "tree_acc = accuracy(y_test, tree_predictions)\n", 219 | "print(f'Single tree accuracy: {tree_acc}')\n", 220 | "\n", 221 | "# Let's check which feature is the most important one\n", 222 | "feature_names = data.feature_names\n", 223 | "print(feature_names[tree.root.feature_idx])\n" 224 | ] 225 | }, 226 | { 227 | "cell_type": "markdown", 228 | "metadata": {}, 229 | "source": [ 230 | "# Random Forest\n", 231 | "\n", 232 | "A random forest is a model that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. Each tree is trained on a random subset of the training data. The random subsets are the same size as the original training set, but are created by sampling with replacement. The idea is that by training each tree on different samples, although each tree might have high variance with respect to a particular set of the training data, overall, the entire forest will have lower variance but not at the cost of increasing the bias." 233 | ] 234 | }, 235 | { 236 | "cell_type": "code", 237 | "execution_count": 5, 238 | "metadata": {}, 239 | "outputs": [], 240 | "source": [ 241 | "class RandomForest:\n", 242 | "\n", 243 | " def __init__(self, n_trees=10, min_samples_leaf=2, max_depth=100, n_features=None):\n", 244 | " self.min_samples_leaf = min_samples_leaf\n", 245 | " self.max_depth = max_depth\n", 246 | " self.n_features = n_features\n", 247 | " self.n_trees = n_trees\n", 248 | "\n", 249 | " def fit(self, X, Y):\n", 250 | " self.trees = []\n", 251 | " tree_iterator = tqdm(range(self.n_trees), desc='Training trees')\n", 252 | " for _ in tree_iterator:\n", 253 | " tree = DecisionTree(min_samples_leaf=self.min_samples_leaf, max_depth=self.max_depth, n_features=self.n_features)\n", 254 | " X_sample, Y_sample = self._sample(X, Y)\n", 255 | " tree.fit(X_sample, Y_sample)\n", 256 | " self.trees.append(tree)\n", 257 | " \n", 258 | " def _sample(self, X, Y):\n", 259 | " n_samples = X.shape[0]\n", 260 | " indices = np.random.choice(n_samples, n_samples, replace=True)\n", 261 | " return X[indices], Y[indices]\n", 262 | " \n", 263 | " def _most_common(self, Y):\n", 264 | " return np.bincount(Y).argmax()\n", 265 | " \n", 266 | " def predict(self, X, agg_type='mean'):\n", 267 | " predictions = []\n", 268 | " tree_iterator = tqdm(self.trees, desc='Predicting')\n", 269 | " for tree in tree_iterator:\n", 270 | " predictions.append(tree.predict(X))\n", 271 | " predictions = np.array(predictions)\n", 272 | " if agg_type == 'mean':\n", 273 | " return np.mean(predictions, axis=0)\n", 274 | " elif agg_type == 'majority':\n", 275 | " return np.apply_along_axis(self._most_common, axis=0, arr=predictions)\n", 276 | " else:\n", 277 | " raise ValueError('agg_type must be either mean or majority')\n" 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "execution_count": 6, 283 | "metadata": {}, 284 | "outputs": [ 285 | { 286 | "name": "stdout", 287 | "output_type": "stream", 288 | "text": [ 289 | "Training data shape: (455, 30)\n" 290 | ] 291 | }, 292 | { 293 | "name": "stderr", 294 | "output_type": "stream", 295 | "text": [ 296 | "Training trees: 100%|██████████| 10/10 [00:09<00:00, 1.10it/s]\n", 297 | "Predicting: 100%|██████████| 10/10 [00:00<00:00, 18774.86it/s]" 298 | ] 299 | }, 300 | { 301 | "name": "stdout", 302 | "output_type": "stream", 303 | "text": [ 304 | "Random forest accuracy: 0.956140350877193\n" 305 | ] 306 | }, 307 | { 308 | "name": "stderr", 309 | "output_type": "stream", 310 | "text": [ 311 | "\n" 312 | ] 313 | } 314 | ], 315 | "source": [ 316 | "data, X_train, X_test, y_train, y_test = get_data_split()\n", 317 | "\n", 318 | "print(f'Training data shape: {X_train.shape}')\n", 319 | "\n", 320 | "forest = RandomForest(n_trees=10, **tree_settings)\n", 321 | "forest.fit(X_train, y_train)\n", 322 | "forest_predictions = forest.predict(X_test, agg_type='majority')\n", 323 | "\n", 324 | "tree_acc = accuracy(y_test, forest_predictions)\n", 325 | "print(f'Random forest accuracy: {tree_acc}')" 326 | ] 327 | }, 328 | { 329 | "cell_type": "markdown", 330 | "metadata": {}, 331 | "source": [ 332 | "# Ada Boosting\n", 333 | "\n", 334 | "AdaBoosting (which stands for Adaptive Boosting) is an algorithm that allows to combine multiple weak classifiers into a weighted sum to produce a boosted classifier. It is usually used for binary classification, but can be extended to multi-class classification.\n", 335 | "\n" 336 | ] 337 | }, 338 | { 339 | "cell_type": "code", 340 | "execution_count": 7, 341 | "metadata": {}, 342 | "outputs": [], 343 | "source": [ 344 | "class Stump:\n", 345 | "\n", 346 | " def __init__(self):\n", 347 | " self.feature_idx = None\n", 348 | " self.threshold = None\n", 349 | " self.alpha = None\n", 350 | " self.p = None\n", 351 | "\n", 352 | " def predict(self, X):\n", 353 | " n_samples = X.shape[0]\n", 354 | " X_column = X[:, self.feature_idx]\n", 355 | " # Initially all the samples are classified as 1\n", 356 | " predictions = np.ones(n_samples)\n", 357 | " if self.p == 1:\n", 358 | " predictions[X_column < self.threshold] = -1\n", 359 | " else:\n", 360 | " predictions[X_column > self.threshold] = -1\n", 361 | " return predictions\n", 362 | " \n", 363 | "class AdaBoost:\n", 364 | "\n", 365 | " def __init__(self, num_stumps=5):\n", 366 | " self.num_stumps = num_stumps\n", 367 | " self.stumps = []\n", 368 | "\n", 369 | " def fit(self, X, y):\n", 370 | " n_samples, n_features = X.shape\n", 371 | "\n", 372 | " # Initial weights are 1 / n_samples\n", 373 | " w = np.full(n_samples, (1 / n_samples))\n", 374 | "\n", 375 | " for _ in range(self.num_stumps):\n", 376 | " stump = Stump()\n", 377 | " min_error = float('inf')\n", 378 | "\n", 379 | " # Iterate through every feature and threshold to find the best decision stump\n", 380 | " for feature_idx in range(n_features):\n", 381 | " X_column = X[:, feature_idx]\n", 382 | " thresholds = np.unique(X_column)\n", 383 | "\n", 384 | " for threshold in thresholds:\n", 385 | " p = 1\n", 386 | " prediction = np.ones(n_samples)\n", 387 | " prediction[X_column < threshold] = -1\n", 388 | "\n", 389 | " # Error = sum of weights of misclassified samples\n", 390 | " error = sum(w[y != prediction])\n", 391 | "\n", 392 | " # If error is over 50% we flip the prediction so that error will be 1 - error\n", 393 | " if error > 0.5:\n", 394 | " error = 1 - error\n", 395 | " p = -1\n", 396 | "\n", 397 | " # Store the best configuration\n", 398 | " if error < min_error:\n", 399 | " stump.feature_idx = feature_idx\n", 400 | " stump.threshold = threshold\n", 401 | " min_error = error\n", 402 | " stump.p = p\n", 403 | "\n", 404 | " # Calculate alpha\n", 405 | " EPS = 1e-10\n", 406 | " stump.alpha = 0.5 * np.log((1.0 - min_error + EPS) / (min_error + EPS))\n", 407 | "\n", 408 | " # Calculate predictions and update weights\n", 409 | " predictions = stump.predict(X)\n", 410 | "\n", 411 | " # Only misclassified samples have non-zero weights\n", 412 | " w *= np.exp(-stump.alpha * y * predictions)\n", 413 | " w /= np.sum(w)\n", 414 | "\n", 415 | " # Save stump\n", 416 | " self.stumps.append(stump)\n", 417 | " \n", 418 | " def predict(self, X):\n", 419 | " preds = [stump.alpha * stump.predict(X) for stump in self.stumps]\n", 420 | " preds = np.sum(preds, axis=0)\n", 421 | " preds = np.sign(preds)\n", 422 | " return preds\n", 423 | " " 424 | ] 425 | }, 426 | { 427 | "cell_type": "code", 428 | "execution_count": 8, 429 | "metadata": {}, 430 | "outputs": [ 431 | { 432 | "name": "stdout", 433 | "output_type": "stream", 434 | "text": [ 435 | "Training data shape: (455, 30)\n", 436 | "Adaboost accuracy: 0.6052631578947368\n" 437 | ] 438 | } 439 | ], 440 | "source": [ 441 | "data, X_train, X_test, y_train, y_test = get_data_split()\n", 442 | "\n", 443 | "# Change labels to be 1 and -1\n", 444 | "y_train[y_train == 0] = -1\n", 445 | "\n", 446 | "print(f'Training data shape: {X_train.shape}')\n", 447 | "\n", 448 | "adaboost = AdaBoost(num_stumps = 5)\n", 449 | "adaboost.fit(X_train, y_train)\n", 450 | "adaboost_predictions = adaboost.predict(X_test)\n", 451 | "\n", 452 | "adaboost_acc = accuracy(y_test, adaboost_predictions)\n", 453 | "print(f'Adaboost accuracy: {adaboost_acc}')" 454 | ] 455 | } 456 | ], 457 | "metadata": { 458 | "kernelspec": { 459 | "display_name": "lazy-ml", 460 | "language": "python", 461 | "name": "python3" 462 | }, 463 | "language_info": { 464 | "codemirror_mode": { 465 | "name": "ipython", 466 | "version": 3 467 | }, 468 | "file_extension": ".py", 469 | "mimetype": "text/x-python", 470 | "name": "python", 471 | "nbconvert_exporter": "python", 472 | "pygments_lexer": "ipython3", 473 | "version": "3.11.3" 474 | }, 475 | "orig_nbformat": 4 476 | }, 477 | "nbformat": 4, 478 | "nbformat_minor": 2 479 | } 480 | -------------------------------------------------------------------------------- /gmm.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import utils 3 | 4 | class GaussianMixtureModel(): 5 | 6 | def __init__(self, k: int, X: np.ndarray) -> None: 7 | self.k = k 8 | 9 | self.n = X.shape[0] 10 | self.d = X.shape[1] 11 | self.X = X 12 | 13 | # Initialize the parameters 14 | self.mu = np.random.rand(k, self.d) 15 | self.cov_matrix = np.array([np.eye(self.d)] * k) 16 | self.pi = np.random.rand(k) 17 | # Normalize pi 18 | self.pi = self.pi / np.sum(self.pi) 19 | # Responsibility matrix 20 | self.r = np.zeros((self.n, k)) 21 | 22 | def _e_step(self) -> None: 23 | for i in range(self.n): 24 | for j in range(self.k): 25 | # Compute the pdf for the ith data point and the jth gaussian 26 | self.r[i, j] = self.pi[j] * utils.pdf_multivariate_normal_distribution(self.X[i], self.mu[j], self.cov_matrix[j]) 27 | self.r = self.r / np.sum(self.r, axis=1, keepdims=True) 28 | 29 | def _m_step(self) -> None: 30 | # Compute the new parameters 31 | for j in range(self.k): 32 | N_j = np.sum(self.r[:, j]) # Sum of the responsibilities for the jth gaussian 33 | self.mu[j] = np.sum(self.r[:, j].reshape(-1, 1) * self.X, axis=0) / N_j 34 | self.cov_matrix[j] = (1 / N_j) * np.sum(self.r[:, j].reshape(-1, 1, 1) * np.matmul((self.X - self.mu[j]).reshape(-1, self.d, 1), (self.X - self.mu[j]).reshape(-1, 1, self.d)), axis=0) 35 | self.pi[j] = N_j / self.n 36 | 37 | def fit(self) -> None: 38 | """ 39 | Fits the model to the data. 40 | """ 41 | # E-step 42 | self._e_step() 43 | # M-step 44 | self._m_step() -------------------------------------------------------------------------------- /kmeans/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hkproj/lazy-ml/5b8dbfe913af580edfd281c3b1e0c67539e8d630/kmeans/__init__.py -------------------------------------------------------------------------------- /kmeans/kmeans.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from enum import Enum, Flag, auto 3 | from typing import Optional 4 | 5 | class InitializationMethod(Enum): 6 | Random = 1 7 | KMeansPlusPLus = 2 8 | 9 | class StoppingCriteria(Flag): 10 | Convergence = auto() 11 | MaxIterations = auto() 12 | CentroidsChangeThreshold = auto() 13 | 14 | class KMeans: 15 | 16 | def __init__(self, k: int, init_method: InitializationMethod): 17 | assert k > 0, "k should be greater than 0" 18 | self.k = k 19 | self.init_method = init_method 20 | 21 | def _initialize_centroids(self, X: np.array) -> np.array: 22 | # X: [N, d] where N is the number of data points and d is the number of features 23 | 24 | if self.init_method == InitializationMethod.Random: 25 | # Randomly choose K distinct data points from the dataset as centroids 26 | chosen_centroids_idx = np.random.choice(X.shape[0], size=self.k, replace=False) 27 | centroids = X[chosen_centroids_idx] # [k, d] 28 | elif self.init_method == InitializationMethod.KMeansPlusPLus: 29 | # The algorithm is as follows: 30 | # 1. Choose one center uniformly at random from among the data points. 31 | # 2. For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen. 32 | # 3. Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x). 33 | # 4. Repeat Steps 2 and 3 until k centers have been chosen. 34 | 35 | chosen_centroids_idx = np.random.choice(X.shape[0], size=1, replace=False) 36 | centroids = X[chosen_centroids_idx] # [1, d] 37 | for _ in range(self.k-1): 38 | # Compute the squared distance between each data point and each centroid 39 | diffs = X[:, np.newaxis, :] - centroids[np.newaxis, :, :] # [N, k, d] 40 | assert diffs.shape == (X.shape[0], centroids.shape[0], X.shape[1]), f"Expected diffs to have shape (N, k, d) but got {diffs.shape}" 41 | distances = np.sum(diffs**2, axis=2, keepdims=False) # [N, k] 42 | assert distances.shape == (X.shape[0], centroids.shape[0]), f"Expected distances to have shape (N, k) but got {distances.shape}" 43 | # Compute the min distance between each data point and its nearest centroid 44 | min_distances = np.min(distances, axis=1, keepdims=False) # [N] 45 | assert min_distances.shape == (X.shape[0],), f"Expected min_distances to have shape {(X.shape[0])} but got {min_distances.shape}" 46 | # Compute the probability distribution 47 | prob = min_distances / np.sum(min_distances) 48 | # Choose a new centroid 49 | chosen_centroids_idx = np.concatenate([chosen_centroids_idx, np.random.choice(X.shape[0], size=1, replace=False, p=prob)], axis=0) 50 | centroids = X[chosen_centroids_idx] 51 | else: 52 | raise ValueError(f"Invalid initialization method: {self.init_method}") 53 | 54 | assert centroids is not None, "Centroids should not be None" 55 | assert centroids.shape == (self.k, X.shape[1]), f"Expected centroids to have shape ({self.k}, {X.shape[1]}) but got {centroids.shape}" 56 | return centroids 57 | 58 | def _assign_clusters(self, X: np.array, centroids: np.array) -> np.array: 59 | # X: [N, d] 60 | # centroids: [k, d] 61 | diffs = X[:, np.newaxis, :] - centroids[np.newaxis, :, :] # [N, k, d] 62 | assert diffs.shape == (X.shape[0], centroids.shape[0], X.shape[1]), f"Expected diffs to have shape (N, k, d) but got {diffs.shape}" 63 | distances = np.sum(diffs**2, axis=2, keepdims=False) # [N, k] 64 | assignments = np.argmin(distances, axis=1) # [N] 65 | return assignments 66 | 67 | def _update_centroids(self, X: np.array, assignments: np.array) -> np.array: 68 | # X: [N, d] 69 | # assignments: [N] 70 | new_centroids = np.empty((self.k, X.shape[1])) 71 | for i in range(self.k): 72 | mask = assignments == i 73 | new_centroids[i] = np.mean(X[mask], axis=0) # [d] 74 | assert new_centroids.shape == (self.k, X.shape[1]), f"Expected new_centroids to have shape ({self.k}, {X.shape[1]}) but got {new_centroids.shape}" 75 | return new_centroids 76 | 77 | def _distance_threshold_convergence(self, centroids: np.array, old_centroids: np.array, threshold: float) -> bool: 78 | # centroids: [k, d] 79 | # old_centroids: [k, d] 80 | # threshold: float 81 | distances = np.linalg.norm(centroids - old_centroids, axis=1) # [k] 82 | return np.all(distances < threshold) 83 | 84 | def _check_stop_criteria(self, stop_criteria: StoppingCriteria, max_iterations: int, dist_threshold: float, old_centroids: np.array, new_centroids: np.array, it: int) -> bool: 85 | if stop_criteria & StoppingCriteria.MaxIterations: 86 | if it >= max_iterations: 87 | return True 88 | if stop_criteria & StoppingCriteria.CentroidsChangeThreshold: 89 | if old_centroids is not None and self._distance_threshold_convergence(new_centroids, old_centroids, dist_threshold): 90 | return True 91 | return False 92 | 93 | def fit(self, X: np.array, stop_criteria: StoppingCriteria, max_iterations: Optional[int] = None, centroids_change_threshold: Optional[float] = None) -> tuple[np.array, np.array]: 94 | 95 | if self.k > X.shape[0]: 96 | raise ValueError(f"Number of clusters k={self.k} should be less than or equal to the number of data points N={X.shape[0]}") 97 | 98 | # By default, the convergence criteris is set 99 | if StoppingCriteria.Convergence not in stop_criteria: 100 | stop_criteria |= StoppingCriteria.Convergence 101 | 102 | # Check stopping criteria requirements 103 | if stop_criteria & StoppingCriteria.MaxIterations: 104 | assert max_iterations is not None, "max_iterations should be provided" 105 | if stop_criteria & StoppingCriteria.CentroidsChangeThreshold: 106 | assert centroids_change_threshold is not None, "centroids_change_threshold should be provided" 107 | 108 | old_centroids = None 109 | old_assignments = None 110 | 111 | # X: [N, d] 112 | new_centroids = self._initialize_centroids(X) 113 | 114 | # Initialize the cluster assignments 115 | new_assignments = self._assign_clusters(X, new_centroids) 116 | 117 | it = 0 118 | 119 | while not self._check_stop_criteria(stop_criteria, max_iterations, centroids_change_threshold, old_centroids, new_centroids, it): 120 | old_centroids, old_assignments = new_centroids, new_assignments 121 | 122 | # Update the centroids 123 | new_centroids = self._update_centroids(X, new_assignments) 124 | # Re-assign the clusters 125 | new_assignments = self._assign_clusters(X, new_centroids) 126 | # Check if the assignments have changed 127 | if StoppingCriteria.Convergence in stop_criteria and np.array_equal(old_assignments, new_assignments): 128 | break 129 | it += 1 130 | 131 | return new_assignments, new_centroids 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | -------------------------------------------------------------------------------- /kmeans/test_kmeans .py: -------------------------------------------------------------------------------- 1 | import pytest 2 | import numpy as np 3 | from kmeans.kmeans import KMeans, InitializationMethod, StoppingCriteria 4 | from sklearn.datasets import make_blobs 5 | 6 | def test_initialization_random(): 7 | # Randomly generate N data points with d features 8 | N = 100 9 | d = 2 10 | k = 3 11 | X = np.random.rand(N, d) 12 | for init_method in [InitializationMethod.Random, InitializationMethod.KMeansPlusPLus]: 13 | kmeans = KMeans(k=k, init_method=init_method) 14 | assert kmeans.k == k 15 | assert kmeans.init_method == init_method 16 | centroids = kmeans._initialize_centroids(X) 17 | assert centroids.shape == (k, d) 18 | 19 | def test_assign_clusters(): 20 | N = 100 21 | d = 32 22 | k = 5 23 | 24 | # Generate data points from gaussian clusters 25 | X, y = make_blobs(n_samples=N, n_features=d, centers=k, cluster_std=0.60, random_state=0, shuffle=True) 26 | 27 | for init_method in [InitializationMethod.Random, InitializationMethod.KMeansPlusPLus]: 28 | kmeans = KMeans(k=k, init_method=init_method) 29 | assignments, centroids = kmeans.fit(X, stop_criteria=StoppingCriteria.Convergence) 30 | assert assignments.shape == (N,) 31 | assert centroids.shape == (k, d) 32 | -------------------------------------------------------------------------------- /knn.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# K-nearest neighbors algorithm\n", 8 | "\n", 9 | "As the name implies, the k-nearest neighbors algorithm works by findinng the nearest neighbors of some give data. For instance, let’s say we have a binary classification problem. If we set k to 10, the KNN modell will look for 10 nearest points to the data presented. If among the 10 neighbors observed, 8 of them have the label 0 and 2 of them are labeled 1, the KNN algorithm will conclude that the label of the provided data is most likely also going to be 0. As we can see, the KNN algorithm is extremely simple, but if we have enough data to feed it, it can produce some highly accurate predictions.\n", 10 | "\n", 11 | "It can also be used for regression, for example we can take the k nearest neighbors and take the average of their values to predict the value of the data presented for a particular feature.\n" 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": 1, 17 | "metadata": {}, 18 | "outputs": [ 19 | { 20 | "name": "stderr", 21 | "output_type": "stream", 22 | "text": [ 23 | "/tmp/ipykernel_972/3349027751.py:6: MatplotlibDeprecationWarning: The seaborn styles shipped by Matplotlib are deprecated since 3.6, as they no longer correspond to the styles shipped by seaborn. However, they will remain available as 'seaborn-v0_8-