├── 01 Unsupervised Learning.pdf ├── 02 Clustering.pdf ├── 03_Distance_Metrics_in_ML.ipynb ├── 04 K Means Clustering.pdf ├── 05 Elbow Method.pdf ├── 06_K_Means_Clustering.ipynb ├── 07 Hierarchical Clustering.pdf ├── 08 Dendogram.pdf ├── 09_Hierarchical_Clustering.ipynb ├── 10 DBScan Clustering.pdf ├── 11_DBScan_Clustering.ipynb ├── 12 GMM Clustering.pdf ├── 13_Gaussian_Mixture_Model.ipynb ├── 14 Cluster Adjustment .pdf ├── 15 Silhouette Coefficient - Cluster Validation.pdf ├── 16 Disadvantage & Choosing Right Clustering .pdf ├── 17 Clustering Revision.pdf ├── 18 Clustering Interview Questions .pdf ├── 19 K Modes.pdf ├── 20_K_Modes.ipynb └── README.md /01 Unsupervised Learning.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sandipanpaul21/Clustering-in-Python/0ad56526cd8d59c14387dbae855c0eafd09e26a5/01 Unsupervised Learning.pdf -------------------------------------------------------------------------------- /02 Clustering.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sandipanpaul21/Clustering-in-Python/0ad56526cd8d59c14387dbae855c0eafd09e26a5/02 Clustering.pdf -------------------------------------------------------------------------------- /03_Distance_Metrics_in_ML.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "03 Distance Metrics in ML.ipynb", 7 | "provenance": [], 8 | "authorship_tag": "ABX9TyOIKYef8Rk3gjr3ltV1P/EU", 9 | "include_colab_link": true 10 | }, 11 | "kernelspec": { 12 | "name": "python3", 13 | "display_name": "Python 3" 14 | }, 15 | "language_info": { 16 | "name": "python" 17 | } 18 | }, 19 | "cells": [ 20 | { 21 | "cell_type": "markdown", 22 | "metadata": { 23 | "id": "view-in-github", 24 | "colab_type": "text" 25 | }, 26 | "source": [ 27 | "\"Open" 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": { 33 | "id": "UmICao6hMfZS" 34 | }, 35 | "source": [ 36 | "### **Distance Metrics** \n", 37 | "used in both supervised and unsupervised learning, **generally to calculate the similarity between data points.**\n", 38 | "\n", 39 | "**Types of Distance Metrics in Machine Learning**\n", 40 | "1. Euclidean Distance\n", 41 | "2. Manhattan Distance\n", 42 | "3. Minkowski Distance\n", 43 | "4. Hamming Distance\n", 44 | "\n", 45 | "**Few Machine learning algorithim uses Distance Metrics**\n", 46 | "1. Clustering Algorithim\n", 47 | "2. Classification - KNN Classification" 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": { 53 | "id": "wF5fKIpsNebO" 54 | }, 55 | "source": [ 56 | "#### **1. Euclidean Distance**\n", 57 | "Euclidean Distance represents the **shortest distance between two points.**\n", 58 | "\n", 59 | "Most machine learning algorithms including **K-Means use this distance metric to measure the similarity between observations.**" 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "metadata": { 65 | "id": "bSGqH1KKMp34" 66 | }, 67 | "source": [ 68 | "# Importing necessary libraries \n", 69 | "from scipy.spatial import distance # To calculate distances\n", 70 | "from google.colab import files\n", 71 | "from IPython.display import Image\n", 72 | "#uploaded = files.upload() # To import image from computer/tab" 73 | ], 74 | "execution_count": 1, 75 | "outputs": [] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "metadata": { 80 | "colab": { 81 | "base_uri": "https://localhost:8080/", 82 | "height": 474 83 | }, 84 | "id": "9KaTXQZbOJu6", 85 | "outputId": "d6237fc9-239c-4d4b-a844-82bbe4312cfb" 86 | }, 87 | "source": [ 88 | "print('Euclidean Distance Theory Overview : Calculate by Pythagorous Theorem\\n')\n", 89 | "Image('1A876F28-65F9-4243-B0C2-E4FBC2DDB56D.jpeg', width = 500)" 90 | ], 91 | "execution_count": null, 92 | "outputs": [ 93 | { 94 | "output_type": "stream", 95 | "name": "stdout", 96 | "text": [ 97 | "Euclidean Distance Theory Overview : Calculate by Pythagorous Theorem\n", 98 | "\n" 99 | ] 100 | }, 101 | { 102 | "output_type": "execute_result", 103 | "data": { 104 | "image/jpeg": "\n", 105 | "text/plain": [ 106 | "" 107 | ] 108 | }, 109 | "metadata": { 110 | "image/jpeg": { 111 | "width": 500 112 | } 113 | }, 114 | "execution_count": 23 115 | } 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "metadata": { 121 | "colab": { 122 | "base_uri": "https://localhost:8080/" 123 | }, 124 | "id": "6kn0KZCDP1l5", 125 | "outputId": "297798c5-ef74-4501-b481-8e3b06e0bfe4" 126 | }, 127 | "source": [ 128 | "# Euclidean Distance\n", 129 | "# defining the points\n", 130 | "point_1 = (1, 2, 3)\n", 131 | "point_2 = (4, 5, 6)\n", 132 | "\n", 133 | "print('First Data Points :',point_1)\n", 134 | "print('Second Data Points :', point_2)\n", 135 | "\n", 136 | "#computing the euclidean distance\n", 137 | "euclidean_distance = distance.euclidean(point_1, point_2)\n", 138 | "print('\\nEuclidean Distance b/w', point_1, 'and', point_2, 'is: ', euclidean_distance)" 139 | ], 140 | "execution_count": 2, 141 | "outputs": [ 142 | { 143 | "output_type": "stream", 144 | "name": "stdout", 145 | "text": [ 146 | "First Data Points : (1, 2, 3)\n", 147 | "Second Data Points : (4, 5, 6)\n", 148 | "\n", 149 | "Euclidean Distance b/w (1, 2, 3) and (4, 5, 6) is: 5.196152422706632\n" 150 | ] 151 | } 152 | ] 153 | }, 154 | { 155 | "cell_type": "markdown", 156 | "metadata": { 157 | "id": "ZWkrHAHUna41" 158 | }, 159 | "source": [ 160 | "#### **2. Manhattan Distance**\n", 161 | "##### Manhattan Distance is the sum of absolute differences between points across all the dimensions." 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "metadata": { 167 | "colab": { 168 | "base_uri": "https://localhost:8080/", 169 | "height": 400 170 | }, 171 | "id": "Gooz82NHUNGD", 172 | "outputId": "70c1d2ec-ee0d-47cc-b33d-3a08041fe83d" 173 | }, 174 | "source": [ 175 | "print('Manhattan Distance vs Euclidean Distance')\n", 176 | "Image('98236A32-D4FA-4359-B42C-377B289C90BD.png', width = 400)" 177 | ], 178 | "execution_count": 14, 179 | "outputs": [ 180 | { 181 | "output_type": "stream", 182 | "name": "stdout", 183 | "text": [ 184 | "Manhattan Distance vs Euclidean Distance\n" 185 | ] 186 | }, 187 | { 188 | "output_type": "execute_result", 189 | "data": { 190 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZcAAAFzCAMAAADFbbSrAAABtlBMVEX////v5LAAAAD/rsmjSaT/fycisUy2ZgAAOpDb//+2//8AAGbbkDqQ2///25D//7b//9tmAACQOgA6kNsAZrZmtv8AADr/tmY6AAA6ADrIeCwAUqbwm1hYAAAAAFim4PAAKYHI4PDp4PDw4KYseMhOsUza4PDw4OZ15v//8qtYm/AAAAemzPDw4Mg6AGYxBgAAABQsAAAAACyBuvCBKQDwuoEAEli9zEzw4NqmUgCQZgDbtmYKEijK1vDam1iBlbpYm9r/wd8izMg+ZJ8AD2dnpvDgj0mmUhT/1ngisW6a8v/e2W46OjoOGDV3RTUdEgrKyODctagUAAfw4LtzGQAoOmufveY7LVdgOijmxLH/lZxO2eMisY3/5o2QkDpmADrNl2s+JhRii8HSrpVmLVfSpYFCEgCfVyXwzKYAAEmBQhlSGAclBgC+lYFzLge42vAAACiQOjr/lSf/wVL/qyf/f1L/6///1v+av0wiv6tmAGaVb0k1GxQdMVdXeKAHLjGizKaxeELcxsL/f3iBWj5iHT7ws3gqUYGswOlJb591sUxOzI3e8qsAZmZmZgAAZpC2/7aQkGbdAmS2AAATw0lEQVR4nO2diXvcxBmHZRFnHdjseu01uzihBBPqpsbEgRgMhUChLbRcIQm0lFJaegVC09IT6E1boHf7H1dzSSNpRt9II+1qtb/3eSAr7ydp9L2aQ7I8CgIAAAAAAAAAAAAAAAAAoG52v/PdcO+Ftz5bFHM4PVWwge+9+FIYPvb049aI137z2zDc+92rO2RhLl35zF1k0FKwux1yTtnF7H50Myzwcs+dYgsP3WcJ2L8oAvY2qcIcng/hRXA4PXhrZ/c75w++bIvg4oq8fP+Jyzu7P/hheNYSsP/e7y/vBB/9KPwSkfPrN/5wodCLPIfC8PPFG+oA2+F4R/xjjyj2Ijkdnrm7MGBl9MCDhQH7Fx/6453wwjnaEBXlcFqU1W0HL4cXCC/XbxTXl91HDp4MKC+2Otk19i+Kk/iewoS4ePmKvcYxdl9+J0p7EZeunN2BF4nycfXa2yftUQ5erl4raqWirn/vlVcLN3B4ng094EWgvKh6Y4b2cvTM3tmCYTAbku298mnBBq7fOMMG2vAiqKm+RFoKBtqcj/505cxPrN9GfT4fZjv0+wdExesE9fQvRxt7Y0ILD7IPo9RFUERBxZXjMaKj6gLxeKxwNEV42b+5d5bUEgSPFAxv3bwwogFE+LbtErY70NcvAeXl+o2DJ+lbLFHVLKgvkuJqK7l6zX4R3Bno6/2A8PLxu8QZfvh1dkMgOs2/ZO9fJE5eCiteV6Dvj6mLbJua+CLcNlg6vODUQDHc6gt146ATkPeTvb3IG86fFA2TJcVeWMWL+peb4VrxnQUwW1TFs965BnPh6MffIH7TAwAAAAAAAOgsK8WQEf4BK/6FWJlvDpvANyX+AfBiwjcl/gHwYsI3Jf4B8GLCNyX+AfBiwjcl/gHwYsI3Jf4B8CK5/uJL4cHT8ldUvinxD4AXweF5/TdLvinxD4AXzv5GeOrx4LWb4iEYeGkLhxf4MymHU/FovW9K/APghSMfuj/a+Cl/osQ3Jf4B8MJRD17LJ7B8U+IfAC8c9UTcdsj/1tE3Jf4B8MJJvPB645sS/wB44aC+tBP0L+1Ejsf2L2I81ipw/dJOcL3fUnB/rKXgfvIC4JsS/wB4MeGbEv8AeDHhmxL/AHgx4ZsS/wB4MeGbEv8AeDHhmxL/AHgBAIAuQjfctxRCbsJlH3Qh6IiOQR8wvMwD+oDhZR7QBwwv84A+YHiZB/QBw8s8oA8YXmbG1dGZu+VH+oDhZUbsfvSjsKKX1VV4aYrT7Lf71bysrubEkJtw2QddCDqC0xOzxq2d0I63py/1N42Rg/tvteZrONrKh/Cl9Q3ytSkl8PByfDVnhtyEyz7oQtARItu33cozmRKTyvGmKXI4us3qRSY/HSKXpLLaOF3VS14MuQmXfdCFoCO0bAf92y0zp2e9iMhJGNq9TLi5dEi81LedANWo7iUnhtyEyz7oQtARerYnYXQe96NmgVmI2rHonO+xNmswjdOrRQ6mm32rF2EyHZIsJZ7n7UV0Mqvt9sJqQS/6L0qg9BKlfzgaW+pL9E+c9OFIdDzqBxNV9dLq1FKv1grj5SUtpo1eBtOxUCIEMS8sfdGHrBcWmU+6Tpx4sxdri1kJPy8pM+QmXPZBF4KOEFkUJ/tYneeD6ZbwMg6yXuLIfNI1hNF8iFriLWZt+HrRxJCbcNkHXQg6QmQ7Th7pxZjmHGLNfIhaYjuoD28vQsxq672w07mcl0z/smD1Jaky5CZc9kEXgo7IZDvTv1StL0ncIvQvWpUhN+GyD7oQdEQ6i4Ecj40DTy+zG4+dlq84EGroA7Z5kWba6iV9/SK9RD9MX79k0pwnNilCVLsVX7/U14zV50WK8U8qXQg6oikmaWX9lIiar/c16AMu8CLNeCeVLgQd0RTpm5Pr7+n9Sd33xzToA76lEFKMyz7oQtARjZFK/iQlqdb7yWnoAy72EgSEGZd90IWgIzoGfcCUF0KMyz7oQtARHYM+YNLLykqRGZd90IWgIzoGfcAOXorEuOyDLgQdAUzEF5mgZUBMS0GVqZNSTXt0XVkUYOxlXPZRqhCWiI5RKiWEF6MYl32UKoQlomOUSgnlxWTGZR+lCmGJ6BilUkJ7yYtx2UepQlgiOkaplDh4yZlx2UepQlgiOkaplDh5yYjRA9Z/xr76Zm4TpQphiegYpVLi5kWIWTUEnFuFFxvylevqzc+lUuLoJVVl9IBnV28x7qNUISwRi849d6belF4qJc5etCqjB/x89YvGfZQqhCVi0bnn+09c3tn9wQ/ldL2lUuLuJakyesAv+DM0+U2UKoQloiOcdv79vrZUxosSowWIbh/9SxGHF9y9PBz5+BxbKuVFmtECzq3+8ldRH7P66+wmXApBRXQEOb21S0qYFiGmpBclJvWz4Wq2k4GXhKvXHuDTjTsc8Ll7j78euXlqpbyXzLXMs6t/5l5QX2wcPbN3lk9r7ZiS9fcrekmJOceqyrOsMctswq0QxRFdINJy6rPio1NKoirzwYmVSl5SF5l8PJYbK8OL5Ghjbyy1OHv56tdWqnlJPS/zc4MWeJHs39w7q7S4puS549+6taqXIHVfxhThWIjCiIXn+o2DJ3fiJceUnLuXVZiKXoofZIIXzsfvypGYgD7gN1hVEV6q5CxIHmSyPslEF4KOWHS2Q4Xj+8X4OFm0Y1VypgLsYuCFUdqLvK58vWLO4gCrGHgx4JKS59R9mCo50wIsZuDFgG9KSgWYxcCLAd+UlAwwP8nkWwh48Q4wiIEXA74pKR2QHzHDi4FSKal8XZkiKwZefIm81LEZPGFeMzV5wR9l1ExtXiCGolTTXk//Ikh6GfQvBkqlpE4viRh4MVAqJbV60Z5kKlUIS0THKJWSmr1IMfBioFRK6vYiLzLhJU+plNTuhfhNpus+4KV2L4W/yXTdRxe87L78l5fC8OCVV8UifcDaUhNeHKqMUzEXHfV3FgdP8kX6gLWlZrxQMzItiZe/vnV5J3jtnXCNP6tEH3DTAWw8Rj3ItAReJIfTz9zF/qUPuOmAFYcHmZbHy8fvtqm+rHhPydQNdr93w/l5/qYD1HVl0YNMy+El6vr3PvlUfKYPuOmA5HrfZsapmB2ADcn2xuIPkukDbjpAuw9jfZBpObyIV7+1rH/hmC8ynYrZEdQfjNEH3HRA+r6l+UGm5fGyf7GCl4auK/MPMpWfkqkrXL/RUi8Vp2RaeE6/8sRlNk4O1+5mi/QBa0uz8WJ4kGkZvISzmIekXFJzP0qbcSrmonP042+8FO698LdG5+0pE2D8fWVKjFMxOwZ9wNrS7LxYp2Sy7qRj0AesLc3SS+pBJngpzOpMvSQjZqdidgz6gLWl2XoxTclk30nHoA9YW5q1F/W8DLyUzqp/QJEX85RMxp2A2YInzNsKzLSUZRRDN9xNBxT3LyKAmzlmZ85JbAAyJa3wQomZbw6bgE5J0wFOXo6tFpqZbw6bgE5J0wFuXo4dKxIz3xw2AZ0SbWnm15XxJo4pMWYz881hE9Ap0Zbm66Wgysw3h01Ap0RbmrcXa5URx9ITv/FLveO7py/11fuM9Uj2PvixNT/s7cjDUZiJYVtt4u3IR8+Ep8QnOiXa0ty92KqMzBZ/JftwZHv5+nAUe0ki++FmMLGK4cmf3H4y/dMJN1r/28R3t8NF9WIWo2c76GfTqMh6YZHDEVPSF8t5JkxA9tuo/oiaZjsBqnLpyvML68VoJpXtSbglWidmIWpxonO+x9qswTQMZYaTyMGUnfQ2lcJkL1Ob+muPciOJ53o4PH/q0gJ7MYgJ9GyzJPei/wbTTeklZH3E2FhfUsuyJ4n1iRZsfeN5KVkwuOOk7LV6tVaYq6OH7ju9yF7yZvTsDqZjoUQIYl5Y+qIPWS8sktMPLSd+T/QjYyZHVRrW5Ugh1hazCkfPPPBgsOBesmJkFsXJPlY9NWujuJdxkPUSRzKs3b4wKoOUA9arSC+8xayJ3W32l5WL7iUjRmY77p9JL3pPbq0tWiVJHEStWNyAic6pHi5dYW+yqOwlQQUcz2AMLvpG7SK/glwnKsSx/De6mGy2hReWStpLT9OS6V9S9UV66YuIeAc1EfX57O8rOuDleIGXTP9S5KVflFseJ4dsehWrv39Rj8FGnLk7WGgvtvEY/8jHY+OA9CIM2pDjsXGQubqvfzzm5aWRgGr9S5pstjPXL9JL9MP09UuQbpbySJNsmMDqTNxu9eT1S90X/BXbsUYC6vPSAJP0pX4/JaL26314cSXdfK2/p1erBqrLXLycu9c8y3+bvaSTP0lJauB+8ly8PGx5+0KrvcwLOiV1BbxxHF7coVNSU8Dw7/BSAjolNQU8d/wfi9i/zAs6JfUEnLv3g3/Cizt0SmoJWH//q19byPHYvKBTUkvAG8efWsxxcsd5WN7TemreBQEp4KW9sHZs3mVYGOimvbYAn/6FjugYvinxD4AXE74p8Q+AFxO+KfEPgBcTvinxD4AXE74p8Q+AFxO+KfEPgBdB/KL3z7Ml35T4B8CLAF7ayXZ4VlvyTYl/ALwI4KWdwEs7gZd2wvv9au8XayQAXgTb1d8v1kgAvCTsvvxO+Daf2No3Jf4B8KJz9drBl9m/vinxD4CXFI/gurKNXB218/0vVTex8Bx+nb3P4uWbVd5n0UgAvHAOL1R/n0UjAfDC4e+zCB97usr7LBoJgBcTvinxD4AXE74p8Q+AFxO+KfEPgBcTvinxD4AXE74p8Q+AFxO+KfEPgBcAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAYB6EM2feR7wYIE3tBF5aCbS0E3hpJc5aBtMw3Kq0i+prLjGuXoajrWBy+8kKe6i+5jLj6mWydiJY36hy2ldfc4kp1buwM78a1ddcUkp56UdnfjWqr7mklPHSv+3WinupvuaSUkpL5b67+prLSgkvqC0zxN3L4I6q53z1NT3oyVukpn5tMN0KeuqLnmvPxwL14P5mDcW0oGthQ6bhiB3LWPvp4P7oXF/f2JQHahhVqaFW+gDj9QL7mrkNZXfuQa+gjjIvSWAZLxrDUd1eQuNHnsPc5d9wxI+vYJgrch+tmjo16fXy1HrtudBe9OrCrv5yPcEkDMVPUuPcVAknYiE62bWfmtcjqLUbSrysb7AqyMvcD0PmPm7Hopq89qj8IuRp7ommLzrZenobqALZWtGxRQfH7i1Fu8jHy31o23QmzH1Q+nvpVmQw3ZTJSp0duhf1RV8cYOF6BL2amjCxMYMXNixk5VJeWEw/lF9EZRZRLH59I2SN+ljbGA+Mvmcrr2+sneBHlo9X+0i2KdIgejtZqIG5zCYvvBFZ33g+K1mdxLoL/bNsfKK+PV3PTetlyBTWsHMPVL+/lXjheWPplF5Ec8Y/bIoSi59EH1jik7LHgey/iSyu2lA6XtvHZmBtAiaWQoepf8RxsA3zrYrjUKgtq0sQlUzVF4iy84GB0Yv7pYth5x4Y6kvcrUgvomhRSyvOLfF1dOaHLM9x2xcfQ1/VFzGEkS1BJj7eh77NHJb6YvAihOublKj8TrQBlWZArtfPjCHN67lQV+fv4oWHREXvx3Wrx5Lcz3lRgbLd4qGiHcvGx/tItlmGUPu/VnhBlEq+1c2kTGnzKS9sPX6FwlvwwvVcKOvRRvn6wncuG5+C+iI2GW4xL/n4TH1RZJpsK2H8v7jwWn3RU+NUX/rpts22XgZjYRvyInpq2VCa+5cok7KTzXpJ9S/qJ9xLLl7bh9xmuWKHQeZanx9H3I8lP6f6ieT4ffsXw8490K5f2MeJGnZN+DBJjsfYcig/sIwOpmxEletftEBZuaItic49F6/2EW/ThK3f507St2Ck+nFyoSigxlXJGVF6PJbBsHMP1HgsKh7rEMZ9y/XLbd9WFZ6lkF2brH3h9pMZL3Ggaqu5m8iHIT51/VJ+EBNmvMjhRS/bVcXXIZbmJblAMXopdcGf2/lSkrllOSlsQOzX7VXXA2ayd5ILWxCH+2Nl1wNmcnf4G8h9bb0FAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAgDt8yjDLtLz2GY/jLwyT8s5nnuSuwZNom5aX9GKclNfHC5tetfz0dnaiM07Oh2qc08gwXSObppqaqnoGU1vzJNqm5U1lWC+I/MI8Ka93fRmOapvlJy5hz9ELmxeKnLy28SmU1dSHlml5KS/mSXm9vfTqm9h6OPrXh3xi7Q//vXherNPyEl4sk/L6eolnG66B4WjcY23iZMzPvez00+sb/9mQ81CLr1Q7apyqmjObqa15Eq3T8sYZzs7Ky7+wTcpbyothcsuiqdvLEh3ahAvZYl5y008n81Crr+L6MqOprW0kSTRMy1tcX2yT8nrWF9tkkFU3xgaJwzdP9saG6afjD/FXysuspra2oXnJT8tb6MU6Ka+nl1pf3hMlh71qbrJ2QvShmemn9fkrxVda/zKTqa1tpLwkP3bwYp2U19OL+wylDrCTtj8Oept8bJObfjpJuvoqacfmO7U1P/Nt0/KWva6so3+ptRkTs+7e/983TzIv+emn4w/xV8rLvKe2FuMxy7S8M/CS33TJicmLYV6Gb/6P5XFsmH468aK+ir3MeWprkUTLtLzz8FLvG2HFWIudyz3TdNXayS2/Svr92UxtbaMwibSX0l+S1PvOPp4O3nP2xobpqpMkqq94bi1TVXNmNLV167wADry0E3gBAAAAAAAAAAAAAAAAAAAAAAAAAFgC/g/9KTGh5tPfNQAAAABJRU5ErkJggg==\n", 191 | "text/plain": [ 192 | "" 193 | ] 194 | }, 195 | "metadata": { 196 | "image/png": { 197 | "width": 400 198 | } 199 | }, 200 | "execution_count": 14 201 | } 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "metadata": { 207 | "colab": { 208 | "base_uri": "https://localhost:8080/" 209 | }, 210 | "id": "fbQQMjfcoARI", 211 | "outputId": "35acc408-d758-4207-bef5-ff0263289007" 212 | }, 213 | "source": [ 214 | "# Manhattan Distance vs Euclidean Distance\n", 215 | "# defining the points\n", 216 | "point_1 = (1, 1)\n", 217 | "point_2 = (5, 4)\n", 218 | "\n", 219 | "print('First Data Points :',point_1)\n", 220 | "print('Second Data Points :', point_2)\n", 221 | "\n", 222 | "#computing the euclidean distance\n", 223 | "euclidean_distance = distance.euclidean(point_1, point_2)\n", 224 | "print('\\nEuclidean Distance b/w', point_1, 'and', point_2, 'is: ', euclidean_distance)\n", 225 | "\n", 226 | "# computing the manhattan distance\n", 227 | "manhattan_distance = distance.cityblock(point_1, point_2)\n", 228 | "print('Manhattan Distance b/w', point_1, 'and', point_2, 'is: ', manhattan_distance)" 229 | ], 230 | "execution_count": 3, 231 | "outputs": [ 232 | { 233 | "output_type": "stream", 234 | "name": "stdout", 235 | "text": [ 236 | "First Data Points : (1, 1)\n", 237 | "Second Data Points : (5, 4)\n", 238 | "\n", 239 | "Euclidean Distance b/w (1, 1) and (5, 4) is: 5.0\n", 240 | "Manhattan Distance b/w (1, 1) and (5, 4) is: 7\n" 241 | ] 242 | } 243 | ] 244 | }, 245 | { 246 | "cell_type": "markdown", 247 | "metadata": { 248 | "id": "9NvuapSFykzT" 249 | }, 250 | "source": [ 251 | "#### **3. Minkowski Distance**\n", 252 | "\n", 253 | "**Minkowski Distance is the generalized form of Euclidean and Manhattan Distance**\n", 254 | "- If Lamda = 1, then it calculates Manhatten Distance\n", 255 | "- If Lamda = 2, then it calculates Euclidean Distance\n", 256 | "\n", 257 | "**In scipy package, the p parameter of the Minkowski Distance metric of SciPy package** \n", 258 | "1. When the order(p) = 1, it will represent Manhattan Distance \n", 259 | "2. When the order(p) = 2, it will represent Euclidean Distance." 260 | ] 261 | }, 262 | { 263 | "cell_type": "code", 264 | "metadata": { 265 | "id": "RCLL3UdooVRB", 266 | "colab": { 267 | "base_uri": "https://localhost:8080/", 268 | "height": 306 269 | }, 270 | "outputId": "3976e8e5-5aa4-470e-bb0e-50a8625cfcc3" 271 | }, 272 | "source": [ 273 | "print('Manhattan Distance vs Euclidean Distance\\n')\n", 274 | "Image('11E4B729-8426-4200-92D6-C61D3307D082.png', width = 600)" 275 | ], 276 | "execution_count": 10, 277 | "outputs": [ 278 | { 279 | "output_type": "stream", 280 | "name": "stdout", 281 | "text": [ 282 | "Manhattan Distance vs Euclidean Distance\n", 283 | "\n" 284 | ] 285 | }, 286 | { 287 | "output_type": "execute_result", 288 | "data": { 289 | "image/png": "\n", 290 | "text/plain": [ 291 | "" 292 | ] 293 | }, 294 | "metadata": { 295 | "image/png": { 296 | "width": 600 297 | } 298 | }, 299 | "execution_count": 10 300 | } 301 | ] 302 | }, 303 | { 304 | "cell_type": "code", 305 | "metadata": { 306 | "colab": { 307 | "base_uri": "https://localhost:8080/" 308 | }, 309 | "id": "L8bNZuTMyZeF", 310 | "outputId": "fbe6aef7-087b-482c-cc05-de00860f9f74" 311 | }, 312 | "source": [ 313 | "# Manhattan Distance vs Euclidean Distance\n", 314 | "# defining the points\n", 315 | "point_1 = (1, 1)\n", 316 | "point_2 = (5, 4)\n", 317 | "\n", 318 | "print('First Data Points :',point_1)\n", 319 | "print('Second Data Points :', point_2)\n", 320 | "\n", 321 | "#computing the euclidean distance\n", 322 | "euclidean_distance = distance.euclidean(point_1, point_2)\n", 323 | "print('\\nEuclidean Distance b/w', point_1, 'and', point_2, 'is: ', euclidean_distance)\n", 324 | "\n", 325 | "# computing the manhattan distance\n", 326 | "manhattan_distance = distance.cityblock(point_1, point_2)\n", 327 | "print('\\nManhattan Distance b/w', point_1, 'and', point_2, 'is: ', manhattan_distance)\n", 328 | "\n", 329 | "# computing the minkowski distance\n", 330 | "minkowski_distance = distance.minkowski(point_1, point_2, p=1)\n", 331 | "print('\\nMinkowski Distance (p=1, Manhattan) b/w', point_1, 'and', point_2, 'is: ', minkowski_distance)\n", 332 | "\n", 333 | "minkowski_distance = distance.minkowski(point_1, point_2, p=2)\n", 334 | "print('\\nMinkowski Distance (p=2, Euclidean) b/w', point_1, 'and', point_2, 'is: ', minkowski_distance)\n", 335 | "\n", 336 | "minkowski_distance = distance.minkowski(point_1, point_2, p=3)\n", 337 | "print('\\nMinkowski Distance (p=3) b/w', point_1, 'and', point_2, 'is: ', minkowski_distance)" 338 | ], 339 | "execution_count": 4, 340 | "outputs": [ 341 | { 342 | "output_type": "stream", 343 | "name": "stdout", 344 | "text": [ 345 | "First Data Points : (1, 1)\n", 346 | "Second Data Points : (5, 4)\n", 347 | "\n", 348 | "Euclidean Distance b/w (1, 1) and (5, 4) is: 5.0\n", 349 | "\n", 350 | "Manhattan Distance b/w (1, 1) and (5, 4) is: 7\n", 351 | "\n", 352 | "Minkowski Distance (p=1, Manhattan) b/w (1, 1) and (5, 4) is: 7.0\n", 353 | "\n", 354 | "Minkowski Distance (p=2, Euclidean) b/w (1, 1) and (5, 4) is: 5.0\n", 355 | "\n", 356 | "Minkowski Distance (p=3) b/w (1, 1) and (5, 4) is: 4.497941445275415\n" 357 | ] 358 | } 359 | ] 360 | }, 361 | { 362 | "cell_type": "markdown", 363 | "metadata": { 364 | "id": "aOeTLnhw2GWO" 365 | }, 366 | "source": [ 367 | "#### **3. Hamming Distance**\n", 368 | "- Hamming Distance measures the **similarity between two strings of the same length** \n", 369 | "\n", 370 | "- The Hamming Distance **between two strings of the same length is the number of positions at which the corresponding characters are different.**\n", 371 | "\n", 372 | "Let’s say we have two strings:\n", 373 | "\n", 374 | "**“euclidean” and “manhattan”**\n", 375 | "\n", 376 | "- Since the length of these strings is equal, we can calculate the Hamming Distance. \n", 377 | "- We will go character by character and match the strings.\n", 378 | "- The first character of both the strings (e and m respectively) is different. Similarly, the second character of both the strings (u and a) is different. and so on.\n", 379 | "- **Look carefully** – seven characters are different whereas two characters (the last two characters) are similar:\n", 380 | "\n", 381 | "**“euclide--an” and “manhatt--an”**\n", 382 | "\n", 383 | "- Hence, the **Hamming Distance here will be 7.**\n", 384 | "- **Note that larger the Hamming Distance between two strings, more dissimilar will be those strings (and vice versa)**\n", 385 | "\n" 386 | ] 387 | }, 388 | { 389 | "cell_type": "code", 390 | "metadata": { 391 | "colab": { 392 | "base_uri": "https://localhost:8080/" 393 | }, 394 | "id": "Mjg8Vhnez1fO", 395 | "outputId": "8ec6a00e-4c6b-4b8c-f37f-c7be88ba13b2" 396 | }, 397 | "source": [ 398 | "# Hamming Distance\n", 399 | "# defining two strings\n", 400 | "string_1 = 'euclidean'\n", 401 | "string_2 = 'manhattan'\n", 402 | "\n", 403 | "# computing the hamming distance\n", 404 | "hamming_distance = distance.hamming(list(string_1), list(string_2))*len(string_1)\n", 405 | "print('Hamming Distance b/w', string_1, 'and', string_2, 'is: ', hamming_distance)" 406 | ], 407 | "execution_count": 5, 408 | "outputs": [ 409 | { 410 | "output_type": "stream", 411 | "name": "stdout", 412 | "text": [ 413 | "Hamming Distance b/w euclidean and manhattan is: 7.0\n" 414 | ] 415 | } 416 | ] 417 | } 418 | ] 419 | } -------------------------------------------------------------------------------- /04 K Means Clustering.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sandipanpaul21/Clustering-in-Python/0ad56526cd8d59c14387dbae855c0eafd09e26a5/04 K Means Clustering.pdf -------------------------------------------------------------------------------- /05 Elbow Method.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sandipanpaul21/Clustering-in-Python/0ad56526cd8d59c14387dbae855c0eafd09e26a5/05 Elbow Method.pdf -------------------------------------------------------------------------------- /07 Hierarchical Clustering.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sandipanpaul21/Clustering-in-Python/0ad56526cd8d59c14387dbae855c0eafd09e26a5/07 Hierarchical Clustering.pdf -------------------------------------------------------------------------------- /08 Dendogram.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sandipanpaul21/Clustering-in-Python/0ad56526cd8d59c14387dbae855c0eafd09e26a5/08 Dendogram.pdf -------------------------------------------------------------------------------- /10 DBScan Clustering.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sandipanpaul21/Clustering-in-Python/0ad56526cd8d59c14387dbae855c0eafd09e26a5/10 DBScan Clustering.pdf -------------------------------------------------------------------------------- /12 GMM Clustering.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sandipanpaul21/Clustering-in-Python/0ad56526cd8d59c14387dbae855c0eafd09e26a5/12 GMM Clustering.pdf -------------------------------------------------------------------------------- /14 Cluster Adjustment .pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sandipanpaul21/Clustering-in-Python/0ad56526cd8d59c14387dbae855c0eafd09e26a5/14 Cluster Adjustment .pdf -------------------------------------------------------------------------------- /15 Silhouette Coefficient - Cluster Validation.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sandipanpaul21/Clustering-in-Python/0ad56526cd8d59c14387dbae855c0eafd09e26a5/15 Silhouette Coefficient - Cluster Validation.pdf -------------------------------------------------------------------------------- /16 Disadvantage & Choosing Right Clustering .pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sandipanpaul21/Clustering-in-Python/0ad56526cd8d59c14387dbae855c0eafd09e26a5/16 Disadvantage & Choosing Right Clustering .pdf -------------------------------------------------------------------------------- /17 Clustering Revision.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sandipanpaul21/Clustering-in-Python/0ad56526cd8d59c14387dbae855c0eafd09e26a5/17 Clustering Revision.pdf -------------------------------------------------------------------------------- /18 Clustering Interview Questions .pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sandipanpaul21/Clustering-in-Python/0ad56526cd8d59c14387dbae855c0eafd09e26a5/18 Clustering Interview Questions .pdf -------------------------------------------------------------------------------- /19 K Modes.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sandipanpaul21/Clustering-in-Python/0ad56526cd8d59c14387dbae855c0eafd09e26a5/19 K Modes.pdf -------------------------------------------------------------------------------- /20_K_Modes.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "20 K Modes.ipynb", 7 | "provenance": [], 8 | "authorship_tag": "ABX9TyMrk2Mr4iYgvzKIUk9Vxnx5", 9 | "include_colab_link": true 10 | }, 11 | "kernelspec": { 12 | "name": "python3", 13 | "display_name": "Python 3" 14 | }, 15 | "language_info": { 16 | "name": "python" 17 | } 18 | }, 19 | "cells": [ 20 | { 21 | "cell_type": "markdown", 22 | "metadata": { 23 | "id": "view-in-github", 24 | "colab_type": "text" 25 | }, 26 | "source": [ 27 | "\"Open" 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "source": [ 33 | "**Custering** is an *unsupervised learning* method whose task is to *divide the population or data points into a number of groups*, such that data points in a group are more similar to other data points in the same group and dissimilar to the data points in other groups. \n", 34 | "\n", 35 | "It is basically a collection of objects based on similarity and dissimilarity between them.\n", 36 | "\n", 37 | "**KMeans** uses mathematical measures (distance) to cluster continuous data. The lesser the distance, the more similar our data points are. Centroids are updated by Means.\n", 38 | "\n", 39 | "**But for categorical data points**, we cannot calculate the distance. So we go for **KModes algorithm**. It uses the dissimilarities(total mismatches) between the data points. The lesser the dissimilarities the more similar our data points are. It uses Modes instead of means.\n", 40 | "\n", 41 | "**Steps in K-Mode**:\n", 42 | "1. Pick K observations at random and use them as leaders/clusters (K to choose from Elbow Method)\n", 43 | "2. Calculate the dissimilarities and assign each observation to its closest cluster\n", 44 | "3. Define new modes for the clusters\n", 45 | "4. Repeat 2–3 steps until there are is no re-assignment required" 46 | ], 47 | "metadata": { 48 | "id": "LbLS41Inmb-S" 49 | } 50 | }, 51 | { 52 | "cell_type": "code", 53 | "source": [ 54 | "# importing necessary libraries\n", 55 | "import pandas as pd\n", 56 | "import numpy as np\n", 57 | "#!pip install kmodes\n", 58 | "from kmodes.kmodes import KModes\n", 59 | "import matplotlib.pyplot as plt\n", 60 | "%matplotlib inline" 61 | ], 62 | "metadata": { 63 | "id": "uMcZ58AknJth" 64 | }, 65 | "execution_count": 1, 66 | "outputs": [] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "source": [ 71 | "# Create toy dataset\n", 72 | "hair_color = np.array(['blonde', 'brunette', 'red', 'black', 'brunette', 'black', 'red', 'black'])\n", 73 | "eye_color = np.array(['amber', 'gray', 'green', 'hazel', 'amber', 'gray', 'green', 'hazel'])\n", 74 | "skin_color = np.array(['fair', 'brown', 'brown', 'brown', 'fair', 'brown', 'fair', 'fair'])\n", 75 | "person = ['P1','P2','P3','P4','P5','P6','P7','P8']\n", 76 | "data = pd.DataFrame({'person':person, 'hair_color':hair_color, 'eye_color':eye_color, 'skin_color':skin_color})\n", 77 | "data = data.set_index('person')\n", 78 | "data" 79 | ], 80 | "metadata": { 81 | "colab": { 82 | "base_uri": "https://localhost:8080/", 83 | "height": 328 84 | }, 85 | "id": "iFFnIJRJnRbU", 86 | "outputId": "073896f8-66d6-4940-c21d-fd1fa93a47a3" 87 | }, 88 | "execution_count": 2, 89 | "outputs": [ 90 | { 91 | "output_type": "execute_result", 92 | "data": { 93 | "text/html": [ 94 | "
\n", 95 | "\n", 108 | "\n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | "
hair_coloreye_colorskin_color
person
P1blondeamberfair
P2brunettegraybrown
P3redgreenbrown
P4blackhazelbrown
P5brunetteamberfair
P6blackgraybrown
P7redgreenfair
P8blackhazelfair
\n", 174 | "
" 175 | ], 176 | "text/plain": [ 177 | " hair_color eye_color skin_color\n", 178 | "person \n", 179 | "P1 blonde amber fair\n", 180 | "P2 brunette gray brown\n", 181 | "P3 red green brown\n", 182 | "P4 black hazel brown\n", 183 | "P5 brunette amber fair\n", 184 | "P6 black gray brown\n", 185 | "P7 red green fair\n", 186 | "P8 black hazel fair" 187 | ] 188 | }, 189 | "metadata": {}, 190 | "execution_count": 2 191 | } 192 | ] 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "source": [ 197 | "**Scree Plot or Elbow curve to find optimal K value**\n", 198 | "- For KModes, plot cost for a range of K values. Cost is the sum of all the dissimilarities between the clusters.\n", 199 | "- Select the K where you observe an elbow-like bend with a lesser cost value.\n", 200 | "\n" 201 | ], 202 | "metadata": { 203 | "id": "sXNaiUl2njHF" 204 | } 205 | }, 206 | { 207 | "cell_type": "code", 208 | "source": [ 209 | "# Elbow curve to find optimal K\n", 210 | "cost = []\n", 211 | "K = range(1,5)\n", 212 | "for num_clusters in list(K):\n", 213 | " kmode = KModes(n_clusters=num_clusters, init = \"random\", n_init = 5, verbose=1)\n", 214 | " kmode.fit_predict(data)\n", 215 | " cost.append(kmode.cost_)\n", 216 | " \n", 217 | "plt.plot(K, cost, 'bx-')\n", 218 | "plt.xlabel('No. of clusters')\n", 219 | "plt.ylabel('Cost')\n", 220 | "plt.title('Elbow Method For Optimal k')\n", 221 | "plt.show()" 222 | ], 223 | "metadata": { 224 | "colab": { 225 | "base_uri": "https://localhost:8080/", 226 | "height": 1000 227 | }, 228 | "id": "4YBfEadpndMK", 229 | "outputId": "a1473766-621f-4827-c57e-d12ea37fa1a0" 230 | }, 231 | "execution_count": 3, 232 | "outputs": [ 233 | { 234 | "output_type": "stream", 235 | "name": "stdout", 236 | "text": [ 237 | "Init: initializing centroids\n", 238 | "Init: initializing clusters\n", 239 | "Starting iterations...\n", 240 | "Run 1, iteration: 1/100, moves: 0, cost: 15.0\n", 241 | "Init: initializing centroids\n", 242 | "Init: initializing clusters\n", 243 | "Starting iterations...\n", 244 | "Run 2, iteration: 1/100, moves: 0, cost: 15.0\n", 245 | "Init: initializing centroids\n", 246 | "Init: initializing clusters\n", 247 | "Starting iterations...\n", 248 | "Run 3, iteration: 1/100, moves: 0, cost: 15.0\n", 249 | "Init: initializing centroids\n", 250 | "Init: initializing clusters\n", 251 | "Starting iterations...\n", 252 | "Run 4, iteration: 1/100, moves: 0, cost: 15.0\n", 253 | "Init: initializing centroids\n", 254 | "Init: initializing clusters\n", 255 | "Starting iterations...\n", 256 | "Run 5, iteration: 1/100, moves: 0, cost: 15.0\n", 257 | "Best run was number 1\n", 258 | "Init: initializing centroids\n", 259 | "Init: initializing clusters\n", 260 | "Starting iterations...\n", 261 | "Run 1, iteration: 1/100, moves: 1, cost: 9.0\n", 262 | "Init: initializing centroids\n", 263 | "Init: initializing clusters\n", 264 | "Starting iterations...\n", 265 | "Run 2, iteration: 1/100, moves: 0, cost: 9.0\n", 266 | "Init: initializing centroids\n", 267 | "Init: initializing clusters\n", 268 | "Starting iterations...\n", 269 | "Run 3, iteration: 1/100, moves: 0, cost: 9.0\n", 270 | "Init: initializing centroids\n", 271 | "Init: initializing clusters\n", 272 | "Starting iterations...\n", 273 | "Run 4, iteration: 1/100, moves: 1, cost: 9.0\n", 274 | "Init: initializing centroids\n", 275 | "Init: initializing clusters\n", 276 | "Starting iterations...\n", 277 | "Run 5, iteration: 1/100, moves: 3, cost: 9.0\n", 278 | "Run 5, iteration: 2/100, moves: 0, cost: 9.0\n", 279 | "Best run was number 1\n", 280 | "Init: initializing centroids\n", 281 | "Init: initializing clusters\n", 282 | "Starting iterations...\n", 283 | "Run 1, iteration: 1/100, moves: 1, cost: 7.0\n", 284 | "Run 1, iteration: 2/100, moves: 1, cost: 7.0\n", 285 | "Init: initializing centroids\n", 286 | "Init: initializing clusters\n", 287 | "Starting iterations...\n", 288 | "Run 2, iteration: 1/100, moves: 0, cost: 9.0\n", 289 | "Init: initializing centroids\n", 290 | "Init: initializing clusters\n", 291 | "Starting iterations...\n", 292 | "Run 3, iteration: 1/100, moves: 0, cost: 7.0\n", 293 | "Init: initializing centroids\n", 294 | "Init: initializing clusters\n", 295 | "Starting iterations...\n", 296 | "Run 4, iteration: 1/100, moves: 0, cost: 6.0\n", 297 | "Init: initializing centroids\n", 298 | "Init: initializing clusters\n", 299 | "Starting iterations...\n", 300 | "Run 5, iteration: 1/100, moves: 0, cost: 8.0\n", 301 | "Best run was number 4\n", 302 | "Init: initializing centroids\n", 303 | "Init: initializing clusters\n", 304 | "Starting iterations...\n", 305 | "Run 1, iteration: 1/100, moves: 0, cost: 6.0\n", 306 | "Init: initializing centroids\n", 307 | "Init: initializing clusters\n", 308 | "Starting iterations...\n", 309 | "Run 2, iteration: 1/100, moves: 1, cost: 6.0\n", 310 | "Init: initializing centroids\n", 311 | "Init: initializing clusters\n", 312 | "Starting iterations...\n", 313 | "Run 3, iteration: 1/100, moves: 2, cost: 4.0\n", 314 | "Init: initializing centroids\n", 315 | "Init: initializing clusters\n", 316 | "Starting iterations...\n", 317 | "Run 4, iteration: 1/100, moves: 1, cost: 6.0\n", 318 | "Init: initializing centroids\n", 319 | "Init: initializing clusters\n", 320 | "Starting iterations...\n", 321 | "Run 5, iteration: 1/100, moves: 0, cost: 6.0\n", 322 | "Best run was number 3\n" 323 | ] 324 | }, 325 | { 326 | "output_type": "display_data", 327 | "data": { 328 | "image/png": "\n", 329 | "text/plain": [ 330 | "
" 331 | ] 332 | }, 333 | "metadata": { 334 | "needs_background": "light" 335 | } 336 | } 337 | ] 338 | }, 339 | { 340 | "cell_type": "code", 341 | "source": [ 342 | "# Elbow Curve\n", 343 | "# We can see a bend at K=3 in the above graph indicating 3is the optimal number of clusters.\n", 344 | "# Build a model with 3 clusters\n", 345 | "\n", 346 | "# Building the model with 3 clusters\n", 347 | "kmode = KModes(n_clusters=3, init = \"random\", n_init = 5, verbose=1)\n", 348 | "clusters = kmode.fit_predict(data)\n", 349 | "clusters" 350 | ], 351 | "metadata": { 352 | "colab": { 353 | "base_uri": "https://localhost:8080/" 354 | }, 355 | "id": "LuQVfC6SnvcK", 356 | "outputId": "66163801-106e-483a-9b5d-3d5af5657af8" 357 | }, 358 | "execution_count": 4, 359 | "outputs": [ 360 | { 361 | "output_type": "stream", 362 | "name": "stdout", 363 | "text": [ 364 | "Init: initializing centroids\n", 365 | "Init: initializing clusters\n", 366 | "Starting iterations...\n", 367 | "Run 1, iteration: 1/100, moves: 3, cost: 7.0\n", 368 | "Run 1, iteration: 2/100, moves: 0, cost: 7.0\n", 369 | "Init: initializing centroids\n", 370 | "Init: initializing clusters\n", 371 | "Starting iterations...\n", 372 | "Run 2, iteration: 1/100, moves: 0, cost: 7.0\n", 373 | "Init: initializing centroids\n", 374 | "Init: initializing clusters\n", 375 | "Starting iterations...\n", 376 | "Run 3, iteration: 1/100, moves: 2, cost: 6.0\n", 377 | "Run 3, iteration: 2/100, moves: 0, cost: 6.0\n", 378 | "Init: initializing centroids\n", 379 | "Init: initializing clusters\n", 380 | "Starting iterations...\n", 381 | "Run 4, iteration: 1/100, moves: 1, cost: 8.0\n", 382 | "Run 4, iteration: 2/100, moves: 1, cost: 8.0\n", 383 | "Init: initializing centroids\n", 384 | "Init: initializing clusters\n", 385 | "Starting iterations...\n", 386 | "Run 5, iteration: 1/100, moves: 1, cost: 8.0\n", 387 | "Best run was number 3\n" 388 | ] 389 | }, 390 | { 391 | "output_type": "execute_result", 392 | "data": { 393 | "text/plain": [ 394 | "array([1, 0, 2, 0, 1, 0, 2, 0], dtype=uint16)" 395 | ] 396 | }, 397 | "metadata": {}, 398 | "execution_count": 4 399 | } 400 | ] 401 | }, 402 | { 403 | "cell_type": "code", 404 | "source": [ 405 | "# Finally, insert the predicted cluster values in our original dataset.\n", 406 | "\n", 407 | "data.insert(0, \"Cluster\", clusters, True)\n", 408 | "data" 409 | ], 410 | "metadata": { 411 | "colab": { 412 | "base_uri": "https://localhost:8080/", 413 | "height": 328 414 | }, 415 | "id": "-22ofNEXn5KI", 416 | "outputId": "6baff152-5e67-4645-81ec-150b560323f0" 417 | }, 418 | "execution_count": 5, 419 | "outputs": [ 420 | { 421 | "output_type": "execute_result", 422 | "data": { 423 | "text/html": [ 424 | "
\n", 425 | "\n", 438 | "\n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | "
Clusterhair_coloreye_colorskin_color
person
P11blondeamberfair
P20brunettegraybrown
P32redgreenbrown
P40blackhazelbrown
P51brunetteamberfair
P60blackgraybrown
P72redgreenfair
P80blackhazelfair
\n", 514 | "
" 515 | ], 516 | "text/plain": [ 517 | " Cluster hair_color eye_color skin_color\n", 518 | "person \n", 519 | "P1 1 blonde amber fair\n", 520 | "P2 0 brunette gray brown\n", 521 | "P3 2 red green brown\n", 522 | "P4 0 black hazel brown\n", 523 | "P5 1 brunette amber fair\n", 524 | "P6 0 black gray brown\n", 525 | "P7 2 red green fair\n", 526 | "P8 0 black hazel fair" 527 | ] 528 | }, 529 | "metadata": {}, 530 | "execution_count": 5 531 | } 532 | ] 533 | }, 534 | { 535 | "cell_type": "markdown", 536 | "source": [ 537 | "Inference from the model predictions: P1, P2, P5 are merged as a cluster; P3, P7 are merged; and P4, P6, P8 are merged.\n", 538 | "\n", 539 | "The results of our theoretical approach are in line with the model predictions. 🙌" 540 | ], 541 | "metadata": { 542 | "id": "Ms9uTc-boMTg" 543 | } 544 | } 545 | ] 546 | } -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Welcome to Clustering (Theory & Code) 2 | 3 | ### 01 Unsupervised Learning (Theory) 4 | * What is Unsupervised Learning & Goals of Unsupervised Learning 5 | * Type of Unsupervised Learning: 1.Clustering, 2.Association Rule & 3.Dimensionality Reduction 6 | 7 | ### 02 Clustering (Theory) 8 | * Definition and Application of Clustering 9 | * 4 methods: 1.K Means 2.Hierarchical 3.DBScan & 4.Gaussian Mixture 10 | 11 | ### 03 Euclidean & Manhattan Distance (Theory) 12 | * Two points are near to each other, chances they are similar 13 | * Distance Measure between two points 14 | 1. Euclidean Distance: Under-root of Square distance between two points 15 | 2. Manhattan Distance: Absolute Distance between points 16 | 17 | ### 04 K-Means Clustering (Theory) 18 | * How Algorithim works (Step Wise Calculation) 19 | * Pre-processing required for K Means 20 | * Determining optimal number of K: 1.Profiling Approach & 2.Elbow Method 21 | 22 | ### 05 Elbow Method (Theory) 23 | * Working of Elbow Method with Example 24 | * 3 concepts: 1.Total Error, 2.Variance/Total Squared Error & 3.Within Cluster Sum of Square (WCSS) 25 | 26 | ### 06 K Means Clustering (Python Code) 27 | * Define number of clusters, take centroids and measure distance 28 | * Euclidean Distance : Measure distance between points 29 | * Number of Clusters defined by Elbow Method 30 | * Elbow Method : WCSS vs Number of Cluster 31 | * Silhouette Score : Goodness of Clustering 32 | 33 | ### 07 Hierarchical Clustering (Theory) 34 | * Two Approaches: 1.Agglomerative(Botton-Up) & 2.Divisive(Top-Down) 35 | * Types of Linkages: 36 | 1. Single Linkage - Nearest Neighbour (Minimal intercluster dissimilarity) 37 | 2. Complete Linkage - Farthest Neighbour (Maximal intercluster dissimilarity) 38 | 3. Average Linkage - Average Distance (Mean intercluster dissimilarity) 39 | * Steps in Agglomerative Hierarchical Clustering with Single Linkage 40 | * Determining optimal number of Cluster: Dendogram 41 | 42 | ### 08 Dendogram (Theory) 43 | * Hierarchical relationship between objects 44 | * Optimal number of Clusters for Hierarchical Clustering 45 | 46 | ### 09 Hierarchical Clustering (Python Code) 47 | * Type of HC 48 | 1. Agglomerative : Bottom Up approach 49 | 2. Divisive : Top Down approach 50 | * Number of Clusters defined by Dendogram 51 | * Dendogram : Joining datapoints based on distance & creating clusters 52 | * Linkage : To calculate distance between two points of two clusters 53 | 1. Single linkage : Minimum Distance between two clusters 54 | 2. Complete linkage : Maximum Distance between two clusters 55 | 3. Average linkage : Average Distance between two clusters 56 | 57 | ### 10 DBScan Clustering (Theory) 58 | * Density Based Clustering 59 | * Kmeans & Hierarchical good for compact & well seperated Data 60 | * Both are sensitive to Outliers & Noise 61 | * DBScan overcome all the issue & works well with Outliers 62 | * 2 important parameters - 63 | 1. eps: Distance between 2 points is lower/equal to eps they are neighbours 64 | 2. MinPts: Minimum number of neighbours/data points with eps radius 65 | 66 | ### 11 DBScan Clustering (Python Code) 67 | * No need to give pre-define clusters 68 | * Distance metric is Euclidean Distance 69 | * Need to give 2 parameters 70 | 1. eps : Radius of the circle 71 | 2. min_samples : minimum data points to consider it as clusters 72 | 73 | ### 12 GMM Clustering (Theory) 74 | * Weakness of K Means 75 | * Expectation Maximization(EM) method 76 | 77 | ### 13 Gausian Mixture Model Clustering (Python Code) 78 | * Probablistic Model 79 | * Uses Expectation-Minimization (EM) steps: 80 | 1. E Step : Probability of datapoint of each cluster 81 | 2. M Step : For each cluster,revise parameter based on proabability 82 | 83 | ### 14 Cluster Adjustment (Theory) 84 | * 2 Steps we normally do for Cluster Adjustement 85 | 1. Quality of Clustering (Cardinality & Magnitude) 86 | 2. Performance of Similiarity Measure (Euclidean Distance) 87 | 88 | ### 15 Silhouette Coefficient - Cluster Validation (Theory) 89 | * Clusters are well apart from each other as the silhouette score is closer to 1 90 | * It is a metric used to calculate the goodness of a clustering technique 91 | * Its value ranges from -1 to 1. 92 | 1. 1: Means clusters are well apart from each other and clearly distinguished 93 | 2. 0: Means clusters are indifferent, or distance between clusters is not significant 94 | 3. -1: Means clusters are assigned in the wrong way 95 | 96 | ### 16 Disadvantage & Choosing Right Clustering Method (Theory) 97 | * Disadvantage of each clustering techniques respectively 98 | * Based on the data, which is the right clustering method 99 | 100 | ### 17 Clustering Revision (Theory) 101 | * Short Description of Each Clustering Alogrithim 102 | * Advantage, Disadvantage 103 | * When to use what 104 | 105 | ### 18 Interview Questions on Clustering (Theory) 106 | * Commonly asked question on Clustering 107 | 108 | ### 19 K Modes (Theory) 109 | * For Categorical variable clustering, use K Modes 110 | * It uses the dissimilarities(total mismatch) between data points 111 | * Lesser the dissimilarities, the more our data points are closer 112 | * It uses Mode for most value in the column 113 | 114 | ### 20 K Modes (Python Code) 115 | * K Mode code in Python 116 | --------------------------------------------------------------------------------