├── Executing K Means in Python ├── Clustering+Python+Lab.ipynb ├── Hopkins+Statistic.ipynb └── init ├── K Means Clustering └── init ├── Other Forms of Clustering ├── K-Mode+Bank+Marketing.ipynb ├── K-Prototype+clustering (1).ipynb └── init └── README.md /Executing K Means in Python/Hopkins+Statistic.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Hopkins Statistics:\n", 8 | "The Hopkins statistic, is a statistic which gives a value which indicates the cluster tendency, in other words: how well the data can be clustered.\n", 9 | "\n", 10 | "- If the value is between {0.01, ...,0.3}, the data is regularly spaced.\n", 11 | "\n", 12 | "- If the value is around 0.5, it is random.\n", 13 | "\n", 14 | "- If the value is between {0.7, ..., 0.99}, it has a high tendency to cluster." 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "Some usefull links to understand Hopkins Statistics:\n", 22 | "- [WikiPedia](https://en.wikipedia.org/wiki/Hopkins_statistic)\n", 23 | "- [Article](http://www.sthda.com/english/articles/29-cluster-validation-essentials/95-assessing-clustering-tendency-essentials/)" 24 | ] 25 | }, 26 | { 27 | "cell_type": "code", 28 | "execution_count": 2, 29 | "metadata": {}, 30 | "outputs": [], 31 | "source": [ 32 | "from sklearn.neighbors import NearestNeighbors\n", 33 | "from random import sample\n", 34 | "from numpy.random import uniform\n", 35 | "import numpy as np\n", 36 | "from math import isnan\n", 37 | " \n", 38 | "def hopkins(X):\n", 39 | " d = X.shape[1]\n", 40 | " #d = len(vars) # columns\n", 41 | " n = len(X) # rows\n", 42 | " m = int(0.1 * n) \n", 43 | " nbrs = NearestNeighbors(n_neighbors=1).fit(X.values)\n", 44 | " \n", 45 | " rand_X = sample(range(0, n, 1), m)\n", 46 | " \n", 47 | " ujd = []\n", 48 | " wjd = []\n", 49 | " for j in range(0, m):\n", 50 | " u_dist, _ = nbrs.kneighbors(uniform(np.amin(X,axis=0),np.amax(X,axis=0),d).reshape(1, -1), 2, return_distance=True)\n", 51 | " ujd.append(u_dist[0][1])\n", 52 | " w_dist, _ = nbrs.kneighbors(X.iloc[rand_X[j]].values.reshape(1, -1), 2, return_distance=True)\n", 53 | " wjd.append(w_dist[0][1])\n", 54 | " \n", 55 | " H = sum(ujd) / (sum(ujd) + sum(wjd))\n", 56 | " if isnan(H):\n", 57 | " print(ujd, wjd)\n", 58 | " H = 0\n", 59 | " \n", 60 | " return H" 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": null, 66 | "metadata": {}, 67 | "outputs": [], 68 | "source": [ 69 | "#First convert the numpy array that you have to a dataframe\n", 70 | "rfm_df_scaled = pd.DataFrame(rfm_df_scaled)\n", 71 | "rfm_df_scaled.columns = ['amount', 'frequency', 'recency']" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": null, 77 | "metadata": {}, 78 | "outputs": [], 79 | "source": [ 80 | "#Use the Hopkins Statistic function by passing the above dataframe as a paramter\n", 81 | "hopkins(rfm_df_scaled)" 82 | ] 83 | } 84 | ], 85 | "metadata": { 86 | "kernelspec": { 87 | "display_name": "Python 3", 88 | "language": "python", 89 | "name": "python3" 90 | }, 91 | "language_info": { 92 | "codemirror_mode": { 93 | "name": "ipython", 94 | "version": 3 95 | }, 96 | "file_extension": ".py", 97 | "mimetype": "text/x-python", 98 | "name": "python", 99 | "nbconvert_exporter": "python", 100 | "pygments_lexer": "ipython3", 101 | "version": "3.6.5" 102 | } 103 | }, 104 | "nbformat": 4, 105 | "nbformat_minor": 2 106 | } 107 | -------------------------------------------------------------------------------- /Executing K Means in Python/init: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /K Means Clustering/init: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Other Forms of Clustering/K-Mode+Bank+Marketing.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# K-Mode Clustering on Bank Marketing Dataset" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. " 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "**Attribute Information(Categorical):**\n", 22 | "\n", 23 | "- age (numeric)\n", 24 | "- job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')\n", 25 | "- marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)\n", 26 | "- education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')\n", 27 | "- default: has credit in default? (categorical: 'no','yes','unknown')\n", 28 | "- housing: has housing loan? (categorical: 'no','yes','unknown')\n", 29 | "- loan: has personal loan? (categorical: 'no','yes','unknown')\n", 30 | "- contact: contact communication type (categorical: 'cellular','telephone') \n", 31 | "- month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')\n", 32 | "- day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')\n", 33 | "- poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')\n", 34 | "- UCI Repository: " 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": 24, 40 | "metadata": {}, 41 | "outputs": [], 42 | "source": [ 43 | "# Importing Libraries\n", 44 | "import pandas as pd\n", 45 | "import numpy as np\n", 46 | "%matplotlib inline\n", 47 | "import matplotlib.pyplot as plt\n", 48 | "import seaborn as sns\n", 49 | "from kmodes.kmodes import KModes\n", 50 | "import warnings\n", 51 | "warnings.filterwarnings(\"ignore\") " 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": null, 57 | "metadata": {}, 58 | "outputs": [], 59 | "source": [ 60 | "help(KModes)" 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": 25, 66 | "metadata": {}, 67 | "outputs": [], 68 | "source": [ 69 | "bank = pd.read_csv('bankmarketing.csv')" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": 26, 75 | "metadata": {}, 76 | "outputs": [ 77 | { 78 | "data": { 79 | "text/html": [ 80 | "
\n", 81 | "\n", 94 | "\n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | "
agejobmaritaleducationdefaulthousingloancontactmonthday_of_week...campaignpdayspreviouspoutcomeemp.var.ratecons.price.idxcons.conf.idxeuribor3mnr.employedy
056housemaidmarriedbasic.4ynononotelephonemaymon...19990nonexistent1.193.994-36.44.8575191.0no
157servicesmarriedhigh.schoolunknownnonotelephonemaymon...19990nonexistent1.193.994-36.44.8575191.0no
237servicesmarriedhigh.schoolnoyesnotelephonemaymon...19990nonexistent1.193.994-36.44.8575191.0no
340admin.marriedbasic.6ynononotelephonemaymon...19990nonexistent1.193.994-36.44.8575191.0no
456servicesmarriedhigh.schoolnonoyestelephonemaymon...19990nonexistent1.193.994-36.44.8575191.0no
\n", 244 | "

5 rows × 21 columns

\n", 245 | "
" 246 | ], 247 | "text/plain": [ 248 | " age job marital education default housing loan contact \\\n", 249 | "0 56 housemaid married basic.4y no no no telephone \n", 250 | "1 57 services married high.school unknown no no telephone \n", 251 | "2 37 services married high.school no yes no telephone \n", 252 | "3 40 admin. married basic.6y no no no telephone \n", 253 | "4 56 services married high.school no no yes telephone \n", 254 | "\n", 255 | " month day_of_week ... campaign pdays previous poutcome emp.var.rate \\\n", 256 | "0 may mon ... 1 999 0 nonexistent 1.1 \n", 257 | "1 may mon ... 1 999 0 nonexistent 1.1 \n", 258 | "2 may mon ... 1 999 0 nonexistent 1.1 \n", 259 | "3 may mon ... 1 999 0 nonexistent 1.1 \n", 260 | "4 may mon ... 1 999 0 nonexistent 1.1 \n", 261 | "\n", 262 | " cons.price.idx cons.conf.idx euribor3m nr.employed y \n", 263 | "0 93.994 -36.4 4.857 5191.0 no \n", 264 | "1 93.994 -36.4 4.857 5191.0 no \n", 265 | "2 93.994 -36.4 4.857 5191.0 no \n", 266 | "3 93.994 -36.4 4.857 5191.0 no \n", 267 | "4 93.994 -36.4 4.857 5191.0 no \n", 268 | "\n", 269 | "[5 rows x 21 columns]" 270 | ] 271 | }, 272 | "execution_count": 26, 273 | "metadata": {}, 274 | "output_type": "execute_result" 275 | } 276 | ], 277 | "source": [ 278 | "bank.head()" 279 | ] 280 | }, 281 | { 282 | "cell_type": "code", 283 | "execution_count": 27, 284 | "metadata": {}, 285 | "outputs": [ 286 | { 287 | "data": { 288 | "text/plain": [ 289 | "Index(['age', 'job', 'marital', 'education', 'default', 'housing', 'loan',\n", 290 | " 'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays',\n", 291 | " 'previous', 'poutcome', 'emp.var.rate', 'cons.price.idx',\n", 292 | " 'cons.conf.idx', 'euribor3m', 'nr.employed', 'y'],\n", 293 | " dtype='object')" 294 | ] 295 | }, 296 | "execution_count": 27, 297 | "metadata": {}, 298 | "output_type": "execute_result" 299 | } 300 | ], 301 | "source": [ 302 | "bank.columns" 303 | ] 304 | }, 305 | { 306 | "cell_type": "code", 307 | "execution_count": 28, 308 | "metadata": {}, 309 | "outputs": [], 310 | "source": [ 311 | "bank_cust = bank[['age','job', 'marital', 'education', 'default', 'housing', 'loan','contact','month','day_of_week','poutcome']]" 312 | ] 313 | }, 314 | { 315 | "cell_type": "code", 316 | "execution_count": 29, 317 | "metadata": {}, 318 | "outputs": [ 319 | { 320 | "data": { 321 | "text/html": [ 322 | "
\n", 323 | "\n", 336 | "\n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | "
agejobmaritaleducationdefaulthousingloancontactmonthday_of_weekpoutcome
056housemaidmarriedbasic.4ynononotelephonemaymonnonexistent
157servicesmarriedhigh.schoolunknownnonotelephonemaymonnonexistent
237servicesmarriedhigh.schoolnoyesnotelephonemaymonnonexistent
340admin.marriedbasic.6ynononotelephonemaymonnonexistent
456servicesmarriedhigh.schoolnonoyestelephonemaymonnonexistent
\n", 426 | "
" 427 | ], 428 | "text/plain": [ 429 | " age job marital education default housing loan contact \\\n", 430 | "0 56 housemaid married basic.4y no no no telephone \n", 431 | "1 57 services married high.school unknown no no telephone \n", 432 | "2 37 services married high.school no yes no telephone \n", 433 | "3 40 admin. married basic.6y no no no telephone \n", 434 | "4 56 services married high.school no no yes telephone \n", 435 | "\n", 436 | " month day_of_week poutcome \n", 437 | "0 may mon nonexistent \n", 438 | "1 may mon nonexistent \n", 439 | "2 may mon nonexistent \n", 440 | "3 may mon nonexistent \n", 441 | "4 may mon nonexistent " 442 | ] 443 | }, 444 | "execution_count": 29, 445 | "metadata": {}, 446 | "output_type": "execute_result" 447 | } 448 | ], 449 | "source": [ 450 | "bank_cust.head()" 451 | ] 452 | }, 453 | { 454 | "cell_type": "code", 455 | "execution_count": 30, 456 | "metadata": {}, 457 | "outputs": [], 458 | "source": [ 459 | "bank_cust['age_bin'] = pd.cut(bank_cust['age'], [0, 20, 30, 40, 50, 60, 70, 80, 90, 100], \n", 460 | " labels=['0-20', '20-30', '30-40', '40-50','50-60','60-70','70-80', '80-90','90-100'])" 461 | ] 462 | }, 463 | { 464 | "cell_type": "code", 465 | "execution_count": 31, 466 | "metadata": {}, 467 | "outputs": [ 468 | { 469 | "data": { 470 | "text/html": [ 471 | "
\n", 472 | "\n", 485 | "\n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | " \n", 527 | " \n", 528 | " \n", 529 | " \n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | " \n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | "
agejobmaritaleducationdefaulthousingloancontactmonthday_of_weekpoutcomeage_bin
056housemaidmarriedbasic.4ynononotelephonemaymonnonexistent50-60
157servicesmarriedhigh.schoolunknownnonotelephonemaymonnonexistent50-60
237servicesmarriedhigh.schoolnoyesnotelephonemaymonnonexistent30-40
340admin.marriedbasic.6ynononotelephonemaymonnonexistent30-40
456servicesmarriedhigh.schoolnonoyestelephonemaymonnonexistent50-60
\n", 581 | "
" 582 | ], 583 | "text/plain": [ 584 | " age job marital education default housing loan contact \\\n", 585 | "0 56 housemaid married basic.4y no no no telephone \n", 586 | "1 57 services married high.school unknown no no telephone \n", 587 | "2 37 services married high.school no yes no telephone \n", 588 | "3 40 admin. married basic.6y no no no telephone \n", 589 | "4 56 services married high.school no no yes telephone \n", 590 | "\n", 591 | " month day_of_week poutcome age_bin \n", 592 | "0 may mon nonexistent 50-60 \n", 593 | "1 may mon nonexistent 50-60 \n", 594 | "2 may mon nonexistent 30-40 \n", 595 | "3 may mon nonexistent 30-40 \n", 596 | "4 may mon nonexistent 50-60 " 597 | ] 598 | }, 599 | "execution_count": 31, 600 | "metadata": {}, 601 | "output_type": "execute_result" 602 | } 603 | ], 604 | "source": [ 605 | "bank_cust.head()" 606 | ] 607 | }, 608 | { 609 | "cell_type": "code", 610 | "execution_count": 32, 611 | "metadata": {}, 612 | "outputs": [], 613 | "source": [ 614 | "bank_cust = bank_cust.drop('age',axis = 1)" 615 | ] 616 | }, 617 | { 618 | "cell_type": "code", 619 | "execution_count": 33, 620 | "metadata": {}, 621 | "outputs": [ 622 | { 623 | "data": { 624 | "text/html": [ 625 | "
\n", 626 | "\n", 639 | "\n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | "
jobmaritaleducationdefaulthousingloancontactmonthday_of_weekpoutcomeage_bin
0housemaidmarriedbasic.4ynononotelephonemaymonnonexistent50-60
1servicesmarriedhigh.schoolunknownnonotelephonemaymonnonexistent50-60
2servicesmarriedhigh.schoolnoyesnotelephonemaymonnonexistent30-40
3admin.marriedbasic.6ynononotelephonemaymonnonexistent30-40
4servicesmarriedhigh.schoolnonoyestelephonemaymonnonexistent50-60
\n", 729 | "
" 730 | ], 731 | "text/plain": [ 732 | " job marital education default housing loan contact month \\\n", 733 | "0 housemaid married basic.4y no no no telephone may \n", 734 | "1 services married high.school unknown no no telephone may \n", 735 | "2 services married high.school no yes no telephone may \n", 736 | "3 admin. married basic.6y no no no telephone may \n", 737 | "4 services married high.school no no yes telephone may \n", 738 | "\n", 739 | " day_of_week poutcome age_bin \n", 740 | "0 mon nonexistent 50-60 \n", 741 | "1 mon nonexistent 50-60 \n", 742 | "2 mon nonexistent 30-40 \n", 743 | "3 mon nonexistent 30-40 \n", 744 | "4 mon nonexistent 50-60 " 745 | ] 746 | }, 747 | "execution_count": 33, 748 | "metadata": {}, 749 | "output_type": "execute_result" 750 | } 751 | ], 752 | "source": [ 753 | "bank_cust.head()" 754 | ] 755 | }, 756 | { 757 | "cell_type": "code", 758 | "execution_count": 34, 759 | "metadata": {}, 760 | "outputs": [ 761 | { 762 | "name": "stdout", 763 | "output_type": "stream", 764 | "text": [ 765 | "\n", 766 | "RangeIndex: 41188 entries, 0 to 41187\n", 767 | "Data columns (total 11 columns):\n", 768 | "job 41188 non-null object\n", 769 | "marital 41188 non-null object\n", 770 | "education 41188 non-null object\n", 771 | "default 41188 non-null object\n", 772 | "housing 41188 non-null object\n", 773 | "loan 41188 non-null object\n", 774 | "contact 41188 non-null object\n", 775 | "month 41188 non-null object\n", 776 | "day_of_week 41188 non-null object\n", 777 | "poutcome 41188 non-null object\n", 778 | "age_bin 41188 non-null category\n", 779 | "dtypes: category(1), object(10)\n", 780 | "memory usage: 3.2+ MB\n" 781 | ] 782 | } 783 | ], 784 | "source": [ 785 | "bank_cust.info()" 786 | ] 787 | }, 788 | { 789 | "cell_type": "code", 790 | "execution_count": 35, 791 | "metadata": {}, 792 | "outputs": [ 793 | { 794 | "data": { 795 | "text/html": [ 796 | "
\n", 797 | "\n", 810 | "\n", 811 | " \n", 812 | " \n", 813 | " \n", 814 | " \n", 815 | " \n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 822 | " \n", 823 | " \n", 824 | " \n", 825 | " \n", 826 | " \n", 827 | " \n", 828 | " \n", 829 | " \n", 830 | " \n", 831 | " \n", 832 | " \n", 833 | " \n", 834 | " \n", 835 | " \n", 836 | " \n", 837 | " \n", 838 | " \n", 839 | " \n", 840 | " \n", 841 | " \n", 842 | " \n", 843 | " \n", 844 | " \n", 845 | " \n", 846 | " \n", 847 | " \n", 848 | " \n", 849 | " \n", 850 | " \n", 851 | " \n", 852 | " \n", 853 | " \n", 854 | " \n", 855 | " \n", 856 | " \n", 857 | " \n", 858 | " \n", 859 | " \n", 860 | " \n", 861 | " \n", 862 | " \n", 863 | " \n", 864 | " \n", 865 | " \n", 866 | " \n", 867 | " \n", 868 | " \n", 869 | " \n", 870 | " \n", 871 | " \n", 872 | " \n", 873 | " \n", 874 | " \n", 875 | " \n", 876 | " \n", 877 | " \n", 878 | " \n", 879 | " \n", 880 | " \n", 881 | " \n", 882 | " \n", 883 | " \n", 884 | " \n", 885 | " \n", 886 | " \n", 887 | " \n", 888 | " \n", 889 | " \n", 890 | " \n", 891 | " \n", 892 | " \n", 893 | " \n", 894 | " \n", 895 | " \n", 896 | " \n", 897 | " \n", 898 | " \n", 899 | "
jobmaritaleducationdefaulthousingloancontactmonthday_of_weekpoutcomeage_bin
031000016114
171310016114
271302016112
301100016112
471300216114
\n", 900 | "
" 901 | ], 902 | "text/plain": [ 903 | " job marital education default housing loan contact month \\\n", 904 | "0 3 1 0 0 0 0 1 6 \n", 905 | "1 7 1 3 1 0 0 1 6 \n", 906 | "2 7 1 3 0 2 0 1 6 \n", 907 | "3 0 1 1 0 0 0 1 6 \n", 908 | "4 7 1 3 0 0 2 1 6 \n", 909 | "\n", 910 | " day_of_week poutcome age_bin \n", 911 | "0 1 1 4 \n", 912 | "1 1 1 4 \n", 913 | "2 1 1 2 \n", 914 | "3 1 1 2 \n", 915 | "4 1 1 4 " 916 | ] 917 | }, 918 | "execution_count": 35, 919 | "metadata": {}, 920 | "output_type": "execute_result" 921 | } 922 | ], 923 | "source": [ 924 | "from sklearn import preprocessing\n", 925 | "le = preprocessing.LabelEncoder()\n", 926 | "bank_cust = bank_cust.apply(le.fit_transform)\n", 927 | "bank_cust.head()" 928 | ] 929 | }, 930 | { 931 | "cell_type": "code", 932 | "execution_count": 38, 933 | "metadata": {}, 934 | "outputs": [], 935 | "source": [ 936 | "# Checking the count per category\n", 937 | "job_df = pd.DataFrame(bank_cust['job'].value_counts())" 938 | ] 939 | }, 940 | { 941 | "cell_type": "code", 942 | "execution_count": 39, 943 | "metadata": {}, 944 | "outputs": [ 945 | { 946 | "data": { 947 | "text/plain": [ 948 | "" 949 | ] 950 | }, 951 | "execution_count": 39, 952 | "metadata": {}, 953 | "output_type": "execute_result" 954 | }, 955 | { 956 | "data": { 957 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZUAAAD8CAYAAAC/1zkdAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAE2dJREFUeJzt3X+w3XV95/Hnq0QU8EcCBJYmuMFpxpUyu5VNIy0O65gOv3QMdKAbp9Wsk266HbZid6dd3M4sU60zdetUy0xllzFxo3VBNhVhLUrTqO06W8EgiMFIk4qFW5BcDaJbp2r0vX+cz9WbcBMu5PM9J5c8HzNnzvf7OZ/v5/39Jrl53e/n+z3npKqQJKmHn5j0DkiSnj0MFUlSN4aKJKkbQ0WS1I2hIknqxlCRJHVjqEiSujFUJEndGCqSpG4WTXoHxu3UU0+tFStWTHo3JGnBuPvuu79eVUvn0/eYC5UVK1awY8eOSe+GJC0YSf5uvn2d/pIkdWOoSJK6MVQkSd0YKpKkbgwVSVI3hookqRtDRZLUjaEiSerGUJEkdXPMvaN+xvT1fzLY2Et//VcGG1uSjmaeqUiSujFUJEndGCqSpG4GC5Ukm5PsTbJzVtvJSbYl2d2el7T2JLkuyZ4k9yU5d9Y261v/3UnWz2r/l0m+2La5LkmGOhZJ0vwMeabyP4CLD2q7BtheVSuB7W0d4BJgZXtsBK6HUQgB1wKvAFYD184EUeuzcdZ2B9eSJI3ZYKFSVX8F7DuoeS2wpS1vAS6b1f6BGvkssDjJGcBFwLaq2ldVjwPbgIvbay+sqr+uqgI+MGssSdKEjPuayulV9ShAez6ttS8DHp7Vb6q1Ha59ao52SdIEHS0X6ue6HlLPoH3uwZONSXYk2TE9Pf0Md1GS9FTGHSqPtakr2vPe1j4FnDmr33LgkadoXz5H+5yq6oaqWlVVq5YundfXLEuSnoFxh8ptwMwdXOuBW2e1v7HdBXYe8ESbHrsDuDDJknaB/kLgjvbat5Oc1+76euOssSRJEzLYx7QkuRF4FXBqkilGd3H9PnBzkg3AQ8CVrfvtwKXAHuA7wJsAqmpfkrcDn2v93lZVMxf/f53RHWYnAB9vD0nSBA0WKlX1+kO8tGaOvgVcdYhxNgOb52jfAZxzJPsoSerraLlQL0l6FjBUJEndGCqSpG4MFUlSN4aKJKkbQ0WS1I2hIknqxlCRJHVjqEiSujFUJEndDPYxLTrQQ9ddMdjYL37z1sHGlqSnwzMVSVI3hookqRtDRZLUjaEiSerGUJEkdWOoSJK6MVQkSd0YKpKkbgwVSVI3hookqRtDRZLUjaEiSerGUJEkdWOoSJK6MVQkSd0YKpKkbgwVSVI3hookqRtDRZLUjaEiSepmIqGS5DeT3J9kZ5IbkzwvyVlJ7kyyO8mHkxzf+j63re9pr6+YNc5bW/sDSS6axLFIkn5s7KGSZBnwZmBVVZ0DHAesA94JvLuqVgKPAxvaJhuAx6vqp4B3t34kObtt99PAxcB7kxw3zmORJB1oUtNfi4ATkiwCTgQeBV4NbG2vbwEua8tr2zrt9TVJ0tpvqqrvVtWDwB5g9Zj2X5I0h7GHSlX9PfAu4CFGYfIEcDfwzara37pNAcva8jLg4bbt/tb/lNntc2wjSZqASUx/LWF0lnEW8JPAScAlc3StmU0O8dqh2uequTHJjiQ7pqenn/5OS5LmZRLTX78APFhV01X1feAjwM8Di9t0GMBy4JG2PAWcCdBefxGwb3b7HNscoKpuqKpVVbVq6dKlvY9HktRMIlQeAs5LcmK7NrIG+BLwKeCK1mc9cGtbvq2t017/ZFVVa1/X7g47C1gJ3DWmY5AkzWHRU3fpq6ruTLIV+DywH7gHuAH4M+CmJL/X2ja1TTYBH0yyh9EZyro2zv1JbmYUSPuBq6rqB2M9GEnSAcYeKgBVdS1w7UHNX2GOu7eq6h+BKw8xzjuAd3TfQUnSM+I76iVJ3RgqkqRuDBVJUjeGiiSpG0NFktSNoSJJ6sZQkSR1Y6hIkroxVCRJ3RgqkqRuDBVJUjeGiiSpG0NFktSNoSJJ6sZQkSR1Y6hIkroxVCRJ3RgqkqRuDBVJUjeGiiSpm0WT3gFJGtLHP/z1wca+5F+fOtjYC5VnKpKkbgwVSVI3hookqRtDRZLUjaEiSerGUJEkdWOoSJK6MVQkSd0YKpKkbgwVSVI3EwmVJIuTbE3y5SS7kvxckpOTbEuyuz0vaX2T5Loke5Lcl+TcWeOsb/13J1k/iWORJP3YpM5U/gj4RFX9M+BfALuAa4DtVbUS2N7WAS4BVrbHRuB6gCQnA9cCrwBWA9fOBJEkaTLGHipJXghcAGwCqKrvVdU3gbXAltZtC3BZW14LfKBGPgssTnIGcBGwrar2VdXjwDbg4jEeiiTpIJM4U3kJMA28P8k9Sd6X5CTg9Kp6FKA9n9b6LwMenrX9VGs7VLskaUImESqLgHOB66vq5cA/8OOprrlkjrY6TPuTB0g2JtmRZMf09PTT3V9J0jxNIlSmgKmqurOtb2UUMo+1aS3a895Z/c+ctf1y4JHDtD9JVd1QVauqatXSpUu7HYgk6UBjD5Wq+hrwcJKXtqY1wJeA24CZO7jWA7e25duAN7a7wM4DnmjTY3cAFyZZ0i7QX9jaJEkTMqlvfvwN4ENJjge+AryJUcDdnGQD8BBwZet7O3ApsAf4TutLVe1L8nbgc63f26pq3/gOQZJ0sHmFSpJfBF7J6JrFZ6rqliMpWlX3AqvmeGnNHH0LuOoQ42wGNh/JvkiS+nnK6a8k7wX+HfBFYCfwa0n+eOgdkyQtPPM5U/lXwDntjIEkWxgFjCRJB5jPhfoHgBfPWj8TuG+Y3ZEkLWSHPFNJ8r8ZXUN5EbAryV3tpdXA/x3DvkmSFpjDTX+9a2x7IUl6VjhkqFTVX84sJzkd+Nm2eldV7Z17K0nSsWw+d3/9EnAXo/eN/BJwZ5Irht4xSdLCM5+7v34H+NmZs5MkS4G/YPTxKpIk/ch87v76iYOmu74xz+0kSceY+ZypfCLJHcCNbX0d8PHhdkmStFA9ZahU1W+1j2k5n9HHzf+3qvro4HsmSVpwDvc+lc9U1SuTfJsDv7/k3yb5IbAP+IOqeu8Y9lOStAAc7pbiV7bnF8z1epJTGL0J0lCRJAFHcMG9qr4BvKrfrkiSFrojuotr5jvlJUkCbw2WJHVkqEiSujFUJEndGCqSpG4MFUlSN4aKJKkbQ0WS1I2hIknqxlCRJHVjqEiSujFUJEndGCqSpG4MFUlSN4aKJKkbQ0WS1I2hIknqZmKhkuS4JPck+VhbPyvJnUl2J/lwkuNb+3Pb+p72+opZY7y1tT+Q5KLJHIkkacYkz1SuBnbNWn8n8O6qWgk8Dmxo7RuAx6vqp4B3t34kORtYB/w0cDHw3iTHjWnfJUlzmEioJFkOvAZ4X1sP8Gpga+uyBbisLa9t67TX17T+a4Gbquq7VfUgsAdYPZ4jkCTNZVJnKu8Bfhv4YVs/BfhmVe1v61PAsra8DHgYoL3+ROv/o/Y5tjlAko1JdiTZMT093fM4JEmzLBp3wSSvBfZW1d1JXjXTPEfXeorXDrfNgY1VNwA3AKxatWrOPtKx7LKt2wcZ96NXrBlkXB29xh4qwPnA65JcCjwPeCGjM5fFSRa1s5HlwCOt/xRwJjCVZBHwImDfrPYZs7eRJE3A2Ke/quqtVbW8qlYwutD+yar6ZeBTwBWt23rg1rZ8W1unvf7JqqrWvq7dHXYWsBK4a0yHIUmawyTOVA7lPwE3Jfk94B5gU2vfBHwwyR5GZyjrAKrq/iQ3A18C9gNXVdUPxr/bkqQZEw2Vqvo08Om2/BXmuHurqv4RuPIQ278DeMdweyhJejp8R70kqRtDRZLUjaEiSerGUJEkdWOoSJK6MVQkSd0YKpKkbgwVSVI3hookqRtDRZLUjaEiSerGUJEkdWOoSJK6MVQkSd0YKpKkbgwVSVI3R9M3P0rz9qZbLh5k3Pdf/olBxpWOFZ6pSJK6MVQkSd0YKpKkbgwVSVI3hookqRtDRZLUjaEiSerGUJEkdWOoSJK6MVQkSd0YKpKkbgwVSVI3hookqZuxh0qSM5N8KsmuJPcnubq1n5xkW5Ld7XlJa0+S65LsSXJfknNnjbW+9d+dZP24j0WSdKBJnKnsB/5jVb0MOA+4KsnZwDXA9qpaCWxv6wCXACvbYyNwPYxCCLgWeAWwGrh2JogkSZMx9lCpqker6vNt+dvALmAZsBbY0rptAS5ry2uBD9TIZ4HFSc4ALgK2VdW+qnoc2AYM8yUbkqR5meg1lSQrgJcDdwKnV9WjMAoe4LTWbRnw8KzNplrbodolSRMysVBJ8nzgT4G3VNW3Dtd1jrY6TPtctTYm2ZFkx/T09NPfWUnSvEwkVJI8h1GgfKiqPtKaH2vTWrTnva19Cjhz1ubLgUcO0/4kVXVDVa2qqlVLly7tdyCSpANM4u6vAJuAXVX1h7Neug2YuYNrPXDrrPY3trvAzgOeaNNjdwAXJlnSLtBf2NokSROyaAI1zwfeAHwxyb2t7T8Dvw/cnGQD8BBwZXvtduBSYA/wHeBNAFW1L8nbgc+1fm+rqn3jOQRJ0lzGHipV9Rnmvh4CsGaO/gVcdYixNgOb++2dJOlI+I56SVI3k5j+khac19zyB4OM+2eX/9Yg40qT4pmKJKkbQ0WS1I2hIknqxlCRJHXjhfpnqTs2XTrIuBdtuH2QcSU9Oxgq0lHotVs/NMi4H7vilwcZV5rh9JckqRtDRZLUjaEiSerGUJEkdWOoSJK6MVQkSd14S7G6+O8fvGiQcX/tDX7vmrSQGCqS1NFX3/O1wcZe8ZZ/MtjYvTj9JUnqxjMVSWP35lseHmTc6y4/c5BxNX+eqUiSujFUJEndGCqSpG4MFUlSN4aKJKkbQ0WS1I2hIknqxlCRJHVjqEiSujFUJEnd+DEtkrSAPfZHfz3Y2Kdf/XNPexvPVCRJ3RgqkqRuFnyoJLk4yQNJ9iS5ZtL7I0nHsgUdKkmOA/4YuAQ4G3h9krMnu1eSdOxa0KECrAb2VNVXqup7wE3A2gnvkyQdsxZ6qCwDZn/bz1RrkyRNQKpq0vvwjCW5Erioqn61rb8BWF1Vv3FQv43Axrb6UuCBZ1DuVODrR7C7R2st61nPesdOvWda659W1dL5dFzo71OZAmZ/f+hy4JGDO1XVDcANR1IoyY6qWnUkYxyNtaxnPesdO/XGUWuhT399DliZ5KwkxwPrgNsmvE+SdMxa0GcqVbU/yb8H7gCOAzZX1f0T3i1JOmYt6FABqKrbgdvHUOqIps+O4lrWs571jp16g9da0BfqJUlHl4V+TUWSdBQxVJ7COD8GJsnmJHuT7Byyzqx6Zyb5VJJdSe5PcvXA9Z6X5K4kX2j1fnfIeq3mcUnuSfKxoWu1el9N8sUk9ybZMXCtxUm2Jvly+zt8+h8pO/9aL23HNPP4VpK3DFWv1fzN9u9kZ5Ibkzxv4HpXt1r3D3Fsc/18Jzk5ybYku9vzkoHrXdmO74dJhrkLrKp8HOLB6OL/3wIvAY4HvgCcPWC9C4BzgZ1jOr4zgHPb8guAvxn4+AI8vy0/B7gTOG/gY/wPwP8EPjamP9OvAqeOqdYW4Ffb8vHA4jHVPQ74GqP3LgxVYxnwIHBCW78Z+DcD1jsH2AmcyOha818AKzvXeNLPN/BfgWva8jXAOweu9zJG79X7NLBqiD9Lz1QOb6wfA1NVfwXsG2r8Oeo9WlWfb8vfBnYx4CcS1Mj/a6vPaY/BLuolWQ68BnjfUDUmJckLGf2nsQmgqr5XVd8cU/k1wN9W1d8NXGcRcEKSRYz+s3/Se9A6ehnw2ar6TlXtB/4SuLxngUP8fK9l9MsB7fmyIetV1a6qeiZv/p43Q+XwjpmPgUmyAng5o7OHIescl+ReYC+wraqGrPce4LeBHw5Y42AF/HmSu9snOQzlJcA08P42vfe+JCcNWG+2dcCNQxaoqr8H3gU8BDwKPFFVfz5gyZ3ABUlOSXIicCkHvrF6KKdX1aMw+iUPOG0MNQdlqBxe5mh71t0ul+T5wJ8Cb6mqbw1Zq6p+UFU/w+jTD1YnOWeIOkleC+ytqruHGP8wzq+qcxl9cvZVSS4YqM4iRlMb11fVy4F/YDR9Mqj2JuPXAf9r4DpLGP0Wfxbwk8BJSX5lqHpVtQt4J7AN+ASjqe79Q9V7NjNUDm9eHwOzkCV5DqNA+VBVfWRcddtUzaeBiwcqcT7wuiRfZTRt+eokfzJQrR+pqkfa817gFkZTqEOYAqZmneltZRQyQ7sE+HxVPTZwnV8AHqyq6ar6PvAR4OeHLFhVm6rq3Kq6gNG00e4h6zWPJTkDoD3vHUPNQRkqh/es/hiYJGE0J7+rqv5wDPWWJlnclk9g9B/Hl4eoVVVvrarlVbWC0d/bJ6tqsN90AZKclOQFM8vAhYymVbqrqq8BDyd5aWtaA3xpiFoHeT0DT301DwHnJTmx/Ttdw+ia32CSnNaeXwz8IuM5ztuA9W15PXDrGGoOa4ir/8+mB6O51b9hdBfY7wxc60ZG88ffZ/Sb6IaB672S0XTefcC97XHpgPX+OXBPq7cT+C9j+jt8FWO4+4vRdY4vtMf9Y/j38jPAjvbn+VFgycD1TgS+AbxoTH9vv8vol46dwAeB5w5c7/8wCuYvAGsGGP9JP9/AKcB2RmdF24GTB653eVv+LvAYcEfv4/Qd9ZKkbpz+kiR1Y6hIkroxVCRJ3RgqkqRuDBVJUjeGiiSpG0NFktSNoSJJ6ub/A9+QXurJYcLZAAAAAElFTkSuQmCC\n", 958 | "text/plain": [ 959 | "" 960 | ] 961 | }, 962 | "metadata": {}, 963 | "output_type": "display_data" 964 | } 965 | ], 966 | "source": [ 967 | "sns.barplot(x=job_df.index, y=job_df['job'])" 968 | ] 969 | }, 970 | { 971 | "cell_type": "code", 972 | "execution_count": 40, 973 | "metadata": {}, 974 | "outputs": [], 975 | "source": [ 976 | "# Checking the count per category\n", 977 | "age_df = pd.DataFrame(bank_cust['age_bin'].value_counts())" 978 | ] 979 | }, 980 | { 981 | "cell_type": "code", 982 | "execution_count": 41, 983 | "metadata": { 984 | "scrolled": true 985 | }, 986 | "outputs": [ 987 | { 988 | "data": { 989 | "text/plain": [ 990 | "" 991 | ] 992 | }, 993 | "execution_count": 41, 994 | "metadata": {}, 995 | "output_type": "execute_result" 996 | }, 997 | { 998 | "data": { 999 | "image/png": "\n", 1000 | "text/plain": [ 1001 | "" 1002 | ] 1003 | }, 1004 | "metadata": {}, 1005 | "output_type": "display_data" 1006 | } 1007 | ], 1008 | "source": [ 1009 | "sns.barplot(x=age_df.index, y=age_df['age_bin'])" 1010 | ] 1011 | }, 1012 | { 1013 | "cell_type": "markdown", 1014 | "metadata": {}, 1015 | "source": [ 1016 | "## Using K-Mode with \"Cao\" initialization" 1017 | ] 1018 | }, 1019 | { 1020 | "cell_type": "code", 1021 | "execution_count": 42, 1022 | "metadata": { 1023 | "scrolled": true 1024 | }, 1025 | "outputs": [ 1026 | { 1027 | "name": "stdout", 1028 | "output_type": "stream", 1029 | "text": [ 1030 | "Init: initializing centroids\n", 1031 | "Init: initializing clusters\n", 1032 | "Starting iterations...\n", 1033 | "Run 1, iteration: 1/100, moves: 5322, cost: 192203.0\n", 1034 | "Run 1, iteration: 2/100, moves: 1160, cost: 192203.0\n" 1035 | ] 1036 | } 1037 | ], 1038 | "source": [ 1039 | "km_cao = KModes(n_clusters=2, init = \"Cao\", n_init = 1, verbose=1)\n", 1040 | "fitClusters_cao = km_cao.fit_predict(bank_cust)" 1041 | ] 1042 | }, 1043 | { 1044 | "cell_type": "code", 1045 | "execution_count": 43, 1046 | "metadata": {}, 1047 | "outputs": [ 1048 | { 1049 | "data": { 1050 | "text/plain": [ 1051 | "array([1, 1, 0, ..., 0, 1, 0], dtype=uint8)" 1052 | ] 1053 | }, 1054 | "execution_count": 43, 1055 | "metadata": {}, 1056 | "output_type": "execute_result" 1057 | } 1058 | ], 1059 | "source": [ 1060 | "# Predicted Clusters\n", 1061 | "fitClusters_cao" 1062 | ] 1063 | }, 1064 | { 1065 | "cell_type": "code", 1066 | "execution_count": 44, 1067 | "metadata": {}, 1068 | "outputs": [], 1069 | "source": [ 1070 | "clusterCentroidsDf = pd.DataFrame(km_cao.cluster_centroids_)\n", 1071 | "clusterCentroidsDf.columns = bank_cust.columns" 1072 | ] 1073 | }, 1074 | { 1075 | "cell_type": "code", 1076 | "execution_count": 45, 1077 | "metadata": {}, 1078 | "outputs": [ 1079 | { 1080 | "data": { 1081 | "text/html": [ 1082 | "
\n", 1083 | "\n", 1096 | "\n", 1097 | " \n", 1098 | " \n", 1099 | " \n", 1100 | " \n", 1101 | " \n", 1102 | " \n", 1103 | " \n", 1104 | " \n", 1105 | " \n", 1106 | " \n", 1107 | " \n", 1108 | " \n", 1109 | " \n", 1110 | " \n", 1111 | " \n", 1112 | " \n", 1113 | " \n", 1114 | " \n", 1115 | " \n", 1116 | " \n", 1117 | " \n", 1118 | " \n", 1119 | " \n", 1120 | " \n", 1121 | " \n", 1122 | " \n", 1123 | " \n", 1124 | " \n", 1125 | " \n", 1126 | " \n", 1127 | " \n", 1128 | " \n", 1129 | " \n", 1130 | " \n", 1131 | " \n", 1132 | " \n", 1133 | " \n", 1134 | " \n", 1135 | " \n", 1136 | " \n", 1137 | " \n", 1138 | " \n", 1139 | " \n", 1140 | " \n", 1141 | " \n", 1142 | " \n", 1143 | "
jobmaritaleducationdefaulthousingloancontactmonthday_of_weekpoutcomeage_bin
001602006212
111300016013
\n", 1144 | "
" 1145 | ], 1146 | "text/plain": [ 1147 | " job marital education default housing loan contact month \\\n", 1148 | "0 0 1 6 0 2 0 0 6 \n", 1149 | "1 1 1 3 0 0 0 1 6 \n", 1150 | "\n", 1151 | " day_of_week poutcome age_bin \n", 1152 | "0 2 1 2 \n", 1153 | "1 0 1 3 " 1154 | ] 1155 | }, 1156 | "execution_count": 45, 1157 | "metadata": {}, 1158 | "output_type": "execute_result" 1159 | } 1160 | ], 1161 | "source": [ 1162 | "# Mode of the clusters\n", 1163 | "clusterCentroidsDf" 1164 | ] 1165 | }, 1166 | { 1167 | "cell_type": "markdown", 1168 | "metadata": {}, 1169 | "source": [ 1170 | "## Using K-Mode with \"Huang\" initialization" 1171 | ] 1172 | }, 1173 | { 1174 | "cell_type": "code", 1175 | "execution_count": 46, 1176 | "metadata": {}, 1177 | "outputs": [ 1178 | { 1179 | "name": "stdout", 1180 | "output_type": "stream", 1181 | "text": [ 1182 | "Init: initializing centroids\n", 1183 | "Init: initializing clusters\n", 1184 | "Starting iterations...\n", 1185 | "Run 1, iteration: 1/100, moves: 8403, cost: 195645.0\n", 1186 | "Run 1, iteration: 2/100, moves: 0, cost: 195645.0\n" 1187 | ] 1188 | } 1189 | ], 1190 | "source": [ 1191 | "km_huang = KModes(n_clusters=2, init = \"Huang\", n_init = 1, verbose=1)\n", 1192 | "fitClusters_huang = km_huang.fit_predict(bank_cust)" 1193 | ] 1194 | }, 1195 | { 1196 | "cell_type": "code", 1197 | "execution_count": 47, 1198 | "metadata": {}, 1199 | "outputs": [ 1200 | { 1201 | "data": { 1202 | "text/plain": [ 1203 | "array([0, 0, 1, ..., 0, 0, 1], dtype=uint8)" 1204 | ] 1205 | }, 1206 | "execution_count": 47, 1207 | "metadata": {}, 1208 | "output_type": "execute_result" 1209 | } 1210 | ], 1211 | "source": [ 1212 | "# Predicted clusters\n", 1213 | "fitClusters_huang" 1214 | ] 1215 | }, 1216 | { 1217 | "cell_type": "markdown", 1218 | "metadata": {}, 1219 | "source": [ 1220 | "## Choosing K by comparing Cost against each K" 1221 | ] 1222 | }, 1223 | { 1224 | "cell_type": "code", 1225 | "execution_count": 68, 1226 | "metadata": {}, 1227 | "outputs": [ 1228 | { 1229 | "name": "stdout", 1230 | "output_type": "stream", 1231 | "text": [ 1232 | "Init: initializing centroids\n", 1233 | "Init: initializing clusters\n", 1234 | "Starting iterations...\n", 1235 | "Run 1, iteration: 1/100, moves: 0, cost: 258139.0\n", 1236 | "Init: initializing centroids\n", 1237 | "Init: initializing clusters\n", 1238 | "Starting iterations...\n", 1239 | "Run 1, iteration: 1/100, moves: 5319, cost: 233390.0\n", 1240 | "Run 1, iteration: 2/100, moves: 1165, cost: 233389.0\n", 1241 | "Run 1, iteration: 3/100, moves: 0, cost: 233389.0\n", 1242 | "Init: initializing centroids\n", 1243 | "Init: initializing clusters\n", 1244 | "Starting iterations...\n", 1245 | "Run 1, iteration: 1/100, moves: 4991, cost: 226325.0\n", 1246 | "Run 1, iteration: 2/100, moves: 1369, cost: 226323.0\n", 1247 | "Run 1, iteration: 3/100, moves: 0, cost: 226323.0\n", 1248 | "Init: initializing centroids\n", 1249 | "Init: initializing clusters\n", 1250 | "Starting iterations...\n", 1251 | "Run 1, iteration: 1/100, moves: 0, cost: 223701.0\n" 1252 | ] 1253 | } 1254 | ], 1255 | "source": [ 1256 | "cost = []\n", 1257 | "for num_clusters in list(range(1,5)):\n", 1258 | " kmode = KModes(n_clusters=num_clusters, init = \"Cao\", n_init = 1, verbose=1)\n", 1259 | " kmode.fit_predict(bank_cust)\n", 1260 | " cost.append(kmode.cost_)" 1261 | ] 1262 | }, 1263 | { 1264 | "cell_type": "code", 1265 | "execution_count": 69, 1266 | "metadata": {}, 1267 | "outputs": [ 1268 | { 1269 | "data": { 1270 | "text/plain": [ 1271 | "[]" 1272 | ] 1273 | }, 1274 | "execution_count": 69, 1275 | "metadata": {}, 1276 | "output_type": "execute_result" 1277 | }, 1278 | { 1279 | "data": { 1280 | "image/png": "\n", 1281 | "text/plain": [ 1282 | "" 1283 | ] 1284 | }, 1285 | "metadata": {}, 1286 | "output_type": "display_data" 1287 | } 1288 | ], 1289 | "source": [ 1290 | "y = np.array([i for i in range(1,5,1)])\n", 1291 | "plt.plot(y,cost)" 1292 | ] 1293 | }, 1294 | { 1295 | "cell_type": "code", 1296 | "execution_count": 50, 1297 | "metadata": {}, 1298 | "outputs": [], 1299 | "source": [ 1300 | "## Choosing K=2" 1301 | ] 1302 | }, 1303 | { 1304 | "cell_type": "code", 1305 | "execution_count": 51, 1306 | "metadata": {}, 1307 | "outputs": [ 1308 | { 1309 | "name": "stdout", 1310 | "output_type": "stream", 1311 | "text": [ 1312 | "Init: initializing centroids\n", 1313 | "Init: initializing clusters\n", 1314 | "Starting iterations...\n", 1315 | "Run 1, iteration: 1/100, moves: 5322, cost: 192203.0\n", 1316 | "Run 1, iteration: 2/100, moves: 1160, cost: 192203.0\n" 1317 | ] 1318 | } 1319 | ], 1320 | "source": [ 1321 | "km_cao = KModes(n_clusters=2, init = \"Cao\", n_init = 1, verbose=1)\n", 1322 | "fitClusters_cao = km_cao.fit_predict(bank_cust)" 1323 | ] 1324 | }, 1325 | { 1326 | "cell_type": "code", 1327 | "execution_count": 52, 1328 | "metadata": {}, 1329 | "outputs": [ 1330 | { 1331 | "data": { 1332 | "text/plain": [ 1333 | "array([1, 1, 0, ..., 0, 1, 0], dtype=uint8)" 1334 | ] 1335 | }, 1336 | "execution_count": 52, 1337 | "metadata": {}, 1338 | "output_type": "execute_result" 1339 | } 1340 | ], 1341 | "source": [ 1342 | "fitClusters_cao" 1343 | ] 1344 | }, 1345 | { 1346 | "cell_type": "markdown", 1347 | "metadata": {}, 1348 | "source": [ 1349 | "### Combining the predicted clusters with the original DF." 1350 | ] 1351 | }, 1352 | { 1353 | "cell_type": "code", 1354 | "execution_count": 53, 1355 | "metadata": {}, 1356 | "outputs": [], 1357 | "source": [ 1358 | "bank_cust = bank_cust.reset_index()\n", 1359 | "clustersDf = pd.DataFrame(fitClusters_cao)\n", 1360 | "clustersDf.columns = ['cluster_predicted']\n", 1361 | "combinedDf = pd.concat([bank_cust, clustersDf], axis = 1).reset_index()\n", 1362 | "combinedDf = combinedDf.drop(['index', 'level_0'], axis = 1)" 1363 | ] 1364 | }, 1365 | { 1366 | "cell_type": "code", 1367 | "execution_count": 54, 1368 | "metadata": {}, 1369 | "outputs": [ 1370 | { 1371 | "data": { 1372 | "text/html": [ 1373 | "
\n", 1374 | "\n", 1387 | "\n", 1388 | " \n", 1389 | " \n", 1390 | " \n", 1391 | " \n", 1392 | " \n", 1393 | " \n", 1394 | " \n", 1395 | " \n", 1396 | " \n", 1397 | " \n", 1398 | " \n", 1399 | " \n", 1400 | " \n", 1401 | " \n", 1402 | " \n", 1403 | " \n", 1404 | " \n", 1405 | " \n", 1406 | " \n", 1407 | " \n", 1408 | " \n", 1409 | " \n", 1410 | " \n", 1411 | " \n", 1412 | " \n", 1413 | " \n", 1414 | " \n", 1415 | " \n", 1416 | " \n", 1417 | " \n", 1418 | " \n", 1419 | " \n", 1420 | " \n", 1421 | " \n", 1422 | " \n", 1423 | " \n", 1424 | " \n", 1425 | " \n", 1426 | " \n", 1427 | " \n", 1428 | " \n", 1429 | " \n", 1430 | " \n", 1431 | " \n", 1432 | " \n", 1433 | " \n", 1434 | " \n", 1435 | " \n", 1436 | " \n", 1437 | " \n", 1438 | " \n", 1439 | " \n", 1440 | " \n", 1441 | " \n", 1442 | " \n", 1443 | " \n", 1444 | " \n", 1445 | " \n", 1446 | " \n", 1447 | " \n", 1448 | " \n", 1449 | " \n", 1450 | " \n", 1451 | " \n", 1452 | " \n", 1453 | " \n", 1454 | " \n", 1455 | " \n", 1456 | " \n", 1457 | " \n", 1458 | " \n", 1459 | " \n", 1460 | " \n", 1461 | " \n", 1462 | " \n", 1463 | " \n", 1464 | " \n", 1465 | " \n", 1466 | " \n", 1467 | " \n", 1468 | " \n", 1469 | " \n", 1470 | " \n", 1471 | " \n", 1472 | " \n", 1473 | " \n", 1474 | " \n", 1475 | " \n", 1476 | " \n", 1477 | " \n", 1478 | " \n", 1479 | " \n", 1480 | " \n", 1481 | " \n", 1482 | "
jobmaritaleducationdefaulthousingloancontactmonthday_of_weekpoutcomeage_bincluster_predicted
0310000161141
1713100161141
2713020161120
3011000161120
4713002161141
\n", 1483 | "
" 1484 | ], 1485 | "text/plain": [ 1486 | " job marital education default housing loan contact month \\\n", 1487 | "0 3 1 0 0 0 0 1 6 \n", 1488 | "1 7 1 3 1 0 0 1 6 \n", 1489 | "2 7 1 3 0 2 0 1 6 \n", 1490 | "3 0 1 1 0 0 0 1 6 \n", 1491 | "4 7 1 3 0 0 2 1 6 \n", 1492 | "\n", 1493 | " day_of_week poutcome age_bin cluster_predicted \n", 1494 | "0 1 1 4 1 \n", 1495 | "1 1 1 4 1 \n", 1496 | "2 1 1 2 0 \n", 1497 | "3 1 1 2 0 \n", 1498 | "4 1 1 4 1 " 1499 | ] 1500 | }, 1501 | "execution_count": 54, 1502 | "metadata": {}, 1503 | "output_type": "execute_result" 1504 | } 1505 | ], 1506 | "source": [ 1507 | "combinedDf.head()" 1508 | ] 1509 | }, 1510 | { 1511 | "cell_type": "code", 1512 | "execution_count": 55, 1513 | "metadata": {}, 1514 | "outputs": [], 1515 | "source": [ 1516 | "# Data for Cluster1\n", 1517 | "cluster1 = combinedDf[combinedDf.cluster_predicted==1]" 1518 | ] 1519 | }, 1520 | { 1521 | "cell_type": "code", 1522 | "execution_count": 56, 1523 | "metadata": {}, 1524 | "outputs": [], 1525 | "source": [ 1526 | "# Data for Cluster0\n", 1527 | "cluster0 = combinedDf[combinedDf.cluster_predicted==0]" 1528 | ] 1529 | }, 1530 | { 1531 | "cell_type": "code", 1532 | "execution_count": 57, 1533 | "metadata": {}, 1534 | "outputs": [ 1535 | { 1536 | "name": "stdout", 1537 | "output_type": "stream", 1538 | "text": [ 1539 | "\n", 1540 | "Int64Index: 12895 entries, 0 to 41186\n", 1541 | "Data columns (total 12 columns):\n", 1542 | "job 12895 non-null int64\n", 1543 | "marital 12895 non-null int64\n", 1544 | "education 12895 non-null int64\n", 1545 | "default 12895 non-null int64\n", 1546 | "housing 12895 non-null int64\n", 1547 | "loan 12895 non-null int64\n", 1548 | "contact 12895 non-null int64\n", 1549 | "month 12895 non-null int64\n", 1550 | "day_of_week 12895 non-null int64\n", 1551 | "poutcome 12895 non-null int64\n", 1552 | "age_bin 12895 non-null int64\n", 1553 | "cluster_predicted 12895 non-null uint8\n", 1554 | "dtypes: int64(11), uint8(1)\n", 1555 | "memory usage: 1.2 MB\n" 1556 | ] 1557 | } 1558 | ], 1559 | "source": [ 1560 | "cluster1.info()" 1561 | ] 1562 | }, 1563 | { 1564 | "cell_type": "code", 1565 | "execution_count": 58, 1566 | "metadata": {}, 1567 | "outputs": [ 1568 | { 1569 | "name": "stdout", 1570 | "output_type": "stream", 1571 | "text": [ 1572 | "\n", 1573 | "Int64Index: 28293 entries, 2 to 41187\n", 1574 | "Data columns (total 12 columns):\n", 1575 | "job 28293 non-null int64\n", 1576 | "marital 28293 non-null int64\n", 1577 | "education 28293 non-null int64\n", 1578 | "default 28293 non-null int64\n", 1579 | "housing 28293 non-null int64\n", 1580 | "loan 28293 non-null int64\n", 1581 | "contact 28293 non-null int64\n", 1582 | "month 28293 non-null int64\n", 1583 | "day_of_week 28293 non-null int64\n", 1584 | "poutcome 28293 non-null int64\n", 1585 | "age_bin 28293 non-null int64\n", 1586 | "cluster_predicted 28293 non-null uint8\n", 1587 | "dtypes: int64(11), uint8(1)\n", 1588 | "memory usage: 2.6 MB\n" 1589 | ] 1590 | } 1591 | ], 1592 | "source": [ 1593 | "cluster0.info()" 1594 | ] 1595 | }, 1596 | { 1597 | "cell_type": "code", 1598 | "execution_count": 79, 1599 | "metadata": {}, 1600 | "outputs": [], 1601 | "source": [ 1602 | "# Checking the count per category for JOB\n", 1603 | "job1_df = pd.DataFrame(cluster1['job'].value_counts())\n", 1604 | "job0_df = pd.DataFrame(cluster0['job'].value_counts())" 1605 | ] 1606 | }, 1607 | { 1608 | "cell_type": "code", 1609 | "execution_count": 80, 1610 | "metadata": {}, 1611 | "outputs": [ 1612 | { 1613 | "data": { 1614 | "image/png": "\n", 1615 | "text/plain": [ 1616 | "" 1617 | ] 1618 | }, 1619 | "metadata": {}, 1620 | "output_type": "display_data" 1621 | } 1622 | ], 1623 | "source": [ 1624 | "fig, ax =plt.subplots(1,2,figsize=(20,5))\n", 1625 | "sns.barplot(x=job1_df.index, y=job1_df['job'], ax=ax[0])\n", 1626 | "sns.barplot(x=job0_df.index, y=job0_df['job'], ax=ax[1])\n", 1627 | "fig.show()" 1628 | ] 1629 | }, 1630 | { 1631 | "cell_type": "code", 1632 | "execution_count": 81, 1633 | "metadata": {}, 1634 | "outputs": [], 1635 | "source": [ 1636 | "age1_df = pd.DataFrame(cluster1['age_bin'].value_counts())\n", 1637 | "age0_df = pd.DataFrame(cluster0['age_bin'].value_counts())" 1638 | ] 1639 | }, 1640 | { 1641 | "cell_type": "code", 1642 | "execution_count": 83, 1643 | "metadata": {}, 1644 | "outputs": [ 1645 | { 1646 | "data": { 1647 | "image/png": "\n", 1648 | "text/plain": [ 1649 | "" 1650 | ] 1651 | }, 1652 | "metadata": {}, 1653 | "output_type": "display_data" 1654 | } 1655 | ], 1656 | "source": [ 1657 | "fig, ax =plt.subplots(1,2,figsize=(20,5))\n", 1658 | "sns.barplot(x=age1_df.index, y=age1_df['age_bin'], ax=ax[0])\n", 1659 | "sns.barplot(x=age0_df.index, y=age0_df['age_bin'], ax=ax[1])\n", 1660 | "fig.show()" 1661 | ] 1662 | }, 1663 | { 1664 | "cell_type": "code", 1665 | "execution_count": 63, 1666 | "metadata": {}, 1667 | "outputs": [ 1668 | { 1669 | "name": "stdout", 1670 | "output_type": "stream", 1671 | "text": [ 1672 | "1 8636\n", 1673 | "2 2732\n", 1674 | "0 1501\n", 1675 | "3 26\n", 1676 | "Name: marital, dtype: int64\n", 1677 | "1 16292\n", 1678 | "2 8836\n", 1679 | "0 3111\n", 1680 | "3 54\n", 1681 | "Name: marital, dtype: int64\n" 1682 | ] 1683 | } 1684 | ], 1685 | "source": [ 1686 | "print(cluster1['marital'].value_counts())\n", 1687 | "print(cluster0['marital'].value_counts())" 1688 | ] 1689 | }, 1690 | { 1691 | "cell_type": "code", 1692 | "execution_count": 65, 1693 | "metadata": {}, 1694 | "outputs": [ 1695 | { 1696 | "name": "stdout", 1697 | "output_type": "stream", 1698 | "text": [ 1699 | "3 4186\n", 1700 | "2 2572\n", 1701 | "0 1981\n", 1702 | "5 1459\n", 1703 | "1 1033\n", 1704 | "6 977\n", 1705 | "7 680\n", 1706 | "4 7\n", 1707 | "Name: education, dtype: int64\n", 1708 | "6 11191\n", 1709 | "3 5329\n", 1710 | "5 3784\n", 1711 | "2 3473\n", 1712 | "0 2195\n", 1713 | "1 1259\n", 1714 | "7 1051\n", 1715 | "4 11\n", 1716 | "Name: education, dtype: int64\n" 1717 | ] 1718 | } 1719 | ], 1720 | "source": [ 1721 | "print(cluster1['education'].value_counts())\n", 1722 | "print(cluster0['education'].value_counts())" 1723 | ] 1724 | } 1725 | ], 1726 | "metadata": { 1727 | "kernelspec": { 1728 | "display_name": "Python 3", 1729 | "language": "python", 1730 | "name": "python3" 1731 | }, 1732 | "language_info": { 1733 | "codemirror_mode": { 1734 | "name": "ipython", 1735 | "version": 3 1736 | }, 1737 | "file_extension": ".py", 1738 | "mimetype": "text/x-python", 1739 | "name": "python", 1740 | "nbconvert_exporter": "python", 1741 | "pygments_lexer": "ipython3", 1742 | "version": "3.6.4" 1743 | } 1744 | }, 1745 | "nbformat": 4, 1746 | "nbformat_minor": 2 1747 | } 1748 | -------------------------------------------------------------------------------- /Other Forms of Clustering/K-Prototype+clustering (1).ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "kernelspec": { 4 | "name": "python", 5 | "display_name": "Pyolite", 6 | "language": "python" 7 | }, 8 | "language_info": { 9 | "codemirror_mode": { 10 | "name": "python", 11 | "version": 3 12 | }, 13 | "file_extension": ".py", 14 | "mimetype": "text/x-python", 15 | "name": "python", 16 | "nbconvert_exporter": "python", 17 | "pygments_lexer": "ipython3", 18 | "version": "3.8" 19 | } 20 | }, 21 | "nbformat_minor": 4, 22 | "nbformat": 4, 23 | "cells": [ 24 | { 25 | "cell_type": "markdown", 26 | "source": "# K-Prototype Clustering", 27 | "metadata": {} 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "source": "**About Blood Transfusion dataset**

\nTo demonstrate the RFMTC marketing model (a modified version of RFM), this study adopted the donor database of Blood Transfusion Service Center in Hsin-Chu City in Taiwan. The center passes their blood transfusion service bus to one university in Hsin-Chu City to gather blood donated about every three months. To build a FRMTC model, we selected 748 donors at random from the donor database. These 748 donor data, each one included R (Recency - months since last donation), F (Frequency - total number of donation), M (Monetary - total blood donated in c.c.), T (Time - months since first donation), and a binary variable representing whether he/she donated blood in March 2007 (1 stand for donating blood; 0 stands for not donating blood).", 32 | "metadata": {} 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "source": "**Attribute Information:**\n\n- R (Recency - months since last donation), \n- F (Frequency - total number of donation), \n- M (Monetary - total blood donated in c.c.), \n- T (Time - months since first donation), and a binary variable representing whether he/she donated blood in March 2007 (1 stand for donating blood; 0 stands for not donating blood). \n\n- Variable\tData Type\tMeasurement\tDescription\tmin\tmax\tmean\tstd \n- Recency quantitative\tMonths\tInput\t0.03\t74.4\t9.74\t8.07 \n- Frequency quantitative\tTimes\tInput\t1\t50\t5.51\t5.84 \n- Monetary\tquantitative\tc.c. blood\tInput\t250\t12500\t1378.68\t1459.83 \n- Time quantitative\tMonths\tInput\t2.27\t98.3\t34.42\t24.32 \n- Whether he/she donated blood in March 2007\tbinary\t1=yes 0=no\tOutput\t0\t1\t1 (24%) 0 (76%) ", 37 | "metadata": {} 38 | }, 39 | { 40 | "cell_type": "code", 41 | "source": "pip install kmodes", 42 | "metadata": {}, 43 | "execution_count": null, 44 | "outputs": [] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "source": "# Importing Libraries\nimport numpy as np\nimport pandas as pd\nfrom kmodes.kprototypes import KPrototypes\n%matplotlib inline\nimport matplotlib.pyplot as plt\nimport seaborn as sns", 49 | "metadata": {}, 50 | "execution_count": 1, 51 | "outputs": [] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "source": "help(KPrototypes)", 56 | "metadata": {}, 57 | "execution_count": 2, 58 | "outputs": [ 59 | { 60 | "name": "stdout", 61 | "output_type": "stream", 62 | "text": "Help on class KPrototypes in module kmodes.kprototypes:\n\n\n\nclass KPrototypes(kmodes.kmodes.KModes)\n\n | k-protoypes clustering algorithm for mixed numerical/categorical data.\n\n | \n\n | Parameters\n\n | -----------\n\n | n_clusters : int, optional, default: 8\n\n | The number of clusters to form as well as the number of\n\n | centroids to generate.\n\n | \n\n | max_iter : int, default: 300\n\n | Maximum number of iterations of the k-modes algorithm for a\n\n | single run.\n\n | \n\n | num_dissim : func, default: euclidian_dissim\n\n | Dissimilarity function used by the algorithm for numerical variables.\n\n | Defaults to the Euclidian dissimilarity function.\n\n | \n\n | cat_dissim : func, default: matching_dissim\n\n | Dissimilarity function used by the kmodes algorithm for categorical variables.\n\n | Defaults to the matching dissimilarity function.\n\n | \n\n | n_init : int, default: 10\n\n | Number of time the k-modes algorithm will be run with different\n\n | centroid seeds. The final results will be the best output of\n\n | n_init consecutive runs in terms of cost.\n\n | \n\n | init : {'Huang', 'Cao', 'random' or a list of ndarrays}, default: 'Cao'\n\n | Method for initialization:\n\n | 'Huang': Method in Huang [1997, 1998]\n\n | 'Cao': Method in Cao et al. [2009]\n\n | 'random': choose 'n_clusters' observations (rows) at random from\n\n | data for the initial centroids.\n\n | If a list of ndarrays is passed, it should be of length 2, with\n\n | shapes (n_clusters, n_features) for numerical and categorical\n\n | data respectively. These are the initial centroids.\n\n | \n\n | gamma : float, default: None\n\n | Weighing factor that determines relative importance of numerical vs.\n\n | categorical attributes (see discussion in Huang [1997]). By default,\n\n | automatically calculated from data.\n\n | \n\n | verbose : integer, optional\n\n | Verbosity mode.\n\n | \n\n | Attributes\n\n | ----------\n\n | cluster_centroids_ : array, [n_clusters, n_features]\n\n | Categories of cluster centroids\n\n | \n\n | labels_ :\n\n | Labels of each point\n\n | \n\n | cost_ : float\n\n | Clustering cost, defined as the sum distance of all points to\n\n | their respective cluster centroids.\n\n | \n\n | n_iter_ : int\n\n | The number of iterations the algorithm ran for.\n\n | \n\n | gamma : float\n\n | The (potentially calculated) weighing factor.\n\n | \n\n | Notes\n\n | -----\n\n | See:\n\n | Huang, Z.: Extensions to the k-modes algorithm for clustering large\n\n | data sets with categorical values, Data Mining and Knowledge\n\n | Discovery 2(3), 1998.\n\n | \n\n | Method resolution order:\n\n | KPrototypes\n\n | kmodes.kmodes.KModes\n\n | sklearn.base.BaseEstimator\n\n | sklearn.base.ClusterMixin\n\n | builtins.object\n\n | \n\n | Methods defined here:\n\n | \n\n | __init__(self, n_clusters=8, max_iter=100, num_dissim=, cat_dissim=, init='Huang', n_init=10, gamma=None, verbose=0)\n\n | Initialize self. See help(type(self)) for accurate signature.\n\n | \n\n | fit(self, X, y=None, categorical=None)\n\n | Compute k-prototypes clustering.\n\n | \n\n | Parameters\n\n | ----------\n\n | X : array-like, shape=[n_samples, n_features]\n\n | categorical : Index of columns that contain categorical data\n\n | \n\n | predict(self, X, categorical=None)\n\n | Predict the closest cluster each sample in X belongs to.\n\n | \n\n | Parameters\n\n | ----------\n\n | X : array-like, shape = [n_samples, n_features]\n\n | New data to predict.\n\n | categorical : Index of columns that contain categorical data\n\n | \n\n | Returns\n\n | -------\n\n | labels : array, shape [n_samples,]\n\n | Index of the cluster each sample belongs to.\n\n | \n\n | ----------------------------------------------------------------------\n\n | Data descriptors defined here:\n\n | \n\n | cluster_centroids_\n\n | \n\n | ----------------------------------------------------------------------\n\n | Methods inherited from kmodes.kmodes.KModes:\n\n | \n\n | fit_predict(self, X, y=None, **kwargs)\n\n | Compute cluster centroids and predict cluster index for each sample.\n\n | \n\n | Convenience method; equivalent to calling fit(X) followed by\n\n | predict(X).\n\n | \n\n | ----------------------------------------------------------------------\n\n | Methods inherited from sklearn.base.BaseEstimator:\n\n | \n\n | __getstate__(self)\n\n | \n\n | __repr__(self)\n\n | Return repr(self).\n\n | \n\n | __setstate__(self, state)\n\n | \n\n | get_params(self, deep=True)\n\n | Get parameters for this estimator.\n\n | \n\n | Parameters\n\n | ----------\n\n | deep : boolean, optional\n\n | If True, will return the parameters for this estimator and\n\n | contained subobjects that are estimators.\n\n | \n\n | Returns\n\n | -------\n\n | params : mapping of string to any\n\n | Parameter names mapped to their values.\n\n | \n\n | set_params(self, **params)\n\n | Set the parameters of this estimator.\n\n | \n\n | The method works on simple estimators as well as on nested objects\n\n | (such as pipelines). The latter have parameters of the form\n\n | ``__`` so that it's possible to update each\n\n | component of a nested object.\n\n | \n\n | Returns\n\n | -------\n\n | self\n\n | \n\n | ----------------------------------------------------------------------\n\n | Data descriptors inherited from sklearn.base.BaseEstimator:\n\n | \n\n | __dict__\n\n | dictionary for instance variables (if defined)\n\n | \n\n | __weakref__\n\n | list of weak references to the object (if defined)\n\n\n" 63 | } 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "source": "# Reading Dataset\nblood = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/blood-transfusion/transfusion.data', sep=\",\",engine = 'python')", 69 | "metadata": {}, 70 | "execution_count": 3, 71 | "outputs": [] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "source": "#Sanity Check\nblood.head()", 76 | "metadata": {}, 77 | "execution_count": 4, 78 | "outputs": [ 79 | { 80 | "execution_count": 4, 81 | "output_type": "execute_result", 82 | "data": { 83 | "text/html": [ 84 | "
\n", 85 | "\n", 98 | "\n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | "
Recency (months)Frequency (times)Monetary (c.c. blood)Time (months)whether he/she donated blood in March 2007
025012500981
10133250281
21164000351
32205000451
41246000770
\n", 152 | "
" 153 | ], 154 | "text/plain": [ 155 | " Recency (months) Frequency (times) Monetary (c.c. blood) Time (months) \\\n", 156 | "0 2 50 12500 98 \n", 157 | "1 0 13 3250 28 \n", 158 | "2 1 16 4000 35 \n", 159 | "3 2 20 5000 45 \n", 160 | "4 1 24 6000 77 \n", 161 | "\n", 162 | " whether he/she donated blood in March 2007 \n", 163 | "0 1 \n", 164 | "1 1 \n", 165 | "2 1 \n", 166 | "3 1 \n", 167 | "4 0 " 168 | ] 169 | }, 170 | "metadata": {} 171 | } 172 | ] 173 | }, 174 | { 175 | "cell_type": "code", 176 | "source": "# standardizing data\ncolumns_to_normalize = ['Recency (months)','Frequency (times)','Monetary (c.c. blood)','Time (months)']\nblood[columns_to_normalize] = blood[columns_to_normalize].apply(lambda x: (x - x.mean()) / np.std(x))", 177 | "metadata": {}, 178 | "execution_count": 5, 179 | "outputs": [] 180 | }, 181 | { 182 | "cell_type": "code", 183 | "source": "# Re-check after standardizing data\nblood.head()", 184 | "metadata": {}, 185 | "execution_count": 6, 186 | "outputs": [ 187 | { 188 | "execution_count": 6, 189 | "output_type": "execute_result", 190 | "data": { 191 | "text/html": [ 192 | "
\n", 193 | "\n", 206 | "\n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | "
Recency (months)Frequency (times)Monetary (c.c. blood)Time (months)whether he/she donated blood in March 2007
0-0.9278997.6233467.6233462.6156331
1-1.1751181.2827381.282738-0.2578811
2-1.0515081.7968421.7968420.0294711
3-0.9278992.4823132.4823130.4399731
4-1.0515083.1677843.1677841.7535790
\n", 260 | "
" 261 | ], 262 | "text/plain": [ 263 | " Recency (months) Frequency (times) Monetary (c.c. blood) Time (months) \\\n", 264 | "0 -0.927899 7.623346 7.623346 2.615633 \n", 265 | "1 -1.175118 1.282738 1.282738 -0.257881 \n", 266 | "2 -1.051508 1.796842 1.796842 0.029471 \n", 267 | "3 -0.927899 2.482313 2.482313 0.439973 \n", 268 | "4 -1.051508 3.167784 3.167784 1.753579 \n", 269 | "\n", 270 | " whether he/she donated blood in March 2007 \n", 271 | "0 1 \n", 272 | "1 1 \n", 273 | "2 1 \n", 274 | "3 1 \n", 275 | "4 0 " 276 | ] 277 | }, 278 | "metadata": {} 279 | } 280 | ] 281 | }, 282 | { 283 | "cell_type": "code", 284 | "source": "# Converting the dataset into matrix\nblood_matrix = blood.values", 285 | "metadata": { 286 | "trusted": true 287 | }, 288 | "execution_count": 1, 289 | "outputs": [ 290 | { 291 | "ename": "", 292 | "evalue": "name 'blood' is not defined", 293 | "traceback": [ 294 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 295 | "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", 296 | "Input \u001b[0;32mIn [1]\u001b[0m, in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;66;03m# Converting the dataset into matrix\u001b[39;00m\n\u001b[0;32m----> 2\u001b[0m blood_matrix \u001b[38;5;241m=\u001b[39m \u001b[43mblood\u001b[49m\u001b[38;5;241m.\u001b[39mvalues\n", 297 | "\u001b[0;31mNameError\u001b[0m: name 'blood' is not defined" 298 | ], 299 | "output_type": "error" 300 | } 301 | ] 302 | }, 303 | { 304 | "cell_type": "code", 305 | "source": "# Martix for analysis\nblood_matrix", 306 | "metadata": {}, 307 | "execution_count": 8, 308 | "outputs": [ 309 | { 310 | "execution_count": 8, 311 | "output_type": "execute_result", 312 | "data": { 313 | "text/plain": [ 314 | "array([[-0.92789873, 7.62334626, 7.62334626, 2.61563344, 1. ],\n", 315 | " [-1.17511806, 1.28273826, 1.28273826, -0.2578809 , 1. ],\n", 316 | " [-1.0515084 , 1.79684161, 1.79684161, 0.02947053, 1. ],\n", 317 | " ...,\n", 318 | " [ 1.66790417, -0.43093957, -0.43093957, 1.13782607, 0. ],\n", 319 | " [ 3.64565877, -0.77367514, -0.77367514, 0.19367135, 0. ],\n", 320 | " [ 7.72477762, -0.77367514, -0.77367514, 1.54832812, 0. ]])" 321 | ] 322 | }, 323 | "metadata": {} 324 | } 325 | ] 326 | }, 327 | { 328 | "cell_type": "code", 329 | "source": "# Running K-Prototype clustering\nkproto = KPrototypes(n_clusters=5, init='Cao')\nclusters = kproto.fit_predict(blood_matrix, categorical=[4])", 330 | "metadata": {}, 331 | "execution_count": 9, 332 | "outputs": [] 333 | }, 334 | { 335 | "cell_type": "code", 336 | "source": "kproto.cluster_centroids_", 337 | "metadata": {}, 338 | "execution_count": 12, 339 | "outputs": [ 340 | { 341 | "execution_count": 12, 342 | "output_type": "execute_result", 343 | "data": { 344 | "text/plain": [ 345 | "[array([[ 1.20155863, -0.0434147 , -0.0434147 , 1.16721428],\n", 346 | " [-0.5217527 , 4.8324995 , 4.8324995 , 2.12009883],\n", 347 | " [ 0.77649881, -0.51171646, -0.51171646, -0.41737993],\n", 348 | " [-0.77483521, -0.39947752, -0.39947752, -0.73541024],\n", 349 | " [-0.46834379, 0.94841338, 0.94841338, 0.92401242]]), array([[0.],\n", 350 | " [0.],\n", 351 | " [0.],\n", 352 | " [0.],\n", 353 | " [0.]])]" 354 | ] 355 | }, 356 | "metadata": {} 357 | } 358 | ] 359 | }, 360 | { 361 | "cell_type": "code", 362 | "source": "# Checking the cost of the clusters created.\nkproto.cost_", 363 | "metadata": {}, 364 | "execution_count": 9, 365 | "outputs": [ 366 | { 367 | "execution_count": 9, 368 | "output_type": "execute_result", 369 | "data": { 370 | "text/plain": [ 371 | "915.5424569187537" 372 | ] 373 | }, 374 | "metadata": {} 375 | } 376 | ] 377 | }, 378 | { 379 | "cell_type": "code", 380 | "source": "# Adding the predicted clusters to the main dataset\nblood['cluster_id'] = clusters", 381 | "metadata": {}, 382 | "execution_count": 10, 383 | "outputs": [] 384 | }, 385 | { 386 | "cell_type": "code", 387 | "source": "# Re-check\nblood.head()", 388 | "metadata": {}, 389 | "execution_count": 11, 390 | "outputs": [ 391 | { 392 | "execution_count": 11, 393 | "output_type": "execute_result", 394 | "data": { 395 | "text/html": [ 396 | "
\n", 397 | "\n", 410 | "\n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | "
Recency (months)Frequency (times)Monetary (c.c. blood)Time (months)whether he/she donated blood in March 2007cluster_id
0-0.9278997.6233467.6233462.61563314
1-1.1751181.2827381.282738-0.25788110
2-1.0515081.7968421.7968420.02947113
3-0.9278992.4823132.4823130.43997313
4-1.0515083.1677843.1677841.75357903
\n", 470 | "
" 471 | ], 472 | "text/plain": [ 473 | " Recency (months) Frequency (times) Monetary (c.c. blood) Time (months) \\\n", 474 | "0 -0.927899 7.623346 7.623346 2.615633 \n", 475 | "1 -1.175118 1.282738 1.282738 -0.257881 \n", 476 | "2 -1.051508 1.796842 1.796842 0.029471 \n", 477 | "3 -0.927899 2.482313 2.482313 0.439973 \n", 478 | "4 -1.051508 3.167784 3.167784 1.753579 \n", 479 | "\n", 480 | " whether he/she donated blood in March 2007 cluster_id \n", 481 | "0 1 4 \n", 482 | "1 1 0 \n", 483 | "2 1 3 \n", 484 | "3 1 3 \n", 485 | "4 0 3 " 486 | ] 487 | }, 488 | "metadata": {} 489 | } 490 | ] 491 | }, 492 | { 493 | "cell_type": "code", 494 | "source": "# Checking the clusters created\nblooddf = pd.DataFrame(blood['cluster_id'].value_counts())\nblooddf", 495 | "metadata": {}, 496 | "execution_count": 14, 497 | "outputs": [ 498 | { 499 | "execution_count": 14, 500 | "output_type": "execute_result", 501 | "data": { 502 | "text/html": [ 503 | "
\n", 504 | "\n", 517 | "\n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | " \n", 527 | " \n", 528 | " \n", 529 | " \n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | " \n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | "
cluster_id
1267
0213
2180
380
48
\n", 547 | "
" 548 | ], 549 | "text/plain": [ 550 | " cluster_id\n", 551 | "1 267\n", 552 | "0 213\n", 553 | "2 180\n", 554 | "3 80\n", 555 | "4 8" 556 | ] 557 | }, 558 | "metadata": {} 559 | } 560 | ] 561 | }, 562 | { 563 | "cell_type": "code", 564 | "source": "sns.barplot(x=blooddf.index, y=blooddf['cluster_id'])", 565 | "metadata": {}, 566 | "execution_count": 16, 567 | "outputs": [ 568 | { 569 | "execution_count": 16, 570 | "output_type": "execute_result", 571 | "data": { 572 | "text/plain": [ 573 | "" 574 | ] 575 | }, 576 | "metadata": {} 577 | }, 578 | { 579 | "output_type": "display_data", 580 | "data": { 581 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYkAAAD8CAYAAACCRVh7AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAD8RJREFUeJzt3XvMZHV9x/H3R6TaCA0QHgiFpYt2S8S2LnYlNLT1VitQFTHBQCIQQrumBYuVtIJpvDQhtamgsRdSFCp4LS1aqcHWlVKJtoq7SLm4Ereywrpbdr1CayJd+PaPOdsdl+e3z4z7nD2z+7xfyWTm/ObMzIcTsp/nXOY3qSokSZrPU4YOIEmaXZaEJKnJkpAkNVkSkqQmS0KS1GRJSJKaLAlJUpMlIUlqsiQkSU1PHTrAnjr88MNr+fLlQ8eQpH3KunXrvlVVcwutt8+XxPLly1m7du3QMSRpn5LkG5Os5+EmSVKTJSFJarIkJElNloQkqcmSkCQ1WRKSpCZLQpLUZElIkposCUlS0z7/jWtN5sE//oWhIyy6Y99yz9ARpP2eexKSpCZLQpLUZElIkposCUlSkyUhSWqyJCRJTZaEJKnJkpAkNfVaEkmWJbktyfok9yW5pBt/W5JvJrmru50+9prLk2xIcn+Sl/WZT5K0e31/43o7cGlV3ZnkYGBdkjXdc++qqneOr5zkBOBs4DnATwOfSfJzVfV4zzklSfPodU+iqrZU1Z3d40eB9cDRu3nJGcBHq+qHVfUAsAE4qc+MkqS2vXZOIsly4ETgi93QxUnuTnJdkkO7saOBh8Zetondl4okqUd7pSSSHATcBLyhqh4BrgaeBawEtgBX7lh1npfXPO+3OsnaJGu3bdvWU2pJUu8lkeRARgXxoar6GEBVPVxVj1fVE8B72XlIaROwbOzlxwCbd33PqrqmqlZV1aq5ubl+/wMkaQnr++qmANcC66vqqrHxo8ZWOxO4t3t8M3B2kqclOQ5YAdzRZ0ZJUlvfVzedApwL3JPkrm7szcA5SVYyOpS0EXgdQFXdl+RG4CuMroy6yCubJGk4vZZEVX2O+c8z3LKb11wBXNFbKEnSxPzGtSSpyZKQJDVZEpKkJktCktRkSUiSmiwJSVKTJSFJarIkJElNloQkqcmSkCQ1WRKSpCZLQpLUZElIkposCUlSkyUhSWqyJCRJTZaEJKnJkpAkNVkSkqQmS0KS1GRJSJKaLAlJUpMlIUlqsiQkSU1PHTpAn37pD24YOsKiW/dn5w0dQdIS4p6EJKnJkpAkNVkSkqQmS0KS1GRJSJKaei2JJMuS3JZkfZL7klzSjR+WZE2Sr3X3h3bjSfKeJBuS3J3keX3mkyTtXt97EtuBS6vq2cDJwEVJTgAuA26tqhXArd0ywGnAiu62Gri653ySpN3otSSqaktV3dk9fhRYDxwNnAFc3612PfCq7vEZwA018gXgkCRH9ZlRktS2185JJFkOnAh8ETiyqrbAqEiAI7rVjgYeGnvZpm5MkjSAvVISSQ4CbgLeUFWP7G7VecZqnvdbnWRtkrXbtm1brJiSpF30XhJJDmRUEB+qqo91ww/vOIzU3W/txjcBy8Zefgywedf3rKprqmpVVa2am5vrL7wkLXF9X90U4FpgfVVdNfbUzcD53ePzgU+MjZ/XXeV0MvD9HYelJEl7X98T/J0CnAvck+SubuzNwDuAG5NcCDwInNU9dwtwOrAB+AFwQc/5JEm70WtJVNXnmP88A8BL5lm/gIv6zCRJmpzfuJYkNVkSkqQmS0KS1GRJSJKaLAlJUpMlIUlqsiQkSU2WhCSpyZKQJDVZEpKkpr7nbpJmzil/fsrQERbd51//+aEjaD/lnoQkqcmSkCQ1WRKSpCZLQpLUZElIkposCUlSkyUhSWqyJCRJTZaEJKlpwW9cJ3nj7p6vqqsWL44kaZZMMi3Hwd398cDzgZu75VcAt/cRSpI0GxYsiap6O0CSTwPPq6pHu+W3AX/XazpJ0qCmOSdxLPDY2PJjwPJFTSNJminTzAL7AeCOJB8HCjgTuKGXVJKkmTBxSVTVFUk+BfxqN3RBVX25n1iSpFkwydVNP1VVjyQ5DNjY3XY8d1hVfae/eJKkIU2yJ/Fh4OXAOkaHmXZIt/zMHnJJkmbAJFc3vby7P2536yV5TlXdt1jBJEnDW8xvXH9gEd9LkjQDFrMk8qSB5LokW5PcOzb2tiTfTHJXdzt97LnLk2xIcn+Sly1iNknSj2ExS6LmGXs/cOo84++qqpXd7RaAJCcAZwPP6V7zV0kOWMR8kqQp9TrBX1XdDkx69dMZwEer6odV9QCwATipt3CSpAVNVBIZWbbAao8t8Py4i5Pc3R2OOrQbOxp4aGydTd3YfHlWJ1mbZO22bdum+FhJ0jQmKomqKuAfFljn5Ak/82rgWcBKYAtwZTf+pHMazH8Ii6q6pqpWVdWqubm5CT9WkjStaQ43fSHJ8/f0A6vq4ap6vKqeAN7LzkNKm4DxvZVjgM17+nmSpB/fNCXxIkZF8Z/doaJ7ktw97QcmOWps8Uxgx5VPNwNnJ3lakuOAFcAd076/JGnxTDPB32nTvnmSjwAvBA5Psgl4K/DCJCsZHUraCLwOoKruS3Ij8BVgO3BRVT0+7WdKkhbPNBP8fSPJrwArqupvkswBBy3wmnPmGb52N+tfAVwxaSZJUr8mPtyU5K3Am4DLu6EDgQ/2EUqSNBumOSdxJvBK4H8AqmozO3/aVJK0H5qmJB7rLoUtgCTP6CeSJGlWTFMSNyb5a+CQJL8NfAZ4Xz+xJEmzYJoT1+9M8lLgEeB44C1Vtaa3ZJKkwU1cEkn+tKreBKyZZ0yStB+a5nDTS+cZm/q7E5Kkfcckv3H9O8DvAs/c5RvWBwOf7yuYJGl4k/7G9aeAPwEuGxt/tKomnQZckrQPWvBwU1V9v6o2An8E/FdVfQM4DnhtkkN6zidJGtA05yRuAh5P8rOMptY4jtFehiRpPzVNSTxRVduBVwPvrqrfB45a4DWSpH3YNCXxv0nOAc4DPtmNHbj4kSRJs2KakrgA+GXgiqp6oPvNByf4k6T92DTfuP4K8Htjyw8A7+gjlCRpNkzzjesHmOc3p6vqmYuaSJI0M6b5ZbpVY4+fDpwFHLa4cSRJs2TicxJV9e2x2zer6t3Ai3vMJkka2DSHm543tvgURnsW/uiQJO3HpjncdOXY4+3ARuA1i5pGkjRTprm66UV9BpEkzZ5JZoF94+6er6qrFi+OJGmWTLInsbvzDk+6JFaStP9YsCSq6u0ASa4HLqmq73XLh/Kj5ykkSfuZaabl+MUdBQFQVd8FTlz8SJKkWTFNSTyl23sAIMlhTHd1lCRpHzPtJbD/luTvGZ2LeA1wRS+pJEkzYZpLYG9IspbRt6wDvLqb9E+StJ+a6nBRVwoWgyQtEdOck5AkLTG9lkSS65JsTXLv2NhhSdYk+Vp3f2g3niTvSbIhyd27zBUlSRpA33sS7wdO3WXsMuDWqloB3NotA5wGrOhuq4Gre84mSVpAryVRVbcD39ll+Azg+u7x9cCrxsZvqJEvAIckOarPfJKk3RvinMSRVbUFoLs/ohs/GnhobL1N3diTJFmdZG2Stdu2bes1rCQtZbN04jrzjM07N1RVXVNVq6pq1dzcXM+xJGnpGqIkHt5xGKm739qNbwKWja13DLB5L2eTJI0ZoiRuBs7vHp8PfGJs/LzuKqeTge/vOCwlSRpGr3MvJfkI8ELg8CSbgLcC7wBuTHIh8CBwVrf6LcDpwAbgB8AFfWaTJC2s15KoqnMaT71knnULuKjPPJKk6czSiWtJ0oyxJCRJTZaEJKnJkpAkNVkSkqQmS0KS1GRJSJKaLAlJUpMlIUlqsiQkSU29TsshabZ99tdeMHSERfeC2z87dIT9insSkqQmS0KS1GRJSJKaLAlJUpMlIUlqsiQkSU2WhCSpyZKQJDVZEpKkJktCktRkSUiSmiwJSVKTJSFJarIkJElNloQkqcmSkCQ1WRKSpCZLQpLUNNjPlybZCDwKPA5sr6pVSQ4D/hZYDmwEXlNV3x0qoyQtdUPvSbyoqlZW1apu+TLg1qpaAdzaLUuSBjJ0SezqDOD67vH1wKsGzCJJS96QJVHAp5OsS7K6GzuyqrYAdPdHDJZOkjTcOQnglKranOQIYE2Sr076wq5UVgMce+yxfeWTpCVvsD2Jqtrc3W8FPg6cBDyc5CiA7n5r47XXVNWqqlo1Nze3tyJL0pIzSEkkeUaSg3c8Bn4DuBe4GTi/W+184BND5JMkjQx1uOlI4ONJdmT4cFX9U5IvATcmuRB4EDhroHySJAYqiar6OvDceca/Dbxk7yeSJM1n1i6BlSTNEEtCktRkSUiSmiwJSVKTJSFJarIkJElNloQkqcmSkCQ1WRKSpCZLQpLUZElIkposCUlSkyUhSWqyJCRJTZaEJKnJkpAkNVkSkqQmS0KS1GRJSJKaLAlJUpMlIUlqsiQkSU2WhCSpyZKQJDVZEpKkJktCktRkSUiSmiwJSVLTU4cOIEmz4C8u/cehIyy6i698xR6/h3sSkqSmmSuJJKcmuT/JhiSXDZ1HkpaymSqJJAcAfwmcBpwAnJPkhGFTSdLSNVMlAZwEbKiqr1fVY8BHgTMGziRJS9aslcTRwENjy5u6MUnSAFJVQ2f4f0nOAl5WVb/VLZ8LnFRVr99lvdXA6m7xeOD+vRp0focD3xo6xIxwW+zkthhxO+w0K9viZ6pqbqGVZu0S2E3AsrHlY4DNu65UVdcA1+ytUJNIsraqVg2dYxa4LXZyW4y4HXba17bFrB1u+hKwIslxSX4COBu4eeBMkrRkzdSeRFVtT3Ix8M/AAcB1VXXfwLEkacmaqZIAqKpbgFuGzvFjmKnDXwNzW+zkthhxO+y0T22LmTpxLUmaLbN2TkKSNEMsiT3kNCIjSa5LsjXJvUNnGVqSZUluS7I+yX1JLhk601CSPD3JHUn+o9sWbx8609CSHJDky0k+OXSWSVgSe8BpRH7E+4FThw4xI7YDl1bVs4GTgYuW8P8XPwReXFXPBVYCpyY5eeBMQ7sEWD90iElZEnvGaUQ6VXU78J2hc8yCqtpSVXd2jx9l9A/Ckpw5oEb+u1s8sLst2ROhSY4BfhN439BZJmVJ7BmnEdFuJVkOnAh8cdgkw+kOr9wFbAXWVNWS3RbAu4E/BJ4YOsikLIk9k3nGluxfSfpRSQ4CbgLeUFWPDJ1nKFX1eFWtZDSDwklJfn7oTENI8nJga1WtGzrLNCyJPTPRNCJaepIcyKggPlRVHxs6zyyoqu8B/8rSPXd1CvDKJBsZHZp+cZIPDhtpYZbEnnEaET1JkgDXAuur6qqh8wwpyVySQ7rHPwn8OvDVYVMNo6our6pjqmo5o38r/qWqXjtwrAVZEnugqrYDO6YRWQ/cuFSnEUnyEeDfgeOTbEpy4dCZBnQKcC6jvxTv6m6nDx1qIEcBtyW5m9EfVWuqap+49FMjfuNaktTknoQkqcmSkCQ1WRKSpCZLQpLUZElIkposCUlSkyUhSWqyJCRJTf8HnMwFuydguAUAAAAASUVORK5CYII=\n", 582 | "text/plain": [ 583 | "" 584 | ] 585 | }, 586 | "metadata": {} 587 | } 588 | ] 589 | }, 590 | { 591 | "cell_type": "code", 592 | "source": "#Choosing optimal K\ncost = []\nfor num_clusters in list(range(1,14)):\n kproto = KPrototypes(n_clusters=num_clusters, init='Cao')\n kproto.fit_predict(blood_matrix, categorical=[4])\n cost.append(kproto.cost_)\n \nplt.plot(cost)", 593 | "metadata": {}, 594 | "execution_count": 24, 595 | "outputs": [ 596 | { 597 | "execution_count": 24, 598 | "output_type": "execute_result", 599 | "data": { 600 | "text/plain": [ 601 | "[]" 602 | ] 603 | }, 604 | "metadata": {} 605 | }, 606 | { 607 | "output_type": "display_data", 608 | "data": { 609 | "image/png": "\n", 610 | "text/plain": [ 611 | "" 612 | ] 613 | }, 614 | "metadata": {} 615 | } 616 | ] 617 | } 618 | ] 619 | } -------------------------------------------------------------------------------- /Other Forms of Clustering/init: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Unsupervised-Learning-Clustering --------------------------------------------------------------------------------