├── Week 1 Dataset.xlsx ├── Week 1- Breast Cancer detection proejct.ipynb ├── Week 2- Breast Cancer detection using Machine Learning.ipynb ├── Week 3 Breast Cancer Detection .ipynb ├── Week 4-Breast Cancer detection.ipynb └── cancer dataset.csv /Week 1 Dataset.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aimlcommunity/Breast-Cancer-Detection-using-Machine-Learning/ba38d2b8fa694a285a12e0934f5ed3028a03d937/Week 1 Dataset.xlsx -------------------------------------------------------------------------------- /Week 1- Breast Cancer detection proejct.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Week 1- Getting the basics right" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Welcome to this guided project. This is a part of AIML Community's Data Science for social good initiative wherein we are trying to make an impact using Data Science. Our first project is Breast Cancer prediction using Machine Learning. \n", 15 | "\n", 16 | "This notebook consists of material for week 1\n", 17 | "\n", 18 | "This week, we are going to familiarize ourselves with the basics we will need in the project. This is a jupyter notebook, coding in Python is preferred on a Jupyter notebook but you can use pycharm as well. \n", 19 | "\n", 20 | "#### What will you learn after completing week 1?\n", 21 | "1. A basic introduction to major Python libraries\n", 22 | "2. Learn basic commands associated with those libraries\n", 23 | "3. Run some code using a simple dataset before we move to a larger dataset for our project\n", 24 | "\n", 25 | "#### How to use this notebook?\n", 26 | "This notebook consists of both the text and the codes. We will be having questions in between which will require coding from you. The dataset for week 1 will be attached in the same folder by the name \"week 1 dataset\". Please download it. \n", 27 | "You can make a separate python notebook on Google Colab/Pycharm/Jupyter and send us or upload on github and send us the link. This is as a part of your evaluation which will be used to provide you the certificate. " 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": {}, 33 | "source": [ 34 | "## Learning the basic libraries of Python " 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [ 41 | "### 1. Numpy" 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "metadata": {}, 47 | "source": [ 48 | "We will not be using Numpy much in our Breast Cancer detection project but it is an important library. Pandas is made on Numpy and Pandas is a very essentail library for using, manipulating and analyzing data. \n", 49 | "\n", 50 | "NumPy is the fundamental package for scientific computing with Python. It contains among other things:\n", 51 | "\n", 52 | "1. a powerful N-dimensional array object\n", 53 | "\n", 54 | "2. sophisticated (broadcasting) functions\n", 55 | "\n", 56 | "3. tools for integrating C/C++ and Fortran code\n", 57 | "\n", 58 | "4. useful linear algebra, Fourier transform, and random number capabilities\n", 59 | "\n", 60 | "Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases." 61 | ] 62 | }, 63 | { 64 | "cell_type": "markdown", 65 | "metadata": {}, 66 | "source": [ 67 | "#### What is an array?\n", 68 | "\n", 69 | "Array data structure, or simply an array, is a data structure consisting of a collection of elements (values or variables), each identified by at least one array index or key.\n", 70 | "\n", 71 | "There are two things in an array, the rows(which are horizontal lines) and columns (which are vertical lines). We write them in the form (rows x columns). So if we see (4,5), it means it has 4 rows and 5 columns. \n", 72 | "\n", 73 | "We all have read matrix on our schools and colleges. Matrix generally implies two dimensions, arrays are usually one dimensional unless specified otherwise. A matrix has the size m*n (2d). An array has the size m (1d), so it is a special case.\n", 74 | "\n", 75 | "NumPy helps us to perform linear algebra operation swiftly on our machines. We will now proceed and learn how to use Numpy. \n", 76 | "\n", 77 | "First we will import NumPy module in our notebook." 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": null, 83 | "metadata": {}, 84 | "outputs": [], 85 | "source": [ 86 | "import numpy as np #the reason we import is as np is to save time from writing \"numpy\" everytime we use it" 87 | ] 88 | }, 89 | { 90 | "cell_type": "markdown", 91 | "metadata": {}, 92 | "source": [ 93 | "We will first learn how to create an array. There are different methods for creating an array:" 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": 5, 99 | "metadata": {}, 100 | "outputs": [ 101 | { 102 | "name": "stdout", 103 | "output_type": "stream", 104 | "text": [ 105 | "arr1:[[1. 2. 4.]\n", 106 | " [5. 8. 7.]]\n", 107 | "arr2:[1 3 2]\n", 108 | "arr3:[[0. 0. 0.]\n", 109 | " [0. 0. 0.]]\n", 110 | "arr4:[ 0 2 4 6 8 10 12 14 16 18]\n", 111 | "arr5[ 0. 0.52631579 1.05263158 1.57894737 2.10526316 2.63157895\n", 112 | " 3.15789474 3.68421053 4.21052632 4.73684211 5.26315789 5.78947368\n", 113 | " 6.31578947 6.84210526 7.36842105 7.89473684 8.42105263 8.94736842\n", 114 | " 9.47368421 10. ]\n", 115 | "arr6:[[ 0 1 2 3 4]\n", 116 | " [ 5 6 7 8 9]\n", 117 | " [10 11 12 13 14]]\n" 118 | ] 119 | } 120 | ], 121 | "source": [ 122 | "# Creating array from list \n", 123 | "arr1 = np.array([[1, 2, 4], [5, 8, 7]], dtype = 'float') \n", 124 | "print(\"arr1:{}\".format(arr1)) \n", 125 | " \n", 126 | "# Creating array from tuple \n", 127 | "arr2 = np.array((1 , 3, 2)) \n", 128 | "print(\"arr2:{}\".format(arr2))\n", 129 | " \n", 130 | "# Creating a 2X3 array with all zeros \n", 131 | "arr3 = np.zeros((2, 3)) \n", 132 | "print (\"arr3:{}\".format(arr3)) \n", 133 | " \n", 134 | "\n", 135 | "# Create a sequence of integers from 0 to 20 with steps of 2 \n", 136 | "arr4 = np.arange(0, 20, 2) \n", 137 | "print (\"arr4:{}\".format(arr4)) \n", 138 | " \n", 139 | "# Create a sequence of 20 values in range 0 to 10 \n", 140 | "arr5 = np.linspace(0, 10, 20) \n", 141 | "print (\"arr5{}\".format(arr5)) \n", 142 | "\n", 143 | "#making an array of consecutive natural numbers\n", 144 | "arr6 = np.arange(15).reshape(3, 5)\n", 145 | "print(\"arr6:{}\".format(arr6))" 146 | ] 147 | }, 148 | { 149 | "cell_type": "markdown", 150 | "metadata": {}, 151 | "source": [ 152 | "### Q1. Create an array having consecutive natural numbers of size 6x3 and print it. " 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "metadata": {}, 158 | "source": [ 159 | "Since our project will not be using much numpy, we will not read about it further. You can read about more features in Numpy by visiting at: https://numpy.org/devdocs/user/quickstart.html" 160 | ] 161 | }, 162 | { 163 | "cell_type": "markdown", 164 | "metadata": {}, 165 | "source": [ 166 | "## 2. Pandas " 167 | ] 168 | }, 169 | { 170 | "cell_type": "markdown", 171 | "metadata": {}, 172 | "source": [ 173 | "Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.\n", 174 | "\n", 175 | "Pandas is well suited for many different kinds of data:\n", 176 | "\n", 177 | "1. Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet\n", 178 | "\n", 179 | "2. Ordered and unordered (not necessarily fixed-frequency) time series data.\n", 180 | "\n", 181 | "3. Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels\n", 182 | "\n", 183 | "4. Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure\n", 184 | "\n", 185 | "The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. Pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries." 186 | ] 187 | }, 188 | { 189 | "cell_type": "markdown", 190 | "metadata": {}, 191 | "source": [ 192 | "### Why use Pandas?\n", 193 | "#### These are some of the things that Pandas does well:\n", 194 | "\n", 195 | "1. Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data\n", 196 | "\n", 197 | "2. Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects\n", 198 | "\n", 199 | "3. Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations\n", 200 | "\n", 201 | "4. Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data\n", 202 | "\n", 203 | "5. Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects\n", 204 | "\n", 205 | "6. Intelligent label-based slicing, fancy indexing, and subsetting of large data sets\n", 206 | "\n", 207 | "7. Intuitive merging and joining data sets\n", 208 | "\n", 209 | "8. Flexible reshaping and pivoting of data sets\n", 210 | "\n", 211 | "9. Hierarchical labeling of axes (possible to have multiple labels per tick)\n", 212 | "\n", 213 | "10. Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving / loading data from the ultrafast HDF5 format\n", 214 | "\n", 215 | "11. Time series-specific functionality: date range generation and frequency conversion, moving window statistics, date shifting and lagging." 216 | ] 217 | }, 218 | { 219 | "cell_type": "markdown", 220 | "metadata": {}, 221 | "source": [ 222 | "#### We have attached a data set named \"Week 1 dataset\". Please download that dataset, go to that file in folder, shift+right click and you will find an option \"copy as path\", click on the that and you will get the source of that file in your system and now we will use that to open our file in this notebook. \n", 223 | "\n", 224 | "## Objective of our dataset:\n", 225 | "We are trying to find out the factors affecting number of siblings. " 226 | ] 227 | }, 228 | { 229 | "cell_type": "markdown", 230 | "metadata": {}, 231 | "source": [ 232 | "There are few formats in which dataset can be saved, the most famous ones are CSV (Comma separated values) and Excel file. \n", 233 | "We will create a DataFrame of our data. To open a csv file and excel file, use the codes given below:" 234 | ] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "execution_count": 40, 239 | "metadata": {}, 240 | "outputs": [], 241 | "source": [ 242 | "import pandas as pd\n", 243 | "#df=pd.read_csv()\n", 244 | "#df=pd.read_excel()" 245 | ] 246 | }, 247 | { 248 | "cell_type": "markdown", 249 | "metadata": {}, 250 | "source": [ 251 | "Now, our file is in Excel format so we will use the second command:" 252 | ] 253 | }, 254 | { 255 | "cell_type": "code", 256 | "execution_count": 41, 257 | "metadata": {}, 258 | "outputs": [ 259 | { 260 | "name": "stdout", 261 | "output_type": "stream", 262 | "text": [ 263 | " Name Income per month State Age Sex Number of siblings\n", 264 | "0 A 50 Tamil Nadu 23 M 0\n", 265 | "1 B 34 Karnataka 37 F 3\n", 266 | "2 C 54 Tamil Nadu 35 F 1\n", 267 | "3 D 24 Uttar Pradesh 26 F 4\n", 268 | "4 E 19 Bihar 59 M 5\n", 269 | "5 F 21 Bihar 48 M 6\n", 270 | "6 G 40 Delhi 30 F 2\n", 271 | "7 H 20 Uttar Pradesh 72 M 10\n", 272 | "8 I 30 Telangana 22 F 0\n", 273 | "9 J 18 Chattisgarh 42 M 7\n" 274 | ] 275 | } 276 | ], 277 | "source": [ 278 | "df=pd.read_excel(r\"C:\\Users\\srtpa\\Desktop\\Week 1 Dataset.xlsx\")\n", 279 | "print(df)" 280 | ] 281 | }, 282 | { 283 | "cell_type": "markdown", 284 | "metadata": {}, 285 | "source": [ 286 | "This is a small dataset that we had created to make it easier to understand the basics. This is demographic and geographical information of 10 individuals. " 287 | ] 288 | }, 289 | { 290 | "cell_type": "code", 291 | "execution_count": 42, 292 | "metadata": {}, 293 | "outputs": [ 294 | { 295 | "data": { 296 | "text/html": [ 297 | "
\n", 298 | "\n", 311 | "\n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | "
NameIncome per monthStateAgeSexNumber of siblings
0A50Tamil Nadu23M0
1B34Karnataka37F3
2C54Tamil Nadu35F1
3D24Uttar Pradesh26F4
4E19Bihar59M5
5F21Bihar48M6
6G40Delhi30F2
7H20Uttar Pradesh72M10
8I30Telangana22F0
9J18Chattisgarh42M7
\n", 416 | "
" 417 | ], 418 | "text/plain": [ 419 | " Name Income per month State Age Sex Number of siblings\n", 420 | "0 A 50 Tamil Nadu 23 M 0\n", 421 | "1 B 34 Karnataka 37 F 3\n", 422 | "2 C 54 Tamil Nadu 35 F 1\n", 423 | "3 D 24 Uttar Pradesh 26 F 4\n", 424 | "4 E 19 Bihar 59 M 5\n", 425 | "5 F 21 Bihar 48 M 6\n", 426 | "6 G 40 Delhi 30 F 2\n", 427 | "7 H 20 Uttar Pradesh 72 M 10\n", 428 | "8 I 30 Telangana 22 F 0\n", 429 | "9 J 18 Chattisgarh 42 M 7" 430 | ] 431 | }, 432 | "execution_count": 42, 433 | "metadata": {}, 434 | "output_type": "execute_result" 435 | } 436 | ], 437 | "source": [ 438 | "df #using this command prints out the DataFrame" 439 | ] 440 | }, 441 | { 442 | "cell_type": "code", 443 | "execution_count": 43, 444 | "metadata": {}, 445 | "outputs": [ 446 | { 447 | "data": { 448 | "text/html": [ 449 | "
\n", 450 | "\n", 463 | "\n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | " \n", 527 | " \n", 528 | " \n", 529 | " \n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | " \n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | "
0123456789
NameABCDEFGHIJ
Income per month50345424192140203018
StateTamil NaduKarnatakaTamil NaduUttar PradeshBiharBiharDelhiUttar PradeshTelanganaChattisgarh
Age23373526594830722242
SexMFFFMMFMFM
Number of siblings03145621007
\n", 560 | "
" 561 | ], 562 | "text/plain": [ 563 | " 0 1 2 3 4 \\\n", 564 | "Name A B C D E \n", 565 | "Income per month 50 34 54 24 19 \n", 566 | "State Tamil Nadu Karnataka Tamil Nadu Uttar Pradesh Bihar \n", 567 | "Age 23 37 35 26 59 \n", 568 | "Sex M F F F M \n", 569 | "Number of siblings 0 3 1 4 5 \n", 570 | "\n", 571 | " 5 6 7 8 9 \n", 572 | "Name F G H I J \n", 573 | "Income per month 21 40 20 30 18 \n", 574 | "State Bihar Delhi Uttar Pradesh Telangana Chattisgarh \n", 575 | "Age 48 30 72 22 42 \n", 576 | "Sex M F M F M \n", 577 | "Number of siblings 6 2 10 0 7 " 578 | ] 579 | }, 580 | "execution_count": 43, 581 | "metadata": {}, 582 | "output_type": "execute_result" 583 | } 584 | ], 585 | "source": [ 586 | "#we can find a transpose the matrix using:\n", 587 | "df.T" 588 | ] 589 | }, 590 | { 591 | "cell_type": "markdown", 592 | "metadata": {}, 593 | "source": [ 594 | "Our dataset can be very big which we cannot print like this, so we need to print limited number of rows so that we can assess the dataset and the parameters it uses. if we have to print a few number of rows, we use:" 595 | ] 596 | }, 597 | { 598 | "cell_type": "code", 599 | "execution_count": 44, 600 | "metadata": {}, 601 | "outputs": [ 602 | { 603 | "data": { 604 | "text/html": [ 605 | "
\n", 606 | "\n", 619 | "\n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | "
NameIncome per monthStateAgeSexNumber of siblings
0A50Tamil Nadu23M0
1B34Karnataka37F3
2C54Tamil Nadu35F1
\n", 661 | "
" 662 | ], 663 | "text/plain": [ 664 | " Name Income per month State Age Sex Number of siblings\n", 665 | "0 A 50 Tamil Nadu 23 M 0\n", 666 | "1 B 34 Karnataka 37 F 3\n", 667 | "2 C 54 Tamil Nadu 35 F 1" 668 | ] 669 | }, 670 | "execution_count": 44, 671 | "metadata": {}, 672 | "output_type": "execute_result" 673 | } 674 | ], 675 | "source": [ 676 | "df.head(3) #from beginning" 677 | ] 678 | }, 679 | { 680 | "cell_type": "code", 681 | "execution_count": 45, 682 | "metadata": {}, 683 | "outputs": [ 684 | { 685 | "data": { 686 | "text/html": [ 687 | "
\n", 688 | "\n", 701 | "\n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | "
NameIncome per monthStateAgeSexNumber of siblings
8I30Telangana22F0
9J18Chattisgarh42M7
\n", 734 | "
" 735 | ], 736 | "text/plain": [ 737 | " Name Income per month State Age Sex Number of siblings\n", 738 | "8 I 30 Telangana 22 F 0\n", 739 | "9 J 18 Chattisgarh 42 M 7" 740 | ] 741 | }, 742 | "execution_count": 45, 743 | "metadata": {}, 744 | "output_type": "execute_result" 745 | } 746 | ], 747 | "source": [ 748 | "df.tail(2) #from the end" 749 | ] 750 | }, 751 | { 752 | "cell_type": "markdown", 753 | "metadata": {}, 754 | "source": [ 755 | "If we have to get an overview of the data, we use:" 756 | ] 757 | }, 758 | { 759 | "cell_type": "code", 760 | "execution_count": 46, 761 | "metadata": {}, 762 | "outputs": [ 763 | { 764 | "name": "stdout", 765 | "output_type": "stream", 766 | "text": [ 767 | "\n", 768 | "RangeIndex: 10 entries, 0 to 9\n", 769 | "Data columns (total 6 columns):\n", 770 | "Name 10 non-null object\n", 771 | "Income per month 10 non-null int64\n", 772 | "State 10 non-null object\n", 773 | "Age 10 non-null int64\n", 774 | "Sex 10 non-null object\n", 775 | "Number of siblings 10 non-null int64\n", 776 | "dtypes: int64(3), object(3)\n", 777 | "memory usage: 608.0+ bytes\n" 778 | ] 779 | } 780 | ], 781 | "source": [ 782 | "df.info() " 783 | ] 784 | }, 785 | { 786 | "cell_type": "markdown", 787 | "metadata": {}, 788 | "source": [ 789 | "### Renaming columns:\n", 790 | "Many a times in dataset, we need to rename the column names for our convenience and that is where we will use the following command:\n" 791 | ] 792 | }, 793 | { 794 | "cell_type": "code", 795 | "execution_count": 47, 796 | "metadata": {}, 797 | "outputs": [ 798 | { 799 | "data": { 800 | "text/html": [ 801 | "
\n", 802 | "\n", 815 | "\n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 822 | " \n", 823 | " \n", 824 | " \n", 825 | " \n", 826 | " \n", 827 | " \n", 828 | " \n", 829 | " \n", 830 | " \n", 831 | " \n", 832 | " \n", 833 | " \n", 834 | " \n", 835 | " \n", 836 | " \n", 837 | " \n", 838 | " \n", 839 | " \n", 840 | " \n", 841 | " \n", 842 | " \n", 843 | " \n", 844 | " \n", 845 | " \n", 846 | " \n", 847 | " \n", 848 | " \n", 849 | " \n", 850 | " \n", 851 | " \n", 852 | " \n", 853 | " \n", 854 | " \n", 855 | " \n", 856 | " \n", 857 | " \n", 858 | " \n", 859 | " \n", 860 | " \n", 861 | " \n", 862 | " \n", 863 | " \n", 864 | " \n", 865 | " \n", 866 | " \n", 867 | " \n", 868 | " \n", 869 | " \n", 870 | " \n", 871 | " \n", 872 | " \n", 873 | " \n", 874 | " \n", 875 | " \n", 876 | " \n", 877 | " \n", 878 | " \n", 879 | " \n", 880 | " \n", 881 | " \n", 882 | " \n", 883 | " \n", 884 | " \n", 885 | " \n", 886 | " \n", 887 | " \n", 888 | " \n", 889 | " \n", 890 | " \n", 891 | " \n", 892 | " \n", 893 | " \n", 894 | " \n", 895 | " \n", 896 | " \n", 897 | " \n", 898 | " \n", 899 | " \n", 900 | " \n", 901 | " \n", 902 | " \n", 903 | " \n", 904 | " \n", 905 | " \n", 906 | " \n", 907 | " \n", 908 | " \n", 909 | " \n", 910 | " \n", 911 | " \n", 912 | " \n", 913 | " \n", 914 | " \n", 915 | " \n", 916 | " \n", 917 | " \n", 918 | " \n", 919 | "
NameIncome per monthState_nameAgeSexNumber of siblings
0A50Tamil Nadu23M0
1B34Karnataka37F3
2C54Tamil Nadu35F1
3D24Uttar Pradesh26F4
4E19Bihar59M5
5F21Bihar48M6
6G40Delhi30F2
7H20Uttar Pradesh72M10
8I30Telangana22F0
9J18Chattisgarh42M7
\n", 920 | "
" 921 | ], 922 | "text/plain": [ 923 | " Name Income per month State_name Age Sex Number of siblings\n", 924 | "0 A 50 Tamil Nadu 23 M 0\n", 925 | "1 B 34 Karnataka 37 F 3\n", 926 | "2 C 54 Tamil Nadu 35 F 1\n", 927 | "3 D 24 Uttar Pradesh 26 F 4\n", 928 | "4 E 19 Bihar 59 M 5\n", 929 | "5 F 21 Bihar 48 M 6\n", 930 | "6 G 40 Delhi 30 F 2\n", 931 | "7 H 20 Uttar Pradesh 72 M 10\n", 932 | "8 I 30 Telangana 22 F 0\n", 933 | "9 J 18 Chattisgarh 42 M 7" 934 | ] 935 | }, 936 | "execution_count": 47, 937 | "metadata": {}, 938 | "output_type": "execute_result" 939 | } 940 | ], 941 | "source": [ 942 | "df1=df.rename(columns={\"State\":\"State_name\"})\n", 943 | "df1" 944 | ] 945 | }, 946 | { 947 | "cell_type": "markdown", 948 | "metadata": {}, 949 | "source": [ 950 | "## Q2: Rename alll the coumns in the dataset as per your wish and save it in another dataframe and name it df2" 951 | ] 952 | }, 953 | { 954 | "cell_type": "markdown", 955 | "metadata": {}, 956 | "source": [ 957 | "Now, if we had to rename the coumns in our original dataframe, without making a new dataframe df1, we will use inplace=True" 958 | ] 959 | }, 960 | { 961 | "cell_type": "code", 962 | "execution_count": 48, 963 | "metadata": {}, 964 | "outputs": [ 965 | { 966 | "data": { 967 | "text/html": [ 968 | "
\n", 969 | "\n", 982 | "\n", 983 | " \n", 984 | " \n", 985 | " \n", 986 | " \n", 987 | " \n", 988 | " \n", 989 | " \n", 990 | " \n", 991 | " \n", 992 | " \n", 993 | " \n", 994 | " \n", 995 | " \n", 996 | " \n", 997 | " \n", 998 | " \n", 999 | " \n", 1000 | " \n", 1001 | " \n", 1002 | " \n", 1003 | " \n", 1004 | " \n", 1005 | " \n", 1006 | " \n", 1007 | " \n", 1008 | " \n", 1009 | " \n", 1010 | " \n", 1011 | " \n", 1012 | " \n", 1013 | " \n", 1014 | " \n", 1015 | " \n", 1016 | " \n", 1017 | " \n", 1018 | " \n", 1019 | " \n", 1020 | " \n", 1021 | " \n", 1022 | " \n", 1023 | " \n", 1024 | " \n", 1025 | " \n", 1026 | " \n", 1027 | " \n", 1028 | " \n", 1029 | " \n", 1030 | " \n", 1031 | " \n", 1032 | " \n", 1033 | " \n", 1034 | " \n", 1035 | " \n", 1036 | " \n", 1037 | " \n", 1038 | " \n", 1039 | " \n", 1040 | " \n", 1041 | " \n", 1042 | " \n", 1043 | " \n", 1044 | " \n", 1045 | " \n", 1046 | " \n", 1047 | " \n", 1048 | " \n", 1049 | " \n", 1050 | " \n", 1051 | " \n", 1052 | " \n", 1053 | " \n", 1054 | " \n", 1055 | " \n", 1056 | " \n", 1057 | " \n", 1058 | " \n", 1059 | " \n", 1060 | " \n", 1061 | " \n", 1062 | " \n", 1063 | " \n", 1064 | " \n", 1065 | " \n", 1066 | " \n", 1067 | " \n", 1068 | " \n", 1069 | " \n", 1070 | " \n", 1071 | " \n", 1072 | " \n", 1073 | " \n", 1074 | " \n", 1075 | " \n", 1076 | " \n", 1077 | " \n", 1078 | " \n", 1079 | " \n", 1080 | " \n", 1081 | " \n", 1082 | " \n", 1083 | " \n", 1084 | " \n", 1085 | " \n", 1086 | "
NameIncome per monthState_nameAgeSexNumber of siblings
0A50Tamil Nadu23M0
1B34Karnataka37F3
2C54Tamil Nadu35F1
3D24Uttar Pradesh26F4
4E19Bihar59M5
5F21Bihar48M6
6G40Delhi30F2
7H20Uttar Pradesh72M10
8I30Telangana22F0
9J18Chattisgarh42M7
\n", 1087 | "
" 1088 | ], 1089 | "text/plain": [ 1090 | " Name Income per month State_name Age Sex Number of siblings\n", 1091 | "0 A 50 Tamil Nadu 23 M 0\n", 1092 | "1 B 34 Karnataka 37 F 3\n", 1093 | "2 C 54 Tamil Nadu 35 F 1\n", 1094 | "3 D 24 Uttar Pradesh 26 F 4\n", 1095 | "4 E 19 Bihar 59 M 5\n", 1096 | "5 F 21 Bihar 48 M 6\n", 1097 | "6 G 40 Delhi 30 F 2\n", 1098 | "7 H 20 Uttar Pradesh 72 M 10\n", 1099 | "8 I 30 Telangana 22 F 0\n", 1100 | "9 J 18 Chattisgarh 42 M 7" 1101 | ] 1102 | }, 1103 | "execution_count": 48, 1104 | "metadata": {}, 1105 | "output_type": "execute_result" 1106 | } 1107 | ], 1108 | "source": [ 1109 | "df.rename(columns={\"State\":\"State_name\"}, inplace=True)\n", 1110 | "df" 1111 | ] 1112 | }, 1113 | { 1114 | "cell_type": "markdown", 1115 | "metadata": {}, 1116 | "source": [ 1117 | "As you can see, our original dataframe has been changed. \n", 1118 | "\n", 1119 | "## Q3: After you have created df2, make the same changes in the original dataframe using the inplace. " 1120 | ] 1121 | }, 1122 | { 1123 | "cell_type": "markdown", 1124 | "metadata": {}, 1125 | "source": [ 1126 | "Now suppose we only want the columns information, State_name and age, then we will use the following command:" 1127 | ] 1128 | }, 1129 | { 1130 | "cell_type": "code", 1131 | "execution_count": 50, 1132 | "metadata": {}, 1133 | "outputs": [ 1134 | { 1135 | "data": { 1136 | "text/html": [ 1137 | "
\n", 1138 | "\n", 1151 | "\n", 1152 | " \n", 1153 | " \n", 1154 | " \n", 1155 | " \n", 1156 | " \n", 1157 | " \n", 1158 | " \n", 1159 | " \n", 1160 | " \n", 1161 | " \n", 1162 | " \n", 1163 | " \n", 1164 | " \n", 1165 | " \n", 1166 | " \n", 1167 | " \n", 1168 | " \n", 1169 | " \n", 1170 | " \n", 1171 | " \n", 1172 | " \n", 1173 | " \n", 1174 | " \n", 1175 | " \n", 1176 | " \n", 1177 | " \n", 1178 | " \n", 1179 | " \n", 1180 | " \n", 1181 | " \n", 1182 | " \n", 1183 | " \n", 1184 | " \n", 1185 | " \n", 1186 | " \n", 1187 | " \n", 1188 | " \n", 1189 | " \n", 1190 | " \n", 1191 | " \n", 1192 | " \n", 1193 | " \n", 1194 | " \n", 1195 | " \n", 1196 | " \n", 1197 | " \n", 1198 | " \n", 1199 | " \n", 1200 | " \n", 1201 | " \n", 1202 | " \n", 1203 | " \n", 1204 | " \n", 1205 | " \n", 1206 | " \n", 1207 | " \n", 1208 | " \n", 1209 | " \n", 1210 | " \n", 1211 | " \n", 1212 | " \n", 1213 | " \n", 1214 | " \n", 1215 | " \n", 1216 | " \n", 1217 | " \n", 1218 | " \n", 1219 | " \n", 1220 | " \n", 1221 | " \n", 1222 | "
NameState_nameAge
0ATamil Nadu23
1BKarnataka37
2CTamil Nadu35
3DUttar Pradesh26
4EBihar59
5FBihar48
6GDelhi30
7HUttar Pradesh72
8ITelangana22
9JChattisgarh42
\n", 1223 | "
" 1224 | ], 1225 | "text/plain": [ 1226 | " Name State_name Age\n", 1227 | "0 A Tamil Nadu 23\n", 1228 | "1 B Karnataka 37\n", 1229 | "2 C Tamil Nadu 35\n", 1230 | "3 D Uttar Pradesh 26\n", 1231 | "4 E Bihar 59\n", 1232 | "5 F Bihar 48\n", 1233 | "6 G Delhi 30\n", 1234 | "7 H Uttar Pradesh 72\n", 1235 | "8 I Telangana 22\n", 1236 | "9 J Chattisgarh 42" 1237 | ] 1238 | }, 1239 | "execution_count": 50, 1240 | "metadata": {}, 1241 | "output_type": "execute_result" 1242 | } 1243 | ], 1244 | "source": [ 1245 | "df[[\"Name\",\"State_name\",\"Age\"]]" 1246 | ] 1247 | }, 1248 | { 1249 | "cell_type": "markdown", 1250 | "metadata": {}, 1251 | "source": [ 1252 | "## Q4: Show only the columns information, sex and number of siblings. \n", 1253 | "\n", 1254 | "There is one more method which we can use to select rwos and columns, the iloc method. We mentioned the rwo and coumn numbers. Always remeber that counting of rows and coumns starts from 0 in Python. \n", 1255 | "\n", 1256 | "First we mention the number of rows then columns, in our case, we used 0:3, note that 3 is excluded, so if we had to use 0-3 rows, we will write 0:4 " 1257 | ] 1258 | }, 1259 | { 1260 | "cell_type": "code", 1261 | "execution_count": 60, 1262 | "metadata": {}, 1263 | "outputs": [ 1264 | { 1265 | "data": { 1266 | "text/html": [ 1267 | "
\n", 1268 | "\n", 1281 | "\n", 1282 | " \n", 1283 | " \n", 1284 | " \n", 1285 | " \n", 1286 | " \n", 1287 | " \n", 1288 | " \n", 1289 | " \n", 1290 | " \n", 1291 | " \n", 1292 | " \n", 1293 | " \n", 1294 | " \n", 1295 | " \n", 1296 | " \n", 1297 | " \n", 1298 | " \n", 1299 | " \n", 1300 | " \n", 1301 | " \n", 1302 | " \n", 1303 | " \n", 1304 | " \n", 1305 | " \n", 1306 | "
NameIncome per month
0A50
1B34
2C54
\n", 1307 | "
" 1308 | ], 1309 | "text/plain": [ 1310 | " Name Income per month\n", 1311 | "0 A 50\n", 1312 | "1 B 34\n", 1313 | "2 C 54" 1314 | ] 1315 | }, 1316 | "execution_count": 60, 1317 | "metadata": {}, 1318 | "output_type": "execute_result" 1319 | } 1320 | ], 1321 | "source": [ 1322 | "df.iloc[0:3,0:2]" 1323 | ] 1324 | }, 1325 | { 1326 | "cell_type": "markdown", 1327 | "metadata": {}, 1328 | "source": [ 1329 | "## Q5: Use iloc method to print data from column 1 to 4 and rows 3 to 9" 1330 | ] 1331 | }, 1332 | { 1333 | "cell_type": "markdown", 1334 | "metadata": {}, 1335 | "source": [ 1336 | "This dataset is small and we can manually see if there are missing values or not, but in a large dataset, it will not be possible, so we will use the following command:" 1337 | ] 1338 | }, 1339 | { 1340 | "cell_type": "code", 1341 | "execution_count": 51, 1342 | "metadata": {}, 1343 | "outputs": [ 1344 | { 1345 | "data": { 1346 | "text/plain": [ 1347 | "Name 0\n", 1348 | "Income per month 0\n", 1349 | "State_name 0\n", 1350 | "Age 0\n", 1351 | "Sex 0\n", 1352 | "Number of siblings 0\n", 1353 | "dtype: int64" 1354 | ] 1355 | }, 1356 | "execution_count": 51, 1357 | "metadata": {}, 1358 | "output_type": "execute_result" 1359 | } 1360 | ], 1361 | "source": [ 1362 | "df.isna().sum()" 1363 | ] 1364 | }, 1365 | { 1366 | "cell_type": "markdown", 1367 | "metadata": {}, 1368 | "source": [ 1369 | "### Basic Statistics with Pandas" 1370 | ] 1371 | }, 1372 | { 1373 | "cell_type": "markdown", 1374 | "metadata": {}, 1375 | "source": [ 1376 | "When we look at a dataset, we need to know the information of the data in statistical terms. So we use describe function and info function: " 1377 | ] 1378 | }, 1379 | { 1380 | "cell_type": "code", 1381 | "execution_count": 52, 1382 | "metadata": {}, 1383 | "outputs": [ 1384 | { 1385 | "data": { 1386 | "text/html": [ 1387 | "
\n", 1388 | "\n", 1401 | "\n", 1402 | " \n", 1403 | " \n", 1404 | " \n", 1405 | " \n", 1406 | " \n", 1407 | " \n", 1408 | " \n", 1409 | " \n", 1410 | " \n", 1411 | " \n", 1412 | " \n", 1413 | " \n", 1414 | " \n", 1415 | " \n", 1416 | " \n", 1417 | " \n", 1418 | " \n", 1419 | " \n", 1420 | " \n", 1421 | " \n", 1422 | " \n", 1423 | " \n", 1424 | " \n", 1425 | " \n", 1426 | " \n", 1427 | " \n", 1428 | " \n", 1429 | " \n", 1430 | " \n", 1431 | " \n", 1432 | " \n", 1433 | " \n", 1434 | " \n", 1435 | " \n", 1436 | " \n", 1437 | " \n", 1438 | " \n", 1439 | " \n", 1440 | " \n", 1441 | " \n", 1442 | " \n", 1443 | " \n", 1444 | " \n", 1445 | " \n", 1446 | " \n", 1447 | " \n", 1448 | " \n", 1449 | " \n", 1450 | " \n", 1451 | " \n", 1452 | " \n", 1453 | " \n", 1454 | " \n", 1455 | " \n", 1456 | " \n", 1457 | " \n", 1458 | " \n", 1459 | " \n", 1460 | "
Income per monthAgeNumber of siblings
count10.0000010.00000010.000000
mean31.0000039.4000003.800000
std13.1824816.3040553.259175
min18.0000022.0000000.000000
25%20.2500027.0000001.250000
50%27.0000036.0000003.500000
75%38.5000046.5000005.750000
max54.0000072.00000010.000000
\n", 1461 | "
" 1462 | ], 1463 | "text/plain": [ 1464 | " Income per month Age Number of siblings\n", 1465 | "count 10.00000 10.000000 10.000000\n", 1466 | "mean 31.00000 39.400000 3.800000\n", 1467 | "std 13.18248 16.304055 3.259175\n", 1468 | "min 18.00000 22.000000 0.000000\n", 1469 | "25% 20.25000 27.000000 1.250000\n", 1470 | "50% 27.00000 36.000000 3.500000\n", 1471 | "75% 38.50000 46.500000 5.750000\n", 1472 | "max 54.00000 72.000000 10.000000" 1473 | ] 1474 | }, 1475 | "execution_count": 52, 1476 | "metadata": {}, 1477 | "output_type": "execute_result" 1478 | } 1479 | ], 1480 | "source": [ 1481 | "df.describe()" 1482 | ] 1483 | }, 1484 | { 1485 | "cell_type": "markdown", 1486 | "metadata": {}, 1487 | "source": [ 1488 | "This gives us the total count, mean, standard deviation, minimum and maximum values, interquartile values. This helps us to get more insights into data. If you are not familiar with these terms, I advise you to read more about them as machine learning makes use of statistics to a very great degree. \n", 1489 | "\n", 1490 | "Suppose we have to find how many males and females are there in our dataset, we will use the following command:" 1491 | ] 1492 | }, 1493 | { 1494 | "cell_type": "code", 1495 | "execution_count": 53, 1496 | "metadata": {}, 1497 | "outputs": [ 1498 | { 1499 | "data": { 1500 | "text/plain": [ 1501 | "F 5\n", 1502 | "M 5\n", 1503 | "Name: Sex, dtype: int64" 1504 | ] 1505 | }, 1506 | "execution_count": 53, 1507 | "metadata": {}, 1508 | "output_type": "execute_result" 1509 | } 1510 | ], 1511 | "source": [ 1512 | "df[\"Sex\"].value_counts()" 1513 | ] 1514 | }, 1515 | { 1516 | "cell_type": "markdown", 1517 | "metadata": {}, 1518 | "source": [ 1519 | "## Q6: Find the total value count of number of siblings in the given dataset" 1520 | ] 1521 | }, 1522 | { 1523 | "cell_type": "markdown", 1524 | "metadata": {}, 1525 | "source": [ 1526 | "# 3. Learning Seaborn" 1527 | ] 1528 | }, 1529 | { 1530 | "cell_type": "markdown", 1531 | "metadata": {}, 1532 | "source": [ 1533 | "Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.\n", 1534 | "\n", 1535 | "Here is some of the functionality that seaborn offers:\n", 1536 | "\n", 1537 | "1. A dataset-oriented API for examining relationships between multiple variables\n", 1538 | "\n", 1539 | "2. Specialized support for using categorical variables to show observations or aggregate statistics\n", 1540 | "\n", 1541 | "3. Options for visualizing univariate or bivariate distributions and for comparing them between subsets of data\n", 1542 | "\n", 1543 | "4. Automatic estimation and plotting of linear regression models for different kinds dependent variables\n", 1544 | "\n", 1545 | "5. Convenient views onto the overall structure of complex datasets\n", 1546 | "\n", 1547 | "6. High-level abstractions for structuring multi-plot grids that let you easily build complex visualizations\n", 1548 | "\n", 1549 | "7. Concise control over matplotlib figure styling with several built-in themes\n", 1550 | "\n", 1551 | "8. Tools for choosing color palettes that faithfully reveal patterns in your data\n", 1552 | "\n", 1553 | "Seaborn aims to make visualization a central part of exploring and understanding data. Its dataset-oriented plotting functions operate on dataframes and arrays containing whole datasets and internally perform the necessary semantic mapping and statistical aggregation to produce informative plots." 1554 | ] 1555 | }, 1556 | { 1557 | "cell_type": "markdown", 1558 | "metadata": {}, 1559 | "source": [ 1560 | "Visualization is always a better method to get insights from the data we are using because we get a better picture. There are lot of visuzliation figures that we can create with its help but we will be focussing on paiplot and correlation matrix. You can read about rest of them at: https://seaborn.pydata.org/introduction.html" 1561 | ] 1562 | }, 1563 | { 1564 | "cell_type": "markdown", 1565 | "metadata": {}, 1566 | "source": [ 1567 | "The pairplot function creates a grid of Axes such that each variable in data will by shared in the y-axis across a single row and in the x-axis across a single column. It helps us to analyze trends in our data by pairing it with various independent variables centered around our dependent variable (which we will write in the hue section in our code). \n", 1568 | "\n", 1569 | "## Q7: What are dependent and independent varibles in our dataset? " 1570 | ] 1571 | }, 1572 | { 1573 | "cell_type": "code", 1574 | "execution_count": 56, 1575 | "metadata": {}, 1576 | "outputs": [ 1577 | { 1578 | "data": { 1579 | "text/plain": [ 1580 | "" 1581 | ] 1582 | }, 1583 | "execution_count": 56, 1584 | "metadata": {}, 1585 | "output_type": "execute_result" 1586 | }, 1587 | { 1588 | "data": { 1589 | "image/png": "\n", 1590 | "text/plain": [ 1591 | "
" 1592 | ] 1593 | }, 1594 | "metadata": { 1595 | "needs_background": "light" 1596 | }, 1597 | "output_type": "display_data" 1598 | } 1599 | ], 1600 | "source": [ 1601 | "import seaborn as sns\n", 1602 | "sns.pairplot(df, hue=\"Number of siblings\")" 1603 | ] 1604 | }, 1605 | { 1606 | "cell_type": "markdown", 1607 | "metadata": {}, 1608 | "source": [ 1609 | "We can infer two things: \n", 1610 | "1. As income per month increases, the number of siblings decreases \n", 1611 | "2. People who are older have more silings than younger ones. " 1612 | ] 1613 | }, 1614 | { 1615 | "cell_type": "markdown", 1616 | "metadata": {}, 1617 | "source": [ 1618 | "### We will be getting similar insights in our week 2 module wherein we will analyze our breast cancer dataset. " 1619 | ] 1620 | }, 1621 | { 1622 | "cell_type": "markdown", 1623 | "metadata": {}, 1624 | "source": [ 1625 | "Now we will try to make a correlation matrix of our dataframe. It will give us a mathematical connotation about the correlation that variables have with each other. \n", 1626 | "\n", 1627 | "Objective of making a correlation matrix: \n", 1628 | "\n", 1629 | "1. To summarize a large amount of data where the goal is to see patterns. \n", 1630 | "2. To input into other analyses. For example, people commonly use correlation matrixes as inputs for exploratory factor analysis, confirmatory factor analysis, structural equation models, and linear regression when excluding missing values pairwise.\n", 1631 | "3. As a diagnostic when checking other analyses. " 1632 | ] 1633 | }, 1634 | { 1635 | "cell_type": "code", 1636 | "execution_count": 57, 1637 | "metadata": {}, 1638 | "outputs": [ 1639 | { 1640 | "data": { 1641 | "text/html": [ 1642 | "
\n", 1643 | "\n", 1656 | "\n", 1657 | " \n", 1658 | " \n", 1659 | " \n", 1660 | " \n", 1661 | " \n", 1662 | " \n", 1663 | " \n", 1664 | " \n", 1665 | " \n", 1666 | " \n", 1667 | " \n", 1668 | " \n", 1669 | " \n", 1670 | " \n", 1671 | " \n", 1672 | " \n", 1673 | " \n", 1674 | " \n", 1675 | " \n", 1676 | " \n", 1677 | " \n", 1678 | " \n", 1679 | " \n", 1680 | " \n", 1681 | " \n", 1682 | " \n", 1683 | " \n", 1684 | " \n", 1685 | "
Income per monthAgeNumber of siblings
Income per month1.000000-0.567917-0.773257
Age-0.5679171.0000000.855771
Number of siblings-0.7732570.8557711.000000
\n", 1686 | "
" 1687 | ], 1688 | "text/plain": [ 1689 | " Income per month Age Number of siblings\n", 1690 | "Income per month 1.000000 -0.567917 -0.773257\n", 1691 | "Age -0.567917 1.000000 0.855771\n", 1692 | "Number of siblings -0.773257 0.855771 1.000000" 1693 | ] 1694 | }, 1695 | "execution_count": 57, 1696 | "metadata": {}, 1697 | "output_type": "execute_result" 1698 | } 1699 | ], 1700 | "source": [ 1701 | "df.corr(method=\"pearson\")" 1702 | ] 1703 | }, 1704 | { 1705 | "cell_type": "markdown", 1706 | "metadata": {}, 1707 | "source": [ 1708 | "Whatever we had inferred from the visualiztion done above, we can see it in numbers in this correlation matrix. We can also make a heatmap and see this matrix in a visaulized form:" 1709 | ] 1710 | }, 1711 | { 1712 | "cell_type": "code", 1713 | "execution_count": 58, 1714 | "metadata": {}, 1715 | "outputs": [ 1716 | { 1717 | "data": { 1718 | "text/plain": [ 1719 | "" 1720 | ] 1721 | }, 1722 | "execution_count": 58, 1723 | "metadata": {}, 1724 | "output_type": "execute_result" 1725 | }, 1726 | { 1727 | "data": { 1728 | "image/png": "\n", 1729 | "text/plain": [ 1730 | "
" 1731 | ] 1732 | }, 1733 | "metadata": { 1734 | "needs_background": "light" 1735 | }, 1736 | "output_type": "display_data" 1737 | } 1738 | ], 1739 | "source": [ 1740 | "sns.heatmap(df.corr(), annot=True,fmt=\"0.0%\")" 1741 | ] 1742 | }, 1743 | { 1744 | "cell_type": "markdown", 1745 | "metadata": {}, 1746 | "source": [ 1747 | "## Q8: Mention all your finding from the above given heatmap. " 1748 | ] 1749 | }, 1750 | { 1751 | "cell_type": "markdown", 1752 | "metadata": {}, 1753 | "source": [ 1754 | "That is all for week 1, see you next week! " 1755 | ] 1756 | } 1757 | ], 1758 | "metadata": { 1759 | "kernelspec": { 1760 | "display_name": "Python 3", 1761 | "language": "python", 1762 | "name": "python3" 1763 | }, 1764 | "language_info": { 1765 | "codemirror_mode": { 1766 | "name": "ipython", 1767 | "version": 3 1768 | }, 1769 | "file_extension": ".py", 1770 | "mimetype": "text/x-python", 1771 | "name": "python", 1772 | "nbconvert_exporter": "python", 1773 | "pygments_lexer": "ipython3", 1774 | "version": "3.7.4" 1775 | } 1776 | }, 1777 | "nbformat": 4, 1778 | "nbformat_minor": 2 1779 | } 1780 | -------------------------------------------------------------------------------- /Week 2- Breast Cancer detection using Machine Learning.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Week 2: Breast Cancer detection using Machine Learning" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "In week 2, we will be implementing the tools we learnt in week 1 on the breast cancer data set. In the second half, we will also learn some of the basic commands from the library scikit learn. " 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "The Breast Cancer Dataset is available at: https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28original%29" 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "metadata": {}, 27 | "source": [ 28 | "Those who are having trouble downloading the Dataset from the website, the data will be uploaded in the folder and you can directly download from there. In the following few code cells, I will write all the instructions step wise, your assignment for this week is to write codes for the steps simultaneously in your python file and take time to analyze. \n", 29 | "In future projects, you will have to think and come up with these steps alongwith writing the codes. " 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "#### Step 1: Open the file in your python notebook, print first 5 rows of the dataset and mention what are the dependent and independent variables in the Data.\n", 37 | "\n", 38 | "#### Step 2: Find the statistical parameters of the Data that you have\n", 39 | "\n", 40 | "#### Step 3: Find the shape of the Dataset in hand. \n", 41 | "\n", 42 | "#### Step 4: Find missing values from the Dataset\n", 43 | "\n", 44 | "#### Step 5: Find the value count of B(Benign) and M(Malignant) cancer cells in the column \"diagnosis\"" 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "metadata": {}, 50 | "source": [ 51 | "Before proceeding to step 6, we will learn an important concept which is called label encoding. While navigating through the data, you would have seen that the classification has been done in form of words B and M for benign and malignant cancer cells. But we need this to be converted in number so that this parameter can also be used statistically. " 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": null, 57 | "metadata": {}, 58 | "outputs": [], 59 | "source": [ 60 | "from sklearn.preprocessing import LabelEncoder\n", 61 | "labelencoder_Y=LabelEncoder()\n", 62 | "df.iloc[:,1]=labelencoder_Y.fit_transform(df.iloc[:,1].values)" 63 | ] 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "metadata": {}, 68 | "source": [ 69 | "In the above code, we have applied label encoding on all the rows and one column. This is the syntax, there is nothing much to understand here, so just write this code in your notebook and remeber that whenever you will be needing to convert textual feature in form of numbers, you will be needing this coding and this process is called label encoding. " 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "#### Step 6: Creating a pairplot and mention the findings\n", 77 | "Before proceeding to creating a pairplot of all the columns, know that the number of columns in the dataset is huge and if we attempt to create a pairplot, either our hardware will not support it or it will crash if it supports. Even if our hardware is able to do so, the pairplot would be so large that it will be virtually impossible to make a sense of the data. Therefore, we will only create a pairplot of first 5 columns and write the findings. \n", 78 | "\n", 79 | "We can do better analysis with the help of correlation matrix. " 80 | ] 81 | }, 82 | { 83 | "cell_type": "markdown", 84 | "metadata": {}, 85 | "source": [ 86 | "#### Step 7: Create a correlation matrix and mention strongly, weakly and negatively correlated quantities. " 87 | ] 88 | }, 89 | { 90 | "cell_type": "markdown", 91 | "metadata": {}, 92 | "source": [ 93 | "We tend to drop features having very strong correlation. Why? Think and try to write the answer. I will write this question's answer in week 3 module. " 94 | ] 95 | }, 96 | { 97 | "cell_type": "markdown", 98 | "metadata": {}, 99 | "source": [ 100 | "#### (Optional) Step 8: Create a heatmap of the correlated features (helps in visualization) " 101 | ] 102 | }, 103 | { 104 | "cell_type": "markdown", 105 | "metadata": {}, 106 | "source": [ 107 | "That is all for week 2! We will learn about various algorithms in week 3 and implement them. See you next week!" 108 | ] 109 | }, 110 | { 111 | "cell_type": "markdown", 112 | "metadata": {}, 113 | "source": [ 114 | "All the best!" 115 | ] 116 | } 117 | ], 118 | "metadata": { 119 | "kernelspec": { 120 | "display_name": "Python 3", 121 | "language": "python", 122 | "name": "python3" 123 | }, 124 | "language_info": { 125 | "codemirror_mode": { 126 | "name": "ipython", 127 | "version": 3 128 | }, 129 | "file_extension": ".py", 130 | "mimetype": "text/x-python", 131 | "name": "python", 132 | "nbconvert_exporter": "python", 133 | "pygments_lexer": "ipython3", 134 | "version": "3.7.4" 135 | } 136 | }, 137 | "nbformat": 4, 138 | "nbformat_minor": 2 139 | } 140 | -------------------------------------------------------------------------------- /Week 3 Breast Cancer Detection .ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import numpy as np\n", 10 | "import pandas as pd\n", 11 | "import matplotlib.pyplot as plt\n", 12 | "import seaborn as sns" 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": 3, 18 | "metadata": {}, 19 | "outputs": [], 20 | "source": [ 21 | "df=pd.read_csv(r\"C:\\Users\\srtpa\\Downloads\\cancer dataset.csv\")" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 4, 27 | "metadata": {}, 28 | "outputs": [], 29 | "source": [ 30 | "from sklearn.preprocessing import LabelEncoder\n", 31 | "labelencoder_Y=LabelEncoder()\n", 32 | "df.iloc[:,1]=labelencoder_Y.fit_transform(df.iloc[:,1].values)" 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": 5, 38 | "metadata": {}, 39 | "outputs": [ 40 | { 41 | "data": { 42 | "text/html": [ 43 | "
\n", 44 | "\n", 57 | "\n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | "
iddiagnosisradius_meantexture_meanperimeter_meanarea_meansmoothness_meancompactness_meanconcavity_meanconcave points_mean...radius_worsttexture_worstperimeter_worstarea_worstsmoothness_worstcompactness_worstconcavity_worstconcave points_worstsymmetry_worstfractal_dimension_worst
0842302117.9910.38122.801001.00.118400.277600.30010.14710...25.3817.33184.602019.00.16220.66560.71190.26540.46010.11890
1842517120.5717.77132.901326.00.084740.078640.08690.07017...24.9923.41158.801956.00.12380.18660.24160.18600.27500.08902
284300903119.6921.25130.001203.00.109600.159900.19740.12790...23.5725.53152.501709.00.14440.42450.45040.24300.36130.08758
384348301111.4220.3877.58386.10.142500.283900.24140.10520...14.9126.5098.87567.70.20980.86630.68690.25750.66380.17300
484358402120.2914.34135.101297.00.100300.132800.19800.10430...22.5416.67152.201575.00.13740.20500.40000.16250.23640.07678
\n", 207 | "

5 rows × 32 columns

\n", 208 | "
" 209 | ], 210 | "text/plain": [ 211 | " id diagnosis radius_mean texture_mean perimeter_mean area_mean \\\n", 212 | "0 842302 1 17.99 10.38 122.80 1001.0 \n", 213 | "1 842517 1 20.57 17.77 132.90 1326.0 \n", 214 | "2 84300903 1 19.69 21.25 130.00 1203.0 \n", 215 | "3 84348301 1 11.42 20.38 77.58 386.1 \n", 216 | "4 84358402 1 20.29 14.34 135.10 1297.0 \n", 217 | "\n", 218 | " smoothness_mean compactness_mean concavity_mean concave points_mean \\\n", 219 | "0 0.11840 0.27760 0.3001 0.14710 \n", 220 | "1 0.08474 0.07864 0.0869 0.07017 \n", 221 | "2 0.10960 0.15990 0.1974 0.12790 \n", 222 | "3 0.14250 0.28390 0.2414 0.10520 \n", 223 | "4 0.10030 0.13280 0.1980 0.10430 \n", 224 | "\n", 225 | " ... radius_worst texture_worst perimeter_worst area_worst \\\n", 226 | "0 ... 25.38 17.33 184.60 2019.0 \n", 227 | "1 ... 24.99 23.41 158.80 1956.0 \n", 228 | "2 ... 23.57 25.53 152.50 1709.0 \n", 229 | "3 ... 14.91 26.50 98.87 567.7 \n", 230 | "4 ... 22.54 16.67 152.20 1575.0 \n", 231 | "\n", 232 | " smoothness_worst compactness_worst concavity_worst concave points_worst \\\n", 233 | "0 0.1622 0.6656 0.7119 0.2654 \n", 234 | "1 0.1238 0.1866 0.2416 0.1860 \n", 235 | "2 0.1444 0.4245 0.4504 0.2430 \n", 236 | "3 0.2098 0.8663 0.6869 0.2575 \n", 237 | "4 0.1374 0.2050 0.4000 0.1625 \n", 238 | "\n", 239 | " symmetry_worst fractal_dimension_worst \n", 240 | "0 0.4601 0.11890 \n", 241 | "1 0.2750 0.08902 \n", 242 | "2 0.3613 0.08758 \n", 243 | "3 0.6638 0.17300 \n", 244 | "4 0.2364 0.07678 \n", 245 | "\n", 246 | "[5 rows x 32 columns]" 247 | ] 248 | }, 249 | "execution_count": 5, 250 | "metadata": {}, 251 | "output_type": "execute_result" 252 | } 253 | ], 254 | "source": [ 255 | "df.head()" 256 | ] 257 | }, 258 | { 259 | "cell_type": "markdown", 260 | "metadata": {}, 261 | "source": [ 262 | "We will now split the original data into dependent and indepndent dataset. " 263 | ] 264 | }, 265 | { 266 | "cell_type": "code", 267 | "execution_count": 7, 268 | "metadata": {}, 269 | "outputs": [], 270 | "source": [ 271 | "X=df.iloc[:,2:31].values #features that help us determine if patient has cancer or not\n", 272 | "Y=df.iloc[:,1].values #this is the dataset containing our target variable which indicates diagnosis" 273 | ] 274 | }, 275 | { 276 | "cell_type": "markdown", 277 | "metadata": {}, 278 | "source": [ 279 | "The data we use is usually split into training data and test data. The training set contains a known output and the model learns on this data in order to be generalized to other data later on. We have the test dataset (or subset) in order to test our model’s prediction on this subset.\n", 280 | "\n", 281 | "From Sklearn, sub-library model_selection, we will import the train_test_split so we can split to training and test sets. The test_size inside the function indicates the percentage of the data that should be held over for testing. It’s usually around 80/20 or 70/30. The ratio is kept as such so that model does not overfit or underfit. \n", 282 | "Let us understand first what overfitting and underfitting means:" 283 | ] 284 | }, 285 | { 286 | "cell_type": "markdown", 287 | "metadata": {}, 288 | "source": [ 289 | "#### Overfitting\n", 290 | "Overfitting means that model we trained has trained “too well” and is now, well, fit too closely to the training dataset. This usually happens when the model is too complex i.e. too many features/variables compared to the number of observations. This model will be very accurate on the training data but will probably be very not accurate on untrained or new data. It is because this model is not generalized, meaning you can generalize the results and can’t make any inferences on other data, which is, ultimately, what you are trying to do. Basically, when this happens, the model learns or describes the “noise” in the training data instead of the actual relationships between variables in the data. This noise, obviously, isn’t part in of any new dataset, and cannot be applied to it.\n", 291 | "\n", 292 | "#### Underfitting\n", 293 | "In contrast to overfitting, when a model is underfitted, it means that the model does not fit the training data and therefore misses the trends in the data. It also means the model cannot be generalized to new data. This is usually the result of a very simple model which does not have enough predictors/independent variables. It could also happen when, for example, we fit a linear model ,like linear regression to data that is not linear. It almost goes without saying that this model will have poor predictive ability on training data and can’t be generalized to other data.\n" 294 | ] 295 | }, 296 | { 297 | "cell_type": "code", 298 | "execution_count": 9, 299 | "metadata": {}, 300 | "outputs": [], 301 | "source": [ 302 | "from sklearn.model_selection import train_test_split\n", 303 | "X_train, X_test, Y_train, Y_test= train_test_split(X,Y, test_size=0.25, random_state=0)" 304 | ] 305 | }, 306 | { 307 | "cell_type": "markdown", 308 | "metadata": {}, 309 | "source": [ 310 | "This is the code that you can use to test and split data using scikit learn" 311 | ] 312 | }, 313 | { 314 | "cell_type": "markdown", 315 | "metadata": {}, 316 | "source": [ 317 | "#### Q1: Now, there is a small homeowrk for statistics wherein you have to read about the parameteres, define them in brief and write about two main types of distributions: \n", 318 | "#### 1. Gaussian distribution \n", 319 | "#### 2. Binomial distribution \n", 320 | "#### Differentiate between both as well. " 321 | ] 322 | }, 323 | { 324 | "cell_type": "markdown", 325 | "metadata": {}, 326 | "source": [ 327 | "I had written about this on the group as well and shall write it once again that statistics is a very important element of machine learning. While we are working things on Python in this project, you should be working on things in statistics and trying to find statistical conclusions and means in the project. " 328 | ] 329 | }, 330 | { 331 | "cell_type": "markdown", 332 | "metadata": {}, 333 | "source": [ 334 | "### Fit Transform" 335 | ] 336 | }, 337 | { 338 | "cell_type": "markdown", 339 | "metadata": {}, 340 | "source": [ 341 | "To center the data (make it have zero mean and unit standard error), you subtract the mean and then divide the result by the standard deviation.\n", 342 | "\n", 343 | "x′=(x−μ)/σ.\n", 344 | "\n", 345 | "You do that on the training set of data. But then you have to apply the same transformation to your testing set (e.g. in cross-validation), or to newly obtained examples before forecast. But you have to use the same two parameters μ and σ (values) that you used for centering the training set.\n", 346 | "\n", 347 | "Hence, every sklearn's transform's fit() just calculates the parameters (e.g. μ and σ in case of StandardScaler) and saves them as an internal objects state. Afterwards, you can call its transform() method to apply the transformation to a particular set of examples.\n", 348 | "\n", 349 | "fit_transform() joins these two steps and is used for the initial fitting of parameters on the training set x, but it also returns a transformed x′. Internally, it just calls first fit() and then transform() on the same data." 350 | ] 351 | }, 352 | { 353 | "cell_type": "code", 354 | "execution_count": 10, 355 | "metadata": {}, 356 | "outputs": [], 357 | "source": [ 358 | "from sklearn.preprocessing import StandardScaler\n", 359 | "sc=StandardScaler()\n", 360 | "X_train=sc.fit_transform(X_train)\n", 361 | "X_test=sc.fit_transform(X_test)" 362 | ] 363 | }, 364 | { 365 | "cell_type": "markdown", 366 | "metadata": {}, 367 | "source": [ 368 | "We have applied fit transform on our test and train data. Now it is time to read about various models that can be used to predict whether a cancer cell is beningn or malignant. " 369 | ] 370 | }, 371 | { 372 | "cell_type": "markdown", 373 | "metadata": {}, 374 | "source": [ 375 | "Model selection is an important part of solving a machine learning but not as important as data cleaning! Your model will always be as good as your data is so focus should always be on getting high quality data and cleaning it properly. \n", 376 | "\n", 377 | "There are a number of Machine Learning models avaialble which can be employed to read to meaningful conclusions and selecting the right model depends on a variety of factors such as:\n", 378 | "\n", 379 | "1. The accuracy of the model.\n", 380 | "2. The interpretability of the model.\n", 381 | "3. The complexity of the model.\n", 382 | "4. The scalability of the model.\n", 383 | "5. How long does it take to build, train, and test the model?\n", 384 | "6. How long does it take to make predictions using the model?\n", 385 | "7. Does the model meet the business goal?" 386 | ] 387 | }, 388 | { 389 | "cell_type": "markdown", 390 | "metadata": {}, 391 | "source": [ 392 | "In this project, we will be mainly focussing on three algorithms which can be used to model our dataset:\n", 393 | " 1. Logistics regression\n", 394 | " 2. Decision tree classifier \n", 395 | " 3. Random Forest classifier " 396 | ] 397 | }, 398 | { 399 | "cell_type": "markdown", 400 | "metadata": {}, 401 | "source": [ 402 | "### Q2: Selectively write 5 lines about each of the above three algorithms so that even a rather inexperienced person can understand it alongwith dealing all the technicalities. " 403 | ] 404 | }, 405 | { 406 | "cell_type": "markdown", 407 | "metadata": {}, 408 | "source": [ 409 | "This week's assignment is more theory oriented because communication skills are equally important in Data Science as is coding and Statistics (Mathematics). These questions will help you gain a command over explaining things. " 410 | ] 411 | }, 412 | { 413 | "cell_type": "markdown", 414 | "metadata": {}, 415 | "source": [ 416 | "That is all for week 3, we will apply these algorithms in the week 4 and calculate accuracy of the models and the final submission and evaluation will be done after that week. All the best!" 417 | ] 418 | } 419 | ], 420 | "metadata": { 421 | "kernelspec": { 422 | "display_name": "Python 3", 423 | "language": "python", 424 | "name": "python3" 425 | }, 426 | "language_info": { 427 | "codemirror_mode": { 428 | "name": "ipython", 429 | "version": 3 430 | }, 431 | "file_extension": ".py", 432 | "mimetype": "text/x-python", 433 | "name": "python", 434 | "nbconvert_exporter": "python", 435 | "pygments_lexer": "ipython3", 436 | "version": "3.7.4" 437 | } 438 | }, 439 | "nbformat": 4, 440 | "nbformat_minor": 2 441 | } 442 | -------------------------------------------------------------------------------- /Week 4-Breast Cancer detection.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "I hope you have all finished week 3 assignements with zeal as it is important to know about algorithms before we actually start applying to our dataset. This is the final week of our assignment and after this, we shall be doing final evaluation of your submissions and then providing you respective certificates.\n", 8 | "\n", 9 | "We will start with applying the models, I will aplly logistic regression and explain things accordingly, your assignment for this week is to apply the rest two classifier models to the data. " 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 1, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "import numpy as np\n", 19 | "import pandas as pd\n", 20 | "import matplotlib.pyplot as plt\n", 21 | "import seaborn as sns" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 2, 27 | "metadata": {}, 28 | "outputs": [], 29 | "source": [ 30 | "df=pd.read_csv(r\"C:\\Users\\srtpa\\Downloads\\cancer dataset.csv\")" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 3, 36 | "metadata": {}, 37 | "outputs": [], 38 | "source": [ 39 | "from sklearn.preprocessing import LabelEncoder\n", 40 | "labelencoder_Y=LabelEncoder()\n", 41 | "df.iloc[:,1]=labelencoder_Y.fit_transform(df.iloc[:,1].values)" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": 4, 47 | "metadata": {}, 48 | "outputs": [], 49 | "source": [ 50 | "X=df.iloc[:,2:31].values \n", 51 | "Y=df.iloc[:,1].values " 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": 5, 57 | "metadata": {}, 58 | "outputs": [], 59 | "source": [ 60 | "from sklearn.model_selection import train_test_split\n", 61 | "X_train, X_test, Y_train, Y_test= train_test_split(X,Y, test_size=0.25, random_state=0)" 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": 6, 67 | "metadata": {}, 68 | "outputs": [], 69 | "source": [ 70 | "from sklearn.preprocessing import StandardScaler\n", 71 | "sc=StandardScaler()\n", 72 | "X_train=sc.fit_transform(X_train)\n", 73 | "X_test=sc.fit_transform(X_test)" 74 | ] 75 | }, 76 | { 77 | "cell_type": "markdown", 78 | "metadata": {}, 79 | "source": [ 80 | " We will start off by creating a function for applying logistic regression to our data. We will use the in-built modules present in the scikit learn library. " 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": 15, 86 | "metadata": {}, 87 | "outputs": [], 88 | "source": [ 89 | "def logreg (X_train, Y_train):\n", 90 | " from sklearn.linear_model import LogisticRegression\n", 91 | " log=LogisticRegression (random_state=0)\n", 92 | " log.fit(X_train, Y_train)\n", 93 | " print(\"Logistic Regression Training Accuracy:\", log.score(X_train, Y_train))\n", 94 | " return log" 95 | ] 96 | }, 97 | { 98 | "cell_type": "markdown", 99 | "metadata": {}, 100 | "source": [ 101 | "You will be seeing random state in the code. What is random state used for?\n", 102 | "\n", 103 | "If there is no randomstate provided the system will use a randomstate that is generated internally. So, when you run the program multiple times you might see different train/test data points and the behavior will be unpredictable. In case, you have an issue with your model you will not be able to recreate it as you do not know the random number that was generated when you ran the program.\n", 104 | "\n", 105 | "### If these codes are a bit overwhelming at the moment, do not worry, you will be doing Decision Tree classifier and Random Forest classifier on your own. Then it will be clearer!" 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "execution_count": 17, 111 | "metadata": {}, 112 | "outputs": [ 113 | { 114 | "name": "stdout", 115 | "output_type": "stream", 116 | "text": [ 117 | "Logistic Regression Training Accuracy: 0.9906103286384976\n" 118 | ] 119 | }, 120 | { 121 | "name": "stderr", 122 | "output_type": "stream", 123 | "text": [ 124 | "C:\\Users\\srtpa\\Anaconda3\\lib\\site-packages\\sklearn\\linear_model\\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.\n", 125 | " FutureWarning)\n" 126 | ] 127 | } 128 | ], 129 | "source": [ 130 | "logrex=logreg(X_train, Y_train)" 131 | ] 132 | }, 133 | { 134 | "cell_type": "markdown", 135 | "metadata": {}, 136 | "source": [ 137 | "The above given accuracy is when we apply the algorithm to the data which we have used for training, it is much obvious that it will be very close to 100% because we are training with that data. We will be finding out the accuracy of the testing data with the help of confusion matrix. " 138 | ] 139 | }, 140 | { 141 | "cell_type": "markdown", 142 | "metadata": {}, 143 | "source": [ 144 | "Now we will apply this algorithm to our testing data as we had earlier applied to the training set and create a confusion matrix. \n", 145 | "\n", 146 | "### Confusion Matrix \n", 147 | "\n", 148 | "A confusion matrix is a summary of prediction results on a classification problem. The number of correct and incorrect predictions are summarized with count values and broken down by each class. This is the key to the confusion matrix. The confusion matrix shows the ways in which your classification model is confused when it makes predictions. It gives us insight not only into the errors being made by a classifier but more importantly the types of errors that are being made.\n", 149 | "\n", 150 | "#### Definition of the Terms:\n", 151 | "\n", 152 | "Positive (P) : Observation is positive (for example: is an apple).\n", 153 | "Negative (N) : Observation is not positive (for example: is not an apple).\n", 154 | "True Positive (TP) : Observation is positive, and is predicted to be positive.\n", 155 | "False Negative (FN) : Observation is positive, but is predicted negative.\n", 156 | "True Negative (TN) : Observation is negative, and is predicted to be negative.\n", 157 | "False Positive (FP) : Observation is negative, but is predicted positive." 158 | ] 159 | }, 160 | { 161 | "cell_type": "markdown", 162 | "metadata": {}, 163 | "source": [ 164 | "Rates that are often computed from a confusion matrix for a binary classifier:\n", 165 | "\n", 166 | "Accuracy: Overall, how often is the classifier correct?\n", 167 | "(TP+TN)/total\n", 168 | "\n", 169 | "Misclassification Rate: Overall, how often is it wrong?\n", 170 | "(FP+FN)/total" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": 20, 176 | "metadata": {}, 177 | "outputs": [ 178 | { 179 | "name": "stdout", 180 | "output_type": "stream", 181 | "text": [ 182 | "[[86 4]\n", 183 | " [ 3 50]]\n" 184 | ] 185 | } 186 | ], 187 | "source": [ 188 | "from sklearn.metrics import confusion_matrix\n", 189 | "cm = confusion_matrix(Y_test, logrex.predict(X_test))\n", 190 | "print(cm)" 191 | ] 192 | }, 193 | { 194 | "cell_type": "markdown", 195 | "metadata": {}, 196 | "source": [ 197 | "In this model\n", 198 | "1. true positive=86\n", 199 | "2. True negative=50\n", 200 | "3. False positive=4\n", 201 | "4. False negative=3" 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": 21, 207 | "metadata": {}, 208 | "outputs": [ 209 | { 210 | "name": "stdout", 211 | "output_type": "stream", 212 | "text": [ 213 | "Testing accuracy of logistic regression model= 0.951048951048951\n" 214 | ] 215 | } 216 | ], 217 | "source": [ 218 | "TP=cm[0][0]\n", 219 | "TN=cm[1][1]\n", 220 | "FN=cm[1][0]\n", 221 | "FP=cm[0][1]\n", 222 | "print(\"Testing accuracy of logistic regression model=\", (TP+TN)/(TP+TN+FN+FP))" 223 | ] 224 | }, 225 | { 226 | "cell_type": "markdown", 227 | "metadata": {}, 228 | "source": [ 229 | "In this way, I have applied the logistic regression model to our data. Now it is your turn to apply Decision tree classifier and Random forest classifier to this data and find out the accuracy of the model using confusion matrix as shown above and also comment on which model had higher accuracy and why. You can find the codes for applying these models online, please use them!\n", 230 | "With this, we come towards the end of the project and in week 5, we will be doing your final evaluation, taking feedbacks from you all and providing you certificates. \n", 231 | "All the best and hope we will do more projects in future! " 232 | ] 233 | } 234 | ], 235 | "metadata": { 236 | "kernelspec": { 237 | "display_name": "Python 3", 238 | "language": "python", 239 | "name": "python3" 240 | }, 241 | "language_info": { 242 | "codemirror_mode": { 243 | "name": "ipython", 244 | "version": 3 245 | }, 246 | "file_extension": ".py", 247 | "mimetype": "text/x-python", 248 | "name": "python", 249 | "nbconvert_exporter": "python", 250 | "pygments_lexer": "ipython3", 251 | "version": "3.7.4" 252 | } 253 | }, 254 | "nbformat": 4, 255 | "nbformat_minor": 2 256 | } 257 | --------------------------------------------------------------------------------