├── Ditching Excel for Python!.ipynb └── README.md /Ditching Excel for Python!.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "#
Ditching Excel for Python
" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "After spending almost a decade with my first love Excel, its time to move on and search for a better half who in thick and thin of my daily tasks is with me and is much better and faster and who can give me a cutting edge in the challenging technological times where new technology is getting ditched by something new at a very rapid pace.\n", 15 | "The idea is to replicate almost all excel functionalities in Python, be it using a simple filter or a complex task of creating an array of data from the rows and crunching them to get fancy results" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "The approach followed here is to start from simple tasks and move to complex computational tasks.\n", 23 | "I've tried and designed it such a way that this can be used a universal notebook and you just just have to change the input file and you can get the same result.\n", 24 | "However I will encourage you to please replicate the steps yourself for your better understanding." 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": {}, 30 | "source": [ 31 | "The inspiration to create something like this came from the non-availablity of a free tutorial which literally gives all. I heavily read and follow Python documentation and you will find a lot of inspiration from that site." 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | "_Our Input and Ouput is both an Excel file :)_" 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "metadata": {}, 44 | "source": [ 45 | "
------------------------------------------------------------------------------------------
" 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | "## Importing Excel Files into a Pandas DataFrame" 53 | ] 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "metadata": {}, 58 | "source": [ 59 | "Initial step is to import excel files into dtaframe so we can perform all our tasks on it.\n", 60 | "
I will be demonstrating the __read_excel__ method of Pandas which supports __xls__ and __xlsx__ file extensions.\n", 61 | "
__read_csv__ is same as using read_excel, we wont go in depth but I will share an example." 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": {}, 67 | "source": [ 68 | "Though __read_excel__ method includes million arguments but I will make you familiarise with the most common ones that will come very handy in day to day operations" 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "metadata": {}, 74 | "source": [ 75 | "I'll be using the Iris sample dataset which is freely available online for educational purpose.\n", 76 | "
Please follow the below link to download the dataset and ensure to save it in the same folder where you are saving your python file" 77 | ] 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "metadata": {}, 82 | "source": [ 83 | "https://archive.ics.uci.edu/ml/datasets/iris" 84 | ] 85 | }, 86 | { 87 | "cell_type": "markdown", 88 | "metadata": {}, 89 | "source": [ 90 | "## The first step is to import necessary libraries in Python" 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": 1, 96 | "metadata": {}, 97 | "outputs": [], 98 | "source": [ 99 | "import pandas as pd\n", 100 | "import numpy as np" 101 | ] 102 | }, 103 | { 104 | "cell_type": "markdown", 105 | "metadata": {}, 106 | "source": [ 107 | "We can import the spreadsheet data into Python using the following code:" 108 | ] 109 | }, 110 | { 111 | "cell_type": "markdown", 112 | "metadata": {}, 113 | "source": [ 114 | "pandas.read_excel(io, sheet_name=0, header=0, names=None, index_col=None, parse_cols=None, usecols=None, squeeze=False, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skiprows=None, nrows=None, na_values=None, keep_default_na=True, verbose=False, parse_dates=False, date_parser=None, thousands=None, comment=None, skip_footer=0, skipfooter=0, convert_float=True, mangle_dupe_cols=True, **kwds)" 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "metadata": {}, 120 | "source": [ 121 | "Since there's a plethora of arguments available, lets look at the most used one's." 122 | ] 123 | }, 124 | { 125 | "cell_type": "markdown", 126 | "metadata": {}, 127 | "source": [ 128 | "### Important Pandas read_excel Options" 129 | ] 130 | }, 131 | { 132 | "cell_type": "markdown", 133 | "metadata": {}, 134 | "source": [ 135 | "|\tArgument\t|\tDescription\t|\n", 136 | "|\t\t-|\t\t|\n", 137 | "|\tio\t|\tA string containing the pathname of the given Excel file.\t|\n", 138 | "|\tsheet_name\t|\tThe Excel sheet name, or sheet number, of the data you want to import. The sheet number can be an integer where 0 is the first sheet, 1 is the second, etc. If a list of sheet names/numbers are given, then the output will be a dictionary of DataFrames. The default is to read all the sheets and output a dictionary of DataFrames.\t|\n", 139 | "|\theader\t|\tRow number to use for the list of column labels. The default is 0, indicating that the first row is assumed to contain the column labels. If the data does not have a row of column labels, None should be used.\t|\n", 140 | "|\tnames\t|\tA separate Python list input of column names. This option is None by default. This option is the equivalent of assigning a list of column names to the columns attribute of the output DataFrame.\t|\n", 141 | "|\tindex_col\t|\tSpecifies which column should be used for row indices. The default option is None, meaning that all columns are included in the data, and a range of numbers is used as the row indices.\t|\n", 142 | "|\tusecols\t|\tAn integer, list of integers, or string that specifies the columns to be imported into the DataFrame. The default is to import all columns. If a string is given, then Pandas uses the standard Excel format to select columns (e.g. \"A:C,F,G\" will import columns A, B, C, F, and G).\t|\n", 143 | "|\tskiprows\t|\tThe number of rows to skip at the top of the Excel sheet. Default is 0. This option is useful for skipping rows in Excel that contain explanatory information about the data below it.\t|\n" 144 | ] 145 | }, 146 | { 147 | "cell_type": "markdown", 148 | "metadata": {}, 149 | "source": [ 150 | "If we are using the path for our local file by default its separated by \"\\\" however python accepts \"/\", \n", 151 | "so make to change the slashes or simply add the file in the same folder where your python file is.\n", 152 | "Should you require detailed explanation on the above, refer to the below medium article.\n", 153 | "https://medium.com/@ageitgey/python-3-quick-tip-the-easy-way-to-deal-with-file-paths-on-windows-mac-and-linux-11a072b58d5f" 154 | ] 155 | }, 156 | { 157 | "cell_type": "markdown", 158 | "metadata": {}, 159 | "source": [ 160 | "We can use Python to scan files in a directory and pick out the ones we want." 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": null, 166 | "metadata": { 167 | "scrolled": true 168 | }, 169 | "outputs": [], 170 | "source": [ 171 | "wkbks = glob(os.path.join(os.pardir, 'input', 'xlsx_files_all', 'Ir*.xls'))\n", 172 | "sorted(wkbks)" 173 | ] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "execution_count": null, 178 | "metadata": {}, 179 | "outputs": [], 180 | "source": [ 181 | "filename = 'Iris.xlsx'" 182 | ] 183 | }, 184 | { 185 | "cell_type": "code", 186 | "execution_count": null, 187 | "metadata": {}, 188 | "outputs": [], 189 | "source": [ 190 | "df = pd.read_excel(filename)" 191 | ] 192 | }, 193 | { 194 | "cell_type": "code", 195 | "execution_count": null, 196 | "metadata": {}, 197 | "outputs": [], 198 | "source": [ 199 | "print(df)" 200 | ] 201 | }, 202 | { 203 | "cell_type": "markdown", 204 | "metadata": {}, 205 | "source": [ 206 | "## Import a specifc sheet" 207 | ] 208 | }, 209 | { 210 | "cell_type": "markdown", 211 | "metadata": {}, 212 | "source": [ 213 | "By default, the first sheet in the file is imported to the dataframe as it is.\n", 214 | "
Using the sheet_name argument we can explicitly mention the sheet that we want to import. Defuat value is 0 i.e. teh first sheet in the file.\n", 215 | "
We can either mention the name of the sheet(s) or pass an integer value to refer to the index of the sheet" 216 | ] 217 | }, 218 | { 219 | "cell_type": "code", 220 | "execution_count": null, 221 | "metadata": {}, 222 | "outputs": [], 223 | "source": [ 224 | "df1 = pd.read_excel(filename,sheet_name='Sheet2')" 225 | ] 226 | }, 227 | { 228 | "cell_type": "code", 229 | "execution_count": null, 230 | "metadata": {}, 231 | "outputs": [], 232 | "source": [ 233 | "print(df1)" 234 | ] 235 | }, 236 | { 237 | "cell_type": "markdown", 238 | "metadata": {}, 239 | "source": [ 240 | "## Using a column from the sheet as an Index" 241 | ] 242 | }, 243 | { 244 | "cell_type": "markdown", 245 | "metadata": {}, 246 | "source": [ 247 | "Unless explicitly mentioned, an index column is added to the dataframe which by default starts from a 0.\n", 248 | "
Using the index_col argumement we can manipulate the index column in our dataframe, if we set the value 0 from none, it will use the first column as our index." 249 | ] 250 | }, 251 | { 252 | "cell_type": "code", 253 | "execution_count": null, 254 | "metadata": {}, 255 | "outputs": [], 256 | "source": [ 257 | "df = pd.read_excel(filename,sheet_name='Sheet1', index_col=0)" 258 | ] 259 | }, 260 | { 261 | "cell_type": "code", 262 | "execution_count": null, 263 | "metadata": {}, 264 | "outputs": [], 265 | "source": [ 266 | "print(df)" 267 | ] 268 | }, 269 | { 270 | "cell_type": "markdown", 271 | "metadata": {}, 272 | "source": [ 273 | "## Skip rows and columns" 274 | ] 275 | }, 276 | { 277 | "cell_type": "markdown", 278 | "metadata": {}, 279 | "source": [ 280 | "The default read_excel parameters assumes that the first row is a list of column names, which is incorporated automatically as column labels within the DataFrame.\n", 281 | "
Using the arguments like skiprows and header we can manipulate the behaviour of the imported dataframe" 282 | ] 283 | }, 284 | { 285 | "cell_type": "code", 286 | "execution_count": null, 287 | "metadata": {}, 288 | "outputs": [], 289 | "source": [ 290 | "df = pd.read_excel(filename, sheet_name='Sheet1', header=None, skiprows=1, index_col=0)" 291 | ] 292 | }, 293 | { 294 | "cell_type": "code", 295 | "execution_count": null, 296 | "metadata": {}, 297 | "outputs": [], 298 | "source": [ 299 | "print(df)" 300 | ] 301 | }, 302 | { 303 | "cell_type": "markdown", 304 | "metadata": {}, 305 | "source": [ 306 | "## Import a specifc column(s)" 307 | ] 308 | }, 309 | { 310 | "cell_type": "markdown", 311 | "metadata": {}, 312 | "source": [ 313 | "Using the usecols argument we can specify if we have import a specific column in our dataframe" 314 | ] 315 | }, 316 | { 317 | "cell_type": "code", 318 | "execution_count": null, 319 | "metadata": {}, 320 | "outputs": [], 321 | "source": [ 322 | "df = pd.read_excel(filename, sheet_name='Sheet1', header=None, skiprows=1, usecols='B,D')" 323 | ] 324 | }, 325 | { 326 | "cell_type": "code", 327 | "execution_count": null, 328 | "metadata": {}, 329 | "outputs": [], 330 | "source": [ 331 | "print(df)" 332 | ] 333 | }, 334 | { 335 | "cell_type": "code", 336 | "execution_count": null, 337 | "metadata": {}, 338 | "outputs": [], 339 | "source": [ 340 | "df = pd.read_excel(filename)\n", 341 | "#Importing the file again to the dataframe in the original shape to use it for further analysis" 342 | ] 343 | }, 344 | { 345 | "cell_type": "markdown", 346 | "metadata": {}, 347 | "source": [ 348 | "_Its not the end of the features available however its a start and you can play around with them as per your requirements_" 349 | ] 350 | }, 351 | { 352 | "cell_type": "markdown", 353 | "metadata": {}, 354 | "source": [ 355 | "
------------------------------------------------------------------------------------------
" 356 | ] 357 | }, 358 | { 359 | "cell_type": "markdown", 360 | "metadata": {}, 361 | "source": [ 362 | "## Lets have a look at the data from 10,000 feet" 363 | ] 364 | }, 365 | { 366 | "cell_type": "markdown", 367 | "metadata": {}, 368 | "source": [ 369 | "As now we have our dataframe, lets look at the data from multiple angles just to get a hang of it/\n", 370 | "
Pandas have plenty of functions available that we can use. We'll use some of them to have a glimpse of our dataset." 371 | ] 372 | }, 373 | { 374 | "cell_type": "markdown", 375 | "metadata": {}, 376 | "source": [ 377 | "## \"Head\" to \"Tail\": \n", 378 | "To view the first or last __five__ rows.\n", 379 | "
_Default is five, however the argument allows us to use a specific number_" 380 | ] 381 | }, 382 | { 383 | "cell_type": "code", 384 | "execution_count": null, 385 | "metadata": {}, 386 | "outputs": [], 387 | "source": [ 388 | "df.head(10)" 389 | ] 390 | }, 391 | { 392 | "cell_type": "code", 393 | "execution_count": null, 394 | "metadata": {}, 395 | "outputs": [], 396 | "source": [ 397 | "df.tail()" 398 | ] 399 | }, 400 | { 401 | "cell_type": "markdown", 402 | "metadata": {}, 403 | "source": [ 404 | "## View specific column's data" 405 | ] 406 | }, 407 | { 408 | "cell_type": "code", 409 | "execution_count": null, 410 | "metadata": { 411 | "scrolled": true 412 | }, 413 | "outputs": [], 414 | "source": [ 415 | "df['SepalLength'].head()" 416 | ] 417 | }, 418 | { 419 | "cell_type": "markdown", 420 | "metadata": {}, 421 | "source": [ 422 | "## Getting the name of all columns" 423 | ] 424 | }, 425 | { 426 | "cell_type": "code", 427 | "execution_count": null, 428 | "metadata": {}, 429 | "outputs": [], 430 | "source": [ 431 | "df.columns" 432 | ] 433 | }, 434 | { 435 | "cell_type": "markdown", 436 | "metadata": {}, 437 | "source": [ 438 | "## Info Method\n", 439 | "Gives a summary of Dataframe" 440 | ] 441 | }, 442 | { 443 | "cell_type": "code", 444 | "execution_count": null, 445 | "metadata": { 446 | "scrolled": true 447 | }, 448 | "outputs": [], 449 | "source": [ 450 | "df.info()" 451 | ] 452 | }, 453 | { 454 | "cell_type": "markdown", 455 | "metadata": {}, 456 | "source": [ 457 | "## Shape Method\n", 458 | "Returns the dimensions of Dataframe" 459 | ] 460 | }, 461 | { 462 | "cell_type": "code", 463 | "execution_count": null, 464 | "metadata": {}, 465 | "outputs": [], 466 | "source": [ 467 | "df.shape[0]" 468 | ] 469 | }, 470 | { 471 | "cell_type": "code", 472 | "execution_count": null, 473 | "metadata": {}, 474 | "outputs": [], 475 | "source": [ 476 | "print('Total rows in Dataframe is: ', df.shape[0])\n", 477 | "print('Total columns in Dataframe is: ', df.shape[0])" 478 | ] 479 | }, 480 | { 481 | "cell_type": "markdown", 482 | "metadata": {}, 483 | "source": [ 484 | "## Look at the datatypes in Dataframe" 485 | ] 486 | }, 487 | { 488 | "cell_type": "code", 489 | "execution_count": null, 490 | "metadata": {}, 491 | "outputs": [], 492 | "source": [ 493 | "df.dtypes" 494 | ] 495 | }, 496 | { 497 | "cell_type": "markdown", 498 | "metadata": {}, 499 | "source": [ 500 | "
------------------------------------------------------------------------------------------
" 501 | ] 502 | }, 503 | { 504 | "cell_type": "markdown", 505 | "metadata": {}, 506 | "source": [ 507 | "# Slice and Dice i.e. Excel filters" 508 | ] 509 | }, 510 | { 511 | "cell_type": "markdown", 512 | "metadata": {}, 513 | "source": [ 514 | "Descriptive reporting is all about data subsets and aggregations, the moment we are to understand our data a little bit we start using filters to look at the smaller sets of data or view a particular column maybe to have a better understanding.\n", 515 | "
Python offers a lot of different methods to slice and dice the dataframes, we'll play around with a couple of them to have an understanding of how it works" 516 | ] 517 | }, 518 | { 519 | "cell_type": "markdown", 520 | "metadata": {}, 521 | "source": [ 522 | "## View a specific column" 523 | ] 524 | }, 525 | { 526 | "cell_type": "markdown", 527 | "metadata": {}, 528 | "source": [ 529 | "There exists three main methods to select columns:\n", 530 | "\n", 531 | "* Use dot notation: e.g. data.column_name\n", 532 | "* Use square braces and the name of the column:, e.g. data['column_name']\n", 533 | "* Use numeric indexing and the iloc selector data.loc[:, 'column_number']" 534 | ] 535 | }, 536 | { 537 | "cell_type": "code", 538 | "execution_count": null, 539 | "metadata": {}, 540 | "outputs": [], 541 | "source": [ 542 | "df['Name'].head()" 543 | ] 544 | }, 545 | { 546 | "cell_type": "code", 547 | "execution_count": null, 548 | "metadata": {}, 549 | "outputs": [], 550 | "source": [ 551 | "df.iloc[:,[4]].head()" 552 | ] 553 | }, 554 | { 555 | "cell_type": "code", 556 | "execution_count": null, 557 | "metadata": {}, 558 | "outputs": [], 559 | "source": [ 560 | "df.loc[:,['Name']].head()" 561 | ] 562 | }, 563 | { 564 | "cell_type": "markdown", 565 | "metadata": {}, 566 | "source": [ 567 | "## View multiple columns" 568 | ] 569 | }, 570 | { 571 | "cell_type": "code", 572 | "execution_count": null, 573 | "metadata": {}, 574 | "outputs": [], 575 | "source": [ 576 | "df[['Name', 'PetalLength']].head()" 577 | ] 578 | }, 579 | { 580 | "cell_type": "code", 581 | "execution_count": null, 582 | "metadata": { 583 | "scrolled": true 584 | }, 585 | "outputs": [], 586 | "source": [ 587 | "#Pass a variable as a list\n", 588 | "SpecificColumnList = ['Name', 'PetalLength']\n", 589 | "df[SpecificColumnList].head()" 590 | ] 591 | }, 592 | { 593 | "cell_type": "markdown", 594 | "metadata": {}, 595 | "source": [ 596 | "## View specific row's data\n", 597 | "The method used here is slicing using the loc function, where we can specify the start and end row separated by colon\n", 598 | "
Remember, __index starts from a 0 and not 1__" 599 | ] 600 | }, 601 | { 602 | "cell_type": "code", 603 | "execution_count": null, 604 | "metadata": {}, 605 | "outputs": [], 606 | "source": [ 607 | "df.loc[20:30] " 608 | ] 609 | }, 610 | { 611 | "cell_type": "markdown", 612 | "metadata": {}, 613 | "source": [ 614 | "## Slice rows and columns together" 615 | ] 616 | }, 617 | { 618 | "cell_type": "code", 619 | "execution_count": null, 620 | "metadata": {}, 621 | "outputs": [], 622 | "source": [ 623 | "df.loc[20:30, ['Name']]" 624 | ] 625 | }, 626 | { 627 | "cell_type": "markdown", 628 | "metadata": {}, 629 | "source": [ 630 | "## Filter data in a column" 631 | ] 632 | }, 633 | { 634 | "cell_type": "code", 635 | "execution_count": null, 636 | "metadata": {}, 637 | "outputs": [], 638 | "source": [ 639 | "df[df['Name'] == 'Iris-versicolor'].head()" 640 | ] 641 | }, 642 | { 643 | "cell_type": "markdown", 644 | "metadata": {}, 645 | "source": [ 646 | "## Filter multiple values" 647 | ] 648 | }, 649 | { 650 | "cell_type": "code", 651 | "execution_count": null, 652 | "metadata": {}, 653 | "outputs": [], 654 | "source": [ 655 | "df[df['Name'].isin(['Iris-versicolor', 'Iris-virginica'])]" 656 | ] 657 | }, 658 | { 659 | "cell_type": "markdown", 660 | "metadata": {}, 661 | "source": [ 662 | "## Filter multiple values using a list" 663 | ] 664 | }, 665 | { 666 | "cell_type": "code", 667 | "execution_count": null, 668 | "metadata": {}, 669 | "outputs": [], 670 | "source": [ 671 | "Filter_Value = ['Iris-versicolor', 'Iris-virginica']" 672 | ] 673 | }, 674 | { 675 | "cell_type": "code", 676 | "execution_count": null, 677 | "metadata": {}, 678 | "outputs": [], 679 | "source": [ 680 | "df[df['Name'].isin(Filter_Value)]" 681 | ] 682 | }, 683 | { 684 | "cell_type": "markdown", 685 | "metadata": {}, 686 | "source": [ 687 | "## Filter values NOT in list or not equal to in Excel" 688 | ] 689 | }, 690 | { 691 | "cell_type": "code", 692 | "execution_count": null, 693 | "metadata": {}, 694 | "outputs": [], 695 | "source": [ 696 | "df[~df['Name'].isin(Filter_Value)]" 697 | ] 698 | }, 699 | { 700 | "cell_type": "markdown", 701 | "metadata": {}, 702 | "source": [ 703 | "## Filter usinng using multiple conditions in multiple columns\n", 704 | "__The input should always be a list__\n", 705 | "
We can use this method to replicate advanced filter function in excel" 706 | ] 707 | }, 708 | { 709 | "cell_type": "code", 710 | "execution_count": null, 711 | "metadata": {}, 712 | "outputs": [], 713 | "source": [ 714 | "width = [2]\n", 715 | "Flower_Name = ['Iris-setosa']\n", 716 | "df[~df['Name'].isin(Flower_Name) & df['PetalWidth'].isin(width)]" 717 | ] 718 | }, 719 | { 720 | "cell_type": "markdown", 721 | "metadata": {}, 722 | "source": [ 723 | "## Filter using numeric conditions" 724 | ] 725 | }, 726 | { 727 | "cell_type": "code", 728 | "execution_count": null, 729 | "metadata": {}, 730 | "outputs": [], 731 | "source": [ 732 | "df[df['SepalLength'] == 5.1].head()" 733 | ] 734 | }, 735 | { 736 | "cell_type": "code", 737 | "execution_count": null, 738 | "metadata": {}, 739 | "outputs": [], 740 | "source": [ 741 | "df[df['SepalLength'] > 5.1].head()" 742 | ] 743 | }, 744 | { 745 | "cell_type": "markdown", 746 | "metadata": {}, 747 | "source": [ 748 | "## Replicate the custom filter in Excel" 749 | ] 750 | }, 751 | { 752 | "cell_type": "code", 753 | "execution_count": null, 754 | "metadata": {}, 755 | "outputs": [], 756 | "source": [ 757 | "df[df['Name'].map(lambda x: x.endswith('sa'))]" 758 | ] 759 | }, 760 | { 761 | "cell_type": "markdown", 762 | "metadata": {}, 763 | "source": [ 764 | "## Combine two filters to get the result" 765 | ] 766 | }, 767 | { 768 | "cell_type": "code", 769 | "execution_count": null, 770 | "metadata": {}, 771 | "outputs": [], 772 | "source": [ 773 | "df[df['Name'].map(lambda x: x.endswith('sa')) & (df['SepalLength'] > 5.1)]" 774 | ] 775 | }, 776 | { 777 | "cell_type": "markdown", 778 | "metadata": {}, 779 | "source": [ 780 | "## Contains function in Excel" 781 | ] 782 | }, 783 | { 784 | "cell_type": "code", 785 | "execution_count": null, 786 | "metadata": {}, 787 | "outputs": [], 788 | "source": [ 789 | "df[df['Name'].str.contains('set')]" 790 | ] 791 | }, 792 | { 793 | "cell_type": "markdown", 794 | "metadata": {}, 795 | "source": [ 796 | "## Get the unique values from dataframe" 797 | ] 798 | }, 799 | { 800 | "cell_type": "code", 801 | "execution_count": null, 802 | "metadata": {}, 803 | "outputs": [], 804 | "source": [ 805 | "df['SepalLength'].unique()" 806 | ] 807 | }, 808 | { 809 | "cell_type": "markdown", 810 | "metadata": {}, 811 | "source": [ 812 | "If we want to view the entire dataframe with the unique values, we can use the drop_duplicates method" 813 | ] 814 | }, 815 | { 816 | "cell_type": "code", 817 | "execution_count": null, 818 | "metadata": {}, 819 | "outputs": [], 820 | "source": [ 821 | "df.drop_duplicates(subset=['Name'])" 822 | ] 823 | }, 824 | { 825 | "cell_type": "code", 826 | "execution_count": null, 827 | "metadata": {}, 828 | "outputs": [], 829 | "source": [ 830 | "df.drop_duplicates(subset=['Name']).iloc[:,[3,4]]" 831 | ] 832 | }, 833 | { 834 | "cell_type": "markdown", 835 | "metadata": {}, 836 | "source": [ 837 | "## Sort Values" 838 | ] 839 | }, 840 | { 841 | "cell_type": "markdown", 842 | "metadata": {}, 843 | "source": [ 844 | "Sort data by a certain column, by default the sorting is ascending" 845 | ] 846 | }, 847 | { 848 | "cell_type": "code", 849 | "execution_count": null, 850 | "metadata": {}, 851 | "outputs": [], 852 | "source": [ 853 | "df.sort_values(by = ['SepalLength'])" 854 | ] 855 | }, 856 | { 857 | "cell_type": "code", 858 | "execution_count": null, 859 | "metadata": {}, 860 | "outputs": [], 861 | "source": [ 862 | "df.sort_values(by = ['SepalLength'], ascending = False)" 863 | ] 864 | }, 865 | { 866 | "cell_type": "markdown", 867 | "metadata": {}, 868 | "source": [ 869 | "
------------------------------------------------------------------------------------------
" 870 | ] 871 | }, 872 | { 873 | "cell_type": "markdown", 874 | "metadata": {}, 875 | "source": [ 876 | "# Statistical summary of data" 877 | ] 878 | }, 879 | { 880 | "cell_type": "markdown", 881 | "metadata": {}, 882 | "source": [ 883 | "## __DataFrame Describe method:__ \n", 884 | "_Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values._" 885 | ] 886 | }, 887 | { 888 | "cell_type": "code", 889 | "execution_count": null, 890 | "metadata": { 891 | "scrolled": true 892 | }, 893 | "outputs": [], 894 | "source": [ 895 | "df.describe()" 896 | ] 897 | }, 898 | { 899 | "cell_type": "markdown", 900 | "metadata": {}, 901 | "source": [ 902 | "Summary stats of character columns" 903 | ] 904 | }, 905 | { 906 | "cell_type": "code", 907 | "execution_count": null, 908 | "metadata": {}, 909 | "outputs": [], 910 | "source": [ 911 | "df.describe(include = ['object'])" 912 | ] 913 | }, 914 | { 915 | "cell_type": "code", 916 | "execution_count": null, 917 | "metadata": {}, 918 | "outputs": [], 919 | "source": [ 920 | "df.describe(include = 'all')" 921 | ] 922 | }, 923 | { 924 | "cell_type": "markdown", 925 | "metadata": {}, 926 | "source": [ 927 | "
------------------------------------------------------------------------------------------
" 928 | ] 929 | }, 930 | { 931 | "cell_type": "markdown", 932 | "metadata": {}, 933 | "source": [ 934 | "# Data Aggregation" 935 | ] 936 | }, 937 | { 938 | "cell_type": "markdown", 939 | "metadata": {}, 940 | "source": [ 941 | "## Counting the unique values of a particular column. \n", 942 | "_Resulting output is a Series. You can refer it as a Single column Pivot Table_" 943 | ] 944 | }, 945 | { 946 | "cell_type": "code", 947 | "execution_count": null, 948 | "metadata": {}, 949 | "outputs": [], 950 | "source": [ 951 | "pd.value_counts(df['Name'])" 952 | ] 953 | }, 954 | { 955 | "cell_type": "markdown", 956 | "metadata": {}, 957 | "source": [ 958 | "## Count cells\n", 959 | "Count non-NA cells for each column or row." 960 | ] 961 | }, 962 | { 963 | "cell_type": "code", 964 | "execution_count": null, 965 | "metadata": { 966 | "scrolled": true 967 | }, 968 | "outputs": [], 969 | "source": [ 970 | "df.count(axis=0)" 971 | ] 972 | }, 973 | { 974 | "cell_type": "markdown", 975 | "metadata": {}, 976 | "source": [ 977 | "## Sum\n", 978 | "Summarising the data to get a snapshot of either by rows or columns" 979 | ] 980 | }, 981 | { 982 | "cell_type": "code", 983 | "execution_count": null, 984 | "metadata": {}, 985 | "outputs": [], 986 | "source": [ 987 | "df.sum(axis = 0) # 0 for column wise total" 988 | ] 989 | }, 990 | { 991 | "cell_type": "markdown", 992 | "metadata": {}, 993 | "source": [ 994 | "Its replicates the method of adding a total column against each row" 995 | ] 996 | }, 997 | { 998 | "cell_type": "code", 999 | "execution_count": null, 1000 | "metadata": {}, 1001 | "outputs": [], 1002 | "source": [ 1003 | "df.sum(axis =1) # row wise sum" 1004 | ] 1005 | }, 1006 | { 1007 | "cell_type": "markdown", 1008 | "metadata": {}, 1009 | "source": [ 1010 | "## Add a total column to the existing dataset" 1011 | ] 1012 | }, 1013 | { 1014 | "cell_type": "code", 1015 | "execution_count": null, 1016 | "metadata": {}, 1017 | "outputs": [], 1018 | "source": [ 1019 | "df['Total'] = df.sum(axis =1)" 1020 | ] 1021 | }, 1022 | { 1023 | "cell_type": "code", 1024 | "execution_count": null, 1025 | "metadata": {}, 1026 | "outputs": [], 1027 | "source": [ 1028 | "df.head()" 1029 | ] 1030 | }, 1031 | { 1032 | "cell_type": "markdown", 1033 | "metadata": {}, 1034 | "source": [ 1035 | "## Sum of specific columns, use the loc methos and pass the column names" 1036 | ] 1037 | }, 1038 | { 1039 | "cell_type": "code", 1040 | "execution_count": null, 1041 | "metadata": {}, 1042 | "outputs": [], 1043 | "source": [ 1044 | "df['Total_loc']=df.loc[:,['SepalLength', 'SepalWidth']].sum(axis=1)" 1045 | ] 1046 | }, 1047 | { 1048 | "cell_type": "code", 1049 | "execution_count": null, 1050 | "metadata": { 1051 | "scrolled": true 1052 | }, 1053 | "outputs": [], 1054 | "source": [ 1055 | "df.head()" 1056 | ] 1057 | }, 1058 | { 1059 | "cell_type": "markdown", 1060 | "metadata": {}, 1061 | "source": [ 1062 | "## Or, we can use the below method" 1063 | ] 1064 | }, 1065 | { 1066 | "cell_type": "code", 1067 | "execution_count": null, 1068 | "metadata": {}, 1069 | "outputs": [], 1070 | "source": [ 1071 | "df['Total_DFSum']= df['SepalLength'] + df['SepalWidth']" 1072 | ] 1073 | }, 1074 | { 1075 | "cell_type": "code", 1076 | "execution_count": null, 1077 | "metadata": {}, 1078 | "outputs": [], 1079 | "source": [ 1080 | "df.head()" 1081 | ] 1082 | }, 1083 | { 1084 | "cell_type": "markdown", 1085 | "metadata": {}, 1086 | "source": [ 1087 | "### Don't like the new column, delete it using drop method" 1088 | ] 1089 | }, 1090 | { 1091 | "cell_type": "code", 1092 | "execution_count": null, 1093 | "metadata": {}, 1094 | "outputs": [], 1095 | "source": [ 1096 | "df.drop(['Total_DFSum'], axis = 1)" 1097 | ] 1098 | }, 1099 | { 1100 | "cell_type": "markdown", 1101 | "metadata": {}, 1102 | "source": [ 1103 | "## Adding sum-total beneath each column" 1104 | ] 1105 | }, 1106 | { 1107 | "cell_type": "code", 1108 | "execution_count": null, 1109 | "metadata": {}, 1110 | "outputs": [], 1111 | "source": [ 1112 | "Sum_Total = df[['SepalLength', 'SepalWidth', 'Total']].sum()" 1113 | ] 1114 | }, 1115 | { 1116 | "cell_type": "code", 1117 | "execution_count": null, 1118 | "metadata": {}, 1119 | "outputs": [], 1120 | "source": [ 1121 | "Sum_Total" 1122 | ] 1123 | }, 1124 | { 1125 | "cell_type": "code", 1126 | "execution_count": null, 1127 | "metadata": {}, 1128 | "outputs": [], 1129 | "source": [ 1130 | "T_Sum = pd.DataFrame(data=Sum_Total).T" 1131 | ] 1132 | }, 1133 | { 1134 | "cell_type": "code", 1135 | "execution_count": null, 1136 | "metadata": {}, 1137 | "outputs": [], 1138 | "source": [ 1139 | "T_Sum" 1140 | ] 1141 | }, 1142 | { 1143 | "cell_type": "code", 1144 | "execution_count": null, 1145 | "metadata": {}, 1146 | "outputs": [], 1147 | "source": [ 1148 | "T_Sum = T_Sum.reindex(columns=df.columns)" 1149 | ] 1150 | }, 1151 | { 1152 | "cell_type": "code", 1153 | "execution_count": null, 1154 | "metadata": {}, 1155 | "outputs": [], 1156 | "source": [ 1157 | "T_Sum" 1158 | ] 1159 | }, 1160 | { 1161 | "cell_type": "code", 1162 | "execution_count": null, 1163 | "metadata": {}, 1164 | "outputs": [], 1165 | "source": [ 1166 | "Row_Total = df.append(T_Sum,ignore_index=True)" 1167 | ] 1168 | }, 1169 | { 1170 | "cell_type": "code", 1171 | "execution_count": null, 1172 | "metadata": {}, 1173 | "outputs": [], 1174 | "source": [ 1175 | "Row_Total" 1176 | ] 1177 | }, 1178 | { 1179 | "cell_type": "markdown", 1180 | "metadata": {}, 1181 | "source": [ 1182 | "A lot has been done above, the approach that we are using is:\n", 1183 | "* Sum_Total: Do the sum of columns\n", 1184 | "* T_Sum: Convert the series output to dataframe and transpose\n", 1185 | "* Re-index to add missing columns\n", 1186 | "* Row_Total: append T_Sum to existing dataframe" 1187 | ] 1188 | }, 1189 | { 1190 | "cell_type": "markdown", 1191 | "metadata": {}, 1192 | "source": [ 1193 | "## Sum based on criteria i.e. Sumif in Excel" 1194 | ] 1195 | }, 1196 | { 1197 | "cell_type": "code", 1198 | "execution_count": null, 1199 | "metadata": {}, 1200 | "outputs": [], 1201 | "source": [ 1202 | "df[df['Name'] == 'Iris-versicolor'].sum()" 1203 | ] 1204 | }, 1205 | { 1206 | "cell_type": "markdown", 1207 | "metadata": {}, 1208 | "source": [ 1209 | "## Sumifs" 1210 | ] 1211 | }, 1212 | { 1213 | "cell_type": "code", 1214 | "execution_count": null, 1215 | "metadata": {}, 1216 | "outputs": [], 1217 | "source": [ 1218 | "df[df['Name'].map(lambda x: x.endswith('sa')) & (df['SepalLength'] > 5.1)].sum()" 1219 | ] 1220 | }, 1221 | { 1222 | "cell_type": "markdown", 1223 | "metadata": {}, 1224 | "source": [ 1225 | "## Averageif" 1226 | ] 1227 | }, 1228 | { 1229 | "cell_type": "code", 1230 | "execution_count": null, 1231 | "metadata": {}, 1232 | "outputs": [], 1233 | "source": [ 1234 | "df[df['Name'] == 'Iris-versicolor'].mean()" 1235 | ] 1236 | }, 1237 | { 1238 | "cell_type": "markdown", 1239 | "metadata": {}, 1240 | "source": [ 1241 | "## Averageifs" 1242 | ] 1243 | }, 1244 | { 1245 | "cell_type": "code", 1246 | "execution_count": null, 1247 | "metadata": {}, 1248 | "outputs": [], 1249 | "source": [ 1250 | "df[df['Name'].map(lambda x: x.endswith('sa')) & (df['SepalLength'] > 5.1)].mean()" 1251 | ] 1252 | }, 1253 | { 1254 | "cell_type": "markdown", 1255 | "metadata": {}, 1256 | "source": [ 1257 | "## Max" 1258 | ] 1259 | }, 1260 | { 1261 | "cell_type": "code", 1262 | "execution_count": null, 1263 | "metadata": {}, 1264 | "outputs": [], 1265 | "source": [ 1266 | "df[df['Name'] == 'Iris-versicolor'].max()" 1267 | ] 1268 | }, 1269 | { 1270 | "cell_type": "markdown", 1271 | "metadata": {}, 1272 | "source": [ 1273 | "## Min" 1274 | ] 1275 | }, 1276 | { 1277 | "cell_type": "code", 1278 | "execution_count": null, 1279 | "metadata": {}, 1280 | "outputs": [], 1281 | "source": [ 1282 | "df[df['Name'] == 'Iris-versicolor'].min()" 1283 | ] 1284 | }, 1285 | { 1286 | "cell_type": "markdown", 1287 | "metadata": {}, 1288 | "source": [ 1289 | "# Groupby i.e. Subtotals in Excel" 1290 | ] 1291 | }, 1292 | { 1293 | "cell_type": "code", 1294 | "execution_count": null, 1295 | "metadata": {}, 1296 | "outputs": [], 1297 | "source": [ 1298 | "df[['Name','SepalLength']].groupby('Name').sum()" 1299 | ] 1300 | }, 1301 | { 1302 | "cell_type": "code", 1303 | "execution_count": null, 1304 | "metadata": {}, 1305 | "outputs": [], 1306 | "source": [ 1307 | "GroupBy = df.groupby('Name').sum()" 1308 | ] 1309 | }, 1310 | { 1311 | "cell_type": "code", 1312 | "execution_count": null, 1313 | "metadata": {}, 1314 | "outputs": [], 1315 | "source": [ 1316 | "Group_By.append(pd.DataFrame(df[['SepalLength','SepalWidth','PetalLength','PetalWidth']].sum()).T)" 1317 | ] 1318 | }, 1319 | { 1320 | "cell_type": "markdown", 1321 | "metadata": {}, 1322 | "source": [ 1323 | "
------------------------------------------------------------------------------------------
" 1324 | ] 1325 | }, 1326 | { 1327 | "cell_type": "markdown", 1328 | "metadata": {}, 1329 | "source": [ 1330 | "# Pivot Tables in Dataframes i.e. Pivot Tables in Excel" 1331 | ] 1332 | }, 1333 | { 1334 | "cell_type": "markdown", 1335 | "metadata": {}, 1336 | "source": [ 1337 | "Who doesn'l love a Pivot Table in Excel, its one the best ways to analyse your data, have a quick overview of the information, helps you slice and dice the data with a super easy interface, helps you plots graphs basis on the data, add calculative columns etc.\n", 1338 | "
No, we wont have an interface to work, we'll have to explicitly write the code to get the output, No, it wont generate charts for you, but I don't think we can complete a tutorial without learning about the Pivot tables." 1339 | ] 1340 | }, 1341 | { 1342 | "cell_type": "code", 1343 | "execution_count": null, 1344 | "metadata": {}, 1345 | "outputs": [], 1346 | "source": [ 1347 | "pd.pivot_table(df, index= 'Name')#Same as Groupby" 1348 | ] 1349 | }, 1350 | { 1351 | "cell_type": "code", 1352 | "execution_count": null, 1353 | "metadata": {}, 1354 | "outputs": [], 1355 | "source": [ 1356 | "pd.pivot_table(df, values='SepalWidth', index= 'SepalLength',columns='Name', aggfunc = np.sum)" 1357 | ] 1358 | }, 1359 | { 1360 | "cell_type": "markdown", 1361 | "metadata": {}, 1362 | "source": [ 1363 | "A simple Pivot table showing us the sum of SepalWidth in values, SepalLength in Row Column and Name in Column Labels" 1364 | ] 1365 | }, 1366 | { 1367 | "cell_type": "markdown", 1368 | "metadata": {}, 1369 | "source": [ 1370 | "Lets see if we can complicate it a bit." 1371 | ] 1372 | }, 1373 | { 1374 | "cell_type": "code", 1375 | "execution_count": null, 1376 | "metadata": {}, 1377 | "outputs": [], 1378 | "source": [ 1379 | "pd.pivot_table(df, values='SepalWidth', index= 'SepalLength',columns='Name', aggfunc = np.sum, fill_value=0)" 1380 | ] 1381 | }, 1382 | { 1383 | "cell_type": "markdown", 1384 | "metadata": {}, 1385 | "source": [ 1386 | "Blanks are now replaced with 0's by using the fill_value argument" 1387 | ] 1388 | }, 1389 | { 1390 | "cell_type": "code", 1391 | "execution_count": null, 1392 | "metadata": {}, 1393 | "outputs": [], 1394 | "source": [ 1395 | "pd.pivot_table(df, values=['SepalWidth', 'PetalWidth'], index= 'SepalLength',columns='Name', aggfunc = np.sum, fill_value=0)" 1396 | ] 1397 | }, 1398 | { 1399 | "cell_type": "code", 1400 | "execution_count": null, 1401 | "metadata": {}, 1402 | "outputs": [], 1403 | "source": [ 1404 | "pd.pivot_table(df, values=['SepalWidth', 'PetalWidth'], index= ['SepalLength', 'PetalLength'],columns='Name', aggfunc = np.sum, fill_value=0)" 1405 | ] 1406 | }, 1407 | { 1408 | "cell_type": "code", 1409 | "execution_count": null, 1410 | "metadata": {}, 1411 | "outputs": [], 1412 | "source": [ 1413 | "pd.pivot_table(df, values=['SepalWidth', 'PetalWidth'], index= 'SepalLength',columns='Name', \n", 1414 | " aggfunc = {'SepalWidth': np.sum, 'PetalWidth': np.mean}, fill_value=0)" 1415 | ] 1416 | }, 1417 | { 1418 | "cell_type": "markdown", 1419 | "metadata": {}, 1420 | "source": [ 1421 | "We can have individual calculations on values using dictionary method and can also have multiple calculations on values" 1422 | ] 1423 | }, 1424 | { 1425 | "cell_type": "code", 1426 | "execution_count": null, 1427 | "metadata": {}, 1428 | "outputs": [], 1429 | "source": [ 1430 | "pd.pivot_table(df, values=['SepalWidth', 'PetalWidth'], index= 'SepalLength',columns='Name', \n", 1431 | " aggfunc = {'SepalWidth': np.sum, 'PetalWidth': np.mean}, fill_value=0, margins=True)" 1432 | ] 1433 | }, 1434 | { 1435 | "cell_type": "markdown", 1436 | "metadata": {}, 1437 | "source": [ 1438 | "If we use margins argument, we can have total row added" 1439 | ] 1440 | }, 1441 | { 1442 | "cell_type": "markdown", 1443 | "metadata": {}, 1444 | "source": [ 1445 | "
------------------------------------------------------------------------------------------
" 1446 | ] 1447 | }, 1448 | { 1449 | "cell_type": "markdown", 1450 | "metadata": {}, 1451 | "source": [ 1452 | "# Vlookup" 1453 | ] 1454 | }, 1455 | { 1456 | "cell_type": "markdown", 1457 | "metadata": {}, 1458 | "source": [ 1459 | "What a magical formula is vlookup in Excel, I think its the first thing that everyone wants to learn before learning how to even add. Looks fascinating when someone is applying vlookup, looks like magic when we get the output. Makes life easy. I can with very much confidence can say its the backbone of every data wrangling action performed on the spreadsheet.\n", 1460 | "
\n", 1461 | "
__Unfortunately__ we dont have a vlookup function in Pandas!\n", 1462 | "
\n", 1463 | "
Since we dont have a \"Vlookup\" function in Pandas, Merge is used as an alternate which is same as SQL. There are a total of four merge options available:\n", 1464 | "* ‘left’ — Use the shared column from the left dataframe and match to right dataframe. Fill in any N/A as NaN\n", 1465 | "\n", 1466 | "* ‘right’ — Use the shared column from the right dataframe and match to left dataframe. Fill in any N/A as NaN\n", 1467 | "\n", 1468 | "* ‘inner’ — Only show data where the two shared columns overlap. Default method.\n", 1469 | "\n", 1470 | "* ‘outer’ — Return all records when there is a match in either left or right dataframe.\n", 1471 | "
" 1472 | ] 1473 | }, 1474 | { 1475 | "cell_type": "code", 1476 | "execution_count": null, 1477 | "metadata": {}, 1478 | "outputs": [], 1479 | "source": [ 1480 | "df1 = pd.read_excel(filename)" 1481 | ] 1482 | }, 1483 | { 1484 | "cell_type": "code", 1485 | "execution_count": null, 1486 | "metadata": {}, 1487 | "outputs": [], 1488 | "source": [ 1489 | "lookup = df.merge(df,on='Name')" 1490 | ] 1491 | }, 1492 | { 1493 | "cell_type": "code", 1494 | "execution_count": null, 1495 | "metadata": {}, 1496 | "outputs": [], 1497 | "source": [ 1498 | "lookup" 1499 | ] 1500 | }, 1501 | { 1502 | "cell_type": "markdown", 1503 | "metadata": {}, 1504 | "source": [ 1505 | "The above might not be the best example to suppport the concept, however the working is the same." 1506 | ] 1507 | }, 1508 | { 1509 | "cell_type": "markdown", 1510 | "metadata": {}, 1511 | "source": [ 1512 | "
------------------------------------------------------------------------------------------
" 1513 | ] 1514 | }, 1515 | { 1516 | "cell_type": "markdown", 1517 | "metadata": {}, 1518 | "source": [ 1519 | "I am hoping this tutorial made some sense, though I agree this could have been more elaborative which I will be pursuing soon.\n", 1520 | "
__Watch out the space for more of it.__" 1521 | ] 1522 | } 1523 | ], 1524 | "metadata": { 1525 | "kernelspec": { 1526 | "display_name": "Python 3", 1527 | "language": "python", 1528 | "name": "python3" 1529 | }, 1530 | "language_info": { 1531 | "codemirror_mode": { 1532 | "name": "ipython", 1533 | "version": 3 1534 | }, 1535 | "file_extension": ".py", 1536 | "mimetype": "text/x-python", 1537 | "name": "python", 1538 | "nbconvert_exporter": "python", 1539 | "pygments_lexer": "ipython3", 1540 | "version": "3.7.1" 1541 | } 1542 | }, 1543 | "nbformat": 4, 1544 | "nbformat_minor": 2 1545 | } 1546 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Ditching Excel for Python 2 | Functionalities in Excel translated to Python. 3 | 4 | The very first thought of creating this tutorial came into existence when I exhausted many hours online trying to search for one stop shop to replicate the most common functions in Excel using Pandas and Numpy. Though there are many tutorials/articles/blogs but then I had to traverse between multiple tabs, so to avoid wasting my time moving between tabs and there might be atleast one person like who's searching for the same I thought of creating this tutorial. 5 | 6 | In this tutorial, we'll explore how to use Excel as input and output only rest all Excel tasks will be carried out using Python libraries: Pandas and Numpy 7 | 8 | This is not the best for sure but should help to start your journey, I am working on another exhaustive one, which should be out soon. 9 | 10 | Happy Learning! 11 | 12 | [](https://deepnote.com/launch?url=https://github.com/ank0409/Ditching-Excel-for-Python/blob/master/Ditching%20Excel%20for%20Python!.ipynb) 13 | 14 | If at all my tutorial helped you, would you buy me a Ko-fi! 15 |
[![ko-fi](https://www.ko-fi.com/img/donate_sm.png)](https://ko-fi.com/I2I0PGQA) 16 | --------------------------------------------------------------------------------