├── 01-DA_Numpy_arrays_creation.ipynb ├── 02-DA_Numpy_array_maths.ipynb ├── 03-DA_Numpy_matplotlib.ipynb ├── 04-DA_Numpy_indexing.ipynb ├── 05-DA_Numpy_combining_arrays.ipynb ├── 06-DA_Pandas_introduction.ipynb ├── 07-DA_Pandas_structures.ipynb ├── 08-DA_Pandas_import_plotting.ipynb ├── 09-DA_Pandas_operations.ipynb ├── 10-DA_Pandas_combine.ipynb ├── 11-DA_Pandas_splitting.ipynb ├── 12-DA_Pandas_realworld.ipynb ├── 98-DA_Numpy_Exercises.ipynb ├── 98-DA_Numpy_Solutions.ipynb ├── 99-DA_Pandas_Exercises.ipynb ├── 99-DA_Pandas_Solutions.ipynb ├── Data ├── AB_NYC_2019.csv ├── P3_GrantExport.csv ├── P3_PersonExport.csv ├── composers.xlsx └── ny_boroughs.xlsx ├── LICENSE ├── README.md ├── binder ├── environment.yml └── postBuild ├── colab ├── automate_colab_editing.ipynb └── colab_data.sh └── svg.py /01-DA_Numpy_arrays_creation.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 1. Creating Numpy arrays" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Numpy has many different types of data \"containers\": lists, dictionaries, tuples etc. However none of them allows for efficient numerical calculation, in particular not in multi-dimensional cases (think e.g. of operations on images). Numpy has been developed exactly to fill this gap. It provides a new data structure, the **numpy array**, and a large library of operations that allow to: \n", 15 | "- generate such arrays\n", 16 | "- combine arrays in different ways (concatenation, stacking etc.)\n", 17 | "- modify such arrays (projection, extraction of sub-arrays etc.)\n", 18 | "- apply mathematical operations on them\n", 19 | "\n", 20 | "Numpy is the base of almost the entire Python scientific programming stack. Many libraries build on top of Numpy, either by providing specialized functions to operate on them (e.g. scikit-image for image processing) or by creating more complex data containers on top of it. The data science library Pandas that will also be presented in this course is a good example of this with its dataframe structures.\n" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": null, 26 | "metadata": {}, 27 | "outputs": [], 28 | "source": [ 29 | "import numpy as np\n", 30 | "from svg import numpy_to_svg" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "## 1.1 What is an array ?" 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "Let us create the simplest example of an array by transforming a regular Python list into an array (we will see more advanced ways of creating arrays in the next chapters):" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": null, 50 | "metadata": {}, 51 | "outputs": [], 52 | "source": [ 53 | "mylist = [2,5,3,9,5,2]" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": 3, 59 | "metadata": {}, 60 | "outputs": [ 61 | { 62 | "data": { 63 | "text/plain": [ 64 | "[2, 5, 3, 9, 5, 2]" 65 | ] 66 | }, 67 | "execution_count": 3, 68 | "metadata": {}, 69 | "output_type": "execute_result" 70 | } 71 | ], 72 | "source": [ 73 | "mylist" 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": 4, 79 | "metadata": {}, 80 | "outputs": [], 81 | "source": [ 82 | "myarray = np.array(mylist)" 83 | ] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "execution_count": 5, 88 | "metadata": {}, 89 | "outputs": [ 90 | { 91 | "data": { 92 | "text/plain": [ 93 | "array([2, 5, 3, 9, 5, 2])" 94 | ] 95 | }, 96 | "execution_count": 5, 97 | "metadata": {}, 98 | "output_type": "execute_result" 99 | } 100 | ], 101 | "source": [ 102 | "myarray" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": 6, 108 | "metadata": {}, 109 | "outputs": [ 110 | { 111 | "data": { 112 | "text/plain": [ 113 | "numpy.ndarray" 114 | ] 115 | }, 116 | "execution_count": 6, 117 | "metadata": {}, 118 | "output_type": "execute_result" 119 | } 120 | ], 121 | "source": [ 122 | "type(myarray)" 123 | ] 124 | }, 125 | { 126 | "cell_type": "markdown", 127 | "metadata": {}, 128 | "source": [ 129 | "We see that ```myarray``` is a Numpy array thanks to the ```array``` specification in the output. The type also says that we have a numpy ndarray (n-dimensional). At this point we don't see a big difference with regular lists, but we'll see in the following sections all the operations we can do with these objects.\n", 130 | "\n", 131 | "We can already see a difference with two basic attributes of arrays: their type and shape." 132 | ] 133 | }, 134 | { 135 | "cell_type": "markdown", 136 | "metadata": {}, 137 | "source": [ 138 | "### 1.1.1 Array Type" 139 | ] 140 | }, 141 | { 142 | "cell_type": "markdown", 143 | "metadata": {}, 144 | "source": [ 145 | "Just like when we create regular variables in Python, arrays receive a type when created. Unlike regular list, **all** elements of an array always have the same type. The type of an array can be recovered through the ```.dtype``` method:" 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": 7, 151 | "metadata": {}, 152 | "outputs": [ 153 | { 154 | "data": { 155 | "text/plain": [ 156 | "dtype('int64')" 157 | ] 158 | }, 159 | "execution_count": 7, 160 | "metadata": {}, 161 | "output_type": "execute_result" 162 | } 163 | ], 164 | "source": [ 165 | "myarray.dtype" 166 | ] 167 | }, 168 | { 169 | "cell_type": "markdown", 170 | "metadata": {}, 171 | "source": [ 172 | "Depending on the content of the list, the array will have different types. But the logic of \"maximal complexity\" is kept. For example if we mix integers and floats, we get a float array:" 173 | ] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "execution_count": 8, 178 | "metadata": {}, 179 | "outputs": [ 180 | { 181 | "data": { 182 | "text/plain": [ 183 | "array([1.2, 6. , 7.6, 5. ])" 184 | ] 185 | }, 186 | "execution_count": 8, 187 | "metadata": {}, 188 | "output_type": "execute_result" 189 | } 190 | ], 191 | "source": [ 192 | "myarray2 = np.array([1.2, 6, 7.6, 5])\n", 193 | "myarray2" 194 | ] 195 | }, 196 | { 197 | "cell_type": "code", 198 | "execution_count": 9, 199 | "metadata": {}, 200 | "outputs": [ 201 | { 202 | "data": { 203 | "text/plain": [ 204 | "dtype('float64')" 205 | ] 206 | }, 207 | "execution_count": 9, 208 | "metadata": {}, 209 | "output_type": "execute_result" 210 | } 211 | ], 212 | "source": [ 213 | "myarray2.dtype" 214 | ] 215 | }, 216 | { 217 | "cell_type": "markdown", 218 | "metadata": {}, 219 | "source": [ 220 | "In general, we have the possibility to assign a type to an array. This is true here, as well as later when we'll create more complex arrays, and is done via the ```dtype``` option: " 221 | ] 222 | }, 223 | { 224 | "cell_type": "code", 225 | "execution_count": 10, 226 | "metadata": {}, 227 | "outputs": [ 228 | { 229 | "data": { 230 | "text/plain": [ 231 | "array([ 1, 6, 7, 244], dtype=uint8)" 232 | ] 233 | }, 234 | "execution_count": 10, 235 | "metadata": {}, 236 | "output_type": "execute_result" 237 | } 238 | ], 239 | "source": [ 240 | "myarray2 = np.array([1.2, 6, 7.6, 500], dtype=np.uint8)\n", 241 | "myarray2" 242 | ] 243 | }, 244 | { 245 | "cell_type": "markdown", 246 | "metadata": {}, 247 | "source": [ 248 | "The type of the array can also be changed after creation using the ```.astype()``` method:" 249 | ] 250 | }, 251 | { 252 | "cell_type": "code", 253 | "execution_count": 11, 254 | "metadata": {}, 255 | "outputs": [ 256 | { 257 | "data": { 258 | "text/plain": [ 259 | "dtype('float64')" 260 | ] 261 | }, 262 | "execution_count": 11, 263 | "metadata": {}, 264 | "output_type": "execute_result" 265 | } 266 | ], 267 | "source": [ 268 | "myfloat_array = np.array([1.2, 6, 7.6, 500], dtype=np.float)\n", 269 | "myfloat_array.dtype" 270 | ] 271 | }, 272 | { 273 | "cell_type": "code", 274 | "execution_count": 12, 275 | "metadata": {}, 276 | "outputs": [ 277 | { 278 | "data": { 279 | "text/plain": [ 280 | "dtype('int8')" 281 | ] 282 | }, 283 | "execution_count": 12, 284 | "metadata": {}, 285 | "output_type": "execute_result" 286 | } 287 | ], 288 | "source": [ 289 | "myint_array = myfloat_array.astype(np.int8)\n", 290 | "myint_array.dtype" 291 | ] 292 | }, 293 | { 294 | "cell_type": "markdown", 295 | "metadata": {}, 296 | "source": [ 297 | "### 1.1.2 Array shape\n", 298 | "\n", 299 | "A very important property of an array is its **shape** or in other words the dimensions of each axis. That property can be accessed via the ```.shape``` property:" 300 | ] 301 | }, 302 | { 303 | "cell_type": "code", 304 | "execution_count": 13, 305 | "metadata": {}, 306 | "outputs": [ 307 | { 308 | "data": { 309 | "text/plain": [ 310 | "array([2, 5, 3, 9, 5, 2])" 311 | ] 312 | }, 313 | "execution_count": 13, 314 | "metadata": {}, 315 | "output_type": "execute_result" 316 | } 317 | ], 318 | "source": [ 319 | "myarray" 320 | ] 321 | }, 322 | { 323 | "cell_type": "code", 324 | "execution_count": 14, 325 | "metadata": {}, 326 | "outputs": [ 327 | { 328 | "data": { 329 | "text/plain": [ 330 | "(6,)" 331 | ] 332 | }, 333 | "execution_count": 14, 334 | "metadata": {}, 335 | "output_type": "execute_result" 336 | } 337 | ], 338 | "source": [ 339 | "myarray.shape" 340 | ] 341 | }, 342 | { 343 | "cell_type": "markdown", 344 | "metadata": {}, 345 | "source": [ 346 | "We see that our simple array has only one dimension of length 6. Now of course we can create more complex arrays. Let's create for example a *list of two lists*:" 347 | ] 348 | }, 349 | { 350 | "cell_type": "code", 351 | "execution_count": 15, 352 | "metadata": {}, 353 | "outputs": [ 354 | { 355 | "data": { 356 | "text/plain": [ 357 | "array([[1, 2, 3],\n", 358 | " [4, 5, 6]])" 359 | ] 360 | }, 361 | "execution_count": 15, 362 | "metadata": {}, 363 | "output_type": "execute_result" 364 | } 365 | ], 366 | "source": [ 367 | "my2d_list = [[1,2,3], [4,5,6]]\n", 368 | "\n", 369 | "my2d_array = np.array(my2d_list)\n", 370 | "my2d_array" 371 | ] 372 | }, 373 | { 374 | "cell_type": "code", 375 | "execution_count": 16, 376 | "metadata": {}, 377 | "outputs": [ 378 | { 379 | "data": { 380 | "text/plain": [ 381 | "(2, 3)" 382 | ] 383 | }, 384 | "execution_count": 16, 385 | "metadata": {}, 386 | "output_type": "execute_result" 387 | } 388 | ], 389 | "source": [ 390 | "my2d_array.shape" 391 | ] 392 | }, 393 | { 394 | "cell_type": "markdown", 395 | "metadata": {}, 396 | "source": [ 397 | "We see now that the shape of this array is *two-dimensional*. We also see that we have 2 lists of 3 elements. In fact at this point we should forget that we have a list of lists and simply consider this object as a *matrix* with *two rows and three columns*. We'll use the follwing graphical representation to clarify some concepts:" 398 | ] 399 | }, 400 | { 401 | "cell_type": "code", 402 | "execution_count": 17, 403 | "metadata": {}, 404 | "outputs": [ 405 | { 406 | "data": { 407 | "text/html": [ 408 | "\n", 409 | "\n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | "\n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | "\n", 421 | " \n", 422 | " \n", 423 | "\n", 424 | " \n", 425 | " 3\n", 426 | " 2\n", 427 | "" 428 | ], 429 | "text/plain": [ 430 | "" 431 | ] 432 | }, 433 | "execution_count": 17, 434 | "metadata": {}, 435 | "output_type": "execute_result" 436 | } 437 | ], 438 | "source": [ 439 | "numpy_to_svg(my2d_array)" 440 | ] 441 | }, 442 | { 443 | "cell_type": "markdown", 444 | "metadata": {}, 445 | "source": [ 446 | "## 1.2 Creating arrays\n", 447 | "\n", 448 | "We have seen that we can turn regular lists into arrays. However this becomes quickly impractical for larger arrays. Numpy offers several functions to create particular arrays. " 449 | ] 450 | }, 451 | { 452 | "cell_type": "markdown", 453 | "metadata": {}, 454 | "source": [ 455 | "### 1.2.1 Common simple arrays\n", 456 | "For example an array full of zeros or ones:" 457 | ] 458 | }, 459 | { 460 | "cell_type": "code", 461 | "execution_count": 18, 462 | "metadata": {}, 463 | "outputs": [ 464 | { 465 | "data": { 466 | "text/plain": [ 467 | "array([[1., 1., 1.],\n", 468 | " [1., 1., 1.]])" 469 | ] 470 | }, 471 | "execution_count": 18, 472 | "metadata": {}, 473 | "output_type": "execute_result" 474 | } 475 | ], 476 | "source": [ 477 | "one_array = np.ones((2,3))\n", 478 | "one_array" 479 | ] 480 | }, 481 | { 482 | "cell_type": "code", 483 | "execution_count": 19, 484 | "metadata": {}, 485 | "outputs": [ 486 | { 487 | "data": { 488 | "text/plain": [ 489 | "array([[0., 0., 0.],\n", 490 | " [0., 0., 0.]])" 491 | ] 492 | }, 493 | "execution_count": 19, 494 | "metadata": {}, 495 | "output_type": "execute_result" 496 | } 497 | ], 498 | "source": [ 499 | "zero_array = np.zeros((2,3))\n", 500 | "zero_array" 501 | ] 502 | }, 503 | { 504 | "cell_type": "markdown", 505 | "metadata": {}, 506 | "source": [ 507 | "One can also create diagonal matrix:" 508 | ] 509 | }, 510 | { 511 | "cell_type": "code", 512 | "execution_count": 20, 513 | "metadata": {}, 514 | "outputs": [ 515 | { 516 | "data": { 517 | "text/plain": [ 518 | "array([[1., 0., 0.],\n", 519 | " [0., 1., 0.],\n", 520 | " [0., 0., 1.]])" 521 | ] 522 | }, 523 | "execution_count": 20, 524 | "metadata": {}, 525 | "output_type": "execute_result" 526 | } 527 | ], 528 | "source": [ 529 | "np.eye(3)" 530 | ] 531 | }, 532 | { 533 | "cell_type": "markdown", 534 | "metadata": {}, 535 | "source": [ 536 | "By default Numpy creates float arrays:" 537 | ] 538 | }, 539 | { 540 | "cell_type": "code", 541 | "execution_count": 21, 542 | "metadata": {}, 543 | "outputs": [ 544 | { 545 | "data": { 546 | "text/plain": [ 547 | "dtype('float64')" 548 | ] 549 | }, 550 | "execution_count": 21, 551 | "metadata": {}, 552 | "output_type": "execute_result" 553 | } 554 | ], 555 | "source": [ 556 | "one_array.dtype" 557 | ] 558 | }, 559 | { 560 | "cell_type": "markdown", 561 | "metadata": {}, 562 | "source": [ 563 | "However as mentioned before, one can impose a type usine the ```dtype``` option:" 564 | ] 565 | }, 566 | { 567 | "cell_type": "code", 568 | "execution_count": 22, 569 | "metadata": {}, 570 | "outputs": [ 571 | { 572 | "data": { 573 | "text/plain": [ 574 | "array([[1, 1, 1],\n", 575 | " [1, 1, 1]], dtype=int8)" 576 | ] 577 | }, 578 | "execution_count": 22, 579 | "metadata": {}, 580 | "output_type": "execute_result" 581 | } 582 | ], 583 | "source": [ 584 | "one_array_int = np.ones((2,3), dtype=np.int8)\n", 585 | "one_array_int" 586 | ] 587 | }, 588 | { 589 | "cell_type": "code", 590 | "execution_count": 23, 591 | "metadata": {}, 592 | "outputs": [ 593 | { 594 | "data": { 595 | "text/plain": [ 596 | "dtype('int8')" 597 | ] 598 | }, 599 | "execution_count": 23, 600 | "metadata": {}, 601 | "output_type": "execute_result" 602 | } 603 | ], 604 | "source": [ 605 | "one_array_int.dtype" 606 | ] 607 | }, 608 | { 609 | "cell_type": "markdown", 610 | "metadata": {}, 611 | "source": [ 612 | "### 1.2.2 Copying the shape\n", 613 | "Often one needs to create arrays of same shape. This can be done with \"like-functions\":" 614 | ] 615 | }, 616 | { 617 | "cell_type": "code", 618 | "execution_count": 24, 619 | "metadata": {}, 620 | "outputs": [ 621 | { 622 | "data": { 623 | "text/plain": [ 624 | "array([[0., 0., 0.],\n", 625 | " [0., 0., 0.]])" 626 | ] 627 | }, 628 | "execution_count": 24, 629 | "metadata": {}, 630 | "output_type": "execute_result" 631 | } 632 | ], 633 | "source": [ 634 | "same_shape_array = np.zeros_like(one_array)\n", 635 | "same_shape_array" 636 | ] 637 | }, 638 | { 639 | "cell_type": "code", 640 | "execution_count": 25, 641 | "metadata": {}, 642 | "outputs": [ 643 | { 644 | "data": { 645 | "text/plain": [ 646 | "(2, 3)" 647 | ] 648 | }, 649 | "execution_count": 25, 650 | "metadata": {}, 651 | "output_type": "execute_result" 652 | } 653 | ], 654 | "source": [ 655 | "one_array.shape" 656 | ] 657 | }, 658 | { 659 | "cell_type": "code", 660 | "execution_count": 26, 661 | "metadata": {}, 662 | "outputs": [ 663 | { 664 | "data": { 665 | "text/plain": [ 666 | "(2, 3)" 667 | ] 668 | }, 669 | "execution_count": 26, 670 | "metadata": {}, 671 | "output_type": "execute_result" 672 | } 673 | ], 674 | "source": [ 675 | "same_shape_array.shape" 676 | ] 677 | }, 678 | { 679 | "cell_type": "code", 680 | "execution_count": 27, 681 | "metadata": {}, 682 | "outputs": [ 683 | { 684 | "data": { 685 | "text/plain": [ 686 | "array([[1., 1., 1.],\n", 687 | " [1., 1., 1.]])" 688 | ] 689 | }, 690 | "execution_count": 27, 691 | "metadata": {}, 692 | "output_type": "execute_result" 693 | } 694 | ], 695 | "source": [ 696 | "np.ones_like(one_array)" 697 | ] 698 | }, 699 | { 700 | "cell_type": "markdown", 701 | "metadata": {}, 702 | "source": [ 703 | "### 1.2.3 Complex arrays\n", 704 | "\n", 705 | "We are not limited to create arrays containing ones or zeros. Very common operations involve e.g. the creation of arrays containing regularly arrange numbers. For example a \"from-to-by-step\" list:" 706 | ] 707 | }, 708 | { 709 | "cell_type": "code", 710 | "execution_count": 28, 711 | "metadata": {}, 712 | "outputs": [ 713 | { 714 | "data": { 715 | "text/plain": [ 716 | "array([0, 2, 4, 6, 8])" 717 | ] 718 | }, 719 | "execution_count": 28, 720 | "metadata": {}, 721 | "output_type": "execute_result" 722 | } 723 | ], 724 | "source": [ 725 | "np.arange(0, 10, 2)" 726 | ] 727 | }, 728 | { 729 | "cell_type": "markdown", 730 | "metadata": {}, 731 | "source": [ 732 | "Or equidistant numbers between boundaries:" 733 | ] 734 | }, 735 | { 736 | "cell_type": "code", 737 | "execution_count": 29, 738 | "metadata": {}, 739 | "outputs": [ 740 | { 741 | "data": { 742 | "text/plain": [ 743 | "array([0. , 0.11111111, 0.22222222, 0.33333333, 0.44444444,\n", 744 | " 0.55555556, 0.66666667, 0.77777778, 0.88888889, 1. ])" 745 | ] 746 | }, 747 | "execution_count": 29, 748 | "metadata": {}, 749 | "output_type": "execute_result" 750 | } 751 | ], 752 | "source": [ 753 | "np.linspace(0,1, 10)" 754 | ] 755 | }, 756 | { 757 | "cell_type": "markdown", 758 | "metadata": {}, 759 | "source": [ 760 | "Numpy offers in particular a ```random``` submodules that allows one to create arrays containing values from a wide array of distributions. For example, normally distributed:" 761 | ] 762 | }, 763 | { 764 | "cell_type": "code", 765 | "execution_count": 30, 766 | "metadata": {}, 767 | "outputs": [ 768 | { 769 | "data": { 770 | "text/plain": [ 771 | "array([[16.64156121, 13.38970093, 11.32772287, 7.93713055],\n", 772 | " [ 8.33365707, 11.27817138, 9.81766403, 11.11541451],\n", 773 | " [12.97743479, 7.1622948 , 12.02417108, 8.64402656]])" 774 | ] 775 | }, 776 | "execution_count": 30, 777 | "metadata": {}, 778 | "output_type": "execute_result" 779 | } 780 | ], 781 | "source": [ 782 | "normal_array = np.random.normal(loc=10, scale=2, size=(3,4))\n", 783 | "normal_array" 784 | ] 785 | }, 786 | { 787 | "cell_type": "code", 788 | "execution_count": 31, 789 | "metadata": {}, 790 | "outputs": [ 791 | { 792 | "data": { 793 | "text/plain": [ 794 | "array([[4, 4, 2, 4],\n", 795 | " [3, 7, 6, 3],\n", 796 | " [6, 5, 5, 4]])" 797 | ] 798 | }, 799 | "execution_count": 31, 800 | "metadata": {}, 801 | "output_type": "execute_result" 802 | } 803 | ], 804 | "source": [ 805 | "np.random.poisson(lam=5, size=(3,4))" 806 | ] 807 | }, 808 | { 809 | "cell_type": "markdown", 810 | "metadata": {}, 811 | "source": [ 812 | "### 1.2.4 Higher dimensions" 813 | ] 814 | }, 815 | { 816 | "cell_type": "markdown", 817 | "metadata": {}, 818 | "source": [ 819 | "Until now we have almost only dealt with 1D or 2D arrays that look like a simple grid:" 820 | ] 821 | }, 822 | { 823 | "cell_type": "code", 824 | "execution_count": 32, 825 | "metadata": {}, 826 | "outputs": [ 827 | { 828 | "data": { 829 | "text/html": [ 830 | "\n", 831 | "\n", 832 | " \n", 833 | " \n", 834 | " \n", 835 | " \n", 836 | " \n", 837 | " \n", 838 | " \n", 839 | "\n", 840 | " \n", 841 | " \n", 842 | " \n", 843 | " \n", 844 | " \n", 845 | " \n", 846 | " \n", 847 | " \n", 848 | " \n", 849 | " \n", 850 | " \n", 851 | " \n", 852 | "\n", 853 | " \n", 854 | " \n", 855 | "\n", 856 | " \n", 857 | " 10\n", 858 | " 5\n", 859 | "" 860 | ], 861 | "text/plain": [ 862 | "" 863 | ] 864 | }, 865 | "execution_count": 32, 866 | "metadata": {}, 867 | "output_type": "execute_result" 868 | } 869 | ], 870 | "source": [ 871 | "myarray = np.ones((5,10))\n", 872 | "numpy_to_svg(myarray)" 873 | ] 874 | }, 875 | { 876 | "cell_type": "markdown", 877 | "metadata": {}, 878 | "source": [ 879 | "We are not limited to create 1 or 2 dimensional arrays. We can basically create any-dimension array. For example in microscopy, images can be volumetric and thus they are 3D arrays in Numpy. For example if we acquired 5 planes of a 10px by 10px image, we would have something like:" 880 | ] 881 | }, 882 | { 883 | "cell_type": "code", 884 | "execution_count": 33, 885 | "metadata": {}, 886 | "outputs": [], 887 | "source": [ 888 | "array3D = np.ones((10,10,5))" 889 | ] 890 | }, 891 | { 892 | "cell_type": "code", 893 | "execution_count": 34, 894 | "metadata": {}, 895 | "outputs": [ 896 | { 897 | "data": { 898 | "text/html": [ 899 | "\n", 900 | "\n", 901 | " \n", 902 | " \n", 903 | " \n", 904 | " \n", 905 | " \n", 906 | " \n", 907 | " \n", 908 | " \n", 909 | " \n", 910 | " \n", 911 | " \n", 912 | " \n", 913 | "\n", 914 | " \n", 915 | " \n", 916 | " \n", 917 | " \n", 918 | " \n", 919 | " \n", 920 | " \n", 921 | " \n", 922 | " \n", 923 | " \n", 924 | " \n", 925 | " \n", 926 | "\n", 927 | " \n", 928 | " \n", 929 | "\n", 930 | " \n", 931 | " \n", 932 | " \n", 933 | " \n", 934 | " \n", 935 | " \n", 936 | " \n", 937 | " \n", 938 | " \n", 939 | " \n", 940 | " \n", 941 | " \n", 942 | "\n", 943 | " \n", 944 | " \n", 945 | " \n", 946 | " \n", 947 | " \n", 948 | " \n", 949 | " \n", 950 | "\n", 951 | " \n", 952 | " \n", 953 | "\n", 954 | " \n", 955 | " \n", 956 | " \n", 957 | " \n", 958 | " \n", 959 | " \n", 960 | " \n", 961 | " \n", 962 | " \n", 963 | " \n", 964 | " \n", 965 | " \n", 966 | "\n", 967 | " \n", 968 | " \n", 969 | " \n", 970 | " \n", 971 | " \n", 972 | " \n", 973 | " \n", 974 | "\n", 975 | " \n", 976 | " \n", 977 | "\n", 978 | " \n", 979 | " 5\n", 980 | " 10\n", 981 | " 10\n", 982 | "" 983 | ], 984 | "text/plain": [ 985 | "" 986 | ] 987 | }, 988 | "execution_count": 34, 989 | "metadata": {}, 990 | "output_type": "execute_result" 991 | } 992 | ], 993 | "source": [ 994 | "numpy_to_svg(array3D)" 995 | ] 996 | }, 997 | { 998 | "cell_type": "markdown", 999 | "metadata": {}, 1000 | "source": [ 1001 | "All the functions and properties that we have seen until now are N-dimensional, i.e. they work in the same way irrespective of the array size." 1002 | ] 1003 | }, 1004 | { 1005 | "cell_type": "markdown", 1006 | "metadata": {}, 1007 | "source": [ 1008 | "## 1.3 Importing arrays\n", 1009 | "\n", 1010 | "We have seen until now multiple ways to create arrays. However, most of the time, you will *import* data from some source, either directly as arrays or as lists, and use these data in your analysis." 1011 | ] 1012 | }, 1013 | { 1014 | "cell_type": "markdown", 1015 | "metadata": {}, 1016 | "source": [ 1017 | "### 1.3.1 Loading and saving arrays\n", 1018 | "\n", 1019 | "Numpy can efficiently save and load arrays in its own format ```.npy```. Let's create an array and save it:" 1020 | ] 1021 | }, 1022 | { 1023 | "cell_type": "code", 1024 | "execution_count": 35, 1025 | "metadata": {}, 1026 | "outputs": [ 1027 | { 1028 | "data": { 1029 | "text/plain": [ 1030 | "array([[ 5.41052227, 11.78370736, 9.22402365, 9.91645679, 9.48495895],\n", 1031 | " [10.10853493, 8.75839699, 8.26026504, 12.51736441, 9.80407577],\n", 1032 | " [10.09084097, 7.27962072, 11.05963249, 14.37978527, 9.00654627],\n", 1033 | " [ 6.01521954, 10.25115807, 10.28647927, 10.12389832, 8.91184397]])" 1034 | ] 1035 | }, 1036 | "execution_count": 35, 1037 | "metadata": {}, 1038 | "output_type": "execute_result" 1039 | } 1040 | ], 1041 | "source": [ 1042 | "array_to_save = np.random.normal(10, 2, (4,5))\n", 1043 | "array_to_save" 1044 | ] 1045 | }, 1046 | { 1047 | "cell_type": "code", 1048 | "execution_count": 36, 1049 | "metadata": {}, 1050 | "outputs": [], 1051 | "source": [ 1052 | "np.save('my_saved_array.npy', array_to_save)" 1053 | ] 1054 | }, 1055 | { 1056 | "cell_type": "code", 1057 | "execution_count": 37, 1058 | "metadata": {}, 1059 | "outputs": [ 1060 | { 1061 | "name": "stdout", 1062 | "output_type": "stream", 1063 | "text": [ 1064 | "01-DA_Numpy_arrays_creation.ipynb 98-DA_Numpy_Solutions.ipynb\n", 1065 | "02-DA_Numpy_array_maths.ipynb 99-DA_Pandas_Exercises.ipynb\n", 1066 | "03-DA_Numpy_matplotlib.ipynb 99-DA_Pandas_Solutions.ipynb\n", 1067 | "04-DA_Numpy_indexing.ipynb My_first_plot.png\n", 1068 | "05-DA_Numpy_combining_arrays.ipynb SNSF_data.ipynb\n", 1069 | "06-DA_Pandas_introduction.ipynb Untitled.ipynb\n", 1070 | "07-DA_Pandas_structures.ipynb \u001b[34m__pycache__\u001b[m\u001b[m/\n", 1071 | "08-DA_Pandas_import.ipynb ipyleaflet.ipynb\n", 1072 | "09-DA_Pandas_operations.ipynb multiple_arrays.npz\n", 1073 | "10-DA_Pandas_combine.ipynb my_saved_array.npy\n", 1074 | "11-DA_Pandas_splitting.ipynb \u001b[34mraw.githubusercontent.com\u001b[m\u001b[m/\n", 1075 | "12-DA_Pandas_plotting.ipynb svg.py\n", 1076 | "13-DA_Pandas_ML.ipynb \u001b[34munused\u001b[m\u001b[m/\n", 1077 | "98-DA_Numpy_Exercises.ipynb\n" 1078 | ] 1079 | } 1080 | ], 1081 | "source": [ 1082 | "ls" 1083 | ] 1084 | }, 1085 | { 1086 | "cell_type": "markdown", 1087 | "metadata": {}, 1088 | "source": [ 1089 | "Now that this array is saved on disk, we can load it again using ```np.load```:" 1090 | ] 1091 | }, 1092 | { 1093 | "cell_type": "code", 1094 | "execution_count": 38, 1095 | "metadata": {}, 1096 | "outputs": [ 1097 | { 1098 | "data": { 1099 | "text/plain": [ 1100 | "array([[ 5.41052227, 11.78370736, 9.22402365, 9.91645679, 9.48495895],\n", 1101 | " [10.10853493, 8.75839699, 8.26026504, 12.51736441, 9.80407577],\n", 1102 | " [10.09084097, 7.27962072, 11.05963249, 14.37978527, 9.00654627],\n", 1103 | " [ 6.01521954, 10.25115807, 10.28647927, 10.12389832, 8.91184397]])" 1104 | ] 1105 | }, 1106 | "execution_count": 38, 1107 | "metadata": {}, 1108 | "output_type": "execute_result" 1109 | } 1110 | ], 1111 | "source": [ 1112 | "new_array = np.load('my_saved_array.npy')\n", 1113 | "new_array" 1114 | ] 1115 | }, 1116 | { 1117 | "cell_type": "markdown", 1118 | "metadata": {}, 1119 | "source": [ 1120 | "If you have several arrays that belong together, you can also save them in a single file using ```np.savez``` in ```npz``` format. Let's create a second array:" 1121 | ] 1122 | }, 1123 | { 1124 | "cell_type": "code", 1125 | "execution_count": 39, 1126 | "metadata": {}, 1127 | "outputs": [ 1128 | { 1129 | "data": { 1130 | "text/plain": [ 1131 | "array([[14.57759687, 7.62340049]])" 1132 | ] 1133 | }, 1134 | "execution_count": 39, 1135 | "metadata": {}, 1136 | "output_type": "execute_result" 1137 | } 1138 | ], 1139 | "source": [ 1140 | "array_to_save2 = np.random.normal(10, 2, (1,2))\n", 1141 | "array_to_save2" 1142 | ] 1143 | }, 1144 | { 1145 | "cell_type": "code", 1146 | "execution_count": 40, 1147 | "metadata": {}, 1148 | "outputs": [], 1149 | "source": [ 1150 | "np.savez('multiple_arrays.npz', array_to_save=array_to_save, array_to_save2=array_to_save2)" 1151 | ] 1152 | }, 1153 | { 1154 | "cell_type": "code", 1155 | "execution_count": 41, 1156 | "metadata": {}, 1157 | "outputs": [ 1158 | { 1159 | "name": "stdout", 1160 | "output_type": "stream", 1161 | "text": [ 1162 | "01-DA_Numpy_arrays_creation.ipynb 98-DA_Numpy_Solutions.ipynb\n", 1163 | "02-DA_Numpy_array_maths.ipynb 99-DA_Pandas_Exercises.ipynb\n", 1164 | "03-DA_Numpy_matplotlib.ipynb 99-DA_Pandas_Solutions.ipynb\n", 1165 | "04-DA_Numpy_indexing.ipynb My_first_plot.png\n", 1166 | "05-DA_Numpy_combining_arrays.ipynb SNSF_data.ipynb\n", 1167 | "06-DA_Pandas_introduction.ipynb Untitled.ipynb\n", 1168 | "07-DA_Pandas_structures.ipynb \u001b[34m__pycache__\u001b[m\u001b[m/\n", 1169 | "08-DA_Pandas_import.ipynb ipyleaflet.ipynb\n", 1170 | "09-DA_Pandas_operations.ipynb multiple_arrays.npz\n", 1171 | "10-DA_Pandas_combine.ipynb my_saved_array.npy\n", 1172 | "11-DA_Pandas_splitting.ipynb \u001b[34mraw.githubusercontent.com\u001b[m\u001b[m/\n", 1173 | "12-DA_Pandas_plotting.ipynb svg.py\n", 1174 | "13-DA_Pandas_ML.ipynb \u001b[34munused\u001b[m\u001b[m/\n", 1175 | "98-DA_Numpy_Exercises.ipynb\n" 1176 | ] 1177 | } 1178 | ], 1179 | "source": [ 1180 | "ls" 1181 | ] 1182 | }, 1183 | { 1184 | "cell_type": "markdown", 1185 | "metadata": {}, 1186 | "source": [ 1187 | "And when we load it again:" 1188 | ] 1189 | }, 1190 | { 1191 | "cell_type": "code", 1192 | "execution_count": 42, 1193 | "metadata": {}, 1194 | "outputs": [ 1195 | { 1196 | "data": { 1197 | "text/plain": [ 1198 | "numpy.lib.npyio.NpzFile" 1199 | ] 1200 | }, 1201 | "execution_count": 42, 1202 | "metadata": {}, 1203 | "output_type": "execute_result" 1204 | } 1205 | ], 1206 | "source": [ 1207 | "load_multiple = np.load('multiple_arrays.npz')\n", 1208 | "type(load_multiple)" 1209 | ] 1210 | }, 1211 | { 1212 | "cell_type": "markdown", 1213 | "metadata": {}, 1214 | "source": [ 1215 | "We get here an ```NpzFile``` *object* from which we can read our data. Note that when we load an ```npz``` file, it is only loaded *lazily*, i.e. data are not actually read, but the content is parsed. This is very useful if you need to store large amounts of data but don't always need to re-load all of them. We can use methods to actually access the data:" 1216 | ] 1217 | }, 1218 | { 1219 | "cell_type": "code", 1220 | "execution_count": 43, 1221 | "metadata": {}, 1222 | "outputs": [ 1223 | { 1224 | "data": { 1225 | "text/plain": [ 1226 | "['array_to_save', 'array_to_save2']" 1227 | ] 1228 | }, 1229 | "execution_count": 43, 1230 | "metadata": {}, 1231 | "output_type": "execute_result" 1232 | } 1233 | ], 1234 | "source": [ 1235 | "load_multiple.files" 1236 | ] 1237 | }, 1238 | { 1239 | "cell_type": "code", 1240 | "execution_count": 44, 1241 | "metadata": {}, 1242 | "outputs": [ 1243 | { 1244 | "data": { 1245 | "text/plain": [ 1246 | "array([[14.57759687, 7.62340049]])" 1247 | ] 1248 | }, 1249 | "execution_count": 44, 1250 | "metadata": {}, 1251 | "output_type": "execute_result" 1252 | } 1253 | ], 1254 | "source": [ 1255 | "load_multiple.get('array_to_save2')" 1256 | ] 1257 | }, 1258 | { 1259 | "cell_type": "markdown", 1260 | "metadata": {}, 1261 | "source": [ 1262 | "### 1.3.2 Importing data as arrays\n", 1263 | "\n", 1264 | "Images are a typical example of data that are array-like (matrix of pixels) and that can be imported directly as arrays. Of course, each domain will have it's own *importing libraries*. For example in the area of imaging, the scikit-image package is one of the main libraries, and it offers and importer of images as arrays which works both with local files and web addresses:" 1265 | ] 1266 | }, 1267 | { 1268 | "cell_type": "code", 1269 | "execution_count": 45, 1270 | "metadata": {}, 1271 | "outputs": [], 1272 | "source": [ 1273 | "import skimage.io\n", 1274 | "\n", 1275 | "image = skimage.io.imread('https://upload.wikimedia.org/wikipedia/commons/f/fd/%27%C3%9Cbermut_Exub%C3%A9rance%27_by_Paul_Klee%2C_1939.jpg')" 1276 | ] 1277 | }, 1278 | { 1279 | "cell_type": "markdown", 1280 | "metadata": {}, 1281 | "source": [ 1282 | "We can briefly explore that image:" 1283 | ] 1284 | }, 1285 | { 1286 | "cell_type": "code", 1287 | "execution_count": 46, 1288 | "metadata": {}, 1289 | "outputs": [ 1290 | { 1291 | "data": { 1292 | "text/plain": [ 1293 | "numpy.ndarray" 1294 | ] 1295 | }, 1296 | "execution_count": 46, 1297 | "metadata": {}, 1298 | "output_type": "execute_result" 1299 | } 1300 | ], 1301 | "source": [ 1302 | "type(image)" 1303 | ] 1304 | }, 1305 | { 1306 | "cell_type": "code", 1307 | "execution_count": 47, 1308 | "metadata": {}, 1309 | "outputs": [ 1310 | { 1311 | "data": { 1312 | "text/plain": [ 1313 | "dtype('uint8')" 1314 | ] 1315 | }, 1316 | "execution_count": 47, 1317 | "metadata": {}, 1318 | "output_type": "execute_result" 1319 | } 1320 | ], 1321 | "source": [ 1322 | "image.dtype" 1323 | ] 1324 | }, 1325 | { 1326 | "cell_type": "code", 1327 | "execution_count": 48, 1328 | "metadata": {}, 1329 | "outputs": [ 1330 | { 1331 | "data": { 1332 | "text/plain": [ 1333 | "(584, 756, 3)" 1334 | ] 1335 | }, 1336 | "execution_count": 48, 1337 | "metadata": {}, 1338 | "output_type": "execute_result" 1339 | } 1340 | ], 1341 | "source": [ 1342 | "image.shape" 1343 | ] 1344 | }, 1345 | { 1346 | "cell_type": "markdown", 1347 | "metadata": {}, 1348 | "source": [ 1349 | "We see that we have an array of integeres with 3 dimensions. Since we imported a jpg image, we know that the thrid dimension corresponds to three color channels Red, Green, Blue (RGB)." 1350 | ] 1351 | }, 1352 | { 1353 | "cell_type": "markdown", 1354 | "metadata": {}, 1355 | "source": [ 1356 | "You can also read regular CSV files directly as Numpy arrays. This is more commonly done using Pandas, so we don't spend much time on this, but here is an example on importing data from the web:" 1357 | ] 1358 | }, 1359 | { 1360 | "cell_type": "code", 1361 | "execution_count": 49, 1362 | "metadata": {}, 1363 | "outputs": [], 1364 | "source": [ 1365 | "oilprice = np.loadtxt('https://raw.githubusercontent.com/guiwitz/Rdatasets/master/csv/quantreg/gasprice.csv',\n", 1366 | " delimiter=',', usecols=range(2,3), skiprows=1)" 1367 | ] 1368 | }, 1369 | { 1370 | "cell_type": "code", 1371 | "execution_count": 50, 1372 | "metadata": {}, 1373 | "outputs": [ 1374 | { 1375 | "data": { 1376 | "text/plain": [ 1377 | "array([126.6, 127.2, 132.1, 133.3, 133.9, 134.5, 133.9, 133.4, 132.8,\n", 1378 | " 132.3, 131.1, 134.1, 119.2, 116.8, 113.9, 110.6, 107.8, 105.4,\n", 1379 | " 102.5, 104.5, 104.3, 104.7, 105.2, 106.6, 106.9, 109. , 110.4,\n", 1380 | " 111.3, 112.1, 112.9, 114. , 113.8, 113.5, 112.6, 111.4, 110.4,\n", 1381 | " 109.8, 109.4, 109.1, 109.1, 109.9, 111.2, 112.4, 112.4, 112.7,\n", 1382 | " 112. , 111. , 109.7, 109.2, 108.9, 108.4, 108.8, 109.1, 109.1,\n", 1383 | " 110.2, 110.4, 109.9, 109.9, 109.1, 107.5, 106.3, 105.3, 104.2,\n", 1384 | " 102.6, 101.4, 100.6, 99.5, 100.4, 101.1, 101.4, 101.2, 101.3,\n", 1385 | " 101. , 101.5, 101.3, 102.6, 105.1, 105.8, 107.2, 108.9, 110.2,\n", 1386 | " 111.8, 112. , 112.8, 114.3, 115.1, 115.3, 114.9, 114.7, 113.9,\n", 1387 | " 113.2, 112.8, 112.6, 112.3, 111.6, 112.3, 112.1, 112.1, 112.4,\n", 1388 | " 112.3, 111.8, 111.5, 111.5, 111.3, 111.3, 112. , 112. , 111.2,\n", 1389 | " 110.6, 109.8, 108.9, 107.8, 107.4, 106.9, 106.5, 106.6, 106.1,\n", 1390 | " 105.5, 105.5, 106.2, 105.3, 104.7, 104.2, 104.8, 105.8, 105.6,\n", 1391 | " 105.7, 106.8, 107.9, 107.9, 108.6, 108.6, 109.7, 110.6, 110.6,\n", 1392 | " 110.7, 110.4, 110.1, 109.5, 108.9, 108.6, 108.1, 107.5, 106.9,\n", 1393 | " 106.2, 106. , 105.9, 106.5, 106.2, 105.5, 105.1, 104.5, 104.7,\n", 1394 | " 109.2, 109. , 109.3, 109.2, 108.4, 107.5, 106.4, 105.8, 105.1,\n", 1395 | " 103.6, 101.8, 100.3, 99.9, 99.2, 99.5, 100.1, 99.9, 100.5,\n", 1396 | " 100.7, 101.6, 100.9, 100.4, 100.7, 100.5, 100.7, 101.2, 101.1,\n", 1397 | " 102.8, 103.3, 103.7, 104. , 104.5, 104.6, 105. , 105.6, 106.5,\n", 1398 | " 107.3, 107.9, 109.5, 109.7, 110.3, 110.9, 111.4, 113. , 115.7,\n", 1399 | " 116.1, 116.5, 116.1, 115.6, 115. , 114. , 112.9, 112. , 111.4,\n", 1400 | " 110.6, 110.7, 112.1, 112.3, 112.2, 111.3, 108.2, 107.5, 106.4,\n", 1401 | " 105.6, 104.4, 106.3, 107. , 106.2, 106.8, 106.8, 106.2, 105.8,\n", 1402 | " 105.2, 106. , 106.3, 105.6, 105.5, 106.3, 107.7, 109.4, 111. ,\n", 1403 | " 113.3, 114.1, 116.4, 117.3, 119.1, 119.3, 119.4, 119. , 118.3,\n", 1404 | " 117.7, 116.9, 115.9, 114.8, 113.8, 112.6, 112.4, 112.1, 112.2,\n", 1405 | " 111.3, 111.1, 110.7, 110.6, 110.6, 110. , 109.2, 108.1, 107.3,\n", 1406 | " 106.2, 106. , 105.9, 105.6, 105.7, 105.8, 105.7, 107.2, 107.5,\n", 1407 | " 107.7, 108.6, 109.2, 108.4, 107.9, 107.6, 107.3, 107.8, 109.9,\n", 1408 | " 111.5, 111.6, 112.8, 115.8, 117.2, 119.5, 123.4, 124.3, 125.7,\n", 1409 | " 125.9, 126.2, 126.9, 126. , 125.2, 124.7, 124.1, 123. , 121.9,\n", 1410 | " 121.7, 121.5, 121.5, 120.9, 119.9, 119.6, 119.9, 120.1, 119.3,\n", 1411 | " 120.1, 120.3, 120.3, 119.9, 119.1, 120.3, 120.5, 121.7, 122.5,\n", 1412 | " 122.9, 123.8, 124.6, 124.2, 124.1, 123.3, 122.7, 122.4, 122. ,\n", 1413 | " 123.5, 123.6, 123.2, 123. , 122.7, 122. , 121.7, 120.8, 119.9,\n", 1414 | " 119.1, 119.6, 119.1, 119.2, 118.7, 118.8, 118.5, 118.2, 118.2,\n", 1415 | " 119.5, 120.4, 120.6, 119.8, 118.9, 117.9, 117.1, 116.9, 116.5,\n", 1416 | " 117. , 116.4, 118.5, 121.9, 121.8, 123. , 122.9, 122.7, 121.9,\n", 1417 | " 120.8, 119.5, 119.5, 118.7, 117.8, 116.8, 116.3, 116.4, 115.6,\n", 1418 | " 115. , 114. , 112.8, 111.8, 110.8, 109.9, 108.9, 108.3, 107.2,\n", 1419 | " 105.5, 105.1, 104.5, 103.2, 103.8, 102.5, 101.7, 100.6, 99.8,\n", 1420 | " 102.6, 102.3, 101.8, 102.1, 103.2, 103.8, 105.2, 105.5, 105.2,\n", 1421 | " 104.7, 106. , 104.9, 104.1, 104.2, 104.1, 103.7, 104.4, 103.5,\n", 1422 | " 102.3, 101.8, 101.1, 100.4, 99.8, 99.1, 98.7, 99.9, 99.9,\n", 1423 | " 100.6, 101. , 100.7, 100.1, 99.7, 99.4, 98.1, 97.1, 95.4,\n", 1424 | " 93.3, 92.3, 92.1, 91.4, 91.3, 92. , 92.1, 91.3, 90.8,\n", 1425 | " 90.7, 89.9, 88.5, 89.1, 90. , 95.8, 99.9, 105.5, 108.7,\n", 1426 | " 110.7, 110.3, 109.9, 110.7, 110.9, 111.2, 110.1, 108.8, 109.2,\n", 1427 | " 108.8, 110.5, 109.5, 111. , 112.3, 114.8, 117.2, 117.2, 118.3,\n", 1428 | " 121.4, 121.2, 121.4, 122.3, 123.4, 125.2, 124.8, 124.2, 123.4,\n", 1429 | " 122. , 122.5, 121.8, 122.2, 124. , 125.8, 126.2, 126. , 126.3,\n", 1430 | " 125.7, 126.3, 126. , 125.2, 126.8, 130.7, 130.7, 131.9, 135. ,\n", 1431 | " 140. , 141.3, 149. , 151.1, 150.8, 148.4, 147.8, 144.7, 141.5,\n", 1432 | " 140.6, 138.6, 142.7, 146.6, 149.4, 150.9, 153.5, 160.7, 166.4,\n", 1433 | " 164.1, 160.6, 157.1, 152.1, 149.9, 144.7, 143.7, 142. , 144.4,\n", 1434 | " 145.6, 150.2, 153.5, 153.9, 152.5, 149.8, 147.3, 151.6, 153.2,\n", 1435 | " 152.3, 150.2, 150.1, 148.7, 148.9, 146.4, 142.5, 139.6, 138.8,\n", 1436 | " 137.7, 140. , 145.8, 145.6, 144.6, 142.6, 146. , 142.9, 141. ,\n", 1437 | " 139.3, 138.7, 137.7, 137.9, 141.1, 146.9, 153.5, 158.6, 158.5,\n", 1438 | " 165.9, 166.3, 163.7, 165.6, 163. , 158. , 152.6, 145.4, 138.4,\n", 1439 | " 135. , 133. , 131.8, 131.9, 131.9, 134.7, 139.9, 148. , 153.8,\n", 1440 | " 151.1, 151.6, 146. , 138.1, 131. , 126.4, 122.1, 119.3, 117. ,\n", 1441 | " 114.7, 114. , 109.7, 108.4, 107.5, 104.2, 106.3, 109.6, 110.9,\n", 1442 | " 109.9, 108.7, 108.1, 109.8, 108.5, 108.9, 108.7, 111.8, 119.4,\n", 1443 | " 126.2, 130.8, 133.9, 138.2, 136.8, 136.7, 135.3, 135.6, 134.9,\n", 1444 | " 136. , 134.8, 135.3, 133.2, 133.5, 134.2, 135.7, 134.5, 136.1,\n", 1445 | " 138.1, 137.6, 135.5, 135.5, 135.7, 136.5, 135.3, 135.5, 136.7,\n", 1446 | " 135.7, 138.5, 141.6, 142.2, 144.3, 142.7, 142.7, 140.6, 137. ,\n", 1447 | " 133.6, 131.6, 131.6, 132.2, 137.1, 141.7, 141.2, 142.3, 142.2,\n", 1448 | " 143.7, 149.9, 158.2, 163. , 161.7, 164.1, 166.3, 167.3, 162.6,\n", 1449 | " 157.7, 155.7, 152.1, 150.4, 148.6, 144.1, 142.7, 144.4, 143.9,\n", 1450 | " 142.8, 145.6, 148. , 145.1, 144.3, 144.8, 148.9, 149.6, 148.8,\n", 1451 | " 151.6, 155. , 159.4, 169.3, 168.8, 165.3, 163.6, 158. , 152.4,\n", 1452 | " 151.1, 151.5, 152.7, 149.9, 149.4, 146.4, 145.9, 147.8, 145.4,\n", 1453 | " 144.1, 143.3, 145.9, 145.4, 149.2, 154.4, 157.9, 160.4, 159.1,\n", 1454 | " 160.9, 161.7])" 1455 | ] 1456 | }, 1457 | "execution_count": 50, 1458 | "metadata": {}, 1459 | "output_type": "execute_result" 1460 | } 1461 | ], 1462 | "source": [ 1463 | "oilprice" 1464 | ] 1465 | }, 1466 | { 1467 | "cell_type": "code", 1468 | "execution_count": null, 1469 | "metadata": {}, 1470 | "outputs": [], 1471 | "source": [] 1472 | } 1473 | ], 1474 | "metadata": { 1475 | "kernelspec": { 1476 | "display_name": "Python 3", 1477 | "language": "python", 1478 | "name": "python3" 1479 | }, 1480 | "language_info": { 1481 | "codemirror_mode": { 1482 | "name": "ipython", 1483 | "version": 3 1484 | }, 1485 | "file_extension": ".py", 1486 | "mimetype": "text/x-python", 1487 | "name": "python", 1488 | "nbconvert_exporter": "python", 1489 | "pygments_lexer": "ipython3", 1490 | "version": "3.8.2" 1491 | }, 1492 | "nteract": { 1493 | "version": "0.23.1" 1494 | }, 1495 | "toc": { 1496 | "base_numbering": 1, 1497 | "nav_menu": {}, 1498 | "number_sections": false, 1499 | "sideBar": true, 1500 | "skip_h1_title": false, 1501 | "title_cell": "Table of Contents", 1502 | "title_sidebar": "Contents", 1503 | "toc_cell": false, 1504 | "toc_position": {}, 1505 | "toc_section_display": true, 1506 | "toc_window_display": true 1507 | } 1508 | }, 1509 | "nbformat": 4, 1510 | "nbformat_minor": 4 1511 | } 1512 | -------------------------------------------------------------------------------- /02-DA_Numpy_array_maths.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 2. Mathematics with arrays\n", 8 | "\n", 9 | "One of the great advantages of Numpy arrays is that they allow one to very easily apply mathematical operations to entire arrays effortlessly. We are presenting here 3 ways in which this can be done." 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 1, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "import numpy as np" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "## 2.1 Simple calculus\n", 26 | "\n", 27 | "To illustrate how arrays are useful, let's first consider the following problem. You have a list:" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": 2, 33 | "metadata": {}, 34 | "outputs": [], 35 | "source": [ 36 | "mylist = [1,2,3,4,5]" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "And now you wish to add to each element of that list the value 3. If we write:" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": 3, 49 | "metadata": {}, 50 | "outputs": [ 51 | { 52 | "ename": "TypeError", 53 | "evalue": "can only concatenate list (not \"int\") to list", 54 | "output_type": "error", 55 | "traceback": [ 56 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 57 | "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", 58 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mmylist\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0;36m3\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", 59 | "\u001b[0;31mTypeError\u001b[0m: can only concatenate list (not \"int\") to list" 60 | ] 61 | } 62 | ], 63 | "source": [ 64 | "mylist + 3" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": {}, 70 | "source": [ 71 | "We receive an error because Python doesn't know how to combine a list with a simple integer. In this case we would have to use a for loop or a comprehension list, which is cumbersome." 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": 4, 77 | "metadata": {}, 78 | "outputs": [ 79 | { 80 | "data": { 81 | "text/plain": [ 82 | "[4, 5, 6, 7, 8]" 83 | ] 84 | }, 85 | "execution_count": 4, 86 | "metadata": {}, 87 | "output_type": "execute_result" 88 | } 89 | ], 90 | "source": [ 91 | "[x + 3 for x in mylist]" 92 | ] 93 | }, 94 | { 95 | "cell_type": "markdown", 96 | "metadata": {}, 97 | "source": [ 98 | "Let's see now how this works for an array:" 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": 5, 104 | "metadata": {}, 105 | "outputs": [], 106 | "source": [ 107 | "myarray = np.array(mylist)" 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": 6, 113 | "metadata": {}, 114 | "outputs": [ 115 | { 116 | "data": { 117 | "text/plain": [ 118 | "array([4, 5, 6, 7, 8])" 119 | ] 120 | }, 121 | "execution_count": 6, 122 | "metadata": {}, 123 | "output_type": "execute_result" 124 | } 125 | ], 126 | "source": [ 127 | "myarray + 3" 128 | ] 129 | }, 130 | { 131 | "cell_type": "markdown", 132 | "metadata": {}, 133 | "source": [ 134 | "Numpy understands without trouble that our goal is to add the value 3 to *each element* in our list. Naturally this is dimension independent e.g.:" 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": 7, 140 | "metadata": {}, 141 | "outputs": [ 142 | { 143 | "data": { 144 | "text/plain": [ 145 | "array([[1., 1., 1., 1., 1., 1.],\n", 146 | " [1., 1., 1., 1., 1., 1.],\n", 147 | " [1., 1., 1., 1., 1., 1.]])" 148 | ] 149 | }, 150 | "execution_count": 7, 151 | "metadata": {}, 152 | "output_type": "execute_result" 153 | } 154 | ], 155 | "source": [ 156 | "my2d_array = np.ones((3,6))\n", 157 | "my2d_array" 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "execution_count": 8, 163 | "metadata": {}, 164 | "outputs": [ 165 | { 166 | "data": { 167 | "text/plain": [ 168 | "array([[4., 4., 4., 4., 4., 4.],\n", 169 | " [4., 4., 4., 4., 4., 4.],\n", 170 | " [4., 4., 4., 4., 4., 4.]])" 171 | ] 172 | }, 173 | "execution_count": 8, 174 | "metadata": {}, 175 | "output_type": "execute_result" 176 | } 177 | ], 178 | "source": [ 179 | "my2d_array + 3" 180 | ] 181 | }, 182 | { 183 | "cell_type": "markdown", 184 | "metadata": {}, 185 | "source": [ 186 | "Of course as long as we don't reassign this new state to our variable it remains unchanged:" 187 | ] 188 | }, 189 | { 190 | "cell_type": "code", 191 | "execution_count": 9, 192 | "metadata": {}, 193 | "outputs": [ 194 | { 195 | "data": { 196 | "text/plain": [ 197 | "array([[1., 1., 1., 1., 1., 1.],\n", 198 | " [1., 1., 1., 1., 1., 1.],\n", 199 | " [1., 1., 1., 1., 1., 1.]])" 200 | ] 201 | }, 202 | "execution_count": 9, 203 | "metadata": {}, 204 | "output_type": "execute_result" 205 | } 206 | ], 207 | "source": [ 208 | "my2d_array" 209 | ] 210 | }, 211 | { 212 | "cell_type": "markdown", 213 | "metadata": {}, 214 | "source": [ 215 | "We have to write:" 216 | ] 217 | }, 218 | { 219 | "cell_type": "code", 220 | "execution_count": 10, 221 | "metadata": {}, 222 | "outputs": [], 223 | "source": [ 224 | "my2d_array = my2d_array + 3" 225 | ] 226 | }, 227 | { 228 | "cell_type": "code", 229 | "execution_count": 11, 230 | "metadata": {}, 231 | "outputs": [ 232 | { 233 | "data": { 234 | "text/plain": [ 235 | "array([[4., 4., 4., 4., 4., 4.],\n", 236 | " [4., 4., 4., 4., 4., 4.],\n", 237 | " [4., 4., 4., 4., 4., 4.]])" 238 | ] 239 | }, 240 | "execution_count": 11, 241 | "metadata": {}, 242 | "output_type": "execute_result" 243 | } 244 | ], 245 | "source": [ 246 | "my2d_array" 247 | ] 248 | }, 249 | { 250 | "cell_type": "markdown", 251 | "metadata": {}, 252 | "source": [ 253 | "Naturally all basic operations work:" 254 | ] 255 | }, 256 | { 257 | "cell_type": "code", 258 | "execution_count": 12, 259 | "metadata": {}, 260 | "outputs": [ 261 | { 262 | "data": { 263 | "text/plain": [ 264 | "array([[16., 16., 16., 16., 16., 16.],\n", 265 | " [16., 16., 16., 16., 16., 16.],\n", 266 | " [16., 16., 16., 16., 16., 16.]])" 267 | ] 268 | }, 269 | "execution_count": 12, 270 | "metadata": {}, 271 | "output_type": "execute_result" 272 | } 273 | ], 274 | "source": [ 275 | "my2d_array * 4" 276 | ] 277 | }, 278 | { 279 | "cell_type": "code", 280 | "execution_count": 13, 281 | "metadata": {}, 282 | "outputs": [ 283 | { 284 | "data": { 285 | "text/plain": [ 286 | "array([[0.8, 0.8, 0.8, 0.8, 0.8, 0.8],\n", 287 | " [0.8, 0.8, 0.8, 0.8, 0.8, 0.8],\n", 288 | " [0.8, 0.8, 0.8, 0.8, 0.8, 0.8]])" 289 | ] 290 | }, 291 | "execution_count": 13, 292 | "metadata": {}, 293 | "output_type": "execute_result" 294 | } 295 | ], 296 | "source": [ 297 | "my2d_array / 5" 298 | ] 299 | }, 300 | { 301 | "cell_type": "code", 302 | "execution_count": 14, 303 | "metadata": {}, 304 | "outputs": [ 305 | { 306 | "data": { 307 | "text/plain": [ 308 | "array([[1024., 1024., 1024., 1024., 1024., 1024.],\n", 309 | " [1024., 1024., 1024., 1024., 1024., 1024.],\n", 310 | " [1024., 1024., 1024., 1024., 1024., 1024.]])" 311 | ] 312 | }, 313 | "execution_count": 14, 314 | "metadata": {}, 315 | "output_type": "execute_result" 316 | } 317 | ], 318 | "source": [ 319 | "my2d_array ** 5" 320 | ] 321 | }, 322 | { 323 | "cell_type": "markdown", 324 | "metadata": {}, 325 | "source": [ 326 | "## 2.2 Mathematical functions\n", 327 | "\n", 328 | "In addition to simple arithmetic, Numpy offers a vast choice of functions that can be directly applied to arrays. For example trigonometry:" 329 | ] 330 | }, 331 | { 332 | "cell_type": "code", 333 | "execution_count": 15, 334 | "metadata": {}, 335 | "outputs": [ 336 | { 337 | "data": { 338 | "text/plain": [ 339 | "array([ 0.54030231, -0.41614684, -0.9899925 , -0.65364362, 0.28366219])" 340 | ] 341 | }, 342 | "execution_count": 15, 343 | "metadata": {}, 344 | "output_type": "execute_result" 345 | } 346 | ], 347 | "source": [ 348 | "np.cos(myarray)" 349 | ] 350 | }, 351 | { 352 | "cell_type": "markdown", 353 | "metadata": {}, 354 | "source": [ 355 | "Exponentials and logs:" 356 | ] 357 | }, 358 | { 359 | "cell_type": "code", 360 | "execution_count": 16, 361 | "metadata": {}, 362 | "outputs": [ 363 | { 364 | "data": { 365 | "text/plain": [ 366 | "array([ 2.71828183, 7.3890561 , 20.08553692, 54.59815003,\n", 367 | " 148.4131591 ])" 368 | ] 369 | }, 370 | "execution_count": 16, 371 | "metadata": {}, 372 | "output_type": "execute_result" 373 | } 374 | ], 375 | "source": [ 376 | "np.exp(myarray)" 377 | ] 378 | }, 379 | { 380 | "cell_type": "code", 381 | "execution_count": 17, 382 | "metadata": {}, 383 | "outputs": [ 384 | { 385 | "data": { 386 | "text/plain": [ 387 | "array([0. , 0.30103 , 0.47712125, 0.60205999, 0.69897 ])" 388 | ] 389 | }, 390 | "execution_count": 17, 391 | "metadata": {}, 392 | "output_type": "execute_result" 393 | } 394 | ], 395 | "source": [ 396 | "np.log10(myarray)" 397 | ] 398 | }, 399 | { 400 | "cell_type": "markdown", 401 | "metadata": {}, 402 | "source": [ 403 | "## 2.3 Logical operations" 404 | ] 405 | }, 406 | { 407 | "cell_type": "markdown", 408 | "metadata": {}, 409 | "source": [ 410 | "If we use a logical comparison on a regular variable, the output is a *boolean* (True or False) that describes the outcome of the comparison:" 411 | ] 412 | }, 413 | { 414 | "cell_type": "code", 415 | "execution_count": 18, 416 | "metadata": {}, 417 | "outputs": [ 418 | { 419 | "data": { 420 | "text/plain": [ 421 | "False" 422 | ] 423 | }, 424 | "execution_count": 18, 425 | "metadata": {}, 426 | "output_type": "execute_result" 427 | } 428 | ], 429 | "source": [ 430 | "a = 3\n", 431 | "b = 2\n", 432 | "a > 3" 433 | ] 434 | }, 435 | { 436 | "cell_type": "markdown", 437 | "metadata": {}, 438 | "source": [ 439 | "We can do exactly the same thing with arrays. When we added 3 to an array, that value was automatically added to each element of the array. With logical operations, the comparison is also done for each element in the array resulting in a boolean array:" 440 | ] 441 | }, 442 | { 443 | "cell_type": "code", 444 | "execution_count": 19, 445 | "metadata": {}, 446 | "outputs": [ 447 | { 448 | "data": { 449 | "text/plain": [ 450 | "array([[0., 0., 0., 0.],\n", 451 | " [0., 0., 0., 0.],\n", 452 | " [0., 0., 0., 1.],\n", 453 | " [0., 0., 0., 0.]])" 454 | ] 455 | }, 456 | "execution_count": 19, 457 | "metadata": {}, 458 | "output_type": "execute_result" 459 | } 460 | ], 461 | "source": [ 462 | "myarray = np.zeros((4,4))\n", 463 | "myarray[2,3] = 1\n", 464 | "myarray" 465 | ] 466 | }, 467 | { 468 | "cell_type": "code", 469 | "execution_count": 20, 470 | "metadata": {}, 471 | "outputs": [ 472 | { 473 | "data": { 474 | "text/plain": [ 475 | "array([[False, False, False, False],\n", 476 | " [False, False, False, False],\n", 477 | " [False, False, False, True],\n", 478 | " [False, False, False, False]])" 479 | ] 480 | }, 481 | "execution_count": 20, 482 | "metadata": {}, 483 | "output_type": "execute_result" 484 | } 485 | ], 486 | "source": [ 487 | "myarray > 0" 488 | ] 489 | }, 490 | { 491 | "cell_type": "markdown", 492 | "metadata": {}, 493 | "source": [ 494 | "Exactly as for simple variables, we can assign this boolean array to a new variable directly:" 495 | ] 496 | }, 497 | { 498 | "cell_type": "code", 499 | "execution_count": 21, 500 | "metadata": {}, 501 | "outputs": [], 502 | "source": [ 503 | "myboolean = myarray > 0" 504 | ] 505 | }, 506 | { 507 | "cell_type": "code", 508 | "execution_count": 22, 509 | "metadata": {}, 510 | "outputs": [ 511 | { 512 | "data": { 513 | "text/plain": [ 514 | "array([[False, False, False, False],\n", 515 | " [False, False, False, False],\n", 516 | " [False, False, False, True],\n", 517 | " [False, False, False, False]])" 518 | ] 519 | }, 520 | "execution_count": 22, 521 | "metadata": {}, 522 | "output_type": "execute_result" 523 | } 524 | ], 525 | "source": [ 526 | "myboolean" 527 | ] 528 | }, 529 | { 530 | "cell_type": "markdown", 531 | "metadata": {}, 532 | "source": [ 533 | "## 2.4 Methods modifying array dimensions" 534 | ] 535 | }, 536 | { 537 | "cell_type": "markdown", 538 | "metadata": {}, 539 | "source": [ 540 | "The operations described above were applied *element-wise*. However sometimes we need to do operations either at the array level or some of its axes. For example, we need very commonly statistics on an array (mean, sum etc.)" 541 | ] 542 | }, 543 | { 544 | "cell_type": "code", 545 | "execution_count": 23, 546 | "metadata": {}, 547 | "outputs": [ 548 | { 549 | "data": { 550 | "text/plain": [ 551 | "array([[ 8.22235922, 10.86316749, 8.97190654, 12.16211971],\n", 552 | " [11.31745909, 9.80774793, 11.2873836 , 6.77945745],\n", 553 | " [10.20776894, 8.78011512, 6.96723135, 11.77819806]])" 554 | ] 555 | }, 556 | "execution_count": 23, 557 | "metadata": {}, 558 | "output_type": "execute_result" 559 | } 560 | ], 561 | "source": [ 562 | "nd_array = np.random.normal(10, 2, (3,4))\n", 563 | "nd_array" 564 | ] 565 | }, 566 | { 567 | "cell_type": "code", 568 | "execution_count": 24, 569 | "metadata": {}, 570 | "outputs": [ 571 | { 572 | "data": { 573 | "text/plain": [ 574 | "9.762076209457817" 575 | ] 576 | }, 577 | "execution_count": 24, 578 | "metadata": {}, 579 | "output_type": "execute_result" 580 | } 581 | ], 582 | "source": [ 583 | "np.mean(nd_array)" 584 | ] 585 | }, 586 | { 587 | "cell_type": "code", 588 | "execution_count": 25, 589 | "metadata": {}, 590 | "outputs": [ 591 | { 592 | "data": { 593 | "text/plain": [ 594 | "1.747626512794281" 595 | ] 596 | }, 597 | "execution_count": 25, 598 | "metadata": {}, 599 | "output_type": "execute_result" 600 | } 601 | ], 602 | "source": [ 603 | "np.std(nd_array)" 604 | ] 605 | }, 606 | { 607 | "cell_type": "markdown", 608 | "metadata": {}, 609 | "source": [ 610 | "Or the maximum value:" 611 | ] 612 | }, 613 | { 614 | "cell_type": "code", 615 | "execution_count": 26, 616 | "metadata": {}, 617 | "outputs": [ 618 | { 619 | "data": { 620 | "text/plain": [ 621 | "12.162119714449235" 622 | ] 623 | }, 624 | "execution_count": 26, 625 | "metadata": {}, 626 | "output_type": "execute_result" 627 | } 628 | ], 629 | "source": [ 630 | "np.max(nd_array)" 631 | ] 632 | }, 633 | { 634 | "cell_type": "markdown", 635 | "metadata": {}, 636 | "source": [ 637 | "Note that several of these functions can be called as array methods instead of numpy functions:" 638 | ] 639 | }, 640 | { 641 | "cell_type": "code", 642 | "execution_count": 27, 643 | "metadata": {}, 644 | "outputs": [ 645 | { 646 | "data": { 647 | "text/plain": [ 648 | "9.762076209457817" 649 | ] 650 | }, 651 | "execution_count": 27, 652 | "metadata": {}, 653 | "output_type": "execute_result" 654 | } 655 | ], 656 | "source": [ 657 | "nd_array.mean()" 658 | ] 659 | }, 660 | { 661 | "cell_type": "code", 662 | "execution_count": 28, 663 | "metadata": {}, 664 | "outputs": [ 665 | { 666 | "data": { 667 | "text/plain": [ 668 | "12.162119714449235" 669 | ] 670 | }, 671 | "execution_count": 28, 672 | "metadata": {}, 673 | "output_type": "execute_result" 674 | } 675 | ], 676 | "source": [ 677 | "nd_array.max()" 678 | ] 679 | }, 680 | { 681 | "cell_type": "markdown", 682 | "metadata": {}, 683 | "source": [ 684 | "Note that most functions can be applied to specific axes. Let's remember that our arrays is:" 685 | ] 686 | }, 687 | { 688 | "cell_type": "code", 689 | "execution_count": 29, 690 | "metadata": {}, 691 | "outputs": [ 692 | { 693 | "data": { 694 | "text/plain": [ 695 | "array([[ 8.22235922, 10.86316749, 8.97190654, 12.16211971],\n", 696 | " [11.31745909, 9.80774793, 11.2873836 , 6.77945745],\n", 697 | " [10.20776894, 8.78011512, 6.96723135, 11.77819806]])" 698 | ] 699 | }, 700 | "execution_count": 29, 701 | "metadata": {}, 702 | "output_type": "execute_result" 703 | } 704 | ], 705 | "source": [ 706 | "nd_array" 707 | ] 708 | }, 709 | { 710 | "cell_type": "markdown", 711 | "metadata": {}, 712 | "source": [ 713 | "We can for example do a maximum projection along the first axis (rows): the maximum value of eadch column is kept:" 714 | ] 715 | }, 716 | { 717 | "cell_type": "code", 718 | "execution_count": 30, 719 | "metadata": {}, 720 | "outputs": [ 721 | { 722 | "data": { 723 | "text/plain": [ 724 | "array([11.31745909, 10.86316749, 11.2873836 , 12.16211971])" 725 | ] 726 | }, 727 | "execution_count": 30, 728 | "metadata": {}, 729 | "output_type": "execute_result" 730 | } 731 | ], 732 | "source": [ 733 | "proj0 = nd_array.max(axis=0)\n", 734 | "proj0" 735 | ] 736 | }, 737 | { 738 | "cell_type": "code", 739 | "execution_count": 31, 740 | "metadata": {}, 741 | "outputs": [ 742 | { 743 | "data": { 744 | "text/plain": [ 745 | "(4,)" 746 | ] 747 | }, 748 | "execution_count": 31, 749 | "metadata": {}, 750 | "output_type": "execute_result" 751 | } 752 | ], 753 | "source": [ 754 | "proj0.shape" 755 | ] 756 | }, 757 | { 758 | "cell_type": "markdown", 759 | "metadata": {}, 760 | "source": [ 761 | "We can of course do the same operation for the second axis:" 762 | ] 763 | }, 764 | { 765 | "cell_type": "code", 766 | "execution_count": 32, 767 | "metadata": {}, 768 | "outputs": [ 769 | { 770 | "data": { 771 | "text/plain": [ 772 | "array([12.16211971, 11.31745909, 11.77819806])" 773 | ] 774 | }, 775 | "execution_count": 32, 776 | "metadata": {}, 777 | "output_type": "execute_result" 778 | } 779 | ], 780 | "source": [ 781 | "proj1 = nd_array.max(axis=1)\n", 782 | "proj1" 783 | ] 784 | }, 785 | { 786 | "cell_type": "code", 787 | "execution_count": 33, 788 | "metadata": {}, 789 | "outputs": [ 790 | { 791 | "data": { 792 | "text/plain": [ 793 | "(3,)" 794 | ] 795 | }, 796 | "execution_count": 33, 797 | "metadata": {}, 798 | "output_type": "execute_result" 799 | } 800 | ], 801 | "source": [ 802 | "proj1.shape" 803 | ] 804 | }, 805 | { 806 | "cell_type": "markdown", 807 | "metadata": {}, 808 | "source": [ 809 | "There are of course more advanced functions. For example a cumulative sum:" 810 | ] 811 | }, 812 | { 813 | "cell_type": "code", 814 | "execution_count": 34, 815 | "metadata": {}, 816 | "outputs": [ 817 | { 818 | "data": { 819 | "text/plain": [ 820 | "array([ 8.22235922, 19.08552671, 28.05743325, 40.21955296,\n", 821 | " 51.53701205, 61.34475998, 72.63214358, 79.41160103,\n", 822 | " 89.61936998, 98.3994851 , 105.36671645, 117.14491451])" 823 | ] 824 | }, 825 | "execution_count": 34, 826 | "metadata": {}, 827 | "output_type": "execute_result" 828 | } 829 | ], 830 | "source": [ 831 | "np.cumsum(nd_array)" 832 | ] 833 | } 834 | ], 835 | "metadata": { 836 | "kernelspec": { 837 | "display_name": "Python 3", 838 | "language": "python", 839 | "name": "python3" 840 | }, 841 | "language_info": { 842 | "codemirror_mode": { 843 | "name": "ipython", 844 | "version": 3 845 | }, 846 | "file_extension": ".py", 847 | "mimetype": "text/x-python", 848 | "name": "python", 849 | "nbconvert_exporter": "python", 850 | "pygments_lexer": "ipython3", 851 | "version": "3.8.2" 852 | } 853 | }, 854 | "nbformat": 4, 855 | "nbformat_minor": 4 856 | } 857 | -------------------------------------------------------------------------------- /07-DA_Pandas_structures.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 7. Pandas objects" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "import numpy as np\n", 17 | "import pandas as pd\n", 18 | "import matplotlib.pyplot as plt" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "Python has a series of data containers (list, dicts etc.) and Numpy offers multi-dimensional arrays, however none of these structures offers a simple way neither to handle tabular data, nor to easily do standard database operations. This is why Pandas exists: it offers a complete ecosystem of structures and functions dedicated to handle large tables with inhomogeneous contents.\n", 26 | "\n", 27 | "In this first chapter, we are going to learn about the two main structures of Pandas: Series and Dataframes." 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": {}, 33 | "source": [ 34 | "## 7.1 Series" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [ 41 | "### 7.1.1 Simple series" 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "metadata": {}, 47 | "source": [ 48 | "Series are a the Pandas version of 1-D Numpy arrays. We are rarely going to use them directly, but they often appear implicitly when handling data from the more general Dataframe structure. We therefore only give here basics. \n", 49 | "\n", 50 | "To understand Series' specificities, let's create one. Usually Pandas structures (Series and Dataframes) are created from other simpler structures like Numpy arrays or dictionaries:" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": 2, 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "numpy_array = np.array([4,8,38,1,6])\n" 60 | ] 61 | }, 62 | { 63 | "cell_type": "markdown", 64 | "metadata": {}, 65 | "source": [ 66 | "The function ```pd.Series()``` allows us to convert objects into Series:" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": 3, 72 | "metadata": {}, 73 | "outputs": [ 74 | { 75 | "data": { 76 | "text/plain": [ 77 | "0 4\n", 78 | "1 8\n", 79 | "2 38\n", 80 | "3 1\n", 81 | "4 6\n", 82 | "dtype: int64" 83 | ] 84 | }, 85 | "execution_count": 3, 86 | "metadata": {}, 87 | "output_type": "execute_result" 88 | } 89 | ], 90 | "source": [ 91 | "pd_series = pd.Series(numpy_array)\n", 92 | "pd_series" 93 | ] 94 | }, 95 | { 96 | "cell_type": "markdown", 97 | "metadata": {}, 98 | "source": [ 99 | "The underlying structure can be recovered with the ```.values``` attribute: " 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": 4, 105 | "metadata": {}, 106 | "outputs": [ 107 | { 108 | "data": { 109 | "text/plain": [ 110 | "array([ 4, 8, 38, 1, 6])" 111 | ] 112 | }, 113 | "execution_count": 4, 114 | "metadata": {}, 115 | "output_type": "execute_result" 116 | } 117 | ], 118 | "source": [ 119 | "pd_series.values" 120 | ] 121 | }, 122 | { 123 | "cell_type": "markdown", 124 | "metadata": {}, 125 | "source": [ 126 | "Otherwise, indexing works as for regular arrays:" 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": 5, 132 | "metadata": {}, 133 | "outputs": [ 134 | { 135 | "data": { 136 | "text/plain": [ 137 | "8" 138 | ] 139 | }, 140 | "execution_count": 5, 141 | "metadata": {}, 142 | "output_type": "execute_result" 143 | } 144 | ], 145 | "source": [ 146 | "pd_series[1]" 147 | ] 148 | }, 149 | { 150 | "cell_type": "markdown", 151 | "metadata": {}, 152 | "source": [ 153 | "### 7.1.2 Indexing" 154 | ] 155 | }, 156 | { 157 | "cell_type": "markdown", 158 | "metadata": {}, 159 | "source": [ 160 | "On top of accessing values in a series by regular indexing, one can create custom indices for each element in the series:" 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": 6, 166 | "metadata": {}, 167 | "outputs": [], 168 | "source": [ 169 | "pd_series2 = pd.Series(numpy_array, index=['a', 'b', 'c', 'd','e'])" 170 | ] 171 | }, 172 | { 173 | "cell_type": "code", 174 | "execution_count": 7, 175 | "metadata": {}, 176 | "outputs": [ 177 | { 178 | "data": { 179 | "text/plain": [ 180 | "a 4\n", 181 | "b 8\n", 182 | "c 38\n", 183 | "d 1\n", 184 | "e 6\n", 185 | "dtype: int64" 186 | ] 187 | }, 188 | "execution_count": 7, 189 | "metadata": {}, 190 | "output_type": "execute_result" 191 | } 192 | ], 193 | "source": [ 194 | "pd_series2" 195 | ] 196 | }, 197 | { 198 | "cell_type": "markdown", 199 | "metadata": {}, 200 | "source": [ 201 | "Now a given element can be accessed either by using its regular index:" 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": 8, 207 | "metadata": {}, 208 | "outputs": [ 209 | { 210 | "data": { 211 | "text/plain": [ 212 | "8" 213 | ] 214 | }, 215 | "execution_count": 8, 216 | "metadata": {}, 217 | "output_type": "execute_result" 218 | } 219 | ], 220 | "source": [ 221 | "pd_series2[1]" 222 | ] 223 | }, 224 | { 225 | "cell_type": "markdown", 226 | "metadata": {}, 227 | "source": [ 228 | "or its chosen index:" 229 | ] 230 | }, 231 | { 232 | "cell_type": "code", 233 | "execution_count": 9, 234 | "metadata": {}, 235 | "outputs": [ 236 | { 237 | "data": { 238 | "text/plain": [ 239 | "8" 240 | ] 241 | }, 242 | "execution_count": 9, 243 | "metadata": {}, 244 | "output_type": "execute_result" 245 | } 246 | ], 247 | "source": [ 248 | "pd_series2['b']" 249 | ] 250 | }, 251 | { 252 | "cell_type": "markdown", 253 | "metadata": {}, 254 | "source": [ 255 | "A more direct way to create specific indexes is to transform as dictionary into a Series:" 256 | ] 257 | }, 258 | { 259 | "cell_type": "code", 260 | "execution_count": 10, 261 | "metadata": {}, 262 | "outputs": [], 263 | "source": [ 264 | "composer_birth = {'Mahler': 1860, 'Beethoven': 1770, 'Puccini': 1858, 'Shostakovich': 1906}" 265 | ] 266 | }, 267 | { 268 | "cell_type": "code", 269 | "execution_count": 11, 270 | "metadata": {}, 271 | "outputs": [ 272 | { 273 | "data": { 274 | "text/plain": [ 275 | "Mahler 1860\n", 276 | "Beethoven 1770\n", 277 | "Puccini 1858\n", 278 | "Shostakovich 1906\n", 279 | "dtype: int64" 280 | ] 281 | }, 282 | "execution_count": 11, 283 | "metadata": {}, 284 | "output_type": "execute_result" 285 | } 286 | ], 287 | "source": [ 288 | "pd_composer_birth = pd.Series(composer_birth)\n", 289 | "pd_composer_birth" 290 | ] 291 | }, 292 | { 293 | "cell_type": "code", 294 | "execution_count": 12, 295 | "metadata": {}, 296 | "outputs": [ 297 | { 298 | "data": { 299 | "text/plain": [ 300 | "1858" 301 | ] 302 | }, 303 | "execution_count": 12, 304 | "metadata": {}, 305 | "output_type": "execute_result" 306 | } 307 | ], 308 | "source": [ 309 | "pd_composer_birth['Puccini']" 310 | ] 311 | }, 312 | { 313 | "cell_type": "markdown", 314 | "metadata": {}, 315 | "source": [ 316 | "## 7.2 Dataframes" 317 | ] 318 | }, 319 | { 320 | "cell_type": "markdown", 321 | "metadata": {}, 322 | "source": [ 323 | "In most cases, one has to deal with more than just one variable, e.g. one has the birth year and the death year of a list of composers. Also one might have different types of information, e.g. in addition to numerical variables (year) one might have string variables like the city of birth. The Pandas structure that allow one to deal with such complex data is called a Dataframe, which can somehow be seen as an aggregation of Series with a common index." 324 | ] 325 | }, 326 | { 327 | "cell_type": "markdown", 328 | "metadata": {}, 329 | "source": [ 330 | "### 7.2.1 Creating a Dataframe" 331 | ] 332 | }, 333 | { 334 | "cell_type": "markdown", 335 | "metadata": {}, 336 | "source": [ 337 | "To see how to construct such a Dataframe, let's create some more information about composers:" 338 | ] 339 | }, 340 | { 341 | "cell_type": "code", 342 | "execution_count": 13, 343 | "metadata": {}, 344 | "outputs": [], 345 | "source": [ 346 | "composer_death = pd.Series({'Mahler': 1911, 'Beethoven': 1827, 'Puccini': 1924, 'Shostakovich': 1975})\n", 347 | "composer_city_birth = pd.Series({'Mahler': 'Kaliste', 'Beethoven': 'Bonn', 'Puccini': 'Lucques', 'Shostakovich': 'Saint-Petersburg'})\n" 348 | ] 349 | }, 350 | { 351 | "cell_type": "markdown", 352 | "metadata": {}, 353 | "source": [ 354 | "Now we can combine multiple series into a Dataframe by precising a variable name for each series. Note that all our series need to have the same indices (here the composers' name):" 355 | ] 356 | }, 357 | { 358 | "cell_type": "code", 359 | "execution_count": 14, 360 | "metadata": {}, 361 | "outputs": [ 362 | { 363 | "data": { 364 | "text/html": [ 365 | "
\n", 366 | "\n", 379 | "\n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | "
birthdeathcity
Mahler18601911Kaliste
Beethoven17701827Bonn
Puccini18581924Lucques
Shostakovich19061975Saint-Petersburg
\n", 415 | "
" 416 | ], 417 | "text/plain": [ 418 | " birth death city\n", 419 | "Mahler 1860 1911 Kaliste\n", 420 | "Beethoven 1770 1827 Bonn\n", 421 | "Puccini 1858 1924 Lucques\n", 422 | "Shostakovich 1906 1975 Saint-Petersburg" 423 | ] 424 | }, 425 | "execution_count": 14, 426 | "metadata": {}, 427 | "output_type": "execute_result" 428 | } 429 | ], 430 | "source": [ 431 | "composers_df = pd.DataFrame({'birth': pd_composer_birth, 'death': composer_death, 'city': composer_city_birth})\n", 432 | "composers_df" 433 | ] 434 | }, 435 | { 436 | "cell_type": "markdown", 437 | "metadata": {}, 438 | "source": [ 439 | "A more common way of creating a Dataframe is to construct it directly from a dictionary of lists where each element of the dictionary turns into a column:" 440 | ] 441 | }, 442 | { 443 | "cell_type": "code", 444 | "execution_count": 15, 445 | "metadata": {}, 446 | "outputs": [], 447 | "source": [ 448 | "dict_of_list = {'birth': [1860, 1770, 1858, 1906], 'death':[1911, 1827, 1924, 1975], \n", 449 | " 'city':['Kaliste', 'Bonn', 'Lucques', 'Saint-Petersburg']}" 450 | ] 451 | }, 452 | { 453 | "cell_type": "code", 454 | "execution_count": 16, 455 | "metadata": {}, 456 | "outputs": [ 457 | { 458 | "data": { 459 | "text/html": [ 460 | "
\n", 461 | "\n", 474 | "\n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | "
birthdeathcity
018601911Kaliste
117701827Bonn
218581924Lucques
319061975Saint-Petersburg
\n", 510 | "
" 511 | ], 512 | "text/plain": [ 513 | " birth death city\n", 514 | "0 1860 1911 Kaliste\n", 515 | "1 1770 1827 Bonn\n", 516 | "2 1858 1924 Lucques\n", 517 | "3 1906 1975 Saint-Petersburg" 518 | ] 519 | }, 520 | "execution_count": 16, 521 | "metadata": {}, 522 | "output_type": "execute_result" 523 | } 524 | ], 525 | "source": [ 526 | "pd.DataFrame(dict_of_list)" 527 | ] 528 | }, 529 | { 530 | "cell_type": "markdown", 531 | "metadata": {}, 532 | "source": [ 533 | "However we now lost the composers name. We can enforce it by providing, as we did before for the Series, a list of indices:" 534 | ] 535 | }, 536 | { 537 | "cell_type": "code", 538 | "execution_count": 17, 539 | "metadata": {}, 540 | "outputs": [ 541 | { 542 | "data": { 543 | "text/html": [ 544 | "
\n", 545 | "\n", 558 | "\n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | "
birthdeathcity
Mahler18601911Kaliste
Beethoven17701827Bonn
Puccini18581924Lucques
Shostakovich19061975Saint-Petersburg
\n", 594 | "
" 595 | ], 596 | "text/plain": [ 597 | " birth death city\n", 598 | "Mahler 1860 1911 Kaliste\n", 599 | "Beethoven 1770 1827 Bonn\n", 600 | "Puccini 1858 1924 Lucques\n", 601 | "Shostakovich 1906 1975 Saint-Petersburg" 602 | ] 603 | }, 604 | "execution_count": 17, 605 | "metadata": {}, 606 | "output_type": "execute_result" 607 | } 608 | ], 609 | "source": [ 610 | "pd.DataFrame(dict_of_list, index=['Mahler', 'Beethoven', 'Puccini', 'Shostakovich'])" 611 | ] 612 | }, 613 | { 614 | "cell_type": "markdown", 615 | "metadata": {}, 616 | "source": [ 617 | "### 7.2.2 Accessing values" 618 | ] 619 | }, 620 | { 621 | "cell_type": "markdown", 622 | "metadata": {}, 623 | "source": [ 624 | "There are multiple ways of accessing values or series of values in a Dataframe. Unlike in Series, a simple bracket gives access to a column and not an index, for example:" 625 | ] 626 | }, 627 | { 628 | "cell_type": "code", 629 | "execution_count": 18, 630 | "metadata": {}, 631 | "outputs": [ 632 | { 633 | "data": { 634 | "text/plain": [ 635 | "Mahler Kaliste\n", 636 | "Beethoven Bonn\n", 637 | "Puccini Lucques\n", 638 | "Shostakovich Saint-Petersburg\n", 639 | "Name: city, dtype: object" 640 | ] 641 | }, 642 | "execution_count": 18, 643 | "metadata": {}, 644 | "output_type": "execute_result" 645 | } 646 | ], 647 | "source": [ 648 | "composers_df['city']" 649 | ] 650 | }, 651 | { 652 | "cell_type": "markdown", 653 | "metadata": {}, 654 | "source": [ 655 | "returns a Series. Alternatively one can also use the *attributes* synthax and access columns by using:" 656 | ] 657 | }, 658 | { 659 | "cell_type": "code", 660 | "execution_count": 19, 661 | "metadata": {}, 662 | "outputs": [ 663 | { 664 | "data": { 665 | "text/plain": [ 666 | "Mahler Kaliste\n", 667 | "Beethoven Bonn\n", 668 | "Puccini Lucques\n", 669 | "Shostakovich Saint-Petersburg\n", 670 | "Name: city, dtype: object" 671 | ] 672 | }, 673 | "execution_count": 19, 674 | "metadata": {}, 675 | "output_type": "execute_result" 676 | } 677 | ], 678 | "source": [ 679 | "composers_df.city" 680 | ] 681 | }, 682 | { 683 | "cell_type": "markdown", 684 | "metadata": {}, 685 | "source": [ 686 | "The attributes synthax has some limitations, so in case something does not work as expected, revert to the brackets notation.\n", 687 | "\n", 688 | "When specifiying multiple columns, a DataFrame is returned:" 689 | ] 690 | }, 691 | { 692 | "cell_type": "code", 693 | "execution_count": 20, 694 | "metadata": {}, 695 | "outputs": [ 696 | { 697 | "data": { 698 | "text/html": [ 699 | "
\n", 700 | "\n", 713 | "\n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | " \n", 743 | "
citybirth
MahlerKaliste1860
BeethovenBonn1770
PucciniLucques1858
ShostakovichSaint-Petersburg1906
\n", 744 | "
" 745 | ], 746 | "text/plain": [ 747 | " city birth\n", 748 | "Mahler Kaliste 1860\n", 749 | "Beethoven Bonn 1770\n", 750 | "Puccini Lucques 1858\n", 751 | "Shostakovich Saint-Petersburg 1906" 752 | ] 753 | }, 754 | "execution_count": 20, 755 | "metadata": {}, 756 | "output_type": "execute_result" 757 | } 758 | ], 759 | "source": [ 760 | "composers_df[['city', 'birth']]" 761 | ] 762 | }, 763 | { 764 | "cell_type": "markdown", 765 | "metadata": {}, 766 | "source": [ 767 | "One of the important differences with a regular Numpy array is that here, regular indexing doesn't work:" 768 | ] 769 | }, 770 | { 771 | "cell_type": "code", 772 | "execution_count": 21, 773 | "metadata": {}, 774 | "outputs": [], 775 | "source": [ 776 | "#composers_df[0,0]" 777 | ] 778 | }, 779 | { 780 | "cell_type": "markdown", 781 | "metadata": {}, 782 | "source": [ 783 | "Instead one has to use either the ```.iloc[]``` or the ```.loc[]``` method. ```.ìloc[]``` can be used to recover the regular indexing:" 784 | ] 785 | }, 786 | { 787 | "cell_type": "code", 788 | "execution_count": 22, 789 | "metadata": {}, 790 | "outputs": [ 791 | { 792 | "data": { 793 | "text/plain": [ 794 | "1911" 795 | ] 796 | }, 797 | "execution_count": 22, 798 | "metadata": {}, 799 | "output_type": "execute_result" 800 | } 801 | ], 802 | "source": [ 803 | " composers_df.iloc[0,1]" 804 | ] 805 | }, 806 | { 807 | "cell_type": "markdown", 808 | "metadata": {}, 809 | "source": [ 810 | "While ```.loc[]``` allows one to recover elements by using the **explicit** index, on our case the composers name:" 811 | ] 812 | }, 813 | { 814 | "cell_type": "code", 815 | "execution_count": 23, 816 | "metadata": {}, 817 | "outputs": [ 818 | { 819 | "data": { 820 | "text/plain": [ 821 | "1911" 822 | ] 823 | }, 824 | "execution_count": 23, 825 | "metadata": {}, 826 | "output_type": "execute_result" 827 | } 828 | ], 829 | "source": [ 830 | "composers_df.loc['Mahler','death']" 831 | ] 832 | }, 833 | { 834 | "cell_type": "markdown", 835 | "metadata": {}, 836 | "source": [ 837 | "**Remember that ```loc``` and ``ìloc``` use brackets [] and not parenthesis ().**\n", 838 | "\n", 839 | "Numpy style indexing works here too" 840 | ] 841 | }, 842 | { 843 | "cell_type": "code", 844 | "execution_count": 24, 845 | "metadata": {}, 846 | "outputs": [ 847 | { 848 | "data": { 849 | "text/html": [ 850 | "
\n", 851 | "\n", 864 | "\n", 865 | " \n", 866 | " \n", 867 | " \n", 868 | " \n", 869 | " \n", 870 | " \n", 871 | " \n", 872 | " \n", 873 | " \n", 874 | " \n", 875 | " \n", 876 | " \n", 877 | " \n", 878 | " \n", 879 | " \n", 880 | " \n", 881 | " \n", 882 | " \n", 883 | " \n", 884 | " \n", 885 | " \n", 886 | " \n", 887 | "
birthdeathcity
Beethoven17701827Bonn
Puccini18581924Lucques
\n", 888 | "
" 889 | ], 890 | "text/plain": [ 891 | " birth death city\n", 892 | "Beethoven 1770 1827 Bonn\n", 893 | "Puccini 1858 1924 Lucques" 894 | ] 895 | }, 896 | "execution_count": 24, 897 | "metadata": {}, 898 | "output_type": "execute_result" 899 | } 900 | ], 901 | "source": [ 902 | "composers_df.iloc[1:3,:]" 903 | ] 904 | }, 905 | { 906 | "cell_type": "markdown", 907 | "metadata": {}, 908 | "source": [ 909 | "If you are working with a large table, it might be useful to sometimes have a list of all the columns. This is given by the ```.keys()``` attribute:" 910 | ] 911 | }, 912 | { 913 | "cell_type": "code", 914 | "execution_count": 25, 915 | "metadata": {}, 916 | "outputs": [ 917 | { 918 | "data": { 919 | "text/plain": [ 920 | "Index(['birth', 'death', 'city'], dtype='object')" 921 | ] 922 | }, 923 | "execution_count": 25, 924 | "metadata": {}, 925 | "output_type": "execute_result" 926 | } 927 | ], 928 | "source": [ 929 | "composers_df.keys()" 930 | ] 931 | }, 932 | { 933 | "cell_type": "markdown", 934 | "metadata": {}, 935 | "source": [ 936 | "### 7.2.3 Adding columns" 937 | ] 938 | }, 939 | { 940 | "cell_type": "markdown", 941 | "metadata": {}, 942 | "source": [ 943 | "It is very simple to add a column to a Dataframe. One can e.g. just create a column a give it a default value that we can change later:" 944 | ] 945 | }, 946 | { 947 | "cell_type": "code", 948 | "execution_count": 26, 949 | "metadata": {}, 950 | "outputs": [], 951 | "source": [ 952 | "composers_df['country'] = 'default'" 953 | ] 954 | }, 955 | { 956 | "cell_type": "code", 957 | "execution_count": 27, 958 | "metadata": {}, 959 | "outputs": [ 960 | { 961 | "data": { 962 | "text/html": [ 963 | "
\n", 964 | "\n", 977 | "\n", 978 | " \n", 979 | " \n", 980 | " \n", 981 | " \n", 982 | " \n", 983 | " \n", 984 | " \n", 985 | " \n", 986 | " \n", 987 | " \n", 988 | " \n", 989 | " \n", 990 | " \n", 991 | " \n", 992 | " \n", 993 | " \n", 994 | " \n", 995 | " \n", 996 | " \n", 997 | " \n", 998 | " \n", 999 | " \n", 1000 | " \n", 1001 | " \n", 1002 | " \n", 1003 | " \n", 1004 | " \n", 1005 | " \n", 1006 | " \n", 1007 | " \n", 1008 | " \n", 1009 | " \n", 1010 | " \n", 1011 | " \n", 1012 | " \n", 1013 | " \n", 1014 | " \n", 1015 | " \n", 1016 | " \n", 1017 | "
birthdeathcitycountry
Mahler18601911Kalistedefault
Beethoven17701827Bonndefault
Puccini18581924Lucquesdefault
Shostakovich19061975Saint-Petersburgdefault
\n", 1018 | "
" 1019 | ], 1020 | "text/plain": [ 1021 | " birth death city country\n", 1022 | "Mahler 1860 1911 Kaliste default\n", 1023 | "Beethoven 1770 1827 Bonn default\n", 1024 | "Puccini 1858 1924 Lucques default\n", 1025 | "Shostakovich 1906 1975 Saint-Petersburg default" 1026 | ] 1027 | }, 1028 | "execution_count": 27, 1029 | "metadata": {}, 1030 | "output_type": "execute_result" 1031 | } 1032 | ], 1033 | "source": [ 1034 | "composers_df" 1035 | ] 1036 | }, 1037 | { 1038 | "cell_type": "markdown", 1039 | "metadata": {}, 1040 | "source": [ 1041 | "Or one can use an existing list:" 1042 | ] 1043 | }, 1044 | { 1045 | "cell_type": "code", 1046 | "execution_count": 28, 1047 | "metadata": {}, 1048 | "outputs": [], 1049 | "source": [ 1050 | "country = ['Austria','Germany','Italy','Russia']" 1051 | ] 1052 | }, 1053 | { 1054 | "cell_type": "code", 1055 | "execution_count": 29, 1056 | "metadata": {}, 1057 | "outputs": [], 1058 | "source": [ 1059 | "composers_df['country2'] = country" 1060 | ] 1061 | }, 1062 | { 1063 | "cell_type": "code", 1064 | "execution_count": 30, 1065 | "metadata": {}, 1066 | "outputs": [ 1067 | { 1068 | "data": { 1069 | "text/html": [ 1070 | "
\n", 1071 | "\n", 1084 | "\n", 1085 | " \n", 1086 | " \n", 1087 | " \n", 1088 | " \n", 1089 | " \n", 1090 | " \n", 1091 | " \n", 1092 | " \n", 1093 | " \n", 1094 | " \n", 1095 | " \n", 1096 | " \n", 1097 | " \n", 1098 | " \n", 1099 | " \n", 1100 | " \n", 1101 | " \n", 1102 | " \n", 1103 | " \n", 1104 | " \n", 1105 | " \n", 1106 | " \n", 1107 | " \n", 1108 | " \n", 1109 | " \n", 1110 | " \n", 1111 | " \n", 1112 | " \n", 1113 | " \n", 1114 | " \n", 1115 | " \n", 1116 | " \n", 1117 | " \n", 1118 | " \n", 1119 | " \n", 1120 | " \n", 1121 | " \n", 1122 | " \n", 1123 | " \n", 1124 | " \n", 1125 | " \n", 1126 | " \n", 1127 | " \n", 1128 | " \n", 1129 | "
birthdeathcitycountrycountry2
Mahler18601911KalistedefaultAustria
Beethoven17701827BonndefaultGermany
Puccini18581924LucquesdefaultItaly
Shostakovich19061975Saint-PetersburgdefaultRussia
\n", 1130 | "
" 1131 | ], 1132 | "text/plain": [ 1133 | " birth death city country country2\n", 1134 | "Mahler 1860 1911 Kaliste default Austria\n", 1135 | "Beethoven 1770 1827 Bonn default Germany\n", 1136 | "Puccini 1858 1924 Lucques default Italy\n", 1137 | "Shostakovich 1906 1975 Saint-Petersburg default Russia" 1138 | ] 1139 | }, 1140 | "execution_count": 30, 1141 | "metadata": {}, 1142 | "output_type": "execute_result" 1143 | } 1144 | ], 1145 | "source": [ 1146 | "composers_df" 1147 | ] 1148 | } 1149 | ], 1150 | "metadata": { 1151 | "kernelspec": { 1152 | "display_name": "Python 3", 1153 | "language": "python", 1154 | "name": "python3" 1155 | }, 1156 | "language_info": { 1157 | "codemirror_mode": { 1158 | "name": "ipython", 1159 | "version": 3 1160 | }, 1161 | "file_extension": ".py", 1162 | "mimetype": "text/x-python", 1163 | "name": "python", 1164 | "nbconvert_exporter": "python", 1165 | "pygments_lexer": "ipython3", 1166 | "version": "3.8.2" 1167 | } 1168 | }, 1169 | "nbformat": 4, 1170 | "nbformat_minor": 4 1171 | } 1172 | -------------------------------------------------------------------------------- /09-DA_Pandas_operations.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 9. Operations with Pandas objects" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "import pandas as pd\n", 17 | "import numpy as np" 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": {}, 23 | "source": [ 24 | "One of the great advantages of using Pandas to handle tabular data is how simple it is to extract valuable information from them. Here we are going to see various types of operations that are available for this." 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": {}, 30 | "source": [ 31 | "## 9.1 Matrix types of operations" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | "The strength of Numpy is its natural way of handling matrix operations, and Pandas reuses a lot of these features. For example one can use simple mathematical operations to operate at the cell level: " 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": 2, 44 | "metadata": {}, 45 | "outputs": [ 46 | { 47 | "data": { 48 | "text/html": [ 49 | "
\n", 50 | "\n", 63 | "\n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | "
composerbirthdeathcity
0Mahler18601911Kaliste
1Beethoven17701827Bonn
2Puccini18581924Lucques
3Shostakovich19061975Saint-Petersburg
\n", 104 | "
" 105 | ], 106 | "text/plain": [ 107 | " composer birth death city\n", 108 | "0 Mahler 1860 1911 Kaliste\n", 109 | "1 Beethoven 1770 1827 Bonn\n", 110 | "2 Puccini 1858 1924 Lucques\n", 111 | "3 Shostakovich 1906 1975 Saint-Petersburg" 112 | ] 113 | }, 114 | "execution_count": 2, 115 | "metadata": {}, 116 | "output_type": "execute_result" 117 | } 118 | ], 119 | "source": [ 120 | "compo_pd = pd.read_excel('Data/composers.xlsx')\n", 121 | "compo_pd" 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": 3, 127 | "metadata": {}, 128 | "outputs": [ 129 | { 130 | "data": { 131 | "text/plain": [ 132 | "0 3720\n", 133 | "1 3540\n", 134 | "2 3716\n", 135 | "3 3812\n", 136 | "Name: birth, dtype: int64" 137 | ] 138 | }, 139 | "execution_count": 3, 140 | "metadata": {}, 141 | "output_type": "execute_result" 142 | } 143 | ], 144 | "source": [ 145 | "compo_pd['birth']*2" 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": 4, 151 | "metadata": {}, 152 | "outputs": [ 153 | { 154 | "data": { 155 | "text/plain": [ 156 | "0 7.528332\n", 157 | "1 7.478735\n", 158 | "2 7.527256\n", 159 | "3 7.552762\n", 160 | "Name: birth, dtype: float64" 161 | ] 162 | }, 163 | "execution_count": 4, 164 | "metadata": {}, 165 | "output_type": "execute_result" 166 | } 167 | ], 168 | "source": [ 169 | "np.log(compo_pd['birth'])" 170 | ] 171 | }, 172 | { 173 | "cell_type": "markdown", 174 | "metadata": {}, 175 | "source": [ 176 | "Here we applied functions only to series. Indeed, since our Dataframe contains e.g. strings, no operation can be done on it:" 177 | ] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": 5, 182 | "metadata": {}, 183 | "outputs": [], 184 | "source": [ 185 | "#compo_pd+1" 186 | ] 187 | }, 188 | { 189 | "cell_type": "markdown", 190 | "metadata": {}, 191 | "source": [ 192 | "If however we have a homogenous Dataframe, this is possible:" 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": 6, 198 | "metadata": {}, 199 | "outputs": [ 200 | { 201 | "data": { 202 | "text/html": [ 203 | "
\n", 204 | "\n", 217 | "\n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | "
birthdeath
018601911
117701827
218581924
319061975
\n", 248 | "
" 249 | ], 250 | "text/plain": [ 251 | " birth death\n", 252 | "0 1860 1911\n", 253 | "1 1770 1827\n", 254 | "2 1858 1924\n", 255 | "3 1906 1975" 256 | ] 257 | }, 258 | "execution_count": 6, 259 | "metadata": {}, 260 | "output_type": "execute_result" 261 | } 262 | ], 263 | "source": [ 264 | "compo_pd[['birth','death']]" 265 | ] 266 | }, 267 | { 268 | "cell_type": "code", 269 | "execution_count": 7, 270 | "metadata": {}, 271 | "outputs": [ 272 | { 273 | "data": { 274 | "text/html": [ 275 | "
\n", 276 | "\n", 289 | "\n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | "
birthdeath
037203822
135403654
237163848
338123950
\n", 320 | "
" 321 | ], 322 | "text/plain": [ 323 | " birth death\n", 324 | "0 3720 3822\n", 325 | "1 3540 3654\n", 326 | "2 3716 3848\n", 327 | "3 3812 3950" 328 | ] 329 | }, 330 | "execution_count": 7, 331 | "metadata": {}, 332 | "output_type": "execute_result" 333 | } 334 | ], 335 | "source": [ 336 | "compo_pd[['birth','death']]*2" 337 | ] 338 | }, 339 | { 340 | "cell_type": "markdown", 341 | "metadata": {}, 342 | "source": [ 343 | "## 9.2 Column operations" 344 | ] 345 | }, 346 | { 347 | "cell_type": "markdown", 348 | "metadata": {}, 349 | "source": [ 350 | "There are other types of functions whose purpose is to summarize the data. For example the mean or standard deviation. Pandas by default applies such functions column-wise and returns a series containing e.g. the mean of each column:" 351 | ] 352 | }, 353 | { 354 | "cell_type": "code", 355 | "execution_count": 8, 356 | "metadata": {}, 357 | "outputs": [ 358 | { 359 | "data": { 360 | "text/plain": [ 361 | "birth 1848.50\n", 362 | "death 1909.25\n", 363 | "dtype: float64" 364 | ] 365 | }, 366 | "execution_count": 8, 367 | "metadata": {}, 368 | "output_type": "execute_result" 369 | } 370 | ], 371 | "source": [ 372 | "np.mean(compo_pd)" 373 | ] 374 | }, 375 | { 376 | "cell_type": "markdown", 377 | "metadata": {}, 378 | "source": [ 379 | "Note that columns for which a mean does not make sense, like the city are discarded.\n", 380 | "A series of common functions like mean or standard deviation are directly implemented as methods and can be accessed in the alternative form:" 381 | ] 382 | }, 383 | { 384 | "cell_type": "code", 385 | "execution_count": 9, 386 | "metadata": {}, 387 | "outputs": [ 388 | { 389 | "data": { 390 | "text/html": [ 391 | "
\n", 392 | "\n", 405 | "\n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | "
birthdeath
count4.0000004.000000
mean1848.5000001909.250000
std56.83602161.396933
min1770.0000001827.000000
25%1836.0000001890.000000
50%1859.0000001917.500000
75%1871.5000001936.750000
max1906.0000001975.000000
\n", 456 | "
" 457 | ], 458 | "text/plain": [ 459 | " birth death\n", 460 | "count 4.000000 4.000000\n", 461 | "mean 1848.500000 1909.250000\n", 462 | "std 56.836021 61.396933\n", 463 | "min 1770.000000 1827.000000\n", 464 | "25% 1836.000000 1890.000000\n", 465 | "50% 1859.000000 1917.500000\n", 466 | "75% 1871.500000 1936.750000\n", 467 | "max 1906.000000 1975.000000" 468 | ] 469 | }, 470 | "execution_count": 9, 471 | "metadata": {}, 472 | "output_type": "execute_result" 473 | } 474 | ], 475 | "source": [ 476 | "compo_pd.describe()" 477 | ] 478 | }, 479 | { 480 | "cell_type": "code", 481 | "execution_count": 10, 482 | "metadata": {}, 483 | "outputs": [ 484 | { 485 | "data": { 486 | "text/plain": [ 487 | "birth 56.836021\n", 488 | "death 61.396933\n", 489 | "dtype: float64" 490 | ] 491 | }, 492 | "execution_count": 10, 493 | "metadata": {}, 494 | "output_type": "execute_result" 495 | } 496 | ], 497 | "source": [ 498 | "compo_pd.std()" 499 | ] 500 | }, 501 | { 502 | "cell_type": "markdown", 503 | "metadata": {}, 504 | "source": [ 505 | "If you need the mean of only a single column you can of course chains operations:" 506 | ] 507 | }, 508 | { 509 | "cell_type": "code", 510 | "execution_count": 11, 511 | "metadata": {}, 512 | "outputs": [ 513 | { 514 | "data": { 515 | "text/plain": [ 516 | "1848.5" 517 | ] 518 | }, 519 | "execution_count": 11, 520 | "metadata": {}, 521 | "output_type": "execute_result" 522 | } 523 | ], 524 | "source": [ 525 | "compo_pd.birth.mean()" 526 | ] 527 | }, 528 | { 529 | "cell_type": "markdown", 530 | "metadata": {}, 531 | "source": [ 532 | "## 9.3 Operations between Series" 533 | ] 534 | }, 535 | { 536 | "cell_type": "markdown", 537 | "metadata": {}, 538 | "source": [ 539 | "We can also do computations with multiple series as we would do with Numpy arrays:" 540 | ] 541 | }, 542 | { 543 | "cell_type": "code", 544 | "execution_count": 12, 545 | "metadata": {}, 546 | "outputs": [ 547 | { 548 | "data": { 549 | "text/plain": [ 550 | "0 51\n", 551 | "1 57\n", 552 | "2 66\n", 553 | "3 69\n", 554 | "dtype: int64" 555 | ] 556 | }, 557 | "execution_count": 12, 558 | "metadata": {}, 559 | "output_type": "execute_result" 560 | } 561 | ], 562 | "source": [ 563 | "compo_pd['death']-compo_pd['birth']" 564 | ] 565 | }, 566 | { 567 | "cell_type": "markdown", 568 | "metadata": {}, 569 | "source": [ 570 | "We can even use the result of this computation to create a new column in our Dataframe:" 571 | ] 572 | }, 573 | { 574 | "cell_type": "code", 575 | "execution_count": 13, 576 | "metadata": {}, 577 | "outputs": [ 578 | { 579 | "data": { 580 | "text/html": [ 581 | "
\n", 582 | "\n", 595 | "\n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | "
composerbirthdeathcity
0Mahler18601911Kaliste
1Beethoven17701827Bonn
2Puccini18581924Lucques
3Shostakovich19061975Saint-Petersburg
\n", 636 | "
" 637 | ], 638 | "text/plain": [ 639 | " composer birth death city\n", 640 | "0 Mahler 1860 1911 Kaliste\n", 641 | "1 Beethoven 1770 1827 Bonn\n", 642 | "2 Puccini 1858 1924 Lucques\n", 643 | "3 Shostakovich 1906 1975 Saint-Petersburg" 644 | ] 645 | }, 646 | "execution_count": 13, 647 | "metadata": {}, 648 | "output_type": "execute_result" 649 | } 650 | ], 651 | "source": [ 652 | "compo_pd" 653 | ] 654 | }, 655 | { 656 | "cell_type": "code", 657 | "execution_count": 14, 658 | "metadata": {}, 659 | "outputs": [], 660 | "source": [ 661 | "compo_pd['age'] = compo_pd['death']-compo_pd['birth']" 662 | ] 663 | }, 664 | { 665 | "cell_type": "code", 666 | "execution_count": 15, 667 | "metadata": {}, 668 | "outputs": [ 669 | { 670 | "data": { 671 | "text/html": [ 672 | "
\n", 673 | "\n", 686 | "\n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | "
composerbirthdeathcityage
0Mahler18601911Kaliste51
1Beethoven17701827Bonn57
2Puccini18581924Lucques66
3Shostakovich19061975Saint-Petersburg69
\n", 732 | "
" 733 | ], 734 | "text/plain": [ 735 | " composer birth death city age\n", 736 | "0 Mahler 1860 1911 Kaliste 51\n", 737 | "1 Beethoven 1770 1827 Bonn 57\n", 738 | "2 Puccini 1858 1924 Lucques 66\n", 739 | "3 Shostakovich 1906 1975 Saint-Petersburg 69" 740 | ] 741 | }, 742 | "execution_count": 15, 743 | "metadata": {}, 744 | "output_type": "execute_result" 745 | } 746 | ], 747 | "source": [ 748 | "compo_pd" 749 | ] 750 | }, 751 | { 752 | "cell_type": "markdown", 753 | "metadata": {}, 754 | "source": [ 755 | "## 9.4 Other functions" 756 | ] 757 | }, 758 | { 759 | "cell_type": "markdown", 760 | "metadata": {}, 761 | "source": [ 762 | "Sometimes one needs to apply to a column a very specific function that is not provided by default. In that case we can use one of the different ```apply``` methods of Pandas.\n", 763 | "\n", 764 | "The simplest case is to apply a function to a column, or Series of a DataFrame. Let's say for example that we want to define the the age >60 as 'old' and <60 as 'young'. We can define the following general function:" 765 | ] 766 | }, 767 | { 768 | "cell_type": "code", 769 | "execution_count": 16, 770 | "metadata": {}, 771 | "outputs": [], 772 | "source": [ 773 | "def define_age(x):\n", 774 | " if x>60:\n", 775 | " return 'old'\n", 776 | " else:\n", 777 | " return 'young'" 778 | ] 779 | }, 780 | { 781 | "cell_type": "code", 782 | "execution_count": 17, 783 | "metadata": {}, 784 | "outputs": [ 785 | { 786 | "data": { 787 | "text/plain": [ 788 | "'young'" 789 | ] 790 | }, 791 | "execution_count": 17, 792 | "metadata": {}, 793 | "output_type": "execute_result" 794 | } 795 | ], 796 | "source": [ 797 | "define_age(30)" 798 | ] 799 | }, 800 | { 801 | "cell_type": "code", 802 | "execution_count": 18, 803 | "metadata": {}, 804 | "outputs": [ 805 | { 806 | "data": { 807 | "text/plain": [ 808 | "'old'" 809 | ] 810 | }, 811 | "execution_count": 18, 812 | "metadata": {}, 813 | "output_type": "execute_result" 814 | } 815 | ], 816 | "source": [ 817 | "define_age(70)" 818 | ] 819 | }, 820 | { 821 | "cell_type": "markdown", 822 | "metadata": {}, 823 | "source": [ 824 | "We can now apply this function on an entire Series:" 825 | ] 826 | }, 827 | { 828 | "cell_type": "code", 829 | "execution_count": 19, 830 | "metadata": {}, 831 | "outputs": [ 832 | { 833 | "data": { 834 | "text/plain": [ 835 | "0 young\n", 836 | "1 young\n", 837 | "2 old\n", 838 | "3 old\n", 839 | "Name: age, dtype: object" 840 | ] 841 | }, 842 | "execution_count": 19, 843 | "metadata": {}, 844 | "output_type": "execute_result" 845 | } 846 | ], 847 | "source": [ 848 | "compo_pd.age.apply(define_age)" 849 | ] 850 | }, 851 | { 852 | "cell_type": "code", 853 | "execution_count": 20, 854 | "metadata": {}, 855 | "outputs": [ 856 | { 857 | "data": { 858 | "text/plain": [ 859 | "0 2601\n", 860 | "1 3249\n", 861 | "2 4356\n", 862 | "3 4761\n", 863 | "Name: age, dtype: int64" 864 | ] 865 | }, 866 | "execution_count": 20, 867 | "metadata": {}, 868 | "output_type": "execute_result" 869 | } 870 | ], 871 | "source": [ 872 | "compo_pd.age.apply(lambda x: x**2)" 873 | ] 874 | }, 875 | { 876 | "cell_type": "markdown", 877 | "metadata": {}, 878 | "source": [ 879 | "And again, if we want, we can directly use this output to create a new column:" 880 | ] 881 | }, 882 | { 883 | "cell_type": "code", 884 | "execution_count": 21, 885 | "metadata": {}, 886 | "outputs": [ 887 | { 888 | "data": { 889 | "text/html": [ 890 | "
\n", 891 | "\n", 904 | "\n", 905 | " \n", 906 | " \n", 907 | " \n", 908 | " \n", 909 | " \n", 910 | " \n", 911 | " \n", 912 | " \n", 913 | " \n", 914 | " \n", 915 | " \n", 916 | " \n", 917 | " \n", 918 | " \n", 919 | " \n", 920 | " \n", 921 | " \n", 922 | " \n", 923 | " \n", 924 | " \n", 925 | " \n", 926 | " \n", 927 | " \n", 928 | " \n", 929 | " \n", 930 | " \n", 931 | " \n", 932 | " \n", 933 | " \n", 934 | " \n", 935 | " \n", 936 | " \n", 937 | " \n", 938 | " \n", 939 | " \n", 940 | " \n", 941 | " \n", 942 | " \n", 943 | " \n", 944 | " \n", 945 | " \n", 946 | " \n", 947 | " \n", 948 | " \n", 949 | " \n", 950 | " \n", 951 | " \n", 952 | " \n", 953 | " \n", 954 | "
composerbirthdeathcityageage_def
0Mahler18601911Kaliste51young
1Beethoven17701827Bonn57young
2Puccini18581924Lucques66old
3Shostakovich19061975Saint-Petersburg69old
\n", 955 | "
" 956 | ], 957 | "text/plain": [ 958 | " composer birth death city age age_def\n", 959 | "0 Mahler 1860 1911 Kaliste 51 young\n", 960 | "1 Beethoven 1770 1827 Bonn 57 young\n", 961 | "2 Puccini 1858 1924 Lucques 66 old\n", 962 | "3 Shostakovich 1906 1975 Saint-Petersburg 69 old" 963 | ] 964 | }, 965 | "execution_count": 21, 966 | "metadata": {}, 967 | "output_type": "execute_result" 968 | } 969 | ], 970 | "source": [ 971 | "compo_pd['age_def'] = compo_pd.age.apply(define_age)\n", 972 | "compo_pd" 973 | ] 974 | }, 975 | { 976 | "cell_type": "markdown", 977 | "metadata": {}, 978 | "source": [ 979 | "We can also apply a function to an entire DataFrame. For example we can ask how many composers have birth and death dates within the XIXth century:" 980 | ] 981 | }, 982 | { 983 | "cell_type": "code", 984 | "execution_count": 22, 985 | "metadata": {}, 986 | "outputs": [], 987 | "source": [ 988 | "def nineteen_century_count(x):\n", 989 | " return np.sum((x>=1800)&(x<1900))\n" 990 | ] 991 | }, 992 | { 993 | "cell_type": "code", 994 | "execution_count": 23, 995 | "metadata": {}, 996 | "outputs": [ 997 | { 998 | "data": { 999 | "text/plain": [ 1000 | "birth 2\n", 1001 | "death 1\n", 1002 | "dtype: int64" 1003 | ] 1004 | }, 1005 | "execution_count": 23, 1006 | "metadata": {}, 1007 | "output_type": "execute_result" 1008 | } 1009 | ], 1010 | "source": [ 1011 | "compo_pd[['birth','death']].apply(nineteen_century_count)" 1012 | ] 1013 | }, 1014 | { 1015 | "cell_type": "markdown", 1016 | "metadata": {}, 1017 | "source": [ 1018 | "The function is applied column-wise and returns a single number for each in the form of a series." 1019 | ] 1020 | }, 1021 | { 1022 | "cell_type": "code", 1023 | "execution_count": 24, 1024 | "metadata": {}, 1025 | "outputs": [], 1026 | "source": [ 1027 | "def nineteen_century_true(x):\n", 1028 | " return (x>=1800)&(x<1900)\n" 1029 | ] 1030 | }, 1031 | { 1032 | "cell_type": "code", 1033 | "execution_count": 25, 1034 | "metadata": {}, 1035 | "outputs": [ 1036 | { 1037 | "data": { 1038 | "text/html": [ 1039 | "
\n", 1040 | "\n", 1053 | "\n", 1054 | " \n", 1055 | " \n", 1056 | " \n", 1057 | " \n", 1058 | " \n", 1059 | " \n", 1060 | " \n", 1061 | " \n", 1062 | " \n", 1063 | " \n", 1064 | " \n", 1065 | " \n", 1066 | " \n", 1067 | " \n", 1068 | " \n", 1069 | " \n", 1070 | " \n", 1071 | " \n", 1072 | " \n", 1073 | " \n", 1074 | " \n", 1075 | " \n", 1076 | " \n", 1077 | " \n", 1078 | " \n", 1079 | " \n", 1080 | " \n", 1081 | " \n", 1082 | " \n", 1083 | "
birthdeath
0TrueFalse
1FalseTrue
2TrueFalse
3FalseFalse
\n", 1084 | "
" 1085 | ], 1086 | "text/plain": [ 1087 | " birth death\n", 1088 | "0 True False\n", 1089 | "1 False True\n", 1090 | "2 True False\n", 1091 | "3 False False" 1092 | ] 1093 | }, 1094 | "execution_count": 25, 1095 | "metadata": {}, 1096 | "output_type": "execute_result" 1097 | } 1098 | ], 1099 | "source": [ 1100 | "compo_pd[['birth','death']].apply(nineteen_century_true)" 1101 | ] 1102 | }, 1103 | { 1104 | "cell_type": "markdown", 1105 | "metadata": {}, 1106 | "source": [ 1107 | "Here the operation is again applied column-wise but the output is a Series.\n", 1108 | "\n", 1109 | "There are more combinations of what can be the in- and output of the apply function and in what order (column- or row-wise) they are applied that cannot be covered here." 1110 | ] 1111 | }, 1112 | { 1113 | "cell_type": "markdown", 1114 | "metadata": {}, 1115 | "source": [ 1116 | "## 9.5 Logical indexing" 1117 | ] 1118 | }, 1119 | { 1120 | "cell_type": "markdown", 1121 | "metadata": {}, 1122 | "source": [ 1123 | "Just like with Numpy, it is possible to subselect parts of a Dataframe using logical indexing. Let's have a look again at an example:" 1124 | ] 1125 | }, 1126 | { 1127 | "cell_type": "code", 1128 | "execution_count": 26, 1129 | "metadata": {}, 1130 | "outputs": [ 1131 | { 1132 | "data": { 1133 | "text/html": [ 1134 | "
\n", 1135 | "\n", 1148 | "\n", 1149 | " \n", 1150 | " \n", 1151 | " \n", 1152 | " \n", 1153 | " \n", 1154 | " \n", 1155 | " \n", 1156 | " \n", 1157 | " \n", 1158 | " \n", 1159 | " \n", 1160 | " \n", 1161 | " \n", 1162 | " \n", 1163 | " \n", 1164 | " \n", 1165 | " \n", 1166 | " \n", 1167 | " \n", 1168 | " \n", 1169 | " \n", 1170 | " \n", 1171 | " \n", 1172 | " \n", 1173 | " \n", 1174 | " \n", 1175 | " \n", 1176 | " \n", 1177 | " \n", 1178 | " \n", 1179 | " \n", 1180 | " \n", 1181 | " \n", 1182 | " \n", 1183 | " \n", 1184 | " \n", 1185 | " \n", 1186 | " \n", 1187 | " \n", 1188 | " \n", 1189 | " \n", 1190 | " \n", 1191 | " \n", 1192 | " \n", 1193 | " \n", 1194 | " \n", 1195 | " \n", 1196 | " \n", 1197 | " \n", 1198 | "
composerbirthdeathcityageage_def
0Mahler18601911Kaliste51young
1Beethoven17701827Bonn57young
2Puccini18581924Lucques66old
3Shostakovich19061975Saint-Petersburg69old
\n", 1199 | "
" 1200 | ], 1201 | "text/plain": [ 1202 | " composer birth death city age age_def\n", 1203 | "0 Mahler 1860 1911 Kaliste 51 young\n", 1204 | "1 Beethoven 1770 1827 Bonn 57 young\n", 1205 | "2 Puccini 1858 1924 Lucques 66 old\n", 1206 | "3 Shostakovich 1906 1975 Saint-Petersburg 69 old" 1207 | ] 1208 | }, 1209 | "execution_count": 26, 1210 | "metadata": {}, 1211 | "output_type": "execute_result" 1212 | } 1213 | ], 1214 | "source": [ 1215 | "compo_pd" 1216 | ] 1217 | }, 1218 | { 1219 | "cell_type": "markdown", 1220 | "metadata": {}, 1221 | "source": [ 1222 | "If we use a logical comparison on a series, this yields a **logical Series**:" 1223 | ] 1224 | }, 1225 | { 1226 | "cell_type": "code", 1227 | "execution_count": 27, 1228 | "metadata": {}, 1229 | "outputs": [ 1230 | { 1231 | "data": { 1232 | "text/plain": [ 1233 | "0 1860\n", 1234 | "1 1770\n", 1235 | "2 1858\n", 1236 | "3 1906\n", 1237 | "Name: birth, dtype: int64" 1238 | ] 1239 | }, 1240 | "execution_count": 27, 1241 | "metadata": {}, 1242 | "output_type": "execute_result" 1243 | } 1244 | ], 1245 | "source": [ 1246 | "compo_pd['birth']" 1247 | ] 1248 | }, 1249 | { 1250 | "cell_type": "code", 1251 | "execution_count": 28, 1252 | "metadata": {}, 1253 | "outputs": [ 1254 | { 1255 | "data": { 1256 | "text/plain": [ 1257 | "0 True\n", 1258 | "1 False\n", 1259 | "2 False\n", 1260 | "3 True\n", 1261 | "Name: birth, dtype: bool" 1262 | ] 1263 | }, 1264 | "execution_count": 28, 1265 | "metadata": {}, 1266 | "output_type": "execute_result" 1267 | } 1268 | ], 1269 | "source": [ 1270 | "compo_pd['birth'] > 1859" 1271 | ] 1272 | }, 1273 | { 1274 | "cell_type": "markdown", 1275 | "metadata": {}, 1276 | "source": [ 1277 | "Just like in Numpy we can use this logical Series as an index to select elements in the Dataframe:" 1278 | ] 1279 | }, 1280 | { 1281 | "cell_type": "code", 1282 | "execution_count": 29, 1283 | "metadata": {}, 1284 | "outputs": [ 1285 | { 1286 | "data": { 1287 | "text/plain": [ 1288 | "0 True\n", 1289 | "1 False\n", 1290 | "2 False\n", 1291 | "3 True\n", 1292 | "Name: birth, dtype: bool" 1293 | ] 1294 | }, 1295 | "execution_count": 29, 1296 | "metadata": {}, 1297 | "output_type": "execute_result" 1298 | } 1299 | ], 1300 | "source": [ 1301 | "log_indexer = compo_pd['birth'] > 1859\n", 1302 | "log_indexer" 1303 | ] 1304 | }, 1305 | { 1306 | "cell_type": "code", 1307 | "execution_count": 30, 1308 | "metadata": {}, 1309 | "outputs": [ 1310 | { 1311 | "data": { 1312 | "text/html": [ 1313 | "
\n", 1314 | "\n", 1327 | "\n", 1328 | " \n", 1329 | " \n", 1330 | " \n", 1331 | " \n", 1332 | " \n", 1333 | " \n", 1334 | " \n", 1335 | " \n", 1336 | " \n", 1337 | " \n", 1338 | " \n", 1339 | " \n", 1340 | " \n", 1341 | " \n", 1342 | " \n", 1343 | " \n", 1344 | " \n", 1345 | " \n", 1346 | " \n", 1347 | " \n", 1348 | " \n", 1349 | " \n", 1350 | " \n", 1351 | " \n", 1352 | " \n", 1353 | " \n", 1354 | " \n", 1355 | " \n", 1356 | " \n", 1357 | " \n", 1358 | " \n", 1359 | " \n", 1360 | " \n", 1361 | " \n", 1362 | " \n", 1363 | " \n", 1364 | " \n", 1365 | " \n", 1366 | " \n", 1367 | " \n", 1368 | " \n", 1369 | " \n", 1370 | " \n", 1371 | " \n", 1372 | " \n", 1373 | " \n", 1374 | " \n", 1375 | " \n", 1376 | " \n", 1377 | "
composerbirthdeathcityageage_def
0Mahler18601911Kaliste51young
1Beethoven17701827Bonn57young
2Puccini18581924Lucques66old
3Shostakovich19061975Saint-Petersburg69old
\n", 1378 | "
" 1379 | ], 1380 | "text/plain": [ 1381 | " composer birth death city age age_def\n", 1382 | "0 Mahler 1860 1911 Kaliste 51 young\n", 1383 | "1 Beethoven 1770 1827 Bonn 57 young\n", 1384 | "2 Puccini 1858 1924 Lucques 66 old\n", 1385 | "3 Shostakovich 1906 1975 Saint-Petersburg 69 old" 1386 | ] 1387 | }, 1388 | "execution_count": 30, 1389 | "metadata": {}, 1390 | "output_type": "execute_result" 1391 | } 1392 | ], 1393 | "source": [ 1394 | "compo_pd" 1395 | ] 1396 | }, 1397 | { 1398 | "cell_type": "code", 1399 | "execution_count": 31, 1400 | "metadata": {}, 1401 | "outputs": [ 1402 | { 1403 | "data": { 1404 | "text/plain": [ 1405 | "0 False\n", 1406 | "1 True\n", 1407 | "2 True\n", 1408 | "3 False\n", 1409 | "Name: birth, dtype: bool" 1410 | ] 1411 | }, 1412 | "execution_count": 31, 1413 | "metadata": {}, 1414 | "output_type": "execute_result" 1415 | } 1416 | ], 1417 | "source": [ 1418 | "~log_indexer" 1419 | ] 1420 | }, 1421 | { 1422 | "cell_type": "code", 1423 | "execution_count": 32, 1424 | "metadata": {}, 1425 | "outputs": [ 1426 | { 1427 | "data": { 1428 | "text/html": [ 1429 | "
\n", 1430 | "\n", 1443 | "\n", 1444 | " \n", 1445 | " \n", 1446 | " \n", 1447 | " \n", 1448 | " \n", 1449 | " \n", 1450 | " \n", 1451 | " \n", 1452 | " \n", 1453 | " \n", 1454 | " \n", 1455 | " \n", 1456 | " \n", 1457 | " \n", 1458 | " \n", 1459 | " \n", 1460 | " \n", 1461 | " \n", 1462 | " \n", 1463 | " \n", 1464 | " \n", 1465 | " \n", 1466 | " \n", 1467 | " \n", 1468 | " \n", 1469 | " \n", 1470 | " \n", 1471 | " \n", 1472 | " \n", 1473 | " \n", 1474 | " \n", 1475 | "
composerbirthdeathcityageage_def
1Beethoven17701827Bonn57young
2Puccini18581924Lucques66old
\n", 1476 | "
" 1477 | ], 1478 | "text/plain": [ 1479 | " composer birth death city age age_def\n", 1480 | "1 Beethoven 1770 1827 Bonn 57 young\n", 1481 | "2 Puccini 1858 1924 Lucques 66 old" 1482 | ] 1483 | }, 1484 | "execution_count": 32, 1485 | "metadata": {}, 1486 | "output_type": "execute_result" 1487 | } 1488 | ], 1489 | "source": [ 1490 | "compo_pd[~log_indexer]" 1491 | ] 1492 | }, 1493 | { 1494 | "cell_type": "markdown", 1495 | "metadata": {}, 1496 | "source": [ 1497 | "We can also create more complex logical indexings: " 1498 | ] 1499 | }, 1500 | { 1501 | "cell_type": "code", 1502 | "execution_count": 33, 1503 | "metadata": {}, 1504 | "outputs": [ 1505 | { 1506 | "data": { 1507 | "text/plain": [ 1508 | "0 False\n", 1509 | "1 False\n", 1510 | "2 False\n", 1511 | "3 True\n", 1512 | "dtype: bool" 1513 | ] 1514 | }, 1515 | "execution_count": 33, 1516 | "metadata": {}, 1517 | "output_type": "execute_result" 1518 | } 1519 | ], 1520 | "source": [ 1521 | "(compo_pd['birth'] > 1859)&(compo_pd['age']>60)" 1522 | ] 1523 | }, 1524 | { 1525 | "cell_type": "code", 1526 | "execution_count": 34, 1527 | "metadata": {}, 1528 | "outputs": [ 1529 | { 1530 | "data": { 1531 | "text/html": [ 1532 | "
\n", 1533 | "\n", 1546 | "\n", 1547 | " \n", 1548 | " \n", 1549 | " \n", 1550 | " \n", 1551 | " \n", 1552 | " \n", 1553 | " \n", 1554 | " \n", 1555 | " \n", 1556 | " \n", 1557 | " \n", 1558 | " \n", 1559 | " \n", 1560 | " \n", 1561 | " \n", 1562 | " \n", 1563 | " \n", 1564 | " \n", 1565 | " \n", 1566 | " \n", 1567 | " \n", 1568 | " \n", 1569 | "
composerbirthdeathcityageage_def
3Shostakovich19061975Saint-Petersburg69old
\n", 1570 | "
" 1571 | ], 1572 | "text/plain": [ 1573 | " composer birth death city age age_def\n", 1574 | "3 Shostakovich 1906 1975 Saint-Petersburg 69 old" 1575 | ] 1576 | }, 1577 | "execution_count": 34, 1578 | "metadata": {}, 1579 | "output_type": "execute_result" 1580 | } 1581 | ], 1582 | "source": [ 1583 | "compo_pd[(compo_pd['birth'] > 1859)&(compo_pd['age']>60)]" 1584 | ] 1585 | }, 1586 | { 1587 | "cell_type": "markdown", 1588 | "metadata": {}, 1589 | "source": [ 1590 | "And we can create new arrays containing only these subselections:" 1591 | ] 1592 | }, 1593 | { 1594 | "cell_type": "code", 1595 | "execution_count": 35, 1596 | "metadata": {}, 1597 | "outputs": [], 1598 | "source": [ 1599 | "compos_sub = compo_pd[compo_pd['birth'] > 1859]" 1600 | ] 1601 | }, 1602 | { 1603 | "cell_type": "code", 1604 | "execution_count": 36, 1605 | "metadata": {}, 1606 | "outputs": [ 1607 | { 1608 | "data": { 1609 | "text/html": [ 1610 | "
\n", 1611 | "\n", 1624 | "\n", 1625 | " \n", 1626 | " \n", 1627 | " \n", 1628 | " \n", 1629 | " \n", 1630 | " \n", 1631 | " \n", 1632 | " \n", 1633 | " \n", 1634 | " \n", 1635 | " \n", 1636 | " \n", 1637 | " \n", 1638 | " \n", 1639 | " \n", 1640 | " \n", 1641 | " \n", 1642 | " \n", 1643 | " \n", 1644 | " \n", 1645 | " \n", 1646 | " \n", 1647 | " \n", 1648 | " \n", 1649 | " \n", 1650 | " \n", 1651 | " \n", 1652 | " \n", 1653 | " \n", 1654 | " \n", 1655 | " \n", 1656 | "
composerbirthdeathcityageage_def
0Mahler18601911Kaliste51young
3Shostakovich19061975Saint-Petersburg69old
\n", 1657 | "
" 1658 | ], 1659 | "text/plain": [ 1660 | " composer birth death city age age_def\n", 1661 | "0 Mahler 1860 1911 Kaliste 51 young\n", 1662 | "3 Shostakovich 1906 1975 Saint-Petersburg 69 old" 1663 | ] 1664 | }, 1665 | "execution_count": 36, 1666 | "metadata": {}, 1667 | "output_type": "execute_result" 1668 | } 1669 | ], 1670 | "source": [ 1671 | "compos_sub" 1672 | ] 1673 | }, 1674 | { 1675 | "cell_type": "markdown", 1676 | "metadata": {}, 1677 | "source": [ 1678 | "We can then modify the new array:" 1679 | ] 1680 | }, 1681 | { 1682 | "cell_type": "code", 1683 | "execution_count": 37, 1684 | "metadata": {}, 1685 | "outputs": [ 1686 | { 1687 | "name": "stderr", 1688 | "output_type": "stream", 1689 | "text": [ 1690 | "/Users/gw18g940/miniconda3/envs/danalytics/lib/python3.8/site-packages/pandas/core/indexing.py:966: SettingWithCopyWarning: \n", 1691 | "A value is trying to be set on a copy of a slice from a DataFrame.\n", 1692 | "Try using .loc[row_indexer,col_indexer] = value instead\n", 1693 | "\n", 1694 | "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", 1695 | " self.obj[item] = s\n" 1696 | ] 1697 | } 1698 | ], 1699 | "source": [ 1700 | "compos_sub.loc[0,'birth'] = 3000" 1701 | ] 1702 | }, 1703 | { 1704 | "cell_type": "markdown", 1705 | "metadata": {}, 1706 | "source": [ 1707 | "Note that we get this SettingWithCopyWarning warning. This is a very common problem hand has to do with how new arrays are created when making subselections. Simply stated, did we create an entirely new array or a \"view\" of the old one? This will be very case-dependent and to avoid this, if we want to create a new array we can just enforce it using the ```copy()``` method (for more information on the topic see for example this [explanation](https://www.dataquest.io/blog/settingwithcopywarning/):" 1708 | ] 1709 | }, 1710 | { 1711 | "cell_type": "code", 1712 | "execution_count": 38, 1713 | "metadata": {}, 1714 | "outputs": [], 1715 | "source": [ 1716 | "compos_sub2 = compo_pd[compo_pd['birth'] > 1859].copy()\n", 1717 | "compos_sub2.loc[0,'birth'] = 3000" 1718 | ] 1719 | }, 1720 | { 1721 | "cell_type": "code", 1722 | "execution_count": 39, 1723 | "metadata": {}, 1724 | "outputs": [ 1725 | { 1726 | "data": { 1727 | "text/html": [ 1728 | "
\n", 1729 | "\n", 1742 | "\n", 1743 | " \n", 1744 | " \n", 1745 | " \n", 1746 | " \n", 1747 | " \n", 1748 | " \n", 1749 | " \n", 1750 | " \n", 1751 | " \n", 1752 | " \n", 1753 | " \n", 1754 | " \n", 1755 | " \n", 1756 | " \n", 1757 | " \n", 1758 | " \n", 1759 | " \n", 1760 | " \n", 1761 | " \n", 1762 | " \n", 1763 | " \n", 1764 | " \n", 1765 | " \n", 1766 | " \n", 1767 | " \n", 1768 | " \n", 1769 | " \n", 1770 | " \n", 1771 | " \n", 1772 | " \n", 1773 | " \n", 1774 | "
composerbirthdeathcityageage_def
0Mahler30001911Kaliste51young
3Shostakovich19061975Saint-Petersburg69old
\n", 1775 | "
" 1776 | ], 1777 | "text/plain": [ 1778 | " composer birth death city age age_def\n", 1779 | "0 Mahler 3000 1911 Kaliste 51 young\n", 1780 | "3 Shostakovich 1906 1975 Saint-Petersburg 69 old" 1781 | ] 1782 | }, 1783 | "execution_count": 39, 1784 | "metadata": {}, 1785 | "output_type": "execute_result" 1786 | } 1787 | ], 1788 | "source": [ 1789 | "compos_sub2" 1790 | ] 1791 | } 1792 | ], 1793 | "metadata": { 1794 | "kernelspec": { 1795 | "display_name": "Python 3", 1796 | "language": "python", 1797 | "name": "python3" 1798 | }, 1799 | "language_info": { 1800 | "codemirror_mode": { 1801 | "name": "ipython", 1802 | "version": 3 1803 | }, 1804 | "file_extension": ".py", 1805 | "mimetype": "text/x-python", 1806 | "name": "python", 1807 | "nbconvert_exporter": "python", 1808 | "pygments_lexer": "ipython3", 1809 | "version": "3.8.2" 1810 | } 1811 | }, 1812 | "nbformat": 4, 1813 | "nbformat_minor": 4 1814 | } 1815 | -------------------------------------------------------------------------------- /98-DA_Numpy_Exercises.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 2, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import numpy as np\n", 10 | "import matplotlib.pyplot as plt" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": {}, 16 | "source": [ 17 | "# Exercice Numpy" 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": {}, 23 | "source": [ 24 | "## 1. Array creation" 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": {}, 30 | "source": [ 31 | "- Create a 1D array called ```xarray``` with values from 0 to 10 and in steps of 0.1. Check the shape of the array:" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": null, 37 | "metadata": {}, 38 | "outputs": [], 39 | "source": [] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "metadata": {}, 44 | "source": [ 45 | "- Create an array of normally distributed numbers with mean $\\mu=0$ and standard deviation $\\sigma=0.5$. It should have 20 rows and as many columns as there are elements in ```xarray```. Call it ```normal_array```:" 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": null, 51 | "metadata": {}, 52 | "outputs": [], 53 | "source": [] 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "metadata": {}, 58 | "source": [ 59 | "- Check the type of ```normal_array```:" 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": null, 65 | "metadata": {}, 66 | "outputs": [], 67 | "source": [] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "## 2. Array mathematics" 74 | ] 75 | }, 76 | { 77 | "cell_type": "markdown", 78 | "metadata": {}, 79 | "source": [ 80 | "- Using ```xarray``` as x-variable, create a new array ```yarray``` as y-variable using the function $y = 10* cos(x) * e^{-0.1x}$:" 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": null, 86 | "metadata": {}, 87 | "outputs": [], 88 | "source": [] 89 | }, 90 | { 91 | "cell_type": "markdown", 92 | "metadata": {}, 93 | "source": [ 94 | "- Create ```array_abs``` by taking the absolute value of ```yarray```:" 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": null, 100 | "metadata": {}, 101 | "outputs": [], 102 | "source": [] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "metadata": {}, 107 | "source": [ 108 | "- Create a boolan array (logical array) where all positions $>0.3$ in ```array_abs``` are ```True``` and the others ```False```" 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": null, 114 | "metadata": {}, 115 | "outputs": [], 116 | "source": [] 117 | }, 118 | { 119 | "cell_type": "markdown", 120 | "metadata": {}, 121 | "source": [ 122 | "- Create a standard deviation projection along the second dimension (columns) of ```normal_array```. Check that the dimensions are the ones you expected. Also are the values around the value you expect?" 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": null, 128 | "metadata": {}, 129 | "outputs": [], 130 | "source": [] 131 | }, 132 | { 133 | "cell_type": "markdown", 134 | "metadata": {}, 135 | "source": [ 136 | "## 3. Plotting\n", 137 | "\n", 138 | "- Use a line plot to plot ```yarray``` vs ```xarray```:" 139 | ] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "execution_count": null, 144 | "metadata": {}, 145 | "outputs": [], 146 | "source": [] 147 | }, 148 | { 149 | "cell_type": "markdown", 150 | "metadata": {}, 151 | "source": [ 152 | "- Try to change the color of the plot to red and to have markers on top of the line as squares:" 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": null, 158 | "metadata": {}, 159 | "outputs": [], 160 | "source": [] 161 | }, 162 | { 163 | "cell_type": "markdown", 164 | "metadata": {}, 165 | "source": [ 166 | "- Plot the ```normal_array``` as an imagage and change the colormap to 'gray':" 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": null, 172 | "metadata": {}, 173 | "outputs": [], 174 | "source": [] 175 | }, 176 | { 177 | "cell_type": "markdown", 178 | "metadata": {}, 179 | "source": [ 180 | "- Assemble the two above plots in a figure with one row and two columns grid:" 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": null, 186 | "metadata": {}, 187 | "outputs": [], 188 | "source": [] 189 | }, 190 | { 191 | "cell_type": "markdown", 192 | "metadata": {}, 193 | "source": [ 194 | "## 4. Indexing\n", 195 | "\n", 196 | "- Create new arrays where you select every second element from xarray and yarray. Plot them on top of ```xarray``` and ```yarray```." 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": null, 202 | "metadata": {}, 203 | "outputs": [], 204 | "source": [] 205 | }, 206 | { 207 | "cell_type": "markdown", 208 | "metadata": {}, 209 | "source": [ 210 | "- Select all values of ```yarray``` that are larger than 0. Plot those on top of the regular ```xarray``` and ```yarray```plot." 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": null, 216 | "metadata": {}, 217 | "outputs": [], 218 | "source": [] 219 | }, 220 | { 221 | "cell_type": "markdown", 222 | "metadata": {}, 223 | "source": [ 224 | "- Flip the order of ```xarray``` use it to plot ```yarray```:" 225 | ] 226 | }, 227 | { 228 | "cell_type": "code", 229 | "execution_count": null, 230 | "metadata": {}, 231 | "outputs": [], 232 | "source": [] 233 | }, 234 | { 235 | "cell_type": "markdown", 236 | "metadata": {}, 237 | "source": [ 238 | "## 5. Combining arrays\n", 239 | "\n", 240 | "- Create an array filled with ones with the same shape as ```normal_array```. Concatenate it to ```normal_array``` along the first dimensions and plot the result:" 241 | ] 242 | }, 243 | { 244 | "cell_type": "code", 245 | "execution_count": null, 246 | "metadata": {}, 247 | "outputs": [], 248 | "source": [] 249 | }, 250 | { 251 | "cell_type": "markdown", 252 | "metadata": {}, 253 | "source": [ 254 | "- ```yarray``` represents a signal. Each line of ```normal_array``` represents a possible random noise for that signal. Using broadcasting, try to create an array of noisy versions of ```yarray``` using ```normal_array```. Finally, plot it:" 255 | ] 256 | }, 257 | { 258 | "cell_type": "code", 259 | "execution_count": null, 260 | "metadata": {}, 261 | "outputs": [], 262 | "source": [] 263 | } 264 | ], 265 | "metadata": { 266 | "kernelspec": { 267 | "display_name": "Python 3", 268 | "language": "python", 269 | "name": "python3" 270 | }, 271 | "language_info": { 272 | "codemirror_mode": { 273 | "name": "ipython", 274 | "version": 3 275 | }, 276 | "file_extension": ".py", 277 | "mimetype": "text/x-python", 278 | "name": "python", 279 | "nbconvert_exporter": "python", 280 | "pygments_lexer": "ipython3", 281 | "version": "3.8.5" 282 | }, 283 | "toc": { 284 | "base_numbering": 1, 285 | "nav_menu": {}, 286 | "number_sections": false, 287 | "sideBar": true, 288 | "skip_h1_title": false, 289 | "title_cell": "Table of Contents", 290 | "title_sidebar": "Contents", 291 | "toc_cell": false, 292 | "toc_position": {}, 293 | "toc_section_display": true, 294 | "toc_window_display": true 295 | }, 296 | "varInspector": { 297 | "cols": { 298 | "lenName": 16, 299 | "lenType": 16, 300 | "lenVar": 40 301 | }, 302 | "kernels_config": { 303 | "python": { 304 | "delete_cmd_postfix": "", 305 | "delete_cmd_prefix": "del ", 306 | "library": "var_list.py", 307 | "varRefreshCmd": "print(var_dic_list())" 308 | }, 309 | "r": { 310 | "delete_cmd_postfix": ") ", 311 | "delete_cmd_prefix": "rm(", 312 | "library": "var_list.r", 313 | "varRefreshCmd": "cat(var_dic_list()) " 314 | } 315 | }, 316 | "types_to_exclude": [ 317 | "module", 318 | "function", 319 | "builtin_function_or_method", 320 | "instance", 321 | "_Feature" 322 | ], 323 | "window_display": false 324 | } 325 | }, 326 | "nbformat": 4, 327 | "nbformat_minor": 4 328 | } 329 | -------------------------------------------------------------------------------- /99-DA_Pandas_Exercises.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 21, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import pandas as pd\n", 10 | "import numpy as np\n", 11 | "import matplotlib.pyplot as plt\n", 12 | "import seaborn as sns" 13 | ] 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "metadata": {}, 18 | "source": [ 19 | "# Exercise Pandas" 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "metadata": {}, 25 | "source": [ 26 | "For these exercices we are using a [dataset](https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data/kernels) provided by Airbnb for a Kaggle competition. It describes its offer for New York City in 2019, including types of appartments, price, location etc." 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "## 1. Create a dataframe \n", 34 | "Create a dataframe of a few lines with objects and their poperties (e.g fruits, their weight and colour).\n", 35 | "Calculate the mean of your Dataframe." 36 | ] 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "metadata": {}, 41 | "source": [ 42 | "## 2. Import\n", 43 | "- Import the table called ```AB_NYC_2019.csv``` as a dataframe. It is located in the Datasets folder. Have a look at the beginning of the table (head).\n", 44 | "\n", 45 | "- Create a histogram of prices" 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | "## 3. Operations" 53 | ] 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "metadata": {}, 58 | "source": [ 59 | "Create a new column in the dataframe by multiplying the \"price\" and \"availability_365\" columns to get an estimate of the maximum yearly income." 60 | ] 61 | }, 62 | { 63 | "cell_type": "markdown", 64 | "metadata": {}, 65 | "source": [ 66 | "## 3b. Subselection and plotting\n", 67 | "Create a new Dataframe by first subselecting yearly incomes between 1 and 100'000. Then make a scatter plot of yearly income versus number of reviews " 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "metadata": {}, 73 | "source": [ 74 | "## 4. Combine" 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": {}, 80 | "source": [ 81 | "We provide below and additional table that contains the number of inhabitants of each of New York's boroughs (\"neighbourhood_group\" in the table). Use ```merge``` to add this population information to each element in the original dataframe." 82 | ] 83 | }, 84 | { 85 | "cell_type": "markdown", 86 | "metadata": {}, 87 | "source": [ 88 | "## 5. Groups" 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "metadata": {}, 94 | "source": [ 95 | "- Using ```groupby``` calculate the average price for each type of room (room_type) in each neighbourhood_group. What is the average price for an entire home in Brooklyn ?\n", 96 | "- Unstack the multi-level Dataframe into a regular Dataframe with ```unstack()``` and create a bar plot with the resulting table\n" 97 | ] 98 | }, 99 | { 100 | "cell_type": "markdown", 101 | "metadata": {}, 102 | "source": [ 103 | "## 6. Advanced plotting" 104 | ] 105 | }, 106 | { 107 | "cell_type": "markdown", 108 | "metadata": {}, 109 | "source": [ 110 | "Using Seaborn, create a scatter plot where x and y positions are longitude and lattitude, the color reflects price and the shape of the marker the borough (neighbourhood_group). Can you recognize parts of new york ? Does the map make sense ?" 111 | ] 112 | } 113 | ], 114 | "metadata": { 115 | "kernelspec": { 116 | "display_name": "Python 3", 117 | "language": "python", 118 | "name": "python3" 119 | }, 120 | "language_info": { 121 | "codemirror_mode": { 122 | "name": "ipython", 123 | "version": 3 124 | }, 125 | "file_extension": ".py", 126 | "mimetype": "text/x-python", 127 | "name": "python", 128 | "nbconvert_exporter": "python", 129 | "pygments_lexer": "ipython3", 130 | "version": "3.8.2" 131 | } 132 | }, 133 | "nbformat": 4, 134 | "nbformat_minor": 4 135 | } 136 | -------------------------------------------------------------------------------- /Data/composers.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/guiwitz/NumpyPandas_course/63506c8e1229483512786323539fbcf853ae8495/Data/composers.xlsx -------------------------------------------------------------------------------- /Data/ny_boroughs.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/guiwitz/NumpyPandas_course/63506c8e1229483512786323539fbcf853ae8495/Data/ny_boroughs.xlsx -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | BSD 3-Clause License 2 | 3 | Copyright (c) 2020, Guillaume Witz 4 | All rights reserved. 5 | 6 | Redistribution and use in source and binary forms, with or without 7 | modification, are permitted provided that the following conditions are met: 8 | 9 | 1. Redistributions of source code must retain the above copyright notice, this 10 | list of conditions and the following disclaimer. 11 | 12 | 2. Redistributions in binary form must reproduce the above copyright notice, 13 | this list of conditions and the following disclaimer in the documentation 14 | and/or other materials provided with the distribution. 15 | 16 | 3. Neither the name of the copyright holder nor the names of its 17 | contributors may be used to endorse or promote products derived from 18 | this software without specific prior written permission. 19 | 20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 30 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/guiwitz/NumpyPandas_course/54488164b462644baf601875be69cc911eda9615?urlpath=lab) 2 | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/guiwitz/NumpyPandas_course/blob/colab) 3 | 4 | 5 | # Introduction to Numpy and Pandas 6 | 7 | This repository contains Jupyter notebooks introducing beginners to the Python packages Numpy and Pandas. The material has been designed for people already familiar with Python but not with its "scientific stack". 8 | 9 | This material has been created by Guillaume Witz (Science IT Support, Microscopy Imaging Center, Bern University) in the frame of the [courses offered by ScITS](https://www.scits.unibe.ch/). 10 | 11 | ## Content 12 | The course has the following content: 13 | 14 | ### Numpy 15 | - [Numpy arrays:](01-DA_Numpy_arrays_creation.ipynb): what they are and how to create, import and save them 16 | - [Maths with Numpy arrays](02-DA_Numpy_array_maths.ipynb): applying functions to arrays, doing basic statistics with arrays 17 | - [Numpy and Matplotlib](03-DA_Numpy_matplotlib.ipynb): Basics of plotting Numpy arrays with Matplotlib 18 | - [Recovering parts of arrays](04-DA_Numpy_indexing.ipynb): Using array coordinates to extract information (indexing, slicing) 19 | - [Combining arrays](05-DA_Numpy_combining_arrays.ipynb): Assembling arrays by concatenation, stacking etc. Combining arrays of different sizes (broadcasting) 20 | 21 | ### Pandas 22 | - [Introduction to Pandas](06-DA_Pandas_introduction.ipynb): What does Pandas offer? 23 | - [Pandas data structures](07-DA_Pandas_structures.ipynb): Series and dataframes 24 | - [Importing data to Pandas](08-DA_Pandas_import_plotting.ipynb): Importing data tables into Pandas (from Excel, CSV) and plotting them 25 | - [Pandas operations](09-DA_Pandas_operations.ipynb): Applying functions to the contents of Pandas dataframes (classical statistics, ```apply``` function etc.) 26 | - [Combining Pandas dataframes](10-DA_Pandas_combine.ipynb): Using concatenation or join operations to combine dataframes 27 | - [Analyzing Pandas dataframes](11-DA_Pandas_splitting.ipynb): Split dataframes into groups (```groupy```) for category-based analysis 28 | - [A real-world example](12-DA_Pandas_realworld.ipynb): Complete pipeline including data import, cleaning, analysis and plotting and showing the nitty-gritty issues one often faces with real data 29 | 30 | ## Running the course 31 | 32 | ### Live sessions 33 | 34 | During live sessions of the course, you are given access to a private Jupyter session and don't need to install anything no your computer. 35 | 36 | ### Without installation 37 | Outside live-sessions, this entire course can still be run interactively without any local installation thanks to the [mybinder](mybinder.org) service. For that just click on the mybinder tag at the top of this Readme. This will open a Jupyter session for you with all packages, notebooks and data available to run. 38 | 39 | Alternatively you can also run the course on Google Colab. For that just click on the Colab badge at the top of this file. 40 | 41 | ### Local installation 42 | For a local installation, we recommend using conda to create a specific environment to run the code. If you don't yet have conda, you can e.g. install miniconda, see [here](https://docs.conda.io/en/latest/miniconda.html) for instructions. Then: 43 | 44 | 1. Clone the repository to your computer using [this link](https://github.com/guiwitz/NumpyPandas_course/archive/master.zip) and unzip it 45 | 2. Open a terminal and move to the ```NumpyPandas_course-master/binder``` folder 46 | 3. Here you find an ```environment.yml``` file that you can use to create a conda environment. Choose an environment name e.g. ```numpypandas``` and type: 47 | ``` 48 | conda env create -n numpypandas -f environment.yml 49 | ``` 50 | 4. When you want to run the material, activate the environment and start jupyter: 51 | ``` 52 | conda activate numpypandas 53 | jupyter lab 54 | ``` 55 | Note that the top folder of your directory in Jupyter is the folder from where you started Jupyter. So if you are e.g. in the ```binder``` folder, move one level up to have access to the notebooks 56 | 57 | ## Note on the data used 58 | 59 | In the Pandas part, we use some data provided publicly by the Swiss National Science foundation at this link: http://p3.snf.ch/Pages/DataAndDocumentation.aspx#DataDownload. The examples of analysis on these data **are in no way confirmed or validated by the SNSF and are entirely the work of Guillaume Witz, Science IT Support, Bern University**. 60 | 61 | -------------------------------------------------------------------------------- /binder/environment.yml: -------------------------------------------------------------------------------- 1 | channels: 2 | - conda-forge 3 | dependencies: 4 | - numpy 5 | - matplotlib 6 | - scikit-learn 7 | - scikit-image 8 | - pandas 9 | - jupyter 10 | - jupyterlab=1.2.* 11 | - jupyter_contrib_nbextensions 12 | - tqdm 13 | - seaborn 14 | - pip 15 | - nodejs 16 | - ipywidgets 17 | - pip: 18 | - plotnine 19 | - xlrd -------------------------------------------------------------------------------- /binder/postBuild: -------------------------------------------------------------------------------- 1 | jupyter labextension install @jupyterlab/toc --no-build 2 | jupyter labextension install @jupyter-widgets/jupyterlab-manager --no-build 3 | jupyter labextension install @lckr/jupyterlab_variableinspector --no-build 4 | 5 | jupyter lab build -------------------------------------------------------------------------------- /colab/automate_colab_editing.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import os, re, glob" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "## Collect notebooks from regular branch" 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": 4, 22 | "metadata": {}, 23 | "outputs": [], 24 | "source": [ 25 | "notebooks_or = glob.glob('/Users/gw18g940/OneDrive - Universitaet Bern/Courses/DataAnalytics_course/DataAnalytics_course/*.ipynb')\n" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "## Find which packages to add in each notebook by looking for \"special\" packages" 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": 6, 38 | "metadata": {}, 39 | "outputs": [], 40 | "source": [ 41 | "external_packages = ['aicsimageio','ipyvolume','mrc','trackpy','stardist','cellpose']\n", 42 | "new_packages = []\n", 43 | "for noteb in notebooks_or:\n", 44 | " with open(noteb) as n:\n", 45 | " all_lines = n.readlines()\n", 46 | " to_add = []\n", 47 | " for a in all_lines:\n", 48 | " if len(a) < 1000:\n", 49 | " for e in external_packages:\n", 50 | " if a.find(e) > 0:\n", 51 | " if e not in to_add:\n", 52 | " to_add.append(e)\n", 53 | " new_packages.append(to_add)" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": 7, 59 | "metadata": {}, 60 | "outputs": [ 61 | { 62 | "data": { 63 | "text/plain": [ 64 | "[[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], []]" 65 | ] 66 | }, 67 | "execution_count": 7, 68 | "metadata": {}, 69 | "output_type": "execute_result" 70 | } 71 | ], 72 | "source": [ 73 | "new_packages" 74 | ] 75 | }, 76 | { 77 | "cell_type": "markdown", 78 | "metadata": {}, 79 | "source": [ 80 | "## Define basic cells to add to notebook" 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": 18, 86 | "metadata": {}, 87 | "outputs": [], 88 | "source": [ 89 | "data_import = \"\"\" {\n", 90 | " \"cell_type\": \"code\",\n", 91 | " \"execution_count\": null,\n", 92 | " \"metadata\": {},\n", 93 | " \"outputs\": [],\n", 94 | " \"source\": [\n", 95 | " \"import sys, os\\\\n\",\n", 96 | " \"if 'google.colab' in sys.modules:\\\\n\",\n", 97 | " \" if not os.path.isdir('Data'):\\\\n\",\n", 98 | " \" !curl https://raw.githubusercontent.com/guiwitz/NumpyPandas_course/master/colab/colab_data.sh -o colab_data.sh\\\\n\",\n", 99 | " \" !curl https://raw.githubusercontent.com/guiwitz/NumpyPandas_course/master/svg.py -o svg.py\\\\n\",\n", 100 | " \" !sh colab_data.sh\"\n", 101 | " ]\n", 102 | " },\\n\"\"\"" 103 | ] 104 | }, 105 | { 106 | "cell_type": "markdown", 107 | "metadata": {}, 108 | "source": [ 109 | "## Define where to save new notebooks (colab branch)" 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": 19, 115 | "metadata": {}, 116 | "outputs": [], 117 | "source": [ 118 | "newpath = '/Users/gw18g940/OneDrive - Universitaet Bern/Courses/DataAnalytics_course/DataAnalytics_course_colab\\\n", 119 | "/NumpyPandas_course/'\n" 120 | ] 121 | }, 122 | { 123 | "cell_type": "markdown", 124 | "metadata": {}, 125 | "source": [ 126 | "## Add Google drive import and package installation to each notebook" 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": 23, 132 | "metadata": {}, 133 | "outputs": [ 134 | { 135 | "name": "stdout", 136 | "output_type": "stream", 137 | "text": [ 138 | "/Users/gw18g940/OneDrive - Universitaet Bern/Courses/DataAnalytics_course/DataAnalytics_course/05-DA_Numpy_combining_arrays.ipynb\n", 139 | "/Users/gw18g940/OneDrive - Universitaet Bern/Courses/DataAnalytics_course/DataAnalytics_course/99-DA_Pandas_Exercises.ipynb\n", 140 | "/Users/gw18g940/OneDrive - Universitaet Bern/Courses/DataAnalytics_course/DataAnalytics_course/98-DA_Numpy_Exercises.ipynb\n", 141 | "/Users/gw18g940/OneDrive - Universitaet Bern/Courses/DataAnalytics_course/DataAnalytics_course/01-DA_Numpy_arrays_creation.ipynb\n", 142 | "/Users/gw18g940/OneDrive - Universitaet Bern/Courses/DataAnalytics_course/DataAnalytics_course/09-DA_Pandas_operations.ipynb\n", 143 | "/Users/gw18g940/OneDrive - Universitaet Bern/Courses/DataAnalytics_course/DataAnalytics_course/11-DA_Pandas_splitting.ipynb\n", 144 | "/Users/gw18g940/OneDrive - Universitaet Bern/Courses/DataAnalytics_course/DataAnalytics_course/08-DA_Pandas_import_plotting.ipynb\n", 145 | "/Users/gw18g940/OneDrive - Universitaet Bern/Courses/DataAnalytics_course/DataAnalytics_course/06-DA_Pandas_introduction.ipynb\n", 146 | "/Users/gw18g940/OneDrive - Universitaet Bern/Courses/DataAnalytics_course/DataAnalytics_course/02-DA_Numpy_array_maths.ipynb\n", 147 | "/Users/gw18g940/OneDrive - Universitaet Bern/Courses/DataAnalytics_course/DataAnalytics_course/03-DA_Numpy_matplotlib.ipynb\n", 148 | "/Users/gw18g940/OneDrive - Universitaet Bern/Courses/DataAnalytics_course/DataAnalytics_course/10-DA_Pandas_combine.ipynb\n", 149 | "/Users/gw18g940/OneDrive - Universitaet Bern/Courses/DataAnalytics_course/DataAnalytics_course/12-DA_Pandas_realworld.ipynb\n", 150 | "/Users/gw18g940/OneDrive - Universitaet Bern/Courses/DataAnalytics_course/DataAnalytics_course/99-DA_Pandas_Solutions.ipynb\n", 151 | "/Users/gw18g940/OneDrive - Universitaet Bern/Courses/DataAnalytics_course/DataAnalytics_course/04-DA_Numpy_indexing.ipynb\n", 152 | "/Users/gw18g940/OneDrive - Universitaet Bern/Courses/DataAnalytics_course/DataAnalytics_course/07-DA_Pandas_structures.ipynb\n", 153 | "/Users/gw18g940/OneDrive - Universitaet Bern/Courses/DataAnalytics_course/DataAnalytics_course/98-DA_Numpy_Solutions.ipynb\n" 154 | ] 155 | } 156 | ], 157 | "source": [ 158 | "for ind, n in enumerate(notebooks_or):\n", 159 | " print(n)\n", 160 | " fh = newpath + n.split('/')[-1]\n", 161 | " counter = 0\n", 162 | "\n", 163 | "\n", 164 | " with open(fh,'w') as new_file:\n", 165 | " with open(n) as old_file:\n", 166 | " for line in old_file:\n", 167 | " if counter == 2:\n", 168 | " new_file.write(data_import)\n", 169 | " new_file.write(line)\n", 170 | " else:\n", 171 | " new_file.write(line)\n", 172 | " counter +=1\n" 173 | ] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "execution_count": null, 178 | "metadata": {}, 179 | "outputs": [], 180 | "source": [] 181 | } 182 | ], 183 | "metadata": { 184 | "kernelspec": { 185 | "display_name": "Python 3", 186 | "language": "python", 187 | "name": "python3" 188 | }, 189 | "language_info": { 190 | "codemirror_mode": { 191 | "name": "ipython", 192 | "version": 3 193 | }, 194 | "file_extension": ".py", 195 | "mimetype": "text/x-python", 196 | "name": "python", 197 | "nbconvert_exporter": "python", 198 | "pygments_lexer": "ipython3", 199 | "version": "3.8.2" 200 | }, 201 | "toc": { 202 | "base_numbering": 1, 203 | "nav_menu": {}, 204 | "number_sections": false, 205 | "sideBar": true, 206 | "skip_h1_title": false, 207 | "title_cell": "Table of Contents", 208 | "title_sidebar": "Contents", 209 | "toc_cell": false, 210 | "toc_position": {}, 211 | "toc_section_display": true, 212 | "toc_window_display": true 213 | }, 214 | "varInspector": { 215 | "cols": { 216 | "lenName": 16, 217 | "lenType": 16, 218 | "lenVar": 40 219 | }, 220 | "kernels_config": { 221 | "python": { 222 | "delete_cmd_postfix": "", 223 | "delete_cmd_prefix": "del ", 224 | "library": "var_list.py", 225 | "varRefreshCmd": "print(var_dic_list())" 226 | }, 227 | "r": { 228 | "delete_cmd_postfix": ") ", 229 | "delete_cmd_prefix": "rm(", 230 | "library": "var_list.r", 231 | "varRefreshCmd": "cat(var_dic_list()) " 232 | } 233 | }, 234 | "types_to_exclude": [ 235 | "module", 236 | "function", 237 | "builtin_function_or_method", 238 | "instance", 239 | "_Feature" 240 | ], 241 | "window_display": false 242 | } 243 | }, 244 | "nbformat": 4, 245 | "nbformat_minor": 4 246 | } 247 | -------------------------------------------------------------------------------- /colab/colab_data.sh: -------------------------------------------------------------------------------- 1 | git clone https://github.com/guiwitz/NumpyPandas_course.git 2 | cp -r NumpyPandas_course/Data /content 3 | rm -r NumpyPandas_course/ -------------------------------------------------------------------------------- /svg.py: -------------------------------------------------------------------------------- 1 | #This module is taken from the Dask project and can be found here: 2 | #https://github.com/dask/dask/blob/master/dask/array/svg.py 3 | #It has been slightly modified to allow for the representation of numpy arrays. 4 | #Here is the accompanying license: 5 | 6 | ''' 7 | Copyright (c) 2014-2018, Anaconda, Inc. and contributors 8 | All rights reserved. 9 | 10 | Redistribution and use in source and binary forms, with or without modification, 11 | are permitted provided that the following conditions are met: 12 | 13 | Redistributions of source code must retain the above copyright notice, 14 | this list of conditions and the following disclaimer. 15 | 16 | Redistributions in binary form must reproduce the above copyright notice, 17 | this list of conditions and the following disclaimer in the documentation 18 | and/or other materials provided with the distribution. 19 | 20 | Neither the name of Anaconda nor the names of any contributors may be used to 21 | endorse or promote products derived from this software without specific prior 22 | written permission. 23 | 24 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 25 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 26 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 27 | ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE 28 | LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR 29 | CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF 30 | SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 31 | INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN 32 | CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) 33 | ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF 34 | THE POSSIBILITY OF SUCH DAMAGE. 35 | ''' 36 | 37 | import math 38 | import re 39 | 40 | import numpy as np 41 | from IPython.display import HTML 42 | 43 | def svg(chunks, size=200, **kwargs): 44 | """ Convert chunks from Dask Array into an SVG Image 45 | 46 | Parameters 47 | ---------- 48 | chunks: tuple 49 | size: int 50 | Rough size of the image 51 | 52 | Returns 53 | ------- 54 | text: An svg string depicting the array as a grid of chunks 55 | """ 56 | shape = tuple(map(sum, chunks)) 57 | if np.isnan(shape).any(): # don't support unknown sizes 58 | raise NotImplementedError( 59 | "Can't generate SVG with unknown chunk sizes.\n\n" 60 | " A possible solution is with x.compute_chunk_sizes()" 61 | ) 62 | if not all(shape): 63 | raise NotImplementedError("Can't generate SVG with 0-length dimensions") 64 | if len(chunks) == 0: 65 | raise NotImplementedError("Can't generate SVG with 0 dimensions") 66 | if len(chunks) == 1: 67 | return svg_1d(chunks, size=size, **kwargs) 68 | elif len(chunks) == 2: 69 | return svg_2d(chunks, size=size, **kwargs) 70 | elif len(chunks) == 3: 71 | return svg_3d(chunks, size=size, **kwargs) 72 | else: 73 | return svg_nd(chunks, size=size, **kwargs) 74 | 75 | 76 | text_style = 'font-size="1.0rem" font-weight="100" text-anchor="middle"' 77 | 78 | 79 | def svg_2d(chunks, offset=(0, 0), skew=(0, 0), size=200, sizes=None): 80 | shape = tuple(map(sum, chunks)) 81 | sizes = sizes or draw_sizes(shape, size=size) 82 | y, x = grid_points(chunks, sizes) 83 | 84 | lines, (min_x, max_x, min_y, max_y) = svg_grid(x, y, offset=offset, skew=skew) 85 | 86 | header = ( 87 | '\n' 88 | % (max_x + 50, max_y + 50) 89 | ) 90 | footer = "\n" 91 | 92 | if shape[0] >= 100: 93 | rotate = -90 94 | else: 95 | rotate = 0 96 | 97 | text = [ 98 | "", 99 | " ", 100 | ' %d' 101 | % (max_x / 2, max_y + 20, text_style, shape[1]), 102 | ' %d' 103 | % (max_x + 20, max_y / 2, text_style, rotate, max_x + 20, max_y / 2, shape[0]), 104 | ] 105 | 106 | return header + "\n".join(lines + text) + footer 107 | 108 | 109 | def svg_3d(chunks, size=200, sizes=None, offset=(0, 0)): 110 | shape = tuple(map(sum, chunks)) 111 | sizes = sizes or draw_sizes(shape, size=size) 112 | x, y, z = grid_points(chunks, sizes) 113 | ox, oy = offset 114 | 115 | xy, (mnx, mxx, mny, mxy) = svg_grid( 116 | x / 1.7, y, offset=(ox + 10, oy + 0), skew=(1, 0) 117 | ) 118 | 119 | zx, (_, _, _, max_x) = svg_grid(z, x / 1.7, offset=(ox + 10, oy + 0), skew=(0, 1)) 120 | zy, (min_z, max_z, min_y, max_y) = svg_grid( 121 | z, y, offset=(ox + max_x + 10, oy + max_x), skew=(0, 0) 122 | ) 123 | 124 | header = ( 125 | '\n' 126 | % (max_z + 50, max_y + 50) 127 | ) 128 | footer = "\n" 129 | 130 | if shape[1] >= 100: 131 | rotate = -90 132 | else: 133 | rotate = 0 134 | 135 | text = [ 136 | "", 137 | " ", 138 | ' %d' 139 | % ((min_z + max_z) / 2, max_y + 20, text_style, shape[2]), 140 | ' %d' 141 | % ( 142 | max_z + 20, 143 | (min_y + max_y) / 2, 144 | text_style, 145 | rotate, 146 | max_z + 20, 147 | (min_y + max_y) / 2, 148 | shape[1], 149 | ), 150 | ' %d' 151 | % ( 152 | (mnx + mxx) / 2 - 10, 153 | mxy - (mxx - mnx) / 2 + 20, 154 | text_style, 155 | (mnx + mxx) / 2 - 10, 156 | mxy - (mxx - mnx) / 2 + 20, 157 | shape[0], 158 | ), 159 | ] 160 | 161 | return header + "\n".join(xy + zx + zy + text) + footer 162 | 163 | 164 | def svg_nd(chunks, size=200): 165 | if len(chunks) % 3 == 1: 166 | chunks = ((1,),) + chunks 167 | shape = tuple(map(sum, chunks)) 168 | sizes = draw_sizes(shape, size=size) 169 | 170 | chunks2 = chunks 171 | sizes2 = sizes 172 | out = [] 173 | left = 0 174 | total_height = 0 175 | while chunks2: 176 | n = len(chunks2) % 3 or 3 177 | o = svg(chunks2[:n], sizes=sizes2[:n], offset=(left, 0)) 178 | chunks2 = chunks2[n:] 179 | sizes2 = sizes2[n:] 180 | 181 | lines = o.split("\n") 182 | header = lines[0] 183 | height = float(re.search(r'height="(\d*\.?\d*)"', header).groups()[0]) 184 | total_height = max(total_height, height) 185 | width = float(re.search(r'width="(\d*\.?\d*)"', header).groups()[0]) 186 | left += width + 10 187 | o = "\n".join(lines[1:-1]) # remove header and footer 188 | 189 | out.append(o) 190 | 191 | header = ( 192 | '\n' 193 | % (left, total_height) 194 | ) 195 | footer = "\n" 196 | return header + "\n\n".join(out) + footer 197 | 198 | 199 | def svg_lines(x1, y1, x2, y2): 200 | """ Convert points into lines of text for an SVG plot 201 | 202 | Examples 203 | -------- 204 | >>> svg_lines([0, 1], [0, 0], [10, 11], [1, 1]) # doctest: +NORMALIZE_WHITESPACE 205 | [' ', 206 | ' '] 207 | """ 208 | n = len(x1) 209 | lines = [ 210 | ' ' % (x1[i], y1[i], x2[i], y2[i]) 211 | for i in range(n) 212 | ] 213 | 214 | lines[0] = lines[0].replace(" /", ' style="stroke-width:2" /') 215 | lines[-1] = lines[-1].replace(" /", ' style="stroke-width:2" /') 216 | return lines 217 | 218 | 219 | def svg_grid(x, y, offset=(0, 0), skew=(0, 0)): 220 | """ Create lines of SVG text that show a grid 221 | 222 | Parameters 223 | ---------- 224 | x: numpy.ndarray 225 | y: numpy.ndarray 226 | offset: tuple 227 | translational displacement of the grid in SVG coordinates 228 | skew: tuple 229 | """ 230 | # Horizontal lines 231 | x1 = np.zeros_like(y) + offset[0] 232 | y1 = y + offset[1] 233 | x2 = np.full_like(y, x[-1]) + offset[0] 234 | y2 = y + offset[1] 235 | 236 | if skew[0]: 237 | y2 += x.max() * skew[0] 238 | if skew[1]: 239 | x1 += skew[1] * y 240 | x2 += skew[1] * y 241 | 242 | min_x = min(x1.min(), x2.min()) 243 | min_y = min(y1.min(), y2.min()) 244 | max_x = max(x1.max(), x2.max()) 245 | max_y = max(y1.max(), y2.max()) 246 | 247 | h_lines = ["", " "] + svg_lines(x1, y1, x2, y2) 248 | 249 | # Vertical lines 250 | x1 = x + offset[0] 251 | y1 = np.zeros_like(x) + offset[1] 252 | x2 = x + offset[0] 253 | y2 = np.full_like(x, y[-1]) + offset[1] 254 | 255 | if skew[0]: 256 | y1 += skew[0] * x 257 | y2 += skew[0] * x 258 | if skew[1]: 259 | x2 += skew[1] * y.max() 260 | 261 | v_lines = ["", " "] + svg_lines(x1, y1, x2, y2) 262 | 263 | rect = [ 264 | "", 265 | " ", 266 | ' ' 267 | % (x1[0], y1[0], x1[-1], y1[-1], x2[-1], y2[-1], x2[0], y2[0]), 268 | ] 269 | 270 | return h_lines + v_lines + rect, (min_x, max_x, min_y, max_y) 271 | 272 | 273 | def svg_1d(chunks, sizes=None, **kwargs): 274 | return svg_2d(((1,),) + chunks, **kwargs) 275 | 276 | 277 | def grid_points(chunks, sizes): 278 | cumchunks = [np.cumsum((0,) + c) for c in chunks] 279 | points = [x * size / x[-1] for x, size in zip(cumchunks, sizes)] 280 | return points 281 | 282 | 283 | def draw_sizes(shape, size=200): 284 | """ Get size in pixels for all dimensions """ 285 | mx = max(shape) 286 | ratios = [mx / max(0.1, d) for d in shape] 287 | ratios = [ratio_response(r) for r in ratios] 288 | return tuple(size / r for r in ratios) 289 | 290 | 291 | def ratio_response(x): 292 | """ How we display actual size ratios 293 | 294 | Common ratios in sizes span several orders of magnitude, 295 | which is hard for us to perceive. 296 | 297 | We keep ratios in the 1-3 range accurate, and then apply a logarithm to 298 | values up until about 100 or so, at which point we stop scaling. 299 | """ 300 | if x < math.e: 301 | return x 302 | elif x <= 100: 303 | return math.log(x + 12.4) # f(e) == e 304 | else: 305 | return math.log(100 + 12.4) 306 | 307 | def numpy_to_svg(array): 308 | 309 | return HTML(svg(tuple((tuple(np.ones(x)) for x in array.shape)))) 310 | 311 | 312 | 313 | --------------------------------------------------------------------------------