├── LICENSE ├── README.md ├── git_etc_notes.md ├── notebooks ├── 1-intro-to-python.ipynb ├── 2-numpy_and_pandas.ipynb ├── 3-visualization.ipynb ├── data_ny_temperatures.csv ├── haarcascade_frontalface_default.xml ├── image.jpg ├── movies_stats.csv ├── songs_metadata.json └── text_file.txt └── scripts ├── TODOs ├── cl_example.py ├── cl_example_simple.py ├── dash_demo.py ├── listings.csv ├── matplotlib_simple_script.py └── requirements.txt /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 Theodoros Giannakopoulos 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # python-data-science 2 | (updated 2023) 3 | This repo contains Python examples for the "Data Programming" Introductory 4 | course of the [MSc Program of Data Science](http://msc-data-science.iit.demokritos.gr), organzied by NCSR Demokritos and the University of the Peloponnese. 5 | 6 | ## Notebooks 7 | The following notebooks can be used as short tutorials on the respective subjects 8 | 9 | 1. Python basics: [1-intro-to-python.ipynb](https://nbviewer.jupyter.org/github/tyiannak/python-data-science/blob/master/notebooks/1-intro-to-python.ipynb) 10 | 2. Intro to numpy and pandas for basic data handling [2-numpy_and_pandas.ipynb](https://nbviewer.jupyter.org/github/tyiannak/python-data-science/blob/master/notebooks/2-numpy_and_pandas.ipynb) 11 | 3. Visualization using matplotlib and plotly [3-visualization.ipynb](https://nbviewer.jupyter.org/github/tyiannak/python-data-science/blob/master/notebooks/3-visualization.ipynb) 12 | 13 | ## Scripts 14 | The following scripts are 15 | 1. `scripts/cl_example.py` demonstrates how to use `argparse` to use command-line arguments in a python script. 16 | The particular example generates a visualization of a list of Gaussian distributions. Usage example: 17 | ```python 18 | python3 cl_example.py -m 0 5 10 15 20 -s 1 1 1 7 2 -n 5000 5000 5000 10000 5000 --names "x1" "x2" "x3" "x4" "x5" -b 100 --normalize 19 | ``` 20 | 21 | ## TODOs 22 | 1. NLP example (no ML) 23 | 2. docker 24 | 3. This is a git demo! -------------------------------------------------------------------------------- /git_etc_notes.md: -------------------------------------------------------------------------------- 1 | 2 | # 1. Virtual environments 3 | ## 1.1 using the existing requirements: 4 | ``` 5 | virtualenv -p python3.9 venv 6 | source venv/bin/activate 7 | pip3 install numpy==1.20.3 8 | python3 deleteme.py 9 | ``` 10 | 11 | ## 1.2 Save your own requirements 12 | To save our own requirements we either do it manually by editing the requirements file or we do: 13 | ``` 14 | pipreqs . 15 | ``` 16 | 17 | Notes: 18 | * we suppose that before that we first create the virtual `env` and we activate it as above, and then we manually pip install the requirements one by one) 19 | * the results is saved in `requirements.txt` by default 20 | * we share the `requirements.txt` in the repo along with our code 21 | * για να βγω απο το virtual environment κάνω απλα deactivate 22 | 23 | # 2 git / github 24 | 25 | ## 2.1 download the repo code (clone repo): 26 | ``` 27 | git clone https://github.com/tyiannak/python-data-science.git 28 | ``` 29 | 30 | ## 2.2 then we do changes in the code locally (using our IDE and testing) 31 | 32 | ## 2.3 see the changes we have done locally: 33 | ``` 34 | git status 35 | ``` 36 | or 37 | ``` 38 | git status . 39 | ``` 40 | 41 | or see changes in particular files: 42 | ``` 43 | git diff 44 | ``` 45 | ## 2.4 steps to add a new file in the repo: 46 | * first we create and/or edit a file (say, `TODOs`) 47 | * `git add TODOs` (can add more than one file) 48 | * `git commit -m "added a TODOs list and initated it with a first TODO message"` 49 | * `git push` (this sends the changes to the repo) 50 | 51 | ## 2.5 to get the changes locally: 52 | * `git pull`: get latest version of current branch (see bellow for branches!) 53 | * `git checkout ` locally to "enable" the selected branch 63 | 7. work locally to do the required changes on the code of the selected branch and repeat the aforementioned procedure (git add, git commit, git push ON THE BRANCH). 64 | 8. as soon as i am sure the work is done --> pull request on github page (also mentions coworkers to do the QA. Also drag the issue from Ongoing to QA in the project view) 65 | 9. The person(s) that have been tagged to QA, checkout (git-checkout) the branch locally, they test it, and as soon as they are sure the code works without any bugs or new issues, they merge to master (through the github page). After this, automatically: the issue is deleted and the task is moved to "Done" in the project view 66 | 67 | Summary of steps for a single task/issue: 68 | task / issue --> branch --> work on branch locally --> push changes to branch and P(ull) R(equest) when ready --> the person(s) that do the QA do the final merging of the branch to master 69 | 70 | ## 2.7 Example for a small excercise for 2-persons teams: 71 | * person A creates a new github repo. Adds person B as "collaborator". 72 | * persons A and B create a "project", add a QA column before DONE 73 | * persons A and B create two dummy tasks and convert them to "issues" in the project view 74 | * person A works on the first dummy issue: (drag to ongoing, create branch, checkout branch locally, add a new dummy file, git add file, git commit, git push). And finally creates a PR (pull request) in github and tags person B 75 | * person B works on the second dummy issue: (drag to ongoing, create branch, checkout branch locally, add a new dummy file, git add file, git commit, git push). And finally creates a PR (pull request) in github and tags person A 76 | * persons A and B check second and first issues (PRs) respectively, and if they agree merge the changes to master 77 | 78 | 79 | 80 | -------------------------------------------------------------------------------- /notebooks/1-intro-to-python.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Python basics" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Printing, variables, strings " 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 4, 20 | "metadata": {}, 21 | "outputs": [ 22 | { 23 | "name": "stdout", 24 | "output_type": "stream", 25 | "text": [ 26 | "hello data world!\n", 27 | "5\n" 28 | ] 29 | } 30 | ], 31 | "source": [ 32 | "# let's start\n", 33 | "print(\"hello data world!\")\n", 34 | "\n", 35 | "# we don't (usually) specify types to variables:\n", 36 | "n = 2\n", 37 | "x = 2.5\n", 38 | "y = 1.0\n", 39 | "print(x + y + n)" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": 6, 45 | "metadata": {}, 46 | "outputs": [ 47 | { 48 | "name": "stdout", 49 | "output_type": "stream", 50 | "text": [ 51 | "this is ' a name\n" 52 | ] 53 | } 54 | ], 55 | "source": [ 56 | "# use double or single quotes for strings\n", 57 | "name1 = 'this is \n", 58 | "name2 = \" a name\"\n", 59 | "# string concatenation is simple\n", 60 | "print(name1 + name2)" 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": 10, 66 | "metadata": {}, 67 | "outputs": [ 68 | { 69 | "name": "stdout", 70 | "output_type": "stream", 71 | "text": [ 72 | "hi there this is a test\n", 73 | "2.5\n", 74 | "and this is how you can print numbers with strings 2.5\n", 75 | "and this test is how to use f-strings2.5\n" 76 | ] 77 | } 78 | ], 79 | "source": [ 80 | "# printing\n", 81 | "x = 2.5\n", 82 | "name = \"test\"\n", 83 | "print(\"hi there this is a \" + name)\n", 84 | "print(x)\n", 85 | "print(\"and this is how you can print numbers with strings \" + str(x))\n", 86 | "print(f\"and this {name} is how to use f-strings {x}\")" 87 | ] 88 | }, 89 | { 90 | "cell_type": "markdown", 91 | "metadata": {}, 92 | "source": [ 93 | "## Tuples\n", 94 | "A tuple is a collection of elements that is immutable (you cannot add and remove elements from it). " 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": 11, 100 | "metadata": {}, 101 | "outputs": [ 102 | { 103 | "data": { 104 | "text/plain": [ 105 | "True" 106 | ] 107 | }, 108 | "execution_count": 11, 109 | "metadata": {}, 110 | "output_type": "execute_result" 111 | } 112 | ], 113 | "source": [ 114 | "# tuple example:\n", 115 | "t1 = (1, 2, 3)\n", 116 | "t2 = 1, 2, 3 # parentheses are not required\n", 117 | "t1 == t2" 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": 20, 123 | "metadata": {}, 124 | "outputs": [ 125 | { 126 | "data": { 127 | "text/plain": [ 128 | "3" 129 | ] 130 | }, 131 | "execution_count": 20, 132 | "metadata": {}, 133 | "output_type": "execute_result" 134 | } 135 | ], 136 | "source": [ 137 | "len(t1) # same as list" 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": 21, 143 | "metadata": {}, 144 | "outputs": [ 145 | { 146 | "data": { 147 | "text/plain": [ 148 | "2" 149 | ] 150 | }, 151 | "execution_count": 21, 152 | "metadata": {}, 153 | "output_type": "execute_result" 154 | } 155 | ], 156 | "source": [ 157 | "t1[1] # same as list" 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "execution_count": 22, 163 | "metadata": {}, 164 | "outputs": [ 165 | { 166 | "data": { 167 | "text/plain": [ 168 | "(1, 2)" 169 | ] 170 | }, 171 | "execution_count": 22, 172 | "metadata": {}, 173 | "output_type": "execute_result" 174 | } 175 | ], 176 | "source": [ 177 | "t1[:2] # same as list" 178 | ] 179 | }, 180 | { 181 | "cell_type": "markdown", 182 | "metadata": {}, 183 | "source": [ 184 | "## Lists" 185 | ] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "execution_count": 19, 190 | "metadata": {}, 191 | "outputs": [ 192 | { 193 | "name": "stdout", 194 | "output_type": "stream", 195 | "text": [ 196 | "[1, -1, 10]\n", 197 | "list has 3 elements\n", 198 | "[1, ['name', 2], -0.51]\n" 199 | ] 200 | }, 201 | { 202 | "data": { 203 | "text/plain": [ 204 | "'name'" 205 | ] 206 | }, 207 | "execution_count": 19, 208 | "metadata": {}, 209 | "output_type": "execute_result" 210 | } 211 | ], 212 | "source": [ 213 | "# lists: structures that form collection of variables\n", 214 | "numbers = [1, -1, 10]\n", 215 | "print(numbers)\n", 216 | "print(f'list has {len(numbers)} elements') # len() is a function that takes list objects as argument and returns number of elements\n", 217 | "\n", 218 | "# can contain different types (or even lists!)\n", 219 | "complex_list = [1, [\"name\", 2], -0.51]\n", 220 | "print(complex_list)\n", 221 | "complex_list[1][0]" 222 | ] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "execution_count": 26, 227 | "metadata": {}, 228 | "outputs": [ 229 | { 230 | "name": "stdout", 231 | "output_type": "stream", 232 | "text": [ 233 | "this is the 2nd element of the list: 4\n", 234 | "t3st\n" 235 | ] 236 | } 237 | ], 238 | "source": [ 239 | "# simple list indexing\n", 240 | "numbers = [1, 4, 10]\n", 241 | "print(f\"this is the 2nd element of the list: {numbers[1]}\")\n", 242 | "\n", 243 | "s = \"test\"\n", 244 | "# strings are also lists but they cannot be modified (immutable)\n", 245 | "# s[1] = \"3\" # this returns an error\n", 246 | "# instead you can do this: \n", 247 | "print(s.replace(\"e\", \"3\"))\n", 248 | "# replace is a METHOD of string OBJECTS" 249 | ] 250 | }, 251 | { 252 | "cell_type": "code", 253 | "execution_count": 29, 254 | "metadata": {}, 255 | "outputs": [ 256 | { 257 | "name": "stdout", 258 | "output_type": "stream", 259 | "text": [ 260 | "-10\n", 261 | "[2, 3]\n", 262 | "[1, 2, 3, 10]\n", 263 | "[1, 2, 3, 10, 15]\n", 264 | "[1, 3, 15, -10]\n", 265 | "[1, 3, 15]\n", 266 | "[-10, 2, 15, 10, 3, 2]\n" 267 | ] 268 | } 269 | ], 270 | "source": [ 271 | "# negative list indices\n", 272 | "numbers = [1, 2, 3, 10, 15, 2, -10]\n", 273 | "print(numbers[-1])\n", 274 | "\n", 275 | "# ... and list slicing\n", 276 | "print(numbers[1:3])\n", 277 | "print(numbers[:4])\n", 278 | "print(numbers[:-2])\n", 279 | "print(numbers[::2])\n", 280 | "print(numbers[0:5:2])\n", 281 | "print(numbers[-1:0:-1])" 282 | ] 283 | }, 284 | { 285 | "cell_type": "code", 286 | "execution_count": 34, 287 | "metadata": {}, 288 | "outputs": [ 289 | { 290 | "name": "stdout", 291 | "output_type": "stream", 292 | "text": [ 293 | "[1, 2, -10, -10, 15, 2, -10]\n", 294 | "[1, 2, -10, -10, 15, 2, -10, -20, 2]\n", 295 | "[1, 2, -10, -10, 15, 2, -10, -20, 2, -100]\n" 296 | ] 297 | } 298 | ], 299 | "source": [ 300 | "# list manipulation\n", 301 | "# element replacement\n", 302 | "numbers = [1, 2, 3, 10, 15, 2, -10]\n", 303 | "numbers[2:4] = [-10, -10]\n", 304 | "print(numbers)\n", 305 | "# list concatenation\n", 306 | "numbers += [-20, 2]\n", 307 | "print(numbers)\n", 308 | "# can also use the append method for appending a single element\n", 309 | "# (DOES NOT RETURN list but appends on existing object):\n", 310 | "numbers.append(-100)\n", 311 | "print(numbers)" 312 | ] 313 | }, 314 | { 315 | "cell_type": "code", 316 | "execution_count": 37, 317 | "metadata": {}, 318 | "outputs": [ 319 | { 320 | "name": "stdout", 321 | "output_type": "stream", 322 | "text": [ 323 | "[1, 0, 3]\n", 324 | "[1, 0, 3]\n", 325 | "Copy:\n", 326 | "[1, 0, 3]\n", 327 | "[1, 2, 3]\n" 328 | ] 329 | } 330 | ], 331 | "source": [ 332 | "# copy lists vs by reference\n", 333 | "x = [1, 2, 3]\n", 334 | "y = x\n", 335 | "x[1] = 0\n", 336 | "print(x)\n", 337 | "print(y)\n", 338 | "# y has also changed\n", 339 | "\n", 340 | "print(\"Copy:\")\n", 341 | "x = [1, 2, 3]\n", 342 | "y = x[:]\n", 343 | "x[1] = 0\n", 344 | "print(x)\n", 345 | "print(y) \n", 346 | "# (y has not changed)" 347 | ] 348 | }, 349 | { 350 | "cell_type": "markdown", 351 | "metadata": {}, 352 | "source": [ 353 | "## If statements" 354 | ] 355 | }, 356 | { 357 | "cell_type": "code", 358 | "execution_count": 42, 359 | "metadata": {}, 360 | "outputs": [ 361 | { 362 | "name": "stdout", 363 | "output_type": "stream", 364 | "text": [ 365 | "10 is the element no 3\n", 366 | "-2 is not found in list\n" 367 | ] 368 | } 369 | ], 370 | "source": [ 371 | "# if statements\n", 372 | "my_list = [1, 2, -4, 10]\n", 373 | "number_to_find_1 =10\n", 374 | "\n", 375 | "if number_to_find_1 in my_list:\n", 376 | " # call METHOD \"index\" of the list OBJECT:\n", 377 | " print(f'{number_to_find_1} is the element no {my_list.index(number_to_find_1)}') \n", 378 | "else:\n", 379 | " print(f'{number_to_find_1} is not found in list')\n", 380 | " \n", 381 | "number_to_find_2 = -2\n", 382 | "if number_to_find_2 in my_list:\n", 383 | " # call METHOD \"index\" of the list OBJECT:\n", 384 | " print(f'{number_to_find_2} is the element no {my_list.index(number_to_find_2)}') \n", 385 | "else:\n", 386 | " print(f'{number_to_find_2} is not found in list')\n", 387 | " " 388 | ] 389 | }, 390 | { 391 | "cell_type": "code", 392 | "execution_count": 44, 393 | "metadata": {}, 394 | "outputs": [ 395 | { 396 | "name": "stdout", 397 | "output_type": "stream", 398 | "text": [ 399 | "hot\n" 400 | ] 401 | } 402 | ], 403 | "source": [ 404 | "# more if\n", 405 | "temperature = 29\n", 406 | "if temperature < 10: \n", 407 | " print(\"cold\")\n", 408 | "elif temperature < 25:\n", 409 | " print(\"normal\")\n", 410 | "elif temperature < 35:\n", 411 | " print('hot')\n", 412 | "else:\n", 413 | " print(\"very hot\")" 414 | ] 415 | }, 416 | { 417 | "cell_type": "code", 418 | "execution_count": 1, 419 | "metadata": {}, 420 | "outputs": [ 421 | { 422 | "name": "stdout", 423 | "output_type": "stream", 424 | "text": [ 425 | "False\n", 426 | "True\n", 427 | "True\n", 428 | "True\n" 429 | ] 430 | } 431 | ], 432 | "source": [ 433 | "# operators\n", 434 | "a = 10\n", 435 | "print(a == 3)\n", 436 | "print(a == 10)\n", 437 | "print(a < 20)\n", 438 | "\n", 439 | "# not operator:\n", 440 | "print(not False)" 441 | ] 442 | }, 443 | { 444 | "cell_type": "markdown", 445 | "metadata": {}, 446 | "source": [ 447 | "## For loops" 448 | ] 449 | }, 450 | { 451 | "cell_type": "code", 452 | "execution_count": 50, 453 | "metadata": {}, 454 | "outputs": [ 455 | { 456 | "name": "stdout", 457 | "output_type": "stream", 458 | "text": [ 459 | "1\n", 460 | "5\n", 461 | "100\n" 462 | ] 463 | } 464 | ], 465 | "source": [ 466 | "l = [1, 5, 100]\n", 467 | "for i in l:\n", 468 | " print(i)" 469 | ] 470 | }, 471 | { 472 | "cell_type": "code", 473 | "execution_count": 47, 474 | "metadata": {}, 475 | "outputs": [ 476 | { 477 | "name": "stdout", 478 | "output_type": "stream", 479 | "text": [ 480 | "[0, 1, 2, 3, 4]\n", 481 | "1\n", 482 | "2\n", 483 | "3\n", 484 | "4\n", 485 | "5\n" 486 | ] 487 | } 488 | ], 489 | "source": [ 490 | "for i in range(5):\n", 491 | " print(i+1)" 492 | ] 493 | }, 494 | { 495 | "cell_type": "code", 496 | "execution_count": 32, 497 | "metadata": {}, 498 | "outputs": [ 499 | { 500 | "name": "stdout", 501 | "output_type": "stream", 502 | "text": [ 503 | "0\n", 504 | "5\n", 505 | "10\n", 506 | "15\n" 507 | ] 508 | } 509 | ], 510 | "source": [ 511 | "for i in range(0, 20, 5):\n", 512 | " print(i)" 513 | ] 514 | }, 515 | { 516 | "cell_type": "code", 517 | "execution_count": 2, 518 | "metadata": {}, 519 | "outputs": [ 520 | { 521 | "name": "stdout", 522 | "output_type": "stream", 523 | "text": [ 524 | "George\n", 525 | "Alan\n", 526 | "Mary\n" 527 | ] 528 | } 529 | ], 530 | "source": [ 531 | "names = [\"George\", \"Alan\", \"Mary\"]\n", 532 | "for n in names:\n", 533 | " print(n)" 534 | ] 535 | }, 536 | { 537 | "cell_type": "code", 538 | "execution_count": 3, 539 | "metadata": {}, 540 | "outputs": [ 541 | { 542 | "name": "stdout", 543 | "output_type": "stream", 544 | "text": [ 545 | "name 0: George\n", 546 | "name 1: Alan\n", 547 | "name 2: Mary\n" 548 | ] 549 | } 550 | ], 551 | "source": [ 552 | "# with indices:\n", 553 | "names = [\"George\", \"Alan\", \"Mary\"]\n", 554 | "for i, n in enumerate(names):\n", 555 | " print(f\"name {i}: {n}\")" 556 | ] 557 | }, 558 | { 559 | "cell_type": "code", 560 | "execution_count": 53, 561 | "metadata": {}, 562 | "outputs": [ 563 | { 564 | "name": "stdout", 565 | "output_type": "stream", 566 | "text": [ 567 | "[0, 0, 2, 6, 12, 20, 30, 42, 56, 72]\n" 568 | ] 569 | } 570 | ], 571 | "source": [ 572 | "# list comprehensions\n", 573 | "my_list = [i**2 - i for i in range(0, 10)]\n", 574 | "#this is equivalent to:\n", 575 | "#my_list = []\n", 576 | "#for i in range(0, 10):\n", 577 | "# my_list.append(i**2 - i)\n", 578 | "print(my_list)" 579 | ] 580 | }, 581 | { 582 | "cell_type": "code", 583 | "execution_count": 59, 584 | "metadata": {}, 585 | "outputs": [ 586 | { 587 | "name": "stdout", 588 | "output_type": "stream", 589 | "text": [ 590 | "list comprehensions are 11.2 % faster\n" 591 | ] 592 | } 593 | ], 594 | "source": [ 595 | "# list comprehensions are faster than for loops\n", 596 | "import time\n", 597 | "n = 1000000\n", 598 | "\n", 599 | "# 1st method to create the list using list comprehensions:\n", 600 | "t1 = time.time()\n", 601 | "my_list = [i**2 - i for i in range(0, n)]\n", 602 | "t2 = time.time()\n", 603 | "\n", 604 | "# 2nd method to create the list (the tradiotional way)\n", 605 | "my_list = []\n", 606 | "for i in range(0, n):\n", 607 | " my_list.append(i**2 - i)\n", 608 | "t3 = time.time()\n", 609 | "t_loop = t3 - t2\n", 610 | "t_com = t2 - t1\n", 611 | "print(f\"list comprehensions are {100*(t_loop-t_com)/t_loop:.1f} % faster\")" 612 | ] 613 | }, 614 | { 615 | "cell_type": "markdown", 616 | "metadata": {}, 617 | "source": [ 618 | "## Dictionaries" 619 | ] 620 | }, 621 | { 622 | "cell_type": "code", 623 | "execution_count": 61, 624 | "metadata": {}, 625 | "outputs": [ 626 | { 627 | "name": "stdout", 628 | "output_type": "stream", 629 | "text": [ 630 | "type-r\n", 631 | "brand honda\n", 632 | "model civic\n", 633 | "type type-r\n", 634 | "engine 1997\n", 635 | "power 320\n" 636 | ] 637 | } 638 | ], 639 | "source": [ 640 | "# dictionaries are unordered and mutable\n", 641 | "car = {\n", 642 | " \"brand\": \"honda\", \n", 643 | " \"model\": \"civic\",\n", 644 | " \"type\": \"type-r\",\n", 645 | " \"engine\": 1997,\n", 646 | " \"power\": 320\n", 647 | "}\n", 648 | "\n", 649 | "print(car['type'])\n", 650 | "\n" 651 | ] 652 | }, 653 | { 654 | "cell_type": "code", 655 | "execution_count": 57, 656 | "metadata": {}, 657 | "outputs": [ 658 | { 659 | "name": "stdout", 660 | "output_type": "stream", 661 | "text": [ 662 | "brand:honda\n", 663 | "model:civic\n", 664 | "type:type-r\n", 665 | "engine:1997\n", 666 | "power:320\n", 667 | "dict_keys(['brand', 'model', 'type', 'engine', 'power'])\n", 668 | "dict_values(['honda', 'civic', 'type-r', 1997, 320])\n", 669 | "dict_items([('brand', 'honda'), ('model', 'civic'), ('type', 'type-r'), ('engine', 1997), ('power', 320)])\n", 670 | "honda\n", 671 | "{'brand': 'honda', 'model': 'civic', 'type': 'type-r', 'engine': 1997, 'power': 320, 'year': 2017}\n", 672 | "engine found!\n", 673 | "{'brand': 'honda', 'model': 'civic', 'type': 'type-r', 'power': 320, 'year': 2017}\n", 674 | "engine not found!\n" 675 | ] 676 | } 677 | ], 678 | "source": [ 679 | "# dictionaries are unordered and mutable\n", 680 | "car = {\n", 681 | " \"brand\": \"honda\", \n", 682 | " \"model\": \"civic\",\n", 683 | " \"type\": \"type-r\",\n", 684 | " \"engine\": 1997,\n", 685 | " \"power\": 320\n", 686 | "}\n", 687 | "\n", 688 | "\n", 689 | "# to iterate the (key,val) pairs of a dict:\n", 690 | "for key in car:\n", 691 | " print(f\"{key}:{car[key]}\")\n", 692 | " \n", 693 | "# ... or:\n", 694 | "for key, value in car.items():\n", 695 | " print(key, value)\n", 696 | "\n", 697 | "print(car.keys())\n", 698 | "print(car.values())\n", 699 | "print(car.items())\n", 700 | "\n", 701 | "# print specific value:\n", 702 | "print(car[\"brand\"])\n", 703 | "\n", 704 | "# insert value:\n", 705 | "car[\"year\"] = 2017\n", 706 | "print(car)\n", 707 | "\n", 708 | "# check if key in dict:\n", 709 | "if \"engine\" in car:\n", 710 | " print(\"engine found!\")\n", 711 | "\n", 712 | "\n", 713 | "# delete a key-value\n", 714 | "del car[\"engine\"]\n", 715 | "print(car)\n", 716 | "if \"engine\" not in car:\n", 717 | " print(\"engine not found!\")\n", 718 | "\n" 719 | ] 720 | }, 721 | { 722 | "cell_type": "code", 723 | "execution_count": 63, 724 | "metadata": {}, 725 | "outputs": [ 726 | { 727 | "name": "stdout", 728 | "output_type": "stream", 729 | "text": [ 730 | "without copy:\n", 731 | "{'l1': 20, 'l2': 20}\n", 732 | "{'l1': 20, 'l2': 20}\n", 733 | "simple dict-copy:\n", 734 | "{'l1': 20, 'l2': 20}\n", 735 | "{'l1': 10, 'l2': 20}\n", 736 | "simple dict-copy - 2:\n", 737 | "{'l1': [1000, 20], 'l2': 20}\n", 738 | "{'l1': [1000, 20], 'l2': 20}\n", 739 | "with deep copy - 2:\n", 740 | "{'l1': [1000, 20], 'l2': 20}\n", 741 | "{'l1': [10, 20], 'l2': 20}\n" 742 | ] 743 | } 744 | ], 745 | "source": [ 746 | "# deep and shallow copy:\n", 747 | "# example 1:\n", 748 | "# without copy:\n", 749 | "print(\"without copy:\")\n", 750 | "d1 = {\"l1\": 10, \"l2\": 20}\n", 751 | "d2 = d1\n", 752 | "d1[\"l1\"] = 20\n", 753 | "print(d1)\n", 754 | "print(d2)\n", 755 | "\n", 756 | "# dict contains simple types (not lists, dicts, tuples etc)\n", 757 | "print(\"simple dict-copy:\")\n", 758 | "d1 = {\"l1\": 10, \"l2\": 20}\n", 759 | "d2 = d1.copy()\n", 760 | "d1[\"l1\"] = 20\n", 761 | "print(d1)\n", 762 | "print(d2)\n", 763 | "\n", 764 | "print(\"simple dict-copy - 2:\")\n", 765 | "d1 = {\"l1\": [10, 20], \"l2\": 20}\n", 766 | "d2 = d1.copy()\n", 767 | "d1[\"l1\"][0] = 1000\n", 768 | "print(d1)\n", 769 | "print(d2)\n", 770 | "\n", 771 | "print(\"with deep copy - 2:\")\n", 772 | "import copy\n", 773 | "d1 = {\"l1\": [10, 20], \"l2\": 20}\n", 774 | "d2 = copy.deepcopy(d1)\n", 775 | "d1[\"l1\"][0] = 1000\n", 776 | "print(d1)\n", 777 | "print(d2)" 778 | ] 779 | }, 780 | { 781 | "cell_type": "code", 782 | "execution_count": 65, 783 | "metadata": {}, 784 | "outputs": [ 785 | { 786 | "name": "stdout", 787 | "output_type": "stream", 788 | "text": [ 789 | "{'greece': 10, 'germany': 85, 'france': 65, 'uk': 68, 'italy': 60, 'spain': 46, 'austria': 9, 'ireland': 5, 'cyprus': 1}\n", 790 | "{'germany': 85, 'france': 65, 'uk': 68, 'italy': 60, 'spain': 46}\n", 791 | "{'germany': 85, 'france': 65, 'uk': 68, 'italy': 60, 'spain': 46}\n" 792 | ] 793 | } 794 | ], 795 | "source": [ 796 | "# dictionaries comprehension\n", 797 | "countries_pop = dict(greece=10, germany=85, france=65, uk=68, italy=60, spain=46, austria=9, ireland=5, cyprus=1)\n", 798 | "print(countries_pop)\n", 799 | "\n", 800 | "large_countries = {}\n", 801 | "for c, p in countries_pop.items():\n", 802 | " if p > 10:\n", 803 | " large_countries[c] = p \n", 804 | "print(large_countries)\n", 805 | "\n", 806 | "large_countries = {c:p for c, p in countries_pop.items() if p > 10}\n", 807 | "print(large_countries)" 808 | ] 809 | }, 810 | { 811 | "cell_type": "markdown", 812 | "metadata": {}, 813 | "source": [ 814 | "## Reading files" 815 | ] 816 | }, 817 | { 818 | "cell_type": "code", 819 | "execution_count": 70, 820 | "metadata": {}, 821 | "outputs": [ 822 | { 823 | "name": "stdout", 824 | "output_type": "stream", 825 | "text": [ 826 | "But I had forgotten Bombadil, if indeed this is still the same that walked the woods and hills long ago, and even then was older than the old. \n", 827 | "That was not then his name. Iarwain Ben-adar we called him, oldest and fatherless. \n", 828 | "But many another name he has since been given by other folk: Forn by the Dwarves, Orald by Northern Men, and other names beside. He is a strange creature...\n", 829 | "\n", 830 | "385 characters\n" 831 | ] 832 | } 833 | ], 834 | "source": [ 835 | "# text files reading\n", 836 | "import os.path\n", 837 | "if os.path.exists(\"text_file.txt\"):\n", 838 | " with open(\"text_file.txt\", \"r\") as file_reader:\n", 839 | " data = file_reader.read()\n", 840 | " print(data)\n", 841 | " print(f'{len(data)} characters')" 842 | ] 843 | }, 844 | { 845 | "cell_type": "code", 846 | "execution_count": 41, 847 | "metadata": {}, 848 | "outputs": [ 849 | { 850 | "name": "stdout", 851 | "output_type": "stream", 852 | "text": [ 853 | "['But I had forgotten Bombadil, if indeed this is still the same that walked the woods and hills long ago, and even then was older than the old. \\n', 'That was not then his name. Iarwain Ben-adar we called him, oldest and fatherless. \\n', 'But many another name he has since been given by other folk: Forn by the Dwarves, Orald by Northern Men, and other names beside. He is a strange creature...\\n']\n", 854 | "3 lines\n" 855 | ] 856 | } 857 | ], 858 | "source": [ 859 | "# read lines\n", 860 | "with open(\"text_file.txt\", \"r\") as file_reader:\n", 861 | " data = file_reader.readlines()\n", 862 | "print(data)\n", 863 | "print(f'{len(data)} lines')" 864 | ] 865 | }, 866 | { 867 | "cell_type": "code", 868 | "execution_count": 79, 869 | "metadata": {}, 870 | "outputs": [ 871 | { 872 | "name": "stdout", 873 | "output_type": "stream", 874 | "text": [ 875 | "5000\n", 876 | "Bing Crosby\n", 877 | "Sean Kingston\n", 878 | "Band Aid\n", 879 | "David Guetta & Akon\n", 880 | "Phil Collins\n", 881 | "Elvis Presley\n", 882 | "Tony Bennett\n", 883 | "Cyndi Lauper\n", 884 | "The Jackson 5\n", 885 | "The Rolling Stones\n", 886 | "Les Paul & Mary Ford\n", 887 | "Johnny Mercer\n", 888 | "Millie Small\n", 889 | "Queen\n", 890 | "Notorious BIG & P Diddy\n", 891 | "Barbra Streisand & Bryan Adams\n", 892 | "The Doors\n", 893 | "The Pussycat Dolls & Will.I.Am\n", 894 | "Elvis Presley\n", 895 | "Bill Withers\n", 896 | "The Young Rascals\n", 897 | "Robin Gibb\n", 898 | "Mary J Blige & U2\n", 899 | "Faith Hill\n", 900 | "Def Leppard\n", 901 | "Peter, Paul & Mary\n", 902 | "Avril Lavigne\n", 903 | "Debbie Gibson\n", 904 | "Kim Wilde\n", 905 | "Glen Campbell\n", 906 | "Metallica\n", 907 | "The Mills Brothers\n", 908 | "Bryan Adams & Melanie C\n", 909 | "Michael Buble\n", 910 | "Elton John\n", 911 | "Offspring\n", 912 | "Francoise Hardy\n", 913 | "First Class\n", 914 | "Ray Conniff\n", 915 | "Seal\n", 916 | "McFadden & Whitehead\n", 917 | "T-Bone Walker Quintet\n", 918 | "Good Charlotte\n", 919 | "Avril Lavigne\n", 920 | "Dolly Parton\n", 921 | "Radiohead\n", 922 | "Belinda Carlisle\n", 923 | "Frankie Vaughan\n", 924 | "The Beatles\n", 925 | "Ce Ce Peniston\n" 926 | ] 927 | } 928 | ], 929 | "source": [ 930 | "# read file from json (as a simple text file) and convert string to dict using json.loads() function\n", 931 | "with open('songs_metadata.json') as reader_json:\n", 932 | " json_string = reader_json.read()\n", 933 | "import json\n", 934 | "list_of_dicts = json.loads(json_string)\n", 935 | "print(len(list_of_dicts)\n", 936 | "for d in list_of_dicts[::100]:\n", 937 | " if 'artist' in d:\n", 938 | " print(d['artist'])" 939 | ] 940 | }, 941 | { 942 | "cell_type": "markdown", 943 | "metadata": {}, 944 | "source": [ 945 | "## Functions in Python" 946 | ] 947 | }, 948 | { 949 | "cell_type": "code", 950 | "execution_count": 81, 951 | "metadata": {}, 952 | "outputs": [ 953 | { 954 | "name": "stdout", 955 | "output_type": "stream", 956 | "text": [ 957 | "Hello!\n" 958 | ] 959 | } 960 | ], 961 | "source": [ 962 | "def this_is_a_function_with_no_args():\n", 963 | " print(\"Hello!\") \n", 964 | "this_is_a_function_with_no_args()" 965 | ] 966 | }, 967 | { 968 | "cell_type": "code", 969 | "execution_count": 10, 970 | "metadata": {}, 971 | "outputs": [ 972 | { 973 | "name": "stdout", 974 | "output_type": "stream", 975 | "text": [ 976 | "hello!\n" 977 | ] 978 | } 979 | ], 980 | "source": [ 981 | "def concatenate_strings_and_print(string1, string2):\n", 982 | " print(string1 + string2)\n", 983 | "concatenate_strings_and_print(\"hello\", \"!\")" 984 | ] 985 | }, 986 | { 987 | "cell_type": "code", 988 | "execution_count": 84, 989 | "metadata": {}, 990 | "outputs": [ 991 | { 992 | "name": "stdout", 993 | "output_type": "stream", 994 | "text": [ 995 | "4\n", 996 | "This is the second argument: IS\n" 997 | ] 998 | } 999 | ], 1000 | "source": [ 1001 | "def fun_with_unknown_num_of_arguments(*args):\n", 1002 | " print(len(args))\n", 1003 | " print(f\"This is the second argument: {args[1]}\")\n", 1004 | "fun_with_unknown_num_of_arguments(\"THIS\", \"IS\", \"A\", \"TEST\")" 1005 | ] 1006 | }, 1007 | { 1008 | "cell_type": "code", 1009 | "execution_count": 87, 1010 | "metadata": {}, 1011 | "outputs": [ 1012 | { 1013 | "ename": "SyntaxError", 1014 | "evalue": "non-default argument follows default argument (2910958961.py, line 2)", 1015 | "output_type": "error", 1016 | "traceback": [ 1017 | "\u001b[0;36m Input \u001b[0;32mIn [87]\u001b[0;36m\u001b[0m\n\u001b[0;31m def fun(amka, name=\"George\", id_no):\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m non-default argument follows default argument\n" 1018 | ] 1019 | } 1020 | ], 1021 | "source": [ 1022 | "# default parameter values: (name is optional)\n", 1023 | "def fun(name=\"George\"):\n", 1024 | " print(name)\n", 1025 | " \n", 1026 | "fun(\"Maria\")\n", 1027 | "fun(\"Helen\")\n", 1028 | "fun() # if the arg is not provided --> default param value is used" 1029 | ] 1030 | }, 1031 | { 1032 | "cell_type": "code", 1033 | "execution_count": 46, 1034 | "metadata": {}, 1035 | "outputs": [ 1036 | { 1037 | "name": "stdout", 1038 | "output_type": "stream", 1039 | "text": [ 1040 | "9\n" 1041 | ] 1042 | } 1043 | ], 1044 | "source": [ 1045 | "# this is how to return value:\n", 1046 | "def find_max(some_list):\n", 1047 | " m = some_list[0]\n", 1048 | " for x in some_list[1:]:\n", 1049 | " if m < x:\n", 1050 | " m = x\n", 1051 | " return m\n", 1052 | "\n", 1053 | "print(find_max([2,4,1,5,6,7,8,9,1]))" 1054 | ] 1055 | }, 1056 | { 1057 | "cell_type": "code", 1058 | "execution_count": 88, 1059 | "metadata": {}, 1060 | "outputs": [ 1061 | { 1062 | "name": "stdout", 1063 | "output_type": "stream", 1064 | "text": [ 1065 | "1\n", 1066 | "9\n" 1067 | ] 1068 | } 1069 | ], 1070 | "source": [ 1071 | "# and return more than one values:\n", 1072 | "def find_max_min(some_list):\n", 1073 | " max_val = some_list[0]\n", 1074 | " min_val = some_list[0]\n", 1075 | " for x in some_list[1:]:\n", 1076 | " if max_val < x:\n", 1077 | " max_val = x\n", 1078 | " if min_val > x:\n", 1079 | " min_val = x \n", 1080 | " return min_val, max_val\n", 1081 | "\n", 1082 | "mymax, mymin = find_max_min([2,4,1,5,6,7,8,9,1])\n", 1083 | "print(mymax)\n", 1084 | "print(mymin)" 1085 | ] 1086 | }, 1087 | { 1088 | "cell_type": "markdown", 1089 | "metadata": {}, 1090 | "source": [ 1091 | "## Lamda functions\n", 1092 | "Python lambdas are short and anonymous functions. Lambda expressions in Python and other programming languages have their roots in lambda calculus, a model of computation used in several programming languages. " 1093 | ] 1094 | }, 1095 | { 1096 | "cell_type": "code", 1097 | "execution_count": 48, 1098 | "metadata": {}, 1099 | "outputs": [ 1100 | { 1101 | "data": { 1102 | "text/plain": [ 1103 | "20" 1104 | ] 1105 | }, 1106 | "execution_count": 48, 1107 | "metadata": {}, 1108 | "output_type": "execute_result" 1109 | } 1110 | ], 1111 | "source": [ 1112 | "# this lamda expression\n", 1113 | "double_number = lambda x: 2 * x\n", 1114 | "double_number(10)" 1115 | ] 1116 | }, 1117 | { 1118 | "cell_type": "code", 1119 | "execution_count": 49, 1120 | "metadata": {}, 1121 | "outputs": [ 1122 | { 1123 | "data": { 1124 | "text/plain": [ 1125 | "20" 1126 | ] 1127 | }, 1128 | "execution_count": 49, 1129 | "metadata": {}, 1130 | "output_type": "execute_result" 1131 | } 1132 | ], 1133 | "source": [ 1134 | "# is equivalent to this function:\n", 1135 | "def double_number(x):\n", 1136 | " return 2 * x\n", 1137 | "double_number(10)" 1138 | ] 1139 | }, 1140 | { 1141 | "cell_type": "code", 1142 | "execution_count": 17, 1143 | "metadata": {}, 1144 | "outputs": [ 1145 | { 1146 | "name": "stdout", 1147 | "output_type": "stream", 1148 | "text": [ 1149 | "12\n" 1150 | ] 1151 | } 1152 | ], 1153 | "source": [ 1154 | "# lamda with two arguments:\n", 1155 | "y = lambda a, b : a * b\n", 1156 | "print(y(3, 4))" 1157 | ] 1158 | }, 1159 | { 1160 | "cell_type": "code", 1161 | "execution_count": null, 1162 | "metadata": {}, 1163 | "outputs": [], 1164 | "source": [] 1165 | } 1166 | ], 1167 | "metadata": { 1168 | "kernelspec": { 1169 | "display_name": "Python 3", 1170 | "language": "python", 1171 | "name": "python3" 1172 | }, 1173 | "language_info": { 1174 | "codemirror_mode": { 1175 | "name": "ipython", 1176 | "version": 3 1177 | }, 1178 | "file_extension": ".py", 1179 | "mimetype": "text/x-python", 1180 | "name": "python", 1181 | "nbconvert_exporter": "python", 1182 | "pygments_lexer": "ipython3", 1183 | "version": "3.10.5" 1184 | } 1185 | }, 1186 | "nbformat": 4, 1187 | "nbformat_minor": 4 1188 | } 1189 | -------------------------------------------------------------------------------- /notebooks/2-numpy_and_pandas.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# NumPy and pandas\n", 8 | "## General\n", 9 | "### NumPy\n", 10 | "NumPy is used for performing numerical computations on arrays and matrices, such as mean, median, percentiles and linear algebra computations. Simply install numpy with pip `pip3 install numpy`. \n", 11 | "\n", 12 | "### Pandas\n", 13 | "Pandas is used for handling tabular datasets that usually combine different types of data columns (integer, float, nominals, etc). Pandas requires NumPy. To install: `pip3 install pandas`." 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "metadata": {}, 19 | "source": [ 20 | "## Numpy examples" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "### The basics" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": 1, 33 | "metadata": {}, 34 | "outputs": [ 35 | { 36 | "ename": "AttributeError", 37 | "evalue": "module 'numpy' has no attribute 'zeros'", 38 | "output_type": "error", 39 | "traceback": [ 40 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 41 | "\u001b[0;31mAttributeError\u001b[0m Traceback (most recent call last)", 42 | "Input \u001b[0;32mIn [1]\u001b[0m, in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;66;03m# zeros and ones. array shape\u001b[39;00m\n\u001b[1;32m 2\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mnumpy\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m \u001b[38;5;21;01mnp\u001b[39;00m\n\u001b[0;32m----> 3\u001b[0m a \u001b[38;5;241m=\u001b[39m \u001b[43mnp\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mzeros\u001b[49m((\u001b[38;5;241m2\u001b[39m, \u001b[38;5;241m4\u001b[39m))\n\u001b[1;32m 4\u001b[0m b \u001b[38;5;241m=\u001b[39m np\u001b[38;5;241m.\u001b[39mones((\u001b[38;5;241m2\u001b[39m, \u001b[38;5;241m4\u001b[39m))\n\u001b[1;32m 5\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124ma:\u001b[39m\u001b[38;5;130;01m\\n\u001b[39;00m\u001b[38;5;132;01m{\u001b[39;00ma\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m\"\u001b[39m)\n", 43 | "\u001b[0;31mAttributeError\u001b[0m: module 'numpy' has no attribute 'zeros'" 44 | ] 45 | } 46 | ], 47 | "source": [ 48 | "# zeros and ones. array shape\n", 49 | "import numpy as np\n", 50 | "a = np.zeros((2, 4))\n", 51 | "b = np.ones((2, 4))\n", 52 | "print(f\"a:\\n{a}\")\n", 53 | "print(f\"b:\\n{b}\")\n", 54 | "print(f\"a+b:\\n{a+b}\")\n", 55 | "print(f\"a-2b:\\n{a-2*b}\")\n", 56 | "print(f\"shape:\\n{a.shape}\")" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": null, 62 | "metadata": {}, 63 | "outputs": [], 64 | "source": [ 65 | "# creating arrays from lists and array types\n", 66 | "import numpy as np\n", 67 | "a = np.array([1, 2, 5])\n", 68 | "b = np.array([2.0, 10, -1])\n", 69 | "print(f\"a+b{a + b}\")\n", 70 | "print(a.dtype)\n", 71 | "print(b.dtype)\n", 72 | "print((a+b).dtype)" 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": null, 78 | "metadata": {}, 79 | "outputs": [], 80 | "source": [ 81 | "# to change the type of a numpy array use astype():\n", 82 | "import numpy as np\n", 83 | "b = np.array([2.1, 10, -5])\n", 84 | "b_reduced = b.astype('uint8')\n", 85 | "print(b.dtype)\n", 86 | "print(b_reduced.dtype)\n", 87 | "print(b_reduced)" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": null, 93 | "metadata": {}, 94 | "outputs": [], 95 | "source": [ 96 | "import numpy as np\n", 97 | "x = np.array([[200, 100], [100, 200]]).astype('uint8')\n", 98 | "y = np.array([[255, 100], [100, 255]]).astype('uint8')\n", 99 | "print(x)\n", 100 | "print(y)\n", 101 | "print(x + y)\n", 102 | "print(\"(Result is overfloated!)\")\n", 103 | "print(\"\\n Results with type conversion:\")\n", 104 | "print(x.astype('int32') + y.astype('int32'))" 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": null, 110 | "metadata": {}, 111 | "outputs": [], 112 | "source": [ 113 | "# numpy.arange. basic operations\n", 114 | "import numpy as np\n", 115 | "a = np.arange(0, 20, 5)\n", 116 | "b = np.arange(0, 20, 5) - 10\n", 117 | "print(f\"a:{a}\")\n", 118 | "print(f\"a-10:{a-10}\")\n", 119 | "print(f\"a^2:{a ** 2}\")\n", 120 | "print(f\"a-b:{a-b}\")\n", 121 | "print(f\"cos(b * pi / 20):{np.cos(b * np.pi / 20.0)}\")" 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": null, 127 | "metadata": {}, 128 | "outputs": [], 129 | "source": [ 130 | "# element-wise product matrix product\n", 131 | "import numpy as np\n", 132 | "A = np.array([[0, 2], [1, 1]])\n", 133 | "B = np.array([[-1, 1], [1, 1]])\n", 134 | "print(f\"A .* B =\\n {A * B}\") # element-wise\n", 135 | "print(f\"A * B =\\n {A.dot(B)}\") # matrix product" 136 | ] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "execution_count": null, 141 | "metadata": {}, 142 | "outputs": [], 143 | "source": [ 144 | "# reshaping arrays\n", 145 | "import numpy as np\n", 146 | "x = np.arange(10)\n", 147 | "print(x)\n", 148 | "print(x.reshape(2, 5))" 149 | ] 150 | }, 151 | { 152 | "cell_type": "markdown", 153 | "metadata": {}, 154 | "source": [ 155 | "### Numpy statistics\n", 156 | "The following example reads the temperatures from NYC in the last 150 years on the same day (9th April). The csv file contains 3 rows, namely date, max day temp and min day temp. Date is saved in a list and the two temperatures in numpy arrays. The following code extracts some basic statistics including, mean vale, median value, max, min and 10 and 90 percentiles. " 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": null, 162 | "metadata": {}, 163 | "outputs": [], 164 | "source": [ 165 | "import csv\n", 166 | "import numpy as np\n", 167 | "\n", 168 | "years = []\n", 169 | "max_t, min_t = np.array([]), np.array([])\n", 170 | "# read the csv file of New York min and max temperatures of 9th April for the last 150 years:\n", 171 | "with open('data_ny_temperatures.csv', newline='') as csvfile:\n", 172 | " reader = csv.reader(csvfile, delimiter=',', quotechar='|')\n", 173 | " for ir, row in enumerate(reader):\n", 174 | " if ir>0:\n", 175 | " max_t = np.append(max_t, float(row[1]))\n", 176 | " min_t = np.append(min_t, float(row[2]))\n", 177 | " years.append(int(row[0].split('-')[0]))\n", 178 | "\n", 179 | "print(f\"Average max-day temperature is {max_t.mean():.1f}\")\n", 180 | "print(f\"Median max-day temperature is {np.median(max_t):.1f}\")\n", 181 | "print(f\"Average min-day temperature is {min_t.mean():.1f}\")\n", 182 | "print(f\"Median min-day temperature is {np.median(min_t):.1f}\")\n", 183 | "\n", 184 | "print(f\"The maximum max-day temp was {np.max(max_t):.1f} in {years[np.argmax(max_t)]}\")\n", 185 | "print(f\"The maximum min-day temp was {np.max(min_t):.1f} in {years[np.argmax(min_t)]}\")\n", 186 | "print(f\"The minimum max-day temp was {np.min(max_t):.1f} in {years[np.argmin(max_t)]}\")\n", 187 | "print(f\"The minimum max-day temp was {np.min(min_t):.1f} in {years[np.argmin(min_t)]}\")\n", 188 | "\n", 189 | "max_t_p_10 = np.percentile(max_t, 10)\n", 190 | "max_t_p_90 = np.percentile(max_t, 90)\n", 191 | "years_max_10 = [y for i, y in enumerate(years) if max_t[i] < max_t_p_10]\n", 192 | "# Note: this is equvalent to the followingL\n", 193 | "# years_max_10 = []\n", 194 | "#for i, y in enumerate(years):\n", 195 | "# if max_t[i] < max_t_p_10:\n", 196 | "# years_max_10.append(y)\n", 197 | "print(years_max_10)\n", 198 | "years_max_90 = [y for i, y in enumerate(years) if max_t[i] > max_t_p_90]\n", 199 | "print(years_max_90)\n", 200 | "min_t_p_10 = np.percentile(min_t, 10)\n", 201 | "min_t_p_90 = np.percentile(min_t, 90)\n", 202 | "years_min_10 = [y for i, y in enumerate(years) if min_t[i] < min_t_p_10]\n", 203 | "print(years_min_10)\n", 204 | "years_min_90 = [y for i, y in enumerate(years) if min_t[i] > min_t_p_90]\n", 205 | "print(years_min_90)" 206 | ] 207 | }, 208 | { 209 | "cell_type": "markdown", 210 | "metadata": {}, 211 | "source": [ 212 | "A note on speed: if you need to append a large number of elements in a numpy array, it is much faster to append it to a list and then convert the list to numpy array (instead of using the numpy.append() method). And list comprehension is obvioysly even faster. " 213 | ] 214 | }, 215 | { 216 | "cell_type": "code", 217 | "execution_count": null, 218 | "metadata": {}, 219 | "outputs": [], 220 | "source": [ 221 | "import numpy as np\n", 222 | "import time\n", 223 | "\n", 224 | "t1 = time.time()\n", 225 | "a = np.array([])\n", 226 | "for i in range(1, 10000):\n", 227 | " a = np.append(a, i)\n", 228 | "t2 = time.time()\n", 229 | "print(f\"numpy.append(): {1000 * (t2 - t1):.2f} msecs\")\n", 230 | "\n", 231 | "t1 = time.time()\n", 232 | "a = []\n", 233 | "for i in range(1, 10000):\n", 234 | " a.append(i)\n", 235 | "a = np.array(a)\n", 236 | "t2 = time.time()\n", 237 | "print(f\"list append and numpy array conversion: {1000 * (t2 - t1):.2f} msecs\")\n", 238 | "\n", 239 | "t1 = time.time()\n", 240 | "a = [i for i in range(1, 10000)]\n", 241 | "a = np.array(a)\n", 242 | "t2 = time.time()\n", 243 | "print(f\"list comprehension and numpy array conversion: {1000 * (t2 - t1):.2f} msecs\")" 244 | ] 245 | }, 246 | { 247 | "cell_type": "markdown", 248 | "metadata": {}, 249 | "source": [ 250 | "Talking about statistics, two of the most important quantities used in random variable statistics (whatever quantity they measure) are mean and standard deviation. We've already seen mean in some examples above. Standard deviation, which measures how close the values of the variable are to their mean value. Below, we are showing how to compute mean and std of a sequence and how to standardize the values of the sequence into having a standard deviation of 1 and mean value equal to 0. This is a very important process, used in machine learning and data science before training models and before predicting. An alternative is the max / min normalization, not shown here. " 251 | ] 252 | }, 253 | { 254 | "cell_type": "code", 255 | "execution_count": null, 256 | "metadata": {}, 257 | "outputs": [], 258 | "source": [ 259 | "import numpy as np\n", 260 | "import numpy.random\n", 261 | "m, s, n_samples = 10, 5, 1000\n", 262 | "x = numpy.random.normal(m, s, n_samples)\n", 263 | "m_est = x.mean()\n", 264 | "s_est = x.std()\n", 265 | "\n", 266 | "print(f\"mean is {m_est:.3f} and std is {s_est:.3f}\")\n", 267 | "# z = (x - m) / s\n", 268 | "x_norm = (x - m_est) / s_est\n", 269 | "print(f\"after standardization mean is {x_norm.mean():.3f} and std is {x_norm.std():.3f}\")" 270 | ] 271 | }, 272 | { 273 | "cell_type": "markdown", 274 | "metadata": {}, 275 | "source": [ 276 | "### Numpy slicing and row - column operations" 277 | ] 278 | }, 279 | { 280 | "cell_type": "code", 281 | "execution_count": null, 282 | "metadata": {}, 283 | "outputs": [], 284 | "source": [ 285 | "import numpy as np\n", 286 | "x = np.array([[1,2,3], [4,5,6], [7, 8, 9], [10, 11, 12]])\n", 287 | "print(\"x:\")\n", 288 | "print(x)\n", 289 | "print(\"\\nx[1:, :-1]:\")\n", 290 | "print(x[1:, :-1])" 291 | ] 292 | }, 293 | { 294 | "cell_type": "code", 295 | "execution_count": null, 296 | "metadata": {}, 297 | "outputs": [], 298 | "source": [ 299 | "# global and row-wise or column-wise calculations\n", 300 | "import numpy as np\n", 301 | "x = np.array([[1,2,3], [4,5,6], [7, 8, 9], [10, 11, 12]])\n", 302 | "print(f\"global mean {x.mean()}\")\n", 303 | "print(f\"global min {x.min()}\")\n", 304 | "print(f\"global max {x.max()}\")\n", 305 | "print(f\"column-wise mean {x.mean(axis=0)}\")\n", 306 | "print(f\"row-wise mean {x.mean(axis=1)}\")" 307 | ] 308 | }, 309 | { 310 | "cell_type": "markdown", 311 | "metadata": {}, 312 | "source": [] 313 | }, 314 | { 315 | "cell_type": "markdown", 316 | "metadata": {}, 317 | "source": [ 318 | "### Broadcasting\n", 319 | "Broadcasting in numpy is. a very powerful mechanism that allows numpy operators to work on arrays of different shapes.\n", 320 | "\n", 321 | "We saw previously that element-to-element operations are possible in numpy when arrays have the same dimensions. However, operations on arrays that do not share the same shapes is possible in numpy because of broadcasting. Broadcasting can be performed when the shape of each dimension in the arrays are equal or one has the one of its dimensions equal to 1. Below are some broadcasting examples:" 322 | ] 323 | }, 324 | { 325 | "cell_type": "code", 326 | "execution_count": null, 327 | "metadata": {}, 328 | "outputs": [], 329 | "source": [ 330 | "# broacasting examples\n", 331 | "import numpy as np\n", 332 | "# example 1:\n", 333 | "x = np.array([[1, 2], [3, 4]])\n", 334 | "print(x + 2) # scalar and 2D array broadcasting\n", 335 | "\n", 336 | "# example 2:\n", 337 | "x = np.array([[1,2,3], [4,5,6], [7, 8, 9], [10, 11, 12]])\n", 338 | "y = np.array([1, 2, 3])\n", 339 | "print(f\"add a {x.shape[0]}x{x.shape[1]} with a {y.shape[0]}x{1} numpy array:\")\n", 340 | "print(x + y)\n", 341 | "\n", 342 | "#example 3:\n", 343 | "y = np.array([1, 2, 3, 4]).reshape(4,1)\n", 344 | "print(f\"add a {x.shape[0]}x{x.shape[1]} with a {y.shape[0]}x{1} numpy array:\")\n", 345 | "print(x + y)" 346 | ] 347 | }, 348 | { 349 | "cell_type": "code", 350 | "execution_count": null, 351 | "metadata": {}, 352 | "outputs": [], 353 | "source": [ 354 | "import numpy as np\n", 355 | "# A normalization example without looping (using numpy broadcasting)\n", 356 | "# initialize features (columns represent features and rows represent instances)\n", 357 | "X = np.array([[200,0.1],[220,0.15],[250,0.11],[300,0.15],[320,0.16],[240,0.14]])\n", 358 | "\n", 359 | "# get mean / std per feature (per column):\n", 360 | "m = X.mean(axis=0) \n", 361 | "s = X.std(axis=0)\n", 362 | "\n", 363 | "# normalize (without having to loop through different rows):\n", 364 | "X_norm = (X - m) / s\n", 365 | "# now X_norm is normalized with mean = 0 , std = 1:\n", 366 | "X_norm" 367 | ] 368 | }, 369 | { 370 | "cell_type": "markdown", 371 | "metadata": {}, 372 | "source": [ 373 | "## Pandas\n", 374 | "### Pandas data structures\n", 375 | "Two are the basic types used in pandas: *series* and *dataframes*.\n", 376 | "Series is a 1D labeled array that holds any data type (integers, strings, floats etc). To define a Series we need its data and its indices. Obviously the index must be of the same length to the data. If index is not defined, then the default value is \\[0, ..., len(data) - 1\\]. \n", 377 | "\n", 378 | "#### Series" 379 | ] 380 | }, 381 | { 382 | "cell_type": "code", 383 | "execution_count": null, 384 | "metadata": {}, 385 | "outputs": [], 386 | "source": [ 387 | "# series definition\n", 388 | "import pandas as pd\n", 389 | "import numpy as np\n", 390 | "s = pd.Series(np.random.randn(10), index=[f'index{i}' for i in range(10)])\n", 391 | "print(\"series:\"); print(s)\n", 392 | "print(\"s.index\"); print(s.index)" 393 | ] 394 | }, 395 | { 396 | "cell_type": "code", 397 | "execution_count": null, 398 | "metadata": {}, 399 | "outputs": [], 400 | "source": [ 401 | "# one can also initialize series from dict:\n", 402 | "s = pd.Series({'a': 2.1, 'c': 1.9, 'b': 1, 'd': -1})\n", 403 | "print(\"series:\"); print(s)\n", 404 | "print(\"s.index\"); print(s.index)" 405 | ] 406 | }, 407 | { 408 | "cell_type": "code", 409 | "execution_count": null, 410 | "metadata": {}, 411 | "outputs": [], 412 | "source": [ 413 | "# indexing in series can be done with both its indices and integers\n", 414 | "print(s[1], s['c'])" 415 | ] 416 | }, 417 | { 418 | "cell_type": "code", 419 | "execution_count": null, 420 | "metadata": {}, 421 | "outputs": [], 422 | "source": [ 423 | "# also Series shares functions from numpy arrays:\n", 424 | "s.mean(), s.median()" 425 | ] 426 | }, 427 | { 428 | "cell_type": "code", 429 | "execution_count": null, 430 | "metadata": {}, 431 | "outputs": [], 432 | "source": [ 433 | "# ... and more functions:\n", 434 | "np.cos(s)" 435 | ] 436 | }, 437 | { 438 | "cell_type": "code", 439 | "execution_count": null, 440 | "metadata": {}, 441 | "outputs": [], 442 | "source": [ 443 | "# slicing similar to numpy arrays:\n", 444 | "s[:-2]" 445 | ] 446 | }, 447 | { 448 | "cell_type": "code", 449 | "execution_count": null, 450 | "metadata": {}, 451 | "outputs": [], 452 | "source": [ 453 | "s[s>0.5]" 454 | ] 455 | }, 456 | { 457 | "cell_type": "code", 458 | "execution_count": null, 459 | "metadata": {}, 460 | "outputs": [], 461 | "source": [ 462 | "# BUT, operations are not the same as numpy. E.g. + results in the union of the indices involved\n", 463 | "# NaN is assigned as the default value for indices that are not in both series \n", 464 | "a = pd.Series({'a': 2.1, 'b': 1, 'c': -1})\n", 465 | "b = pd.Series({'a': 1, 'd': 1, 'g': -1, 'c': -1})\n", 466 | "a + b" 467 | ] 468 | }, 469 | { 470 | "cell_type": "markdown", 471 | "metadata": {}, 472 | "source": [ 473 | "#### DataFrame\n", 474 | "When your data is tabular with row index and column index, the go-to choice is pandas.DataFrame. DataFrame is a 2D data structure with columns of potentially different types. Conceptually, DataFrame can be considered as a data table stored in a spreadsheet, a csv, a json file or a database. \n", 475 | "\n", 476 | "There are several ways to construct a DataFrame object, below are two of the most frequent:" 477 | ] 478 | }, 479 | { 480 | "cell_type": "code", 481 | "execution_count": null, 482 | "metadata": {}, 483 | "outputs": [], 484 | "source": [ 485 | "# construct DataFrame from dict\n", 486 | "import pandas as pd\n", 487 | "d = {'name': [\"james\", \"theodore\", \"jane\", \"maria\"], \n", 488 | " 'score': [4., 3., 2., 5.]}\n", 489 | "df = pd.DataFrame(d)\n", 490 | "print(df.columns)\n", 491 | "print(df)" 492 | ] 493 | }, 494 | { 495 | "cell_type": "code", 496 | "execution_count": null, 497 | "metadata": {}, 498 | "outputs": [], 499 | "source": [ 500 | "# construct DataFrame from list of dicts\n", 501 | "# (note that \"sparse\" matrices - aka missing data - are more easily supported using this format)\n", 502 | "import pandas as pd\n", 503 | "d = [{'name': 'james', 'score': '4', 'note': 'this is a note'},\n", 504 | " {'name': 'theodore', 'score': '3'},\n", 505 | " {'name': 'jane', 'score': '2'},\n", 506 | " {'name': 'maria', 'score': '5'}]\n", 507 | "df = pd.DataFrame(d)\n", 508 | "print(df.columns)\n", 509 | "print(df)" 510 | ] 511 | }, 512 | { 513 | "cell_type": "markdown", 514 | "metadata": {}, 515 | "source": [ 516 | "#### More on DataFrames" 517 | ] 518 | }, 519 | { 520 | "cell_type": "code", 521 | "execution_count": null, 522 | "metadata": {}, 523 | "outputs": [], 524 | "source": [ 525 | "# lets read the CSV file of temperatures again:\n", 526 | "import pandas as pd\n", 527 | "df = pd.read_csv(\"data_ny_temperatures.csv\")\n", 528 | "print(f\"{len(list(df.columns))} columns {list(df.columns)}\")\n", 529 | "print(f\"{len(df.index)} rows\")\n", 530 | "df" 531 | ] 532 | }, 533 | { 534 | "cell_type": "code", 535 | "execution_count": null, 536 | "metadata": {}, 537 | "outputs": [], 538 | "source": [ 539 | "# SELECT a column:\n", 540 | "import pandas as pd\n", 541 | "df = pd.read_csv(\"data_ny_temperatures.csv\")\n", 542 | "df['date']" 543 | ] 544 | }, 545 | { 546 | "cell_type": "code", 547 | "execution_count": null, 548 | "metadata": {}, 549 | "outputs": [], 550 | "source": [ 551 | "# convert a column to numpy array:\n", 552 | "import pandas as pd\n", 553 | "df = pd.read_csv(\"data_ny_temperatures.csv\")\n", 554 | "df['maxt'].to_numpy()" 555 | ] 556 | }, 557 | { 558 | "cell_type": "code", 559 | "execution_count": null, 560 | "metadata": {}, 561 | "outputs": [], 562 | "source": [ 563 | "# ... or to list\n", 564 | "import pandas as pd\n", 565 | "df = pd.read_csv(\"data_ny_temperatures.csv\")\n", 566 | "df['mint'].to_list()[::20]" 567 | ] 568 | }, 569 | { 570 | "cell_type": "code", 571 | "execution_count": null, 572 | "metadata": {}, 573 | "outputs": [], 574 | "source": [ 575 | "# you can also INSERT a new column e.g.\n", 576 | "import pandas as pd\n", 577 | "df = pd.read_csv(\"data_ny_temperatures.csv\")\n", 578 | "df['meant'] = (df['mint'] + df['maxt']) / 2\n", 579 | "# or you can insert a fixed (non-array) value (it will be added to ALL rows)\n", 580 | "df['note'] = 'this is a note'\n", 581 | "df" 582 | ] 583 | }, 584 | { 585 | "cell_type": "code", 586 | "execution_count": null, 587 | "metadata": {}, 588 | "outputs": [], 589 | "source": [ 590 | "# DELETE a column\n", 591 | "import pandas as pd\n", 592 | "df = pd.read_csv(\"data_ny_temperatures.csv\")\n", 593 | "del df['maxt']\n", 594 | "df" 595 | ] 596 | }, 597 | { 598 | "cell_type": "markdown", 599 | "metadata": {}, 600 | "source": [ 601 | "We've seen that indexing columns is done like in dicts e.g. df['maxt']. What about indexing rows and assining values to individual cells:" 602 | ] 603 | }, 604 | { 605 | "cell_type": "code", 606 | "execution_count": null, 607 | "metadata": {}, 608 | "outputs": [], 609 | "source": [ 610 | "import pandas as pd\n", 611 | "df = pd.read_csv(\"data_ny_temperatures.csv\")\n", 612 | "print(df.iloc[0]) # index rows\n", 613 | "# you can also use df.loc iand provide a LABEL instead of an integer\n", 614 | "# (if the dataframe has been defined with labels in rows, see next examples)\n", 615 | "\n", 616 | "# ASSIGN a value to a specific CELL:\n", 617 | "df.loc[0, 'maxt'] = -10\n", 618 | "\n", 619 | "df.iloc[0, df.columns.get_loc(\"maxt\")] = -10\n", 620 | "print(df)" 621 | ] 622 | }, 623 | { 624 | "cell_type": "markdown", 625 | "metadata": {}, 626 | "source": [ 627 | "In the following example\n", 628 | " * we set the index of the temperatures matrix from default (integers) to the date\n", 629 | " * we demonstrate how to use the loc method to index when non-integer indices are used" 630 | ] 631 | }, 632 | { 633 | "cell_type": "code", 634 | "execution_count": null, 635 | "metadata": {}, 636 | "outputs": [], 637 | "source": [ 638 | "import pandas as pd\n", 639 | "df = pd.read_csv(\"data_ny_temperatures.csv\")\n", 640 | "df = df.set_index(\"date\")\n", 641 | "print(df)\n", 642 | "# you can now use the loc method\n", 643 | "df.loc[\"2018-04-09\"]" 644 | ] 645 | }, 646 | { 647 | "cell_type": "code", 648 | "execution_count": null, 649 | "metadata": {}, 650 | "outputs": [], 651 | "source": [ 652 | "# SLICING\n", 653 | "import pandas as pd\n", 654 | "df = pd.read_csv(\"data_ny_temperatures.csv\")\n", 655 | "df = df.set_index(\"date\")\n", 656 | "print(df[::20]) # print every 20 rows\n", 657 | "print(df[2:4]) # print rows 2 to 3\n", 658 | "print(df[\"1929-04-09\": \"1944-04-09\"]) # use non-integer indices in slicing" 659 | ] 660 | }, 661 | { 662 | "cell_type": "code", 663 | "execution_count": null, 664 | "metadata": {}, 665 | "outputs": [], 666 | "source": [ 667 | "# SELECTION\n", 668 | "print(df[df['maxt'] > 80]) # select rows with maxt>80\n", 669 | "print(df[df['maxt'] - df['mint'] < 5]) # select rows with less than 5 difference between maxt and mint\n", 670 | "print(df[(df['maxt'] > 80) | (df['mint'] < 28)]) # select rows with very high max or very low min temperatures" 671 | ] 672 | }, 673 | { 674 | "cell_type": "code", 675 | "execution_count": null, 676 | "metadata": {}, 677 | "outputs": [], 678 | "source": [ 679 | "# SORTING\n", 680 | "import pandas as pd\n", 681 | "df = pd.read_csv(\"data_ny_temperatures.csv\")\n", 682 | "df = df.set_index(\"date\")\n", 683 | "df = df.sort_values(by='mint') \n", 684 | "print(df)" 685 | ] 686 | }, 687 | { 688 | "cell_type": "code", 689 | "execution_count": null, 690 | "metadata": {}, 691 | "outputs": [], 692 | "source": [] 693 | } 694 | ], 695 | "metadata": { 696 | "kernelspec": { 697 | "display_name": "Python 3 (ipykernel)", 698 | "language": "python", 699 | "name": "python3" 700 | }, 701 | "language_info": { 702 | "codemirror_mode": { 703 | "name": "ipython", 704 | "version": 3 705 | }, 706 | "file_extension": ".py", 707 | "mimetype": "text/x-python", 708 | "name": "python", 709 | "nbconvert_exporter": "python", 710 | "pygments_lexer": "ipython3", 711 | "version": "3.9.13" 712 | } 713 | }, 714 | "nbformat": 4, 715 | "nbformat_minor": 4 716 | } 717 | -------------------------------------------------------------------------------- /notebooks/data_ny_temperatures.csv: -------------------------------------------------------------------------------- 1 | date,maxt,mint 2 | 1869-04-09,47,33 3 | 1870-04-09,66,44 4 | 1871-04-09,83,62 5 | 1872-04-09,54,41 6 | 1873-04-09,50,39 7 | 1874-04-09,42,36 8 | 1875-04-09,47,37 9 | 1876-04-09,43,29 10 | 1877-04-09,49,36 11 | 1878-04-09,53,43 12 | 1879-04-09,69,43 13 | 1880-04-09,49,31 14 | 1881-04-09,56,37 15 | 1882-04-09,53,44 16 | 1883-04-09,62,42 17 | 1884-04-09,41,37 18 | 1885-04-09,39,27 19 | 1886-04-09,65,34 20 | 1887-04-09,63,37 21 | 1888-04-09,51,30 22 | 1889-04-09,59,39 23 | 1890-04-09,46,40 24 | 1891-04-09,48,31 25 | 1892-04-09,51,33 26 | 1893-04-09,56,41 27 | 1894-04-09,48,32 28 | 1895-04-09,64,53 29 | 1896-04-09,51,34 30 | 1897-04-09,47,44 31 | 1898-04-09,57,40 32 | 1899-04-09,46,39 33 | 1900-04-09,42,31 34 | 1901-04-09,49,39 35 | 1902-04-09,48,43 36 | 1903-04-09,64,47 37 | 1904-04-09,57,44 38 | 1905-04-09,56,39 39 | 1906-04-09,51,38 40 | 1907-04-09,41,36 41 | 1908-04-09,64,46 42 | 1909-04-09,51,39 43 | 1910-04-09,54,43 44 | 1911-04-09,42,34 45 | 1912-04-09,63,33 46 | 1913-04-09,45,35 47 | 1914-04-09,44,35 48 | 1915-04-09,68,51 49 | 1916-04-09,43,32 50 | 1917-04-09,40,28 51 | 1918-04-09,51,34 52 | 1919-04-09,54,44 53 | 1920-04-09,45,29 54 | 1921-04-09,72,49 55 | 1922-04-09,65,51 56 | 1923-04-09,50,32 57 | 1924-04-09,53,37 58 | 1925-04-09,56,39 59 | 1926-04-09,61,43 60 | 1927-04-09,48,34 61 | 1928-04-09,52,39 62 | 1929-04-09,71,53 63 | 1930-04-09,50,33 64 | 1931-04-09,63,46 65 | 1932-04-09,45,41 66 | 1933-04-09,60,40 67 | 1934-04-09,69,44 68 | 1935-04-09,42,37 69 | 1936-04-09,47,36 70 | 1937-04-09,45,36 71 | 1938-04-09,49,37 72 | 1939-04-09,46,33 73 | 1940-04-09,61,43 74 | 1941-04-09,59,39 75 | 1942-04-09,49,33 76 | 1943-04-09,58,39 77 | 1944-04-09,59,45 78 | 1945-04-09,71,46 79 | 1946-04-09,57,36 80 | 1947-04-09,48,42 81 | 1948-04-09,54,37 82 | 1949-04-09,50,40 83 | 1950-04-09,46,27 84 | 1951-04-09,68,46 85 | 1952-04-09,54,42 86 | 1953-04-09,60,48 87 | 1954-04-09,56,35 88 | 1955-04-09,63,39 89 | 1956-04-09,52,34 90 | 1957-04-09,45,34 91 | 1958-04-09,52,29 92 | 1959-04-09,71,52 93 | 1960-04-09,47,36 94 | 1961-04-09,52,35 95 | 1962-04-09,61,47 96 | 1963-04-09,55,40 97 | 1964-04-09,47,38 98 | 1965-04-09,58,42 99 | 1966-04-09,53,40 100 | 1967-04-09,59,38 101 | 1968-04-09,74,55 102 | 1969-04-09,64,41 103 | 1970-04-09,76,51 104 | 1971-04-09,67,42 105 | 1972-04-09,51,29 106 | 1973-04-09,55,37 107 | 1974-04-09,40,32 108 | 1975-04-09,52,33 109 | 1976-04-09,51,35 110 | 1977-04-09,44,25 111 | 1978-04-09,57,34 112 | 1979-04-09,41,37 113 | 1980-04-09,57,55 114 | 1981-04-09,75,55 115 | 1982-04-09,39,34 116 | 1983-04-09,58,48 117 | 1984-04-09,58,35 118 | 1985-04-09,47,32 119 | 1986-04-09,55,43 120 | 1987-04-09,61,46 121 | 1988-04-09,66,43 122 | 1989-04-09,56,41 123 | 1990-04-09,62,40 124 | 1991-04-09,86,68 125 | 1992-04-09,60,45 126 | 1993-04-09,63,44 127 | 1994-04-09,55,42 128 | 1995-04-09,68,41 129 | 1996-04-09,40,33 130 | 1997-04-09,42,30 131 | 1998-04-09,50,37 132 | 1999-04-09,64,42 133 | 2000-04-09,50,30 134 | 2001-04-09,78,45 135 | 2002-04-09,76,52 136 | 2003-04-09,39,35 137 | 2004-04-09,62,45 138 | 2005-04-09,63,45 139 | 2006-04-09,58,37 140 | 2007-04-09,49,32 141 | 2008-04-09,58,40 142 | 2009-04-09,63,41 143 | 2010-04-09,68,44 144 | 2011-04-09,58,40 145 | 2012-04-09,64,49 146 | 2013-04-09,82,51 147 | 2014-04-09,61,45 148 | 2015-04-09,43,37 149 | 2016-04-09,43,36 150 | 2017-04-09,67,45 151 | 2018-04-09,48.9,32 152 | 2019-04-09,51.1,46 153 | -------------------------------------------------------------------------------- /notebooks/image.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tyiannak/python-data-science/ac6bc7209ba55ffc28f38c6d38b713e27491e468/notebooks/image.jpg -------------------------------------------------------------------------------- /notebooks/movies_stats.csv: -------------------------------------------------------------------------------- 1 | Director,Title,CountBodies,CountFwords 2 | Spielberg,War of the Worlds,52,0 3 | Spielberg,Raiders of the Lost Ark,84,0 4 | Spielberg,Indiana Jones and the Kingdom of the Crystal Skull,76,0 5 | Scorsese,Casino,25,343 6 | Scorsese,The Departed,21,266 7 | Scorsese,Taxi Driver,4,34 8 | Tarantino,Kill Bill I,95,17 9 | Tarantino,Reservoir Dogs,12,250 10 | Tarantino,Pulp Fiction,10,249 11 | -------------------------------------------------------------------------------- /notebooks/text_file.txt: -------------------------------------------------------------------------------- 1 | But I had forgotten Bombadil, if indeed this is still the same that walked the woods and hills long ago, and even then was older than the old. 2 | That was not then his name. Iarwain Ben-adar we called him, oldest and fatherless. 3 | But many another name he has since been given by other folk: Forn by the Dwarves, Orald by Northern Men, and other names beside. He is a strange creature... 4 | -------------------------------------------------------------------------------- /scripts/TODOs: -------------------------------------------------------------------------------- 1 | Add a docker example! -------------------------------------------------------------------------------- /scripts/cl_example.py: -------------------------------------------------------------------------------- 1 | """ 2 | Command line example 3 | This example plots a list of Gaussian distribution 4 | Usage example: 5 | 6 | python3 cl_example.py -m 0 5 10 15 20 -s 1 1 1 7 2 -n 5000 5000 5000 10000 5000 7 | --names "x1" "x2" "x3" "x4" "x5" -b 100 --normalize 8 | """ 9 | 10 | import argparse 11 | import numpy as np 12 | import numpy.random 13 | import plotly.subplots 14 | import plotly.graph_objs as go 15 | 16 | colors = ["red", "green", "blue", "orange", "gray"] 17 | 18 | 19 | def generate_distributions(means, stds, n_samples): 20 | """ 21 | Generates a list of Gaussian distributions: 22 | :param means: list of mean values of the N distributions 23 | :param stds: list of std values of the N distributions 24 | :param n_samples: list of number of samples of the N distributions 25 | :return: list of random variables (each distribution a saparate numpy array) 26 | """ 27 | return [numpy.random.normal(mean, std, ns) 28 | for (mean, std, ns) in zip(means, stds, n_samples)] 29 | 30 | 31 | def visualize_distributions(data, dist_names, norm=False, n_bins=10): 32 | """ 33 | Visualize histograms of a list of distributions 34 | :param data: list of Gaussian distriburtions as generated by 35 | generate_distributions() 36 | :param dist_names: distribution name 37 | :param norm: normalize histograms 38 | :param n_bins: number of bins to be used in each histogram 39 | """ 40 | fig = plotly.subplots.make_subplots(rows=1, cols=2) 41 | for i, x in enumerate(data): # for each distribution 42 | # (1) compute (and normalize) histogram 43 | h, bins = np.histogram(x, bins=n_bins) 44 | if norm: 45 | h = h / h.sum() 46 | # get histogram bin centers from the endpoinds: 47 | bins_centers = (bins[0:-1] + bins[1:]) / 2 48 | 49 | # (2) Plot 50 | marker_style1 = dict(color=colors[i], size=2, 51 | line=dict(color=colors[i], width=2)) 52 | # append data to 1st subplot: 53 | fig.append_trace(go.Scatter(y=x, name=dist_names[i], 54 | marker=marker_style1), 1, 1) 55 | # append histogram to 2nd subplot: 56 | fig.append_trace(go.Scatter(x=bins_centers, y=h, 57 | name="hist " + dist_names[i], 58 | marker=marker_style1), 1, 2) 59 | plotly.offline.iplot(fig) 60 | 61 | 62 | def parse_arguments(): 63 | """Parse/check input arguments.""" 64 | parser = argparse.ArgumentParser(prog='PROG') 65 | parser.add_argument('-m', '--means', type=int, nargs="+", required=True, 66 | help="Mean value(s) of the distribution(s)") 67 | parser.add_argument('-s', '--stds', type=int, nargs="+", required=True, 68 | help="Standard deviation(s) of the distribution(s)") 69 | parser.add_argument('-n', '--num_of_samples', type=int, nargs="+", 70 | required=True, 71 | help="Number of samples of the distribution(s)") 72 | parser.add_argument('--names', nargs="+", 73 | help="Distribution names") 74 | parser.add_argument("-b", "--bins", nargs=None, default=10, type=int, 75 | help="Number of histogram bins") 76 | parser.add_argument('--normalize', action='store_true', 77 | help="Set true if histograms are to be normalized") 78 | arguments = parser.parse_args() 79 | return arguments 80 | 81 | 82 | if __name__ == "__main__": 83 | args = parse_arguments() 84 | m, s, n, normalize, b = args.means, args.stds, args.num_of_samples, \ 85 | args.normalize, args.bins 86 | if not(len(m) == len(s) == len(n)): 87 | print("Distribution parameters must be of the same length!") 88 | exit(1) 89 | if len(m) > 5: 90 | print("Maximum number of distributions is 5!") 91 | exit(1) 92 | names = args.names 93 | if not names or len(names) != len(m): 94 | names = [f"var_{i}" for i in range(len(m))] 95 | 96 | d = generate_distributions(m, s, n) 97 | visualize_distributions(d, names, norm=normalize, n_bins=b) 98 | -------------------------------------------------------------------------------- /scripts/cl_example_simple.py: -------------------------------------------------------------------------------- 1 | # This is a basic example of how to use argparse to create a command line 2 | # python script 3 | 4 | import argparse 5 | 6 | def parse_arguments(): 7 | """Parse/check input arguments.""" 8 | parser = argparse.ArgumentParser(prog='PROG') 9 | parser.add_argument('-a', '--num1', type=int, required=True, 10 | help="First number") 11 | parser.add_argument('-b', '--num2', type=int, required=True, 12 | help="Second number") 13 | arguments = parser.parse_args() 14 | return arguments 15 | 16 | if __name__ == "__main__": 17 | args = parse_arguments() 18 | print(args.num1 + args.num2) -------------------------------------------------------------------------------- /scripts/dash_demo.py: -------------------------------------------------------------------------------- 1 | """ 2 | Instructions: 3 | - Simply run python3 dash_demo.py 4 | - You need to generate a maxbox token and save it in a file called mapbox_token 5 | (https://docs.mapbox.com/help/getting-started/access-tokens/) 6 | 7 | Maintainer: Theodoros Giannakopoulos {tyiannak@gmail.com} 8 | """ 9 | 10 | # -*- coding: utf-8 -*- 11 | import dash 12 | import dash_core_components as dcc 13 | import dash_bootstrap_components as dbc 14 | import dash_html_components as html 15 | import plotly.graph_objs as go 16 | import pandas as pd 17 | import dash_table 18 | import numpy as np 19 | 20 | 21 | colors = {'background': '#111111', 'text': '#7FDBDD'} 22 | def_font = dict(family="Courier New, Monospace", size=10, color='#000000') 23 | fonts_histogram = dict(family="Courier New, monospace", size=8, 24 | color="RebeccaPurple") 25 | 26 | # read airbnb data for Athens: 27 | csv_data = pd.read_csv("listings.csv") 28 | nei = csv_data["neighbourhood"].dropna().unique() 29 | nei_list = [{'label': c, 'value': c} for c in nei if c != "nan"] 30 | #for n in nei: 31 | # print(n, len(csv_data[csv_data["neighbourhood"]==n])) 32 | 33 | global min_price 34 | global max_price 35 | global min_rating 36 | global max_rating 37 | global csv_data_temp 38 | min_price, max_price = 0, 1000 39 | min_rating, max_rating = 3, 5 40 | 41 | print(len(csv_data)) 42 | csv_data = csv_data[csv_data.review_scores_rating.notnull()] 43 | print(len(csv_data)) 44 | csv_data.price = csv_data.price.str.replace('$', '', regex=True) 45 | csv_data.price = csv_data.price.str.replace(',', '', regex=True) 46 | csv_data.price = csv_data.price.astype("float") 47 | csv_data = csv_data[csv_data["price"] <= 1000] 48 | print(len(csv_data)) 49 | 50 | 51 | def get_statistics(d): 52 | df = pd.DataFrame(d.groupby('neighbourhood_cleansed').agg( 53 | { 54 | 'id': 'count', 55 | 'price': 'median', 56 | 'review_scores_rating': 'median', 57 | } 58 | )).round(2) 59 | data_t = df.reset_index().to_dict('records') 60 | return data_t 61 | 62 | 63 | def draw_data(): 64 | # filtering: 65 | global min_price 66 | global max_price 67 | global min_rating 68 | global max_rating 69 | global csv_data_temp 70 | csv_data_temp = csv_data[(csv_data['price'] <= max_price) & 71 | (csv_data['price'] >= min_price) & 72 | (csv_data['review_scores_rating'] <= max_rating) & 73 | (csv_data['review_scores_rating'] >= min_rating) 74 | ] 75 | 76 | print(f'value:{min_price}-{max_price}, ' 77 | f'rating:{min_rating}-{max_rating}, ' 78 | f'data_samples={len(csv_data_temp)}') 79 | 80 | figure = {'data': [go.Scattermapbox(lat=csv_data_temp['latitude'], 81 | lon=csv_data_temp['longitude'], 82 | text=csv_data_temp['neighbourhood_cleansed'], 83 | mode='markers', marker_size=7, 84 | marker_color='rgba(22, 182, 255, .9)'), 85 | ], 86 | 'layout': go.Layout( 87 | height=500, 88 | hovermode='closest', 89 | autosize=False, 90 | margin={"l": 0, "r": 0, "b": 0, "t": 0, "pad": 0}, 91 | mapbox=dict(accesstoken=open("mapbox_token").read(), 92 | style='outdoors', bearing=0, 93 | center=go.layout.mapbox.Center( 94 | lat=np.mean(csv_data_temp['latitude']), 95 | lon=np.mean(csv_data_temp['longitude'])), 96 | pitch=0, zoom=13))} 97 | 98 | h, h_bins = np.histogram(csv_data_temp['price'], bins=30) 99 | h_bins = (h_bins[0:-1] + h_bins[1:]) / 2 100 | figure_2 = {'data': [go.Scatter(x=h_bins, y=h, 101 | marker_color='rgba(22, 182, 255, .9)'),], 102 | 'layout': go.Layout( 103 | title="Prices Distribution", 104 | xaxis_title="Price ($)", 105 | yaxis_title="Counts", 106 | font=fonts_histogram, 107 | hovermode='closest', 108 | autosize=False, 109 | height=200, 110 | margin={"l": 40, "r": 15, "b": 30, "t": 30, "pad": 0}, 111 | )} 112 | 113 | nbeds = csv_data_temp['beds'].dropna() 114 | h, h_bins = np.histogram(nbeds, bins=range(int(nbeds.min()), 115 | int(nbeds.max()))) 116 | h_bins = np.array(list(range(int(nbeds.min()), int(nbeds.max())))) 117 | figure_3 = {'data': [go.Scatter(x=h_bins, y=h, 118 | marker_color='rgba(22, 182, 255, .9)'),], 119 | 'layout': go.Layout( 120 | title="#Beds Distribution", 121 | xaxis_title="#Beds", 122 | yaxis_title="Counts", 123 | font=fonts_histogram, 124 | hovermode='closest', 125 | autosize=False, 126 | height=200, 127 | margin={"l": 40, "r": 15, "b": 30, "t": 30, "pad": 0}, 128 | )} 129 | 130 | nbedrooms = csv_data_temp['bedrooms'].dropna() 131 | h, h_bins = np.histogram(nbedrooms, bins=range(1, 7)) 132 | h_bins = np.array(list(range(1, 7))) 133 | figure_4 = {'data': [go.Scatter(x=h_bins, y=h, 134 | marker_color='rgba(22, 182, 255, .9)'),], 135 | 'layout': go.Layout( 136 | title="#Bedrooms Distribution", 137 | xaxis_title="#Bedrooms", 138 | yaxis_title="Counts", 139 | font=fonts_histogram, 140 | hovermode='closest', 141 | autosize=False, 142 | height=200, 143 | margin={"l": 40, "r": 15, "b": 30, "t": 30, "pad": 0}, 144 | )} 145 | 146 | 147 | price_table = get_statistics(csv_data_temp) 148 | 149 | return dcc.Graph(figure=figure, 150 | config={'displayModeBar': False}), \ 151 | dcc.Graph(figure=figure_2, 152 | config={'displayModeBar': False}), \ 153 | dcc.Graph(figure=figure_3, 154 | config={'displayModeBar': False}), \ 155 | dcc.Graph(figure=figure_4, 156 | config={'displayModeBar': False}), \ 157 | price_table 158 | 159 | 160 | def get_layout(): 161 | """ 162 | Initialize the UI layout 163 | """ 164 | global data 165 | cols = [{"name": "neighbourhood_cleansed", "id": "neighbourhood_cleansed"}] 166 | cols.append({"name": "count", "id": "id"}) 167 | cols += [{"name": "avg " + i, "id": i, } 168 | for i in 169 | ["price", "review_scores_rating"]] 170 | 171 | 172 | layout = dbc.Container([ 173 | # Title 174 | dbc.Row(dbc.Col(html.H2("Airbnb Data Visualization Example", 175 | style={'textAlign': 'center', 176 | 'color': colors['text']}))), 177 | 178 | # Main Graph 179 | dbc.Row( 180 | [ 181 | dbc.Col( 182 | dcc.Graph(figure={'data': [go.Scatter(x=[1], y=[1])],}, 183 | config=dict(displayModeBar=False)), 184 | width=9, id="map_graph"), 185 | 186 | dbc.Col( 187 | dash_table.DataTable( 188 | id='dataframe_output', 189 | fixed_rows={'headers': True}, 190 | style_cell={'fontSize': 8, 'font-family': 'sans-serif', 191 | 'minWidth': 10, 192 | 'width': 40, 193 | 'maxWidth': 95}, 194 | sort_action="native", 195 | sort_mode="multi", 196 | column_selectable="single", 197 | selected_columns=[], 198 | selected_rows=[], 199 | style_table={'height': 700}, 200 | virtualization=True, 201 | #page_size=20, 202 | columns=cols, 203 | style_header={ 204 | 'font-family': 'sans-serif', 205 | 'fontWeight': 'bold', 206 | 'font_size': '9px', 207 | 'color': 'white', 208 | 'backgroundColor': 'black' 209 | }, 210 | ), 211 | width=3 212 | ), 213 | ], className="h-75"), 214 | 215 | 216 | # Controls 217 | dbc.Row( 218 | [ 219 | dbc.Col( 220 | html.Div([ 221 | dcc.Slider(id='slider_price_min', 222 | min=0, max=500, step=1, value=0), 223 | dcc.Slider(id='slider_price_max', 224 | min=0, max=500, step=1, value=500), 225 | html.Div(id='slider_price_container'),] 226 | ), 227 | width=2, 228 | ), 229 | 230 | dbc.Col( 231 | html.Div([ 232 | dcc.Slider(id='slider_rating_min', 233 | min=3, max=5, step=0.1, value=3), 234 | dcc.Slider(id='slider_rating_max', 235 | min=3, max=5, step=0.1, value=5), 236 | html.Div(id='slider_rating_container')]), 237 | width=2, 238 | ), 239 | 240 | dbc.Col( 241 | dcc.Graph(figure={'data': [go.Scatter(x=[1], y=[1])]}), 242 | width=2, id="hist_graph_1"), 243 | 244 | dbc.Col( 245 | dcc.Graph(figure={'data': [go.Scatter(x=[1], y=[1])]}), 246 | width=2, id="hist_graph_2"), 247 | 248 | dbc.Col( 249 | dcc.Graph(figure={'data': [go.Scatter(x=[1], y=[1])]}), 250 | width=2, id="hist_graph_3"), 251 | 252 | dbc.Col(html.Button('Run', id='btn-next'), width=2), 253 | 254 | ], className="h-25"), 255 | ]) 256 | 257 | return layout 258 | 259 | 260 | if __name__ == '__main__': 261 | app = dash.Dash(external_stylesheets=[dbc.themes.BOOTSTRAP]) 262 | 263 | app.layout = get_layout() 264 | 265 | @app.callback( 266 | [dash.dependencies.Output('slider_price_container', 'children'), 267 | dash.dependencies.Output('slider_rating_container', 'children'), 268 | dash.dependencies.Output('map_graph', 'children'), 269 | dash.dependencies.Output('hist_graph_1', 'children'), 270 | dash.dependencies.Output('hist_graph_2', 'children'), 271 | dash.dependencies.Output('hist_graph_3', 'children'), 272 | dash.dependencies.Output('dataframe_output', 'data')], 273 | [dash.dependencies.Input('slider_price_min', 'value'), 274 | dash.dependencies.Input('slider_price_max', 'value'), 275 | dash.dependencies.Input('slider_rating_min', 'value'), 276 | dash.dependencies.Input('slider_rating_max', 'value'), 277 | dash.dependencies.Input('btn-next', 'n_clicks')]) 278 | def update_output(val1, val2, val3, val4, val5): 279 | changed_id = [p['prop_id'] for p in dash.callback_context.triggered][0] 280 | if 'slider_price_min' in changed_id: 281 | global min_price 282 | min_price = int(val1) 283 | elif 'slider_price_max' in changed_id: 284 | global max_price 285 | max_price = int(val2) 286 | elif 'slider_rating_min' in changed_id: 287 | global min_rating 288 | min_rating = float(val3) 289 | elif 'slider_rating_max' in changed_id: 290 | global max_rating 291 | max_rating = float(val4) 292 | elif 'btn-next' in changed_id: 293 | print("TODO") 294 | g1, g2, g3, g4, t = draw_data() 295 | return f'Price {min_price} - {max_price} $', \ 296 | f'Rating {min_rating} - {max_rating} Stars', \ 297 | g1, g2, g3, g4, t 298 | 299 | 300 | app.run_server(debug=True) 301 | -------------------------------------------------------------------------------- /scripts/matplotlib_simple_script.py: -------------------------------------------------------------------------------- 1 | # Draw y = x ^ 2 - 2 in [-1, 1] 2 | import matplotlib.pyplot as plt 3 | import numpy as np 4 | x = np.arange(-5, 5, 0.01) 5 | y = np.cos(x * x) 6 | plt.plot(x, y, 'g') 7 | plt.show() -------------------------------------------------------------------------------- /scripts/requirements.txt: -------------------------------------------------------------------------------- 1 | numpy==1.23.1 2 | matplotlib==3.4.2 3 | plotly==5.11.0 4 | pandas==1.2.4 5 | dash_bootstrap_components==1.0.2 6 | dash==2.0.0 7 | dash_core_components==2.0.0 8 | dash_html_components==2.0.0 9 | dash_table==5.0.0 10 | --------------------------------------------------------------------------------