├── requirements.txt ├── binder └── requirements.txt ├── images ├── error.jpg ├── bmi_nan.png ├── diff_hr.PNG ├── lower_hr.PNG ├── upper_hr.PNG ├── conda_install.PNG ├── download_repo.png ├── typesOfData.png ├── wids_heading.png └── img_FunctionDef.PNG ├── README.md ├── Anaconda Installation Guide.ipynb ├── .ipynb_checkpoints ├── Anaconda Installation Guide-checkpoint.ipynb ├── 01_Intro_to_Jupyter-checkpoint.ipynb └── 03_More_DataStructures-checkpoint.ipynb ├── 01_Intro_to_Jupyter.ipynb └── 03_More_DataStructures.ipynb /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy 2 | matplotlib 3 | pandas -------------------------------------------------------------------------------- /binder/requirements.txt: -------------------------------------------------------------------------------- 1 | numpy 2 | matplotlib 3 | pandas -------------------------------------------------------------------------------- /images/error.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/keikokamei/WiDS_Datathon_Tutorials/HEAD/images/error.jpg -------------------------------------------------------------------------------- /images/bmi_nan.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/keikokamei/WiDS_Datathon_Tutorials/HEAD/images/bmi_nan.png -------------------------------------------------------------------------------- /images/diff_hr.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/keikokamei/WiDS_Datathon_Tutorials/HEAD/images/diff_hr.PNG -------------------------------------------------------------------------------- /images/lower_hr.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/keikokamei/WiDS_Datathon_Tutorials/HEAD/images/lower_hr.PNG -------------------------------------------------------------------------------- /images/upper_hr.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/keikokamei/WiDS_Datathon_Tutorials/HEAD/images/upper_hr.PNG -------------------------------------------------------------------------------- /images/conda_install.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/keikokamei/WiDS_Datathon_Tutorials/HEAD/images/conda_install.PNG -------------------------------------------------------------------------------- /images/download_repo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/keikokamei/WiDS_Datathon_Tutorials/HEAD/images/download_repo.png -------------------------------------------------------------------------------- /images/typesOfData.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/keikokamei/WiDS_Datathon_Tutorials/HEAD/images/typesOfData.png -------------------------------------------------------------------------------- /images/wids_heading.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/keikokamei/WiDS_Datathon_Tutorials/HEAD/images/wids_heading.png -------------------------------------------------------------------------------- /images/img_FunctionDef.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/keikokamei/WiDS_Datathon_Tutorials/HEAD/images/img_FunctionDef.PNG -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # WiDS_Datathon_Tutorials 2 | 3 | [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/keikokamei/WiDS_Datathon_Tutorials/HEAD) 4 | 5 | **Provided by: WiDS Datathon Team** 6 | 7 | 8 | ### Welcome to your repository of introductory data science tutorials! 9 | 10 | This tutorial is aimed to introduce basic data science skills and concepts to datathon participants with little to no background in statistics or computer science. In these [Jupyter Notebook](http://jupyter.org/) tutorials, you will learn to use the [Python](https://www.python.org/) programming language to explore fundamental statistical concepts and become familiar with various data values, structures and manipulation techniques. 11 | 12 | If you do not already have Anaconda installed on your computer, **it is highly recommended that you first follow the "Anaconda Installation Guide" to download and install Anaconda to access your Jupyter Notebooks.** This will allow you to download and work interactively through your own copy of the tutorial notebooks, and it will set you up with an environment to tackle your datathon challenge should this be the data science platform you choose to use. 13 | 14 | | Notebook | Summary | 15 | |--------------------------------------------|------------------------------------------------------------------------------------| 16 | | Anaconda Installation Guide | Instructions for Anaconda installation & accessing tutorial notebooks on Jupyter | 17 | | 00_WiDS_Datathon_Upskill_Workshop_10Jan23 | "From Couch to Jupyter" Upskill Workshop interactive notebook | 18 | | 01_Intro_to_Jupyter | Introduction to Jupyter Notebooks & Python | 19 | | 02_Intro_to_DataStructures | Python packages, data structures & basic statistics | 20 | | 03_More_DataStructures | Dictionaries, Matrices & Pandas data manipulation | 21 | 22 |

23 | 24 | #### Content adapted from: 25 | - Jupyter Notebook modules from the [UC Berkeley Data Science Modules Program](https://ds-modules.github.io/DS-Modules/) licensed under [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/) 26 | - [ESPM-163ac: Lab1-Introduction to Jupyter Notebook](https://github.com/ds-modules/ESPM-163ac/blob/master/Lab1/Lab1_Intro_to_Jupyter.ipynb) by Alleana Clark 27 | - [Data 8X Public Materials for 2022](https://github.com/ds-modules/materials-x22/) by Sean Morris 28 | - [LEGALST-123: Anaconda Installation Guide](https://github.com/ds-modules/LEGALST-190/blob/master/LEGALST-123/Anaconda%20Installation%20Guide.ipynb) by Keiko Kamei 29 | - [Composing Programs](https://www.composingprograms.com/) by John DeNero based on the textbook [Structure and Interpretation of Computer Programs](https://mitpress.mit.edu/9780262510875/structure-and-interpretation-of-computer-programs/) by Harold Abelson and Gerald Jay Sussman, licensed under [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) 30 | 31 | #### Content of tutorial notebooks include references to: 32 | - [Computational and Inferential Thinking: The Foundations of Data Science](https://inferentialthinking.com/chapters/intro.html) by [Ani Adhikari](https://statistics.berkeley.edu/people/ani-adhikari), [John DeNero](http://denero.org/), and [David Wagner](https://people.eecs.berkeley.edu/~daw/) licensed under [CC BY-ND-NC 2.0](https://creativecommons.org/licenses/by-nc-nd/2.0/) 33 | 34 | -------------------------------------------------------------------------------- /Anaconda Installation Guide.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Anaconda Installation & Accessing Tutorial Notebooks\n", 8 | "\n", 9 | "This is a guide to help you properly install and set up [Anaconda](https://www.anaconda.com/), a popular data science platform. Follow the steps below to install Anaconda, set up your virtual environment, and begin using Jupyter Notebooks!\n", 10 | "\n", 11 | "---\n", 12 | "\n", 13 | "### I. Installation Steps:\n", 14 | "\n", 15 | "1. Go to https://www.anaconda.com/download#downloads\n", 16 | "\n", 17 | "2. Choose appropriate operating system and download the **Python 3** installer:\n", 18 | "![image](./images/conda_install.PNG)\n", 19 | "\n", 20 | "\n", 21 | "3. Follow the installation instructions\n", 22 | "
\n", 23 | "IMPORTANT: If you are a Windows user, choose to also install Anaconda Prompt when prompted by the installer -- this will be your terminal from which you will activate virtual environments and access your Jupyter Notebook.\n", 24 | "
\n", 25 | "\n", 26 | "\n", 27 | "4. To verify that Anaconda has been installed, open up your terminal and \\*run the following command: `conda --version` \n", 28 | "(\\*type the exact command as written and hit enter)\n", 29 | "\n", 30 | "If your installation was successful, your message should say `conda` with the version number of your Anaconda. \n", 31 | "\n", 32 | "\n", 33 | "---\n", 34 | "### II. Set-Up: Creating a Virtual Environment\n", 35 | "\n", 36 | "We create virtual environments to use different versions of applications and libraries. Virtual environments allow you to use isolated python environments to install different versions of libraries. Here, we will make a new virtual environment called `wids-datathon` and show you how to activate/deactive it.\n", 37 | "\n", 38 | " \n", 39 | "\n", 40 | "**Steps:**\n", 41 | "1. **Open your Terminal** (for Windows, open Anaconda Prompt) \n", 42 | "2. In your terminal, **run the command**:\n", 43 | " `conda create -n wids-datathon python=3 anaconda` \n", 44 | " You've now created your virtual environment! \n", 45 | "3. To **activate** this virtual environment, run: \n", 46 | "   \\- on Mac or Linux: `source activate wids-datathon` \n", 47 | "   \\- on Windows: `activate wids-datathon` \n", 48 | "4. To **deactivate** virtual environment: \n", 49 | "   \\- Mac or Linux: `source deactivate` \n", 50 | "   \\- Windows: `conda deactivate`\n", 51 | " \n", 52 | " \n", 53 | "Remember to always activate your virtual environment first before you install packages or run a notebook. This prevents the potential of crashing your root Python/Anaconda installation.\n", 54 | "\n", 55 | "---\n", 56 | "\n", 57 | "### III. Navigating Your Directories\n", 58 | "\n", 59 | "At this point, you can run the command to start your Jupyer Notebook server. However, it will open in your home directory and you will have to click through your folders to find the file you want to open. To prevent this, you can **navigate to the desired directory first** in the terminal, and open the server to that directory.\n", 60 | "\n", 61 | "A \"directory\" is just another term for \"folder\" -- your Desktop folder is a directory, as are your Downloads, Documents, and OneDrive folders. All you are doing here is laying out the path you will take from your home directory to whichever folder you want to work from.\n", 62 | "\n", 63 | "Here are some basic commands in the command line:\n", 64 | "\n", 65 | "- `cd `: you can navigate through your directories from your root with the `cd` command by specifying a path to your desired directory.\n", 66 | " - e.g. If your home directory contains your `Desktop` folder, `cd Desktop` takes you from Home to your Desktop directory.\n", 67 | " - e.g. If your `Desktop` folder contains folder `WiDS` which contains the folder `Datathon`, the following command from your Home Directory takes you to the Datathon folder: `cd Desktop/WiDS/Datathon`\n", 68 | "- `cd ..`: this allows you to go back to the previous directory (called the parent directory)\n", 69 | " - e.g. `WiDS` is the parent folder of `Datathon`, so from the `Datathon` folder, `cd ..` will take you to the `WiDS` folder.\n", 70 | "- `ls` or `dir`: lists all folders/files in the current directory. This is a good way to check, for example, if your parent folder contains your Desktop folder.\n", 71 | "\n", 72 | "Now you know how to navigate directories from your command line! Find your desired directory **before** you run the JupyterHub Server to prevent clicking through layers of folders. \n", 73 | "\n", 74 | "---\n", 75 | "### IV. Create Your First Notebook!\n", 76 | "\n", 77 | "Anaconda comes with Jupyter Notebook which is what we will use throughout this tutorial. In order to create your first notebook:\n", 78 | "\n", 79 | "1. Open your terminal (for Windows users, use Anaconda Prompt)\n", 80 | "\n", 81 | "2. Activate your virtual environment\n", 82 | "\n", 83 | "3. Navigate to your desired directory\n", 84 | "\n", 85 | "4. Run the following command: `jupyter notebook`\n", 86 | "\n", 87 | "Your default browser window will open, and you should be in your specified directory. From here, you can create a new notebook, open and edit saved notebooks, and much, much more!\n", 88 | "\n", 89 | "To close the notebook server (and shut down all running notebooks), run the command: `jupyter notebook stop` OR simply hit `Ctrl + c` in your command line.\n", 90 | "\n", 91 | "---\n", 92 | "\n", 93 | "### V. Downloading the WiDS Datathon Tutorial\n", 94 | "\n", 95 | "Now you are ready to download and interact with our tutorial notebooks. To access them, simply:\n", 96 | "1. Go to: https://github.com/keikokamei/WiDS_Datathon_Tutorials\n", 97 | "2. Download a ZIP file of the repository, as pictured below \n", 98 | "3. Unzip the folder into your desired directory\n", 99 | "4. Navigate to this directory in your command line (step 3. from above) and start running your Jupyter Notebook server (step 4.)\n", 100 | "\n", 101 | "![download_repo](./images/download_repo.png) \n", 102 | "\n", 103 | "\n", 104 | "Once your default browser opens up to this directory, you will be able to open up and interact with the tutorial notebooks. Happy datathon prepping! :)\n", 105 | "\n", 106 | "---\n", 107 | "\n", 108 | "\n", 109 | "**Content adapted from** a Jupyter Notebook modules from the [UC Berkeley Data Science Modules Program](https://ds-modules.github.io/DS-Modules/) licensed under [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/): \n", 110 | "- [LEGALST-123: Anaconda Installation Guide](https://github.com/ds-modules/LEGALST-190/blob/master/LEGALST-123/Anaconda%20Installation%20Guide.ipynb) by Keiko Kamei\n", 111 | "\n" 112 | ] 113 | } 114 | ], 115 | "metadata": { 116 | "kernelspec": { 117 | "display_name": "Python 3 (ipykernel)", 118 | "language": "python", 119 | "name": "python3" 120 | }, 121 | "language_info": { 122 | "codemirror_mode": { 123 | "name": "ipython", 124 | "version": 3 125 | }, 126 | "file_extension": ".py", 127 | "mimetype": "text/x-python", 128 | "name": "python", 129 | "nbconvert_exporter": "python", 130 | "pygments_lexer": "ipython3", 131 | "version": "3.11.5" 132 | } 133 | }, 134 | "nbformat": 4, 135 | "nbformat_minor": 2 136 | } 137 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/Anaconda Installation Guide-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Anaconda Installation & Accessing Tutorial Notebooks\n", 8 | "\n", 9 | "This is a guide to help you properly install and set up [Anaconda](https://www.anaconda.com/), a popular data science platform. Follow the steps below to install Anaconda, set up your virtual environment, and begin using Jupyter Notebooks!\n", 10 | "\n", 11 | "---\n", 12 | "\n", 13 | "### I. Installation Steps:\n", 14 | "\n", 15 | "1. Go to https://www.anaconda.com/download#downloads\n", 16 | "\n", 17 | "2. Choose appropriate operating system and download the **Python 3** installer:\n", 18 | "![image](./images/conda_install.PNG)\n", 19 | "\n", 20 | "\n", 21 | "3. Follow the installation instructions\n", 22 | "
\n", 23 | "IMPORTANT: If you are a Windows user, choose to also install Anaconda Prompt when prompted by the installer -- this will be your terminal from which you will activate virtual environments and access your Jupyter Notebook.\n", 24 | "
\n", 25 | "\n", 26 | "\n", 27 | "4. To verify that Anaconda has been installed, open up your terminal and \\*run the following command: `conda --version` \n", 28 | "(\\*type the exact command as written and hit enter)\n", 29 | "\n", 30 | "If your installation was successful, your message should say `conda` with the version number of your Anaconda. \n", 31 | "\n", 32 | "\n", 33 | "---\n", 34 | "### II. Set-Up: Creating a Virtual Environment\n", 35 | "\n", 36 | "We create virtual environments to use different versions of applications and libraries. Virtual environments allow you to use isolated python environments to install different versions of libraries. Here, we will make a new virtual environment called `wids-datathon` and show you how to activate/deactive it.\n", 37 | "\n", 38 | " \n", 39 | "\n", 40 | "**Steps:**\n", 41 | "1. **Open your Terminal** (for Windows, open Anaconda Prompt) \n", 42 | "2. In your terminal, **run the command**:\n", 43 | " `conda create -n wids-datathon python=3 anaconda` \n", 44 | " You've now created your virtual environment! \n", 45 | "3. To **activate** this virtual environment, run: \n", 46 | "   \\- on Mac or Linux: `source activate wids-datathon` \n", 47 | "   \\- on Windows: `activate wids-datathon` \n", 48 | "4. To **deactivate** virtual environment: \n", 49 | "   \\- Mac or Linux: `source deactivate` \n", 50 | "   \\- Windows: `conda deactivate`\n", 51 | " \n", 52 | " \n", 53 | "Remember to always activate your virtual environment first before you install packages or run a notebook. This prevents the potential of crashing your root Python/Anaconda installation.\n", 54 | "\n", 55 | "---\n", 56 | "\n", 57 | "### III. Navigating Your Directories\n", 58 | "\n", 59 | "At this point, you can run the command to start your Jupyer Notebook server. However, it will open in your home directory and you will have to click through your folders to find the file you want to open. To prevent this, you can **navigate to the desired directory first** in the terminal, and open the server to that directory.\n", 60 | "\n", 61 | "A \"directory\" is just another term for \"folder\" -- your Desktop folder is a directory, as are your Downloads, Documents, and OneDrive folders. All you are doing here is laying out the path you will take from your home directory to whichever folder you want to work from.\n", 62 | "\n", 63 | "Here are some basic commands in the command line:\n", 64 | "\n", 65 | "- `cd `: you can navigate through your directories from your root with the `cd` command by specifying a path to your desired directory.\n", 66 | " - e.g. If your home directory contains your `Desktop` folder, `cd Desktop` takes you from Home to your Desktop directory.\n", 67 | " - e.g. If your `Desktop` folder contains folder `WiDS` which contains the folder `Datathon`, the following command from your Home Directory takes you to the Datathon folder: `cd Desktop/WiDS/Datathon`\n", 68 | "- `cd ..`: this allows you to go back to the previous directory (called the parent directory)\n", 69 | " - e.g. `WiDS` is the parent folder of `Datathon`, so from the `Datathon` folder, `cd ..` will take you to the `WiDS` folder.\n", 70 | "- `ls` or `dir`: lists all folders/files in the current directory. This is a good way to check, for example, if your parent folder contains your Desktop folder.\n", 71 | "\n", 72 | "Now you know how to navigate directories from your command line! Find your desired directory **before** you run the JupyterHub Server to prevent clicking through layers of folders. \n", 73 | "\n", 74 | "---\n", 75 | "### IV. Create Your First Notebook!\n", 76 | "\n", 77 | "Anaconda comes with Jupyter Notebook which is what we will use throughout this tutorial. In order to create your first notebook:\n", 78 | "\n", 79 | "1. Open your terminal (for Windows users, use Anaconda Prompt)\n", 80 | "\n", 81 | "2. Activate your virtual environment\n", 82 | "\n", 83 | "3. Navigate to your desired directory\n", 84 | "\n", 85 | "4. Run the following command: `jupyter notebook`\n", 86 | "\n", 87 | "Your default browser window will open, and you should be in your specified directory. From here, you can create a new notebook, open and edit saved notebooks, and much, much more!\n", 88 | "\n", 89 | "To close the notebook server (and shut down all running notebooks), run the command: `jupyter notebook stop` OR simply hit `Ctrl + c` in your command line.\n", 90 | "\n", 91 | "---\n", 92 | "\n", 93 | "### V. Downloading the WiDS Datathon Tutorial\n", 94 | "\n", 95 | "Now you are ready to download and interact with our tutorial notebooks. To access them, simply:\n", 96 | "1. Go to: https://github.com/keikokamei/WiDS_Datathon_Tutorials\n", 97 | "2. Download a ZIP file of the repository, as pictured below \n", 98 | "3. Unzip the folder into your desired directory\n", 99 | "4. Navigate to this directory in your command line (step 3. from above) and start running your Jupyter Notebook server (step 4.)\n", 100 | "\n", 101 | "![download_repo](./images/download_repo.png) \n", 102 | "\n", 103 | "\n", 104 | "Once your default browser opens up to this directory, you will be able to open up and interact with the tutorial notebooks. Happy datathon prepping! :)\n", 105 | "\n", 106 | "---\n", 107 | "\n", 108 | "\n", 109 | "**Content adapted from** a Jupyter Notebook modules from the [UC Berkeley Data Science Modules Program](https://ds-modules.github.io/DS-Modules/) licensed under [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/): \n", 110 | "- [LEGALST-123: Anaconda Installation Guide](https://github.com/ds-modules/LEGALST-190/blob/master/LEGALST-123/Anaconda%20Installation%20Guide.ipynb) by Keiko Kamei\n", 111 | "\n" 112 | ] 113 | } 114 | ], 115 | "metadata": { 116 | "kernelspec": { 117 | "display_name": "Python 3 (ipykernel)", 118 | "language": "python", 119 | "name": "python3" 120 | }, 121 | "language_info": { 122 | "codemirror_mode": { 123 | "name": "ipython", 124 | "version": 3 125 | }, 126 | "file_extension": ".py", 127 | "mimetype": "text/x-python", 128 | "name": "python", 129 | "nbconvert_exporter": "python", 130 | "pygments_lexer": "ipython3", 131 | "version": "3.11.5" 132 | } 133 | }, 134 | "nbformat": 4, 135 | "nbformat_minor": 2 136 | } 137 | -------------------------------------------------------------------------------- /01_Intro_to_Jupyter.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Part I: Introduction to Jupyter Notebook\n", 8 | "\n", 9 | "Welcome to the Jupyter Notebook! In this tutorial, we will take you step-by-step through basic coding concepts that we will leverage and build upon in subsequent data analysis tutorials. This is meant to be an introductory notebook, so coding experience is NOT required. We encourage you to work through the material with an experimental mindset - practice and curiosity are the keys to success! Feel free to add additional code blocks to try things out on your own.\n", 10 | "\n", 11 | "## Table of Contents:\n", 12 | "1. [The Jupyter Notebook](#jupyter)\n", 13 | "2. [Expressions](#expr)\n", 14 | "3. [Variables](#vars)\n", 15 | "4. [Variables vs. Strings](#str)\n", 16 | "5. [Boolean Values & Expressions](#bool)\n", 17 | "6. [Conditional Statements](#ifs)\n", 18 | "7. [Defining a Function](#func)\n", 19 | "8. [Understanding Errors](#error)\n", 20 | "\n", 21 | "---\n", 22 | "\n" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "## 1. The Jupyter Notebook\n", 30 | "\n", 31 | "\n", 32 | "A Jupyter Notebook is divided into what are called *cells*. You can navigate cells by clicking on them or by using the up and down arrows. Cells will be highlighted as you navigate them.\n", 33 | "\n", 34 | "### Markdown cells\n", 35 | "\n", 36 | "Text cells (like this one) can be edited by double-clicking on them. They're written in a simple format called [Markdown](http://daringfireball.net/projects/markdown/syntax) to add formatting and section headings. It can also render HTML since Markdown is a superset of HTML; you will often see HTML tags in the Markdown cells of this notebook. You don't need to learn Markdown, but know the difference between Text Cells and Code Cells.\n", 37 | "\n", 38 | "### Code cells\n", 39 | "Other cells, like the one below, contain code in the Python 3 language. The fundamental building block of Python code is an **expression**. Cells can contain multiple lines with multiple expressions. We'll explain what exactly we mean by \"expressions\" in just a moment - first, let's learn how to \"run\" cells." 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": null, 45 | "metadata": {}, 46 | "outputs": [], 47 | "source": [ 48 | "# This is a code cell!\n", 49 | "print(\"Hello, World! \\N{EARTH GLOBE ASIA-AUSTRALIA}\")" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": {}, 55 | "source": [ 56 | ">*Note: code cells, like the one above, can contain **`# comments`**. Comments are not code but notes that we can leave inline with the code to help us while writing or reviewing code. In Python, you can enter a comment after a pound `#` - everthing after the `#` will be considered a note.*\n", 57 | "\n", 58 | "\n", 59 | "### Running cells\n", 60 | "\n", 61 | "\"Running a cell\" is equivalent to pressing \"Enter\" on a calculator once you've typed in the expression you want to evaluate: it produces a result. When you run a text cell, it outputs clean, organized writing. When you run a code cell, it **computes** all of the expressions you want to evaluate, and can **output** the result of the computation if there is anything to return.\n", 62 | "\n", 63 | "

\n", 64 | "\n", 65 | "
\n", 66 | "   To run the code in a cell, first click on that cell. It'll become highlighted with a green or blue border. Next, you can either click the Run button above, or press Shift + Return / Shift + Enter. This will run the current cell and select the next one.
\n", 67 | "\n", 68 | "Text cells are useful for taking notes and keeping your notebook organized, but your data analysis will be done in code cells. We will focus on code cells for the remainder of the notebook.\n", 69 | "\n", 70 | "\n", 71 | "**Try running the code cell above, if you haven't already!**\n" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "metadata": {}, 77 | "source": [ 78 | "### Adding / Deleting Cells\n", 79 | "\n", 80 | "You can **add** a cell above or below a currently highlighted cell, by navigating to **`Insert` → `Insert Cell Above`/`Below`**. The default cell type will be a code cell. You can change the cell type by keeping the cell highlighted and navigating to **`Cell` → `Cell Type` → `Code`/`Markdown`**.\n", 81 | "\n", 82 | "Alternatively, you can highlight a cell (without double-clicking into the cell itself) and hit the `a` key to insert a cell above, or `b` key to insert a cell below. To change the cell type, hit `m` while the cell is highlighted to change to a Markdown cell, or hit `y` to change to a code cell.\n", 83 | "\n", 84 | "
Try inserting a code cell...
\n", 85 | "between here ↓:
\n" 86 | ] 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "metadata": {}, 91 | "source": [ 92 | "
   ...and here ↑
\n", 93 | "\n", 94 | "You can **delete** a highlighted cell by going to `Edit` → `Delete Cells`, or simply hitting the `d` keyboard key twice. \n", 95 | "**Try deleting the cell you created just above.**\n", 96 | "\n", 97 | "You can **undo** a cell deletion by going to `Edit` → `Undo Delete Cells`, or simply hitting the `z` keyboard key. \n", 98 | "**Try bringing back the cell you've just deleted.**\n", 99 | "\n", 100 | "Now that we have a basic understanding of how to use a Jupyter Notebook, we'll shift our focus to coding within code cells. \n", 101 | "\n", 102 | "---\n", 103 | "\n" 104 | ] 105 | }, 106 | { 107 | "cell_type": "markdown", 108 | "metadata": {}, 109 | "source": [ 110 | "## 2. Expressions\n", 111 | "\n", 112 | "An expression is a combination of numbers, variables, operators, and/or other Python elements that the language interprets and acts upon. Expressions act as a set of **instructions** to be followed, with the goal of generating specific outcomes.\n", 113 | "\n", 114 | "\n", 115 | "### Arithmetic\n", 116 | "You can start by thinking of code cells as fancy calculators that computes these expressions. For instance, code cells can evaluate simple arithmetic:" 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": null, 122 | "metadata": {}, 123 | "outputs": [], 124 | "source": [ 125 | "# Run me!\n", 126 | "# This is an expression\n", 127 | "10 + 10" 128 | ] 129 | }, 130 | { 131 | "cell_type": "code", 132 | "execution_count": null, 133 | "metadata": {}, 134 | "outputs": [], 135 | "source": [ 136 | "# Run me too!\n", 137 | "# This is another expression\n", 138 | "(10 + 10) / 5" 139 | ] 140 | }, 141 | { 142 | "cell_type": "markdown", 143 | "metadata": {}, 144 | "source": [ 145 | "Below are some basic arithmetic operators that are built into Python:\n", 146 | "\n", 147 | "|Operation|Operator|Example|Result|\n", 148 | "|:-|:-|:-|:-|\n", 149 | "|Addition|`+`|`1 + 2`|`3`|\n", 150 | "|Subtraction|`-`|`1 - 2`|`-1`|\n", 151 | "|Multiplication|`*`|`2 * 3`|`6`|\n", 152 | "|Division|`/`|`10 / 3`|`3.3333`|\n", 153 | "|Remainder|`%`|`10 % 3`|`1`|\n", 154 | "|Exponentiation|`**`|`2 ** 3`|`8`|\n", 155 | "\n", 156 | "The orders of operations are the same as we learned in elementary math classes (PEMDAS). Just like in mathematical expressions, parentheses can be used to group together smaller expressions within larger expressions. Observe the difference in the results of the following two expressions:" 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": null, 162 | "metadata": {}, 163 | "outputs": [], 164 | "source": [ 165 | "# Expression 1\n", 166 | "1 + 2 * 3 * 4 * 5 / 6 ** 3 + 7 - 8 - 9 + 10 + 11 + 12" 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": null, 172 | "metadata": {}, 173 | "outputs": [], 174 | "source": [ 175 | "# Expression 2\n", 176 | "1 + 2 * (3 * 4 * 5 / 6) ** 3 + 7 - 8 - 9 + 10 + 11 + 12" 177 | ] 178 | }, 179 | { 180 | "cell_type": "markdown", 181 | "metadata": {}, 182 | "source": [ 183 | "This is what they would look like in standard notation:\n", 184 | "\n", 185 | "Expression 1: $1 + 2 \\times 3 \\times 4 \\times 5 \\div 6^3 + 7 - 8 - 9 + 10 + 11 + 12$\n", 186 | "\n", 187 | "Expression 2: $1 + 2 \\times (\\frac{3 \\times 4 \\times 5}{6})^3 + 7 - 8 - 9 + 10 + 11 + 12$" 188 | ] 189 | }, 190 | { 191 | "cell_type": "markdown", 192 | "metadata": {}, 193 | "source": [ 194 | "### Call Expressions\n", 195 | "\n", 196 | "Another important type of expression is the **call expression**. A call expression \"calls\" on a **function** to be executed on specified input value(s), and often returns a value depending on these inputs. We call the values we put into functions the **arguments** of a function. As we'll discuss more later on, a **function** is a compuational process that is given a name, so that the process can easily be used.\n", 197 | "\n", 198 | "Here are some commonly used mathematical functions:\n", 199 | "\n", 200 | "|Function|Example|Value|Description|\n", 201 | "|:-:|:-|:-:|:-|\n", 202 | "|`abs`|`abs(-5)`|`5`| Takes the absolute value of the argument|\n", 203 | "|`max`|`max(5, 13, -9, 2)`|`13`| Finds the maximum value of all arguments|\n", 204 | "|`min`|`min(5, 13, -9, 2)`|`-9`| Finds the minimum value of all arguments|\n", 205 | "|`round`|`round(5.435)`|`5`| Rounds its argument to the nearest integer|\n", 206 | "\n", 207 | "Here are two call expressions that both evaluate to 3:" 208 | ] 209 | }, 210 | { 211 | "cell_type": "code", 212 | "execution_count": null, 213 | "metadata": {}, 214 | "outputs": [], 215 | "source": [ 216 | "abs(2 - 5)" 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": null, 222 | "metadata": {}, 223 | "outputs": [], 224 | "source": [ 225 | "min(round(4.32), max(2, abs(3-4) + round(5/3)), 7)" 226 | ] 227 | }, 228 | { 229 | "cell_type": "markdown", 230 | "metadata": {}, 231 | "source": [ 232 | "## 3. Variables\n", 233 | "\n", 234 | "---\n", 235 | "\n", 236 | "In natural language, we have terminology that lets us quickly reference very complicated concepts. We don't say, \"That's a large mammal with brown fur and sharp teeth!\" Instead, we just say, \"Bear!\"\n", 237 | "\n", 238 | "In Python, we do this with assignment statements. An assignment statement has a name on the left side of an `=` sign and an expression to be evaluated on the right. The name we assign to the expression is called a **variable**. Just like in your standard algebra class, you can assign the letter `x` to be 10. You can assign letter `y` to be 5. You can then add variables `x` and `y` to get 15." 239 | ] 240 | }, 241 | { 242 | "cell_type": "code", 243 | "execution_count": null, 244 | "metadata": {}, 245 | "outputs": [], 246 | "source": [ 247 | "#this won't output anything: you're just telling the cell to set x to 10\n", 248 | "x = 10" 249 | ] 250 | }, 251 | { 252 | "cell_type": "code", 253 | "execution_count": null, 254 | "metadata": {}, 255 | "outputs": [], 256 | "source": [ 257 | "y = 5" 258 | ] 259 | }, 260 | { 261 | "cell_type": "code", 262 | "execution_count": null, 263 | "metadata": {}, 264 | "outputs": [], 265 | "source": [ 266 | "#This will output the answer to the addition: you're asking it to compute the number\n", 267 | "x + y" 268 | ] 269 | }, 270 | { 271 | "cell_type": "markdown", 272 | "metadata": {}, 273 | "source": [ 274 | "In algebra class, you were limited to using the letters of the alphabet as variable names. Here, you can use any combination of words **as long as there are no spaces in the names:**" 275 | ] 276 | }, 277 | { 278 | "cell_type": "code", 279 | "execution_count": null, 280 | "metadata": {}, 281 | "outputs": [], 282 | "source": [ 283 | "quarter = 1/4\n", 284 | "quarter" 285 | ] 286 | }, 287 | { 288 | "cell_type": "markdown", 289 | "metadata": {}, 290 | "source": [ 291 | "A previously assigned name can be used in the expression to the right of `=`. Python evaluates the expression to the right of `=` first, then assigns the resulting value to the variable." 292 | ] 293 | }, 294 | { 295 | "cell_type": "code", 296 | "execution_count": null, 297 | "metadata": {}, 298 | "outputs": [], 299 | "source": [ 300 | "half = 2 * quarter\n", 301 | "half" 302 | ] 303 | }, 304 | { 305 | "cell_type": "markdown", 306 | "metadata": {}, 307 | "source": [ 308 | "You can **redefine** variables you have used before to hold **new** values; in other words, you can overwrite the old values. If you run the following cell, `x` and `y` will now hold a different values:" 309 | ] 310 | }, 311 | { 312 | "cell_type": "code", 313 | "execution_count": null, 314 | "metadata": {}, 315 | "outputs": [], 316 | "source": [ 317 | "# You can put all of the different expressions above into one code cell.\n", 318 | "# When you run this code cell, everything will be evaluated in order, from top to bottom.\n", 319 | "x = 3\n", 320 | "y = 8\n", 321 | "x_plus_y = x + y\n", 322 | "x_plus_y" 323 | ] 324 | }, 325 | { 326 | "cell_type": "markdown", 327 | "metadata": {}, 328 | "source": [ 329 | "However, variables defined in terms of another variable (e.g. `half` defined in terms of `quarter`) will **not** change automatically just because a variable in a previously evaluated expression has later changed its value:" 330 | ] 331 | }, 332 | { 333 | "cell_type": "code", 334 | "execution_count": null, 335 | "metadata": {}, 336 | "outputs": [], 337 | "source": [ 338 | "# even though `quarter` now carries a new value, the value of `half` does not automatically change \n", 339 | "# here we set `quarter` to a new value\n", 340 | "quarter = 4\n", 341 | "\n", 342 | "# and return the value of `half`\n", 343 | "half" 344 | ] 345 | }, 346 | { 347 | "cell_type": "markdown", 348 | "metadata": {}, 349 | "source": [ 350 | "Recall that when `half` was defined earlier, `quarter` was assigned the value 0.25. Because expressions to the right of `=` are evaluated first before variable assignment, `2 * quarter` was evaluated to 0.5, and `half` was assigned this value of 0.5. It does not remain dependent on the variable name `quarter`. \n", 351 | "\n", 352 | "So, even though `quarter` later changed its value, it doesn't change the fact that `half` was assigned 0.5, and `half` continues to represent 0.5." 353 | ] 354 | }, 355 | { 356 | "cell_type": "markdown", 357 | "metadata": {}, 358 | "source": [ 359 | "\n", 360 | "\n", 361 | "
You Try: What should be the answer to \"6 times half plus x_plus_y\"?
" 362 | ] 363 | }, 364 | { 365 | "cell_type": "code", 366 | "execution_count": null, 367 | "metadata": {}, 368 | "outputs": [], 369 | "source": [ 370 | "# YOUR CODE HERE" 371 | ] 372 | }, 373 | { 374 | "cell_type": "markdown", 375 | "metadata": {}, 376 | "source": [ 377 | "Does your answer make sense?" 378 | ] 379 | }, 380 | { 381 | "cell_type": "markdown", 382 | "metadata": {}, 383 | "source": [ 384 | "## 4. Variables vs. Strings\n", 385 | "\n", 386 | "---\n", 387 | "\n", 388 | "In the section above, we understood that `quarter` is a variable to which we have assigned a numeric value. However, as soon as we put quotes around the word, Python understands it as an entirely different object - _\"quarter\"_ is now a piece of textual data, or a **string**. A string is a type of value, just like numbers are values, that is made up of a sequence of characters. Strings can represent a single character, a word, a sentence, or the contents of an entire book.\n" 389 | ] 390 | }, 391 | { 392 | "cell_type": "code", 393 | "execution_count": null, 394 | "metadata": {}, 395 | "outputs": [], 396 | "source": [ 397 | "# This is a string, not a variable\n", 398 | "\"quarter\"" 399 | ] 400 | }, 401 | { 402 | "cell_type": "code", 403 | "execution_count": null, 404 | "metadata": {}, 405 | "outputs": [], 406 | "source": [ 407 | "# Another string\n", 408 | "\"Woohoo!\"" 409 | ] 410 | }, 411 | { 412 | "cell_type": "code", 413 | "execution_count": null, 414 | "metadata": {}, 415 | "outputs": [], 416 | "source": [ 417 | "\"Strings can capture long bodies of text\"" 418 | ] 419 | }, 420 | { 421 | "cell_type": "markdown", 422 | "metadata": {}, 423 | "source": [ 424 | "The meaning of an expression depends both upon its structure and the types of values that are being combined. So, for instance, adding two strings together produces another string. This expression is still an addition, but the result of adding strings is different from the result of adding numbers:" 425 | ] 426 | }, 427 | { 428 | "cell_type": "code", 429 | "execution_count": null, 430 | "metadata": {}, 431 | "outputs": [], 432 | "source": [ 433 | "# I output a combined string:\n", 434 | "\"123\" + \"456\"" 435 | ] 436 | }, 437 | { 438 | "cell_type": "code", 439 | "execution_count": null, 440 | "metadata": {}, 441 | "outputs": [], 442 | "source": [ 443 | "# I output the result of adding two numbers:\n", 444 | "123 + 456" 445 | ] 446 | }, 447 | { 448 | "cell_type": "markdown", 449 | "metadata": {}, 450 | "source": [ 451 | "What if we try to type a random word in a code cell **without** putting it in quotes?" 452 | ] 453 | }, 454 | { 455 | "cell_type": "code", 456 | "execution_count": null, 457 | "metadata": {}, 458 | "outputs": [], 459 | "source": [ 460 | "#This will Error!\n", 461 | "Woohoo" 462 | ] 463 | }, 464 | { 465 | "cell_type": "markdown", 466 | "metadata": {}, 467 | "source": [ 468 | "It throws out an error! Why? Because code cells will consider any word **not** in quotes to be a **Python object**, like a variable, that stores some sort of information. In this notebook, we haven't told it what `Woohoo` means -- it's just an empty variable holding no information, so it complains and says, \"I don't know what `Woohoo` is supposed to be.\"" 469 | ] 470 | }, 471 | { 472 | "cell_type": "markdown", 473 | "metadata": {}, 474 | "source": [ 475 | "## 5. Boolean Values & Expressions\n", 476 | "---\n", 477 | "\n", 478 | "A Boolean is another data type, and it can carry one of only two values - `True` or `False`. They often arise when two values are compared against each other:\n" 479 | ] 480 | }, 481 | { 482 | "cell_type": "code", 483 | "execution_count": null, 484 | "metadata": {}, 485 | "outputs": [], 486 | "source": [ 487 | "# expression: '10 is greater than 1'\n", 488 | "10 > 1" 489 | ] 490 | }, 491 | { 492 | "cell_type": "code", 493 | "execution_count": null, 494 | "metadata": {}, 495 | "outputs": [], 496 | "source": [ 497 | "# expression: '10 is equal to 1'\n", 498 | "10 == 1" 499 | ] 500 | }, 501 | { 502 | "cell_type": "markdown", 503 | "metadata": {}, 504 | "source": [ 505 | "The value `True` indicates that the statement is accurate. In the above, Python has confirmed the simple statement that 10 is greater than 1; and that 10 is not, in fact, equal to 1.\n", 506 | "\n", 507 | "Here are some common comparison operators:\n", 508 | "\n", 509 | "|Operation|Operator|Result: True|Result: False|\n", 510 | "|-|:-:|:-:|:-:|\n", 511 | "|Equal to|==|1.3 == 1.3|1.3 == 1|\n", 512 | "|Not equal to|!=|1.3 != 1|1 != 1|\n", 513 | "|Less than|<|5 < 10|5 < 5|\n", 514 | "|Less than or equal|<=|5 <= 5|10 <= 5|\n", 515 | "Greater than|>|10 > 5|5 > 10|\n", 516 | "|Greater or equal|>=|5 >= 5|5 >= 10|\n" 517 | ] 518 | }, 519 | { 520 | "cell_type": "markdown", 521 | "metadata": {}, 522 | "source": [ 523 | "
You Try: An apple, a lemon, and a pack of strawberries each cost \\$0.79, \\$0.24, and \\$3.49, respectively. If I set out to purchase 2 apples, 6 lemons, and 2 strawberries, would \\$10 be enough to cover the total cost? Use a boolean operator to output a True or False answer below:
" 524 | ] 525 | }, 526 | { 527 | "cell_type": "code", 528 | "execution_count": null, 529 | "metadata": {}, 530 | "outputs": [], 531 | "source": [ 532 | "apple = 0.79\n", 533 | "lemon = 0.24\n", 534 | "strawberries = 3.49\n", 535 | "\n", 536 | "# YOUR CODE HERE\n", 537 | "# ..." 538 | ] 539 | }, 540 | { 541 | "cell_type": "markdown", 542 | "metadata": {}, 543 | "source": [ 544 | "### Numeric value of Booleans \n", 545 | "\n", 546 | "Different data type values actually represent different boolean values. For instance:\n", 547 | " \n", 548 | "- Numbers:
\n", 549 | "    `0` is evaluated to: `False` \n", 550 | "    any non-zero number is evaluated to: `True`\n", 551 | "
\n", 552 | " \n", 553 | "- Strings:
\n", 554 | "    `''` (an empty string) is evaluated to: `False` \n", 555 | "    any non-empty string is evaluated to: `True` \n", 556 | "\n" 557 | ] 558 | }, 559 | { 560 | "cell_type": "markdown", 561 | "metadata": {}, 562 | "source": [ 563 | "### Boolean Operators: And, Or, Not\n", 564 | "\n", 565 | "We can combine multiple True/False expressions to compose a broader True/False statement, much like we do in natural conversation: \n", 566 | ">Question:  \"You played tennis **AND** went running today?\" \n", 567 | ">Answer:    \"**No,** I didn't play tennis today.\" \n", 568 | ">Meaning: *The statement is False, because I didn't do both those things*\n", 569 | "\n", 570 | ">Question:  \"Do you want me to make you a smoothie **or** a milkshake?\" \n", 571 | ">Answer:    \"**Yes,** please; a smoothie sounds fantastic!\" \n", 572 | ">Meaning: *The statement is True, because I want at least one of those things*\n", 573 | "\n", 574 | "The question/answer format doesn't work well for the case of negation, but luckily this one is quite straightforward - putting a `not` in front of a True/False expression negates the expression - \"not True\" means False, and \"not False\" means True. \n", 575 | ">Simply:  \"That is **not** True.\" \n", 576 | ">Expression being negated: \"That is True\" \n", 577 | ">Meaning of the statement: \"That is False\"\n", 578 | "\n", 579 | "
\n", 580 | "\n", 581 | "Let's explore these in the code cells below:\n", 582 | "\n", 583 | "**1. `and, &` Operator:** \n", 584 | ">Evaluates to `True` if and only if all expressions evaluate to `True` - in other words: \n", 585 | ">    - Statement is True if **no expression** evaluates to False \n", 586 | ">    - Statement is False if **at least 1 expression** evaluates to False" 587 | ] 588 | }, 589 | { 590 | "cell_type": "code", 591 | "execution_count": null, 592 | "metadata": {}, 593 | "outputs": [], 594 | "source": [ 595 | "# False, because there is 1 False expression\n", 596 | "True and True and True and True and True and False and True" 597 | ] 598 | }, 599 | { 600 | "cell_type": "code", 601 | "execution_count": null, 602 | "metadata": {}, 603 | "outputs": [], 604 | "source": [ 605 | "# first expression is false, so it is not the case that both expr1 and expr2 are True,\n", 606 | "# so this evaluates to False\n", 607 | "\n", 608 | "# False and True --> False\n", 609 | "(10<5) and (4%2==0)" 610 | ] 611 | }, 612 | { 613 | "cell_type": "code", 614 | "execution_count": null, 615 | "metadata": {}, 616 | "outputs": [], 617 | "source": [ 618 | "# first AND second AND third expressions are all True --> True\n", 619 | "# True and True and True\n", 620 | "(7%2!=0) & (5-9<0) & (9==3+7-1)" 621 | ] 622 | }, 623 | { 624 | "cell_type": "markdown", 625 | "metadata": {}, 626 | "source": [ 627 | "**2. `or, |` Operator:** \n", 628 | ">If one expression is `True`, then the entire expression evaluates to `True` - in other words: \n", 629 | ">    - Statement is True if **at least 1 expression** evaluates to True \n", 630 | ">    - Statement is False if **no expression** evaluates to True \n" 631 | ] 632 | }, 633 | { 634 | "cell_type": "code", 635 | "execution_count": null, 636 | "metadata": {}, 637 | "outputs": [], 638 | "source": [ 639 | "# True, as long as there is 1 True statement\n", 640 | "False or False or False or False or False or True or False" 641 | ] 642 | }, 643 | { 644 | "cell_type": "code", 645 | "execution_count": null, 646 | "metadata": {}, 647 | "outputs": [], 648 | "source": [ 649 | "# neither left nor right expressions are True; overall expression is False\n", 650 | "# False or False --> False\n", 651 | "(10<5) or (8==9)" 652 | ] 653 | }, 654 | { 655 | "cell_type": "code", 656 | "execution_count": null, 657 | "metadata": {}, 658 | "outputs": [], 659 | "source": [ 660 | "# second expression evaluates to True, so overall expression is True\n", 661 | "# False or True --> True\n", 662 | "(10<5) | (4%2==0)" 663 | ] 664 | }, 665 | { 666 | "cell_type": "markdown", 667 | "metadata": {}, 668 | "source": [ 669 | "**3. `not, ~` Operator:** \n", 670 | ">Negates the expression: \n", 671 | ">    - if the expression is True, `not` turns expression to False \n", 672 | ">    - if the expression is False, `not` turns the expression to True\n" 673 | ] 674 | }, 675 | { 676 | "cell_type": "code", 677 | "execution_count": null, 678 | "metadata": {}, 679 | "outputs": [], 680 | "source": [ 681 | "not True" 682 | ] 683 | }, 684 | { 685 | "cell_type": "code", 686 | "execution_count": null, 687 | "metadata": {}, 688 | "outputs": [], 689 | "source": [ 690 | "not False" 691 | ] 692 | }, 693 | { 694 | "cell_type": "code", 695 | "execution_count": null, 696 | "metadata": {}, 697 | "outputs": [], 698 | "source": [ 699 | "# not False --> True\n", 700 | "not 10<5" 701 | ] 702 | }, 703 | { 704 | "cell_type": "code", 705 | "execution_count": null, 706 | "metadata": {}, 707 | "outputs": [], 708 | "source": [ 709 | "# not True --> False\n", 710 | "~10>5" 711 | ] 712 | }, 713 | { 714 | "cell_type": "markdown", 715 | "metadata": {}, 716 | "source": [ 717 | "
\n", 718 | "   Order of Operations:
\n", 719 | " Just like mathematical operations, boolean operations have an order of operations - from highest to lowest priority, the order of evaluation is:
\n", 720 | "     NOT, AND, then OR\n", 721 | " \n", 722 | "
\n", 723 | "\n", 724 | "The expression below is evaluated in the following steps:\n", 725 | "\n", 726 | "```\n", 727 | "(10==1) or ~(10<5) and (4%2==0) \n", 728 | " False or ~False and True # 1. evaluate inside parentheses first\n", 729 | " False or True and True # 2. then NOT (~)\n", 730 | " False or True # 3. then AND\n", 731 | " True # 4. then OR\n", 732 | "\n", 733 | "```" 734 | ] 735 | }, 736 | { 737 | "cell_type": "code", 738 | "execution_count": null, 739 | "metadata": {}, 740 | "outputs": [], 741 | "source": [ 742 | "# evaluates to True\n", 743 | "(10==1) or ~(10<5) and (4%2==0)" 744 | ] 745 | }, 746 | { 747 | "cell_type": "markdown", 748 | "metadata": {}, 749 | "source": [ 750 | "
\n", 751 | "   Negating \"Or\" / \"And\" Statements:
\n", 752 | " Negation has the following effects on AND and OR statements:
\n", 753 | " \n", 754 | "- not (A or B) is equivalent to (not A) AND (not B) \n", 755 | "- not (A AND B) is equivalent to (not A) OR (not B) \n", 756 | " \n", 757 | "
" 758 | ] 759 | }, 760 | { 761 | "cell_type": "code", 762 | "execution_count": null, 763 | "metadata": {}, 764 | "outputs": [], 765 | "source": [ 766 | "# 1a. \"not (A or B)\"\" is the same as saying (not A) and (not B)\"\n", 767 | "not((10<5) | (4%2==0))" 768 | ] 769 | }, 770 | { 771 | "cell_type": "code", 772 | "execution_count": null, 773 | "metadata": {}, 774 | "outputs": [], 775 | "source": [ 776 | "# 1b. this is the same as 1a.\n", 777 | "(not(10<5)) & (not(4%2==0))" 778 | ] 779 | }, 780 | { 781 | "cell_type": "code", 782 | "execution_count": null, 783 | "metadata": {}, 784 | "outputs": [], 785 | "source": [ 786 | "# 2a. \"not (A and B) is the same as saying \"(not A) OR (not B)\"\n", 787 | "not((10<5) & (4%2==0))" 788 | ] 789 | }, 790 | { 791 | "cell_type": "code", 792 | "execution_count": null, 793 | "metadata": {}, 794 | "outputs": [], 795 | "source": [ 796 | "# 2b. same as 2a\n", 797 | "(not(10<5)) | (not(4%2==0))" 798 | ] 799 | }, 800 | { 801 | "cell_type": "markdown", 802 | "metadata": {}, 803 | "source": [ 804 | "## 6. Conditional Statements\n", 805 | "\n", 806 | "  If-Statements\n", 807 | "\n", 808 | "Boolean expressions are very useful when writing code, because it lets you determine what course of action to take for a given scenario. One way to do this is by using **if-statements**, which allows you to execute an action if it meets a certain condition (if the condition evaluates to `True`). If it does not meet the condition - if the condition evaluates to `False` - then it will ignore the instructions for the action altogether, and move on.\n", 809 | "\n", 810 | "```python\n", 811 | "if condition_1: # if condition_1 is true, execute action_1\n", 812 | " action_1\n", 813 | "```\n", 814 | "\n", 815 | "Notice that there is a colon `:` at the end of the if-statement, and that the action to be executed is **indented** underneath the statement. The colon completes the \"if-statement\" - it marks the end of the condition it is specifying; and the indentation indicates that everything captured in the indented block underneath the if-statement should **only be executed if the condition is met**. In Python, indentation determines when blocks of code should be run.\n", 816 | "\n", 817 | "The example below contains 2 if-statements. 3 actions are coded, but only 2 will be executed. Before running the cell, can you identify which of the 3 phrases will be printed?" 818 | ] 819 | }, 820 | { 821 | "cell_type": "code", 822 | "execution_count": null, 823 | "metadata": {}, 824 | "outputs": [], 825 | "source": [ 826 | "# Cost of fruits, from before\n", 827 | "apple = 0.79\n", 828 | "lemon = 0.24\n", 829 | "strawberries = 3.49\n", 830 | "\n", 831 | "if apple < 1.00:\n", 832 | " print(\"Action 1: What a bargain!\") # action 1\n", 833 | "\n", 834 | "if strawberries >= 6.00:\n", 835 | " print(\"Action 2: Boy, these are expensive...\") # action 2\n", 836 | "\n", 837 | "print(\"Action 3: When will this be printed?\") # action 3" 838 | ] 839 | }, 840 | { 841 | "cell_type": "markdown", 842 | "metadata": {}, 843 | "source": [ 844 | "
Question: What would happen if we add an indent before the last print statement (\"action 3\"), and re-run the code block? Why?
" 845 | ] 846 | }, 847 | { 848 | "cell_type": "markdown", 849 | "metadata": {}, 850 | "source": [ 851 | "*Your answer here*" 852 | ] 853 | }, 854 | { 855 | "cell_type": "markdown", 856 | "metadata": {}, 857 | "source": [ 858 | "  \"Else if\" & \"Else\" Statements\n", 859 | "\n", 860 | "Python will always evaluate an if-statement. However, sometimes we want to evaluate other conditions **only if a prior condition has not been met**. This can be achieved by following an if-statement with an **`elif`** (\"else, if\") and/or **`else`** statements:\n", 861 | "\n", 862 | "```python\n", 863 | "if (condition_1): # if condition_1 is true, execute action_1\n", 864 | " return action_1\n", 865 | " \n", 866 | "elif (condition_2): # \"else, if\": if condition_1 is not true, then if condition_2 is true, execute action_2\n", 867 | " return action_2\n", 868 | " \n", 869 | "else: # all other cases - if neither condition_1 nor condition_2 are true, execute action_3\n", 870 | " return action_3\n", 871 | "```\n", 872 | "\n", 873 | "There can be multiple **`elif`** statements after an **`if`**-statement and before an **`else`**-statement, but there can only be one **`else`** statement. In this set-up, once a condition is satisfied, Python will execute each respective action, and ignore all other succeeding conditional statements.\n", 874 | "\n", 875 | "
Question 1: \n", 876 | "Let's say a pack of strawberries now cost $6.50. What would you expect the logic from above to output now?
" 877 | ] 878 | }, 879 | { 880 | "cell_type": "markdown", 881 | "metadata": {}, 882 | "source": [ 883 | "*Your answer here*" 884 | ] 885 | }, 886 | { 887 | "cell_type": "code", 888 | "execution_count": null, 889 | "metadata": {}, 890 | "outputs": [], 891 | "source": [ 892 | "strawberries = 6.50\n", 893 | "\n", 894 | "## same code as above\n", 895 | "if apple < 1.00:\n", 896 | " print(\"Action 1: What a bargain!\")\n", 897 | "\n", 898 | "if strawberries >= 6.00:\n", 899 | " print(\"Action 2: Boy, these are expensive...\")" 900 | ] 901 | }, 902 | { 903 | "cell_type": "markdown", 904 | "metadata": {}, 905 | "source": [ 906 | "
Question 2: \n", 907 | " The code above uses two if-statements. What happens if we change the second if to an elif? Why?
" 908 | ] 909 | }, 910 | { 911 | "cell_type": "markdown", 912 | "metadata": {}, 913 | "source": [ 914 | "*Your answer here*" 915 | ] 916 | }, 917 | { 918 | "cell_type": "code", 919 | "execution_count": null, 920 | "metadata": {}, 921 | "outputs": [], 922 | "source": [ 923 | "if apple < 1.00:\n", 924 | " print(\"Action 1: What a bargain!\")\n", 925 | "\n", 926 | "elif strawberries >= 6.00:\n", 927 | " print(\"Action 2: Boy, these are expensive...\")" 928 | ] 929 | }, 930 | { 931 | "cell_type": "markdown", 932 | "metadata": {}, 933 | "source": [ 934 | "---\n", 935 | "\n", 936 | "## 7. Defining a Function\n", 937 | "\n", 938 | "Functions are useful when you want to repeat a series of steps on multiple different objects, but don't want to type out the steps over and over again. Many functions are built into Python already, as we've already seen in the section on call expressions. In this section, we'll discuss how to **write and name our own functions**.\n", 939 | "\n", 940 | "Recall that when we call on a function, we must often provide one or more input values, or **arguments**, for the function to operate on. When we define a function, we need a way to let the function know what to do with which argument. We do this by setting up parameters for the function - **parameters** can be thought of as placeholder variables that are waiting to be assigned values, which happens when the function is called upon with specific arguments.\n", 941 | " \n", 942 | "Below is our first example, found in the UC Berkeley [Inferential Thinking](http://www.data8.org/zero-to-data-8/textbook.html) Textbook by Ani Adhikari and John DeNero:\n", 943 | "\n", 944 | " \n", 945 | ">The definition of the `double` function below simply doubles a number." 946 | ] 947 | }, 948 | { 949 | "cell_type": "code", 950 | "execution_count": null, 951 | "metadata": {}, 952 | "outputs": [], 953 | "source": [ 954 | "# Our first function definition\n", 955 | "def double(x):\n", 956 | " \"\"\"Double x\"\"\"\n", 957 | " return 2*x" 958 | ] 959 | }, 960 | { 961 | "cell_type": "markdown", 962 | "metadata": {}, 963 | "source": [ 964 | ">We start any function definition by writing `def`. Here is a breakdown of the other parts (the *syntax*) of this small function:\n", 965 | "\n", 966 | "\n", 967 | "\n", 968 | "\n", 969 | "\n", 970 | ">When we run the cell above, no particular number is doubled, and the code inside the body of `double` is not yet evaluated. In this respect, our function is analogous to a *recipe*. Each time we follow the instructions in a recipe, we need to start with ingredients. Each time we want to use our function to double a number, we need to specify a number.1\n", 971 | "\n", 972 | "---\n", 973 | "\n", 974 | "
You Try! \n", 975 | "The following function, `add_two`, has been set up with a more thorough docstring. Fill in the ... below with an expression that would satisfy the function's description.
" 976 | ] 977 | }, 978 | { 979 | "cell_type": "code", 980 | "execution_count": null, 981 | "metadata": {}, 982 | "outputs": [], 983 | "source": [ 984 | "def add_two(number):\n", 985 | " \"\"\"Adds 2 to the input.\n", 986 | " \n", 987 | " Parameters\n", 988 | " ----------\n", 989 | " number:\n", 990 | " The given number that 2 will be added to.\n", 991 | " \n", 992 | " Returns\n", 993 | " -------\n", 994 | " A number which is 2 greater than the original input.\n", 995 | " \n", 996 | " Example\n", 997 | " -------\n", 998 | " >>> add_two(4)\n", 999 | " 6\n", 1000 | " \"\"\"\n", 1001 | " return # ... YOUR CODE HERE" 1002 | ] 1003 | }, 1004 | { 1005 | "cell_type": "markdown", 1006 | "metadata": {}, 1007 | "source": [ 1008 | "Given what you understand from the docstring, what do you think this function does? Run the cells below to test it out:" 1009 | ] 1010 | }, 1011 | { 1012 | "cell_type": "code", 1013 | "execution_count": null, 1014 | "metadata": { 1015 | "scrolled": true 1016 | }, 1017 | "outputs": [], 1018 | "source": [ 1019 | "add_two(3)" 1020 | ] 1021 | }, 1022 | { 1023 | "cell_type": "code", 1024 | "execution_count": null, 1025 | "metadata": {}, 1026 | "outputs": [], 1027 | "source": [ 1028 | "add_two(-1)" 1029 | ] 1030 | }, 1031 | { 1032 | "cell_type": "markdown", 1033 | "metadata": {}, 1034 | "source": [ 1035 | "Functions often take advantage of conditional statements to carry out different procedures given different input values. In the example below, where we define our own absolute value function, notice how negative and non-negative arguments are handled differently: " 1036 | ] 1037 | }, 1038 | { 1039 | "cell_type": "code", 1040 | "execution_count": null, 1041 | "metadata": {}, 1042 | "outputs": [], 1043 | "source": [ 1044 | "def absolute_value_of(number):\n", 1045 | " \"\"\"Finds the absolute value of the input.\n", 1046 | " \n", 1047 | " Parameters\n", 1048 | " ----------\n", 1049 | " number:\n", 1050 | " Input value\n", 1051 | " \n", 1052 | " Returns\n", 1053 | " -------\n", 1054 | " The absolute value of the input number\n", 1055 | " \n", 1056 | " Example\n", 1057 | " -------\n", 1058 | " >>> absolute_value_of(-5)\n", 1059 | " 5 \n", 1060 | " \"\"\"\n", 1061 | "\n", 1062 | " if number < 0:\n", 1063 | " number = number * -1\n", 1064 | " \n", 1065 | " return number" 1066 | ] 1067 | }, 1068 | { 1069 | "cell_type": "markdown", 1070 | "metadata": {}, 1071 | "source": [ 1072 | "## 8. Understanding Errors\n", 1073 | "Python is a language, and like natural human languages, it has rules. It differs from natural language in two important ways:\n", 1074 | "1. The rules are **simple**. You can learn most of them in a few weeks and gain reasonable proficiency with the language in just a few months.\n", 1075 | "2. The rules are **rigid**. If you're proficient in a natural language, you can understand a non-proficient speaker, glossing over small mistakes. A computer running Python code is **not** smart enough to do that.\n", 1076 | "\n", 1077 | "Whenever you write code, you'll inevitably make mistakes. When you run a code cell that has errors, Python will usually produce error messages to tell you what you did wrong.\n", 1078 | "\n", 1079 | "Errors are okay; even experienced programmers make many errors. When you make an error, you just have to find the source of the problem, fix it, and move on.\n", 1080 | "\n", 1081 | "We have made an error in the next cell. Run it and see what happens." 1082 | ] 1083 | }, 1084 | { 1085 | "cell_type": "code", 1086 | "execution_count": null, 1087 | "metadata": {}, 1088 | "outputs": [], 1089 | "source": [ 1090 | "print(\"This line is missing something.\"" 1091 | ] 1092 | }, 1093 | { 1094 | "cell_type": "markdown", 1095 | "metadata": {}, 1096 | "source": [ 1097 | "We can break down the error message as follows:\n", 1098 | "\n", 1099 | "![error](./images/error.jpg)\n", 1100 | "\n", 1101 | "Fix this error in the cell below:" 1102 | ] 1103 | }, 1104 | { 1105 | "cell_type": "code", 1106 | "execution_count": null, 1107 | "metadata": {}, 1108 | "outputs": [], 1109 | "source": [ 1110 | "#Your Answer Here\n", 1111 | "..." 1112 | ] 1113 | }, 1114 | { 1115 | "cell_type": "markdown", 1116 | "metadata": {}, 1117 | "source": [ 1118 | "---\n", 1119 | "\n", 1120 | "**Congratulations!** You have completed the introduction to Jupyter Notebooks tutorial! In the next tutorial, we will use these skills to develop further explore data structures like lists and arrays; and explore statistical concepts like percentiles, histograms, and standard deviations.\n", 1121 | "\n", 1122 | "---\n", 1123 | "\n", 1124 | "\n", 1125 | "#### Content adapted from: \n", 1126 | "- Jupyter Notebook modules from the [UC Berkeley Data Science Modules Program](https://ds-modules.github.io/DS-Modules/) licensed under [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/)\n", 1127 | " - [ESPM-163ac: Lab1-Introduction to Jupyter Notebook](https://github.com/ds-modules/ESPM-163ac/blob/master/Lab1/Lab1_Intro_to_Jupyter.ipynb) by Alleana Clark\n", 1128 | " - [Data 8X Public Materials for 2022](https://github.com/ds-modules/materials-x22/) by Sean Morris\n", 1129 | "- [Composing Programs](https://www.composingprograms.com/) by John DeNero based on the textbook [Structure and Interpretation of Computer Programs](https://mitpress.mit.edu/9780262510875/structure-and-interpretation-of-computer-programs/) by Harold Abelson and Gerald Jay Sussman, licensed under [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) \n", 1130 | " \n", 1131 | "\n", 1132 | "#### Citations:\n", 1133 | "\n", 1134 | "\n", 1135 | "1. Ani Adhikari, et al, “8. Functions and Tables,” Computational and Inferential Thinking: The Foundations of Data Science, accessed 15 August 2023, https://inferentialthinking.com/chapters/08/Functions_and_Tables.html. \n" 1136 | ] 1137 | } 1138 | ], 1139 | "metadata": { 1140 | "anaconda-cloud": {}, 1141 | "kernelspec": { 1142 | "display_name": "Python 3 (ipykernel)", 1143 | "language": "python", 1144 | "name": "python3" 1145 | }, 1146 | "language_info": { 1147 | "codemirror_mode": { 1148 | "name": "ipython", 1149 | "version": 3 1150 | }, 1151 | "file_extension": ".py", 1152 | "mimetype": "text/x-python", 1153 | "name": "python", 1154 | "nbconvert_exporter": "python", 1155 | "pygments_lexer": "ipython3", 1156 | "version": "3.11.5" 1157 | } 1158 | }, 1159 | "nbformat": 4, 1160 | "nbformat_minor": 2 1161 | } 1162 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/01_Intro_to_Jupyter-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Part I: Introduction to Jupyter Notebook\n", 8 | "\n", 9 | "Welcome to the Jupyter Notebook! In this tutorial, we will take you step-by-step through basic coding concepts that we will leverage and build upon in subsequent data analysis tutorials. This is meant to be an introductory notebook, so coding experience is NOT required. We encourage you to work through the material with an experimental mindset - practice and curiosity are the keys to success! Feel free to add additional code blocks to try things out on your own.\n", 10 | "\n", 11 | "## Table of Contents:\n", 12 | "1. [The Jupyter Notebook](#jupyter)\n", 13 | "2. [Expressions](#expr)\n", 14 | "3. [Variables](#vars)\n", 15 | "4. [Variables vs. Strings](#str)\n", 16 | "5. [Boolean Values & Expressions](#bool)\n", 17 | "6. [Conditional Statements](#ifs)\n", 18 | "7. [Defining a Function](#func)\n", 19 | "8. [Understanding Errors](#error)\n", 20 | "\n", 21 | "---\n", 22 | "\n" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "## 1. The Jupyter Notebook\n", 30 | "\n", 31 | "\n", 32 | "A Jupyter Notebook is divided into what are called *cells*. You can navigate cells by clicking on them or by using the up and down arrows. Cells will be highlighted as you navigate them.\n", 33 | "\n", 34 | "### Markdown cells\n", 35 | "\n", 36 | "Text cells (like this one) can be edited by double-clicking on them. They're written in a simple format called [Markdown](http://daringfireball.net/projects/markdown/syntax) to add formatting and section headings. It can also render HTML since Markdown is a superset of HTML; you will often see HTML tags in the Markdown cells of this notebook. You don't need to learn Markdown, but know the difference between Text Cells and Code Cells.\n", 37 | "\n", 38 | "### Code cells\n", 39 | "Other cells, like the one below, contain code in the Python 3 language. The fundamental building block of Python code is an **expression**. Cells can contain multiple lines with multiple expressions. We'll explain what exactly we mean by \"expressions\" in just a moment - first, let's learn how to \"run\" cells." 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": null, 45 | "metadata": {}, 46 | "outputs": [], 47 | "source": [ 48 | "# This is a code cell!\n", 49 | "print(\"Hello, World! \\N{EARTH GLOBE ASIA-AUSTRALIA}\")" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": {}, 55 | "source": [ 56 | ">*Note: code cells, like the one above, can contain **`# comments`**. Comments are not code but notes that we can leave inline with the code to help us while writing or reviewing code. In Python, you can enter a comment after a pound `#` - everthing after the `#` will be considered a note.*\n", 57 | "\n", 58 | "\n", 59 | "### Running cells\n", 60 | "\n", 61 | "\"Running a cell\" is equivalent to pressing \"Enter\" on a calculator once you've typed in the expression you want to evaluate: it produces a result. When you run a text cell, it outputs clean, organized writing. When you run a code cell, it **computes** all of the expressions you want to evaluate, and can **output** the result of the computation if there is anything to return.\n", 62 | "\n", 63 | "

\n", 64 | "\n", 65 | "
\n", 66 | "   To run the code in a cell, first click on that cell. It'll become highlighted with a green or blue border. Next, you can either click the Run button above, or press Shift + Return / Shift + Enter. This will run the current cell and select the next one.
\n", 67 | "\n", 68 | "Text cells are useful for taking notes and keeping your notebook organized, but your data analysis will be done in code cells. We will focus on code cells for the remainder of the notebook.\n", 69 | "\n", 70 | "\n", 71 | "**Try running the code cell above, if you haven't already!**\n" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "metadata": {}, 77 | "source": [ 78 | "### Adding / Deleting Cells\n", 79 | "\n", 80 | "You can **add** a cell above or below a currently highlighted cell, by navigating to **`Insert` → `Insert Cell Above`/`Below`**. The default cell type will be a code cell. You can change the cell type by keeping the cell highlighted and navigating to **`Cell` → `Cell Type` → `Code`/`Markdown`**.\n", 81 | "\n", 82 | "Alternatively, you can highlight a cell (without double-clicking into the cell itself) and hit the `a` key to insert a cell above, or `b` key to insert a cell below. To change the cell type, hit `m` while the cell is highlighted to change to a Markdown cell, or hit `y` to change to a code cell.\n", 83 | "\n", 84 | "
Try inserting a code cell...
\n", 85 | "between here ↓:
\n" 86 | ] 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "metadata": {}, 91 | "source": [ 92 | "
   ...and here ↑
\n", 93 | "\n", 94 | "You can **delete** a highlighted cell by going to `Edit` → `Delete Cells`, or simply hitting the `d` keyboard key twice. \n", 95 | "**Try deleting the cell you created just above.**\n", 96 | "\n", 97 | "You can **undo** a cell deletion by going to `Edit` → `Undo Delete Cells`, or simply hitting the `z` keyboard key. \n", 98 | "**Try bringing back the cell you've just deleted.**\n", 99 | "\n", 100 | "Now that we have a basic understanding of how to use a Jupyter Notebook, we'll shift our focus to coding within code cells. \n", 101 | "\n", 102 | "---\n", 103 | "\n" 104 | ] 105 | }, 106 | { 107 | "cell_type": "markdown", 108 | "metadata": {}, 109 | "source": [ 110 | "## 2. Expressions\n", 111 | "\n", 112 | "An expression is a combination of numbers, variables, operators, and/or other Python elements that the language interprets and acts upon. Expressions act as a set of **instructions** to be followed, with the goal of generating specific outcomes.\n", 113 | "\n", 114 | "\n", 115 | "### Arithmetic\n", 116 | "You can start by thinking of code cells as fancy calculators that computes these expressions. For instance, code cells can evaluate simple arithmetic:" 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": null, 122 | "metadata": {}, 123 | "outputs": [], 124 | "source": [ 125 | "# Run me!\n", 126 | "# This is an expression\n", 127 | "10 + 10" 128 | ] 129 | }, 130 | { 131 | "cell_type": "code", 132 | "execution_count": null, 133 | "metadata": {}, 134 | "outputs": [], 135 | "source": [ 136 | "# Run me too!\n", 137 | "# This is another expression\n", 138 | "(10 + 10) / 5" 139 | ] 140 | }, 141 | { 142 | "cell_type": "markdown", 143 | "metadata": {}, 144 | "source": [ 145 | "Below are some basic arithmetic operators that are built into Python:\n", 146 | "\n", 147 | "|Operation|Operator|Example|Result|\n", 148 | "|:-|:-|:-|:-|\n", 149 | "|Addition|`+`|`1 + 2`|`3`|\n", 150 | "|Subtraction|`-`|`1 - 2`|`-1`|\n", 151 | "|Multiplication|`*`|`2 * 3`|`6`|\n", 152 | "|Division|`/`|`10 / 3`|`3.3333`|\n", 153 | "|Remainder|`%`|`10 % 3`|`1`|\n", 154 | "|Exponentiation|`**`|`2 ** 3`|`8`|\n", 155 | "\n", 156 | "The orders of operations are the same as we learned in elementary math classes (PEMDAS). Just like in mathematical expressions, parentheses can be used to group together smaller expressions within larger expressions. Observe the difference in the results of the following two expressions:" 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": null, 162 | "metadata": {}, 163 | "outputs": [], 164 | "source": [ 165 | "# Expression 1\n", 166 | "1 + 2 * 3 * 4 * 5 / 6 ** 3 + 7 - 8 - 9 + 10 + 11 + 12" 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": null, 172 | "metadata": {}, 173 | "outputs": [], 174 | "source": [ 175 | "# Expression 2\n", 176 | "1 + 2 * (3 * 4 * 5 / 6) ** 3 + 7 - 8 - 9 + 10 + 11 + 12" 177 | ] 178 | }, 179 | { 180 | "cell_type": "markdown", 181 | "metadata": {}, 182 | "source": [ 183 | "This is what they would look like in standard notation:\n", 184 | "\n", 185 | "Expression 1: $1 + 2 \\times 3 \\times 4 \\times 5 \\div 6^3 + 7 - 8 - 9 + 10 + 11 + 12$\n", 186 | "\n", 187 | "Expression 2: $1 + 2 \\times (\\frac{3 \\times 4 \\times 5}{6})^3 + 7 - 8 - 9 + 10 + 11 + 12$" 188 | ] 189 | }, 190 | { 191 | "cell_type": "markdown", 192 | "metadata": {}, 193 | "source": [ 194 | "### Call Expressions\n", 195 | "\n", 196 | "Another important type of expression is the **call expression**. A call expression \"calls\" on a **function** to be executed on specified input value(s), and often returns a value depending on these inputs. We call the values we put into functions the **arguments** of a function. As we'll discuss more later on, a **function** is a compuational process that is given a name, so that the process can easily be used.\n", 197 | "\n", 198 | "Here are some commonly used mathematical functions:\n", 199 | "\n", 200 | "|Function|Example|Value|Description|\n", 201 | "|:-:|:-|:-:|:-|\n", 202 | "|`abs`|`abs(-5)`|`5`| Takes the absolute value of the argument|\n", 203 | "|`max`|`max(5, 13, -9, 2)`|`13`| Finds the maximum value of all arguments|\n", 204 | "|`min`|`min(5, 13, -9, 2)`|`-9`| Finds the minimum value of all arguments|\n", 205 | "|`round`|`round(5.435)`|`5`| Rounds its argument to the nearest integer|\n", 206 | "\n", 207 | "Here are two call expressions that both evaluate to 3:" 208 | ] 209 | }, 210 | { 211 | "cell_type": "code", 212 | "execution_count": null, 213 | "metadata": {}, 214 | "outputs": [], 215 | "source": [ 216 | "abs(2 - 5)" 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": null, 222 | "metadata": {}, 223 | "outputs": [], 224 | "source": [ 225 | "min(round(4.32), max(2, abs(3-4) + round(5/3)), 7)" 226 | ] 227 | }, 228 | { 229 | "cell_type": "markdown", 230 | "metadata": {}, 231 | "source": [ 232 | "## 3. Variables\n", 233 | "\n", 234 | "---\n", 235 | "\n", 236 | "In natural language, we have terminology that lets us quickly reference very complicated concepts. We don't say, \"That's a large mammal with brown fur and sharp teeth!\" Instead, we just say, \"Bear!\"\n", 237 | "\n", 238 | "In Python, we do this with assignment statements. An assignment statement has a name on the left side of an `=` sign and an expression to be evaluated on the right. The name we assign to the expression is called a **variable**. Just like in your standard algebra class, you can assign the letter `x` to be 10. You can assign letter `y` to be 5. You can then add variables `x` and `y` to get 15." 239 | ] 240 | }, 241 | { 242 | "cell_type": "code", 243 | "execution_count": null, 244 | "metadata": {}, 245 | "outputs": [], 246 | "source": [ 247 | "#this won't output anything: you're just telling the cell to set x to 10\n", 248 | "x = 10" 249 | ] 250 | }, 251 | { 252 | "cell_type": "code", 253 | "execution_count": null, 254 | "metadata": {}, 255 | "outputs": [], 256 | "source": [ 257 | "y = 5" 258 | ] 259 | }, 260 | { 261 | "cell_type": "code", 262 | "execution_count": null, 263 | "metadata": {}, 264 | "outputs": [], 265 | "source": [ 266 | "#This will output the answer to the addition: you're asking it to compute the number\n", 267 | "x + y" 268 | ] 269 | }, 270 | { 271 | "cell_type": "markdown", 272 | "metadata": {}, 273 | "source": [ 274 | "In algebra class, you were limited to using the letters of the alphabet as variable names. Here, you can use any combination of words **as long as there are no spaces in the names:**" 275 | ] 276 | }, 277 | { 278 | "cell_type": "code", 279 | "execution_count": null, 280 | "metadata": {}, 281 | "outputs": [], 282 | "source": [ 283 | "quarter = 1/4\n", 284 | "quarter" 285 | ] 286 | }, 287 | { 288 | "cell_type": "markdown", 289 | "metadata": {}, 290 | "source": [ 291 | "A previously assigned name can be used in the expression to the right of `=`. Python evaluates the expression to the right of `=` first, then assigns the resulting value to the variable." 292 | ] 293 | }, 294 | { 295 | "cell_type": "code", 296 | "execution_count": null, 297 | "metadata": {}, 298 | "outputs": [], 299 | "source": [ 300 | "half = 2 * quarter\n", 301 | "half" 302 | ] 303 | }, 304 | { 305 | "cell_type": "markdown", 306 | "metadata": {}, 307 | "source": [ 308 | "You can **redefine** variables you have used before to hold **new** values; in other words, you can overwrite the old values. If you run the following cell, `x` and `y` will now hold a different values:" 309 | ] 310 | }, 311 | { 312 | "cell_type": "code", 313 | "execution_count": null, 314 | "metadata": {}, 315 | "outputs": [], 316 | "source": [ 317 | "# You can put all of the different expressions above into one code cell.\n", 318 | "# When you run this code cell, everything will be evaluated in order, from top to bottom.\n", 319 | "x = 3\n", 320 | "y = 8\n", 321 | "x_plus_y = x + y\n", 322 | "x_plus_y" 323 | ] 324 | }, 325 | { 326 | "cell_type": "markdown", 327 | "metadata": {}, 328 | "source": [ 329 | "However, variables defined in terms of another variable (e.g. `half` defined in terms of `quarter`) will **not** change automatically just because a variable in a previously evaluated expression has later changed its value:" 330 | ] 331 | }, 332 | { 333 | "cell_type": "code", 334 | "execution_count": null, 335 | "metadata": {}, 336 | "outputs": [], 337 | "source": [ 338 | "# even though `quarter` now carries a new value, the value of `half` does not automatically change \n", 339 | "# here we set `quarter` to a new value\n", 340 | "quarter = 4\n", 341 | "\n", 342 | "# and return the value of `half`\n", 343 | "half" 344 | ] 345 | }, 346 | { 347 | "cell_type": "markdown", 348 | "metadata": {}, 349 | "source": [ 350 | "Recall that when `half` was defined earlier, `quarter` was assigned the value 0.25. Because expressions to the right of `=` are evaluated first before variable assignment, `2 * quarter` was evaluated to 0.5, and `half` was assigned this value of 0.5. It does not remain dependent on the variable name `quarter`. \n", 351 | "\n", 352 | "So, even though `quarter` later changed its value, it doesn't change the fact that `half` was assigned 0.5, and `half` continues to represent 0.5." 353 | ] 354 | }, 355 | { 356 | "cell_type": "markdown", 357 | "metadata": {}, 358 | "source": [ 359 | "\n", 360 | "\n", 361 | "
You Try: What should be the answer to \"6 times half plus x_plus_y\"?
" 362 | ] 363 | }, 364 | { 365 | "cell_type": "code", 366 | "execution_count": null, 367 | "metadata": {}, 368 | "outputs": [], 369 | "source": [ 370 | "# YOUR CODE HERE" 371 | ] 372 | }, 373 | { 374 | "cell_type": "markdown", 375 | "metadata": {}, 376 | "source": [ 377 | "Does your answer make sense?" 378 | ] 379 | }, 380 | { 381 | "cell_type": "markdown", 382 | "metadata": {}, 383 | "source": [ 384 | "## 4. Variables vs. Strings\n", 385 | "\n", 386 | "---\n", 387 | "\n", 388 | "In the section above, we understood that `quarter` is a variable to which we have assigned a numeric value. However, as soon as we put quotes around the word, Python understands it as an entirely different object - _\"quarter\"_ is now a piece of textual data, or a **string**. A string is a type of value, just like numbers are values, that is made up of a sequence of characters. Strings can represent a single character, a word, a sentence, or the contents of an entire book.\n" 389 | ] 390 | }, 391 | { 392 | "cell_type": "code", 393 | "execution_count": null, 394 | "metadata": {}, 395 | "outputs": [], 396 | "source": [ 397 | "# This is a string, not a variable\n", 398 | "\"quarter\"" 399 | ] 400 | }, 401 | { 402 | "cell_type": "code", 403 | "execution_count": null, 404 | "metadata": {}, 405 | "outputs": [], 406 | "source": [ 407 | "# Another string\n", 408 | "\"Woohoo!\"" 409 | ] 410 | }, 411 | { 412 | "cell_type": "code", 413 | "execution_count": null, 414 | "metadata": {}, 415 | "outputs": [], 416 | "source": [ 417 | "\"Strings can capture long bodies of text\"" 418 | ] 419 | }, 420 | { 421 | "cell_type": "markdown", 422 | "metadata": {}, 423 | "source": [ 424 | "The meaning of an expression depends both upon its structure and the types of values that are being combined. So, for instance, adding two strings together produces another string. This expression is still an addition, but the result of adding strings is different from the result of adding numbers:" 425 | ] 426 | }, 427 | { 428 | "cell_type": "code", 429 | "execution_count": null, 430 | "metadata": {}, 431 | "outputs": [], 432 | "source": [ 433 | "# I output a combined string:\n", 434 | "\"123\" + \"456\"" 435 | ] 436 | }, 437 | { 438 | "cell_type": "code", 439 | "execution_count": null, 440 | "metadata": {}, 441 | "outputs": [], 442 | "source": [ 443 | "# I output the result of adding two numbers:\n", 444 | "123 + 456" 445 | ] 446 | }, 447 | { 448 | "cell_type": "markdown", 449 | "metadata": {}, 450 | "source": [ 451 | "What if we try to type a random word in a code cell **without** putting it in quotes?" 452 | ] 453 | }, 454 | { 455 | "cell_type": "code", 456 | "execution_count": null, 457 | "metadata": {}, 458 | "outputs": [], 459 | "source": [ 460 | "#This will Error!\n", 461 | "Woohoo" 462 | ] 463 | }, 464 | { 465 | "cell_type": "markdown", 466 | "metadata": {}, 467 | "source": [ 468 | "It throws out an error! Why? Because code cells will consider any word **not** in quotes to be a **Python object**, like a variable, that stores some sort of information. In this notebook, we haven't told it what `Woohoo` means -- it's just an empty variable holding no information, so it complains and says, \"I don't know what `Woohoo` is supposed to be.\"" 469 | ] 470 | }, 471 | { 472 | "cell_type": "markdown", 473 | "metadata": {}, 474 | "source": [ 475 | "## 5. Boolean Values & Expressions\n", 476 | "---\n", 477 | "\n", 478 | "A Boolean is another data type, and it can carry one of only two values - `True` or `False`. They often arise when two values are compared against each other:\n" 479 | ] 480 | }, 481 | { 482 | "cell_type": "code", 483 | "execution_count": null, 484 | "metadata": {}, 485 | "outputs": [], 486 | "source": [ 487 | "# expression: '10 is greater than 1'\n", 488 | "10 > 1" 489 | ] 490 | }, 491 | { 492 | "cell_type": "code", 493 | "execution_count": null, 494 | "metadata": {}, 495 | "outputs": [], 496 | "source": [ 497 | "# expression: '10 is equal to 1'\n", 498 | "10 == 1" 499 | ] 500 | }, 501 | { 502 | "cell_type": "markdown", 503 | "metadata": {}, 504 | "source": [ 505 | "The value `True` indicates that the statement is accurate. In the above, Python has confirmed the simple statement that 10 is greater than 1; and that 10 is not, in fact, equal to 1.\n", 506 | "\n", 507 | "Here are some common comparison operators:\n", 508 | "\n", 509 | "|Operation|Operator|Result: True|Result: False|\n", 510 | "|-|:-:|:-:|:-:|\n", 511 | "|Equal to|==|1.3 == 1.3|1.3 == 1|\n", 512 | "|Not equal to|!=|1.3 != 1|1 != 1|\n", 513 | "|Less than|<|5 < 10|5 < 5|\n", 514 | "|Less than or equal|<=|5 <= 5|10 <= 5|\n", 515 | "Greater than|>|10 > 5|5 > 10|\n", 516 | "|Greater or equal|>=|5 >= 5|5 >= 10|\n" 517 | ] 518 | }, 519 | { 520 | "cell_type": "markdown", 521 | "metadata": {}, 522 | "source": [ 523 | "
You Try: An apple, a lemon, and a pack of strawberries each cost \\$0.79, \\$0.24, and \\$3.49, respectively. If I set out to purchase 2 apples, 6 lemons, and 2 strawberries, would \\$10 be enough to cover the total cost? Use a boolean operator to output a True or False answer below:
" 524 | ] 525 | }, 526 | { 527 | "cell_type": "code", 528 | "execution_count": null, 529 | "metadata": {}, 530 | "outputs": [], 531 | "source": [ 532 | "apple = 0.79\n", 533 | "lemon = 0.24\n", 534 | "strawberries = 3.49\n", 535 | "\n", 536 | "# YOUR CODE HERE\n", 537 | "# ..." 538 | ] 539 | }, 540 | { 541 | "cell_type": "markdown", 542 | "metadata": {}, 543 | "source": [ 544 | "### Numeric value of Booleans \n", 545 | "\n", 546 | "Different data type values actually represent different boolean values. For instance:\n", 547 | " \n", 548 | "- Numbers:
\n", 549 | "    `0` is evaluated to: `False` \n", 550 | "    any non-zero number is evaluated to: `True`\n", 551 | "
\n", 552 | " \n", 553 | "- Strings:
\n", 554 | "    `''` (an empty string) is evaluated to: `False` \n", 555 | "    any non-empty string is evaluated to: `True` \n", 556 | "\n" 557 | ] 558 | }, 559 | { 560 | "cell_type": "markdown", 561 | "metadata": {}, 562 | "source": [ 563 | "### Boolean Operators: And, Or, Not\n", 564 | "\n", 565 | "We can combine multiple True/False expressions to compose a broader True/False statement, much like we do in natural conversation: \n", 566 | ">Question:  \"You played tennis **AND** went running today?\" \n", 567 | ">Answer:    \"**No,** I didn't play tennis today.\" \n", 568 | ">Meaning: *The statement is False, because I didn't do both those things*\n", 569 | "\n", 570 | ">Question:  \"Do you want me to make you a smoothie **or** a milkshake?\" \n", 571 | ">Answer:    \"**Yes,** please; a smoothie sounds fantastic!\" \n", 572 | ">Meaning: *The statement is True, because I want at least one of those things*\n", 573 | "\n", 574 | "The question/answer format doesn't work well for the case of negation, but luckily this one is quite straightforward - putting a `not` in front of a True/False expression negates the expression - \"not True\" means False, and \"not False\" means True. \n", 575 | ">Simply:  \"That is **not** True.\" \n", 576 | ">Expression being negated: \"That is True\" \n", 577 | ">Meaning of the statement: \"That is False\"\n", 578 | "\n", 579 | "
\n", 580 | "\n", 581 | "Let's explore these in the code cells below:\n", 582 | "\n", 583 | "**1. `and, &` Operator:** \n", 584 | ">Evaluates to `True` if and only if all expressions evaluate to `True` - in other words: \n", 585 | ">    - Statement is True if **no expression** evaluates to False \n", 586 | ">    - Statement is False if **at least 1 expression** evaluates to False" 587 | ] 588 | }, 589 | { 590 | "cell_type": "code", 591 | "execution_count": null, 592 | "metadata": {}, 593 | "outputs": [], 594 | "source": [ 595 | "# False, because there is 1 False expression\n", 596 | "True and True and True and True and True and False and True" 597 | ] 598 | }, 599 | { 600 | "cell_type": "code", 601 | "execution_count": null, 602 | "metadata": {}, 603 | "outputs": [], 604 | "source": [ 605 | "# first expression is false, so it is not the case that both expr1 and expr2 are True,\n", 606 | "# so this evaluates to False\n", 607 | "\n", 608 | "# False and True --> False\n", 609 | "(10<5) and (4%2==0)" 610 | ] 611 | }, 612 | { 613 | "cell_type": "code", 614 | "execution_count": null, 615 | "metadata": {}, 616 | "outputs": [], 617 | "source": [ 618 | "# first AND second AND third expressions are all True --> True\n", 619 | "# True and True and True\n", 620 | "(7%2!=0) & (5-9<0) & (9==3+7-1)" 621 | ] 622 | }, 623 | { 624 | "cell_type": "markdown", 625 | "metadata": {}, 626 | "source": [ 627 | "**2. `or, |` Operator:** \n", 628 | ">If one expression is `True`, then the entire expression evaluates to `True` - in other words: \n", 629 | ">    - Statement is True if **at least 1 expression** evaluates to True \n", 630 | ">    - Statement is False if **no expression** evaluates to True \n" 631 | ] 632 | }, 633 | { 634 | "cell_type": "code", 635 | "execution_count": null, 636 | "metadata": {}, 637 | "outputs": [], 638 | "source": [ 639 | "# True, as long as there is 1 True statement\n", 640 | "False or False or False or False or False or True or False" 641 | ] 642 | }, 643 | { 644 | "cell_type": "code", 645 | "execution_count": null, 646 | "metadata": {}, 647 | "outputs": [], 648 | "source": [ 649 | "# neither left nor right expressions are True; overall expression is False\n", 650 | "# False or False --> False\n", 651 | "(10<5) or (8==9)" 652 | ] 653 | }, 654 | { 655 | "cell_type": "code", 656 | "execution_count": null, 657 | "metadata": {}, 658 | "outputs": [], 659 | "source": [ 660 | "# second expression evaluates to True, so overall expression is True\n", 661 | "# False or True --> True\n", 662 | "(10<5) | (4%2==0)" 663 | ] 664 | }, 665 | { 666 | "cell_type": "markdown", 667 | "metadata": {}, 668 | "source": [ 669 | "**3. `not, ~` Operator:** \n", 670 | ">Negates the expression: \n", 671 | ">    - if the expression is True, `not` turns expression to False \n", 672 | ">    - if the expression is False, `not` turns the expression to True\n" 673 | ] 674 | }, 675 | { 676 | "cell_type": "code", 677 | "execution_count": null, 678 | "metadata": {}, 679 | "outputs": [], 680 | "source": [ 681 | "not True" 682 | ] 683 | }, 684 | { 685 | "cell_type": "code", 686 | "execution_count": null, 687 | "metadata": {}, 688 | "outputs": [], 689 | "source": [ 690 | "not False" 691 | ] 692 | }, 693 | { 694 | "cell_type": "code", 695 | "execution_count": null, 696 | "metadata": {}, 697 | "outputs": [], 698 | "source": [ 699 | "# not False --> True\n", 700 | "not 10<5" 701 | ] 702 | }, 703 | { 704 | "cell_type": "code", 705 | "execution_count": null, 706 | "metadata": {}, 707 | "outputs": [], 708 | "source": [ 709 | "# not True --> False\n", 710 | "~10>5" 711 | ] 712 | }, 713 | { 714 | "cell_type": "markdown", 715 | "metadata": {}, 716 | "source": [ 717 | "
\n", 718 | "   Order of Operations:
\n", 719 | " Just like mathematical operations, boolean operations have an order of operations - from highest to lowest priority, the order of evaluation is:
\n", 720 | "     NOT, AND, then OR\n", 721 | " \n", 722 | "
\n", 723 | "\n", 724 | "The expression below is evaluated in the following steps:\n", 725 | "\n", 726 | "```\n", 727 | "(10==1) or ~(10<5) and (4%2==0) \n", 728 | " False or ~False and True # 1. evaluate inside parentheses first\n", 729 | " False or True and True # 2. then NOT (~)\n", 730 | " False or True # 3. then AND\n", 731 | " True # 4. then OR\n", 732 | "\n", 733 | "```" 734 | ] 735 | }, 736 | { 737 | "cell_type": "code", 738 | "execution_count": null, 739 | "metadata": {}, 740 | "outputs": [], 741 | "source": [ 742 | "# evaluates to True\n", 743 | "(10==1) or ~(10<5) and (4%2==0)" 744 | ] 745 | }, 746 | { 747 | "cell_type": "markdown", 748 | "metadata": {}, 749 | "source": [ 750 | "
\n", 751 | "   Negating \"Or\" / \"And\" Statements:
\n", 752 | " Negation has the following effects on AND and OR statements:
\n", 753 | " \n", 754 | "- not (A or B) is equivalent to (not A) AND (not B) \n", 755 | "- not (A AND B) is equivalent to (not A) OR (not B) \n", 756 | " \n", 757 | "
" 758 | ] 759 | }, 760 | { 761 | "cell_type": "code", 762 | "execution_count": null, 763 | "metadata": {}, 764 | "outputs": [], 765 | "source": [ 766 | "# 1a. \"not (A or B)\"\" is the same as saying (not A) and (not B)\"\n", 767 | "not((10<5) | (4%2==0))" 768 | ] 769 | }, 770 | { 771 | "cell_type": "code", 772 | "execution_count": null, 773 | "metadata": {}, 774 | "outputs": [], 775 | "source": [ 776 | "# 1b. this is the same as 1a.\n", 777 | "(not(10<5)) & (not(4%2==0))" 778 | ] 779 | }, 780 | { 781 | "cell_type": "code", 782 | "execution_count": null, 783 | "metadata": {}, 784 | "outputs": [], 785 | "source": [ 786 | "# 2a. \"not (A and B) is the same as saying \"(not A) OR (not B)\"\n", 787 | "not((10<5) & (4%2==0))" 788 | ] 789 | }, 790 | { 791 | "cell_type": "code", 792 | "execution_count": null, 793 | "metadata": {}, 794 | "outputs": [], 795 | "source": [ 796 | "# 2b. same as 2a\n", 797 | "(not(10<5)) | (not(4%2==0))" 798 | ] 799 | }, 800 | { 801 | "cell_type": "markdown", 802 | "metadata": {}, 803 | "source": [ 804 | "## 6. Conditional Statements\n", 805 | "\n", 806 | "  If-Statements\n", 807 | "\n", 808 | "Boolean expressions are very useful when writing code, because it lets you determine what course of action to take for a given scenario. One way to do this is by using **if-statements**, which allows you to execute an action if it meets a certain condition (if the condition evaluates to `True`). If it does not meet the condition - if the condition evaluates to `False` - then it will ignore the instructions for the action altogether, and move on.\n", 809 | "\n", 810 | "```python\n", 811 | "if condition_1: # if condition_1 is true, execute action_1\n", 812 | " action_1\n", 813 | "```\n", 814 | "\n", 815 | "Notice that there is a colon `:` at the end of the if-statement, and that the action to be executed is **indented** underneath the statement. The colon completes the \"if-statement\" - it marks the end of the condition it is specifying; and the indentation indicates that everything captured in the indented block underneath the if-statement should **only be executed if the condition is met**. In Python, indentation determines when blocks of code should be run.\n", 816 | "\n", 817 | "The example below contains 2 if-statements. 3 actions are coded, but only 2 will be executed. Before running the cell, can you identify which of the 3 phrases will be printed?" 818 | ] 819 | }, 820 | { 821 | "cell_type": "code", 822 | "execution_count": null, 823 | "metadata": {}, 824 | "outputs": [], 825 | "source": [ 826 | "# Cost of fruits, from before\n", 827 | "apple = 0.79\n", 828 | "lemon = 0.24\n", 829 | "strawberries = 3.49\n", 830 | "\n", 831 | "if apple < 1.00:\n", 832 | " print(\"Action 1: What a bargain!\") # action 1\n", 833 | "\n", 834 | "if strawberries >= 6.00:\n", 835 | " print(\"Action 2: Boy, these are expensive...\") # action 2\n", 836 | "\n", 837 | "print(\"Action 3: When will this be printed?\") # action 3" 838 | ] 839 | }, 840 | { 841 | "cell_type": "markdown", 842 | "metadata": {}, 843 | "source": [ 844 | "
Question: What would happen if we add an indent before the last print statement (\"action 3\"), and re-run the code block? Why?
" 845 | ] 846 | }, 847 | { 848 | "cell_type": "markdown", 849 | "metadata": {}, 850 | "source": [ 851 | "*Your answer here*" 852 | ] 853 | }, 854 | { 855 | "cell_type": "markdown", 856 | "metadata": {}, 857 | "source": [ 858 | "  \"Else if\" & \"Else\" Statements\n", 859 | "\n", 860 | "Python will always evaluate an if-statement. However, sometimes we want to evaluate other conditions **only if a prior condition has not been met**. This can be achieved by following an if-statement with an **`elif`** (\"else, if\") and/or **`else`** statements:\n", 861 | "\n", 862 | "```python\n", 863 | "if (condition_1): # if condition_1 is true, execute action_1\n", 864 | " return action_1\n", 865 | " \n", 866 | "elif (condition_2): # \"else, if\": if condition_1 is not true, then if condition_2 is true, execute action_2\n", 867 | " return action_2\n", 868 | " \n", 869 | "else: # all other cases - if neither condition_1 nor condition_2 are true, execute action_3\n", 870 | " return action_3\n", 871 | "```\n", 872 | "\n", 873 | "There can be multiple **`elif`** statements after an **`if`**-statement and before an **`else`**-statement, but there can only be one **`else`** statement. In this set-up, once a condition is satisfied, Python will execute each respective action, and ignore all other succeeding conditional statements.\n", 874 | "\n", 875 | "
Question 1: \n", 876 | "Let's say a pack of strawberries now cost $6.50. What would you expect the logic from above to output now?
" 877 | ] 878 | }, 879 | { 880 | "cell_type": "markdown", 881 | "metadata": {}, 882 | "source": [ 883 | "*Your answer here*" 884 | ] 885 | }, 886 | { 887 | "cell_type": "code", 888 | "execution_count": null, 889 | "metadata": {}, 890 | "outputs": [], 891 | "source": [ 892 | "strawberries = 6.50\n", 893 | "\n", 894 | "## same code as above\n", 895 | "if apple < 1.00:\n", 896 | " print(\"Action 1: What a bargain!\")\n", 897 | "\n", 898 | "if strawberries >= 6.00:\n", 899 | " print(\"Action 2: Boy, these are expensive...\")" 900 | ] 901 | }, 902 | { 903 | "cell_type": "markdown", 904 | "metadata": {}, 905 | "source": [ 906 | "
Question 2: \n", 907 | " The code above uses two if-statements. What happens if we change the second if to an elif? Why?
" 908 | ] 909 | }, 910 | { 911 | "cell_type": "markdown", 912 | "metadata": {}, 913 | "source": [ 914 | "*Your answer here*" 915 | ] 916 | }, 917 | { 918 | "cell_type": "code", 919 | "execution_count": null, 920 | "metadata": {}, 921 | "outputs": [], 922 | "source": [ 923 | "if apple < 1.00:\n", 924 | " print(\"Action 1: What a bargain!\")\n", 925 | "\n", 926 | "elif strawberries >= 6.00:\n", 927 | " print(\"Action 2: Boy, these are expensive...\")" 928 | ] 929 | }, 930 | { 931 | "cell_type": "markdown", 932 | "metadata": {}, 933 | "source": [ 934 | "---\n", 935 | "\n", 936 | "## 7. Defining a Function\n", 937 | "\n", 938 | "Functions are useful when you want to repeat a series of steps on multiple different objects, but don't want to type out the steps over and over again. Many functions are built into Python already, as we've already seen in the section on call expressions. In this section, we'll discuss how to **write and name our own functions**.\n", 939 | "\n", 940 | "Recall that when we call on a function, we must often provide one or more input values, or **arguments**, for the function to operate on. When we define a function, we need a way to let the function know what to do with which argument. We do this by setting up parameters for the function - **parameters** can be thought of as placeholder variables that are waiting to be assigned values, which happens when the function is called upon with specific arguments.\n", 941 | " \n", 942 | "Below is our first example, found in the UC Berkeley [Inferential Thinking](http://www.data8.org/zero-to-data-8/textbook.html) Textbook by Ani Adhikari and John DeNero:\n", 943 | "\n", 944 | " \n", 945 | ">The definition of the `double` function below simply doubles a number." 946 | ] 947 | }, 948 | { 949 | "cell_type": "code", 950 | "execution_count": null, 951 | "metadata": {}, 952 | "outputs": [], 953 | "source": [ 954 | "# Our first function definition\n", 955 | "def double(x):\n", 956 | " \"\"\"Double x\"\"\"\n", 957 | " return 2*x" 958 | ] 959 | }, 960 | { 961 | "cell_type": "markdown", 962 | "metadata": {}, 963 | "source": [ 964 | ">We start any function definition by writing `def`. Here is a breakdown of the other parts (the *syntax*) of this small function:\n", 965 | "\n", 966 | "\n", 967 | "\n", 968 | "\n", 969 | "\n", 970 | ">When we run the cell above, no particular number is doubled, and the code inside the body of `double` is not yet evaluated. In this respect, our function is analogous to a *recipe*. Each time we follow the instructions in a recipe, we need to start with ingredients. Each time we want to use our function to double a number, we need to specify a number.1\n", 971 | "\n", 972 | "---\n", 973 | "\n", 974 | "
You Try! \n", 975 | "The following function, `add_two`, has been set up with a more thorough docstring. Fill in the ... below with an expression that would satisfy the function's description.
" 976 | ] 977 | }, 978 | { 979 | "cell_type": "code", 980 | "execution_count": null, 981 | "metadata": {}, 982 | "outputs": [], 983 | "source": [ 984 | "def add_two(number):\n", 985 | " \"\"\"Adds 2 to the input.\n", 986 | " \n", 987 | " Parameters\n", 988 | " ----------\n", 989 | " number:\n", 990 | " The given number that 2 will be added to.\n", 991 | " \n", 992 | " Returns\n", 993 | " -------\n", 994 | " A number which is 2 greater than the original input.\n", 995 | " \n", 996 | " Example\n", 997 | " -------\n", 998 | " >>> add_two(4)\n", 999 | " 6\n", 1000 | " \"\"\"\n", 1001 | " return # ... YOUR CODE HERE" 1002 | ] 1003 | }, 1004 | { 1005 | "cell_type": "markdown", 1006 | "metadata": {}, 1007 | "source": [ 1008 | "Given what you understand from the docstring, what do you think this function does? Run the cells below to test it out:" 1009 | ] 1010 | }, 1011 | { 1012 | "cell_type": "code", 1013 | "execution_count": null, 1014 | "metadata": { 1015 | "scrolled": true 1016 | }, 1017 | "outputs": [], 1018 | "source": [ 1019 | "add_two(3)" 1020 | ] 1021 | }, 1022 | { 1023 | "cell_type": "code", 1024 | "execution_count": null, 1025 | "metadata": {}, 1026 | "outputs": [], 1027 | "source": [ 1028 | "add_two(-1)" 1029 | ] 1030 | }, 1031 | { 1032 | "cell_type": "markdown", 1033 | "metadata": {}, 1034 | "source": [ 1035 | "Functions often take advantage of conditional statements to carry out different procedures given different input values. In the example below, where we define our own absolute value function, notice how negative and non-negative arguments are handled differently: " 1036 | ] 1037 | }, 1038 | { 1039 | "cell_type": "code", 1040 | "execution_count": null, 1041 | "metadata": {}, 1042 | "outputs": [], 1043 | "source": [ 1044 | "def absolute_value_of(number):\n", 1045 | " \"\"\"Finds the absolute value of the input.\n", 1046 | " \n", 1047 | " Parameters\n", 1048 | " ----------\n", 1049 | " number:\n", 1050 | " Input value\n", 1051 | " \n", 1052 | " Returns\n", 1053 | " -------\n", 1054 | " The absolute value of the input number\n", 1055 | " \n", 1056 | " Example\n", 1057 | " -------\n", 1058 | " >>> absolute_value_of(-5)\n", 1059 | " 5 \n", 1060 | " \"\"\"\n", 1061 | "\n", 1062 | " if number < 0:\n", 1063 | " number = number * -1\n", 1064 | " \n", 1065 | " return number" 1066 | ] 1067 | }, 1068 | { 1069 | "cell_type": "markdown", 1070 | "metadata": {}, 1071 | "source": [ 1072 | "## 8. Understanding Errors\n", 1073 | "Python is a language, and like natural human languages, it has rules. It differs from natural language in two important ways:\n", 1074 | "1. The rules are **simple**. You can learn most of them in a few weeks and gain reasonable proficiency with the language in just a few months.\n", 1075 | "2. The rules are **rigid**. If you're proficient in a natural language, you can understand a non-proficient speaker, glossing over small mistakes. A computer running Python code is **not** smart enough to do that.\n", 1076 | "\n", 1077 | "Whenever you write code, you'll inevitably make mistakes. When you run a code cell that has errors, Python will usually produce error messages to tell you what you did wrong.\n", 1078 | "\n", 1079 | "Errors are okay; even experienced programmers make many errors. When you make an error, you just have to find the source of the problem, fix it, and move on.\n", 1080 | "\n", 1081 | "We have made an error in the next cell. Run it and see what happens." 1082 | ] 1083 | }, 1084 | { 1085 | "cell_type": "code", 1086 | "execution_count": null, 1087 | "metadata": {}, 1088 | "outputs": [], 1089 | "source": [ 1090 | "print(\"This line is missing something.\"" 1091 | ] 1092 | }, 1093 | { 1094 | "cell_type": "markdown", 1095 | "metadata": {}, 1096 | "source": [ 1097 | "We can break down the error message as follows:\n", 1098 | "\n", 1099 | "![error](./images/error.jpg)\n", 1100 | "\n", 1101 | "Fix this error in the cell below:" 1102 | ] 1103 | }, 1104 | { 1105 | "cell_type": "code", 1106 | "execution_count": null, 1107 | "metadata": {}, 1108 | "outputs": [], 1109 | "source": [ 1110 | "#Your Answer Here\n", 1111 | "..." 1112 | ] 1113 | }, 1114 | { 1115 | "cell_type": "markdown", 1116 | "metadata": {}, 1117 | "source": [ 1118 | "---\n", 1119 | "\n", 1120 | "**Congratulations!** You have completed the introduction to Jupyter Notebooks tutorial! In the next tutorial, we will use these skills to develop further explore data structures like lists and arrays; and explore statistical concepts like percentiles, histograms, and standard deviations.\n", 1121 | "\n", 1122 | "---\n", 1123 | "\n", 1124 | "\n", 1125 | "#### Content adapted from: \n", 1126 | "- Jupyter Notebook modules from the [UC Berkeley Data Science Modules Program](https://ds-modules.github.io/DS-Modules/) licensed under [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/)\n", 1127 | " - [ESPM-163ac: Lab1-Introduction to Jupyter Notebook](https://github.com/ds-modules/ESPM-163ac/blob/master/Lab1/Lab1_Intro_to_Jupyter.ipynb) by Alleana Clark\n", 1128 | " - [Data 8X Public Materials for 2022](https://github.com/ds-modules/materials-x22/) by Sean Morris\n", 1129 | "- [Composing Programs](https://www.composingprograms.com/) by John DeNero based on the textbook [Structure and Interpretation of Computer Programs](https://mitpress.mit.edu/9780262510875/structure-and-interpretation-of-computer-programs/) by Harold Abelson and Gerald Jay Sussman, licensed under [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) \n", 1130 | " \n", 1131 | "\n", 1132 | "#### Citations:\n", 1133 | "\n", 1134 | "\n", 1135 | "1. Ani Adhikari, et al, “8. Functions and Tables,” Computational and Inferential Thinking: The Foundations of Data Science, accessed 15 August 2023, https://inferentialthinking.com/chapters/08/Functions_and_Tables.html. \n" 1136 | ] 1137 | } 1138 | ], 1139 | "metadata": { 1140 | "anaconda-cloud": {}, 1141 | "kernelspec": { 1142 | "display_name": "Python 3 (ipykernel)", 1143 | "language": "python", 1144 | "name": "python3" 1145 | }, 1146 | "language_info": { 1147 | "codemirror_mode": { 1148 | "name": "ipython", 1149 | "version": 3 1150 | }, 1151 | "file_extension": ".py", 1152 | "mimetype": "text/x-python", 1153 | "name": "python", 1154 | "nbconvert_exporter": "python", 1155 | "pygments_lexer": "ipython3", 1156 | "version": "3.11.5" 1157 | } 1158 | }, 1159 | "nbformat": 4, 1160 | "nbformat_minor": 2 1161 | } 1162 | -------------------------------------------------------------------------------- /03_More_DataStructures.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Part 3: More Data Structures\n", 8 | "\n", 9 | "In the previous notebook, we looked at ways to store and work with a collection of data values within a sequence, like a list or an array. We also explored insights we can gain from computing summary statistics on those collections of data. Now we need a way to store and work with **multiple** separate collections of information about the same population - e.g. in addition to the test scores of the class, perhaps we also know the students' IDs, and results from one other exam they have taken in the past. In this notebook, we will work with data structures that allow us to store multiple pieces of information, or **features**, about a population; and we will explore a powerful library that allows us to read in, explore and manipulate datasets.\n", 10 | "\n", 11 | "## Table of Contents\n", 12 | " \n", 13 | "**1. [Collections of Multiple Features](#multFeat)** \n", 14 | "    **1.1.** [Dictionaries](#dict) \n", 15 | "    **1.2.** [Matrices](#matrix) \n", 16 | "**2. [Pandas Library](#pd)** \n", 17 | "    **2.1.** [Reading in Data](#import) \n", 18 | "    **2.2.** [Exploring the Pandas DataFrame](#df) \n", 19 | "    **2.3.** [Selecting Rows & Columns](#loc) \n", 20 | "    **2.4.** [Applying Functions](#apply) \n", 21 | "    **2.5.** [Data Aggregation](#group) \n", 22 | "\n", 23 | "\n", 24 | "---\n", 25 | "\n", 26 | "## 1. Collections of Multiple Features\n", 27 | "\n", 28 | "First, we will explore 2 important data structures that allow us to organize multiple features of a given population: dictionaries and matrices.\n", 29 | "\n", 30 | "### 1.1. Dictionaries\n", 31 | "\n", 32 | "We can use arrays to collect data points on a single feature of the population, like `test_scores`. When we have multiple arrays capturing different features of the same population, one way to store all of the different arrays is in a **dictionary**.\n", 33 | "\n", 34 | "Here are two examples of a dictionary:" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": null, 40 | "metadata": {}, 41 | "outputs": [], 42 | "source": [ 43 | "# ex1. dictionary of single values\n", 44 | "numerals = {'I': 1, 'V': 5, \"X\": 10}\n", 45 | "numerals" 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": null, 51 | "metadata": {}, 52 | "outputs": [], 53 | "source": [ 54 | "# ex2. dictionary of collections\n", 55 | "class_dict = {\"student_ID\":np.arange(1, len(test_scores)+1), \n", 56 | " \"test_scores\":test_scores}\n", 57 | "class_dict" 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "metadata": {}, 63 | "source": [ 64 | "  **Definition:**\n", 65 | "
\n", 66 | "A dictionary organizes data into key-value pairs. This allows us to store and retrieve values indexed not by consecutive integers, but by descriptive keys. \n", 67 | "\n", 68 | "- Keys: Strings commonly serve as keys since they enable us to represent names of things. In the context of storing data, they are the column names, or names of the value(s) it represents. \n", 69 | "- Values: The data that we are storing. This can be a single value, or a collection of values. \n", 70 | "
\n", 71 | "\n", 72 | "
  **Accessing Dictionary Contents** \n", 73 | "\n", 74 | "1. Access a dictionary value by indexing the dictionary by the corresponding key:" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": null, 80 | "metadata": {}, 81 | "outputs": [], 82 | "source": [ 83 | "# 1. get value associated with \"test_scores\"\n", 84 | "class_dict['test_scores']" 85 | ] 86 | }, 87 | { 88 | "cell_type": "markdown", 89 | "metadata": {}, 90 | "source": [ 91 | "2. Dictionaries have methods that give us access to a list of its keys, values, and key-value pairs: " 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": null, 97 | "metadata": {}, 98 | "outputs": [], 99 | "source": [ 100 | "## a. list of dictionary keys:\n", 101 | "class_dict.keys()" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": null, 107 | "metadata": {}, 108 | "outputs": [], 109 | "source": [ 110 | "## b. list of dictionary values:\n", 111 | "class_dict.values()" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": null, 117 | "metadata": {}, 118 | "outputs": [], 119 | "source": [ 120 | "## c. list of (key, value) pairs:\n", 121 | "class_dict.items()" 122 | ] 123 | }, 124 | { 125 | "cell_type": "markdown", 126 | "metadata": {}, 127 | "source": [ 128 | "
\n", 129 | "   Unlike lists and arrays, dictionary are unordered so the order in which the key:value pairs appear in the dictionary may change when you run code cells. \n", 130 | "
" 131 | ] 132 | }, 133 | { 134 | "cell_type": "markdown", 135 | "metadata": {}, 136 | "source": [ 137 | "
  **Adding key-value Pairs** \n", 138 | "\n", 139 | "You can add a new item (key-value pair) into the exiting dictionary by assigning the value to a new name on the dictionary:\n", 140 | "\n", 141 | "```python\n", 142 | "dictionary['new_key'] = new_value\n", 143 | "```" 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "execution_count": null, 149 | "metadata": {}, 150 | "outputs": [], 151 | "source": [ 152 | "# adding a new entry\n", 153 | "past_scores = np.array([89.0, 94.2, 78.0, 86.2, 81.2, 86.0, 88.3, 84.9, 88.1, 93.0, 82.2, 78.2, 96.1, 95.9, 98.2])\n", 154 | "\n", 155 | "class_dict[\"past_test_score\"] = past_scores\n", 156 | "class_dict" 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": null, 162 | "metadata": { 163 | "scrolled": true 164 | }, 165 | "outputs": [], 166 | "source": [ 167 | "class_dict.items()" 168 | ] 169 | }, 170 | { 171 | "cell_type": "markdown", 172 | "metadata": {}, 173 | "source": [ 174 | "
\n", 175 | "Note: There can only be 1 value per key. If you attempt to assign a new value to the dictionary but specify a key name that already exists in the dictionary, the existing values associated with that key will be overwritten.\n", 176 | "
\n", 177 | "\n", 178 | "---\n", 179 | "\n", 180 | "### 1.2. Matrices\n", 181 | "\n", 182 | "Dictionaries organize information by features - all data values capturing student IDs are boxed into one container and saved into a dictionary; and data values about test scores from last year are boxed up into a separate container and saved into the dictionary under a differet label.\n", 183 | "\n", 184 | "But when you think about it, that is not the most helpful way to organize data when you are trying to **make predictions** about a specific case. For instance, say you were trying to guess what animal each record (row) is, given the following features:\n", 185 | "\n", 186 | "||Opposable Thumbs|Class of Animal|Diet |Tail Length |Number of Legs | Flies|\n", 187 | "|:-:|:-:|:-:|:-:|:-:|:-:|:-:|\n", 188 | "|**0**|True|Mammal|Bananas| long | 2|False|\n", 189 | "|**1**|False|Anthropod|Insects| none |8| False|\n", 190 | "|**2**|False|Bird|Fish|short|2|False|\n", 191 | "\n", 192 | "We wouldn't want to look at the data one column at a time, the way dictionaries are organized, when we want to predict what animal record **0** might be. Instead, we'd want to look all the features of the one record at the same time:\n", 193 | "\n", 194 | "||Opposable Thumbs|Animal Class|Diet |Tail Length | Wings |Number of Legs | Flies|\n", 195 | "|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|\n", 196 | "|**0**|True|Mammal|Bananas| long | False | 2|False|\n", 197 | "\n", 198 | "To make a prediction about a particular record, we need to consider all its features - we want to organize information by record, not by column. This is exactly what a **matrix** is designed to do, and it is in this matrix form that we ultimately feed the data into machine learning models.\n", 199 | "\n", 200 | "  **Definition:**\n", 201 | "
\n", 202 | " A matrix is a rectangular list*, or a list of lists. We say that matrix $M$ has shape $m \\times n$:\n", 203 | " \n", 204 | "* It has **m** rows: each row is a list of all features that describe a single record;\n", 205 | "* It has **n** columns: each column is displayed as elements in the same position/index of every row, and represents a specific feature of the data\n", 206 | "\n", 207 | "
\n", 208 | "\n", 209 | "Our example above in matrix form would look like:" 210 | ] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "execution_count": null, 215 | "metadata": {}, 216 | "outputs": [], 217 | "source": [ 218 | "animal_matrix = [[ True, 'Mammal', 'Bananas', 'long', 2, False],\n", 219 | " [False, 'Anthropod', 'Insects', 'none', 9, False],\n", 220 | " [False, 'Bird', 'Fish', 'short', 2, False]]\n", 221 | "\n", 222 | "animal_matrix" 223 | ] 224 | }, 225 | { 226 | "cell_type": "markdown", 227 | "metadata": {}, 228 | "source": [ 229 | "Compare this to how the data would be represented in a dictionary:" 230 | ] 231 | }, 232 | { 233 | "cell_type": "code", 234 | "execution_count": null, 235 | "metadata": {}, 236 | "outputs": [], 237 | "source": [ 238 | "animal_dict = {\"Opposable Thumbs\": [True, False, False],\n", 239 | " \"Class of Animal\": ['Mammal', 'Anthropod', 'Bird'],\n", 240 | " \"Diet\": ['Bananas', 'Insects', 'Fish'],\n", 241 | " \"Tail Length\": ['long', 'none', 'short'],\n", 242 | " \"Number of Legs\": [2, 8, 2],\n", 243 | " \"Flies\": [False, False, False]}\n", 244 | "animal_dict" 245 | ] 246 | }, 247 | { 248 | "cell_type": "markdown", 249 | "metadata": {}, 250 | "source": [ 251 | "
\n", 252 | "  *We use lists in this example to demonstrate what a matrix looks like, since the features are represented by different value types (and values in NumPy arrays must all be of the same type). However, NumPy's representation of the matrix ndarray, or the n-dimensional array, is usually preferred over using Python lists because NumPy arrays consume less memory and is able to handle operations much more efficiently than lists. Even though we have a mix of data types in our example, that does not mean we are stuck using lists. There are many ways to transform categorical features of datasets into numerical features; figuring out how best to handle categorical variables (like \"Diet\" and \"Animal Class\") is a big part of data wrangling for predictive modeling!\n", 253 | "
" 254 | ] 255 | }, 256 | { 257 | "cell_type": "markdown", 258 | "metadata": {}, 259 | "source": [ 260 | "---\n", 261 | "\n", 262 | "## 2. Pandas Library\n", 263 | "\n", 264 | "Now for the exciting part! Up to this point we have been fabricating data in the notebook to serve as our examples. With the introduction of the Pandas Library, we can **import** real data files into Jupyter Notebooks to explore. Let's do that now!\n", 265 | "\n", 266 | "
\n", 267 | "\n", 268 | "**Kaggle: our data source** \n", 269 | "We will use the \"Breast Cancer Wisconsin (Diagnostic) Dataset\" from **Kaggle**. Kaggle is a data science competition platform that hosts datathons, publish datasets, and support an online community of data scientists. Anyone is able to download the cleaned, published datasets to explore from the site and have access to an abundance of resources - from **data dictionaries** that detail data contents, to notebooks and code that other users of the data have posted. It's a great place to find interesting problems to explore and learn from others who have done/are doing the same.\n", 270 | "\n", 271 | "
\n", 272 | "\n", 273 | "**Pandas** \n", 274 | "Pandas is the standard tool for working with **dataframes**. A dataframe is a data structure that that represents data in a 2-dimensional table of rows and columns. We've seen a couple of examples of dataframes already, in the section on standard deviations, and just now in the matrix section. They are very useful for exploratory data analysis, data cleaning, and processing before turning them into matrices to be fed into machine learning models.\n", 275 | "\n", 276 | "
We've already imported the pandas library, but let's do that again here:
" 277 | ] 278 | }, 279 | { 280 | "cell_type": "code", 281 | "execution_count": null, 282 | "metadata": {}, 283 | "outputs": [], 284 | "source": [ 285 | "import pandas as pd" 286 | ] 287 | }, 288 | { 289 | "cell_type": "markdown", 290 | "metadata": {}, 291 | "source": [ 292 | "---\n", 293 | "\n", 294 | "### 2.1. Reading in the Data\n", 295 | "\n", 296 | "Pandas allows us to easily \"read in\" data from a downloaded csv (comma separated values) file and save it as a variable in the Jupyter Notebook, with the `pd.read_csv()` function. It can take many different arguments depending on the desired specifications, but we can just accept the default for the optional parameters. The only required parameter is `filepath_or_buffer`, which asks for the **file path**, or location, of the data file on your computer so that it can find it and turn it into a Pandas dataframe. There are 2 ways to specify the file path:\n", 297 | "\n", 298 | "\n", 299 | "
  **Absolute File Path:** \n", 300 | "\n", 301 | "All of your files on the computer have a file path. If you go to the location of any file on your File Explorer, you can find its absolute file path by clicking the address bar at the top of the window. You'll see something like:\n", 302 | "\n", 303 | "C:/Users/username/folder/data_folder/filename.csv\n", 304 | "\n", 305 | "When you have all of the information needed to locate the file, all the way to the very first layer of folders, you have an absolute file path.\n", 306 | "\n", 307 | "\n", 308 | "
  **Relative File Path:**\n", 309 | "\n", 310 | "Your Jupyter notebook (ipynb) file that you are working on has a path, too. When you navigate from one folder to the next on the File Explorer, you often start at a file location (let's call that location A), back out of that folder, enter into another folder, and access the file in this new location (location B). We can do something similar with file paths, by specifying the path of location B **relative to** the location of A.\n", 311 | "\n", 312 | "Let's say this Jupyter notebook is found in location A, whose absolute path is: \n", 313 | "C:/Users/username/folder/myNotebook.ipynb\n", 314 | "\n", 315 | "So, this is the **current directory**, or the location we are starting from: \n", 316 | "C:/Users/username/folder/\n", 317 | "\n", 318 | "From here, we want to get to location B: \n", 319 | "C:/Users/username/folder/data_folder/filename.csv\n", 320 | "\n", 321 | "
To do this, we can specify the relative path:
\n", 322 | "./data_folder/filename.csv
\n", 323 | "\n", 324 | "



\n", 325 | "\n", 326 | "**Notation:** \n", 327 | "- The **`.`** in the relative path indicates we are **staying in the same, current directory**. Since the `data_folder` that contains the desired file is **inside** the current directory we started out in, we indicate that it is from here that we then move into another folder, or identify a file to point to.\n", 328 | "
\n", 329 | "\n", 330 | "- The **`..`** indicates we need to **back out of the current folder**. It's the equivalent of clicking the back button on File Explorer. \n", 331 | ">**Example**:
\n", 332 | ">We can back out of multiple folders - say there is another file in this location we want to get to:\n", 333 | ">`C:/Users/another_user/theirFile.csv` \n", 334 | ">\n", 335 | ">We can access this from location A with the relative path: \n", 336 | ">`../../another_user/theirFile.csv`\n", 337 | "\n", 338 | "\n", 339 | "Once we have the path, all we need to do it put it in string form, an input it as an argument!\n", 340 | "\n", 341 | "
Run the code below to read in our first dataset!
\n" 342 | ] 343 | }, 344 | { 345 | "cell_type": "code", 346 | "execution_count": null, 347 | "metadata": {}, 348 | "outputs": [], 349 | "source": [ 350 | "# using relative path!\n", 351 | "df = pd.read_csv(\"./data/data.csv\")\n", 352 | "df" 353 | ] 354 | }, 355 | { 356 | "cell_type": "markdown", 357 | "metadata": {}, 358 | "source": [ 359 | "This dataset captures measurements and characteristics of breast mass (e.g. mass radius, smoothness, symmery) and the actual diagnosis of the mass. The challenge here would be to predict the diagnosis from the features of the mass. The purpose of this section is to introduce data manipulation using Pandas dataframes and series so we will not be tackling the challenge in this tutorial, but the notebooks uploaded on the Kaggle page would be a great place to see what other people have done with this dataset!\n", 360 | "\n", 361 | "---\n", 362 | "\n", 363 | "### 2.2. Exploring the Pandas DataFrame\n", 364 | "\n", 365 | "The Pandas DataFrame data structure allows us to easily access both the rows (records) **and** columns (features). It can be created in many different ways: from scratch, from a dictionary of values, from a matrix, from reading in a dataset, etc.\n", 366 | "\n", 367 | "\n", 368 | "
 **From scratch**: \n", 369 | "\n", 370 | "Below is an empty DataFrame object - it has no column or row yet. Run the code to see what it looks like:" 371 | ] 372 | }, 373 | { 374 | "cell_type": "code", 375 | "execution_count": null, 376 | "metadata": {}, 377 | "outputs": [], 378 | "source": [ 379 | "df_fromScratch = pd.DataFrame()\n", 380 | "df_fromScratch" 381 | ] 382 | }, 383 | { 384 | "cell_type": "markdown", 385 | "metadata": {}, 386 | "source": [ 387 | "We can add a column to the dataframe in the same way that we can add new key-value pairs to dictionaries:" 388 | ] 389 | }, 390 | { 391 | "cell_type": "code", 392 | "execution_count": null, 393 | "metadata": {}, 394 | "outputs": [], 395 | "source": [ 396 | "df_fromScratch['first_column'] = np.arange(1, 10)\n", 397 | "df_fromScratch['second_column']= np.arange(9, 0, -1)\n", 398 | "df_fromScratch" 399 | ] 400 | }, 401 | { 402 | "cell_type": "markdown", 403 | "metadata": {}, 404 | "source": [ 405 | "However, once one column of a certain **length** is added to a dataframe, all other new columns must be of the same length:" 406 | ] 407 | }, 408 | { 409 | "cell_type": "code", 410 | "execution_count": null, 411 | "metadata": {}, 412 | "outputs": [], 413 | "source": [ 414 | "# this will throw an error\n", 415 | "df_fromScratch['short_column'] = np.array([7, 8, 9])" 416 | ] 417 | }, 418 | { 419 | "cell_type": "markdown", 420 | "metadata": {}, 421 | "source": [ 422 | "
  **From a dictionary:** \n", 423 | "\n", 424 | "We can also convert a dictionary of data into a Pandas DataFrame, as long as the the number of elements captured in each dictionary value is the same:" 425 | ] 426 | }, 427 | { 428 | "cell_type": "code", 429 | "execution_count": null, 430 | "metadata": {}, 431 | "outputs": [], 432 | "source": [ 433 | "# animal dictionary from earlier\n", 434 | "pd.DataFrame(animal_dict)" 435 | ] 436 | }, 437 | { 438 | "cell_type": "markdown", 439 | "metadata": {}, 440 | "source": [ 441 | "
  **From a matrix:** \n", 442 | "\n", 443 | "..and same with matrices. Since a matrix does not have a name value like dictionaries do, we can include an argument to specify the column names:" 444 | ] 445 | }, 446 | { 447 | "cell_type": "code", 448 | "execution_count": null, 449 | "metadata": {}, 450 | "outputs": [], 451 | "source": [ 452 | "columnNames = ['Opposable Thumbs', 'Class of Animal', 'Diet', 'Tail Length', 'Number of Legs', 'Flies']\n", 453 | "pd.DataFrame(animal_matrix, columns = columnNames)" 454 | ] 455 | }, 456 | { 457 | "cell_type": "markdown", 458 | "metadata": {}, 459 | "source": [ 460 | "
  **Exploring data contents:** \n", 461 | "\n", 462 | "Let's explore the Pandas capabilities using breast cancer data we read in earlier. The first step we'd want to take when exploring a dataset is to undertand what the dataset contains. The DataFrame object has many attributes to help us with this task:\n", 463 | "\n", 464 | "1. Identify the number of rows (records) and columns (features) in the data\n", 465 | "2. Get info on the column names, their position on the dataframe, how many non-**null\\*** values there are in each feature, and what the data type of each feature is\n", 466 | "3. Get a list of the columns in the dataset\n", 467 | "4. Create a table of summary statistics on all numeric features\n", 468 | "\n" 469 | ] 470 | }, 471 | { 472 | "cell_type": "code", 473 | "execution_count": null, 474 | "metadata": {}, 475 | "outputs": [], 476 | "source": [ 477 | "# 1. find the number of rows and columns (row, col) in the dataset\n", 478 | "df.shape" 479 | ] 480 | }, 481 | { 482 | "cell_type": "code", 483 | "execution_count": null, 484 | "metadata": {}, 485 | "outputs": [], 486 | "source": [ 487 | "# 2. summary of features\n", 488 | "df.info()" 489 | ] 490 | }, 491 | { 492 | "cell_type": "code", 493 | "execution_count": null, 494 | "metadata": {}, 495 | "outputs": [], 496 | "source": [ 497 | "# 3. list of column names\n", 498 | "df.columns" 499 | ] 500 | }, 501 | { 502 | "cell_type": "code", 503 | "execution_count": null, 504 | "metadata": {}, 505 | "outputs": [], 506 | "source": [ 507 | "# 4. summary statistics of numeric features\n", 508 | "df.describe()" 509 | ] 510 | }, 511 | { 512 | "cell_type": "markdown", 513 | "metadata": {}, 514 | "source": [ 515 | "
\n", 516 | "  A null or nan value represents an unknown or missing value in the data - it is an empty entry. If a feature is riddled with missing values, we may need to drop the feature from the investigation since it may not capture enough valid data, or the valid data it does capture may be biased. If a feature has some missing values but still captures valuable information, we will want to clean the data so to replace these values with something more interpretable. We won't touch on this here, but you can find a guide on how to work with missing data using Pandas in their documentation.\n", 517 | "
\n", 518 | "\n" 519 | ] 520 | }, 521 | { 522 | "cell_type": "markdown", 523 | "metadata": {}, 524 | "source": [ 525 | "
 **Selecting a DataFrame feature - Series**:
\n", 526 | "\n", 527 | "We know from the `.info` output above that there is only one non-numeric field in the dataset, and that is the target variable - the diagnosis. Let's understand this target variable better.\n", 528 | "\n", 529 | "We can select a feature from the DataFrame in a similar way to how we would get the value of a dictionary - by indexing the dataframe by the column name:" 530 | ] 531 | }, 532 | { 533 | "cell_type": "code", 534 | "execution_count": null, 535 | "metadata": {}, 536 | "outputs": [], 537 | "source": [ 538 | "# get the data values of `diagnosis` column\n", 539 | "df['diagnosis']" 540 | ] 541 | }, 542 | { 543 | "cell_type": "markdown", 544 | "metadata": {}, 545 | "source": [ 546 | "The extracted column is stored in a Pandas data structure called a **Pandas Series**. \n", 547 | "\n", 548 | "  **Definition:**\n", 549 | "
\n", 550 | " A Series is a Pandas data structure that behaves very similarly to NumPy arrays and will be a valid argument to most NumPy functions. Series are also similar to dictionaries, in that its values can have index labels and be indexed by these labels.\n", 551 | "
\n", 552 | "\n", 553 | "For instance:" 554 | ] 555 | }, 556 | { 557 | "cell_type": "code", 558 | "execution_count": null, 559 | "metadata": {}, 560 | "outputs": [], 561 | "source": [ 562 | "# this is an array\n", 563 | "array_a = np.arange(1, 6)\n", 564 | "\n", 565 | "# this is a Series, with non-numeric index labels, and a name\n", 566 | "series_a = pd.Series(array_a, index=['a', 'b', 'c', 'd', 'e'], name=\"Series_A\")\n", 567 | "\n", 568 | "print(\"array: \", array_a)\n", 569 | "print(\"\\nSeries: \")\n", 570 | "print(series_a)" 571 | ] 572 | }, 573 | { 574 | "cell_type": "markdown", 575 | "metadata": {}, 576 | "source": [ 577 | "`series_a` has non-numeric indices. If I want to extract a value from the structure, I can index using its positional index (like an array), or using its label index (like a dictionary):" 578 | ] 579 | }, 580 | { 581 | "cell_type": "code", 582 | "execution_count": null, 583 | "metadata": {}, 584 | "outputs": [], 585 | "source": [ 586 | "# extracting value like an array\n", 587 | "series_a[1]" 588 | ] 589 | }, 590 | { 591 | "cell_type": "code", 592 | "execution_count": null, 593 | "metadata": { 594 | "scrolled": true 595 | }, 596 | "outputs": [], 597 | "source": [ 598 | "# extracting value like a dictionary \n", 599 | "series_a['b']" 600 | ] 601 | }, 602 | { 603 | "cell_type": "markdown", 604 | "metadata": {}, 605 | "source": [ 606 | "A Series can also have a `name` attribute, which is how Pandas knows to name the dataframe when a Series object is turned into a dataframe:" 607 | ] 608 | }, 609 | { 610 | "cell_type": "code", 611 | "execution_count": null, 612 | "metadata": {}, 613 | "outputs": [], 614 | "source": [ 615 | "series_a.to_frame()" 616 | ] 617 | }, 618 | { 619 | "cell_type": "markdown", 620 | "metadata": {}, 621 | "source": [ 622 | "**Now back to our data.** If we were to predict the diagnosis based on the cancer mass attributes, it would be good to know how many categories of diagnoses there may be. We want to find the unique values of the variable.\n", 623 | "\n", 624 | "Like the DataFrame object, the Pandas Series object also has many useful attributes. Let's use a couple of them here to better understand the field:\n" 625 | ] 626 | }, 627 | { 628 | "cell_type": "code", 629 | "execution_count": null, 630 | "metadata": {}, 631 | "outputs": [], 632 | "source": [ 633 | "# Find all unique values of the field\n", 634 | "df['diagnosis'].unique()" 635 | ] 636 | }, 637 | { 638 | "cell_type": "markdown", 639 | "metadata": {}, 640 | "source": [ 641 | "There are only 2 possible values for the `diagnosis` variable - malignant (`M`) and benign (`B`). Use the `.value_counts()` method to count how many of each are in the dataset:" 642 | ] 643 | }, 644 | { 645 | "cell_type": "code", 646 | "execution_count": null, 647 | "metadata": {}, 648 | "outputs": [], 649 | "source": [ 650 | "df['diagnosis'].value_counts()" 651 | ] 652 | }, 653 | { 654 | "cell_type": "markdown", 655 | "metadata": {}, 656 | "source": [ 657 | "### 2.3 Selecting Rows & Columns\n", 658 | "\n", 659 | "What if we wanted to create subsets of our data, without pulling them out one by one into Pandas Series objects? We will often want to select a set of features (columns) to keep, or filter for data records (rows) that meet specific criteria. There are a number of ways to accomplish this:\n", 660 | "\n", 661 | "
 **Creating a dataset with fewer selected features:**
\n", 662 | "\n", 663 | "Often times we want to investigate just a couple of fields from the data. In these cases, we may want to create a smaller dataset for greater efficiency and run times. We can select fields to keep in a few ways:\n", 664 | "\n", 665 | "**1. Double square brackets `[]`** \n", 666 | "We can index the dataframe with a list of column names to create a dataset with just those columns (but with all the rows)." 667 | ] 668 | }, 669 | { 670 | "cell_type": "code", 671 | "execution_count": null, 672 | "metadata": {}, 673 | "outputs": [], 674 | "source": [ 675 | "df[['diagnosis', 'area_mean']]" 676 | ] 677 | }, 678 | { 679 | "cell_type": "markdown", 680 | "metadata": {}, 681 | "source": [ 682 | "**2. `.loc[]` attribute** \n", 683 | "We can do the same using the `.loc` attribute. This attribute allows us to specify the column names to keep **and** filter the rows at the same time.\n", 684 | "\n", 685 | "Just as we could slice (extract specific ranges of) sequences based their positional indices, we can slice the data rows and data columns by their index labels. \n", 686 | "\n", 687 | "The `.loc[]` attribute takes two ranges. The range for rows is specified first, and the range for columns second: \n", 688 | "\n", 689 | "df.loc\\[ startRowLabel : endRowLabel, startColName : endColName \\]" 690 | ] 691 | }, 692 | { 693 | "cell_type": "code", 694 | "execution_count": null, 695 | "metadata": {}, 696 | "outputs": [], 697 | "source": [ 698 | "# grabs all records, and all columns positioned between and including `diagnosis` and `area_mean`\n", 699 | "df.loc[:, \"diagnosis\":\"area_mean\"]" 700 | ] 701 | }, 702 | { 703 | "cell_type": "code", 704 | "execution_count": null, 705 | "metadata": {}, 706 | "outputs": [], 707 | "source": [ 708 | "# if we just want the 2 columns and not the columns in between, we leverage the double-bracket\n", 709 | "df.loc[:, [\"diagnosis\",\"area_mean\"]]" 710 | ] 711 | }, 712 | { 713 | "cell_type": "markdown", 714 | "metadata": {}, 715 | "source": [ 716 | "**3. `.iloc[]` attribute** \n", 717 | "This is very similar to `.loc[]`, but instead of using row and column labels, we specify index positions instead. We can see below that `diagnosis` is found at index `1`, and `area_mean` at index `5`. So, if we specify the range `1:6`, we should get the same table as before.\n", 718 | "\n", 719 | "*Remember that, when slicing with indices, the `stop` value in the `start`:`stop` range is excluded from the selection.*" 720 | ] 721 | }, 722 | { 723 | "cell_type": "code", 724 | "execution_count": null, 725 | "metadata": {}, 726 | "outputs": [], 727 | "source": [ 728 | "# run to see column positions\n", 729 | "df.columns" 730 | ] 731 | }, 732 | { 733 | "cell_type": "code", 734 | "execution_count": null, 735 | "metadata": {}, 736 | "outputs": [], 737 | "source": [ 738 | "# slicing using index positions\n", 739 | "df.iloc[:, 1:6]" 740 | ] 741 | }, 742 | { 743 | "cell_type": "code", 744 | "execution_count": null, 745 | "metadata": {}, 746 | "outputs": [], 747 | "source": [ 748 | "df.iloc[:, [1, 5]]" 749 | ] 750 | }, 751 | { 752 | "cell_type": "markdown", 753 | "metadata": {}, 754 | "source": [ 755 | "
 **Slicing and Filtering Dataset Records:**
\n", 756 | "\n", 757 | "Just as we can create data with subsets of columns, we can create data with subsets of rows.\n", 758 | "\n", 759 | "**1. Regular indexing** \n", 760 | "When we specify a range of integer values, DataFrames know to slice the rows:" 761 | ] 762 | }, 763 | { 764 | "cell_type": "code", 765 | "execution_count": null, 766 | "metadata": {}, 767 | "outputs": [], 768 | "source": [ 769 | "# keep first 100 records\n", 770 | "df[:100]" 771 | ] 772 | }, 773 | { 774 | "cell_type": "markdown", 775 | "metadata": {}, 776 | "source": [ 777 | "**2. Filtering by criteria** \n", 778 | "We can also filter by a criteria in the data. For instance, what if we only wanted to check out the distributions of features for masses that are known to be \"benign\"? We would create what we call a **mask**, and apply it to the dataset, like this:\n" 779 | ] 780 | }, 781 | { 782 | "cell_type": "code", 783 | "execution_count": null, 784 | "metadata": {}, 785 | "outputs": [], 786 | "source": [ 787 | "# applying a mask to the dataset\n", 788 | "# to only keep records that are benign\n", 789 | "df[df['diagnosis']=='B']" 790 | ] 791 | }, 792 | { 793 | "cell_type": "markdown", 794 | "metadata": {}, 795 | "source": [ 796 | "Recall that a Series acts very much like a NumPy array. This mean that the expression `df['diagnosis']=='B'` would create a long array of `True` and `False`, depending on whether the element in the `diagnosis` field is `=='B'` or not. This sequence of boolean values acts as a mask on the dataset - the DataFrame knows only to keep records that contains a `True` value from the mask." 797 | ] 798 | }, 799 | { 800 | "cell_type": "code", 801 | "execution_count": null, 802 | "metadata": {}, 803 | "outputs": [], 804 | "source": [ 805 | "# mask\n", 806 | "df['diagnosis']=='B'" 807 | ] 808 | }, 809 | { 810 | "cell_type": "markdown", 811 | "metadata": {}, 812 | "source": [ 813 | "**2. `.loc[]` and `.iloc[]` attributes** \n", 814 | "\n", 815 | "The `.loc[]` attribute supports slicing and filtering rows, as well:" 816 | ] 817 | }, 818 | { 819 | "cell_type": "code", 820 | "execution_count": null, 821 | "metadata": {}, 822 | "outputs": [], 823 | "source": [ 824 | "# slicing rows using index values (which in this case is same as index positions)\n", 825 | "df.loc[:100]" 826 | ] 827 | }, 828 | { 829 | "cell_type": "markdown", 830 | "metadata": {}, 831 | "source": [ 832 | "The nice thing about `.loc[]` is that it allows you to filter or slice for rows and columns at the same time:" 833 | ] 834 | }, 835 | { 836 | "cell_type": "code", 837 | "execution_count": null, 838 | "metadata": {}, 839 | "outputs": [], 840 | "source": [ 841 | "# filtering for benign records\n", 842 | "df.loc[df['diagnosis']=='B', 'diagnosis':'area_mean']" 843 | ] 844 | }, 845 | { 846 | "cell_type": "markdown", 847 | "metadata": {}, 848 | "source": [ 849 | "We can also slice rows and columns simultaneously with the `.iloc[]` attribute:" 850 | ] 851 | }, 852 | { 853 | "cell_type": "code", 854 | "execution_count": null, 855 | "metadata": {}, 856 | "outputs": [], 857 | "source": [ 858 | "df.iloc[:100, [1, 5]]" 859 | ] 860 | }, 861 | { 862 | "cell_type": "markdown", 863 | "metadata": {}, 864 | "source": [ 865 | "### 2.4 Feature Engineering & Applying Functions\n", 866 | "\n", 867 | "Often, we will want to engineer new features from existing ones or transform features in the dataset. We'll approach this in a couple of ways:\n", 868 | "\n", 869 | "1. Computing a new sequence of data with existing ones, and assigning it as a new column\n", 870 | "2. Using the `.apply()` DataFrame method\n", 871 | "\n", 872 | "
  **1. Creating a new potential feature**
\n", 873 | "\n", 874 | "Looking at the available fields, it looks like there might be an opportunity to approximate how *irregularly shaped* a mass may have been. In particular, we are interested in these data fields:" 875 | ] 876 | }, 877 | { 878 | "cell_type": "code", 879 | "execution_count": null, 880 | "metadata": {}, 881 | "outputs": [], 882 | "source": [ 883 | "df_new = df.loc[:, ['area_worst', 'radius_worst', 'perimeter_worst', 'symmetry_worst', 'diagnosis']]\n", 884 | "df_new" 885 | ] 886 | }, 887 | { 888 | "cell_type": "markdown", 889 | "metadata": {}, 890 | "source": [ 891 | "We know that the area of a circle is found by the equation $A = \\pi r^2$, and that the circumference of a circle is given by $C = 2 \\pi r$. If we make the assumption that the mass is **not** irrecgularly shaped, i.e. the mass has a circular shape, then the measured perimeter of the mass and the calculated circumference should in theory be pretty similar. If the perimeter is larger than the circumference by a lot, that may be a good indicator that there is irregulary in the shape, which may be a good predictor of a malignant mass.\n", 892 | "\n", 893 | "
Let's calculate the circumference of the mass given its measured area and radius, and create a new field that captures the ratio of the calculated circumference to the measured perimeter:
\n", 894 | "\n" 895 | ] 896 | }, 897 | { 898 | "cell_type": "code", 899 | "execution_count": null, 900 | "metadata": {}, 901 | "outputs": [], 902 | "source": [ 903 | "# Series behave like NumPy arrays - the same rules of arithmetic operations apply here\n", 904 | "\n", 905 | "# C = 2*A/r : circumference = 2 x area / radius\n", 906 | "circumference = 2*df_new['area_worst']/df_new['radius_worst']\n", 907 | "\n", 908 | "# creating the new ratio field\n", 909 | "df_new['ratio_CtoP'] = circumference / df_new['perimeter_worst']\n", 910 | "\n", 911 | "df_new" 912 | ] 913 | }, 914 | { 915 | "cell_type": "markdown", 916 | "metadata": {}, 917 | "source": [ 918 | "Nice! We have engineered our first feature.\n", 919 | "\n", 920 | "
\n", 921 | "\n", 922 | "
  **2. `apply()`**
\n", 923 | "\n", 924 | "`.apply` allows us to take a function and apply it to the Pandas series or dataframe. \n", 925 | "\n", 926 | "
Let's standardize the columns area_worst, radius_worst, and perimeter_worst by applying the function we had defined earlier:
\n" 927 | ] 928 | }, 929 | { 930 | "cell_type": "code", 931 | "execution_count": null, 932 | "metadata": {}, 933 | "outputs": [], 934 | "source": [ 935 | "# check the docs for more details!\n", 936 | "df.apply?" 937 | ] 938 | }, 939 | { 940 | "cell_type": "code", 941 | "execution_count": null, 942 | "metadata": {}, 943 | "outputs": [], 944 | "source": [ 945 | "# function to standardize data\n", 946 | "def standard_units(numbers_array):\n", 947 | " \"Convert an array of numbers to standard units\"\n", 948 | " return (numbers_array - np.mean(numbers_array))/np.std(numbers_array)" 949 | ] 950 | }, 951 | { 952 | "cell_type": "code", 953 | "execution_count": null, 954 | "metadata": {}, 955 | "outputs": [], 956 | "source": [ 957 | "# applying the function to the 3 fields\n", 958 | "df.loc[:, ['area_worst', 'radius_worst', 'perimeter_worst']].apply(standard_units)" 959 | ] 960 | }, 961 | { 962 | "cell_type": "markdown", 963 | "metadata": {}, 964 | "source": [ 965 | "..so much more elegant than extracting each individual field as a Series, plugging them into the function, and setting each new output a as a new column in the dataset!\n", 966 | "\n", 967 | "### 2.5 Data Aggregation\n", 968 | "The final concept we will cover is the concept of **grouped operations**. Grouping datasets allow us to efficiently compute and compare aggregations of data values conducted separately for each group of fields.\n", 969 | "\n", 970 | "In the section above, we have just explored shape irregularity as a possible predictor of malignant vs. benign masses. One way to analyze whether we may be onto something is to compute the feature's summary statistic separately for the two groups, and see if we observe a notable difference. We can do this with the `.groupby()` method for Pandas DataFrames, which organizes the data into groups based on the values of the group-by variable, and computes an aggregation on the members of each group such that we are left with an aggregate value for each group. It takes the form:\n", 971 | "\n", 972 | "`df.groupby(group_variable).aggregation()`\n", 973 | "\n", 974 | "\n", 975 | "
Let's find the averages by diagnosis of the new and existing features in the df_new data:
\n" 976 | ] 977 | }, 978 | { 979 | "cell_type": "code", 980 | "execution_count": null, 981 | "metadata": {}, 982 | "outputs": [], 983 | "source": [ 984 | "df_new.groupby('diagnosis').mean()" 985 | ] 986 | }, 987 | { 988 | "cell_type": "markdown", 989 | "metadata": {}, 990 | "source": [ 991 | "Recall that there are only two possible diagnoses: **B**enign, or **M**alignant. We had seen earlier from looking at its `.value_counts()` that there are 357 benign records and 212 malignant records. The `groupby` operation above is calculating the averages for each of the 5 features in the `df_new` dataset, for both the group of 357 benign records and, separately, 212 malignant records. If we were to filter the dataset for benign records, isolate the \"area_worst\" field, and calculate its mean, we would arrive at the value in the upper left-most cell:" 992 | ] 993 | }, 994 | { 995 | "cell_type": "code", 996 | "execution_count": null, 997 | "metadata": {}, 998 | "outputs": [], 999 | "source": [ 1000 | "# verify that this matches the mean of \"area_worst\" for the benign diagnostic group:\n", 1001 | "print(\"Benign cases:\", df_new[df_new['diagnosis']==\"B\"]['area_worst'].mean())" 1002 | ] 1003 | }, 1004 | { 1005 | "cell_type": "markdown", 1006 | "metadata": {}, 1007 | "source": [ 1008 | "Aggregation allows us to quickly consolidate data by a specific category(ies). We can apply built-in aggregators (like `.mean()`) or user-defined aggregating functions, like below:" 1009 | ] 1010 | }, 1011 | { 1012 | "cell_type": "code", 1013 | "execution_count": null, 1014 | "metadata": {}, 1015 | "outputs": [], 1016 | "source": [ 1017 | "# mean of standard units should be very close to 0 \n", 1018 | "def meanOfStandardUnits(numbers_array):\n", 1019 | " return np.mean(standard_units(numbers_array))" 1020 | ] 1021 | }, 1022 | { 1023 | "cell_type": "code", 1024 | "execution_count": null, 1025 | "metadata": {}, 1026 | "outputs": [], 1027 | "source": [ 1028 | "df_new.groupby('diagnosis').aggregate(meanOfStandardUnits)" 1029 | ] 1030 | }, 1031 | { 1032 | "cell_type": "markdown", 1033 | "metadata": {}, 1034 | "source": [ 1035 | "\n", 1036 | "**Congratulations!** You've made it through the datathon tutorial notebooks. While there is a lot more to learn beyond what's covered in this tutorial when it comes to the art and science of working with data, you have begun to build a solid foundation from which you can dive into the world of data science. As long as you remain curious and leverage the many resources available to you (documentation sites, Kaggle community, Stack Overflow, WiDS workshops, etc.), you are bound to rapidly develop your data science repertoire. Good luck, and have fun!\n", 1037 | "\n", 1038 | "---\n", 1039 | "\n", 1040 | "#### Content adapted from: \n", 1041 | "- Jupyter Notebook modules from the [UC Berkeley Data Science Modules Program](https://ds-modules.github.io/DS-Modules/) licensed under [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/)\n", 1042 | " - [Data 8X Public Materials for 2022](https://github.com/ds-modules/materials-x22/) by Sean Morris\n", 1043 | "- [Composing Programs](https://www.composingprograms.com/) by John DeNero based on the textbook [Structure and Interpretation of Computer Programs](https://mitpress.mit.edu/9780262510875/structure-and-interpretation-of-computer-programs/) by Harold Abelson and Gerald Jay Sussman, licensed under [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) \n" 1044 | ] 1045 | } 1046 | ], 1047 | "metadata": { 1048 | "anaconda-cloud": {}, 1049 | "kernelspec": { 1050 | "display_name": "Python 3 (ipykernel)", 1051 | "language": "python", 1052 | "name": "python3" 1053 | }, 1054 | "language_info": { 1055 | "codemirror_mode": { 1056 | "name": "ipython", 1057 | "version": 3 1058 | }, 1059 | "file_extension": ".py", 1060 | "mimetype": "text/x-python", 1061 | "name": "python", 1062 | "nbconvert_exporter": "python", 1063 | "pygments_lexer": "ipython3", 1064 | "version": "3.11.5" 1065 | } 1066 | }, 1067 | "nbformat": 4, 1068 | "nbformat_minor": 2 1069 | } 1070 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/03_More_DataStructures-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Part 3: More Data Structures\n", 8 | "\n", 9 | "In the previous notebook, we looked at ways to store and work with a collection of data values within a sequence, like a list or an array. We also explored insights we can gain from computing summary statistics on those collections of data. Now we need a way to store and work with **multiple** separate collections of information about the same population - e.g. in addition to the test scores of the class, perhaps we also know the students' IDs, and results from one other exam they have taken in the past. In this notebook, we will work with data structures that allow us to store multiple pieces of information, or **features**, about a population; and we will explore a powerful library that allows us to read in, explore and manipulate datasets.\n", 10 | "\n", 11 | "## Table of Contents\n", 12 | " \n", 13 | "**1. [Collections of Multiple Features](#multFeat)** \n", 14 | "    **1.1.** [Dictionaries](#dict) \n", 15 | "    **1.2.** [Matrices](#matrix) \n", 16 | "**2. [Pandas Library](#pd)** \n", 17 | "    **2.1.** [Reading in Data](#import) \n", 18 | "    **2.2.** [Exploring the Pandas DataFrame](#df) \n", 19 | "    **2.3.** [Selecting Rows & Columns](#loc) \n", 20 | "    **2.4.** [Applying Functions](#apply) \n", 21 | "    **2.5.** [Data Aggregation](#group) \n", 22 | "\n", 23 | "\n", 24 | "---\n", 25 | "\n", 26 | "## 1. Collections of Multiple Features\n", 27 | "\n", 28 | "First, we will explore 2 important data structures that allow us to organize multiple features of a given population: dictionaries and matrices.\n", 29 | "\n", 30 | "### 1.1. Dictionaries\n", 31 | "\n", 32 | "We can use arrays to collect data points on a single feature of the population, like `test_scores`. When we have multiple arrays capturing different features of the same population, one way to store all of the different arrays is in a **dictionary**.\n", 33 | "\n", 34 | "Here are two examples of a dictionary:" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": null, 40 | "metadata": {}, 41 | "outputs": [], 42 | "source": [ 43 | "# ex1. dictionary of single values\n", 44 | "numerals = {'I': 1, 'V': 5, \"X\": 10}\n", 45 | "numerals" 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": null, 51 | "metadata": {}, 52 | "outputs": [], 53 | "source": [ 54 | "# ex2. dictionary of collections\n", 55 | "class_dict = {\"student_ID\":np.arange(1, len(test_scores)+1), \n", 56 | " \"test_scores\":test_scores}\n", 57 | "class_dict" 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "metadata": {}, 63 | "source": [ 64 | "  **Definition:**\n", 65 | "
\n", 66 | "A dictionary organizes data into key-value pairs. This allows us to store and retrieve values indexed not by consecutive integers, but by descriptive keys. \n", 67 | "\n", 68 | "- Keys: Strings commonly serve as keys since they enable us to represent names of things. In the context of storing data, they are the column names, or names of the value(s) it represents. \n", 69 | "- Values: The data that we are storing. This can be a single value, or a collection of values. \n", 70 | "
\n", 71 | "\n", 72 | "
  **Accessing Dictionary Contents** \n", 73 | "\n", 74 | "1. Access a dictionary value by indexing the dictionary by the corresponding key:" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": null, 80 | "metadata": {}, 81 | "outputs": [], 82 | "source": [ 83 | "# 1. get value associated with \"test_scores\"\n", 84 | "class_dict['test_scores']" 85 | ] 86 | }, 87 | { 88 | "cell_type": "markdown", 89 | "metadata": {}, 90 | "source": [ 91 | "2. Dictionaries have methods that give us access to a list of its keys, values, and key-value pairs: " 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": null, 97 | "metadata": {}, 98 | "outputs": [], 99 | "source": [ 100 | "## a. list of dictionary keys:\n", 101 | "class_dict.keys()" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": null, 107 | "metadata": {}, 108 | "outputs": [], 109 | "source": [ 110 | "## b. list of dictionary values:\n", 111 | "class_dict.values()" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": null, 117 | "metadata": {}, 118 | "outputs": [], 119 | "source": [ 120 | "## c. list of (key, value) pairs:\n", 121 | "class_dict.items()" 122 | ] 123 | }, 124 | { 125 | "cell_type": "markdown", 126 | "metadata": {}, 127 | "source": [ 128 | "
\n", 129 | "   Unlike lists and arrays, dictionary are unordered so the order in which the key:value pairs appear in the dictionary may change when you run code cells. \n", 130 | "
" 131 | ] 132 | }, 133 | { 134 | "cell_type": "markdown", 135 | "metadata": {}, 136 | "source": [ 137 | "
  **Adding key-value Pairs** \n", 138 | "\n", 139 | "You can add a new item (key-value pair) into the exiting dictionary by assigning the value to a new name on the dictionary:\n", 140 | "\n", 141 | "```python\n", 142 | "dictionary['new_key'] = new_value\n", 143 | "```" 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "execution_count": null, 149 | "metadata": {}, 150 | "outputs": [], 151 | "source": [ 152 | "# adding a new entry\n", 153 | "past_scores = np.array([89.0, 94.2, 78.0, 86.2, 81.2, 86.0, 88.3, 84.9, 88.1, 93.0, 82.2, 78.2, 96.1, 95.9, 98.2])\n", 154 | "\n", 155 | "class_dict[\"past_test_score\"] = past_scores\n", 156 | "class_dict" 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": null, 162 | "metadata": { 163 | "scrolled": true 164 | }, 165 | "outputs": [], 166 | "source": [ 167 | "class_dict.items()" 168 | ] 169 | }, 170 | { 171 | "cell_type": "markdown", 172 | "metadata": {}, 173 | "source": [ 174 | "
\n", 175 | "Note: There can only be 1 value per key. If you attempt to assign a new value to the dictionary but specify a key name that already exists in the dictionary, the existing values associated with that key will be overwritten.\n", 176 | "
\n", 177 | "\n", 178 | "---\n", 179 | "\n", 180 | "### 1.2. Matrices\n", 181 | "\n", 182 | "Dictionaries organize information by features - all data values capturing student IDs are boxed into one container and saved into a dictionary; and data values about test scores from last year are boxed up into a separate container and saved into the dictionary under a differet label.\n", 183 | "\n", 184 | "But when you think about it, that is not the most helpful way to organize data when you are trying to **make predictions** about a specific case. For instance, say you were trying to guess what animal each record (row) is, given the following features:\n", 185 | "\n", 186 | "||Opposable Thumbs|Class of Animal|Diet |Tail Length |Number of Legs | Flies|\n", 187 | "|:-:|:-:|:-:|:-:|:-:|:-:|:-:|\n", 188 | "|**0**|True|Mammal|Bananas| long | 2|False|\n", 189 | "|**1**|False|Anthropod|Insects| none |8| False|\n", 190 | "|**2**|False|Bird|Fish|short|2|False|\n", 191 | "\n", 192 | "We wouldn't want to look at the data one column at a time, the way dictionaries are organized, when we want to predict what animal record **0** might be. Instead, we'd want to look all the features of the one record at the same time:\n", 193 | "\n", 194 | "||Opposable Thumbs|Animal Class|Diet |Tail Length | Wings |Number of Legs | Flies|\n", 195 | "|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|\n", 196 | "|**0**|True|Mammal|Bananas| long | False | 2|False|\n", 197 | "\n", 198 | "To make a prediction about a particular record, we need to consider all its features - we want to organize information by record, not by column. This is exactly what a **matrix** is designed to do, and it is in this matrix form that we ultimately feed the data into machine learning models.\n", 199 | "\n", 200 | "  **Definition:**\n", 201 | "
\n", 202 | " A matrix is a rectangular list*, or a list of lists. We say that matrix $M$ has shape $m \\times n$:\n", 203 | " \n", 204 | "* It has **m** rows: each row is a list of all features that describe a single record;\n", 205 | "* It has **n** columns: each column is displayed as elements in the same position/index of every row, and represents a specific feature of the data\n", 206 | "\n", 207 | "
\n", 208 | "\n", 209 | "Our example above in matrix form would look like:" 210 | ] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "execution_count": null, 215 | "metadata": {}, 216 | "outputs": [], 217 | "source": [ 218 | "animal_matrix = [[ True, 'Mammal', 'Bananas', 'long', 2, False],\n", 219 | " [False, 'Anthropod', 'Insects', 'none', 9, False],\n", 220 | " [False, 'Bird', 'Fish', 'short', 2, False]]\n", 221 | "\n", 222 | "animal_matrix" 223 | ] 224 | }, 225 | { 226 | "cell_type": "markdown", 227 | "metadata": {}, 228 | "source": [ 229 | "Compare this to how the data would be represented in a dictionary:" 230 | ] 231 | }, 232 | { 233 | "cell_type": "code", 234 | "execution_count": null, 235 | "metadata": {}, 236 | "outputs": [], 237 | "source": [ 238 | "animal_dict = {\"Opposable Thumbs\": [True, False, False],\n", 239 | " \"Class of Animal\": ['Mammal', 'Anthropod', 'Bird'],\n", 240 | " \"Diet\": ['Bananas', 'Insects', 'Fish'],\n", 241 | " \"Tail Length\": ['long', 'none', 'short'],\n", 242 | " \"Number of Legs\": [2, 8, 2],\n", 243 | " \"Flies\": [False, False, False]}\n", 244 | "animal_dict" 245 | ] 246 | }, 247 | { 248 | "cell_type": "markdown", 249 | "metadata": {}, 250 | "source": [ 251 | "
\n", 252 | "  *We use lists in this example to demonstrate what a matrix looks like, since the features are represented by different value types (and values in NumPy arrays must all be of the same type). However, NumPy's representation of the matrix ndarray, or the n-dimensional array, is usually preferred over using Python lists because NumPy arrays consume less memory and is able to handle operations much more efficiently than lists. Even though we have a mix of data types in our example, that does not mean we are stuck using lists. There are many ways to transform categorical features of datasets into numerical features; figuring out how best to handle categorical variables (like \"Diet\" and \"Animal Class\") is a big part of data wrangling for predictive modeling!\n", 253 | "
" 254 | ] 255 | }, 256 | { 257 | "cell_type": "markdown", 258 | "metadata": {}, 259 | "source": [ 260 | "---\n", 261 | "\n", 262 | "## 2. Pandas Library\n", 263 | "\n", 264 | "Now for the exciting part! Up to this point we have been fabricating data in the notebook to serve as our examples. With the introduction of the Pandas Library, we can **import** real data files into Jupyter Notebooks to explore. Let's do that now!\n", 265 | "\n", 266 | "
\n", 267 | "\n", 268 | "**Kaggle: our data source** \n", 269 | "We will use the \"Breast Cancer Wisconsin (Diagnostic) Dataset\" from **Kaggle**. Kaggle is a data science competition platform that hosts datathons, publish datasets, and support an online community of data scientists. Anyone is able to download the cleaned, published datasets to explore from the site and have access to an abundance of resources - from **data dictionaries** that detail data contents, to notebooks and code that other users of the data have posted. It's a great place to find interesting problems to explore and learn from others who have done/are doing the same.\n", 270 | "\n", 271 | "
\n", 272 | "\n", 273 | "**Pandas** \n", 274 | "Pandas is the standard tool for working with **dataframes**. A dataframe is a data structure that that represents data in a 2-dimensional table of rows and columns. We've seen a couple of examples of dataframes already, in the section on standard deviations, and just now in the matrix section. They are very useful for exploratory data analysis, data cleaning, and processing before turning them into matrices to be fed into machine learning models.\n", 275 | "\n", 276 | "
We've already imported the pandas library, but let's do that again here:
" 277 | ] 278 | }, 279 | { 280 | "cell_type": "code", 281 | "execution_count": null, 282 | "metadata": {}, 283 | "outputs": [], 284 | "source": [ 285 | "import pandas as pd" 286 | ] 287 | }, 288 | { 289 | "cell_type": "markdown", 290 | "metadata": {}, 291 | "source": [ 292 | "---\n", 293 | "\n", 294 | "### 2.1. Reading in the Data\n", 295 | "\n", 296 | "Pandas allows us to easily \"read in\" data from a downloaded csv (comma separated values) file and save it as a variable in the Jupyter Notebook, with the `pd.read_csv()` function. It can take many different arguments depending on the desired specifications, but we can just accept the default for the optional parameters. The only required parameter is `filepath_or_buffer`, which asks for the **file path**, or location, of the data file on your computer so that it can find it and turn it into a Pandas dataframe. There are 2 ways to specify the file path:\n", 297 | "\n", 298 | "\n", 299 | "
  **Absolute File Path:** \n", 300 | "\n", 301 | "All of your files on the computer have a file path. If you go to the location of any file on your File Explorer, you can find its absolute file path by clicking the address bar at the top of the window. You'll see something like:\n", 302 | "\n", 303 | "C:/Users/username/folder/data_folder/filename.csv\n", 304 | "\n", 305 | "When you have all of the information needed to locate the file, all the way to the very first layer of folders, you have an absolute file path.\n", 306 | "\n", 307 | "\n", 308 | "
  **Relative File Path:**\n", 309 | "\n", 310 | "Your Jupyter notebook (ipynb) file that you are working on has a path, too. When you navigate from one folder to the next on the File Explorer, you often start at a file location (let's call that location A), back out of that folder, enter into another folder, and access the file in this new location (location B). We can do something similar with file paths, by specifying the path of location B **relative to** the location of A.\n", 311 | "\n", 312 | "Let's say this Jupyter notebook is found in location A, whose absolute path is: \n", 313 | "C:/Users/username/folder/myNotebook.ipynb\n", 314 | "\n", 315 | "So, this is the **current directory**, or the location we are starting from: \n", 316 | "C:/Users/username/folder/\n", 317 | "\n", 318 | "From here, we want to get to location B: \n", 319 | "C:/Users/username/folder/data_folder/filename.csv\n", 320 | "\n", 321 | "
To do this, we can specify the relative path:
\n", 322 | "./data_folder/filename.csv
\n", 323 | "\n", 324 | "



\n", 325 | "\n", 326 | "**Notation:** \n", 327 | "- The **`.`** in the relative path indicates we are **staying in the same, current directory**. Since the `data_folder` that contains the desired file is **inside** the current directory we started out in, we indicate that it is from here that we then move into another folder, or identify a file to point to.\n", 328 | "
\n", 329 | "\n", 330 | "- The **`..`** indicates we need to **back out of the current folder**. It's the equivalent of clicking the back button on File Explorer. \n", 331 | ">**Example**:
\n", 332 | ">We can back out of multiple folders - say there is another file in this location we want to get to:\n", 333 | ">`C:/Users/another_user/theirFile.csv` \n", 334 | ">\n", 335 | ">We can access this from location A with the relative path: \n", 336 | ">`../../another_user/theirFile.csv`\n", 337 | "\n", 338 | "\n", 339 | "Once we have the path, all we need to do it put it in string form, an input it as an argument!\n", 340 | "\n", 341 | "
Run the code below to read in our first dataset!
\n" 342 | ] 343 | }, 344 | { 345 | "cell_type": "code", 346 | "execution_count": null, 347 | "metadata": {}, 348 | "outputs": [], 349 | "source": [ 350 | "# using relative path!\n", 351 | "df = pd.read_csv(\"./data/data.csv\")\n", 352 | "df" 353 | ] 354 | }, 355 | { 356 | "cell_type": "markdown", 357 | "metadata": {}, 358 | "source": [ 359 | "This dataset captures measurements and characteristics of breast mass (e.g. mass radius, smoothness, symmery) and the actual diagnosis of the mass. The challenge here would be to predict the diagnosis from the features of the mass. The purpose of this section is to introduce data manipulation using Pandas dataframes and series so we will not be tackling the challenge in this tutorial, but the notebooks uploaded on the Kaggle page would be a great place to see what other people have done with this dataset!\n", 360 | "\n", 361 | "---\n", 362 | "\n", 363 | "### 2.2. Exploring the Pandas DataFrame\n", 364 | "\n", 365 | "The Pandas DataFrame data structure allows us to easily access both the rows (records) **and** columns (features). It can be created in many different ways: from scratch, from a dictionary of values, from a matrix, from reading in a dataset, etc.\n", 366 | "\n", 367 | "\n", 368 | "
 **From scratch**: \n", 369 | "\n", 370 | "Below is an empty DataFrame object - it has no column or row yet. Run the code to see what it looks like:" 371 | ] 372 | }, 373 | { 374 | "cell_type": "code", 375 | "execution_count": null, 376 | "metadata": {}, 377 | "outputs": [], 378 | "source": [ 379 | "df_fromScratch = pd.DataFrame()\n", 380 | "df_fromScratch" 381 | ] 382 | }, 383 | { 384 | "cell_type": "markdown", 385 | "metadata": {}, 386 | "source": [ 387 | "We can add a column to the dataframe in the same way that we can add new key-value pairs to dictionaries:" 388 | ] 389 | }, 390 | { 391 | "cell_type": "code", 392 | "execution_count": null, 393 | "metadata": {}, 394 | "outputs": [], 395 | "source": [ 396 | "df_fromScratch['first_column'] = np.arange(1, 10)\n", 397 | "df_fromScratch['second_column']= np.arange(9, 0, -1)\n", 398 | "df_fromScratch" 399 | ] 400 | }, 401 | { 402 | "cell_type": "markdown", 403 | "metadata": {}, 404 | "source": [ 405 | "However, once one column of a certain **length** is added to a dataframe, all other new columns must be of the same length:" 406 | ] 407 | }, 408 | { 409 | "cell_type": "code", 410 | "execution_count": null, 411 | "metadata": {}, 412 | "outputs": [], 413 | "source": [ 414 | "# this will throw an error\n", 415 | "df_fromScratch['short_column'] = np.array([7, 8, 9])" 416 | ] 417 | }, 418 | { 419 | "cell_type": "markdown", 420 | "metadata": {}, 421 | "source": [ 422 | "
  **From a dictionary:** \n", 423 | "\n", 424 | "We can also convert a dictionary of data into a Pandas DataFrame, as long as the the number of elements captured in each dictionary value is the same:" 425 | ] 426 | }, 427 | { 428 | "cell_type": "code", 429 | "execution_count": null, 430 | "metadata": {}, 431 | "outputs": [], 432 | "source": [ 433 | "# animal dictionary from earlier\n", 434 | "pd.DataFrame(animal_dict)" 435 | ] 436 | }, 437 | { 438 | "cell_type": "markdown", 439 | "metadata": {}, 440 | "source": [ 441 | "
  **From a matrix:** \n", 442 | "\n", 443 | "..and same with matrices. Since a matrix does not have a name value like dictionaries do, we can include an argument to specify the column names:" 444 | ] 445 | }, 446 | { 447 | "cell_type": "code", 448 | "execution_count": null, 449 | "metadata": {}, 450 | "outputs": [], 451 | "source": [ 452 | "columnNames = ['Opposable Thumbs', 'Class of Animal', 'Diet', 'Tail Length', 'Number of Legs', 'Flies']\n", 453 | "pd.DataFrame(animal_matrix, columns = columnNames)" 454 | ] 455 | }, 456 | { 457 | "cell_type": "markdown", 458 | "metadata": {}, 459 | "source": [ 460 | "
  **Exploring data contents:** \n", 461 | "\n", 462 | "Let's explore the Pandas capabilities using breast cancer data we read in earlier. The first step we'd want to take when exploring a dataset is to undertand what the dataset contains. The DataFrame object has many attributes to help us with this task:\n", 463 | "\n", 464 | "1. Identify the number of rows (records) and columns (features) in the data\n", 465 | "2. Get info on the column names, their position on the dataframe, how many non-**null\\*** values there are in each feature, and what the data type of each feature is\n", 466 | "3. Get a list of the columns in the dataset\n", 467 | "4. Create a table of summary statistics on all numeric features\n", 468 | "\n" 469 | ] 470 | }, 471 | { 472 | "cell_type": "code", 473 | "execution_count": null, 474 | "metadata": {}, 475 | "outputs": [], 476 | "source": [ 477 | "# 1. find the number of rows and columns (row, col) in the dataset\n", 478 | "df.shape" 479 | ] 480 | }, 481 | { 482 | "cell_type": "code", 483 | "execution_count": null, 484 | "metadata": {}, 485 | "outputs": [], 486 | "source": [ 487 | "# 2. summary of features\n", 488 | "df.info()" 489 | ] 490 | }, 491 | { 492 | "cell_type": "code", 493 | "execution_count": null, 494 | "metadata": {}, 495 | "outputs": [], 496 | "source": [ 497 | "# 3. list of column names\n", 498 | "df.columns" 499 | ] 500 | }, 501 | { 502 | "cell_type": "code", 503 | "execution_count": null, 504 | "metadata": {}, 505 | "outputs": [], 506 | "source": [ 507 | "# 4. summary statistics of numeric features\n", 508 | "df.describe()" 509 | ] 510 | }, 511 | { 512 | "cell_type": "markdown", 513 | "metadata": {}, 514 | "source": [ 515 | "
\n", 516 | "  A null or nan value represents an unknown or missing value in the data - it is an empty entry. If a feature is riddled with missing values, we may need to drop the feature from the investigation since it may not capture enough valid data, or the valid data it does capture may be biased. If a feature has some missing values but still captures valuable information, we will want to clean the data so to replace these values with something more interpretable. We won't touch on this here, but you can find a guide on how to work with missing data using Pandas in their documentation.\n", 517 | "
\n", 518 | "\n" 519 | ] 520 | }, 521 | { 522 | "cell_type": "markdown", 523 | "metadata": {}, 524 | "source": [ 525 | "
 **Selecting a DataFrame feature - Series**:
\n", 526 | "\n", 527 | "We know from the `.info` output above that there is only one non-numeric field in the dataset, and that is the target variable - the diagnosis. Let's understand this target variable better.\n", 528 | "\n", 529 | "We can select a feature from the DataFrame in a similar way to how we would get the value of a dictionary - by indexing the dataframe by the column name:" 530 | ] 531 | }, 532 | { 533 | "cell_type": "code", 534 | "execution_count": null, 535 | "metadata": {}, 536 | "outputs": [], 537 | "source": [ 538 | "# get the data values of `diagnosis` column\n", 539 | "df['diagnosis']" 540 | ] 541 | }, 542 | { 543 | "cell_type": "markdown", 544 | "metadata": {}, 545 | "source": [ 546 | "The extracted column is stored in a Pandas data structure called a **Pandas Series**. \n", 547 | "\n", 548 | "  **Definition:**\n", 549 | "
\n", 550 | " A Series is a Pandas data structure that behaves very similarly to NumPy arrays and will be a valid argument to most NumPy functions. Series are also similar to dictionaries, in that its values can have index labels and be indexed by these labels.\n", 551 | "
\n", 552 | "\n", 553 | "For instance:" 554 | ] 555 | }, 556 | { 557 | "cell_type": "code", 558 | "execution_count": null, 559 | "metadata": {}, 560 | "outputs": [], 561 | "source": [ 562 | "# this is an array\n", 563 | "array_a = np.arange(1, 6)\n", 564 | "\n", 565 | "# this is a Series, with non-numeric index labels, and a name\n", 566 | "series_a = pd.Series(array_a, index=['a', 'b', 'c', 'd', 'e'], name=\"Series_A\")\n", 567 | "\n", 568 | "print(\"array: \", array_a)\n", 569 | "print(\"\\nSeries: \")\n", 570 | "print(series_a)" 571 | ] 572 | }, 573 | { 574 | "cell_type": "markdown", 575 | "metadata": {}, 576 | "source": [ 577 | "`series_a` has non-numeric indices. If I want to extract a value from the structure, I can index using its positional index (like an array), or using its label index (like a dictionary):" 578 | ] 579 | }, 580 | { 581 | "cell_type": "code", 582 | "execution_count": null, 583 | "metadata": {}, 584 | "outputs": [], 585 | "source": [ 586 | "# extracting value like an array\n", 587 | "series_a[1]" 588 | ] 589 | }, 590 | { 591 | "cell_type": "code", 592 | "execution_count": null, 593 | "metadata": { 594 | "scrolled": true 595 | }, 596 | "outputs": [], 597 | "source": [ 598 | "# extracting value like a dictionary \n", 599 | "series_a['b']" 600 | ] 601 | }, 602 | { 603 | "cell_type": "markdown", 604 | "metadata": {}, 605 | "source": [ 606 | "A Series can also have a `name` attribute, which is how Pandas knows to name the dataframe when a Series object is turned into a dataframe:" 607 | ] 608 | }, 609 | { 610 | "cell_type": "code", 611 | "execution_count": null, 612 | "metadata": {}, 613 | "outputs": [], 614 | "source": [ 615 | "series_a.to_frame()" 616 | ] 617 | }, 618 | { 619 | "cell_type": "markdown", 620 | "metadata": {}, 621 | "source": [ 622 | "**Now back to our data.** If we were to predict the diagnosis based on the cancer mass attributes, it would be good to know how many categories of diagnoses there may be. We want to find the unique values of the variable.\n", 623 | "\n", 624 | "Like the DataFrame object, the Pandas Series object also has many useful attributes. Let's use a couple of them here to better understand the field:\n" 625 | ] 626 | }, 627 | { 628 | "cell_type": "code", 629 | "execution_count": null, 630 | "metadata": {}, 631 | "outputs": [], 632 | "source": [ 633 | "# Find all unique values of the field\n", 634 | "df['diagnosis'].unique()" 635 | ] 636 | }, 637 | { 638 | "cell_type": "markdown", 639 | "metadata": {}, 640 | "source": [ 641 | "There are only 2 possible values for the `diagnosis` variable - malignant (`M`) and benign (`B`). Use the `.value_counts()` method to count how many of each are in the dataset:" 642 | ] 643 | }, 644 | { 645 | "cell_type": "code", 646 | "execution_count": null, 647 | "metadata": {}, 648 | "outputs": [], 649 | "source": [ 650 | "df['diagnosis'].value_counts()" 651 | ] 652 | }, 653 | { 654 | "cell_type": "markdown", 655 | "metadata": {}, 656 | "source": [ 657 | "### 2.3 Selecting Rows & Columns\n", 658 | "\n", 659 | "What if we wanted to create subsets of our data, without pulling them out one by one into Pandas Series objects? We will often want to select a set of features (columns) to keep, or filter for data records (rows) that meet specific criteria. There are a number of ways to accomplish this:\n", 660 | "\n", 661 | "
 **Creating a dataset with fewer selected features:**
\n", 662 | "\n", 663 | "Often times we want to investigate just a couple of fields from the data. In these cases, we may want to create a smaller dataset for greater efficiency and run times. We can select fields to keep in a few ways:\n", 664 | "\n", 665 | "**1. Double square brackets `[]`** \n", 666 | "We can index the dataframe with a list of column names to create a dataset with just those columns (but with all the rows)." 667 | ] 668 | }, 669 | { 670 | "cell_type": "code", 671 | "execution_count": null, 672 | "metadata": {}, 673 | "outputs": [], 674 | "source": [ 675 | "df[['diagnosis', 'area_mean']]" 676 | ] 677 | }, 678 | { 679 | "cell_type": "markdown", 680 | "metadata": {}, 681 | "source": [ 682 | "**2. `.loc[]` attribute** \n", 683 | "We can do the same using the `.loc` attribute. This attribute allows us to specify the column names to keep **and** filter the rows at the same time.\n", 684 | "\n", 685 | "Just as we could slice (extract specific ranges of) sequences based their positional indices, we can slice the data rows and data columns by their index labels. \n", 686 | "\n", 687 | "The `.loc[]` attribute takes two ranges. The range for rows is specified first, and the range for columns second: \n", 688 | "\n", 689 | "df.loc\\[ startRowLabel : endRowLabel, startColName : endColName \\]" 690 | ] 691 | }, 692 | { 693 | "cell_type": "code", 694 | "execution_count": null, 695 | "metadata": {}, 696 | "outputs": [], 697 | "source": [ 698 | "# grabs all records, and all columns positioned between and including `diagnosis` and `area_mean`\n", 699 | "df.loc[:, \"diagnosis\":\"area_mean\"]" 700 | ] 701 | }, 702 | { 703 | "cell_type": "code", 704 | "execution_count": null, 705 | "metadata": {}, 706 | "outputs": [], 707 | "source": [ 708 | "# if we just want the 2 columns and not the columns in between, we leverage the double-bracket\n", 709 | "df.loc[:, [\"diagnosis\",\"area_mean\"]]" 710 | ] 711 | }, 712 | { 713 | "cell_type": "markdown", 714 | "metadata": {}, 715 | "source": [ 716 | "**3. `.iloc[]` attribute** \n", 717 | "This is very similar to `.loc[]`, but instead of using row and column labels, we specify index positions instead. We can see below that `diagnosis` is found at index `1`, and `area_mean` at index `5`. So, if we specify the range `1:6`, we should get the same table as before.\n", 718 | "\n", 719 | "*Remember that, when slicing with indices, the `stop` value in the `start`:`stop` range is excluded from the selection.*" 720 | ] 721 | }, 722 | { 723 | "cell_type": "code", 724 | "execution_count": null, 725 | "metadata": {}, 726 | "outputs": [], 727 | "source": [ 728 | "# run to see column positions\n", 729 | "df.columns" 730 | ] 731 | }, 732 | { 733 | "cell_type": "code", 734 | "execution_count": null, 735 | "metadata": {}, 736 | "outputs": [], 737 | "source": [ 738 | "# slicing using index positions\n", 739 | "df.iloc[:, 1:6]" 740 | ] 741 | }, 742 | { 743 | "cell_type": "code", 744 | "execution_count": null, 745 | "metadata": {}, 746 | "outputs": [], 747 | "source": [ 748 | "df.iloc[:, [1, 5]]" 749 | ] 750 | }, 751 | { 752 | "cell_type": "markdown", 753 | "metadata": {}, 754 | "source": [ 755 | "
 **Slicing and Filtering Dataset Records:**
\n", 756 | "\n", 757 | "Just as we can create data with subsets of columns, we can create data with subsets of rows.\n", 758 | "\n", 759 | "**1. Regular indexing** \n", 760 | "When we specify a range of integer values, DataFrames know to slice the rows:" 761 | ] 762 | }, 763 | { 764 | "cell_type": "code", 765 | "execution_count": null, 766 | "metadata": {}, 767 | "outputs": [], 768 | "source": [ 769 | "# keep first 100 records\n", 770 | "df[:100]" 771 | ] 772 | }, 773 | { 774 | "cell_type": "markdown", 775 | "metadata": {}, 776 | "source": [ 777 | "**2. Filtering by criteria** \n", 778 | "We can also filter by a criteria in the data. For instance, what if we only wanted to check out the distributions of features for masses that are known to be \"benign\"? We would create what we call a **mask**, and apply it to the dataset, like this:\n" 779 | ] 780 | }, 781 | { 782 | "cell_type": "code", 783 | "execution_count": null, 784 | "metadata": {}, 785 | "outputs": [], 786 | "source": [ 787 | "# applying a mask to the dataset\n", 788 | "# to only keep records that are benign\n", 789 | "df[df['diagnosis']=='B']" 790 | ] 791 | }, 792 | { 793 | "cell_type": "markdown", 794 | "metadata": {}, 795 | "source": [ 796 | "Recall that a Series acts very much like a NumPy array. This mean that the expression `df['diagnosis']=='B'` would create a long array of `True` and `False`, depending on whether the element in the `diagnosis` field is `=='B'` or not. This sequence of boolean values acts as a mask on the dataset - the DataFrame knows only to keep records that contains a `True` value from the mask." 797 | ] 798 | }, 799 | { 800 | "cell_type": "code", 801 | "execution_count": null, 802 | "metadata": {}, 803 | "outputs": [], 804 | "source": [ 805 | "# mask\n", 806 | "df['diagnosis']=='B'" 807 | ] 808 | }, 809 | { 810 | "cell_type": "markdown", 811 | "metadata": {}, 812 | "source": [ 813 | "**2. `.loc[]` and `.iloc[]` attributes** \n", 814 | "\n", 815 | "The `.loc[]` attribute supports slicing and filtering rows, as well:" 816 | ] 817 | }, 818 | { 819 | "cell_type": "code", 820 | "execution_count": null, 821 | "metadata": {}, 822 | "outputs": [], 823 | "source": [ 824 | "# slicing rows using index values (which in this case is same as index positions)\n", 825 | "df.loc[:100]" 826 | ] 827 | }, 828 | { 829 | "cell_type": "markdown", 830 | "metadata": {}, 831 | "source": [ 832 | "The nice thing about `.loc[]` is that it allows you to filter or slice for rows and columns at the same time:" 833 | ] 834 | }, 835 | { 836 | "cell_type": "code", 837 | "execution_count": null, 838 | "metadata": {}, 839 | "outputs": [], 840 | "source": [ 841 | "# filtering for benign records\n", 842 | "df.loc[df['diagnosis']=='B', 'diagnosis':'area_mean']" 843 | ] 844 | }, 845 | { 846 | "cell_type": "markdown", 847 | "metadata": {}, 848 | "source": [ 849 | "We can also slice rows and columns simultaneously with the `.iloc[]` attribute:" 850 | ] 851 | }, 852 | { 853 | "cell_type": "code", 854 | "execution_count": null, 855 | "metadata": {}, 856 | "outputs": [], 857 | "source": [ 858 | "df.iloc[:100, [1, 5]]" 859 | ] 860 | }, 861 | { 862 | "cell_type": "markdown", 863 | "metadata": {}, 864 | "source": [ 865 | "### 2.4 Feature Engineering & Applying Functions\n", 866 | "\n", 867 | "Often, we will want to engineer new features from existing ones or transform features in the dataset. We'll approach this in a couple of ways:\n", 868 | "\n", 869 | "1. Computing a new sequence of data with existing ones, and assigning it as a new column\n", 870 | "2. Using the `.apply()` DataFrame method\n", 871 | "\n", 872 | "
  **1. Creating a new potential feature**
\n", 873 | "\n", 874 | "Looking at the available fields, it looks like there might be an opportunity to approximate how *irregularly shaped* a mass may have been. In particular, we are interested in these data fields:" 875 | ] 876 | }, 877 | { 878 | "cell_type": "code", 879 | "execution_count": null, 880 | "metadata": {}, 881 | "outputs": [], 882 | "source": [ 883 | "df_new = df.loc[:, ['area_worst', 'radius_worst', 'perimeter_worst', 'symmetry_worst', 'diagnosis']]\n", 884 | "df_new" 885 | ] 886 | }, 887 | { 888 | "cell_type": "markdown", 889 | "metadata": {}, 890 | "source": [ 891 | "We know that the area of a circle is found by the equation $A = \\pi r^2$, and that the circumference of a circle is given by $C = 2 \\pi r$. If we make the assumption that the mass is **not** irrecgularly shaped, i.e. the mass has a circular shape, then the measured perimeter of the mass and the calculated circumference should in theory be pretty similar. If the perimeter is larger than the circumference by a lot, that may be a good indicator that there is irregulary in the shape, which may be a good predictor of a malignant mass.\n", 892 | "\n", 893 | "
Let's calculate the circumference of the mass given its measured area and radius, and create a new field that captures the ratio of the calculated circumference to the measured perimeter:
\n", 894 | "\n" 895 | ] 896 | }, 897 | { 898 | "cell_type": "code", 899 | "execution_count": null, 900 | "metadata": {}, 901 | "outputs": [], 902 | "source": [ 903 | "# Series behave like NumPy arrays - the same rules of arithmetic operations apply here\n", 904 | "\n", 905 | "# C = 2*A/r : circumference = 2 x area / radius\n", 906 | "circumference = 2*df_new['area_worst']/df_new['radius_worst']\n", 907 | "\n", 908 | "# creating the new ratio field\n", 909 | "df_new['ratio_CtoP'] = circumference / df_new['perimeter_worst']\n", 910 | "\n", 911 | "df_new" 912 | ] 913 | }, 914 | { 915 | "cell_type": "markdown", 916 | "metadata": {}, 917 | "source": [ 918 | "Nice! We have engineered our first feature.\n", 919 | "\n", 920 | "
\n", 921 | "\n", 922 | "
  **2. `apply()`**
\n", 923 | "\n", 924 | "`.apply` allows us to take a function and apply it to the Pandas series or dataframe. \n", 925 | "\n", 926 | "
Let's standardize the columns area_worst, radius_worst, and perimeter_worst by applying the function we had defined earlier:
\n" 927 | ] 928 | }, 929 | { 930 | "cell_type": "code", 931 | "execution_count": null, 932 | "metadata": {}, 933 | "outputs": [], 934 | "source": [ 935 | "# check the docs for more details!\n", 936 | "df.apply?" 937 | ] 938 | }, 939 | { 940 | "cell_type": "code", 941 | "execution_count": null, 942 | "metadata": {}, 943 | "outputs": [], 944 | "source": [ 945 | "# function to standardize data\n", 946 | "def standard_units(numbers_array):\n", 947 | " \"Convert an array of numbers to standard units\"\n", 948 | " return (numbers_array - np.mean(numbers_array))/np.std(numbers_array)" 949 | ] 950 | }, 951 | { 952 | "cell_type": "code", 953 | "execution_count": null, 954 | "metadata": {}, 955 | "outputs": [], 956 | "source": [ 957 | "# applying the function to the 3 fields\n", 958 | "df.loc[:, ['area_worst', 'radius_worst', 'perimeter_worst']].apply(standard_units)" 959 | ] 960 | }, 961 | { 962 | "cell_type": "markdown", 963 | "metadata": {}, 964 | "source": [ 965 | "..so much more elegant than extracting each individual field as a Series, plugging them into the function, and setting each new output a as a new column in the dataset!\n", 966 | "\n", 967 | "### 2.5 Data Aggregation\n", 968 | "The final concept we will cover is the concept of **grouped operations**. Grouping datasets allow us to efficiently compute and compare aggregations of data values conducted separately for each group of fields.\n", 969 | "\n", 970 | "In the section above, we have just explored shape irregularity as a possible predictor of malignant vs. benign masses. One way to analyze whether we may be onto something is to compute the feature's summary statistic separately for the two groups, and see if we observe a notable difference. We can do this with the `.groupby()` method for Pandas DataFrames, which organizes the data into groups based on the values of the group-by variable, and computes an aggregation on the members of each group such that we are left with an aggregate value for each group. It takes the form:\n", 971 | "\n", 972 | "`df.groupby(group_variable).aggregation()`\n", 973 | "\n", 974 | "\n", 975 | "
Let's find the averages by diagnosis of the new and existing features in the df_new data:
\n" 976 | ] 977 | }, 978 | { 979 | "cell_type": "code", 980 | "execution_count": null, 981 | "metadata": {}, 982 | "outputs": [], 983 | "source": [ 984 | "df_new.groupby('diagnosis').mean()" 985 | ] 986 | }, 987 | { 988 | "cell_type": "markdown", 989 | "metadata": {}, 990 | "source": [ 991 | "Recall that there are only two possible diagnoses: **B**enign, or **M**alignant. We had seen earlier from looking at its `.value_counts()` that there are 357 benign records and 212 malignant records. The `groupby` operation above is calculating the averages for each of the 5 features in the `df_new` dataset, for both the group of 357 benign records and, separately, 212 malignant records. If we were to filter the dataset for benign records, isolate the \"area_worst\" field, and calculate its mean, we would arrive at the value in the upper left-most cell:" 992 | ] 993 | }, 994 | { 995 | "cell_type": "code", 996 | "execution_count": null, 997 | "metadata": {}, 998 | "outputs": [], 999 | "source": [ 1000 | "# verify that this matches the mean of \"area_worst\" for the benign diagnostic group:\n", 1001 | "print(\"Benign cases:\", df_new[df_new['diagnosis']==\"B\"]['area_worst'].mean())" 1002 | ] 1003 | }, 1004 | { 1005 | "cell_type": "markdown", 1006 | "metadata": {}, 1007 | "source": [ 1008 | "Aggregation allows us to quickly consolidate data by a specific category(ies). We can apply built-in aggregators (like `.mean()`) or user-defined aggregating functions, like below:" 1009 | ] 1010 | }, 1011 | { 1012 | "cell_type": "code", 1013 | "execution_count": null, 1014 | "metadata": {}, 1015 | "outputs": [], 1016 | "source": [ 1017 | "# mean of standard units should be very close to 0 \n", 1018 | "def meanOfStandardUnits(numbers_array):\n", 1019 | " return np.mean(standard_units(numbers_array))" 1020 | ] 1021 | }, 1022 | { 1023 | "cell_type": "code", 1024 | "execution_count": null, 1025 | "metadata": {}, 1026 | "outputs": [], 1027 | "source": [ 1028 | "df_new.groupby('diagnosis').aggregate(meanOfStandardUnits)" 1029 | ] 1030 | }, 1031 | { 1032 | "cell_type": "markdown", 1033 | "metadata": {}, 1034 | "source": [ 1035 | "\n", 1036 | "**Congratulations!** You've made it through the datathon tutorial notebooks. While there is a lot more to learn beyond what's covered in this tutorial when it comes to the art and science of working with data, you have begun to build a solid foundation from which you can dive into the world of data science. As long as you remain curious and leverage the many resources available to you (documentation sites, Kaggle community, Stack Overflow, WiDS workshops, etc.), you are bound to rapidly develop your data science repertoire. Good luck, and have fun!\n", 1037 | "\n", 1038 | "---\n", 1039 | "\n", 1040 | "#### Content adapted from: \n", 1041 | "- Jupyter Notebook modules from the [UC Berkeley Data Science Modules Program](https://ds-modules.github.io/DS-Modules/) licensed under [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/)\n", 1042 | " - [Data 8X Public Materials for 2022](https://github.com/ds-modules/materials-x22/) by Sean Morris\n", 1043 | "- [Composing Programs](https://www.composingprograms.com/) by John DeNero based on the textbook [Structure and Interpretation of Computer Programs](https://mitpress.mit.edu/9780262510875/structure-and-interpretation-of-computer-programs/) by Harold Abelson and Gerald Jay Sussman, licensed under [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) \n" 1044 | ] 1045 | } 1046 | ], 1047 | "metadata": { 1048 | "anaconda-cloud": {}, 1049 | "kernelspec": { 1050 | "display_name": "Python 3 (ipykernel)", 1051 | "language": "python", 1052 | "name": "python3" 1053 | }, 1054 | "language_info": { 1055 | "codemirror_mode": { 1056 | "name": "ipython", 1057 | "version": 3 1058 | }, 1059 | "file_extension": ".py", 1060 | "mimetype": "text/x-python", 1061 | "name": "python", 1062 | "nbconvert_exporter": "python", 1063 | "pygments_lexer": "ipython3", 1064 | "version": "3.11.5" 1065 | } 1066 | }, 1067 | "nbformat": 4, 1068 | "nbformat_minor": 2 1069 | } 1070 | --------------------------------------------------------------------------------