├── .gitignore
├── LICENSE
├── README.md
├── data
├── cleaned_country_totals.csv
├── countries.csv
└── country_total.csv
├── images
├── df_diagram.svg
├── groupby.svg
└── valcounts.svg
├── lessons
├── 01_python_data_wrangling.ipynb
└── 02_python_data_wrangling.ipynb
└── solutions
├── 01_python_data_wrangling_solutions.ipynb
└── 02_python_data_wrangling_solutions.ipynb
/.gitignore:
--------------------------------------------------------------------------------
1 | unemployment_missing.csv
2 | .DS_Store
3 | .ipynb_checkpoints/
4 |
5 | # Byte-compiled / optimized / DLL files
6 | __pycache__/
7 | *.py[cod]
8 |
9 | # C extensions
10 | *.so
11 |
12 | # Distribution / packaging
13 | .Python
14 | env/
15 | build/
16 | develop-eggs/
17 | dist/
18 | downloads/
19 | eggs/
20 | .eggs/
21 | lib/
22 | lib64/
23 | parts/
24 | sdist/
25 | var/
26 | *.egg-info/
27 | .installed.cfg
28 | *.egg
29 |
30 | # PyInstaller
31 | # Usually these files are written by a python script from a template
32 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
33 | *.manifest
34 | *.spec
35 |
36 | # Installer logs
37 | pip-log.txt
38 | pip-delete-this-directory.txt
39 |
40 | # Unit test / coverage reports
41 | htmlcov/
42 | .tox/
43 | .coverage
44 | .coverage.*
45 | .cache
46 | nosetests.xml
47 | coverage.xml
48 | *,cover
49 |
50 | # Translations
51 | *.mo
52 | *.pot
53 |
54 | # Django stuff:
55 | *.log
56 |
57 | # Sphinx documentation
58 | docs/_build/
59 |
60 | # PyBuilder
61 | target/
62 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 |
2 | Creative Commons Attribution-NonCommercial 4.0 International Public License
3 |
4 | By exercising the Licensed Rights (defined below), You accept and agree to be bound by the terms and conditions of this Creative Commons Attribution-NonCommercial 4.0 International Public License ("Public License"). To the extent this Public License may be interpreted as a contract, You are granted the Licensed Rights in consideration of Your acceptance of these terms and conditions, and the Licensor grants You such rights in consideration of benefits the Licensor receives from making the Licensed Material available under these terms and conditions.
5 |
6 | Section 1 – Definitions.
7 |
8 | Adapted Material means material subject to Copyright and Similar Rights that is derived from or based upon the Licensed Material and in which the Licensed Material is translated, altered, arranged, transformed, or otherwise modified in a manner requiring permission under the Copyright and Similar Rights held by the Licensor. For purposes of this Public License, where the Licensed Material is a musical work, performance, or sound recording, Adapted Material is always produced where the Licensed Material is synched in timed relation with a moving image.
9 | Adapter's License means the license You apply to Your Copyright and Similar Rights in Your contributions to Adapted Material in accordance with the terms and conditions of this Public License.
10 | Copyright and Similar Rights means copyright and/or similar rights closely related to copyright including, without limitation, performance, broadcast, sound recording, and Sui Generis Database Rights, without regard to how the rights are labeled or categorized. For purposes of this Public License, the rights specified in Section 2(b)(1)-(2) are not Copyright and Similar Rights.
11 | Effective Technological Measures means those measures that, in the absence of proper authority, may not be circumvented under laws fulfilling obligations under Article 11 of the WIPO Copyright Treaty adopted on December 20, 1996, and/or similar international agreements.
12 | Exceptions and Limitations means fair use, fair dealing, and/or any other exception or limitation to Copyright and Similar Rights that applies to Your use of the Licensed Material.
13 | Licensed Material means the artistic or literary work, database, or other material to which the Licensor applied this Public License.
14 | Licensed Rights means the rights granted to You subject to the terms and conditions of this Public License, which are limited to all Copyright and Similar Rights that apply to Your use of the Licensed Material and that the Licensor has authority to license.
15 | Licensor means the individual(s) or entity(ies) granting rights under this Public License.
16 | NonCommercial means not primarily intended for or directed towards commercial advantage or monetary compensation. For purposes of this Public License, the exchange of the Licensed Material for other material subject to Copyright and Similar Rights by digital file-sharing or similar means is NonCommercial provided there is no payment of monetary compensation in connection with the exchange.
17 | Share means to provide material to the public by any means or process that requires permission under the Licensed Rights, such as reproduction, public display, public performance, distribution, dissemination, communication, or importation, and to make material available to the public including in ways that members of the public may access the material from a place and at a time individually chosen by them.
18 | Sui Generis Database Rights means rights other than copyright resulting from Directive 96/9/EC of the European Parliament and of the Council of 11 March 1996 on the legal protection of databases, as amended and/or succeeded, as well as other essentially equivalent rights anywhere in the world.
19 | You means the individual or entity exercising the Licensed Rights under this Public License. Your has a corresponding meaning.
20 | Section 2 – Scope.
21 |
22 | License grant.
23 | Subject to the terms and conditions of this Public License, the Licensor hereby grants You a worldwide, royalty-free, non-sublicensable, non-exclusive, irrevocable license to exercise the Licensed Rights in the Licensed Material to:
24 | reproduce and Share the Licensed Material, in whole or in part, for NonCommercial purposes only; and
25 | produce, reproduce, and Share Adapted Material for NonCommercial purposes only.
26 | Exceptions and Limitations. For the avoidance of doubt, where Exceptions and Limitations apply to Your use, this Public License does not apply, and You do not need to comply with its terms and conditions.
27 | Term. The term of this Public License is specified in Section 6(a).
28 | Media and formats; technical modifications allowed. The Licensor authorizes You to exercise the Licensed Rights in all media and formats whether now known or hereafter created, and to make technical modifications necessary to do so. The Licensor waives and/or agrees not to assert any right or authority to forbid You from making technical modifications necessary to exercise the Licensed Rights, including technical modifications necessary to circumvent Effective Technological Measures. For purposes of this Public License, simply making modifications authorized by this Section 2(a)(4) never produces Adapted Material.
29 | Downstream recipients.
30 | Offer from the Licensor – Licensed Material. Every recipient of the Licensed Material automatically receives an offer from the Licensor to exercise the Licensed Rights under the terms and conditions of this Public License.
31 | No downstream restrictions. You may not offer or impose any additional or different terms or conditions on, or apply any Effective Technological Measures to, the Licensed Material if doing so restricts exercise of the Licensed Rights by any recipient of the Licensed Material.
32 | No endorsement. Nothing in this Public License constitutes or may be construed as permission to assert or imply that You are, or that Your use of the Licensed Material is, connected with, or sponsored, endorsed, or granted official status by, the Licensor or others designated to receive attribution as provided in Section 3(a)(1)(A)(i).
33 | Other rights.
34 |
35 | Moral rights, such as the right of integrity, are not licensed under this Public License, nor are publicity, privacy, and/or other similar personality rights; however, to the extent possible, the Licensor waives and/or agrees not to assert any such rights held by the Licensor to the limited extent necessary to allow You to exercise the Licensed Rights, but not otherwise.
36 | Patent and trademark rights are not licensed under this Public License.
37 | To the extent possible, the Licensor waives any right to collect royalties from You for the exercise of the Licensed Rights, whether directly or through a collecting society under any voluntary or waivable statutory or compulsory licensing scheme. In all other cases the Licensor expressly reserves any right to collect such royalties, including when the Licensed Material is used other than for NonCommercial purposes.
38 | Section 3 – License Conditions.
39 |
40 | Your exercise of the Licensed Rights is expressly made subject to the following conditions.
41 |
42 | Attribution.
43 |
44 | If You Share the Licensed Material (including in modified form), You must:
45 |
46 | retain the following if it is supplied by the Licensor with the Licensed Material:
47 | identification of the creator(s) of the Licensed Material and any others designated to receive attribution, in any reasonable manner requested by the Licensor (including by pseudonym if designated);
48 | a copyright notice;
49 | a notice that refers to this Public License;
50 | a notice that refers to the disclaimer of warranties;
51 | a URI or hyperlink to the Licensed Material to the extent reasonably practicable;
52 | indicate if You modified the Licensed Material and retain an indication of any previous modifications; and
53 | indicate the Licensed Material is licensed under this Public License, and include the text of, or the URI or hyperlink to, this Public License.
54 | You may satisfy the conditions in Section 3(a)(1) in any reasonable manner based on the medium, means, and context in which You Share the Licensed Material. For example, it may be reasonable to satisfy the conditions by providing a URI or hyperlink to a resource that includes the required information.
55 | If requested by the Licensor, You must remove any of the information required by Section 3(a)(1)(A) to the extent reasonably practicable.
56 | If You Share Adapted Material You produce, the Adapter's License You apply must not prevent recipients of the Adapted Material from complying with this Public License.
57 | Section 4 – Sui Generis Database Rights.
58 |
59 | Where the Licensed Rights include Sui Generis Database Rights that apply to Your use of the Licensed Material:
60 |
61 | for the avoidance of doubt, Section 2(a)(1) grants You the right to extract, reuse, reproduce, and Share all or a substantial portion of the contents of the database for NonCommercial purposes only;
62 | if You include all or a substantial portion of the database contents in a database in which You have Sui Generis Database Rights, then the database in which You have Sui Generis Database Rights (but not its individual contents) is Adapted Material; and
63 | You must comply with the conditions in Section 3(a) if You Share all or a substantial portion of the contents of the database.
64 | For the avoidance of doubt, this Section 4 supplements and does not replace Your obligations under this Public License where the Licensed Rights include other Copyright and Similar Rights.
65 | Section 5 – Disclaimer of Warranties and Limitation of Liability.
66 |
67 | Unless otherwise separately undertaken by the Licensor, to the extent possible, the Licensor offers the Licensed Material as-is and as-available, and makes no representations or warranties of any kind concerning the Licensed Material, whether express, implied, statutory, or other. This includes, without limitation, warranties of title, merchantability, fitness for a particular purpose, non-infringement, absence of latent or other defects, accuracy, or the presence or absence of errors, whether or not known or discoverable. Where disclaimers of warranties are not allowed in full or in part, this disclaimer may not apply to You.
68 | To the extent possible, in no event will the Licensor be liable to You on any legal theory (including, without limitation, negligence) or otherwise for any direct, special, indirect, incidental, consequential, punitive, exemplary, or other losses, costs, expenses, or damages arising out of this Public License or use of the Licensed Material, even if the Licensor has been advised of the possibility of such losses, costs, expenses, or damages. Where a limitation of liability is not allowed in full or in part, this limitation may not apply to You.
69 | The disclaimer of warranties and limitation of liability provided above shall be interpreted in a manner that, to the extent possible, most closely approximates an absolute disclaimer and waiver of all liability.
70 | Section 6 – Term and Termination.
71 |
72 | This Public License applies for the term of the Copyright and Similar Rights licensed here. However, if You fail to comply with this Public License, then Your rights under this Public License terminate automatically.
73 | Where Your right to use the Licensed Material has terminated under Section 6(a), it reinstates:
74 |
75 | automatically as of the date the violation is cured, provided it is cured within 30 days of Your discovery of the violation; or
76 | upon express reinstatement by the Licensor.
77 | For the avoidance of doubt, this Section 6(b) does not affect any right the Licensor may have to seek remedies for Your violations of this Public License.
78 | For the avoidance of doubt, the Licensor may also offer the Licensed Material under separate terms or conditions or stop distributing the Licensed Material at any time; however, doing so will not terminate this Public License.
79 | Sections 1, 5, 6, 7, and 8 survive termination of this Public License.
80 | Section 7 – Other Terms and Conditions.
81 |
82 | The Licensor shall not be bound by any additional or different terms or conditions communicated by You unless expressly agreed.
83 | Any arrangements, understandings, or agreements regarding the Licensed Material not stated herein are separate from and independent of the terms and conditions of this Public License.
84 | Section 8 – Interpretation.
85 |
86 | For the avoidance of doubt, this Public License does not, and shall not be interpreted to, reduce, limit, restrict, or impose conditions on any use of the Licensed Material that could lawfully be made without permission under this Public License.
87 | To the extent possible, if any provision of this Public License is deemed unenforceable, it shall be automatically reformed to the minimum extent necessary to make it enforceable. If the provision cannot be reformed, it shall be severed from this Public License without affecting the enforceability of the remaining terms and conditions.
88 | No term or condition of this Public License will be waived and no failure to comply consented to unless expressly agreed to by the Licensor.
89 | Nothing in this Public License constitutes or may be interpreted as a limitation upon, or waiver of, any privileges and immunities that apply to the Licensor or You, including from the legal processes of any jurisdiction or authority.
90 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # D-Lab Python Data Wrangling Workshop
2 |
3 | [](https://datahub.berkeley.edu/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fdlab-berkeley%2FPython-Data-Wrangling-Pilot&urlpath=tree%2FPython-Data-Wrangling-Pilot%2F&branch=main)
4 | [](https://mybinder.org/v2/gh/dlab-berkeley/Python-Data-Wrangling-Pilot/HEAD)
5 |
6 | This repository contains the materials for D-Lab's Python Data Wrangling Pilot workshop.
7 |
8 | ### Prerequisites
9 | Prior experience with [Python Fundamentals](https://github.com/dlab-berkeley/python-fundamentals) is assumed.
10 |
11 | Check D-Lab's [Learning Pathways](https://dlab-berkeley.github.io/dlab-workshops/python_path.html) to figure out which of our workshops to take!
12 |
13 | ## Workshop Goals
14 |
15 | In this workshop, we provide an introduction to data wrangling with Python. We will do so largely with the `pandas` package, which provides a rich set of tools to manipulate and interact with *data frames*, the most common data structure used when analyzing tabular data. We'll learn how to manipulate, index, merge, group, and plot data frames using `pandas` functions.
16 |
17 | Basic familiarity with Python *is* assumed. If you are not familiar with the material in [Python Fundamentals](https://github.com/dlab-berkeley/python-fundamentals), we recommend attending that workshop first.
18 |
19 | ## Installation Instructions
20 |
21 | Anaconda is a useful package management software that allows you to run Python and Jupyter notebooks very easily. Installing Anaconda is the easiest way to make sure you have all the necessary software to run the materials for this workshop. Complete the following steps:
22 |
23 | 1. [Download and install Anaconda (Python 3.8 distribution)](https://www.anaconda.com/products/individual). Click "Download" and then click 64-bit "Graphical Installer" for your current operating system.
24 |
25 | 2. Download the [Python-Data-Wrangling workshop materials](https://github.com/dlab-berkeley/Python-Data-Wrangling-Pilot):
26 |
27 | * Click the green "Code" button in the top right of the repository information.
28 | * Click "Download Zip".
29 | * Extract this file to a folder on your computer where you can easily access it (we recommend Desktop).
30 |
31 | 3. Optional: if you're familiar with `git`, you can instead clone this repository by opening a terminal and entering `git clone git@github.com:dlab-berkeley/Python-Data-Wrangling.git`.
32 |
33 | ## Run the code
34 |
35 | Now that you have all the required software and materials, you need to run the code:
36 |
37 | 1. Open the Anaconda Navigator application. You should see the green snake logo appear on your screen. Note that this can take a few minutes to load up the first time.
38 |
39 | 2. Click the "Launch" button under "Jupyter Notebooks" and navigate through your file system to the `Python-Data-Visualization` folder you downloaded above.
40 |
41 | 3. Open the `lessons` folder, and click `01_pandas.ipynb` to begin.
42 |
43 | 4. Press Shift + Enter (or Ctrl + Enter) to run a cell.
44 |
45 | Note that all of the above steps can be run from the terminal, if you're familiar with how to interact with Anaconda in that fashion. However, using Anaconda Navigator is the easiest way to get started if this is your first time working with Anaconda.
46 |
47 | ## Is Python not working on your laptop?
48 |
49 | If you do not have Anaconda installed and the materials loaded on your workshop by the time it starts, we *strongly* recommend using the UC Berkeley Datahub to run the materials for these lessons. You can access the DataHub by clicking this button:
50 |
51 | [](https://datahub.berkeley.edu/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fdlab-berkeley%2FPython-Data-Wrangling-Pilot&urlpath=tree%2FPython-Data-Wrangling-Pilot%2F&branch=main)
52 |
53 | The DataHub downloads this repository, along with any necessary packages, and allows you to run the materials in a Jupyter notebook that is stored on UC Berkeley's servers. No installation is necessary from your end - you only need an internet browser and a CalNet ID to log in. By using the DataHub, you can save your work and come back to it at any time. When you want to return to your saved work, just go straight to [DataHub](https://datahub.berkeley.edu), sign in, and you click on the `Python-Data-Wrangling` folder.
54 |
55 | If you don't have a Berkeley CalNet ID, you can still run these lessons in the cloud, by clicking this button:
56 |
57 | [](https://mybinder.org/v2/gh/dlab-berkeley/Python-Data-Wrangling-Pilot/HEAD)
58 |
59 | Once you have opened a jupyter notebook within the Binder environment, run the following code within a cell in the notebook:
60 | ```
61 | ! pip install pandas matplotlib
62 | ```
63 | Note that in Binder you cannot save your work.
64 |
65 | # Additional Resources
66 |
67 | * [The official pandas documentation](http://pandas.pydata.org/pandas-docs/stable/)
68 | * [Visualization with pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html)
69 |
70 | # About the UC Berkeley D-Lab
71 |
72 | D-Lab works with Berkeley faculty, research staff, and students to advance data-intensive social science and humanities research. Our goal at D-Lab is to provide practical training, staff support, resources, and space to enable you to use R for your own research applications. Our services cater to all skill levels and no programming, statistical, or computer science backgrounds are necessary. We offer these services in the form of workshops, one-to-one consulting, and working groups that cover a variety of research topics, digital tools, and programming languages.
73 |
74 | Visit the [D-Lab homepage](https://dlab.berkeley.edu/) to learn more about us. You can view our [calendar](https://dlab.berkeley.edu/events/calendar) for upcoming events, learn about how to utilize our [consulting](https://dlab.berkeley.edu/consulting) and [data](https://dlab.berkeley.edu/data) services, and check out upcoming [workshops](https://dlab.berkeley.edu/events/workshops).
75 |
76 | # Other D-Lab Python Workshops
77 |
78 | Here are other Python workshops offered by the D-Lab:
79 |
80 | ## Introductory Workshops
81 |
82 | * [Python Fundamentals](https://github.com/dlab-berkeley/Python-Fundamentals)
83 | * [Python Data Visualization](https://github.com/dlab-berkeley/Python-Data-Visualization)
84 | * [Python Geospatial Fundamentals](https://github.com/dlab-berkeley/Geospatial-Data-and-Mapping-in-Python)
85 |
86 | ## Advanced Workshops
87 |
88 | * [Python Web Scraping and APIs](https://github.com/dlab-berkeley/Python-Web-Scraping)
89 | * [Python Machine Learning](https://github.com/dlab-berkeley/Python-Machine-Learning)
90 | * [Python Text Analysis](https://github.com/dlab-berkeley/Python-Text-Analysis)
91 | * [Python Deep Learning](https://github.com/dlab-berkeley/Python-Deep-Learning)
92 |
--------------------------------------------------------------------------------
/data/countries.csv:
--------------------------------------------------------------------------------
1 | "country","google_country_code","country_group","name_en","name_fr","name_de","latitude","longitude"
2 | "at","AT","eu","Austria","Autriche","Österreich","47.6965545","13.34598005"
3 | "be","BE","eu","Belgium","Belgique","Belgien","50.501045","4.47667405"
4 | "bg","BG","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375","25.4823218"
5 | "hr","HR","non-eu","Croatia","Croatie","Kroatien","44.74664297","15.34084438"
6 | "cy","CY","eu","Cyprus","Chypre","Zypern","35.129141","33.4286823"
7 | "cz","CZ","eu","Czech Republic","République tchèque","Tschechische Republik","49.803531","15.47499805"
8 | "dk","DK","eu","Denmark","Danemark","Dänemark","55.93968425","9.51668905"
9 | "ee","EE","eu","Estonia","Estonie","Estland","58.5924685","25.8069503"
10 | "fi","FI","eu","Finland","Finlande","Finnland","64.95015875","26.06756405"
11 | "fr","FR","eu","France","France","Frankreich","46.7109945","1.7185608"
12 | "de","DE","eu","Germany (including former GDR from 1991)","Allemagne (incluant l'ancienne RDA à partir de 1991)","Deutschland (einschließlich der ehemaligen DDR seit 1991)","51.16382538","10.4540478"
13 | "gr","GR","eu","Greece","Grèce","Griechenland","39.698467","21.57725572"
14 | "hu","HU","eu","Hungary","Hongrie","Ungarn","47.16116325","19.5042648"
15 | "ie","IE","eu","Ireland","Irlande","Irland","53.41526","-8.2391222"
16 | "it","IT","eu","Italy","Italie","Italien","42.504191","12.57378705"
17 | "lv","LV","eu","Latvia","Lettonie","Lettland","56.880117","24.60655505"
18 | "lt","LT","eu","Lithuania","Lituanie","Litauen","55.173687","23.9431678"
19 | "lu","LU","eu","Luxembourg","Luxembourg","Luxemburg","49.815319","6.13335155"
20 | "mt","MT","eu","Malta","Malte","Malta","35.902422","14.4474608"
21 | "nl","NL","eu","Netherlands","Pays-Bas","Niederlande","52.10811825","5.3301983"
22 | "no","NO","non-eu","Norway","Norvège","Norwegen","64.55645975","12.66576565"
23 | "pl","PL","eu","Poland","Pologne","Polen","51.91890725","19.1343338"
24 | "pt","PT","eu","Portugal","Portugal","Portugal","39.55806875","-7.84494095"
25 | "ro","RO","eu","Romania","Roumanie","Rumänien","45.94261125","24.99015155"
26 | "sk","SK","eu","Slovakia","Slovaquie","Slowakei","48.67264375","19.7000323"
27 | "si","SI","eu","Slovenia","Slovénie","Slowenien","46.14925925","14.98661705"
28 | "es","ES","eu","Spain","Espagne","Spanien","39.8950135","-2.9882957"
29 | "se","SE","eu","Sweden","Suède","Schweden","62.1984675","14.89630657"
30 | "tr","TR","non-eu","Turkey","Turquie","Türkei","38.95294205","35.43979471"
31 | "uk","GB","eu","United Kingdom","Royaume-Uni","Vereinigtes Königreich","54.315447","-2.23261195"
32 |
--------------------------------------------------------------------------------
/images/groupby.svg:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/images/valcounts.svg:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/lessons/01_python_data_wrangling.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Python Data Wrangling with `pandas`: Part 1\n",
8 | "* * * \n",
9 | "
\n",
10 | " \n",
11 | "### Learning Objectives \n",
12 | " \n",
13 | "* Gain familiarity with `pandas` and the core `DataFrame` object\n",
14 | "* Apply core data wrangling techniques in `pandas`\n",
15 | "* Understand the flexibility of the `pandas` library\n",
16 | "
\n",
17 | "\n",
18 | "### Icons Used in This Notebook\n",
19 | "🔔 **Question**: A quick question to help you understand what's going on. \n",
20 | "🥊 **Challenge**: Interactive exercise. We'll work through these in the workshop! \n",
21 | "💡 **Tip**: How to do something a bit more efficiently or effectively. \n",
22 | "⚠️ **Warning:** Heads-up about tricky stuff or common mistakes. \n",
23 | "🎬 **Demo**: Showing off something more advanced – so you know what `pandas` can be used for! "
24 | ]
25 | },
26 | {
27 | "cell_type": "markdown",
28 | "metadata": {},
29 | "source": [
30 | "### Sections\n",
31 | "1. [The `DataFrame` Object](#dataframe)\n",
32 | "2. [Indexing Data](#indexing)\n",
33 | "3. [Boolean Indexing](#boolean)\n",
34 | "4. [🎬 Demo: Boolean Indexing with multiple conditions](#demo)"
35 | ]
36 | },
37 | {
38 | "cell_type": "markdown",
39 | "metadata": {
40 | "tags": []
41 | },
42 | "source": [
43 | "# This Workshop\n",
44 | "In this workshop, we provide an introduction to **data wrangling with Python**. We will be using the `pandas` package, which provides a rich set of tools for manipulating data.\n",
45 | "\n",
46 | "We'll use worked examples and practice on real data to learn the core techniques of data wrangling -- how to index, manipulate, merge, group, and plot data -- in `pandas`. \n",
47 | "\n",
48 | "Let's get started.\n",
49 | "\n",
50 | "\n",
51 | "# The `DataFrame` object\n",
52 | "\n",
53 | "`pandas` is designed to make it easy to work with structured, tabular data. Many of the analyses you might typically perform involve using tabular data, i.e. .csv files, excel files, extracts from relational databases, etc. `pandas` represents this data as a DataFrame object -- we'll see what this looks like in a moment.\n",
54 | "## Importing and Viewing Data\n",
55 | "We are going to work with European unemployment data from Eurostat, which is[ hosted by Google](https://code.google.com/p/dspl/downloads/list). There are several `.csv` files related to this topic that we'll work with in this workshop, all of which are related to unemployment rates in European countries.\n",
56 | "\n",
57 | "Let's begin by importing `pandas` using the conventional `pd` abbreviation."
58 | ]
59 | },
60 | {
61 | "cell_type": "code",
62 | "execution_count": null,
63 | "metadata": {},
64 | "outputs": [],
65 | "source": [
66 | "# Imports pandas and assign it to the variable `pd`\n",
67 | "import pandas as pd\n",
68 | "\n",
69 | "# We often import NumPy (numerical python) with pandas\n",
70 | "# we will import that and assign it to the variable `np`\n",
71 | "import numpy as np\n",
72 | "\n",
73 | "# Load matplotlib for plotting later in the workshop\n",
74 | "import matplotlib.pyplot as plt\n",
75 | "%matplotlib inline"
76 | ]
77 | },
78 | {
79 | "cell_type": "markdown",
80 | "metadata": {},
81 | "source": [
82 | "`pandas` has a `read_csv()` function that allows us to easily import tabular data. The function returns a `DataFrame` object, which is the main object `pandas` uses to represent tabular data.\n",
83 | "\n",
84 | "Notice that we call `read_csv()` using the `pd` abbreviation from the import statement above:"
85 | ]
86 | },
87 | {
88 | "cell_type": "code",
89 | "execution_count": null,
90 | "metadata": {},
91 | "outputs": [],
92 | "source": [
93 | "unemployment = pd.read_csv('../data/country_total.csv')"
94 | ]
95 | },
96 | {
97 | "cell_type": "markdown",
98 | "metadata": {},
99 | "source": [
100 | "Let's run `type()` on the `unemployment` object and see what it is..."
101 | ]
102 | },
103 | {
104 | "cell_type": "code",
105 | "execution_count": null,
106 | "metadata": {},
107 | "outputs": [],
108 | "source": [
109 | "type(unemployment)"
110 | ]
111 | },
112 | {
113 | "cell_type": "markdown",
114 | "metadata": {},
115 | "source": [
116 | "Great! You've created a `pandas` `DataFrame`. We can look at our data by using the `.head()` method. By default, this shows the header (column names) and the first **five** rows. "
117 | ]
118 | },
119 | {
120 | "cell_type": "code",
121 | "execution_count": null,
122 | "metadata": {},
123 | "outputs": [],
124 | "source": [
125 | "unemployment.head()"
126 | ]
127 | },
128 | {
129 | "cell_type": "markdown",
130 | "metadata": {},
131 | "source": [
132 | "💡 **Tip**: If you'd like to see some other number of rows, you can pass an integer to `.head()` to return that many rows. For example `unemployment.head(6)` would return the first six rows. \n",
133 | "\n"
134 | ]
135 | },
136 | {
137 | "cell_type": "markdown",
138 | "metadata": {},
139 | "source": [
140 | "To find the number of rows, you can use the `.shape` attribute, which returns a [tuple](https://www.w3schools.com/python/python_tuples.asp): `(number of rows, number of columns)`"
141 | ]
142 | },
143 | {
144 | "cell_type": "code",
145 | "execution_count": null,
146 | "metadata": {},
147 | "outputs": [],
148 | "source": [
149 | "unemployment.shape"
150 | ]
151 | },
152 | {
153 | "cell_type": "markdown",
154 | "metadata": {},
155 | "source": [
156 | "To find out exactly what all of your columns are, you can use the `.columns` attribute."
157 | ]
158 | },
159 | {
160 | "cell_type": "code",
161 | "execution_count": null,
162 | "metadata": {},
163 | "outputs": [],
164 | "source": [
165 | "unemployment.columns"
166 | ]
167 | },
168 | {
169 | "cell_type": "markdown",
170 | "metadata": {},
171 | "source": [
172 | "To find out what kinds of data we have, we use the `.dtypes` attribute, which tells us which columns contain numerical data (e.g. `float64` or `int64` types) and which ones contain text (e.g. `object` types)"
173 | ]
174 | },
175 | {
176 | "cell_type": "code",
177 | "execution_count": null,
178 | "metadata": {},
179 | "outputs": [],
180 | "source": [
181 | "unemployment.dtypes"
182 | ]
183 | },
184 | {
185 | "cell_type": "markdown",
186 | "metadata": {},
187 | "source": [
188 | "## Looking up functions\n",
189 | "A useful method that generates various summary statistics is `.describe()`. This is a powerful method that will return a lot of information, so before we run it, let's look up exactly what it does.\n",
190 | "\n",
191 | "💡 **Tip**: The [`pandas` documentation](http://pandas.pydata.org/pandas-docs/stable/) contains exhaustive information on every function, object, etc. in `pandas`. It can be a little difficult to navigate on its own, so it's typical to interact with the documentation primarily through Google searches. \n",
192 | "\n",
193 | "The following is a general worflow for learning about a function in `pandas`:\n",
194 | "1. Google the `pandas` function, e.g. \"pandas {insert function name}\"\n",
195 | "2. Find a result from pandas.pydata.org (the pandas documentation)\n",
196 | "3. Read the summary of what the function does (at the top of the page), examine its arguments and what it returns.\n",
197 | "\n",
198 | "🔔 **Question:** Before running the following code, try using the general workflow detailed above to find out what `.describe()` returns. "
199 | ]
200 | },
201 | {
202 | "cell_type": "code",
203 | "execution_count": null,
204 | "metadata": {},
205 | "outputs": [],
206 | "source": [
207 | "unemployment.describe()"
208 | ]
209 | },
210 | {
211 | "cell_type": "markdown",
212 | "metadata": {},
213 | "source": [
214 | "⚠️ **Warning**: `.describe()` will behave differently depending on your data's types, or, `dtype`s. If your `DataFrame` includes both numeric and object (e.g., strings) `dtype`s, it will default to **summarizing only the numeric data** (as shown above). If `.describe()` is called on a `DataFrame` that only contains strings, it will return the count, number of unique values, and the most frequent value along with its count. "
215 | ]
216 | },
217 | {
218 | "cell_type": "markdown",
219 | "metadata": {},
220 | "source": [
221 | "## 🥊 Challenge 1\n",
222 | "**Setup** \n",
223 | "We previously imported unemployment data into `pandas` using the `read_csv` function and a relative file path. `read_csv` also allows us to import data using a URL as the file path. \n",
224 | "\n",
225 | "A .csv file with data on world countries and their abbreviations is located at the following URL:\n",
226 | "\n",
227 | "[https://raw.githubusercontent.com/dlab-berkeley/introduction-to-pandas/master/data/countries.csv](https://raw.githubusercontent.com/dlab-berkeley/introduction-to-pandas/master/data/countries.csv)\n",
228 | "\n",
229 | "We can load that data directly as follows:"
230 | ]
231 | },
232 | {
233 | "cell_type": "code",
234 | "execution_count": null,
235 | "metadata": {},
236 | "outputs": [],
237 | "source": [
238 | "countries_url = 'https://raw.githubusercontent.com/dlab-berkeley/Python-Data-Wrangling/main/data/countries.csv'\n",
239 | "countries = pd.read_csv(countries_url)"
240 | ]
241 | },
242 | {
243 | "cell_type": "code",
244 | "execution_count": null,
245 | "metadata": {},
246 | "outputs": [],
247 | "source": [
248 | "countries.head()"
249 | ]
250 | },
251 | {
252 | "cell_type": "markdown",
253 | "metadata": {
254 | "jp-MarkdownHeadingCollapsed": true,
255 | "tags": []
256 | },
257 | "source": [
258 | "**Challenge** \n",
259 | "Whenever we open a new DataFrame, it's important to get a basic understanding of its structure.\n",
260 | "\n",
261 | "Using the methods and attributes we just discussed, **answer the following questions** about `countries`:\n",
262 | "\n",
263 | "1. What columns does `countries` contain?\n",
264 | "2. How many rows and columns does it contain?\n",
265 | "3. What are the minimum and maximum values of the columns with numerical data?\n",
266 | "\n",
267 | "Click for hint\n",
268 | "Hint: consider using .columns, .shape, and .describe() here.\n",
269 | ""
270 | ]
271 | },
272 | {
273 | "cell_type": "code",
274 | "execution_count": null,
275 | "metadata": {},
276 | "outputs": [],
277 | "source": [
278 | "# YOUR CODE HERE\n"
279 | ]
280 | },
281 | {
282 | "cell_type": "code",
283 | "execution_count": null,
284 | "metadata": {},
285 | "outputs": [],
286 | "source": [
287 | "# YOUR CODE HERE\n"
288 | ]
289 | },
290 | {
291 | "cell_type": "code",
292 | "execution_count": null,
293 | "metadata": {},
294 | "outputs": [],
295 | "source": [
296 | "# YOUR CODE HERE\n"
297 | ]
298 | },
299 | {
300 | "cell_type": "markdown",
301 | "metadata": {},
302 | "source": [
303 | "\n",
304 | "# Indexing Data\n",
305 | "Wrangling data in a DataFrame often requires extracting specific rows and/or columns of interest. This is referred to as **Indexing**. We've actually already learned a simple way to index our data using `.head()`, which isolated the first five rows of our data. Now, we'll learn more flexible and powerful methods for indexing."
306 | ]
307 | },
308 | {
309 | "cell_type": "markdown",
310 | "metadata": {},
311 | "source": [
312 | "## Recall basic Python indexing\n",
313 | "To index (this is synonymous with other verbs like \"subset,\" \"slice,\" etc.) data in Python, we use bracket notation: `[]`. Run the following code to instantiate a list of numbers and observe what different indexes return:"
314 | ]
315 | },
316 | {
317 | "cell_type": "code",
318 | "execution_count": null,
319 | "metadata": {
320 | "tags": []
321 | },
322 | "outputs": [],
323 | "source": [
324 | "my_list = ['a', 'b', 'c', 'd', 'e', 'f']"
325 | ]
326 | },
327 | {
328 | "cell_type": "code",
329 | "execution_count": null,
330 | "metadata": {
331 | "tags": []
332 | },
333 | "outputs": [],
334 | "source": [
335 | "my_list"
336 | ]
337 | },
338 | {
339 | "cell_type": "code",
340 | "execution_count": null,
341 | "metadata": {
342 | "tags": []
343 | },
344 | "outputs": [],
345 | "source": [
346 | "my_list[:4]"
347 | ]
348 | },
349 | {
350 | "cell_type": "code",
351 | "execution_count": null,
352 | "metadata": {
353 | "tags": []
354 | },
355 | "outputs": [],
356 | "source": [
357 | "my_list[0]"
358 | ]
359 | },
360 | {
361 | "cell_type": "code",
362 | "execution_count": null,
363 | "metadata": {
364 | "tags": []
365 | },
366 | "outputs": [],
367 | "source": [
368 | "my_list[2:]"
369 | ]
370 | },
371 | {
372 | "cell_type": "markdown",
373 | "metadata": {},
374 | "source": [
375 | "\n",
376 | "Indexing works very similarly in `pandas` as it does in standard python, but with a few key differences. In `pandas`, indexing relies on referencing a DataFrame's rows and then its columns → `[rows, columns]`. Let's get a more visual sense of this -- in the `countries` DataFrame that we created earlier, the structure of the data is as follows: \n",
377 | "\n",
378 | " "
379 | ]
380 | },
381 | {
382 | "cell_type": "markdown",
383 | "metadata": {},
384 | "source": [
385 | "To index and get to specific data from this DataFrame, we select a row/column combination. \n",
386 | "For example, indexing row 3 and the column `google_country_code` would give us the value 'HR'. In code, that would look as follows: \n",
387 | "`countries.loc[3, 'google_country_code']` \n",
388 | "Try writing that in the cell below and running it."
389 | ]
390 | },
391 | {
392 | "cell_type": "code",
393 | "execution_count": null,
394 | "metadata": {},
395 | "outputs": [],
396 | "source": [
397 | "# YOUR CODE HERE\n",
398 | "countries.loc[3, 'google_country_code']"
399 | ]
400 | },
401 | {
402 | "cell_type": "markdown",
403 | "metadata": {},
404 | "source": [
405 | "## `.loc`\n",
406 | "Let's go deeper into what `.loc` does, as this will be the primary tool we use for indexing. \n",
407 | "\n",
408 | "`.loc` allows us to index data based on the labels of our DataFrame's index and its column names. Let's take a look at its behavior below:"
409 | ]
410 | },
411 | {
412 | "cell_type": "code",
413 | "execution_count": null,
414 | "metadata": {},
415 | "outputs": [],
416 | "source": [
417 | "countries.loc[:4, :]"
418 | ]
419 | },
420 | {
421 | "cell_type": "markdown",
422 | "metadata": {},
423 | "source": [
424 | "The code, `countries.loc[:4, :]` executes the following: \n",
425 | "- countries.loc[**:4**, :]→ Select rows up to index 4\n",
426 | "- countries.loc[:4, **:**]→\n",
427 | " Select all columns\n",
428 | "\n",
429 | "This format allows us to flexibly select ranges of rows and columns at the same time. Consider this more complex example:"
430 | ]
431 | },
432 | {
433 | "cell_type": "code",
434 | "execution_count": null,
435 | "metadata": {},
436 | "outputs": [],
437 | "source": [
438 | "countries.head()"
439 | ]
440 | },
441 | {
442 | "cell_type": "code",
443 | "execution_count": null,
444 | "metadata": {},
445 | "outputs": [],
446 | "source": [
447 | "countries.loc[2:4, 'name_en']"
448 | ]
449 | },
450 | {
451 | "cell_type": "markdown",
452 | "metadata": {},
453 | "source": [
454 | "This code executed the following: \n",
455 | "- countries.loc[**2:4**, 'name_en'] -> Select rows from index 2 up to index 4 \n",
456 | "- countries.loc[2:4, **'name_en'**] -> Select the `name_en` column\n",
457 | "\n",
458 | "💡 **Tip**: Note that the output of this code looks different from our previous output! Because we selected a single column, our code returned a `Series` object."
459 | ]
460 | },
461 | {
462 | "cell_type": "code",
463 | "execution_count": null,
464 | "metadata": {},
465 | "outputs": [],
466 | "source": [
467 | "type(countries.loc[2:4, 'name_en'])"
468 | ]
469 | },
470 | {
471 | "cell_type": "markdown",
472 | "metadata": {},
473 | "source": [
474 | "Let's look at one more example of `.loc`. \n",
475 | "🔔 **Question:** Before running the following code block, can you anticipate what it will output?"
476 | ]
477 | },
478 | {
479 | "cell_type": "code",
480 | "execution_count": null,
481 | "metadata": {},
482 | "outputs": [],
483 | "source": [
484 | "countries.columns"
485 | ]
486 | },
487 | {
488 | "cell_type": "code",
489 | "execution_count": null,
490 | "metadata": {},
491 | "outputs": [],
492 | "source": [
493 | "countries.loc[19:29, ['name_en', 'longitude']]"
494 | ]
495 | },
496 | {
497 | "cell_type": "markdown",
498 | "metadata": {},
499 | "source": [
500 | "## 🥊 Challenge 2: Indexing with `.loc`\n",
501 | "\n",
502 | "Let's get a little practice with the `.loc` operator. \n",
503 | "\n",
504 | "Select rows 10 through 20, then compute their average `latitude` \n",
505 | "\n",
506 | " Click for Hint\n",
507 | " This can be done using `.loc` and `.mean()`, all in one line of code: countries.loc[{your row selection}, {your column selection}].mean()\n",
508 | ""
509 | ]
510 | },
511 | {
512 | "cell_type": "code",
513 | "execution_count": null,
514 | "metadata": {},
515 | "outputs": [],
516 | "source": [
517 | "# YOUR CODE HERE\n"
518 | ]
519 | },
520 | {
521 | "cell_type": "markdown",
522 | "metadata": {},
523 | "source": [
524 | "## Positional Indexing\n",
525 | "`.loc` is a very powerful indexing system that can handle almost any indexing task you can imagine. However, as is typical in `pandas`, there is more than one way to get what you're looking for. \n",
526 | "\n",
527 | "When we are executing very simple indexing tasks, such as selecting a column, it is common to use the more succinct **positional indexing** system. Positional indexing allows us to omit `.loc`, but only allows us to select a row **OR** column index, whereas most of the indexing we just did using `.loc` involved both row **AND** column indices."
528 | ]
529 | },
530 | {
531 | "cell_type": "code",
532 | "execution_count": null,
533 | "metadata": {},
534 | "outputs": [],
535 | "source": [
536 | "# This will work\n",
537 | "countries.loc[1:5, 'latitude']"
538 | ]
539 | },
540 | {
541 | "cell_type": "markdown",
542 | "metadata": {},
543 | "source": [
544 | "Try running the following code -- it will throw an error. You can \"comment out\" (put a # before the code) the first statement and \"un-comment\" (remove the # before the code) the second statement to see how `.loc` fixes the error. "
545 | ]
546 | },
547 | {
548 | "cell_type": "code",
549 | "execution_count": null,
550 | "metadata": {},
551 | "outputs": [],
552 | "source": [
553 | "# this won't work\n",
554 | "countries[0, 'latitude']\n",
555 | "\n",
556 | "# this will work\n",
557 | "# countries.loc[0, 'latitude']"
558 | ]
559 | },
560 | {
561 | "cell_type": "markdown",
562 | "metadata": {},
563 | "source": [
564 | "## `iloc`\n",
565 | "\n",
566 | "Another widely used alternative to `.loc` is `.iloc`. **We recommend sticking to `.loc` while learning `pandas`**."
567 | ]
568 | },
569 | {
570 | "cell_type": "markdown",
571 | "metadata": {},
572 | "source": [
573 | "\n",
574 | "# Boolean Indexing\n",
575 | "Now that we've covered the basics of indexing, let's get into an extremely powerful extension -- \"**boolean indexing**.\" Boolean indexing refers to filtering data based on some logical test. The `pandas` implementation of boolean indexing can be a little jarring at first, so let's build up to it from scratch. First, recall how booleans and logical tests work in standard python:"
576 | ]
577 | },
578 | {
579 | "cell_type": "code",
580 | "execution_count": null,
581 | "metadata": {},
582 | "outputs": [],
583 | "source": [
584 | "\"D-Lab\" == \"D-Lab\""
585 | ]
586 | },
587 | {
588 | "cell_type": "code",
589 | "execution_count": null,
590 | "metadata": {},
591 | "outputs": [],
592 | "source": [
593 | "\"D-Lab\" == \"H-Lab\""
594 | ]
595 | },
596 | {
597 | "cell_type": "code",
598 | "execution_count": null,
599 | "metadata": {},
600 | "outputs": [],
601 | "source": [
602 | "7 > 7"
603 | ]
604 | },
605 | {
606 | "cell_type": "markdown",
607 | "metadata": {},
608 | "source": [
609 | "We will use that same style of logical test in `pandas` to execute boolean indexing. \n",
610 | "\n",
611 | "## Example: find countries outside the EU\n",
612 | "Notice in the `countries` dataframe pictured below that we have a column, `country_group`, that tells us whether or not a country is in the European Union (EU). We're going to do a boolean indexing example on these first five rows. \n",
613 | " "
614 | ]
615 | },
616 | {
617 | "cell_type": "markdown",
618 | "metadata": {},
619 | "source": [
620 | "*Note: Croatia has been part of the EU since 2013. Data isn't always correct!*"
621 | ]
622 | },
623 | {
624 | "cell_type": "code",
625 | "execution_count": null,
626 | "metadata": {},
627 | "outputs": [],
628 | "source": [
629 | "# Create a smaller test dataframe\n",
630 | "# to show how boolean indexing works\n",
631 | "test = countries.loc[20:25, :]"
632 | ]
633 | },
634 | {
635 | "cell_type": "code",
636 | "execution_count": null,
637 | "metadata": {},
638 | "outputs": [],
639 | "source": [
640 | "test"
641 | ]
642 | },
643 | {
644 | "cell_type": "markdown",
645 | "metadata": {},
646 | "source": [
647 | "Let's use that column to filter our data down to only countries outside of the European Union. The steps are as follows:\n",
648 | "1. Select the column we will use as a filter: `test['country_group']` or `test.loc[:, 'country_group']`"
649 | ]
650 | },
651 | {
652 | "cell_type": "code",
653 | "execution_count": null,
654 | "metadata": {},
655 | "outputs": [],
656 | "source": [
657 | "test['country_group']"
658 | ]
659 | },
660 | {
661 | "cell_type": "markdown",
662 | "metadata": {},
663 | "source": [
664 | "2. Determine which rows in that column are equal to \"non-eu\" -- which denotes that the country is outside the European Union: `test['country_group'] == 'non-eu'`. The output of this code is what's called a **boolean mask**."
665 | ]
666 | },
667 | {
668 | "cell_type": "code",
669 | "execution_count": null,
670 | "metadata": {},
671 | "outputs": [],
672 | "source": [
673 | "test['country_group'] == 'non-eu'"
674 | ]
675 | },
676 | {
677 | "cell_type": "markdown",
678 | "metadata": {},
679 | "source": [
680 | "3. Use the boolean mask to index only those rows that satisfied the test: `test[test['country_group'] == 'non-eu']`"
681 | ]
682 | },
683 | {
684 | "cell_type": "code",
685 | "execution_count": null,
686 | "metadata": {},
687 | "outputs": [],
688 | "source": [
689 | "test[test['country_group'] == 'non-eu']"
690 | ]
691 | },
692 | {
693 | "cell_type": "markdown",
694 | "metadata": {},
695 | "source": [
696 | "And that's boolean indexing! We used a test for equality (`countries['country_group'] == 'non-eu'`), but we can use a variety of different tests and conditions to index our data.\n",
697 | "\n",
698 | "For example, we might want to find those countries with a longitude greater than some threshold, such as 25 (note that we will go back to using the full `countries` DataFrame now):"
699 | ]
700 | },
701 | {
702 | "cell_type": "code",
703 | "execution_count": null,
704 | "metadata": {},
705 | "outputs": [],
706 | "source": [
707 | "countries[countries['longitude'] > 25]"
708 | ]
709 | },
710 | {
711 | "cell_type": "markdown",
712 | "metadata": {},
713 | "source": [
714 | "## 🥊 Challenge 3: Boolean Indexing\n",
715 | "\n",
716 | "Let's push our boolean indexing skills a little further with a challenge problem.\n",
717 | "1. Find the average longitude of countries in our data, assign it to the variable `average_long`\n",
718 | "2. Find countries that have \"above average\" longitude\n",
719 | "\n",
720 | " Click for Hint\n",
721 | " Compute the average longitude of the data: countries['longitude'].mean() and save that to a variable average_long. Then, you can use that variable to create a boolean mask for indexing: countries['longitude'] > average_long\n",
722 | ""
723 | ]
724 | },
725 | {
726 | "cell_type": "code",
727 | "execution_count": null,
728 | "metadata": {},
729 | "outputs": [],
730 | "source": [
731 | "# YOUR CODE HERE\n"
732 | ]
733 | },
734 | {
735 | "cell_type": "code",
736 | "execution_count": null,
737 | "metadata": {},
738 | "outputs": [],
739 | "source": [
740 | "# YOUR CODE HERE\n"
741 | ]
742 | },
743 | {
744 | "cell_type": "markdown",
745 | "metadata": {},
746 | "source": [
747 | "\n",
748 | "# 🎬 Demo: Boolean Indexing with multiple conditions\n",
749 | "We won't have a challenge on this topic, but it's useful to know that we can boolean index using as many logical tests as we want by wrapping each test in parenthesis (`()`)and by using the AND operator (`&`) or the OR operator (`|`)"
750 | ]
751 | },
752 | {
753 | "cell_type": "code",
754 | "execution_count": null,
755 | "metadata": {},
756 | "outputs": [],
757 | "source": [
758 | "# Select the countries with longitude greater than 25 but less than 30\n",
759 | "countries[(countries['longitude'] > 25) & (countries['longitude'] < 30)]"
760 | ]
761 | },
762 | {
763 | "cell_type": "code",
764 | "execution_count": null,
765 | "metadata": {},
766 | "outputs": [],
767 | "source": [
768 | "# Select the countries with longitude greater than 30 or less than 0\n",
769 | "countries[(countries['longitude'] > 30) | (countries['longitude'] < 0)]"
770 | ]
771 | }
772 | ],
773 | "metadata": {
774 | "kernelspec": {
775 | "display_name": "Python 3 (ipykernel)",
776 | "language": "python",
777 | "name": "python3"
778 | },
779 | "language_info": {
780 | "codemirror_mode": {
781 | "name": "ipython",
782 | "version": 3
783 | },
784 | "file_extension": ".py",
785 | "mimetype": "text/x-python",
786 | "name": "python",
787 | "nbconvert_exporter": "python",
788 | "pygments_lexer": "ipython3",
789 | "version": "3.11.7"
790 | }
791 | },
792 | "nbformat": 4,
793 | "nbformat_minor": 4
794 | }
795 |
--------------------------------------------------------------------------------
/lessons/02_python_data_wrangling.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Python Data Wrangling with `pandas`: Part 2\n",
8 | "\n",
9 | "* * * \n",
10 | "
\n",
11 | " \n",
12 | "### Learning Objectives \n",
13 | " \n",
14 | "* Gain familiarity with `pandas` and the core `DataFrame` object\n",
15 | "* Apply core data wrangling techniques in `pandas`\n",
16 | "* Understand the flexibility of the `pandas` library\n",
17 | "
\n",
18 | "\n",
19 | "### Icons Used in This Notebook\n",
20 | "🔔 **Question**: A quick question to help you understand what's going on. \n",
21 | "🥊 **Challenge**: Interactive exercise. We'll work through these in the workshop! \n",
22 | "💡 **Tip**: How to do something a bit more efficiently or effectively. \n",
23 | "⚠️ **Warning:** Heads-up about tricky stuff or common mistakes. \n",
24 | "🎬 **Demo**: Showing off something more advanced – so you know what Pandas can be used for! "
25 | ]
26 | },
27 | {
28 | "cell_type": "markdown",
29 | "metadata": {},
30 | "source": [
31 | "### Sections\n",
32 | "4. [Missing Data](#missing)\n",
33 | "5. [Sorting Values](#sorting)\n",
34 | "6. [Merging](#merging)\n",
35 | "7. [Grouping](#grouping)\n",
36 | "8. [Visualization](#viz)\n",
37 | "\n",
38 | "Let's start back up by importing our libraries and loading up our data"
39 | ]
40 | },
41 | {
42 | "cell_type": "code",
43 | "execution_count": null,
44 | "metadata": {},
45 | "outputs": [],
46 | "source": [
47 | "import pandas as pd\n",
48 | "import numpy as np\n",
49 | "import matplotlib.pyplot as plt\n",
50 | "%matplotlib inline"
51 | ]
52 | },
53 | {
54 | "cell_type": "code",
55 | "execution_count": null,
56 | "metadata": {},
57 | "outputs": [],
58 | "source": [
59 | "# Open the unemployment dataset\n",
60 | "unemployment = pd.read_csv('../data/cleaned_country_totals.csv')\n",
61 | "# This is some formatting that's out of scope\n",
62 | "unemployment['date'] = pd.to_datetime(unemployment['date'])\n",
63 | "\n",
64 | "# Open the countries dataset\n",
65 | "countries_url = 'https://raw.githubusercontent.com/dlab-berkeley/Python-Data-Wrangling/main/data/countries.csv'\n",
66 | "countries = pd.read_csv(countries_url)"
67 | ]
68 | },
69 | {
70 | "cell_type": "markdown",
71 | "metadata": {},
72 | "source": [
73 | "\n",
74 | "# Missing Values\n",
75 | "When working with a new data source, it's good to get an idea of how much information is missing. `pandas` provides various methods for exploring and dealing with \"missing-ness\", one of which is `.isna()`"
76 | ]
77 | },
78 | {
79 | "cell_type": "code",
80 | "execution_count": null,
81 | "metadata": {},
82 | "outputs": [],
83 | "source": [
84 | "unemployment.isna()"
85 | ]
86 | },
87 | {
88 | "cell_type": "markdown",
89 | "metadata": {},
90 | "source": [
91 | "The `.isna()` method returns a corresponding boolean value for each entry in the `unemployment` DataFrame. In Python, `True` is equivalent to 1 and `False` is equivalent to 0. Thus, when we add up the results by column (with `.sum()`), we get a count for the **total** number of missing values by column::"
92 | ]
93 | },
94 | {
95 | "cell_type": "code",
96 | "execution_count": null,
97 | "metadata": {},
98 | "outputs": [],
99 | "source": [
100 | "unemployment.isna().sum()"
101 | ]
102 | },
103 | {
104 | "cell_type": "markdown",
105 | "metadata": {},
106 | "source": [
107 | "This is a very helpful trick, and it shows us that the only column with missing values is `unemployment_rate`.\n",
108 | "\n",
109 | "There are a wide variety of approaches to dealing with missing data. One basic approach would be to drop any row with a missing unemployment rate record:"
110 | ]
111 | },
112 | {
113 | "cell_type": "code",
114 | "execution_count": null,
115 | "metadata": {},
116 | "outputs": [],
117 | "source": [
118 | "unemployment.dropna(subset=['unemployment_rate'])"
119 | ]
120 | },
121 | {
122 | "cell_type": "markdown",
123 | "metadata": {},
124 | "source": [
125 | "Simply running the `.dropna()` method actually doesn't alter our data -- it makes a copy then drops the missing rows for that copy. In order to save our alteration, we'll need to re-define the `unemployment` DataFrame with our altered copy:"
126 | ]
127 | },
128 | {
129 | "cell_type": "code",
130 | "execution_count": null,
131 | "metadata": {},
132 | "outputs": [],
133 | "source": [
134 | "unemployment = unemployment.dropna(subset=['unemployment_rate'])"
135 | ]
136 | },
137 | {
138 | "cell_type": "markdown",
139 | "metadata": {},
140 | "source": [
141 | "**💡 Tip:** Note that the brackets `[]` are needed in the `dropna()` method because the subset parameter expects a list of column names. Even if you're specifying just one column, it still needs to be enclosed in brackets to indicate that it's a list. "
142 | ]
143 | },
144 | {
145 | "cell_type": "markdown",
146 | "metadata": {},
147 | "source": [
148 | "\n",
149 | "# Sorting Values\n",
150 | "\n",
151 | "We've been working with data about unemployment rates, so it would probably be useful to know what the highest unemployment rates are in this data. For this, we'll use the `sort_values()` method to sort the data. We chain `.head()` onto the end of this so that we only see the first five rows:"
152 | ]
153 | },
154 | {
155 | "cell_type": "code",
156 | "execution_count": null,
157 | "metadata": {},
158 | "outputs": [],
159 | "source": [
160 | "unemployment.sort_values(by='date', ascending=False).head()"
161 | ]
162 | },
163 | {
164 | "cell_type": "markdown",
165 | "metadata": {},
166 | "source": [
167 | "The above code creates a copy of the `DataFrame`, sorted in **descending** order (note `ascending=False`), and prints the first five rows."
168 | ]
169 | },
170 | {
171 | "cell_type": "markdown",
172 | "metadata": {},
173 | "source": [
174 | "## 🥊 Challenge 5\n",
175 | "Let's use sorting to answer a practical question: \n",
176 | "\n",
177 | "Which country has the highest unemployment rate in our data? \n",
178 | "\n",
179 | "Click for hint\n",
180 | "1. Use .sort_values() to sort this data based on the unemployment rate using descending order \n",
181 | "2. Select the top row using .head() with an argument for number of rows \n",
182 | "\n",
183 | "\n"
184 | ]
185 | },
186 | {
187 | "cell_type": "code",
188 | "execution_count": null,
189 | "metadata": {},
190 | "outputs": [],
191 | "source": [
192 | "# YOUR CODE HERE"
193 | ]
194 | },
195 | {
196 | "cell_type": "markdown",
197 | "metadata": {},
198 | "source": [
199 | "\n",
200 | "# Merging DataFrames\n",
201 | "\n",
202 | "Our `unemployment` has a lot of interesting information, but it is unfortunately missing any full country names. We have that information in the `countries` DataFrame, but we need to find a way of combining it with `unemployment`."
203 | ]
204 | },
205 | {
206 | "cell_type": "code",
207 | "execution_count": null,
208 | "metadata": {},
209 | "outputs": [],
210 | "source": [
211 | "countries.head()"
212 | ]
213 | },
214 | {
215 | "attachments": {},
216 | "cell_type": "markdown",
217 | "metadata": {},
218 | "source": [
219 | "Because the data we need is stored in two separate files, we'll want to merge the data somehow. Let's determine which column we can use to join this data by taking a look at `unemployment`"
220 | ]
221 | },
222 | {
223 | "cell_type": "code",
224 | "execution_count": null,
225 | "metadata": {},
226 | "outputs": [],
227 | "source": [
228 | "unemployment.head()"
229 | ]
230 | },
231 | {
232 | "cell_type": "markdown",
233 | "metadata": {},
234 | "source": [
235 | "These two DataFrames seem to have a similar `country` column, with two character codes for each country. Let's try doing a merge of these two datasets based on the `country` column."
236 | ]
237 | },
238 | {
239 | "cell_type": "markdown",
240 | "metadata": {},
241 | "source": [
242 | "`pandas` includes an easy-to-use merge function that accepts two DataFrames and a column to merge them on:"
243 | ]
244 | },
245 | {
246 | "cell_type": "code",
247 | "execution_count": null,
248 | "metadata": {},
249 | "outputs": [],
250 | "source": [
251 | "unemployment_merged = pd.merge(unemployment, countries, on='country')"
252 | ]
253 | },
254 | {
255 | "cell_type": "code",
256 | "execution_count": null,
257 | "metadata": {},
258 | "outputs": [],
259 | "source": [
260 | "unemployment_merged.head()"
261 | ]
262 | },
263 | {
264 | "attachments": {},
265 | "cell_type": "markdown",
266 | "metadata": {},
267 | "source": [
268 | "\n",
269 | "# Grouping and Aggregating Data\n",
270 | "\n",
271 | "What if we'd like to know how many observations exist for each country? To do so, we need to group the countries, then count how many times each one occurs. In other words, we're going to **group** our data **by** a specific column, and calculate some quantity within each group. The \"group-by\" operation is a fundamental technique used with tabular data. \n",
272 | "\n",
273 | "\n",
274 | "(For those who have used spreadsheet software like Excel, you might recognize that we are essentially talking about making a \"Pivot-Table\")\n",
275 | "\n",
276 | "## Simple Grouping with `.value_counts()`\n",
277 | "For simple grouping operations, we can use the handy `.value_counts()` method. We typically run this on a single column, and it will return a table showing how many observations there are for each unique value in the column. The following graphic represents the basics of the operations.\n",
278 | ""
279 | ]
280 | },
281 | {
282 | "cell_type": "code",
283 | "execution_count": null,
284 | "metadata": {},
285 | "outputs": [],
286 | "source": [
287 | "unemployment_merged['name_en'].value_counts()"
288 | ]
289 | },
290 | {
291 | "cell_type": "markdown",
292 | "metadata": {},
293 | "source": [
294 | "This tells us that we have a lot of observations of data for Italy, France, Sweden, etc. and very few observations for Turkey and Estonia."
295 | ]
296 | },
297 | {
298 | "cell_type": "markdown",
299 | "metadata": {},
300 | "source": [
301 | "## 🥊 Challenge 6\n",
302 | "Try using `.value_counts()` on the DataFrame to find out how many observations are from EU versus non-EU records"
303 | ]
304 | },
305 | {
306 | "cell_type": "code",
307 | "execution_count": null,
308 | "metadata": {},
309 | "outputs": [],
310 | "source": [
311 | "### YOUR CODE HERE"
312 | ]
313 | },
314 | {
315 | "attachments": {},
316 | "cell_type": "markdown",
317 | "metadata": {},
318 | "source": [
319 | "## Complex grouping with `.groupby()`\n",
320 | "What if we want to do something more complex, like find out **what was the average unemployment rate for EU and non-EU countries?**. `.value_counts()` groups data then counts it, but we need a method that can group data then average it.\n",
321 | "\n",
322 | "This sort of question is a typical use case for `.groupby()` -- which allows us to group data then apply any **aggregate** function we want -- count, average, min, max, median, etc."
323 | ]
324 | },
325 | {
326 | "cell_type": "markdown",
327 | "metadata": {},
328 | "source": [
329 | "In our example, we want to find out the average unemployment rate for EU and non-EU countries, so we will group our data based on `country_group`. Here is a graphical representation of our goal: \n",
330 | ""
331 | ]
332 | },
333 | {
334 | "cell_type": "markdown",
335 | "metadata": {},
336 | "source": [
337 | "We start with the method, `.groupby()`. This doesn't actually return data or output -- it just groups the data. "
338 | ]
339 | },
340 | {
341 | "cell_type": "code",
342 | "execution_count": null,
343 | "metadata": {},
344 | "outputs": [],
345 | "source": [
346 | "unemployment_merged.groupby('country_group')"
347 | ]
348 | },
349 | {
350 | "cell_type": "markdown",
351 | "metadata": {},
352 | "source": [
353 | "We now have to select a column of data and specify an aggregate function."
354 | ]
355 | },
356 | {
357 | "cell_type": "code",
358 | "execution_count": null,
359 | "metadata": {},
360 | "outputs": [],
361 | "source": [
362 | "unemployment_merged.groupby('country_group')['unemployment_rate'].mean()"
363 | ]
364 | },
365 | {
366 | "cell_type": "markdown",
367 | "metadata": {},
368 | "source": [
369 | "Dissecting the code, we told `pandas`:\n",
370 | "1. unemployment_merged.groupby('country_group')['unemployment_rate'].mean()\n",
371 | "\n",
372 | " Group all of our rows based on the unique values of the `country_group` column -- EU, non-EU\n",
373 | " \n",
374 | "2. unemployment_merged.groupby('country_group')['unemployment_rate'].mean()\n",
375 | "\n",
376 | " Select the `unemployment_rate` column\n",
377 | "\n",
378 | "2. unemployment_merged.groupby('country_group')['unemployment_rate'].mean()\n",
379 | "\n",
380 | " Compute the average of the selected column (`unemployment_rate`) for each group"
381 | ]
382 | },
383 | {
384 | "cell_type": "markdown",
385 | "metadata": {},
386 | "source": [
387 | "We can confirm this behavior using boolean indexing as well. If we index to only those records from EU countries, select the `unemployment_rate` column, then compute the average, we should get 8.3, the same value computed with groupby:"
388 | ]
389 | },
390 | {
391 | "cell_type": "code",
392 | "execution_count": null,
393 | "metadata": {},
394 | "outputs": [],
395 | "source": [
396 | "boolean_index = unemployment_merged['country_group'] == 'eu'"
397 | ]
398 | },
399 | {
400 | "cell_type": "code",
401 | "execution_count": null,
402 | "metadata": {},
403 | "outputs": [],
404 | "source": [
405 | "boolean_index"
406 | ]
407 | },
408 | {
409 | "cell_type": "code",
410 | "execution_count": null,
411 | "metadata": {},
412 | "outputs": [],
413 | "source": [
414 | "unemployment_merged.loc[boolean_index, 'unemployment_rate'].mean()"
415 | ]
416 | },
417 | {
418 | "cell_type": "markdown",
419 | "metadata": {},
420 | "source": [
421 | "The strengths of `.groupby()` relative to using boolean indexing are that `groupby()` scales very well to scenarios with many groups, and it requires much less code."
422 | ]
423 | },
424 | {
425 | "attachments": {},
426 | "cell_type": "markdown",
427 | "metadata": {},
428 | "source": [
429 | "## 🥊 Challenge 7\n",
430 | "\n",
431 | "Use `.groupby()` to find the maximum unemployment rate for each country. Sort your results from largest to smallest.\n",
432 | "\n",
433 | "Use the example above for guidance. \n",
434 | "Click for hint\n",
435 | "1. First, use groupby() to group on \"name_en\". \n",
436 | "2. Then, select the \"unemployment_rate\" column, \n",
437 | "3. Aggregate by using .max() to get the max value. \n",
438 | "4. Chain on the method .sort_values(ascending=False).\n",
439 | ""
440 | ]
441 | },
442 | {
443 | "cell_type": "code",
444 | "execution_count": null,
445 | "metadata": {},
446 | "outputs": [],
447 | "source": [
448 | "# YOUR CODE HERE"
449 | ]
450 | },
451 | {
452 | "cell_type": "markdown",
453 | "metadata": {},
454 | "source": [
455 | "\n",
456 | "# Data Visualization\n",
457 | "In the last challenge, you created an interesting table showing the all-time maximum unemployment rate that each country has experienced. Let's visualize that table as a bar chart to make it easier to present.\n",
458 | "\n",
459 | "There are various ways to approach data visualization in Python -- we'll cover simple plotting in `pandas`, which draws on functionality from the `matplotlib` library.\n",
460 | "\n",
461 | "First, we'll define a variable, `grouped`, with the table you just made:"
462 | ]
463 | },
464 | {
465 | "cell_type": "code",
466 | "execution_count": null,
467 | "metadata": {},
468 | "outputs": [],
469 | "source": [
470 | "grouped = unemployment_merged.groupby('name_en')['unemployment_rate'].max().sort_values(ascending=False)\n",
471 | "grouped"
472 | ]
473 | },
474 | {
475 | "cell_type": "markdown",
476 | "metadata": {},
477 | "source": [
478 | "Now we can plot it. In `pandas`, visualization is as simple as calling the `.plot()` method, then supplying optional arguments (here I supplied `kind=\"barh\"` to make a horizontal bar chart rather than the default line-chart). The following is the maximum unemployment rate across countries in the data for each year:"
479 | ]
480 | },
481 | {
482 | "cell_type": "code",
483 | "execution_count": null,
484 | "metadata": {},
485 | "outputs": [],
486 | "source": [
487 | "grouped.plot(kind='barh')\n",
488 | "\n",
489 | "# We add plt.show() to properly render the chart\n",
490 | "plt.show()"
491 | ]
492 | },
493 | {
494 | "cell_type": "markdown",
495 | "metadata": {},
496 | "source": [
497 | "Let's try another plot. Our full DataFrame is what's called a time-series -- we have repeated observations of various countries' unemployment rates over time. We typically plot time-series using line-plots, so let's make a line plot examining Spain and Portugal's unemployment rates. \n",
498 | "\n",
499 | "To make this sort of plot simpler, we'll start by making our date column into the DataFrame's index:"
500 | ]
501 | },
502 | {
503 | "cell_type": "code",
504 | "execution_count": null,
505 | "metadata": {},
506 | "outputs": [],
507 | "source": [
508 | "unemployment_merged = unemployment_merged.set_index('date')"
509 | ]
510 | },
511 | {
512 | "cell_type": "code",
513 | "execution_count": null,
514 | "metadata": {},
515 | "outputs": [],
516 | "source": [
517 | "unemployment_merged.head()"
518 | ]
519 | },
520 | {
521 | "cell_type": "markdown",
522 | "metadata": {},
523 | "source": [
524 | "We will also use boolean indexing to select on those observations that are for Spain. We'll save that as a new DataFrame, `spain`:"
525 | ]
526 | },
527 | {
528 | "cell_type": "code",
529 | "execution_count": null,
530 | "metadata": {},
531 | "outputs": [],
532 | "source": [
533 | "spain = unemployment_merged[unemployment_merged['name_en'] == 'Spain']\n",
534 | "spain.head()"
535 | ]
536 | },
537 | {
538 | "cell_type": "markdown",
539 | "metadata": {},
540 | "source": [
541 | "Now we can easily access Spain's unemployment rate, with the date for each observation included as the index:"
542 | ]
543 | },
544 | {
545 | "cell_type": "code",
546 | "execution_count": null,
547 | "metadata": {},
548 | "outputs": [],
549 | "source": [
550 | "spain['unemployment_rate']"
551 | ]
552 | },
553 | {
554 | "cell_type": "markdown",
555 | "metadata": {},
556 | "source": [
557 | "Let's repeat the same boolean indexing we used for Spain, but now for Portugal:"
558 | ]
559 | },
560 | {
561 | "cell_type": "code",
562 | "execution_count": null,
563 | "metadata": {},
564 | "outputs": [],
565 | "source": [
566 | "portugal = unemployment_merged[unemployment_merged['name_en'] == 'Portugal']"
567 | ]
568 | },
569 | {
570 | "cell_type": "code",
571 | "execution_count": null,
572 | "metadata": {},
573 | "outputs": [],
574 | "source": [
575 | "spain.head()"
576 | ]
577 | },
578 | {
579 | "cell_type": "code",
580 | "execution_count": null,
581 | "metadata": {},
582 | "outputs": [],
583 | "source": [
584 | "portugal.head()"
585 | ]
586 | },
587 | {
588 | "cell_type": "markdown",
589 | "metadata": {},
590 | "source": [
591 | "We'll take advantage of the `.plot()` function, simply calling `spain['unemployment_rate'].plot()`. We don't need to supply any argument to `.plot()` since we are using the default plot style -- a line-plot. We do add some other commands to add a y-axis label and render the plot."
592 | ]
593 | },
594 | {
595 | "cell_type": "code",
596 | "execution_count": null,
597 | "metadata": {},
598 | "outputs": [],
599 | "source": [
600 | "spain['unemployment_rate']"
601 | ]
602 | },
603 | {
604 | "cell_type": "code",
605 | "execution_count": null,
606 | "metadata": {},
607 | "outputs": [],
608 | "source": [
609 | "spain['unemployment_rate'].plot()\n",
610 | "\n",
611 | "# We add plt.show() to properly render the chart\n",
612 | "plt.show()"
613 | ]
614 | },
615 | {
616 | "cell_type": "markdown",
617 | "metadata": {},
618 | "source": [
619 | "Layering plots will involve simply calling multiple `.plot()` commands in the same Jupyter cell. We can add some basic styling as well, such as labels, a legend, and a title."
620 | ]
621 | },
622 | {
623 | "cell_type": "code",
624 | "execution_count": null,
625 | "metadata": {},
626 | "outputs": [],
627 | "source": [
628 | "# Plot commands\n",
629 | "spain['unemployment_rate'].plot()\n",
630 | "portugal['unemployment_rate'].plot()\n",
631 | "\n",
632 | "# Styling\n",
633 | "plt.legend([\"Spain\", \"Portugal\"])\n",
634 | "plt.ylabel(\"Count\")\n",
635 | "plt.title(\"Spain and Portugal Unemployment Rates\")\n",
636 | "plt.show()"
637 | ]
638 | },
639 | {
640 | "cell_type": "markdown",
641 | "metadata": {},
642 | "source": [
643 | "# 🎉 Well Done!\n",
644 | "\n",
645 | "This workshop series took us through the basics of data analysis using `pandas`. We indexed data using boolean logic, grouped data to perform conditional analyses, and we create basic but informative visualizations.\n",
646 | "\n",
647 | "## More Workshops!\n",
648 | "\n",
649 | "D-Lab teaches workshops that allow you to practice more with DataFrames and visualization.\n",
650 | "\n",
651 | "- To learn other fundamental Python topics, check out D-Lab's [Python Intermediate workshop](https://github.com/dlab-berkeley/Python-Intermediate-Pilot).\n",
652 | "- To learn more about data visualization, check out D-Lab's [Python Data Visualization workshop](https://github.com/dlab-berkeley/Python-Data-Visualization)."
653 | ]
654 | }
655 | ],
656 | "metadata": {
657 | "kernelspec": {
658 | "display_name": "Python 3 (ipykernel)",
659 | "language": "python",
660 | "name": "python3"
661 | },
662 | "language_info": {
663 | "codemirror_mode": {
664 | "name": "ipython",
665 | "version": 3
666 | },
667 | "file_extension": ".py",
668 | "mimetype": "text/x-python",
669 | "name": "python",
670 | "nbconvert_exporter": "python",
671 | "pygments_lexer": "ipython3",
672 | "version": "3.11.7"
673 | }
674 | },
675 | "nbformat": 4,
676 | "nbformat_minor": 4
677 | }
678 |
--------------------------------------------------------------------------------
/solutions/01_python_data_wrangling_solutions.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {},
7 | "outputs": [],
8 | "source": [
9 | "# Imports pandas and assign it to the variable `pd`\n",
10 | "import pandas as pd\n",
11 | "\n",
12 | "# We often import NumPy (numerical python) with pandas\n",
13 | "# we will import that and assign it to the variable `np`\n",
14 | "import numpy as np\n",
15 | "\n",
16 | "# Load matplotlib for plotting later in the workshop\n",
17 | "import matplotlib.pyplot as plt\n",
18 | "%matplotlib inline"
19 | ]
20 | },
21 | {
22 | "cell_type": "code",
23 | "execution_count": 2,
24 | "metadata": {},
25 | "outputs": [],
26 | "source": [
27 | "unemployment = pd.read_csv('../data/country_total.csv')"
28 | ]
29 | },
30 | {
31 | "cell_type": "markdown",
32 | "metadata": {},
33 | "source": [
34 | "## 🥊 Challenge 1\n",
35 | "**Setup** \n",
36 | "We previously imported unemployment data into `pandas` using the `read_csv` function and a relative file path. `read_csv` also allows us to import data using a URL as the file path. \n",
37 | "\n",
38 | "A .csv file with data on world countries and their abbreviations is located at the following URL:\n",
39 | "\n",
40 | "[https://raw.githubusercontent.com/dlab-berkeley/introduction-to-pandas/master/data/countries.csv](https://raw.githubusercontent.com/dlab-berkeley/introduction-to-pandas/master/data/countries.csv)\n",
41 | "\n",
42 | "We can load that data directly as follows:"
43 | ]
44 | },
45 | {
46 | "cell_type": "code",
47 | "execution_count": 3,
48 | "metadata": {},
49 | "outputs": [],
50 | "source": [
51 | "countries_url = 'https://raw.githubusercontent.com/dlab-berkeley/Python-Data-Wrangling/main/data/countries.csv'\n",
52 | "countries = pd.read_csv(countries_url)"
53 | ]
54 | },
55 | {
56 | "cell_type": "markdown",
57 | "metadata": {
58 | "jp-MarkdownHeadingCollapsed": true,
59 | "tags": []
60 | },
61 | "source": [
62 | "**Challenge** \n",
63 | "Whenever we open a new DataFrame, it's important to get a basic understanding of its structure.\n",
64 | "\n",
65 | "Using the methods and attributes we just discussed, **answer the following questions** about `countries`:\n",
66 | "\n",
67 | "1. What columns does `countries` contain?\n",
68 | "2. How many rows and columns does it contain?\n",
69 | "3. What are the minimum and maximum values of the columns with numerical data?\n",
70 | "\n",
71 | "Click for hint\n",
72 | "Hint: consider using .columns, .shape, and .describe() here.\n",
73 | ""
74 | ]
75 | },
76 | {
77 | "cell_type": "code",
78 | "execution_count": 4,
79 | "metadata": {},
80 | "outputs": [
81 | {
82 | "data": {
83 | "text/plain": [
84 | "Index(['country', 'google_country_code', 'country_group', 'name_en', 'name_fr',\n",
85 | " 'name_de', 'latitude', 'longitude'],\n",
86 | " dtype='object')"
87 | ]
88 | },
89 | "execution_count": 4,
90 | "metadata": {},
91 | "output_type": "execute_result"
92 | }
93 | ],
94 | "source": [
95 | "# YOUR CODE HERE\n",
96 | "\n",
97 | "countries.columns"
98 | ]
99 | },
100 | {
101 | "cell_type": "code",
102 | "execution_count": 5,
103 | "metadata": {},
104 | "outputs": [
105 | {
106 | "data": {
107 | "text/plain": [
108 | "(30, 8)"
109 | ]
110 | },
111 | "execution_count": 5,
112 | "metadata": {},
113 | "output_type": "execute_result"
114 | }
115 | ],
116 | "source": [
117 | "# YOUR CODE HERE\n",
118 | "\n",
119 | "countries.shape"
120 | ]
121 | },
122 | {
123 | "cell_type": "code",
124 | "execution_count": 6,
125 | "metadata": {},
126 | "outputs": [
127 | {
128 | "data": {
129 | "text/html": [
130 | "
\n",
131 | "\n",
144 | "
\n",
145 | " \n",
146 | "
\n",
147 | "
\n",
148 | "
latitude
\n",
149 | "
longitude
\n",
150 | "
\n",
151 | " \n",
152 | " \n",
153 | "
\n",
154 | "
count
\n",
155 | "
30.000000
\n",
156 | "
30.000000
\n",
157 | "
\n",
158 | "
\n",
159 | "
mean
\n",
160 | "
49.092609
\n",
161 | "
14.324579
\n",
162 | "
\n",
163 | "
\n",
164 | "
std
\n",
165 | "
7.956624
\n",
166 | "
11.257010
\n",
167 | "
\n",
168 | "
\n",
169 | "
min
\n",
170 | "
35.129141
\n",
171 | "
-8.239122
\n",
172 | "
\n",
173 | "
\n",
174 | "
25%
\n",
175 | "
43.230916
\n",
176 | "
6.979186
\n",
177 | "
\n",
178 | "
\n",
179 | "
50%
\n",
180 | "
49.238087
\n",
181 | "
14.941462
\n",
182 | "
\n",
183 | "
\n",
184 | "
75%
\n",
185 | "
54.090400
\n",
186 | "
23.351690
\n",
187 | "
\n",
188 | "
\n",
189 | "
max
\n",
190 | "
64.950159
\n",
191 | "
35.439795
\n",
192 | "
\n",
193 | " \n",
194 | "
\n",
195 | "
"
196 | ],
197 | "text/plain": [
198 | " latitude longitude\n",
199 | "count 30.000000 30.000000\n",
200 | "mean 49.092609 14.324579\n",
201 | "std 7.956624 11.257010\n",
202 | "min 35.129141 -8.239122\n",
203 | "25% 43.230916 6.979186\n",
204 | "50% 49.238087 14.941462\n",
205 | "75% 54.090400 23.351690\n",
206 | "max 64.950159 35.439795"
207 | ]
208 | },
209 | "execution_count": 6,
210 | "metadata": {},
211 | "output_type": "execute_result"
212 | }
213 | ],
214 | "source": [
215 | "# YOUR CODE HERE\n",
216 | "\n",
217 | "countries.describe()"
218 | ]
219 | },
220 | {
221 | "cell_type": "markdown",
222 | "metadata": {},
223 | "source": [
224 | "## 🥊 Challenge 2: Indexing with `.loc`\n",
225 | "\n",
226 | "Let's get a little practice with the `.loc` operator. \n",
227 | "\n",
228 | "Select rows 10 through 20, then compute their average `latitude` \n",
229 | "\n",
230 | " Click for Hint\n",
231 | " This can be done using `.loc` and `.mean()`, all in one line of code: countries.loc[{your row selection}, {your column selection}].mean()\n",
232 | ""
233 | ]
234 | },
235 | {
236 | "cell_type": "code",
237 | "execution_count": 7,
238 | "metadata": {},
239 | "outputs": [
240 | {
241 | "data": {
242 | "text/plain": [
243 | "49.852639057272725"
244 | ]
245 | },
246 | "execution_count": 7,
247 | "metadata": {},
248 | "output_type": "execute_result"
249 | }
250 | ],
251 | "source": [
252 | "# YOUR CODE HERE\n",
253 | "\n",
254 | "countries.loc[10:20, 'latitude'].mean()"
255 | ]
256 | },
257 | {
258 | "cell_type": "markdown",
259 | "metadata": {},
260 | "source": [
261 | "## 🥊 Challenge 3: Boolean Indexing\n",
262 | "\n",
263 | "Let's push our boolean indexing skills a little further with a challenge problem.\n",
264 | "1. Find the average longitude of countries in our data, assign it to the variable `average_long`\n",
265 | "2. Find countries that have \"above average\" longitude\n",
266 | "\n",
267 | " Click for Hint\n",
268 | " Compute the average longitude of the data: countries['longitude'].mean() and save that to a variable average_long. Then, you can use that variable to create a boolean mask for indexing: countries['longitude'] > average_long\n",
269 | ""
270 | ]
271 | },
272 | {
273 | "cell_type": "code",
274 | "execution_count": 8,
275 | "metadata": {},
276 | "outputs": [],
277 | "source": [
278 | "# YOUR CODE HERE\n",
279 | "\n",
280 | "average_long = countries['longitude'].mean()"
281 | ]
282 | },
283 | {
284 | "cell_type": "code",
285 | "execution_count": 9,
286 | "metadata": {},
287 | "outputs": [
288 | {
289 | "data": {
290 | "text/html": [
291 | "
\n",
292 | "\n",
305 | "
\n",
306 | " \n",
307 | "
\n",
308 | "
\n",
309 | "
country
\n",
310 | "
google_country_code
\n",
311 | "
country_group
\n",
312 | "
name_en
\n",
313 | "
name_fr
\n",
314 | "
name_de
\n",
315 | "
latitude
\n",
316 | "
longitude
\n",
317 | "
\n",
318 | " \n",
319 | " \n",
320 | "
\n",
321 | "
2
\n",
322 | "
bg
\n",
323 | "
BG
\n",
324 | "
eu
\n",
325 | "
Bulgaria
\n",
326 | "
Bulgarie
\n",
327 | "
Bulgarien
\n",
328 | "
42.725674
\n",
329 | "
25.482322
\n",
330 | "
\n",
331 | "
\n",
332 | "
3
\n",
333 | "
hr
\n",
334 | "
HR
\n",
335 | "
non-eu
\n",
336 | "
Croatia
\n",
337 | "
Croatie
\n",
338 | "
Kroatien
\n",
339 | "
44.746643
\n",
340 | "
15.340844
\n",
341 | "
\n",
342 | "
\n",
343 | "
4
\n",
344 | "
cy
\n",
345 | "
CY
\n",
346 | "
eu
\n",
347 | "
Cyprus
\n",
348 | "
Chypre
\n",
349 | "
Zypern
\n",
350 | "
35.129141
\n",
351 | "
33.428682
\n",
352 | "
\n",
353 | "
\n",
354 | "
5
\n",
355 | "
cz
\n",
356 | "
CZ
\n",
357 | "
eu
\n",
358 | "
Czech Republic
\n",
359 | "
République tchèque
\n",
360 | "
Tschechische Republik
\n",
361 | "
49.803531
\n",
362 | "
15.474998
\n",
363 | "
\n",
364 | "
\n",
365 | "
7
\n",
366 | "
ee
\n",
367 | "
EE
\n",
368 | "
eu
\n",
369 | "
Estonia
\n",
370 | "
Estonie
\n",
371 | "
Estland
\n",
372 | "
58.592469
\n",
373 | "
25.806950
\n",
374 | "
\n",
375 | "
\n",
376 | "
8
\n",
377 | "
fi
\n",
378 | "
FI
\n",
379 | "
eu
\n",
380 | "
Finland
\n",
381 | "
Finlande
\n",
382 | "
Finnland
\n",
383 | "
64.950159
\n",
384 | "
26.067564
\n",
385 | "
\n",
386 | "
\n",
387 | "
11
\n",
388 | "
gr
\n",
389 | "
GR
\n",
390 | "
eu
\n",
391 | "
Greece
\n",
392 | "
Grèce
\n",
393 | "
Griechenland
\n",
394 | "
39.698467
\n",
395 | "
21.577256
\n",
396 | "
\n",
397 | "
\n",
398 | "
12
\n",
399 | "
hu
\n",
400 | "
HU
\n",
401 | "
eu
\n",
402 | "
Hungary
\n",
403 | "
Hongrie
\n",
404 | "
Ungarn
\n",
405 | "
47.161163
\n",
406 | "
19.504265
\n",
407 | "
\n",
408 | "
\n",
409 | "
15
\n",
410 | "
lv
\n",
411 | "
LV
\n",
412 | "
eu
\n",
413 | "
Latvia
\n",
414 | "
Lettonie
\n",
415 | "
Lettland
\n",
416 | "
56.880117
\n",
417 | "
24.606555
\n",
418 | "
\n",
419 | "
\n",
420 | "
16
\n",
421 | "
lt
\n",
422 | "
LT
\n",
423 | "
eu
\n",
424 | "
Lithuania
\n",
425 | "
Lituanie
\n",
426 | "
Litauen
\n",
427 | "
55.173687
\n",
428 | "
23.943168
\n",
429 | "
\n",
430 | "
\n",
431 | "
18
\n",
432 | "
mt
\n",
433 | "
MT
\n",
434 | "
eu
\n",
435 | "
Malta
\n",
436 | "
Malte
\n",
437 | "
Malta
\n",
438 | "
35.902422
\n",
439 | "
14.447461
\n",
440 | "
\n",
441 | "
\n",
442 | "
21
\n",
443 | "
pl
\n",
444 | "
PL
\n",
445 | "
eu
\n",
446 | "
Poland
\n",
447 | "
Pologne
\n",
448 | "
Polen
\n",
449 | "
51.918907
\n",
450 | "
19.134334
\n",
451 | "
\n",
452 | "
\n",
453 | "
23
\n",
454 | "
ro
\n",
455 | "
RO
\n",
456 | "
eu
\n",
457 | "
Romania
\n",
458 | "
Roumanie
\n",
459 | "
Rumänien
\n",
460 | "
45.942611
\n",
461 | "
24.990152
\n",
462 | "
\n",
463 | "
\n",
464 | "
24
\n",
465 | "
sk
\n",
466 | "
SK
\n",
467 | "
eu
\n",
468 | "
Slovakia
\n",
469 | "
Slovaquie
\n",
470 | "
Slowakei
\n",
471 | "
48.672644
\n",
472 | "
19.700032
\n",
473 | "
\n",
474 | "
\n",
475 | "
25
\n",
476 | "
si
\n",
477 | "
SI
\n",
478 | "
eu
\n",
479 | "
Slovenia
\n",
480 | "
Slovénie
\n",
481 | "
Slowenien
\n",
482 | "
46.149259
\n",
483 | "
14.986617
\n",
484 | "
\n",
485 | "
\n",
486 | "
27
\n",
487 | "
se
\n",
488 | "
SE
\n",
489 | "
eu
\n",
490 | "
Sweden
\n",
491 | "
Suède
\n",
492 | "
Schweden
\n",
493 | "
62.198467
\n",
494 | "
14.896307
\n",
495 | "
\n",
496 | "
\n",
497 | "
28
\n",
498 | "
tr
\n",
499 | "
TR
\n",
500 | "
non-eu
\n",
501 | "
Turkey
\n",
502 | "
Turquie
\n",
503 | "
Türkei
\n",
504 | "
38.952942
\n",
505 | "
35.439795
\n",
506 | "
\n",
507 | " \n",
508 | "
\n",
509 | "
"
510 | ],
511 | "text/plain": [
512 | " country google_country_code country_group name_en \\\n",
513 | "2 bg BG eu Bulgaria \n",
514 | "3 hr HR non-eu Croatia \n",
515 | "4 cy CY eu Cyprus \n",
516 | "5 cz CZ eu Czech Republic \n",
517 | "7 ee EE eu Estonia \n",
518 | "8 fi FI eu Finland \n",
519 | "11 gr GR eu Greece \n",
520 | "12 hu HU eu Hungary \n",
521 | "15 lv LV eu Latvia \n",
522 | "16 lt LT eu Lithuania \n",
523 | "18 mt MT eu Malta \n",
524 | "21 pl PL eu Poland \n",
525 | "23 ro RO eu Romania \n",
526 | "24 sk SK eu Slovakia \n",
527 | "25 si SI eu Slovenia \n",
528 | "27 se SE eu Sweden \n",
529 | "28 tr TR non-eu Turkey \n",
530 | "\n",
531 | " name_fr name_de latitude longitude \n",
532 | "2 Bulgarie Bulgarien 42.725674 25.482322 \n",
533 | "3 Croatie Kroatien 44.746643 15.340844 \n",
534 | "4 Chypre Zypern 35.129141 33.428682 \n",
535 | "5 République tchèque Tschechische Republik 49.803531 15.474998 \n",
536 | "7 Estonie Estland 58.592469 25.806950 \n",
537 | "8 Finlande Finnland 64.950159 26.067564 \n",
538 | "11 Grèce Griechenland 39.698467 21.577256 \n",
539 | "12 Hongrie Ungarn 47.161163 19.504265 \n",
540 | "15 Lettonie Lettland 56.880117 24.606555 \n",
541 | "16 Lituanie Litauen 55.173687 23.943168 \n",
542 | "18 Malte Malta 35.902422 14.447461 \n",
543 | "21 Pologne Polen 51.918907 19.134334 \n",
544 | "23 Roumanie Rumänien 45.942611 24.990152 \n",
545 | "24 Slovaquie Slowakei 48.672644 19.700032 \n",
546 | "25 Slovénie Slowenien 46.149259 14.986617 \n",
547 | "27 Suède Schweden 62.198467 14.896307 \n",
548 | "28 Turquie Türkei 38.952942 35.439795 "
549 | ]
550 | },
551 | "execution_count": 9,
552 | "metadata": {},
553 | "output_type": "execute_result"
554 | }
555 | ],
556 | "source": [
557 | "# YOUR CODE HERE\n",
558 | "\n",
559 | "countries[countries['longitude'] > average_long]"
560 | ]
561 | }
562 | ],
563 | "metadata": {
564 | "kernelspec": {
565 | "display_name": "Python 3 (ipykernel)",
566 | "language": "python",
567 | "name": "python3"
568 | },
569 | "language_info": {
570 | "codemirror_mode": {
571 | "name": "ipython",
572 | "version": 3
573 | },
574 | "file_extension": ".py",
575 | "mimetype": "text/x-python",
576 | "name": "python",
577 | "nbconvert_exporter": "python",
578 | "pygments_lexer": "ipython3",
579 | "version": "3.9.15"
580 | }
581 | },
582 | "nbformat": 4,
583 | "nbformat_minor": 4
584 | }
585 |
--------------------------------------------------------------------------------
/solutions/02_python_data_wrangling_solutions.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {},
7 | "outputs": [],
8 | "source": [
9 | "import pandas as pd\n",
10 | "import numpy as np\n",
11 | "import matplotlib.pyplot as plt\n",
12 | "%matplotlib inline"
13 | ]
14 | },
15 | {
16 | "cell_type": "code",
17 | "execution_count": 2,
18 | "metadata": {},
19 | "outputs": [],
20 | "source": [
21 | "# Open the unemployment dataset\n",
22 | "unemployment = pd.read_csv('../data/cleaned_country_totals.csv')\n",
23 | "# This is some formatting that's out of scope\n",
24 | "unemployment['date'] = pd.to_datetime(unemployment['date'])\n",
25 | "\n",
26 | "# Open the countries dataset\n",
27 | "countries_url = 'https://raw.githubusercontent.com/dlab-berkeley/Python-Data-Wrangling/main/data/countries.csv'\n",
28 | "countries = pd.read_csv(countries_url)"
29 | ]
30 | },
31 | {
32 | "cell_type": "code",
33 | "execution_count": 3,
34 | "metadata": {},
35 | "outputs": [],
36 | "source": [
37 | "unemployment = unemployment.dropna(subset=\"unemployment_rate\")"
38 | ]
39 | },
40 | {
41 | "attachments": {},
42 | "cell_type": "markdown",
43 | "metadata": {},
44 | "source": [
45 | "## 🥊 Challenge 5\n",
46 | "Let's use sorting to answer a practical question: \n",
47 | "\n",
48 | "Which country has the highest unemployment rate in our data? \n",
49 | "\n",
50 | "Click for hint\n",
51 | "1. Use .sort_values() to sort this data based on the unemployment rate using descending order \n",
52 | "2. Select the top row using .head() with an argument for number of rows \n",
53 | "\n",
54 | "\n"
55 | ]
56 | },
57 | {
58 | "cell_type": "code",
59 | "execution_count": 4,
60 | "metadata": {},
61 | "outputs": [
62 | {
63 | "data": {
64 | "text/html": [
65 | "