├── .gitignore
├── attendees.csv
├── employee-earnings-report-2016.csv
├── pandas-data-cleaning-tricks.ipynb
├── pandas-data-cleaning-tricks.pdf
├── pandas-data-cleaning-tricks.py
├── readme.md
└── unemployment.xlsx


/.gitignore:
--------------------------------------------------------------------------------
1 | *.ipynb_checkpoints


--------------------------------------------------------------------------------
/attendees.csv:
--------------------------------------------------------------------------------
 1 | Occupation,Job title,Age group,Gender,State/Province,Education,Which data subject area are you most interested in working with? (Select up to three),What do you hope to get out of the workshop?,Which type of laptop will you bring?,College or University Name,Major or Concentration ,College Year,Which Digital Badge track best suits you?,Which session would you like to attend?,Choose your status:
 2 | Data Analyst,Data Quality Analyst,30-39,Male,MA,Bachelor's Degree,Retail,other,PC,,,,Advanced Data Storytelling,June 5-9,"Nonprofit, Academic, Government"
 3 | PhD Student,Student/Research Assistant,18-29,Male,MA,Bachelor's Degree,Sports,Master Advanced R,PC,Boston University,Biostatistics,PhD,Advanced Data Storytelling,June 5-9,Student
 4 | Education,Data Analyst,18-29,Female,Kentucky,Master's Degree,Retail,other,PC,,,,Advanced Data Storytelling,June 5-9,"Nonprofit, Academic, Government"
 5 | Manager,BAS Manager,30-39,Male,MA,Bachelor's Degree,Education,Pick up Beginning R And SQL,PC,Boston University,PEMBA,Graduate,Advanced Data Storytelling,June 5-9,Student
 6 | Government Finance,Performance Analyst,30 - 39,Male,MA,Master's Degree,"Environment, Finance, Food and agriculture",Pick up Beginning R And SQL,MAC,,,,Advanced Data Storytelling,June 5-9,"Nonprofit, Academic, Government Early Bird"
 7 | Engineer,Display Engineer,30-39, Female,MA,Bachelor's Degree,"Environment, Finance, Food and Agriculture","Explore the field of data storytelling, including career options, Improve my ability to write with numbers, Acquire data visualization skills, Effectively clean and standardize data, Pick up Beginning R And SQL, Master Advanced R",Advanced Data Storytelling,,,,Advanced Data Storytelling,June 5-9,Professional
 8 | self-employed,CEO and founder,30-39, Female,MA,Master's Degree,"Criminal justice, Education, Environment","Improve my ability to write with numbers, Acquire data visualization skills, Analyze data better, Effectively clean and standardize data, Pick up Beginning R And SQL",PC,,,,Advanced Data Storytelling,June 5-9,Professional
 9 | Evaluator,"Assistant Director, Stewardship & Donor Relations",30-39,Female,MA,Master's Degree,"Education, Environment, Health care","Analyze data better, Pick up Beginning R And SQL, other",Mac,,,,Advanced Data Storytelling,June 5-9,"Nonprofit, Academic, Government"
10 | Marketing Analytics,Sr. Analytics Consultant,30-39,Female,GA,Master's Degree,"Criminal justice, Finance, Retail","Acquire data visualization skills, Analyze data better, Discover the real story revealed in the data, Utilize spreadsheets for data analysis and visualization, Pick up Beginning R And SQL",PC,,,,Advanced Data Storytelling,June 5-9,Professional
11 | Student,Ph.D. Student,30-39,Male,LA,Doctoral or other advanced degree,"Campaign finance, Sports, Retail","Acquire data visualization skills, Analyze data better, Effectively clean and standardize data, Utilize spreadsheets for data analysis and visualization, Pick up Beginning R And SQL, Master Advanced R",Mac,Louisiana State University,Mass Communication,PhD,Advanced Data Storytelling,June 5-9,Student
12 | Student,College graduate,18-29,Male,MA,Bachelor's Degree,"Environment, Food and Agriculture, Health care, Retail","Explore the field of data storytelling, including career options, Improve my ability to write with numbers, Acquire data visualization skills, Analyze data better, Discover the real story revealed in the data, Utilize spreadsheets for data analysis and visualization, Pick up Beginning R And SQL",Mac,Mc Gill,Psychology,Graduate,Advanced Data Storytelling,June 5-9,Student
13 | Student,Student,18-29,Female,MA,Bachelor's Degree,"Environment, Finance, Retail","Explore the field of data storytelling, including career options, Improve my ability to write with numbers, Acquire data visualization skills, Analyze data better, Discover the real story revealed in the data, Effectively clean and standardize data, Utilize spreadsheets for data analysis and visualization, Pick up Beginning R And SQL, Master Advanced R",Mac,Boston University,Accounting and Finance,Senior,Advanced Data Storytelling,June 5-9,Student


--------------------------------------------------------------------------------
/employee-earnings-report-2016.csv:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/underthecurve/pandas-data-cleaning-tricks/cd96cd85ac5b941eb077bbf6bc30b51da0e3f29b/employee-earnings-report-2016.csv


--------------------------------------------------------------------------------
/pandas-data-cleaning-tricks.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {},
   6 |    "source": [
   7 |     "# Tricks for cleaning your data in Python using pandas\n",
   8 |     "\n",
   9 |     "**By Christine Zhang ([ychristinezhang at gmail dot com](mailto:ychristinezhang@gmail.com / [@christinezhang](https://twitter.com/christinezhang) | [@christinezhang](https://twitter.com/christinezhang))**"
  10 |    ]
  11 |   },
  12 |   {
  13 |    "cell_type": "markdown",
  14 |    "metadata": {},
  15 |    "source": [
  16 |     "GitHub repository for Data+Code: https://github.com/underthecurve/pandas-data-cleaning-tricks\n",
  17 |     "\n",
  18 |     "In 2017 I gave a talk called \"Tricks for cleaning your data in R\" which I presented at the [Data+Narrative workshop](http://www.bu.edu/com/data-narrative/) at Boston University. The repo with the code and data, https://github.com/underthecurve/r-data-cleaning-tricks, was pretty well-received, so I figured I'd try to do some of the same stuff in Python using `pandas`.\n",
  19 |     "\n",
  20 |     "**Disclaimer:** when it comes to data stuff, I'm much better with R, especially the `tidyverse` set of packages, than with Python, but in my last job I used Python's `pandas` library to do a lot of data processing since Python was the dominant language there.\n",
  21 |     "\n",
  22 |     "Anyway, here goes: \n",
  23 |     "\n",
  24 |     "Data cleaning is a cumbersome task, and it can be hard to navigate in programming languages like Python. \n",
  25 |     "\n",
  26 |     "The `pandas` library in Python is a powerful tool for data cleaning and analysis. By default, it leaves a trail of code that documents all the work you've done, which makes it extremely useful for creating reproducible workflows.\n",
  27 |     "\n",
  28 |     "In this workshop, I'll show you some examples of real-life \"messy\" datasets, the problems they present for analysis in Python's `pandas` library, and some of the solutions to these problems.\n",
  29 |     "\n",
  30 |     "Fittingly, I'll [start the numbering system at 0](http://python-history.blogspot.com/2013/10/why-python-uses-0-based-indexing.html)."
  31 |    ]
  32 |   },
  33 |   {
  34 |    "cell_type": "markdown",
  35 |    "metadata": {},
  36 |    "source": [
  37 |     "## 0. Importing the `pandas` library"
  38 |    ]
  39 |   },
  40 |   {
  41 |    "cell_type": "markdown",
  42 |    "metadata": {},
  43 |    "source": [
  44 |     "Here I tell Python to import the `pandas` library as `pd` (a common alias for `pandas` — more on that in the next code chunk)."
  45 |    ]
  46 |   },
  47 |   {
  48 |    "cell_type": "code",
  49 |    "execution_count": 1,
  50 |    "metadata": {},
  51 |    "outputs": [],
  52 |    "source": [
  53 |     "import pandas as pd"
  54 |    ]
  55 |   },
  56 |   {
  57 |    "cell_type": "markdown",
  58 |    "metadata": {},
  59 |    "source": [
  60 |     "## 1. Finding and replacing non-numeric characters like `,` and `$`"
  61 |    ]
  62 |   },
  63 |   {
  64 |    "cell_type": "markdown",
  65 |    "metadata": {},
  66 |    "source": [
  67 |     "Let's check out the city of Boston's [Open Data portal](https://data.boston.gov/), where the local government puts up datasets that are free for the public to analyze.\n",
  68 |     "\n",
  69 |     "The [Employee Earnings Report](https://data.boston.gov/dataset/employee-earnings-report) is one of the more interesting ones, because it gives payroll data for every person on the municipal payroll. It's where the *Boston Globe* gets stories like these every year:\n",
  70 |     "\n",
  71 |     "-   [\"64 City of Boston workers earn more than $250,000\"](https://www.bostonglobe.com/metro/2016/02/05/city-boston-workers-earn-more-than/MvW6RExJZimdrTlwdwUI7M/story.html) (February 6, 2016)\n",
  72 |     "\n",
  73 |     "-   [\"Police detective tops Boston’s payroll with a total of over $403,000\"](https://www.bostonglobe.com/metro/2017/02/14/police-detective-tops-boston-payroll-with-total-over/6PaXwTAHZGEW5djgwCJuTI/story.html) (February 14, 2017)\n",
  74 |     "\n",
  75 |     "Let's take at the February 14 story from 2017. The story begins:\n",
  76 |     "\n",
  77 |     "> \"A veteran police detective took home more than $403,000 in earnings last year, topping the list of Boston’s highest-paid employees in 2016, newly released city payroll data show.\"\n",
  78 |     "\n",
  79 |     "**What if we wanted to check this number using the Employee Earnings Report?**"
  80 |    ]
  81 |   },
  82 |   {
  83 |    "cell_type": "markdown",
  84 |    "metadata": {},
  85 |    "source": [
  86 |     "We can use the `pandas` function [`pandas.read_csv()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) to load the csv file into Python. We will call this DataFrame `salary`. Remember that I imported `pandas` \"as `pd`\" in the last code chunk. This saves me a bit of typing by allowing me to access `pandas` functions like `pandas.read_csv()` by typing `pd.read_csv()` instead. If I had typed `import pandas` in the code chunk under section `0` without `as pd`, the below code wouldn't work. I'd have to instead write `pandas.read_csv()` to access the function.\n",
  87 |     "\n",
  88 |     "The `pd` alias for `pandas` is so common that the library's [documentation](http://pandas.pydata.org/pandas-docs/stable/install.html#running-the-test-suite) even uses it sometimes.\n",
  89 |     "\n",
  90 |     "Let's try to use `pd.read_csv()`:"
  91 |    ]
  92 |   },
  93 |   {
  94 |    "cell_type": "code",
  95 |    "execution_count": 2,
  96 |    "metadata": {},
  97 |    "outputs": [
  98 |     {
  99 |      "ename": "UnicodeDecodeError",
 100 |      "evalue": "'utf-8' codec can't decode byte 0xe9 in position 22: invalid continuation byte",
 101 |      "output_type": "error",
 102 |      "traceback": [
 103 |       "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
 104 |       "\u001b[0;31mUnicodeDecodeError\u001b[0m                        Traceback (most recent call last)",
 105 |       "\u001b[0;32mpandas/_libs/parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers.TextReader._convert_tokens\u001b[0;34m()\u001b[0m\n",
 106 |       "\u001b[0;32mpandas/_libs/parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers.TextReader._convert_with_dtype\u001b[0;34m()\u001b[0m\n",
 107 |       "\u001b[0;32mpandas/_libs/parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers.TextReader._string_convert\u001b[0;34m()\u001b[0m\n",
 108 |       "\u001b[0;32mpandas/_libs/parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers._string_box_utf8\u001b[0;34m()\u001b[0m\n",
 109 |       "\u001b[0;31mUnicodeDecodeError\u001b[0m: 'utf-8' codec can't decode byte 0xe9 in position 22: invalid continuation byte",
 110 |       "\nDuring handling of the above exception, another exception occurred:\n",
 111 |       "\u001b[0;31mUnicodeDecodeError\u001b[0m                        Traceback (most recent call last)",
 112 |       "\u001b[0;32m<ipython-input-2-188a5f5f7049>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0msalary\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread_csv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'employee-earnings-report-2016.csv'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
 113 |       "\u001b[0;32m~/anaconda/envs/ipykernel_py3/lib/python3.6/site-packages/pandas/io/parsers.py\u001b[0m in \u001b[0;36mparser_f\u001b[0;34m(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, doublequote, delim_whitespace, low_memory, memory_map, float_precision)\u001b[0m\n\u001b[1;32m    676\u001b[0m                     skip_blank_lines=skip_blank_lines)\n\u001b[1;32m    677\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 678\u001b[0;31m         \u001b[0;32mreturn\u001b[0m \u001b[0m_read\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfilepath_or_buffer\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkwds\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    679\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    680\u001b[0m     \u001b[0mparser_f\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__name__\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
 114 |       "\u001b[0;32m~/anaconda/envs/ipykernel_py3/lib/python3.6/site-packages/pandas/io/parsers.py\u001b[0m in \u001b[0;36m_read\u001b[0;34m(filepath_or_buffer, kwds)\u001b[0m\n\u001b[1;32m    444\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    445\u001b[0m     \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 446\u001b[0;31m         \u001b[0mdata\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mparser\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnrows\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    447\u001b[0m     \u001b[0;32mfinally\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    448\u001b[0m         \u001b[0mparser\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mclose\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
 115 |       "\u001b[0;32m~/anaconda/envs/ipykernel_py3/lib/python3.6/site-packages/pandas/io/parsers.py\u001b[0m in \u001b[0;36mread\u001b[0;34m(self, nrows)\u001b[0m\n\u001b[1;32m   1034\u001b[0m                 \u001b[0;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'skipfooter not supported for iteration'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1035\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1036\u001b[0;31m         \u001b[0mret\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnrows\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   1037\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1038\u001b[0m         \u001b[0;31m# May alter columns / col_dict\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
 116 |       "\u001b[0;32m~/anaconda/envs/ipykernel_py3/lib/python3.6/site-packages/pandas/io/parsers.py\u001b[0m in \u001b[0;36mread\u001b[0;34m(self, nrows)\u001b[0m\n\u001b[1;32m   1846\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0mread\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnrows\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1847\u001b[0m         \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1848\u001b[0;31m             \u001b[0mdata\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_reader\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnrows\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   1849\u001b[0m         \u001b[0;32mexcept\u001b[0m \u001b[0mStopIteration\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1850\u001b[0m             \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_first_chunk\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
 117 |       "\u001b[0;32mpandas/_libs/parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers.TextReader.read\u001b[0;34m()\u001b[0m\n",
 118 |       "\u001b[0;32mpandas/_libs/parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers.TextReader._read_low_memory\u001b[0;34m()\u001b[0m\n",
 119 |       "\u001b[0;32mpandas/_libs/parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers.TextReader._read_rows\u001b[0;34m()\u001b[0m\n",
 120 |       "\u001b[0;32mpandas/_libs/parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers.TextReader._convert_column_data\u001b[0;34m()\u001b[0m\n",
 121 |       "\u001b[0;32mpandas/_libs/parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers.TextReader._convert_tokens\u001b[0;34m()\u001b[0m\n",
 122 |       "\u001b[0;32mpandas/_libs/parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers.TextReader._convert_with_dtype\u001b[0;34m()\u001b[0m\n",
 123 |       "\u001b[0;32mpandas/_libs/parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers.TextReader._string_convert\u001b[0;34m()\u001b[0m\n",
 124 |       "\u001b[0;32mpandas/_libs/parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers._string_box_utf8\u001b[0;34m()\u001b[0m\n",
 125 |       "\u001b[0;31mUnicodeDecodeError\u001b[0m: 'utf-8' codec can't decode byte 0xe9 in position 22: invalid continuation byte"
 126 |      ]
 127 |     }
 128 |    ],
 129 |    "source": [
 130 |     "salary = pd.read_csv('employee-earnings-report-2016.csv')"
 131 |    ]
 132 |   },
 133 |   {
 134 |    "cell_type": "markdown",
 135 |    "metadata": {},
 136 |    "source": [
 137 |     "That's a pretty long and nasty error. Usually when I run into something like this, I start from the bottom and work my way up — in this case, I typed `UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 22: invalid continuation byte` into a search engine and came across [this discussion on the Stack Overflow forum](https://stackoverflow.com/questions/30462807/encoding-error-in-panda-read-csv). The last response suggested that adding `encoding ='latin1'` inside the function would fix the problem on Macs (which is the type of computer I have)."
 138 |    ]
 139 |   },
 140 |   {
 141 |    "cell_type": "code",
 142 |    "execution_count": 3,
 143 |    "metadata": {},
 144 |    "outputs": [],
 145 |    "source": [
 146 |     "salary = pd.read_csv('employee-earnings-report-2016.csv', encoding = 'latin-1')"
 147 |    ]
 148 |   },
 149 |   {
 150 |    "cell_type": "markdown",
 151 |    "metadata": {},
 152 |    "source": [
 153 |     "Great! (I don't know much about encoding, but this is something I run into from time to time so I thought it would be helpful to show here.)\n",
 154 |     "\n",
 155 |     "We can use `head()` on the `salary` DataFrame to inspect the first five rows of `salary`. (Note I use the `print()` to display the output, but you don't need to do this in your own code if you'd prefer not to.)"
 156 |    ]
 157 |   },
 158 |   {
 159 |    "cell_type": "code",
 160 |    "execution_count": 4,
 161 |    "metadata": {},
 162 |    "outputs": [
 163 |     {
 164 |      "name": "stdout",
 165 |      "output_type": "stream",
 166 |      "text": [
 167 |       "                      NAME           DEPARTMENT_NAME                 TITLE  \\\n",
 168 |       "0           Abadi,Kidani A      Assessing Department      Property Officer   \n",
 169 |       "1         Abasciano,Joseph  Boston Police Department        Police Officer   \n",
 170 |       "2   Abban,Christopher John    Boston Fire Department          Fire Fighter   \n",
 171 |       "3            Abbasi,Sophia             Green Academy  Manager (C) (non-ac)   \n",
 172 |       "4  Abbate-Vaughn,Jorgelina      BPS Ellis Elementary               Teacher   \n",
 173 |       "\n",
 174 |       "       REGULAR RETRO      OTHER    OVERTIME     INJURED     DETAIL  \\\n",
 175 |       "0   $46,291.98   NaN    $300.00         NaN         NaN        NaN   \n",
 176 |       "1    $6,933.66   NaN    $850.00     $205.92  $74,331.86        NaN   \n",
 177 |       "2  $103,442.22   NaN    $550.00  $15,884.53         NaN  $4,746.50   \n",
 178 |       "3   $18,249.83   NaN        NaN         NaN         NaN        NaN   \n",
 179 |       "4   $84,410.28   NaN  $1,250.00         NaN         NaN        NaN   \n",
 180 |       "\n",
 181 |       "  QUINN/EDUCATION INCENTIVE TOTAL EARNINGS POSTAL  \n",
 182 |       "0                       NaN     $46,591.98  02118  \n",
 183 |       "1                $15,258.44     $97,579.88  02132  \n",
 184 |       "2                       NaN    $124,623.25  02132  \n",
 185 |       "3                       NaN     $18,249.83  02148  \n",
 186 |       "4                       NaN     $85,660.28  02481  \n"
 187 |      ]
 188 |     }
 189 |    ],
 190 |    "source": [
 191 |     "print(salary.head())"
 192 |    ]
 193 |   },
 194 |   {
 195 |    "cell_type": "markdown",
 196 |    "metadata": {},
 197 |    "source": [
 198 |     "There are a lot of columns. Let's simplify by selecting the ones of interest: `NAME`, `DEPARTMENT_NAME`, and `TOTAL.EARNINGS`. There are [a few different ways](https://medium.com/dunder-data/selecting-subsets-of-data-in-pandas-6fcd0170be9c) of doing this with `pandas`. The simplest way, imo, is by using the indexing operator `[]`.\n",
 199 |     "\n",
 200 |     "For example, I could select a single column, `NAME`: (Note I also run the line `pd.options.display.max_rows = 20` in order to display a maximum of 20 rows so the output isn't too crowded.)"
 201 |    ]
 202 |   },
 203 |   {
 204 |    "cell_type": "code",
 205 |    "execution_count": 5,
 206 |    "metadata": {
 207 |     "scrolled": false
 208 |    },
 209 |    "outputs": [
 210 |     {
 211 |      "data": {
 212 |       "text/plain": [
 213 |        "0                    Abadi,Kidani A\n",
 214 |        "1                  Abasciano,Joseph\n",
 215 |        "2            Abban,Christopher John\n",
 216 |        "3                     Abbasi,Sophia\n",
 217 |        "4           Abbate-Vaughn,Jorgelina\n",
 218 |        "5                  Abberton,James P\n",
 219 |        "6             Abbott,Erin Elizabeth\n",
 220 |        "7                    Abbott,John R.\n",
 221 |        "8                 Abbruzzese,Angela\n",
 222 |        "9                  Abbruzzese,Donna\n",
 223 |        "                    ...            \n",
 224 |        "22036         Zuares,David Jonathan\n",
 225 |        "22037             Zubrin,William W.\n",
 226 |        "22038               Zuccaro,John E.\n",
 227 |        "22039            Zucker,Alyse Paige\n",
 228 |        "22040         Zuckerman,Naomi Julia\n",
 229 |        "22041          Zukowski III,Charles\n",
 230 |        "22042    Zuluaga  Castro,Juan Pablo\n",
 231 |        "22043        Zwarich,Maralene Zoann\n",
 232 |        "22044               Zweig,Susanna B\n",
 233 |        "22045               Zwerdling,Laura\n",
 234 |        "Name: NAME, Length: 22046, dtype: object"
 235 |       ]
 236 |      },
 237 |      "execution_count": 5,
 238 |      "metadata": {},
 239 |      "output_type": "execute_result"
 240 |     }
 241 |    ],
 242 |    "source": [
 243 |     "pd.options.display.max_rows = 20\n",
 244 |     "\n",
 245 |     "salary['NAME']"
 246 |    ]
 247 |   },
 248 |   {
 249 |    "cell_type": "markdown",
 250 |    "metadata": {},
 251 |    "source": [
 252 |     "This works for selecting one column at a time, but using `[]` returns a [Series](https://pandas.pydata.org/pandas-docs/stable/dsintro.html#series), not a [DataFrame](https://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe). I can confirm this using the `type()` function:"
 253 |    ]
 254 |   },
 255 |   {
 256 |    "cell_type": "code",
 257 |    "execution_count": 6,
 258 |    "metadata": {},
 259 |    "outputs": [
 260 |     {
 261 |      "data": {
 262 |       "text/plain": [
 263 |        "pandas.core.series.Series"
 264 |       ]
 265 |      },
 266 |      "execution_count": 6,
 267 |      "metadata": {},
 268 |      "output_type": "execute_result"
 269 |     }
 270 |    ],
 271 |    "source": [
 272 |     "type(salary['NAME'])"
 273 |    ]
 274 |   },
 275 |   {
 276 |    "cell_type": "markdown",
 277 |    "metadata": {},
 278 |    "source": [
 279 |     "If I want a DataFrame, I have to use double brackets:"
 280 |    ]
 281 |   },
 282 |   {
 283 |    "cell_type": "code",
 284 |    "execution_count": 7,
 285 |    "metadata": {},
 286 |    "outputs": [
 287 |     {
 288 |      "data": {
 289 |       "text/html": [
 290 |        "<div>\n",
 291 |        "<style scoped>\n",
 292 |        "    .dataframe tbody tr th:only-of-type {\n",
 293 |        "        vertical-align: middle;\n",
 294 |        "    }\n",
 295 |        "\n",
 296 |        "    .dataframe tbody tr th {\n",
 297 |        "        vertical-align: top;\n",
 298 |        "    }\n",
 299 |        "\n",
 300 |        "    .dataframe thead th {\n",
 301 |        "        text-align: right;\n",
 302 |        "    }\n",
 303 |        "</style>\n",
 304 |        "<table border=\"1\" class=\"dataframe\">\n",
 305 |        "  <thead>\n",
 306 |        "    <tr style=\"text-align: right;\">\n",
 307 |        "      <th></th>\n",
 308 |        "      <th>NAME</th>\n",
 309 |        "    </tr>\n",
 310 |        "  </thead>\n",
 311 |        "  <tbody>\n",
 312 |        "    <tr>\n",
 313 |        "      <th>0</th>\n",
 314 |        "      <td>Abadi,Kidani A</td>\n",
 315 |        "    </tr>\n",
 316 |        "    <tr>\n",
 317 |        "      <th>1</th>\n",
 318 |        "      <td>Abasciano,Joseph</td>\n",
 319 |        "    </tr>\n",
 320 |        "    <tr>\n",
 321 |        "      <th>2</th>\n",
 322 |        "      <td>Abban,Christopher John</td>\n",
 323 |        "    </tr>\n",
 324 |        "    <tr>\n",
 325 |        "      <th>3</th>\n",
 326 |        "      <td>Abbasi,Sophia</td>\n",
 327 |        "    </tr>\n",
 328 |        "    <tr>\n",
 329 |        "      <th>4</th>\n",
 330 |        "      <td>Abbate-Vaughn,Jorgelina</td>\n",
 331 |        "    </tr>\n",
 332 |        "    <tr>\n",
 333 |        "      <th>5</th>\n",
 334 |        "      <td>Abberton,James P</td>\n",
 335 |        "    </tr>\n",
 336 |        "    <tr>\n",
 337 |        "      <th>6</th>\n",
 338 |        "      <td>Abbott,Erin Elizabeth</td>\n",
 339 |        "    </tr>\n",
 340 |        "    <tr>\n",
 341 |        "      <th>7</th>\n",
 342 |        "      <td>Abbott,John R.</td>\n",
 343 |        "    </tr>\n",
 344 |        "    <tr>\n",
 345 |        "      <th>8</th>\n",
 346 |        "      <td>Abbruzzese,Angela</td>\n",
 347 |        "    </tr>\n",
 348 |        "    <tr>\n",
 349 |        "      <th>9</th>\n",
 350 |        "      <td>Abbruzzese,Donna</td>\n",
 351 |        "    </tr>\n",
 352 |        "    <tr>\n",
 353 |        "      <th>...</th>\n",
 354 |        "      <td>...</td>\n",
 355 |        "    </tr>\n",
 356 |        "    <tr>\n",
 357 |        "      <th>22036</th>\n",
 358 |        "      <td>Zuares,David Jonathan</td>\n",
 359 |        "    </tr>\n",
 360 |        "    <tr>\n",
 361 |        "      <th>22037</th>\n",
 362 |        "      <td>Zubrin,William W.</td>\n",
 363 |        "    </tr>\n",
 364 |        "    <tr>\n",
 365 |        "      <th>22038</th>\n",
 366 |        "      <td>Zuccaro,John E.</td>\n",
 367 |        "    </tr>\n",
 368 |        "    <tr>\n",
 369 |        "      <th>22039</th>\n",
 370 |        "      <td>Zucker,Alyse Paige</td>\n",
 371 |        "    </tr>\n",
 372 |        "    <tr>\n",
 373 |        "      <th>22040</th>\n",
 374 |        "      <td>Zuckerman,Naomi Julia</td>\n",
 375 |        "    </tr>\n",
 376 |        "    <tr>\n",
 377 |        "      <th>22041</th>\n",
 378 |        "      <td>Zukowski III,Charles</td>\n",
 379 |        "    </tr>\n",
 380 |        "    <tr>\n",
 381 |        "      <th>22042</th>\n",
 382 |        "      <td>Zuluaga  Castro,Juan Pablo</td>\n",
 383 |        "    </tr>\n",
 384 |        "    <tr>\n",
 385 |        "      <th>22043</th>\n",
 386 |        "      <td>Zwarich,Maralene Zoann</td>\n",
 387 |        "    </tr>\n",
 388 |        "    <tr>\n",
 389 |        "      <th>22044</th>\n",
 390 |        "      <td>Zweig,Susanna B</td>\n",
 391 |        "    </tr>\n",
 392 |        "    <tr>\n",
 393 |        "      <th>22045</th>\n",
 394 |        "      <td>Zwerdling,Laura</td>\n",
 395 |        "    </tr>\n",
 396 |        "  </tbody>\n",
 397 |        "</table>\n",
 398 |        "<p>22046 rows × 1 columns</p>\n",
 399 |        "</div>"
 400 |       ],
 401 |       "text/plain": [
 402 |        "                             NAME\n",
 403 |        "0                  Abadi,Kidani A\n",
 404 |        "1                Abasciano,Joseph\n",
 405 |        "2          Abban,Christopher John\n",
 406 |        "3                   Abbasi,Sophia\n",
 407 |        "4         Abbate-Vaughn,Jorgelina\n",
 408 |        "5                Abberton,James P\n",
 409 |        "6           Abbott,Erin Elizabeth\n",
 410 |        "7                  Abbott,John R.\n",
 411 |        "8               Abbruzzese,Angela\n",
 412 |        "9                Abbruzzese,Donna\n",
 413 |        "...                           ...\n",
 414 |        "22036       Zuares,David Jonathan\n",
 415 |        "22037           Zubrin,William W.\n",
 416 |        "22038             Zuccaro,John E.\n",
 417 |        "22039          Zucker,Alyse Paige\n",
 418 |        "22040       Zuckerman,Naomi Julia\n",
 419 |        "22041        Zukowski III,Charles\n",
 420 |        "22042  Zuluaga  Castro,Juan Pablo\n",
 421 |        "22043      Zwarich,Maralene Zoann\n",
 422 |        "22044             Zweig,Susanna B\n",
 423 |        "22045             Zwerdling,Laura\n",
 424 |        "\n",
 425 |        "[22046 rows x 1 columns]"
 426 |       ]
 427 |      },
 428 |      "execution_count": 7,
 429 |      "metadata": {},
 430 |      "output_type": "execute_result"
 431 |     }
 432 |    ],
 433 |    "source": [
 434 |     "salary[['NAME']]"
 435 |    ]
 436 |   },
 437 |   {
 438 |    "cell_type": "code",
 439 |    "execution_count": 8,
 440 |    "metadata": {},
 441 |    "outputs": [
 442 |     {
 443 |      "data": {
 444 |       "text/plain": [
 445 |        "pandas.core.frame.DataFrame"
 446 |       ]
 447 |      },
 448 |      "execution_count": 8,
 449 |      "metadata": {},
 450 |      "output_type": "execute_result"
 451 |     }
 452 |    ],
 453 |    "source": [
 454 |     "type(salary[['NAME']])"
 455 |    ]
 456 |   },
 457 |   {
 458 |    "cell_type": "markdown",
 459 |    "metadata": {},
 460 |    "source": [
 461 |     "To select multiple columns, we can put those columns inside of the second pair of brackets. We will save this into a new DataFrame, `salary_selected`. We type `.copy()` after `salary[['NAME','DEPARTMENT_NAME', 'TOTAL EARNINGS']]` because we are making a copy of the DataFrame and assigning it to new DataFrame. Learn more about `copy()` [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.copy.html)."
 462 |    ]
 463 |   },
 464 |   {
 465 |    "cell_type": "code",
 466 |    "execution_count": 9,
 467 |    "metadata": {},
 468 |    "outputs": [],
 469 |    "source": [
 470 |     "salary_selected = salary[['NAME','DEPARTMENT_NAME', 'TOTAL EARNINGS']].copy()"
 471 |    ]
 472 |   },
 473 |   {
 474 |    "cell_type": "markdown",
 475 |    "metadata": {},
 476 |    "source": [
 477 |     "We can also change the column names to lowercase names for easier typing. First, let's take a look at the columns by displaying the `columns` attribute of the `salary_selected` DataFrame."
 478 |    ]
 479 |   },
 480 |   {
 481 |    "cell_type": "code",
 482 |    "execution_count": 10,
 483 |    "metadata": {},
 484 |    "outputs": [
 485 |     {
 486 |      "data": {
 487 |       "text/plain": [
 488 |        "Index(['NAME', 'DEPARTMENT_NAME', 'TOTAL EARNINGS'], dtype='object')"
 489 |       ]
 490 |      },
 491 |      "execution_count": 10,
 492 |      "metadata": {},
 493 |      "output_type": "execute_result"
 494 |     }
 495 |    ],
 496 |    "source": [
 497 |     "salary_selected.columns"
 498 |    ]
 499 |   },
 500 |   {
 501 |    "cell_type": "code",
 502 |    "execution_count": 11,
 503 |    "metadata": {},
 504 |    "outputs": [
 505 |     {
 506 |      "data": {
 507 |       "text/plain": [
 508 |        "pandas.core.indexes.base.Index"
 509 |       ]
 510 |      },
 511 |      "execution_count": 11,
 512 |      "metadata": {},
 513 |      "output_type": "execute_result"
 514 |     }
 515 |    ],
 516 |    "source": [
 517 |     "type(salary_selected.columns)"
 518 |    ]
 519 |   },
 520 |   {
 521 |    "cell_type": "markdown",
 522 |    "metadata": {},
 523 |    "source": [
 524 |     "Notice how this returns something called an \"Index.\" In `pandas`, DataFrames have both row indexes (in our case, the row number, starting from 0 and going to 22045) and column indexes. We can use the [`str.lower()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.lower.html) function to convert the strings (aka characters) in the index to lowercase."
 525 |    ]
 526 |   },
 527 |   {
 528 |    "cell_type": "code",
 529 |    "execution_count": 12,
 530 |    "metadata": {},
 531 |    "outputs": [
 532 |     {
 533 |      "data": {
 534 |       "text/plain": [
 535 |        "Index(['name', 'department_name', 'total earnings'], dtype='object')"
 536 |       ]
 537 |      },
 538 |      "execution_count": 12,
 539 |      "metadata": {},
 540 |      "output_type": "execute_result"
 541 |     }
 542 |    ],
 543 |    "source": [
 544 |     "salary_selected.columns = salary_selected.columns.str.lower()\n",
 545 |     "\n",
 546 |     "salary_selected.columns"
 547 |    ]
 548 |   },
 549 |   {
 550 |    "cell_type": "markdown",
 551 |    "metadata": {},
 552 |    "source": [
 553 |     "Another thing that will make our lives easier is if the `total earnings` column didn't have a space between `total` and `earnings`. We can use a \"string replace\" function, [`str.replace()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.replace.html), to replace the space with an underscore. The syntax is: `str.replace('thing you want to replace', 'what to replace it with')` "
 554 |    ]
 555 |   },
 556 |   {
 557 |    "cell_type": "code",
 558 |    "execution_count": 13,
 559 |    "metadata": {},
 560 |    "outputs": [
 561 |     {
 562 |      "data": {
 563 |       "text/plain": [
 564 |        "Index(['name', 'department_name', 'total earnings'], dtype='object')"
 565 |       ]
 566 |      },
 567 |      "execution_count": 13,
 568 |      "metadata": {},
 569 |      "output_type": "execute_result"
 570 |     }
 571 |    ],
 572 |    "source": [
 573 |     "salary_selected.columns.str.replace(' ', '_') \n",
 574 |     "\n",
 575 |     "salary_selected.columns"
 576 |    ]
 577 |   },
 578 |   {
 579 |    "cell_type": "markdown",
 580 |    "metadata": {},
 581 |    "source": [
 582 |     "We could have used both the `str.lower()` and `str.replace()` functions in one line of code by putting them one after the other (aka \"chaining\"):"
 583 |    ]
 584 |   },
 585 |   {
 586 |    "cell_type": "code",
 587 |    "execution_count": 14,
 588 |    "metadata": {},
 589 |    "outputs": [
 590 |     {
 591 |      "data": {
 592 |       "text/plain": [
 593 |        "Index(['name', 'department_name', 'total_earnings'], dtype='object')"
 594 |       ]
 595 |      },
 596 |      "execution_count": 14,
 597 |      "metadata": {},
 598 |      "output_type": "execute_result"
 599 |     }
 600 |    ],
 601 |    "source": [
 602 |     "salary_selected.columns = salary_selected.columns.str.lower().str.replace(' ', '_') \n",
 603 |     "\n",
 604 |     "salary_selected.columns"
 605 |    ]
 606 |   },
 607 |   {
 608 |    "cell_type": "markdown",
 609 |    "metadata": {},
 610 |    "source": [
 611 |     "Let's use `head()` to visually inspect the first five rows of `salary_selected`:"
 612 |    ]
 613 |   },
 614 |   {
 615 |    "cell_type": "code",
 616 |    "execution_count": 15,
 617 |    "metadata": {},
 618 |    "outputs": [
 619 |     {
 620 |      "name": "stdout",
 621 |      "output_type": "stream",
 622 |      "text": [
 623 |       "                      name           department_name total_earnings\n",
 624 |       "0           Abadi,Kidani A      Assessing Department     $46,591.98\n",
 625 |       "1         Abasciano,Joseph  Boston Police Department     $97,579.88\n",
 626 |       "2   Abban,Christopher John    Boston Fire Department    $124,623.25\n",
 627 |       "3            Abbasi,Sophia             Green Academy     $18,249.83\n",
 628 |       "4  Abbate-Vaughn,Jorgelina      BPS Ellis Elementary     $85,660.28\n"
 629 |      ]
 630 |     }
 631 |    ],
 632 |    "source": [
 633 |     "print(salary_selected.head()) "
 634 |    ]
 635 |   },
 636 |   {
 637 |    "cell_type": "markdown",
 638 |    "metadata": {},
 639 |    "source": [
 640 |     "Now let's try sorting the data by `total.earnings` using the [`sort_values()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) function in `pandas`:"
 641 |    ]
 642 |   },
 643 |   {
 644 |    "cell_type": "code",
 645 |    "execution_count": 16,
 646 |    "metadata": {},
 647 |    "outputs": [],
 648 |    "source": [
 649 |     "salary_sort = salary_selected.sort_values('total_earnings')"
 650 |    ]
 651 |   },
 652 |   {
 653 |    "cell_type": "markdown",
 654 |    "metadata": {},
 655 |    "source": [
 656 |     "We can use `head()` to visually inspect `salary_sort`:"
 657 |    ]
 658 |   },
 659 |   {
 660 |    "cell_type": "code",
 661 |    "execution_count": 17,
 662 |    "metadata": {},
 663 |    "outputs": [
 664 |     {
 665 |      "name": "stdout",
 666 |      "output_type": "stream",
 667 |      "text": [
 668 |       "                      name               department_name total_earnings\n",
 669 |       "11146     Lally,Bernadette           Boston City Council      $1,000.00\n",
 670 |       "7104   Fowlkes,Lorraine E.           Boston City Council      $1,000.00\n",
 671 |       "15058         Nolan,Andrew              Parks Department      $1,000.00\n",
 672 |       "21349   White-Pilet,Yoni A  BPS Substitute Teachers/Nurs      $1,006.53\n",
 673 |       "5915           Dunn,Lori D          BPS East Boston High      $1,010.05\n"
 674 |      ]
 675 |     }
 676 |    ],
 677 |    "source": [
 678 |     "print(salary_sort.head())"
 679 |    ]
 680 |   },
 681 |   {
 682 |    "cell_type": "markdown",
 683 |    "metadata": {},
 684 |    "source": [
 685 |     "At first glance, it looks okay. The employees appear to be sorted by `total_earnings` from lowest to highest. If this were the case, we'd expect the last row of the `salary_sort` DataFrame to contain the employee with the highest salary. Let's take a look at the last five rows using `tail()`."
 686 |    ]
 687 |   },
 688 |   {
 689 |    "cell_type": "code",
 690 |    "execution_count": 18,
 691 |    "metadata": {},
 692 |    "outputs": [
 693 |     {
 694 |      "name": "stdout",
 695 |      "output_type": "stream",
 696 |      "text": [
 697 |       "                   name               department_name total_earnings\n",
 698 |       "13303   McGrath,Caitlin  BPS Substitute Teachers/Nurs        $990.61\n",
 699 |       "1869   Bradshaw,John E.  BPS Substitute Teachers/Nurs        $990.62\n",
 700 |       "21380   Wiggins,Lucas A  BPS Substitute Teachers/Nurs        $990.63\n",
 701 |       "15036       Nixon,Chloe  BPS Substitute Teachers/Nurs        $990.64\n",
 702 |       "10478   Kassa,Selamawit  BPS Substitute Teachers/Nurs        $990.64\n"
 703 |      ]
 704 |     }
 705 |    ],
 706 |    "source": [
 707 |     "print(salary_sort.tail())"
 708 |    ]
 709 |   },
 710 |   {
 711 |    "cell_type": "markdown",
 712 |    "metadata": {},
 713 |    "source": [
 714 |     "**What went wrong?**\n",
 715 |     "\n",
 716 |     "The problem is that there are non-numeric characters, `,` and `$`, in the `total.earnings` column. We can see with  [`dtypes`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dtypes.html), which returns the data type of each column in the DataFrame, that `total_earnings` is recognized as an \"object\"."
 717 |    ]
 718 |   },
 719 |   {
 720 |    "cell_type": "code",
 721 |    "execution_count": 19,
 722 |    "metadata": {},
 723 |    "outputs": [
 724 |     {
 725 |      "data": {
 726 |       "text/plain": [
 727 |        "name               object\n",
 728 |        "department_name    object\n",
 729 |        "total_earnings     object\n",
 730 |        "dtype: object"
 731 |       ]
 732 |      },
 733 |      "execution_count": 19,
 734 |      "metadata": {},
 735 |      "output_type": "execute_result"
 736 |     }
 737 |    ],
 738 |    "source": [
 739 |     "salary_selected.dtypes"
 740 |    ]
 741 |   },
 742 |   {
 743 |    "cell_type": "markdown",
 744 |    "metadata": {},
 745 |    "source": [
 746 |     "[Here](http://pbpython.com/pandas_dtypes.html) is an overview of `pandas` data types. Basically, being labeled an \"object\" means that the column is not being recognized as containing numbers."
 747 |    ]
 748 |   },
 749 |   {
 750 |    "cell_type": "markdown",
 751 |    "metadata": {},
 752 |    "source": [
 753 |     "We need to find the `,` and `$` in `total.earnings` and remove them. The `str.replace()` function, which we used above when renaming the columns, lets us do this.\n",
 754 |     "\n",
 755 |     "Let's start by removing the comma and write the result to the original column. (The format for calling a column from a DataFrame in `pandas` is `DataFrame['column_name']`)"
 756 |    ]
 757 |   },
 758 |   {
 759 |    "cell_type": "code",
 760 |    "execution_count": 20,
 761 |    "metadata": {},
 762 |    "outputs": [],
 763 |    "source": [
 764 |     "salary_selected['total_earnings'] = salary_selected['total_earnings'].str.replace(',', '')"
 765 |    ]
 766 |   },
 767 |   {
 768 |    "cell_type": "markdown",
 769 |    "metadata": {},
 770 |    "source": [
 771 |     "Using `head()` to visually inspect `salary_selected`, we see that the commas are gone:"
 772 |    ]
 773 |   },
 774 |   {
 775 |    "cell_type": "code",
 776 |    "execution_count": 21,
 777 |    "metadata": {},
 778 |    "outputs": [
 779 |     {
 780 |      "name": "stdout",
 781 |      "output_type": "stream",
 782 |      "text": [
 783 |       "                      name           department_name total_earnings\n",
 784 |       "0           Abadi,Kidani A      Assessing Department      $46591.98\n",
 785 |       "1         Abasciano,Joseph  Boston Police Department      $97579.88\n",
 786 |       "2   Abban,Christopher John    Boston Fire Department     $124623.25\n",
 787 |       "3            Abbasi,Sophia             Green Academy      $18249.83\n",
 788 |       "4  Abbate-Vaughn,Jorgelina      BPS Ellis Elementary      $85660.28\n"
 789 |      ]
 790 |     }
 791 |    ],
 792 |    "source": [
 793 |     "print(salary_selected.head()) # this works - the commas are gone"
 794 |    ]
 795 |   },
 796 |   {
 797 |    "cell_type": "markdown",
 798 |    "metadata": {},
 799 |    "source": [
 800 |     "Let's do the same thing, with the dollar sign `$`:"
 801 |    ]
 802 |   },
 803 |   {
 804 |    "cell_type": "code",
 805 |    "execution_count": 22,
 806 |    "metadata": {},
 807 |    "outputs": [],
 808 |    "source": [
 809 |     "salary_selected['total_earnings'] = salary_selected['total_earnings'].str.replace('$', '')"
 810 |    ]
 811 |   },
 812 |   {
 813 |    "cell_type": "markdown",
 814 |    "metadata": {},
 815 |    "source": [
 816 |     "Using `head()` to visually inspect `salary_selected`, we see that the dollar signs are gone:"
 817 |    ]
 818 |   },
 819 |   {
 820 |    "cell_type": "code",
 821 |    "execution_count": 23,
 822 |    "metadata": {},
 823 |    "outputs": [
 824 |     {
 825 |      "data": {
 826 |       "text/html": [
 827 |        "<div>\n",
 828 |        "<style scoped>\n",
 829 |        "    .dataframe tbody tr th:only-of-type {\n",
 830 |        "        vertical-align: middle;\n",
 831 |        "    }\n",
 832 |        "\n",
 833 |        "    .dataframe tbody tr th {\n",
 834 |        "        vertical-align: top;\n",
 835 |        "    }\n",
 836 |        "\n",
 837 |        "    .dataframe thead th {\n",
 838 |        "        text-align: right;\n",
 839 |        "    }\n",
 840 |        "</style>\n",
 841 |        "<table border=\"1\" class=\"dataframe\">\n",
 842 |        "  <thead>\n",
 843 |        "    <tr style=\"text-align: right;\">\n",
 844 |        "      <th></th>\n",
 845 |        "      <th>name</th>\n",
 846 |        "      <th>department_name</th>\n",
 847 |        "      <th>total_earnings</th>\n",
 848 |        "    </tr>\n",
 849 |        "  </thead>\n",
 850 |        "  <tbody>\n",
 851 |        "    <tr>\n",
 852 |        "      <th>0</th>\n",
 853 |        "      <td>Abadi,Kidani A</td>\n",
 854 |        "      <td>Assessing Department</td>\n",
 855 |        "      <td>46591.98</td>\n",
 856 |        "    </tr>\n",
 857 |        "    <tr>\n",
 858 |        "      <th>1</th>\n",
 859 |        "      <td>Abasciano,Joseph</td>\n",
 860 |        "      <td>Boston Police Department</td>\n",
 861 |        "      <td>97579.88</td>\n",
 862 |        "    </tr>\n",
 863 |        "    <tr>\n",
 864 |        "      <th>2</th>\n",
 865 |        "      <td>Abban,Christopher John</td>\n",
 866 |        "      <td>Boston Fire Department</td>\n",
 867 |        "      <td>124623.25</td>\n",
 868 |        "    </tr>\n",
 869 |        "    <tr>\n",
 870 |        "      <th>3</th>\n",
 871 |        "      <td>Abbasi,Sophia</td>\n",
 872 |        "      <td>Green Academy</td>\n",
 873 |        "      <td>18249.83</td>\n",
 874 |        "    </tr>\n",
 875 |        "    <tr>\n",
 876 |        "      <th>4</th>\n",
 877 |        "      <td>Abbate-Vaughn,Jorgelina</td>\n",
 878 |        "      <td>BPS Ellis Elementary</td>\n",
 879 |        "      <td>85660.28</td>\n",
 880 |        "    </tr>\n",
 881 |        "  </tbody>\n",
 882 |        "</table>\n",
 883 |        "</div>"
 884 |       ],
 885 |       "text/plain": [
 886 |        "                      name           department_name total_earnings\n",
 887 |        "0           Abadi,Kidani A      Assessing Department       46591.98\n",
 888 |        "1         Abasciano,Joseph  Boston Police Department       97579.88\n",
 889 |        "2   Abban,Christopher John    Boston Fire Department      124623.25\n",
 890 |        "3            Abbasi,Sophia             Green Academy       18249.83\n",
 891 |        "4  Abbate-Vaughn,Jorgelina      BPS Ellis Elementary       85660.28"
 892 |       ]
 893 |      },
 894 |      "execution_count": 23,
 895 |      "metadata": {},
 896 |      "output_type": "execute_result"
 897 |     }
 898 |    ],
 899 |    "source": [
 900 |     "salary_selected.head()"
 901 |    ]
 902 |   },
 903 |   {
 904 |    "cell_type": "markdown",
 905 |    "metadata": {},
 906 |    "source": [
 907 |     "**Now can we use `arrange()` to sort the data by `total_earnings`?**"
 908 |    ]
 909 |   },
 910 |   {
 911 |    "cell_type": "code",
 912 |    "execution_count": 24,
 913 |    "metadata": {},
 914 |    "outputs": [
 915 |     {
 916 |      "data": {
 917 |       "text/html": [
 918 |        "<div>\n",
 919 |        "<style scoped>\n",
 920 |        "    .dataframe tbody tr th:only-of-type {\n",
 921 |        "        vertical-align: middle;\n",
 922 |        "    }\n",
 923 |        "\n",
 924 |        "    .dataframe tbody tr th {\n",
 925 |        "        vertical-align: top;\n",
 926 |        "    }\n",
 927 |        "\n",
 928 |        "    .dataframe thead th {\n",
 929 |        "        text-align: right;\n",
 930 |        "    }\n",
 931 |        "</style>\n",
 932 |        "<table border=\"1\" class=\"dataframe\">\n",
 933 |        "  <thead>\n",
 934 |        "    <tr style=\"text-align: right;\">\n",
 935 |        "      <th></th>\n",
 936 |        "      <th>name</th>\n",
 937 |        "      <th>department_name</th>\n",
 938 |        "      <th>total_earnings</th>\n",
 939 |        "    </tr>\n",
 940 |        "  </thead>\n",
 941 |        "  <tbody>\n",
 942 |        "    <tr>\n",
 943 |        "      <th>3315</th>\n",
 944 |        "      <td>Charles,Yveline</td>\n",
 945 |        "      <td>BPS Transportation</td>\n",
 946 |        "      <td>10.07</td>\n",
 947 |        "    </tr>\n",
 948 |        "    <tr>\n",
 949 |        "      <th>9914</th>\n",
 950 |        "      <td>Jean Baptiste,Hugues</td>\n",
 951 |        "      <td>BPS Transportation</td>\n",
 952 |        "      <td>10.12</td>\n",
 953 |        "    </tr>\n",
 954 |        "    <tr>\n",
 955 |        "      <th>16419</th>\n",
 956 |        "      <td>Piper,Sarah A</td>\n",
 957 |        "      <td>BPS Transportation</td>\n",
 958 |        "      <td>10.47</td>\n",
 959 |        "    </tr>\n",
 960 |        "    <tr>\n",
 961 |        "      <th>11131</th>\n",
 962 |        "      <td>Laguerre,Yolaine M</td>\n",
 963 |        "      <td>BPS Transportation</td>\n",
 964 |        "      <td>10.94</td>\n",
 965 |        "    </tr>\n",
 966 |        "    <tr>\n",
 967 |        "      <th>17641</th>\n",
 968 |        "      <td>Rosario Severino,Yomayra</td>\n",
 969 |        "      <td>Food &amp; Nutrition Svc</td>\n",
 970 |        "      <td>100.00</td>\n",
 971 |        "    </tr>\n",
 972 |        "  </tbody>\n",
 973 |        "</table>\n",
 974 |        "</div>"
 975 |       ],
 976 |       "text/plain": [
 977 |        "                           name       department_name total_earnings\n",
 978 |        "3315            Charles,Yveline    BPS Transportation          10.07\n",
 979 |        "9914       Jean Baptiste,Hugues    BPS Transportation          10.12\n",
 980 |        "16419             Piper,Sarah A    BPS Transportation          10.47\n",
 981 |        "11131        Laguerre,Yolaine M    BPS Transportation          10.94\n",
 982 |        "17641  Rosario Severino,Yomayra  Food & Nutrition Svc         100.00"
 983 |       ]
 984 |      },
 985 |      "execution_count": 24,
 986 |      "metadata": {},
 987 |      "output_type": "execute_result"
 988 |     }
 989 |    ],
 990 |    "source": [
 991 |     "salary_sort = salary_selected.sort_values('total_earnings')\n",
 992 |     "\n",
 993 |     "salary_sort.head()"
 994 |    ]
 995 |   },
 996 |   {
 997 |    "cell_type": "code",
 998 |    "execution_count": 25,
 999 |    "metadata": {},
1000 |    "outputs": [
1001 |     {
1002 |      "data": {
1003 |       "text/html": [
1004 |        "<div>\n",
1005 |        "<style scoped>\n",
1006 |        "    .dataframe tbody tr th:only-of-type {\n",
1007 |        "        vertical-align: middle;\n",
1008 |        "    }\n",
1009 |        "\n",
1010 |        "    .dataframe tbody tr th {\n",
1011 |        "        vertical-align: top;\n",
1012 |        "    }\n",
1013 |        "\n",
1014 |        "    .dataframe thead th {\n",
1015 |        "        text-align: right;\n",
1016 |        "    }\n",
1017 |        "</style>\n",
1018 |        "<table border=\"1\" class=\"dataframe\">\n",
1019 |        "  <thead>\n",
1020 |        "    <tr style=\"text-align: right;\">\n",
1021 |        "      <th></th>\n",
1022 |        "      <th>name</th>\n",
1023 |        "      <th>department_name</th>\n",
1024 |        "      <th>total_earnings</th>\n",
1025 |        "    </tr>\n",
1026 |        "  </thead>\n",
1027 |        "  <tbody>\n",
1028 |        "    <tr>\n",
1029 |        "      <th>18134</th>\n",
1030 |        "      <td>Santos,Maria C</td>\n",
1031 |        "      <td>Curley K-8</td>\n",
1032 |        "      <td>99970.30</td>\n",
1033 |        "    </tr>\n",
1034 |        "    <tr>\n",
1035 |        "      <th>5999</th>\n",
1036 |        "      <td>Dyson,Margaret O.</td>\n",
1037 |        "      <td>Parks Department</td>\n",
1038 |        "      <td>99972.07</td>\n",
1039 |        "    </tr>\n",
1040 |        "    <tr>\n",
1041 |        "      <th>13012</th>\n",
1042 |        "      <td>McCarthy,Margaret M</td>\n",
1043 |        "      <td>BPS Substitute Teachers/Nurs</td>\n",
1044 |        "      <td>9998.47</td>\n",
1045 |        "    </tr>\n",
1046 |        "    <tr>\n",
1047 |        "      <th>1083</th>\n",
1048 |        "      <td>Bartholet,Carolyn V</td>\n",
1049 |        "      <td>BPS Mckay Elementary</td>\n",
1050 |        "      <td>99989.18</td>\n",
1051 |        "    </tr>\n",
1052 |        "    <tr>\n",
1053 |        "      <th>1960</th>\n",
1054 |        "      <td>Bresnahan,John M.</td>\n",
1055 |        "      <td>Boston Police Department</td>\n",
1056 |        "      <td>99997.38</td>\n",
1057 |        "    </tr>\n",
1058 |        "  </tbody>\n",
1059 |        "</table>\n",
1060 |        "</div>"
1061 |       ],
1062 |       "text/plain": [
1063 |        "                      name               department_name total_earnings\n",
1064 |        "18134       Santos,Maria C                    Curley K-8       99970.30\n",
1065 |        "5999     Dyson,Margaret O.              Parks Department       99972.07\n",
1066 |        "13012  McCarthy,Margaret M  BPS Substitute Teachers/Nurs        9998.47\n",
1067 |        "1083   Bartholet,Carolyn V          BPS Mckay Elementary       99989.18\n",
1068 |        "1960     Bresnahan,John M.      Boston Police Department       99997.38"
1069 |       ]
1070 |      },
1071 |      "execution_count": 25,
1072 |      "metadata": {},
1073 |      "output_type": "execute_result"
1074 |     }
1075 |    ],
1076 |    "source": [
1077 |     "salary_sort.tail()"
1078 |    ]
1079 |   },
1080 |   {
1081 |    "cell_type": "markdown",
1082 |    "metadata": {},
1083 |    "source": [
1084 |     "Again, at first glance, the employees appear to be sorted by `total_earnings` from lowest to highest. But that would imply that John M. Bresnahan was the highest-paid employee, making 99,997.38 dollars in 2016, while the *Boston Globe* [story](https://www.bostonglobe.com/metro/2017/02/14/police-detective-tops-boston-payroll-with-total-over/6PaXwTAHZGEW5djgwCJuTI/story.html) said the highest-paid city employee made more than 403,000 dollars."
1085 |    ]
1086 |   },
1087 |   {
1088 |    "cell_type": "markdown",
1089 |    "metadata": {},
1090 |    "source": [
1091 |     "**What's the problem?**\n",
1092 |     "\n",
1093 |     "Again, we can use `dtypes` to check on how the `total_earnings` variable is encoded."
1094 |    ]
1095 |   },
1096 |   {
1097 |    "cell_type": "code",
1098 |    "execution_count": 26,
1099 |    "metadata": {},
1100 |    "outputs": [
1101 |     {
1102 |      "data": {
1103 |       "text/plain": [
1104 |        "name               object\n",
1105 |        "department_name    object\n",
1106 |        "total_earnings     object\n",
1107 |        "dtype: object"
1108 |       ]
1109 |      },
1110 |      "execution_count": 26,
1111 |      "metadata": {},
1112 |      "output_type": "execute_result"
1113 |     }
1114 |    ],
1115 |    "source": [
1116 |     "salary_sort.dtypes"
1117 |    ]
1118 |   },
1119 |   {
1120 |    "cell_type": "markdown",
1121 |    "metadata": {},
1122 |    "source": [
1123 |     "It's still an \"object\" now (still not numeric), because we didn't tell `pandas` that it should be numeric. We can do this with `pd.to_numeric()`:"
1124 |    ]
1125 |   },
1126 |   {
1127 |    "cell_type": "code",
1128 |    "execution_count": 27,
1129 |    "metadata": {},
1130 |    "outputs": [],
1131 |    "source": [
1132 |     "salary_sort['total_earnings'] = pd.to_numeric(salary_sort['total_earnings'])"
1133 |    ]
1134 |   },
1135 |   {
1136 |    "cell_type": "markdown",
1137 |    "metadata": {},
1138 |    "source": [
1139 |     "Now let's run `dtypes` again:"
1140 |    ]
1141 |   },
1142 |   {
1143 |    "cell_type": "code",
1144 |    "execution_count": 28,
1145 |    "metadata": {},
1146 |    "outputs": [
1147 |     {
1148 |      "data": {
1149 |       "text/plain": [
1150 |        "name                object\n",
1151 |        "department_name     object\n",
1152 |        "total_earnings     float64\n",
1153 |        "dtype: object"
1154 |       ]
1155 |      },
1156 |      "execution_count": 28,
1157 |      "metadata": {},
1158 |      "output_type": "execute_result"
1159 |     }
1160 |    ],
1161 |    "source": [
1162 |     "salary_sort.dtypes"
1163 |    ]
1164 |   },
1165 |   {
1166 |    "cell_type": "markdown",
1167 |    "metadata": {},
1168 |    "source": [
1169 |     "\"float64\" means [\"floating point numbers\"](http://pbpython.com/pandas_dtypes.html) — this is what we want."
1170 |    ]
1171 |   },
1172 |   {
1173 |    "cell_type": "markdown",
1174 |    "metadata": {},
1175 |    "source": [
1176 |     "Now let's sort using `sort_values()`. "
1177 |    ]
1178 |   },
1179 |   {
1180 |    "cell_type": "code",
1181 |    "execution_count": 29,
1182 |    "metadata": {},
1183 |    "outputs": [
1184 |     {
1185 |      "data": {
1186 |       "text/html": [
1187 |        "<div>\n",
1188 |        "<style scoped>\n",
1189 |        "    .dataframe tbody tr th:only-of-type {\n",
1190 |        "        vertical-align: middle;\n",
1191 |        "    }\n",
1192 |        "\n",
1193 |        "    .dataframe tbody tr th {\n",
1194 |        "        vertical-align: top;\n",
1195 |        "    }\n",
1196 |        "\n",
1197 |        "    .dataframe thead th {\n",
1198 |        "        text-align: right;\n",
1199 |        "    }\n",
1200 |        "</style>\n",
1201 |        "<table border=\"1\" class=\"dataframe\">\n",
1202 |        "  <thead>\n",
1203 |        "    <tr style=\"text-align: right;\">\n",
1204 |        "      <th></th>\n",
1205 |        "      <th>name</th>\n",
1206 |        "      <th>department_name</th>\n",
1207 |        "      <th>total_earnings</th>\n",
1208 |        "    </tr>\n",
1209 |        "  </thead>\n",
1210 |        "  <tbody>\n",
1211 |        "    <tr>\n",
1212 |        "      <th>9849</th>\n",
1213 |        "      <td>Jameau,Bernadette</td>\n",
1214 |        "      <td>BPS Transportation</td>\n",
1215 |        "      <td>2.14</td>\n",
1216 |        "    </tr>\n",
1217 |        "    <tr>\n",
1218 |        "      <th>1986</th>\n",
1219 |        "      <td>Bridgewaters,Sandra J</td>\n",
1220 |        "      <td>BPS Transportation</td>\n",
1221 |        "      <td>2.50</td>\n",
1222 |        "    </tr>\n",
1223 |        "    <tr>\n",
1224 |        "      <th>13853</th>\n",
1225 |        "      <td>Milian,Sonia Maria</td>\n",
1226 |        "      <td>BPS Transportation</td>\n",
1227 |        "      <td>3.85</td>\n",
1228 |        "    </tr>\n",
1229 |        "    <tr>\n",
1230 |        "      <th>2346</th>\n",
1231 |        "      <td>Burke II,Myrell Nadine</td>\n",
1232 |        "      <td>BPS Transportation</td>\n",
1233 |        "      <td>4.38</td>\n",
1234 |        "    </tr>\n",
1235 |        "    <tr>\n",
1236 |        "      <th>7717</th>\n",
1237 |        "      <td>Gillard Jr.,Trina F</td>\n",
1238 |        "      <td>Food &amp; Nutrition Svc</td>\n",
1239 |        "      <td>5.00</td>\n",
1240 |        "    </tr>\n",
1241 |        "  </tbody>\n",
1242 |        "</table>\n",
1243 |        "</div>"
1244 |       ],
1245 |       "text/plain": [
1246 |        "                         name       department_name  total_earnings\n",
1247 |        "9849        Jameau,Bernadette    BPS Transportation            2.14\n",
1248 |        "1986    Bridgewaters,Sandra J    BPS Transportation            2.50\n",
1249 |        "13853      Milian,Sonia Maria    BPS Transportation            3.85\n",
1250 |        "2346   Burke II,Myrell Nadine    BPS Transportation            4.38\n",
1251 |        "7717      Gillard Jr.,Trina F  Food & Nutrition Svc            5.00"
1252 |       ]
1253 |      },
1254 |      "execution_count": 29,
1255 |      "metadata": {},
1256 |      "output_type": "execute_result"
1257 |     }
1258 |    ],
1259 |    "source": [
1260 |     "salary_sort = salary_sort.sort_values('total_earnings')\n",
1261 |     "\n",
1262 |     "salary_sort.head() # ascending order by default"
1263 |    ]
1264 |   },
1265 |   {
1266 |    "cell_type": "markdown",
1267 |    "metadata": {},
1268 |    "source": [
1269 |     "One last thing: we have to specify `ascending = False` within `sort_values()` because the function by default sorts the data in ascending order."
1270 |    ]
1271 |   },
1272 |   {
1273 |    "cell_type": "code",
1274 |    "execution_count": 30,
1275 |    "metadata": {},
1276 |    "outputs": [
1277 |     {
1278 |      "data": {
1279 |       "text/html": [
1280 |        "<div>\n",
1281 |        "<style scoped>\n",
1282 |        "    .dataframe tbody tr th:only-of-type {\n",
1283 |        "        vertical-align: middle;\n",
1284 |        "    }\n",
1285 |        "\n",
1286 |        "    .dataframe tbody tr th {\n",
1287 |        "        vertical-align: top;\n",
1288 |        "    }\n",
1289 |        "\n",
1290 |        "    .dataframe thead th {\n",
1291 |        "        text-align: right;\n",
1292 |        "    }\n",
1293 |        "</style>\n",
1294 |        "<table border=\"1\" class=\"dataframe\">\n",
1295 |        "  <thead>\n",
1296 |        "    <tr style=\"text-align: right;\">\n",
1297 |        "      <th></th>\n",
1298 |        "      <th>name</th>\n",
1299 |        "      <th>department_name</th>\n",
1300 |        "      <th>total_earnings</th>\n",
1301 |        "    </tr>\n",
1302 |        "  </thead>\n",
1303 |        "  <tbody>\n",
1304 |        "    <tr>\n",
1305 |        "      <th>11489</th>\n",
1306 |        "      <td>Lee,Waiman</td>\n",
1307 |        "      <td>Boston Police Department</td>\n",
1308 |        "      <td>403408.61</td>\n",
1309 |        "    </tr>\n",
1310 |        "    <tr>\n",
1311 |        "      <th>10327</th>\n",
1312 |        "      <td>Josey,Windell C.</td>\n",
1313 |        "      <td>Boston Police Department</td>\n",
1314 |        "      <td>396348.50</td>\n",
1315 |        "    </tr>\n",
1316 |        "    <tr>\n",
1317 |        "      <th>15716</th>\n",
1318 |        "      <td>Painten,Paul A</td>\n",
1319 |        "      <td>Boston Police Department</td>\n",
1320 |        "      <td>373959.35</td>\n",
1321 |        "    </tr>\n",
1322 |        "    <tr>\n",
1323 |        "      <th>2113</th>\n",
1324 |        "      <td>Brown,Gregory</td>\n",
1325 |        "      <td>Boston Police Department</td>\n",
1326 |        "      <td>351825.50</td>\n",
1327 |        "    </tr>\n",
1328 |        "    <tr>\n",
1329 |        "      <th>9446</th>\n",
1330 |        "      <td>Hosein,Haseeb</td>\n",
1331 |        "      <td>Boston Police Department</td>\n",
1332 |        "      <td>346105.17</td>\n",
1333 |        "    </tr>\n",
1334 |        "  </tbody>\n",
1335 |        "</table>\n",
1336 |        "</div>"
1337 |       ],
1338 |       "text/plain": [
1339 |        "                   name           department_name  total_earnings\n",
1340 |        "11489        Lee,Waiman  Boston Police Department       403408.61\n",
1341 |        "10327  Josey,Windell C.  Boston Police Department       396348.50\n",
1342 |        "15716    Painten,Paul A  Boston Police Department       373959.35\n",
1343 |        "2113      Brown,Gregory  Boston Police Department       351825.50\n",
1344 |        "9446      Hosein,Haseeb  Boston Police Department       346105.17"
1345 |       ]
1346 |      },
1347 |      "execution_count": 30,
1348 |      "metadata": {},
1349 |      "output_type": "execute_result"
1350 |     }
1351 |    ],
1352 |    "source": [
1353 |     "salary_sort = salary_sort.sort_values('total_earnings', ascending = False)\n",
1354 |     "\n",
1355 |     "salary_sort.head() # descending order"
1356 |    ]
1357 |   },
1358 |   {
1359 |    "cell_type": "markdown",
1360 |    "metadata": {},
1361 |    "source": [
1362 |     "We see that Waiman Lee from the Boston PD is the top earner with &gt;403,408 per year, just as the *Boston Globe* [article](https://www.bostonglobe.com/metro/2017/02/14/police-detective-tops-boston-payroll-with-total-over/6PaXwTAHZGEW5djgwCJuTI/story.html) states."
1363 |    ]
1364 |   },
1365 |   {
1366 |    "cell_type": "markdown",
1367 |    "metadata": {},
1368 |    "source": [
1369 |     "A bonus thing: maybe it bothers you that the numbers next to each row are no longer in any numeric order. This is because these numbers are the row index of the DataFrame — basically the order that they were in prior to being sorted. In order to reset these numbers, we can use the [`reset_index()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html) function on the `salary_sort` DataFrame. We include `drop = True` as a parameter of the function to prevent the old index from being added as a column in the DataFrame."
1370 |    ]
1371 |   },
1372 |   {
1373 |    "cell_type": "code",
1374 |    "execution_count": 31,
1375 |    "metadata": {},
1376 |    "outputs": [
1377 |     {
1378 |      "data": {
1379 |       "text/html": [
1380 |        "<div>\n",
1381 |        "<style scoped>\n",
1382 |        "    .dataframe tbody tr th:only-of-type {\n",
1383 |        "        vertical-align: middle;\n",
1384 |        "    }\n",
1385 |        "\n",
1386 |        "    .dataframe tbody tr th {\n",
1387 |        "        vertical-align: top;\n",
1388 |        "    }\n",
1389 |        "\n",
1390 |        "    .dataframe thead th {\n",
1391 |        "        text-align: right;\n",
1392 |        "    }\n",
1393 |        "</style>\n",
1394 |        "<table border=\"1\" class=\"dataframe\">\n",
1395 |        "  <thead>\n",
1396 |        "    <tr style=\"text-align: right;\">\n",
1397 |        "      <th></th>\n",
1398 |        "      <th>name</th>\n",
1399 |        "      <th>department_name</th>\n",
1400 |        "      <th>total_earnings</th>\n",
1401 |        "    </tr>\n",
1402 |        "  </thead>\n",
1403 |        "  <tbody>\n",
1404 |        "    <tr>\n",
1405 |        "      <th>0</th>\n",
1406 |        "      <td>Lee,Waiman</td>\n",
1407 |        "      <td>Boston Police Department</td>\n",
1408 |        "      <td>403408.61</td>\n",
1409 |        "    </tr>\n",
1410 |        "    <tr>\n",
1411 |        "      <th>1</th>\n",
1412 |        "      <td>Josey,Windell C.</td>\n",
1413 |        "      <td>Boston Police Department</td>\n",
1414 |        "      <td>396348.50</td>\n",
1415 |        "    </tr>\n",
1416 |        "    <tr>\n",
1417 |        "      <th>2</th>\n",
1418 |        "      <td>Painten,Paul A</td>\n",
1419 |        "      <td>Boston Police Department</td>\n",
1420 |        "      <td>373959.35</td>\n",
1421 |        "    </tr>\n",
1422 |        "    <tr>\n",
1423 |        "      <th>3</th>\n",
1424 |        "      <td>Brown,Gregory</td>\n",
1425 |        "      <td>Boston Police Department</td>\n",
1426 |        "      <td>351825.50</td>\n",
1427 |        "    </tr>\n",
1428 |        "    <tr>\n",
1429 |        "      <th>4</th>\n",
1430 |        "      <td>Hosein,Haseeb</td>\n",
1431 |        "      <td>Boston Police Department</td>\n",
1432 |        "      <td>346105.17</td>\n",
1433 |        "    </tr>\n",
1434 |        "  </tbody>\n",
1435 |        "</table>\n",
1436 |        "</div>"
1437 |       ],
1438 |       "text/plain": [
1439 |        "               name           department_name  total_earnings\n",
1440 |        "0        Lee,Waiman  Boston Police Department       403408.61\n",
1441 |        "1  Josey,Windell C.  Boston Police Department       396348.50\n",
1442 |        "2    Painten,Paul A  Boston Police Department       373959.35\n",
1443 |        "3     Brown,Gregory  Boston Police Department       351825.50\n",
1444 |        "4     Hosein,Haseeb  Boston Police Department       346105.17"
1445 |       ]
1446 |      },
1447 |      "execution_count": 31,
1448 |      "metadata": {},
1449 |      "output_type": "execute_result"
1450 |     }
1451 |    ],
1452 |    "source": [
1453 |     "salary_sort = salary_sort.reset_index(drop = True)\n",
1454 |     "\n",
1455 |     "salary_sort.head() # index is reset"
1456 |    ]
1457 |   },
1458 |   {
1459 |    "cell_type": "markdown",
1460 |    "metadata": {},
1461 |    "source": [
1462 |     "The Boston Police Department has a lot of high earners. We can figure out the average earnings by department, which we'll call `salary_average`, by using the `groupby` and `mean()` functions in `pandas`."
1463 |    ]
1464 |   },
1465 |   {
1466 |    "cell_type": "code",
1467 |    "execution_count": 32,
1468 |    "metadata": {},
1469 |    "outputs": [],
1470 |    "source": [
1471 |     "salary_average = salary_sort.groupby('department_name').mean()"
1472 |    ]
1473 |   },
1474 |   {
1475 |    "cell_type": "code",
1476 |    "execution_count": 33,
1477 |    "metadata": {
1478 |     "scrolled": false
1479 |    },
1480 |    "outputs": [
1481 |     {
1482 |      "data": {
1483 |       "text/html": [
1484 |        "<div>\n",
1485 |        "<style scoped>\n",
1486 |        "    .dataframe tbody tr th:only-of-type {\n",
1487 |        "        vertical-align: middle;\n",
1488 |        "    }\n",
1489 |        "\n",
1490 |        "    .dataframe tbody tr th {\n",
1491 |        "        vertical-align: top;\n",
1492 |        "    }\n",
1493 |        "\n",
1494 |        "    .dataframe thead th {\n",
1495 |        "        text-align: right;\n",
1496 |        "    }\n",
1497 |        "</style>\n",
1498 |        "<table border=\"1\" class=\"dataframe\">\n",
1499 |        "  <thead>\n",
1500 |        "    <tr style=\"text-align: right;\">\n",
1501 |        "      <th></th>\n",
1502 |        "      <th>total_earnings</th>\n",
1503 |        "    </tr>\n",
1504 |        "    <tr>\n",
1505 |        "      <th>department_name</th>\n",
1506 |        "      <th></th>\n",
1507 |        "    </tr>\n",
1508 |        "  </thead>\n",
1509 |        "  <tbody>\n",
1510 |        "    <tr>\n",
1511 |        "      <th>ASD Human Resources</th>\n",
1512 |        "      <td>67236.150755</td>\n",
1513 |        "    </tr>\n",
1514 |        "    <tr>\n",
1515 |        "      <th>ASD Intergvernmtl Relations</th>\n",
1516 |        "      <td>83787.581000</td>\n",
1517 |        "    </tr>\n",
1518 |        "    <tr>\n",
1519 |        "      <th>ASD Office Of Labor Relation</th>\n",
1520 |        "      <td>58899.954615</td>\n",
1521 |        "    </tr>\n",
1522 |        "    <tr>\n",
1523 |        "      <th>ASD Office of Budget Mangmnt</th>\n",
1524 |        "      <td>73946.044643</td>\n",
1525 |        "    </tr>\n",
1526 |        "    <tr>\n",
1527 |        "      <th>ASD Purchasing Division</th>\n",
1528 |        "      <td>72893.203750</td>\n",
1529 |        "    </tr>\n",
1530 |        "    <tr>\n",
1531 |        "      <th>Accountability</th>\n",
1532 |        "      <td>102073.280667</td>\n",
1533 |        "    </tr>\n",
1534 |        "    <tr>\n",
1535 |        "      <th>Achievement Gap</th>\n",
1536 |        "      <td>60105.522500</td>\n",
1537 |        "    </tr>\n",
1538 |        "    <tr>\n",
1539 |        "      <th>Alighieri Montessori School</th>\n",
1540 |        "      <td>55160.025556</td>\n",
1541 |        "    </tr>\n",
1542 |        "    <tr>\n",
1543 |        "      <th>Assessing Department</th>\n",
1544 |        "      <td>70713.327111</td>\n",
1545 |        "    </tr>\n",
1546 |        "    <tr>\n",
1547 |        "      <th>Asst Superintendent-Network A</th>\n",
1548 |        "      <td>132514.885000</td>\n",
1549 |        "    </tr>\n",
1550 |        "    <tr>\n",
1551 |        "      <th>...</th>\n",
1552 |        "      <td>...</td>\n",
1553 |        "    </tr>\n",
1554 |        "    <tr>\n",
1555 |        "      <th>Unified Student Svc</th>\n",
1556 |        "      <td>65018.485000</td>\n",
1557 |        "    </tr>\n",
1558 |        "    <tr>\n",
1559 |        "      <th>Veterans' Services</th>\n",
1560 |        "      <td>48411.606250</td>\n",
1561 |        "    </tr>\n",
1562 |        "    <tr>\n",
1563 |        "      <th>WREC: Urban Science Academy</th>\n",
1564 |        "      <td>81170.398214</td>\n",
1565 |        "    </tr>\n",
1566 |        "    <tr>\n",
1567 |        "      <th>Warren/Prescott K-8</th>\n",
1568 |        "      <td>66389.351341</td>\n",
1569 |        "    </tr>\n",
1570 |        "    <tr>\n",
1571 |        "      <th>West Roxbury Academy</th>\n",
1572 |        "      <td>70373.066494</td>\n",
1573 |        "    </tr>\n",
1574 |        "    <tr>\n",
1575 |        "      <th>West Zone ELC</th>\n",
1576 |        "      <td>55868.384118</td>\n",
1577 |        "    </tr>\n",
1578 |        "    <tr>\n",
1579 |        "      <th>Women's Advancement</th>\n",
1580 |        "      <td>63811.150000</td>\n",
1581 |        "    </tr>\n",
1582 |        "    <tr>\n",
1583 |        "      <th>Workers Compensation Service</th>\n",
1584 |        "      <td>23797.119133</td>\n",
1585 |        "    </tr>\n",
1586 |        "    <tr>\n",
1587 |        "      <th>Young Achievers K-8</th>\n",
1588 |        "      <td>56534.020463</td>\n",
1589 |        "    </tr>\n",
1590 |        "    <tr>\n",
1591 |        "      <th>Youth Engagement &amp; Employment</th>\n",
1592 |        "      <td>33645.202308</td>\n",
1593 |        "    </tr>\n",
1594 |        "  </tbody>\n",
1595 |        "</table>\n",
1596 |        "<p>228 rows × 1 columns</p>\n",
1597 |        "</div>"
1598 |       ],
1599 |       "text/plain": [
1600 |        "                               total_earnings\n",
1601 |        "department_name                              \n",
1602 |        "ASD Human Resources              67236.150755\n",
1603 |        "ASD Intergvernmtl Relations      83787.581000\n",
1604 |        "ASD Office Of Labor Relation     58899.954615\n",
1605 |        "ASD Office of Budget Mangmnt     73946.044643\n",
1606 |        "ASD Purchasing Division          72893.203750\n",
1607 |        "Accountability                  102073.280667\n",
1608 |        "Achievement Gap                  60105.522500\n",
1609 |        "Alighieri Montessori School      55160.025556\n",
1610 |        "Assessing Department             70713.327111\n",
1611 |        "Asst Superintendent-Network A   132514.885000\n",
1612 |        "...                                       ...\n",
1613 |        "Unified Student Svc              65018.485000\n",
1614 |        "Veterans' Services               48411.606250\n",
1615 |        "WREC: Urban Science Academy      81170.398214\n",
1616 |        "Warren/Prescott K-8              66389.351341\n",
1617 |        "West Roxbury Academy             70373.066494\n",
1618 |        "West Zone ELC                    55868.384118\n",
1619 |        "Women's Advancement              63811.150000\n",
1620 |        "Workers Compensation Service     23797.119133\n",
1621 |        "Young Achievers K-8              56534.020463\n",
1622 |        "Youth Engagement & Employment    33645.202308\n",
1623 |        "\n",
1624 |        "[228 rows x 1 columns]"
1625 |       ]
1626 |      },
1627 |      "execution_count": 33,
1628 |      "metadata": {},
1629 |      "output_type": "execute_result"
1630 |     }
1631 |    ],
1632 |    "source": [
1633 |     "salary_average = salary_average\n",
1634 |     "\n",
1635 |     "salary_average"
1636 |    ]
1637 |   },
1638 |   {
1639 |    "cell_type": "markdown",
1640 |    "metadata": {},
1641 |    "source": [
1642 |     "Notice that `pandas` by default sets the `department_name` column as the row index of the `salary_average` DataFrame. I personally don't love this and would rather have a straight-up DataFrame with the row numbers as the index, so I usually run `reset_index()` to get rid of this indexing: "
1643 |    ]
1644 |   },
1645 |   {
1646 |    "cell_type": "code",
1647 |    "execution_count": 34,
1648 |    "metadata": {},
1649 |    "outputs": [
1650 |     {
1651 |      "data": {
1652 |       "text/html": [
1653 |        "<div>\n",
1654 |        "<style scoped>\n",
1655 |        "    .dataframe tbody tr th:only-of-type {\n",
1656 |        "        vertical-align: middle;\n",
1657 |        "    }\n",
1658 |        "\n",
1659 |        "    .dataframe tbody tr th {\n",
1660 |        "        vertical-align: top;\n",
1661 |        "    }\n",
1662 |        "\n",
1663 |        "    .dataframe thead th {\n",
1664 |        "        text-align: right;\n",
1665 |        "    }\n",
1666 |        "</style>\n",
1667 |        "<table border=\"1\" class=\"dataframe\">\n",
1668 |        "  <thead>\n",
1669 |        "    <tr style=\"text-align: right;\">\n",
1670 |        "      <th></th>\n",
1671 |        "      <th>department_name</th>\n",
1672 |        "      <th>total_earnings</th>\n",
1673 |        "    </tr>\n",
1674 |        "  </thead>\n",
1675 |        "  <tbody>\n",
1676 |        "    <tr>\n",
1677 |        "      <th>0</th>\n",
1678 |        "      <td>ASD Human Resources</td>\n",
1679 |        "      <td>67236.150755</td>\n",
1680 |        "    </tr>\n",
1681 |        "    <tr>\n",
1682 |        "      <th>1</th>\n",
1683 |        "      <td>ASD Intergvernmtl Relations</td>\n",
1684 |        "      <td>83787.581000</td>\n",
1685 |        "    </tr>\n",
1686 |        "    <tr>\n",
1687 |        "      <th>2</th>\n",
1688 |        "      <td>ASD Office Of Labor Relation</td>\n",
1689 |        "      <td>58899.954615</td>\n",
1690 |        "    </tr>\n",
1691 |        "    <tr>\n",
1692 |        "      <th>3</th>\n",
1693 |        "      <td>ASD Office of Budget Mangmnt</td>\n",
1694 |        "      <td>73946.044643</td>\n",
1695 |        "    </tr>\n",
1696 |        "    <tr>\n",
1697 |        "      <th>4</th>\n",
1698 |        "      <td>ASD Purchasing Division</td>\n",
1699 |        "      <td>72893.203750</td>\n",
1700 |        "    </tr>\n",
1701 |        "    <tr>\n",
1702 |        "      <th>5</th>\n",
1703 |        "      <td>Accountability</td>\n",
1704 |        "      <td>102073.280667</td>\n",
1705 |        "    </tr>\n",
1706 |        "    <tr>\n",
1707 |        "      <th>6</th>\n",
1708 |        "      <td>Achievement Gap</td>\n",
1709 |        "      <td>60105.522500</td>\n",
1710 |        "    </tr>\n",
1711 |        "    <tr>\n",
1712 |        "      <th>7</th>\n",
1713 |        "      <td>Alighieri Montessori School</td>\n",
1714 |        "      <td>55160.025556</td>\n",
1715 |        "    </tr>\n",
1716 |        "    <tr>\n",
1717 |        "      <th>8</th>\n",
1718 |        "      <td>Assessing Department</td>\n",
1719 |        "      <td>70713.327111</td>\n",
1720 |        "    </tr>\n",
1721 |        "    <tr>\n",
1722 |        "      <th>9</th>\n",
1723 |        "      <td>Asst Superintendent-Network A</td>\n",
1724 |        "      <td>132514.885000</td>\n",
1725 |        "    </tr>\n",
1726 |        "    <tr>\n",
1727 |        "      <th>...</th>\n",
1728 |        "      <td>...</td>\n",
1729 |        "      <td>...</td>\n",
1730 |        "    </tr>\n",
1731 |        "    <tr>\n",
1732 |        "      <th>218</th>\n",
1733 |        "      <td>Unified Student Svc</td>\n",
1734 |        "      <td>65018.485000</td>\n",
1735 |        "    </tr>\n",
1736 |        "    <tr>\n",
1737 |        "      <th>219</th>\n",
1738 |        "      <td>Veterans' Services</td>\n",
1739 |        "      <td>48411.606250</td>\n",
1740 |        "    </tr>\n",
1741 |        "    <tr>\n",
1742 |        "      <th>220</th>\n",
1743 |        "      <td>WREC: Urban Science Academy</td>\n",
1744 |        "      <td>81170.398214</td>\n",
1745 |        "    </tr>\n",
1746 |        "    <tr>\n",
1747 |        "      <th>221</th>\n",
1748 |        "      <td>Warren/Prescott K-8</td>\n",
1749 |        "      <td>66389.351341</td>\n",
1750 |        "    </tr>\n",
1751 |        "    <tr>\n",
1752 |        "      <th>222</th>\n",
1753 |        "      <td>West Roxbury Academy</td>\n",
1754 |        "      <td>70373.066494</td>\n",
1755 |        "    </tr>\n",
1756 |        "    <tr>\n",
1757 |        "      <th>223</th>\n",
1758 |        "      <td>West Zone ELC</td>\n",
1759 |        "      <td>55868.384118</td>\n",
1760 |        "    </tr>\n",
1761 |        "    <tr>\n",
1762 |        "      <th>224</th>\n",
1763 |        "      <td>Women's Advancement</td>\n",
1764 |        "      <td>63811.150000</td>\n",
1765 |        "    </tr>\n",
1766 |        "    <tr>\n",
1767 |        "      <th>225</th>\n",
1768 |        "      <td>Workers Compensation Service</td>\n",
1769 |        "      <td>23797.119133</td>\n",
1770 |        "    </tr>\n",
1771 |        "    <tr>\n",
1772 |        "      <th>226</th>\n",
1773 |        "      <td>Young Achievers K-8</td>\n",
1774 |        "      <td>56534.020463</td>\n",
1775 |        "    </tr>\n",
1776 |        "    <tr>\n",
1777 |        "      <th>227</th>\n",
1778 |        "      <td>Youth Engagement &amp; Employment</td>\n",
1779 |        "      <td>33645.202308</td>\n",
1780 |        "    </tr>\n",
1781 |        "  </tbody>\n",
1782 |        "</table>\n",
1783 |        "<p>228 rows × 2 columns</p>\n",
1784 |        "</div>"
1785 |       ],
1786 |       "text/plain": [
1787 |        "                   department_name  total_earnings\n",
1788 |        "0              ASD Human Resources    67236.150755\n",
1789 |        "1      ASD Intergvernmtl Relations    83787.581000\n",
1790 |        "2     ASD Office Of Labor Relation    58899.954615\n",
1791 |        "3     ASD Office of Budget Mangmnt    73946.044643\n",
1792 |        "4          ASD Purchasing Division    72893.203750\n",
1793 |        "5                   Accountability   102073.280667\n",
1794 |        "6                  Achievement Gap    60105.522500\n",
1795 |        "7      Alighieri Montessori School    55160.025556\n",
1796 |        "8             Assessing Department    70713.327111\n",
1797 |        "9    Asst Superintendent-Network A   132514.885000\n",
1798 |        "..                             ...             ...\n",
1799 |        "218            Unified Student Svc    65018.485000\n",
1800 |        "219             Veterans' Services    48411.606250\n",
1801 |        "220    WREC: Urban Science Academy    81170.398214\n",
1802 |        "221            Warren/Prescott K-8    66389.351341\n",
1803 |        "222           West Roxbury Academy    70373.066494\n",
1804 |        "223                  West Zone ELC    55868.384118\n",
1805 |        "224            Women's Advancement    63811.150000\n",
1806 |        "225   Workers Compensation Service    23797.119133\n",
1807 |        "226            Young Achievers K-8    56534.020463\n",
1808 |        "227  Youth Engagement & Employment    33645.202308\n",
1809 |        "\n",
1810 |        "[228 rows x 2 columns]"
1811 |       ]
1812 |      },
1813 |      "execution_count": 34,
1814 |      "metadata": {},
1815 |      "output_type": "execute_result"
1816 |     }
1817 |    ],
1818 |    "source": [
1819 |     "salary_average = salary_average.reset_index() # reset_index\n",
1820 |     "\n",
1821 |     "salary_average"
1822 |    ]
1823 |   },
1824 |   {
1825 |    "cell_type": "markdown",
1826 |    "metadata": {},
1827 |    "source": [
1828 |     "We should also rename the `total_earnings` column to `average_earnings` to avoid confusion. We can do this using [`rename()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html). The syntax for `rename()` is `DataFrame.rename(columns = {'current column name':'new column name'})`."
1829 |    ]
1830 |   },
1831 |   {
1832 |    "cell_type": "code",
1833 |    "execution_count": 35,
1834 |    "metadata": {},
1835 |    "outputs": [],
1836 |    "source": [
1837 |     "salary_average = salary_average.rename(columns = {'total_earnings': 'dept_average'}) "
1838 |    ]
1839 |   },
1840 |   {
1841 |    "cell_type": "code",
1842 |    "execution_count": 36,
1843 |    "metadata": {},
1844 |    "outputs": [
1845 |     {
1846 |      "data": {
1847 |       "text/html": [
1848 |        "<div>\n",
1849 |        "<style scoped>\n",
1850 |        "    .dataframe tbody tr th:only-of-type {\n",
1851 |        "        vertical-align: middle;\n",
1852 |        "    }\n",
1853 |        "\n",
1854 |        "    .dataframe tbody tr th {\n",
1855 |        "        vertical-align: top;\n",
1856 |        "    }\n",
1857 |        "\n",
1858 |        "    .dataframe thead th {\n",
1859 |        "        text-align: right;\n",
1860 |        "    }\n",
1861 |        "</style>\n",
1862 |        "<table border=\"1\" class=\"dataframe\">\n",
1863 |        "  <thead>\n",
1864 |        "    <tr style=\"text-align: right;\">\n",
1865 |        "      <th></th>\n",
1866 |        "      <th>department_name</th>\n",
1867 |        "      <th>dept_average</th>\n",
1868 |        "    </tr>\n",
1869 |        "  </thead>\n",
1870 |        "  <tbody>\n",
1871 |        "    <tr>\n",
1872 |        "      <th>0</th>\n",
1873 |        "      <td>ASD Human Resources</td>\n",
1874 |        "      <td>67236.150755</td>\n",
1875 |        "    </tr>\n",
1876 |        "    <tr>\n",
1877 |        "      <th>1</th>\n",
1878 |        "      <td>ASD Intergvernmtl Relations</td>\n",
1879 |        "      <td>83787.581000</td>\n",
1880 |        "    </tr>\n",
1881 |        "    <tr>\n",
1882 |        "      <th>2</th>\n",
1883 |        "      <td>ASD Office Of Labor Relation</td>\n",
1884 |        "      <td>58899.954615</td>\n",
1885 |        "    </tr>\n",
1886 |        "    <tr>\n",
1887 |        "      <th>3</th>\n",
1888 |        "      <td>ASD Office of Budget Mangmnt</td>\n",
1889 |        "      <td>73946.044643</td>\n",
1890 |        "    </tr>\n",
1891 |        "    <tr>\n",
1892 |        "      <th>4</th>\n",
1893 |        "      <td>ASD Purchasing Division</td>\n",
1894 |        "      <td>72893.203750</td>\n",
1895 |        "    </tr>\n",
1896 |        "    <tr>\n",
1897 |        "      <th>5</th>\n",
1898 |        "      <td>Accountability</td>\n",
1899 |        "      <td>102073.280667</td>\n",
1900 |        "    </tr>\n",
1901 |        "    <tr>\n",
1902 |        "      <th>6</th>\n",
1903 |        "      <td>Achievement Gap</td>\n",
1904 |        "      <td>60105.522500</td>\n",
1905 |        "    </tr>\n",
1906 |        "    <tr>\n",
1907 |        "      <th>7</th>\n",
1908 |        "      <td>Alighieri Montessori School</td>\n",
1909 |        "      <td>55160.025556</td>\n",
1910 |        "    </tr>\n",
1911 |        "    <tr>\n",
1912 |        "      <th>8</th>\n",
1913 |        "      <td>Assessing Department</td>\n",
1914 |        "      <td>70713.327111</td>\n",
1915 |        "    </tr>\n",
1916 |        "    <tr>\n",
1917 |        "      <th>9</th>\n",
1918 |        "      <td>Asst Superintendent-Network A</td>\n",
1919 |        "      <td>132514.885000</td>\n",
1920 |        "    </tr>\n",
1921 |        "    <tr>\n",
1922 |        "      <th>...</th>\n",
1923 |        "      <td>...</td>\n",
1924 |        "      <td>...</td>\n",
1925 |        "    </tr>\n",
1926 |        "    <tr>\n",
1927 |        "      <th>218</th>\n",
1928 |        "      <td>Unified Student Svc</td>\n",
1929 |        "      <td>65018.485000</td>\n",
1930 |        "    </tr>\n",
1931 |        "    <tr>\n",
1932 |        "      <th>219</th>\n",
1933 |        "      <td>Veterans' Services</td>\n",
1934 |        "      <td>48411.606250</td>\n",
1935 |        "    </tr>\n",
1936 |        "    <tr>\n",
1937 |        "      <th>220</th>\n",
1938 |        "      <td>WREC: Urban Science Academy</td>\n",
1939 |        "      <td>81170.398214</td>\n",
1940 |        "    </tr>\n",
1941 |        "    <tr>\n",
1942 |        "      <th>221</th>\n",
1943 |        "      <td>Warren/Prescott K-8</td>\n",
1944 |        "      <td>66389.351341</td>\n",
1945 |        "    </tr>\n",
1946 |        "    <tr>\n",
1947 |        "      <th>222</th>\n",
1948 |        "      <td>West Roxbury Academy</td>\n",
1949 |        "      <td>70373.066494</td>\n",
1950 |        "    </tr>\n",
1951 |        "    <tr>\n",
1952 |        "      <th>223</th>\n",
1953 |        "      <td>West Zone ELC</td>\n",
1954 |        "      <td>55868.384118</td>\n",
1955 |        "    </tr>\n",
1956 |        "    <tr>\n",
1957 |        "      <th>224</th>\n",
1958 |        "      <td>Women's Advancement</td>\n",
1959 |        "      <td>63811.150000</td>\n",
1960 |        "    </tr>\n",
1961 |        "    <tr>\n",
1962 |        "      <th>225</th>\n",
1963 |        "      <td>Workers Compensation Service</td>\n",
1964 |        "      <td>23797.119133</td>\n",
1965 |        "    </tr>\n",
1966 |        "    <tr>\n",
1967 |        "      <th>226</th>\n",
1968 |        "      <td>Young Achievers K-8</td>\n",
1969 |        "      <td>56534.020463</td>\n",
1970 |        "    </tr>\n",
1971 |        "    <tr>\n",
1972 |        "      <th>227</th>\n",
1973 |        "      <td>Youth Engagement &amp; Employment</td>\n",
1974 |        "      <td>33645.202308</td>\n",
1975 |        "    </tr>\n",
1976 |        "  </tbody>\n",
1977 |        "</table>\n",
1978 |        "<p>228 rows × 2 columns</p>\n",
1979 |        "</div>"
1980 |       ],
1981 |       "text/plain": [
1982 |        "                   department_name   dept_average\n",
1983 |        "0              ASD Human Resources   67236.150755\n",
1984 |        "1      ASD Intergvernmtl Relations   83787.581000\n",
1985 |        "2     ASD Office Of Labor Relation   58899.954615\n",
1986 |        "3     ASD Office of Budget Mangmnt   73946.044643\n",
1987 |        "4          ASD Purchasing Division   72893.203750\n",
1988 |        "5                   Accountability  102073.280667\n",
1989 |        "6                  Achievement Gap   60105.522500\n",
1990 |        "7      Alighieri Montessori School   55160.025556\n",
1991 |        "8             Assessing Department   70713.327111\n",
1992 |        "9    Asst Superintendent-Network A  132514.885000\n",
1993 |        "..                             ...            ...\n",
1994 |        "218            Unified Student Svc   65018.485000\n",
1995 |        "219             Veterans' Services   48411.606250\n",
1996 |        "220    WREC: Urban Science Academy   81170.398214\n",
1997 |        "221            Warren/Prescott K-8   66389.351341\n",
1998 |        "222           West Roxbury Academy   70373.066494\n",
1999 |        "223                  West Zone ELC   55868.384118\n",
2000 |        "224            Women's Advancement   63811.150000\n",
2001 |        "225   Workers Compensation Service   23797.119133\n",
2002 |        "226            Young Achievers K-8   56534.020463\n",
2003 |        "227  Youth Engagement & Employment   33645.202308\n",
2004 |        "\n",
2005 |        "[228 rows x 2 columns]"
2006 |       ]
2007 |      },
2008 |      "execution_count": 36,
2009 |      "metadata": {},
2010 |      "output_type": "execute_result"
2011 |     }
2012 |    ],
2013 |    "source": [
2014 |     "salary_average"
2015 |    ]
2016 |   },
2017 |   {
2018 |    "cell_type": "markdown",
2019 |    "metadata": {},
2020 |    "source": [
2021 |     "We can find the Boston Police Department. Find out more about selecting based on attributes [here](https://chrisalbon.com/python/data_wrangling/pandas_selecting_rows_on_conditions/)."
2022 |    ]
2023 |   },
2024 |   {
2025 |    "cell_type": "code",
2026 |    "execution_count": 37,
2027 |    "metadata": {},
2028 |    "outputs": [
2029 |     {
2030 |      "data": {
2031 |       "text/html": [
2032 |        "<div>\n",
2033 |        "<style scoped>\n",
2034 |        "    .dataframe tbody tr th:only-of-type {\n",
2035 |        "        vertical-align: middle;\n",
2036 |        "    }\n",
2037 |        "\n",
2038 |        "    .dataframe tbody tr th {\n",
2039 |        "        vertical-align: top;\n",
2040 |        "    }\n",
2041 |        "\n",
2042 |        "    .dataframe thead th {\n",
2043 |        "        text-align: right;\n",
2044 |        "    }\n",
2045 |        "</style>\n",
2046 |        "<table border=\"1\" class=\"dataframe\">\n",
2047 |        "  <thead>\n",
2048 |        "    <tr style=\"text-align: right;\">\n",
2049 |        "      <th></th>\n",
2050 |        "      <th>department_name</th>\n",
2051 |        "      <th>dept_average</th>\n",
2052 |        "    </tr>\n",
2053 |        "  </thead>\n",
2054 |        "  <tbody>\n",
2055 |        "    <tr>\n",
2056 |        "      <th>121</th>\n",
2057 |        "      <td>Boston Police Department</td>\n",
2058 |        "      <td>124787.164775</td>\n",
2059 |        "    </tr>\n",
2060 |        "  </tbody>\n",
2061 |        "</table>\n",
2062 |        "</div>"
2063 |       ],
2064 |       "text/plain": [
2065 |        "              department_name   dept_average\n",
2066 |        "121  Boston Police Department  124787.164775"
2067 |       ]
2068 |      },
2069 |      "execution_count": 37,
2070 |      "metadata": {},
2071 |      "output_type": "execute_result"
2072 |     }
2073 |    ],
2074 |    "source": [
2075 |     "salary_average[salary_average['department_name'] == 'Boston Police Department']"
2076 |    ]
2077 |   },
2078 |   {
2079 |    "cell_type": "markdown",
2080 |    "metadata": {},
2081 |    "source": [
2082 |     "Now is a good time to revisit \"chaining.\" Notice how we did three things in creating `salary_average`:\n",
2083 |     "1. Grouped the `salary_sort` DataFrame by `department_name` and calculated the mean of the numeric columns (in our case, `total_earnings` using `group_by()` and `mean()`.\n",
2084 |     "2. Used `reset_index()` on the resulting DataFrame so that `department_name` would no longer be the row index.\n",
2085 |     "3. Renamed the `total_earnings` column to `dept_average` to avoid confusion using `rename()`.\n",
2086 |     "\n",
2087 |     "In fact, we can do these three things all at once, by chaining the functions together:"
2088 |    ]
2089 |   },
2090 |   {
2091 |    "cell_type": "code",
2092 |    "execution_count": 38,
2093 |    "metadata": {},
2094 |    "outputs": [
2095 |     {
2096 |      "data": {
2097 |       "text/html": [
2098 |        "<div>\n",
2099 |        "<style scoped>\n",
2100 |        "    .dataframe tbody tr th:only-of-type {\n",
2101 |        "        vertical-align: middle;\n",
2102 |        "    }\n",
2103 |        "\n",
2104 |        "    .dataframe tbody tr th {\n",
2105 |        "        vertical-align: top;\n",
2106 |        "    }\n",
2107 |        "\n",
2108 |        "    .dataframe thead th {\n",
2109 |        "        text-align: right;\n",
2110 |        "    }\n",
2111 |        "</style>\n",
2112 |        "<table border=\"1\" class=\"dataframe\">\n",
2113 |        "  <thead>\n",
2114 |        "    <tr style=\"text-align: right;\">\n",
2115 |        "      <th></th>\n",
2116 |        "      <th>department_name</th>\n",
2117 |        "      <th>dept_average</th>\n",
2118 |        "    </tr>\n",
2119 |        "  </thead>\n",
2120 |        "  <tbody>\n",
2121 |        "    <tr>\n",
2122 |        "      <th>0</th>\n",
2123 |        "      <td>ASD Human Resources</td>\n",
2124 |        "      <td>67236.150755</td>\n",
2125 |        "    </tr>\n",
2126 |        "    <tr>\n",
2127 |        "      <th>1</th>\n",
2128 |        "      <td>ASD Intergvernmtl Relations</td>\n",
2129 |        "      <td>83787.581000</td>\n",
2130 |        "    </tr>\n",
2131 |        "    <tr>\n",
2132 |        "      <th>2</th>\n",
2133 |        "      <td>ASD Office Of Labor Relation</td>\n",
2134 |        "      <td>58899.954615</td>\n",
2135 |        "    </tr>\n",
2136 |        "    <tr>\n",
2137 |        "      <th>3</th>\n",
2138 |        "      <td>ASD Office of Budget Mangmnt</td>\n",
2139 |        "      <td>73946.044643</td>\n",
2140 |        "    </tr>\n",
2141 |        "    <tr>\n",
2142 |        "      <th>4</th>\n",
2143 |        "      <td>ASD Purchasing Division</td>\n",
2144 |        "      <td>72893.203750</td>\n",
2145 |        "    </tr>\n",
2146 |        "    <tr>\n",
2147 |        "      <th>5</th>\n",
2148 |        "      <td>Accountability</td>\n",
2149 |        "      <td>102073.280667</td>\n",
2150 |        "    </tr>\n",
2151 |        "    <tr>\n",
2152 |        "      <th>6</th>\n",
2153 |        "      <td>Achievement Gap</td>\n",
2154 |        "      <td>60105.522500</td>\n",
2155 |        "    </tr>\n",
2156 |        "    <tr>\n",
2157 |        "      <th>7</th>\n",
2158 |        "      <td>Alighieri Montessori School</td>\n",
2159 |        "      <td>55160.025556</td>\n",
2160 |        "    </tr>\n",
2161 |        "    <tr>\n",
2162 |        "      <th>8</th>\n",
2163 |        "      <td>Assessing Department</td>\n",
2164 |        "      <td>70713.327111</td>\n",
2165 |        "    </tr>\n",
2166 |        "    <tr>\n",
2167 |        "      <th>9</th>\n",
2168 |        "      <td>Asst Superintendent-Network A</td>\n",
2169 |        "      <td>132514.885000</td>\n",
2170 |        "    </tr>\n",
2171 |        "    <tr>\n",
2172 |        "      <th>...</th>\n",
2173 |        "      <td>...</td>\n",
2174 |        "      <td>...</td>\n",
2175 |        "    </tr>\n",
2176 |        "    <tr>\n",
2177 |        "      <th>218</th>\n",
2178 |        "      <td>Unified Student Svc</td>\n",
2179 |        "      <td>65018.485000</td>\n",
2180 |        "    </tr>\n",
2181 |        "    <tr>\n",
2182 |        "      <th>219</th>\n",
2183 |        "      <td>Veterans' Services</td>\n",
2184 |        "      <td>48411.606250</td>\n",
2185 |        "    </tr>\n",
2186 |        "    <tr>\n",
2187 |        "      <th>220</th>\n",
2188 |        "      <td>WREC: Urban Science Academy</td>\n",
2189 |        "      <td>81170.398214</td>\n",
2190 |        "    </tr>\n",
2191 |        "    <tr>\n",
2192 |        "      <th>221</th>\n",
2193 |        "      <td>Warren/Prescott K-8</td>\n",
2194 |        "      <td>66389.351341</td>\n",
2195 |        "    </tr>\n",
2196 |        "    <tr>\n",
2197 |        "      <th>222</th>\n",
2198 |        "      <td>West Roxbury Academy</td>\n",
2199 |        "      <td>70373.066494</td>\n",
2200 |        "    </tr>\n",
2201 |        "    <tr>\n",
2202 |        "      <th>223</th>\n",
2203 |        "      <td>West Zone ELC</td>\n",
2204 |        "      <td>55868.384118</td>\n",
2205 |        "    </tr>\n",
2206 |        "    <tr>\n",
2207 |        "      <th>224</th>\n",
2208 |        "      <td>Women's Advancement</td>\n",
2209 |        "      <td>63811.150000</td>\n",
2210 |        "    </tr>\n",
2211 |        "    <tr>\n",
2212 |        "      <th>225</th>\n",
2213 |        "      <td>Workers Compensation Service</td>\n",
2214 |        "      <td>23797.119133</td>\n",
2215 |        "    </tr>\n",
2216 |        "    <tr>\n",
2217 |        "      <th>226</th>\n",
2218 |        "      <td>Young Achievers K-8</td>\n",
2219 |        "      <td>56534.020463</td>\n",
2220 |        "    </tr>\n",
2221 |        "    <tr>\n",
2222 |        "      <th>227</th>\n",
2223 |        "      <td>Youth Engagement &amp; Employment</td>\n",
2224 |        "      <td>33645.202308</td>\n",
2225 |        "    </tr>\n",
2226 |        "  </tbody>\n",
2227 |        "</table>\n",
2228 |        "<p>228 rows × 2 columns</p>\n",
2229 |        "</div>"
2230 |       ],
2231 |       "text/plain": [
2232 |        "                   department_name   dept_average\n",
2233 |        "0              ASD Human Resources   67236.150755\n",
2234 |        "1      ASD Intergvernmtl Relations   83787.581000\n",
2235 |        "2     ASD Office Of Labor Relation   58899.954615\n",
2236 |        "3     ASD Office of Budget Mangmnt   73946.044643\n",
2237 |        "4          ASD Purchasing Division   72893.203750\n",
2238 |        "5                   Accountability  102073.280667\n",
2239 |        "6                  Achievement Gap   60105.522500\n",
2240 |        "7      Alighieri Montessori School   55160.025556\n",
2241 |        "8             Assessing Department   70713.327111\n",
2242 |        "9    Asst Superintendent-Network A  132514.885000\n",
2243 |        "..                             ...            ...\n",
2244 |        "218            Unified Student Svc   65018.485000\n",
2245 |        "219             Veterans' Services   48411.606250\n",
2246 |        "220    WREC: Urban Science Academy   81170.398214\n",
2247 |        "221            Warren/Prescott K-8   66389.351341\n",
2248 |        "222           West Roxbury Academy   70373.066494\n",
2249 |        "223                  West Zone ELC   55868.384118\n",
2250 |        "224            Women's Advancement   63811.150000\n",
2251 |        "225   Workers Compensation Service   23797.119133\n",
2252 |        "226            Young Achievers K-8   56534.020463\n",
2253 |        "227  Youth Engagement & Employment   33645.202308\n",
2254 |        "\n",
2255 |        "[228 rows x 2 columns]"
2256 |       ]
2257 |      },
2258 |      "execution_count": 38,
2259 |      "metadata": {},
2260 |      "output_type": "execute_result"
2261 |     }
2262 |    ],
2263 |    "source": [
2264 |     "salary_sort.groupby('department_name').mean().reset_index().rename(columns = {'total_earnings':'dept_average'})"
2265 |    ]
2266 |   },
2267 |   {
2268 |    "cell_type": "markdown",
2269 |    "metadata": {},
2270 |    "source": [
2271 |     "That's a pretty long line of code. To make it more readable, we can split it up into separate lines. I like to do this by putting the whole expression in parentheses and splitting it up right before each of the functions, which are delineated by the periods:"
2272 |    ]
2273 |   },
2274 |   {
2275 |    "cell_type": "code",
2276 |    "execution_count": 39,
2277 |    "metadata": {},
2278 |    "outputs": [
2279 |     {
2280 |      "data": {
2281 |       "text/html": [
2282 |        "<div>\n",
2283 |        "<style scoped>\n",
2284 |        "    .dataframe tbody tr th:only-of-type {\n",
2285 |        "        vertical-align: middle;\n",
2286 |        "    }\n",
2287 |        "\n",
2288 |        "    .dataframe tbody tr th {\n",
2289 |        "        vertical-align: top;\n",
2290 |        "    }\n",
2291 |        "\n",
2292 |        "    .dataframe thead th {\n",
2293 |        "        text-align: right;\n",
2294 |        "    }\n",
2295 |        "</style>\n",
2296 |        "<table border=\"1\" class=\"dataframe\">\n",
2297 |        "  <thead>\n",
2298 |        "    <tr style=\"text-align: right;\">\n",
2299 |        "      <th></th>\n",
2300 |        "      <th>department_name</th>\n",
2301 |        "      <th>dept_average</th>\n",
2302 |        "    </tr>\n",
2303 |        "  </thead>\n",
2304 |        "  <tbody>\n",
2305 |        "    <tr>\n",
2306 |        "      <th>0</th>\n",
2307 |        "      <td>ASD Human Resources</td>\n",
2308 |        "      <td>67236.150755</td>\n",
2309 |        "    </tr>\n",
2310 |        "    <tr>\n",
2311 |        "      <th>1</th>\n",
2312 |        "      <td>ASD Intergvernmtl Relations</td>\n",
2313 |        "      <td>83787.581000</td>\n",
2314 |        "    </tr>\n",
2315 |        "    <tr>\n",
2316 |        "      <th>2</th>\n",
2317 |        "      <td>ASD Office Of Labor Relation</td>\n",
2318 |        "      <td>58899.954615</td>\n",
2319 |        "    </tr>\n",
2320 |        "    <tr>\n",
2321 |        "      <th>3</th>\n",
2322 |        "      <td>ASD Office of Budget Mangmnt</td>\n",
2323 |        "      <td>73946.044643</td>\n",
2324 |        "    </tr>\n",
2325 |        "    <tr>\n",
2326 |        "      <th>4</th>\n",
2327 |        "      <td>ASD Purchasing Division</td>\n",
2328 |        "      <td>72893.203750</td>\n",
2329 |        "    </tr>\n",
2330 |        "    <tr>\n",
2331 |        "      <th>5</th>\n",
2332 |        "      <td>Accountability</td>\n",
2333 |        "      <td>102073.280667</td>\n",
2334 |        "    </tr>\n",
2335 |        "    <tr>\n",
2336 |        "      <th>6</th>\n",
2337 |        "      <td>Achievement Gap</td>\n",
2338 |        "      <td>60105.522500</td>\n",
2339 |        "    </tr>\n",
2340 |        "    <tr>\n",
2341 |        "      <th>7</th>\n",
2342 |        "      <td>Alighieri Montessori School</td>\n",
2343 |        "      <td>55160.025556</td>\n",
2344 |        "    </tr>\n",
2345 |        "    <tr>\n",
2346 |        "      <th>8</th>\n",
2347 |        "      <td>Assessing Department</td>\n",
2348 |        "      <td>70713.327111</td>\n",
2349 |        "    </tr>\n",
2350 |        "    <tr>\n",
2351 |        "      <th>9</th>\n",
2352 |        "      <td>Asst Superintendent-Network A</td>\n",
2353 |        "      <td>132514.885000</td>\n",
2354 |        "    </tr>\n",
2355 |        "    <tr>\n",
2356 |        "      <th>...</th>\n",
2357 |        "      <td>...</td>\n",
2358 |        "      <td>...</td>\n",
2359 |        "    </tr>\n",
2360 |        "    <tr>\n",
2361 |        "      <th>218</th>\n",
2362 |        "      <td>Unified Student Svc</td>\n",
2363 |        "      <td>65018.485000</td>\n",
2364 |        "    </tr>\n",
2365 |        "    <tr>\n",
2366 |        "      <th>219</th>\n",
2367 |        "      <td>Veterans' Services</td>\n",
2368 |        "      <td>48411.606250</td>\n",
2369 |        "    </tr>\n",
2370 |        "    <tr>\n",
2371 |        "      <th>220</th>\n",
2372 |        "      <td>WREC: Urban Science Academy</td>\n",
2373 |        "      <td>81170.398214</td>\n",
2374 |        "    </tr>\n",
2375 |        "    <tr>\n",
2376 |        "      <th>221</th>\n",
2377 |        "      <td>Warren/Prescott K-8</td>\n",
2378 |        "      <td>66389.351341</td>\n",
2379 |        "    </tr>\n",
2380 |        "    <tr>\n",
2381 |        "      <th>222</th>\n",
2382 |        "      <td>West Roxbury Academy</td>\n",
2383 |        "      <td>70373.066494</td>\n",
2384 |        "    </tr>\n",
2385 |        "    <tr>\n",
2386 |        "      <th>223</th>\n",
2387 |        "      <td>West Zone ELC</td>\n",
2388 |        "      <td>55868.384118</td>\n",
2389 |        "    </tr>\n",
2390 |        "    <tr>\n",
2391 |        "      <th>224</th>\n",
2392 |        "      <td>Women's Advancement</td>\n",
2393 |        "      <td>63811.150000</td>\n",
2394 |        "    </tr>\n",
2395 |        "    <tr>\n",
2396 |        "      <th>225</th>\n",
2397 |        "      <td>Workers Compensation Service</td>\n",
2398 |        "      <td>23797.119133</td>\n",
2399 |        "    </tr>\n",
2400 |        "    <tr>\n",
2401 |        "      <th>226</th>\n",
2402 |        "      <td>Young Achievers K-8</td>\n",
2403 |        "      <td>56534.020463</td>\n",
2404 |        "    </tr>\n",
2405 |        "    <tr>\n",
2406 |        "      <th>227</th>\n",
2407 |        "      <td>Youth Engagement &amp; Employment</td>\n",
2408 |        "      <td>33645.202308</td>\n",
2409 |        "    </tr>\n",
2410 |        "  </tbody>\n",
2411 |        "</table>\n",
2412 |        "<p>228 rows × 2 columns</p>\n",
2413 |        "</div>"
2414 |       ],
2415 |       "text/plain": [
2416 |        "                   department_name   dept_average\n",
2417 |        "0              ASD Human Resources   67236.150755\n",
2418 |        "1      ASD Intergvernmtl Relations   83787.581000\n",
2419 |        "2     ASD Office Of Labor Relation   58899.954615\n",
2420 |        "3     ASD Office of Budget Mangmnt   73946.044643\n",
2421 |        "4          ASD Purchasing Division   72893.203750\n",
2422 |        "5                   Accountability  102073.280667\n",
2423 |        "6                  Achievement Gap   60105.522500\n",
2424 |        "7      Alighieri Montessori School   55160.025556\n",
2425 |        "8             Assessing Department   70713.327111\n",
2426 |        "9    Asst Superintendent-Network A  132514.885000\n",
2427 |        "..                             ...            ...\n",
2428 |        "218            Unified Student Svc   65018.485000\n",
2429 |        "219             Veterans' Services   48411.606250\n",
2430 |        "220    WREC: Urban Science Academy   81170.398214\n",
2431 |        "221            Warren/Prescott K-8   66389.351341\n",
2432 |        "222           West Roxbury Academy   70373.066494\n",
2433 |        "223                  West Zone ELC   55868.384118\n",
2434 |        "224            Women's Advancement   63811.150000\n",
2435 |        "225   Workers Compensation Service   23797.119133\n",
2436 |        "226            Young Achievers K-8   56534.020463\n",
2437 |        "227  Youth Engagement & Employment   33645.202308\n",
2438 |        "\n",
2439 |        "[228 rows x 2 columns]"
2440 |       ]
2441 |      },
2442 |      "execution_count": 39,
2443 |      "metadata": {},
2444 |      "output_type": "execute_result"
2445 |     }
2446 |    ],
2447 |    "source": [
2448 |     "(salary_sort.groupby('department_name')\n",
2449 |     " .mean()\n",
2450 |     " .reset_index()\n",
2451 |     " .rename(columns = {'total_earnings':'dept_average'}))"
2452 |    ]
2453 |   },
2454 |   {
2455 |    "cell_type": "markdown",
2456 |    "metadata": {},
2457 |    "source": [
2458 |     "## 2. Merging datasets"
2459 |    ]
2460 |   },
2461 |   {
2462 |    "cell_type": "markdown",
2463 |    "metadata": {},
2464 |    "source": [
2465 |     "Now we have two main datasets, `salary_sort` (the salary for each person, sorted from high to low) and `salary_average` (the average salary for each department). What if I wanted to merge these two together, so I could see side-by-side each person's salary compared to the average for their department?\n",
2466 |     "\n",
2467 |     "We want to join by the `department_name` variable, since that is consistent across both datasets. Let's put the merged data into a new dataframe, `salary_merged`:"
2468 |    ]
2469 |   },
2470 |   {
2471 |    "cell_type": "code",
2472 |    "execution_count": 40,
2473 |    "metadata": {},
2474 |    "outputs": [],
2475 |    "source": [
2476 |     "salary_merged = pd.merge(salary_sort, salary_average, on = 'department_name')"
2477 |    ]
2478 |   },
2479 |   {
2480 |    "cell_type": "markdown",
2481 |    "metadata": {},
2482 |    "source": [
2483 |     "Now we can see the department average, `dept_average`, next to the individual's salary, `total_earnings`:"
2484 |    ]
2485 |   },
2486 |   {
2487 |    "cell_type": "code",
2488 |    "execution_count": 41,
2489 |    "metadata": {},
2490 |    "outputs": [
2491 |     {
2492 |      "data": {
2493 |       "text/html": [
2494 |        "<div>\n",
2495 |        "<style scoped>\n",
2496 |        "    .dataframe tbody tr th:only-of-type {\n",
2497 |        "        vertical-align: middle;\n",
2498 |        "    }\n",
2499 |        "\n",
2500 |        "    .dataframe tbody tr th {\n",
2501 |        "        vertical-align: top;\n",
2502 |        "    }\n",
2503 |        "\n",
2504 |        "    .dataframe thead th {\n",
2505 |        "        text-align: right;\n",
2506 |        "    }\n",
2507 |        "</style>\n",
2508 |        "<table border=\"1\" class=\"dataframe\">\n",
2509 |        "  <thead>\n",
2510 |        "    <tr style=\"text-align: right;\">\n",
2511 |        "      <th></th>\n",
2512 |        "      <th>name</th>\n",
2513 |        "      <th>department_name</th>\n",
2514 |        "      <th>total_earnings</th>\n",
2515 |        "      <th>dept_average</th>\n",
2516 |        "    </tr>\n",
2517 |        "  </thead>\n",
2518 |        "  <tbody>\n",
2519 |        "    <tr>\n",
2520 |        "      <th>0</th>\n",
2521 |        "      <td>Lee,Waiman</td>\n",
2522 |        "      <td>Boston Police Department</td>\n",
2523 |        "      <td>403408.61</td>\n",
2524 |        "      <td>124787.164775</td>\n",
2525 |        "    </tr>\n",
2526 |        "    <tr>\n",
2527 |        "      <th>1</th>\n",
2528 |        "      <td>Josey,Windell C.</td>\n",
2529 |        "      <td>Boston Police Department</td>\n",
2530 |        "      <td>396348.50</td>\n",
2531 |        "      <td>124787.164775</td>\n",
2532 |        "    </tr>\n",
2533 |        "    <tr>\n",
2534 |        "      <th>2</th>\n",
2535 |        "      <td>Painten,Paul A</td>\n",
2536 |        "      <td>Boston Police Department</td>\n",
2537 |        "      <td>373959.35</td>\n",
2538 |        "      <td>124787.164775</td>\n",
2539 |        "    </tr>\n",
2540 |        "    <tr>\n",
2541 |        "      <th>3</th>\n",
2542 |        "      <td>Brown,Gregory</td>\n",
2543 |        "      <td>Boston Police Department</td>\n",
2544 |        "      <td>351825.50</td>\n",
2545 |        "      <td>124787.164775</td>\n",
2546 |        "    </tr>\n",
2547 |        "    <tr>\n",
2548 |        "      <th>4</th>\n",
2549 |        "      <td>Hosein,Haseeb</td>\n",
2550 |        "      <td>Boston Police Department</td>\n",
2551 |        "      <td>346105.17</td>\n",
2552 |        "      <td>124787.164775</td>\n",
2553 |        "    </tr>\n",
2554 |        "  </tbody>\n",
2555 |        "</table>\n",
2556 |        "</div>"
2557 |       ],
2558 |       "text/plain": [
2559 |        "               name           department_name  total_earnings   dept_average\n",
2560 |        "0        Lee,Waiman  Boston Police Department       403408.61  124787.164775\n",
2561 |        "1  Josey,Windell C.  Boston Police Department       396348.50  124787.164775\n",
2562 |        "2    Painten,Paul A  Boston Police Department       373959.35  124787.164775\n",
2563 |        "3     Brown,Gregory  Boston Police Department       351825.50  124787.164775\n",
2564 |        "4     Hosein,Haseeb  Boston Police Department       346105.17  124787.164775"
2565 |       ]
2566 |      },
2567 |      "execution_count": 41,
2568 |      "metadata": {},
2569 |      "output_type": "execute_result"
2570 |     }
2571 |    ],
2572 |    "source": [
2573 |     "salary_merged.head()"
2574 |    ]
2575 |   },
2576 |   {
2577 |    "cell_type": "markdown",
2578 |    "metadata": {},
2579 |    "source": [
2580 |     "## 3. Reshaping data"
2581 |    ]
2582 |   },
2583 |   {
2584 |    "cell_type": "markdown",
2585 |    "metadata": {},
2586 |    "source": [
2587 |     "Here's a dataset on unemployment rates by country from 2012 to 2016, from the International Monetary Fund's World Economic Outlook database (available [here](https://www.imf.org/external/pubs/ft/weo/2017/01/weodata/index.aspx)).\n",
2588 |     "\n",
2589 |     "When you download the dataset, it comes in an Excel file. We can use the `pd.read_excel()` function from `pandas` to load the file into Python."
2590 |    ]
2591 |   },
2592 |   {
2593 |    "cell_type": "code",
2594 |    "execution_count": 42,
2595 |    "metadata": {},
2596 |    "outputs": [
2597 |     {
2598 |      "data": {
2599 |       "text/html": [
2600 |        "<div>\n",
2601 |        "<style scoped>\n",
2602 |        "    .dataframe tbody tr th:only-of-type {\n",
2603 |        "        vertical-align: middle;\n",
2604 |        "    }\n",
2605 |        "\n",
2606 |        "    .dataframe tbody tr th {\n",
2607 |        "        vertical-align: top;\n",
2608 |        "    }\n",
2609 |        "\n",
2610 |        "    .dataframe thead th {\n",
2611 |        "        text-align: right;\n",
2612 |        "    }\n",
2613 |        "</style>\n",
2614 |        "<table border=\"1\" class=\"dataframe\">\n",
2615 |        "  <thead>\n",
2616 |        "    <tr style=\"text-align: right;\">\n",
2617 |        "      <th></th>\n",
2618 |        "      <th>Country</th>\n",
2619 |        "      <th>2012</th>\n",
2620 |        "      <th>2013</th>\n",
2621 |        "      <th>2014</th>\n",
2622 |        "      <th>2015</th>\n",
2623 |        "      <th>2016</th>\n",
2624 |        "    </tr>\n",
2625 |        "  </thead>\n",
2626 |        "  <tbody>\n",
2627 |        "    <tr>\n",
2628 |        "      <th>0</th>\n",
2629 |        "      <td>Albania</td>\n",
2630 |        "      <td>13.400</td>\n",
2631 |        "      <td>16.000</td>\n",
2632 |        "      <td>17.500</td>\n",
2633 |        "      <td>17.100</td>\n",
2634 |        "      <td>16.100</td>\n",
2635 |        "    </tr>\n",
2636 |        "    <tr>\n",
2637 |        "      <th>1</th>\n",
2638 |        "      <td>Algeria</td>\n",
2639 |        "      <td>11.000</td>\n",
2640 |        "      <td>9.829</td>\n",
2641 |        "      <td>10.600</td>\n",
2642 |        "      <td>11.214</td>\n",
2643 |        "      <td>10.498</td>\n",
2644 |        "    </tr>\n",
2645 |        "    <tr>\n",
2646 |        "      <th>2</th>\n",
2647 |        "      <td>Argentina</td>\n",
2648 |        "      <td>7.200</td>\n",
2649 |        "      <td>7.075</td>\n",
2650 |        "      <td>7.250</td>\n",
2651 |        "      <td>NaN</td>\n",
2652 |        "      <td>8.467</td>\n",
2653 |        "    </tr>\n",
2654 |        "    <tr>\n",
2655 |        "      <th>3</th>\n",
2656 |        "      <td>Armenia</td>\n",
2657 |        "      <td>17.300</td>\n",
2658 |        "      <td>16.200</td>\n",
2659 |        "      <td>17.600</td>\n",
2660 |        "      <td>18.500</td>\n",
2661 |        "      <td>18.790</td>\n",
2662 |        "    </tr>\n",
2663 |        "    <tr>\n",
2664 |        "      <th>4</th>\n",
2665 |        "      <td>Australia</td>\n",
2666 |        "      <td>5.217</td>\n",
2667 |        "      <td>5.650</td>\n",
2668 |        "      <td>6.058</td>\n",
2669 |        "      <td>6.058</td>\n",
2670 |        "      <td>5.733</td>\n",
2671 |        "    </tr>\n",
2672 |        "  </tbody>\n",
2673 |        "</table>\n",
2674 |        "</div>"
2675 |       ],
2676 |       "text/plain": [
2677 |        "     Country    2012    2013    2014    2015    2016\n",
2678 |        "0    Albania  13.400  16.000  17.500  17.100  16.100\n",
2679 |        "1    Algeria  11.000   9.829  10.600  11.214  10.498\n",
2680 |        "2  Argentina   7.200   7.075   7.250     NaN   8.467\n",
2681 |        "3    Armenia  17.300  16.200  17.600  18.500  18.790\n",
2682 |        "4  Australia   5.217   5.650   6.058   6.058   5.733"
2683 |       ]
2684 |      },
2685 |      "execution_count": 42,
2686 |      "metadata": {},
2687 |      "output_type": "execute_result"
2688 |     }
2689 |    ],
2690 |    "source": [
2691 |     "unemployment = pd.read_excel('unemployment.xlsx')\n",
2692 |     "unemployment.head()"
2693 |    ]
2694 |   },
2695 |   {
2696 |    "cell_type": "markdown",
2697 |    "metadata": {},
2698 |    "source": [
2699 |     "You'll notice if you open the `unemployment.xlsx` file in Excel that cells that do not have data (like Argentina in 2015) are labeled with \"n/a\". A nice feature of `pd.read_excel()` is that it recognizes these cells as NaN (\"not a number,\" or Python's way of encoding missing values), by default. If we wanted to, we could explicitly tell pandas that missing values were labeled \"n/a\" using `na_values = 'n/a'` within the `pd.read_excel()` function:"
2700 |    ]
2701 |   },
2702 |   {
2703 |    "cell_type": "code",
2704 |    "execution_count": 43,
2705 |    "metadata": {},
2706 |    "outputs": [],
2707 |    "source": [
2708 |     "unemployment = pd.read_excel('unemployment.xlsx', na_values = 'n/a')"
2709 |    ]
2710 |   },
2711 |   {
2712 |    "cell_type": "markdown",
2713 |    "metadata": {},
2714 |    "source": [
2715 |     "Right now, the data are in what's commonly referred to as \"wide\" format, meaning the variables (unemployment rate for each year) are spread across rows. This might be good for presentation, but it's not great for certain calculations or graphing. \"Wide\" format data also becomes confusing if other variables are added.\n",
2716 |     "\n",
2717 |     "We need to change the format from \"wide\" to \"long,\" meaning that the columns (`2012`, `2013`, `2014`, `2015`, `2016`) will be converted into a new variable, which we'll call `Year`, with repeated values for each country. And the unemployment rates will be put into a new variable, which we'll call `Rate_Unemployed`."
2718 |    ]
2719 |   },
2720 |   {
2721 |    "cell_type": "markdown",
2722 |    "metadata": {},
2723 |    "source": [
2724 |     "To do this, we'll use the [`pd.melt()`](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.melt.html) function in `pandas` to create a new DataFrame, `unemployment_long`."
2725 |    ]
2726 |   },
2727 |   {
2728 |    "cell_type": "code",
2729 |    "execution_count": 44,
2730 |    "metadata": {},
2731 |    "outputs": [],
2732 |    "source": [
2733 |     "unemployment_long = pd.melt(unemployment, # data to reshape\n",
2734 |     "                            id_vars = 'Country', # identifier variable\n",
2735 |     "                            var_name = 'Year', # column we want to create from the rows \n",
2736 |     "                            value_name = 'Rate_Unemployed') # the values of interest"
2737 |    ]
2738 |   },
2739 |   {
2740 |    "cell_type": "markdown",
2741 |    "metadata": {},
2742 |    "source": [
2743 |     "Inspecting `unemployment_long` using `head()` shows that we have successfully created a long dataset."
2744 |    ]
2745 |   },
2746 |   {
2747 |    "cell_type": "code",
2748 |    "execution_count": 45,
2749 |    "metadata": {},
2750 |    "outputs": [
2751 |     {
2752 |      "data": {
2753 |       "text/html": [
2754 |        "<div>\n",
2755 |        "<style scoped>\n",
2756 |        "    .dataframe tbody tr th:only-of-type {\n",
2757 |        "        vertical-align: middle;\n",
2758 |        "    }\n",
2759 |        "\n",
2760 |        "    .dataframe tbody tr th {\n",
2761 |        "        vertical-align: top;\n",
2762 |        "    }\n",
2763 |        "\n",
2764 |        "    .dataframe thead th {\n",
2765 |        "        text-align: right;\n",
2766 |        "    }\n",
2767 |        "</style>\n",
2768 |        "<table border=\"1\" class=\"dataframe\">\n",
2769 |        "  <thead>\n",
2770 |        "    <tr style=\"text-align: right;\">\n",
2771 |        "      <th></th>\n",
2772 |        "      <th>Country</th>\n",
2773 |        "      <th>Year</th>\n",
2774 |        "      <th>Rate_Unemployed</th>\n",
2775 |        "    </tr>\n",
2776 |        "  </thead>\n",
2777 |        "  <tbody>\n",
2778 |        "    <tr>\n",
2779 |        "      <th>0</th>\n",
2780 |        "      <td>Albania</td>\n",
2781 |        "      <td>2012</td>\n",
2782 |        "      <td>13.400</td>\n",
2783 |        "    </tr>\n",
2784 |        "    <tr>\n",
2785 |        "      <th>1</th>\n",
2786 |        "      <td>Algeria</td>\n",
2787 |        "      <td>2012</td>\n",
2788 |        "      <td>11.000</td>\n",
2789 |        "    </tr>\n",
2790 |        "    <tr>\n",
2791 |        "      <th>2</th>\n",
2792 |        "      <td>Argentina</td>\n",
2793 |        "      <td>2012</td>\n",
2794 |        "      <td>7.200</td>\n",
2795 |        "    </tr>\n",
2796 |        "    <tr>\n",
2797 |        "      <th>3</th>\n",
2798 |        "      <td>Armenia</td>\n",
2799 |        "      <td>2012</td>\n",
2800 |        "      <td>17.300</td>\n",
2801 |        "    </tr>\n",
2802 |        "    <tr>\n",
2803 |        "      <th>4</th>\n",
2804 |        "      <td>Australia</td>\n",
2805 |        "      <td>2012</td>\n",
2806 |        "      <td>5.217</td>\n",
2807 |        "    </tr>\n",
2808 |        "  </tbody>\n",
2809 |        "</table>\n",
2810 |        "</div>"
2811 |       ],
2812 |       "text/plain": [
2813 |        "     Country  Year  Rate_Unemployed\n",
2814 |        "0    Albania  2012           13.400\n",
2815 |        "1    Algeria  2012           11.000\n",
2816 |        "2  Argentina  2012            7.200\n",
2817 |        "3    Armenia  2012           17.300\n",
2818 |        "4  Australia  2012            5.217"
2819 |       ]
2820 |      },
2821 |      "execution_count": 45,
2822 |      "metadata": {},
2823 |      "output_type": "execute_result"
2824 |     }
2825 |    ],
2826 |    "source": [
2827 |     "unemployment_long.head()"
2828 |    ]
2829 |   },
2830 |   {
2831 |    "cell_type": "markdown",
2832 |    "metadata": {},
2833 |    "source": [
2834 |     "## 4. Calculating year-over-year change in panel data"
2835 |    ]
2836 |   },
2837 |   {
2838 |    "cell_type": "markdown",
2839 |    "metadata": {},
2840 |    "source": [
2841 |     "Sort the data by `Country` and `Year` using the `sort_values()` function:"
2842 |    ]
2843 |   },
2844 |   {
2845 |    "cell_type": "code",
2846 |    "execution_count": 46,
2847 |    "metadata": {},
2848 |    "outputs": [
2849 |     {
2850 |      "data": {
2851 |       "text/html": [
2852 |        "<div>\n",
2853 |        "<style scoped>\n",
2854 |        "    .dataframe tbody tr th:only-of-type {\n",
2855 |        "        vertical-align: middle;\n",
2856 |        "    }\n",
2857 |        "\n",
2858 |        "    .dataframe tbody tr th {\n",
2859 |        "        vertical-align: top;\n",
2860 |        "    }\n",
2861 |        "\n",
2862 |        "    .dataframe thead th {\n",
2863 |        "        text-align: right;\n",
2864 |        "    }\n",
2865 |        "</style>\n",
2866 |        "<table border=\"1\" class=\"dataframe\">\n",
2867 |        "  <thead>\n",
2868 |        "    <tr style=\"text-align: right;\">\n",
2869 |        "      <th></th>\n",
2870 |        "      <th>Country</th>\n",
2871 |        "      <th>Year</th>\n",
2872 |        "      <th>Rate_Unemployed</th>\n",
2873 |        "    </tr>\n",
2874 |        "  </thead>\n",
2875 |        "  <tbody>\n",
2876 |        "    <tr>\n",
2877 |        "      <th>0</th>\n",
2878 |        "      <td>Albania</td>\n",
2879 |        "      <td>2012</td>\n",
2880 |        "      <td>13.4</td>\n",
2881 |        "    </tr>\n",
2882 |        "    <tr>\n",
2883 |        "      <th>112</th>\n",
2884 |        "      <td>Albania</td>\n",
2885 |        "      <td>2013</td>\n",
2886 |        "      <td>16.0</td>\n",
2887 |        "    </tr>\n",
2888 |        "    <tr>\n",
2889 |        "      <th>224</th>\n",
2890 |        "      <td>Albania</td>\n",
2891 |        "      <td>2014</td>\n",
2892 |        "      <td>17.5</td>\n",
2893 |        "    </tr>\n",
2894 |        "    <tr>\n",
2895 |        "      <th>336</th>\n",
2896 |        "      <td>Albania</td>\n",
2897 |        "      <td>2015</td>\n",
2898 |        "      <td>17.1</td>\n",
2899 |        "    </tr>\n",
2900 |        "    <tr>\n",
2901 |        "      <th>448</th>\n",
2902 |        "      <td>Albania</td>\n",
2903 |        "      <td>2016</td>\n",
2904 |        "      <td>16.1</td>\n",
2905 |        "    </tr>\n",
2906 |        "  </tbody>\n",
2907 |        "</table>\n",
2908 |        "</div>"
2909 |       ],
2910 |       "text/plain": [
2911 |        "     Country  Year  Rate_Unemployed\n",
2912 |        "0    Albania  2012             13.4\n",
2913 |        "112  Albania  2013             16.0\n",
2914 |        "224  Albania  2014             17.5\n",
2915 |        "336  Albania  2015             17.1\n",
2916 |        "448  Albania  2016             16.1"
2917 |       ]
2918 |      },
2919 |      "execution_count": 46,
2920 |      "metadata": {},
2921 |      "output_type": "execute_result"
2922 |     }
2923 |    ],
2924 |    "source": [
2925 |     "unemployment_long = unemployment_long.sort_values(['Country', 'Year'])\n",
2926 |     "\n",
2927 |     "unemployment_long.head()"
2928 |    ]
2929 |   },
2930 |   {
2931 |    "cell_type": "markdown",
2932 |    "metadata": {
2933 |     "collapsed": true
2934 |    },
2935 |    "source": [
2936 |     "Again, we can use `reset_index(drop = True)` to reset the row index so that the numbers next to the rows are in sequential order."
2937 |    ]
2938 |   },
2939 |   {
2940 |    "cell_type": "code",
2941 |    "execution_count": 47,
2942 |    "metadata": {},
2943 |    "outputs": [
2944 |     {
2945 |      "data": {
2946 |       "text/html": [
2947 |        "<div>\n",
2948 |        "<style scoped>\n",
2949 |        "    .dataframe tbody tr th:only-of-type {\n",
2950 |        "        vertical-align: middle;\n",
2951 |        "    }\n",
2952 |        "\n",
2953 |        "    .dataframe tbody tr th {\n",
2954 |        "        vertical-align: top;\n",
2955 |        "    }\n",
2956 |        "\n",
2957 |        "    .dataframe thead th {\n",
2958 |        "        text-align: right;\n",
2959 |        "    }\n",
2960 |        "</style>\n",
2961 |        "<table border=\"1\" class=\"dataframe\">\n",
2962 |        "  <thead>\n",
2963 |        "    <tr style=\"text-align: right;\">\n",
2964 |        "      <th></th>\n",
2965 |        "      <th>Country</th>\n",
2966 |        "      <th>Year</th>\n",
2967 |        "      <th>Rate_Unemployed</th>\n",
2968 |        "    </tr>\n",
2969 |        "  </thead>\n",
2970 |        "  <tbody>\n",
2971 |        "    <tr>\n",
2972 |        "      <th>0</th>\n",
2973 |        "      <td>Albania</td>\n",
2974 |        "      <td>2012</td>\n",
2975 |        "      <td>13.4</td>\n",
2976 |        "    </tr>\n",
2977 |        "    <tr>\n",
2978 |        "      <th>1</th>\n",
2979 |        "      <td>Albania</td>\n",
2980 |        "      <td>2013</td>\n",
2981 |        "      <td>16.0</td>\n",
2982 |        "    </tr>\n",
2983 |        "    <tr>\n",
2984 |        "      <th>2</th>\n",
2985 |        "      <td>Albania</td>\n",
2986 |        "      <td>2014</td>\n",
2987 |        "      <td>17.5</td>\n",
2988 |        "    </tr>\n",
2989 |        "    <tr>\n",
2990 |        "      <th>3</th>\n",
2991 |        "      <td>Albania</td>\n",
2992 |        "      <td>2015</td>\n",
2993 |        "      <td>17.1</td>\n",
2994 |        "    </tr>\n",
2995 |        "    <tr>\n",
2996 |        "      <th>4</th>\n",
2997 |        "      <td>Albania</td>\n",
2998 |        "      <td>2016</td>\n",
2999 |        "      <td>16.1</td>\n",
3000 |        "    </tr>\n",
3001 |        "  </tbody>\n",
3002 |        "</table>\n",
3003 |        "</div>"
3004 |       ],
3005 |       "text/plain": [
3006 |        "   Country  Year  Rate_Unemployed\n",
3007 |        "0  Albania  2012             13.4\n",
3008 |        "1  Albania  2013             16.0\n",
3009 |        "2  Albania  2014             17.5\n",
3010 |        "3  Albania  2015             17.1\n",
3011 |        "4  Albania  2016             16.1"
3012 |       ]
3013 |      },
3014 |      "execution_count": 47,
3015 |      "metadata": {},
3016 |      "output_type": "execute_result"
3017 |     }
3018 |    ],
3019 |    "source": [
3020 |     "unemployment_long = unemployment_long.reset_index(drop = True)\n",
3021 |     "\n",
3022 |     "unemployment_long.head()"
3023 |    ]
3024 |   },
3025 |   {
3026 |    "cell_type": "markdown",
3027 |    "metadata": {},
3028 |    "source": [
3029 |     "This type of data is known in time-series analysis as a panel; each country is observed every year from 2012 to 2016.\n",
3030 |     "\n",
3031 |     "For Albania, the percentage point change in unemployment rate from 2012 to 2013 would be 16 - 13.4 = 2.5 percentage points. What if I wanted the year-over-year change in unemployment rate for every country?\n",
3032 |     "\n",
3033 |     "We can use the [`diff()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.diff.html) function in `pandas` to do this. We can use `diff()` to calculate the difference between the `Rate_Unemployed` that year and the `Rate_Unemployed` for the year prior (the default for `lag()` is 1 period, which is good for us since we want the change from the previous year). We will save this difference into a new variable, `Change`."
3034 |    ]
3035 |   },
3036 |   {
3037 |    "cell_type": "code",
3038 |    "execution_count": 48,
3039 |    "metadata": {},
3040 |    "outputs": [],
3041 |    "source": [
3042 |     "unemployment_long['Change'] = unemployment_long.Rate_Unemployed.diff()"
3043 |    ]
3044 |   },
3045 |   {
3046 |    "cell_type": "markdown",
3047 |    "metadata": {},
3048 |    "source": [
3049 |     "Let's inspect the first five rows again, using `head()`:"
3050 |    ]
3051 |   },
3052 |   {
3053 |    "cell_type": "code",
3054 |    "execution_count": 49,
3055 |    "metadata": {},
3056 |    "outputs": [
3057 |     {
3058 |      "data": {
3059 |       "text/html": [
3060 |        "<div>\n",
3061 |        "<style scoped>\n",
3062 |        "    .dataframe tbody tr th:only-of-type {\n",
3063 |        "        vertical-align: middle;\n",
3064 |        "    }\n",
3065 |        "\n",
3066 |        "    .dataframe tbody tr th {\n",
3067 |        "        vertical-align: top;\n",
3068 |        "    }\n",
3069 |        "\n",
3070 |        "    .dataframe thead th {\n",
3071 |        "        text-align: right;\n",
3072 |        "    }\n",
3073 |        "</style>\n",
3074 |        "<table border=\"1\" class=\"dataframe\">\n",
3075 |        "  <thead>\n",
3076 |        "    <tr style=\"text-align: right;\">\n",
3077 |        "      <th></th>\n",
3078 |        "      <th>Country</th>\n",
3079 |        "      <th>Year</th>\n",
3080 |        "      <th>Rate_Unemployed</th>\n",
3081 |        "      <th>Change</th>\n",
3082 |        "    </tr>\n",
3083 |        "  </thead>\n",
3084 |        "  <tbody>\n",
3085 |        "    <tr>\n",
3086 |        "      <th>0</th>\n",
3087 |        "      <td>Albania</td>\n",
3088 |        "      <td>2012</td>\n",
3089 |        "      <td>13.4</td>\n",
3090 |        "      <td>NaN</td>\n",
3091 |        "    </tr>\n",
3092 |        "    <tr>\n",
3093 |        "      <th>1</th>\n",
3094 |        "      <td>Albania</td>\n",
3095 |        "      <td>2013</td>\n",
3096 |        "      <td>16.0</td>\n",
3097 |        "      <td>2.6</td>\n",
3098 |        "    </tr>\n",
3099 |        "    <tr>\n",
3100 |        "      <th>2</th>\n",
3101 |        "      <td>Albania</td>\n",
3102 |        "      <td>2014</td>\n",
3103 |        "      <td>17.5</td>\n",
3104 |        "      <td>1.5</td>\n",
3105 |        "    </tr>\n",
3106 |        "    <tr>\n",
3107 |        "      <th>3</th>\n",
3108 |        "      <td>Albania</td>\n",
3109 |        "      <td>2015</td>\n",
3110 |        "      <td>17.1</td>\n",
3111 |        "      <td>-0.4</td>\n",
3112 |        "    </tr>\n",
3113 |        "    <tr>\n",
3114 |        "      <th>4</th>\n",
3115 |        "      <td>Albania</td>\n",
3116 |        "      <td>2016</td>\n",
3117 |        "      <td>16.1</td>\n",
3118 |        "      <td>-1.0</td>\n",
3119 |        "    </tr>\n",
3120 |        "  </tbody>\n",
3121 |        "</table>\n",
3122 |        "</div>"
3123 |       ],
3124 |       "text/plain": [
3125 |        "   Country  Year  Rate_Unemployed  Change\n",
3126 |        "0  Albania  2012             13.4     NaN\n",
3127 |        "1  Albania  2013             16.0     2.6\n",
3128 |        "2  Albania  2014             17.5     1.5\n",
3129 |        "3  Albania  2015             17.1    -0.4\n",
3130 |        "4  Albania  2016             16.1    -1.0"
3131 |       ]
3132 |      },
3133 |      "execution_count": 49,
3134 |      "metadata": {},
3135 |      "output_type": "execute_result"
3136 |     }
3137 |    ],
3138 |    "source": [
3139 |     "unemployment_long.head()"
3140 |    ]
3141 |   },
3142 |   {
3143 |    "cell_type": "markdown",
3144 |    "metadata": {},
3145 |    "source": [
3146 |     "So far so good. It also makes sense that Albania's `Change` is `NaN` in 2012, since the dataset doesn't contain any unemployment figures before the year 2012.\n",
3147 |     "\n",
3148 |     "But a closer inspection of the data reveals a problem. What if we used `tail()` to look at the *last* 5 rows of the data?"
3149 |    ]
3150 |   },
3151 |   {
3152 |    "cell_type": "code",
3153 |    "execution_count": 50,
3154 |    "metadata": {},
3155 |    "outputs": [
3156 |     {
3157 |      "data": {
3158 |       "text/html": [
3159 |        "<div>\n",
3160 |        "<style scoped>\n",
3161 |        "    .dataframe tbody tr th:only-of-type {\n",
3162 |        "        vertical-align: middle;\n",
3163 |        "    }\n",
3164 |        "\n",
3165 |        "    .dataframe tbody tr th {\n",
3166 |        "        vertical-align: top;\n",
3167 |        "    }\n",
3168 |        "\n",
3169 |        "    .dataframe thead th {\n",
3170 |        "        text-align: right;\n",
3171 |        "    }\n",
3172 |        "</style>\n",
3173 |        "<table border=\"1\" class=\"dataframe\">\n",
3174 |        "  <thead>\n",
3175 |        "    <tr style=\"text-align: right;\">\n",
3176 |        "      <th></th>\n",
3177 |        "      <th>Country</th>\n",
3178 |        "      <th>Year</th>\n",
3179 |        "      <th>Rate_Unemployed</th>\n",
3180 |        "      <th>Change</th>\n",
3181 |        "    </tr>\n",
3182 |        "  </thead>\n",
3183 |        "  <tbody>\n",
3184 |        "    <tr>\n",
3185 |        "      <th>555</th>\n",
3186 |        "      <td>Vietnam</td>\n",
3187 |        "      <td>2012</td>\n",
3188 |        "      <td>2.74</td>\n",
3189 |        "      <td>-18.493</td>\n",
3190 |        "    </tr>\n",
3191 |        "    <tr>\n",
3192 |        "      <th>556</th>\n",
3193 |        "      <td>Vietnam</td>\n",
3194 |        "      <td>2013</td>\n",
3195 |        "      <td>2.75</td>\n",
3196 |        "      <td>0.010</td>\n",
3197 |        "    </tr>\n",
3198 |        "    <tr>\n",
3199 |        "      <th>557</th>\n",
3200 |        "      <td>Vietnam</td>\n",
3201 |        "      <td>2014</td>\n",
3202 |        "      <td>2.05</td>\n",
3203 |        "      <td>-0.700</td>\n",
3204 |        "    </tr>\n",
3205 |        "    <tr>\n",
3206 |        "      <th>558</th>\n",
3207 |        "      <td>Vietnam</td>\n",
3208 |        "      <td>2015</td>\n",
3209 |        "      <td>2.40</td>\n",
3210 |        "      <td>0.350</td>\n",
3211 |        "    </tr>\n",
3212 |        "    <tr>\n",
3213 |        "      <th>559</th>\n",
3214 |        "      <td>Vietnam</td>\n",
3215 |        "      <td>2016</td>\n",
3216 |        "      <td>2.40</td>\n",
3217 |        "      <td>0.000</td>\n",
3218 |        "    </tr>\n",
3219 |        "  </tbody>\n",
3220 |        "</table>\n",
3221 |        "</div>"
3222 |       ],
3223 |       "text/plain": [
3224 |        "     Country  Year  Rate_Unemployed  Change\n",
3225 |        "555  Vietnam  2012             2.74 -18.493\n",
3226 |        "556  Vietnam  2013             2.75   0.010\n",
3227 |        "557  Vietnam  2014             2.05  -0.700\n",
3228 |        "558  Vietnam  2015             2.40   0.350\n",
3229 |        "559  Vietnam  2016             2.40   0.000"
3230 |       ]
3231 |      },
3232 |      "execution_count": 50,
3233 |      "metadata": {},
3234 |      "output_type": "execute_result"
3235 |     }
3236 |    ],
3237 |    "source": [
3238 |     "unemployment_long.tail()"
3239 |    ]
3240 |   },
3241 |   {
3242 |    "cell_type": "markdown",
3243 |    "metadata": {},
3244 |    "source": [
3245 |     "**Why does Vietnam have a -18.493 percentage point change in 2012?**"
3246 |    ]
3247 |   },
3248 |   {
3249 |    "cell_type": "markdown",
3250 |    "metadata": {},
3251 |    "source": [
3252 |     "(Hint: use `tail()` to look at the last 6 rows of the data.)"
3253 |    ]
3254 |   },
3255 |   {
3256 |    "cell_type": "code",
3257 |    "execution_count": 51,
3258 |    "metadata": {},
3259 |    "outputs": [
3260 |     {
3261 |      "data": {
3262 |       "text/html": [
3263 |        "<div>\n",
3264 |        "<style scoped>\n",
3265 |        "    .dataframe tbody tr th:only-of-type {\n",
3266 |        "        vertical-align: middle;\n",
3267 |        "    }\n",
3268 |        "\n",
3269 |        "    .dataframe tbody tr th {\n",
3270 |        "        vertical-align: top;\n",
3271 |        "    }\n",
3272 |        "\n",
3273 |        "    .dataframe thead th {\n",
3274 |        "        text-align: right;\n",
3275 |        "    }\n",
3276 |        "</style>\n",
3277 |        "<table border=\"1\" class=\"dataframe\">\n",
3278 |        "  <thead>\n",
3279 |        "    <tr style=\"text-align: right;\">\n",
3280 |        "      <th></th>\n",
3281 |        "      <th>Country</th>\n",
3282 |        "      <th>Year</th>\n",
3283 |        "      <th>Rate_Unemployed</th>\n",
3284 |        "      <th>Change</th>\n",
3285 |        "    </tr>\n",
3286 |        "  </thead>\n",
3287 |        "  <tbody>\n",
3288 |        "    <tr>\n",
3289 |        "      <th>555</th>\n",
3290 |        "      <td>Vietnam</td>\n",
3291 |        "      <td>2012</td>\n",
3292 |        "      <td>2.74</td>\n",
3293 |        "      <td>NaN</td>\n",
3294 |        "    </tr>\n",
3295 |        "    <tr>\n",
3296 |        "      <th>556</th>\n",
3297 |        "      <td>Vietnam</td>\n",
3298 |        "      <td>2013</td>\n",
3299 |        "      <td>2.75</td>\n",
3300 |        "      <td>0.01</td>\n",
3301 |        "    </tr>\n",
3302 |        "    <tr>\n",
3303 |        "      <th>557</th>\n",
3304 |        "      <td>Vietnam</td>\n",
3305 |        "      <td>2014</td>\n",
3306 |        "      <td>2.05</td>\n",
3307 |        "      <td>-0.70</td>\n",
3308 |        "    </tr>\n",
3309 |        "    <tr>\n",
3310 |        "      <th>558</th>\n",
3311 |        "      <td>Vietnam</td>\n",
3312 |        "      <td>2015</td>\n",
3313 |        "      <td>2.40</td>\n",
3314 |        "      <td>0.35</td>\n",
3315 |        "    </tr>\n",
3316 |        "    <tr>\n",
3317 |        "      <th>559</th>\n",
3318 |        "      <td>Vietnam</td>\n",
3319 |        "      <td>2016</td>\n",
3320 |        "      <td>2.40</td>\n",
3321 |        "      <td>0.00</td>\n",
3322 |        "    </tr>\n",
3323 |        "  </tbody>\n",
3324 |        "</table>\n",
3325 |        "</div>"
3326 |       ],
3327 |       "text/plain": [
3328 |        "     Country  Year  Rate_Unemployed  Change\n",
3329 |        "555  Vietnam  2012             2.74     NaN\n",
3330 |        "556  Vietnam  2013             2.75    0.01\n",
3331 |        "557  Vietnam  2014             2.05   -0.70\n",
3332 |        "558  Vietnam  2015             2.40    0.35\n",
3333 |        "559  Vietnam  2016             2.40    0.00"
3334 |       ]
3335 |      },
3336 |      "execution_count": 51,
3337 |      "metadata": {},
3338 |      "output_type": "execute_result"
3339 |     }
3340 |    ],
3341 |    "source": [
3342 |     "unemployment_long['Change'] = (unemployment_long\n",
3343 |     "                               .groupby('Country')\n",
3344 |     "                               .Rate_Unemployed.diff())\n",
3345 |     "\n",
3346 |     "unemployment_long.tail()"
3347 |    ]
3348 |   },
3349 |   {
3350 |    "cell_type": "markdown",
3351 |    "metadata": {},
3352 |    "source": [
3353 |     "(Also notice how I put the entire expression in parentheses and put each function on a different line for readability.)"
3354 |    ]
3355 |   },
3356 |   {
3357 |    "cell_type": "markdown",
3358 |    "metadata": {},
3359 |    "source": [
3360 |     "## 5. Recoding numerical variables into categorical ones"
3361 |    ]
3362 |   },
3363 |   {
3364 |    "cell_type": "markdown",
3365 |    "metadata": {},
3366 |    "source": [
3367 |     "Here's a list of some attendees for the 2016 workshop, with names and contact info removed."
3368 |    ]
3369 |   },
3370 |   {
3371 |    "cell_type": "code",
3372 |    "execution_count": 52,
3373 |    "metadata": {},
3374 |    "outputs": [
3375 |     {
3376 |      "data": {
3377 |       "text/html": [
3378 |        "<div>\n",
3379 |        "<style scoped>\n",
3380 |        "    .dataframe tbody tr th:only-of-type {\n",
3381 |        "        vertical-align: middle;\n",
3382 |        "    }\n",
3383 |        "\n",
3384 |        "    .dataframe tbody tr th {\n",
3385 |        "        vertical-align: top;\n",
3386 |        "    }\n",
3387 |        "\n",
3388 |        "    .dataframe thead th {\n",
3389 |        "        text-align: right;\n",
3390 |        "    }\n",
3391 |        "</style>\n",
3392 |        "<table border=\"1\" class=\"dataframe\">\n",
3393 |        "  <thead>\n",
3394 |        "    <tr style=\"text-align: right;\">\n",
3395 |        "      <th></th>\n",
3396 |        "      <th>Occupation</th>\n",
3397 |        "      <th>Job title</th>\n",
3398 |        "      <th>Age group</th>\n",
3399 |        "      <th>Gender</th>\n",
3400 |        "      <th>State/Province</th>\n",
3401 |        "      <th>Education</th>\n",
3402 |        "      <th>Which data subject area are you most interested in working with? (Select up to three)</th>\n",
3403 |        "      <th>What do you hope to get out of the workshop?</th>\n",
3404 |        "      <th>Which type of laptop will you bring?</th>\n",
3405 |        "      <th>College or University Name</th>\n",
3406 |        "      <th>Major or Concentration</th>\n",
3407 |        "      <th>College Year</th>\n",
3408 |        "      <th>Which Digital Badge track best suits you?</th>\n",
3409 |        "      <th>Which session would you like to attend?</th>\n",
3410 |        "      <th>Choose your status:</th>\n",
3411 |        "    </tr>\n",
3412 |        "  </thead>\n",
3413 |        "  <tbody>\n",
3414 |        "    <tr>\n",
3415 |        "      <th>0</th>\n",
3416 |        "      <td>Data Analyst</td>\n",
3417 |        "      <td>Data Quality Analyst</td>\n",
3418 |        "      <td>30-39</td>\n",
3419 |        "      <td>Male</td>\n",
3420 |        "      <td>MA</td>\n",
3421 |        "      <td>Bachelor's Degree</td>\n",
3422 |        "      <td>Retail</td>\n",
3423 |        "      <td>other</td>\n",
3424 |        "      <td>PC</td>\n",
3425 |        "      <td>NaN</td>\n",
3426 |        "      <td>NaN</td>\n",
3427 |        "      <td>NaN</td>\n",
3428 |        "      <td>Advanced Data Storytelling</td>\n",
3429 |        "      <td>June 5-9</td>\n",
3430 |        "      <td>Nonprofit, Academic, Government</td>\n",
3431 |        "    </tr>\n",
3432 |        "    <tr>\n",
3433 |        "      <th>1</th>\n",
3434 |        "      <td>PhD Student</td>\n",
3435 |        "      <td>Student/Research Assistant</td>\n",
3436 |        "      <td>18-29</td>\n",
3437 |        "      <td>Male</td>\n",
3438 |        "      <td>MA</td>\n",
3439 |        "      <td>Bachelor's Degree</td>\n",
3440 |        "      <td>Sports</td>\n",
3441 |        "      <td>Master Advanced R</td>\n",
3442 |        "      <td>PC</td>\n",
3443 |        "      <td>Boston University</td>\n",
3444 |        "      <td>Biostatistics</td>\n",
3445 |        "      <td>PhD</td>\n",
3446 |        "      <td>Advanced Data Storytelling</td>\n",
3447 |        "      <td>June 5-9</td>\n",
3448 |        "      <td>Student</td>\n",
3449 |        "    </tr>\n",
3450 |        "    <tr>\n",
3451 |        "      <th>2</th>\n",
3452 |        "      <td>Education</td>\n",
3453 |        "      <td>Data Analyst</td>\n",
3454 |        "      <td>18-29</td>\n",
3455 |        "      <td>Female</td>\n",
3456 |        "      <td>Kentucky</td>\n",
3457 |        "      <td>Master's Degree</td>\n",
3458 |        "      <td>Retail</td>\n",
3459 |        "      <td>other</td>\n",
3460 |        "      <td>PC</td>\n",
3461 |        "      <td>NaN</td>\n",
3462 |        "      <td>NaN</td>\n",
3463 |        "      <td>NaN</td>\n",
3464 |        "      <td>Advanced Data Storytelling</td>\n",
3465 |        "      <td>June 5-9</td>\n",
3466 |        "      <td>Nonprofit, Academic, Government</td>\n",
3467 |        "    </tr>\n",
3468 |        "    <tr>\n",
3469 |        "      <th>3</th>\n",
3470 |        "      <td>Manager</td>\n",
3471 |        "      <td>BAS Manager</td>\n",
3472 |        "      <td>30-39</td>\n",
3473 |        "      <td>Male</td>\n",
3474 |        "      <td>MA</td>\n",
3475 |        "      <td>Bachelor's Degree</td>\n",
3476 |        "      <td>Education</td>\n",
3477 |        "      <td>Pick up Beginning R And SQL</td>\n",
3478 |        "      <td>PC</td>\n",
3479 |        "      <td>Boston University</td>\n",
3480 |        "      <td>PEMBA</td>\n",
3481 |        "      <td>Graduate</td>\n",
3482 |        "      <td>Advanced Data Storytelling</td>\n",
3483 |        "      <td>June 5-9</td>\n",
3484 |        "      <td>Student</td>\n",
3485 |        "    </tr>\n",
3486 |        "    <tr>\n",
3487 |        "      <th>4</th>\n",
3488 |        "      <td>Government Finance</td>\n",
3489 |        "      <td>Performance Analyst</td>\n",
3490 |        "      <td>30 - 39</td>\n",
3491 |        "      <td>Male</td>\n",
3492 |        "      <td>MA</td>\n",
3493 |        "      <td>Master's Degree</td>\n",
3494 |        "      <td>Environment, Finance, Food and agriculture</td>\n",
3495 |        "      <td>Pick up Beginning R And SQL</td>\n",
3496 |        "      <td>MAC</td>\n",
3497 |        "      <td>NaN</td>\n",
3498 |        "      <td>NaN</td>\n",
3499 |        "      <td>NaN</td>\n",
3500 |        "      <td>Advanced Data Storytelling</td>\n",
3501 |        "      <td>June 5-9</td>\n",
3502 |        "      <td>Nonprofit, Academic, Government Early Bird</td>\n",
3503 |        "    </tr>\n",
3504 |        "  </tbody>\n",
3505 |        "</table>\n",
3506 |        "</div>"
3507 |       ],
3508 |       "text/plain": [
3509 |        "           Occupation                   Job title Age group  Gender  \\\n",
3510 |        "0        Data Analyst        Data Quality Analyst     30-39    Male   \n",
3511 |        "1         PhD Student  Student/Research Assistant     18-29    Male   \n",
3512 |        "2           Education                Data Analyst     18-29  Female   \n",
3513 |        "3             Manager                 BAS Manager     30-39    Male   \n",
3514 |        "4  Government Finance         Performance Analyst   30 - 39    Male   \n",
3515 |        "\n",
3516 |        "  State/Province          Education  \\\n",
3517 |        "0             MA  Bachelor's Degree   \n",
3518 |        "1             MA  Bachelor's Degree   \n",
3519 |        "2       Kentucky    Master's Degree   \n",
3520 |        "3             MA  Bachelor's Degree   \n",
3521 |        "4             MA    Master's Degree   \n",
3522 |        "\n",
3523 |        "  Which data subject area are you most interested in working with? (Select up to three)  \\\n",
3524 |        "0                                             Retail                                      \n",
3525 |        "1                                             Sports                                      \n",
3526 |        "2                                             Retail                                      \n",
3527 |        "3                                          Education                                      \n",
3528 |        "4         Environment, Finance, Food and agriculture                                      \n",
3529 |        "\n",
3530 |        "  What do you hope to get out of the workshop?  \\\n",
3531 |        "0                                        other   \n",
3532 |        "1                            Master Advanced R   \n",
3533 |        "2                                        other   \n",
3534 |        "3                  Pick up Beginning R And SQL   \n",
3535 |        "4                  Pick up Beginning R And SQL   \n",
3536 |        "\n",
3537 |        "  Which type of laptop will you bring? College or University Name  \\\n",
3538 |        "0                                   PC                        NaN   \n",
3539 |        "1                                   PC          Boston University   \n",
3540 |        "2                                   PC                        NaN   \n",
3541 |        "3                                   PC          Boston University   \n",
3542 |        "4                                  MAC                        NaN   \n",
3543 |        "\n",
3544 |        "  Major or Concentration  College Year  \\\n",
3545 |        "0                     NaN          NaN   \n",
3546 |        "1           Biostatistics          PhD   \n",
3547 |        "2                     NaN          NaN   \n",
3548 |        "3                   PEMBA     Graduate   \n",
3549 |        "4                     NaN          NaN   \n",
3550 |        "\n",
3551 |        "  Which Digital Badge track best suits you?  \\\n",
3552 |        "0                Advanced Data Storytelling   \n",
3553 |        "1                Advanced Data Storytelling   \n",
3554 |        "2                Advanced Data Storytelling   \n",
3555 |        "3                Advanced Data Storytelling   \n",
3556 |        "4                Advanced Data Storytelling   \n",
3557 |        "\n",
3558 |        "  Which session would you like to attend?  \\\n",
3559 |        "0                                June 5-9   \n",
3560 |        "1                                June 5-9   \n",
3561 |        "2                                June 5-9   \n",
3562 |        "3                                June 5-9   \n",
3563 |        "4                                June 5-9   \n",
3564 |        "\n",
3565 |        "                          Choose your status:  \n",
3566 |        "0             Nonprofit, Academic, Government  \n",
3567 |        "1                                     Student  \n",
3568 |        "2             Nonprofit, Academic, Government  \n",
3569 |        "3                                     Student  \n",
3570 |        "4  Nonprofit, Academic, Government Early Bird  "
3571 |       ]
3572 |      },
3573 |      "execution_count": 52,
3574 |      "metadata": {},
3575 |      "output_type": "execute_result"
3576 |     }
3577 |    ],
3578 |    "source": [
3579 |     "attendees = pd.read_csv('attendees.csv')\n",
3580 |     "\n",
3581 |     "attendees.head()"
3582 |    ]
3583 |   },
3584 |   {
3585 |    "cell_type": "markdown",
3586 |    "metadata": {},
3587 |    "source": [
3588 |     "**What if we wanted to quickly see the age distribution of attendees?**"
3589 |    ]
3590 |   },
3591 |   {
3592 |    "cell_type": "code",
3593 |    "execution_count": 53,
3594 |    "metadata": {},
3595 |    "outputs": [
3596 |     {
3597 |      "data": {
3598 |       "text/plain": [
3599 |        "30-39      7\n",
3600 |        "18-29      4\n",
3601 |        "30 - 39    1\n",
3602 |        "Name: Age group, dtype: int64"
3603 |       ]
3604 |      },
3605 |      "execution_count": 53,
3606 |      "metadata": {},
3607 |      "output_type": "execute_result"
3608 |     }
3609 |    ],
3610 |    "source": [
3611 |     "attendees['Age group'].value_counts()"
3612 |    ]
3613 |   },
3614 |   {
3615 |    "cell_type": "markdown",
3616 |    "metadata": {},
3617 |    "source": [
3618 |     "There's an inconsistency in the labeling of the `Age group` variable here. We can fix this using `np.where()` in the `numpy` library. First, let's import the `numpy` library. Like `pandas`, `numpy` has a commonly used alias — `np`."
3619 |    ]
3620 |   },
3621 |   {
3622 |    "cell_type": "code",
3623 |    "execution_count": 54,
3624 |    "metadata": {},
3625 |    "outputs": [],
3626 |    "source": [
3627 |     "import numpy as np"
3628 |    ]
3629 |   },
3630 |   {
3631 |    "cell_type": "code",
3632 |    "execution_count": 55,
3633 |    "metadata": {},
3634 |    "outputs": [],
3635 |    "source": [
3636 |     "attendees['Age group'] = np.where(attendees['Age group'] == '30 - 39', # where attendees['Age group'] == '30 - 39'\n",
3637 |     "                                  '30-39', # replace attendees['Age group'] with '30-39'\n",
3638 |     "                                  attendees['Age group']) # otherwise, keep attendees['Age group'] values the same"
3639 |    ]
3640 |   },
3641 |   {
3642 |    "cell_type": "markdown",
3643 |    "metadata": {},
3644 |    "source": [
3645 |     "This might seem trivial for just one value, but it's useful for larger datasets."
3646 |    ]
3647 |   },
3648 |   {
3649 |    "cell_type": "code",
3650 |    "execution_count": 56,
3651 |    "metadata": {},
3652 |    "outputs": [
3653 |     {
3654 |      "data": {
3655 |       "text/plain": [
3656 |        "30-39    8\n",
3657 |        "18-29    4\n",
3658 |        "Name: Age group, dtype: int64"
3659 |       ]
3660 |      },
3661 |      "execution_count": 56,
3662 |      "metadata": {},
3663 |      "output_type": "execute_result"
3664 |     }
3665 |    ],
3666 |    "source": [
3667 |     "attendees['Age group'].value_counts()"
3668 |    ]
3669 |   },
3670 |   {
3671 |    "cell_type": "markdown",
3672 |    "metadata": {},
3673 |    "source": [
3674 |     "Now let's take a look at the professional status of attendees, labeled in `Choose your status:`"
3675 |    ]
3676 |   },
3677 |   {
3678 |    "cell_type": "code",
3679 |    "execution_count": 57,
3680 |    "metadata": {},
3681 |    "outputs": [
3682 |     {
3683 |      "data": {
3684 |       "text/plain": [
3685 |        "Student                                       5\n",
3686 |        "Nonprofit, Academic, Government               3\n",
3687 |        "Professional                                  3\n",
3688 |        "Nonprofit, Academic, Government Early Bird    1\n",
3689 |        "Name: Choose your status:, dtype: int64"
3690 |       ]
3691 |      },
3692 |      "execution_count": 57,
3693 |      "metadata": {},
3694 |      "output_type": "execute_result"
3695 |     }
3696 |    ],
3697 |    "source": [
3698 |     "attendees['Choose your status:'].value_counts()"
3699 |    ]
3700 |   },
3701 |   {
3702 |    "cell_type": "markdown",
3703 |    "metadata": {},
3704 |    "source": [
3705 |     "\"Nonprofit, Academic, Government\" and \"Nonprofit, Academic, Government Early Bird\" seem to be the same. We can use `np.where()` (and the Python designation `|` for \"or\") to combine these two categories into one big category, \"Nonprofit/Gov\". Let's create a new variable, `status`, for our simplified categorization.\n",
3706 |     "\n",
3707 |     "Notice the extra sets of parentheses around the two conditions linked by the `|` symbol."
3708 |    ]
3709 |   },
3710 |   {
3711 |    "cell_type": "code",
3712 |    "execution_count": 58,
3713 |    "metadata": {},
3714 |    "outputs": [],
3715 |    "source": [
3716 |     "attendees['status'] = np.where((attendees['Choose your status:'] == 'Nonprofit, Academic, Government') |\n",
3717 |     "                               (attendees['Choose your status:'] == 'Nonprofit, Academic, Government Early Bird'),\n",
3718 |     "                           'Nonprofit/Gov', \n",
3719 |     "                           attendees['Choose your status:'])"
3720 |    ]
3721 |   },
3722 |   {
3723 |    "cell_type": "code",
3724 |    "execution_count": 59,
3725 |    "metadata": {},
3726 |    "outputs": [
3727 |     {
3728 |      "data": {
3729 |       "text/plain": [
3730 |        "Student          5\n",
3731 |        "Nonprofit/Gov    4\n",
3732 |        "Professional     3\n",
3733 |        "Name: status, dtype: int64"
3734 |       ]
3735 |      },
3736 |      "execution_count": 59,
3737 |      "metadata": {},
3738 |      "output_type": "execute_result"
3739 |     }
3740 |    ],
3741 |    "source": [
3742 |     "attendees['status'].value_counts()"
3743 |    ]
3744 |   },
3745 |   {
3746 |    "cell_type": "markdown",
3747 |    "metadata": {},
3748 |    "source": [
3749 |     "## What else?\n",
3750 |     "\n",
3751 |     "-   How would you create a new variable in the `attendees` data (let's call it `status2`) that has just two categories, \"Student\" and \"Other\"?\n",
3752 |     "\n",
3753 |     "-   How would you rename the variables in the `attendees` data to make them easier to work with?\n",
3754 |     "\n",
3755 |     "-   What are some other issues with this dataset? How would you solve them using what we've learned?\n",
3756 |     "\n",
3757 |     "-   What are some other \"messy\" data issues you've encountered?"
3758 |    ]
3759 |   }
3760 |  ],
3761 |  "metadata": {
3762 |   "kernelspec": {
3763 |    "display_name": "Python 3",
3764 |    "language": "python",
3765 |    "name": "python3"
3766 |   },
3767 |   "language_info": {
3768 |    "codemirror_mode": {
3769 |     "name": "ipython",
3770 |     "version": 3
3771 |    },
3772 |    "file_extension": ".py",
3773 |    "mimetype": "text/x-python",
3774 |    "name": "python",
3775 |    "nbconvert_exporter": "python",
3776 |    "pygments_lexer": "ipython3",
3777 |    "version": "3.6.2"
3778 |   }
3779 |  },
3780 |  "nbformat": 4,
3781 |  "nbformat_minor": 2
3782 | }
3783 | 


--------------------------------------------------------------------------------
/pandas-data-cleaning-tricks.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/underthecurve/pandas-data-cleaning-tricks/cd96cd85ac5b941eb077bbf6bc30b51da0e3f29b/pandas-data-cleaning-tricks.pdf


--------------------------------------------------------------------------------
/pandas-data-cleaning-tricks.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # coding: utf-8
  3 | 
  4 | # # Tricks for cleaning your data in Python using pandas
  5 | # 
  6 | # **By Christine Zhang ([ychristinezhang at gmail dot com](mailto:ychristinezhang@gmail.com / [@christinezhang](https://twitter.com/christinezhang) | [@christinezhang](https://twitter.com/christinezhang))**
  7 | 
  8 | # GitHub repository for Data+Code: https://github.com/underthecurve/pandas-data-cleaning-tricks
  9 | # 
 10 | # In 2017 I gave a talk called "Tricks for cleaning your data in R" which I presented at the [Data+Narrative workshop](http://www.bu.edu/com/data-narrative/) at Boston University. The repo with the code and data, https://github.com/underthecurve/r-data-cleaning-tricks, was pretty well-received, so I figured I'd try to do some of the same stuff in Python using `pandas`.
 11 | # 
 12 | # **Disclaimer:** when it comes to data stuff, I'm much better with R, especially the `tidyverse` set of packages, than with Python, but in my last job I used Python's `pandas` library to do a lot of data processing since Python was the dominant language there.
 13 | # 
 14 | # Anyway, here goes: 
 15 | # 
 16 | # Data cleaning is a cumbersome task, and it can be hard to navigate in programming languages like Python. 
 17 | # 
 18 | # The `pandas` library in Python is a powerful tool for data cleaning and analysis. By default, it leaves a trail of code that documents all the work you've done, which makes it extremely useful for creating reproducible workflows.
 19 | # 
 20 | # In this workshop, I'll show you some examples of real-life "messy" datasets, the problems they present for analysis in Python's `pandas` library, and some of the solutions to these problems.
 21 | # 
 22 | # Fittingly, I'll [start the numbering system at 0](http://python-history.blogspot.com/2013/10/why-python-uses-0-based-indexing.html).
 23 | 
 24 | # ## 0. Importing the `pandas` library
 25 | 
 26 | # Here I tell Python to import the `pandas` library as `pd` (a common alias for `pandas` — more on that in the next code chunk).
 27 | 
 28 | # In[1]:
 29 | 
 30 | 
 31 | import pandas as pd
 32 | 
 33 | 
 34 | # ## 1. Finding and replacing non-numeric characters like `,` and `$`
 35 | 
 36 | # Let's check out the city of Boston's [Open Data portal](https://data.boston.gov/), where the local government puts up datasets that are free for the public to analyze.
 37 | # 
 38 | # The [Employee Earnings Report](https://data.boston.gov/dataset/employee-earnings-report) is one of the more interesting ones, because it gives payroll data for every person on the municipal payroll. It's where the *Boston Globe* gets stories like these every year:
 39 | # 
 40 | # -   ["64 City of Boston workers earn more than $250,000"](https://www.bostonglobe.com/metro/2016/02/05/city-boston-workers-earn-more-than/MvW6RExJZimdrTlwdwUI7M/story.html) (February 6, 2016)
 41 | # 
 42 | # -   ["Police detective tops Boston’s payroll with a total of over $403,000"](https://www.bostonglobe.com/metro/2017/02/14/police-detective-tops-boston-payroll-with-total-over/6PaXwTAHZGEW5djgwCJuTI/story.html) (February 14, 2017)
 43 | # 
 44 | # Let's take at the February 14 story from 2017. The story begins:
 45 | # 
 46 | # > "A veteran police detective took home more than $403,000 in earnings last year, topping the list of Boston’s highest-paid employees in 2016, newly released city payroll data show."
 47 | # 
 48 | # **What if we wanted to check this number using the Employee Earnings Report?**
 49 | 
 50 | # We can use the `pandas` function [`pandas.read_csv()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) to load the csv file into Python. We will call this DataFrame `salary`. Remember that I imported `pandas` "as `pd`" in the last code chunk. This saves me a bit of typing by allowing me to access `pandas` functions like `pandas.read_csv()` by typing `pd.read_csv()` instead. If I had typed `import pandas` in the code chunk under section `0` without `as pd`, the below code wouldn't work. I'd have to instead write `pandas.read_csv()` to access the function.
 51 | # 
 52 | # The `pd` alias for `pandas` is so common that the library's [documentation](http://pandas.pydata.org/pandas-docs/stable/install.html#running-the-test-suite) even uses it sometimes.
 53 | # 
 54 | # Let's try to use `pd.read_csv()`:
 55 | 
 56 | # In[2]:
 57 | 
 58 | 
 59 | salary = pd.read_csv('employee-earnings-report-2016.csv')
 60 | 
 61 | 
 62 | # That's a pretty long and nasty error. Usually when I run into something like this, I start from the bottom and work my way up — in this case, I typed `UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 22: invalid continuation byte` into a search engine and came across [this discussion on the Stack Overflow forum](https://stackoverflow.com/questions/30462807/encoding-error-in-panda-read-csv). The last response suggested that adding `encoding ='latin1'` inside the function would fix the problem on Macs (which is the type of computer I have).
 63 | 
 64 | # In[3]:
 65 | 
 66 | 
 67 | salary = pd.read_csv('employee-earnings-report-2016.csv', encoding = 'latin-1')
 68 | 
 69 | 
 70 | # Great! (I don't know much about encoding, but this is something I run into from time to time so I thought it would be helpful to show here.)
 71 | # 
 72 | # We can use `head()` on the `salary` DataFrame to inspect the first five rows of `salary`. (Note I use the `print()` to display the output, but you don't need to do this in your own code if you'd prefer not to.)
 73 | 
 74 | # In[4]:
 75 | 
 76 | 
 77 | print(salary.head())
 78 | 
 79 | 
 80 | # There are a lot of columns. Let's simplify by selecting the ones of interest: `NAME`, `DEPARTMENT_NAME`, and `TOTAL.EARNINGS`. There are [a few different ways](https://medium.com/dunder-data/selecting-subsets-of-data-in-pandas-6fcd0170be9c) of doing this with `pandas`. The simplest way, imo, is by using the indexing operator `[]`.
 81 | # 
 82 | # For example, I could select a single column, `NAME`: (Note I also run the line `pd.options.display.max_rows = 20` in order to display a maximum of 20 rows so the output isn't too crowded.)
 83 | 
 84 | # In[5]:
 85 | 
 86 | 
 87 | pd.options.display.max_rows = 20
 88 | 
 89 | salary['NAME']
 90 | 
 91 | 
 92 | # This works for selecting one column at a time, but using `[]` returns a [Series](https://pandas.pydata.org/pandas-docs/stable/dsintro.html#series), not a [DataFrame](https://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe). I can confirm this using the `type()` function:
 93 | 
 94 | # In[6]:
 95 | 
 96 | 
 97 | type(salary['NAME'])
 98 | 
 99 | 
100 | # If I want a DataFrame, I have to use double brackets:
101 | 
102 | # In[7]:
103 | 
104 | 
105 | salary[['NAME']]
106 | 
107 | 
108 | # In[8]:
109 | 
110 | 
111 | type(salary[['NAME']])
112 | 
113 | 
114 | # To select multiple columns, we can put those columns inside of the second pair of brackets. We will save this into a new DataFrame, `salary_selected`. We type `.copy()` after `salary[['NAME','DEPARTMENT_NAME', 'TOTAL EARNINGS']]` because we are making a copy of the DataFrame and assigning it to new DataFrame. Learn more about `copy()` [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.copy.html).
115 | 
116 | # In[9]:
117 | 
118 | 
119 | salary_selected = salary[['NAME','DEPARTMENT_NAME', 'TOTAL EARNINGS']].copy()
120 | 
121 | 
122 | # We can also change the column names to lowercase names for easier typing. First, let's take a look at the columns by displaying the `columns` attribute of the `salary_selected` DataFrame.
123 | 
124 | # In[10]:
125 | 
126 | 
127 | salary_selected.columns
128 | 
129 | 
130 | # In[11]:
131 | 
132 | 
133 | type(salary_selected.columns)
134 | 
135 | 
136 | # Notice how this returns something called an "Index." In `pandas`, DataFrames have both row indexes (in our case, the row number, starting from 0 and going to 22045) and column indexes. We can use the [`str.lower()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.lower.html) function to convert the strings (aka characters) in the index to lowercase.
137 | 
138 | # In[12]:
139 | 
140 | 
141 | salary_selected.columns = salary_selected.columns.str.lower()
142 | 
143 | salary_selected.columns
144 | 
145 | 
146 | # Another thing that will make our lives easier is if the `total earnings` column didn't have a space between `total` and `earnings`. We can use a "string replace" function, [`str.replace()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.replace.html), to replace the space with an underscore. The syntax is: `str.replace('thing you want to replace', 'what to replace it with')` 
147 | 
148 | # In[13]:
149 | 
150 | 
151 | salary_selected.columns.str.replace(' ', '_') 
152 | 
153 | salary_selected.columns
154 | 
155 | 
156 | # We could have used both the `str.lower()` and `str.replace()` functions in one line of code by putting them one after the other (aka "chaining"):
157 | 
158 | # In[14]:
159 | 
160 | 
161 | salary_selected.columns = salary_selected.columns.str.lower().str.replace(' ', '_') 
162 | 
163 | salary_selected.columns
164 | 
165 | 
166 | # Let's use `head()` to visually inspect the first five rows of `salary_selected`:
167 | 
168 | # In[15]:
169 | 
170 | 
171 | print(salary_selected.head()) 
172 | 
173 | 
174 | # Now let's try sorting the data by `total.earnings` using the [`sort_values()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) function in `pandas`:
175 | 
176 | # In[16]:
177 | 
178 | 
179 | salary_sort = salary_selected.sort_values('total_earnings')
180 | 
181 | 
182 | # We can use `head()` to visually inspect `salary_sort`:
183 | 
184 | # In[17]:
185 | 
186 | 
187 | print(salary_sort.head())
188 | 
189 | 
190 | # At first glance, it looks okay. The employees appear to be sorted by `total_earnings` from lowest to highest. If this were the case, we'd expect the last row of the `salary_sort` DataFrame to contain the employee with the highest salary. Let's take a look at the last five rows using `tail()`.
191 | 
192 | # In[18]:
193 | 
194 | 
195 | print(salary_sort.tail())
196 | 
197 | 
198 | # **What went wrong?**
199 | # 
200 | # The problem is that there are non-numeric characters, `,` and `$`, in the `total.earnings` column. We can see with  [`dtypes`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dtypes.html), which returns the data type of each column in the DataFrame, that `total_earnings` is recognized as an "object".
201 | 
202 | # In[19]:
203 | 
204 | 
205 | salary_selected.dtypes
206 | 
207 | 
208 | # [Here](http://pbpython.com/pandas_dtypes.html) is an overview of `pandas` data types. Basically, being labeled an "object" means that the column is not being recognized as containing numbers.
209 | 
210 | # We need to find the `,` and `$` in `total.earnings` and remove them. The `str.replace()` function, which we used above when renaming the columns, lets us do this.
211 | # 
212 | # Let's start by removing the comma and write the result to the original column. (The format for calling a column from a DataFrame in `pandas` is `DataFrame['column_name']`)
213 | 
214 | # In[20]:
215 | 
216 | 
217 | salary_selected['total_earnings'] = salary_selected['total_earnings'].str.replace(',', '')
218 | 
219 | 
220 | # Using `head()` to visually inspect `salary_selected`, we see that the commas are gone:
221 | 
222 | # In[21]:
223 | 
224 | 
225 | print(salary_selected.head()) # this works - the commas are gone
226 | 
227 | 
228 | # Let's do the same thing, with the dollar sign `$`:
229 | 
230 | # In[22]:
231 | 
232 | 
233 | salary_selected['total_earnings'] = salary_selected['total_earnings'].str.replace('$', '')
234 | 
235 | 
236 | # Using `head()` to visually inspect `salary_selected`, we see that the dollar signs are gone:
237 | 
238 | # In[23]:
239 | 
240 | 
241 | salary_selected.head()
242 | 
243 | 
244 | # **Now can we use `arrange()` to sort the data by `total_earnings`?**
245 | 
246 | # In[24]:
247 | 
248 | 
249 | salary_sort = salary_selected.sort_values('total_earnings')
250 | 
251 | salary_sort.head()
252 | 
253 | 
254 | # In[25]:
255 | 
256 | 
257 | salary_sort.tail()
258 | 
259 | 
260 | # Again, at first glance, the employees appear to be sorted by `total_earnings` from lowest to highest. But that would imply that John M. Bresnahan was the highest-paid employee, making 99,997.38 dollars in 2016, while the *Boston Globe* [story](https://www.bostonglobe.com/metro/2017/02/14/police-detective-tops-boston-payroll-with-total-over/6PaXwTAHZGEW5djgwCJuTI/story.html) said the highest-paid city employee made more than 403,000 dollars.
261 | 
262 | # **What's the problem?**
263 | # 
264 | # Again, we can use `dtypes` to check on how the `total_earnings` variable is encoded.
265 | 
266 | # In[26]:
267 | 
268 | 
269 | salary_sort.dtypes
270 | 
271 | 
272 | # It's still an "object" now (still not numeric), because we didn't tell `pandas` that it should be numeric. We can do this with `pd.to_numeric()`:
273 | 
274 | # In[27]:
275 | 
276 | 
277 | salary_sort['total_earnings'] = pd.to_numeric(salary_sort['total_earnings'])
278 | 
279 | 
280 | # Now let's run `dtypes` again:
281 | 
282 | # In[28]:
283 | 
284 | 
285 | salary_sort.dtypes
286 | 
287 | 
288 | # "float64" means ["floating point numbers"](http://pbpython.com/pandas_dtypes.html) — this is what we want.
289 | 
290 | # Now let's sort using `sort_values()`. 
291 | 
292 | # In[29]:
293 | 
294 | 
295 | salary_sort = salary_sort.sort_values('total_earnings')
296 | 
297 | salary_sort.head() # ascending order by default
298 | 
299 | 
300 | # One last thing: we have to specify `ascending = False` within `sort_values()` because the function by default sorts the data in ascending order.
301 | 
302 | # In[30]:
303 | 
304 | 
305 | salary_sort = salary_sort.sort_values('total_earnings', ascending = False)
306 | 
307 | salary_sort.head() # descending order
308 | 
309 | 
310 | # We see that Waiman Lee from the Boston PD is the top earner with &gt;403,408 per year, just as the *Boston Globe* [article](https://www.bostonglobe.com/metro/2017/02/14/police-detective-tops-boston-payroll-with-total-over/6PaXwTAHZGEW5djgwCJuTI/story.html) states.
311 | 
312 | # A bonus thing: maybe it bothers you that the numbers next to each row are no longer in any numeric order. This is because these numbers are the row index of the DataFrame — basically the order that they were in prior to being sorted. In order to reset these numbers, we can use the [`reset_index()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html) function on the `salary_sort` DataFrame. We include `drop = True` as a parameter of the function to prevent the old index from being added as a column in the DataFrame.
313 | 
314 | # In[31]:
315 | 
316 | 
317 | salary_sort = salary_sort.reset_index(drop = True)
318 | 
319 | salary_sort.head() # index is reset
320 | 
321 | 
322 | # The Boston Police Department has a lot of high earners. We can figure out the average earnings by department, which we'll call `salary_average`, by using the `groupby` and `mean()` functions in `pandas`.
323 | 
324 | # In[32]:
325 | 
326 | 
327 | salary_average = salary_sort.groupby('department_name').mean()
328 | 
329 | 
330 | # In[33]:
331 | 
332 | 
333 | salary_average = salary_average
334 | 
335 | salary_average
336 | 
337 | 
338 | # Notice that `pandas` by default sets the `department_name` column as the row index of the `salary_average` DataFrame. I personally don't love this and would rather have a straight-up DataFrame with the row numbers as the index, so I usually run `reset_index()` to get rid of this indexing: 
339 | 
340 | # In[34]:
341 | 
342 | 
343 | salary_average = salary_average.reset_index() # reset_index
344 | 
345 | salary_average
346 | 
347 | 
348 | # We should also rename the `total_earnings` column to `average_earnings` to avoid confusion. We can do this using [`rename()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html). The syntax for `rename()` is `DataFrame.rename(columns = {'current column name':'new column name'})`.
349 | 
350 | # In[35]:
351 | 
352 | 
353 | salary_average = salary_average.rename(columns = {'total_earnings': 'dept_average'}) 
354 | 
355 | 
356 | # In[36]:
357 | 
358 | 
359 | salary_average
360 | 
361 | 
362 | # We can find the Boston Police Department. Find out more about selecting based on attributes [here](https://chrisalbon.com/python/data_wrangling/pandas_selecting_rows_on_conditions/).
363 | 
364 | # In[37]:
365 | 
366 | 
367 | salary_average[salary_average['department_name'] == 'Boston Police Department']
368 | 
369 | 
370 | # Now is a good time to revisit "chaining." Notice how we did three things in creating `salary_average`:
371 | # 1. Grouped the `salary_sort` DataFrame by `department_name` and calculated the mean of the numeric columns (in our case, `total_earnings` using `group_by()` and `mean()`.
372 | # 2. Used `reset_index()` on the resulting DataFrame so that `department_name` would no longer be the row index.
373 | # 3. Renamed the `total_earnings` column to `dept_average` to avoid confusion using `rename()`.
374 | # 
375 | # In fact, we can do these three things all at once, by chaining the functions together:
376 | 
377 | # In[38]:
378 | 
379 | 
380 | salary_sort.groupby('department_name').mean().reset_index().rename(columns = {'total_earnings':'dept_average'})
381 | 
382 | 
383 | # That's a pretty long line of code. To make it more readable, we can split it up into separate lines. I like to do this by putting the whole expression in parentheses and splitting it up right before each of the functions, which are delineated by the periods:
384 | 
385 | # In[39]:
386 | 
387 | 
388 | (salary_sort.groupby('department_name')
389 |  .mean()
390 |  .reset_index()
391 |  .rename(columns = {'total_earnings':'dept_average'}))
392 | 
393 | 
394 | # ## 2. Merging datasets
395 | 
396 | # Now we have two main datasets, `salary_sort` (the salary for each person, sorted from high to low) and `salary_average` (the average salary for each department). What if I wanted to merge these two together, so I could see side-by-side each person's salary compared to the average for their department?
397 | # 
398 | # We want to join by the `department_name` variable, since that is consistent across both datasets. Let's put the merged data into a new dataframe, `salary_merged`:
399 | 
400 | # In[40]:
401 | 
402 | 
403 | salary_merged = pd.merge(salary_sort, salary_average, on = 'department_name')
404 | 
405 | 
406 | # Now we can see the department average, `dept_average`, next to the individual's salary, `total_earnings`:
407 | 
408 | # In[41]:
409 | 
410 | 
411 | salary_merged.head()
412 | 
413 | 
414 | # ## 3. Reshaping data
415 | 
416 | # Here's a dataset on unemployment rates by country from 2012 to 2016, from the International Monetary Fund's World Economic Outlook database (available [here](https://www.imf.org/external/pubs/ft/weo/2017/01/weodata/index.aspx)).
417 | # 
418 | # When you download the dataset, it comes in an Excel file. We can use the `pd.read_excel()` function from `pandas` to load the file into Python.
419 | 
420 | # In[42]:
421 | 
422 | 
423 | unemployment = pd.read_excel('unemployment.xlsx')
424 | unemployment.head()
425 | 
426 | 
427 | # You'll notice if you open the `unemployment.xlsx` file in Excel that cells that do not have data (like Argentina in 2015) are labeled with "n/a". A nice feature of `pd.read_excel()` is that it recognizes these cells as NaN ("not a number," or Python's way of encoding missing values), by default. If we wanted to, we could explicitly tell pandas that missing values were labeled "n/a" using `na_values = 'n/a'` within the `pd.read_excel()` function:
428 | 
429 | # In[43]:
430 | 
431 | 
432 | unemployment = pd.read_excel('unemployment.xlsx', na_values = 'n/a')
433 | 
434 | 
435 | # Right now, the data are in what's commonly referred to as "wide" format, meaning the variables (unemployment rate for each year) are spread across rows. This might be good for presentation, but it's not great for certain calculations or graphing. "Wide" format data also becomes confusing if other variables are added.
436 | # 
437 | # We need to change the format from "wide" to "long," meaning that the columns (`2012`, `2013`, `2014`, `2015`, `2016`) will be converted into a new variable, which we'll call `Year`, with repeated values for each country. And the unemployment rates will be put into a new variable, which we'll call `Rate_Unemployed`.
438 | 
439 | # To do this, we'll use the [`pd.melt()`](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.melt.html) function in `pandas` to create a new DataFrame, `unemployment_long`.
440 | 
441 | # In[44]:
442 | 
443 | 
444 | unemployment_long = pd.melt(unemployment, # data to reshape
445 |                             id_vars = 'Country', # identifier variable
446 |                             var_name = 'Year', # column we want to create from the rows 
447 |                             value_name = 'Rate_Unemployed') # the values of interest
448 | 
449 | 
450 | # Inspecting `unemployment_long` using `head()` shows that we have successfully created a long dataset.
451 | 
452 | # In[45]:
453 | 
454 | 
455 | unemployment_long.head()
456 | 
457 | 
458 | # ## 4. Calculating year-over-year change in panel data
459 | 
460 | # Sort the data by `Country` and `Year` using the `sort_values()` function:
461 | 
462 | # In[46]:
463 | 
464 | 
465 | unemployment_long = unemployment_long.sort_values(['Country', 'Year'])
466 | 
467 | unemployment_long.head()
468 | 
469 | 
470 | # Again, we can use `reset_index(drop = True)` to reset the row index so that the numbers next to the rows are in sequential order.
471 | 
472 | # In[47]:
473 | 
474 | 
475 | unemployment_long = unemployment_long.reset_index(drop = True)
476 | 
477 | unemployment_long.head()
478 | 
479 | 
480 | # This type of data is known in time-series analysis as a panel; each country is observed every year from 2012 to 2016.
481 | # 
482 | # For Albania, the percentage point change in unemployment rate from 2012 to 2013 would be 16 - 13.4 = 2.5 percentage points. What if I wanted the year-over-year change in unemployment rate for every country?
483 | # 
484 | # We can use the [`diff()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.diff.html) function in `pandas` to do this. We can use `diff()` to calculate the difference between the `Rate_Unemployed` that year and the `Rate_Unemployed` for the year prior (the default for `lag()` is 1 period, which is good for us since we want the change from the previous year). We will save this difference into a new variable, `Change`.
485 | 
486 | # In[48]:
487 | 
488 | 
489 | unemployment_long['Change'] = unemployment_long.Rate_Unemployed.diff()
490 | 
491 | 
492 | # Let's inspect the first five rows again, using `head()`:
493 | 
494 | # In[49]:
495 | 
496 | 
497 | unemployment_long.head()
498 | 
499 | 
500 | # So far so good. It also makes sense that Albania's `Change` is `NaN` in 2012, since the dataset doesn't contain any unemployment figures before the year 2012.
501 | # 
502 | # But a closer inspection of the data reveals a problem. What if we used `tail()` to look at the *last* 5 rows of the data?
503 | 
504 | # In[50]:
505 | 
506 | 
507 | unemployment_long.tail()
508 | 
509 | 
510 | # **Why does Vietnam have a -18.493 percentage point change in 2012?**
511 | 
512 | # (Hint: use `tail()` to look at the last 6 rows of the data.)
513 | 
514 | # In[51]:
515 | 
516 | 
517 | unemployment_long['Change'] = (unemployment_long
518 |                                .groupby('Country')
519 |                                .Rate_Unemployed.diff())
520 | 
521 | unemployment_long.tail()
522 | 
523 | 
524 | # (Also notice how I put the entire expression in parentheses and put each function on a different line for readability.)
525 | 
526 | # ## 5. Recoding numerical variables into categorical ones
527 | 
528 | # Here's a list of some attendees for the 2016 workshop, with names and contact info removed.
529 | 
530 | # In[52]:
531 | 
532 | 
533 | attendees = pd.read_csv('attendees.csv')
534 | 
535 | attendees.head()
536 | 
537 | 
538 | # **What if we wanted to quickly see the age distribution of attendees?**
539 | 
540 | # In[53]:
541 | 
542 | 
543 | attendees['Age group'].value_counts()
544 | 
545 | 
546 | # There's an inconsistency in the labeling of the `Age group` variable here. We can fix this using `np.where()` in the `numpy` library. First, let's import the `numpy` library. Like `pandas`, `numpy` has a commonly used alias — `np`.
547 | 
548 | # In[54]:
549 | 
550 | 
551 | import numpy as np
552 | 
553 | 
554 | # In[55]:
555 | 
556 | 
557 | attendees['Age group'] = np.where(attendees['Age group'] == '30 - 39', # where attendees['Age group'] == '30 - 39'
558 |                                   '30-39', # replace attendees['Age group'] with '30-39'
559 |                                   attendees['Age group']) # otherwise, keep attendees['Age group'] values the same
560 | 
561 | 
562 | # This might seem trivial for just one value, but it's useful for larger datasets.
563 | 
564 | # In[56]:
565 | 
566 | 
567 | attendees['Age group'].value_counts()
568 | 
569 | 
570 | # Now let's take a look at the professional status of attendees, labeled in `Choose your status:`
571 | 
572 | # In[57]:
573 | 
574 | 
575 | attendees['Choose your status:'].value_counts()
576 | 
577 | 
578 | # "Nonprofit, Academic, Government" and "Nonprofit, Academic, Government Early Bird" seem to be the same. We can use `np.where()` (and the Python designation `|` for "or") to combine these two categories into one big category, "Nonprofit/Gov". Let's create a new variable, `status`, for our simplified categorization.
579 | # 
580 | # Notice the extra sets of parentheses around the two conditions linked by the `|` symbol.
581 | 
582 | # In[58]:
583 | 
584 | 
585 | attendees['status'] = np.where((attendees['Choose your status:'] == 'Nonprofit, Academic, Government') |
586 |                                (attendees['Choose your status:'] == 'Nonprofit, Academic, Government Early Bird'),
587 |                            'Nonprofit/Gov', 
588 |                            attendees['Choose your status:'])
589 | 
590 | 
591 | # In[59]:
592 | 
593 | 
594 | attendees['status'].value_counts()
595 | 
596 | 
597 | # ## What else?
598 | # 
599 | # -   How would you create a new variable in the `attendees` data (let's call it `status2`) that has just two categories, "Student" and "Other"?
600 | # 
601 | # -   How would you rename the variables in the `attendees` data to make them easier to work with?
602 | # 
603 | # -   What are some other issues with this dataset? How would you solve them using what we've learned?
604 | # 
605 | # -   What are some other "messy" data issues you've encountered?
606 | 


--------------------------------------------------------------------------------
/readme.md:
--------------------------------------------------------------------------------
 1 | # Tricks for cleaning your data in Python using pandas
 2 | 
 3 | In 2017 I gave a talk called "Tricks for cleaning your data in R" which I presented at the [Data+Narrative workshop](http://www.bu.edu/com/data-narrative/) at Boston University. The [repo with the code and data](https://github.com/underthecurve/r-data-cleaning-tricks) I used for the talk was pretty well-received, so I figured I'd try to do some of the same stuff in Python using `pandas`.
 4 | 
 5 | **Disclaimer:** when it comes to data stuff, I'm much better with R, especially the `tidyverse` set of packages, than with Python, but in my last job I used Python's `pandas` library to do a lot of data processing since Python was the dominant language there. Please feel free to let me know if there are better ways to do things!
 6 | 
 7 | ## Links to install Python, pandas and Jupyter notebook
 8 | 
 9 | * [Python](https://www.python.org/downloads/): website for Python
10 | * [pandas](https://pandas.pydata.org/): website for the pandas library
11 | * [Jupyter](http://jupyter.org/): website for Project Jupyter, whose interactive notebook this tutorial was written in
12 | 
13 | ## Files included
14 | 
15 | ### Annotated code and step-by step instructions for the workshop
16 | * [Pandas data cleaning tricks.ipynb](https://github.com/underthecurve/pandas-data-cleaning-tricks/blob/master/pandas-data-cleaning-tricks.ipynb): Jupyter notebook file (for viewing on the web - Desktop only)
17 | * [pandas-data-cleaning-tricks.pdf](https://github.com/underthecurve/pandas-data-cleaning-tricks/blob/master/pandas-data-cleaning-tricks.pdf): PDF file (for printing out)
18 | 
19 | ### Python code
20 | * [pandas-data-cleaning-tricks.py](https://github.com/underthecurve/pandas-data-cleaning-tricks/blob/master/pandas-data-cleaning-tricks.py): the straight-up Python code, with annotations commented otu
21 | 
22 | ### Underlying data needed to run the Python code
23 | * [employee-earnings-report-2016.csv](https://github.com/underthecurve/pandas-data-cleaning-tricks/blob/master/employee-earnings-report-2016.csv): data on earnings for Boston's municipal employees, from the city's [open data portal](https://data.boston.gov/dataset/employee-earnings-report)
24 | * [unemployment.xlsx](https://github.com/underthecurve/pandas-data-cleaning-tricks/blob/master/unemployment.xlsx): data on global unemployment rates from 2012 to 2016, from the [International Monetary Fund](https://www.imf.org/external/pubs/ft/weo/2017/01/weodata/index.aspx)
25 | * [attendees.csv](https://github.com/underthecurve/pandas-data-cleaning-tricks/blob/master/attendees.csv): data on some attendees of the 2017 Data+Narrative workshop, with names and identifying information removed
26 | 
27 | ## How to follow this tutorial
28 | 
29 | * You can clone or download this repository by clicking on the green button above, "Clone or download"
30 | * Follow along by reading the `.ipynb` file online or printing the `.pdf` file out by clicking the Github links above
31 | 
32 | ## Questions / Feedback?
33 | 
34 | ychristinezhang at gmail dot com
35 | 
36 | or on Twitter
37 | 
38 | [@christinezhang](https://twitter.com/christinezhang)
39 | 
40 | <a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.


--------------------------------------------------------------------------------
/unemployment.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/underthecurve/pandas-data-cleaning-tricks/cd96cd85ac5b941eb077bbf6bc30b51da0e3f29b/unemployment.xlsx


--------------------------------------------------------------------------------