├── .gitignore ├── attendees.csv ├── employee-earnings-report-2016.csv ├── pandas-data-cleaning-tricks.ipynb ├── pandas-data-cleaning-tricks.pdf ├── pandas-data-cleaning-tricks.py ├── readme.md └── unemployment.xlsx /.gitignore: -------------------------------------------------------------------------------- 1 | *.ipynb_checkpoints -------------------------------------------------------------------------------- /attendees.csv: -------------------------------------------------------------------------------- 1 | Occupation,Job title,Age group,Gender,State/Province,Education,Which data subject area are you most interested in working with? (Select up to three),What do you hope to get out of the workshop?,Which type of laptop will you bring?,College or University Name,Major or Concentration ,College Year,Which Digital Badge track best suits you?,Which session would you like to attend?,Choose your status: 2 | Data Analyst,Data Quality Analyst,30-39,Male,MA,Bachelor's Degree,Retail,other,PC,,,,Advanced Data Storytelling,June 5-9,"Nonprofit, Academic, Government" 3 | PhD Student,Student/Research Assistant,18-29,Male,MA,Bachelor's Degree,Sports,Master Advanced R,PC,Boston University,Biostatistics,PhD,Advanced Data Storytelling,June 5-9,Student 4 | Education,Data Analyst,18-29,Female,Kentucky,Master's Degree,Retail,other,PC,,,,Advanced Data Storytelling,June 5-9,"Nonprofit, Academic, Government" 5 | Manager,BAS Manager,30-39,Male,MA,Bachelor's Degree,Education,Pick up Beginning R And SQL,PC,Boston University,PEMBA,Graduate,Advanced Data Storytelling,June 5-9,Student 6 | Government Finance,Performance Analyst,30 - 39,Male,MA,Master's Degree,"Environment, Finance, Food and agriculture",Pick up Beginning R And SQL,MAC,,,,Advanced Data Storytelling,June 5-9,"Nonprofit, Academic, Government Early Bird" 7 | Engineer,Display Engineer,30-39, Female,MA,Bachelor's Degree,"Environment, Finance, Food and Agriculture","Explore the field of data storytelling, including career options, Improve my ability to write with numbers, Acquire data visualization skills, Effectively clean and standardize data, Pick up Beginning R And SQL, Master Advanced R",Advanced Data Storytelling,,,,Advanced Data Storytelling,June 5-9,Professional 8 | self-employed,CEO and founder,30-39, Female,MA,Master's Degree,"Criminal justice, Education, Environment","Improve my ability to write with numbers, Acquire data visualization skills, Analyze data better, Effectively clean and standardize data, Pick up Beginning R And SQL",PC,,,,Advanced Data Storytelling,June 5-9,Professional 9 | Evaluator,"Assistant Director, Stewardship & Donor Relations",30-39,Female,MA,Master's Degree,"Education, Environment, Health care","Analyze data better, Pick up Beginning R And SQL, other",Mac,,,,Advanced Data Storytelling,June 5-9,"Nonprofit, Academic, Government" 10 | Marketing Analytics,Sr. Analytics Consultant,30-39,Female,GA,Master's Degree,"Criminal justice, Finance, Retail","Acquire data visualization skills, Analyze data better, Discover the real story revealed in the data, Utilize spreadsheets for data analysis and visualization, Pick up Beginning R And SQL",PC,,,,Advanced Data Storytelling,June 5-9,Professional 11 | Student,Ph.D. Student,30-39,Male,LA,Doctoral or other advanced degree,"Campaign finance, Sports, Retail","Acquire data visualization skills, Analyze data better, Effectively clean and standardize data, Utilize spreadsheets for data analysis and visualization, Pick up Beginning R And SQL, Master Advanced R",Mac,Louisiana State University,Mass Communication,PhD,Advanced Data Storytelling,June 5-9,Student 12 | Student,College graduate,18-29,Male,MA,Bachelor's Degree,"Environment, Food and Agriculture, Health care, Retail","Explore the field of data storytelling, including career options, Improve my ability to write with numbers, Acquire data visualization skills, Analyze data better, Discover the real story revealed in the data, Utilize spreadsheets for data analysis and visualization, Pick up Beginning R And SQL",Mac,Mc Gill,Psychology,Graduate,Advanced Data Storytelling,June 5-9,Student 13 | Student,Student,18-29,Female,MA,Bachelor's Degree,"Environment, Finance, Retail","Explore the field of data storytelling, including career options, Improve my ability to write with numbers, Acquire data visualization skills, Analyze data better, Discover the real story revealed in the data, Effectively clean and standardize data, Utilize spreadsheets for data analysis and visualization, Pick up Beginning R And SQL, Master Advanced R",Mac,Boston University,Accounting and Finance,Senior,Advanced Data Storytelling,June 5-9,Student -------------------------------------------------------------------------------- /employee-earnings-report-2016.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/underthecurve/pandas-data-cleaning-tricks/cd96cd85ac5b941eb077bbf6bc30b51da0e3f29b/employee-earnings-report-2016.csv -------------------------------------------------------------------------------- /pandas-data-cleaning-tricks.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Tricks for cleaning your data in Python using pandas\n", 8 | "\n", 9 | "**By Christine Zhang ([ychristinezhang at gmail dot com](mailto:ychristinezhang@gmail.com / [@christinezhang](https://twitter.com/christinezhang) | [@christinezhang](https://twitter.com/christinezhang))**" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "GitHub repository for Data+Code: https://github.com/underthecurve/pandas-data-cleaning-tricks\n", 17 | "\n", 18 | "In 2017 I gave a talk called \"Tricks for cleaning your data in R\" which I presented at the [Data+Narrative workshop](http://www.bu.edu/com/data-narrative/) at Boston University. The repo with the code and data, https://github.com/underthecurve/r-data-cleaning-tricks, was pretty well-received, so I figured I'd try to do some of the same stuff in Python using `pandas`.\n", 19 | "\n", 20 | "**Disclaimer:** when it comes to data stuff, I'm much better with R, especially the `tidyverse` set of packages, than with Python, but in my last job I used Python's `pandas` library to do a lot of data processing since Python was the dominant language there.\n", 21 | "\n", 22 | "Anyway, here goes: \n", 23 | "\n", 24 | "Data cleaning is a cumbersome task, and it can be hard to navigate in programming languages like Python. \n", 25 | "\n", 26 | "The `pandas` library in Python is a powerful tool for data cleaning and analysis. By default, it leaves a trail of code that documents all the work you've done, which makes it extremely useful for creating reproducible workflows.\n", 27 | "\n", 28 | "In this workshop, I'll show you some examples of real-life \"messy\" datasets, the problems they present for analysis in Python's `pandas` library, and some of the solutions to these problems.\n", 29 | "\n", 30 | "Fittingly, I'll [start the numbering system at 0](http://python-history.blogspot.com/2013/10/why-python-uses-0-based-indexing.html)." 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "## 0. Importing the `pandas` library" 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "Here I tell Python to import the `pandas` library as `pd` (a common alias for `pandas` — more on that in the next code chunk)." 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": 1, 50 | "metadata": {}, 51 | "outputs": [], 52 | "source": [ 53 | "import pandas as pd" 54 | ] 55 | }, 56 | { 57 | "cell_type": "markdown", 58 | "metadata": {}, 59 | "source": [ 60 | "## 1. Finding and replacing non-numeric characters like `,` and `$`" 61 | ] 62 | }, 63 | { 64 | "cell_type": "markdown", 65 | "metadata": {}, 66 | "source": [ 67 | "Let's check out the city of Boston's [Open Data portal](https://data.boston.gov/), where the local government puts up datasets that are free for the public to analyze.\n", 68 | "\n", 69 | "The [Employee Earnings Report](https://data.boston.gov/dataset/employee-earnings-report) is one of the more interesting ones, because it gives payroll data for every person on the municipal payroll. It's where the *Boston Globe* gets stories like these every year:\n", 70 | "\n", 71 | "- [\"64 City of Boston workers earn more than $250,000\"](https://www.bostonglobe.com/metro/2016/02/05/city-boston-workers-earn-more-than/MvW6RExJZimdrTlwdwUI7M/story.html) (February 6, 2016)\n", 72 | "\n", 73 | "- [\"Police detective tops Boston’s payroll with a total of over $403,000\"](https://www.bostonglobe.com/metro/2017/02/14/police-detective-tops-boston-payroll-with-total-over/6PaXwTAHZGEW5djgwCJuTI/story.html) (February 14, 2017)\n", 74 | "\n", 75 | "Let's take at the February 14 story from 2017. The story begins:\n", 76 | "\n", 77 | "> \"A veteran police detective took home more than $403,000 in earnings last year, topping the list of Boston’s highest-paid employees in 2016, newly released city payroll data show.\"\n", 78 | "\n", 79 | "**What if we wanted to check this number using the Employee Earnings Report?**" 80 | ] 81 | }, 82 | { 83 | "cell_type": "markdown", 84 | "metadata": {}, 85 | "source": [ 86 | "We can use the `pandas` function [`pandas.read_csv()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) to load the csv file into Python. We will call this DataFrame `salary`. Remember that I imported `pandas` \"as `pd`\" in the last code chunk. This saves me a bit of typing by allowing me to access `pandas` functions like `pandas.read_csv()` by typing `pd.read_csv()` instead. If I had typed `import pandas` in the code chunk under section `0` without `as pd`, the below code wouldn't work. I'd have to instead write `pandas.read_csv()` to access the function.\n", 87 | "\n", 88 | "The `pd` alias for `pandas` is so common that the library's [documentation](http://pandas.pydata.org/pandas-docs/stable/install.html#running-the-test-suite) even uses it sometimes.\n", 89 | "\n", 90 | "Let's try to use `pd.read_csv()`:" 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": 2, 96 | "metadata": {}, 97 | "outputs": [ 98 | { 99 | "ename": "UnicodeDecodeError", 100 | "evalue": "'utf-8' codec can't decode byte 0xe9 in position 22: invalid continuation byte", 101 | "output_type": "error", 102 | "traceback": [ 103 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 104 | "\u001b[0;31mUnicodeDecodeError\u001b[0m Traceback (most recent call last)", 105 | "\u001b[0;32mpandas/_libs/parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers.TextReader._convert_tokens\u001b[0;34m()\u001b[0m\n", 106 | "\u001b[0;32mpandas/_libs/parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers.TextReader._convert_with_dtype\u001b[0;34m()\u001b[0m\n", 107 | "\u001b[0;32mpandas/_libs/parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers.TextReader._string_convert\u001b[0;34m()\u001b[0m\n", 108 | "\u001b[0;32mpandas/_libs/parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers._string_box_utf8\u001b[0;34m()\u001b[0m\n", 109 | "\u001b[0;31mUnicodeDecodeError\u001b[0m: 'utf-8' codec can't decode byte 0xe9 in position 22: invalid continuation byte", 110 | "\nDuring handling of the above exception, another exception occurred:\n", 111 | "\u001b[0;31mUnicodeDecodeError\u001b[0m Traceback (most recent call last)", 112 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0msalary\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread_csv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'employee-earnings-report-2016.csv'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", 113 | "\u001b[0;32m~/anaconda/envs/ipykernel_py3/lib/python3.6/site-packages/pandas/io/parsers.py\u001b[0m in \u001b[0;36mparser_f\u001b[0;34m(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, doublequote, delim_whitespace, low_memory, memory_map, float_precision)\u001b[0m\n\u001b[1;32m 676\u001b[0m skip_blank_lines=skip_blank_lines)\n\u001b[1;32m 677\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 678\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0m_read\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfilepath_or_buffer\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkwds\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 679\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 680\u001b[0m \u001b[0mparser_f\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__name__\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 114 | "\u001b[0;32m~/anaconda/envs/ipykernel_py3/lib/python3.6/site-packages/pandas/io/parsers.py\u001b[0m in \u001b[0;36m_read\u001b[0;34m(filepath_or_buffer, kwds)\u001b[0m\n\u001b[1;32m 444\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 445\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 446\u001b[0;31m \u001b[0mdata\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mparser\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnrows\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 447\u001b[0m \u001b[0;32mfinally\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 448\u001b[0m \u001b[0mparser\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mclose\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 115 | "\u001b[0;32m~/anaconda/envs/ipykernel_py3/lib/python3.6/site-packages/pandas/io/parsers.py\u001b[0m in \u001b[0;36mread\u001b[0;34m(self, nrows)\u001b[0m\n\u001b[1;32m 1034\u001b[0m \u001b[0;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'skipfooter not supported for iteration'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1035\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1036\u001b[0;31m \u001b[0mret\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnrows\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1037\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1038\u001b[0m \u001b[0;31m# May alter columns / col_dict\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 116 | "\u001b[0;32m~/anaconda/envs/ipykernel_py3/lib/python3.6/site-packages/pandas/io/parsers.py\u001b[0m in \u001b[0;36mread\u001b[0;34m(self, nrows)\u001b[0m\n\u001b[1;32m 1846\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mread\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnrows\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1847\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1848\u001b[0;31m \u001b[0mdata\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_reader\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnrows\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1849\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mStopIteration\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1850\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_first_chunk\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 117 | "\u001b[0;32mpandas/_libs/parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers.TextReader.read\u001b[0;34m()\u001b[0m\n", 118 | "\u001b[0;32mpandas/_libs/parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers.TextReader._read_low_memory\u001b[0;34m()\u001b[0m\n", 119 | "\u001b[0;32mpandas/_libs/parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers.TextReader._read_rows\u001b[0;34m()\u001b[0m\n", 120 | "\u001b[0;32mpandas/_libs/parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers.TextReader._convert_column_data\u001b[0;34m()\u001b[0m\n", 121 | "\u001b[0;32mpandas/_libs/parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers.TextReader._convert_tokens\u001b[0;34m()\u001b[0m\n", 122 | "\u001b[0;32mpandas/_libs/parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers.TextReader._convert_with_dtype\u001b[0;34m()\u001b[0m\n", 123 | "\u001b[0;32mpandas/_libs/parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers.TextReader._string_convert\u001b[0;34m()\u001b[0m\n", 124 | "\u001b[0;32mpandas/_libs/parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers._string_box_utf8\u001b[0;34m()\u001b[0m\n", 125 | "\u001b[0;31mUnicodeDecodeError\u001b[0m: 'utf-8' codec can't decode byte 0xe9 in position 22: invalid continuation byte" 126 | ] 127 | } 128 | ], 129 | "source": [ 130 | "salary = pd.read_csv('employee-earnings-report-2016.csv')" 131 | ] 132 | }, 133 | { 134 | "cell_type": "markdown", 135 | "metadata": {}, 136 | "source": [ 137 | "That's a pretty long and nasty error. Usually when I run into something like this, I start from the bottom and work my way up — in this case, I typed `UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 22: invalid continuation byte` into a search engine and came across [this discussion on the Stack Overflow forum](https://stackoverflow.com/questions/30462807/encoding-error-in-panda-read-csv). The last response suggested that adding `encoding ='latin1'` inside the function would fix the problem on Macs (which is the type of computer I have)." 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": 3, 143 | "metadata": {}, 144 | "outputs": [], 145 | "source": [ 146 | "salary = pd.read_csv('employee-earnings-report-2016.csv', encoding = 'latin-1')" 147 | ] 148 | }, 149 | { 150 | "cell_type": "markdown", 151 | "metadata": {}, 152 | "source": [ 153 | "Great! (I don't know much about encoding, but this is something I run into from time to time so I thought it would be helpful to show here.)\n", 154 | "\n", 155 | "We can use `head()` on the `salary` DataFrame to inspect the first five rows of `salary`. (Note I use the `print()` to display the output, but you don't need to do this in your own code if you'd prefer not to.)" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": 4, 161 | "metadata": {}, 162 | "outputs": [ 163 | { 164 | "name": "stdout", 165 | "output_type": "stream", 166 | "text": [ 167 | " NAME DEPARTMENT_NAME TITLE \\\n", 168 | "0 Abadi,Kidani A Assessing Department Property Officer \n", 169 | "1 Abasciano,Joseph Boston Police Department Police Officer \n", 170 | "2 Abban,Christopher John Boston Fire Department Fire Fighter \n", 171 | "3 Abbasi,Sophia Green Academy Manager (C) (non-ac) \n", 172 | "4 Abbate-Vaughn,Jorgelina BPS Ellis Elementary Teacher \n", 173 | "\n", 174 | " REGULAR RETRO OTHER OVERTIME INJURED DETAIL \\\n", 175 | "0 $46,291.98 NaN $300.00 NaN NaN NaN \n", 176 | "1 $6,933.66 NaN $850.00 $205.92 $74,331.86 NaN \n", 177 | "2 $103,442.22 NaN $550.00 $15,884.53 NaN $4,746.50 \n", 178 | "3 $18,249.83 NaN NaN NaN NaN NaN \n", 179 | "4 $84,410.28 NaN $1,250.00 NaN NaN NaN \n", 180 | "\n", 181 | " QUINN/EDUCATION INCENTIVE TOTAL EARNINGS POSTAL \n", 182 | "0 NaN $46,591.98 02118 \n", 183 | "1 $15,258.44 $97,579.88 02132 \n", 184 | "2 NaN $124,623.25 02132 \n", 185 | "3 NaN $18,249.83 02148 \n", 186 | "4 NaN $85,660.28 02481 \n" 187 | ] 188 | } 189 | ], 190 | "source": [ 191 | "print(salary.head())" 192 | ] 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "metadata": {}, 197 | "source": [ 198 | "There are a lot of columns. Let's simplify by selecting the ones of interest: `NAME`, `DEPARTMENT_NAME`, and `TOTAL.EARNINGS`. There are [a few different ways](https://medium.com/dunder-data/selecting-subsets-of-data-in-pandas-6fcd0170be9c) of doing this with `pandas`. The simplest way, imo, is by using the indexing operator `[]`.\n", 199 | "\n", 200 | "For example, I could select a single column, `NAME`: (Note I also run the line `pd.options.display.max_rows = 20` in order to display a maximum of 20 rows so the output isn't too crowded.)" 201 | ] 202 | }, 203 | { 204 | "cell_type": "code", 205 | "execution_count": 5, 206 | "metadata": { 207 | "scrolled": false 208 | }, 209 | "outputs": [ 210 | { 211 | "data": { 212 | "text/plain": [ 213 | "0 Abadi,Kidani A\n", 214 | "1 Abasciano,Joseph\n", 215 | "2 Abban,Christopher John\n", 216 | "3 Abbasi,Sophia\n", 217 | "4 Abbate-Vaughn,Jorgelina\n", 218 | "5 Abberton,James P\n", 219 | "6 Abbott,Erin Elizabeth\n", 220 | "7 Abbott,John R.\n", 221 | "8 Abbruzzese,Angela\n", 222 | "9 Abbruzzese,Donna\n", 223 | " ... \n", 224 | "22036 Zuares,David Jonathan\n", 225 | "22037 Zubrin,William W.\n", 226 | "22038 Zuccaro,John E.\n", 227 | "22039 Zucker,Alyse Paige\n", 228 | "22040 Zuckerman,Naomi Julia\n", 229 | "22041 Zukowski III,Charles\n", 230 | "22042 Zuluaga Castro,Juan Pablo\n", 231 | "22043 Zwarich,Maralene Zoann\n", 232 | "22044 Zweig,Susanna B\n", 233 | "22045 Zwerdling,Laura\n", 234 | "Name: NAME, Length: 22046, dtype: object" 235 | ] 236 | }, 237 | "execution_count": 5, 238 | "metadata": {}, 239 | "output_type": "execute_result" 240 | } 241 | ], 242 | "source": [ 243 | "pd.options.display.max_rows = 20\n", 244 | "\n", 245 | "salary['NAME']" 246 | ] 247 | }, 248 | { 249 | "cell_type": "markdown", 250 | "metadata": {}, 251 | "source": [ 252 | "This works for selecting one column at a time, but using `[]` returns a [Series](https://pandas.pydata.org/pandas-docs/stable/dsintro.html#series), not a [DataFrame](https://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe). I can confirm this using the `type()` function:" 253 | ] 254 | }, 255 | { 256 | "cell_type": "code", 257 | "execution_count": 6, 258 | "metadata": {}, 259 | "outputs": [ 260 | { 261 | "data": { 262 | "text/plain": [ 263 | "pandas.core.series.Series" 264 | ] 265 | }, 266 | "execution_count": 6, 267 | "metadata": {}, 268 | "output_type": "execute_result" 269 | } 270 | ], 271 | "source": [ 272 | "type(salary['NAME'])" 273 | ] 274 | }, 275 | { 276 | "cell_type": "markdown", 277 | "metadata": {}, 278 | "source": [ 279 | "If I want a DataFrame, I have to use double brackets:" 280 | ] 281 | }, 282 | { 283 | "cell_type": "code", 284 | "execution_count": 7, 285 | "metadata": {}, 286 | "outputs": [ 287 | { 288 | "data": { 289 | "text/html": [ 290 | "
\n", 291 | "\n", 304 | "\n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | "
NAME
0Abadi,Kidani A
1Abasciano,Joseph
2Abban,Christopher John
3Abbasi,Sophia
4Abbate-Vaughn,Jorgelina
5Abberton,James P
6Abbott,Erin Elizabeth
7Abbott,John R.
8Abbruzzese,Angela
9Abbruzzese,Donna
......
22036Zuares,David Jonathan
22037Zubrin,William W.
22038Zuccaro,John E.
22039Zucker,Alyse Paige
22040Zuckerman,Naomi Julia
22041Zukowski III,Charles
22042Zuluaga Castro,Juan Pablo
22043Zwarich,Maralene Zoann
22044Zweig,Susanna B
22045Zwerdling,Laura
\n", 398 | "

22046 rows × 1 columns

\n", 399 | "
" 400 | ], 401 | "text/plain": [ 402 | " NAME\n", 403 | "0 Abadi,Kidani A\n", 404 | "1 Abasciano,Joseph\n", 405 | "2 Abban,Christopher John\n", 406 | "3 Abbasi,Sophia\n", 407 | "4 Abbate-Vaughn,Jorgelina\n", 408 | "5 Abberton,James P\n", 409 | "6 Abbott,Erin Elizabeth\n", 410 | "7 Abbott,John R.\n", 411 | "8 Abbruzzese,Angela\n", 412 | "9 Abbruzzese,Donna\n", 413 | "... ...\n", 414 | "22036 Zuares,David Jonathan\n", 415 | "22037 Zubrin,William W.\n", 416 | "22038 Zuccaro,John E.\n", 417 | "22039 Zucker,Alyse Paige\n", 418 | "22040 Zuckerman,Naomi Julia\n", 419 | "22041 Zukowski III,Charles\n", 420 | "22042 Zuluaga Castro,Juan Pablo\n", 421 | "22043 Zwarich,Maralene Zoann\n", 422 | "22044 Zweig,Susanna B\n", 423 | "22045 Zwerdling,Laura\n", 424 | "\n", 425 | "[22046 rows x 1 columns]" 426 | ] 427 | }, 428 | "execution_count": 7, 429 | "metadata": {}, 430 | "output_type": "execute_result" 431 | } 432 | ], 433 | "source": [ 434 | "salary[['NAME']]" 435 | ] 436 | }, 437 | { 438 | "cell_type": "code", 439 | "execution_count": 8, 440 | "metadata": {}, 441 | "outputs": [ 442 | { 443 | "data": { 444 | "text/plain": [ 445 | "pandas.core.frame.DataFrame" 446 | ] 447 | }, 448 | "execution_count": 8, 449 | "metadata": {}, 450 | "output_type": "execute_result" 451 | } 452 | ], 453 | "source": [ 454 | "type(salary[['NAME']])" 455 | ] 456 | }, 457 | { 458 | "cell_type": "markdown", 459 | "metadata": {}, 460 | "source": [ 461 | "To select multiple columns, we can put those columns inside of the second pair of brackets. We will save this into a new DataFrame, `salary_selected`. We type `.copy()` after `salary[['NAME','DEPARTMENT_NAME', 'TOTAL EARNINGS']]` because we are making a copy of the DataFrame and assigning it to new DataFrame. Learn more about `copy()` [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.copy.html)." 462 | ] 463 | }, 464 | { 465 | "cell_type": "code", 466 | "execution_count": 9, 467 | "metadata": {}, 468 | "outputs": [], 469 | "source": [ 470 | "salary_selected = salary[['NAME','DEPARTMENT_NAME', 'TOTAL EARNINGS']].copy()" 471 | ] 472 | }, 473 | { 474 | "cell_type": "markdown", 475 | "metadata": {}, 476 | "source": [ 477 | "We can also change the column names to lowercase names for easier typing. First, let's take a look at the columns by displaying the `columns` attribute of the `salary_selected` DataFrame." 478 | ] 479 | }, 480 | { 481 | "cell_type": "code", 482 | "execution_count": 10, 483 | "metadata": {}, 484 | "outputs": [ 485 | { 486 | "data": { 487 | "text/plain": [ 488 | "Index(['NAME', 'DEPARTMENT_NAME', 'TOTAL EARNINGS'], dtype='object')" 489 | ] 490 | }, 491 | "execution_count": 10, 492 | "metadata": {}, 493 | "output_type": "execute_result" 494 | } 495 | ], 496 | "source": [ 497 | "salary_selected.columns" 498 | ] 499 | }, 500 | { 501 | "cell_type": "code", 502 | "execution_count": 11, 503 | "metadata": {}, 504 | "outputs": [ 505 | { 506 | "data": { 507 | "text/plain": [ 508 | "pandas.core.indexes.base.Index" 509 | ] 510 | }, 511 | "execution_count": 11, 512 | "metadata": {}, 513 | "output_type": "execute_result" 514 | } 515 | ], 516 | "source": [ 517 | "type(salary_selected.columns)" 518 | ] 519 | }, 520 | { 521 | "cell_type": "markdown", 522 | "metadata": {}, 523 | "source": [ 524 | "Notice how this returns something called an \"Index.\" In `pandas`, DataFrames have both row indexes (in our case, the row number, starting from 0 and going to 22045) and column indexes. We can use the [`str.lower()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.lower.html) function to convert the strings (aka characters) in the index to lowercase." 525 | ] 526 | }, 527 | { 528 | "cell_type": "code", 529 | "execution_count": 12, 530 | "metadata": {}, 531 | "outputs": [ 532 | { 533 | "data": { 534 | "text/plain": [ 535 | "Index(['name', 'department_name', 'total earnings'], dtype='object')" 536 | ] 537 | }, 538 | "execution_count": 12, 539 | "metadata": {}, 540 | "output_type": "execute_result" 541 | } 542 | ], 543 | "source": [ 544 | "salary_selected.columns = salary_selected.columns.str.lower()\n", 545 | "\n", 546 | "salary_selected.columns" 547 | ] 548 | }, 549 | { 550 | "cell_type": "markdown", 551 | "metadata": {}, 552 | "source": [ 553 | "Another thing that will make our lives easier is if the `total earnings` column didn't have a space between `total` and `earnings`. We can use a \"string replace\" function, [`str.replace()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.replace.html), to replace the space with an underscore. The syntax is: `str.replace('thing you want to replace', 'what to replace it with')` " 554 | ] 555 | }, 556 | { 557 | "cell_type": "code", 558 | "execution_count": 13, 559 | "metadata": {}, 560 | "outputs": [ 561 | { 562 | "data": { 563 | "text/plain": [ 564 | "Index(['name', 'department_name', 'total earnings'], dtype='object')" 565 | ] 566 | }, 567 | "execution_count": 13, 568 | "metadata": {}, 569 | "output_type": "execute_result" 570 | } 571 | ], 572 | "source": [ 573 | "salary_selected.columns.str.replace(' ', '_') \n", 574 | "\n", 575 | "salary_selected.columns" 576 | ] 577 | }, 578 | { 579 | "cell_type": "markdown", 580 | "metadata": {}, 581 | "source": [ 582 | "We could have used both the `str.lower()` and `str.replace()` functions in one line of code by putting them one after the other (aka \"chaining\"):" 583 | ] 584 | }, 585 | { 586 | "cell_type": "code", 587 | "execution_count": 14, 588 | "metadata": {}, 589 | "outputs": [ 590 | { 591 | "data": { 592 | "text/plain": [ 593 | "Index(['name', 'department_name', 'total_earnings'], dtype='object')" 594 | ] 595 | }, 596 | "execution_count": 14, 597 | "metadata": {}, 598 | "output_type": "execute_result" 599 | } 600 | ], 601 | "source": [ 602 | "salary_selected.columns = salary_selected.columns.str.lower().str.replace(' ', '_') \n", 603 | "\n", 604 | "salary_selected.columns" 605 | ] 606 | }, 607 | { 608 | "cell_type": "markdown", 609 | "metadata": {}, 610 | "source": [ 611 | "Let's use `head()` to visually inspect the first five rows of `salary_selected`:" 612 | ] 613 | }, 614 | { 615 | "cell_type": "code", 616 | "execution_count": 15, 617 | "metadata": {}, 618 | "outputs": [ 619 | { 620 | "name": "stdout", 621 | "output_type": "stream", 622 | "text": [ 623 | " name department_name total_earnings\n", 624 | "0 Abadi,Kidani A Assessing Department $46,591.98\n", 625 | "1 Abasciano,Joseph Boston Police Department $97,579.88\n", 626 | "2 Abban,Christopher John Boston Fire Department $124,623.25\n", 627 | "3 Abbasi,Sophia Green Academy $18,249.83\n", 628 | "4 Abbate-Vaughn,Jorgelina BPS Ellis Elementary $85,660.28\n" 629 | ] 630 | } 631 | ], 632 | "source": [ 633 | "print(salary_selected.head()) " 634 | ] 635 | }, 636 | { 637 | "cell_type": "markdown", 638 | "metadata": {}, 639 | "source": [ 640 | "Now let's try sorting the data by `total.earnings` using the [`sort_values()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) function in `pandas`:" 641 | ] 642 | }, 643 | { 644 | "cell_type": "code", 645 | "execution_count": 16, 646 | "metadata": {}, 647 | "outputs": [], 648 | "source": [ 649 | "salary_sort = salary_selected.sort_values('total_earnings')" 650 | ] 651 | }, 652 | { 653 | "cell_type": "markdown", 654 | "metadata": {}, 655 | "source": [ 656 | "We can use `head()` to visually inspect `salary_sort`:" 657 | ] 658 | }, 659 | { 660 | "cell_type": "code", 661 | "execution_count": 17, 662 | "metadata": {}, 663 | "outputs": [ 664 | { 665 | "name": "stdout", 666 | "output_type": "stream", 667 | "text": [ 668 | " name department_name total_earnings\n", 669 | "11146 Lally,Bernadette Boston City Council $1,000.00\n", 670 | "7104 Fowlkes,Lorraine E. Boston City Council $1,000.00\n", 671 | "15058 Nolan,Andrew Parks Department $1,000.00\n", 672 | "21349 White-Pilet,Yoni A BPS Substitute Teachers/Nurs $1,006.53\n", 673 | "5915 Dunn,Lori D BPS East Boston High $1,010.05\n" 674 | ] 675 | } 676 | ], 677 | "source": [ 678 | "print(salary_sort.head())" 679 | ] 680 | }, 681 | { 682 | "cell_type": "markdown", 683 | "metadata": {}, 684 | "source": [ 685 | "At first glance, it looks okay. The employees appear to be sorted by `total_earnings` from lowest to highest. If this were the case, we'd expect the last row of the `salary_sort` DataFrame to contain the employee with the highest salary. Let's take a look at the last five rows using `tail()`." 686 | ] 687 | }, 688 | { 689 | "cell_type": "code", 690 | "execution_count": 18, 691 | "metadata": {}, 692 | "outputs": [ 693 | { 694 | "name": "stdout", 695 | "output_type": "stream", 696 | "text": [ 697 | " name department_name total_earnings\n", 698 | "13303 McGrath,Caitlin BPS Substitute Teachers/Nurs $990.61\n", 699 | "1869 Bradshaw,John E. BPS Substitute Teachers/Nurs $990.62\n", 700 | "21380 Wiggins,Lucas A BPS Substitute Teachers/Nurs $990.63\n", 701 | "15036 Nixon,Chloe BPS Substitute Teachers/Nurs $990.64\n", 702 | "10478 Kassa,Selamawit BPS Substitute Teachers/Nurs $990.64\n" 703 | ] 704 | } 705 | ], 706 | "source": [ 707 | "print(salary_sort.tail())" 708 | ] 709 | }, 710 | { 711 | "cell_type": "markdown", 712 | "metadata": {}, 713 | "source": [ 714 | "**What went wrong?**\n", 715 | "\n", 716 | "The problem is that there are non-numeric characters, `,` and `$`, in the `total.earnings` column. We can see with [`dtypes`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dtypes.html), which returns the data type of each column in the DataFrame, that `total_earnings` is recognized as an \"object\"." 717 | ] 718 | }, 719 | { 720 | "cell_type": "code", 721 | "execution_count": 19, 722 | "metadata": {}, 723 | "outputs": [ 724 | { 725 | "data": { 726 | "text/plain": [ 727 | "name object\n", 728 | "department_name object\n", 729 | "total_earnings object\n", 730 | "dtype: object" 731 | ] 732 | }, 733 | "execution_count": 19, 734 | "metadata": {}, 735 | "output_type": "execute_result" 736 | } 737 | ], 738 | "source": [ 739 | "salary_selected.dtypes" 740 | ] 741 | }, 742 | { 743 | "cell_type": "markdown", 744 | "metadata": {}, 745 | "source": [ 746 | "[Here](http://pbpython.com/pandas_dtypes.html) is an overview of `pandas` data types. Basically, being labeled an \"object\" means that the column is not being recognized as containing numbers." 747 | ] 748 | }, 749 | { 750 | "cell_type": "markdown", 751 | "metadata": {}, 752 | "source": [ 753 | "We need to find the `,` and `$` in `total.earnings` and remove them. The `str.replace()` function, which we used above when renaming the columns, lets us do this.\n", 754 | "\n", 755 | "Let's start by removing the comma and write the result to the original column. (The format for calling a column from a DataFrame in `pandas` is `DataFrame['column_name']`)" 756 | ] 757 | }, 758 | { 759 | "cell_type": "code", 760 | "execution_count": 20, 761 | "metadata": {}, 762 | "outputs": [], 763 | "source": [ 764 | "salary_selected['total_earnings'] = salary_selected['total_earnings'].str.replace(',', '')" 765 | ] 766 | }, 767 | { 768 | "cell_type": "markdown", 769 | "metadata": {}, 770 | "source": [ 771 | "Using `head()` to visually inspect `salary_selected`, we see that the commas are gone:" 772 | ] 773 | }, 774 | { 775 | "cell_type": "code", 776 | "execution_count": 21, 777 | "metadata": {}, 778 | "outputs": [ 779 | { 780 | "name": "stdout", 781 | "output_type": "stream", 782 | "text": [ 783 | " name department_name total_earnings\n", 784 | "0 Abadi,Kidani A Assessing Department $46591.98\n", 785 | "1 Abasciano,Joseph Boston Police Department $97579.88\n", 786 | "2 Abban,Christopher John Boston Fire Department $124623.25\n", 787 | "3 Abbasi,Sophia Green Academy $18249.83\n", 788 | "4 Abbate-Vaughn,Jorgelina BPS Ellis Elementary $85660.28\n" 789 | ] 790 | } 791 | ], 792 | "source": [ 793 | "print(salary_selected.head()) # this works - the commas are gone" 794 | ] 795 | }, 796 | { 797 | "cell_type": "markdown", 798 | "metadata": {}, 799 | "source": [ 800 | "Let's do the same thing, with the dollar sign `$`:" 801 | ] 802 | }, 803 | { 804 | "cell_type": "code", 805 | "execution_count": 22, 806 | "metadata": {}, 807 | "outputs": [], 808 | "source": [ 809 | "salary_selected['total_earnings'] = salary_selected['total_earnings'].str.replace('$', '')" 810 | ] 811 | }, 812 | { 813 | "cell_type": "markdown", 814 | "metadata": {}, 815 | "source": [ 816 | "Using `head()` to visually inspect `salary_selected`, we see that the dollar signs are gone:" 817 | ] 818 | }, 819 | { 820 | "cell_type": "code", 821 | "execution_count": 23, 822 | "metadata": {}, 823 | "outputs": [ 824 | { 825 | "data": { 826 | "text/html": [ 827 | "
\n", 828 | "\n", 841 | "\n", 842 | " \n", 843 | " \n", 844 | " \n", 845 | " \n", 846 | " \n", 847 | " \n", 848 | " \n", 849 | " \n", 850 | " \n", 851 | " \n", 852 | " \n", 853 | " \n", 854 | " \n", 855 | " \n", 856 | " \n", 857 | " \n", 858 | " \n", 859 | " \n", 860 | " \n", 861 | " \n", 862 | " \n", 863 | " \n", 864 | " \n", 865 | " \n", 866 | " \n", 867 | " \n", 868 | " \n", 869 | " \n", 870 | " \n", 871 | " \n", 872 | " \n", 873 | " \n", 874 | " \n", 875 | " \n", 876 | " \n", 877 | " \n", 878 | " \n", 879 | " \n", 880 | " \n", 881 | " \n", 882 | "
namedepartment_nametotal_earnings
0Abadi,Kidani AAssessing Department46591.98
1Abasciano,JosephBoston Police Department97579.88
2Abban,Christopher JohnBoston Fire Department124623.25
3Abbasi,SophiaGreen Academy18249.83
4Abbate-Vaughn,JorgelinaBPS Ellis Elementary85660.28
\n", 883 | "
" 884 | ], 885 | "text/plain": [ 886 | " name department_name total_earnings\n", 887 | "0 Abadi,Kidani A Assessing Department 46591.98\n", 888 | "1 Abasciano,Joseph Boston Police Department 97579.88\n", 889 | "2 Abban,Christopher John Boston Fire Department 124623.25\n", 890 | "3 Abbasi,Sophia Green Academy 18249.83\n", 891 | "4 Abbate-Vaughn,Jorgelina BPS Ellis Elementary 85660.28" 892 | ] 893 | }, 894 | "execution_count": 23, 895 | "metadata": {}, 896 | "output_type": "execute_result" 897 | } 898 | ], 899 | "source": [ 900 | "salary_selected.head()" 901 | ] 902 | }, 903 | { 904 | "cell_type": "markdown", 905 | "metadata": {}, 906 | "source": [ 907 | "**Now can we use `arrange()` to sort the data by `total_earnings`?**" 908 | ] 909 | }, 910 | { 911 | "cell_type": "code", 912 | "execution_count": 24, 913 | "metadata": {}, 914 | "outputs": [ 915 | { 916 | "data": { 917 | "text/html": [ 918 | "
\n", 919 | "\n", 932 | "\n", 933 | " \n", 934 | " \n", 935 | " \n", 936 | " \n", 937 | " \n", 938 | " \n", 939 | " \n", 940 | " \n", 941 | " \n", 942 | " \n", 943 | " \n", 944 | " \n", 945 | " \n", 946 | " \n", 947 | " \n", 948 | " \n", 949 | " \n", 950 | " \n", 951 | " \n", 952 | " \n", 953 | " \n", 954 | " \n", 955 | " \n", 956 | " \n", 957 | " \n", 958 | " \n", 959 | " \n", 960 | " \n", 961 | " \n", 962 | " \n", 963 | " \n", 964 | " \n", 965 | " \n", 966 | " \n", 967 | " \n", 968 | " \n", 969 | " \n", 970 | " \n", 971 | " \n", 972 | " \n", 973 | "
namedepartment_nametotal_earnings
3315Charles,YvelineBPS Transportation10.07
9914Jean Baptiste,HuguesBPS Transportation10.12
16419Piper,Sarah ABPS Transportation10.47
11131Laguerre,Yolaine MBPS Transportation10.94
17641Rosario Severino,YomayraFood & Nutrition Svc100.00
\n", 974 | "
" 975 | ], 976 | "text/plain": [ 977 | " name department_name total_earnings\n", 978 | "3315 Charles,Yveline BPS Transportation 10.07\n", 979 | "9914 Jean Baptiste,Hugues BPS Transportation 10.12\n", 980 | "16419 Piper,Sarah A BPS Transportation 10.47\n", 981 | "11131 Laguerre,Yolaine M BPS Transportation 10.94\n", 982 | "17641 Rosario Severino,Yomayra Food & Nutrition Svc 100.00" 983 | ] 984 | }, 985 | "execution_count": 24, 986 | "metadata": {}, 987 | "output_type": "execute_result" 988 | } 989 | ], 990 | "source": [ 991 | "salary_sort = salary_selected.sort_values('total_earnings')\n", 992 | "\n", 993 | "salary_sort.head()" 994 | ] 995 | }, 996 | { 997 | "cell_type": "code", 998 | "execution_count": 25, 999 | "metadata": {}, 1000 | "outputs": [ 1001 | { 1002 | "data": { 1003 | "text/html": [ 1004 | "
\n", 1005 | "\n", 1018 | "\n", 1019 | " \n", 1020 | " \n", 1021 | " \n", 1022 | " \n", 1023 | " \n", 1024 | " \n", 1025 | " \n", 1026 | " \n", 1027 | " \n", 1028 | " \n", 1029 | " \n", 1030 | " \n", 1031 | " \n", 1032 | " \n", 1033 | " \n", 1034 | " \n", 1035 | " \n", 1036 | " \n", 1037 | " \n", 1038 | " \n", 1039 | " \n", 1040 | " \n", 1041 | " \n", 1042 | " \n", 1043 | " \n", 1044 | " \n", 1045 | " \n", 1046 | " \n", 1047 | " \n", 1048 | " \n", 1049 | " \n", 1050 | " \n", 1051 | " \n", 1052 | " \n", 1053 | " \n", 1054 | " \n", 1055 | " \n", 1056 | " \n", 1057 | " \n", 1058 | " \n", 1059 | "
namedepartment_nametotal_earnings
18134Santos,Maria CCurley K-899970.30
5999Dyson,Margaret O.Parks Department99972.07
13012McCarthy,Margaret MBPS Substitute Teachers/Nurs9998.47
1083Bartholet,Carolyn VBPS Mckay Elementary99989.18
1960Bresnahan,John M.Boston Police Department99997.38
\n", 1060 | "
" 1061 | ], 1062 | "text/plain": [ 1063 | " name department_name total_earnings\n", 1064 | "18134 Santos,Maria C Curley K-8 99970.30\n", 1065 | "5999 Dyson,Margaret O. Parks Department 99972.07\n", 1066 | "13012 McCarthy,Margaret M BPS Substitute Teachers/Nurs 9998.47\n", 1067 | "1083 Bartholet,Carolyn V BPS Mckay Elementary 99989.18\n", 1068 | "1960 Bresnahan,John M. Boston Police Department 99997.38" 1069 | ] 1070 | }, 1071 | "execution_count": 25, 1072 | "metadata": {}, 1073 | "output_type": "execute_result" 1074 | } 1075 | ], 1076 | "source": [ 1077 | "salary_sort.tail()" 1078 | ] 1079 | }, 1080 | { 1081 | "cell_type": "markdown", 1082 | "metadata": {}, 1083 | "source": [ 1084 | "Again, at first glance, the employees appear to be sorted by `total_earnings` from lowest to highest. But that would imply that John M. Bresnahan was the highest-paid employee, making 99,997.38 dollars in 2016, while the *Boston Globe* [story](https://www.bostonglobe.com/metro/2017/02/14/police-detective-tops-boston-payroll-with-total-over/6PaXwTAHZGEW5djgwCJuTI/story.html) said the highest-paid city employee made more than 403,000 dollars." 1085 | ] 1086 | }, 1087 | { 1088 | "cell_type": "markdown", 1089 | "metadata": {}, 1090 | "source": [ 1091 | "**What's the problem?**\n", 1092 | "\n", 1093 | "Again, we can use `dtypes` to check on how the `total_earnings` variable is encoded." 1094 | ] 1095 | }, 1096 | { 1097 | "cell_type": "code", 1098 | "execution_count": 26, 1099 | "metadata": {}, 1100 | "outputs": [ 1101 | { 1102 | "data": { 1103 | "text/plain": [ 1104 | "name object\n", 1105 | "department_name object\n", 1106 | "total_earnings object\n", 1107 | "dtype: object" 1108 | ] 1109 | }, 1110 | "execution_count": 26, 1111 | "metadata": {}, 1112 | "output_type": "execute_result" 1113 | } 1114 | ], 1115 | "source": [ 1116 | "salary_sort.dtypes" 1117 | ] 1118 | }, 1119 | { 1120 | "cell_type": "markdown", 1121 | "metadata": {}, 1122 | "source": [ 1123 | "It's still an \"object\" now (still not numeric), because we didn't tell `pandas` that it should be numeric. We can do this with `pd.to_numeric()`:" 1124 | ] 1125 | }, 1126 | { 1127 | "cell_type": "code", 1128 | "execution_count": 27, 1129 | "metadata": {}, 1130 | "outputs": [], 1131 | "source": [ 1132 | "salary_sort['total_earnings'] = pd.to_numeric(salary_sort['total_earnings'])" 1133 | ] 1134 | }, 1135 | { 1136 | "cell_type": "markdown", 1137 | "metadata": {}, 1138 | "source": [ 1139 | "Now let's run `dtypes` again:" 1140 | ] 1141 | }, 1142 | { 1143 | "cell_type": "code", 1144 | "execution_count": 28, 1145 | "metadata": {}, 1146 | "outputs": [ 1147 | { 1148 | "data": { 1149 | "text/plain": [ 1150 | "name object\n", 1151 | "department_name object\n", 1152 | "total_earnings float64\n", 1153 | "dtype: object" 1154 | ] 1155 | }, 1156 | "execution_count": 28, 1157 | "metadata": {}, 1158 | "output_type": "execute_result" 1159 | } 1160 | ], 1161 | "source": [ 1162 | "salary_sort.dtypes" 1163 | ] 1164 | }, 1165 | { 1166 | "cell_type": "markdown", 1167 | "metadata": {}, 1168 | "source": [ 1169 | "\"float64\" means [\"floating point numbers\"](http://pbpython.com/pandas_dtypes.html) — this is what we want." 1170 | ] 1171 | }, 1172 | { 1173 | "cell_type": "markdown", 1174 | "metadata": {}, 1175 | "source": [ 1176 | "Now let's sort using `sort_values()`. " 1177 | ] 1178 | }, 1179 | { 1180 | "cell_type": "code", 1181 | "execution_count": 29, 1182 | "metadata": {}, 1183 | "outputs": [ 1184 | { 1185 | "data": { 1186 | "text/html": [ 1187 | "
\n", 1188 | "\n", 1201 | "\n", 1202 | " \n", 1203 | " \n", 1204 | " \n", 1205 | " \n", 1206 | " \n", 1207 | " \n", 1208 | " \n", 1209 | " \n", 1210 | " \n", 1211 | " \n", 1212 | " \n", 1213 | " \n", 1214 | " \n", 1215 | " \n", 1216 | " \n", 1217 | " \n", 1218 | " \n", 1219 | " \n", 1220 | " \n", 1221 | " \n", 1222 | " \n", 1223 | " \n", 1224 | " \n", 1225 | " \n", 1226 | " \n", 1227 | " \n", 1228 | " \n", 1229 | " \n", 1230 | " \n", 1231 | " \n", 1232 | " \n", 1233 | " \n", 1234 | " \n", 1235 | " \n", 1236 | " \n", 1237 | " \n", 1238 | " \n", 1239 | " \n", 1240 | " \n", 1241 | " \n", 1242 | "
namedepartment_nametotal_earnings
9849Jameau,BernadetteBPS Transportation2.14
1986Bridgewaters,Sandra JBPS Transportation2.50
13853Milian,Sonia MariaBPS Transportation3.85
2346Burke II,Myrell NadineBPS Transportation4.38
7717Gillard Jr.,Trina FFood & Nutrition Svc5.00
\n", 1243 | "
" 1244 | ], 1245 | "text/plain": [ 1246 | " name department_name total_earnings\n", 1247 | "9849 Jameau,Bernadette BPS Transportation 2.14\n", 1248 | "1986 Bridgewaters,Sandra J BPS Transportation 2.50\n", 1249 | "13853 Milian,Sonia Maria BPS Transportation 3.85\n", 1250 | "2346 Burke II,Myrell Nadine BPS Transportation 4.38\n", 1251 | "7717 Gillard Jr.,Trina F Food & Nutrition Svc 5.00" 1252 | ] 1253 | }, 1254 | "execution_count": 29, 1255 | "metadata": {}, 1256 | "output_type": "execute_result" 1257 | } 1258 | ], 1259 | "source": [ 1260 | "salary_sort = salary_sort.sort_values('total_earnings')\n", 1261 | "\n", 1262 | "salary_sort.head() # ascending order by default" 1263 | ] 1264 | }, 1265 | { 1266 | "cell_type": "markdown", 1267 | "metadata": {}, 1268 | "source": [ 1269 | "One last thing: we have to specify `ascending = False` within `sort_values()` because the function by default sorts the data in ascending order." 1270 | ] 1271 | }, 1272 | { 1273 | "cell_type": "code", 1274 | "execution_count": 30, 1275 | "metadata": {}, 1276 | "outputs": [ 1277 | { 1278 | "data": { 1279 | "text/html": [ 1280 | "
\n", 1281 | "\n", 1294 | "\n", 1295 | " \n", 1296 | " \n", 1297 | " \n", 1298 | " \n", 1299 | " \n", 1300 | " \n", 1301 | " \n", 1302 | " \n", 1303 | " \n", 1304 | " \n", 1305 | " \n", 1306 | " \n", 1307 | " \n", 1308 | " \n", 1309 | " \n", 1310 | " \n", 1311 | " \n", 1312 | " \n", 1313 | " \n", 1314 | " \n", 1315 | " \n", 1316 | " \n", 1317 | " \n", 1318 | " \n", 1319 | " \n", 1320 | " \n", 1321 | " \n", 1322 | " \n", 1323 | " \n", 1324 | " \n", 1325 | " \n", 1326 | " \n", 1327 | " \n", 1328 | " \n", 1329 | " \n", 1330 | " \n", 1331 | " \n", 1332 | " \n", 1333 | " \n", 1334 | " \n", 1335 | "
namedepartment_nametotal_earnings
11489Lee,WaimanBoston Police Department403408.61
10327Josey,Windell C.Boston Police Department396348.50
15716Painten,Paul ABoston Police Department373959.35
2113Brown,GregoryBoston Police Department351825.50
9446Hosein,HaseebBoston Police Department346105.17
\n", 1336 | "
" 1337 | ], 1338 | "text/plain": [ 1339 | " name department_name total_earnings\n", 1340 | "11489 Lee,Waiman Boston Police Department 403408.61\n", 1341 | "10327 Josey,Windell C. Boston Police Department 396348.50\n", 1342 | "15716 Painten,Paul A Boston Police Department 373959.35\n", 1343 | "2113 Brown,Gregory Boston Police Department 351825.50\n", 1344 | "9446 Hosein,Haseeb Boston Police Department 346105.17" 1345 | ] 1346 | }, 1347 | "execution_count": 30, 1348 | "metadata": {}, 1349 | "output_type": "execute_result" 1350 | } 1351 | ], 1352 | "source": [ 1353 | "salary_sort = salary_sort.sort_values('total_earnings', ascending = False)\n", 1354 | "\n", 1355 | "salary_sort.head() # descending order" 1356 | ] 1357 | }, 1358 | { 1359 | "cell_type": "markdown", 1360 | "metadata": {}, 1361 | "source": [ 1362 | "We see that Waiman Lee from the Boston PD is the top earner with >403,408 per year, just as the *Boston Globe* [article](https://www.bostonglobe.com/metro/2017/02/14/police-detective-tops-boston-payroll-with-total-over/6PaXwTAHZGEW5djgwCJuTI/story.html) states." 1363 | ] 1364 | }, 1365 | { 1366 | "cell_type": "markdown", 1367 | "metadata": {}, 1368 | "source": [ 1369 | "A bonus thing: maybe it bothers you that the numbers next to each row are no longer in any numeric order. This is because these numbers are the row index of the DataFrame — basically the order that they were in prior to being sorted. In order to reset these numbers, we can use the [`reset_index()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html) function on the `salary_sort` DataFrame. We include `drop = True` as a parameter of the function to prevent the old index from being added as a column in the DataFrame." 1370 | ] 1371 | }, 1372 | { 1373 | "cell_type": "code", 1374 | "execution_count": 31, 1375 | "metadata": {}, 1376 | "outputs": [ 1377 | { 1378 | "data": { 1379 | "text/html": [ 1380 | "
\n", 1381 | "\n", 1394 | "\n", 1395 | " \n", 1396 | " \n", 1397 | " \n", 1398 | " \n", 1399 | " \n", 1400 | " \n", 1401 | " \n", 1402 | " \n", 1403 | " \n", 1404 | " \n", 1405 | " \n", 1406 | " \n", 1407 | " \n", 1408 | " \n", 1409 | " \n", 1410 | " \n", 1411 | " \n", 1412 | " \n", 1413 | " \n", 1414 | " \n", 1415 | " \n", 1416 | " \n", 1417 | " \n", 1418 | " \n", 1419 | " \n", 1420 | " \n", 1421 | " \n", 1422 | " \n", 1423 | " \n", 1424 | " \n", 1425 | " \n", 1426 | " \n", 1427 | " \n", 1428 | " \n", 1429 | " \n", 1430 | " \n", 1431 | " \n", 1432 | " \n", 1433 | " \n", 1434 | " \n", 1435 | "
namedepartment_nametotal_earnings
0Lee,WaimanBoston Police Department403408.61
1Josey,Windell C.Boston Police Department396348.50
2Painten,Paul ABoston Police Department373959.35
3Brown,GregoryBoston Police Department351825.50
4Hosein,HaseebBoston Police Department346105.17
\n", 1436 | "
" 1437 | ], 1438 | "text/plain": [ 1439 | " name department_name total_earnings\n", 1440 | "0 Lee,Waiman Boston Police Department 403408.61\n", 1441 | "1 Josey,Windell C. Boston Police Department 396348.50\n", 1442 | "2 Painten,Paul A Boston Police Department 373959.35\n", 1443 | "3 Brown,Gregory Boston Police Department 351825.50\n", 1444 | "4 Hosein,Haseeb Boston Police Department 346105.17" 1445 | ] 1446 | }, 1447 | "execution_count": 31, 1448 | "metadata": {}, 1449 | "output_type": "execute_result" 1450 | } 1451 | ], 1452 | "source": [ 1453 | "salary_sort = salary_sort.reset_index(drop = True)\n", 1454 | "\n", 1455 | "salary_sort.head() # index is reset" 1456 | ] 1457 | }, 1458 | { 1459 | "cell_type": "markdown", 1460 | "metadata": {}, 1461 | "source": [ 1462 | "The Boston Police Department has a lot of high earners. We can figure out the average earnings by department, which we'll call `salary_average`, by using the `groupby` and `mean()` functions in `pandas`." 1463 | ] 1464 | }, 1465 | { 1466 | "cell_type": "code", 1467 | "execution_count": 32, 1468 | "metadata": {}, 1469 | "outputs": [], 1470 | "source": [ 1471 | "salary_average = salary_sort.groupby('department_name').mean()" 1472 | ] 1473 | }, 1474 | { 1475 | "cell_type": "code", 1476 | "execution_count": 33, 1477 | "metadata": { 1478 | "scrolled": false 1479 | }, 1480 | "outputs": [ 1481 | { 1482 | "data": { 1483 | "text/html": [ 1484 | "
\n", 1485 | "\n", 1498 | "\n", 1499 | " \n", 1500 | " \n", 1501 | " \n", 1502 | " \n", 1503 | " \n", 1504 | " \n", 1505 | " \n", 1506 | " \n", 1507 | " \n", 1508 | " \n", 1509 | " \n", 1510 | " \n", 1511 | " \n", 1512 | " \n", 1513 | " \n", 1514 | " \n", 1515 | " \n", 1516 | " \n", 1517 | " \n", 1518 | " \n", 1519 | " \n", 1520 | " \n", 1521 | " \n", 1522 | " \n", 1523 | " \n", 1524 | " \n", 1525 | " \n", 1526 | " \n", 1527 | " \n", 1528 | " \n", 1529 | " \n", 1530 | " \n", 1531 | " \n", 1532 | " \n", 1533 | " \n", 1534 | " \n", 1535 | " \n", 1536 | " \n", 1537 | " \n", 1538 | " \n", 1539 | " \n", 1540 | " \n", 1541 | " \n", 1542 | " \n", 1543 | " \n", 1544 | " \n", 1545 | " \n", 1546 | " \n", 1547 | " \n", 1548 | " \n", 1549 | " \n", 1550 | " \n", 1551 | " \n", 1552 | " \n", 1553 | " \n", 1554 | " \n", 1555 | " \n", 1556 | " \n", 1557 | " \n", 1558 | " \n", 1559 | " \n", 1560 | " \n", 1561 | " \n", 1562 | " \n", 1563 | " \n", 1564 | " \n", 1565 | " \n", 1566 | " \n", 1567 | " \n", 1568 | " \n", 1569 | " \n", 1570 | " \n", 1571 | " \n", 1572 | " \n", 1573 | " \n", 1574 | " \n", 1575 | " \n", 1576 | " \n", 1577 | " \n", 1578 | " \n", 1579 | " \n", 1580 | " \n", 1581 | " \n", 1582 | " \n", 1583 | " \n", 1584 | " \n", 1585 | " \n", 1586 | " \n", 1587 | " \n", 1588 | " \n", 1589 | " \n", 1590 | " \n", 1591 | " \n", 1592 | " \n", 1593 | " \n", 1594 | " \n", 1595 | "
total_earnings
department_name
ASD Human Resources67236.150755
ASD Intergvernmtl Relations83787.581000
ASD Office Of Labor Relation58899.954615
ASD Office of Budget Mangmnt73946.044643
ASD Purchasing Division72893.203750
Accountability102073.280667
Achievement Gap60105.522500
Alighieri Montessori School55160.025556
Assessing Department70713.327111
Asst Superintendent-Network A132514.885000
......
Unified Student Svc65018.485000
Veterans' Services48411.606250
WREC: Urban Science Academy81170.398214
Warren/Prescott K-866389.351341
West Roxbury Academy70373.066494
West Zone ELC55868.384118
Women's Advancement63811.150000
Workers Compensation Service23797.119133
Young Achievers K-856534.020463
Youth Engagement & Employment33645.202308
\n", 1596 | "

228 rows × 1 columns

\n", 1597 | "
" 1598 | ], 1599 | "text/plain": [ 1600 | " total_earnings\n", 1601 | "department_name \n", 1602 | "ASD Human Resources 67236.150755\n", 1603 | "ASD Intergvernmtl Relations 83787.581000\n", 1604 | "ASD Office Of Labor Relation 58899.954615\n", 1605 | "ASD Office of Budget Mangmnt 73946.044643\n", 1606 | "ASD Purchasing Division 72893.203750\n", 1607 | "Accountability 102073.280667\n", 1608 | "Achievement Gap 60105.522500\n", 1609 | "Alighieri Montessori School 55160.025556\n", 1610 | "Assessing Department 70713.327111\n", 1611 | "Asst Superintendent-Network A 132514.885000\n", 1612 | "... ...\n", 1613 | "Unified Student Svc 65018.485000\n", 1614 | "Veterans' Services 48411.606250\n", 1615 | "WREC: Urban Science Academy 81170.398214\n", 1616 | "Warren/Prescott K-8 66389.351341\n", 1617 | "West Roxbury Academy 70373.066494\n", 1618 | "West Zone ELC 55868.384118\n", 1619 | "Women's Advancement 63811.150000\n", 1620 | "Workers Compensation Service 23797.119133\n", 1621 | "Young Achievers K-8 56534.020463\n", 1622 | "Youth Engagement & Employment 33645.202308\n", 1623 | "\n", 1624 | "[228 rows x 1 columns]" 1625 | ] 1626 | }, 1627 | "execution_count": 33, 1628 | "metadata": {}, 1629 | "output_type": "execute_result" 1630 | } 1631 | ], 1632 | "source": [ 1633 | "salary_average = salary_average\n", 1634 | "\n", 1635 | "salary_average" 1636 | ] 1637 | }, 1638 | { 1639 | "cell_type": "markdown", 1640 | "metadata": {}, 1641 | "source": [ 1642 | "Notice that `pandas` by default sets the `department_name` column as the row index of the `salary_average` DataFrame. I personally don't love this and would rather have a straight-up DataFrame with the row numbers as the index, so I usually run `reset_index()` to get rid of this indexing: " 1643 | ] 1644 | }, 1645 | { 1646 | "cell_type": "code", 1647 | "execution_count": 34, 1648 | "metadata": {}, 1649 | "outputs": [ 1650 | { 1651 | "data": { 1652 | "text/html": [ 1653 | "
\n", 1654 | "\n", 1667 | "\n", 1668 | " \n", 1669 | " \n", 1670 | " \n", 1671 | " \n", 1672 | " \n", 1673 | " \n", 1674 | " \n", 1675 | " \n", 1676 | " \n", 1677 | " \n", 1678 | " \n", 1679 | " \n", 1680 | " \n", 1681 | " \n", 1682 | " \n", 1683 | " \n", 1684 | " \n", 1685 | " \n", 1686 | " \n", 1687 | " \n", 1688 | " \n", 1689 | " \n", 1690 | " \n", 1691 | " \n", 1692 | " \n", 1693 | " \n", 1694 | " \n", 1695 | " \n", 1696 | " \n", 1697 | " \n", 1698 | " \n", 1699 | " \n", 1700 | " \n", 1701 | " \n", 1702 | " \n", 1703 | " \n", 1704 | " \n", 1705 | " \n", 1706 | " \n", 1707 | " \n", 1708 | " \n", 1709 | " \n", 1710 | " \n", 1711 | " \n", 1712 | " \n", 1713 | " \n", 1714 | " \n", 1715 | " \n", 1716 | " \n", 1717 | " \n", 1718 | " \n", 1719 | " \n", 1720 | " \n", 1721 | " \n", 1722 | " \n", 1723 | " \n", 1724 | " \n", 1725 | " \n", 1726 | " \n", 1727 | " \n", 1728 | " \n", 1729 | " \n", 1730 | " \n", 1731 | " \n", 1732 | " \n", 1733 | " \n", 1734 | " \n", 1735 | " \n", 1736 | " \n", 1737 | " \n", 1738 | " \n", 1739 | " \n", 1740 | " \n", 1741 | " \n", 1742 | " \n", 1743 | " \n", 1744 | " \n", 1745 | " \n", 1746 | " \n", 1747 | " \n", 1748 | " \n", 1749 | " \n", 1750 | " \n", 1751 | " \n", 1752 | " \n", 1753 | " \n", 1754 | " \n", 1755 | " \n", 1756 | " \n", 1757 | " \n", 1758 | " \n", 1759 | " \n", 1760 | " \n", 1761 | " \n", 1762 | " \n", 1763 | " \n", 1764 | " \n", 1765 | " \n", 1766 | " \n", 1767 | " \n", 1768 | " \n", 1769 | " \n", 1770 | " \n", 1771 | " \n", 1772 | " \n", 1773 | " \n", 1774 | " \n", 1775 | " \n", 1776 | " \n", 1777 | " \n", 1778 | " \n", 1779 | " \n", 1780 | " \n", 1781 | " \n", 1782 | "
department_nametotal_earnings
0ASD Human Resources67236.150755
1ASD Intergvernmtl Relations83787.581000
2ASD Office Of Labor Relation58899.954615
3ASD Office of Budget Mangmnt73946.044643
4ASD Purchasing Division72893.203750
5Accountability102073.280667
6Achievement Gap60105.522500
7Alighieri Montessori School55160.025556
8Assessing Department70713.327111
9Asst Superintendent-Network A132514.885000
.........
218Unified Student Svc65018.485000
219Veterans' Services48411.606250
220WREC: Urban Science Academy81170.398214
221Warren/Prescott K-866389.351341
222West Roxbury Academy70373.066494
223West Zone ELC55868.384118
224Women's Advancement63811.150000
225Workers Compensation Service23797.119133
226Young Achievers K-856534.020463
227Youth Engagement & Employment33645.202308
\n", 1783 | "

228 rows × 2 columns

\n", 1784 | "
" 1785 | ], 1786 | "text/plain": [ 1787 | " department_name total_earnings\n", 1788 | "0 ASD Human Resources 67236.150755\n", 1789 | "1 ASD Intergvernmtl Relations 83787.581000\n", 1790 | "2 ASD Office Of Labor Relation 58899.954615\n", 1791 | "3 ASD Office of Budget Mangmnt 73946.044643\n", 1792 | "4 ASD Purchasing Division 72893.203750\n", 1793 | "5 Accountability 102073.280667\n", 1794 | "6 Achievement Gap 60105.522500\n", 1795 | "7 Alighieri Montessori School 55160.025556\n", 1796 | "8 Assessing Department 70713.327111\n", 1797 | "9 Asst Superintendent-Network A 132514.885000\n", 1798 | ".. ... ...\n", 1799 | "218 Unified Student Svc 65018.485000\n", 1800 | "219 Veterans' Services 48411.606250\n", 1801 | "220 WREC: Urban Science Academy 81170.398214\n", 1802 | "221 Warren/Prescott K-8 66389.351341\n", 1803 | "222 West Roxbury Academy 70373.066494\n", 1804 | "223 West Zone ELC 55868.384118\n", 1805 | "224 Women's Advancement 63811.150000\n", 1806 | "225 Workers Compensation Service 23797.119133\n", 1807 | "226 Young Achievers K-8 56534.020463\n", 1808 | "227 Youth Engagement & Employment 33645.202308\n", 1809 | "\n", 1810 | "[228 rows x 2 columns]" 1811 | ] 1812 | }, 1813 | "execution_count": 34, 1814 | "metadata": {}, 1815 | "output_type": "execute_result" 1816 | } 1817 | ], 1818 | "source": [ 1819 | "salary_average = salary_average.reset_index() # reset_index\n", 1820 | "\n", 1821 | "salary_average" 1822 | ] 1823 | }, 1824 | { 1825 | "cell_type": "markdown", 1826 | "metadata": {}, 1827 | "source": [ 1828 | "We should also rename the `total_earnings` column to `average_earnings` to avoid confusion. We can do this using [`rename()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html). The syntax for `rename()` is `DataFrame.rename(columns = {'current column name':'new column name'})`." 1829 | ] 1830 | }, 1831 | { 1832 | "cell_type": "code", 1833 | "execution_count": 35, 1834 | "metadata": {}, 1835 | "outputs": [], 1836 | "source": [ 1837 | "salary_average = salary_average.rename(columns = {'total_earnings': 'dept_average'}) " 1838 | ] 1839 | }, 1840 | { 1841 | "cell_type": "code", 1842 | "execution_count": 36, 1843 | "metadata": {}, 1844 | "outputs": [ 1845 | { 1846 | "data": { 1847 | "text/html": [ 1848 | "
\n", 1849 | "\n", 1862 | "\n", 1863 | " \n", 1864 | " \n", 1865 | " \n", 1866 | " \n", 1867 | " \n", 1868 | " \n", 1869 | " \n", 1870 | " \n", 1871 | " \n", 1872 | " \n", 1873 | " \n", 1874 | " \n", 1875 | " \n", 1876 | " \n", 1877 | " \n", 1878 | " \n", 1879 | " \n", 1880 | " \n", 1881 | " \n", 1882 | " \n", 1883 | " \n", 1884 | " \n", 1885 | " \n", 1886 | " \n", 1887 | " \n", 1888 | " \n", 1889 | " \n", 1890 | " \n", 1891 | " \n", 1892 | " \n", 1893 | " \n", 1894 | " \n", 1895 | " \n", 1896 | " \n", 1897 | " \n", 1898 | " \n", 1899 | " \n", 1900 | " \n", 1901 | " \n", 1902 | " \n", 1903 | " \n", 1904 | " \n", 1905 | " \n", 1906 | " \n", 1907 | " \n", 1908 | " \n", 1909 | " \n", 1910 | " \n", 1911 | " \n", 1912 | " \n", 1913 | " \n", 1914 | " \n", 1915 | " \n", 1916 | " \n", 1917 | " \n", 1918 | " \n", 1919 | " \n", 1920 | " \n", 1921 | " \n", 1922 | " \n", 1923 | " \n", 1924 | " \n", 1925 | " \n", 1926 | " \n", 1927 | " \n", 1928 | " \n", 1929 | " \n", 1930 | " \n", 1931 | " \n", 1932 | " \n", 1933 | " \n", 1934 | " \n", 1935 | " \n", 1936 | " \n", 1937 | " \n", 1938 | " \n", 1939 | " \n", 1940 | " \n", 1941 | " \n", 1942 | " \n", 1943 | " \n", 1944 | " \n", 1945 | " \n", 1946 | " \n", 1947 | " \n", 1948 | " \n", 1949 | " \n", 1950 | " \n", 1951 | " \n", 1952 | " \n", 1953 | " \n", 1954 | " \n", 1955 | " \n", 1956 | " \n", 1957 | " \n", 1958 | " \n", 1959 | " \n", 1960 | " \n", 1961 | " \n", 1962 | " \n", 1963 | " \n", 1964 | " \n", 1965 | " \n", 1966 | " \n", 1967 | " \n", 1968 | " \n", 1969 | " \n", 1970 | " \n", 1971 | " \n", 1972 | " \n", 1973 | " \n", 1974 | " \n", 1975 | " \n", 1976 | " \n", 1977 | "
department_namedept_average
0ASD Human Resources67236.150755
1ASD Intergvernmtl Relations83787.581000
2ASD Office Of Labor Relation58899.954615
3ASD Office of Budget Mangmnt73946.044643
4ASD Purchasing Division72893.203750
5Accountability102073.280667
6Achievement Gap60105.522500
7Alighieri Montessori School55160.025556
8Assessing Department70713.327111
9Asst Superintendent-Network A132514.885000
.........
218Unified Student Svc65018.485000
219Veterans' Services48411.606250
220WREC: Urban Science Academy81170.398214
221Warren/Prescott K-866389.351341
222West Roxbury Academy70373.066494
223West Zone ELC55868.384118
224Women's Advancement63811.150000
225Workers Compensation Service23797.119133
226Young Achievers K-856534.020463
227Youth Engagement & Employment33645.202308
\n", 1978 | "

228 rows × 2 columns

\n", 1979 | "
" 1980 | ], 1981 | "text/plain": [ 1982 | " department_name dept_average\n", 1983 | "0 ASD Human Resources 67236.150755\n", 1984 | "1 ASD Intergvernmtl Relations 83787.581000\n", 1985 | "2 ASD Office Of Labor Relation 58899.954615\n", 1986 | "3 ASD Office of Budget Mangmnt 73946.044643\n", 1987 | "4 ASD Purchasing Division 72893.203750\n", 1988 | "5 Accountability 102073.280667\n", 1989 | "6 Achievement Gap 60105.522500\n", 1990 | "7 Alighieri Montessori School 55160.025556\n", 1991 | "8 Assessing Department 70713.327111\n", 1992 | "9 Asst Superintendent-Network A 132514.885000\n", 1993 | ".. ... ...\n", 1994 | "218 Unified Student Svc 65018.485000\n", 1995 | "219 Veterans' Services 48411.606250\n", 1996 | "220 WREC: Urban Science Academy 81170.398214\n", 1997 | "221 Warren/Prescott K-8 66389.351341\n", 1998 | "222 West Roxbury Academy 70373.066494\n", 1999 | "223 West Zone ELC 55868.384118\n", 2000 | "224 Women's Advancement 63811.150000\n", 2001 | "225 Workers Compensation Service 23797.119133\n", 2002 | "226 Young Achievers K-8 56534.020463\n", 2003 | "227 Youth Engagement & Employment 33645.202308\n", 2004 | "\n", 2005 | "[228 rows x 2 columns]" 2006 | ] 2007 | }, 2008 | "execution_count": 36, 2009 | "metadata": {}, 2010 | "output_type": "execute_result" 2011 | } 2012 | ], 2013 | "source": [ 2014 | "salary_average" 2015 | ] 2016 | }, 2017 | { 2018 | "cell_type": "markdown", 2019 | "metadata": {}, 2020 | "source": [ 2021 | "We can find the Boston Police Department. Find out more about selecting based on attributes [here](https://chrisalbon.com/python/data_wrangling/pandas_selecting_rows_on_conditions/)." 2022 | ] 2023 | }, 2024 | { 2025 | "cell_type": "code", 2026 | "execution_count": 37, 2027 | "metadata": {}, 2028 | "outputs": [ 2029 | { 2030 | "data": { 2031 | "text/html": [ 2032 | "
\n", 2033 | "\n", 2046 | "\n", 2047 | " \n", 2048 | " \n", 2049 | " \n", 2050 | " \n", 2051 | " \n", 2052 | " \n", 2053 | " \n", 2054 | " \n", 2055 | " \n", 2056 | " \n", 2057 | " \n", 2058 | " \n", 2059 | " \n", 2060 | " \n", 2061 | "
department_namedept_average
121Boston Police Department124787.164775
\n", 2062 | "
" 2063 | ], 2064 | "text/plain": [ 2065 | " department_name dept_average\n", 2066 | "121 Boston Police Department 124787.164775" 2067 | ] 2068 | }, 2069 | "execution_count": 37, 2070 | "metadata": {}, 2071 | "output_type": "execute_result" 2072 | } 2073 | ], 2074 | "source": [ 2075 | "salary_average[salary_average['department_name'] == 'Boston Police Department']" 2076 | ] 2077 | }, 2078 | { 2079 | "cell_type": "markdown", 2080 | "metadata": {}, 2081 | "source": [ 2082 | "Now is a good time to revisit \"chaining.\" Notice how we did three things in creating `salary_average`:\n", 2083 | "1. Grouped the `salary_sort` DataFrame by `department_name` and calculated the mean of the numeric columns (in our case, `total_earnings` using `group_by()` and `mean()`.\n", 2084 | "2. Used `reset_index()` on the resulting DataFrame so that `department_name` would no longer be the row index.\n", 2085 | "3. Renamed the `total_earnings` column to `dept_average` to avoid confusion using `rename()`.\n", 2086 | "\n", 2087 | "In fact, we can do these three things all at once, by chaining the functions together:" 2088 | ] 2089 | }, 2090 | { 2091 | "cell_type": "code", 2092 | "execution_count": 38, 2093 | "metadata": {}, 2094 | "outputs": [ 2095 | { 2096 | "data": { 2097 | "text/html": [ 2098 | "
\n", 2099 | "\n", 2112 | "\n", 2113 | " \n", 2114 | " \n", 2115 | " \n", 2116 | " \n", 2117 | " \n", 2118 | " \n", 2119 | " \n", 2120 | " \n", 2121 | " \n", 2122 | " \n", 2123 | " \n", 2124 | " \n", 2125 | " \n", 2126 | " \n", 2127 | " \n", 2128 | " \n", 2129 | " \n", 2130 | " \n", 2131 | " \n", 2132 | " \n", 2133 | " \n", 2134 | " \n", 2135 | " \n", 2136 | " \n", 2137 | " \n", 2138 | " \n", 2139 | " \n", 2140 | " \n", 2141 | " \n", 2142 | " \n", 2143 | " \n", 2144 | " \n", 2145 | " \n", 2146 | " \n", 2147 | " \n", 2148 | " \n", 2149 | " \n", 2150 | " \n", 2151 | " \n", 2152 | " \n", 2153 | " \n", 2154 | " \n", 2155 | " \n", 2156 | " \n", 2157 | " \n", 2158 | " \n", 2159 | " \n", 2160 | " \n", 2161 | " \n", 2162 | " \n", 2163 | " \n", 2164 | " \n", 2165 | " \n", 2166 | " \n", 2167 | " \n", 2168 | " \n", 2169 | " \n", 2170 | " \n", 2171 | " \n", 2172 | " \n", 2173 | " \n", 2174 | " \n", 2175 | " \n", 2176 | " \n", 2177 | " \n", 2178 | " \n", 2179 | " \n", 2180 | " \n", 2181 | " \n", 2182 | " \n", 2183 | " \n", 2184 | " \n", 2185 | " \n", 2186 | " \n", 2187 | " \n", 2188 | " \n", 2189 | " \n", 2190 | " \n", 2191 | " \n", 2192 | " \n", 2193 | " \n", 2194 | " \n", 2195 | " \n", 2196 | " \n", 2197 | " \n", 2198 | " \n", 2199 | " \n", 2200 | " \n", 2201 | " \n", 2202 | " \n", 2203 | " \n", 2204 | " \n", 2205 | " \n", 2206 | " \n", 2207 | " \n", 2208 | " \n", 2209 | " \n", 2210 | " \n", 2211 | " \n", 2212 | " \n", 2213 | " \n", 2214 | " \n", 2215 | " \n", 2216 | " \n", 2217 | " \n", 2218 | " \n", 2219 | " \n", 2220 | " \n", 2221 | " \n", 2222 | " \n", 2223 | " \n", 2224 | " \n", 2225 | " \n", 2226 | " \n", 2227 | "
department_namedept_average
0ASD Human Resources67236.150755
1ASD Intergvernmtl Relations83787.581000
2ASD Office Of Labor Relation58899.954615
3ASD Office of Budget Mangmnt73946.044643
4ASD Purchasing Division72893.203750
5Accountability102073.280667
6Achievement Gap60105.522500
7Alighieri Montessori School55160.025556
8Assessing Department70713.327111
9Asst Superintendent-Network A132514.885000
.........
218Unified Student Svc65018.485000
219Veterans' Services48411.606250
220WREC: Urban Science Academy81170.398214
221Warren/Prescott K-866389.351341
222West Roxbury Academy70373.066494
223West Zone ELC55868.384118
224Women's Advancement63811.150000
225Workers Compensation Service23797.119133
226Young Achievers K-856534.020463
227Youth Engagement & Employment33645.202308
\n", 2228 | "

228 rows × 2 columns

\n", 2229 | "
" 2230 | ], 2231 | "text/plain": [ 2232 | " department_name dept_average\n", 2233 | "0 ASD Human Resources 67236.150755\n", 2234 | "1 ASD Intergvernmtl Relations 83787.581000\n", 2235 | "2 ASD Office Of Labor Relation 58899.954615\n", 2236 | "3 ASD Office of Budget Mangmnt 73946.044643\n", 2237 | "4 ASD Purchasing Division 72893.203750\n", 2238 | "5 Accountability 102073.280667\n", 2239 | "6 Achievement Gap 60105.522500\n", 2240 | "7 Alighieri Montessori School 55160.025556\n", 2241 | "8 Assessing Department 70713.327111\n", 2242 | "9 Asst Superintendent-Network A 132514.885000\n", 2243 | ".. ... ...\n", 2244 | "218 Unified Student Svc 65018.485000\n", 2245 | "219 Veterans' Services 48411.606250\n", 2246 | "220 WREC: Urban Science Academy 81170.398214\n", 2247 | "221 Warren/Prescott K-8 66389.351341\n", 2248 | "222 West Roxbury Academy 70373.066494\n", 2249 | "223 West Zone ELC 55868.384118\n", 2250 | "224 Women's Advancement 63811.150000\n", 2251 | "225 Workers Compensation Service 23797.119133\n", 2252 | "226 Young Achievers K-8 56534.020463\n", 2253 | "227 Youth Engagement & Employment 33645.202308\n", 2254 | "\n", 2255 | "[228 rows x 2 columns]" 2256 | ] 2257 | }, 2258 | "execution_count": 38, 2259 | "metadata": {}, 2260 | "output_type": "execute_result" 2261 | } 2262 | ], 2263 | "source": [ 2264 | "salary_sort.groupby('department_name').mean().reset_index().rename(columns = {'total_earnings':'dept_average'})" 2265 | ] 2266 | }, 2267 | { 2268 | "cell_type": "markdown", 2269 | "metadata": {}, 2270 | "source": [ 2271 | "That's a pretty long line of code. To make it more readable, we can split it up into separate lines. I like to do this by putting the whole expression in parentheses and splitting it up right before each of the functions, which are delineated by the periods:" 2272 | ] 2273 | }, 2274 | { 2275 | "cell_type": "code", 2276 | "execution_count": 39, 2277 | "metadata": {}, 2278 | "outputs": [ 2279 | { 2280 | "data": { 2281 | "text/html": [ 2282 | "
\n", 2283 | "\n", 2296 | "\n", 2297 | " \n", 2298 | " \n", 2299 | " \n", 2300 | " \n", 2301 | " \n", 2302 | " \n", 2303 | " \n", 2304 | " \n", 2305 | " \n", 2306 | " \n", 2307 | " \n", 2308 | " \n", 2309 | " \n", 2310 | " \n", 2311 | " \n", 2312 | " \n", 2313 | " \n", 2314 | " \n", 2315 | " \n", 2316 | " \n", 2317 | " \n", 2318 | " \n", 2319 | " \n", 2320 | " \n", 2321 | " \n", 2322 | " \n", 2323 | " \n", 2324 | " \n", 2325 | " \n", 2326 | " \n", 2327 | " \n", 2328 | " \n", 2329 | " \n", 2330 | " \n", 2331 | " \n", 2332 | " \n", 2333 | " \n", 2334 | " \n", 2335 | " \n", 2336 | " \n", 2337 | " \n", 2338 | " \n", 2339 | " \n", 2340 | " \n", 2341 | " \n", 2342 | " \n", 2343 | " \n", 2344 | " \n", 2345 | " \n", 2346 | " \n", 2347 | " \n", 2348 | " \n", 2349 | " \n", 2350 | " \n", 2351 | " \n", 2352 | " \n", 2353 | " \n", 2354 | " \n", 2355 | " \n", 2356 | " \n", 2357 | " \n", 2358 | " \n", 2359 | " \n", 2360 | " \n", 2361 | " \n", 2362 | " \n", 2363 | " \n", 2364 | " \n", 2365 | " \n", 2366 | " \n", 2367 | " \n", 2368 | " \n", 2369 | " \n", 2370 | " \n", 2371 | " \n", 2372 | " \n", 2373 | " \n", 2374 | " \n", 2375 | " \n", 2376 | " \n", 2377 | " \n", 2378 | " \n", 2379 | " \n", 2380 | " \n", 2381 | " \n", 2382 | " \n", 2383 | " \n", 2384 | " \n", 2385 | " \n", 2386 | " \n", 2387 | " \n", 2388 | " \n", 2389 | " \n", 2390 | " \n", 2391 | " \n", 2392 | " \n", 2393 | " \n", 2394 | " \n", 2395 | " \n", 2396 | " \n", 2397 | " \n", 2398 | " \n", 2399 | " \n", 2400 | " \n", 2401 | " \n", 2402 | " \n", 2403 | " \n", 2404 | " \n", 2405 | " \n", 2406 | " \n", 2407 | " \n", 2408 | " \n", 2409 | " \n", 2410 | " \n", 2411 | "
department_namedept_average
0ASD Human Resources67236.150755
1ASD Intergvernmtl Relations83787.581000
2ASD Office Of Labor Relation58899.954615
3ASD Office of Budget Mangmnt73946.044643
4ASD Purchasing Division72893.203750
5Accountability102073.280667
6Achievement Gap60105.522500
7Alighieri Montessori School55160.025556
8Assessing Department70713.327111
9Asst Superintendent-Network A132514.885000
.........
218Unified Student Svc65018.485000
219Veterans' Services48411.606250
220WREC: Urban Science Academy81170.398214
221Warren/Prescott K-866389.351341
222West Roxbury Academy70373.066494
223West Zone ELC55868.384118
224Women's Advancement63811.150000
225Workers Compensation Service23797.119133
226Young Achievers K-856534.020463
227Youth Engagement & Employment33645.202308
\n", 2412 | "

228 rows × 2 columns

\n", 2413 | "
" 2414 | ], 2415 | "text/plain": [ 2416 | " department_name dept_average\n", 2417 | "0 ASD Human Resources 67236.150755\n", 2418 | "1 ASD Intergvernmtl Relations 83787.581000\n", 2419 | "2 ASD Office Of Labor Relation 58899.954615\n", 2420 | "3 ASD Office of Budget Mangmnt 73946.044643\n", 2421 | "4 ASD Purchasing Division 72893.203750\n", 2422 | "5 Accountability 102073.280667\n", 2423 | "6 Achievement Gap 60105.522500\n", 2424 | "7 Alighieri Montessori School 55160.025556\n", 2425 | "8 Assessing Department 70713.327111\n", 2426 | "9 Asst Superintendent-Network A 132514.885000\n", 2427 | ".. ... ...\n", 2428 | "218 Unified Student Svc 65018.485000\n", 2429 | "219 Veterans' Services 48411.606250\n", 2430 | "220 WREC: Urban Science Academy 81170.398214\n", 2431 | "221 Warren/Prescott K-8 66389.351341\n", 2432 | "222 West Roxbury Academy 70373.066494\n", 2433 | "223 West Zone ELC 55868.384118\n", 2434 | "224 Women's Advancement 63811.150000\n", 2435 | "225 Workers Compensation Service 23797.119133\n", 2436 | "226 Young Achievers K-8 56534.020463\n", 2437 | "227 Youth Engagement & Employment 33645.202308\n", 2438 | "\n", 2439 | "[228 rows x 2 columns]" 2440 | ] 2441 | }, 2442 | "execution_count": 39, 2443 | "metadata": {}, 2444 | "output_type": "execute_result" 2445 | } 2446 | ], 2447 | "source": [ 2448 | "(salary_sort.groupby('department_name')\n", 2449 | " .mean()\n", 2450 | " .reset_index()\n", 2451 | " .rename(columns = {'total_earnings':'dept_average'}))" 2452 | ] 2453 | }, 2454 | { 2455 | "cell_type": "markdown", 2456 | "metadata": {}, 2457 | "source": [ 2458 | "## 2. Merging datasets" 2459 | ] 2460 | }, 2461 | { 2462 | "cell_type": "markdown", 2463 | "metadata": {}, 2464 | "source": [ 2465 | "Now we have two main datasets, `salary_sort` (the salary for each person, sorted from high to low) and `salary_average` (the average salary for each department). What if I wanted to merge these two together, so I could see side-by-side each person's salary compared to the average for their department?\n", 2466 | "\n", 2467 | "We want to join by the `department_name` variable, since that is consistent across both datasets. Let's put the merged data into a new dataframe, `salary_merged`:" 2468 | ] 2469 | }, 2470 | { 2471 | "cell_type": "code", 2472 | "execution_count": 40, 2473 | "metadata": {}, 2474 | "outputs": [], 2475 | "source": [ 2476 | "salary_merged = pd.merge(salary_sort, salary_average, on = 'department_name')" 2477 | ] 2478 | }, 2479 | { 2480 | "cell_type": "markdown", 2481 | "metadata": {}, 2482 | "source": [ 2483 | "Now we can see the department average, `dept_average`, next to the individual's salary, `total_earnings`:" 2484 | ] 2485 | }, 2486 | { 2487 | "cell_type": "code", 2488 | "execution_count": 41, 2489 | "metadata": {}, 2490 | "outputs": [ 2491 | { 2492 | "data": { 2493 | "text/html": [ 2494 | "
\n", 2495 | "\n", 2508 | "\n", 2509 | " \n", 2510 | " \n", 2511 | " \n", 2512 | " \n", 2513 | " \n", 2514 | " \n", 2515 | " \n", 2516 | " \n", 2517 | " \n", 2518 | " \n", 2519 | " \n", 2520 | " \n", 2521 | " \n", 2522 | " \n", 2523 | " \n", 2524 | " \n", 2525 | " \n", 2526 | " \n", 2527 | " \n", 2528 | " \n", 2529 | " \n", 2530 | " \n", 2531 | " \n", 2532 | " \n", 2533 | " \n", 2534 | " \n", 2535 | " \n", 2536 | " \n", 2537 | " \n", 2538 | " \n", 2539 | " \n", 2540 | " \n", 2541 | " \n", 2542 | " \n", 2543 | " \n", 2544 | " \n", 2545 | " \n", 2546 | " \n", 2547 | " \n", 2548 | " \n", 2549 | " \n", 2550 | " \n", 2551 | " \n", 2552 | " \n", 2553 | " \n", 2554 | " \n", 2555 | "
namedepartment_nametotal_earningsdept_average
0Lee,WaimanBoston Police Department403408.61124787.164775
1Josey,Windell C.Boston Police Department396348.50124787.164775
2Painten,Paul ABoston Police Department373959.35124787.164775
3Brown,GregoryBoston Police Department351825.50124787.164775
4Hosein,HaseebBoston Police Department346105.17124787.164775
\n", 2556 | "
" 2557 | ], 2558 | "text/plain": [ 2559 | " name department_name total_earnings dept_average\n", 2560 | "0 Lee,Waiman Boston Police Department 403408.61 124787.164775\n", 2561 | "1 Josey,Windell C. Boston Police Department 396348.50 124787.164775\n", 2562 | "2 Painten,Paul A Boston Police Department 373959.35 124787.164775\n", 2563 | "3 Brown,Gregory Boston Police Department 351825.50 124787.164775\n", 2564 | "4 Hosein,Haseeb Boston Police Department 346105.17 124787.164775" 2565 | ] 2566 | }, 2567 | "execution_count": 41, 2568 | "metadata": {}, 2569 | "output_type": "execute_result" 2570 | } 2571 | ], 2572 | "source": [ 2573 | "salary_merged.head()" 2574 | ] 2575 | }, 2576 | { 2577 | "cell_type": "markdown", 2578 | "metadata": {}, 2579 | "source": [ 2580 | "## 3. Reshaping data" 2581 | ] 2582 | }, 2583 | { 2584 | "cell_type": "markdown", 2585 | "metadata": {}, 2586 | "source": [ 2587 | "Here's a dataset on unemployment rates by country from 2012 to 2016, from the International Monetary Fund's World Economic Outlook database (available [here](https://www.imf.org/external/pubs/ft/weo/2017/01/weodata/index.aspx)).\n", 2588 | "\n", 2589 | "When you download the dataset, it comes in an Excel file. We can use the `pd.read_excel()` function from `pandas` to load the file into Python." 2590 | ] 2591 | }, 2592 | { 2593 | "cell_type": "code", 2594 | "execution_count": 42, 2595 | "metadata": {}, 2596 | "outputs": [ 2597 | { 2598 | "data": { 2599 | "text/html": [ 2600 | "
\n", 2601 | "\n", 2614 | "\n", 2615 | " \n", 2616 | " \n", 2617 | " \n", 2618 | " \n", 2619 | " \n", 2620 | " \n", 2621 | " \n", 2622 | " \n", 2623 | " \n", 2624 | " \n", 2625 | " \n", 2626 | " \n", 2627 | " \n", 2628 | " \n", 2629 | " \n", 2630 | " \n", 2631 | " \n", 2632 | " \n", 2633 | " \n", 2634 | " \n", 2635 | " \n", 2636 | " \n", 2637 | " \n", 2638 | " \n", 2639 | " \n", 2640 | " \n", 2641 | " \n", 2642 | " \n", 2643 | " \n", 2644 | " \n", 2645 | " \n", 2646 | " \n", 2647 | " \n", 2648 | " \n", 2649 | " \n", 2650 | " \n", 2651 | " \n", 2652 | " \n", 2653 | " \n", 2654 | " \n", 2655 | " \n", 2656 | " \n", 2657 | " \n", 2658 | " \n", 2659 | " \n", 2660 | " \n", 2661 | " \n", 2662 | " \n", 2663 | " \n", 2664 | " \n", 2665 | " \n", 2666 | " \n", 2667 | " \n", 2668 | " \n", 2669 | " \n", 2670 | " \n", 2671 | " \n", 2672 | " \n", 2673 | "
Country20122013201420152016
0Albania13.40016.00017.50017.10016.100
1Algeria11.0009.82910.60011.21410.498
2Argentina7.2007.0757.250NaN8.467
3Armenia17.30016.20017.60018.50018.790
4Australia5.2175.6506.0586.0585.733
\n", 2674 | "
" 2675 | ], 2676 | "text/plain": [ 2677 | " Country 2012 2013 2014 2015 2016\n", 2678 | "0 Albania 13.400 16.000 17.500 17.100 16.100\n", 2679 | "1 Algeria 11.000 9.829 10.600 11.214 10.498\n", 2680 | "2 Argentina 7.200 7.075 7.250 NaN 8.467\n", 2681 | "3 Armenia 17.300 16.200 17.600 18.500 18.790\n", 2682 | "4 Australia 5.217 5.650 6.058 6.058 5.733" 2683 | ] 2684 | }, 2685 | "execution_count": 42, 2686 | "metadata": {}, 2687 | "output_type": "execute_result" 2688 | } 2689 | ], 2690 | "source": [ 2691 | "unemployment = pd.read_excel('unemployment.xlsx')\n", 2692 | "unemployment.head()" 2693 | ] 2694 | }, 2695 | { 2696 | "cell_type": "markdown", 2697 | "metadata": {}, 2698 | "source": [ 2699 | "You'll notice if you open the `unemployment.xlsx` file in Excel that cells that do not have data (like Argentina in 2015) are labeled with \"n/a\". A nice feature of `pd.read_excel()` is that it recognizes these cells as NaN (\"not a number,\" or Python's way of encoding missing values), by default. If we wanted to, we could explicitly tell pandas that missing values were labeled \"n/a\" using `na_values = 'n/a'` within the `pd.read_excel()` function:" 2700 | ] 2701 | }, 2702 | { 2703 | "cell_type": "code", 2704 | "execution_count": 43, 2705 | "metadata": {}, 2706 | "outputs": [], 2707 | "source": [ 2708 | "unemployment = pd.read_excel('unemployment.xlsx', na_values = 'n/a')" 2709 | ] 2710 | }, 2711 | { 2712 | "cell_type": "markdown", 2713 | "metadata": {}, 2714 | "source": [ 2715 | "Right now, the data are in what's commonly referred to as \"wide\" format, meaning the variables (unemployment rate for each year) are spread across rows. This might be good for presentation, but it's not great for certain calculations or graphing. \"Wide\" format data also becomes confusing if other variables are added.\n", 2716 | "\n", 2717 | "We need to change the format from \"wide\" to \"long,\" meaning that the columns (`2012`, `2013`, `2014`, `2015`, `2016`) will be converted into a new variable, which we'll call `Year`, with repeated values for each country. And the unemployment rates will be put into a new variable, which we'll call `Rate_Unemployed`." 2718 | ] 2719 | }, 2720 | { 2721 | "cell_type": "markdown", 2722 | "metadata": {}, 2723 | "source": [ 2724 | "To do this, we'll use the [`pd.melt()`](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.melt.html) function in `pandas` to create a new DataFrame, `unemployment_long`." 2725 | ] 2726 | }, 2727 | { 2728 | "cell_type": "code", 2729 | "execution_count": 44, 2730 | "metadata": {}, 2731 | "outputs": [], 2732 | "source": [ 2733 | "unemployment_long = pd.melt(unemployment, # data to reshape\n", 2734 | " id_vars = 'Country', # identifier variable\n", 2735 | " var_name = 'Year', # column we want to create from the rows \n", 2736 | " value_name = 'Rate_Unemployed') # the values of interest" 2737 | ] 2738 | }, 2739 | { 2740 | "cell_type": "markdown", 2741 | "metadata": {}, 2742 | "source": [ 2743 | "Inspecting `unemployment_long` using `head()` shows that we have successfully created a long dataset." 2744 | ] 2745 | }, 2746 | { 2747 | "cell_type": "code", 2748 | "execution_count": 45, 2749 | "metadata": {}, 2750 | "outputs": [ 2751 | { 2752 | "data": { 2753 | "text/html": [ 2754 | "
\n", 2755 | "\n", 2768 | "\n", 2769 | " \n", 2770 | " \n", 2771 | " \n", 2772 | " \n", 2773 | " \n", 2774 | " \n", 2775 | " \n", 2776 | " \n", 2777 | " \n", 2778 | " \n", 2779 | " \n", 2780 | " \n", 2781 | " \n", 2782 | " \n", 2783 | " \n", 2784 | " \n", 2785 | " \n", 2786 | " \n", 2787 | " \n", 2788 | " \n", 2789 | " \n", 2790 | " \n", 2791 | " \n", 2792 | " \n", 2793 | " \n", 2794 | " \n", 2795 | " \n", 2796 | " \n", 2797 | " \n", 2798 | " \n", 2799 | " \n", 2800 | " \n", 2801 | " \n", 2802 | " \n", 2803 | " \n", 2804 | " \n", 2805 | " \n", 2806 | " \n", 2807 | " \n", 2808 | " \n", 2809 | "
CountryYearRate_Unemployed
0Albania201213.400
1Algeria201211.000
2Argentina20127.200
3Armenia201217.300
4Australia20125.217
\n", 2810 | "
" 2811 | ], 2812 | "text/plain": [ 2813 | " Country Year Rate_Unemployed\n", 2814 | "0 Albania 2012 13.400\n", 2815 | "1 Algeria 2012 11.000\n", 2816 | "2 Argentina 2012 7.200\n", 2817 | "3 Armenia 2012 17.300\n", 2818 | "4 Australia 2012 5.217" 2819 | ] 2820 | }, 2821 | "execution_count": 45, 2822 | "metadata": {}, 2823 | "output_type": "execute_result" 2824 | } 2825 | ], 2826 | "source": [ 2827 | "unemployment_long.head()" 2828 | ] 2829 | }, 2830 | { 2831 | "cell_type": "markdown", 2832 | "metadata": {}, 2833 | "source": [ 2834 | "## 4. Calculating year-over-year change in panel data" 2835 | ] 2836 | }, 2837 | { 2838 | "cell_type": "markdown", 2839 | "metadata": {}, 2840 | "source": [ 2841 | "Sort the data by `Country` and `Year` using the `sort_values()` function:" 2842 | ] 2843 | }, 2844 | { 2845 | "cell_type": "code", 2846 | "execution_count": 46, 2847 | "metadata": {}, 2848 | "outputs": [ 2849 | { 2850 | "data": { 2851 | "text/html": [ 2852 | "
\n", 2853 | "\n", 2866 | "\n", 2867 | " \n", 2868 | " \n", 2869 | " \n", 2870 | " \n", 2871 | " \n", 2872 | " \n", 2873 | " \n", 2874 | " \n", 2875 | " \n", 2876 | " \n", 2877 | " \n", 2878 | " \n", 2879 | " \n", 2880 | " \n", 2881 | " \n", 2882 | " \n", 2883 | " \n", 2884 | " \n", 2885 | " \n", 2886 | " \n", 2887 | " \n", 2888 | " \n", 2889 | " \n", 2890 | " \n", 2891 | " \n", 2892 | " \n", 2893 | " \n", 2894 | " \n", 2895 | " \n", 2896 | " \n", 2897 | " \n", 2898 | " \n", 2899 | " \n", 2900 | " \n", 2901 | " \n", 2902 | " \n", 2903 | " \n", 2904 | " \n", 2905 | " \n", 2906 | " \n", 2907 | "
CountryYearRate_Unemployed
0Albania201213.4
112Albania201316.0
224Albania201417.5
336Albania201517.1
448Albania201616.1
\n", 2908 | "
" 2909 | ], 2910 | "text/plain": [ 2911 | " Country Year Rate_Unemployed\n", 2912 | "0 Albania 2012 13.4\n", 2913 | "112 Albania 2013 16.0\n", 2914 | "224 Albania 2014 17.5\n", 2915 | "336 Albania 2015 17.1\n", 2916 | "448 Albania 2016 16.1" 2917 | ] 2918 | }, 2919 | "execution_count": 46, 2920 | "metadata": {}, 2921 | "output_type": "execute_result" 2922 | } 2923 | ], 2924 | "source": [ 2925 | "unemployment_long = unemployment_long.sort_values(['Country', 'Year'])\n", 2926 | "\n", 2927 | "unemployment_long.head()" 2928 | ] 2929 | }, 2930 | { 2931 | "cell_type": "markdown", 2932 | "metadata": { 2933 | "collapsed": true 2934 | }, 2935 | "source": [ 2936 | "Again, we can use `reset_index(drop = True)` to reset the row index so that the numbers next to the rows are in sequential order." 2937 | ] 2938 | }, 2939 | { 2940 | "cell_type": "code", 2941 | "execution_count": 47, 2942 | "metadata": {}, 2943 | "outputs": [ 2944 | { 2945 | "data": { 2946 | "text/html": [ 2947 | "
\n", 2948 | "\n", 2961 | "\n", 2962 | " \n", 2963 | " \n", 2964 | " \n", 2965 | " \n", 2966 | " \n", 2967 | " \n", 2968 | " \n", 2969 | " \n", 2970 | " \n", 2971 | " \n", 2972 | " \n", 2973 | " \n", 2974 | " \n", 2975 | " \n", 2976 | " \n", 2977 | " \n", 2978 | " \n", 2979 | " \n", 2980 | " \n", 2981 | " \n", 2982 | " \n", 2983 | " \n", 2984 | " \n", 2985 | " \n", 2986 | " \n", 2987 | " \n", 2988 | " \n", 2989 | " \n", 2990 | " \n", 2991 | " \n", 2992 | " \n", 2993 | " \n", 2994 | " \n", 2995 | " \n", 2996 | " \n", 2997 | " \n", 2998 | " \n", 2999 | " \n", 3000 | " \n", 3001 | " \n", 3002 | "
CountryYearRate_Unemployed
0Albania201213.4
1Albania201316.0
2Albania201417.5
3Albania201517.1
4Albania201616.1
\n", 3003 | "
" 3004 | ], 3005 | "text/plain": [ 3006 | " Country Year Rate_Unemployed\n", 3007 | "0 Albania 2012 13.4\n", 3008 | "1 Albania 2013 16.0\n", 3009 | "2 Albania 2014 17.5\n", 3010 | "3 Albania 2015 17.1\n", 3011 | "4 Albania 2016 16.1" 3012 | ] 3013 | }, 3014 | "execution_count": 47, 3015 | "metadata": {}, 3016 | "output_type": "execute_result" 3017 | } 3018 | ], 3019 | "source": [ 3020 | "unemployment_long = unemployment_long.reset_index(drop = True)\n", 3021 | "\n", 3022 | "unemployment_long.head()" 3023 | ] 3024 | }, 3025 | { 3026 | "cell_type": "markdown", 3027 | "metadata": {}, 3028 | "source": [ 3029 | "This type of data is known in time-series analysis as a panel; each country is observed every year from 2012 to 2016.\n", 3030 | "\n", 3031 | "For Albania, the percentage point change in unemployment rate from 2012 to 2013 would be 16 - 13.4 = 2.5 percentage points. What if I wanted the year-over-year change in unemployment rate for every country?\n", 3032 | "\n", 3033 | "We can use the [`diff()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.diff.html) function in `pandas` to do this. We can use `diff()` to calculate the difference between the `Rate_Unemployed` that year and the `Rate_Unemployed` for the year prior (the default for `lag()` is 1 period, which is good for us since we want the change from the previous year). We will save this difference into a new variable, `Change`." 3034 | ] 3035 | }, 3036 | { 3037 | "cell_type": "code", 3038 | "execution_count": 48, 3039 | "metadata": {}, 3040 | "outputs": [], 3041 | "source": [ 3042 | "unemployment_long['Change'] = unemployment_long.Rate_Unemployed.diff()" 3043 | ] 3044 | }, 3045 | { 3046 | "cell_type": "markdown", 3047 | "metadata": {}, 3048 | "source": [ 3049 | "Let's inspect the first five rows again, using `head()`:" 3050 | ] 3051 | }, 3052 | { 3053 | "cell_type": "code", 3054 | "execution_count": 49, 3055 | "metadata": {}, 3056 | "outputs": [ 3057 | { 3058 | "data": { 3059 | "text/html": [ 3060 | "
\n", 3061 | "\n", 3074 | "\n", 3075 | " \n", 3076 | " \n", 3077 | " \n", 3078 | " \n", 3079 | " \n", 3080 | " \n", 3081 | " \n", 3082 | " \n", 3083 | " \n", 3084 | " \n", 3085 | " \n", 3086 | " \n", 3087 | " \n", 3088 | " \n", 3089 | " \n", 3090 | " \n", 3091 | " \n", 3092 | " \n", 3093 | " \n", 3094 | " \n", 3095 | " \n", 3096 | " \n", 3097 | " \n", 3098 | " \n", 3099 | " \n", 3100 | " \n", 3101 | " \n", 3102 | " \n", 3103 | " \n", 3104 | " \n", 3105 | " \n", 3106 | " \n", 3107 | " \n", 3108 | " \n", 3109 | " \n", 3110 | " \n", 3111 | " \n", 3112 | " \n", 3113 | " \n", 3114 | " \n", 3115 | " \n", 3116 | " \n", 3117 | " \n", 3118 | " \n", 3119 | " \n", 3120 | " \n", 3121 | "
CountryYearRate_UnemployedChange
0Albania201213.4NaN
1Albania201316.02.6
2Albania201417.51.5
3Albania201517.1-0.4
4Albania201616.1-1.0
\n", 3122 | "
" 3123 | ], 3124 | "text/plain": [ 3125 | " Country Year Rate_Unemployed Change\n", 3126 | "0 Albania 2012 13.4 NaN\n", 3127 | "1 Albania 2013 16.0 2.6\n", 3128 | "2 Albania 2014 17.5 1.5\n", 3129 | "3 Albania 2015 17.1 -0.4\n", 3130 | "4 Albania 2016 16.1 -1.0" 3131 | ] 3132 | }, 3133 | "execution_count": 49, 3134 | "metadata": {}, 3135 | "output_type": "execute_result" 3136 | } 3137 | ], 3138 | "source": [ 3139 | "unemployment_long.head()" 3140 | ] 3141 | }, 3142 | { 3143 | "cell_type": "markdown", 3144 | "metadata": {}, 3145 | "source": [ 3146 | "So far so good. It also makes sense that Albania's `Change` is `NaN` in 2012, since the dataset doesn't contain any unemployment figures before the year 2012.\n", 3147 | "\n", 3148 | "But a closer inspection of the data reveals a problem. What if we used `tail()` to look at the *last* 5 rows of the data?" 3149 | ] 3150 | }, 3151 | { 3152 | "cell_type": "code", 3153 | "execution_count": 50, 3154 | "metadata": {}, 3155 | "outputs": [ 3156 | { 3157 | "data": { 3158 | "text/html": [ 3159 | "
\n", 3160 | "\n", 3173 | "\n", 3174 | " \n", 3175 | " \n", 3176 | " \n", 3177 | " \n", 3178 | " \n", 3179 | " \n", 3180 | " \n", 3181 | " \n", 3182 | " \n", 3183 | " \n", 3184 | " \n", 3185 | " \n", 3186 | " \n", 3187 | " \n", 3188 | " \n", 3189 | " \n", 3190 | " \n", 3191 | " \n", 3192 | " \n", 3193 | " \n", 3194 | " \n", 3195 | " \n", 3196 | " \n", 3197 | " \n", 3198 | " \n", 3199 | " \n", 3200 | " \n", 3201 | " \n", 3202 | " \n", 3203 | " \n", 3204 | " \n", 3205 | " \n", 3206 | " \n", 3207 | " \n", 3208 | " \n", 3209 | " \n", 3210 | " \n", 3211 | " \n", 3212 | " \n", 3213 | " \n", 3214 | " \n", 3215 | " \n", 3216 | " \n", 3217 | " \n", 3218 | " \n", 3219 | " \n", 3220 | "
CountryYearRate_UnemployedChange
555Vietnam20122.74-18.493
556Vietnam20132.750.010
557Vietnam20142.05-0.700
558Vietnam20152.400.350
559Vietnam20162.400.000
\n", 3221 | "
" 3222 | ], 3223 | "text/plain": [ 3224 | " Country Year Rate_Unemployed Change\n", 3225 | "555 Vietnam 2012 2.74 -18.493\n", 3226 | "556 Vietnam 2013 2.75 0.010\n", 3227 | "557 Vietnam 2014 2.05 -0.700\n", 3228 | "558 Vietnam 2015 2.40 0.350\n", 3229 | "559 Vietnam 2016 2.40 0.000" 3230 | ] 3231 | }, 3232 | "execution_count": 50, 3233 | "metadata": {}, 3234 | "output_type": "execute_result" 3235 | } 3236 | ], 3237 | "source": [ 3238 | "unemployment_long.tail()" 3239 | ] 3240 | }, 3241 | { 3242 | "cell_type": "markdown", 3243 | "metadata": {}, 3244 | "source": [ 3245 | "**Why does Vietnam have a -18.493 percentage point change in 2012?**" 3246 | ] 3247 | }, 3248 | { 3249 | "cell_type": "markdown", 3250 | "metadata": {}, 3251 | "source": [ 3252 | "(Hint: use `tail()` to look at the last 6 rows of the data.)" 3253 | ] 3254 | }, 3255 | { 3256 | "cell_type": "code", 3257 | "execution_count": 51, 3258 | "metadata": {}, 3259 | "outputs": [ 3260 | { 3261 | "data": { 3262 | "text/html": [ 3263 | "
\n", 3264 | "\n", 3277 | "\n", 3278 | " \n", 3279 | " \n", 3280 | " \n", 3281 | " \n", 3282 | " \n", 3283 | " \n", 3284 | " \n", 3285 | " \n", 3286 | " \n", 3287 | " \n", 3288 | " \n", 3289 | " \n", 3290 | " \n", 3291 | " \n", 3292 | " \n", 3293 | " \n", 3294 | " \n", 3295 | " \n", 3296 | " \n", 3297 | " \n", 3298 | " \n", 3299 | " \n", 3300 | " \n", 3301 | " \n", 3302 | " \n", 3303 | " \n", 3304 | " \n", 3305 | " \n", 3306 | " \n", 3307 | " \n", 3308 | " \n", 3309 | " \n", 3310 | " \n", 3311 | " \n", 3312 | " \n", 3313 | " \n", 3314 | " \n", 3315 | " \n", 3316 | " \n", 3317 | " \n", 3318 | " \n", 3319 | " \n", 3320 | " \n", 3321 | " \n", 3322 | " \n", 3323 | " \n", 3324 | "
CountryYearRate_UnemployedChange
555Vietnam20122.74NaN
556Vietnam20132.750.01
557Vietnam20142.05-0.70
558Vietnam20152.400.35
559Vietnam20162.400.00
\n", 3325 | "
" 3326 | ], 3327 | "text/plain": [ 3328 | " Country Year Rate_Unemployed Change\n", 3329 | "555 Vietnam 2012 2.74 NaN\n", 3330 | "556 Vietnam 2013 2.75 0.01\n", 3331 | "557 Vietnam 2014 2.05 -0.70\n", 3332 | "558 Vietnam 2015 2.40 0.35\n", 3333 | "559 Vietnam 2016 2.40 0.00" 3334 | ] 3335 | }, 3336 | "execution_count": 51, 3337 | "metadata": {}, 3338 | "output_type": "execute_result" 3339 | } 3340 | ], 3341 | "source": [ 3342 | "unemployment_long['Change'] = (unemployment_long\n", 3343 | " .groupby('Country')\n", 3344 | " .Rate_Unemployed.diff())\n", 3345 | "\n", 3346 | "unemployment_long.tail()" 3347 | ] 3348 | }, 3349 | { 3350 | "cell_type": "markdown", 3351 | "metadata": {}, 3352 | "source": [ 3353 | "(Also notice how I put the entire expression in parentheses and put each function on a different line for readability.)" 3354 | ] 3355 | }, 3356 | { 3357 | "cell_type": "markdown", 3358 | "metadata": {}, 3359 | "source": [ 3360 | "## 5. Recoding numerical variables into categorical ones" 3361 | ] 3362 | }, 3363 | { 3364 | "cell_type": "markdown", 3365 | "metadata": {}, 3366 | "source": [ 3367 | "Here's a list of some attendees for the 2016 workshop, with names and contact info removed." 3368 | ] 3369 | }, 3370 | { 3371 | "cell_type": "code", 3372 | "execution_count": 52, 3373 | "metadata": {}, 3374 | "outputs": [ 3375 | { 3376 | "data": { 3377 | "text/html": [ 3378 | "
\n", 3379 | "\n", 3392 | "\n", 3393 | " \n", 3394 | " \n", 3395 | " \n", 3396 | " \n", 3397 | " \n", 3398 | " \n", 3399 | " \n", 3400 | " \n", 3401 | " \n", 3402 | " \n", 3403 | " \n", 3404 | " \n", 3405 | " \n", 3406 | " \n", 3407 | " \n", 3408 | " \n", 3409 | " \n", 3410 | " \n", 3411 | " \n", 3412 | " \n", 3413 | " \n", 3414 | " \n", 3415 | " \n", 3416 | " \n", 3417 | " \n", 3418 | " \n", 3419 | " \n", 3420 | " \n", 3421 | " \n", 3422 | " \n", 3423 | " \n", 3424 | " \n", 3425 | " \n", 3426 | " \n", 3427 | " \n", 3428 | " \n", 3429 | " \n", 3430 | " \n", 3431 | " \n", 3432 | " \n", 3433 | " \n", 3434 | " \n", 3435 | " \n", 3436 | " \n", 3437 | " \n", 3438 | " \n", 3439 | " \n", 3440 | " \n", 3441 | " \n", 3442 | " \n", 3443 | " \n", 3444 | " \n", 3445 | " \n", 3446 | " \n", 3447 | " \n", 3448 | " \n", 3449 | " \n", 3450 | " \n", 3451 | " \n", 3452 | " \n", 3453 | " \n", 3454 | " \n", 3455 | " \n", 3456 | " \n", 3457 | " \n", 3458 | " \n", 3459 | " \n", 3460 | " \n", 3461 | " \n", 3462 | " \n", 3463 | " \n", 3464 | " \n", 3465 | " \n", 3466 | " \n", 3467 | " \n", 3468 | " \n", 3469 | " \n", 3470 | " \n", 3471 | " \n", 3472 | " \n", 3473 | " \n", 3474 | " \n", 3475 | " \n", 3476 | " \n", 3477 | " \n", 3478 | " \n", 3479 | " \n", 3480 | " \n", 3481 | " \n", 3482 | " \n", 3483 | " \n", 3484 | " \n", 3485 | " \n", 3486 | " \n", 3487 | " \n", 3488 | " \n", 3489 | " \n", 3490 | " \n", 3491 | " \n", 3492 | " \n", 3493 | " \n", 3494 | " \n", 3495 | " \n", 3496 | " \n", 3497 | " \n", 3498 | " \n", 3499 | " \n", 3500 | " \n", 3501 | " \n", 3502 | " \n", 3503 | " \n", 3504 | " \n", 3505 | "
OccupationJob titleAge groupGenderState/ProvinceEducationWhich data subject area are you most interested in working with? (Select up to three)What do you hope to get out of the workshop?Which type of laptop will you bring?College or University NameMajor or ConcentrationCollege YearWhich Digital Badge track best suits you?Which session would you like to attend?Choose your status:
0Data AnalystData Quality Analyst30-39MaleMABachelor's DegreeRetailotherPCNaNNaNNaNAdvanced Data StorytellingJune 5-9Nonprofit, Academic, Government
1PhD StudentStudent/Research Assistant18-29MaleMABachelor's DegreeSportsMaster Advanced RPCBoston UniversityBiostatisticsPhDAdvanced Data StorytellingJune 5-9Student
2EducationData Analyst18-29FemaleKentuckyMaster's DegreeRetailotherPCNaNNaNNaNAdvanced Data StorytellingJune 5-9Nonprofit, Academic, Government
3ManagerBAS Manager30-39MaleMABachelor's DegreeEducationPick up Beginning R And SQLPCBoston UniversityPEMBAGraduateAdvanced Data StorytellingJune 5-9Student
4Government FinancePerformance Analyst30 - 39MaleMAMaster's DegreeEnvironment, Finance, Food and agriculturePick up Beginning R And SQLMACNaNNaNNaNAdvanced Data StorytellingJune 5-9Nonprofit, Academic, Government Early Bird
\n", 3506 | "
" 3507 | ], 3508 | "text/plain": [ 3509 | " Occupation Job title Age group Gender \\\n", 3510 | "0 Data Analyst Data Quality Analyst 30-39 Male \n", 3511 | "1 PhD Student Student/Research Assistant 18-29 Male \n", 3512 | "2 Education Data Analyst 18-29 Female \n", 3513 | "3 Manager BAS Manager 30-39 Male \n", 3514 | "4 Government Finance Performance Analyst 30 - 39 Male \n", 3515 | "\n", 3516 | " State/Province Education \\\n", 3517 | "0 MA Bachelor's Degree \n", 3518 | "1 MA Bachelor's Degree \n", 3519 | "2 Kentucky Master's Degree \n", 3520 | "3 MA Bachelor's Degree \n", 3521 | "4 MA Master's Degree \n", 3522 | "\n", 3523 | " Which data subject area are you most interested in working with? (Select up to three) \\\n", 3524 | "0 Retail \n", 3525 | "1 Sports \n", 3526 | "2 Retail \n", 3527 | "3 Education \n", 3528 | "4 Environment, Finance, Food and agriculture \n", 3529 | "\n", 3530 | " What do you hope to get out of the workshop? \\\n", 3531 | "0 other \n", 3532 | "1 Master Advanced R \n", 3533 | "2 other \n", 3534 | "3 Pick up Beginning R And SQL \n", 3535 | "4 Pick up Beginning R And SQL \n", 3536 | "\n", 3537 | " Which type of laptop will you bring? College or University Name \\\n", 3538 | "0 PC NaN \n", 3539 | "1 PC Boston University \n", 3540 | "2 PC NaN \n", 3541 | "3 PC Boston University \n", 3542 | "4 MAC NaN \n", 3543 | "\n", 3544 | " Major or Concentration College Year \\\n", 3545 | "0 NaN NaN \n", 3546 | "1 Biostatistics PhD \n", 3547 | "2 NaN NaN \n", 3548 | "3 PEMBA Graduate \n", 3549 | "4 NaN NaN \n", 3550 | "\n", 3551 | " Which Digital Badge track best suits you? \\\n", 3552 | "0 Advanced Data Storytelling \n", 3553 | "1 Advanced Data Storytelling \n", 3554 | "2 Advanced Data Storytelling \n", 3555 | "3 Advanced Data Storytelling \n", 3556 | "4 Advanced Data Storytelling \n", 3557 | "\n", 3558 | " Which session would you like to attend? \\\n", 3559 | "0 June 5-9 \n", 3560 | "1 June 5-9 \n", 3561 | "2 June 5-9 \n", 3562 | "3 June 5-9 \n", 3563 | "4 June 5-9 \n", 3564 | "\n", 3565 | " Choose your status: \n", 3566 | "0 Nonprofit, Academic, Government \n", 3567 | "1 Student \n", 3568 | "2 Nonprofit, Academic, Government \n", 3569 | "3 Student \n", 3570 | "4 Nonprofit, Academic, Government Early Bird " 3571 | ] 3572 | }, 3573 | "execution_count": 52, 3574 | "metadata": {}, 3575 | "output_type": "execute_result" 3576 | } 3577 | ], 3578 | "source": [ 3579 | "attendees = pd.read_csv('attendees.csv')\n", 3580 | "\n", 3581 | "attendees.head()" 3582 | ] 3583 | }, 3584 | { 3585 | "cell_type": "markdown", 3586 | "metadata": {}, 3587 | "source": [ 3588 | "**What if we wanted to quickly see the age distribution of attendees?**" 3589 | ] 3590 | }, 3591 | { 3592 | "cell_type": "code", 3593 | "execution_count": 53, 3594 | "metadata": {}, 3595 | "outputs": [ 3596 | { 3597 | "data": { 3598 | "text/plain": [ 3599 | "30-39 7\n", 3600 | "18-29 4\n", 3601 | "30 - 39 1\n", 3602 | "Name: Age group, dtype: int64" 3603 | ] 3604 | }, 3605 | "execution_count": 53, 3606 | "metadata": {}, 3607 | "output_type": "execute_result" 3608 | } 3609 | ], 3610 | "source": [ 3611 | "attendees['Age group'].value_counts()" 3612 | ] 3613 | }, 3614 | { 3615 | "cell_type": "markdown", 3616 | "metadata": {}, 3617 | "source": [ 3618 | "There's an inconsistency in the labeling of the `Age group` variable here. We can fix this using `np.where()` in the `numpy` library. First, let's import the `numpy` library. Like `pandas`, `numpy` has a commonly used alias — `np`." 3619 | ] 3620 | }, 3621 | { 3622 | "cell_type": "code", 3623 | "execution_count": 54, 3624 | "metadata": {}, 3625 | "outputs": [], 3626 | "source": [ 3627 | "import numpy as np" 3628 | ] 3629 | }, 3630 | { 3631 | "cell_type": "code", 3632 | "execution_count": 55, 3633 | "metadata": {}, 3634 | "outputs": [], 3635 | "source": [ 3636 | "attendees['Age group'] = np.where(attendees['Age group'] == '30 - 39', # where attendees['Age group'] == '30 - 39'\n", 3637 | " '30-39', # replace attendees['Age group'] with '30-39'\n", 3638 | " attendees['Age group']) # otherwise, keep attendees['Age group'] values the same" 3639 | ] 3640 | }, 3641 | { 3642 | "cell_type": "markdown", 3643 | "metadata": {}, 3644 | "source": [ 3645 | "This might seem trivial for just one value, but it's useful for larger datasets." 3646 | ] 3647 | }, 3648 | { 3649 | "cell_type": "code", 3650 | "execution_count": 56, 3651 | "metadata": {}, 3652 | "outputs": [ 3653 | { 3654 | "data": { 3655 | "text/plain": [ 3656 | "30-39 8\n", 3657 | "18-29 4\n", 3658 | "Name: Age group, dtype: int64" 3659 | ] 3660 | }, 3661 | "execution_count": 56, 3662 | "metadata": {}, 3663 | "output_type": "execute_result" 3664 | } 3665 | ], 3666 | "source": [ 3667 | "attendees['Age group'].value_counts()" 3668 | ] 3669 | }, 3670 | { 3671 | "cell_type": "markdown", 3672 | "metadata": {}, 3673 | "source": [ 3674 | "Now let's take a look at the professional status of attendees, labeled in `Choose your status:`" 3675 | ] 3676 | }, 3677 | { 3678 | "cell_type": "code", 3679 | "execution_count": 57, 3680 | "metadata": {}, 3681 | "outputs": [ 3682 | { 3683 | "data": { 3684 | "text/plain": [ 3685 | "Student 5\n", 3686 | "Nonprofit, Academic, Government 3\n", 3687 | "Professional 3\n", 3688 | "Nonprofit, Academic, Government Early Bird 1\n", 3689 | "Name: Choose your status:, dtype: int64" 3690 | ] 3691 | }, 3692 | "execution_count": 57, 3693 | "metadata": {}, 3694 | "output_type": "execute_result" 3695 | } 3696 | ], 3697 | "source": [ 3698 | "attendees['Choose your status:'].value_counts()" 3699 | ] 3700 | }, 3701 | { 3702 | "cell_type": "markdown", 3703 | "metadata": {}, 3704 | "source": [ 3705 | "\"Nonprofit, Academic, Government\" and \"Nonprofit, Academic, Government Early Bird\" seem to be the same. We can use `np.where()` (and the Python designation `|` for \"or\") to combine these two categories into one big category, \"Nonprofit/Gov\". Let's create a new variable, `status`, for our simplified categorization.\n", 3706 | "\n", 3707 | "Notice the extra sets of parentheses around the two conditions linked by the `|` symbol." 3708 | ] 3709 | }, 3710 | { 3711 | "cell_type": "code", 3712 | "execution_count": 58, 3713 | "metadata": {}, 3714 | "outputs": [], 3715 | "source": [ 3716 | "attendees['status'] = np.where((attendees['Choose your status:'] == 'Nonprofit, Academic, Government') |\n", 3717 | " (attendees['Choose your status:'] == 'Nonprofit, Academic, Government Early Bird'),\n", 3718 | " 'Nonprofit/Gov', \n", 3719 | " attendees['Choose your status:'])" 3720 | ] 3721 | }, 3722 | { 3723 | "cell_type": "code", 3724 | "execution_count": 59, 3725 | "metadata": {}, 3726 | "outputs": [ 3727 | { 3728 | "data": { 3729 | "text/plain": [ 3730 | "Student 5\n", 3731 | "Nonprofit/Gov 4\n", 3732 | "Professional 3\n", 3733 | "Name: status, dtype: int64" 3734 | ] 3735 | }, 3736 | "execution_count": 59, 3737 | "metadata": {}, 3738 | "output_type": "execute_result" 3739 | } 3740 | ], 3741 | "source": [ 3742 | "attendees['status'].value_counts()" 3743 | ] 3744 | }, 3745 | { 3746 | "cell_type": "markdown", 3747 | "metadata": {}, 3748 | "source": [ 3749 | "## What else?\n", 3750 | "\n", 3751 | "- How would you create a new variable in the `attendees` data (let's call it `status2`) that has just two categories, \"Student\" and \"Other\"?\n", 3752 | "\n", 3753 | "- How would you rename the variables in the `attendees` data to make them easier to work with?\n", 3754 | "\n", 3755 | "- What are some other issues with this dataset? How would you solve them using what we've learned?\n", 3756 | "\n", 3757 | "- What are some other \"messy\" data issues you've encountered?" 3758 | ] 3759 | } 3760 | ], 3761 | "metadata": { 3762 | "kernelspec": { 3763 | "display_name": "Python 3", 3764 | "language": "python", 3765 | "name": "python3" 3766 | }, 3767 | "language_info": { 3768 | "codemirror_mode": { 3769 | "name": "ipython", 3770 | "version": 3 3771 | }, 3772 | "file_extension": ".py", 3773 | "mimetype": "text/x-python", 3774 | "name": "python", 3775 | "nbconvert_exporter": "python", 3776 | "pygments_lexer": "ipython3", 3777 | "version": "3.6.2" 3778 | } 3779 | }, 3780 | "nbformat": 4, 3781 | "nbformat_minor": 2 3782 | } 3783 | -------------------------------------------------------------------------------- /pandas-data-cleaning-tricks.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/underthecurve/pandas-data-cleaning-tricks/cd96cd85ac5b941eb077bbf6bc30b51da0e3f29b/pandas-data-cleaning-tricks.pdf -------------------------------------------------------------------------------- /pandas-data-cleaning-tricks.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | 4 | # # Tricks for cleaning your data in Python using pandas 5 | # 6 | # **By Christine Zhang ([ychristinezhang at gmail dot com](mailto:ychristinezhang@gmail.com / [@christinezhang](https://twitter.com/christinezhang) | [@christinezhang](https://twitter.com/christinezhang))** 7 | 8 | # GitHub repository for Data+Code: https://github.com/underthecurve/pandas-data-cleaning-tricks 9 | # 10 | # In 2017 I gave a talk called "Tricks for cleaning your data in R" which I presented at the [Data+Narrative workshop](http://www.bu.edu/com/data-narrative/) at Boston University. The repo with the code and data, https://github.com/underthecurve/r-data-cleaning-tricks, was pretty well-received, so I figured I'd try to do some of the same stuff in Python using `pandas`. 11 | # 12 | # **Disclaimer:** when it comes to data stuff, I'm much better with R, especially the `tidyverse` set of packages, than with Python, but in my last job I used Python's `pandas` library to do a lot of data processing since Python was the dominant language there. 13 | # 14 | # Anyway, here goes: 15 | # 16 | # Data cleaning is a cumbersome task, and it can be hard to navigate in programming languages like Python. 17 | # 18 | # The `pandas` library in Python is a powerful tool for data cleaning and analysis. By default, it leaves a trail of code that documents all the work you've done, which makes it extremely useful for creating reproducible workflows. 19 | # 20 | # In this workshop, I'll show you some examples of real-life "messy" datasets, the problems they present for analysis in Python's `pandas` library, and some of the solutions to these problems. 21 | # 22 | # Fittingly, I'll [start the numbering system at 0](http://python-history.blogspot.com/2013/10/why-python-uses-0-based-indexing.html). 23 | 24 | # ## 0. Importing the `pandas` library 25 | 26 | # Here I tell Python to import the `pandas` library as `pd` (a common alias for `pandas` — more on that in the next code chunk). 27 | 28 | # In[1]: 29 | 30 | 31 | import pandas as pd 32 | 33 | 34 | # ## 1. Finding and replacing non-numeric characters like `,` and `$` 35 | 36 | # Let's check out the city of Boston's [Open Data portal](https://data.boston.gov/), where the local government puts up datasets that are free for the public to analyze. 37 | # 38 | # The [Employee Earnings Report](https://data.boston.gov/dataset/employee-earnings-report) is one of the more interesting ones, because it gives payroll data for every person on the municipal payroll. It's where the *Boston Globe* gets stories like these every year: 39 | # 40 | # - ["64 City of Boston workers earn more than $250,000"](https://www.bostonglobe.com/metro/2016/02/05/city-boston-workers-earn-more-than/MvW6RExJZimdrTlwdwUI7M/story.html) (February 6, 2016) 41 | # 42 | # - ["Police detective tops Boston’s payroll with a total of over $403,000"](https://www.bostonglobe.com/metro/2017/02/14/police-detective-tops-boston-payroll-with-total-over/6PaXwTAHZGEW5djgwCJuTI/story.html) (February 14, 2017) 43 | # 44 | # Let's take at the February 14 story from 2017. The story begins: 45 | # 46 | # > "A veteran police detective took home more than $403,000 in earnings last year, topping the list of Boston’s highest-paid employees in 2016, newly released city payroll data show." 47 | # 48 | # **What if we wanted to check this number using the Employee Earnings Report?** 49 | 50 | # We can use the `pandas` function [`pandas.read_csv()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) to load the csv file into Python. We will call this DataFrame `salary`. Remember that I imported `pandas` "as `pd`" in the last code chunk. This saves me a bit of typing by allowing me to access `pandas` functions like `pandas.read_csv()` by typing `pd.read_csv()` instead. If I had typed `import pandas` in the code chunk under section `0` without `as pd`, the below code wouldn't work. I'd have to instead write `pandas.read_csv()` to access the function. 51 | # 52 | # The `pd` alias for `pandas` is so common that the library's [documentation](http://pandas.pydata.org/pandas-docs/stable/install.html#running-the-test-suite) even uses it sometimes. 53 | # 54 | # Let's try to use `pd.read_csv()`: 55 | 56 | # In[2]: 57 | 58 | 59 | salary = pd.read_csv('employee-earnings-report-2016.csv') 60 | 61 | 62 | # That's a pretty long and nasty error. Usually when I run into something like this, I start from the bottom and work my way up — in this case, I typed `UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 22: invalid continuation byte` into a search engine and came across [this discussion on the Stack Overflow forum](https://stackoverflow.com/questions/30462807/encoding-error-in-panda-read-csv). The last response suggested that adding `encoding ='latin1'` inside the function would fix the problem on Macs (which is the type of computer I have). 63 | 64 | # In[3]: 65 | 66 | 67 | salary = pd.read_csv('employee-earnings-report-2016.csv', encoding = 'latin-1') 68 | 69 | 70 | # Great! (I don't know much about encoding, but this is something I run into from time to time so I thought it would be helpful to show here.) 71 | # 72 | # We can use `head()` on the `salary` DataFrame to inspect the first five rows of `salary`. (Note I use the `print()` to display the output, but you don't need to do this in your own code if you'd prefer not to.) 73 | 74 | # In[4]: 75 | 76 | 77 | print(salary.head()) 78 | 79 | 80 | # There are a lot of columns. Let's simplify by selecting the ones of interest: `NAME`, `DEPARTMENT_NAME`, and `TOTAL.EARNINGS`. There are [a few different ways](https://medium.com/dunder-data/selecting-subsets-of-data-in-pandas-6fcd0170be9c) of doing this with `pandas`. The simplest way, imo, is by using the indexing operator `[]`. 81 | # 82 | # For example, I could select a single column, `NAME`: (Note I also run the line `pd.options.display.max_rows = 20` in order to display a maximum of 20 rows so the output isn't too crowded.) 83 | 84 | # In[5]: 85 | 86 | 87 | pd.options.display.max_rows = 20 88 | 89 | salary['NAME'] 90 | 91 | 92 | # This works for selecting one column at a time, but using `[]` returns a [Series](https://pandas.pydata.org/pandas-docs/stable/dsintro.html#series), not a [DataFrame](https://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe). I can confirm this using the `type()` function: 93 | 94 | # In[6]: 95 | 96 | 97 | type(salary['NAME']) 98 | 99 | 100 | # If I want a DataFrame, I have to use double brackets: 101 | 102 | # In[7]: 103 | 104 | 105 | salary[['NAME']] 106 | 107 | 108 | # In[8]: 109 | 110 | 111 | type(salary[['NAME']]) 112 | 113 | 114 | # To select multiple columns, we can put those columns inside of the second pair of brackets. We will save this into a new DataFrame, `salary_selected`. We type `.copy()` after `salary[['NAME','DEPARTMENT_NAME', 'TOTAL EARNINGS']]` because we are making a copy of the DataFrame and assigning it to new DataFrame. Learn more about `copy()` [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.copy.html). 115 | 116 | # In[9]: 117 | 118 | 119 | salary_selected = salary[['NAME','DEPARTMENT_NAME', 'TOTAL EARNINGS']].copy() 120 | 121 | 122 | # We can also change the column names to lowercase names for easier typing. First, let's take a look at the columns by displaying the `columns` attribute of the `salary_selected` DataFrame. 123 | 124 | # In[10]: 125 | 126 | 127 | salary_selected.columns 128 | 129 | 130 | # In[11]: 131 | 132 | 133 | type(salary_selected.columns) 134 | 135 | 136 | # Notice how this returns something called an "Index." In `pandas`, DataFrames have both row indexes (in our case, the row number, starting from 0 and going to 22045) and column indexes. We can use the [`str.lower()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.lower.html) function to convert the strings (aka characters) in the index to lowercase. 137 | 138 | # In[12]: 139 | 140 | 141 | salary_selected.columns = salary_selected.columns.str.lower() 142 | 143 | salary_selected.columns 144 | 145 | 146 | # Another thing that will make our lives easier is if the `total earnings` column didn't have a space between `total` and `earnings`. We can use a "string replace" function, [`str.replace()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.replace.html), to replace the space with an underscore. The syntax is: `str.replace('thing you want to replace', 'what to replace it with')` 147 | 148 | # In[13]: 149 | 150 | 151 | salary_selected.columns.str.replace(' ', '_') 152 | 153 | salary_selected.columns 154 | 155 | 156 | # We could have used both the `str.lower()` and `str.replace()` functions in one line of code by putting them one after the other (aka "chaining"): 157 | 158 | # In[14]: 159 | 160 | 161 | salary_selected.columns = salary_selected.columns.str.lower().str.replace(' ', '_') 162 | 163 | salary_selected.columns 164 | 165 | 166 | # Let's use `head()` to visually inspect the first five rows of `salary_selected`: 167 | 168 | # In[15]: 169 | 170 | 171 | print(salary_selected.head()) 172 | 173 | 174 | # Now let's try sorting the data by `total.earnings` using the [`sort_values()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) function in `pandas`: 175 | 176 | # In[16]: 177 | 178 | 179 | salary_sort = salary_selected.sort_values('total_earnings') 180 | 181 | 182 | # We can use `head()` to visually inspect `salary_sort`: 183 | 184 | # In[17]: 185 | 186 | 187 | print(salary_sort.head()) 188 | 189 | 190 | # At first glance, it looks okay. The employees appear to be sorted by `total_earnings` from lowest to highest. If this were the case, we'd expect the last row of the `salary_sort` DataFrame to contain the employee with the highest salary. Let's take a look at the last five rows using `tail()`. 191 | 192 | # In[18]: 193 | 194 | 195 | print(salary_sort.tail()) 196 | 197 | 198 | # **What went wrong?** 199 | # 200 | # The problem is that there are non-numeric characters, `,` and `$`, in the `total.earnings` column. We can see with [`dtypes`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dtypes.html), which returns the data type of each column in the DataFrame, that `total_earnings` is recognized as an "object". 201 | 202 | # In[19]: 203 | 204 | 205 | salary_selected.dtypes 206 | 207 | 208 | # [Here](http://pbpython.com/pandas_dtypes.html) is an overview of `pandas` data types. Basically, being labeled an "object" means that the column is not being recognized as containing numbers. 209 | 210 | # We need to find the `,` and `$` in `total.earnings` and remove them. The `str.replace()` function, which we used above when renaming the columns, lets us do this. 211 | # 212 | # Let's start by removing the comma and write the result to the original column. (The format for calling a column from a DataFrame in `pandas` is `DataFrame['column_name']`) 213 | 214 | # In[20]: 215 | 216 | 217 | salary_selected['total_earnings'] = salary_selected['total_earnings'].str.replace(',', '') 218 | 219 | 220 | # Using `head()` to visually inspect `salary_selected`, we see that the commas are gone: 221 | 222 | # In[21]: 223 | 224 | 225 | print(salary_selected.head()) # this works - the commas are gone 226 | 227 | 228 | # Let's do the same thing, with the dollar sign `$`: 229 | 230 | # In[22]: 231 | 232 | 233 | salary_selected['total_earnings'] = salary_selected['total_earnings'].str.replace('$', '') 234 | 235 | 236 | # Using `head()` to visually inspect `salary_selected`, we see that the dollar signs are gone: 237 | 238 | # In[23]: 239 | 240 | 241 | salary_selected.head() 242 | 243 | 244 | # **Now can we use `arrange()` to sort the data by `total_earnings`?** 245 | 246 | # In[24]: 247 | 248 | 249 | salary_sort = salary_selected.sort_values('total_earnings') 250 | 251 | salary_sort.head() 252 | 253 | 254 | # In[25]: 255 | 256 | 257 | salary_sort.tail() 258 | 259 | 260 | # Again, at first glance, the employees appear to be sorted by `total_earnings` from lowest to highest. But that would imply that John M. Bresnahan was the highest-paid employee, making 99,997.38 dollars in 2016, while the *Boston Globe* [story](https://www.bostonglobe.com/metro/2017/02/14/police-detective-tops-boston-payroll-with-total-over/6PaXwTAHZGEW5djgwCJuTI/story.html) said the highest-paid city employee made more than 403,000 dollars. 261 | 262 | # **What's the problem?** 263 | # 264 | # Again, we can use `dtypes` to check on how the `total_earnings` variable is encoded. 265 | 266 | # In[26]: 267 | 268 | 269 | salary_sort.dtypes 270 | 271 | 272 | # It's still an "object" now (still not numeric), because we didn't tell `pandas` that it should be numeric. We can do this with `pd.to_numeric()`: 273 | 274 | # In[27]: 275 | 276 | 277 | salary_sort['total_earnings'] = pd.to_numeric(salary_sort['total_earnings']) 278 | 279 | 280 | # Now let's run `dtypes` again: 281 | 282 | # In[28]: 283 | 284 | 285 | salary_sort.dtypes 286 | 287 | 288 | # "float64" means ["floating point numbers"](http://pbpython.com/pandas_dtypes.html) — this is what we want. 289 | 290 | # Now let's sort using `sort_values()`. 291 | 292 | # In[29]: 293 | 294 | 295 | salary_sort = salary_sort.sort_values('total_earnings') 296 | 297 | salary_sort.head() # ascending order by default 298 | 299 | 300 | # One last thing: we have to specify `ascending = False` within `sort_values()` because the function by default sorts the data in ascending order. 301 | 302 | # In[30]: 303 | 304 | 305 | salary_sort = salary_sort.sort_values('total_earnings', ascending = False) 306 | 307 | salary_sort.head() # descending order 308 | 309 | 310 | # We see that Waiman Lee from the Boston PD is the top earner with >403,408 per year, just as the *Boston Globe* [article](https://www.bostonglobe.com/metro/2017/02/14/police-detective-tops-boston-payroll-with-total-over/6PaXwTAHZGEW5djgwCJuTI/story.html) states. 311 | 312 | # A bonus thing: maybe it bothers you that the numbers next to each row are no longer in any numeric order. This is because these numbers are the row index of the DataFrame — basically the order that they were in prior to being sorted. In order to reset these numbers, we can use the [`reset_index()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html) function on the `salary_sort` DataFrame. We include `drop = True` as a parameter of the function to prevent the old index from being added as a column in the DataFrame. 313 | 314 | # In[31]: 315 | 316 | 317 | salary_sort = salary_sort.reset_index(drop = True) 318 | 319 | salary_sort.head() # index is reset 320 | 321 | 322 | # The Boston Police Department has a lot of high earners. We can figure out the average earnings by department, which we'll call `salary_average`, by using the `groupby` and `mean()` functions in `pandas`. 323 | 324 | # In[32]: 325 | 326 | 327 | salary_average = salary_sort.groupby('department_name').mean() 328 | 329 | 330 | # In[33]: 331 | 332 | 333 | salary_average = salary_average 334 | 335 | salary_average 336 | 337 | 338 | # Notice that `pandas` by default sets the `department_name` column as the row index of the `salary_average` DataFrame. I personally don't love this and would rather have a straight-up DataFrame with the row numbers as the index, so I usually run `reset_index()` to get rid of this indexing: 339 | 340 | # In[34]: 341 | 342 | 343 | salary_average = salary_average.reset_index() # reset_index 344 | 345 | salary_average 346 | 347 | 348 | # We should also rename the `total_earnings` column to `average_earnings` to avoid confusion. We can do this using [`rename()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html). The syntax for `rename()` is `DataFrame.rename(columns = {'current column name':'new column name'})`. 349 | 350 | # In[35]: 351 | 352 | 353 | salary_average = salary_average.rename(columns = {'total_earnings': 'dept_average'}) 354 | 355 | 356 | # In[36]: 357 | 358 | 359 | salary_average 360 | 361 | 362 | # We can find the Boston Police Department. Find out more about selecting based on attributes [here](https://chrisalbon.com/python/data_wrangling/pandas_selecting_rows_on_conditions/). 363 | 364 | # In[37]: 365 | 366 | 367 | salary_average[salary_average['department_name'] == 'Boston Police Department'] 368 | 369 | 370 | # Now is a good time to revisit "chaining." Notice how we did three things in creating `salary_average`: 371 | # 1. Grouped the `salary_sort` DataFrame by `department_name` and calculated the mean of the numeric columns (in our case, `total_earnings` using `group_by()` and `mean()`. 372 | # 2. Used `reset_index()` on the resulting DataFrame so that `department_name` would no longer be the row index. 373 | # 3. Renamed the `total_earnings` column to `dept_average` to avoid confusion using `rename()`. 374 | # 375 | # In fact, we can do these three things all at once, by chaining the functions together: 376 | 377 | # In[38]: 378 | 379 | 380 | salary_sort.groupby('department_name').mean().reset_index().rename(columns = {'total_earnings':'dept_average'}) 381 | 382 | 383 | # That's a pretty long line of code. To make it more readable, we can split it up into separate lines. I like to do this by putting the whole expression in parentheses and splitting it up right before each of the functions, which are delineated by the periods: 384 | 385 | # In[39]: 386 | 387 | 388 | (salary_sort.groupby('department_name') 389 | .mean() 390 | .reset_index() 391 | .rename(columns = {'total_earnings':'dept_average'})) 392 | 393 | 394 | # ## 2. Merging datasets 395 | 396 | # Now we have two main datasets, `salary_sort` (the salary for each person, sorted from high to low) and `salary_average` (the average salary for each department). What if I wanted to merge these two together, so I could see side-by-side each person's salary compared to the average for their department? 397 | # 398 | # We want to join by the `department_name` variable, since that is consistent across both datasets. Let's put the merged data into a new dataframe, `salary_merged`: 399 | 400 | # In[40]: 401 | 402 | 403 | salary_merged = pd.merge(salary_sort, salary_average, on = 'department_name') 404 | 405 | 406 | # Now we can see the department average, `dept_average`, next to the individual's salary, `total_earnings`: 407 | 408 | # In[41]: 409 | 410 | 411 | salary_merged.head() 412 | 413 | 414 | # ## 3. Reshaping data 415 | 416 | # Here's a dataset on unemployment rates by country from 2012 to 2016, from the International Monetary Fund's World Economic Outlook database (available [here](https://www.imf.org/external/pubs/ft/weo/2017/01/weodata/index.aspx)). 417 | # 418 | # When you download the dataset, it comes in an Excel file. We can use the `pd.read_excel()` function from `pandas` to load the file into Python. 419 | 420 | # In[42]: 421 | 422 | 423 | unemployment = pd.read_excel('unemployment.xlsx') 424 | unemployment.head() 425 | 426 | 427 | # You'll notice if you open the `unemployment.xlsx` file in Excel that cells that do not have data (like Argentina in 2015) are labeled with "n/a". A nice feature of `pd.read_excel()` is that it recognizes these cells as NaN ("not a number," or Python's way of encoding missing values), by default. If we wanted to, we could explicitly tell pandas that missing values were labeled "n/a" using `na_values = 'n/a'` within the `pd.read_excel()` function: 428 | 429 | # In[43]: 430 | 431 | 432 | unemployment = pd.read_excel('unemployment.xlsx', na_values = 'n/a') 433 | 434 | 435 | # Right now, the data are in what's commonly referred to as "wide" format, meaning the variables (unemployment rate for each year) are spread across rows. This might be good for presentation, but it's not great for certain calculations or graphing. "Wide" format data also becomes confusing if other variables are added. 436 | # 437 | # We need to change the format from "wide" to "long," meaning that the columns (`2012`, `2013`, `2014`, `2015`, `2016`) will be converted into a new variable, which we'll call `Year`, with repeated values for each country. And the unemployment rates will be put into a new variable, which we'll call `Rate_Unemployed`. 438 | 439 | # To do this, we'll use the [`pd.melt()`](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.melt.html) function in `pandas` to create a new DataFrame, `unemployment_long`. 440 | 441 | # In[44]: 442 | 443 | 444 | unemployment_long = pd.melt(unemployment, # data to reshape 445 | id_vars = 'Country', # identifier variable 446 | var_name = 'Year', # column we want to create from the rows 447 | value_name = 'Rate_Unemployed') # the values of interest 448 | 449 | 450 | # Inspecting `unemployment_long` using `head()` shows that we have successfully created a long dataset. 451 | 452 | # In[45]: 453 | 454 | 455 | unemployment_long.head() 456 | 457 | 458 | # ## 4. Calculating year-over-year change in panel data 459 | 460 | # Sort the data by `Country` and `Year` using the `sort_values()` function: 461 | 462 | # In[46]: 463 | 464 | 465 | unemployment_long = unemployment_long.sort_values(['Country', 'Year']) 466 | 467 | unemployment_long.head() 468 | 469 | 470 | # Again, we can use `reset_index(drop = True)` to reset the row index so that the numbers next to the rows are in sequential order. 471 | 472 | # In[47]: 473 | 474 | 475 | unemployment_long = unemployment_long.reset_index(drop = True) 476 | 477 | unemployment_long.head() 478 | 479 | 480 | # This type of data is known in time-series analysis as a panel; each country is observed every year from 2012 to 2016. 481 | # 482 | # For Albania, the percentage point change in unemployment rate from 2012 to 2013 would be 16 - 13.4 = 2.5 percentage points. What if I wanted the year-over-year change in unemployment rate for every country? 483 | # 484 | # We can use the [`diff()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.diff.html) function in `pandas` to do this. We can use `diff()` to calculate the difference between the `Rate_Unemployed` that year and the `Rate_Unemployed` for the year prior (the default for `lag()` is 1 period, which is good for us since we want the change from the previous year). We will save this difference into a new variable, `Change`. 485 | 486 | # In[48]: 487 | 488 | 489 | unemployment_long['Change'] = unemployment_long.Rate_Unemployed.diff() 490 | 491 | 492 | # Let's inspect the first five rows again, using `head()`: 493 | 494 | # In[49]: 495 | 496 | 497 | unemployment_long.head() 498 | 499 | 500 | # So far so good. It also makes sense that Albania's `Change` is `NaN` in 2012, since the dataset doesn't contain any unemployment figures before the year 2012. 501 | # 502 | # But a closer inspection of the data reveals a problem. What if we used `tail()` to look at the *last* 5 rows of the data? 503 | 504 | # In[50]: 505 | 506 | 507 | unemployment_long.tail() 508 | 509 | 510 | # **Why does Vietnam have a -18.493 percentage point change in 2012?** 511 | 512 | # (Hint: use `tail()` to look at the last 6 rows of the data.) 513 | 514 | # In[51]: 515 | 516 | 517 | unemployment_long['Change'] = (unemployment_long 518 | .groupby('Country') 519 | .Rate_Unemployed.diff()) 520 | 521 | unemployment_long.tail() 522 | 523 | 524 | # (Also notice how I put the entire expression in parentheses and put each function on a different line for readability.) 525 | 526 | # ## 5. Recoding numerical variables into categorical ones 527 | 528 | # Here's a list of some attendees for the 2016 workshop, with names and contact info removed. 529 | 530 | # In[52]: 531 | 532 | 533 | attendees = pd.read_csv('attendees.csv') 534 | 535 | attendees.head() 536 | 537 | 538 | # **What if we wanted to quickly see the age distribution of attendees?** 539 | 540 | # In[53]: 541 | 542 | 543 | attendees['Age group'].value_counts() 544 | 545 | 546 | # There's an inconsistency in the labeling of the `Age group` variable here. We can fix this using `np.where()` in the `numpy` library. First, let's import the `numpy` library. Like `pandas`, `numpy` has a commonly used alias — `np`. 547 | 548 | # In[54]: 549 | 550 | 551 | import numpy as np 552 | 553 | 554 | # In[55]: 555 | 556 | 557 | attendees['Age group'] = np.where(attendees['Age group'] == '30 - 39', # where attendees['Age group'] == '30 - 39' 558 | '30-39', # replace attendees['Age group'] with '30-39' 559 | attendees['Age group']) # otherwise, keep attendees['Age group'] values the same 560 | 561 | 562 | # This might seem trivial for just one value, but it's useful for larger datasets. 563 | 564 | # In[56]: 565 | 566 | 567 | attendees['Age group'].value_counts() 568 | 569 | 570 | # Now let's take a look at the professional status of attendees, labeled in `Choose your status:` 571 | 572 | # In[57]: 573 | 574 | 575 | attendees['Choose your status:'].value_counts() 576 | 577 | 578 | # "Nonprofit, Academic, Government" and "Nonprofit, Academic, Government Early Bird" seem to be the same. We can use `np.where()` (and the Python designation `|` for "or") to combine these two categories into one big category, "Nonprofit/Gov". Let's create a new variable, `status`, for our simplified categorization. 579 | # 580 | # Notice the extra sets of parentheses around the two conditions linked by the `|` symbol. 581 | 582 | # In[58]: 583 | 584 | 585 | attendees['status'] = np.where((attendees['Choose your status:'] == 'Nonprofit, Academic, Government') | 586 | (attendees['Choose your status:'] == 'Nonprofit, Academic, Government Early Bird'), 587 | 'Nonprofit/Gov', 588 | attendees['Choose your status:']) 589 | 590 | 591 | # In[59]: 592 | 593 | 594 | attendees['status'].value_counts() 595 | 596 | 597 | # ## What else? 598 | # 599 | # - How would you create a new variable in the `attendees` data (let's call it `status2`) that has just two categories, "Student" and "Other"? 600 | # 601 | # - How would you rename the variables in the `attendees` data to make them easier to work with? 602 | # 603 | # - What are some other issues with this dataset? How would you solve them using what we've learned? 604 | # 605 | # - What are some other "messy" data issues you've encountered? 606 | -------------------------------------------------------------------------------- /readme.md: -------------------------------------------------------------------------------- 1 | # Tricks for cleaning your data in Python using pandas 2 | 3 | In 2017 I gave a talk called "Tricks for cleaning your data in R" which I presented at the [Data+Narrative workshop](http://www.bu.edu/com/data-narrative/) at Boston University. The [repo with the code and data](https://github.com/underthecurve/r-data-cleaning-tricks) I used for the talk was pretty well-received, so I figured I'd try to do some of the same stuff in Python using `pandas`. 4 | 5 | **Disclaimer:** when it comes to data stuff, I'm much better with R, especially the `tidyverse` set of packages, than with Python, but in my last job I used Python's `pandas` library to do a lot of data processing since Python was the dominant language there. Please feel free to let me know if there are better ways to do things! 6 | 7 | ## Links to install Python, pandas and Jupyter notebook 8 | 9 | * [Python](https://www.python.org/downloads/): website for Python 10 | * [pandas](https://pandas.pydata.org/): website for the pandas library 11 | * [Jupyter](http://jupyter.org/): website for Project Jupyter, whose interactive notebook this tutorial was written in 12 | 13 | ## Files included 14 | 15 | ### Annotated code and step-by step instructions for the workshop 16 | * [Pandas data cleaning tricks.ipynb](https://github.com/underthecurve/pandas-data-cleaning-tricks/blob/master/pandas-data-cleaning-tricks.ipynb): Jupyter notebook file (for viewing on the web - Desktop only) 17 | * [pandas-data-cleaning-tricks.pdf](https://github.com/underthecurve/pandas-data-cleaning-tricks/blob/master/pandas-data-cleaning-tricks.pdf): PDF file (for printing out) 18 | 19 | ### Python code 20 | * [pandas-data-cleaning-tricks.py](https://github.com/underthecurve/pandas-data-cleaning-tricks/blob/master/pandas-data-cleaning-tricks.py): the straight-up Python code, with annotations commented otu 21 | 22 | ### Underlying data needed to run the Python code 23 | * [employee-earnings-report-2016.csv](https://github.com/underthecurve/pandas-data-cleaning-tricks/blob/master/employee-earnings-report-2016.csv): data on earnings for Boston's municipal employees, from the city's [open data portal](https://data.boston.gov/dataset/employee-earnings-report) 24 | * [unemployment.xlsx](https://github.com/underthecurve/pandas-data-cleaning-tricks/blob/master/unemployment.xlsx): data on global unemployment rates from 2012 to 2016, from the [International Monetary Fund](https://www.imf.org/external/pubs/ft/weo/2017/01/weodata/index.aspx) 25 | * [attendees.csv](https://github.com/underthecurve/pandas-data-cleaning-tricks/blob/master/attendees.csv): data on some attendees of the 2017 Data+Narrative workshop, with names and identifying information removed 26 | 27 | ## How to follow this tutorial 28 | 29 | * You can clone or download this repository by clicking on the green button above, "Clone or download" 30 | * Follow along by reading the `.ipynb` file online or printing the `.pdf` file out by clicking the Github links above 31 | 32 | ## Questions / Feedback? 33 | 34 | ychristinezhang at gmail dot com 35 | 36 | or on Twitter 37 | 38 | [@christinezhang](https://twitter.com/christinezhang) 39 | 40 | Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License. -------------------------------------------------------------------------------- /unemployment.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/underthecurve/pandas-data-cleaning-tricks/cd96cd85ac5b941eb077bbf6bc30b51da0e3f29b/unemployment.xlsx --------------------------------------------------------------------------------