├── requirements.txt
├── Dockerfile
├── images
    ├── race_age_dist.png
    ├── county_scatter_AZ.png
    ├── county_scatter_CO.png
    ├── county_scatter_CT.png
    ├── county_scatter_NC.png
    ├── county_scatter_TX.png
    ├── county_scatter_WA.png
    ├── county_scatter_ma.png
    ├── county_scatter_vt.png
    ├── county_scatter_wi.png
    ├── search_scatters_AZ.png
    ├── search_scatters_CO.png
    ├── search_scatters_CT.png
    ├── search_scatters_NC.png
    ├── search_scatters_TX.png
    ├── search_scatters_WA.png
    ├── search_scatters_ma.png
    ├── search_scatters_vt.png
    ├── search_scatters_wi.png
    ├── citations_and_arrests_by_race.png
    ├── citations_and_arrests_by_race_and_violation.png
    └── citations_and_arrests_by_gender_and_violation.png
├── download_data.sh
├── LICENSE
├── .gitignore
├── README.md
└── traffic_stop_analysis.py


/requirements.txt:
--------------------------------------------------------------------------------
1 | numpy
2 | pandas
3 | matplotlib
4 | jupyter
5 | plotly
6 | 


--------------------------------------------------------------------------------
/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM jupyter/datascience-notebook
2 | 
3 | WORKDIR /home/jovyan/work/
4 | 
5 | ADD . .


--------------------------------------------------------------------------------
/images/race_age_dist.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/triestpa/Police-Analysis-Python/HEAD/images/race_age_dist.png


--------------------------------------------------------------------------------
/images/county_scatter_AZ.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/triestpa/Police-Analysis-Python/HEAD/images/county_scatter_AZ.png


--------------------------------------------------------------------------------
/images/county_scatter_CO.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/triestpa/Police-Analysis-Python/HEAD/images/county_scatter_CO.png


--------------------------------------------------------------------------------
/images/county_scatter_CT.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/triestpa/Police-Analysis-Python/HEAD/images/county_scatter_CT.png


--------------------------------------------------------------------------------
/images/county_scatter_NC.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/triestpa/Police-Analysis-Python/HEAD/images/county_scatter_NC.png


--------------------------------------------------------------------------------
/images/county_scatter_TX.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/triestpa/Police-Analysis-Python/HEAD/images/county_scatter_TX.png


--------------------------------------------------------------------------------
/images/county_scatter_WA.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/triestpa/Police-Analysis-Python/HEAD/images/county_scatter_WA.png


--------------------------------------------------------------------------------
/images/county_scatter_ma.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/triestpa/Police-Analysis-Python/HEAD/images/county_scatter_ma.png


--------------------------------------------------------------------------------
/images/county_scatter_vt.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/triestpa/Police-Analysis-Python/HEAD/images/county_scatter_vt.png


--------------------------------------------------------------------------------
/images/county_scatter_wi.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/triestpa/Police-Analysis-Python/HEAD/images/county_scatter_wi.png


--------------------------------------------------------------------------------
/images/search_scatters_AZ.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/triestpa/Police-Analysis-Python/HEAD/images/search_scatters_AZ.png


--------------------------------------------------------------------------------
/images/search_scatters_CO.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/triestpa/Police-Analysis-Python/HEAD/images/search_scatters_CO.png


--------------------------------------------------------------------------------
/images/search_scatters_CT.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/triestpa/Police-Analysis-Python/HEAD/images/search_scatters_CT.png


--------------------------------------------------------------------------------
/images/search_scatters_NC.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/triestpa/Police-Analysis-Python/HEAD/images/search_scatters_NC.png


--------------------------------------------------------------------------------
/images/search_scatters_TX.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/triestpa/Police-Analysis-Python/HEAD/images/search_scatters_TX.png


--------------------------------------------------------------------------------
/images/search_scatters_WA.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/triestpa/Police-Analysis-Python/HEAD/images/search_scatters_WA.png


--------------------------------------------------------------------------------
/images/search_scatters_ma.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/triestpa/Police-Analysis-Python/HEAD/images/search_scatters_ma.png


--------------------------------------------------------------------------------
/images/search_scatters_vt.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/triestpa/Police-Analysis-Python/HEAD/images/search_scatters_vt.png


--------------------------------------------------------------------------------
/images/search_scatters_wi.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/triestpa/Police-Analysis-Python/HEAD/images/search_scatters_wi.png


--------------------------------------------------------------------------------
/images/citations_and_arrests_by_race.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/triestpa/Police-Analysis-Python/HEAD/images/citations_and_arrests_by_race.png


--------------------------------------------------------------------------------
/images/citations_and_arrests_by_race_and_violation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/triestpa/Police-Analysis-Python/HEAD/images/citations_and_arrests_by_race_and_violation.png


--------------------------------------------------------------------------------
/images/citations_and_arrests_by_gender_and_violation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/triestpa/Police-Analysis-Python/HEAD/images/citations_and_arrests_by_gender_and_violation.png


--------------------------------------------------------------------------------
/download_data.sh:
--------------------------------------------------------------------------------
 1 | mkdir data
 2 | cd data
 3 | wget https://stacks.stanford.edu/file/druid:py883nd2578/VT-clean.csv.gz
 4 | wget https://stacks.stanford.edu/file/druid:py883nd2578/MA-clean.csv.gz
 5 | wget https://stacks.stanford.edu/file/druid:py883nd2578/CT-clean.csv.gz
 6 | wget https://stacks.stanford.edu/file/druid:py883nd2578/WI-clean.csv.gz
 7 | wget https://stacks.stanford.edu/file/druid:py883nd2578/AZ-clean.csv.gz
 8 | wget https://stacks.stanford.edu/file/druid:py883nd2578/CO-clean.csv.gz
 9 | wget https://stacks.stanford.edu/file/druid:py883nd2578/NC-clean.csv.gz
10 | wget https://stacks.stanford.edu/file/druid:py883nd2578/WA-clean.csv.gz
11 | wget https://stacks.stanford.edu/file/druid:py883nd2578/TX-clean.csv.gz


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | The MIT License (MIT)
 2 | 
 3 | Copyright (c) 2017 Patrick Triest
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | *.egg-info/
 24 | .installed.cfg
 25 | *.egg
 26 | MANIFEST
 27 | 
 28 | # PyInstaller
 29 | #  Usually these files are written by a python script from a template
 30 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 31 | *.manifest
 32 | *.spec
 33 | 
 34 | # Installer logs
 35 | pip-log.txt
 36 | pip-delete-this-directory.txt
 37 | 
 38 | # Unit test / coverage reports
 39 | htmlcov/
 40 | .tox/
 41 | .coverage
 42 | .coverage.*
 43 | .cache
 44 | nosetests.xml
 45 | coverage.xml
 46 | *.cover
 47 | .hypothesis/
 48 | 
 49 | # Translations
 50 | *.mo
 51 | *.pot
 52 | 
 53 | # Django stuff:
 54 | *.log
 55 | .static_storage/
 56 | .media/
 57 | local_settings.py
 58 | 
 59 | # Flask stuff:
 60 | instance/
 61 | .webassets-cache
 62 | 
 63 | # Scrapy stuff:
 64 | .scrapy
 65 | 
 66 | # Sphinx documentation
 67 | docs/_build/
 68 | 
 69 | # PyBuilder
 70 | target/
 71 | 
 72 | # Jupyter Notebook
 73 | .ipynb_checkpoints
 74 | 
 75 | # pyenv
 76 | .python-version
 77 | 
 78 | # celery beat schedule file
 79 | celerybeat-schedule
 80 | 
 81 | # SageMath parsed files
 82 | *.sage.py
 83 | 
 84 | # Environments
 85 | .env
 86 | .venv
 87 | env/
 88 | venv/
 89 | ENV/
 90 | env.bak/
 91 | venv.bak/
 92 | 
 93 | # Spyder project settings
 94 | .spyderproject
 95 | .spyproject
 96 | 
 97 | # Rope project settings
 98 | .ropeproject
 99 | 
100 | # mkdocs documentation
101 | /site
102 | 
103 | # mypy
104 | .mypy_cache/
105 | 
106 | .DS_Store
107 | 
108 | data/
109 | 
110 | .vscode


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ## Data Science, Politics, and Police
 2 | 
 3 | The intersection of science, politics, personal opinion, and social policy can be rather complex.  This junction of ideas and disciplines is often rife with controversies, strongly held viewpoints, and agendas that are often [more based on belief than on empirical evidence](https://en.wikipedia.org/wiki/Global_warming_controversy).  Data science is particularly important in this area since it provides a methodology for examining the world in a pragmatic fact-first manner, and is capable of providing insight into some of the most important issues that we face today.
 4 | 
 5 | The recent high-profile police shootings of unarmed black men, such as [Michael Brown](https://en.wikipedia.org/wiki/Shooting_of_Michael_Brown) (2014), [Tamir Rice](https://en.wikipedia.org/wiki/Shooting_of_Tamir_Rice) (2014), [Anton Sterling](https://en.wikipedia.org/wiki/Shooting_of_Alton_Sterling) (2016), and [Philando Castile](https://en.wikipedia.org/wiki/Shooting_of_Philando_Castile) (2016), have triggered a divisive national dialog on the issue of racial bias in policing.
 6 | 
 7 | These shootings have spurred the growth of large social movements seeking to raise awareness of what is viewed as the systemic targeting of people-of-color by police forces across the country.  On the other side of the political spectrum, many hold a view that the unbalanced targeting of non-white citizens is a myth created by the media based on a handful of extreme cases, and that these highly-publicized stories are not representative of the national norm.
 8 | 
 9 | In June 2017, a team of researchers at Stanford University collected and released an open-source data set of 60 million state police patrol stops from 20 states across the US.  In this tutorial, we will walk through how to analyze and visualize this data using Python.
10 | 
11 | ![county scatters vt](https://cdn.patricktriest.com/blog/images/posts/policing-data/county_scatter_VT.png)
12 | 
13 | To preview the completed IPython notebook, visit the page [here](https://github.com/triestpa/Police-Analysis-Python/blob/master/traffic_stop_analysis.ipynb).
14 | 
15 | > This tutorial and analysis would not be possible without the work performed by [The Stanford Open Policing Project](https://openpolicing.stanford.edu/).  Much of the analysis performed in this tutorial is based on the work that has already performed by this team.  [A short tutorial](https://openpolicing.stanford.edu/tutorials/) for working with the data using the R programming language is provided on the official project website.
16 | 
17 | To read more, visit - https://blog.patricktriest.com/police-data-python/
18 | 
19 | ___
20 | 
21 | This iPython notebook is 100% open-source, feel free to utilize the code however you would like.
22 | 
23 | ```
24 | The MIT License (MIT)
25 | 
26 | Copyright (c) 2018 Patrick Triest
27 | 
28 | Permission is hereby granted, free of charge, to any person obtaining a copy
29 | of this software and associated documentation files (the "Software"), to deal
30 | in the Software without restriction, including without limitation the rights
31 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
32 | copies of the Software, and to permit persons to whom the Software is
33 | furnished to do so, subject to the following conditions:
34 | 
35 | The above copyright notice and this permission notice shall be included in all
36 | copies or substantial portions of the Software.
37 | 
38 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
39 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
40 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
41 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
42 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
43 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
44 | SOFTWARE.
45 | ```
46 | 


--------------------------------------------------------------------------------
/traffic_stop_analysis.py:
--------------------------------------------------------------------------------
  1 | 
  2 | # coding: utf-8
  3 | 
  4 | # # Exploring United States Policing Data Using Python
  5 | 
  6 | # The intersection of science, politics, personal opinion, and social policy can be rather complex.  This junction of ideas and disciplines is often rife with controversies, strongly held viewpoints, and agendas that are often [more based on belief than on empirical evidence](https://en.wikipedia.org/wiki/Global_warming_controversy).  Data science is particularly important in this area since it provides a methodology for examining the world in a pragmatic fact-first manner, and is capable of providing insight into some of the most important issues that we face today.
  7 | # 
  8 | # The recent high-profile police shootings of unarmed black men, such as [Michael Brown](https://en.wikipedia.org/wiki/Shooting_of_Michael_Brown) (2014), [Tamir Rice](https://en.wikipedia.org/wiki/Shooting_of_Tamir_Rice) (2014), [Anton Sterling](https://en.wikipedia.org/wiki/Shooting_of_Alton_Sterling) (2016), and [Philando Castile](https://en.wikipedia.org/wiki/Shooting_of_Philando_Castile) (2016), have triggered a divisive national dialog on the issue of racial bias in policing.
  9 | # 
 10 | # These shootings have spurred the growth of large social movements seeking to raise awareness of what is viewed as the systemic targeting of people-of-color by police forces across the country.  On the other side of the political spectrum, many hold a view that the unbalanced targeting of non-white citizens is a myth created by the media based on a handful of extreme cases, and that these highly-publicized stories are not representative of the national norm.
 11 | # 
 12 | # In June 2017, a team of researchers at Stanford University collected and released an open-source data set of 60 million state police patrol stops from 20 states across the US.  In this tutorial, we will walk through how to analyze and visualize this data using Python.
 13 | # 
 14 | # ![county scatters vt](https://cdn.patricktriest.com/blog/images/posts/policing-data/county_scatter_VT.png)
 15 | # 
 16 | # The source code and figures for this analysis can be found in the companion Github repository - https://github.com/triestpa/Police-Analysis-Python
 17 | # 
 18 | # To preview the completed IPython notebook, visit the page [here](https://github.com/triestpa/Police-Analysis-Python/blob/master/traffic_stop_analysis.ipynb).
 19 | # 
 20 | # > This tutorial and analysis would not be possible without the work performed by [The Standford Open Policing Project](https://openpolicing.stanford.edu/).  Much of the analysis performed in this tutorial is based on the work that has already performed by this team.  [A short tutorial](https://openpolicing.stanford.edu/tutorials/) for working with the data using the R programming language is provided on the official project website.
 21 | # 
 22 | 
 23 | # In the United States there are more than 50,000 traffic stops on a typical day.  The potential number of data points for each stop is huge, from the demographics (age, race, gender) of the driver, to the location, time of day, stop reason, stop outcome, car model, and much more.  Unfortunately, not every state makes this data available, and those that do often have different standards for which information is reported.  Different counties and districts within each state can also be inconstant in how each traffic stop is recorded.  The [research team at Stanford](https://openpolicing.stanford.edu/) has managed to gather traffic-stop data from twenty states, and has worked to regularize the reporting standards for 11 fields.
 24 | # 
 25 | # - Stop Date
 26 | # - Stop Time
 27 | # - Stop Location
 28 | # - Driver Race
 29 | # - Driver Gender
 30 | # - Driver Age
 31 | # - Stop Reason
 32 | # - Search Conducted
 33 | # - Search Type
 34 | # - Contraband Found
 35 | # - Stop Outcome
 36 | # 
 37 | # Most states do not have data available for every field, but there is enough overlap between the data sets to provide a solid foundation for some very interesting analysis.
 38 | # 
 39 | 
 40 | # # Project Setup
 41 | 
 42 | # We'll need to install a few Python packages to perform our analysis.
 43 | # 
 44 | # On the command line, run the following command to install the required libraries.
 45 | # ```bash
 46 | # pip install numpy pandas matplotlib ipython jupyter
 47 | # ```
 48 | # 
 49 | # > If you're using Anaconda, you can replace the `pip` command here with `conda`.  Also, depending on your installation, you might need to use `pip3` instead of `pip` in order to install the Python 3 versions of the packages.
 50 | # 
 51 | 
 52 | # In the first cell of the notebook, import our dependencies.
 53 | 
 54 | # In[1]:
 55 | 
 56 | 
 57 | import pandas as pd
 58 | import numpy as np
 59 | import matplotlib.pyplot as plt
 60 | get_ipython().run_line_magic('matplotlib', 'inline')
 61 | 
 62 | 
 63 | # In[2]:
 64 | 
 65 | 
 66 | figsize = (16,8)
 67 | 
 68 | 
 69 | # We're also setting a shared variable `figsize` that we'll reuse later on in our data visualization logic.
 70 | 
 71 | # # Dataset Exploration
 72 | 
 73 | # We'll start with analyzing the data set for Vermont.  We're looking at Vermont first for a few reasons.
 74 | # 
 75 | # 1. The Vermont dataset is small enough to be very manageable and quick to operate on, with only 283,285 traffic stops (compared to the Texas data set, for instance, which contains almost 24 million records).
 76 | # 1. There is not much missing data, as all eleven fields mentioned above are covered.
 77 | # 1. Vermont is 94% white, but is also in a part of the country known for being very liberal (disclaimer - I grew up in the Boston area, and I've spent a quite a bit of time in Vermont).  Many in this area consider this state to be very progressive and might like to believe that their state institutions are not as prone to systemic racism as the institutions in other parts of the country.  It will be interesting to determine if the data validates this view.
 78 | 
 79 | # First, download the Vermont traffic stop data - https://stacks.stanford.edu/file/druid:py883nd2578/VT-clean.csv.gz
 80 | 
 81 | # ## Load Dataset
 82 | 
 83 | # Load Vermont police stop data set into a [Pandas dataframe](https://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe).
 84 | 
 85 | # In[3]:
 86 | 
 87 | 
 88 | df_vt = pd.read_csv('./data/VT-clean.csv.gz', compression='gzip', low_memory=False)
 89 | 
 90 | 
 91 | # > This command assumes that you are storing the data set in the `data` directory of the project.  If you are not, you can adjust the data file path accordingly.
 92 | 
 93 | # We can get a quick preview of the first ten rows of the data set with the `head()` method.
 94 | 
 95 | # In[4]:
 96 | 
 97 | 
 98 | df_vt.head()
 99 | 
100 | 
101 | # We can also list the available fields by reading the `columns` property.
102 | 
103 | # In[5]:
104 | 
105 | 
106 | df_vt.columns
107 | 
108 | 
109 | # ## Clean Dataset
110 | 
111 | # Let's do a quick count of each column to determine how consistently populated the data is.
112 | 
113 | # In[6]:
114 | 
115 | 
116 | df_vt.count()
117 | 
118 | 
119 | # We can see that most columns have similar numbers of values besides `search_type`, which is not present for most of the rows, likely because most stops do not result in a search.
120 | # 
121 | # For our analysis, it will be best to have the exact same number of values for each field.  We'll go ahead now and make sure that every single cell has a value.
122 | 
123 | # In[7]:
124 | 
125 | 
126 | # Fill missing search type values with placeholder
127 | df_vt['search_type'].fillna('N/A', inplace=True)
128 | 
129 | # Drop rows with missing values
130 | df_vt.dropna(inplace=True)
131 | 
132 | 
133 | # When we count the values again, we'll see that each column has the exact same number of entries.
134 | 
135 | # In[8]:
136 | 
137 | 
138 | df_vt.count()
139 | 
140 | 
141 | # ## Stops By County
142 | 
143 | # Let's get a list of all counties in the data set, along with how many traffic stops happened in each.
144 | 
145 | # In[9]:
146 | 
147 | 
148 | df_vt['county_name'].value_counts()
149 | 
150 | 
151 | # If you're familiar with Vermont's geography, you'll notice that the police stops seem to be more concentrated in counties in the southern-half of the state.  The southern-half of the state is also where much of the cross-state traffic flows in transit to and from New Hampshire, Massachusetts, and New York.  Since the traffic stop data is from the state troopers, this interstate highway traffic could potentially explain why we see more traffic stops in these counties.
152 | # 
153 | # Here's a quick map generated with [Tableau](https://public.tableau.com/profile/patrick.triest#!/vizhome/VtPoliceStops/Sheet1) to visualize this regional distribution.
154 | # 
155 | # ![Vermont County Map](https://cdn.patricktriest.com/blog/images/posts/policing-data/vermont_map.png)
156 | 
157 | # ## Violations
158 | 
159 | # We can also check out the distribution of traffic stop reasons.
160 | 
161 | # In[10]:
162 | 
163 | 
164 | df_vt['violation_raw'].value_counts()
165 | 
166 | 
167 | # In[11]:
168 | 
169 | 
170 | df_vt['violation'].value_counts()
171 | 
172 | 
173 | # Unsurprisingly, the top reason for a traffic stop is `Moving Violation` (speeding, reckless driving, etc.), followed by `Equipment` (faulty lights, illegal modifications, etc.).
174 | # 
175 | # By using the `violation_raw` fields as reference, we can see that the `Other` category includes "Investigatory Stop" (the police have reason to suspect that the driver of the vehicle has committed a crime) and  "Externally Generated Stop" (possibly as a result of a 911 call, or a referral from municipal police departments).
176 | # 
177 | # `DUI` ("driving under the influence", i.e. drunk driving) is surprisingly the least prevalent, with only 711 total recorded stops for this reason over the five year period (2010-2015) that the dataset covers.  This seems low, since [Vermont had 2,647 DUI arrests in 2015](http://www.statisticbrain.com/number-of-dui-arrests-per-state/), so I suspect that a large proportion of these arrests were performed by municipal police departments, and/or began with a `Moving Violation` stop, instead of a more specific `DUI` stop.
178 | 
179 | # ## Stop Outcomes
180 | 
181 | # We can also examine the traffic stop outcomes.
182 | 
183 | # In[12]:
184 | 
185 | 
186 | df_vt['stop_outcome'].value_counts()
187 | 
188 | 
189 | # A majority of stops result in a written warning - which goes on the record but carries no direct penalty.  A bit over 1/3 of the stops result in a citation (commonly known as a ticket), which comes with a direct fine and can carry other negative side-effects such as raising a driver's auto insurance premiums.
190 | # 
191 | # The decision to give a warning or a citation is often at the discretion of the police officer, so this could be a good source for studying bias.
192 | 
193 | # ## Stops By Gender
194 | 
195 | # Let's break down the traffic stops by gender.
196 | 
197 | # In[13]:
198 | 
199 | 
200 | df_vt['driver_gender'].value_counts()
201 | 
202 | 
203 | # We can see that approximately 36% of the stops are of women drivers, and 64% are of men.
204 | 
205 | # ## Stops By Race
206 | 
207 | # Let's also examine the distribution by race.
208 | 
209 | # In[14]:
210 | 
211 | 
212 | df_vt['driver_race'].value_counts()
213 | 
214 | 
215 | # Most traffic stops are of white drivers, which is to be expected since [Vermont is around 94% white](https://www.census.gov/quickfacts/VT) (making it the 2nd-least diverse state in the nation, [behind Maine](https://www.census.gov/quickfacts/ME)).  Since white drivers make up approximately 94% of the traffic stops, there's no obvious bias here for pulling over non-white drivers vs white drivers.  Using the same methodology, however, we can also see that while black drivers make up roughly 2% of all traffic stops, [only 1.3% of Vermont's population is black](https://www.census.gov/quickfacts/VT).
216 | # 
217 | # Let's keep on analyzing the data to see what else we can learn.
218 | 
219 | # ## Stop Frequency by Race and Age
220 | 
221 | # It would be interesting to visualize how the frequency of police stops breaks down by both race and age.
222 | 
223 | # In[15]:
224 | 
225 | 
226 | df_vt = df_vt[df_vt['driver_race'] != 'Other']
227 | 
228 | 
229 | # In[16]:
230 | 
231 | 
232 | fig, ax = plt.subplots(figsize=(20,8))
233 | ax.set_xlim(15, 70)
234 | for race in df_vt['driver_race'].unique():
235 |     s = df_vt[df_vt['driver_race'] == race]['driver_age']
236 |     s.plot.kde(ax=ax, label=race)
237 | ax.legend()
238 | 
239 | # fig.savefig('images/race_age_dist.png', bbox_inches='tight')
240 | 
241 | 
242 | # We can see that young drivers in their late teens and early twenties are the most likely to be pulled over.  Between ages 25 and 35, the stop rate of each demographic drops off quickly. As far as the racial comparison goes, the most interesting disparity is that for white drivers between the ages of 35 and 50 the pull-over rate stays mostly flat, whereas for other races it continues to drop steadily.
243 | 
244 | # # Analyze Violation and Outcome Data
245 | 
246 | # Now that we've got a feel for the dataset, we can start getting into some more advanced analysis.
247 | # 
248 | # One interesting topic that we touched on earlier is the fact that the decision to penalize a driver with a ticket or a citation is often at the discretion of the police officer.  With this in mind, let's see if there are any discernable patterns in driver demographics and stop outcome.
249 | 
250 | # ## Analysis Helper Function
251 | # 
252 | 
253 | # In order to assist in this analysis, we'll define a helper function to aggregate a few important statistics from our dataset.
254 | # 
255 | # - `citations_per_warning` - The ratio of citations to warnings.  A higher number signifies a greater likelihood of being ticketed instead of getting off with a warning.
256 | # - `arrest_rate` - The percentage of stops that end in an arrest.
257 | 
258 | # In[17]:
259 | 
260 | 
261 | def compute_outcome_stats(df):
262 |     """Compute statistics regarding the relative quanties of arrests, warnings, and citations"""
263 |     n_total = len(df)
264 |     n_warnings = len(df[df['stop_outcome'] == 'Written Warning'])
265 |     n_citations = len(df[df['stop_outcome'] == 'Citation'])
266 |     n_arrests = len(df[df['stop_outcome'] == 'Arrest for Violation'])
267 |     citations_per_warning = n_citations / n_warnings
268 |     arrest_rate = n_arrests / n_total
269 |     
270 |     return(pd.Series(data = { 
271 |         'n_total': n_total,
272 |         'n_warnings': n_warnings,
273 |         'n_citations': n_citations,
274 |         'n_arrests': n_arrests,
275 |         'citations_per_warning': citations_per_warning,
276 |         'arrest_rate': arrest_rate
277 |     }))
278 | 
279 | 
280 | # Let's test out this helper function by applying it to the entire dataframe.
281 | 
282 | # In[18]:
283 | 
284 | 
285 | compute_outcome_stats(df_vt)
286 | 
287 | 
288 | # In the above result, we can see that about `1.17%` of traffic stops result in an arrest, and there are on-average `0.62` citations (tickets) issued per warning.  This data passes the sanity check, but it's too coarse to provide many interesting insights.  Let's dig deeper.
289 | 
290 | # ## Breakdown By Gender
291 | 
292 | # Using our helper function, along with the Pandas dataframe [groupby](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html) method, we can easily compare these stats for male and female drivers.
293 | 
294 | # In[19]:
295 | 
296 | 
297 | df_vt.groupby('driver_gender').apply(compute_outcome_stats)
298 | 
299 | 
300 | # This is a simple example of the common [split-apply-combine](https://pandas.pydata.org/pandas-docs/stable/groupby.html) technique.  We'll be building on this pattern for the remainder of the tutorial, so make sure that you understand how this comparison table is generated before continuing.
301 | # 
302 | # We can see here that men are, on average, twice as likely to be arrested during a traffic stop, and are also slightly more likely to be given a citation than women.  It is, of course, not clear from the data whether this is indicative of any bias by the police officers, or if it reflects that men are being pulled over for more serious offenses than women on average.
303 | # 
304 | 
305 | # ## Breakdown By Race
306 | 
307 | # Let's now compute the same comparison, grouping by race.
308 | # 
309 | 
310 | # In[20]:
311 | 
312 | 
313 | df_vt.groupby('driver_race').apply(compute_outcome_stats)
314 | 
315 | 
316 | # Ok, this is interesting.  We can see that Asian drivers are arrested at the lowest rate, but receive tickets at the highest rate (roughly 1 ticket per warning).  Black and Hispanic drivers are both arrested at a higher rate and ticketed at a higher rate than white drivers.
317 | # 
318 | 
319 | # Let's visualize these results.
320 | # 
321 | 
322 | # In[21]:
323 | 
324 | 
325 | race_agg = df_vt.groupby(['driver_race']).apply(compute_outcome_stats)
326 | fig, axes = plt.subplots(nrows=2, ncols=1, figsize=figsize)
327 | race_agg['citations_per_warning'].plot.barh(ax=axes[0], figsize=figsize, title="Citation Rate By Race")
328 | race_agg['arrest_rate'].plot.barh(ax=axes[1], figsize=figsize, title='Arrest Rate By Race')
329 | 
330 | # fig.savefig('images/citations_and_arrests_by_race.png', bbox_inches='tight')
331 | 
332 | 
333 | # ## Group By Outcome and Violation
334 | 
335 | # We'll deepen our analysis by grouping each statistic by the violation that triggered the traffic stop.
336 | # 
337 | 
338 | # In[22]:
339 | 
340 | 
341 | df_vt.groupby(['driver_race','violation']).apply(compute_outcome_stats)
342 | 
343 | 
344 | # Ok, well this table looks interesting, but it's rather large and visually overwhelming.  Let's trim down that dataset in order to retrieve a more focused subset of information.
345 | # 
346 | 
347 | # In[23]:
348 | 
349 | 
350 | # Create new column to represent whether the driver is White
351 | df_vt['is_white'] = df_vt['driver_race'] == 'White'
352 | 
353 | # Remove violation with too few data points
354 | df_vt_filtered = df_vt[~df_vt['violation'].isin(['Other (non-mapped)', 'DUI'])]
355 | 
356 | 
357 | # We're generating a new column to represent whether or not the driver is white.  We are also generating a filtered version of the dataframe that strips out the two violation types with the fewest datapoints.
358 | # 
359 | # > We not assigning the filtered dataframe to `df_vt` since we'll want to keep using the complete unfiltered dataset in the next sections.
360 | # 
361 | # Let's redo our race + violation aggregation now, using our filtered dataset.
362 | # 
363 | 
364 | # In[24]:
365 | 
366 | 
367 | df_vt_filtered.groupby(['is_white','violation']).apply(compute_outcome_stats)
368 | 
369 | 
370 | # Ok great, this is much easier to read.
371 | # 
372 | # In the above table, we can see that non-white drivers are more likely to be arrested during a stop that was initiated due to an equipment or moving violation, but white drivers are more likely to be arrested for a traffic stop resulting from "Other" reasons.  Non-white drivers are more likely than white drivers to be given tickets for each violation.
373 | # 
374 | # 
375 | 
376 | # ## Visualize Stop Outcome and Violation Results
377 | 
378 | # Let's generate a bar chart in order to visualize this data broken down by race.
379 | # 
380 | 
381 | # In[25]:
382 | 
383 | 
384 | race_stats = df_vt_filtered.groupby(['violation', 'driver_race']).apply(compute_outcome_stats).unstack()
385 | fig, axes = plt.subplots(nrows=1, ncols=2, figsize=figsize)
386 | race_stats.plot.bar(y='arrest_rate', ax=axes[0], title='Arrest Rate By Race and Violation')
387 | race_stats.plot.bar(y='citations_per_warning', ax=axes[1], title='Citations Per Warning By Race and Violation')
388 |                                                        # 
389 | #fig.savefig('images/citations_and_arrests_by_race_and_violation.png', bbox_inches='tight')
390 | 
391 | 
392 | # 
393 | # We can see in these charts that Hispanic and Black drivers are generally arrested at a higher rate than white drivers (with the exception of the rather ambiguous "Other" category). and  that Black drivers are more likely, across the board, to be issued a citation than white drivers.  Asian drivers are arrested at very low rates, and their citation rates are highly variable.
394 | # 
395 | # These results are compelling, and are suggestive of potential racial bias, but they are too inconsistent across violation types to provide any definitive answers.  Let's dig deeper to see what else we can find.
396 | # 
397 | 
398 | # We can easily generate the same visualizations grouped by gender.
399 | 
400 | # In[26]:
401 | 
402 | 
403 | gender_stats = df_vt_filtered.groupby(['violation','driver_gender']).apply(compute_outcome_stats).unstack()
404 | fig, axes = plt.subplots(nrows=1, ncols=2, figsize=figsize)
405 | ax_gender_arrests = gender_stats.plot.bar(y='arrest_rate', ax=axes[0], title='Arrests By Gender and Violation', figsize=figsize)
406 | ax_gender_citations = gender_stats.plot.bar(y='citations_per_warning', ax=axes[1], title='Citations By Gender and Violation', figsize=figsize)
407 | 
408 | #fig.savefig('images/citations_and_arrests_by_gender_and_violation.png', bbox_inches='tight')
409 | 
410 | 
411 | # # Search Outcome Analysis
412 | 
413 | # Two of the more interesting fields available to us are `search_conducted` and `contraband_found`.
414 | # 
415 | # In the analysis by the "Standford Open Policing Project", they use these two fields to perform what is known as an "outcome test".
416 | # 
417 | # On the [project website](https://openpolicing.stanford.edu/findings/), the "outcome test" is summarized clearly.
418 | # 
419 | # > In the 1950s, the Nobel prize-winning economist Gary Becker proposed an elegant method to test for bias in search decisions: the outcome test.
420 | # >
421 | # > Becker proposed looking at search outcomes. If officers don’t discriminate, he argued, they should find contraband — like illegal drugs or weapons — on searched minorities at the same rate as on searched whites. If searches of minorities turn up contraband at lower rates than searches of whites, the outcome test suggests officers are applying a double standard, searching minorities on the basis of less evidence."
422 | # 
423 | # [Findings, Stanford Open Policing Project](https://openpolicing.stanford.edu/findings/)
424 | # 
425 | # The authors of the project also make the point that only using the "hit rate", or the rate of searches where contraband is found, can be misleading.  For this reason, we'll also need to use the "search rate" in our analysis - the rate at which a traffic stop results in a search.
426 | # 
427 | # We'll now use the available data to perform our own outcome test, in order to determine whether minorities in Vermont are routinely searched on the basis of less evidence than white drivers.
428 | # 
429 | 
430 | # ## Search Rate and Hit Rate
431 | 
432 | # 
433 | # We'll define a new function to compute the search rate and hit rate for the traffic stops in our dataframe.
434 | # 
435 | # - **Search Rate** - The rate at which a traffic stop results in a search.  A search rate of `0.20` would signify that out of 100 traffic stops, 20 resulted in a search.
436 | # - **Hit Rate** - The rate at which contraband is found in a search. A hit rate of `0.80` would signify that out of 100 searches, 80 searches resulted in contraband (drugs, unregistered weapons, etc.) being found.
437 | # 
438 | 
439 | # In[27]:
440 | 
441 | 
442 | def compute_search_stats(df):
443 |     """Compute the search rate and hit rate"""
444 |     search_conducted = df['search_conducted']
445 |     contraband_found = df['contraband_found']
446 |     n_stops     = len(search_conducted)
447 |     n_searches  = sum(search_conducted)
448 |     n_hits      = sum(contraband_found)
449 |     
450 |     # Filter out counties with too few stops
451 |     if (n_stops) < 50:
452 |         search_rate = None
453 |     else:
454 |         search_rate = n_searches / n_stops
455 |     
456 |     # Filter out counties with too few searches
457 |     if (n_searches) < 5:
458 |         hit_rate = None
459 |     else:
460 |         hit_rate = n_hits / n_searches
461 |     
462 |     return(pd.Series(data = { 
463 |         'n_stops': n_stops,
464 |         'n_searches': n_searches,
465 |         'n_hits': n_hits,
466 |         'search_rate': search_rate,
467 |         'hit_rate': hit_rate
468 |     }))
469 | 
470 | 
471 | # ## Search Stats For Entire Dataset
472 | 
473 | # We can test our new function to determine the search rate and hit rate for the entire state.
474 | 
475 | # In[28]:
476 | 
477 | 
478 | compute_search_stats(df_vt)
479 | 
480 | 
481 | # Here we can see that each traffic stop had a 1.2% change of resulting in a search, and each search had an 80% chance of yielding contraband.
482 | 
483 | # ## Search Stats By Driver Gender
484 | 
485 | # Using the Pandas `groupby` method, we can compute how the search stats differ by gender.
486 | 
487 | # In[29]:
488 | 
489 | 
490 | df_vt.groupby('driver_gender').apply(compute_search_stats)
491 | 
492 | 
493 | # We can see here that men are three times as likely to be searched as women, and that 80% of searches for both genders resulted in contraband being found.  The data shows that men are searched and caught with contraband more often than women, but it is unclear whether there is any gender discrimination in deciding who to search since the hit rate is equal.
494 | # 
495 | 
496 | # ## Search Stats By Age
497 | 
498 | # We can split the dataset into age buckets and perform the same analysis.
499 | 
500 | # In[30]:
501 | 
502 | 
503 | age_groups = pd.cut(df_vt["driver_age"], np.arange(15, 70, 5))
504 | df_vt.groupby(age_groups).apply(compute_search_stats)
505 | 
506 | 
507 | # We can see here that the search rate steadily declines as drivers get older, and that the hit rate also declines rapidly for older drivers.
508 | 
509 | # ## Search Stats By Race
510 | 
511 | # In[31]:
512 | 
513 | 
514 | df_vt.groupby('driver_race').apply(compute_search_stats)
515 | 
516 | 
517 | # Black and Hispanic drivers are searched at much higher rates than White drivers (5% and 4% of traffic stops respectively, versus 1% for white drivers), but the searches of these drivers only yield contraband 60-70% of the time, compared to 80% of the time for White drivers.
518 | # 
519 | # Let's rephrase these results.
520 | # 
521 | # *Black drivers are **500% more likely** to be searched than white drivers during a traffic stop, but are **13% less likely** to be caught with contraband in the event of a search.*
522 | # 
523 | # *Hispanic drivers are **400% more likely** to be searched than white drivers during a traffic stop, but are **17% less likely** to be caught with contraband in the event of a search.*
524 | # 
525 | 
526 | # ## Search Stats By Race and Location
527 | 
528 | # 
529 | # Let's add in location as another factor.  It's possible that some counties (such as those with larger towns or with interstate highways where opioid trafficking is prevalent) have a much higher search rate / lower hit rates for both white and non-white drivers, but also have greater racial diversity, leading to distortion in the overall stats.  By controlling for location, we can determine if this is the case.
530 | # 
531 | 
532 | # We'll define three new helper functions to generate the visualizations.
533 | # 
534 | 
535 | # In[32]:
536 | 
537 | 
538 | def generate_comparison_scatter(df, ax, state, race, field, color):
539 |     """Generate scatter plot comparing field for white drivers with minority drivers"""
540 |     race_location_agg = df.groupby(['county_fips','driver_race']).apply(compute_search_stats).reset_index().dropna()    
541 |     race_location_agg = race_location_agg.pivot(index='county_fips', columns='driver_race', values=field)
542 |     ax = race_location_agg.plot.scatter(ax=ax, x='White', y=race, s=150, label=race, color=color)
543 |     return ax
544 | 
545 | 
546 | # In[33]:
547 | 
548 | 
549 | def format_scatter_chart(ax, state, field):
550 |     """Format and label to scatter chart"""
551 |     ax.set_xlabel('{} - White'.format(field))
552 |     ax.set_ylabel('{} - Non-White'.format(field, race))
553 |     ax.set_title("{} By County - {}".format(field, state))
554 |     lim = max(ax.get_xlim()[1], ax.get_ylim()[1])
555 |     ax.set_xlim(0, lim)
556 |     ax.set_ylim(0, lim)
557 |     diag_line, = ax.plot(ax.get_xlim(), ax.get_ylim(), ls="--", c=".3")
558 |     ax.legend()
559 |     return ax
560 | 
561 | 
562 | # In[34]:
563 | 
564 | 
565 | def generate_comparison_scatters(df, state):
566 |     """Generate scatter plots comparing search rates of white drivers with black and hispanic drivers"""
567 |     fig, axes = plt.subplots(nrows=1, ncols=2, figsize=figsize)
568 |     generate_comparison_scatter(df, axes[0], state, 'Black', 'search_rate', 'red')
569 |     generate_comparison_scatter(df, axes[0], state, 'Hispanic', 'search_rate', 'orange')
570 |     generate_comparison_scatter(df, axes[0], state, 'Asian', 'search_rate', 'green')
571 |     format_scatter_chart(axes[0], state, 'Search Rate')
572 |     
573 |     generate_comparison_scatter(df, axes[1], state, 'Black', 'hit_rate', 'red')
574 |     generate_comparison_scatter(df, axes[1], state, 'Hispanic', 'hit_rate', 'orange')
575 |     generate_comparison_scatter(df, axes[1], state, 'Asian', 'hit_rate', 'green')
576 |     format_scatter_chart(axes[1], state, 'Hit Rate')
577 | 
578 |     return fig
579 | 
580 | 
581 | # We can now generate the scatter plots using the `generate_comparison_scatters` function.
582 | 
583 | # In[35]:
584 | 
585 | 
586 | fig = generate_comparison_scatters(df_vt, 'VT')
587 | #fig.savefig('images/search_scatters_VT.png', bbox_inches='tight')
588 | 
589 | 
590 | # 
591 | # The plots above are comparing `search_rate` (left) and `hit_rate` (right) for minority drivers compared with white drivers in each county.  If all of the dots (each of which represents the stats for a single county and race) followed the diagonal center line, the implication would be that white drivers and non-white drivers are searched at the exact same rate with the exact same standard of evidence.
592 | # 
593 | # Unfortunately, this is not the case.  In the above charts, we can see that, for every county, the search rate is higher for Black and Hispanic drivers even though the hit rate is lower.
594 | # 
595 | # 
596 | 
597 | # Let's define one more visualization helper function, to show all of these results on a single scatter plot.
598 | # 
599 | 
600 | # In[36]:
601 | 
602 | 
603 | def generate_county_search_stats_scatter(df, state):
604 |     """Generate a scatter plot of search rate vs. hit rate by race and county"""
605 |     race_location_agg = df.groupby(['county_fips','driver_race']).apply(compute_search_stats)
606 | 
607 |     colors = ['orange','red', 'green','blue']
608 |     fig, ax = plt.subplots(figsize=figsize)
609 |     for c, frame in race_location_agg.groupby(level='driver_race'):
610 |         ax.scatter(x=frame['hit_rate'], y=frame['search_rate'], s=150, label=c, color=colors.pop())
611 |     ax.legend(loc='upper center', bbox_to_anchor=(0.5, 1.2), ncol=4, fancybox=True)
612 |     ax.set_xlabel('Hit Rate')
613 |     ax.set_ylabel('Search Rate')
614 |     ax.set_title("Search Stats By County and Race - {}".format(state))
615 |     return fig
616 | 
617 | 
618 | # In[37]:
619 | 
620 | 
621 | fig = generate_county_search_stats_scatter(df_vt, "VT")
622 | #fig.savefig('images/county_scatter_VT.png', bbox_inches='tight')
623 | 
624 | 
625 | # As the old idiom goes - *a picture is worth a thousand words*.  The above chart is one of those pictures - and the name of the picture is "Systemic Racism".
626 | # 
627 | # The search rates and hit rates for white drivers in most counties are consistently clustered around 80% and 1% respectively.  We can see, however, that nearly every county searches Black and Hispanic drivers at a higher rate, and that these searches uniformly have a lower hit rate than those on White drivers.
628 | # 
629 | # This state-wide pattern of a higher search rate combined with a lower hit rate suggests that a lower standard of evidence is used when deciding to search Black and Hispanic drivers compared to when searching White drivers.
630 | # 
631 | # > You might notice that only one county is represented by Asian drivers - this is due to the lack of data for searches of Asian drivers in other counties.
632 | # 
633 | 
634 | # # Analyzing Other States
635 | 
636 | # Vermont is a great state to test out our analysis on, but the dataset size is relatively small.  Let's now perform the same analysis on other states to determine if this pattern persists across state lines.
637 | 
638 | # We've developed a solid reusable formula for reading and visualizing each state's dataset, so let's wrap the entire recipe in a new helper function.
639 | 
640 | # In[38]:
641 | 
642 | 
643 | fields = ['county_fips', 'driver_race', 'search_conducted', 'contraband_found']
644 | types = {
645 |     'contraband_found': bool,
646 |     'county_fips': float,
647 |     'driver_race': 'category',
648 |     'search_conducted': bool
649 | }
650 | 
651 | def analyze_state_data(state):
652 |     df = pd.read_csv('./data/{}-clean.csv.gz'.format(state), compression='gzip', low_memory=True, dtype=types, usecols=fields)
653 |     df.dropna(inplace=True)
654 |     
655 |     df = df[df['driver_race'] != 'Other']
656 |     df['driver_race'].cat.remove_unused_categories(inplace = True)
657 |     
658 |     search_scatters = generate_comparison_scatters(df, state)
659 |     #search_scatters.savefig('images/search_scatters_{}.png'.format(state), bbox_inches='tight')
660 | 
661 |     county_scatter = generate_county_search_stats_scatter(df, state)
662 |     #county_scatter.savefig('images/county_scatter_{}.png'.format(state), bbox_inches='tight')
663 |     
664 |     return df.groupby('driver_race').apply(compute_search_stats)
665 | 
666 | 
667 | # We're making a few optimizations here in order to make the analysis a bit more streamlined and computationally efficient.  By only reading the four columns that we're interested in, and by specifying the datatypes ahead of time, we'll be able to read larger datasets into memory more quickly.
668 | # 
669 | 
670 | # ## Massachusetts
671 | 
672 | # First we'll generate the analysis for my home state, Massachusetts.  This time we'll have more data to work with - roughly 3.4 million traffic stops.
673 | # 
674 | # Download the dataset to your project's `/data` directory - https://stacks.stanford.edu/file/druid:py883nd2578/MA-clean.csv.gz
675 | # 
676 | 
677 | # In[39]:
678 | 
679 | 
680 | analyze_state_data('MA')
681 | 
682 | 
683 | # We can see here again that Black and Hispanic drivers are searched at significantly higher rates than white drivers. The differences in hit rates are not as extreme as in Vermont, but they are still noticeably lower for Black and Hispanic drivers than for White drivers.  Asian drivers, interestingly, are the least likely to be searched and also the least likely to have contraband if they are searched.
684 | 
685 | # If we compare the stats for MA to VT, we'll also notice that police in MA seem to use a much lower standard of evidence when searching a vehicle, with their searches averaging around a 50% hit rate, compared to 80% in VT.
686 | # 
687 | # The trend here is much less obvious than in Vermont, but it is still clear that traffic stops of Black and Hispanic drivers are more likely to result in a search, despite the fact the searches of White drivers are more likely to result in contraband being found.
688 | # 
689 | 
690 | # ## Wisconsin & Connecticut
691 | 
692 | # 
693 | # Wisconsin and Connecticut have been named as some of the [worst states in America for racial disparities](https://www.wpr.org/wisconsin-considered-one-worst-states-racial-disparities).  Let's see how their police stats stack up.
694 | # 
695 | # Again, you'll need to download the Wisconsin and Connecticut dataset to your project's `/data` directory.
696 | # 
697 | # - Wisconsin: https://stacks.stanford.edu/file/druid:py883nd2578/WI-clean.csv.gz
698 | # - Connecticut: https://stacks.stanford.edu/file/druid:py883nd2578/WI-clean.csv.gz
699 | # 
700 | 
701 | # We can call our `analyze_state_data` function for Wisconsin once the dataset has been downloaded.
702 | # 
703 | 
704 | # In[40]:
705 | 
706 | 
707 | analyze_state_data('WI')
708 | 
709 | 
710 | # The trends here are starting to look familiar.  White drivers in Wisconsin are much less likely to be searched than non-white drivers (aside from Asians, who tend to be searched at around the same rates as whites).  Searches of non-white drivers are, again, less likely to yield contraband than searches on white drivers.
711 | # 
712 | # We can see here, yet again, that the standard of evidence for searching Black and Hispanic drivers is lower in virtually every county than for White drivers.  In one outlying county, almost 25% (!) of traffic stops for Black drivers resulted in a search, even though only half of those searches yielded contraband.
713 | # 
714 | 
715 | # Let's do the same analysis for Connecticut
716 | # 
717 | 
718 | # In[41]:
719 | 
720 | 
721 | analyze_state_data('CT')
722 | 
723 | 
724 | # Again, the pattern persists.
725 | # 
726 | 
727 | # ## Arizona
728 | 
729 | # We can generate each result rather quickly for each state (with available data), once we've downloaded each dataset.
730 | # 
731 | 
732 | # In[42]:
733 | 
734 | 
735 | analyze_state_data('AZ')
736 | 
737 | 
738 | # ## Colorado
739 | 
740 | # In[43]:
741 | 
742 | 
743 | analyze_state_data('CO')
744 | 
745 | 
746 | # ## North Carolina
747 | 
748 | # In[44]:
749 | 
750 | 
751 | analyze_state_data('NC')
752 | 
753 | 
754 | # ## Washington
755 | 
756 | # In[45]:
757 | 
758 | 
759 | analyze_state_data('WA')
760 | 
761 | 
762 | # ## Texas
763 | 
764 | # You might want to let this one run while you go fix yourself a cup of coffee or tea.  At almost 24 million traffic stops, the Texas dataset takes a rather long time to process.
765 | # 
766 | 
767 | # In[46]:
768 | 
769 | 
770 | analyze_state_data('TX')
771 | 
772 | 
773 | # ## Even more data visualizations
774 | # 
775 | 
776 | # 
777 | # I highly recommend that you visit the [Standford Open Policing Project results page](https://openpolicing.stanford.edu/findings/) for more visualizations of this data.  Here you can browse the search outcome results for all available states, and explore additional analysis that the researchers have performed such as stop rate by race (using county population demographics data) as well as the effects of recreational marijuana legalization on search rates.
778 | # 
779 | 
780 | # # What next?
781 | 
782 | # Do these results imply that police officers are overtly racist?  **No.**
783 | # 
784 | # Do they show that Black and Hispanic drivers are searched much more frequently than white drivers, often with a lower standard of evidence?  **Yes.**
785 | # 
786 | # What we are observing here appears to be a pattern of systemic racism.  The racial disparities revealed in this analysis are a reflection of an entrenched mistrust of certain minorities in the United States.  The data and accompanying analysis are indicative of social trends that are certainly not limited to police officers, and *should not be used to disparage this profession as a whole*.  Racial discrimination is present at all levels of society from [retail stores](https://www.theguardian.com/us-news/2015/jun/22/zara-reports-culture-of-favoritism-based-on-race) to the [tech industry](https://www.wired.com/story/tech-leadership-race-problem/) to [academia](https://www.scientificamerican.com/article/sex-and-race-discrimination-in-academia-starts-even-before-grad-school/).
787 | # 
788 | # We are able to empirically identify these trends only because state police deparments (and the Open Policing team at Stanford) have made this data available to the public; no similar datasets exist for most other professions and industries.  Releasing datasets about these issues is commendable (but sadly still somewhat uncommon, especially in the private sector) and will help to further identify where these disparities exist, and to influence policies in order to provide a fair, effective way to counteract these biases.
789 | # 
790 | # To see the full official analysis for all 20 available states, check out the official findings paper here - https://5harad.com/papers/traffic-stops.pdf.
791 | # 
792 | # I hope that this tutorial has provided the tools you might need to take this analysis further.  There's a *lot* more that you can do with the data than what we've covered here.
793 | # 
794 | # - Analyze police stops for your home state and county (if the data is available).  If the data is not available, submit a formal request to your local representatives and institutions that the data be made public.
795 | # - Combine your analysis with US census data on the demographic, social, and economic stats about each county.
796 | # - Create a web app to display the county trends on an interactive map.
797 | # - Build a mobile app to warn drivers when they're entering an area that appears to be more distrusting of drivers of a certain race.
798 | # - Open-source your own analysis, spread your findings, seek out peer review, maybe even write an explanatory blog post.
799 | # 
800 | # The source code and figures for this analysis can be found in the companion Github repository - https://github.com/triestpa/Police-Analysis-Python
801 | # 
802 | # To view the completed IPython notebook, visit the page [here](https://github.com/triestpa/Police-Analysis-Python/blob/master/traffic_stop_analysis.ipynb).
803 | # 
804 | # The code for this project is 100% open source ([MIT license](https://github.com/triestpa/Police-Analysis-Python/blob/master/LICENSE)), so feel free to use it however you see fit in your own projects.
805 | 


--------------------------------------------------------------------------------