├── ap-regression-unemployment
└── notebooks
│ ├── Associated Press - Simple regression with census data (statsmodels with dataframes).ipynb
│ ├── Associated Press - Simple regression with census data (statsmodels with formulas).ipynb
│ ├── Associated Press and life expectancy.ipynb
│ ├── Homework - Associated Press and life expectancy (Completed).ipynb
│ ├── Simple regression with census data (statsmodels with dataframes).ipynb
│ └── Simple regression with census data (statsmodels with formulas).ipynb
├── apm-reports-jury-bias
└── notebooks
│ ├── In The Dark - Alternative formula methods.ipynb
│ ├── In The Dark - Combining datasets and cleaning the data.ipynb
│ ├── In The Dark - Feature selection with p values.ipynb
│ └── In The Dark - Jury selection regression walkthrough.ipynb
├── azcentral-text-reuse-model-legislation
└── notebooks
│ ├── 01-Downloading one million pieces of legislation from LegiScan.ipynb
│ ├── 02-Taking a million pieces of legislation from a CSV and inserting them into Postgres.ipynb
│ ├── 03-Download Word, PDF and HTML content and process it into text with Tika.ipynb
│ ├── 04-Import content into Solr for advanced text searching.ipynb
│ ├── 05-Checking for legislative text reuse using Python, Solr, and ngrams.ipynb
│ ├── 05-Checking for legislative text reuse using Python, Solr, and simple text search.ipynb
│ ├── 06-Search for model legislation in over one million bills using Postgres and Solr.ipynb
│ ├── Explaining n-grams in Natural Language Processing.ipynb
│ ├── Processing documents with Apache Tika (Greek).ipynb
│ ├── Processing documents with Apache Tika.ipynb
│ └── Using topic modeling to categorize legislation.ipynb
├── basic-ml-concepts
└── notebooks
│ ├── Classifier Evaluation.ipynb
│ ├── Counting words with Python's Counter.ipynb
│ ├── Counting words with scikit-learn's CountVectorizer.ipynb
│ ├── Document similarity using word embeddings.ipynb
│ ├── Evaluating Classifiers.ipynb
│ ├── Evaluating Logistic Regressions.ipynb
│ ├── Intro to Classification.ipynb
│ ├── Linear Regression Evaluation.ipynb
│ ├── Linear Regression Part Two.ipynb
│ ├── Linear Regression Quickstart.ipynb
│ ├── Linear Regression.ipynb
│ ├── Logistic Regression Part Two.ipynb
│ ├── Logistic Regression Quickstart.ipynb
│ ├── Logistic Regression.ipynb
│ ├── Scikit-learn and categorical features.ipynb
│ ├── What is regression.ipynb
│ └── Word embeddings.ipynb
├── bloomberg-tweet-topics
└── notebooks
│ ├── Assigning categories to text using keyword matching.ipynb
│ ├── Assigning categories to text using on keyword matching.ipynb
│ ├── Building streamgraphs from candidate tweets.ipynb
│ ├── Scrape tweets from presidential primary candidates.ipynb
│ └── Topic modeling for tweets.ipynb
├── boston-globe-tickets
└── notebooks
│ └── Boston Globe ticketing regression.ipynb
├── buzzfeed-spy-planes
└── notebooks
│ ├── Buzzfeed Surveillance Planes Random Forests.ipynb
│ ├── Drawing flight paths on maps with cartopy.ipynb
│ ├── Feature engineering - Buzzfeed spy planes.ipynb
│ ├── Homework - Buzzfeed Surveillance Planes Random Forests (Completed).ipynb
│ └── Homework - Buzzfeed Surveillance Planes Random Forests.ipynb
├── caixin-museum-word-count
└── notebooks
│ ├── Chinese museum dataset cleanup.ipynb
│ ├── Chinese museums per capita analysis.ipynb
│ └── Counting words in Chinese museum names.ipynb
├── car-crashes-weight-regression
└── notebooks
│ ├── 01 - Combine Excel files across multiple sheets and save as CSV files.ipynb
│ ├── 02 - Create make model weights csv.ipynb
│ ├── 03 - Find car data from VINs.ipynb
│ ├── 04 - Combine VINs and weights.ipynb
│ ├── 05 - Clean combine and filter data.ipynb
│ ├── Car Crashes - Feature selection and engineering (Completed).ipynb
│ └── Car Crashes - Feature selection and engineering.ipynb
├── classification
└── notebooks
│ ├── Comparing classifiers.ipynb
│ ├── Correcting for imbalanced datasets.ipynb
│ ├── Evaluating Classifiers.ipynb
│ ├── Intro to Classification.ipynb
│ ├── Scikit-learn and categorical features.ipynb
│ └── Using classification algorithms with text.ipynb
├── dmn-texas-school-cheating
└── notebooks
│ ├── Texas School Cheating - Finding outliers with regression residuals.ipynb
│ ├── Texas School Cheating - Finding outliers with standard deviation and regression.ipynb
│ └── Texas School Cheating - Graph Reproductions.ipynb
├── investigating-sentiment-analysis
└── notebooks
│ ├── Cleaning the Sentiment140 data.ipynb
│ ├── Comparing sentiment analysis tools.ipynb
│ ├── Designing your own sentiment analysis tool.ipynb
│ └── More data to train our sentiment analysis tool.ipynb
├── latimes-crime-classification
└── notebooks
│ ├── Actual analysis.ipynb
│ ├── Inspecting classifications.ipynb
│ ├── Trying out different classifiers.ipynb
│ └── Using a classifier to find misclassified crimes.ipynb
├── milwaukee-potholes
└── notebooks
│ ├── Homework - Milwaukee Journal Sentinel and potholes (Completed).ipynb
│ ├── Homework - Milwaukee Journal Sentinel and potholes (No merging).ipynb
│ ├── Milwaukee Journal Sentinel and potholes full walkthrough.ipynb
│ └── Milwaukee Journal Sentinel and potholes without merging.ipynb
├── nyt-takata-airbags
└── notebooks
│ ├── Airbag classifier search (Binary).ipynb
│ ├── Airbag classifier search (CountVectorizer).ipynb
│ ├── Airbag classifier search (Decision Tree).ipynb
│ ├── NYT Takata (Completed).ipynb
│ ├── NYT Takata (CountVectorizer) (Completed).ipynb
│ └── NYT Takata (Decision Tree) (Completed).ipynb
├── propublica-criminal-sentencing
└── notebooks
│ └── week-5-1-machine-bias-class.ipynb
├── regression
└── notebooks
│ ├── Evaluating Logistic Regressions.ipynb
│ ├── Linear Regression Evaluation.ipynb
│ ├── Linear Regression Part Two.ipynb
│ ├── Linear Regression Quickstart.ipynb
│ ├── Linear Regression.ipynb
│ ├── Logistic Regression Part Two.ipynb
│ ├── Logistic Regression Quickstart.ipynb
│ ├── Logistic Regression.ipynb
│ └── What is regression.ipynb
├── reuters-asylum
└── notebooks
│ ├── Cleaning the EOIR immigration court dataset.ipynb
│ └── Using regression to analyze asylum cases.ipynb
├── reveal-mortgages
└── notebooks
│ ├── Homework - Reveal Mortgage Analysis - Logistic Regression (Completed).ipynb
│ ├── Homework - Reveal Mortgage Analysis - Logistic Regression using statsmodels formulas.ipynb
│ ├── Homework - Reveal Mortgage Analysis - Logistic Regression.ipynb
│ ├── Reveal Mortgage Analysis - Cleaning and combining data.ipynb
│ ├── Reveal Mortgage Analysis - Logistic Regression using statsmodels formulas.ipynb
│ ├── Reveal Mortgage Analysis - Logistic Regression.ipynb
│ └── Reveal Mortgage Analysis - Wild formulas in statsmodels using Patsy (short version).ipynb
├── sentiment-analysis-is-bad
└── notebooks
│ ├── Cleaning the Sentiment140 data.ipynb
│ ├── Comparing sentiment analysis tools.ipynb
│ ├── Designing your own sentiment analysis tool.ipynb
│ └── More data to train our sentiment analysis tool.ipynb
├── tampa-bay-times-schools
└── notebooks
│ ├── Linear regression on Florida schools (Completed).ipynb
│ ├── Linear regression on Florida schools (No cleaning) (Completed).ipynb
│ ├── Linear regression on Florida schools (No cleaning).ipynb
│ └── Linear regression on Florida schools.ipynb
├── text-analysis
└── notebooks
│ ├── A simple explanation of TF-IDF.ipynb
│ ├── Choosing the right number of topics for a scikit-learn topic model.ipynb
│ ├── Comparing documents in different languages.ipynb
│ ├── Counting words with Python's Counter.ipynb
│ ├── Counting words with scikit-learn's CountVectorizer.ipynb
│ ├── Document similarity over different languages.ipynb
│ ├── Document similarity using word embeddings.ipynb
│ ├── Explaining n-grams in Natural Language Processing.ipynb
│ ├── How to make scikit-learn Natural Language Processing work with Japanese Chinese.ipynb
│ ├── Introduction to topic modeling.ipynb
│ ├── Named Entity Recognition.ipynb
│ ├── Processing documents with Apache Tika (Greek).ipynb
│ ├── Processing documents with Apache Tika.ipynb
│ ├── Splitting words in East Asian languages.ipynb
│ ├── Stemming and lemmatization.ipynb
│ ├── Topic modeling and clustering.ipynb
│ ├── Topic modeling with Gensim.ipynb
│ ├── Topic models with Gensim.ipynb
│ ├── Types of text analysis.ipynb
│ ├── Using TF-IDF with Chinese.ipynb
│ └── Word embeddings.ipynb
├── upshot-trump-emolex
└── notebooks
│ ├── NRC Emotional Lexicon.ipynb
│ └── Trump vs State of the Union addresses.ipynb
└── wapo-app-reviews
└── notebooks
├── 01-Scrape app store reviews.ipynb
├── 02-Predict reviews.ipynb
├── Predict reviews.ipynb
└── Scrape app store reviews.ipynb
/azcentral-text-reuse-model-legislation/notebooks/04-Import content into Solr for advanced text searching.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Importing content in Solr for advanced text searching\n",
8 | "\n",
9 | "We've imported over a million pieces of legislation into a postgres database, **but that just isn't good enough!** While our database system can do a lot, we have some **intense text searching** in our future, and postgres just isn't up to the task.\n",
10 | "\n",
11 | "Instead, we're going to use another Apache product - [Apache Solr](https://lucene.apache.org/solr/) - as a search tool that sits next to our postgres database and performs lightning-fast text searches."
12 | ]
13 | },
14 | {
15 | "cell_type": "markdown",
16 | "metadata": {},
17 | "source": [
18 | "
\n \n \n Read online\n \n \n \n Download notebook\n \n \n \n Interactive version\n \n
"
19 | ]
20 | },
21 | {
22 | "cell_type": "markdown",
23 | "metadata": {},
24 | "source": [
25 | "### Prep work: Downloading necessary files\n",
26 | "Before we get started, we need to download all of the data we'll be using.\n",
27 | "* **solrconfig:** solr configuration - TK\n"
28 | ]
29 | },
30 | {
31 | "cell_type": "code",
32 | "metadata": {},
33 | "source": [
34 | "# Make data directory if it doesn't exist\n",
35 | "!mkdir -p data\n",
36 | "!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/azcentral-text-reuse-model-legislation/data/solrconfig.zip -P data\n",
37 | "!unzip -n -d data data/solrconfig.zip"
38 | ],
39 | "outputs": [],
40 | "execution_count": null
41 | },
42 | {
43 | "cell_type": "markdown",
44 | "metadata": {},
45 | "source": [
46 | "## Why do we need Solr?\n",
47 | "\n",
48 | "Asking why we need to use Solr for this is an excellent question! With over a million documents, it's going to be very, very, very slow to do the kinds of fancy searches we want to do using Python or postgres. We're going to feed our documents to Solr in order to speed up searching.\n",
49 | "\n",
50 | "Solr isn't a database, though! We'll just use it to say, \"hey, do you recognize any legislation like this one?\" and it will give us some bill identifiers in return. We'll take those identifiers back to postgres to find the actual content of the bills.\n",
51 | "\n",
52 | "What magic can Solr do? For example, take the sentence **Put taxes on fishing**. Even though **PLACE A TAX ON FISH** might _seem_ very similar, even after we ignore punctuation \"on\" is the only thing technically shared between the two. Solr can do magic like automatically lowercasing, removing boring words like \"on,\" \"a,\" and \"and,\" and **stemming** words like \"fish\" and \"fishes\" and \"fishing\" so they all mean the same thing.\n",
53 | "\n",
54 | "This sort of pre-processing allows us to get more accurate results more quickly in the next step.\n",
55 | "\n",
56 | "## Create the legislation database\n",
57 | "\n",
58 | "First we'll need to start solr.\n",
59 | "\n",
60 | "Because indexing 6-grams is demanding from a hardware point of view, we're going to assign Solr **5GB of RAM**. It won't use all of the RAM the entire time, but if you don't grant it all five gigs it will mysteriously halt partway through the process.\n",
61 | "\n",
62 | "If you aren't using the ngrams technique you should be able to use the default `solr start` command (which assigns 512MB of RAM)."
63 | ]
64 | },
65 | {
66 | "cell_type": "code",
67 | "execution_count": 5,
68 | "metadata": {},
69 | "outputs": [
70 | {
71 | "name": "stdout",
72 | "output_type": "stream",
73 | "text": [
74 | "Sending stop command to Solr running on port 8983 ... waiting up to 180 seconds to allow Jetty process 10755 to stop gracefully.\n",
75 | " [|] [/] [-] [\\] [|] [/] [-] [\\] [|] [/] [-] [\\] [|] [/] [-] [\\] [|] [/] [-] [\\] [|] [/] [-] [\\] [|] [/] [-] [\\] "
76 | ]
77 | }
78 | ],
79 | "source": [
80 | "# Stop solr if it's running\n",
81 | "!solr stop"
82 | ]
83 | },
84 | {
85 | "cell_type": "code",
86 | "execution_count": null,
87 | "metadata": {},
88 | "outputs": [],
89 | "source": [
90 | "# Start solr with 5 gigs of RAM"
91 | ]
92 | },
93 | {
94 | "cell_type": "code",
95 | "execution_count": 6,
96 | "metadata": {},
97 | "outputs": [
98 | {
99 | "name": "stdout",
100 | "output_type": "stream",
101 | "text": [
102 | "*** [WARN] *** Your Max Processes Limit is currently 2048. \n",
103 | " It should be set to 65000 to avoid operational disruption. \n",
104 | " If you no longer wish to see this warning, set SOLR_ULIMIT_CHECKS to false in your profile or solr.in.sh\n",
105 | "Waiting up to 180 seconds to see Solr running on port 8983 [|] [/] [-] [\\] [|] [/] [-] [\\] [|] [/] [-] [\\] [|] [/] [-] [\\] [|] [/] [-] \n",
106 | "Started Solr server on port 8983 (pid=11858). Happy searching!\n",
107 | "\n",
108 | " "
109 | ]
110 | }
111 | ],
112 | "source": [
113 | "!solr start -m 5g"
114 | ]
115 | },
116 | {
117 | "cell_type": "markdown",
118 | "metadata": {},
119 | "source": [
120 | "Are we re-running this to recreate our database? If so, it will destroy the existing `legislation` database. Otherwise we'll just create a new database called `legislation`.\n",
121 | "\n",
122 | "We're also going to use the `solrconfig` folder as the default configuration. It's faster than trying to set up new columns and imports manually."
123 | ]
124 | },
125 | {
126 | "cell_type": "code",
127 | "execution_count": 1,
128 | "metadata": {},
129 | "outputs": [
130 | {
131 | "name": "stdout",
132 | "output_type": "stream",
133 | "text": [
134 | "\r\n",
135 | "Deleting core 'legislation' using command:\r\n",
136 | "http://localhost:8983/solr/admin/cores?action=UNLOAD&core=legislation&deleteIndex=true&deleteDataDir=true&deleteInstanceDir=true\r\n",
137 | "\r\n"
138 | ]
139 | }
140 | ],
141 | "source": [
142 | "# Delete index if already exists\n",
143 | "!solr delete -c legislation"
144 | ]
145 | },
146 | {
147 | "cell_type": "code",
148 | "execution_count": 2,
149 | "metadata": {},
150 | "outputs": [
151 | {
152 | "name": "stdout",
153 | "output_type": "stream",
154 | "text": [
155 | "\r\n",
156 | "Created new core 'legislation'\r\n"
157 | ]
158 | }
159 | ],
160 | "source": [
161 | "# Use the settings in data/solrconfig to initialize our setup\n",
162 | "!solr create -c legislation -d data/solrconfig"
163 | ]
164 | },
165 | {
166 | "cell_type": "markdown",
167 | "metadata": {},
168 | "source": [
169 | "## Connect to Solr\n",
170 | "\n",
171 | "Let's connect to our Solr database and See if it works. We're going to be using two different ways of talking to solr - the [pysolr](https://github.com/django-haystack/pysolr) library when it's convenient and just normal `requests` when we want to use a feature that pysolr doesn't support.\n",
172 | "\n",
173 | "In this case, we don't use the library at all. We just want to do a health check below that can't be done with the current version of pysolr!"
174 | ]
175 | },
176 | {
177 | "cell_type": "code",
178 | "execution_count": 9,
179 | "metadata": {},
180 | "outputs": [],
181 | "source": [
182 | "import requests\n",
183 | "import pysolr\n",
184 | "\n",
185 | "# Connecting just so you see what it looks like\n",
186 | "solr_url = 'http://localhost:8983/solr/legislation'\n",
187 | "solr = pysolr.Solr(solr_url, always_commit=True)"
188 | ]
189 | },
190 | {
191 | "cell_type": "code",
192 | "execution_count": 10,
193 | "metadata": {},
194 | "outputs": [
195 | {
196 | "data": {
197 | "text/plain": [
198 | "{'responseHeader': {'zkConnected': None,\n",
199 | " 'status': 0,\n",
200 | " 'QTime': 148,\n",
201 | " 'params': {'q': '{!lucene}*:*',\n",
202 | " 'distrib': 'false',\n",
203 | " 'df': '_text_',\n",
204 | " 'rows': '10',\n",
205 | " 'echoParams': 'all'}},\n",
206 | " 'status': 'OK'}"
207 | ]
208 | },
209 | "execution_count": 10,
210 | "metadata": {},
211 | "output_type": "execute_result"
212 | }
213 | ],
214 | "source": [
215 | "# Health check\n",
216 | "response = requests.get('http://localhost:8983/solr/legislation/admin/ping')\n",
217 | "response.json()"
218 | ]
219 | },
220 | {
221 | "cell_type": "markdown",
222 | "metadata": {},
223 | "source": [
224 | "Now that solr is set up, we want to do a **data import** from postgres. You can start that by visiting the Solr web interface at http://localhost:8983/solr/#/legislation/dataimport/. Click **Execute** and we're good to go!\n",
225 | "\n",
226 | "You could use the API for this, but I find that you can read error messages more easily if you use the web interface (and it's fun to see how quickly things are filling up!)."
227 | ]
228 | },
229 | {
230 | "cell_type": "markdown",
231 | "metadata": {},
232 | "source": [
233 | "## Stopping solr\n",
234 | "\n",
235 | "Once you're done with your import, you can stop solr. When you're ready to do searching you can restart it with the `solr start` command, without having it tie up 5 gigs of memory."
236 | ]
237 | },
238 | {
239 | "cell_type": "code",
240 | "execution_count": 1,
241 | "metadata": {},
242 | "outputs": [
243 | {
244 | "name": "stdout",
245 | "output_type": "stream",
246 | "text": [
247 | "Sending stop command to Solr running on port 8983 ... waiting up to 180 seconds to allow Jetty process 622 to stop gracefully.\n",
248 | " [|] [/] [-] [\\] [|] [/] [-] [\\] "
249 | ]
250 | }
251 | ],
252 | "source": [
253 | "!solr stop"
254 | ]
255 | },
256 | {
257 | "cell_type": "code",
258 | "execution_count": null,
259 | "metadata": {},
260 | "outputs": [],
261 | "source": []
262 | }
263 | ],
264 | "metadata": {
265 | "kernelspec": {
266 | "display_name": "Python 3",
267 | "language": "python",
268 | "name": "python3"
269 | },
270 | "language_info": {
271 | "codemirror_mode": {
272 | "name": "ipython",
273 | "version": 3
274 | },
275 | "file_extension": ".py",
276 | "mimetype": "text/x-python",
277 | "name": "python",
278 | "nbconvert_exporter": "python",
279 | "pygments_lexer": "ipython3",
280 | "version": "3.6.8"
281 | },
282 | "toc": {
283 | "base_numbering": 1,
284 | "nav_menu": {},
285 | "number_sections": true,
286 | "sideBar": true,
287 | "skip_h1_title": false,
288 | "title_cell": "Table of Contents",
289 | "title_sidebar": "Contents",
290 | "toc_cell": false,
291 | "toc_position": {},
292 | "toc_section_display": true,
293 | "toc_window_display": false
294 | }
295 | },
296 | "nbformat": 4,
297 | "nbformat_minor": 2
298 | }
--------------------------------------------------------------------------------
/basic-ml-concepts/notebooks/Linear Regression Quickstart.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Linear Regression Quickstart\n",
8 | "\n",
9 | "Already know what's what with linear regression, just need to know how to tackle it in Python? We're here for you! If not, continue on to the next section.\n",
10 | "\n",
11 | "We're going to **ignore the nuance of what we're doing** in this notebook, it's really just for people who need to see the process."
12 | ]
13 | },
14 | {
15 | "cell_type": "markdown",
16 | "metadata": {},
17 | "source": [
18 | "\n \n \n Read online\n \n \n \n Download notebook\n \n \n \n Interactive version\n \n
"
19 | ]
20 | },
21 | {
22 | "cell_type": "markdown",
23 | "metadata": {},
24 | "source": [
25 | "## Pandas for our data\n",
26 | "\n",
27 | "As is typical, we'll be using [pandas dataframes](https://pandas.pydata.org/) for the data."
28 | ]
29 | },
30 | {
31 | "cell_type": "code",
32 | "execution_count": 27,
33 | "metadata": {},
34 | "outputs": [
35 | {
36 | "data": {
37 | "text/html": [
38 | "\n",
39 | "\n",
52 | "
\n",
53 | " \n",
54 | " \n",
55 | " | \n",
56 | " sold | \n",
57 | " revenue | \n",
58 | "
\n",
59 | " \n",
60 | " \n",
61 | " \n",
62 | " 0 | \n",
63 | " 0 | \n",
64 | " 0 | \n",
65 | "
\n",
66 | " \n",
67 | " 1 | \n",
68 | " 4 | \n",
69 | " 8 | \n",
70 | "
\n",
71 | " \n",
72 | " 2 | \n",
73 | " 16 | \n",
74 | " 32 | \n",
75 | "
\n",
76 | " \n",
77 | "
\n",
78 | "
"
79 | ],
80 | "text/plain": [
81 | " sold revenue\n",
82 | "0 0 0\n",
83 | "1 4 8\n",
84 | "2 16 32"
85 | ]
86 | },
87 | "execution_count": 27,
88 | "metadata": {},
89 | "output_type": "execute_result"
90 | }
91 | ],
92 | "source": [
93 | "import pandas as pd\n",
94 | "\n",
95 | "df = pd.DataFrame([\n",
96 | " { 'sold': 0, 'revenue': 0 },\n",
97 | " { 'sold': 4, 'revenue': 8 },\n",
98 | " { 'sold': 16, 'revenue': 32 },\n",
99 | "])\n",
100 | "df"
101 | ]
102 | },
103 | {
104 | "cell_type": "markdown",
105 | "metadata": {},
106 | "source": [
107 | "## Performing a regression\n",
108 | "\n",
109 | "The [statsmodels](statsmodels.org) package is your best friend when it comes to regression. In theory you can do it using other techniques or libraries, but statsmodels is just *so simple*.\n",
110 | "\n",
111 | "For the regression below, I'm using the formula method of describing the regression. If that makes you grumpy, check the [regression reference page](/reference/regression/) for more details."
112 | ]
113 | },
114 | {
115 | "cell_type": "code",
116 | "execution_count": 28,
117 | "metadata": {},
118 | "outputs": [
119 | {
120 | "data": {
121 | "text/html": [
122 | "\n",
123 | "OLS Regression Results\n",
124 | "\n",
125 | " Dep. Variable: | revenue | R-squared: | 1.000 | \n",
126 | "
\n",
127 | "\n",
128 | " Model: | OLS | Adj. R-squared: | 1.000 | \n",
129 | "
\n",
130 | "\n",
131 | " Method: | Least Squares | F-statistic: | 9.502e+30 | \n",
132 | "
\n",
133 | "\n",
134 | " Date: | Sun, 08 Dec 2019 | Prob (F-statistic): | 2.07e-16 | \n",
135 | "
\n",
136 | "\n",
137 | " Time: | 10:14:18 | Log-Likelihood: | 94.907 | \n",
138 | "
\n",
139 | "\n",
140 | " No. Observations: | 3 | AIC: | -185.8 | \n",
141 | "
\n",
142 | "\n",
143 | " Df Residuals: | 1 | BIC: | -187.6 | \n",
144 | "
\n",
145 | "\n",
146 | " Df Model: | 1 | | | \n",
147 | "
\n",
148 | "\n",
149 | " Covariance Type: | nonrobust | | | \n",
150 | "
\n",
151 | "
\n",
152 | "\n",
153 | "\n",
154 | " | coef | std err | t | P>|t| | [0.025 | 0.975] | \n",
155 | "
\n",
156 | "\n",
157 | " Intercept | -2.665e-15 | 6.18e-15 | -0.431 | 0.741 | -8.12e-14 | 7.58e-14 | \n",
158 | "
\n",
159 | "\n",
160 | " sold | 2.0000 | 6.49e-16 | 3.08e+15 | 0.000 | 2.000 | 2.000 | \n",
161 | "
\n",
162 | "
\n",
163 | "\n",
164 | "\n",
165 | " Omnibus: | nan | Durbin-Watson: | 1.149 | \n",
166 | "
\n",
167 | "\n",
168 | " Prob(Omnibus): | nan | Jarque-Bera (JB): | 0.471 | \n",
169 | "
\n",
170 | "\n",
171 | " Skew: | -0.616 | Prob(JB): | 0.790 | \n",
172 | "
\n",
173 | "\n",
174 | " Kurtosis: | 1.500 | Cond. No. | 13.4 | \n",
175 | "
\n",
176 | "
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified."
177 | ],
178 | "text/plain": [
179 | "\n",
180 | "\"\"\"\n",
181 | " OLS Regression Results \n",
182 | "==============================================================================\n",
183 | "Dep. Variable: revenue R-squared: 1.000\n",
184 | "Model: OLS Adj. R-squared: 1.000\n",
185 | "Method: Least Squares F-statistic: 9.502e+30\n",
186 | "Date: Sun, 08 Dec 2019 Prob (F-statistic): 2.07e-16\n",
187 | "Time: 10:14:18 Log-Likelihood: 94.907\n",
188 | "No. Observations: 3 AIC: -185.8\n",
189 | "Df Residuals: 1 BIC: -187.6\n",
190 | "Df Model: 1 \n",
191 | "Covariance Type: nonrobust \n",
192 | "==============================================================================\n",
193 | " coef std err t P>|t| [0.025 0.975]\n",
194 | "------------------------------------------------------------------------------\n",
195 | "Intercept -2.665e-15 6.18e-15 -0.431 0.741 -8.12e-14 7.58e-14\n",
196 | "sold 2.0000 6.49e-16 3.08e+15 0.000 2.000 2.000\n",
197 | "==============================================================================\n",
198 | "Omnibus: nan Durbin-Watson: 1.149\n",
199 | "Prob(Omnibus): nan Jarque-Bera (JB): 0.471\n",
200 | "Skew: -0.616 Prob(JB): 0.790\n",
201 | "Kurtosis: 1.500 Cond. No. 13.4\n",
202 | "==============================================================================\n",
203 | "\n",
204 | "Warnings:\n",
205 | "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
206 | "\"\"\""
207 | ]
208 | },
209 | "execution_count": 28,
210 | "metadata": {},
211 | "output_type": "execute_result"
212 | }
213 | ],
214 | "source": [
215 | "import statsmodels.formula.api as smf\n",
216 | "\n",
217 | "model = smf.ols(\"revenue ~ sold\", data=df)\n",
218 | "results = model.fit()\n",
219 | "results.summary()"
220 | ]
221 | },
222 | {
223 | "cell_type": "markdown",
224 | "metadata": {},
225 | "source": [
226 | "For each unit sold, we get 2 revenue. That's about it."
227 | ]
228 | },
229 | {
230 | "cell_type": "markdown",
231 | "metadata": {},
232 | "source": [
233 | "## Multivariable regression\n",
234 | "\n",
235 | "Multivariable regression is easy-peasy. Let's add a couple more columns to our dataset, adding tips to the equation."
236 | ]
237 | },
238 | {
239 | "cell_type": "code",
240 | "execution_count": 29,
241 | "metadata": {},
242 | "outputs": [
243 | {
244 | "data": {
245 | "text/html": [
246 | "\n",
247 | "\n",
260 | "
\n",
261 | " \n",
262 | " \n",
263 | " | \n",
264 | " sold | \n",
265 | " revenue | \n",
266 | " tips | \n",
267 | " charge_amount | \n",
268 | "
\n",
269 | " \n",
270 | " \n",
271 | " \n",
272 | " 0 | \n",
273 | " 0 | \n",
274 | " 0 | \n",
275 | " 0 | \n",
276 | " 0 | \n",
277 | "
\n",
278 | " \n",
279 | " 1 | \n",
280 | " 4 | \n",
281 | " 8 | \n",
282 | " 1 | \n",
283 | " 9 | \n",
284 | "
\n",
285 | " \n",
286 | " 2 | \n",
287 | " 16 | \n",
288 | " 32 | \n",
289 | " 2 | \n",
290 | " 34 | \n",
291 | "
\n",
292 | " \n",
293 | "
\n",
294 | "
"
295 | ],
296 | "text/plain": [
297 | " sold revenue tips charge_amount\n",
298 | "0 0 0 0 0\n",
299 | "1 4 8 1 9\n",
300 | "2 16 32 2 34"
301 | ]
302 | },
303 | "execution_count": 29,
304 | "metadata": {},
305 | "output_type": "execute_result"
306 | }
307 | ],
308 | "source": [
309 | "import pandas as pd\n",
310 | "\n",
311 | "df = pd.DataFrame([\n",
312 | " { 'sold': 0, 'revenue': 0, 'tips': 0, 'charge_amount': 0 },\n",
313 | " { 'sold': 4, 'revenue': 8, 'tips': 1, 'charge_amount': 9 },\n",
314 | " { 'sold': 16, 'revenue': 32, 'tips': 2, 'charge_amount': 34 },\n",
315 | "])\n",
316 | "df"
317 | ]
318 | },
319 | {
320 | "cell_type": "code",
321 | "execution_count": 30,
322 | "metadata": {},
323 | "outputs": [
324 | {
325 | "data": {
326 | "text/html": [
327 | "\n",
328 | "OLS Regression Results\n",
329 | "\n",
330 | " Dep. Variable: | charge_amount | R-squared: | 1.000 | \n",
331 | "
\n",
332 | "\n",
333 | " Model: | OLS | Adj. R-squared: | nan | \n",
334 | "
\n",
335 | "\n",
336 | " Method: | Least Squares | F-statistic: | 0.000 | \n",
337 | "
\n",
338 | "\n",
339 | " Date: | Sun, 08 Dec 2019 | Prob (F-statistic): | nan | \n",
340 | "
\n",
341 | "\n",
342 | " Time: | 10:14:20 | Log-Likelihood: | 89.745 | \n",
343 | "
\n",
344 | "\n",
345 | " No. Observations: | 3 | AIC: | -173.5 | \n",
346 | "
\n",
347 | "\n",
348 | " Df Residuals: | 0 | BIC: | -176.2 | \n",
349 | "
\n",
350 | "\n",
351 | " Df Model: | 2 | | | \n",
352 | "
\n",
353 | "\n",
354 | " Covariance Type: | nonrobust | | | \n",
355 | "
\n",
356 | "
\n",
357 | "\n",
358 | "\n",
359 | " | coef | std err | t | P>|t| | [0.025 | 0.975] | \n",
360 | "
\n",
361 | "\n",
362 | " Intercept | -1.685e-15 | inf | -0 | nan | nan | nan | \n",
363 | "
\n",
364 | "\n",
365 | " sold | 2.0000 | inf | 0 | nan | nan | nan | \n",
366 | "
\n",
367 | "\n",
368 | " tips | 1.0000 | inf | 0 | nan | nan | nan | \n",
369 | "
\n",
370 | "
\n",
371 | "\n",
372 | "\n",
373 | " Omnibus: | nan | Durbin-Watson: | 0.922 | \n",
374 | "
\n",
375 | "\n",
376 | " Prob(Omnibus): | nan | Jarque-Bera (JB): | 0.520 | \n",
377 | "
\n",
378 | "\n",
379 | " Skew: | -0.691 | Prob(JB): | 0.771 | \n",
380 | "
\n",
381 | "\n",
382 | " Kurtosis: | 1.500 | Cond. No. | 44.0 | \n",
383 | "
\n",
384 | "
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified."
385 | ],
386 | "text/plain": [
387 | "\n",
388 | "\"\"\"\n",
389 | " OLS Regression Results \n",
390 | "==============================================================================\n",
391 | "Dep. Variable: charge_amount R-squared: 1.000\n",
392 | "Model: OLS Adj. R-squared: nan\n",
393 | "Method: Least Squares F-statistic: 0.000\n",
394 | "Date: Sun, 08 Dec 2019 Prob (F-statistic): nan\n",
395 | "Time: 10:14:20 Log-Likelihood: 89.745\n",
396 | "No. Observations: 3 AIC: -173.5\n",
397 | "Df Residuals: 0 BIC: -176.2\n",
398 | "Df Model: 2 \n",
399 | "Covariance Type: nonrobust \n",
400 | "==============================================================================\n",
401 | " coef std err t P>|t| [0.025 0.975]\n",
402 | "------------------------------------------------------------------------------\n",
403 | "Intercept -1.685e-15 inf -0 nan nan nan\n",
404 | "sold 2.0000 inf 0 nan nan nan\n",
405 | "tips 1.0000 inf 0 nan nan nan\n",
406 | "==============================================================================\n",
407 | "Omnibus: nan Durbin-Watson: 0.922\n",
408 | "Prob(Omnibus): nan Jarque-Bera (JB): 0.520\n",
409 | "Skew: -0.691 Prob(JB): 0.771\n",
410 | "Kurtosis: 1.500 Cond. No. 44.0\n",
411 | "==============================================================================\n",
412 | "\n",
413 | "Warnings:\n",
414 | "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
415 | "\"\"\""
416 | ]
417 | },
418 | "execution_count": 30,
419 | "metadata": {},
420 | "output_type": "execute_result"
421 | }
422 | ],
423 | "source": [
424 | "import statsmodels.formula.api as smf\n",
425 | "\n",
426 | "model = smf.ols(\"charge_amount ~ sold + tips\", data=df)\n",
427 | "results = model.fit()\n",
428 | "results.summary()"
429 | ]
430 | },
431 | {
432 | "cell_type": "markdown",
433 | "metadata": {},
434 | "source": [
435 | "There you go!\n",
436 | "\n",
437 | "If you'd like more details, you can continue on in this section. If you'd just like the how-to-do-an-exact-thing explanations, check out the [regression reference page](/reference/regression/)."
438 | ]
439 | },
440 | {
441 | "cell_type": "code",
442 | "execution_count": null,
443 | "metadata": {},
444 | "outputs": [],
445 | "source": []
446 | }
447 | ],
448 | "metadata": {
449 | "kernelspec": {
450 | "display_name": "Python 3",
451 | "language": "python",
452 | "name": "python3"
453 | },
454 | "language_info": {
455 | "codemirror_mode": {
456 | "name": "ipython",
457 | "version": 3
458 | },
459 | "file_extension": ".py",
460 | "mimetype": "text/x-python",
461 | "name": "python",
462 | "nbconvert_exporter": "python",
463 | "pygments_lexer": "ipython3",
464 | "version": "3.6.8"
465 | },
466 | "toc": {
467 | "base_numbering": 1,
468 | "nav_menu": {},
469 | "number_sections": true,
470 | "sideBar": true,
471 | "skip_h1_title": false,
472 | "title_cell": "Table of Contents",
473 | "title_sidebar": "Contents",
474 | "toc_cell": false,
475 | "toc_position": {},
476 | "toc_section_display": true,
477 | "toc_window_display": false
478 | }
479 | },
480 | "nbformat": 4,
481 | "nbformat_minor": 2
482 | }
--------------------------------------------------------------------------------
/basic-ml-concepts/notebooks/What is regression.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Predicting values with regression\n",
8 | "\n",
9 | "Let's take a look at how we can use regression to find the relationships inside of our data."
10 | ]
11 | },
12 | {
13 | "cell_type": "markdown",
14 | "metadata": {},
15 | "source": [
16 | "\n \n \n Read online\n \n \n \n Download notebook\n \n \n \n Interactive version\n \n
"
17 | ]
18 | },
19 | {
20 | "cell_type": "markdown",
21 | "metadata": {},
22 | "source": [
23 | "## What is regression?\n",
24 | "\n",
25 | "Regression is a way to describe how two things are related to each other. Specifically, how a change in one is related to a change in the other. You might notice it in sentences like:\n",
26 | "\n",
27 | "* \"An increase of 10 percentage points in the unemployment rate in a neighborhood translated to a loss of roughly a year and a half of life expectancy,\" from the [Associated Press](https://apnews.com/66ac44186b6249709501f07a7eab36da). As unemployment goes up, life expectancy goes down.\n",
28 | "* \"Reveal\u2019s analysis also showed that the greater the number of African Americans or Latinos in a neighborhood, the more likely a loan application would be denied there \u2013 even after accounting for income and other factors,\" from [Reveal](https://www.revealnews.org/article/for-people-of-color-banks-are-shutting-the-door-to-homeownership/). As the amount of African Americans or Latinos goes up, the likelihood of a loan being denied goes up.\n",
29 | "* \"In Boston, Asian and Latino residents were more likely to be ticketed than were out-of-towners of the same race, when cited for the same offense,\" from the [Boston Globe](http://archive.boston.com/globe/metro/packages/tickets/072003.shtml)\n",
30 | "\n",
31 | "## Types of regression\n",
32 | "\n",
33 | "While there are [a lot of kinds of regression out there](https://www.listendata.com/2018/03/regression-analysis.html), there are really only two journalists care about: linear regression and logistic regression.\n",
34 | "\n",
35 | "**Linear regression** is used to predict numbers. Life expectancy is a number, so the Associate Press story above uses linear regression. It's also very common when looking at school test scores.\n",
36 | "\n",
37 | "**Logistic regression** is used to predict categories such as yes/no or confirmed/denied. A loan being denied is a yes/no, so the Reveal story above uses logistic regression. Logistic regression is very common when looking at bias or discrimination.\n",
38 | "\n",
39 | "## When to use regression\n",
40 | "\n",
41 | "The biggest clue for regression is when you say you want to find the **relationship between** or **correlation between** two different things. Correlation is a [real stats thing](http://guessthecorrelation.com/) but you probably actually want the \"when X goes up, Y changes such-and-such amount\" kind of description, which comes from regression.\n",
42 | "\n",
43 | "You can also recognize regression from the phrases \"all other factors being equal\" or \"controlling for differences in.\" Notice how in the Reveal example above the African American/Latino population matters _even after accounting for income and other factors_.\n",
44 | "\n",
45 | "You'll use both linear and logistic regression for **two major things**:\n",
46 | "\n",
47 | "* **Understanding the impact of different factors:** In ProPublica's [criminal sentencing bias piece](https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing), they showed that race played an outsize role in determining sentencing suggestions, controlling for other possibly-explanatory factors.\n",
48 | "* **Predicting to find outliers:** In [this piece](http://clipfile.org/?p=754) from the Dallas Morning News, they used regression analysis to predict a school's 4th grade standardized test scores based on their 3rd grade scores. Schools that did much much much better than expected were suspected of cheating.\n",
49 | "\n",
50 | "While regression will show up in other cases - automatically classifying documents, for example - we'll cover that separately.\n",
51 | "\n",
52 | "## How to stay careful\n",
53 | "\n",
54 | "If you're doing anything involving statistics or large datasets, **you need to check your results with an expert.** Running a regression can be prety straightforward, you're sure to find a gremlins hiding in the details.\n",
55 | "\n",
56 | "Almost every single person I interviewed who performed a regression analysis for a story leaned heavily on both other members of their team, as well as at least one outside expert. Some newsrooms even pitted multiple academics against each other over the results, having statisticians and subject-matter experts battle it out over the \"right\" way to do it!\n",
57 | "\n",
58 | "Even if you hired sixty statisticians and got back a hundred different contradictory answers, even if you can't be 100% certain it's all 100% perfect, even if you're convinced statistics is more art than science, at least you'll have an idea of possible issues wiht your approach and how they might affect your story."
59 | ]
60 | },
61 | {
62 | "cell_type": "code",
63 | "execution_count": null,
64 | "metadata": {},
65 | "outputs": [],
66 | "source": []
67 | }
68 | ],
69 | "metadata": {
70 | "kernelspec": {
71 | "display_name": "Python 3",
72 | "language": "python",
73 | "name": "python3"
74 | },
75 | "language_info": {
76 | "codemirror_mode": {
77 | "name": "ipython",
78 | "version": 3
79 | },
80 | "file_extension": ".py",
81 | "mimetype": "text/x-python",
82 | "name": "python",
83 | "nbconvert_exporter": "python",
84 | "pygments_lexer": "ipython3",
85 | "version": "3.6.8"
86 | },
87 | "toc": {
88 | "base_numbering": 1,
89 | "nav_menu": {},
90 | "number_sections": true,
91 | "sideBar": true,
92 | "skip_h1_title": false,
93 | "title_cell": "Table of Contents",
94 | "title_sidebar": "Contents",
95 | "toc_cell": false,
96 | "toc_position": {},
97 | "toc_section_display": true,
98 | "toc_window_display": false
99 | }
100 | },
101 | "nbformat": 4,
102 | "nbformat_minor": 2
103 | }
--------------------------------------------------------------------------------
/bloomberg-tweet-topics/notebooks/Scrape tweets from presidential primary candidates.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Scraping tweets from Democratic presidential primary candidates\n",
8 | "\n",
9 | "What's a person to do when the Twitter API only lets you go back so far? Scraping to the rescue! And luckily we can use a *library* to scrape instead of having to write something manually."
10 | ]
11 | },
12 | {
13 | "cell_type": "markdown",
14 | "metadata": {},
15 | "source": [
16 | "\n \n \n Read online\n \n \n \n Download notebook\n \n \n \n Interactive version\n \n
"
17 | ]
18 | },
19 | {
20 | "cell_type": "markdown",
21 | "metadata": {},
22 | "source": [
23 | "## Introducing GetOldTweets3\n",
24 | "\n",
25 | "We'll be using the adorably-named [GetOldTweets3](https://github.com/Mottl/GetOldTweets3) library to acquire the Twitter history of the candidates in the Democratic presidential primary. We *could* use the [Twitter API](https://developer.twitter.com/en/docs), but unfortunately it doesn't let you go all the way back to the begining.\n",
26 | "\n",
27 | "GetOldTweets3, though, will help you get each and every tweet from 2019 by scraping each user's public timeline."
28 | ]
29 | },
30 | {
31 | "cell_type": "code",
32 | "execution_count": null,
33 | "metadata": {},
34 | "outputs": [],
35 | "source": [
36 | "#!pip install GetOldTweets3"
37 | ]
38 | },
39 | {
40 | "cell_type": "markdown",
41 | "metadata": {},
42 | "source": [
43 | "## Scraping our tweets\n",
44 | "\n",
45 | "We're going to start with a list of usernames we're interested in, then loop through each one and use GetOldTweets3 to save the tweets into a CSV file named after the username."
46 | ]
47 | },
48 | {
49 | "cell_type": "code",
50 | "execution_count": 60,
51 | "metadata": {},
52 | "outputs": [],
53 | "source": [
54 | "usernames = [\n",
55 | " 'joebiden', 'corybooker','petebuttigieg','juliancastro','kamalaharris',\n",
56 | " 'amyklobuchar','betoorourke','berniesanders','ewarren','andrewyang',\n",
57 | " 'michaelbennet','governorbullock','billdeblasio','johndelaney',\n",
58 | " 'tulsigabbard','waynemessam','timryan','joesestak','tomsteyer',\n",
59 | " 'marwilliamson','sengillibrand','hickenlooper','jayinslee',\n",
60 | " 'sethmoulton','ericswalwell'\n",
61 | "]"
62 | ]
63 | },
64 | {
65 | "cell_type": "code",
66 | "execution_count": 61,
67 | "metadata": {},
68 | "outputs": [
69 | {
70 | "name": "stdout",
71 | "output_type": "stream",
72 | "text": [
73 | "Downloading for joebiden\n",
74 | "(859, 15)\n",
75 | "Downloading for corybooker\n",
76 | "(1317, 15)\n",
77 | "Downloading for petebuttigieg\n",
78 | "(866, 15)\n",
79 | "Downloading for juliancastro\n",
80 | "(1231, 15)\n",
81 | "Downloading for kamalaharris\n",
82 | "(2114, 15)\n",
83 | "Downloading for amyklobuchar\n",
84 | "(1405, 15)\n",
85 | "Downloading for betoorourke\n",
86 | "(1683, 15)\n",
87 | "Downloading for berniesanders\n",
88 | "(1881, 15)\n",
89 | "Downloading for ewarren\n",
90 | "(2571, 15)\n",
91 | "Downloading for andrewyang\n",
92 | "(4475, 15)\n",
93 | "Downloading for michaelbennet\n",
94 | "(906, 15)\n",
95 | "Downloading for governorbullock\n",
96 | "(1722, 15)\n",
97 | "Downloading for billdeblasio\n",
98 | "(500, 15)\n",
99 | "Downloading for johndelaney\n",
100 | "(1921, 15)\n",
101 | "Downloading for tulsigabbard\n",
102 | "(900, 15)\n",
103 | "Downloading for waynemessam\n",
104 | "(817, 15)\n",
105 | "Downloading for timryan\n",
106 | "(1486, 15)\n",
107 | "Downloading for joesestak\n",
108 | "(621, 15)\n",
109 | "Downloading for tomsteyer\n",
110 | "(1279, 15)\n",
111 | "Downloading for marwilliamson\n",
112 | "(2637, 15)\n",
113 | "Downloading for sengillibrand\n",
114 | "(1538, 15)\n",
115 | "Downloading for hickenlooper\n",
116 | "(973, 15)\n",
117 | "Downloading for jayinslee\n",
118 | "(2128, 15)\n",
119 | "Downloading for sethmoulton\n",
120 | "(1242, 15)\n",
121 | "Downloading for ericswalwell\n",
122 | "(1717, 15)\n"
123 | ]
124 | }
125 | ],
126 | "source": [
127 | "import GetOldTweets3 as got\n",
128 | "\n",
129 | "def download_tweets(username):\n",
130 | " print(f\"Downloading for {username}\")\n",
131 | " tweetCriteria = got.manager.TweetCriteria().setUsername(username)\\\n",
132 | " .setSince(\"2019-01-01\")\\\n",
133 | " .setUntil(\"2019-09-01\")\\\n",
134 | "\n",
135 | " tweets = got.manager.TweetManager.getTweets(tweetCriteria)\n",
136 | " df = pd.DataFrame([tweet.__dict__ for tweet in tweets])\n",
137 | " print(df.shape)\n",
138 | " df.to_csv(f\"data/tweets-raw-{username}.csv\", index=False)\n",
139 | " \n",
140 | "for username in usernames:\n",
141 | " download_tweets(username)"
142 | ]
143 | },
144 | {
145 | "cell_type": "markdown",
146 | "metadata": {},
147 | "source": [
148 | "## Combining our files\n",
149 | "\n",
150 | "We don't want to operate on these tweets in separate files, though - we'd rather have them all in one file! We'll finish up our data scraping by combining all of the tweets into one file.\n",
151 | "\n",
152 | "We'll start by using the [glob library](https://pymotw.com/3/glob/) to get a list of the filenames."
153 | ]
154 | },
155 | {
156 | "cell_type": "code",
157 | "execution_count": 2,
158 | "metadata": {},
159 | "outputs": [
160 | {
161 | "name": "stdout",
162 | "output_type": "stream",
163 | "text": [
164 | "['data/tweets-kamalaharris.csv', 'data/tweets-tomsteyer.csv', 'data/tweets-betoorourke.csv', 'data/tweets-amyklobuchar.csv', 'data/tweets-billdeblasio.csv', 'data/tweets-joebiden.csv', 'data/tweets-petebuttigieg.csv', 'data/tweets-sethmoulton.csv', 'data/tweets-joesestak.csv', 'data/tweets-juliancastro.csv', 'data/tweets-tulsigabbard.csv', 'data/tweets-waynemessam.csv', 'data/tweets-marwilliamson.csv', 'data/tweets-governorbullock.csv', 'data/tweets-jayinslee.csv', 'data/tweets-hickenlooper.csv', 'data/tweets-sengillibrand.csv', 'data/tweets-ericswalwell.csv', 'data/tweets-johndelaney.csv', 'data/tweets-corybooker.csv', 'data/tweets-michaelbennet.csv', 'data/tweets-timryan.csv', 'data/tweets-ewarren.csv', 'data/tweets-berniesanders.csv', 'data/tweets-andrewyang.csv']\n"
165 | ]
166 | }
167 | ],
168 | "source": [
169 | "import glob\n",
170 | "\n",
171 | "filenames = glob.glob(\"data/tweets-raw-*.csv\")\n",
172 | "print(filenames)"
173 | ]
174 | },
175 | {
176 | "cell_type": "markdown",
177 | "metadata": {},
178 | "source": [
179 | "We'll then use a list comprehension to turn each filename into a datframe, then `pd.concat` to combine them together."
180 | ]
181 | },
182 | {
183 | "cell_type": "code",
184 | "execution_count": 4,
185 | "metadata": {},
186 | "outputs": [
187 | {
188 | "data": {
189 | "text/plain": [
190 | "(38789, 15)"
191 | ]
192 | },
193 | "execution_count": 4,
194 | "metadata": {},
195 | "output_type": "execute_result"
196 | }
197 | ],
198 | "source": [
199 | "import pandas as pd\n",
200 | "\n",
201 | "dataframes = [pd.read_csv(filename) for filename in filenames]\n",
202 | "df = pd.concat(dataframes)\n",
203 | "df.shape"
204 | ]
205 | },
206 | {
207 | "cell_type": "markdown",
208 | "metadata": {},
209 | "source": [
210 | "Let's pull a sample to make sure it looks like we think it should..."
211 | ]
212 | },
213 | {
214 | "cell_type": "code",
215 | "execution_count": 6,
216 | "metadata": {},
217 | "outputs": [
218 | {
219 | "data": {
220 | "text/html": [
221 | "\n",
222 | "\n",
235 | "
\n",
236 | " \n",
237 | " \n",
238 | " | \n",
239 | " username | \n",
240 | " to | \n",
241 | " text | \n",
242 | " retweets | \n",
243 | " favorites | \n",
244 | " replies | \n",
245 | " id | \n",
246 | " permalink | \n",
247 | " author_id | \n",
248 | " date | \n",
249 | " formatted_date | \n",
250 | " hashtags | \n",
251 | " mentions | \n",
252 | " geo | \n",
253 | " urls | \n",
254 | "
\n",
255 | " \n",
256 | " \n",
257 | " \n",
258 | " 247 | \n",
259 | " TimRyan | \n",
260 | " NaN | \n",
261 | " Hate, racism, white nationalism is terrorizing... | \n",
262 | " 88 | \n",
263 | " 401 | \n",
264 | " 198 | \n",
265 | " 1158019752931074049 | \n",
266 | " https://twitter.com/TimRyan/status/11580197529... | \n",
267 | " 466532637 | \n",
268 | " 2019-08-04 14:19:58+00:00 | \n",
269 | " Sun Aug 04 14:19:58 +0000 2019 | \n",
270 | " NaN | \n",
271 | " NaN | \n",
272 | " NaN | \n",
273 | " NaN | \n",
274 | "
\n",
275 | " \n",
276 | " 681 | \n",
277 | " amyklobuchar | \n",
278 | " washingtonpost | \n",
279 | " We need to see the full report in order to pro... | \n",
280 | " 518 | \n",
281 | " 1813 | \n",
282 | " 240 | \n",
283 | " 1126198087213563905 | \n",
284 | " https://twitter.com/amyklobuchar/status/112619... | \n",
285 | " 33537967 | \n",
286 | " 2019-05-08 18:52:02+00:00 | \n",
287 | " Wed May 08 18:52:02 +0000 2019 | \n",
288 | " NaN | \n",
289 | " NaN | \n",
290 | " NaN | \n",
291 | " https://twitter.com/washingtonpost/status/1126... | \n",
292 | "
\n",
293 | " \n",
294 | " 647 | \n",
295 | " GovernorBullock | \n",
296 | " NaN | \n",
297 | " McConnell has stood in the way of American pro... | \n",
298 | " 89 | \n",
299 | " 392 | \n",
300 | " 157 | \n",
301 | " 1149037399030272011 | \n",
302 | " https://twitter.com/GovernorBullock/status/114... | \n",
303 | " 111721601 | \n",
304 | " 2019-07-10 19:27:18+00:00 | \n",
305 | " Wed Jul 10 19:27:18 +0000 2019 | \n",
306 | " NaN | \n",
307 | " @AmyMcGrathKY | \n",
308 | " NaN | \n",
309 | " http://bit.ly/2SdvJP6 | \n",
310 | "
\n",
311 | " \n",
312 | " 543 | \n",
313 | " ericswalwell | \n",
314 | " NaN | \n",
315 | " $1 could be the difference between 4 more year... | \n",
316 | " 327 | \n",
317 | " 893 | \n",
318 | " 877 | \n",
319 | " 1133149644014391303 | \n",
320 | " https://twitter.com/ericswalwell/status/113314... | \n",
321 | " 377609596 | \n",
322 | " 2019-05-27 23:15:02+00:00 | \n",
323 | " Mon May 27 23:15:02 +0000 2019 | \n",
324 | " NaN | \n",
325 | " NaN | \n",
326 | " NaN | \n",
327 | " https://bit.ly/2EnCLuG | \n",
328 | "
\n",
329 | " \n",
330 | " 1423 | \n",
331 | " marwilliamson | \n",
332 | " maidenoftheair | \n",
333 | " Hear hear. | \n",
334 | " 0 | \n",
335 | " 5 | \n",
336 | " 0 | \n",
337 | " 1122503791553630210 | \n",
338 | " https://twitter.com/marwilliamson/status/11225... | \n",
339 | " 21522338 | \n",
340 | " 2019-04-28 14:12:13+00:00 | \n",
341 | " Sun Apr 28 14:12:13 +0000 2019 | \n",
342 | " NaN | \n",
343 | " NaN | \n",
344 | " NaN | \n",
345 | " NaN | \n",
346 | "
\n",
347 | " \n",
348 | "
\n",
349 | "
"
350 | ],
351 | "text/plain": [
352 | " username to \\\n",
353 | "247 TimRyan NaN \n",
354 | "681 amyklobuchar washingtonpost \n",
355 | "647 GovernorBullock NaN \n",
356 | "543 ericswalwell NaN \n",
357 | "1423 marwilliamson maidenoftheair \n",
358 | "\n",
359 | " text retweets favorites \\\n",
360 | "247 Hate, racism, white nationalism is terrorizing... 88 401 \n",
361 | "681 We need to see the full report in order to pro... 518 1813 \n",
362 | "647 McConnell has stood in the way of American pro... 89 392 \n",
363 | "543 $1 could be the difference between 4 more year... 327 893 \n",
364 | "1423 Hear hear. 0 5 \n",
365 | "\n",
366 | " replies id \\\n",
367 | "247 198 1158019752931074049 \n",
368 | "681 240 1126198087213563905 \n",
369 | "647 157 1149037399030272011 \n",
370 | "543 877 1133149644014391303 \n",
371 | "1423 0 1122503791553630210 \n",
372 | "\n",
373 | " permalink author_id \\\n",
374 | "247 https://twitter.com/TimRyan/status/11580197529... 466532637 \n",
375 | "681 https://twitter.com/amyklobuchar/status/112619... 33537967 \n",
376 | "647 https://twitter.com/GovernorBullock/status/114... 111721601 \n",
377 | "543 https://twitter.com/ericswalwell/status/113314... 377609596 \n",
378 | "1423 https://twitter.com/marwilliamson/status/11225... 21522338 \n",
379 | "\n",
380 | " date formatted_date hashtags \\\n",
381 | "247 2019-08-04 14:19:58+00:00 Sun Aug 04 14:19:58 +0000 2019 NaN \n",
382 | "681 2019-05-08 18:52:02+00:00 Wed May 08 18:52:02 +0000 2019 NaN \n",
383 | "647 2019-07-10 19:27:18+00:00 Wed Jul 10 19:27:18 +0000 2019 NaN \n",
384 | "543 2019-05-27 23:15:02+00:00 Mon May 27 23:15:02 +0000 2019 NaN \n",
385 | "1423 2019-04-28 14:12:13+00:00 Sun Apr 28 14:12:13 +0000 2019 NaN \n",
386 | "\n",
387 | " mentions geo urls \n",
388 | "247 NaN NaN NaN \n",
389 | "681 NaN NaN https://twitter.com/washingtonpost/status/1126... \n",
390 | "647 @AmyMcGrathKY NaN http://bit.ly/2SdvJP6 \n",
391 | "543 NaN NaN https://bit.ly/2EnCLuG \n",
392 | "1423 NaN NaN NaN "
393 | ]
394 | },
395 | "execution_count": 6,
396 | "metadata": {},
397 | "output_type": "execute_result"
398 | }
399 | ],
400 | "source": [
401 | "df.sample(5)"
402 | ]
403 | },
404 | {
405 | "cell_type": "markdown",
406 | "metadata": {},
407 | "source": [
408 | "Looking good! Let's **remove any missing the `text` column** (I don't know why, *but they exist*), and save it so we can analyze it in the next notebook."
409 | ]
410 | },
411 | {
412 | "cell_type": "code",
413 | "execution_count": 8,
414 | "metadata": {},
415 | "outputs": [],
416 | "source": [
417 | "df = df.dropna(subset=['text'])\n",
418 | "df.to_csv(\"data/tweets.csv\", index=False)"
419 | ]
420 | },
421 | {
422 | "cell_type": "markdown",
423 | "metadata": {},
424 | "source": [
425 | "## Review\n",
426 | "\n",
427 | "In this section we used the `GetOldTweets3` library to download large numbers of tweets that the API could not get us.\n",
428 | "\n",
429 | "## Discussion topics\n",
430 | "\n",
431 | "We're certainly breaking Twitter's Terms of Service by scraping these tweets. Should we not do it? What are the ethical and legal issues at play?\n",
432 | "\n",
433 | "Why are we scraping tweets as opposed to Facebook posts or campaign speeches?"
434 | ]
435 | },
436 | {
437 | "cell_type": "code",
438 | "execution_count": null,
439 | "metadata": {},
440 | "outputs": [],
441 | "source": []
442 | }
443 | ],
444 | "metadata": {
445 | "kernelspec": {
446 | "display_name": "Python 3",
447 | "language": "python",
448 | "name": "python3"
449 | },
450 | "language_info": {
451 | "codemirror_mode": {
452 | "name": "ipython",
453 | "version": 3
454 | },
455 | "file_extension": ".py",
456 | "mimetype": "text/x-python",
457 | "name": "python",
458 | "nbconvert_exporter": "python",
459 | "pygments_lexer": "ipython3",
460 | "version": "3.6.8"
461 | },
462 | "toc": {
463 | "base_numbering": 1,
464 | "nav_menu": {},
465 | "number_sections": true,
466 | "sideBar": true,
467 | "skip_h1_title": false,
468 | "title_cell": "Table of Contents",
469 | "title_sidebar": "Contents",
470 | "toc_cell": false,
471 | "toc_position": {},
472 | "toc_section_display": true,
473 | "toc_window_display": false
474 | }
475 | },
476 | "nbformat": 4,
477 | "nbformat_minor": 2
478 | }
--------------------------------------------------------------------------------
/car-crashes-weight-regression/notebooks/04 - Combine VINs and weights.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Combining car VIN data with vehicle weights\n",
8 | "\n",
9 | "A simple bit of data wrangling."
10 | ]
11 | },
12 | {
13 | "cell_type": "markdown",
14 | "metadata": {},
15 | "source": [
16 | "\n \n \n Read online\n \n \n \n Download notebook\n \n \n \n Interactive version\n \n
"
17 | ]
18 | },
19 | {
20 | "cell_type": "markdown",
21 | "metadata": {},
22 | "source": [
23 | "### VIN data"
24 | ]
25 | },
26 | {
27 | "cell_type": "code",
28 | "execution_count": 3,
29 | "metadata": {},
30 | "outputs": [
31 | {
32 | "data": {
33 | "text/html": [
34 | "\n",
35 | "\n",
48 | "
\n",
49 | " \n",
50 | " \n",
51 | " | \n",
52 | " VIN | \n",
53 | " Make | \n",
54 | " Model | \n",
55 | " ModelYear | \n",
56 | "
\n",
57 | " \n",
58 | " \n",
59 | " \n",
60 | " 0 | \n",
61 | " 2FMDA5143TBB45576 | \n",
62 | " FORD | \n",
63 | " WINDSTAR | \n",
64 | " 1996 | \n",
65 | "
\n",
66 | " \n",
67 | " 1 | \n",
68 | " 2G1WC5E37E1120089 | \n",
69 | " CHEVROLET | \n",
70 | " IMPALA | \n",
71 | " 2014 | \n",
72 | "
\n",
73 | " \n",
74 | " 2 | \n",
75 | " 5J6RE4H55AL053951 | \n",
76 | " HONDA | \n",
77 | " CR-V | \n",
78 | " 2010 | \n",
79 | "
\n",
80 | " \n",
81 | " 3 | \n",
82 | " 1N4AA5AP0EC435185 | \n",
83 | " NISSAN | \n",
84 | " MAXIMA | \n",
85 | " 2014 | \n",
86 | "
\n",
87 | " \n",
88 | " 4 | \n",
89 | " JTHCK262075010440 | \n",
90 | " LEXUS | \n",
91 | " IS | \n",
92 | " 2007 | \n",
93 | "
\n",
94 | " \n",
95 | "
\n",
96 | "
"
97 | ],
98 | "text/plain": [
99 | " VIN Make Model ModelYear\n",
100 | "0 2FMDA5143TBB45576 FORD WINDSTAR 1996\n",
101 | "1 2G1WC5E37E1120089 CHEVROLET IMPALA 2014\n",
102 | "2 5J6RE4H55AL053951 HONDA CR-V 2010\n",
103 | "3 1N4AA5AP0EC435185 NISSAN MAXIMA 2014\n",
104 | "4 JTHCK262075010440 LEXUS IS 2007"
105 | ]
106 | },
107 | "execution_count": 3,
108 | "metadata": {},
109 | "output_type": "execute_result"
110 | }
111 | ],
112 | "source": [
113 | "vin_df = pd.read_csv(\"data/vin_data.csv\")\n",
114 | "vin_df.head()"
115 | ]
116 | },
117 | {
118 | "cell_type": "markdown",
119 | "metadata": {},
120 | "source": [
121 | "### Weight data"
122 | ]
123 | },
124 | {
125 | "cell_type": "code",
126 | "execution_count": 4,
127 | "metadata": {},
128 | "outputs": [
129 | {
130 | "data": {
131 | "text/html": [
132 | "\n",
133 | "\n",
146 | "
\n",
147 | " \n",
148 | " \n",
149 | " | \n",
150 | " Make | \n",
151 | " Model | \n",
152 | " ModelYear | \n",
153 | " weight | \n",
154 | "
\n",
155 | " \n",
156 | " \n",
157 | " \n",
158 | " 0 | \n",
159 | " ACURA | \n",
160 | " CL | \n",
161 | " 1988 | \n",
162 | " 3009.0 | \n",
163 | "
\n",
164 | " \n",
165 | " 1 | \n",
166 | " ACURA | \n",
167 | " CL | \n",
168 | " 1989 | \n",
169 | " 3009.0 | \n",
170 | "
\n",
171 | " \n",
172 | " 2 | \n",
173 | " ACURA | \n",
174 | " CL | \n",
175 | " 1990 | \n",
176 | " 3009.0 | \n",
177 | "
\n",
178 | " \n",
179 | " 3 | \n",
180 | " ACURA | \n",
181 | " CL | \n",
182 | " 1991 | \n",
183 | " 3009.0 | \n",
184 | "
\n",
185 | " \n",
186 | " 4 | \n",
187 | " ACURA | \n",
188 | " CL | \n",
189 | " 1992 | \n",
190 | " 3009.0 | \n",
191 | "
\n",
192 | " \n",
193 | "
\n",
194 | "
"
195 | ],
196 | "text/plain": [
197 | " Make Model ModelYear weight\n",
198 | "0 ACURA CL 1988 3009.0\n",
199 | "1 ACURA CL 1989 3009.0\n",
200 | "2 ACURA CL 1990 3009.0\n",
201 | "3 ACURA CL 1991 3009.0\n",
202 | "4 ACURA CL 1992 3009.0"
203 | ]
204 | },
205 | "execution_count": 4,
206 | "metadata": {},
207 | "output_type": "execute_result"
208 | }
209 | ],
210 | "source": [
211 | "weights_df = pd.read_csv(\"data/weights.csv\")\n",
212 | "weights_df.head()"
213 | ]
214 | },
215 | {
216 | "cell_type": "code",
217 | "execution_count": 5,
218 | "metadata": {},
219 | "outputs": [
220 | {
221 | "data": {
222 | "text/html": [
223 | "\n",
224 | "\n",
237 | "
\n",
238 | " \n",
239 | " \n",
240 | " | \n",
241 | " VIN | \n",
242 | " Make | \n",
243 | " Model | \n",
244 | " ModelYear | \n",
245 | " weight | \n",
246 | "
\n",
247 | " \n",
248 | " \n",
249 | " \n",
250 | " 0 | \n",
251 | " 2FMDA5143TBB45576 | \n",
252 | " FORD | \n",
253 | " WINDSTAR | \n",
254 | " 1996 | \n",
255 | " 3733.0 | \n",
256 | "
\n",
257 | " \n",
258 | " 1 | \n",
259 | " 2G1WC5E37E1120089 | \n",
260 | " CHEVROLET | \n",
261 | " IMPALA | \n",
262 | " 2014 | \n",
263 | " 3618.0 | \n",
264 | "
\n",
265 | " \n",
266 | " 2 | \n",
267 | " 5J6RE4H55AL053951 | \n",
268 | " HONDA | \n",
269 | " CR-V | \n",
270 | " 2010 | \n",
271 | " 3389.0 | \n",
272 | "
\n",
273 | " \n",
274 | " 3 | \n",
275 | " 1N4AA5AP0EC435185 | \n",
276 | " NISSAN | \n",
277 | " MAXIMA | \n",
278 | " 2014 | \n",
279 | " 3556.0 | \n",
280 | "
\n",
281 | " \n",
282 | " 4 | \n",
283 | " JTHCK262075010440 | \n",
284 | " LEXUS | \n",
285 | " IS | \n",
286 | " 2007 | \n",
287 | " 3527.0 | \n",
288 | "
\n",
289 | " \n",
290 | "
\n",
291 | "
"
292 | ],
293 | "text/plain": [
294 | " VIN Make Model ModelYear weight\n",
295 | "0 2FMDA5143TBB45576 FORD WINDSTAR 1996 3733.0\n",
296 | "1 2G1WC5E37E1120089 CHEVROLET IMPALA 2014 3618.0\n",
297 | "2 5J6RE4H55AL053951 HONDA CR-V 2010 3389.0\n",
298 | "3 1N4AA5AP0EC435185 NISSAN MAXIMA 2014 3556.0\n",
299 | "4 JTHCK262075010440 LEXUS IS 2007 3527.0"
300 | ]
301 | },
302 | "execution_count": 5,
303 | "metadata": {},
304 | "output_type": "execute_result"
305 | }
306 | ],
307 | "source": [
308 | "vins_weights = vin_df.merge(weights_df, \n",
309 | " left_on=['Make', 'Model', 'ModelYear'], \n",
310 | " right_on=['Make', 'Model', 'ModelYear'], \n",
311 | " how='left')\n",
312 | "vins_weights.head(5)"
313 | ]
314 | },
315 | {
316 | "cell_type": "code",
317 | "execution_count": 6,
318 | "metadata": {},
319 | "outputs": [
320 | {
321 | "data": {
322 | "text/plain": [
323 | "(500661, 5)"
324 | ]
325 | },
326 | "execution_count": 6,
327 | "metadata": {},
328 | "output_type": "execute_result"
329 | }
330 | ],
331 | "source": [
332 | "vins_weights = vins_weights.dropna(subset=['weight'])\n",
333 | "vins_weights.shape"
334 | ]
335 | },
336 | {
337 | "cell_type": "code",
338 | "execution_count": 7,
339 | "metadata": {},
340 | "outputs": [],
341 | "source": [
342 | "vins_weights.to_csv(\"data/vins_and_weights.csv\", index=False)"
343 | ]
344 | },
345 | {
346 | "cell_type": "code",
347 | "execution_count": null,
348 | "metadata": {},
349 | "outputs": [],
350 | "source": []
351 | }
352 | ],
353 | "metadata": {
354 | "kernelspec": {
355 | "display_name": "Python 3",
356 | "language": "python",
357 | "name": "python3"
358 | },
359 | "language_info": {
360 | "codemirror_mode": {
361 | "name": "ipython",
362 | "version": 3
363 | },
364 | "file_extension": ".py",
365 | "mimetype": "text/x-python",
366 | "name": "python",
367 | "nbconvert_exporter": "python",
368 | "pygments_lexer": "ipython3",
369 | "version": "3.6.8"
370 | },
371 | "toc": {
372 | "base_numbering": 1,
373 | "nav_menu": {},
374 | "number_sections": true,
375 | "sideBar": true,
376 | "skip_h1_title": false,
377 | "title_cell": "Table of Contents",
378 | "title_sidebar": "Contents",
379 | "toc_cell": false,
380 | "toc_position": {},
381 | "toc_section_display": true,
382 | "toc_window_display": false
383 | }
384 | },
385 | "nbformat": 4,
386 | "nbformat_minor": 2
387 | }
--------------------------------------------------------------------------------
/classification/notebooks/Comparing classifiers.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": null,
6 | "metadata": {},
7 | "outputs": [],
8 | "source": []
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "\n \n \n Read online\n \n \n \n Download notebook\n \n \n \n Interactive version\n \n
"
15 | ]
16 | }
17 | ],
18 | "metadata": {
19 | "kernelspec": {
20 | "display_name": "Python 3",
21 | "language": "python",
22 | "name": "python3"
23 | },
24 | "language_info": {
25 | "codemirror_mode": {
26 | "name": "ipython",
27 | "version": 3
28 | },
29 | "file_extension": ".py",
30 | "mimetype": "text/x-python",
31 | "name": "python",
32 | "nbconvert_exporter": "python",
33 | "pygments_lexer": "ipython3",
34 | "version": "3.6.8"
35 | },
36 | "toc": {
37 | "base_numbering": 1,
38 | "nav_menu": {},
39 | "number_sections": true,
40 | "sideBar": true,
41 | "skip_h1_title": false,
42 | "title_cell": "Table of Contents",
43 | "title_sidebar": "Contents",
44 | "toc_cell": false,
45 | "toc_position": {},
46 | "toc_section_display": true,
47 | "toc_window_display": false
48 | }
49 | },
50 | "nbformat": 4,
51 | "nbformat_minor": 2
52 | }
--------------------------------------------------------------------------------
/investigating-sentiment-analysis/notebooks/Cleaning the Sentiment140 data.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Cleaning the Sentiment140 data\n",
8 | "\n",
9 | "\n",
10 | "The [Sentiment140](http://www.sentiment140.com/) dataset is a collection of 1.6 million tweets that have been tagged as either positive or negative.\n",
11 | "\n",
12 | "Before we clean it, a question: _how'd they get so many tagged tweets?_ If you poke around on their documentation, [the answer is hiding right here](http://help.sentiment140.com/for-students):\n",
13 | "\n",
14 | "> In our approach, we assume that any tweet with positive emoticons, like :), were positive, and tweets with negative emoticons, like :(, were negative.\n",
15 | "\n",
16 | "That's a good thing to discuss later, but for now let's just clean it up. In this notebook we'll be removing columns we don't want, and standardizing the sentiment column."
17 | ]
18 | },
19 | {
20 | "cell_type": "markdown",
21 | "metadata": {},
22 | "source": [
23 | "\n \n \n Read online\n \n \n \n Download notebook\n \n \n \n Interactive version\n \n
"
24 | ]
25 | },
26 | {
27 | "cell_type": "markdown",
28 | "metadata": {},
29 | "source": [
30 | "### Prep work: Downloading necessary files\n",
31 | "Before we get started, we need to download all of the data we'll be using.\n",
32 | "* **training.1600000.processed.noemoticon.csv:** raw data from Sentiment140 - 1.4 million tweets tagged for sentiment, no column headers, nothing cleaned up\n"
33 | ]
34 | },
35 | {
36 | "cell_type": "code",
37 | "metadata": {},
38 | "source": [
39 | "# Make data directory if it doesn't exist\n",
40 | "!mkdir -p data\n",
41 | "!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/investigating-sentiment-analysis/data/training.1600000.processed.noemoticon.csv.zip -P data\n",
42 | "!unzip -n -d data data/training.1600000.processed.noemoticon.csv.zip"
43 | ],
44 | "outputs": [],
45 | "execution_count": null
46 | },
47 | {
48 | "cell_type": "markdown",
49 | "metadata": {},
50 | "source": [
51 | "## Read the tweets in"
52 | ]
53 | },
54 | {
55 | "cell_type": "code",
56 | "execution_count": 2,
57 | "metadata": {},
58 | "outputs": [
59 | {
60 | "data": {
61 | "text/html": [
62 | "\n",
63 | "\n",
76 | "
\n",
77 | " \n",
78 | " \n",
79 | " | \n",
80 | " polarity | \n",
81 | " id | \n",
82 | " date | \n",
83 | " query | \n",
84 | " user | \n",
85 | " text | \n",
86 | "
\n",
87 | " \n",
88 | " \n",
89 | " \n",
90 | " 0 | \n",
91 | " 0 | \n",
92 | " 1467810369 | \n",
93 | " Mon Apr 06 22:19:45 PDT 2009 | \n",
94 | " NO_QUERY | \n",
95 | " _TheSpecialOne_ | \n",
96 | " @switchfoot http://twitpic.com/2y1zl - Awww, t... | \n",
97 | "
\n",
98 | " \n",
99 | " 1 | \n",
100 | " 0 | \n",
101 | " 1467810672 | \n",
102 | " Mon Apr 06 22:19:49 PDT 2009 | \n",
103 | " NO_QUERY | \n",
104 | " scotthamilton | \n",
105 | " is upset that he can't update his Facebook by ... | \n",
106 | "
\n",
107 | " \n",
108 | " 2 | \n",
109 | " 0 | \n",
110 | " 1467810917 | \n",
111 | " Mon Apr 06 22:19:53 PDT 2009 | \n",
112 | " NO_QUERY | \n",
113 | " mattycus | \n",
114 | " @Kenichan I dived many times for the ball. Man... | \n",
115 | "
\n",
116 | " \n",
117 | " 3 | \n",
118 | " 0 | \n",
119 | " 1467811184 | \n",
120 | " Mon Apr 06 22:19:57 PDT 2009 | \n",
121 | " NO_QUERY | \n",
122 | " ElleCTF | \n",
123 | " my whole body feels itchy and like its on fire | \n",
124 | "
\n",
125 | " \n",
126 | " 4 | \n",
127 | " 0 | \n",
128 | " 1467811193 | \n",
129 | " Mon Apr 06 22:19:57 PDT 2009 | \n",
130 | " NO_QUERY | \n",
131 | " Karoli | \n",
132 | " @nationwideclass no, it's not behaving at all.... | \n",
133 | "
\n",
134 | " \n",
135 | "
\n",
136 | "
"
137 | ],
138 | "text/plain": [
139 | " polarity id date query \\\n",
140 | "0 0 1467810369 Mon Apr 06 22:19:45 PDT 2009 NO_QUERY \n",
141 | "1 0 1467810672 Mon Apr 06 22:19:49 PDT 2009 NO_QUERY \n",
142 | "2 0 1467810917 Mon Apr 06 22:19:53 PDT 2009 NO_QUERY \n",
143 | "3 0 1467811184 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY \n",
144 | "4 0 1467811193 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY \n",
145 | "\n",
146 | " user text \n",
147 | "0 _TheSpecialOne_ @switchfoot http://twitpic.com/2y1zl - Awww, t... \n",
148 | "1 scotthamilton is upset that he can't update his Facebook by ... \n",
149 | "2 mattycus @Kenichan I dived many times for the ball. Man... \n",
150 | "3 ElleCTF my whole body feels itchy and like its on fire \n",
151 | "4 Karoli @nationwideclass no, it's not behaving at all.... "
152 | ]
153 | },
154 | "execution_count": 2,
155 | "metadata": {},
156 | "output_type": "execute_result"
157 | }
158 | ],
159 | "source": [
160 | "import pandas as pd\n",
161 | "\n",
162 | "df = pd.read_csv(\"data/training.1600000.processed.noemoticon.csv\",\n",
163 | " names=['polarity', 'id', 'date', 'query', 'user', 'text'],\n",
164 | " encoding='latin-1')\n",
165 | "df.head()"
166 | ]
167 | },
168 | {
169 | "cell_type": "markdown",
170 | "metadata": {},
171 | "source": [
172 | "## Update polarity\n",
173 | "\n",
174 | "Right now the `polarity` column is `0` for negative, `4` for positive. Let's change that to `0` and `1` to make things a little more reasonably readable."
175 | ]
176 | },
177 | {
178 | "cell_type": "code",
179 | "execution_count": 3,
180 | "metadata": {},
181 | "outputs": [
182 | {
183 | "data": {
184 | "text/plain": [
185 | "4 800000\n",
186 | "0 800000\n",
187 | "Name: polarity, dtype: int64"
188 | ]
189 | },
190 | "execution_count": 3,
191 | "metadata": {},
192 | "output_type": "execute_result"
193 | }
194 | ],
195 | "source": [
196 | "df.polarity.value_counts()"
197 | ]
198 | },
199 | {
200 | "cell_type": "code",
201 | "execution_count": 4,
202 | "metadata": {},
203 | "outputs": [
204 | {
205 | "data": {
206 | "text/plain": [
207 | "1 800000\n",
208 | "0 800000\n",
209 | "Name: polarity, dtype: int64"
210 | ]
211 | },
212 | "execution_count": 4,
213 | "metadata": {},
214 | "output_type": "execute_result"
215 | }
216 | ],
217 | "source": [
218 | "df.polarity = df.polarity.replace({0: 0, 4: 1})\n",
219 | "df.polarity.value_counts()"
220 | ]
221 | },
222 | {
223 | "cell_type": "markdown",
224 | "metadata": {},
225 | "source": [
226 | "## Remove unneeded columns\n",
227 | "\n",
228 | "We don't need all those columns! Let's get rid of the ones that won't affect the sentiment."
229 | ]
230 | },
231 | {
232 | "cell_type": "code",
233 | "execution_count": 5,
234 | "metadata": {},
235 | "outputs": [
236 | {
237 | "data": {
238 | "text/html": [
239 | "\n",
240 | "\n",
253 | "
\n",
254 | " \n",
255 | " \n",
256 | " | \n",
257 | " polarity | \n",
258 | " text | \n",
259 | "
\n",
260 | " \n",
261 | " \n",
262 | " \n",
263 | " 0 | \n",
264 | " 0 | \n",
265 | " @switchfoot http://twitpic.com/2y1zl - Awww, t... | \n",
266 | "
\n",
267 | " \n",
268 | " 1 | \n",
269 | " 0 | \n",
270 | " is upset that he can't update his Facebook by ... | \n",
271 | "
\n",
272 | " \n",
273 | " 2 | \n",
274 | " 0 | \n",
275 | " @Kenichan I dived many times for the ball. Man... | \n",
276 | "
\n",
277 | " \n",
278 | " 3 | \n",
279 | " 0 | \n",
280 | " my whole body feels itchy and like its on fire | \n",
281 | "
\n",
282 | " \n",
283 | " 4 | \n",
284 | " 0 | \n",
285 | " @nationwideclass no, it's not behaving at all.... | \n",
286 | "
\n",
287 | " \n",
288 | "
\n",
289 | "
"
290 | ],
291 | "text/plain": [
292 | " polarity text\n",
293 | "0 0 @switchfoot http://twitpic.com/2y1zl - Awww, t...\n",
294 | "1 0 is upset that he can't update his Facebook by ...\n",
295 | "2 0 @Kenichan I dived many times for the ball. Man...\n",
296 | "3 0 my whole body feels itchy and like its on fire \n",
297 | "4 0 @nationwideclass no, it's not behaving at all...."
298 | ]
299 | },
300 | "execution_count": 5,
301 | "metadata": {},
302 | "output_type": "execute_result"
303 | }
304 | ],
305 | "source": [
306 | "df = df.drop(columns=['id', 'date', 'query', 'user'])\n",
307 | "df.head()"
308 | ]
309 | },
310 | {
311 | "cell_type": "markdown",
312 | "metadata": {},
313 | "source": [
314 | "## Sample\n",
315 | "\n",
316 | "To make the filesize a little smaller and pandas a little happier, let's knock this down to 500,000 tweets."
317 | ]
318 | },
319 | {
320 | "cell_type": "code",
321 | "execution_count": 6,
322 | "metadata": {},
323 | "outputs": [
324 | {
325 | "data": {
326 | "text/plain": [
327 | "0 250275\n",
328 | "1 249725\n",
329 | "Name: polarity, dtype: int64"
330 | ]
331 | },
332 | "execution_count": 6,
333 | "metadata": {},
334 | "output_type": "execute_result"
335 | }
336 | ],
337 | "source": [
338 | "df = df.sample(n=500000)\n",
339 | "df.polarity.value_counts()"
340 | ]
341 | },
342 | {
343 | "cell_type": "code",
344 | "execution_count": 7,
345 | "metadata": {},
346 | "outputs": [],
347 | "source": [
348 | "df.to_csv(\"data/sentiment140-subset.csv\", index=False)"
349 | ]
350 | },
351 | {
352 | "cell_type": "markdown",
353 | "metadata": {},
354 | "source": [
355 | "## Review\n",
356 | "\n",
357 | "In this section, we cleaned up the **Sentiment140** tweet dataset. Sentiment140 is a collection of 1.6 million tweets that are marked as either positive or negative sentiment."
358 | ]
359 | },
360 | {
361 | "cell_type": "code",
362 | "execution_count": null,
363 | "metadata": {},
364 | "outputs": [],
365 | "source": []
366 | }
367 | ],
368 | "metadata": {
369 | "kernelspec": {
370 | "display_name": "Python 3",
371 | "language": "python",
372 | "name": "python3"
373 | },
374 | "language_info": {
375 | "codemirror_mode": {
376 | "name": "ipython",
377 | "version": 3
378 | },
379 | "file_extension": ".py",
380 | "mimetype": "text/x-python",
381 | "name": "python",
382 | "nbconvert_exporter": "python",
383 | "pygments_lexer": "ipython3",
384 | "version": "3.6.8"
385 | },
386 | "toc": {
387 | "base_numbering": 1,
388 | "nav_menu": {},
389 | "number_sections": true,
390 | "sideBar": true,
391 | "skip_h1_title": false,
392 | "title_cell": "Table of Contents",
393 | "title_sidebar": "Contents",
394 | "toc_cell": false,
395 | "toc_position": {},
396 | "toc_section_display": true,
397 | "toc_window_display": false
398 | }
399 | },
400 | "nbformat": 4,
401 | "nbformat_minor": 2
402 | }
--------------------------------------------------------------------------------
/milwaukee-potholes/notebooks/Homework - Milwaukee Journal Sentinel and potholes (No merging).ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# The Milwaukee Journal Sentinel and pothole fill times (No merging)\n",
8 | "\n",
9 | "**Story:** [Race gap found in pothole patching](https://web.archive.org/web/20081223094123/http://www.jsonline.com/news/milwaukee/32580034.html)\n",
10 | "\n",
11 | "**Author:** Keegan Kyle, Grant Smith and Ben Poston, Milwaukee Journal Sentinel\n",
12 | "\n",
13 | "**Topics:** Census Data, Geocoding, QGIS Spatial Joins, Linear Regression\n",
14 | "\n",
15 | "**Datasets**\n",
16 | "\n",
17 | "* **potholes-cleaned-merged.csv:** a series of merged datasets (minus the income dataset). The datasets include:\n",
18 | " - **2007-2010 POTHOLES.xls**: Pothole data, July 2007-July 2010 from the Milwaukee [DPW](https://city.milwaukee.gov/dpw)\n",
19 | " - **2010-2013 POTHOLES.xls**: Pothole data, July 2010-July 2013 from the Milwaukee [DPW](https://city.milwaukee.gov/dpw)\n",
20 | " - **2013-2017 POTHOLES.xls**: Pothole data, July 2013-July 2017 from the Milwaukee [DPW](https://city.milwaukee.gov/dpw)\n",
21 | " - **tl_2013_55_tract.zip:** 2013 census tract boundaries from the [US Census Bureau](https://www.census.gov/cgi-bin/geo/shapefiles/index.php)\n",
22 | " - **addresses_geocoded.csv:** a large selection of addresses in Milwaukee, geocoded by [Geocod.io](https://geocod.io)\n",
23 | " - **R12216099_SL140.csv:** ACS 2013 5-year, tract level, from [Social Explorer](https://www.socialexplorer.com)\n",
24 | " - (Table A04001, Hispanic or Latino by Race, `R12216099.txt` is the data dictionary\n",
25 | "* **R12216226_SL140.csv** ACS 2013 5-year, tract level, from [Social Explorer](https://www.socialexplorer.com)\n",
26 | " - Table A14006, 2013 Median Household income\n",
27 | " - Data dictionary [is here](https://www.socialexplorer.com/data/ACS2013_5yr/metadata/?ds=SE&table=A14006)\n",
28 | "\n",
29 | "# What's the story?\n",
30 | "\n",
31 | "We're trying to figure out if the **time it took Milwaukee to fill pot holes** is related to the racial makeup of a census tract."
32 | ]
33 | },
34 | {
35 | "cell_type": "markdown",
36 | "metadata": {},
37 | "source": [
38 | "\n \n \n Read online\n \n \n \n Download notebook\n \n \n \n Interactive version\n \n
"
39 | ]
40 | },
41 | {
42 | "cell_type": "markdown",
43 | "metadata": {},
44 | "source": [
45 | "### Prep work: Downloading necessary files\n",
46 | "Before we get started, we need to download all of the data we'll be using.\n",
47 | "* **potholes-cleaned-merged.csv:** merged and cleaned pothole data - combined and completed\n",
48 | "* **R12216226_SL140.csv:** median income census data - American Community Survey table A14006\n",
49 | "* **R12216099_SL140.csv:** racial makeup census data - American Community Survey table A04001\n"
50 | ]
51 | },
52 | {
53 | "cell_type": "code",
54 | "metadata": {},
55 | "source": [
56 | "# Make data directory if it doesn't exist\n",
57 | "!mkdir -p data\n",
58 | "!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/milwaukee-potholes/data/potholes-cleaned-merged.csv -P data\n",
59 | "!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/milwaukee-potholes/data/R12216226_SL140.csv -P data\n",
60 | "!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/milwaukee-potholes/data/R12216099_SL140.csv -P data"
61 | ],
62 | "outputs": [],
63 | "execution_count": null
64 | },
65 | {
66 | "cell_type": "markdown",
67 | "metadata": {},
68 | "source": [
69 | "# Do your imports\n",
70 | "\n",
71 | "You'll also want to set pandas to display **up to 200 columns at a time**."
72 | ]
73 | },
74 | {
75 | "cell_type": "code",
76 | "execution_count": null,
77 | "metadata": {},
78 | "outputs": [],
79 | "source": []
80 | },
81 | {
82 | "cell_type": "markdown",
83 | "metadata": {},
84 | "source": [
85 | "# Read in your data\n",
86 | "\n",
87 | "We're just reading in `potholes-cleaned-merged.csv` for now. It's a lot lot lot of other files, somewhat cleaned and all merged together.\n",
88 | "\n",
89 | "* **Tip:** Both `GEOID` and `Geo_FIPS` are census tract identifiers. You'll want to read them in as strings so they don't lose leading zeroes"
90 | ]
91 | },
92 | {
93 | "cell_type": "code",
94 | "execution_count": null,
95 | "metadata": {},
96 | "outputs": [],
97 | "source": []
98 | },
99 | {
100 | "cell_type": "markdown",
101 | "metadata": {},
102 | "source": [
103 | "## What is the maximum and minimum `EnterDt` and `ResolvDt`?\n",
104 | "\n",
105 | "Use this to confirm that your date range is what you expected. If it isn't, take a look at what might have happened with your dataset.\n",
106 | "\n",
107 | "* **Tip:** Missing data might be a headache"
108 | ]
109 | },
110 | {
111 | "cell_type": "code",
112 | "execution_count": null,
113 | "metadata": {},
114 | "outputs": [],
115 | "source": []
116 | },
117 | {
118 | "cell_type": "code",
119 | "execution_count": null,
120 | "metadata": {},
121 | "outputs": [],
122 | "source": []
123 | },
124 | {
125 | "cell_type": "markdown",
126 | "metadata": {},
127 | "source": [
128 | "## Calculate how long it took to fill potholes in 2013\n",
129 | "\n",
130 | "Save it into a new column.\n",
131 | "\n",
132 | "* **Tip:** It's possible to subtract two dates"
133 | ]
134 | },
135 | {
136 | "cell_type": "code",
137 | "execution_count": null,
138 | "metadata": {},
139 | "outputs": [],
140 | "source": []
141 | },
142 | {
143 | "cell_type": "markdown",
144 | "metadata": {},
145 | "source": [
146 | "### Hrm, well, I think we need that difference to be an integer\n",
147 | "\n",
148 | "If your new column isn't an integer, create _another_ column that is.\n",
149 | "\n",
150 | "* **Tip:** Just like you might use `.str.strip()` on a string column, if your column is a datetime you can use `.dt.components` to get the days, hours, minutes, seconds, etc of the column."
151 | ]
152 | },
153 | {
154 | "cell_type": "code",
155 | "execution_count": null,
156 | "metadata": {},
157 | "outputs": [],
158 | "source": []
159 | },
160 | {
161 | "cell_type": "markdown",
162 | "metadata": {},
163 | "source": [
164 | "## Cleaning up your census data\n",
165 | "\n",
166 | "The `SE_` columns are all from the census data, you can find out what they mean by reading the data dictionary `R12216099.txt`.\n",
167 | "\n",
168 | "Add new columns to create:\n",
169 | "\n",
170 | "* `pct_white` The percent of the population that is White\n",
171 | "* `pct_black` The percent of the population that is Black\n",
172 | "* `pct_hispanic` The percent of the population that is Hispanic\n",
173 | "* `pct_minority` The percent of the population that is a minority (non-White)\n",
174 | "\n",
175 | "The column names don't match exactly, but you can figure it out."
176 | ]
177 | },
178 | {
179 | "cell_type": "code",
180 | "execution_count": null,
181 | "metadata": {},
182 | "outputs": [],
183 | "source": []
184 | },
185 | {
186 | "cell_type": "markdown",
187 | "metadata": {},
188 | "source": [
189 | "Feel free to drop the census if you're not interested in them any more."
190 | ]
191 | },
192 | {
193 | "cell_type": "markdown",
194 | "metadata": {},
195 | "source": [
196 | "# Linear regression\n",
197 | "\n",
198 | "Using the `statsmodels` package, run a linear regression to find the coefficient relating percent minority and pothole fill times.\n",
199 | "\n",
200 | "* **Tip:** Be sure to remove missing data with `.dropna()` first. How many rows get removed?\n",
201 | "* **Tip:** Don't forget to use `sm.add_constant`. Why do we use it?"
202 | ]
203 | },
204 | {
205 | "cell_type": "code",
206 | "execution_count": null,
207 | "metadata": {},
208 | "outputs": [],
209 | "source": []
210 | },
211 | {
212 | "cell_type": "code",
213 | "execution_count": null,
214 | "metadata": {},
215 | "outputs": [],
216 | "source": []
217 | },
218 | {
219 | "cell_type": "code",
220 | "execution_count": null,
221 | "metadata": {},
222 | "outputs": [],
223 | "source": []
224 | },
225 | {
226 | "cell_type": "markdown",
227 | "metadata": {},
228 | "source": [
229 | "Translate that into the form **\"every X percentage point change in the minority population translates to a Y change in pot hole fill times\"**"
230 | ]
231 | },
232 | {
233 | "cell_type": "code",
234 | "execution_count": null,
235 | "metadata": {},
236 | "outputs": [],
237 | "source": []
238 | },
239 | {
240 | "cell_type": "markdown",
241 | "metadata": {},
242 | "source": [
243 | "Do you feel comfortable that someone can understand that? Can you reword it to make it more easily understandable?"
244 | ]
245 | },
246 | {
247 | "cell_type": "code",
248 | "execution_count": null,
249 | "metadata": {},
250 | "outputs": [],
251 | "source": []
252 | },
253 | {
254 | "cell_type": "markdown",
255 | "metadata": {},
256 | "source": [
257 | "# Other methods of explanation\n",
258 | "\n",
259 | "While the regression is technically correct, it just does't sound very nice. What other options do we have?\n",
260 | "\n",
261 | "## What's the average wait to fill a pothole between majority-white and majority-minority census tracts?\n",
262 | "\n",
263 | "You'll need to create a new column to specify whether the census tract is majority White or not."
264 | ]
265 | },
266 | {
267 | "cell_type": "code",
268 | "execution_count": null,
269 | "metadata": {},
270 | "outputs": [],
271 | "source": []
272 | },
273 | {
274 | "cell_type": "markdown",
275 | "metadata": {},
276 | "source": [
277 | "## How does the average wait time to fill a pothole change as more minorities live in an area?\n",
278 | "\n",
279 | "* **Tip:** Use `.cut` to split the percent minority (or white) into a few different bins."
280 | ]
281 | },
282 | {
283 | "cell_type": "code",
284 | "execution_count": null,
285 | "metadata": {},
286 | "outputs": [],
287 | "source": []
288 | },
289 | {
290 | "cell_type": "code",
291 | "execution_count": null,
292 | "metadata": {},
293 | "outputs": [],
294 | "source": []
295 | },
296 | {
297 | "cell_type": "code",
298 | "execution_count": null,
299 | "metadata": {},
300 | "outputs": [],
301 | "source": []
302 | },
303 | {
304 | "cell_type": "markdown",
305 | "metadata": {},
306 | "source": [
307 | "# Analyzing Income\n",
308 | "\n",
309 | "`R12216226_SL140.csv` contains income data for each census tract in Wisconsin. Add it into your analysis.\n",
310 | "\n",
311 | "If you run a multivariate regression also including income, how does this change things?\n",
312 | "\n",
313 | "* **Tip:** Be sure to read in `Geo_FIPS` as a string so leading zeroes don't get removed\n",
314 | "* **Tip:** You can use [this data dictionary](https://www.socialexplorer.com/data/ACS2013_5yr/metadata/?ds=SE&table=A14006) to understand what column you're interested in."
315 | ]
316 | },
317 | {
318 | "cell_type": "code",
319 | "execution_count": null,
320 | "metadata": {},
321 | "outputs": [],
322 | "source": []
323 | },
324 | {
325 | "cell_type": "markdown",
326 | "metadata": {},
327 | "source": [
328 | "### Filter out every column except the one you'll be joining on and the median income"
329 | ]
330 | },
331 | {
332 | "cell_type": "code",
333 | "execution_count": null,
334 | "metadata": {},
335 | "outputs": [],
336 | "source": []
337 | },
338 | {
339 | "cell_type": "markdown",
340 | "metadata": {},
341 | "source": [
342 | "## Merge with your existing dataset on census tract"
343 | ]
344 | },
345 | {
346 | "cell_type": "code",
347 | "execution_count": null,
348 | "metadata": {},
349 | "outputs": [],
350 | "source": []
351 | },
352 | {
353 | "cell_type": "markdown",
354 | "metadata": {},
355 | "source": [
356 | "### Run another regression, this time including both percent minority and income"
357 | ]
358 | },
359 | {
360 | "cell_type": "code",
361 | "execution_count": null,
362 | "metadata": {},
363 | "outputs": [],
364 | "source": []
365 | },
366 | {
367 | "cell_type": "markdown",
368 | "metadata": {},
369 | "source": [
370 | "### The income coefficient is very unfriendly!\n",
371 | "\n",
372 | "Try to explain what it means in normal words. Or... don't, and just skip to the next question."
373 | ]
374 | },
375 | {
376 | "cell_type": "code",
377 | "execution_count": null,
378 | "metadata": {},
379 | "outputs": [],
380 | "source": []
381 | },
382 | {
383 | "cell_type": "markdown",
384 | "metadata": {},
385 | "source": [
386 | "### Create a new column that stands for income in $10,000 increments, and try the regression again"
387 | ]
388 | },
389 | {
390 | "cell_type": "code",
391 | "execution_count": null,
392 | "metadata": {},
393 | "outputs": [],
394 | "source": []
395 | },
396 | {
397 | "cell_type": "code",
398 | "execution_count": null,
399 | "metadata": {},
400 | "outputs": [],
401 | "source": []
402 | },
403 | {
404 | "cell_type": "markdown",
405 | "metadata": {},
406 | "source": [
407 | "### Explain that in normal human-being words\n",
408 | "\n",
409 | "Controlling for minority population, for an X change in income, there is a Y change in how long it takes to get potholes filled."
410 | ]
411 | },
412 | {
413 | "cell_type": "code",
414 | "execution_count": null,
415 | "metadata": {},
416 | "outputs": [],
417 | "source": []
418 | },
419 | {
420 | "cell_type": "markdown",
421 | "metadata": {},
422 | "source": [
423 | "...does that make sense? \n",
424 | "\n",
425 | "### Bin income levels and graph it"
426 | ]
427 | },
428 | {
429 | "cell_type": "code",
430 | "execution_count": null,
431 | "metadata": {},
432 | "outputs": [],
433 | "source": []
434 | },
435 | {
436 | "cell_type": "markdown",
437 | "metadata": {},
438 | "source": [
439 | "## This seems unexpected, maybe?\n",
440 | "\n",
441 | "Not like we were _hoping_ for the race this, but this seems like not what we were going to get. That means there's either a story or we're forgetting something obvious. What might be causing this trend? How could we investigate it?"
442 | ]
443 | },
444 | {
445 | "cell_type": "code",
446 | "execution_count": null,
447 | "metadata": {},
448 | "outputs": [],
449 | "source": []
450 | }
451 | ],
452 | "metadata": {
453 | "kernelspec": {
454 | "display_name": "Python 3",
455 | "language": "python",
456 | "name": "python3"
457 | },
458 | "language_info": {
459 | "codemirror_mode": {
460 | "name": "ipython",
461 | "version": 3
462 | },
463 | "file_extension": ".py",
464 | "mimetype": "text/x-python",
465 | "name": "python",
466 | "nbconvert_exporter": "python",
467 | "pygments_lexer": "ipython3",
468 | "version": "3.6.8"
469 | },
470 | "toc": {
471 | "base_numbering": 1,
472 | "nav_menu": {},
473 | "number_sections": true,
474 | "sideBar": true,
475 | "skip_h1_title": false,
476 | "title_cell": "Table of Contents",
477 | "title_sidebar": "Contents",
478 | "toc_cell": false,
479 | "toc_position": {},
480 | "toc_section_display": true,
481 | "toc_window_display": false
482 | }
483 | },
484 | "nbformat": 4,
485 | "nbformat_minor": 2
486 | }
--------------------------------------------------------------------------------
/milwaukee-potholes/notebooks/Milwaukee Journal Sentinel and potholes without merging.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# The Milwaukee Journal Sentinel and pothole fill times (No merging)\n",
8 | "\n",
9 | "**Story:** [Race gap found in pothole patching](https://web.archive.org/web/20081223094123/http://www.jsonline.com/news/milwaukee/32580034.html)\n",
10 | "\n",
11 | "**Author:** Keegan Kyle, Grant Smith and Ben Poston, Milwaukee Journal Sentinel\n",
12 | "\n",
13 | "**Topics:** Census Data, Geocoding, QGIS Spatial Joins, Linear Regression\n",
14 | "\n",
15 | "**Datasets**\n",
16 | "\n",
17 | "* **potholes-cleaned-merged.csv:** a series of merged datasets (minus the income dataset). The datasets include:\n",
18 | " - **2007-2010 POTHOLES.xls**: Pothole data, July 2007-July 2010 from the Milwaukee [DPW](https://city.milwaukee.gov/dpw)\n",
19 | " - **2010-2013 POTHOLES.xls**: Pothole data, July 2010-July 2013 from the Milwaukee [DPW](https://city.milwaukee.gov/dpw)\n",
20 | " - **2013-2017 POTHOLES.xls**: Pothole data, July 2013-July 2017 from the Milwaukee [DPW](https://city.milwaukee.gov/dpw)\n",
21 | " - **tl_2013_55_tract.zip:** 2013 census tract boundaries from the [US Census Bureau](https://www.census.gov/cgi-bin/geo/shapefiles/index.php)\n",
22 | " - **addresses_geocoded.csv:** a large selection of addresses in Milwaukee, geocoded by [Geocod.io](https://geocod.io)\n",
23 | " - **R12216099_SL140.csv:** ACS 2013 5-year, tract level, from [Social Explorer](https://www.socialexplorer.com)\n",
24 | " - (Table A04001, Hispanic or Latino by Race, `R12216099.txt` is the data dictionary\n",
25 | "* **R12216226_SL140.csv** ACS 2013 5-year, tract level, from [Social Explorer](https://www.socialexplorer.com)\n",
26 | " - Table A14006, 2013 Median Household income\n",
27 | " - Data dictionary [is here](https://www.socialexplorer.com/data/ACS2013_5yr/metadata/?ds=SE&table=A14006)\n",
28 | "\n",
29 | "# What's the story?\n",
30 | "\n",
31 | "We're trying to figure out if the **time it took Milwaukee to fill pot holes** is related to the racial makeup of a census tract."
32 | ]
33 | },
34 | {
35 | "cell_type": "markdown",
36 | "metadata": {},
37 | "source": [
38 | "\n \n \n Read online\n \n \n \n Download notebook\n \n \n \n Interactive version\n \n
"
39 | ]
40 | },
41 | {
42 | "cell_type": "markdown",
43 | "metadata": {},
44 | "source": [
45 | "### Prep work: Downloading necessary files\n",
46 | "Before we get started, we need to download all of the data we'll be using.\n",
47 | "* **potholes-cleaned-merged.csv:** merged and cleaned pothole data - combined and completed\n",
48 | "* **R12216226_SL140.csv:** median income census data - American Community Survey table A14006\n",
49 | "* **R12216099_SL140.csv:** racial makeup census data - American Community Survey table A04001\n"
50 | ]
51 | },
52 | {
53 | "cell_type": "code",
54 | "metadata": {},
55 | "source": [
56 | "# Make data directory if it doesn't exist\n",
57 | "!mkdir -p data\n",
58 | "!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/milwaukee-potholes/data/potholes-cleaned-merged.csv -P data\n",
59 | "!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/milwaukee-potholes/data/R12216226_SL140.csv -P data\n",
60 | "!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/milwaukee-potholes/data/R12216099_SL140.csv -P data"
61 | ],
62 | "outputs": [],
63 | "execution_count": null
64 | },
65 | {
66 | "cell_type": "markdown",
67 | "metadata": {},
68 | "source": [
69 | "# Do your imports\n",
70 | "\n",
71 | "You'll also want to set pandas to display **up to 200 columns at a time**."
72 | ]
73 | },
74 | {
75 | "cell_type": "code",
76 | "execution_count": null,
77 | "metadata": {},
78 | "outputs": [],
79 | "source": []
80 | },
81 | {
82 | "cell_type": "markdown",
83 | "metadata": {},
84 | "source": [
85 | "# Read in your data\n",
86 | "\n",
87 | "We're just reading in `potholes-cleaned-merged.csv` for now. It's a lot lot lot of other files, somewhat cleaned and all merged together.\n",
88 | "\n",
89 | "* **Tip:** Both `GEOID` and `Geo_FIPS` are census tract identifiers. You'll want to read them in as strings so they don't lose leading zeroes"
90 | ]
91 | },
92 | {
93 | "cell_type": "code",
94 | "execution_count": null,
95 | "metadata": {},
96 | "outputs": [],
97 | "source": []
98 | },
99 | {
100 | "cell_type": "markdown",
101 | "metadata": {},
102 | "source": [
103 | "## What is the maximum and minimum `EnterDt` and `ResolvDt`?\n",
104 | "\n",
105 | "Use this to confirm that your date range is what you expected. If it isn't, take a look at what might have happened with your dataset.\n",
106 | "\n",
107 | "* **Tip:** Missing data might be a headache"
108 | ]
109 | },
110 | {
111 | "cell_type": "code",
112 | "execution_count": null,
113 | "metadata": {},
114 | "outputs": [],
115 | "source": []
116 | },
117 | {
118 | "cell_type": "code",
119 | "execution_count": null,
120 | "metadata": {},
121 | "outputs": [],
122 | "source": []
123 | },
124 | {
125 | "cell_type": "markdown",
126 | "metadata": {},
127 | "source": [
128 | "## Calculate how long it took to fill potholes in 2013\n",
129 | "\n",
130 | "Save it into a new column.\n",
131 | "\n",
132 | "* **Tip:** It's possible to subtract two dates"
133 | ]
134 | },
135 | {
136 | "cell_type": "code",
137 | "execution_count": null,
138 | "metadata": {},
139 | "outputs": [],
140 | "source": []
141 | },
142 | {
143 | "cell_type": "markdown",
144 | "metadata": {},
145 | "source": [
146 | "### Hrm, well, I think we need that difference to be an integer\n",
147 | "\n",
148 | "If your new column isn't an integer, create _another_ column that is.\n",
149 | "\n",
150 | "* **Tip:** Just like you might use `.str.strip()` on a string column, if your column is a datetime you can use `.dt.components` to get the days, hours, minutes, seconds, etc of the column."
151 | ]
152 | },
153 | {
154 | "cell_type": "code",
155 | "execution_count": null,
156 | "metadata": {},
157 | "outputs": [],
158 | "source": []
159 | },
160 | {
161 | "cell_type": "markdown",
162 | "metadata": {},
163 | "source": [
164 | "## Cleaning up your census data\n",
165 | "\n",
166 | "The `SE_` columns are all from the census data, you can find out what they mean by reading the data dictionary `R12216099.txt`.\n",
167 | "\n",
168 | "Add new columns to create:\n",
169 | "\n",
170 | "* `pct_white` The percent of the population that is White\n",
171 | "* `pct_black` The percent of the population that is Black\n",
172 | "* `pct_hispanic` The percent of the population that is Hispanic\n",
173 | "* `pct_minority` The percent of the population that is a minority (non-White)\n",
174 | "\n",
175 | "The column names don't match exactly, but you can figure it out."
176 | ]
177 | },
178 | {
179 | "cell_type": "code",
180 | "execution_count": null,
181 | "metadata": {},
182 | "outputs": [],
183 | "source": []
184 | },
185 | {
186 | "cell_type": "markdown",
187 | "metadata": {},
188 | "source": [
189 | "Feel free to drop the census if you're not interested in them any more."
190 | ]
191 | },
192 | {
193 | "cell_type": "markdown",
194 | "metadata": {},
195 | "source": [
196 | "# Linear regression\n",
197 | "\n",
198 | "Using the `statsmodels` package, run a linear regression to find the coefficient relating percent minority and pothole fill times.\n",
199 | "\n",
200 | "* **Tip:** Be sure to remove missing data with `.dropna()` first. How many rows get removed?\n",
201 | "* **Tip:** Don't forget to use `sm.add_constant`. Why do we use it?"
202 | ]
203 | },
204 | {
205 | "cell_type": "code",
206 | "execution_count": null,
207 | "metadata": {},
208 | "outputs": [],
209 | "source": []
210 | },
211 | {
212 | "cell_type": "code",
213 | "execution_count": null,
214 | "metadata": {},
215 | "outputs": [],
216 | "source": []
217 | },
218 | {
219 | "cell_type": "code",
220 | "execution_count": null,
221 | "metadata": {},
222 | "outputs": [],
223 | "source": []
224 | },
225 | {
226 | "cell_type": "markdown",
227 | "metadata": {},
228 | "source": [
229 | "Translate that into the form **\"every X percentage point change in the minority population translates to a Y change in pot hole fill times\"**"
230 | ]
231 | },
232 | {
233 | "cell_type": "code",
234 | "execution_count": null,
235 | "metadata": {},
236 | "outputs": [],
237 | "source": []
238 | },
239 | {
240 | "cell_type": "markdown",
241 | "metadata": {},
242 | "source": [
243 | "Do you feel comfortable that someone can understand that? Can you reword it to make it more easily understandable?"
244 | ]
245 | },
246 | {
247 | "cell_type": "code",
248 | "execution_count": null,
249 | "metadata": {},
250 | "outputs": [],
251 | "source": []
252 | },
253 | {
254 | "cell_type": "markdown",
255 | "metadata": {},
256 | "source": [
257 | "# Other methods of explanation\n",
258 | "\n",
259 | "While the regression is technically correct, it just does't sound very nice. What other options do we have?\n",
260 | "\n",
261 | "## What's the average wait to fill a pothole between majority-white and majority-minority census tracts?\n",
262 | "\n",
263 | "You'll need to create a new column to specify whether the census tract is majority White or not."
264 | ]
265 | },
266 | {
267 | "cell_type": "code",
268 | "execution_count": null,
269 | "metadata": {},
270 | "outputs": [],
271 | "source": []
272 | },
273 | {
274 | "cell_type": "markdown",
275 | "metadata": {},
276 | "source": [
277 | "## How does the average wait time to fill a pothole change as more minorities live in an area?\n",
278 | "\n",
279 | "* **Tip:** Use `.cut` to split the percent minority (or white) into a few different bins."
280 | ]
281 | },
282 | {
283 | "cell_type": "code",
284 | "execution_count": null,
285 | "metadata": {},
286 | "outputs": [],
287 | "source": []
288 | },
289 | {
290 | "cell_type": "code",
291 | "execution_count": null,
292 | "metadata": {},
293 | "outputs": [],
294 | "source": []
295 | },
296 | {
297 | "cell_type": "code",
298 | "execution_count": null,
299 | "metadata": {},
300 | "outputs": [],
301 | "source": []
302 | },
303 | {
304 | "cell_type": "markdown",
305 | "metadata": {},
306 | "source": [
307 | "# Analyzing Income\n",
308 | "\n",
309 | "`R12216226_SL140.csv` contains income data for each census tract in Wisconsin. Add it into your analysis.\n",
310 | "\n",
311 | "If you run a multivariate regression also including income, how does this change things?\n",
312 | "\n",
313 | "* **Tip:** Be sure to read in `Geo_FIPS` as a string so leading zeroes don't get removed\n",
314 | "* **Tip:** You can use [this data dictionary](https://www.socialexplorer.com/data/ACS2013_5yr/metadata/?ds=SE&table=A14006) to understand what column you're interested in."
315 | ]
316 | },
317 | {
318 | "cell_type": "code",
319 | "execution_count": null,
320 | "metadata": {},
321 | "outputs": [],
322 | "source": []
323 | },
324 | {
325 | "cell_type": "markdown",
326 | "metadata": {},
327 | "source": [
328 | "### Filter out every column except the one you'll be joining on and the median income"
329 | ]
330 | },
331 | {
332 | "cell_type": "code",
333 | "execution_count": null,
334 | "metadata": {},
335 | "outputs": [],
336 | "source": []
337 | },
338 | {
339 | "cell_type": "markdown",
340 | "metadata": {},
341 | "source": [
342 | "## Merge with your existing dataset on census tract"
343 | ]
344 | },
345 | {
346 | "cell_type": "code",
347 | "execution_count": null,
348 | "metadata": {},
349 | "outputs": [],
350 | "source": []
351 | },
352 | {
353 | "cell_type": "markdown",
354 | "metadata": {},
355 | "source": [
356 | "### Run another regression, this time including both percent minority and income"
357 | ]
358 | },
359 | {
360 | "cell_type": "code",
361 | "execution_count": null,
362 | "metadata": {},
363 | "outputs": [],
364 | "source": []
365 | },
366 | {
367 | "cell_type": "markdown",
368 | "metadata": {},
369 | "source": [
370 | "### The income coefficient is very unfriendly!\n",
371 | "\n",
372 | "Try to explain what it means in normal words. Or... don't, and just skip to the next question."
373 | ]
374 | },
375 | {
376 | "cell_type": "code",
377 | "execution_count": null,
378 | "metadata": {},
379 | "outputs": [],
380 | "source": []
381 | },
382 | {
383 | "cell_type": "markdown",
384 | "metadata": {},
385 | "source": [
386 | "### Create a new column that stands for income in $10,000 increments, and try the regression again"
387 | ]
388 | },
389 | {
390 | "cell_type": "code",
391 | "execution_count": null,
392 | "metadata": {},
393 | "outputs": [],
394 | "source": []
395 | },
396 | {
397 | "cell_type": "code",
398 | "execution_count": null,
399 | "metadata": {},
400 | "outputs": [],
401 | "source": []
402 | },
403 | {
404 | "cell_type": "markdown",
405 | "metadata": {},
406 | "source": [
407 | "### Explain that in normal human-being words\n",
408 | "\n",
409 | "Controlling for minority population, for an X change in income, there is a Y change in how long it takes to get potholes filled."
410 | ]
411 | },
412 | {
413 | "cell_type": "code",
414 | "execution_count": null,
415 | "metadata": {},
416 | "outputs": [],
417 | "source": []
418 | },
419 | {
420 | "cell_type": "markdown",
421 | "metadata": {},
422 | "source": [
423 | "...does that make sense? \n",
424 | "\n",
425 | "### Bin income levels and graph it"
426 | ]
427 | },
428 | {
429 | "cell_type": "code",
430 | "execution_count": null,
431 | "metadata": {},
432 | "outputs": [],
433 | "source": []
434 | },
435 | {
436 | "cell_type": "markdown",
437 | "metadata": {},
438 | "source": [
439 | "## This seems unexpected, maybe?\n",
440 | "\n",
441 | "Not like we were _hoping_ for the race this, but this seems like not what we were going to get. That means there's either a story or we're forgetting something obvious. What might be causing this trend? How could we investigate it?"
442 | ]
443 | },
444 | {
445 | "cell_type": "code",
446 | "execution_count": null,
447 | "metadata": {},
448 | "outputs": [],
449 | "source": []
450 | }
451 | ],
452 | "metadata": {
453 | "kernelspec": {
454 | "display_name": "Python 3",
455 | "language": "python",
456 | "name": "python3"
457 | },
458 | "language_info": {
459 | "codemirror_mode": {
460 | "name": "ipython",
461 | "version": 3
462 | },
463 | "file_extension": ".py",
464 | "mimetype": "text/x-python",
465 | "name": "python",
466 | "nbconvert_exporter": "python",
467 | "pygments_lexer": "ipython3",
468 | "version": "3.6.8"
469 | },
470 | "toc": {
471 | "base_numbering": 1,
472 | "nav_menu": {},
473 | "number_sections": true,
474 | "sideBar": true,
475 | "skip_h1_title": false,
476 | "title_cell": "Table of Contents",
477 | "title_sidebar": "Contents",
478 | "toc_cell": false,
479 | "toc_position": {},
480 | "toc_section_display": true,
481 | "toc_window_display": false
482 | }
483 | },
484 | "nbformat": 4,
485 | "nbformat_minor": 2
486 | }
--------------------------------------------------------------------------------
/regression/notebooks/Linear Regression Quickstart.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Linear Regression Quickstart\n",
8 | "\n",
9 | "Already know what's what with linear regression, just need to know how to tackle it in Python? We're here for you! If not, continue on to the next section.\n",
10 | "\n",
11 | "We're going to **ignore the nuance of what we're doing** in this notebook, it's really just for people who need to see the process."
12 | ]
13 | },
14 | {
15 | "cell_type": "markdown",
16 | "metadata": {},
17 | "source": [
18 | "\n \n \n Read online\n \n \n \n Download notebook\n \n \n \n Interactive version\n \n
"
19 | ]
20 | },
21 | {
22 | "cell_type": "markdown",
23 | "metadata": {},
24 | "source": [
25 | "## Pandas for our data\n",
26 | "\n",
27 | "As is typical, we'll be using [pandas dataframes](https://pandas.pydata.org/) for the data."
28 | ]
29 | },
30 | {
31 | "cell_type": "code",
32 | "execution_count": 27,
33 | "metadata": {},
34 | "outputs": [
35 | {
36 | "data": {
37 | "text/html": [
38 | "\n",
39 | "\n",
52 | "
\n",
53 | " \n",
54 | " \n",
55 | " | \n",
56 | " sold | \n",
57 | " revenue | \n",
58 | "
\n",
59 | " \n",
60 | " \n",
61 | " \n",
62 | " 0 | \n",
63 | " 0 | \n",
64 | " 0 | \n",
65 | "
\n",
66 | " \n",
67 | " 1 | \n",
68 | " 4 | \n",
69 | " 8 | \n",
70 | "
\n",
71 | " \n",
72 | " 2 | \n",
73 | " 16 | \n",
74 | " 32 | \n",
75 | "
\n",
76 | " \n",
77 | "
\n",
78 | "
"
79 | ],
80 | "text/plain": [
81 | " sold revenue\n",
82 | "0 0 0\n",
83 | "1 4 8\n",
84 | "2 16 32"
85 | ]
86 | },
87 | "execution_count": 27,
88 | "metadata": {},
89 | "output_type": "execute_result"
90 | }
91 | ],
92 | "source": [
93 | "import pandas as pd\n",
94 | "\n",
95 | "df = pd.DataFrame([\n",
96 | " { 'sold': 0, 'revenue': 0 },\n",
97 | " { 'sold': 4, 'revenue': 8 },\n",
98 | " { 'sold': 16, 'revenue': 32 },\n",
99 | "])\n",
100 | "df"
101 | ]
102 | },
103 | {
104 | "cell_type": "markdown",
105 | "metadata": {},
106 | "source": [
107 | "## Performing a regression\n",
108 | "\n",
109 | "The [statsmodels](https://www.statsmodels.org) package is your best friend when it comes to regression. In theory you can do it using other techniques or libraries, but statsmodels is just *so simple*.\n",
110 | "\n",
111 | "For the regression below, I'm using the formula method of describing the regression. If that makes you grumpy, check the [regression reference page](/reference/regression/) for more details."
112 | ]
113 | },
114 | {
115 | "cell_type": "code",
116 | "execution_count": 28,
117 | "metadata": {},
118 | "outputs": [
119 | {
120 | "data": {
121 | "text/html": [
122 | "\n",
123 | "OLS Regression Results\n",
124 | "\n",
125 | " Dep. Variable: | revenue | R-squared: | 1.000 | \n",
126 | "
\n",
127 | "\n",
128 | " Model: | OLS | Adj. R-squared: | 1.000 | \n",
129 | "
\n",
130 | "\n",
131 | " Method: | Least Squares | F-statistic: | 9.502e+30 | \n",
132 | "
\n",
133 | "\n",
134 | " Date: | Sun, 08 Dec 2019 | Prob (F-statistic): | 2.07e-16 | \n",
135 | "
\n",
136 | "\n",
137 | " Time: | 10:14:18 | Log-Likelihood: | 94.907 | \n",
138 | "
\n",
139 | "\n",
140 | " No. Observations: | 3 | AIC: | -185.8 | \n",
141 | "
\n",
142 | "\n",
143 | " Df Residuals: | 1 | BIC: | -187.6 | \n",
144 | "
\n",
145 | "\n",
146 | " Df Model: | 1 | | | \n",
147 | "
\n",
148 | "\n",
149 | " Covariance Type: | nonrobust | | | \n",
150 | "
\n",
151 | "
\n",
152 | "\n",
153 | "\n",
154 | " | coef | std err | t | P>|t| | [0.025 | 0.975] | \n",
155 | "
\n",
156 | "\n",
157 | " Intercept | -2.665e-15 | 6.18e-15 | -0.431 | 0.741 | -8.12e-14 | 7.58e-14 | \n",
158 | "
\n",
159 | "\n",
160 | " sold | 2.0000 | 6.49e-16 | 3.08e+15 | 0.000 | 2.000 | 2.000 | \n",
161 | "
\n",
162 | "
\n",
163 | "\n",
164 | "\n",
165 | " Omnibus: | nan | Durbin-Watson: | 1.149 | \n",
166 | "
\n",
167 | "\n",
168 | " Prob(Omnibus): | nan | Jarque-Bera (JB): | 0.471 | \n",
169 | "
\n",
170 | "\n",
171 | " Skew: | -0.616 | Prob(JB): | 0.790 | \n",
172 | "
\n",
173 | "\n",
174 | " Kurtosis: | 1.500 | Cond. No. | 13.4 | \n",
175 | "
\n",
176 | "
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified."
177 | ],
178 | "text/plain": [
179 | "\n",
180 | "\"\"\"\n",
181 | " OLS Regression Results \n",
182 | "==============================================================================\n",
183 | "Dep. Variable: revenue R-squared: 1.000\n",
184 | "Model: OLS Adj. R-squared: 1.000\n",
185 | "Method: Least Squares F-statistic: 9.502e+30\n",
186 | "Date: Sun, 08 Dec 2019 Prob (F-statistic): 2.07e-16\n",
187 | "Time: 10:14:18 Log-Likelihood: 94.907\n",
188 | "No. Observations: 3 AIC: -185.8\n",
189 | "Df Residuals: 1 BIC: -187.6\n",
190 | "Df Model: 1 \n",
191 | "Covariance Type: nonrobust \n",
192 | "==============================================================================\n",
193 | " coef std err t P>|t| [0.025 0.975]\n",
194 | "------------------------------------------------------------------------------\n",
195 | "Intercept -2.665e-15 6.18e-15 -0.431 0.741 -8.12e-14 7.58e-14\n",
196 | "sold 2.0000 6.49e-16 3.08e+15 0.000 2.000 2.000\n",
197 | "==============================================================================\n",
198 | "Omnibus: nan Durbin-Watson: 1.149\n",
199 | "Prob(Omnibus): nan Jarque-Bera (JB): 0.471\n",
200 | "Skew: -0.616 Prob(JB): 0.790\n",
201 | "Kurtosis: 1.500 Cond. No. 13.4\n",
202 | "==============================================================================\n",
203 | "\n",
204 | "Warnings:\n",
205 | "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
206 | "\"\"\""
207 | ]
208 | },
209 | "execution_count": 28,
210 | "metadata": {},
211 | "output_type": "execute_result"
212 | }
213 | ],
214 | "source": [
215 | "import statsmodels.formula.api as smf\n",
216 | "\n",
217 | "model = smf.ols(\"revenue ~ sold\", data=df)\n",
218 | "results = model.fit()\n",
219 | "results.summary()"
220 | ]
221 | },
222 | {
223 | "cell_type": "markdown",
224 | "metadata": {},
225 | "source": [
226 | "For each unit sold, we get 2 revenue. That's about it."
227 | ]
228 | },
229 | {
230 | "cell_type": "markdown",
231 | "metadata": {},
232 | "source": [
233 | "## Multivariable regression\n",
234 | "\n",
235 | "Multivariable regression is easy-peasy. Let's add a couple more columns to our dataset, adding tips to the equation."
236 | ]
237 | },
238 | {
239 | "cell_type": "code",
240 | "execution_count": 29,
241 | "metadata": {},
242 | "outputs": [
243 | {
244 | "data": {
245 | "text/html": [
246 | "\n",
247 | "\n",
260 | "
\n",
261 | " \n",
262 | " \n",
263 | " | \n",
264 | " sold | \n",
265 | " revenue | \n",
266 | " tips | \n",
267 | " charge_amount | \n",
268 | "
\n",
269 | " \n",
270 | " \n",
271 | " \n",
272 | " 0 | \n",
273 | " 0 | \n",
274 | " 0 | \n",
275 | " 0 | \n",
276 | " 0 | \n",
277 | "
\n",
278 | " \n",
279 | " 1 | \n",
280 | " 4 | \n",
281 | " 8 | \n",
282 | " 1 | \n",
283 | " 9 | \n",
284 | "
\n",
285 | " \n",
286 | " 2 | \n",
287 | " 16 | \n",
288 | " 32 | \n",
289 | " 2 | \n",
290 | " 34 | \n",
291 | "
\n",
292 | " \n",
293 | "
\n",
294 | "
"
295 | ],
296 | "text/plain": [
297 | " sold revenue tips charge_amount\n",
298 | "0 0 0 0 0\n",
299 | "1 4 8 1 9\n",
300 | "2 16 32 2 34"
301 | ]
302 | },
303 | "execution_count": 29,
304 | "metadata": {},
305 | "output_type": "execute_result"
306 | }
307 | ],
308 | "source": [
309 | "import pandas as pd\n",
310 | "\n",
311 | "df = pd.DataFrame([\n",
312 | " { 'sold': 0, 'revenue': 0, 'tips': 0, 'charge_amount': 0 },\n",
313 | " { 'sold': 4, 'revenue': 8, 'tips': 1, 'charge_amount': 9 },\n",
314 | " { 'sold': 16, 'revenue': 32, 'tips': 2, 'charge_amount': 34 },\n",
315 | "])\n",
316 | "df"
317 | ]
318 | },
319 | {
320 | "cell_type": "code",
321 | "execution_count": 30,
322 | "metadata": {},
323 | "outputs": [
324 | {
325 | "data": {
326 | "text/html": [
327 | "\n",
328 | "OLS Regression Results\n",
329 | "\n",
330 | " Dep. Variable: | charge_amount | R-squared: | 1.000 | \n",
331 | "
\n",
332 | "\n",
333 | " Model: | OLS | Adj. R-squared: | nan | \n",
334 | "
\n",
335 | "\n",
336 | " Method: | Least Squares | F-statistic: | 0.000 | \n",
337 | "
\n",
338 | "\n",
339 | " Date: | Sun, 08 Dec 2019 | Prob (F-statistic): | nan | \n",
340 | "
\n",
341 | "\n",
342 | " Time: | 10:14:20 | Log-Likelihood: | 89.745 | \n",
343 | "
\n",
344 | "\n",
345 | " No. Observations: | 3 | AIC: | -173.5 | \n",
346 | "
\n",
347 | "\n",
348 | " Df Residuals: | 0 | BIC: | -176.2 | \n",
349 | "
\n",
350 | "\n",
351 | " Df Model: | 2 | | | \n",
352 | "
\n",
353 | "\n",
354 | " Covariance Type: | nonrobust | | | \n",
355 | "
\n",
356 | "
\n",
357 | "\n",
358 | "\n",
359 | " | coef | std err | t | P>|t| | [0.025 | 0.975] | \n",
360 | "
\n",
361 | "\n",
362 | " Intercept | -1.685e-15 | inf | -0 | nan | nan | nan | \n",
363 | "
\n",
364 | "\n",
365 | " sold | 2.0000 | inf | 0 | nan | nan | nan | \n",
366 | "
\n",
367 | "\n",
368 | " tips | 1.0000 | inf | 0 | nan | nan | nan | \n",
369 | "
\n",
370 | "
\n",
371 | "\n",
372 | "\n",
373 | " Omnibus: | nan | Durbin-Watson: | 0.922 | \n",
374 | "
\n",
375 | "\n",
376 | " Prob(Omnibus): | nan | Jarque-Bera (JB): | 0.520 | \n",
377 | "
\n",
378 | "\n",
379 | " Skew: | -0.691 | Prob(JB): | 0.771 | \n",
380 | "
\n",
381 | "\n",
382 | " Kurtosis: | 1.500 | Cond. No. | 44.0 | \n",
383 | "
\n",
384 | "
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified."
385 | ],
386 | "text/plain": [
387 | "\n",
388 | "\"\"\"\n",
389 | " OLS Regression Results \n",
390 | "==============================================================================\n",
391 | "Dep. Variable: charge_amount R-squared: 1.000\n",
392 | "Model: OLS Adj. R-squared: nan\n",
393 | "Method: Least Squares F-statistic: 0.000\n",
394 | "Date: Sun, 08 Dec 2019 Prob (F-statistic): nan\n",
395 | "Time: 10:14:20 Log-Likelihood: 89.745\n",
396 | "No. Observations: 3 AIC: -173.5\n",
397 | "Df Residuals: 0 BIC: -176.2\n",
398 | "Df Model: 2 \n",
399 | "Covariance Type: nonrobust \n",
400 | "==============================================================================\n",
401 | " coef std err t P>|t| [0.025 0.975]\n",
402 | "------------------------------------------------------------------------------\n",
403 | "Intercept -1.685e-15 inf -0 nan nan nan\n",
404 | "sold 2.0000 inf 0 nan nan nan\n",
405 | "tips 1.0000 inf 0 nan nan nan\n",
406 | "==============================================================================\n",
407 | "Omnibus: nan Durbin-Watson: 0.922\n",
408 | "Prob(Omnibus): nan Jarque-Bera (JB): 0.520\n",
409 | "Skew: -0.691 Prob(JB): 0.771\n",
410 | "Kurtosis: 1.500 Cond. No. 44.0\n",
411 | "==============================================================================\n",
412 | "\n",
413 | "Warnings:\n",
414 | "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
415 | "\"\"\""
416 | ]
417 | },
418 | "execution_count": 30,
419 | "metadata": {},
420 | "output_type": "execute_result"
421 | }
422 | ],
423 | "source": [
424 | "import statsmodels.formula.api as smf\n",
425 | "\n",
426 | "model = smf.ols(\"charge_amount ~ sold + tips\", data=df)\n",
427 | "results = model.fit()\n",
428 | "results.summary()"
429 | ]
430 | },
431 | {
432 | "cell_type": "markdown",
433 | "metadata": {},
434 | "source": [
435 | "There you go!\n",
436 | "\n",
437 | "If you'd like more details, you can continue on in this section. If you'd just like the how-to-do-an-exact-thing explanations, check out the [regression reference page](/reference/regression/)."
438 | ]
439 | },
440 | {
441 | "cell_type": "code",
442 | "execution_count": null,
443 | "metadata": {},
444 | "outputs": [],
445 | "source": []
446 | }
447 | ],
448 | "metadata": {
449 | "kernelspec": {
450 | "display_name": "Python 3",
451 | "language": "python",
452 | "name": "python3"
453 | },
454 | "language_info": {
455 | "codemirror_mode": {
456 | "name": "ipython",
457 | "version": 3
458 | },
459 | "file_extension": ".py",
460 | "mimetype": "text/x-python",
461 | "name": "python",
462 | "nbconvert_exporter": "python",
463 | "pygments_lexer": "ipython3",
464 | "version": "3.6.8"
465 | },
466 | "toc": {
467 | "base_numbering": 1,
468 | "nav_menu": {},
469 | "number_sections": true,
470 | "sideBar": true,
471 | "skip_h1_title": false,
472 | "title_cell": "Table of Contents",
473 | "title_sidebar": "Contents",
474 | "toc_cell": false,
475 | "toc_position": {},
476 | "toc_section_display": true,
477 | "toc_window_display": false
478 | }
479 | },
480 | "nbformat": 4,
481 | "nbformat_minor": 2
482 | }
--------------------------------------------------------------------------------
/regression/notebooks/What is regression.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Regression: What's the point?\n",
8 | "\n",
9 | "Let's take a look at how we can use regression to find **relationships within our dataset**."
10 | ]
11 | },
12 | {
13 | "cell_type": "markdown",
14 | "metadata": {},
15 | "source": [
16 | "\n \n \n Read online\n \n \n \n Download notebook\n \n \n \n Interactive version\n \n
"
17 | ]
18 | },
19 | {
20 | "cell_type": "markdown",
21 | "metadata": {},
22 | "source": [
23 | "## What is regression?\n",
24 | "\n",
25 | "Regression is a way to describe how **two (or more) things are related to each other**. You might notice it in sentences like:\n",
26 | "\n",
27 | "* \"An increase of 10 percentage points in the unemployment rate in a neighborhood translated to a loss of roughly a year and a half of life expectancy,\" from the [Associated Press](https://apnews.com/66ac44186b6249709501f07a7eab36da). As unemployment goes up, life expectancy goes down.\n",
28 | "* \"Reveal\u2019s analysis also showed that the greater the number of African Americans or Latinos in a neighborhood, the more likely a loan application would be denied there \u2013 even after accounting for income and other factors,\" from [Reveal](https://www.revealnews.org/article/for-people-of-color-banks-are-shutting-the-door-to-homeownership/). As the amount of African Americans or Latinos goes up, the likelihood of a loan being denied goes up.\n",
29 | "* \"In Boston, Asian and Latino residents were more likely to be ticketed than were out-of-towners of the same race, when cited for the same offense,\" from the [Boston Globe](http://archive.boston.com/globe/metro/packages/tickets/072003.shtml)\n",
30 | "\n",
31 | "Notice that we've defined it as how a change in one variable is _related to_ a change in another, not that a change in one _causes_ a change in another. Regression can tell you the relationship, but not the \"why.\"\n",
32 | "\n",
33 | "## Types of regression\n",
34 | "\n",
35 | "While there are [many kinds of regression out there](https://www.listendata.com/2018/03/regression-analysis.html), the two major ones journalists care about are **linear regression** and **logistic regression**.\n",
36 | "\n",
37 | "**Linear regression** is used to predict **numbers**. Life expectancy is a number, so the Associate Press story above uses linear regression. We'll also see linear regression often used with standardized test scores in education.\n",
38 | "\n",
39 | "**Logistic regression** is used to predict **categories** such as yes/no or accepted/rejected. A loan being denied is a yes/no, so the Reveal story above uses logistic regression. Logistic regression is also very common when looking at bias or discrimination.\n",
40 | "\n",
41 | "## When to use regression\n",
42 | "\n",
43 | "Before we get used to it, we might not realize there are times that regression would be useful. The biggest clue that we'll want to use regression is when we're looking at the \"relationship between\" or \"correlation between\" two (or more) different things.\n",
44 | "\n",
45 | "> Note that correlation is a actually [real stats thing](http://guessthecorrelation.com/) that's separate from regression. But it seems like when most people talk about correlation they're looking for a \"when X goes up, Y changes such-and-such amount\" kind of description, which comes from performing a regression.\n",
46 | "\n",
47 | "We can also recognize regression from the phrases \"all other factors being equal\" or \"controlling for differences in.\" Notice how in the Reveal example above the African American/Latino population matters \"even after accounting for income and other factors.\"\n",
48 | "\n",
49 | "We'll use both linear and logistic regression for **two major things**:\n",
50 | "\n",
51 | "* **Understanding the impact of different factors:** In ProPublica's [criminal sentencing bias piece](https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing), they showed that race played an outsize role in determining sentencing suggestions, controlling for other possibly-explanatory factors. In the Associated Press piece, they show the relationship between unemployment and life expectancy.\n",
52 | "* **Finding unexpected outliers:** In [this piece](http://clipfile.org/?p=754) from the Dallas Morning News, they used regression analysis to predict a school's 4th grade standardized test scores based on their 3rd grade scores. Schools that did much much much better than expected were suspected of cheating.\n",
53 | "\n",
54 | "While regression will show up in other situations - automatically classifying documents, for example! - we'll cover that separately.\n",
55 | "\n",
56 | "## How to stay careful\n",
57 | "\n",
58 | "If you're doing anything involving statistics or large datasets, **you need to check your results with an expert.** While running a regression can be pretty straightforward, there's always the possibility of problems hiding in the details. For example, we might have forgotten some useful variable, or maybe a few variables are fighting with each other and ruining our results (which sounds more fun than \"multicollinearity\").\n",
59 | "\n",
60 | "Almost every single person I interviewed who performed a regression analysis for a story leaned heavily on other members of their team, as well as at least one outside expert. Some newsrooms even pitted multiple academics against each other over the results, having statisticians and subject-matter experts battle it out over the \"right\" way to do it!\n",
61 | "\n",
62 | "Even if we hire sixty statisticians and get back a hundred different contradictory answers, even if we can't be 100% certain it's all 100% perfect, even if we're eventually convinced statistics is more art than science, at least we'll have _an idea of possible issues_ with our approach and how they might affect your story.\n",
63 | "\n",
64 | "## Review\n",
65 | "\n",
66 | "In this section we introduced **regression**, which is the relationship between different one or more input variables and an output variable. For example, unemployment and education (inputs) on life expectancy (output). There are two kinds of regression, **linear regression** which is used to predict numbers, and **logistic regression** which is used to predict categories (typically yes/no answers).\n",
67 | "\n",
68 | "You can use the result of a regression to make predictions (how well should this school have scored in math, given it scored XXX in reading?), or simply explain how two things are related (\"all other factors being equal...\").\n",
69 | "\n",
70 | "While performing a regression can be pretty easy, you'll always want to double-check with someone who has more statistics and/or domain-specific knowledge than you. Since people really trust numbers you'll want to make sure you're doing everything right!"
71 | ]
72 | },
73 | {
74 | "cell_type": "code",
75 | "execution_count": null,
76 | "metadata": {},
77 | "outputs": [],
78 | "source": []
79 | }
80 | ],
81 | "metadata": {
82 | "kernelspec": {
83 | "display_name": "Python 3",
84 | "language": "python",
85 | "name": "python3"
86 | },
87 | "language_info": {
88 | "codemirror_mode": {
89 | "name": "ipython",
90 | "version": 3
91 | },
92 | "file_extension": ".py",
93 | "mimetype": "text/x-python",
94 | "name": "python",
95 | "nbconvert_exporter": "python",
96 | "pygments_lexer": "ipython3",
97 | "version": "3.6.8"
98 | },
99 | "toc": {
100 | "base_numbering": 1,
101 | "nav_menu": {},
102 | "number_sections": true,
103 | "sideBar": true,
104 | "skip_h1_title": false,
105 | "title_cell": "Table of Contents",
106 | "title_sidebar": "Contents",
107 | "toc_cell": false,
108 | "toc_position": {},
109 | "toc_section_display": true,
110 | "toc_window_display": false
111 | }
112 | },
113 | "nbformat": 4,
114 | "nbformat_minor": 2
115 | }
--------------------------------------------------------------------------------
/sentiment-analysis-is-bad/notebooks/Cleaning the Sentiment140 data.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Cleaning the Sentiment140 data\n",
8 | "\n",
9 | "\n",
10 | "The [Sentiment140](http://www.sentiment140.com/) dataset is a collection of 1.6 million tweets that have been tagged as either positive or negative.\n",
11 | "\n",
12 | "Before we clean it, a question: _how'd they get so many tagged tweets?_ If you poke around on their documentation, [the answer is hiding right here](http://help.sentiment140.com/for-students):\n",
13 | "\n",
14 | "> In our approach, we assume that any tweet with positive emoticons, like :), were positive, and tweets with negative emoticons, like :(, were negative.\n",
15 | "\n",
16 | "That's a good thing to discuss later, but for now let's just clean it up. In this notebook we'll be removing columns we don't want, and standardizing the sentiment column."
17 | ]
18 | },
19 | {
20 | "cell_type": "markdown",
21 | "metadata": {},
22 | "source": [
23 | "\n \n \n Read online\n \n \n \n Download notebook\n \n \n \n Interactive version\n \n
"
24 | ]
25 | },
26 | {
27 | "cell_type": "markdown",
28 | "metadata": {},
29 | "source": [
30 | "### Prep work: Downloading necessary files\n",
31 | "Before we get started, we need to download all of the data we'll be using.\n",
32 | "* **training.1600000.processed.noemoticon.csv:** raw data from Sentiment140 - 1.4 million tweets tagged for sentiment, no column headers, nothing cleaned up\n"
33 | ]
34 | },
35 | {
36 | "cell_type": "code",
37 | "metadata": {},
38 | "source": [
39 | "# Make data directory if it doesn't exist\n",
40 | "!mkdir -p data\n",
41 | "!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/sentiment-analysis-is-bad/data/training.1600000.processed.noemoticon.csv.zip -P data\n",
42 | "!unzip -n -d data data/training.1600000.processed.noemoticon.csv.zip"
43 | ],
44 | "outputs": [],
45 | "execution_count": null
46 | },
47 | {
48 | "cell_type": "markdown",
49 | "metadata": {},
50 | "source": [
51 | "## Read the tweets in"
52 | ]
53 | },
54 | {
55 | "cell_type": "code",
56 | "execution_count": 2,
57 | "metadata": {},
58 | "outputs": [
59 | {
60 | "data": {
61 | "text/html": [
62 | "\n",
63 | "\n",
76 | "
\n",
77 | " \n",
78 | " \n",
79 | " | \n",
80 | " polarity | \n",
81 | " id | \n",
82 | " date | \n",
83 | " query | \n",
84 | " user | \n",
85 | " text | \n",
86 | "
\n",
87 | " \n",
88 | " \n",
89 | " \n",
90 | " 0 | \n",
91 | " 0 | \n",
92 | " 1467810369 | \n",
93 | " Mon Apr 06 22:19:45 PDT 2009 | \n",
94 | " NO_QUERY | \n",
95 | " _TheSpecialOne_ | \n",
96 | " @switchfoot http://twitpic.com/2y1zl - Awww, t... | \n",
97 | "
\n",
98 | " \n",
99 | " 1 | \n",
100 | " 0 | \n",
101 | " 1467810672 | \n",
102 | " Mon Apr 06 22:19:49 PDT 2009 | \n",
103 | " NO_QUERY | \n",
104 | " scotthamilton | \n",
105 | " is upset that he can't update his Facebook by ... | \n",
106 | "
\n",
107 | " \n",
108 | " 2 | \n",
109 | " 0 | \n",
110 | " 1467810917 | \n",
111 | " Mon Apr 06 22:19:53 PDT 2009 | \n",
112 | " NO_QUERY | \n",
113 | " mattycus | \n",
114 | " @Kenichan I dived many times for the ball. Man... | \n",
115 | "
\n",
116 | " \n",
117 | " 3 | \n",
118 | " 0 | \n",
119 | " 1467811184 | \n",
120 | " Mon Apr 06 22:19:57 PDT 2009 | \n",
121 | " NO_QUERY | \n",
122 | " ElleCTF | \n",
123 | " my whole body feels itchy and like its on fire | \n",
124 | "
\n",
125 | " \n",
126 | " 4 | \n",
127 | " 0 | \n",
128 | " 1467811193 | \n",
129 | " Mon Apr 06 22:19:57 PDT 2009 | \n",
130 | " NO_QUERY | \n",
131 | " Karoli | \n",
132 | " @nationwideclass no, it's not behaving at all.... | \n",
133 | "
\n",
134 | " \n",
135 | "
\n",
136 | "
"
137 | ],
138 | "text/plain": [
139 | " polarity id date query \\\n",
140 | "0 0 1467810369 Mon Apr 06 22:19:45 PDT 2009 NO_QUERY \n",
141 | "1 0 1467810672 Mon Apr 06 22:19:49 PDT 2009 NO_QUERY \n",
142 | "2 0 1467810917 Mon Apr 06 22:19:53 PDT 2009 NO_QUERY \n",
143 | "3 0 1467811184 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY \n",
144 | "4 0 1467811193 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY \n",
145 | "\n",
146 | " user text \n",
147 | "0 _TheSpecialOne_ @switchfoot http://twitpic.com/2y1zl - Awww, t... \n",
148 | "1 scotthamilton is upset that he can't update his Facebook by ... \n",
149 | "2 mattycus @Kenichan I dived many times for the ball. Man... \n",
150 | "3 ElleCTF my whole body feels itchy and like its on fire \n",
151 | "4 Karoli @nationwideclass no, it's not behaving at all.... "
152 | ]
153 | },
154 | "execution_count": 2,
155 | "metadata": {},
156 | "output_type": "execute_result"
157 | }
158 | ],
159 | "source": [
160 | "import pandas as pd\n",
161 | "\n",
162 | "df = pd.read_csv(\"data/training.1600000.processed.noemoticon.csv\",\n",
163 | " names=['polarity', 'id', 'date', 'query', 'user', 'text'],\n",
164 | " encoding='latin-1')\n",
165 | "df.head()"
166 | ]
167 | },
168 | {
169 | "cell_type": "markdown",
170 | "metadata": {},
171 | "source": [
172 | "## Update polarity\n",
173 | "\n",
174 | "Right now the `polarity` column is `0` for negative, `4` for positive. Let's change that to `0` and `1` to make things a little more reasonably readable."
175 | ]
176 | },
177 | {
178 | "cell_type": "code",
179 | "execution_count": 3,
180 | "metadata": {},
181 | "outputs": [
182 | {
183 | "data": {
184 | "text/plain": [
185 | "4 800000\n",
186 | "0 800000\n",
187 | "Name: polarity, dtype: int64"
188 | ]
189 | },
190 | "execution_count": 3,
191 | "metadata": {},
192 | "output_type": "execute_result"
193 | }
194 | ],
195 | "source": [
196 | "df.polarity.value_counts()"
197 | ]
198 | },
199 | {
200 | "cell_type": "code",
201 | "execution_count": 4,
202 | "metadata": {},
203 | "outputs": [
204 | {
205 | "data": {
206 | "text/plain": [
207 | "1 800000\n",
208 | "0 800000\n",
209 | "Name: polarity, dtype: int64"
210 | ]
211 | },
212 | "execution_count": 4,
213 | "metadata": {},
214 | "output_type": "execute_result"
215 | }
216 | ],
217 | "source": [
218 | "df.polarity = df.polarity.replace({0: 0, 4: 1})\n",
219 | "df.polarity.value_counts()"
220 | ]
221 | },
222 | {
223 | "cell_type": "markdown",
224 | "metadata": {},
225 | "source": [
226 | "## Remove unneeded columns\n",
227 | "\n",
228 | "We don't need all those columns! Let's get rid of the ones that won't affect the sentiment."
229 | ]
230 | },
231 | {
232 | "cell_type": "code",
233 | "execution_count": 5,
234 | "metadata": {},
235 | "outputs": [
236 | {
237 | "data": {
238 | "text/html": [
239 | "\n",
240 | "\n",
253 | "
\n",
254 | " \n",
255 | " \n",
256 | " | \n",
257 | " polarity | \n",
258 | " text | \n",
259 | "
\n",
260 | " \n",
261 | " \n",
262 | " \n",
263 | " 0 | \n",
264 | " 0 | \n",
265 | " @switchfoot http://twitpic.com/2y1zl - Awww, t... | \n",
266 | "
\n",
267 | " \n",
268 | " 1 | \n",
269 | " 0 | \n",
270 | " is upset that he can't update his Facebook by ... | \n",
271 | "
\n",
272 | " \n",
273 | " 2 | \n",
274 | " 0 | \n",
275 | " @Kenichan I dived many times for the ball. Man... | \n",
276 | "
\n",
277 | " \n",
278 | " 3 | \n",
279 | " 0 | \n",
280 | " my whole body feels itchy and like its on fire | \n",
281 | "
\n",
282 | " \n",
283 | " 4 | \n",
284 | " 0 | \n",
285 | " @nationwideclass no, it's not behaving at all.... | \n",
286 | "
\n",
287 | " \n",
288 | "
\n",
289 | "
"
290 | ],
291 | "text/plain": [
292 | " polarity text\n",
293 | "0 0 @switchfoot http://twitpic.com/2y1zl - Awww, t...\n",
294 | "1 0 is upset that he can't update his Facebook by ...\n",
295 | "2 0 @Kenichan I dived many times for the ball. Man...\n",
296 | "3 0 my whole body feels itchy and like its on fire \n",
297 | "4 0 @nationwideclass no, it's not behaving at all...."
298 | ]
299 | },
300 | "execution_count": 5,
301 | "metadata": {},
302 | "output_type": "execute_result"
303 | }
304 | ],
305 | "source": [
306 | "df = df.drop(columns=['id', 'date', 'query', 'user'])\n",
307 | "df.head()"
308 | ]
309 | },
310 | {
311 | "cell_type": "markdown",
312 | "metadata": {},
313 | "source": [
314 | "## Sample\n",
315 | "\n",
316 | "To make the filesize a little smaller and pandas a little happier, let's knock this down to 500,000 tweets."
317 | ]
318 | },
319 | {
320 | "cell_type": "code",
321 | "execution_count": 6,
322 | "metadata": {},
323 | "outputs": [
324 | {
325 | "data": {
326 | "text/plain": [
327 | "0 250275\n",
328 | "1 249725\n",
329 | "Name: polarity, dtype: int64"
330 | ]
331 | },
332 | "execution_count": 6,
333 | "metadata": {},
334 | "output_type": "execute_result"
335 | }
336 | ],
337 | "source": [
338 | "df = df.sample(n=500000)\n",
339 | "df.polarity.value_counts()"
340 | ]
341 | },
342 | {
343 | "cell_type": "code",
344 | "execution_count": 7,
345 | "metadata": {},
346 | "outputs": [],
347 | "source": [
348 | "df.to_csv(\"data/sentiment140-subset.csv\", index=False)"
349 | ]
350 | },
351 | {
352 | "cell_type": "markdown",
353 | "metadata": {},
354 | "source": [
355 | "## Review\n",
356 | "\n",
357 | "In this section, we cleaned up the **Sentiment140** tweet dataset. Sentiment140 is a collection of 1.6 million tweets that are marked as either positive or negative sentiment."
358 | ]
359 | },
360 | {
361 | "cell_type": "code",
362 | "execution_count": null,
363 | "metadata": {},
364 | "outputs": [],
365 | "source": []
366 | }
367 | ],
368 | "metadata": {
369 | "kernelspec": {
370 | "display_name": "Python 3",
371 | "language": "python",
372 | "name": "python3"
373 | },
374 | "language_info": {
375 | "codemirror_mode": {
376 | "name": "ipython",
377 | "version": 3
378 | },
379 | "file_extension": ".py",
380 | "mimetype": "text/x-python",
381 | "name": "python",
382 | "nbconvert_exporter": "python",
383 | "pygments_lexer": "ipython3",
384 | "version": "3.6.8"
385 | },
386 | "toc": {
387 | "base_numbering": 1,
388 | "nav_menu": {},
389 | "number_sections": true,
390 | "sideBar": true,
391 | "skip_h1_title": false,
392 | "title_cell": "Table of Contents",
393 | "title_sidebar": "Contents",
394 | "toc_cell": false,
395 | "toc_position": {},
396 | "toc_section_display": true,
397 | "toc_window_display": false
398 | }
399 | },
400 | "nbformat": 4,
401 | "nbformat_minor": 2
402 | }
--------------------------------------------------------------------------------
/text-analysis/notebooks/Stemming and lemmatization.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Stemming and lemmatization\n",
8 | "\n",
9 | "The English language loves putting endings on things: potato and potatoes are the same thing, as are swim/swimming/swims. Many other languages, like German or Spanish, like to do the same thing. In this section we'll take a look at what you can do to standardize or normalize the different forms of these words to join them all together."
10 | ]
11 | },
12 | {
13 | "cell_type": "markdown",
14 | "metadata": {},
15 | "source": [
16 | "\n \n \n Read online\n \n \n \n Download notebook\n \n \n \n Interactive version\n \n
"
17 | ]
18 | },
19 | {
20 | "cell_type": "markdown",
21 | "metadata": {},
22 | "source": [
23 | "## Our sentences\n",
24 | "\n",
25 | "Let's start off with a few small pieces of text."
26 | ]
27 | },
28 | {
29 | "cell_type": "markdown",
30 | "metadata": {},
31 | "source": [
32 | "TODO"
33 | ]
34 | },
35 | {
36 | "cell_type": "code",
37 | "execution_count": null,
38 | "metadata": {},
39 | "outputs": [],
40 | "source": []
41 | }
42 | ],
43 | "metadata": {
44 | "kernelspec": {
45 | "display_name": "Python 3",
46 | "language": "python",
47 | "name": "python3"
48 | },
49 | "language_info": {
50 | "codemirror_mode": {
51 | "name": "ipython",
52 | "version": 3
53 | },
54 | "file_extension": ".py",
55 | "mimetype": "text/x-python",
56 | "name": "python",
57 | "nbconvert_exporter": "python",
58 | "pygments_lexer": "ipython3",
59 | "version": "3.6.8"
60 | },
61 | "toc": {
62 | "base_numbering": 1,
63 | "nav_menu": {},
64 | "number_sections": true,
65 | "sideBar": true,
66 | "skip_h1_title": false,
67 | "title_cell": "Table of Contents",
68 | "title_sidebar": "Contents",
69 | "toc_cell": false,
70 | "toc_position": {},
71 | "toc_section_display": true,
72 | "toc_window_display": false
73 | }
74 | },
75 | "nbformat": 4,
76 | "nbformat_minor": 2
77 | }
--------------------------------------------------------------------------------
/text-analysis/notebooks/Topic modeling and clustering.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Topic modeling and clustering\n",
8 | "\n",
9 | "Topic models and clustering are both techniques for automatically learning about documents. How do they compare?"
10 | ]
11 | },
12 | {
13 | "cell_type": "markdown",
14 | "metadata": {},
15 | "source": [
16 | "\n \n \n Read online\n \n \n \n Download notebook\n \n \n \n Interactive version\n \n
"
17 | ]
18 | },
19 | {
20 | "cell_type": "markdown",
21 | "metadata": {},
22 | "source": [
23 | "### Prep work: Downloading necessary files\n",
24 | "Before we get started, we need to download all of the data we'll be using.\n",
25 | "* **recipes.csv:** recipes - a list of recipes (but only with ingredient names)\n",
26 | "* **state-of-the-union.csv:** State of the Union addresses - each presidential address from 1970 to 2012\n"
27 | ]
28 | },
29 | {
30 | "cell_type": "code",
31 | "metadata": {},
32 | "source": [
33 | "# Make data directory if it doesn't exist\n",
34 | "!mkdir -p data\n",
35 | "!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/text-analysis/data/recipes.csv -P data\n",
36 | "!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/text-analysis/data/state-of-the-union.csv -P data"
37 | ],
38 | "outputs": [],
39 | "execution_count": null
40 | },
41 | {
42 | "cell_type": "markdown",
43 | "metadata": {},
44 | "source": [
45 | "TODO"
46 | ]
47 | },
48 | {
49 | "cell_type": "code",
50 | "execution_count": null,
51 | "metadata": {},
52 | "outputs": [],
53 | "source": []
54 | }
55 | ],
56 | "metadata": {
57 | "kernelspec": {
58 | "display_name": "Python 3",
59 | "language": "python",
60 | "name": "python3"
61 | },
62 | "language_info": {
63 | "codemirror_mode": {
64 | "name": "ipython",
65 | "version": 3
66 | },
67 | "file_extension": ".py",
68 | "mimetype": "text/x-python",
69 | "name": "python",
70 | "nbconvert_exporter": "python",
71 | "pygments_lexer": "ipython3",
72 | "version": "3.6.8"
73 | },
74 | "toc": {
75 | "base_numbering": 1,
76 | "nav_menu": {},
77 | "number_sections": true,
78 | "sideBar": true,
79 | "skip_h1_title": false,
80 | "title_cell": "Table of Contents",
81 | "title_sidebar": "Contents",
82 | "toc_cell": false,
83 | "toc_position": {},
84 | "toc_section_display": true,
85 | "toc_window_display": false
86 | }
87 | },
88 | "nbformat": 4,
89 | "nbformat_minor": 2
90 | }
--------------------------------------------------------------------------------
/text-analysis/notebooks/Topic modeling with Gensim.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Topic modeling with Gensim\n",
8 | "\n",
9 | "Gensim is a popular library for topic modeling. Here we'll see how it stacks up to scikit-learn."
10 | ]
11 | },
12 | {
13 | "cell_type": "markdown",
14 | "metadata": {},
15 | "source": [
16 | "\n \n \n Read online\n \n \n \n Download notebook\n \n \n \n Interactive version\n \n
"
17 | ]
18 | },
19 | {
20 | "cell_type": "code",
21 | "execution_count": null,
22 | "metadata": {},
23 | "outputs": [],
24 | "source": []
25 | },
26 | {
27 | "cell_type": "code",
28 | "execution_count": null,
29 | "metadata": {},
30 | "outputs": [],
31 | "source": []
32 | }
33 | ],
34 | "metadata": {
35 | "kernelspec": {
36 | "display_name": "Python 3",
37 | "language": "python",
38 | "name": "python3"
39 | },
40 | "language_info": {
41 | "codemirror_mode": {
42 | "name": "ipython",
43 | "version": 3
44 | },
45 | "file_extension": ".py",
46 | "mimetype": "text/x-python",
47 | "name": "python",
48 | "nbconvert_exporter": "python",
49 | "pygments_lexer": "ipython3",
50 | "version": "3.6.8"
51 | },
52 | "toc": {
53 | "base_numbering": 1,
54 | "nav_menu": {},
55 | "number_sections": true,
56 | "sideBar": true,
57 | "skip_h1_title": false,
58 | "title_cell": "Table of Contents",
59 | "title_sidebar": "Contents",
60 | "toc_cell": false,
61 | "toc_position": {},
62 | "toc_section_display": true,
63 | "toc_window_display": false
64 | }
65 | },
66 | "nbformat": 4,
67 | "nbformat_minor": 2
68 | }
--------------------------------------------------------------------------------
/text-analysis/notebooks/Types of text analysis.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Breaking down a few different kinds of text analysis\n",
8 | "\n",
9 | "Natural language processing (NLP) is a wide wide field that encompasses _everything_ involving language. We'll mostly be sticking with analyzing documents, but even then there are a hundred and one different things we can do. Let's break a few of them down.\n",
10 | "\n",
11 | "_(and yes, everything from books to tweets count as \"documents\")_"
12 | ]
13 | },
14 | {
15 | "cell_type": "markdown",
16 | "metadata": {},
17 | "source": [
18 | "\n \n \n Read online\n \n \n \n Download notebook\n \n \n \n Interactive version\n \n
"
19 | ]
20 | },
21 | {
22 | "cell_type": "markdown",
23 | "metadata": {},
24 | "source": [
25 | "## Word counting\n",
26 | "\n",
27 | "Sometimes you just want to [count some words](/text-analysis/counting-words-with-pythons-counter/). We outline both a simple technique as well as [a more advanced version](/text-analysis/counting-words-with-scikit-learns-countvectorizer/), too.\n",
28 | "\n",
29 | "## Topic modeling and clustering\n",
30 | "\n",
31 | "If you have no clue what a set of documents might be about, both **topic modeling** and **clustering** are approaches to getting [a glance at what's inside](/text-analysis/topic-modeling-and-clustering). Topic modeling tries to find a set of topics that show up in the documents, while clustering organizes the documents into separate, discrete categories.\n",
32 | "\n",
33 | "## Entity extraction\n",
34 | "\n",
35 | "Sometimes you aren't looking for concepts, you're looking for **actual people or things**. Who is mentioned in that document dump? What companies are listed in a judge's conflict of interest filings? This is **[entity extraction](/text-analysis/named-entity-recognition/).**\n",
36 | "\n",
37 | "## Classification\n",
38 | "\n",
39 | "When you have a large set of documents, you can often organize them into two (or more) categories: **ones you're interested in and ones you aren't.**\n",
40 | "\n",
41 | "You might be trying to find comments mentioning bullying, or discplinary orders about sexual abuse, or complaints mentioning airbags that malfunctioned in a specific way. **Classification** can help out in these situations, by having you train the computer what interesting and uninteresting documents look like. You read a portion and then let the computer explore the rest!\n",
42 | "\n",
43 | "We cover classification under [a different section](/classification/intro-to-classification/), so you'll want to review how to [count words](/text-analysis/counting-words-with-scikit-learns-countvectorizer/) first.\n",
44 | "\n",
45 | "## Sentiment analysis\n",
46 | "\n",
47 | "Positive or negative? Happy or sad? [Sentiment analysis](/investigating-sentiment-analysis/comparing-sentiment-analysis-tools/) is the idea that you can extract emotional meaning based on what people have written. Often used for news stories or tweets, it's generally a subset of classification.\n",
48 | "\n",
49 | "## Document similarity\n",
50 | "\n",
51 | "Comparing two or more documents can be approached a few different ways. Are you looking for word-for-word plagiarism, or just similarity in concepts? The former you can do with [simple counts](/text-analysis/explaining-n-grams-in-natural-language-processing/), while the latter takes [a leap into word embeddings](/text-analysis/document-similarity-using-word-embeddings/)."
52 | ]
53 | },
54 | {
55 | "cell_type": "code",
56 | "execution_count": null,
57 | "metadata": {},
58 | "outputs": [],
59 | "source": []
60 | }
61 | ],
62 | "metadata": {
63 | "kernelspec": {
64 | "display_name": "Python 3",
65 | "language": "python",
66 | "name": "python3"
67 | },
68 | "language_info": {
69 | "codemirror_mode": {
70 | "name": "ipython",
71 | "version": 3
72 | },
73 | "file_extension": ".py",
74 | "mimetype": "text/x-python",
75 | "name": "python",
76 | "nbconvert_exporter": "python",
77 | "pygments_lexer": "ipython3",
78 | "version": "3.6.8"
79 | },
80 | "toc": {
81 | "base_numbering": 1,
82 | "nav_menu": {},
83 | "number_sections": true,
84 | "sideBar": true,
85 | "skip_h1_title": false,
86 | "title_cell": "Table of Contents",
87 | "title_sidebar": "Contents",
88 | "toc_cell": false,
89 | "toc_position": {},
90 | "toc_section_display": true,
91 | "toc_window_display": false
92 | }
93 | },
94 | "nbformat": 4,
95 | "nbformat_minor": 2
96 | }
--------------------------------------------------------------------------------