├── README.md ├── LICENSE └── DocHate.ipynb /README.md: -------------------------------------------------------------------------------- 1 | ## Multilabel Classification with DocHate tips 2 | When journalists ask their audience for help, success creates a whole new problem: what do you do with thousands of tips? 3 | 4 | Or what do you do with thousands of textual descriptions of … anything … potholes, disciplinary actions at prisons, aircraft safety incidents? There are too many to really read. 5 | 6 | And any time you feel "there are too many to really read," that's when you should consider getting help from machine learning. 7 | 8 | Here's how we did that. There's an iPython notebook in this repo; we also have [a non-technical blogpost](https://qz.ai/a-crash-course-for-journalists-in-classifying-text-with-machine-learning/) you can read. 9 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright 2020 Quartz 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 4 | 5 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 6 | 7 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 8 | 9 | 10 | -------------------------------------------------------------------------------- /DocHate.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Multilabel Classification with DocHate tips\n", 8 | "\n", 9 | "\n", 10 | "When journalists ask their audience for help, success creates a whole new problem: what do you do with thousands of tips? \n", 11 | "\n", 12 | "Or what do you do with thousands of textual descriptions of … anything … potholes, disciplinary actions at prisons, aircraft safety incidents? There are too many to really read.\n", 13 | "\n", 14 | "And any time you feel \"there are too many to really read,\" that's when you should consider getting help from machine learning.\n", 15 | "\n", 16 | "The folks at ProPublica’s Documenting Hate project had this problem, with around 6,000 tips about hate crimes and bias incidents contributed by readers. To report a hate incident, someone only has to provide a written description of what happened. If they choose, they can also fill out checkboxes for why the victim was targeted -- e.g. because of their race, religion or immigrant status.\n", 17 | "\n", 18 | "Only some people include that “targeted because” checkbox, but that data is important for analysis and for getting tips to the right reporter. Could we train a computer to guess at what kind of target was involved based on the written description alone?\n", 19 | "\n", 20 | "**This notebook is technical and gets into the nitty-gritty of how to do text classification in this context. If you'd like a less-technical overview, read [our blogpost](https://qz.ai/a-crash-course-for-journalists-in-classifying-text-with-machine-learning/).** Check out the interactive example of the Naive Bayes model here: [demo](https://s3.amazonaws.com/qz-aistudio-public/dochate.html).\n", 21 | "\n", 22 | "We used Python and the scikit-learn library. (And tested some other algorithms using Keras.) But all of this is doable in R or other programming/stats languages. \n", 23 | "\n", 24 | "Here's what the final results look like, for predicting whether a tip is related to race and/or ethnicity using a variety of algorithms:\n", 25 | "\n", 26 | "````\n", 27 | " AU PR Curve\n", 28 | " Keras CNN 92\n", 29 | " Naive Bayes 90\n", 30 | " Spacy 88\n", 31 | " Google AutomML 87\n", 32 | " Keras NN 84\n", 33 | " Keras LSTM -\n", 34 | "```` \n", 35 | "\n", 36 | "It goes without saying, but, **be aware that there are slurs, swear words, and other offensive language in the code and output here!**" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "## Step 1: Figuring out what question we wanted to answer.\n", 44 | "\n", 45 | "ProPublica receives tips about hate crimes via a [web form](http://documentinghate.com). The `targeted_because` checkboxes are optional. To fiddle with text classification approaches that work well for this kind of data (short-ish, political topics, etc.), we're going to try to \"fill in the blanks\" when the value of the `targeted_because` field is empty. \n" 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": 1, 51 | "metadata": {}, 52 | "outputs": [], 53 | "source": [ 54 | "# let's get started!\n", 55 | "from os import environ\n", 56 | "import numpy as np\n", 57 | "import tensorflow as tf\n", 58 | "import random\n", 59 | "\n", 60 | "# several of the algorithms we test here make use of randomness. and we split the data into train/test groups randomly.\n", 61 | "# in order to make sure that every time we run this notebook, we get the same results (rather than a\n", 62 | "# \"good\" split making one algorithm choice seem better), we set an arbitrary number (1234) as the seed for all the \n", 63 | "# random number generators.\n", 64 | "RANDOM_SEED = 1234\n", 65 | "np.random.seed(RANDOM_SEED)\n", 66 | "random.seed(RANDOM_SEED)\n", 67 | "tf.set_random_seed(RANDOM_SEED)\n", 68 | "environ['PYTHONHASHSEED'] = '0'\n", 69 | "\n", 70 | "from tensorflow import keras\n", 71 | "import pandas as pd\n", 72 | "import spacy\n", 73 | "import csv\n", 74 | "from sklearn.model_selection import train_test_split\n", 75 | "from sklearn.metrics import confusion_matrix, average_precision_score, precision_recall_curve, classification_report\n", 76 | "nlp = spacy.load('en_core_web_lg')\n", 77 | "from sklearn.preprocessing import MultiLabelBinarizer\n", 78 | "from imblearn.over_sampling import SMOTE" 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": 4, 84 | "metadata": {}, 85 | "outputs": [ 86 | { 87 | "name": "stdout", 88 | "output_type": "stream", 89 | "text": [ 90 | "Tips count: 5943\n", 91 | "Columns: ['admin_url', 'links', 'source', 'city', 'state', 'incident_date', 'where_occurred', 'type', 'targeted_because', 'gender', 'religion', 'race_ethnicity', 'reported_to_police', 'police_dept', 'description', 'knowledge', 'status']\n" 92 | ] 93 | } 94 | ], 95 | "source": [ 96 | "tips_raw = pd.read_csv(\"data/dochate/CleanReport-2019-02-13.csv\") # the actual file is confidential. see \"step 2\"\n", 97 | "print(\"Tips count: {}\".format(tips_raw.shape[0]))\n", 98 | "print(\"Columns: {}\".format(tips_raw.columns.tolist()))" 99 | ] 100 | }, 101 | { 102 | "cell_type": "markdown", 103 | "metadata": {}, 104 | "source": [ 105 | "## Step 2: Getting our data\n", 106 | "\n", 107 | "We have our \"train_test\" data and our \"real\" data all mixed in one spreadsheet (along with out-of-scope data, like trolls, inapplicable data like those in Spanish and those with a blank `description`.). Our actual goal is to predict the column values where it's absent, using just the description field. \n", 108 | "\n", 109 | "I can't show you the data itself, but the descriptions are just text. The `targeted_because` column is comma-separated, so it might say `race,ethnicity` or `religion,sexual-orientation,race`.\n", 110 | "\n", 111 | "We remove the trolls, the not-applicable tips, those without a description and those that are in Spanish.\n", 112 | "\n", 113 | "We split the remaining data into two groups. First `real_data` the remaining tips where there were no targeting reasons selected. Those are the ones where we want the computer to find the right answer. Second, `train_test_data` is tips that do have a targeting reason provided by the tipster." 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": 6, 119 | "metadata": {}, 120 | "outputs": [ 121 | { 122 | "name": "stdout", 123 | "output_type": "stream", 124 | "text": [ 125 | "Tips that need classification: (568, 19)\n", 126 | "tips that have a classification already: (3710, 19)\n" 127 | ] 128 | } 129 | ], 130 | "source": [ 131 | "column_of_interest = \"targeted_because\"\n", 132 | "\n", 133 | "\n", 134 | "# remove all the tips that were marked by hand as trolls or not-applicable or with a blank description.\n", 135 | "tips = tips_raw[(tips_raw[\"status\"] != 'troll') & (tips_raw[\"status\"] != 'not-applicable') & tips_raw[\"description\"].notnull()]\n", 136 | "\n", 137 | "\n", 138 | "# remove duplicates (of which there are some!)\n", 139 | "tips_raw = tips_raw.drop_duplicates(subset=['description', column_of_interest], keep=False)\n", 140 | "\n", 141 | "\n", 142 | "# this is a hacky way of detecting if a tip is in English or in Spanish. \n", 143 | "# stopwords are standard lists of grammatical function words (\"a\", \"the\", \"of\"). \n", 144 | "# If tip has more than 2 Spanish stopwords for every 3 English ones, we exclude it.\n", 145 | "# It's not perfect but it works okay.\n", 146 | "def is_english(sentence):\n", 147 | " from nltk import word_tokenize\n", 148 | " from nltk.corpus import stopwords\n", 149 | " tokens_set = set(word_tokenize(sentence))\n", 150 | " return len(set(stopwords.words('english')) & tokens_set) * 1.5 > len(set(stopwords.words('spanish')) & tokens_set )\n", 151 | "tips['english'] = tips['description'].apply(lambda x: is_english(x))\n", 152 | "tips = tips[tips['english'] != False]\n", 153 | "tips = tips.reset_index()\n", 154 | "\n", 155 | "# split the data into the data for training/testing our models -- and the \"real life\" data we hope to use our model to help with.\n", 156 | "train_test_data = tips[ tips[column_of_interest].notnull() ].copy() # if targeted_because isn't blank\n", 157 | "real_data = tips[~tips.isin(train_test_data)].dropna(how='all').copy() # if it is.\n", 158 | "\n", 159 | "\n", 160 | "print(\"Tips that need classification: {}\".format(real_data.shape))\n", 161 | "print(\"tips that have a classification already: {}\".format(train_test_data.shape))\n" 162 | ] 163 | }, 164 | { 165 | "cell_type": "markdown", 166 | "metadata": {}, 167 | "source": [ 168 | "## Step 3. Cleaning the data to remove the things that might confuse a computer.\n", 169 | "\n", 170 | "Data cleaning is one of the most important parts of real-world natural language processing, but it's underdiscussed for at least two reasons: it's completely unsexy and it's often different for every project. Data cleaning means removing stuff that might distract a computer and combining similar but not quite identical features so that they appear identical to the computer. A good way to think about data cleaning is to ask yourself what sorts of things in your data would be what _you_ would use to categorize the data.\n", 171 | "\n", 172 | "An easy example is that we lowercase everything (in other words, making the not-quite-identical words \"Then\" and \"then\" identical by transforming the first to \"then\"), since we're working with text typed by internet users. And, we will remove punctuation and \"non-word characters\" because they're not likely to tell us much about what attribute a hate crime was targeted by. (Ask yourself, will commas tell us anything about hate crimes? Of course not.) These are very typical and built into the vectorizers... (so we don't have to do it). We may also want to remove common English words like \"a\" and \"the\" -- typically called \"stopwords\" -- this is sometimes automatic, but not always, so it's worth checking.\n", 173 | "\n", 174 | "Other examples depend on your precise dataset. There's no recipe. You have to ask yourself what words will be a distraction to the model. Here's a harder example: Consider a database of press releases from US Congress that you're trying to categorize by topic (taxation, military, education, etc.). The model should pick out the phrases used frequently by members of Congress who talk about each topic a lot. Sometimes that's good (\"deduction\", for instance)... but sometimes that's bad, like the name of former Rep. Paul Ryan's press secretary. Those words aren't actually useful for determining if something is about taxation... especially if Ryan's replacement hires his staffer.\n", 175 | "\n", 176 | "Data cleaning is the process of cogitating about the data and figuring out a way to remove the unhelpful stuff, but not the helpful stuff. This is task-specific; if we had a dataset of press releases about Wisconsin that contained Ryan's press releases but also ones from the Milwaukee Brewers that we were trying to classify into the politics or sports category, the presence of Ryan-related words like the name of his press secretary _would_ be useful. \n", 177 | "\n", 178 | "Additionally, since the Documenting Hate tip data was submitted by users, it’s possible that they made mistakes -- like marking a clearly religion-related hate crime as related to, say, gender. Or failing to select ‘immigrant’ as a category for an incident that involved the intersection of race and immigrant status. Once I had an initial model trained, I dug -- quite unscientifically -- through the data that the classifier gets wrong to see if there are some where the user-contributed answer (that we treat as ground-truth) might be wrong. We might want to change some of the answers ourselves or even merge categories. You don’t want the computer to “learn” someone else’s mistakes.\n", 179 | "\n", 180 | "We can see what words are being fed into the Naive Bayes model with `vectorizer.vocabulary_.keys()`. Let's do that and take a look. They mostly look good, right? " 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": 7, 186 | "metadata": { 187 | "scrolled": true 188 | }, 189 | "outputs": [ 190 | { 191 | "name": "stdout", 192 | "output_type": "stream", 193 | "text": [ 194 | "cranky, 28th, decatur, designated, ezpass, reverting, tender, demonstrators, tellin, fundamental, closely, giampa, inferiors, replies, demarcate, regenerate, aprove, marxist, shoe, hallway, arabs, mart, shortest, purported, curbs, abqjew, palm, homosexuality, engange, waste, trigger, xbox, carbondale, cbp, breathe, superstition, martial, horribly, beware, netanyahu, breeds, 42nd, smythe, chuckle, attended, predators, corporation, potted, strangely, duffle, catty, harrasing, sacred, bt, leaf, revs, gecko, 1st, akbar, clarksville, muscle, advocating, meningioma, bucks, slack, americorps, fyre, hanging, offhand, efforts, agitators, paranormal, investgations, disected, humptulips, ranged, justin, bisexual, congress, marijuana, hoodie, philando, denver, marco, upholstery, for, unavailable, kenosha, 99485266, imperial, holt, parkersburg, kleeve, middletown, instagram, arbitrarily, swaztica, outpopulate, seminole, killing, considerado, spokesman, nuefeild, chugiak, escape, founded, clinical, mustard, affluent, tab, caucassian, stolen, virtually, dangerous, funneled, santa, chokes, tenants, buffalo, benches, lynching, readership, transferred, sa, siguiendo\n" 195 | ] 196 | } 197 | ], 198 | "source": [ 199 | "from sklearn.feature_extraction.text import HashingVectorizer, CountVectorizer, TfidfVectorizer\n", 200 | "simple_vectorizer = TfidfVectorizer(lowercase=True) # 1,1 works well?\n", 201 | "simple_vectorizer.fit(tips[\"description\"])\n", 202 | "words = list(simple_vectorizer.vocabulary_.keys())\n", 203 | "random.shuffle(words)\n", 204 | "print(', '.join(words[:125]))" 205 | ] 206 | }, 207 | { 208 | "cell_type": "markdown", 209 | "metadata": {}, 210 | "source": [ 211 | "But what if there are meaningful words that are absent from here, because they're removed by the cleaning process?\n", 212 | "\n", 213 | "🤔🤔🤔\n" 214 | ] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "execution_count": 8, 219 | "metadata": { 220 | "scrolled": true 221 | }, 222 | "outputs": [ 223 | { 224 | "name": "stdout", 225 | "output_type": "stream", 226 | "text": [ 227 | "is email 'visible' to the computer? True\n", 228 | "is e-mail 'visible' to the computer? False\n", 229 | "is t-shirt 'visible' to the computer? False\n", 230 | "is f**king 'visible' to the computer? False\n" 231 | ] 232 | } 233 | ], 234 | "source": [ 235 | "words_to_check = [\"email\", \"e-mail\", \"t-shirt\", \"f**king\"]\n", 236 | "for word in words_to_check:\n", 237 | " print(\"is {} 'visible' to the computer? {}\".format(word, word in simple_vectorizer.vocabulary_.keys()))" 238 | ] 239 | }, 240 | { 241 | "cell_type": "markdown", 242 | "metadata": {}, 243 | "source": [ 244 | "I think you see where we're going here...\n", 245 | "\n", 246 | "### Data Cleaning That Preserves Censored Slurs\n", 247 | "\n", 248 | "What are some words that are really informative about hate crimes, but are frequently not spelled out? Slurs. Remember when we removed \"non-word characters\"? We might want to backtrack and keep some of them, e.g. when a slur is replaced with comics-style grawlixes (\"F@#$!\"), stars, dashes or transformed into, e.g., \"the f-word\" or \"k**e\".\n", 249 | "\n", 250 | "Default text-cleaning rules will split words at hyphens, transforming the quite-informative \"f-word\" first into \"f word\", then it will remove one-letter words, so we're just left with \"word\"... which tells us basically nothing. Defaults are usually a good choice, but here's an example where they're not.\n", 251 | "\n", 252 | "So lets find some examples in the dataset so we can try to make sure they're included." 253 | ] 254 | }, 255 | { 256 | "cell_type": "code", 257 | "execution_count": 18, 258 | "metadata": { 259 | "scrolled": true 260 | }, 261 | "outputs": [ 262 | { 263 | "data": { 264 | "text/plain": [ 265 | "['B_t', 'C*NT', 'C-word', 'E=MC', 'F*****g']" 266 | ] 267 | }, 268 | "execution_count": 18, 269 | "metadata": {}, 270 | "output_type": "execute_result" 271 | } 272 | ], 273 | "source": [ 274 | "import re\n", 275 | "# find words that have one alphabetic character, then one or more non-alpha chars, then more alpha chars.\n", 276 | "# (so this matches 'e-mail', 't-shirt', 'f***ing' but not 'anti-semitic')\n", 277 | "bad_words = sorted(set([item for sublist in [res for res in [re.findall(r\"(?i)(?<= )[a-z\\u00C0-\\u017F“][^a-z0-9“\\u00C0-\\u017F'’\\.\\s]+[a-z\\u00C0-\\u017F“]+\", tip) for tip in list(tips['description'].values)] if res] for item in sublist if not re.match(r'^(?i)[A-Za-z][&-][A-Za-z]$', item)]))\n", 278 | "list(bad_words)[5:10]" 279 | ] 280 | }, 281 | { 282 | "cell_type": "markdown", 283 | "metadata": {}, 284 | "source": [ 285 | "Yeah okay. How're we gonna deal with that...\n", 286 | "\n", 287 | "We can verify (with the `inspect` method) that the word \"word\" makes an input tip more likely to be related to race and less likely to be related to other topics (though it's far more of a drag on, say, the `religion` class than on `sexual-orientation`.)\n", 288 | "\n", 289 | "I noodled around with this for a while... The solution didn't occur to me immediately and I tried a variety of things and changed my goals when it became clear I hadn't fully solved the problem. At the start, I just wanted words like `f-word`, `f****r`, etc. to be preserved in the data given to the classifier... by the end, I decided that I wanted as many different variants of censored words to be \"collapsed\" into the same token -- and into a token that was mostly understandable by a human (not gibberish). I also wanted to make sure that the censored words didn't get turned into an instance of an unrelated \"normal\" word.\n", 290 | "\n", 291 | "What I came up with only acts on words that are a single alphabetic character followed by one or more non-alphabetic characters followed by one or more alphabetic characters. If the non-alphabetic character string is just one hyphen, it gets turned into `dash`, so if we see `t-shirt` we turn it into `tdashshirt`. Otherwise, we replace it with the first letter, `XXX` and the last letter of the word -- so that `f*cking` and `f***ing` end up collapsed to the same thing, `fXXXg`. \n", 292 | "\n", 293 | "Inevitably, I did a fair amount of futzing around here. For a while, my regex didn't realize characters with diacritics were letters, so it started censoring the Spanish word \"pública\". I also was initially matching words like \"A&M\", which had to be excluded.\n", 294 | "\n", 295 | "The effect of this turned out not to be that great though (about half a percentage point improvement in AUC). " 296 | ] 297 | }, 298 | { 299 | "cell_type": "code", 300 | "execution_count": 19, 301 | "metadata": { 302 | "scrolled": true 303 | }, 304 | "outputs": [], 305 | "source": [ 306 | "import re\n", 307 | "def collapse_censored_word(word):\n", 308 | " if re.match(r\"(?i)[a-z\\u00C0-\\u017F“]-[a-z\\u00C0-\\u017F“]+\", word): # if there's just one hyphen, e.g. t-shirt, f-ing...\n", 309 | " word = word.replace(\"-\", \"dash\")\n", 310 | " else:\n", 311 | " word = word[0] + \"XXX\" + word[-1]\n", 312 | "# word = re.sub(r\"(?i)([^a-z“\\u00C0-\\u017F0-9'’\\.\\s\\-]+|-{2,})\", \"XXX\", word)\n", 313 | " return word\n", 314 | "\n", 315 | "censorable_word_regex = \"[a-z\\u00C0-\\u017F“][^a-z0-9“\\u00C0-\\u017F'’\\.\\s]+[a-z\\u00C0-\\u017F“]+\"\n", 316 | "def clean(text):\n", 317 | " potential_censored_words = re.findall(r\"(?i)(?<=[ \\(\\\"\\'])\" + censorable_word_regex, text) + re.findall(\"(?i)^\" + censorable_word_regex, text)\n", 318 | " for word in potential_censored_words:\n", 319 | " text = text.replace(word, collapse_censored_word(word))\n", 320 | " return text.replace(\"“\", '').replace(\"’s\", \" 's\").replace(\"'s\", \" 's\")" 321 | ] 322 | }, 323 | { 324 | "cell_type": "markdown", 325 | "metadata": {}, 326 | "source": [ 327 | "See how it works? Rather than just retaining \"word\", we retain something meaningful." 328 | ] 329 | }, 330 | { 331 | "cell_type": "code", 332 | "execution_count": 20, 333 | "metadata": {}, 334 | "outputs": [ 335 | { 336 | "data": { 337 | "text/plain": [ 338 | "'Qdashword is bad'" 339 | ] 340 | }, 341 | "execution_count": 20, 342 | "metadata": {}, 343 | "output_type": "execute_result" 344 | } 345 | ], 346 | "source": [ 347 | "clean(\"Q-word is bad\")" 348 | ] 349 | }, 350 | { 351 | "cell_type": "code", 352 | "execution_count": 21, 353 | "metadata": {}, 354 | "outputs": [ 355 | { 356 | "name": "stdout", 357 | "output_type": "stream", 358 | "text": [ 359 | "Total censored words: 120\n", 360 | "Total censored words after cleaning: 73\n" 361 | ] 362 | } 363 | ], 364 | "source": [ 365 | "# checking how many bad words are collapsed with this method\n", 366 | "print(\"Total censored words: {}\".format(len(bad_words)))\n", 367 | "unified_bad_words = {}\n", 368 | "for word, clean_word in [(word, collapse_censored_word(word)) for word in bad_words]:\n", 369 | " if clean_word not in unified_bad_words:\n", 370 | " unified_bad_words[clean_word] = []\n", 371 | " unified_bad_words[clean_word].append(word)\n", 372 | "print(\"Total censored words after cleaning: {}\".format(len(set( unified_bad_words.keys()))))" 373 | ] 374 | }, 375 | { 376 | "cell_type": "markdown", 377 | "metadata": {}, 378 | "source": [ 379 | "Once we've come up with a way to clean the text that we like, we do it.\n", 380 | "\n", 381 | "`lemmatize` relies on a library called Spacy. It removes verb endings from words -- on the theory that we learn more by treating \"punch\" and \"punching\" and \"punched\" as the same word, especially when we have a small dataset. It adds a few percentage points of AUPR for several of the classes (but not race_ethnicity) with NB. For CNN it improves or does nothing (and costs one percentage point in a few places; immigrant does worse with the default dropout1=0.02 but dropout1=0.01 fixes the problem)\n" 382 | ] 383 | }, 384 | { 385 | "cell_type": "code", 386 | "execution_count": 22, 387 | "metadata": {}, 388 | "outputs": [], 389 | "source": [ 390 | "def lemmatize(doc):\n", 391 | " return ' '.join([token.lemma_ for token in nlp(doc)])\n", 392 | "\n", 393 | "train_test_data[\"description\"] = train_test_data[\"description\"].apply(clean)\n", 394 | "train_test_data[\"description\"] = train_test_data[\"description\"].apply(lemmatize)" 395 | ] 396 | }, 397 | { 398 | "cell_type": "markdown", 399 | "metadata": {}, 400 | "source": [ 401 | "### Preparing to predict targeted_because\n", 402 | "\n", 403 | "Right now, the `targeted_because` column is exactly as it was in our source data (except we removed the blanks rows, the trolls, etc.) -- that is, a string with commas. The \"typical\" format for machine-learning projects like this one is to have one column for each possible targeting reason (race, etc.) and then a `1` in that column for each description if it has that class and a `0` if it doesn't. \n", 404 | "\n", 405 | "You can do that however you like, but we're using the MultiLabelBinarizer class from scikit-learn.\n", 406 | "\n", 407 | "I'm also adding the `race_ethnicity` column that's `1` (i.e. true) if the hate incident is classified as either `race` or `ethnicity`-related by the tipster. That's because I guess that some tipsters are going to mix them up, which'd confuse the computer. \n", 408 | "\n", 409 | "I wonder if merging `race` and `ethnicity` categories might be a good idea -- only because people may use the terms interchangably on the form in a way that the computer can't learn the nuanced distinction between them.\n" 410 | ] 411 | }, 412 | { 413 | "cell_type": "code", 414 | "execution_count": 27, 415 | "metadata": {}, 416 | "outputs": [], 417 | "source": [ 418 | "# split the comma-separated targeted_because column into an actual list.\n", 419 | "train_test_data[column_of_interest] = train_test_data[column_of_interest].apply(lambda x: x.split(\",\") if type(x) != list else x)\n", 420 | "\n", 421 | "# since we're doing a multi-label classification problem -- aka a single incident can involve targeting someone for \n", 422 | "# one or more of the possible labels (e.g. race AND religion AND immigrant status) -- we need to do some data preprocessing.\n", 423 | "# 'disability', 'ethnicity', 'gender', 'immigrant', 'race', 'religion', 'sexual-orientation'\n", 424 | "# we're actually going to be doing 7 classifiers, one to see if a description matches each label or not.\n", 425 | "lb = MultiLabelBinarizer()\n", 426 | "labels_df = pd.DataFrame(lb.fit_transform(train_test_data[column_of_interest]), columns=list(lb.classes_), index=train_test_data.index)\n", 427 | "train_test_data_one_hot = pd.concat([train_test_data[[\"description\", column_of_interest]], labels_df], axis=1)\n", 428 | "# print(train_test_data_one_hot[[idx for idx in train_test_data_one_hot.columns if idx != 'description']])\n", 429 | "train_test_data_one_hot[\"race_ethnicity\"] = train_test_data_one_hot.apply(lambda x: 1.0 if x[\"race\"] or x[\"ethnicity\"] else 0.0, axis=1)\n", 430 | "train_test_data_one_hot[[\"description\", \"race_ethnicity\"]].to_csv(\"data/dochate/dochate_for_automl.csv\", header=False, index=False)\n", 431 | "unique_classes = list(set([item for sublist in train_test_data[column_of_interest].values for item in sublist])) + [\"race_ethnicity\"]" 432 | ] 433 | }, 434 | { 435 | "cell_type": "markdown", 436 | "metadata": {}, 437 | "source": [ 438 | "So here's our data looks like now.\n", 439 | "\n", 440 | "````\n", 441 | " description race gender ...\n", 442 | "0 I was the victim of a hate crime. 0 1\n", 443 | "1 I also was a hate crime victim. 1 0\n", 444 | "````" 445 | ] 446 | }, 447 | { 448 | "cell_type": "markdown", 449 | "metadata": {}, 450 | "source": [ 451 | "Before we get started with actual machine learning, this is how many hate incidents of each class we have. It's generally harder to predict classes that have fewer examples. (The computer, which is quite dumb, never learns what words from the 129 disability-related reports indicate it's a disability-related report as opposed to a word that happens to be included in the report, like a city name.) That's why we don't do a great job with guessing which tips have to do with disability or gender." 452 | ] 453 | }, 454 | { 455 | "cell_type": "code", 456 | "execution_count": 28, 457 | "metadata": {}, 458 | "outputs": [ 459 | { 460 | "name": "stdout", 461 | "output_type": "stream", 462 | "text": [ 463 | " disability: 127 / 3710 | 3%\n", 464 | " ethnicity: 1236 / 3710 | 33%\n", 465 | " gender: 374 / 3710 | 10%\n", 466 | " immigrant: 745 / 3710 | 20%\n", 467 | " race: 1838 / 3710 | 50%\n", 468 | " religion: 948 / 3710 | 26%\n", 469 | "sexual-orientation: 655 / 3710 | 18%\n", 470 | " race_ethnicity: 2428 / 3710\n" 471 | ] 472 | } 473 | ], 474 | "source": [ 475 | "# are any of these columns so rare as to be useless to try to predict?\n", 476 | "from itertools import groupby\n", 477 | "all_values = [item for sublist in train_test_data[\"targeted_because\"].values for item in sublist]\n", 478 | "total = len(train_test_data)\n", 479 | "for cnt, label in [(len(list(g)), k) for k, g in groupby((sorted(all_values)))]:\n", 480 | " print(\"{}: {} / {} | {}%\".format(label.rjust(18), str(cnt).rjust(len(str(total))), total, round(cnt / float(total) * 100) ))\n", 481 | "\n", 482 | "print(\"{}: {} / {}\".format(\"race_ethnicity\".rjust(18), str(len(train_test_data[train_test_data_one_hot[\"race_ethnicity\"] == 1.0])).rjust(len(str(total))), total)) " 483 | ] 484 | }, 485 | { 486 | "cell_type": "markdown", 487 | "metadata": {}, 488 | "source": [ 489 | "## Step 4. Choosing an algorithm\n", 490 | "\n", 491 | "Naive Bayes is a simple machine-learning technique (i.e. there's no calculus) but it works well. It's what we're trying first.\n", 492 | "\n", 493 | "Later in this notebook, I'll be trying several algorithms for two reasons (a) as a learning exercise and (b) because machine learning is often such that one algorithm will mysteriously work better than others for a given task, just for idiosyncratic reasons, so it can be worthwhile to try several. Be aware that a lot of them require the data to be in different formats -- that’s step 4 -- so it requires a little extra work.\n", 494 | "\n", 495 | "I’d lean towards picking simpler algorithms over more complex ones… especially if you have relatively little data.\n", 496 | "\n", 497 | "### Here are the algorithms I tried.\n", 498 | "\n", 499 | " - Naive Bayes\n", 500 | " - a ‘vanilla’ neural net\n", 501 | " - a convolutional neural network\n", 502 | " - an LSTM neural network\n", 503 | " - Google’s NLP AutoML\n", 504 | " - Spacy’s text classification\n", 505 | "\n", 506 | "Spoiler alert: the convolutional neural net works a tiny bit better than naive Bayes, but only a touch. And it's more complicated.\n", 507 | "\n", 508 | "For each algorithm, we will do the next two steps:\n", 509 | "\n", 510 | "5. Formatting the data in the way that your chosen algorithm requires it. \n", 511 | "6. Feeding most of your data to your algorithm and perhaps waiting a few minutes." 512 | ] 513 | }, 514 | { 515 | "cell_type": "markdown", 516 | "metadata": {}, 517 | "source": [ 518 | "## Naive Bayes\n", 519 | "\n", 520 | "This is a pretty basic classification algorithm, but it worked well in my experimentation.\n", 521 | "\n", 522 | "We're actually doing seven classifiers, one for each of those options, predicting if a given description matches `race` or not, another predicting if it matches `sexual-orientation` or not, etc." 523 | ] 524 | }, 525 | { 526 | "cell_type": "code", 527 | "execution_count": 30, 528 | "metadata": {}, 529 | "outputs": [], 530 | "source": [ 531 | "from sklearn.naive_bayes import MultinomialNB, ComplementNB\n", 532 | "from sklearn.feature_extraction.text import HashingVectorizer, CountVectorizer, TfidfVectorizer\n", 533 | "from sklearn.metrics import classification_report\n", 534 | "from sklearn.model_selection import KFold\n", 535 | "from os.path import join\n", 536 | "from os import makedirs\n", 537 | "import pickle" 538 | ] 539 | }, 540 | { 541 | "cell_type": "markdown", 542 | "metadata": {}, 543 | "source": [ 544 | "### Step 5: Formatting the data in the way that our chosen algorithm requires it. \n", 545 | "\n", 546 | "At this point, our tips are in English. But computers can’t read! So we’re going to have to modify the data to a particular “format” for Naive Bayes.\n", 547 | "\n", 548 | "That algorithm requires words to be represented by numbers -- a process called vectorizing.\n", 549 | "\n", 550 | "I used the scikit-learn package’s TfidfVectorizer to do this, after experimenting with the HashingVectorizer and CountVectorizer. (The performance was about the same.) TfidfVectorizer transforms each tip into a list of numbers: reflecting the TF-IDF score for each token (aka word) in that tip (calculated against the entire corpus of all tips). Implicitly, each position into the list refers to an individual word -- and most of the entries in the list are 0, for words that exist in our dataset, but not in this particular tip. So a \"vectorized\" tip might look like this:\n", 551 | "\n", 552 | "```\n", 553 | "[0.1, 0, 0, 0, 0, 0.2, 0.11, 0, 0, 0]\n", 554 | "```\n", 555 | "\n", 556 | "The vectorizers also have the option to generate \"n-grams\" -- aka pairing together 2 or 3 word chunks and treating them as tokens too. For instance, the word \"my\" and the word \"country\" might not be informative about the tip of hate incident alone, but when they occur together, \"my country\" is probably a strong sign of an immigration-related incident. This tactic is often successful, but it gave worse results here.\n", 557 | "\n", 558 | "We also split our data into two groups: training data and test data. Won’t the model do better with more training data? Yes, but we keep some portion to the side, so we can evaluate how the model did, with data it wasn’t trained on (but that we know the right answers for)." 559 | ] 560 | }, 561 | { 562 | "cell_type": "code", 563 | "execution_count": 31, 564 | "metadata": { 565 | "scrolled": true 566 | }, 567 | "outputs": [ 568 | { 569 | "name": "stdout", 570 | "output_type": "stream", 571 | "text": [ 572 | "what (part of) a vectorized tip looks like: \n", 573 | "[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]\n" 574 | ] 575 | } 576 | ], 577 | "source": [ 578 | "def equalize_classes(predictor, response):\n", 579 | " return SMOTE(random_state=RANDOM_SEED).fit_sample(predictor, response)\n", 580 | "\n", 581 | "\n", 582 | "train_df, test_df = train_test_split(train_test_data_one_hot, \n", 583 | " test_size=0.2, \n", 584 | " shuffle=True,\n", 585 | " random_state=RANDOM_SEED)\n", 586 | "\n", 587 | "train_features_nb = train_df[\"description\"]\n", 588 | "test_features_nb = test_df[\"description\"]\n", 589 | "\n", 590 | "vectorizer = TfidfVectorizer(ngram_range=(1,1), # 1,1 works well?\n", 591 | " # max_features=5000, # works best with max_features set to None.\n", 592 | " lowercase=True) # automatically lowercase each word.\n", 593 | "vectorizer.fit(train_features_nb)\n", 594 | "\n", 595 | "train_features_nb_vec = vectorizer.transform(train_features_nb)\n", 596 | "test_features_nb_vec = vectorizer.transform(test_features_nb)\n", 597 | "\n", 598 | "print(\"what (part of) a vectorized tip looks like: \")\n", 599 | "print(train_features_nb_vec[0].toarray()[0].tolist()[200:250])\n", 600 | "\n" 601 | ] 602 | }, 603 | { 604 | "cell_type": "markdown", 605 | "metadata": {}, 606 | "source": [ 607 | "### Step 6: Feeding most of your data to your algorithm and perhaps waiting a few minutes.\n", 608 | "\n", 609 | "Finally! Let's train our model. We'll actually train seven models, one for each of our classes.\n", 610 | "\n", 611 | "This is the part where the computer is learning. And it’s pretty simple, from your perspective. It's this line: `naivebayes_classifier.fit(train_features_nb_vec, train_labels_nb)`. \n", 612 | "\n", 613 | "For each of our classes, we have differing amounts of tips. For instance, we have 749 tips of incidents that involve immigrant status, out of 3732 total tagged tips, so 20%. This is a problem. Imagine if we had a very dumb model that predicted that nothing was immigrant-status related; it'd get 80% accuracy! So we have to \"equalize\" the imbalanced classes. We're doing that with SMOTE oversampling (but there are other options).\n", 614 | "\n" 615 | ] 616 | }, 617 | { 618 | "cell_type": "code", 619 | "execution_count": 64, 620 | "metadata": {}, 621 | "outputs": [ 622 | { 623 | "name": "stdout", 624 | "output_type": "stream", 625 | "text": [ 626 | "Training models for ...\n", 627 | " - disability\n", 628 | " - ethnicity\n", 629 | " - gender\n", 630 | " - immigrant\n", 631 | " - race\n", 632 | " - religion\n", 633 | " - sexual-orientation\n", 634 | " - race_ethnicity\n" 635 | ] 636 | } 637 | ], 638 | "source": [ 639 | "classifiers = {}\n", 640 | "print(\"Training models for ...\")\n", 641 | "for class_of_interest in [col for col in train_test_data_one_hot.columns if col != \"description\" and col != 'targeted_because']:\n", 642 | " print(\" - \" + class_of_interest)\n", 643 | " naivebayes_classifier = MultinomialNB()\n", 644 | " train_labels_nb = train_df[class_of_interest]\n", 645 | " test_labels_nb = test_df[class_of_interest]\n", 646 | "\n", 647 | " train_features_equalized_nb_vec, train_labels_equalized_nb = equalize_classes(train_features_nb_vec, train_labels_nb)\n", 648 | "\n", 649 | " naivebayes_classifier.fit(train_features_equalized_nb_vec, train_labels_equalized_nb) # <-- TRAINING\n", 650 | "\n", 651 | " classifiers[class_of_interest] = naivebayes_classifier\n" 652 | ] 653 | }, 654 | { 655 | "cell_type": "markdown", 656 | "metadata": {}, 657 | "source": [ 658 | "### Step 6. Looking at the results and deciding if it’s good enough or not -- and if it isn’t, repeating steps 2-6 as necessary.\n", 659 | "\n", 660 | "So we're going to see how we did at the end of the next cell. But how do we know how well our model did?\n", 661 | "\n", 662 | "`area under precision-recall curve` is a \"metric\" that's good for measuring classifiers with imbalanced classes -- a dataset is imbalanced when it isn't just 50% of one class and 50% of another, but instead has, say, 26% religion-related hate incidents and thus 74% non-religion-related. It's plotting precision (avoiding false positives) against recall (avoiding false negatives). The area under that curve is a proportion; the higher the better. If the area under the precision recall curve is significantly higher than the proportion of classses in our test data, then our model has learned to make a distinction between the two classes, however imperfectly.\n", 663 | "\n", 664 | "In that example, you’d be comparing the area under the precision-recall curve to the proportion of your testing data that has the religion class -- if your model has more than 26% area under the precision-recall curve, it’s working. If it’s got a lot more than 26%, it’s working pretty well.\n", 665 | "\n", 666 | "We also show the confusion matrix, which plots the model's guesses against the right answers. Bigger numbers in the top-left and bottom-right are better; the top-right is false negatives and top-left is false positives." 667 | ] 668 | }, 669 | { 670 | "cell_type": "code", 671 | "execution_count": 65, 672 | "metadata": {}, 673 | "outputs": [ 674 | { 675 | "name": "stdout", 676 | "output_type": "stream", 677 | "text": [ 678 | "\n", 679 | "disability\n", 680 | "area under PR curve: 0.68\n", 681 | "If the AUPR score (0.6807504776362596) is more than a little bigger than the baseline (0.6563342318059299), which it *is*, then our model is working!\n", 682 | "\n", 683 | "\n", 684 | "\n", 685 | "ethnicity\n", 686 | "area under PR curve: 0.8\n", 687 | "If the AUPR score (0.8002249439051338) is more than a little bigger than the baseline (0.6563342318059299), which it *is*, then our model is working!\n", 688 | "\n", 689 | "\n", 690 | "\n", 691 | "gender\n", 692 | "area under PR curve: 0.63\n", 693 | "If the AUPR score (0.6319593009558203) is more than a little bigger than the baseline (0.6563342318059299), which it *is*, then our model is working!\n", 694 | "\n", 695 | "\n", 696 | "\n", 697 | "immigrant\n", 698 | "area under PR curve: 0.77\n", 699 | "If the AUPR score (0.7732047968744618) is more than a little bigger than the baseline (0.6563342318059299), which it *is*, then our model is working!\n", 700 | "\n", 701 | "\n", 702 | "\n", 703 | "race\n", 704 | "area under PR curve: 0.88\n", 705 | "If the AUPR score (0.8807390613874166) is more than a little bigger than the baseline (0.6563342318059299), which it *is*, then our model is working!\n", 706 | "\n", 707 | "\n", 708 | "\n", 709 | "religion\n", 710 | "area under PR curve: 0.55\n", 711 | "If the AUPR score (0.545680249173754) is more than a little bigger than the baseline (0.6563342318059299), which it *is*, then our model is working!\n", 712 | "\n", 713 | "\n", 714 | "\n", 715 | "sexual-orientation\n", 716 | "area under PR curve: 0.56\n", 717 | "If the AUPR score (0.5569133421769314) is more than a little bigger than the baseline (0.6563342318059299), which it *is*, then our model is working!\n", 718 | "\n", 719 | "\n", 720 | "\n", 721 | "race_ethnicity\n", 722 | "area under PR curve: 0.9\n", 723 | "If the AUPR score (0.9040488375018177) is more than a little bigger than the baseline (0.6563342318059299), which it *is*, then our model is working!\n", 724 | "\n", 725 | "\n" 726 | ] 727 | } 728 | ], 729 | "source": [ 730 | "for class_of_interest in [col for col in train_test_data_one_hot.columns if col != \"description\" and col != 'targeted_because']:\n", 731 | " naivebayes_classifier = classifiers[class_of_interest]\n", 732 | " predicted_probabilities_nb = naivebayes_classifier.predict_proba(test_features_nb_vec)[:,1]\n", 733 | " predicted_labels_nb = [(1.0 if proba > 0.5 else 0.0) for proba in predicted_probabilities_nb]\n", 734 | "# print(confusion_matrix(test_labels_nb, predicted_labels_nb, labels=[1., 0.]))\n", 735 | " print()\n", 736 | " \n", 737 | " pr_baseline = float(len([a for a in test_labels_nb if a]))/len(test_labels_nb)\n", 738 | " pr_score = average_precision_score(test_labels_nb, predicted_probabilities_nb)\n", 739 | " print(class_of_interest)\n", 740 | " print(\"area under PR curve: \", round(pr_score, 2))\n", 741 | " print(\"If the AUPR score ({}) is more than a little bigger than the baseline ({}), which it *{}*, then our model is working!\".format(pr_score, pr_baseline, \"is\" if pr_score - (pr_baseline * 1.1) else \"isn't\" ))\n", 742 | " print()\n", 743 | " print() " 744 | ] 745 | }, 746 | { 747 | "cell_type": "markdown", 748 | "metadata": {}, 749 | "source": [ 750 | "You can see a chart of the precision-recall curve below. If we had a different goal, we might rather have false positive than false negatives (or vice versa); the values of precision for each possible recall goal are what is plotted here.\n" 751 | ] 752 | }, 753 | { 754 | "cell_type": "code", 755 | "execution_count": 39, 756 | "metadata": {}, 757 | "outputs": [ 758 | { 759 | "data": { 760 | "image/png": "\n", 761 | "text/plain": [ 762 | "
" 763 | ] 764 | }, 765 | "metadata": { 766 | "needs_background": "light" 767 | }, 768 | "output_type": "display_data" 769 | } 770 | ], 771 | "source": [ 772 | "pr_chart(test_labels_nb, predicted_probabilities_nb)" 773 | ] 774 | }, 775 | { 776 | "cell_type": "markdown", 777 | "metadata": {}, 778 | "source": [ 779 | "## Keras Neural Nets\n", 780 | "via https://www.tensorflow.org/tutorials/keras/basic_text_classification\n", 781 | "\n", 782 | "Neural nets are very trendy and for good reason: they're very powerful and can \"learn\" patterns that are too complicated for Naive Bayes.\n", 783 | "\n", 784 | "Neural nets are a category, not an individual model algorithm. We're going to try two different ones:\n", 785 | "\n", 786 | " - a basic neural net\n", 787 | " - a convolutional neural net\n", 788 | " \n", 789 | "Both networks learn \"embeddings\" for each word in our tips. These are akin to word2vec-style vectors, but where those vectors are trained on a large general-purpose dataset, ours are trained just for this purpose (and trained on a lot less data). I tried using word2vec vectors, but it didn't work as well. The vectors are equivalent to the output of the TfidfVectorizer that we used for Naive Bayes.\n", 790 | "\n", 791 | "The convolutional neural net does better than the basic one, likely because it takes into account each word's context.\n", 792 | "\n", 793 | "(I also tried an LSTM, but I couldn't get it to work! I suspect because I don't have enough data.)" 794 | ] 795 | }, 796 | { 797 | "cell_type": "markdown", 798 | "metadata": {}, 799 | "source": [ 800 | "Here's some shared settings for both kinds of models." 801 | ] 802 | }, 803 | { 804 | "cell_type": "code", 805 | "execution_count": 45, 806 | "metadata": {}, 807 | "outputs": [], 808 | "source": [ 809 | "from __future__ import absolute_import, division, print_function\n", 810 | "WORDS_TO_KEEP = 10000 # should really be 10000\n", 811 | "tokenizer = keras.preprocessing.text.Tokenizer(num_words=WORDS_TO_KEEP)\n", 812 | "VALIDATION_SET_SIZE = 1000\n", 813 | "SHOULD_EQUALIZE = True\n", 814 | "VOCAB_SIZE = WORDS_TO_KEEP + 3\n", 815 | "MAX_SEQUENCE_LENGTH = 256" 816 | ] 817 | }, 818 | { 819 | "cell_type": "markdown", 820 | "metadata": {}, 821 | "source": [ 822 | "### Step 5: Formatting the data in the way that our chosen algorithm requires it. \n", 823 | "\n", 824 | "Unlike Naive Bayes, the neural nets take the words as a list of numbers, where each number corresponds directly to a token (aka word). Each list has to be the same length, so we have a special character for padding that gets added at the end of shorter tips. So a tip encoded for Keras might look like. `[46, 3449, 9, 172, 15, 6, 1054, 0, 0, 0 ... 0, 0]`" 825 | ] 826 | }, 827 | { 828 | "cell_type": "code", 829 | "execution_count": 46, 830 | "metadata": {}, 831 | "outputs": [ 832 | { 833 | "name": "stdout", 834 | "output_type": "stream", 835 | "text": [ 836 | "We have 16135 total words.\n" 837 | ] 838 | } 839 | ], 840 | "source": [ 841 | "# via https://www.tensorflow.org/tutorials/keras/basic_text_classification\n", 842 | "train_df, test_df = train_test_split(train_test_data_one_hot, test_size=0.2, shuffle=True, random_state=RANDOM_SEED)\n", 843 | "\n", 844 | "tokenizer.fit_on_texts(train_test_data_one_hot[\"description\"])\n", 845 | "print(\"We have {} total words.\".format(max(tokenizer.word_index.values())))\n", 846 | "\n", 847 | "word_index = {k:(v+3) for k,v in tokenizer.word_index.items()} \n", 848 | "word_index[\"\"] = 0\n", 849 | "word_index[\"\"] = 1\n", 850 | "word_index[\"\"] = 2 # unknown\n", 851 | "word_index[\"\"] = 3\n", 852 | "\n", 853 | "reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])\n", 854 | "\n", 855 | "def decode_text(text):\n", 856 | " return ' '.join([reverse_word_index.get(i, '?') for i in text])\n", 857 | "def encode_texts(texts):\n", 858 | " return [[word_index['']] + [idx + 3 for idx in list(idxs)] for idxs in tokenizer.texts_to_sequences(texts)[:]]\n", 859 | "def encode_text(text):\n", 860 | " return encode_texts([text])[0]" 861 | ] 862 | }, 863 | { 864 | "cell_type": "markdown", 865 | "metadata": {}, 866 | "source": [ 867 | "These are some helper methods we'll use for all the kinds of neural nets. The `train_keras_model` method combines steps 4-6: preparing the input data, training the model, and printing out evaluation stats." 868 | ] 869 | }, 870 | { 871 | "cell_type": "code", 872 | "execution_count": 50, 873 | "metadata": { 874 | "scrolled": true 875 | }, 876 | "outputs": [], 877 | "source": [ 878 | "def equalize_classes_keras(predictor, response):\n", 879 | " return SMOTE(random_state=RANDOM_SEED).fit_sample(predictor, response)\n", 880 | "\n", 881 | "def train_keras_model(model_fn, train_df, test_df, epochs=40, should_equalize=SHOULD_EQUALIZE, classes_of_interest=None):\n", 882 | " # several of the algorithms we test here make use of randomness. and we split the data into train/test groups randomly.\n", 883 | " # in order to make sure that every time we run this notebook, we get the same results (rather than a\n", 884 | " # \"good\" split making one algorithm choice seem better), we set an arbitrary number (1234) as the seed for all the \n", 885 | " # random number generators.\n", 886 | " keras.backend.clear_session()\n", 887 | " np.random.seed(RANDOM_SEED)\n", 888 | " random.seed(RANDOM_SEED)\n", 889 | " tf.set_random_seed(RANDOM_SEED)\n", 890 | " \n", 891 | " histories = {}\n", 892 | "\n", 893 | " # preparing the input data.\n", 894 | " train_data = np.array(encode_texts(train_df[\"description\"]))\n", 895 | " test_data = np.array(encode_texts(test_df[\"description\"]))\n", 896 | " train_data = keras.preprocessing.sequence.pad_sequences(train_data,\n", 897 | " value=word_index[\"\"],\n", 898 | " padding='post',\n", 899 | " maxlen=256)\n", 900 | " test_data = keras.preprocessing.sequence.pad_sequences(test_data,\n", 901 | " value=word_index[\"\"],\n", 902 | " padding='post',\n", 903 | " maxlen=256)\n", 904 | " for class_of_interest in classes_of_interest:\n", 905 | " train_labels = train_df[class_of_interest]\n", 906 | " test_labels = test_df[class_of_interest]\n", 907 | "\n", 908 | " if should_equalize:\n", 909 | " equalized_train_data, equalized_train_labels = equalize_classes_keras(train_data, train_labels)\n", 910 | " else:\n", 911 | " equalized_train_data = train_data.copy()\n", 912 | " equalized_train_labels = train_labels.copy()\n", 913 | "\n", 914 | " x_val = equalized_train_data[:VALIDATION_SET_SIZE]\n", 915 | " partial_x_train = equalized_train_data[VALIDATION_SET_SIZE:]\n", 916 | "\n", 917 | " y_val = equalized_train_labels[:VALIDATION_SET_SIZE]\n", 918 | " partial_y_train = equalized_train_labels[VALIDATION_SET_SIZE:]\n", 919 | " model = model_fn()\n", 920 | " print(class_of_interest)\n", 921 | " history = model.fit(partial_x_train,\n", 922 | " partial_y_train,\n", 923 | " epochs=epochs,\n", 924 | " batch_size=256,\n", 925 | " validation_data=(x_val, y_val),\n", 926 | " verbose=0\n", 927 | " )\n", 928 | " results = model.evaluate(test_data, test_labels)\n", 929 | " histories[class_of_interest] = history\n", 930 | " predicted_probabilities = model.predict(test_data)\n", 931 | " predicted_labels = [1.0 if proba > 0.5 else 0.0 for proba in predicted_probabilities]\n", 932 | " # print(confusion_matrix(test_labels, predicted_labels, labels=[1., 0.]))\n", 933 | " pr_score = average_precision_score(test_labels, predicted_probabilities)\n", 934 | " pr_baseline = float(len([a for a in test_labels if a]))/len(test_labels)\n", 935 | " print(\"Area under PR curve: \", round(pr_score, 2))\n", 936 | " print(\"If the AUPR score ({}) is more than a little bigger than the baseline ({}), which it *{}*, then our model is working!\".format(round(pr_score, 2), round(pr_baseline, 2), \"is\" if pr_score - (pr_baseline * 1.1) else \"isn't\" ))\n", 937 | "\n", 938 | " print()\n", 939 | " print()\n", 940 | " return [histories, test_labels, predicted_probabilities]" 941 | ] 942 | }, 943 | { 944 | "cell_type": "code", 945 | "execution_count": 51, 946 | "metadata": {}, 947 | "outputs": [], 948 | "source": [ 949 | "def embedding_layer(word2vec=False):\n", 950 | " if word2vec: \n", 951 | " EMBEDDING_DIM = 200\n", 952 | " from gensim.models import Word2Vec\n", 953 | " w2v = Word2Vec.load(\"my_word2vec_model.bin\")\n", 954 | " embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))\n", 955 | " for word, i in word_index.items():\n", 956 | " embedding_vector = w2v[word.lower()] if word in w2v else None\n", 957 | " if embedding_vector is not None:\n", 958 | " # words not found in embedding index will be all-zeros.\n", 959 | " embedding_matrix[i] = embedding_vector\n", 960 | "\n", 961 | " return keras.layers.Embedding(len(word_index) + 1,\n", 962 | " EMBEDDING_DIM,\n", 963 | " weights=[embedding_matrix],\n", 964 | " input_length=MAX_SEQUENCE_LENGTH,\n", 965 | " trainable=False) # \n", 966 | " else: \n", 967 | " return keras.layers.Embedding(VOCAB_SIZE, 16)\n" 968 | ] 969 | }, 970 | { 971 | "cell_type": "markdown", 972 | "metadata": {}, 973 | "source": [ 974 | "### Basic Keras NN\n", 975 | "\n", 976 | "This is a very basic neural net. " 977 | ] 978 | }, 979 | { 980 | "cell_type": "code", 981 | "execution_count": 54, 982 | "metadata": {}, 983 | "outputs": [], 984 | "source": [ 985 | "def basic_model():\n", 986 | " model = keras.Sequential()\n", 987 | " model.add(embedding_layer())\n", 988 | " model.add(keras.layers.GlobalAveragePooling1D())\n", 989 | " model.add(keras.layers.Dense(16, activation=tf.nn.relu))\n", 990 | " model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))\n", 991 | " model.compile(optimizer='adam',\n", 992 | " loss='binary_crossentropy',\n", 993 | " metrics=['accuracy'])\n", 994 | " # model.summary()\n", 995 | " return model" 996 | ] 997 | }, 998 | { 999 | "cell_type": "markdown", 1000 | "metadata": {}, 1001 | "source": [ 1002 | "As you can see, this model doesn't do as well as Naive Bayes. It also takes a lot longer to train, hence the progress bars." 1003 | ] 1004 | }, 1005 | { 1006 | "cell_type": "code", 1007 | "execution_count": 55, 1008 | "metadata": {}, 1009 | "outputs": [ 1010 | { 1011 | "name": "stdout", 1012 | "output_type": "stream", 1013 | "text": [ 1014 | "742/742 [==============================] - 0s 32us/sample - loss: 0.7459 - acc: 0.2615\n", 1015 | "religion\n", 1016 | "Area under PR curve: 0.34\n", 1017 | "PR Baseline : 0.24123989218328842\n", 1018 | "If the AUPR score (0.34) is more than a little bigger than the baseline (0.24), which it *is*, then our model is working!\n", 1019 | "\n", 1020 | "\n", 1021 | "742/742 [==============================] - 0s 43us/sample - loss: 0.7171 - acc: 0.3518\n", 1022 | "ethnicity\n", 1023 | "Area under PR curve: 0.43\n", 1024 | "PR Baseline : 0.35175202156334234\n", 1025 | "If the AUPR score (0.43) is more than a little bigger than the baseline (0.35), which it *is*, then our model is working!\n", 1026 | "\n", 1027 | "\n", 1028 | "742/742 [==============================] - 0s 35us/sample - loss: 0.8048 - acc: 0.1685\n", 1029 | "sexual-orientation\n", 1030 | "Area under PR curve: 0.17\n", 1031 | "PR Baseline : 0.16711590296495957\n", 1032 | "If the AUPR score (0.17) is more than a little bigger than the baseline (0.17), which it *is*, then our model is working!\n", 1033 | "\n", 1034 | "\n", 1035 | "742/742 [==============================] - 0s 84us/sample - loss: 0.8311 - acc: 0.1105\n", 1036 | "gender\n", 1037 | "Area under PR curve: 0.1\n", 1038 | "PR Baseline : 0.1105121293800539\n", 1039 | "If the AUPR score (0.1) is more than a little bigger than the baseline (0.11), which it *is*, then our model is working!\n", 1040 | "\n", 1041 | "\n", 1042 | "742/742 [==============================] - 0s 53us/sample - loss: 0.8521 - acc: 0.0283\n", 1043 | "disability\n", 1044 | "Area under PR curve: 0.03\n", 1045 | "PR Baseline : 0.02830188679245283\n", 1046 | "If the AUPR score (0.03) is more than a little bigger than the baseline (0.03), which it *is*, then our model is working!\n", 1047 | "\n", 1048 | "\n", 1049 | "742/742 [==============================] - 0s 51us/sample - loss: 0.7844 - acc: 0.2143\n", 1050 | "immigrant\n", 1051 | "Area under PR curve: 0.29\n", 1052 | "PR Baseline : 0.21428571428571427\n", 1053 | "If the AUPR score (0.29) is more than a little bigger than the baseline (0.21), which it *is*, then our model is working!\n", 1054 | "\n", 1055 | "\n", 1056 | "742/742 [==============================] - 0s 42us/sample - loss: 0.6918 - acc: 0.5418\n", 1057 | "race\n", 1058 | "Area under PR curve: 0.56\n", 1059 | "PR Baseline : 0.48517520215633425\n", 1060 | "If the AUPR score (0.56) is more than a little bigger than the baseline (0.49), which it *is*, then our model is working!\n", 1061 | "\n", 1062 | "\n", 1063 | "742/742 [==============================] - 0s 51us/sample - loss: 0.7289 - acc: 0.4111\n", 1064 | "race_ethnicity\n", 1065 | "Area under PR curve: 0.7\n", 1066 | "PR Baseline : 0.6563342318059299\n", 1067 | "If the AUPR score (0.7) is more than a little bigger than the baseline (0.66), which it *is*, then our model is working!\n", 1068 | "\n", 1069 | "\n" 1070 | ] 1071 | } 1072 | ], 1073 | "source": [ 1074 | "histories, test_labels, predicted_probas = train_keras_model(basic_model, train_df, test_df, epochs=5, classes_of_interest=list(set(all_values)) + ['race_ethnicity'])" 1075 | ] 1076 | }, 1077 | { 1078 | "cell_type": "markdown", 1079 | "metadata": {}, 1080 | "source": [ 1081 | "## Keras CNN:\n", 1082 | "\n" 1083 | ] 1084 | }, 1085 | { 1086 | "cell_type": "markdown", 1087 | "metadata": {}, 1088 | "source": [ 1089 | "### Step 6. Looking at the results and deciding if it’s good enough or not -- and if it isn’t, repeating steps 2-6 as necessary.\n", 1090 | "\n", 1091 | "Convolutional neural nets were revolutionary in how they improved neural nets' performance with text by examining several words at once. Other people undoubtedly would do a better job explaining how that works!\n", 1092 | "\n", 1093 | "While the performance of CNNs can be great, they have a LOT of settings that you can fiddle with to change how well they do. Learning rates, embedding dimensions, the number of epochs, etc. And they even depend on starting off with a certain amount of randomness, so running the same model twice on the same data (if you're not careful) can get you different results! \n", 1094 | "\n", 1095 | "Other tutorials will do you a better job explaining how to do that too, though below I discuss using a \"grid search\" to try to help.\n", 1096 | "\n", 1097 | "### repeatability\n", 1098 | "\n", 1099 | "However, I went down the rabbithole of making my runs repeatable: this is really hard becuase there are several sources of randomness and Jupyter notebooks all happen within the same Python session. In the process, I learned something important about _keras_ sessions -- a model is built within a session, so when you train a model twice within the same session, you're really just (it seems?) training the same model at double the epochs. So to make stuff repeatable, we need to re-set the random seeds AND start a new session AND re-declare the model.\n", 1100 | "```\n", 1101 | " keras.backend.clear_session()\n", 1102 | " np.random.seed(RANDOM_SEED)\n", 1103 | " random.seed(RANDOM_SEED)\n", 1104 | " tf.set_random_seed(RANDOM_SEED)\n", 1105 | "```" 1106 | ] 1107 | }, 1108 | { 1109 | "cell_type": "code", 1110 | "execution_count": 76, 1111 | "metadata": {}, 1112 | "outputs": [], 1113 | "source": [ 1114 | "def cnn_model(learning_rate=0.001, dropout_embedding=0.0, dropout1=0.2, dropout2=0.2, dropout3=0.2, embedding_dim=32, num_filters=32, kernel_size=5 ):\n", 1115 | " keras.backend.clear_session()\n", 1116 | "\n", 1117 | " adam = keras.optimizers.Adam(lr=learning_rate) # default lr = 0.001\n", 1118 | "\n", 1119 | " model = keras.Sequential()\n", 1120 | " model.add(keras.layers.Embedding(VOCAB_SIZE, embedding_dim, input_length=MAX_SEQUENCE_LENGTH))\n", 1121 | " # dropout here doesn't help.\n", 1122 | " # model.add(keras.layers.Dropout(dropout_embedding))\n", 1123 | " model.add(keras.layers.Conv1D(num_filters, kernel_size, activation='relu'))\n", 1124 | " model.add(keras.layers.Dropout(dropout1)) # 0.4 works better than 0.5 here (but not dramatically); 0.3 does better than 0.4 maybe?, 0.2 actually does better than both (and better than NB, at 86/92)\n", 1125 | " model.add(keras.layers.GlobalMaxPooling1D())\n", 1126 | " model.add(keras.layers.Dropout(dropout2))\n", 1127 | " model.add(keras.layers.Dense(10, activation='relu'))\n", 1128 | " model.add(keras.layers.Dropout(dropout3))\n", 1129 | " model.add(keras.layers.Dense(1, activation='sigmoid'))\n", 1130 | " model.compile(optimizer=adam,\n", 1131 | " loss='binary_crossentropy',\n", 1132 | " metrics=['accuracy'])\n", 1133 | " # model.summary()\n", 1134 | " return model" 1135 | ] 1136 | }, 1137 | { 1138 | "cell_type": "code", 1139 | "execution_count": 77, 1140 | "metadata": {}, 1141 | "outputs": [ 1142 | { 1143 | "name": "stdout", 1144 | "output_type": "stream", 1145 | "text": [ 1146 | "742/742 [==============================] - 0s 165us/sample - loss: 0.3474 - acc: 0.8747\n", 1147 | "religion\n", 1148 | "Area under PR curve: 0.77\n", 1149 | "PR Baseline : 0.24123989218328842\n", 1150 | "If the AUPR score (0.77) is more than a little bigger than the baseline (0.24), which it *is*, then our model is working!\n", 1151 | "\n", 1152 | "\n", 1153 | "742/742 [==============================] - 0s 170us/sample - loss: 0.5497 - acc: 0.7480\n", 1154 | "ethnicity\n", 1155 | "Area under PR curve: 0.66\n", 1156 | "PR Baseline : 0.35175202156334234\n", 1157 | "If the AUPR score (0.66) is more than a little bigger than the baseline (0.35), which it *is*, then our model is working!\n", 1158 | "\n", 1159 | "\n", 1160 | "742/742 [==============================] - 0s 164us/sample - loss: 0.2411 - acc: 0.9272\n", 1161 | "sexual-orientation\n", 1162 | "Area under PR curve: 0.79\n", 1163 | "PR Baseline : 0.16711590296495957\n", 1164 | "If the AUPR score (0.79) is more than a little bigger than the baseline (0.17), which it *is*, then our model is working!\n", 1165 | "\n", 1166 | "\n", 1167 | "742/742 [==============================] - 0s 203us/sample - loss: 0.3429 - acc: 0.8895\n", 1168 | "gender\n", 1169 | "Area under PR curve: 0.2\n", 1170 | "PR Baseline : 0.1105121293800539\n", 1171 | "If the AUPR score (0.2) is more than a little bigger than the baseline (0.11), which it *is*, then our model is working!\n", 1172 | "\n", 1173 | "\n", 1174 | "742/742 [==============================] - 0s 115us/sample - loss: 0.1347 - acc: 0.9717\n", 1175 | "disability\n", 1176 | "Area under PR curve: 0.11\n", 1177 | "PR Baseline : 0.02830188679245283\n", 1178 | "If the AUPR score (0.11) is more than a little bigger than the baseline (0.03), which it *is*, then our model is working!\n", 1179 | "\n", 1180 | "\n", 1181 | "742/742 [==============================] - 0s 94us/sample - loss: 0.4530 - acc: 0.7857\n", 1182 | "immigrant\n", 1183 | "Area under PR curve: 0.48\n", 1184 | "PR Baseline : 0.21428571428571427\n", 1185 | "If the AUPR score (0.48) is more than a little bigger than the baseline (0.21), which it *is*, then our model is working!\n", 1186 | "\n", 1187 | "\n", 1188 | "742/742 [==============================] - 0s 158us/sample - loss: 0.5393 - acc: 0.7385\n", 1189 | "race\n", 1190 | "Area under PR curve: 0.8\n", 1191 | "PR Baseline : 0.48517520215633425\n", 1192 | "If the AUPR score (0.8) is more than a little bigger than the baseline (0.49), which it *is*, then our model is working!\n", 1193 | "\n", 1194 | "\n", 1195 | "742/742 [==============================] - 0s 116us/sample - loss: 0.4752 - acc: 0.7682\n", 1196 | "race_ethnicity\n", 1197 | "Area under PR curve: 0.9\n", 1198 | "PR Baseline : 0.6563342318059299\n", 1199 | "If the AUPR score (0.9) is more than a little bigger than the baseline (0.66), which it *is*, then our model is working!\n", 1200 | "\n", 1201 | "\n" 1202 | ] 1203 | } 1204 | ], 1205 | "source": [ 1206 | "histories_cnn, test_labels_cnn, predicted_probas_cnn = train_keras_model(cnn_model, train_df, test_df, epochs=20, should_equalize=False, classes_of_interest=(list(set(all_values))) + [\"race_ethnicity\"]) # " 1207 | ] 1208 | }, 1209 | { 1210 | "cell_type": "raw", 1211 | "metadata": {}, 1212 | "source": [ 1213 | "\n", 1214 | "\n", 1215 | "### dropout (and epochs)\n", 1216 | "\n", 1217 | "The goal of a model is to learn how to _generalize_. We're going to use a chart of training loss versus validation loss to help us understand if the model is generalizing, i.e. if it's doing the right thing. If it's not, then we can know which settings to fiddle with.\n", 1218 | "\n", 1219 | "Suppose many of the religion-related tips in our training data mention grocery stores. That's not, logically, true of ALL religion-related hate incidents; it's just a coincidence that that's how our training data got chosen. We want the model to correctly learn the real signs of what makes a hate incident related to religion, but not get confused by the coincidental co-occurrence of grocery stores with religion-related tips. If the model has generalized, it's learned what _really_ distinguishes religion-related tips. But if it focuses wrongfully on grocery stores, then it is \"overfit\" -- that is, focused on patterns that are true in the training data, but not the patterns that are true in real life.\n", 1220 | "\n", 1221 | "The way neural nets handle this problem is by having a \"validation set\" -- chosen like the training data, but not used for training. Unless we're really unlucky, the religion-related tips in the validation set won't mention grocery stores... but they will mention the true signs of religion-related hate incidents (mosques, synagogues, etc.). \n", 1222 | "\n", 1223 | "In essence, if performance on the validation set is close to performance on the training set, then we're doing good. If it's much worse, then the model is either undertrained or overfit.\n", 1224 | "\n", 1225 | "The solution to undertraining is to train more. We do that by increasing the number of epochs. We know if the model is undertrained is if the \"validation loss\" is steadily decreasing on the chart below.\n", 1226 | "\n", 1227 | "The solution to overfitting is to make the model more \"forgetful\" -- we want the model to learn multiple ways of determining what's a religion-related hate crime. Not just the names of houses of worship (and grocery stores, the spurious indicator) but other words that provide some signal. To do that, we force the model to ignore some things with a \"dropout layer\". Increasing the amount of dropout in the model's dropout layers lets it forget more and is one possible solution to overfitting.\n", 1228 | "\n", 1229 | "You can read more about this at these links:\n", 1230 | "\n", 1231 | " - https://stats.stackexchange.com/questions/187335/validation-error-less-than-training-error/187404#187404\n", 1232 | " - https://forums.fast.ai/t/determining-when-you-are-overfitting-underfitting-or-just-right/7732/6" 1233 | ] 1234 | }, 1235 | { 1236 | "cell_type": "code", 1237 | "execution_count": 78, 1238 | "metadata": {}, 1239 | "outputs": [ 1240 | { 1241 | "data": { 1242 | "image/png": "\n", 1243 | "text/plain": [ 1244 | "
" 1245 | ] 1246 | }, 1247 | "metadata": { 1248 | "needs_background": "light" 1249 | }, 1250 | "output_type": "display_data" 1251 | }, 1252 | { 1253 | "data": { 1254 | "image/png": "\n", 1255 | "text/plain": [ 1256 | "
" 1257 | ] 1258 | }, 1259 | "metadata": { 1260 | "needs_background": "light" 1261 | }, 1262 | "output_type": "display_data" 1263 | } 1264 | ], 1265 | "source": [ 1266 | "training_and_validation_loss(histories_cnn['race_ethnicity'])\n", 1267 | "training_and_validation_accuracy(histories_cnn['race_ethnicity'])" 1268 | ] 1269 | }, 1270 | { 1271 | "cell_type": "markdown", 1272 | "metadata": {}, 1273 | "source": [ 1274 | "As you can see in the charts above, validation loss has begun to flatten out. We don't need more epochs. Since the training loss is very low, but validation loss is higher, that means we may be overfit." 1275 | ] 1276 | }, 1277 | { 1278 | "cell_type": "code", 1279 | "execution_count": 80, 1280 | "metadata": {}, 1281 | "outputs": [ 1282 | { 1283 | "data": { 1284 | "image/png": "\n", 1285 | "text/plain": [ 1286 | "
" 1287 | ] 1288 | }, 1289 | "metadata": { 1290 | "needs_background": "light" 1291 | }, 1292 | "output_type": "display_data" 1293 | } 1294 | ], 1295 | "source": [ 1296 | "pr_chart(test_labels_cnn, predicted_probas_cnn)" 1297 | ] 1298 | }, 1299 | { 1300 | "cell_type": "markdown", 1301 | "metadata": {}, 1302 | "source": [ 1303 | "## Keras LSTM:\n", 1304 | "\n", 1305 | "LSTMs (short for Long Short-Term Memory) are another kind of neural network that works well on text. But for some reason, I couldn't get it to do much at all for this dataset. I got an area under PR curve of 0.66, with a baseline of 0.657...\n", 1306 | "\n", 1307 | "In other words, the model was just guessing. \n", 1308 | "\n", 1309 | "I don't know why. If you do, let me know?\n" 1310 | ] 1311 | }, 1312 | { 1313 | "cell_type": "code", 1314 | "execution_count": 82, 1315 | "metadata": {}, 1316 | "outputs": [], 1317 | "source": [ 1318 | "def lstm_model(learning_rate=0.001):\n", 1319 | " keras.backend.clear_session()\n", 1320 | "\n", 1321 | " adam = keras.optimizers.Adam(lr=learning_rate) # default lr = 0.001\n", 1322 | "\n", 1323 | " model = keras.Sequential()\n", 1324 | " model.add(\n", 1325 | " keras.layers.Embedding(VOCAB_SIZE, # vocab size\n", 1326 | " 32, # output size; embedding dimensions\n", 1327 | " input_length=MAX_SEQUENCE_LENGTH\n", 1328 | " ))\n", 1329 | " model.add(keras.layers.Dropout(0.2))\n", 1330 | " model.add(keras.layers.LSTM(100))\n", 1331 | " model.add(keras.layers.Dropout(0.2))\n", 1332 | "\n", 1333 | " model.add(keras.layers.Dense(1, activation='sigmoid')) # with just one dense layer, race_ethnicity ROC 0.69\n", 1334 | " model.compile(loss='binary_crossentropy', optimizer=adam, metrics=['accuracy'])\n", 1335 | " model.summary()\n", 1336 | " return model" 1337 | ] 1338 | }, 1339 | { 1340 | "cell_type": "code", 1341 | "execution_count": 83, 1342 | "metadata": {}, 1343 | "outputs": [ 1344 | { 1345 | "name": "stdout", 1346 | "output_type": "stream", 1347 | "text": [ 1348 | "_________________________________________________________________\n", 1349 | "Layer (type) Output Shape Param # \n", 1350 | "=================================================================\n", 1351 | "embedding (Embedding) (None, 256, 32) 320096 \n", 1352 | "_________________________________________________________________\n", 1353 | "dropout (Dropout) (None, 256, 32) 0 \n", 1354 | "_________________________________________________________________\n", 1355 | "lstm (LSTM) (None, 100) 53200 \n", 1356 | "_________________________________________________________________\n", 1357 | "dropout_1 (Dropout) (None, 100) 0 \n", 1358 | "_________________________________________________________________\n", 1359 | "dense (Dense) (None, 1) 101 \n", 1360 | "=================================================================\n", 1361 | "Total params: 373,397\n", 1362 | "Trainable params: 373,397\n", 1363 | "Non-trainable params: 0\n", 1364 | "_________________________________________________________________\n", 1365 | "742/742 [==============================] - 2s 2ms/sample - loss: 0.7263 - acc: 0.4084\n", 1366 | "race_ethnicity\n", 1367 | "Area under PR curve: 0.69\n", 1368 | "PR Baseline : 0.6563342318059299\n", 1369 | "If the AUPR score (0.69) is more than a little bigger than the baseline (0.66), which it *is*, then our model is working!\n", 1370 | "\n", 1371 | "\n" 1372 | ] 1373 | } 1374 | ], 1375 | "source": [ 1376 | "histories, test_labels_lstm, predicted_probas_lstm = train_keras_model(lstm_model, train_df, test_df, epochs=5, should_equalize=True, classes_of_interest=[\"race_ethnicity\"])" 1377 | ] 1378 | }, 1379 | { 1380 | "cell_type": "code", 1381 | "execution_count": 84, 1382 | "metadata": {}, 1383 | "outputs": [ 1384 | { 1385 | "data": { 1386 | "image/png": "\n", 1387 | "text/plain": [ 1388 | "
" 1389 | ] 1390 | }, 1391 | "metadata": { 1392 | "needs_background": "light" 1393 | }, 1394 | "output_type": "display_data" 1395 | }, 1396 | { 1397 | "data": { 1398 | "image/png": "\n", 1399 | "text/plain": [ 1400 | "
" 1401 | ] 1402 | }, 1403 | "metadata": { 1404 | "needs_background": "light" 1405 | }, 1406 | "output_type": "display_data" 1407 | } 1408 | ], 1409 | "source": [ 1410 | "training_and_validation_loss(histories['race_ethnicity'])\n", 1411 | "training_and_validation_accuracy(histories['race_ethnicity'])" 1412 | ] 1413 | }, 1414 | { 1415 | "cell_type": "markdown", 1416 | "metadata": {}, 1417 | "source": [ 1418 | "## Spacy Text Categorization\n", 1419 | "https://spacy.io/usage/examples#textcat\n", 1420 | "It's a CNN, but with smart defaults for a lot of the options." 1421 | ] 1422 | }, 1423 | { 1424 | "cell_type": "code", 1425 | "execution_count": 85, 1426 | "metadata": {}, 1427 | "outputs": [ 1428 | { 1429 | "name": "stdout", 1430 | "output_type": "stream", 1431 | "text": [ 1432 | "Using 2968 examples (2374 training, 594 evaluation)\n", 1433 | "Training the model...\n", 1434 | "LOSS \t P \t R \t F \n", 1435 | "1 . 128.089\t0.726\t0.908\t0.807\n", 1436 | "2 . 64.358\t0.795\t0.847\t0.820\n", 1437 | "3 . 39.252\t0.789\t0.858\t0.822\n", 1438 | "4 . 27.853\t0.802\t0.853\t0.827\n", 1439 | "5 . 19.732\t0.801\t0.868\t0.833\n", 1440 | "6 . 15.693\t0.801\t0.858\t0.828\n", 1441 | "7 . 12.891\t0.795\t0.868\t0.830\n", 1442 | "8 . 10.547\t0.800\t0.866\t0.832\n", 1443 | "9 . 8.722\t0.804\t0.866\t0.834\n", 1444 | "10. 8.085\t0.813\t0.847\t0.830\n", 1445 | "11. 7.108\t0.815\t0.868\t0.841\n", 1446 | "12. 7.256\t0.821\t0.845\t0.833\n", 1447 | "13. 6.614\t0.820\t0.850\t0.835\n", 1448 | "14. 6.221\t0.815\t0.837\t0.826\n", 1449 | "15. 5.845\t0.812\t0.839\t0.825\n", 1450 | "16. 7.384\t0.807\t0.845\t0.825\n", 1451 | "17. 6.187\t0.806\t0.853\t0.829\n", 1452 | "18. 7.135\t0.797\t0.858\t0.826\n", 1453 | "19. 5.674\t0.802\t0.863\t0.831\n", 1454 | "20. 5.625\t0.811\t0.847\t0.829\n", 1455 | "Saved model to data/spacy\n" 1456 | ] 1457 | } 1458 | ], 1459 | "source": [ 1460 | "from spacy.util import minibatch, compounding\n", 1461 | "from pathlib import Path\n", 1462 | "\n", 1463 | "nlp_textcat = spacy.load('en_core_web_lg')\n", 1464 | "\n", 1465 | "SHOULD_EQUALIZE_SPACY = True\n", 1466 | "VALIDATION_SET_PERCENTAGE = 0.2\n", 1467 | "output_dir = 'data/spacy/'\n", 1468 | "n_iter = 20 # 5 might be plenty?\n", 1469 | "COLUMN_OF_INTEREST_SPACY = \"race_ethnicity\"\n", 1470 | "\n", 1471 | "train_df, test_df = train_test_split(train_test_data_one_hot, test_size=0.2, shuffle=True)\n", 1472 | "\n", 1473 | "def load_data(limit=0, split=0.8, column=COLUMN_OF_INTEREST_SPACY):\n", 1474 | " \"\"\"Load data from the IMDB dataset.\"\"\"\n", 1475 | " # Partition off part of the train data for evaluation\n", 1476 | " train_data = train_df[[\"description\", column]]\n", 1477 | " train_data = train_data.sample(frac=1).reset_index(drop=True)\n", 1478 | " train_data = train_data[-limit:]\n", 1479 | " texts, labels = zip(*train_data.values)\n", 1480 | " cats = [{'POSITIVE-{}'.format(column): bool(y)} for y in labels]\n", 1481 | " split = int(len(train_data) * split)\n", 1482 | " return (texts[:split], cats[:split]), (texts[split:], cats[split:])\n", 1483 | "\n", 1484 | "def evaluate(tokenizer, textcat, texts, cats):\n", 1485 | " docs = (tokenizer(text) for text in texts)\n", 1486 | " tp = 0.0 # True positives\n", 1487 | " fp = 1e-8 # False positives\n", 1488 | " fn = 1e-8 # False negatives\n", 1489 | " tn = 0.0 # True negatives\n", 1490 | " for i, doc in enumerate(textcat.pipe(docs)):\n", 1491 | " gold = cats[i]\n", 1492 | " for label, score in doc.cats.items():\n", 1493 | " if label not in gold:\n", 1494 | " continue\n", 1495 | " if score >= 0.5 and gold[label] >= 0.5:\n", 1496 | " tp += 1.\n", 1497 | " elif score >= 0.5 and gold[label] < 0.5:\n", 1498 | " fp += 1.\n", 1499 | " elif score < 0.5 and gold[label] < 0.5:\n", 1500 | " tn += 1\n", 1501 | " elif score < 0.5 and gold[label] >= 0.5:\n", 1502 | " fn += 1\n", 1503 | " precision = tp / (tp + fp)\n", 1504 | " recall = tp / (tp + fn)\n", 1505 | " f_score = 2 * (precision * recall) / (precision + recall)\n", 1506 | " return {'textcat_p': precision, 'textcat_r': recall, 'textcat_f': f_score}\n", 1507 | "\n", 1508 | "\n", 1509 | "\n", 1510 | "if output_dir is not None:\n", 1511 | " output_dir = Path(output_dir)\n", 1512 | " if not output_dir.exists():\n", 1513 | " output_dir.mkdir()\n", 1514 | "\n", 1515 | "# add the text classifier to the pipeline if it doesn't exist\n", 1516 | "# nlp.create_pipe works for built-ins that are registered with spaCy\n", 1517 | "if 'textcat' not in nlp_textcat.pipe_names:\n", 1518 | " textcat = nlp_textcat.create_pipe('textcat')\n", 1519 | " nlp_textcat.add_pipe(textcat, last=True)\n", 1520 | "# otherwise, get it, so we can add labels to it\n", 1521 | "else:\n", 1522 | " textcat = nlp_textcat.get_pipe('textcat')\n", 1523 | "\n", 1524 | "# add label to text classifier\n", 1525 | "label_name = 'POSITIVE-{}'.format(COLUMN_OF_INTEREST_SPACY)\n", 1526 | "textcat.add_label(label_name)\n", 1527 | "\n", 1528 | "(train_texts, train_cats), (dev_texts, dev_cats) = load_data(split=1.0-VALIDATION_SET_PERCENTAGE)\n", 1529 | "print(\"Using {} examples ({} training, {} evaluation)\"\n", 1530 | " .format(len(train_texts) + len(dev_texts), len(train_texts), len(dev_texts)))\n", 1531 | "train_data = list(zip(train_texts,\n", 1532 | " [{'cats': cats} for cats in train_cats]))\n", 1533 | "\n", 1534 | "# get names of other pipes to disable them during training\n", 1535 | "other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'textcat']\n", 1536 | "with nlp_textcat.disable_pipes(*other_pipes): # only train textcat\n", 1537 | " optimizer = nlp_textcat.begin_training()\n", 1538 | " print(\"Training the model...\")\n", 1539 | " print('{:^5}\\t{:^5}\\t{:^5}\\t{:^5}'.format('LOSS', 'P', 'R', 'F'))\n", 1540 | " for i in range(n_iter):\n", 1541 | " losses = {}\n", 1542 | " # batch up the examples using spaCy's minibatch\n", 1543 | " batches = minibatch(train_data, size=compounding(4., 32., 1.001))\n", 1544 | " for batch in batches:\n", 1545 | " texts, annotations = zip(*batch)\n", 1546 | " nlp_textcat.update(texts, annotations, sgd=optimizer, drop=0.2,\n", 1547 | " losses=losses)\n", 1548 | " with textcat.model.use_params(optimizer.averages):\n", 1549 | " # evaluate on the dev data split off in load_data()\n", 1550 | " scores = evaluate(nlp.tokenizer, textcat, dev_texts, dev_cats)\n", 1551 | " print('{4: <2}. {0:.3f}\\t{1:.3f}\\t{2:.3f}\\t{3:.3f}' # print a simple table\n", 1552 | " .format(losses['textcat'], scores['textcat_p'],\n", 1553 | " scores['textcat_r'], scores['textcat_f'], i+1))\n", 1554 | "\n", 1555 | "\n", 1556 | "if output_dir is not None:\n", 1557 | " with nlp_textcat.use_params(optimizer.averages):\n", 1558 | " nlp_textcat.to_disk(output_dir)\n", 1559 | " print(\"Saved model to\", output_dir)" 1560 | ] 1561 | }, 1562 | { 1563 | "cell_type": "code", 1564 | "execution_count": 116, 1565 | "metadata": {}, 1566 | "outputs": [ 1567 | { 1568 | "data": { 1569 | "application/vnd.jupyter.widget-view+json": { 1570 | "model_id": "1e117bcf60e24c12bc8310369969c663", 1571 | "version_major": 2, 1572 | "version_minor": 0 1573 | }, 1574 | "text/plain": [ 1575 | "HBox(children=(IntProgress(value=0, max=747), HTML(value='')))" 1576 | ] 1577 | }, 1578 | "metadata": {}, 1579 | "output_type": "display_data" 1580 | }, 1581 | { 1582 | "name": "stdout", 1583 | "output_type": "stream", 1584 | "text": [ 1585 | "\n", 1586 | "spaCy Area under ROC curve: 0.8214486024989873\n", 1587 | "spaCy Area under PR curve: 0.8822858437821977\n" 1588 | ] 1589 | } 1590 | ], 1591 | "source": [ 1592 | "from tqdm import tqdm_notebook as tqdm\n", 1593 | "predicted_probabilities = []\n", 1594 | "for row in tqdm(test_df[[\"description\", COLUMN_OF_INTEREST_SPACY]].values):\n", 1595 | " doc = nlp_textcat(row[0])\n", 1596 | " predicted_probabilities.append(doc.cats[\"POSITIVE-{}\".format(COLUMN_OF_INTEREST_SPACY)])\n", 1597 | " \n", 1598 | "print(\"spaCy Area under PR curve: \", average_precision_score(test_df[COLUMN_OF_INTEREST_SPACY], predicted_probabilities))\n" 1599 | ] 1600 | }, 1601 | { 1602 | "cell_type": "markdown", 1603 | "metadata": {}, 1604 | "source": [ 1605 | "## Bonus #1: What confuses the classifier?\n", 1606 | "\n", 1607 | "Let's loop through the test set and see for which documents the Naive Bayes classifier gives the wrong answer. Then we'll use our human judgment to see if the computer is really giving the 'wrong' answer or if the data is just coded wrong.\n", 1608 | "\n", 1609 | "In some cases the person who submitted the tip didn't classify it in the way we'd want to. In that case, the computer's not really wrong.\n", 1610 | "\n", 1611 | "In some cases, the computer is not giving the answer we want it to. Sometimes, that's \"forgivable\" and we can't expect it to (e.g. misspellings or descriptions that use unfamiliar wordings); other times, the model is _really_ doing the wrong thing. It's that last case that we're trying to eliminate overall.\n", 1612 | "\n", 1613 | "But the purpose of this exercise is to get a sense of what the model is missing and what tips are mis-classified by humans. \n", 1614 | "\n", 1615 | "What I've noticed:\n", 1616 | "\n", 1617 | "- the model doesn't seem to know the phrase \"N word\"; even though that should be a clear sign of a race-related tip. (Which I fix above).\n", 1618 | "- the model classifies many Judaism-related tips as non-race/ethnicity-related (when teh submitter classified it that way). In my opinion, that may be okay if the model categorizes those as religion-related. (But like, it's a tough category in the first place. If humans can't agree on the right answer, we can't expect the computer to settle it for us!)\n", 1619 | "- forgivable error \"get out of the country\" coded as race-related when person reporting is from England. (i.e. computer is picking up on a real signal that just happens not to apply here.)" 1620 | ] 1621 | }, 1622 | { 1623 | "cell_type": "code", 1624 | "execution_count": null, 1625 | "metadata": { 1626 | "collapsed": true 1627 | }, 1628 | "outputs": [], 1629 | "source": [ 1630 | "test_labels_nb = test_df_nb[class_of_interest]\n", 1631 | "test_features_nb = test_df_nb[\"description\"]\n", 1632 | "test_features_nb_vec = vectorizer.transform(test_features_nb)\n", 1633 | "\n", 1634 | "predicted_probabilities = nbclassifier.predict_proba(test_features_nb_vec)[:,1]\n", 1635 | "for (row, predicted_proba) in zip(test_df_nb[[\"description\", class_of_interest]].values, predicted_probabilities):\n", 1636 | " if row[1] == (1.0 if predicted_proba > 0.5 else 0.0):\n", 1637 | " continue\n", 1638 | " print(\"Classifier sez {}; gold-standard coded as {}\".format(predicted_proba, row[1]))\n", 1639 | " print(\"Text: {}\".format(row[0]))\n", 1640 | " print(\"\\n---------------------------------------------\\n\")" 1641 | ] 1642 | }, 1643 | { 1644 | "cell_type": "markdown", 1645 | "metadata": {}, 1646 | "source": [ 1647 | "`I can't actually show you the output.`" 1648 | ] 1649 | }, 1650 | { 1651 | "cell_type": "markdown", 1652 | "metadata": {}, 1653 | "source": [ 1654 | "## Bonus #2. Interactive Example\n", 1655 | "\n", 1656 | "We can figure out what words in a description contribute most to the score (again using Naive Bayes).\n", 1657 | "\n", 1658 | "You can also see this code in an [interactive form here](https://s3.amazonaws.com/qz-aistudio-public/dochate.html).\n", 1659 | "\n", 1660 | "For a tip, we remove each n-gram and run the modified tip through the classifier. The n-grams that, when removed, cause the largest change in the model's guess are the ones that have the biggest effect. This is kind of a hack: some models might take into account additional context than just trigrams or do so in different ways. But this gives us a sense of what the model is basing its decision on. \n", 1661 | "\n", 1662 | "You'll find some times when the model makes the right call... and other times where it has mistakenly learned that an phrase that ought to be irrelevant has a big effect. (Can you think of why that might happen?)\n" 1663 | ] 1664 | }, 1665 | { 1666 | "cell_type": "code", 1667 | "execution_count": 67, 1668 | "metadata": {}, 1669 | "outputs": [], 1670 | "source": [ 1671 | "import re\n", 1672 | "\n", 1673 | "def classify_all(text, cutoff=0.5):\n", 1674 | " classifications = []\n", 1675 | " vectorized_text = vectorizer.transform([text])\n", 1676 | " for class_of_interest, classifier in classifiers.items():\n", 1677 | " proba = classifier.predict_proba(vectorized_text)[:,1][0]\n", 1678 | " if proba > 0.50:\n", 1679 | " classifications.append(class_of_interest)\n", 1680 | " return classifications\n", 1681 | " \n", 1682 | "\n", 1683 | "def permute_text(text):\n", 1684 | " words = ' '.join(re.sub(r'[^A-Za-z0-9]', ' ', text).split()).split()\n", 1685 | " bigrams = list(zip(words[:-1], words[1:]))\n", 1686 | " trigrams = list(zip(bigrams[:-2], words[2:]))\n", 1687 | " return [(word, ' '.join(words[:i] + words[i+1:]) ) for i, word in enumerate(words)] + [(' '.join(bigram), ' '.join(words[:i] + words[i+2:]) ) for i, bigram in enumerate(bigrams)] + [( ' '.join(trigram[0] + (trigram[1],)), ' '.join(words[:i] + words[i+3:]) ) for i, trigram in enumerate(trigrams)]\n", 1688 | "\n", 1689 | "def but_why_with_func(text, classify):\n", 1690 | " baseline = classify(text)\n", 1691 | " permuted_texts = permute_text(text)\n", 1692 | " diffs = [(deleted_word, baseline - classify(permuted_text)) for (deleted_word, permuted_text) in permuted_texts]\n", 1693 | " biggest_diffs = sorted(diffs, key=lambda word_diff: -abs(word_diff[1]))[:4]\n", 1694 | " return baseline, biggest_diffs \n", 1695 | " \n", 1696 | "def but_why(text, class_of_interest=\"race_ethnicity\"):\n", 1697 | " baseline, biggest_diffs = but_why_with_func(text, \n", 1698 | " lambda x: classifiers[class_of_interest].predict_proba(vectorizer.transform([x]))[:,1][0])\n", 1699 | " return baseline, biggest_diffs\n", 1700 | "\n", 1701 | "def inspect(text, class_of_interest=None):\n", 1702 | " text = clean(text)\n", 1703 | " print(\"Text:\")\n", 1704 | " print()\n", 1705 | " print(\" \" + text)\n", 1706 | " print()\n", 1707 | "\n", 1708 | " print(\"Predicted targeted-because: \")\n", 1709 | " for target in classify_all(text):\n", 1710 | " print(' * ' + target)\n", 1711 | " if class_of_interest: \n", 1712 | " print()\n", 1713 | " print(\"Why that {} prediction?\".format(class_of_interest))\n", 1714 | " baseline, biggest_diffs = but_why(text, class_of_interest)\n", 1715 | " print(\"predicted probability: {0:.2f}%\".format(baseline * 100))\n", 1716 | " print(\"top difference-makers:\")\n", 1717 | " for (deleted_word, diff) in biggest_diffs:\n", 1718 | " print(\" - {0}, {1:.2f}%\".format(deleted_word, diff * 100))\n", 1719 | "\n" 1720 | ] 1721 | }, 1722 | { 1723 | "cell_type": "code", 1724 | "execution_count": 69, 1725 | "metadata": {}, 1726 | "outputs": [ 1727 | { 1728 | "name": "stdout", 1729 | "output_type": "stream", 1730 | "text": [ 1731 | "Text:\n", 1732 | "\n", 1733 | " someone called me the ndashword at the grocery store\n", 1734 | "\n", 1735 | "Predicted targeted-because: \n", 1736 | " * race\n", 1737 | " * race_ethnicity\n", 1738 | "\n", 1739 | "Why that race_ethnicity prediction?\n", 1740 | "predicted probability: 69.56%\n", 1741 | "top difference-makers:\n", 1742 | " - ndashword at, 18.51%\n", 1743 | " - the ndashword at, 18.47%\n", 1744 | " - ndashword at the, 18.47%\n", 1745 | " - ndashword, 18.08%\n" 1746 | ] 1747 | } 1748 | ], 1749 | "source": [ 1750 | "text = \"someone called me the n-word at the grocery store\"\n", 1751 | "inspect(text, 'race_ethnicity')" 1752 | ] 1753 | }, 1754 | { 1755 | "cell_type": "markdown", 1756 | "metadata": {}, 1757 | "source": [ 1758 | "## Bonus 3: Grid Search for targeted_because\n", 1759 | "\n", 1760 | "My CNN managed to slightly outperform Naive Bayes for predicting `targeted_because`. Let's see if we can fiddle with the knobs (technically \"hyperparameters\" -- settings like the size of the embedding layer, the learning rate or the dropout amount -- and get better results.\n", 1761 | "\n", 1762 | "There's no science to this. We're just fiddling with knobs. Grid Search is a technique for fiddling with those knobs systematically...\n", 1763 | "\n", 1764 | "This took a lot of fiddling (Tensorflow doesn't parallelize well, due to memory reservations that aren't easily freed. That's what a lot of random-seed setting code above does.)\n", 1765 | "\n", 1766 | "I did this on a p3.2xlarge instance, which costs \\\\$3/hr and testing 11250 combinations 3x each, took 20hr 52min -- costing \\\\$60ish. (If you do everything right the first time, which I didn't.)\n", 1767 | "\n", 1768 | "Results look like this: Best: 0.792443 using {'batch_size': 256, 'dropout1': 0.1, 'dropout2': 0.0, 'dropout3': 0.0, 'dropout_embedding': 0.0, 'embedding_dim': 32, 'epochs': 20, 'kernel_size': 5, 'learning_rate': 0.001, 'num_filters': 256, 'verbose': 0} plus results for every combination.\n", 1769 | "\n", 1770 | "I used this tutorial: https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/\n", 1771 | "\n", 1772 | "The best result was 0.898847 using `{'batch_size': 256, 'dropout1': 0.0, 'dropout2': 0.1, 'dropout3': 0.2, 'dropout_embedding': 0.0, 'embedding_dim': 32, 'epochs': 30, 'kernel_size': 5, 'learning_rate': 0.0005, 'num_filters': 256, 'verbose': 0}`." 1773 | ] 1774 | }, 1775 | { 1776 | "cell_type": "code", 1777 | "execution_count": null, 1778 | "metadata": {}, 1779 | "outputs": [], 1780 | "source": [ 1781 | "from tensorflow.keras.wrappers.scikit_learn import KerasClassifier\n", 1782 | "from sklearn.model_selection import GridSearchCV\n", 1783 | "\n", 1784 | "keras.backend.clear_session()\n", 1785 | "np.random.seed(RANDOM_SEED)\n", 1786 | "random.seed(RANDOM_SEED)\n", 1787 | "tf.set_random_seed(RANDOM_SEED)\n", 1788 | "\n", 1789 | "\n", 1790 | "should_equalize = False\n", 1791 | "\n", 1792 | "histories = {}\n", 1793 | "\n", 1794 | "train_data = np.array(encode_texts(train_df[\"description\"]))\n", 1795 | "test_data = np.array(encode_texts(test_df[\"description\"]))\n", 1796 | "train_data = keras.preprocessing.sequence.pad_sequences(train_data,\n", 1797 | " value=word_index[\"\"],\n", 1798 | " padding='post',\n", 1799 | " maxlen=256)\n", 1800 | "test_data = keras.preprocessing.sequence.pad_sequences(test_data,\n", 1801 | " value=word_index[\"\"],\n", 1802 | " padding='post',\n", 1803 | " maxlen=256)\n", 1804 | "for class_of_interest in [\"race_ethnicity\"]:\n", 1805 | " train_labels = train_df[class_of_interest]\n", 1806 | " test_labels = test_df[class_of_interest]\n", 1807 | "\n", 1808 | " if should_equalize:\n", 1809 | " equalized_train_data, equalized_train_labels = equalize_classes_keras(train_data, train_labels)\n", 1810 | " else:\n", 1811 | " equalized_train_data = train_data.copy()\n", 1812 | " equalized_train_labels = train_labels.copy()\n", 1813 | "\n", 1814 | " x_val = equalized_train_data[:VALIDATION_SET_SIZE]\n", 1815 | " partial_x_train = equalized_train_data[VALIDATION_SET_SIZE:]\n", 1816 | "\n", 1817 | " y_val = equalized_train_labels[:VALIDATION_SET_SIZE]\n", 1818 | " partial_y_train = equalized_train_labels[VALIDATION_SET_SIZE:]\n", 1819 | " \n", 1820 | " # different bits\n", 1821 | " parameters_cnn = {\n", 1822 | " \"learning_rate\": (0.001,),\n", 1823 | " \"dropout_embedding\": (0.0, 0.1, 0.2, 0.3, 0.4),\n", 1824 | " \"dropout1\": (0.0, 0.1, 0.2, 0.3, 0.4),\n", 1825 | " \"dropout2\": (0.0, 0.1, 0.2, 0.3, 0.4),\n", 1826 | " \"dropout3\": (0.0, 0.1, 0.2, 0.3, 0.4),\n", 1827 | " \"embedding_dim\": (16, 32), # default 16\n", 1828 | " \"num_filters\": (32,64,256), # default 128\n", 1829 | " \"kernel_size\": (3,5,7), # default 5\n", 1830 | "\n", 1831 | " # these aren't actually options we're messing around with parameters\n", 1832 | " \"epochs\": (20,),\n", 1833 | " \"batch_size\": (256,),\n", 1834 | " \"validation_data\": [(x_val, y_val)],\n", 1835 | " \"verbose\": (0,),\n", 1836 | " }\n", 1837 | "\n", 1838 | " \n", 1839 | " model = KerasClassifier(build_fn=cnn_model) \n", 1840 | " gridsearcher = GridSearchCV(model, parameters_cnn, scoring='average_precision', verbose=0, n_jobs=1)\n", 1841 | " grid_result = gridsearcher.fit(partial_x_train, partial_y_train)\n", 1842 | " \n", 1843 | " # summarize results\n", 1844 | " print(\"Best: %f using %s\" % (grid_result.best_score_, {k:v for k,v in grid_result.best_params_.items() if k != 'validation_data'}))\n", 1845 | " means = grid_result.cv_results_['mean_test_score']\n", 1846 | " stds = grid_result.cv_results_['std_test_score']\n", 1847 | " params = grid_result.cv_results_['params']\n", 1848 | " for mean, stdev, param in zip(means, stds, params):\n", 1849 | " print(\"%f (%f) with: %r\" % (mean, stdev, param))\n" 1850 | ] 1851 | }, 1852 | { 1853 | "cell_type": "markdown", 1854 | "metadata": {}, 1855 | "source": [ 1856 | "### some helpful methods for charting...\n", 1857 | "\n", 1858 | "That are used above, but defined here just so that they're out of the way!" 1859 | ] 1860 | }, 1861 | { 1862 | "cell_type": "code", 1863 | "execution_count": 38, 1864 | "metadata": {}, 1865 | "outputs": [], 1866 | "source": [ 1867 | "from sklearn.utils.fixes import signature\n", 1868 | "import matplotlib.pyplot as plt\n", 1869 | "\n", 1870 | "def pr_chart(labels, predicted_probas):\n", 1871 | " precision, recall, _ = precision_recall_curve(labels, predicted_probas)\n", 1872 | "\n", 1873 | " # In matplotlib < 1.5, plt.fill_between does not have a 'step' argument\n", 1874 | " step_kwargs = ({'step': 'post'}\n", 1875 | " if 'step' in signature(plt.fill_between).parameters\n", 1876 | " else {})\n", 1877 | " plt.step(recall, precision, color='b', alpha=0.2,\n", 1878 | " where='post')\n", 1879 | " plt.fill_between(recall, precision, alpha=0.2, color='b', **step_kwargs)\n", 1880 | " plt.xlabel('Recall')\n", 1881 | " plt.ylabel('Precision')\n", 1882 | " plt.ylim([0.0, 1.05])\n", 1883 | " plt.xlim([0.0, 1.0])\n", 1884 | " plt.title('2-class Precision-Recall curve')" 1885 | ] 1886 | }, 1887 | { 1888 | "cell_type": "code", 1889 | "execution_count": 74, 1890 | "metadata": {}, 1891 | "outputs": [], 1892 | "source": [ 1893 | "def training_and_validation_accuracy(history):\n", 1894 | " acc = history.history['acc']\n", 1895 | " val_acc = history.history['val_acc']\n", 1896 | " loss = history.history['loss']\n", 1897 | " val_loss = history.history['val_loss']\n", 1898 | "\n", 1899 | " epochs = range(1, len(acc) + 1)\n", 1900 | " \n", 1901 | " plt.plot(epochs, acc, 'bo', label='Training acc')\n", 1902 | " plt.plot(epochs, val_acc, 'b', label='Validation acc')\n", 1903 | " plt.title('Training and validation accuracy')\n", 1904 | " plt.xlabel('Epochs')\n", 1905 | " plt.ylabel('Accuracy')\n", 1906 | " plt.legend()\n", 1907 | "\n", 1908 | " plt.show()\n" 1909 | ] 1910 | }, 1911 | { 1912 | "cell_type": "code", 1913 | "execution_count": 75, 1914 | "metadata": {}, 1915 | "outputs": [], 1916 | "source": [ 1917 | "import matplotlib.pyplot as plt\n", 1918 | "\n", 1919 | "def training_and_validation_loss(history): \n", 1920 | " acc = history.history['acc']\n", 1921 | " val_acc = history.history['val_acc']\n", 1922 | " loss = history.history['loss']\n", 1923 | " val_loss = history.history['val_loss']\n", 1924 | "\n", 1925 | " epochs = range(1, len(acc) + 1)\n", 1926 | "\n", 1927 | " # \"bo\" is for \"blue dot\"\n", 1928 | " plt.plot(epochs, loss, 'bo', label='Training loss')\n", 1929 | " # b is for \"solid blue line\"\n", 1930 | " plt.plot(epochs, val_loss, 'b', label='Validation loss')\n", 1931 | " plt.title('Training and validation loss')\n", 1932 | " plt.xlabel('Epochs')\n", 1933 | " plt.ylabel('Loss')\n", 1934 | " plt.legend()\n", 1935 | "\n", 1936 | " plt.show()" 1937 | ] 1938 | }, 1939 | { 1940 | "cell_type": "code", 1941 | "execution_count": null, 1942 | "metadata": {}, 1943 | "outputs": [], 1944 | "source": [] 1945 | } 1946 | ], 1947 | "metadata": { 1948 | "kernelspec": { 1949 | "display_name": "Python 2", 1950 | "language": "python", 1951 | "name": "python2" 1952 | }, 1953 | "language_info": { 1954 | "codemirror_mode": { 1955 | "name": "ipython", 1956 | "version": 3 1957 | }, 1958 | "file_extension": ".py", 1959 | "mimetype": "text/x-python", 1960 | "name": "python", 1961 | "nbconvert_exporter": "python", 1962 | "pygments_lexer": "ipython3", 1963 | "version": "3.7.2" 1964 | } 1965 | }, 1966 | "nbformat": 4, 1967 | "nbformat_minor": 2 1968 | } 1969 | --------------------------------------------------------------------------------