├── README.md └── data_cleaning ├── CBOW.ipynb ├── LDA.ipynb ├── assign_SOC.ipynb ├── auxiliary files ├── OCRcorrect_enchant.py ├── OCRcorrect_hyphen.py ├── PWL.txt ├── TitleBase.txt ├── __pycache__ │ ├── ExtractLDAresult.cpython-36.pyc │ ├── OCRcorrect_enchant.cpython-36.pyc │ ├── OCRcorrect_hyphen.cpython-36.pyc │ ├── compute_spelling.cpython-36.pyc │ ├── detect_ending.cpython-36.pyc │ ├── edit_distance.cpython-36.pyc │ ├── extract_LDA_result.cpython-36.pyc │ ├── extract_information.cpython-36.pyc │ ├── title_detection.cpython-36.pyc │ └── title_substitute.cpython-36.pyc ├── apst_mapping.xlsx ├── compute_spelling.py ├── detect_ending.py ├── edit_distance.py ├── example_ONET_api.png ├── extract_LDA_result.py ├── extract_information.py ├── phrase_substitutes.csv ├── state_name.txt ├── title2soc.txt ├── title_detection.py ├── title_substitute.py └── word_substitutes.csv ├── initial_cleaning.ipynb └── structured_data.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # newspaper_project 2 | This repository contains supplementary materials to "The Evolution of Work in the United States" by Enghin Atalay, Phai Phongthiengtham, Sebastian Sotelo and Daniel Tannenbaum - American Economic Journal: Applied Economics (2020) https://www.aeaweb.org/articles?id=10.1257/app.20190070 3 | 4 | - Project Data Page: https://occupationdata.github.io 5 | -------------------------------------------------------------------------------- /data_cleaning/CBOW.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# The Continuous Bag of Words Model\n", 8 | "\n", 9 | "Online supplementary material to \"The Evolution of Work in the United States\" by Enghin Atalay, Phai Phongthiengtham, Sebastian Sotelo and Daniel Tannenbaum.\n", 10 | "\n", 11 | "* [Project data library](https://occupationdata.github.io) \n", 12 | "\n", 13 | "* [GitHub repository](https://github.com/phaiptt125/newspaper_project)\n", 14 | "\n", 15 | "***" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "This IPython notebook demonstrates how we map between occupational characteristics to words or phrases from newspaper text using the Continuous Bag of Words Model (CBOW). \n", 23 | "\n", 24 | "* See [here](http://ssc.wisc.edu/~eatalay/apst/apst_mapping.pdf) for more examples.\n", 25 | "* See project data library for full results." 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | " Due to copyright restrictions, we are not authorized to publish a large body of newspaper text. \n", 33 | "***" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "## Import necessary modules" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": 1, 46 | "metadata": { 47 | "collapsed": true 48 | }, 49 | "outputs": [], 50 | "source": [ 51 | "import os\n", 52 | "import re\n", 53 | "import sys\n", 54 | "import platform\n", 55 | "import collections\n", 56 | "import shutil\n", 57 | "\n", 58 | "import pandas\n", 59 | "import math\n", 60 | "import multiprocessing\n", 61 | "import os.path\n", 62 | "import numpy as np\n", 63 | "from gensim import corpora, models\n", 64 | "from gensim.models import Word2Vec, keyedvectors \n", 65 | "from gensim.models.word2vec import LineSentence\n", 66 | "from sklearn.metrics.pairwise import cosine_similarity" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "In our implementation, we construct our model by taking as our text corpora all of the text from job ads which appeared in our cleaned newspaper data, plus the raw text from job ads which were posted on-line in two months: January 2012 and January 2016." 74 | ] 75 | }, 76 | { 77 | "cell_type": "markdown", 78 | "metadata": {}, 79 | "source": [ 80 | "## Prepare newspaper text data" 81 | ] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "metadata": {}, 86 | "source": [ 87 | "For newspaper text data, we:\n", 88 | "\n", 89 | "1. Retrieve document metadata, remove markup from the newspaper text, and to perform an initial spell-check of the text (see [here](https://github.com/phaiptt125/newspaper_project/blob/master/data_cleaning/initial_cleaning.ipynb)). \n", 90 | "2. Exclude non-job ad pages (see [here](https://github.com/phaiptt125/newspaper_project/blob/master/data_cleaning/LDA.ipynb)).\n", 91 | "3. Transform unstructured newspaper text into spreadsheet data (see [here](https://github.com/phaiptt125/newspaper_project/blob/master/data_cleaning/structured_data.ipynb)).\n", 92 | "4. Delete all non alphabetic characters, e.g., numbers and punctuations.\n", 93 | "5. Convert all characters to lowercase. \n", 94 | "\n", 95 | "The example below demonstrates how to perform step 4 and 5 in a very short snippet of Display Ad page 226, from the January 14, 1979 Boston Globe. " 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": 2, 101 | "metadata": {}, 102 | "outputs": [ 103 | { 104 | "name": "stdout", 105 | "output_type": "stream", 106 | "text": [ 107 | "--- newspaper text ---\n", 108 | "manage its Primary Care Programs including 24-hour Emergency Room Primary Care program\n", 109 | "\n", 110 | "--- transformed text ---\n", 111 | "manage its primary care programs including hour emergency room primary care program\n" 112 | ] 113 | } 114 | ], 115 | "source": [ 116 | "text = \"manage its Primary Care Programs including 24-hour Emergency Room Primary Care program\"\n", 117 | "\n", 118 | "print('--- newspaper text ---')\n", 119 | "print(text)\n", 120 | "print('')\n", 121 | "print('--- transformed text ---')\n", 122 | "print(re.sub('[^a-z ]','',text.lower()))" 123 | ] 124 | }, 125 | { 126 | "cell_type": "markdown", 127 | "metadata": {}, 128 | "source": [ 129 | "## Prepare online job posting text data" 130 | ] 131 | }, 132 | { 133 | "cell_type": "markdown", 134 | "metadata": {}, 135 | "source": [ 136 | "Economic Modeling Specialists International (EMSI) provided us with online postings data in a processed format and relatively clean form: see [here](https://github.com/phaiptt125/online_job_posting/blob/master/data_cleaning/initial_cleaning.ipynb).\n", 137 | "\n", 138 | "For the purpose of this project, we use online postings data to:\n", 139 | "1. Enrich the sample of text usuage when constructing the Continuous Bag of Words model\n", 140 | "2. Retrieve a mapping between job titles and ONET-SOC codes. " 141 | ] 142 | }, 143 | { 144 | "cell_type": "markdown", 145 | "metadata": {}, 146 | "source": [ 147 | "## Construct CBOW model" 148 | ] 149 | }, 150 | { 151 | "cell_type": "code", 152 | "execution_count": 3, 153 | "metadata": { 154 | "collapsed": true 155 | }, 156 | "outputs": [], 157 | "source": [ 158 | "# filename of the combined ads ~ 15 GB \n", 159 | "text_data_filename = 'ad_combined.txt'\n", 160 | "\n", 161 | "# construct CBOW model\n", 162 | "dim_model = 300\n", 163 | "model = Word2Vec(LineSentence(open(text_data_filename)), \n", 164 | " size=dim_model, \n", 165 | " window=5, \n", 166 | " min_count=5, \n", 167 | " workers=multiprocessing.cpu_count())\n", 168 | "\n", 169 | "model.init_sims(replace=True)\n", 170 | "\n", 171 | "# define output filename for CBOW model\n", 172 | "cbow_filename = 'cbow.model'\n", 173 | "\n", 174 | "# save model into file\n", 175 | "model.save(cbow_filename)" 176 | ] 177 | }, 178 | { 179 | "cell_type": "markdown", 180 | "metadata": {}, 181 | "source": [ 182 | "## Compute similar words" 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": 4, 188 | "metadata": { 189 | "collapsed": true 190 | }, 191 | "outputs": [], 192 | "source": [ 193 | "# load model\n", 194 | "model = Word2Vec.load(cbow_filename)\n", 195 | "word_all = model.wv # set of all words in the model" 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": 5, 201 | "metadata": { 202 | "collapsed": true 203 | }, 204 | "outputs": [], 205 | "source": [ 206 | "def find_similar_words(phrase,model,dim_model):\n", 207 | " # This function compute similar words given a word or phrase.\n", 208 | " # If the input is just one word, this function is the same as gensim built-in function: model.most_similar\n", 209 | " \n", 210 | " # phrase : input for word or phrases to look for. For a phrase with multiple words, add \"_\" in between.\n", 211 | " # model : constructed CBOW model\n", 212 | " # dim_model : dimension of the model, i.e., length of a vector of each word \n", 213 | " \n", 214 | " tokens = [w for w in re.split('_',phrase) if w in word_all] \n", 215 | " # split input to tokens, ignoring words that are not in the model \n", 216 | " \n", 217 | " vector_by_word = np.zeros((len(tokens),dim_model)) # initialize a matrix \n", 218 | " \n", 219 | " for i in range(0,len(tokens)):\n", 220 | " word = tokens[i] # loop for each word\n", 221 | " vector_this_word = model[word] # get a vector representation\n", 222 | " vector_by_word[i,:] = vector_this_word # record the vector\n", 223 | " \n", 224 | " vector_this_phrase = sum(vector_by_word) \n", 225 | " # sum over words to get a vector representation of the whole phrase\n", 226 | " \n", 227 | " most_similar_words = model.similar_by_vector(vector_this_phrase, topn=100, restrict_vocab=None)\n", 228 | " # find 100 most similar words\n", 229 | " \n", 230 | " most_similar_words = [w for w in most_similar_words if not w[0] == phrase]\n", 231 | " # take out the output word that is identical to the input word\n", 232 | " \n", 233 | " return most_similar_words" 234 | ] 235 | }, 236 | { 237 | "cell_type": "markdown", 238 | "metadata": {}, 239 | "source": [ 240 | "Cosine similarity score of any pair of words/phrases is defined to be a cosine of the two vectors representing those pair of words/phrases. Higher cosine similarity score means the two words/phrases tend to appear in similar contexts.\n", 241 | "\n", 242 | "The function *find_similar_words* above returns a set of similar words, ordered by cosine similarity score, and their corresponding cosine similarity score. For example, the ten most similar words to \"creative\" are: " 243 | ] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "execution_count": 6, 248 | "metadata": {}, 249 | "outputs": [ 250 | { 251 | "data": { 252 | "text/plain": [ 253 | "[('imaginative', 0.6997416615486145),\n", 254 | " ('versatile', 0.6824457049369812),\n", 255 | " ('creature', 0.591433584690094),\n", 256 | " ('innovative', 0.5758161544799805),\n", 257 | " ('resourceful', 0.5575118660926819),\n", 258 | " ('creallve', 0.5550633668899536),\n", 259 | " ('restive', 0.5526227951049805),\n", 260 | " ('dynamic', 0.5416233539581299),\n", 261 | " ('clever', 0.5349052548408508),\n", 262 | " ('pragmatic', 0.5299020409584045)]" 263 | ] 264 | }, 265 | "execution_count": 6, 266 | "metadata": {}, 267 | "output_type": "execute_result" 268 | } 269 | ], 270 | "source": [ 271 | "most_similar_words = find_similar_words('creative',model,dim_model)\n", 272 | "most_similar_words[:10]" 273 | ] 274 | }, 275 | { 276 | "cell_type": "markdown", 277 | "metadata": {}, 278 | "source": [ 279 | "Likewise, the ten most similar words to \"bookkeeping\" are:" 280 | ] 281 | }, 282 | { 283 | "cell_type": "code", 284 | "execution_count": 7, 285 | "metadata": {}, 286 | "outputs": [ 287 | { 288 | "data": { 289 | "text/plain": [ 290 | "[('bkkp', 0.6903467178344727),\n", 291 | " ('beekeeping', 0.6871334314346313),\n", 292 | " ('stenography', 0.672173023223877),\n", 293 | " ('bkkpng', 0.6181079745292664),\n", 294 | " ('bkkpg', 0.6175851821899414),\n", 295 | " ('bookkpg', 0.5925684571266174),\n", 296 | " ('dkkpg', 0.5809350609779358),\n", 297 | " ('bkkping', 0.5768048167228699),\n", 298 | " ('clerical', 0.5741672515869141),\n", 299 | " ('payroll', 0.5619226098060608)]" 300 | ] 301 | }, 302 | "execution_count": 7, 303 | "metadata": {}, 304 | "output_type": "execute_result" 305 | } 306 | ], 307 | "source": [ 308 | "most_similar_words = find_similar_words('bookkeeping',model,dim_model)\n", 309 | "most_similar_words[:10]" 310 | ] 311 | }, 312 | { 313 | "cell_type": "markdown", 314 | "metadata": {}, 315 | "source": [ 316 | "The strength of the Continuous Bag of Words (CBOW) model is twofold. First, the model provides context-based synonyms which allows us to keep track of relevant words even if their usage may differ over time. We provide one example in the main paper: " 317 | ] 318 | }, 319 | { 320 | "cell_type": "markdown", 321 | "metadata": {}, 322 | "source": [ 323 | "*For instance, even though “creative” and “innovative” largely refer to the same occupational skill, it is possible that their relative usage among potential employers may differ within the sample period. This is indeed the case: Use of the word “innovative” has increased more quickly than “creative” over the sample period. To the extent that our ad hoc classification included only one of these two words, we would be mis-characterizing trends in the ONET skill of “Thinking Creatively.” The advantage of the continuous bag of words model is that it will identify that “creative” and “innovative” mean the same thing because they appear in similar contexts within job ads. Hence, even if employers start using “innovative” as opposed to “creative” part way through our sample, we will be able to consistently measure trends in “Thinking Creatively” throughout the entire period.*" 324 | ] 325 | }, 326 | { 327 | "cell_type": "markdown", 328 | "metadata": {}, 329 | "source": [ 330 | "The second advantage of the CBOW model is to identify common abbrevations and transcription errors. The word \"bookkeeping\", for instance, was offen mistranscribed into \"beekeeping\" due to the imperfection of the Optical Character Recognition (OCR) algorithm. Moreover, our CBOW model also reveals common abbrevations that employers offen used such as \"bkkp\" and \"bkkpng\"." 331 | ] 332 | } 333 | ], 334 | "metadata": { 335 | "kernelspec": { 336 | "display_name": "Python 3", 337 | "language": "python", 338 | "name": "python3" 339 | }, 340 | "language_info": { 341 | "codemirror_mode": { 342 | "name": "ipython", 343 | "version": 3 344 | }, 345 | "file_extension": ".py", 346 | "mimetype": "text/x-python", 347 | "name": "python", 348 | "nbconvert_exporter": "python", 349 | "pygments_lexer": "ipython3", 350 | "version": "3.6.1" 351 | } 352 | }, 353 | "nbformat": 4, 354 | "nbformat_minor": 2 355 | } 356 | -------------------------------------------------------------------------------- /data_cleaning/assign_SOC.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Mappings between Job Titles and SOC Codes\n", 8 | "\n", 9 | "Online supplementary material to \"The Evolution of Work in the United States\" by Enghin Atalay, Phai Phongthiengtham, Sebastian Sotelo and Daniel Tannenbaum.\n", 10 | "\n", 11 | "* [Project data library](https://occupationdata.github.io) \n", 12 | "\n", 13 | "* [GitHub repository](https://github.com/phaiptt125/newspaper_project)\n", 14 | "\n", 15 | "***" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "This IPython notebook demonstrates how we map between job titles and SOC from newspaper text. \n", 23 | "\n", 24 | "* We use the continuous bag of words (CBOW) model previously constructed. See [here](https://github.com/phaiptt125/newspaper_project/blob/master/data_cleaning/CBOW.ipynb) for more detail. \n", 25 | "* See [here](http://ssc.wisc.edu/~eatalay/apst/apst_mapping.pdf) for more explanations.\n", 26 | "* See project data library for full results." 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | " Due to copyright restrictions, we are not authorized to publish a large body of newspaper text. \n", 34 | "***" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [ 41 | "## List of auxiliary files (see project data library or GitHub repository)\n", 42 | "\n", 43 | "* *\"title_substitute.py\"* : This python code edits job titles.\n", 44 | "* *\"word_substitutes.csv\"* : List of job title words substitution.\n", 45 | "* *\"phrase_substitutes.csv\"* : List of job title phrases substitution.\n", 46 | "\n", 47 | "Note: We look for the most common job titles and list manually-coded substitutions in *\"word_substitutes.csv\"* and *\"phrase_substitutes.csv\"*. " 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": 1, 53 | "metadata": { 54 | "collapsed": true 55 | }, 56 | "outputs": [], 57 | "source": [ 58 | "import os\n", 59 | "import re\n", 60 | "import sys\n", 61 | "import platform\n", 62 | "import collections\n", 63 | "import shutil\n", 64 | "\n", 65 | "import pandas as pd\n", 66 | "import math\n", 67 | "import multiprocessing\n", 68 | "import os.path\n", 69 | "import numpy as np\n", 70 | "from gensim import corpora, models\n", 71 | "from gensim.models import Word2Vec, keyedvectors \n", 72 | "from gensim.models.word2vec import LineSentence\n", 73 | "from sklearn.metrics.pairwise import cosine_similarity\n", 74 | "\n", 75 | "sys.path.append('./auxiliary files')\n", 76 | "\n", 77 | "from title_substitute import *" 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": {}, 83 | "source": [ 84 | "## Edit job titles" 85 | ] 86 | }, 87 | { 88 | "cell_type": "markdown", 89 | "metadata": {}, 90 | "source": [ 91 | "We first lightly edit job titles to reduce the number of unique titles: We convert all titles to lowercase and remove all non-alphanumeric characters; combine titles which are very similar to one another (e.g., replacing \"hostesses\" with \"host\"); replace plural nouns with their singular form (e.g., replacing \"nurses\" with \"nurse\", \"foremen\" with \"foreman\"); and remove abbreviations (e.g., replacing \"asst\" with \"assistant\", and \"customer service rep\" with \"customer service representative\"). " 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": 2, 97 | "metadata": { 98 | "collapsed": true 99 | }, 100 | "outputs": [], 101 | "source": [ 102 | "# import files for editing titles\n", 103 | "word_substitutes = io.open('word_substitutes.csv','r',encoding='utf-8',errors='ignore').read()\n", 104 | "word_substitutes = ''.join([w for w in word_substitutes if ord(w) < 127])\n", 105 | "word_substitutes = [w for w in re.split('\\n',word_substitutes) if not w=='']\n", 106 | " \n", 107 | "phrase_substitutes = io.open('phrase_substitutes.csv','r',encoding='utf-8',errors='ignore').read()\n", 108 | "phrase_substitutes = ''.join([w for w in phrase_substitutes if ord(w) < 127])\n", 109 | "phrase_substitutes = [w for w in re.split('\\n',phrase_substitutes) if not w=='']" 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": 3, 115 | "metadata": {}, 116 | "outputs": [ 117 | { 118 | "name": "stdout", 119 | "output_type": "stream", 120 | "text": [ 121 | "original title = registered nurses\n", 122 | "edited title = registered nurse\n", 123 | "---\n", 124 | "original title = rn\n", 125 | "edited title = registered nurse\n", 126 | "---\n", 127 | "original title = hostesses\n", 128 | "edited title = host\n", 129 | "---\n", 130 | "original title = foremen\n", 131 | "edited title = foreman\n", 132 | "---\n", 133 | "original title = customer service rep\n", 134 | "edited title = customer service representative\n", 135 | "---\n" 136 | ] 137 | } 138 | ], 139 | "source": [ 140 | "# some illustrations (see \"title_substitute.py\")\n", 141 | "\n", 142 | "list_job_titles = ['registered nurses',\n", 143 | " 'rn', \n", 144 | " 'hostesses',\n", 145 | " 'foremen', \n", 146 | " 'customer service rep']\n", 147 | "\n", 148 | "for title in list_job_titles: \n", 149 | " title_clean = substitute_titles(title,word_substitutes,phrase_substitutes)\n", 150 | " print('original title = ' + title)\n", 151 | " print('edited title = ' + title_clean)\n", 152 | " print('---')" 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "metadata": {}, 158 | "source": [ 159 | "## Some technical issues\n", 160 | "\n", 161 | "* The procedure of replacing plural nouns with their singular form works in general:" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": 4, 167 | "metadata": {}, 168 | "outputs": [ 169 | { 170 | "data": { 171 | "text/plain": [ 172 | "'galaxy'" 173 | ] 174 | }, 175 | "execution_count": 4, 176 | "metadata": {}, 177 | "output_type": "execute_result" 178 | } 179 | ], 180 | "source": [ 181 | "substitute_titles('galaxies',word_substitutes,phrase_substitutes)\n", 182 | "# Note: We do not supply the mapping from 'galaxies' to 'galaxy'." 183 | ] 184 | }, 185 | { 186 | "cell_type": "markdown", 187 | "metadata": {}, 188 | "source": [ 189 | "* The procedure of replacing abbreviations, on the other hand, requires user-provided information, i.e., we list down the most common substitutions. While we cannot possibly identify all abbreviations, we will use the continuous bag of word (CBOW) model later. Common abbreviations would have similar meanings as their original words. " 190 | ] 191 | }, 192 | { 193 | "cell_type": "markdown", 194 | "metadata": {}, 195 | "source": [ 196 | "## ONET reported job titles " 197 | ] 198 | }, 199 | { 200 | "cell_type": "markdown", 201 | "metadata": {}, 202 | "source": [ 203 | "The ONET publishes, for each SOC code, a list of reported job titles in \"Sample of Reported Titles\" and \"Alternate Titles\" sections. The ONET data dictionary, see [here](https://www.onetcenter.org/dl_files/database/db_22_1_dictionary.pdf), explains these files as the following:\n", 204 | "\n", 205 | "*\"This file [Sample of Reported Titles] contains job titles frequently reported by incumbents and occupational experts on data collection surveys.\"* (page 52)\n", 206 | "\n", 207 | "*\"This file [Alternate Titles] contains alternate, or 'lay', occupational titles for the ONET-SOC classification system. The file was developed to improve keyword searches in several Department of Labor internet applications (i.e., Career InfoNet, ONET OnLine, and ONET Code Connector). The file contains\n", 208 | "occupational titles from existing occupational classification systems, as well as from other diverse sources.\"* (page 50)" 209 | ] 210 | }, 211 | { 212 | "cell_type": "markdown", 213 | "metadata": {}, 214 | "source": [ 215 | "## A mapping between ONET reported job titles and SOC codes" 216 | ] 217 | }, 218 | { 219 | "cell_type": "markdown", 220 | "metadata": {}, 221 | "source": [ 222 | "The ONET provides, for each job title in \"Sample of Reported Titles\" and \"Alternate Titles\", a corresponding SOC code. We then record these mappings directly. \n", 223 | "\n", 224 | "Some job titles, unfortunately, do not have a unique mapping to an SOC code. For example, \"Office Administrator\" is reported to be \"43-9061.00\", \"43-6011.00\" and \"43-6014.00\". For these titles, we rely on the ONET website search algorithm. First, we enter \"Office Administrator\" into the search query box, \"Occupation Quick Search.\" See [here](https://github.com/phaiptt125/newspaper_project/blob/master/data_cleaning/auxiliary%20files/example_ONET_api.png) for a screenshot of this procedure. \n", 225 | "\n", 226 | "Then, we map \"Office Administrator\" to \"43-9061.00\", which is the cloest match that the ONET website provides. Next, we apply the same title editing procedure as in newspaper job titles described above. We record these mappings to \"title2SOC.txt\" as shown below. " 227 | ] 228 | }, 229 | { 230 | "cell_type": "code", 231 | "execution_count": 21, 232 | "metadata": {}, 233 | "outputs": [ 234 | { 235 | "name": "stdout", 236 | "output_type": "stream", 237 | "text": [ 238 | "Total mappings = 45207\n" 239 | ] 240 | }, 241 | { 242 | "data": { 243 | "text/html": [ 244 | "
\n", 245 | "\n", 258 | "\n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | "
titleoriginal_titlesoc
0operation directorOperations Director11102100
1us commissionerU.S. Commissioner11101100
2sale and marketing directorSales and Marketing Director11202200
3market analysis directorMarket Analysis Director11202100
4director of sale and marketingDirector of Sales and Marketing41101200
\n", 300 | "
" 301 | ], 302 | "text/plain": [ 303 | " title original_title soc\n", 304 | "0 operation director Operations Director 11102100\n", 305 | "1 us commissioner U.S. Commissioner 11101100\n", 306 | "2 sale and marketing director Sales and Marketing Director 11202200\n", 307 | "3 market analysis director Market Analysis Director 11202100\n", 308 | "4 director of sale and marketing Director of Sales and Marketing 41101200" 309 | ] 310 | }, 311 | "execution_count": 21, 312 | "metadata": {}, 313 | "output_type": "execute_result" 314 | } 315 | ], 316 | "source": [ 317 | "title2SOC_filename = 'title2SOC.txt'\n", 318 | "names = ['title','original_title','soc']\n", 319 | "\n", 320 | "# title: The edited title, to be matched with newspaper titles.\n", 321 | "# original_title: The original titles from ONET website. \n", 322 | "# soc: Occupation code.\n", 323 | " \n", 324 | "# import into pandas dataframe\n", 325 | "title2SOC = pd.read_csv(title2SOC_filename, sep = '\\t', names = names)\n", 326 | "\n", 327 | "# print number of total mappings\n", 328 | "print('Total mappings = ' + str(len(title2SOC)))\n", 329 | " \n", 330 | "# print some examples\n", 331 | "title2SOC.head()" 332 | ] 333 | }, 334 | { 335 | "cell_type": "markdown", 336 | "metadata": {}, 337 | "source": [ 338 | "The subsequent sections of this IPython notebook explain how we use these mappings from ONET, in combination with the previously constructed continuous bag of words (CBOW) model, to assign an SOC code to each of the newspaper job title." 339 | ] 340 | }, 341 | { 342 | "cell_type": "markdown", 343 | "metadata": {}, 344 | "source": [ 345 | "## Map ONET job titles to newspaper job titles (direct match)" 346 | ] 347 | }, 348 | { 349 | "cell_type": "markdown", 350 | "metadata": {}, 351 | "source": [ 352 | "We assign the ONET job title, where a corresponding SOC code is available, to each of the newspaper job title. First, for each newspaper job title, we check if there is any direct string match. Suppose we have \"sale and marketing director\" in the newspaper:" 353 | ] 354 | }, 355 | { 356 | "cell_type": "code", 357 | "execution_count": 6, 358 | "metadata": {}, 359 | "outputs": [ 360 | { 361 | "data": { 362 | "text/plain": [ 363 | "True" 364 | ] 365 | }, 366 | "execution_count": 6, 367 | "metadata": {}, 368 | "output_type": "execute_result" 369 | } 370 | ], 371 | "source": [ 372 | "\"sale and marketing director\" in title2SOC['title'].values" 373 | ] 374 | }, 375 | { 376 | "cell_type": "code", 377 | "execution_count": 7, 378 | "metadata": {}, 379 | "outputs": [ 380 | { 381 | "data": { 382 | "text/html": [ 383 | "
\n", 384 | "\n", 397 | "\n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | "
titleoriginal_titlesoc
2sale and marketing directorSales and Marketing Director11202200
\n", 415 | "
" 416 | ], 417 | "text/plain": [ 418 | " title original_title soc\n", 419 | "2 sale and marketing director Sales and Marketing Director 11202200" 420 | ] 421 | }, 422 | "execution_count": 7, 423 | "metadata": {}, 424 | "output_type": "execute_result" 425 | } 426 | ], 427 | "source": [ 428 | "title2SOC[title2SOC['title'] == \"sale and marketing director\"]" 429 | ] 430 | }, 431 | { 432 | "cell_type": "markdown", 433 | "metadata": {}, 434 | "source": [ 435 | "* Since, we have \"sale and marketing director\" in our list of ONET titles, we can proceed and assign the SOC of \"11-2022.00\". " 436 | ] 437 | }, 438 | { 439 | "cell_type": "markdown", 440 | "metadata": {}, 441 | "source": [ 442 | "## Map ONET job titles to newspaper job titles (CBOW-based)\n", 443 | "\n", 444 | "For those newspaper job titles where there is no exact match to our list of ONET job titles, we reply on our previously constructed CBOW model to assign the \"closet\" ONET job title to each of the newspaper job title. \n", 445 | "\n", 446 | "In the actual implementation, we set our dimension of the CBOW model to be 300, as explained [here](https://github.com/phaiptt125/newspaper_project/blob/master/data_cleaning/CBOW.ipynb). For illustrative purposes, however, this IPython notebook provides examples using the CBOW model with the dimension of 5. The embedded code below illustrates how we construct this CBOW model:" 447 | ] 448 | }, 449 | { 450 | "cell_type": "markdown", 451 | "metadata": {}, 452 | "source": [ 453 | "***\n", 454 | " model = Word2Vec(LineSentence(open('ad_combined.txt')), \n", 455 | " size = 5, \n", 456 | " window = 5, \n", 457 | " min_count = 5, \n", 458 | " workers = multiprocessing.cpu_count())\n", 459 | "\n", 460 | " model.save('cbow_small.model')\n", 461 | "***" 462 | ] 463 | }, 464 | { 465 | "cell_type": "code", 466 | "execution_count": 8, 467 | "metadata": { 468 | "collapsed": true 469 | }, 470 | "outputs": [], 471 | "source": [ 472 | "model = Word2Vec.load('cbow_small.model')\n", 473 | "# 'cbow_small.model' has dimension of 5.\n", 474 | "# In the actual implementation, we use our previously constructed 'cbow.model', which has dimension of 300. " 475 | ] 476 | }, 477 | { 478 | "cell_type": "markdown", 479 | "metadata": {}, 480 | "source": [ 481 | "Our CBOW model provides a vector representation of each word in the corpus. For example:" 482 | ] 483 | }, 484 | { 485 | "cell_type": "code", 486 | "execution_count": 9, 487 | "metadata": {}, 488 | "outputs": [ 489 | { 490 | "data": { 491 | "text/plain": [ 492 | "array([-0.23945422, -0.33969662, -0.25194243, 0.86623007, 0.11592443], dtype=float32)" 493 | ] 494 | }, 495 | "execution_count": 9, 496 | "metadata": {}, 497 | "output_type": "execute_result" 498 | } 499 | ], 500 | "source": [ 501 | "model['customer']" 502 | ] 503 | }, 504 | { 505 | "cell_type": "code", 506 | "execution_count": 10, 507 | "metadata": {}, 508 | "outputs": [ 509 | { 510 | "data": { 511 | "text/plain": [ 512 | "array([ 0.03195868, -0.56184751, 0.24374393, 0.58998656, 0.52517688], dtype=float32)" 513 | ] 514 | }, 515 | "execution_count": 10, 516 | "metadata": {}, 517 | "output_type": "execute_result" 518 | } 519 | ], 520 | "source": [ 521 | "model['relation']" 522 | ] 523 | }, 524 | { 525 | "cell_type": "code", 526 | "execution_count": 11, 527 | "metadata": {}, 528 | "outputs": [ 529 | { 530 | "data": { 531 | "text/plain": [ 532 | "array([-0.52168244, -0.50416076, 0.10234968, 0.33064061, 0.59487033], dtype=float32)" 533 | ] 534 | }, 535 | "execution_count": 11, 536 | "metadata": {}, 537 | "output_type": "execute_result" 538 | } 539 | ], 540 | "source": [ 541 | "model['specialist']" 542 | ] 543 | }, 544 | { 545 | "cell_type": "markdown", 546 | "metadata": {}, 547 | "source": [ 548 | "We compute a vector represenation of \"customer relation specialist\" to be the sum of a vector representation of \"customer\", \"relation\" and \"specialist\"." 549 | ] 550 | }, 551 | { 552 | "cell_type": "code", 553 | "execution_count": 12, 554 | "metadata": {}, 555 | "outputs": [ 556 | { 557 | "data": { 558 | "text/plain": [ 559 | "array([-0.72917795, -1.40570486, 0.09415118, 1.78685713, 1.23597169], dtype=float32)" 560 | ] 561 | }, 562 | "execution_count": 12, 563 | "metadata": {}, 564 | "output_type": "execute_result" 565 | } 566 | ], 567 | "source": [ 568 | "vector_title = model['customer'] + model['relation'] + model['specialist']\n", 569 | "vector_title" 570 | ] 571 | }, 572 | { 573 | "cell_type": "markdown", 574 | "metadata": {}, 575 | "source": [ 576 | "As such, we can compute a vector represenation of:\n", 577 | "\n", 578 | "1. All job titles from our newspaper data.\n", 579 | "2. All job titles from our list of ONET titles." 580 | ] 581 | }, 582 | { 583 | "cell_type": "markdown", 584 | "metadata": {}, 585 | "source": [ 586 | "Suppose we have \"customer relation specialist\" as a newspaper job title, we first check if there is a direct match to our list of ONET titles: " 587 | ] 588 | }, 589 | { 590 | "cell_type": "code", 591 | "execution_count": 13, 592 | "metadata": {}, 593 | "outputs": [ 594 | { 595 | "data": { 596 | "text/plain": [ 597 | "False" 598 | ] 599 | }, 600 | "execution_count": 13, 601 | "metadata": {}, 602 | "output_type": "execute_result" 603 | } 604 | ], 605 | "source": [ 606 | "\"customer relation specialist\" in title2SOC['title'].values" 607 | ] 608 | }, 609 | { 610 | "cell_type": "markdown", 611 | "metadata": {}, 612 | "source": [ 613 | "Since there is no direct match, we then assign a vector representation of this title and compute how similar this title to each of the ONET job titles. We use cosine similarity as a measure of how two vectors are similar to each other. As the cosine function gives the value between 0 and 1, the closer value to 1 means the two vectors are closer to each other. The results below demonstrate cosine similarity scores to some ONET job titles:" 614 | ] 615 | }, 616 | { 617 | "cell_type": "code", 618 | "execution_count": 14, 619 | "metadata": {}, 620 | "outputs": [ 621 | { 622 | "name": "stdout", 623 | "output_type": "stream", 624 | "text": [ 625 | "Computing cosine similarity of \"customer relation specialist\" to: \n", 626 | "----------------\n", 627 | "\"executive secretary\" = [[ 0.6176427]]\n", 628 | "\"mechanical engineer\" = [[ 0.80217057]]\n", 629 | "\"customer service assistant\" = [[ 0.96143997]]\n", 630 | "\"client relation specialist\" = [[ 0.99550998]]\n" 631 | ] 632 | } 633 | ], 634 | "source": [ 635 | "vector_newspaper = model['customer'] + model['relation'] + model['specialist']\n", 636 | "\n", 637 | "print('Computing cosine similarity of \"customer relation specialist\" to: ')\n", 638 | "print('----------------')\n", 639 | "\n", 640 | "# compute similarity to \"executive secretary\" \n", 641 | "vector_to_match = model['executive'] + model['secretary']\n", 642 | "cosine = cosine_similarity(vector_to_match.reshape(1,-1), vector_newspaper.reshape(1,-1))\n", 643 | "print( '\"executive secretary\" = ' + str(cosine))\n", 644 | "\n", 645 | "# compute similarity to \"mechanical engineer\" \n", 646 | "vector_to_match = model['mechanical'] + model['engineer']\n", 647 | "cosine = cosine_similarity(vector_to_match.reshape(1,-1), vector_newspaper.reshape(1,-1))\n", 648 | "print( '\"mechanical engineer\" = ' + str(cosine))\n", 649 | "\n", 650 | "# compute similarity to \"customer service assistant\" \n", 651 | "vector_to_match = model['customer'] + model['service'] + model['assistant']\n", 652 | "cosine = cosine_similarity(vector_to_match.reshape(1,-1), vector_newspaper.reshape(1,-1))\n", 653 | "print( '\"customer service assistant\" = ' + str(cosine))\n", 654 | "\n", 655 | "# compute similarity to \"client relation specialist\" \n", 656 | "vector_to_match = model['client'] + model['relation'] + model['specialist']\n", 657 | "cosine = cosine_similarity(vector_to_match.reshape(1,-1), vector_newspaper.reshape(1,-1))\n", 658 | "print( '\"client relation specialist\" = ' + str(cosine))" 659 | ] 660 | }, 661 | { 662 | "cell_type": "markdown", 663 | "metadata": {}, 664 | "source": [ 665 | "***\n", 666 | "Therefore, using the CBOW model, we conclude that \"customer relation specialist\" has a closer meaning to \"client relation specialist\" than than \"executive secretary\", \"mechanical engineer\" and \"customer service assistant.\" \n", 667 | "\n", 668 | "Even though the we do not have \"customer relation specialist\" in our list of ONET job titles, our CBOW model suggests that this job title is extremely similar to \"client relation specialist\". There are two reasons why this should be the case. First, there are two identical words \"relation\" and \"specialist\" in both job titles. Second, our CBOW model suggests that \"client\" and \"customer\" are similar to each other:" 669 | ] 670 | }, 671 | { 672 | "cell_type": "code", 673 | "execution_count": 15, 674 | "metadata": {}, 675 | "outputs": [ 676 | { 677 | "data": { 678 | "text/plain": [ 679 | "array([[ 0.96610314]], dtype=float32)" 680 | ] 681 | }, 682 | "execution_count": 15, 683 | "metadata": {}, 684 | "output_type": "execute_result" 685 | } 686 | ], 687 | "source": [ 688 | "cosine_similarity(model['client'].reshape(1,-1), model['customer'].reshape(1,-1))" 689 | ] 690 | }, 691 | { 692 | "cell_type": "markdown", 693 | "metadata": {}, 694 | "source": [ 695 | "In the actual implementation, we compute cosine similarity score to all 45207 ONET job titles, which cannot be performed in this IPython notebook. \n", 696 | "\n", 697 | "Nevertheless, it turns out that \"client relation specialist\" is indeed the cloest ONET job title to \"customer relation specialist.\" We, then, assign the SOC code of \"customer relation specialist\" to be the same as \"client relation specialist.\" " 698 | ] 699 | }, 700 | { 701 | "cell_type": "code", 702 | "execution_count": 16, 703 | "metadata": {}, 704 | "outputs": [ 705 | { 706 | "data": { 707 | "text/html": [ 708 | "
\n", 709 | "\n", 722 | "\n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | "
titleoriginal_titlesoc
14392client relation specialistClient Relations Specialist43405100
\n", 740 | "
" 741 | ], 742 | "text/plain": [ 743 | " title original_title soc\n", 744 | "14392 client relation specialist Client Relations Specialist 43405100" 745 | ] 746 | }, 747 | "execution_count": 16, 748 | "metadata": {}, 749 | "output_type": "execute_result" 750 | } 751 | ], 752 | "source": [ 753 | "title2SOC[title2SOC['title'] == \"client relation specialist\"]" 754 | ] 755 | }, 756 | { 757 | "cell_type": "markdown", 758 | "metadata": {}, 759 | "source": [ 760 | "## Some technical issues" 761 | ] 762 | }, 763 | { 764 | "cell_type": "markdown", 765 | "metadata": {}, 766 | "source": [ 767 | "* We ignore job title words that are not in our CBOW model. \n", 768 | "* Unlike the LDA model, we do not stem words. As a result, the model considers different forms of a word as different words, e.g., \"manage\" and \"management\". However, our CBOW model generally assign similar vector representation, for example: " 769 | ] 770 | }, 771 | { 772 | "cell_type": "code", 773 | "execution_count": 17, 774 | "metadata": {}, 775 | "outputs": [ 776 | { 777 | "data": { 778 | "text/plain": [ 779 | "array([[ 0.92724895]], dtype=float32)" 780 | ] 781 | }, 782 | "execution_count": 17, 783 | "metadata": {}, 784 | "output_type": "execute_result" 785 | } 786 | ], 787 | "source": [ 788 | "cosine_similarity(model['manage'].reshape(1,-1), model['management'].reshape(1,-1))" 789 | ] 790 | }, 791 | { 792 | "cell_type": "markdown", 793 | "metadata": {}, 794 | "source": [ 795 | "* Our CBOW model is invariant to the order of job title words, e.g., we consider \"executive secretary\" and \"secretary executive\" as the same title. " 796 | ] 797 | }, 798 | { 799 | "cell_type": "code", 800 | "execution_count": 18, 801 | "metadata": {}, 802 | "outputs": [ 803 | { 804 | "data": { 805 | "text/plain": [ 806 | "array([-0.5665881 , -0.73142403, 0.72307652, -0.10102642, 1.02186275], dtype=float32)" 807 | ] 808 | }, 809 | "execution_count": 18, 810 | "metadata": {}, 811 | "output_type": "execute_result" 812 | } 813 | ], 814 | "source": [ 815 | "model['executive'] + model['secretary']" 816 | ] 817 | }, 818 | { 819 | "cell_type": "code", 820 | "execution_count": 19, 821 | "metadata": {}, 822 | "outputs": [ 823 | { 824 | "data": { 825 | "text/plain": [ 826 | "array([-0.5665881 , -0.73142403, 0.72307652, -0.10102642, 1.02186275], dtype=float32)" 827 | ] 828 | }, 829 | "execution_count": 19, 830 | "metadata": {}, 831 | "output_type": "execute_result" 832 | } 833 | ], 834 | "source": [ 835 | "model['secretary'] + model['executive']" 836 | ] 837 | }, 838 | { 839 | "cell_type": "markdown", 840 | "metadata": {}, 841 | "source": [ 842 | "* Common abbreviations would have similar meanings as their original words. For instance, \"rn\" is a common abbreviation for \"registered nurse\", as a result, our CBOW model assigns very similar vector representation: " 843 | ] 844 | }, 845 | { 846 | "cell_type": "code", 847 | "execution_count": 20, 848 | "metadata": {}, 849 | "outputs": [ 850 | { 851 | "data": { 852 | "text/plain": [ 853 | "array([[ 0.98632824]], dtype=float32)" 854 | ] 855 | }, 856 | "execution_count": 20, 857 | "metadata": {}, 858 | "output_type": "execute_result" 859 | } 860 | ], 861 | "source": [ 862 | "vector_title = model['registered'] + model['nurse']\n", 863 | "cosine_similarity(model['rn'].reshape(1,-1), vector_title.reshape(1,-1))" 864 | ] 865 | }, 866 | { 867 | "cell_type": "markdown", 868 | "metadata": {}, 869 | "source": [ 870 | "* There are rare circumstances where our CBOW model suggests more than one \"closest\" ONET titles to a newspaper job title, i.e., the cosine similarity scores are exactly equal. This can happen because there are some different ONET job titles, each map to a different SOC, but our CBOW model assigns the exact same vector representation. For example, ONET registers \"wage and salary administrator\" to be \"11-3111.00\" (Compensation and Benefits Managers) and \"salary and wage administrator\" to be \"13-1141.00\" (Compensation, Benefits, and Job Analysis Specialists). However, our CBOW model assigns the exact same vector representation to \"wage and salary administrator\" and \"salary and wage administrator.\" In these circumstances, we reply on The Bureau of Labor Statistics employment data, see [here](https://www.bls.gov/oes/current/oes_nat.htm), and choose the SOC code with higher employment." 871 | ] 872 | }, 873 | { 874 | "cell_type": "markdown", 875 | "metadata": {}, 876 | "source": [ 877 | "## Additional amendments\n", 878 | "\n", 879 | "Finally, we made additional amendments as the following (see [here](https://ssc.wisc.edu/~eatalay/apst/apst_mapping.pdf) for more detail):\n", 880 | "\n", 881 | "1. We assign an SOC code of 999999 (“missing”) if certain words or phrases appear — “associate,” “career builder,” “liberal employee benefit,” “many employee benefit,” or “personnel” — anywhere in the job title, or for certain exact titles: “boys,” “boys boys,” “men boys girls,” “men boys girls women,” “men boys men,” “people,” “professional,” or “trainee.” These words and phrases appear commonly in our newspaper ads and do not refer to the SOC code which our CBOW model indicates. “Associate” commonly appears the part of the name of the firms which are placing the ad. “Personnel” commonly refers to the personnel department to which the applicant should contact.\n", 882 | "\n", 883 | "2. We also replace the SOC code for the job title “Assistant” from 399021 (the SOC code for “Personal Care Aides”) to 436014 (the SOC code for “Secretaries and Administrative Assistants”). “Assistant” is the fifth most common job title, and judging by the text within the job ads refers to a secretarial occupation rather than one for a personal care worker. While we are hesitant to modify our job title to SOC mapping in an ad hoc fashion for any job title, mis-specifying this mapping for such a common title would have a noticeably deleterious impact on our dataset.\n", 884 | "\n", 885 | "3. In a final step, we amend the output of the CBOW model for a few ambiguously defined job titles. These final amendments have no discernible impact on aggregate trends in task content, on role within-occupation shifts in accounting for aggregate task changes, or on the role of shifts in the demand for tasks in accounting for increased earnings inequality. First, for job titles which include “server” and which do not also include a food-service-related word — banquet, bartender, cashier, cocktail, cook, dining, food, or restaurant — we substitute an SOC code beginning with 3530 with the SOC code for computer systems analysts (151121). Second, for job titles which contain the word “programmer,” do not include the words “cnc” or “machine,” we substitute SOC codes beginning with 5140 or 5141 with the SOC code for computer programmers (151131). Finally, for job titles which contain the word “assembler” and do not contain a word referring to manufacturing assembly work — words containing the strings electronic, electric, machin, mechanical, metal, and wire — we substitute SOC codes beginning with 5120 with the SOC code of computer programmers (151131). The amendments, which alter the SOC codes for approximately 0.2 percent of ads in our data set, are necessary for ongoing work in which we explore the role of new technologies in the labor market. Certain words refer both to a job title unrelated to new technologies as well as to new technologies. By linking the aforementioned job titles to SOCs that have no exposure to new technologies, we would be vastly overstating the rates at which food service staff or manufacturing production workers adopt new ICT software. On the other hand, since these 8 ads represent a small portion of the ads referring to computer programmer occupations, lumping the ambiguous job titles with the computer programmer SOC codes will only have a minor effect on the assessed technology adoption rates for computer programmers." 886 | ] 887 | } 888 | ], 889 | "metadata": { 890 | "kernelspec": { 891 | "display_name": "Python 3", 892 | "language": "python", 893 | "name": "python3" 894 | }, 895 | "language_info": { 896 | "codemirror_mode": { 897 | "name": "ipython", 898 | "version": 3 899 | }, 900 | "file_extension": ".py", 901 | "mimetype": "text/x-python", 902 | "name": "python", 903 | "nbconvert_exporter": "python", 904 | "pygments_lexer": "ipython3", 905 | "version": "3.6.1" 906 | } 907 | }, 908 | "nbformat": 4, 909 | "nbformat_minor": 2 910 | } 911 | -------------------------------------------------------------------------------- /data_cleaning/auxiliary files/OCRcorrect_enchant.py: -------------------------------------------------------------------------------- 1 | import os 2 | import re 3 | import sys 4 | import nltk 5 | import enchant, difflib 6 | import operator 7 | from enchant import DictWithPWL 8 | 9 | #...............................................# 10 | # This python function performs word-by-word spelling correction 11 | #...............................................# 12 | 13 | def EnchantErrorCorrection(InputByLine,mydictfile): 14 | 15 | # "InputByLine" is a string of the text by line. 16 | # "mydictfile" is a filename (e.g., "myPWL.txt") for personal word list 17 | # The function returns " ' '.join(OutputList) " as a string 18 | 19 | d = enchant.DictWithPWL('en_US', mydictfile) # define spell-checker 20 | 21 | # http://pythonhosted.org/pyenchant/tutorial.html 22 | # http://stackoverflow.com/questions/22898355/pyenchant-spellchecking-block-of-text-with-a-personal-word-list 23 | 24 | InputList = [w for w in re.split(' ',InputByLine) if not w==''] 25 | OutputList = list() 26 | 27 | for Word in InputList: 28 | if len(Word)>=3: # only check words with length greater than or equal to 3 29 | if d.check(Word): #d.check() is TRUE if the word is correctly spelled 30 | OutputList.append(Word) #append the old word back 31 | else: #d.check() is FALSE if the word is correctly spelled 32 | correct = d.suggest(Word) #get a suggestion 33 | count=0 34 | if correct: #if a suggestion is not empty 35 | dictTemp,maxTemp = {},0 ##ea 36 | for b in correct: ## ea 37 | count=count+1 38 | if count<8: 39 | tmp = max(0,difflib.SequenceMatcher(None, Word.lower(), b.lower()).ratio()-(1e-3)*count); ##ea 40 | dictTemp[tmp] = b ##ea 41 | if tmp > maxTemp: ##ea 42 | maxTemp = tmp ##ea 43 | if maxTemp>=0.8: 44 | OutputList.append(dictTemp[maxTemp]) ##ea 45 | else: 46 | OutputList.append(Word) 47 | else: #if a suggestion is empty, just append the old word back 48 | OutputList.append(Word) 49 | else: # if the word is less than 3 characters, just append the same word back to output 50 | OutputList.append(Word) 51 | 52 | return ' '.join(OutputList) 53 | 54 | #...............................................# 55 | -------------------------------------------------------------------------------- /data_cleaning/auxiliary files/OCRcorrect_hyphen.py: -------------------------------------------------------------------------------- 1 | import os 2 | import re 3 | import sys 4 | import nltk 5 | import enchant, difflib 6 | import operator 7 | from enchant import DictWithPWL 8 | from edit_distance import * 9 | 10 | #...............................................# 11 | # This python function performs spelling correction 12 | # on words with hyphen 13 | #...............................................# 14 | 15 | def CorrectHyphenated(InputByLine,mydictfile): 16 | 17 | # "InputByLine" is a string of the text by line. 18 | # "mydictfile" is a filename (e.g., "myPWL.txt") for personal word list 19 | # The function returns a string as output 20 | 21 | d = enchant.DictWithPWL('en_US', mydictfile) # define spell-checker 22 | # http://pythonhosted.org/pyenchant/tutorial.html 23 | # http://stackoverflow.com/questions/22898355/pyenchant-spellchecking-block-of-text-with-a-personal-word-list 24 | 25 | text = InputByLine 26 | 27 | HyphenWords = re.findall(r'\b[a-zA-Z]+-\s?[a-zA-Z]+\b', InputByLine) 28 | # "HyphenWords" is a list of potential hyphen word corrections 29 | 30 | for word in HyphenWords: 31 | WordForCheck = re.sub('[- ]','',word) 32 | # Newspaper tends to the cut to a new line in the middle of a word. 33 | # Therefore, most corrections are just removing "-" and " " 34 | CorrectionFlag = 0 #indicator for correction 35 | if d.check(word): # if the word (with hyphen) is already correct 36 | pass #do nothing 37 | elif d.check(WordForCheck): # elif the word without "-" and " " is correct 38 | Correction = WordForCheck 39 | CorrectionFlag = 1 40 | elif d.suggest(WordForCheck): #get a suggestion 41 | ListSuggest = [w for w in d.suggest(WordForCheck) if not ' ' in w] 42 | if len(ListSuggest) > 0: 43 | DistanceSuggest = [EditDistance(w,WordForCheck) for w in ListSuggest] 44 | min_index, min_value = min(enumerate(DistanceSuggest), key=operator.itemgetter(1)) 45 | if min_value <= 3: #if the difference is not exceeding 3 46 | Correction = ListSuggest[min_index] 47 | CorrectionFlag = 1 48 | 49 | if CorrectionFlag == 1: 50 | text = re.sub(word,Correction,text) 51 | 52 | return text 53 | 54 | #...............................................# 55 | -------------------------------------------------------------------------------- /data_cleaning/auxiliary files/PWL.txt: -------------------------------------------------------------------------------- 1 | Abilene 2 | Akron 3 | Alameda 4 | Albany 5 | Albuquerque 6 | Alexandria 7 | Alhambra 8 | Allentown 9 | Allis 10 | Alto 11 | Amarillo 12 | Ames 13 | Anaheim 14 | Anchorage 15 | Anderson 16 | Angeles 17 | Angelo 18 | Antioch 19 | Antonio 20 | Appleton 21 | Arcadia 22 | Arlington 23 | Arthur 24 | Arvada 25 | Asheville 26 | Athens 27 | Atlanta 28 | Augusta 29 | Aurora 30 | Austin 31 | Bakersfield 32 | Baldwin 33 | Baltimore 34 | Barbara 35 | Baton 36 | Bayonne 37 | Baytown 38 | Beaumont 39 | Beaverton 40 | Bedford 41 | Bellevue 42 | Bellflower 43 | Bellingham 44 | Bend 45 | Berkeley 46 | Bernardino 47 | Berwyn 48 | Bethlehem 49 | Billings 50 | Biloxi 51 | Birmingham 52 | Bismarck 53 | Bloomington 54 | Boca 55 | Boise 56 | Bolingbrook 57 | Bossier 58 | Boston 59 | Boulder 60 | Bowie 61 | Boynton 62 | Bridgeport 63 | Bristol 64 | Britain 65 | Brockton 66 | Brooklyn 67 | Brownsville 68 | Bryan 69 | Buena 70 | Buenaventura 71 | Buffalo 72 | Burbank 73 | Burnsville 74 | Cajon 75 | Camarillo 76 | Cambridge 77 | Camden 78 | Canton 79 | Cape 80 | Carlsbad 81 | Carrollton 82 | Carson 83 | Cary 84 | Cedar 85 | Cerritos 86 | Champaign 87 | Chandler 88 | Charles 89 | Charleston 90 | Charlotte 91 | Chattanooga 92 | Chesapeake 93 | Cheyenne 94 | Chicago 95 | Chico 96 | Chicopee 97 | Chino 98 | Christi 99 | Chula 100 | Cicero 101 | Cincinnati 102 | Citrus 103 | Clair 104 | Claire 105 | Clara 106 | Clarita 107 | Clarksville 108 | Clearwater 109 | Cleveland 110 | Clifton 111 | Clovis 112 | College 113 | Collins 114 | Colorado 115 | Columbia 116 | Columbus 117 | Compton 118 | Concord 119 | Coon 120 | Coral 121 | Corona 122 | Corpus 123 | Costa 124 | Council 125 | Covina 126 | Cranston 127 | Crosse 128 | Cruces 129 | Cruz 130 | Cucamonga 131 | Cupertino 132 | Dallas 133 | Daly 134 | Danbury 135 | Davenport 136 | Davidson 137 | Davie 138 | Davis 139 | Dayton 140 | Daytona 141 | Dearborn 142 | Decatur 143 | Deerfield 144 | Delray 145 | Deltona 146 | Denton 147 | Denver 148 | Detroit 149 | Diego 150 | Dothan 151 | Downey 152 | Dubuque 153 | Duluth 154 | Durham 155 | Eagan 156 | Eau 157 | Eden 158 | Edmond 159 | Elgin 160 | Elizabeth 161 | Elkhart 162 | Elyria 163 | Encinitas 164 | Erie 165 | Escondido 166 | Euclid 167 | Eugene 168 | Evanston 169 | Evansville 170 | Everett 171 | Fairfield 172 | Falls 173 | Fargo 174 | Farmington 175 | Fayette 176 | Fayetteville 177 | Federal 178 | Flagstaff 179 | Flint 180 | Florissant 181 | Folsom 182 | Fontana 183 | Francisco 184 | Frederick 185 | Fremont 186 | Fresno 187 | Fullerton 188 | Gainesville 189 | Gaithersburg 190 | Galveston 191 | Garden 192 | Gardena 193 | Garland 194 | Gary 195 | Gastonia 196 | Gilbert 197 | Glendale 198 | Greeley 199 | Greensboro 200 | Greenville 201 | Gresham 202 | Gulfport 203 | Habra 204 | Hamilton 205 | Hammond 206 | Hampton 207 | Harlingen 208 | Hartford 209 | Haute 210 | Haven 211 | Haverhill 212 | Hawthorne 213 | Hayward 214 | Hemet 215 | Hempstead 216 | Henderson 217 | Hesperia 218 | Hialeah 219 | Hillsboro 220 | Hollywood 221 | Hoover 222 | Houston 223 | Huntington 224 | Huntsville 225 | Idaho 226 | Independence 227 | Indianapolis 228 | Inglewood 229 | Iowa 230 | Irvine 231 | Irving 232 | Jackson 233 | Jacksonville 234 | Janesville 235 | Jersey 236 | Johnson 237 | Joliet 238 | Jonesboro 239 | Jordan 240 | Jose 241 | Joseph 242 | Kalamazoo 243 | Kansas 244 | Kenner 245 | Kennewick 246 | Kenosha 247 | Kent 248 | Kettering 249 | Killeen 250 | Knoxville 251 | Lafayette 252 | Laguna 253 | Lakeland 254 | Lakewood 255 | Lancaster 256 | Lansing 257 | Laredo 258 | Largo 259 | Lauderdale 260 | Lauderhill 261 | Lawrence 262 | Lawton 263 | Layton 264 | Leandro 265 | Lee 266 | Lewisville 267 | Lexington 268 | Lincoln 269 | Linda 270 | Little 271 | Livermore 272 | Livonia 273 | Lodi 274 | Longmont 275 | Longview 276 | Lorain 277 | Louis 278 | Louisville 279 | Loveland 280 | Lowell 281 | Lubbock 282 | Lucie 283 | Lynchburg 284 | Lynn 285 | Lynwood 286 | Macon 287 | Madison 288 | Malden 289 | Manchester 290 | Maple 291 | Marcos 292 | Margate 293 | Maria 294 | Marietta 295 | Mateo 296 | McAllen 297 | McKinney 298 | Medford 299 | Melbourne 300 | Memphis 301 | Mentor 302 | Merced 303 | Meriden 304 | Mesa 305 | Mesquite 306 | Miami 307 | Middletown 308 | Midland 309 | Midwest 310 | Milford 311 | Milpitas 312 | Milwaukee 313 | Minneapolis 314 | Minnetonka 315 | Miramar 316 | Missoula 317 | Missouri 318 | Mobile 319 | Modesto 320 | Moines 321 | Monica 322 | Monroe 323 | Monte 324 | Montebello 325 | Monterey 326 | Montgomery 327 | Moreno 328 | Muncie 329 | Murfreesboro 330 | Nampa 331 | Napa 332 | Naperville 333 | Nashua 334 | Nashville 335 | National 336 | Newark 337 | Newport 338 | Newton 339 | Niagara 340 | Niguel 341 | Norfolk 342 | Norman 343 | Norwalk 344 | Oakland 345 | Oceanside 346 | Odessa 347 | Ogden 348 | Oklahoma 349 | Olathe 350 | Omaha 351 | Ontario 352 | Orange 353 | Orem 354 | Orland 355 | Orlando 356 | Orleans 357 | Oshkosh 358 | Overland 359 | Owensboro 360 | Oxnard 361 | Palatine 362 | Palmdale 363 | Palo 364 | Paramount 365 | Parma 366 | Pasadena 367 | Paso 368 | Passaic 369 | Paterson 370 | Paul 371 | Pawtucket 372 | Pembroke 373 | Pensacola 374 | Peoria 375 | Petaluma 376 | Peters 377 | Petersburg 378 | Philadelphia 379 | Phoenix 380 | Pico 381 | Pittsburg 382 | Pittsburgh 383 | Plaines 384 | Plano 385 | Plantation 386 | Pleasanton 387 | Plymouth 388 | Pocatello 389 | Pomona 390 | Pompano 391 | Pontiac 392 | Portland 393 | Portsmouth 394 | Prairie 395 | Providence 396 | Provo 397 | Pueblo 398 | Quincy 399 | Racine 400 | Rafael 401 | Raleigh 402 | Rancho 403 | Rapid 404 | Rapids 405 | Raton 406 | Reading 407 | Redding 408 | Redlands 409 | Redondo 410 | Redwood 411 | Reno 412 | Renton 413 | Rialto 414 | Richardson 415 | Richlands 416 | Richmond 417 | Rio 418 | Rivera 419 | Riverside 420 | Roanoke 421 | Rochelle 422 | Rochester 423 | Rockford 424 | Rocky 425 | Rosa 426 | Rosemead 427 | Roseville 428 | Roswell 429 | Rouge 430 | Sacramento 431 | Saginaw 432 | Salem 433 | Salinas 434 | Sandy 435 | Santee 436 | Sarasota 437 | Savannah 438 | Schaumburg 439 | Schenectady 440 | Scottsdale 441 | Scranton 442 | Seattle 443 | Sheboygan 444 | Shoreline 445 | Shreveport 446 | Simi 447 | Sioux 448 | Skokie 449 | Smith 450 | Somerville 451 | Southfield 452 | Sparks 453 | Spokane 454 | Springfield 455 | Stamford 456 | Sterling 457 | Stockton 458 | Suffolk 459 | Sugar 460 | Sunnyvale 461 | Sunrise 462 | Syracuse 463 | Tacoma 464 | Tallahassee 465 | Tamarac 466 | Tampa 467 | Taunton 468 | Taylor 469 | Taylorsville 470 | Temecula 471 | Tempe 472 | Temple 473 | Terre 474 | Thornton 475 | Toledo 476 | Topeka 477 | Torrance 478 | Tracy 479 | Trenton 480 | Troy 481 | Tucson 482 | Tulsa 483 | Turlock 484 | Tuscaloosa 485 | Tustin 486 | Tyler 487 | Upland 488 | Utica 489 | Vacaville 490 | Vallejo 491 | Vancouver 492 | Vegas 493 | Vernon 494 | Victoria 495 | Victorville 496 | Viejo 497 | Vineland 498 | Virginia 499 | Visalia 500 | Vista 501 | Waco 502 | Waltham 503 | Warren 504 | Warwick 505 | Washington 506 | Waterbury 507 | Waterloo 508 | Waukegan 509 | Waukesha 510 | Wayne 511 | Westland 512 | Westminster 513 | Wheaton 514 | Whittier 515 | Wichita 516 | Wilmington 517 | Winston 518 | Worcester 519 | Worth 520 | Wyoming 521 | Yakima 522 | Yonkers 523 | Yorba 524 | York 525 | Youngstown 526 | Yuma 527 | allen 528 | america 529 | american 530 | api 531 | apl 532 | aquacultural 533 | assistive 534 | autocad 535 | autocad 536 | autodesk 537 | bal 538 | bandsaws 539 | barcode 540 | benchtop 541 | biofuels 542 | bioinformatics 543 | blockmasons 544 | bsee 545 | burets 546 | businessobjects 547 | cae 548 | cam 549 | cannulas 550 | catheterization 551 | cdl 552 | chromatographs 553 | cics 554 | cobol 555 | comal 556 | cplus 557 | cplusplus 558 | crimpers 559 | curettes 560 | cyclers 561 | dataloggers 562 | db2 563 | dbms 564 | deburring 565 | defibrillators 566 | doppler 567 | dos 568 | dragline 569 | dynamometers 570 | echography 571 | electrocautery 572 | electrosurgical 573 | endotracheal 574 | english 575 | enteral 576 | epidiascopes 577 | extruders 578 | facebook 579 | flowmeters 580 | fluorimeters 581 | fortran 582 | freeware 583 | fundraising 584 | gauge 585 | gauges 586 | geospatial 587 | glucometers 588 | groundskeeping 589 | handheld 590 | handtrucks 591 | healthcare 592 | html 593 | html5 594 | hvac 595 | hypertext 596 | idms 597 | imagers 598 | ims 599 | inkjet 600 | internet 601 | j2ee 602 | javascript 603 | jcl 604 | krl 605 | laminators 606 | lan 607 | laryngoscopes 608 | lis 609 | locators 610 | logisticians 611 | longnose 612 | loupes 613 | manlift 614 | measurers 615 | mgmt 616 | microcentrifuges 617 | microcontrollers 618 | microplate 619 | microsoft 620 | mis 621 | ms-excel 622 | ms-power 623 | ms-word 624 | multilimb 625 | multiline 626 | multimeters 627 | mvs 628 | nebulizer 629 | needlenose 630 | netare 631 | nonfarm 632 | nonrestaurant 633 | novell 634 | offbearers 635 | onetcenter 636 | online 637 | ophthalmoscopes 638 | oracle 639 | otoscopes 640 | oximeter 641 | oximeters 642 | pascal 643 | patternmakers 644 | pdp 645 | photonics 646 | photovoltaic 647 | pipelayers 648 | pl/m 649 | pl/sql 650 | powerbuilder 651 | powerpoint 652 | psychrometers 653 | quickbooks 654 | radarbased 655 | recordkeeping 656 | reddit 657 | reflectometers 658 | sap 659 | sas 660 | screwguns 661 | scribers 662 | sharepoint 663 | sonographers 664 | spectrofluorimeters 665 | specula 666 | sphygmomanometers 667 | spirometers 668 | sql 669 | stimulators 670 | stumbleupon 671 | sybase 672 | syllabi 673 | tcp-ip 674 | tcpip 675 | tinners 676 | transcutaneous 677 | trephines 678 | tso 679 | ultracentrifuges 680 | univac 681 | unix 682 | vax 683 | viscosimeters 684 | visio 685 | visualbasic 686 | vms 687 | vsam 688 | vtam 689 | wattmeters 690 | webcams 691 | weighers 692 | whiteboards 693 | widemouth 694 | wordperfect 695 | workflow 696 | x-ray 697 | xray -------------------------------------------------------------------------------- /data_cleaning/auxiliary files/TitleBase.txt: -------------------------------------------------------------------------------- 1 | abstracter 2 | abstracters 3 | abstractor 4 | abstractors 5 | accounting 6 | accountings 7 | accountant 8 | accountants 9 | actor 10 | actors 11 | actress 12 | actresses 13 | actuarial 14 | actuarials 15 | actuaries 16 | actuary 17 | acupuncturist 18 | acupuncturists 19 | adjudicator 20 | adjudicators 21 | adjuster 22 | adjusters 23 | administrator 24 | administrators 25 | advisor 26 | advisors 27 | advocate 28 | advocates 29 | aesthetician 30 | aestheticians 31 | agent 32 | agents 33 | agronomist 34 | agronomists 35 | aid 36 | aide 37 | aides 38 | aids 39 | allergist 40 | allergists 41 | ambassador 42 | ambassadors 43 | analyst 44 | analysts 45 | analyzer 46 | analyzers 47 | anchor 48 | anchors 49 | ancillaries 50 | ancillary 51 | anesthesiologist 52 | anesthesiologists 53 | anesthetist 54 | anesthetists 55 | animator 56 | animators 57 | announcer 58 | announcers 59 | anodizer 60 | anodizers 61 | anthropologist 62 | anthropologists 63 | applicator 64 | applicators 65 | appraiser 66 | appraisers 67 | apprentice 68 | apprentices 69 | aquarist 70 | aquarists 71 | arbiter 72 | arbiters 73 | arbitrator 74 | arbitrators 75 | arborist 76 | arborists 77 | archaeologist 78 | archaeologists 79 | archeologist 80 | archeologists 81 | architect 82 | architects 83 | archivist 84 | archivists 85 | arranger 86 | arrangers 87 | artisan 88 | artisans 89 | artist 90 | artists 91 | assembler 92 | assemblers 93 | assessor 94 | assessors 95 | assistant 96 | assistants 97 | associate 98 | associates 99 | asst 100 | assts 101 | astronomer 102 | astronomers 103 | astrophysicist 104 | astrophysicists 105 | athlete 106 | athletes 107 | attendant 108 | attendants 109 | attorney 110 | attorneys 111 | audience 112 | audiences 113 | audiologist 114 | audiologists 115 | audioprosthologist 116 | audioprosthologists 117 | auditor 118 | auditors 119 | author 120 | authorizer 121 | authorizers 122 | authors 123 | bacteriologist 124 | bacteriologists 125 | bagger 126 | baggers 127 | bailiff 128 | bailiffs 129 | baker 130 | bakers 131 | baler 132 | balers 133 | ballerina 134 | ballerinas 135 | bander 136 | banders 137 | banker 138 | bankers 139 | barber 140 | barbers 141 | barista 142 | baristas 143 | bartacker 144 | bartackers 145 | bartender 146 | bartenders 147 | batchmaker 148 | batchmakers 149 | bellhop 150 | bellhops 151 | bellman 152 | bellmen 153 | bender 154 | benders 155 | biller 156 | billers 157 | binder 158 | binders 159 | biochemist 160 | biochemists 161 | bioinformaticist 162 | bioinformaticists 163 | biologist 164 | biologists 165 | biometrist 166 | biometrists 167 | biophysicist 168 | biophysicists 169 | biostatistician 170 | biostatisticians 171 | biotechnician 172 | biotechnicians 173 | blacksmith 174 | blacksmiths 175 | blaster 176 | blasters 177 | blender 178 | blenders 179 | blower 180 | blowers 181 | boilermaker 182 | boilermakers 183 | bolter 184 | bolters 185 | bookkeeper 186 | bookkeepers 187 | boss 188 | bosses 189 | bosun 190 | bosuns 191 | boy 192 | boys 193 | brakeman 194 | brakemen 195 | brazer 196 | brazers 197 | breaker 198 | breakers 199 | breeder 200 | breeders 201 | brewer 202 | brewers 203 | bricker 204 | brickers 205 | bricklayer 206 | bricklayers 207 | broker 208 | brokers 209 | buffer 210 | buffers 211 | builder 212 | builders 213 | buncher 214 | bunchers 215 | bundler 216 | bundlers 217 | businessman 218 | businessmen 219 | buster 220 | busters 221 | butcher 222 | butchers 223 | buyer 224 | buyers 225 | cabinetmaker 226 | cabinetmakers 227 | calibrator 228 | calibrators 229 | caller 230 | callers 231 | captain 232 | captains 233 | caretaker 234 | caretakers 235 | carman 236 | carmen 237 | carpenter 238 | carpenters 239 | carrier 240 | carriers 241 | cartographer 242 | cartographers 243 | carver 244 | carvers 245 | cashier 246 | cashiers 247 | caster 248 | casters 249 | caterer 250 | caterers 251 | cellist 252 | cellists 253 | ceramist 254 | ceramists 255 | champion 256 | champions 257 | changer 258 | changers 259 | chaplain 260 | chaplains 261 | chauffeur 262 | chauffeurs 263 | checker 264 | checkers 265 | chef 266 | chefs 267 | chemist 268 | chemists 269 | chief 270 | chiefs 271 | chiropractor 272 | chiropractors 273 | choreographer 274 | choreographers 275 | claim 276 | claims 277 | clarinetist 278 | clarinetists 279 | cleaner 280 | cleaners 281 | clergies 282 | clergy 283 | clerk 284 | clerks 285 | climber 286 | climbers 287 | clinician 288 | clinicians 289 | closer 290 | closers 291 | clothier 292 | clothiers 293 | coach 294 | coaches 295 | coater 296 | coaters 297 | coder 298 | coders 299 | collector 300 | collectors 301 | comedian 302 | comedians 303 | commander 304 | commanders 305 | commissioner 306 | commissioners 307 | competitor 308 | competitors 309 | compiler 310 | compilers 311 | composer 312 | composers 313 | compounder 314 | compounders 315 | comptroller 316 | comptrollers 317 | concierge 318 | concierges 319 | conciliator 320 | conciliators 321 | conductor 322 | conductors 323 | confessor 324 | confessors 325 | connector 326 | connectors 327 | conservationist 328 | conservationists 329 | conservator 330 | conservators 331 | constructor 332 | constructors 333 | consultant 334 | consultants 335 | contractor 336 | contractors 337 | controller 338 | controllers 339 | conveyor 340 | conveyors 341 | cook 342 | cooks 343 | coordinator 344 | coordinators 345 | copilot 346 | copilots 347 | cordwainer 348 | cordwainers 349 | coremaker 350 | coremakers 351 | coroner 352 | coroners 353 | correspondent 354 | correspondents 355 | cosmetologist 356 | cosmetologists 357 | costumer 358 | costumers 359 | counsel 360 | counselor 361 | counselors 362 | counsels 363 | counter 364 | counters 365 | courier 366 | couriers 367 | coutierier 368 | coutieriers 369 | couturiere 370 | couturieres 371 | coverer 372 | coverers 373 | crabber 374 | crabbers 375 | crafter 376 | crafters 377 | craftsman 378 | craftsmen 379 | criminalist 380 | criminalists 381 | cryptographer 382 | cryptographers 383 | curator 384 | curators 385 | custodian 386 | custodians 387 | cutter 388 | cutters 389 | cytogenetic 390 | cytogeneticist 391 | cytogeneticists 392 | cytogenetics 393 | cytopathologist 394 | cytopathologists 395 | cytotechnologist 396 | cytotechnologists 397 | dancer 398 | dancers 399 | dealer 400 | dealers 401 | dean 402 | deans 403 | deboner 404 | deboners 405 | deburrer 406 | deburrers 407 | decaler 408 | decalers 409 | decorator 410 | decorators 411 | deliverer 412 | deliverers 413 | demonstrator 414 | demonstrators 415 | dentist 416 | dentists 417 | deputies 418 | deputy 419 | dermatologist 420 | dermatologists 421 | dermatopathologist 422 | dermatopathologists 423 | designer 424 | designers 425 | detail 426 | detailer 427 | detailers 428 | details 429 | detective 430 | detectives 431 | developer 432 | developers 433 | dietician 434 | dieticians 435 | dietitian 436 | dietitians 437 | digger 438 | diggers 439 | director 440 | directors 441 | dishwashe 442 | dishwashes 443 | dispatcher 444 | dispatchers 445 | dispenser 446 | dispensers 447 | displayer 448 | displayers 449 | distributor 450 | distributors 451 | diver 452 | divers 453 | docent 454 | docents 455 | doctor 456 | doctors 457 | doorman 458 | doormen 459 | dosimetrist 460 | dosimetrists 461 | drafter 462 | drafters 463 | draftsman 464 | draftsmen 465 | draper 466 | drapers 467 | dredger 468 | dredgers 469 | dresser 470 | dressers 471 | dressmaker 472 | dressmakers 473 | driller 474 | drillers 475 | driver 476 | drivers 477 | dyer 478 | dyers 479 | ecologist 480 | ecologists 481 | economist 482 | economists 483 | editor 484 | editors 485 | educator 486 | educators 487 | electrician 488 | electricians 489 | embalmer 490 | embalmers 491 | emcee 492 | emcees 493 | employee 494 | employees 495 | endocrinologist 496 | endocrinologists 497 | engineer 498 | engineers 499 | engraver 500 | engravers 501 | entertainer 502 | entertainers 503 | epidemiologist 504 | epidemiologists 505 | erector 506 | erectors 507 | ergonomist 508 | ergonomists 509 | escort 510 | escorts 511 | esthetician 512 | estheticians 513 | estimator 514 | estimators 515 | etcher 516 | etchers 517 | evaluator 518 | evaluators 519 | examiner 520 | examiners 521 | executive 522 | executives 523 | expediter 524 | expediters 525 | expeditor 526 | expeditors 527 | expert 528 | experts 529 | extender 530 | extenders 531 | exterminator 532 | exterminators 533 | fabricator 534 | fabricators 535 | facetor 536 | facetors 537 | facialist 538 | facialists 539 | facilitator 540 | facilitators 541 | faculties 542 | faculty 543 | faller 544 | fallers 545 | farmer 546 | farmers 547 | farmworker 548 | farmworkers 549 | feeder 550 | feeders 551 | feller 552 | fellers 553 | fellow 554 | fellows 555 | fiberglasser 556 | fiberglassers 557 | fieldman 558 | fieldmen 559 | fighter 560 | fighters 561 | filer 562 | filers 563 | filler 564 | fillers 565 | finisher 566 | finishers 567 | firefighter 568 | firefighters 569 | fireman 570 | firemen 571 | firers 572 | firerss 573 | fisher 574 | fishers 575 | fitter 576 | fitters 577 | fixer 578 | fixers 579 | floorpeople 580 | floorperson 581 | florist 582 | florists 583 | follower 584 | followers 585 | foreman 586 | foremen 587 | forester 588 | foresters 589 | forwarder 590 | forwarders 591 | framer 592 | framers 593 | fundraiser 594 | fundraisers 595 | gaffer 596 | gaffers 597 | gardener 598 | gardeners 599 | gastroenterologist 600 | gastroenterologists 601 | gatherer 602 | gatherers 603 | gauger 604 | gaugers 605 | gemologist 606 | gemologists 607 | generalist 608 | generalists 609 | geneticist 610 | geneticists 611 | geodesist 612 | geodesists 613 | geographer 614 | geographers 615 | geologist 616 | geologists 617 | geophysicist 618 | geophysicists 619 | geoscientist 620 | geoscientists 621 | giver 622 | givers 623 | glazer 624 | glazers 625 | glazier 626 | glaziers 627 | goldsmith 628 | goldsmiths 629 | grader 630 | graders 631 | greeter 632 | greeters 633 | grinder 634 | grinders 635 | groomer 636 | groomers 637 | groundskeeper 638 | groundskeepers 639 | grower 640 | growers 641 | guard 642 | guards 643 | guide 644 | guides 645 | guru 646 | gurus 647 | gynecologist 648 | gynecologists 649 | hairdresser 650 | hairdressers 651 | hairstylist 652 | hairstylists 653 | hand 654 | handler 655 | handlers 656 | hands 657 | hanger 658 | hangers 659 | harvester 660 | harvesters 661 | hauler 662 | haulers 663 | head 664 | heads 665 | helper 666 | helpers 667 | hiker 668 | hikers 669 | histologist 670 | histologists 671 | historian 672 | historians 673 | histotechnologist 674 | histotechnologists 675 | holder 676 | holders 677 | horologist 678 | horologists 679 | horticulturist 680 | horticulturists 681 | hospitalist 682 | hospitalists 683 | host 684 | hostess 685 | hostesses 686 | hostler 687 | hostlers 688 | hosts 689 | housekeeper 690 | housekeepers 691 | hunter 692 | hunters 693 | hydrogeologist 694 | hydrogeologists 695 | hydrologist 696 | hydrologists 697 | hygienist 698 | hygienists 699 | illustrator 700 | illustrators 701 | imager 702 | imagers 703 | immunologist 704 | immunologists 705 | informaticist 706 | informaticists 707 | innkeeper 708 | innkeepers 709 | inseamer 710 | inseamers 711 | inspector 712 | inspectors 713 | installer 714 | installers 715 | instructor 716 | instructors 717 | insulator 718 | insulators 719 | internist 720 | internists 721 | interpreter 722 | interpreters 723 | interviewer 724 | interviewers 725 | investigator 726 | investigators 727 | irrigator 728 | irrigators 729 | jailer 730 | jailers 731 | jailerss 732 | jailor 733 | jailors 734 | janitor 735 | janitors 736 | jeweler 737 | jewelers 738 | jockey 739 | jockeys 740 | judge 741 | judges 742 | keeper 743 | keepers 744 | kettleman 745 | kettlemans 746 | keyer 747 | keyers 748 | knitter 749 | knitters 750 | laborer 751 | laborers 752 | lacer 753 | lacers 754 | laminator 755 | laminators 756 | lapidarist 757 | lapidarists 758 | laster 759 | lasters 760 | lawyer 761 | lawyers 762 | layer 763 | layers 764 | lead 765 | leader 766 | leaders 767 | leads 768 | lecturer 769 | lecturers 770 | liaison 771 | liaisons 772 | librarian 773 | librarians 774 | librettist 775 | librettists 776 | licensee 777 | licensees 778 | lieutenant 779 | lieutenants 780 | lifeguard 781 | lifeguards 782 | lineman 783 | linemen 784 | liner 785 | liners 786 | loader 787 | loaders 788 | lobsterman 789 | lobstermen 790 | locker 791 | lockers 792 | locksmith 793 | locksmiths 794 | logger 795 | loggers 796 | logistician 797 | logisticians 798 | lookout 799 | lookouts 800 | lubricator 801 | lubricators 802 | luthier 803 | luthiers 804 | lyricist 805 | lyricists 806 | machinist 807 | machinists 808 | magistrate 809 | magistrates 810 | maid 811 | maids 812 | maintainer 813 | maintainers 814 | maker 815 | makers 816 | mammographer 817 | mammographers 818 | manager 819 | managers 820 | manicurist 821 | manicurists 822 | marker 823 | markers 824 | marketer 825 | marketers 826 | marshal 827 | marshals 828 | mason 829 | masons 830 | massager 831 | massagers 832 | masseuse 833 | masseuses 834 | master 835 | masters 836 | mate 837 | mates 838 | mathematician 839 | mathematicians 840 | measurer 841 | measurers 842 | mechanic 843 | mechanics 844 | mediator 845 | mediators 846 | melter 847 | melters 848 | member 849 | members 850 | mender 851 | menders 852 | menderss 853 | merchandiser 854 | merchandisers 855 | merchant 856 | merchants 857 | messenger 858 | messengers 859 | metallurgist 860 | metallurgists 861 | meteorologist 862 | meteorologists 863 | methodologist 864 | methodologists 865 | microbiologist 866 | microbiologists 867 | midwife 868 | midwives 869 | midwivess 870 | miller 871 | millers 872 | millwright 873 | millwrights 874 | miner 875 | miners 876 | minister 877 | ministers 878 | mixer 879 | mixers 880 | mixologist 881 | mixologists 882 | model 883 | modeler 884 | modelers 885 | models 886 | molder 887 | molders 888 | monitor 889 | monitors 890 | mortician 891 | morticians 892 | motorist 893 | motorists 894 | mounter 895 | mounters 896 | mover 897 | movers 898 | musician 899 | musicians 900 | nannies 901 | nanniess 902 | nanny 903 | narrator 904 | narrators 905 | naturalist 906 | naturalists 907 | neurologist 908 | neurologists 909 | neuropsychologist 910 | neuropsychologists 911 | neuroradiologist 912 | neuroradiologists 913 | novelist 914 | novelists 915 | nurse 916 | nurses 917 | nutritionist 918 | nutritionists 919 | oboist 920 | oboists 921 | obstetric 922 | obstetrician 923 | obstetricians 924 | obstetrics 925 | offbearer 926 | offbearers 927 | officer 928 | officers 929 | official 930 | officials 931 | oiler 932 | oilers 933 | oncologist 934 | oncologists 935 | operator 936 | operators 937 | ophthalmologist 938 | ophthalmologists 939 | optician 940 | opticians 941 | optometrist 942 | optometrists 943 | originator 944 | originators 945 | orthodontist 946 | orthodontists 947 | orthoptist 948 | orthoptists 949 | orthotist 950 | orthotists 951 | overhauler 952 | overhaulers 953 | owner 954 | owners 955 | packager 956 | packagers 957 | packer 958 | packers 959 | painter 960 | painters 961 | paperhanger 962 | paperhangers 963 | paralegal 964 | paralegals 965 | paramedic 966 | paramedics 967 | parker 968 | parkers 969 | partner 970 | partners 971 | passenger 972 | passengers 973 | pastor 974 | pastors 975 | pathologist 976 | pathologists 977 | patrol 978 | patrols 979 | patternmaker 980 | patternmakers 981 | paver 982 | pavers 983 | pediatrician 984 | pediatricians 985 | pedicurist 986 | pedicurists 987 | pedorthist 988 | pedorthists 989 | people 990 | percussionist 991 | percussionists 992 | performer 993 | performers 994 | personnel 995 | personnels 996 | pewterer 997 | pewterers 998 | pharmacist 999 | pharmacists 1000 | pharmacologist 1001 | pharmacologists 1002 | philosopher 1003 | philosophers 1004 | phlebotomist 1005 | phlebotomists 1006 | photogrammetrist 1007 | photogrammetrists 1008 | photographer 1009 | photographers 1010 | physiatrist 1011 | physiatrists 1012 | physician 1013 | physicians 1014 | physicist 1015 | physicists 1016 | physiologist 1017 | physiologists 1018 | picker 1019 | pickers 1020 | pilot 1021 | pilots 1022 | pipefitter 1023 | pipefitters 1024 | pipelayer 1025 | pipelayers 1026 | pitcher 1027 | pitchers 1028 | planer 1029 | planers 1030 | planner 1031 | planners 1032 | planter 1033 | planters 1034 | plasterer 1035 | plasterers 1036 | plater 1037 | platers 1038 | player 1039 | players 1040 | plumber 1041 | plumbers 1042 | podiatrist 1043 | podiatrists 1044 | poet 1045 | poets 1046 | police 1047 | polices 1048 | polisher 1049 | polishers 1050 | politician 1051 | politicians 1052 | porter 1053 | porters 1054 | poster 1055 | posters 1056 | postmaster 1057 | postmasters 1058 | potter 1059 | potters 1060 | pourer 1061 | pourers 1062 | powderman 1063 | powdermen 1064 | practitioner 1065 | practitioners 1066 | preceptor 1067 | preceptors 1068 | preparator 1069 | preparators 1070 | preparer 1071 | preparers 1072 | president 1073 | presidents 1074 | presser 1075 | pressers 1076 | pressman 1077 | pressmen 1078 | priest 1079 | priests 1080 | principal 1081 | principals 1082 | printer 1083 | printers 1084 | processor 1085 | processors 1086 | producer 1087 | producers 1088 | professional 1089 | professionals 1090 | professor 1091 | professors 1092 | programer 1093 | programers 1094 | programmer 1095 | programmers 1096 | projectionist 1097 | projectionists 1098 | promoter 1099 | promoters 1100 | proofer 1101 | proofers 1102 | proofreader 1103 | proofreaders 1104 | prosthetist 1105 | prosthetists 1106 | prosthodontist 1107 | prosthodontists 1108 | provider 1109 | providers 1110 | provost 1111 | provosts 1112 | psychiatrist 1113 | psychiatrists 1114 | psychologist 1115 | psychologists 1116 | psychometrist 1117 | psychometrists 1118 | psychotherapist 1119 | psychotherapists 1120 | publisher 1121 | publishers 1122 | puller 1123 | pullers 1124 | pulmonologist 1125 | pulmonologists 1126 | pumper 1127 | pumpers 1128 | purchaser 1129 | purchasers 1130 | purser 1131 | pursers 1132 | rabbi 1133 | rabbis 1134 | radiographer 1135 | radiographers 1136 | radiologist 1137 | radiologists 1138 | raker 1139 | rakers 1140 | rancher 1141 | ranchers 1142 | ranger 1143 | rangers 1144 | rater 1145 | raters 1146 | reader 1147 | readers 1148 | realtor 1149 | realtors 1150 | recapper 1151 | recappers 1152 | receiver 1153 | receivers 1154 | receptionist 1155 | receptionists 1156 | reconditioner 1157 | reconditioners 1158 | recorder 1159 | recorders 1160 | recruiter 1161 | recruiters 1162 | rector 1163 | rectors 1164 | referee 1165 | referees 1166 | refinisher 1167 | refinishers 1168 | registrar 1169 | registrars 1170 | rep 1171 | representative 1172 | representatives 1173 | reps 1174 | reservationist 1175 | reservationists 1176 | resident 1177 | residents 1178 | responder 1179 | responders 1180 | restorer 1181 | restorers 1182 | reviewer 1183 | reviewers 1184 | rigger 1185 | riggers 1186 | riveter 1187 | riveters 1188 | rn 1189 | rns 1190 | roaster 1191 | roasters 1192 | rodbuster 1193 | rodbusters 1194 | roller 1195 | rollers 1196 | roofer 1197 | roofers 1198 | roustabout 1199 | roustabouts 1200 | rover 1201 | rovers 1202 | runner 1203 | runners 1204 | sacker 1205 | sackers 1206 | safecracker 1207 | safecrackers 1208 | sailor 1209 | sailors 1210 | sale rep 1211 | sale reps 1212 | sales 1213 | sales rep 1214 | sales reps 1215 | salesman 1216 | salesmen 1217 | salesmens 1218 | salespeople 1219 | salespeoples 1220 | salesperson 1221 | salespersons 1222 | salespersonss 1223 | saless 1224 | sampler 1225 | samplers 1226 | sander 1227 | sanders 1228 | sanitarian 1229 | sanitarians 1230 | sanitizer 1231 | sanitizers 1232 | sawer 1233 | sawers 1234 | sawyer 1235 | sawyers 1236 | scaler 1237 | scalers 1238 | scheduler 1239 | schedulers 1240 | scientist 1241 | scientists 1242 | scorer 1243 | scorers 1244 | scout 1245 | scouts 1246 | screener 1247 | screeners 1248 | sculptor 1249 | sculptors 1250 | seaman 1251 | seamen 1252 | seamstress 1253 | seamstresses 1254 | searcher 1255 | searchers 1256 | secretaries 1257 | secretariess 1258 | secretary 1259 | senior 1260 | seniors 1261 | sergeant 1262 | sergeants 1263 | server 1264 | servers 1265 | serviceman 1266 | servicemen 1267 | servicer 1268 | servicers 1269 | setter 1270 | setters 1271 | sewer 1272 | sewers 1273 | shampooer 1274 | shampooers 1275 | sheeter 1276 | sheeters 1277 | sheriff 1278 | sheriffs 1279 | shifter 1280 | shifters 1281 | shipper 1282 | shippers 1283 | silversmith 1284 | silversmiths 1285 | silviculturist 1286 | silviculturists 1287 | singer 1288 | singers 1289 | skycap 1290 | skycaps 1291 | slaughterer 1292 | slaughterers 1293 | slicer 1294 | slicers 1295 | slitter 1296 | slitters 1297 | smith 1298 | smiths 1299 | sociologist 1300 | sociologists 1301 | solder 1302 | solders 1303 | soloist 1304 | soloists 1305 | solver 1306 | solvers 1307 | sonographer 1308 | sonographers 1309 | sorter 1310 | sorters 1311 | specialist 1312 | specialists 1313 | speechwriter 1314 | speechwriters 1315 | spinner 1316 | spinners 1317 | splicer 1318 | splicers 1319 | splitter 1320 | splitters 1321 | sprayer 1322 | sprayers 1323 | staff 1324 | staffs 1325 | stapler 1326 | staplers 1327 | starter 1328 | starters 1329 | statistician 1330 | statisticians 1331 | steamfitter 1332 | steamfitters 1333 | stenographer 1334 | stenographers 1335 | steward 1336 | stewards 1337 | stillman 1338 | stillmen 1339 | stitcher 1340 | stitchers 1341 | stocker 1342 | stockers 1343 | stonemason 1344 | stonemasons 1345 | strategist 1346 | strategists 1347 | stripper 1348 | strippers 1349 | student 1350 | students 1351 | stylist 1352 | stylists 1353 | superintendant 1354 | superintendants 1355 | superintendent 1356 | superintendents 1357 | supervisor 1358 | supervisors 1359 | surgeon 1360 | surgeons 1361 | surveyor 1362 | surveyors 1363 | swamper 1364 | swampers 1365 | switcher 1366 | switchers 1367 | switchman 1368 | switchmen 1369 | tailor 1370 | tailors 1371 | taker 1372 | takers 1373 | tankerman 1374 | tankermen 1375 | taper 1376 | tapers 1377 | teacher 1378 | teachers 1379 | tech 1380 | teches 1381 | technician 1382 | technicians 1383 | technologist 1384 | technologists 1385 | telecommunicator 1386 | telecommunicators 1387 | telemarketer 1388 | telemarketers 1389 | teller 1390 | tellers 1391 | tender 1392 | tenders 1393 | tenor 1394 | tenors 1395 | tester 1396 | testers 1397 | therapist 1398 | therapists 1399 | ticketer 1400 | ticketers 1401 | tipper 1402 | tippers 1403 | toolmaker 1404 | toolmakers 1405 | topper 1406 | toppers 1407 | trackman 1408 | trackmen 1409 | trader 1410 | traders 1411 | trailer 1412 | trailers 1413 | trainee 1414 | trainees 1415 | trainer 1416 | trainers 1417 | transcriber 1418 | transcribers 1419 | transcriptionist 1420 | transcriptionists 1421 | translator 1422 | translators 1423 | trapper 1424 | trappers 1425 | treasurer 1426 | treasurers 1427 | treater 1428 | treaters 1429 | trimmer 1430 | trimmers 1431 | trooper 1432 | troopers 1433 | troubleshooter 1434 | troubleshooters 1435 | trucker 1436 | truckers 1437 | tuner 1438 | tuners 1439 | tutor 1440 | tutors 1441 | typesetter 1442 | typesetters 1443 | typist 1444 | typists 1445 | umpire 1446 | umpires 1447 | undertaker 1448 | undertakers 1449 | underwriter 1450 | underwriters 1451 | upholsterer 1452 | upholsterers 1453 | urologist 1454 | urologists 1455 | usher 1456 | ushers 1457 | vaccinator 1458 | vaccinators 1459 | vendor 1460 | vendors 1461 | vet 1462 | veterinarian 1463 | veterinarians 1464 | vets 1465 | videographer 1466 | videographers 1467 | violinist 1468 | violinists 1469 | violist 1470 | violists 1471 | vocalist 1472 | vocalists 1473 | volunteer 1474 | volunteers 1475 | waiter 1476 | waiters 1477 | waitress 1478 | waitresses 1479 | waitressess 1480 | walker 1481 | walkers 1482 | warden 1483 | wardens 1484 | wardenss 1485 | washer 1486 | washers 1487 | watchman 1488 | watchmen 1489 | waxer 1490 | waxers 1491 | weaver 1492 | weavers 1493 | webmaster 1494 | webmasters 1495 | weigher 1496 | weighers 1497 | welder 1498 | welders 1499 | winder 1500 | winders 1501 | wiper 1502 | wipers 1503 | wireman 1504 | wiremen 1505 | wirer 1506 | wirers 1507 | worker 1508 | workers 1509 | wrapper 1510 | wrappers 1511 | writer 1512 | writers 1513 | yardmaster 1514 | yardmasters 1515 | zoologist 1516 | zoologists 1517 | electronic 1518 | processing 1519 | account 1520 | accounts 1521 | electronics 1522 | saleswomen 1523 | saleswoman 1524 | salesman 1525 | salesmen 1526 | clerical 1527 | clericals 1528 | medical -------------------------------------------------------------------------------- /data_cleaning/auxiliary files/__pycache__/ExtractLDAresult.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/phaiptt125/newspaper_project/4e31ced4d930258ff7d659012fff15f3f6f626a4/data_cleaning/auxiliary files/__pycache__/ExtractLDAresult.cpython-36.pyc -------------------------------------------------------------------------------- /data_cleaning/auxiliary files/__pycache__/OCRcorrect_enchant.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/phaiptt125/newspaper_project/4e31ced4d930258ff7d659012fff15f3f6f626a4/data_cleaning/auxiliary files/__pycache__/OCRcorrect_enchant.cpython-36.pyc -------------------------------------------------------------------------------- /data_cleaning/auxiliary files/__pycache__/OCRcorrect_hyphen.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/phaiptt125/newspaper_project/4e31ced4d930258ff7d659012fff15f3f6f626a4/data_cleaning/auxiliary files/__pycache__/OCRcorrect_hyphen.cpython-36.pyc -------------------------------------------------------------------------------- /data_cleaning/auxiliary files/__pycache__/compute_spelling.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/phaiptt125/newspaper_project/4e31ced4d930258ff7d659012fff15f3f6f626a4/data_cleaning/auxiliary files/__pycache__/compute_spelling.cpython-36.pyc -------------------------------------------------------------------------------- /data_cleaning/auxiliary files/__pycache__/detect_ending.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/phaiptt125/newspaper_project/4e31ced4d930258ff7d659012fff15f3f6f626a4/data_cleaning/auxiliary files/__pycache__/detect_ending.cpython-36.pyc -------------------------------------------------------------------------------- /data_cleaning/auxiliary files/__pycache__/edit_distance.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/phaiptt125/newspaper_project/4e31ced4d930258ff7d659012fff15f3f6f626a4/data_cleaning/auxiliary files/__pycache__/edit_distance.cpython-36.pyc -------------------------------------------------------------------------------- /data_cleaning/auxiliary files/__pycache__/extract_LDA_result.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/phaiptt125/newspaper_project/4e31ced4d930258ff7d659012fff15f3f6f626a4/data_cleaning/auxiliary files/__pycache__/extract_LDA_result.cpython-36.pyc -------------------------------------------------------------------------------- /data_cleaning/auxiliary files/__pycache__/extract_information.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/phaiptt125/newspaper_project/4e31ced4d930258ff7d659012fff15f3f6f626a4/data_cleaning/auxiliary files/__pycache__/extract_information.cpython-36.pyc -------------------------------------------------------------------------------- /data_cleaning/auxiliary files/__pycache__/title_detection.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/phaiptt125/newspaper_project/4e31ced4d930258ff7d659012fff15f3f6f626a4/data_cleaning/auxiliary files/__pycache__/title_detection.cpython-36.pyc -------------------------------------------------------------------------------- /data_cleaning/auxiliary files/__pycache__/title_substitute.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/phaiptt125/newspaper_project/4e31ced4d930258ff7d659012fff15f3f6f626a4/data_cleaning/auxiliary files/__pycache__/title_substitute.cpython-36.pyc -------------------------------------------------------------------------------- /data_cleaning/auxiliary files/apst_mapping.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/phaiptt125/newspaper_project/4e31ced4d930258ff7d659012fff15f3f6f626a4/data_cleaning/auxiliary files/apst_mapping.xlsx -------------------------------------------------------------------------------- /data_cleaning/auxiliary files/compute_spelling.py: -------------------------------------------------------------------------------- 1 | import re 2 | import enchant, difflib 3 | from enchant import DictWithPWL 4 | 5 | #...............................................# 6 | 7 | def ComputeSpellingError(rawtext,mydict): 8 | 9 | d = enchant.DictWithPWL("en_US", mydict) 10 | tokens = [w for w in re.split(' ',rawtext.lower()) if not w == ''] 11 | tokens = [re.sub(r'[^a-z]','',w) for w in tokens] 12 | tokens = [w for w in tokens if not w==''] 13 | 14 | CountInDict = 0 15 | CountNotInDict = 0 16 | CountTotal = len(tokens) 17 | if CountTotal > 0: 18 | for word in tokens: 19 | if len(word)==1: 20 | CountNotInDict += 1 21 | elif d.check(word): 22 | CountInDict += 1 23 | else: 24 | CountNotInDict += 1 25 | Ratio = str(round(CountInDict/CountTotal,2)) 26 | else: 27 | Ratio = str(0) 28 | 29 | TotalWord = str(CountTotal) 30 | Output = [TotalWord,Ratio] 31 | return Output 32 | #...............................................# 33 | 34 | def RecordCorrectSpelling(rawtext): 35 | 36 | d = enchant.Dict("en_US") 37 | tokens = [w for w in re.split(' ',rawtext.lower()) if not w == ''] 38 | tokens = [re.sub(r'[^a-z]','',w) for w in tokens] 39 | tokens = [w for w in tokens if len(w) >= 3] 40 | tokens = [w for w in tokens if not w==''] 41 | 42 | TotalWord = len(tokens) 43 | 44 | if TotalWord > 0: 45 | correct_tokens = [w for w in tokens if d.check(w)] 46 | correct_tokens = [w for w in correct_tokens if not w==''] 47 | output_text = ' '.join(correct_tokens) 48 | WordCount = str(len(correct_tokens)) 49 | else: 50 | output_text = '' 51 | WordCount = str(0) 52 | 53 | Output = [WordCount,output_text] 54 | return Output 55 | 56 | #...............................................# 57 | 58 | 59 | 60 | -------------------------------------------------------------------------------- /data_cleaning/auxiliary files/detect_ending.py: -------------------------------------------------------------------------------- 1 | import os 2 | import re 3 | import nltk 4 | from nltk.tokenize import word_tokenize 5 | 6 | file_state_name = open('./auxiliary files/state_name.txt').read() 7 | state_name = [w for w in re.split('\n',file_state_name) if not w==''] 8 | 9 | StateFullname = [re.split(',',w)[0] for w in state_name] 10 | StateAbbrevation = [re.split(',',w)[1] for w in state_name] 11 | 12 | # Define a set of pattern we will use to split 13 | 14 | ZipCodeFullPattern = re.compile('|'.join(['\\b' + w.lower() + '.{0,3}\d{5}\\b' for w in StateFullname]),re.IGNORECASE) 15 | ZipCodeAbbPattern = re.compile('|'.join(['\\b'+w[0]+'\W?['+w[1]+'|'+w[1].lower()+'].{0,3}\d{5}\\b' for w in StateAbbrevation])) 16 | 17 | ZipCodeExtraPattern = ['tribune.?[0-9BtlifoOS]{5}', #tribune + 5 number 18 | 'tribune.{,5}6\d{4}', 19 | 'chicago\s.{,6}\d{5}?'] #chicago + space + something + five numbers 20 | 21 | ZipCodeExtraPattern = re.compile( '|'.join(ZipCodeExtraPattern),re.IGNORECASE ) #this one ignores case 22 | 23 | ZipCodeExtraPattern2 = ['I.?[L|l].?[L|l].?\s\d{5}', # detect ILL as Illinois 24 | 'I[Ll]{1,2}.?\s[0-9BtlifoOS]{5}', # detect IL 25 | 'IL.?6[0oO]{1,2}[0-9BtlifoOS]{2,3}', # detect IL 26 | 'I.?I.?\s\d{5}', # detect II as Illinois 27 | 'It.?\s\d{5}', # detect It + 5 numbers as Illinois 28 | '[I|i]n\s\d{5}\s', #In + something + five numbers (as zip code) 29 | 'MCB\s\d{3}', #MCB + space + 3 digits 30 | 'BOX\sM[A-Z ]{2,3}\s[0-9BtlfoO]{3}', #BOX + space + M + two more character + 3 digits 31 | 'D.?C.?\s\d{5}'] # 'D.?C.?\s\d{5}' = DC 32 | 33 | ZipCodeExtraPattern2 = re.compile( '|'.join(ZipCodeExtraPattern2) ) #Note: No "re.IGNORECASE" 34 | 35 | SteetNamePattern = ['\d{2,5}[\s\w]+\save', 36 | '\d{2,5}[\s\w]+\sblvd', 37 | '\d{2,5}[\s\w]+\sstreet', 38 | '\d{2,5}[\s\w]+\shgwy', 39 | '\d{2,5}[\s\w]+\sroad', 40 | '\d{2,5}\s\w*\sdrive', 41 | '\d{2,5}\s\w*\sst.?\sboston', 42 | '\d{2,5}\s\w*\sst.?\slawrence', 43 | '\d{2,5}\s\w*\s\w*\sst\scambridge', 44 | '^\d{2,5}\s\w*\sst.?\s', 45 | '^\d{2,5}\s\w*\s\w*\sst\W', 46 | '\sfloor\sboston$', 47 | 'glo[6b]e.{,3}office'] # globe office 48 | 49 | SteetNamePattern = re.compile( '|'.join(SteetNamePattern),re.IGNORECASE ) 50 | 51 | EndingPhrasePattern = ['equal opportunit[y|ies]', #EoE 52 | 'affirmative.?employer\s?', #affirmative[anything]employer 53 | 'i[n|v].?confidence.?\s?', #in confidence 54 | 'send.{,10}resume\s?', 55 | 'apply.{,20}office', 56 | 'submit.{,10}resume\s?', 57 | 'please\sapply', 58 | 'for\sfurther\sinformation\.{,20}contact', 59 | '\d{2,4}\sext.?\s\d{2,4}', #Phone number: numbers + ext + numbers 60 | '\d{3}.\d{3}-\d{4}\s?'] #Phone number: 3 numbers + anything + 3 numbers + hyphen + four numbers' 61 | 62 | EndingPhrasePattern = re.compile('|'.join(EndingPhrasePattern),re.IGNORECASE) 63 | 64 | ListFirmIndicator = ['co','company','inc','corporation','inc','corp','llc',"incorporated"] 65 | ListFirmNoTitleIndicator = ['associates','associate'] 66 | 67 | #...............................................# 68 | 69 | def AssignFlag(InputString): 70 | # this function detect address / ending phrase 71 | AddressFound = False 72 | EndingPhraseFound = False 73 | 74 | if re.findall(ZipCodeFullPattern,InputString): 75 | AddressFound = True 76 | if re.findall(ZipCodeAbbPattern,InputString): 77 | AddressFound = True 78 | if re.findall(ZipCodeExtraPattern,InputString): 79 | AddressFound = True 80 | if re.findall(ZipCodeExtraPattern2,InputString): 81 | AddressFound = True 82 | if re.findall(SteetNamePattern,InputString): 83 | AddressFound = True 84 | if re.findall(EndingPhrasePattern,InputString): 85 | EndingPhraseFound = True 86 | 87 | return AddressFound , EndingPhraseFound -------------------------------------------------------------------------------- /data_cleaning/auxiliary files/edit_distance.py: -------------------------------------------------------------------------------- 1 | # Computing Weighted Edit Distance # 2 | # The code is adapted from http://www.nltk.org/_modules/nltk/metrics/distance.html 3 | #............................................................... 4 | 5 | #Creating a matrix to store output 6 | def InitializingMatrix(len1, len2): 7 | lev = [] 8 | for i in range(len1): 9 | lev.append([0] * len2) # initialize 2D array to zero 10 | for i in range(len1): 11 | lev[i][0] = i # column 0: 0,1,2,3,4,... 12 | for j in range(len2): 13 | lev[0][j] = j # row 0: 0,1,2,3,4,... 14 | return lev 15 | 16 | #Say, lev = InitializingMatrix(5, 3) will give a matrix of 17 | # 18 | #[4, 0, 0] 19 | #[3, 0, 0] 20 | #[2, 0, 0] 21 | #[1, 0, 0] 22 | #[0, 1, 2] 23 | # 24 | # Next function will print this matrix out 25 | #............................................................... 26 | 27 | #Printing matrix: 28 | def PrintMatrix(mat): 29 | NumRow = len(mat) 30 | for ind in range(NumRow): 31 | print(mat[NumRow-ind-1][:]) 32 | 33 | #............................................................... 34 | 35 | def ComputeMinStep(lev, i, j, s1, s2): 36 | c1 = s1[i - 1] 37 | c2 = s2[j - 1] 38 | 39 | # skipping a character in s1 40 | a = lev[i - 1][j] + 1 41 | # skipping a character in s2 42 | b = lev[i][j - 1] + 1 43 | # substitution 44 | c = lev[i - 1][j - 1] + (c1 != c2) 45 | 46 | # minimize distance in a step 47 | lev[i][j] = min(a, b, c) 48 | 49 | #............................................................... 50 | 51 | def EditDistance(s1, s2): 52 | 53 | len1 = len(s1) 54 | len2 = len(s2) 55 | lev = InitializingMatrix(len1+1, len2+1) 56 | 57 | for i in range(len1): 58 | for j in range(len2): 59 | ComputeMinStep(lev, i + 1, j + 1, s1, s2) 60 | 61 | Distance = lev[len1][len2] 62 | return Distance 63 | 64 | #............................................................... 65 | 66 | 67 | 68 | 69 | 70 | -------------------------------------------------------------------------------- /data_cleaning/auxiliary files/example_ONET_api.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/phaiptt125/newspaper_project/4e31ced4d930258ff7d659012fff15f3f6f626a4/data_cleaning/auxiliary files/example_ONET_api.png -------------------------------------------------------------------------------- /data_cleaning/auxiliary files/extract_LDA_result.py: -------------------------------------------------------------------------------- 1 | import re 2 | 3 | ''' 4 | The command 'lda.show_topics' gives a pretty complicate output format 5 | for example, if given 2 words and 2 topics, it will show: 6 | 7 | [(0, [('price', 0.014396994044837077), 8 | ('new', 0.0122260497589219)]) 9 | , 10 | (1, [('opportun', 0.020830242773974533), 11 | ('experi', 0.019701193739871937)])] 12 | 13 | The first element belongs to the first topic 14 | 15 | TopicKeyword[0] = 16 | (0, [('price', 0.014396994044837077), ('new', 0.0122260497589219)]) 17 | 18 | TopicKeyword[0][0] = 0 19 | TopicKeyword[0][1] = [('price', 0.014396994044837077), ('new', 0.0122260497589219)] 20 | 21 | so the way to extract is to loop over TopicKeyword[Ind][1], where Ind is topic number 22 | ''' 23 | 24 | def GetWordScore(TopicKeyword): 25 | WordScoreList = list() # list of word and its score 26 | for Ind in range(0,len(TopicKeyword)): #loop by topics 27 | WordsThisTopic = TopicKeyword[Ind][1] 28 | for WordScore in WordsThisTopic: #loop by words 29 | Word = WordScore[0] 30 | Score = "{0:.3f}".format(WordScore[1]) #round to 3 decimal 31 | #"{0:.2f}".format(13.949999999999999) = '13.95' 32 | WordScoreList.append(str(Ind) + '\t' + Word + '\t' + str(Score)) 33 | return WordScoreList 34 | 35 | def GetWordList(WordScoreList,TopicNum): 36 | ListWordByTopic = ['']*TopicNum 37 | for item in WordScoreList: 38 | Split = re.split('\t',item) 39 | ListWordByTopic[int(Split[0])] = ListWordByTopic[int(Split[0])] + '\t' + Split[1] 40 | return [[y for y in re.split('\t',w) if not y==''] for w in ListWordByTopic if not w==''] 41 | 42 | #...............................................# 43 | 44 | ''' 45 | "docTopic" item contains document score by topic, has length = number of doc 46 | NOTE: topic score that is below a certain threshold is set to be zero and not report 47 | For example: 48 | 49 | docTopic[0] = [(0, 0.1334268305392638), (2, 0.8638742905886998)] 50 | 51 | means the first document has 0.13 for topic 0, 0 for topic 1 and 0.86 for topic 2 52 | ''' 53 | 54 | def GetDocumentScore(docTopic,TopicNum): 55 | OutputTable = list() 56 | for Ind in range(0,len(docTopic)): 57 | ScoreThisDoc = docTopic[Ind] 58 | RecordScore = ['0']*TopicNum 59 | for item in ScoreThisDoc: 60 | RecordScore[item[0]] = "{0:.3f}".format(item[1]) 61 | OutputTable.append( '\t'.join(RecordScore) ) 62 | assert( len(docTopic) == len(OutputTable) ) 63 | return OutputTable 64 | 65 | #...............................................# 66 | -------------------------------------------------------------------------------- /data_cleaning/auxiliary files/extract_information.py: -------------------------------------------------------------------------------- 1 | import re 2 | 3 | def RemoveCharacters(text): 4 | # This function removes some non-grammatical characters 5 | # and add extra spaces to punctuations in order to facilitate 6 | # spelling error correction. 7 | output = text 8 | output = output.replace('"','') 9 | output = output.replace('.', ' . ') 10 | output = output.replace(',', ' , ') 11 | output = output.replace('?', ' ? ') 12 | output = output.replace('(', ' ( ') 13 | output = output.replace(')', ' ) ') 14 | output = output.replace('$', ' $ ') 15 | output = output.replace(';',' ; ') 16 | output = output.replace('!',' ! ') 17 | output = output.replace('}','') 18 | output = output.replace('{','') 19 | output = output.replace('/',' ') 20 | output = output.replace('_',' ') 21 | output = output.replace('*','') 22 | return output 23 | 24 | def CleanXML(text): 25 | # This function removes markups 26 | 27 | output = text #initialize output 28 | 29 | # '&lt;/p&gt;' and '&lt;p&gt;' are line-breaks 30 | NewlinePattern = re.compile( re.escape('&lt;/p&gt;') 31 | + '|' 32 | + re.escape('&lt;p&gt;') ) 33 | 34 | output = re.sub(NewlinePattern,'\n',output) 35 | 36 | # replace all other markups 37 | 38 | XMLmarkups = ['name=&quot;ValidationSchema&quot;', 39 | 'content=&quot;', 40 | '&quot;/&gt;', 41 | '&lt;meta'] 42 | 43 | for pattern in XMLmarkups: 44 | output = re.sub(re.escape(pattern),'',output , re.IGNORECASE) 45 | 46 | html_header = re.compile(re.escape('&lt;') 47 | + '/?html/?' 48 | + re.escape('&gt;')) 49 | 50 | output = re.sub(html_header,'',output) 51 | 52 | body_header = re.compile(re.escape('&lt;') 53 | + '/?body/?' 54 | + re.escape('&gt;')) 55 | 56 | output = re.sub(body_header,'',output) 57 | 58 | title_header = re.compile(re.escape('&lt;') 59 | + '/?title/?' 60 | + re.escape('&gt;')) 61 | 62 | output = re.sub(title_header,'',output) 63 | 64 | head_header = re.compile(re.escape('&lt;') 65 | + '/?head/?' 66 | + re.escape('&gt;')) 67 | 68 | output = re.sub(head_header,'',output) 69 | 70 | HTTPpattern = re.compile( re.escape('http://') + '\S*' 71 | + re.escape('.xsd') ) 72 | 73 | output = re.sub(HTTPpattern,'',output) 74 | output = re.sub(re.escape('&quot;'),'"',output) 75 | output = re.sub(re.escape('&apos;'),"'",output) 76 | output = re.sub(re.escape('&amp;'),"&",output) 77 | output = re.sub(re.escape('&'),'',output) 78 | output = re.sub(re.escape('<'),'',output) 79 | output = re.sub(re.escape('>'),'',output) 80 | output = RemoveCharacters(output) 81 | 82 | return ' '.join([w for w in re.split(' ',output) if not w=='']) 83 | 84 | def ExtractElement(text,field): 85 | # This function takes input string (text) and looks for markups. 86 | # input "field" is a specific element that the code looks for. 87 | # For example, the page title can be located in the text as: 88 | # Display Ad 33 -- No Title 89 | # Here, "field" variable is "recordtitle". 90 | # (Note: all searches are not case-sensitive.) 91 | 92 | beginMarkup = '<' + field + '>' #example: 93 | endMarkup = '' #example: 94 | 95 | textNoLineBreak = re.sub(r'\n|\r\n','',text) #delete the line break 96 | 97 | # Windows and Linux use different line break ('\n' vs '\r\n') 98 | 99 | ElementPattern = re.compile( re.escape(beginMarkup) + '.*' + re.escape(endMarkup), re.IGNORECASE ) 100 | ElementMarkup = re.compile( re.escape(beginMarkup) + '|' + re.escape(endMarkup), re.IGNORECASE) 101 | 102 | DetectElement = re.findall(ElementPattern,textNoLineBreak) 103 | 104 | #strip markup 105 | Content = str(re.sub(ElementMarkup,'',str(DetectElement[0]))) 106 | 107 | #reset space 108 | Content = ' '.join([w for w in re.split(' ',Content) if not w=='']) 109 | 110 | return Content 111 | 112 | def AssignPageIdentifier(text, journal): 113 | # This function assigns page identifier. 114 | # For example, 'WSJ_classifiedad_19780912_45'. 115 | # 'WSJ' is the journal name, to be specified by the user. 116 | # 'classifiedad' means the page is Classified Ad. 117 | # '19780912' is the publication date. 118 | # '45' is the page number. 119 | 120 | recordtitle = ExtractElement(text,'recordtitle') 121 | 122 | # All classified ad pages have 'recordtitle' of 'Classified Ad [number] -- No Title'. 123 | # (likewise for display ad pages) 124 | 125 | Match = re.findall('Ad \d+ -- No Title',recordtitle,re.IGNORECASE) 126 | 127 | if Match: # this page is either display ad or classified ad 128 | 129 | if re.findall('Display Ad',recordtitle,re.IGNORECASE): 130 | ad_type = 'displayad' 131 | elif re.findall('Classified Ad',recordtitle,re.IGNORECASE): 132 | ad_type = 'classifiedad' 133 | 134 | ad_number = re.findall('\d+',recordtitle)[0] # get the page number 135 | 136 | numericpubdate = ExtractElement(text,'numericpubdate') 137 | pub_date = re.findall('\d{8}',numericpubdate)[0] # get the publication date 138 | 139 | output = '_'.join([journal,ad_type,pub_date,ad_number]) # create page identifider 140 | else: 141 | output = None 142 | 143 | return output 144 | 145 | #...............................................# 146 | 147 | 148 | 149 | -------------------------------------------------------------------------------- /data_cleaning/auxiliary files/phrase_substitutes.csv: -------------------------------------------------------------------------------- 1 | accountant auditor,accountingauditor,accountantauditor,,,,,,,,,,,,,,,, 2 | accounting clerk,accounting clerks,accounting clk,accounting clks,acct clerks,,,,,,,,,,,,,, 3 | accounting manager,accounts mgr,account manager,manager of accounting,,,,,,,,,,,,,,, 4 | administrative assistant,adm assistant,admin assistant,assistant administrator,,,,,,,,,,,,,,, 5 | assistant bookkeeper,ass t bookkeeper,,,,,,,,,,,,,,,,, 6 | assistant controller,assistant to controller,,,,,,,,,,,,,,,,, 7 | assistant credit manager,assistant credit mgr,,,,,,,,,,,,,,,,, 8 | assistant director of nursing,assistant dir nsg,,,,,,,,,,,,,,,,, 9 | assistant manager,manager assistant,,,,,,,,,,,,,,,,, 10 | assistant tax manager,assistant tax mgr,,,,,,,,,,,,,,,,, 11 | auto mechanic,automobile mechanic,automotive mechanic,,,,,,,,,,,,,,,, 12 | auto sale,automobile sales,automobile salesman,automobile salesperson,,,,,,,,,,,,,,, 13 | builder developer,builderdevelopers,,,,,,,,,,,,,,,,, 14 | chief accountant,chief acct,,,,,,,,,,,,,,,,, 15 | cost accountant,cost accounting,,,,,,,,,,,,,,,,, 16 | database administrator,data base administrator,,,,,,,,,,,,,,,,, 17 | data processing manager,manager data processing,,,,,,,,,,,,,,,,, 18 | design checker,designer checker,,,,,,,,,,,,,,,,, 19 | design draftsman,designer draftsman,design drafter,designer drafter,,,,,,,,,,,,,,, 20 | design engineer,engineer designer,designer engineer,,,,,,,,,,,,,,,, 21 | digital technician,digital tech,,,,,,,,,,,,,,,,, 22 | director of nursing,director of nurse,,,,,,,,,,,,,,,,, 23 | electronic technician,electronic tech,,,,,,,,,,,,,,,,, 24 | employment manager,manager of employment,manager employment,,,,,,,,,,,,,,,, 25 | employee relation manager,manager employee relation,,,,,,,,,,,,,,,,, 26 | engineering manager,management engineer,manager of engineering,,,,,,,,,,,,,,,, 27 | engineering technician,engineer technician,,,,,,,,,,,,,,,,, 28 | executive assistant,exec assistant,,,,,,,,,,,,,,,,, 29 | executive sale,sale executive,,,,,,,,,,,,,,,,, 30 | executive secretary,exec secretary,executive secy,executive secretarial,executive secty,,,,,,,,,,,,,, 31 | field sales manager,field sales manager you are,,,,,,,,,,,,,,,,, 32 | financial analyst,fin analyst,,,,,,,,,,,,,,,,, 33 | food technologist,food tech,,,,,,,,,,,,,,,,, 34 | foreman,foremen,,,,,,,,,,,,,,,,, 35 | general accountant,general accounting,,,,,,,,,,,,,,,,, 36 | host,hostesses,host hostess,hostess,hostess host,,,,,,,,,,,,,, 37 | industrial sale,sale industrial,,,,,,,,,,,,,,,,, 38 | international scout,int l scout,int scout,intl scout,,,,,,,,,,,,,,, 39 | keypunch operator,key punch operator,,,,,,,,,,,,,,,,, 40 | lab assistant,laboratory assistant,,,,,,,,,,,,,,,,, 41 | lab technician,lab tech,laboratory technician,,,,,,,,,,,,,,,, 42 | licensed electrician,lic electrician,,,,,,,,,,,,,,,,, 43 | licensed plumber,licensed plumbers,lic plumber,,,,,,,,,,,,,,,, 44 | industrial relation manager,manager industrial relation,,,,,,,,,,,,,,,,, 45 | instrument engineer,instrumentation engineer,,,,,,,,,,,,,,,,, 46 | management trainee,mgmt trainee,,,,,,,,,,,,,,,,, 47 | manager advertising,manageradvertising,,,,,,,,,,,,,,,,, 48 | manager equipment,managerequipment,,,,,,,,,,,,,,,,, 49 | manager material,managermaterials,,,,,,,,,,,,,,,,, 50 | manager plant,managerplant,,,,,,,,,,,,,,,,, 51 | manager telecommunication,managertelecommunications,,,,,,,,,,,,,,,,, 52 | manager warehousing,managerwarehousing,,,,,,,,,,,,,,,,, 53 | manufacturing engineering manager,manager manufacturing engineering,,,,,,,,,,,,,,,,, 54 | marketing analyst,market analyst,,,,,,,,,,,,,,,,, 55 | marketing director,director of marketing,,,,,,,,,,,,,,,,, 56 | marketing research analyst,market research analyst,,,,,,,,,,,,,,,,, 57 | marketing sale,marketingsales,,,,,,,,,,,,,,,,, 58 | mechanical engineer,engineer mechanical,,,,,,,,,,,,,,,,, 59 | medical technician,med tech,,,,,,,,,,,,,,,,, 60 | nurse aide,nurse s aide,,,,,,,,,,,,,,,,, 61 | nurse recruiter,nurse recruitment,,,,,,,,,,,,,,,,, 62 | nurse,nurse nursenurse,,,,,,,,,,,,,,,,, 63 | nursing assistant,nurse assistant,,,,,,,,,,,,,,,,, 64 | personnel consultant,personnel consuitants,personnel consuliants,personnel consutants,personnel cosultants,,,,,,,,,,,,,, 65 | personnel director,personnel dlrectro,director of personnel,,,,,,,,,,,,,,,, 66 | personnel manager,personnel mgr,manager of personnel,,,,,,,,,,,,,,,, 67 | personnel secretary,personnel secty,personnel secy,personnel sec,secretary personnel,,,,,,,,,,,,,, 68 | pipefitter,pipe fitter,,,,,,,,,,,,,,,,, 69 | professional help,help professional,,,,,,,,,,,,,,,,, 70 | professional employment manager,manager professional employment,,,,,,,,,,,,,,,,, 71 | professional recruiter,professional recruitment,,,,,,,,,,,,,,,,, 72 | programmer analyst cobol,programmer analystcobol,,,,,,,,,,,,,,,,, 73 | programmer analyst,programmer programmer analyst,program mere analyst,prog analyst,programmeranalyst,programmer anal yst,analyst programmer,,,,,,,,,,,, 74 | programmer,programmer programmer,,,,,,,,,,,,,,,,, 75 | programmer cobol,cobol programmer,,,,,,,,,,,,,,,,, 76 | public accountant,public accounting,,,,,,,,,,,,,,,,, 77 | punch press operator,punch pres operator,,,,,,,,,,,,,,,,, 78 | real time programmer,realtime programmer,,,,,,,,,,,,,,,,, 79 | receptionist typist,receptionisttypist,typist receptionist,typist recept,,,,,,,,,,,,,,, 80 | registered nurse,reg nurse,rn lpn,rn lpns,rn s and lpn,rn and lpns,rn or lpn,rn s lpn,rn slpn,rnlpn,rnlpns,registered nurse lpns,nurse rn,nurse registered,registered nurse staff,staff registered nurse,registered nurse s lpn,registered nurse lpn,registered nurse and lpn 81 | registered pharmacist,reg pharmacist,,,,,,,,,,,,,,,,, 82 | resident manager,resident mgr,,,,,,,,,,,,,,,,, 83 | sale career,salescareers,,,,,,,,,,,,,,,,, 84 | sale engineer,sales engr,sale engr,,,,,,,,,,,,,,,, 85 | sale manager,area sale manager,national sale manager,regional sale manager,,,,,,,,,,,,,,, 86 | sale marketing,sales mktg,salesmktg,marketing sale,,,,,,,,,,,,,,, 87 | sale management trainee,sales mgmt trainee,,,,,,,,,,,,,,,,, 88 | sale manager,sales management,sales manage,ales manager,sales mgr,,,,,,,,,,,,,, 89 | sale part,salesparts,,,,,,,,,,,,,,,,, 90 | sale position,and sales positions,,,,,,,,,,,,,,,,, 91 | sale professional,sale pro,professional sale,,,,,,,,,,,,,,,, 92 | sale secretary,sales secy,,,,,,,,,,,,,,,,, 93 | sale service part,salesserviceparts,,,,,,,,,,,,,,,,, 94 | sale service rental,salesservicerentals,,,,,,,,,,,,,,,,, 95 | sale service,salesservice,,,,,,,,,,,,,,,,, 96 | sale,saless,,,,,,,,,,,,,,,,, 97 | salesperson,sales person,salesman,salesmen,salesman and,salesman too,salespeople,sales ladies,sale people,,,,,,,,,, 98 | secretary assistant,secy assistant,,,,,,,,,,,,,,,,, 99 | secretary bookkeeper,secretarybookkeeper,bookkeeper secretary,,,,,,,,,,,,,,,, 100 | secretary receptionist,secretaryreceptionist,secy receptionist,receptionist secretary,receptionistsecretary,,,,,,,,,,,,,, 101 | secretary typist,secretarytypist,,,,,,,,,,,,,,,,, 102 | secretary,secretary for,,,,,,,,,,,,,,,,, 103 | senior accountant,senior acct,,,,,,,,,,,,,,,,, 104 | senior staff,enior staff,,,,,,,,,,,,,,,,, 105 | senior technical writer,senior tech writer,,,,,,,,,,,,,,,,, 106 | shipper receiver,shipperreceiver,,,,,,,,,,,,,,,,, 107 | staff accountant,staff acct,staff accts,,,,,,,,,,,,,,,, 108 | statistical typist,stat typist,,,,,,,,,,,,,,,,, 109 | stock room clerk,stockroom clerk,,,,,,,,,,,,,,,,, 110 | supervisor tax,supervisortax,,,,,,,,,,,,,,,,, 111 | system analyst programmer,programmer system analyst,system programmer analyst,programmer analyst system,,,,,,,,,,,,,,, 112 | system engineer,engineer system,,,,,,,,,,,,,,,,, 113 | technical recruiter,technical recruiter a new,,,,,,,,,,,,,,,,, 114 | technical typist,tech typist,,,,,,,,,,,,,,,,, 115 | technical writer,tech writer,,,,,,,,,,,,,,,,, 116 | test technician,test tech,,,,,,,,,,,,,,,,, 117 | tool engineer,tooling engineer,,,,,,,,,,,,,,,,, 118 | tool and die maker,tool die maker,,,,,,,,,,,,,,,,, 119 | typist clerk,clerktypist,clk typist,lerk typist,clerk typist,typist clerk typist,typistclerk,,,,,,,,,,,, 120 | vice president finance,vice presidentfinance,,,,,,,,,,,,,,,,, 121 | vice president human resource,vicepresident human resources,,,,,,,,,,,,,,,,, 122 | vice president sale,vice presidentales,,,,,,,,,,,,,,,,, 123 | vice president,vicepresident,,,,,,,,,,,,,,,,, 124 | waiter,waitresseswaiters,,,,,,,,,,,,,,,,, 125 | xray,x ray,x-ray,x- ray,x -ray,,,,,,,,,,,,,, 126 | -------------------------------------------------------------------------------- /data_cleaning/auxiliary files/state_name.txt: -------------------------------------------------------------------------------- 1 | Alabama,AL, 2 | Alaska,AK, 3 | Arizona,AZ, 4 | Arkansas,AR, 5 | California,CA, 6 | Colorado,CO, 7 | Connecticut,CT, 8 | Delaware,DE, 9 | Florida,FL, 10 | Georgia,GA, 11 | Hawaii,HI, 12 | Idaho,ID, 13 | Illinois,IL,IIL 14 | Indiana,IN, 15 | Iowa,IA, 16 | Kansas,KS, 17 | Kentucky,KY, 18 | Louisiana,LA, 19 | Maine,ME, 20 | Maryland,MD, 21 | Massachusetts,MA, 22 | Michigan,MI, 23 | Minnesota,MN, 24 | Mississippi,MS, 25 | Missouri,MO, 26 | Montana,MT, 27 | Nebraska,NE, 28 | Nevada,NV, 29 | New Hampshire,NH, 30 | New Jersey,NJ, 31 | New Mexico,NM, 32 | New York,NY, 33 | North Carolina,NC, 34 | North Dakota,ND, 35 | Ohio,OH, 36 | Oklahoma,OK, 37 | Oregon,OR, 38 | Pennsylvania,PA, 39 | Rhode Island,RI, 40 | South Carolina,SC, 41 | South Dakota,SD, 42 | Tennessee,TN, 43 | Texas,TX, 44 | Utah,UT, 45 | Vermont,VT, 46 | Virginia,VA, 47 | Washington,WA, 48 | West Virginia,WV, 49 | Wisconsin,WI, 50 | Wyoming,WY, 51 | -------------------------------------------------------------------------------- /data_cleaning/auxiliary files/title_detection.py: -------------------------------------------------------------------------------- 1 | import os 2 | import re 3 | import sys 4 | 5 | def DetermineUppercase(string): 6 | # This function determines whether a line is uppercase 7 | # There are cases where there are some non-uppercased as well 8 | # example: ENGINEERS MICRowAVE...which should be considered as uppercase 9 | # This would help detecting job titles 10 | StringUppercase = re.sub('[^A-Z]','',string) # take out all non uppercase 11 | if string.isupper(): # perfect uppercase 12 | Output = True 13 | elif len(string) > 4 and len(StringUppercase)/len(string) >= 0.8: 14 | # this line allows some "imperfect" uppercase lines 15 | # (length of string is long enough and contains 80% of uppercase characters) 16 | Output = True 17 | else: 18 | Output = False 19 | return Output 20 | 21 | #...............................................# 22 | 23 | def IndexAll(word,tokens): # IndexAll('b',['a','b','c','b','c','c']) = [1, 3] 24 | return [i for i,v in enumerate(tokens) if v == word] 25 | 26 | #...............................................# 27 | 28 | def NextWordIsNotNumber(word,tokens): 29 | Output = True 30 | for location in IndexAll(word,tokens): 31 | if location == len(tokens) - 1: # if the word is the last word -- skip 32 | pass 33 | elif re.findall('\d', tokens[location + 1] ): 34 | Output = False 35 | return Output 36 | 37 | #...............................................# 38 | 39 | def UppercaseNewline(ListByLine,LineBreak): 40 | # This function adds an extra line break when an uppercase word or phrases is found 41 | # The purpose of this function is to break the uppercase phrases within the line that contains 42 | # both upper and lower case words 43 | OutputResetLine = list() 44 | for line in ListByLine: 45 | if line.isupper(): #ignore if the whole line is already uppercase 46 | OutputResetLine.append(line) #just write down exactly the same 47 | elif len(re.findall(r'[a-z]',line)) >= 5: #the line must contain come lowercases characters 48 | ResetThisLine = list() 49 | tokens = [w for w in re.split(' ',line) if not w==''] 50 | for word in tokens: 51 | WordNoHyphen = re.sub('-','',word) 52 | if WordNoHyphen.isupper() and len(WordNoHyphen) >= 2 and NextWordIsNotNumber(word,tokens): 53 | # if the line is uppercase, is long enough and is NOT followed by a set of number 54 | # (because a set of uppercase followed by a set of number could be a zip code!) 55 | ResetThisLine.append(LineBreak + word + LineBreak) 56 | else: 57 | ResetThisLine.append(word) 58 | OutputResetLine.append(' '.join(ResetThisLine)) 59 | else: 60 | OutputResetLine.append(line) #just write down exactly the same 61 | 62 | # At this point, some elements in the "OutputResetLine" would contain more than one line. 63 | # We want to convert this list such that one element is one line 64 | # This can be done by (1) join everything with 'LineBreak' and (2) split again 65 | OutputResetLine = [w for w in re.split(LineBreak,LineBreak.join(OutputResetLine)) if not w==''] #reset lines 66 | return OutputResetLine 67 | 68 | #...............................................# 69 | 70 | def CombineUppercase(ListByLine): 71 | 72 | # This function combines short consecutive uppercase lines together to facilitate job title detection 73 | # For example: "SALE\nMANAGER\nWanted" >>> "SALE MANAGER\nWanted" 74 | # See DetermineUppercase(string) function above for a new definition of "uppercase". 75 | 76 | ListByLineNotEmpty = [w for w in ListByLine if re.findall(r'[a-zA-Z0-9]',w)] 77 | # take out lines where no a-z, A-Z or 0-9 is found (empty lines) 78 | 79 | OutputResetLine = [''] #initialze output 80 | CurrentLine = 0 # current number of line 81 | PreviousShortUpper = False # indicator that the previous line is short uppercase 82 | 83 | for line in ListByLineNotEmpty: 84 | LineNoSpace = re.sub('[^a-zA-Z]','',line) #this only serves the purpose of detecting uppercase line 85 | if DetermineUppercase(LineNoSpace) and PreviousShortUpper == True: # if this line AND the previous one is uppercase 86 | tokens = [w for w in re.split(' ',line) if not w==''] 87 | if len(tokens) <= 3: #the line must be short enough 88 | #add this line to the previous one 89 | # NOTE: "CurrentLine" does not get +1 90 | OutputResetLine[CurrentLine] = OutputResetLine[CurrentLine] + ' ' + re.sub('[^A-Z0-9- ]','',line.upper()) 91 | PreviousShortUpper = True 92 | else: #rather, even if the line is upper -- ignore and write down as normal if it is too long 93 | PreviousShortUpper = False 94 | OutputResetLine.append('') # prepare a new empty line 95 | CurrentLine += 1 # moving on to the next line 96 | OutputResetLine[CurrentLine] = line 97 | PreviousShortUpper = False 98 | elif DetermineUppercase(LineNoSpace) and PreviousShortUpper == False: 99 | # if the line is uppercase BUT the pervious one is not => start the new line AND change "PreviousUpper" to "True" 100 | OutputResetLine.append('') # prepare a new empty line 101 | CurrentLine += 1 # moving on to the next line 102 | OutputResetLine[CurrentLine] = re.sub('[^A-Z0-9- ]','',line.upper()) 103 | PreviousShortUpper = True # change status 104 | else: # if the line is not uppercase => just write it down as normally should 105 | OutputResetLine.append('') # prepare a new empty line 106 | CurrentLine += 1 # moving on to the next line 107 | OutputResetLine[CurrentLine] = line 108 | PreviousShortUpper = False 109 | OutputResetLine = [w for w in OutputResetLine if not w==''] # delete empty lines 110 | return OutputResetLine 111 | 112 | #...............................................# 113 | 114 | def CheckNoTXTLost(list1, list2, AllFlag): 115 | #this function checks that "list1" and "list2" contains exactly the same string of chatacters 116 | combine_list1 = re.sub( AllFlag,'',''.join(list1).lower() ) #take out all flags (title, firm names, etc...) 117 | combine_list2 = re.sub( AllFlag,'',''.join(list2).lower() ) 118 | if re.sub( '\W|\s','',combine_list1) == re.sub( '\W|\s','',combine_list2): #test 119 | output = True 120 | else: 121 | output = False 122 | return output 123 | 124 | #...............................................# -------------------------------------------------------------------------------- /data_cleaning/auxiliary files/title_substitute.py: -------------------------------------------------------------------------------- 1 | import os 2 | import re 3 | import sys 4 | import platform 5 | import shutil 6 | import enchant, difflib 7 | import io 8 | 9 | d = enchant.DictWithPWL("en_US", 'myPWL.txt') 10 | 11 | #...............................................# 12 | # This python module cleans titles 13 | # (1.) substitute word-by-word: includes plural => singular, abbrevations... 14 | # (2.) substitute phrases 15 | # (3.) general plural to singular transformation 16 | #...............................................# 17 | 18 | def WordSubstitute(InputString, word_substitutes): 19 | # This function makes word-by-word substitutions (See: word_substitutes.csv) 20 | # For each row, everything in the second to last column will be substituted with the first column 21 | # Example, one row reads "assistant | assistants | asst | asst. | assts" 22 | # If any word is "assistants", "asst." or "assts" is found, it will be substituted with simply "assistant" 23 | 24 | InputTokens = [w for w in re.split('\s|-', InputString.lower()) if not w==''] 25 | 26 | ListBase = [re.split(',', w)[0] for w in word_substitutes] # list of everything in the first column 27 | 28 | RegexList = ['|'.join(['\\b'+y+'\\b' for y in re.split(',', w)[1:] if not y=='']) for w in word_substitutes] 29 | # regular expressions of everyhing in the second to last column 30 | 31 | OutputTokens = InputTokens[:] #copying the output from input 32 | 33 | for tokenInd in range(0,len(OutputTokens)): 34 | token = OutputTokens[tokenInd] # (1) For each word... 35 | for regexInd in range(0,len(RegexList)): 36 | regex = RegexList[regexInd] # (2) ...for each set of regular expressions... 37 | baseForm = ListBase[regexInd] 38 | if re.findall(re.compile(regex),token): # (3) ...if the word contains in the set of regular expressions... 39 | OutputTokens[tokenInd] = baseForm # (4) ...the word becomes that baseForm = value of the first column. 40 | return ' '.join(OutputTokens) 41 | 42 | #...............................................# 43 | 44 | def PhraseSubstitute(InputString, phrase_substitutes): 45 | # This function makes phrases substitutions (See: phrase_substitutes.csv) 46 | # The format is similar to word_substitutes.csv 47 | # Example: 'assistant tax mgr' will be substituted with 'assistant tax manager' 48 | 49 | ListBase = [re.split(',',w)[0] for w in phrase_substitutes] 50 | RegexList = ['|'.join(['\\b'+y+'\\b' for y in re.split(',',w)[1:] if not y=='']) for w in phrase_substitutes] 51 | 52 | OutputString = InputString.lower() 53 | 54 | # Unlike WordSubstitute(.) function, this one looks at the whole InputString and make substitution. 55 | 56 | for regexInd in range(0,len(RegexList)): 57 | regex = RegexList[regexInd] 58 | baseForm = ListBase[regexInd] 59 | if re.findall(re.compile(regex),InputString): 60 | OutputString = re.sub(re.compile(regex),baseForm,InputString) 61 | return OutputString 62 | 63 | #...............................................# 64 | 65 | def SingularSubstitute(InputString): 66 | # This function performs general plural to singular transformation 67 | # Note that several frequently appeared words would have been manually typed in "word_substitutes.csv" 68 | 69 | InputTokens = [w for w in re.split(' ', InputString.lower()) if not w==''] 70 | OutputTokens = InputTokens[:] #initialize output to be exactly as input 71 | 72 | for tokenInd in range(0,len(OutputTokens)): 73 | 74 | token = OutputTokens[tokenInd] 75 | corrected_token = '' 76 | 77 | if d.check(token): # To be conservative, only look at words that d.check(.) is true 78 | if re.findall('\w+ies$',token): 79 | # if the word ends with 'ies', changes 'ies' to 'y' 80 | corrected_token = re.sub('ies$','y',token) 81 | elif re.findall('\w+ches$|\w+ses$|\w+xes|\w+oes$',token): 82 | # if the word ends with 'ches', 'ses', 'xes', 'oes', drops the 'es' 83 | corrected_token = re.sub('es$','',token) 84 | elif re.findall('\w+s$',token): 85 | # if the word ends with 's' BUT NOT 'ss' (this is to prevent changing words like 'business') 86 | if not re.findall('\w+ss$',token): 87 | corrected_token = re.sub('s$','',token) # drop the 's' 88 | 89 | if len(corrected_token) >= 3 and d.check(corrected_token): 90 | #finally, make a substitution only if the word is at least 3 characters long... 91 | # AND the correction actually has meanings! 92 | OutputTokens[tokenInd] = corrected_token 93 | 94 | return ' '.join(OutputTokens) 95 | 96 | #...............................................# 97 | 98 | def substitute_titles(InputString,word_substitutes,phrase_substitutes): 99 | # This is the main function 100 | 101 | # (1.) Initial cleaning: 102 | CleanedString = re.sub('[^A-Za-z- ]','',InputString) 103 | CleanedString = re.sub('-',' ',CleanedString.lower()) 104 | CleanedString = ' '.join([w for w in re.split(' ', CleanedString) if not w=='']) 105 | 106 | # (2.) Three types of substitutions: 107 | 108 | if len(CleanedString) >= 1: 109 | CleanedString = PhraseSubstitute(CleanedString, phrase_substitutes) 110 | CleanedString = WordSubstitute(CleanedString, word_substitutes) 111 | CleanedString = SingularSubstitute(CleanedString) 112 | CleanedString = PhraseSubstitute(CleanedString, phrase_substitutes) 113 | 114 | # (3.) Get rid of duplicating words: 115 | # This step is to reduce dimensions of the title. 116 | # for example, "sale sale engineer sale " would be reduced to simply "sale engineer" 117 | 118 | ListTokens = [w for w in re.split(' ',CleanedString) if not w==''] 119 | FinalTokens = list() 120 | 121 | for token in ListTokens: # for each word... 122 | if not token in FinalTokens: # ...if that word has NOT appeared before... 123 | FinalTokens.append(token) # ...append that word to the final result. 124 | 125 | return ' '.join(FinalTokens) 126 | 127 | #...............................................# 128 | -------------------------------------------------------------------------------- /data_cleaning/auxiliary files/word_substitutes.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/phaiptt125/newspaper_project/4e31ced4d930258ff7d659012fff15f3f6f626a4/data_cleaning/auxiliary files/word_substitutes.csv -------------------------------------------------------------------------------- /data_cleaning/initial_cleaning.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# The Initial Text Cleaning\n", 8 | "Online supplementary material to \"The Evolution of Work in the United States\" by Enghin Atalay, Phai Phongthiengtham, Sebastian Sotelo and Daniel Tannenbaum. \n", 9 | "\n", 10 | "* [Project data library](https://occupationdata.github.io) \n", 11 | "\n", 12 | "* [GitHub repository](https://github.com/phaiptt125/newspaper_project)\n", 13 | "\n", 14 | "***" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "This IPython notebook demonstrates initial processing of the raw text, provided by ProQuest. The main components of this step are to retrieve document metadata, to remove markup from the newspaper text, and to perform an initial spell-check of the text." 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "metadata": {}, 27 | "source": [ 28 | " Due to copyright restrictions, we are not authorized to publish a large body of newspaper text. \n", 29 | "***" 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "## List of auxiliary files (see project data library or GitHub repository)" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "* *extract_information* : This python code removes markup and extract relevant information.\n", 44 | "* *edit_distance.py* : This python code computes string edit distance, used in the spelling correction procedure.\n", 45 | "* *OCRcorrect_enchant.py* : This python code performs basic word-by-word spelling error correction.\n", 46 | "* *PWL.txt* : This file contains words such as software and state names that are not contained in the dictionary provided by python's enchant module.\n", 47 | "***" 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | "## Import python modules" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": 1, 60 | "metadata": { 61 | "collapsed": true 62 | }, 63 | "outputs": [], 64 | "source": [ 65 | "import os\n", 66 | "import re\n", 67 | "import sys\n", 68 | "import enchant #spelling correction module\n", 69 | "\n", 70 | "sys.path.append('./auxiliary files')\n", 71 | "\n", 72 | "from extract_information import *\n", 73 | "from edit_distance import *\n", 74 | "from OCRcorrect_enchant import *" 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": {}, 80 | "source": [ 81 | "## Import raw text file\n", 82 | "\n", 83 | "ProQuest has provided us with text files which have been transcribed from scanned images of newspaper pages. The file 'ad_sample.txt', as shown below, is one of these text files. ProQuest only provided us with the information that this file belongs to a page of Wall Street Journal. " 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "execution_count": 2, 89 | "metadata": {}, 90 | "outputs": [ 91 | { 92 | "name": "stdout", 93 | "output_type": "stream", 94 | "text": [ 95 | " TDM_Record_v1.0.xsd 4a667155d557ab68c878224bc3de0979 Classified Ad 45 -- No Title Sep 12, 1978 19780912 classified_ad Classified Advertisement Advertisement Copyright Dow Jones & Company Inc Sep 12, 1978 English 506733 45441 Wall Street Journal (1923 - Current file) &lt;html&gt; &lt;head&gt; &lt;meta name=&quot;ValidationSchema&quot; content=&quot;http://www.w3.org/2002/08/xhtml/xhtml1-strict.xsd&quot;/&gt; &lt;title/&gt; &lt;/head&gt; &lt;body&gt; &lt;p&gt; Singer has long been one of the world s gr &apos; pacesetters in volume manufacturing of intricate, &lt;/p&gt; &lt;p&gt; precision machines that achieve extreme reliability and durability. Our sewing machines are in use around the globe in every kind of climate. As pioneers in electronic sewing machines, we have again set new standards. &lt;/p&gt; &lt;p&gt; ELECTROMECHANICAL ENGINEERS, &lt;/p&gt; &lt;p&gt; Minimum of 6 eara experience in developing of electromechanical consumer or atm&amp;gt;lar products. BSME or BSEE degree required, &lt;/p&gt; &lt;p&gt; advanced degree preferred. &lt;/p&gt; &lt;p&gt; ELECTRONIC ENGINEERS MECHANICAL ENGINEERS &lt;/p&gt; &lt;p&gt; A least 2+ years is needed in one of 2+ years of practical in mechanisms the following areas: and machine design analysis. Working know! edga of computers as a design tool would be &lt;/p&gt; &lt;p&gt; 1) Analog and digital industrial electron helpful. Experience in sophisticated , with microprocessor and CAD knowl chanical products. Background should include edge desirable; mechanism or gear or machine design 2) Analog sad digital circuitry, logic de and analysis. Knowledge of computers as , PC bond design, ISI and minicom neering ardes helpful &lt;/p&gt; &lt;p&gt; puter ; &lt;/p&gt; &lt;p&gt; S) Application of mini and micro-computers including , and hardware de- &lt;/p&gt; &lt;p&gt; bugging of analog and digital circuitry. &lt;/p&gt; &lt;p&gt; DESIGNERS, JUNIOR SPECIALIST AND SENIOR &lt;/p&gt; &lt;p&gt; Ezperience in fractional and AC 1-8 Years experience in precision high toler _ and DC motors and motor control system as ante design of mechanical devices and/or circuit well as other electromechanical devices. layout. Intricate detailing experience mandato- &lt;/p&gt; &lt;p&gt; ry. Singer offers attractive salaries, benefits and professional working conditions, and very favorable career . These positions are located at our Elizabeth, New Jersey facility and at our R&amp;amp;D Laboratory in Fairfield, New Jersey. &lt;/p&gt; &lt;p&gt; Please send resume stating position of interest in confidence to: &lt;/p&gt; &lt;p&gt; Hosie Scott, Employment Manager &lt;/p&gt; &lt;p&gt; or call (201) 527-6166 or 67 &lt;/p&gt; &lt;p&gt; SINGER &lt;/p&gt; &lt;p&gt; DIVERSIFIED WORL. 321 First Street &lt;/p&gt; &lt;p&gt; Elizabeth, New Jersey 07207 An Equal Opportunity Employer M/F &lt;/p&gt; &lt;/body&gt; &lt;/html&gt; \n" 96 | ] 97 | } 98 | ], 99 | "source": [ 100 | "# input files\n", 101 | "input_file = 'ad_sample.txt'\n", 102 | "\n", 103 | "# bring in raw ads \n", 104 | "raw_ad = open(input_file).read()\n", 105 | "print(raw_ad)" 106 | ] 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "metadata": {}, 111 | "source": [ 112 | "***\n", 113 | "Relevant information we have to extract is:\n", 114 | "\n", 115 | "1. publication date - \"19780912\" (September 12, 1978)\n", 116 | "2. page title - \"Classified Ad 45\" (classified ad, page 45)\n", 117 | "3. content - all text in the \"fulltext\" field\n", 118 | "\n", 119 | "Fortunately, job ads appear only in either \"Display Ad\" or \"Classified Ad\" pages. As such, we only need to include pages that are either \"Display Ad\" or \"Classified Ad\" in this step." 120 | ] 121 | }, 122 | { 123 | "cell_type": "markdown", 124 | "metadata": {}, 125 | "source": [ 126 | "However, not all pages in \"Display Ad\" or \"Classified Ad\" are job ads. The next step, as demonstated in the next IPython notebook [here](https://github.com/phaiptt125/newspaper_project/blob/master/data_cleaning/LDA.ipynb), is to use Latent Dirichlet Allocation (LDA) procedure to idenfity which pages are job ads." 127 | ] 128 | }, 129 | { 130 | "cell_type": "markdown", 131 | "metadata": {}, 132 | "source": [ 133 | "## Assign unique page identifier\n", 134 | "* Assign a unique identifier for each newpaper page that is either Display Ad or Classified Ad." 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": 3, 140 | "metadata": {}, 141 | "outputs": [ 142 | { 143 | "name": "stdout", 144 | "output_type": "stream", 145 | "text": [ 146 | "WSJ_classifiedad_19780912_45\n" 147 | ] 148 | } 149 | ], 150 | "source": [ 151 | "page_identifier = AssignPageIdentifier(raw_ad, 'WSJ') # see extract_information.py\n", 152 | "print(page_identifier)" 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "metadata": {}, 158 | "source": [ 159 | "The value \"WSJ_classifiedad_19780912_45\" refers to the 45th page of classified ads in the September 12, 1978 edition of the Wall Street Journal.\n", 160 | "\n", 161 | "## Extract posting and remove markup" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": 4, 167 | "metadata": {}, 168 | "outputs": [ 169 | { 170 | "name": "stdout", 171 | "output_type": "stream", 172 | "text": [ 173 | "\n", 174 | " Singer has long been one of the world s gr ' pacesetters in volume manufacturing of intricate , \n", 175 | " \n", 176 | " precision machines that achieve extreme reliability and durability . Our sewing machines are in use around the globe in every kind of climate . As pioneers in electronic sewing machines , we have again set new standards . \n", 177 | " \n", 178 | " ELECTROMECHANICAL ENGINEERS , \n", 179 | " \n", 180 | " Minimum of 6 eara experience in developing of electromechanical consumer or atmlar products . BSME or BSEE degree required , \n", 181 | " \n", 182 | " advanced degree preferred . \n", 183 | " \n", 184 | " ELECTRONIC ENGINEERS MECHANICAL ENGINEERS \n", 185 | " \n", 186 | " A least 2+ years is needed in one of 2+ years of practical in mechanisms the following areas: and machine design analysis . Working know ! edga of computers as a design tool would be \n", 187 | " \n", 188 | " 1 ) Analog and digital industrial electron helpful . Experience in sophisticated , with microprocessor and CAD knowl chanical products . Background should include edge desirable ; mechanism or gear or machine design 2 ) Analog sad digital circuitry , logic de and analysis . Knowledge of computers as , PC bond design , ISI and minicom neering ardes helpful \n", 189 | " \n", 190 | " puter ; \n", 191 | " \n", 192 | " S ) Application of mini and micro-computers including , and hardware de- \n", 193 | " \n", 194 | " bugging of analog and digital circuitry . \n", 195 | " \n", 196 | " DESIGNERS , JUNIOR SPECIALIST AND SENIOR \n", 197 | " \n", 198 | " Ezperience in fractional and AC 1-8 Years experience in precision high toler and DC motors and motor control system as ante design of mechanical devices and or circuit well as other electromechanical devices . layout . Intricate detailing experience mandato- \n", 199 | " \n", 200 | " ry . Singer offers attractive salaries , benefits and professional working conditions , and very favorable career . These positions are located at our Elizabeth , New Jersey facility and at our R ; D Laboratory in Fairfield , New Jersey . \n", 201 | " \n", 202 | " Please send resume stating position of interest in confidence to: \n", 203 | " \n", 204 | " Hosie Scott , Employment Manager \n", 205 | " \n", 206 | " or call ( 201 ) 527-6166 or 67 \n", 207 | " \n", 208 | " SINGER \n", 209 | " \n", 210 | " DIVERSIFIED WORL . 321 First Street \n", 211 | " \n", 212 | " Elizabeth , New Jersey 07207 An Equal Opportunity Employer M F \n", 213 | "\n" 214 | ] 215 | } 216 | ], 217 | "source": [ 218 | "# extract field \n", 219 | "fulltext = ExtractElement(raw_ad,'fulltext') # see extract_information.py\n", 220 | "# remove xml markups\n", 221 | "posting = CleanXML(fulltext) # see extract_information.py\n", 222 | "print(posting)" 223 | ] 224 | }, 225 | { 226 | "cell_type": "markdown", 227 | "metadata": {}, 228 | "source": [ 229 | "## Perform basic spelling error correction, remove extra spaces and empty lines " 230 | ] 231 | }, 232 | { 233 | "cell_type": "code", 234 | "execution_count": 5, 235 | "metadata": {}, 236 | "outputs": [ 237 | { 238 | "name": "stdout", 239 | "output_type": "stream", 240 | "text": [ 241 | "Singer has long been one of the world s gr ' pacesetters in volume manufacturing of intricate ,\n", 242 | "precision machines that achieve extreme reliability and durability . Our sewing machines are in use around the globe in every kind of climate . As pioneers in electronic sewing machines , we have again set new standards .\n", 243 | "ELECTROMECHANICAL ENGINEERS ,\n", 244 | "Minimum of 6 Meara experience in developing of electromechanical consumer or atmlar products . BSME or B SEE degree required ,\n", 245 | "advanced degree preferred .\n", 246 | "ELECTRONIC ENGINEERS MECHANICAL ENGINEERS\n", 247 | "A least 2+ years is needed in one of 2+ years of practical in mechanisms the following areas and machine design analysis . Working know ! Edgar of computers as a design tool would be\n", 248 | "1 ) Analog and digital industrial electron helpful . Experience in sophisticated , with microprocessor and CAD kn owl mechanical products . Background should include edge desirable ; mechanism or gear or machine design 2 ) Analog sad digital circuitry , logic de and analysis . Knowledge of computers as , PC bond design , IS I and mini com sneering ares helpful\n", 249 | "pouter ;\n", 250 | "S ) Application of mini and microcomputers including , and hardware de-\n", 251 | "bugging of analog and digital circuitry .\n", 252 | "DESIGNERS , JUNIOR SPECIALIST AND SENIOR\n", 253 | "Experience in fractional and AC 1-8 Years experience in precision high tooler and DC motors and motor control system as ante design of mechanical devices and or circuit well as other electromechanical devices . layout . Intricate detailing experience mandatory\n", 254 | "ry . Singer offers attractive salaries , benefits and professional working conditions , and very favorable career . These positions are located at our Elizabeth , New Jersey facility and at our R ; D Laboratory in Fairfield , New Jersey .\n", 255 | "Please send resume stating position of interest in confidence to:\n", 256 | "Hosier Scott , Employment Manager\n", 257 | "or call ( 201 ) 527-6166 or 67\n", 258 | "SINGER\n", 259 | "DIVERSIFIED WHORL . 321 First Street\n", 260 | "Elizabeth , New Jersey 07207 An Equal Opportunity Employer M F\n" 261 | ] 262 | } 263 | ], 264 | "source": [ 265 | "posting_by_line = [w for w in re.split('\\n',posting) if len(w)>0] \n", 266 | "clean_posting_by_line = list()\n", 267 | " \n", 268 | "for line in posting_by_line:\n", 269 | " clean_line = line\n", 270 | " # spelling error correction\n", 271 | " clean_line = EnchantErrorCorrection(clean_line, 'PWL.txt')\n", 272 | " # remove extra white spaces\n", 273 | " clean_line = ' '.join([w for w in re.split(' ',clean_line) if not w=='']) \n", 274 | " clean_posting_by_line.append(clean_line)\n", 275 | "\n", 276 | "# remove empty lines\n", 277 | "clean_posting_by_line = [w for w in clean_posting_by_line if not w=='']\n", 278 | "\n", 279 | "# print final output of this step\n", 280 | "print('\\n'.join(clean_posting_by_line))" 281 | ] 282 | }, 283 | { 284 | "cell_type": "markdown", 285 | "metadata": {}, 286 | "source": [ 287 | "The final output of this step is the variable \"clean_posting_by_line\". The next step, as demonstated in the next IPython notebook [here](https://github.com/phaiptt125/newspaper_project/blob/master/data_cleaning/LDA.ipynb), is to use Latent Dirichlet Allocation (LDA) procedure to idenfity which pages are job ads. " 288 | ] 289 | } 290 | ], 291 | "metadata": { 292 | "kernelspec": { 293 | "display_name": "Python 3", 294 | "language": "python", 295 | "name": "python3" 296 | }, 297 | "language_info": { 298 | "codemirror_mode": { 299 | "name": "ipython", 300 | "version": 3 301 | }, 302 | "file_extension": ".py", 303 | "mimetype": "text/x-python", 304 | "name": "python", 305 | "nbconvert_exporter": "python", 306 | "pygments_lexer": "ipython3", 307 | "version": "3.6.1" 308 | } 309 | }, 310 | "nbformat": 4, 311 | "nbformat_minor": 1 312 | } 313 | -------------------------------------------------------------------------------- /data_cleaning/structured_data.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Transforming Unstructured Text into Structured Data \n", 8 | "Online supplementary material to \"The Evolution of Work in the United States\" by Enghin Atalay, Phai Phongthiengtham, Sebastian Sotelo and Daniel Tannenbaum. \n", 9 | "\n", 10 | "* [Project data library](https://occupationdata.github.io) \n", 11 | "\n", 12 | "* [GitHub repository](https://github.com/phaiptt125/newspaper_project)\n", 13 | "\n", 14 | "***" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "This IPython notebook demonstrates how we finally transform unstructured newspaper text into structured data (spreadsheet). In the previous steps, we:\n", 22 | "\n", 23 | "* Retrieve document metadata, remove markup from the newspaper text, and perform an initial spell-check of the text (see [here](https://github.com/phaiptt125/newspaper_project/blob/master/data_cleaning/initial_cleaning.ipynb)). \n", 24 | "* Exclude non-job ad pages (see [here](https://github.com/phaiptt125/newspaper_project/blob/master/data_cleaning/LDA.ipynb)).\n", 25 | "\n", 26 | "The main components of this step are to identify the job title, discern the boundaries between job ads, and transform relevant information into structured data. \n", 27 | "\n" 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": {}, 33 | "source": [ 34 | " Due to copyright restrictions, we are not authorized to publish a large body of newspaper text. \n", 35 | "***" 36 | ] 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "metadata": {}, 41 | "source": [ 42 | "### List of auxiliary files (see project data library or GitHub repository)\n", 43 | "* *title_detection.py* : This python code detects job titles. \n", 44 | "* *detect_ending.py* : This python code detects ending patterns of ads.\n", 45 | "* *TitleBase.txt* : A list of job title words " 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | "***\n", 53 | "## Import necessary modules" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": 1, 59 | "metadata": { 60 | "collapsed": true 61 | }, 62 | "outputs": [], 63 | "source": [ 64 | "import os\n", 65 | "import re\n", 66 | "import sys\n", 67 | "import pandas as pd\n", 68 | "\n", 69 | "import nltk\n", 70 | "from nltk.corpus import stopwords\n", 71 | "from nltk.tokenize import word_tokenize\n", 72 | "from nltk.stem.snowball import SnowballStemmer\n", 73 | " \n", 74 | "stop_words = set(stopwords.words('english'))\n", 75 | "stemmer = SnowballStemmer(\"english\")\n", 76 | "\n", 77 | "sys.path.append('./auxiliary files')\n", 78 | "\n", 79 | "from title_detection import *\n", 80 | "from detect_ending import *" 81 | ] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "metadata": { 86 | "collapsed": true 87 | }, 88 | "source": [ 89 | "## Import job ad pages\n", 90 | "\n", 91 | "We present an example describing how our procedure identifies job ads' boundaries and their job titles on a snippet of Display Ad page 226, from the January 14, 1979 Boston Globe (page identifer: \"Globe_displayad_19790114_226\"). " 92 | ] 93 | }, 94 | { 95 | "cell_type": "markdown", 96 | "metadata": {}, 97 | "source": [ 98 | "* The text file has already been cleaned by retrieving document metadata, removing markup from the newspaper text, and correcting spelling errors of the text (see [here](https://github.com/phaiptt125/newspaper_project/blob/master/data_cleaning/initial_cleaning.ipynb) for detail). \n", 99 | "* We have already classified this page to be related to job ads (see [here](https://github.com/phaiptt125/newspaper_project/blob/master/data_cleaning/LDA.ipynb) for detail)." 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": 2, 105 | "metadata": {}, 106 | "outputs": [ 107 | { 108 | "name": "stdout", 109 | "output_type": "stream", 110 | "text": [ 111 | "MEDICAL HELP\n", 112 | "NUCLEAR\n", 113 | "RADIOLOGIC TECH\n", 114 | "full time day po ition is available for registred or registry technician in our Nuclear Medicine department This position does require taking call\n", 115 | "CHEST\n", 116 | "PHYSICAL THERAPIST\n", 117 | "If you are or registry eligible\n", 118 | "Physical Trhrapist interested in Chest\n", 119 | "Therapy consider the New England Baptist Hospital Responsibilities will include providng chest therapy for Medical Surgical patients family teaching interdisciplinary inservice programs and more\n", 120 | "For more information please contact our Personnel department 738-5800 , Ext 255 . An Equal Opportunity Employer\n", 121 | "41 Pa HII Boston\n", 122 | "MANAGER OF\n", 123 | "PRIMARY CARE PROGRAMS\n", 124 | "Children's Hospital Medical Center\n", 125 | "seeks dynamic creative individual to manage its Primary Care Programs including 24-hour Emergency Room Primary Care program the Massachusetts Poison information Center and\n", 126 | "Dental services This position requires 3-5 years experience with background in planning budgeting and managing\n", 127 | "health programs Masters degree preferred but additional experience may be substituted We offer salary commensurate\n", 128 | "with experience and fine fringe benefits package\n", 129 | "please forward resumes to Helena Wallace personnel office\n", 130 | "MEDICAL\n", 131 | "300 Lonjwood Avenue\n", 132 | "MA 0211\n", 133 | "REGISTERED\n", 134 | "REGISTRY ELIGIBLE OR\n", 135 | "immi ate available in our modern well- and fu ly accredited 173-bed general hospital Cheshire Hospilal is 80 miles from Boston and near skiing water sports hunting and fishing\n", 136 | "Apphcants must be registered registry eligible or NERT For further information please contact the Personrel department\n", 137 | "Cheshire Hospital\n", 138 | "580 Court Street Keene NH 03431\n" 139 | ] 140 | } 141 | ], 142 | "source": [ 143 | "text = open('Snippet_Globe_displayad_19790114_226.txt').read()\n", 144 | "page_identifier = 'Globe_displayad_19790114_226'\n", 145 | "print(text) # posting text" 146 | ] 147 | }, 148 | { 149 | "cell_type": "markdown", 150 | "metadata": { 151 | "collapsed": true 152 | }, 153 | "source": [ 154 | "## Reset line breaks\n", 155 | "First, we combine short, uppercased and consecutive lines together so that we can detect, for instance, \"MANAGER OF PRIMARY CARE PROGRAMS\" when we have two lines of \"MANAGER OF\" and \"PRIMARY CARE PROGRAMS\" . " 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": 3, 161 | "metadata": { 162 | "collapsed": true 163 | }, 164 | "outputs": [], 165 | "source": [ 166 | "# remove emypty lines\n", 167 | "text_by_line = [w for w in re.split('\\n',text) if not w=='']\n", 168 | "\n", 169 | "# reset lines (see title_detection.py)\n", 170 | "text_reset_line = CombineUppercase(text_by_line)\n", 171 | "text_reset_line = UppercaseNewline(text_reset_line,'\\n') #assign new line when an uppercase word is found\n", 172 | "text_reset_line = CombineUppercase(text_reset_line) #re-combine uppercase words together\n", 173 | "\n", 174 | "# remove extra white spaces\n", 175 | "text_reset_line = [' '.join([y for y in re.split(' ',w) if not y=='']) for w in text_reset_line]\n", 176 | "# remove empty lines\n", 177 | "text_reset_line = [w for w in text_reset_line if not w=='']" 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": 4, 183 | "metadata": {}, 184 | "outputs": [ 185 | { 186 | "data": { 187 | "text/plain": [ 188 | "['MEDICAL HELP NUCLEAR RADIOLOGIC TECH',\n", 189 | " 'full time day po ition is available for registred or registry technician in our Nuclear Medicine department This position does require taking call',\n", 190 | " 'CHEST PHYSICAL THERAPIST',\n", 191 | " 'If you are or registry eligible',\n", 192 | " 'Physical Trhrapist interested in Chest',\n", 193 | " 'Therapy consider the New England Baptist Hospital Responsibilities will include providng chest therapy for Medical Surgical patients family teaching interdisciplinary inservice programs and more',\n", 194 | " 'For more information please contact our Personnel department 738-5800 , Ext 255 . An Equal Opportunity Employer',\n", 195 | " '41 Pa',\n", 196 | " 'HII',\n", 197 | " 'Boston',\n", 198 | " 'MANAGER OF PRIMARY CARE PROGRAMS',\n", 199 | " \"Children's Hospital Medical Center\",\n", 200 | " 'seeks dynamic creative individual to manage its Primary Care Programs including 24-hour Emergency Room Primary Care program the Massachusetts Poison information Center and',\n", 201 | " 'Dental services This position requires 3-5 years experience with background in planning budgeting and managing',\n", 202 | " 'health programs Masters degree preferred but additional experience may be substituted We offer salary commensurate',\n", 203 | " 'with experience and fine fringe benefits package',\n", 204 | " 'please forward resumes to Helena Wallace personnel office',\n", 205 | " 'MEDICAL',\n", 206 | " '300 Lonjwood Avenue',\n", 207 | " 'MA 0211 REGISTERED REGISTRY ELIGIBLE OR',\n", 208 | " 'immi ate available in our modern well- and fu ly accredited 173-bed general hospital Cheshire Hospilal is 80 miles from Boston and near skiing water sports hunting and fishing',\n", 209 | " 'Apphcants must be registered registry eligible or',\n", 210 | " 'NERT',\n", 211 | " 'For further information please contact the Personrel department',\n", 212 | " 'Cheshire Hospital',\n", 213 | " '580 Court Street Keene NH 03431']" 214 | ] 215 | }, 216 | "execution_count": 4, 217 | "metadata": {}, 218 | "output_type": "execute_result" 219 | } 220 | ], 221 | "source": [ 222 | "# print results\n", 223 | "text_reset_line" 224 | ] 225 | }, 226 | { 227 | "cell_type": "markdown", 228 | "metadata": {}, 229 | "source": [ 230 | "## Detect job titles\n", 231 | "Next, we detect job titles by matching to a list of job title personal nouns. For instance, with the word \"THERAPIST\" in our list, we are able to detect \"CHEST PHYSICAL THERAPIST\" being a job title without having to specify all type of possible therapists. " 232 | ] 233 | }, 234 | { 235 | "cell_type": "code", 236 | "execution_count": 5, 237 | "metadata": {}, 238 | "outputs": [ 239 | { 240 | "name": "stdout", 241 | "output_type": "stream", 242 | "text": [ 243 | "--- Examples of job title personal nouns ---\n", 244 | "['abstracter', 'abstracters', 'abstractor', 'abstractors', 'accounting', 'accountings', 'accountant', 'accountants', 'actor', 'actors', 'actress', 'actresses', 'actuarial', 'actuarials', 'actuaries']\n" 245 | ] 246 | } 247 | ], 248 | "source": [ 249 | "# define indicators if job title detected\n", 250 | "title_found = '---titlefound---'\n", 251 | "\n", 252 | "# list of job title personal nouns\n", 253 | "TitleBaseFile = open('./auxiliary files/TitleBase.txt').read()\n", 254 | "TitleBaseList = [w for w in re.split('\\n',TitleBaseFile) if not w=='']\n", 255 | "print('--- Examples of job title personal nouns ---')\n", 256 | "print(TitleBaseList[:15]) " 257 | ] 258 | }, 259 | { 260 | "cell_type": "code", 261 | "execution_count": 6, 262 | "metadata": { 263 | "collapsed": true 264 | }, 265 | "outputs": [], 266 | "source": [ 267 | "text_detect_title = ['']*len(text_reset_line)\n", 268 | "PreviousLineIsUppercaseTitle = False\n", 269 | "\n", 270 | "# assign a flag of '---titlefound---' to lines where we detect a job title\n", 271 | "\n", 272 | "for i in range(0,len(text_reset_line)):\n", 273 | " line = text_reset_line[i]\n", 274 | " line_no_hyphen = re.sub('-',' ',line.lower())\n", 275 | " tokens = word_tokenize(line_no_hyphen)\n", 276 | " \n", 277 | " Match = list(set(tokens).intersection(TitleBaseList)) # see if the line has words in TitleBaseList \n", 278 | " \n", 279 | " if Match and DetermineUppercase(line): # uppercase job title\n", 280 | " text_detect_title[i] = ' '.join([w for w in re.split(' ',line) if not w=='']) + title_found\n", 281 | " # adding a flag that a title is found\n", 282 | " # ' '.join([w for w in split(' ',line) if not w=='']) is to remove extra spaces from 'line'\n", 283 | " PreviousLineIsUppercaseTitle = True\n", 284 | " elif Match and len(tokens) <= 2:\n", 285 | " # This line allows non-uppercase job titles\n", 286 | " # It has to be short enough => less than or equal to 2 words.\n", 287 | " # In addition, the previous line must NOT be a uppercase job title. \n", 288 | " if PreviousLineIsUppercaseTitle == False:\n", 289 | " text_detect_title[i] = ' '.join([w for w in re.split(' ',line) if not w=='']) + title_found\n", 290 | " PreviousLineIsUppercaseTitle = False\n", 291 | " else:\n", 292 | " text_detect_title[i] = ' '.join([w for w in re.split(' ',line) if not w==''])\n", 293 | " PreviousLineIsUppercaseTitle = False\n", 294 | " else:\n", 295 | " text_detect_title[i] = ' '.join([w for w in re.split(' ',line) if not w==''])\n", 296 | " PreviousLineIsUppercaseTitle = False" 297 | ] 298 | }, 299 | { 300 | "cell_type": "markdown", 301 | "metadata": {}, 302 | "source": [ 303 | "For this snippet of text, we are able to detect the following job titles:" 304 | ] 305 | }, 306 | { 307 | "cell_type": "code", 308 | "execution_count": 7, 309 | "metadata": {}, 310 | "outputs": [ 311 | { 312 | "data": { 313 | "text/plain": [ 314 | "['MEDICAL HELP NUCLEAR RADIOLOGIC TECH---titlefound---',\n", 315 | " 'CHEST PHYSICAL THERAPIST---titlefound---',\n", 316 | " 'MANAGER OF PRIMARY CARE PROGRAMS---titlefound---',\n", 317 | " 'MEDICAL---titlefound---']" 318 | ] 319 | }, 320 | "execution_count": 7, 321 | "metadata": {}, 322 | "output_type": "execute_result" 323 | } 324 | ], 325 | "source": [ 326 | "[w for w in text_detect_title if re.findall(title_found,w)]" 327 | ] 328 | }, 329 | { 330 | "cell_type": "markdown", 331 | "metadata": {}, 332 | "source": [ 333 | "## Detect addresses and ending phrases \n", 334 | "In this step, we detect addresses such as street names, zip codes, and phrases which tend to appear at the end of ads. Such phrases include \"An Equal Opportunity Employer\" and \"send resume.\" If we do, we assign a string \"---endingfound---\" to the end of the line. " 335 | ] 336 | }, 337 | { 338 | "cell_type": "code", 339 | "execution_count": 8, 340 | "metadata": { 341 | "collapsed": true 342 | }, 343 | "outputs": [], 344 | "source": [ 345 | "ending_found = '---endingfound---'\n", 346 | "text_assign_flag = list()\n", 347 | "\n", 348 | "# see \"detect_ending.py\"\n", 349 | "\n", 350 | "for line in text_detect_title:\n", 351 | " AddressFound , EndingPhraseFound = AssignFlag(line)\n", 352 | " if AddressFound == True or EndingPhraseFound == True:\n", 353 | " text_assign_flag.append(line + ending_found)\n", 354 | " else:\n", 355 | " text_assign_flag.append(line)" 356 | ] 357 | }, 358 | { 359 | "cell_type": "markdown", 360 | "metadata": {}, 361 | "source": [ 362 | "For this snippet of text, we are able to detect the following addresses and phrases:" 363 | ] 364 | }, 365 | { 366 | "cell_type": "code", 367 | "execution_count": 9, 368 | "metadata": {}, 369 | "outputs": [ 370 | { 371 | "data": { 372 | "text/plain": [ 373 | "['For more information please contact our Personnel department 738-5800 , Ext 255 . An Equal Opportunity Employer---endingfound---',\n", 374 | " '300 Lonjwood Avenue---endingfound---',\n", 375 | " '580 Court Street Keene NH 03431---endingfound---']" 376 | ] 377 | }, 378 | "execution_count": 9, 379 | "metadata": {}, 380 | "output_type": "execute_result" 381 | } 382 | ], 383 | "source": [ 384 | "[w for w in text_assign_flag if re.findall(ending_found,w)]" 385 | ] 386 | }, 387 | { 388 | "cell_type": "markdown", 389 | "metadata": {}, 390 | "source": [ 391 | "After detecting job titles, addresses and ending phrases, we end up with the following text: " 392 | ] 393 | }, 394 | { 395 | "cell_type": "code", 396 | "execution_count": 10, 397 | "metadata": {}, 398 | "outputs": [ 399 | { 400 | "data": { 401 | "text/plain": [ 402 | "['MEDICAL HELP NUCLEAR RADIOLOGIC TECH---titlefound---',\n", 403 | " 'full time day po ition is available for registred or registry technician in our Nuclear Medicine department This position does require taking call',\n", 404 | " 'CHEST PHYSICAL THERAPIST---titlefound---',\n", 405 | " 'If you are or registry eligible',\n", 406 | " 'Physical Trhrapist interested in Chest',\n", 407 | " 'Therapy consider the New England Baptist Hospital Responsibilities will include providng chest therapy for Medical Surgical patients family teaching interdisciplinary inservice programs and more',\n", 408 | " 'For more information please contact our Personnel department 738-5800 , Ext 255 . An Equal Opportunity Employer---endingfound---',\n", 409 | " '41 Pa',\n", 410 | " 'HII',\n", 411 | " 'Boston',\n", 412 | " 'MANAGER OF PRIMARY CARE PROGRAMS---titlefound---',\n", 413 | " \"Children's Hospital Medical Center\",\n", 414 | " 'seeks dynamic creative individual to manage its Primary Care Programs including 24-hour Emergency Room Primary Care program the Massachusetts Poison information Center and',\n", 415 | " 'Dental services This position requires 3-5 years experience with background in planning budgeting and managing',\n", 416 | " 'health programs Masters degree preferred but additional experience may be substituted We offer salary commensurate',\n", 417 | " 'with experience and fine fringe benefits package',\n", 418 | " 'please forward resumes to Helena Wallace personnel office',\n", 419 | " 'MEDICAL---titlefound---',\n", 420 | " '300 Lonjwood Avenue---endingfound---',\n", 421 | " 'MA 0211 REGISTERED REGISTRY ELIGIBLE OR',\n", 422 | " 'immi ate available in our modern well- and fu ly accredited 173-bed general hospital Cheshire Hospilal is 80 miles from Boston and near skiing water sports hunting and fishing',\n", 423 | " 'Apphcants must be registered registry eligible or',\n", 424 | " 'NERT',\n", 425 | " 'For further information please contact the Personrel department',\n", 426 | " 'Cheshire Hospital',\n", 427 | " '580 Court Street Keene NH 03431---endingfound---']" 428 | ] 429 | }, 430 | "execution_count": 10, 431 | "metadata": {}, 432 | "output_type": "execute_result" 433 | } 434 | ], 435 | "source": [ 436 | "text_assign_flag" 437 | ] 438 | }, 439 | { 440 | "cell_type": "markdown", 441 | "metadata": {}, 442 | "source": [ 443 | "## Assign boundaries\n", 444 | "Next, we assign boundaries by scanning from the beginning line:\n", 445 | "1. If we see a flag '---titlefound---', then we assign a split indicator **before** that line.\n", 446 | "2. If we see a flag '---endingfound---', then we assign a split indicator **after** that line." 447 | ] 448 | }, 449 | { 450 | "cell_type": "code", 451 | "execution_count": 11, 452 | "metadata": { 453 | "collapsed": true 454 | }, 455 | "outputs": [], 456 | "source": [ 457 | "split_indicator = '---splithere---'\n", 458 | "split_by_title = list() \n", 459 | "split_posting = list()\n", 460 | "\n", 461 | "# -----split if title is found-----\n", 462 | "\n", 463 | "for line in text_assign_flag:\n", 464 | " if re.findall(title_found,line):\n", 465 | " #add a split indicator BEFORE the line with title \n", 466 | " split_by_title.append(split_indicator + '\\n' + line)\n", 467 | " else:\n", 468 | " split_by_title.append(line) # if not found, just append the line back in \n", 469 | " \n", 470 | "split_by_title = [w for w in re.split('\\n','\\n'.join(split_by_title)) if not w=='']" 471 | ] 472 | }, 473 | { 474 | "cell_type": "code", 475 | "execution_count": 12, 476 | "metadata": { 477 | "collapsed": true 478 | }, 479 | "outputs": [], 480 | "source": [ 481 | "# -----split if any ending phrase and/or address is found-----\n", 482 | "\n", 483 | "for line in split_by_title:\n", 484 | " line_remove_ending_found = re.sub(ending_found,'',line) #remove the ending flag\n", 485 | " if re.findall(ending_found,line):\n", 486 | " #add a split indicator AFTER the line where the pattern is found\n", 487 | " split_posting.append( line_remove_ending_found + '\\n' + split_indicator)\n", 488 | " else:\n", 489 | " split_posting.append( line_remove_ending_found ) # if not found, just append the line back in \n", 490 | "\n", 491 | "# after assigning the split indicators, we can use python command to split the ads. \n", 492 | "split_posting = [w for w in re.split(split_indicator,'\\n'.join(split_posting)) if not w=='']" 493 | ] 494 | }, 495 | { 496 | "cell_type": "markdown", 497 | "metadata": {}, 498 | "source": [ 499 | "After assigning boundaires, we end up with the following text:" 500 | ] 501 | }, 502 | { 503 | "cell_type": "code", 504 | "execution_count": 13, 505 | "metadata": {}, 506 | "outputs": [ 507 | { 508 | "name": "stdout", 509 | "output_type": "stream", 510 | "text": [ 511 | "MEDICAL HELP NUCLEAR RADIOLOGIC TECH---titlefound---full time day po ition is available for registred or registry technician in our Nuclear Medicine department This position does require taking call\n", 512 | "---splithere---\n", 513 | "CHEST PHYSICAL THERAPIST---titlefound---If you are or registry eligiblePhysical Trhrapist interested in ChestTherapy consider the New England Baptist Hospital Responsibilities will include providng chest therapy for Medical Surgical patients family teaching interdisciplinary inservice programs and moreFor more information please contact our Personnel department 738-5800 , Ext 255 . An Equal Opportunity Employer\n", 514 | "---splithere---\n", 515 | "41 PaHIIBoston\n", 516 | "---splithere---\n", 517 | "MANAGER OF PRIMARY CARE PROGRAMS---titlefound---Children's Hospital Medical Centerseeks dynamic creative individual to manage its Primary Care Programs including 24-hour Emergency Room Primary Care program the Massachusetts Poison information Center andDental services This position requires 3-5 years experience with background in planning budgeting and managinghealth programs Masters degree preferred but additional experience may be substituted We offer salary commensuratewith experience and fine fringe benefits packageplease forward resumes to Helena Wallace personnel office\n", 518 | "---splithere---\n", 519 | "MEDICAL---titlefound---300 Lonjwood Avenue\n", 520 | "---splithere---\n", 521 | "MA 0211 REGISTERED REGISTRY ELIGIBLE ORimmi ate available in our modern well- and fu ly accredited 173-bed general hospital Cheshire Hospilal is 80 miles from Boston and near skiing water sports hunting and fishingApphcants must be registered registry eligible orNERTFor further information please contact the Personrel departmentCheshire Hospital580 Court Street Keene NH 03431\n", 522 | "---splithere---\n" 523 | ] 524 | } 525 | ], 526 | "source": [ 527 | "for ad in split_posting:\n", 528 | " print(re.sub('\\n','',ad)) #print out each ad, ignoring the line break indicators. \n", 529 | " print('---splithere---')" 530 | ] 531 | }, 532 | { 533 | "cell_type": "markdown", 534 | "metadata": {}, 535 | "source": [ 536 | "## Construct a spreadsheet dataset\n", 537 | "Finally, we construct a spreadsheet with the following variables:\n", 538 | "1. *page_identifier* : We recover this information in the previous step. For this illustration, we take text from Display Ad page 226, from the January 14, 1979 Boston Globe (Globe_displayad_19790114_226)\n", 539 | "2. *ad_num* : Ad number within a page\n", 540 | "3. *job_title* : Job title of that particular ad (equals empty string if the ad has no title).\n", 541 | "4. *ad_content* : Posting content" 542 | ] 543 | }, 544 | { 545 | "cell_type": "code", 546 | "execution_count": 14, 547 | "metadata": { 548 | "collapsed": true 549 | }, 550 | "outputs": [], 551 | "source": [ 552 | "all_flag = re.compile('|'.join([title_found,ending_found]))\n", 553 | "\n", 554 | "num_ad = 0 #initialize ad number within displayad\n", 555 | "\n", 556 | "final_output = list()\n", 557 | "\n", 558 | "for ad in split_posting:\n", 559 | " \n", 560 | " ad_split_line = [w for w in re.split('\\n',ad) if not w=='']\n", 561 | " \n", 562 | " # --------- record title ----------\n", 563 | "\n", 564 | " title_this_ad = [w for w in ad_split_line if re.findall(title_found,w)] \n", 565 | " #see if any line is a title\n", 566 | " \n", 567 | " if len(title_this_ad) == 1: #if we do have a title\n", 568 | " title_clean = re.sub(all_flag,'',title_this_ad[0].lower()) \n", 569 | " #take out the flags and revert to lowercase\n", 570 | "\n", 571 | " title_clean = ' '.join([y for y in re.split(' ',title_clean) if not y==''])\n", 572 | " else:\n", 573 | " title_clean = ''\n", 574 | "\n", 575 | " # --------- record content ----------\n", 576 | " \n", 577 | " ad_content = [w for w in ad_split_line if not re.findall(title_found,w)] # take out lines with title\n", 578 | " ad_content = ' '.join([w for w in ad_content if not w==''])\n", 579 | " #delete empty lines + combine all the line together (within an ad)\n", 580 | " \n", 581 | " ad_content = re.sub(all_flag,'',ad_content) \n", 582 | " #take out all the flags\n", 583 | "\n", 584 | " # --------- record output ----------\n", 585 | "\n", 586 | " num_ad += 1\n", 587 | " output = [str(page_identifier),str(num_ad),str(title_clean),str(ad_content)] \n", 588 | " final_output.append( '|'.join(output) )\n", 589 | "\n", 590 | "# final output \n", 591 | "final_output_file = open('structured_data.txt','w')\n", 592 | "final_output_file.write('\\n'.join(final_output))\n", 593 | "final_output_file.close()" 594 | ] 595 | }, 596 | { 597 | "cell_type": "code", 598 | "execution_count": 15, 599 | "metadata": {}, 600 | "outputs": [ 601 | { 602 | "name": "stdout", 603 | "output_type": "stream", 604 | "text": [ 605 | "Globe_displayad_19790114_226|1|medical help nuclear radiologic tech|full time day po ition is available for registred or registry technician in our Nuclear Medicine department This position does require taking call\n", 606 | "Globe_displayad_19790114_226|2|chest physical therapist|If you are or registry eligible Physical Trhrapist interested in Chest Therapy consider the New England Baptist Hospital Responsibilities will include providng chest therapy for Medical Surgical patients family teaching interdisciplinary inservice programs and more For more information please contact our Personnel department 738-5800 , Ext 255 . An Equal Opportunity Employer\n", 607 | "Globe_displayad_19790114_226|3||41 Pa HII Boston\n", 608 | "Globe_displayad_19790114_226|4|manager of primary care programs|Children's Hospital Medical Center seeks dynamic creative individual to manage its Primary Care Programs including 24-hour Emergency Room Primary Care program the Massachusetts Poison information Center and Dental services This position requires 3-5 years experience with background in planning budgeting and managing health programs Masters degree preferred but additional experience may be substituted We offer salary commensurate with experience and fine fringe benefits package please forward resumes to Helena Wallace personnel office\n", 609 | "Globe_displayad_19790114_226|5|medical|300 Lonjwood Avenue\n", 610 | "Globe_displayad_19790114_226|6||MA 0211 REGISTERED REGISTRY ELIGIBLE OR immi ate available in our modern well- and fu ly accredited 173-bed general hospital Cheshire Hospilal is 80 miles from Boston and near skiing water sports hunting and fishing Apphcants must be registered registry eligible or NERT For further information please contact the Personrel department Cheshire Hospital 580 Court Street Keene NH 03431\n" 611 | ] 612 | } 613 | ], 614 | "source": [ 615 | "# print out final output\n", 616 | "structured_posting = open('structured_data.txt').read()\n", 617 | "structured_posting = re.split('\\n',structured_posting)\n", 618 | "for ad in structured_posting:\n", 619 | " print(ad)" 620 | ] 621 | } 622 | ], 623 | "metadata": { 624 | "kernelspec": { 625 | "display_name": "Python 3", 626 | "language": "python", 627 | "name": "python3" 628 | }, 629 | "language_info": { 630 | "codemirror_mode": { 631 | "name": "ipython", 632 | "version": 3 633 | }, 634 | "file_extension": ".py", 635 | "mimetype": "text/x-python", 636 | "name": "python", 637 | "nbconvert_exporter": "python", 638 | "pygments_lexer": "ipython3", 639 | "version": "3.6.1" 640 | } 641 | }, 642 | "nbformat": 4, 643 | "nbformat_minor": 1 644 | } 645 | --------------------------------------------------------------------------------