├── README.md
└── data_cleaning
├── CBOW.ipynb
├── LDA.ipynb
├── assign_SOC.ipynb
├── auxiliary files
├── OCRcorrect_enchant.py
├── OCRcorrect_hyphen.py
├── PWL.txt
├── TitleBase.txt
├── __pycache__
│ ├── ExtractLDAresult.cpython-36.pyc
│ ├── OCRcorrect_enchant.cpython-36.pyc
│ ├── OCRcorrect_hyphen.cpython-36.pyc
│ ├── compute_spelling.cpython-36.pyc
│ ├── detect_ending.cpython-36.pyc
│ ├── edit_distance.cpython-36.pyc
│ ├── extract_LDA_result.cpython-36.pyc
│ ├── extract_information.cpython-36.pyc
│ ├── title_detection.cpython-36.pyc
│ └── title_substitute.cpython-36.pyc
├── apst_mapping.xlsx
├── compute_spelling.py
├── detect_ending.py
├── edit_distance.py
├── example_ONET_api.png
├── extract_LDA_result.py
├── extract_information.py
├── phrase_substitutes.csv
├── state_name.txt
├── title2soc.txt
├── title_detection.py
├── title_substitute.py
└── word_substitutes.csv
├── initial_cleaning.ipynb
└── structured_data.ipynb
/README.md:
--------------------------------------------------------------------------------
1 | # newspaper_project
2 | This repository contains supplementary materials to "The Evolution of Work in the United States" by Enghin Atalay, Phai Phongthiengtham, Sebastian Sotelo and Daniel Tannenbaum - American Economic Journal: Applied Economics (2020) https://www.aeaweb.org/articles?id=10.1257/app.20190070
3 |
4 | - Project Data Page: https://occupationdata.github.io
5 |
--------------------------------------------------------------------------------
/data_cleaning/CBOW.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# The Continuous Bag of Words Model\n",
8 | "\n",
9 | "Online supplementary material to \"The Evolution of Work in the United States\" by Enghin Atalay, Phai Phongthiengtham, Sebastian Sotelo and Daniel Tannenbaum.\n",
10 | "\n",
11 | "* [Project data library](https://occupationdata.github.io) \n",
12 | "\n",
13 | "* [GitHub repository](https://github.com/phaiptt125/newspaper_project)\n",
14 | "\n",
15 | "***"
16 | ]
17 | },
18 | {
19 | "cell_type": "markdown",
20 | "metadata": {},
21 | "source": [
22 | "This IPython notebook demonstrates how we map between occupational characteristics to words or phrases from newspaper text using the Continuous Bag of Words Model (CBOW). \n",
23 | "\n",
24 | "* See [here](http://ssc.wisc.edu/~eatalay/apst/apst_mapping.pdf) for more examples.\n",
25 | "* See project data library for full results."
26 | ]
27 | },
28 | {
29 | "cell_type": "markdown",
30 | "metadata": {},
31 | "source": [
32 | " Due to copyright restrictions, we are not authorized to publish a large body of newspaper text. \n",
33 | "***"
34 | ]
35 | },
36 | {
37 | "cell_type": "markdown",
38 | "metadata": {},
39 | "source": [
40 | "## Import necessary modules"
41 | ]
42 | },
43 | {
44 | "cell_type": "code",
45 | "execution_count": 1,
46 | "metadata": {
47 | "collapsed": true
48 | },
49 | "outputs": [],
50 | "source": [
51 | "import os\n",
52 | "import re\n",
53 | "import sys\n",
54 | "import platform\n",
55 | "import collections\n",
56 | "import shutil\n",
57 | "\n",
58 | "import pandas\n",
59 | "import math\n",
60 | "import multiprocessing\n",
61 | "import os.path\n",
62 | "import numpy as np\n",
63 | "from gensim import corpora, models\n",
64 | "from gensim.models import Word2Vec, keyedvectors \n",
65 | "from gensim.models.word2vec import LineSentence\n",
66 | "from sklearn.metrics.pairwise import cosine_similarity"
67 | ]
68 | },
69 | {
70 | "cell_type": "markdown",
71 | "metadata": {},
72 | "source": [
73 | "In our implementation, we construct our model by taking as our text corpora all of the text from job ads which appeared in our cleaned newspaper data, plus the raw text from job ads which were posted on-line in two months: January 2012 and January 2016."
74 | ]
75 | },
76 | {
77 | "cell_type": "markdown",
78 | "metadata": {},
79 | "source": [
80 | "## Prepare newspaper text data"
81 | ]
82 | },
83 | {
84 | "cell_type": "markdown",
85 | "metadata": {},
86 | "source": [
87 | "For newspaper text data, we:\n",
88 | "\n",
89 | "1. Retrieve document metadata, remove markup from the newspaper text, and to perform an initial spell-check of the text (see [here](https://github.com/phaiptt125/newspaper_project/blob/master/data_cleaning/initial_cleaning.ipynb)). \n",
90 | "2. Exclude non-job ad pages (see [here](https://github.com/phaiptt125/newspaper_project/blob/master/data_cleaning/LDA.ipynb)).\n",
91 | "3. Transform unstructured newspaper text into spreadsheet data (see [here](https://github.com/phaiptt125/newspaper_project/blob/master/data_cleaning/structured_data.ipynb)).\n",
92 | "4. Delete all non alphabetic characters, e.g., numbers and punctuations.\n",
93 | "5. Convert all characters to lowercase. \n",
94 | "\n",
95 | "The example below demonstrates how to perform step 4 and 5 in a very short snippet of Display Ad page 226, from the January 14, 1979 Boston Globe. "
96 | ]
97 | },
98 | {
99 | "cell_type": "code",
100 | "execution_count": 2,
101 | "metadata": {},
102 | "outputs": [
103 | {
104 | "name": "stdout",
105 | "output_type": "stream",
106 | "text": [
107 | "--- newspaper text ---\n",
108 | "manage its Primary Care Programs including 24-hour Emergency Room Primary Care program\n",
109 | "\n",
110 | "--- transformed text ---\n",
111 | "manage its primary care programs including hour emergency room primary care program\n"
112 | ]
113 | }
114 | ],
115 | "source": [
116 | "text = \"manage its Primary Care Programs including 24-hour Emergency Room Primary Care program\"\n",
117 | "\n",
118 | "print('--- newspaper text ---')\n",
119 | "print(text)\n",
120 | "print('')\n",
121 | "print('--- transformed text ---')\n",
122 | "print(re.sub('[^a-z ]','',text.lower()))"
123 | ]
124 | },
125 | {
126 | "cell_type": "markdown",
127 | "metadata": {},
128 | "source": [
129 | "## Prepare online job posting text data"
130 | ]
131 | },
132 | {
133 | "cell_type": "markdown",
134 | "metadata": {},
135 | "source": [
136 | "Economic Modeling Specialists International (EMSI) provided us with online postings data in a processed format and relatively clean form: see [here](https://github.com/phaiptt125/online_job_posting/blob/master/data_cleaning/initial_cleaning.ipynb).\n",
137 | "\n",
138 | "For the purpose of this project, we use online postings data to:\n",
139 | "1. Enrich the sample of text usuage when constructing the Continuous Bag of Words model\n",
140 | "2. Retrieve a mapping between job titles and ONET-SOC codes. "
141 | ]
142 | },
143 | {
144 | "cell_type": "markdown",
145 | "metadata": {},
146 | "source": [
147 | "## Construct CBOW model"
148 | ]
149 | },
150 | {
151 | "cell_type": "code",
152 | "execution_count": 3,
153 | "metadata": {
154 | "collapsed": true
155 | },
156 | "outputs": [],
157 | "source": [
158 | "# filename of the combined ads ~ 15 GB \n",
159 | "text_data_filename = 'ad_combined.txt'\n",
160 | "\n",
161 | "# construct CBOW model\n",
162 | "dim_model = 300\n",
163 | "model = Word2Vec(LineSentence(open(text_data_filename)), \n",
164 | " size=dim_model, \n",
165 | " window=5, \n",
166 | " min_count=5, \n",
167 | " workers=multiprocessing.cpu_count())\n",
168 | "\n",
169 | "model.init_sims(replace=True)\n",
170 | "\n",
171 | "# define output filename for CBOW model\n",
172 | "cbow_filename = 'cbow.model'\n",
173 | "\n",
174 | "# save model into file\n",
175 | "model.save(cbow_filename)"
176 | ]
177 | },
178 | {
179 | "cell_type": "markdown",
180 | "metadata": {},
181 | "source": [
182 | "## Compute similar words"
183 | ]
184 | },
185 | {
186 | "cell_type": "code",
187 | "execution_count": 4,
188 | "metadata": {
189 | "collapsed": true
190 | },
191 | "outputs": [],
192 | "source": [
193 | "# load model\n",
194 | "model = Word2Vec.load(cbow_filename)\n",
195 | "word_all = model.wv # set of all words in the model"
196 | ]
197 | },
198 | {
199 | "cell_type": "code",
200 | "execution_count": 5,
201 | "metadata": {
202 | "collapsed": true
203 | },
204 | "outputs": [],
205 | "source": [
206 | "def find_similar_words(phrase,model,dim_model):\n",
207 | " # This function compute similar words given a word or phrase.\n",
208 | " # If the input is just one word, this function is the same as gensim built-in function: model.most_similar\n",
209 | " \n",
210 | " # phrase : input for word or phrases to look for. For a phrase with multiple words, add \"_\" in between.\n",
211 | " # model : constructed CBOW model\n",
212 | " # dim_model : dimension of the model, i.e., length of a vector of each word \n",
213 | " \n",
214 | " tokens = [w for w in re.split('_',phrase) if w in word_all] \n",
215 | " # split input to tokens, ignoring words that are not in the model \n",
216 | " \n",
217 | " vector_by_word = np.zeros((len(tokens),dim_model)) # initialize a matrix \n",
218 | " \n",
219 | " for i in range(0,len(tokens)):\n",
220 | " word = tokens[i] # loop for each word\n",
221 | " vector_this_word = model[word] # get a vector representation\n",
222 | " vector_by_word[i,:] = vector_this_word # record the vector\n",
223 | " \n",
224 | " vector_this_phrase = sum(vector_by_word) \n",
225 | " # sum over words to get a vector representation of the whole phrase\n",
226 | " \n",
227 | " most_similar_words = model.similar_by_vector(vector_this_phrase, topn=100, restrict_vocab=None)\n",
228 | " # find 100 most similar words\n",
229 | " \n",
230 | " most_similar_words = [w for w in most_similar_words if not w[0] == phrase]\n",
231 | " # take out the output word that is identical to the input word\n",
232 | " \n",
233 | " return most_similar_words"
234 | ]
235 | },
236 | {
237 | "cell_type": "markdown",
238 | "metadata": {},
239 | "source": [
240 | "Cosine similarity score of any pair of words/phrases is defined to be a cosine of the two vectors representing those pair of words/phrases. Higher cosine similarity score means the two words/phrases tend to appear in similar contexts.\n",
241 | "\n",
242 | "The function *find_similar_words* above returns a set of similar words, ordered by cosine similarity score, and their corresponding cosine similarity score. For example, the ten most similar words to \"creative\" are: "
243 | ]
244 | },
245 | {
246 | "cell_type": "code",
247 | "execution_count": 6,
248 | "metadata": {},
249 | "outputs": [
250 | {
251 | "data": {
252 | "text/plain": [
253 | "[('imaginative', 0.6997416615486145),\n",
254 | " ('versatile', 0.6824457049369812),\n",
255 | " ('creature', 0.591433584690094),\n",
256 | " ('innovative', 0.5758161544799805),\n",
257 | " ('resourceful', 0.5575118660926819),\n",
258 | " ('creallve', 0.5550633668899536),\n",
259 | " ('restive', 0.5526227951049805),\n",
260 | " ('dynamic', 0.5416233539581299),\n",
261 | " ('clever', 0.5349052548408508),\n",
262 | " ('pragmatic', 0.5299020409584045)]"
263 | ]
264 | },
265 | "execution_count": 6,
266 | "metadata": {},
267 | "output_type": "execute_result"
268 | }
269 | ],
270 | "source": [
271 | "most_similar_words = find_similar_words('creative',model,dim_model)\n",
272 | "most_similar_words[:10]"
273 | ]
274 | },
275 | {
276 | "cell_type": "markdown",
277 | "metadata": {},
278 | "source": [
279 | "Likewise, the ten most similar words to \"bookkeeping\" are:"
280 | ]
281 | },
282 | {
283 | "cell_type": "code",
284 | "execution_count": 7,
285 | "metadata": {},
286 | "outputs": [
287 | {
288 | "data": {
289 | "text/plain": [
290 | "[('bkkp', 0.6903467178344727),\n",
291 | " ('beekeeping', 0.6871334314346313),\n",
292 | " ('stenography', 0.672173023223877),\n",
293 | " ('bkkpng', 0.6181079745292664),\n",
294 | " ('bkkpg', 0.6175851821899414),\n",
295 | " ('bookkpg', 0.5925684571266174),\n",
296 | " ('dkkpg', 0.5809350609779358),\n",
297 | " ('bkkping', 0.5768048167228699),\n",
298 | " ('clerical', 0.5741672515869141),\n",
299 | " ('payroll', 0.5619226098060608)]"
300 | ]
301 | },
302 | "execution_count": 7,
303 | "metadata": {},
304 | "output_type": "execute_result"
305 | }
306 | ],
307 | "source": [
308 | "most_similar_words = find_similar_words('bookkeeping',model,dim_model)\n",
309 | "most_similar_words[:10]"
310 | ]
311 | },
312 | {
313 | "cell_type": "markdown",
314 | "metadata": {},
315 | "source": [
316 | "The strength of the Continuous Bag of Words (CBOW) model is twofold. First, the model provides context-based synonyms which allows us to keep track of relevant words even if their usage may differ over time. We provide one example in the main paper: "
317 | ]
318 | },
319 | {
320 | "cell_type": "markdown",
321 | "metadata": {},
322 | "source": [
323 | "*For instance, even though “creative” and “innovative” largely refer to the same occupational skill, it is possible that their relative usage among potential employers may differ within the sample period. This is indeed the case: Use of the word “innovative” has increased more quickly than “creative” over the sample period. To the extent that our ad hoc classification included only one of these two words, we would be mis-characterizing trends in the ONET skill of “Thinking Creatively.” The advantage of the continuous bag of words model is that it will identify that “creative” and “innovative” mean the same thing because they appear in similar contexts within job ads. Hence, even if employers start using “innovative” as opposed to “creative” part way through our sample, we will be able to consistently measure trends in “Thinking Creatively” throughout the entire period.*"
324 | ]
325 | },
326 | {
327 | "cell_type": "markdown",
328 | "metadata": {},
329 | "source": [
330 | "The second advantage of the CBOW model is to identify common abbrevations and transcription errors. The word \"bookkeeping\", for instance, was offen mistranscribed into \"beekeeping\" due to the imperfection of the Optical Character Recognition (OCR) algorithm. Moreover, our CBOW model also reveals common abbrevations that employers offen used such as \"bkkp\" and \"bkkpng\"."
331 | ]
332 | }
333 | ],
334 | "metadata": {
335 | "kernelspec": {
336 | "display_name": "Python 3",
337 | "language": "python",
338 | "name": "python3"
339 | },
340 | "language_info": {
341 | "codemirror_mode": {
342 | "name": "ipython",
343 | "version": 3
344 | },
345 | "file_extension": ".py",
346 | "mimetype": "text/x-python",
347 | "name": "python",
348 | "nbconvert_exporter": "python",
349 | "pygments_lexer": "ipython3",
350 | "version": "3.6.1"
351 | }
352 | },
353 | "nbformat": 4,
354 | "nbformat_minor": 2
355 | }
356 |
--------------------------------------------------------------------------------
/data_cleaning/assign_SOC.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Mappings between Job Titles and SOC Codes\n",
8 | "\n",
9 | "Online supplementary material to \"The Evolution of Work in the United States\" by Enghin Atalay, Phai Phongthiengtham, Sebastian Sotelo and Daniel Tannenbaum.\n",
10 | "\n",
11 | "* [Project data library](https://occupationdata.github.io) \n",
12 | "\n",
13 | "* [GitHub repository](https://github.com/phaiptt125/newspaper_project)\n",
14 | "\n",
15 | "***"
16 | ]
17 | },
18 | {
19 | "cell_type": "markdown",
20 | "metadata": {},
21 | "source": [
22 | "This IPython notebook demonstrates how we map between job titles and SOC from newspaper text. \n",
23 | "\n",
24 | "* We use the continuous bag of words (CBOW) model previously constructed. See [here](https://github.com/phaiptt125/newspaper_project/blob/master/data_cleaning/CBOW.ipynb) for more detail. \n",
25 | "* See [here](http://ssc.wisc.edu/~eatalay/apst/apst_mapping.pdf) for more explanations.\n",
26 | "* See project data library for full results."
27 | ]
28 | },
29 | {
30 | "cell_type": "markdown",
31 | "metadata": {},
32 | "source": [
33 | " Due to copyright restrictions, we are not authorized to publish a large body of newspaper text. \n",
34 | "***"
35 | ]
36 | },
37 | {
38 | "cell_type": "markdown",
39 | "metadata": {},
40 | "source": [
41 | "## List of auxiliary files (see project data library or GitHub repository)\n",
42 | "\n",
43 | "* *\"title_substitute.py\"* : This python code edits job titles.\n",
44 | "* *\"word_substitutes.csv\"* : List of job title words substitution.\n",
45 | "* *\"phrase_substitutes.csv\"* : List of job title phrases substitution.\n",
46 | "\n",
47 | "Note: We look for the most common job titles and list manually-coded substitutions in *\"word_substitutes.csv\"* and *\"phrase_substitutes.csv\"*. "
48 | ]
49 | },
50 | {
51 | "cell_type": "code",
52 | "execution_count": 1,
53 | "metadata": {
54 | "collapsed": true
55 | },
56 | "outputs": [],
57 | "source": [
58 | "import os\n",
59 | "import re\n",
60 | "import sys\n",
61 | "import platform\n",
62 | "import collections\n",
63 | "import shutil\n",
64 | "\n",
65 | "import pandas as pd\n",
66 | "import math\n",
67 | "import multiprocessing\n",
68 | "import os.path\n",
69 | "import numpy as np\n",
70 | "from gensim import corpora, models\n",
71 | "from gensim.models import Word2Vec, keyedvectors \n",
72 | "from gensim.models.word2vec import LineSentence\n",
73 | "from sklearn.metrics.pairwise import cosine_similarity\n",
74 | "\n",
75 | "sys.path.append('./auxiliary files')\n",
76 | "\n",
77 | "from title_substitute import *"
78 | ]
79 | },
80 | {
81 | "cell_type": "markdown",
82 | "metadata": {},
83 | "source": [
84 | "## Edit job titles"
85 | ]
86 | },
87 | {
88 | "cell_type": "markdown",
89 | "metadata": {},
90 | "source": [
91 | "We first lightly edit job titles to reduce the number of unique titles: We convert all titles to lowercase and remove all non-alphanumeric characters; combine titles which are very similar to one another (e.g., replacing \"hostesses\" with \"host\"); replace plural nouns with their singular form (e.g., replacing \"nurses\" with \"nurse\", \"foremen\" with \"foreman\"); and remove abbreviations (e.g., replacing \"asst\" with \"assistant\", and \"customer service rep\" with \"customer service representative\"). "
92 | ]
93 | },
94 | {
95 | "cell_type": "code",
96 | "execution_count": 2,
97 | "metadata": {
98 | "collapsed": true
99 | },
100 | "outputs": [],
101 | "source": [
102 | "# import files for editing titles\n",
103 | "word_substitutes = io.open('word_substitutes.csv','r',encoding='utf-8',errors='ignore').read()\n",
104 | "word_substitutes = ''.join([w for w in word_substitutes if ord(w) < 127])\n",
105 | "word_substitutes = [w for w in re.split('\\n',word_substitutes) if not w=='']\n",
106 | " \n",
107 | "phrase_substitutes = io.open('phrase_substitutes.csv','r',encoding='utf-8',errors='ignore').read()\n",
108 | "phrase_substitutes = ''.join([w for w in phrase_substitutes if ord(w) < 127])\n",
109 | "phrase_substitutes = [w for w in re.split('\\n',phrase_substitutes) if not w=='']"
110 | ]
111 | },
112 | {
113 | "cell_type": "code",
114 | "execution_count": 3,
115 | "metadata": {},
116 | "outputs": [
117 | {
118 | "name": "stdout",
119 | "output_type": "stream",
120 | "text": [
121 | "original title = registered nurses\n",
122 | "edited title = registered nurse\n",
123 | "---\n",
124 | "original title = rn\n",
125 | "edited title = registered nurse\n",
126 | "---\n",
127 | "original title = hostesses\n",
128 | "edited title = host\n",
129 | "---\n",
130 | "original title = foremen\n",
131 | "edited title = foreman\n",
132 | "---\n",
133 | "original title = customer service rep\n",
134 | "edited title = customer service representative\n",
135 | "---\n"
136 | ]
137 | }
138 | ],
139 | "source": [
140 | "# some illustrations (see \"title_substitute.py\")\n",
141 | "\n",
142 | "list_job_titles = ['registered nurses',\n",
143 | " 'rn', \n",
144 | " 'hostesses',\n",
145 | " 'foremen', \n",
146 | " 'customer service rep']\n",
147 | "\n",
148 | "for title in list_job_titles: \n",
149 | " title_clean = substitute_titles(title,word_substitutes,phrase_substitutes)\n",
150 | " print('original title = ' + title)\n",
151 | " print('edited title = ' + title_clean)\n",
152 | " print('---')"
153 | ]
154 | },
155 | {
156 | "cell_type": "markdown",
157 | "metadata": {},
158 | "source": [
159 | "## Some technical issues\n",
160 | "\n",
161 | "* The procedure of replacing plural nouns with their singular form works in general:"
162 | ]
163 | },
164 | {
165 | "cell_type": "code",
166 | "execution_count": 4,
167 | "metadata": {},
168 | "outputs": [
169 | {
170 | "data": {
171 | "text/plain": [
172 | "'galaxy'"
173 | ]
174 | },
175 | "execution_count": 4,
176 | "metadata": {},
177 | "output_type": "execute_result"
178 | }
179 | ],
180 | "source": [
181 | "substitute_titles('galaxies',word_substitutes,phrase_substitutes)\n",
182 | "# Note: We do not supply the mapping from 'galaxies' to 'galaxy'."
183 | ]
184 | },
185 | {
186 | "cell_type": "markdown",
187 | "metadata": {},
188 | "source": [
189 | "* The procedure of replacing abbreviations, on the other hand, requires user-provided information, i.e., we list down the most common substitutions. While we cannot possibly identify all abbreviations, we will use the continuous bag of word (CBOW) model later. Common abbreviations would have similar meanings as their original words. "
190 | ]
191 | },
192 | {
193 | "cell_type": "markdown",
194 | "metadata": {},
195 | "source": [
196 | "## ONET reported job titles "
197 | ]
198 | },
199 | {
200 | "cell_type": "markdown",
201 | "metadata": {},
202 | "source": [
203 | "The ONET publishes, for each SOC code, a list of reported job titles in \"Sample of Reported Titles\" and \"Alternate Titles\" sections. The ONET data dictionary, see [here](https://www.onetcenter.org/dl_files/database/db_22_1_dictionary.pdf), explains these files as the following:\n",
204 | "\n",
205 | "*\"This file [Sample of Reported Titles] contains job titles frequently reported by incumbents and occupational experts on data collection surveys.\"* (page 52)\n",
206 | "\n",
207 | "*\"This file [Alternate Titles] contains alternate, or 'lay', occupational titles for the ONET-SOC classification system. The file was developed to improve keyword searches in several Department of Labor internet applications (i.e., Career InfoNet, ONET OnLine, and ONET Code Connector). The file contains\n",
208 | "occupational titles from existing occupational classification systems, as well as from other diverse sources.\"* (page 50)"
209 | ]
210 | },
211 | {
212 | "cell_type": "markdown",
213 | "metadata": {},
214 | "source": [
215 | "## A mapping between ONET reported job titles and SOC codes"
216 | ]
217 | },
218 | {
219 | "cell_type": "markdown",
220 | "metadata": {},
221 | "source": [
222 | "The ONET provides, for each job title in \"Sample of Reported Titles\" and \"Alternate Titles\", a corresponding SOC code. We then record these mappings directly. \n",
223 | "\n",
224 | "Some job titles, unfortunately, do not have a unique mapping to an SOC code. For example, \"Office Administrator\" is reported to be \"43-9061.00\", \"43-6011.00\" and \"43-6014.00\". For these titles, we rely on the ONET website search algorithm. First, we enter \"Office Administrator\" into the search query box, \"Occupation Quick Search.\" See [here](https://github.com/phaiptt125/newspaper_project/blob/master/data_cleaning/auxiliary%20files/example_ONET_api.png) for a screenshot of this procedure. \n",
225 | "\n",
226 | "Then, we map \"Office Administrator\" to \"43-9061.00\", which is the cloest match that the ONET website provides. Next, we apply the same title editing procedure as in newspaper job titles described above. We record these mappings to \"title2SOC.txt\" as shown below. "
227 | ]
228 | },
229 | {
230 | "cell_type": "code",
231 | "execution_count": 21,
232 | "metadata": {},
233 | "outputs": [
234 | {
235 | "name": "stdout",
236 | "output_type": "stream",
237 | "text": [
238 | "Total mappings = 45207\n"
239 | ]
240 | },
241 | {
242 | "data": {
243 | "text/html": [
244 | "
\n",
245 | "\n",
258 | "
\n",
259 | " \n",
260 | " \n",
261 | " | \n",
262 | " title | \n",
263 | " original_title | \n",
264 | " soc | \n",
265 | "
\n",
266 | " \n",
267 | " \n",
268 | " \n",
269 | " 0 | \n",
270 | " operation director | \n",
271 | " Operations Director | \n",
272 | " 11102100 | \n",
273 | "
\n",
274 | " \n",
275 | " 1 | \n",
276 | " us commissioner | \n",
277 | " U.S. Commissioner | \n",
278 | " 11101100 | \n",
279 | "
\n",
280 | " \n",
281 | " 2 | \n",
282 | " sale and marketing director | \n",
283 | " Sales and Marketing Director | \n",
284 | " 11202200 | \n",
285 | "
\n",
286 | " \n",
287 | " 3 | \n",
288 | " market analysis director | \n",
289 | " Market Analysis Director | \n",
290 | " 11202100 | \n",
291 | "
\n",
292 | " \n",
293 | " 4 | \n",
294 | " director of sale and marketing | \n",
295 | " Director of Sales and Marketing | \n",
296 | " 41101200 | \n",
297 | "
\n",
298 | " \n",
299 | "
\n",
300 | "
"
301 | ],
302 | "text/plain": [
303 | " title original_title soc\n",
304 | "0 operation director Operations Director 11102100\n",
305 | "1 us commissioner U.S. Commissioner 11101100\n",
306 | "2 sale and marketing director Sales and Marketing Director 11202200\n",
307 | "3 market analysis director Market Analysis Director 11202100\n",
308 | "4 director of sale and marketing Director of Sales and Marketing 41101200"
309 | ]
310 | },
311 | "execution_count": 21,
312 | "metadata": {},
313 | "output_type": "execute_result"
314 | }
315 | ],
316 | "source": [
317 | "title2SOC_filename = 'title2SOC.txt'\n",
318 | "names = ['title','original_title','soc']\n",
319 | "\n",
320 | "# title: The edited title, to be matched with newspaper titles.\n",
321 | "# original_title: The original titles from ONET website. \n",
322 | "# soc: Occupation code.\n",
323 | " \n",
324 | "# import into pandas dataframe\n",
325 | "title2SOC = pd.read_csv(title2SOC_filename, sep = '\\t', names = names)\n",
326 | "\n",
327 | "# print number of total mappings\n",
328 | "print('Total mappings = ' + str(len(title2SOC)))\n",
329 | " \n",
330 | "# print some examples\n",
331 | "title2SOC.head()"
332 | ]
333 | },
334 | {
335 | "cell_type": "markdown",
336 | "metadata": {},
337 | "source": [
338 | "The subsequent sections of this IPython notebook explain how we use these mappings from ONET, in combination with the previously constructed continuous bag of words (CBOW) model, to assign an SOC code to each of the newspaper job title."
339 | ]
340 | },
341 | {
342 | "cell_type": "markdown",
343 | "metadata": {},
344 | "source": [
345 | "## Map ONET job titles to newspaper job titles (direct match)"
346 | ]
347 | },
348 | {
349 | "cell_type": "markdown",
350 | "metadata": {},
351 | "source": [
352 | "We assign the ONET job title, where a corresponding SOC code is available, to each of the newspaper job title. First, for each newspaper job title, we check if there is any direct string match. Suppose we have \"sale and marketing director\" in the newspaper:"
353 | ]
354 | },
355 | {
356 | "cell_type": "code",
357 | "execution_count": 6,
358 | "metadata": {},
359 | "outputs": [
360 | {
361 | "data": {
362 | "text/plain": [
363 | "True"
364 | ]
365 | },
366 | "execution_count": 6,
367 | "metadata": {},
368 | "output_type": "execute_result"
369 | }
370 | ],
371 | "source": [
372 | "\"sale and marketing director\" in title2SOC['title'].values"
373 | ]
374 | },
375 | {
376 | "cell_type": "code",
377 | "execution_count": 7,
378 | "metadata": {},
379 | "outputs": [
380 | {
381 | "data": {
382 | "text/html": [
383 | "\n",
384 | "\n",
397 | "
\n",
398 | " \n",
399 | " \n",
400 | " | \n",
401 | " title | \n",
402 | " original_title | \n",
403 | " soc | \n",
404 | "
\n",
405 | " \n",
406 | " \n",
407 | " \n",
408 | " 2 | \n",
409 | " sale and marketing director | \n",
410 | " Sales and Marketing Director | \n",
411 | " 11202200 | \n",
412 | "
\n",
413 | " \n",
414 | "
\n",
415 | "
"
416 | ],
417 | "text/plain": [
418 | " title original_title soc\n",
419 | "2 sale and marketing director Sales and Marketing Director 11202200"
420 | ]
421 | },
422 | "execution_count": 7,
423 | "metadata": {},
424 | "output_type": "execute_result"
425 | }
426 | ],
427 | "source": [
428 | "title2SOC[title2SOC['title'] == \"sale and marketing director\"]"
429 | ]
430 | },
431 | {
432 | "cell_type": "markdown",
433 | "metadata": {},
434 | "source": [
435 | "* Since, we have \"sale and marketing director\" in our list of ONET titles, we can proceed and assign the SOC of \"11-2022.00\". "
436 | ]
437 | },
438 | {
439 | "cell_type": "markdown",
440 | "metadata": {},
441 | "source": [
442 | "## Map ONET job titles to newspaper job titles (CBOW-based)\n",
443 | "\n",
444 | "For those newspaper job titles where there is no exact match to our list of ONET job titles, we reply on our previously constructed CBOW model to assign the \"closet\" ONET job title to each of the newspaper job title. \n",
445 | "\n",
446 | "In the actual implementation, we set our dimension of the CBOW model to be 300, as explained [here](https://github.com/phaiptt125/newspaper_project/blob/master/data_cleaning/CBOW.ipynb). For illustrative purposes, however, this IPython notebook provides examples using the CBOW model with the dimension of 5. The embedded code below illustrates how we construct this CBOW model:"
447 | ]
448 | },
449 | {
450 | "cell_type": "markdown",
451 | "metadata": {},
452 | "source": [
453 | "***\n",
454 | " model = Word2Vec(LineSentence(open('ad_combined.txt')), \n",
455 | " size = 5, \n",
456 | " window = 5, \n",
457 | " min_count = 5, \n",
458 | " workers = multiprocessing.cpu_count())\n",
459 | "\n",
460 | " model.save('cbow_small.model')\n",
461 | "***"
462 | ]
463 | },
464 | {
465 | "cell_type": "code",
466 | "execution_count": 8,
467 | "metadata": {
468 | "collapsed": true
469 | },
470 | "outputs": [],
471 | "source": [
472 | "model = Word2Vec.load('cbow_small.model')\n",
473 | "# 'cbow_small.model' has dimension of 5.\n",
474 | "# In the actual implementation, we use our previously constructed 'cbow.model', which has dimension of 300. "
475 | ]
476 | },
477 | {
478 | "cell_type": "markdown",
479 | "metadata": {},
480 | "source": [
481 | "Our CBOW model provides a vector representation of each word in the corpus. For example:"
482 | ]
483 | },
484 | {
485 | "cell_type": "code",
486 | "execution_count": 9,
487 | "metadata": {},
488 | "outputs": [
489 | {
490 | "data": {
491 | "text/plain": [
492 | "array([-0.23945422, -0.33969662, -0.25194243, 0.86623007, 0.11592443], dtype=float32)"
493 | ]
494 | },
495 | "execution_count": 9,
496 | "metadata": {},
497 | "output_type": "execute_result"
498 | }
499 | ],
500 | "source": [
501 | "model['customer']"
502 | ]
503 | },
504 | {
505 | "cell_type": "code",
506 | "execution_count": 10,
507 | "metadata": {},
508 | "outputs": [
509 | {
510 | "data": {
511 | "text/plain": [
512 | "array([ 0.03195868, -0.56184751, 0.24374393, 0.58998656, 0.52517688], dtype=float32)"
513 | ]
514 | },
515 | "execution_count": 10,
516 | "metadata": {},
517 | "output_type": "execute_result"
518 | }
519 | ],
520 | "source": [
521 | "model['relation']"
522 | ]
523 | },
524 | {
525 | "cell_type": "code",
526 | "execution_count": 11,
527 | "metadata": {},
528 | "outputs": [
529 | {
530 | "data": {
531 | "text/plain": [
532 | "array([-0.52168244, -0.50416076, 0.10234968, 0.33064061, 0.59487033], dtype=float32)"
533 | ]
534 | },
535 | "execution_count": 11,
536 | "metadata": {},
537 | "output_type": "execute_result"
538 | }
539 | ],
540 | "source": [
541 | "model['specialist']"
542 | ]
543 | },
544 | {
545 | "cell_type": "markdown",
546 | "metadata": {},
547 | "source": [
548 | "We compute a vector represenation of \"customer relation specialist\" to be the sum of a vector representation of \"customer\", \"relation\" and \"specialist\"."
549 | ]
550 | },
551 | {
552 | "cell_type": "code",
553 | "execution_count": 12,
554 | "metadata": {},
555 | "outputs": [
556 | {
557 | "data": {
558 | "text/plain": [
559 | "array([-0.72917795, -1.40570486, 0.09415118, 1.78685713, 1.23597169], dtype=float32)"
560 | ]
561 | },
562 | "execution_count": 12,
563 | "metadata": {},
564 | "output_type": "execute_result"
565 | }
566 | ],
567 | "source": [
568 | "vector_title = model['customer'] + model['relation'] + model['specialist']\n",
569 | "vector_title"
570 | ]
571 | },
572 | {
573 | "cell_type": "markdown",
574 | "metadata": {},
575 | "source": [
576 | "As such, we can compute a vector represenation of:\n",
577 | "\n",
578 | "1. All job titles from our newspaper data.\n",
579 | "2. All job titles from our list of ONET titles."
580 | ]
581 | },
582 | {
583 | "cell_type": "markdown",
584 | "metadata": {},
585 | "source": [
586 | "Suppose we have \"customer relation specialist\" as a newspaper job title, we first check if there is a direct match to our list of ONET titles: "
587 | ]
588 | },
589 | {
590 | "cell_type": "code",
591 | "execution_count": 13,
592 | "metadata": {},
593 | "outputs": [
594 | {
595 | "data": {
596 | "text/plain": [
597 | "False"
598 | ]
599 | },
600 | "execution_count": 13,
601 | "metadata": {},
602 | "output_type": "execute_result"
603 | }
604 | ],
605 | "source": [
606 | "\"customer relation specialist\" in title2SOC['title'].values"
607 | ]
608 | },
609 | {
610 | "cell_type": "markdown",
611 | "metadata": {},
612 | "source": [
613 | "Since there is no direct match, we then assign a vector representation of this title and compute how similar this title to each of the ONET job titles. We use cosine similarity as a measure of how two vectors are similar to each other. As the cosine function gives the value between 0 and 1, the closer value to 1 means the two vectors are closer to each other. The results below demonstrate cosine similarity scores to some ONET job titles:"
614 | ]
615 | },
616 | {
617 | "cell_type": "code",
618 | "execution_count": 14,
619 | "metadata": {},
620 | "outputs": [
621 | {
622 | "name": "stdout",
623 | "output_type": "stream",
624 | "text": [
625 | "Computing cosine similarity of \"customer relation specialist\" to: \n",
626 | "----------------\n",
627 | "\"executive secretary\" = [[ 0.6176427]]\n",
628 | "\"mechanical engineer\" = [[ 0.80217057]]\n",
629 | "\"customer service assistant\" = [[ 0.96143997]]\n",
630 | "\"client relation specialist\" = [[ 0.99550998]]\n"
631 | ]
632 | }
633 | ],
634 | "source": [
635 | "vector_newspaper = model['customer'] + model['relation'] + model['specialist']\n",
636 | "\n",
637 | "print('Computing cosine similarity of \"customer relation specialist\" to: ')\n",
638 | "print('----------------')\n",
639 | "\n",
640 | "# compute similarity to \"executive secretary\" \n",
641 | "vector_to_match = model['executive'] + model['secretary']\n",
642 | "cosine = cosine_similarity(vector_to_match.reshape(1,-1), vector_newspaper.reshape(1,-1))\n",
643 | "print( '\"executive secretary\" = ' + str(cosine))\n",
644 | "\n",
645 | "# compute similarity to \"mechanical engineer\" \n",
646 | "vector_to_match = model['mechanical'] + model['engineer']\n",
647 | "cosine = cosine_similarity(vector_to_match.reshape(1,-1), vector_newspaper.reshape(1,-1))\n",
648 | "print( '\"mechanical engineer\" = ' + str(cosine))\n",
649 | "\n",
650 | "# compute similarity to \"customer service assistant\" \n",
651 | "vector_to_match = model['customer'] + model['service'] + model['assistant']\n",
652 | "cosine = cosine_similarity(vector_to_match.reshape(1,-1), vector_newspaper.reshape(1,-1))\n",
653 | "print( '\"customer service assistant\" = ' + str(cosine))\n",
654 | "\n",
655 | "# compute similarity to \"client relation specialist\" \n",
656 | "vector_to_match = model['client'] + model['relation'] + model['specialist']\n",
657 | "cosine = cosine_similarity(vector_to_match.reshape(1,-1), vector_newspaper.reshape(1,-1))\n",
658 | "print( '\"client relation specialist\" = ' + str(cosine))"
659 | ]
660 | },
661 | {
662 | "cell_type": "markdown",
663 | "metadata": {},
664 | "source": [
665 | "***\n",
666 | "Therefore, using the CBOW model, we conclude that \"customer relation specialist\" has a closer meaning to \"client relation specialist\" than than \"executive secretary\", \"mechanical engineer\" and \"customer service assistant.\" \n",
667 | "\n",
668 | "Even though the we do not have \"customer relation specialist\" in our list of ONET job titles, our CBOW model suggests that this job title is extremely similar to \"client relation specialist\". There are two reasons why this should be the case. First, there are two identical words \"relation\" and \"specialist\" in both job titles. Second, our CBOW model suggests that \"client\" and \"customer\" are similar to each other:"
669 | ]
670 | },
671 | {
672 | "cell_type": "code",
673 | "execution_count": 15,
674 | "metadata": {},
675 | "outputs": [
676 | {
677 | "data": {
678 | "text/plain": [
679 | "array([[ 0.96610314]], dtype=float32)"
680 | ]
681 | },
682 | "execution_count": 15,
683 | "metadata": {},
684 | "output_type": "execute_result"
685 | }
686 | ],
687 | "source": [
688 | "cosine_similarity(model['client'].reshape(1,-1), model['customer'].reshape(1,-1))"
689 | ]
690 | },
691 | {
692 | "cell_type": "markdown",
693 | "metadata": {},
694 | "source": [
695 | "In the actual implementation, we compute cosine similarity score to all 45207 ONET job titles, which cannot be performed in this IPython notebook. \n",
696 | "\n",
697 | "Nevertheless, it turns out that \"client relation specialist\" is indeed the cloest ONET job title to \"customer relation specialist.\" We, then, assign the SOC code of \"customer relation specialist\" to be the same as \"client relation specialist.\" "
698 | ]
699 | },
700 | {
701 | "cell_type": "code",
702 | "execution_count": 16,
703 | "metadata": {},
704 | "outputs": [
705 | {
706 | "data": {
707 | "text/html": [
708 | "\n",
709 | "\n",
722 | "
\n",
723 | " \n",
724 | " \n",
725 | " | \n",
726 | " title | \n",
727 | " original_title | \n",
728 | " soc | \n",
729 | "
\n",
730 | " \n",
731 | " \n",
732 | " \n",
733 | " 14392 | \n",
734 | " client relation specialist | \n",
735 | " Client Relations Specialist | \n",
736 | " 43405100 | \n",
737 | "
\n",
738 | " \n",
739 | "
\n",
740 | "
"
741 | ],
742 | "text/plain": [
743 | " title original_title soc\n",
744 | "14392 client relation specialist Client Relations Specialist 43405100"
745 | ]
746 | },
747 | "execution_count": 16,
748 | "metadata": {},
749 | "output_type": "execute_result"
750 | }
751 | ],
752 | "source": [
753 | "title2SOC[title2SOC['title'] == \"client relation specialist\"]"
754 | ]
755 | },
756 | {
757 | "cell_type": "markdown",
758 | "metadata": {},
759 | "source": [
760 | "## Some technical issues"
761 | ]
762 | },
763 | {
764 | "cell_type": "markdown",
765 | "metadata": {},
766 | "source": [
767 | "* We ignore job title words that are not in our CBOW model. \n",
768 | "* Unlike the LDA model, we do not stem words. As a result, the model considers different forms of a word as different words, e.g., \"manage\" and \"management\". However, our CBOW model generally assign similar vector representation, for example: "
769 | ]
770 | },
771 | {
772 | "cell_type": "code",
773 | "execution_count": 17,
774 | "metadata": {},
775 | "outputs": [
776 | {
777 | "data": {
778 | "text/plain": [
779 | "array([[ 0.92724895]], dtype=float32)"
780 | ]
781 | },
782 | "execution_count": 17,
783 | "metadata": {},
784 | "output_type": "execute_result"
785 | }
786 | ],
787 | "source": [
788 | "cosine_similarity(model['manage'].reshape(1,-1), model['management'].reshape(1,-1))"
789 | ]
790 | },
791 | {
792 | "cell_type": "markdown",
793 | "metadata": {},
794 | "source": [
795 | "* Our CBOW model is invariant to the order of job title words, e.g., we consider \"executive secretary\" and \"secretary executive\" as the same title. "
796 | ]
797 | },
798 | {
799 | "cell_type": "code",
800 | "execution_count": 18,
801 | "metadata": {},
802 | "outputs": [
803 | {
804 | "data": {
805 | "text/plain": [
806 | "array([-0.5665881 , -0.73142403, 0.72307652, -0.10102642, 1.02186275], dtype=float32)"
807 | ]
808 | },
809 | "execution_count": 18,
810 | "metadata": {},
811 | "output_type": "execute_result"
812 | }
813 | ],
814 | "source": [
815 | "model['executive'] + model['secretary']"
816 | ]
817 | },
818 | {
819 | "cell_type": "code",
820 | "execution_count": 19,
821 | "metadata": {},
822 | "outputs": [
823 | {
824 | "data": {
825 | "text/plain": [
826 | "array([-0.5665881 , -0.73142403, 0.72307652, -0.10102642, 1.02186275], dtype=float32)"
827 | ]
828 | },
829 | "execution_count": 19,
830 | "metadata": {},
831 | "output_type": "execute_result"
832 | }
833 | ],
834 | "source": [
835 | "model['secretary'] + model['executive']"
836 | ]
837 | },
838 | {
839 | "cell_type": "markdown",
840 | "metadata": {},
841 | "source": [
842 | "* Common abbreviations would have similar meanings as their original words. For instance, \"rn\" is a common abbreviation for \"registered nurse\", as a result, our CBOW model assigns very similar vector representation: "
843 | ]
844 | },
845 | {
846 | "cell_type": "code",
847 | "execution_count": 20,
848 | "metadata": {},
849 | "outputs": [
850 | {
851 | "data": {
852 | "text/plain": [
853 | "array([[ 0.98632824]], dtype=float32)"
854 | ]
855 | },
856 | "execution_count": 20,
857 | "metadata": {},
858 | "output_type": "execute_result"
859 | }
860 | ],
861 | "source": [
862 | "vector_title = model['registered'] + model['nurse']\n",
863 | "cosine_similarity(model['rn'].reshape(1,-1), vector_title.reshape(1,-1))"
864 | ]
865 | },
866 | {
867 | "cell_type": "markdown",
868 | "metadata": {},
869 | "source": [
870 | "* There are rare circumstances where our CBOW model suggests more than one \"closest\" ONET titles to a newspaper job title, i.e., the cosine similarity scores are exactly equal. This can happen because there are some different ONET job titles, each map to a different SOC, but our CBOW model assigns the exact same vector representation. For example, ONET registers \"wage and salary administrator\" to be \"11-3111.00\" (Compensation and Benefits Managers) and \"salary and wage administrator\" to be \"13-1141.00\" (Compensation, Benefits, and Job Analysis Specialists). However, our CBOW model assigns the exact same vector representation to \"wage and salary administrator\" and \"salary and wage administrator.\" In these circumstances, we reply on The Bureau of Labor Statistics employment data, see [here](https://www.bls.gov/oes/current/oes_nat.htm), and choose the SOC code with higher employment."
871 | ]
872 | },
873 | {
874 | "cell_type": "markdown",
875 | "metadata": {},
876 | "source": [
877 | "## Additional amendments\n",
878 | "\n",
879 | "Finally, we made additional amendments as the following (see [here](https://ssc.wisc.edu/~eatalay/apst/apst_mapping.pdf) for more detail):\n",
880 | "\n",
881 | "1. We assign an SOC code of 999999 (“missing”) if certain words or phrases appear — “associate,” “career builder,” “liberal employee benefit,” “many employee benefit,” or “personnel” — anywhere in the job title, or for certain exact titles: “boys,” “boys boys,” “men boys girls,” “men boys girls women,” “men boys men,” “people,” “professional,” or “trainee.” These words and phrases appear commonly in our newspaper ads and do not refer to the SOC code which our CBOW model indicates. “Associate” commonly appears the part of the name of the firms which are placing the ad. “Personnel” commonly refers to the personnel department to which the applicant should contact.\n",
882 | "\n",
883 | "2. We also replace the SOC code for the job title “Assistant” from 399021 (the SOC code for “Personal Care Aides”) to 436014 (the SOC code for “Secretaries and Administrative Assistants”). “Assistant” is the fifth most common job title, and judging by the text within the job ads refers to a secretarial occupation rather than one for a personal care worker. While we are hesitant to modify our job title to SOC mapping in an ad hoc fashion for any job title, mis-specifying this mapping for such a common title would have a noticeably deleterious impact on our dataset.\n",
884 | "\n",
885 | "3. In a final step, we amend the output of the CBOW model for a few ambiguously defined job titles. These final amendments have no discernible impact on aggregate trends in task content, on role within-occupation shifts in accounting for aggregate task changes, or on the role of shifts in the demand for tasks in accounting for increased earnings inequality. First, for job titles which include “server” and which do not also include a food-service-related word — banquet, bartender, cashier, cocktail, cook, dining, food, or restaurant — we substitute an SOC code beginning with 3530 with the SOC code for computer systems analysts (151121). Second, for job titles which contain the word “programmer,” do not include the words “cnc” or “machine,” we substitute SOC codes beginning with 5140 or 5141 with the SOC code for computer programmers (151131). Finally, for job titles which contain the word “assembler” and do not contain a word referring to manufacturing assembly work — words containing the strings electronic, electric, machin, mechanical, metal, and wire — we substitute SOC codes beginning with 5120 with the SOC code of computer programmers (151131). The amendments, which alter the SOC codes for approximately 0.2 percent of ads in our data set, are necessary for ongoing work in which we explore the role of new technologies in the labor market. Certain words refer both to a job title unrelated to new technologies as well as to new technologies. By linking the aforementioned job titles to SOCs that have no exposure to new technologies, we would be vastly overstating the rates at which food service staff or manufacturing production workers adopt new ICT software. On the other hand, since these 8 ads represent a small portion of the ads referring to computer programmer occupations, lumping the ambiguous job titles with the computer programmer SOC codes will only have a minor effect on the assessed technology adoption rates for computer programmers."
886 | ]
887 | }
888 | ],
889 | "metadata": {
890 | "kernelspec": {
891 | "display_name": "Python 3",
892 | "language": "python",
893 | "name": "python3"
894 | },
895 | "language_info": {
896 | "codemirror_mode": {
897 | "name": "ipython",
898 | "version": 3
899 | },
900 | "file_extension": ".py",
901 | "mimetype": "text/x-python",
902 | "name": "python",
903 | "nbconvert_exporter": "python",
904 | "pygments_lexer": "ipython3",
905 | "version": "3.6.1"
906 | }
907 | },
908 | "nbformat": 4,
909 | "nbformat_minor": 2
910 | }
911 |
--------------------------------------------------------------------------------
/data_cleaning/auxiliary files/OCRcorrect_enchant.py:
--------------------------------------------------------------------------------
1 | import os
2 | import re
3 | import sys
4 | import nltk
5 | import enchant, difflib
6 | import operator
7 | from enchant import DictWithPWL
8 |
9 | #...............................................#
10 | # This python function performs word-by-word spelling correction
11 | #...............................................#
12 |
13 | def EnchantErrorCorrection(InputByLine,mydictfile):
14 |
15 | # "InputByLine" is a string of the text by line.
16 | # "mydictfile" is a filename (e.g., "myPWL.txt") for personal word list
17 | # The function returns " ' '.join(OutputList) " as a string
18 |
19 | d = enchant.DictWithPWL('en_US', mydictfile) # define spell-checker
20 |
21 | # http://pythonhosted.org/pyenchant/tutorial.html
22 | # http://stackoverflow.com/questions/22898355/pyenchant-spellchecking-block-of-text-with-a-personal-word-list
23 |
24 | InputList = [w for w in re.split(' ',InputByLine) if not w=='']
25 | OutputList = list()
26 |
27 | for Word in InputList:
28 | if len(Word)>=3: # only check words with length greater than or equal to 3
29 | if d.check(Word): #d.check() is TRUE if the word is correctly spelled
30 | OutputList.append(Word) #append the old word back
31 | else: #d.check() is FALSE if the word is correctly spelled
32 | correct = d.suggest(Word) #get a suggestion
33 | count=0
34 | if correct: #if a suggestion is not empty
35 | dictTemp,maxTemp = {},0 ##ea
36 | for b in correct: ## ea
37 | count=count+1
38 | if count<8:
39 | tmp = max(0,difflib.SequenceMatcher(None, Word.lower(), b.lower()).ratio()-(1e-3)*count); ##ea
40 | dictTemp[tmp] = b ##ea
41 | if tmp > maxTemp: ##ea
42 | maxTemp = tmp ##ea
43 | if maxTemp>=0.8:
44 | OutputList.append(dictTemp[maxTemp]) ##ea
45 | else:
46 | OutputList.append(Word)
47 | else: #if a suggestion is empty, just append the old word back
48 | OutputList.append(Word)
49 | else: # if the word is less than 3 characters, just append the same word back to output
50 | OutputList.append(Word)
51 |
52 | return ' '.join(OutputList)
53 |
54 | #...............................................#
55 |
--------------------------------------------------------------------------------
/data_cleaning/auxiliary files/OCRcorrect_hyphen.py:
--------------------------------------------------------------------------------
1 | import os
2 | import re
3 | import sys
4 | import nltk
5 | import enchant, difflib
6 | import operator
7 | from enchant import DictWithPWL
8 | from edit_distance import *
9 |
10 | #...............................................#
11 | # This python function performs spelling correction
12 | # on words with hyphen
13 | #...............................................#
14 |
15 | def CorrectHyphenated(InputByLine,mydictfile):
16 |
17 | # "InputByLine" is a string of the text by line.
18 | # "mydictfile" is a filename (e.g., "myPWL.txt") for personal word list
19 | # The function returns a string as output
20 |
21 | d = enchant.DictWithPWL('en_US', mydictfile) # define spell-checker
22 | # http://pythonhosted.org/pyenchant/tutorial.html
23 | # http://stackoverflow.com/questions/22898355/pyenchant-spellchecking-block-of-text-with-a-personal-word-list
24 |
25 | text = InputByLine
26 |
27 | HyphenWords = re.findall(r'\b[a-zA-Z]+-\s?[a-zA-Z]+\b', InputByLine)
28 | # "HyphenWords" is a list of potential hyphen word corrections
29 |
30 | for word in HyphenWords:
31 | WordForCheck = re.sub('[- ]','',word)
32 | # Newspaper tends to the cut to a new line in the middle of a word.
33 | # Therefore, most corrections are just removing "-" and " "
34 | CorrectionFlag = 0 #indicator for correction
35 | if d.check(word): # if the word (with hyphen) is already correct
36 | pass #do nothing
37 | elif d.check(WordForCheck): # elif the word without "-" and " " is correct
38 | Correction = WordForCheck
39 | CorrectionFlag = 1
40 | elif d.suggest(WordForCheck): #get a suggestion
41 | ListSuggest = [w for w in d.suggest(WordForCheck) if not ' ' in w]
42 | if len(ListSuggest) > 0:
43 | DistanceSuggest = [EditDistance(w,WordForCheck) for w in ListSuggest]
44 | min_index, min_value = min(enumerate(DistanceSuggest), key=operator.itemgetter(1))
45 | if min_value <= 3: #if the difference is not exceeding 3
46 | Correction = ListSuggest[min_index]
47 | CorrectionFlag = 1
48 |
49 | if CorrectionFlag == 1:
50 | text = re.sub(word,Correction,text)
51 |
52 | return text
53 |
54 | #...............................................#
55 |
--------------------------------------------------------------------------------
/data_cleaning/auxiliary files/PWL.txt:
--------------------------------------------------------------------------------
1 | Abilene
2 | Akron
3 | Alameda
4 | Albany
5 | Albuquerque
6 | Alexandria
7 | Alhambra
8 | Allentown
9 | Allis
10 | Alto
11 | Amarillo
12 | Ames
13 | Anaheim
14 | Anchorage
15 | Anderson
16 | Angeles
17 | Angelo
18 | Antioch
19 | Antonio
20 | Appleton
21 | Arcadia
22 | Arlington
23 | Arthur
24 | Arvada
25 | Asheville
26 | Athens
27 | Atlanta
28 | Augusta
29 | Aurora
30 | Austin
31 | Bakersfield
32 | Baldwin
33 | Baltimore
34 | Barbara
35 | Baton
36 | Bayonne
37 | Baytown
38 | Beaumont
39 | Beaverton
40 | Bedford
41 | Bellevue
42 | Bellflower
43 | Bellingham
44 | Bend
45 | Berkeley
46 | Bernardino
47 | Berwyn
48 | Bethlehem
49 | Billings
50 | Biloxi
51 | Birmingham
52 | Bismarck
53 | Bloomington
54 | Boca
55 | Boise
56 | Bolingbrook
57 | Bossier
58 | Boston
59 | Boulder
60 | Bowie
61 | Boynton
62 | Bridgeport
63 | Bristol
64 | Britain
65 | Brockton
66 | Brooklyn
67 | Brownsville
68 | Bryan
69 | Buena
70 | Buenaventura
71 | Buffalo
72 | Burbank
73 | Burnsville
74 | Cajon
75 | Camarillo
76 | Cambridge
77 | Camden
78 | Canton
79 | Cape
80 | Carlsbad
81 | Carrollton
82 | Carson
83 | Cary
84 | Cedar
85 | Cerritos
86 | Champaign
87 | Chandler
88 | Charles
89 | Charleston
90 | Charlotte
91 | Chattanooga
92 | Chesapeake
93 | Cheyenne
94 | Chicago
95 | Chico
96 | Chicopee
97 | Chino
98 | Christi
99 | Chula
100 | Cicero
101 | Cincinnati
102 | Citrus
103 | Clair
104 | Claire
105 | Clara
106 | Clarita
107 | Clarksville
108 | Clearwater
109 | Cleveland
110 | Clifton
111 | Clovis
112 | College
113 | Collins
114 | Colorado
115 | Columbia
116 | Columbus
117 | Compton
118 | Concord
119 | Coon
120 | Coral
121 | Corona
122 | Corpus
123 | Costa
124 | Council
125 | Covina
126 | Cranston
127 | Crosse
128 | Cruces
129 | Cruz
130 | Cucamonga
131 | Cupertino
132 | Dallas
133 | Daly
134 | Danbury
135 | Davenport
136 | Davidson
137 | Davie
138 | Davis
139 | Dayton
140 | Daytona
141 | Dearborn
142 | Decatur
143 | Deerfield
144 | Delray
145 | Deltona
146 | Denton
147 | Denver
148 | Detroit
149 | Diego
150 | Dothan
151 | Downey
152 | Dubuque
153 | Duluth
154 | Durham
155 | Eagan
156 | Eau
157 | Eden
158 | Edmond
159 | Elgin
160 | Elizabeth
161 | Elkhart
162 | Elyria
163 | Encinitas
164 | Erie
165 | Escondido
166 | Euclid
167 | Eugene
168 | Evanston
169 | Evansville
170 | Everett
171 | Fairfield
172 | Falls
173 | Fargo
174 | Farmington
175 | Fayette
176 | Fayetteville
177 | Federal
178 | Flagstaff
179 | Flint
180 | Florissant
181 | Folsom
182 | Fontana
183 | Francisco
184 | Frederick
185 | Fremont
186 | Fresno
187 | Fullerton
188 | Gainesville
189 | Gaithersburg
190 | Galveston
191 | Garden
192 | Gardena
193 | Garland
194 | Gary
195 | Gastonia
196 | Gilbert
197 | Glendale
198 | Greeley
199 | Greensboro
200 | Greenville
201 | Gresham
202 | Gulfport
203 | Habra
204 | Hamilton
205 | Hammond
206 | Hampton
207 | Harlingen
208 | Hartford
209 | Haute
210 | Haven
211 | Haverhill
212 | Hawthorne
213 | Hayward
214 | Hemet
215 | Hempstead
216 | Henderson
217 | Hesperia
218 | Hialeah
219 | Hillsboro
220 | Hollywood
221 | Hoover
222 | Houston
223 | Huntington
224 | Huntsville
225 | Idaho
226 | Independence
227 | Indianapolis
228 | Inglewood
229 | Iowa
230 | Irvine
231 | Irving
232 | Jackson
233 | Jacksonville
234 | Janesville
235 | Jersey
236 | Johnson
237 | Joliet
238 | Jonesboro
239 | Jordan
240 | Jose
241 | Joseph
242 | Kalamazoo
243 | Kansas
244 | Kenner
245 | Kennewick
246 | Kenosha
247 | Kent
248 | Kettering
249 | Killeen
250 | Knoxville
251 | Lafayette
252 | Laguna
253 | Lakeland
254 | Lakewood
255 | Lancaster
256 | Lansing
257 | Laredo
258 | Largo
259 | Lauderdale
260 | Lauderhill
261 | Lawrence
262 | Lawton
263 | Layton
264 | Leandro
265 | Lee
266 | Lewisville
267 | Lexington
268 | Lincoln
269 | Linda
270 | Little
271 | Livermore
272 | Livonia
273 | Lodi
274 | Longmont
275 | Longview
276 | Lorain
277 | Louis
278 | Louisville
279 | Loveland
280 | Lowell
281 | Lubbock
282 | Lucie
283 | Lynchburg
284 | Lynn
285 | Lynwood
286 | Macon
287 | Madison
288 | Malden
289 | Manchester
290 | Maple
291 | Marcos
292 | Margate
293 | Maria
294 | Marietta
295 | Mateo
296 | McAllen
297 | McKinney
298 | Medford
299 | Melbourne
300 | Memphis
301 | Mentor
302 | Merced
303 | Meriden
304 | Mesa
305 | Mesquite
306 | Miami
307 | Middletown
308 | Midland
309 | Midwest
310 | Milford
311 | Milpitas
312 | Milwaukee
313 | Minneapolis
314 | Minnetonka
315 | Miramar
316 | Missoula
317 | Missouri
318 | Mobile
319 | Modesto
320 | Moines
321 | Monica
322 | Monroe
323 | Monte
324 | Montebello
325 | Monterey
326 | Montgomery
327 | Moreno
328 | Muncie
329 | Murfreesboro
330 | Nampa
331 | Napa
332 | Naperville
333 | Nashua
334 | Nashville
335 | National
336 | Newark
337 | Newport
338 | Newton
339 | Niagara
340 | Niguel
341 | Norfolk
342 | Norman
343 | Norwalk
344 | Oakland
345 | Oceanside
346 | Odessa
347 | Ogden
348 | Oklahoma
349 | Olathe
350 | Omaha
351 | Ontario
352 | Orange
353 | Orem
354 | Orland
355 | Orlando
356 | Orleans
357 | Oshkosh
358 | Overland
359 | Owensboro
360 | Oxnard
361 | Palatine
362 | Palmdale
363 | Palo
364 | Paramount
365 | Parma
366 | Pasadena
367 | Paso
368 | Passaic
369 | Paterson
370 | Paul
371 | Pawtucket
372 | Pembroke
373 | Pensacola
374 | Peoria
375 | Petaluma
376 | Peters
377 | Petersburg
378 | Philadelphia
379 | Phoenix
380 | Pico
381 | Pittsburg
382 | Pittsburgh
383 | Plaines
384 | Plano
385 | Plantation
386 | Pleasanton
387 | Plymouth
388 | Pocatello
389 | Pomona
390 | Pompano
391 | Pontiac
392 | Portland
393 | Portsmouth
394 | Prairie
395 | Providence
396 | Provo
397 | Pueblo
398 | Quincy
399 | Racine
400 | Rafael
401 | Raleigh
402 | Rancho
403 | Rapid
404 | Rapids
405 | Raton
406 | Reading
407 | Redding
408 | Redlands
409 | Redondo
410 | Redwood
411 | Reno
412 | Renton
413 | Rialto
414 | Richardson
415 | Richlands
416 | Richmond
417 | Rio
418 | Rivera
419 | Riverside
420 | Roanoke
421 | Rochelle
422 | Rochester
423 | Rockford
424 | Rocky
425 | Rosa
426 | Rosemead
427 | Roseville
428 | Roswell
429 | Rouge
430 | Sacramento
431 | Saginaw
432 | Salem
433 | Salinas
434 | Sandy
435 | Santee
436 | Sarasota
437 | Savannah
438 | Schaumburg
439 | Schenectady
440 | Scottsdale
441 | Scranton
442 | Seattle
443 | Sheboygan
444 | Shoreline
445 | Shreveport
446 | Simi
447 | Sioux
448 | Skokie
449 | Smith
450 | Somerville
451 | Southfield
452 | Sparks
453 | Spokane
454 | Springfield
455 | Stamford
456 | Sterling
457 | Stockton
458 | Suffolk
459 | Sugar
460 | Sunnyvale
461 | Sunrise
462 | Syracuse
463 | Tacoma
464 | Tallahassee
465 | Tamarac
466 | Tampa
467 | Taunton
468 | Taylor
469 | Taylorsville
470 | Temecula
471 | Tempe
472 | Temple
473 | Terre
474 | Thornton
475 | Toledo
476 | Topeka
477 | Torrance
478 | Tracy
479 | Trenton
480 | Troy
481 | Tucson
482 | Tulsa
483 | Turlock
484 | Tuscaloosa
485 | Tustin
486 | Tyler
487 | Upland
488 | Utica
489 | Vacaville
490 | Vallejo
491 | Vancouver
492 | Vegas
493 | Vernon
494 | Victoria
495 | Victorville
496 | Viejo
497 | Vineland
498 | Virginia
499 | Visalia
500 | Vista
501 | Waco
502 | Waltham
503 | Warren
504 | Warwick
505 | Washington
506 | Waterbury
507 | Waterloo
508 | Waukegan
509 | Waukesha
510 | Wayne
511 | Westland
512 | Westminster
513 | Wheaton
514 | Whittier
515 | Wichita
516 | Wilmington
517 | Winston
518 | Worcester
519 | Worth
520 | Wyoming
521 | Yakima
522 | Yonkers
523 | Yorba
524 | York
525 | Youngstown
526 | Yuma
527 | allen
528 | america
529 | american
530 | api
531 | apl
532 | aquacultural
533 | assistive
534 | autocad
535 | autocad
536 | autodesk
537 | bal
538 | bandsaws
539 | barcode
540 | benchtop
541 | biofuels
542 | bioinformatics
543 | blockmasons
544 | bsee
545 | burets
546 | businessobjects
547 | cae
548 | cam
549 | cannulas
550 | catheterization
551 | cdl
552 | chromatographs
553 | cics
554 | cobol
555 | comal
556 | cplus
557 | cplusplus
558 | crimpers
559 | curettes
560 | cyclers
561 | dataloggers
562 | db2
563 | dbms
564 | deburring
565 | defibrillators
566 | doppler
567 | dos
568 | dragline
569 | dynamometers
570 | echography
571 | electrocautery
572 | electrosurgical
573 | endotracheal
574 | english
575 | enteral
576 | epidiascopes
577 | extruders
578 | facebook
579 | flowmeters
580 | fluorimeters
581 | fortran
582 | freeware
583 | fundraising
584 | gauge
585 | gauges
586 | geospatial
587 | glucometers
588 | groundskeeping
589 | handheld
590 | handtrucks
591 | healthcare
592 | html
593 | html5
594 | hvac
595 | hypertext
596 | idms
597 | imagers
598 | ims
599 | inkjet
600 | internet
601 | j2ee
602 | javascript
603 | jcl
604 | krl
605 | laminators
606 | lan
607 | laryngoscopes
608 | lis
609 | locators
610 | logisticians
611 | longnose
612 | loupes
613 | manlift
614 | measurers
615 | mgmt
616 | microcentrifuges
617 | microcontrollers
618 | microplate
619 | microsoft
620 | mis
621 | ms-excel
622 | ms-power
623 | ms-word
624 | multilimb
625 | multiline
626 | multimeters
627 | mvs
628 | nebulizer
629 | needlenose
630 | netare
631 | nonfarm
632 | nonrestaurant
633 | novell
634 | offbearers
635 | onetcenter
636 | online
637 | ophthalmoscopes
638 | oracle
639 | otoscopes
640 | oximeter
641 | oximeters
642 | pascal
643 | patternmakers
644 | pdp
645 | photonics
646 | photovoltaic
647 | pipelayers
648 | pl/m
649 | pl/sql
650 | powerbuilder
651 | powerpoint
652 | psychrometers
653 | quickbooks
654 | radarbased
655 | recordkeeping
656 | reddit
657 | reflectometers
658 | sap
659 | sas
660 | screwguns
661 | scribers
662 | sharepoint
663 | sonographers
664 | spectrofluorimeters
665 | specula
666 | sphygmomanometers
667 | spirometers
668 | sql
669 | stimulators
670 | stumbleupon
671 | sybase
672 | syllabi
673 | tcp-ip
674 | tcpip
675 | tinners
676 | transcutaneous
677 | trephines
678 | tso
679 | ultracentrifuges
680 | univac
681 | unix
682 | vax
683 | viscosimeters
684 | visio
685 | visualbasic
686 | vms
687 | vsam
688 | vtam
689 | wattmeters
690 | webcams
691 | weighers
692 | whiteboards
693 | widemouth
694 | wordperfect
695 | workflow
696 | x-ray
697 | xray
--------------------------------------------------------------------------------
/data_cleaning/auxiliary files/TitleBase.txt:
--------------------------------------------------------------------------------
1 | abstracter
2 | abstracters
3 | abstractor
4 | abstractors
5 | accounting
6 | accountings
7 | accountant
8 | accountants
9 | actor
10 | actors
11 | actress
12 | actresses
13 | actuarial
14 | actuarials
15 | actuaries
16 | actuary
17 | acupuncturist
18 | acupuncturists
19 | adjudicator
20 | adjudicators
21 | adjuster
22 | adjusters
23 | administrator
24 | administrators
25 | advisor
26 | advisors
27 | advocate
28 | advocates
29 | aesthetician
30 | aestheticians
31 | agent
32 | agents
33 | agronomist
34 | agronomists
35 | aid
36 | aide
37 | aides
38 | aids
39 | allergist
40 | allergists
41 | ambassador
42 | ambassadors
43 | analyst
44 | analysts
45 | analyzer
46 | analyzers
47 | anchor
48 | anchors
49 | ancillaries
50 | ancillary
51 | anesthesiologist
52 | anesthesiologists
53 | anesthetist
54 | anesthetists
55 | animator
56 | animators
57 | announcer
58 | announcers
59 | anodizer
60 | anodizers
61 | anthropologist
62 | anthropologists
63 | applicator
64 | applicators
65 | appraiser
66 | appraisers
67 | apprentice
68 | apprentices
69 | aquarist
70 | aquarists
71 | arbiter
72 | arbiters
73 | arbitrator
74 | arbitrators
75 | arborist
76 | arborists
77 | archaeologist
78 | archaeologists
79 | archeologist
80 | archeologists
81 | architect
82 | architects
83 | archivist
84 | archivists
85 | arranger
86 | arrangers
87 | artisan
88 | artisans
89 | artist
90 | artists
91 | assembler
92 | assemblers
93 | assessor
94 | assessors
95 | assistant
96 | assistants
97 | associate
98 | associates
99 | asst
100 | assts
101 | astronomer
102 | astronomers
103 | astrophysicist
104 | astrophysicists
105 | athlete
106 | athletes
107 | attendant
108 | attendants
109 | attorney
110 | attorneys
111 | audience
112 | audiences
113 | audiologist
114 | audiologists
115 | audioprosthologist
116 | audioprosthologists
117 | auditor
118 | auditors
119 | author
120 | authorizer
121 | authorizers
122 | authors
123 | bacteriologist
124 | bacteriologists
125 | bagger
126 | baggers
127 | bailiff
128 | bailiffs
129 | baker
130 | bakers
131 | baler
132 | balers
133 | ballerina
134 | ballerinas
135 | bander
136 | banders
137 | banker
138 | bankers
139 | barber
140 | barbers
141 | barista
142 | baristas
143 | bartacker
144 | bartackers
145 | bartender
146 | bartenders
147 | batchmaker
148 | batchmakers
149 | bellhop
150 | bellhops
151 | bellman
152 | bellmen
153 | bender
154 | benders
155 | biller
156 | billers
157 | binder
158 | binders
159 | biochemist
160 | biochemists
161 | bioinformaticist
162 | bioinformaticists
163 | biologist
164 | biologists
165 | biometrist
166 | biometrists
167 | biophysicist
168 | biophysicists
169 | biostatistician
170 | biostatisticians
171 | biotechnician
172 | biotechnicians
173 | blacksmith
174 | blacksmiths
175 | blaster
176 | blasters
177 | blender
178 | blenders
179 | blower
180 | blowers
181 | boilermaker
182 | boilermakers
183 | bolter
184 | bolters
185 | bookkeeper
186 | bookkeepers
187 | boss
188 | bosses
189 | bosun
190 | bosuns
191 | boy
192 | boys
193 | brakeman
194 | brakemen
195 | brazer
196 | brazers
197 | breaker
198 | breakers
199 | breeder
200 | breeders
201 | brewer
202 | brewers
203 | bricker
204 | brickers
205 | bricklayer
206 | bricklayers
207 | broker
208 | brokers
209 | buffer
210 | buffers
211 | builder
212 | builders
213 | buncher
214 | bunchers
215 | bundler
216 | bundlers
217 | businessman
218 | businessmen
219 | buster
220 | busters
221 | butcher
222 | butchers
223 | buyer
224 | buyers
225 | cabinetmaker
226 | cabinetmakers
227 | calibrator
228 | calibrators
229 | caller
230 | callers
231 | captain
232 | captains
233 | caretaker
234 | caretakers
235 | carman
236 | carmen
237 | carpenter
238 | carpenters
239 | carrier
240 | carriers
241 | cartographer
242 | cartographers
243 | carver
244 | carvers
245 | cashier
246 | cashiers
247 | caster
248 | casters
249 | caterer
250 | caterers
251 | cellist
252 | cellists
253 | ceramist
254 | ceramists
255 | champion
256 | champions
257 | changer
258 | changers
259 | chaplain
260 | chaplains
261 | chauffeur
262 | chauffeurs
263 | checker
264 | checkers
265 | chef
266 | chefs
267 | chemist
268 | chemists
269 | chief
270 | chiefs
271 | chiropractor
272 | chiropractors
273 | choreographer
274 | choreographers
275 | claim
276 | claims
277 | clarinetist
278 | clarinetists
279 | cleaner
280 | cleaners
281 | clergies
282 | clergy
283 | clerk
284 | clerks
285 | climber
286 | climbers
287 | clinician
288 | clinicians
289 | closer
290 | closers
291 | clothier
292 | clothiers
293 | coach
294 | coaches
295 | coater
296 | coaters
297 | coder
298 | coders
299 | collector
300 | collectors
301 | comedian
302 | comedians
303 | commander
304 | commanders
305 | commissioner
306 | commissioners
307 | competitor
308 | competitors
309 | compiler
310 | compilers
311 | composer
312 | composers
313 | compounder
314 | compounders
315 | comptroller
316 | comptrollers
317 | concierge
318 | concierges
319 | conciliator
320 | conciliators
321 | conductor
322 | conductors
323 | confessor
324 | confessors
325 | connector
326 | connectors
327 | conservationist
328 | conservationists
329 | conservator
330 | conservators
331 | constructor
332 | constructors
333 | consultant
334 | consultants
335 | contractor
336 | contractors
337 | controller
338 | controllers
339 | conveyor
340 | conveyors
341 | cook
342 | cooks
343 | coordinator
344 | coordinators
345 | copilot
346 | copilots
347 | cordwainer
348 | cordwainers
349 | coremaker
350 | coremakers
351 | coroner
352 | coroners
353 | correspondent
354 | correspondents
355 | cosmetologist
356 | cosmetologists
357 | costumer
358 | costumers
359 | counsel
360 | counselor
361 | counselors
362 | counsels
363 | counter
364 | counters
365 | courier
366 | couriers
367 | coutierier
368 | coutieriers
369 | couturiere
370 | couturieres
371 | coverer
372 | coverers
373 | crabber
374 | crabbers
375 | crafter
376 | crafters
377 | craftsman
378 | craftsmen
379 | criminalist
380 | criminalists
381 | cryptographer
382 | cryptographers
383 | curator
384 | curators
385 | custodian
386 | custodians
387 | cutter
388 | cutters
389 | cytogenetic
390 | cytogeneticist
391 | cytogeneticists
392 | cytogenetics
393 | cytopathologist
394 | cytopathologists
395 | cytotechnologist
396 | cytotechnologists
397 | dancer
398 | dancers
399 | dealer
400 | dealers
401 | dean
402 | deans
403 | deboner
404 | deboners
405 | deburrer
406 | deburrers
407 | decaler
408 | decalers
409 | decorator
410 | decorators
411 | deliverer
412 | deliverers
413 | demonstrator
414 | demonstrators
415 | dentist
416 | dentists
417 | deputies
418 | deputy
419 | dermatologist
420 | dermatologists
421 | dermatopathologist
422 | dermatopathologists
423 | designer
424 | designers
425 | detail
426 | detailer
427 | detailers
428 | details
429 | detective
430 | detectives
431 | developer
432 | developers
433 | dietician
434 | dieticians
435 | dietitian
436 | dietitians
437 | digger
438 | diggers
439 | director
440 | directors
441 | dishwashe
442 | dishwashes
443 | dispatcher
444 | dispatchers
445 | dispenser
446 | dispensers
447 | displayer
448 | displayers
449 | distributor
450 | distributors
451 | diver
452 | divers
453 | docent
454 | docents
455 | doctor
456 | doctors
457 | doorman
458 | doormen
459 | dosimetrist
460 | dosimetrists
461 | drafter
462 | drafters
463 | draftsman
464 | draftsmen
465 | draper
466 | drapers
467 | dredger
468 | dredgers
469 | dresser
470 | dressers
471 | dressmaker
472 | dressmakers
473 | driller
474 | drillers
475 | driver
476 | drivers
477 | dyer
478 | dyers
479 | ecologist
480 | ecologists
481 | economist
482 | economists
483 | editor
484 | editors
485 | educator
486 | educators
487 | electrician
488 | electricians
489 | embalmer
490 | embalmers
491 | emcee
492 | emcees
493 | employee
494 | employees
495 | endocrinologist
496 | endocrinologists
497 | engineer
498 | engineers
499 | engraver
500 | engravers
501 | entertainer
502 | entertainers
503 | epidemiologist
504 | epidemiologists
505 | erector
506 | erectors
507 | ergonomist
508 | ergonomists
509 | escort
510 | escorts
511 | esthetician
512 | estheticians
513 | estimator
514 | estimators
515 | etcher
516 | etchers
517 | evaluator
518 | evaluators
519 | examiner
520 | examiners
521 | executive
522 | executives
523 | expediter
524 | expediters
525 | expeditor
526 | expeditors
527 | expert
528 | experts
529 | extender
530 | extenders
531 | exterminator
532 | exterminators
533 | fabricator
534 | fabricators
535 | facetor
536 | facetors
537 | facialist
538 | facialists
539 | facilitator
540 | facilitators
541 | faculties
542 | faculty
543 | faller
544 | fallers
545 | farmer
546 | farmers
547 | farmworker
548 | farmworkers
549 | feeder
550 | feeders
551 | feller
552 | fellers
553 | fellow
554 | fellows
555 | fiberglasser
556 | fiberglassers
557 | fieldman
558 | fieldmen
559 | fighter
560 | fighters
561 | filer
562 | filers
563 | filler
564 | fillers
565 | finisher
566 | finishers
567 | firefighter
568 | firefighters
569 | fireman
570 | firemen
571 | firers
572 | firerss
573 | fisher
574 | fishers
575 | fitter
576 | fitters
577 | fixer
578 | fixers
579 | floorpeople
580 | floorperson
581 | florist
582 | florists
583 | follower
584 | followers
585 | foreman
586 | foremen
587 | forester
588 | foresters
589 | forwarder
590 | forwarders
591 | framer
592 | framers
593 | fundraiser
594 | fundraisers
595 | gaffer
596 | gaffers
597 | gardener
598 | gardeners
599 | gastroenterologist
600 | gastroenterologists
601 | gatherer
602 | gatherers
603 | gauger
604 | gaugers
605 | gemologist
606 | gemologists
607 | generalist
608 | generalists
609 | geneticist
610 | geneticists
611 | geodesist
612 | geodesists
613 | geographer
614 | geographers
615 | geologist
616 | geologists
617 | geophysicist
618 | geophysicists
619 | geoscientist
620 | geoscientists
621 | giver
622 | givers
623 | glazer
624 | glazers
625 | glazier
626 | glaziers
627 | goldsmith
628 | goldsmiths
629 | grader
630 | graders
631 | greeter
632 | greeters
633 | grinder
634 | grinders
635 | groomer
636 | groomers
637 | groundskeeper
638 | groundskeepers
639 | grower
640 | growers
641 | guard
642 | guards
643 | guide
644 | guides
645 | guru
646 | gurus
647 | gynecologist
648 | gynecologists
649 | hairdresser
650 | hairdressers
651 | hairstylist
652 | hairstylists
653 | hand
654 | handler
655 | handlers
656 | hands
657 | hanger
658 | hangers
659 | harvester
660 | harvesters
661 | hauler
662 | haulers
663 | head
664 | heads
665 | helper
666 | helpers
667 | hiker
668 | hikers
669 | histologist
670 | histologists
671 | historian
672 | historians
673 | histotechnologist
674 | histotechnologists
675 | holder
676 | holders
677 | horologist
678 | horologists
679 | horticulturist
680 | horticulturists
681 | hospitalist
682 | hospitalists
683 | host
684 | hostess
685 | hostesses
686 | hostler
687 | hostlers
688 | hosts
689 | housekeeper
690 | housekeepers
691 | hunter
692 | hunters
693 | hydrogeologist
694 | hydrogeologists
695 | hydrologist
696 | hydrologists
697 | hygienist
698 | hygienists
699 | illustrator
700 | illustrators
701 | imager
702 | imagers
703 | immunologist
704 | immunologists
705 | informaticist
706 | informaticists
707 | innkeeper
708 | innkeepers
709 | inseamer
710 | inseamers
711 | inspector
712 | inspectors
713 | installer
714 | installers
715 | instructor
716 | instructors
717 | insulator
718 | insulators
719 | internist
720 | internists
721 | interpreter
722 | interpreters
723 | interviewer
724 | interviewers
725 | investigator
726 | investigators
727 | irrigator
728 | irrigators
729 | jailer
730 | jailers
731 | jailerss
732 | jailor
733 | jailors
734 | janitor
735 | janitors
736 | jeweler
737 | jewelers
738 | jockey
739 | jockeys
740 | judge
741 | judges
742 | keeper
743 | keepers
744 | kettleman
745 | kettlemans
746 | keyer
747 | keyers
748 | knitter
749 | knitters
750 | laborer
751 | laborers
752 | lacer
753 | lacers
754 | laminator
755 | laminators
756 | lapidarist
757 | lapidarists
758 | laster
759 | lasters
760 | lawyer
761 | lawyers
762 | layer
763 | layers
764 | lead
765 | leader
766 | leaders
767 | leads
768 | lecturer
769 | lecturers
770 | liaison
771 | liaisons
772 | librarian
773 | librarians
774 | librettist
775 | librettists
776 | licensee
777 | licensees
778 | lieutenant
779 | lieutenants
780 | lifeguard
781 | lifeguards
782 | lineman
783 | linemen
784 | liner
785 | liners
786 | loader
787 | loaders
788 | lobsterman
789 | lobstermen
790 | locker
791 | lockers
792 | locksmith
793 | locksmiths
794 | logger
795 | loggers
796 | logistician
797 | logisticians
798 | lookout
799 | lookouts
800 | lubricator
801 | lubricators
802 | luthier
803 | luthiers
804 | lyricist
805 | lyricists
806 | machinist
807 | machinists
808 | magistrate
809 | magistrates
810 | maid
811 | maids
812 | maintainer
813 | maintainers
814 | maker
815 | makers
816 | mammographer
817 | mammographers
818 | manager
819 | managers
820 | manicurist
821 | manicurists
822 | marker
823 | markers
824 | marketer
825 | marketers
826 | marshal
827 | marshals
828 | mason
829 | masons
830 | massager
831 | massagers
832 | masseuse
833 | masseuses
834 | master
835 | masters
836 | mate
837 | mates
838 | mathematician
839 | mathematicians
840 | measurer
841 | measurers
842 | mechanic
843 | mechanics
844 | mediator
845 | mediators
846 | melter
847 | melters
848 | member
849 | members
850 | mender
851 | menders
852 | menderss
853 | merchandiser
854 | merchandisers
855 | merchant
856 | merchants
857 | messenger
858 | messengers
859 | metallurgist
860 | metallurgists
861 | meteorologist
862 | meteorologists
863 | methodologist
864 | methodologists
865 | microbiologist
866 | microbiologists
867 | midwife
868 | midwives
869 | midwivess
870 | miller
871 | millers
872 | millwright
873 | millwrights
874 | miner
875 | miners
876 | minister
877 | ministers
878 | mixer
879 | mixers
880 | mixologist
881 | mixologists
882 | model
883 | modeler
884 | modelers
885 | models
886 | molder
887 | molders
888 | monitor
889 | monitors
890 | mortician
891 | morticians
892 | motorist
893 | motorists
894 | mounter
895 | mounters
896 | mover
897 | movers
898 | musician
899 | musicians
900 | nannies
901 | nanniess
902 | nanny
903 | narrator
904 | narrators
905 | naturalist
906 | naturalists
907 | neurologist
908 | neurologists
909 | neuropsychologist
910 | neuropsychologists
911 | neuroradiologist
912 | neuroradiologists
913 | novelist
914 | novelists
915 | nurse
916 | nurses
917 | nutritionist
918 | nutritionists
919 | oboist
920 | oboists
921 | obstetric
922 | obstetrician
923 | obstetricians
924 | obstetrics
925 | offbearer
926 | offbearers
927 | officer
928 | officers
929 | official
930 | officials
931 | oiler
932 | oilers
933 | oncologist
934 | oncologists
935 | operator
936 | operators
937 | ophthalmologist
938 | ophthalmologists
939 | optician
940 | opticians
941 | optometrist
942 | optometrists
943 | originator
944 | originators
945 | orthodontist
946 | orthodontists
947 | orthoptist
948 | orthoptists
949 | orthotist
950 | orthotists
951 | overhauler
952 | overhaulers
953 | owner
954 | owners
955 | packager
956 | packagers
957 | packer
958 | packers
959 | painter
960 | painters
961 | paperhanger
962 | paperhangers
963 | paralegal
964 | paralegals
965 | paramedic
966 | paramedics
967 | parker
968 | parkers
969 | partner
970 | partners
971 | passenger
972 | passengers
973 | pastor
974 | pastors
975 | pathologist
976 | pathologists
977 | patrol
978 | patrols
979 | patternmaker
980 | patternmakers
981 | paver
982 | pavers
983 | pediatrician
984 | pediatricians
985 | pedicurist
986 | pedicurists
987 | pedorthist
988 | pedorthists
989 | people
990 | percussionist
991 | percussionists
992 | performer
993 | performers
994 | personnel
995 | personnels
996 | pewterer
997 | pewterers
998 | pharmacist
999 | pharmacists
1000 | pharmacologist
1001 | pharmacologists
1002 | philosopher
1003 | philosophers
1004 | phlebotomist
1005 | phlebotomists
1006 | photogrammetrist
1007 | photogrammetrists
1008 | photographer
1009 | photographers
1010 | physiatrist
1011 | physiatrists
1012 | physician
1013 | physicians
1014 | physicist
1015 | physicists
1016 | physiologist
1017 | physiologists
1018 | picker
1019 | pickers
1020 | pilot
1021 | pilots
1022 | pipefitter
1023 | pipefitters
1024 | pipelayer
1025 | pipelayers
1026 | pitcher
1027 | pitchers
1028 | planer
1029 | planers
1030 | planner
1031 | planners
1032 | planter
1033 | planters
1034 | plasterer
1035 | plasterers
1036 | plater
1037 | platers
1038 | player
1039 | players
1040 | plumber
1041 | plumbers
1042 | podiatrist
1043 | podiatrists
1044 | poet
1045 | poets
1046 | police
1047 | polices
1048 | polisher
1049 | polishers
1050 | politician
1051 | politicians
1052 | porter
1053 | porters
1054 | poster
1055 | posters
1056 | postmaster
1057 | postmasters
1058 | potter
1059 | potters
1060 | pourer
1061 | pourers
1062 | powderman
1063 | powdermen
1064 | practitioner
1065 | practitioners
1066 | preceptor
1067 | preceptors
1068 | preparator
1069 | preparators
1070 | preparer
1071 | preparers
1072 | president
1073 | presidents
1074 | presser
1075 | pressers
1076 | pressman
1077 | pressmen
1078 | priest
1079 | priests
1080 | principal
1081 | principals
1082 | printer
1083 | printers
1084 | processor
1085 | processors
1086 | producer
1087 | producers
1088 | professional
1089 | professionals
1090 | professor
1091 | professors
1092 | programer
1093 | programers
1094 | programmer
1095 | programmers
1096 | projectionist
1097 | projectionists
1098 | promoter
1099 | promoters
1100 | proofer
1101 | proofers
1102 | proofreader
1103 | proofreaders
1104 | prosthetist
1105 | prosthetists
1106 | prosthodontist
1107 | prosthodontists
1108 | provider
1109 | providers
1110 | provost
1111 | provosts
1112 | psychiatrist
1113 | psychiatrists
1114 | psychologist
1115 | psychologists
1116 | psychometrist
1117 | psychometrists
1118 | psychotherapist
1119 | psychotherapists
1120 | publisher
1121 | publishers
1122 | puller
1123 | pullers
1124 | pulmonologist
1125 | pulmonologists
1126 | pumper
1127 | pumpers
1128 | purchaser
1129 | purchasers
1130 | purser
1131 | pursers
1132 | rabbi
1133 | rabbis
1134 | radiographer
1135 | radiographers
1136 | radiologist
1137 | radiologists
1138 | raker
1139 | rakers
1140 | rancher
1141 | ranchers
1142 | ranger
1143 | rangers
1144 | rater
1145 | raters
1146 | reader
1147 | readers
1148 | realtor
1149 | realtors
1150 | recapper
1151 | recappers
1152 | receiver
1153 | receivers
1154 | receptionist
1155 | receptionists
1156 | reconditioner
1157 | reconditioners
1158 | recorder
1159 | recorders
1160 | recruiter
1161 | recruiters
1162 | rector
1163 | rectors
1164 | referee
1165 | referees
1166 | refinisher
1167 | refinishers
1168 | registrar
1169 | registrars
1170 | rep
1171 | representative
1172 | representatives
1173 | reps
1174 | reservationist
1175 | reservationists
1176 | resident
1177 | residents
1178 | responder
1179 | responders
1180 | restorer
1181 | restorers
1182 | reviewer
1183 | reviewers
1184 | rigger
1185 | riggers
1186 | riveter
1187 | riveters
1188 | rn
1189 | rns
1190 | roaster
1191 | roasters
1192 | rodbuster
1193 | rodbusters
1194 | roller
1195 | rollers
1196 | roofer
1197 | roofers
1198 | roustabout
1199 | roustabouts
1200 | rover
1201 | rovers
1202 | runner
1203 | runners
1204 | sacker
1205 | sackers
1206 | safecracker
1207 | safecrackers
1208 | sailor
1209 | sailors
1210 | sale rep
1211 | sale reps
1212 | sales
1213 | sales rep
1214 | sales reps
1215 | salesman
1216 | salesmen
1217 | salesmens
1218 | salespeople
1219 | salespeoples
1220 | salesperson
1221 | salespersons
1222 | salespersonss
1223 | saless
1224 | sampler
1225 | samplers
1226 | sander
1227 | sanders
1228 | sanitarian
1229 | sanitarians
1230 | sanitizer
1231 | sanitizers
1232 | sawer
1233 | sawers
1234 | sawyer
1235 | sawyers
1236 | scaler
1237 | scalers
1238 | scheduler
1239 | schedulers
1240 | scientist
1241 | scientists
1242 | scorer
1243 | scorers
1244 | scout
1245 | scouts
1246 | screener
1247 | screeners
1248 | sculptor
1249 | sculptors
1250 | seaman
1251 | seamen
1252 | seamstress
1253 | seamstresses
1254 | searcher
1255 | searchers
1256 | secretaries
1257 | secretariess
1258 | secretary
1259 | senior
1260 | seniors
1261 | sergeant
1262 | sergeants
1263 | server
1264 | servers
1265 | serviceman
1266 | servicemen
1267 | servicer
1268 | servicers
1269 | setter
1270 | setters
1271 | sewer
1272 | sewers
1273 | shampooer
1274 | shampooers
1275 | sheeter
1276 | sheeters
1277 | sheriff
1278 | sheriffs
1279 | shifter
1280 | shifters
1281 | shipper
1282 | shippers
1283 | silversmith
1284 | silversmiths
1285 | silviculturist
1286 | silviculturists
1287 | singer
1288 | singers
1289 | skycap
1290 | skycaps
1291 | slaughterer
1292 | slaughterers
1293 | slicer
1294 | slicers
1295 | slitter
1296 | slitters
1297 | smith
1298 | smiths
1299 | sociologist
1300 | sociologists
1301 | solder
1302 | solders
1303 | soloist
1304 | soloists
1305 | solver
1306 | solvers
1307 | sonographer
1308 | sonographers
1309 | sorter
1310 | sorters
1311 | specialist
1312 | specialists
1313 | speechwriter
1314 | speechwriters
1315 | spinner
1316 | spinners
1317 | splicer
1318 | splicers
1319 | splitter
1320 | splitters
1321 | sprayer
1322 | sprayers
1323 | staff
1324 | staffs
1325 | stapler
1326 | staplers
1327 | starter
1328 | starters
1329 | statistician
1330 | statisticians
1331 | steamfitter
1332 | steamfitters
1333 | stenographer
1334 | stenographers
1335 | steward
1336 | stewards
1337 | stillman
1338 | stillmen
1339 | stitcher
1340 | stitchers
1341 | stocker
1342 | stockers
1343 | stonemason
1344 | stonemasons
1345 | strategist
1346 | strategists
1347 | stripper
1348 | strippers
1349 | student
1350 | students
1351 | stylist
1352 | stylists
1353 | superintendant
1354 | superintendants
1355 | superintendent
1356 | superintendents
1357 | supervisor
1358 | supervisors
1359 | surgeon
1360 | surgeons
1361 | surveyor
1362 | surveyors
1363 | swamper
1364 | swampers
1365 | switcher
1366 | switchers
1367 | switchman
1368 | switchmen
1369 | tailor
1370 | tailors
1371 | taker
1372 | takers
1373 | tankerman
1374 | tankermen
1375 | taper
1376 | tapers
1377 | teacher
1378 | teachers
1379 | tech
1380 | teches
1381 | technician
1382 | technicians
1383 | technologist
1384 | technologists
1385 | telecommunicator
1386 | telecommunicators
1387 | telemarketer
1388 | telemarketers
1389 | teller
1390 | tellers
1391 | tender
1392 | tenders
1393 | tenor
1394 | tenors
1395 | tester
1396 | testers
1397 | therapist
1398 | therapists
1399 | ticketer
1400 | ticketers
1401 | tipper
1402 | tippers
1403 | toolmaker
1404 | toolmakers
1405 | topper
1406 | toppers
1407 | trackman
1408 | trackmen
1409 | trader
1410 | traders
1411 | trailer
1412 | trailers
1413 | trainee
1414 | trainees
1415 | trainer
1416 | trainers
1417 | transcriber
1418 | transcribers
1419 | transcriptionist
1420 | transcriptionists
1421 | translator
1422 | translators
1423 | trapper
1424 | trappers
1425 | treasurer
1426 | treasurers
1427 | treater
1428 | treaters
1429 | trimmer
1430 | trimmers
1431 | trooper
1432 | troopers
1433 | troubleshooter
1434 | troubleshooters
1435 | trucker
1436 | truckers
1437 | tuner
1438 | tuners
1439 | tutor
1440 | tutors
1441 | typesetter
1442 | typesetters
1443 | typist
1444 | typists
1445 | umpire
1446 | umpires
1447 | undertaker
1448 | undertakers
1449 | underwriter
1450 | underwriters
1451 | upholsterer
1452 | upholsterers
1453 | urologist
1454 | urologists
1455 | usher
1456 | ushers
1457 | vaccinator
1458 | vaccinators
1459 | vendor
1460 | vendors
1461 | vet
1462 | veterinarian
1463 | veterinarians
1464 | vets
1465 | videographer
1466 | videographers
1467 | violinist
1468 | violinists
1469 | violist
1470 | violists
1471 | vocalist
1472 | vocalists
1473 | volunteer
1474 | volunteers
1475 | waiter
1476 | waiters
1477 | waitress
1478 | waitresses
1479 | waitressess
1480 | walker
1481 | walkers
1482 | warden
1483 | wardens
1484 | wardenss
1485 | washer
1486 | washers
1487 | watchman
1488 | watchmen
1489 | waxer
1490 | waxers
1491 | weaver
1492 | weavers
1493 | webmaster
1494 | webmasters
1495 | weigher
1496 | weighers
1497 | welder
1498 | welders
1499 | winder
1500 | winders
1501 | wiper
1502 | wipers
1503 | wireman
1504 | wiremen
1505 | wirer
1506 | wirers
1507 | worker
1508 | workers
1509 | wrapper
1510 | wrappers
1511 | writer
1512 | writers
1513 | yardmaster
1514 | yardmasters
1515 | zoologist
1516 | zoologists
1517 | electronic
1518 | processing
1519 | account
1520 | accounts
1521 | electronics
1522 | saleswomen
1523 | saleswoman
1524 | salesman
1525 | salesmen
1526 | clerical
1527 | clericals
1528 | medical
--------------------------------------------------------------------------------
/data_cleaning/auxiliary files/__pycache__/ExtractLDAresult.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/phaiptt125/newspaper_project/4e31ced4d930258ff7d659012fff15f3f6f626a4/data_cleaning/auxiliary files/__pycache__/ExtractLDAresult.cpython-36.pyc
--------------------------------------------------------------------------------
/data_cleaning/auxiliary files/__pycache__/OCRcorrect_enchant.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/phaiptt125/newspaper_project/4e31ced4d930258ff7d659012fff15f3f6f626a4/data_cleaning/auxiliary files/__pycache__/OCRcorrect_enchant.cpython-36.pyc
--------------------------------------------------------------------------------
/data_cleaning/auxiliary files/__pycache__/OCRcorrect_hyphen.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/phaiptt125/newspaper_project/4e31ced4d930258ff7d659012fff15f3f6f626a4/data_cleaning/auxiliary files/__pycache__/OCRcorrect_hyphen.cpython-36.pyc
--------------------------------------------------------------------------------
/data_cleaning/auxiliary files/__pycache__/compute_spelling.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/phaiptt125/newspaper_project/4e31ced4d930258ff7d659012fff15f3f6f626a4/data_cleaning/auxiliary files/__pycache__/compute_spelling.cpython-36.pyc
--------------------------------------------------------------------------------
/data_cleaning/auxiliary files/__pycache__/detect_ending.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/phaiptt125/newspaper_project/4e31ced4d930258ff7d659012fff15f3f6f626a4/data_cleaning/auxiliary files/__pycache__/detect_ending.cpython-36.pyc
--------------------------------------------------------------------------------
/data_cleaning/auxiliary files/__pycache__/edit_distance.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/phaiptt125/newspaper_project/4e31ced4d930258ff7d659012fff15f3f6f626a4/data_cleaning/auxiliary files/__pycache__/edit_distance.cpython-36.pyc
--------------------------------------------------------------------------------
/data_cleaning/auxiliary files/__pycache__/extract_LDA_result.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/phaiptt125/newspaper_project/4e31ced4d930258ff7d659012fff15f3f6f626a4/data_cleaning/auxiliary files/__pycache__/extract_LDA_result.cpython-36.pyc
--------------------------------------------------------------------------------
/data_cleaning/auxiliary files/__pycache__/extract_information.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/phaiptt125/newspaper_project/4e31ced4d930258ff7d659012fff15f3f6f626a4/data_cleaning/auxiliary files/__pycache__/extract_information.cpython-36.pyc
--------------------------------------------------------------------------------
/data_cleaning/auxiliary files/__pycache__/title_detection.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/phaiptt125/newspaper_project/4e31ced4d930258ff7d659012fff15f3f6f626a4/data_cleaning/auxiliary files/__pycache__/title_detection.cpython-36.pyc
--------------------------------------------------------------------------------
/data_cleaning/auxiliary files/__pycache__/title_substitute.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/phaiptt125/newspaper_project/4e31ced4d930258ff7d659012fff15f3f6f626a4/data_cleaning/auxiliary files/__pycache__/title_substitute.cpython-36.pyc
--------------------------------------------------------------------------------
/data_cleaning/auxiliary files/apst_mapping.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/phaiptt125/newspaper_project/4e31ced4d930258ff7d659012fff15f3f6f626a4/data_cleaning/auxiliary files/apst_mapping.xlsx
--------------------------------------------------------------------------------
/data_cleaning/auxiliary files/compute_spelling.py:
--------------------------------------------------------------------------------
1 | import re
2 | import enchant, difflib
3 | from enchant import DictWithPWL
4 |
5 | #...............................................#
6 |
7 | def ComputeSpellingError(rawtext,mydict):
8 |
9 | d = enchant.DictWithPWL("en_US", mydict)
10 | tokens = [w for w in re.split(' ',rawtext.lower()) if not w == '']
11 | tokens = [re.sub(r'[^a-z]','',w) for w in tokens]
12 | tokens = [w for w in tokens if not w=='']
13 |
14 | CountInDict = 0
15 | CountNotInDict = 0
16 | CountTotal = len(tokens)
17 | if CountTotal > 0:
18 | for word in tokens:
19 | if len(word)==1:
20 | CountNotInDict += 1
21 | elif d.check(word):
22 | CountInDict += 1
23 | else:
24 | CountNotInDict += 1
25 | Ratio = str(round(CountInDict/CountTotal,2))
26 | else:
27 | Ratio = str(0)
28 |
29 | TotalWord = str(CountTotal)
30 | Output = [TotalWord,Ratio]
31 | return Output
32 | #...............................................#
33 |
34 | def RecordCorrectSpelling(rawtext):
35 |
36 | d = enchant.Dict("en_US")
37 | tokens = [w for w in re.split(' ',rawtext.lower()) if not w == '']
38 | tokens = [re.sub(r'[^a-z]','',w) for w in tokens]
39 | tokens = [w for w in tokens if len(w) >= 3]
40 | tokens = [w for w in tokens if not w=='']
41 |
42 | TotalWord = len(tokens)
43 |
44 | if TotalWord > 0:
45 | correct_tokens = [w for w in tokens if d.check(w)]
46 | correct_tokens = [w for w in correct_tokens if not w=='']
47 | output_text = ' '.join(correct_tokens)
48 | WordCount = str(len(correct_tokens))
49 | else:
50 | output_text = ''
51 | WordCount = str(0)
52 |
53 | Output = [WordCount,output_text]
54 | return Output
55 |
56 | #...............................................#
57 |
58 |
59 |
60 |
--------------------------------------------------------------------------------
/data_cleaning/auxiliary files/detect_ending.py:
--------------------------------------------------------------------------------
1 | import os
2 | import re
3 | import nltk
4 | from nltk.tokenize import word_tokenize
5 |
6 | file_state_name = open('./auxiliary files/state_name.txt').read()
7 | state_name = [w for w in re.split('\n',file_state_name) if not w=='']
8 |
9 | StateFullname = [re.split(',',w)[0] for w in state_name]
10 | StateAbbrevation = [re.split(',',w)[1] for w in state_name]
11 |
12 | # Define a set of pattern we will use to split
13 |
14 | ZipCodeFullPattern = re.compile('|'.join(['\\b' + w.lower() + '.{0,3}\d{5}\\b' for w in StateFullname]),re.IGNORECASE)
15 | ZipCodeAbbPattern = re.compile('|'.join(['\\b'+w[0]+'\W?['+w[1]+'|'+w[1].lower()+'].{0,3}\d{5}\\b' for w in StateAbbrevation]))
16 |
17 | ZipCodeExtraPattern = ['tribune.?[0-9BtlifoOS]{5}', #tribune + 5 number
18 | 'tribune.{,5}6\d{4}',
19 | 'chicago\s.{,6}\d{5}?'] #chicago + space + something + five numbers
20 |
21 | ZipCodeExtraPattern = re.compile( '|'.join(ZipCodeExtraPattern),re.IGNORECASE ) #this one ignores case
22 |
23 | ZipCodeExtraPattern2 = ['I.?[L|l].?[L|l].?\s\d{5}', # detect ILL as Illinois
24 | 'I[Ll]{1,2}.?\s[0-9BtlifoOS]{5}', # detect IL
25 | 'IL.?6[0oO]{1,2}[0-9BtlifoOS]{2,3}', # detect IL
26 | 'I.?I.?\s\d{5}', # detect II as Illinois
27 | 'It.?\s\d{5}', # detect It + 5 numbers as Illinois
28 | '[I|i]n\s\d{5}\s', #In + something + five numbers (as zip code)
29 | 'MCB\s\d{3}', #MCB + space + 3 digits
30 | 'BOX\sM[A-Z ]{2,3}\s[0-9BtlfoO]{3}', #BOX + space + M + two more character + 3 digits
31 | 'D.?C.?\s\d{5}'] # 'D.?C.?\s\d{5}' = DC
32 |
33 | ZipCodeExtraPattern2 = re.compile( '|'.join(ZipCodeExtraPattern2) ) #Note: No "re.IGNORECASE"
34 |
35 | SteetNamePattern = ['\d{2,5}[\s\w]+\save',
36 | '\d{2,5}[\s\w]+\sblvd',
37 | '\d{2,5}[\s\w]+\sstreet',
38 | '\d{2,5}[\s\w]+\shgwy',
39 | '\d{2,5}[\s\w]+\sroad',
40 | '\d{2,5}\s\w*\sdrive',
41 | '\d{2,5}\s\w*\sst.?\sboston',
42 | '\d{2,5}\s\w*\sst.?\slawrence',
43 | '\d{2,5}\s\w*\s\w*\sst\scambridge',
44 | '^\d{2,5}\s\w*\sst.?\s',
45 | '^\d{2,5}\s\w*\s\w*\sst\W',
46 | '\sfloor\sboston$',
47 | 'glo[6b]e.{,3}office'] # globe office
48 |
49 | SteetNamePattern = re.compile( '|'.join(SteetNamePattern),re.IGNORECASE )
50 |
51 | EndingPhrasePattern = ['equal opportunit[y|ies]', #EoE
52 | 'affirmative.?employer\s?', #affirmative[anything]employer
53 | 'i[n|v].?confidence.?\s?', #in confidence
54 | 'send.{,10}resume\s?',
55 | 'apply.{,20}office',
56 | 'submit.{,10}resume\s?',
57 | 'please\sapply',
58 | 'for\sfurther\sinformation\.{,20}contact',
59 | '\d{2,4}\sext.?\s\d{2,4}', #Phone number: numbers + ext + numbers
60 | '\d{3}.\d{3}-\d{4}\s?'] #Phone number: 3 numbers + anything + 3 numbers + hyphen + four numbers'
61 |
62 | EndingPhrasePattern = re.compile('|'.join(EndingPhrasePattern),re.IGNORECASE)
63 |
64 | ListFirmIndicator = ['co','company','inc','corporation','inc','corp','llc',"incorporated"]
65 | ListFirmNoTitleIndicator = ['associates','associate']
66 |
67 | #...............................................#
68 |
69 | def AssignFlag(InputString):
70 | # this function detect address / ending phrase
71 | AddressFound = False
72 | EndingPhraseFound = False
73 |
74 | if re.findall(ZipCodeFullPattern,InputString):
75 | AddressFound = True
76 | if re.findall(ZipCodeAbbPattern,InputString):
77 | AddressFound = True
78 | if re.findall(ZipCodeExtraPattern,InputString):
79 | AddressFound = True
80 | if re.findall(ZipCodeExtraPattern2,InputString):
81 | AddressFound = True
82 | if re.findall(SteetNamePattern,InputString):
83 | AddressFound = True
84 | if re.findall(EndingPhrasePattern,InputString):
85 | EndingPhraseFound = True
86 |
87 | return AddressFound , EndingPhraseFound
--------------------------------------------------------------------------------
/data_cleaning/auxiliary files/edit_distance.py:
--------------------------------------------------------------------------------
1 | # Computing Weighted Edit Distance #
2 | # The code is adapted from http://www.nltk.org/_modules/nltk/metrics/distance.html
3 | #...............................................................
4 |
5 | #Creating a matrix to store output
6 | def InitializingMatrix(len1, len2):
7 | lev = []
8 | for i in range(len1):
9 | lev.append([0] * len2) # initialize 2D array to zero
10 | for i in range(len1):
11 | lev[i][0] = i # column 0: 0,1,2,3,4,...
12 | for j in range(len2):
13 | lev[0][j] = j # row 0: 0,1,2,3,4,...
14 | return lev
15 |
16 | #Say, lev = InitializingMatrix(5, 3) will give a matrix of
17 | #
18 | #[4, 0, 0]
19 | #[3, 0, 0]
20 | #[2, 0, 0]
21 | #[1, 0, 0]
22 | #[0, 1, 2]
23 | #
24 | # Next function will print this matrix out
25 | #...............................................................
26 |
27 | #Printing matrix:
28 | def PrintMatrix(mat):
29 | NumRow = len(mat)
30 | for ind in range(NumRow):
31 | print(mat[NumRow-ind-1][:])
32 |
33 | #...............................................................
34 |
35 | def ComputeMinStep(lev, i, j, s1, s2):
36 | c1 = s1[i - 1]
37 | c2 = s2[j - 1]
38 |
39 | # skipping a character in s1
40 | a = lev[i - 1][j] + 1
41 | # skipping a character in s2
42 | b = lev[i][j - 1] + 1
43 | # substitution
44 | c = lev[i - 1][j - 1] + (c1 != c2)
45 |
46 | # minimize distance in a step
47 | lev[i][j] = min(a, b, c)
48 |
49 | #...............................................................
50 |
51 | def EditDistance(s1, s2):
52 |
53 | len1 = len(s1)
54 | len2 = len(s2)
55 | lev = InitializingMatrix(len1+1, len2+1)
56 |
57 | for i in range(len1):
58 | for j in range(len2):
59 | ComputeMinStep(lev, i + 1, j + 1, s1, s2)
60 |
61 | Distance = lev[len1][len2]
62 | return Distance
63 |
64 | #...............................................................
65 |
66 |
67 |
68 |
69 |
70 |
--------------------------------------------------------------------------------
/data_cleaning/auxiliary files/example_ONET_api.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/phaiptt125/newspaper_project/4e31ced4d930258ff7d659012fff15f3f6f626a4/data_cleaning/auxiliary files/example_ONET_api.png
--------------------------------------------------------------------------------
/data_cleaning/auxiliary files/extract_LDA_result.py:
--------------------------------------------------------------------------------
1 | import re
2 |
3 | '''
4 | The command 'lda.show_topics' gives a pretty complicate output format
5 | for example, if given 2 words and 2 topics, it will show:
6 |
7 | [(0, [('price', 0.014396994044837077),
8 | ('new', 0.0122260497589219)])
9 | ,
10 | (1, [('opportun', 0.020830242773974533),
11 | ('experi', 0.019701193739871937)])]
12 |
13 | The first element belongs to the first topic
14 |
15 | TopicKeyword[0] =
16 | (0, [('price', 0.014396994044837077), ('new', 0.0122260497589219)])
17 |
18 | TopicKeyword[0][0] = 0
19 | TopicKeyword[0][1] = [('price', 0.014396994044837077), ('new', 0.0122260497589219)]
20 |
21 | so the way to extract is to loop over TopicKeyword[Ind][1], where Ind is topic number
22 | '''
23 |
24 | def GetWordScore(TopicKeyword):
25 | WordScoreList = list() # list of word and its score
26 | for Ind in range(0,len(TopicKeyword)): #loop by topics
27 | WordsThisTopic = TopicKeyword[Ind][1]
28 | for WordScore in WordsThisTopic: #loop by words
29 | Word = WordScore[0]
30 | Score = "{0:.3f}".format(WordScore[1]) #round to 3 decimal
31 | #"{0:.2f}".format(13.949999999999999) = '13.95'
32 | WordScoreList.append(str(Ind) + '\t' + Word + '\t' + str(Score))
33 | return WordScoreList
34 |
35 | def GetWordList(WordScoreList,TopicNum):
36 | ListWordByTopic = ['']*TopicNum
37 | for item in WordScoreList:
38 | Split = re.split('\t',item)
39 | ListWordByTopic[int(Split[0])] = ListWordByTopic[int(Split[0])] + '\t' + Split[1]
40 | return [[y for y in re.split('\t',w) if not y==''] for w in ListWordByTopic if not w=='']
41 |
42 | #...............................................#
43 |
44 | '''
45 | "docTopic" item contains document score by topic, has length = number of doc
46 | NOTE: topic score that is below a certain threshold is set to be zero and not report
47 | For example:
48 |
49 | docTopic[0] = [(0, 0.1334268305392638), (2, 0.8638742905886998)]
50 |
51 | means the first document has 0.13 for topic 0, 0 for topic 1 and 0.86 for topic 2
52 | '''
53 |
54 | def GetDocumentScore(docTopic,TopicNum):
55 | OutputTable = list()
56 | for Ind in range(0,len(docTopic)):
57 | ScoreThisDoc = docTopic[Ind]
58 | RecordScore = ['0']*TopicNum
59 | for item in ScoreThisDoc:
60 | RecordScore[item[0]] = "{0:.3f}".format(item[1])
61 | OutputTable.append( '\t'.join(RecordScore) )
62 | assert( len(docTopic) == len(OutputTable) )
63 | return OutputTable
64 |
65 | #...............................................#
66 |
--------------------------------------------------------------------------------
/data_cleaning/auxiliary files/extract_information.py:
--------------------------------------------------------------------------------
1 | import re
2 |
3 | def RemoveCharacters(text):
4 | # This function removes some non-grammatical characters
5 | # and add extra spaces to punctuations in order to facilitate
6 | # spelling error correction.
7 | output = text
8 | output = output.replace('"','')
9 | output = output.replace('.', ' . ')
10 | output = output.replace(',', ' , ')
11 | output = output.replace('?', ' ? ')
12 | output = output.replace('(', ' ( ')
13 | output = output.replace(')', ' ) ')
14 | output = output.replace('$', ' $ ')
15 | output = output.replace(';',' ; ')
16 | output = output.replace('!',' ! ')
17 | output = output.replace('}','')
18 | output = output.replace('{','')
19 | output = output.replace('/',' ')
20 | output = output.replace('_',' ')
21 | output = output.replace('*','')
22 | return output
23 |
24 | def CleanXML(text):
25 | # This function removes markups
26 |
27 | output = text #initialize output
28 |
29 | # '</p>' and '<p>' are line-breaks
30 | NewlinePattern = re.compile( re.escape('</p>')
31 | + '|'
32 | + re.escape('<p>') )
33 |
34 | output = re.sub(NewlinePattern,'\n',output)
35 |
36 | # replace all other markups
37 |
38 | XMLmarkups = ['name="ValidationSchema"',
39 | 'content="',
40 | '"/>',
41 | '<meta']
42 |
43 | for pattern in XMLmarkups:
44 | output = re.sub(re.escape(pattern),'',output , re.IGNORECASE)
45 |
46 | html_header = re.compile(re.escape('<')
47 | + '/?html/?'
48 | + re.escape('>'))
49 |
50 | output = re.sub(html_header,'',output)
51 |
52 | body_header = re.compile(re.escape('<')
53 | + '/?body/?'
54 | + re.escape('>'))
55 |
56 | output = re.sub(body_header,'',output)
57 |
58 | title_header = re.compile(re.escape('<')
59 | + '/?title/?'
60 | + re.escape('>'))
61 |
62 | output = re.sub(title_header,'',output)
63 |
64 | head_header = re.compile(re.escape('<')
65 | + '/?head/?'
66 | + re.escape('>'))
67 |
68 | output = re.sub(head_header,'',output)
69 |
70 | HTTPpattern = re.compile( re.escape('http://') + '\S*'
71 | + re.escape('.xsd') )
72 |
73 | output = re.sub(HTTPpattern,'',output)
74 | output = re.sub(re.escape('"'),'"',output)
75 | output = re.sub(re.escape('''),"'",output)
76 | output = re.sub(re.escape('&'),"&",output)
77 | output = re.sub(re.escape('&'),'',output)
78 | output = re.sub(re.escape('<'),'',output)
79 | output = re.sub(re.escape('>'),'',output)
80 | output = RemoveCharacters(output)
81 |
82 | return ' '.join([w for w in re.split(' ',output) if not w==''])
83 |
84 | def ExtractElement(text,field):
85 | # This function takes input string (text) and looks for markups.
86 | # input "field" is a specific element that the code looks for.
87 | # For example, the page title can be located in the text as:
88 | # Display Ad 33 -- No Title
89 | # Here, "field" variable is "recordtitle".
90 | # (Note: all searches are not case-sensitive.)
91 |
92 | beginMarkup = '<' + field + '>' #example:
93 | endMarkup = '' + field + '>' #example:
94 |
95 | textNoLineBreak = re.sub(r'\n|\r\n','',text) #delete the line break
96 |
97 | # Windows and Linux use different line break ('\n' vs '\r\n')
98 |
99 | ElementPattern = re.compile( re.escape(beginMarkup) + '.*' + re.escape(endMarkup), re.IGNORECASE )
100 | ElementMarkup = re.compile( re.escape(beginMarkup) + '|' + re.escape(endMarkup), re.IGNORECASE)
101 |
102 | DetectElement = re.findall(ElementPattern,textNoLineBreak)
103 |
104 | #strip markup
105 | Content = str(re.sub(ElementMarkup,'',str(DetectElement[0])))
106 |
107 | #reset space
108 | Content = ' '.join([w for w in re.split(' ',Content) if not w==''])
109 |
110 | return Content
111 |
112 | def AssignPageIdentifier(text, journal):
113 | # This function assigns page identifier.
114 | # For example, 'WSJ_classifiedad_19780912_45'.
115 | # 'WSJ' is the journal name, to be specified by the user.
116 | # 'classifiedad' means the page is Classified Ad.
117 | # '19780912' is the publication date.
118 | # '45' is the page number.
119 |
120 | recordtitle = ExtractElement(text,'recordtitle')
121 |
122 | # All classified ad pages have 'recordtitle' of 'Classified Ad [number] -- No Title'.
123 | # (likewise for display ad pages)
124 |
125 | Match = re.findall('Ad \d+ -- No Title',recordtitle,re.IGNORECASE)
126 |
127 | if Match: # this page is either display ad or classified ad
128 |
129 | if re.findall('Display Ad',recordtitle,re.IGNORECASE):
130 | ad_type = 'displayad'
131 | elif re.findall('Classified Ad',recordtitle,re.IGNORECASE):
132 | ad_type = 'classifiedad'
133 |
134 | ad_number = re.findall('\d+',recordtitle)[0] # get the page number
135 |
136 | numericpubdate = ExtractElement(text,'numericpubdate')
137 | pub_date = re.findall('\d{8}',numericpubdate)[0] # get the publication date
138 |
139 | output = '_'.join([journal,ad_type,pub_date,ad_number]) # create page identifider
140 | else:
141 | output = None
142 |
143 | return output
144 |
145 | #...............................................#
146 |
147 |
148 |
149 |
--------------------------------------------------------------------------------
/data_cleaning/auxiliary files/phrase_substitutes.csv:
--------------------------------------------------------------------------------
1 | accountant auditor,accountingauditor,accountantauditor,,,,,,,,,,,,,,,,
2 | accounting clerk,accounting clerks,accounting clk,accounting clks,acct clerks,,,,,,,,,,,,,,
3 | accounting manager,accounts mgr,account manager,manager of accounting,,,,,,,,,,,,,,,
4 | administrative assistant,adm assistant,admin assistant,assistant administrator,,,,,,,,,,,,,,,
5 | assistant bookkeeper,ass t bookkeeper,,,,,,,,,,,,,,,,,
6 | assistant controller,assistant to controller,,,,,,,,,,,,,,,,,
7 | assistant credit manager,assistant credit mgr,,,,,,,,,,,,,,,,,
8 | assistant director of nursing,assistant dir nsg,,,,,,,,,,,,,,,,,
9 | assistant manager,manager assistant,,,,,,,,,,,,,,,,,
10 | assistant tax manager,assistant tax mgr,,,,,,,,,,,,,,,,,
11 | auto mechanic,automobile mechanic,automotive mechanic,,,,,,,,,,,,,,,,
12 | auto sale,automobile sales,automobile salesman,automobile salesperson,,,,,,,,,,,,,,,
13 | builder developer,builderdevelopers,,,,,,,,,,,,,,,,,
14 | chief accountant,chief acct,,,,,,,,,,,,,,,,,
15 | cost accountant,cost accounting,,,,,,,,,,,,,,,,,
16 | database administrator,data base administrator,,,,,,,,,,,,,,,,,
17 | data processing manager,manager data processing,,,,,,,,,,,,,,,,,
18 | design checker,designer checker,,,,,,,,,,,,,,,,,
19 | design draftsman,designer draftsman,design drafter,designer drafter,,,,,,,,,,,,,,,
20 | design engineer,engineer designer,designer engineer,,,,,,,,,,,,,,,,
21 | digital technician,digital tech,,,,,,,,,,,,,,,,,
22 | director of nursing,director of nurse,,,,,,,,,,,,,,,,,
23 | electronic technician,electronic tech,,,,,,,,,,,,,,,,,
24 | employment manager,manager of employment,manager employment,,,,,,,,,,,,,,,,
25 | employee relation manager,manager employee relation,,,,,,,,,,,,,,,,,
26 | engineering manager,management engineer,manager of engineering,,,,,,,,,,,,,,,,
27 | engineering technician,engineer technician,,,,,,,,,,,,,,,,,
28 | executive assistant,exec assistant,,,,,,,,,,,,,,,,,
29 | executive sale,sale executive,,,,,,,,,,,,,,,,,
30 | executive secretary,exec secretary,executive secy,executive secretarial,executive secty,,,,,,,,,,,,,,
31 | field sales manager,field sales manager you are,,,,,,,,,,,,,,,,,
32 | financial analyst,fin analyst,,,,,,,,,,,,,,,,,
33 | food technologist,food tech,,,,,,,,,,,,,,,,,
34 | foreman,foremen,,,,,,,,,,,,,,,,,
35 | general accountant,general accounting,,,,,,,,,,,,,,,,,
36 | host,hostesses,host hostess,hostess,hostess host,,,,,,,,,,,,,,
37 | industrial sale,sale industrial,,,,,,,,,,,,,,,,,
38 | international scout,int l scout,int scout,intl scout,,,,,,,,,,,,,,,
39 | keypunch operator,key punch operator,,,,,,,,,,,,,,,,,
40 | lab assistant,laboratory assistant,,,,,,,,,,,,,,,,,
41 | lab technician,lab tech,laboratory technician,,,,,,,,,,,,,,,,
42 | licensed electrician,lic electrician,,,,,,,,,,,,,,,,,
43 | licensed plumber,licensed plumbers,lic plumber,,,,,,,,,,,,,,,,
44 | industrial relation manager,manager industrial relation,,,,,,,,,,,,,,,,,
45 | instrument engineer,instrumentation engineer,,,,,,,,,,,,,,,,,
46 | management trainee,mgmt trainee,,,,,,,,,,,,,,,,,
47 | manager advertising,manageradvertising,,,,,,,,,,,,,,,,,
48 | manager equipment,managerequipment,,,,,,,,,,,,,,,,,
49 | manager material,managermaterials,,,,,,,,,,,,,,,,,
50 | manager plant,managerplant,,,,,,,,,,,,,,,,,
51 | manager telecommunication,managertelecommunications,,,,,,,,,,,,,,,,,
52 | manager warehousing,managerwarehousing,,,,,,,,,,,,,,,,,
53 | manufacturing engineering manager,manager manufacturing engineering,,,,,,,,,,,,,,,,,
54 | marketing analyst,market analyst,,,,,,,,,,,,,,,,,
55 | marketing director,director of marketing,,,,,,,,,,,,,,,,,
56 | marketing research analyst,market research analyst,,,,,,,,,,,,,,,,,
57 | marketing sale,marketingsales,,,,,,,,,,,,,,,,,
58 | mechanical engineer,engineer mechanical,,,,,,,,,,,,,,,,,
59 | medical technician,med tech,,,,,,,,,,,,,,,,,
60 | nurse aide,nurse s aide,,,,,,,,,,,,,,,,,
61 | nurse recruiter,nurse recruitment,,,,,,,,,,,,,,,,,
62 | nurse,nurse nursenurse,,,,,,,,,,,,,,,,,
63 | nursing assistant,nurse assistant,,,,,,,,,,,,,,,,,
64 | personnel consultant,personnel consuitants,personnel consuliants,personnel consutants,personnel cosultants,,,,,,,,,,,,,,
65 | personnel director,personnel dlrectro,director of personnel,,,,,,,,,,,,,,,,
66 | personnel manager,personnel mgr,manager of personnel,,,,,,,,,,,,,,,,
67 | personnel secretary,personnel secty,personnel secy,personnel sec,secretary personnel,,,,,,,,,,,,,,
68 | pipefitter,pipe fitter,,,,,,,,,,,,,,,,,
69 | professional help,help professional,,,,,,,,,,,,,,,,,
70 | professional employment manager,manager professional employment,,,,,,,,,,,,,,,,,
71 | professional recruiter,professional recruitment,,,,,,,,,,,,,,,,,
72 | programmer analyst cobol,programmer analystcobol,,,,,,,,,,,,,,,,,
73 | programmer analyst,programmer programmer analyst,program mere analyst,prog analyst,programmeranalyst,programmer anal yst,analyst programmer,,,,,,,,,,,,
74 | programmer,programmer programmer,,,,,,,,,,,,,,,,,
75 | programmer cobol,cobol programmer,,,,,,,,,,,,,,,,,
76 | public accountant,public accounting,,,,,,,,,,,,,,,,,
77 | punch press operator,punch pres operator,,,,,,,,,,,,,,,,,
78 | real time programmer,realtime programmer,,,,,,,,,,,,,,,,,
79 | receptionist typist,receptionisttypist,typist receptionist,typist recept,,,,,,,,,,,,,,,
80 | registered nurse,reg nurse,rn lpn,rn lpns,rn s and lpn,rn and lpns,rn or lpn,rn s lpn,rn slpn,rnlpn,rnlpns,registered nurse lpns,nurse rn,nurse registered,registered nurse staff,staff registered nurse,registered nurse s lpn,registered nurse lpn,registered nurse and lpn
81 | registered pharmacist,reg pharmacist,,,,,,,,,,,,,,,,,
82 | resident manager,resident mgr,,,,,,,,,,,,,,,,,
83 | sale career,salescareers,,,,,,,,,,,,,,,,,
84 | sale engineer,sales engr,sale engr,,,,,,,,,,,,,,,,
85 | sale manager,area sale manager,national sale manager,regional sale manager,,,,,,,,,,,,,,,
86 | sale marketing,sales mktg,salesmktg,marketing sale,,,,,,,,,,,,,,,
87 | sale management trainee,sales mgmt trainee,,,,,,,,,,,,,,,,,
88 | sale manager,sales management,sales manage,ales manager,sales mgr,,,,,,,,,,,,,,
89 | sale part,salesparts,,,,,,,,,,,,,,,,,
90 | sale position,and sales positions,,,,,,,,,,,,,,,,,
91 | sale professional,sale pro,professional sale,,,,,,,,,,,,,,,,
92 | sale secretary,sales secy,,,,,,,,,,,,,,,,,
93 | sale service part,salesserviceparts,,,,,,,,,,,,,,,,,
94 | sale service rental,salesservicerentals,,,,,,,,,,,,,,,,,
95 | sale service,salesservice,,,,,,,,,,,,,,,,,
96 | sale,saless,,,,,,,,,,,,,,,,,
97 | salesperson,sales person,salesman,salesmen,salesman and,salesman too,salespeople,sales ladies,sale people,,,,,,,,,,
98 | secretary assistant,secy assistant,,,,,,,,,,,,,,,,,
99 | secretary bookkeeper,secretarybookkeeper,bookkeeper secretary,,,,,,,,,,,,,,,,
100 | secretary receptionist,secretaryreceptionist,secy receptionist,receptionist secretary,receptionistsecretary,,,,,,,,,,,,,,
101 | secretary typist,secretarytypist,,,,,,,,,,,,,,,,,
102 | secretary,secretary for,,,,,,,,,,,,,,,,,
103 | senior accountant,senior acct,,,,,,,,,,,,,,,,,
104 | senior staff,enior staff,,,,,,,,,,,,,,,,,
105 | senior technical writer,senior tech writer,,,,,,,,,,,,,,,,,
106 | shipper receiver,shipperreceiver,,,,,,,,,,,,,,,,,
107 | staff accountant,staff acct,staff accts,,,,,,,,,,,,,,,,
108 | statistical typist,stat typist,,,,,,,,,,,,,,,,,
109 | stock room clerk,stockroom clerk,,,,,,,,,,,,,,,,,
110 | supervisor tax,supervisortax,,,,,,,,,,,,,,,,,
111 | system analyst programmer,programmer system analyst,system programmer analyst,programmer analyst system,,,,,,,,,,,,,,,
112 | system engineer,engineer system,,,,,,,,,,,,,,,,,
113 | technical recruiter,technical recruiter a new,,,,,,,,,,,,,,,,,
114 | technical typist,tech typist,,,,,,,,,,,,,,,,,
115 | technical writer,tech writer,,,,,,,,,,,,,,,,,
116 | test technician,test tech,,,,,,,,,,,,,,,,,
117 | tool engineer,tooling engineer,,,,,,,,,,,,,,,,,
118 | tool and die maker,tool die maker,,,,,,,,,,,,,,,,,
119 | typist clerk,clerktypist,clk typist,lerk typist,clerk typist,typist clerk typist,typistclerk,,,,,,,,,,,,
120 | vice president finance,vice presidentfinance,,,,,,,,,,,,,,,,,
121 | vice president human resource,vicepresident human resources,,,,,,,,,,,,,,,,,
122 | vice president sale,vice presidentales,,,,,,,,,,,,,,,,,
123 | vice president,vicepresident,,,,,,,,,,,,,,,,,
124 | waiter,waitresseswaiters,,,,,,,,,,,,,,,,,
125 | xray,x ray,x-ray,x- ray,x -ray,,,,,,,,,,,,,,
126 |
--------------------------------------------------------------------------------
/data_cleaning/auxiliary files/state_name.txt:
--------------------------------------------------------------------------------
1 | Alabama,AL,
2 | Alaska,AK,
3 | Arizona,AZ,
4 | Arkansas,AR,
5 | California,CA,
6 | Colorado,CO,
7 | Connecticut,CT,
8 | Delaware,DE,
9 | Florida,FL,
10 | Georgia,GA,
11 | Hawaii,HI,
12 | Idaho,ID,
13 | Illinois,IL,IIL
14 | Indiana,IN,
15 | Iowa,IA,
16 | Kansas,KS,
17 | Kentucky,KY,
18 | Louisiana,LA,
19 | Maine,ME,
20 | Maryland,MD,
21 | Massachusetts,MA,
22 | Michigan,MI,
23 | Minnesota,MN,
24 | Mississippi,MS,
25 | Missouri,MO,
26 | Montana,MT,
27 | Nebraska,NE,
28 | Nevada,NV,
29 | New Hampshire,NH,
30 | New Jersey,NJ,
31 | New Mexico,NM,
32 | New York,NY,
33 | North Carolina,NC,
34 | North Dakota,ND,
35 | Ohio,OH,
36 | Oklahoma,OK,
37 | Oregon,OR,
38 | Pennsylvania,PA,
39 | Rhode Island,RI,
40 | South Carolina,SC,
41 | South Dakota,SD,
42 | Tennessee,TN,
43 | Texas,TX,
44 | Utah,UT,
45 | Vermont,VT,
46 | Virginia,VA,
47 | Washington,WA,
48 | West Virginia,WV,
49 | Wisconsin,WI,
50 | Wyoming,WY,
51 |
--------------------------------------------------------------------------------
/data_cleaning/auxiliary files/title_detection.py:
--------------------------------------------------------------------------------
1 | import os
2 | import re
3 | import sys
4 |
5 | def DetermineUppercase(string):
6 | # This function determines whether a line is uppercase
7 | # There are cases where there are some non-uppercased as well
8 | # example: ENGINEERS MICRowAVE...which should be considered as uppercase
9 | # This would help detecting job titles
10 | StringUppercase = re.sub('[^A-Z]','',string) # take out all non uppercase
11 | if string.isupper(): # perfect uppercase
12 | Output = True
13 | elif len(string) > 4 and len(StringUppercase)/len(string) >= 0.8:
14 | # this line allows some "imperfect" uppercase lines
15 | # (length of string is long enough and contains 80% of uppercase characters)
16 | Output = True
17 | else:
18 | Output = False
19 | return Output
20 |
21 | #...............................................#
22 |
23 | def IndexAll(word,tokens): # IndexAll('b',['a','b','c','b','c','c']) = [1, 3]
24 | return [i for i,v in enumerate(tokens) if v == word]
25 |
26 | #...............................................#
27 |
28 | def NextWordIsNotNumber(word,tokens):
29 | Output = True
30 | for location in IndexAll(word,tokens):
31 | if location == len(tokens) - 1: # if the word is the last word -- skip
32 | pass
33 | elif re.findall('\d', tokens[location + 1] ):
34 | Output = False
35 | return Output
36 |
37 | #...............................................#
38 |
39 | def UppercaseNewline(ListByLine,LineBreak):
40 | # This function adds an extra line break when an uppercase word or phrases is found
41 | # The purpose of this function is to break the uppercase phrases within the line that contains
42 | # both upper and lower case words
43 | OutputResetLine = list()
44 | for line in ListByLine:
45 | if line.isupper(): #ignore if the whole line is already uppercase
46 | OutputResetLine.append(line) #just write down exactly the same
47 | elif len(re.findall(r'[a-z]',line)) >= 5: #the line must contain come lowercases characters
48 | ResetThisLine = list()
49 | tokens = [w for w in re.split(' ',line) if not w=='']
50 | for word in tokens:
51 | WordNoHyphen = re.sub('-','',word)
52 | if WordNoHyphen.isupper() and len(WordNoHyphen) >= 2 and NextWordIsNotNumber(word,tokens):
53 | # if the line is uppercase, is long enough and is NOT followed by a set of number
54 | # (because a set of uppercase followed by a set of number could be a zip code!)
55 | ResetThisLine.append(LineBreak + word + LineBreak)
56 | else:
57 | ResetThisLine.append(word)
58 | OutputResetLine.append(' '.join(ResetThisLine))
59 | else:
60 | OutputResetLine.append(line) #just write down exactly the same
61 |
62 | # At this point, some elements in the "OutputResetLine" would contain more than one line.
63 | # We want to convert this list such that one element is one line
64 | # This can be done by (1) join everything with 'LineBreak' and (2) split again
65 | OutputResetLine = [w for w in re.split(LineBreak,LineBreak.join(OutputResetLine)) if not w==''] #reset lines
66 | return OutputResetLine
67 |
68 | #...............................................#
69 |
70 | def CombineUppercase(ListByLine):
71 |
72 | # This function combines short consecutive uppercase lines together to facilitate job title detection
73 | # For example: "SALE\nMANAGER\nWanted" >>> "SALE MANAGER\nWanted"
74 | # See DetermineUppercase(string) function above for a new definition of "uppercase".
75 |
76 | ListByLineNotEmpty = [w for w in ListByLine if re.findall(r'[a-zA-Z0-9]',w)]
77 | # take out lines where no a-z, A-Z or 0-9 is found (empty lines)
78 |
79 | OutputResetLine = [''] #initialze output
80 | CurrentLine = 0 # current number of line
81 | PreviousShortUpper = False # indicator that the previous line is short uppercase
82 |
83 | for line in ListByLineNotEmpty:
84 | LineNoSpace = re.sub('[^a-zA-Z]','',line) #this only serves the purpose of detecting uppercase line
85 | if DetermineUppercase(LineNoSpace) and PreviousShortUpper == True: # if this line AND the previous one is uppercase
86 | tokens = [w for w in re.split(' ',line) if not w=='']
87 | if len(tokens) <= 3: #the line must be short enough
88 | #add this line to the previous one
89 | # NOTE: "CurrentLine" does not get +1
90 | OutputResetLine[CurrentLine] = OutputResetLine[CurrentLine] + ' ' + re.sub('[^A-Z0-9- ]','',line.upper())
91 | PreviousShortUpper = True
92 | else: #rather, even if the line is upper -- ignore and write down as normal if it is too long
93 | PreviousShortUpper = False
94 | OutputResetLine.append('') # prepare a new empty line
95 | CurrentLine += 1 # moving on to the next line
96 | OutputResetLine[CurrentLine] = line
97 | PreviousShortUpper = False
98 | elif DetermineUppercase(LineNoSpace) and PreviousShortUpper == False:
99 | # if the line is uppercase BUT the pervious one is not => start the new line AND change "PreviousUpper" to "True"
100 | OutputResetLine.append('') # prepare a new empty line
101 | CurrentLine += 1 # moving on to the next line
102 | OutputResetLine[CurrentLine] = re.sub('[^A-Z0-9- ]','',line.upper())
103 | PreviousShortUpper = True # change status
104 | else: # if the line is not uppercase => just write it down as normally should
105 | OutputResetLine.append('') # prepare a new empty line
106 | CurrentLine += 1 # moving on to the next line
107 | OutputResetLine[CurrentLine] = line
108 | PreviousShortUpper = False
109 | OutputResetLine = [w for w in OutputResetLine if not w==''] # delete empty lines
110 | return OutputResetLine
111 |
112 | #...............................................#
113 |
114 | def CheckNoTXTLost(list1, list2, AllFlag):
115 | #this function checks that "list1" and "list2" contains exactly the same string of chatacters
116 | combine_list1 = re.sub( AllFlag,'',''.join(list1).lower() ) #take out all flags (title, firm names, etc...)
117 | combine_list2 = re.sub( AllFlag,'',''.join(list2).lower() )
118 | if re.sub( '\W|\s','',combine_list1) == re.sub( '\W|\s','',combine_list2): #test
119 | output = True
120 | else:
121 | output = False
122 | return output
123 |
124 | #...............................................#
--------------------------------------------------------------------------------
/data_cleaning/auxiliary files/title_substitute.py:
--------------------------------------------------------------------------------
1 | import os
2 | import re
3 | import sys
4 | import platform
5 | import shutil
6 | import enchant, difflib
7 | import io
8 |
9 | d = enchant.DictWithPWL("en_US", 'myPWL.txt')
10 |
11 | #...............................................#
12 | # This python module cleans titles
13 | # (1.) substitute word-by-word: includes plural => singular, abbrevations...
14 | # (2.) substitute phrases
15 | # (3.) general plural to singular transformation
16 | #...............................................#
17 |
18 | def WordSubstitute(InputString, word_substitutes):
19 | # This function makes word-by-word substitutions (See: word_substitutes.csv)
20 | # For each row, everything in the second to last column will be substituted with the first column
21 | # Example, one row reads "assistant | assistants | asst | asst. | assts"
22 | # If any word is "assistants", "asst." or "assts" is found, it will be substituted with simply "assistant"
23 |
24 | InputTokens = [w for w in re.split('\s|-', InputString.lower()) if not w=='']
25 |
26 | ListBase = [re.split(',', w)[0] for w in word_substitutes] # list of everything in the first column
27 |
28 | RegexList = ['|'.join(['\\b'+y+'\\b' for y in re.split(',', w)[1:] if not y=='']) for w in word_substitutes]
29 | # regular expressions of everyhing in the second to last column
30 |
31 | OutputTokens = InputTokens[:] #copying the output from input
32 |
33 | for tokenInd in range(0,len(OutputTokens)):
34 | token = OutputTokens[tokenInd] # (1) For each word...
35 | for regexInd in range(0,len(RegexList)):
36 | regex = RegexList[regexInd] # (2) ...for each set of regular expressions...
37 | baseForm = ListBase[regexInd]
38 | if re.findall(re.compile(regex),token): # (3) ...if the word contains in the set of regular expressions...
39 | OutputTokens[tokenInd] = baseForm # (4) ...the word becomes that baseForm = value of the first column.
40 | return ' '.join(OutputTokens)
41 |
42 | #...............................................#
43 |
44 | def PhraseSubstitute(InputString, phrase_substitutes):
45 | # This function makes phrases substitutions (See: phrase_substitutes.csv)
46 | # The format is similar to word_substitutes.csv
47 | # Example: 'assistant tax mgr' will be substituted with 'assistant tax manager'
48 |
49 | ListBase = [re.split(',',w)[0] for w in phrase_substitutes]
50 | RegexList = ['|'.join(['\\b'+y+'\\b' for y in re.split(',',w)[1:] if not y=='']) for w in phrase_substitutes]
51 |
52 | OutputString = InputString.lower()
53 |
54 | # Unlike WordSubstitute(.) function, this one looks at the whole InputString and make substitution.
55 |
56 | for regexInd in range(0,len(RegexList)):
57 | regex = RegexList[regexInd]
58 | baseForm = ListBase[regexInd]
59 | if re.findall(re.compile(regex),InputString):
60 | OutputString = re.sub(re.compile(regex),baseForm,InputString)
61 | return OutputString
62 |
63 | #...............................................#
64 |
65 | def SingularSubstitute(InputString):
66 | # This function performs general plural to singular transformation
67 | # Note that several frequently appeared words would have been manually typed in "word_substitutes.csv"
68 |
69 | InputTokens = [w for w in re.split(' ', InputString.lower()) if not w=='']
70 | OutputTokens = InputTokens[:] #initialize output to be exactly as input
71 |
72 | for tokenInd in range(0,len(OutputTokens)):
73 |
74 | token = OutputTokens[tokenInd]
75 | corrected_token = ''
76 |
77 | if d.check(token): # To be conservative, only look at words that d.check(.) is true
78 | if re.findall('\w+ies$',token):
79 | # if the word ends with 'ies', changes 'ies' to 'y'
80 | corrected_token = re.sub('ies$','y',token)
81 | elif re.findall('\w+ches$|\w+ses$|\w+xes|\w+oes$',token):
82 | # if the word ends with 'ches', 'ses', 'xes', 'oes', drops the 'es'
83 | corrected_token = re.sub('es$','',token)
84 | elif re.findall('\w+s$',token):
85 | # if the word ends with 's' BUT NOT 'ss' (this is to prevent changing words like 'business')
86 | if not re.findall('\w+ss$',token):
87 | corrected_token = re.sub('s$','',token) # drop the 's'
88 |
89 | if len(corrected_token) >= 3 and d.check(corrected_token):
90 | #finally, make a substitution only if the word is at least 3 characters long...
91 | # AND the correction actually has meanings!
92 | OutputTokens[tokenInd] = corrected_token
93 |
94 | return ' '.join(OutputTokens)
95 |
96 | #...............................................#
97 |
98 | def substitute_titles(InputString,word_substitutes,phrase_substitutes):
99 | # This is the main function
100 |
101 | # (1.) Initial cleaning:
102 | CleanedString = re.sub('[^A-Za-z- ]','',InputString)
103 | CleanedString = re.sub('-',' ',CleanedString.lower())
104 | CleanedString = ' '.join([w for w in re.split(' ', CleanedString) if not w==''])
105 |
106 | # (2.) Three types of substitutions:
107 |
108 | if len(CleanedString) >= 1:
109 | CleanedString = PhraseSubstitute(CleanedString, phrase_substitutes)
110 | CleanedString = WordSubstitute(CleanedString, word_substitutes)
111 | CleanedString = SingularSubstitute(CleanedString)
112 | CleanedString = PhraseSubstitute(CleanedString, phrase_substitutes)
113 |
114 | # (3.) Get rid of duplicating words:
115 | # This step is to reduce dimensions of the title.
116 | # for example, "sale sale engineer sale " would be reduced to simply "sale engineer"
117 |
118 | ListTokens = [w for w in re.split(' ',CleanedString) if not w=='']
119 | FinalTokens = list()
120 |
121 | for token in ListTokens: # for each word...
122 | if not token in FinalTokens: # ...if that word has NOT appeared before...
123 | FinalTokens.append(token) # ...append that word to the final result.
124 |
125 | return ' '.join(FinalTokens)
126 |
127 | #...............................................#
128 |
--------------------------------------------------------------------------------
/data_cleaning/auxiliary files/word_substitutes.csv:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/phaiptt125/newspaper_project/4e31ced4d930258ff7d659012fff15f3f6f626a4/data_cleaning/auxiliary files/word_substitutes.csv
--------------------------------------------------------------------------------
/data_cleaning/initial_cleaning.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# The Initial Text Cleaning\n",
8 | "Online supplementary material to \"The Evolution of Work in the United States\" by Enghin Atalay, Phai Phongthiengtham, Sebastian Sotelo and Daniel Tannenbaum. \n",
9 | "\n",
10 | "* [Project data library](https://occupationdata.github.io) \n",
11 | "\n",
12 | "* [GitHub repository](https://github.com/phaiptt125/newspaper_project)\n",
13 | "\n",
14 | "***"
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "This IPython notebook demonstrates initial processing of the raw text, provided by ProQuest. The main components of this step are to retrieve document metadata, to remove markup from the newspaper text, and to perform an initial spell-check of the text."
22 | ]
23 | },
24 | {
25 | "cell_type": "markdown",
26 | "metadata": {},
27 | "source": [
28 | " Due to copyright restrictions, we are not authorized to publish a large body of newspaper text. \n",
29 | "***"
30 | ]
31 | },
32 | {
33 | "cell_type": "markdown",
34 | "metadata": {},
35 | "source": [
36 | "## List of auxiliary files (see project data library or GitHub repository)"
37 | ]
38 | },
39 | {
40 | "cell_type": "markdown",
41 | "metadata": {},
42 | "source": [
43 | "* *extract_information* : This python code removes markup and extract relevant information.\n",
44 | "* *edit_distance.py* : This python code computes string edit distance, used in the spelling correction procedure.\n",
45 | "* *OCRcorrect_enchant.py* : This python code performs basic word-by-word spelling error correction.\n",
46 | "* *PWL.txt* : This file contains words such as software and state names that are not contained in the dictionary provided by python's enchant module.\n",
47 | "***"
48 | ]
49 | },
50 | {
51 | "cell_type": "markdown",
52 | "metadata": {},
53 | "source": [
54 | "## Import python modules"
55 | ]
56 | },
57 | {
58 | "cell_type": "code",
59 | "execution_count": 1,
60 | "metadata": {
61 | "collapsed": true
62 | },
63 | "outputs": [],
64 | "source": [
65 | "import os\n",
66 | "import re\n",
67 | "import sys\n",
68 | "import enchant #spelling correction module\n",
69 | "\n",
70 | "sys.path.append('./auxiliary files')\n",
71 | "\n",
72 | "from extract_information import *\n",
73 | "from edit_distance import *\n",
74 | "from OCRcorrect_enchant import *"
75 | ]
76 | },
77 | {
78 | "cell_type": "markdown",
79 | "metadata": {},
80 | "source": [
81 | "## Import raw text file\n",
82 | "\n",
83 | "ProQuest has provided us with text files which have been transcribed from scanned images of newspaper pages. The file 'ad_sample.txt', as shown below, is one of these text files. ProQuest only provided us with the information that this file belongs to a page of Wall Street Journal. "
84 | ]
85 | },
86 | {
87 | "cell_type": "code",
88 | "execution_count": 2,
89 | "metadata": {},
90 | "outputs": [
91 | {
92 | "name": "stdout",
93 | "output_type": "stream",
94 | "text": [
95 | " TDM_Record_v1.0.xsd 4a667155d557ab68c878224bc3de0979 Classified Ad 45 -- No Title Sep 12, 1978 19780912 classified_ad Classified Advertisement Advertisement Copyright Dow Jones & Company Inc Sep 12, 1978 English 506733 45441 Wall Street Journal (1923 - Current file) <html> <head> <meta name="ValidationSchema" content="http://www.w3.org/2002/08/xhtml/xhtml1-strict.xsd"/> <title/> </head> <body> <p> Singer has long been one of the world s gr ' pacesetters in volume manufacturing of intricate, </p> <p> precision machines that achieve extreme reliability and durability. Our sewing machines are in use around the globe in every kind of climate. As pioneers in electronic sewing machines, we have again set new standards. </p> <p> ELECTROMECHANICAL ENGINEERS, </p> <p> Minimum of 6 eara experience in developing of electromechanical consumer or atm&gt;lar products. BSME or BSEE degree required, </p> <p> advanced degree preferred. </p> <p> ELECTRONIC ENGINEERS MECHANICAL ENGINEERS </p> <p> A least 2+ years is needed in one of 2+ years of practical in mechanisms the following areas: and machine design analysis. Working know! edga of computers as a design tool would be </p> <p> 1) Analog and digital industrial electron helpful. Experience in sophisticated , with microprocessor and CAD knowl chanical products. Background should include edge desirable; mechanism or gear or machine design 2) Analog sad digital circuitry, logic de and analysis. Knowledge of computers as , PC bond design, ISI and minicom neering ardes helpful </p> <p> puter ; </p> <p> S) Application of mini and micro-computers including , and hardware de- </p> <p> bugging of analog and digital circuitry. </p> <p> DESIGNERS, JUNIOR SPECIALIST AND SENIOR </p> <p> Ezperience in fractional and AC 1-8 Years experience in precision high toler _ and DC motors and motor control system as ante design of mechanical devices and/or circuit well as other electromechanical devices. layout. Intricate detailing experience mandato- </p> <p> ry. Singer offers attractive salaries, benefits and professional working conditions, and very favorable career . These positions are located at our Elizabeth, New Jersey facility and at our R&amp;D Laboratory in Fairfield, New Jersey. </p> <p> Please send resume stating position of interest in confidence to: </p> <p> Hosie Scott, Employment Manager </p> <p> or call (201) 527-6166 or 67 </p> <p> SINGER </p> <p> DIVERSIFIED WORL. 321 First Street </p> <p> Elizabeth, New Jersey 07207 An Equal Opportunity Employer M/F </p> </body> </html> \n"
96 | ]
97 | }
98 | ],
99 | "source": [
100 | "# input files\n",
101 | "input_file = 'ad_sample.txt'\n",
102 | "\n",
103 | "# bring in raw ads \n",
104 | "raw_ad = open(input_file).read()\n",
105 | "print(raw_ad)"
106 | ]
107 | },
108 | {
109 | "cell_type": "markdown",
110 | "metadata": {},
111 | "source": [
112 | "***\n",
113 | "Relevant information we have to extract is:\n",
114 | "\n",
115 | "1. publication date - \"19780912\" (September 12, 1978)\n",
116 | "2. page title - \"Classified Ad 45\" (classified ad, page 45)\n",
117 | "3. content - all text in the \"fulltext\" field\n",
118 | "\n",
119 | "Fortunately, job ads appear only in either \"Display Ad\" or \"Classified Ad\" pages. As such, we only need to include pages that are either \"Display Ad\" or \"Classified Ad\" in this step."
120 | ]
121 | },
122 | {
123 | "cell_type": "markdown",
124 | "metadata": {},
125 | "source": [
126 | "However, not all pages in \"Display Ad\" or \"Classified Ad\" are job ads. The next step, as demonstated in the next IPython notebook [here](https://github.com/phaiptt125/newspaper_project/blob/master/data_cleaning/LDA.ipynb), is to use Latent Dirichlet Allocation (LDA) procedure to idenfity which pages are job ads."
127 | ]
128 | },
129 | {
130 | "cell_type": "markdown",
131 | "metadata": {},
132 | "source": [
133 | "## Assign unique page identifier\n",
134 | "* Assign a unique identifier for each newpaper page that is either Display Ad or Classified Ad."
135 | ]
136 | },
137 | {
138 | "cell_type": "code",
139 | "execution_count": 3,
140 | "metadata": {},
141 | "outputs": [
142 | {
143 | "name": "stdout",
144 | "output_type": "stream",
145 | "text": [
146 | "WSJ_classifiedad_19780912_45\n"
147 | ]
148 | }
149 | ],
150 | "source": [
151 | "page_identifier = AssignPageIdentifier(raw_ad, 'WSJ') # see extract_information.py\n",
152 | "print(page_identifier)"
153 | ]
154 | },
155 | {
156 | "cell_type": "markdown",
157 | "metadata": {},
158 | "source": [
159 | "The value \"WSJ_classifiedad_19780912_45\" refers to the 45th page of classified ads in the September 12, 1978 edition of the Wall Street Journal.\n",
160 | "\n",
161 | "## Extract posting and remove markup"
162 | ]
163 | },
164 | {
165 | "cell_type": "code",
166 | "execution_count": 4,
167 | "metadata": {},
168 | "outputs": [
169 | {
170 | "name": "stdout",
171 | "output_type": "stream",
172 | "text": [
173 | "\n",
174 | " Singer has long been one of the world s gr ' pacesetters in volume manufacturing of intricate , \n",
175 | " \n",
176 | " precision machines that achieve extreme reliability and durability . Our sewing machines are in use around the globe in every kind of climate . As pioneers in electronic sewing machines , we have again set new standards . \n",
177 | " \n",
178 | " ELECTROMECHANICAL ENGINEERS , \n",
179 | " \n",
180 | " Minimum of 6 eara experience in developing of electromechanical consumer or atmlar products . BSME or BSEE degree required , \n",
181 | " \n",
182 | " advanced degree preferred . \n",
183 | " \n",
184 | " ELECTRONIC ENGINEERS MECHANICAL ENGINEERS \n",
185 | " \n",
186 | " A least 2+ years is needed in one of 2+ years of practical in mechanisms the following areas: and machine design analysis . Working know ! edga of computers as a design tool would be \n",
187 | " \n",
188 | " 1 ) Analog and digital industrial electron helpful . Experience in sophisticated , with microprocessor and CAD knowl chanical products . Background should include edge desirable ; mechanism or gear or machine design 2 ) Analog sad digital circuitry , logic de and analysis . Knowledge of computers as , PC bond design , ISI and minicom neering ardes helpful \n",
189 | " \n",
190 | " puter ; \n",
191 | " \n",
192 | " S ) Application of mini and micro-computers including , and hardware de- \n",
193 | " \n",
194 | " bugging of analog and digital circuitry . \n",
195 | " \n",
196 | " DESIGNERS , JUNIOR SPECIALIST AND SENIOR \n",
197 | " \n",
198 | " Ezperience in fractional and AC 1-8 Years experience in precision high toler and DC motors and motor control system as ante design of mechanical devices and or circuit well as other electromechanical devices . layout . Intricate detailing experience mandato- \n",
199 | " \n",
200 | " ry . Singer offers attractive salaries , benefits and professional working conditions , and very favorable career . These positions are located at our Elizabeth , New Jersey facility and at our R ; D Laboratory in Fairfield , New Jersey . \n",
201 | " \n",
202 | " Please send resume stating position of interest in confidence to: \n",
203 | " \n",
204 | " Hosie Scott , Employment Manager \n",
205 | " \n",
206 | " or call ( 201 ) 527-6166 or 67 \n",
207 | " \n",
208 | " SINGER \n",
209 | " \n",
210 | " DIVERSIFIED WORL . 321 First Street \n",
211 | " \n",
212 | " Elizabeth , New Jersey 07207 An Equal Opportunity Employer M F \n",
213 | "\n"
214 | ]
215 | }
216 | ],
217 | "source": [
218 | "# extract field \n",
219 | "fulltext = ExtractElement(raw_ad,'fulltext') # see extract_information.py\n",
220 | "# remove xml markups\n",
221 | "posting = CleanXML(fulltext) # see extract_information.py\n",
222 | "print(posting)"
223 | ]
224 | },
225 | {
226 | "cell_type": "markdown",
227 | "metadata": {},
228 | "source": [
229 | "## Perform basic spelling error correction, remove extra spaces and empty lines "
230 | ]
231 | },
232 | {
233 | "cell_type": "code",
234 | "execution_count": 5,
235 | "metadata": {},
236 | "outputs": [
237 | {
238 | "name": "stdout",
239 | "output_type": "stream",
240 | "text": [
241 | "Singer has long been one of the world s gr ' pacesetters in volume manufacturing of intricate ,\n",
242 | "precision machines that achieve extreme reliability and durability . Our sewing machines are in use around the globe in every kind of climate . As pioneers in electronic sewing machines , we have again set new standards .\n",
243 | "ELECTROMECHANICAL ENGINEERS ,\n",
244 | "Minimum of 6 Meara experience in developing of electromechanical consumer or atmlar products . BSME or B SEE degree required ,\n",
245 | "advanced degree preferred .\n",
246 | "ELECTRONIC ENGINEERS MECHANICAL ENGINEERS\n",
247 | "A least 2+ years is needed in one of 2+ years of practical in mechanisms the following areas and machine design analysis . Working know ! Edgar of computers as a design tool would be\n",
248 | "1 ) Analog and digital industrial electron helpful . Experience in sophisticated , with microprocessor and CAD kn owl mechanical products . Background should include edge desirable ; mechanism or gear or machine design 2 ) Analog sad digital circuitry , logic de and analysis . Knowledge of computers as , PC bond design , IS I and mini com sneering ares helpful\n",
249 | "pouter ;\n",
250 | "S ) Application of mini and microcomputers including , and hardware de-\n",
251 | "bugging of analog and digital circuitry .\n",
252 | "DESIGNERS , JUNIOR SPECIALIST AND SENIOR\n",
253 | "Experience in fractional and AC 1-8 Years experience in precision high tooler and DC motors and motor control system as ante design of mechanical devices and or circuit well as other electromechanical devices . layout . Intricate detailing experience mandatory\n",
254 | "ry . Singer offers attractive salaries , benefits and professional working conditions , and very favorable career . These positions are located at our Elizabeth , New Jersey facility and at our R ; D Laboratory in Fairfield , New Jersey .\n",
255 | "Please send resume stating position of interest in confidence to:\n",
256 | "Hosier Scott , Employment Manager\n",
257 | "or call ( 201 ) 527-6166 or 67\n",
258 | "SINGER\n",
259 | "DIVERSIFIED WHORL . 321 First Street\n",
260 | "Elizabeth , New Jersey 07207 An Equal Opportunity Employer M F\n"
261 | ]
262 | }
263 | ],
264 | "source": [
265 | "posting_by_line = [w for w in re.split('\\n',posting) if len(w)>0] \n",
266 | "clean_posting_by_line = list()\n",
267 | " \n",
268 | "for line in posting_by_line:\n",
269 | " clean_line = line\n",
270 | " # spelling error correction\n",
271 | " clean_line = EnchantErrorCorrection(clean_line, 'PWL.txt')\n",
272 | " # remove extra white spaces\n",
273 | " clean_line = ' '.join([w for w in re.split(' ',clean_line) if not w=='']) \n",
274 | " clean_posting_by_line.append(clean_line)\n",
275 | "\n",
276 | "# remove empty lines\n",
277 | "clean_posting_by_line = [w for w in clean_posting_by_line if not w=='']\n",
278 | "\n",
279 | "# print final output of this step\n",
280 | "print('\\n'.join(clean_posting_by_line))"
281 | ]
282 | },
283 | {
284 | "cell_type": "markdown",
285 | "metadata": {},
286 | "source": [
287 | "The final output of this step is the variable \"clean_posting_by_line\". The next step, as demonstated in the next IPython notebook [here](https://github.com/phaiptt125/newspaper_project/blob/master/data_cleaning/LDA.ipynb), is to use Latent Dirichlet Allocation (LDA) procedure to idenfity which pages are job ads. "
288 | ]
289 | }
290 | ],
291 | "metadata": {
292 | "kernelspec": {
293 | "display_name": "Python 3",
294 | "language": "python",
295 | "name": "python3"
296 | },
297 | "language_info": {
298 | "codemirror_mode": {
299 | "name": "ipython",
300 | "version": 3
301 | },
302 | "file_extension": ".py",
303 | "mimetype": "text/x-python",
304 | "name": "python",
305 | "nbconvert_exporter": "python",
306 | "pygments_lexer": "ipython3",
307 | "version": "3.6.1"
308 | }
309 | },
310 | "nbformat": 4,
311 | "nbformat_minor": 1
312 | }
313 |
--------------------------------------------------------------------------------
/data_cleaning/structured_data.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Transforming Unstructured Text into Structured Data \n",
8 | "Online supplementary material to \"The Evolution of Work in the United States\" by Enghin Atalay, Phai Phongthiengtham, Sebastian Sotelo and Daniel Tannenbaum. \n",
9 | "\n",
10 | "* [Project data library](https://occupationdata.github.io) \n",
11 | "\n",
12 | "* [GitHub repository](https://github.com/phaiptt125/newspaper_project)\n",
13 | "\n",
14 | "***"
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "This IPython notebook demonstrates how we finally transform unstructured newspaper text into structured data (spreadsheet). In the previous steps, we:\n",
22 | "\n",
23 | "* Retrieve document metadata, remove markup from the newspaper text, and perform an initial spell-check of the text (see [here](https://github.com/phaiptt125/newspaper_project/blob/master/data_cleaning/initial_cleaning.ipynb)). \n",
24 | "* Exclude non-job ad pages (see [here](https://github.com/phaiptt125/newspaper_project/blob/master/data_cleaning/LDA.ipynb)).\n",
25 | "\n",
26 | "The main components of this step are to identify the job title, discern the boundaries between job ads, and transform relevant information into structured data. \n",
27 | "\n"
28 | ]
29 | },
30 | {
31 | "cell_type": "markdown",
32 | "metadata": {},
33 | "source": [
34 | " Due to copyright restrictions, we are not authorized to publish a large body of newspaper text. \n",
35 | "***"
36 | ]
37 | },
38 | {
39 | "cell_type": "markdown",
40 | "metadata": {},
41 | "source": [
42 | "### List of auxiliary files (see project data library or GitHub repository)\n",
43 | "* *title_detection.py* : This python code detects job titles. \n",
44 | "* *detect_ending.py* : This python code detects ending patterns of ads.\n",
45 | "* *TitleBase.txt* : A list of job title words "
46 | ]
47 | },
48 | {
49 | "cell_type": "markdown",
50 | "metadata": {},
51 | "source": [
52 | "***\n",
53 | "## Import necessary modules"
54 | ]
55 | },
56 | {
57 | "cell_type": "code",
58 | "execution_count": 1,
59 | "metadata": {
60 | "collapsed": true
61 | },
62 | "outputs": [],
63 | "source": [
64 | "import os\n",
65 | "import re\n",
66 | "import sys\n",
67 | "import pandas as pd\n",
68 | "\n",
69 | "import nltk\n",
70 | "from nltk.corpus import stopwords\n",
71 | "from nltk.tokenize import word_tokenize\n",
72 | "from nltk.stem.snowball import SnowballStemmer\n",
73 | " \n",
74 | "stop_words = set(stopwords.words('english'))\n",
75 | "stemmer = SnowballStemmer(\"english\")\n",
76 | "\n",
77 | "sys.path.append('./auxiliary files')\n",
78 | "\n",
79 | "from title_detection import *\n",
80 | "from detect_ending import *"
81 | ]
82 | },
83 | {
84 | "cell_type": "markdown",
85 | "metadata": {
86 | "collapsed": true
87 | },
88 | "source": [
89 | "## Import job ad pages\n",
90 | "\n",
91 | "We present an example describing how our procedure identifies job ads' boundaries and their job titles on a snippet of Display Ad page 226, from the January 14, 1979 Boston Globe (page identifer: \"Globe_displayad_19790114_226\"). "
92 | ]
93 | },
94 | {
95 | "cell_type": "markdown",
96 | "metadata": {},
97 | "source": [
98 | "* The text file has already been cleaned by retrieving document metadata, removing markup from the newspaper text, and correcting spelling errors of the text (see [here](https://github.com/phaiptt125/newspaper_project/blob/master/data_cleaning/initial_cleaning.ipynb) for detail). \n",
99 | "* We have already classified this page to be related to job ads (see [here](https://github.com/phaiptt125/newspaper_project/blob/master/data_cleaning/LDA.ipynb) for detail)."
100 | ]
101 | },
102 | {
103 | "cell_type": "code",
104 | "execution_count": 2,
105 | "metadata": {},
106 | "outputs": [
107 | {
108 | "name": "stdout",
109 | "output_type": "stream",
110 | "text": [
111 | "MEDICAL HELP\n",
112 | "NUCLEAR\n",
113 | "RADIOLOGIC TECH\n",
114 | "full time day po ition is available for registred or registry technician in our Nuclear Medicine department This position does require taking call\n",
115 | "CHEST\n",
116 | "PHYSICAL THERAPIST\n",
117 | "If you are or registry eligible\n",
118 | "Physical Trhrapist interested in Chest\n",
119 | "Therapy consider the New England Baptist Hospital Responsibilities will include providng chest therapy for Medical Surgical patients family teaching interdisciplinary inservice programs and more\n",
120 | "For more information please contact our Personnel department 738-5800 , Ext 255 . An Equal Opportunity Employer\n",
121 | "41 Pa HII Boston\n",
122 | "MANAGER OF\n",
123 | "PRIMARY CARE PROGRAMS\n",
124 | "Children's Hospital Medical Center\n",
125 | "seeks dynamic creative individual to manage its Primary Care Programs including 24-hour Emergency Room Primary Care program the Massachusetts Poison information Center and\n",
126 | "Dental services This position requires 3-5 years experience with background in planning budgeting and managing\n",
127 | "health programs Masters degree preferred but additional experience may be substituted We offer salary commensurate\n",
128 | "with experience and fine fringe benefits package\n",
129 | "please forward resumes to Helena Wallace personnel office\n",
130 | "MEDICAL\n",
131 | "300 Lonjwood Avenue\n",
132 | "MA 0211\n",
133 | "REGISTERED\n",
134 | "REGISTRY ELIGIBLE OR\n",
135 | "immi ate available in our modern well- and fu ly accredited 173-bed general hospital Cheshire Hospilal is 80 miles from Boston and near skiing water sports hunting and fishing\n",
136 | "Apphcants must be registered registry eligible or NERT For further information please contact the Personrel department\n",
137 | "Cheshire Hospital\n",
138 | "580 Court Street Keene NH 03431\n"
139 | ]
140 | }
141 | ],
142 | "source": [
143 | "text = open('Snippet_Globe_displayad_19790114_226.txt').read()\n",
144 | "page_identifier = 'Globe_displayad_19790114_226'\n",
145 | "print(text) # posting text"
146 | ]
147 | },
148 | {
149 | "cell_type": "markdown",
150 | "metadata": {
151 | "collapsed": true
152 | },
153 | "source": [
154 | "## Reset line breaks\n",
155 | "First, we combine short, uppercased and consecutive lines together so that we can detect, for instance, \"MANAGER OF PRIMARY CARE PROGRAMS\" when we have two lines of \"MANAGER OF\" and \"PRIMARY CARE PROGRAMS\" . "
156 | ]
157 | },
158 | {
159 | "cell_type": "code",
160 | "execution_count": 3,
161 | "metadata": {
162 | "collapsed": true
163 | },
164 | "outputs": [],
165 | "source": [
166 | "# remove emypty lines\n",
167 | "text_by_line = [w for w in re.split('\\n',text) if not w=='']\n",
168 | "\n",
169 | "# reset lines (see title_detection.py)\n",
170 | "text_reset_line = CombineUppercase(text_by_line)\n",
171 | "text_reset_line = UppercaseNewline(text_reset_line,'\\n') #assign new line when an uppercase word is found\n",
172 | "text_reset_line = CombineUppercase(text_reset_line) #re-combine uppercase words together\n",
173 | "\n",
174 | "# remove extra white spaces\n",
175 | "text_reset_line = [' '.join([y for y in re.split(' ',w) if not y=='']) for w in text_reset_line]\n",
176 | "# remove empty lines\n",
177 | "text_reset_line = [w for w in text_reset_line if not w=='']"
178 | ]
179 | },
180 | {
181 | "cell_type": "code",
182 | "execution_count": 4,
183 | "metadata": {},
184 | "outputs": [
185 | {
186 | "data": {
187 | "text/plain": [
188 | "['MEDICAL HELP NUCLEAR RADIOLOGIC TECH',\n",
189 | " 'full time day po ition is available for registred or registry technician in our Nuclear Medicine department This position does require taking call',\n",
190 | " 'CHEST PHYSICAL THERAPIST',\n",
191 | " 'If you are or registry eligible',\n",
192 | " 'Physical Trhrapist interested in Chest',\n",
193 | " 'Therapy consider the New England Baptist Hospital Responsibilities will include providng chest therapy for Medical Surgical patients family teaching interdisciplinary inservice programs and more',\n",
194 | " 'For more information please contact our Personnel department 738-5800 , Ext 255 . An Equal Opportunity Employer',\n",
195 | " '41 Pa',\n",
196 | " 'HII',\n",
197 | " 'Boston',\n",
198 | " 'MANAGER OF PRIMARY CARE PROGRAMS',\n",
199 | " \"Children's Hospital Medical Center\",\n",
200 | " 'seeks dynamic creative individual to manage its Primary Care Programs including 24-hour Emergency Room Primary Care program the Massachusetts Poison information Center and',\n",
201 | " 'Dental services This position requires 3-5 years experience with background in planning budgeting and managing',\n",
202 | " 'health programs Masters degree preferred but additional experience may be substituted We offer salary commensurate',\n",
203 | " 'with experience and fine fringe benefits package',\n",
204 | " 'please forward resumes to Helena Wallace personnel office',\n",
205 | " 'MEDICAL',\n",
206 | " '300 Lonjwood Avenue',\n",
207 | " 'MA 0211 REGISTERED REGISTRY ELIGIBLE OR',\n",
208 | " 'immi ate available in our modern well- and fu ly accredited 173-bed general hospital Cheshire Hospilal is 80 miles from Boston and near skiing water sports hunting and fishing',\n",
209 | " 'Apphcants must be registered registry eligible or',\n",
210 | " 'NERT',\n",
211 | " 'For further information please contact the Personrel department',\n",
212 | " 'Cheshire Hospital',\n",
213 | " '580 Court Street Keene NH 03431']"
214 | ]
215 | },
216 | "execution_count": 4,
217 | "metadata": {},
218 | "output_type": "execute_result"
219 | }
220 | ],
221 | "source": [
222 | "# print results\n",
223 | "text_reset_line"
224 | ]
225 | },
226 | {
227 | "cell_type": "markdown",
228 | "metadata": {},
229 | "source": [
230 | "## Detect job titles\n",
231 | "Next, we detect job titles by matching to a list of job title personal nouns. For instance, with the word \"THERAPIST\" in our list, we are able to detect \"CHEST PHYSICAL THERAPIST\" being a job title without having to specify all type of possible therapists. "
232 | ]
233 | },
234 | {
235 | "cell_type": "code",
236 | "execution_count": 5,
237 | "metadata": {},
238 | "outputs": [
239 | {
240 | "name": "stdout",
241 | "output_type": "stream",
242 | "text": [
243 | "--- Examples of job title personal nouns ---\n",
244 | "['abstracter', 'abstracters', 'abstractor', 'abstractors', 'accounting', 'accountings', 'accountant', 'accountants', 'actor', 'actors', 'actress', 'actresses', 'actuarial', 'actuarials', 'actuaries']\n"
245 | ]
246 | }
247 | ],
248 | "source": [
249 | "# define indicators if job title detected\n",
250 | "title_found = '---titlefound---'\n",
251 | "\n",
252 | "# list of job title personal nouns\n",
253 | "TitleBaseFile = open('./auxiliary files/TitleBase.txt').read()\n",
254 | "TitleBaseList = [w for w in re.split('\\n',TitleBaseFile) if not w=='']\n",
255 | "print('--- Examples of job title personal nouns ---')\n",
256 | "print(TitleBaseList[:15]) "
257 | ]
258 | },
259 | {
260 | "cell_type": "code",
261 | "execution_count": 6,
262 | "metadata": {
263 | "collapsed": true
264 | },
265 | "outputs": [],
266 | "source": [
267 | "text_detect_title = ['']*len(text_reset_line)\n",
268 | "PreviousLineIsUppercaseTitle = False\n",
269 | "\n",
270 | "# assign a flag of '---titlefound---' to lines where we detect a job title\n",
271 | "\n",
272 | "for i in range(0,len(text_reset_line)):\n",
273 | " line = text_reset_line[i]\n",
274 | " line_no_hyphen = re.sub('-',' ',line.lower())\n",
275 | " tokens = word_tokenize(line_no_hyphen)\n",
276 | " \n",
277 | " Match = list(set(tokens).intersection(TitleBaseList)) # see if the line has words in TitleBaseList \n",
278 | " \n",
279 | " if Match and DetermineUppercase(line): # uppercase job title\n",
280 | " text_detect_title[i] = ' '.join([w for w in re.split(' ',line) if not w=='']) + title_found\n",
281 | " # adding a flag that a title is found\n",
282 | " # ' '.join([w for w in split(' ',line) if not w=='']) is to remove extra spaces from 'line'\n",
283 | " PreviousLineIsUppercaseTitle = True\n",
284 | " elif Match and len(tokens) <= 2:\n",
285 | " # This line allows non-uppercase job titles\n",
286 | " # It has to be short enough => less than or equal to 2 words.\n",
287 | " # In addition, the previous line must NOT be a uppercase job title. \n",
288 | " if PreviousLineIsUppercaseTitle == False:\n",
289 | " text_detect_title[i] = ' '.join([w for w in re.split(' ',line) if not w=='']) + title_found\n",
290 | " PreviousLineIsUppercaseTitle = False\n",
291 | " else:\n",
292 | " text_detect_title[i] = ' '.join([w for w in re.split(' ',line) if not w==''])\n",
293 | " PreviousLineIsUppercaseTitle = False\n",
294 | " else:\n",
295 | " text_detect_title[i] = ' '.join([w for w in re.split(' ',line) if not w==''])\n",
296 | " PreviousLineIsUppercaseTitle = False"
297 | ]
298 | },
299 | {
300 | "cell_type": "markdown",
301 | "metadata": {},
302 | "source": [
303 | "For this snippet of text, we are able to detect the following job titles:"
304 | ]
305 | },
306 | {
307 | "cell_type": "code",
308 | "execution_count": 7,
309 | "metadata": {},
310 | "outputs": [
311 | {
312 | "data": {
313 | "text/plain": [
314 | "['MEDICAL HELP NUCLEAR RADIOLOGIC TECH---titlefound---',\n",
315 | " 'CHEST PHYSICAL THERAPIST---titlefound---',\n",
316 | " 'MANAGER OF PRIMARY CARE PROGRAMS---titlefound---',\n",
317 | " 'MEDICAL---titlefound---']"
318 | ]
319 | },
320 | "execution_count": 7,
321 | "metadata": {},
322 | "output_type": "execute_result"
323 | }
324 | ],
325 | "source": [
326 | "[w for w in text_detect_title if re.findall(title_found,w)]"
327 | ]
328 | },
329 | {
330 | "cell_type": "markdown",
331 | "metadata": {},
332 | "source": [
333 | "## Detect addresses and ending phrases \n",
334 | "In this step, we detect addresses such as street names, zip codes, and phrases which tend to appear at the end of ads. Such phrases include \"An Equal Opportunity Employer\" and \"send resume.\" If we do, we assign a string \"---endingfound---\" to the end of the line. "
335 | ]
336 | },
337 | {
338 | "cell_type": "code",
339 | "execution_count": 8,
340 | "metadata": {
341 | "collapsed": true
342 | },
343 | "outputs": [],
344 | "source": [
345 | "ending_found = '---endingfound---'\n",
346 | "text_assign_flag = list()\n",
347 | "\n",
348 | "# see \"detect_ending.py\"\n",
349 | "\n",
350 | "for line in text_detect_title:\n",
351 | " AddressFound , EndingPhraseFound = AssignFlag(line)\n",
352 | " if AddressFound == True or EndingPhraseFound == True:\n",
353 | " text_assign_flag.append(line + ending_found)\n",
354 | " else:\n",
355 | " text_assign_flag.append(line)"
356 | ]
357 | },
358 | {
359 | "cell_type": "markdown",
360 | "metadata": {},
361 | "source": [
362 | "For this snippet of text, we are able to detect the following addresses and phrases:"
363 | ]
364 | },
365 | {
366 | "cell_type": "code",
367 | "execution_count": 9,
368 | "metadata": {},
369 | "outputs": [
370 | {
371 | "data": {
372 | "text/plain": [
373 | "['For more information please contact our Personnel department 738-5800 , Ext 255 . An Equal Opportunity Employer---endingfound---',\n",
374 | " '300 Lonjwood Avenue---endingfound---',\n",
375 | " '580 Court Street Keene NH 03431---endingfound---']"
376 | ]
377 | },
378 | "execution_count": 9,
379 | "metadata": {},
380 | "output_type": "execute_result"
381 | }
382 | ],
383 | "source": [
384 | "[w for w in text_assign_flag if re.findall(ending_found,w)]"
385 | ]
386 | },
387 | {
388 | "cell_type": "markdown",
389 | "metadata": {},
390 | "source": [
391 | "After detecting job titles, addresses and ending phrases, we end up with the following text: "
392 | ]
393 | },
394 | {
395 | "cell_type": "code",
396 | "execution_count": 10,
397 | "metadata": {},
398 | "outputs": [
399 | {
400 | "data": {
401 | "text/plain": [
402 | "['MEDICAL HELP NUCLEAR RADIOLOGIC TECH---titlefound---',\n",
403 | " 'full time day po ition is available for registred or registry technician in our Nuclear Medicine department This position does require taking call',\n",
404 | " 'CHEST PHYSICAL THERAPIST---titlefound---',\n",
405 | " 'If you are or registry eligible',\n",
406 | " 'Physical Trhrapist interested in Chest',\n",
407 | " 'Therapy consider the New England Baptist Hospital Responsibilities will include providng chest therapy for Medical Surgical patients family teaching interdisciplinary inservice programs and more',\n",
408 | " 'For more information please contact our Personnel department 738-5800 , Ext 255 . An Equal Opportunity Employer---endingfound---',\n",
409 | " '41 Pa',\n",
410 | " 'HII',\n",
411 | " 'Boston',\n",
412 | " 'MANAGER OF PRIMARY CARE PROGRAMS---titlefound---',\n",
413 | " \"Children's Hospital Medical Center\",\n",
414 | " 'seeks dynamic creative individual to manage its Primary Care Programs including 24-hour Emergency Room Primary Care program the Massachusetts Poison information Center and',\n",
415 | " 'Dental services This position requires 3-5 years experience with background in planning budgeting and managing',\n",
416 | " 'health programs Masters degree preferred but additional experience may be substituted We offer salary commensurate',\n",
417 | " 'with experience and fine fringe benefits package',\n",
418 | " 'please forward resumes to Helena Wallace personnel office',\n",
419 | " 'MEDICAL---titlefound---',\n",
420 | " '300 Lonjwood Avenue---endingfound---',\n",
421 | " 'MA 0211 REGISTERED REGISTRY ELIGIBLE OR',\n",
422 | " 'immi ate available in our modern well- and fu ly accredited 173-bed general hospital Cheshire Hospilal is 80 miles from Boston and near skiing water sports hunting and fishing',\n",
423 | " 'Apphcants must be registered registry eligible or',\n",
424 | " 'NERT',\n",
425 | " 'For further information please contact the Personrel department',\n",
426 | " 'Cheshire Hospital',\n",
427 | " '580 Court Street Keene NH 03431---endingfound---']"
428 | ]
429 | },
430 | "execution_count": 10,
431 | "metadata": {},
432 | "output_type": "execute_result"
433 | }
434 | ],
435 | "source": [
436 | "text_assign_flag"
437 | ]
438 | },
439 | {
440 | "cell_type": "markdown",
441 | "metadata": {},
442 | "source": [
443 | "## Assign boundaries\n",
444 | "Next, we assign boundaries by scanning from the beginning line:\n",
445 | "1. If we see a flag '---titlefound---', then we assign a split indicator **before** that line.\n",
446 | "2. If we see a flag '---endingfound---', then we assign a split indicator **after** that line."
447 | ]
448 | },
449 | {
450 | "cell_type": "code",
451 | "execution_count": 11,
452 | "metadata": {
453 | "collapsed": true
454 | },
455 | "outputs": [],
456 | "source": [
457 | "split_indicator = '---splithere---'\n",
458 | "split_by_title = list() \n",
459 | "split_posting = list()\n",
460 | "\n",
461 | "# -----split if title is found-----\n",
462 | "\n",
463 | "for line in text_assign_flag:\n",
464 | " if re.findall(title_found,line):\n",
465 | " #add a split indicator BEFORE the line with title \n",
466 | " split_by_title.append(split_indicator + '\\n' + line)\n",
467 | " else:\n",
468 | " split_by_title.append(line) # if not found, just append the line back in \n",
469 | " \n",
470 | "split_by_title = [w for w in re.split('\\n','\\n'.join(split_by_title)) if not w=='']"
471 | ]
472 | },
473 | {
474 | "cell_type": "code",
475 | "execution_count": 12,
476 | "metadata": {
477 | "collapsed": true
478 | },
479 | "outputs": [],
480 | "source": [
481 | "# -----split if any ending phrase and/or address is found-----\n",
482 | "\n",
483 | "for line in split_by_title:\n",
484 | " line_remove_ending_found = re.sub(ending_found,'',line) #remove the ending flag\n",
485 | " if re.findall(ending_found,line):\n",
486 | " #add a split indicator AFTER the line where the pattern is found\n",
487 | " split_posting.append( line_remove_ending_found + '\\n' + split_indicator)\n",
488 | " else:\n",
489 | " split_posting.append( line_remove_ending_found ) # if not found, just append the line back in \n",
490 | "\n",
491 | "# after assigning the split indicators, we can use python command to split the ads. \n",
492 | "split_posting = [w for w in re.split(split_indicator,'\\n'.join(split_posting)) if not w=='']"
493 | ]
494 | },
495 | {
496 | "cell_type": "markdown",
497 | "metadata": {},
498 | "source": [
499 | "After assigning boundaires, we end up with the following text:"
500 | ]
501 | },
502 | {
503 | "cell_type": "code",
504 | "execution_count": 13,
505 | "metadata": {},
506 | "outputs": [
507 | {
508 | "name": "stdout",
509 | "output_type": "stream",
510 | "text": [
511 | "MEDICAL HELP NUCLEAR RADIOLOGIC TECH---titlefound---full time day po ition is available for registred or registry technician in our Nuclear Medicine department This position does require taking call\n",
512 | "---splithere---\n",
513 | "CHEST PHYSICAL THERAPIST---titlefound---If you are or registry eligiblePhysical Trhrapist interested in ChestTherapy consider the New England Baptist Hospital Responsibilities will include providng chest therapy for Medical Surgical patients family teaching interdisciplinary inservice programs and moreFor more information please contact our Personnel department 738-5800 , Ext 255 . An Equal Opportunity Employer\n",
514 | "---splithere---\n",
515 | "41 PaHIIBoston\n",
516 | "---splithere---\n",
517 | "MANAGER OF PRIMARY CARE PROGRAMS---titlefound---Children's Hospital Medical Centerseeks dynamic creative individual to manage its Primary Care Programs including 24-hour Emergency Room Primary Care program the Massachusetts Poison information Center andDental services This position requires 3-5 years experience with background in planning budgeting and managinghealth programs Masters degree preferred but additional experience may be substituted We offer salary commensuratewith experience and fine fringe benefits packageplease forward resumes to Helena Wallace personnel office\n",
518 | "---splithere---\n",
519 | "MEDICAL---titlefound---300 Lonjwood Avenue\n",
520 | "---splithere---\n",
521 | "MA 0211 REGISTERED REGISTRY ELIGIBLE ORimmi ate available in our modern well- and fu ly accredited 173-bed general hospital Cheshire Hospilal is 80 miles from Boston and near skiing water sports hunting and fishingApphcants must be registered registry eligible orNERTFor further information please contact the Personrel departmentCheshire Hospital580 Court Street Keene NH 03431\n",
522 | "---splithere---\n"
523 | ]
524 | }
525 | ],
526 | "source": [
527 | "for ad in split_posting:\n",
528 | " print(re.sub('\\n','',ad)) #print out each ad, ignoring the line break indicators. \n",
529 | " print('---splithere---')"
530 | ]
531 | },
532 | {
533 | "cell_type": "markdown",
534 | "metadata": {},
535 | "source": [
536 | "## Construct a spreadsheet dataset\n",
537 | "Finally, we construct a spreadsheet with the following variables:\n",
538 | "1. *page_identifier* : We recover this information in the previous step. For this illustration, we take text from Display Ad page 226, from the January 14, 1979 Boston Globe (Globe_displayad_19790114_226)\n",
539 | "2. *ad_num* : Ad number within a page\n",
540 | "3. *job_title* : Job title of that particular ad (equals empty string if the ad has no title).\n",
541 | "4. *ad_content* : Posting content"
542 | ]
543 | },
544 | {
545 | "cell_type": "code",
546 | "execution_count": 14,
547 | "metadata": {
548 | "collapsed": true
549 | },
550 | "outputs": [],
551 | "source": [
552 | "all_flag = re.compile('|'.join([title_found,ending_found]))\n",
553 | "\n",
554 | "num_ad = 0 #initialize ad number within displayad\n",
555 | "\n",
556 | "final_output = list()\n",
557 | "\n",
558 | "for ad in split_posting:\n",
559 | " \n",
560 | " ad_split_line = [w for w in re.split('\\n',ad) if not w=='']\n",
561 | " \n",
562 | " # --------- record title ----------\n",
563 | "\n",
564 | " title_this_ad = [w for w in ad_split_line if re.findall(title_found,w)] \n",
565 | " #see if any line is a title\n",
566 | " \n",
567 | " if len(title_this_ad) == 1: #if we do have a title\n",
568 | " title_clean = re.sub(all_flag,'',title_this_ad[0].lower()) \n",
569 | " #take out the flags and revert to lowercase\n",
570 | "\n",
571 | " title_clean = ' '.join([y for y in re.split(' ',title_clean) if not y==''])\n",
572 | " else:\n",
573 | " title_clean = ''\n",
574 | "\n",
575 | " # --------- record content ----------\n",
576 | " \n",
577 | " ad_content = [w for w in ad_split_line if not re.findall(title_found,w)] # take out lines with title\n",
578 | " ad_content = ' '.join([w for w in ad_content if not w==''])\n",
579 | " #delete empty lines + combine all the line together (within an ad)\n",
580 | " \n",
581 | " ad_content = re.sub(all_flag,'',ad_content) \n",
582 | " #take out all the flags\n",
583 | "\n",
584 | " # --------- record output ----------\n",
585 | "\n",
586 | " num_ad += 1\n",
587 | " output = [str(page_identifier),str(num_ad),str(title_clean),str(ad_content)] \n",
588 | " final_output.append( '|'.join(output) )\n",
589 | "\n",
590 | "# final output \n",
591 | "final_output_file = open('structured_data.txt','w')\n",
592 | "final_output_file.write('\\n'.join(final_output))\n",
593 | "final_output_file.close()"
594 | ]
595 | },
596 | {
597 | "cell_type": "code",
598 | "execution_count": 15,
599 | "metadata": {},
600 | "outputs": [
601 | {
602 | "name": "stdout",
603 | "output_type": "stream",
604 | "text": [
605 | "Globe_displayad_19790114_226|1|medical help nuclear radiologic tech|full time day po ition is available for registred or registry technician in our Nuclear Medicine department This position does require taking call\n",
606 | "Globe_displayad_19790114_226|2|chest physical therapist|If you are or registry eligible Physical Trhrapist interested in Chest Therapy consider the New England Baptist Hospital Responsibilities will include providng chest therapy for Medical Surgical patients family teaching interdisciplinary inservice programs and more For more information please contact our Personnel department 738-5800 , Ext 255 . An Equal Opportunity Employer\n",
607 | "Globe_displayad_19790114_226|3||41 Pa HII Boston\n",
608 | "Globe_displayad_19790114_226|4|manager of primary care programs|Children's Hospital Medical Center seeks dynamic creative individual to manage its Primary Care Programs including 24-hour Emergency Room Primary Care program the Massachusetts Poison information Center and Dental services This position requires 3-5 years experience with background in planning budgeting and managing health programs Masters degree preferred but additional experience may be substituted We offer salary commensurate with experience and fine fringe benefits package please forward resumes to Helena Wallace personnel office\n",
609 | "Globe_displayad_19790114_226|5|medical|300 Lonjwood Avenue\n",
610 | "Globe_displayad_19790114_226|6||MA 0211 REGISTERED REGISTRY ELIGIBLE OR immi ate available in our modern well- and fu ly accredited 173-bed general hospital Cheshire Hospilal is 80 miles from Boston and near skiing water sports hunting and fishing Apphcants must be registered registry eligible or NERT For further information please contact the Personrel department Cheshire Hospital 580 Court Street Keene NH 03431\n"
611 | ]
612 | }
613 | ],
614 | "source": [
615 | "# print out final output\n",
616 | "structured_posting = open('structured_data.txt').read()\n",
617 | "structured_posting = re.split('\\n',structured_posting)\n",
618 | "for ad in structured_posting:\n",
619 | " print(ad)"
620 | ]
621 | }
622 | ],
623 | "metadata": {
624 | "kernelspec": {
625 | "display_name": "Python 3",
626 | "language": "python",
627 | "name": "python3"
628 | },
629 | "language_info": {
630 | "codemirror_mode": {
631 | "name": "ipython",
632 | "version": 3
633 | },
634 | "file_extension": ".py",
635 | "mimetype": "text/x-python",
636 | "name": "python",
637 | "nbconvert_exporter": "python",
638 | "pygments_lexer": "ipython3",
639 | "version": "3.6.1"
640 | }
641 | },
642 | "nbformat": 4,
643 | "nbformat_minor": 1
644 | }
645 |
--------------------------------------------------------------------------------