├── whitelist.pkl
├── README.md
└── philter.py


/whitelist.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/beaunorgeot/philter_ucsf/HEAD/whitelist.pkl


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # README
 2 | 
 3 | ![Alt text](https://github.com/beaunorgeot/images_for_presentations/blob/master/logo_v4.tif?raw=true "annotation interface")
 4 | 
 5 | ### What is PHIlter? 
 6 | The package and associated scripts provide an end-to-end pipeline for removing Protected Health Information from clinical notes (or other sensitive text documents) in a completely secure environment (a machine with no external connections or exposed ports). We use a combination of regular expressions, Part Of Speech (POS) and Entity Recognition (NER) tagging, and filtering through a whitelist to achieve nearly perfect Recall and generate clean, readable notes. Everything is written in straight python and the package will process any text file, regardless of structure. You can install with PIP (see below) and run with a single command-line argument. Parallelization of processesing can be infinitly divided across cores or machines. 
 7 | 
 8 | - Please note: we don't make any claims that running this software on your data will instantly produce HIPAA compliance. 
 9 | 
10 | 
11 | 
12 | **philter:** Pull in each note from a directory (nested directories are supported), maintain clean words and replace PHI words with a safe filtered word: **\*\*PHI\*\***, then write the 'phi-reduced' output to a new file with the original file name appended by '\_phi\_reduced'. Conveniently generates additional output files containing meta data from the run: 
13 | 	- number of files processed
14 | 	- number of instances of PHI that were filtered
15 | 	- list of filtered words
16 | 	- etc.
17 | 
18 | 
19 | 
20 | 
21 | # Installation
22 | 
23 | **Install philter**
24 | 
25 | ```pip3 install philter```
26 | 
27 | ### Dependencies
28 | spacy package en: A pretrained model of the english language.
29 | You can learn more about the model at the [spacy model documentation]("https://spacy.io/docs/usage/models") page. Language models are required for optimum performance. 
30 | 
31 | **Download the model:**
32 | 
33 | ```python3 -m spacy download en ```
34 | 
35 | Note for Windows Users: This command must be run with admin previleges because it will create a *short-cut link* that let's you load a model by name.
36 | 
37 | 
38 | 
39 | # Run 
40 | 
41 | **philter**
42 | ``` philter -i ./raw_notes_dir -r -o dir_to_write_to -p 32```
43 | 
44 | Arguments:
45 | 
46 | - ("-i", "--input") = Path to the directory or the file that contains the PHI note, the default is ./input_test/
47 | - ("-r", "--recursive") = help="whether read files in the input folder recursively. Default = False. 
48 | - ("-o", "--output") = Path to the directory that save the PHI-reduced note, the default is ./output_test/.
49 | - ("-w", "--whitelist") = Path to the whitelist, the default is phireducer/whitelist.pkl
50 | - ("-n", "--name") = The key word of the output file name, the default is *_phi_reduced.txt.
51 | - ("-p", "--process") = The number of processes to run simultaneously, the default is 1.
52 | 
53 | 
54 | # How it works
55 | 
56 | **philter**
57 | 
58 | ![Alt text](https://github.com/beaunorgeot/images_for_presentations/blob/master/flow_v2.tif?raw=true "phi-reduction process")
59 | 
60 | 
61 | 
62 | 
63 | ### Why did we build it?
64 | Clinical notes capture rich information on the interaction between physicians, nurses, patients, and more. While this data holds the promise of uncovering valuable insights, it is also challenging to work with for numerous reasons. Extracting various forms of knowledge from Natural Language is difficult on it's own. However, attempts to even begin to mine this data on a large scale are severely hampered by the nature of the raw data, it's deeply personal. In order to allow more researchers to have access to this potentially transformative data, individual patient identifiers need to be removed in a way that presevers the content, context, and integrity of the raw note. 
65 | 
66 | De-Identification of clinical notes is certainly not a new topic, there are even machine learning competitions that are held to compare methods. Unfornuately these did not provide us with a viable approach to de-identify our own notes. First, the code from methods used in the competitions are often not available. 
67 | Second, the notes used in public competitions don't reflect our notes very closely and therefore even methods that are publicly available did not perform nearly as well on our data as they did on the data used for the competitions (As noted by Ferrandez, 2012, BMC Medical Research Methodology who compared public methods on VA data). Additionally,our patient's privacy is paramount to us which meant we were unwilling to expose our data use any methods that required access to any url or external api call. Finally, our goal was to de-identify all 40 MILLION of our notes. There are multiple published approaches that are simply impractical from a run-time perspective at this scale. 
68 | 
69 | ## Why a whitelist (aren't blacklists smaller and easier)?
70 | 
71 | Blacklists are certainly the norm, but they have some pretty large inherent problems. For starters, they present an unbounded problem: there are a nearly infinite number of words that could be PHI and that you'd therefore want to filter. For us, the difference between blacklists vs whitelists comes down to the *types* of errors that you're willing to make. Since blacklists are made of  PHI words and/or patterns, that means that when a mistake is made PHI is allowed through (Recall error). Whitelists on the other hand are made of non-PHI which means that when a mistake is made a non-PHI word gets filtered (Precision Error). We care more about recall for our own uses, and we think that high recall is also important to others that will use this software, so a whitelist was the sensible approach. 
72 | 
73 | ### Results (current, unpublished)
74 | 
75 | ![Alt text](https://github.com/beaunorgeot/images_for_presentations/blob/master/performance_1.png?raw=true "info_extraction_csv example")
76 | 
77 | # Recommendations
78 | - Search through filtered words for institution specific words to improve precision
79 | - have a policy in place to report phi-leakage


--------------------------------------------------------------------------------
/philter.py:
--------------------------------------------------------------------------------
  1 | # De-id script
  2 | # import modules
  3 | from __future__ import print_function
  4 | import os
  5 | import sys
  6 | import pickle
  7 | import glob
  8 | import string
  9 | import re
 10 | import time
 11 | import argparse
 12 | 
 13 | # import multiprocess
 14 | import multiprocessing
 15 | from multiprocessing import Pool
 16 | 
 17 | # import NLP packages
 18 | import nltk
 19 | from nltk import sent_tokenize
 20 | from nltk import word_tokenize
 21 | from nltk.tree import Tree
 22 | from nltk import pos_tag_sents
 23 | from nltk import pos_tag
 24 | from nltk import ne_chunk
 25 | import spacy
 26 | from pkg_resources import resource_filename
 27 | from nltk.tag.perceptron import AveragedPerceptron
 28 | from nltk.tag import SennaTagger
 29 | from nltk.tag import HunposTagger
 30 | """
 31 | Replace PHI words with a safe filtered word: '**PHI**'
 32 | 
 33 | Does:
 34 | 1. regex to search for salutations (must be done prior to splitting into sentences b/c of '.' present in most salutations)
 35 | 2. split document into sentences.
 36 | 3.Run regex patterns to identify PHI considering only 1 word at a time:emails, phone numbers, DOB, SSN, Postal codes, or any word containing 5 or more consecutive digits or
 37 |     8 or more characters that begins and ends with digits.
 38 | 
 39 | 4. Split sentences into words
 40 | 
 41 | 5. Run regex patterns to identify PHI looking at the context for each word. For example DOB checks the preceding words for 'age' or 'years' etc. 
 42 | addresses which include [streets, rooms, states, etc],age over 90. 
 43 | 
 44 | 6. Use nltk to label POS. 
 45 | 
 46 | 7. Identify names: We run 2 separate methods to check if a word is a name based on it's context at the chunk/phrase level. To do this:
 47 | First: Spacy nlp() is run on the sentence level and outputs NER labels at the chunk/phrase level. 
 48 | Second: For chunks/phrases that spacy thinks are 'person', get a second opinion by running nltk ne_chunck which uses nltk POS 
 49 | to assign an NER label to the chunk/phrases. 
 50 | *If both spacy and nltk provide a 'person' NER label for a chunk/phrase: check the in the chunk 1-by-1 with nltk to determine if
 51 |  the word's POS tag is a proper noun.
 52 |     - sometimes the label 'person' may be applied to more than 1 word, and occassionally 1 of those words is just a normal noun, not a name.
 53 |     - If word is a proper noun, flag the word and add it to name_set
 54 | *If spacy labels word as person but nltk does not label person but labels word as another category of NER, use spacy on the all UPPERCASE version words
 55 | in the chunk 1-by-1 to see if spacy still believes that the uppercase word is NER of any category
 56 |     - If it is, add word to name_set; 
 57 |     - If spacy thinks the uppercase version of the word no longer has an NER label, then treat word as any other noun and send to be filtered through the whitelist. 
 58 | 
 59 | 8. If word is noun, send it on to check it against the whitelist. If word is not noun,
 60 | consider it safe and pass it on to output.For nouns, if word is in whitelist,
 61 |                                     check if word is in name_set, if so -> filter.
 62 |                                         If not in name_set,
 63 |                                             use spacy to check if word is a name based on the single word's meaning and format.
 64 |                                             Spacy does a per-word look up and assigns most frequent use of that word as a flag
 65 |                                             (eg'HUNT':-organization, 'Hunt'-name, 'hunt':verb).
 66 |                                             If the flags is name -> filter
 67 |                                             If flag is not name pass word through as safe
 68 |                                 if word not in whitelist -> filter
 69 | 9. Search for middle initials by checking if single Uppercase letter is between PHI infos, if so, consider the letter as a middle initial and filter it. e.g. Ane H Berry.
 70 | 
 71 | NOTE: All of the above numbered steps happen in filter_task(). Other functions either support filter task or simply involve
 72 | dealing with I/O and multiprocessing
 73 | 
 74 | """
 75 | 
 76 | 
 77 | nlp = spacy.load('en')  # load spacy english library
 78 | # pretrain = SennaTagger('senna')
 79 | 
 80 | # configure the regex patterns
 81 | # we're going to want to remove all special characters
 82 | pattern_word = re.compile(r"[^\w+]")
 83 | 
 84 | # Find numbers like SSN/PHONE/FAX
 85 | # 3 patterns: 1. 6 or more digits will be filtered 2. digit followed by - followed by digit. 3. Ignore case of characters
 86 | pattern_number = re.compile(r"""\b(
 87 | (\d[\(\)\-\']?\s?){6}([\(\)\-\']?\d)+   # SSN/PHONE/FAX XXX-XX-XXXX, XXX-XXX-XXXX, XXX-XXXXXXXX, etc.
 88 | |(\d[\(\)\-.\']?){7}([\(\)\-.\']?\d)+  # test
 89 | )\b""", re.X)
 90 | 
 91 | pattern_4digits = re.compile(r"""\b(
 92 | \d{5}[A-Z0-9]*
 93 | )\b""", re.X)
 94 | 
 95 | pattern_devid = re.compile(r"""\b(
 96 | [A-Z0-9\-/]{6}[A-Z0-9\-/]*
 97 | )\b""", re.X)
 98 | # postal code
 99 | # 5 digits or, 5 digits followed dash and 4 digits
100 | pattern_postal = re.compile(r"""\b(
101 | \d{5}(-\d{4})?             # postal code XXXXX, XXXXX-XXXX
102 | )\b""", re.X)
103 | 
104 | # match DOB
105 | pattern_dob = re.compile(r"""\b(
106 | .*?(?=\b(\d{1,2}[-./\s]\d{1,2}[-./\s]\d{2}  # X/X/XX
107 | |\d{1,2}[-./\s]\d{1,2}[-./\s]\d{4}          # XX/XX/XXXX
108 | |\d{2}[-./\s]\d{1,2}[-./\s]\d{1,2}          # xx/xx/xx
109 | |\d{4}[-./\s]\d{1,2}[-./\s]\d{1,2}          # xxxx/xx/xx
110 | )\b)
111 | )\b""", re.X | re.I)
112 | 
113 | # match emails
114 | pattern_email = re.compile(r"""\b(
115 | [a-zA-Z0-9_.+-@\"]+@[a-zA-Z0-9-\:\]\[]+[a-zA-Z0-9-.]*
116 | )\b""", re.X | re.I)
117 | 
118 | # match date, similar to DOB but does not include any words
119 | month_name = "Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?"
120 | pattern_date = re.compile(r"""\b(
121 | \d{4}[\-/](0?[1-9]|1[0-2]|"""+month_name+r""")\-\d{4}[\-/](0?[1-9]|1[0-2]|"""+month_name+r""")  # YYYY/MM-YYYY/MM
122 | |(0?[1-9]|1[0-2]|"""+month_name+r""")[\-/]\d{4}\-(0?[1-9]|1[0-2]|"""+month_name+r""")[\-/]\d{4}  # MM/YYYY-MM/YYYY
123 | |(0?[1-9]|1[0-2]|"""+month_name+r""")/\d{2}\-(0?[1-9]|1[0-2]|"""+month_name+r""")/\d{2}  # MM/YY-MM/YY
124 | |(0?[1-9]|1[0-2]|"""+month_name+r""")/\d{2}\-(0?[1-9]|1[0-2]|"""+month_name+r""")/\d{4}  # MM/YYYY-MM/YYYY
125 | |(0?[1-9]|1[0-2]|"""+month_name+r""")/([1-2][0-9]|3[0-1]|0?[1-9])\-(0?[1-9]|1[0-2]|"""+month_name+r""")/([1-2][0-9]|3[0-1]|0?[1-9])  #MM/DD-MM/DD
126 | |([1-2][0-9]|3[0-1]|0?[1-9])/(0?[1-9]|1[0-2]|"""+month_name+r""")\-([1-2][0-9]|3[0-1]|0?[1-9])/(0?[1-9]|1[0-2]|"""+month_name+r""")  #DD/MM-DD/MM
127 | |(0?[1-9]|1[0-2]|"""+month_name+r""")[\-/\s]([1-2][0-9]|3[0-1]|0?[1-9])[\-/\s]\d{2}  # MM/DD/YY
128 | |(0?[1-9]|1[0-2]|"""+month_name+r""")[\-/\s]([1-2][0-9]|3[0-1]|0?[1-9])[\-/\s]\d{4}  # MM/DD/YYYY
129 | |([1-2][0-9]|3[0-1]|0?[1-9])[\-/\s](0?[1-9]|1[0-2]|"""+month_name+r""")[\-/\s]\d{2}  # DD/MM/YY
130 | |([1-2][0-9]|3[0-1]|0?[1-9])[\-/\s](0?[1-9]|1[0-2]|"""+month_name+r""")[\-/\s]\d{4}  # DD/MM/YYYY
131 | |\d{2}[\-./\s](0?[1-9]|1[0-2]|"""+month_name+r""")[\-\./\s]([1-2][0-9]|3[0-1]|0?[1-9])   # YY/MM/DD
132 | |\d{4}[\-./\s](0?[1-9]|1[0-2]|"""+month_name+r""")[\-\./\s]([1-2][0-9]|3[0-1]|0?[1-9])   # YYYY/MM/DD
133 | |\d{4}[\-/](0?[1-9]|1[0-2]|"""+month_name+r""")  # YYYY/MM
134 | |(0?[1-9]|1[0-2]|"""+month_name+r""")[\-/]\d{4}  # MM/YYYY
135 | |(0?[1-9]|1[0-2]|"""+month_name+r""")/\d{2}  # MM/YY
136 | |(0?[1-9]|1[0-2]|"""+month_name+r""")/\d{2}  # MM/YYYY
137 | |(0?[1-9]|1[0-2]|"""+month_name+r""")/([1-2][0-9]|3[0-1]|0?[1-9])  #MM/DD
138 | |([1-2][0-9]|3[0-1]|0?[1-9])/(0?[1-9]|1[0-2]|"""+month_name+r""")  #DD/MM
139 | )\b""", re.X | re.I)
140 | pattern_mname = re.compile(r'\b(' + month_name + r')\b')
141 | 
142 | # match names, A'Bsfs, Absssfs, A-Bsfsfs
143 | pattern_name = re.compile(r"""^[A-Z]\'?[-a-zA-Z]+$""")
144 | 
145 | # match age
146 | pattern_age = re.compile(r"""\b(
147 | age|year[s-]?\s?old|y.o[.]?
148 | )\b""", re.X | re.I)
149 | 
150 | # match salutation
151 | pattern_salutation = re.compile(r"""
152 | (Dr\.|Mr\.|Mrs\.|Ms\.|Miss|Sir|Madam)\s
153 | (([A-Z]\'?[A-Z]?[\-a-z]+(\s[A-Z]\'?[A-Z]?[\-a-z]+)*)
154 | )""", re.X)
155 | 
156 | # match middle initial
157 | # if single char or Jr is surround by 2 phi words, filter. 
158 | pattern_middle = re.compile(r"""\*\*PHI\*\*,? (([A-CE-LN-Z][Rr]?|[DM])\.?) | (([A-CE-LN-Z][Rr]?|[DM])\.?),? \*\*PHI\*\*""")
159 | 
160 | 
161 | # match url
162 | pattern_url = re.compile(r'\b((http[s]?://)?([a-zA-Z0-9$-_@.&+:!\*\(\),])*[\.\/]([a-zA-Z0-9$-_@.&+:\!\*\(\),])*)\b', re.I)
163 | 
164 | # check if the folder exists
165 | def is_valid_file(parser, arg):
166 |     if not os.path.exists(arg):
167 |         parser.error("The folder %s does not exist. Please input a new folder or create one." % arg)
168 |     else:
169 |         return arg
170 | 
171 | # check if word is in name_set, if not, check the word by single word level
172 | def namecheck(word_output, name_set, screened_words, safe):
173 |     # check if the word is in the name list
174 |     if word_output.title() in name_set:
175 |         # with open("name.txt", 'a') as fout:
176 |            # fout.write(word_output + '\n')
177 |         # print('Name:', word_output)
178 |         screened_words.append(word_output)
179 |         word_output = "**PHI**"
180 |         safe = False
181 | 
182 |     else:
183 |     # check spacy, and add the word to the name list if it is a name
184 |     # check the word's title version and its uppercase version
185 |         word_title = nlp(word_output.title())
186 |         # search Title or UPPER version of word in the english dictionary: nlp()
187 |         # nlp() returns the most likely NER tag (word.ents) for the word 
188 |         # If word_title has NER = person AND word_upper has ANY NER tag, filter
189 |         word_upper = nlp(word_output.upper())
190 |         if (word_title.ents != () and word_title.ents[0].label_ == 'PERSON' and
191 |                 word_upper.ents != () and word_upper.ents[0].label_ is not None):
192 |             # with open("name.txt", 'a') as fout:
193 |                # fout.write(word_output + '\n')
194 |             # print('Name:', word_output)
195 |             screened_words.append(word_output)
196 |             name_set.add(word_output.title())
197 |             word_output = "**PHI**"
198 |             safe = False
199 | 
200 |     return word_output, name_set, screened_words, safe
201 | 
202 | 
203 | def filter_task(f, whitelist_dict, foutpath, key_name):
204 | 
205 |     # pretrain = HunposTagger('hunpos.model', 'hunpos-1.0-linux/hunpos-tag')
206 |     pretrain = SennaTagger('senna')
207 | 
208 |     """
209 |     Uses: namecheck() to check if word that has been tagged as name by either nltk or spacy. namecheck() first searches
210 |     nameset which is generated by checking words at the sentence level and tagging names. If word is not in nameset,
211 |     namecheck() uses spacy.nlp() to check if word is likely to be a name at the word level. 
212 | 
213 |     """
214 |     with open(f, encoding='utf-8', errors='ignore') as fin:
215 |         # define intial variables
216 |         head, tail = os.path.split(f)
217 |         #f_name = re.findall(r'[\w\d]+', tail)[0]  # get the file number
218 |         print(tail)
219 |         start_time_single = time.time()
220 |         total_records = 1
221 |         phi_containing_records = 0
222 |         safe = True
223 |         screened_words = []
224 |         name_set = set()
225 |         phi_reduced = ''
226 |         '''
227 |         address_indictor = ['street', 'avenue', 'road', 'boulevard',
228 |                             'drive', 'trail', 'way', 'lane', 'ave',
229 |                             'blvd', 'st', 'rd', 'trl', 'wy', 'ln',
230 |                             'court', 'ct', 'place', 'plc', 'terrace', 'ter']
231 |                             '''
232 |         address_indictor = ['street', 'avenue', 'road', 'boulevard',
233 |                             'drive', 'trail', 'way', 'lane', 'ave',
234 |                             'blvd', 'st', 'rd', 'trl', 'wy', 'ln',
235 |                             'court', 'ct', 'place', 'plc', 'terrace', 'ter',
236 |                             'highway', 'freeway', 'autoroute', 'autobahn', 'expressway',
237 |                             'autostrasse', 'autostrada', 'byway', 'auto-estrada', 'motorway',
238 |                             'avenue', 'boulevard', 'road', 'street', 'alley', 'bay', 'drive',
239 |                             'gardens', 'gate', 'grove', 'heights', 'highlands', 'lane', 'mews',
240 |                             'pathway', 'terrace', 'trail', 'vale', 'view', 'walk', 'way', 'close',
241 |                             'court', 'place', 'cove', 'circle', 'crescent', 'square', 'loop', 'hill',
242 |                             'causeway', 'canyon', 'parkway', 'esplanade', 'approach', 'parade', 'park',
243 |                             'plaza', 'promenade', 'quay', 'bypass']
244 | 
245 | 
246 |         note = fin.read()
247 |         note = re.sub(r'=', ' = ', note)
248 |         # Begin Step 1: saluation check
249 |         re_list = pattern_salutation.findall(note)
250 |         for i in re_list:
251 |             name_set = name_set | set(i[1].split(' '))
252 | 
253 |         # note_length = len(word_tokenize(note))
254 |         # Begin step 2: split document into sentences
255 |         note = sent_tokenize(note)
256 | 
257 |         for sent in note: # Begin Step 3: Pattern checking
258 |             # postal code check
259 |             # print(sent)
260 |             if pattern_postal.findall(sent) != []:
261 |                 safe = False
262 |                 for item in pattern_postal.findall(sent):
263 |                     screened_words.append(item[0])
264 |             sent = str(pattern_postal.sub('**PHIPostal**', sent))
265 | 
266 |             if pattern_devid.findall(sent) != []:
267 |                 safe = False
268 |                 for item in pattern_devid.findall(sent):
269 |                     if (re.search(r'\d', item) is not None and
270 |                         re.search(r'[A-Z]',item) is not None):
271 |                         screened_words.append(item)
272 |                         sent = sent.replace(item, '**PHI**')
273 | 
274 |             # number check
275 |             if pattern_number.findall(sent) != []:
276 |                 safe = False
277 |                 for item in pattern_number.findall(sent):
278 |                     # print(item)
279 |                     #if pattern_date.match(item[0]) is None:
280 |                     sent = sent.replace(item[0], '**PHI**')
281 |                     screened_words.append(item[0])
282 |                     #print(item[0])
283 |             #sent = str(pattern_number.sub('**PHI**', sent))
284 |             '''
285 |             if pattern_date.findall(sent) != []:
286 |                 safe = False
287 |                 for item in pattern_date.findall(sent):
288 |                     if '-' in item[0]:
289 |                         if (len(set(re.findall(r'[^\w\-]',item[0]))) <= 1):
290 |                             screened_words.append(item[0])
291 |                             #print(item[0])
292 |                             sent = sent.replace(item[0], '**PHIDate**')
293 |                     else:
294 |                         if len(set(re.findall(r'[^\w]',item[0]))) == 1:
295 |                             screened_words.append(item[0])
296 |                             #print(item[0])
297 |                             sent = sent.replace(item[0], '**PHIDate**')
298 |             '''
299 |             data_list = []
300 |             if pattern_date.findall(sent) != []:
301 |                 safe = False
302 |                 for item in pattern_date.findall(sent):
303 |                     if '-' in item[0]:
304 |                         if (len(set(re.findall(r'[^\w\-]',item[0]))) <= 1):
305 |                             #screened_words.append(item[0])
306 |                             #print(item[0])
307 |                             data_list.append(item[0])
308 |                             #sent = sent.replace(item[0], '**PHIDate**')
309 |                     else:
310 |                         if len(set(re.findall(r'[^\w]',item[0]))) == 1:
311 |                             #screened_words.append(item[0])
312 |                             #print(item[0])
313 |                             data_list.append(item[0])
314 |                             #sent = sent.replace(item[0], '**PHIDate**')
315 |             data_list.sort(key=len, reverse=True) 
316 |             for item in data_list:
317 |                 sent = sent.replace(item, '**PHIDate**')
318 | 
319 |             #sent = str(pattern_date.sub('**PHI**', sent))
320 |             #print(sent)
321 |             if pattern_4digits.findall(sent) != []:
322 |                 safe = False
323 |                 for item in pattern_4digits.findall(sent):
324 |                     screened_words.append(item)
325 |             sent = str(pattern_4digits.sub('**PHI**', sent))
326 |             # email check
327 |             if pattern_email.findall(sent) != []:
328 |                 safe = False
329 |                 for item in pattern_email.findall(sent):
330 |                     screened_words.append(item)
331 |             sent = str(pattern_email.sub('**PHI**', sent))
332 |             # url check
333 |             if pattern_url.findall(sent) != []:
334 |                 safe = False
335 |                 for item in pattern_url.findall(sent):
336 |                     #print(item[0])
337 |                     if (re.search(r'[a-z]', item[0]) is not None and
338 |                         '.' in item[0] and
339 |                         re.search(r'[A-Z]', item[0]) is None and
340 |                         len(item[0])>10):
341 |                         print(item[0])
342 |                         screened_words.append(item[0])
343 |                         sent = sent.replace(item[0], '**PHI**')
344 |                         #print(item[0])
345 |             #sent = str(pattern_url.sub('**PHI**', sent))
346 |             # dob check
347 |             '''
348 |             re_list = pattern_dob.findall(sent)
349 |             i = 0
350 |             while True:
351 |                 if i >= len(re_list):
352 |                     break
353 |                 else:
354 |                     text = ' '.join(re_list[i][0].split(' ')[-6:])
355 |                     if re.findall(r'\b(birth|dob)\b', text, re.I) != []:
356 |                         safe = False
357 |                         sent = sent.replace(re_list[i][1], '**PHI**')
358 |                         screened_words.append(re_list[i][1])
359 |                     i += 2
360 |             '''
361 | 
362 |             # Begin Step 4
363 |             # substitute spaces for special characters 
364 |             sent = re.sub(r'[\/\-\:\~\_]', ' ', sent)
365 |             # label all words for NER using the sentence level context. 
366 |             spcy_sent_output = nlp(sent)
367 |             # split sentences into words
368 |             sent = [word_tokenize(sent)]
369 |             #print(sent)
370 |             # Begin Step 5: context level pattern matching with regex 
371 |             for position in range(0, len(sent[0])):
372 |                 word = sent[0][position]
373 |                 # age check
374 |                 if word.isdigit() and int(word) > 90:
375 |                     if position <= 2:  # check the words before age
376 |                         word_previous = ' '.join(sent[0][:position])
377 |                     else:
378 |                         word_previous = ' '.join(sent[0][position - 2:position])
379 |                     if position >= len(sent[0]) - 2:  # check the words after age
380 |                         word_after = ' '.join(sent[0][position+1:])
381 |                     else:
382 |                         word_after = ' '.join(sent[0][position+1:position +3])
383 | 
384 |                     age_string = str(word_previous) + str(word_after)
385 |                     if pattern_age.findall(age_string) != []:
386 |                         screened_words.append(sent[0][position])
387 |                         sent[0][position] = '**PHI**'
388 |                         safe = False
389 | 
390 |                 # address check
391 |                 elif (position >= 1 and position < len(sent[0])-1 and
392 |                       (word.lower() in address_indictor or
393 |                        (word.lower() == 'dr' and sent[0][position+1] != '.')) and
394 |                       (word.istitle() or word.isupper())):
395 | 
396 |                     if sent[0][position - 1].istitle() or sent[0][position-1].isupper():
397 |                         screened_words.append(sent[0][position - 1])
398 |                         sent[0][position - 1] = '**PHI**'
399 |                         i = position - 1
400 |                         # find the closet number, should be the number of street
401 |                         while True:
402 |                             if re.findall(r'^[\d-]+$', sent[0][i]) != []:
403 |                                 begin_position = i
404 |                                 break
405 |                             elif i == 0 or position - i > 5:
406 |                                 begin_position = position
407 |                                 break
408 |                             else:
409 |                                 i -= 1
410 |                         i = position + 1
411 |                         # block the info of city, state, apt number, etc.
412 |                         while True:
413 |                             if '**PHIPostal**' in sent[0][i]:
414 |                                 end_position = i
415 |                                 break
416 |                             elif i == len(sent[0]) - 1:
417 |                                 end_position = position
418 |                                 break
419 |                             else:
420 |                                 i += 1
421 |                         if end_position <= position:
422 |                             end_position = position
423 | 
424 |                         for i in range(begin_position, end_position):
425 |                             #if sent[0][i] != '**PHIPostal**':
426 |                             screened_words.append(sent[0][i])
427 |                             sent[0][i] = '**PHI**'
428 |                             safe = False
429 | 
430 |             # Begin Step 6: NLTK POS tagging
431 |             sent_tag = nltk.pos_tag_sents(sent)
432 |             #try:
433 |                 # senna cannot handle long sentence.
434 |                 #sent_tag = [[]]
435 |                 #length_100 = len(sent[0])//100
436 |                 #for j in range(0, length_100+1):
437 |                     #[sent_tag[0].append(j) for j in pretrain.tag(sent[0][100*j:100*(j+1)])]
438 |                 # hunpos needs to change the type from bytes to string
439 |                 #print(sent_tag[0])
440 |                 #sent_tag = [pretrain.tag(sent[0])]
441 |                 #for j in range(len(sent_tag[0])):
442 |                     #sent_tag[0][j] = list(sent_tag[0][j])
443 |                     #sent_tag[0][j][1] = sent_tag[0][j][1].decode('utf-8')
444 |             #except:
445 |                 #print('POS error:', tail, sent[0])
446 |                 #sent_tag = nltk.pos_tag_sents(sent)
447 |             # Begin Step 7: Use both NLTK and Spacy to check if the word is a name based on sentence level NER label for the word.
448 |             for ent in spcy_sent_output.ents:  # spcy_sent_output contains a dict with each word in the sentence and its NLP labels
449 |                 #spcy_sent_ouput.ents is a list of dictionaries containing chunks of words (phrases) that spacy believes are Named Entities
450 |                 # Each ent has 2 properties: text which is the raw word, and label_ which is the NER category for the word
451 |                 if ent.label_ == 'PERSON':
452 |                 #print(ent.text)
453 |                     # if word is person, recheck that spacy still thinks word is person at the word level
454 |                     spcy_chunk_output = nlp(ent.text)
455 |                     if spcy_chunk_output.ents != () and spcy_chunk_output.ents[0].label_ == 'PERSON':
456 |                         # Now check to see what labels NLTK provides for the word
457 |                         name_tag = word_tokenize(ent.text)
458 |                         # senna & hunpos
459 |                         #name_tag = pretrain.tag(name_tag)
460 |                         # hunpos needs to change the type from bytes to string
461 |                         #for j in range(len(name_tag)):
462 |                             #name_tag[j] = list(name_tag[j])
463 |                             #name_tag[j][1] = name_tag[j][1].decode('utf-8')
464 |                         #chunked = ne_chunk(name_tag)
465 |                         # default
466 |                         name_tag = pos_tag_sents([name_tag])
467 |                         chunked = ne_chunk(name_tag[0])
468 |                         for i in chunked:
469 |                             if type(i) == Tree: # if ne_chunck thinks chunk is NER, creates a tree structure were leaves are the words in the chunk (and their POS labels) and the trunk is the single NER label for the chunk
470 |                                 if i.label() == 'PERSON':
471 |                                     for token, pos in i.leaves():
472 |                                         if pos == 'NNP':
473 |                                             name_set.add(token)
474 | 
475 |                                 else:
476 |                                     for token, pos in i.leaves():
477 |                                         spcy_upper_output = nlp(token.upper())
478 |                                         if spcy_upper_output.ents != ():
479 |                                             name_set.add(token)
480 | 
481 |             # BEGIN STEP 8: whitelist check
482 |             # sent_tag is the nltk POS tagging for each word at the sentence level.
483 |             for i in range(len(sent_tag[0])):
484 |                 # word contains the i-th word and it's POS tag
485 |                 word = sent_tag[0][i]
486 |                 # print(word)
487 |                 # word_output is just the raw word itself
488 |                 word_output = word[0]
489 | 
490 |                 if word_output not in string.punctuation:
491 |                     word_check = str(pattern_word.sub('', word_output))
492 |                     #if word_check.title() in ['Dr', 'Mr', 'Mrs', 'Ms']:
493 |                         #print(word_check)
494 |                         # remove the speical chars
495 |                     try:
496 |                         # word[1] is the pos tag of the word
497 | 
498 |                         if (((word[1] == 'NN' or word[1] == 'NNP') or
499 |                             ((word[1] == 'NNS' or word[1] == 'NNPS') and word_check.istitle()))):
500 |                             if word_check.lower() not in whitelist_dict:
501 |                                 screened_words.append(word_output)
502 |                                 word_output = "**PHI**"
503 |                                 safe = False
504 |                             else:
505 |                                 # For words that are in whitelist, check to make sure that we have not identified them as names
506 |                                 if ((word_output.istitle() or word_output.isupper()) and
507 |                                     pattern_name.findall(word_output) != [] and
508 |                                     re.search(r'\b([A-Z])\b', word_check) is None):
509 |                                     word_output, name_set, screened_words, safe = namecheck(word_output, name_set, screened_words, safe)
510 | 
511 |                         # check day/year according to the month name
512 |                         elif word[1] == 'CD':
513 |                             if i > 2:
514 |                                 context_before = sent_tag[0][i-3:i]
515 |                             else:
516 |                                 context_before = sent_tag[0][0:i]
517 |                             if i <= len(sent_tag[0]) - 4:
518 |                                 context_after = sent_tag[0][i+1:i+4]
519 |                             else:
520 |                                 context_after = sent_tag[0][i+1:]
521 |                             #print(word_output, context_before+context_after)
522 |                             for j in (context_before + context_after):
523 |                                 if pattern_mname.search(j[0]) is not None:
524 |                                     screened_words.append(word_output)
525 |                                     #print(word_output)
526 |                                     word_output = "**PHI**"
527 |                                     safe = False
528 |                                     break
529 |                         else:
530 |                             word_output, name_set, screened_words, safe = namecheck(word_output, name_set, screened_words, safe)
531 | 
532 | 
533 |                     except:
534 |                         print(word_output, sys.exc_info())
535 |                     if word_output.lower()[0] == '\'s':
536 |                         if phi_reduced[-7:] != '**PHI**':
537 |                             phi_reduced = phi_reduced + word_output
538 |                         #print(word_output)
539 |                     else:
540 |                         phi_reduced = phi_reduced + ' ' + word_output
541 |                 # Format output for later use by eval.py
542 |                 else:
543 |                     if (i > 0 and sent_tag[0][i-1][0][-1] in string.punctuation and
544 |                         sent_tag[0][i-1][0][-1] != '*'):
545 |                         phi_reduced = phi_reduced + word_output
546 |                     elif word_output == '.' and sent_tag[0][i-1][0] in ['Dr', 'Mr', 'Mrs', 'Ms']:
547 |                         phi_reduced = phi_reduced + word_output
548 |                     else:
549 |                         phi_reduced = phi_reduced + ' ' + word_output
550 |             #print(phi_reduced)
551 | 
552 |             # Begin Step 8: check middle initial and month name
553 |             if pattern_mname.findall(phi_reduced) != []:
554 |                 for item in pattern_mname.findall(phi_reduced):
555 |                     screened_words.append(item[0])
556 |             phi_reduced = pattern_mname.sub('**PHI**', phi_reduced)
557 | 
558 |             if pattern_middle.findall(phi_reduced) != []:
559 |                 for item in pattern_middle.findall(phi_reduced):
560 |                 #    print(item[0])
561 |                     screened_words.append(item[0])
562 |             phi_reduced = pattern_middle.sub('**PHI** **PHI** ', phi_reduced)
563 |         # print(phi_reduced)
564 | 
565 |         if not safe:
566 |             phi_containing_records = 1
567 | 
568 |         # save phi_reduced file
569 |         filename = '.'.join(tail.split('.')[:-1])+"_" + key_name + ".txt"
570 |         filepath = os.path.join(foutpath, filename)
571 |         with open(filepath, "w") as phi_reduced_note:
572 |             phi_reduced_note.write(phi_reduced)
573 | 
574 |         # save filtered words
575 |         #screened_words = list(filter(lambda a: a!= '**PHI**', screened_words))
576 |         filepath = os.path.join(foutpath,'filter_summary.txt')
577 |         #print(filepath)
578 |         screened_words = list(filter(lambda a: '**PHI' not in a, screened_words))
579 |         #screened_words = list(filter(lambda a: a != '**PHI**', screened_words))
580 |         #print(screened_words)
581 |         with open(filepath, 'a') as fout:
582 |             fout.write('.'.join(tail.split('.')[:-1])+' ' + str(len(screened_words)) +
583 |                 ' ' + ' '.join(screened_words)+'\n')
584 |             # fout.write(' '.join(screened_words))
585 | 
586 |         print(total_records, f, "--- %s seconds ---" % (time.time() - start_time_single))
587 |         # hunpos needs to close session
588 |         #pretrain.close()
589 |         return total_records, phi_containing_records
590 | 
591 | 
592 | def main():
593 |     # get input/output/filename
594 |     ap = argparse.ArgumentParser()
595 |     ap.add_argument("-i", "--input", default="input_test/",
596 |                     help="Path to the directory or the file that contains the PHI note, the default is ./input_test/.",
597 |                     type=lambda x: is_valid_file(ap, x))
598 |     ap.add_argument("-r", "--recursive", action = 'store_true', default = False,
599 |                     help="whether to read files in the input folder recursively.")
600 |     ap.add_argument("-o", "--output", default="output_test/",
601 |                     help="Path to the directory to save the PHI-reduced notes in, the default is ./output_test/.",
602 |                     type=lambda x: is_valid_file(ap, x))
603 |     ap.add_argument("-w", "--whitelist",
604 |                     #default=os.path.join(os.path.dirname(__file__), 'whitelist.pkl'),
605 |                     default=resource_filename(__name__, 'whitelist.pkl'),
606 |                     help="Path to the whitelist, the default is phireducer/whitelist.pkl")
607 |     ap.add_argument("-n", "--name", default="phi_reduced",
608 |                     help="The key word of the output file name, the default is *_phi_reduced.txt.")
609 |     ap.add_argument("-p", "--process", default=1, type=int,
610 |                     help="The number of processes to run simultaneously, the default is 1.")
611 |     args = ap.parse_args()
612 | 
613 |     finpath = args.input
614 |     foutpath = args.output
615 |     key_name = args.name
616 |     whitelist_file = args.whitelist
617 |     process_number = args.process
618 |     if_dir = os.path.isdir(finpath)
619 | 
620 |     start_time_all = time.time()
621 |     if if_dir:
622 |         print('input folder:', finpath)
623 |         print('recursive?:', args.recursive)
624 |     else:
625 |         print('input file:', finpath)
626 |         head, tail = os.path.split(finpath)
627 |         # f_name = re.findall(r'[\w\d]+', tail)[0]
628 |     print('output folder:', foutpath)
629 |     print('Using whitelist:', whitelist_file)
630 |     try:
631 |         with open(whitelist_file, "rb") as fin:
632 |             whitelist = pickle.load(fin)
633 |         print('length of whitelist: {}'.format(len(whitelist)))
634 |         if if_dir:
635 |             print('phi_reduced file\'s name would be:', "*_"+key_name+".txt")
636 |         else:
637 |             print('phi_reduced file\'s name would be:', '.'.join(tail.split('.')[:-1])+"_"+key_name+".txt")
638 |         print('run in {} process(es)'.format(process_number))
639 |     except FileNotFoundError:
640 |         print("No whitelist is found. The script will stop.")
641 |         os._exit(0)
642 | 
643 |     filepath = os.path.join(foutpath,'filter_summary.txt')
644 |     with open(filepath, 'w') as fout:
645 |         fout.write("")
646 |     # start multiprocess
647 |     pool = Pool(processes=process_number)
648 | 
649 |     results_list = []
650 |     filter_time = time.time()
651 | 
652 |     # apply_async() allows a worker to begin a new task before other works have completed their current task
653 |     if os.path.isdir(finpath):
654 |         if args.recursive:
655 |             results = [pool.apply_async(filter_task, (f,)+(whitelist, foutpath, key_name)) for f in glob.glob   (finpath+"/**/*.txt", recursive=True)]
656 |         else:
657 |             results = [pool.apply_async(filter_task, (f,)+(whitelist, foutpath, key_name)) for f in glob.glob   (finpath+"/*.txt")]
658 |     else:
659 |         results = [pool.apply_async(filter_task, (f,)+(whitelist, foutpath, key_name)) for f in glob.glob(  finpath)]
660 |     try:
661 |         results_list = [r.get() for r in results]
662 |         total_records, phi_containing_records = zip(*results_list)
663 |         total_records = sum(total_records)
664 |         phi_containing_records = sum(phi_containing_records)
665 | 
666 |         print("total records:", total_records, "--- %s seconds ---" % (time.time() - start_time_all))
667 |         print('filter_time', "--- %s seconds ---" % (time.time() - filter_time))
668 |         print('total records processed: {}'.format(total_records))
669 |         print('num records with phi: {}'.format(phi_containing_records))
670 |     except ValueError:
671 |         print("No txt file in the input folder.")
672 |         pass
673 | 
674 |     pool.close()
675 |     pool.join()
676 | 
677 | 
678 |     # close multiprocess
679 | 
680 | 
681 | if __name__ == "__main__":
682 |     multiprocessing.freeze_support()  # must run for windows
683 |     main()
684 | 


--------------------------------------------------------------------------------