├── whitelist.pkl ├── README.md └── philter.py /whitelist.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/beaunorgeot/philter_ucsf/HEAD/whitelist.pkl -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # README 2 | 3 | ![Alt text](https://github.com/beaunorgeot/images_for_presentations/blob/master/logo_v4.tif?raw=true "annotation interface") 4 | 5 | ### What is PHIlter? 6 | The package and associated scripts provide an end-to-end pipeline for removing Protected Health Information from clinical notes (or other sensitive text documents) in a completely secure environment (a machine with no external connections or exposed ports). We use a combination of regular expressions, Part Of Speech (POS) and Entity Recognition (NER) tagging, and filtering through a whitelist to achieve nearly perfect Recall and generate clean, readable notes. Everything is written in straight python and the package will process any text file, regardless of structure. You can install with PIP (see below) and run with a single command-line argument. Parallelization of processesing can be infinitly divided across cores or machines. 7 | 8 | - Please note: we don't make any claims that running this software on your data will instantly produce HIPAA compliance. 9 | 10 | 11 | 12 | **philter:** Pull in each note from a directory (nested directories are supported), maintain clean words and replace PHI words with a safe filtered word: **\*\*PHI\*\***, then write the 'phi-reduced' output to a new file with the original file name appended by '\_phi\_reduced'. Conveniently generates additional output files containing meta data from the run: 13 | - number of files processed 14 | - number of instances of PHI that were filtered 15 | - list of filtered words 16 | - etc. 17 | 18 | 19 | 20 | 21 | # Installation 22 | 23 | **Install philter** 24 | 25 | ```pip3 install philter``` 26 | 27 | ### Dependencies 28 | spacy package en: A pretrained model of the english language. 29 | You can learn more about the model at the [spacy model documentation]("https://spacy.io/docs/usage/models") page. Language models are required for optimum performance. 30 | 31 | **Download the model:** 32 | 33 | ```python3 -m spacy download en ``` 34 | 35 | Note for Windows Users: This command must be run with admin previleges because it will create a *short-cut link* that let's you load a model by name. 36 | 37 | 38 | 39 | # Run 40 | 41 | **philter** 42 | ``` philter -i ./raw_notes_dir -r -o dir_to_write_to -p 32``` 43 | 44 | Arguments: 45 | 46 | - ("-i", "--input") = Path to the directory or the file that contains the PHI note, the default is ./input_test/ 47 | - ("-r", "--recursive") = help="whether read files in the input folder recursively. Default = False. 48 | - ("-o", "--output") = Path to the directory that save the PHI-reduced note, the default is ./output_test/. 49 | - ("-w", "--whitelist") = Path to the whitelist, the default is phireducer/whitelist.pkl 50 | - ("-n", "--name") = The key word of the output file name, the default is *_phi_reduced.txt. 51 | - ("-p", "--process") = The number of processes to run simultaneously, the default is 1. 52 | 53 | 54 | # How it works 55 | 56 | **philter** 57 | 58 | ![Alt text](https://github.com/beaunorgeot/images_for_presentations/blob/master/flow_v2.tif?raw=true "phi-reduction process") 59 | 60 | 61 | 62 | 63 | ### Why did we build it? 64 | Clinical notes capture rich information on the interaction between physicians, nurses, patients, and more. While this data holds the promise of uncovering valuable insights, it is also challenging to work with for numerous reasons. Extracting various forms of knowledge from Natural Language is difficult on it's own. However, attempts to even begin to mine this data on a large scale are severely hampered by the nature of the raw data, it's deeply personal. In order to allow more researchers to have access to this potentially transformative data, individual patient identifiers need to be removed in a way that presevers the content, context, and integrity of the raw note. 65 | 66 | De-Identification of clinical notes is certainly not a new topic, there are even machine learning competitions that are held to compare methods. Unfornuately these did not provide us with a viable approach to de-identify our own notes. First, the code from methods used in the competitions are often not available. 67 | Second, the notes used in public competitions don't reflect our notes very closely and therefore even methods that are publicly available did not perform nearly as well on our data as they did on the data used for the competitions (As noted by Ferrandez, 2012, BMC Medical Research Methodology who compared public methods on VA data). Additionally,our patient's privacy is paramount to us which meant we were unwilling to expose our data use any methods that required access to any url or external api call. Finally, our goal was to de-identify all 40 MILLION of our notes. There are multiple published approaches that are simply impractical from a run-time perspective at this scale. 68 | 69 | ## Why a whitelist (aren't blacklists smaller and easier)? 70 | 71 | Blacklists are certainly the norm, but they have some pretty large inherent problems. For starters, they present an unbounded problem: there are a nearly infinite number of words that could be PHI and that you'd therefore want to filter. For us, the difference between blacklists vs whitelists comes down to the *types* of errors that you're willing to make. Since blacklists are made of PHI words and/or patterns, that means that when a mistake is made PHI is allowed through (Recall error). Whitelists on the other hand are made of non-PHI which means that when a mistake is made a non-PHI word gets filtered (Precision Error). We care more about recall for our own uses, and we think that high recall is also important to others that will use this software, so a whitelist was the sensible approach. 72 | 73 | ### Results (current, unpublished) 74 | 75 | ![Alt text](https://github.com/beaunorgeot/images_for_presentations/blob/master/performance_1.png?raw=true "info_extraction_csv example") 76 | 77 | # Recommendations 78 | - Search through filtered words for institution specific words to improve precision 79 | - have a policy in place to report phi-leakage -------------------------------------------------------------------------------- /philter.py: -------------------------------------------------------------------------------- 1 | # De-id script 2 | # import modules 3 | from __future__ import print_function 4 | import os 5 | import sys 6 | import pickle 7 | import glob 8 | import string 9 | import re 10 | import time 11 | import argparse 12 | 13 | # import multiprocess 14 | import multiprocessing 15 | from multiprocessing import Pool 16 | 17 | # import NLP packages 18 | import nltk 19 | from nltk import sent_tokenize 20 | from nltk import word_tokenize 21 | from nltk.tree import Tree 22 | from nltk import pos_tag_sents 23 | from nltk import pos_tag 24 | from nltk import ne_chunk 25 | import spacy 26 | from pkg_resources import resource_filename 27 | from nltk.tag.perceptron import AveragedPerceptron 28 | from nltk.tag import SennaTagger 29 | from nltk.tag import HunposTagger 30 | """ 31 | Replace PHI words with a safe filtered word: '**PHI**' 32 | 33 | Does: 34 | 1. regex to search for salutations (must be done prior to splitting into sentences b/c of '.' present in most salutations) 35 | 2. split document into sentences. 36 | 3.Run regex patterns to identify PHI considering only 1 word at a time:emails, phone numbers, DOB, SSN, Postal codes, or any word containing 5 or more consecutive digits or 37 | 8 or more characters that begins and ends with digits. 38 | 39 | 4. Split sentences into words 40 | 41 | 5. Run regex patterns to identify PHI looking at the context for each word. For example DOB checks the preceding words for 'age' or 'years' etc. 42 | addresses which include [streets, rooms, states, etc],age over 90. 43 | 44 | 6. Use nltk to label POS. 45 | 46 | 7. Identify names: We run 2 separate methods to check if a word is a name based on it's context at the chunk/phrase level. To do this: 47 | First: Spacy nlp() is run on the sentence level and outputs NER labels at the chunk/phrase level. 48 | Second: For chunks/phrases that spacy thinks are 'person', get a second opinion by running nltk ne_chunck which uses nltk POS 49 | to assign an NER label to the chunk/phrases. 50 | *If both spacy and nltk provide a 'person' NER label for a chunk/phrase: check the in the chunk 1-by-1 with nltk to determine if 51 | the word's POS tag is a proper noun. 52 | - sometimes the label 'person' may be applied to more than 1 word, and occassionally 1 of those words is just a normal noun, not a name. 53 | - If word is a proper noun, flag the word and add it to name_set 54 | *If spacy labels word as person but nltk does not label person but labels word as another category of NER, use spacy on the all UPPERCASE version words 55 | in the chunk 1-by-1 to see if spacy still believes that the uppercase word is NER of any category 56 | - If it is, add word to name_set; 57 | - If spacy thinks the uppercase version of the word no longer has an NER label, then treat word as any other noun and send to be filtered through the whitelist. 58 | 59 | 8. If word is noun, send it on to check it against the whitelist. If word is not noun, 60 | consider it safe and pass it on to output.For nouns, if word is in whitelist, 61 | check if word is in name_set, if so -> filter. 62 | If not in name_set, 63 | use spacy to check if word is a name based on the single word's meaning and format. 64 | Spacy does a per-word look up and assigns most frequent use of that word as a flag 65 | (eg'HUNT':-organization, 'Hunt'-name, 'hunt':verb). 66 | If the flags is name -> filter 67 | If flag is not name pass word through as safe 68 | if word not in whitelist -> filter 69 | 9. Search for middle initials by checking if single Uppercase letter is between PHI infos, if so, consider the letter as a middle initial and filter it. e.g. Ane H Berry. 70 | 71 | NOTE: All of the above numbered steps happen in filter_task(). Other functions either support filter task or simply involve 72 | dealing with I/O and multiprocessing 73 | 74 | """ 75 | 76 | 77 | nlp = spacy.load('en') # load spacy english library 78 | # pretrain = SennaTagger('senna') 79 | 80 | # configure the regex patterns 81 | # we're going to want to remove all special characters 82 | pattern_word = re.compile(r"[^\w+]") 83 | 84 | # Find numbers like SSN/PHONE/FAX 85 | # 3 patterns: 1. 6 or more digits will be filtered 2. digit followed by - followed by digit. 3. Ignore case of characters 86 | pattern_number = re.compile(r"""\b( 87 | (\d[\(\)\-\']?\s?){6}([\(\)\-\']?\d)+ # SSN/PHONE/FAX XXX-XX-XXXX, XXX-XXX-XXXX, XXX-XXXXXXXX, etc. 88 | |(\d[\(\)\-.\']?){7}([\(\)\-.\']?\d)+ # test 89 | )\b""", re.X) 90 | 91 | pattern_4digits = re.compile(r"""\b( 92 | \d{5}[A-Z0-9]* 93 | )\b""", re.X) 94 | 95 | pattern_devid = re.compile(r"""\b( 96 | [A-Z0-9\-/]{6}[A-Z0-9\-/]* 97 | )\b""", re.X) 98 | # postal code 99 | # 5 digits or, 5 digits followed dash and 4 digits 100 | pattern_postal = re.compile(r"""\b( 101 | \d{5}(-\d{4})? # postal code XXXXX, XXXXX-XXXX 102 | )\b""", re.X) 103 | 104 | # match DOB 105 | pattern_dob = re.compile(r"""\b( 106 | .*?(?=\b(\d{1,2}[-./\s]\d{1,2}[-./\s]\d{2} # X/X/XX 107 | |\d{1,2}[-./\s]\d{1,2}[-./\s]\d{4} # XX/XX/XXXX 108 | |\d{2}[-./\s]\d{1,2}[-./\s]\d{1,2} # xx/xx/xx 109 | |\d{4}[-./\s]\d{1,2}[-./\s]\d{1,2} # xxxx/xx/xx 110 | )\b) 111 | )\b""", re.X | re.I) 112 | 113 | # match emails 114 | pattern_email = re.compile(r"""\b( 115 | [a-zA-Z0-9_.+-@\"]+@[a-zA-Z0-9-\:\]\[]+[a-zA-Z0-9-.]* 116 | )\b""", re.X | re.I) 117 | 118 | # match date, similar to DOB but does not include any words 119 | month_name = "Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?" 120 | pattern_date = re.compile(r"""\b( 121 | \d{4}[\-/](0?[1-9]|1[0-2]|"""+month_name+r""")\-\d{4}[\-/](0?[1-9]|1[0-2]|"""+month_name+r""") # YYYY/MM-YYYY/MM 122 | |(0?[1-9]|1[0-2]|"""+month_name+r""")[\-/]\d{4}\-(0?[1-9]|1[0-2]|"""+month_name+r""")[\-/]\d{4} # MM/YYYY-MM/YYYY 123 | |(0?[1-9]|1[0-2]|"""+month_name+r""")/\d{2}\-(0?[1-9]|1[0-2]|"""+month_name+r""")/\d{2} # MM/YY-MM/YY 124 | |(0?[1-9]|1[0-2]|"""+month_name+r""")/\d{2}\-(0?[1-9]|1[0-2]|"""+month_name+r""")/\d{4} # MM/YYYY-MM/YYYY 125 | |(0?[1-9]|1[0-2]|"""+month_name+r""")/([1-2][0-9]|3[0-1]|0?[1-9])\-(0?[1-9]|1[0-2]|"""+month_name+r""")/([1-2][0-9]|3[0-1]|0?[1-9]) #MM/DD-MM/DD 126 | |([1-2][0-9]|3[0-1]|0?[1-9])/(0?[1-9]|1[0-2]|"""+month_name+r""")\-([1-2][0-9]|3[0-1]|0?[1-9])/(0?[1-9]|1[0-2]|"""+month_name+r""") #DD/MM-DD/MM 127 | |(0?[1-9]|1[0-2]|"""+month_name+r""")[\-/\s]([1-2][0-9]|3[0-1]|0?[1-9])[\-/\s]\d{2} # MM/DD/YY 128 | |(0?[1-9]|1[0-2]|"""+month_name+r""")[\-/\s]([1-2][0-9]|3[0-1]|0?[1-9])[\-/\s]\d{4} # MM/DD/YYYY 129 | |([1-2][0-9]|3[0-1]|0?[1-9])[\-/\s](0?[1-9]|1[0-2]|"""+month_name+r""")[\-/\s]\d{2} # DD/MM/YY 130 | |([1-2][0-9]|3[0-1]|0?[1-9])[\-/\s](0?[1-9]|1[0-2]|"""+month_name+r""")[\-/\s]\d{4} # DD/MM/YYYY 131 | |\d{2}[\-./\s](0?[1-9]|1[0-2]|"""+month_name+r""")[\-\./\s]([1-2][0-9]|3[0-1]|0?[1-9]) # YY/MM/DD 132 | |\d{4}[\-./\s](0?[1-9]|1[0-2]|"""+month_name+r""")[\-\./\s]([1-2][0-9]|3[0-1]|0?[1-9]) # YYYY/MM/DD 133 | |\d{4}[\-/](0?[1-9]|1[0-2]|"""+month_name+r""") # YYYY/MM 134 | |(0?[1-9]|1[0-2]|"""+month_name+r""")[\-/]\d{4} # MM/YYYY 135 | |(0?[1-9]|1[0-2]|"""+month_name+r""")/\d{2} # MM/YY 136 | |(0?[1-9]|1[0-2]|"""+month_name+r""")/\d{2} # MM/YYYY 137 | |(0?[1-9]|1[0-2]|"""+month_name+r""")/([1-2][0-9]|3[0-1]|0?[1-9]) #MM/DD 138 | |([1-2][0-9]|3[0-1]|0?[1-9])/(0?[1-9]|1[0-2]|"""+month_name+r""") #DD/MM 139 | )\b""", re.X | re.I) 140 | pattern_mname = re.compile(r'\b(' + month_name + r')\b') 141 | 142 | # match names, A'Bsfs, Absssfs, A-Bsfsfs 143 | pattern_name = re.compile(r"""^[A-Z]\'?[-a-zA-Z]+$""") 144 | 145 | # match age 146 | pattern_age = re.compile(r"""\b( 147 | age|year[s-]?\s?old|y.o[.]? 148 | )\b""", re.X | re.I) 149 | 150 | # match salutation 151 | pattern_salutation = re.compile(r""" 152 | (Dr\.|Mr\.|Mrs\.|Ms\.|Miss|Sir|Madam)\s 153 | (([A-Z]\'?[A-Z]?[\-a-z]+(\s[A-Z]\'?[A-Z]?[\-a-z]+)*) 154 | )""", re.X) 155 | 156 | # match middle initial 157 | # if single char or Jr is surround by 2 phi words, filter. 158 | pattern_middle = re.compile(r"""\*\*PHI\*\*,? (([A-CE-LN-Z][Rr]?|[DM])\.?) | (([A-CE-LN-Z][Rr]?|[DM])\.?),? \*\*PHI\*\*""") 159 | 160 | 161 | # match url 162 | pattern_url = re.compile(r'\b((http[s]?://)?([a-zA-Z0-9$-_@.&+:!\*\(\),])*[\.\/]([a-zA-Z0-9$-_@.&+:\!\*\(\),])*)\b', re.I) 163 | 164 | # check if the folder exists 165 | def is_valid_file(parser, arg): 166 | if not os.path.exists(arg): 167 | parser.error("The folder %s does not exist. Please input a new folder or create one." % arg) 168 | else: 169 | return arg 170 | 171 | # check if word is in name_set, if not, check the word by single word level 172 | def namecheck(word_output, name_set, screened_words, safe): 173 | # check if the word is in the name list 174 | if word_output.title() in name_set: 175 | # with open("name.txt", 'a') as fout: 176 | # fout.write(word_output + '\n') 177 | # print('Name:', word_output) 178 | screened_words.append(word_output) 179 | word_output = "**PHI**" 180 | safe = False 181 | 182 | else: 183 | # check spacy, and add the word to the name list if it is a name 184 | # check the word's title version and its uppercase version 185 | word_title = nlp(word_output.title()) 186 | # search Title or UPPER version of word in the english dictionary: nlp() 187 | # nlp() returns the most likely NER tag (word.ents) for the word 188 | # If word_title has NER = person AND word_upper has ANY NER tag, filter 189 | word_upper = nlp(word_output.upper()) 190 | if (word_title.ents != () and word_title.ents[0].label_ == 'PERSON' and 191 | word_upper.ents != () and word_upper.ents[0].label_ is not None): 192 | # with open("name.txt", 'a') as fout: 193 | # fout.write(word_output + '\n') 194 | # print('Name:', word_output) 195 | screened_words.append(word_output) 196 | name_set.add(word_output.title()) 197 | word_output = "**PHI**" 198 | safe = False 199 | 200 | return word_output, name_set, screened_words, safe 201 | 202 | 203 | def filter_task(f, whitelist_dict, foutpath, key_name): 204 | 205 | # pretrain = HunposTagger('hunpos.model', 'hunpos-1.0-linux/hunpos-tag') 206 | pretrain = SennaTagger('senna') 207 | 208 | """ 209 | Uses: namecheck() to check if word that has been tagged as name by either nltk or spacy. namecheck() first searches 210 | nameset which is generated by checking words at the sentence level and tagging names. If word is not in nameset, 211 | namecheck() uses spacy.nlp() to check if word is likely to be a name at the word level. 212 | 213 | """ 214 | with open(f, encoding='utf-8', errors='ignore') as fin: 215 | # define intial variables 216 | head, tail = os.path.split(f) 217 | #f_name = re.findall(r'[\w\d]+', tail)[0] # get the file number 218 | print(tail) 219 | start_time_single = time.time() 220 | total_records = 1 221 | phi_containing_records = 0 222 | safe = True 223 | screened_words = [] 224 | name_set = set() 225 | phi_reduced = '' 226 | ''' 227 | address_indictor = ['street', 'avenue', 'road', 'boulevard', 228 | 'drive', 'trail', 'way', 'lane', 'ave', 229 | 'blvd', 'st', 'rd', 'trl', 'wy', 'ln', 230 | 'court', 'ct', 'place', 'plc', 'terrace', 'ter'] 231 | ''' 232 | address_indictor = ['street', 'avenue', 'road', 'boulevard', 233 | 'drive', 'trail', 'way', 'lane', 'ave', 234 | 'blvd', 'st', 'rd', 'trl', 'wy', 'ln', 235 | 'court', 'ct', 'place', 'plc', 'terrace', 'ter', 236 | 'highway', 'freeway', 'autoroute', 'autobahn', 'expressway', 237 | 'autostrasse', 'autostrada', 'byway', 'auto-estrada', 'motorway', 238 | 'avenue', 'boulevard', 'road', 'street', 'alley', 'bay', 'drive', 239 | 'gardens', 'gate', 'grove', 'heights', 'highlands', 'lane', 'mews', 240 | 'pathway', 'terrace', 'trail', 'vale', 'view', 'walk', 'way', 'close', 241 | 'court', 'place', 'cove', 'circle', 'crescent', 'square', 'loop', 'hill', 242 | 'causeway', 'canyon', 'parkway', 'esplanade', 'approach', 'parade', 'park', 243 | 'plaza', 'promenade', 'quay', 'bypass'] 244 | 245 | 246 | note = fin.read() 247 | note = re.sub(r'=', ' = ', note) 248 | # Begin Step 1: saluation check 249 | re_list = pattern_salutation.findall(note) 250 | for i in re_list: 251 | name_set = name_set | set(i[1].split(' ')) 252 | 253 | # note_length = len(word_tokenize(note)) 254 | # Begin step 2: split document into sentences 255 | note = sent_tokenize(note) 256 | 257 | for sent in note: # Begin Step 3: Pattern checking 258 | # postal code check 259 | # print(sent) 260 | if pattern_postal.findall(sent) != []: 261 | safe = False 262 | for item in pattern_postal.findall(sent): 263 | screened_words.append(item[0]) 264 | sent = str(pattern_postal.sub('**PHIPostal**', sent)) 265 | 266 | if pattern_devid.findall(sent) != []: 267 | safe = False 268 | for item in pattern_devid.findall(sent): 269 | if (re.search(r'\d', item) is not None and 270 | re.search(r'[A-Z]',item) is not None): 271 | screened_words.append(item) 272 | sent = sent.replace(item, '**PHI**') 273 | 274 | # number check 275 | if pattern_number.findall(sent) != []: 276 | safe = False 277 | for item in pattern_number.findall(sent): 278 | # print(item) 279 | #if pattern_date.match(item[0]) is None: 280 | sent = sent.replace(item[0], '**PHI**') 281 | screened_words.append(item[0]) 282 | #print(item[0]) 283 | #sent = str(pattern_number.sub('**PHI**', sent)) 284 | ''' 285 | if pattern_date.findall(sent) != []: 286 | safe = False 287 | for item in pattern_date.findall(sent): 288 | if '-' in item[0]: 289 | if (len(set(re.findall(r'[^\w\-]',item[0]))) <= 1): 290 | screened_words.append(item[0]) 291 | #print(item[0]) 292 | sent = sent.replace(item[0], '**PHIDate**') 293 | else: 294 | if len(set(re.findall(r'[^\w]',item[0]))) == 1: 295 | screened_words.append(item[0]) 296 | #print(item[0]) 297 | sent = sent.replace(item[0], '**PHIDate**') 298 | ''' 299 | data_list = [] 300 | if pattern_date.findall(sent) != []: 301 | safe = False 302 | for item in pattern_date.findall(sent): 303 | if '-' in item[0]: 304 | if (len(set(re.findall(r'[^\w\-]',item[0]))) <= 1): 305 | #screened_words.append(item[0]) 306 | #print(item[0]) 307 | data_list.append(item[0]) 308 | #sent = sent.replace(item[0], '**PHIDate**') 309 | else: 310 | if len(set(re.findall(r'[^\w]',item[0]))) == 1: 311 | #screened_words.append(item[0]) 312 | #print(item[0]) 313 | data_list.append(item[0]) 314 | #sent = sent.replace(item[0], '**PHIDate**') 315 | data_list.sort(key=len, reverse=True) 316 | for item in data_list: 317 | sent = sent.replace(item, '**PHIDate**') 318 | 319 | #sent = str(pattern_date.sub('**PHI**', sent)) 320 | #print(sent) 321 | if pattern_4digits.findall(sent) != []: 322 | safe = False 323 | for item in pattern_4digits.findall(sent): 324 | screened_words.append(item) 325 | sent = str(pattern_4digits.sub('**PHI**', sent)) 326 | # email check 327 | if pattern_email.findall(sent) != []: 328 | safe = False 329 | for item in pattern_email.findall(sent): 330 | screened_words.append(item) 331 | sent = str(pattern_email.sub('**PHI**', sent)) 332 | # url check 333 | if pattern_url.findall(sent) != []: 334 | safe = False 335 | for item in pattern_url.findall(sent): 336 | #print(item[0]) 337 | if (re.search(r'[a-z]', item[0]) is not None and 338 | '.' in item[0] and 339 | re.search(r'[A-Z]', item[0]) is None and 340 | len(item[0])>10): 341 | print(item[0]) 342 | screened_words.append(item[0]) 343 | sent = sent.replace(item[0], '**PHI**') 344 | #print(item[0]) 345 | #sent = str(pattern_url.sub('**PHI**', sent)) 346 | # dob check 347 | ''' 348 | re_list = pattern_dob.findall(sent) 349 | i = 0 350 | while True: 351 | if i >= len(re_list): 352 | break 353 | else: 354 | text = ' '.join(re_list[i][0].split(' ')[-6:]) 355 | if re.findall(r'\b(birth|dob)\b', text, re.I) != []: 356 | safe = False 357 | sent = sent.replace(re_list[i][1], '**PHI**') 358 | screened_words.append(re_list[i][1]) 359 | i += 2 360 | ''' 361 | 362 | # Begin Step 4 363 | # substitute spaces for special characters 364 | sent = re.sub(r'[\/\-\:\~\_]', ' ', sent) 365 | # label all words for NER using the sentence level context. 366 | spcy_sent_output = nlp(sent) 367 | # split sentences into words 368 | sent = [word_tokenize(sent)] 369 | #print(sent) 370 | # Begin Step 5: context level pattern matching with regex 371 | for position in range(0, len(sent[0])): 372 | word = sent[0][position] 373 | # age check 374 | if word.isdigit() and int(word) > 90: 375 | if position <= 2: # check the words before age 376 | word_previous = ' '.join(sent[0][:position]) 377 | else: 378 | word_previous = ' '.join(sent[0][position - 2:position]) 379 | if position >= len(sent[0]) - 2: # check the words after age 380 | word_after = ' '.join(sent[0][position+1:]) 381 | else: 382 | word_after = ' '.join(sent[0][position+1:position +3]) 383 | 384 | age_string = str(word_previous) + str(word_after) 385 | if pattern_age.findall(age_string) != []: 386 | screened_words.append(sent[0][position]) 387 | sent[0][position] = '**PHI**' 388 | safe = False 389 | 390 | # address check 391 | elif (position >= 1 and position < len(sent[0])-1 and 392 | (word.lower() in address_indictor or 393 | (word.lower() == 'dr' and sent[0][position+1] != '.')) and 394 | (word.istitle() or word.isupper())): 395 | 396 | if sent[0][position - 1].istitle() or sent[0][position-1].isupper(): 397 | screened_words.append(sent[0][position - 1]) 398 | sent[0][position - 1] = '**PHI**' 399 | i = position - 1 400 | # find the closet number, should be the number of street 401 | while True: 402 | if re.findall(r'^[\d-]+$', sent[0][i]) != []: 403 | begin_position = i 404 | break 405 | elif i == 0 or position - i > 5: 406 | begin_position = position 407 | break 408 | else: 409 | i -= 1 410 | i = position + 1 411 | # block the info of city, state, apt number, etc. 412 | while True: 413 | if '**PHIPostal**' in sent[0][i]: 414 | end_position = i 415 | break 416 | elif i == len(sent[0]) - 1: 417 | end_position = position 418 | break 419 | else: 420 | i += 1 421 | if end_position <= position: 422 | end_position = position 423 | 424 | for i in range(begin_position, end_position): 425 | #if sent[0][i] != '**PHIPostal**': 426 | screened_words.append(sent[0][i]) 427 | sent[0][i] = '**PHI**' 428 | safe = False 429 | 430 | # Begin Step 6: NLTK POS tagging 431 | sent_tag = nltk.pos_tag_sents(sent) 432 | #try: 433 | # senna cannot handle long sentence. 434 | #sent_tag = [[]] 435 | #length_100 = len(sent[0])//100 436 | #for j in range(0, length_100+1): 437 | #[sent_tag[0].append(j) for j in pretrain.tag(sent[0][100*j:100*(j+1)])] 438 | # hunpos needs to change the type from bytes to string 439 | #print(sent_tag[0]) 440 | #sent_tag = [pretrain.tag(sent[0])] 441 | #for j in range(len(sent_tag[0])): 442 | #sent_tag[0][j] = list(sent_tag[0][j]) 443 | #sent_tag[0][j][1] = sent_tag[0][j][1].decode('utf-8') 444 | #except: 445 | #print('POS error:', tail, sent[0]) 446 | #sent_tag = nltk.pos_tag_sents(sent) 447 | # Begin Step 7: Use both NLTK and Spacy to check if the word is a name based on sentence level NER label for the word. 448 | for ent in spcy_sent_output.ents: # spcy_sent_output contains a dict with each word in the sentence and its NLP labels 449 | #spcy_sent_ouput.ents is a list of dictionaries containing chunks of words (phrases) that spacy believes are Named Entities 450 | # Each ent has 2 properties: text which is the raw word, and label_ which is the NER category for the word 451 | if ent.label_ == 'PERSON': 452 | #print(ent.text) 453 | # if word is person, recheck that spacy still thinks word is person at the word level 454 | spcy_chunk_output = nlp(ent.text) 455 | if spcy_chunk_output.ents != () and spcy_chunk_output.ents[0].label_ == 'PERSON': 456 | # Now check to see what labels NLTK provides for the word 457 | name_tag = word_tokenize(ent.text) 458 | # senna & hunpos 459 | #name_tag = pretrain.tag(name_tag) 460 | # hunpos needs to change the type from bytes to string 461 | #for j in range(len(name_tag)): 462 | #name_tag[j] = list(name_tag[j]) 463 | #name_tag[j][1] = name_tag[j][1].decode('utf-8') 464 | #chunked = ne_chunk(name_tag) 465 | # default 466 | name_tag = pos_tag_sents([name_tag]) 467 | chunked = ne_chunk(name_tag[0]) 468 | for i in chunked: 469 | if type(i) == Tree: # if ne_chunck thinks chunk is NER, creates a tree structure were leaves are the words in the chunk (and their POS labels) and the trunk is the single NER label for the chunk 470 | if i.label() == 'PERSON': 471 | for token, pos in i.leaves(): 472 | if pos == 'NNP': 473 | name_set.add(token) 474 | 475 | else: 476 | for token, pos in i.leaves(): 477 | spcy_upper_output = nlp(token.upper()) 478 | if spcy_upper_output.ents != (): 479 | name_set.add(token) 480 | 481 | # BEGIN STEP 8: whitelist check 482 | # sent_tag is the nltk POS tagging for each word at the sentence level. 483 | for i in range(len(sent_tag[0])): 484 | # word contains the i-th word and it's POS tag 485 | word = sent_tag[0][i] 486 | # print(word) 487 | # word_output is just the raw word itself 488 | word_output = word[0] 489 | 490 | if word_output not in string.punctuation: 491 | word_check = str(pattern_word.sub('', word_output)) 492 | #if word_check.title() in ['Dr', 'Mr', 'Mrs', 'Ms']: 493 | #print(word_check) 494 | # remove the speical chars 495 | try: 496 | # word[1] is the pos tag of the word 497 | 498 | if (((word[1] == 'NN' or word[1] == 'NNP') or 499 | ((word[1] == 'NNS' or word[1] == 'NNPS') and word_check.istitle()))): 500 | if word_check.lower() not in whitelist_dict: 501 | screened_words.append(word_output) 502 | word_output = "**PHI**" 503 | safe = False 504 | else: 505 | # For words that are in whitelist, check to make sure that we have not identified them as names 506 | if ((word_output.istitle() or word_output.isupper()) and 507 | pattern_name.findall(word_output) != [] and 508 | re.search(r'\b([A-Z])\b', word_check) is None): 509 | word_output, name_set, screened_words, safe = namecheck(word_output, name_set, screened_words, safe) 510 | 511 | # check day/year according to the month name 512 | elif word[1] == 'CD': 513 | if i > 2: 514 | context_before = sent_tag[0][i-3:i] 515 | else: 516 | context_before = sent_tag[0][0:i] 517 | if i <= len(sent_tag[0]) - 4: 518 | context_after = sent_tag[0][i+1:i+4] 519 | else: 520 | context_after = sent_tag[0][i+1:] 521 | #print(word_output, context_before+context_after) 522 | for j in (context_before + context_after): 523 | if pattern_mname.search(j[0]) is not None: 524 | screened_words.append(word_output) 525 | #print(word_output) 526 | word_output = "**PHI**" 527 | safe = False 528 | break 529 | else: 530 | word_output, name_set, screened_words, safe = namecheck(word_output, name_set, screened_words, safe) 531 | 532 | 533 | except: 534 | print(word_output, sys.exc_info()) 535 | if word_output.lower()[0] == '\'s': 536 | if phi_reduced[-7:] != '**PHI**': 537 | phi_reduced = phi_reduced + word_output 538 | #print(word_output) 539 | else: 540 | phi_reduced = phi_reduced + ' ' + word_output 541 | # Format output for later use by eval.py 542 | else: 543 | if (i > 0 and sent_tag[0][i-1][0][-1] in string.punctuation and 544 | sent_tag[0][i-1][0][-1] != '*'): 545 | phi_reduced = phi_reduced + word_output 546 | elif word_output == '.' and sent_tag[0][i-1][0] in ['Dr', 'Mr', 'Mrs', 'Ms']: 547 | phi_reduced = phi_reduced + word_output 548 | else: 549 | phi_reduced = phi_reduced + ' ' + word_output 550 | #print(phi_reduced) 551 | 552 | # Begin Step 8: check middle initial and month name 553 | if pattern_mname.findall(phi_reduced) != []: 554 | for item in pattern_mname.findall(phi_reduced): 555 | screened_words.append(item[0]) 556 | phi_reduced = pattern_mname.sub('**PHI**', phi_reduced) 557 | 558 | if pattern_middle.findall(phi_reduced) != []: 559 | for item in pattern_middle.findall(phi_reduced): 560 | # print(item[0]) 561 | screened_words.append(item[0]) 562 | phi_reduced = pattern_middle.sub('**PHI** **PHI** ', phi_reduced) 563 | # print(phi_reduced) 564 | 565 | if not safe: 566 | phi_containing_records = 1 567 | 568 | # save phi_reduced file 569 | filename = '.'.join(tail.split('.')[:-1])+"_" + key_name + ".txt" 570 | filepath = os.path.join(foutpath, filename) 571 | with open(filepath, "w") as phi_reduced_note: 572 | phi_reduced_note.write(phi_reduced) 573 | 574 | # save filtered words 575 | #screened_words = list(filter(lambda a: a!= '**PHI**', screened_words)) 576 | filepath = os.path.join(foutpath,'filter_summary.txt') 577 | #print(filepath) 578 | screened_words = list(filter(lambda a: '**PHI' not in a, screened_words)) 579 | #screened_words = list(filter(lambda a: a != '**PHI**', screened_words)) 580 | #print(screened_words) 581 | with open(filepath, 'a') as fout: 582 | fout.write('.'.join(tail.split('.')[:-1])+' ' + str(len(screened_words)) + 583 | ' ' + ' '.join(screened_words)+'\n') 584 | # fout.write(' '.join(screened_words)) 585 | 586 | print(total_records, f, "--- %s seconds ---" % (time.time() - start_time_single)) 587 | # hunpos needs to close session 588 | #pretrain.close() 589 | return total_records, phi_containing_records 590 | 591 | 592 | def main(): 593 | # get input/output/filename 594 | ap = argparse.ArgumentParser() 595 | ap.add_argument("-i", "--input", default="input_test/", 596 | help="Path to the directory or the file that contains the PHI note, the default is ./input_test/.", 597 | type=lambda x: is_valid_file(ap, x)) 598 | ap.add_argument("-r", "--recursive", action = 'store_true', default = False, 599 | help="whether to read files in the input folder recursively.") 600 | ap.add_argument("-o", "--output", default="output_test/", 601 | help="Path to the directory to save the PHI-reduced notes in, the default is ./output_test/.", 602 | type=lambda x: is_valid_file(ap, x)) 603 | ap.add_argument("-w", "--whitelist", 604 | #default=os.path.join(os.path.dirname(__file__), 'whitelist.pkl'), 605 | default=resource_filename(__name__, 'whitelist.pkl'), 606 | help="Path to the whitelist, the default is phireducer/whitelist.pkl") 607 | ap.add_argument("-n", "--name", default="phi_reduced", 608 | help="The key word of the output file name, the default is *_phi_reduced.txt.") 609 | ap.add_argument("-p", "--process", default=1, type=int, 610 | help="The number of processes to run simultaneously, the default is 1.") 611 | args = ap.parse_args() 612 | 613 | finpath = args.input 614 | foutpath = args.output 615 | key_name = args.name 616 | whitelist_file = args.whitelist 617 | process_number = args.process 618 | if_dir = os.path.isdir(finpath) 619 | 620 | start_time_all = time.time() 621 | if if_dir: 622 | print('input folder:', finpath) 623 | print('recursive?:', args.recursive) 624 | else: 625 | print('input file:', finpath) 626 | head, tail = os.path.split(finpath) 627 | # f_name = re.findall(r'[\w\d]+', tail)[0] 628 | print('output folder:', foutpath) 629 | print('Using whitelist:', whitelist_file) 630 | try: 631 | with open(whitelist_file, "rb") as fin: 632 | whitelist = pickle.load(fin) 633 | print('length of whitelist: {}'.format(len(whitelist))) 634 | if if_dir: 635 | print('phi_reduced file\'s name would be:', "*_"+key_name+".txt") 636 | else: 637 | print('phi_reduced file\'s name would be:', '.'.join(tail.split('.')[:-1])+"_"+key_name+".txt") 638 | print('run in {} process(es)'.format(process_number)) 639 | except FileNotFoundError: 640 | print("No whitelist is found. The script will stop.") 641 | os._exit(0) 642 | 643 | filepath = os.path.join(foutpath,'filter_summary.txt') 644 | with open(filepath, 'w') as fout: 645 | fout.write("") 646 | # start multiprocess 647 | pool = Pool(processes=process_number) 648 | 649 | results_list = [] 650 | filter_time = time.time() 651 | 652 | # apply_async() allows a worker to begin a new task before other works have completed their current task 653 | if os.path.isdir(finpath): 654 | if args.recursive: 655 | results = [pool.apply_async(filter_task, (f,)+(whitelist, foutpath, key_name)) for f in glob.glob (finpath+"/**/*.txt", recursive=True)] 656 | else: 657 | results = [pool.apply_async(filter_task, (f,)+(whitelist, foutpath, key_name)) for f in glob.glob (finpath+"/*.txt")] 658 | else: 659 | results = [pool.apply_async(filter_task, (f,)+(whitelist, foutpath, key_name)) for f in glob.glob( finpath)] 660 | try: 661 | results_list = [r.get() for r in results] 662 | total_records, phi_containing_records = zip(*results_list) 663 | total_records = sum(total_records) 664 | phi_containing_records = sum(phi_containing_records) 665 | 666 | print("total records:", total_records, "--- %s seconds ---" % (time.time() - start_time_all)) 667 | print('filter_time', "--- %s seconds ---" % (time.time() - filter_time)) 668 | print('total records processed: {}'.format(total_records)) 669 | print('num records with phi: {}'.format(phi_containing_records)) 670 | except ValueError: 671 | print("No txt file in the input folder.") 672 | pass 673 | 674 | pool.close() 675 | pool.join() 676 | 677 | 678 | # close multiprocess 679 | 680 | 681 | if __name__ == "__main__": 682 | multiprocessing.freeze_support() # must run for windows 683 | main() 684 | --------------------------------------------------------------------------------