├── 25_epochs_0.5_similarity.npy ├── 20_epochs_0.6_similarity_weights.npy ├── 25_epochs_0.6_similarity_seems_better.npy ├── LICENSE ├── Unrepresented Words.csv ├── README.md ├── Dictionary.csv ├── Symptom Counts.csv └── Symptoms_Similarity.py /25_epochs_0.5_similarity.npy: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sekharvth/symptom-disease/HEAD/25_epochs_0.5_similarity.npy -------------------------------------------------------------------------------- /20_epochs_0.6_similarity_weights.npy: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sekharvth/symptom-disease/HEAD/20_epochs_0.6_similarity_weights.npy -------------------------------------------------------------------------------- /25_epochs_0.6_similarity_seems_better.npy: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sekharvth/symptom-disease/HEAD/25_epochs_0.6_similarity_seems_better.npy -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 sekharvth 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /Unrepresented Words.csv: -------------------------------------------------------------------------------- 1 | Words,No. of Occurences 2 | polydypsia,4 3 | orthopnea,14 4 | weepiness,7 5 | hypokinesia,12 6 | pleuritic,10 7 | rhonchus,7 8 | haemoptysis,5 9 | apyrexial,19 10 | dysuria,4 11 | hemodynamically,9 12 | ecchymosis,5 13 | orthostasis,5 14 | transaminitis,7 15 | micturition,4 16 | asterixis,4 17 | prostatism,8 18 | formication,3 19 | cushingoid,5 20 | emphysematous,3 21 | hypesthesia,1 22 | cardiomegaly,3 23 | hemianopsia,2 24 | cicatrisation,2 25 | hypometabolism,1 26 | oliguria,5 27 | gravida,5 28 | photopsia,1 29 | macule,1 30 | "hypothermia,",4 31 | atypia,2 32 | stridor,2 33 | aphagia,2 34 | fremitus,4 35 | stahlis,3 36 | bradykinesia,1 37 | hematochezia,1 38 | egophony,4 39 | temperature-associated,1 40 | paraparesis,3 41 | dysesthesia,1 42 | polymyalgia,3 43 | heberdens,1 44 | retropulsion,1 45 | hypersomnolence,1 46 | urinoma,1 47 | hypoalbuminemia,2 48 | pustule,2 49 | pansystolic,1 50 | titubation,1 51 | dysdiadochokinesia,1 52 | monocytosis,1 53 | tenesmus,3 54 | fecaluria,1 55 | pneumatouria,1 56 | hydropneumothorax,2 57 | uncoordination,2 58 | fatigability,3 59 | intermenstrual,1 60 | primigravida,1 61 | proteinemia,1 62 | phonophobia,1 63 | pulsus,2 64 | breath-holding,1 65 | charleyhorse,1 66 | hypertonicity,1 67 | prodrome,1 68 | hypoproteinemia,1 69 | large-for-dates,1 70 | exanthema,15 71 | deglutition,15 72 | thrombocytopaenia,9 73 | oralcandidiasis,15 74 | decubitus,3 75 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Many birds, one stone 2 | Finds out symptoms similar to a given symptom, from a symptom-disease data set. 3 | 4 | This model is used to predict symptoms that are closely related to a given symptom. It can be used in cases (read apps) where the user enters a symptom, and a list of similar symptoms pop up, of which the user can select the ones he's suffering from, and these can be further fed into a model that can then predict the disease the person is suffering from, and redirect him to the associated specialist. The latter part isn't included here. 5 | 6 | The data set contains a table of diseases and the associated symptoms. The model architecture is as follows: 7 | 1) After preprocessing, make the data into the symptom-disease format from the existing disease-symptom format. 8 | 2) Make symptoms the target words and the associated diseases the context words, and use this as the (target word, context word) pair for skipgram generation. 9 | 3) After assigning labels of 1 or 0 to the pairs, feed it into the Keras model, which generates new word vectors on top of existing GloVe vectors 10 | 4) Loop through the set of all symptoms in the data set to find out the cosine similarity between the embeddings of the given symptom and current symptom in the loop, and then print out the symptoms with a high similrity score. 11 | 12 | The '.npy' files are the new word vectors for the symptoms that have been trained for different epochs. The similarity_score value is a hyperparameter that needs to be tuned with the number of epochs. 13 | 14 | The different csv files are : 15 | 1) Dictionary, that shows all the symptoms and diseases and their corresponding indexes in the Keras model. 16 | 2) Symptom Counts, which show the number of occurences of each symptom in the data set. 17 | 3) Unrepresented Words, which shows the occurences of words that are not represented in the GloVe vectors. 18 | 19 | The code is heavily documented, and all the details regarding the implementation of the model architecture can be found in it. 20 | -------------------------------------------------------------------------------- /Dictionary.csv: -------------------------------------------------------------------------------- 1 | Key,Values 2 | headache,0 3 | hemodynamically stable,1 4 | irritable mood,2 5 | consciousness clear,3 6 | fatigue,4 7 | unresponsiveness,5 8 | agitation,6 9 | pressure chest,7 10 | rhonchus,8 11 | mass of body structure,9 12 | vomiting,10 13 | hypokinesia,11 14 | intoxication,12 15 | angina pectoris,13 16 | numbness,14 17 | hallucinations auditory,15 18 | dyspnea,16 19 | pain,17 20 | hyponatremia,18 21 | myalgia,19 22 | hyperkalemia,20 23 | decreased body weight,21 24 | pruritus,22 25 | worry,23 26 | ascites,24 27 | chill,25 28 | shortness of breath,26 29 | pain abdominal,27 30 | difficulty,28 31 | constipation,29 32 | feeling hopeless,30 33 | rale,31 34 | mental status changes,32 35 | fever,33 36 | guaiac positive,34 37 | abscess bacterial,35 38 | pleuritic pain,36 39 | nausea,37 40 | prostatism,38 41 | hematuria,39 42 | cough,40 43 | diarrhea,41 44 | breath sounds decreased,42 45 | dyspnea on exertion,43 46 | chest tightness,44 47 | sore to touch,45 48 | lesion,46 49 | patient non compliance,47 50 | redness,48 51 | hemiplegia,49 52 | sweating increased,50 53 | mood depressed,51 54 | apyrexial,52 55 | orthopnea,53 56 | blackout,54 57 | thicken,55 58 | syncope,56 59 | nightmare,57 60 | asthenia,58 61 | haemorrhage,59 62 | tumor cell invasion,60 63 | pain chest,61 64 | suicidal,62 65 | night sweat,63 66 | facial paresis,64 67 | tremor,65 68 | hallucinations visual,66 69 | seizure,67 70 | non-productive cough,68 71 | hypotension,69 72 | yellow sputum,70 73 | anorexia,71 74 | unconscious state,72 75 | erythema,73 76 | distress respiratory,74 77 | verbal auditory hallucinations,75 78 | unsteady gait,76 79 | chest discomfort,77 80 | lethargy,78 81 | bradycardia,79 82 | swelling,80 83 | transaminitis,81 84 | wheezing,82 85 | fall,83 86 | distended abdomen,84 87 | sinus rhythm,85 88 | dysarthria,86 89 | sleeplessness,87 90 | decreased translucency,88 91 | feeling suicidal,89 92 | productive cough,90 93 | palpitation,91 94 | homelessness,92 95 | dizziness,93 96 | tachypnea,94 97 | sleepy,95 98 | weepiness,96 99 | jugular venous distention,97 100 | muscle twitch,98 101 | abdominal tenderness,99 102 | endocarditis,100 103 | tachycardia sinus,101 104 | suicide attempt,102 105 | stenosis aortic valve,103 106 | upper respiratory infection,104 107 | pneumonia aspiration,105 108 | pneumothorax,106 109 | cirrhosis,107 110 | encephalopathy,108 111 | parkinson disease,109 112 | bacteremia,110 113 | dementia,111 114 | depressive disorder,112 115 | incontinence,113 116 | transient ischemic attack,114 117 | chronic alcoholic intoxication,115 118 | mitral valve insufficiency,116 119 | lymphatic diseases,117 120 | tricuspid valve insufficiency,118 121 | failure kidney,119 122 | hyperglycemia,120 123 | paroxysmal dyspnea,121 124 | carcinoma,122 125 | glaucoma,123 126 | thrombocytopaenia,124 127 | deglutition disorder,125 128 | insufficiency renal,126 129 | gastritis,127 130 | hepatitis c,128 131 | kidney failure acute,129 132 | diverticulitis,130 133 | schizophrenia,131 134 | pneumonia,132 135 | ulcer peptic,133 136 | bronchitis,134 137 | primary carcinoma of the liver cells,135 138 | biliary calculus,136 139 | hepatitis,137 140 | carcinoma colon,138 141 | infection urinary tract,139 142 | gout,140 143 | oralcandidiasis,141 144 | gastroenteritis,142 145 | diabetes,143 146 | chronic obstructive airway disease,144 147 | respiratory failure,145 148 | pancreatitis,146 149 | neutropenia,147 150 | emphysema pulmonary,148 151 | colitis,149 152 | alzheimers disease,150 153 | neuropathy,151 154 | dehydration,152 155 | influenza,153 156 | peripheral vascular disease,154 157 | chronic kidney failure,155 158 | neoplasm,156 159 | obesity,157 160 | lymphoma,158 161 | failure heart,159 162 | psychotic disorder,160 163 | overload fluid,161 164 | exanthema,162 165 | gastroesophageal reflux disease,163 166 | osteomyelitis,164 167 | myocardial infarction,165 168 | coronary heart disease,166 169 | anemia,167 170 | ischemia,168 171 | anxiety state,169 172 | sepsis ,170 173 | benign prostatic hypertrophy,171 174 | bipolar disorder,172 175 | dependence,173 176 | cellulitis,174 177 | affect labile,175 178 | hernia,176 179 | infection,177 180 | migraine disorders,178 181 | ileus,179 182 | hypothyroidism,180 183 | osteoporosis,181 184 | neoplasm metastasis,182 185 | adenocarcinoma,183 186 | pneumocystis carinii pneumonia,184 187 | cholecystitis,185 188 | delirium,186 189 | hyperlipidemia,187 190 | hemorrhoids,188 191 | hypercholesterolemia,189 192 | hypoglycemia,190 193 | edema pulmonary,191 194 | pyelonephritis,192 195 | malignant neoplasms,193 196 | hernia hiatal,194 197 | primary malignant neoplasm,195 198 | adhesion,196 199 | hyperbilirubinemia,197 200 | melanoma,198 201 | cardiomyopathy,199 202 | arthritis,200 203 | personality disorder,201 204 | manic disorder,202 205 | deep vein thrombosis,203 206 | hemiparesis,204 207 | thrombus,205 208 | embolism pulmonary,206 209 | ketoacidosis diabetic,207 210 | sickle cell anemia,208 211 | carcinoma breast,209 212 | carcinoma of lung,210 213 | asthma,211 214 | epilepsy,212 215 | delusion,213 216 | hypertension pulmonary,214 217 | degenerative polyarthritis,215 218 | hepatitis b,216 219 | hiv infections,217 220 | paranoia,218 221 | carcinoma prostate,219 222 | spasm bronchial,220 223 | pancytopenia,221 224 | failure heart congestive,222 225 | fibroid tumor,223 226 | hypertensive disease,224 227 | pericardial effusion body substance,225 228 | confusion,226 229 | kidney disease,227 230 | decubitus ulcer,228 231 | tonic-clonic seizures,229 232 | diverticulosis,230 233 | accident cerebrovascular,231 234 | obesity morbid,232 235 | aphasia,233 236 | -------------------------------------------------------------------------------- /Symptom Counts.csv: -------------------------------------------------------------------------------- 1 | Symptom,Count 2 | shortness of breath,46 3 | pain,41 4 | fever,34 5 | pain abdominal,27 6 | diarrhea,24 7 | vomiting,24 8 | asthenia,22 9 | cough,22 10 | dyspnea,22 11 | nausea,22 12 | unresponsiveness,22 13 | chill,20 14 | pain chest,20 15 | apyrexial,19 16 | decreased body weight,19 17 | agitation,18 18 | rale,18 19 | lesion,17 20 | mass of body structure,17 21 | hypotension,16 22 | sore to touch,16 23 | hallucinations auditory,14 24 | night sweat,14 25 | orthopnea,14 26 | syncope,14 27 | thicken,14 28 | haemorrhage,13 29 | swelling,13 30 | tremor,13 31 | distress respiratory,12 32 | feeling suicidal,12 33 | hypokinesia,12 34 | patient non compliance,12 35 | suicidal,12 36 | feeling hopeless,11 37 | irritable mood,11 38 | sleepy,11 39 | sweating increased,11 40 | tachypnea,11 41 | wheezing,11 42 | worry,11 43 | ascites,10 44 | blackout,10 45 | difficulty,10 46 | dyspnea on exertion,10 47 | headache,10 48 | hemiplegia,10 49 | hyponatremia,10 50 | non-productive cough,10 51 | pleuritic pain,10 52 | pruritus,10 53 | seizure,10 54 | sleeplessness,10 55 | angina pectoris,9 56 | constipation,9 57 | facial paresis,9 58 | fall,9 59 | fatigue,9 60 | hallucinations visual,9 61 | hemodynamically stable,9 62 | hyperkalemia,9 63 | mental status changes,9 64 | palpitation,9 65 | productive cough,9 66 | anorexia,8 67 | bradycardia,8 68 | chest tightness,8 69 | dizziness,8 70 | guaiac positive,8 71 | homelessness,8 72 | prostatism,8 73 | tumor cell invasion,8 74 | abdominal tenderness,7 75 | abscess bacterial,7 76 | chest discomfort,7 77 | consciousness clear,7 78 | decreased translucency,7 79 | distended abdomen,7 80 | erythema,7 81 | jugular venous distention,7 82 | lethargy,7 83 | mood depressed,7 84 | myalgia,7 85 | redness,7 86 | rhonchus,7 87 | transaminitis,7 88 | unconscious state,7 89 | unsteady gait,7 90 | weepiness,7 91 | breath sounds decreased,6 92 | dysarthria,6 93 | hematuria,6 94 | intoxication,6 95 | muscle twitch,6 96 | nightmare,6 97 | numbness,6 98 | pressure chest,6 99 | sinus rhythm,6 100 | verbal auditory hallucinations,6 101 | yellow sputum,6 102 | catatonia,5 103 | clonus,5 104 | cushingoid habitus,5 105 | ecchymosis,5 106 | energy increased,5 107 | extreme exhaustion,5 108 | food intolerance,5 109 | general discomfort,5 110 | gurgle,5 111 | haemoptysis,5 112 | labored breathing,5 113 | lightheadedness,5 114 | metastatic lesion,5 115 | motor retardation,5 116 | oliguria,5 117 | orthostasis,5 118 | scar tissue,5 119 | snuffle,5 120 | speech slurred,5 121 | splenomegaly,5 122 | throat sore,5 123 | vertigo,5 124 | asterixis,4 125 | aura,4 126 | bedridden,4 127 | burning sensation,4 128 | cachexia,4 129 | drowsiness,4 130 | dysuria,4 131 | egophony,4 132 | enuresis,4 133 | fremitus,4 134 | gravida 0,4 135 | green sputum,4 136 | has religious belief,4 137 | heartburn,4 138 | "hypothermia, natural",4 139 | hypoxemia,4 140 | left atrial hypertrophy,4 141 | lung nodule,4 142 | nausea and vomiting,4 143 | neck stiffness,4 144 | nervousness,4 145 | numbness of hand,4 146 | painful swallowing,4 147 | paresthesia,4 148 | polydypsia,4 149 | polyuria,4 150 | sensory discomfort,4 151 | spontaneous rupture of membranes,4 152 | stiffness,4 153 | symptom aggravating factors,4 154 | urge incontinence,4 155 | urgency of micturition,4 156 | weight gain,4 157 | withdraw,4 158 | achalasia,3 159 | arthralgia,3 160 | ataxia,3 161 | breech presentation,3 162 | cardiomegaly,3 163 | cardiovascular event,3 164 | colic abdominal,3 165 | cyanosis,3 166 | debilitation,3 167 | decompensation,3 168 | difficulty passing urine,3 169 | emphysematous change,3 170 | fatigability,3 171 | flatulence,3 172 | formication,3 173 | frail,3 174 | hematocrit decreased,3 175 | heme positive,3 176 | hepatosplenomegaly,3 177 | hoarseness,3 178 | hypercapnia,3 179 | indifferent mood,3 180 | loose associations,3 181 | malaise,3 182 | moan,3 183 | monoclonal,3 184 | neologism,3 185 | out of breath,3 186 | paraparesis,3 187 | paresis,3 188 | photophobia,3 189 | polymyalgia,3 190 | presence of q wave,3 191 | projectile vomiting,3 192 | qt interval prolonged,3 193 | room spinning,3 194 | satiety early,3 195 | scratch marks,3 196 | snore,3 197 | spasm,3 198 | stahlis line,3 199 | stool color yellow,3 200 | stupor,3 201 | systolic murmur,3 202 | t wave inverted,3 203 | tenesmus,3 204 | terrify,3 205 | tired,3 206 | unhappy,3 207 | urinary hesitation,3 208 | verbally abusive behavior,3 209 | wheelchair bound,3 210 | abnormally hard consistency,2 211 | absences finding,2 212 | ache,2 213 | ambidexterity,2 214 | aphagia,2 215 | asymptomatic,2 216 | atypia,2 217 | awakening early,2 218 | behavior hyperactive,2 219 | bleeding of vagina,2 220 | cicatrisation,2 221 | clammy skin,2 222 | cystic lesion,2 223 | decreased stool caliber,2 224 | disturbed family,2 225 | dullness,2 226 | estrogen use,2 227 | extrapyramidal sign,2 228 | feeling strange,2 229 | flushing,2 230 | general unsteadiness,2 231 | giddy mood,2 232 | groggy,2 233 | hacking cough,2 234 | heavy legs,2 235 | hemianopsia homonymous,2 236 | hepatomegaly,2 237 | hirsutism,2 238 | hot flush,2 239 | hunger,2 240 | hydropneumothorax,2 241 | hyperacusis,2 242 | hypersomnia,2 243 | hypoalbuminemia,2 244 | hypocalcemia result,2 245 | hypokalemia,2 246 | hypotonic,2 247 | impaired cognition,2 248 | lip smacking,2 249 | mass in breast,2 250 | mediastinal shift,2 251 | moody,2 252 | myoclonus,2 253 | nonsmoker,2 254 | pain back,2 255 | pallor,2 256 | para 1,2 257 | paralyse,2 258 | poor dentition,2 259 | posturing,2 260 | pulsus paradoxus,2 261 | pustule,2 262 | renal angle tenderness,2 263 | scleral icterus,2 264 | side pain,2 265 | slowing of urinary stream,2 266 | sputum purulent,2 267 | st segment depression,2 268 | stridor,2 269 | stuffy nose,2 270 | superimposition,2 271 | systolic ejection murmur,2 272 | unable to concentrate,2 273 | uncoordination,2 274 | vision blurred,2 275 | abdomen acute,1 276 | abdominal bloating,1 277 | abnormal sensation,1 278 | abortion,1 279 | adverse effect,1 280 | air fluid level,1 281 | alcohol binge episode,1 282 | alcoholic withdrawal symptoms,1 283 | anosmia,1 284 | barking cough,1 285 | behavior showing increased motor activity,1 286 | blanch,1 287 | bowel sounds decreased,1 288 | bradykinesia,1 289 | breakthrough pain,1 290 | breath-holding spell,1 291 | bruit,1 292 | catching breath,1 293 | charleyhorse,1 294 | choke,1 295 | claudication,1 296 | clumsiness,1 297 | coordination abnormal,1 298 | disequilibrium,1 299 | dizzy spells,1 300 | drool,1 301 | dysdiadochokinesia,1 302 | dysesthesia,1 303 | dyspareunia,1 304 | elation,1 305 | excruciating pain,1 306 | exhaustion,1 307 | fear of falling,1 308 | fecaluria,1 309 | feces in rectum,1 310 | feels hot feverish,1 311 | flare,1 312 | floppy,1 313 | focal seizures,1 314 | frothy sputum,1 315 | gag,1 316 | gasping for breath,1 317 | gravida 10,1 318 | heavy feeling,1 319 | heberdens node,1 320 | hematochezia,1 321 | history of - blackout,1 322 | hoard,1 323 | homicidal thoughts,1 324 | hyperemesis,1 325 | hyperhidrosis disorder,1 326 | hypersomnolence,1 327 | hypertonicity,1 328 | hyperventilation,1 329 | hypesthesia,1 330 | hypometabolism,1 331 | hypoproteinemia,1 332 | immobile,1 333 | inappropriate affect,1 334 | incoherent,1 335 | intermenstrual heavy bleeding,1 336 | large-for-dates fetus,1 337 | low back pain,1 338 | macerated skin,1 339 | macule,1 340 | milky,1 341 | monocytosis,1 342 | murphys sign,1 343 | mydriasis,1 344 | nasal discharge present,1 345 | nasal flaring,1 346 | no known drug allergies,1 347 | no status change,1 348 | noisy respiration,1 349 | overweight,1 350 | pain foot,1 351 | pain in lower limb,1 352 | pain neck,1 353 | panic,1 354 | pansystolic murmur,1 355 | para 2,1 356 | passed stones,1 357 | pericardial friction rub,1 358 | phonophobia,1 359 | photopsia,1 360 | pin-point pupils,1 361 | pneumatouria,1 362 | poor feeding,1 363 | posterior rhinorrhea,1 364 | previous pregnancies 2,1 365 | primigravida,1 366 | prodrome,1 367 | prostate tender,1 368 | proteinemia,1 369 | pulse absent,1 370 | r wave feature,1 371 | rambling speech,1 372 | rapid shallow breathing,1 373 | red blotches,1 374 | regurgitates after swallowing,1 375 | rest pain,1 376 | retch,1 377 | retropulsion,1 378 | rhd positive,1 379 | rigor - temperature-associated observation,1 380 | rolling of eyes,1 381 | sciatica,1 382 | sedentary,1 383 | shooting pain,1 384 | sneeze,1 385 | sniffle,1 386 | soft tissue swelling,1 387 | st segment elevation,1 388 | stinging sensation,1 389 | throbbing sensation quality,1 390 | tinnitus,1 391 | titubation,1 392 | todd paralysis,1 393 | tonic seizures,1 394 | transsexual,1 395 | tremor resting,1 396 | underweight,1 397 | unwell,1 398 | urinoma,1 399 | welt,1 400 | -------------------------------------------------------------------------------- /Symptoms_Similarity.py: -------------------------------------------------------------------------------- 1 | 2 | # Data obtained from 3 | 4 | http://people.dbmi.columbia.edu/~friedma/Projects/DiseaseSymptomKB/index.html 5 | 6 | 7 | import pandas as pd 8 | import numpy as np 9 | import re 10 | 11 | # read in the 50-dimensional GloVe vectors 12 | def read_glove_vecs(file): 13 | with open(file, 'r') as f: 14 | words = set() 15 | word_to_vec_map = {} 16 | 17 | for line in f: 18 | line = line.strip().split() 19 | word = line[0] 20 | words.add(word) 21 | word_to_vec_map[word] = np.array(line[1:], dtype=np.float64) 22 | 23 | return words, word_to_vec_map 24 | 25 | words, word_to_vec_map = read_glove_vecs('data/glove.6B.50d.txt') # replace file path with your location for 50-d embeddings 26 | 27 | # for use later on; finds the cosine similarity b/w 2 vectors 28 | def cosine_similarity(x, y): 29 | 30 | # Compute the dot product between x and y 31 | dot = np.dot(x,y) 32 | # Compute the L2 norm of x 33 | norm_x = np.sqrt(np.sum(x**2)) 34 | # Compute the L2 norm of y 35 | norm_y = np.sqrt(np.sum(y**2)) 36 | # Compute the cosine similarity 37 | cosine_similarity = dot/(norm_x * norm_y) 38 | 39 | 40 | #read in the data from the file 41 | df = pd.read_excel('Disease_Symptoms.xlsx').drop('Count of Disease Occurrence', axis = 1).fillna(method = 'ffill') 42 | 43 | # some basic preprocessing to get the data into required formats 44 | df.Symptom = df.Symptom.map(lambda x: re.sub('^.*_', '', x)) 45 | df.Disease = df.Disease.map(lambda x: re.sub('^.*_', '', x)) 46 | 47 | df.Symptom = df.Symptom.map(lambda x: x.lower()) 48 | df.Disease = df.Disease.map(lambda x: x.lower()) 49 | 50 | # makes words like 'pain/swelling' into 'pain swelling' 51 | df.Symptom = df.Symptom.map(lambda x: re.sub('(.*)\/(.*)', r'\1 \2', x)) 52 | df.Disease = df.Disease.map(lambda x: re.sub('(.*)\/(.*)', r'\1 \2', x)) 53 | 54 | # gets rid of parenthesised words 55 | df.Symptom = df.Symptom.map(lambda x: re.sub('(.*)\(.*\)(.*)', r'\1\2', x)) 56 | df.Disease = df.Disease.map(lambda x: re.sub('(.*)\(.*\)(.*)', r'\1\2', x)) 57 | 58 | # gets rid of apostrophes and tokens of the sort '\xa0' 59 | df.Symptom = df.Symptom.map(lambda x: re.sub('\'', '', x)) 60 | df.Disease = df.Disease.map(lambda x: re.sub('\'', '', x)) 61 | df.Disease = df.Disease.map(lambda x: re.sub('\\xa0', ' ', x)) 62 | 63 | # there may be words in the data set that don't have a representation in the 50-d GloVe vectors. 64 | # Now, new embeddings for such words can be generated, but they'll require humungous amounts of data 65 | # that maps its context words (diseases in this case), that has to be trained for at least 10000 iterations, in order to 66 | # generalise well. And upon inspection, 67 | counts = {} 68 | def remove(x): 69 | for i in x.split(): 70 | if not i in word_to_vec_map.keys(): 71 | counts[i] = counts.get(i, 0) + 1 72 | df.Symptom.map(lambda x: remove(x)) 73 | df.Disease.map(lambda x: remove(x)) 74 | 75 | # make the above counts into a dataframe 76 | unrepresented_words = pd.DataFrame() 77 | unrepresented_words['Words'] = counts.keys() 78 | unrepresented_words['No. of Occurences'] = counts.values() 79 | unrepresented_words.to_csv('Unrepresented Words.csv') 80 | 81 | # reorganises the dataframe by grouping the data by symptoms instead of by diseases 82 | frame = pd.DataFrame(df.groupby(['Symptom', 'Disease']).size()).drop(0, axis = 1) 83 | # the first entry contains only the disease and no symptom, so it is dropped 84 | frame = frame.iloc[1:] 85 | 86 | # set the index of the dataframe as 'Symptom' 87 | frame = frame.reset_index().set_index('Symptom') 88 | 89 | # get the counts of each symptom, ie, how many times it occurs in the data set 90 | counts = {} 91 | for i in frame.index: 92 | counts[i] = counts.get(i, 0) + 1 93 | 94 | # sort the symptoms by their counts in descending order and save it into a dataframe 95 | import operator 96 | sym, ct = zip(*sorted(counts.items(), key = operator.itemgetter(1), reverse = True)) 97 | sym_count = pd.DataFrame() 98 | sym_count['Symptom'] = sym 99 | sym_count['Count'] = ct 100 | sym_count.to_csv('Symptom Counts.csv') 101 | 102 | # drop the symptoms that have fewer than 6 entries in the data set 103 | [frame.drop(i, inplace = True) for i in frame.index if counts[i] < 6] 104 | 105 | # extract all the diseases present in the data set and make them into a list, for use later on 106 | lst = [] 107 | frame.Disease.map(lambda x: lst.append(x)) 108 | 109 | # For us to train our own word embeddings on top of the existing GloVe representation, we are going to use the skipgram model. 110 | # Each symptom has a disease associated with it, and we use this as the (target word, context word) pair for skipgram generation. 111 | # The 'skipgrams' function in Keras samples equal no. of context and non-context words for a given word from the distribution, 112 | # and we are going to do the same here. 113 | # First we'll make a list that stores the pair and its corresponding label of 1, if the disease is indeed associated with 114 | # the symptom, and 0 otherwise. 115 | couples_and_labels = [] 116 | 117 | import random 118 | # run through the symptoms 119 | for i in frame.index.unique(): 120 | # make a temporary list of the diseases associated with the symptom (actual context words) 121 | a = list(frame.Disease.loc[i].values) 122 | # loop through the context words 123 | for j in a: 124 | # randomly select a disease that isn't associated with the symptom, to set as a non-context word with label 0, 125 | # by using the XOR operator, that finds the uncommon elements in the 2 sets 126 | non_context = random.choice(list(set(lst) ^ set(a))) 127 | # add labels of 1 and 0 to context and non-context words repectively 128 | couples_and_labels.append((i, j, 1)) 129 | couples_and_labels.append((i, non_context, 0)) 130 | 131 | # the entries in the couples_and_labels list now follow the pattern of 1, 0, 1, 0 for the labels. We shuffle it up. 132 | b = random.sample(couples_and_labels, len(couples_and_labels)) 133 | # Extract the symptoms, the diseases and the corresponding labels 134 | symptom, disease, label = zip(*b) 135 | 136 | # Transform them into series' to get unique entries in each ('set()' not used as it generates a different order each time 137 | # and the index(number) associated with a word changes each time the program is run) 138 | s1 = pd.Series(list(symptom)) 139 | s2 = pd.Series(list(disease)) 140 | dic = {} 141 | 142 | # Map each word in the symptoms and diseases to a corresponding number that can be fed into Keras 143 | for i,j in enumerate(s1.append(s2).unique()): 144 | dic[j] = i 145 | # Now all the symptoms are represented by a number in the arrays 'symptoms', and 'diseases' 146 | symptoms = np.array(s1.map(dic), dtype = 'int32') 147 | diseases = np.array(s2.map(dic), dtype = 'int32') 148 | 149 | # Make the labels too into an array 150 | labels = np.array(label, dtype = 'int32') 151 | 152 | lst = [] 153 | 154 | # size of the vocabulary ,ie, no. of unique words in corpus 155 | vocab_size = len(dic) 156 | 157 | # dimension of word embeddings 158 | vector_dim = 50 159 | 160 | # create an array of zeros of shape (vocab_size, vector_dim) to store the new embedding matrix (word vector representations) 161 | embedding_matrix = np.zeros((len(dic), 50)) 162 | 163 | # loop through the dictionary of words and corresponding indexes 164 | for word, index in dic.items(): 165 | # split each symptom/disease into a list of constituent words 166 | for i in word.split(): 167 | lst.append(word_to_vec_map[i]) # add the embeddings of each word in symptoms and diseases to list 'lst' 168 | # make an array out of the list 169 | arr = np.array(lst) 170 | # sum the embeddings of all words in the sentence, to get an embedding of the entire sentence 171 | # if in the entire sentence, word embeddings weren't available in GloVe vectors, make that sentence into a 172 | # zero array of shape (50,), just as a precaution, as we have already removed such words 173 | arrsum = arr.sum(axis = 0) 174 | # normalize the values 175 | arrsum = arrsum/np.sqrt((arrsum**2).sum()) 176 | # add the embedding to the corresponding word index 177 | embedding_matrix[index,:] = arrsum 178 | 179 | #TRAIN NEW WORD EMBEDDINGS ON CORPUS 180 | 181 | #import necessary keras modules 182 | from keras.preprocessing import sequence 183 | from keras.layers import Dot, Reshape, Dense 184 | from keras.models import Model 185 | 186 | # START BUILDING THE KERAS MODEL FOR TRAINING 187 | input_target = Input((1,)) 188 | input_context = Input((1,)) 189 | 190 | # make a Keras embedding layer of shape (vocab_size, vector_dim) and set 'trainable' argument to 'True' 191 | embedding = Embedding(input_dim = vocab_size, output_dim = vector_dim, input_length = 1, name='embedding', trainable = True) 192 | 193 | # load pre-trained weights(embeddings) from 'embedding_matrix' into the Keras embedding layer 194 | embedding.build((None,)) 195 | embedding.set_weights([embedding_matrix]) 196 | 197 | # run the context and target words through the embedding layer 198 | context = embedding(input_context) 199 | context = Reshape((vector_dim, 1))(context) 200 | 201 | target = embedding(input_target) 202 | target = Reshape((vector_dim, 1))(target) 203 | 204 | # compute the dot product of the context and target words, to find the similarity (dot product is usually a measure of similarity) 205 | dot = Dot(axes = 1)([context, target]) 206 | dot = Reshape((1,))(dot) 207 | # pass it through a 'sigmoid' activation neuron; this is then comapared with the value in 'label' generated from the skipgram 208 | out = Dense(1, activation = 'sigmoid')(dot) 209 | 210 | # create model instance 211 | model = Model(input = [input_context, input_target], output = out) 212 | model.compile(loss = 'binary_crossentropy', optimizer = 'adam') 213 | 214 | # fit the model, default batch_size of 32 215 | # running for 25 epochs seems to generate good enough results, although running for more iterations may improve performance further 216 | model.fit(x = [symptoms, diseases], y = labels, epochs = 25,) 217 | 218 | # get the new weights (embeddings) after running through keras 219 | new_vecs = model.layers[2].get_weights()[0] 220 | 221 | # Each time the model is run, it generates a different loss at the end, and consequently, different word embeddings after each run. 222 | # It is common to save the trained weights once they are seen to be performing well 223 | # I have saved the weights after 25 epochs and an end loss of 0.232, the screenshot of which I've attached, and can be loaded like this: 224 | 225 | # replace the filename here to try out weights obtained after different numbers of epochs 226 | x = '25_epochs_0.6_similarity_seems_better.npy' 227 | 228 | # load the weights 229 | new_vecs = np.load(x) 230 | 231 | # find the value to which cosine similarity is compared, from the file name 232 | similarity_score = float(re.findall('\d{1,}\.\d{1,}', x)[0]) 233 | 234 | # NOTE : the 'similarity_score' (like 0.6 in this case), is a hyperparameter that needs to be selected manually and tuned, to obtain 235 | # best performance 236 | 237 | d = pd.read_csv('Dictionary.csv') 238 | dic = {} 239 | for i in d.index: 240 | dic[d.Key.loc[i]] = d.Values.loc[i] 241 | 242 | # enter the symptom 243 | symp = input('Enter symptom for which similar symptoms are to be found: ') 244 | print ('\nThe similar symptoms are: ') 245 | 246 | # loop through the symptoms in the data set and find the symptoms with cosine similarity greater than 'similarity_score' 247 | for i in set(symptom): 248 | if (cosine_similarity(new_vecs[dic[i]], new_vecs[dic[symp]])) > similarity_score: 249 | # remove the same symptom from the list of outputs 250 | if i != symp: 251 | print (i) 252 | --------------------------------------------------------------------------------