├── CustomNERwithSpacy.ipynb ├── README.md ├── devel.tsv ├── test.tsv ├── train.tsv └── train_dev.tsv /README.md: -------------------------------------------------------------------------------- 1 | Let us look at how we can create a custom Named Entity Recognition model with spaCy. 2 | 3 | Here i will be creating a clinical named entity recognition model which can recognize the disease names from clinical text 4 | 5 | For this i have extracted annotated clinical text from the following github repo:https://github.com/dmis-lab/biobert 6 | 7 | They provide annotated clinical text here: Named Entity Recognition: (17.3 MB), 8 datasets on biomedical named entity recognition(https://drive.google.com/open?id=1OletxmPYNkz2ltOr9pyT0b0iBtUWxslh) 8 | 9 | Once you download and unzip the files you get 8 datasets with each dataset having the following files: train.tsv, test.tsv , dev.tsv and devel.tsv In 10 | These tsv files each word is annotated using the BIO format. 11 | 12 | 13 | A few lines from tran.tsv in BC5CDR-disease dataset looks like: 14 | 15 | Selegiline O 16 | 17 | induced O 18 | 19 | postural B 20 | 21 | hypotension I 22 | 23 | in O 24 | 25 | Parkinson B 26 | 27 | ' I 28 | 29 | s I 30 | 31 | disease I 32 | 33 | : O 34 | 35 | a O 36 | 37 | longitudinal O 38 | 39 | study O 40 | 41 | on O 42 | 43 | the O 44 | 45 | effects O 46 | 47 | of O 48 | 49 | drug O 50 | 51 | withdrawal O 52 | 53 | . O 54 | 55 | Here it is of the format: word \t label\n 56 | 57 | for instance: postural B hypotension I 58 | 59 | here B-> Begin entity, I-> inside entity and O-> outside entity 60 | 61 | Let us build a custom named entity(disease) recognition model with spaCy 62 | 63 | CustomNERwithSpacy python notebook has the code for training such a model 64 | 65 | 66 | 67 | This notebook has been inpsired from : https://aihub.cloud.google.com/p/products%2F2290fc65-0041-4c87-a898-0289f59aa8ba 68 | 69 | Prerequisites 70 | 71 | spaCy (https://spacy.io/) 72 | 73 | matplotlib 74 | 75 | Python 3.5 or above 76 | 77 | 78 | 79 | --------------------------------------------------------------------------------