├── .dockerignore
├── LICENSE
├── README.md
├── compute_stats.py
├── dataset
├── README.md
└── original
│ ├── celiac
│ └── README.md
│ ├── cervix
│ └── README.md
│ ├── colon
│ └── README.md
│ └── lung
│ └── README.md
├── docker-compose.yml
├── docker-sket_server-config
├── docker_sket_cpu
│ └── Dockerfile
├── docker_sket_gpu
│ └── Dockerfile
└── requirements.txt
├── evaluate_sket.py
├── examples
├── test.xlsx
├── test_multiple_reports.json
└── test_single_report.json
├── ground_truth
└── README.md
├── manage.py
├── outputs
└── README.md
├── requirements.txt
├── run_med_sket.py
├── run_sket.py
├── sket
├── __init__.py
├── negex
│ ├── __init__.py
│ ├── negation.py
│ ├── termsets.py
│ └── test.py
├── nerd
│ ├── __init__.py
│ ├── nerd.py
│ ├── normalizer.py
│ └── rules
│ │ ├── cin_mappings.txt
│ │ ├── dysplasia_mappings.txt
│ │ └── rules.txt
├── ont_proc
│ ├── __init__.py
│ ├── ontology
│ │ └── examode.owl
│ ├── ontology_processing.py
│ └── rules
│ │ └── hierarchy_relations.txt
├── rdf_proc
│ ├── __init__.py
│ └── rdf_processing.py
├── rep_proc
│ ├── __init__.py
│ ├── report_processing.py
│ └── rules
│ │ └── report_fields.txt
├── sket.py
└── utils
│ ├── __init__.py
│ └── utils.py
└── sket_server
├── sket_rest_app
├── __init__.py
├── __pycache__
│ ├── __init__.cpython-38.pyc
│ ├── __init__.cpython-39.pyc
│ ├── admin.cpython-38.pyc
│ ├── admin.cpython-39.pyc
│ ├── apps.cpython-38.pyc
│ ├── apps.cpython-39.pyc
│ ├── models.cpython-38.pyc
│ ├── models.cpython-39.pyc
│ ├── urls.cpython-38.pyc
│ ├── urls.cpython-39.pyc
│ ├── views.cpython-38.pyc
│ └── views.cpython-39.pyc
├── admin.py
├── apps.py
├── migrations
│ ├── __init__.py
│ └── __pycache__
│ │ ├── __init__.cpython-38.pyc
│ │ └── __init__.cpython-39.pyc
├── models.py
├── tests.py
├── urls.py
└── views.py
└── sket_rest_config
├── __init__.py
├── __pycache__
├── __init__.cpython-38.pyc
├── __init__.cpython-39.pyc
├── settings.cpython-38.pyc
├── settings.cpython-39.pyc
├── urls.cpython-38.pyc
├── urls.cpython-39.pyc
├── wsgi.cpython-38.pyc
└── wsgi.cpython-39.pyc
├── asgi.py
├── config.json
├── settings.py
├── urls.py
└── wsgi.py
/.dockerignore:
--------------------------------------------------------------------------------
1 | # ignore outputs and dataset (i.e., volume) when building image
2 | outputs
3 | dataset
4 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2021 ExaNLP
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # SKET
2 | This repository contains the source code for the Semantic Knowledge Extractor Tool (SKET). SKET is an unsupervised hybrid knowledge extraction system that combines a rule-based expert system with pre-trained machine learning models to extract cancer-related information from pathology reports.
3 |
4 | ## Installation
5 |
6 | CAVEAT : the package has been tested using Python 3.7 and 3.8 on unix-based systems and win64 systems. There are no guarantees that it works with different configurations.
7 |
8 | Clone this repository
9 |
10 | ```bash
11 | git clone https://github.com/ExaNLP/sket.git
12 | ```
13 |
14 | Install all the requirements:
15 |
16 | ```bash
17 | pip install -r requirements.txt
18 | ```
19 |
20 | Then install any ```core``` model from scispacy v0.3.0 (default is ```en_core_sci_sm```):
21 |
22 | ```bash
23 | pip install
24 | ```
25 |
26 | The required scispacy models are available at: https://github.com/allenai/scispacy/tree/v0.3.0
27 |
28 | ## Datasets
29 |
30 | Users can go into the ```datasets``` folder and place their datasets within the corresponding use case folders. Use cases are: Colon Cancer (colon), Cervix Uterine Cancer (cervix), and Lung Cancer (lung).
31 |
32 | Datasets can be provided in two formats:
33 |
34 | ### XLS Format
35 |
36 | Users can provide ```.xls``` or ```.xlsx``` files with the first row consisting of column headers (i.e., fields) and the rest of data inputs.
37 |
38 | ### JSON Format
39 |
40 | Users can provide ```.json``` files structured in two ways:
41 |
42 | As a dict containing a ```reports``` field consisting of multiple key-value reports;
43 |
44 | ```bash
45 | {'reports': [{k: v, ...}, ...]}
46 | ```
47 |
48 | As a dict containing a single key-value report.
49 |
50 | ```bash
51 | {k: v, ...}
52 | ```
53 |
54 | SKET concatenates data from all the fields before translation. Users can alterate this behavior by filling ```./sket/rep_proc/rules/report_fields.txt``` with target fields, one per line. Users can also provide a custom file to SKET, as long as it contains one field per line (more on this below).
55 |
56 | Users can provide special headers that are treated differently from regular text by SKET. These fields are:
57 | ```id```: when specified, the ```id``` field is used to identify the corresponding report. Otherwise, ```uuid``` is used.
58 | ```gender```: when specified, the ```gender``` field is used to provide patient's information within RDF graphs. Otherwise, ```gender``` is set to None.
59 | ```age```: when specified, the ```age``` field is used to provide patient's information within RDF graphs. Otherwise, ```age``` is set to None.
60 |
61 | ## Dataset Statistics
62 |
63 | Users can compute dataset statistics to uderstand the distribution of concepts extracted by SKET for each use case. For instance, if a user wants to compute statistics for Colon Cancer, they can run
64 |
65 | ```bash
66 | python compute_stats.py --outputs ./outputs/concepts/refined/colon/*.json --use_case colon
67 | ```
68 |
69 | ## Pretrain
70 |
71 | SKET can be deployed with different pretrained models, i.e., fastText and BERT. In our experiments, we employed the [BioWordVec](https://github.com/ncbi-nlp/BioSentVec) fastText model and the [Bio + Clinical BERT model](https://huggingface.co/emilyalsentzer/Bio_ClinicalBERT).
72 | BioWordVec can be downloaded from https://ftp.ncbi.nlm.nih.gov/pub/lu/Suppl/BioSentVec/BioWordVec_PubMed_MIMICIII_d200.bin
73 | Bio + Clinical BERT model can be automatically downloaded at run time by setting the ```biobert``` SKET parameter equal to 'emilyalsentzer/Bio_ClinicalBERT'
74 |
75 | Users can pass different pretrained models depending on their preferences.
76 |
77 |
78 | ## Usage
79 |
80 | Users can deploy SKET using ```run_med_sket.py```. We release within ```./examples``` three sample datasets that can be used as toy examples to play with SKET. SKET can be deployed with different configurations and using different combinations of matching models.
81 |
82 | Furthermore, SKET exhibits a tunable ```threshold``` parameter that can be tuned to decide the harshness of the entity linking component. The higher the ```threshold```, the more precise the model -- at the expense of recall -- and vice versa. Users can fine-tune this parameter to obtain the desired trade-off between precision and recall. Note that ```threshold``` must always be lower than or equal to the number of considered matching models. Otherwise, the entity linking component does not return any concept.
83 |
84 | The available matching models, in form of SKET parameters, are:
85 | ```biow2v```: the ScispaCy pretrained word embeddings. Set this parameter to ```True``` to use them.
86 | ```biofast```: the fastText model. Set this parameter to ```/path/to/fastText/file``` to use fastText.
87 | ```biobert```: the BERT model. Set this parameter to ```bert-name``` to use BERT (see https://huggingface.co/transformers/pretrained_models.html for model IDs).
88 | ```str_match```: the Gestalt Pattern Matching (GPM) model. Set this parameter to ```True``` to use GPM.
89 |
90 | When using BERT, users can also set ```gpu``` parameter to the corresponding GPU number to fasten SKET execution.
91 |
92 | For instance, a user can run the following script to obtain concepts, labels, and RDF graphs on the test.xlsx sample dataset:
93 |
94 | ```bash
95 | python run_med_sket.py
96 | --src_lang it
97 | --use_case colon
98 | --spacy_model en_core_sci_sm
99 | --w2v_model
100 | --string_model
101 | --thr 2.0
102 | --store
103 | --dataset ./examples/test.xlsx
104 | ```
105 |
106 | or, if a user also wants to use BERT with GPU support, they can run the following script:
107 |
108 | ```bash
109 | python run_med_sket.py
110 | --src_lang it
111 | --use_case colon
112 | --spacy_model en_core_sci_sm
113 | --w2v_model
114 | --string_model
115 | --bert_model emilyalsentzer/Bio_ClinicalBERT
116 | --gpu 0
117 | --thr 2.5
118 | --store
119 | --dataset ./examples/test.xlsx
120 | ```
121 |
122 | In both cases, we set the ```src_lang``` to ```it``` as the source language of reports is Italian. Therefore, SKET needs to translate reports from Italian to English before performing information extraction.
123 |
124 | ## Docker
125 |
126 | SKET can also be deployed as a Docker container -- thus avoiding the need to install its dependencies directly on the host machine. Two Docker images can be built: sket_cpu and sket_gpu .
127 | For ```sket_gpu```, NVIDIA drivers have to be already installed within the host machine. Users can refer to NVIDIA [user-guide](https://docs.nvidia.com/deeplearning/frameworks/user-guide/#nvcontainers) for more information.
128 |
129 | Instructions on how to build and run sket images are reported below, if you already have [docker](https://docs.docker.com/engine/reference/commandline/docker/) installed on your machine, you can skip the first step.
130 |
131 | 1) Install Docker. In this regard, check out the correct [installation procedure](https://docs.docker.com/get-docker/) for your platform.
132 |
133 | 2) Install docker-compose. In this regard, check the correct [installation procedure](https://docs.docker.com/compose/install/) for your platform.
134 |
135 | 3) Check the Docker daemon (i.e., ```dockerd```) is up and running.
136 |
137 | 4) Download or clone the [sket](https://github.com/ExaNLP/sket) repository.
138 |
139 | 5) In ```sket_server/sket_rest_config``` the ```config.json``` file allows you to configure the sket instance, edit this file in order to set the following parameters: ```w2v_model```, ```fasttext_model```, ```bert_model```, ```string_model```, ```gpu```, and ```thr```, where ```thr``` stands for *similarity threshold* and its default value is set to 0.9.
140 |
141 | 6) Depending on the Docker image of interest, follow one of the two procedures below:
142 | 6a) SKET CPU-only : from the [sket](https://github.com/ExaNLP/sket/), type: ```docker-compose run --service-ports sket_cpu ```
143 | 6b) SKET GPU-enabled : from the [sket](https://github.com/ExaNLP/sket/), type: ```docker-compose run --service-ports sket_gpu ```
144 |
145 | 7) When the image is ready, the sket server is running at: http://0.0.0.0:8000 if you run ```sket_cpu ```. If you run ```sket_gpu ``` the server will run at: http://0.0.0.0:8001.
146 |
147 | 8) The annotation of medical reports can be performed with two types of POST request :
148 | 8a) If you want to store the annotations in the ```outputs``` directory, the URL to make the request to is: ```http://0.0.0.0:8000/annotate//``` where ```use_case``` and ```language``` are the use case and the language (identified using [ISO 639-1 Code](https://www.loc.gov/standards/iso639-2/php/code_list.php)) of your reports, respectively.
149 | Request example:
150 | ```bash
151 | curl -H "Content-Type: multipart/form-data" -F "data=@path/to/examples/test.xlsx" http://0.0.0.0:8000/annotate/colon/it
152 | ```
153 |
154 | where ```path/to/examples``` is the path to the examples folder. With this type of request, labels and concepts are stored in ```.json``` files, while graphs are stored in ```.json```,```.n3```,```.ttl```,```.trig``` files.
155 | If you want to store exclusively one file format among ```.n3```,```.ttl```, and ```.trig```, put after the desired language ```/trig``` if you want to store graphs in ```.trig``` format, ```/turtle``` if you want to store graphs in ```ttl``` format and ```/n3``` if you want to store graphs in ```.n3``` format.
156 | Request example:
157 | ```bash
158 | curl -H "Content-Type: multipart/form-data" -F "data=@path/to/examples/test.xlsx" http://0.0.0.0:8000/annotate/colon/it/turtle
159 | ```
160 |
161 | where ```path/to/examples``` is the path to the examples folder.
162 | 8b) If you want to use the labels, the concepts, or the graphs returned by sket without saving them, the URL to make the request to is: ```http://0.0.0.0:8000/annotate///``` where ```use_case``` and ```language``` are the use case and the language (identified using [ISO 639-1 Code](https://www.loc.gov/standards/iso639-2/php/code_list.php)) of your reports, respectively, and ```output``` is ```labels```, ```concepts```, or ```graphs```.
163 | Request example:
164 | ```bash
165 | curl -H "Content-Type: multipart/form-data" -F "data=@path/to/examples/test.xlsx" http://0.0.0.0:8000/annotate/colon/it/labels
166 | ```
167 | where ```path/to/examples``` is the path to the examples folder.
168 | If you want your request to return a graph, your request must include also the graph format. Hence, your request will be: ```http://0.0.0.0:8000/annotate///graphs/``` where `````` can be on format among: ```turtle```, ```n3``` and ```trig```.
169 | ```bash
170 | curl -H "Content-Type: multipart/form-data" -F "data=@path/to/examples/test.xlsx" http://0.0.0.0:8000/annotate/colon/it/graphs/turtle
171 | ```
172 | where ```path/to/examples``` is the path to the examples folder.
173 |
174 | 9) If you want to embed your medical reports in the request, change the application type and set: ```-H "Content-Type: application/json"``` then, instead of ```- F "data=@..."``` put ```-d '{"reports":[{},...,{}]}'``` if you have multiple reports, or ```-d '{"k":"v",...}'``` if you have a single report.
175 |
176 | 10) If you want to build the images again, from the project folder type ```docker-compose down --rmi local```, pay attention that this command will remove all the images created (both CPU and GPU). If you want to remove only one image between CPU and GPU see the [docker image documentation](https://docs.docker.com/engine/reference/commandline/image/). Finally repeat steps 5-8.
177 |
178 | Regarding SKET GPU-enabled, the corresponding Dockerfile (you can find the Dockerfile at the following path: sket_server/docker-sket_server-config/sket_gpu) contains the ```nvidia/cuda:11.0-devel```. Users are encouraged to change the NVIDIA/CUDA image within the Dockerfile depending on the NVIDIA drivers installed in their host machine. NVIDIA images can be found [here](https://hub.docker.com/r/nvidia/cuda/tags?page=1&ordering=last_updated).
179 |
180 | ## Cite
181 |
182 | If you use or extend our work, please cite the following:
183 |
184 | ```
185 | @article{jpi_sket-2022,
186 | title = "Empowering Digital Pathology Applications through Explainable Knowledge Extraction Tools",
187 | author = "S. Marchesin and F. Giachelle and N. Marini and M. Atzori and S. Boytcheva and G. Buttafuoco and F. Ciompi and G. M. Di Nunzio and F. Fraggetta and O. Irrera and H. Müller and T. Primov and S. Vatrano and G. Silvello",
188 | journal = "Journal of Pathology Informatics",
189 | year = "2022",
190 | url = "https://www.sciencedirect.com/science/article/pii/S2153353922007337",
191 | doi = "https://doi.org/10.1016/j.jpi.2022.100139",
192 | pages = "100139"
193 | }
194 | ```
195 |
196 |
197 | ```
198 | @article{npj_dig_med-2022,
199 | title = "Unleashing the potential of digital pathology data by training computer-aided diagnosis models without human annotations",
200 | author = "N. Marini and S. Marchesin and S. Otálora and M. Wodzinski and A. Caputo and M. van Rijthoven and W. Aswolinskiy and J. M. Bokhorst and D. Podareanu and E. Petters and S. Boytcheva and G. Buttafuoco and S. Vatrano and F. Fraggetta and J. der Laak and M. Agosti and F. Ciompi and G. Silvello and H. Müller and M. Atzori",
201 | journal = "npj Digital Medicine",
202 | year = "2022",
203 | url = "http://dx.doi.org/10.1038/s41746-022-00635-4",
204 | doi = "10.1038/s41746-022-00635-4",
205 | volume = "5",
206 | number = "1",
207 | pages = "1--18"
208 | }
209 | ```
210 |
--------------------------------------------------------------------------------
/compute_stats.py:
--------------------------------------------------------------------------------
1 | import json
2 | import glob
3 | import argparse
4 | import numpy as np
5 |
6 | parser = argparse.ArgumentParser()
7 | parser.add_argument('--outputs', default='./outputs/concepts/refined/aoec/colon/*.json', type=str, help='SKET results file.')
8 | parser.add_argument('--use_case', default='colon', choices=['colon', 'cervix', 'lung', 'celiac'], help='Considered use-case.')
9 | args = parser.parse_args()
10 |
11 |
12 | def main():
13 | # read SKET results
14 | if '*.json' == args.outputs.split('/')[-1]: # read files
15 | # read file paths
16 | rsfps = glob.glob(args.outputs)
17 | # set dict
18 | rs = {}
19 | for rsfp in rsfps:
20 | with open(rsfp, 'r') as rsf:
21 | rs.update(json.load(rsf))
22 | else: # read file
23 | with open(args.outputs, 'r') as rsf:
24 | rs = json.load(rsf)
25 |
26 | stats = []
27 | # loop over reports and store size
28 | for rid, rdata in rs.items():
29 | stats.append(sum([len(sem_data) for sem_cat, sem_data in rdata.items()]))
30 | # convert into numpy
31 | stats = np.array(stats)
32 | print('size: {}'.format(np.size(stats)))
33 | print('max: {}'.format(np.max(stats)))
34 | print('min: {}'.format(np.min(stats)))
35 | print('mean: {}'.format(np.mean(stats)))
36 | print('std: {}'.format(np.std(stats)))
37 | print('tot: {}'.format(np.sum(stats)))
38 |
39 |
40 | if __name__ == "__main__":
41 | main()
42 |
--------------------------------------------------------------------------------
/dataset/README.md:
--------------------------------------------------------------------------------
1 | # Datasets
2 |
3 | Please use this folder to store the datasets to process with SKET.
4 |
--------------------------------------------------------------------------------
/dataset/original/celiac/README.md:
--------------------------------------------------------------------------------
1 | # Celiac datasets
2 |
3 | Put here datasets containing Celiac Disease pathology reports.
4 |
--------------------------------------------------------------------------------
/dataset/original/cervix/README.md:
--------------------------------------------------------------------------------
1 | # Cervix datasets
2 |
3 | Put here datasets containing Cervix Uterine Cancer pathology reports.
4 |
--------------------------------------------------------------------------------
/dataset/original/colon/README.md:
--------------------------------------------------------------------------------
1 | # Colon datasets
2 |
3 | Put here datasets containing Colon Cancer pathology reports.
4 |
--------------------------------------------------------------------------------
/dataset/original/lung/README.md:
--------------------------------------------------------------------------------
1 | # Lung datasets
2 |
3 | Put here datasets containing Lung Cancer pathology reports.
4 |
--------------------------------------------------------------------------------
/docker-compose.yml:
--------------------------------------------------------------------------------
1 | version: "2.3"
2 |
3 | services:
4 | sket_cpu:
5 |
6 | build:
7 | context: .
8 | dockerfile: ./docker-sket_server-config/docker_sket_cpu/Dockerfile
9 |
10 | volumes:
11 | - .:/code
12 | ports:
13 | - "8000:8000"
14 | command: bash -c 'python manage.py runserver 0.0.0.0:8000'
15 |
16 |
17 | sket_gpu:
18 | runtime: nvidia
19 | environment:
20 | - NVIDIA_VISIBLE_DEVICES=all
21 |
22 | build:
23 | context: .
24 | dockerfile: ./docker-sket_server-config/docker_sket_gpu/Dockerfile
25 |
26 | volumes:
27 | - .:/code
28 | ports:
29 | - "8001:8001"
30 | command: bash -c 'python3 manage.py runserver 0.0.0.0:8001'
--------------------------------------------------------------------------------
/docker-sket_server-config/docker_sket_cpu/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM python:3.8-buster
2 | ENV PYTHONUNBUFFERED=1
3 | WORKDIR /code
4 | COPY ./docker-sket_server-config/requirements.txt /code/
5 | RUN pip install --no-cache-dir -r requirements.txt
6 | RUN pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.3.0/en_core_sci_sm-0.3.0.tar.gz
7 | COPY . /code/
8 |
9 |
--------------------------------------------------------------------------------
/docker-sket_server-config/docker_sket_gpu/Dockerfile:
--------------------------------------------------------------------------------
1 | # set nvidia version
2 | FROM nvidia/cuda:11.0-devel
3 |
4 | #set up environment
5 | RUN apt-get update && apt-get install --no-install-recommends --no-install-suggests -y curl
6 | RUN apt-get install unzip
7 | RUN apt-get -y install python3.8
8 | RUN apt-get -y install python3-pip
9 |
10 | # set work directory
11 | WORKDIR /code
12 | # copy requirements in work directory
13 | COPY ./docker-sket_server-config/requirements.txt /code/
14 | # install requirements and scispacy model
15 | RUN pip install --no-cache-dir -r requirements.txt
16 | RUN pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.3.0/en_core_sci_sm-0.3.0.tar.gz
17 |
18 | # copy code and config files within work directory
19 | COPY . /code/
20 | # run sket
21 |
22 |
--------------------------------------------------------------------------------
/docker-sket_server-config/requirements.txt:
--------------------------------------------------------------------------------
1 | Django>=3.0,<4.0
2 | Owlready2==0.26
3 | negspacy==0.1.9
4 | pandas
5 | torch==1.7.1
6 | numpy
7 | tqdm==4.55.0
8 | rdflib==5.0.0
9 | spacy==2.3.5
10 | textdistance==4.2.0
11 | transformers==4.2.2
12 | roman==3.3
13 | fasttext==0.9.2
14 | pytest==6.2.4
15 | scikit_learn==0.24.2
16 | sentencepiece
17 | openpyxl
18 | djangorestframework
19 | pyparsing==2.4.7
20 |
--------------------------------------------------------------------------------
/evaluate_sket.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import json
3 | import glob
4 | import os
5 | import argparse
6 |
7 | from sklearn.metrics import hamming_loss, accuracy_score, classification_report
8 |
9 |
10 | parser = argparse.ArgumentParser()
11 | parser.add_argument('--gt', default='./ground_truth/celiac/aoec/celiac_labels_allDS.json', type=str, help='Ground truth file.')
12 | parser.add_argument('--outputs', default='./outputs/labels/aoec/celiac/*.json', type=str, help='SKET results file.')
13 | parser.add_argument('--use_case', default='celiac', choices=['colon', 'cervix', 'lung', 'celiac'], help='Considered use-case.')
14 | parser.add_argument('--hospital', default='aoec', choices=['aoec', 'radboud'], help='Considered hospital.')
15 | parser.add_argument('--debug', default=False, action='store_true', help='Whether to use evaluation for debugging purposes.')
16 | args = parser.parse_args()
17 |
18 | label2class = {
19 | 'cervix': {
20 | 'Normal glands': 'glands_norm',
21 | 'Normal squamous': 'squamous_norm',
22 | 'Cancer - squamous cell carcinoma in situ': 'cancer_scc_insitu',
23 | 'Low grade dysplasia': 'lgd',
24 | 'Cancer - squamous cell carcinoma invasive': 'cancer_scc_inv',
25 | 'High grade dysplasia': 'hgd',
26 | 'Koilocytes': 'koilocytes',
27 | 'Cancer - adenocarcinoma invasive': 'cancer_adeno_inv',
28 | 'Cancer - adenocarcinoma in situ': 'cancer_adeno_insitu',
29 | 'HPV infection present': 'hpv'
30 | },
31 | 'colon': {
32 | 'Hyperplastic polyp': 'hyperplastic',
33 | 'Cancer': 'cancer',
34 | 'Adenomatous polyp - high grade dysplasia': 'hgd',
35 | 'Adenomatous polyp - low grade dysplasia': 'lgd',
36 | 'Non-informative': 'ni'
37 | },
38 | 'lung': {
39 | 'No cancer': 'no_cancer',
40 | 'Cancer - non-small cell cancer, adenocarcinoma': 'cancer_nscc_adeno',
41 | 'Cancer - non-small cell cancer, squamous cell carcinoma': 'cancer_nscc_squamous',
42 | 'Cancer - small cell cancer': 'cancer_scc',
43 | 'Cancer - non-small cell cancer, large cell carcinoma': 'cancer_nscc_large'
44 | },
45 | 'celiac': {
46 | 'Normal': 'normal',
47 | 'Celiac disease': 'celiac_disease',
48 | 'Non-specific duodenitis': 'duodenitis',
49 | }
50 | }
51 |
52 |
53 | def main():
54 | # create path for debugging
55 | debug_path = './logs/debug/' + args.hospital + '/' + args.use_case + '/'
56 | os.makedirs(os.path.dirname(debug_path), exist_ok=True)
57 |
58 | # read ground-truth
59 | with open(args.gt, 'r') as gtf:
60 | ground_truth = json.load(gtf)
61 |
62 | gt = {}
63 | # prepare ground-truth for evaluation
64 | if args.hospital == 'aoec' or args.use_case == 'celiac':
65 | ground_truth = ground_truth['groundtruths']
66 | else:
67 | ground_truth = ground_truth['ground_truth']
68 | for data in ground_truth:
69 | if args.hospital == 'aoec' or args.use_case == 'celiac':
70 | rid = data['id_report']
71 | else:
72 | rid = data['report_id_not_hashed']
73 |
74 | if len(rid.split('_')) == 3 and args.hospital == 'aoec': # contains codeint info not present within new processed reports
75 | rid = rid.split('_')
76 | rid = rid[0] + '_' + rid[2]
77 |
78 | gt[rid] = {label2class[args.use_case][label]: 0 for label in label2class[args.use_case].keys()}
79 | for datum in data['labels']:
80 | label = label2class[args.use_case][datum['label']]
81 | if label in gt[rid]:
82 | gt[rid][label] = 1
83 | # gt name
84 | gt_name = args.gt.split('/')[-1].split('.')[0]
85 |
86 | # read SKET results
87 | if '*.json' == args.outputs.split('/')[-1]: # read files
88 | # read file paths
89 | rsfps = glob.glob(args.outputs)
90 | # set dict
91 | rs = {}
92 | for rsfp in rsfps:
93 | with open(rsfp, 'r') as rsf:
94 | rs.update(json.load(rsf))
95 | else: # read file
96 | with open(args.outputs, 'r') as rsf:
97 | rs = json.load(rsf)
98 |
99 | sket = {}
100 | # prepare SKET results for evaluation
101 | for rid, rdata in rs.items():
102 | if args.use_case == 'colon' and args.hospital == 'aoec' and '2ndDS' in args.gt:
103 | rid = rid.split('_')[0]
104 | if args.hospital == 'radboud':
105 | sket[rid] = rdata['labels']
106 | else:
107 | sket[rid] = rdata
108 |
109 | # fix class order to avoid inconsistencies
110 | rids = list(sket.keys())
111 | classes = list(sket[rids[0]].keys())
112 | if args.use_case == 'celiac':
113 | classes.remove('inconclusive')
114 |
115 | # obtain ground-truth and SKET scores
116 | gt_scores = []
117 | sket_scores = []
118 |
119 | if args.debug: # open output for debugging
120 | debugf = open(debug_path + gt_name + '.txt', 'w+')
121 |
122 | for rid in gt.keys():
123 | gt_rscores = []
124 | sket_rscores = []
125 | if rid not in sket:
126 | print('skipped gt record: {}'.format(rid))
127 | continue
128 | if args.debug:
129 | first = True
130 | for c in classes:
131 | #if c != 'inconclusive':
132 | gt_rscores.append(gt[rid][c])
133 | sket_rscores.append(sket[rid][c])
134 | if args.debug: # perform debugging
135 | if gt[rid][c] != sket[rid][c]: # store info for debugging purposes
136 | if first: # first occurence
137 | debugf.write('\nReport ID: {}\n'.format(rid))
138 | first = False
139 | debugf.write(c + ': gt = {}, sket = {}\n'.format(gt[rid][c], sket[rid][c]))
140 | gt_scores.append(gt_rscores)
141 | sket_scores.append(sket_rscores)
142 |
143 | if args.debug: # close output for debugging
144 | debugf.close()
145 |
146 | # convert to numpy
147 | gt_scores = np.array(gt_scores)
148 | sket_scores = np.array(sket_scores)
149 |
150 | # compute evaluation measures
151 | print('Compute evaluation measures')
152 |
153 | # exact match accuracy & hamming loss
154 | print("Accuracy (exact match): {}".format(accuracy_score(gt_scores, sket_scores)))
155 | print("Hamming loss: {}\n".format(hamming_loss(gt_scores, sket_scores)))
156 |
157 | # compute classification report
158 | print("Classification report:")
159 | print(classification_report(y_true=gt_scores, y_pred=sket_scores, target_names=classes))
160 |
161 |
162 | if __name__ == "__main__":
163 | main()
164 |
--------------------------------------------------------------------------------
/examples/test.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ExaNLP/sket/d9a3fcc42d5f3671dcb2ac6597ea663b9b259433/examples/test.xlsx
--------------------------------------------------------------------------------
/examples/test_multiple_reports.json:
--------------------------------------------------------------------------------
1 | {"reports": [{"diagnosis": "adenocarcinoma con displasia lieve, focalmente severa. Risultati ottenuti con biopsia al colon.", "materials": "biopsia polipo."}, {"diagnosis": "polipo iperplastico con displasia focalmente severa", "materials": "biopsia retto."}]}
--------------------------------------------------------------------------------
/examples/test_single_report.json:
--------------------------------------------------------------------------------
1 | {"diagnosis": "adenocarcinoma con displasia lieve, focalmente severa. Risultati ottenuti con biopsia al colon.", "materials": "biopsia polipo.", "age": 5, "gender": "M", "id": "test"}
--------------------------------------------------------------------------------
/ground_truth/README.md:
--------------------------------------------------------------------------------
1 | # Ground truth
2 |
3 | Please put here the ground truth used to evaluate SKET.
4 |
--------------------------------------------------------------------------------
/manage.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | """Django's command-line utility for administrative tasks."""
3 | import os
4 | import sys
5 |
6 |
7 | def main():
8 | """Run administrative tasks."""
9 | os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'sket_server.sket_rest_config.settings')
10 | try:
11 | from django.core.management import execute_from_command_line
12 | except ImportError as exc:
13 | raise ImportError(
14 | "Couldn't import Django. Are you sure it's installed and "
15 | "available on your PYTHONPATH environment variable? Did you "
16 | "forget to activate a virtual environment?"
17 | ) from exc
18 | execute_from_command_line(sys.argv)
19 |
20 |
21 | if __name__ == '__main__':
22 | main()
23 |
--------------------------------------------------------------------------------
/outputs/README.md:
--------------------------------------------------------------------------------
1 | # Outputs
2 |
3 | This directory contains the different outputs generated by SKET, namely:
4 | 1. Concepts
5 | 2. Labels
6 | 3. Graphs
7 |
8 | Each directory is generated at run-time during the first execution of SKET.
9 | Each output is contained within the corresponding directory.
10 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | Owlready2==0.26
2 | negspacy==0.1.9
3 | pandas==1.0.4
4 | torch==1.7.1
5 | numpy==1.19.5
6 | tqdm==4.55.0
7 | rdflib==5.0.0
8 | spacy==2.3.5
9 | textdistance==4.2.0
10 | transformers==4.2.2
11 | roman==3.3
12 | fasttext==0.9.2
13 | pytest==6.2.4
14 | scikit_learn==0.24.2
15 | sentencepiece==0.1.91
16 | openpyxl==3.0.7
17 | xlrd==2.0.1
18 | pyparsing==2.4.7
19 |
--------------------------------------------------------------------------------
/run_med_sket.py:
--------------------------------------------------------------------------------
1 | import argparse
2 | import warnings
3 |
4 | from sket.sket import SKET
5 |
6 | warnings.filterwarnings("ignore", message=r"\[W008\]", category=UserWarning)
7 |
8 | parser = argparse.ArgumentParser()
9 | parser.add_argument('--src_lang', default='en', type=str, help='Considered source language.')
10 | parser.add_argument('--use_case', default='celiac', choices=['colon', 'cervix', 'lung', 'celiac'], help='Considered use-case.')
11 | parser.add_argument('--spacy_model', default='en_core_sci_sm', type=str, help='Considered NLP spacy model.')
12 | parser.add_argument('--w2v_model', default=True, action='store_true', help='Considered word2vec model.')
13 | parser.add_argument('--fasttext_model', default=None, type=str, help='File path for FastText model.')
14 | parser.add_argument('--bert_model', default=None, type=str, help='Considered BERT model.')
15 | parser.add_argument('--string_model', default=True, action='store_true', help='Considered string matching model.')
16 | parser.add_argument('--gpu', default=None, type=int, help='Considered GPU device. If not specified (default to None), use CPU instead.')
17 | parser.add_argument('--thr', default=1.8, type=float, help='Similarity threshold.')
18 | parser.add_argument('--store', default=True, action='store_true', help='Whether to store concepts, labels, and graphs.')
19 | parser.add_argument('--rdf_format', default='all', choices=['n3', 'trig', 'turtle', 'all'], help='Whether to specify the rdf format for graph serialization. If "all" is specified, serialize w/ the three different formats')
20 | parser.add_argument('--raw', default=False, action='store_true', help='Whether to consider full pipeline or not.')
21 | parser.add_argument('--debug', default=False, action='store_true', help='Whether to use flags for debugging.')
22 | parser.add_argument('--preprocess', default=True, action='store_true', help='Whether to preprocess input data or not.')
23 | parser.add_argument('--dataset', default=None, type=str, help='Dataset file path.')
24 | args = parser.parse_args()
25 |
26 |
27 | def main():
28 | # set SKET
29 | sket = SKET(args.use_case, args.src_lang, args.spacy_model, args.w2v_model, args.fasttext_model, args.bert_model, args.string_model, args.gpu)
30 |
31 | if args.dataset: # use dataset from file path
32 | dataset = args.dataset
33 | else: # use sample "stream" dataset
34 | dataset = {
35 | "text": "polyp 40 cm: tubular adenoma with moderate dysplasia.",
36 | 'gender': 'F',
37 | 'age': 56,
38 | 'id': 'test_colon'
39 | }
40 |
41 | # use SKET pipeline to extract concepts, labels, and graphs from dataset
42 | sket.med_pipeline(dataset, args.preprocess, args.src_lang, args.use_case, args.thr, args.store, args.rdf_format, args.raw, args.debug)
43 |
44 | if args.raw:
45 | print('processed data up to concepts.')
46 | else:
47 | print('full pipeline.')
48 |
49 |
50 | if __name__ == "__main__":
51 | main()
52 |
--------------------------------------------------------------------------------
/run_sket.py:
--------------------------------------------------------------------------------
1 | import argparse
2 | import warnings
3 |
4 | from sket.sket import SKET
5 |
6 | warnings.filterwarnings("ignore", message=r"\[W008\]", category=UserWarning)
7 |
8 | parser = argparse.ArgumentParser()
9 | parser.add_argument('--dataset', default='./dataset/original/celiac/aoec/ExaMode_0thDS_AOEC_Celiac.xlsx', type=str, help='Dataset file.')
10 | parser.add_argument('--sheet', default='Sheet 1', type=str, help='Considered dataset sheet.')
11 | parser.add_argument('--header', default=0, type=str, help='Header row within dataset.')
12 | parser.add_argument('--ver', default=2, type=str, help='Considered versioning for operations.')
13 | parser.add_argument('--use_case', default='celiac', choices=['colon', 'cervix', 'lung', 'celiac'], help='Considered use-case.')
14 | parser.add_argument('--hospital', default='aoec', choices=['aoec', 'radboud'], help='Considered hospital.')
15 | parser.add_argument('--spacy_model', default='en_core_sci_sm', type=str, help='Considered NLP spacy model.')
16 | parser.add_argument('--w2v_model', default=True, action='store_true', help='Considered word2vec model.')
17 | parser.add_argument('--fasttext_model', default=None, type=str, help='File path for FastText model.')
18 | parser.add_argument('--bert_model', default=None, type=str, help='Considered BERT model.')
19 | parser.add_argument('--string_model', default=True, action='store_true', help='Considered string matching model.')
20 | parser.add_argument('--gpu', default=None, type=int, help='Considered GPU device. If not specified (default to None), use CPU instead.')
21 | parser.add_argument('--thr', default=1.8, type=float, help='Similarity threshold.')
22 | parser.add_argument('--raw', default=False, action='store_true', help='Whether to return concepts within semantic areas (deployment) or mentions+concepts (debugging)')
23 | parser.add_argument('--debug', default=False, action='store_true', help='Whether to use flags for debugging.')
24 | args = parser.parse_args()
25 |
26 |
27 | def main():
28 | # set source language based on hospital
29 | if args.hospital == 'aoec':
30 | src_lang = 'it'
31 | elif args.hospital == 'radboud':
32 | src_lang = 'nl'
33 | else: # raise exception
34 | print('Input hospital does not belong to available ones.\nPlease consider either "aoec" or "radboud" as hospital.')
35 | raise Exception
36 | # set SKET
37 | sket = SKET(args.use_case, src_lang, args.spacy_model, args.w2v_model, args.fasttext_model, args.bert_model, args.string_model, args.gpu)
38 |
39 | # use SKET pipeline to extract concepts, labels, and graphs from args.dataset
40 | sket.exa_pipeline(args.dataset, args.sheet, args.header, args.ver, args.use_case, args.hospital, args.thr, args.raw, args.debug)
41 |
42 |
43 | if __name__ == "__main__":
44 | main()
45 |
--------------------------------------------------------------------------------
/sket/__init__.py:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/sket/negex/__init__.py:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/sket/negex/negation.py:
--------------------------------------------------------------------------------
1 | from spacy.tokens import Token, Doc, Span
2 | from spacy.matcher import PhraseMatcher
3 | import logging
4 |
5 | from negspacy.termsets import LANGUAGES
6 |
7 |
8 | class Negex:
9 | """
10 | A spaCy pipeline component which identifies negated tokens in text.
11 |
12 | Based on: NegEx - A Simple Algorithm for Identifying Negated Findings and Diseasesin Discharge Summaries
13 | Chapman, Bridewell, Hanbury, Cooper, Buchanan
14 |
15 | Parameters
16 | ----------
17 | nlp: object
18 | spaCy language object
19 | ent_types: list
20 | list of entity types to negate
21 | language: str
22 | language code, if using default termsets (e.g. "en" for english)
23 | extension_name: str
24 | defaults to "negex"; whether entity is negated is then available as ent._.negex
25 | pseudo_negations: list
26 | list of phrases that cancel out a negation, if empty, defaults are used
27 | preceding_negations: list
28 | negations that appear before an entity, if empty, defaults are used
29 | following_negations: list
30 | negations that appear after an entity, if empty, defaults are used
31 | termination: list
32 | phrases that "terminate" a sentence for processing purposes such as "but". If empty, defaults are used
33 |
34 | """
35 |
36 | def __init__(
37 | self,
38 | nlp,
39 | language="en_clinical",
40 | ent_types=list(),
41 | extension_name="negex",
42 | pseudo_negations=list(),
43 | preceding_negations=list(),
44 | following_negations=list(),
45 | termination=list(),
46 | chunk_prefix=list(),
47 | ):
48 | if not language in LANGUAGES:
49 | raise KeyError(
50 | f"{language} not found in languages termset. "
51 | "Ensure this is a supported language or specify "
52 | "your own termsets when initializing Negex."
53 | )
54 | termsets = LANGUAGES[language]
55 | if not Span.has_extension(extension_name):
56 | Span.set_extension(extension_name, default=False, force=True)
57 |
58 | if not pseudo_negations:
59 | if not "pseudo_negations" in termsets:
60 | raise KeyError("pseudo_negations not specified for this language.")
61 | self.pseudo_negations = termsets["pseudo_negations"]
62 | else:
63 | self.pseudo_negations = pseudo_negations
64 |
65 | if not preceding_negations:
66 | if not "preceding_negations" in termsets:
67 | raise KeyError("preceding_negations not specified for this language.")
68 | self.preceding_negations = termsets["preceding_negations"]
69 | else:
70 | self.preceding_negations = preceding_negations
71 |
72 | if not following_negations:
73 | if not "following_negations" in termsets:
74 | raise KeyError("following_negations not specified for this language.")
75 | self.following_negations = termsets["following_negations"]
76 | else:
77 | self.following_negations = following_negations
78 |
79 | if not termination:
80 | if not "termination" in termsets:
81 | raise KeyError("termination not specified for this language.")
82 | self.termination = termsets["termination"]
83 | else:
84 | self.termination = termination
85 |
86 | self.nlp = nlp
87 | self.ent_types = ent_types
88 | self.extension_name = extension_name
89 | self.build_patterns()
90 | self.chunk_prefix = list(nlp.tokenizer.pipe(chunk_prefix))
91 |
92 | def build_patterns(self):
93 | # efficiently build spaCy matcher patterns
94 | self.matcher = PhraseMatcher(self.nlp.vocab, attr="LOWER")
95 |
96 | self.pseudo_patterns = list(self.nlp.tokenizer.pipe(self.pseudo_negations))
97 | self.matcher.add("pseudo", None, *self.pseudo_patterns)
98 |
99 | self.preceding_patterns = list(
100 | self.nlp.tokenizer.pipe(self.preceding_negations)
101 | )
102 | self.matcher.add("Preceding", None, *self.preceding_patterns)
103 |
104 | self.following_patterns = list(
105 | self.nlp.tokenizer.pipe(self.following_negations)
106 | )
107 | self.matcher.add("Following", None, *self.following_patterns)
108 |
109 | self.termination_patterns = list(self.nlp.tokenizer.pipe(self.termination))
110 | self.matcher.add("Termination", None, *self.termination_patterns)
111 |
112 | def remove_patterns(
113 | self,
114 | pseudo_negations=None,
115 | preceding_negations=None,
116 | following_negations=None,
117 | termination=None,
118 | ):
119 | if pseudo_negations:
120 | if isinstance(pseudo_negations, list):
121 | for p in pseudo_negations:
122 | self.pseudo_negations.remove(p)
123 | else:
124 | self.pseudo_negations.remove(pseudo_negations)
125 | if preceding_negations:
126 | if isinstance(preceding_negations, list):
127 | for p in preceding_negations:
128 | self.preceding_negations.remove(p)
129 | else:
130 | self.preceding_negations.remove(preceding_negations)
131 | if following_negations:
132 | if isinstance(following_negations, list):
133 | for p in following_negations:
134 | self.following_negations.remove(p)
135 | else:
136 | self.following_negations.extend(following_negations)
137 | if termination:
138 | if isinstance(termination, list):
139 | for p in termination:
140 | self.termination.remove(p)
141 | else:
142 | self.termination.remove(termination)
143 | self.build_patterns()
144 |
145 | def add_patterns(
146 | self,
147 | pseudo_negations=None,
148 | preceding_negations=None,
149 | following_negations=None,
150 | termination=None,
151 | ):
152 | if pseudo_negations:
153 | if not isinstance(pseudo_negations, list):
154 | raise ValueError("A list of phrases expected when adding patterns")
155 | self.pseudo_negations.extend(pseudo_negations)
156 | if preceding_negations:
157 | if not isinstance(preceding_negations, list):
158 | raise ValueError("A list of phrases expected when adding patterns")
159 | self.preceding_negations.extend(preceding_negations)
160 | if following_negations:
161 | if not isinstance(following_negations, list):
162 | raise ValueError("A list of phrases expected when adding patterns")
163 | self.following_negations.extend(following_negations)
164 | if termination:
165 | if not isinstance(termination, list):
166 | raise ValueError("A list of phrases expected when adding patterns")
167 | self.termination.extend(termination)
168 | self.build_patterns()
169 |
170 | def get_patterns(self):
171 | """
172 | returns phrase patterns used for various negation dictionaries
173 |
174 | Returns
175 | -------
176 | patterns: dict
177 | pattern_type: [patterns]
178 |
179 | """
180 | patterns = {
181 | "pseudo_patterns": self.pseudo_patterns,
182 | "preceding_patterns": self.preceding_patterns,
183 | "following_patterns": self.following_patterns,
184 | "termination_patterns": self.termination_patterns,
185 | }
186 | for pattern in patterns:
187 | logging.info(pattern)
188 | return patterns
189 |
190 | def process_negations(self, doc):
191 | """
192 | Find negations in doc and clean candidate negations to remove pseudo negations
193 |
194 | Parameters
195 | ----------
196 | doc: object
197 | spaCy Doc object
198 |
199 | Returns
200 | -------
201 | preceding: list
202 | list of tuples for preceding negations
203 | following: list
204 | list of tuples for following negations
205 | terminating: list
206 | list of tuples of terminating phrases
207 |
208 | """
209 | ###
210 | # does not work properly in spacy 2.1.8. Will incorporate after 2.2.
211 | # Relying on user to use NER in meantime
212 | # see https://github.com/jenojp/negspacy/issues/7
213 | ###
214 | # if not doc.is_nered:
215 | # raise ValueError(
216 | # "Negations are evaluated for Named Entities found in text. "
217 | # "Your SpaCy pipeline does not included Named Entity resolution. "
218 | # "Please ensure it is enabled or choose a different language model that includes it."
219 | # )
220 | preceding = list()
221 | following = list()
222 | terminating = list()
223 |
224 | matches = self.matcher(doc)
225 | pseudo = [
226 | (match_id, start, end)
227 | for match_id, start, end in matches
228 | if self.nlp.vocab.strings[match_id] == "pseudo"
229 | ]
230 |
231 | for match_id, start, end in matches:
232 | if self.nlp.vocab.strings[match_id] == "pseudo":
233 | continue
234 | pseudo_flag = False
235 | for p in pseudo:
236 | if start >= p[1] and start <= p[2]:
237 | pseudo_flag = True
238 | continue
239 | if not pseudo_flag:
240 | if self.nlp.vocab.strings[match_id] == "Preceding":
241 | preceding.append((match_id, start, end))
242 | elif self.nlp.vocab.strings[match_id] == "Following":
243 | following.append((match_id, start, end))
244 | elif self.nlp.vocab.strings[match_id] == "Termination":
245 | terminating.append((match_id, start, end))
246 | else:
247 | logging.warnings(
248 | f"phrase {doc[start:end].text} not in one of the expected matcher types."
249 | )
250 | return preceding, following, terminating
251 |
252 | def termination_boundaries(self, doc, terminating):
253 | """
254 | Create sub sentences based on terminations found in text.
255 |
256 | Parameters
257 | ----------
258 | doc: object
259 | spaCy Doc object
260 | terminating: list
261 | list of tuples with (match_id, start, end)
262 |
263 | returns
264 | -------
265 | boundaries: list
266 | list of tuples with (start, end) of spans
267 |
268 | """
269 | sent_starts = [sent.start for sent in doc.sents]
270 | terminating_starts = [t[1] for t in terminating]
271 | starts = sent_starts + terminating_starts + [len(doc)]
272 | starts.sort()
273 | boundaries = list()
274 | index = 0
275 | for i, start in enumerate(starts):
276 | if not i == 0:
277 | boundaries.append((index, start))
278 | index = start
279 | return boundaries
280 |
281 | def negex(self, doc):
282 | """
283 | Negates entities of interest
284 |
285 | Parameters
286 | ----------
287 | doc: object
288 | spaCy Doc object
289 |
290 | """
291 | preceding, following, terminating = self.process_negations(doc)
292 | boundaries = self.termination_boundaries(doc, terminating)
293 | for b in boundaries:
294 | sub_preceding = [i for i in preceding if b[0] <= i[1] < b[1]]
295 | sub_following = [i for i in following if b[0] <= i[1] < b[1]]
296 |
297 | for e in doc[b[0] : b[1]].ents:
298 | if self.ent_types:
299 | if e.label_ not in self.ent_types:
300 | continue
301 | if any(pre < e.start for pre in [i[1] for i in sub_preceding]):
302 | e._.set(self.extension_name, True)
303 | continue
304 | if any(fol > e.end for fol in [i[2] for i in sub_following]):
305 | e._.set(self.extension_name, True)
306 | continue
307 | if self.chunk_prefix:
308 | if any(
309 | c.text.lower() == doc[e.start:e.start+len(c)].text.lower()
310 | for c in self.chunk_prefix
311 | ):
312 | e._.set(self.extension_name, True)
313 | return doc
314 |
315 | def __call__(self, doc):
316 | return self.negex(doc)
317 |
--------------------------------------------------------------------------------
/sket/negex/termsets.py:
--------------------------------------------------------------------------------
1 | """
2 | Default termsets for various languages
3 | """
4 |
5 | LANGUAGES = dict()
6 |
7 | # english termset dictionary
8 | en = dict()
9 | pseudo = [
10 | "no further",
11 | "not able to be",
12 | "not certain if",
13 | "not certain whether",
14 | "not necessarily",
15 | "without any further",
16 | "without difficulty",
17 | "without further",
18 | "might not",
19 | "not only",
20 | "no increase",
21 | "no significant change",
22 | "no change",
23 | "no definite change",
24 | "not extend",
25 | "not cause",
26 | "not certain if",
27 | "not certain whether",
28 | ]
29 | en["pseudo_negations"] = pseudo
30 |
31 | preceding = [
32 | "absence of",
33 | "declined",
34 | "denied",
35 | "denies",
36 | "denying",
37 | "no sign of",
38 | "no signs of",
39 | "not",
40 | "not demonstrate",
41 | "symptoms atypical",
42 | "doubt",
43 | "negative for",
44 | "no",
45 | "versus",
46 | "without",
47 | "doesn't",
48 | "doesnt",
49 | "don't",
50 | "dont",
51 | "didn't",
52 | "didnt",
53 | "wasn't",
54 | "wasnt",
55 | "weren't",
56 | "werent",
57 | "isn't",
58 | "isnt",
59 | "aren't",
60 | "arent",
61 | "cannot",
62 | "can't",
63 | "cant",
64 | "couldn't",
65 | "couldnt",
66 | "never",
67 | ]
68 | en["preceding_negations"] = preceding
69 |
70 | following = [
71 | "declined",
72 | "unlikely",
73 | "was not",
74 | "were not",
75 | "wasn't",
76 | "wasnt",
77 | "weren't",
78 | "werent",
79 | ]
80 | en["following_negations"] = following
81 |
82 | termination = [
83 | "although",
84 | "apart from",
85 | "as there are",
86 | "aside from",
87 | "but",
88 | "except",
89 | "however",
90 | "involving",
91 | "nevertheless",
92 | "still",
93 | "though",
94 | "which",
95 | "yet",
96 | "still",
97 | ]
98 | en["termination"] = termination
99 |
100 | LANGUAGES["en"] = en
101 |
102 | # en_clinical builds upon en
103 | en_clinical = dict()
104 | pseudo_clinical = pseudo + [
105 | "gram negative",
106 | "not rule out",
107 | "not ruled out",
108 | "not been ruled out",
109 | "not drain",
110 | "no suspicious change",
111 | "no interval change",
112 | "no significant interval change",
113 | ]
114 | en_clinical["pseudo_negations"] = pseudo_clinical
115 |
116 | preceding_clinical = preceding + [
117 | "patient was not",
118 | "without indication of",
119 | "without sign of",
120 | "without signs of",
121 | "without any reactions or signs of",
122 | "no complaints of",
123 | "no evidence of",
124 | "no cause of",
125 | "evaluate for",
126 | "fails to reveal",
127 | "free of",
128 | "never developed",
129 | "never had",
130 | "did not exhibit",
131 | "rules out",
132 | "rule out",
133 | "rule him out",
134 | "rule her out",
135 | "rule patient out",
136 | "rule the patient out",
137 | "ruled out",
138 | "ruled him out" "ruled her out",
139 | "ruled patient out",
140 | "ruled the patient out",
141 | "r/o",
142 | "ro",
143 | ]
144 | en_clinical["preceding_negations"] = preceding_clinical
145 |
146 | following_clinical = following + ["was ruled out", "were ruled out", "free"]
147 | en_clinical["following_negations"] = following_clinical
148 |
149 | termination_clinical = termination + [
150 | "cause for",
151 | "cause of",
152 | "causes for",
153 | "causes of",
154 | "etiology for",
155 | "etiology of",
156 | "origin for",
157 | "origin of",
158 | "origins for",
159 | "origins of",
160 | "other possibilities of",
161 | "reason for",
162 | "reason of",
163 | "reasons for",
164 | "reasons of",
165 | "secondary to",
166 | "source for",
167 | "source of",
168 | "sources for",
169 | "sources of",
170 | "trigger event for",
171 | ]
172 | en_clinical["termination"] = termination_clinical
173 | LANGUAGES["en_clinical"] = en_clinical
174 |
175 | en_clinical_sensitive = dict()
176 |
177 | preceding_clinical_sensitive = preceding_clinical + [
178 | "concern for",
179 | "supposed",
180 | "which causes",
181 | "leads to",
182 | "h/o",
183 | "history of",
184 | "instead of",
185 | "if you experience",
186 | "if you get",
187 | "teaching the patient",
188 | "taught the patient",
189 | "teach the patient",
190 | "educated the patient",
191 | "educate the patient",
192 | "educating the patient",
193 | "monitored for",
194 | "monitor for",
195 | "test for",
196 | "tested for",
197 | ]
198 | en_clinical_sensitive["pseudo_negations"] = pseudo_clinical
199 | en_clinical_sensitive["preceding_negations"] = preceding_clinical_sensitive
200 | en_clinical_sensitive["following_negations"] = following_clinical
201 | en_clinical_sensitive["termination"] = termination_clinical
202 |
203 | LANGUAGES["en_clinical_sensitive"] = en_clinical_sensitive
204 |
--------------------------------------------------------------------------------
/sket/negex/test.py:
--------------------------------------------------------------------------------
1 | import pytest
2 | import spacy
3 | from negation import Negex
4 | from spacy.pipeline import EntityRuler
5 |
6 |
7 | def build_docs():
8 | docs = list()
9 | docs.append(
10 | (
11 | "Patient denies Apple Computers but has Steve Jobs. He likes USA.",
12 | [("Apple Computers", True), ("Steve Jobs", False), ("USA", False)],
13 | )
14 | )
15 | docs.append(
16 | (
17 | "No history of USA, Germany, Italy, Canada, or Brazil",
18 | [
19 | ("USA", True),
20 | ("Germany", True),
21 | ("Italy", True),
22 | ("Canada", True),
23 | ("Brazil", True),
24 | ],
25 | )
26 | )
27 |
28 | docs.append(("That might not be Barack Obama.", [("Barack Obama", False)]))
29 |
30 | return docs
31 |
32 |
33 | def build_med_docs():
34 | docs = list()
35 | docs.append(
36 | (
37 | "Patient denies cardiovascular disease but has headaches. No history of smoking. Alcoholism unlikely. Smoking not ruled out.",
38 | [
39 | ("Patient denies", False),
40 | ("cardiovascular disease", True),
41 | ("headaches", False),
42 | ("No history", True),
43 | ("smoking", True),
44 | ("Alcoholism", True),
45 | ("Smoking", False),
46 | ],
47 | )
48 | )
49 | docs.append(
50 | (
51 | "No history of headaches, prbc, smoking, acid reflux, or GERD.",
52 | [
53 | ("No history", True),
54 | ("headaches", True),
55 | ("prbc", True),
56 | ("smoking", True),
57 | ("acid reflux", True),
58 | ("GERD", True),
59 | ],
60 | )
61 | )
62 |
63 | docs.append(
64 | (
65 | "Alcoholism was not the cause of liver disease.",
66 | [("Alcoholism", True), ("cause", False), ("liver disease", False)],
67 | )
68 | )
69 |
70 | docs.append(
71 | (
72 | "There was no headache for this patient.",
73 | [("no headache", True), ("patient", True)],
74 | )
75 | )
76 | return docs
77 |
78 |
79 | def test():
80 | nlp = spacy.load("en_core_web_sm")
81 | negex = Negex(nlp)
82 | nlp.add_pipe(negex, last=True)
83 | docs = build_docs()
84 | for d in docs:
85 | doc = nlp(d[0])
86 | for i, e in enumerate(doc.ents):
87 | print(e.text, e._.negex)
88 | assert (e.text, e._.negex) == d[1][i]
89 |
90 |
91 | def test_en():
92 | nlp = spacy.load("en_core_web_sm")
93 | negex = Negex(nlp, language="en")
94 | nlp.add_pipe(negex, last=True)
95 | docs = build_docs()
96 | for d in docs:
97 | doc = nlp(d[0])
98 | for i, e in enumerate(doc.ents):
99 | print(e.text, e._.negex)
100 | assert (e.text, e._.negex) == d[1][i]
101 |
102 |
103 | def test_umls():
104 | nlp = spacy.load("en_core_sci_sm")
105 | negex = Negex(
106 | nlp, language="en_clinical", ent_types=["ENTITY"], chunk_prefix=["no"]
107 | )
108 | nlp.add_pipe(negex, last=True)
109 | docs = build_med_docs()
110 | for d in docs:
111 | doc = nlp(d[0])
112 | for i, e in enumerate(doc.ents):
113 | print(e.text, e._.negex)
114 | assert (e.text, e._.negex) == d[1][i]
115 |
116 |
117 | def test_umls2():
118 | nlp = spacy.load("en_core_sci_sm")
119 | negex = Negex(
120 | nlp, language="en_clinical_sensitive", ent_types=["ENTITY"], chunk_prefix=["no"]
121 | )
122 | nlp.add_pipe(negex, last=True)
123 | docs = build_med_docs()
124 | for d in docs:
125 | doc = nlp(d[0])
126 | for i, e in enumerate(doc.ents):
127 | print(e.text, e._.negex)
128 | assert (e.text, e._.negex) == d[1][i]
129 |
130 |
131 | # blocked by spacy 2.1.8 issue. Adding back after spacy 2.2.
132 | # def test_no_ner():
133 | # nlp = spacy.load("en_core_web_sm", disable=["ner"])
134 | # negex = Negex(nlp)
135 | # nlp.add_pipe(negex, last=True)
136 | # with pytest.raises(ValueError):
137 | # doc = nlp("this doc has not been NERed")
138 |
139 |
140 | def test_own_terminology():
141 | nlp = spacy.load("en_core_web_sm")
142 | negex = Negex(nlp, termination=["whatever"])
143 | nlp.add_pipe(negex, last=True)
144 | doc = nlp("He does not like Steve Jobs whatever he says about Barack Obama.")
145 | assert doc.ents[1]._.negex == False
146 |
147 |
148 | def test_get_patterns():
149 | nlp = spacy.load("en_core_web_sm")
150 | negex = Negex(nlp)
151 | patterns = negex.get_patterns()
152 | assert type(patterns) == dict
153 | assert len(patterns) == 4
154 |
155 |
156 | def test_issue7():
157 | nlp = spacy.load("en_core_web_sm")
158 | negex = Negex(nlp)
159 | nlp.add_pipe(negex, last=True)
160 | ruler = EntityRuler(nlp)
161 | patterns = [{"label": "SOFTWARE", "pattern": "spacy"}]
162 | doc = nlp("fgfgdghgdh")
163 |
164 |
165 | def test_add_remove_patterns():
166 | nlp = spacy.load("en_core_web_sm")
167 | negex = Negex(nlp)
168 | patterns = negex.get_patterns()
169 | negex.add_patterns(
170 | pseudo_negations=["my favorite pattern"],
171 | termination=["these are", "great patterns"],
172 | preceding_negations=["wow a negation"],
173 | following_negations=["extra negation"],
174 | )
175 | patterns_after = negex.get_patterns()
176 | print(patterns_after)
177 | print(len(patterns_after["pseudo_patterns"]))
178 | assert len(patterns_after["pseudo_patterns"]) - 1 == len(
179 | patterns["pseudo_patterns"]
180 | )
181 | assert len(patterns_after["termination_patterns"]) - 2 == len(
182 | patterns["termination_patterns"]
183 | )
184 | assert len(patterns_after["preceding_patterns"]) - 1 == len(
185 | patterns["preceding_patterns"]
186 | )
187 | assert len(patterns_after["following_patterns"]) - 1 == len(
188 | patterns["following_patterns"]
189 | )
190 |
191 | negex.remove_patterns(
192 | termination=["these are", "great patterns"],
193 | pseudo_negations=["my favorite pattern"],
194 | preceding_negations="denied",
195 | following_negations=["unlikely"],
196 | )
197 | negex.remove_patterns(termination="but")
198 | negex.remove_patterns(
199 | preceding_negations="wow a negation", following_negations=["extra negation"]
200 | )
201 | patterns_after = negex.get_patterns()
202 | assert (
203 | len(patterns_after["termination_patterns"])
204 | == len(patterns["termination_patterns"]) - 1
205 | )
206 | assert (
207 | len(patterns_after["following_patterns"])
208 | == len(patterns["following_patterns"]) - 1
209 | )
210 | assert (
211 | len(patterns_after["preceding_patterns"])
212 | == len(patterns["preceding_patterns"]) - 1
213 | )
214 | assert len(patterns_after["pseudo_patterns"]) == len(patterns["pseudo_patterns"])
215 |
216 |
217 | if __name__ == "__main__":
218 | test()
219 | test_umls()
220 | test_bad_beharor()
221 | test_own_terminology()
222 | test_get_patterns()
223 | test_issue7()
224 | test_add_remove_patterns()
225 |
--------------------------------------------------------------------------------
/sket/nerd/__init__.py:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/sket/nerd/normalizer.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 |
3 |
4 | class StandardizationNormalizer(object):
5 | # apply standard deviation normalization
6 | def __init__(self, scores):
7 | self.mean = np.mean(scores)
8 | self.std = np.std(scores)
9 |
10 | def __call__(self, scores):
11 | if self.std > 0:
12 | return (scores - self.mean) / self.std
13 | else:
14 | return np.zeros(scores.size)
15 |
16 |
17 | class MinMaxNormalizer(object):
18 | # apply minmax normalization
19 | def __init__(self, scores):
20 | self.min = np.min(scores)
21 | self.max = np.max(scores)
22 |
23 | def __call__(self, scores):
24 | if (self.max - self.min) > 0:
25 | return (scores - self.min) / (self.max - self.min)
26 | else:
27 | return np.zeros(scores.size)
28 |
29 |
30 | class IdentityNormalizer(object):
31 | # apply identify normalization
32 | def __init__(self):
33 | pass
34 |
35 | def __call__(self, scores):
36 | return scores
37 |
--------------------------------------------------------------------------------
/sket/nerd/rules/cin_mappings.txt:
--------------------------------------------------------------------------------
1 | cin1 low grade cervical squamous intraepithelial neoplasia
2 | cin2 cervical squamous intraepithelial neoplasia 2
3 | cin3 squamous carcinoma in situ
4 | cin23 cervical intraepithelial neoplasia grade 2/3
5 | lsil low grade cervical squamous intraepithelial neoplasia
6 | hsil cervical intraepithelial neoplasia grade 2/3
--------------------------------------------------------------------------------
/sket/nerd/rules/dysplasia_mappings.txt:
--------------------------------------------------------------------------------
1 | mild mild colon dysplasia colon
2 | moderate moderate colon dysplasia colon
3 | severe severe colon dysplasia colon
4 | low-grade mild colon dysplasia colon
5 | low grade mild colon dysplasia colon
6 | low-degree mild colon dysplasia colon
7 | low degree mild colon dysplasia colon
8 | low mild colon dysplasia colon
9 | high-grade severe colon dysplasia colon
10 | high grade severe colon dysplasia colon
11 | high-degree severe colon dysplasia colon
12 | high degree severe colon dysplasia colon
13 | high severe colon dysplasia colon
14 | strong severe colon dysplasia colon
15 | mild-to-moderate mild colon dysplasia,moderate colon dysplasia colon
16 | mild to moderate mild colon dysplasia,moderate colon dysplasia colon
17 | mild and moderate mild colon dysplasia,moderate colon dysplasia colon
18 | moderate-to-severe moderate colon dysplasia,severe colon dysplasia colon
19 | moderate to severe moderate colon dysplasia,severe colon dysplasia colon
20 | moderate and severe moderate colon dysplasia,severe colon dysplasia colon
21 | focally severe severe colon dysplasia colon
22 | severe focally severe colon dysplasia colon
23 | severe focal severe colon dysplasia colon
24 | moderate-severe moderate colon dysplasia,severe colon dysplasia colon
25 | mild-moderate mild colon dysplasia,moderate colon dysplasia colon
26 | mild-severe mild colon dysplasia,severe colon dysplasia colon
27 | mild to severe mild colon dysplasia,severe colon dysplasia colon
28 | mild and severe mild colon dysplasia,severe colon dysplasia colon
29 | mild-to-severe mild colon dysplasia,severe colon dysplasia colon
30 | mild low grade cervical squamous intraepithelial neoplasia cervix
31 | moderate cervical squamous intraepithelial neoplasia 2 cervix
32 | severe squamous carcinoma in situ cervix
33 | low-grade low grade cervical squamous intraepithelial neoplasia cervix
34 | low grade low grade cervical squamous intraepithelial neoplasia cervix
35 | low low grade cervical squamous intraepithelial neoplasia cervix
36 | high-grade cervical intraepithelial neoplasia grade 2/3 cervix
37 | high grade cervical intraepithelial neoplasia grade 2/3 cervix
38 | high cervical intraepithelial neoplasia grade 2/3 cervix
39 | strong squamous carcinoma in situ cervix
40 | mild-to-moderate low grade cervical squamous intraepithelial neoplasia,cervical squamous intraepithelial neoplasia 2 cervix
41 | mild to moderate low grade cervical squamous intraepithelial neoplasia,cervical squamous intraepithelial neoplasia 2 cervix
42 | mild and moderate low grade cervical squamous intraepithelial neoplasia,cervical squamous intraepithelial neoplasia 2 cervix
43 | moderate-to-severe cervical squamous intraepithelial neoplasia 2,squamous carcinoma in situ cervix
44 | moderate to severe cervical squamous intraepithelial neoplasia 2,squamous carcinoma in situ cervix
45 | moderate and severe cervical squamous intraepithelial neoplasia 2,squamous carcinoma in situ cervix
46 | focally severe squamous carcinoma in situ cervix
47 | severe focally squamous carcinoma in situ cervix
48 | severe focal squamous carcinoma in situ cervix
49 | moderate-severe cervical squamous intraepithelial neoplasia 2,squamous carcinoma in situ cervix
50 | mild-moderate low grade cervical squamous intraepithelial neoplasia,cervical squamous intraepithelial neoplasia 2 cervix
51 | mild-severe low grade cervical squamous intraepithelial neoplasia,squamous carcinoma in situ cervix
52 | mild to severe low grade cervical squamous intraepithelial neoplasia,squamous carcinoma in situ cervix
53 | mild and severe low grade cervical squamous intraepithelial neoplasia,squamous carcinoma in situ cervix
54 | mild-to-severe low grade cervical squamous intraepithelial neoplasia,squamous carcinoma in situ cervix
--------------------------------------------------------------------------------
/sket/nerd/rules/rules.txt:
--------------------------------------------------------------------------------
1 | dysplasia mild,moderate,severe,low-grade,low-degree,low grade,low degree,low,high-grade,high-degree,high grade,high degree,high,strong,mild-to-moderate,mild to moderate,mild and moderate,moderate-to-severe,moderate to severe,moderate and severe,focally severe,severe focally,severe focal,moderate-severe,mild-moderate,mild-severe,mild to severe,mild and severe,mild-to-severe BOTH LOOSE colon,cervix
2 | hyperplastic polyp,polyp-adenomatous,adenomatous polyp,polyp adenomatous type,adenomatous polyp-type,polyp-type,polyp-focal adenomatous,polyp focal adenomatous,polyp-inflammatory BOTH EXACT colon
3 | hyperplastic polyp adenomatous,adenomatous type,focal adenomatous,inflammatory BOTH EXACT colon
4 | transverse colon POST EXACT colon
5 | descending colon POST EXACT colon
6 | rectal mucous membrane POST EXACT colon
7 | ascending colon POST EXACT colon
8 | sigmoid colon POST EXACT colon
9 | right colon POST EXACT colon
10 | left colon POST EXACT colon
11 | rectum nos POST EXACT colon
12 | colon nos POST EXACT colon
13 | uterus nos POST EXACT cervix
14 | carcinoma in situ POST EXACT cervix
15 | squamous cell carcinoma in situ POST EXACT cervix
16 | squamous carcinoma in situ POST EXACT cervix
17 | cervical adenocarcinoma in situ POST EXACT cervix
18 | uterine cervix carcinoma in situ POST EXACT cervix
19 | leep cervical BOTH EXACT cervix
20 | epithelium exocervical,endocervical BOTH EXACT cervix
21 | squamous intraepithelial lesion low-grade,low grade,low,high-grade,high grade,high BOTH LOOSE cervix
22 | neuroendocrine large-cell,large cell,large,non-small cell,non small cell,small-cell,small cell,small PRE EXACT lung
23 | cell non-small,non small,small,large,clear PRE EXACT lung
24 | duodenal bulb biopsy,biopsy POST EXACT celiac
25 | biopsy 2nd duodenum,ii duodenal,according to duodenum,according to duodenal, duodenum ii BOTH EXACT celiac
26 | duodenal mucosa POST LOOSE celiac
27 | intraepithelial lymphocytes,lymphocytic quota (iel,lymphocytic quota (iel:,lymphocyte infiltrate (iel,lymphocyte infiltrate (iel POST EXACT celiac
28 | hyperplasia of the brunner glands,brunner gland,of glandular crypts,of the glands of brunner BOTH EXACT celiac
29 | celiac disease type POST LOOSE celiac
30 | gluten hypersensitivity type POST LOOSE celiac
31 | phlogosis chronic,chronic active,chronic acute,active,acute,with marked activity,with activity BOTH EXACT celiac
32 | inflammatory chronic,acute BOTH EXACT celiac
33 | chronic gastritis,phlogosis,inflammation POST LOOSE celiac
34 | active gastritis,phlogosis,inflammation POST LOOSE celiac
35 | acute gastritis,phlogosis,inflammation POST LOOSE celiac
36 | chronic duodenitis POST EXACT celiac
37 | brunner glands,glands of BOTH EXACT celiac
38 | normal morphology,within the limits of,within the limit of,devoid of,appearance BOTH EXACT celiac
39 | antral type mucous membranes of PRE EXACT celiac
40 | atrophy glandular,of the crypts,crypt,villi,of the villi,villial BOTH EXACT celiac
41 | atrophic villi,crypt PRE LOOSE celiac
42 | flattened villuses,villi BOTH LOOSE celiac
43 | flattening of the villi,of the villus POST LOOSE celiac
44 | lymphocyte (iel(iel:,infiltrate (iel,infiltrate (iel: POST EXACT celiac
45 | infiltration lymphocytic,(iel,(iel: BOTH EXACT celiac
46 | villi free of PRE EXACT celiac
47 | villi atrophy BOTH EXACT celiac
48 | villi height,length BOTH LOOSE celiac
49 | height of the villi POST EXACT celiac
50 | height villi,villus PRE LOOSE celiac
51 | mitosis number of,proportion of,proportion of cryptic,number of cryptic,share of,share of cryptic PRE EXACT celiac
52 | duodenitis chronic,moderate,mild,active,erosive,mild-activity,acute,ulcerative,chronic active ulcerative,moderate chronic,chronic mild,chronic active,chronic moderate,mild chronic,chronic active and erosive,chronic severe,chronic erosive,chronic active erosive,erosive chronic,acute erosiva BOTH EXACT celiac
53 | intraepithelial lymphocyte POST EXACT celiac
54 | celiac no indications of,no more signs of,no evidence of PRE EXACT celiac
55 | abnormalities without,no PRE EXACT celiac
--------------------------------------------------------------------------------
/sket/ont_proc/__init__.py:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/sket/ont_proc/ontology_processing.py:
--------------------------------------------------------------------------------
1 | import owlready2
2 | import itertools
3 | import pandas as pd
4 | import rdflib
5 |
6 | from collections import defaultdict
7 | from copy import deepcopy
8 | from rdflib import URIRef
9 | from rdflib.namespace import RDFS
10 | from owlready2 import IRIS
11 |
12 | from ..utils import utils
13 |
14 |
15 | class OntoProc(object):
16 |
17 | def __init__(self, ontology_path=None, hierarchies_path=None):
18 | """
19 | Load ontology and set use-case variable
20 |
21 | Params:
22 | ontology_path (str): ontology.owl file path
23 | hierarchies_path (str): hierarchy relations file path
24 |
25 | Returns: None
26 | """
27 |
28 | self.ontology = rdflib.Graph()
29 | if ontology_path: # custom ontology path
30 | #self.ontology = owlready2.get_ontology(ontology_path).load()
31 | self.ontology.parse(ontology_path)
32 | else: # default ontology path
33 | self.ontology.parse('./sket/ont_proc/ontology/examode.owl')
34 | if hierarchies_path: # custom hierarchy relations path
35 | self.hrels = utils.read_hierarchies(hierarchies_path)
36 | else: # default hierarchy relations path
37 | self.hrels = utils.read_hierarchies('./sket/ont_proc/rules/hierarchy_relations.txt')
38 | self.disease = {'colon': '0002032', 'lung': '0008903', 'cervix': '0002974', 'celiac': '0005130'}
39 |
40 | def restrict2use_case(self, use_case, limit=1000):
41 | """
42 | Restrict ontology to the considered use-case and return DataFrame containing concepts from restricted ontology
43 |
44 | Params:
45 | use_case (str): use case considered (colon, lung, cervix, celiac)
46 | limit (int): max number of returned elements
47 |
48 | Returns: a pandas DataFrame containing concepts information
49 | """
50 |
51 | disease = self.disease[use_case]
52 | sparql = "PREFIX exa: " \
53 | "PREFIX rdfs: " \
54 | "PREFIX mondo: " \
55 | "PREFIX dcterms: "\
56 | "select ?iri ?iri_label ?iri_SNOMED_code ?iri_UMLS_code ?semantic_area ?semantic_area_label where { " \
57 | "?iri rdfs:label ?iri_label ; exa:associatedDisease mondo:" + disease + ". " \
58 | "filter (langMatches( lang(?iri_label), 'en')). " \
59 | "OPTIONAL {?iri exa:hasSNOMEDCode ?iri_SNOMED_code .} " \
60 | "OPTIONAL {?iri dcterms:conformsTo ?iri_UMLS_code .} " \
61 | "OPTIONAL {?iri exa:hasSemanticArea ?semantic_area . " \
62 | "?semantic_area rdfs:label ?semantic_area_label . " \
63 | "filter (langMatches( lang(?semantic_area_label), 'en')).} " \
64 | "} " \
65 | "limit " + str(limit)
66 | # issue sparql query
67 | resultSet = self.ontology.query(query_object=sparql)
68 | # convert query output to DataFrame
69 | ontology_dict = defaultdict(list)
70 | for row in resultSet:
71 | # store entity as IRI
72 | ontology_dict['iri'].append(str(row.iri))
73 | # store additional information associated w/ entity
74 | ontology_dict['label'].append(str(row.iri_label))
75 | ontology_dict['SNOMED'].append(str(row.iri_SNOMED_code) if row.iri_SNOMED_code is not None else None)
76 | ontology_dict['UMLS'].append(str(row.iri_UMLS_code)if row.iri_UMLS_code is not None else None)
77 | ontology_dict['semantic_area'].append(str(row.semantic_area))
78 | ontology_dict['semantic_area_label'].append(str(row.semantic_area_label))
79 | if use_case == 'celiac':
80 | # Add negative result
81 | # store entity as IRI
82 | ontology_dict['iri'].append('https://w3id.org/examode/ontology/NegativeResult')
83 | # store additional information associated w/ entity
84 | ontology_dict['label'].append('Negative Result')
85 | ontology_dict['SNOMED'].append('M-010100')
86 | ontology_dict['UMLS'].append(None)
87 | ontology_dict['semantic_area'].append('http://purl.obolibrary.org/obo/NCIT_C15220')
88 | ontology_dict['semantic_area_label'].append('Diagnosis')
89 | # Add inconclusive result
90 | # store entity as IRI
91 | ontology_dict['iri'].append('https://w3id.org/examode/ontology/InconclusiveOutcome')
92 | # store additional information associated w/ entity
93 | ontology_dict['label'].append('Inconclusive Outcome')
94 | ontology_dict['SNOMED'].append(None)
95 | ontology_dict['UMLS'].append(None)
96 | ontology_dict['semantic_area'].append('http://purl.obolibrary.org/obo/NCIT_C15220')
97 | ontology_dict['semantic_area_label'].append('Diagnosis')
98 | return pd.DataFrame(ontology_dict)
99 |
100 | @staticmethod
101 | def lookup_semantic_areas(semantic_areas, use_case_ontology):
102 | """
103 | Lookup for ontology concepts associated to target semantic areas
104 |
105 | Params:
106 | semantic_areas (list(str)/str): target semantic areas
107 | use_case_ontology (pandas DataFrame): reference ontology restricted to the use case considered
108 |
109 | Returns a list of rows matching semantic areas
110 | """
111 |
112 | if type(semantic_areas) == list: # search for list of semantic areas
113 | rows = use_case_ontology.loc[use_case_ontology['semantic_area_label'].isin(semantic_areas)][['iri', 'label', 'semantic_area_label']]
114 | else: # search for single semantic area
115 | rows = use_case_ontology.loc[use_case_ontology['semantic_area_label'] == semantic_areas][['iri', 'label', 'semantic_area_label']]
116 | if rows.empty: # no match found within ontology
117 | return []
118 | else: # match found
119 | return rows.values.tolist()
120 |
121 | def get_ancestors(self, concepts, include_self=False):
122 | """
123 | Returns the list of ancestor concepts given target concept and hierachical relations
124 |
125 | Params:
126 | concepts (list(str)): list of concepts from ontology
127 | include_self (bool): whether to include current concept in the list of ancestors
128 |
129 | Returns: the list of ancestors for target concept
130 | """
131 |
132 | assert type(concepts) == list
133 |
134 | # get latest concept within concepts
135 | concept = concepts[-1]
136 |
137 | # Query to return ancestors (both for classes or individuals
138 | txtQuery = "PREFIX rdfs: " \
139 | "PREFIX skos: " \
140 | "select ?ancestor where { " \
141 | "<" + concept + "> (rdfs:subClassOf|skos:broaderTransitive)+ ?ancestor. " \
142 | "}"
143 |
144 | # issue sparql query
145 | resultSet = self.ontology.query(query_object=txtQuery)
146 | ancestors = []
147 | for r in resultSet:
148 | ancestors.append(str(r.ancestor))
149 | # if include_self include concept
150 | if include_self:
151 | ancestors.append(concept)
152 |
153 | return ancestors
154 |
155 | def check_individual_type(self, individual, classURI):
156 | """
157 | Checks if an individual belongs to a specified class.
158 |
159 | Params:
160 | individual (str): URI of the individual.
161 | classURI (str): URI of the class.
162 |
163 | Returns: boolean value asserting whether individual belongs to classURI.
164 | """
165 |
166 | # Query to return classes of individual
167 | txtQuery = "select ?type { <" + individual + "> a ?type.}"
168 |
169 | # issue sparql query
170 | resultSet = self.ontology.query(query_object=txtQuery)
171 | classes = []
172 | for r in resultSet:
173 | classes.append(str(r.type))
174 | # if include_self include concept
175 | if classURI in classes:
176 | return True
177 | else:
178 | return False
179 |
180 | def get_higher_concept(self, iri1, iris, include_self=False):
181 | """
182 | Return the ontology concept that is more general (hierarchically higher)
183 |
184 | Params:
185 | iri1 (str): the first iri considered
186 | iris (list(str)): list of the second iris considered
187 | include_self (bool): whether to include current concept in the list of ancestors
188 |
189 | Returns: the hierarchically higher concept's iri
190 | """
191 | # get ancestors for both concepts
192 | ancestors1 = self.get_ancestors([iri1], include_self)
193 | ancestors2 = self.get_ancestors([iris[0]], include_self)
194 | if iris[0] in ancestors1: # concept1 is a descendant of concept2
195 | return iris[0]
196 | elif iri1 in ancestors2: # concept1 is an ancestor of concept2
197 | return iri1
198 | else: # concept1 and concept2 are not hierarchically related, check if there is another concept
199 | if len(iris) == 2:
200 | poType = self.check_individual_type(iri1, iris[1])
201 | if poType: # concept1 is an individual of type concept3
202 | return iris[1]
203 | return None
204 |
205 |
206 | def merge_nlp_and_struct(self, nlp_concepts, struct_concepts):
207 | """
208 | Merge the information extracted from 'nlp' and 'struct' sections
209 |
210 | Params:
211 | nlp_concepts (dict): the dictionary of linked concepts from 'nlp' section
212 | struct_concepts (dict): the dictionary of linked concepts from 'struct' section
213 |
214 | Returns: a dict containing the linked concepts w/o distinction between 'nlp' and 'struct' concepts
215 | """
216 |
217 | cconcepts = dict()
218 | # merge linked concepts from 'nlp' and 'struct' sections
219 | for sem_area in nlp_concepts.keys():
220 | if nlp_concepts[sem_area] and struct_concepts[sem_area]: # semantic area is not empty for both 'nlp' and 'struct' sections
221 | # get all the possible combinations of 'nlp' and 'struct' concepts
222 | combinations = list(itertools.product(nlp_concepts[sem_area], struct_concepts[sem_area]))
223 | # return IRIs to be removed (hierarchically higher)
224 | IRIs = {self.get_higher_concept(combination[0][0], combination[1][0]) for combination in combinations} - {None}
225 | # remove under-specified concepts and store remaining concepts
226 | cconcepts[sem_area] = deepcopy(nlp_concepts[sem_area])
227 | cconcepts[sem_area].extend([concept for concept in struct_concepts[sem_area] if concept[0] not in [concept[0] for concept in nlp_concepts[sem_area]]])
228 | # remove IRIs from cconcepts
229 | cconcepts[sem_area] = [concept for concept in cconcepts[sem_area] if concept[0] not in IRIs]
230 | elif nlp_concepts[sem_area]: # semantic area is not empty only for the 'nlp' section
231 | cconcepts[sem_area] = deepcopy(nlp_concepts[sem_area])
232 | elif struct_concepts[sem_area]: # semantic area is not empty only for 'struct' section
233 | cconcepts[sem_area] = deepcopy(struct_concepts[sem_area])
234 | else: # semantic area is empty for both sections
235 | cconcepts[sem_area] = list()
236 | # return combined concepts
237 | return cconcepts
238 |
--------------------------------------------------------------------------------
/sket/ont_proc/rules/hierarchy_relations.txt:
--------------------------------------------------------------------------------
1 | hasBroaderTransitive
--------------------------------------------------------------------------------
/sket/rdf_proc/__init__.py:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/sket/rep_proc/__init__.py:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/sket/rep_proc/report_processing.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import math
3 | import string
4 | import re
5 | import json
6 | import uuid
7 | import copy
8 | import roman
9 |
10 | from tqdm import tqdm
11 | from copy import deepcopy
12 | from collections import defaultdict
13 | from transformers import MarianMTModel, MarianTokenizer
14 | from datetime import datetime
15 |
16 | from ..utils import utils
17 |
18 |
19 | class ReportProc(object):
20 |
21 | def __init__(self, src_lang, use_case, fields_path=None):
22 | """
23 | Set translator and build regular expression to split text based on bullets
24 |
25 | Params:
26 | src_lang (str): considered source language
27 | use_case (str): considered use case
28 | fields_path (str): report fields file path
29 |
30 | Returns: None
31 | """
32 |
33 | self.use_case = use_case
34 |
35 | if fields_path: # read report fields file
36 | self.fields = utils.read_report_fields(fields_path)
37 | else: # no report fields file provided
38 | self.fields = utils.read_report_fields('./sket/rep_proc/rules/report_fields.txt')
39 |
40 | if src_lang != 'en': # set NMT model
41 | self.nmt_name = 'Helsinki-NLP/opus-mt-' + src_lang + '-en'
42 | self.tokenizer = MarianTokenizer.from_pretrained(self.nmt_name)
43 | self.nmt = MarianMTModel.from_pretrained(self.nmt_name)
44 | else: # no NMT model required
45 | self.nmt_name = None
46 | self.tokenizer = None
47 | self.nmt = None
48 |
49 | # build regex for bullet patterns
50 | self.en_roman_regex = re.compile('((?<=(^i-ii(\s|:|\.)))|(?<=(^i-iii(\s|:|\.)))|(?<=(^ii-iii(\s|:|\.)))|(?<=(^i-iv(\s|:|\.)))|(?<=(^ii-iv(\s|:|\.)))|(?<=(^iii-iv(\s|:|\.)))|(?<=(^i and ii(\s|:|\.)))|(?<=(^i and iii(\s|:|\.)))|(?<=(^ii and iii(\s|:|\.)))|(?<=(^i and iv(\s|:|\.)))|(?<=(^ii and iv(\s|:|\.)))|(?<=(^iii and iv(\s|:|\.)))|(?<=(^i(\s|:|\.)))|(?<=(^ii(\s|:|\.)))|(?<=(^iii(\s|:|\.)))|(?<=(^iv(\s|:|\.)))|(?<=(\si-ii(\s|:|\.)))|(?<=(\si-iii(\s|:|\.)))|(?<=(\sii-iii(\s|:|\.)))|(?<=(\si-iv(\s|:|\.)))|(?<=(\sii-iv(\s|:|\.)))|(?<=(\siii-iv(\s|:|\.)))|(?<=(\si and ii(\s|:|\.)))|(?<=(\si and iii(\s|:|\.)))|(?<=(\sii and iii(\s|:|\.)))|(?<=(\si and iv(\s|:|\.)))|(?<=(\sii and iv(\s|:|\.)))|(?<=(\siii and iv(\s|:|\.)))|(?<=(\si(\s|:|\.)))|(?<=(\sii(\s|:|\.)))|(?<=(\siii(\s|:|\.)))|(?<=(\siv(\s|:|\.))))(.*?)((?=(\si+(\s|:|\.|-)))|(?=(\siv(\s|:|\.|-)))|(?=($)))')
51 | self.nl_roman_regex = re.compile('((?<=(^i-ii(\s|:|\.)))|(?<=(^i-iii(\s|:|\.)))|(?<=(^ii-iii(\s|:|\.)))|(?<=(^i-iv(\s|:|\.)))|(?<=(^ii-iv(\s|:|\.)))|(?<=(^iii-iv(\s|:|\.)))|(?<=(^i en ii(\s|:|\.)))|(?<=(^i en iii(\s|:|\.)))|(?<=(^ii en iii(\s|:|\.)))|(?<=(^i en iv(\s|:|\.)))|(?<=(^ii en iv(\s|:|\.)))|(?<=(^iii en iv(\s|:|\.)))|(?<=(^i(\s|:|\.)))|(?<=(^ii(\s|:|\.)))|(?<=(^iii(\s|:|\.)))|(?<=(^iv(\s|:|\.)))|(?<=(\si-ii(\s|:|\.)))|(?<=(\si-iii(\s|:|\.)))|(?<=(\sii-iii(\s|:|\.)))|(?<=(\si-iv(\s|:|\.)))|(?<=(\sii-iv(\s|:|\.)))|(?<=(\siii-iv(\s|:|\.)))|(?<=(\si en ii(\s|:|\.)))|(?<=(\si en iii(\s|:|\.)))|(?<=(\sii en iii(\s|:|\.)))|(?<=(\si en iv(\s|:|\.)))|(?<=(\sii en iv(\s|:|\.)))|(?<=(\siii en iv(\s|:|\.)))|(?<=(\si(\s|:|\.)))|(?<=(\sii(\s|:|\.)))|(?<=(\siii(\s|:|\.)))|(?<=(\siv(\s|:|\.))))(.*?)((?=(\si+(\s|:|\.|-)))|(?=(\siv(\s|:|\.|-)))|(?=($)))')
52 | self.bullet_regex = re.compile("^[-(]?\s*[\d,]+\s*[:)-]?")
53 | self.ranges_regex = re.compile("^\(?\s*(\d\s*-\s*\d|\d\s*\.\s*\d)\s*\)?")
54 |
55 | # COMMON FUNCTIONS
56 |
57 | def is_empty(self, var):
58 | """
59 | Check whether a var is empty (i.e., NULL or nan)
60 |
61 | Params:
62 | var (any): considered variable
63 |
64 | Returns: bool
65 | """
66 |
67 | if type(var) == float:
68 | return math.isnan(var)
69 | else:
70 | return var is None
71 |
72 | def update_usecase(self, use_case):
73 | """
74 | Update use case
75 |
76 | Params:
77 | use_case (str): considered use case
78 |
79 | Returns: None
80 | """
81 |
82 | self.use_case = use_case
83 |
84 | def update_nmt(self, src_lang):
85 | """
86 | Update NMT model changing source language
87 |
88 | Params:
89 | src_lang (str): considered source language
90 |
91 | Returns: None
92 | """
93 |
94 | if src_lang != 'en': # update NMT model
95 | self.nmt_name = 'Helsinki-NLP/opus-mt-' + src_lang + '-en'
96 | self.tokenizer = MarianTokenizer.from_pretrained(self.nmt_name)
97 | self.nmt = MarianMTModel.from_pretrained(self.nmt_name)
98 | else: # no NMT model required
99 | self.nmt_name = None
100 | self.tokenizer = None
101 | self.nmt = None
102 |
103 | def update_report_fields(self, fields_path):
104 | """
105 | Update report fields changing current ones
106 |
107 | Params:
108 | fields_path (str): report fields file
109 |
110 | Returns: None
111 | """
112 |
113 | self.fields = utils.read_report_fields(fields_path)
114 |
115 | def load_dataset(self, reports_path, sheet, header):
116 | """
117 | Load reports dataset
118 |
119 | Params:
120 | reports_path (str): reports.xlsx fpath
121 | sheet (str): name of the excel sheet to use
122 | header (int): row index used as header
123 |
124 | Returns: the loaded dataset
125 | """
126 |
127 | if reports_path.split('.')[-1] == 'xlsx': # requires openpyxl engine
128 | dataset = pd.read_excel(io=reports_path, sheet_name=sheet, header=header, engine='openpyxl')
129 | else:
130 | dataset = pd.read_excel(io=reports_path, sheet_name=sheet, header=header)
131 | # remove rows w/ na
132 | dataset.dropna(axis=0, how='all', inplace=True)
133 | # dataset.dropna(axis=0, how='all', subset=dataset.columns[1:], inplace=True)
134 |
135 | return dataset
136 |
137 | def translate_text(self, text):
138 | """
139 | Translate text from source to destination -- text is lower-cased before and after translation
140 |
141 | Params:
142 | text (str): target text
143 |
144 | Returns: translated text
145 | """
146 |
147 | if type(text) == str:
148 | trans_text = self.nmt.generate(**self.tokenizer(text.lower(), return_tensors="pt", padding=True))[0]
149 | trans_text = self.tokenizer.decode(trans_text, skip_special_tokens=True)
150 | else:
151 | trans_text = ''
152 | return trans_text.lower()
153 |
154 | # AOEC SPECIFIC FUNCTIONS
155 |
156 | def aoec_process_data(self, dataset):
157 | """
158 | Read AOEC reports and extract the required fields
159 |
160 | Params:
161 | dataset (pandas DataFrame): target dataset
162 |
163 | Returns: a dict containing the required reports fields
164 | """
165 |
166 | reports = dict()
167 | print('acquire data')
168 | # acquire data and translate text
169 | for report in tqdm(dataset.itertuples()):
170 | reports[str(report._1).strip()] = {
171 | 'diagnosis_nlp': report.Diagnosi,
172 | 'materials': report.Materiali,
173 | 'procedure': report.Procedura if type(report.Procedura) == str else '',
174 | 'topography': report.Topografia if type(report.Topografia) == str else '',
175 | 'diagnosis_struct': report._5 if type(report._5) == str else '',
176 | 'age': int(report.Età) if not math.isnan(report.Età) else None,
177 | 'gender': report.Sesso if type(report.Sesso) == str else ''
178 | }
179 | return reports
180 |
181 | def aoec_split_diagnoses(self, diagnoses, int_id, debug=False):
182 | """
183 | Split the section 'diagnoses' within AOEC reports relying on bullets (i.e. '1', '2', etc.)
184 |
185 | Params:
186 | diagnoses (str): the 'diagnoses' section of AOEC reports
187 | int_id (int): the internal id specifying the current diagnosis
188 | debug (bool): whether to keep flags for debugging
189 |
190 | Returns: the part of the 'diagnoses' section related to the current internalid
191 | """
192 |
193 | current_iids = []
194 | dgnss = {}
195 | # split diagnosis on new lines
196 | dlines = diagnoses.split('\n')
197 | # loop over lines
198 | for line in dlines:
199 | line = line.strip()
200 | if line: # line contains text
201 | # look for range first
202 | rtext = self.ranges_regex.findall(line)
203 | if rtext: # range found
204 | bullets = re.findall('\d+', rtext[0])
205 | bullets = list(map(int, bullets))
206 | bullets = range(bullets[0], bullets[1]+1)
207 | current_iids = deepcopy(bullets)
208 | else: # ranges not found
209 | # look for bullets
210 | btext = self.bullet_regex.findall(line)
211 | if btext: # bullets found
212 | bullets = re.findall('\d+', btext[0])
213 | bullets = list(map(int, bullets))
214 | current_iids = deepcopy(bullets)
215 | # associate current line to the corresponding ids
216 | for iid in current_iids:
217 | if iid in dgnss: # iid assigned before
218 | dgnss[iid] += ' ' + line
219 | else: # new idd
220 | dgnss[iid] = line
221 | if int_id in dgnss: # return the corresponding diagnosis
222 | return dgnss[int_id]
223 | elif not current_iids: # no bullet found -- return the whole diagnoses field (w/o \n to avoid problems w/ FastText)
224 | return diagnoses.replace('\n', ' ')
225 | else: # return the whole diagnoses field (w/o \n to avoid problems w/ FastText) -- something went wrong
226 | if debug:
227 | print('\n\nSomething went wrong -- return the whole diagnoses field but print data:')
228 | print('Internal ID: {}'.format(int_id))
229 | print('Raw Field: {}'.format(diagnoses))
230 | print('Processed Field: {}\n\n'.format(dgnss))
231 | return diagnoses.replace('\n', ' ')
232 |
233 | def aoec_process_data_v2(self, dataset, debug=False):
234 | """
235 | Read AOEC reports and extract the required fields (v2 used for batches from 2nd onwards)
236 |
237 | Params:
238 | dataset (pandas DataFrame): target dataset
239 | debug (bool): whether to keep flags for debugging
240 |
241 | Returns: a dict containing the required report fields
242 | """
243 |
244 | reports = dict()
245 | print('acquire data and split it based on diagnoses')
246 | # acquire data and split it based on diagnoses
247 | for report in tqdm(dataset.itertuples()):
248 | if 'IDINTERNO' in dataset.columns:
249 | rid = str(report.FILENAME).strip() + '_' + str(report.IDINTERNO).strip()
250 | if type(report.TESTODIAGNOSI) == str:
251 | reports[rid] = {
252 | 'diagnosis_nlp': self.aoec_split_diagnoses(report.TESTODIAGNOSI, report.IDINTERNO, debug=debug)
253 | if type(report.IDINTERNO) == str else report.TESTODIAGNOSI,
254 | 'materials': report.MATERIALE,
255 | 'procedure': report.SNOMEDPROCEDURA if type(report.SNOMEDPROCEDURA) == str else '',
256 | 'topography': report.SNOMEDTOPOGRAFIA if type(report.SNOMEDTOPOGRAFIA) == str else '',
257 | 'diagnosis_struct': report.SNOMEDDIAGNOSI if type(report.SNOMEDDIAGNOSI) == str else '',
258 | 'birth_date': report.NATOIL if report.NATOIL else '',
259 | 'visit_date': report.DATAORAFINEVALIDAZIONE if report.DATAORAFINEVALIDAZIONE else '',
260 | 'gender': report.SESSO if type(report.SESSO) == str else '',
261 | 'image': report.FILENAME,
262 | 'internalid': report.IDINTERNO
263 | }
264 | else:
265 | # process_data_v3, no IDINTERNO and MATERIALI
266 | rid = str(int(report.FILENAME)).strip()
267 | if type(report.TESTODIAGNOSI) == str:
268 | reports[rid] = {
269 | 'diagnosis_nlp': report.TESTODIAGNOSI,
270 | 'procedure': report.SNOMEDPROCEDURA if type(report.SNOMEDPROCEDURA) == str else '',
271 | 'topography': report.SNOMEDTOPOGRAFIA if type(report.SNOMEDTOPOGRAFIA) == str else '',
272 | 'diagnosis_struct': report.SNOMEDDIAGNOSI if type(report.SNOMEDDIAGNOSI) == str else '',
273 | 'birth_date': report.NATOIL.to_pydatetime().strftime("%Y%m%d")+"000000" if report.NATOIL else '',
274 | 'visit_date': report.DATAORAFINEVALIDAZIONE.to_pydatetime().strftime("%Y%m%d")+"000000" if report.DATAORAFINEVALIDAZIONE else '',
275 | 'gender': report.SESSO if type(report.SESSO) == str else '',
276 | 'image': int(report.FILENAME)
277 | }
278 |
279 | return reports
280 |
281 | @staticmethod
282 | def date_formatter(raw_date):
283 | """
284 | Returns date in the correct format.
285 |
286 | Params:
287 | raw_date (timestamp): date to format
288 | Return string with correctly formatted date.
289 | """
290 | date = datetime.strptime(raw_date, '%d/%m/%Y')
291 | return date.strftime("%Y%m%d")+"000000"
292 |
293 | def aoec_translate_reports(self, reports):
294 | """
295 | Translate processed reports
296 |
297 | Params:
298 | reports (dict): processed reports
299 |
300 | Returns: translated reports
301 | """
302 |
303 | trans_reports = copy.deepcopy(reports)
304 | print('translate text')
305 | # translate text
306 | for rid, report in tqdm(trans_reports.items()):
307 | trans_reports[rid]['diagnosis_nlp'] = self.translate_text(report['diagnosis_nlp'])
308 | if 'materials' in report:
309 | trans_reports[rid]['materials'] = self.translate_text(report['materials']) if report['materials'] != '' else ''
310 | return trans_reports
311 |
312 | # RADBOUD SPECIFIC FUNCTIONS
313 |
314 | def radboud_split_conclusions(self, conclusions):
315 | """
316 | Split the section 'conclusions' within reports relying on bullets (i.e. 'i', 'ii', etc.)
317 |
318 | Params:
319 | conclusions (str): the 'conclusions' section of radboud reports
320 |
321 | Returns: a dict containing the 'conclusions' section divided as a bullet list
322 | """
323 |
324 | sections = defaultdict(str)
325 | # use regex to identify bullet-divided sections within 'conclusions'
326 | for groups in self.nl_roman_regex.findall(conclusions):
327 | # identify the target bullet for the given section
328 | bullet = [group for group in groups[:65] if group and any(char.isalpha() or char.isdigit() for char in group)][0].strip()
329 | if 'en' in bullet: # composite bullet
330 | bullets = bullet.split(' en ')
331 | elif '-' in bullet: # composite bullet
332 | bullets = bullet.split('-')
333 | else: # single bullet
334 | bullets = [bullet]
335 | # loop over bullets and concatenate corresponding sections
336 | for bullet in bullets:
337 | if groups[65] != 'en': # the section is not a conjunction between two bullets (e.g., 'i and ii')
338 | sections[bullet.translate(str.maketrans('', '', string.punctuation)).upper()] += ' ' + groups[65] # store them using uppercased roman numbers as keys - required to make Python 'roman' library working
339 | if bool(sections): # 'sections' contains split sections
340 | return sections
341 | else: # 'sections' is empty - assign the whole 'conclusions' to 'sections'
342 | sections['whole'] = conclusions
343 | return sections
344 |
345 | def radboud_process_data(self, dataset, debug=False):
346 | """
347 | Read Radboud reports and extract the required fields
348 |
349 | Params:
350 | dataset (pandas DataFrame): target dataset
351 | debug (bool): whether to keep flags for debugging
352 |
353 |
354 | Returns: a dict containing the required report fields
355 | """
356 |
357 | proc_reports = dict()
358 | skipped_reports = []
359 | unsplitted_reports = 0
360 | misplitted_reports = 0
361 | report_conc_keys = {report.Studynumber: report.Conclusion for report in dataset.itertuples()}
362 | for report in tqdm(dataset.itertuples()):
363 | rid = str(report.Studynumber).strip()
364 | if type(report.Conclusion) == str: # split conclusions and associate to each block the corresponding conclusion
365 | # deepcopy rdata to avoid removing elements from input reports
366 | raw_conclusions = report.Conclusion
367 | # split conclusions into sections
368 | conclusions = self.radboud_split_conclusions(utils.nl_sanitize_record(raw_conclusions.lower(), self.use_case))
369 | pid = '_'.join(rid.split('_')[:-1]) # remove block and slide ids from report id - keep patient id
370 | related_ids = [rel_id for rel_id in report_conc_keys.keys() if pid in rel_id] # get all the ids related to the current patient
371 | # get block ids from related_ids
372 | block_ids = []
373 | for rel_id in related_ids:
374 | if 'B' not in rel_id: # skip report as it does not contain block ID
375 | skipped_reports.append(rel_id)
376 | continue
377 | if 'v' not in rel_id.lower() and '-' not in rel_id: # report does not contain special characters
378 | block_part = rel_id.split('_')[-1]
379 | if len(block_part) < 4: # slide ID not available
380 | block_ids.append(rel_id)
381 | else: # slide ID available
382 | block_ids.append(rel_id[:-2])
383 | elif 'v' in rel_id.lower(): # report contains slide ID first variant (i.e., _V0*)
384 | block_part = rel_id.split('_')[-2]
385 | if len(block_part) < 4: # slide ID not available
386 | block_ids.append('_'.join(rel_id.split('_')[:-1]))
387 | else: # slide ID available
388 | block_ids.append('_'.join(rel_id.split('_')[:-1])[:-2])
389 | elif '-' in rel_id: # report contains slide ID second variant (i.e., -*)
390 | block_part = rel_id.split('_')[-1].split('-')[0]
391 | if len(block_part) < 4: # slide ID not available
392 | block_ids.append(rel_id.split('-')[0])
393 | else: # slide ID available
394 | block_ids.append(rel_id.split('-')[0][:-2])
395 | else:
396 | print('something went wrong w/ current report')
397 | print(rel_id)
398 |
399 | if not block_ids: # Block IDs not found -- skip it
400 | continue
401 |
402 | if 'whole' in conclusions: # unable to split conclusions - either single conclusion or not appropriately specified
403 | if len(block_ids) > 1: # conclusions splits not appropriately specified or wrong
404 | unsplitted_reports += 1
405 | for bid in block_ids:
406 | # create dict to store block diagnosis and slide ids
407 | proc_reports[bid] = dict()
408 | # store conclusion - i.e., the final diagnosis
409 | proc_reports[bid]['diagnosis'] = conclusions['whole']
410 | # store slide ids associated to the current block diagnosis
411 | slide_ids = []
412 | for sid in report_conc_keys.keys():
413 | if bid in sid: # Block ID found within report ID
414 | if 'v' not in sid.lower() and '-' not in sid: # report does not contain special characters
415 | block_part = sid.split('_')[-1]
416 | if len(block_part) < 4: # slide ID not available
417 | continue
418 | else: # slide ID available
419 | slide_ids.append(sid[-2:])
420 | elif 'v' in sid.lower(): # report contains slide ID first variant (i.e., _V0*)
421 | block_part = sid.split('_')[-2]
422 | if len(block_part) < 4: # slide ID not available
423 | slide_ids.append(sid.split('_')[-1])
424 | else: # slide ID available
425 | slide_ids.append(sid.split('_')[-2][-2:] + '_' + sid.split('_')[-1])
426 | elif '-' in sid: # report contains slide ID second variant (i.e., -*)
427 | block_part = sid.split('_')[-1].split('-')[0]
428 | if len(block_part) < 4: # slide ID not available
429 | slide_ids.append(sid.split('-')[1])
430 | else: # slide ID available
431 | slide_ids.append(sid.split('-')[0][-2:] + '-' + sid.split('-')[1])
432 | proc_reports[bid]['slide_ids'] = slide_ids
433 | else:
434 | block_ix2id = {int(block_id[-1]): block_id for block_id in block_ids}
435 | if len(conclusions) < len(block_ids): # fewer conclusions have been identified than the actual number of blocks - store and fix later
436 | misplitted_reports += 1
437 | # get conclusions IDs
438 | cix2id = {roman.fromRoman(cid): cid for cid in conclusions.keys()}
439 | # loop over Block IDs and associate the given conclusions to the corresponding blocks when available
440 | for bix, bid in block_ix2id.items():
441 | # create dict to store block diagnosis and slide ids
442 | proc_reports[bid] = dict()
443 | if bix in cix2id: # conclusion associated with the corresponding block
444 | # store conclusion - i.e., the final diagnosis
445 | proc_reports[bid]['diagnosis'] = conclusions[cix2id[bix]]
446 | # store slide ids associated to the current block diagnosis
447 | slide_ids = []
448 | for sid in report_conc_keys.keys():
449 | if bid in sid: # Block ID found within report ID
450 | if 'v' not in sid.lower() and '-' not in sid: # report does not contain special characters
451 | block_part = sid.split('_')[-1]
452 | if len(block_part) < 4: # slide ID not available
453 | continue
454 | else: # slide ID available
455 | slide_ids.append(sid[-2:])
456 | elif 'v' in sid.lower(): # report contains slide ID first variant (i.e., _V0*)
457 | block_part = sid.split('_')[-2]
458 | if len(block_part) < 4: # slide ID not available
459 | slide_ids.append(sid.split('_')[-1])
460 | else: # slide ID available
461 | slide_ids.append(sid.split('_')[-2][-2:] + '_' + sid.split('_')[-1])
462 | elif '-' in sid: # report contains slide ID second variant (i.e., -*)
463 | block_part = sid.split('_')[-1].split('-')[0]
464 | if len(block_part) < 4: # slide ID not available
465 | slide_ids.append(sid.split('-')[1])
466 | else: # slide ID available
467 | slide_ids.append(sid.split('-')[0][-2:] + '-' + sid.split('-')[1])
468 | proc_reports[bid]['slide_ids'] = slide_ids
469 | else: # unable to associate diagnosis with the corresponding block -- associate the entire conclusion
470 | # store slide ids associated to the current block diagnosis
471 | slide_ids = []
472 | # get patient ID to store conclusions field
473 | pid = '_'.join(bid.split('_')[:3])
474 | wconc = [report_conc_keys[sid] for sid in report_conc_keys.keys() if pid in sid and type(report_conc_keys[sid]) == str]
475 | # store the whole 'conclusions' field
476 | proc_reports[bid]['diagnosis'] = wconc[0]
477 | for sid in report_conc_keys.keys():
478 | if bid in sid: # Block ID found within report ID
479 | if 'v' not in sid.lower() and '-' not in sid: # report does not contain special characters
480 | block_part = sid.split('_')[-1]
481 | if len(block_part) < 4: # slide ID not available
482 | continue
483 | else: # slide ID available
484 | slide_ids.append(sid[-2:])
485 | elif 'v' in sid.lower(): # report contains slide ID first variant (i.e., _V0*)
486 | block_part = sid.split('_')[-2]
487 | if len(block_part) < 4: # slide ID not available
488 | slide_ids.append(sid.split('_')[-1])
489 | else: # slide ID available
490 | slide_ids.append(sid.split('_')[-2][-2:] + '_' + sid.split('_')[-1])
491 | elif '-' in sid: # report contains slide ID second variant (i.e., -*)
492 | block_part = sid.split('_')[-1].split('-')[0]
493 | if len(block_part) < 4: # slide ID not available
494 | slide_ids.append(sid.split('-')[1])
495 | else: # slide ID available
496 | slide_ids.append(sid.split('-')[0][-2:] + '-' + sid.split('-')[1])
497 | proc_reports[bid]['slide_ids'] = slide_ids
498 | else: # associate the given conclusions to the corresponding blocks
499 | # loop over conclusions and fill proc_reports
500 | for cid, cdata in conclusions.items():
501 | block_ix = roman.fromRoman(cid) # convert conclusion id (roman number) into corresponding arabic number (i.e., block index)
502 | if block_ix in block_ix2id: # block with bloc_ix present within dataset
503 | # create dict to store block diagnosis and slide ids
504 | proc_reports[block_ix2id[block_ix]] = dict()
505 | # store conclusion - i.e., the final diagnosis
506 | proc_reports[block_ix2id[block_ix]]['diagnosis'] = cdata
507 | # store slide ids associated to the current block diagnosis
508 | slide_ids = []
509 | for sid in report_conc_keys.keys():
510 | if block_ix2id[block_ix] in sid: # Block ID found within report ID
511 | if 'v' not in sid.lower() and '-' not in sid: # report does not contain special characters
512 | block_part = sid.split('_')[-1]
513 | if len(block_part) < 4: # slide ID not available
514 | continue
515 | else: # slide ID available
516 | slide_ids.append(sid[-2:])
517 | elif 'v' in sid.lower(): # report contains slide ID first variant (i.e., _V0*)
518 | block_part = sid.split('_')[-2]
519 | if len(block_part) < 4: # slide ID not available
520 | slide_ids.append(sid.split('_')[-1])
521 | else: # slide ID available
522 | slide_ids.append(sid.split('_')[-2][-2:] + '_' + sid.split('_')[-1])
523 | elif '-' in sid: # report contains slide ID second variant (i.e., -*)
524 | block_part = sid.split('_')[-1].split('-')[0]
525 | if len(block_part) < 4: # slide ID not available
526 | slide_ids.append(sid.split('-')[1])
527 | else: # slide ID available
528 | slide_ids.append(sid.split('-')[0][-2:] + '-' + sid.split('-')[1])
529 | proc_reports[block_ix2id[block_ix]]['slide_ids'] = slide_ids
530 | if debug:
531 | print('number of missplitted reports: {}'.format(misplitted_reports))
532 | print('number of unsplitted reports: {}'.format(unsplitted_reports))
533 | print('skipped reports:')
534 | print(skipped_reports)
535 | return proc_reports
536 |
537 | def radboud_process_data_v2(self, dataset):
538 | """
539 | Read Radboud reports and extract the required fields (v2 used for anonymized datasets)
540 |
541 | Params:
542 | dataset (pandas DataFrame): target dataset
543 |
544 | Returns: a dict containing the required report fields
545 | """
546 |
547 | proc_reports = dict()
548 | for report in tqdm(dataset.itertuples()):
549 | if 'Microscopy' in report._fields: # first batch of Radboud reports
550 | rid = str(report.Studynumber).strip()
551 | else: # subsequent anonymized batches of Radboud reports
552 | rid = str(report._3).strip() + '_A' # '_A' stands for anonymized report
553 | if report.Conclusion: # split conclusions and associate to each block the corresponding conclusion
554 | # split conclusions into sections
555 | conclusions = self.radboud_split_conclusions(utils.nl_sanitize_record(report.Conclusion.lower(), self.use_case))
556 |
557 | if 'whole' in conclusions: # unable to split conclusions - either single conclusion or not appropriately specified
558 | # create block id
559 | bid = rid + '_1'
560 | # create dict to store block diagnosis
561 | proc_reports[bid] = dict()
562 | # store conclusion - i.e., the final diagnosis
563 | proc_reports[bid]['diagnosis'] = conclusions['whole']
564 |
565 | else:
566 | # get conclusions IDs
567 | cid2ix = {cid: roman.fromRoman(cid) for cid in conclusions.keys()}
568 | for cid, cix in cid2ix.items():
569 | # create block id
570 | bid = rid + '_' + str(cix)
571 | # create dict to store block diagnosis
572 | proc_reports[bid] = dict()
573 | # store conclusion - i.e., the final diagnosis
574 | proc_reports[bid]['diagnosis'] = conclusions[cid]
575 | return proc_reports
576 |
577 | def radboud_process_celiac_data(self, dataset):
578 | """
579 | Read Radboud reports and extract the required fields (used for celiac datasets)
580 |
581 | Params:
582 | dataset (pandas DataFrame): target dataset
583 |
584 | Returns: a dict containing the required report fields
585 | """
586 |
587 | proc_reports = dict()
588 | for report in tqdm(dataset.itertuples()):
589 | rid = str(report.Studynumber).strip()
590 | if type(report._7) == str:
591 | if any(report._6.strip() == k for k in ['alle', 'aIle', 'aI', 'al', 'a', '?']): # single conclusion
592 |
593 | # create dict to store block diagnosis
594 | proc_reports[rid] = dict()
595 | # store conclusion - i.e., the final diagnosis
596 | proc_reports[rid]['diagnosis'] = utils.nl_sanitize_record(report._7.lower(), self.use_case)
597 | # Add other fields
598 | proc_reports[rid]['tissue'] = report._8 if type(report._8) == str else ''
599 | proc_reports[rid]['procedure'] = report._9 if type(report._9) == str else ''
600 | proc_reports[rid]['short'] = []
601 | for short in [report.short1, report.short2, report.short3]:
602 | if type(short) == str:
603 | proc_reports[rid]['short'].append(short)
604 | proc_reports[rid]['slide_ids'] = [rid.lstrip(report.block).split('_')[1].lstrip(report.block.split('_')[3])]
605 | else: # split conclusions and associate to each block the corresponding conclusion
606 | # split conclusions into sections
607 | conclusions = self.radboud_split_conclusions(utils.nl_sanitize_record(report._7.lower(), self.use_case))
608 |
609 | if 'whole' in conclusions: # unable to split conclusions - either single conclusion or not appropriately specified
610 | # create dict to store block diagnosis
611 | proc_reports[rid] = dict()
612 | # store conclusion - i.e., the final diagnosis
613 | proc_reports[rid]['diagnosis'] = conclusions['whole']
614 | # Add other fields
615 | proc_reports[rid]['tissue'] = report._8 if type(report._8) == str else ''
616 | proc_reports[rid]['procedure'] = report._9 if type(report._9) == str else ''
617 | proc_reports[rid]['short'] = []
618 | for short in [report.short1, report.short2, report.short3]:
619 | if type(short) == str:
620 | proc_reports[rid]['short'].append(short)
621 | proc_reports[rid]['slide_ids'] = [rid.lstrip(report.block).split('_')[1].lstrip(
622 | report.block.split('_')[3])]
623 |
624 | else:
625 | # get conclusions IDs
626 | cid2ix = {cid: roman.fromRoman(cid) for cid in conclusions.keys()}
627 | for cid, cix in cid2ix.items():
628 | numbers = [n.strip() for n in report._6.split('&')]
629 | if cid in numbers:
630 | # create block id
631 | bid = rid + '_' + str(cix)
632 | # create dict to store block diagnosis
633 | proc_reports[bid] = dict()
634 | # store conclusion - i.e., the final diagnosis
635 | proc_reports[bid]['diagnosis'] = conclusions[cid]
636 | # Add other fields
637 | proc_reports[bid]['tissue'] = report._8 if type(report._8) == str else ''
638 | proc_reports[bid]['procedure'] = report._9 if type(report._9) == str else ''
639 | proc_reports[bid]['short'] = []
640 | for short in [report.short1, report.short2, report.short3]:
641 | if type(short) == str:
642 | proc_reports[bid]['short'].append(short)
643 | proc_reports[bid]['slide_ids'] = [rid.lstrip(report.block).split('_')[1].lstrip(
644 | report.block.split('_')[3])]
645 |
646 | return proc_reports
647 |
648 | def radboud_translate_celiac_reports(self, reports):
649 | """
650 | Translate processed reports for celiac use-case
651 |
652 | Params:
653 | reports (dict): processed reports
654 |
655 | Returns: translated reports
656 | """
657 |
658 | trans_reports = copy.deepcopy(reports)
659 | print('translate text')
660 | # translate text
661 | for rid, report in tqdm(trans_reports.items()):
662 | trans_reports[rid]['diagnosis'] = self.translate_text(report['diagnosis'])
663 | if report['tissue'] != '':
664 | trans_reports[rid]['tissue'] = self.translate_text(report['tissue'])
665 | if report['procedure'] != '':
666 | trans_reports[rid]['procedure'] = self.translate_text(report['procedure'])
667 | # List of translated shorts
668 | tmp = []
669 | for short in report['short']:
670 | tmp.append(self.translate_text(short))
671 | trans_reports[rid]['short'] = tmp
672 | return trans_reports
673 |
674 | def radboud_translate_reports(self, reports):
675 | """
676 | Translate processed reports
677 |
678 | Params:
679 | reports (dict): processed reports
680 |
681 | Returns: translated reports
682 | """
683 |
684 | trans_reports = copy.deepcopy(reports)
685 | print('translate text')
686 | # translate text
687 | for rid, report in tqdm(trans_reports.items()):
688 | trans_reports[rid]['diagnosis'] = self.translate_text(report['diagnosis'])
689 | return trans_reports
690 |
691 | # GENERAL-PURPOSE FUNCTIONS
692 |
693 | def read_xls_reports(self, dataset):
694 | """
695 | Read reports from xls file
696 |
697 | Params:
698 | dataset (str): target dataset
699 |
700 | Returns: a list containing dataset report(s)
701 | """
702 |
703 | if dataset.split('.')[-1] == 'xlsx': # read input file as xlsx object
704 | ds = pd.read_excel(io=dataset, header=0, engine='openpyxl')
705 | else: # read input file as xls object
706 | ds = pd.read_excel(io=dataset, header=0)
707 |
708 | reports = []
709 | for report in tqdm(ds.itertuples(index=False)): # convert raw dataset into list containing report(s)
710 | reports.append({field: report[ix] for ix, field in enumerate(report._fields)})
711 | # return report(s)
712 | return reports
713 |
714 | def read_csv_reports(self, dataset):
715 | """
716 | Read reports from csv file
717 |
718 | Params:
719 | dataset (str): target dataset
720 |
721 | Returns: a list containing dataset report(s)
722 | """
723 | # read input file as csv object
724 | ds = pd.read_csv(filepath_or_buffer=dataset, sep=' ', header=0)
725 |
726 | reports = []
727 | for report in tqdm(ds.itertuples(index=False)): # convert raw dataset into list containing report(s)
728 | reports.append({field: report[ix] for ix, field in enumerate(report._fields)})
729 | # return report(s)
730 | return reports
731 |
732 | def read_json_reports(self, dataset):
733 | """
734 | Read reports from JSON file
735 |
736 | Params:
737 | dataset (str): target dataset
738 |
739 | Returns: a list containing dataset report(s)
740 | """
741 |
742 | with open(dataset, 'r') as dsf:
743 | ds = json.load(dsf)
744 |
745 | if 'reports' in ds: # dataset consists of several reports
746 | reports = ds['reports']
747 | else: # dataset consists of single report
748 | reports = [ds]
749 | # return report(s)
750 | return reports
751 |
752 | def read_stream_reports(self, dataset):
753 | """
754 | Read reports from stream input
755 |
756 | Params:
757 | dataset (dict): target dataset
758 |
759 | Returns: a list containing dataset report(s)
760 | """
761 |
762 | if 'reports' in dataset: # dataset consists of several reports
763 | reports = dataset['reports']
764 | else: # dataset consists of single report
765 | reports = [dataset]
766 | # return report(s)
767 | return reports
768 |
769 | def process_data(self, dataset, debug=False):
770 | """
771 | Read reports and extract the required fields
772 |
773 | Params:
774 | dataset (dict): target dataset
775 | debug (bool): whether to keep flags for debugging
776 |
777 | Returns: a dict containing the required report fields
778 | """
779 |
780 | if type(dataset) == str: # dataset passed as input file
781 | if dataset.split('.')[-1] == 'json': # read input file as JSON object
782 | reports = self.read_json_reports(dataset)
783 | elif dataset.split('.')[-1] == 'xlsx' or dataset.split('.')[-1] == 'xls': # read input file as xlsx or xls object
784 | reports = self.read_xls_reports(dataset)
785 | elif dataset.split('.')[-1] == 'csv': # read input file as csv or csv object
786 | reports = self.read_csv_reports(dataset)
787 | else: # raise exception
788 | print('Format required for input: JSON, xls, xlsx or csv.')
789 | raise Exception
790 | else: # dataset passed as stream dict
791 | reports = self.read_stream_reports(dataset)
792 |
793 | proc_reports = {}
794 | # process reports and concat fields
795 | for report in reports:
796 | if 'id' in report:
797 | rid = report.pop('id') # use provided id
798 | else:
799 | rid = str(uuid.uuid4()) # generate uuid
800 |
801 | if 'age' in report: # get age from report
802 | if self.is_empty(report['age']):
803 | age = None
804 | else:
805 | age = report.pop('age')
806 | else: # set age to None
807 | age = None
808 |
809 | if 'gender' in report: # get gender from report
810 | if self.is_empty(report['gender']):
811 | gender = None
812 | else:
813 | gender = report.pop('gender')
814 | else: # set gender to None
815 | gender = None
816 |
817 | if self.fields: # report fields specified -- restrict to self.fields
818 | fields = [field for field in report.keys() if field in self.fields]
819 | else: # report fields not specified -- keep report fields
820 | fields = [field for field in report.keys()]
821 | report_fields = [report[field] if report[field].endswith('.') else report[field] + '.' for field in fields]
822 | text = ' '.join(report_fields)
823 |
824 | # prepare processed report
825 | proc_reports[rid] = {'text': text, 'age': age, 'gender': gender}
826 | return proc_reports
827 |
828 | def translate_reports(self, reports):
829 | """
830 | Translate reports
831 |
832 | Params:
833 | reports (dict): reports
834 |
835 | Returns: translated reports
836 | """
837 |
838 | trans_reports = copy.deepcopy(reports)
839 | print('translate text')
840 | # translate text
841 | for rid, report in tqdm(trans_reports.items()):
842 | trans_reports[rid]['text'] = self.translate_text(report['text'])
843 | return trans_reports
844 |
--------------------------------------------------------------------------------
/sket/rep_proc/rules/report_fields.txt:
--------------------------------------------------------------------------------
1 | text
--------------------------------------------------------------------------------
/sket/sket.py:
--------------------------------------------------------------------------------
1 | import os
2 | import uuid
3 | import json
4 |
5 | from .rep_proc.report_processing import ReportProc
6 | from .ont_proc.ontology_processing import OntoProc
7 | from .nerd.nerd import NERD
8 | from .rdf_proc.rdf_processing import RDFProc
9 |
10 | from .utils import utils
11 |
12 |
13 | class SKET(object):
14 |
15 | def __init__(
16 | self,
17 | use_case, src_lang,
18 | biospacy="en_core_sci_sm", biow2v=True, biofast=None, biobert=None, str_match=False, gpu=None, rules=None, dysplasia_mappings=None, cin_mappings=None,
19 | ontology_path=None, hierarchies_path=None,
20 | fields_path=None
21 | ):
22 | """
23 | Load SKET components
24 |
25 | Params:
26 | SKET:
27 | use_case (str): considered use case
28 | src_lang (str): considered language
29 | NERD:
30 | biospacy (str): full spaCy pipeline for biomedical data
31 | biow2v (bool): whether to use biospacy to perform semantic matching or not
32 | biofast (str): biomedical fasttext model
33 | biobert (str): biomedical bert model
34 | str_match (bool): string matching
35 | gpu (int): use gpu when using BERT
36 | rules (str): hand-crafted rules file path
37 | dysplasia_mappings (str): dysplasia mappings file path
38 | cin_mappings (str): cin mappings file path
39 | OntoProc:
40 | ontology_path (str): ontology.owl file path
41 | hierarchies_path (str): hierarchy relations file path
42 | ReportProc:
43 | fields_path (str): report fields file path
44 |
45 | Returns: None
46 | """
47 |
48 | # load Named Entity Recognition and Disambiguation (NERD)
49 | self.nerd = NERD(biospacy, biow2v, str_match, biofast, biobert, rules, dysplasia_mappings, cin_mappings, gpu)
50 | # load Ontology Processing (OntoProc)
51 | self.onto_proc = OntoProc(ontology_path, hierarchies_path)
52 | # load Report Processing (ReportProc)
53 | self.rep_proc = ReportProc(src_lang, use_case, fields_path)
54 | # load RDF Processing (RDFProc)
55 | self.rdf_proc = RDFProc()
56 |
57 | # define set of ad hoc labeling operations @smarchesin TODO: add 'custom' to lung too if required
58 | self.ad_hoc_exa_labeling = {
59 | 'aoec': {
60 | 'colon': {
61 | 'original': utils.aoec_colon_concepts2labels,
62 | 'custom': utils.aoec_colon_labels2binary},
63 | 'cervix': {
64 | 'original': utils.aoec_cervix_concepts2labels,
65 | 'custom': utils.aoec_cervix_labels2aggregates},
66 | 'lung': {
67 | 'original': utils.aoec_lung_concepts2labels},
68 | 'celiac': {
69 | 'original': utils.aoec_celiac_concepts2labels}
70 | },
71 | 'radboud': {
72 | 'colon': {
73 | 'original': utils.radboud_colon_concepts2labels,
74 | 'custom': utils.radboud_colon_labels2binary},
75 | 'cervix': {
76 | 'original': utils.radboud_cervix_concepts2labels,
77 | 'custom': utils.radboud_cervix_labels2aggregates},
78 | 'celiac': {
79 | 'original': utils.radboud_celiac_concepts2labels}
80 | }
81 | }
82 |
83 | self.ad_hoc_med_labeling = {
84 | 'colon': {
85 | 'original': utils.colon_concepts2labels,
86 | 'custom': utils.colon_labels2binary
87 | },
88 | 'cervix': {
89 | 'original': utils.cervix_concepts2labels,
90 | 'custom': utils.cervix_labels2aggregates
91 | },
92 | 'lung': {
93 | 'original': utils.lung_concepts2labels
94 | },
95 | 'celiac': {
96 | 'original': utils.celiac_concepts2labels
97 | }
98 | }
99 |
100 | # set use case
101 | self.use_case = use_case
102 | # restrict hand-crafted rules and mappings based on use case
103 | self.nerd.restrict2use_case(use_case)
104 | # restrict onto concepts to the given use case
105 | self.onto = self.onto_proc.restrict2use_case(use_case)
106 | # restrict concept preferred terms (i.e., labels) given the use case
107 | self.onto_terms = self.nerd.process_ontology_concepts([term.lower() for term in self.onto['label'].tolist()])
108 |
109 | def update_nerd(
110 | self,
111 | biospacy="en_core_sci_lg", biofast=None, biobert=None, str_match=False, rules=None, dysplasia_mappings=None, cin_mappings=None, gpu=None):
112 | """
113 | Update NERD model w/ input parameters
114 |
115 | Params:
116 | biospacy (str): full spaCy pipeline for biomedical data
117 | biofast (str): biomedical fasttext model
118 | biobert (str): biomedical bert model
119 | str_match (bool): string matching
120 | rules (str): hand-crafted rules file path
121 | dysplasia_mappings (str): dysplasia mappings file path
122 | cin_mappings (str): cin mappings file path
123 | gpu (int): use gpu when using BERT
124 |
125 | Returns: None
126 | """
127 |
128 | # update nerd model
129 | self.nerd = NERD(biospacy, biofast, biobert, str_match, rules, dysplasia_mappings, cin_mappings, gpu)
130 | # restrict hand-crafted rules and mappings based on current use case
131 | self.nerd.restrict2use_case(self.use_case)
132 |
133 | def update_usecase(self, use_case):
134 | """
135 | Update use case and dependent functions
136 |
137 | Params:
138 | use_case (str): considered use case
139 |
140 | Returns: None
141 | """
142 |
143 | if use_case not in ['colon', 'cervix', 'lung', 'celiac']: # raise exception
144 | print('current supported use cases are: "colon", "cervix", "lung" and "celiac"')
145 | raise Exception
146 | # set use case
147 | self.use_case = use_case
148 | # update report processing
149 | self.rep_proc.update_usecase(self.use_case)
150 | # restrict hand-crafted rules and mappings based on use case
151 | self.nerd.restrict2use_case(use_case)
152 | # restrict onto concepts to the given use case
153 | self.onto = self.onto_proc.restrict2use_case(use_case)
154 | # restrict concept preferred terms (i.e., labels) given the use case
155 | self.onto_terms = self.nerd.process_ontology_concepts([term.lower() for term in self.onto['label'].tolist()])
156 |
157 | def update_nmt(self, src_lang):
158 | """
159 | Update NMT model changing source language
160 |
161 | Params:
162 | src_lang (str): considered source language
163 |
164 | Returns: None
165 | """
166 |
167 | # update NMT model
168 | self.rep_proc.update_nmt(src_lang)
169 |
170 | def update_report_fields(self, fields):
171 | """
172 | Update report fields changing current ones
173 |
174 | Params:
175 | fields (list): report fields
176 |
177 | Returns: None
178 | """
179 |
180 | # update report fields
181 | self.rep_proc.fields = fields
182 |
183 | @staticmethod
184 | def store_reports(reports, r_path):
185 | """
186 | Store reports
187 |
188 | Params:
189 | reports (dict): reports
190 | r_path (str): reports file path
191 |
192 | Returns: None
193 | """
194 |
195 | with open(r_path, 'w') as out:
196 | json.dump(reports, out, indent=4)
197 |
198 | @staticmethod
199 | def load_reports(r_fpath):
200 | """
201 | Load reports
202 |
203 | Params:
204 | r_fpath (str): reports file path
205 |
206 | Returns: reports
207 | """
208 |
209 | with open(r_fpath, 'r') as rfp:
210 | reports = json.load(rfp)
211 | return reports
212 |
213 | @staticmethod
214 | def store_concepts(concepts, c_fpath):
215 | """
216 | Store extracted concepts as JSON dict
217 |
218 | Params:
219 | concepts (dict): dict containing concepts extracted from reports
220 | c_fpath (str): concepts file path
221 |
222 | Returns: None
223 | """
224 |
225 | utils.store_concepts(concepts, c_fpath)
226 |
227 | @staticmethod
228 | def store_labels(labels, l_fpath):
229 | """
230 | Store mapped labels as JSON dict
231 |
232 | Params:
233 | labels (dict): dict containing labels mapped from extracted concepts
234 | l_fpath (str): labels file path
235 |
236 | Returns: None
237 | """
238 |
239 | utils.store_labels(labels, l_fpath)
240 |
241 | def store_rdf_graphs(self, graphs, g_fpath, rdf_format='turtle'):
242 | """
243 | Store RDF graphs w/ RDF serialization format
244 |
245 | Params:
246 | graphs (list): list containing (s,p,o) triples representing ExaMode report(s)
247 | g_fpath (str): graphs file path
248 | rdf_format (str): RDF format used to serialize graphs
249 |
250 | Returns: serialized report graph when g_fpath == 'stream' or boolean when g_fpath != 'stream'
251 | """
252 |
253 | if rdf_format not in ['turtle', 'n3', 'trig']: # raise exception
254 | print('provide correct format: "turtle", "n3", or "trig".')
255 | raise Exception
256 |
257 | if g_fpath != 'stream': # check that file type and rdf format coincide
258 | ftype = g_fpath.split('.')[-1]
259 | ftype = 'turtle' if ftype == 'ttl' else ftype
260 | assert ftype == rdf_format
261 |
262 | return self.rdf_proc.serialize_report_graphs(graphs, output=g_fpath, rdf_format=rdf_format)
263 |
264 | @staticmethod
265 | def store_json_graphs(graphs, g_fpath):
266 | """
267 | Store RDF graphs w/ JSON serialization format
268 |
269 | Params:
270 | graphs (dict): dict containing (s,p,o) triples representing ExaMode report(s)
271 | g_fpath (str): graphs file path
272 |
273 | Returns: None
274 | """
275 |
276 | os.makedirs(os.path.dirname(g_fpath), exist_ok=True)
277 |
278 | with open(g_fpath, 'w') as out:
279 | json.dump(graphs, out, indent=4)
280 |
281 | # EXAMODE RELATED FUNCTIONS
282 |
283 | def prepare_exa_dataset(self, ds_fpath, sheet, header, hospital, ver, ds_name=None, debug=False):
284 | """
285 | Prepare ExaMode batch data to perform NERD
286 |
287 | Params:
288 | ds_fpath (str): examode dataset file path
289 | sheet (str): name of the excel sheet to use
290 | header (int): row index used as header
291 | hospital (str): considered hospital
292 | ver (int): data format version
293 | use_case (str): considered use case
294 | ds_name (str): dataset name
295 | debug (bool): whether to keep flags for debugging
296 |
297 | Returns: translated, split, and prepared dataset
298 | """
299 |
300 | # get dataset name from file path if not provided
301 | if not ds_name:
302 | ds_name = ds_fpath.split('/')[-1].split('.')[0] # ./dataset/raw/aoec/####.csv
303 | # set output directories
304 | proc_out = './dataset/processed/' + hospital + '/' + self.use_case + '/'
305 | trans_out = './dataset/translated/' + hospital + '/' + self.use_case + '/'
306 |
307 | if os.path.isfile(trans_out + ds_name + '.json'): # translated reports file already exists
308 | print('translated reports file already exist -- remove it before running "exa_pipeline" to reprocess it')
309 | trans_reports = self.load_reports(trans_out + ds_name + '.json')
310 | return trans_reports
311 | elif os.path.isfile(proc_out + ds_name + '.json'): # processed reports file already exists
312 | print('processed reports file already exist -- remove it before running "exa_pipeline" to reprocess it')
313 | proc_reports = self.load_reports(proc_out + ds_name + '.json')
314 | if hospital == 'aoec':
315 | # translate reports
316 | trans_reports = self.rep_proc.aoec_translate_reports(proc_reports)
317 | elif hospital == 'radboud':
318 | if self.use_case == 'celiac':
319 | # translate celiac reports
320 | trans_reports = self.rep_proc.radboud_translate_celiac_reports(proc_reports)
321 | else:
322 | # translate reports
323 | trans_reports = self.rep_proc.radboud_translate_reports(proc_reports)
324 | else: # raise exception
325 | print('provide correct hospital info: "aoec" or "radboud"')
326 | raise Exception
327 |
328 | if not os.path.exists(trans_out): # dir not exists -- make it
329 | os.makedirs(trans_out)
330 | # store translated reports
331 | self.store_reports(trans_reports, trans_out + ds_name + '.json')
332 |
333 | return trans_reports
334 | else: # neither processed nor translated reports files exist
335 | # load dataset
336 | dataset = self.rep_proc.load_dataset(ds_fpath, sheet, header)
337 |
338 | if hospital == 'aoec':
339 | if ver == 1: # process data using method v1
340 | proc_reports = self.rep_proc.aoec_process_data(dataset)
341 | else: # process data using method v2
342 | proc_reports = self.rep_proc.aoec_process_data_v2(dataset, debug=debug)
343 |
344 | # translate reports
345 | trans_reports = self.rep_proc.aoec_translate_reports(proc_reports)
346 | elif hospital == 'radboud':
347 | if self.use_case == 'celiac':
348 | proc_reports = self.rep_proc.radboud_process_celiac_data(dataset)
349 | elif ver == 1: # process data using method v1
350 | proc_reports = self.rep_proc.radboud_process_data(dataset, debug=debug)
351 | else: # process data using method v2
352 | proc_reports = self.rep_proc.radboud_process_data_v2(dataset)
353 | if self.use_case == 'celiac':
354 | # translate reports
355 | trans_reports = self.rep_proc.radboud_translate_celiac_reports(proc_reports)
356 | else:
357 | # translate reports
358 | trans_reports = self.rep_proc.radboud_translate_reports(proc_reports)
359 | else: # raise exception
360 | print('provide correct hospital info: "aoec" or "radboud"')
361 | raise Exception
362 |
363 | if not os.path.exists(proc_out): # dir not exists -- make it
364 | os.makedirs(proc_out)
365 | # store processed reports
366 | self.store_reports(proc_reports, proc_out + ds_name + '.json')
367 | if not os.path.exists(trans_out): # dir not exists -- make it
368 | os.makedirs(trans_out)
369 | # store translated reports
370 | self.store_reports(trans_reports, trans_out + ds_name + '.json')
371 |
372 | return trans_reports
373 |
374 | def exa_entity_linking(self, reports, hospital, sim_thr=0.7, raw=False, debug=False):
375 | """
376 | Perform entity linking based on ExaMode reports structure and data
377 |
378 | Params:
379 | reports (dict): dict containing reports -- can be either one or many
380 | hospital (str): considered hospital
381 | sim_thr (float): keep candidates with sim score greater than or equal to sim_thr
382 | raw (bool): whether to return concepts within semantic areas or mentions+concepts
383 | debug (bool): whether to keep flags for debugging
384 |
385 | Returns: a dict containing concepts from input reports
386 | """
387 |
388 | # perform entity linking
389 | if hospital == 'aoec': # AOEC data
390 | concepts = self.nerd.aoec_entity_linking(reports, self.onto_proc, self.onto, self.onto_terms, self.use_case, sim_thr, raw, debug=debug)
391 | elif hospital == 'radboud': # Radboud data
392 | concepts = self.nerd.radboud_entity_linking(reports, self.onto, self.onto_terms, self.use_case, sim_thr, raw, debug=debug)
393 | else: # raise exception
394 | print('provide correct hospital info: "aoec" or "radboud"')
395 | raise Exception
396 | return concepts
397 |
398 | def exa_labeling(self, concepts, hospital):
399 | """
400 | Map extracted concepts to pre-defined labels
401 |
402 | Params:
403 | concepts (dict): dict containing concepts extracted from report(s)
404 | hospital (str): considered hospital
405 |
406 | Returns: a dict containing labels from input report(s)
407 | """
408 |
409 | if hospital not in ['aoec', 'radboud']:
410 | print('provide correct hospital info: "aoec" or "radboud"')
411 | raise Exception
412 | labels = self.ad_hoc_exa_labeling[hospital][self.use_case]['original'](concepts)
413 | return labels
414 |
415 | def create_exa_graphs(self, reports, concepts, hospital, struct=False, debug=False):
416 | """
417 | Create report graphs in RDF format
418 |
419 | Params:
420 | reports (dict): dict containing reports -- can be either one or many
421 | concepts (dict): dict containing concepts extracted from report(s)
422 | hospital (str): considered hospital
423 | struct (bool): whether to return graphs structured as dict
424 | debug (bool): whether to keep flags for debugging
425 |
426 | Returns: list of (s,p,o) triples representing report graphs and dict structuring report graphs (if struct==True)
427 | """
428 |
429 | if hospital == 'aoec': # AOEC data
430 | create_graph = self.rdf_proc.aoec_create_graph
431 | elif hospital == 'radboud': # Radboud data
432 | create_graph = self.rdf_proc.radboud_create_graph
433 | else: # raise exception
434 | print('provide correct hospital info: "aoec" or "radboud"')
435 | raise Exception
436 |
437 | rdf_graphs = []
438 | struct_graphs = []
439 | # convert report data into (s,p,o) triples
440 | for rid in reports.keys():
441 | rdf_graph, struct_graph = create_graph(rid, reports[rid], concepts[rid], self.onto_proc, self.use_case, debug=debug)
442 | rdf_graphs.append(rdf_graph)
443 | struct_graphs.append(struct_graph)
444 | if struct: # return both rdf and dict graphs
445 | return rdf_graphs, struct_graphs
446 | else:
447 | return rdf_graphs
448 |
449 | def exa_pipeline(self, ds_fpath, sheet, header, ver, use_case=None, hosp=None, sim_thr=0.7, raw=False, debug=False):
450 | """
451 | Perform the complete SKET pipeline over ExaMode data:
452 | - (i) Load dataset
453 | - (ii) Process dataset
454 | - (iii) Translate dataset
455 | - (iv) Perform entity linking and store concepts
456 | - (v) Perform labeling and store labels
457 | - (vi) Create RDF graphs and store graphs
458 |
459 | Params:
460 | ds_fpath (str): dataset file path
461 | sheet (str): name of the excel sheet to use
462 | header (int): row index used as header
463 | ver (int): data format version
464 | use_case (str): considered use case
465 | hosp (str): considered hospital
466 | sim_thr (float): keep candidates with sim score greater than or equal to sim_thr
467 | raw (bool): whether to return concepts within semantic areas or mentions+concepts
468 | debug (bool): whether to keep flags for debugging.
469 |
470 | Returns: None
471 | """
472 |
473 | if use_case: # update to input use case
474 | self.update_usecase(use_case)
475 |
476 | # get dataset name
477 | ds_name = ds_fpath.split('/')[-1].split('.')[0] # ./dataset/raw/aoec/####.csv
478 |
479 | if hosp: # update to input hospital
480 | if hosp not in ['aoec', 'radboud']:
481 | print('provide correct hospital info: "aoec" or "radboud"')
482 | raise Exception
483 | else:
484 | hospital = hosp
485 | else:
486 | # get hospital name
487 | hospital = ds_fpath.split('/')[-2] # ./dataset/raw/ --> aoec <-- /####.csv
488 |
489 | # set output directories
490 | if raw: # return mentions+concepts (used for EXATAG)
491 | concepts_out = './outputs/concepts/raw/' + hospital + '/' + self.use_case + '/'
492 | else: # perform complete pipeline (used for SKET/CERT/EXANET)
493 | concepts_out = './outputs/concepts/refined/' + hospital + '/' + self.use_case + '/'
494 | labels_out = './outputs/labels/' + hospital + '/' + self.use_case + '/'
495 | rdf_graphs_out = './outputs/graphs/rdf/' + hospital + '/' + self.use_case + '/'
496 | struct_graphs_out = './outputs/graphs/json/' + hospital + '/' + self.use_case + '/'
497 |
498 | # prepare dataset
499 | reports = self.prepare_exa_dataset(ds_fpath, sheet, header, hospital, ver, ds_name, debug=debug)
500 |
501 | # perform entity linking
502 | concepts = self.exa_entity_linking(reports, hospital, sim_thr, raw, debug=debug)
503 | # store concepts
504 | self.store_concepts(concepts, concepts_out + 'concepts_' + ds_name + '.json')
505 | if raw: # return mentions+concepts
506 | return concepts
507 |
508 | # perform labeling
509 | labels = self.exa_labeling(concepts, hospital)
510 | # store labels
511 | self.store_labels(labels, labels_out + 'labels_' + ds_name + '.json')
512 | # create RDF graphs
513 | rdf_graphs, struct_graphs = self.create_exa_graphs(reports, concepts, hospital, struct=True, debug=debug)
514 | # store RDF graphs
515 | self.store_rdf_graphs(rdf_graphs, rdf_graphs_out + 'graphs_' + ds_name + '.n3', 'n3')
516 | self.store_rdf_graphs(rdf_graphs, rdf_graphs_out + 'graphs_' + ds_name + '.trig', 'trig')
517 | self.store_rdf_graphs(rdf_graphs, rdf_graphs_out + 'graphs_' + ds_name + '.ttl', 'turtle')
518 | # store JSON graphs
519 | self.store_json_graphs(struct_graphs, struct_graphs_out + 'graphs_' + ds_name + '.json')
520 |
521 | # GENERAL-PURPOSE FUNCTIONS
522 |
523 | def prepare_med_dataset(self, ds, ds_name, src_lang=None, store=False, debug=False):
524 | """
525 | Prepare dataset to perform NERD
526 |
527 | Params:
528 | ds (dict): dataset
529 | ds_name (str): dataset name
530 | src_lang (str): considered language
531 | store (bool): whether to store concepts, labels, and RDF graphs
532 | debug (bool): whether to keep flags for debugging
533 |
534 | Returns: translated, split, and prepared dataset
535 | """
536 |
537 | # set output directories
538 | proc_out = './dataset/processed/' + self.use_case + '/'
539 | trans_out = './dataset/translated/' + self.use_case + '/'
540 |
541 | # process reports
542 | proc_reports = self.rep_proc.process_data(ds, debug=debug)
543 | if store: # store processed reports
544 | os.makedirs(proc_out, exist_ok=True)
545 | self.store_reports(proc_reports, proc_out + ds_name + '.json')
546 |
547 | if src_lang != 'en': # translate reports
548 | trans_reports = self.rep_proc.translate_reports(proc_reports)
549 | else: # keep processed reports
550 | trans_reports = proc_reports
551 | if store: # store translated reports
552 | os.makedirs(trans_out, exist_ok=True)
553 | self.store_reports(trans_reports, trans_out + ds_name + '.json')
554 |
555 | return trans_reports
556 |
557 | def med_entity_linking(self, reports, sim_thr=0.7, raw=False, debug=False):
558 | """
559 | Perform entity linking on input reports
560 |
561 | Params:
562 | reports (dict): dict containing reports -- can be either one or many
563 | sim_thr (float): keep candidates with sim score greater than or equal to sim_thr
564 | raw (bool): whether to return concepts within semantic areas or mentions+concepts
565 | debug (bool): whether to keep flags for debugging
566 |
567 | Returns: a dict containing concepts from input reports
568 | """
569 |
570 | # perform entity linking
571 | concepts = self.nerd.entity_linking(reports, self.onto, self.onto_terms, self.use_case, sim_thr, raw, debug=debug)
572 |
573 | return concepts
574 |
575 | def med_labeling(self, concepts):
576 | """
577 | Map extracted concepts to pre-defined labels
578 |
579 | Params:
580 | concepts (dict): dict containing concepts extracted from report(s)
581 |
582 | Returns: a dict containing labels from input report(s)
583 | """
584 |
585 | labels = self.ad_hoc_med_labeling[self.use_case]['original'](concepts)
586 | return labels
587 |
588 | def create_med_graphs(self, reports, concepts, struct=False, debug=False):
589 | """
590 | Create report graphs in RDF format
591 |
592 | Params:
593 | reports (dict): dict containing reports -- can be either one or many
594 | concepts (dict): dict containing concepts extracted from report(s)
595 | struct (bool): whether to return graphs structured as dict
596 | debug (bool): whether to keep flags for debugging
597 |
598 | Returns: list of (s,p,o) triples representing report graphs and dict structuring report graphs (if struct==True)
599 | """
600 |
601 | rdf_graphs = []
602 | struct_graphs = []
603 | # convert report data into (s,p,o) triples
604 | for rid in reports.keys():
605 | rdf_graph, struct_graph = self.rdf_proc.create_graph(rid, reports[rid], concepts[rid], self.onto_proc, self.use_case, debug=debug)
606 | rdf_graphs.append(rdf_graph)
607 | struct_graphs.append(struct_graph)
608 | if struct: # return both rdf and dict graphs
609 | return rdf_graphs, struct_graphs
610 | else:
611 | return rdf_graphs
612 |
613 | def med_pipeline(self, ds, preprocess, src_lang=None, use_case=None, sim_thr=0.7, store=False, rdf_format='all', raw=False, debug=False):
614 | """
615 | Perform the complete SKET pipeline over generic data:
616 | - (i) Process dataset
617 | - (ii) Translate dataset
618 | - (iii) Perform entity linking (and store concepts)
619 | - (iv) Perform labeling (and store labels)
620 | - (v) Create RDF graphs (and store graphs)
621 | - (vi) Return concepts, labels, and RDF graphs
622 |
623 | When raw == True: perform steps i-iii and return mentions+concepts
624 | When store == True: store concepts, labels, and RDF graphs
625 |
626 | Params:
627 | ds (dict): dataset
628 | preprocess (boolean): whether to preprocess data or not.
629 | src_lang (str): considered language
630 | use_case (str): considered use case
631 | hosp (str): considered hospital
632 | sim_thr (float): keep candidates with sim score greater than or equal to sim_thr
633 | store (bool): whether to store concepts, labels, and RDF graphs
634 | rdf_format (str): RDF format used to serialize graphs
635 | raw (bool): whether to return concepts within semantic areas or mentions+concepts
636 | debug (bool): whether to keep flags for debugging
637 |
638 | Returns: None
639 | """
640 |
641 | if use_case: # update to input use case
642 | self.update_usecase(use_case)
643 |
644 | if src_lang: # update to input source language
645 | self.update_nmt(src_lang)
646 |
647 | # set output directories
648 | if raw: # return mentions+concepts (used for EXATAG)
649 | concepts_out = './outputs/concepts/raw/' + self.use_case + '/'
650 | else: # perform complete pipeline (used for SKET/CERT/EXANET)
651 | concepts_out = './outputs/concepts/refined/' + self.use_case + '/'
652 | labels_out = './outputs/labels/' + self.use_case + '/'
653 | rdf_graphs_out = './outputs/graphs/rdf/' + self.use_case + '/'
654 | struct_graphs_out = './outputs/graphs/json/' + self.use_case + '/'
655 |
656 | if preprocess:
657 | if type(ds) == str:
658 | # set dataset name as input file name
659 | ds_name = ds.split('/')[-1].split('.')[0]
660 | else:
661 | # set dataset name with random identifier
662 | ds_name = str(uuid.uuid4())
663 | # prepare dataset
664 | reports = self.prepare_med_dataset(ds, ds_name, src_lang, store, debug=debug)
665 | else:
666 | if type(ds) == str and ds.split('.')[-1] == 'json':
667 | # set dataset name as input file name
668 | ds_name = ds.split('/')[-1].split('.')[0]
669 | reports = self.load_reports(ds)
670 | else: # raise exception
671 | print('Format required for input without preprocess step: JSON.')
672 | raise Exception
673 |
674 | # perform entity linking
675 | concepts = self.med_entity_linking(reports, sim_thr, raw, debug=debug)
676 | if store: # store concepts
677 | self.store_concepts(concepts, concepts_out + 'concepts_' + ds_name + '.json')
678 | if raw: # return mentions+concepts
679 | return concepts
680 |
681 | # perform labeling
682 | labels = self.med_labeling(concepts)
683 | if store: # store labels
684 | self.store_labels(labels, labels_out + 'labels_' + ds_name + '.json')
685 | # create RDF graphs
686 | rdf_graphs, struct_graphs = self.create_med_graphs(reports, concepts, struct=True, debug=debug)
687 | if store: # store graphs
688 | # RDF graphs
689 | if rdf_format in ['all', 'n3']:
690 | self.store_rdf_graphs(rdf_graphs, rdf_graphs_out + 'graphs_' + ds_name + '.n3', 'n3')
691 | if rdf_format in ['all', 'trig']:
692 | self.store_rdf_graphs(rdf_graphs, rdf_graphs_out + 'graphs_' + ds_name + '.trig', 'trig')
693 | if rdf_format in ['all', 'turtle']:
694 | self.store_rdf_graphs(rdf_graphs, rdf_graphs_out + 'graphs_' + ds_name + '.ttl', 'turtle')
695 | # JSON graphs
696 | self.store_json_graphs(struct_graphs, struct_graphs_out + 'graphs_' + ds_name + '.json')
697 | else: # return serialized graphs as stream
698 | if rdf_format == 'all':
699 | print('"all" is not supported for standard (stream) output.\nSupported RDF serialization formats for stream output are: "n3", "trig", and "turtle".')
700 | raise Exception
701 | else:
702 | rdf_graphs = self.store_rdf_graphs(rdf_graphs, 'stream', rdf_format)
703 | return concepts, labels, rdf_graphs
704 |
--------------------------------------------------------------------------------
/sket/utils/__init__.py:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/sket/utils/utils.py:
--------------------------------------------------------------------------------
1 | import os
2 | import json
3 |
4 |
5 | def assign_gpu(tknz_out, gpu):
6 | """
7 | Assign tokenizer tensors to GPU(s)
8 |
9 | Params:
10 | tknz_out (dict): dict containing tokenizer tensors within CPU
11 | gpu (int): gpu device
12 |
13 | Returns: the dict containing tokenizer tensors within GPU(s)
14 | """
15 |
16 | if type(gpu) == int:
17 | device = 'cuda:' + str(gpu)
18 | else:
19 | device = 'cpu'
20 | tokens_tensor = tknz_out['input_ids'].to(device)
21 | token_type_ids = tknz_out['token_type_ids'].to(device)
22 | attention_mask = tknz_out['attention_mask'].to(device)
23 | # assign GPU(s) tokenizer tensors to output dict
24 | output = {
25 | 'input_ids': tokens_tensor,
26 | 'token_type_ids': token_type_ids,
27 | 'attention_mask': attention_mask
28 | }
29 | return output
30 |
31 |
32 | def en_sanitize_record(record, use_case): # @smarchesin TODO: define sanitize use-case functions that read replacements from file
33 | """
34 | Sanitize record to avoid translation errors
35 |
36 | Params:
37 | record (str): target record
38 |
39 | Returns: the sanitized record
40 | """
41 |
42 | if record:
43 | if use_case == 'colon':
44 | record = record.replace('octopus', 'polyp')
45 | record = record.replace('hairy', 'villous')
46 | record = record.replace('villous adenoma-tubule', 'tubulo-villous adenoma')
47 | record = record.replace('villous adenomas-tubule', 'tubulo-villous adenoma')
48 | record = record.replace('villous adenomas tubule', 'tubulo-villous adenoma')
49 | record = record.replace('tubule adenoma-villous', 'tubulo-villous adenoma')
50 | record = record.replace('tubular adenoma-villous', 'tubulo-villous adenoma')
51 | record = record.replace('villous adenoma tubule-', 'tubulo-villous adenoma ')
52 | record = record.replace('villous adenoma tubule', 'tubulo-villous adenoma')
53 | record = record.replace('tubulovilloso adenoma', 'tubulo-villous adenoma')
54 | record = record.replace('blind', 'caecum')
55 | record = record.replace('cecal', 'caecum')
56 | record = record.replace('rectal', 'rectum')
57 | record = record.replace('sigma', 'sigmoid')
58 | record = record.replace('hyperplasia', 'hyperplastic') # MarianMT translates 'iperplastico' as 'hyperplasia' instead of 'hyperplastic'
59 | record = record.replace('proximal colon', 'right colon')
60 | if use_case == 'cervix':
61 | record = record.replace('octopus', 'polyp')
62 | record = record.replace('his cassock', 'lamina propria')
63 | record = record.replace('tunica propria', 'lamina propria')
64 | record = record.replace('l-sil', 'lsil')
65 | record = record.replace('h-sil', 'hsil')
66 | record = record.replace('cin ii / iii', 'cin23')
67 | record = record.replace('cin iii', 'cin3')
68 | record = record.replace('cin ii', 'cin2')
69 | record = record.replace('cin i', 'cin1')
70 | record = record.replace('cin-iii', 'cin3')
71 | record = record.replace('cin-ii', 'cin2')
72 | record = record.replace('cin-i', 'cin1')
73 | record = record.replace('cin1-2', 'cin1 cin2')
74 | record = record.replace('cin2-3', 'cin2 cin3')
75 | record = record.replace('cin-1', 'cin1')
76 | record = record.replace('cin-2', 'cin2')
77 | record = record.replace('cin-3', 'cin3')
78 | record = record.replace('cin 2 / 3', 'cin23')
79 | record = record.replace('cin 2/3', 'cin23')
80 | record = record.replace('cin 1-2', 'cin1 cin2')
81 | record = record.replace('cin 2-3', 'cin2 cin3')
82 | record = record.replace('cin 1', 'cin1')
83 | record = record.replace('cin 2', 'cin2')
84 | record = record.replace('cin 3', 'cin3')
85 | record = record.replace('ii-iii cin', 'cin2 cin3')
86 | record = record.replace('i-ii cin', 'cin1 cin2')
87 | record = record.replace('iii cin', 'cin3')
88 | record = record.replace('ii cin', 'cin2')
89 | record = record.replace('i cin', 'cin1')
90 | record = record.replace('port biopsy', 'portio biopsy')
91 | if use_case == 'celiac':
92 | record = record.replace('villas', 'villi')
93 | record = record.replace('duodonitis', 'duodenitis')
94 | record = record.replace('duodoneitis', 'duodenitis')
95 | record = record.replace('duodonia', 'duodenitis')
96 | record = record.replace('duedenitis', 'duodenitis')
97 | record = record.replace('mucosae', 'mucosa')
98 | record = record.replace('mucous', 'mucosa')
99 | record = record.replace('oedema', 'edema')
100 | record = record.replace('leucocyte', 'leukocyte')
101 | record = record.replace('granulocytes', 'granulocyte')
102 | record = record.replace('eosinophiles', 'eosinophil')
103 | record = record.replace('neutrophiles', 'neutrophil')
104 | record = record.replace('leukocytes', 'leukocyte')
105 | record = record.replace('lymphocytes', 'lymphocyte')
106 | record = record.replace('lymphocytosis', 'lymphocyte')
107 | record = record.replace('enterocytes', 'enterocyte')
108 | record = record.replace('vills', 'villi')
109 | record = record.replace('villous', 'villi')
110 | record = record.replace('villuse', 'villi')
111 | record = record.replace('villus', 'villi')
112 | record = record.replace('cryptes', 'crypts')
113 | record = record.replace('hyperaemia', 'hyperemia')
114 | record = record.replace('antro', 'antrum')
115 | record = record.replace('biopt', 'biopsy')
116 | record = record.replace('biopsys', 'biopsy')
117 | record = record.replace('geen afwijking', 'no abnormalities')
118 | record = record.replace('no deviation', 'no abnormalities')
119 | record = record.replace('no abnormality', 'no abnormalities')
120 | record = record.replace('bioptes', 'biopsy')
121 | record = record.replace('biopsie', 'biopsy')
122 | record = record.replace('duedenum', 'duodenum')
123 | record = record.replace('duodenium', 'duodenum')
124 | record = record.replace('biopsies', 'biopsy')
125 | record = record.replace('coeliac', 'celiac')
126 | record = record.replace('coeliakie', 'celiac disease')
127 | record = record.replace('ontsteking', 'inflammation')
128 | record = record.replace('anthrum', 'antrum')
129 | record = record.replace('corpusbiopts', 'corpus biopsy')
130 | record = record.replace('flokatrophy', 'villi atrophy')
131 | record = record.replace('flocatrophy', 'villi atrophy')
132 | record = record.replace('flake', 'villi')
133 | record = record.replace('bulbus duodeni', 'duodenal bulb')
134 | record = record.replace('eosinophilia', 'eosinophil')
135 | record = record.replace('theduodenum', 'the duodenum')
136 |
137 | return record
138 |
139 |
140 | def nl_sanitize_record(record, use_case):
141 | """
142 | Sanitize record to avoid translation errors
143 | Params:
144 | record (str): target record
145 | Returns: the sanitized record
146 | """
147 |
148 | if record:
149 | if use_case == 'cervix':
150 | record = record.replace('cin ii - iii', 'cin2 cin3')
151 | record = record.replace('cin ii-iii', 'cin2 cin3')
152 | record = record.replace('cin ii en iii', 'cin2 cin3')
153 | record = record.replace('cin i - iii', 'cin1 cin3')
154 | record = record.replace('cin i-iii', 'cin1 cin3')
155 | record = record.replace('cin i en iii', 'cin1 cin3')
156 | record = record.replace('cin i - ii', 'cin1 cin2')
157 | record = record.replace('cin i-ii', 'cin1 cin2')
158 | record = record.replace('cin i en ii', 'cin1 cin2')
159 | record = record.replace('cin ii / iii', 'cin23')
160 | record = record.replace('cin iii', 'cin3')
161 | record = record.replace('cin ii', 'cin2')
162 | record = record.replace('cin i', 'cin1')
163 | record = record.replace('cin-iii', 'cin3')
164 | record = record.replace('cin-ii', 'cin2')
165 | record = record.replace('cin-i', 'cin1')
166 | record = record.replace('ii-iii cin', 'cin2 cin3')
167 | record = record.replace('i-ii cin', 'cin1 cin2')
168 | record = record.replace('iii cin', 'cin3')
169 | record = record.replace('ii cin', 'cin2')
170 | record = record.replace('i cin', 'cin1')
171 | record = record.replace('kin ii - iii', 'kin2 kin3')
172 | record = record.replace('kin ii-iii', 'kin2 kin3')
173 | record = record.replace('kin ii en iii', 'kin2 kin3')
174 | record = record.replace('kin i - iii', 'kin1 kin3')
175 | record = record.replace('kin i-iii', 'kin1 kin3')
176 | record = record.replace('kin i en iii', 'kin1 kin3')
177 | record = record.replace('kin i - ii', 'kin1 kin2')
178 | record = record.replace('kin i-ii', 'kin1 kin2')
179 | record = record.replace('kin i en ii', 'kin1 kin2')
180 | record = record.replace('kin ii / iii', 'kin2 kin3')
181 | record = record.replace('kin iii', 'kin3')
182 | record = record.replace('kin ii', 'kin2')
183 | record = record.replace('kin i', 'kin1')
184 | record = record.replace('kin-iii', 'kin3')
185 | record = record.replace('kin-ii', 'kin2')
186 | record = record.replace('kin-i', 'kin1')
187 | record = record.replace('ii-iii kin', 'kin2 kin3')
188 | record = record.replace('i-ii kin', 'kin1 kin2')
189 | record = record.replace('iii kin', 'kin3')
190 | record = record.replace('ii kin', 'kin2')
191 | record = record.replace('i kin', 'kin1')
192 | return record
193 |
194 |
195 | def sanitize_code(code):
196 | """
197 | Sanitize code removing unnecessary characters
198 |
199 | Params:
200 | code (str): target code
201 |
202 | Returns: the sanitized code
203 | """
204 |
205 | if code:
206 | code = code.replace('-', '')
207 | code = code.ljust(7, '0')
208 | return code
209 |
210 |
211 | def sanitize_codes(codes):
212 | """
213 | Sanitize codes by splitting and removing unnecessarsy characters
214 |
215 | Params:
216 | codes (list): target codes
217 |
218 | Returns: the sanitized codes
219 | """
220 |
221 | codes = codes.split(';')
222 | codes = [sanitize_code(code) for code in codes]
223 | return codes
224 |
225 |
226 | def read_rules(rules):
227 | """
228 | Read rules stored within file
229 |
230 | Params:
231 | rules (str): path to rules file
232 |
233 | Returns: a dict of trigger: [candidates] representing rules for each use-case
234 | """
235 |
236 | with open(rules, 'r') as file:
237 | lines = file.readlines()
238 |
239 | rules = {'colon': {}, 'cervix': {}, 'celiac': {}, 'lung': {}}
240 | for line in lines:
241 | trigger, candidates, position, mode, use_cases = line.strip().split('\t')
242 | use_cases = use_cases.split(',')
243 | for use_case in use_cases:
244 | rules[use_case][trigger] = (candidates.split(','), position, mode)
245 | return rules
246 |
247 |
248 | def read_dysplasia_mappings(mappings):
249 | """
250 | Read dysplasia mappings stored within file
251 |
252 | Params:
253 | mappings (str): path to dysplasia mappings file
254 |
255 | Returns: a dict of {trigger: grade} representing mappings for each use-case
256 | """
257 |
258 | with open(mappings, 'r') as file:
259 | lines = file.readlines()
260 |
261 | mappings = {'colon': {}, 'cervix': {}, 'celiac': {}, 'lung': {}}
262 | for line in lines:
263 | trigger, grade, use_cases = line.strip().split('\t')
264 | use_cases = use_cases.split(',')
265 | for use_case in use_cases:
266 | mappings[use_case][trigger] = grade.split(',')
267 | return mappings
268 |
269 |
270 | def read_cin_mappings(mappings):
271 | """
272 | Read cin mappings stored within file
273 |
274 | Params:
275 | mappings (str): path to cin mappings file
276 |
277 | Returns: a dict of {trigger: grade} representing mappings for cervical intraephitelial neoplasia
278 | """
279 |
280 | with open(mappings, 'r') as file:
281 | lines = file.readlines()
282 |
283 | mappings = {}
284 | for line in lines:
285 | trigger, grade = line.strip().split('\t')
286 | mappings[trigger] = grade
287 | return mappings
288 |
289 |
290 | def read_hierarchies(hrels):
291 | """
292 | Read hierarchy relations stored within file
293 |
294 | Params:
295 | hrels (str): hierarchy relations file path
296 |
297 | Returns: the list of hierarchical relations
298 | """
299 |
300 | with open(hrels, 'r') as f:
301 | rels = f.readlines()
302 | return [rel.strip() for rel in rels]
303 |
304 |
305 | def read_report_fields(rfields):
306 | """
307 | Read considered report fields stored within file
308 |
309 | Params:
310 | rfields (str): report fields file path
311 |
312 | Returns: the list of report fields
313 | """
314 |
315 | with open(rfields, 'r') as f:
316 | fields = f.read().splitlines()
317 | return [field.strip() for field in fields if field]
318 |
319 |
320 | def store_concepts(concepts, out_path, indent=4, sort_keys=False):
321 | """
322 | Store report concepts
323 |
324 | Params:
325 | concepts (dict): report concepts
326 | out_path (str): output file-path w/o extension
327 | indent (int): indentation level
328 | sort_keys (bool): sort keys
329 |
330 | Returns: True
331 | """
332 |
333 | os.makedirs(os.path.dirname(out_path), exist_ok=True)
334 |
335 | with open(out_path, 'w') as out:
336 | json.dump(concepts, out, indent=indent, sort_keys=sort_keys)
337 | return True
338 |
339 |
340 | def load_concepts(concept_fpath):
341 | """
342 | Load stored concepts
343 |
344 | Params:
345 | concept_fpath (str): file-path to stored concepts
346 |
347 | Returns: the dict containing the report (stored) concepts
348 | """
349 |
350 | with open(concept_fpath, 'r') as f:
351 | concepts = json.load(f)
352 | return concepts
353 |
354 |
355 | def store_labels(labels, out_path, indent=4, sort_keys=False):
356 | """
357 | Store report labels
358 |
359 | Params:
360 | labels (dict): report labels
361 | out_path (str): output file-path w/o extension
362 | indent (int): indentation level
363 | sort_keys (bool): sort keys
364 |
365 | Returns: True
366 | """
367 |
368 | os.makedirs(os.path.dirname(out_path), exist_ok=True)
369 |
370 | with open(out_path, 'w') as out:
371 | json.dump(labels, out, indent=indent, sort_keys=sort_keys)
372 | return True
373 |
374 |
375 | def load_labels(label_fpath):
376 | """
377 | Load stored labels
378 |
379 | Params:
380 | label_fpath (str): file-path to stored labels
381 |
382 | Returns: the dict containing the report (stored) labels
383 | """
384 |
385 | with open(label_fpath, 'r') as f:
386 | labels = json.load(f)
387 | return labels
388 |
389 |
390 | # AOEC RELATED FUNCTIONS
391 |
392 | def aoec_colon_concepts2labels(report_concepts):
393 | """
394 | Convert the concepts extracted from colon reports to the set of pre-defined labels used for classification
395 |
396 | Params:
397 | report_concepts (dict(list)): the dict containing for each colon report the extracted concepts
398 |
399 | Returns: a dict containing for each colon report the set of pre-defined labels where 0 = absence and 1 = presence
400 | """
401 |
402 | report_labels = dict()
403 | # loop over reports
404 | for rid, rconcepts in report_concepts.items():
405 | # assign pre-defined set of labels to current report
406 | report_labels[rid] = {'cancer': 0, 'hgd': 0, 'lgd': 0, 'hyperplastic': 0, 'ni': 0}
407 | # textify diagnosis section
408 | diagnosis = ' '.join([concept[1].lower() for concept in rconcepts['Diagnosis']])
409 | # update pre-defined labels w/ 1 in case of label presence
410 | if 'colon adenocarcinoma' in diagnosis: # update cancer
411 | report_labels[rid]['cancer'] = 1
412 | if 'dysplasia' in diagnosis: # diagnosis contains dysplasia
413 | if 'mild' in diagnosis: # update lgd
414 | report_labels[rid]['lgd'] = 1
415 | if 'moderate' in diagnosis: # update lgd
416 | report_labels[rid]['lgd'] = 1
417 | if 'severe' in diagnosis: # update hgd
418 | report_labels[rid]['hgd'] = 1
419 | if 'hyperplastic polyp' in diagnosis: # update hyperplastic
420 | report_labels[rid]['hyperplastic'] = 1
421 | if sum(report_labels[rid].values()) == 0: # update ni
422 | report_labels[rid]['ni'] = 1
423 | return report_labels
424 |
425 |
426 | def aoec_colon_labels2binary(report_labels):
427 | """
428 | Convert the pre-defined labels extracted from colon reports to binary labels used for classification
429 |
430 | Params:
431 | report_labels (dict(list)): the dict containing for each colon report the pre-defined labels
432 |
433 | Returns: a dict containing for each colon report the set of binary labels where 0 = absence and 1 = presence
434 | """
435 |
436 | binary_labels = dict()
437 | # loop over reports
438 | for rid, rlabels in report_labels.items():
439 | # assign binary labels to current report
440 | binary_labels[rid] = {'cancer_or_dysplasia': 0, 'other': 0}
441 | # update binary labels w/ 1 in case of label presence
442 | if rlabels['cancer'] == 1 or rlabels['lgd'] == 1 or rlabels['hgd'] == 1: # update cancer_or_dysplasia label
443 | binary_labels[rid]['cancer_or_dysplasia'] = 1
444 | else: # update other label
445 | binary_labels[rid]['other'] = 1
446 | return binary_labels
447 |
448 |
449 | def aoec_cervix_concepts2labels(report_concepts):
450 | """
451 | Convert the concepts extracted from cervix reports to the set of pre-defined labels used for classification
452 |
453 | Params:
454 | report_concepts (dict(list)): the dict containing for each cervix report the extracted concepts
455 |
456 | Returns: a dict containing for each cervix report the set of pre-defined labels where 0 = absence and 1 = presence
457 | """
458 |
459 | report_labels = dict()
460 | # loop over reports
461 | for rid, rconcepts in report_concepts.items():
462 | # assign pre-defined set of labels to current report
463 | report_labels[rid] = {
464 | 'cancer_scc_inv': 0, 'cancer_scc_insitu': 0, 'cancer_adeno_inv': 0, 'cancer_adeno_insitu': 0,
465 | 'lgd': 0, 'hgd': 0,
466 | 'hpv': 0, 'koilocytes': 0,
467 | 'glands_norm': 0, 'squamous_norm': 0
468 | }
469 | # make diagnosis section a set
470 | diagnosis = set([concept[1].lower() for concept in rconcepts['Diagnosis']])
471 | # update pre-defined labels w/ 1 in case of label presence
472 | for d in diagnosis:
473 | if 'cervical squamous cell carcinoma' == d:
474 | report_labels[rid]['cancer_scc_inv'] = 1
475 | if 'squamous carcinoma in situ' == d or 'squamous intraepithelial neoplasia' == d:
476 | report_labels[rid]['cancer_scc_insitu'] = 1
477 | if 'cervical adenocarcinoma' in d:
478 | if 'cervical adenocarcinoma in situ' == d:
479 | report_labels[rid]['cancer_adeno_insitu'] = 1
480 | else:
481 | report_labels[rid]['cancer_adeno_inv'] = 1
482 | if 'low grade cervical squamous intraepithelial neoplasia' == d:
483 | report_labels[rid]['lgd'] = 1
484 | if 'squamous carcinoma in situ' == d or \
485 | 'squamous intraepithelial neoplasia' == d or \
486 | 'cervical squamous intraepithelial neoplasia 2' == d or \
487 | 'cervical intraepithelial neoplasia grade 2/3' == d:
488 | report_labels[rid]['hgd'] = 1
489 | if 'human papilloma virus infection' == d:
490 | report_labels[rid]['hpv'] = 1
491 | if 'koilocytotic squamous cell' == d:
492 | report_labels[rid]['koilocytes'] = 1
493 | # update when no label has been set to 1
494 | if sum(report_labels[rid].values()) == 0:
495 | report_labels[rid]['glands_norm'] = 1
496 | report_labels[rid]['squamous_norm'] = 1
497 | return report_labels
498 |
499 |
500 | def aoec_cervix_labels2aggregates(report_labels):
501 | """
502 | Convert the pre-defined labels extracted from cervix reports to coarse- and fine-grained aggregated labels
503 | Params:
504 | report_labels (dict(list)): the dict containing for each cervix report the pre-defined labels
505 | Returns: two dicts containing for each cervix report the set of aggregated labels where 0 = absence and 1 = presence
506 | """
507 |
508 | coarse_labels = dict()
509 | fine_labels = dict()
510 | # loop over reports
511 | for rid, rlabels in report_labels.items():
512 | # assign aggregated labels to current report
513 | coarse_labels[rid] = {'cancer': 0, 'dysplasia': 0, 'normal': 0}
514 | fine_labels[rid] = {'cancer_adeno': 0, 'cancer_scc': 0, 'dysplasia': 0, 'glands_norm': 0, 'squamous_norm': 0}
515 | # update aggregated labels w/ 1 in case of label presence
516 | if rlabels['cancer_adeno_inv'] == 1 or rlabels['cancer_adeno_insitu'] == 1:
517 | coarse_labels[rid]['cancer'] = 1
518 | fine_labels[rid]['cancer_adeno'] = 1
519 | if rlabels['cancer_scc_inv'] == 1 or rlabels['cancer_scc_insitu'] == 1:
520 | coarse_labels[rid]['cancer'] = 1
521 | fine_labels[rid]['cancer_scc'] = 1
522 | if rlabels['lgd'] == 1 or rlabels['hgd'] == 1:
523 | coarse_labels[rid]['dysplasia'] = 1
524 | fine_labels[rid]['dysplasia'] = 1
525 | if rlabels['glands_norm'] == 1:
526 | coarse_labels[rid]['normal'] = 1
527 | fine_labels[rid]['glands_norm'] = 1
528 | if rlabels['squamous_norm'] == 1:
529 | coarse_labels[rid]['normal'] = 1
530 | fine_labels[rid]['squamous_norm'] = 1
531 | return coarse_labels, fine_labels
532 |
533 |
534 | def aoec_lung_concepts2labels(report_concepts):
535 | """
536 | Convert the concepts extracted from lung reports to the set of pre-defined labels used for classification
537 |
538 | Params:
539 | report_concepts (dict(list)): the dict containing for each lung report the extracted concepts
540 |
541 | Returns: a dict containing for each lung report the set of pre-defined labels where 0 = absence and 1 = presence
542 | """
543 |
544 | report_labels = dict()
545 | # loop over reports
546 | for rid, rconcepts in report_concepts.items():
547 | # assign pre-defined set of labels to current report
548 | report_labels[rid] = {
549 | 'cancer_scc': 0, 'cancer_nscc_adeno': 0, 'cancer_nscc_squamous': 0, 'cancer_nscc_large': 0, 'no_cancer': 0}
550 | # make diagnosis section a set
551 | diagnosis = set([concept[1].lower() for concept in rconcepts['Diagnosis']])
552 | # update pre-defined labels w/ 1 in case of label presence
553 | for d in diagnosis:
554 | if 'small cell lung carcinoma' == d:
555 | report_labels[rid]['cancer_scc'] = 1
556 | if 'lung adenocarcinoma' == d or 'clear cell adenocarcinoma' == d or 'metastatic neoplasm' == d:
557 | report_labels[rid]['cancer_nscc_adeno'] = 1
558 | if 'non-small cell squamous lung carcinoma' == d:
559 | report_labels[rid]['cancer_nscc_squamous'] = 1
560 | if 'lung large cell carcinoma' == d:
561 | report_labels[rid]['cancer_nscc_large'] = 1
562 | # update when no label has been set to 1
563 | if sum(report_labels[rid].values()) == 0:
564 | report_labels[rid]['no_cancer'] = 1
565 | return report_labels
566 |
567 |
568 | def aoec_celiac_concepts2labels(report_concepts):
569 | """
570 | Convert the concepts extracted from celiac reports to the set of pre-defined labels used for classification
571 |
572 | Params:
573 | report_concepts (dict(list)): the dict containing for each celiac report the extracted concepts
574 |
575 | Returns: a dict containing for each celiac report the set of pre-defined labels where 0 = absence and 1 = presence
576 | """
577 |
578 | report_labels = dict()
579 | # loop over reports
580 | for rid, rconcepts in report_concepts.items():
581 | # assign pre-defined set of labels to current report
582 | report_labels[rid] = {
583 | 'celiac_disease': 0, 'duodenitis': 0, 'inconclusive': 0, 'normal': 0}
584 | # make diagnosis section a set
585 | diagnosis = set([concept[1].lower() for concept in rconcepts['Diagnosis']])
586 | # update pre-defined labels w/ 1 in case of label presence
587 | for d in diagnosis:
588 | if 'positive to celiac disease' == d:
589 | report_labels[rid]['celiac_disease'] = 1
590 | if 'duodenitis' == d:
591 | report_labels[rid]['duodenitis'] = 1
592 | if 'inconclusive outcome' == d:
593 | report_labels[rid]['inconclusive'] = 1
594 | if 'negative result' == d:
595 | report_labels[rid]['normal'] = 1
596 | # update when no label has been set to 1
597 | if sum(report_labels[rid].values()) == 0:
598 | report_labels[rid]['normal'] = 1
599 | return report_labels
600 |
601 |
602 | # RADBOUD RELATED FUNCTIONS
603 |
604 | def radboud_colon_concepts2labels(report_concepts):
605 | """
606 | Convert the concepts extracted from reports to the set of pre-defined labels used for classification
607 |
608 | Params:
609 | report_concepts (dict(list)): the dict containing for each report the extracted concepts
610 |
611 | Returns: a dict containing for each report the set of pre-defined labels where 0 = abscence and 1 = presence
612 | """
613 |
614 | report_labels = dict()
615 | # loop over reports
616 | for rid, rconcepts in report_concepts.items():
617 | report_labels[rid] = dict()
618 | # assign pre-defined set of labels to current report
619 | report_labels[rid]['labels'] = {'cancer': 0, 'hgd': 0, 'lgd': 0, 'hyperplastic': 0, 'ni': 0}
620 | # textify diagnosis section
621 | diagnosis = ' '.join([concept[1].lower() for concept in rconcepts['Diagnosis']])
622 | # update pre-defined labels w/ 1 in case of label presence
623 | if 'colon adenocarcinoma' in diagnosis: # update cancer
624 | report_labels[rid]['labels']['cancer'] = 1
625 | if 'dysplasia' in diagnosis: # diagnosis contains dysplasia
626 | if 'mild' in diagnosis: # update lgd
627 | report_labels[rid]['labels']['lgd'] = 1
628 | if 'moderate' in diagnosis: # update lgd
629 | report_labels[rid]['labels']['lgd'] = 1
630 | if 'severe' in diagnosis: # update hgd
631 | report_labels[rid]['labels']['hgd'] = 1
632 | if 'hyperplastic polyp' in diagnosis: # update hyperplastic
633 | report_labels[rid]['labels']['hyperplastic'] = 1
634 | if sum(report_labels[rid]['labels'].values()) == 0: # update ni
635 | report_labels[rid]['labels']['ni'] = 1
636 | if 'slide_ids' in rconcepts:
637 | report_labels[rid]['slide_ids'] = rconcepts['slide_ids']
638 | return report_labels
639 |
640 |
641 | def radboud_colon_labels2binary(report_labels):
642 | """
643 | Convert the pre-defined labels extracted from reports to binary labels used for classification
644 |
645 | Params:
646 | report_labels (dict(list)): the dict containing for each report the pre-defined labels
647 |
648 | Returns: a dict containing for each report the set of binary labels where 0 = abscence and 1 = presence
649 | """
650 |
651 | binary_labels = dict()
652 | # loop over reports
653 | for rid, rlabels in report_labels.items():
654 | binary_labels[rid] = dict()
655 | # assign binary labels to current report
656 | binary_labels[rid]['labels'] = {'cancer_or_dysplasia': 0, 'other': 0}
657 | # update binary labels w/ 1 in case of label presence
658 | if rlabels['labels']['cancer'] == 1 or rlabels['labels']['lgd'] == 1 or rlabels['labels']['hgd'] == 1: # update cancer_or_adenoma label
659 | binary_labels[rid]['labels']['cancer_or_dysplasia'] = 1
660 | else: # update other label
661 | binary_labels[rid]['labels']['other'] = 1
662 | if 'slide_ids' in rlabels:
663 | binary_labels[rid]['slide_ids'] = rlabels['slide_ids']
664 | return binary_labels
665 |
666 |
667 | def radboud_cervix_concepts2labels(report_concepts):
668 | """
669 | Convert the concepts extracted from cervix reports to the set of pre-defined labels used for classification
670 |
671 | Params:
672 | report_concepts (dict(list)): the dict containing for each cervix report the extracted concepts
673 |
674 | Returns: a dict containing for each cervix report the set of pre-defined labels where 0 = absence and 1 = presence
675 | """
676 |
677 | report_labels = dict()
678 | # loop over reports
679 | for rid, rconcepts in report_concepts.items():
680 | report_labels[rid] = dict()
681 | # assign pre-defined set of labels to current report
682 | report_labels[rid]['labels'] = {
683 | 'cancer_scc_inv': 0, 'cancer_scc_insitu': 0, 'cancer_adeno_inv': 0, 'cancer_adeno_insitu': 0,
684 | 'lgd': 0, 'hgd': 0,
685 | 'hpv': 0, 'koilocytes': 0,
686 | 'glands_norm': 0, 'squamous_norm': 0
687 | }
688 | # make diagnosis section a set
689 | diagnosis = set([concept[1].lower() for concept in rconcepts['Diagnosis']])
690 | # update pre-defined labels w/ 1 in case of label presence
691 | for d in diagnosis:
692 | if 'cervical squamous cell carcinoma' == d:
693 | report_labels[rid]['labels']['cancer_scc_inv'] = 1
694 | if 'squamous carcinoma in situ' == d or 'squamous intraepithelial neoplasia' == d:
695 | report_labels[rid]['labels']['cancer_scc_insitu'] = 1
696 | if 'cervical adenocarcinoma' in d:
697 | if 'cervical adenocarcinoma in situ' == d:
698 | report_labels[rid]['labels']['cancer_adeno_insitu'] = 1
699 | else:
700 | report_labels[rid]['labels']['cancer_adeno_inv'] = 1
701 | if 'low grade cervical squamous intraepithelial neoplasia' == d:
702 | report_labels[rid]['labels']['lgd'] = 1
703 | if 'squamous carcinoma in situ' == d or \
704 | 'squamous intraepithelial neoplasia' == d or \
705 | 'cervical squamous intraepithelial neoplasia 2' == d or \
706 | 'cervical intraepithelial neoplasia grade 2/3' == d:
707 | report_labels[rid]['labels']['hgd'] = 1
708 | if 'human papilloma virus infection' == d:
709 | report_labels[rid]['labels']['hpv'] = 1
710 | if 'koilocytotic squamous cell' == d:
711 | report_labels[rid]['labels']['koilocytes'] = 1
712 | # update when no label has been set to 1
713 | if sum(report_labels[rid]['labels'].values()) == 0:
714 | report_labels[rid]['labels']['glands_norm'] = 1
715 | report_labels[rid]['labels']['squamous_norm'] = 1
716 |
717 | if 'slide_ids' in rconcepts:
718 | report_labels[rid]['slide_ids'] = rconcepts['slide_ids']
719 | return report_labels
720 |
721 |
722 | def radboud_cervix_labels2aggregates(report_labels):
723 | """
724 | Convert the pre-defined labels extracted from cervix reports to coarse- and fine-grained aggregated labels
725 |
726 | Params:
727 | report_labels (dict(list)): the dict containing for each cervix report the pre-defined labels
728 | Returns: two dicts containing for each cervix report the set of aggregated labels where 0 = absence and 1 = presence
729 | """
730 |
731 | coarse_labels = dict()
732 | fine_labels = dict()
733 | # loop over reports
734 | for rid, rlabels in report_labels.items():
735 | coarse_labels[rid] = dict()
736 | fine_labels[rid] = dict()
737 | # assign aggregated labels to current report
738 | coarse_labels[rid]['labels'] = {'cancer': 0, 'dysplasia': 0, 'normal': 0}
739 | fine_labels[rid]['labels'] = {'cancer_adeno': 0, 'cancer_scc': 0, 'dysplasia': 0, 'glands_norm': 0, 'squamous_norm': 0}
740 | # update aggregated labels w/ 1 in case of label presence
741 | if rlabels['cancer_adeno_inv'] == 1 or rlabels['cancer_adeno_insitu'] == 1:
742 | coarse_labels[rid]['labels']['cancer'] = 1
743 | fine_labels[rid]['labels']['cancer_adeno'] = 1
744 | if rlabels['cancer_scc_inv'] == 1 or rlabels['cancer_scc_insitu'] == 1:
745 | coarse_labels[rid]['labels']['cancer'] = 1
746 | fine_labels[rid]['labels']['cancer_scc'] = 1
747 | if rlabels['lgd'] == 1 or rlabels['hgd'] == 1:
748 | coarse_labels[rid]['labels']['dysplasia'] = 1
749 | fine_labels[rid]['labels']['dysplasia'] = 1
750 | if rlabels['glands_norm'] == 1:
751 | coarse_labels[rid]['labels']['normal'] = 1
752 | fine_labels[rid]['labels']['glands_norm'] = 1
753 | if rlabels['squamous_norm'] == 1:
754 | coarse_labels[rid]['labels']['normal'] = 1
755 | fine_labels[rid]['labels']['squamous_norm'] = 1
756 | if 'slide_ids' in rlabels:
757 | coarse_labels[rid]['slide_ids'] = rlabels['slide_ids']
758 | fine_labels[rid]['slide_ids'] = rlabels['slide_ids']
759 | return coarse_labels, fine_labels
760 |
761 | def radboud_celiac_concepts2labels(report_concepts):
762 | """
763 | Convert the concepts extracted from celiac reports to the set of pre-defined labels used for classification
764 |
765 | Params:
766 | report_concepts (dict(list)): the dict containing for each celiac report the extracted concepts
767 |
768 | Returns: a dict containing for each celiac report the set of pre-defined labels where 0 = absence and 1 = presence
769 | """
770 |
771 | report_labels = dict()
772 | # loop over reports
773 | for rid, rconcepts in report_concepts.items():
774 | report_labels[rid] = dict()
775 | # assign pre-defined set of labels to current report
776 | report_labels[rid]['labels'] = {
777 | 'celiac_disease': 0, 'duodenitis': 0, 'inconclusive': 0, 'normal': 0}
778 | # make diagnosis section a set
779 | diagnosis = set([concept[1].lower() for concept in rconcepts['Diagnosis']])
780 | # update pre-defined labels w/ 1 in case of label presence
781 | for d in diagnosis:
782 | if 'positive to celiac disease' == d:
783 | report_labels[rid]['labels']['celiac_disease'] = 1
784 | if 'duodenitis' == d:
785 | report_labels[rid]['labels']['duodenitis'] = 1
786 | if 'inconclusive outcome' == d:
787 | report_labels[rid]['inconclusive'] = 1
788 | if 'negative result' == d:
789 | report_labels[rid]['labels']['normal'] = 1
790 | # update when no label has been set to 1
791 | if sum(report_labels[rid]['labels'].values()) == 0:
792 | report_labels[rid]['labels']['normal'] = 1
793 | if 'slide_ids' in rconcepts:
794 | report_labels[rid]['slide_ids'] = rconcepts['slide_ids']
795 | return report_labels
796 |
797 |
798 | # GENERAL-PURPOSE FUNCTIONS
799 |
800 | def colon_concepts2labels(report_concepts):
801 | """
802 | Convert the concepts extracted from colon reports to the set of pre-defined labels used for classification
803 |
804 | Params:
805 | report_concepts (dict(list)): the dict containing for each colon report the extracted concepts
806 |
807 | Returns: a dict containing for each colon report the set of pre-defined labels where 0 = absence and 1 = presence
808 | """
809 |
810 | report_labels = dict()
811 | # loop over reports
812 | for rid, rconcepts in report_concepts.items():
813 | report_labels[rid] = dict()
814 | # assign pre-defined set of labels to current report
815 | report_labels[rid] = {'cancer': 0, 'hgd': 0, 'lgd': 0, 'hyperplastic': 0, 'ni': 0}
816 | # textify diagnosis section
817 | diagnosis = ' '.join([concept[1].lower() for concept in rconcepts['Diagnosis']])
818 | # update pre-defined labels w/ 1 in case of label presence
819 | if 'colon adenocarcinoma' in diagnosis: # update cancer
820 | report_labels[rid]['cancer'] = 1
821 | if 'dysplasia' in diagnosis: # diagnosis contains dysplasia
822 | if 'mild' in diagnosis: # update lgd
823 | report_labels[rid]['lgd'] = 1
824 | if 'moderate' in diagnosis: # update lgd
825 | report_labels[rid]['lgd'] = 1
826 | if 'severe' in diagnosis: # update hgd
827 | report_labels[rid]['hgd'] = 1
828 | if 'hyperplastic polyp' in diagnosis: # update hyperplastic
829 | report_labels[rid]['hyperplastic'] = 1
830 | if sum(report_labels[rid].values()) == 0: # update ni
831 | report_labels[rid]['ni'] = 1
832 | return report_labels
833 |
834 |
835 | def colon_labels2binary(report_labels):
836 | """
837 | Convert the pre-defined labels extracted from colon reports to binary labels used for classification
838 |
839 | Params:
840 | report_labels (dict(list)): the dict containing for each colon report the pre-defined labels
841 |
842 | Returns: a dict containing for each colon report the set of binary labels where 0 = absence and 1 = presence
843 | """
844 |
845 | binary_labels = dict()
846 | # loop over reports
847 | for rid, rlabels in report_labels.items():
848 | # assign binary labels to current report
849 | binary_labels[rid] = {'cancer_or_dysplasia': 0, 'other': 0}
850 | # update binary labels w/ 1 in case of label presence
851 | if rlabels['cancer'] == 1 or rlabels['lgd'] == 1 or rlabels['hgd'] == 1: # update cancer_or_dysplasia label
852 | binary_labels[rid]['cancer_or_dysplasia'] = 1
853 | else: # update other label
854 | binary_labels[rid]['other'] = 1
855 | return binary_labels
856 |
857 |
858 | def cervix_concepts2labels(report_concepts):
859 | """
860 | Convert the concepts extracted from cervix reports to the set of pre-defined labels used for classification
861 |
862 | Params:
863 | report_concepts (dict(list)): the dict containing for each cervix report the extracted concepts
864 |
865 | Returns: a dict containing for each cervix report the set of pre-defined labels where 0 = absence and 1 = presence
866 | """
867 |
868 | report_labels = dict()
869 | # loop over reports
870 | for rid, rconcepts in report_concepts.items():
871 | # assign pre-defined set of labels to current report
872 | report_labels[rid] = {
873 | 'cancer_scc_inv': 0, 'cancer_scc_insitu': 0, 'cancer_adeno_inv': 0, 'cancer_adeno_insitu': 0,
874 | 'lgd': 0, 'hgd': 0,
875 | 'hpv': 0, 'koilocytes': 0,
876 | 'glands_norm': 0, 'squamous_norm': 0
877 | }
878 | # make diagnosis section a set
879 | diagnosis = set([concept[1].lower() for concept in rconcepts['Diagnosis']])
880 | # update pre-defined labels w/ 1 in case of label presence
881 | for d in diagnosis:
882 | if 'cervical squamous cell carcinoma' == d:
883 | report_labels[rid]['cancer_scc_inv'] = 1
884 | if 'squamous carcinoma in situ' == d or 'squamous intraepithelial neoplasia' == d:
885 | report_labels[rid]['cancer_scc_insitu'] = 1
886 | if 'cervical adenocarcinoma' in d:
887 | if 'cervical adenocarcinoma in situ' == d:
888 | report_labels[rid]['cancer_adeno_insitu'] = 1
889 | else:
890 | report_labels[rid]['cancer_adeno_inv'] = 1
891 | if 'low grade cervical squamous intraepithelial neoplasia' == d:
892 | report_labels[rid]['lgd'] = 1
893 | if 'squamous carcinoma in situ' == d or \
894 | 'squamous intraepithelial neoplasia' == d or \
895 | 'cervical squamous intraepithelial neoplasia 2' == d or \
896 | 'cervical intraepithelial neoplasia grade 2/3' == d:
897 | report_labels[rid]['hgd'] = 1
898 | if 'human papilloma virus infection' == d:
899 | report_labels[rid]['hpv'] = 1
900 | if 'koilocytotic squamous cell' == d:
901 | report_labels[rid]['koilocytes'] = 1
902 | # update when no label has been set to 1
903 | if sum(report_labels[rid].values()) == 0:
904 | report_labels[rid]['glands_norm'] = 1
905 | report_labels[rid]['squamous_norm'] = 1
906 | return report_labels
907 |
908 |
909 | def cervix_labels2aggregates(report_labels):
910 | """
911 | Convert the pre-defined labels extracted from cervix reports to coarse- and fine-grained aggregated labels
912 | Params:
913 | report_labels (dict(list)): the dict containing for each cervix report the pre-defined labels
914 | Returns: two dicts containing for each cervix report the set of aggregated labels where 0 = absence and 1 = presence
915 | """
916 |
917 | coarse_labels = dict()
918 | fine_labels = dict()
919 | # loop over reports
920 | for rid, rlabels in report_labels.items():
921 | # assign aggregated labels to current report
922 | coarse_labels[rid] = {'cancer': 0, 'dysplasia': 0, 'normal': 0}
923 | fine_labels[rid] = {'cancer_adeno': 0, 'cancer_scc': 0, 'dysplasia': 0, 'glands_norm': 0, 'squamous_norm': 0}
924 | # update aggregated labels w/ 1 in case of label presence
925 | if rlabels['cancer_adeno_inv'] == 1 or rlabels['cancer_adeno_insitu'] == 1:
926 | coarse_labels[rid]['cancer'] = 1
927 | fine_labels[rid]['cancer_adeno'] = 1
928 | if rlabels['cancer_scc_inv'] == 1 or rlabels['cancer_scc_insitu'] == 1:
929 | coarse_labels[rid]['cancer'] = 1
930 | fine_labels[rid]['cancer_scc'] = 1
931 | if rlabels['lgd'] == 1 or rlabels['hgd'] == 1:
932 | coarse_labels[rid]['dysplasia'] = 1
933 | fine_labels[rid]['dysplasia'] = 1
934 | if rlabels['glands_norm'] == 1:
935 | coarse_labels[rid]['normal'] = 1
936 | fine_labels[rid]['glands_norm'] = 1
937 | if rlabels['squamous_norm'] == 1:
938 | coarse_labels[rid]['normal'] = 1
939 | fine_labels[rid]['squamous_norm'] = 1
940 | return coarse_labels, fine_labels
941 |
942 |
943 | def lung_concepts2labels(report_concepts):
944 | """
945 | Convert the concepts extracted from lung reports to the set of pre-defined labels used for classification
946 |
947 | Params:
948 | report_concepts (dict(list)): the dict containing for each lung report the extracted concepts
949 |
950 | Returns: a dict containing for each lung report the set of pre-defined labels where 0 = absence and 1 = presence
951 | """
952 |
953 | report_labels = dict()
954 | # loop over reports
955 | for rid, rconcepts in report_concepts.items():
956 | # assign pre-defined set of labels to current report
957 | report_labels[rid] = {
958 | 'cancer_scc': 0, 'cancer_nscc_adeno': 0, 'cancer_nscc_squamous': 0, 'cancer_nscc_large': 0, 'no_cancer': 0}
959 | # make diagnosis section a set
960 | diagnosis = set([concept[1].lower() for concept in rconcepts['Diagnosis']])
961 | # update pre-defined labels w/ 1 in case of label presence
962 | for d in diagnosis:
963 | if 'small cell lung carcinoma' == d:
964 | report_labels[rid]['cancer_scc'] = 1
965 | if 'lung adenocarcinoma' == d or 'clear cell adenocarcinoma' == d or 'metastatic neoplasm' == d:
966 | report_labels[rid]['cancer_nscc_adeno'] = 1
967 | if 'non-small cell squamous lung carcinoma' == d:
968 | report_labels[rid]['cancer_nscc_squamous'] = 1
969 | if 'lung large cell carcinoma' == d:
970 | report_labels[rid]['cancer_nscc_large'] = 1
971 | # update when no label has been set to 1
972 | if sum(report_labels[rid].values()) == 0:
973 | report_labels[rid]['no_cancer'] = 1
974 | return report_labels
975 |
976 |
977 | def celiac_concepts2labels(report_concepts):
978 | """
979 | Convert the concepts extracted from celiac reports to the set of pre-defined labels used for classification
980 |
981 | Params:
982 | report_concepts (dict(list)): the dict containing for each celiac report the extracted concepts
983 |
984 | Returns: a dict containing for each celiac report the set of pre-defined labels where 0 = absence and 1 = presence
985 | """
986 |
987 | report_labels = dict()
988 | # loop over reports
989 | for rid, rconcepts in report_concepts.items():
990 | # assign pre-defined set of labels to current report
991 | report_labels[rid] = {
992 | 'celiac_disease': 0, 'duodenitis': 0, 'inconclusive': 0, 'normal': 0}
993 | # make diagnosis section a set
994 | diagnosis = set([concept[1].lower() for concept in rconcepts['Diagnosis']])
995 | # update pre-defined labels w/ 1 in case of label presence
996 | for d in diagnosis:
997 | if 'positive to celiac disease' == d:
998 | report_labels[rid]['celiac_disease'] = 1
999 | if 'duodenitis' == d:
1000 | report_labels[rid]['duodenitis'] = 1
1001 | if 'inconclusive outcome' == d:
1002 | report_labels[rid]['inconclusive'] = 1
1003 | if 'negative result' == d:
1004 | report_labels[rid]['normal'] = 1
1005 | # update when no label has been set to 1
1006 | if sum(report_labels[rid].values()) == 0:
1007 | report_labels[rid]['normal'] = 1
1008 | return report_labels
1009 |
--------------------------------------------------------------------------------
/sket_server/sket_rest_app/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ExaNLP/sket/d9a3fcc42d5f3671dcb2ac6597ea663b9b259433/sket_server/sket_rest_app/__init__.py
--------------------------------------------------------------------------------
/sket_server/sket_rest_app/__pycache__/__init__.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ExaNLP/sket/d9a3fcc42d5f3671dcb2ac6597ea663b9b259433/sket_server/sket_rest_app/__pycache__/__init__.cpython-38.pyc
--------------------------------------------------------------------------------
/sket_server/sket_rest_app/__pycache__/__init__.cpython-39.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ExaNLP/sket/d9a3fcc42d5f3671dcb2ac6597ea663b9b259433/sket_server/sket_rest_app/__pycache__/__init__.cpython-39.pyc
--------------------------------------------------------------------------------
/sket_server/sket_rest_app/__pycache__/admin.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ExaNLP/sket/d9a3fcc42d5f3671dcb2ac6597ea663b9b259433/sket_server/sket_rest_app/__pycache__/admin.cpython-38.pyc
--------------------------------------------------------------------------------
/sket_server/sket_rest_app/__pycache__/admin.cpython-39.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ExaNLP/sket/d9a3fcc42d5f3671dcb2ac6597ea663b9b259433/sket_server/sket_rest_app/__pycache__/admin.cpython-39.pyc
--------------------------------------------------------------------------------
/sket_server/sket_rest_app/__pycache__/apps.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ExaNLP/sket/d9a3fcc42d5f3671dcb2ac6597ea663b9b259433/sket_server/sket_rest_app/__pycache__/apps.cpython-38.pyc
--------------------------------------------------------------------------------
/sket_server/sket_rest_app/__pycache__/apps.cpython-39.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ExaNLP/sket/d9a3fcc42d5f3671dcb2ac6597ea663b9b259433/sket_server/sket_rest_app/__pycache__/apps.cpython-39.pyc
--------------------------------------------------------------------------------
/sket_server/sket_rest_app/__pycache__/models.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ExaNLP/sket/d9a3fcc42d5f3671dcb2ac6597ea663b9b259433/sket_server/sket_rest_app/__pycache__/models.cpython-38.pyc
--------------------------------------------------------------------------------
/sket_server/sket_rest_app/__pycache__/models.cpython-39.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ExaNLP/sket/d9a3fcc42d5f3671dcb2ac6597ea663b9b259433/sket_server/sket_rest_app/__pycache__/models.cpython-39.pyc
--------------------------------------------------------------------------------
/sket_server/sket_rest_app/__pycache__/urls.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ExaNLP/sket/d9a3fcc42d5f3671dcb2ac6597ea663b9b259433/sket_server/sket_rest_app/__pycache__/urls.cpython-38.pyc
--------------------------------------------------------------------------------
/sket_server/sket_rest_app/__pycache__/urls.cpython-39.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ExaNLP/sket/d9a3fcc42d5f3671dcb2ac6597ea663b9b259433/sket_server/sket_rest_app/__pycache__/urls.cpython-39.pyc
--------------------------------------------------------------------------------
/sket_server/sket_rest_app/__pycache__/views.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ExaNLP/sket/d9a3fcc42d5f3671dcb2ac6597ea663b9b259433/sket_server/sket_rest_app/__pycache__/views.cpython-38.pyc
--------------------------------------------------------------------------------
/sket_server/sket_rest_app/__pycache__/views.cpython-39.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ExaNLP/sket/d9a3fcc42d5f3671dcb2ac6597ea663b9b259433/sket_server/sket_rest_app/__pycache__/views.cpython-39.pyc
--------------------------------------------------------------------------------
/sket_server/sket_rest_app/admin.py:
--------------------------------------------------------------------------------
1 | from django.contrib import admin
2 |
3 | # Register your models here.
4 |
--------------------------------------------------------------------------------
/sket_server/sket_rest_app/apps.py:
--------------------------------------------------------------------------------
1 | from django.apps import AppConfig
2 |
3 |
4 | class SketRestAppConfig(AppConfig):
5 | default_auto_field = 'django.db.models.BigAutoField'
6 | name = 'sket_server.sket_rest_app'
7 |
--------------------------------------------------------------------------------
/sket_server/sket_rest_app/migrations/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ExaNLP/sket/d9a3fcc42d5f3671dcb2ac6597ea663b9b259433/sket_server/sket_rest_app/migrations/__init__.py
--------------------------------------------------------------------------------
/sket_server/sket_rest_app/migrations/__pycache__/__init__.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ExaNLP/sket/d9a3fcc42d5f3671dcb2ac6597ea663b9b259433/sket_server/sket_rest_app/migrations/__pycache__/__init__.cpython-38.pyc
--------------------------------------------------------------------------------
/sket_server/sket_rest_app/migrations/__pycache__/__init__.cpython-39.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ExaNLP/sket/d9a3fcc42d5f3671dcb2ac6597ea663b9b259433/sket_server/sket_rest_app/migrations/__pycache__/__init__.cpython-39.pyc
--------------------------------------------------------------------------------
/sket_server/sket_rest_app/models.py:
--------------------------------------------------------------------------------
1 | from django.db import models
2 |
3 | # Create your models here.
4 |
--------------------------------------------------------------------------------
/sket_server/sket_rest_app/tests.py:
--------------------------------------------------------------------------------
1 | from django.test import TestCase
2 |
3 | # Create your tests here.
4 |
--------------------------------------------------------------------------------
/sket_server/sket_rest_app/urls.py:
--------------------------------------------------------------------------------
1 | from django.urls import path
2 | from . import views
3 | from django.views.decorators.csrf import csrf_exempt
4 |
5 | from django.contrib.auth import views as auth_views
6 |
7 | app_name='sket_server.sket_rest_app'
8 | urlpatterns = [
9 |
10 | # path('annotate///', views.annotate, name='annotate'),
11 | path('', views.annotate, name='annotate'),
12 | # path('annotate/', views.annotate, name='annotate'),
13 | path('annotate////', views.annotate, name='annotate'),
14 | path('annotate///', views.annotate, name='annotate'),
15 | path('annotate//', views.annotate, name='annotate'),
16 |
17 | ]
18 |
--------------------------------------------------------------------------------
/sket_server/sket_rest_app/views.py:
--------------------------------------------------------------------------------
1 | import json
2 | from rest_framework import status
3 | from rest_framework.decorators import api_view
4 | from rest_framework.response import Response
5 | from sket_server.sket_rest_config import sket_pipe
6 | from django.core.files.storage import FileSystemStorage
7 | import shutil
8 | import os
9 |
10 |
11 | @api_view(['GET', 'POST'])
12 | def annotate(request, use_case=None, language=None, obj=None, rdf_format=None):
13 | json_resp_single = {'key1': 'value1', 'key2': 'value2', 'key3': 'value3'}
14 |
15 | if request.method == 'GET':
16 | return Response(json_resp_single)
17 |
18 | elif request.method == 'POST':
19 | workpath = os.path.dirname(os.path.abspath(__file__))
20 | output_concepts_dir = os.path.join(workpath, '../sket_rest_config/config.json')
21 | f = open(output_concepts_dir, 'r')
22 | data = json.load(f)
23 | thr = data['thr']
24 | files = []
25 | labels = {}
26 | concepts = {}
27 | rdf_graphs = {}
28 |
29 | if use_case is None or language is None:
30 | response = {'ERROR': 'Your request is invalid!'}
31 | return Response(response, status=status.HTTP_400_BAD_REQUEST)
32 | if obj not in ['concepts','graphs','labels','n3','turtle','all','trig',None]:
33 | response = {'ERROR': 'Your request is invalid!'}
34 | return Response(response, status=status.HTTP_400_BAD_REQUEST)
35 | if obj == 'graphs' and (
36 | rdf_format is None or rdf_format == 'all' or rdf_format not in ['n3', 'turtle', 'trig']):
37 | response = {'ERROR': 'Your request is invalid: the allowed rdf_formats are: turtle, n3, trig'}
38 | return Response(response, status=status.HTTP_400_BAD_REQUEST)
39 |
40 | store = True
41 |
42 | if (use_case is not None and language is not None and obj in ['n3', 'turtle', 'trig', 'all']) or (
43 | use_case is not None and language is not None):
44 | store = True
45 | if obj in ['concepts', 'graphs', 'labels']:
46 | store = False
47 | if obj in ['concepts','labels']:
48 | rdf_format = 'turtle'
49 | if store == True and obj is None:
50 | rdf_format = 'all'
51 | if store == True and obj in ['n3', 'turtle', 'trig', 'all']:
52 | rdf_format = obj
53 |
54 |
55 | if len(request.FILES) > 0:
56 | for file in request.FILES.items():
57 | files.append(file[1])
58 |
59 | if len(files) == 0:
60 | print('json')
61 | if type(request.data) == dict:
62 | request_body = request.data
63 | concepts, labels, rdf_graphs = sket_pipe.med_pipeline(request_body, language, use_case, thr, store,
64 | rdf_format,
65 | False, False)
66 | elif len(files) > 0:
67 | for file in files:
68 | workpath = os.path.dirname(os.path.abspath(__file__))
69 | fs = FileSystemStorage(os.path.join(workpath, './tmp'))
70 | print(file.name)
71 | file_up = fs.save(file.name, file)
72 | uploaded_file_path = os.path.join(workpath, './tmp/' + file_up)
73 | print(rdf_format)
74 | try:
75 | concepts, labels, rdf_graphs = sket_pipe.med_pipeline(uploaded_file_path, language, use_case, thr,
76 | store, rdf_format,
77 | False, False)
78 | except Exception as e:
79 | print(e)
80 | js_resp = {'error': 'an error occurred: ' + str(e) + '.'}
81 | return Response(js_resp)
82 | finally:
83 | for root, dirs, files in os.walk(os.path.join(workpath, './tmp')):
84 | for f in files:
85 | os.unlink(os.path.join(root, f))
86 | for d in dirs:
87 | shutil.rmtree(os.path.join(root, d))
88 | if not store:
89 | if obj == 'graphs':
90 | return Response(rdf_graphs, status=status.HTTP_201_CREATED)
91 | elif obj == 'labels':
92 | return Response(labels, status=status.HTTP_201_CREATED)
93 | elif obj == 'concepts':
94 | return Response(concepts, status=status.HTTP_201_CREATED)
95 | elif store:
96 | json_resp = {"response": "request handled with success."}
97 | return Response(json_resp, status=status.HTTP_201_CREATED)
98 |
99 | else:
100 | return Response(json_resp_single, status=status.HTTP_201_CREATED)
101 |
102 |
--------------------------------------------------------------------------------
/sket_server/sket_rest_config/__init__.py:
--------------------------------------------------------------------------------
1 | import time
2 | import json
3 | import os
4 | workpath = os.path.dirname(os.path.abspath(__file__)) # Returns the Path your .py file is in
5 | pp = os.path.join(workpath,'../../')
6 | import sys
7 | sys.path.insert(0,pp)
8 | from sket.sket import SKET
9 | print('start sket initialization')
10 |
11 | output_concepts_dir = os.path.join(workpath, './config.json')
12 | f = open(output_concepts_dir,'r')
13 | data = json.load(f)
14 | st = time.time()
15 | # sket_pipe = SKET('colon', 'en', 'en_core_sci_sm', True, None, None, False, 0)
16 | sket_pipe = SKET('colon', 'en', 'en_core_sci_sm', data['w2v_model'], data['fasttext_model'], data['bert_model'], data['string_model'],data['gpu'])
17 | end = time.time()
18 | print('sket initialization completed in: ',str(end-st), ' seconds')
19 |
--------------------------------------------------------------------------------
/sket_server/sket_rest_config/__pycache__/__init__.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ExaNLP/sket/d9a3fcc42d5f3671dcb2ac6597ea663b9b259433/sket_server/sket_rest_config/__pycache__/__init__.cpython-38.pyc
--------------------------------------------------------------------------------
/sket_server/sket_rest_config/__pycache__/__init__.cpython-39.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ExaNLP/sket/d9a3fcc42d5f3671dcb2ac6597ea663b9b259433/sket_server/sket_rest_config/__pycache__/__init__.cpython-39.pyc
--------------------------------------------------------------------------------
/sket_server/sket_rest_config/__pycache__/settings.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ExaNLP/sket/d9a3fcc42d5f3671dcb2ac6597ea663b9b259433/sket_server/sket_rest_config/__pycache__/settings.cpython-38.pyc
--------------------------------------------------------------------------------
/sket_server/sket_rest_config/__pycache__/settings.cpython-39.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ExaNLP/sket/d9a3fcc42d5f3671dcb2ac6597ea663b9b259433/sket_server/sket_rest_config/__pycache__/settings.cpython-39.pyc
--------------------------------------------------------------------------------
/sket_server/sket_rest_config/__pycache__/urls.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ExaNLP/sket/d9a3fcc42d5f3671dcb2ac6597ea663b9b259433/sket_server/sket_rest_config/__pycache__/urls.cpython-38.pyc
--------------------------------------------------------------------------------
/sket_server/sket_rest_config/__pycache__/urls.cpython-39.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ExaNLP/sket/d9a3fcc42d5f3671dcb2ac6597ea663b9b259433/sket_server/sket_rest_config/__pycache__/urls.cpython-39.pyc
--------------------------------------------------------------------------------
/sket_server/sket_rest_config/__pycache__/wsgi.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ExaNLP/sket/d9a3fcc42d5f3671dcb2ac6597ea663b9b259433/sket_server/sket_rest_config/__pycache__/wsgi.cpython-38.pyc
--------------------------------------------------------------------------------
/sket_server/sket_rest_config/__pycache__/wsgi.cpython-39.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ExaNLP/sket/d9a3fcc42d5f3671dcb2ac6597ea663b9b259433/sket_server/sket_rest_config/__pycache__/wsgi.cpython-39.pyc
--------------------------------------------------------------------------------
/sket_server/sket_rest_config/asgi.py:
--------------------------------------------------------------------------------
1 | """
2 | ASGI config for sket_rest project.
3 |
4 | It exposes the ASGI callable as a module-level variable named ``application``.
5 |
6 | For more information on this file, see
7 | https://docs.djangoproject.com/en/3.2/howto/deployment/asgi/
8 | """
9 |
10 | import os
11 |
12 | from django.core.asgi import get_asgi_application
13 |
14 | os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'sket_server.sket_rest_config.settings')
15 |
16 | application = get_asgi_application()
17 |
--------------------------------------------------------------------------------
/sket_server/sket_rest_config/config.json:
--------------------------------------------------------------------------------
1 | {
2 | "w2v_model":true,
3 | "fasttext_model": null,
4 | "bert_model":null,
5 | "string_model": false,
6 | "gpu":null,
7 | "thr":0.9
8 | }
9 |
--------------------------------------------------------------------------------
/sket_server/sket_rest_config/settings.py:
--------------------------------------------------------------------------------
1 | """
2 | Django settings for sket_rest project.
3 |
4 | Generated by 'django-admin startproject' using Django 3.2.7.
5 |
6 | For more information on this file, see
7 | https://docs.djangoproject.com/en/3.2/topics/settings/
8 |
9 | For the full list of settings and their values, see
10 | https://docs.djangoproject.com/en/3.2/ref/settings/
11 | """
12 |
13 | from pathlib import Path
14 | import os
15 | # Build paths inside the project like this: BASE_DIR / 'subdir'.
16 | BASE_DIR = Path(__file__).resolve().parent.parent
17 |
18 | # Quick-start development settings - unsuitable for production
19 | # See https://docs.djangoproject.com/en/3.2/howto/deployment/checklist/
20 |
21 | # SECURITY WARNING: keep the secret key used in production secret!
22 | SECRET_KEY = 'django-insecure-co+vta_0^gtvr&-m7@254lf)3zw!i!yel^=bd9b7yjf_@&h6-t'
23 |
24 | # SECURITY WARNING: don't run with debug turned on in production!
25 | DEBUG = True
26 |
27 | ALLOWED_HOSTS = ['*']
28 |
29 |
30 | # Application definition
31 |
32 | INSTALLED_APPS = [
33 | 'django.contrib.admin',
34 | 'django.contrib.auth',
35 | 'django.contrib.contenttypes',
36 | 'django.contrib.sessions',
37 | 'django.contrib.messages',
38 | 'django.contrib.staticfiles',
39 | 'sket_server.sket_rest_app.apps.SketRestAppConfig',
40 | 'rest_framework',
41 | ]
42 |
43 | MIDDLEWARE = [
44 | 'django.middleware.security.SecurityMiddleware',
45 | 'django.contrib.sessions.middleware.SessionMiddleware',
46 | 'django.middleware.common.CommonMiddleware',
47 | 'django.middleware.csrf.CsrfViewMiddleware',
48 | 'django.contrib.auth.middleware.AuthenticationMiddleware',
49 | 'django.contrib.messages.middleware.MessageMiddleware',
50 | 'django.middleware.clickjacking.XFrameOptionsMiddleware',
51 | ]
52 |
53 | ROOT_URLCONF = 'sket_server.sket_rest_config.urls'
54 |
55 | TEMPLATES = [
56 | {
57 | 'BACKEND': 'django.template.backends.django.DjangoTemplates',
58 | 'DIRS': [BASE_DIR / 'templates']
59 | ,
60 | 'APP_DIRS': True,
61 | 'OPTIONS': {
62 | 'context_processors': [
63 | 'django.template.context_processors.debug',
64 | 'django.template.context_processors.request',
65 | 'django.contrib.auth.context_processors.auth',
66 | 'django.contrib.messages.context_processors.messages',
67 | ],
68 | },
69 | },
70 | ]
71 |
72 | WSGI_APPLICATION = 'sket_server.sket_rest_config.wsgi.application'
73 | REST_FRAMEWORK = {
74 | 'DEFAULT_AUTHENTICATION_CLASSES': [],
75 | 'DEFAULT_PERMISSION_CLASSES': [],
76 | }
77 |
78 | # Database
79 | # https://docs.djangoproject.com/en/3.2/ref/settings/#databases
80 |
81 |
82 |
83 | # Password validation
84 | # https://docs.djangoproject.com/en/3.2/ref/settings/#auth-password-validators
85 |
86 | AUTH_PASSWORD_VALIDATORS = [
87 | {
88 | 'NAME': 'django.contrib.auth.password_validation.UserAttributeSimilarityValidator',
89 | },
90 | {
91 | 'NAME': 'django.contrib.auth.password_validation.MinimumLengthValidator',
92 | },
93 | {
94 | 'NAME': 'django.contrib.auth.password_validation.CommonPasswordValidator',
95 | },
96 | {
97 | 'NAME': 'django.contrib.auth.password_validation.NumericPasswordValidator',
98 | },
99 | ]
100 |
101 |
102 | # Internationalization
103 | # https://docs.djangoproject.com/en/3.2/topics/i18n/
104 |
105 | LANGUAGE_CODE = 'en-us'
106 |
107 | TIME_ZONE = 'UTC'
108 |
109 | USE_I18N = True
110 |
111 | USE_L10N = True
112 |
113 | USE_TZ = True
114 |
115 |
116 | # Static files (CSS, JavaScript, Images)
117 | # https://docs.djangoproject.com/en/3.2/howto/static-files/
118 |
119 | STATIC_URL = '/static/'
120 |
121 | # Default primary key field type
122 | # https://docs.djangoproject.com/en/3.2/ref/settings/#default-auto-field
123 |
124 | DEFAULT_AUTO_FIELD = 'django.db.models.BigAutoField'
125 |
--------------------------------------------------------------------------------
/sket_server/sket_rest_config/urls.py:
--------------------------------------------------------------------------------
1 | """sket_rest URL Configuration
2 |
3 | The `urlpatterns` list routes URLs to views. For more information please see:
4 | https://docs.djangoproject.com/en/3.2/topics/http/urls/
5 | Examples:
6 | Function views
7 | 1. Add an import: from my_app import views
8 | 2. Add a URL to urlpatterns: path('', views.home, name='home')
9 | Class-based views
10 | 1. Add an import: from other_app.views import Home
11 | 2. Add a URL to urlpatterns: path('', Home.as_view(), name='home')
12 | Including another URLconf
13 | 1. Import the include() function: from django.urls import include, path
14 | 2. Add a URL to urlpatterns: path('blog/', include('blog.urls'))
15 | """
16 | from django.contrib import admin
17 | from django.urls import path
18 | from django.urls import include
19 |
20 | urlpatterns = [
21 | path('', include('sket_server.sket_rest_app.urls')),
22 | path('api-auth/', include('rest_framework.urls', namespace='rest_framework'))
23 | ]
24 |
--------------------------------------------------------------------------------
/sket_server/sket_rest_config/wsgi.py:
--------------------------------------------------------------------------------
1 | """
2 | WSGI config for sket_rest project.
3 |
4 | It exposes the WSGI callable as a module-level variable named ``application``.
5 |
6 | For more information on this file, see
7 | https://docs.djangoproject.com/en/3.2/howto/deployment/wsgi/
8 | """
9 |
10 | import os
11 |
12 | from django.core.wsgi import get_wsgi_application
13 |
14 | os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'sket_server.sket_rest_config.settings')
15 |
16 | application = get_wsgi_application()
17 |
--------------------------------------------------------------------------------