├── .gitattributes
├── .gitignore
├── README.md
├── cooccur-topics.ipynb
├── cooccurrence.py
├── data
├── DO-slim-to-mesh.tsv
├── disease-disease-cooccurrence.tsv
├── disease-pmids-topic.tsv.gz
├── disease-pmids.tsv.gz
├── disease-symptom-cooccurrence.tsv
├── disease-uberon-cooccurrence.tsv
├── mesh-nxo-node-link.json.gz
├── mesh-term-topics-noexp.jsonl.gz
├── symptom-pmids.tsv.gz
└── uberon-pmids.tsv.gz
├── diseases.ipynb
├── download-topics.ipynb
├── environment.yml
├── eutility.py
├── symptoms.ipynb
└── tissues.ipynb
/.gitattributes:
--------------------------------------------------------------------------------
1 | *.xz filter=lfs diff=lfs merge=lfs -text
2 | *.gz filter=lfs diff=lfs merge=lfs -text
3 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | # Python
2 | __pycache__/
3 | *.egg-info/
4 | pip-wheel-metadata/
5 | .ipynb_checkpoints
6 | .cache
7 | .pytest_cache/
8 | build/
9 | dist/
10 |
11 | # System specific files
12 |
13 | ## Linux
14 | *~
15 | .Trash-*
16 |
17 | ## macOS
18 | .DS_Store
19 | ._*
20 | .Trashes
21 |
22 | ## Windows
23 | Thumbs.db
24 | [Dd]esktop.ini
25 |
26 | ## Text Editors
27 | .vscode
28 | .idea/
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Computing term cooccurrence in MEDLINE
2 |
3 | This repository quantifies term cooccurrence in MEDLINE.
4 | It's designed for computing the cooccurence of all pairs between two MeSH termsets.
5 | The repository computes MEDLINE cooccurences for the Rephetio hetnet.
6 | See the corresponding [Thinklab discussion](https://doi.org/10.15363/thinklab.d67 "Mining knowledge from MEDLINE articles and their indexed MeSH terms") for more information.
7 |
8 | ## Modules
9 |
10 | + [`eutility.py`](eutility.py) defines an `esearch_query` function for retreiving PubMed IDs matching a user-defined query.
11 | + [`cooccurrence.py`](cooccurrence.py) computes the cooccurences bewteen two termsets,
12 | whose associated PubMed IDs have been retrieved.
13 |
14 | ## Notebooks
15 |
16 | The following notebooks were used to compute relationships for Hetionet v1.0 by [Project Rephetio](https://git.dhimmel.com/rephetio-manuscript/):
17 |
18 | + [`diseases.ipynb`](diseases.ipynb) computes disease-disease cooccurrence
19 | + [`symptoms.ipynb`](symptoms.ipynb) computes symptom-disease cooccurrence
20 | + [`tissues.ipynb`](tissues.ipynb) computes anatomy-disease cooccurrence.
21 | This notebook depends on `data/disease-pmids.tsv.gz`,
22 | a dataset created by `symptoms.ipynb`.
23 |
24 | The following notebooks are for a more general analysis to support custom user queries:
25 |
26 | - [`download-topics.ipynb`](download-topics.ipynb) downloads the PubMed IDs for all MeSH descriptors and supplementary disease concepts and saves this to [`data/mesh-term-topics-noexp.jsonl.gz`](data/mesh-term-topics-noexp.jsonl.gz).
27 | - [`cooccur-topics.ipynb`](cooccur-topics.ipynb) reads `mesh-term-topics-noexp.jsonl.gz` to compute cooccurrence between a user-selected term with all other MeSH terms.
28 |
29 | ## Environment
30 |
31 | ```shell
32 | # create environment
33 | conda env create --file=environment.yml
34 |
35 | # update environment
36 | conda env update --file=environment.yml
37 |
38 | # activate environment
39 | conda activate medline
40 |
41 | # run jupyter lab for notebook development
42 | jupyter lab
43 | ```
44 |
45 | ## History
46 |
47 | On 2021-04-09, ownership of this repository on GitHub was changed from `dhimmel/medline` to `hetio/medline`.
48 | The `hetio` organization has GitHub LFS quota,
49 | providing a more convenient way to store large compressed files.
50 |
51 | At the time of the transfer, the only default (and only) branch was `gh-pages`.
52 | The `gh-pages` branch was renamed to `pre-lfs-archive`.
53 | A new default branch `main` was created, whose history has been migrated to use Git LFS.
54 | For the version of this repository used by Project Rephetio to create Hetionet v1.0,
55 | refer to the [v1.0 release](https://github.com/hetio/medline/releases/tag/v1.0).
56 |
57 | ## Comparison to MRCOC
58 |
59 | MEDLINE [produces co-occurrence files](https://ii.nlm.nih.gov/MRCOC.shtml) under the codename MRCOC.
60 | More information is available in the 2016 report [Building an Updated MEDLINE Co-Occurrences (MRCOC) File](https://ii.nlm.nih.gov/MRCOC/MRCOC_Doc_2016.pdf).
61 | These files might be a viable alternative to the analyses in this repository for certain applications.
62 | However, they don't appear to contain topics for supplemental concept records
63 | (for example MeSH term [`C000591739`](https://id.nlm.nih.gov/mesh/2020/C000591739.html)).
64 | Feel free to open an issue with additional insights on or comparisons to MRCOC.
65 |
66 | ## License
67 |
68 | This repository is released under [CC0 1.0](https://creativecommons.org/publicdomain/zero/1.0/ "CC0 1.0 Universal: Public Domain Dedication").
69 |
--------------------------------------------------------------------------------
/cooccur-topics.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "6e78410f-f668-4e79-8983-0ac747c58d6d",
6 | "metadata": {},
7 | "source": [
8 | "# Cooccurrence of a user-selected term against all MeSH terms with citations"
9 | ]
10 | },
11 | {
12 | "cell_type": "code",
13 | "execution_count": 1,
14 | "id": "fe5814b0-9d56-482f-8990-309f9c9b2db2",
15 | "metadata": {},
16 | "outputs": [
17 | {
18 | "name": "stdout",
19 | "output_type": "stream",
20 | "text": [
21 | "\u001b[33m7744a88\u001b[m\u001b[33m (\u001b[m\u001b[1;36mHEAD -> \u001b[m\u001b[1;32mmain\u001b[m\u001b[33m, \u001b[m\u001b[1;31morigin/main\u001b[m\u001b[33m)\u001b[m Query pubmed with quoted MeSH terms and [nm]\n"
22 | ]
23 | }
24 | ],
25 | "source": [
26 | "! git log -1 --oneline"
27 | ]
28 | },
29 | {
30 | "cell_type": "code",
31 | "execution_count": 2,
32 | "id": "135775b5-8fae-4a1d-81d0-96fc9989dd13",
33 | "metadata": {},
34 | "outputs": [],
35 | "source": [
36 | "import datetime\n",
37 | "import gzip\n",
38 | "import pathlib\n",
39 | "from typing import List, Set\n",
40 | "\n",
41 | "import scipy.stats\n",
42 | "import tqdm\n",
43 | "import jsonlines\n",
44 | "import tqdm\n",
45 | "import pandas as pd\n",
46 | "from nxontology import NXOntology\n",
47 | "\n",
48 | "from cooccurrence import cooccurrence_metrics"
49 | ]
50 | },
51 | {
52 | "cell_type": "code",
53 | "execution_count": 3,
54 | "id": "bb3b6ec1-7e21-46b6-a8f5-c1c9fa065a3c",
55 | "metadata": {},
56 | "outputs": [
57 | {
58 | "data": {
59 | "text/plain": [
60 | "300093"
61 | ]
62 | },
63 | "execution_count": 3,
64 | "metadata": {},
65 | "output_type": "execute_result"
66 | }
67 | ],
68 | "source": [
69 | "# read the MeSH ontology\n",
70 | "nxo = NXOntology.read_node_link_json(\"data/mesh-nxo-node-link.json.gz\")\n",
71 | "nxo.freeze()\n",
72 | "nxo.n_nodes"
73 | ]
74 | },
75 | {
76 | "cell_type": "code",
77 | "execution_count": 4,
78 | "id": "6ee9d0ef-9212-4f2e-8d67-ce6edead3546",
79 | "metadata": {},
80 | "outputs": [
81 | {
82 | "data": {
83 | "text/plain": [
84 | "35533"
85 | ]
86 | },
87 | "execution_count": 4,
88 | "metadata": {},
89 | "output_type": "execute_result"
90 | }
91 | ],
92 | "source": [
93 | "# Read the jsonlines file\n",
94 | "path = pathlib.Path('data/mesh-term-topics-noexp.jsonl.gz')\n",
95 | "with jsonlines.Reader(gzip.open(path, \"rt\")) as reader:\n",
96 | " lines = list(reader)\n",
97 | "for line in lines:\n",
98 | " line[\"pumbed_ids\"] = set(line[\"pubmed_ids\"])\n",
99 | "len(lines)"
100 | ]
101 | },
102 | {
103 | "cell_type": "code",
104 | "execution_count": 5,
105 | "id": "5e9a2205-2528-4c15-8309-39c3f5a37cfd",
106 | "metadata": {},
107 | "outputs": [],
108 | "source": [
109 | "# filter topics without mesh_ids since cooccurrence cannot be computed\n",
110 | "mesh_id_to_line = {line[\"mesh_id\"]: line for line in lines if line[\"pubmed_ids\"]}"
111 | ]
112 | },
113 | {
114 | "cell_type": "code",
115 | "execution_count": 6,
116 | "id": "03403778-2573-4f0b-9106-1f2608c26c46",
117 | "metadata": {},
118 | "outputs": [
119 | {
120 | "data": {
121 | "text/plain": [
122 | "27698253"
123 | ]
124 | },
125 | "execution_count": 6,
126 | "metadata": {},
127 | "output_type": "execute_result"
128 | }
129 | ],
130 | "source": [
131 | "all_pmids: Set[str] = set()\n",
132 | "for line in lines:\n",
133 | " all_pmids |= set(line[\"pubmed_ids\"])\n",
134 | "len(all_pmids)"
135 | ]
136 | },
137 | {
138 | "cell_type": "code",
139 | "execution_count": 7,
140 | "id": "d76747d0-6640-4b9b-9d1c-305433fd29ee",
141 | "metadata": {},
142 | "outputs": [],
143 | "source": [
144 | "def explode_pubmid_ids(nxo: NXOntology, mesh_id_to_line: dict, topic: str):\n",
145 | " exploded_pubmed_ids = set()\n",
146 | " for descendant in nxo.node_info(topic).descendants:\n",
147 | " if descendant not in mesh_id_to_line:\n",
148 | " continue\n",
149 | " exploded_pubmed_ids |= set(mesh_id_to_line[descendant][\"pubmed_ids\"])\n",
150 | " return exploded_pubmed_ids\n",
151 | "\n",
152 | "def cooccurrence_result(source_mesh_id: str, target_mesh_id: str, nxo: NXOntology, mesh_id_to_line: dict, total_pmids: int) -> dict:\n",
153 | " source_pmids = explode_pubmid_ids(nxo, mesh_id_to_line, source_mesh_id)\n",
154 | " target_pmids = explode_pubmid_ids(nxo, mesh_id_to_line, target_mesh_id)\n",
155 | " result = {\n",
156 | " \"source_mesh_id\": source_mesh_id,\n",
157 | " \"target_mesh_id\": target_mesh_id,\n",
158 | " \"source_mesh_label\": nxo.node_info(source_mesh_id).label,\n",
159 | " \"target_mesh_label\": nxo.node_info(target_mesh_id).label,\n",
160 | " }\n",
161 | " result.update(cooccurrence_metrics(source_pmids, target_pmids, total_pmids=total_pmids))\n",
162 | " return result"
163 | ]
164 | },
165 | {
166 | "cell_type": "code",
167 | "execution_count": 8,
168 | "id": "bc9ce436-c33c-4027-8005-01c0da8e3972",
169 | "metadata": {},
170 | "outputs": [
171 | {
172 | "data": {
173 | "text/plain": [
174 | "{'source_mesh_id': 'D005357',\n",
175 | " 'target_mesh_id': 'D009103',\n",
176 | " 'source_mesh_label': 'Fibrous Dysplasia of Bone',\n",
177 | " 'target_mesh_label': 'Multiple Sclerosis',\n",
178 | " 'cooccurrence': 0,\n",
179 | " 'expected': 10.945734736410992,\n",
180 | " 'enrichment': 0.0,\n",
181 | " 'odds_ratio': 0.0,\n",
182 | " 'p_fisher': 1.0,\n",
183 | " 'n_source': 4985,\n",
184 | " 'n_target': 60818}"
185 | ]
186 | },
187 | "execution_count": 8,
188 | "metadata": {},
189 | "output_type": "execute_result"
190 | }
191 | ],
192 | "source": [
193 | "source_mesh_id = \"D005357\" # Fibrous Dysplasia of Bone\n",
194 | "target_mesh_id = \"D009103\"\n",
195 | "cooccurrence_result(source_mesh_id, target_mesh_id, nxo, mesh_id_to_line, total_pmids=len(all_pmids))"
196 | ]
197 | },
198 | {
199 | "cell_type": "code",
200 | "execution_count": 9,
201 | "id": "77341fd8-e927-41a0-bbe1-7692a69d4560",
202 | "metadata": {},
203 | "outputs": [
204 | {
205 | "name": "stderr",
206 | "output_type": "stream",
207 | "text": [
208 | "100%|██████████| 32568/32568 [05:11<00:00, 104.58it/s] \n"
209 | ]
210 | }
211 | ],
212 | "source": [
213 | "source_mesh_id = \"D005357\" # Fibrous Dysplasia of Bone\n",
214 | "\n",
215 | "rows = list()\n",
216 | "for target_mesh_id in tqdm.tqdm(mesh_id_to_line):\n",
217 | " # for development\n",
218 | "# if len(rows) > 1000:\n",
219 | "# break\n",
220 | " row = cooccurrence_result(source_mesh_id, target_mesh_id, nxo, mesh_id_to_line, total_pmids=len(all_pmids))\n",
221 | " rows.append(row)"
222 | ]
223 | },
224 | {
225 | "cell_type": "code",
226 | "execution_count": 10,
227 | "id": "fc6360dd-9138-44ba-aca6-ee144aa3095f",
228 | "metadata": {},
229 | "outputs": [
230 | {
231 | "data": {
232 | "text/html": [
233 | "
\n",
234 | "\n",
247 | "
\n",
248 | " \n",
249 | " \n",
250 | " | \n",
251 | " source_mesh_id | \n",
252 | " target_mesh_id | \n",
253 | " source_mesh_label | \n",
254 | " target_mesh_label | \n",
255 | " cooccurrence | \n",
256 | " expected | \n",
257 | " enrichment | \n",
258 | " odds_ratio | \n",
259 | " p_fisher | \n",
260 | " n_source | \n",
261 | " n_target | \n",
262 | "
\n",
263 | " \n",
264 | " \n",
265 | " \n",
266 | " 8235 | \n",
267 | " D005357 | \n",
268 | " D002636 | \n",
269 | " Fibrous Dysplasia of Bone | \n",
270 | " Cherubism | \n",
271 | " 432 | \n",
272 | " 0.077749 | \n",
273 | " 5556.319559 | \n",
274 | " inf | \n",
275 | " 0.0 | \n",
276 | " 4985 | \n",
277 | " 432 | \n",
278 | "
\n",
279 | " \n",
280 | " 10805 | \n",
281 | " D005357 | \n",
282 | " D005357 | \n",
283 | " Fibrous Dysplasia of Bone | \n",
284 | " Fibrous Dysplasia of Bone | \n",
285 | " 4985 | \n",
286 | " 0.897177 | \n",
287 | " 5556.319559 | \n",
288 | " inf | \n",
289 | " 0.0 | \n",
290 | " 4985 | \n",
291 | " 4985 | \n",
292 | "
\n",
293 | " \n",
294 | " 10806 | \n",
295 | " D005357 | \n",
296 | " D005358 | \n",
297 | " Fibrous Dysplasia of Bone | \n",
298 | " Fibrous Dysplasia, Monostotic | \n",
299 | " 455 | \n",
300 | " 0.081889 | \n",
301 | " 5556.319559 | \n",
302 | " inf | \n",
303 | " 0.0 | \n",
304 | " 4985 | \n",
305 | " 455 | \n",
306 | "
\n",
307 | " \n",
308 | " 10807 | \n",
309 | " D005357 | \n",
310 | " D005359 | \n",
311 | " Fibrous Dysplasia of Bone | \n",
312 | " Fibrous Dysplasia, Polyostotic | \n",
313 | " 1446 | \n",
314 | " 0.260244 | \n",
315 | " 5556.319559 | \n",
316 | " inf | \n",
317 | " 0.0 | \n",
318 | " 4985 | \n",
319 | " 1446 | \n",
320 | "
\n",
321 | " \n",
322 | " 15098 | \n",
323 | " D005357 | \n",
324 | " D010002 | \n",
325 | " Fibrous Dysplasia of Bone | \n",
326 | " Osteitis Fibrosa Cystica | \n",
327 | " 570 | \n",
328 | " 0.291920 | \n",
329 | " 1952.590720 | \n",
330 | " 3398.490955 | \n",
331 | " 0.0 | \n",
332 | " 4985 | \n",
333 | " 1622 | \n",
334 | "
\n",
335 | " \n",
336 | " 15105 | \n",
337 | " D005357 | \n",
338 | " D010009 | \n",
339 | " Fibrous Dysplasia of Bone | \n",
340 | " Osteochondrodysplasias | \n",
341 | " 4985 | \n",
342 | " 5.510482 | \n",
343 | " 904.639526 | \n",
344 | " inf | \n",
345 | " 0.0 | \n",
346 | " 4985 | \n",
347 | " 30618 | \n",
348 | "
\n",
349 | " \n",
350 | " 15096 | \n",
351 | " D005357 | \n",
352 | " D010000 | \n",
353 | " Fibrous Dysplasia of Bone | \n",
354 | " Osteitis | \n",
355 | " 508 | \n",
356 | " 0.711802 | \n",
357 | " 713.681501 | \n",
358 | " 911.497502 | \n",
359 | " 0.0 | \n",
360 | " 4985 | \n",
361 | " 3955 | \n",
362 | "
\n",
363 | " \n",
364 | " 23170 | \n",
365 | " D005357 | \n",
366 | " D019205 | \n",
367 | " Fibrous Dysplasia of Bone | \n",
368 | " GTP-Binding Protein alpha Subunits, Gs | \n",
369 | " 218 | \n",
370 | " 0.385507 | \n",
371 | " 565.489105 | \n",
372 | " 658.188528 | \n",
373 | " 0.0 | \n",
374 | " 4985 | \n",
375 | " 2142 | \n",
376 | "
\n",
377 | " \n",
378 | " 7503 | \n",
379 | " D005357 | \n",
380 | " D001848 | \n",
381 | " Fibrous Dysplasia of Bone | \n",
382 | " Bone Diseases, Developmental | \n",
383 | " 4985 | \n",
384 | " 14.657724 | \n",
385 | " 340.093722 | \n",
386 | " inf | \n",
387 | " 0.0 | \n",
388 | " 4985 | \n",
389 | " 81443 | \n",
390 | "
\n",
391 | " \n",
392 | " 16640 | \n",
393 | " D005357 | \n",
394 | " D011629 | \n",
395 | " Fibrous Dysplasia of Bone | \n",
396 | " Puberty, Precocious | \n",
397 | " 190 | \n",
398 | " 0.841564 | \n",
399 | " 225.770042 | \n",
400 | " 244.573598 | \n",
401 | " 0.0 | \n",
402 | " 4985 | \n",
403 | " 4676 | \n",
404 | "
\n",
405 | " \n",
406 | " 7504 | \n",
407 | " D005357 | \n",
408 | " D001849 | \n",
409 | " Fibrous Dysplasia of Bone | \n",
410 | " Bone Diseases, Endocrine | \n",
411 | " 660 | \n",
412 | " 3.252513 | \n",
413 | " 202.920037 | \n",
414 | " 242.554998 | \n",
415 | " 0.0 | \n",
416 | " 4985 | \n",
417 | " 18072 | \n",
418 | "
\n",
419 | " \n",
420 | " 15112 | \n",
421 | " D005357 | \n",
422 | " D010016 | \n",
423 | " Fibrous Dysplasia of Bone | \n",
424 | " Osteoma | \n",
425 | " 200 | \n",
426 | " 1.073912 | \n",
427 | " 186.234944 | \n",
428 | " 200.669728 | \n",
429 | " 0.0 | \n",
430 | " 4985 | \n",
431 | " 5967 | \n",
432 | "
\n",
433 | " \n",
434 | " 27191 | \n",
435 | " D005357 | \n",
436 | " D044385 | \n",
437 | " Fibrous Dysplasia of Bone | \n",
438 | " GTP-Binding Protein alpha Subunits | \n",
439 | " 222 | \n",
440 | " 1.412806 | \n",
441 | " 157.134133 | \n",
442 | " 169.167245 | \n",
443 | " 0.0 | \n",
444 | " 4985 | \n",
445 | " 7850 | \n",
446 | "
\n",
447 | " \n",
448 | " 13657 | \n",
449 | " D005357 | \n",
450 | " D008439 | \n",
451 | " Fibrous Dysplasia of Bone | \n",
452 | " Maxillary Diseases | \n",
453 | " 220 | \n",
454 | " 1.591701 | \n",
455 | " 138.216904 | \n",
456 | " 148.214254 | \n",
457 | " 0.0 | \n",
458 | " 4985 | \n",
459 | " 8844 | \n",
460 | "
\n",
461 | " \n",
462 | " 7500 | \n",
463 | " D005357 | \n",
464 | " D001845 | \n",
465 | " Fibrous Dysplasia of Bone | \n",
466 | " Bone Cysts | \n",
467 | " 282 | \n",
468 | " 2.398530 | \n",
469 | " 117.572005 | \n",
470 | " 127.232960 | \n",
471 | " 0.0 | \n",
472 | " 4985 | \n",
473 | " 13327 | \n",
474 | "
\n",
475 | " \n",
476 | "
\n",
477 | "
"
478 | ],
479 | "text/plain": [
480 | " source_mesh_id target_mesh_id source_mesh_label \\\n",
481 | "8235 D005357 D002636 Fibrous Dysplasia of Bone \n",
482 | "10805 D005357 D005357 Fibrous Dysplasia of Bone \n",
483 | "10806 D005357 D005358 Fibrous Dysplasia of Bone \n",
484 | "10807 D005357 D005359 Fibrous Dysplasia of Bone \n",
485 | "15098 D005357 D010002 Fibrous Dysplasia of Bone \n",
486 | "15105 D005357 D010009 Fibrous Dysplasia of Bone \n",
487 | "15096 D005357 D010000 Fibrous Dysplasia of Bone \n",
488 | "23170 D005357 D019205 Fibrous Dysplasia of Bone \n",
489 | "7503 D005357 D001848 Fibrous Dysplasia of Bone \n",
490 | "16640 D005357 D011629 Fibrous Dysplasia of Bone \n",
491 | "7504 D005357 D001849 Fibrous Dysplasia of Bone \n",
492 | "15112 D005357 D010016 Fibrous Dysplasia of Bone \n",
493 | "27191 D005357 D044385 Fibrous Dysplasia of Bone \n",
494 | "13657 D005357 D008439 Fibrous Dysplasia of Bone \n",
495 | "7500 D005357 D001845 Fibrous Dysplasia of Bone \n",
496 | "\n",
497 | " target_mesh_label cooccurrence expected \\\n",
498 | "8235 Cherubism 432 0.077749 \n",
499 | "10805 Fibrous Dysplasia of Bone 4985 0.897177 \n",
500 | "10806 Fibrous Dysplasia, Monostotic 455 0.081889 \n",
501 | "10807 Fibrous Dysplasia, Polyostotic 1446 0.260244 \n",
502 | "15098 Osteitis Fibrosa Cystica 570 0.291920 \n",
503 | "15105 Osteochondrodysplasias 4985 5.510482 \n",
504 | "15096 Osteitis 508 0.711802 \n",
505 | "23170 GTP-Binding Protein alpha Subunits, Gs 218 0.385507 \n",
506 | "7503 Bone Diseases, Developmental 4985 14.657724 \n",
507 | "16640 Puberty, Precocious 190 0.841564 \n",
508 | "7504 Bone Diseases, Endocrine 660 3.252513 \n",
509 | "15112 Osteoma 200 1.073912 \n",
510 | "27191 GTP-Binding Protein alpha Subunits 222 1.412806 \n",
511 | "13657 Maxillary Diseases 220 1.591701 \n",
512 | "7500 Bone Cysts 282 2.398530 \n",
513 | "\n",
514 | " enrichment odds_ratio p_fisher n_source n_target \n",
515 | "8235 5556.319559 inf 0.0 4985 432 \n",
516 | "10805 5556.319559 inf 0.0 4985 4985 \n",
517 | "10806 5556.319559 inf 0.0 4985 455 \n",
518 | "10807 5556.319559 inf 0.0 4985 1446 \n",
519 | "15098 1952.590720 3398.490955 0.0 4985 1622 \n",
520 | "15105 904.639526 inf 0.0 4985 30618 \n",
521 | "15096 713.681501 911.497502 0.0 4985 3955 \n",
522 | "23170 565.489105 658.188528 0.0 4985 2142 \n",
523 | "7503 340.093722 inf 0.0 4985 81443 \n",
524 | "16640 225.770042 244.573598 0.0 4985 4676 \n",
525 | "7504 202.920037 242.554998 0.0 4985 18072 \n",
526 | "15112 186.234944 200.669728 0.0 4985 5967 \n",
527 | "27191 157.134133 169.167245 0.0 4985 7850 \n",
528 | "13657 138.216904 148.214254 0.0 4985 8844 \n",
529 | "7500 117.572005 127.232960 0.0 4985 13327 "
530 | ]
531 | },
532 | "execution_count": 10,
533 | "metadata": {},
534 | "output_type": "execute_result"
535 | }
536 | ],
537 | "source": [
538 | "cooccur_df = pd.DataFrame(rows)\n",
539 | "cooccur_df = cooccur_df.sort_values(by=[\"p_fisher\", \"enrichment\"], ascending=[True, False])\n",
540 | "cooccur_df.head(15)"
541 | ]
542 | },
543 | {
544 | "cell_type": "code",
545 | "execution_count": 11,
546 | "id": "8e0ad8d5-6414-45ec-aad7-1ca73c0c2d3a",
547 | "metadata": {},
548 | "outputs": [],
549 | "source": [
550 | "# cooccur_df.head(1000).to_excel(\"data/medline-cooccurrence.xlsx\", index=False, freeze_panes=(0, 1))"
551 | ]
552 | }
553 | ],
554 | "metadata": {
555 | "kernelspec": {
556 | "display_name": "Python 3",
557 | "language": "python",
558 | "name": "python3"
559 | },
560 | "language_info": {
561 | "codemirror_mode": {
562 | "name": "ipython",
563 | "version": 3
564 | },
565 | "file_extension": ".py",
566 | "mimetype": "text/x-python",
567 | "name": "python",
568 | "nbconvert_exporter": "python",
569 | "pygments_lexer": "ipython3",
570 | "version": "3.9.2"
571 | }
572 | },
573 | "nbformat": 4,
574 | "nbformat_minor": 5
575 | }
576 |
--------------------------------------------------------------------------------
/cooccurrence.py:
--------------------------------------------------------------------------------
1 | import itertools
2 | from typing import Any, Dict, List, Set
3 |
4 | import scipy.stats
5 | import pandas
6 |
7 |
8 | def read_pmids_tsv(path, key, min_articles = 1):
9 | term_to_pmids = dict()
10 | pmids_df = pandas.read_table(path, compression='gzip')
11 | pmids_df = pmids_df[pmids_df.n_articles >= min_articles]
12 | for i, row in pmids_df.iterrows():
13 | term = row[key]
14 | pmids = row.pubmed_ids.split('|')
15 | term_to_pmids[term] = set(pmids)
16 | pmids_df.drop('pubmed_ids', axis=1, inplace=True)
17 | return pmids_df, term_to_pmids
18 |
19 | def score_pmid_cooccurrence(term0_to_pmids, term1_to_pmids, term0_name='term_0', term1_name='term_1', verbose=True):
20 | """
21 | Find pubmed cooccurrence between topics of two classes.
22 |
23 | term0_to_pmids -- a dictionary that returns the pubmed_ids for each term of class 0
24 | term0_to_pmids -- a dictionary that returns the pubmed_ids for each term of class 1
25 | """
26 | all_pmids0 = set.union(*term0_to_pmids.values())
27 | all_pmids1 = set.union(*term1_to_pmids.values())
28 | pmids_in_both = all_pmids0 & all_pmids1
29 | total_pmids = len(pmids_in_both)
30 | if verbose:
31 | print('Total articles containing a {}: {}'.format(term0_name, len(all_pmids0)))
32 | print('Total articles containing a {}: {}'.format(term1_name, len(all_pmids1)))
33 | print('Total articles containing both a {} and {}: {}'.format(term0_name, term1_name, total_pmids))
34 |
35 | term0_to_pmids = term0_to_pmids.copy()
36 | term1_to_pmids = term1_to_pmids.copy()
37 | for d in term0_to_pmids, term1_to_pmids:
38 | for key, value in list(d.items()):
39 | d[key] = value & pmids_in_both
40 | if not d[key]:
41 | del d[key]
42 |
43 | if verbose:
44 | print('\nAfter removing terms without any cooccurences:')
45 | print('+ {} {}s remain'.format(len(term0_to_pmids), term0_name))
46 | print('+ {} {}s remain'.format(len(term1_to_pmids), term1_name))
47 |
48 | rows = list()
49 | for term0, term1 in itertools.product(term0_to_pmids, term1_to_pmids):
50 | pmids0 = term0_to_pmids[term0]
51 | pmids1 = term1_to_pmids[term1]
52 | row = {
53 | term0_name: term0,
54 | term1_name: term1,
55 | **cooccurrence_metrics(pmids0, pmids1, total_pmids=total_pmids)
56 | }
57 | rows.append(row)
58 | df = pandas.DataFrame(rows)
59 |
60 | if verbose:
61 | print('\nCooccurrence scores calculated for {} {} -- {} pairs'.format(len(df), term0_name, term1_name))
62 | return df
63 |
64 |
65 | def cooccurrence_metrics(source_pmids: Set[str], target_pmids: Set[str], total_pmids: int) -> Dict[str, Any]:
66 | """
67 | Compute metrics of cooccurrence between two sets of pubmed ids.
68 | Requires providing the total number of pubmed ids in the corpus.
69 | """
70 | a = len(source_pmids & target_pmids)
71 | b = len(source_pmids) - a
72 | c = len(target_pmids) - a
73 | d = total_pmids - (a + b + c)
74 | contingency_table = [[a, b], [c, d]]
75 | # discussion on this formula in https://github.com/hetio/medline/issues/1
76 | expected = len(source_pmids) * len(target_pmids) / total_pmids
77 | enrichment = a / expected
78 | odds_ratio, p_fisher = scipy.stats.fisher_exact(contingency_table, alternative='greater')
79 | return {
80 | "cooccurrence": a,
81 | "expected": expected,
82 | "enrichment": enrichment,
83 | "odds_ratio": odds_ratio,
84 | "p_fisher": p_fisher,
85 | "n_source": len(source_pmids),
86 | "n_target": len(target_pmids),
87 | }
88 |
--------------------------------------------------------------------------------
/data/DO-slim-to-mesh.tsv:
--------------------------------------------------------------------------------
1 | doid_code doid_name mesh_id mesh_name
2 | DOID:2531 hematologic cancer D019337 Hematologic Neoplasms
3 | DOID:1319 brain cancer D001932 Brain Neoplasms
4 | DOID:1324 lung cancer D008175 Lung Neoplasms
5 | DOID:263 kidney cancer D007680 Kidney Neoplasms
6 | DOID:1793 pancreatic cancer D010190 Pancreatic Neoplasms
7 | DOID:4159 skin cancer D012878 Skin Neoplasms
8 | DOID:184 bone cancer D001859 Bone Neoplasms
9 | DOID:0060119 pharynx cancer D010610 Pharyngeal Neoplasms
10 | DOID:2394 ovarian cancer D010051 Ovarian Neoplasms
11 | DOID:1612 breast cancer D001943 Breast Neoplasms
12 | DOID:3070 malignant glioma D005910 Glioma
13 | DOID:363 uterine cancer D014594 Uterine Neoplasms
14 | DOID:3953 adrenal gland cancer D000310 Adrenal Gland Neoplasms
15 | DOID:5041 esophageal cancer D004938 Esophageal Neoplasms
16 | DOID:8850 salivary gland cancer D012468 Salivary Gland Neoplasms
17 | DOID:10283 prostate cancer D011471 Prostatic Neoplasms
18 | DOID:10534 stomach cancer D013274 Stomach Neoplasms
19 | DOID:11054 urinary bladder cancer D001749 Urinary Bladder Neoplasms
20 | DOID:1192 peripheral nervous system neoplasm D010524 Peripheral Nervous System Neoplasms
21 | DOID:1781 thyroid cancer D013964 Thyroid Neoplasms
22 | DOID:3571 liver cancer D008113 Liver Neoplasms
23 | DOID:4362 cervical cancer D002583 Uterine Cervical Neoplasms
24 | DOID:119 vaginal cancer D014625 Vaginal Neoplasms
25 | DOID:11934 head and neck cancer D006258 Head and Neck Neoplasms
26 | DOID:1993 rectum cancer D012004 Rectal Neoplasms
27 | DOID:2174 ocular cancer D005134 Eye Neoplasms
28 | DOID:219 colon cancer D003110 Colonic Neoplasms
29 | DOID:2596 larynx cancer D007822 Laryngeal Neoplasms
30 | DOID:2994 germ cell cancer D009373 Neoplasms, Germ Cell and Embryonal
31 | DOID:3277 thymus cancer D013953 Thymus Neoplasms
32 | DOID:4045 muscle cancer D009217 Myosarcoma
33 | DOID:10021 duodenum cancer D004379 Duodenal Neoplasms
34 | DOID:10153 ileum cancer D007078 Ileal Neoplasms
35 | DOID:1115 sarcoma D012509 Sarcoma
36 | DOID:11239 appendix cancer D001063 Appendiceal Neoplasms
37 | DOID:11615 penile cancer D010412 Penile Neoplasms
38 | DOID:11819 ureter cancer D014516 Ureteral Neoplasms
39 | DOID:11920 tracheal cancer D014134 Tracheal Neoplasms
40 | DOID:1245 vulva cancer D014846 Vulvar Neoplasms
41 | DOID:13499 jejunal cancer D007580 Jejunal Neoplasms
42 | DOID:1725 peritoneum cancer D010534 Peritoneal Neoplasms
43 | DOID:175 vascular cancer D019043 Vascular Neoplasms
44 | DOID:1790 malignant mesothelioma D008654 Mesothelioma
45 | DOID:1909 melanoma D008545 Melanoma
46 | DOID:1964 fallopian tube cancer D005185 Fallopian Tube Neoplasms
47 | DOID:2998 testicular cancer D013736 Testicular Neoplasms
48 | DOID:3121 gallbladder cancer D005706 Gallbladder Neoplasms
49 | DOID:3565 meningioma D008577 Meningeal Neoplasms
50 | DOID:4606 bile duct cancer D001650 Bile Duct Neoplasms
51 | DOID:5559 mediastinal cancer D008479 Mediastinal Neoplasms
52 | DOID:5612 spinal cancer D013120 Spinal Cord Neoplasms
53 | DOID:5875 retroperitoneal cancer D012186 Retroperitoneal Neoplasms
54 | DOID:8778 Crohn's disease D003424 Crohn Disease
55 | DOID:2377 multiple sclerosis D009103 Multiple Sclerosis
56 | DOID:9352 type 2 diabetes mellitus D003924 Diabetes Mellitus, Type 2
57 | DOID:8577 ulcerative colitis D003093 Colitis, Ulcerative
58 | DOID:9744 type 1 diabetes mellitus D003922 Diabetes Mellitus, Type 1
59 | DOID:7148 rheumatoid arthritis D001172 Arthritis, Rheumatoid
60 | DOID:3393 coronary artery disease D003324 Coronary Artery Disease
61 | DOID:3393 coronary artery disease D003327 Coronary Disease
62 | DOID:3393 coronary artery disease D017202 Myocardial Ischemia
63 | DOID:9970 obesity D009765 Obesity
64 | DOID:10608 celiac disease D002446 Celiac Disease
65 | DOID:9074 systemic lupus erythematosus D008180 Lupus Erythematosus, Systemic
66 | DOID:9835 refractive error D012030 Refractive Errors
67 | DOID:12236 primary biliary cirrhosis D008105 Liver Cirrhosis, Biliary
68 | DOID:12306 vitiligo D014820 Vitiligo
69 | DOID:10871 age related macular degeneration D008268 Macular Degeneration
70 | DOID:14221 metabolic syndrome X D024821 Metabolic Syndrome X
71 | DOID:2841 asthma D001249 Asthma
72 | DOID:8893 psoriasis D011565 Psoriasis
73 | DOID:5419 schizophrenia D012559 Schizophrenia
74 | DOID:6364 migraine D008881 Migraine Disorders
75 | DOID:10652 Alzheimer's disease D000544 Alzheimer Disease
76 | DOID:12361 Graves' disease D006111 Graves Disease
77 | DOID:14330 Parkinson's disease D010300 Parkinson Disease
78 | DOID:3310 atopic dermatitis D003876 Dermatitis, Atopic
79 | DOID:3312 bipolar disorder D001714 Bipolar Disorder
80 | DOID:7147 ankylosing spondylitis D013167 Spondylitis, Ankylosing
81 | DOID:11612 polycystic ovary syndrome D011085 Polycystic Ovary Syndrome
82 | DOID:10763 hypertension D006973 Hypertension
83 | DOID:418 systemic scleroderma D012595 Scleroderma, Systemic
84 | DOID:13241 Behcet's disease D001528 Behcet Syndrome
85 | DOID:5408 Paget's disease of bone D010001 Osteitis Deformans
86 | DOID:1024 leprosy D007918 Leprosy
87 | DOID:10941 intracranial aneurysm D002532 Intracranial Aneurysm
88 | DOID:1686 glaucoma D005901 Glaucoma
89 | DOID:332 amyotrophic lateral sclerosis D000690 Amyotrophic Lateral Sclerosis
90 | DOID:0050425 restless legs syndrome D012148 Restless Legs Syndrome
91 | DOID:13378 Kawasaki disease D009080 Mucocutaneous Lymph Node Syndrome
92 | DOID:1936 atherosclerosis D050197 Atherosclerosis
93 | DOID:986 alopecia areata D000506 Alopecia Areata
94 | DOID:11476 osteoporosis D010024 Osteoporosis
95 | DOID:1459 hypothyroidism D007037 Hypothyroidism
96 | DOID:2986 IgA glomerulonephritis D005922 Glomerulonephritis, IGA
97 | DOID:0050741 alcohol dependence D000437 Alcoholism
98 | DOID:11949 Creutzfeldt-Jakob disease D007562 Creutzfeldt-Jakob Syndrome
99 | DOID:14227 azoospermia D053713 Azoospermia
100 | DOID:1826 epilepsy syndrome D004827 Epilepsy
101 | DOID:2043 hepatitis B D006509 Hepatitis B
102 | DOID:3083 chronic obstructive pulmonary disease D029424 Pulmonary Disease, Chronic Obstructive
103 | DOID:7693 abdominal aortic aneurysm D017544 Aortic Aneurysm, Abdominal
104 | DOID:784 chronic kidney failure D007676 Kidney Failure, Chronic
105 | DOID:8398 osteoarthritis D010003 Osteoarthritis
106 | DOID:9008 psoriatic arthritis D015535 Arthritis, Psoriatic
107 | DOID:0050742 nicotine dependence D014029 Tobacco Use Disorder
108 | DOID:10976 membranous glomerulonephritis D015433 Glomerulonephritis, Membranous
109 | DOID:11714 gestational diabetes D016640 Diabetes, Gestational
110 | DOID:12365 malaria D008288 Malaria
111 | DOID:12849 autistic disorder D001321 Autistic Disorder
112 | DOID:12930 dilated cardiomyopathy D002311 Cardiomyopathy, Dilated
113 | DOID:13189 gout D015210 Arthritis, Gouty
114 | DOID:13223 uterine fibroid D007889 Leiomyoma
115 | DOID:14268 sclerosing cholangitis D015209 Cholangitis, Sclerosing
116 | DOID:8986 narcolepsy D009290 Narcolepsy
117 | DOID:90 degenerative disc disease D055959 Intervertebral Disc Degeneration
118 | DOID:9296 cleft lip D002971 Cleft Lip
119 | DOID:0050156 idiopathic pulmonary fibrosis D054990 Idiopathic Pulmonary Fibrosis
120 | DOID:1094 attention deficit hyperactivity disorder D001289 Attention Deficit Disorder with Hyperactivity
121 | DOID:11119 Gilles de la Tourette syndrome D005879 Tourette Syndrome
122 | DOID:14004 thoracic aortic aneurysm D017545 Aortic Aneurysm, Thoracic
123 | DOID:1595 endogenous depression D003866 Depressive Disorder
124 | DOID:4481 allergic rhinitis D065631 Rhinitis, Allergic
125 | DOID:4989 pancreatitis D010195 Pancreatitis
126 | DOID:585 nephrolithiasis D053040 Nephrolithiasis
127 | DOID:824 periodontitis D010518 Periodontitis
128 | DOID:9206 Barrett's esophagus D001471 Barrett Esophagus
129 | DOID:11555 Fuchs' endothelial dystrophy D005642 Fuchs' Endothelial Dystrophy
130 | DOID:12185 otosclerosis D010040 Otosclerosis
131 | DOID:12995 conduct disorder D019955 Conduct Disorder
132 | DOID:1312 focal segmental glomerulosclerosis D005923 Glomerulosclerosis, Focal Segmental
133 | DOID:216 dental caries D003731 Dental Caries
134 | DOID:2355 anemia D000740 Anemia
135 | DOID:594 panic disorder D016584 Panic Disorder
136 | DOID:635 acquired immunodeficiency syndrome D000163 Acquired Immunodeficiency Syndrome
137 |
--------------------------------------------------------------------------------
/data/disease-pmids-topic.tsv.gz:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:acb1d6322425aa7e57d6f8a2cc185d9a837b3cd45d26f534f021a866b76cf98e
3 | size 16822856
4 |
--------------------------------------------------------------------------------
/data/disease-pmids.tsv.gz:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:29408bee2b7c81da5774b5ab31896ac91845380d8071d105902bff16c87f2697
3 | size 13432263
4 |
--------------------------------------------------------------------------------
/data/mesh-nxo-node-link.json.gz:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:c0221b2d057af5b1f018224eebf42ce0ec31907d0cfb276f03120eaa592d9c16
3 | size 9092183
4 |
--------------------------------------------------------------------------------
/data/mesh-term-topics-noexp.jsonl.gz:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:c0b28c6d4b5e5b6384156b450c0145a658c5cbbb10084de14b37fe22cb7d865d
3 | size 849544061
4 |
--------------------------------------------------------------------------------
/data/symptom-pmids.tsv.gz:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:a87a99ef60bf5553e49702202e20b3e70bb82b148156d1725be6c8954ef9b4a1
3 | size 7398164
4 |
--------------------------------------------------------------------------------
/data/uberon-pmids.tsv.gz:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:03247bfa21e5f078226a795bede0bec0fe91e798c95131dfa304a0f8bdaa0cbd
3 | size 21074726
4 |
--------------------------------------------------------------------------------
/diseases.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Compute disease-disease-cooccurrence for Hetionet"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 1,
13 | "metadata": {
14 | "collapsed": true,
15 | "jupyter": {
16 | "outputs_hidden": true
17 | }
18 | },
19 | "outputs": [],
20 | "source": [
21 | "import io\n",
22 | "import gzip\n",
23 | "\n",
24 | "import pandas\n",
25 | "import requests\n",
26 | "import networkx\n",
27 | "\n",
28 | "import eutility\n",
29 | "import cooccurrence"
30 | ]
31 | },
32 | {
33 | "cell_type": "code",
34 | "execution_count": 2,
35 | "metadata": {
36 | "collapsed": false,
37 | "jupyter": {
38 | "outputs_hidden": false
39 | }
40 | },
41 | "outputs": [
42 | {
43 | "data": {
44 | "text/html": [
45 | "\n",
46 | "
\n",
47 | " \n",
48 | " \n",
49 | " | \n",
50 | " doid_code | \n",
51 | " doid_name | \n",
52 | " mesh_id | \n",
53 | " mesh_name | \n",
54 | "
\n",
55 | " \n",
56 | " \n",
57 | " \n",
58 | " 0 | \n",
59 | " DOID:2531 | \n",
60 | " hematologic cancer | \n",
61 | " D019337 | \n",
62 | " Hematologic Neoplasms | \n",
63 | "
\n",
64 | " \n",
65 | " 1 | \n",
66 | " DOID:1319 | \n",
67 | " brain cancer | \n",
68 | " D001932 | \n",
69 | " Brain Neoplasms | \n",
70 | "
\n",
71 | " \n",
72 | " 2 | \n",
73 | " DOID:1324 | \n",
74 | " lung cancer | \n",
75 | " D008175 | \n",
76 | " Lung Neoplasms | \n",
77 | "
\n",
78 | " \n",
79 | " 3 | \n",
80 | " DOID:263 | \n",
81 | " kidney cancer | \n",
82 | " D007680 | \n",
83 | " Kidney Neoplasms | \n",
84 | "
\n",
85 | " \n",
86 | " 4 | \n",
87 | " DOID:1793 | \n",
88 | " pancreatic cancer | \n",
89 | " D010190 | \n",
90 | " Pancreatic Neoplasms | \n",
91 | "
\n",
92 | " \n",
93 | "
\n",
94 | "
"
95 | ],
96 | "text/plain": [
97 | " doid_code doid_name mesh_id mesh_name\n",
98 | "0 DOID:2531 hematologic cancer D019337 Hematologic Neoplasms\n",
99 | "1 DOID:1319 brain cancer D001932 Brain Neoplasms\n",
100 | "2 DOID:1324 lung cancer D008175 Lung Neoplasms\n",
101 | "3 DOID:263 kidney cancer D007680 Kidney Neoplasms\n",
102 | "4 DOID:1793 pancreatic cancer D010190 Pancreatic Neoplasms"
103 | ]
104 | },
105 | "execution_count": 2,
106 | "metadata": {},
107 | "output_type": "execute_result"
108 | }
109 | ],
110 | "source": [
111 | "# Read mappings for DO Slim terms\n",
112 | "url = 'https://raw.githubusercontent.com/dhimmel/disease-ontology/72614ade9f1cc5a5317b8f6836e1e464b31d5587/data/xrefs-slim.tsv'\n",
113 | "disease_df = pandas.read_table(url)\n",
114 | "disease_df = disease_df.query('resource == \"MSH\"').drop('resource', 1)\n",
115 | "disease_df = disease_df.rename(columns={'resource_id': 'mesh_id'})\n",
116 | "\n",
117 | "# Read MeSH terms to MeSH names\n",
118 | "url = 'https://raw.githubusercontent.com/dhimmel/mesh/e561301360e6de2140dedeaa7c7e17ce4714eb7f/data/terms.tsv'\n",
119 | "mesh_df = pandas.read_table(url)\n",
120 | "disease_df = disease_df.merge(mesh_df)\n",
121 | "\n",
122 | "# Manually remove problematic xrefs\n",
123 | "# https://github.com/obophenotype/human-disease-ontology/issues/45\n",
124 | "disease_df = disease_df.query(\"mesh_id != 'D003327' and mesh_id != 'D017202'\")\n",
125 | "disease_df.head()"
126 | ]
127 | },
128 | {
129 | "cell_type": "markdown",
130 | "metadata": {},
131 | "source": [
132 | "## Query PubMed"
133 | ]
134 | },
135 | {
136 | "cell_type": "code",
137 | "execution_count": 3,
138 | "metadata": {
139 | "collapsed": false,
140 | "jupyter": {
141 | "outputs_hidden": false
142 | }
143 | },
144 | "outputs": [
145 | {
146 | "name": "stdout",
147 | "output_type": "stream",
148 | "text": [
149 | "10320 articles for Hematologic Neoplasms\n",
150 | "122727 articles for Brain Neoplasms\n",
151 | "180844 articles for Lung Neoplasms\n",
152 | "60494 articles for Kidney Neoplasms\n",
153 | "57863 articles for Pancreatic Neoplasms\n",
154 | "100038 articles for Skin Neoplasms\n",
155 | "104535 articles for Bone Neoplasms\n",
156 | "27302 articles for Pharyngeal Neoplasms\n",
157 | "65991 articles for Ovarian Neoplasms\n",
158 | "226835 articles for Breast Neoplasms\n",
159 | "63189 articles for Glioma\n",
160 | "107447 articles for Uterine Neoplasms\n",
161 | "24447 articles for Adrenal Gland Neoplasms\n",
162 | "40010 articles for Esophageal Neoplasms\n",
163 | "14552 articles for Salivary Gland Neoplasms\n",
164 | "97203 articles for Prostatic Neoplasms\n",
165 | "77286 articles for Stomach Neoplasms\n",
166 | "45208 articles for Urinary Bladder Neoplasms\n",
167 | "18495 articles for Peripheral Nervous System Neoplasms\n",
168 | "40519 articles for Thyroid Neoplasms\n",
169 | "130963 articles for Liver Neoplasms\n",
170 | "60840 articles for Uterine Cervical Neoplasms\n",
171 | "4780 articles for Vaginal Neoplasms\n",
172 | "249626 articles for Head and Neck Neoplasms\n",
173 | "38987 articles for Rectal Neoplasms\n",
174 | "34076 articles for Eye Neoplasms\n",
175 | "68917 articles for Colonic Neoplasms\n",
176 | "24448 articles for Laryngeal Neoplasms\n",
177 | "283101 articles for Neoplasms, Germ Cell and Embryonal\n",
178 | "9735 articles for Thymus Neoplasms\n",
179 | "11737 articles for Myosarcoma\n",
180 | "5565 articles for Duodenal Neoplasms\n",
181 | "2617 articles for Ileal Neoplasms\n",
182 | "117808 articles for Sarcoma\n",
183 | "2355 articles for Appendiceal Neoplasms\n",
184 | "4612 articles for Penile Neoplasms\n",
185 | "4139 articles for Ureteral Neoplasms\n",
186 | "3249 articles for Tracheal Neoplasms\n",
187 | "7161 articles for Vulvar Neoplasms\n",
188 | "1940 articles for Jejunal Neoplasms\n",
189 | "12425 articles for Peritoneal Neoplasms\n",
190 | "2738 articles for Vascular Neoplasms\n",
191 | "11841 articles for Mesothelioma\n",
192 | "76390 articles for Melanoma\n",
193 | "2380 articles for Fallopian Tube Neoplasms\n",
194 | "22613 articles for Testicular Neoplasms\n",
195 | "7358 articles for Gallbladder Neoplasms\n",
196 | "20327 articles for Meningeal Neoplasms\n",
197 | "14052 articles for Bile Duct Neoplasms\n",
198 | "12274 articles for Mediastinal Neoplasms\n",
199 | "9418 articles for Spinal Cord Neoplasms\n",
200 | "7944 articles for Retroperitoneal Neoplasms\n",
201 | "31533 articles for Crohn Disease\n",
202 | "46287 articles for Multiple Sclerosis\n",
203 | "91140 articles for Diabetes Mellitus, Type 2\n",
204 | "28289 articles for Colitis, Ulcerative\n",
205 | "62862 articles for Diabetes Mellitus, Type 1\n",
206 | "95295 articles for Arthritis, Rheumatoid\n",
207 | "40786 articles for Coronary Artery Disease\n",
208 | "148894 articles for Obesity\n",
209 | "16725 articles for Celiac Disease\n",
210 | "49965 articles for Lupus Erythematosus, Systemic\n",
211 | "26855 articles for Refractive Errors\n",
212 | "7065 articles for Liver Cirrhosis, Biliary\n",
213 | "4078 articles for Vitiligo\n",
214 | "16971 articles for Macular Degeneration\n",
215 | "21070 articles for Metabolic Syndrome X\n",
216 | "108236 articles for Asthma\n",
217 | "30896 articles for Psoriasis\n",
218 | "87056 articles for Schizophrenia\n",
219 | "22222 articles for Migraine Disorders\n",
220 | "69752 articles for Alzheimer Disease\n",
221 | "14577 articles for Graves Disease\n",
222 | "49349 articles for Parkinson Disease\n",
223 | "14898 articles for Dermatitis, Atopic\n",
224 | "32534 articles for Bipolar Disorder\n",
225 | "12161 articles for Spondylitis, Ankylosing\n",
226 | "10757 articles for Polycystic Ovary Syndrome\n",
227 | "214731 articles for Hypertension\n",
228 | "17041 articles for Scleroderma, Systemic\n",
229 | "7686 articles for Behcet Syndrome\n",
230 | "4802 articles for Osteitis Deformans\n",
231 | "20395 articles for Leprosy\n",
232 | "22378 articles for Intracranial Aneurysm\n",
233 | "43355 articles for Glaucoma\n",
234 | "13589 articles for Amyotrophic Lateral Sclerosis\n",
235 | "2744 articles for Restless Legs Syndrome\n",
236 | "4773 articles for Mucocutaneous Lymph Node Syndrome\n",
237 | "24584 articles for Atherosclerosis\n",
238 | "2509 articles for Alopecia Areata\n",
239 | "45971 articles for Osteoporosis\n",
240 | "28909 articles for Hypothyroidism\n",
241 | "4960 articles for Glomerulonephritis, IGA\n",
242 | "67451 articles for Alcoholism\n",
243 | "5771 articles for Creutzfeldt-Jakob Syndrome\n",
244 | "1206 articles for Azoospermia\n",
245 | "132583 articles for Epilepsy\n",
246 | "47571 articles for Hepatitis B\n",
247 | "38605 articles for Pulmonary Disease, Chronic Obstructive\n",
248 | "14411 articles for Aortic Aneurysm, Abdominal\n",
249 | "79638 articles for Kidney Failure, Chronic\n",
250 | "45631 articles for Osteoarthritis\n",
251 | "3974 articles for Arthritis, Psoriatic\n",
252 | "8353 articles for Tobacco Use Disorder\n",
253 | "2448 articles for Glomerulonephritis, Membranous\n",
254 | "7669 articles for Diabetes, Gestational\n",
255 | "52704 articles for Malaria\n",
256 | "16500 articles for Autistic Disorder\n",
257 | "13355 articles for Cardiomyopathy, Dilated\n",
258 | "920 articles for Arthritis, Gouty\n",
259 | "17621 articles for Leiomyoma\n",
260 | "3033 articles for Cholangitis, Sclerosing\n",
261 | "3065 articles for Narcolepsy\n",
262 | "1884 articles for Intervertebral Disc Degeneration\n",
263 | "12123 articles for Cleft Lip\n",
264 | "1442 articles for Idiopathic Pulmonary Fibrosis\n",
265 | "21145 articles for Attention Deficit Disorder with Hyperactivity\n",
266 | "3636 articles for Tourette Syndrome\n",
267 | "8889 articles for Aortic Aneurysm, Thoracic\n",
268 | "83521 articles for Depressive Disorder\n",
269 | "17875 articles for Rhinitis, Allergic\n",
270 | "44312 articles for Pancreatitis\n",
271 | "16146 articles for Nephrolithiasis\n",
272 | "24223 articles for Periodontitis\n",
273 | "6418 articles for Barrett Esophagus\n",
274 | "782 articles for Fuchs' Endothelial Dystrophy\n",
275 | "4768 articles for Otosclerosis\n",
276 | "2277 articles for Conduct Disorder\n",
277 | "4440 articles for Glomerulosclerosis, Focal Segmental\n",
278 | "37451 articles for Dental Caries\n",
279 | "138233 articles for Anemia\n",
280 | "6096 articles for Panic Disorder\n",
281 | "72916 articles for Acquired Immunodeficiency Syndrome\n"
282 | ]
283 | }
284 | ],
285 | "source": [
286 | "rows_out = list()\n",
287 | "\n",
288 | "for i, row in disease_df.iterrows():\n",
289 | " term_query = '{disease}[MeSH Terms]'.format(disease = row.mesh_name.lower())\n",
290 | " payload = {'db': 'pubmed', 'term': term_query}\n",
291 | " pmids = eutility.esearch_query(payload, retmax = 10000)\n",
292 | " row['term_query'] = term_query\n",
293 | " row['n_articles'] = len(pmids)\n",
294 | " row['pubmed_ids'] = '|'.join(pmids)\n",
295 | " rows_out.append(row)\n",
296 | " print('{} articles for {}'.format(len(pmids), row.mesh_name))\n",
297 | "\n",
298 | "disease_pmids_df = pandas.DataFrame(rows_out)"
299 | ]
300 | },
301 | {
302 | "cell_type": "code",
303 | "execution_count": 4,
304 | "metadata": {
305 | "collapsed": true,
306 | "jupyter": {
307 | "outputs_hidden": true
308 | }
309 | },
310 | "outputs": [],
311 | "source": [
312 | "with gzip.open('data/disease-pmids-topic.tsv.gz', 'wt') as write_file:\n",
313 | " disease_pmids_df.to_csv(write_file, sep='\\t', index=False)"
314 | ]
315 | },
316 | {
317 | "cell_type": "markdown",
318 | "metadata": {},
319 | "source": [
320 | "## Analyze data"
321 | ]
322 | },
323 | {
324 | "cell_type": "code",
325 | "execution_count": 5,
326 | "metadata": {
327 | "collapsed": true,
328 | "jupyter": {
329 | "outputs_hidden": true
330 | }
331 | },
332 | "outputs": [],
333 | "source": [
334 | "disease_df, disease_to_pmids = cooccurrence.read_pmids_tsv('data/disease-pmids-topic.tsv.gz', key='doid_code')"
335 | ]
336 | },
337 | {
338 | "cell_type": "code",
339 | "execution_count": 6,
340 | "metadata": {
341 | "collapsed": false,
342 | "jupyter": {
343 | "outputs_hidden": false
344 | }
345 | },
346 | "outputs": [
347 | {
348 | "name": "stdout",
349 | "output_type": "stream",
350 | "text": [
351 | "Total articles containing a doid_code_0: 4161769\n",
352 | "Total articles containing a doid_code_1: 4161769\n",
353 | "Total articles containing both a doid_code_0 and doid_code_1: 4161769\n",
354 | "\n",
355 | "After removing terms without any cooccurences:\n",
356 | "+ 133 doid_code_0s remain\n",
357 | "+ 133 doid_code_1s remain\n",
358 | "\n",
359 | "Cooccurrence scores calculated for 17689 doid_code_0 -- doid_code_1 pairs\n"
360 | ]
361 | }
362 | ],
363 | "source": [
364 | "cooc_df = cooccurrence.score_pmid_cooccurrence(disease_to_pmids, disease_to_pmids, 'doid_code_0', 'doid_code_1')"
365 | ]
366 | },
367 | {
368 | "cell_type": "code",
369 | "execution_count": 7,
370 | "metadata": {
371 | "collapsed": false,
372 | "jupyter": {
373 | "outputs_hidden": false
374 | }
375 | },
376 | "outputs": [
377 | {
378 | "data": {
379 | "text/html": [
380 | "\n",
381 | "
\n",
382 | " \n",
383 | " \n",
384 | " | \n",
385 | " doid_code | \n",
386 | " doid_name | \n",
387 | " mesh_id | \n",
388 | " mesh_name | \n",
389 | " term_query | \n",
390 | " n_articles | \n",
391 | "
\n",
392 | " \n",
393 | " \n",
394 | " \n",
395 | " 0 | \n",
396 | " DOID:2531 | \n",
397 | " hematologic cancer | \n",
398 | " D019337 | \n",
399 | " Hematologic Neoplasms | \n",
400 | " hematologic neoplasms[MeSH Terms] | \n",
401 | " 10320 | \n",
402 | "
\n",
403 | " \n",
404 | " 1 | \n",
405 | " DOID:1319 | \n",
406 | " brain cancer | \n",
407 | " D001932 | \n",
408 | " Brain Neoplasms | \n",
409 | " brain neoplasms[MeSH Terms] | \n",
410 | " 122727 | \n",
411 | "
\n",
412 | " \n",
413 | " 2 | \n",
414 | " DOID:1324 | \n",
415 | " lung cancer | \n",
416 | " D008175 | \n",
417 | " Lung Neoplasms | \n",
418 | " lung neoplasms[MeSH Terms] | \n",
419 | " 180844 | \n",
420 | "
\n",
421 | " \n",
422 | " 3 | \n",
423 | " DOID:263 | \n",
424 | " kidney cancer | \n",
425 | " D007680 | \n",
426 | " Kidney Neoplasms | \n",
427 | " kidney neoplasms[MeSH Terms] | \n",
428 | " 60494 | \n",
429 | "
\n",
430 | " \n",
431 | " 4 | \n",
432 | " DOID:1793 | \n",
433 | " pancreatic cancer | \n",
434 | " D010190 | \n",
435 | " Pancreatic Neoplasms | \n",
436 | " pancreatic neoplasms[MeSH Terms] | \n",
437 | " 57863 | \n",
438 | "
\n",
439 | " \n",
440 | "
\n",
441 | "
"
442 | ],
443 | "text/plain": [
444 | " doid_code doid_name mesh_id mesh_name \\\n",
445 | "0 DOID:2531 hematologic cancer D019337 Hematologic Neoplasms \n",
446 | "1 DOID:1319 brain cancer D001932 Brain Neoplasms \n",
447 | "2 DOID:1324 lung cancer D008175 Lung Neoplasms \n",
448 | "3 DOID:263 kidney cancer D007680 Kidney Neoplasms \n",
449 | "4 DOID:1793 pancreatic cancer D010190 Pancreatic Neoplasms \n",
450 | "\n",
451 | " term_query n_articles \n",
452 | "0 hematologic neoplasms[MeSH Terms] 10320 \n",
453 | "1 brain neoplasms[MeSH Terms] 122727 \n",
454 | "2 lung neoplasms[MeSH Terms] 180844 \n",
455 | "3 kidney neoplasms[MeSH Terms] 60494 \n",
456 | "4 pancreatic neoplasms[MeSH Terms] 57863 "
457 | ]
458 | },
459 | "execution_count": 7,
460 | "metadata": {},
461 | "output_type": "execute_result"
462 | }
463 | ],
464 | "source": [
465 | "disease_df.head()"
466 | ]
467 | },
468 | {
469 | "cell_type": "code",
470 | "execution_count": 8,
471 | "metadata": {
472 | "collapsed": false,
473 | "jupyter": {
474 | "outputs_hidden": false
475 | }
476 | },
477 | "outputs": [
478 | {
479 | "data": {
480 | "text/html": [
481 | "\n",
482 | "
\n",
483 | " \n",
484 | " \n",
485 | " | \n",
486 | " doid_code_0 | \n",
487 | " doid_code_1 | \n",
488 | " cooccurrence | \n",
489 | " expected | \n",
490 | " enrichment | \n",
491 | " odds_ratio | \n",
492 | " p_fisher | \n",
493 | "
\n",
494 | " \n",
495 | " \n",
496 | " \n",
497 | " 0 | \n",
498 | " DOID:11615 | \n",
499 | " DOID:11615 | \n",
500 | " 4612 | \n",
501 | " 5.110938 | \n",
502 | " 902.378361 | \n",
503 | " inf | \n",
504 | " 0.000000 | \n",
505 | "
\n",
506 | " \n",
507 | " 1 | \n",
508 | " DOID:11615 | \n",
509 | " DOID:8577 | \n",
510 | " 1 | \n",
511 | " 31.349378 | \n",
512 | " 0.031899 | \n",
513 | " 0.031654 | \n",
514 | " 1.000000 | \n",
515 | "
\n",
516 | " \n",
517 | " 2 | \n",
518 | " DOID:11615 | \n",
519 | " DOID:5612 | \n",
520 | " 2 | \n",
521 | " 10.436864 | \n",
522 | " 0.191628 | \n",
523 | " 0.191106 | \n",
524 | " 0.999669 | \n",
525 | "
\n",
526 | " \n",
527 | " 3 | \n",
528 | " DOID:11615 | \n",
529 | " DOID:14330 | \n",
530 | " 0 | \n",
531 | " 54.687703 | \n",
532 | " 0.000000 | \n",
533 | " 0.000000 | \n",
534 | " 1.000000 | \n",
535 | "
\n",
536 | " \n",
537 | " 4 | \n",
538 | " DOID:11615 | \n",
539 | " DOID:0050425 | \n",
540 | " 0 | \n",
541 | " 3.040853 | \n",
542 | " 0.000000 | \n",
543 | " 0.000000 | \n",
544 | " 1.000000 | \n",
545 | "
\n",
546 | " \n",
547 | "
\n",
548 | "
"
549 | ],
550 | "text/plain": [
551 | " doid_code_0 doid_code_1 cooccurrence expected enrichment odds_ratio \\\n",
552 | "0 DOID:11615 DOID:11615 4612 5.110938 902.378361 inf \n",
553 | "1 DOID:11615 DOID:8577 1 31.349378 0.031899 0.031654 \n",
554 | "2 DOID:11615 DOID:5612 2 10.436864 0.191628 0.191106 \n",
555 | "3 DOID:11615 DOID:14330 0 54.687703 0.000000 0.000000 \n",
556 | "4 DOID:11615 DOID:0050425 0 3.040853 0.000000 0.000000 \n",
557 | "\n",
558 | " p_fisher \n",
559 | "0 0.000000 \n",
560 | "1 1.000000 \n",
561 | "2 0.999669 \n",
562 | "3 1.000000 \n",
563 | "4 1.000000 "
564 | ]
565 | },
566 | "execution_count": 8,
567 | "metadata": {},
568 | "output_type": "execute_result"
569 | }
570 | ],
571 | "source": [
572 | "cooc_df.head()"
573 | ]
574 | },
575 | {
576 | "cell_type": "code",
577 | "execution_count": 9,
578 | "metadata": {
579 | "collapsed": false,
580 | "jupyter": {
581 | "outputs_hidden": false
582 | }
583 | },
584 | "outputs": [],
585 | "source": [
586 | "cooc_df = cooc_df[cooc_df['doid_code_0'] != cooc_df['doid_code_1']]\n",
587 | "doid_name_df = disease_df[['doid_code', 'doid_name']].drop_duplicates()\n",
588 | "cooc_df = doid_name_df.rename(columns={'doid_code': 'doid_code_1', 'doid_name': 'doid_name_1'}).merge(cooc_df)\n",
589 | "cooc_df = doid_name_df.rename(columns={'doid_code': 'doid_code_0', 'doid_name': 'doid_name_0'}).merge(cooc_df)\n",
590 | "cooc_df = cooc_df.sort_values(by=['doid_name_0', 'p_fisher'])"
591 | ]
592 | },
593 | {
594 | "cell_type": "code",
595 | "execution_count": 10,
596 | "metadata": {
597 | "collapsed": false,
598 | "jupyter": {
599 | "outputs_hidden": false
600 | }
601 | },
602 | "outputs": [
603 | {
604 | "data": {
605 | "text/html": [
606 | "\n",
607 | "
\n",
608 | " \n",
609 | " \n",
610 | " | \n",
611 | " doid_code_0 | \n",
612 | " doid_name_0 | \n",
613 | " doid_code_1 | \n",
614 | " doid_name_1 | \n",
615 | " cooccurrence | \n",
616 | " expected | \n",
617 | " enrichment | \n",
618 | " odds_ratio | \n",
619 | " p_fisher | \n",
620 | "
\n",
621 | " \n",
622 | " \n",
623 | " \n",
624 | " 9444 | \n",
625 | " DOID:10652 | \n",
626 | " Alzheimer's disease | \n",
627 | " DOID:14330 | \n",
628 | " Parkinson's disease | \n",
629 | " 2760 | \n",
630 | " 827.098152 | \n",
631 | " 3.336968 | \n",
632 | " 3.577398 | \n",
633 | " 0.000000e+00 | \n",
634 | "
\n",
635 | " \n",
636 | " 9465 | \n",
637 | " DOID:10652 | \n",
638 | " Alzheimer's disease | \n",
639 | " DOID:11949 | \n",
640 | " Creutzfeldt-Jakob disease | \n",
641 | " 332 | \n",
642 | " 96.723002 | \n",
643 | " 3.432482 | \n",
644 | " 3.593306 | \n",
645 | " 3.377672e-80 | \n",
646 | "
\n",
647 | " \n",
648 | " 9456 | \n",
649 | " DOID:10652 | \n",
650 | " Alzheimer's disease | \n",
651 | " DOID:332 | \n",
652 | " amyotrophic lateral sclerosis | \n",
653 | " 451 | \n",
654 | " 227.754094 | \n",
655 | " 1.980206 | \n",
656 | " 2.020452 | \n",
657 | " 5.524978e-40 | \n",
658 | "
\n",
659 | " \n",
660 | " 9496 | \n",
661 | " DOID:10652 | \n",
662 | " Alzheimer's disease | \n",
663 | " DOID:11555 | \n",
664 | " Fuchs' endothelial dystrophy | \n",
665 | " 1 | \n",
666 | " 13.106461 | \n",
667 | " 0.076298 | \n",
668 | " 0.075102 | \n",
669 | " 9.999982e-01 | \n",
670 | "
\n",
671 | " \n",
672 | " 9490 | \n",
673 | " DOID:10652 | \n",
674 | " Alzheimer's disease | \n",
675 | " DOID:1595 | \n",
676 | " endogenous depression | \n",
677 | " 1221 | \n",
678 | " 1399.827043 | \n",
679 | " 0.872251 | \n",
680 | " 0.868045 | \n",
681 | " 9.999997e-01 | \n",
682 | "
\n",
683 | " \n",
684 | "
\n",
685 | "
"
686 | ],
687 | "text/plain": [
688 | " doid_code_0 doid_name_0 doid_code_1 \\\n",
689 | "9444 DOID:10652 Alzheimer's disease DOID:14330 \n",
690 | "9465 DOID:10652 Alzheimer's disease DOID:11949 \n",
691 | "9456 DOID:10652 Alzheimer's disease DOID:332 \n",
692 | "9496 DOID:10652 Alzheimer's disease DOID:11555 \n",
693 | "9490 DOID:10652 Alzheimer's disease DOID:1595 \n",
694 | "\n",
695 | " doid_name_1 cooccurrence expected enrichment \\\n",
696 | "9444 Parkinson's disease 2760 827.098152 3.336968 \n",
697 | "9465 Creutzfeldt-Jakob disease 332 96.723002 3.432482 \n",
698 | "9456 amyotrophic lateral sclerosis 451 227.754094 1.980206 \n",
699 | "9496 Fuchs' endothelial dystrophy 1 13.106461 0.076298 \n",
700 | "9490 endogenous depression 1221 1399.827043 0.872251 \n",
701 | "\n",
702 | " odds_ratio p_fisher \n",
703 | "9444 3.577398 0.000000e+00 \n",
704 | "9465 3.593306 3.377672e-80 \n",
705 | "9456 2.020452 5.524978e-40 \n",
706 | "9496 0.075102 9.999982e-01 \n",
707 | "9490 0.868045 9.999997e-01 "
708 | ]
709 | },
710 | "execution_count": 10,
711 | "metadata": {},
712 | "output_type": "execute_result"
713 | }
714 | ],
715 | "source": [
716 | "cooc_df.head()"
717 | ]
718 | },
719 | {
720 | "cell_type": "code",
721 | "execution_count": 11,
722 | "metadata": {
723 | "collapsed": false,
724 | "jupyter": {
725 | "outputs_hidden": false
726 | }
727 | },
728 | "outputs": [
729 | {
730 | "data": {
731 | "text/plain": [
732 | "17556"
733 | ]
734 | },
735 | "execution_count": 11,
736 | "metadata": {},
737 | "output_type": "execute_result"
738 | }
739 | ],
740 | "source": [
741 | "len(cooc_df)"
742 | ]
743 | },
744 | {
745 | "cell_type": "code",
746 | "execution_count": 12,
747 | "metadata": {
748 | "collapsed": false,
749 | "jupyter": {
750 | "outputs_hidden": false
751 | }
752 | },
753 | "outputs": [
754 | {
755 | "data": {
756 | "text/plain": [
757 | "1086"
758 | ]
759 | },
760 | "execution_count": 12,
761 | "metadata": {},
762 | "output_type": "execute_result"
763 | }
764 | ],
765 | "source": [
766 | "len(cooc_df[cooc_df.p_fisher <= 0.005])"
767 | ]
768 | },
769 | {
770 | "cell_type": "code",
771 | "execution_count": 13,
772 | "metadata": {
773 | "collapsed": true,
774 | "jupyter": {
775 | "outputs_hidden": true
776 | }
777 | },
778 | "outputs": [],
779 | "source": [
780 | "cooc_df.to_csv('data/disease-disease-cooccurrence.tsv', index=False, sep='\\t')"
781 | ]
782 | }
783 | ],
784 | "metadata": {
785 | "kernelspec": {
786 | "display_name": "Python 3",
787 | "language": "python",
788 | "name": "python3"
789 | },
790 | "language_info": {
791 | "codemirror_mode": {
792 | "name": "ipython",
793 | "version": 3
794 | },
795 | "file_extension": ".py",
796 | "mimetype": "text/x-python",
797 | "name": "python",
798 | "nbconvert_exporter": "python",
799 | "pygments_lexer": "ipython3",
800 | "version": "3.9.2"
801 | }
802 | },
803 | "nbformat": 4,
804 | "nbformat_minor": 4
805 | }
806 |
--------------------------------------------------------------------------------
/download-topics.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "59a31d15-b048-4518-a56a-394d456d57a6",
6 | "metadata": {},
7 | "source": [
8 | "# Download MEDLINE topics for all MeSH Topical Descriptors and SCR Diseases"
9 | ]
10 | },
11 | {
12 | "cell_type": "code",
13 | "execution_count": 1,
14 | "id": "0fd925e3-aba3-434b-92b1-b268f6a7799b",
15 | "metadata": {},
16 | "outputs": [],
17 | "source": [
18 | "import datetime\n",
19 | "import gzip\n",
20 | "import pathlib\n",
21 | "\n",
22 | "import tenacity\n",
23 | "import jsonlines\n",
24 | "import tqdm\n",
25 | "import pandas as pd\n",
26 | "from pubmedpy.eutilities import esearch_query\n",
27 | "from nxontology import NXOntology"
28 | ]
29 | },
30 | {
31 | "cell_type": "code",
32 | "execution_count": 2,
33 | "id": "ea711065-ce1f-4aac-b6c6-abd0a504d80c",
34 | "metadata": {},
35 | "outputs": [],
36 | "source": [
37 | "@tenacity.retry(wait=tenacity.wait_exponential(min=2, max=2**10))\n",
38 | "def query_topic(mesh_term: str, scr: bool = False) -> dict:\n",
39 | " \"\"\"\n",
40 | " mesh_term is the name/label of a MeSH Term.\n",
41 | " scr: whether the MeSH term is a supplementary concept.\n",
42 | " See https://github.com/hetio/medline/issues/4.\n",
43 | " \"\"\"\n",
44 | " result = {}\n",
45 | " # https://pubmed.ncbi.nlm.nih.gov/help/#pubmed-format\n",
46 | " term_query = f'\"{mesh_term}\" [{\"Supplementary Concept\" if scr else \"MeSH Terms\"}:noexp]'\n",
47 | " result[\"pubmed_search\"] = term_query\n",
48 | " payload = {'db': 'pubmed', 'term': term_query}\n",
49 | " result[\"timestamp\"] = datetime.datetime.utcnow().isoformat(timespec=\"seconds\")\n",
50 | " result[\"pubmed_ids\"] = sorted(esearch_query(payload, retmax = 5000, tqdm=None))\n",
51 | " return result"
52 | ]
53 | },
54 | {
55 | "cell_type": "code",
56 | "execution_count": 3,
57 | "id": "df08612e-df1c-411b-9e5f-42cba0192574",
58 | "metadata": {
59 | "tags": []
60 | },
61 | "outputs": [
62 | {
63 | "data": {
64 | "text/plain": [
65 | "{'pubmed_search': '\"Tabatznik syndrome\" [Supplementary Concept:noexp]',\n",
66 | " 'timestamp': '2021-04-12T20:05:16',\n",
67 | " 'pubmed_ids': []}"
68 | ]
69 | },
70 | "execution_count": 3,
71 | "metadata": {},
72 | "output_type": "execute_result"
73 | }
74 | ],
75 | "source": [
76 | "# example query\n",
77 | "query_topic(\"Tabatznik syndrome\", scr=True)"
78 | ]
79 | },
80 | {
81 | "cell_type": "markdown",
82 | "id": "f50387db-d61f-417d-9316-664f2c42c510",
83 | "metadata": {},
84 | "source": [
85 | "## Load MeSH Ontology"
86 | ]
87 | },
88 | {
89 | "cell_type": "code",
90 | "execution_count": 4,
91 | "id": "967432ee-1e49-4236-a076-ff922bf3a071",
92 | "metadata": {},
93 | "outputs": [
94 | {
95 | "data": {
96 | "text/plain": [
97 | "300093"
98 | ]
99 | },
100 | "execution_count": 4,
101 | "metadata": {},
102 | "output_type": "execute_result"
103 | }
104 | ],
105 | "source": [
106 | "# read the MeSH ontology\n",
107 | "nxo = NXOntology.read_node_link_json(\"data/mesh-nxo-node-link.json.gz\")\n",
108 | "nxo.n_nodes"
109 | ]
110 | },
111 | {
112 | "cell_type": "code",
113 | "execution_count": 5,
114 | "id": "ad017896-60a7-4dfc-ae78-8e89c339c68d",
115 | "metadata": {},
116 | "outputs": [
117 | {
118 | "data": {
119 | "text/html": [
120 | "\n",
121 | "\n",
134 | "
\n",
135 | " \n",
136 | " \n",
137 | " | \n",
138 | " mesh_id | \n",
139 | " mesh_class | \n",
140 | " mesh_uri | \n",
141 | " mesh_label | \n",
142 | " tree_numbers | \n",
143 | "
\n",
144 | " \n",
145 | " \n",
146 | " \n",
147 | " 0 | \n",
148 | " D005260 | \n",
149 | " CheckTag | \n",
150 | " http://id.nlm.nih.gov/mesh/2020/D005260 | \n",
151 | " Female | \n",
152 | " NaN | \n",
153 | "
\n",
154 | " \n",
155 | " 1 | \n",
156 | " D008297 | \n",
157 | " CheckTag | \n",
158 | " http://id.nlm.nih.gov/mesh/2020/D008297 | \n",
159 | " Male | \n",
160 | " NaN | \n",
161 | "
\n",
162 | " \n",
163 | "
\n",
164 | "
"
165 | ],
166 | "text/plain": [
167 | " mesh_id mesh_class mesh_uri mesh_label \\\n",
168 | "0 D005260 CheckTag http://id.nlm.nih.gov/mesh/2020/D005260 Female \n",
169 | "1 D008297 CheckTag http://id.nlm.nih.gov/mesh/2020/D008297 Male \n",
170 | "\n",
171 | " tree_numbers \n",
172 | "0 NaN \n",
173 | "1 NaN "
174 | ]
175 | },
176 | "execution_count": 5,
177 | "metadata": {},
178 | "output_type": "execute_result"
179 | }
180 | ],
181 | "source": [
182 | "nodes_data = [data for node, data in nxo.graph.nodes(data=True)]\n",
183 | "nodes_data.sort(key=lambda x: (x[\"mesh_class\"], x[\"mesh_id\"]))\n",
184 | "term_df = pd.DataFrame(nodes_data)\n",
185 | "term_df.head(2)"
186 | ]
187 | },
188 | {
189 | "cell_type": "code",
190 | "execution_count": 6,
191 | "id": "755c6040-83e4-4752-aebb-ff0de812eef9",
192 | "metadata": {},
193 | "outputs": [
194 | {
195 | "data": {
196 | "text/plain": [
197 | "SCR_Chemical 243740\n",
198 | "TopicalDescriptor 29054\n",
199 | "SCR_Organism 19019\n",
200 | "SCR_Disease 6479\n",
201 | "SCR_Protocol 1215\n",
202 | "GeographicalDescriptor 397\n",
203 | "PublicationType 187\n",
204 | "CheckTag 2\n",
205 | "Name: mesh_class, dtype: int64"
206 | ]
207 | },
208 | "execution_count": 6,
209 | "metadata": {},
210 | "output_type": "execute_result"
211 | }
212 | ],
213 | "source": [
214 | "term_df.mesh_class.value_counts()"
215 | ]
216 | },
217 | {
218 | "cell_type": "code",
219 | "execution_count": 7,
220 | "id": "d2026b7b-32e8-437a-8a13-0902cc7f13f9",
221 | "metadata": {},
222 | "outputs": [
223 | {
224 | "data": {
225 | "text/plain": [
226 | "35533"
227 | ]
228 | },
229 | "execution_count": 7,
230 | "metadata": {},
231 | "output_type": "execute_result"
232 | }
233 | ],
234 | "source": [
235 | "# filter to classes of interest\n",
236 | "keep_classes = {\"TopicalDescriptor\", \"SCR_Disease\"}\n",
237 | "nodes_data = [info for info in nodes_data if info[\"mesh_class\"] in keep_classes]\n",
238 | "mesh_ids = [x[\"mesh_id\"] for x in nodes_data]\n",
239 | "len(nodes_data)"
240 | ]
241 | },
242 | {
243 | "cell_type": "code",
244 | "execution_count": 8,
245 | "id": "bf487760-6978-4509-bdea-9bbd719d6d08",
246 | "metadata": {},
247 | "outputs": [
248 | {
249 | "data": {
250 | "text/plain": [
251 | "{'mesh_id': 'C000591739',\n",
252 | " 'mesh_class': 'SCR_Disease',\n",
253 | " 'mesh_uri': 'http://id.nlm.nih.gov/mesh/2020/C000591739',\n",
254 | " 'mesh_label': 'Familial gynecomastia, due to increased aromatase activity'}"
255 | ]
256 | },
257 | "execution_count": 8,
258 | "metadata": {},
259 | "output_type": "execute_result"
260 | }
261 | ],
262 | "source": [
263 | "nodes_data[0]"
264 | ]
265 | },
266 | {
267 | "cell_type": "markdown",
268 | "id": "c5c5f986-b705-4ffa-a362-7690458f3659",
269 | "metadata": {},
270 | "source": [
271 | "## Perform queries"
272 | ]
273 | },
274 | {
275 | "cell_type": "code",
276 | "execution_count": 9,
277 | "id": "d5eb4727-d219-4635-88f7-659be5e66746",
278 | "metadata": {},
279 | "outputs": [
280 | {
281 | "name": "stdout",
282 | "output_type": "stream",
283 | "text": [
284 | "35,533 total mesh_ids: 0 already queried, 35,533 new\n"
285 | ]
286 | }
287 | ],
288 | "source": [
289 | "# read already queried affiliations\n",
290 | "path = pathlib.Path('data/mesh-term-topics-noexp.jsonl.gz')\n",
291 | "lines = jsonlines.Reader(gzip.open(path, \"rt\")) if path.exists() else []\n",
292 | "existing = {row['mesh_id'] for row in lines}\n",
293 | "new = sorted(set(mesh_ids) - existing)\n",
294 | "print(f\"{len(mesh_ids):,} total mesh_ids: {len(existing):,} already queried, {len(new):,} new\")"
295 | ]
296 | },
297 | {
298 | "cell_type": "code",
299 | "execution_count": 10,
300 | "id": "f7cf722e-5239-49ce-9c1d-8090c497ac11",
301 | "metadata": {},
302 | "outputs": [
303 | {
304 | "name": "stderr",
305 | "output_type": "stream",
306 | "text": [
307 | "100%|██████████| 35533/35533 [15:30:42<00:00, 1.57s/it] \n"
308 | ]
309 | }
310 | ],
311 | "source": [
312 | "# query new affiliations and append to JSON Lines file\n",
313 | "write_file = gzip.GzipFile(filename=path, mode=\"ab\", mtime=0)\n",
314 | "with write_file:\n",
315 | " with jsonlines.Writer(write_file) as writer:\n",
316 | " for mesh_id in tqdm.tqdm(new):\n",
317 | " result = nxo.graph.nodes[mesh_id].copy()\n",
318 | " result.update(query_topic(result[\"mesh_label\"], result[\"mesh_class\"] != \"TopicalDescriptor\"))\n",
319 | " writer.write(result)"
320 | ]
321 | },
322 | {
323 | "cell_type": "code",
324 | "execution_count": 11,
325 | "id": "14ce8898-f1be-4e17-b78b-128bfe08676a",
326 | "metadata": {},
327 | "outputs": [
328 | {
329 | "data": {
330 | "text/plain": [
331 | "35533"
332 | ]
333 | },
334 | "execution_count": 11,
335 | "metadata": {},
336 | "output_type": "execute_result"
337 | }
338 | ],
339 | "source": [
340 | "# Read the jsonlines file\n",
341 | "with jsonlines.Reader(gzip.open(path, \"rt\")) as reader:\n",
342 | " lines = list(reader)\n",
343 | "len(lines)"
344 | ]
345 | },
346 | {
347 | "cell_type": "code",
348 | "execution_count": 12,
349 | "id": "02a69105-a4b0-48cb-a834-465e93e0156c",
350 | "metadata": {},
351 | "outputs": [
352 | {
353 | "data": {
354 | "text/plain": [
355 | "['mesh_id',\n",
356 | " 'mesh_class',\n",
357 | " 'mesh_uri',\n",
358 | " 'mesh_label',\n",
359 | " 'pubmed_search',\n",
360 | " 'timestamp',\n",
361 | " 'pubmed_ids']"
362 | ]
363 | },
364 | "execution_count": 12,
365 | "metadata": {},
366 | "output_type": "execute_result"
367 | }
368 | ],
369 | "source": [
370 | "# Show keys for a single line\n",
371 | "list(lines[0])"
372 | ]
373 | }
374 | ],
375 | "metadata": {
376 | "kernelspec": {
377 | "display_name": "Python 3",
378 | "language": "python",
379 | "name": "python3"
380 | },
381 | "language_info": {
382 | "codemirror_mode": {
383 | "name": "ipython",
384 | "version": 3
385 | },
386 | "file_extension": ".py",
387 | "mimetype": "text/x-python",
388 | "name": "python",
389 | "nbconvert_exporter": "python",
390 | "pygments_lexer": "ipython3",
391 | "version": "3.9.2"
392 | }
393 | },
394 | "nbformat": 4,
395 | "nbformat_minor": 5
396 | }
397 |
--------------------------------------------------------------------------------
/environment.yml:
--------------------------------------------------------------------------------
1 | name: medline
2 |
3 | channels:
4 | - conda-forge
5 |
6 | dependencies:
7 | - ipywidgets=7.6.3
8 | - jsonlines=2.0.0
9 | - jupyterlab=3.0.13
10 | - lxml=4.6.3
11 | - networkx=2.5.1
12 | - numpy=1.20.2
13 | - pandas=1.2.3
14 | - python=3.9.2
15 | - requests=2.25.1
16 | - scipy=1.6.2
17 | - tenacity=7.0.0
18 | - tqdm=4.60.0
19 | - pip
20 | - pip:
21 | - git+https://github.com/dhimmel/pubmedpy.git@9d716768f5ab798ec448154588e4fd99afd7584a
22 | - nxontology==0.1.4
23 |
24 |
--------------------------------------------------------------------------------
/eutility.py:
--------------------------------------------------------------------------------
1 | import time
2 |
3 | import xml.etree.ElementTree as ET
4 |
5 | import requests
6 |
7 | def esearch_query(payload, retmax = 100, sleep=2):
8 | """
9 | Query the esearch E-utility.
10 | NOTE: use `pubmedpy.eutilities.esearch_query` instead.
11 | This function might be deleted in the future.
12 | """
13 | url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi'
14 | payload['retmax'] = retmax
15 | payload['retstart'] = 0
16 | ids = list()
17 | count = 1
18 | while payload['retstart'] < count:
19 | response = requests.get(url, params=payload)
20 | xml = ET.fromstring(response.content)
21 | count = int(xml.findtext('Count'))
22 | ids += [xml_id.text for xml_id in xml.findall('IdList/Id')]
23 | payload['retstart'] += retmax
24 | time.sleep(sleep)
25 | return ids
26 |
--------------------------------------------------------------------------------
/symptoms.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Compute symptom-disease cooccurrence for Hetionet"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 1,
13 | "metadata": {
14 | "collapsed": false,
15 | "jupyter": {
16 | "outputs_hidden": false
17 | }
18 | },
19 | "outputs": [],
20 | "source": [
21 | "import io\n",
22 | "import gzip\n",
23 | "\n",
24 | "import pandas\n",
25 | "import requests\n",
26 | "import networkx\n",
27 | "\n",
28 | "import eutility\n",
29 | "import cooccurrence"
30 | ]
31 | },
32 | {
33 | "cell_type": "code",
34 | "execution_count": 2,
35 | "metadata": {
36 | "collapsed": false,
37 | "jupyter": {
38 | "outputs_hidden": false
39 | }
40 | },
41 | "outputs": [
42 | {
43 | "data": {
44 | "text/html": [
45 | "\n",
46 | "
\n",
47 | " \n",
48 | " \n",
49 | " | \n",
50 | " doid_code | \n",
51 | " doid_name | \n",
52 | " mesh_id | \n",
53 | " mesh_name | \n",
54 | "
\n",
55 | " \n",
56 | " \n",
57 | " \n",
58 | " 0 | \n",
59 | " DOID:2531 | \n",
60 | " hematologic cancer | \n",
61 | " D019337 | \n",
62 | " Hematologic Neoplasms | \n",
63 | "
\n",
64 | " \n",
65 | " 1 | \n",
66 | " DOID:1319 | \n",
67 | " brain cancer | \n",
68 | " D001932 | \n",
69 | " Brain Neoplasms | \n",
70 | "
\n",
71 | " \n",
72 | " 2 | \n",
73 | " DOID:1324 | \n",
74 | " lung cancer | \n",
75 | " D008175 | \n",
76 | " Lung Neoplasms | \n",
77 | "
\n",
78 | " \n",
79 | " 3 | \n",
80 | " DOID:263 | \n",
81 | " kidney cancer | \n",
82 | " D007680 | \n",
83 | " Kidney Neoplasms | \n",
84 | "
\n",
85 | " \n",
86 | " 4 | \n",
87 | " DOID:1793 | \n",
88 | " pancreatic cancer | \n",
89 | " D010190 | \n",
90 | " Pancreatic Neoplasms | \n",
91 | "
\n",
92 | " \n",
93 | "
\n",
94 | "
"
95 | ],
96 | "text/plain": [
97 | " doid_code doid_name mesh_id mesh_name\n",
98 | "0 DOID:2531 hematologic cancer D019337 Hematologic Neoplasms\n",
99 | "1 DOID:1319 brain cancer D001932 Brain Neoplasms\n",
100 | "2 DOID:1324 lung cancer D008175 Lung Neoplasms\n",
101 | "3 DOID:263 kidney cancer D007680 Kidney Neoplasms\n",
102 | "4 DOID:1793 pancreatic cancer D010190 Pancreatic Neoplasms"
103 | ]
104 | },
105 | "execution_count": 2,
106 | "metadata": {},
107 | "output_type": "execute_result"
108 | }
109 | ],
110 | "source": [
111 | "# Read mappings for DO Slim terms\n",
112 | "url = 'https://raw.githubusercontent.com/dhimmel/disease-ontology/72614ade9f1cc5a5317b8f6836e1e464b31d5587/data/xrefs-slim.tsv'\n",
113 | "disease_df = pandas.read_table(url)\n",
114 | "disease_df = disease_df.query('resource == \"MSH\"').drop('resource', 1)\n",
115 | "disease_df = disease_df.rename(columns={'resource_id': 'mesh_id'})\n",
116 | "\n",
117 | "# Read MeSH terms to MeSH names\n",
118 | "url = 'https://raw.githubusercontent.com/dhimmel/mesh/e561301360e6de2140dedeaa7c7e17ce4714eb7f/data/terms.tsv'\n",
119 | "mesh_df = pandas.read_table(url)\n",
120 | "disease_df = disease_df.merge(mesh_df)\n",
121 | "\n",
122 | "# Manually remove problematic xrefs\n",
123 | "# https://github.com/obophenotype/human-disease-ontology/issues/45\n",
124 | "disease_df = disease_df.query(\"mesh_id != 'D003327' and mesh_id != 'D017202'\")\n",
125 | "disease_df.head()"
126 | ]
127 | },
128 | {
129 | "cell_type": "markdown",
130 | "metadata": {},
131 | "source": [
132 | "# Diseases"
133 | ]
134 | },
135 | {
136 | "cell_type": "code",
137 | "execution_count": 3,
138 | "metadata": {
139 | "collapsed": false,
140 | "jupyter": {
141 | "outputs_hidden": false
142 | },
143 | "scrolled": true
144 | },
145 | "outputs": [
146 | {
147 | "name": "stdout",
148 | "output_type": "stream",
149 | "text": [
150 | "7382 articles for Hematologic Neoplasms\n",
151 | "99586 articles for Brain Neoplasms\n",
152 | "139299 articles for Lung Neoplasms\n",
153 | "49515 articles for Kidney Neoplasms\n",
154 | "46298 articles for Pancreatic Neoplasms\n",
155 | "85654 articles for Skin Neoplasms\n",
156 | "83658 articles for Bone Neoplasms\n",
157 | "22125 articles for Pharyngeal Neoplasms\n",
158 | "53989 articles for Ovarian Neoplasms\n",
159 | "188908 articles for Breast Neoplasms\n",
160 | "49539 articles for Glioma\n",
161 | "89055 articles for Uterine Neoplasms\n",
162 | "18514 articles for Adrenal Gland Neoplasms\n",
163 | "33421 articles for Esophageal Neoplasms\n",
164 | "12158 articles for Salivary Gland Neoplasms\n",
165 | "83257 articles for Prostatic Neoplasms\n",
166 | "64512 articles for Stomach Neoplasms\n",
167 | "37569 articles for Urinary Bladder Neoplasms\n",
168 | "14765 articles for Peripheral Nervous System Neoplasms\n",
169 | "33286 articles for Thyroid Neoplasms\n",
170 | "97650 articles for Liver Neoplasms\n",
171 | "50220 articles for Uterine Cervical Neoplasms\n",
172 | "3507 articles for Vaginal Neoplasms\n",
173 | "210249 articles for Head and Neck Neoplasms\n",
174 | "32809 articles for Rectal Neoplasms\n",
175 | "28761 articles for Eye Neoplasms\n",
176 | "50799 articles for Colonic Neoplasms\n",
177 | "19451 articles for Laryngeal Neoplasms\n",
178 | "225331 articles for Neoplasms, Germ Cell and Embryonal\n",
179 | "7330 articles for Thymus Neoplasms\n",
180 | "8568 articles for Myosarcoma\n",
181 | "4375 articles for Duodenal Neoplasms\n",
182 | "2161 articles for Ileal Neoplasms\n",
183 | "90305 articles for Sarcoma\n",
184 | "2002 articles for Appendiceal Neoplasms\n",
185 | "3898 articles for Penile Neoplasms\n",
186 | "3458 articles for Ureteral Neoplasms\n",
187 | "2526 articles for Tracheal Neoplasms\n",
188 | "5866 articles for Vulvar Neoplasms\n",
189 | "1649 articles for Jejunal Neoplasms\n",
190 | "9852 articles for Peritoneal Neoplasms\n",
191 | "2469 articles for Vascular Neoplasms\n",
192 | "9785 articles for Mesothelioma\n",
193 | "60090 articles for Melanoma\n",
194 | "2040 articles for Fallopian Tube Neoplasms\n",
195 | "18463 articles for Testicular Neoplasms\n",
196 | "5919 articles for Gallbladder Neoplasms\n",
197 | "15236 articles for Meningeal Neoplasms\n",
198 | "11129 articles for Bile Duct Neoplasms\n",
199 | "9591 articles for Mediastinal Neoplasms\n",
200 | "7736 articles for Spinal Cord Neoplasms\n",
201 | "6254 articles for Retroperitoneal Neoplasms\n",
202 | "24975 articles for Crohn Disease\n",
203 | "39550 articles for Multiple Sclerosis\n",
204 | "72794 articles for Diabetes Mellitus, Type 2\n",
205 | "22055 articles for Colitis, Ulcerative\n",
206 | "50883 articles for Diabetes Mellitus, Type 1\n",
207 | "76622 articles for Arthritis, Rheumatoid\n",
208 | "33214 articles for Coronary Artery Disease\n",
209 | "105366 articles for Obesity\n",
210 | "13742 articles for Celiac Disease\n",
211 | "39741 articles for Lupus Erythematosus, Systemic\n",
212 | "19765 articles for Refractive Errors\n",
213 | "5304 articles for Liver Cirrhosis, Biliary\n",
214 | "3492 articles for Vitiligo\n",
215 | "13479 articles for Macular Degeneration\n",
216 | "16426 articles for Metabolic Syndrome X\n",
217 | "88631 articles for Asthma\n",
218 | "25313 articles for Psoriasis\n",
219 | "69283 articles for Schizophrenia\n",
220 | "18093 articles for Migraine Disorders\n",
221 | "55360 articles for Alzheimer Disease\n",
222 | "10965 articles for Graves Disease\n",
223 | "40397 articles for Parkinson Disease\n",
224 | "11782 articles for Dermatitis, Atopic\n",
225 | "24524 articles for Bipolar Disorder\n",
226 | "9401 articles for Spondylitis, Ankylosing\n",
227 | "8888 articles for Polycystic Ovary Syndrome\n",
228 | "155557 articles for Hypertension\n",
229 | "14044 articles for Scleroderma, Systemic\n",
230 | "6764 articles for Behcet Syndrome\n",
231 | "3814 articles for Osteitis Deformans\n",
232 | "18561 articles for Leprosy\n",
233 | "18785 articles for Intracranial Aneurysm\n",
234 | "35366 articles for Glaucoma\n",
235 | "11500 articles for Amyotrophic Lateral Sclerosis\n",
236 | "2296 articles for Restless Legs Syndrome\n",
237 | "4319 articles for Mucocutaneous Lymph Node Syndrome\n",
238 | "18009 articles for Atherosclerosis\n",
239 | "2125 articles for Alopecia Areata\n",
240 | "32547 articles for Osteoporosis\n",
241 | "20300 articles for Hypothyroidism\n",
242 | "4202 articles for Glomerulonephritis, IGA\n",
243 | "49443 articles for Alcoholism\n",
244 | "4464 articles for Creutzfeldt-Jakob Syndrome\n",
245 | "864 articles for Azoospermia\n",
246 | "102949 articles for Epilepsy\n",
247 | "36716 articles for Hepatitis B\n",
248 | "30665 articles for Pulmonary Disease, Chronic Obstructive\n",
249 | "12886 articles for Aortic Aneurysm, Abdominal\n",
250 | "54875 articles for Kidney Failure, Chronic\n",
251 | "33398 articles for Osteoarthritis\n",
252 | "2999 articles for Arthritis, Psoriatic\n",
253 | "6354 articles for Tobacco Use Disorder\n",
254 | "1918 articles for Glomerulonephritis, Membranous\n",
255 | "6056 articles for Diabetes, Gestational\n",
256 | "43489 articles for Malaria\n",
257 | "13959 articles for Autistic Disorder\n",
258 | "10108 articles for Cardiomyopathy, Dilated\n",
259 | "724 articles for Arthritis, Gouty\n",
260 | "14343 articles for Leiomyoma\n",
261 | "2309 articles for Cholangitis, Sclerosing\n",
262 | "2374 articles for Narcolepsy\n",
263 | "1561 articles for Intervertebral Disc Degeneration\n",
264 | "9599 articles for Cleft Lip\n",
265 | "1277 articles for Idiopathic Pulmonary Fibrosis\n",
266 | "16912 articles for Attention Deficit Disorder with Hyperactivity\n",
267 | "3143 articles for Tourette Syndrome\n",
268 | "7893 articles for Aortic Aneurysm, Thoracic\n",
269 | "63783 articles for Depressive Disorder\n",
270 | "13894 articles for Rhinitis, Allergic\n",
271 | "35263 articles for Pancreatitis\n",
272 | "12217 articles for Nephrolithiasis\n",
273 | "16409 articles for Periodontitis\n",
274 | "5256 articles for Barrett Esophagus\n",
275 | "550 articles for Fuchs' Endothelial Dystrophy\n",
276 | "3870 articles for Otosclerosis\n",
277 | "1486 articles for Conduct Disorder\n",
278 | "2979 articles for Glomerulosclerosis, Focal Segmental\n",
279 | "25969 articles for Dental Caries\n",
280 | "105132 articles for Anemia\n",
281 | "4634 articles for Panic Disorder\n",
282 | "58396 articles for Acquired Immunodeficiency Syndrome\n"
283 | ]
284 | }
285 | ],
286 | "source": [
287 | "rows_out = list()\n",
288 | "\n",
289 | "for i, row in disease_df.iterrows():\n",
290 | " term_query = '{disease}[MeSH Major Topic]'.format(disease = row.mesh_name.lower())\n",
291 | " payload = {'db': 'pubmed', 'term': term_query}\n",
292 | " pmids = eutility.esearch_query(payload, retmax = 10000)\n",
293 | " row['term_query'] = term_query\n",
294 | " row['n_articles'] = len(pmids)\n",
295 | " row['pubmed_ids'] = '|'.join(pmids)\n",
296 | " rows_out.append(row)\n",
297 | " print('{} articles for {}'.format(len(pmids), row.mesh_name))\n",
298 | "\n",
299 | "disease_pmids_df = pandas.DataFrame(rows_out)"
300 | ]
301 | },
302 | {
303 | "cell_type": "code",
304 | "execution_count": 4,
305 | "metadata": {
306 | "collapsed": false,
307 | "jupyter": {
308 | "outputs_hidden": false
309 | }
310 | },
311 | "outputs": [],
312 | "source": [
313 | "with gzip.open('data/disease-pmids.tsv.gz', 'w') as write_file:\n",
314 | " write_file = io.TextIOWrapper(write_file)\n",
315 | " disease_pmids_df.to_csv(write_file, sep='\\t', index=False)"
316 | ]
317 | },
318 | {
319 | "cell_type": "markdown",
320 | "metadata": {},
321 | "source": [
322 | "# Symptoms"
323 | ]
324 | },
325 | {
326 | "cell_type": "code",
327 | "execution_count": 5,
328 | "metadata": {
329 | "collapsed": false,
330 | "jupyter": {
331 | "outputs_hidden": false
332 | }
333 | },
334 | "outputs": [
335 | {
336 | "data": {
337 | "text/html": [
338 | "\n",
339 | "
\n",
340 | " \n",
341 | " \n",
342 | " | \n",
343 | " mesh_id | \n",
344 | " mesh_name | \n",
345 | " in_hsdn | \n",
346 | "
\n",
347 | " \n",
348 | " \n",
349 | " \n",
350 | " 0 | \n",
351 | " D000006 | \n",
352 | " Abdomen, Acute | \n",
353 | " 1 | \n",
354 | "
\n",
355 | " \n",
356 | " 1 | \n",
357 | " D000270 | \n",
358 | " Adie Syndrome | \n",
359 | " 0 | \n",
360 | "
\n",
361 | " \n",
362 | " 2 | \n",
363 | " D000326 | \n",
364 | " Adrenoleukodystrophy | \n",
365 | " 0 | \n",
366 | "
\n",
367 | " \n",
368 | " 3 | \n",
369 | " D000334 | \n",
370 | " Aerophagy | \n",
371 | " 1 | \n",
372 | "
\n",
373 | " \n",
374 | " 4 | \n",
375 | " D000370 | \n",
376 | " Ageusia | \n",
377 | " 1 | \n",
378 | "
\n",
379 | " \n",
380 | "
\n",
381 | "
"
382 | ],
383 | "text/plain": [
384 | " mesh_id mesh_name in_hsdn\n",
385 | "0 D000006 Abdomen, Acute 1\n",
386 | "1 D000270 Adie Syndrome 0\n",
387 | "2 D000326 Adrenoleukodystrophy 0\n",
388 | "3 D000334 Aerophagy 1\n",
389 | "4 D000370 Ageusia 1"
390 | ]
391 | },
392 | "execution_count": 5,
393 | "metadata": {},
394 | "output_type": "execute_result"
395 | }
396 | ],
397 | "source": [
398 | "# Read MeSH Symptoms\n",
399 | "url = 'https://raw.githubusercontent.com/dhimmel/mesh/e561301360e6de2140dedeaa7c7e17ce4714eb7f/data/symptoms.tsv'\n",
400 | "symptom_df = pandas.read_table(url)\n",
401 | "symptom_df.head()"
402 | ]
403 | },
404 | {
405 | "cell_type": "code",
406 | "execution_count": 6,
407 | "metadata": {
408 | "collapsed": false,
409 | "jupyter": {
410 | "outputs_hidden": false
411 | }
412 | },
413 | "outputs": [
414 | {
415 | "name": "stdout",
416 | "output_type": "stream",
417 | "text": [
418 | "8496 articles for Abdomen, Acute\n",
419 | "313 articles for Adie Syndrome\n",
420 | "1508 articles for Adrenoleukodystrophy\n",
421 | "261 articles for Aerophagy\n",
422 | "222 articles for Ageusia\n",
423 | "2049 articles for Agnosia\n",
424 | "849 articles for Agraphia\n",
425 | "12310 articles for Albuminuria\n",
426 | "1118 articles for Alcohol Amnestic Disorder\n",
427 | "846 articles for Alkalosis, Respiratory\n",
428 | "5803 articles for Amblyopia\n",
429 | "6330 articles for Amnesia\n",
430 | "824 articles for Amnesia, Retrograde\n",
431 | "30785 articles for Angina Pectoris\n",
432 | "1864 articles for Angina Pectoris, Variant\n",
433 | "8277 articles for Angina, Unstable\n",
434 | "933 articles for Anomia\n",
435 | "4130 articles for Anorexia\n",
436 | "3055 articles for Olfaction Disorders\n",
437 | "53502 articles for Anoxia\n",
438 | "8561 articles for Aphasia\n",
439 | "1439 articles for Aphasia, Broca\n",
440 | "825 articles for Aphasia, Wernicke\n",
441 | "270 articles for Aphonia\n",
442 | "6310 articles for Apnea\n",
443 | "2355 articles for Apraxias\n",
444 | "1544 articles for Articulation Disorders\n",
445 | "1439 articles for Asthenia\n",
446 | "6601 articles for Ataxia\n",
447 | "2930 articles for Ataxia Telangiectasia\n",
448 | "1284 articles for Athetosis\n",
449 | "1041 articles for Auditory Perceptual Disorders\n",
450 | "15025 articles for Back Pain\n",
451 | "33305 articles for Birth Weight\n",
452 | "6468 articles for Urinary Bladder, Neurogenic\n",
453 | "16978 articles for Blindness\n",
454 | "0 articles for Body Temperature Changes\n",
455 | "164793 articles for Body Weight\n",
456 | "5 articles for Body Weight Changes\n",
457 | "7382 articles for Brain Death\n",
458 | "4920 articles for Bulimia\n",
459 | "4023 articles for Cachexia\n",
460 | "5278 articles for Cardiac Output, Low\n",
461 | "2347 articles for Catalepsy\n",
462 | "767 articles for Cataplexy\n",
463 | "1977 articles for Catatonia\n",
464 | "564 articles for Causalgia\n",
465 | "3625 articles for Cerebellar Ataxia\n",
466 | "981 articles for Cerebrospinal Fluid Otorrhea\n",
467 | "2619 articles for Cerebrospinal Fluid Rhinorrhea\n",
468 | "9786 articles for Chest Pain\n",
469 | "625 articles for Cheyne-Stokes Respiration\n",
470 | "3746 articles for Chorea\n",
471 | "361 articles for Choroid Hemorrhage\n",
472 | "3597 articles for Colic\n",
473 | "3593 articles for Color Vision Defects\n",
474 | "10956 articles for Coma\n",
475 | "1687 articles for Communication Disorders\n",
476 | "3771 articles for Confusion\n",
477 | "2178 articles for Consciousness Disorders\n",
478 | "11088 articles for Constipation\n",
479 | "12559 articles for Cough\n",
480 | "609 articles for Cri-du-Chat Syndrome\n",
481 | "4077 articles for Cyanosis\n",
482 | "627 articles for De Lange Syndrome\n",
483 | "23208 articles for Deafness\n",
484 | "1750 articles for Hearing Loss, Sudden\n",
485 | "4708 articles for Decerebrate State\n",
486 | "6161 articles for Delirium\n",
487 | "39716 articles for Diarrhea\n",
488 | "6500 articles for Diarrhea, Infantile\n",
489 | "4370 articles for Diplopia\n",
490 | "11880 articles for Dizziness\n",
491 | "21187 articles for Down Syndrome\n",
492 | "1906 articles for Dysarthria\n",
493 | "288 articles for Dysgeusia\n",
494 | "6293 articles for Dyskinesia, Drug-Induced\n",
495 | "6411 articles for Dyslexia\n",
496 | "825 articles for Dyslexia, Acquired\n",
497 | "3213 articles for Dysmenorrhea\n",
498 | "7353 articles for Dyspepsia\n",
499 | "15978 articles for Dyspnea\n",
500 | "322 articles for Dyspnea, Paroxysmal\n",
501 | "6503 articles for Dystonia\n",
502 | "614 articles for Earache\n",
503 | "1075 articles for Ecchymosis\n",
504 | "211 articles for Echolalia\n",
505 | "33279 articles for Edema\n",
506 | "555 articles for Edema, Cardiac\n",
507 | "607 articles for Emaciation\n",
508 | "599 articles for Encopresis\n",
509 | "310 articles for Eructation\n",
510 | "817 articles for Eye Hemorrhage\n",
511 | "3146 articles for Eye Manifestations\n",
512 | "5163 articles for Facial Pain\n",
513 | "10449 articles for Facial Paralysis\n",
514 | "1785 articles for Failure to Thrive\n",
515 | "598 articles for Fasciculation\n",
516 | "20620 articles for Fatigue\n",
517 | "1172 articles for Mental Fatigue\n",
518 | "534 articles for Feminization\n",
519 | "2763 articles for Fetal Hypoxia\n",
520 | "2988 articles for Fetal Distress\n",
521 | "1878 articles for Fetal Macrosomia\n",
522 | "31658 articles for Fever\n",
523 | "3742 articles for Fever of Unknown Origin\n",
524 | "1233 articles for Flatulence\n",
525 | "1084 articles for Flushing\n",
526 | "4151 articles for Fragile X Syndrome\n",
527 | "416 articles for Gagging\n",
528 | "188 articles for Gerstmann Syndrome\n",
529 | "2562 articles for Gingival Hemorrhage\n",
530 | "251 articles for Glossalgia\n",
531 | "1128 articles for Halitosis\n",
532 | "9275 articles for Hallucinations\n",
533 | "22956 articles for Headache\n",
534 | "13453 articles for Hearing Disorders\n",
535 | "1745 articles for Hearing Loss, Bilateral\n",
536 | "481 articles for Hearing Loss, Central\n",
537 | "3022 articles for Hearing Loss, Conductive\n",
538 | "158 articles for Hearing Loss, Functional\n",
539 | "868 articles for Hearing Loss, High-Frequency\n",
540 | "6101 articles for Hearing Loss, Noise-Induced\n",
541 | "13197 articles for Hearing Loss, Sensorineural\n",
542 | "3035 articles for Heart Murmurs\n",
543 | "1707 articles for Heartburn\n",
544 | "2030 articles for Hematemesis\n",
545 | "2493 articles for Hemianopsia\n",
546 | "10439 articles for Hemiplegia\n",
547 | "1147 articles for Hemoglobinuria\n",
548 | "5102 articles for Hemoptysis\n",
549 | "1904 articles for Oral Hemorrhage\n",
550 | "939 articles for Hiccup\n",
551 | "3403 articles for Hirsutism\n",
552 | "1731 articles for Hoarseness\n",
553 | "1713 articles for Horner Syndrome\n",
554 | "9641 articles for Huntington Disease\n",
555 | "8030 articles for Hyperalgesia\n",
556 | "7386 articles for Hypercapnia\n",
557 | "1254 articles for Hyperemesis Gravidarum\n",
558 | "831 articles for Hyperesthesia\n",
559 | "3377 articles for Hypergammaglobulinemia\n",
560 | "3698 articles for Hyperkinesis\n",
561 | "2475 articles for Hyperphagia\n",
562 | "2742 articles for Disorders of Excessive Somnolence\n",
563 | "5160 articles for Hyperventilation\n",
564 | "2373 articles for Hypesthesia\n",
565 | "1260 articles for Hyphema\n",
566 | "4944 articles for Hypotension, Orthostatic\n",
567 | "8729 articles for Hypothermia\n",
568 | "1680 articles for Hypoventilation\n",
569 | "4413 articles for Illusions\n",
570 | "9308 articles for Sleep Initiation and Maintenance Disorders\n",
571 | "319 articles for Insulin Coma\n",
572 | "7081 articles for Intermittent Claudication\n",
573 | "9452 articles for Jaundice\n",
574 | "596 articles for Kearns-Sayre Syndrome\n",
575 | "916 articles for Menkes Kinky Hair Syndrome\n",
576 | "5025 articles for Language Development Disorders\n",
577 | "5692 articles for Language Disorders\n",
578 | "12767 articles for Learning Disorders\n",
579 | "1144 articles for Lesch-Nyhan Syndrome\n",
580 | "311 articles for Lipoid Proteinosis of Urbach and Wiethe\n",
581 | "15869 articles for Memory Disorders\n",
582 | "238 articles for Meningism\n",
583 | "47833 articles for Intellectual Disability\n",
584 | "781 articles for Monoclonal Gammopathy of Undetermined Significance\n",
585 | "2279 articles for Motion Sickness\n",
586 | "1119 articles for Mouth Breathing\n",
587 | "1948 articles for Muscle Cramp\n",
588 | "777 articles for Muscle Hypertonia\n",
589 | "2650 articles for Muscle Hypotonia\n",
590 | "1789 articles for Muscle Rigidity\n",
591 | "6967 articles for Muscle Spasticity\n",
592 | "8969 articles for Muscular Atrophy\n",
593 | "906 articles for Mutism\n",
594 | "4660 articles for Myoclonus\n",
595 | "1040 articles for Myotonia\n",
596 | "2842 articles for Narcolepsy\n",
597 | "13292 articles for Nausea\n",
598 | "9009 articles for Neuralgia\n",
599 | "7314 articles for Neurologic Manifestations\n",
600 | "1287 articles for Night Blindness\n",
601 | "132455 articles for Obesity\n",
602 | "12170 articles for Obesity, Morbid\n",
603 | "1037 articles for Oliguria\n",
604 | "7392 articles for Ophthalmoplegia\n",
605 | "4188 articles for Optical Illusions\n",
606 | "1484 articles for Oral Manifestations\n",
607 | "111258 articles for Pain\n",
608 | "5588 articles for Pain, Intractable\n",
609 | "28672 articles for Pain, Postoperative\n",
610 | "269 articles for Pallor\n",
611 | "18293 articles for Paralysis\n",
612 | "11424 articles for Paraplegia\n",
613 | "5291 articles for Paresis\n",
614 | "5248 articles for Paresthesia\n",
615 | "5499 articles for Perceptual Disorders\n",
616 | "1560 articles for Phantom Limb\n",
617 | "643 articles for Obesity Hypoventilation Syndrome\n",
618 | "1821 articles for Polyuria\n",
619 | "2356 articles for Prader-Willi Syndrome\n",
620 | "1165 articles for Presbycusis\n",
621 | "20966 articles for Proteinuria\n",
622 | "9095 articles for Pruritus\n",
623 | "366 articles for Pruritus Ani\n",
624 | "314 articles for Pruritus Vulvae\n",
625 | "3697 articles for Psychomotor Agitation\n",
626 | "4816 articles for Psychomotor Disorders\n",
627 | "17476 articles for Psychophysiologic Disorders\n",
628 | "766 articles for Pupil Disorders\n",
629 | "4655 articles for Purpura\n",
630 | "243 articles for Purpura, Hyperglobulinemic\n",
631 | "3611 articles for Purpura, Schoenlein-Henoch\n",
632 | "5798 articles for Purpura, Thrombocytopenic\n",
633 | "3792 articles for Purpura, Thrombotic Thrombocytopenic\n",
634 | "7154 articles for Quadriplegia\n",
635 | "393 articles for Hyperacusis\n",
636 | "4453 articles for Reflex, Abnormal\n",
637 | "1479 articles for Respiratory Paralysis\n",
638 | "7110 articles for Respiratory Sounds\n",
639 | "2744 articles for Restless Legs Syndrome\n",
640 | "4507 articles for Retinal Hemorrhage\n",
641 | "417 articles for Rubinstein-Taybi Syndrome\n",
642 | "4261 articles for Sciatica\n",
643 | "2517 articles for Scotoma\n",
644 | "42647 articles for Seizures\n",
645 | "4095 articles for Sensation Disorders\n",
646 | "0 articles for Signs and Symptoms, Digestive\n",
647 | "0 articles for Signs and Symptoms, Respiratory\n",
648 | "2723 articles for Skin Manifestations\n",
649 | "12373 articles for Sleep Apnea Syndromes\n",
650 | "7447 articles for Sleep Deprivation\n",
651 | "16262 articles for Sleep Disorders\n",
652 | "781 articles for Sneezing\n",
653 | "3366 articles for Snoring\n",
654 | "529 articles for Somnambulism\n",
655 | "6248 articles for Spasm\n",
656 | "10007 articles for Speech Disorders\n",
657 | "3113 articles for Stuttering\n",
658 | "1910 articles for Supranuclear Palsy, Progressive\n",
659 | "9376 articles for Syncope\n",
660 | "1377 articles for Taste Disorders\n",
661 | "2277 articles for Tetany\n",
662 | "3905 articles for Thinness\n",
663 | "1203 articles for Tinea Pedis\n",
664 | "6152 articles for Tinnitus\n",
665 | "2362 articles for Toothache\n",
666 | "3088 articles for Torticollis\n",
667 | "8147 articles for Tremor\n",
668 | "1300 articles for Trismus\n",
669 | "3711 articles for Unconsciousness\n",
670 | "18321 articles for Urinary Incontinence\n",
671 | "9276 articles for Urinary Incontinence, Stress\n",
672 | "8668 articles for Vertigo\n",
673 | "1893 articles for Virilism\n",
674 | "22639 articles for Vision Disorders\n",
675 | "1583 articles for Vitreous Hemorrhage\n",
676 | "5144 articles for Vocal Cord Paralysis\n",
677 | "4620 articles for Voice Disorders\n",
678 | "19894 articles for Vomiting\n",
679 | "221 articles for Vomiting, Anticipatory\n",
680 | "422 articles for Waterhouse-Friderichsen Syndrome\n",
681 | "349 articles for Wolfram Syndrome\n",
682 | "1908 articles for Hydrops Fetalis\n",
683 | "362 articles for Pyruvate Dehydrogenase Complex Deficiency Disease\n",
684 | "2343 articles for Vision, Low\n",
685 | "23709 articles for Weight Gain\n",
686 | "25680 articles for Weight Loss\n",
687 | "1947 articles for Rett Syndrome\n",
688 | "15002 articles for Abdominal Pain\n",
689 | "124 articles for Tonic Pupil\n",
690 | "265 articles for Anisocoria\n",
691 | "363 articles for Miosis\n",
692 | "514 articles for Mydriasis\n",
693 | "786 articles for Mucopolysaccharidosis II\n",
694 | "167 articles for Cardiac Output, High\n",
695 | "4732 articles for Purpura, Thrombocytopenic, Idiopathic\n",
696 | "762 articles for Hypocapnia\n",
697 | "1306 articles for Akathisia, Drug-Induced\n",
698 | "15043 articles for Low Back Pain\n",
699 | "441 articles for Ophthalmoplegia, Chronic Progressive External\n",
700 | "9826 articles for Pain Threshold\n",
701 | "897 articles for Microvascular Angina\n",
702 | "150 articles for Kleine-Levin Syndrome\n",
703 | "82 articles for WAGR Syndrome\n",
704 | "3734 articles for Pelvic Pain\n",
705 | "629 articles for Machado-Joseph Disease\n",
706 | "268 articles for Brown-Sequard Syndrome\n",
707 | "2572 articles for Persistent Vegetative State\n",
708 | "1046 articles for Hypokinesia\n",
709 | "189 articles for Space Motion Sickness\n",
710 | "2667 articles for Hyperoxia\n",
711 | "1228 articles for Gastroparesis\n",
712 | "22 articles for Sweating Sickness\n",
713 | "5442 articles for Arthralgia\n",
714 | "60 articles for Aphasia, Conduction\n",
715 | "409 articles for Aphasia, Primary Progressive\n",
716 | "11068 articles for Muscle Weakness\n",
717 | "1324 articles for Williams Syndrome\n",
718 | "318 articles for Cafe-au-Lait Spots\n",
719 | "1557 articles for Syncope, Vasovagal\n",
720 | "4803 articles for Neck Pain\n",
721 | "751 articles for Hemifacial Spasm\n",
722 | "457 articles for Blindness, Cortical\n",
723 | "2471 articles for Hot Flashes\n",
724 | "758 articles for Aging, Premature\n",
725 | "1678 articles for Pseudophakia\n",
726 | "148 articles for Schnitzler Syndrome\n",
727 | "129 articles for Neurobehavioral Manifestations\n",
728 | "3129 articles for Shoulder Pain\n",
729 | "498 articles for Neurogenic Inflammation\n",
730 | "20 articles for Chorea Gravidarum\n",
731 | "45 articles for Hypersomnolence, Idiopathic\n",
732 | "1410 articles for Sleep Disorders, Circadian Rhythm\n",
733 | "310 articles for Jet Lag Syndrome\n",
734 | "12292 articles for Sleep Apnea, Obstructive\n",
735 | "928 articles for Sleep Apnea, Central\n",
736 | "45 articles for Nocturnal Paroxysmal Dystonia\n",
737 | "112 articles for Night Terrors\n",
738 | "320 articles for Sleep Bruxism\n",
739 | "706 articles for REM Sleep Behavior Disorder\n",
740 | "103 articles for Sleep Paralysis\n",
741 | "493 articles for Nocturnal Myoclonus Syndrome\n",
742 | "94 articles for Coma, Post-Head Injury\n",
743 | "4298 articles for Gait Disorders, Neurologic\n",
744 | "394 articles for Gait Ataxia\n",
745 | "101 articles for Gait Apraxia\n",
746 | "316 articles for Amnesia, Transient Global\n",
747 | "75 articles for Alexia, Pure\n",
748 | "412 articles for Prosopagnosia\n",
749 | "179 articles for Apraxia, Ideomotor\n",
750 | "3000 articles for Postoperative Nausea and Vomiting\n",
751 | "225 articles for Alcohol Withdrawal Seizures\n",
752 | "671 articles for Tics\n",
753 | "221 articles for Amnesia, Anterograde\n",
754 | "608 articles for Paraparesis\n",
755 | "325 articles for Paraparesis, Spastic\n",
756 | "118 articles for Myokymia\n",
757 | "328 articles for Parasomnias\n",
758 | "1247 articles for Fetal Weight\n",
759 | "1610 articles for Spinocerebellar Ataxias\n",
760 | "245 articles for Amaurosis Fugax\n",
761 | "495 articles for Photophobia\n",
762 | "1384 articles for Dyskinesias\n",
763 | "95 articles for Pseudobulbar Palsy\n",
764 | "0 articles for Neuromuscular Manifestations\n",
765 | "991 articles for Somatosensory Disorders\n",
766 | "348 articles for Korsakoff Syndrome\n",
767 | "223 articles for Sleep Disorders, Intrinsic\n",
768 | "293 articles for Dyssomnias\n",
769 | "129 articles for Sleep Arousal Disorders\n",
770 | "123 articles for Sleep-Wake Transition Disorders\n",
771 | "73 articles for REM Sleep Parasomnias\n",
772 | "0 articles for Urological Manifestations\n",
773 | "428 articles for Flank Pain\n",
774 | "156 articles for Chills\n",
775 | "123 articles for Insomnia, Fatal Familial\n",
776 | "9079 articles for Hearing Loss\n",
777 | "142 articles for Metatarsalgia\n",
778 | "510 articles for Mental Retardation, X-Linked\n",
779 | "72 articles for Coffin-Lowry Syndrome\n",
780 | "2706 articles for Jaundice, Obstructive\n",
781 | "44 articles for Reticulocytosis\n",
782 | "436 articles for Hearing Loss, Unilateral\n",
783 | "222 articles for Hearing Loss, Mixed Conductive-Sensorineural\n",
784 | "110 articles for Synkinesis\n",
785 | "656 articles for Labor Pain\n",
786 | "116 articles for Morning Sickness\n",
787 | "13641 articles for Overweight\n",
788 | "2618 articles for Mobility Limitation\n",
789 | "687 articles for Neuralgia, Postherpetic\n",
790 | "91 articles for Glycogen Storage Disease Type IIb\n",
791 | "299 articles for Usher Syndromes\n",
792 | "475 articles for Nocturia\n",
793 | "224 articles for Dysuria\n",
794 | "2760 articles for Urinary Bladder, Overactive\n",
795 | "622 articles for Urinary Incontinence, Urge\n",
796 | "423 articles for Prostatism\n",
797 | "463 articles for Hypercalciuria\n",
798 | "166 articles for Opsoclonus-Myoclonus Syndrome\n",
799 | "74 articles for Urinoma\n",
800 | "231 articles for Pain, Referred\n",
801 | "97 articles for Stupor\n",
802 | "226 articles for Lethargy\n",
803 | "8724 articles for Acute Coronary Syndrome\n",
804 | "66 articles for Deaf-Blind Disorders\n",
805 | "165 articles for Livedo Reticularis\n",
806 | "121 articles for Mevalonate Kinase Deficiency\n",
807 | "48 articles for Systolic Murmurs\n",
808 | "99 articles for Classical Lissencephalies and Subcortical Band Heterotopias\n",
809 | "102 articles for Neuroacanthocytosis\n",
810 | "188 articles for Orthostatic Intolerance\n",
811 | "209 articles for Postural Orthostatic Tachycardia Syndrome\n",
812 | "154 articles for Failed Back Surgery Syndrome\n",
813 | "770 articles for Dysphonia\n",
814 | "136 articles for Purpura Fulminans\n",
815 | "1103 articles for Sarcopenia\n",
816 | "93 articles for Susac Syndrome\n",
817 | "57 articles for Piriformis Muscle Syndrome\n",
818 | "35 articles for Alien Hand Syndrome\n",
819 | "21 articles for Slit Ventricle Syndrome\n",
820 | "1770 articles for Obesity, Abdominal\n",
821 | "185 articles for Renal Colic\n",
822 | "166 articles for Ideal Body Weight\n",
823 | "65 articles for Primary Progressive Nonfluent Aphasia\n",
824 | "16 articles for Infantile Apparent Life-Threatening Event\n",
825 | "27 articles for Post-Exercise Hypotension\n",
826 | "67 articles for Striae Distensae\n",
827 | "293 articles for Eye Pain\n",
828 | "29 articles for Necrolytic Migratory Erythema\n",
829 | "280 articles for Nociceptive Pain\n",
830 | "32 articles for Transient Tachypnea of the Newborn\n",
831 | "65 articles for Tachypnea\n",
832 | "222 articles for Visceral Pain\n",
833 | "4756 articles for Chronic Pain\n",
834 | "1127 articles for Musculoskeletal Pain\n",
835 | "56 articles for Mastodynia\n",
836 | "46 articles for Pelvic Girdle Pain\n",
837 | "146 articles for Breakthrough Pain\n",
838 | "997 articles for Lower Urinary Tract Symptoms\n",
839 | "300 articles for Anhedonia\n",
840 | "65 articles for Polydipsia\n",
841 | "20 articles for Polydipsia, Psychogenic\n",
842 | "775 articles for Acute Pain\n",
843 | "542 articles for Angina, Stable\n",
844 | "18 articles for Ophthalmoplegic Migraine\n",
845 | "38 articles for Pudendal Neuralgia\n",
846 | "83 articles for Dyscalculia\n",
847 | "13 articles for Alice in Wonderland Syndrome\n",
848 | "441 articles for Prodromal Symptoms\n",
849 | "1354 articles for Pediatric Obesity\n",
850 | "327 articles for Myalgia\n",
851 | "12 articles for Hypertriglyceridemic Waist\n",
852 | "388 articles for Cerebrospinal Fluid Leak\n",
853 | "296 articles for Benign Paroxysmal Positional Vertigo\n",
854 | "15 articles for Hyperlactatemia\n",
855 | "0 articles for Allesthesia\n"
856 | ]
857 | }
858 | ],
859 | "source": [
860 | "rows_out = list()\n",
861 | "\n",
862 | "for i, row in symptom_df.iterrows():\n",
863 | " term_query = '{symptom}[MeSH Terms:noexp]'.format(symptom = row.mesh_name.lower())\n",
864 | " payload = {'db': 'pubmed', 'term': term_query}\n",
865 | " pmids = eutility.esearch_query(payload, retmax = 5000, sleep=2)\n",
866 | " row['term_query'] = term_query\n",
867 | " row['n_articles'] = len(pmids)\n",
868 | " row['pubmed_ids'] = '|'.join(pmids)\n",
869 | " rows_out.append(row)\n",
870 | " print('{} articles for {}'.format(len(pmids), row.mesh_name))"
871 | ]
872 | },
873 | {
874 | "cell_type": "code",
875 | "execution_count": 7,
876 | "metadata": {
877 | "collapsed": false,
878 | "jupyter": {
879 | "outputs_hidden": false
880 | }
881 | },
882 | "outputs": [
883 | {
884 | "data": {
885 | "text/html": [
886 | "\n",
887 | "
\n",
888 | " \n",
889 | " \n",
890 | " | \n",
891 | " mesh_id | \n",
892 | " mesh_name | \n",
893 | " in_hsdn | \n",
894 | " term_query | \n",
895 | " n_articles | \n",
896 | " pubmed_ids | \n",
897 | "
\n",
898 | " \n",
899 | " \n",
900 | " \n",
901 | " 0 | \n",
902 | " D000006 | \n",
903 | " Abdomen, Acute | \n",
904 | " 1 | \n",
905 | " abdomen, acute[MeSH Terms:noexp] | \n",
906 | " 8496 | \n",
907 | " 25742249|25669229|25650451|25619050|25608417|2... | \n",
908 | "
\n",
909 | " \n",
910 | " 1 | \n",
911 | " D000270 | \n",
912 | " Adie Syndrome | \n",
913 | " 0 | \n",
914 | " adie syndrome[MeSH Terms:noexp] | \n",
915 | " 313 | \n",
916 | " 25138821|24995781|24625775|24533698|24215593|2... | \n",
917 | "
\n",
918 | " \n",
919 | " 2 | \n",
920 | " D000326 | \n",
921 | " Adrenoleukodystrophy | \n",
922 | " 0 | \n",
923 | " adrenoleukodystrophy[MeSH Terms:noexp] | \n",
924 | " 1508 | \n",
925 | " 25860611|25583825|25393703|25378668|25297370|2... | \n",
926 | "
\n",
927 | " \n",
928 | " 3 | \n",
929 | " D000334 | \n",
930 | " Aerophagy | \n",
931 | " 1 | \n",
932 | " aerophagy[MeSH Terms:noexp] | \n",
933 | " 261 | \n",
934 | " 25073665|24796405|24280810|23772202|23772201|2... | \n",
935 | "
\n",
936 | " \n",
937 | " 4 | \n",
938 | " D000370 | \n",
939 | " Ageusia | \n",
940 | " 1 | \n",
941 | " ageusia[MeSH Terms:noexp] | \n",
942 | " 222 | \n",
943 | " 24999669|24999665|24825557|24782205|24191925|2... | \n",
944 | "
\n",
945 | " \n",
946 | "
\n",
947 | "
"
948 | ],
949 | "text/plain": [
950 | " mesh_id mesh_name in_hsdn \\\n",
951 | "0 D000006 Abdomen, Acute 1 \n",
952 | "1 D000270 Adie Syndrome 0 \n",
953 | "2 D000326 Adrenoleukodystrophy 0 \n",
954 | "3 D000334 Aerophagy 1 \n",
955 | "4 D000370 Ageusia 1 \n",
956 | "\n",
957 | " term_query n_articles \\\n",
958 | "0 abdomen, acute[MeSH Terms:noexp] 8496 \n",
959 | "1 adie syndrome[MeSH Terms:noexp] 313 \n",
960 | "2 adrenoleukodystrophy[MeSH Terms:noexp] 1508 \n",
961 | "3 aerophagy[MeSH Terms:noexp] 261 \n",
962 | "4 ageusia[MeSH Terms:noexp] 222 \n",
963 | "\n",
964 | " pubmed_ids \n",
965 | "0 25742249|25669229|25650451|25619050|25608417|2... \n",
966 | "1 25138821|24995781|24625775|24533698|24215593|2... \n",
967 | "2 25860611|25583825|25393703|25378668|25297370|2... \n",
968 | "3 25073665|24796405|24280810|23772202|23772201|2... \n",
969 | "4 24999669|24999665|24825557|24782205|24191925|2... "
970 | ]
971 | },
972 | "execution_count": 7,
973 | "metadata": {},
974 | "output_type": "execute_result"
975 | }
976 | ],
977 | "source": [
978 | "symptom_pmids_df = pandas.DataFrame(rows_out)\n",
979 | "\n",
980 | "with gzip.open('data/symptom-pmids.tsv.gz', 'w') as write_file:\n",
981 | " write_file = io.TextIOWrapper(write_file)\n",
982 | " symptom_pmids_df.to_csv(write_file, sep='\\t', index=False)\n",
983 | "\n",
984 | "symptom_pmids_df.head()"
985 | ]
986 | },
987 | {
988 | "cell_type": "markdown",
989 | "metadata": {},
990 | "source": [
991 | "# Cooccurrence"
992 | ]
993 | },
994 | {
995 | "cell_type": "code",
996 | "execution_count": 8,
997 | "metadata": {
998 | "collapsed": false,
999 | "jupyter": {
1000 | "outputs_hidden": false
1001 | }
1002 | },
1003 | "outputs": [],
1004 | "source": [
1005 | "symptom_df, symptom_to_pmids = cooccurrence.read_pmids_tsv('data/symptom-pmids.tsv.gz', key='mesh_id')\n",
1006 | "disease_df, disease_to_pmids = cooccurrence.read_pmids_tsv('data/disease-pmids.tsv.gz', key='doid_code')"
1007 | ]
1008 | },
1009 | {
1010 | "cell_type": "code",
1011 | "execution_count": 9,
1012 | "metadata": {
1013 | "collapsed": false,
1014 | "jupyter": {
1015 | "outputs_hidden": false
1016 | }
1017 | },
1018 | "outputs": [
1019 | {
1020 | "data": {
1021 | "text/plain": [
1022 | "1759475"
1023 | ]
1024 | },
1025 | "execution_count": 9,
1026 | "metadata": {},
1027 | "output_type": "execute_result"
1028 | }
1029 | ],
1030 | "source": [
1031 | "symptom_pmids = set.union(*symptom_to_pmids.values())\n",
1032 | "len(symptom_pmids)"
1033 | ]
1034 | },
1035 | {
1036 | "cell_type": "code",
1037 | "execution_count": 10,
1038 | "metadata": {
1039 | "collapsed": false,
1040 | "jupyter": {
1041 | "outputs_hidden": false
1042 | }
1043 | },
1044 | "outputs": [
1045 | {
1046 | "data": {
1047 | "text/plain": [
1048 | "3478558"
1049 | ]
1050 | },
1051 | "execution_count": 10,
1052 | "metadata": {},
1053 | "output_type": "execute_result"
1054 | }
1055 | ],
1056 | "source": [
1057 | "disease_pmids = set.union(*disease_to_pmids.values())\n",
1058 | "len(disease_pmids)"
1059 | ]
1060 | },
1061 | {
1062 | "cell_type": "code",
1063 | "execution_count": 11,
1064 | "metadata": {
1065 | "collapsed": false,
1066 | "jupyter": {
1067 | "outputs_hidden": false
1068 | }
1069 | },
1070 | "outputs": [
1071 | {
1072 | "name": "stdout",
1073 | "output_type": "stream",
1074 | "text": [
1075 | "Total articles containing a doid_code: 3478558\n",
1076 | "Total articles containing a mesh_id: 1759475\n",
1077 | "Total articles containing both a doid_code and mesh_id: 363928\n",
1078 | "\n",
1079 | "After removing terms without any cooccurences:\n",
1080 | "+ 133 doid_codes remain\n",
1081 | "+ 426 mesh_ids remain\n",
1082 | "\n",
1083 | "Cooccurrence scores calculated for 56658 doid_code -- mesh_id pairs\n"
1084 | ]
1085 | }
1086 | ],
1087 | "source": [
1088 | "cooc_df = cooccurrence.score_pmid_cooccurrence(disease_to_pmids, symptom_to_pmids, 'doid_code', 'mesh_id')"
1089 | ]
1090 | },
1091 | {
1092 | "cell_type": "code",
1093 | "execution_count": 12,
1094 | "metadata": {
1095 | "collapsed": false,
1096 | "jupyter": {
1097 | "outputs_hidden": false
1098 | }
1099 | },
1100 | "outputs": [
1101 | {
1102 | "data": {
1103 | "text/html": [
1104 | "\n",
1105 | "
\n",
1106 | " \n",
1107 | " \n",
1108 | " | \n",
1109 | " doid_code | \n",
1110 | " doid_name | \n",
1111 | " mesh_id | \n",
1112 | " mesh_name | \n",
1113 | " cooccurrence | \n",
1114 | " expected | \n",
1115 | " enrichment | \n",
1116 | " odds_ratio | \n",
1117 | " p_fisher | \n",
1118 | "
\n",
1119 | " \n",
1120 | " \n",
1121 | " \n",
1122 | " 30318 | \n",
1123 | " DOID:10652 | \n",
1124 | " Alzheimer's disease | \n",
1125 | " D004314 | \n",
1126 | " Down Syndrome | \n",
1127 | " 800 | \n",
1128 | " 35.619601 | \n",
1129 | " 22.459544 | \n",
1130 | " 39.918352 | \n",
1131 | " 0.000000e+00 | \n",
1132 | "
\n",
1133 | " \n",
1134 | " 30408 | \n",
1135 | " DOID:10652 | \n",
1136 | " Alzheimer's disease | \n",
1137 | " D008569 | \n",
1138 | " Memory Disorders | \n",
1139 | " 1593 | \n",
1140 | " 76.580532 | \n",
1141 | " 20.801631 | \n",
1142 | " 41.885877 | \n",
1143 | " 0.000000e+00 | \n",
1144 | "
\n",
1145 | " \n",
1146 | " 30452 | \n",
1147 | " DOID:10652 | \n",
1148 | " Alzheimer's disease | \n",
1149 | " D011595 | \n",
1150 | " Psychomotor Agitation | \n",
1151 | " 334 | \n",
1152 | " 15.235665 | \n",
1153 | " 21.922247 | \n",
1154 | " 35.277329 | \n",
1155 | " 0.000000e+00 | \n",
1156 | "
\n",
1157 | " \n",
1158 | " 30257 | \n",
1159 | " DOID:10652 | \n",
1160 | " Alzheimer's disease | \n",
1161 | " D000647 | \n",
1162 | " Amnesia | \n",
1163 | " 307 | \n",
1164 | " 14.061215 | \n",
1165 | " 21.833106 | \n",
1166 | " 34.890099 | \n",
1167 | " 4.277452e-314 | \n",
1168 | "
\n",
1169 | " \n",
1170 | " 30381 | \n",
1171 | " DOID:10652 | \n",
1172 | " Alzheimer's disease | \n",
1173 | " D006816 | \n",
1174 | " Huntington Disease | \n",
1175 | " 255 | \n",
1176 | " 12.130614 | \n",
1177 | " 21.021195 | \n",
1178 | " 32.630035 | \n",
1179 | " 8.215868e-256 | \n",
1180 | "
\n",
1181 | " \n",
1182 | "
\n",
1183 | "
"
1184 | ],
1185 | "text/plain": [
1186 | " doid_code doid_name mesh_id mesh_name \\\n",
1187 | "30318 DOID:10652 Alzheimer's disease D004314 Down Syndrome \n",
1188 | "30408 DOID:10652 Alzheimer's disease D008569 Memory Disorders \n",
1189 | "30452 DOID:10652 Alzheimer's disease D011595 Psychomotor Agitation \n",
1190 | "30257 DOID:10652 Alzheimer's disease D000647 Amnesia \n",
1191 | "30381 DOID:10652 Alzheimer's disease D006816 Huntington Disease \n",
1192 | "\n",
1193 | " cooccurrence expected enrichment odds_ratio p_fisher \n",
1194 | "30318 800 35.619601 22.459544 39.918352 0.000000e+00 \n",
1195 | "30408 1593 76.580532 20.801631 41.885877 0.000000e+00 \n",
1196 | "30452 334 15.235665 21.922247 35.277329 0.000000e+00 \n",
1197 | "30257 307 14.061215 21.833106 34.890099 4.277452e-314 \n",
1198 | "30381 255 12.130614 21.021195 32.630035 8.215868e-256 "
1199 | ]
1200 | },
1201 | "execution_count": 12,
1202 | "metadata": {},
1203 | "output_type": "execute_result"
1204 | }
1205 | ],
1206 | "source": [
1207 | "cooc_df = symptom_df[['mesh_id', 'mesh_name']].drop_duplicates().merge(cooc_df)\n",
1208 | "cooc_df = disease_df[['doid_code', 'doid_name']].drop_duplicates().merge(cooc_df)\n",
1209 | "cooc_df = cooc_df.sort_values(by=['doid_name', 'p_fisher'])\n",
1210 | "cooc_df.to_csv('data/disease-symptom-cooccurrence.tsv', index=False, sep='\\t')\n",
1211 | "cooc_df.head()"
1212 | ]
1213 | },
1214 | {
1215 | "cell_type": "markdown",
1216 | "metadata": {},
1217 | "source": [
1218 | "## Visualization"
1219 | ]
1220 | },
1221 | {
1222 | "cell_type": "code",
1223 | "execution_count": 13,
1224 | "metadata": {
1225 | "collapsed": false,
1226 | "jupyter": {
1227 | "outputs_hidden": false
1228 | }
1229 | },
1230 | "outputs": [],
1231 | "source": [
1232 | "import numpy\n",
1233 | "import scipy\n",
1234 | "import seaborn\n",
1235 | "import matplotlib.pyplot as plt\n",
1236 | "\n",
1237 | "%matplotlib inline"
1238 | ]
1239 | },
1240 | {
1241 | "cell_type": "code",
1242 | "execution_count": 14,
1243 | "metadata": {
1244 | "collapsed": false,
1245 | "jupyter": {
1246 | "outputs_hidden": false
1247 | }
1248 | },
1249 | "outputs": [
1250 | {
1251 | "data": {
1252 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXYAAAECCAYAAADq7fyyAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAFGxJREFUeJzt3X+MZWV9x/H3guwuyw6j0FmwZVlSlG82bbBC1ZYafqQK\n0h/SmCZNY20lFSQSXJMKqatd+2Mpptu1EX+lsipqbCUQadcSFUsN4MZIaa2Vil9BdHdICI4z4+wO\nyzq7O9M/7h2ZZe6P+XHnnnuffb+SDXOfc+bcLzN3Pvfc5znneVbNzMwgSSrHCVUXIEnqLINdkgpj\nsEtSYQx2SSqMwS5JhTHYJakwL2i1MSJOBG4DzgNmgOuAnwK3A9PAI8D1mTkTEdcA1wJHgO2Zec8K\n1i1JaqLdGfvvANOZ+WrgPcDfAjuBrZl5MbAKuCoizgRuAC4CrgBuiYjVK1e2JKmZlsGemf8KvLX+\n8BxgHLgwMx+ot30ReA3wCmBPZh7OzP3A48D5K1KxJKmltn3smXk0Im4HPgB8ltpZ+qwDwCBwKjDR\noF2S1GULGjzNzDcDAewC1s7ZdCrwE2A/MDCnfYDa2b0kqcvaDZ6+CTgrM28BngWOAg9HxCWZeT9w\nJXAf8BBwc0SsoRb8m6kNrDY1MzMzs2rVqla7SJLmaxucq1pNAhYRJ1O7AuZM4CTgFuC71K6UWQ18\nB7imflXMW6hdFXMCcHNm3t3muWdGRg4s4P+hWkNDA1hn51hn5/RDjWCdnTY0NNA22FuesWfms8Af\nNNh0aYN9d1HrqpEkVcgblCSpMAa7JBXGYJekwrTsY9d8U1NTDA/vbbp948ZNrF7tTbeSqmOwL9Lw\n8F627NjNusEN87YdnPgRH7jx9Zx77ksrqEySagz2JVg3uIH1L/qFqsuQpIbsY5ekwhjsklQYu2Ka\nmB0kHR9fz9jY5M/a9+1rPnAqSb3AYG+i2SDp6JOPcvpZmyuqSpLaM9hbaDRIenDi6YqqkaSFsY9d\nkgpjsEtSYeyK6YJWd6t6p6qkTjPYu6DZQKx3qkpaCQZ7l3i3qqRusY9dkgpjsEtSYQx2SSqMwS5J\nhTHYJakwBrskFcZgl6TCGOySVBiDXZIKY7BLUmEMdkkqjMEuSYVxErAOmj56pOGaqK6TKqmbWgZ7\nRJwEfALYBKwBtgNPAv8GfK++20cy886IuAa4FjgCbM/Me1as6h51aHKUnXeMsW7wqWPaXSdVUje1\nO2N/IzCSmW+KiBcB3wL+CtiZme+f3SkizgRuAC4ETga+FhFfycypFaq7Z7lOqqSqtQv2O4G76l+f\nABymFt4REVcBjwHvAF4J7MnMw8DhiHgcOB94eEWqliQ11XLwNDOfyczJiBigFvLvBh4C3pmZlwBP\nAO8FBoCJOd96ABhcmZIlSa20vSomIjYC/wF8OjM/B9ydmd+sb74beDmwn1q4zxoAxjtcqyRpAdoN\nnp4B3Au8LTO/Wm/+UkS8PTP/E3gNte6Wh4CbI2INsBbYDDzS7smHhgba7VKZ8fH1K/4c00ePMDEx\n0vS5zjnnnEUtdN3LP8+5rLNz+qFGsM5ua9fHvpVal8q2iNhWb3sH8A8RcRh4Cri23l1zK/AgtU8B\nWxcycDoycmDpla+wsbHJFX+OQ5OjbPvY11k3+P152ybHn+LGP7yAs8/eNG/bxo2b5gX+0NBAT/88\nZ1ln5/RDjWCdnbaQN5+WwZ6ZW4AtDTa9usG+u4BdCy1ONc0WuT448TQ77/jWvEsnD078iA/c+HrO\nPfel3SpRUp/xBqUe1iz0JamV4z7Yp6amGB72blFJ5Tjug314eC9bduxm3eCGY9q9W1RSvzrugx28\nW1RSWZzdUZIKY7BLUmEMdkkqjMEuSYUx2CWpMAa7JBXGYJekwhjsklQYg12SCmOwS1JhDHZJKozB\nLkmFcRKwPjN99EjDKYXHx9dzyimnL2opPUllMtj7zKHJUXbeMebKSpKaMtj7kCsrSWrFPnZJKozB\nLkmFMdglqTD2sRei2dUyszZu3OQVM9Jx4rgI9qmpKYaHG4deqzDsJ82ulgGvmJGON8dFsA8P72XL\njt2sG9wwb9vok49y+lmbK6iq87xaRhIcJ8EOzUPv4MTTFVQjSSvHwVNJKozBLkmFMdglqTAGuyQV\npuXgaUScBHwC2ASsAbYDjwK3A9PAI8D1mTkTEdcA1wJHgO2Zec8K1i1JaqLdGfsbgZHMvBh4HfBh\nYCewtd62CrgqIs4EbgAuAq4AbokI74aRpAq0u9zxTuCu+tcnAIeBCzLzgXrbF4HLgaPAnsw8DByO\niMeB84GHO1+yJKmVlsGemc8ARMQAtZB/D/D3c3Y5AAwCpwITDdolSV3W9galiNgIfB74cGb+c0T8\n3ZzNpwI/AfYDA3PaB4DxdsceGhpot0tHjI+v78rz9Krpo0eYmBhp+HM455xzuj6HTLd+78vVD3X2\nQ41gnd3WbvD0DOBe4G2Z+dV68zcj4pLMvB+4ErgPeAi4OSLWAGuBzdQGVlsaGTmwnNoXbGxssivP\n06sOTY6y7WNfZ93g949pr2IOmaGhga793pejH+rshxrBOjttIW8+7c7Yt1LrUtkWEdvqbVuAW+uD\no98B7qpfFXMr8CC1vvitmTm15MrVcc4jIx0/2vWxb6EW5M93aYN9dwG7OlOWJGmpvEFJkgpjsEtS\nYQx2SSqMwS5JhTHYJakwBrskFcZgl6TCGOySVBiDXZIKY7BLUmEMdkkqjMEuSYUx2CWpMAa7JBWm\n7QpK/WRqaorh4b3z2vftm98mSaUqKtiHh/eyZcdu1g1uOKZ99MlHOf2szRVVJUndVVSwQ+OVgg5O\nPF1RNZLUffaxS1Jhijtj18JNHz3SdPxh48ZNrF69ussVSeoEg/04dmhylJ13jLFu8Klj2g9O/IgP\n3Ph6zj33pRVVJmk5DPbjXKMxiVZn8uDZvNTrDHbN0+xMHjybl/qBwa6GGp3JS+oPXhUjSYUx2CWp\nMAa7JBXGYJekwhjsklQYg12SCrOgyx0j4lXA+zLzsoh4OfAF4LH65o9k5p0RcQ1wLXAE2J6Z96xI\nxZKkltoGe0TcBPwRMFlvuhB4f2a+f84+ZwI31LedDHwtIr6SmVOdL1mS1MpCztgfB94AfKb++ELg\nvIi4itpZ+zuAVwJ7MvMwcDgiHgfOBx7ufMmSpFba9rFn5uepda/M+gbwzsy8BHgCeC8wAEzM2ecA\nMNjBOiVJC7SUKQXuzszZEL8b+CDwALVwnzUAjLc70NDQQLtdFmV8fH1Hj6fGTjtt/bJ+d53+va+U\nfqizH2oE6+y2pQT7lyLi7Zn5n8BrqHW3PATcHBFrgLXAZuCRdgcaGTmwhKdvbmxssv1OWraxsckl\n/+6GhgY6/ntfCf1QZz/UCNbZaQt581lMsM/U/3sd8OGIOAw8BVybmZMRcSvwILXuna0OnEpSNRYU\n7Jn5Q+Ci+tffAl7dYJ9dwK5OFidJWjxvUJKkwhjsklQYg12SCmOwS1JhXBpPK2pqaorh4ecWxh4f\nX3/MZakujC11nsGuFTU8vJctO3azbnDDvG0ujC2tDINdK86FsaXuso9dkgpjsEtSYQx2SSqMfezq\niOdf/TJr3775bZJWlsGujmh29cvok49y+lmbK6pKOj4Z7OqYRle/HJx4uqJqpOOXfeySVBiDXZIK\nY1eMFmX66JGGA6IOkkq9w2DXohyaHGXnHWOsG3zqmHYHSaXeYbBr0RwklXqbfeySVBiDXZIKY1eM\nKtNsIBacp11aDoNdlWk2EOs87dLyGOyqlHO1S51nH7skFcZgl6TCGOySVBiDXZIKY7BLUmEMdkkq\nzIIud4yIVwHvy8zLIuIlwO3ANPAIcH1mzkTENcC1wBFge2bes0I1S5JaaHvGHhE3AbcBa+pN7we2\nZubFwCrgqog4E7gBuAi4ArglIrxtUJIqsJCumMeBN1ALcYALMvOB+tdfBF4DvALYk5mHM3N//XvO\n73SxkqT22gZ7Zn6eWvfKrFVzvj4ADAKnAhMN2iVJXbaUKQWm53x9KvATYD8wMKd9ABhvd6ChoYF2\nuyzK+Pj6jh5P1TnttPUdf30sVtXPvxD9UCNYZ7ctJdi/GRGXZOb9wJXAfcBDwM0RsQZYC2ymNrDa\n0sjIgSU8fXNjY5MdPZ6qMzY22fHXx2IMDQ1U+vwL0Q81gnV22kLefBYT7DP1//4ZcFt9cPQ7wF31\nq2JuBR6k1r2zNTOnFlnvgk1NTTE87LqbktTIgoI9M39I7YoXMvMx4NIG++wCdnWwtqaGh/eyZcdu\n1g1uOKbddTclqY+n7XXdTUlqrG+DXeVqtbISuLqS1I7Brp7TbGUlcHUlaSEMdvUkV1aSls5JwCSp\nMAa7JBXGYJekwhjsklQYg12SCmOwS1JhDHZJKozBLkmFMdglqTAGuyQVxmCXpMIY7JJUGINdkgpj\nsEtSYZy2V32l1SIcLsAh1Rjs6ivNFuFwAQ7pOQa7+o6LcEit2ccuSYXxjF1FsO9deo7BriLY9y49\nx2BXMex7l2rsY5ekwnjGrqK16nsH+99VJoNdRWvW9w72v6tcBruKZ9+7jjdLDvaI+G9gov7wCeAW\n4HZgGngEuD4zZ5ZboCRpcZYU7BGxFiAzL5vTthvYmpkPRMRHgauAf+lIlZKkBVvqGfvLgHUR8eX6\nMd4NXJCZD9S3fxG4HINdkrpuqZc7PgPsyMwrgOuAzz5v+yQwuJzCJElLs9Rg/x71MM/Mx4BR4Iw5\n2weAnyyvNEnSUiy1K+Zq4Hzg+oj4eWpBfm9EXJKZ9wNXAve1O8jQ0MCSnnx8fP2Svk96vtNOW9/0\ndbjU12c39UONYJ3dttRg/zjwyYiY7VO/mtpZ+20RsRr4DnBXu4OMjBxY0pOPjU0u6fuk5xsbm2z4\nOhwaGljy67Nb+qFGsM5OW8ibz5KCPTOPAG9qsOnSpRyvkampKYaHG98x2OpOQmmhWt2VOjj4y12u\nRuqcnr1BaXh4L1t27Gbd4IZ520affJTTz9pcQVUqSasZIT9zy3pe9KIXV1SZtDw9G+zQ/I7BgxNP\nV1CNSuRdqSpRTwe71EtadQ+CE4qpdxjs0gK16h50QjH1EoNdWgS7btQPDHbpeaaPHuEHP/jBvMtq\nvRpL/cJgl57n0OQo2z729XldLl6NpX5hsEsNNOpy8Wos9QvXPJWkwhjsklQYg12SCmMfu9QBread\n8cYldZvBLnVAq3lnvHFJ3WawSx3izUvqFQa7tILsolEVDHZpBdlFoyoY7NIKa9RF0+pMHjyb1/IY\n7FIFmp3Jg2fzWj6DXaqIg61aKd6gJEmFMdglqTB2xUg9xksktVwGu9Rjmg2sTo4/xY1/eAFnn70J\ngPHx9T9bDMTA11yVB/uX//2rfOUbj81r3z8+AieeXUFFUvWazQe/845veU282qo82EdGxxg78SXz\n2idPPLmCaqTe5pU0WojKg13SypmammJ42P76443BLhVseHgvW3bsnrd+q903ZTPYpT7X6iqaffv2\nLnpKA8/k+5/BLvW5VtMTjD75KKeftXnB3+OZfBk6GuwRcQLwEeB84KfAWzLz+518DknzNRtUPTjx\n9KK/R/2v02fsvweszsyLIuJVwM56m6Q+0MlZJ2cHbudeb7/UY2lxOh3svwF8CSAzvxERv9rh40ta\nQZ2cdbLZwC3Mv9lqLgN/+Tod7KcC++c8PhoRJ2TmdIefR9IKadZF0+xs/vDhwwCcdNJJx7Q3G7iF\n7txs1epSz0Y1z36y6NQbS5WXmnY62PcDA3Metw311S84kenRb89rn574MYdOeGHD73n2wBiwatnt\nHqu/jlX18x/vxxp/6jG23/Zd1q4/7Zj2iaefYM0pL2zY/sIXn9e0rpMHTm+4rVVX0GLs27eX7bd9\nZV5ds7U1qvnQ5Bjvuea1DT9JdOr5D02O8bG/ecuKDlCvmpmZ6djBIuINwO9m5tUR8WvAX2Tmb3fs\nCSRJbXX6jP1u4LURsaf++OoOH1+S1EZHz9glSdVzoQ1JKozBLkmFMdglqTAGuyQVpuuTgPXbfDL1\nqRHel5mXVV1LIxFxEvAJYBOwBtiemV+otqpjRcSJwG3AecAMcF1m/l+1VTUXERuA/wJ+MzO/V3U9\njUTEfwMT9YdPZOafVllPMxHxLuB3gZOAD2XmpyouaZ6I+BPgzfWHJwMvA87IzP1Nv6kC9ezcRe3v\naBq4JjOz0b5VnLH/bD4Z4M+pzSfTkyLiJmqBtKbqWlp4IzCSmRcDrwM+VHE9jfwOMJ2ZrwbeA9xc\ncT1N1d8o/xF4pupamomItQCZeVn9X6+G+qXAr9f/1i8FfrHSgprIzE/N/iyBh4Ebei3U6y4HTqn/\nHf01Lf6Oqgj2Y+aTAXp5PpnHgTfQ7Fa83nAnsK3+9QnAkQpraSgz/xV4a/3hOcB4ddW0tQP4KDB/\nspTe8TJgXUR8OSLuq3+q7EWXA9+OiH8BvgDsrrielupzW/1SZu6qupYmngUGI2IVMAhMNduximBv\nOJ9MBXW0lZmfpweDcq7MfCYzJyNigFrIv7vqmhrJzKMRcTtwK/BPFZfTUES8mdqnn3vrTb36hv4M\nsCMzrwCuAz7bo39DQ8CFwO9Tr7PactraCvxl1UW0sAdYC3yX2qfKDzbbsYoXw6Lnk1FrEbER+A/g\n05n5uarraSYz30ytf/C2iOjF1cqvpnbn9FeBXwE+FRFnVFxTI9+jHpKZ+RgwCry40ooa+zFwb2Ye\nqY9VHIqIn6u6qEYi4oXAeZl5f9W1tHATsCczg+denw1nEqsi2PcAvwVQn0/mfyuooRj14LkXuCkz\nb6+4nIYi4k31QTSofZycrv/rKZl5SWZeWu9r/R/gjzOz+UoV1bma+thURPw8tU/Bvdh19DVq4z6z\ndZ5C7U2oF10M3Fd1EW2cwnO9HePUBqRPbLRjFUvj9eN8Mr0878JWav1t2yJitq/9ysw8VGFNz3cX\ncHtE3E/txbglM39acU397OPAJyPigfrjq3vxU29m3hMRF0fEQ9ROIt+Wmb36t3Qe0LNX59XtoPZ7\nf5Da39G7MvPZRjs6V4wkFaYXB1wkSctgsEtSYQx2SSqMwS5JhTHYJakwBrskFcZgl6TCGOySVJj/\nBwG0X6gKM3flAAAAAElFTkSuQmCC\n",
1253 | "text/plain": [
1254 | ""
1255 | ]
1256 | },
1257 | "metadata": {},
1258 | "output_type": "display_data"
1259 | }
1260 | ],
1261 | "source": [
1262 | "sig_df = cooc_df[cooc_df.p_fisher < 0.05]\n",
1263 | "plt.hist(list(numpy.log(sig_df.enrichment)), bins = 50);"
1264 | ]
1265 | }
1266 | ],
1267 | "metadata": {
1268 | "kernelspec": {
1269 | "display_name": "Python 3",
1270 | "language": "python",
1271 | "name": "python3"
1272 | },
1273 | "language_info": {
1274 | "codemirror_mode": {
1275 | "name": "ipython",
1276 | "version": 3
1277 | },
1278 | "file_extension": ".py",
1279 | "mimetype": "text/x-python",
1280 | "name": "python",
1281 | "nbconvert_exporter": "python",
1282 | "pygments_lexer": "ipython3",
1283 | "version": "3.9.2"
1284 | }
1285 | },
1286 | "nbformat": 4,
1287 | "nbformat_minor": 4
1288 | }
1289 |
--------------------------------------------------------------------------------
/tissues.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Compute anatomy-disease cooccurrence for Hetionet"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 1,
13 | "metadata": {
14 | "collapsed": true,
15 | "jupyter": {
16 | "outputs_hidden": true
17 | }
18 | },
19 | "outputs": [],
20 | "source": [
21 | "import io\n",
22 | "import gzip\n",
23 | "\n",
24 | "import pandas\n",
25 | "import requests\n",
26 | "import networkx\n",
27 | "\n",
28 | "import eutility\n",
29 | "import cooccurrence"
30 | ]
31 | },
32 | {
33 | "cell_type": "markdown",
34 | "metadata": {},
35 | "source": [
36 | "# Tissues"
37 | ]
38 | },
39 | {
40 | "cell_type": "code",
41 | "execution_count": 2,
42 | "metadata": {
43 | "collapsed": false,
44 | "jupyter": {
45 | "outputs_hidden": false
46 | }
47 | },
48 | "outputs": [
49 | {
50 | "data": {
51 | "text/html": [
52 | "\n",
53 | "
\n",
54 | " \n",
55 | " \n",
56 | " | \n",
57 | " uberon_id | \n",
58 | " uberon_name | \n",
59 | " mesh_id | \n",
60 | " mesh_name | \n",
61 | "
\n",
62 | " \n",
63 | " \n",
64 | " \n",
65 | " 0 | \n",
66 | " UBERON:0001716 | \n",
67 | " secondary palate | \n",
68 | " D010159 | \n",
69 | " Palate | \n",
70 | "
\n",
71 | " \n",
72 | " 1 | \n",
73 | " UBERON:0001908 | \n",
74 | " optic tract | \n",
75 | " D014795 | \n",
76 | " Visual Pathways | \n",
77 | "
\n",
78 | " \n",
79 | " 2 | \n",
80 | " UBERON:0002286 | \n",
81 | " third ventricle | \n",
82 | " D020542 | \n",
83 | " Third Ventricle | \n",
84 | "
\n",
85 | " \n",
86 | " 3 | \n",
87 | " UBERON:0002349 | \n",
88 | " myocardium | \n",
89 | " D009206 | \n",
90 | " Myocardium | \n",
91 | "
\n",
92 | " \n",
93 | " 4 | \n",
94 | " UBERON:0000978 | \n",
95 | " leg | \n",
96 | " D035002 | \n",
97 | " Lower Extremity | \n",
98 | "
\n",
99 | " \n",
100 | "
\n",
101 | "
"
102 | ],
103 | "text/plain": [
104 | " uberon_id uberon_name mesh_id mesh_name\n",
105 | "0 UBERON:0001716 secondary palate D010159 Palate\n",
106 | "1 UBERON:0001908 optic tract D014795 Visual Pathways\n",
107 | "2 UBERON:0002286 third ventricle D020542 Third Ventricle\n",
108 | "3 UBERON:0002349 myocardium D009206 Myocardium\n",
109 | "4 UBERON:0000978 leg D035002 Lower Extremity"
110 | ]
111 | },
112 | "execution_count": 2,
113 | "metadata": {},
114 | "output_type": "execute_result"
115 | }
116 | ],
117 | "source": [
118 | "# Read MeSH UBERON Anatomical structures\n",
119 | "url = 'https://raw.githubusercontent.com/dhimmel/uberon/86a9b754871e5ce7d91d2ef15bcc8f6a0ef6cda1/data/hetio-slim.tsv'\n",
120 | "uberon_df = pandas.read_table(url)\n",
121 | "uberon_df.head()"
122 | ]
123 | },
124 | {
125 | "cell_type": "code",
126 | "execution_count": 3,
127 | "metadata": {
128 | "collapsed": false,
129 | "jupyter": {
130 | "outputs_hidden": false
131 | }
132 | },
133 | "outputs": [
134 | {
135 | "name": "stdout",
136 | "output_type": "stream",
137 | "text": [
138 | "9284 articles for Palate\n",
139 | "15786 articles for Visual Pathways\n",
140 | "1359 articles for Third Ventricle\n",
141 | "139687 articles for Myocardium\n",
142 | "8596 articles for Lower Extremity\n",
143 | "39066 articles for Cerebellum\n",
144 | "2471 articles for Arachnoid\n",
145 | "382125 articles for Liver\n",
146 | "3960 articles for Dermis\n",
147 | "3295 articles for Sweat\n",
148 | "16125 articles for Optic Nerve\n",
149 | "13305 articles for Gallbladder\n",
150 | "11676 articles for Parotid Gland\n",
151 | "265 articles for Manubrium\n",
152 | "6438 articles for Vena Cava, Superior\n",
153 | "47991 articles for Arteries\n",
154 | "26790 articles for Arm\n",
155 | "28944 articles for Aorta, Thoracic\n",
156 | "60440 articles for Pancreas\n",
157 | "16194 articles for Mesencephalon\n",
158 | "9469 articles for Common Bile Duct\n",
159 | "4456 articles for Choroid Plexus\n",
160 | "5769 articles for Nails\n",
161 | "13016 articles for Joints\n",
162 | "399 articles for Bulbourethral Glands\n",
163 | "158666 articles for Skin\n",
164 | "530 articles for Incus\n",
165 | "14579 articles for Forearm\n",
166 | "8129 articles for Trigeminal Nerve\n",
167 | "1167 articles for Axillary Vein\n",
168 | "3943 articles for Peroneal Nerve\n",
169 | "465 articles for Stapedius\n",
170 | "20418 articles for Vagus Nerve\n",
171 | "24984 articles for Femoral Artery\n",
172 | "7003 articles for Ligaments\n",
173 | "19976 articles for Extremities\n",
174 | "7780 articles for Thumb\n",
175 | "32275 articles for Trachea\n",
176 | "4117 articles for Subclavian Vein\n",
177 | "11826 articles for Iris\n",
178 | "5612 articles for Epiphyses\n",
179 | "5318 articles for Hemolymph\n",
180 | "13982 articles for Wrist\n",
181 | "2056 articles for Sense Organs\n",
182 | "3271 articles for Sweat Glands\n",
183 | "2276 articles for Axillary Artery\n",
184 | "6069 articles for Basilar Artery\n",
185 | "12065 articles for Lymphoid Tissue\n",
186 | "16279 articles for Medulla Oblongata\n",
187 | "7256 articles for Gonads\n",
188 | "15748 articles for Penis\n",
189 | "33282 articles for Heart Atria\n",
190 | "1138 articles for Basilar Membrane\n",
191 | "212 articles for Zona Reticularis\n",
192 | "7331 articles for Femoral Vein\n",
193 | "15273 articles for Tongue\n",
194 | "9959 articles for Tunica Intima\n",
195 | "7380 articles for Seminal Vesicles\n",
196 | "17376 articles for Mouth\n",
197 | "16444 articles for Thoracic Vertebrae\n",
198 | "3420 articles for Semicircular Canals\n",
199 | "754 articles for Ulnar Artery\n",
200 | "1692 articles for Cranial Sutures\n",
201 | "12999 articles for Carotid Artery, Internal\n",
202 | "17868 articles for Cardiovascular System\n",
203 | "8207 articles for Vertebral Artery\n",
204 | "7214 articles for Papillary Muscles\n",
205 | "1566 articles for Palate, Hard\n",
206 | "11061 articles for Ankle Joint\n",
207 | "1838 articles for Ethmoid Bone\n",
208 | "4409 articles for Locus Coeruleus\n",
209 | "415 articles for Malleus\n",
210 | "138163 articles for Muscles\n",
211 | "24440 articles for Maxilla\n",
212 | "3333 articles for Sebaceous Glands\n",
213 | "5278 articles for Ganglia, Autonomic\n",
214 | "1753 articles for Uvea\n",
215 | "2250 articles for Pia Mater\n",
216 | "313 articles for Oval Window, Ear\n",
217 | "62855 articles for Retina\n",
218 | "2802 articles for Purkinje Fibers\n",
219 | "10795 articles for Ear, External\n",
220 | "7283 articles for Prosencephalon\n",
221 | "15487 articles for Head\n",
222 | "10011 articles for Submandibular Gland\n",
223 | "45 articles for Metencephalon\n",
224 | "11258 articles for Pulmonary Veins\n",
225 | "4959 articles for Neck Muscles\n",
226 | "4934 articles for Decidua\n",
227 | "1974 articles for Loop of Henle\n",
228 | "2487 articles for Zygoma\n",
229 | "2774 articles for Intestinal Secretions\n",
230 | "1151 articles for Eyelashes\n",
231 | "2388 articles for Brachiocephalic Trunk\n",
232 | "12103 articles for Umbilical Veins\n",
233 | "60128 articles for Testis\n",
234 | "22830 articles for Pulmonary Alveoli\n",
235 | "28862 articles for Skull\n",
236 | "368 articles for Geniculate Ganglion\n",
237 | "1560 articles for Cochlear Nucleus\n",
238 | "7456 articles for Ganglia\n",
239 | "53374 articles for Uterus\n",
240 | "5274 articles for Umbilical Arteries\n",
241 | "2980 articles for Ethmoid Sinus\n",
242 | "2110 articles for Stellate Ganglion\n",
243 | "3345 articles for Palate, Soft\n",
244 | "5092 articles for Scapula\n",
245 | "8415 articles for Median Nerve\n",
246 | "816 articles for Maxillary Artery\n",
247 | "50886 articles for Thyroid Gland\n",
248 | "70886 articles for Lymph Nodes\n",
249 | "9377 articles for Elbow Joint\n",
250 | "3943 articles for Nipples\n",
251 | "6072 articles for Atrioventricular Node\n",
252 | "594 articles for Anterior Cerebral Artery\n",
253 | "3299 articles for Skull Base\n",
254 | "38733 articles for Mandible\n",
255 | "38733 articles for Mandible\n",
256 | "8666 articles for Endocrine Glands\n",
257 | "13807 articles for Immune System\n",
258 | "444 articles for Vestibular Aqueduct\n",
259 | "22218 articles for Cartilage\n",
260 | "70968 articles for Spinal Cord\n",
261 | "2860 articles for Radial Nerve\n",
262 | "7956 articles for Vas Deferens\n",
263 | "4298 articles for Vestibulocochlear Nerve\n",
264 | "1018 articles for Sternoclavicular Joint\n",
265 | "17244 articles for Mammary Glands, Animal\n",
266 | "2818 articles for Exocrine Glands\n",
267 | "3322 articles for Fovea Centralis\n",
268 | "18490 articles for Aorta, Abdominal\n",
269 | "3535 articles for Endocrine System\n",
270 | "4302 articles for Ear Canal\n",
271 | "848 articles for Bile Canaliculi\n",
272 | "430 articles for Cochlear Duct\n",
273 | "49555 articles for Muscle, Smooth\n",
274 | "23210 articles for Endometrium\n",
275 | "882 articles for Acromion\n",
276 | "6664 articles for Lacrimal Apparatus\n",
277 | "39662 articles for Pulmonary Artery\n",
278 | "20823 articles for Foot\n",
279 | "1819 articles for Olfactory Nerve\n",
280 | "1000 articles for Medial Forebrain Bundle\n",
281 | "10831 articles for Parathyroid Glands\n",
282 | "11571 articles for Bile Ducts\n",
283 | "23513 articles for Neck\n",
284 | "72662 articles for Bone and Bones\n",
285 | "2449 articles for Nasal Bone\n",
286 | "6464 articles for Kidney Medulla\n",
287 | "5374 articles for Hepatic Veins\n",
288 | "3782 articles for Masseter Muscle\n",
289 | "121682 articles for Heart\n",
290 | "4958 articles for Endothelium, Corneal\n",
291 | "3739 articles for Tibial Nerve\n",
292 | "7079 articles for Tears\n",
293 | "18399 articles for Gastric Juice\n",
294 | "2475 articles for Musculoskeletal System\n",
295 | "3679 articles for Hematopoietic System\n",
296 | "4587 articles for Elastic Tissue\n",
297 | "1078 articles for Stapes\n",
298 | "856 articles for Zona Glomerulosa\n",
299 | "12726 articles for Pericardium\n",
300 | "7595 articles for Arterioles\n",
301 | "3811 articles for Myenteric Plexus\n",
302 | "2157 articles for Enteric Nervous System\n",
303 | "905 articles for Celiac Plexus\n",
304 | "21094 articles for Ureter\n",
305 | "4224 articles for Pulmonary Valve\n",
306 | "6662 articles for Pyramidal Tracts\n",
307 | "10362 articles for Kidney Cortex\n",
308 | "15782 articles for Renal Artery\n",
309 | "2332 articles for Petrous Bone\n",
310 | "66514 articles for Epithelium\n",
311 | "20764 articles for Hair\n",
312 | "2413 articles for Ophthalmic Artery\n",
313 | "10484 articles for Fallopian Tubes\n",
314 | "272 articles for Endolymphatic Duct\n",
315 | "4235 articles for Epithelium, Corneal\n",
316 | "10160 articles for Pons\n",
317 | "1674 articles for Bronchial Arteries\n",
318 | "5314 articles for Colostrum\n",
319 | "868 articles for Neuropil\n",
320 | "21982 articles for Digestive System\n",
321 | "11662 articles for Eyelids\n",
322 | "2474 articles for Lumbosacral Plexus\n",
323 | "158 articles for Truncus Arteriosus\n",
324 | "7471 articles for Macula Lutea\n",
325 | "13490 articles for Salivary Glands\n",
326 | "14568 articles for Elbow\n",
327 | "222 articles for Acoustic Maculae\n",
328 | "1188 articles for Sebum\n",
329 | "41482 articles for Pituitary Gland\n",
330 | "2389 articles for Femoral Nerve\n",
331 | "6717 articles for Renal Veins\n",
332 | "6673 articles for Subclavian Artery\n",
333 | "3161 articles for Trigeminal Ganglion\n",
334 | "42968 articles for Embryo, Mammalian\n",
335 | "18090 articles for Endothelium\n",
336 | "18347 articles for Portal Vein\n",
337 | "4601 articles for Clavicle\n",
338 | "50256 articles for Blood\n",
339 | "1219 articles for Zygapophyseal Joint\n",
340 | "1882 articles for Retinal Vein\n",
341 | "20645 articles for Urethra\n",
342 | "11286 articles for Synovial Fluid\n",
343 | "28147 articles for Cervical Vertebrae\n",
344 | "39275 articles for Central Nervous System\n",
345 | "887 articles for Sesamoid Bones\n",
346 | "1766 articles for Otolithic Membrane\n",
347 | "29346 articles for Prostate\n",
348 | "8731 articles for Brachial Artery\n",
349 | "6440 articles for Facial Muscles\n",
350 | "2580 articles for Cerebellar Nuclei\n",
351 | "486 articles for Ejaculatory Ducts\n",
352 | "1676 articles for Nasolacrimal Duct\n",
353 | "1035 articles for Round Window, Ear\n",
354 | "67285 articles for Heart Ventricles\n",
355 | "12568 articles for Ear, Middle\n",
356 | "20945 articles for Autonomic Nervous System\n",
357 | "1193 articles for Serous Membrane\n",
358 | "3208 articles for Cranial Nerves\n",
359 | "777 articles for Trochlear Nerve\n",
360 | "2901 articles for Occipital Bone\n",
361 | "5188 articles for Carotid Artery, Common\n",
362 | "5890 articles for Paranasal Sinuses\n",
363 | "2866 articles for Oculomotor Nerve\n",
364 | "2574 articles for Hepatic Duct, Common\n",
365 | "11292 articles for Ear, Inner\n",
366 | "2976 articles for Peripheral Nervous System\n",
367 | "28343 articles for Hindlimb\n",
368 | "3319 articles for Sacroiliac Joint\n",
369 | "7290 articles for Perineum\n",
370 | "700 articles for Cerumen\n",
371 | "7040 articles for Tricuspid Valve\n",
372 | "3474 articles for Seminiferous Tubules\n",
373 | "14570 articles for Vena Cava, Inferior\n",
374 | "21373 articles for Tendons\n",
375 | "3202 articles for Optic Chiasm\n",
376 | "8320 articles for Parasympathetic Nervous System\n",
377 | "2226 articles for Follicular Fluid\n",
378 | "4265 articles for Nephrons\n",
379 | "9982 articles for Hip\n",
380 | "8989 articles for Dura Mater\n",
381 | "33485 articles for Saliva\n",
382 | "4545 articles for Hair Follicle\n",
383 | "8669 articles for Kidney Pelvis\n",
384 | "253 articles for Ultimobranchial Body\n",
385 | "24382 articles for Cervix Uteri\n",
386 | "480 articles for Internal Capsule\n",
387 | "3271 articles for Vestibular Nuclei\n",
388 | "29115 articles for Eye\n",
389 | "3742 articles for Celiac Artery\n",
390 | "8402 articles for Pleura\n",
391 | "5731 articles for Sinoatrial Node\n",
392 | "1892 articles for Chromaffin Cells\n",
393 | "15068 articles for Synovial Membrane\n",
394 | "715 articles for Spinothalamic Tracts\n",
395 | "218 articles for Tensor Tympani\n",
396 | "7331 articles for Reticular Formation\n",
397 | "56200 articles for Ovary\n",
398 | "5004 articles for Cerebellar Cortex\n",
399 | "1606 articles for Trigeminal Nuclei\n",
400 | "2947 articles for Retinal Artery\n",
401 | "10206 articles for Shoulder\n",
402 | "7475 articles for Pancreatic Ducts\n",
403 | "19257 articles for Epidermis\n",
404 | "8584 articles for Maxillary Sinus\n",
405 | "18673 articles for Basement Membrane\n",
406 | "9330 articles for Nasal Cavity\n",
407 | "39454 articles for Cornea\n",
408 | "14737 articles for Respiratory System\n",
409 | "9830 articles for Temporal Bone\n",
410 | "10127 articles for Corpus Luteum\n",
411 | "1571 articles for Superior Cervical Ganglion\n",
412 | "32909 articles for Face\n",
413 | "6536 articles for Cheek\n",
414 | "11521 articles for Hepatic Artery\n",
415 | "35393 articles for Sympathetic Nervous System\n",
416 | "391093 articles for Brain\n",
417 | "16739 articles for Semen\n",
418 | "8922 articles for Sclera\n",
419 | "3358 articles for Frontal Sinus\n",
420 | "5619 articles for Biliary Tract\n",
421 | "5184 articles for Endocardium\n",
422 | "892 articles for Meibomian Glands\n",
423 | "118 articles for Diagonal Band of Broca\n",
424 | "1575 articles for Intercostal Muscles\n",
425 | "3514 articles for Sphenoid Bone\n",
426 | "10040 articles for Genitalia, Female\n",
427 | "1235 articles for Skeleton\n",
428 | "429 articles for Scala Tympani\n",
429 | "29363 articles for Blood Vessels\n",
430 | "9860 articles for Aqueous Humor\n",
431 | "33165 articles for Islets of Langerhans\n",
432 | "161 articles for Para-Aortic Bodies\n",
433 | "9149 articles for Ribs\n",
434 | "21783 articles for Hip Joint\n",
435 | "17297 articles for Sciatic Nerve\n",
436 | "5419 articles for Spinal Nerves\n",
437 | "23545 articles for Bile\n",
438 | "18713 articles for Nose\n",
439 | "14785 articles for Orbit\n",
440 | "1947 articles for Glossopharyngeal Nerve\n",
441 | "2651 articles for Turbinates\n",
442 | "1034 articles for Diaphyses\n",
443 | "7338 articles for Jaw\n",
444 | "23702 articles for Spine\n",
445 | "12877 articles for Adrenal Cortex\n",
446 | "7442 articles for Ciliary Body\n",
447 | "40260 articles for Adrenal Glands\n",
448 | "1439 articles for Azygos Vein\n",
449 | "3276 articles for Bundle of His\n",
450 | "8269 articles for Popliteal Artery\n",
451 | "823 articles for Stria Vascularis\n",
452 | "34461 articles for Urine\n",
453 | "4302 articles for Corneal Stroma\n",
454 | "23289 articles for Aortic Valve\n",
455 | "200 articles for Area Postrema\n",
456 | "387 articles for Neurilemma\n",
457 | "1886 articles for Axis\n",
458 | "932 articles for Salivary Ducts\n",
459 | "17828 articles for Diaphragm\n",
460 | "181304 articles for Lung\n",
461 | "2212 articles for Carotid Artery, External\n",
462 | "3792 articles for Oviducts\n",
463 | "9569 articles for Growth Plate\n",
464 | "8384 articles for Facial Nerve\n",
465 | "50094 articles for Knee\n",
466 | "4815 articles for Meninges\n",
467 | "4618 articles for Periosteum\n",
468 | "27105 articles for Veins\n",
469 | "7875 articles for Forelimb\n",
470 | "656 articles for Fourth Ventricle\n",
471 | "2817 articles for Mesenteric Artery, Superior\n",
472 | "1790 articles for Abducens Nerve\n",
473 | "24297 articles for Mitral Valve\n",
474 | "5586 articles for Telencephalon\n",
475 | "839 articles for Accessory Nerve\n",
476 | "1563 articles for Saccule and Utricle\n",
477 | "4045 articles for Vulva\n",
478 | "7473 articles for Brachial Plexus\n",
479 | "1133 articles for Endolymphatic Sac\n",
480 | "1710 articles for Hyoid Bone\n",
481 | "693 articles for Trigeminal Nucleus, Spinal\n",
482 | "234281 articles for Kidney\n",
483 | "14926 articles for Plasma\n",
484 | "20874 articles for Peripheral Nerves\n",
485 | "19503 articles for Nervous System\n",
486 | "2910 articles for Sphenoid Sinus\n",
487 | "398 articles for Mesenteric Artery, Inferior\n",
488 | "1316 articles for Sublingual Gland\n",
489 | "8982 articles for Ear\n",
490 | "5827 articles for Adipose Tissue, Brown\n",
491 | "6326 articles for Ulnar Nerve\n",
492 | "28976 articles for Brain Stem\n",
493 | "7950 articles for Pupil\n",
494 | "25665 articles for Fingers\n",
495 | "7456 articles for Heart Septum\n",
496 | "2905 articles for Hypoglossal Nerve\n",
497 | "11207 articles for Pineal Gland\n",
498 | "7808 articles for Optic Disk\n",
499 | "4657 articles for Cochlear Nerve\n",
500 | "7849 articles for Sternum\n",
501 | "3487 articles for Splenic Artery\n",
502 | "2059 articles for Juxtaglomerular Apparatus\n",
503 | "7755 articles for Scrotum\n",
504 | "14802 articles for Shoulder Joint\n",
505 | "1622 articles for Joint Capsule\n",
506 | "37166 articles for Bronchi\n",
507 | "34463 articles for Hand\n",
508 | "29131 articles for Vagina\n",
509 | "39872 articles for Knee Joint\n",
510 | "13964 articles for Larynx\n",
511 | "529 articles for Posterior Cerebral Artery\n",
512 | "3832 articles for Kidney Tubules, Collecting\n",
513 | "4097 articles for Tunica Media\n",
514 | "415 articles for Zona Fasciculata\n",
515 | "9454 articles for Choroid\n",
516 | "6330 articles for Myometrium\n",
517 | "1541 articles for Uvula\n",
518 | "1377 articles for Clitoris\n",
519 | "2054 articles for Temporal Muscle\n",
520 | "4811 articles for Heart Valves\n",
521 | "21040 articles for Kidney Tubules\n",
522 | "4609 articles for Radial Artery\n",
523 | "3766 articles for Middle Cerebral Artery\n",
524 | "8773 articles for Lip\n",
525 | "58594 articles for Bone Marrow\n",
526 | "63407 articles for Adipose Tissue\n",
527 | "9228 articles for Abdominal Muscles\n",
528 | "1739 articles for Rectus Abdominis\n",
529 | "40413 articles for Lumbar Vertebrae\n",
530 | "4744 articles for Glottis\n",
531 | "2472 articles for Rhombencephalon\n",
532 | "15689 articles for Ovarian Follicle\n",
533 | "72824 articles for Feces\n",
534 | "6049 articles for Phrenic Nerve\n",
535 | "25711 articles for Cochlea\n",
536 | "19935 articles for Connective Tissue\n",
537 | "1666 articles for Popliteal Vein\n",
538 | "1473 articles for Lateral Ventricles\n",
539 | "6089 articles for Urinary Tract\n"
540 | ]
541 | }
542 | ],
543 | "source": [
544 | "rows_out = list()\n",
545 | "\n",
546 | "for i, row in uberon_df.iterrows():\n",
547 | " term_query = '{tissue}[MeSH Terms:noexp]'.format(tissue = row.mesh_name.lower())\n",
548 | " payload = {'db': 'pubmed', 'term': term_query}\n",
549 | " pmids = eutility.esearch_query(payload, retmax = 5000, sleep=2)\n",
550 | " row['term_query'] = term_query\n",
551 | " row['n_articles'] = len(pmids)\n",
552 | " row['pubmed_ids'] = '|'.join(pmids)\n",
553 | " rows_out.append(row)\n",
554 | " print('{} articles for {}'.format(len(pmids), row.mesh_name))\n",
555 | "\n",
556 | "uberon_pmids_df = pandas.DataFrame(rows_out)"
557 | ]
558 | },
559 | {
560 | "cell_type": "code",
561 | "execution_count": 4,
562 | "metadata": {
563 | "collapsed": false,
564 | "jupyter": {
565 | "outputs_hidden": false
566 | }
567 | },
568 | "outputs": [
569 | {
570 | "data": {
571 | "text/html": [
572 | "\n",
573 | "
\n",
574 | " \n",
575 | " \n",
576 | " | \n",
577 | " uberon_id | \n",
578 | " uberon_name | \n",
579 | " mesh_id | \n",
580 | " mesh_name | \n",
581 | " term_query | \n",
582 | " n_articles | \n",
583 | " pubmed_ids | \n",
584 | "
\n",
585 | " \n",
586 | " \n",
587 | " \n",
588 | " 0 | \n",
589 | " UBERON:0001716 | \n",
590 | " secondary palate | \n",
591 | " D010159 | \n",
592 | " Palate | \n",
593 | " palate[MeSH Terms:noexp] | \n",
594 | " 9284 | \n",
595 | " 26023113|25975064|25895319|25872295|25869559|2... | \n",
596 | "
\n",
597 | " \n",
598 | " 1 | \n",
599 | " UBERON:0001908 | \n",
600 | " optic tract | \n",
601 | " D014795 | \n",
602 | " Visual Pathways | \n",
603 | " visual pathways[MeSH Terms:noexp] | \n",
604 | " 15786 | \n",
605 | " 26113723|26089513|26080589|26080584|25972183|2... | \n",
606 | "
\n",
607 | " \n",
608 | " 2 | \n",
609 | " UBERON:0002286 | \n",
610 | " third ventricle | \n",
611 | " D020542 | \n",
612 | " Third Ventricle | \n",
613 | " third ventricle[MeSH Terms:noexp] | \n",
614 | " 1359 | \n",
615 | " 26120619|26023696|25723723|25723303|25723298|2... | \n",
616 | "
\n",
617 | " \n",
618 | " 3 | \n",
619 | " UBERON:0002349 | \n",
620 | " myocardium | \n",
621 | " D009206 | \n",
622 | " Myocardium | \n",
623 | " myocardium[MeSH Terms:noexp] | \n",
624 | " 139687 | \n",
625 | " 26072537|26062198|26040042|26040041|26039915|2... | \n",
626 | "
\n",
627 | " \n",
628 | " 4 | \n",
629 | " UBERON:0000978 | \n",
630 | " leg | \n",
631 | " D035002 | \n",
632 | " Lower Extremity | \n",
633 | " lower extremity[MeSH Terms:noexp] | \n",
634 | " 8596 | \n",
635 | " 26118216|26072540|26062181|26047150|26047149|2... | \n",
636 | "
\n",
637 | " \n",
638 | "
\n",
639 | "
"
640 | ],
641 | "text/plain": [
642 | " uberon_id uberon_name mesh_id mesh_name \\\n",
643 | "0 UBERON:0001716 secondary palate D010159 Palate \n",
644 | "1 UBERON:0001908 optic tract D014795 Visual Pathways \n",
645 | "2 UBERON:0002286 third ventricle D020542 Third Ventricle \n",
646 | "3 UBERON:0002349 myocardium D009206 Myocardium \n",
647 | "4 UBERON:0000978 leg D035002 Lower Extremity \n",
648 | "\n",
649 | " term_query n_articles \\\n",
650 | "0 palate[MeSH Terms:noexp] 9284 \n",
651 | "1 visual pathways[MeSH Terms:noexp] 15786 \n",
652 | "2 third ventricle[MeSH Terms:noexp] 1359 \n",
653 | "3 myocardium[MeSH Terms:noexp] 139687 \n",
654 | "4 lower extremity[MeSH Terms:noexp] 8596 \n",
655 | "\n",
656 | " pubmed_ids \n",
657 | "0 26023113|25975064|25895319|25872295|25869559|2... \n",
658 | "1 26113723|26089513|26080589|26080584|25972183|2... \n",
659 | "2 26120619|26023696|25723723|25723303|25723298|2... \n",
660 | "3 26072537|26062198|26040042|26040041|26039915|2... \n",
661 | "4 26118216|26072540|26062181|26047150|26047149|2... "
662 | ]
663 | },
664 | "execution_count": 4,
665 | "metadata": {},
666 | "output_type": "execute_result"
667 | }
668 | ],
669 | "source": [
670 | "with gzip.open('data/uberon-pmids.tsv.gz', 'w') as write_file:\n",
671 | " write_file = io.TextIOWrapper(write_file)\n",
672 | " uberon_pmids_df.to_csv(write_file, sep='\\t', index=False)\n",
673 | "\n",
674 | "uberon_pmids_df.head()"
675 | ]
676 | },
677 | {
678 | "cell_type": "markdown",
679 | "metadata": {},
680 | "source": [
681 | "# Tissue-Disease Cooccurrence"
682 | ]
683 | },
684 | {
685 | "cell_type": "code",
686 | "execution_count": 9,
687 | "metadata": {
688 | "collapsed": false,
689 | "jupyter": {
690 | "outputs_hidden": false
691 | }
692 | },
693 | "outputs": [],
694 | "source": [
695 | "uberon_df, uberon_to_pmids = cooccurrence.read_pmids_tsv('data/uberon-pmids.tsv.gz', key='uberon_id')\n",
696 | "disease_df, disease_to_pmids = cooccurrence.read_pmids_tsv('data/disease-pmids.tsv.gz', key='doid_code')"
697 | ]
698 | },
699 | {
700 | "cell_type": "code",
701 | "execution_count": 10,
702 | "metadata": {
703 | "collapsed": false,
704 | "jupyter": {
705 | "outputs_hidden": false
706 | }
707 | },
708 | "outputs": [
709 | {
710 | "name": "stdout",
711 | "output_type": "stream",
712 | "text": [
713 | "Total articles containing a doid_code: 3686312\n",
714 | "Total articles containing a uberon_id: 4697277\n",
715 | "Total articles containing both a doid_code and uberon_id: 696252\n",
716 | "\n",
717 | "After removing terms without any cooccurences:\n",
718 | "+ 133 doid_codes remain\n",
719 | "+ 401 uberon_ids remain\n",
720 | "\n",
721 | "Cooccurrence scores calculated for 53333 doid_code -- uberon_id pairs\n"
722 | ]
723 | }
724 | ],
725 | "source": [
726 | "cooc_df = cooccurrence.score_pmid_cooccurrence(disease_to_pmids, uberon_to_pmids, 'doid_code', 'uberon_id')"
727 | ]
728 | },
729 | {
730 | "cell_type": "code",
731 | "execution_count": 11,
732 | "metadata": {
733 | "collapsed": false,
734 | "jupyter": {
735 | "outputs_hidden": false
736 | }
737 | },
738 | "outputs": [
739 | {
740 | "data": {
741 | "text/html": [
742 | "\n",
743 | "
\n",
744 | " \n",
745 | " \n",
746 | " | \n",
747 | " doid_code | \n",
748 | " doid_name | \n",
749 | " uberon_id | \n",
750 | " uberon_name | \n",
751 | " cooccurrence | \n",
752 | " expected | \n",
753 | " enrichment | \n",
754 | " odds_ratio | \n",
755 | " p_fisher | \n",
756 | "
\n",
757 | " \n",
758 | " \n",
759 | " \n",
760 | " 28748 | \n",
761 | " DOID:10652 | \n",
762 | " Alzheimer's disease | \n",
763 | " UBERON:0000955 | \n",
764 | " brain | \n",
765 | " 11209 | \n",
766 | " 1182.634069 | \n",
767 | " 9.477995 | \n",
768 | " 74.210761 | \n",
769 | " 0.000000e+00 | \n",
770 | "
\n",
771 | " \n",
772 | " 28553 | \n",
773 | " DOID:10652 | \n",
774 | " Alzheimer's disease | \n",
775 | " UBERON:0001890 | \n",
776 | " forebrain | \n",
777 | " 114 | \n",
778 | " 7.326350 | \n",
779 | " 15.560272 | \n",
780 | " 21.733764 | \n",
781 | " 5.971023e-99 | \n",
782 | "
\n",
783 | " \n",
784 | " 28476 | \n",
785 | " DOID:10652 | \n",
786 | " Alzheimer's disease | \n",
787 | " UBERON:0002037 | \n",
788 | " cerebellum | \n",
789 | " 303 | \n",
790 | " 86.548368 | \n",
791 | " 3.500933 | \n",
792 | " 3.740149 | \n",
793 | " 3.504584e-76 | \n",
794 | "
\n",
795 | " \n",
796 | " 28541 | \n",
797 | " DOID:10652 | \n",
798 | " Alzheimer's disease | \n",
799 | " UBERON:0002148 | \n",
800 | " locus ceruleus | \n",
801 | " 97 | \n",
802 | " 8.450598 | \n",
803 | " 11.478477 | \n",
804 | " 14.449700 | \n",
805 | " 1.183699e-70 | \n",
806 | "
\n",
807 | " \n",
808 | " 28708 | \n",
809 | " DOID:10652 | \n",
810 | " Alzheimer's disease | \n",
811 | " UBERON:0000011 | \n",
812 | " parasympathetic nervous system | \n",
813 | " 103 | \n",
814 | " 14.727650 | \n",
815 | " 6.993648 | \n",
816 | " 7.952412 | \n",
817 | " 3.985211e-53 | \n",
818 | "
\n",
819 | " \n",
820 | "
\n",
821 | "
"
822 | ],
823 | "text/plain": [
824 | " doid_code doid_name uberon_id \\\n",
825 | "28748 DOID:10652 Alzheimer's disease UBERON:0000955 \n",
826 | "28553 DOID:10652 Alzheimer's disease UBERON:0001890 \n",
827 | "28476 DOID:10652 Alzheimer's disease UBERON:0002037 \n",
828 | "28541 DOID:10652 Alzheimer's disease UBERON:0002148 \n",
829 | "28708 DOID:10652 Alzheimer's disease UBERON:0000011 \n",
830 | "\n",
831 | " uberon_name cooccurrence expected enrichment \\\n",
832 | "28748 brain 11209 1182.634069 9.477995 \n",
833 | "28553 forebrain 114 7.326350 15.560272 \n",
834 | "28476 cerebellum 303 86.548368 3.500933 \n",
835 | "28541 locus ceruleus 97 8.450598 11.478477 \n",
836 | "28708 parasympathetic nervous system 103 14.727650 6.993648 \n",
837 | "\n",
838 | " odds_ratio p_fisher \n",
839 | "28748 74.210761 0.000000e+00 \n",
840 | "28553 21.733764 5.971023e-99 \n",
841 | "28476 3.740149 3.504584e-76 \n",
842 | "28541 14.449700 1.183699e-70 \n",
843 | "28708 7.952412 3.985211e-53 "
844 | ]
845 | },
846 | "execution_count": 11,
847 | "metadata": {},
848 | "output_type": "execute_result"
849 | }
850 | ],
851 | "source": [
852 | "cooc_df = uberon_df[['uberon_id', 'uberon_name']].drop_duplicates().merge(cooc_df)\n",
853 | "cooc_df = disease_df[['doid_code', 'doid_name']].drop_duplicates().merge(cooc_df)\n",
854 | "cooc_df = cooc_df.sort_values(by=['doid_name', 'p_fisher'])\n",
855 | "cooc_df.head()"
856 | ]
857 | },
858 | {
859 | "cell_type": "code",
860 | "execution_count": 12,
861 | "metadata": {
862 | "collapsed": true,
863 | "jupyter": {
864 | "outputs_hidden": true
865 | }
866 | },
867 | "outputs": [],
868 | "source": [
869 | "cooc_df.to_csv('data/disease-uberon-cooccurrence.tsv', index=False, sep='\\t')"
870 | ]
871 | }
872 | ],
873 | "metadata": {
874 | "kernelspec": {
875 | "display_name": "Python 3",
876 | "language": "python",
877 | "name": "python3"
878 | },
879 | "language_info": {
880 | "codemirror_mode": {
881 | "name": "ipython",
882 | "version": 3
883 | },
884 | "file_extension": ".py",
885 | "mimetype": "text/x-python",
886 | "name": "python",
887 | "nbconvert_exporter": "python",
888 | "pygments_lexer": "ipython3",
889 | "version": "3.9.2"
890 | }
891 | },
892 | "nbformat": 4,
893 | "nbformat_minor": 4
894 | }
895 |
--------------------------------------------------------------------------------