OR
16 | CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
17 | CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
18 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
19 | ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
20 | NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
21 | ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
22 |
23 | The views and conclusions contained in the software and documentation are those of the
24 | authors and should not be interpreted as representing official policies, either expressed
25 | or implied, of The Board of Trustees of Leland Stanford Junior University.
26 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | This project takes a MySQL Unified Medical Language System (UMLS) database and converts the ontologies to RDF using OWL and SKOS as the main schemas.
2 |
3 | Virtual Appliance users can review the [documentation in the OntoPortal Administration Guide}(https://ontoportal.github.io/documentation/administration/ontologies/handling_umls).
4 |
5 | To use it:
6 |
7 | * Specify your database connection conf.py
8 | * Specify the SAB ontologies to export in umls.conf
9 |
10 | The umls.conf configuration file must contain one ontology per line. The lines are comma separated tuples where the elements are:
11 |
12 | The following list needs updating.
13 |
14 | (0) SAB
15 | (1) BioPortal Virtual ID. This is optional, any value works.
16 | (2) Output file name
17 | (3) Conversion strategy. Accepted values (load_on_codes, load_on_cuis).
18 |
19 |
20 | Note that 'CCS COSTAR DSM3R DSM4 DXP ICPC2ICD10ENG MCM MMSL MMX MTHCMSFRF MTHMST MTHSPL MTH NDFRT SNM' have no code and should not be loaded on loads_on_codes.
21 |
22 | umls2rdf.py is designed to be an offline, run-once process.
23 | It's memory intensive and exports all of the default ontologies in umls.conf in 3h 30min.
24 | The ontologies listed in umls.conf are the UMLS ontologies accessible in [BioPortal](https://bioportal.bioontology.org/).
25 |
26 | If you get an error when installing the MySQL-python python library, https://stackoverflow.com/questions/12218229/my-config-h-file-not-found-when-intall-mysql-python-on-osx-10-8 may be of help.
27 |
28 | If running a Windows 10 OS with MySQL, the following tips may be of help.
29 |
30 | - Install [MySQL 5.5](https://dev.mysql.com/downloads/mysql/5.5.html#downloads) to avoid the InnoDB space [disclaimer](https://www.nlm.nih.gov/research/umls/implementation_resources/scripts/README_RRF_MySQL_Output_Stream.html) by NLM.
31 | - [Python 2.7.x](https://www.python.org/downloads/) should be used to avoid syntax errors on 'raise Attribute'
32 | - For installtion of the MySQLdb module python -m pip install MySQLdb
is error prone. Install with executable [MySQL-python-1.2.3.win-amd64-py2.7](http://www.codegood.com/archives/129) (last known location).
33 | - Create your RRF subset(s) using mmsys with the MySQL load option, load your database, edit conf.py and umls.py to specifications, run umsl2rdf.py
34 |
--------------------------------------------------------------------------------
/checkOutputSyntax.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | #===============================================================================
3 | #
4 | # FILE: checkOutputSyntax.sh
5 | #
6 | # USAGE: ./checkOutputSyntax.sh
7 | #
8 | # DESCRIPTION: Apply rapper to check the syntax of turtle files created
9 | # by the umls2rdf.py conversion.
10 | #
11 | # OPTIONS: --- First argument is output path (default = 'output')
12 | # REQUIREMENTS: ---
13 | # BUGS: ---
14 | # NOTES: ---
15 | # AUTHOR: Darren L. Weber, Ph.D. (), darren.weber@stanford.edu
16 | # COMPANY: Stanford University
17 | # VERSION: 1.0
18 | # CREATED: 03/29/2013 11:28:18 AM PDT
19 | # REVISION: ---
20 | #===============================================================================
21 |
22 | outputPath='output'
23 | if [ "$1" != "" ]; then
24 | outputPath="$1"
25 | fi
26 |
27 | if which rapper >/dev/null 2>&1; then
28 | for ttlFile in ${outputPath}/*.ttl; do
29 | echo
30 | rapper -i turtle -c --show-graphs --show-namespaces $ttlFile
31 | done
32 | echo
33 | fi
34 |
35 |
--------------------------------------------------------------------------------
/conf_sample.py:
--------------------------------------------------------------------------------
1 | #Folder to dump the RDF files.
2 | OUTPUT_FOLDER = "output"
3 |
4 | #DB Config
5 | DB_HOST = "your-host"
6 | DB_NAME = "umls2015ab"
7 | DB_USER = "your db user"
8 | DB_PASS = "your db pass"
9 |
10 | UMLS_VERSION = "2015ab"
11 |
12 | # Define the base URI used to generate the concepts URI
13 | UMLS_BASE_URI = "http://purl.bioontology.org/ontology/"
14 |
15 | # Include the semantic type concepts for each Ontology file generated
16 | INCLUDE_SEMANTIC_TYPES = True
17 |
--------------------------------------------------------------------------------
/move_rdf_to_repo.py:
--------------------------------------------------------------------------------
1 | #Script utility to replace TTL files in BioPortal
2 | #Useful also for BioPortal appliances
3 |
4 | from os import listdir
5 | from os.path import isfile,isdir, join
6 | import pdb
7 | import glob
8 | import pdb
9 | import shutil
10 |
11 | REPO = "/srv/ncbo/repository"
12 | OUTPUT = "./output"
13 |
14 | ttl_files = glob.glob("%s/*.ttl"%(OUTPUT))
15 |
16 | file_map = {}
17 | for ttl in ttl_files:
18 | acronym = ttl.split("/")[-1][0:-4]
19 | file_map[acronym] = ttl
20 |
21 | for acronym in file_map:
22 | ttl = file_map[acronym]
23 | dir_ont = join(REPO,acronym)
24 | if isdir(dir_ont):
25 | sub_dirs = glob.glob(join(dir_ont,"*"))
26 | latest = 0
27 | latest_subdir = None
28 | for sub_dir in sub_dirs:
29 | s = sub_dir.split("/")[-1]
30 | try:
31 | i = int(s)
32 | if i > latest:
33 | latest = i
34 | latest_subdir = sub_dir
35 | except ValueError:
36 | continue
37 | print("Latest for " + acronym + " is " + str(latest))
38 | if latest_subdir:
39 | if isfile(join(latest_subdir,ttl.split("/")[-1])):
40 | shutil.copy2(ttl,latest_subdir)
41 | print("ttl found")
42 | else:
43 | print("ttl file not found for " + acronym)
44 | else:
45 | print("NOT Found " + dir_ont)
46 |
47 |
--------------------------------------------------------------------------------
/output/.gitignore:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ncbo/umls2rdf/8d9f98a8ba2feed98c8c8667422c880349e853b0/output/.gitignore
--------------------------------------------------------------------------------
/umls.conf:
--------------------------------------------------------------------------------
1 | AIR,AI-RHEUM.ttl,load_on_codes
2 | ATC,ATC.ttl,load_on_codes
3 | #CPT,CPT.ttl,load_on_codes. #disabled due to licensing restrictions in BioPortal
4 | CSP,CRISP.ttl,load_on_codes
5 | CST,COSTART.ttl,load_on_codes
6 | HCPCS,HCPCS.ttl,load_on_codes
7 | HL7V3.0;HL7,HL7.ttl,load_on_cuis
8 | ICD10,ICD10.ttl,load_on_codes
9 | ICD10CM,ICD10CM.ttl,load_on_codes
10 | ICD10PCS,ICD10PCS.ttl,load_on_codes
11 | ICD9CM;ICD9CM,ICD9CM.ttl,load_on_codes
12 | ICPC,ICPC.ttl,load_on_codes
13 | ICPC2P,ICPC2P.ttl,load_on_codes
14 | ICPCFRE,ICPCFRE.ttl,load_on_codes
15 | LNC,LOINC.ttl,load_on_codes
16 | #LOINC-VIEW,LOINC-VIEW.ttl, load_on_codes
17 | #MDDB,MDDB.ttl,load_on_codes #retired
18 | MDR;MEDDRA,MEDDRA.ttl,load_on_codes
19 | MDRFRE,MDRFRE.ttl,load_on_codes
20 | MDRGER,MDRGER.ttl,load_on_codes
21 | MEDLINEPLUS,MEDLINEPLUS.ttl,load_on_cuis
22 | MSH;MESH,MESH.ttl,load_on_codes
23 | MSHFRE,MSHFRE.ttl,load_on_codes
24 | MSHSPA,MSHSPA.ttl,load_on_codes
25 | MTHMST;MSTDE,MSTDE.ttl,load_on_codes
26 | MTHMSTFRE;MSTDE-FRE,MSTDE-FRE.ttl,load_on_codes
27 | NCBI;NCBITAXON,NCBITAXON.ttl,load_on_codes
28 | NDDF,NDDF.ttl,load_on_codes
29 | #NDFRT,NDFRT.ttl,load_on_codes #removed in 2018AA
30 | OMIM,OMIM.ttl,load_on_codes
31 | PDQ,PDQ.ttl,load_on_codes
32 | RCD,RCD.ttl,load_on_codes
33 | RXNORM,RXNORM.ttl,load_on_codes
34 | #RXNORM-VIEW,RXNORM-VIEW.ttl, load_on_codes
35 | SCTSPA,SCTSPA.ttl,load_on_codes
36 | SNMI,SNMI.ttl,load_on_codes
37 | #SNOMEDCT-CORE,SNOMEDCT-CORE.ttl, load_on_codes
38 | SNOMEDCT_US;SNOMEDCT,SNOMEDCT.ttl,load_on_codes
39 | VANDF,VANDF.ttl,load_on_codes
40 | WHO,WHO-ART.ttl,load_on_codes
41 | WHOFRE,WHOFRE.ttl,load_on_codes
42 |
--------------------------------------------------------------------------------
/umls2rdf.py:
--------------------------------------------------------------------------------
1 | #! /usr/bin/env python
2 |
3 | DEBUG = False
4 |
5 | import codecs
6 | import sys
7 | import os
8 | import urllib.request, urllib.parse, urllib.error
9 | from string import Template
10 | import collections
11 | import pymysql
12 | import pdb
13 | from functools import reduce
14 | from itertools import groupby
15 |
16 | try:
17 | import conf
18 | except:
19 | raise
20 |
21 | PREFIXES = """
22 | @prefix skos: .
23 | @prefix owl: .
24 | @prefix rdfs: .
25 | @prefix xsd: .
26 | @prefix umls: .
27 | """
28 |
29 | ONTOLOGY_HEADER = Template("""
30 | <$uri>
31 | a owl:Ontology ;
32 | rdfs:comment "$comment" ;
33 | rdfs:label "$label" ;
34 | owl:imports ;
35 | owl:versionInfo "$versioninfo" .
36 | """)
37 |
38 | STY_URL = "http://bioportal.bioontology.org/ontologies/umls/sty/"
39 | HAS_STY = "umls:hasSTY"
40 | HAS_AUI = "umls:aui"
41 | HAS_CUI = "umls:cui"
42 | HAS_TUI = "umls:tui"
43 |
44 | MRCONSO_CODE = 13
45 | MRCONSO_AUI = 7
46 | MRCONSO_STR = 14
47 | MRCONSO_STT = 4
48 | MRCONSO_SCUI = 9
49 | MRCONSO_ISPREF = 6
50 | MRCONSO_TTY = 12
51 | MRCONSO_TS = 2
52 | MRCONSO_CUI = 0
53 |
54 | # http://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/SNOMEDCT/relationships.html
55 | MRREL_AUI1 = 1
56 | MRREL_AUI2 = 5
57 | MRREL_CUI1 = 0
58 | MRREL_CUI2 = 4
59 | MRREL_REL = 3
60 | MRREL_RELA = 7
61 |
62 | MRDEF_AUI = 1
63 | MRDEF_DEF = 5
64 | MRDEF_CUI = 0
65 |
66 | MRSAT_CUI = 0
67 | MRSAT_CODE = 5
68 | MRSAT_ATV = 10
69 | MRSAT_ATN = 8
70 |
71 | MRDOC_DOCKEY = 0
72 | MRDOC_VALUE = 1
73 | MRDOC_TYPE = 2
74 | MRDOC_DESC = 3
75 |
76 | MRRANK_TTY = 2
77 | MRRANK_RANK = 0
78 |
79 | MRSTY_CUI = 0
80 | MRSTY_TUI = 1
81 |
82 | MRSAB_LAT = 19
83 |
84 | UMLS_LANGCODE_MAP = {"eng" : "en", "fre" : "fr", "cze" : "cz", "fin" : "fi", "ger" : "de", "ita" : "it", "jpn" : "jp", "pol" : "pl", "por" : "pt", "rus" : "ru", "spa" : "es", "swe" : "sw", "scr" : "hr", "dut" : "nl", "lav" : "lv", "hun" : "hu", "kor" : "kr", "dan" : "da", "nor" : "no", "heb" : "he", "baq" : "eu"}
85 |
86 | def get_umls_url(code):
87 | return "%s%s/"%(conf.UMLS_BASE_URI,code)
88 |
89 | def flatten(matrix):
90 | return reduce(lambda x,y: x+y,matrix)
91 |
92 | def escape(string):
93 | return string.replace("\\","\\\\").replace('"','\\"')
94 |
95 | def get_url_term(ns,code):
96 | if ns[-1] == '/':
97 | ret = ns + urllib.parse.quote(code)
98 | else:
99 | ret = "%s/%s"%(ns,urllib.parse.quote(code))
100 | return ret
101 |
102 | def get_rel_fragment(rel):
103 | return rel[MRREL_RELA] if rel[MRREL_RELA] else rel[MRREL_REL]
104 |
105 |
106 | # NOTE: See UmlsOntology.terms() for the reason these functions use -1 and -2
107 | # indices to obtain the source and target codes, respectively.
108 | def get_rel_code_source(rel,on_cuis):
109 | return rel[-1] if not on_cuis else rel[MRREL_CUI2]
110 | def get_rel_code_target(rel,on_cuis):
111 | return rel[-2] if not on_cuis else rel[MRREL_CUI1]
112 |
113 | def get_code(reg,load_on_cuis):
114 | if load_on_cuis:
115 | return reg[MRCONSO_CUI]
116 | if reg[MRCONSO_CODE]:
117 | return reg[MRCONSO_CODE]
118 | raise AttributeError("No code on reg [%s]"%("|".join(reg)))
119 |
120 | def __get_connection():
121 | return pymysql.connect(host=conf.DB_HOST,user=conf.DB_USER,
122 | passwd=conf.DB_PASS,db=conf.DB_NAME)
123 |
124 | def generate_semantic_types(con,with_roots=False):
125 | url = get_umls_url("STY")
126 | hierarchy = collections.defaultdict(lambda : list())
127 | all_nodes = list()
128 | mrsty = UmlsTable("MRSTY",con,
129 | load_select="SELECT DISTINCT TUI, STN, STY FROM MRSTY")
130 | ont = list()
131 |
132 | for stt in mrsty.scan():
133 | hierarchy[stt[1]].append(stt[0])
134 | sty_term = """<%s> a owl:Class ;
135 | \tskos:notation "%s"^^xsd:string ;
136 | \tskos:prefLabel "%s"@en .
137 | """%(url+stt[0],stt[0],stt[2])
138 | ont.append(sty_term)
139 | all_nodes.append(stt)
140 |
141 | for node in all_nodes:
142 | parent = node[1]
143 | if "." in parent:
144 | parent = ".".join(node[1].split(".")[0:-1])
145 | else:
146 | parent = parent[0:-1]
147 |
148 | rdfs_subclasses = []
149 | for x in hierarchy[parent]:
150 | if node[0] != x:
151 | rdfs_subclasses.append(
152 | "<%s> rdfs:subClassOf <%s> ."%(url+node[0],url+x))
153 |
154 | if len(rdfs_subclasses) == 0 and with_roots:
155 | rdfs_subclasses = ["<%s> rdfs:subClassOf owl:Thing ."%(url+node[0])]
156 |
157 | for sc in rdfs_subclasses:
158 | ont.append(sc)
159 | data_ont_ttl = "\n".join(ont)
160 | return data_ont_ttl
161 |
162 |
163 |
164 |
165 | class UmlsTable(object):
166 | def __init__(self,table_name,conn,load_select=None):
167 | self.table_name = table_name
168 | self.conn = conn
169 | self.page_size = 500000
170 | self.load_select = load_select
171 |
172 | def mesh_tree(self):
173 | q = """select DISTINCT c1.code as parent, c2.code as child
174 | from MRREL r, MRCONSO c1, MRCONSO c2 where r.sab = 'MSH' and r.rel = 'CHD'
175 | and c1.cui = r.cui1
176 | and c2.cui = r.cui2
177 | and c2.code like 'D%'
178 | and c1.code like 'D%'
179 | and c1.sab = 'MSH'
180 | and c2.sab = 'MSH'
181 | """
182 | cursor = self.conn.cursor()
183 | cursor.execute(q)
184 | result = cursor.fetchall()
185 | edges = collections.defaultdict(set)
186 | for record in result:
187 | edges[record[1]].add(record[0])
188 | return edges
189 |
190 | def count(self):
191 | q = "SELECT count(*) FROM %s"%self.table_name
192 | cursor = self.conn.cursor()
193 | cursor.execute(q)
194 | result = cursor.fetchall()
195 | for record in result:
196 | cursor.close()
197 | return int(record[0])
198 |
199 | def scan(self,filt=None,limit=None):
200 | #c = self.count()
201 | i = 0
202 | page = 0
203 | cont = True
204 | cursor = self.conn.cursor()
205 | while cont:
206 | if self.load_select:
207 | q = self.load_select
208 | else:
209 | q = "SELECT * FROM %s WHERE %s LIMIT %s OFFSET %s"%(self.table_name,filt,self.page_size,page * self.page_size)
210 | if filt == None or len(filt) == 0:
211 | q = "SELECT * FROM %s LIMIT %s OFFSET %s"%(self.table_name,self.page_size,page * self.page_size)
212 | sys.stdout.write("[UMLS-Query] %s\n" % q)
213 | sys.stdout.flush()
214 | cursor.execute(q)
215 | result = cursor.fetchall()
216 | cont = False
217 | for record in result:
218 | cont = True
219 | i += 1
220 | yield record
221 | if limit and i >= limit:
222 | cont = False
223 | break
224 | # Do we already have all the rows available for the query?
225 | if self.load_select:
226 | cont = False
227 | elif not limit and i < self.page_size:
228 | cont = False
229 | page += 1
230 | cursor.close()
231 |
232 |
233 |
234 | class UmlsClass(object):
235 | def __init__(self,ns,atoms=None,rels=None,
236 | defs=None,atts=None,rank=None,
237 | rank_by_tty=None,sty=None,
238 | sty_by_cui=None,load_on_cuis=False,
239 | is_root=None):
240 | self.ns = ns
241 | self.atoms = atoms
242 | self.rels = rels
243 | self.defs = defs
244 | self.atts = atts
245 | self.rank = rank
246 | self.rank_by_tty = rank_by_tty
247 | self.sty = sty
248 | self.sty_by_cui = sty_by_cui
249 | self.load_on_cuis = load_on_cuis
250 | self.is_root = is_root
251 | self.class_properties = dict()
252 |
253 | def code(self):
254 | codes = set([get_code(x,self.load_on_cuis) for x in self.atoms])
255 | if len(codes) != 1:
256 | raise AttributeError("Only one code per term.")
257 | #if DEBUG:
258 | #sys.stderr.write(self.atoms)
259 | #sys.stderr.write(codes)
260 | return codes.pop()
261 |
262 | def getAltLabels(self,prefLabel):
263 | #is_pref_atoms = filter(lambda x: x[MRCONSO_ISPREF] == 'Y', self.atoms)
264 | return set([atom[MRCONSO_STR] for atom in self.atoms if atom[MRCONSO_STR] != prefLabel])
265 |
266 | def getPrefLabel(self):
267 | if self.load_on_cuis:
268 | if len(self.atoms) == 1:
269 | return self.atoms[0][MRCONSO_STR]
270 |
271 | labels = set([x[MRCONSO_STR] for x in self.atoms])
272 | if len(labels) == 1:
273 | return labels.pop()
274 |
275 | is_pref_atoms = [x for x in self.atoms if x[MRCONSO_ISPREF] == 'Y']
276 | if len(is_pref_atoms) == 0:
277 | return self.atoms[0][MRCONSO_STR]
278 | elif len(is_pref_atoms) == 1:
279 | return is_pref_atoms[0][MRCONSO_STR]
280 |
281 | is_pref_atoms = [x for x in is_pref_atoms if x[MRCONSO_STT] == 'PF']
282 | if len(is_pref_atoms) == 0:
283 | return self.atoms[0][MRCONSO_STR]
284 | elif len(is_pref_atoms) == 1:
285 | return is_pref_atoms[0][MRCONSO_STR]
286 |
287 | is_pref_atoms = [x for x in self.atoms if x[MRCONSO_TTY][0] == 'P']
288 | if len(is_pref_atoms) == 1:
289 | return is_pref_atoms[0][MRCONSO_STR]
290 | return self.atoms[0][MRCONSO_STR]
291 | else:
292 | #if ISPREF=Y is not 1 then we look into MRRANK.
293 | if len(self.rank) > 0:
294 | sort_key = \
295 | lambda x: int(self.rank[self.rank_by_tty[x[MRCONSO_TTY]][0]][MRRANK_RANK])
296 | mmrank_sorted_atoms = sorted(self.atoms,key=sort_key,reverse=True)
297 | return mmrank_sorted_atoms[0][MRCONSO_STR]
298 | #there is no rank to use
299 | else:
300 | pref_atom = [x for x in self.atoms if 'P' in x[MRCONSO_TTY]]
301 | if len(pref_atom) == 1:
302 | return pref_atom[0][MRCONSO_STR]
303 | raise AttributeError("Unable to select pref label")
304 |
305 | def getURLTerm(self,code):
306 | return get_url_term(self.ns,code)
307 |
308 | def properties(self):
309 | return self.class_properties
310 |
311 | def toRDF(self,fmt="Turtle",hierarchy=True,lang="en",tree=None):
312 | if not fmt == "Turtle":
313 | raise AttributeError("Only fmt='Turtle' is currently supported")
314 | term_code = self.code()
315 | url_term = self.getURLTerm(term_code)
316 | prefLabel = self.getPrefLabel()
317 | altLabels = self.getAltLabels(prefLabel)
318 | rdf_term = """<%s> a owl:Class ;
319 | \tskos:prefLabel \"\"\"%s\"\"\"@%s ;
320 | \tskos:notation \"\"\"%s\"\"\"^^xsd:string ;
321 | """%(url_term,escape(prefLabel),lang,escape(term_code))
322 |
323 | if len(altLabels) > 0:
324 | rdf_term += """\tskos:altLabel %s ;
325 | """%(" , ".join(['\"\"\"%s\"\"\"@%s'%(escape(x),lang) for x in set(altLabels)]))
326 |
327 | if self.is_root:
328 | rdf_term += '\trdfs:subClassOf owl:Thing ;\n'
329 |
330 | if len(self.defs) > 0:
331 | rdf_term += """\tskos:definition %s ;
332 | """%(" , ".join(['\"\"\"%s\"\"\"@%s'%(escape(x[MRDEF_DEF]),lang) for x in set(self.defs)]))
333 |
334 | count_parents = 0
335 | if tree:
336 | if term_code in tree:
337 | for parent in tree[term_code]:
338 | o = self.getURLTerm(parent)
339 | rdf_term += "\trdfs:subClassOf <%s> ;\n" % (o,)
340 | for rel in self.rels:
341 | source_code = get_rel_code_source(rel,self.load_on_cuis)
342 | target_code = get_rel_code_target(rel,self.load_on_cuis)
343 | if source_code != term_code:
344 | raise AttributeError("Inconsistent code in rel")
345 | # Map child relations to rdf:subClassOf (skip parent relations).
346 | if rel[MRREL_REL] == 'PAR':
347 | continue
348 | if rel[MRREL_REL] == 'CHD' and hierarchy:
349 | o = self.getURLTerm(target_code)
350 | count_parents += 1
351 | if target_code == "ICD-10-CM":
352 | #skip bogus ICD10CM parent
353 | continue
354 | if target_code == "138875005":
355 | #skip bogus SNOMED root concept
356 | continue
357 | if target_code == "V-HL7V3.0" or target_code == "C1553931":
358 | #skip bogus HL7V3.0 root concept
359 | continue
360 | if not tree:
361 | rdf_term += "\trdfs:subClassOf <%s> ;\n" % (o,)
362 | else:
363 | p = self.getURLTerm(get_rel_fragment(rel))
364 | o = self.getURLTerm(target_code)
365 | rdf_term += "\t<%s> <%s> ;\n" % (p,o)
366 | if p not in self.class_properties:
367 | self.class_properties[p] = \
368 | UmlsAttribute(p,get_rel_fragment(rel))
369 |
370 | for att in self.atts:
371 | atn = att[MRSAT_ATN]
372 | atv = att[MRSAT_ATV]
373 | if atn == 'AQ':
374 | # Skip all these values (they are replicated in MRREL for
375 | # SNOMEDCT, unknown relationship for MSH).
376 | #if DEBUG:
377 | # sys.stderr.write("att: %s\n" % str(att))
378 | # sys.stderr.flush()
379 | continue
380 | #MESH ROOTS ONLY DESCRIPTORS
381 | if tree and atn == "MN" and term_code.startswith("D"):
382 | if len(atv.split(".")) == 1:
383 | rdf_term += "\trdfs:subClassOf owl:Thing;\n"
384 | p = self.getURLTerm(atn)
385 | rdf_term += "\t<%s> \"\"\"%s\"\"\"^^xsd:string ;\n"%(p, escape(atv))
386 | if p not in self.class_properties:
387 | self.class_properties[p] = UmlsAttribute(p,atn)
388 |
389 | #auis = set([x[MRCONSO_AUI] for x in self.atoms])
390 | cuis = set([x[MRCONSO_CUI] for x in self.atoms])
391 | sty_recs = flatten([indexes for indexes in [self.sty_by_cui[cui] for cui in cuis]])
392 | types = [self.sty[index][MRSTY_TUI] for index in sty_recs]
393 |
394 | #for t in auis:
395 | # rdf_term += """\t%s \"\"\"%s\"\"\"^^xsd:string ;\n"""%(HAS_AUI,t)
396 | for t in cuis:
397 | rdf_term += """\t%s \"\"\"%s\"\"\"^^xsd:string ;\n"""%(HAS_CUI,t)
398 | for t in set(types):
399 | rdf_term += """\t%s \"\"\"%s\"\"\"^^xsd:string ;\n"""%(HAS_TUI,t)
400 | for t in set(types):
401 | rdf_term += """\t%s <%s> ;\n"""%(HAS_STY,get_umls_url("STY")+t)
402 |
403 | return rdf_term + " .\n\n"
404 |
405 |
406 |
407 | class UmlsAttribute(object):
408 | def __init__(self,uri,att):
409 | self.uri = uri
410 | self.att = att
411 |
412 | def getURLTerm(self,code):
413 | return get_url_term(self.ns,code)
414 |
415 | def toRDFWithDesc(self,label,desc,_type):
416 | uri_rdf = self.uri
417 | if self.uri.startswith("http"):
418 | uri_rdf = "<%s>"%self.uri
419 | return """%s a owl:%s ;
420 | \trdfs:label \"\"\"%s\"\"\";
421 | \trdfs:comment \"\"\"%s\"\"\" .
422 | \n"""%(uri_rdf,_type,label,escape(desc))
423 |
424 | def toRDF(self,dockey,desc,fmt="Turtle"):
425 | if not fmt == "Turtle":
426 | raise AttributeError("Only fmt='Turtle' is currently supported")
427 | _type = ""
428 | if "REL" in dockey:
429 | _type = "ObjectProperty"
430 | elif dockey == "ATN":
431 | _type = "DatatypeProperty"
432 | else:
433 | raise AttributeError("Unknown DOCKEY" + dockey)
434 |
435 | label = self.att
436 | if len(desc) < 20:
437 | label = desc
438 | if "_" in label:
439 | label = " ".join(self.att.split("_"))
440 | label = label[0].upper() + label[1:]
441 |
442 | return """<%s> a owl:%s ;
443 | \trdfs:label \"\"\"%s\"\"\";
444 | \trdfs:comment \"\"\"%s\"\"\" .
445 | \n"""%(self.uri,_type,label,escape(desc))
446 |
447 |
448 |
449 | class UmlsOntology(object):
450 | def __init__(self,ont_code,ns,con,load_on_cuis=False):
451 | self.loaded = False
452 | self.ont_code = ont_code
453 | self.ns = ns
454 | self.con = con
455 | self.load_on_cuis = load_on_cuis
456 | #self.alt_uri_code = alt_uri_code
457 | self.atoms = list()
458 | self.atoms_by_code = collections.defaultdict(lambda : list())
459 | if not self.load_on_cuis:
460 | self.atoms_by_aui = collections.defaultdict(lambda : list())
461 | self.rels = list()
462 | self.rels_by_aui_src = collections.defaultdict(lambda : list())
463 | self.defs = list()
464 | self.defs_by_aui = collections.defaultdict(lambda : list())
465 | self.atts = list()
466 | self.atts_by_code = collections.defaultdict(lambda : list())
467 | self.rank = list()
468 | self.rank_by_tty = collections.defaultdict(lambda : list())
469 | self.sty = list()
470 | self.sty_by_cui = collections.defaultdict(lambda : list())
471 | self.cui_roots = set()
472 | self.lang = None
473 | self.ont_properties = dict()
474 |
475 | def load_tables(self,limit=None):
476 | mrconso = UmlsTable("MRCONSO",self.con)
477 | if self.ont_code == 'MSH':
478 | self.tree = mrconso.mesh_tree()
479 | else:
480 | self.tree = None
481 | mrsab = UmlsTable("MRSAB", self.con)
482 | for sab_rec in mrsab.scan(filt="RSAB = '" + self.ont_code + "'", limit=1):
483 | self.lang = sab_rec[MRSAB_LAT].lower()
484 | mrconso_filt = "SAB = '%s' AND lat = '%s' AND SUPPRESS = 'N'"%(
485 | self.ont_code,self.lang)
486 | for atom in mrconso.scan(filt=mrconso_filt,limit=limit):
487 | index = len(self.atoms)
488 | self.atoms_by_code[get_code(atom,self.load_on_cuis)].append(index)
489 | if not self.load_on_cuis:
490 | self.atoms_by_aui[atom[MRCONSO_AUI]].append(index)
491 | self.atoms.append(atom)
492 | if DEBUG:
493 | sys.stderr.write("length atoms: %d\n" % len(self.atoms))
494 | sys.stderr.write("length atoms_by_aui: %d\n" % len(self.atoms_by_aui))
495 | sys.stderr.write("atom example: %s\n\n" % str(self.atoms))
496 | sys.stderr.flush()
497 |
498 | mrconso_filt = "SAB = 'SRC' AND CODE = 'V-%s'"%self.ont_code
499 | for atom in mrconso.scan(filt=mrconso_filt,limit=limit):
500 | self.cui_roots.add(atom[MRCONSO_CUI])
501 | if DEBUG:
502 | sys.stderr.write("length cui_roots: %d\n\n" % len(self.cui_roots))
503 | sys.stderr.flush()
504 |
505 | #
506 | mrrel = UmlsTable("MRREL",self.con)
507 | mrrel_filt = "SAB = '%s' AND SUPPRESS = 'N'"%self.ont_code
508 | field = MRREL_AUI2 if not self.load_on_cuis else MRREL_CUI2
509 | for rel in mrrel.scan(filt=mrrel_filt,limit=limit):
510 | index = len(self.rels)
511 | self.rels_by_aui_src[rel[field]].append(index)
512 | self.rels.append(rel)
513 | if DEBUG:
514 | sys.stderr.write("length rels: %d\n\n" % len(self.rels))
515 | sys.stderr.flush()
516 | #
517 | mrdef = UmlsTable("MRDEF",self.con)
518 | mrdef_filt = "SAB = '%s'"%self.ont_code
519 | field = MRDEF_AUI if not self.load_on_cuis else MRDEF_CUI
520 | for defi in mrdef.scan(filt=mrdef_filt):
521 | index = len(self.defs)
522 | self.defs_by_aui[defi[field]].append(index)
523 | self.defs.append(defi)
524 | if DEBUG:
525 | sys.stderr.write("length defs: %d\n\n" % len(self.defs))
526 | sys.stderr.flush()
527 | #
528 | mrsat = UmlsTable("MRSAT",self.con)
529 | mrsat_filt = "SAB = '%s' AND CODE IS NOT NULL"%self.ont_code
530 | field = MRSAT_CODE if not self.load_on_cuis else MRSAT_CUI
531 | for att in mrsat.scan(filt=mrsat_filt):
532 | index = len(self.atts)
533 | self.atts_by_code[att[field]].append(index)
534 | self.atts.append(att)
535 | if DEBUG:
536 | sys.stderr.write("length atts: %d\n\n" % len(self.atts))
537 | sys.stderr.flush()
538 | #
539 | mrrank = UmlsTable("MRRANK",self.con)
540 | mrrank_filt = "SAB = '%s'"%self.ont_code
541 | for rank in mrrank.scan(filt=mrrank_filt):
542 | index = len(self.rank)
543 | self.rank_by_tty[rank[MRRANK_TTY]].append(index)
544 | self.rank.append(rank)
545 | if DEBUG:
546 | sys.stderr.write("length rank: %d\n\n" % len(self.rank))
547 | sys.stderr.flush()
548 | #
549 | load_mrsty = "SELECT sty.* FROM MRSTY sty, MRCONSO conso \
550 | WHERE conso.SAB = '%s' AND conso.cui = sty.cui AND conso.suppress = 'N'"
551 | load_mrsty %= self.ont_code
552 | mrsty = UmlsTable("MRSTY",self.con,load_select=load_mrsty)
553 | for sty in mrsty.scan(filt=None):
554 | index = len(self.sty)
555 | self.sty_by_cui[sty[MRSTY_CUI]].append(index)
556 | self.sty.append(sty)
557 | if DEBUG:
558 | sys.stderr.write("length sty: %d\n\n" % len(self.sty))
559 | sys.stderr.flush()
560 | # Track the loaded status
561 | self.loaded = True
562 | sys.stdout.write("%s tables loaded ...\n" % self.ont_code)
563 | sys.stdout.flush()
564 |
565 | def terms(self):
566 | if not self.loaded:
567 | self.load_tables()
568 | # Note: most UMLS ontologies are 'load_on_codes' (only HL7 is load_on_cuis)
569 | for code in self.atoms_by_code:
570 | code_atoms = [self.atoms[row] for row in self.atoms_by_code[code]]
571 | field = MRCONSO_CUI if self.load_on_cuis else MRCONSO_AUI
572 | ids = [x[field] for x in code_atoms]
573 | rels = list()
574 | for _id in ids:
575 | rels += [self.rels[x] for x in self.rels_by_aui_src[_id]]
576 | rels_to_class = list()
577 | is_root = False
578 | if self.load_on_cuis:
579 | rels_to_class = rels
580 | for rel in rels_to_class:
581 | if rel[MRREL_CUI1] in self.cui_roots:
582 | is_root = True
583 | break
584 | else:
585 | for rel in rels:
586 | rel_with_codes = list(rel)
587 | aui_source = rel[MRREL_AUI2]
588 | aui_target = rel[MRREL_AUI1]
589 | code_source = [ get_code(self.atoms[x],self.load_on_cuis) \
590 | for x in self.atoms_by_aui[aui_source] ]
591 | code_target = [ get_code(self.atoms[x],self.load_on_cuis) \
592 | for x in self.atoms_by_aui[aui_target] ]
593 | # TODO: Check use of CUI1 (target) or CUI2 (source) here:
594 | if (rel[MRREL_CUI1] in self.cui_roots) and rel[MRREL_REL] == "CHD":
595 | is_root = True
596 | elif self.ont_code == "ICD10CM":
597 | # TODO: patch to fix ICD10-CM hierachy.
598 | if rel[MRREL_CUI1] == "C3264380" and rel[MRREL_REL] == "CHD":
599 | is_root = True
600 |
601 | if len(code_source) != 1 or len(code_target) > 1:
602 | raise AttributeError("more than one or none codes")
603 | if len(code_source) == 1 and len(code_target) == 1 and \
604 | code_source[0] != code_target[0]:
605 | code_source = code_source[0]
606 | code_target = code_target[0]
607 | # NOTE: the order of these append operations below is important.
608 | # get_rel_code_source() - it uses [-1]
609 | # get_rel_code_target() - it uses [-2]
610 | # which are called from UmlsClass.toRDF().
611 | rel_with_codes.append(code_target)
612 | rel_with_codes.append(code_source)
613 | rels_to_class.append(rel_with_codes)
614 | aui_codes_def = [self.defs_by_aui[_id] for _id in ids]
615 | aui_codes_def = [item for sublist in aui_codes_def for item in sublist]
616 | defs = [self.defs[index] for index in aui_codes_def]
617 | atts = [self.atts[x] for x in self.atts_by_code[code]]
618 |
619 | umls_class = UmlsClass(self.ns,atoms=code_atoms,rels=rels_to_class,
620 | defs=defs,atts=atts,rank=self.rank,rank_by_tty=self.rank_by_tty,
621 | sty=self.sty, sty_by_cui=self.sty_by_cui,
622 | load_on_cuis=self.load_on_cuis,is_root=is_root)
623 |
624 | # TODO: patch to fix roots in SNOMED-CT
625 | # suppress should fix this one
626 | #if self.ont_code == "SNOMEDCT_US":
627 | # umls_class.is_root = (umls_class.code() == "138875005")
628 |
629 | yield umls_class
630 |
631 | def write_into(self,file_path,hierarchy=True):
632 | sys.stdout.write("%s writing terms ... %s\n" % (self.ont_code, file_path))
633 | sys.stdout.flush()
634 | fout = codecs.open(file_path,"w","utf-8")
635 | #nterms = len(self.atoms_by_code)
636 | fout.write(PREFIXES)
637 | comment = "RDF Version of the UMLS ontology %s; " +\
638 | "converted with the UMLS2RDF tool " +\
639 | "(https://github.com/ncbo/umls2rdf), "+\
640 | "developed by the NCBO project."
641 | header_values = dict(
642 | label = self.ont_code,
643 | comment = comment % self.ont_code,
644 | versioninfo = conf.UMLS_VERSION,
645 | uri = self.ns
646 | )
647 | fout.write(ONTOLOGY_HEADER.substitute(header_values))
648 | for term in self.terms():
649 | try:
650 | rdf_text = term.toRDF(lang=UMLS_LANGCODE_MAP[self.lang],tree=self.tree)
651 | fout.write(rdf_text)
652 | except Exception as e:
653 | print("ERROR dumping term ", e)
654 |
655 | for att in term.properties():
656 | if att not in self.ont_properties:
657 | self.ont_properties[att] = term.properties()[att]
658 |
659 | return fout
660 |
661 | def properties(self):
662 | return self.ont_properties
663 |
664 | def write_properties(self,fout,property_docs):
665 | self.ont_properties["hasSTY"] =\
666 | UmlsAttribute(HAS_STY,"hasSTY")
667 | for p in self.ont_properties:
668 | prop = self.ont_properties[p]
669 | if "hasSTY" in p:
670 | fout.write(prop.toRDFWithDesc(
671 | "Semantic type UMLS property",
672 | "Semantic type UMLS property",
673 | "ObjectProperty"))
674 | continue
675 | doc = property_docs[prop.att]
676 | if "expanded_form" not in doc:
677 | raise AttributeError("expanded form not found in " + doc)
678 | _desc = doc["expanded_form"]
679 | if "inverse" in doc:
680 | _desc = "Inverse of " + doc["inverse"]
681 |
682 | _dockey = doc["dockey"]
683 | fout.write(prop.toRDF(_dockey,_desc))
684 |
685 | def write_semantic_types(self,sem_types,fout):
686 | fout.write(sem_types)
687 | fout.write("\n")
688 |
689 |
690 |
691 | if __name__ == "__main__":
692 |
693 | con = __get_connection()
694 |
695 | umls_conf = None
696 | with open("umls.conf","r") as fconf:
697 | umls_conf = [line.split(",") \
698 | for line in fconf.read().splitlines() \
699 | if len(line) > 0]
700 | umls_conf = [x for x in umls_conf if not x[0].startswith("#")]
701 | fconf.close()
702 |
703 | if not os.path.isdir(conf.OUTPUT_FOLDER):
704 | raise Exception("Output folder '%s' not found."%conf.OUTPUT_FOLDER)
705 |
706 | sem_types = generate_semantic_types(con,with_roots=True)
707 | output_file = os.path.join(conf.OUTPUT_FOLDER,"umls_semantictypes.ttl")
708 | with codecs.open(output_file,"w","utf-8") as semfile:
709 | semfile.write(PREFIXES)
710 | semfile.write(sem_types)
711 | semfile.flush()
712 | semfile.close()
713 |
714 | sem_types = generate_semantic_types(con,with_roots=False)
715 | mrdoc = UmlsTable("MRDOC", con)
716 | property_docs = dict()
717 | for doc_record in mrdoc.scan():
718 | _type = doc_record[MRDOC_TYPE]
719 | _expl = doc_record[MRDOC_DESC]
720 | _key = doc_record[MRDOC_VALUE]
721 | if _key not in property_docs:
722 | property_docs[_key] = dict()
723 | property_docs[_key]["dockey"] = doc_record[MRDOC_DOCKEY]
724 | if "inverse" in _type:
725 | _type = "inverse"
726 | property_docs[_key][_type] = _expl
727 |
728 | for (umls_code, file_out, load_on_field) in umls_conf:
729 | alt_uri_code = None
730 | if ";" in umls_code:
731 | umls_code,alt_uri_code = umls_code.split(";")
732 | if umls_code.startswith("#"):
733 | continue
734 | load_on_cuis = load_on_field == "load_on_cuis"
735 | output_file = os.path.join(conf.OUTPUT_FOLDER,file_out)
736 | sys.stdout.write("Generating %s (using '%s')\n" %
737 | (umls_code,load_on_field))
738 | sys.stdout.flush()
739 | ns = get_umls_url(umls_code if not alt_uri_code else alt_uri_code)
740 | ont = UmlsOntology(umls_code,ns,con,load_on_cuis=load_on_cuis)
741 | ont.load_tables()
742 | fout = ont.write_into(output_file,hierarchy=(ont.ont_code != "MSH"))
743 | ont.write_properties(fout,property_docs)
744 | if conf.INCLUDE_SEMANTIC_TYPES:
745 | ont.write_semantic_types(sem_types,fout)
746 | fout.close()
747 | sys.stdout.write("done!\n\n")
748 | sys.stdout.flush()
749 | sys.stdout.flush()
750 |
--------------------------------------------------------------------------------