├── .gitignore ├── .project ├── .pydevproject ├── COPYING ├── README.md ├── checkOutputSyntax.sh ├── conf_sample.py ├── move_rdf_to_repo.py ├── output └── .gitignore ├── umls.conf └── umls2rdf.py /.gitignore: -------------------------------------------------------------------------------- 1 | conf.py 2 | *.pyc 3 | output/* 4 | 5 | *.swp 6 | sync* 7 | *.spf 8 | *.rsync 9 | .vscode 10 | -------------------------------------------------------------------------------- /.project: -------------------------------------------------------------------------------- 1 | 2 | 3 | UMLS2RDF 4 | 5 | 6 | 7 | 8 | 9 | org.python.pydev.PyDevBuilder 10 | 11 | 12 | 13 | 14 | 15 | org.python.pydev.pythonNature 16 | 17 | 18 | -------------------------------------------------------------------------------- /.pydevproject: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | /UMLS2RDF 7 | 8 | python 2.7 9 | Default 10 | 11 | -------------------------------------------------------------------------------- /COPYING: -------------------------------------------------------------------------------- 1 | Copyright 2005–2011, The Board of Trustees of Leland Stanford Junior University. All rights reserved. 2 | 3 | Redistribution and use in source and binary forms, with or without modification, are 4 | permitted provided that the following conditions are met: 5 | 6 | 1. Redistributions of source code must retain the above copyright notice, this list of 7 | conditions and the following disclaimer. 8 | 9 | 2. Redistributions in binary form must reproduce the above copyright notice, this list 10 | of conditions and the following disclaimer in the documentation and/or other materials 11 | provided with the distribution. 12 | 13 | THIS SOFTWARE IS PROVIDED BY ''AS IS'' AND ANY EXPRESS OR IMPLIED 14 | WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND 15 | FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL OR 16 | CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR 17 | CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 18 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON 19 | ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING 20 | NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF 21 | ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 22 | 23 | The views and conclusions contained in the software and documentation are those of the 24 | authors and should not be interpreted as representing official policies, either expressed 25 | or implied, of The Board of Trustees of Leland Stanford Junior University. 26 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | This project takes a MySQL Unified Medical Language System (UMLS) database and converts the ontologies to RDF using OWL and SKOS as the main schemas. 2 | 3 | Virtual Appliance users can review the [documentation in the OntoPortal Administration Guide}(https://ontoportal.github.io/documentation/administration/ontologies/handling_umls). 4 | 5 | To use it: 6 | 7 | * Specify your database connection conf.py 8 | * Specify the SAB ontologies to export in umls.conf 9 | 10 | The umls.conf configuration file must contain one ontology per line. The lines are comma separated tuples where the elements are: 11 | 12 | The following list needs updating. 13 |
14 | (0) SAB
15 | (1) BioPortal Virtual ID. This is optional, any value works.
16 | (2) Output file name
17 | (3) Conversion strategy. Accepted values (load_on_codes, load_on_cuis).
18 | 
19 | 20 | Note that 'CCS COSTAR DSM3R DSM4 DXP ICPC2ICD10ENG MCM MMSL MMX MTHCMSFRF MTHMST MTHSPL MTH NDFRT SNM' have no code and should not be loaded on loads_on_codes. 21 | 22 | umls2rdf.py is designed to be an offline, run-once process. 23 | It's memory intensive and exports all of the default ontologies in umls.conf in 3h 30min. 24 | The ontologies listed in umls.conf are the UMLS ontologies accessible in [BioPortal](https://bioportal.bioontology.org/). 25 | 26 | If you get an error when installing the MySQL-python python library, https://stackoverflow.com/questions/12218229/my-config-h-file-not-found-when-intall-mysql-python-on-osx-10-8 may be of help. 27 | 28 | If running a Windows 10 OS with MySQL, the following tips may be of help. 29 | 30 | - Install [MySQL 5.5](https://dev.mysql.com/downloads/mysql/5.5.html#downloads) to avoid the InnoDB space [disclaimer](https://www.nlm.nih.gov/research/umls/implementation_resources/scripts/README_RRF_MySQL_Output_Stream.html) by NLM. 31 | - [Python 2.7.x](https://www.python.org/downloads/) should be used to avoid syntax errors on 'raise Attribute' 32 | - For installtion of the MySQLdb module
python -m pip install MySQLdb
is error prone. Install with executable [MySQL-python-1.2.3.win-amd64-py2.7](http://www.codegood.com/archives/129) (last known location). 33 | - Create your RRF subset(s) using mmsys with the MySQL load option, load your database, edit conf.py and umls.py to specifications, run umsl2rdf.py 34 | -------------------------------------------------------------------------------- /checkOutputSyntax.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #=============================================================================== 3 | # 4 | # FILE: checkOutputSyntax.sh 5 | # 6 | # USAGE: ./checkOutputSyntax.sh 7 | # 8 | # DESCRIPTION: Apply rapper to check the syntax of turtle files created 9 | # by the umls2rdf.py conversion. 10 | # 11 | # OPTIONS: --- First argument is output path (default = 'output') 12 | # REQUIREMENTS: --- 13 | # BUGS: --- 14 | # NOTES: --- 15 | # AUTHOR: Darren L. Weber, Ph.D. (), darren.weber@stanford.edu 16 | # COMPANY: Stanford University 17 | # VERSION: 1.0 18 | # CREATED: 03/29/2013 11:28:18 AM PDT 19 | # REVISION: --- 20 | #=============================================================================== 21 | 22 | outputPath='output' 23 | if [ "$1" != "" ]; then 24 | outputPath="$1" 25 | fi 26 | 27 | if which rapper >/dev/null 2>&1; then 28 | for ttlFile in ${outputPath}/*.ttl; do 29 | echo 30 | rapper -i turtle -c --show-graphs --show-namespaces $ttlFile 31 | done 32 | echo 33 | fi 34 | 35 | -------------------------------------------------------------------------------- /conf_sample.py: -------------------------------------------------------------------------------- 1 | #Folder to dump the RDF files. 2 | OUTPUT_FOLDER = "output" 3 | 4 | #DB Config 5 | DB_HOST = "your-host" 6 | DB_NAME = "umls2015ab" 7 | DB_USER = "your db user" 8 | DB_PASS = "your db pass" 9 | 10 | UMLS_VERSION = "2015ab" 11 | 12 | # Define the base URI used to generate the concepts URI 13 | UMLS_BASE_URI = "http://purl.bioontology.org/ontology/" 14 | 15 | # Include the semantic type concepts for each Ontology file generated 16 | INCLUDE_SEMANTIC_TYPES = True 17 | -------------------------------------------------------------------------------- /move_rdf_to_repo.py: -------------------------------------------------------------------------------- 1 | #Script utility to replace TTL files in BioPortal 2 | #Useful also for BioPortal appliances 3 | 4 | from os import listdir 5 | from os.path import isfile,isdir, join 6 | import pdb 7 | import glob 8 | import pdb 9 | import shutil 10 | 11 | REPO = "/srv/ncbo/repository" 12 | OUTPUT = "./output" 13 | 14 | ttl_files = glob.glob("%s/*.ttl"%(OUTPUT)) 15 | 16 | file_map = {} 17 | for ttl in ttl_files: 18 | acronym = ttl.split("/")[-1][0:-4] 19 | file_map[acronym] = ttl 20 | 21 | for acronym in file_map: 22 | ttl = file_map[acronym] 23 | dir_ont = join(REPO,acronym) 24 | if isdir(dir_ont): 25 | sub_dirs = glob.glob(join(dir_ont,"*")) 26 | latest = 0 27 | latest_subdir = None 28 | for sub_dir in sub_dirs: 29 | s = sub_dir.split("/")[-1] 30 | try: 31 | i = int(s) 32 | if i > latest: 33 | latest = i 34 | latest_subdir = sub_dir 35 | except ValueError: 36 | continue 37 | print("Latest for " + acronym + " is " + str(latest)) 38 | if latest_subdir: 39 | if isfile(join(latest_subdir,ttl.split("/")[-1])): 40 | shutil.copy2(ttl,latest_subdir) 41 | print("ttl found") 42 | else: 43 | print("ttl file not found for " + acronym) 44 | else: 45 | print("NOT Found " + dir_ont) 46 | 47 | -------------------------------------------------------------------------------- /output/.gitignore: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ncbo/umls2rdf/8d9f98a8ba2feed98c8c8667422c880349e853b0/output/.gitignore -------------------------------------------------------------------------------- /umls.conf: -------------------------------------------------------------------------------- 1 | AIR,AI-RHEUM.ttl,load_on_codes 2 | ATC,ATC.ttl,load_on_codes 3 | #CPT,CPT.ttl,load_on_codes. #disabled due to licensing restrictions in BioPortal 4 | CSP,CRISP.ttl,load_on_codes 5 | CST,COSTART.ttl,load_on_codes 6 | HCPCS,HCPCS.ttl,load_on_codes 7 | HL7V3.0;HL7,HL7.ttl,load_on_cuis 8 | ICD10,ICD10.ttl,load_on_codes 9 | ICD10CM,ICD10CM.ttl,load_on_codes 10 | ICD10PCS,ICD10PCS.ttl,load_on_codes 11 | ICD9CM;ICD9CM,ICD9CM.ttl,load_on_codes 12 | ICPC,ICPC.ttl,load_on_codes 13 | ICPC2P,ICPC2P.ttl,load_on_codes 14 | ICPCFRE,ICPCFRE.ttl,load_on_codes 15 | LNC,LOINC.ttl,load_on_codes 16 | #LOINC-VIEW,LOINC-VIEW.ttl, load_on_codes 17 | #MDDB,MDDB.ttl,load_on_codes #retired 18 | MDR;MEDDRA,MEDDRA.ttl,load_on_codes 19 | MDRFRE,MDRFRE.ttl,load_on_codes 20 | MDRGER,MDRGER.ttl,load_on_codes 21 | MEDLINEPLUS,MEDLINEPLUS.ttl,load_on_cuis 22 | MSH;MESH,MESH.ttl,load_on_codes 23 | MSHFRE,MSHFRE.ttl,load_on_codes 24 | MSHSPA,MSHSPA.ttl,load_on_codes 25 | MTHMST;MSTDE,MSTDE.ttl,load_on_codes 26 | MTHMSTFRE;MSTDE-FRE,MSTDE-FRE.ttl,load_on_codes 27 | NCBI;NCBITAXON,NCBITAXON.ttl,load_on_codes 28 | NDDF,NDDF.ttl,load_on_codes 29 | #NDFRT,NDFRT.ttl,load_on_codes #removed in 2018AA 30 | OMIM,OMIM.ttl,load_on_codes 31 | PDQ,PDQ.ttl,load_on_codes 32 | RCD,RCD.ttl,load_on_codes 33 | RXNORM,RXNORM.ttl,load_on_codes 34 | #RXNORM-VIEW,RXNORM-VIEW.ttl, load_on_codes 35 | SCTSPA,SCTSPA.ttl,load_on_codes 36 | SNMI,SNMI.ttl,load_on_codes 37 | #SNOMEDCT-CORE,SNOMEDCT-CORE.ttl, load_on_codes 38 | SNOMEDCT_US;SNOMEDCT,SNOMEDCT.ttl,load_on_codes 39 | VANDF,VANDF.ttl,load_on_codes 40 | WHO,WHO-ART.ttl,load_on_codes 41 | WHOFRE,WHOFRE.ttl,load_on_codes 42 | -------------------------------------------------------------------------------- /umls2rdf.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | 3 | DEBUG = False 4 | 5 | import codecs 6 | import sys 7 | import os 8 | import urllib.request, urllib.parse, urllib.error 9 | from string import Template 10 | import collections 11 | import pymysql 12 | import pdb 13 | from functools import reduce 14 | from itertools import groupby 15 | 16 | try: 17 | import conf 18 | except: 19 | raise 20 | 21 | PREFIXES = """ 22 | @prefix skos: . 23 | @prefix owl: . 24 | @prefix rdfs: . 25 | @prefix xsd: . 26 | @prefix umls: . 27 | """ 28 | 29 | ONTOLOGY_HEADER = Template(""" 30 | <$uri> 31 | a owl:Ontology ; 32 | rdfs:comment "$comment" ; 33 | rdfs:label "$label" ; 34 | owl:imports ; 35 | owl:versionInfo "$versioninfo" . 36 | """) 37 | 38 | STY_URL = "http://bioportal.bioontology.org/ontologies/umls/sty/" 39 | HAS_STY = "umls:hasSTY" 40 | HAS_AUI = "umls:aui" 41 | HAS_CUI = "umls:cui" 42 | HAS_TUI = "umls:tui" 43 | 44 | MRCONSO_CODE = 13 45 | MRCONSO_AUI = 7 46 | MRCONSO_STR = 14 47 | MRCONSO_STT = 4 48 | MRCONSO_SCUI = 9 49 | MRCONSO_ISPREF = 6 50 | MRCONSO_TTY = 12 51 | MRCONSO_TS = 2 52 | MRCONSO_CUI = 0 53 | 54 | # http://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/SNOMEDCT/relationships.html 55 | MRREL_AUI1 = 1 56 | MRREL_AUI2 = 5 57 | MRREL_CUI1 = 0 58 | MRREL_CUI2 = 4 59 | MRREL_REL = 3 60 | MRREL_RELA = 7 61 | 62 | MRDEF_AUI = 1 63 | MRDEF_DEF = 5 64 | MRDEF_CUI = 0 65 | 66 | MRSAT_CUI = 0 67 | MRSAT_CODE = 5 68 | MRSAT_ATV = 10 69 | MRSAT_ATN = 8 70 | 71 | MRDOC_DOCKEY = 0 72 | MRDOC_VALUE = 1 73 | MRDOC_TYPE = 2 74 | MRDOC_DESC = 3 75 | 76 | MRRANK_TTY = 2 77 | MRRANK_RANK = 0 78 | 79 | MRSTY_CUI = 0 80 | MRSTY_TUI = 1 81 | 82 | MRSAB_LAT = 19 83 | 84 | UMLS_LANGCODE_MAP = {"eng" : "en", "fre" : "fr", "cze" : "cz", "fin" : "fi", "ger" : "de", "ita" : "it", "jpn" : "jp", "pol" : "pl", "por" : "pt", "rus" : "ru", "spa" : "es", "swe" : "sw", "scr" : "hr", "dut" : "nl", "lav" : "lv", "hun" : "hu", "kor" : "kr", "dan" : "da", "nor" : "no", "heb" : "he", "baq" : "eu"} 85 | 86 | def get_umls_url(code): 87 | return "%s%s/"%(conf.UMLS_BASE_URI,code) 88 | 89 | def flatten(matrix): 90 | return reduce(lambda x,y: x+y,matrix) 91 | 92 | def escape(string): 93 | return string.replace("\\","\\\\").replace('"','\\"') 94 | 95 | def get_url_term(ns,code): 96 | if ns[-1] == '/': 97 | ret = ns + urllib.parse.quote(code) 98 | else: 99 | ret = "%s/%s"%(ns,urllib.parse.quote(code)) 100 | return ret 101 | 102 | def get_rel_fragment(rel): 103 | return rel[MRREL_RELA] if rel[MRREL_RELA] else rel[MRREL_REL] 104 | 105 | 106 | # NOTE: See UmlsOntology.terms() for the reason these functions use -1 and -2 107 | # indices to obtain the source and target codes, respectively. 108 | def get_rel_code_source(rel,on_cuis): 109 | return rel[-1] if not on_cuis else rel[MRREL_CUI2] 110 | def get_rel_code_target(rel,on_cuis): 111 | return rel[-2] if not on_cuis else rel[MRREL_CUI1] 112 | 113 | def get_code(reg,load_on_cuis): 114 | if load_on_cuis: 115 | return reg[MRCONSO_CUI] 116 | if reg[MRCONSO_CODE]: 117 | return reg[MRCONSO_CODE] 118 | raise AttributeError("No code on reg [%s]"%("|".join(reg))) 119 | 120 | def __get_connection(): 121 | return pymysql.connect(host=conf.DB_HOST,user=conf.DB_USER, 122 | passwd=conf.DB_PASS,db=conf.DB_NAME) 123 | 124 | def generate_semantic_types(con,with_roots=False): 125 | url = get_umls_url("STY") 126 | hierarchy = collections.defaultdict(lambda : list()) 127 | all_nodes = list() 128 | mrsty = UmlsTable("MRSTY",con, 129 | load_select="SELECT DISTINCT TUI, STN, STY FROM MRSTY") 130 | ont = list() 131 | 132 | for stt in mrsty.scan(): 133 | hierarchy[stt[1]].append(stt[0]) 134 | sty_term = """<%s> a owl:Class ; 135 | \tskos:notation "%s"^^xsd:string ; 136 | \tskos:prefLabel "%s"@en . 137 | """%(url+stt[0],stt[0],stt[2]) 138 | ont.append(sty_term) 139 | all_nodes.append(stt) 140 | 141 | for node in all_nodes: 142 | parent = node[1] 143 | if "." in parent: 144 | parent = ".".join(node[1].split(".")[0:-1]) 145 | else: 146 | parent = parent[0:-1] 147 | 148 | rdfs_subclasses = [] 149 | for x in hierarchy[parent]: 150 | if node[0] != x: 151 | rdfs_subclasses.append( 152 | "<%s> rdfs:subClassOf <%s> ."%(url+node[0],url+x)) 153 | 154 | if len(rdfs_subclasses) == 0 and with_roots: 155 | rdfs_subclasses = ["<%s> rdfs:subClassOf owl:Thing ."%(url+node[0])] 156 | 157 | for sc in rdfs_subclasses: 158 | ont.append(sc) 159 | data_ont_ttl = "\n".join(ont) 160 | return data_ont_ttl 161 | 162 | 163 | 164 | 165 | class UmlsTable(object): 166 | def __init__(self,table_name,conn,load_select=None): 167 | self.table_name = table_name 168 | self.conn = conn 169 | self.page_size = 500000 170 | self.load_select = load_select 171 | 172 | def mesh_tree(self): 173 | q = """select DISTINCT c1.code as parent, c2.code as child 174 | from MRREL r, MRCONSO c1, MRCONSO c2 where r.sab = 'MSH' and r.rel = 'CHD' 175 | and c1.cui = r.cui1 176 | and c2.cui = r.cui2 177 | and c2.code like 'D%' 178 | and c1.code like 'D%' 179 | and c1.sab = 'MSH' 180 | and c2.sab = 'MSH' 181 | """ 182 | cursor = self.conn.cursor() 183 | cursor.execute(q) 184 | result = cursor.fetchall() 185 | edges = collections.defaultdict(set) 186 | for record in result: 187 | edges[record[1]].add(record[0]) 188 | return edges 189 | 190 | def count(self): 191 | q = "SELECT count(*) FROM %s"%self.table_name 192 | cursor = self.conn.cursor() 193 | cursor.execute(q) 194 | result = cursor.fetchall() 195 | for record in result: 196 | cursor.close() 197 | return int(record[0]) 198 | 199 | def scan(self,filt=None,limit=None): 200 | #c = self.count() 201 | i = 0 202 | page = 0 203 | cont = True 204 | cursor = self.conn.cursor() 205 | while cont: 206 | if self.load_select: 207 | q = self.load_select 208 | else: 209 | q = "SELECT * FROM %s WHERE %s LIMIT %s OFFSET %s"%(self.table_name,filt,self.page_size,page * self.page_size) 210 | if filt == None or len(filt) == 0: 211 | q = "SELECT * FROM %s LIMIT %s OFFSET %s"%(self.table_name,self.page_size,page * self.page_size) 212 | sys.stdout.write("[UMLS-Query] %s\n" % q) 213 | sys.stdout.flush() 214 | cursor.execute(q) 215 | result = cursor.fetchall() 216 | cont = False 217 | for record in result: 218 | cont = True 219 | i += 1 220 | yield record 221 | if limit and i >= limit: 222 | cont = False 223 | break 224 | # Do we already have all the rows available for the query? 225 | if self.load_select: 226 | cont = False 227 | elif not limit and i < self.page_size: 228 | cont = False 229 | page += 1 230 | cursor.close() 231 | 232 | 233 | 234 | class UmlsClass(object): 235 | def __init__(self,ns,atoms=None,rels=None, 236 | defs=None,atts=None,rank=None, 237 | rank_by_tty=None,sty=None, 238 | sty_by_cui=None,load_on_cuis=False, 239 | is_root=None): 240 | self.ns = ns 241 | self.atoms = atoms 242 | self.rels = rels 243 | self.defs = defs 244 | self.atts = atts 245 | self.rank = rank 246 | self.rank_by_tty = rank_by_tty 247 | self.sty = sty 248 | self.sty_by_cui = sty_by_cui 249 | self.load_on_cuis = load_on_cuis 250 | self.is_root = is_root 251 | self.class_properties = dict() 252 | 253 | def code(self): 254 | codes = set([get_code(x,self.load_on_cuis) for x in self.atoms]) 255 | if len(codes) != 1: 256 | raise AttributeError("Only one code per term.") 257 | #if DEBUG: 258 | #sys.stderr.write(self.atoms) 259 | #sys.stderr.write(codes) 260 | return codes.pop() 261 | 262 | def getAltLabels(self,prefLabel): 263 | #is_pref_atoms = filter(lambda x: x[MRCONSO_ISPREF] == 'Y', self.atoms) 264 | return set([atom[MRCONSO_STR] for atom in self.atoms if atom[MRCONSO_STR] != prefLabel]) 265 | 266 | def getPrefLabel(self): 267 | if self.load_on_cuis: 268 | if len(self.atoms) == 1: 269 | return self.atoms[0][MRCONSO_STR] 270 | 271 | labels = set([x[MRCONSO_STR] for x in self.atoms]) 272 | if len(labels) == 1: 273 | return labels.pop() 274 | 275 | is_pref_atoms = [x for x in self.atoms if x[MRCONSO_ISPREF] == 'Y'] 276 | if len(is_pref_atoms) == 0: 277 | return self.atoms[0][MRCONSO_STR] 278 | elif len(is_pref_atoms) == 1: 279 | return is_pref_atoms[0][MRCONSO_STR] 280 | 281 | is_pref_atoms = [x for x in is_pref_atoms if x[MRCONSO_STT] == 'PF'] 282 | if len(is_pref_atoms) == 0: 283 | return self.atoms[0][MRCONSO_STR] 284 | elif len(is_pref_atoms) == 1: 285 | return is_pref_atoms[0][MRCONSO_STR] 286 | 287 | is_pref_atoms = [x for x in self.atoms if x[MRCONSO_TTY][0] == 'P'] 288 | if len(is_pref_atoms) == 1: 289 | return is_pref_atoms[0][MRCONSO_STR] 290 | return self.atoms[0][MRCONSO_STR] 291 | else: 292 | #if ISPREF=Y is not 1 then we look into MRRANK. 293 | if len(self.rank) > 0: 294 | sort_key = \ 295 | lambda x: int(self.rank[self.rank_by_tty[x[MRCONSO_TTY]][0]][MRRANK_RANK]) 296 | mmrank_sorted_atoms = sorted(self.atoms,key=sort_key,reverse=True) 297 | return mmrank_sorted_atoms[0][MRCONSO_STR] 298 | #there is no rank to use 299 | else: 300 | pref_atom = [x for x in self.atoms if 'P' in x[MRCONSO_TTY]] 301 | if len(pref_atom) == 1: 302 | return pref_atom[0][MRCONSO_STR] 303 | raise AttributeError("Unable to select pref label") 304 | 305 | def getURLTerm(self,code): 306 | return get_url_term(self.ns,code) 307 | 308 | def properties(self): 309 | return self.class_properties 310 | 311 | def toRDF(self,fmt="Turtle",hierarchy=True,lang="en",tree=None): 312 | if not fmt == "Turtle": 313 | raise AttributeError("Only fmt='Turtle' is currently supported") 314 | term_code = self.code() 315 | url_term = self.getURLTerm(term_code) 316 | prefLabel = self.getPrefLabel() 317 | altLabels = self.getAltLabels(prefLabel) 318 | rdf_term = """<%s> a owl:Class ; 319 | \tskos:prefLabel \"\"\"%s\"\"\"@%s ; 320 | \tskos:notation \"\"\"%s\"\"\"^^xsd:string ; 321 | """%(url_term,escape(prefLabel),lang,escape(term_code)) 322 | 323 | if len(altLabels) > 0: 324 | rdf_term += """\tskos:altLabel %s ; 325 | """%(" , ".join(['\"\"\"%s\"\"\"@%s'%(escape(x),lang) for x in set(altLabels)])) 326 | 327 | if self.is_root: 328 | rdf_term += '\trdfs:subClassOf owl:Thing ;\n' 329 | 330 | if len(self.defs) > 0: 331 | rdf_term += """\tskos:definition %s ; 332 | """%(" , ".join(['\"\"\"%s\"\"\"@%s'%(escape(x[MRDEF_DEF]),lang) for x in set(self.defs)])) 333 | 334 | count_parents = 0 335 | if tree: 336 | if term_code in tree: 337 | for parent in tree[term_code]: 338 | o = self.getURLTerm(parent) 339 | rdf_term += "\trdfs:subClassOf <%s> ;\n" % (o,) 340 | for rel in self.rels: 341 | source_code = get_rel_code_source(rel,self.load_on_cuis) 342 | target_code = get_rel_code_target(rel,self.load_on_cuis) 343 | if source_code != term_code: 344 | raise AttributeError("Inconsistent code in rel") 345 | # Map child relations to rdf:subClassOf (skip parent relations). 346 | if rel[MRREL_REL] == 'PAR': 347 | continue 348 | if rel[MRREL_REL] == 'CHD' and hierarchy: 349 | o = self.getURLTerm(target_code) 350 | count_parents += 1 351 | if target_code == "ICD-10-CM": 352 | #skip bogus ICD10CM parent 353 | continue 354 | if target_code == "138875005": 355 | #skip bogus SNOMED root concept 356 | continue 357 | if target_code == "V-HL7V3.0" or target_code == "C1553931": 358 | #skip bogus HL7V3.0 root concept 359 | continue 360 | if not tree: 361 | rdf_term += "\trdfs:subClassOf <%s> ;\n" % (o,) 362 | else: 363 | p = self.getURLTerm(get_rel_fragment(rel)) 364 | o = self.getURLTerm(target_code) 365 | rdf_term += "\t<%s> <%s> ;\n" % (p,o) 366 | if p not in self.class_properties: 367 | self.class_properties[p] = \ 368 | UmlsAttribute(p,get_rel_fragment(rel)) 369 | 370 | for att in self.atts: 371 | atn = att[MRSAT_ATN] 372 | atv = att[MRSAT_ATV] 373 | if atn == 'AQ': 374 | # Skip all these values (they are replicated in MRREL for 375 | # SNOMEDCT, unknown relationship for MSH). 376 | #if DEBUG: 377 | # sys.stderr.write("att: %s\n" % str(att)) 378 | # sys.stderr.flush() 379 | continue 380 | #MESH ROOTS ONLY DESCRIPTORS 381 | if tree and atn == "MN" and term_code.startswith("D"): 382 | if len(atv.split(".")) == 1: 383 | rdf_term += "\trdfs:subClassOf owl:Thing;\n" 384 | p = self.getURLTerm(atn) 385 | rdf_term += "\t<%s> \"\"\"%s\"\"\"^^xsd:string ;\n"%(p, escape(atv)) 386 | if p not in self.class_properties: 387 | self.class_properties[p] = UmlsAttribute(p,atn) 388 | 389 | #auis = set([x[MRCONSO_AUI] for x in self.atoms]) 390 | cuis = set([x[MRCONSO_CUI] for x in self.atoms]) 391 | sty_recs = flatten([indexes for indexes in [self.sty_by_cui[cui] for cui in cuis]]) 392 | types = [self.sty[index][MRSTY_TUI] for index in sty_recs] 393 | 394 | #for t in auis: 395 | # rdf_term += """\t%s \"\"\"%s\"\"\"^^xsd:string ;\n"""%(HAS_AUI,t) 396 | for t in cuis: 397 | rdf_term += """\t%s \"\"\"%s\"\"\"^^xsd:string ;\n"""%(HAS_CUI,t) 398 | for t in set(types): 399 | rdf_term += """\t%s \"\"\"%s\"\"\"^^xsd:string ;\n"""%(HAS_TUI,t) 400 | for t in set(types): 401 | rdf_term += """\t%s <%s> ;\n"""%(HAS_STY,get_umls_url("STY")+t) 402 | 403 | return rdf_term + " .\n\n" 404 | 405 | 406 | 407 | class UmlsAttribute(object): 408 | def __init__(self,uri,att): 409 | self.uri = uri 410 | self.att = att 411 | 412 | def getURLTerm(self,code): 413 | return get_url_term(self.ns,code) 414 | 415 | def toRDFWithDesc(self,label,desc,_type): 416 | uri_rdf = self.uri 417 | if self.uri.startswith("http"): 418 | uri_rdf = "<%s>"%self.uri 419 | return """%s a owl:%s ; 420 | \trdfs:label \"\"\"%s\"\"\"; 421 | \trdfs:comment \"\"\"%s\"\"\" . 422 | \n"""%(uri_rdf,_type,label,escape(desc)) 423 | 424 | def toRDF(self,dockey,desc,fmt="Turtle"): 425 | if not fmt == "Turtle": 426 | raise AttributeError("Only fmt='Turtle' is currently supported") 427 | _type = "" 428 | if "REL" in dockey: 429 | _type = "ObjectProperty" 430 | elif dockey == "ATN": 431 | _type = "DatatypeProperty" 432 | else: 433 | raise AttributeError("Unknown DOCKEY" + dockey) 434 | 435 | label = self.att 436 | if len(desc) < 20: 437 | label = desc 438 | if "_" in label: 439 | label = " ".join(self.att.split("_")) 440 | label = label[0].upper() + label[1:] 441 | 442 | return """<%s> a owl:%s ; 443 | \trdfs:label \"\"\"%s\"\"\"; 444 | \trdfs:comment \"\"\"%s\"\"\" . 445 | \n"""%(self.uri,_type,label,escape(desc)) 446 | 447 | 448 | 449 | class UmlsOntology(object): 450 | def __init__(self,ont_code,ns,con,load_on_cuis=False): 451 | self.loaded = False 452 | self.ont_code = ont_code 453 | self.ns = ns 454 | self.con = con 455 | self.load_on_cuis = load_on_cuis 456 | #self.alt_uri_code = alt_uri_code 457 | self.atoms = list() 458 | self.atoms_by_code = collections.defaultdict(lambda : list()) 459 | if not self.load_on_cuis: 460 | self.atoms_by_aui = collections.defaultdict(lambda : list()) 461 | self.rels = list() 462 | self.rels_by_aui_src = collections.defaultdict(lambda : list()) 463 | self.defs = list() 464 | self.defs_by_aui = collections.defaultdict(lambda : list()) 465 | self.atts = list() 466 | self.atts_by_code = collections.defaultdict(lambda : list()) 467 | self.rank = list() 468 | self.rank_by_tty = collections.defaultdict(lambda : list()) 469 | self.sty = list() 470 | self.sty_by_cui = collections.defaultdict(lambda : list()) 471 | self.cui_roots = set() 472 | self.lang = None 473 | self.ont_properties = dict() 474 | 475 | def load_tables(self,limit=None): 476 | mrconso = UmlsTable("MRCONSO",self.con) 477 | if self.ont_code == 'MSH': 478 | self.tree = mrconso.mesh_tree() 479 | else: 480 | self.tree = None 481 | mrsab = UmlsTable("MRSAB", self.con) 482 | for sab_rec in mrsab.scan(filt="RSAB = '" + self.ont_code + "'", limit=1): 483 | self.lang = sab_rec[MRSAB_LAT].lower() 484 | mrconso_filt = "SAB = '%s' AND lat = '%s' AND SUPPRESS = 'N'"%( 485 | self.ont_code,self.lang) 486 | for atom in mrconso.scan(filt=mrconso_filt,limit=limit): 487 | index = len(self.atoms) 488 | self.atoms_by_code[get_code(atom,self.load_on_cuis)].append(index) 489 | if not self.load_on_cuis: 490 | self.atoms_by_aui[atom[MRCONSO_AUI]].append(index) 491 | self.atoms.append(atom) 492 | if DEBUG: 493 | sys.stderr.write("length atoms: %d\n" % len(self.atoms)) 494 | sys.stderr.write("length atoms_by_aui: %d\n" % len(self.atoms_by_aui)) 495 | sys.stderr.write("atom example: %s\n\n" % str(self.atoms)) 496 | sys.stderr.flush() 497 | 498 | mrconso_filt = "SAB = 'SRC' AND CODE = 'V-%s'"%self.ont_code 499 | for atom in mrconso.scan(filt=mrconso_filt,limit=limit): 500 | self.cui_roots.add(atom[MRCONSO_CUI]) 501 | if DEBUG: 502 | sys.stderr.write("length cui_roots: %d\n\n" % len(self.cui_roots)) 503 | sys.stderr.flush() 504 | 505 | # 506 | mrrel = UmlsTable("MRREL",self.con) 507 | mrrel_filt = "SAB = '%s' AND SUPPRESS = 'N'"%self.ont_code 508 | field = MRREL_AUI2 if not self.load_on_cuis else MRREL_CUI2 509 | for rel in mrrel.scan(filt=mrrel_filt,limit=limit): 510 | index = len(self.rels) 511 | self.rels_by_aui_src[rel[field]].append(index) 512 | self.rels.append(rel) 513 | if DEBUG: 514 | sys.stderr.write("length rels: %d\n\n" % len(self.rels)) 515 | sys.stderr.flush() 516 | # 517 | mrdef = UmlsTable("MRDEF",self.con) 518 | mrdef_filt = "SAB = '%s'"%self.ont_code 519 | field = MRDEF_AUI if not self.load_on_cuis else MRDEF_CUI 520 | for defi in mrdef.scan(filt=mrdef_filt): 521 | index = len(self.defs) 522 | self.defs_by_aui[defi[field]].append(index) 523 | self.defs.append(defi) 524 | if DEBUG: 525 | sys.stderr.write("length defs: %d\n\n" % len(self.defs)) 526 | sys.stderr.flush() 527 | # 528 | mrsat = UmlsTable("MRSAT",self.con) 529 | mrsat_filt = "SAB = '%s' AND CODE IS NOT NULL"%self.ont_code 530 | field = MRSAT_CODE if not self.load_on_cuis else MRSAT_CUI 531 | for att in mrsat.scan(filt=mrsat_filt): 532 | index = len(self.atts) 533 | self.atts_by_code[att[field]].append(index) 534 | self.atts.append(att) 535 | if DEBUG: 536 | sys.stderr.write("length atts: %d\n\n" % len(self.atts)) 537 | sys.stderr.flush() 538 | # 539 | mrrank = UmlsTable("MRRANK",self.con) 540 | mrrank_filt = "SAB = '%s'"%self.ont_code 541 | for rank in mrrank.scan(filt=mrrank_filt): 542 | index = len(self.rank) 543 | self.rank_by_tty[rank[MRRANK_TTY]].append(index) 544 | self.rank.append(rank) 545 | if DEBUG: 546 | sys.stderr.write("length rank: %d\n\n" % len(self.rank)) 547 | sys.stderr.flush() 548 | # 549 | load_mrsty = "SELECT sty.* FROM MRSTY sty, MRCONSO conso \ 550 | WHERE conso.SAB = '%s' AND conso.cui = sty.cui AND conso.suppress = 'N'" 551 | load_mrsty %= self.ont_code 552 | mrsty = UmlsTable("MRSTY",self.con,load_select=load_mrsty) 553 | for sty in mrsty.scan(filt=None): 554 | index = len(self.sty) 555 | self.sty_by_cui[sty[MRSTY_CUI]].append(index) 556 | self.sty.append(sty) 557 | if DEBUG: 558 | sys.stderr.write("length sty: %d\n\n" % len(self.sty)) 559 | sys.stderr.flush() 560 | # Track the loaded status 561 | self.loaded = True 562 | sys.stdout.write("%s tables loaded ...\n" % self.ont_code) 563 | sys.stdout.flush() 564 | 565 | def terms(self): 566 | if not self.loaded: 567 | self.load_tables() 568 | # Note: most UMLS ontologies are 'load_on_codes' (only HL7 is load_on_cuis) 569 | for code in self.atoms_by_code: 570 | code_atoms = [self.atoms[row] for row in self.atoms_by_code[code]] 571 | field = MRCONSO_CUI if self.load_on_cuis else MRCONSO_AUI 572 | ids = [x[field] for x in code_atoms] 573 | rels = list() 574 | for _id in ids: 575 | rels += [self.rels[x] for x in self.rels_by_aui_src[_id]] 576 | rels_to_class = list() 577 | is_root = False 578 | if self.load_on_cuis: 579 | rels_to_class = rels 580 | for rel in rels_to_class: 581 | if rel[MRREL_CUI1] in self.cui_roots: 582 | is_root = True 583 | break 584 | else: 585 | for rel in rels: 586 | rel_with_codes = list(rel) 587 | aui_source = rel[MRREL_AUI2] 588 | aui_target = rel[MRREL_AUI1] 589 | code_source = [ get_code(self.atoms[x],self.load_on_cuis) \ 590 | for x in self.atoms_by_aui[aui_source] ] 591 | code_target = [ get_code(self.atoms[x],self.load_on_cuis) \ 592 | for x in self.atoms_by_aui[aui_target] ] 593 | # TODO: Check use of CUI1 (target) or CUI2 (source) here: 594 | if (rel[MRREL_CUI1] in self.cui_roots) and rel[MRREL_REL] == "CHD": 595 | is_root = True 596 | elif self.ont_code == "ICD10CM": 597 | # TODO: patch to fix ICD10-CM hierachy. 598 | if rel[MRREL_CUI1] == "C3264380" and rel[MRREL_REL] == "CHD": 599 | is_root = True 600 | 601 | if len(code_source) != 1 or len(code_target) > 1: 602 | raise AttributeError("more than one or none codes") 603 | if len(code_source) == 1 and len(code_target) == 1 and \ 604 | code_source[0] != code_target[0]: 605 | code_source = code_source[0] 606 | code_target = code_target[0] 607 | # NOTE: the order of these append operations below is important. 608 | # get_rel_code_source() - it uses [-1] 609 | # get_rel_code_target() - it uses [-2] 610 | # which are called from UmlsClass.toRDF(). 611 | rel_with_codes.append(code_target) 612 | rel_with_codes.append(code_source) 613 | rels_to_class.append(rel_with_codes) 614 | aui_codes_def = [self.defs_by_aui[_id] for _id in ids] 615 | aui_codes_def = [item for sublist in aui_codes_def for item in sublist] 616 | defs = [self.defs[index] for index in aui_codes_def] 617 | atts = [self.atts[x] for x in self.atts_by_code[code]] 618 | 619 | umls_class = UmlsClass(self.ns,atoms=code_atoms,rels=rels_to_class, 620 | defs=defs,atts=atts,rank=self.rank,rank_by_tty=self.rank_by_tty, 621 | sty=self.sty, sty_by_cui=self.sty_by_cui, 622 | load_on_cuis=self.load_on_cuis,is_root=is_root) 623 | 624 | # TODO: patch to fix roots in SNOMED-CT 625 | # suppress should fix this one 626 | #if self.ont_code == "SNOMEDCT_US": 627 | # umls_class.is_root = (umls_class.code() == "138875005") 628 | 629 | yield umls_class 630 | 631 | def write_into(self,file_path,hierarchy=True): 632 | sys.stdout.write("%s writing terms ... %s\n" % (self.ont_code, file_path)) 633 | sys.stdout.flush() 634 | fout = codecs.open(file_path,"w","utf-8") 635 | #nterms = len(self.atoms_by_code) 636 | fout.write(PREFIXES) 637 | comment = "RDF Version of the UMLS ontology %s; " +\ 638 | "converted with the UMLS2RDF tool " +\ 639 | "(https://github.com/ncbo/umls2rdf), "+\ 640 | "developed by the NCBO project." 641 | header_values = dict( 642 | label = self.ont_code, 643 | comment = comment % self.ont_code, 644 | versioninfo = conf.UMLS_VERSION, 645 | uri = self.ns 646 | ) 647 | fout.write(ONTOLOGY_HEADER.substitute(header_values)) 648 | for term in self.terms(): 649 | try: 650 | rdf_text = term.toRDF(lang=UMLS_LANGCODE_MAP[self.lang],tree=self.tree) 651 | fout.write(rdf_text) 652 | except Exception as e: 653 | print("ERROR dumping term ", e) 654 | 655 | for att in term.properties(): 656 | if att not in self.ont_properties: 657 | self.ont_properties[att] = term.properties()[att] 658 | 659 | return fout 660 | 661 | def properties(self): 662 | return self.ont_properties 663 | 664 | def write_properties(self,fout,property_docs): 665 | self.ont_properties["hasSTY"] =\ 666 | UmlsAttribute(HAS_STY,"hasSTY") 667 | for p in self.ont_properties: 668 | prop = self.ont_properties[p] 669 | if "hasSTY" in p: 670 | fout.write(prop.toRDFWithDesc( 671 | "Semantic type UMLS property", 672 | "Semantic type UMLS property", 673 | "ObjectProperty")) 674 | continue 675 | doc = property_docs[prop.att] 676 | if "expanded_form" not in doc: 677 | raise AttributeError("expanded form not found in " + doc) 678 | _desc = doc["expanded_form"] 679 | if "inverse" in doc: 680 | _desc = "Inverse of " + doc["inverse"] 681 | 682 | _dockey = doc["dockey"] 683 | fout.write(prop.toRDF(_dockey,_desc)) 684 | 685 | def write_semantic_types(self,sem_types,fout): 686 | fout.write(sem_types) 687 | fout.write("\n") 688 | 689 | 690 | 691 | if __name__ == "__main__": 692 | 693 | con = __get_connection() 694 | 695 | umls_conf = None 696 | with open("umls.conf","r") as fconf: 697 | umls_conf = [line.split(",") \ 698 | for line in fconf.read().splitlines() \ 699 | if len(line) > 0] 700 | umls_conf = [x for x in umls_conf if not x[0].startswith("#")] 701 | fconf.close() 702 | 703 | if not os.path.isdir(conf.OUTPUT_FOLDER): 704 | raise Exception("Output folder '%s' not found."%conf.OUTPUT_FOLDER) 705 | 706 | sem_types = generate_semantic_types(con,with_roots=True) 707 | output_file = os.path.join(conf.OUTPUT_FOLDER,"umls_semantictypes.ttl") 708 | with codecs.open(output_file,"w","utf-8") as semfile: 709 | semfile.write(PREFIXES) 710 | semfile.write(sem_types) 711 | semfile.flush() 712 | semfile.close() 713 | 714 | sem_types = generate_semantic_types(con,with_roots=False) 715 | mrdoc = UmlsTable("MRDOC", con) 716 | property_docs = dict() 717 | for doc_record in mrdoc.scan(): 718 | _type = doc_record[MRDOC_TYPE] 719 | _expl = doc_record[MRDOC_DESC] 720 | _key = doc_record[MRDOC_VALUE] 721 | if _key not in property_docs: 722 | property_docs[_key] = dict() 723 | property_docs[_key]["dockey"] = doc_record[MRDOC_DOCKEY] 724 | if "inverse" in _type: 725 | _type = "inverse" 726 | property_docs[_key][_type] = _expl 727 | 728 | for (umls_code, file_out, load_on_field) in umls_conf: 729 | alt_uri_code = None 730 | if ";" in umls_code: 731 | umls_code,alt_uri_code = umls_code.split(";") 732 | if umls_code.startswith("#"): 733 | continue 734 | load_on_cuis = load_on_field == "load_on_cuis" 735 | output_file = os.path.join(conf.OUTPUT_FOLDER,file_out) 736 | sys.stdout.write("Generating %s (using '%s')\n" % 737 | (umls_code,load_on_field)) 738 | sys.stdout.flush() 739 | ns = get_umls_url(umls_code if not alt_uri_code else alt_uri_code) 740 | ont = UmlsOntology(umls_code,ns,con,load_on_cuis=load_on_cuis) 741 | ont.load_tables() 742 | fout = ont.write_into(output_file,hierarchy=(ont.ont_code != "MSH")) 743 | ont.write_properties(fout,property_docs) 744 | if conf.INCLUDE_SEMANTIC_TYPES: 745 | ont.write_semantic_types(sem_types,fout) 746 | fout.close() 747 | sys.stdout.write("done!\n\n") 748 | sys.stdout.flush() 749 | sys.stdout.flush() 750 | --------------------------------------------------------------------------------