├── BioC.dtd ├── COPYING ├── DISCLAIMER ├── README.md ├── src └── tmVarlib │ ├── BioCConverter.java │ ├── MentionRecognition.java │ ├── PostProcessing.java │ ├── PrefixTree.java │ └── tmVar.java └── tmBioC.key /BioC.dtd: -------------------------------------------------------------------------------- 1 | 2 | 3 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | -------------------------------------------------------------------------------- /COPYING: -------------------------------------------------------------------------------- 1 | PUBLIC DOMAIN NOTICE 2 | National Center for Biotechnology Information 3 | 4 | This software/database is a "United States Government Work" under the terms of the United States Copyright Act. It was written as part of the authors' official duties as a United States Government employee and thus cannot be copyrighted. This software/database is freely available to the public for use. The National Library of Medicine and the U.S. Government have not placed any restriction on its use or reproduction. 5 | 6 | Although all reasonable efforts have been taken to ensure the accuracy and reliability of the software and data, the NLM and the U.S. Government do not and cannot warrant the performance or results that may be obtained by using this software or data. The NLM and the U.S. Government disclaim all warranties, express or implied, including warranties of performance, merchantability or fitness for any particular purpose. 7 | -------------------------------------------------------------------------------- /DISCLAIMER: -------------------------------------------------------------------------------- 1 | This tool shows the results of research conducted in the Computational Biology Branch, NCBI. The information produced on this website is not intended for direct diagnostic use or medical decision-making without review and oversight by a clinical professional. Individuals should not change their health behavior solely on the basis of information produced on this website. NIH does not independently verify the validity or utility of the information produced by this tool. If you have questions about the information produced on this website, please see a health care professional. More information about NCBI's disclaimer policy is available. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # tmVar 3.0: an improved variant concept recog-nition and normalization tool 2 | 3 | We propose tmVar 3.0: an improved variant recognition and normalization system. Com-pared to its predecessors, tmVar 3.0 recognizes a wider spectrum of variant related entities (e.g., allele and copy number variants), and to group different variant mentions belonging to the same genomic sequence position in an article for improved accuracy. Moreover, tmVar 3.0 provides ad-vanced variant normalization options such as allele-specific identifiers from the ClinGen Allele Registry. tmVar 3.0 exhibits state-of-the-art performance with over 90% in F-measure for variant recognition and normalization, when evaluated on three independent benchmarking datasets. tmVar 3.0 as well annotations for the entire PubMed and PMC datasets are freely available for download on our FTP. 4 | 5 | # tmVar 3.0 download from FTP 6 | 7 | - [tmVar 3.0 software](https://ftp.ncbi.nlm.nih.gov/pub/lu/tmVar3/tmVar3.tar.gz) 8 | - [tmVar 3.0 corpus](https://ftp.ncbi.nlm.nih.gov/pub/lu/tmVar3/tmVar3Corpus.txt) 9 | - [tmVar 3.0 annotation guideline](https://ftp.ncbi.nlm.nih.gov/pub/lu/tmVar3/AnnotationGuideline.rev.docx) 10 | 11 | # PubTator API to access tmVar 3.0 12 | 13 | We host a RESTful API (https://www.ncbi.nlm.nih.gov/research/pubtator/api.html) that users can access the tmVar 3.0 results in PubMed/PMC. The "Process Raw Text" section of the API page also shows the way to submit a raw text for online processing. We provide the sample code in Java, Python and Perl to assist the users to quickly familiar with the API service. 14 | 15 | # Acknowledgments 16 | 17 | This research was supported by the Intramural Research Program of the National Library of Medicine (NLM), National Institutes of Health. 18 | 19 | # Disclaimer 20 | 21 | This tool shows the results of research conducted in the Computational Biology Branch, NCBI. The information produced on this website is not intended for direct diagnostic use or medical decision-making without review and oversight by a clinical professional. Individuals should not change their health behavior solely on the basis of information produced on this website. NIH does not independently verify the validity or utility of the information produced by this tool. If you have questions about the information produced on this website, please see a health care professional. More information about NCBI's disclaimer policy is available. 22 | -------------------------------------------------------------------------------- /src/tmVarlib/BioCConverter.java: -------------------------------------------------------------------------------- 1 | // 2 | // tmVar - Java version 3 | // BioC Format Converter 4 | // 5 | package tmVarlib; 6 | 7 | import bioc.BioCAnnotation; 8 | import bioc.BioCCollection; 9 | import bioc.BioCDocument; 10 | import bioc.BioCLocation; 11 | import bioc.BioCPassage; 12 | import bioc.io.BioCDocumentWriter; 13 | import bioc.io.BioCFactory; 14 | import bioc.io.woodstox.ConnectorWoodstox; 15 | 16 | import java.io.BufferedReader; 17 | import java.io.BufferedWriter; 18 | import java.io.FileInputStream; 19 | import java.io.FileNotFoundException; 20 | import java.io.FileOutputStream; 21 | import java.io.FileReader; 22 | import java.io.FileWriter; 23 | import java.io.IOException; 24 | import java.io.InputStreamReader; 25 | import java.io.OutputStreamWriter; 26 | import java.io.UnsupportedEncodingException; 27 | import java.time.LocalDate; 28 | import java.time.ZoneId; 29 | import java.util.Map; 30 | import java.util.regex.Matcher; 31 | import java.util.regex.Pattern; 32 | 33 | import javax.xml.stream.XMLStreamException; 34 | 35 | import java.util.ArrayList; 36 | import java.util.HashMap; 37 | 38 | public class BioCConverter 39 | { 40 | /* 41 | * Contexts in BioC file 42 | */ 43 | public ArrayList PMIDs=new ArrayList(); // Type: PMIDs 44 | public ArrayList> PassageNames = new ArrayList(); // PassageName 45 | public ArrayList> PassageOffsets = new ArrayList(); // PassageOffset 46 | public ArrayList> PassageContexts = new ArrayList(); // PassageContext 47 | public ArrayList>> Annotations = new ArrayList(); // Annotation - GNormPlus 48 | 49 | public String BioCFormatCheck(String InputFile) throws IOException 50 | { 51 | 52 | ConnectorWoodstox connector = new ConnectorWoodstox(); 53 | BioCCollection collection = new BioCCollection(); 54 | try 55 | { 56 | collection = connector.startRead(new InputStreamReader(new FileInputStream(InputFile), "UTF-8")); 57 | } 58 | catch (UnsupportedEncodingException | FileNotFoundException | XMLStreamException e) 59 | { 60 | BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(InputFile), "UTF-8")); 61 | String line=""; 62 | String status=""; 63 | String Pmid = ""; 64 | boolean tiabs=false; 65 | Pattern patt = Pattern.compile("^([^\\|\\t]+)\\|([^\\|\\t]+)\\|(.*)$"); 66 | while ((line = br.readLine()) != null) 67 | { 68 | Matcher mat = patt.matcher(line); 69 | if(mat.find()) //Title|Abstract 70 | { 71 | if(Pmid.equals("")) 72 | { 73 | Pmid = mat.group(1); 74 | } 75 | else if(!Pmid.equals(mat.group(1))) 76 | { 77 | return "[Error]: "+InputFile+" - A blank is needed between "+Pmid+" and "+mat.group(1)+"."; 78 | } 79 | status = "tiabs"; 80 | tiabs = true; 81 | } 82 | else if (line.contains("\t")) //Annotation 83 | { 84 | } 85 | else if(line.length()==0) //Processing 86 | { 87 | if(status.equals("")) 88 | { 89 | if(Pmid.equals("")) 90 | { 91 | return "[Error]: "+InputFile+" - It's not either BioC or PubTator format."; 92 | } 93 | else 94 | { 95 | return "[Error]: "+InputFile+" - A redundant blank is after "+Pmid+"."; 96 | } 97 | } 98 | Pmid=""; 99 | status=""; 100 | } 101 | } 102 | br.close(); 103 | if(tiabs == false) 104 | { 105 | return "[Error]: "+InputFile+" - It's not either BioC or PubTator format."; 106 | } 107 | if(status.equals("")) 108 | { 109 | return "PubTator"; 110 | } 111 | else 112 | { 113 | return "[Error]: "+InputFile+" - The last column missed a blank."; 114 | } 115 | } 116 | return "BioC"; 117 | } 118 | public void BioC2PubTator(String input,String output) throws IOException, XMLStreamException 119 | { 120 | /* 121 | * BioC2PubTator 122 | */ 123 | HashMap pmidlist = new HashMap(); // check if appear duplicate pmids 124 | boolean duplicate = false; 125 | BufferedWriter PubTatorOutputFormat = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output), "UTF-8")); 126 | ConnectorWoodstox connector = new ConnectorWoodstox(); 127 | BioCCollection collection = new BioCCollection(); 128 | collection = connector.startRead(new InputStreamReader(new FileInputStream(input), "UTF-8")); 129 | while (connector.hasNext()) 130 | { 131 | BioCDocument document = connector.next(); 132 | String PMID = document.getID(); 133 | if(pmidlist.containsKey(PMID)){System.out.println("\nError: duplicate pmid-"+PMID);duplicate = true;} 134 | else{pmidlist.put(PMID,"");} 135 | String Anno=""; 136 | int realpassage_offset=0; 137 | for (BioCPassage passage : document.getPassages()) 138 | { 139 | if(passage.getInfon("type").toLowerCase().equals("table")) 140 | { 141 | String temp=passage.getText().replaceAll(" ", ";"); 142 | temp=temp.replaceAll(";>;", " > "); 143 | PubTatorOutputFormat.write(PMID+"|"+passage.getInfon("type")+"|"+temp+"\n"); 144 | } 145 | else 146 | { 147 | String temp=passage.getText(); 148 | if(passage.getText().equals("")) 149 | { 150 | PubTatorOutputFormat.write(PMID+"|"+passage.getInfon("type")+"|"+"\n");//- No text - 151 | } 152 | else 153 | { 154 | PubTatorOutputFormat.write(PMID+"|"+passage.getInfon("type")+"|"+temp+"\n"); 155 | } 156 | } 157 | for (BioCAnnotation annotation : passage.getAnnotations()) 158 | { 159 | String Annoid = annotation.getInfon("identifier"); 160 | if(Annoid == null) 161 | { 162 | Annoid=annotation.getInfon("NCBI Gene"); 163 | } 164 | if(Annoid == null) 165 | { 166 | Annoid = annotation.getInfon("Identifier"); 167 | } 168 | String Annotype = annotation.getInfon("type"); 169 | int start = annotation.getLocations().get(0).getOffset(); 170 | start=start-(passage.getOffset()-realpassage_offset); 171 | int last = start + annotation.getLocations().get(0).getLength(); 172 | String AnnoMention=annotation.getText(); 173 | Anno=Anno+PMID+"\t"+start+"\t"+last+"\t"+AnnoMention+"\t"+Annotype+"\t"+Annoid+"\n"; 174 | } 175 | realpassage_offset=realpassage_offset+passage.getText().length()+1; 176 | } 177 | PubTatorOutputFormat.write(Anno+"\n"); 178 | } 179 | PubTatorOutputFormat.close(); 180 | if(duplicate == true){System.exit(0);} 181 | } 182 | public void PubTator2BioC(String input,String output) throws IOException, XMLStreamException 183 | { 184 | /* 185 | * PubTator2BioC 186 | */ 187 | String parser = BioCFactory.WOODSTOX; 188 | BioCFactory factory = BioCFactory.newFactory(parser); 189 | BioCDocumentWriter BioCOutputFormat = factory.createBioCDocumentWriter(new OutputStreamWriter(new FileOutputStream(output), "UTF-8")); 190 | BioCCollection biocCollection = new BioCCollection(); 191 | 192 | //time 193 | ZoneId zonedId = ZoneId.of( "America/Montreal" ); 194 | LocalDate today = LocalDate.now( zonedId ); 195 | biocCollection.setDate(today.toString()); 196 | 197 | biocCollection.setKey("BioC.key");//key 198 | biocCollection.setSource("tmVar");//source 199 | 200 | BioCOutputFormat.writeCollectionInfo(biocCollection); 201 | BufferedReader inputfile = new BufferedReader(new InputStreamReader(new FileInputStream(input), "UTF-8")); 202 | ArrayList ParagraphType=new ArrayList(); // Type: Title|Abstract 203 | ArrayList ParagraphContent = new ArrayList(); // Text 204 | ArrayList annotations = new ArrayList(); // Annotation 205 | String line; 206 | String Pmid=""; 207 | while ((line = inputfile.readLine()) != null) 208 | { 209 | Pattern patt = Pattern.compile("^([^\\|\\t]+)\\|([^\\|\\t]+)\\|(.*)$"); 210 | Matcher mat = patt.matcher(line); 211 | if(mat.find()) //Title|Abstract 212 | { 213 | ParagraphType.add(mat.group(2)); 214 | ParagraphContent.add(mat.group(3)); 215 | } 216 | else if (line.contains("\t")) //Annotation 217 | { 218 | String anno[]=line.split("\t"); 219 | annotations.add(anno[1]+"\t"+anno[2]+"\t"+anno[3]+"\t"+anno[4]+"\t"+anno[5]); 220 | } 221 | else if(line.length()==0) //Processing 222 | { 223 | BioCDocument biocDocument = new BioCDocument(); 224 | biocDocument.setID(Pmid); 225 | int startoffset=0; 226 | for(int i=0;i Infons = new HashMap(); 230 | Infons.put("type", ParagraphType.get(i)); 231 | biocPassage.setInfons(Infons); 232 | biocPassage.setText(ParagraphContent.get(i)); 233 | biocPassage.setOffset(startoffset); 234 | startoffset=startoffset+ParagraphContent.get(i).length()+1; 235 | for(int j=0;j=startoffset-ParagraphContent.get(i).length()-1) 239 | { 240 | BioCAnnotation biocAnnotation = new BioCAnnotation(); 241 | Map AnnoInfons = new HashMap(); 242 | AnnoInfons.put("Identifier", anno[4]); 243 | AnnoInfons.put("type", anno[3]); 244 | biocAnnotation.setInfons(AnnoInfons); 245 | BioCLocation location = new BioCLocation(); 246 | location.setOffset(Integer.parseInt(anno[0])); 247 | location.setLength(Integer.parseInt(anno[1])-Integer.parseInt(anno[0])); 248 | biocAnnotation.setLocation(location); 249 | biocAnnotation.setText(anno[2]); 250 | biocPassage.addAnnotation(biocAnnotation); 251 | } 252 | } 253 | biocDocument.addPassage(biocPassage); 254 | } 255 | biocCollection.addDocument(biocDocument); 256 | ParagraphType.clear(); 257 | ParagraphContent.clear(); 258 | annotations.clear(); 259 | BioCOutputFormat.writeDocument(biocDocument); 260 | } 261 | } 262 | BioCOutputFormat.close(); 263 | inputfile.close(); 264 | } 265 | public void PubTator2BioC_AppendAnnotation(String inputPubTator,String inputBioc,String output) throws IOException, XMLStreamException 266 | { 267 | /* 268 | * PubTator2BioC 269 | */ 270 | 271 | //input: PubTator 272 | BufferedReader inputfile = new BufferedReader(new InputStreamReader(new FileInputStream(inputPubTator), "UTF-8")); 273 | HashMap ParagraphType_hash = new HashMap(); // Type: Title|Abstract 274 | HashMap ParagraphContent_hash = new HashMap(); // Text 275 | HashMap annotations_hash = new HashMap(); // Annotation 276 | String Annotation=""; 277 | String Pmid=""; 278 | String line=""; 279 | while ((line = inputfile.readLine()) != null) 280 | { 281 | Pattern patt = Pattern.compile("^([^\\|\\t]+)\\|([^\\|\\t]+)\\|(.*)$"); 282 | Matcher mat = patt.matcher(line); 283 | if(mat.find()) //Title|Abstract 284 | { 285 | Pmid=mat.group(1); 286 | ParagraphType_hash.put(Pmid,mat.group(2)); 287 | ParagraphContent_hash.put(Pmid,mat.group(3)); 288 | } 289 | else if (line.contains("\t")) //Annotation 290 | { 291 | if(Annotation.equals("")) 292 | { 293 | Annotation=line; 294 | } 295 | else 296 | { 297 | Annotation=Annotation+"\n"+line; 298 | } 299 | } 300 | else if(line.length()==0) //Processing 301 | { 302 | annotations_hash.put(Pmid,Annotation); 303 | Annotation=""; 304 | } 305 | } 306 | inputfile.close(); 307 | 308 | //output 309 | BioCDocumentWriter BioCOutputFormat = BioCFactory.newFactory(BioCFactory.WOODSTOX).createBioCDocumentWriter(new OutputStreamWriter(new FileOutputStream(output), "UTF-8")); 310 | BioCCollection biocCollection_input = new BioCCollection(); 311 | BioCCollection biocCollection_output = new BioCCollection(); 312 | 313 | //input: BioC 314 | ConnectorWoodstox connector = new ConnectorWoodstox(); 315 | biocCollection_input = connector.startRead(new InputStreamReader(new FileInputStream(inputBioc), "UTF-8")); 316 | BioCOutputFormat.writeCollectionInfo(biocCollection_input); 317 | while (connector.hasNext()) 318 | { 319 | int real_start_passage=0; 320 | BioCDocument document_output = new BioCDocument(); 321 | BioCDocument document_input = connector.next(); 322 | String PMID=document_input.getID(); 323 | document_output.setID(PMID); 324 | int annotation_count=0; 325 | for (BioCPassage passage_input : document_input.getPassages()) 326 | { 327 | String passage_input_Text = passage_input.getText(); 328 | 329 | BioCPassage passage_output = passage_input; 330 | //passage_output.clearAnnotations(); //clean the previous annotation 331 | for (BioCAnnotation annotation : passage_output.getAnnotations()) 332 | { 333 | annotation.setID(""+annotation_count); 334 | annotation_count++; 335 | } 336 | 337 | int start_passage=passage_input.getOffset(); 338 | int last_passage=passage_input.getOffset()+passage_input.getText().length(); 339 | if(annotations_hash.containsKey(PMID) && !annotations_hash.get(PMID).equals("")) 340 | { 341 | String Anno[]=annotations_hash.get(PMID).split("\\n"); 342 | for(int i=0;i5) 354 | { 355 | id=An[5]; 356 | } 357 | if((start>=start_passage && start=start_passage && last AnnoInfons = new HashMap(); 361 | AnnoInfons.put("Identifier", id); 362 | AnnoInfons.put("type", type); 363 | biocAnnotation.setInfons(AnnoInfons); 364 | 365 | /*redirect the offset*/ 366 | //location.setOffset(start); 367 | //location.setLength(last-start); 368 | String mention_tmp = mention.replaceAll("([^A-Za-z0-9@ ])", "\\\\$1"); 369 | Pattern patt = Pattern.compile("^(.*)("+mention_tmp+")(.*)$"); 370 | Matcher mat = patt.matcher(passage_input_Text); 371 | if(mat.find()) 372 | { 373 | String pre=mat.group(1); 374 | String men=mat.group(2); 375 | String post=mat.group(3); 376 | start=pre.length()+start_passage; 377 | BioCLocation location = new BioCLocation(); 378 | location.setOffset(start); 379 | location.setLength(men.length()); 380 | biocAnnotation.setLocation(location); 381 | biocAnnotation.setText(mention); 382 | biocAnnotation.setID(""+annotation_count); 383 | annotation_count++; 384 | passage_output.addAnnotation(biocAnnotation); 385 | men=men.replaceAll(".", "@"); 386 | passage_input_Text=pre+men+post; 387 | } 388 | } 389 | } 390 | } 391 | real_start_passage = real_start_passage + passage_input.getText().length() + 1; 392 | document_output.addPassage(passage_output); 393 | } 394 | biocCollection_output.addDocument(document_output); 395 | BioCOutputFormat.writeDocument(document_output); 396 | } 397 | BioCOutputFormat.close(); 398 | } 399 | public void BioCReaderWithAnnotation(String input) throws IOException, XMLStreamException 400 | { 401 | ConnectorWoodstox connector = new ConnectorWoodstox(); 402 | BioCCollection collection = new BioCCollection(); 403 | collection = connector.startRead(new InputStreamReader(new FileInputStream(input), "UTF-8")); 404 | 405 | /* 406 | * Per document 407 | */ 408 | while (connector.hasNext()) 409 | { 410 | BioCDocument document = connector.next(); 411 | PMIDs.add(document.getID()); 412 | 413 | ArrayList PassageName= new ArrayList(); // array of Passage name 414 | ArrayList PassageOffset= new ArrayList(); // array of Passage offset 415 | ArrayList PassageContext= new ArrayList(); // array of Passage context 416 | ArrayList> AnnotationInPMID= new ArrayList(); // array of Annotations in the PassageName 417 | 418 | /* 419 | * Per Passage 420 | */ 421 | for (BioCPassage passage : document.getPassages()) 422 | { 423 | PassageName.add(passage.getInfon("type")); //Paragraph 424 | 425 | String txt = passage.getText(); 426 | if(txt.matches("[\t ]+")) 427 | { 428 | txt = txt.replaceAll(".","@"); 429 | } 430 | else 431 | { 432 | //if(passage.getInfon("type").toLowerCase().equals("table")) 433 | //{ 434 | // txt=txt.replaceAll(" ", "|"); 435 | //} 436 | txt = txt.replaceAll("ω","w"); 437 | txt = txt.replaceAll("μ","u"); 438 | txt = txt.replaceAll("κ","k"); 439 | txt = txt.replaceAll("α","a"); 440 | txt = txt.replaceAll("γ","g"); 441 | txt = txt.replaceAll("β","b"); 442 | txt = txt.replaceAll("×","x"); 443 | txt = txt.replaceAll("¹","1"); 444 | txt = txt.replaceAll("²","2"); 445 | txt = txt.replaceAll("°","o"); 446 | txt = txt.replaceAll("ö","o"); 447 | txt = txt.replaceAll("é","e"); 448 | txt = txt.replaceAll("à","a"); 449 | txt = txt.replaceAll("Á","A"); 450 | txt = txt.replaceAll("ε","e"); 451 | txt = txt.replaceAll("θ","O"); 452 | txt = txt.replaceAll("•","."); 453 | txt = txt.replaceAll("µ","u"); 454 | txt = txt.replaceAll("λ","r"); 455 | txt = txt.replaceAll("⁺","+"); 456 | txt = txt.replaceAll("ν","v"); 457 | txt = txt.replaceAll("ï","i"); 458 | txt = txt.replaceAll("ã","a"); 459 | txt = txt.replaceAll("≡","="); 460 | txt = txt.replaceAll("ó","o"); 461 | txt = txt.replaceAll("³","3"); 462 | txt = txt.replaceAll("〖","["); 463 | txt = txt.replaceAll("〗","]"); 464 | txt = txt.replaceAll("Å","A"); 465 | txt = txt.replaceAll("ρ","p"); 466 | txt = txt.replaceAll("ü","u"); 467 | txt = txt.replaceAll("ɛ","e"); 468 | txt = txt.replaceAll("č","c"); 469 | txt = txt.replaceAll("š","s"); 470 | txt = txt.replaceAll("ß","b"); 471 | txt = txt.replaceAll("═","="); 472 | txt = txt.replaceAll("£","L"); 473 | txt = txt.replaceAll("Ł","L"); 474 | txt = txt.replaceAll("ƒ","f"); 475 | txt = txt.replaceAll("ä","a"); 476 | txt = txt.replaceAll("–","-"); 477 | txt = txt.replaceAll("⁻","-"); 478 | txt = txt.replaceAll("〈","<"); 479 | txt = txt.replaceAll("〉",">"); 480 | txt = txt.replaceAll("χ","X"); 481 | txt = txt.replaceAll("Đ","D"); 482 | txt = txt.replaceAll("‰","%"); 483 | txt = txt.replaceAll("·","."); 484 | txt = txt.replaceAll("→",">"); 485 | txt = txt.replaceAll("←","<"); 486 | txt = txt.replaceAll("ζ","z"); 487 | txt = txt.replaceAll("π","p"); 488 | txt = txt.replaceAll("τ","t"); 489 | txt = txt.replaceAll("ξ","X"); 490 | txt = txt.replaceAll("η","h"); 491 | txt = txt.replaceAll("ø","0"); 492 | txt = txt.replaceAll("Δ","D"); 493 | txt = txt.replaceAll("∆","D"); 494 | txt = txt.replaceAll("∑","S"); 495 | txt = txt.replaceAll("Ω","O"); 496 | txt = txt.replaceAll("δ","d"); 497 | txt = txt.replaceAll("σ","s"); 498 | txt = txt.replaceAll("Φ","F"); 499 | //txt = txt.replaceAll("[^\\~\\!\\@\\#\\$\\%\\^\\&\\*\\(\\)\\_\\+\\{\\}\\|\\:\"\\<\\>\\?\\`\\-\\=\\[\\]\\;\\'\\,\\.\\/\\r\\n0-9a-zA-Z ]"," "); 500 | } 501 | if(passage.getText().equals("") || passage.getText().matches("[ ]+")) 502 | { 503 | PassageContext.add("-notext-"); //Context 504 | } 505 | else 506 | { 507 | PassageContext.add(txt); //Context 508 | } 509 | PassageOffset.add(passage.getOffset()); //Offset 510 | ArrayList AnnotationInPassage= new ArrayList(); // array of Annotations in the PassageName 511 | 512 | /* 513 | * Per Annotation : 514 | * start 515 | * last 516 | * mention 517 | * type 518 | * id 519 | */ 520 | for (BioCAnnotation Anno : passage.getAnnotations()) 521 | { 522 | int start = Anno.getLocations().get(0).getOffset()-passage.getOffset(); // start 523 | int last = start + Anno.getLocations().get(0).getLength(); // last 524 | String AnnoMention=Anno.getText(); // mention 525 | String Annotype = Anno.getInfon("type"); // type 526 | 527 | String Annoid=""; 528 | Map Infons=Anno.getInfons(); 529 | for(String Infon :Infons.keySet()) 530 | { 531 | if(!Infon.toLowerCase().equals("type")) 532 | { 533 | if(Annoid.equals("")) 534 | { 535 | Annoid=Infons.get(Infon); 536 | } 537 | else 538 | { 539 | Annoid=Annoid+";"+Infons.get(Infon); 540 | } 541 | } 542 | } 543 | 544 | if(Annoid == "") 545 | { 546 | AnnotationInPassage.add(start+"\t"+last+"\t"+AnnoMention+"\t"+Annotype); //paragraph 547 | } 548 | else 549 | { 550 | AnnotationInPassage.add(start+"\t"+last+"\t"+AnnoMention+"\t"+Annotype+"\t"+Annoid); //paragraph 551 | } 552 | } 553 | AnnotationInPMID.add(AnnotationInPassage); 554 | } 555 | PassageNames.add(PassageName); 556 | PassageContexts.add(PassageContext); 557 | PassageOffsets.add(PassageOffset); 558 | Annotations.add(AnnotationInPMID); 559 | } 560 | } 561 | public void BioCOutput(String input,String output) throws IOException, XMLStreamException 562 | { 563 | BioCDocumentWriter BioCOutputFormat = BioCFactory.newFactory(BioCFactory.WOODSTOX).createBioCDocumentWriter(new OutputStreamWriter(new FileOutputStream(output), "UTF-8")); 564 | BioCCollection biocCollection_input = new BioCCollection(); 565 | BioCCollection biocCollection_output = new BioCCollection(); 566 | 567 | //input: BioC 568 | ConnectorWoodstox connector = new ConnectorWoodstox(); 569 | biocCollection_input = connector.startRead(new InputStreamReader(new FileInputStream(input), "UTF-8")); 570 | BioCOutputFormat.writeCollectionInfo(biocCollection_input); 571 | int i=0; //count for pmid 572 | while (connector.hasNext()) 573 | { 574 | BioCDocument document_output = new BioCDocument(); 575 | BioCDocument document_input = connector.next(); 576 | document_output.setID(document_input.getID()); 577 | int annotation_count=0; 578 | int j=0; //count for paragraph 579 | for (BioCPassage passage_input : document_input.getPassages()) 580 | { 581 | BioCPassage passage_output = passage_input; 582 | passage_output.clearAnnotations(); //clean the previous annotation 583 | int passage_Offset = passage_input.getOffset(); 584 | String passage_Text = passage_input.getText(); 585 | ArrayList AnnotationInPassage = Annotations.get(i).get(j); 586 | for(int a=0;a AnnoInfons = new HashMap(); 595 | AnnoInfons.put("type", type); 596 | String identifier=""; 597 | if(Anno.length==5){identifier=Anno[4];} 598 | if(type.equals("Gene")) 599 | { 600 | AnnoInfons.put("NCBI Gene", identifier); 601 | } 602 | else if(type.equals("Species")) 603 | { 604 | AnnoInfons.put("NCBI Taxonomy", identifier); 605 | } 606 | else 607 | { 608 | AnnoInfons.put("Identifier", identifier); 609 | } 610 | biocAnnotation.setInfons(AnnoInfons); 611 | BioCLocation location = new BioCLocation(); 612 | location.setOffset(start+passage_Offset); 613 | location.setLength(last-start); 614 | biocAnnotation.setLocation(location); 615 | biocAnnotation.setText(mention); 616 | biocAnnotation.setID(""+annotation_count); 617 | annotation_count++; 618 | passage_output.addAnnotation(biocAnnotation); 619 | } 620 | document_output.addPassage(passage_output); 621 | j++; 622 | } 623 | biocCollection_output.addDocument(document_output); 624 | BioCOutputFormat.writeDocument(document_output); 625 | i++; 626 | } 627 | BioCOutputFormat.close(); 628 | } 629 | } -------------------------------------------------------------------------------- /src/tmVarlib/MentionRecognition.java: -------------------------------------------------------------------------------- 1 | // 2 | // tmVar - Java version 3 | // Feature Extraction 4 | // 5 | package tmVarlib; 6 | 7 | import java.io.*; 8 | import java.util.*; 9 | import java.util.regex.Matcher; 10 | import java.util.regex.Pattern; 11 | 12 | import org.tartarus.snowball.SnowballStemmer; 13 | import org.tartarus.snowball.ext.englishStemmer; 14 | 15 | import edu.stanford.nlp.tagger.maxent.MaxentTagger; 16 | 17 | public class MentionRecognition 18 | { 19 | public void FeatureExtraction(String Filename,String FilenameData,String FilenameLoca,String TrainTest) 20 | { 21 | /* 22 | * Feature Extraction 23 | */ 24 | try 25 | { 26 | //input 27 | BufferedReader inputfile = new BufferedReader(new InputStreamReader(new FileInputStream(Filename), "UTF-8")); 28 | //output 29 | BufferedWriter FileLocation = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(FilenameLoca), "UTF-8")); // .location 30 | BufferedWriter FileData = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(FilenameData), "UTF-8")); // .data 31 | //parameters 32 | String Pmid=""; 33 | ArrayList ParagraphType=new ArrayList(); // Type: Title|Abstract 34 | ArrayList ParagraphContent = new ArrayList(); // Text 35 | ArrayList annotations = new ArrayList(); // Annotation 36 | HashMap RegEx_HGVs_hash = new HashMap(); // RegEx_HGVs_hash 37 | HashMap character_hash = new HashMap(); 38 | String line; 39 | while ((line = inputfile.readLine()) != null) 40 | { 41 | 42 | Pattern pat = Pattern.compile("^([^\\|\\t]+)\\|([^\\|\\t]+)\\|(.*)$"); 43 | Matcher mat = pat.matcher(line); 44 | if(mat.find()) //Title|Abstract 45 | { 46 | Pmid = mat.group(1); 47 | ParagraphType.add(mat.group(2)); 48 | ParagraphContent.add(mat.group(3)); 49 | } 50 | else if (line.contains("\t")) //Annotation 51 | { 52 | String anno[]=line.split("\t"); 53 | if(anno.length>=6) 54 | { 55 | String mentiontype=anno[4]; 56 | if(mentiontype.equals("Gene")) 57 | { 58 | tmVar.GeneMention=true; 59 | } 60 | 61 | if(TrainTest.equals("Train")) 62 | { 63 | int start= Integer.parseInt(anno[1]); 64 | int last= Integer.parseInt(anno[2]); 65 | String mention=anno[3]; 66 | String component=anno[5]; 67 | 68 | Matcher m1 = tmVar.Pattern_Component_1.matcher(component); 69 | Matcher m2 = tmVar.Pattern_Component_2.matcher(component); 70 | Matcher m3 = tmVar.Pattern_Component_3.matcher(component); 71 | Matcher m4 = tmVar.Pattern_Component_4.matcher(component); 72 | Matcher m5 = tmVar.Pattern_Component_5.matcher(component); 73 | Matcher m6 = tmVar.Pattern_Component_6.matcher(component); 74 | 75 | for(int s=start;s"); 859 | Document_rev = Document_rev.replaceAll("χ","X"); 860 | Document_rev = Document_rev.replaceAll("Đ","D"); 861 | Document_rev = Document_rev.replaceAll("‰","%"); 862 | Document_rev = Document_rev.replaceAll("·","."); 863 | Document_rev = Document_rev.replaceAll("→",">"); 864 | Document_rev = Document_rev.replaceAll("←","<"); 865 | Document_rev = Document_rev.replaceAll("ζ","z"); 866 | Document_rev = Document_rev.replaceAll("π","p"); 867 | Document_rev = Document_rev.replaceAll("τ","t"); 868 | Document_rev = Document_rev.replaceAll("ξ","X"); 869 | Document_rev = Document_rev.replaceAll("η","h"); 870 | Document_rev = Document_rev.replaceAll("ø","0"); 871 | Document_rev = Document_rev.replaceAll("Δ","D"); 872 | Document_rev = Document_rev.replaceAll("∆","D"); 873 | Document_rev = Document_rev.replaceAll("∑","S"); 874 | Document_rev = Document_rev.replaceAll("Ω","O"); 875 | Document_rev = Document_rev.replaceAll("δ","d"); 876 | Document_rev = Document_rev.replaceAll("σ","s"); 877 | Document_rev = Document_rev.replaceAll("Φ","F"); 878 | //Document_rev = Document_rev.replaceAll("[^0-9A-Za-z\\!\\@\\#\\$\\%\\^\\&\\*\\(\\)\\_\\+\\-\\=\\{\\}\\|\\[\\]\\\\\\:;'\\<\\>\\?\\,\\.\\/\\' ]+"," "); 879 | //tmVar.tagger = new MaxentTagger("lib/taggers/english-left3words-distsim.tagger"); 880 | String tagged=tmVar.tagger.tagString(Document_rev).replace("-LRB-", "(").replace("-RRB-", ")").replace("-LSB-", "[").replace("-RSB-", "]"); 881 | String tag_split[]=tagged.split(" "); 882 | HashMap POS=new HashMap(); 883 | for(int p=0;p0) 907 | { 908 | while(DocumentTmp.substring(0,1).matches("[\\t ]")) 909 | { 910 | DocumentTmp=DocumentTmp.substring(1); 911 | Offset++; 912 | } 913 | if(DocumentTmp.substring(0,TokensInDoc[i].length()).equals(TokensInDoc[i])) 914 | { 915 | if(TokensInDoc[i].length()>0) 916 | { 917 | DocumentTmp=DocumentTmp.substring(TokensInDoc[i].length()); 918 | 919 | /* 920 | * Feature Extration 921 | */ 922 | //PST 923 | String pos=POS.get(TokensInDoc[i]); 924 | if(pos == null || pos.equals("")) 925 | { 926 | pos = "_NULL_"; 927 | } 928 | 929 | //stemming 930 | tmVar.stemmer.setCurrent(TokensInDoc[i].toLowerCase()); 931 | tmVar.stemmer.stem(); 932 | String stem=tmVar.stemmer.getCurrent(); 933 | 934 | //Number of Numbers [0-9] 935 | String Num_num=""; 936 | String tmp=TokensInDoc[i]; 937 | tmp=tmp.replaceAll("[^0-9]",""); 938 | if(tmp.length()>3){Num_num="N:4+";}else{Num_num="N:"+ tmp.length();} 939 | 940 | //Number of Uppercase [A-Z] 941 | String Num_Uc=""; 942 | tmp=TokensInDoc[i]; 943 | tmp=tmp.replaceAll("[^A-Z]",""); 944 | if(tmp.length()>3){Num_Uc="U:4+";}else{Num_Uc="U:"+ tmp.length();} 945 | 946 | //Number of Lowercase [a-z] 947 | String Num_lc=""; 948 | tmp=TokensInDoc[i]; 949 | tmp=tmp.replaceAll("[^a-z]",""); 950 | if(tmp.length()>3){Num_lc="L:4+";}else{Num_lc="L:"+ tmp.length();} 951 | 952 | //Number of ALL char 953 | String Num_All=""; 954 | if(TokensInDoc[i].length()>3){Num_All="A:4+";}else{Num_All="A:"+ TokensInDoc[i].length();} 955 | 956 | //specific character (;:,.->+_) 957 | String SpecificC=""; 958 | tmp=TokensInDoc[i]; 959 | 960 | if(TokensInDoc[i].equals(";") || TokensInDoc[i].equals(":") || TokensInDoc[i].equals(",") || TokensInDoc[i].equals(".") || TokensInDoc[i].equals("-") || TokensInDoc[i].equals(">") || TokensInDoc[i].equals("+") || TokensInDoc[i].equals("_")) 961 | { 962 | SpecificC="-SpecificC1-"; 963 | } 964 | else if(TokensInDoc[i].equals("(") || TokensInDoc[i].equals(")")) 965 | { 966 | SpecificC="-SpecificC2-"; 967 | } 968 | else if(TokensInDoc[i].equals("{") || TokensInDoc[i].equals("}")) 969 | { 970 | SpecificC="-SpecificC3-"; 971 | } 972 | else if(TokensInDoc[i].equals("[") || TokensInDoc[i].equals("]")) 973 | { 974 | SpecificC="-SpecificC4-"; 975 | } 976 | else if(TokensInDoc[i].equals("\\") || TokensInDoc[i].equals("/")) 977 | { 978 | SpecificC="-SpecificC5-"; 979 | } 980 | else 981 | { 982 | SpecificC="__nil__"; 983 | } 984 | 985 | //chromosomal keytokens 986 | String ChroKey=""; 987 | tmp=TokensInDoc[i]; 988 | String pattern_ChroKey="^(q|p|qter|pter|XY|t)$"; 989 | Pattern pattern_ChroKey_compile = Pattern.compile(pattern_ChroKey); 990 | Matcher pattern_ChroKey_compile_Matcher = pattern_ChroKey_compile.matcher(tmp); 991 | if(pattern_ChroKey_compile_Matcher.find()) 992 | { 993 | ChroKey="-ChroKey-"; 994 | } 995 | else 996 | { 997 | ChroKey="__nil__"; 998 | } 999 | 1000 | //Mutation type 1001 | String MutatType=""; 1002 | tmp=TokensInDoc[i]; 1003 | tmp=tmp.toLowerCase(); 1004 | String pattern_MutatType="^(del|ins|dup|tri|qua|con|delins|indel)$"; 1005 | String pattern_FrameShiftType="(fs|fsX|fsx)"; 1006 | Pattern pattern_MutatType_compile = Pattern.compile(pattern_MutatType); 1007 | Pattern pattern_FrameShiftType_compile = Pattern.compile(pattern_FrameShiftType); 1008 | Matcher pattern_MutatType_compile_Matcher = pattern_MutatType_compile.matcher(tmp); 1009 | Matcher pattern_FrameShiftType_compile_Matcher = pattern_FrameShiftType_compile.matcher(tmp); 1010 | if(pattern_MutatType_compile_Matcher.find()) 1011 | { 1012 | MutatType="-MutatType-"; 1013 | } 1014 | else if(pattern_FrameShiftType_compile_Matcher.find()) 1015 | { 1016 | MutatType="-FrameShiftType-"; 1017 | } 1018 | else 1019 | { 1020 | MutatType="__nil__"; 1021 | } 1022 | 1023 | //Mutation word 1024 | String MutatWord=""; 1025 | tmp=TokensInDoc[i]; 1026 | tmp=tmp.toLowerCase(); 1027 | String pattern_MutatWord="^(deletion|delta|elta|insertion|repeat|inversion|deletions|insertions|repeats|inversions)$"; 1028 | Pattern pattern_MutatWord_compile = Pattern.compile(pattern_MutatWord); 1029 | Matcher pattern_MutatWord_compile_Matcher = pattern_MutatWord_compile.matcher(tmp); 1030 | if(pattern_MutatWord_compile_Matcher.find()) 1031 | { 1032 | MutatWord="-MutatWord-"; 1033 | } 1034 | else 1035 | { 1036 | MutatWord="__nil__"; 1037 | } 1038 | 1039 | //Mutation article & basepair 1040 | String MutatArticle=""; 1041 | tmp=TokensInDoc[i]; 1042 | tmp=tmp.toLowerCase(); 1043 | String pattern_base="^(single|a|one|two|three|four|five|six|seven|eight|nine|ten|[0-9]+)$"; 1044 | String pattern_Byte="^(kb|mb)$"; 1045 | String pattern_bp="(base|bases|pair|amino|acid|acids|codon|postion|postions|bp|nucleotide|nucleotides)"; 1046 | Pattern pattern_base_compile = Pattern.compile(pattern_base); 1047 | Pattern pattern_Byte_compile = Pattern.compile(pattern_Byte); 1048 | Pattern pattern_bp_compile = Pattern.compile(pattern_bp); 1049 | Matcher pattern_base_compile_Matcher = pattern_base_compile.matcher(tmp); 1050 | Matcher pattern_Byte_compile_Matcher = pattern_Byte_compile.matcher(tmp); 1051 | Matcher pattern_bp_compile_Matcher = pattern_bp_compile.matcher(tmp); 1052 | if(pattern_base_compile_Matcher.find()) 1053 | { 1054 | MutatArticle="-Base-"; 1055 | } 1056 | else if(pattern_Byte_compile_Matcher.find()) 1057 | { 1058 | MutatArticle="-Byte-"; 1059 | } 1060 | else if(pattern_bp_compile_Matcher.find()) 1061 | { 1062 | MutatArticle="-bp-"; 1063 | } 1064 | else 1065 | { 1066 | MutatArticle="__nil__"; 1067 | } 1068 | 1069 | //Type1 1070 | String Type1=""; 1071 | tmp=TokensInDoc[i]; 1072 | tmp=tmp.toLowerCase(); 1073 | String pattern_Type1="^[cgrm]$"; 1074 | String pattern_Type1_2="^(ivs|ex|orf)$"; 1075 | Pattern pattern_Type1_compile = Pattern.compile(pattern_Type1); 1076 | Pattern pattern_Type1_2_compile = Pattern.compile(pattern_Type1_2); 1077 | Matcher pattern_Type1_compile_Matcher = pattern_Type1_compile.matcher(tmp); 1078 | Matcher pattern_Type1_2_compile_Matcher = pattern_Type1_2_compile.matcher(tmp); 1079 | if(pattern_Type1_compile_Matcher.find()) 1080 | { 1081 | Type1="-Type1-"; 1082 | } 1083 | else if(pattern_Type1_2_compile_Matcher.find()) 1084 | { 1085 | Type1="-Type1_2-"; 1086 | } 1087 | else 1088 | { 1089 | Type1="__nil__"; 1090 | } 1091 | 1092 | //Type2 1093 | String Type2=""; 1094 | tmp=TokensInDoc[i]; 1095 | 1096 | if(tmp.equals("p")) 1097 | { 1098 | Type2="-Type2-"; 1099 | } 1100 | else 1101 | { 1102 | Type2="__nil__"; 1103 | } 1104 | 1105 | //DNA symbols 1106 | String DNASym=""; 1107 | tmp=TokensInDoc[i]; 1108 | String pattern_DNASym="^[ATCGUatcgu]$"; 1109 | Pattern pattern_DNASym_compile = Pattern.compile(pattern_DNASym); 1110 | Matcher pattern_DNASym_compile_Matcher = pattern_DNASym_compile.matcher(tmp); 1111 | if(pattern_DNASym_compile_Matcher.find()) 1112 | { 1113 | DNASym="-DNASym-"; 1114 | } 1115 | else 1116 | { 1117 | DNASym="__nil__"; 1118 | } 1119 | 1120 | //Protein symbols 1121 | String ProteinSym=""; 1122 | String lastToken=""; 1123 | tmp=TokensInDoc[i]; 1124 | if(i>0){lastToken=TokensInDoc[i-1];} 1125 | String pattern_ProteinSymFull="(glutamine|glutamic|leucine|valine|isoleucine|lysine|alanine|glycine|aspartate|methionine|threonine|histidine|aspartic|asparticacid|arginine|asparagine|tryptophan|proline|phenylalanine|cysteine|serine|glutamate|tyrosine|stop|frameshift)"; 1126 | String pattern_ProteinSymTri="^(cys|ile|ser|gln|met|asn|pro|lys|asp|thr|phe|ala|gly|his|leu|arg|trp|val|glu|tyr|fs|fsx)$"; 1127 | String pattern_ProteinSymTriSub="^(ys|le|er|ln|et|sn|ro|ys|sp|hr|he|la|ly|is|eu|rg|rp|al|lu|yr)$"; 1128 | String pattern_ProteinSymChar="^[CISQMNPKDTFAGHLRWVEYX]$"; 1129 | String pattern_lastToken="^[CISQMNPKDTFAGHLRWVEYX]$"; 1130 | Pattern pattern_ProteinSymFull_compile = Pattern.compile(pattern_ProteinSymFull); 1131 | Matcher pattern_ProteinSymFull_compile_Matcher = pattern_ProteinSymFull_compile.matcher(tmp); 1132 | Pattern pattern_ProteinSymTri_compile = Pattern.compile(pattern_ProteinSymTri); 1133 | Matcher pattern_ProteinSymTri_compile_Matcher = pattern_ProteinSymTri_compile.matcher(tmp); 1134 | Pattern pattern_ProteinSymTriSub_compile = Pattern.compile(pattern_ProteinSymTriSub); 1135 | Matcher pattern_ProteinSymTriSub_compile_Matcher = pattern_ProteinSymTriSub_compile.matcher(tmp); 1136 | Pattern pattern_ProteinSymChar_compile = Pattern.compile(pattern_ProteinSymChar); 1137 | Matcher pattern_ProteinSymChar_compile_Matcher = pattern_ProteinSymChar_compile.matcher(tmp); 1138 | Pattern pattern_lastToken_compile = Pattern.compile(pattern_lastToken); 1139 | Matcher pattern_lastToken_compile_Matcher = pattern_lastToken_compile.matcher(lastToken); 1140 | 1141 | if(pattern_ProteinSymFull_compile_Matcher.find()) 1142 | { 1143 | ProteinSym="-ProteinSymFull-"; 1144 | } 1145 | else if(pattern_ProteinSymTri_compile_Matcher.find()) 1146 | { 1147 | ProteinSym="-ProteinSymTri-"; 1148 | } 1149 | else if(pattern_ProteinSymTriSub_compile_Matcher.find() && pattern_lastToken_compile_Matcher.find() && !Document.substring(Offset-1,Offset).equals(" ")) 1150 | { 1151 | ProteinSym="-ProteinSymTriSub-"; 1152 | } 1153 | else if(pattern_ProteinSymChar_compile_Matcher.find()) 1154 | { 1155 | ProteinSym="-ProteinSymChar-"; 1156 | } 1157 | else 1158 | { 1159 | ProteinSym="__nil__"; 1160 | } 1161 | 1162 | //RS 1163 | String RScode=""; 1164 | tmp=TokensInDoc[i]; 1165 | String pattern_RScode="^(rs|RS|Rs)$"; 1166 | Pattern pattern_RScode_compile = Pattern.compile(pattern_RScode); 1167 | Matcher pattern_RScode_compile_Matcher = pattern_RScode_compile.matcher(tmp); 1168 | if(pattern_RScode_compile_Matcher.find()) 1169 | { 1170 | RScode="-RScode-"; 1171 | } 1172 | else 1173 | { 1174 | RScode="__nil__"; 1175 | } 1176 | 1177 | //Patterns 1178 | String Pattern1=TokensInDoc[i]; 1179 | String Pattern2=TokensInDoc[i]; 1180 | String Pattern3=TokensInDoc[i]; 1181 | String Pattern4=TokensInDoc[i]; 1182 | Pattern1=Pattern1.replaceAll("[A-Z]","A"); 1183 | Pattern1=Pattern1.replaceAll("[a-z]","a"); 1184 | Pattern1=Pattern1.replaceAll("[0-9]","0"); 1185 | Pattern1="P1:"+Pattern1; 1186 | Pattern2=Pattern2.replaceAll("[A-Za-z]","a"); 1187 | Pattern2=Pattern2.replaceAll("[0-9]","0"); 1188 | Pattern2="P2:"+Pattern2; 1189 | Pattern3=Pattern3.replaceAll("[A-Z]+","A"); 1190 | Pattern3=Pattern3.replaceAll("[a-z]+","a"); 1191 | Pattern3=Pattern3.replaceAll("[0-9]+","0"); 1192 | Pattern3="P3:"+Pattern3; 1193 | Pattern4=Pattern4.replaceAll("[A-Za-z]+","a"); 1194 | Pattern4=Pattern4.replaceAll("[0-9]+","0"); 1195 | Pattern4="P4:"+Pattern4; 1196 | 1197 | //prefix 1198 | String prefix=""; 1199 | tmp=TokensInDoc[i]; 1200 | if(tmp.length()>=1){ prefix=tmp.substring(0, 1);}else{prefix="__nil__";} 1201 | if(tmp.length()>=2){ prefix=prefix+" "+tmp.substring(0, 2);}else{prefix=prefix+" __nil__";} 1202 | if(tmp.length()>=3){ prefix=prefix+" "+tmp.substring(0, 3);}else{prefix=prefix+" __nil__";} 1203 | if(tmp.length()>=4){ prefix=prefix+" "+tmp.substring(0, 4);}else{prefix=prefix+" __nil__";} 1204 | if(tmp.length()>=5){ prefix=prefix+" "+tmp.substring(0, 5);}else{prefix=prefix+" __nil__";} 1205 | 1206 | 1207 | //suffix 1208 | String suffix=""; 1209 | tmp=TokensInDoc[i]; 1210 | if(tmp.length()>=1){ suffix=tmp.substring(tmp.length()-1, tmp.length());}else{suffix="__nil__";} 1211 | if(tmp.length()>=2){ suffix=suffix+" "+tmp.substring(tmp.length()-2, tmp.length());}else{suffix=suffix+" __nil__";} 1212 | if(tmp.length()>=3){ suffix=suffix+" "+tmp.substring(tmp.length()-3, tmp.length());}else{suffix=suffix+" __nil__";} 1213 | if(tmp.length()>=4){ suffix=suffix+" "+tmp.substring(tmp.length()-4, tmp.length());}else{suffix=suffix+" __nil__";} 1214 | if(tmp.length()>=5){ suffix=suffix+" "+tmp.substring(tmp.length()-5, tmp.length());}else{suffix=suffix+" __nil__";} 1215 | 1216 | /* 1217 | * Print out: .data 1218 | */ 1219 | FileData.write(TokensInDoc[i]+" "+stem+" "+pos+" "+Num_num+" "+Num_Uc+" "+Num_lc+" "+Num_All+" "+SpecificC+" "+ChroKey+" "+MutatType+" "+MutatWord+" "+MutatArticle+" "+Type1+" "+Type2+" "+DNASym+" "+ProteinSym+" "+RScode+" "+Pattern1+" "+Pattern2+" "+Pattern3+" "+Pattern4+" "+prefix+" "+suffix); 1220 | if(RegEx_HGVs_hash.containsKey(Offset)) 1221 | { 1222 | FileData.write(" "+RegEx_HGVs_hash.get(Offset)); 1223 | } 1224 | else 1225 | { 1226 | FileData.write(" O"); 1227 | } 1228 | if(TrainTest.equals("Train")) // Test 1229 | { 1230 | if(character_hash.containsKey(Offset)) 1231 | { 1232 | FileData.write(" "+character_hash.get(Offset)); 1233 | } 1234 | else 1235 | { 1236 | FileData.write(" O"); 1237 | } 1238 | } 1239 | FileData.write("\n"); 1240 | 1241 | /* 1242 | * Print out: .location 1243 | */ 1244 | FileLocation.write(Pmid+"\t"+TokensInDoc[i]+"\t"+(Offset+1)+"\t"+(Offset+TokensInDoc[i].length())+"\n"); 1245 | 1246 | Offset=Offset+TokensInDoc[i].length(); 1247 | } 1248 | } 1249 | else 1250 | { 1251 | System.out.println("Error! String not match: '"+DocumentTmp.substring(0,TokensInDoc[i].length())+"'\t'"+TokensInDoc[i]+"'"); 1252 | } 1253 | } 1254 | } 1255 | 1256 | FileLocation.write("\n"); 1257 | FileData.write("\n"); 1258 | 1259 | ParagraphType.clear(); 1260 | ParagraphContent.clear(); 1261 | annotations.clear(); 1262 | RegEx_HGVs_hash.clear(); 1263 | character_hash.clear(); 1264 | } 1265 | } 1266 | 1267 | inputfile.close(); 1268 | FileLocation.close(); 1269 | FileData.close(); 1270 | } 1271 | catch(IOException e1){ System.out.println("[MR]: Input file is not exist.");} 1272 | } 1273 | 1274 | /* 1275 | * Testing by CRF++ 1276 | */ 1277 | public void CRF_test(String FilenameData,String FilenameOutput,String TrainTest) throws IOException 1278 | { 1279 | File f = new File(FilenameOutput); 1280 | BufferedWriter fr = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(f), "UTF-8")); 1281 | 1282 | Runtime runtime = Runtime.getRuntime(); 1283 | 1284 | String OS=System.getProperty("os.name").toLowerCase(); 1285 | String model="MentionExtractionUB.Model"; 1286 | 1287 | String cmd="./CRF/crf_test -m CRF/"+model+" -o "+FilenameOutput+" "+FilenameData; 1288 | if(TrainTest.equals("Test_FullText")) 1289 | { 1290 | model="MentionExtractionUB.fulltext.Model"; 1291 | } 1292 | if(OS.contains("windows")) 1293 | { 1294 | cmd ="CRF/crf_test -m CRF/"+model+" -o "+FilenameOutput+" "+FilenameData; 1295 | } 1296 | else //if(OS.contains("nux")||OS.contains("nix")) 1297 | { 1298 | cmd ="./CRF/crf_test -m CRF/"+model+" -o "+FilenameOutput+" "+FilenameData; 1299 | } 1300 | 1301 | try { 1302 | Process process = runtime.exec(cmd); 1303 | InputStream is = process.getInputStream(); 1304 | InputStreamReader isr = new InputStreamReader(is); 1305 | BufferedReader br = new BufferedReader(isr); 1306 | String line=""; 1307 | while ( (line = br.readLine()) != null) 1308 | { 1309 | fr.write(line); 1310 | fr.newLine(); 1311 | fr.flush(); 1312 | } 1313 | is.close(); 1314 | isr.close(); 1315 | br.close(); 1316 | fr.close(); 1317 | } 1318 | catch (IOException e) { 1319 | System.out.println(e); 1320 | runtime.exit(0); 1321 | } 1322 | } 1323 | 1324 | /* 1325 | * Learning model by CRF++ 1326 | */ 1327 | public void CRF_learn(String FilenameData) throws IOException 1328 | { 1329 | Process process = null; 1330 | String line = null; 1331 | InputStream is = null; 1332 | InputStreamReader isr = null; 1333 | BufferedReader br = null; 1334 | 1335 | Runtime runtime = Runtime.getRuntime(); 1336 | String OS=System.getProperty("os.name").toLowerCase(); 1337 | String cmd="./CRF/crf_learn -f 3 -c 4.0 CRF/template_UB "+FilenameData+" CRF/MentionExtractionUB.Model.new"; 1338 | if(OS.contains("windows")) 1339 | { 1340 | cmd ="CRF/crf_learn -f 3 -c 4.0 CRF/template_UB "+FilenameData+" CRF/MentionExtractionUB.Model.new"; 1341 | } 1342 | else //if(OS.contains("nux")||OS.contains("nix")) 1343 | { 1344 | cmd ="./CRF/crf_learn -f 3 -c 4.0 CRF/template_UB "+FilenameData+" CRF/MentionExtractionUB.Model.new"; 1345 | } 1346 | 1347 | try { 1348 | process = runtime.exec(cmd); 1349 | is = process.getInputStream(); 1350 | isr = new InputStreamReader(is); 1351 | br = new BufferedReader(isr); 1352 | while ( (line = br.readLine()) != null) 1353 | { 1354 | System.out.println(line); 1355 | System.out.flush(); 1356 | } 1357 | is.close(); 1358 | isr.close(); 1359 | br.close(); 1360 | } 1361 | catch (IOException e) { 1362 | System.out.println(e); 1363 | runtime.exit(0); 1364 | } 1365 | } 1366 | } 1367 | 1368 | 1369 | -------------------------------------------------------------------------------- /src/tmVarlib/PrefixTree.java: -------------------------------------------------------------------------------- 1 | /** 2 | * Project: 3 | * Function: Dictionary lookup by Prefix Tree 4 | */ 5 | 6 | package tmVarlib; 7 | 8 | import java.io.BufferedReader; 9 | import java.io.FileReader; 10 | import java.io.IOException; 11 | import java.io.*; 12 | import java.util.*; 13 | import java.util.regex.Matcher; 14 | import java.util.regex.Pattern; 15 | 16 | public class PrefixTree 17 | { 18 | private Tree Tr=new Tree(); 19 | 20 | /** 21 | * Hash2Tree(HashMap ID2Names) 22 | * Dictionary2Tree_Combine(String Filename,String StopWords,String MentionType) 23 | * Dictionary2Tree_UniqueGene(String Filename,String StopWords) //olr1831ps 10116:*405718 24 | * TreeFile2Tree(String Filename) 25 | * 26 | */ 27 | public static HashMap StopWord_hash = new HashMap(); 28 | 29 | public void Hash2Tree(HashMap ID2Names) 30 | { 31 | for(String ID : ID2Names.keySet()) 32 | { 33 | Tr.insertMention(ID2Names.get(ID),ID); 34 | } 35 | } 36 | 37 | /* 38 | * Type Identifier Names 39 | * Species 9606 ttdh3pv|igl027/99|igl027/98|sw 1463 40 | */ 41 | public void Dictionary2Tree(String Filename,String StopWords) 42 | { 43 | try 44 | { 45 | /** Stop Word */ 46 | BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(StopWords), "UTF-8")); 47 | String line=""; 48 | while ((line = br.readLine()) != null) 49 | { 50 | StopWord_hash.put(line, "StopWord"); 51 | } 52 | br.close(); 53 | 54 | BufferedReader inputfile = new BufferedReader(new InputStreamReader(new FileInputStream(Filename), "UTF-8")); 55 | line=""; 56 | while ((line = inputfile.readLine()) != null) 57 | { 58 | line = line.replaceAll("ω","w");line = line.replaceAll("μ","u");line = line.replaceAll("κ","k");line = line.replaceAll("α","a");line = line.replaceAll("γ","r");line = line.replaceAll("β","b");line = line.replaceAll("×","x");line = line.replaceAll("¹","1");line = line.replaceAll("²","2");line = line.replaceAll("°","o");line = line.replaceAll("ö","o");line = line.replaceAll("é","e");line = line.replaceAll("à","a");line = line.replaceAll("Á","A");line = line.replaceAll("ε","e");line = line.replaceAll("θ","O");line = line.replaceAll("•",".");line = line.replaceAll("µ","u");line = line.replaceAll("λ","r");line = line.replaceAll("⁺","+");line = line.replaceAll("ν","v");line = line.replaceAll("ï","i");line = line.replaceAll("ã","a");line = line.replaceAll("≡","=");line = line.replaceAll("ó","o");line = line.replaceAll("³","3");line = line.replaceAll("〖","[");line = line.replaceAll("〗","]");line = line.replaceAll("Å","A");line = line.replaceAll("ρ","p");line = line.replaceAll("ü","u");line = line.replaceAll("ɛ","e");line = line.replaceAll("č","c");line = line.replaceAll("š","s");line = line.replaceAll("ß","b");line = line.replaceAll("═","=");line = line.replaceAll("£","L");line = line.replaceAll("Ł","L");line = line.replaceAll("ƒ","f");line = line.replaceAll("ä","a");line = line.replaceAll("–","-");line = line.replaceAll("⁻","-");line = line.replaceAll("〈","<");line = line.replaceAll("〉",">");line = line.replaceAll("χ","X");line = line.replaceAll("Đ","D");line = line.replaceAll("‰","%");line = line.replaceAll("·",".");line = line.replaceAll("→",">");line = line.replaceAll("←","<");line = line.replaceAll("ζ","z");line = line.replaceAll("π","p");line = line.replaceAll("τ","t");line = line.replaceAll("ξ","X");line = line.replaceAll("η","h");line = line.replaceAll("ø","0");line = line.replaceAll("Δ","D");line = line.replaceAll("∆","D");line = line.replaceAll("∑","S");line = line.replaceAll("Ω","O");line = line.replaceAll("δ","d");line = line.replaceAll("σ","s"); 59 | String Column[]=line.split("\t",-1); 60 | if(Column.length>2) 61 | { 62 | String ConceptType=Column[0]; 63 | String ConceptID=Column[1]; 64 | String ConceptNames=Column[2]; 65 | /* 66 | * Specific usage for Species 67 | */ 68 | if( ConceptType.equals("Species")) 69 | { 70 | ConceptID=ConceptID.replace("species:ncbi:",""); 71 | ConceptNames=ConceptNames.replaceAll(" strain=", " "); 72 | ConceptNames=ConceptNames.replaceAll("[\\W\\-\\_](str.|strain|substr.|substrain|var.|variant|subsp.|subspecies|pv.|pathovars|pathovar|br.|biovar)[\\W\\-\\_]", " "); 73 | ConceptNames=ConceptNames.replaceAll("[\\(\\)]", " "); 74 | } 75 | 76 | String NameColumn[]=ConceptNames.split("\\|"); 77 | for(int i=0;i=3) 88 | { 89 | String tmp_mention=NameColumn[i].toLowerCase(); 90 | if(!StopWord_hash.containsKey(tmp_mention)) 91 | { 92 | Tr.insertMention(NameColumn[i],ConceptType+"\t"+ConceptID); 93 | } 94 | } 95 | } 96 | /* 97 | * Specific usage for Gene & Cell 98 | */ 99 | else if ((ConceptType.equals("Gene") || ConceptType.equals("Cell")) ) 100 | { 101 | if ( (!NameColumn[i].substring(0, 1).matches("[\\W\\-\\_]")) && tmp.length()>=3) 102 | { 103 | String tmp_mention=NameColumn[i].toLowerCase(); 104 | if(!StopWord_hash.containsKey(tmp_mention)) 105 | { 106 | Tr.insertMention(NameColumn[i],ConceptType+"\t"+ConceptID); 107 | } 108 | } 109 | } 110 | /* 111 | * Other Concepts 112 | */ 113 | else 114 | { 115 | if ( (!NameColumn[i].equals("")) && (!NameColumn[i].substring(0, 1).matches("[\\W\\-\\_]")) ) 116 | { 117 | String tmp_mention=NameColumn[i].toLowerCase(); 118 | if(!StopWord_hash.containsKey(tmp_mention)) 119 | { 120 | Tr.insertMention(NameColumn[i],ConceptType+"\t"+ConceptID); 121 | } 122 | } 123 | } 124 | } 125 | } 126 | else 127 | { 128 | System.out.println("[Dictionary2Tree_Combine]: Lexicon format error! Please follow : Type | Identifier | Names (Identifier can be NULL)"); 129 | } 130 | } 131 | inputfile.close(); 132 | } 133 | catch(IOException e1){ System.out.println("[Dictionary2Tree_Combine]: Input file is not exist.");} 134 | } 135 | 136 | /* 137 | * Type Identifier Names 138 | * Species 9606 ttdh3pv|igl027/99|igl027/98|sw 1463 139 | * 140 | * @ Prefix 141 | */ 142 | public void Dictionary2Tree(String Filename,String StopWords,String Prefix) 143 | { 144 | try 145 | { 146 | /** Stop Word */ 147 | BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(StopWords), "UTF-8")); 148 | String line=""; 149 | while ((line = br.readLine()) != null) 150 | { 151 | StopWord_hash.put(line, "StopWord"); 152 | } 153 | br.close(); 154 | 155 | /** Parsing Input */ 156 | BufferedReader inputfile = new BufferedReader(new InputStreamReader(new FileInputStream(Filename), "UTF-8")); 157 | line=""; 158 | while ((line = inputfile.readLine()) != null) 159 | { 160 | line = line.replaceAll("ω","w");line = line.replaceAll("μ","u");line = line.replaceAll("κ","k");line = line.replaceAll("α","a");line = line.replaceAll("γ","r");line = line.replaceAll("β","b");line = line.replaceAll("×","x");line = line.replaceAll("¹","1");line = line.replaceAll("²","2");line = line.replaceAll("°","o");line = line.replaceAll("ö","o");line = line.replaceAll("é","e");line = line.replaceAll("à","a");line = line.replaceAll("Á","A");line = line.replaceAll("ε","e");line = line.replaceAll("θ","O");line = line.replaceAll("•",".");line = line.replaceAll("µ","u");line = line.replaceAll("λ","r");line = line.replaceAll("⁺","+");line = line.replaceAll("ν","v");line = line.replaceAll("ï","i");line = line.replaceAll("ã","a");line = line.replaceAll("≡","=");line = line.replaceAll("ó","o");line = line.replaceAll("³","3");line = line.replaceAll("〖","[");line = line.replaceAll("〗","]");line = line.replaceAll("Å","A");line = line.replaceAll("ρ","p");line = line.replaceAll("ü","u");line = line.replaceAll("ɛ","e");line = line.replaceAll("č","c");line = line.replaceAll("š","s");line = line.replaceAll("ß","b");line = line.replaceAll("═","=");line = line.replaceAll("£","L");line = line.replaceAll("Ł","L");line = line.replaceAll("ƒ","f");line = line.replaceAll("ä","a");line = line.replaceAll("–","-");line = line.replaceAll("⁻","-");line = line.replaceAll("〈","<");line = line.replaceAll("〉",">");line = line.replaceAll("χ","X");line = line.replaceAll("Đ","D");line = line.replaceAll("‰","%");line = line.replaceAll("·",".");line = line.replaceAll("→",">");line = line.replaceAll("←","<");line = line.replaceAll("ζ","z");line = line.replaceAll("π","p");line = line.replaceAll("τ","t");line = line.replaceAll("ξ","X");line = line.replaceAll("η","h");line = line.replaceAll("ø","0");line = line.replaceAll("Δ","D");line = line.replaceAll("∆","D");line = line.replaceAll("∑","S");line = line.replaceAll("Ω","O");line = line.replaceAll("δ","d");line = line.replaceAll("σ","s"); 161 | String Column[]=line.split("\t"); 162 | if(Column.length>2) 163 | { 164 | String ConceptType=Column[0]; 165 | String ConceptID=Column[1]; 166 | String ConceptNames=Column[2]; 167 | 168 | /* 169 | * Specific usage for Species 170 | */ 171 | if( ConceptType.equals("Species")) 172 | { 173 | ConceptID=ConceptID.replace("species:ncbi:",""); 174 | ConceptNames=ConceptNames.replaceAll(" strain=", " "); 175 | ConceptNames=ConceptNames.replaceAll("[\\W\\-\\_](str.|strain|substr.|substrain|var.|variant|subsp.|subspecies|pv.|pathovars|pathovar|br.|biovar)[\\W\\-\\_]", " "); 176 | ConceptNames=ConceptNames.replaceAll("[\\(\\)]", " "); 177 | } 178 | String NameColumn[]=ConceptNames.split("\\|"); 179 | 180 | for(int i=0;i=3 ) 191 | { 192 | String tmp_mention=NameColumn[i].toLowerCase(); 193 | if(!StopWord_hash.containsKey(tmp_mention)) 194 | { 195 | if(Prefix.equals("")) 196 | { 197 | Tr.insertMention(NameColumn[i],ConceptType+"\t"+ConceptID); 198 | } 199 | else if(Prefix.equals("Num") && NameColumn[i].toLowerCase().matches("[0-9].*")) 200 | { 201 | Tr.insertMention(NameColumn[i],ConceptType+"\t"+ConceptID); 202 | } 203 | else if(NameColumn[i].length()>=2 && NameColumn[i].toLowerCase().substring(0,2).equals(Prefix)) 204 | { 205 | Tr.insertMention(NameColumn[i],ConceptType+"\t"+ConceptID); 206 | } 207 | else if(Prefix.equals("Other") 208 | && (!NameColumn[i].toLowerCase().matches("[0-9].*")) 209 | && (!NameColumn[i].toLowerCase().matches("[a-z][a-z].*")) 210 | ) 211 | { 212 | Tr.insertMention(NameColumn[i],ConceptType+"\t"+ConceptID); 213 | } 214 | } 215 | } 216 | } 217 | /* 218 | * Specific usage for Gene & Cell 219 | */ 220 | else if ((ConceptType.equals("Gene") || ConceptType.equals("Cell"))) 221 | { 222 | if( (!NameColumn[i].substring(0, 1).matches("[\\W\\-\\_]")) && tmp.length()>=3 ) 223 | { 224 | String tmp_mention=NameColumn[i].toLowerCase(); 225 | if(!StopWord_hash.containsKey(tmp_mention)) 226 | { 227 | if(Prefix.equals("")) 228 | { 229 | Tr.insertMention(NameColumn[i],ConceptType+"\t"+ConceptID); 230 | } 231 | else if(Prefix.equals("Num") && NameColumn[i].toLowerCase().matches("[0-9].*")) 232 | { 233 | Tr.insertMention(NameColumn[i],ConceptType+"\t"+ConceptID); 234 | } 235 | else if(NameColumn[i].length()>=2 && NameColumn[i].toLowerCase().substring(0,2).equals(Prefix)) 236 | { 237 | Tr.insertMention(NameColumn[i],ConceptType+"\t"+ConceptID); 238 | } 239 | else if(Prefix.equals("Other") 240 | && (!NameColumn[i].toLowerCase().matches("[0-9].*")) 241 | && (!NameColumn[i].toLowerCase().matches("[a-z][a-z].*")) 242 | ) 243 | { 244 | Tr.insertMention(NameColumn[i],ConceptType+"\t"+ConceptID); 245 | } 246 | } 247 | } 248 | } 249 | /* 250 | * Other Concepts 251 | */ 252 | else 253 | { 254 | if ( (!NameColumn[i].equals("")) && (!NameColumn[i].substring(0, 1).matches("[\\W\\-\\_]")) ) 255 | { 256 | String tmp_mention=NameColumn[i].toLowerCase(); 257 | if(!StopWord_hash.containsKey(tmp_mention)) 258 | { 259 | if(Prefix.equals("")) 260 | { 261 | Tr.insertMention(NameColumn[i],ConceptType+"\t"+ConceptID); 262 | } 263 | else if(Prefix.equals("Num") && NameColumn[i].toLowerCase().matches("[0-9].*")) 264 | { 265 | Tr.insertMention(NameColumn[i],ConceptType+"\t"+ConceptID); 266 | } 267 | else if(NameColumn[i].length()>2 && NameColumn[i].toLowerCase().substring(0,2).equals(Prefix)) 268 | { 269 | Tr.insertMention(NameColumn[i],ConceptType+"\t"+ConceptID); 270 | } 271 | else if(Prefix.equals("Other") 272 | && (!NameColumn[i].toLowerCase().matches("[0-9].*")) 273 | && (!NameColumn[i].toLowerCase().matches("[a-z][a-z].*")) 274 | ) 275 | { 276 | Tr.insertMention(NameColumn[i],ConceptType+"\t"+ConceptID); 277 | } 278 | } 279 | } 280 | } 281 | } 282 | } 283 | else 284 | { 285 | System.out.println("[Dictionary2Tree_Combine]: Lexicon format error! Please follow : Type | Identifier | Names (Identifier can be NULL)"); 286 | } 287 | } 288 | inputfile.close(); 289 | } 290 | catch(IOException e1){ System.out.println("[Dictionary2Tree_Combine]: Input file is not exist.");} 291 | } 292 | public void TreeFile2Tree(String Filename) 293 | { 294 | try 295 | { 296 | BufferedReader inputfile = new BufferedReader(new InputStreamReader(new FileInputStream(Filename), "UTF-8")); 297 | String line=""; 298 | int count=0; 299 | while ((line = inputfile.readLine()) != null) 300 | { 301 | line = line.replaceAll("ω","w");line = line.replaceAll("μ","u");line = line.replaceAll("κ","k");line = line.replaceAll("α","a");line = line.replaceAll("γ","r");line = line.replaceAll("β","b");line = line.replaceAll("×","x");line = line.replaceAll("¹","1");line = line.replaceAll("²","2");line = line.replaceAll("°","o");line = line.replaceAll("ö","o");line = line.replaceAll("é","e");line = line.replaceAll("à","a");line = line.replaceAll("Á","A");line = line.replaceAll("ε","e");line = line.replaceAll("θ","O");line = line.replaceAll("•",".");line = line.replaceAll("µ","u");line = line.replaceAll("λ","r");line = line.replaceAll("⁺","+");line = line.replaceAll("ν","v");line = line.replaceAll("ï","i");line = line.replaceAll("ã","a");line = line.replaceAll("≡","=");line = line.replaceAll("ó","o");line = line.replaceAll("³","3");line = line.replaceAll("〖","[");line = line.replaceAll("〗","]");line = line.replaceAll("Å","A");line = line.replaceAll("ρ","p");line = line.replaceAll("ü","u");line = line.replaceAll("ɛ","e");line = line.replaceAll("č","c");line = line.replaceAll("š","s");line = line.replaceAll("ß","b");line = line.replaceAll("═","=");line = line.replaceAll("£","L");line = line.replaceAll("Ł","L");line = line.replaceAll("ƒ","f");line = line.replaceAll("ä","a");line = line.replaceAll("–","-");line = line.replaceAll("⁻","-");line = line.replaceAll("〈","<");line = line.replaceAll("〉",">");line = line.replaceAll("χ","X");line = line.replaceAll("Đ","D");line = line.replaceAll("‰","%");line = line.replaceAll("·",".");line = line.replaceAll("→",">");line = line.replaceAll("←","<");line = line.replaceAll("ζ","z");line = line.replaceAll("π","p");line = line.replaceAll("τ","t");line = line.replaceAll("ξ","X");line = line.replaceAll("η","h");line = line.replaceAll("ø","0");line = line.replaceAll("Δ","D");line = line.replaceAll("∆","D");line = line.replaceAll("∑","S");line = line.replaceAll("Ω","O");line = line.replaceAll("δ","d");line = line.replaceAll("σ","s"); 302 | String Anno[]=line.split("\t"); 303 | String LocationInTree = Anno[0]; 304 | String token = Anno[1]; 305 | String type=""; 306 | String identifier=""; 307 | if(Anno.length>2) 308 | { 309 | type = Anno[2]; 310 | } 311 | if(Anno.length>3) 312 | { 313 | identifier = Anno[3]; 314 | } 315 | 316 | String LocationsInTree[]=LocationInTree.split("-"); 317 | TreeNode tmp = Tr.root; 318 | for(int i=0;i location = new ArrayList(); 347 | String Menlist[]=Mentions.split("\\|"); 348 | for(int m=0;m=0) //Find Tokens in the links 362 | { 363 | int CheckChild_Num=tmp.CheckChild(Tkns[i],i,Tok_NumCharPartialMatch,PrefixTranslation); 364 | int CheckChild_Num_tmp=CheckChild_Num; 365 | if(CheckChild_Num>=10000000){CheckChild_Num_tmp=CheckChild_Num-10000000;} 366 | tmp=tmp.links.get(CheckChild_Num_tmp); //move point to the link 367 | Concept = tmp.Concept; 368 | if(CheckChild_Num>=10000000){Concept="Species _PartialMatch_";} 369 | find=true; 370 | i++; 371 | } 372 | if(find == true) 373 | { 374 | if(i==Tkns.length) 375 | { 376 | if(!Concept.equals("")) 377 | { 378 | return Concept; 379 | } 380 | else 381 | { 382 | return "-1"; //gene id is not found. 383 | } 384 | } 385 | else 386 | { 387 | return "-2"; //the gene mention matched a substring in PrefixTree. 388 | } 389 | } 390 | else 391 | { 392 | return "-3"; //mention is not found 393 | } 394 | } 395 | return "-3"; //mention is not found 396 | } 397 | 398 | /* 399 | * Search target mention in the Prefix Tree 400 | * ConceptType: Species|Genus|Cell|CTDGene 401 | */ 402 | public ArrayList SearchMentionLocation(String Doc,String Doc_org,String ConceptType, Integer PrefixTranslation, Integer Tok_NumCharPartialMatch) 403 | { 404 | ArrayList location = new ArrayList(); 405 | Doc=Doc.toLowerCase(); 406 | String Doc_lc=Doc; 407 | Doc = Doc.replaceAll("([0-9])([A-Za-z])", "$1 $2"); 408 | Doc = Doc.replaceAll("([A-Za-z])([0-9])", "$1 $2"); 409 | Doc = Doc.replaceAll("[\\W^;:,]+", " "); 410 | 411 | String DocTkns[]=Doc.split(" "); 412 | int Offset=0; 413 | int Start=0; 414 | int Last=0; 415 | int FirstTime=0; 416 | 417 | while(Doc_lc.length()>0 && Doc_lc.substring(0,1).matches("[\\W]")) //clean the forward whitespace 418 | { 419 | Doc_lc=Doc_lc.substring(1); 420 | Offset++; 421 | } 422 | 423 | for(int i=0;i=0 ) //Find Tokens in the links 432 | { 433 | int CheckChild_Num=tmp.CheckChild(DocTkns[i],Tokn_num,Tok_NumCharPartialMatch,PrefixTranslation); 434 | int CheckChild_Num_tmp=CheckChild_Num; 435 | if(CheckChild_Num>=10000000){CheckChild_Num_tmp=CheckChild_Num-10000000;} 436 | tmp=tmp.links.get(CheckChild_Num_tmp); //move point to the link 437 | Concept = tmp.Concept; 438 | if(CheckChild_Num>=10000000){Concept="Species _PartialMatch_";} 439 | 440 | if(Start==0 && FirstTime>0){Start = Offset;} //Start <- Offset 441 | if(Doc_lc.length()>=DocTkns[i].length() && Doc_lc.substring(0,DocTkns[i].length()).equals(DocTkns[i])) 442 | { 443 | if(DocTkns[i].length()>0) 444 | { 445 | Doc_lc=Doc_lc.substring(DocTkns[i].length()); 446 | Offset=Offset+DocTkns[i].length(); 447 | } 448 | } 449 | Last = Offset; 450 | while(Doc_lc.length()>0 && Doc_lc.substring(0,1).matches("[\\W]")) //clean the forward whitespace 451 | { 452 | Doc_lc=Doc_lc.substring(1); 453 | Offset++; 454 | } 455 | i++; 456 | Tokn_num++; 457 | 458 | if(ConceptType.equals("Species")) 459 | { 460 | if(i0 && Doc_lc.substring(0,1).matches("[\\W]")) //clean the forward whitespace 466 | { 467 | Doc_lc=Doc_lc.substring(1); 468 | Offset++; 469 | } 470 | i++; 471 | } 472 | } 473 | 474 | if(!Concept.equals("") && (Last-Start>0)) //Keep found concept 475 | { 476 | ConceptFound=i; 477 | ConceptFound_STR=Start+"\t"+Last+"\t"+Doc_org.substring(Start, Last)+"\t"+Concept; 478 | } 479 | 480 | find=true; 481 | if(i>=DocTkns.length){break;} 482 | //else if(i==DocTkns.length-1){PrefixTranslation=2;} 483 | } 484 | 485 | if(find == true) 486 | { 487 | if(!Concept.equals("") && (Last-Start>0)) //the last matched token has concept id 488 | { 489 | location.add(Start+"\t"+Last+"\t"+Doc_org.substring(Start, Last)+"\t"+Concept); 490 | 491 | } 492 | else if(!ConceptFound_STR.equals("")) //Keep found concept 493 | { 494 | location.add(ConceptFound_STR); 495 | i = ConceptFound + 1; 496 | } 497 | Start=0; 498 | Last=0; 499 | if(i>0){i--;} 500 | ConceptFound=i; //Keep found concept 501 | ConceptFound_STR="";//Keep found concept 502 | } 503 | else //if(find == false) 504 | { 505 | if(Doc_lc.length()>=DocTkns[i].length() && Doc_lc.substring(0,DocTkns[i].length()).equals(DocTkns[i])) 506 | { 507 | if(DocTkns[i].length()>0) 508 | { 509 | Doc_lc=Doc_lc.substring(DocTkns[i].length()); 510 | Offset=Offset+DocTkns[i].length(); 511 | } 512 | } 513 | } 514 | 515 | while(Doc_lc.length()>0 && Doc_lc.substring(0,1).matches("[\\W]")) //clean the forward whitespace 516 | { 517 | Doc_lc=Doc_lc.substring(1); 518 | Offset++; 519 | } 520 | FirstTime++; 521 | } 522 | return location; 523 | } 524 | 525 | /* 526 | * Print out the Prefix Tree 527 | */ 528 | public String PrintTree() 529 | { 530 | return Tr.PrintTree_preorder(Tr.root,""); 531 | } 532 | } 533 | 534 | class Tree 535 | { 536 | /* 537 | * Prefix Tree - root node 538 | */ 539 | public TreeNode root; 540 | 541 | public Tree() 542 | { 543 | root = new TreeNode("-ROOT-"); 544 | } 545 | 546 | /* 547 | * Insert mention into the tree 548 | */ 549 | public void insertMention(String Mention, String Identifier) 550 | { 551 | Mention=Mention.toLowerCase(); 552 | Identifier = Identifier.replaceAll("\t$", ""); 553 | Mention = Mention.replaceAll("([0-9])([A-Za-z])", "$1 $2"); 554 | Mention = Mention.replaceAll("([A-Za-z])([0-9])", "$1 $2"); 555 | Mention = Mention.replaceAll("[\\W\\-\\_]+", " "); 556 | 557 | String Tokens[]=Mention.split(" "); 558 | TreeNode tmp = root; 559 | for(int i=0;i=0) 562 | { 563 | tmp=tmp.links.get( tmp.CheckChild(Tokens[i],i,0,0) ); //go through next generation (exist node) 564 | if(i == Tokens.length-1) 565 | { 566 | tmp.Concept=Identifier; 567 | } 568 | } 569 | else //not exist 570 | { 571 | if(i == Tokens.length-1) 572 | { 573 | tmp.InsertToken(Tokens[i],Identifier); 574 | } 575 | else 576 | { 577 | tmp.InsertToken(Tokens[i]); 578 | } 579 | tmp=tmp.links.get(tmp.NumOflinks-1); //go to the next generation (new node) 580 | } 581 | } 582 | } 583 | 584 | /* 585 | * Print the tree by pre-order 586 | */ 587 | public String PrintTree_preorder(TreeNode node, String LocationInTree) 588 | { 589 | String opt=""; 590 | if(!node.token.equals("-ROOT-"))//Ignore root 591 | { 592 | if(node.Concept.equals("")) 593 | { 594 | opt=opt+LocationInTree+"\t"+node.token+"\n"; 595 | } 596 | else 597 | { 598 | opt=opt+LocationInTree+"\t"+node.token+"\t"+node.Concept+"\n"; 599 | } 600 | } 601 | if(!LocationInTree.equals("")){LocationInTree=LocationInTree+"-";} 602 | for(int i=0;i links; 616 | 617 | public TreeNode(String Tok,String ID) 618 | { 619 | token = Tok; 620 | NumOflinks = 0; 621 | Concept = ID; 622 | links = new ArrayList(); 623 | } 624 | public TreeNode(String Tok) 625 | { 626 | token = Tok; 627 | NumOflinks = 0; 628 | Concept = ""; 629 | links = new ArrayList(); 630 | } 631 | public TreeNode() 632 | { 633 | token = ""; 634 | NumOflinks = 0; 635 | Concept = ""; 636 | links = new ArrayList(); 637 | } 638 | 639 | /* 640 | * Insert an new node under the target node 641 | */ 642 | public void InsertToken(String Tok) 643 | { 644 | TreeNode NewNode = new TreeNode(Tok); 645 | links.add(NewNode); 646 | NumOflinks++; 647 | } 648 | public void InsertToken(String Tok,String ID) 649 | { 650 | TreeNode NewNode = new TreeNode(Tok,ID); 651 | links.add(NewNode); 652 | NumOflinks++; 653 | } 654 | 655 | /* 656 | * Check the tokens of children 657 | * PrefixTranslation = 1 (SuffixTranslationMap) 658 | * PrefixTranslation = 2 (CTDGene; partial match for numbers) 659 | * PrefixTranslation = 3 (NCBI Taxonomy usage (IEB) : suffix partial match) 660 | */ 661 | public int CheckChild(String Tok,Integer Tok_num,Integer Tok_NumCharPartialMatch, Integer PrefixTranslation) 662 | { 663 | /** Suffix Translation */ 664 | ArrayList SuffixTranslationMap = new ArrayList(); 665 | SuffixTranslationMap.add("alpha-a"); 666 | SuffixTranslationMap.add("alpha-1"); 667 | SuffixTranslationMap.add("a-alpha"); 668 | //SuffixTranslationMap.add("a-1"); 669 | SuffixTranslationMap.add("1-alpha"); 670 | //SuffixTranslationMap.add("1-a"); 671 | SuffixTranslationMap.add("beta-b"); 672 | SuffixTranslationMap.add("beta-2"); 673 | SuffixTranslationMap.add("b-beta"); 674 | //SuffixTranslationMap.add("b-2"); 675 | SuffixTranslationMap.add("2-beta"); 676 | //SuffixTranslationMap.add("2-b"); 677 | SuffixTranslationMap.add("gamma-g"); 678 | SuffixTranslationMap.add("gamma-y"); 679 | SuffixTranslationMap.add("g-gamma"); 680 | SuffixTranslationMap.add("y-gamma"); 681 | SuffixTranslationMap.add("1-i"); 682 | SuffixTranslationMap.add("i-1"); 683 | SuffixTranslationMap.add("2-ii"); 684 | SuffixTranslationMap.add("ii-2"); 685 | SuffixTranslationMap.add("3-iii"); 686 | SuffixTranslationMap.add("iii-3"); 687 | SuffixTranslationMap.add("4-vi"); 688 | SuffixTranslationMap.add("vi-4"); 689 | SuffixTranslationMap.add("5-v"); 690 | SuffixTranslationMap.add("v-5"); 691 | 692 | for(int i=0;i=Tok_NumCharPartialMatch && Tok_num>=1) // for NCBI Taxonomy usage (IEB) : suffix partial match 721 | { 722 | for(int i=0;i=Tok_NumCharPartialMatch) 725 | { 726 | String Tok_Prefix=Tok.substring(0,Tok_NumCharPartialMatch); 727 | if((!links.get(i).Concept.equals("")) && links.get(i).token.substring(0,Tok_NumCharPartialMatch).equals(Tok_Prefix)) 728 | { 729 | return(i+10000000); 730 | } 731 | } 732 | } 733 | } 734 | 735 | return(-1); 736 | } 737 | } 738 | -------------------------------------------------------------------------------- /src/tmVarlib/tmVar.java: -------------------------------------------------------------------------------- 1 | package tmVarlib; 2 | // 3 | // tmVar - Java version 4 | // 5 | 6 | import java.io.BufferedReader; 7 | import java.io.File; 8 | import java.io.FileReader; 9 | import java.io.FileInputStream; 10 | import java.io.FileOutputStream; 11 | import java.io.InputStream; 12 | import java.io.InputStreamReader; 13 | import java.io.OutputStream; 14 | import java.io.OutputStreamWriter; 15 | import java.sql.SQLException; 16 | import java.io.IOException; 17 | import java.util.ArrayList; 18 | import java.util.HashMap; 19 | import java.util.regex.Matcher; 20 | import java.util.regex.Pattern; 21 | 22 | import javax.xml.stream.XMLStreamException; 23 | 24 | import org.tartarus.snowball.SnowballStemmer; 25 | import org.tartarus.snowball.ext.englishStemmer; 26 | 27 | import edu.stanford.nlp.tagger.maxent.MaxentTagger; 28 | 29 | public class tmVar 30 | { 31 | public static MaxentTagger tagger = new MaxentTagger(); 32 | public static SnowballStemmer stemmer = new englishStemmer(); 33 | public static ArrayList RegEx_DNAMutation_STR=new ArrayList(); 34 | public static ArrayList RegEx_ProteinMutation_STR=new ArrayList(); 35 | public static ArrayList RegEx_SNP_STR=new ArrayList(); 36 | public static ArrayList PAM_lowerScorePair = new ArrayList(); 37 | public static HashMap GeneVariantMention = new HashMap(); 38 | public static HashMap VariantFrequency = new HashMap(); 39 | public static Pattern Pattern_Component_1; 40 | public static Pattern Pattern_Component_2; 41 | public static Pattern Pattern_Component_3; 42 | public static Pattern Pattern_Component_4; 43 | public static Pattern Pattern_Component_5; 44 | public static Pattern Pattern_Component_6; 45 | public static boolean GeneMention = false; // will be turn to true if "ExtractFeature" can find gene mention 46 | public static HashMap nametothree = new HashMap(); 47 | public static HashMap threetone = new HashMap(); 48 | public static HashMap threetone_nu = new HashMap(); 49 | public static HashMap NTtoATCG = new HashMap(); 50 | public static ArrayList MF_Pattern = new ArrayList(); 51 | public static ArrayList MF_Type = new ArrayList(); 52 | public static HashMap filteringStr_hash = new HashMap(); 53 | public static HashMap Mutation_RS_Geneid_hash = new HashMap(); 54 | public static ArrayList RS_DNA_Protein = new ArrayList(); 55 | public static HashMap one2three = new HashMap(); 56 | public static PrefixTree PT_GeneVariantMention = new PrefixTree(); 57 | public static HashMap VariantType_hash = new HashMap(); 58 | public static HashMap Number_word2digit = new HashMap(); 59 | public static HashMap grouped_variants = new HashMap(); 60 | public static HashMap nu2aa_hash = new HashMap(); 61 | public static HashMap RS2Frequency_hash = new HashMap(); 62 | public static HashMap> variant_mention_to_filter_overlap_gene = new HashMap>(); // gene mention: GCC 63 | public static HashMap Gene2HumanGene_hash = new HashMap(); 64 | public static HashMap Variant2MostCorrespondingGene_hash = new HashMap(); 65 | public static HashMap RSandPosition2Seq_hash = new HashMap(); 66 | 67 | public static void main(String [] args) throws IOException, InterruptedException, XMLStreamException, SQLException 68 | { 69 | /* 70 | * Parameters 71 | */ 72 | String InputFolder="input"; 73 | String OutputFolder="output"; 74 | String TrainTest="Test"; //Train|Train_Mention|Test|Test_FullText 75 | String DeleteTmp="True"; 76 | String DisplayRSnumOnly="True"; // hide the types of the methods 77 | String DisplayChromosome="True"; // hide the chromosome mentions 78 | String DisplayRefSeq="True"; // hide the RefSeq mentions 79 | String DisplayGenomicRegion="True"; 80 | String HideMultipleResult="True"; //L858R: 121434568|1057519847|1057519848 --> 121434568 81 | if(args.length<2) 82 | { 83 | System.out.println("\n$ java -Xmx5G -Xms5G -jar tmVar.jar [InputFolder] [OutputFolder]"); 84 | System.out.println("[InputFolder] Default : input"); 85 | System.out.println("[OutputFolder] Default : output\n\n"); 86 | } 87 | else 88 | { 89 | InputFolder=args [0]; 90 | OutputFolder=args [1]; 91 | 92 | if(args.length>2 && args[2].toLowerCase().matches("(train|train_mention|test|test_fulltext)")) 93 | { 94 | TrainTest=args [2]; 95 | if(args[2].toLowerCase().matches("(train|train_mention)")) 96 | { 97 | DeleteTmp="False"; 98 | } 99 | } 100 | if(args.length>3 && args[3].toLowerCase().matches("(True|False)")) 101 | { 102 | DeleteTmp=args [3]; 103 | } 104 | if(args.length>4 && args[4].toLowerCase().matches("(True|False)")) 105 | { 106 | DisplayRSnumOnly=args [4]; 107 | } 108 | if(args.length>5 && args[5].toLowerCase().matches("(True|False)")) 109 | { 110 | HideMultipleResult=args [4]; 111 | } 112 | } 113 | 114 | double startTime,endTime,totTime; 115 | startTime = System.currentTimeMillis();//start time 116 | BioCConverter BC= new BioCConverter(); 117 | 118 | /** 119 | * Import models and resource 120 | */ 121 | { 122 | /* 123 | * POSTagging: loading model 124 | */ 125 | tagger = new MaxentTagger("lib/taggers/english-left3words-distsim.tagger"); 126 | 127 | /* 128 | * Stemming : using Snowball 129 | */ 130 | stemmer = new englishStemmer(); 131 | 132 | /* 133 | * PAM 140 : <=-6 pairs 134 | */ 135 | BufferedReader PAM = new BufferedReader(new InputStreamReader(new FileInputStream("lib/PAM140-6.txt"), "UTF-8")); 136 | String line=""; 137 | while ((line = PAM.readLine()) != null) 138 | { 139 | String nt[]=line.split("\t"); 140 | PAM_lowerScorePair.add(nt[0]+"\t"+nt[1]); 141 | PAM_lowerScorePair.add(nt[1]+"\t"+nt[0]); 142 | } 143 | PAM.close(); 144 | 145 | /* 146 | * Variant frequency 147 | */ 148 | BufferedReader frequency = new BufferedReader(new InputStreamReader(new FileInputStream("Database/rs.rank.txt"), "UTF-8")); 149 | line=""; 150 | while ((line = frequency.readLine()) != null) 151 | { 152 | String nt[]=line.split("\t"); 153 | VariantFrequency.put(nt[1],Integer.parseInt(nt[0])); 154 | } 155 | frequency.close(); 156 | /* 157 | * HGVs nomenclature lookup - RegEx : DNAMutation 158 | */ 159 | BufferedReader RegEx_DNAMutation = new BufferedReader(new InputStreamReader(new FileInputStream("lib/RegEx/DNAMutation.RegEx.txt"), "UTF-8")); 160 | line=""; 161 | while ((line = RegEx_DNAMutation.readLine()) != null) 162 | { 163 | RegEx_DNAMutation_STR.add(line); 164 | } 165 | RegEx_DNAMutation.close(); 166 | 167 | /* 168 | * HGVs nomenclature lookup - RegEx : ProteinMutation 169 | */ 170 | BufferedReader RegEx_ProteinMutation = new BufferedReader(new InputStreamReader(new FileInputStream("lib/RegEx/ProteinMutation.RegEx.txt"), "UTF-8")); 171 | line=""; 172 | while ((line = RegEx_ProteinMutation.readLine()) != null) 173 | { 174 | RegEx_ProteinMutation_STR.add(line); 175 | } 176 | RegEx_ProteinMutation.close(); 177 | 178 | /* 179 | * HGVs nomenclature lookup - RegEx : SNP 180 | */ 181 | BufferedReader RegEx_SNP = new BufferedReader(new InputStreamReader(new FileInputStream("lib/RegEx/SNP.RegEx.txt"), "UTF-8")); 182 | line=""; 183 | while ((line = RegEx_SNP.readLine()) != null) 184 | { 185 | RegEx_SNP_STR.add(line); 186 | } 187 | RegEx_SNP.close(); 188 | 189 | /* 190 | * Append-pattern 191 | */ 192 | BufferedReader RegEx_NL = new BufferedReader(new InputStreamReader(new FileInputStream("lib/RegEx/MF.RegEx.2.txt"), "UTF-8")); 193 | while ((line = RegEx_NL.readLine()) != null) 194 | { 195 | String RegEx[]=line.split("\t"); 196 | MF_Pattern.add(RegEx[0]); 197 | MF_Type.add(RegEx[1]); 198 | } 199 | RegEx_NL.close(); 200 | 201 | /* 202 | * Append-pattern 203 | */ 204 | 205 | BufferedReader VariantType = new BufferedReader(new InputStreamReader(new FileInputStream("lib/RegEx/VariantType.txt"), "UTF-8")); 206 | while ((line = VariantType.readLine()) != null) 207 | { 208 | String split[]=line.split("\t"); 209 | VariantType_hash.put(split[0],split[1]); 210 | } 211 | VariantType.close(); 212 | 213 | /* 214 | * nu2aa 215 | */ 216 | BufferedReader nu2aa = new BufferedReader(new InputStreamReader(new FileInputStream("lib/RegEx/nu2aa.mapping.txt"), "UTF-8")); 217 | while ((line = nu2aa.readLine()) != null) 218 | { 219 | nu2aa_hash.put(line,""); 220 | } 221 | nu2aa.close(); 222 | 223 | 224 | //RegEx of component recognition 225 | Pattern_Component_1 = Pattern.compile("^([^|]*)\\|([^|]*)\\|([^|]*)\\|([^|]*)\\|(fs[^|]*)\\|([^|]*)$"); 226 | Pattern_Component_2 = Pattern.compile("^([^|]*)\\|([^|]*)\\|([^|]*)\\|([^|]*)\\|(fs[^|]*)$"); 227 | Pattern_Component_3 = Pattern.compile("^([^|]*)\\|([^|]*(ins|del|Del|dup|-)[^|]*)\\|([^|]*)\\|([^|]*)$"); 228 | Pattern_Component_4 = Pattern.compile("^([^|]*)\\|([^|]*)\\|([^|]*)\\|([^|]*)$"); 229 | Pattern_Component_5 = Pattern.compile("^([^|]*)\\|([^|]*)\\|([^|]*)\\|([^|]*)\\|([^|]*)$"); 230 | Pattern_Component_6 = Pattern.compile("^((\\[rs\\]|rs|RS|Rs|reference SNP no[.] )[0-9]+)$"); 231 | 232 | nametothree.put("GLUTAMIC","GLU");nametothree.put("ASPARTIC","ASP");nametothree.put("THYMINE", "THY");nametothree.put("ALANINE", "ALA");nametothree.put("ARGININE", "ARG");nametothree.put("ASPARAGINE", "ASN");nametothree.put("ASPARTICACID", "ASP");nametothree.put("ASPARTATE", "ASP");nametothree.put("CYSTEINE", "CYS");nametothree.put("GLUTAMINE", "GLN");nametothree.put("GLUTAMICACID", "GLU");nametothree.put("GLUTAMATE", "GLU");nametothree.put("GLYCINE", "GLY");nametothree.put("HISTIDINE", "HIS");nametothree.put("ISOLEUCINE", "ILE");nametothree.put("LEUCINE", "LEU");nametothree.put("LYSINE", "LYS");nametothree.put("METHIONINE", "MET");nametothree.put("PHENYLALANINE", "PHE");nametothree.put("PROLINE", "PRO");nametothree.put("SERINE", "SER");nametothree.put("THREONINE", "THR");nametothree.put("TRYPTOPHAN", "TRP");nametothree.put("TYROSINE", "TYR");nametothree.put("VALINE", "VAL");nametothree.put("STOP", "XAA");nametothree.put("TERM", "XAA");nametothree.put("TERMINATION", "XAA");nametothree.put("STOP", "XAA");nametothree.put("TERM", "XAA");nametothree.put("TERMINATION", "XAA");nametothree.put("GLUTAMICCODON","GLU");nametothree.put("ASPARTICCODON","ASP");nametothree.put("THYMINECODON","THY");nametothree.put("ALANINECODON","ALA");nametothree.put("ARGININECODON","ARG");nametothree.put("ASPARAGINECODON","ASN");nametothree.put("ASPARTICACIDCODON","ASP");nametothree.put("ASPARTATECODON","ASP");nametothree.put("CYSTEINECODON","CYS");nametothree.put("GLUTAMINECODON","GLN");nametothree.put("GLUTAMICACIDCODON","GLU");nametothree.put("GLUTAMATECODON","GLU");nametothree.put("GLYCINECODON","GLY");nametothree.put("HISTIDINECODON","HIS");nametothree.put("ISOLEUCINECODON","ILE");nametothree.put("LEUCINECODON","LEU");nametothree.put("LYSINECODON","LYS");nametothree.put("METHIONINECODON","MET");nametothree.put("PHENYLALANINECODON","PHE");nametothree.put("PROLINECODON","PRO");nametothree.put("SERINECODON","SER");nametothree.put("THREONINECODON","THR");nametothree.put("TRYPTOPHANCODON","TRP");nametothree.put("TYROSINECODON","TYR");nametothree.put("VALINECODON","VAL");nametothree.put("STOPCODON","XAA");nametothree.put("TERMCODON","XAA");nametothree.put("TERMINATIONCODON","XAA");nametothree.put("STOPCODON","XAA");nametothree.put("TERMCODON","XAA");nametothree.put("TERMINATIONCODON","XAA");nametothree.put("TAG","XAA");nametothree.put("TAA","XAA");nametothree.put("UAG","XAA");nametothree.put("UAA","XAA"); 233 | threetone.put("ALA", "A");threetone.put("ARG", "R");threetone.put("ASN", "N");threetone.put("ASP", "D");threetone.put("CYS", "C");threetone.put("GLN", "Q");threetone.put("GLU", "E");threetone.put("GLY", "G");threetone.put("HIS", "H");threetone.put("ILE", "I");threetone.put("LEU", "L");threetone.put("LYS", "K");threetone.put("MET", "M");threetone.put("PHE", "F");threetone.put("PRO", "P");threetone.put("SER", "S");threetone.put("THR", "T");threetone.put("TRP", "W");threetone.put("TYR", "Y");threetone.put("VAL", "V");threetone.put("ASX", "B");threetone.put("GLX", "Z");threetone.put("XAA", "X");threetone.put("TER", "X"); 234 | threetone_nu.put("GCU","A");threetone_nu.put("GCC","A");threetone_nu.put("GCA","A");threetone_nu.put("GCG","A");threetone_nu.put("CGU","R");threetone_nu.put("CGC","R");threetone_nu.put("CGA","R");threetone_nu.put("CGG","R");threetone_nu.put("AGA","R");threetone_nu.put("AGG","R");threetone_nu.put("AAU","N");threetone_nu.put("AAC","N");threetone_nu.put("GAU","D");threetone_nu.put("GAC","D");threetone_nu.put("UGU","C");threetone_nu.put("UGC","C");threetone_nu.put("GAA","E");threetone_nu.put("GAG","E");threetone_nu.put("CAA","Q");threetone_nu.put("CAG","Q");threetone_nu.put("GGU","G");threetone_nu.put("GGC","G");threetone_nu.put("GGA","G");threetone_nu.put("GGG","G");threetone_nu.put("CAU","H");threetone_nu.put("CAC","H");threetone_nu.put("AUU","I");threetone_nu.put("AUC","I");threetone_nu.put("AUA","I");threetone_nu.put("CUU","L");threetone_nu.put("CUC","L");threetone_nu.put("CUA","L");threetone_nu.put("CUG","L");threetone_nu.put("UUA","L");threetone_nu.put("UUG","L");threetone_nu.put("AAA","K");threetone_nu.put("AAG","K");threetone_nu.put("AUG","M");threetone_nu.put("UUU","F");threetone_nu.put("UUC","F");threetone_nu.put("CCU","P");threetone_nu.put("CCC","P");threetone_nu.put("CCA","P");threetone_nu.put("CCG","P");threetone_nu.put("UCU","S");threetone_nu.put("UCC","S");threetone_nu.put("UCA","S");threetone_nu.put("UCG","S");threetone_nu.put("AGU","S");threetone_nu.put("AGC","S");threetone_nu.put("ACU","T");threetone_nu.put("ACC","T");threetone_nu.put("ACA","T");threetone_nu.put("ACG","T");threetone_nu.put("UGG","W");threetone_nu.put("UAU","Y");threetone_nu.put("UAC","Y");threetone_nu.put("GUU","V");threetone_nu.put("GUC","V");threetone_nu.put("GUA","V");threetone_nu.put("GUG","V");threetone_nu.put("UAA","X");threetone_nu.put("UGA","X");threetone_nu.put("UAG","X");threetone_nu.put("GCT","A");threetone_nu.put("GCC","A");threetone_nu.put("GCA","A");threetone_nu.put("GCG","A");threetone_nu.put("CGT","R");threetone_nu.put("CGC","R");threetone_nu.put("CGA","R");threetone_nu.put("CGG","R");threetone_nu.put("AGA","R");threetone_nu.put("AGG","R");threetone_nu.put("AAT","N");threetone_nu.put("AAC","N");threetone_nu.put("GAT","D");threetone_nu.put("GAC","D");threetone_nu.put("TGT","C");threetone_nu.put("TGC","C");threetone_nu.put("GAA","E");threetone_nu.put("GAG","E");threetone_nu.put("CAA","Q");threetone_nu.put("CAG","Q");threetone_nu.put("GGT","G");threetone_nu.put("GGC","G");threetone_nu.put("GGA","G");threetone_nu.put("GGG","G");threetone_nu.put("CAT","H");threetone_nu.put("CAC","H");threetone_nu.put("ATT","I");threetone_nu.put("ATC","I");threetone_nu.put("ATA","I");threetone_nu.put("CTT","L");threetone_nu.put("CTC","L");threetone_nu.put("CTA","L");threetone_nu.put("CTG","L");threetone_nu.put("TTA","L");threetone_nu.put("TTG","L");threetone_nu.put("AAA","K");threetone_nu.put("AAG","K");threetone_nu.put("ATG","M");threetone_nu.put("TTT","F");threetone_nu.put("TTC","F");threetone_nu.put("CCT","P");threetone_nu.put("CCC","P");threetone_nu.put("CCA","P");threetone_nu.put("CCG","P");threetone_nu.put("TCT","S");threetone_nu.put("TCC","S");threetone_nu.put("TCA","S");threetone_nu.put("TCG","S");threetone_nu.put("AGT","S");threetone_nu.put("AGC","S");threetone_nu.put("ACT","T");threetone_nu.put("ACC","T");threetone_nu.put("ACA","T");threetone_nu.put("ACG","T");threetone_nu.put("TGG","W");threetone_nu.put("TAT","Y");threetone_nu.put("TAC","Y");threetone_nu.put("GTT","V");threetone_nu.put("GTC","V");threetone_nu.put("GTA","V");threetone_nu.put("GTG","V");threetone_nu.put("TAA","X");threetone_nu.put("TGA","X");threetone_nu.put("TAG","X"); 235 | NTtoATCG.put("ADENINE", "A");NTtoATCG.put("CYTOSINE", "C");NTtoATCG.put("GUANINE", "G");NTtoATCG.put("URACIL", "U");NTtoATCG.put("THYMINE", "T");NTtoATCG.put("ADENOSINE", "A");NTtoATCG.put("CYTIDINE", "C");NTtoATCG.put("THYMIDINE", "T");NTtoATCG.put("GUANOSINE", "G");NTtoATCG.put("URIDINE", "U"); 236 | 237 | Number_word2digit.put("ZERO","0");Number_word2digit.put("SINGLE","1");Number_word2digit.put("ONE","1");Number_word2digit.put("TWO","2");Number_word2digit.put("THREE","3");Number_word2digit.put("FOUR","4");Number_word2digit.put("FIVE","5");Number_word2digit.put("SIX","6");Number_word2digit.put("SEVEN","7");Number_word2digit.put("EIGHT","8");Number_word2digit.put("NINE","9");Number_word2digit.put("TWN","10"); 238 | 239 | //Filtering 240 | BufferedReader filterfile = new BufferedReader(new InputStreamReader(new FileInputStream("lib/filtering.txt"), "UTF-8")); 241 | while ((line = filterfile.readLine()) != null) 242 | { 243 | filteringStr_hash.put(line, ""); 244 | } 245 | filterfile.close(); 246 | 247 | /*one2three*/ 248 | one2three.put("A", "Ala"); 249 | one2three.put("R", "Arg"); 250 | one2three.put("N", "Asn"); 251 | one2three.put("D", "Asp"); 252 | one2three.put("C", "Cys"); 253 | one2three.put("Q", "Gln"); 254 | one2three.put("E", "Glu"); 255 | one2three.put("G", "Gly"); 256 | one2three.put("H", "His"); 257 | one2three.put("I", "Ile"); 258 | one2three.put("L", "Leu"); 259 | one2three.put("K", "Lys"); 260 | one2three.put("M", "Met"); 261 | one2three.put("F", "Phe"); 262 | one2three.put("P", "Pro"); 263 | one2three.put("S", "Ser"); 264 | one2three.put("T", "Thr"); 265 | one2three.put("W", "Trp"); 266 | one2three.put("Y", "Tyr"); 267 | one2three.put("V", "Val"); 268 | one2three.put("B", "Asx"); 269 | one2three.put("Z", "Glx"); 270 | one2three.put("X", "Xaa"); 271 | one2three.put("X", "Ter"); 272 | 273 | /*RS_DNA_Protein.txt - Pattern : PP[P]+[ ]*[\(\[][ ]*DD[D]+[ ]*[\)\]][ ]*[\(\[][ ]*SS[S]+[ ]*[\)\]]*/ 274 | BufferedReader inputfile = new BufferedReader(new InputStreamReader(new FileInputStream("lib/RegEx/RS_DNA_Protein.txt"), "UTF-8")); 275 | while ((line = inputfile.readLine()) != null) 276 | { 277 | if(!line.equals("")) 278 | { 279 | RS_DNA_Protein.add(line); 280 | } 281 | } 282 | inputfile.close(); 283 | 284 | /*PT_GeneVariantMention - BRAFV600E*/ 285 | PT_GeneVariantMention.TreeFile2Tree("lib/PT_GeneVariantMention.txt"); 286 | 287 | /*Mutation_RS_Geneid.txt - the patterns retrieved from PubMed result*/ 288 | inputfile = new BufferedReader(new InputStreamReader(new FileInputStream("lib/RegEx/Mutation_RS_Geneid.txt"), "UTF-8")); 289 | while ((line = inputfile.readLine()) != null) 290 | { 291 | //Pattern c|SUB|C|1749|T 2071558 269 292 | Pattern pat = Pattern.compile("^(Pattern|Recognized|ManuallyAdded) ([^\t]+) ([0-9]+) ([0-9,]+)$"); 293 | Matcher mat = pat.matcher(line); 294 | if (mat.find()) 295 | { 296 | String geneids[]=mat.group(4).split(","); 297 | for(int i=0;i rs# 300 | Mutation_RS_Geneid_hash.put(mat.group(2)+"\t"+geneids[i], mat.group(3)); 301 | } 302 | } 303 | } 304 | inputfile.close(); 305 | /** tmVarForm2RSID2Freq - together with Mutation_RS_Geneid.txt (the patterns retrieved from PubMed result) */ 306 | BufferedReader tmVarForm2RSID2Freq = new BufferedReader(new InputStreamReader(new FileInputStream("lib/tmVarForm2RSID2Freq.txt"), "UTF-8")); 307 | line=""; 308 | while ((line = tmVarForm2RSID2Freq.readLine()) != null) 309 | { 310 | String nt[]=line.split("\t"); 311 | String tmVarForm=nt[0]; 312 | String rs_gene_freq=nt[1]; 313 | //RS:rs121913377|Gene:673|Freq:3 314 | Pattern pat = Pattern.compile("^RS:rs([0-9]+)\\|Gene:([0-9,]+)\\|Freq:([0-9]+)$"); 315 | Matcher mat = pat.matcher(rs_gene_freq); 316 | if (mat.find()) 317 | { 318 | String rs=mat.group(1); 319 | String gene=mat.group(2); 320 | Mutation_RS_Geneid_hash.put(tmVarForm+"\t"+gene,rs); 321 | } 322 | } 323 | tmVarForm2RSID2Freq.close(); 324 | 325 | /** RS2Frequency - rs# to its frequency in PTC) */ 326 | BufferedReader RS2Frequency = new BufferedReader(new InputStreamReader(new FileInputStream("lib/RS2Frequency.txt"), "UTF-8")); 327 | line=""; 328 | while ((line = RS2Frequency.readLine()) != null) 329 | { 330 | String nt[]=line.split("\t"); 331 | String rs=nt[0]; 332 | int freq=Integer.parseInt(nt[1]); 333 | RS2Frequency_hash.put(rs,freq); 334 | } 335 | RS2Frequency.close(); 336 | 337 | /** Homoid2HumanGene_hash **/ 338 | HashMap Homoid2HumanGene_hash = new HashMap(); 339 | BufferedReader Homoid2HumanGene = new BufferedReader(new InputStreamReader(new FileInputStream("Database/Homoid2HumanGene.txt"), "UTF-8")); 340 | line=""; 341 | while ((line = Homoid2HumanGene.readLine()) != null) 342 | { 343 | String nt[]=line.split("\t"); 344 | String homoid=nt[0]; 345 | String humangeneid=nt[1]; 346 | Homoid2HumanGene_hash.put(homoid,humangeneid); 347 | } 348 | Homoid2HumanGene.close(); 349 | 350 | /** Gene2Homoid.txt **/ 351 | BufferedReader Gene2Homoid = new BufferedReader(new InputStreamReader(new FileInputStream("Database/Gene2Homoid.txt"), "UTF-8")); 352 | line=""; 353 | while ((line = Gene2Homoid.readLine()) != null) 354 | { 355 | String nt[]=line.split("\t"); 356 | String geneid=nt[0]; 357 | String homoid=nt[1]; 358 | if(Homoid2HumanGene_hash.containsKey(homoid)) 359 | { 360 | if(!geneid.equals(Homoid2HumanGene_hash.get(homoid))) 361 | { 362 | Gene2HumanGene_hash.put(geneid,Homoid2HumanGene_hash.get(homoid)); 363 | } 364 | } 365 | } 366 | Gene2Homoid.close(); 367 | 368 | /** Variant2MostCorrespondingGene **/ 369 | BufferedReader Variant2MostCorrespondingGene = new BufferedReader(new InputStreamReader(new FileInputStream("Database/var2gene.txt"), "UTF-8")); 370 | line=""; 371 | while ((line = Variant2MostCorrespondingGene.readLine()) != null) 372 | { 373 | String nt[]=line.split("\t"); //4524 1801133 C677T 374 | String geneid=nt[0]; 375 | String rsid=nt[1]; 376 | String var=nt[2].toLowerCase(); 377 | Variant2MostCorrespondingGene_hash.put(var,geneid+"\t"+rsid); 378 | } 379 | Variant2MostCorrespondingGene.close(); 380 | 381 | BufferedReader RSandPosition2Seq = new BufferedReader(new InputStreamReader(new FileInputStream("lib/RS2tmVarForm.txt"), "UTF-8")); 382 | line=""; 383 | while ((line = RSandPosition2Seq.readLine()) != null) 384 | { 385 | String nt[]=line.split("\t"); //121908752 c SUB T 617 G 386 | if(nt.length>3) 387 | { 388 | String rs=nt[0]; 389 | String seq=nt[1]; 390 | String P=nt[4]; 391 | RSandPosition2Seq_hash.put(rs+"\t"+P,seq); 392 | } 393 | } 394 | RSandPosition2Seq.close(); 395 | 396 | } 397 | 398 | File folder = new File(InputFolder); 399 | File[] listOfFiles = folder.listFiles(); 400 | for (int i = 0; i < listOfFiles.length; i++) 401 | { 402 | if (listOfFiles[i].isFile()) 403 | { 404 | String InputFile = listOfFiles[i].getName(); 405 | 406 | File f = new File(OutputFolder+"/"+InputFile+".PubTator"); 407 | File f_BioC = new File(OutputFolder+"/"+InputFile+".BioC.XML"); 408 | 409 | if(f.exists() && !f.isDirectory()) 410 | { 411 | System.out.println(InputFolder+"/"+InputFile+" - Done. (The output file (PubTator) exists in output folder)"); 412 | } 413 | else if(f_BioC.exists() && !f_BioC.isDirectory()) 414 | { 415 | System.out.println(InputFolder+"/"+InputFile+" - Done. (The output file (BioC) exists in output folder)"); 416 | } 417 | else 418 | { 419 | /* 420 | * Mention recognition by CRF++ 421 | */ 422 | if(TrainTest.equals("Test") || TrainTest.equals("Test_FullText") || TrainTest.equals("Test_FullText") || TrainTest.equals("Train")) 423 | { 424 | /* 425 | * Format Check 426 | */ 427 | String Format = ""; 428 | String checkR=BC.BioCFormatCheck(InputFolder+"/"+InputFile); 429 | if(checkR.equals("BioC")) 430 | { 431 | Format = "BioC"; 432 | } 433 | else if(checkR.equals("PubTator")) 434 | { 435 | Format = "PubTator"; 436 | } 437 | else 438 | { 439 | System.out.println(checkR); 440 | System.exit(0); 441 | } 442 | 443 | System.out.print(InputFolder+"/"+InputFile+" - ("+Format+" format) : Processing ... \r"); 444 | 445 | /* 446 | * Pre-processing 447 | */ 448 | MentionRecognition MR= new MentionRecognition(); 449 | if(Format.equals("BioC")) 450 | { 451 | BC.BioC2PubTator(InputFolder+"/"+InputFile,"tmp/"+InputFile); 452 | MR.FeatureExtraction("tmp/"+InputFile,"tmp/"+InputFile+".data","tmp/"+InputFile+".location",TrainTest); 453 | } 454 | else if(Format.equals("PubTator")) 455 | { 456 | MR.FeatureExtraction(InputFolder+"/"+InputFile,"tmp/"+InputFile+".data","tmp/"+InputFile+".location",TrainTest); 457 | } 458 | if(TrainTest.equals("Test") || TrainTest.equals("Test_FullText")) 459 | { 460 | MR.CRF_test("tmp/"+InputFile+".data","tmp/"+InputFile+".output",TrainTest); 461 | } 462 | 463 | /* 464 | * CRF++ output --> PubTator 465 | */ 466 | PostProcessing PP = new PostProcessing(); 467 | { 468 | if(Format.equals("BioC")) 469 | { 470 | PP.toME("tmp/"+InputFile,"tmp/"+InputFile+".output","tmp/"+InputFile+".location","tmp/"+InputFile+".ME"); 471 | PP.toPostME("tmp/"+InputFile+".ME","tmp/"+InputFile+".PostME"); 472 | PP.toPostMEData("tmp/"+InputFile,"tmp/"+InputFile+".PostME","tmp/"+InputFile+".PostME.ml","tmp/"+InputFile+".PostME.data",TrainTest); 473 | } 474 | else if(Format.equals("PubTator")) 475 | { 476 | PP.toME(InputFolder+"/"+InputFile,"tmp/"+InputFile+".output","tmp/"+InputFile+".location","tmp/"+InputFile+".ME"); 477 | PP.toPostME("tmp/"+InputFile+".ME","tmp/"+InputFile+".PostME"); 478 | PP.toPostMEData(InputFolder+"/"+InputFile,"tmp/"+InputFile+".PostME","tmp/"+InputFile+".PostME.ml","tmp/"+InputFile+".PostME.data",TrainTest); 479 | } 480 | if(TrainTest.equals("Test") || TrainTest.equals("Test_FullText")) 481 | { 482 | PP.toPostMEoutput("tmp/"+InputFile+".PostME.data","tmp/"+InputFile+".PostME.output"); 483 | } 484 | 485 | else if(TrainTest.equals("Train")) 486 | { 487 | PP.toPostMEModel("tmp/"+InputFile+".PostME.data"); 488 | } 489 | 490 | 491 | /* 492 | * Post-processing 493 | */ 494 | if(TrainTest.equals("Test") || TrainTest.equals("Test_FullText")) 495 | { 496 | GeneMention = true; 497 | if(GeneMention == true) // MentionRecognition detect Gene mentions 498 | { 499 | PP.output2PubTator("tmp/"+InputFile+".PostME.ml","tmp/"+InputFile+".PostME.output","tmp/"+InputFile+".PostME","tmp/"+InputFile+".PubTator"); 500 | 501 | if(Format.equals("BioC")) 502 | { 503 | PP.Normalization("tmp/"+InputFile,"tmp/"+InputFile+".PubTator",OutputFolder+"/"+InputFile+".PubTator",DisplayRSnumOnly,HideMultipleResult,DisplayChromosome,DisplayRefSeq,DisplayGenomicRegion); 504 | } 505 | else if(Format.equals("PubTator")) 506 | { 507 | PP.Normalization(InputFolder+"/"+InputFile,"tmp/"+InputFile+".PubTator",OutputFolder+"/"+InputFile+".PubTator",DisplayRSnumOnly,HideMultipleResult,DisplayChromosome,DisplayRefSeq,DisplayGenomicRegion); 508 | } 509 | } 510 | else 511 | { 512 | PP.output2PubTator("tmp/"+InputFile+".PostME.ml","tmp/"+InputFile+".PostME.output","tmp/"+InputFile+".PostME",OutputFolder+"/"+InputFile+".PubTator"); 513 | } 514 | 515 | if(Format.equals("BioC")) 516 | { 517 | BC.PubTator2BioC_AppendAnnotation(OutputFolder+"/"+InputFile+".PubTator",InputFolder+"/"+InputFile,OutputFolder+"/"+InputFile+".BioC.XML"); 518 | } 519 | } 520 | } 521 | 522 | /* 523 | * Time stamp - last 524 | */ 525 | endTime = System.currentTimeMillis();//ending time 526 | totTime = endTime - startTime; 527 | System.out.println(InputFolder+"/"+InputFile+" - ("+Format+" format) : Processing Time:"+totTime/1000+"sec"); 528 | 529 | /* 530 | * remove tmp files 531 | */ 532 | if(DeleteTmp.toLowerCase().equals("true")) 533 | { 534 | String path="tmp"; 535 | File file = new File(path); 536 | File[] files = file.listFiles(); 537 | for (File ftmp:files) 538 | { 539 | if (ftmp.isFile() && ftmp.exists()) 540 | { 541 | if(ftmp.toString().matches("tmp."+InputFile+".*")) 542 | { 543 | ftmp.delete(); 544 | } 545 | } 546 | } 547 | } 548 | } 549 | else if(TrainTest.equals("Train_Mention")) 550 | { 551 | System.out.print(InputFolder+"/"+InputFile+" - Processing ... \r"); 552 | 553 | PostProcessing PP = new PostProcessing(); 554 | PP.toPostMEData(InputFolder+"/"+InputFile,"tmp/"+InputFile+".PostME","tmp/"+InputFile+".PostME.ml","tmp/"+InputFile+".PostME.data","Train"); 555 | 556 | /* 557 | * Time stamp - last 558 | */ 559 | endTime = System.currentTimeMillis();//ending time 560 | totTime = endTime - startTime; 561 | System.out.println(InputFolder+"/"+InputFile+" - Processing Time:"+totTime/1000+"sec"); 562 | } 563 | } 564 | } 565 | } 566 | } 567 | } 568 | -------------------------------------------------------------------------------- /tmBioC.key: -------------------------------------------------------------------------------- 1 | PubTator.key 2 | 3 | A BioC format for PubTator and other NER tools (i.e., tmChem, DNorm, tmVar, SR4GN or GenNorm) developed at the Biomedical Text Mining group at NCBI 4 | The goal of this collection is to provide easy access to the text and bio-concept annotations for PMC articles. 5 | 6 | collection: a group of PubMed documents, each document is organized into title, abstract and other passages 7 | 8 | source: PubMed, PubMed Central, etc. 9 | 10 | date: Document download date 11 | 12 | document: abstract, full-text article, free-text document, etc. 13 | 14 | id: PubMed ID (or other ID in a given collection) of the document 15 | 16 | passage: Title, abstract and other passages 17 | 18 | infon["type"]: "title", "abstract" and other passages 19 | 20 | offset: Title has an offset of zero, while the other passages (e.g., abstract) are assumed to begin after the previous passages and one space 21 | 22 | text: Text of the passage 23 | 24 | annotation: One bio-concept of the passage as determined by the tmChem, DNorm, tmVar, SR4GN or GenNorm 25 | 26 | infon["type"]: The type of bioconcept, e.g. "Gene", "Species", "Disease", "Chemical" or "Mutation" 27 | 28 | infon["MeSH"]: The bio-concept identifier in MeSH as detected by DNorm or tmChem 29 | 30 | infon["OMIM"]: The bio-concept identifier in OMIM as detected by DNorm 31 | 32 | infon["NCBI_Gene"]: The bio-concept identifier in NCBI Gene as detected by GenNorm 33 | 34 | infon["NCBI_Taxonomy"]: The bio-concept identifier in NCBI Taxonomy as detected by SR4GN 35 | 36 | infon["ChEBI"]: The bio-concept identifier in ChEBI as detected by tmChem 37 | 38 | infon["tmVar"]: The intelligent key generated artificially for the mention detected by tmVar (||||) 39 | 40 | location: location of the mention including the global document "offset" where a bio-concept is located and the "length" of the mention 41 | 42 | text: Mention of the bio-concept 43 | --------------------------------------------------------------------------------