├── BioC.dtd
├── COPYING
├── DISCLAIMER
├── README.md
├── src
    └── tmVarlib
    │   ├── BioCConverter.java
    │   ├── MentionRecognition.java
    │   ├── PostProcessing.java
    │   ├── PrefixTree.java
    │   └── tmVar.java
└── tmBioC.key


/BioC.dtd:
--------------------------------------------------------------------------------
  1 | <!-- Combination DTD that will work with any document so far. -->
  2 | 
  3 | <!--
  4 | 
  5 |     Some believe XML is easily read by humans and that should be
  6 |     supported by clearly formatting the elements. In the long run,
  7 |     this is destracting. While the only meaningful spaces are in text
  8 |     elements and the other spaces can be ignored, current tools add no
  9 |     additional space.  Formatters and editors may be used to make the
 10 |     XML file appear more readable.
 11 | 
 12 |     The possible variety of annotations that one might want to produce
 13 |     or use is nearly countless. There is no guarantee that these are
 14 |     organized in the nice nested structure required for XML
 15 |     elements. Even if they were, it would be nice to more easily
 16 |     ignore unwanted annotations.  So annotations are recorded in a
 17 |     stand off manner, external to the annotated text. The exceptions
 18 |     are passages and sentences because of their fundamental place in
 19 |     text.
 20 | 
 21 |     The text is expected to be encoded in Unicode, specifically
 22 |     utf-8. This is one of the encodings required to be implented by
 23 |     XML tools, is portable between big-endian and little-endian
 24 |     machines and is a superset of 7-bit ASCII. Code points beyond 127
 25 |     may be expressed directly in utf-8 or indirectly using numeric
 26 |     entities.  Since many tools today still only directly process
 27 |     ASCII characters, conversion should be available and
 28 |     standardized.  Offsets should be in 8 bit code units (bytes) for
 29 |     easier processing by naive programs.
 30 | 
 31 |     Nothing final. Just current thoughts.
 32 | 
 33 |     collection:  Group of documents, usually from a larger corpus. If
 34 |     a group of documents is from several corpora, use several
 35 |     collections.
 36 | 
 37 |     source:  Name of the source corpus from which the documents were selected
 38 | 
 39 |     date:  Date documents extracted from original source. Can be as
 40 |     simple as yyyymmdd or an ISO timestamp.
 41 | 
 42 |     key: Separate file describing the types used and any other useful
 43 |     information about the data in the file. For example, if a file
 44 |     includes part-of-speech tags, this file should describe the
 45 |     part-of-speech tags used.
 46 | 
 47 |     infon: key-value pairs. Can record essentially arbitrary
 48 |     information. "type" will be a particular common key in the major
 49 |     sub elements below. For PubMed references, passage "type" might
 50 |     signal "title" or "abstract". For annotations, it might indicate
 51 |     "noun phrase", "gene", or "disease". In the programming language
 52 |     data structures, infons are typically represented as a map from
 53 |     strings to strings.  This means keys should be unique within each
 54 |     parent element.
 55 | 
 56 |     document:  A document in the collection. A single, complete
 57 |     stand-alone document as described by it's parent source.
 58 | 
 59 |     id:  Typically, the id of the document in the parent
 60 |     source. Should at least be unique in the collection.
 61 | 
 62 |     passage:  One portion of the document.  For now PubMed documents
 63 |     have a title and an abstract. Structured abstracts could have
 64 |     additional passages. For a full text document, passages could be
 65 |     sections such as Introduction, Materials and Methods, or
 66 |     Conclusion. Another option would be paragraphs. Passages impose a
 67 |     linear structure on the document. Further structure in the
 68 |     document can be implied by the infon["type"] value.
 69 | 
 70 |     offset: Where the passage occurs in the parent document. Depending
 71 |     on the source corpus, this might be a very relevant number.  They
 72 |     should be sequential and identify a passage's position in
 73 |     the document.  Since pubmed is extracted from an XML file, the
 74 |     title has an offset of zero, while the abstract is assumed to
 75 |     begin after the title and one space.
 76 | 
 77 |     text: The original text of the passage.
 78 | 
 79 |     sentence:  One sentence of the passage.
 80 | 
 81 |     offset: A document offset to where the sentence begins in the
 82 |     passage. This value is the sum of the passage offset and the local
 83 |     offset within the passage.
 84 | 
 85 |     text: The original text of the sentence.
 86 | 
 87 |     annotation:  Stand-off annotation
 88 | 
 89 |     id: Used to refer to this annotation in relations.
 90 | 
 91 |     location: Location of the annotated text. Multiple locations
 92 |     indicate a multi-span annotation.
 93 | 
 94 |     offset: Document offset to where the annotated text begins in
 95 |     the passage or sentence. The value is the sum of the passage or
 96 |     sentence offset and the local offset within the passage or
 97 |     sentence.
 98 | 
 99 |     length: Length of the annotated text. While unlikely, this could
100 |     be zero to describe an annotation that belongs between two
101 |     characters.
102 | 
103 |     text:  Unless something else is defined one would be expect the
104 |     annotated text. The length is redundant in this case. Other uses
105 |     for this text could be the normalized ID for a gene in a gene
106 |     database. 
107 | 
108 |     relation: Relationship between multiple annotations.
109 | 
110 |     id: Used to refer to this relation in other relationships.
111 | 
112 |     refid: Id of an annotated object or other relation.
113 | 
114 |     role: Describes how the referenced annotated object or other
115 |     relation participates in the current relationship. Has a default
116 |     value so can be left out if there is no meaningful value.
117 | 
118 | -->
119 | 
120 | <!ELEMENT collection ( source, date, key, infon*, document+ ) >
121 | <!ELEMENT source (#PCDATA)>
122 | <!ELEMENT date (#PCDATA)>
123 | <!ELEMENT key (#PCDATA)>
124 | <!ELEMENT infon (#PCDATA)>
125 | <!ATTLIST infon key CDATA #REQUIRED >
126 | 
127 | <!ELEMENT document ( id, infon*, passage+, relation* ) >
128 | <!ELEMENT id (#PCDATA)>
129 | 
130 | <!ELEMENT passage ( infon*, offset, ( ( text?, annotation* ) | sentence* ), relation* ) >
131 | <!ELEMENT offset (#PCDATA)>
132 | <!ELEMENT text (#PCDATA)>
133 | 
134 | <!ELEMENT sentence ( infon*, offset, text?, annotation*, relation* ) >
135 | 
136 | <!ELEMENT annotation ( infon*, location*, text ) >
137 | <!ATTLIST annotation id CDATA #IMPLIED >
138 | <!ELEMENT location EMPTY>
139 | <!ATTLIST location offset CDATA #REQUIRED >
140 | <!ATTLIST location length CDATA #REQUIRED >
141 | 
142 | <!ELEMENT relation ( infon*, node* ) >
143 | <!ATTLIST relation id CDATA #IMPLIED >
144 | <!ELEMENT node EMPTY>
145 | <!ATTLIST node refid CDATA #REQUIRED >
146 | <!ATTLIST node role CDATA "" >
147 | 


--------------------------------------------------------------------------------
/COPYING:
--------------------------------------------------------------------------------
1 | PUBLIC DOMAIN NOTICE
2 | National Center for Biotechnology Information
3 |  
4 | This software/database is a "United States Government Work" under the terms of the United States Copyright Act. It was written as part of the authors' official duties as a United States Government employee and thus cannot be copyrighted. This software/database is freely available to the public for use. The National Library of Medicine and the U.S. Government have not placed any restriction on its use or reproduction.
5 |  
6 | Although all reasonable efforts have been taken to ensure the accuracy and reliability of the software and data, the NLM and the U.S. Government do not and cannot warrant the performance or results that may be obtained by using this software or data. The NLM and the U.S. Government disclaim all warranties, express or implied, including warranties of performance, merchantability or fitness for any particular purpose. 
7 | 


--------------------------------------------------------------------------------
/DISCLAIMER:
--------------------------------------------------------------------------------
1 | This tool shows the results of research conducted in the Computational Biology Branch, NCBI. The information produced on this website is not intended for direct diagnostic use or medical decision-making without review and oversight by a clinical professional. Individuals should not change their health behavior solely on the basis of information produced on this website. NIH does not independently verify the validity or utility of the information produced by this tool. If you have questions about the information produced on this website, please see a health care professional. More information about NCBI's disclaimer policy is available. 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # tmVar 3.0: an improved variant concept recog-nition and normalization tool
 2 | 
 3 | We propose tmVar 3.0: an improved variant recognition and normalization system. Com-pared to its predecessors, tmVar 3.0 recognizes a wider spectrum of variant related entities (e.g., allele and copy number variants), and to group different variant mentions belonging to the same genomic sequence position in an article for improved accuracy. Moreover, tmVar 3.0 provides ad-vanced variant normalization options such as allele-specific identifiers from the ClinGen Allele Registry. tmVar 3.0 exhibits state-of-the-art performance with over 90% in F-measure for variant recognition and normalization, when evaluated on three independent benchmarking datasets. tmVar 3.0 as well annotations for the entire PubMed and PMC datasets are freely available for download on our FTP.
 4 | 
 5 | # tmVar 3.0 download from FTP
 6 | 
 7 | - [tmVar 3.0 software](https://ftp.ncbi.nlm.nih.gov/pub/lu/tmVar3/tmVar3.tar.gz)
 8 | - [tmVar 3.0 corpus](https://ftp.ncbi.nlm.nih.gov/pub/lu/tmVar3/tmVar3Corpus.txt)
 9 | - [tmVar 3.0 annotation guideline](https://ftp.ncbi.nlm.nih.gov/pub/lu/tmVar3/AnnotationGuideline.rev.docx)
10 | 
11 | # PubTator API to access tmVar 3.0
12 | 
13 | We host a RESTful API (https://www.ncbi.nlm.nih.gov/research/pubtator/api.html) that users can access the tmVar 3.0 results in PubMed/PMC. The "Process Raw Text" section of the API page also shows the way to submit a raw text for online processing. We provide the sample code in Java, Python and Perl to assist the users to quickly familiar with the API service.
14 | 
15 | # Acknowledgments
16 | 
17 | This research was supported by the Intramural Research Program of the National Library of Medicine (NLM), National Institutes of Health.
18 | 
19 | # Disclaimer
20 | 
21 | This tool shows the results of research conducted in the Computational Biology Branch, NCBI. The information produced on this website is not intended for direct diagnostic use or medical decision-making without review and oversight by a clinical professional. Individuals should not change their health behavior solely on the basis of information produced on this website. NIH does not independently verify the validity or utility of the information produced by this tool. If you have questions about the information produced on this website, please see a health care professional. More information about NCBI's disclaimer policy is available.
22 | 


--------------------------------------------------------------------------------
/src/tmVarlib/BioCConverter.java:
--------------------------------------------------------------------------------
  1 | //
  2 | // tmVar - Java version
  3 | // BioC Format Converter
  4 | //
  5 | package tmVarlib;
  6 | 
  7 | import bioc.BioCAnnotation;
  8 | import bioc.BioCCollection;
  9 | import bioc.BioCDocument;
 10 | import bioc.BioCLocation;
 11 | import bioc.BioCPassage;
 12 | import bioc.io.BioCDocumentWriter;
 13 | import bioc.io.BioCFactory;
 14 | import bioc.io.woodstox.ConnectorWoodstox;
 15 | 
 16 | import java.io.BufferedReader;
 17 | import java.io.BufferedWriter;
 18 | import java.io.FileInputStream;
 19 | import java.io.FileNotFoundException;
 20 | import java.io.FileOutputStream;
 21 | import java.io.FileReader;
 22 | import java.io.FileWriter;
 23 | import java.io.IOException;
 24 | import java.io.InputStreamReader;
 25 | import java.io.OutputStreamWriter;
 26 | import java.io.UnsupportedEncodingException;
 27 | import java.time.LocalDate;
 28 | import java.time.ZoneId;
 29 | import java.util.Map;
 30 | import java.util.regex.Matcher;
 31 | import java.util.regex.Pattern;
 32 | 
 33 | import javax.xml.stream.XMLStreamException;
 34 | 
 35 | import java.util.ArrayList;
 36 | import java.util.HashMap;
 37 | 
 38 | public class BioCConverter 
 39 | {
 40 | 	/*
 41 | 	 * Contexts in BioC file
 42 | 	 */
 43 | 	public ArrayList<String> PMIDs=new ArrayList<String>(); // Type: PMIDs
 44 | 	public ArrayList<ArrayList<String>> PassageNames = new ArrayList(); // PassageName
 45 | 	public ArrayList<ArrayList<Integer>> PassageOffsets = new ArrayList(); // PassageOffset
 46 | 	public ArrayList<ArrayList<String>> PassageContexts = new ArrayList(); // PassageContext
 47 | 	public ArrayList<ArrayList<ArrayList<String>>> Annotations = new ArrayList(); // Annotation - GNormPlus
 48 | 	
 49 | 	public String BioCFormatCheck(String InputFile) throws IOException
 50 | 	{
 51 | 		
 52 | 		ConnectorWoodstox connector = new ConnectorWoodstox();
 53 | 		BioCCollection collection = new BioCCollection();
 54 | 		try
 55 | 		{
 56 | 			collection = connector.startRead(new InputStreamReader(new FileInputStream(InputFile), "UTF-8"));
 57 | 		}
 58 | 		catch (UnsupportedEncodingException | FileNotFoundException | XMLStreamException e) 
 59 | 		{
 60 | 			BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(InputFile), "UTF-8"));
 61 | 			String line="";
 62 | 			String status="";
 63 | 			String Pmid = "";
 64 | 			boolean tiabs=false;
 65 | 			Pattern patt = Pattern.compile("^([^\\|\\t]+)\\|([^\\|\\t]+)\\|(.*)$");
 66 | 			while ((line = br.readLine()) != null)  
 67 | 			{
 68 | 				Matcher mat = patt.matcher(line);
 69 | 				if(mat.find()) //Title|Abstract
 70 | 	        	{
 71 | 					if(Pmid.equals(""))
 72 | 					{
 73 | 						Pmid = mat.group(1);
 74 | 					}
 75 | 					else if(!Pmid.equals(mat.group(1)))
 76 | 					{
 77 | 						return "[Error]: "+InputFile+" - A blank is needed between "+Pmid+" and "+mat.group(1)+".";
 78 | 					}
 79 | 					status = "tiabs";
 80 | 					tiabs = true;
 81 | 	        	}
 82 | 				else if (line.contains("\t")) //Annotation
 83 | 	        	{
 84 | 	        	}
 85 | 				else if(line.length()==0) //Processing
 86 | 				{
 87 | 					if(status.equals(""))
 88 | 					{
 89 | 						if(Pmid.equals(""))
 90 | 						{
 91 | 							return "[Error]: "+InputFile+" - It's not either BioC or PubTator format.";
 92 | 						}
 93 | 						else
 94 | 						{
 95 | 							return "[Error]: "+InputFile+" - A redundant blank is after "+Pmid+".";
 96 | 						}
 97 | 					}
 98 | 					Pmid="";
 99 | 					status="";
100 | 				}
101 | 			}
102 | 			br.close();
103 | 			if(tiabs == false)
104 | 			{
105 | 				return "[Error]: "+InputFile+" - It's not either BioC or PubTator format.";
106 | 			}
107 | 			if(status.equals(""))
108 | 			{
109 | 				return "PubTator";
110 | 			}
111 | 			else
112 | 			{
113 | 				return "[Error]: "+InputFile+" - The last column missed a blank.";
114 | 			}
115 | 		}
116 | 		return "BioC";
117 | 	}
118 | 	public void BioC2PubTator(String input,String output) throws IOException, XMLStreamException
119 | 	{
120 | 		/*
121 | 		 * BioC2PubTator
122 | 		 */
123 | 		HashMap<String, String> pmidlist = new HashMap<String, String>(); // check if appear duplicate pmids
124 | 		boolean duplicate = false;
125 | 		BufferedWriter PubTatorOutputFormat = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output), "UTF-8"));
126 | 		ConnectorWoodstox connector = new ConnectorWoodstox();
127 | 		BioCCollection collection = new BioCCollection();
128 | 		collection = connector.startRead(new InputStreamReader(new FileInputStream(input), "UTF-8"));
129 | 		while (connector.hasNext()) 
130 | 		{
131 | 			BioCDocument document = connector.next();
132 | 			String PMID = document.getID();
133 | 			if(pmidlist.containsKey(PMID)){System.out.println("\nError: duplicate pmid-"+PMID);duplicate = true;}
134 | 			else{pmidlist.put(PMID,"");}
135 | 			String Anno="";
136 | 			int realpassage_offset=0;
137 | 			for (BioCPassage passage : document.getPassages()) 
138 | 			{
139 | 				if(passage.getInfon("type").toLowerCase().equals("table"))
140 | 				{
141 | 					String temp=passage.getText().replaceAll(" ", ";");
142 | 					temp=temp.replaceAll(";>;", " > ");
143 | 					PubTatorOutputFormat.write(PMID+"|"+passage.getInfon("type")+"|"+temp+"\n");
144 | 				}
145 | 				else
146 | 				{
147 | 					String temp=passage.getText();
148 | 					if(passage.getText().equals(""))
149 | 					{
150 | 						PubTatorOutputFormat.write(PMID+"|"+passage.getInfon("type")+"|"+"\n");//- No text -
151 | 					}
152 | 					else
153 | 					{
154 | 						PubTatorOutputFormat.write(PMID+"|"+passage.getInfon("type")+"|"+temp+"\n");
155 | 					}
156 | 				}
157 | 				for (BioCAnnotation annotation : passage.getAnnotations()) 
158 | 				{
159 | 					String Annoid = annotation.getInfon("identifier");
160 | 					if(Annoid == null)
161 | 					{
162 | 						Annoid=annotation.getInfon("NCBI Gene");
163 | 					}
164 | 					if(Annoid == null)
165 | 					{
166 | 						Annoid = annotation.getInfon("Identifier");
167 | 					}
168 | 					String Annotype = annotation.getInfon("type");
169 | 					int start = annotation.getLocations().get(0).getOffset();
170 | 					start=start-(passage.getOffset()-realpassage_offset);
171 | 					int last = start + annotation.getLocations().get(0).getLength();
172 | 					String AnnoMention=annotation.getText();
173 | 					Anno=Anno+PMID+"\t"+start+"\t"+last+"\t"+AnnoMention+"\t"+Annotype+"\t"+Annoid+"\n";
174 | 				}
175 | 				realpassage_offset=realpassage_offset+passage.getText().length()+1;
176 | 			}
177 | 			PubTatorOutputFormat.write(Anno+"\n");
178 | 		}
179 | 		PubTatorOutputFormat.close();
180 | 		if(duplicate == true){System.exit(0);}
181 | 	}
182 | 	public void PubTator2BioC(String input,String output) throws IOException, XMLStreamException
183 | 	{
184 | 		/*
185 | 		 *  PubTator2BioC
186 | 		 */
187 | 		String parser = BioCFactory.WOODSTOX;
188 | 		BioCFactory factory = BioCFactory.newFactory(parser);
189 | 		BioCDocumentWriter BioCOutputFormat = factory.createBioCDocumentWriter(new OutputStreamWriter(new FileOutputStream(output), "UTF-8"));
190 | 		BioCCollection biocCollection = new BioCCollection();
191 | 		
192 | 		//time
193 | 		ZoneId zonedId = ZoneId.of( "America/Montreal" );
194 | 		LocalDate today = LocalDate.now( zonedId );
195 | 		biocCollection.setDate(today.toString());
196 | 		
197 | 		biocCollection.setKey("BioC.key");//key
198 | 		biocCollection.setSource("tmVar");//source
199 | 		
200 | 		BioCOutputFormat.writeCollectionInfo(biocCollection);
201 | 		BufferedReader inputfile = new BufferedReader(new InputStreamReader(new FileInputStream(input), "UTF-8"));
202 | 		ArrayList<String> ParagraphType=new ArrayList<String>(); // Type: Title|Abstract
203 | 		ArrayList<String> ParagraphContent = new ArrayList<String>(); // Text
204 | 		ArrayList<String> annotations = new ArrayList<String>(); // Annotation
205 | 		String line;
206 | 		String Pmid="";
207 | 		while ((line = inputfile.readLine()) != null)  
208 | 		{
209 | 			Pattern patt = Pattern.compile("^([^\\|\\t]+)\\|([^\\|\\t]+)\\|(.*)$");
210 | 			Matcher mat = patt.matcher(line);
211 | 			if(mat.find()) //Title|Abstract
212 | 	        {
213 | 				ParagraphType.add(mat.group(2));
214 | 				ParagraphContent.add(mat.group(3));
215 | 	        }
216 | 			else if (line.contains("\t")) //Annotation
217 |         	{
218 | 				String anno[]=line.split("\t");
219 | 				annotations.add(anno[1]+"\t"+anno[2]+"\t"+anno[3]+"\t"+anno[4]+"\t"+anno[5]);
220 |         	}
221 | 			else if(line.length()==0) //Processing
222 | 			{
223 | 				BioCDocument biocDocument = new BioCDocument();
224 | 				biocDocument.setID(Pmid);
225 | 				int startoffset=0;
226 | 				for(int i=0;i<ParagraphType.size();i++)
227 | 				{
228 | 					BioCPassage biocPassage = new BioCPassage();
229 | 					Map<String, String> Infons = new HashMap<String, String>();
230 | 					Infons.put("type", ParagraphType.get(i));
231 | 					biocPassage.setInfons(Infons);
232 | 					biocPassage.setText(ParagraphContent.get(i));
233 | 					biocPassage.setOffset(startoffset);
234 | 					startoffset=startoffset+ParagraphContent.get(i).length()+1;
235 | 					for(int j=0;j<annotations.size();j++)
236 | 					{
237 | 						String anno[]=annotations.get(j).split("\t");
238 | 						if(Integer.parseInt(anno[0])<startoffset && Integer.parseInt(anno[0])>=startoffset-ParagraphContent.get(i).length()-1)
239 | 						{
240 | 							BioCAnnotation biocAnnotation = new BioCAnnotation();
241 | 							Map<String, String> AnnoInfons = new HashMap<String, String>();
242 | 							AnnoInfons.put("Identifier", anno[4]);
243 | 							AnnoInfons.put("type", anno[3]);
244 | 							biocAnnotation.setInfons(AnnoInfons);
245 | 							BioCLocation location = new BioCLocation();
246 | 							location.setOffset(Integer.parseInt(anno[0]));
247 | 							location.setLength(Integer.parseInt(anno[1])-Integer.parseInt(anno[0]));
248 | 							biocAnnotation.setLocation(location);
249 | 							biocAnnotation.setText(anno[2]);
250 | 							biocPassage.addAnnotation(biocAnnotation);
251 | 						}
252 | 					}
253 | 					biocDocument.addPassage(biocPassage);
254 | 				}
255 | 				biocCollection.addDocument(biocDocument);
256 | 				ParagraphType.clear();
257 | 				ParagraphContent.clear();
258 | 				annotations.clear();
259 | 				BioCOutputFormat.writeDocument(biocDocument);
260 | 			}
261 | 		}
262 | 		BioCOutputFormat.close();
263 | 		inputfile.close();
264 | 	}
265 | 	public void PubTator2BioC_AppendAnnotation(String inputPubTator,String inputBioc,String output) throws IOException, XMLStreamException
266 | 	{
267 | 		/*
268 | 		 *  PubTator2BioC
269 | 		 */
270 | 		
271 | 		//input: PubTator
272 | 		BufferedReader inputfile = new BufferedReader(new InputStreamReader(new FileInputStream(inputPubTator), "UTF-8"));
273 | 		HashMap<String, String> ParagraphType_hash = new HashMap<String, String>(); // Type: Title|Abstract
274 | 		HashMap<String, String> ParagraphContent_hash = new HashMap<String, String>(); // Text
275 | 		HashMap<String, String> annotations_hash = new HashMap<String, String>(); // Annotation
276 | 		String Annotation="";
277 | 		String Pmid="";
278 | 		String line="";
279 | 		while ((line = inputfile.readLine()) != null)  
280 | 		{
281 | 			Pattern patt = Pattern.compile("^([^\\|\\t]+)\\|([^\\|\\t]+)\\|(.*)$");
282 | 			Matcher mat = patt.matcher(line);
283 | 			if(mat.find()) //Title|Abstract
284 | 	        {
285 | 				Pmid=mat.group(1);
286 | 				ParagraphType_hash.put(Pmid,mat.group(2));
287 | 				ParagraphContent_hash.put(Pmid,mat.group(3));
288 | 			}
289 | 			else if (line.contains("\t")) //Annotation
290 |         	{
291 | 				if(Annotation.equals(""))
292 | 				{
293 | 					Annotation=line;
294 | 				}
295 | 				else
296 | 				{
297 | 					Annotation=Annotation+"\n"+line;
298 | 				}
299 |         	}
300 | 			else if(line.length()==0) //Processing
301 | 			{
302 | 				annotations_hash.put(Pmid,Annotation);
303 | 				Annotation="";
304 | 			}
305 | 		}
306 | 		inputfile.close();
307 | 		
308 | 		//output
309 | 		BioCDocumentWriter BioCOutputFormat = BioCFactory.newFactory(BioCFactory.WOODSTOX).createBioCDocumentWriter(new OutputStreamWriter(new FileOutputStream(output), "UTF-8"));
310 | 		BioCCollection biocCollection_input = new BioCCollection();
311 | 		BioCCollection biocCollection_output = new BioCCollection();
312 | 		
313 | 		//input: BioC
314 | 		ConnectorWoodstox connector = new ConnectorWoodstox();
315 | 		biocCollection_input = connector.startRead(new InputStreamReader(new FileInputStream(inputBioc), "UTF-8"));
316 | 		BioCOutputFormat.writeCollectionInfo(biocCollection_input);
317 | 		while (connector.hasNext()) 
318 | 		{
319 | 			int real_start_passage=0;
320 | 			BioCDocument document_output = new BioCDocument();
321 | 			BioCDocument document_input = connector.next();
322 | 			String PMID=document_input.getID();
323 | 			document_output.setID(PMID);
324 | 			int annotation_count=0;
325 | 			for (BioCPassage passage_input : document_input.getPassages()) 
326 | 			{
327 | 				String passage_input_Text = passage_input.getText();
328 | 				
329 | 				BioCPassage passage_output = passage_input;
330 | 				//passage_output.clearAnnotations(); //clean the previous annotation
331 | 				for (BioCAnnotation annotation : passage_output.getAnnotations()) 
332 | 				{
333 | 					annotation.setID(""+annotation_count);
334 | 					annotation_count++;
335 | 				}
336 | 				
337 | 				int start_passage=passage_input.getOffset();
338 | 				int last_passage=passage_input.getOffset()+passage_input.getText().length();
339 | 				if(annotations_hash.containsKey(PMID) && !annotations_hash.get(PMID).equals(""))
340 | 				{
341 | 					String Anno[]=annotations_hash.get(PMID).split("\\n");
342 | 					for(int i=0;i<Anno.length;i++)
343 | 					{
344 | 						String An[]=Anno[i].split("\\t");
345 | 						int start=Integer.parseInt(An[1]);
346 | 						int last=Integer.parseInt(An[2]);
347 | 						start = start + start_passage - real_start_passage;
348 | 						last = last + start_passage - real_start_passage;
349 | 						String mention=An[3];
350 | 						String type=An[4];
351 | 						
352 | 						String id="";
353 | 						if(An.length>5)
354 | 						{
355 | 							id=An[5];
356 | 						}
357 | 						if((start>=start_passage && start<last_passage)||(last>=start_passage && last<last_passage))
358 | 						{
359 | 							BioCAnnotation biocAnnotation = new BioCAnnotation();
360 | 							Map<String, String> AnnoInfons = new HashMap<String, String>();
361 | 							AnnoInfons.put("Identifier", id);
362 | 							AnnoInfons.put("type", type);
363 | 							biocAnnotation.setInfons(AnnoInfons);
364 | 							
365 | 							/*redirect the offset*/
366 | 							//location.setOffset(start);
367 | 							//location.setLength(last-start);
368 | 							String mention_tmp = mention.replaceAll("([^A-Za-z0-9@ ])", "\\\\$1");
369 | 							Pattern patt = Pattern.compile("^(.*)("+mention_tmp+")(.*)$");
370 | 							Matcher mat = patt.matcher(passage_input_Text);
371 | 							if(mat.find())
372 | 							{
373 | 								String pre=mat.group(1);
374 | 								String men=mat.group(2);
375 | 								String post=mat.group(3);
376 | 								start=pre.length()+start_passage;
377 | 								BioCLocation location = new BioCLocation();
378 | 								location.setOffset(start);
379 | 								location.setLength(men.length());
380 | 								biocAnnotation.setLocation(location);
381 | 								biocAnnotation.setText(mention);
382 | 								biocAnnotation.setID(""+annotation_count);
383 | 								annotation_count++;
384 | 								passage_output.addAnnotation(biocAnnotation);
385 | 								men=men.replaceAll(".", "@");
386 | 								passage_input_Text=pre+men+post;
387 | 							}
388 | 						}
389 | 					}
390 | 				}
391 | 				real_start_passage = real_start_passage + passage_input.getText().length() + 1;
392 | 				document_output.addPassage(passage_output);
393 | 			}
394 | 			biocCollection_output.addDocument(document_output);
395 | 			BioCOutputFormat.writeDocument(document_output);
396 | 		}
397 | 		BioCOutputFormat.close();
398 | 	}
399 | 	public void BioCReaderWithAnnotation(String input) throws IOException, XMLStreamException
400 | 	{
401 | 		ConnectorWoodstox connector = new ConnectorWoodstox();
402 | 		BioCCollection collection = new BioCCollection();
403 | 		collection = connector.startRead(new InputStreamReader(new FileInputStream(input), "UTF-8"));
404 | 		
405 | 		/*
406 | 		 * Per document
407 | 		 */
408 | 		while (connector.hasNext()) 
409 | 		{
410 | 			BioCDocument document = connector.next();
411 | 			PMIDs.add(document.getID());
412 | 			
413 | 			ArrayList<String> PassageName= new ArrayList<String>(); // array of Passage name
414 | 			ArrayList<Integer> PassageOffset= new ArrayList<Integer>(); // array of Passage offset
415 | 			ArrayList<String> PassageContext= new ArrayList<String>(); // array of Passage context
416 | 			ArrayList<ArrayList<String>> AnnotationInPMID= new ArrayList(); // array of Annotations in the PassageName
417 | 			
418 | 			/*
419 | 			 * Per Passage
420 | 			 */
421 | 			for (BioCPassage passage : document.getPassages()) 
422 | 			{
423 | 				PassageName.add(passage.getInfon("type")); //Paragraph
424 | 				
425 | 				String txt = passage.getText();
426 | 				if(txt.matches("[\t ]+"))
427 | 				{
428 | 					txt = txt.replaceAll(".","@");
429 | 				}
430 | 				else
431 | 				{
432 | 					//if(passage.getInfon("type").toLowerCase().equals("table"))
433 | 					//{
434 | 					//	txt=txt.replaceAll(" ", "|");
435 | 					//}
436 | 					txt = txt.replaceAll("ω","w");
437 | 					txt = txt.replaceAll("μ","u");
438 | 					txt = txt.replaceAll("κ","k");
439 | 					txt = txt.replaceAll("α","a");
440 | 					txt = txt.replaceAll("γ","g");
441 | 					txt = txt.replaceAll("β","b");
442 | 					txt = txt.replaceAll("×","x");
443 | 					txt = txt.replaceAll("¹","1");
444 | 					txt = txt.replaceAll("²","2");
445 | 					txt = txt.replaceAll("°","o");
446 | 					txt = txt.replaceAll("ö","o");
447 | 					txt = txt.replaceAll("é","e");
448 | 					txt = txt.replaceAll("à","a");
449 | 					txt = txt.replaceAll("Á","A");
450 | 					txt = txt.replaceAll("ε","e");
451 | 					txt = txt.replaceAll("θ","O");
452 | 					txt = txt.replaceAll("•",".");
453 | 					txt = txt.replaceAll("µ","u");
454 | 					txt = txt.replaceAll("λ","r");
455 | 					txt = txt.replaceAll("⁺","+");
456 | 					txt = txt.replaceAll("ν","v");
457 | 					txt = txt.replaceAll("ï","i");
458 | 					txt = txt.replaceAll("ã","a");
459 | 					txt = txt.replaceAll("≡","=");
460 | 					txt = txt.replaceAll("ó","o");
461 | 					txt = txt.replaceAll("³","3");
462 | 					txt = txt.replaceAll("〖","[");
463 | 					txt = txt.replaceAll("〗","]");
464 | 					txt = txt.replaceAll("Å","A");
465 | 					txt = txt.replaceAll("ρ","p");
466 | 					txt = txt.replaceAll("ü","u");
467 | 					txt = txt.replaceAll("ɛ","e");
468 | 					txt = txt.replaceAll("č","c");
469 | 					txt = txt.replaceAll("š","s");
470 | 					txt = txt.replaceAll("ß","b");
471 | 					txt = txt.replaceAll("═","=");
472 | 					txt = txt.replaceAll("£","L");
473 | 					txt = txt.replaceAll("Ł","L");
474 | 					txt = txt.replaceAll("ƒ","f");
475 | 					txt = txt.replaceAll("ä","a");
476 | 					txt = txt.replaceAll("–","-");
477 | 					txt = txt.replaceAll("⁻","-");
478 | 					txt = txt.replaceAll("〈","<");
479 | 					txt = txt.replaceAll("〉",">");
480 | 					txt = txt.replaceAll("χ","X");
481 | 					txt = txt.replaceAll("Đ","D");
482 | 					txt = txt.replaceAll("‰","%");
483 | 					txt = txt.replaceAll("·",".");
484 | 					txt = txt.replaceAll("→",">");
485 | 					txt = txt.replaceAll("←","<");
486 | 					txt = txt.replaceAll("ζ","z");
487 | 					txt = txt.replaceAll("π","p");
488 | 					txt = txt.replaceAll("τ","t");
489 | 					txt = txt.replaceAll("ξ","X");
490 | 					txt = txt.replaceAll("η","h");
491 | 					txt = txt.replaceAll("ø","0");
492 | 					txt = txt.replaceAll("Δ","D");
493 | 					txt = txt.replaceAll("∆","D");
494 | 					txt = txt.replaceAll("∑","S");
495 | 					txt = txt.replaceAll("Ω","O");
496 | 					txt = txt.replaceAll("δ","d");
497 | 					txt = txt.replaceAll("σ","s");
498 | 					txt = txt.replaceAll("Φ","F");
499 | 					//txt = txt.replaceAll("[^\\~\\!\\@\\#\\$\\%\\^\\&\\*\\(\\)\\_\\+\\{\\}\\|\\:\"\\<\\>\\?\\`\\-\\=\\[\\]\\;\\'\\,\\.\\/\\r\\n0-9a-zA-Z ]"," ");
500 | 				}
501 | 				if(passage.getText().equals("") || passage.getText().matches("[ ]+"))
502 | 				{
503 | 					PassageContext.add("-notext-"); //Context
504 | 				}
505 | 				else
506 | 				{
507 | 					PassageContext.add(txt); //Context
508 | 				}
509 | 				PassageOffset.add(passage.getOffset()); //Offset
510 | 				ArrayList<String> AnnotationInPassage= new ArrayList<String>(); // array of Annotations in the PassageName
511 | 				
512 | 				/*
513 | 				 * Per Annotation :
514 | 				 * start
515 | 				 * last
516 | 				 * mention
517 | 				 * type
518 | 				 * id
519 | 				 */
520 | 				for (BioCAnnotation Anno : passage.getAnnotations()) 
521 | 				{
522 | 					int start = Anno.getLocations().get(0).getOffset()-passage.getOffset(); // start
523 | 					int last = start + Anno.getLocations().get(0).getLength(); // last
524 | 					String AnnoMention=Anno.getText(); // mention
525 | 					String Annotype = Anno.getInfon("type"); // type
526 | 
527 | 					String Annoid="";
528 | 					Map<String,String> Infons=Anno.getInfons();
529 | 					for(String Infon :Infons.keySet())
530 | 					{
531 | 						if(!Infon.toLowerCase().equals("type"))
532 | 						{
533 | 							if(Annoid.equals(""))
534 | 							{
535 | 								Annoid=Infons.get(Infon);
536 | 							}
537 | 							else
538 | 							{
539 | 								Annoid=Annoid+";"+Infons.get(Infon);
540 | 							}
541 | 						}
542 | 					}
543 | 					
544 | 					if(Annoid == "")
545 | 					{
546 | 						AnnotationInPassage.add(start+"\t"+last+"\t"+AnnoMention+"\t"+Annotype); //paragraph
547 | 					}
548 | 					else
549 | 					{
550 | 						AnnotationInPassage.add(start+"\t"+last+"\t"+AnnoMention+"\t"+Annotype+"\t"+Annoid); //paragraph
551 | 					}
552 | 				}
553 | 				AnnotationInPMID.add(AnnotationInPassage);
554 | 			}
555 | 			PassageNames.add(PassageName);
556 | 			PassageContexts.add(PassageContext);
557 | 			PassageOffsets.add(PassageOffset);
558 | 			Annotations.add(AnnotationInPMID);
559 | 		}	
560 | 	}
561 | 	public void BioCOutput(String input,String output) throws IOException, XMLStreamException
562 | 	{
563 | 		BioCDocumentWriter BioCOutputFormat = BioCFactory.newFactory(BioCFactory.WOODSTOX).createBioCDocumentWriter(new OutputStreamWriter(new FileOutputStream(output), "UTF-8"));
564 | 		BioCCollection biocCollection_input = new BioCCollection();
565 | 		BioCCollection biocCollection_output = new BioCCollection();
566 | 		
567 | 		//input: BioC
568 | 		ConnectorWoodstox connector = new ConnectorWoodstox();
569 | 		biocCollection_input = connector.startRead(new InputStreamReader(new FileInputStream(input), "UTF-8"));
570 | 		BioCOutputFormat.writeCollectionInfo(biocCollection_input);
571 | 		int i=0; //count for pmid
572 | 		while (connector.hasNext()) 
573 | 		{
574 | 			BioCDocument document_output = new BioCDocument();
575 | 			BioCDocument document_input = connector.next();
576 | 			document_output.setID(document_input.getID());
577 | 			int annotation_count=0;
578 | 			int j=0; //count for paragraph
579 | 			for (BioCPassage passage_input : document_input.getPassages()) 
580 | 			{
581 | 				BioCPassage passage_output = passage_input;
582 | 				passage_output.clearAnnotations(); //clean the previous annotation
583 | 				int passage_Offset = passage_input.getOffset();
584 | 				String passage_Text = passage_input.getText();
585 | 				ArrayList<String> AnnotationInPassage = Annotations.get(i).get(j);
586 | 				for(int a=0;a<AnnotationInPassage.size();a++)
587 | 				{
588 | 					String Anno[]=AnnotationInPassage.get(a).split("\\t",-1);
589 | 					int start = Integer.parseInt(Anno[0]);
590 | 					int last = Integer.parseInt(Anno[1]);
591 | 					String mention = Anno[2];
592 | 					String type = Anno[3];
593 | 					BioCAnnotation biocAnnotation = new BioCAnnotation();
594 | 					Map<String, String> AnnoInfons = new HashMap<String, String>();
595 | 					AnnoInfons.put("type", type);
596 | 					String identifier="";
597 | 					if(Anno.length==5){identifier=Anno[4];}
598 | 					if(type.equals("Gene"))
599 | 					{
600 | 						AnnoInfons.put("NCBI Gene", identifier);
601 | 					}
602 | 					else if(type.equals("Species"))
603 | 					{
604 | 						AnnoInfons.put("NCBI Taxonomy", identifier);
605 | 					}
606 | 					else
607 | 					{
608 | 						AnnoInfons.put("Identifier", identifier);
609 | 					}
610 | 					biocAnnotation.setInfons(AnnoInfons);
611 | 					BioCLocation location = new BioCLocation();
612 | 					location.setOffset(start+passage_Offset);
613 | 					location.setLength(last-start);
614 | 					biocAnnotation.setLocation(location);
615 | 					biocAnnotation.setText(mention);
616 | 					biocAnnotation.setID(""+annotation_count);
617 | 					annotation_count++;
618 | 					passage_output.addAnnotation(biocAnnotation);
619 | 				}
620 | 				document_output.addPassage(passage_output);
621 | 				j++;
622 | 			}
623 | 			biocCollection_output.addDocument(document_output);
624 | 			BioCOutputFormat.writeDocument(document_output);
625 | 			i++;
626 | 		}
627 | 		BioCOutputFormat.close();
628 | 	}	
629 | }


--------------------------------------------------------------------------------
/src/tmVarlib/MentionRecognition.java:
--------------------------------------------------------------------------------
   1 | //
   2 | // tmVar - Java version
   3 | // Feature Extraction
   4 | //
   5 | package tmVarlib;
   6 | 
   7 | import java.io.*;
   8 | import java.util.*;
   9 | import java.util.regex.Matcher;
  10 | import java.util.regex.Pattern;
  11 | 
  12 | import org.tartarus.snowball.SnowballStemmer;
  13 | import org.tartarus.snowball.ext.englishStemmer;
  14 | 
  15 | import edu.stanford.nlp.tagger.maxent.MaxentTagger;
  16 | 
  17 | public class MentionRecognition 
  18 | {
  19 | 	public void FeatureExtraction(String Filename,String FilenameData,String FilenameLoca,String TrainTest)
  20 | 	{
  21 | 		/*
  22 | 		 * Feature Extraction
  23 | 		 */
  24 | 		try 
  25 | 		{
  26 | 			//input
  27 | 			BufferedReader inputfile = new BufferedReader(new InputStreamReader(new FileInputStream(Filename), "UTF-8"));
  28 | 			//output
  29 | 			BufferedWriter FileLocation = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(FilenameLoca), "UTF-8")); // .location
  30 | 			BufferedWriter FileData = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(FilenameData), "UTF-8")); // .data
  31 | 			//parameters
  32 | 			String Pmid="";
  33 | 			ArrayList<String> ParagraphType=new ArrayList<String>(); // Type: Title|Abstract
  34 | 			ArrayList<String> ParagraphContent = new ArrayList<String>(); // Text
  35 | 			ArrayList<String> annotations = new ArrayList<String>(); // Annotation
  36 | 			HashMap<Integer, String> RegEx_HGVs_hash = new HashMap<Integer, String>(); // RegEx_HGVs_hash
  37 | 			HashMap<Integer, String> character_hash = new HashMap<Integer, String>();
  38 |     		String line;
  39 | 			while ((line = inputfile.readLine()) != null)  
  40 | 			{
  41 | 				
  42 | 				Pattern pat = Pattern.compile("^([^\\|\\t]+)\\|([^\\|\\t]+)\\|(.*)$");
  43 | 				Matcher mat = pat.matcher(line);
  44 | 				if(mat.find()) //Title|Abstract
  45 | 	        	{
  46 | 					Pmid = mat.group(1);
  47 | 					ParagraphType.add(mat.group(2));
  48 | 					ParagraphContent.add(mat.group(3));
  49 | 				}
  50 | 				else if (line.contains("\t")) //Annotation
  51 | 	        	{
  52 | 					String anno[]=line.split("\t");
  53 | 	        		if(anno.length>=6)
  54 | 	        		{
  55 | 	        			String mentiontype=anno[4];
  56 | 	        			if(mentiontype.equals("Gene"))
  57 | 	        			{
  58 | 	        				tmVar.GeneMention=true;
  59 | 	        			}
  60 | 	        			
  61 | 	        			if(TrainTest.equals("Train"))
  62 | 	        			{
  63 | 	        				int start= Integer.parseInt(anno[1]);
  64 | 		        			int last= Integer.parseInt(anno[2]);
  65 | 		        			String mention=anno[3];
  66 | 		        			String component=anno[5];
  67 | 		        			
  68 | 		        			Matcher m1 = tmVar.Pattern_Component_1.matcher(component);
  69 | 		        			Matcher m2 = tmVar.Pattern_Component_2.matcher(component);
  70 | 		        			Matcher m3 = tmVar.Pattern_Component_3.matcher(component);
  71 | 		        			Matcher m4 = tmVar.Pattern_Component_4.matcher(component);
  72 | 		        			Matcher m5 = tmVar.Pattern_Component_5.matcher(component);
  73 | 		        			Matcher m6 = tmVar.Pattern_Component_6.matcher(component);
  74 | 		        			
  75 | 		        			for(int s=start;s<last;s++)
  76 | 		        			{
  77 | 		        				character_hash.put(s,"I");
  78 | 		        			}
  79 | 		        			
  80 | 		        			if(m1.find())
  81 | 		        			{
  82 | 		        				String type[]=m1.group(1).split(",");
  83 | 		        				String W[]=m1.group(2).split(",");
  84 | 		        				String P[]=m1.group(3).split(",");
  85 | 		        				String M[]=m1.group(4).split(",");
  86 | 		        				String F[]=m1.group(5).split(",");
  87 | 		        				String S[]=m1.group(6).split(",");
  88 | 		        				String mention_tmp=mention;
  89 | 		        				for(int i=0;i<type.length;i++)
  90 | 		        				{
  91 | 		        					type[i]=type[i].replaceAll("([^A-Za-z0-9@])", "\\\\$1");
  92 | 		        					String patt="^(.*?)("+type[i]+")(.*?)$";
  93 | 		        					Pattern ptmp = Pattern.compile(patt);
  94 | 		        					Matcher mtmp = ptmp.matcher(mention_tmp);
  95 | 		        					if(mtmp.find())
  96 | 		        					{
  97 | 		        						for(int j=mtmp.group(1).length();j<(mtmp.group(1).length()+mtmp.group(2).length());j++)
  98 | 		        						{
  99 | 		        							character_hash.put(j+start,"A");
 100 | 		        						}
 101 | 		        						String mtmp2_tmp="";
 102 | 		        						for(int j=0;j<mtmp.group(2).length();j++){mtmp2_tmp=mtmp2_tmp+"@";}
 103 | 		        						mention_tmp=mtmp.group(1)+mtmp2_tmp+mtmp.group(3);
 104 | 		        					}
 105 | 		        					else
 106 | 		        					{
 107 | 		        						System.out.println("Error! Cannot find component: m1\ttype\t"+Pmid+"\t"+mention);
 108 | 		        					}
 109 | 		        				}
 110 | 		        				for(int i=0;i<W.length;i++)
 111 | 		        				{
 112 | 		        					W[i]=W[i].replaceAll("([^A-Za-z0-9@])", "\\\\$1");
 113 | 		        					String patt="^(.*?)("+W[i]+")(.*?)$";
 114 | 		        					Pattern ptmp = Pattern.compile(patt);
 115 | 		        					Matcher mtmp = ptmp.matcher(mention_tmp);
 116 | 		        					if(mtmp.find())
 117 | 		        					{
 118 | 		        						for(int j=mtmp.group(1).length();j<(mtmp.group(1).length()+mtmp.group(2).length());j++)
 119 | 		        						{
 120 | 		        							character_hash.put(j+start,"W");
 121 | 		        						}
 122 | 		        						String mtmp2_tmp="";
 123 | 		        						for(int j=0;j<mtmp.group(2).length();j++){mtmp2_tmp=mtmp2_tmp+"@";}
 124 | 		        						mention_tmp=mtmp.group(1)+mtmp2_tmp+mtmp.group(3);
 125 | 		        					}
 126 | 		        					else
 127 | 		        					{
 128 | 		        						System.out.println("Error! Cannot find component: m1\tW\t"+Pmid+"\t"+mention);
 129 | 		        					}
 130 | 		        				}
 131 | 		        				for(int i=0;i<P.length;i++)
 132 | 		        				{
 133 | 		        					P[i]=P[i].replaceAll("([^A-Za-z0-9@])", "\\\\$1");
 134 | 		        					String patt="^(.*?)("+P[i]+")(.*?)$";
 135 | 		        					Pattern ptmp = Pattern.compile(patt);
 136 | 		        					Matcher mtmp = ptmp.matcher(mention_tmp);
 137 | 		        					if(mtmp.find())
 138 | 		        					{
 139 | 		        						for(int j=mtmp.group(1).length();j<(mtmp.group(1).length()+mtmp.group(2).length());j++)
 140 | 		        						{
 141 | 		        							character_hash.put(j+start,"P");
 142 | 		        						}
 143 | 		        						String mtmp2_tmp="";
 144 | 		        						for(int j=0;j<mtmp.group(2).length();j++){mtmp2_tmp=mtmp2_tmp+"@";}
 145 | 		        						mention_tmp=mtmp.group(1)+mtmp2_tmp+mtmp.group(3);
 146 | 		        					}
 147 | 		        					else
 148 | 		        					{
 149 | 		        						System.out.println("Error! Cannot find component: m1\tP\t"+Pmid+"\t"+mention);
 150 | 		        					}
 151 | 		        				}
 152 | 		        				for(int i=0;i<M.length;i++)
 153 | 		        				{
 154 | 		        					M[i]=M[i].replaceAll("([^A-Za-z0-9@])", "\\\\$1");
 155 | 		        					String patt="^(.*?)("+M[i]+")(.*?)$";
 156 | 		        					Pattern ptmp = Pattern.compile(patt);
 157 | 		        					Matcher mtmp = ptmp.matcher(mention_tmp);
 158 | 		        					if(mtmp.find())
 159 | 		        					{
 160 | 		        						for(int j=mtmp.group(1).length();j<(mtmp.group(1).length()+mtmp.group(2).length());j++)
 161 | 		        						{
 162 | 		        							character_hash.put(j+start,"M");
 163 | 		        						}
 164 | 		        						String mtmp2_tmp="";
 165 | 		        						for(int j=0;j<mtmp.group(2).length();j++){mtmp2_tmp=mtmp2_tmp+"@";}
 166 | 		        						mention_tmp=mtmp.group(1)+mtmp2_tmp+mtmp.group(3);
 167 | 		        					}
 168 | 		        					else
 169 | 		        					{
 170 | 		        						System.out.println("Error! Cannot find component: m1\tM\t"+Pmid+"\t"+mention);
 171 | 		        					}
 172 | 		        				}
 173 | 		        				for(int i=0;i<F.length;i++)
 174 | 		        				{
 175 | 		        					F[i]=F[i].replaceAll("([^A-Za-z0-9@])", "\\\\$1");
 176 | 		        					String patt="^(.*?)("+F[i]+")(.*?)$";
 177 | 		        					Pattern ptmp = Pattern.compile(patt);
 178 | 		        					Matcher mtmp = ptmp.matcher(mention_tmp);
 179 | 		        					if(mtmp.find())
 180 | 		        					{
 181 | 		        						for(int j=mtmp.group(1).length();j<(mtmp.group(1).length()+mtmp.group(2).length());j++)
 182 | 		        						{
 183 | 		        							character_hash.put(j+start,"F");
 184 | 		        						}
 185 | 		        						String mtmp2_tmp="";
 186 | 		        						for(int j=0;j<mtmp.group(2).length();j++){mtmp2_tmp=mtmp2_tmp+"@";}
 187 | 		        						mention_tmp=mtmp.group(1)+mtmp2_tmp+mtmp.group(3);
 188 | 		        					}
 189 | 		        					else
 190 | 		        					{
 191 | 		        						System.out.println("Error! Cannot find component: m1\tF\t"+Pmid+"\t"+mention);
 192 | 		        					}
 193 | 		        				}
 194 | 		        				for(int i=0;i<S.length;i++)
 195 | 		        				{
 196 | 		        					S[i]=S[i].replaceAll("([^A-Za-z0-9@])", "\\\\$1");
 197 | 		        					String patt="^(.*?)("+S[i]+")(.*?)$";
 198 | 		        					Pattern ptmp = Pattern.compile(patt);
 199 | 		        					Matcher mtmp = ptmp.matcher(mention_tmp);
 200 | 		        					if(mtmp.find())
 201 | 		        					{
 202 | 		        						for(int j=mtmp.group(1).length();j<(mtmp.group(1).length()+mtmp.group(2).length());j++)
 203 | 		        						{
 204 | 		        							character_hash.put(j+start,"S");
 205 | 		        						}
 206 | 		        						String mtmp2_tmp="";
 207 | 		        						for(int j=0;j<mtmp.group(2).length();j++){mtmp2_tmp=mtmp2_tmp+"@";}
 208 | 		        						mention_tmp=mtmp.group(1)+mtmp2_tmp+mtmp.group(3);
 209 | 		        					}
 210 | 		        					else
 211 | 		        					{
 212 | 		        						System.out.println("Error! Cannot find component: m1\tS\t"+Pmid+"\t"+mention);
 213 | 		        					}
 214 | 		        				}
 215 | 		        			}
 216 | 		        			else if(m2.find())
 217 | 		        			{
 218 | 		        				String type[]=m2.group(1).split(",");
 219 | 		        				String W[]=m2.group(2).split(",");
 220 | 		        				String P[]=m2.group(3).split(",");
 221 | 		        				String M[]=m2.group(4).split(",");
 222 | 		        				String F[]=m2.group(5).split(",");
 223 | 		        				String mention_tmp=mention;
 224 | 		        				for(int i=0;i<type.length;i++)
 225 | 		        				{
 226 | 		        					type[i]=type[i].replaceAll("([^A-Za-z0-9@])", "\\\\$1");
 227 | 		        					String patt="^(.*?)("+type[i]+")(.*?)$";
 228 | 		        					Pattern ptmp = Pattern.compile(patt);
 229 | 		        					Matcher mtmp = ptmp.matcher(mention_tmp);
 230 | 		        					if(mtmp.find())
 231 | 		        					{
 232 | 		        						for(int j=mtmp.group(1).length();j<(mtmp.group(1).length()+mtmp.group(2).length());j++)
 233 | 		        						{
 234 | 		        							character_hash.put(j+start,"A");
 235 | 		        						}
 236 | 		        						String mtmp2_tmp="";
 237 | 		        						for(int j=0;j<mtmp.group(2).length();j++){mtmp2_tmp=mtmp2_tmp+"@";}
 238 | 		        						mention_tmp=mtmp.group(1)+mtmp2_tmp+mtmp.group(3);
 239 | 		        					}
 240 | 		        					else
 241 | 		        					{
 242 | 		        						System.out.println("Error! Cannot find component: m2\tType\t"+Pmid+"\t"+mention);
 243 | 		        					}
 244 | 		        				}
 245 | 		        				for(int i=0;i<W.length;i++)
 246 | 		        				{
 247 | 		        					W[i]=W[i].replaceAll("([^A-Za-z0-9@])", "\\\\$1");
 248 | 		        					String patt="^(.*?)("+W[i]+")(.*?)$";
 249 | 		        					Pattern ptmp = Pattern.compile(patt);
 250 | 		        					Matcher mtmp = ptmp.matcher(mention_tmp);
 251 | 		        					if(mtmp.find())
 252 | 		        					{
 253 | 		        						for(int j=mtmp.group(1).length();j<(mtmp.group(1).length()+mtmp.group(2).length());j++)
 254 | 		        						{
 255 | 		        							character_hash.put(j+start,"W");
 256 | 		        						}
 257 | 		        						String mtmp2_tmp="";
 258 | 		        						for(int j=0;j<mtmp.group(2).length();j++){mtmp2_tmp=mtmp2_tmp+"@";}
 259 | 		        						mention_tmp=mtmp.group(1)+mtmp2_tmp+mtmp.group(3);
 260 | 		        					}
 261 | 		        					else
 262 | 		        					{
 263 | 		        						System.out.println("Error! Cannot find component: m2\tW\t"+Pmid+"\t"+mention);
 264 | 		        					}
 265 | 		        				}
 266 | 		        				for(int i=0;i<P.length;i++)
 267 | 		        				{
 268 | 		        					P[i]=P[i].replaceAll("([^A-Za-z0-9@])", "\\\\$1");
 269 | 		        					String patt="^(.*?)("+P[i]+")(.*?)$";
 270 | 		        					Pattern ptmp = Pattern.compile(patt);
 271 | 		        					Matcher mtmp = ptmp.matcher(mention_tmp);
 272 | 		        					if(mtmp.find())
 273 | 		        					{
 274 | 		        						for(int j=mtmp.group(1).length();j<(mtmp.group(1).length()+mtmp.group(2).length());j++)
 275 | 		        						{
 276 | 		        							character_hash.put(j+start,"P");
 277 | 		        						}
 278 | 		        						String mtmp2_tmp="";
 279 | 		        						for(int j=0;j<mtmp.group(2).length();j++){mtmp2_tmp=mtmp2_tmp+"@";}
 280 | 		        						mention_tmp=mtmp.group(1)+mtmp2_tmp+mtmp.group(3);
 281 | 		        					}
 282 | 		        					else
 283 | 		        					{
 284 | 		        						System.out.println("Error! Cannot find component: m2\tP\t"+Pmid+"\t"+mention);
 285 | 		        					}
 286 | 		        				}
 287 | 		        				for(int i=0;i<M.length;i++)
 288 | 		        				{
 289 | 		        					M[i]=M[i].replaceAll("([^A-Za-z0-9@])", "\\\\$1");
 290 | 		        					String patt="^(.*?)("+M[i]+")(.*?)$";
 291 | 		        					Pattern ptmp = Pattern.compile(patt);
 292 | 		        					Matcher mtmp = ptmp.matcher(mention_tmp);
 293 | 		        					if(mtmp.find())
 294 | 		        					{
 295 | 		        						for(int j=mtmp.group(1).length();j<(mtmp.group(1).length()+mtmp.group(2).length());j++)
 296 | 		        						{
 297 | 		        							character_hash.put(j+start,"M");
 298 | 		        						}
 299 | 		        						String mtmp2_tmp="";
 300 | 		        						for(int j=0;j<mtmp.group(2).length();j++){mtmp2_tmp=mtmp2_tmp+"@";}
 301 | 		        						mention_tmp=mtmp.group(1)+mtmp2_tmp+mtmp.group(3);
 302 | 		        					}
 303 | 		        					else
 304 | 		        					{
 305 | 		        						System.out.println("Error! Cannot find component: m2\tM\t"+Pmid+"\t"+mention);
 306 | 		        					}
 307 | 		        				}
 308 | 		        				for(int i=0;i<F.length;i++)
 309 | 		        				{
 310 | 		        					F[i]=F[i].replaceAll("([^A-Za-z0-9@])", "\\\\$1");
 311 | 		        					String patt="^(.*?)("+F[i]+")(.*?)$";
 312 | 		        					Pattern ptmp = Pattern.compile(patt);
 313 | 		        					Matcher mtmp = ptmp.matcher(mention_tmp);
 314 | 		        					if(mtmp.find())
 315 | 		        					{
 316 | 		        						for(int j=mtmp.group(1).length();j<(mtmp.group(1).length()+mtmp.group(2).length());j++)
 317 | 		        						{
 318 | 		        							character_hash.put(j+start,"F");
 319 | 		        						}
 320 | 		        						String mtmp2_tmp="";
 321 | 		        						for(int j=0;j<mtmp.group(2).length();j++){mtmp2_tmp=mtmp2_tmp+"@";}
 322 | 		        						mention_tmp=mtmp.group(1)+mtmp2_tmp+mtmp.group(3);
 323 | 		        					}
 324 | 		        					else
 325 | 		        					{
 326 | 		        						System.out.println("Error! Cannot find component: m2\tF\t"+Pmid+"\t"+mention);
 327 | 		        					}
 328 | 		        				}
 329 | 		        			}
 330 | 		        			else if(m3.find())
 331 | 		        			{
 332 | 		        				String type[]=m3.group(1).split(",");
 333 | 		        				String T[]=m3.group(2).split(",");
 334 | 		        				String P[]=m3.group(4).split(",");
 335 | 		        				String M[]=m3.group(5).split(",");
 336 | 		        				String mention_tmp=mention;
 337 | 		        				for(int i=0;i<type.length;i++)
 338 | 		        				{
 339 | 		        					type[i]=type[i].replaceAll("([^A-Za-z0-9@])", "\\\\$1");
 340 | 		        					String patt="^(.*?)("+type[i]+")(.*?)$";
 341 | 		        					Pattern ptmp = Pattern.compile(patt);
 342 | 		        					Matcher mtmp = ptmp.matcher(mention_tmp);
 343 | 		        					if(mtmp.find())
 344 | 		        					{
 345 | 		        						for(int j=mtmp.group(1).length();j<(mtmp.group(1).length()+mtmp.group(2).length());j++)
 346 | 		        						{
 347 | 		        							character_hash.put(j+start,"A");
 348 | 		        						}
 349 | 		        						String mtmp2_tmp="";
 350 | 		        						for(int j=0;j<mtmp.group(2).length();j++){mtmp2_tmp=mtmp2_tmp+"@";}
 351 | 		        						mention_tmp=mtmp.group(1)+mtmp2_tmp+mtmp.group(3);
 352 | 		        					}
 353 | 		        					else
 354 | 		        					{
 355 | 		        						System.out.println("Error! Cannot find component: m3\tType\t"+Pmid+"\t"+mention);
 356 | 		        					}
 357 | 		        				}
 358 | 		        				for(int i=0;i<P.length;i++)
 359 | 		        				{
 360 | 		        					P[i]=P[i].replaceAll("([^A-Za-z0-9@])", "\\\\$1");
 361 | 		        					String patt="^(.*?)("+P[i]+")(.*?)$";
 362 | 		        					Pattern ptmp = Pattern.compile(patt);
 363 | 		        					Matcher mtmp = ptmp.matcher(mention_tmp);
 364 | 		        					if(mtmp.find())
 365 | 		        					{
 366 | 		        						for(int j=mtmp.group(1).length();j<(mtmp.group(1).length()+mtmp.group(2).length());j++)
 367 | 		        						{
 368 | 		        							character_hash.put(j+start,"P");
 369 | 		        						}
 370 | 		        						String mtmp2_tmp="";
 371 | 		        						for(int j=0;j<mtmp.group(2).length();j++){mtmp2_tmp=mtmp2_tmp+"@";}
 372 | 		        						mention_tmp=mtmp.group(1)+mtmp2_tmp+mtmp.group(3);
 373 | 		        					}
 374 | 		        					else
 375 | 		        					{
 376 | 		        						System.out.println("Error! Cannot find component: m3\tP\t"+Pmid+"\t"+mention);
 377 | 		        					}
 378 | 		        				}
 379 | 		        				for(int i=0;i<T.length;i++)
 380 | 		        				{
 381 | 		        					T[i]=T[i].replaceAll("([^A-Za-z0-9@])", "\\\\$1");
 382 | 		        					String patt="^(.*?)("+T[i]+")(.*?)$";
 383 | 		        					Pattern ptmp = Pattern.compile(patt);
 384 | 		        					Matcher mtmp = ptmp.matcher(mention_tmp);
 385 | 		        					if(mtmp.find())
 386 | 		        					{
 387 | 		        						for(int j=mtmp.group(1).length();j<(mtmp.group(1).length()+mtmp.group(2).length());j++)
 388 | 		        						{
 389 | 		        							character_hash.put(j+start,"T");
 390 | 		        						}
 391 | 		        						String mtmp2_tmp="";
 392 | 		        						for(int j=0;j<mtmp.group(2).length();j++){mtmp2_tmp=mtmp2_tmp+"@";}
 393 | 		        						mention_tmp=mtmp.group(1)+mtmp2_tmp+mtmp.group(3);
 394 | 		        					}
 395 | 		        					else
 396 | 		        					{
 397 | 		        						System.out.println("Error! Cannot find component: m3\tT\t"+Pmid+"\t"+mention);
 398 | 		        					}
 399 | 		        				}
 400 | 		        				for(int i=0;i<M.length;i++)
 401 | 		        				{
 402 | 		        					M[i]=M[i].replaceAll("([^A-Za-z0-9@])", "\\\\$1");
 403 | 		        					String patt="^(.*?)("+M[i]+")(.*?)$";
 404 | 		        					Pattern ptmp = Pattern.compile(patt);
 405 | 		        					Matcher mtmp = ptmp.matcher(mention_tmp);
 406 | 		        					if(mtmp.find())
 407 | 		        					{
 408 | 		        						for(int j=mtmp.group(1).length();j<(mtmp.group(1).length()+mtmp.group(2).length());j++)
 409 | 		        						{
 410 | 		        							character_hash.put(j+start,"M");
 411 | 		        						}
 412 | 		        						String mtmp2_tmp="";
 413 | 		        						for(int j=0;j<mtmp.group(2).length();j++){mtmp2_tmp=mtmp2_tmp+"@";}
 414 | 		        						mention_tmp=mtmp.group(1)+mtmp2_tmp+mtmp.group(3);
 415 | 		        					}
 416 | 		        					else
 417 | 		        					{
 418 | 		        						System.out.println("Error! Cannot find component: m3\tM\t"+Pmid+"\t"+mention);
 419 | 		        					}
 420 | 		        				}
 421 | 		        			}
 422 | 		        			else if(m4.find())
 423 | 		        			{
 424 | 		        				String type[]=m4.group(1).split(",");
 425 | 		        				String W[]=m4.group(2).split(",");
 426 | 		        				String P[]=m4.group(3).split(",");
 427 | 		        				String M[]=m4.group(4).split(",");
 428 | 		        				String mention_tmp=mention;
 429 | 		        				for(int i=0;i<type.length;i++)
 430 | 		        				{
 431 | 		        					type[i]=type[i].replaceAll("([^A-Za-z0-9@])", "\\\\$1");
 432 | 		        					String patt="^(.*?)("+type[i]+")(.*?)$";
 433 | 		        					Pattern ptmp = Pattern.compile(patt);
 434 | 		        					Matcher mtmp = ptmp.matcher(mention_tmp);
 435 | 		        					if(mtmp.find())
 436 | 		        					{
 437 | 		        						for(int j=mtmp.group(1).length();j<(mtmp.group(1).length()+mtmp.group(2).length());j++)
 438 | 		        						{
 439 | 		        							character_hash.put(j+start,"A");
 440 | 		        						}
 441 | 		        						String mtmp2_tmp="";
 442 | 		        						for(int j=0;j<mtmp.group(2).length();j++){mtmp2_tmp=mtmp2_tmp+"@";}
 443 | 		        						mention_tmp=mtmp.group(1)+mtmp2_tmp+mtmp.group(3);
 444 | 		        					}
 445 | 		        					else
 446 | 		        					{
 447 | 		        						System.out.println("Error! Cannot find component: m4\tType\t"+Pmid+"\t"+mention);
 448 | 		        					}
 449 | 		        				}
 450 | 		        				for(int i=0;i<W.length;i++)
 451 | 		        				{
 452 | 		        					W[i]=W[i].replaceAll("([^A-Za-z0-9@])", "\\\\$1");
 453 | 		        					String patt="^(.*?)("+W[i]+")(.*?)$";
 454 | 		        					Pattern ptmp = Pattern.compile(patt);
 455 | 		        					Matcher mtmp = ptmp.matcher(mention_tmp);
 456 | 		        					if(mtmp.find())
 457 | 		        					{
 458 | 		        						for(int j=mtmp.group(1).length();j<(mtmp.group(1).length()+mtmp.group(2).length());j++)
 459 | 		        						{
 460 | 		        							character_hash.put(j+start,"W");
 461 | 		        						}
 462 | 		        						String mtmp2_tmp="";
 463 | 		        						for(int j=0;j<mtmp.group(2).length();j++){mtmp2_tmp=mtmp2_tmp+"@";}
 464 | 		        						mention_tmp=mtmp.group(1)+mtmp2_tmp+mtmp.group(3);
 465 | 		        					}
 466 | 		        					else
 467 | 		        					{
 468 | 		        						System.out.println("Error! Cannot find component: m4\tW\t"+Pmid+"\t"+mention);
 469 | 		        					}
 470 | 		        				}
 471 | 		        				for(int i=0;i<P.length;i++)
 472 | 		        				{
 473 | 		        					P[i]=P[i].replaceAll("([^A-Za-z0-9@])", "\\\\$1");
 474 | 		        					String patt="^(.*?)("+P[i]+")(.*?)$";
 475 | 		        					Pattern ptmp = Pattern.compile(patt);
 476 | 		        					Matcher mtmp = ptmp.matcher(mention_tmp);
 477 | 		        					if(mtmp.find())
 478 | 		        					{
 479 | 		        						for(int j=mtmp.group(1).length();j<(mtmp.group(1).length()+mtmp.group(2).length());j++)
 480 | 		        						{
 481 | 		        							character_hash.put(j+start,"P");
 482 | 		        						}
 483 | 		        						String mtmp2_tmp="";
 484 | 		        						for(int j=0;j<mtmp.group(2).length();j++){mtmp2_tmp=mtmp2_tmp+"@";}
 485 | 		        						mention_tmp=mtmp.group(1)+mtmp2_tmp+mtmp.group(3);
 486 | 		        					}
 487 | 		        					else
 488 | 		        					{
 489 | 		        						System.out.println("Error! Cannot find component: m4\tP\t"+Pmid+"\t"+mention);
 490 | 		        					}
 491 | 		        				}
 492 | 		        				for(int i=0;i<M.length;i++)
 493 | 		        				{
 494 | 		        					M[i]=M[i].replaceAll("([^A-Za-z0-9@])", "\\\\$1");
 495 | 		        					String patt="^(.*?)("+M[i]+")(.*?)$";
 496 | 		        					Pattern ptmp = Pattern.compile(patt);
 497 | 		        					Matcher mtmp = ptmp.matcher(mention_tmp);
 498 | 		        					if(mtmp.find())
 499 | 		        					{
 500 | 		        						for(int j=mtmp.group(1).length();j<(mtmp.group(1).length()+mtmp.group(2).length());j++)
 501 | 		        						{
 502 | 		        							character_hash.put(j+start,"M");
 503 | 		        						}
 504 | 		        						String mtmp2_tmp="";
 505 | 		        						for(int j=0;j<mtmp.group(2).length();j++){mtmp2_tmp=mtmp2_tmp+"@";}
 506 | 		        						mention_tmp=mtmp.group(1)+mtmp2_tmp+mtmp.group(3);
 507 | 		        					}
 508 | 		        					else
 509 | 		        					{
 510 | 		        						System.out.println("Error! Cannot find component: m4\tM\t"+Pmid+"\t"+mention);
 511 | 		        					}
 512 | 		        				}
 513 | 		        			}
 514 | 		        			else if(m5.find())
 515 | 		        			{
 516 | 		        				String type[]=m5.group(1).split(",");
 517 | 		        				String T[]=m5.group(2).split(",");
 518 | 		        				String P[]=m5.group(3).split(",");
 519 | 		        				String M[]=m5.group(4).split(",");
 520 | 		        				String D[]=m5.group(5).split(",");
 521 | 		        				String mention_tmp=mention;
 522 | 		        				for(int i=0;i<type.length;i++)
 523 | 		        				{
 524 | 		        					type[i]=type[i].replaceAll("([^A-Za-z0-9@])", "\\\\$1");
 525 | 		        					String patt="^(.*?)("+type[i]+")(.*?)$";
 526 | 		        					Pattern ptmp = Pattern.compile(patt);
 527 | 		        					Matcher mtmp = ptmp.matcher(mention_tmp);
 528 | 		        					if(mtmp.find())
 529 | 		        					{
 530 | 		        						for(int j=mtmp.group(1).length();j<(mtmp.group(1).length()+mtmp.group(2).length());j++)
 531 | 		        						{
 532 | 		        							character_hash.put(j+start,"A");
 533 | 		        						}
 534 | 		        						String mtmp2_tmp="";
 535 | 		        						for(int j=0;j<mtmp.group(2).length();j++){mtmp2_tmp=mtmp2_tmp+"@";}
 536 | 		        						mention_tmp=mtmp.group(1)+mtmp2_tmp+mtmp.group(3);
 537 | 		        					}
 538 | 		        					else
 539 | 		        					{
 540 | 		        						System.out.println("Error! Cannot find component: m5\tType\t"+Pmid+"\t"+mention);
 541 | 		        					}
 542 | 		        				}
 543 | 		        				for(int i=0;i<T.length;i++)
 544 | 		        				{
 545 | 		        					T[i]=T[i].replaceAll("([^A-Za-z0-9@])", "\\\\$1");
 546 | 		        					String patt="^(.*?)("+T[i]+")(.*?)$";
 547 | 		        					Pattern ptmp = Pattern.compile(patt);
 548 | 		        					Matcher mtmp = ptmp.matcher(mention_tmp);
 549 | 		        					if(mtmp.find())
 550 | 		        					{
 551 | 		        						for(int j=mtmp.group(1).length();j<(mtmp.group(1).length()+mtmp.group(2).length());j++)
 552 | 		        						{
 553 | 		        							character_hash.put(j+start,"T");
 554 | 		        						}
 555 | 		        						String mtmp2_tmp="";
 556 | 		        						for(int j=0;j<mtmp.group(2).length();j++){mtmp2_tmp=mtmp2_tmp+"@";}
 557 | 		        						mention_tmp=mtmp.group(1)+mtmp2_tmp+mtmp.group(3);
 558 | 		        					}
 559 | 		        					else
 560 | 		        					{
 561 | 		        						System.out.println("Error! Cannot find component: m5\tT\t"+Pmid+"\t"+mention);
 562 | 		        					}
 563 | 		        				}
 564 | 		        				for(int i=0;i<P.length;i++)
 565 | 		        				{
 566 | 		        					P[i]=P[i].replaceAll("([^A-Za-z0-9@])", "\\\\$1");
 567 | 		        					String patt="^(.*?)("+P[i]+")(.*?)$";
 568 | 		        					Pattern ptmp = Pattern.compile(patt);
 569 | 		        					Matcher mtmp = ptmp.matcher(mention_tmp);
 570 | 		        					if(mtmp.find())
 571 | 		        					{
 572 | 		        						for(int j=mtmp.group(1).length();j<(mtmp.group(1).length()+mtmp.group(2).length());j++)
 573 | 		        						{
 574 | 		        							character_hash.put(j+start,"P");
 575 | 		        						}
 576 | 		        						String mtmp2_tmp="";
 577 | 		        						for(int j=0;j<mtmp.group(2).length();j++){mtmp2_tmp=mtmp2_tmp+"@";}
 578 | 		        						mention_tmp=mtmp.group(1)+mtmp2_tmp+mtmp.group(3);
 579 | 		        					}
 580 | 		        					else
 581 | 		        					{
 582 | 		        						System.out.println("Error! Cannot find component: m5\tP\t"+Pmid+"\t"+mention);
 583 | 		        					}
 584 | 		        				}
 585 | 		        				for(int i=0;i<M.length;i++)
 586 | 		        				{
 587 | 		        					M[i]=M[i].replaceAll("([^A-Za-z0-9@])", "\\\\$1");
 588 | 		        					String patt="^(.*?)("+M[i]+")(.*?)$";
 589 | 		        					Pattern ptmp = Pattern.compile(patt);
 590 | 		        					Matcher mtmp = ptmp.matcher(mention_tmp);
 591 | 		        					if(mtmp.find())
 592 | 		        					{
 593 | 		        						for(int j=mtmp.group(1).length();j<(mtmp.group(1).length()+mtmp.group(2).length());j++)
 594 | 		        						{
 595 | 		        							character_hash.put(j+start,"M");
 596 | 		        						}
 597 | 		        						String mtmp2_tmp="";
 598 | 		        						for(int j=0;j<mtmp.group(2).length();j++){mtmp2_tmp=mtmp2_tmp+"@";}
 599 | 		        						mention_tmp=mtmp.group(1)+mtmp2_tmp+mtmp.group(3);
 600 | 		        					}
 601 | 		        					else
 602 | 		        					{
 603 | 		        						System.out.println("Error! Cannot find component: m5\tM\t"+Pmid+"\t"+mention);
 604 | 		        					}
 605 | 		        				}
 606 | 		        				for(int i=0;i<D.length;i++)
 607 | 		        				{
 608 | 		        					D[i]=D[i].replaceAll("([^A-Za-z0-9@])", "\\\\$1");
 609 | 		        					String patt="^(.*?)("+D[i]+")(.*?)$";
 610 | 		        					Pattern ptmp = Pattern.compile(patt);
 611 | 		        					Matcher mtmp = ptmp.matcher(mention_tmp);
 612 | 		        					if(mtmp.find())
 613 | 		        					{
 614 | 		        						for(int j=mtmp.group(1).length();j<(mtmp.group(1).length()+mtmp.group(2).length());j++)
 615 | 		        						{
 616 | 		        							character_hash.put(j+start,"D");
 617 | 		        						}
 618 | 		        						String mtmp2_tmp="";
 619 | 		        						for(int j=0;j<mtmp.group(2).length();j++){mtmp2_tmp=mtmp2_tmp+"@";}
 620 | 		        						mention_tmp=mtmp.group(1)+mtmp2_tmp+mtmp.group(3);
 621 | 		        					}
 622 | 		        					else
 623 | 		        					{
 624 | 		        						System.out.println("Error! Cannot find component: m5\tD\t"+Pmid+"\t"+mention);
 625 | 		        					}
 626 | 		        				}
 627 | 		        			}
 628 | 		        			else if(m6.find())
 629 | 		        			{
 630 | 		        				String RS[]=m6.group(1).split(",");
 631 | 		        				String mention_tmp=mention;
 632 | 		        				for(int i=0;i<RS.length;i++)
 633 | 		        				{
 634 | 		        					RS[i]=RS[i].replaceAll("([\\[\\]])", "\\\\$1");
 635 | 		        					String patt="^(.*?)("+RS[i]+")(.*?)$";
 636 | 		        					Pattern ptmp = Pattern.compile(patt);
 637 | 		        					Matcher mtmp = ptmp.matcher(mention_tmp);
 638 | 		        					if(mtmp.find())
 639 | 		        					{
 640 | 		        						for(int j=mtmp.group(1).length();j<(mtmp.group(1).length()+mtmp.group(2).length());j++)
 641 | 		        						{
 642 | 		        							character_hash.put(j+start,"R");
 643 | 		        						}
 644 | 		        						String mtmp2_tmp="";
 645 | 		        						for(int j=0;j<mtmp.group(2).length();j++){mtmp2_tmp=mtmp2_tmp+"@";}
 646 | 		        						mention_tmp=mtmp.group(1)+mtmp2_tmp+mtmp.group(3);
 647 | 		        					}
 648 | 		        					else
 649 | 		        					{
 650 | 		        						System.out.println("Error! Cannot find component: m6\tType\t"+Pmid+"\t"+mention);
 651 | 		        					}
 652 | 		        				}
 653 | 		        			}
 654 | 		        			else
 655 | 			        		{
 656 | 			        			System.out.println("Error! Annotation component cannot match RegEx. " + mention);
 657 | 			        		}
 658 | 		        			annotations.add(anno[1]+"\t"+anno[2]+"\t"+anno[3]+"\t"+anno[4]+"\t"+anno[5]);
 659 | 	        			}
 660 | 	        		}
 661 | 	        		//else
 662 | 	        		//{
 663 | 	        		//	System.out.println("Error! annotation column is less than 6: "+line);
 664 | 	        		//}
 665 | 	        	}
 666 | 				else if(line.length()==0) //Processing
 667 | 				{
 668 | 					String Document="";
 669 | 					for(int i=0;i<ParagraphContent.size();i++)
 670 | 		        	{
 671 | 						Document=Document+ParagraphContent.get(i)+" ";
 672 | 		        	}
 673 | 					String Document_rev=Document;
 674 | 					
 675 | 					/*
 676 | 					 * RegEx result of ProteinMutation - Post is group(5)
 677 | 					 */
 678 | 					for(int i=0;i<tmVar.RegEx_ProteinMutation_STR.size();i++)
 679 | 					{
 680 | 						Pattern PATTERN_RegEx_HGVs = Pattern.compile(tmVar.RegEx_ProteinMutation_STR.get(i));
 681 | 						Matcher m = PATTERN_RegEx_HGVs.matcher(Document_rev);
 682 | 						while (m.find()) 
 683 | 						{
 684 | 							String pre = m.group(1);
 685 | 							String mention = m.group(2);
 686 | 							String post = m.group(5);
 687 | 							
 688 | 							{
 689 | 								Pattern ptmp = Pattern.compile("^(.+)([ ;.,:]+)$");
 690 | 								Matcher mtmp = ptmp.matcher(mention);
 691 | 								if(mtmp.find())
 692 | 								{
 693 | 									mention=mtmp.group(1);
 694 | 									post=mtmp.group(2)+post;
 695 | 								}
 696 | 								
 697 | 								if((!mention.contains("(")) && mention.substring(mention.length()-1,mention.length()).equals(")")){	mention=mention.substring(0,mention.length()-1);post=")"+post;}
 698 | 								if((!mention.contains("[")) && mention.substring(mention.length()-1,mention.length()).equals("]")){	mention=mention.substring(0,mention.length()-1);post="]"+post;}
 699 | 								if((!mention.contains("{")) && mention.substring(mention.length()-1,mention.length()).equals("}")){	mention=mention.substring(0,mention.length()-1);post="}"+post;}
 700 | 								
 701 | 								int start=pre.length();
 702 | 								int last=pre.length()+mention.length();
 703 | 								for(int j=start;j<last;j++)
 704 | 								{
 705 | 									RegEx_HGVs_hash.put(j,"ProteinMutation");
 706 | 								}
 707 | 								String mention_rev="";
 708 | 								for(int j=0;j<mention.length();j++){mention_rev=mention_rev+"@";}
 709 | 								Document_rev=pre+mention_rev+post;
 710 | 							}
 711 | 							PATTERN_RegEx_HGVs = Pattern.compile(tmVar.RegEx_ProteinMutation_STR.get(i));
 712 | 							m = PATTERN_RegEx_HGVs.matcher(Document_rev);
 713 | 						}
 714 | 					}
 715 | 					/*
 716 | 					 * RegEx result of DNAMutation - Post is group(4)
 717 | 					 */
 718 | 					for(int i=0;i<tmVar.RegEx_DNAMutation_STR.size();i++)
 719 | 					{
 720 | 						Pattern PATTERN_RegEx_HGVs = Pattern.compile(tmVar.RegEx_DNAMutation_STR.get(i));
 721 | 						Matcher m = PATTERN_RegEx_HGVs.matcher(Document_rev);
 722 | 						while (m.find()) 
 723 | 						{
 724 | 							String pre = m.group(1);
 725 | 							String mention = m.group(2);
 726 | 							String post = m.group(4);
 727 | 							
 728 | 							//Pattern pCheck = Pattern.compile("^[cgr][^0-9a-zA-Z]*.[^0-9a-zA-Z]*[0-9]+[^0-9a-zA-Z]*$");
 729 | 							//Matcher mCheck = pCheck.matcher(mention);
 730 | 							//if(!mCheck.find())
 731 | 							{
 732 | 								Pattern ptmp = Pattern.compile("^(.+)([ ;.,:]+)$");
 733 | 								Matcher mtmp = ptmp.matcher(mention);
 734 | 								if(mtmp.find())
 735 | 								{
 736 | 									mention=mtmp.group(1);
 737 | 									post=mtmp.group(2)+post;
 738 | 								}
 739 | 								
 740 | 								if((!mention.contains("(")) && mention.substring(mention.length()-1,mention.length()).equals(")")){	mention=mention.substring(0,mention.length()-1);post=")"+post;}
 741 | 								if((!mention.contains("[")) && mention.substring(mention.length()-1,mention.length()).equals("]")){	mention=mention.substring(0,mention.length()-1);post="]"+post;}
 742 | 								if((!mention.contains("{")) && mention.substring(mention.length()-1,mention.length()).equals("}")){	mention=mention.substring(0,mention.length()-1);post="}"+post;}
 743 | 								
 744 | 								int start=pre.length();
 745 | 								int last=pre.length()+mention.length();
 746 | 								for(int j=start;j<last;j++)
 747 | 								{
 748 | 									RegEx_HGVs_hash.put(j,"DNAMutation");
 749 | 								}
 750 | 								String mention_rev="";
 751 | 								for(int j=0;j<mention.length();j++){mention_rev=mention_rev+"@";}
 752 | 								Document_rev=pre+mention_rev+post;
 753 | 							}
 754 | 							PATTERN_RegEx_HGVs = Pattern.compile(tmVar.RegEx_DNAMutation_STR.get(i));
 755 | 							m = PATTERN_RegEx_HGVs.matcher(Document_rev);
 756 | 						}
 757 | 					}
 758 | 					/*
 759 | 					 * RegEx result of SNP
 760 | 					 */
 761 | 					for(int i=0;i<tmVar.RegEx_SNP_STR.size();i++)
 762 | 					{
 763 | 						Pattern PATTERN_RegEx_HGVs = Pattern.compile(tmVar.RegEx_SNP_STR.get(i));
 764 | 						Matcher m = PATTERN_RegEx_HGVs.matcher(Document_rev);
 765 | 						while (m.find()) 
 766 | 						{
 767 | 							String pre = m.group(1);
 768 | 							String mention = m.group(2);
 769 | 							String post = m.group(4);
 770 | 							
 771 | 							{
 772 | 								Pattern ptmp = Pattern.compile("^(.+)([ ;.,:]+)$");
 773 | 								Matcher mtmp = ptmp.matcher(mention);
 774 | 								if(mtmp.find())
 775 | 								{
 776 | 									mention=mtmp.group(1);
 777 | 									post=mtmp.group(2)+post;
 778 | 								}
 779 | 								
 780 | 								if((!mention.contains("(")) && mention.substring(mention.length()-1,mention.length()).equals(")")){	mention=mention.substring(0,mention.length()-1);post=")"+post;}
 781 | 								if((!mention.contains("[")) && mention.substring(mention.length()-1,mention.length()).equals("]")){	mention=mention.substring(0,mention.length()-1);post="]"+post;}
 782 | 								if((!mention.contains("{")) && mention.substring(mention.length()-1,mention.length()).equals("}")){	mention=mention.substring(0,mention.length()-1);post="}"+post;}
 783 | 								
 784 | 								int start=pre.length();
 785 | 								int last=pre.length()+mention.length();
 786 | 								for(int j=start;j<last;j++)
 787 | 								{
 788 | 									RegEx_HGVs_hash.put(j,"SNP");
 789 | 								}
 790 | 								String mention_rev="";
 791 | 								for(int j=0;j<mention.length();j++){mention_rev=mention_rev+"@";}
 792 | 								Document_rev=pre+mention_rev+post;
 793 | 							}
 794 | 							PATTERN_RegEx_HGVs = Pattern.compile(tmVar.RegEx_SNP_STR.get(i));
 795 | 							m = PATTERN_RegEx_HGVs.matcher(Document_rev);
 796 | 						}
 797 | 					}
 798 | 					
 799 | 					/*
 800 | 					 * Tokenization for .location
 801 | 					 */
 802 | 					Document_rev=Document;
 803 | 					Document_rev = Document_rev.replaceAll("([A-Z][A-Z])([A-Z][0-9][0-9]+[A-Z][\\W\\-\\_])", "$1 $2"); //PTENK289E
 804 | 					Document_rev = Document_rev.replaceAll("([0-9])([A-Za-z])", "$1 $2");
 805 | 					Document_rev = Document_rev.replaceAll("([A-Za-z])([0-9])", "$1 $2");
 806 | 					Document_rev = Document_rev.replaceAll("([A-Z])([a-z])", "$1 $2");
 807 | 					Document_rev = Document_rev.replaceAll("([a-z])([A-Z])", "$1 $2");
 808 | 					Document_rev = Document_rev.replaceAll("(.+)fs", "$1 fs");
 809 | 					Document_rev = Document_rev.replaceAll("[\t ]+", " ");
 810 | 					String regex="\\s+|(?=\\p{Punct})|(?<=\\p{Punct})";
 811 | 					String TokensInDoc[]=Document_rev.split(regex);
 812 | 					String DocumentTmp=Document;
 813 | 					int Offset=0;
 814 | 					
 815 | 					Document_rev = Document_rev.replaceAll("ω","w");
 816 | 					Document_rev = Document_rev.replaceAll("μ","u");
 817 | 					Document_rev = Document_rev.replaceAll("κ","k");
 818 | 					Document_rev = Document_rev.replaceAll("α","a");
 819 | 					Document_rev = Document_rev.replaceAll("γ","r");
 820 | 					Document_rev = Document_rev.replaceAll("β","b");
 821 | 					Document_rev = Document_rev.replaceAll("×","x");
 822 | 					Document_rev = Document_rev.replaceAll("¹","1");
 823 | 					Document_rev = Document_rev.replaceAll("²","2");
 824 | 					Document_rev = Document_rev.replaceAll("°","o");
 825 | 					Document_rev = Document_rev.replaceAll("ö","o");
 826 | 					Document_rev = Document_rev.replaceAll("é","e");
 827 | 					Document_rev = Document_rev.replaceAll("à","a");
 828 | 					Document_rev = Document_rev.replaceAll("Á","A");
 829 | 					Document_rev = Document_rev.replaceAll("ε","e");
 830 | 					Document_rev = Document_rev.replaceAll("θ","O");
 831 | 					Document_rev = Document_rev.replaceAll("•",".");
 832 | 					Document_rev = Document_rev.replaceAll("µ","u");
 833 | 					Document_rev = Document_rev.replaceAll("λ","r");
 834 | 					Document_rev = Document_rev.replaceAll("⁺","+");
 835 | 					Document_rev = Document_rev.replaceAll("ν","v");
 836 | 					Document_rev = Document_rev.replaceAll("ï","i");
 837 | 					Document_rev = Document_rev.replaceAll("ã","a");
 838 | 					Document_rev = Document_rev.replaceAll("≡","=");
 839 | 					Document_rev = Document_rev.replaceAll("ó","o");
 840 | 					Document_rev = Document_rev.replaceAll("³","3");
 841 | 					Document_rev = Document_rev.replaceAll("〖","[");
 842 | 					Document_rev = Document_rev.replaceAll("〗","]");
 843 | 					Document_rev = Document_rev.replaceAll("Å","A");
 844 | 					Document_rev = Document_rev.replaceAll("ρ","p");
 845 | 					Document_rev = Document_rev.replaceAll("ü","u");
 846 | 					Document_rev = Document_rev.replaceAll("ɛ","e");
 847 | 					Document_rev = Document_rev.replaceAll("č","c");
 848 | 					Document_rev = Document_rev.replaceAll("š","s");
 849 | 					Document_rev = Document_rev.replaceAll("ß","b");
 850 | 					Document_rev = Document_rev.replaceAll("═","=");
 851 | 					Document_rev = Document_rev.replaceAll("£","L");
 852 | 					Document_rev = Document_rev.replaceAll("Ł","L");
 853 | 					Document_rev = Document_rev.replaceAll("ƒ","f");
 854 | 					Document_rev = Document_rev.replaceAll("ä","a");
 855 | 					Document_rev = Document_rev.replaceAll("–","-");
 856 | 					Document_rev = Document_rev.replaceAll("⁻","-");
 857 | 					Document_rev = Document_rev.replaceAll("〈","<");
 858 | 					Document_rev = Document_rev.replaceAll("〉",">");
 859 | 					Document_rev = Document_rev.replaceAll("χ","X");
 860 | 					Document_rev = Document_rev.replaceAll("Đ","D");
 861 | 					Document_rev = Document_rev.replaceAll("‰","%");
 862 | 					Document_rev = Document_rev.replaceAll("·",".");
 863 | 					Document_rev = Document_rev.replaceAll("→",">");
 864 | 					Document_rev = Document_rev.replaceAll("←","<");
 865 | 					Document_rev = Document_rev.replaceAll("ζ","z");
 866 | 					Document_rev = Document_rev.replaceAll("π","p");
 867 | 					Document_rev = Document_rev.replaceAll("τ","t");
 868 | 					Document_rev = Document_rev.replaceAll("ξ","X");
 869 | 					Document_rev = Document_rev.replaceAll("η","h");
 870 | 					Document_rev = Document_rev.replaceAll("ø","0");
 871 | 					Document_rev = Document_rev.replaceAll("Δ","D");
 872 | 					Document_rev = Document_rev.replaceAll("∆","D");
 873 | 					Document_rev = Document_rev.replaceAll("∑","S");
 874 | 					Document_rev = Document_rev.replaceAll("Ω","O");
 875 | 					Document_rev = Document_rev.replaceAll("δ","d");
 876 | 					Document_rev = Document_rev.replaceAll("σ","s");
 877 | 					Document_rev = Document_rev.replaceAll("Φ","F");
 878 | 					//Document_rev = Document_rev.replaceAll("[^0-9A-Za-z\\!\\@\\#\\$\\%\\^\\&\\*\\(\\)\\_\\+\\-\\=\\{\\}\\|\\[\\]\\\\\\:;'\\<\\>\\?\\,\\.\\/\\' ]+"," ");
 879 | 					//tmVar.tagger = new MaxentTagger("lib/taggers/english-left3words-distsim.tagger");
 880 | 					String tagged=tmVar.tagger.tagString(Document_rev).replace("-LRB-", "(").replace("-RRB-", ")").replace("-LSB-", "[").replace("-RSB-", "]");
 881 | 					String tag_split[]=tagged.split(" ");
 882 | 	        		HashMap<String,String> POS=new HashMap<String,String>();
 883 | 	        		for(int p=0;p<tag_split.length;p++)
 884 | 	                {
 885 | 	                	String tmp[]=tag_split[p].split("_");
 886 | 	                	String tmp2[]=tmp[0].replaceAll("\\s+(?=\\p{Punct})", "").split(regex);
 887 | 	                	for(int q=0;q<tmp2.length;q++)
 888 | 	                	{
 889 | 	                		if(tmp2[q].matches("[^0-9a-zA-Z]"))
 890 | 	                		{
 891 | 	                			POS.put(tmp2[q], tmp2[q]);
 892 | 	                		}
 893 | 	                		else if(tmp2[q].matches("[0-9a-zA-Z]+"))
 894 | 	                		{
 895 | 	                			POS.put(tmp2[q], tmp[1]);
 896 | 	                		}
 897 | 	                		else
 898 | 	                		{
 899 | 	                			POS.put(tmp2[q], tmp2[q]);
 900 | 	                		}
 901 | 	                	}
 902 | 	                }
 903 | 					
 904 | 					for(int i=0;i<TokensInDoc.length;i++)
 905 | 					{
 906 | 						if(DocumentTmp.length()>0)
 907 | 						{
 908 | 							while(DocumentTmp.substring(0,1).matches("[\\t ]"))
 909 | 							{
 910 | 								DocumentTmp=DocumentTmp.substring(1);
 911 | 								Offset++;
 912 | 							}
 913 | 							if(DocumentTmp.substring(0,TokensInDoc[i].length()).equals(TokensInDoc[i]))
 914 | 							{
 915 | 								if(TokensInDoc[i].length()>0)
 916 | 								{
 917 | 									DocumentTmp=DocumentTmp.substring(TokensInDoc[i].length());
 918 | 									
 919 | 									/*
 920 | 									 * Feature Extration
 921 | 									 */
 922 | 									//PST
 923 | 									String pos=POS.get(TokensInDoc[i]);
 924 | 									if(pos == null || pos.equals(""))
 925 | 									{
 926 | 										pos = "_NULL_";
 927 | 									}
 928 | 									
 929 | 									//stemming
 930 | 									tmVar.stemmer.setCurrent(TokensInDoc[i].toLowerCase());
 931 | 									tmVar.stemmer.stem();
 932 | 									String stem=tmVar.stemmer.getCurrent();
 933 | 									
 934 | 									//Number of Numbers [0-9]
 935 | 									String Num_num="";
 936 | 									String tmp=TokensInDoc[i];
 937 | 									tmp=tmp.replaceAll("[^0-9]","");
 938 | 									if(tmp.length()>3){Num_num="N:4+";}else{Num_num="N:"+ tmp.length();}
 939 | 									
 940 | 									//Number of Uppercase [A-Z]
 941 | 									String Num_Uc="";
 942 | 									tmp=TokensInDoc[i];
 943 | 									tmp=tmp.replaceAll("[^A-Z]","");
 944 | 									if(tmp.length()>3){Num_Uc="U:4+";}else{Num_Uc="U:"+ tmp.length();}
 945 | 									
 946 | 									//Number of Lowercase [a-z]
 947 | 									String Num_lc="";
 948 | 									tmp=TokensInDoc[i];
 949 | 									tmp=tmp.replaceAll("[^a-z]","");
 950 | 									if(tmp.length()>3){Num_lc="L:4+";}else{Num_lc="L:"+ tmp.length();}
 951 | 									
 952 | 									//Number of ALL char
 953 | 									String Num_All="";
 954 | 									if(TokensInDoc[i].length()>3){Num_All="A:4+";}else{Num_All="A:"+ TokensInDoc[i].length();}
 955 | 									
 956 | 									//specific character (;:,.->+_)
 957 | 									String SpecificC="";
 958 | 									tmp=TokensInDoc[i];
 959 | 									
 960 | 									if(TokensInDoc[i].equals(";") || TokensInDoc[i].equals(":") || TokensInDoc[i].equals(",") || TokensInDoc[i].equals(".") || TokensInDoc[i].equals("-") || TokensInDoc[i].equals(">") || TokensInDoc[i].equals("+") || TokensInDoc[i].equals("_"))
 961 | 									{
 962 | 										SpecificC="-SpecificC1-";
 963 | 									}
 964 | 									else if(TokensInDoc[i].equals("(") || TokensInDoc[i].equals(")"))
 965 | 									{
 966 | 										SpecificC="-SpecificC2-";
 967 | 									}
 968 | 									else if(TokensInDoc[i].equals("{") || TokensInDoc[i].equals("}"))
 969 | 									{
 970 | 										SpecificC="-SpecificC3-";
 971 | 									}
 972 | 									else if(TokensInDoc[i].equals("[") || TokensInDoc[i].equals("]"))
 973 | 									{
 974 | 										SpecificC="-SpecificC4-";
 975 | 									}
 976 | 									else if(TokensInDoc[i].equals("\\") || TokensInDoc[i].equals("/"))
 977 | 									{
 978 | 										SpecificC="-SpecificC5-";
 979 | 									}
 980 | 									else
 981 | 									{
 982 | 										SpecificC="__nil__";
 983 | 									}
 984 | 	
 985 | 									//chromosomal keytokens
 986 | 									String ChroKey="";
 987 | 									tmp=TokensInDoc[i];
 988 | 									String pattern_ChroKey="^(q|p|qter|pter|XY|t)$";
 989 | 									Pattern pattern_ChroKey_compile = Pattern.compile(pattern_ChroKey);
 990 | 									Matcher pattern_ChroKey_compile_Matcher = pattern_ChroKey_compile.matcher(tmp);
 991 | 									if(pattern_ChroKey_compile_Matcher.find())
 992 | 									{
 993 | 										ChroKey="-ChroKey-";
 994 | 									}
 995 | 									else
 996 | 									{
 997 | 										ChroKey="__nil__";
 998 | 									}
 999 | 									
1000 | 									//Mutation type
1001 | 									String MutatType="";
1002 | 									tmp=TokensInDoc[i];
1003 | 									tmp=tmp.toLowerCase();
1004 | 									String pattern_MutatType="^(del|ins|dup|tri|qua|con|delins|indel)$";
1005 | 									String pattern_FrameShiftType="(fs|fsX|fsx)";
1006 | 									Pattern pattern_MutatType_compile = Pattern.compile(pattern_MutatType);
1007 | 									Pattern pattern_FrameShiftType_compile = Pattern.compile(pattern_FrameShiftType);
1008 | 									Matcher pattern_MutatType_compile_Matcher = pattern_MutatType_compile.matcher(tmp);
1009 | 									Matcher pattern_FrameShiftType_compile_Matcher = pattern_FrameShiftType_compile.matcher(tmp);
1010 | 									if(pattern_MutatType_compile_Matcher.find())
1011 | 									{
1012 | 										MutatType="-MutatType-";
1013 | 									}
1014 | 									else if(pattern_FrameShiftType_compile_Matcher.find())
1015 | 									{
1016 | 										MutatType="-FrameShiftType-";
1017 | 									}
1018 | 									else
1019 | 									{
1020 | 										MutatType="__nil__";
1021 | 									}
1022 | 									
1023 | 									//Mutation word
1024 | 									String MutatWord="";
1025 | 									tmp=TokensInDoc[i];
1026 | 									tmp=tmp.toLowerCase();
1027 | 									String pattern_MutatWord="^(deletion|delta|elta|insertion|repeat|inversion|deletions|insertions|repeats|inversions)$";
1028 | 									Pattern pattern_MutatWord_compile = Pattern.compile(pattern_MutatWord);
1029 | 									Matcher pattern_MutatWord_compile_Matcher = pattern_MutatWord_compile.matcher(tmp);
1030 | 									if(pattern_MutatWord_compile_Matcher.find())
1031 | 									{
1032 | 										MutatWord="-MutatWord-";
1033 | 									}
1034 | 									else
1035 | 									{
1036 | 										MutatWord="__nil__";
1037 | 									}
1038 | 									
1039 | 									//Mutation article & basepair
1040 | 									String MutatArticle="";
1041 | 									tmp=TokensInDoc[i];
1042 | 									tmp=tmp.toLowerCase();
1043 | 									String pattern_base="^(single|a|one|two|three|four|five|six|seven|eight|nine|ten|[0-9]+)$";
1044 | 									String pattern_Byte="^(kb|mb)$";
1045 | 									String pattern_bp="(base|bases|pair|amino|acid|acids|codon|postion|postions|bp|nucleotide|nucleotides)";
1046 | 									Pattern pattern_base_compile = Pattern.compile(pattern_base);
1047 | 									Pattern pattern_Byte_compile = Pattern.compile(pattern_Byte);
1048 | 									Pattern pattern_bp_compile = Pattern.compile(pattern_bp);
1049 | 									Matcher pattern_base_compile_Matcher = pattern_base_compile.matcher(tmp);
1050 | 									Matcher pattern_Byte_compile_Matcher = pattern_Byte_compile.matcher(tmp);
1051 | 									Matcher pattern_bp_compile_Matcher = pattern_bp_compile.matcher(tmp);
1052 | 									if(pattern_base_compile_Matcher.find())
1053 | 									{
1054 | 										MutatArticle="-Base-";
1055 | 									}
1056 | 									else if(pattern_Byte_compile_Matcher.find())
1057 | 									{
1058 | 										MutatArticle="-Byte-";
1059 | 									}
1060 | 									else if(pattern_bp_compile_Matcher.find())
1061 | 									{
1062 | 										MutatArticle="-bp-";
1063 | 									}
1064 | 									else
1065 | 									{
1066 | 										MutatArticle="__nil__";
1067 | 									}
1068 | 									
1069 | 									//Type1
1070 | 									String Type1="";
1071 | 									tmp=TokensInDoc[i];
1072 | 									tmp=tmp.toLowerCase();
1073 | 									String pattern_Type1="^[cgrm]$";
1074 | 									String pattern_Type1_2="^(ivs|ex|orf)$";
1075 | 									Pattern pattern_Type1_compile = Pattern.compile(pattern_Type1);
1076 | 									Pattern pattern_Type1_2_compile = Pattern.compile(pattern_Type1_2);
1077 | 									Matcher pattern_Type1_compile_Matcher = pattern_Type1_compile.matcher(tmp);
1078 | 									Matcher pattern_Type1_2_compile_Matcher = pattern_Type1_2_compile.matcher(tmp);
1079 | 									if(pattern_Type1_compile_Matcher.find())
1080 | 									{
1081 | 										Type1="-Type1-";
1082 | 									}
1083 | 									else if(pattern_Type1_2_compile_Matcher.find())
1084 | 									{
1085 | 										Type1="-Type1_2-";
1086 | 									}
1087 | 									else
1088 | 									{
1089 | 										Type1="__nil__";
1090 | 									}
1091 | 									
1092 | 									//Type2
1093 | 									String Type2="";
1094 | 									tmp=TokensInDoc[i];
1095 | 									
1096 | 									if(tmp.equals("p"))
1097 | 									{
1098 | 										Type2="-Type2-";
1099 | 									}
1100 | 									else
1101 | 									{
1102 | 										Type2="__nil__";
1103 | 									}
1104 | 									
1105 | 									//DNA symbols
1106 | 									String DNASym="";
1107 | 									tmp=TokensInDoc[i];
1108 | 									String pattern_DNASym="^[ATCGUatcgu]$";
1109 | 									Pattern pattern_DNASym_compile = Pattern.compile(pattern_DNASym);
1110 | 									Matcher pattern_DNASym_compile_Matcher = pattern_DNASym_compile.matcher(tmp);
1111 | 									if(pattern_DNASym_compile_Matcher.find())
1112 | 									{
1113 | 										DNASym="-DNASym-";
1114 | 									}
1115 | 									else
1116 | 									{
1117 | 										DNASym="__nil__";
1118 | 									}
1119 | 									
1120 | 									//Protein symbols
1121 | 									String ProteinSym="";
1122 | 									String lastToken="";
1123 | 									tmp=TokensInDoc[i];
1124 | 									if(i>0){lastToken=TokensInDoc[i-1];}
1125 | 									String pattern_ProteinSymFull="(glutamine|glutamic|leucine|valine|isoleucine|lysine|alanine|glycine|aspartate|methionine|threonine|histidine|aspartic|asparticacid|arginine|asparagine|tryptophan|proline|phenylalanine|cysteine|serine|glutamate|tyrosine|stop|frameshift)";
1126 | 									String pattern_ProteinSymTri="^(cys|ile|ser|gln|met|asn|pro|lys|asp|thr|phe|ala|gly|his|leu|arg|trp|val|glu|tyr|fs|fsx)$";
1127 | 									String pattern_ProteinSymTriSub="^(ys|le|er|ln|et|sn|ro|ys|sp|hr|he|la|ly|is|eu|rg|rp|al|lu|yr)$";
1128 | 									String pattern_ProteinSymChar="^[CISQMNPKDTFAGHLRWVEYX]$";
1129 | 									String pattern_lastToken="^[CISQMNPKDTFAGHLRWVEYX]$";
1130 | 									Pattern pattern_ProteinSymFull_compile = Pattern.compile(pattern_ProteinSymFull);
1131 | 									Matcher pattern_ProteinSymFull_compile_Matcher = pattern_ProteinSymFull_compile.matcher(tmp);
1132 | 									Pattern pattern_ProteinSymTri_compile = Pattern.compile(pattern_ProteinSymTri);
1133 | 									Matcher pattern_ProteinSymTri_compile_Matcher = pattern_ProteinSymTri_compile.matcher(tmp);
1134 | 									Pattern pattern_ProteinSymTriSub_compile = Pattern.compile(pattern_ProteinSymTriSub);
1135 | 									Matcher pattern_ProteinSymTriSub_compile_Matcher = pattern_ProteinSymTriSub_compile.matcher(tmp);
1136 | 									Pattern pattern_ProteinSymChar_compile = Pattern.compile(pattern_ProteinSymChar);
1137 | 									Matcher pattern_ProteinSymChar_compile_Matcher = pattern_ProteinSymChar_compile.matcher(tmp);
1138 | 									Pattern pattern_lastToken_compile = Pattern.compile(pattern_lastToken);
1139 | 									Matcher pattern_lastToken_compile_Matcher = pattern_lastToken_compile.matcher(lastToken);
1140 | 									
1141 | 									if(pattern_ProteinSymFull_compile_Matcher.find())
1142 | 									{
1143 | 										ProteinSym="-ProteinSymFull-";
1144 | 									}
1145 | 									else if(pattern_ProteinSymTri_compile_Matcher.find())
1146 | 									{
1147 | 										ProteinSym="-ProteinSymTri-";
1148 | 									}
1149 | 									else if(pattern_ProteinSymTriSub_compile_Matcher.find() && pattern_lastToken_compile_Matcher.find() && !Document.substring(Offset-1,Offset).equals(" "))
1150 | 									{
1151 | 										ProteinSym="-ProteinSymTriSub-";
1152 | 									}
1153 | 									else if(pattern_ProteinSymChar_compile_Matcher.find())
1154 | 									{
1155 | 										ProteinSym="-ProteinSymChar-";
1156 | 									}
1157 | 									else
1158 | 									{
1159 | 										ProteinSym="__nil__";
1160 | 									}
1161 | 									
1162 | 									//RS
1163 | 									String RScode="";
1164 | 									tmp=TokensInDoc[i];
1165 | 									String pattern_RScode="^(rs|RS|Rs)$";
1166 | 									Pattern pattern_RScode_compile = Pattern.compile(pattern_RScode);
1167 | 									Matcher pattern_RScode_compile_Matcher = pattern_RScode_compile.matcher(tmp);
1168 | 									if(pattern_RScode_compile_Matcher.find())
1169 | 									{
1170 | 										RScode="-RScode-";
1171 | 									}
1172 | 									else
1173 | 									{
1174 | 										RScode="__nil__";
1175 | 									}
1176 | 									
1177 | 									//Patterns
1178 | 									String Pattern1=TokensInDoc[i];
1179 | 									String Pattern2=TokensInDoc[i];
1180 | 									String Pattern3=TokensInDoc[i];
1181 | 									String Pattern4=TokensInDoc[i];
1182 | 									Pattern1=Pattern1.replaceAll("[A-Z]","A");
1183 | 									Pattern1=Pattern1.replaceAll("[a-z]","a");
1184 | 									Pattern1=Pattern1.replaceAll("[0-9]","0");
1185 | 									Pattern1="P1:"+Pattern1;
1186 | 									Pattern2=Pattern2.replaceAll("[A-Za-z]","a");
1187 | 									Pattern2=Pattern2.replaceAll("[0-9]","0");
1188 | 									Pattern2="P2:"+Pattern2;
1189 | 									Pattern3=Pattern3.replaceAll("[A-Z]+","A");
1190 | 									Pattern3=Pattern3.replaceAll("[a-z]+","a");
1191 | 									Pattern3=Pattern3.replaceAll("[0-9]+","0");
1192 | 									Pattern3="P3:"+Pattern3;
1193 | 									Pattern4=Pattern4.replaceAll("[A-Za-z]+","a");
1194 | 									Pattern4=Pattern4.replaceAll("[0-9]+","0");
1195 | 									Pattern4="P4:"+Pattern4;
1196 | 									
1197 | 									//prefix
1198 | 									String prefix="";
1199 | 									tmp=TokensInDoc[i];
1200 | 									if(tmp.length()>=1){ prefix=tmp.substring(0, 1);}else{prefix="__nil__";}
1201 | 									if(tmp.length()>=2){ prefix=prefix+" "+tmp.substring(0, 2);}else{prefix=prefix+" __nil__";}
1202 | 									if(tmp.length()>=3){ prefix=prefix+" "+tmp.substring(0, 3);}else{prefix=prefix+" __nil__";}
1203 | 									if(tmp.length()>=4){ prefix=prefix+" "+tmp.substring(0, 4);}else{prefix=prefix+" __nil__";}
1204 | 									if(tmp.length()>=5){ prefix=prefix+" "+tmp.substring(0, 5);}else{prefix=prefix+" __nil__";}
1205 | 									
1206 | 									
1207 | 									//suffix
1208 | 									String suffix="";
1209 | 									tmp=TokensInDoc[i];
1210 | 									if(tmp.length()>=1){ suffix=tmp.substring(tmp.length()-1, tmp.length());}else{suffix="__nil__";}
1211 | 									if(tmp.length()>=2){ suffix=suffix+" "+tmp.substring(tmp.length()-2, tmp.length());}else{suffix=suffix+" __nil__";}
1212 | 									if(tmp.length()>=3){ suffix=suffix+" "+tmp.substring(tmp.length()-3, tmp.length());}else{suffix=suffix+" __nil__";}
1213 | 									if(tmp.length()>=4){ suffix=suffix+" "+tmp.substring(tmp.length()-4, tmp.length());}else{suffix=suffix+" __nil__";}
1214 | 									if(tmp.length()>=5){ suffix=suffix+" "+tmp.substring(tmp.length()-5, tmp.length());}else{suffix=suffix+" __nil__";}
1215 | 									
1216 | 									/*
1217 | 									 * Print out: .data
1218 | 									 */
1219 | 									FileData.write(TokensInDoc[i]+" "+stem+" "+pos+" "+Num_num+" "+Num_Uc+" "+Num_lc+" "+Num_All+" "+SpecificC+" "+ChroKey+" "+MutatType+" "+MutatWord+" "+MutatArticle+" "+Type1+" "+Type2+" "+DNASym+" "+ProteinSym+" "+RScode+" "+Pattern1+" "+Pattern2+" "+Pattern3+" "+Pattern4+" "+prefix+" "+suffix);
1220 | 									if(RegEx_HGVs_hash.containsKey(Offset))
1221 | 									{
1222 | 										FileData.write(" "+RegEx_HGVs_hash.get(Offset));
1223 | 									}
1224 | 									else
1225 | 									{
1226 | 										FileData.write(" O");
1227 | 									}
1228 | 									if(TrainTest.equals("Train")) // Test
1229 | 									{
1230 | 										if(character_hash.containsKey(Offset))
1231 | 										{
1232 | 											FileData.write(" "+character_hash.get(Offset));
1233 | 										}
1234 | 										else
1235 | 										{
1236 | 											FileData.write(" O");
1237 | 										}
1238 | 									}
1239 | 									FileData.write("\n");
1240 | 									
1241 | 									/*
1242 | 									 * Print out: .location
1243 | 									 */
1244 | 									FileLocation.write(Pmid+"\t"+TokensInDoc[i]+"\t"+(Offset+1)+"\t"+(Offset+TokensInDoc[i].length())+"\n");
1245 | 									
1246 | 									Offset=Offset+TokensInDoc[i].length();
1247 | 								}
1248 | 							}
1249 | 							else
1250 | 							{
1251 | 								System.out.println("Error! String not match: '"+DocumentTmp.substring(0,TokensInDoc[i].length())+"'\t'"+TokensInDoc[i]+"'");
1252 | 							}
1253 | 						}
1254 | 					}
1255 | 	
1256 | 					FileLocation.write("\n"); 
1257 | 					FileData.write("\n");
1258 | 					
1259 | 					ParagraphType.clear();
1260 | 					ParagraphContent.clear();
1261 | 					annotations.clear();
1262 | 					RegEx_HGVs_hash.clear();
1263 | 					character_hash.clear();
1264 | 				}
1265 | 			}
1266 | 			
1267 | 			inputfile.close();	
1268 | 			FileLocation.close();
1269 | 			FileData.close();
1270 | 		}
1271 | 		catch(IOException e1){ System.out.println("[MR]: Input file is not exist.");}
1272 | 	}
1273 | 
1274 | 	/*
1275 | 	 * Testing by CRF++
1276 | 	 */
1277 | 	public void CRF_test(String FilenameData,String FilenameOutput,String TrainTest) throws IOException 
1278 | 	{
1279 | 		File f = new File(FilenameOutput);
1280 |         BufferedWriter fr = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(f), "UTF-8"));
1281 | 		
1282 | 		Runtime runtime = Runtime.getRuntime();
1283 | 	    
1284 | 		String OS=System.getProperty("os.name").toLowerCase();
1285 | 		String model="MentionExtractionUB.Model";
1286 | 		
1287 | 		String cmd="./CRF/crf_test -m CRF/"+model+" -o "+FilenameOutput+" "+FilenameData;
1288 | 		if(TrainTest.equals("Test_FullText"))
1289 | 		{
1290 | 			model="MentionExtractionUB.fulltext.Model";
1291 | 		}
1292 | 	    if(OS.contains("windows"))
1293 | 	    {
1294 | 	    	cmd ="CRF/crf_test -m CRF/"+model+" -o "+FilenameOutput+" "+FilenameData;
1295 | 	    }
1296 | 	    else //if(OS.contains("nux")||OS.contains("nix"))
1297 | 	    {
1298 | 	    	cmd ="./CRF/crf_test -m CRF/"+model+" -o "+FilenameOutput+" "+FilenameData;
1299 | 	    }
1300 | 	    
1301 | 	    try {
1302 | 	    	Process process = runtime.exec(cmd);
1303 | 	    	InputStream is = process.getInputStream();
1304 | 	    	InputStreamReader isr = new InputStreamReader(is);
1305 | 	    	BufferedReader br = new BufferedReader(isr);
1306 | 	    	String line="";
1307 | 		    while ( (line = br.readLine()) != null) 
1308 | 		    {
1309 | 		    	fr.write(line);
1310 | 		    	fr.newLine();
1311 | 		        fr.flush();
1312 | 		    }
1313 | 		    is.close();
1314 | 		    isr.close();
1315 | 		    br.close();
1316 | 		    fr.close();
1317 | 	    }
1318 | 	    catch (IOException e) {
1319 | 	    	System.out.println(e);
1320 | 	    	runtime.exit(0);
1321 | 	    }
1322 | 	}
1323 | 	
1324 | 	/*
1325 | 	 * Learning model by CRF++
1326 | 	 */
1327 | 	public void CRF_learn(String FilenameData) throws IOException 
1328 | 	{
1329 | 		Process process = null;
1330 | 	    String line = null;
1331 | 	    InputStream is = null;
1332 | 	    InputStreamReader isr = null;
1333 | 	    BufferedReader br = null;
1334 | 	    
1335 | 	    Runtime runtime = Runtime.getRuntime();
1336 | 	    String OS=System.getProperty("os.name").toLowerCase();
1337 | 		String cmd="./CRF/crf_learn -f 3 -c 4.0 CRF/template_UB "+FilenameData+" CRF/MentionExtractionUB.Model.new"; 
1338 | 	    if(OS.contains("windows"))
1339 | 	    {
1340 | 	    	cmd ="CRF/crf_learn -f 3 -c 4.0 CRF/template_UB "+FilenameData+" CRF/MentionExtractionUB.Model.new"; 
1341 | 	    }
1342 | 	    else //if(OS.contains("nux")||OS.contains("nix"))
1343 | 	    {
1344 | 	    	cmd ="./CRF/crf_learn -f 3 -c 4.0 CRF/template_UB "+FilenameData+" CRF/MentionExtractionUB.Model.new"; 
1345 | 	    }
1346 | 	    
1347 | 	    try {
1348 | 	    	process = runtime.exec(cmd);
1349 | 		    is = process.getInputStream();
1350 | 		    isr = new InputStreamReader(is);
1351 | 		    br = new BufferedReader(isr);
1352 | 		    while ( (line = br.readLine()) != null) 
1353 | 		    {
1354 | 		    	System.out.println(line);
1355 | 		        System.out.flush();
1356 | 		    }
1357 | 		    is.close();
1358 | 		    isr.close();
1359 | 		    br.close();
1360 | 	    }
1361 | 	    catch (IOException e) {
1362 | 	    	System.out.println(e);
1363 | 	    	runtime.exit(0);
1364 | 	    }
1365 | 	}
1366 | }
1367 | 
1368 | 
1369 | 


--------------------------------------------------------------------------------
/src/tmVarlib/PrefixTree.java:
--------------------------------------------------------------------------------
  1 | /**
  2 |  * Project: 
  3 |  * Function: Dictionary lookup by Prefix Tree
  4 |  */
  5 | 
  6 | package tmVarlib;
  7 | 
  8 | import java.io.BufferedReader;
  9 | import java.io.FileReader;
 10 | import java.io.IOException;
 11 | import java.io.*;
 12 | import java.util.*;
 13 | import java.util.regex.Matcher;
 14 | import java.util.regex.Pattern;
 15 | 
 16 | public class PrefixTree
 17 | {
 18 | 	private Tree Tr=new Tree();
 19 | 	
 20 | 	/**
 21 | 	 * Hash2Tree(HashMap<String, String> ID2Names)
 22 | 	 * Dictionary2Tree_Combine(String Filename,String StopWords,String MentionType)
 23 | 	 * Dictionary2Tree_UniqueGene(String Filename,String StopWords)	 //olr1831ps	10116:*405718
 24 | 	 * TreeFile2Tree(String Filename)
 25 | 	 * 
 26 | 	 */
 27 | 	public static HashMap<String, String> StopWord_hash = new HashMap<String, String>();
 28 | 	
 29 | 	public void Hash2Tree(HashMap<String, String> ID2Names)
 30 | 	{
 31 | 		for(String ID : ID2Names.keySet())  
 32 | 		{
 33 | 			Tr.insertMention(ID2Names.get(ID),ID);
 34 | 		}
 35 | 	}
 36 | 	
 37 | 	/*
 38 | 	 * Type	Identifier	Names
 39 | 	 * Species	9606	ttdh3pv|igl027/99|igl027/98|sw 1463
 40 | 	 */
 41 | 	public void Dictionary2Tree(String Filename,String StopWords)	
 42 | 	{
 43 | 		try 
 44 | 		{
 45 | 			/** Stop Word */
 46 | 			BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(StopWords), "UTF-8"));
 47 | 			String line="";
 48 | 			while ((line = br.readLine()) != null)  
 49 | 			{
 50 | 				StopWord_hash.put(line, "StopWord");
 51 | 			}
 52 | 			br.close();	
 53 | 			
 54 | 			BufferedReader inputfile = new BufferedReader(new InputStreamReader(new FileInputStream(Filename), "UTF-8"));
 55 | 			line="";
 56 | 			while ((line = inputfile.readLine()) != null)  
 57 | 			{
 58 | 				line = line.replaceAll("ω","w");line = line.replaceAll("μ","u");line = line.replaceAll("κ","k");line = line.replaceAll("α","a");line = line.replaceAll("γ","r");line = line.replaceAll("β","b");line = line.replaceAll("×","x");line = line.replaceAll("¹","1");line = line.replaceAll("²","2");line = line.replaceAll("°","o");line = line.replaceAll("ö","o");line = line.replaceAll("é","e");line = line.replaceAll("à","a");line = line.replaceAll("Á","A");line = line.replaceAll("ε","e");line = line.replaceAll("θ","O");line = line.replaceAll("•",".");line = line.replaceAll("µ","u");line = line.replaceAll("λ","r");line = line.replaceAll("⁺","+");line = line.replaceAll("ν","v");line = line.replaceAll("ï","i");line = line.replaceAll("ã","a");line = line.replaceAll("≡","=");line = line.replaceAll("ó","o");line = line.replaceAll("³","3");line = line.replaceAll("〖","[");line = line.replaceAll("〗","]");line = line.replaceAll("Å","A");line = line.replaceAll("ρ","p");line = line.replaceAll("ü","u");line = line.replaceAll("ɛ","e");line = line.replaceAll("č","c");line = line.replaceAll("š","s");line = line.replaceAll("ß","b");line = line.replaceAll("═","=");line = line.replaceAll("£","L");line = line.replaceAll("Ł","L");line = line.replaceAll("ƒ","f");line = line.replaceAll("ä","a");line = line.replaceAll("–","-");line = line.replaceAll("⁻","-");line = line.replaceAll("〈","<");line = line.replaceAll("〉",">");line = line.replaceAll("χ","X");line = line.replaceAll("Đ","D");line = line.replaceAll("‰","%");line = line.replaceAll("·",".");line = line.replaceAll("→",">");line = line.replaceAll("←","<");line = line.replaceAll("ζ","z");line = line.replaceAll("π","p");line = line.replaceAll("τ","t");line = line.replaceAll("ξ","X");line = line.replaceAll("η","h");line = line.replaceAll("ø","0");line = line.replaceAll("Δ","D");line = line.replaceAll("∆","D");line = line.replaceAll("∑","S");line = line.replaceAll("Ω","O");line = line.replaceAll("δ","d");line = line.replaceAll("σ","s");
 59 | 				String Column[]=line.split("\t",-1);
 60 | 				if(Column.length>2)
 61 | 				{
 62 | 					String ConceptType=Column[0];
 63 | 					String ConceptID=Column[1];
 64 | 					String ConceptNames=Column[2];
 65 | 					/*
 66 | 					 * Specific usage for Species
 67 | 					 */
 68 | 					if(	ConceptType.equals("Species"))
 69 | 					{
 70 | 						ConceptID=ConceptID.replace("species:ncbi:","");
 71 | 						ConceptNames=ConceptNames.replaceAll(" strain=", " ");
 72 | 						ConceptNames=ConceptNames.replaceAll("[\\W\\-\\_](str.|strain|substr.|substrain|var.|variant|subsp.|subspecies|pv.|pathovars|pathovar|br.|biovar)[\\W\\-\\_]", " ");
 73 | 						ConceptNames=ConceptNames.replaceAll("[\\(\\)]", " ");
 74 | 					}
 75 | 					
 76 | 					String NameColumn[]=ConceptNames.split("\\|");
 77 | 					for(int i=0;i<NameColumn.length;i++)
 78 | 					{
 79 | 						String tmp = NameColumn[i];
 80 | 						tmp=tmp.replaceAll("[\\W\\-\\_0-9]", "");
 81 | 						
 82 | 						/*
 83 | 						 * Specific usage for Species
 84 | 						 */
 85 | 						if(	ConceptType.equals("Species"))
 86 | 						{
 87 | 							if ( (!NameColumn[i].substring(0, 1).matches("[\\W\\-\\_]")) && (!NameColumn[i].matches("a[\\W\\-\\_].*")) && tmp.length()>=3)
 88 | 							{
 89 | 								String tmp_mention=NameColumn[i].toLowerCase();
 90 | 								if(!StopWord_hash.containsKey(tmp_mention))
 91 | 								{
 92 | 									Tr.insertMention(NameColumn[i],ConceptType+"\t"+ConceptID);
 93 | 								}
 94 | 							}
 95 | 						}
 96 | 						/*
 97 | 						 * Specific usage for Gene & Cell
 98 | 						 */
 99 | 						else if ((ConceptType.equals("Gene") || ConceptType.equals("Cell")) )
100 | 						{
101 | 							if ( (!NameColumn[i].substring(0, 1).matches("[\\W\\-\\_]")) && tmp.length()>=3)
102 | 							{
103 | 								String tmp_mention=NameColumn[i].toLowerCase();
104 | 								if(!StopWord_hash.containsKey(tmp_mention))
105 | 								{
106 | 									Tr.insertMention(NameColumn[i],ConceptType+"\t"+ConceptID);
107 | 								}
108 | 							}
109 | 						}
110 | 						/*
111 | 						 * Other Concepts
112 | 						 */
113 | 						else
114 | 						{
115 | 							if ( (!NameColumn[i].equals("")) &&	(!NameColumn[i].substring(0, 1).matches("[\\W\\-\\_]"))	)
116 | 							{
117 | 								String tmp_mention=NameColumn[i].toLowerCase();
118 | 								if(!StopWord_hash.containsKey(tmp_mention))
119 | 								{
120 | 									Tr.insertMention(NameColumn[i],ConceptType+"\t"+ConceptID);
121 | 								}
122 | 							}
123 | 						}
124 | 					}
125 | 				}
126 | 				else
127 | 				{
128 | 					System.out.println("[Dictionary2Tree_Combine]: Lexicon format error! Please follow : Type | Identifier | Names (Identifier can be NULL)");
129 | 				}
130 | 			}
131 | 			inputfile.close();	
132 | 		}
133 | 		catch(IOException e1){ System.out.println("[Dictionary2Tree_Combine]: Input file is not exist.");}
134 | 	}
135 | 	
136 | 	/*
137 | 	 * Type	Identifier	Names
138 | 	 * Species	9606	ttdh3pv|igl027/99|igl027/98|sw 1463
139 | 	 * 
140 | 	 * @ Prefix
141 | 	 */
142 | 	public void Dictionary2Tree(String Filename,String StopWords,String Prefix)	
143 | 	{
144 | 		try 
145 | 		{
146 | 			/** Stop Word */
147 | 			BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(StopWords), "UTF-8"));
148 | 			String line="";
149 | 			while ((line = br.readLine()) != null)  
150 | 			{
151 | 				StopWord_hash.put(line, "StopWord");
152 | 			}
153 | 			br.close();	
154 | 			
155 | 			/** Parsing Input */
156 | 			BufferedReader inputfile = new BufferedReader(new InputStreamReader(new FileInputStream(Filename), "UTF-8"));
157 | 			line="";
158 | 			while ((line = inputfile.readLine()) != null)  
159 | 			{
160 | 				line = line.replaceAll("ω","w");line = line.replaceAll("μ","u");line = line.replaceAll("κ","k");line = line.replaceAll("α","a");line = line.replaceAll("γ","r");line = line.replaceAll("β","b");line = line.replaceAll("×","x");line = line.replaceAll("¹","1");line = line.replaceAll("²","2");line = line.replaceAll("°","o");line = line.replaceAll("ö","o");line = line.replaceAll("é","e");line = line.replaceAll("à","a");line = line.replaceAll("Á","A");line = line.replaceAll("ε","e");line = line.replaceAll("θ","O");line = line.replaceAll("•",".");line = line.replaceAll("µ","u");line = line.replaceAll("λ","r");line = line.replaceAll("⁺","+");line = line.replaceAll("ν","v");line = line.replaceAll("ï","i");line = line.replaceAll("ã","a");line = line.replaceAll("≡","=");line = line.replaceAll("ó","o");line = line.replaceAll("³","3");line = line.replaceAll("〖","[");line = line.replaceAll("〗","]");line = line.replaceAll("Å","A");line = line.replaceAll("ρ","p");line = line.replaceAll("ü","u");line = line.replaceAll("ɛ","e");line = line.replaceAll("č","c");line = line.replaceAll("š","s");line = line.replaceAll("ß","b");line = line.replaceAll("═","=");line = line.replaceAll("£","L");line = line.replaceAll("Ł","L");line = line.replaceAll("ƒ","f");line = line.replaceAll("ä","a");line = line.replaceAll("–","-");line = line.replaceAll("⁻","-");line = line.replaceAll("〈","<");line = line.replaceAll("〉",">");line = line.replaceAll("χ","X");line = line.replaceAll("Đ","D");line = line.replaceAll("‰","%");line = line.replaceAll("·",".");line = line.replaceAll("→",">");line = line.replaceAll("←","<");line = line.replaceAll("ζ","z");line = line.replaceAll("π","p");line = line.replaceAll("τ","t");line = line.replaceAll("ξ","X");line = line.replaceAll("η","h");line = line.replaceAll("ø","0");line = line.replaceAll("Δ","D");line = line.replaceAll("∆","D");line = line.replaceAll("∑","S");line = line.replaceAll("Ω","O");line = line.replaceAll("δ","d");line = line.replaceAll("σ","s");
161 | 				String Column[]=line.split("\t");
162 | 				if(Column.length>2)
163 | 				{
164 | 					String ConceptType=Column[0];
165 | 					String ConceptID=Column[1];
166 | 					String ConceptNames=Column[2];
167 | 					
168 | 					/*
169 | 					 * Specific usage for Species
170 | 					 */
171 | 					if(	ConceptType.equals("Species"))
172 | 					{
173 | 						ConceptID=ConceptID.replace("species:ncbi:","");
174 | 						ConceptNames=ConceptNames.replaceAll(" strain=", " ");
175 | 						ConceptNames=ConceptNames.replaceAll("[\\W\\-\\_](str.|strain|substr.|substrain|var.|variant|subsp.|subspecies|pv.|pathovars|pathovar|br.|biovar)[\\W\\-\\_]", " ");
176 | 						ConceptNames=ConceptNames.replaceAll("[\\(\\)]", " ");
177 | 					}
178 | 					String NameColumn[]=ConceptNames.split("\\|");
179 | 					
180 | 					for(int i=0;i<NameColumn.length;i++)
181 | 					{
182 | 						String tmp = NameColumn[i];
183 | 						tmp=tmp.replaceAll("[\\W\\-\\_0-9]", "");
184 | 						
185 | 						/*
186 | 						 * Specific usage for Species
187 | 						 */
188 | 						if(	ConceptType.equals("Species") )
189 | 						{
190 | 							if ((!NameColumn[i].substring(0, 1).matches("[\\W\\-\\_]")) && (!NameColumn[i].matches("a[\\W\\-\\_].*")) &&	tmp.length()>=3	)
191 | 							{
192 | 								String tmp_mention=NameColumn[i].toLowerCase();
193 | 								if(!StopWord_hash.containsKey(tmp_mention))
194 | 								{
195 | 									if(Prefix.equals(""))
196 | 									{
197 | 										Tr.insertMention(NameColumn[i],ConceptType+"\t"+ConceptID);
198 | 									}
199 | 									else if(Prefix.equals("Num") && NameColumn[i].toLowerCase().matches("[0-9].*"))
200 | 									{
201 | 										Tr.insertMention(NameColumn[i],ConceptType+"\t"+ConceptID);
202 | 									}
203 | 									else if(NameColumn[i].length()>=2 && NameColumn[i].toLowerCase().substring(0,2).equals(Prefix))
204 | 									{
205 | 										Tr.insertMention(NameColumn[i],ConceptType+"\t"+ConceptID);
206 | 									}
207 | 									else if(Prefix.equals("Other")
208 | 											 && (!NameColumn[i].toLowerCase().matches("[0-9].*"))
209 | 											 && (!NameColumn[i].toLowerCase().matches("[a-z][a-z].*"))
210 | 											)
211 | 									{
212 | 										Tr.insertMention(NameColumn[i],ConceptType+"\t"+ConceptID);
213 | 									}
214 | 								}
215 | 							}
216 | 						}
217 | 						/*
218 | 						 * Specific usage for Gene & Cell
219 | 						 */
220 | 						else if ((ConceptType.equals("Gene") || ConceptType.equals("Cell")))
221 | 						{
222 | 							if(	(!NameColumn[i].substring(0, 1).matches("[\\W\\-\\_]")) &&	tmp.length()>=3	)
223 | 							{
224 | 								String tmp_mention=NameColumn[i].toLowerCase();
225 | 								if(!StopWord_hash.containsKey(tmp_mention))
226 | 								{
227 | 									if(Prefix.equals(""))
228 | 									{
229 | 										Tr.insertMention(NameColumn[i],ConceptType+"\t"+ConceptID);
230 | 									}
231 | 									else if(Prefix.equals("Num") && NameColumn[i].toLowerCase().matches("[0-9].*"))
232 | 									{
233 | 										Tr.insertMention(NameColumn[i],ConceptType+"\t"+ConceptID);
234 | 									}
235 | 									else if(NameColumn[i].length()>=2 && NameColumn[i].toLowerCase().substring(0,2).equals(Prefix))
236 | 									{
237 | 										Tr.insertMention(NameColumn[i],ConceptType+"\t"+ConceptID);
238 | 									}
239 | 									else if(Prefix.equals("Other")
240 | 											 && (!NameColumn[i].toLowerCase().matches("[0-9].*"))
241 | 											 && (!NameColumn[i].toLowerCase().matches("[a-z][a-z].*"))
242 | 											)
243 | 									{
244 | 										Tr.insertMention(NameColumn[i],ConceptType+"\t"+ConceptID);
245 | 									}
246 | 								}
247 | 							}
248 | 						}
249 | 						/*
250 | 						 * Other Concepts
251 | 						 */
252 | 						else
253 | 						{
254 | 							if ( (!NameColumn[i].equals("")) && (!NameColumn[i].substring(0, 1).matches("[\\W\\-\\_]"))	)
255 | 							{
256 | 								String tmp_mention=NameColumn[i].toLowerCase();
257 | 								if(!StopWord_hash.containsKey(tmp_mention))
258 | 								{
259 | 									if(Prefix.equals(""))
260 | 									{
261 | 										Tr.insertMention(NameColumn[i],ConceptType+"\t"+ConceptID);
262 | 									}
263 | 									else if(Prefix.equals("Num") && NameColumn[i].toLowerCase().matches("[0-9].*"))
264 | 									{
265 | 										Tr.insertMention(NameColumn[i],ConceptType+"\t"+ConceptID);
266 | 									}
267 | 									else if(NameColumn[i].length()>2 && NameColumn[i].toLowerCase().substring(0,2).equals(Prefix))
268 | 									{
269 | 										Tr.insertMention(NameColumn[i],ConceptType+"\t"+ConceptID);
270 | 									}
271 | 									else if(Prefix.equals("Other")
272 | 											 && (!NameColumn[i].toLowerCase().matches("[0-9].*"))
273 | 											 && (!NameColumn[i].toLowerCase().matches("[a-z][a-z].*"))
274 | 											)
275 | 									{
276 | 										Tr.insertMention(NameColumn[i],ConceptType+"\t"+ConceptID);
277 | 									}
278 | 								}
279 | 							}
280 | 						}
281 | 					}
282 | 				}
283 | 				else
284 | 				{
285 | 					System.out.println("[Dictionary2Tree_Combine]: Lexicon format error! Please follow : Type | Identifier | Names (Identifier can be NULL)");
286 | 				}
287 | 			}
288 | 			inputfile.close();	
289 | 		}
290 | 		catch(IOException e1){ System.out.println("[Dictionary2Tree_Combine]: Input file is not exist.");}
291 | 	}
292 | 	public void TreeFile2Tree(String Filename)	
293 | 	{
294 | 		try 
295 | 		{
296 | 			BufferedReader inputfile = new BufferedReader(new InputStreamReader(new FileInputStream(Filename), "UTF-8"));
297 | 			String line="";
298 | 			int count=0;
299 | 			while ((line = inputfile.readLine()) != null)  
300 | 			{
301 | 				line = line.replaceAll("ω","w");line = line.replaceAll("μ","u");line = line.replaceAll("κ","k");line = line.replaceAll("α","a");line = line.replaceAll("γ","r");line = line.replaceAll("β","b");line = line.replaceAll("×","x");line = line.replaceAll("¹","1");line = line.replaceAll("²","2");line = line.replaceAll("°","o");line = line.replaceAll("ö","o");line = line.replaceAll("é","e");line = line.replaceAll("à","a");line = line.replaceAll("Á","A");line = line.replaceAll("ε","e");line = line.replaceAll("θ","O");line = line.replaceAll("•",".");line = line.replaceAll("µ","u");line = line.replaceAll("λ","r");line = line.replaceAll("⁺","+");line = line.replaceAll("ν","v");line = line.replaceAll("ï","i");line = line.replaceAll("ã","a");line = line.replaceAll("≡","=");line = line.replaceAll("ó","o");line = line.replaceAll("³","3");line = line.replaceAll("〖","[");line = line.replaceAll("〗","]");line = line.replaceAll("Å","A");line = line.replaceAll("ρ","p");line = line.replaceAll("ü","u");line = line.replaceAll("ɛ","e");line = line.replaceAll("č","c");line = line.replaceAll("š","s");line = line.replaceAll("ß","b");line = line.replaceAll("═","=");line = line.replaceAll("£","L");line = line.replaceAll("Ł","L");line = line.replaceAll("ƒ","f");line = line.replaceAll("ä","a");line = line.replaceAll("–","-");line = line.replaceAll("⁻","-");line = line.replaceAll("〈","<");line = line.replaceAll("〉",">");line = line.replaceAll("χ","X");line = line.replaceAll("Đ","D");line = line.replaceAll("‰","%");line = line.replaceAll("·",".");line = line.replaceAll("→",">");line = line.replaceAll("←","<");line = line.replaceAll("ζ","z");line = line.replaceAll("π","p");line = line.replaceAll("τ","t");line = line.replaceAll("ξ","X");line = line.replaceAll("η","h");line = line.replaceAll("ø","0");line = line.replaceAll("Δ","D");line = line.replaceAll("∆","D");line = line.replaceAll("∑","S");line = line.replaceAll("Ω","O");line = line.replaceAll("δ","d");line = line.replaceAll("σ","s");
302 | 				String Anno[]=line.split("\t");
303 | 				String LocationInTree = Anno[0];
304 | 				String token = Anno[1];
305 | 				String type="";
306 | 				String identifier="";
307 | 				if(Anno.length>2)
308 | 				{
309 | 					type = Anno[2];
310 | 				}
311 | 				if(Anno.length>3)
312 | 				{
313 | 					identifier = Anno[3];
314 | 				}
315 | 				
316 | 				String LocationsInTree[]=LocationInTree.split("-");
317 | 				TreeNode tmp = Tr.root;
318 | 				for(int i=0;i<LocationsInTree.length-1;i++)
319 | 				{
320 | 					tmp=tmp.links.get(Integer.parseInt(LocationsInTree[i])-1);
321 | 				}
322 | 				if(type.equals("") && identifier.equals(""))
323 | 				{
324 | 					tmp.InsertToken(token,"");
325 | 				}
326 | 				else if(identifier.equals(""))
327 | 				{
328 | 					tmp.InsertToken(token,type);
329 | 				}
330 | 				else
331 | 				{
332 | 					tmp.InsertToken(token,type+"\t"+identifier);
333 | 				}
334 | 				count++;
335 | 			}
336 | 			inputfile.close();	
337 | 		}
338 | 		catch(IOException e1){ System.out.println("[TreeFile2Tee]: Input file is not exist.");}
339 | 	}
340 | 	
341 | 	/*
342 | 	 * Search target mention in the Prefix Tree
343 | 	 */
344 | 	public String MentionMatch(String Mentions,Integer PrefixTranslation, Integer Tok_NumCharPartialMatch)
345 | 	{
346 | 		ArrayList<String> location = new ArrayList<String>();
347 | 		String Menlist[]=Mentions.split("\\|");
348 | 		for(int m=0;m<Menlist.length;m++)
349 | 		{
350 | 			String Mention=Menlist[m];
351 | 			String Mention_lc=Mention.toLowerCase();
352 | 			Mention_lc = Mention_lc.replaceAll("[\\W\\-\\_]+", "");
353 | 			Mention_lc = Mention_lc.replaceAll("([0-9])([a-z])", "$1 $2");
354 | 			Mention_lc = Mention_lc.replaceAll("([a-z])([0-9])", "$1 $2");
355 | 			String Tkns[]=Mention_lc.split(" ");
356 | 			
357 | 			int i=0;
358 | 			boolean find=false;
359 | 			TreeNode tmp = Tr.root;
360 | 			String Concept="";
361 | 			while( i<Tkns.length && tmp.CheckChild(Tkns[i],i,Tok_NumCharPartialMatch,PrefixTranslation)>=0) //Find Tokens in the links
362 | 			{
363 | 				int CheckChild_Num=tmp.CheckChild(Tkns[i],i,Tok_NumCharPartialMatch,PrefixTranslation);
364 | 				int CheckChild_Num_tmp=CheckChild_Num;
365 | 				if(CheckChild_Num>=10000000){CheckChild_Num_tmp=CheckChild_Num-10000000;}
366 | 				tmp=tmp.links.get(CheckChild_Num_tmp); //move point to the link
367 | 				Concept = tmp.Concept;
368 | 				if(CheckChild_Num>=10000000){Concept="Species	_PartialMatch_";}
369 | 				find=true;
370 | 				i++;
371 | 			}
372 | 			if(find == true)
373 | 			{
374 | 				if(i==Tkns.length)
375 | 				{
376 | 					if(!Concept.equals(""))
377 | 					{
378 | 						return Concept;
379 | 					}
380 | 					else
381 | 					{
382 | 						return "-1"; //gene id is not found.
383 | 					}
384 | 				}
385 | 				else
386 | 				{
387 | 					return "-2"; //the gene mention matched a substring in PrefixTree.
388 | 				}
389 | 			}
390 | 			else
391 | 			{
392 | 				return "-3"; //mention is not found
393 | 			}
394 | 		}
395 | 		return "-3"; //mention is not found
396 | 	}
397 | 	
398 | 	/*
399 | 	 * Search target mention in the Prefix Tree
400 | 	 * ConceptType: Species|Genus|Cell|CTDGene
401 | 	 */
402 | 	public ArrayList<String> SearchMentionLocation(String Doc,String Doc_org,String ConceptType, Integer PrefixTranslation, Integer Tok_NumCharPartialMatch)
403 | 	{
404 | 		ArrayList<String> location = new ArrayList<String>();
405 | 		Doc=Doc.toLowerCase();
406 | 		String Doc_lc=Doc;
407 | 		Doc = Doc.replaceAll("([0-9])([A-Za-z])", "$1 $2");
408 | 		Doc = Doc.replaceAll("([A-Za-z])([0-9])", "$1 $2");
409 | 		Doc = Doc.replaceAll("[\\W^;:,]+", " ");
410 | 		
411 | 		String DocTkns[]=Doc.split(" ");
412 | 		int Offset=0;
413 | 		int Start=0;
414 | 		int Last=0;
415 | 		int FirstTime=0;
416 | 		
417 | 		while(Doc_lc.length()>0 && Doc_lc.substring(0,1).matches("[\\W]")) //clean the forward whitespace
418 | 		{
419 | 			Doc_lc=Doc_lc.substring(1);
420 | 			Offset++;
421 | 		}
422 | 		
423 | 		for(int i=0;i<DocTkns.length;i++)
424 | 		{
425 | 			TreeNode tmp = Tr.root;
426 | 			boolean find=false;
427 | 			int ConceptFound=i; //Keep found concept
428 | 			String ConceptFound_STR="";//Keep found concept
429 | 			int Tokn_num=0;
430 | 			String Concept="";
431 | 			while( tmp.CheckChild(DocTkns[i],Tokn_num,Tok_NumCharPartialMatch,PrefixTranslation)>=0 ) //Find Tokens in the links
432 | 			{
433 | 				int CheckChild_Num=tmp.CheckChild(DocTkns[i],Tokn_num,Tok_NumCharPartialMatch,PrefixTranslation);
434 | 				int CheckChild_Num_tmp=CheckChild_Num;
435 | 				if(CheckChild_Num>=10000000){CheckChild_Num_tmp=CheckChild_Num-10000000;}
436 | 				tmp=tmp.links.get(CheckChild_Num_tmp); //move point to the link
437 | 				Concept = tmp.Concept;
438 | 				if(CheckChild_Num>=10000000){Concept="Species	_PartialMatch_";}
439 | 				
440 | 				if(Start==0 && FirstTime>0){Start = Offset;} //Start <- Offset 
441 | 				if(Doc_lc.length()>=DocTkns[i].length() && Doc_lc.substring(0,DocTkns[i].length()).equals(DocTkns[i]))
442 | 				{
443 | 					if(DocTkns[i].length()>0)
444 | 					{
445 | 						Doc_lc=Doc_lc.substring(DocTkns[i].length());
446 | 						Offset=Offset+DocTkns[i].length();
447 | 					}
448 | 				}
449 | 				Last = Offset;
450 | 				while(Doc_lc.length()>0 && Doc_lc.substring(0,1).matches("[\\W]")) //clean the forward whitespace
451 | 				{
452 | 					Doc_lc=Doc_lc.substring(1);
453 | 					Offset++;
454 | 				}
455 | 				i++;
456 | 				Tokn_num++;
457 | 				
458 | 				if(ConceptType.equals("Species"))
459 | 				{
460 | 					if(i<DocTkns.length-2 && DocTkns[i].matches("(str|strain|substr|substrain|subspecies|subsp|var|variant|pathovars|pv|biovar|bv)"))
461 | 					{
462 | 						Doc_lc=Doc_lc.substring(DocTkns[i].length());
463 | 						Offset=Offset+DocTkns[i].length();
464 | 						Last = Offset;
465 | 						while(Doc_lc.length()>0 && Doc_lc.substring(0,1).matches("[\\W]")) //clean the forward whitespace
466 | 						{
467 | 							Doc_lc=Doc_lc.substring(1);
468 | 							Offset++;
469 | 						}
470 | 						i++;
471 | 					}
472 | 				}
473 | 				
474 | 				if(!Concept.equals("") && (Last-Start>0)) //Keep found concept
475 | 				{
476 | 					ConceptFound=i;
477 | 					ConceptFound_STR=Start+"\t"+Last+"\t"+Doc_org.substring(Start, Last)+"\t"+Concept;
478 | 				}
479 | 				
480 | 				find=true;
481 | 				if(i>=DocTkns.length){break;}
482 | 				//else if(i==DocTkns.length-1){PrefixTranslation=2;}
483 | 			}
484 | 			
485 | 			if(find == true)
486 | 			{
487 | 				if(!Concept.equals("") && (Last-Start>0)) //the last matched token has concept id 
488 | 				{
489 | 					location.add(Start+"\t"+Last+"\t"+Doc_org.substring(Start, Last)+"\t"+Concept);
490 | 					
491 | 				}
492 | 				else if(!ConceptFound_STR.equals("")) //Keep found concept
493 | 				{
494 | 					location.add(ConceptFound_STR);
495 | 					i = ConceptFound + 1;
496 | 				}
497 | 				Start=0;
498 | 				Last=0;
499 | 				if(i>0){i--;}
500 | 				ConceptFound=i; //Keep found concept
501 | 				ConceptFound_STR="";//Keep found concept
502 | 			}
503 | 			else //if(find == false)
504 | 			{
505 | 				if(Doc_lc.length()>=DocTkns[i].length() && Doc_lc.substring(0,DocTkns[i].length()).equals(DocTkns[i]))
506 | 				{
507 | 					if(DocTkns[i].length()>0)
508 | 					{
509 | 						Doc_lc=Doc_lc.substring(DocTkns[i].length());
510 | 						Offset=Offset+DocTkns[i].length();
511 | 					}
512 | 				}
513 | 			}
514 | 			
515 | 			while(Doc_lc.length()>0 && Doc_lc.substring(0,1).matches("[\\W]")) //clean the forward whitespace
516 | 			{
517 | 				Doc_lc=Doc_lc.substring(1);
518 | 				Offset++;
519 | 			}
520 | 			FirstTime++;
521 | 		}
522 | 		return location;
523 | 	}
524 | 	
525 | 	/*
526 | 	 * Print out the Prefix Tree
527 | 	 */
528 | 	public String PrintTree()
529 | 	{
530 | 		return Tr.PrintTree_preorder(Tr.root,"");
531 | 	}
532 | }
533 | 
534 | class Tree 
535 | {
536 | 	/*
537 | 	 * Prefix Tree - root node
538 | 	 */
539 | 	public TreeNode root;
540 | 	
541 | 	public Tree() 
542 | 	{ 
543 | 		root = new TreeNode("-ROOT-"); 
544 | 	}
545 | 	
546 | 	/*
547 | 	 * Insert mention into the tree
548 | 	 */
549 | 	public void insertMention(String Mention, String Identifier)
550 | 	{
551 | 		Mention=Mention.toLowerCase();
552 | 		Identifier = Identifier.replaceAll("\t$", "");
553 | 		Mention = Mention.replaceAll("([0-9])([A-Za-z])", "$1 $2");
554 | 		Mention = Mention.replaceAll("([A-Za-z])([0-9])", "$1 $2");
555 | 		Mention = Mention.replaceAll("[\\W\\-\\_]+", " ");
556 | 
557 | 		String Tokens[]=Mention.split(" ");
558 | 		TreeNode tmp = root;
559 | 		for(int i=0;i<Tokens.length;i++)
560 | 		{
561 | 			if(tmp.CheckChild(Tokens[i],i,0,0)>=0)
562 | 			{
563 | 				tmp=tmp.links.get( tmp.CheckChild(Tokens[i],i,0,0) ); //go through next generation (exist node)
564 | 				if(i == Tokens.length-1)
565 | 				{
566 | 					tmp.Concept=Identifier;
567 | 				}
568 | 			}
569 | 			else //not exist
570 | 			{
571 | 				if(i == Tokens.length-1)
572 | 				{
573 | 					tmp.InsertToken(Tokens[i],Identifier);
574 | 				}
575 | 				else
576 | 				{
577 | 					tmp.InsertToken(Tokens[i]);
578 | 				}
579 | 				tmp=tmp.links.get(tmp.NumOflinks-1); //go to the next generation (new node)
580 | 			}
581 | 		}
582 | 	}
583 | 	
584 | 	/*
585 | 	 * Print the tree by pre-order
586 | 	 */
587 | 	public String PrintTree_preorder(TreeNode node, String LocationInTree)
588 | 	{
589 | 		String opt="";
590 | 		if(!node.token.equals("-ROOT-"))//Ignore root
591 | 		{
592 | 			if(node.Concept.equals(""))
593 | 			{
594 | 				opt=opt+LocationInTree+"\t"+node.token+"\n";
595 | 			}
596 | 			else
597 | 			{
598 | 				opt=opt+LocationInTree+"\t"+node.token+"\t"+node.Concept+"\n";
599 | 			}
600 | 		} 
601 | 		if(!LocationInTree.equals("")){LocationInTree=LocationInTree+"-";}
602 | 		for(int i=0;i<node.NumOflinks;i++)
603 | 		{
604 | 			opt=opt+PrintTree_preorder(node.links.get(i),LocationInTree+(i+1));
605 | 		}
606 | 		return opt;
607 | 	}
608 | }
609 | 
610 | class TreeNode 
611 | {
612 | 	String token; //token of the node
613 | 	int NumOflinks; //Number of links
614 | 	public String Concept;
615 | 	ArrayList<TreeNode> links;
616 | 	
617 | 	public TreeNode(String Tok,String ID)
618 | 	{
619 | 		token = Tok;
620 | 		NumOflinks = 0;
621 | 		Concept = ID;
622 | 		links = new ArrayList<TreeNode>();
623 | 	}
624 | 	public TreeNode(String Tok)
625 | 	{
626 | 		token = Tok;
627 | 		NumOflinks = 0;
628 | 		Concept = "";
629 | 		links = new ArrayList<TreeNode>();
630 | 	}
631 | 	public TreeNode()
632 | 	{
633 | 		token = "";
634 | 		NumOflinks = 0;
635 | 		Concept = "";
636 | 		links = new ArrayList<TreeNode>();
637 | 	}
638 | 	
639 | 	/*
640 | 	 * Insert an new node under the target node
641 | 	 */
642 | 	public void InsertToken(String Tok)
643 | 	{
644 | 		TreeNode NewNode = new TreeNode(Tok);
645 | 		links.add(NewNode);
646 | 		NumOflinks++;
647 | 	}
648 | 	public void InsertToken(String Tok,String ID)
649 | 	{
650 | 		TreeNode NewNode = new TreeNode(Tok,ID);
651 | 		links.add(NewNode);
652 | 		NumOflinks++;
653 | 	}
654 | 	
655 | 	/*
656 | 	 * Check the tokens of children
657 | 	 * PrefixTranslation = 1 (SuffixTranslationMap)
658 | 	 * PrefixTranslation = 2 (CTDGene; partial match for numbers)
659 | 	 * PrefixTranslation = 3 (NCBI Taxonomy usage (IEB) : suffix partial match)
660 | 	 */
661 | 	public int CheckChild(String Tok,Integer Tok_num,Integer Tok_NumCharPartialMatch, Integer PrefixTranslation)
662 | 	{
663 | 		/** Suffix Translation */
664 | 		ArrayList<String> SuffixTranslationMap = new ArrayList<String>();
665 | 		SuffixTranslationMap.add("alpha-a");
666 | 		SuffixTranslationMap.add("alpha-1");
667 | 		SuffixTranslationMap.add("a-alpha");
668 | 		//SuffixTranslationMap.add("a-1");
669 | 		SuffixTranslationMap.add("1-alpha");
670 | 		//SuffixTranslationMap.add("1-a");
671 | 		SuffixTranslationMap.add("beta-b");
672 | 		SuffixTranslationMap.add("beta-2");
673 | 		SuffixTranslationMap.add("b-beta");
674 | 		//SuffixTranslationMap.add("b-2");
675 | 		SuffixTranslationMap.add("2-beta");
676 | 		//SuffixTranslationMap.add("2-b");
677 | 		SuffixTranslationMap.add("gamma-g");
678 | 		SuffixTranslationMap.add("gamma-y");
679 | 		SuffixTranslationMap.add("g-gamma");
680 | 		SuffixTranslationMap.add("y-gamma");
681 | 		SuffixTranslationMap.add("1-i");
682 | 		SuffixTranslationMap.add("i-1");
683 | 		SuffixTranslationMap.add("2-ii");
684 | 		SuffixTranslationMap.add("ii-2");
685 | 		SuffixTranslationMap.add("3-iii");
686 | 		SuffixTranslationMap.add("iii-3");
687 | 		SuffixTranslationMap.add("4-vi");
688 | 		SuffixTranslationMap.add("vi-4");
689 | 		SuffixTranslationMap.add("5-v");
690 | 		SuffixTranslationMap.add("v-5");
691 | 		
692 | 		for(int i=0;i<links.size();i++)
693 |         {
694 | 			if(Tok.equals(links.get(i).token))
695 | 			{
696 | 				return (i);
697 | 			}
698 |         }
699 | 		
700 | 		if(PrefixTranslation == 1 && Tok.matches("(alpha|beta|gamam|[abg]|[12])")) // SuffixTranslationMap
701 | 		{
702 | 			for(int i=0;i<links.size();i++)
703 | 	        {
704 | 				if(SuffixTranslationMap.contains(Tok+"-"+links.get(i).token))
705 | 				{
706 | 					return(i);
707 | 				}
708 | 	        }
709 | 		}
710 | 		else if(PrefixTranslation == 2 && Tok.matches("[1-5]")) // for CTDGene feature
711 | 		{
712 | 			for(int i=0;i<links.size();i++)
713 | 	        {
714 | 				if(links.get(i).token.matches("[1-5]"))
715 | 				{
716 | 					return(i);
717 | 				}
718 | 	        }
719 | 		}
720 | 		else if(PrefixTranslation == 3 && Tok.length()>=Tok_NumCharPartialMatch && Tok_num>=1) // for NCBI Taxonomy usage (IEB) : suffix partial match
721 | 		{
722 | 			for(int i=0;i<links.size();i++)
723 | 	        {
724 | 				if(links.get(i).token.length()>=Tok_NumCharPartialMatch)
725 | 				{
726 | 					String Tok_Prefix=Tok.substring(0,Tok_NumCharPartialMatch);
727 | 					if((!links.get(i).Concept.equals("")) && links.get(i).token.substring(0,Tok_NumCharPartialMatch).equals(Tok_Prefix))
728 | 					{
729 | 						return(i+10000000);
730 | 					}
731 | 				}
732 | 	        }
733 | 		}
734 | 		
735 | 		return(-1);
736 | 	}
737 | }
738 | 	


--------------------------------------------------------------------------------
/src/tmVarlib/tmVar.java:
--------------------------------------------------------------------------------
  1 | package tmVarlib;
  2 | //
  3 | // tmVar - Java version
  4 | //
  5 | 
  6 | import java.io.BufferedReader;
  7 | import java.io.File;
  8 | import java.io.FileReader;
  9 | import java.io.FileInputStream;
 10 | import java.io.FileOutputStream;
 11 | import java.io.InputStream;
 12 | import java.io.InputStreamReader;
 13 | import java.io.OutputStream;
 14 | import java.io.OutputStreamWriter;
 15 | import java.sql.SQLException;
 16 | import java.io.IOException;
 17 | import java.util.ArrayList;
 18 | import java.util.HashMap;
 19 | import java.util.regex.Matcher;
 20 | import java.util.regex.Pattern;
 21 | 
 22 | import javax.xml.stream.XMLStreamException;
 23 | 
 24 | import org.tartarus.snowball.SnowballStemmer;
 25 | import org.tartarus.snowball.ext.englishStemmer;
 26 | 
 27 | import edu.stanford.nlp.tagger.maxent.MaxentTagger;
 28 | 
 29 | public class tmVar
 30 | {
 31 | 	public static MaxentTagger tagger = new MaxentTagger();
 32 | 	public static SnowballStemmer stemmer = new englishStemmer();
 33 | 	public static ArrayList<String> RegEx_DNAMutation_STR=new ArrayList<String>(); 
 34 | 	public static ArrayList<String> RegEx_ProteinMutation_STR=new ArrayList<String>(); 
 35 | 	public static ArrayList<String> RegEx_SNP_STR=new ArrayList<String>();
 36 | 	public static ArrayList<String> PAM_lowerScorePair = new ArrayList<String>();
 37 | 	public static HashMap<String,String> GeneVariantMention = new HashMap<String,String>();
 38 | 	public static HashMap<String,Integer> VariantFrequency = new HashMap<String,Integer>();
 39 | 	public static Pattern Pattern_Component_1;
 40 | 	public static Pattern Pattern_Component_2;
 41 | 	public static Pattern Pattern_Component_3;
 42 | 	public static Pattern Pattern_Component_4;
 43 | 	public static Pattern Pattern_Component_5;
 44 | 	public static Pattern Pattern_Component_6;
 45 | 	public static boolean GeneMention = false; // will be turn to true if "ExtractFeature" can find gene mention
 46 | 	public static HashMap<String,String> nametothree = new HashMap<String,String>();
 47 | 	public static HashMap<String,String> threetone = new HashMap<String,String>();
 48 | 	public static HashMap<String,String> threetone_nu = new HashMap<String,String>();
 49 | 	public static HashMap<String,String> NTtoATCG = new HashMap<String,String>();
 50 | 	public static ArrayList<String> MF_Pattern = new ArrayList<String>();
 51 | 	public static ArrayList<String> MF_Type = new ArrayList<String>();
 52 | 	public static HashMap<String,String> filteringStr_hash = new HashMap<String,String>();
 53 | 	public static HashMap<String,String> Mutation_RS_Geneid_hash = new HashMap<String,String>();
 54 | 	public static ArrayList<String> RS_DNA_Protein = new ArrayList<String>();
 55 | 	public static HashMap<String,String> one2three = new HashMap<String,String>();
 56 | 	public static PrefixTree PT_GeneVariantMention = new PrefixTree();
 57 | 	public static HashMap<String,String> VariantType_hash = new HashMap<String,String>();
 58 | 	public static HashMap<String,String> Number_word2digit = new HashMap<String,String>();
 59 | 	public static HashMap<String,String> grouped_variants = new HashMap<String,String>();
 60 | 	public static HashMap<String,String> nu2aa_hash = new HashMap<String,String>();
 61 | 	public static HashMap<String,Integer> RS2Frequency_hash = new HashMap<String,Integer>();
 62 | 	public static HashMap<String,HashMap<Integer,String>> variant_mention_to_filter_overlap_gene = new HashMap<String,HashMap<Integer,String>>(); // gene mention: GCC
 63 | 	public static HashMap<String,String> Gene2HumanGene_hash = new HashMap<String,String>();
 64 | 	public static HashMap<String,String> Variant2MostCorrespondingGene_hash = new HashMap<String,String>();
 65 | 	public static HashMap<String,String> RSandPosition2Seq_hash = new HashMap<String,String>();
 66 | 	
 67 | 	public static void main(String [] args) throws IOException, InterruptedException, XMLStreamException, SQLException 
 68 | 	{
 69 | 		/*
 70 | 		 * Parameters
 71 | 		 */
 72 | 		String InputFolder="input";
 73 | 		String OutputFolder="output";
 74 | 		String TrainTest="Test"; //Train|Train_Mention|Test|Test_FullText
 75 | 		String DeleteTmp="True";
 76 | 		String DisplayRSnumOnly="True"; // hide the types of the  methods 
 77 | 		String DisplayChromosome="True"; // hide the chromosome mentions
 78 | 		String DisplayRefSeq="True"; // hide the RefSeq mentions
 79 | 		String DisplayGenomicRegion="True";
 80 | 		String HideMultipleResult="True"; //L858R: 121434568|1057519847|1057519848 --> 121434568
 81 | 		if(args.length<2)
 82 | 		{
 83 | 			System.out.println("\n$ java -Xmx5G -Xms5G -jar tmVar.jar [InputFolder] [OutputFolder]");
 84 | 			System.out.println("[InputFolder] Default : input");
 85 | 			System.out.println("[OutputFolder] Default : output\n\n");
 86 | 		}
 87 | 		else
 88 | 		{
 89 | 			InputFolder=args [0];
 90 | 			OutputFolder=args [1];
 91 | 			
 92 | 			if(args.length>2 && args[2].toLowerCase().matches("(train|train_mention|test|test_fulltext)"))
 93 | 			{
 94 | 				TrainTest=args [2];
 95 | 				if(args[2].toLowerCase().matches("(train|train_mention)"))
 96 | 				{
 97 | 					DeleteTmp="False";
 98 | 				}
 99 | 			}
100 | 			if(args.length>3 && args[3].toLowerCase().matches("(True|False)"))
101 | 			{
102 | 				DeleteTmp=args [3];
103 | 			}
104 | 			if(args.length>4 && args[4].toLowerCase().matches("(True|False)"))
105 | 			{
106 | 				DisplayRSnumOnly=args [4];
107 | 			}
108 | 			if(args.length>5 && args[5].toLowerCase().matches("(True|False)"))
109 | 			{
110 | 				HideMultipleResult=args [4];
111 | 			}
112 | 		}
113 | 		
114 | 		double startTime,endTime,totTime;
115 | 		startTime = System.currentTimeMillis();//start time
116 | 		BioCConverter BC= new BioCConverter();
117 | 		
118 | 		/**
119 | 		 * Import models and resource
120 | 		 */
121 | 		{
122 | 			/*
123 | 			 * POSTagging: loading model
124 | 			 */
125 | 			tagger = new MaxentTagger("lib/taggers/english-left3words-distsim.tagger");
126 | 			
127 | 			/*
128 | 			 * Stemming : using Snowball
129 | 			 */
130 | 			stemmer = new englishStemmer();
131 | 			
132 | 			/*
133 | 			 * PAM 140 : <=-6 pairs 
134 | 			 */
135 | 			BufferedReader PAM = new BufferedReader(new InputStreamReader(new FileInputStream("lib/PAM140-6.txt"), "UTF-8"));
136 | 			String line="";
137 | 			while ((line = PAM.readLine()) != null)  
138 | 			{
139 | 				String nt[]=line.split("\t");
140 | 				PAM_lowerScorePair.add(nt[0]+"\t"+nt[1]);
141 | 				PAM_lowerScorePair.add(nt[1]+"\t"+nt[0]);
142 | 			}
143 | 			PAM.close();
144 | 			
145 | 			/*
146 | 			 * Variant frequency
147 | 			 */
148 | 			BufferedReader frequency = new BufferedReader(new InputStreamReader(new FileInputStream("Database/rs.rank.txt"), "UTF-8"));
149 | 			line="";
150 | 			while ((line = frequency.readLine()) != null)  
151 | 			{
152 | 				String nt[]=line.split("\t");
153 | 				VariantFrequency.put(nt[1],Integer.parseInt(nt[0]));
154 | 			}
155 | 			frequency.close();
156 | 			/*
157 | 			 * HGVs nomenclature lookup - RegEx : DNAMutation
158 | 			 */
159 | 			BufferedReader RegEx_DNAMutation = new BufferedReader(new InputStreamReader(new FileInputStream("lib/RegEx/DNAMutation.RegEx.txt"), "UTF-8"));
160 | 			line="";
161 | 			while ((line = RegEx_DNAMutation.readLine()) != null)  
162 | 			{
163 | 				RegEx_DNAMutation_STR.add(line);
164 | 			}
165 | 			RegEx_DNAMutation.close();
166 | 			
167 | 			/*
168 | 			 * HGVs nomenclature lookup - RegEx : ProteinMutation
169 | 			 */
170 | 			BufferedReader RegEx_ProteinMutation = new BufferedReader(new InputStreamReader(new FileInputStream("lib/RegEx/ProteinMutation.RegEx.txt"), "UTF-8"));
171 | 			line="";
172 | 			while ((line = RegEx_ProteinMutation.readLine()) != null)  
173 | 			{
174 | 				RegEx_ProteinMutation_STR.add(line);
175 | 			}
176 | 			RegEx_ProteinMutation.close();
177 | 			
178 | 			/*
179 | 			 * HGVs nomenclature lookup - RegEx : SNP
180 | 			 */
181 | 			BufferedReader RegEx_SNP = new BufferedReader(new InputStreamReader(new FileInputStream("lib/RegEx/SNP.RegEx.txt"), "UTF-8"));
182 | 			line="";
183 | 			while ((line = RegEx_SNP.readLine()) != null)  
184 | 			{
185 | 				RegEx_SNP_STR.add(line);
186 | 			}
187 | 			RegEx_SNP.close();
188 | 			
189 | 			/*
190 | 			 * Append-pattern
191 | 			 */
192 | 			BufferedReader RegEx_NL = new BufferedReader(new InputStreamReader(new FileInputStream("lib/RegEx/MF.RegEx.2.txt"), "UTF-8"));
193 | 			while ((line = RegEx_NL.readLine()) != null)  
194 | 			{
195 | 				String RegEx[]=line.split("\t");
196 | 				MF_Pattern.add(RegEx[0]);
197 | 				MF_Type.add(RegEx[1]);
198 | 			}
199 | 			RegEx_NL.close();
200 | 			
201 | 			/*
202 | 			 * Append-pattern
203 | 			 */
204 | 			
205 | 			BufferedReader VariantType = new BufferedReader(new InputStreamReader(new FileInputStream("lib/RegEx/VariantType.txt"), "UTF-8"));
206 | 			while ((line = VariantType.readLine()) != null)  
207 | 			{
208 | 				String split[]=line.split("\t");
209 | 				VariantType_hash.put(split[0],split[1]);
210 | 			}
211 | 			VariantType.close();
212 | 			
213 | 			/*
214 | 			 * nu2aa
215 | 			 */
216 | 			BufferedReader nu2aa = new BufferedReader(new InputStreamReader(new FileInputStream("lib/RegEx/nu2aa.mapping.txt"), "UTF-8"));
217 | 			while ((line = nu2aa.readLine()) != null)  
218 | 			{
219 | 				nu2aa_hash.put(line,"");
220 | 			}
221 | 			nu2aa.close();
222 | 			
223 | 					
224 | 			//RegEx of component recognition
225 | 			Pattern_Component_1 = Pattern.compile("^([^|]*)\\|([^|]*)\\|([^|]*)\\|([^|]*)\\|(fs[^|]*)\\|([^|]*)$");
226 | 			Pattern_Component_2 = Pattern.compile("^([^|]*)\\|([^|]*)\\|([^|]*)\\|([^|]*)\\|(fs[^|]*)$");
227 | 			Pattern_Component_3 = Pattern.compile("^([^|]*)\\|([^|]*(ins|del|Del|dup|-)[^|]*)\\|([^|]*)\\|([^|]*)$");
228 | 			Pattern_Component_4 = Pattern.compile("^([^|]*)\\|([^|]*)\\|([^|]*)\\|([^|]*)$");
229 | 			Pattern_Component_5 = Pattern.compile("^([^|]*)\\|([^|]*)\\|([^|]*)\\|([^|]*)\\|([^|]*)$");
230 | 			Pattern_Component_6 = Pattern.compile("^((\\[rs\\]|rs|RS|Rs|reference SNP no[.] )[0-9]+)$");
231 | 			
232 | 			nametothree.put("GLUTAMIC","GLU");nametothree.put("ASPARTIC","ASP");nametothree.put("THYMINE", "THY");nametothree.put("ALANINE", "ALA");nametothree.put("ARGININE", "ARG");nametothree.put("ASPARAGINE", "ASN");nametothree.put("ASPARTICACID", "ASP");nametothree.put("ASPARTATE", "ASP");nametothree.put("CYSTEINE", "CYS");nametothree.put("GLUTAMINE", "GLN");nametothree.put("GLUTAMICACID", "GLU");nametothree.put("GLUTAMATE", "GLU");nametothree.put("GLYCINE", "GLY");nametothree.put("HISTIDINE", "HIS");nametothree.put("ISOLEUCINE", "ILE");nametothree.put("LEUCINE", "LEU");nametothree.put("LYSINE", "LYS");nametothree.put("METHIONINE", "MET");nametothree.put("PHENYLALANINE", "PHE");nametothree.put("PROLINE", "PRO");nametothree.put("SERINE", "SER");nametothree.put("THREONINE", "THR");nametothree.put("TRYPTOPHAN", "TRP");nametothree.put("TYROSINE", "TYR");nametothree.put("VALINE", "VAL");nametothree.put("STOP", "XAA");nametothree.put("TERM", "XAA");nametothree.put("TERMINATION", "XAA");nametothree.put("STOP", "XAA");nametothree.put("TERM", "XAA");nametothree.put("TERMINATION", "XAA");nametothree.put("GLUTAMICCODON","GLU");nametothree.put("ASPARTICCODON","ASP");nametothree.put("THYMINECODON","THY");nametothree.put("ALANINECODON","ALA");nametothree.put("ARGININECODON","ARG");nametothree.put("ASPARAGINECODON","ASN");nametothree.put("ASPARTICACIDCODON","ASP");nametothree.put("ASPARTATECODON","ASP");nametothree.put("CYSTEINECODON","CYS");nametothree.put("GLUTAMINECODON","GLN");nametothree.put("GLUTAMICACIDCODON","GLU");nametothree.put("GLUTAMATECODON","GLU");nametothree.put("GLYCINECODON","GLY");nametothree.put("HISTIDINECODON","HIS");nametothree.put("ISOLEUCINECODON","ILE");nametothree.put("LEUCINECODON","LEU");nametothree.put("LYSINECODON","LYS");nametothree.put("METHIONINECODON","MET");nametothree.put("PHENYLALANINECODON","PHE");nametothree.put("PROLINECODON","PRO");nametothree.put("SERINECODON","SER");nametothree.put("THREONINECODON","THR");nametothree.put("TRYPTOPHANCODON","TRP");nametothree.put("TYROSINECODON","TYR");nametothree.put("VALINECODON","VAL");nametothree.put("STOPCODON","XAA");nametothree.put("TERMCODON","XAA");nametothree.put("TERMINATIONCODON","XAA");nametothree.put("STOPCODON","XAA");nametothree.put("TERMCODON","XAA");nametothree.put("TERMINATIONCODON","XAA");nametothree.put("TAG","XAA");nametothree.put("TAA","XAA");nametothree.put("UAG","XAA");nametothree.put("UAA","XAA");
233 | 			threetone.put("ALA", "A");threetone.put("ARG", "R");threetone.put("ASN", "N");threetone.put("ASP", "D");threetone.put("CYS", "C");threetone.put("GLN", "Q");threetone.put("GLU", "E");threetone.put("GLY", "G");threetone.put("HIS", "H");threetone.put("ILE", "I");threetone.put("LEU", "L");threetone.put("LYS", "K");threetone.put("MET", "M");threetone.put("PHE", "F");threetone.put("PRO", "P");threetone.put("SER", "S");threetone.put("THR", "T");threetone.put("TRP", "W");threetone.put("TYR", "Y");threetone.put("VAL", "V");threetone.put("ASX", "B");threetone.put("GLX", "Z");threetone.put("XAA", "X");threetone.put("TER", "X");
234 | 			threetone_nu.put("GCU","A");threetone_nu.put("GCC","A");threetone_nu.put("GCA","A");threetone_nu.put("GCG","A");threetone_nu.put("CGU","R");threetone_nu.put("CGC","R");threetone_nu.put("CGA","R");threetone_nu.put("CGG","R");threetone_nu.put("AGA","R");threetone_nu.put("AGG","R");threetone_nu.put("AAU","N");threetone_nu.put("AAC","N");threetone_nu.put("GAU","D");threetone_nu.put("GAC","D");threetone_nu.put("UGU","C");threetone_nu.put("UGC","C");threetone_nu.put("GAA","E");threetone_nu.put("GAG","E");threetone_nu.put("CAA","Q");threetone_nu.put("CAG","Q");threetone_nu.put("GGU","G");threetone_nu.put("GGC","G");threetone_nu.put("GGA","G");threetone_nu.put("GGG","G");threetone_nu.put("CAU","H");threetone_nu.put("CAC","H");threetone_nu.put("AUU","I");threetone_nu.put("AUC","I");threetone_nu.put("AUA","I");threetone_nu.put("CUU","L");threetone_nu.put("CUC","L");threetone_nu.put("CUA","L");threetone_nu.put("CUG","L");threetone_nu.put("UUA","L");threetone_nu.put("UUG","L");threetone_nu.put("AAA","K");threetone_nu.put("AAG","K");threetone_nu.put("AUG","M");threetone_nu.put("UUU","F");threetone_nu.put("UUC","F");threetone_nu.put("CCU","P");threetone_nu.put("CCC","P");threetone_nu.put("CCA","P");threetone_nu.put("CCG","P");threetone_nu.put("UCU","S");threetone_nu.put("UCC","S");threetone_nu.put("UCA","S");threetone_nu.put("UCG","S");threetone_nu.put("AGU","S");threetone_nu.put("AGC","S");threetone_nu.put("ACU","T");threetone_nu.put("ACC","T");threetone_nu.put("ACA","T");threetone_nu.put("ACG","T");threetone_nu.put("UGG","W");threetone_nu.put("UAU","Y");threetone_nu.put("UAC","Y");threetone_nu.put("GUU","V");threetone_nu.put("GUC","V");threetone_nu.put("GUA","V");threetone_nu.put("GUG","V");threetone_nu.put("UAA","X");threetone_nu.put("UGA","X");threetone_nu.put("UAG","X");threetone_nu.put("GCT","A");threetone_nu.put("GCC","A");threetone_nu.put("GCA","A");threetone_nu.put("GCG","A");threetone_nu.put("CGT","R");threetone_nu.put("CGC","R");threetone_nu.put("CGA","R");threetone_nu.put("CGG","R");threetone_nu.put("AGA","R");threetone_nu.put("AGG","R");threetone_nu.put("AAT","N");threetone_nu.put("AAC","N");threetone_nu.put("GAT","D");threetone_nu.put("GAC","D");threetone_nu.put("TGT","C");threetone_nu.put("TGC","C");threetone_nu.put("GAA","E");threetone_nu.put("GAG","E");threetone_nu.put("CAA","Q");threetone_nu.put("CAG","Q");threetone_nu.put("GGT","G");threetone_nu.put("GGC","G");threetone_nu.put("GGA","G");threetone_nu.put("GGG","G");threetone_nu.put("CAT","H");threetone_nu.put("CAC","H");threetone_nu.put("ATT","I");threetone_nu.put("ATC","I");threetone_nu.put("ATA","I");threetone_nu.put("CTT","L");threetone_nu.put("CTC","L");threetone_nu.put("CTA","L");threetone_nu.put("CTG","L");threetone_nu.put("TTA","L");threetone_nu.put("TTG","L");threetone_nu.put("AAA","K");threetone_nu.put("AAG","K");threetone_nu.put("ATG","M");threetone_nu.put("TTT","F");threetone_nu.put("TTC","F");threetone_nu.put("CCT","P");threetone_nu.put("CCC","P");threetone_nu.put("CCA","P");threetone_nu.put("CCG","P");threetone_nu.put("TCT","S");threetone_nu.put("TCC","S");threetone_nu.put("TCA","S");threetone_nu.put("TCG","S");threetone_nu.put("AGT","S");threetone_nu.put("AGC","S");threetone_nu.put("ACT","T");threetone_nu.put("ACC","T");threetone_nu.put("ACA","T");threetone_nu.put("ACG","T");threetone_nu.put("TGG","W");threetone_nu.put("TAT","Y");threetone_nu.put("TAC","Y");threetone_nu.put("GTT","V");threetone_nu.put("GTC","V");threetone_nu.put("GTA","V");threetone_nu.put("GTG","V");threetone_nu.put("TAA","X");threetone_nu.put("TGA","X");threetone_nu.put("TAG","X");
235 | 			NTtoATCG.put("ADENINE", "A");NTtoATCG.put("CYTOSINE", "C");NTtoATCG.put("GUANINE", "G");NTtoATCG.put("URACIL", "U");NTtoATCG.put("THYMINE", "T");NTtoATCG.put("ADENOSINE", "A");NTtoATCG.put("CYTIDINE", "C");NTtoATCG.put("THYMIDINE", "T");NTtoATCG.put("GUANOSINE", "G");NTtoATCG.put("URIDINE", "U");
236 | 
237 | 			Number_word2digit.put("ZERO","0");Number_word2digit.put("SINGLE","1");Number_word2digit.put("ONE","1");Number_word2digit.put("TWO","2");Number_word2digit.put("THREE","3");Number_word2digit.put("FOUR","4");Number_word2digit.put("FIVE","5");Number_word2digit.put("SIX","6");Number_word2digit.put("SEVEN","7");Number_word2digit.put("EIGHT","8");Number_word2digit.put("NINE","9");Number_word2digit.put("TWN","10");
238 | 			
239 | 			//Filtering
240 | 			BufferedReader filterfile = new BufferedReader(new InputStreamReader(new FileInputStream("lib/filtering.txt"), "UTF-8"));
241 | 			while ((line = filterfile.readLine()) != null)  
242 | 			{
243 | 				filteringStr_hash.put(line, "");
244 | 			}
245 | 			filterfile.close();
246 | 			
247 | 			/*one2three*/
248 | 			one2three.put("A", "Ala");
249 | 			one2three.put("R", "Arg");
250 | 			one2three.put("N", "Asn");
251 | 			one2three.put("D", "Asp");
252 | 			one2three.put("C", "Cys");
253 | 			one2three.put("Q", "Gln");
254 | 			one2three.put("E", "Glu");
255 | 			one2three.put("G", "Gly");
256 | 			one2three.put("H", "His");
257 | 			one2three.put("I", "Ile");
258 | 			one2three.put("L", "Leu");
259 | 			one2three.put("K", "Lys");
260 | 			one2three.put("M", "Met");
261 | 			one2three.put("F", "Phe");
262 | 			one2three.put("P", "Pro");
263 | 			one2three.put("S", "Ser");
264 | 			one2three.put("T", "Thr");
265 | 			one2three.put("W", "Trp");
266 | 			one2three.put("Y", "Tyr");
267 | 			one2three.put("V", "Val");
268 | 			one2three.put("B", "Asx");
269 | 			one2three.put("Z", "Glx");
270 | 			one2three.put("X", "Xaa");
271 | 			one2three.put("X", "Ter");
272 | 			
273 | 			/*RS_DNA_Protein.txt - Pattern : PP[P]+[ ]*[\(\[][ ]*DD[D]+[ ]*[\)\]][ ]*[\(\[][ ]*SS[S]+[ ]*[\)\]]*/
274 | 			BufferedReader inputfile = new BufferedReader(new InputStreamReader(new FileInputStream("lib/RegEx/RS_DNA_Protein.txt"), "UTF-8"));
275 | 			while ((line = inputfile.readLine()) != null)  
276 | 			{
277 | 				if(!line.equals(""))
278 | 				{
279 | 					RS_DNA_Protein.add(line);
280 | 				}
281 | 			}
282 | 			inputfile.close();
283 | 			
284 | 			/*PT_GeneVariantMention - BRAFV600E*/
285 | 			PT_GeneVariantMention.TreeFile2Tree("lib/PT_GeneVariantMention.txt");
286 | 				
287 | 			/*Mutation_RS_Geneid.txt - the patterns retrieved from PubMed result*/
288 | 			inputfile = new BufferedReader(new InputStreamReader(new FileInputStream("lib/RegEx/Mutation_RS_Geneid.txt"), "UTF-8"));
289 | 			while ((line = inputfile.readLine()) != null)  
290 | 			{
291 | 				//Pattern	c|SUB|C|1749|T	2071558	269
292 | 				Pattern pat = Pattern.compile("^(Pattern|Recognized|ManuallyAdded)	([^\t]+)	([0-9]+)	([0-9,]+)$");
293 | 				Matcher mat = pat.matcher(line);
294 | 				if (mat.find())
295 | 	        	{
296 | 					String geneids[]=mat.group(4).split(",");
297 | 					for(int i=0;i<geneids.length;i++)
298 | 					{
299 | 						//mutation id | geneid --> rs#
300 | 						Mutation_RS_Geneid_hash.put(mat.group(2)+"\t"+geneids[i], mat.group(3));
301 | 					}
302 | 	        	}
303 | 			}
304 | 			inputfile.close();
305 | 			/** tmVarForm2RSID2Freq - together with Mutation_RS_Geneid.txt (the patterns retrieved from PubMed result) */
306 | 			BufferedReader tmVarForm2RSID2Freq = new BufferedReader(new InputStreamReader(new FileInputStream("lib/tmVarForm2RSID2Freq.txt"), "UTF-8"));
307 | 			line="";
308 | 			while ((line = tmVarForm2RSID2Freq.readLine()) != null)  
309 | 			{
310 | 				String nt[]=line.split("\t");
311 | 				String tmVarForm=nt[0];
312 | 				String rs_gene_freq=nt[1];
313 | 				//RS:rs121913377|Gene:673|Freq:3
314 | 				Pattern pat = Pattern.compile("^RS:rs([0-9]+)\\|Gene:([0-9,]+)\\|Freq:([0-9]+)$");
315 | 				Matcher mat = pat.matcher(rs_gene_freq);
316 | 				if (mat.find())
317 | 	        	{
318 | 					String rs=mat.group(1);
319 | 					String gene=mat.group(2);
320 | 					Mutation_RS_Geneid_hash.put(tmVarForm+"\t"+gene,rs);
321 | 	        	}
322 | 			}
323 | 			tmVarForm2RSID2Freq.close();
324 | 
325 | 			/** RS2Frequency - rs# to its frequency in PTC) */
326 | 			BufferedReader RS2Frequency = new BufferedReader(new InputStreamReader(new FileInputStream("lib/RS2Frequency.txt"), "UTF-8"));
327 | 			line="";
328 | 			while ((line = RS2Frequency.readLine()) != null)  
329 | 			{
330 | 				String nt[]=line.split("\t");
331 | 				String rs=nt[0];
332 | 				int freq=Integer.parseInt(nt[1]);
333 | 				RS2Frequency_hash.put(rs,freq);
334 | 			}
335 | 			RS2Frequency.close();
336 | 			
337 | 			/** Homoid2HumanGene_hash **/
338 | 			HashMap<String,String> Homoid2HumanGene_hash = new HashMap<String,String>();
339 | 			BufferedReader Homoid2HumanGene = new BufferedReader(new InputStreamReader(new FileInputStream("Database/Homoid2HumanGene.txt"), "UTF-8"));
340 | 			line="";
341 | 			while ((line = Homoid2HumanGene.readLine()) != null)  
342 | 			{
343 | 				String nt[]=line.split("\t");
344 | 				String homoid=nt[0];
345 | 				String humangeneid=nt[1];
346 | 				Homoid2HumanGene_hash.put(homoid,humangeneid);
347 | 			}
348 | 			Homoid2HumanGene.close();
349 | 			
350 | 			/** Gene2Homoid.txt **/
351 | 			BufferedReader Gene2Homoid = new BufferedReader(new InputStreamReader(new FileInputStream("Database/Gene2Homoid.txt"), "UTF-8"));
352 | 			line="";
353 | 			while ((line = Gene2Homoid.readLine()) != null)  
354 | 			{
355 | 				String nt[]=line.split("\t");
356 | 				String geneid=nt[0];
357 | 				String homoid=nt[1];
358 | 				if(Homoid2HumanGene_hash.containsKey(homoid))
359 | 				{
360 | 					if(!geneid.equals(Homoid2HumanGene_hash.get(homoid)))
361 | 					{
362 | 						Gene2HumanGene_hash.put(geneid,Homoid2HumanGene_hash.get(homoid));
363 | 					}
364 | 				}
365 | 			}
366 | 			Gene2Homoid.close();
367 | 			
368 | 			/** Variant2MostCorrespondingGene **/
369 | 			BufferedReader Variant2MostCorrespondingGene = new BufferedReader(new InputStreamReader(new FileInputStream("Database/var2gene.txt"), "UTF-8"));
370 | 			line="";
371 | 			while ((line = Variant2MostCorrespondingGene.readLine()) != null)  
372 | 			{
373 | 				String nt[]=line.split("\t"); //4524	1801133 C677T
374 | 				String geneid=nt[0];
375 | 				String rsid=nt[1];
376 | 				String var=nt[2].toLowerCase();
377 | 				Variant2MostCorrespondingGene_hash.put(var,geneid+"\t"+rsid);
378 | 			}
379 | 			Variant2MostCorrespondingGene.close();
380 | 			
381 | 			BufferedReader RSandPosition2Seq = new BufferedReader(new InputStreamReader(new FileInputStream("lib/RS2tmVarForm.txt"), "UTF-8"));
382 | 			line="";
383 | 			while ((line = RSandPosition2Seq.readLine()) != null)  
384 | 			{
385 | 				String nt[]=line.split("\t"); //121908752	c	SUB	T	617	G
386 | 				if(nt.length>3)
387 | 				{
388 | 					String rs=nt[0];
389 | 					String seq=nt[1];
390 | 					String P=nt[4];
391 | 					RSandPosition2Seq_hash.put(rs+"\t"+P,seq);
392 | 				}
393 | 			}
394 | 			RSandPosition2Seq.close();
395 | 			
396 | 		}
397 | 		
398 | 		File folder = new File(InputFolder);
399 | 		File[] listOfFiles = folder.listFiles();
400 | 		for (int i = 0; i < listOfFiles.length; i++)
401 | 		{
402 | 			if (listOfFiles[i].isFile()) 
403 | 			{
404 | 				String InputFile = listOfFiles[i].getName();
405 | 				
406 | 				File f = new File(OutputFolder+"/"+InputFile+".PubTator");
407 | 				File f_BioC = new File(OutputFolder+"/"+InputFile+".BioC.XML");
408 | 
409 | 				if(f.exists() && !f.isDirectory()) 
410 | 				{ 
411 | 					System.out.println(InputFolder+"/"+InputFile+" - Done. (The output file (PubTator) exists in output folder)");
412 | 				}
413 | 				else if(f_BioC.exists() && !f_BioC.isDirectory()) 
414 | 				{ 
415 | 					System.out.println(InputFolder+"/"+InputFile+" - Done. (The output file (BioC) exists in output folder)");
416 | 				}
417 | 				else
418 | 				{
419 | 					/*
420 | 					 * Mention recognition by CRF++
421 | 					 */
422 | 					if(TrainTest.equals("Test") || TrainTest.equals("Test_FullText") || TrainTest.equals("Test_FullText") || TrainTest.equals("Train"))
423 | 					{
424 | 						/*
425 | 						 * Format Check 
426 | 						 */
427 | 						String Format = "";
428 | 						String checkR=BC.BioCFormatCheck(InputFolder+"/"+InputFile);
429 | 						if(checkR.equals("BioC"))
430 | 						{
431 | 							Format = "BioC";
432 | 						}
433 | 						else if(checkR.equals("PubTator"))
434 | 						{
435 | 							Format = "PubTator";
436 | 						}
437 | 						else
438 | 						{
439 | 							System.out.println(checkR);
440 | 							System.exit(0);
441 | 						}
442 | 						
443 | 						System.out.print(InputFolder+"/"+InputFile+" - ("+Format+" format) : Processing ... \r");
444 | 						 
445 | 						/*
446 | 						 * Pre-processing
447 | 						 */
448 | 						MentionRecognition MR= new MentionRecognition();
449 | 						if(Format.equals("BioC"))
450 | 						{
451 | 							BC.BioC2PubTator(InputFolder+"/"+InputFile,"tmp/"+InputFile);
452 | 							MR.FeatureExtraction("tmp/"+InputFile,"tmp/"+InputFile+".data","tmp/"+InputFile+".location",TrainTest);
453 | 						}
454 | 						else if(Format.equals("PubTator"))
455 | 						{
456 | 							MR.FeatureExtraction(InputFolder+"/"+InputFile,"tmp/"+InputFile+".data","tmp/"+InputFile+".location",TrainTest);
457 | 						}
458 | 						if(TrainTest.equals("Test") || TrainTest.equals("Test_FullText"))
459 | 						{
460 | 							MR.CRF_test("tmp/"+InputFile+".data","tmp/"+InputFile+".output",TrainTest);
461 | 						}
462 | 						
463 | 						/*
464 | 						 * CRF++ output --> PubTator
465 | 						 */
466 | 						PostProcessing PP = new PostProcessing();
467 | 						{
468 | 							if(Format.equals("BioC"))
469 | 							{
470 | 								PP.toME("tmp/"+InputFile,"tmp/"+InputFile+".output","tmp/"+InputFile+".location","tmp/"+InputFile+".ME");
471 | 								PP.toPostME("tmp/"+InputFile+".ME","tmp/"+InputFile+".PostME");
472 | 								PP.toPostMEData("tmp/"+InputFile,"tmp/"+InputFile+".PostME","tmp/"+InputFile+".PostME.ml","tmp/"+InputFile+".PostME.data",TrainTest);
473 | 							}
474 | 							else if(Format.equals("PubTator"))
475 | 							{
476 | 								PP.toME(InputFolder+"/"+InputFile,"tmp/"+InputFile+".output","tmp/"+InputFile+".location","tmp/"+InputFile+".ME");
477 | 								PP.toPostME("tmp/"+InputFile+".ME","tmp/"+InputFile+".PostME");
478 | 								PP.toPostMEData(InputFolder+"/"+InputFile,"tmp/"+InputFile+".PostME","tmp/"+InputFile+".PostME.ml","tmp/"+InputFile+".PostME.data",TrainTest);
479 | 							}
480 | 							if(TrainTest.equals("Test") || TrainTest.equals("Test_FullText"))
481 | 							{
482 | 								PP.toPostMEoutput("tmp/"+InputFile+".PostME.data","tmp/"+InputFile+".PostME.output");
483 | 							}
484 | 							
485 | 							else if(TrainTest.equals("Train"))
486 | 							{
487 | 								PP.toPostMEModel("tmp/"+InputFile+".PostME.data");
488 | 							}
489 | 							
490 | 							
491 | 							/*
492 | 							 * Post-processing
493 | 							 */
494 | 							if(TrainTest.equals("Test") || TrainTest.equals("Test_FullText"))
495 | 							{
496 | 								GeneMention = true;
497 | 								if(GeneMention == true) // MentionRecognition detect Gene mentions
498 | 								{
499 | 									PP.output2PubTator("tmp/"+InputFile+".PostME.ml","tmp/"+InputFile+".PostME.output","tmp/"+InputFile+".PostME","tmp/"+InputFile+".PubTator");
500 | 									
501 | 									if(Format.equals("BioC"))
502 | 									{
503 | 										PP.Normalization("tmp/"+InputFile,"tmp/"+InputFile+".PubTator",OutputFolder+"/"+InputFile+".PubTator",DisplayRSnumOnly,HideMultipleResult,DisplayChromosome,DisplayRefSeq,DisplayGenomicRegion);
504 | 									}
505 | 									else if(Format.equals("PubTator"))
506 | 									{
507 | 										PP.Normalization(InputFolder+"/"+InputFile,"tmp/"+InputFile+".PubTator",OutputFolder+"/"+InputFile+".PubTator",DisplayRSnumOnly,HideMultipleResult,DisplayChromosome,DisplayRefSeq,DisplayGenomicRegion);
508 | 									}
509 | 								}
510 | 								else
511 | 								{
512 | 									PP.output2PubTator("tmp/"+InputFile+".PostME.ml","tmp/"+InputFile+".PostME.output","tmp/"+InputFile+".PostME",OutputFolder+"/"+InputFile+".PubTator");
513 | 								}
514 | 								
515 | 								if(Format.equals("BioC"))
516 | 								{
517 | 									BC.PubTator2BioC_AppendAnnotation(OutputFolder+"/"+InputFile+".PubTator",InputFolder+"/"+InputFile,OutputFolder+"/"+InputFile+".BioC.XML");
518 | 								}			
519 | 							}
520 | 						}
521 | 						
522 | 						/*
523 | 						 * Time stamp - last
524 | 						 */
525 | 						endTime = System.currentTimeMillis();//ending time
526 | 						totTime = endTime - startTime;
527 | 						System.out.println(InputFolder+"/"+InputFile+" - ("+Format+" format) : Processing Time:"+totTime/1000+"sec");
528 | 						
529 | 						/*
530 | 						 * remove tmp files
531 | 						 */
532 | 						if(DeleteTmp.toLowerCase().equals("true"))
533 | 						{
534 | 							String path="tmp"; 
535 | 					        File file = new File(path);
536 | 					        File[] files = file.listFiles(); 
537 | 					        for (File ftmp:files) 
538 | 					        {
539 | 					        	if (ftmp.isFile() && ftmp.exists()) 
540 | 					            {
541 | 					        		if(ftmp.toString().matches("tmp."+InputFile+".*"))
542 | 						        	{
543 | 					        			ftmp.delete();
544 | 						        	}
545 | 					        	}
546 | 					        }
547 | 						}
548 | 					}
549 | 					else if(TrainTest.equals("Train_Mention"))
550 | 					{
551 | 						System.out.print(InputFolder+"/"+InputFile+" - Processing ... \r");
552 | 						 
553 | 						PostProcessing PP = new PostProcessing();
554 | 						PP.toPostMEData(InputFolder+"/"+InputFile,"tmp/"+InputFile+".PostME","tmp/"+InputFile+".PostME.ml","tmp/"+InputFile+".PostME.data","Train");
555 | 						
556 | 						/*
557 | 						 * Time stamp - last
558 | 						 */
559 | 						endTime = System.currentTimeMillis();//ending time
560 | 						totTime = endTime - startTime;
561 | 						System.out.println(InputFolder+"/"+InputFile+" - Processing Time:"+totTime/1000+"sec");
562 | 					}
563 | 				}
564 | 			}
565 | 		}
566 | 	}
567 | }
568 | 


--------------------------------------------------------------------------------
/tmBioC.key:
--------------------------------------------------------------------------------
 1 | PubTator.key
 2 | 
 3 | A BioC format for PubTator and other NER tools (i.e., tmChem, DNorm, tmVar, SR4GN or GenNorm) developed at the Biomedical Text Mining group at NCBI
 4 | The goal of this collection is to provide easy access to the text and bio-concept annotations for PMC articles. 
 5 | 
 6 | 	collection:  a group of PubMed documents, each document is organized into title, abstract and other passages 
 7 | 
 8 | 	source:  PubMed, PubMed Central, etc. 
 9 | 
10 | 	date:  Document download date
11 | 
12 | 	document:  abstract, full-text article, free-text document, etc.
13 | 	
14 | 	id:  PubMed ID (or other ID in a given collection) of the document 
15 | 
16 | 	passage:  Title, abstract and other passages 
17 | 
18 | 		infon["type"]:  "title", "abstract" and other passages
19 | 
20 | 		offset: Title has an offset of zero, while the other passages (e.g., abstract) are assumed to begin after the previous passages and one space
21 | 		
22 | 		text: Text of the passage 
23 | 
24 | 		annotation:  One bio-concept of the passage as determined by the tmChem, DNorm, tmVar, SR4GN or GenNorm
25 | 				
26 | 			infon["type"]:  The type of bioconcept, e.g. "Gene", "Species", "Disease", "Chemical" or "Mutation"		
27 | 	
28 | 			infon["MeSH"]:  The bio-concept identifier in MeSH as detected by DNorm or tmChem
29 | 			
30 | 			infon["OMIM"]:  The bio-concept identifier in OMIM as detected by DNorm
31 | 			
32 | 			infon["NCBI_Gene"]:  The bio-concept identifier in NCBI Gene as detected by GenNorm
33 | 			
34 | 			infon["NCBI_Taxonomy"]:  The bio-concept identifier in NCBI Taxonomy as detected by SR4GN
35 | 			
36 | 			infon["ChEBI"]:  The bio-concept identifier in ChEBI as detected by tmChem
37 | 			
38 | 			infon["tmVar"]:  The intelligent key generated artificially for the mention detected by tmVar (<Sequence type>|<Mutation type>|<Wild type>|<Mutation position>|<Mutant>)
39 | 			
40 | 			location: location of the mention including the global document "offset" where a bio-concept is located and the "length" of the mention 
41 | 
42 | 			text: Mention of the bio-concept
43 | 


--------------------------------------------------------------------------------