├── BioC.dtd
├── CHANGES.txt
├── LICENSE.txt
├── README.txt
├── src
    ├── bioc
    │   ├── __init__.py
    │   ├── bioc_annotation.py
    │   ├── bioc_collection.py
    │   ├── bioc_document.py
    │   ├── bioc_location.py
    │   ├── bioc_node.py
    │   ├── bioc_passage.py
    │   ├── bioc_reader.py
    │   ├── bioc_relation.py
    │   ├── bioc_sentence.py
    │   ├── bioc_writer.py
    │   ├── compat
    │   │   ├── __init__.py
    │   │   └── _py2_next.py
    │   └── meta
    │   │   ├── __init__.py
    │   │   ├── _bioc_meta.py
    │   │   └── _iter.py
    ├── stemmer.py
    └── test_read+write.py
└── test_input
    ├── PMID-8557975-simplified-sentences-tokens.xml
    ├── PMID-8557975-simplified-sentences.xml
    ├── everything-sentence.xml
    ├── everything.xml
    └── example_input.xml


/BioC.dtd:
--------------------------------------------------------------------------------
  1 | <!-- BioC.dtd -->
  2 | 
  3 | <!--
  4 | 
  5 |     BioC is designed to allow programs that process text and
  6 |     annotations on that text to easily share data and work
  7 |     together. This DTD describes how that data is represented in XML
  8 |     files.
  9 | 
 10 |     Some believe XML is easily read by humans and that should be
 11 |     supported by clearly formatting the elements. In the long run,
 12 |     this is distracting. While the only meaningful spaces are in text
 13 |     elements and the other spaces can be ignored, current tools add no
 14 |     additional space.  Formatters and editors may be used to make the
 15 |     XML file appear more readable.
 16 | 
 17 |     The possible variety of annotations that one might want to produce
 18 |     or use is nearly countless. There is no guarantee that these are
 19 |     organized in the nice nested structure required for XML
 20 |     elements. Even if they were, it would be nice to easily ignore
 21 |     unwanted annotations.  So annotations are recorded in a stand off
 22 |     manner, external to the annotated text. The exceptions are
 23 |     passages and sentences because of their fundamental place in text.
 24 | 
 25 |     The text is expected to be encoded in Unicode, specifically
 26 |     UTF-8. This is one of the encodings required to be implemented by
 27 |     XML tools, is portable between big-endian and little-endian
 28 |     machines and is a superset of 7-bit ASCII. Code points beyond 127
 29 |     may be expressed directly in UTF-8 or indirectly using numeric
 30 |     entities.  Since many tools today still only directly process
 31 |     ASCII characters, conversion should be available and
 32 |     standardized.  Offsets should be in 8 bit code units (bytes) for
 33 |     easier processing by naive programs.
 34 | 
 35 |     collection:  Group of documents, usually from a larger corpus. If
 36 |     a group of documents is from several corpora, use several
 37 |     collections.
 38 | 
 39 |     source:  Name of the source corpus from which the documents were selected
 40 | 
 41 |     date:  Date documents extracted from original source. Can be as
 42 |     simple as yyyymmdd or an ISO timestamp.
 43 | 
 44 |     key: Separate file describing the infons used and any other useful
 45 |     information about the data in the file. For example, if a file
 46 |     includes part-of-speech tags, this file should describe the set of
 47 |     part-of-speech tags used.
 48 | 
 49 |     infon: key-value pairs. Can record essentially arbitrary
 50 |     information. "type" will be a particular common key in the major
 51 |     sub elements below. For PubMed references, passage "type" might
 52 |     signal "title" or "abstract". For annotations, it might indicate
 53 |     "noun phrase", "gene", or "disease". In the programming language
 54 |     data structures, infons are typically represented as a map from a
 55 |     string to a string.  This means keys should be unique within each
 56 |     parent element.
 57 | 
 58 |     document: A document in the collection. A single, complete
 59 |     stand-alone document as described by its parent source.
 60 | 
 61 |     id:  Typically, the id of the document in the parent
 62 |     source. Should at least be unique in the collection.
 63 | 
 64 |     passage: One portion of the document.  In the sample collection of
 65 |     PubMed documents, each document has a title and frequently an
 66 |     abstract. Structured abstracts could have additional passages. For
 67 |     a full text document, passages could be sections such as
 68 |     Introduction, Materials and Methods, or Conclusion. Another option
 69 |     would be paragraphs. Passages impose a linear structure on the
 70 |     document. Further structure in the document can be described by
 71 |     infon values.
 72 | 
 73 |     offset: Where the passage occurs in the parent document. Depending
 74 |     on the source corpus, this might be a very relevant number.  They
 75 |     should be sequential and identify a passage's position in the
 76 |     document.  Since the sample PubMed collection is extracted from an
 77 |     XML file, literal offsets have little value. The title is given an
 78 |     offset of zero, while the abstract is assumed to begin after the
 79 |     title and one space.
 80 | 
 81 |     text: The original text of the passage.
 82 | 
 83 |     sentence:  One sentence of the passage.
 84 | 
 85 |     offset: A document offset to where the sentence begins in the
 86 |     passage. This value is the sum of the passage offset and the local
 87 |     offset within the passage.
 88 | 
 89 |     text: The original text of the sentence.
 90 | 
 91 |     annotation:  Stand-off annotation
 92 | 
 93 |     id: Used to refer to this annotation in relations. Should be
 94 |     unique at whatever level relations at appear. If relations appear
 95 |     at the sentence level, annotation ids need to be unique within
 96 |     each sentence. Similarly, if relations appear at the passage
 97 |     level, annotation ids need to be unique within each passage.
 98 | 
 99 |     location: Location of the annotated text. Multiple locations
100 |     indicate a multi-span annotation.
101 | 
102 |     offset: Document offset to where the annotated text begins in
103 |     the passage or sentence. The value is the sum of the passage or
104 |     sentence offset and the local offset within the passage or
105 |     sentence.
106 | 
107 |     length: Length of the annotated text. While unlikely, this could
108 |     be zero to describe an annotation that belongs between two
109 |     characters.
110 | 
111 |     text:  Typically the annotated text.
112 | 
113 |     relation: Relation between multiple annotations and / or other
114 |     relations. Relations are allowed to appear at several levels
115 |     (document, passage, and sentence). Typically they will all appear
116 |     at one level, the level at which they are determined.
117 |     Significantly different types of relations might appear at
118 |     different levels.
119 | 
120 |     id: Used to refer to this relation in other relations. This id
121 |     needs to be unique at whatever level relations appear. (See
122 |     discussion of annotation ids.)
123 | 
124 |     refid: Id of an annotation or an other relation.
125 | 
126 |     role: Describes how the referenced annotattion or other relation
127 |     participates in the current relation. Has a default value so it
128 |     can be left out if there is no meaningful value.
129 | 
130 | -->
131 | 
132 | <!ELEMENT collection ( source, date, key, infon*, document+ ) >
133 | <!ELEMENT source (#PCDATA)>
134 | <!ELEMENT date (#PCDATA)>
135 | <!ELEMENT key (#PCDATA)>
136 | <!ELEMENT infon (#PCDATA)>
137 | <!ATTLIST infon key CDATA #REQUIRED >
138 | 
139 | <!ELEMENT document ( id, infon*, passage+, relation* ) >
140 | <!ELEMENT id (#PCDATA)>
141 | 
142 | <!ELEMENT passage ( infon*, offset, ( ( text?, annotation* ) | sentence* ), relation* ) >
143 | <!ELEMENT offset (#PCDATA)>
144 | <!ELEMENT text (#PCDATA)>
145 | 
146 | <!ELEMENT sentence ( infon*, offset, text?, annotation*, relation* ) >
147 | 
148 | <!ELEMENT annotation ( infon*, location*, text ) >
149 | <!ATTLIST annotation id CDATA #IMPLIED >
150 | <!ELEMENT location EMPTY>
151 | <!ATTLIST location offset CDATA #REQUIRED >
152 | <!ATTLIST location length CDATA #REQUIRED >
153 | 
154 | <!ELEMENT relation ( infon*, node* ) >
155 | <!ATTLIST relation id CDATA #IMPLIED >
156 | <!ELEMENT node EMPTY>
157 | <!ATTLIST node refid CDATA #REQUIRED >
158 | <!ATTLIST node role CDATA "" >
159 | 


--------------------------------------------------------------------------------
/CHANGES.txt:
--------------------------------------------------------------------------------
 1 | 1.01
 2 | ----
 3 | Fix invalid handling of id attributes for annotation and relation tags.
 4 | (Thanks to Tilia Ellendorff <ellendorff@ifi.uzh.ch> for pointing it out.)
 5 | 
 6 | 1.00
 7 | ----
 8 | PyBioC library reaches usable state. The classes BioCReader and BioCWriter can 
 9 | be used to read in and write out PyBioC XML data, respectively.
10 | 
11 | 0.1
12 | ---
13 | Initial (incomplete) version created in analogy to BioC_Java_1.0.1.
14 | 


--------------------------------------------------------------------------------
/LICENSE.txt:
--------------------------------------------------------------------------------
 1 | Copyright (c) 2013, the OntoGene project at the University of Zurich (UZH).
 2 | All rights reserved.
 3 | 
 4 | Redistribution and use in source and binary forms, with or without
 5 | modification, are permitted provided that the following conditions are met: 
 6 | 
 7 | 1. Redistributions of source code must retain the above copyright notice, this
 8 |    list of conditions and the following disclaimer. 
 9 | 2. Redistributions in binary form must reproduce the above copyright notice,
10 |    this list of conditions and the following disclaimer in the documentation
11 |    and/or other materials provided with the distribution. 
12 | 
13 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
14 | ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
15 | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
16 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
17 | ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
18 | (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
19 | LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
20 | ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
21 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
22 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
23 | 
24 | The views and conclusions contained in the software and documentation are those
25 | of the authors and should not be interpreted as representing official policies, 
26 | either expressed or implied, of the OntoGene project or the University of Zurich
27 | (UZH).
28 | 


--------------------------------------------------------------------------------
/README.txt:
--------------------------------------------------------------------------------
 1 | PyBioC is a native python library to deal with BioCreative XML data,
 2 | i. e. to read from and to write to it.
 3 | 
 4 | Usage:
 5 | ------
 6 | Two example programs, test_read+write.py and stemming.py are shipped in the
 7 | src/ folder.
 8 | 
 9 | test_read+write.py shows the very basic reading and writing capability of the 
10 | library.
11 | 
12 | stemming.py uses the Python Natural Language Toolkit (NLTK) library to 
13 | manipulate a BioC XML file read in before; it then tokenizes the corresponding 
14 | text, does stemming on the tokens and transforms the manipulated PyBioC 
15 | objects back to valid BioC XML format.
16 | 


--------------------------------------------------------------------------------
/src/bioc/__init__.py:
--------------------------------------------------------------------------------
 1 | #
 2 | # Package for interoperability in BioCreative work
 3 | #
 4 | 
 5 | __version__ = '1.02'
 6 | 
 7 | __all__ = [
 8 |     'BioCAnnotation', 'BioCCollection', 'BioCDocument',
 9 |     'BioCLocation', 'BioCNode', 'BioCPassage', 'BioCRelation',
10 |     'BioCSentence', 'BioCReader', 'BioCWriter'
11 |     ]
12 | 
13 | __author__ = 'Hernani Marques (h2m@access.uzh.ch)'
14 | 
15 | from bioc_annotation import BioCAnnotation
16 | from bioc_collection import BioCCollection
17 | from bioc_document import BioCDocument
18 | from bioc_location import BioCLocation
19 | from bioc_node import BioCNode
20 | from bioc_passage import BioCPassage
21 | from bioc_relation import BioCRelation
22 | from bioc_sentence import BioCSentence
23 | from bioc_reader import BioCReader
24 | from bioc_writer import BioCWriter
25 | 


--------------------------------------------------------------------------------
/src/bioc/bioc_annotation.py:
--------------------------------------------------------------------------------
 1 | __all__ = ['BioCAnnotation']
 2 | 
 3 | from meta import _MetaId, _MetaInfons, _MetaText
 4 | 
 5 | class BioCAnnotation(_MetaId, _MetaInfons, _MetaText):
 6 | 
 7 |     def __init__(self, annotation=None):
 8 |         
 9 |         self.id = ''
10 |         self.infons = dict()
11 |         self.locations = list()
12 |         self.text = ''
13 | 
14 |         if annotation is not None:
15 |             self.id = annotation.id
16 |             self.infons = annotation.infons
17 |             self.locations = annotation.locations
18 |             self.text = self.text
19 | 
20 |     def __str__(self):
21 |         s = 'id: ' + self.id + '\n'
22 |         s += str(self.infons) + '\n'
23 |         s += 'locations: ' + str(self.locations) + '\n'
24 |         s += 'text: ' + self.text + '\n'
25 | 
26 |         return s
27 | 
28 |     def clear_locations(self):
29 |         self.locations = list()
30 | 
31 |     def add_location(self, location):
32 |         self.locations.append(location)
33 | 


--------------------------------------------------------------------------------
/src/bioc/bioc_collection.py:
--------------------------------------------------------------------------------
 1 | __all__ = ['BioCCollection']
 2 | 
 3 | from meta import _MetaInfons, _MetaIter
 4 | from compat import _Py2Next
 5 | 
 6 | class BioCCollection(_Py2Next, _MetaInfons, _MetaIter):
 7 | 
 8 |     def __init__(self, collection=None):
 9 |         
10 |         self.infons = dict()
11 |         self.source = ''
12 |         self.date = ''
13 |         self.key = ''
14 |         self.documents = list()
15 | 
16 |         if collection is not None:
17 |             self.infons = collection.infons
18 |             self.source = collection.source
19 |             self.date = collection.date
20 |             self.key = collection.key
21 |             self.documents = collection.documents
22 | 
23 |     def __str__(self):
24 |         s = 'source: ' + self.source + '\n'
25 |         s += 'date: ' + self.date + '\n'
26 |         s += 'key: ' + self.key + '\n'
27 |         s += str(self.infons) + '\n'
28 |         s += str(self.documents) + '\n'
29 | 
30 |         return s
31 | 
32 |     def _iterdata(self):
33 |         return self.documents
34 |        
35 |     def clear_documents(self):
36 |         self.documents = list()
37 | 
38 |     def get_document(self, doc_idx):
39 |         return self.documents[doc_idx] 
40 | 
41 |     def add_document(self, document):
42 |         self.documents.append(document)
43 | 
44 |     def remove_document(self, document):
45 |        if type(document) is int:
46 |            self.dcouments.remove(self.documents[document])
47 |        else:
48 |            self.documents.remove(document) # TBC
49 | 


--------------------------------------------------------------------------------
/src/bioc/bioc_document.py:
--------------------------------------------------------------------------------
 1 | __all__ = ['BioCDocument']
 2 | 
 3 | from compat import _Py2Next
 4 | from meta import _MetaId, _MetaInfons, _MetaRelations, _MetaIter
 5 | 
 6 | class BioCDocument(_MetaId, _MetaInfons, _MetaRelations, _MetaIter,
 7 |                    _Py2Next):
 8 | 
 9 |     def __init__(self, document=None):
10 | 
11 |         self.id = ''
12 |         self.infons = dict()
13 |         self.relations = list()
14 |         self.passages = list()
15 | 
16 |         if document is not None:
17 |             self.id = document.id
18 |             self.infons = document.infons
19 |             self.relations = document.relations
20 |             self.passages = document.passages
21 | 
22 |     def __str__(self):
23 |         s = 'id: ' + self.id + '\n'
24 |         s += 'infon: ' + str(self.infons) + '\n'
25 |         s += str(self.passages) + '\n'
26 |         s += 'relation: ' + str(self.relations) + '\n'
27 | 
28 |         return s
29 | 
30 |     def _iterdata(self):
31 |         return self.passages
32 | 
33 |     def get_size(self):
34 |         return self.passages.size() # As in Java BioC
35 | 
36 |     def clear_passages(self):
37 |         self.passages = list()
38 | 
39 |     def add_passage(self, passage):
40 |         self.passages.append(passage)
41 | 
42 |     def remove_passage(self, passage):
43 |         if type(passage) is int:
44 |             self.passages.remove(self.passages[passage])
45 |         else:
46 |             self.passages.remove(passage) # TBC
47 | 


--------------------------------------------------------------------------------
/src/bioc/bioc_location.py:
--------------------------------------------------------------------------------
 1 | __all__ = ['BioCLocation']
 2 | 
 3 | from meta import _MetaOffset
 4 | 
 5 | class BioCLocation(_MetaOffset):
 6 | 
 7 |     def __init__(self, location=None):
 8 |         
 9 |         self.offset = '-1'
10 |         self.length = '0'
11 | 
12 |         if location is not None:
13 |              self.offset = location.offset
14 |              self.length = location.length 
15 | 
16 |     def __str__(self):
17 |         s = str(self.offset) + ':' + str(self.length)
18 | 
19 |         return s
20 | 


--------------------------------------------------------------------------------
/src/bioc/bioc_node.py:
--------------------------------------------------------------------------------
 1 | __all__ = ['BioCNode']
 2 | 
 3 | class BioCNode:
 4 | 
 5 |     def __init__(self, node=None, refid=None, role=None):
 6 |         
 7 |         self.refid = ''
 8 |         self.role = ''
 9 | 
10 |         # Use arg ``node'' if set
11 |         if node is not None:
12 |             self.refid = node.refid
13 |             self.role = node.role
14 |         # Use resting optional args only if both set
15 |         elif (refid is not None) and (role is not None):
16 |             self.refid = refid
17 |             self.role = role
18 | 
19 |     def __str__(self):
20 |          s = 'refid: ' + self.refid + '\n'
21 |          s += 'role: ' + self.role + '\n'
22 | 
23 |          return s
24 | 


--------------------------------------------------------------------------------
/src/bioc/bioc_passage.py:
--------------------------------------------------------------------------------
 1 | __all__ = ['BioCPassage']
 2 | 
 3 | from meta import _MetaAnnotations, _MetaInfons, _MetaOffset, \
 4 |                  _MetaRelations, _MetaText
 5 | 
 6 | class BioCPassage(_MetaAnnotations, _MetaOffset, _MetaText,
 7 |                   _MetaRelations, _MetaInfons):
 8 | 
 9 |     def __init__(self, passage=None):
10 |         
11 |         self.offset = '-1'
12 |         self.text = ''
13 |         self.infons = dict()
14 |         self.sentences = list()
15 |         self.annotations = list()
16 |         self.relations = list()
17 | 
18 |         if passage is not None:
19 |             self.offset = passage.offset
20 |             self.text = passage.text
21 |             self.infons = passage.infons
22 |             self.sentences = passage.sentences
23 |             self.annotations = passage.annotations
24 |             self.relations = passage.relations
25 | 
26 |     def size(self):
27 |         return len(self.sentences)
28 | 
29 |     def has_sentences(self):
30 |         if len(self.sentences) > 0:
31 |             return True
32 | 
33 |     def add_sentence(self, sentence):
34 |         self.sentences.append(sentence)
35 | 
36 |     def sentences_iterator(self):
37 |         return self.sentences.iterator() # TBD
38 | 
39 |     def clear_sentences(self):
40 |         self.relations = list()
41 | 
42 |     def remove_sentence(self, sentence): # int or obj
43 |         if type(sentence) is int:
44 |             self.sentences.remove(self.sentences[sentence])
45 |         else:
46 |             self.sentences.remove(sentence)
47 | 


--------------------------------------------------------------------------------
/src/bioc/bioc_reader.py:
--------------------------------------------------------------------------------
  1 | __all__ = ['BioCReader']
  2 | 
  3 | import StringIO
  4 | 
  5 | from lxml import etree
  6 | 
  7 | from bioc_annotation import BioCAnnotation
  8 | from bioc_collection import BioCCollection
  9 | from bioc_document import BioCDocument
 10 | from bioc_location import BioCLocation
 11 | from bioc_passage import BioCPassage
 12 | from bioc_sentence import BioCSentence
 13 | from bioc_node import BioCNode
 14 | from bioc_relation import BioCRelation
 15 | 
 16 | class BioCReader:
 17 |     """
 18 |     This class can be used to store BioC XML files in PyBioC objects, 
 19 |     for further manipulation.
 20 |     """
 21 | 
 22 |     def __init__(self, source, dtd_valid_file=None):
 23 |         """
 24 |         source:             File path to a BioC XML input document.
 25 |         dtd_valid_file:     File path to a BioC.dtd file. Using this
 26 |                             optional argument ensures DTD validation.
 27 |         """
 28 |         
 29 |         self.source = source
 30 |         self.collection = BioCCollection()
 31 |         self.xml_tree = etree.parse(source)
 32 |         
 33 |         if dtd_valid_file is not None:
 34 |             dtd = etree.DTD(dtd_valid_file)
 35 |             if dtd.validate(self.xml_tree) is False:
 36 |                 raise(Exception(dtd.error_log.filter_from_errors()[0]))
 37 |                 
 38 |     def read(self):
 39 |         """
 40 |         Invoke this method in order to read in the file provided by
 41 |         the source class variable. Only after this method has been
 42 |         called the BioCReader object gets populated.
 43 |         """
 44 |         self._read_collection()
 45 |             
 46 |     def _read_collection(self):
 47 |         collection_elem = self.xml_tree.xpath('/collection')[0]
 48 |         
 49 |         self.collection.source = collection_elem.xpath('source')[0].text
 50 |         self.collection.date = collection_elem.xpath('date')[0].text
 51 |         self.collection.key = collection_elem.xpath('key')[0].text
 52 |         
 53 |         infon_elem_list = collection_elem.xpath('infon')
 54 |         document_elem_list = collection_elem.xpath('document')
 55 |         
 56 |         self._read_infons(infon_elem_list, self.collection)
 57 |         self._read_documents(document_elem_list)
 58 |         
 59 |         
 60 |     def _read_infons(self, infon_elem_list, infons_parent_elem):
 61 |         for infon_elem in infon_elem_list:
 62 |             infons_parent_elem.put_infon(self._get_infon_key(infon_elem),
 63 |                                             infon_elem.text)
 64 | 
 65 |     def _read_documents(self, document_elem_list):
 66 |         for document_elem in document_elem_list:
 67 |             document = BioCDocument()
 68 |             document.id = document_elem.xpath('id')[0].text
 69 |             self._read_infons(document_elem.xpath('infon'), document)
 70 |             self._read_passages(document_elem.xpath('passage'),
 71 |                                 document)
 72 |             self._read_relations(document_elem.xpath('relation'),
 73 |                                 document)
 74 |             
 75 |             self.collection.add_document(document)
 76 | 
 77 |     def _read_passages(self, passage_elem_list, document_parent_elem):
 78 |         for passage_elem in passage_elem_list:
 79 |             passage = BioCPassage()
 80 |             self._read_infons(passage_elem.xpath('infon'), passage)
 81 |             passage.offset = passage_elem.xpath('offset')[0].text
 82 |             
 83 |             # Is this BioC document with <sentence>?
 84 |             if len(passage_elem.xpath('sentence')) > 0:
 85 |                 self._read_sentences(passage_elem.xpath('sentence'),
 86 |                                     passage)
 87 |             else:
 88 |                 # Is the (optional) text element available?
 89 | 		try:
 90 |                     passage.text = passage_elem.xpath('text')[0].text
 91 |                 except:
 92 |                     pass
 93 |                 self._read_annotations(passage_elem.xpath('annotation'),
 94 |                                     passage)
 95 |                                     
 96 |             self._read_relations(passage_elem.xpath('relation'),
 97 |                                     passage)
 98 |             
 99 |             document_parent_elem.add_passage(passage)
100 |     
101 |     def _read_sentences(self, sentence_elem_list, passage_parent_elem):
102 |         for sentence_elem in sentence_elem_list:
103 |             sentence = BioCSentence()
104 |             self._read_infons(sentence_elem.xpath('infon'), sentence)
105 |             sentence.offset = sentence_elem.xpath('offset')[0].text
106 |             sentence.text = sentence_elem.xpath('text')[0].text
107 |             self._read_annotations(sentence_elem.xpath('annotation'),
108 |                                     sentence)
109 |             self._read_relations(sentence_elem.xpath('relation'),
110 |                                     sentence)
111 |             
112 |             passage_parent_elem.add_sentence(sentence)
113 |     
114 |     def _read_annotations(self, annotation_elem_list, 
115 |                             annotations_parent_elem):
116 |         for annotation_elem in annotation_elem_list:
117 |             annotation = BioCAnnotation()
118 |             # Attribute id is just #IMPLIED, not #REQUIRED
119 |             if 'id' in annotation_elem.attrib:
120 |                 annotation.id = annotation_elem.attrib['id']
121 |             self._read_infons(annotation_elem.xpath('infon'),
122 |                                 annotation)
123 |                                 
124 |             for location_elem in annotation_elem.xpath('location'):
125 |                 location = BioCLocation()
126 |                 location.offset = location_elem.attrib['offset']
127 |                 location.length = location_elem.attrib['length']
128 |                 
129 |                 annotation.add_location(location)
130 |                 
131 |             annotation.text = annotation_elem.xpath('text')[0].text
132 |             
133 |             annotations_parent_elem.add_annotation(annotation)
134 |         
135 |     def _read_relations(self, relation_elem_list, relations_parent_elem):
136 |         for relation_elem in relation_elem_list:
137 |             relation = BioCRelation()
138 |             # Attribute id is just #IMPLIED, not #REQUIRED
139 |             if 'id' in relation_elem.attrib:
140 |                 relation.id = relation_elem.attrib['id']
141 |             self._read_infons(relation_elem.xpath('infon'), relation)
142 | 
143 |             for node_elem in relation_elem.xpath('node'):
144 |                 node = BioCNode()
145 |                 node.refid = node_elem.attrib['refid']
146 |                 node.role = node_elem.attrib['role']
147 |                 
148 |                 relation.add_node(node)
149 |             
150 |             relations_parent_elem.add_relation(relation)
151 |  
152 |     def _get_infon_key(self, elem):
153 |         return elem.attrib['key']
154 | 


--------------------------------------------------------------------------------
/src/bioc/bioc_relation.py:
--------------------------------------------------------------------------------
 1 | __all__ = ['BioCRelation']
 2 | 
 3 | from compat import _Py2Next
 4 | from meta import _MetaId, _MetaInfons, _MetaIter
 5 | from bioc_node import BioCNode
 6 | 
 7 | class BioCRelation(_MetaId, _MetaInfons, _Py2Next, _MetaIter):
 8 | 
 9 |     def __init__(self, relation=None):
10 |         
11 |         self.id = ''
12 |         self.nodes = list()
13 |         self.infons = dict()
14 | 
15 |         if relation is not None:
16 |             self.id = relation.id
17 |             self.nodes = relation.nodes
18 |             self.infons = relation.infons
19 | 
20 |     def __str__(self):
21 |         s = 'id: ' + self.id + '\n'
22 |         s += 'infons: ' + str(self.infons) + '\n'
23 |         s += 'nodes: ' + str(self.nodes) + '\n'
24 | 
25 |         return s
26 | 
27 |     def _iterdata(self):
28 |         return self.nodes
29 | 
30 |     def add_node(self, node, refid=None, role=None):
31 |         # Discard arg ``node'' if optional args fully provided
32 |         if (refid is not None) and (role is not None):
33 |             self.add_node(refid=refid, role=role)
34 |         else: # Only consider optional args if both set
35 |             self.nodes.append(node)
36 | 


--------------------------------------------------------------------------------
/src/bioc/bioc_sentence.py:
--------------------------------------------------------------------------------
 1 | __all__ = ['BioCSentence']
 2 | 
 3 | 
 4 | from meta import _MetaAnnotations, _MetaInfons, _MetaOffset, \
 5 |                       _MetaRelations, _MetaText
 6 |                       
 7 | 
 8 | class BioCSentence(_MetaAnnotations, _MetaInfons, _MetaOffset, 
 9 |                    _MetaRelations, _MetaText):
10 |     
11 |     def __init__(self, sentence=None):
12 |         
13 |         self.offset = '-1'
14 |         self.text = ''
15 |         self.infons = dict()
16 |         self.annotations = list()
17 |         self.relations = list()
18 | 
19 |         if sentence is not None:
20 |             self.offset = sentence.offset
21 |             self.text = sentence.text
22 |             self.infons = sentence.infons
23 |             self.annotations = sentence.annotations
24 |             self.relations = sentence.relations
25 | 
26 |     def __str__(self):
27 |         s = 'offset: ' + str(self.offset) + '\n'
28 |         s += 'infons: ' + str(self.infons) + '\n' # TBD
29 |         s += 'text: ' + str(self.text) + '\n' # TBD
30 |         s += str(self.annotations) + '\n' # TBD
31 |         s += str(self.relations) + '\n' # TBD
32 | 
33 |         return s
34 | 


--------------------------------------------------------------------------------
/src/bioc/bioc_writer.py:
--------------------------------------------------------------------------------
  1 | __all__ = ['BioCWriter']
  2 | 
  3 | from lxml.builder import E
  4 | from lxml.etree import tostring
  5 | 
  6 | class BioCWriter:
  7 |     
  8 |     def __init__(self, filename=None, collection=None):
  9 |         
 10 |         self.root_tree = None
 11 |                         
 12 |         self.collection = None
 13 |         self.doctype = '''<?xml version='1.0' encoding='UTF-8'?>'''
 14 |         self.doctype += '''<!DOCTYPE collection SYSTEM 'BioC.dtd'>'''
 15 |         self.filename = filename
 16 |         
 17 |         if collection is not None:
 18 |             self.collection = collection
 19 |         
 20 |         if filename is not None:
 21 |             self.filename = filename
 22 |         
 23 |     def __str__(self):
 24 |         """ A BioCWriter object can be printed as string.
 25 |         """
 26 |         self._check_for_data()
 27 |             
 28 |         self.build()
 29 |         s = tostring(self.root_tree, 
 30 |                     pretty_print=True, 
 31 |                     doctype=self.doctype)
 32 |                     
 33 |         return s
 34 |     
 35 |     def _check_for_data(self):
 36 |         if self.collection is None:
 37 |             raise(Exception('No data available.'))
 38 |     
 39 |     def write(self, filename=None):
 40 |         """ Use this method to write the data in the PyBioC objects
 41 |             to disk.
 42 |             
 43 |             filename:   Output file path (optional argument; filename
 44 |                         provided by __init__ used otherwise.)
 45 |         """
 46 |         if filename is not None:
 47 |             self.filename = filename
 48 |         
 49 |         if self.filename is None:
 50 |             raise(Exception('No output file path provided.'))
 51 |             
 52 |         f = open(self.filename, 'w')
 53 |         f.write(self.__str__())
 54 |         
 55 |     def build(self):
 56 |         self._build_collection()
 57 |         
 58 |     def _build_collection(self):
 59 |         self.root_tree = E('collection', 
 60 |                             E('source'), E('date'), E('key'))
 61 |         self.root_tree.xpath('source')[0].text = self.collection.source
 62 |         self.root_tree.xpath('date')[0].text = self.collection.date
 63 |         self.root_tree.xpath('key')[0].text = self.collection.key         
 64 |         collection_elem = self.root_tree.xpath('/collection')[0]
 65 |         # infon*
 66 |         self._build_infons(self.collection.infons, collection_elem)
 67 |         # document+
 68 |         self._build_documents(self.collection.documents, 
 69 |                                 collection_elem)
 70 |         
 71 |     def _build_infons(self, infons_dict, infons_parent_elem):
 72 |         for infon_key, infon_val in infons_dict.items():
 73 |             infons_parent_elem.append(E('infon'))
 74 |             infon_elem = infons_parent_elem.xpath('infon')[-1]
 75 |             
 76 |             infon_elem.attrib['key'] = infon_key
 77 |             infon_elem.text = infon_val
 78 |             
 79 |     def _build_documents(self, documents_list, collection_parent_elem):
 80 |         for document in documents_list:
 81 |             collection_parent_elem.append(E('document', E('id')))
 82 |             document_elem = collection_parent_elem.xpath('document')[-1]
 83 |             # id
 84 |             id_elem = document_elem.xpath('id')[0]
 85 |             id_elem.text = document.id
 86 |             # infon*
 87 |             self._build_infons(document.infons, document_elem)
 88 |             # passage+
 89 |             self._build_passages(document.passages, document_elem)
 90 |             # relation*
 91 |             self._build_relations(document.relations, document_elem)
 92 |             
 93 |     def _build_passages(self, passages_list, document_parent_elem):
 94 |         for passage in passages_list:
 95 |             document_parent_elem.append(E('passage'))
 96 |             passage_elem = document_parent_elem.xpath('passage')[-1]
 97 |             # infon*
 98 |             self._build_infons(passage.infons, passage_elem)
 99 |             # offset
100 |             passage_elem.append(E('offset'))
101 |             passage_elem.xpath('offset')[0].text = passage.offset
102 |             if passage.has_sentences():
103 |                 # sentence*
104 |                 self._build_sentences(passage.sentences, passage_elem)
105 |             else:
106 |                 # text?, annotation*
107 |                 passage_elem.append(E('text'))
108 |                 passage_elem.xpath('text')[0].text = passage.text
109 |                 self._build_annotations(passage.annotations, 
110 |                                         passage_elem)
111 |             # relation*
112 |             self._build_relations(passage.relations, passage_elem)
113 |         
114 |     def _build_relations(self, relations_list, relations_parent_elem):
115 |         for relation in relations_list:
116 |             relations_parent_elem.append(E('relation'))
117 |             relation_elem = relations_parent_elem.xpath('relation')[-1]
118 |             # infon*
119 |             self._build_infons(relation.infons, relation_elem)
120 |             # node*
121 |             for node in relation.nodes:
122 |                 relation_elem.append(E('node'))
123 |                 node_elem = relation_elem.xpath('node')[-1]
124 |                 node_elem.attrib['refid'] = node.refid
125 |                 node_elem.attrib['role'] = node.role
126 |             # id (just #IMPLIED)
127 |             if len(relation.id) > 0:
128 |                 relation_elem.attrib['id'] = relation.id
129 |         
130 |     def _build_annotations(self, annotations_list, 
131 |                             annotations_parent_elem):
132 |         for annotation in annotations_list:
133 |             annotations_parent_elem.append(E('annotation'))
134 |             annotation_elem = \
135 |                 annotations_parent_elem.xpath('annotation')[-1]
136 |             # infon*
137 |             self._build_infons(annotation.infons, annotation_elem)
138 |             # location*
139 |             for location in annotation.locations:
140 |                 annotation_elem.append(E('location'))
141 |                 location_elem = annotation_elem.xpath('location')[-1]
142 |                 location_elem.attrib['offset'] = location.offset
143 |                 location_elem.attrib['length'] = location.length
144 |             # text
145 |             annotation_elem.append(E('text'))
146 |             text_elem = annotation_elem.xpath('text')[0]
147 |             text_elem.text = annotation.text
148 |             # id (just #IMPLIED)
149 |             if len(annotation.id) > 0:
150 |                 annotation_elem.attrib['id'] = annotation.id
151 | 
152 |     def _build_sentences(self, sentences_list, passage_parent_elem):
153 |         for sentence in sentences_list:
154 |             passage_parent_elem.append(E('sentence'))
155 |             sentence_elem = passage_parent_elem.xpath('sentence')[-1]
156 |             # infon*
157 |             self._build_infons(sentence.infons, sentence_elem)
158 |             # offset
159 |             sentence_elem.append(E('offset'))
160 |             offset_elem = sentence_elem.xpath('offset')[0]
161 |             offset_elem.text = sentence.offset
162 |             # text?
163 |             if len(sentence.text) > 0:
164 |                 sentence_elem.append(E('text'))
165 |                 text_elem = sentence_elem.xpath('text')[0]
166 |                 text_elem.text = sentence.text
167 |             # annotation*
168 |             self._build_annotations(sentence.annotations, sentence_elem)
169 |             # relation*
170 |             self._build_relations(sentence.relations, sentence_elem)
171 | 


--------------------------------------------------------------------------------
/src/bioc/compat/__init__.py:
--------------------------------------------------------------------------------
1 | __all__ = []
2 | 
3 | __author__ = 'Hernani Marques (h2m@access.uzh.ch)'
4 | 
5 | from _py2_next import _Py2Next
6 | 


--------------------------------------------------------------------------------
/src/bioc/compat/_py2_next.py:
--------------------------------------------------------------------------------
1 | __all__ = []
2 | 
3 | class _Py2Next:
4 |       def __next__(self):
5 |           self.next()
6 | 


--------------------------------------------------------------------------------
/src/bioc/meta/__init__.py:
--------------------------------------------------------------------------------
1 | __all__ = []
2 | 
3 | __author__ = 'Hernani Marques (h2m@access.uzh.ch)'
4 | 
5 | from _bioc_meta import _MetaAnnotations, _MetaInfons, _MetaOffset, \
6 |                        _MetaRelations, _MetaText, _MetaId
7 | from _iter import _MetaIter
8 | 


--------------------------------------------------------------------------------
/src/bioc/meta/_bioc_meta.py:
--------------------------------------------------------------------------------
 1 | __all__ = []
 2 | 
 3 | class _MetaAnnotations:
 4 |     annotations = list()
 5 | 
 6 |     def annotation_iterator(self):
 7 |         return self.annotations.iterator() # TBD
 8 | 
 9 |     def clear_annotations(self):
10 |         self.annotations = list()
11 | 
12 |     def add_annotation(self, annotation):
13 |         self.annotations.append(annotation)
14 | 
15 |     def remove_annotation(self, annotation): # Can be int or obj
16 |         if type(annotation) is int:
17 |             self.annotations.remove(self.annotations[annotation])
18 |         else:
19 |             self.annotations.remove(annotation) # TBC
20 | 
21 | class _MetaInfons:
22 |     infons = dict()
23 | 
24 |     def put_infon(self, key, val):
25 |         self.infons[key] = val 
26 | 
27 |     def remove_infon(self, key):
28 |         del(self.infons[key]) 
29 | 
30 |     def clear_infons(self):
31 |         self.infons = dict()
32 | 
33 | class _MetaOffset:
34 |     offset = '-1'
35 | 
36 | class _MetaRelations:
37 |     relations = list()
38 | 
39 |     def relation_iterator(self):
40 |         return self.relations.iterator() # TBD
41 | 
42 |     def clear_relations(self):
43 |         self.relations = list()
44 | 
45 |     def add_relation(self, relation):
46 |         self.relations.append(relation)
47 | 
48 |     def remove_relation(self, relation): # Can be int or obj
49 |         if type(relation) is int:
50 |             self.relations.remove(self.relations[relation])
51 |         else:
52 |             self.relations.remove(relation) # TBC
53 | 
54 | class _MetaText:
55 |     text = ''
56 | 
57 | class _MetaId:
58 |     id = ''
59 | 


--------------------------------------------------------------------------------
/src/bioc/meta/_iter.py:
--------------------------------------------------------------------------------
1 | __all__ = []
2 | 
3 | class _MetaIter:
4 | 
5 |     def __iter__(self):
6 |         return self._iterdata().__iter__()
7 | 


--------------------------------------------------------------------------------
/src/stemmer.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding: utf-8 -*- 
 3 | # h2m@access.uzh.ch
 4 | 
 5 | from os import curdir, sep
 6 | import sys
 7 | 
 8 | from nltk.tokenize import wordpunct_tokenize
 9 | from nltk import PorterStemmer
10 | 
11 | from bioc import BioCAnnotation
12 | from bioc import BioCReader
13 | from bioc import BioCWriter
14 | 
15 | BIOC_IN = '..' + sep + 'test_input' + sep + 'example_input.xml'
16 | BIOC_OUT = 'example_input_stemmed.xml'
17 | DTD_FILE = '..' + sep + 'BioC.dtd'
18 | 
19 | def main():
20 |     # Use file defined by BIOC_IN as default if no other provided
21 |     bioc_in = BIOC_IN
22 |     if len(sys.argv) >= 2:
23 |         bioc_in = sys.argv[1]
24 |     
25 |     # A BioCReader object is put in place to hold the example BioC XML
26 |     # document
27 |     bioc_reader = BioCReader(bioc_in, dtd_valid_file=DTD_FILE)
28 |     
29 |     # A BioCWRiter object is prepared to write out the annotated data
30 |     bioc_writer = BioCWriter(BIOC_OUT)
31 |     
32 |     # The NLTK porter stemmer is used for stemming
33 |     stemmer = PorterStemmer()
34 |     
35 |     # The example input file given above (by BIOC_IN) is fed into
36 |     # a BioCReader object; validation is done by the BioC DTD
37 |     bioc_reader.read()
38 |     
39 |     # Pass over basic data
40 |     bioc_writer.collection = bioc_reader.collection
41 |     
42 |     # Get documents to manipulate
43 |     documents = bioc_writer.collection.documents
44 |     
45 |     # Go through each document
46 |     annotation_id = 0
47 |     for document in documents:
48 |         
49 |         # Go through each passage of the document
50 |         for passage in document:
51 |             #  Stem all the tokens found
52 |             stems = [stemmer.stem(token) for 
53 |                      token in wordpunct_tokenize(passage.text)]
54 |             # Add an anotation showing the stemmed version, in the
55 |             # given order
56 |             for stem in stems:
57 |                 annotation_id += 1
58 |                 
59 |                 # For each token an annotation is created, providing
60 |                 # the surface form of a 'stemmed token'.
61 |                 # (The annotations are collectively added following
62 |                 #  a document passage with a <text> tag.)
63 |                 bioc_annotation = BioCAnnotation()
64 |                 bioc_annotation.text = stem
65 |                 bioc_annotation.id = str(annotation_id)
66 |                 bioc_annotation.put_infon('surface form', 
67 |                                           'stemmed token')
68 |                 passage.add_annotation(bioc_annotation)
69 |     
70 |     # Print file to screen w/o trailing newline
71 |     # (Can be redirected into a file, e. g output_bioc.xml)
72 |     sys.stdout.write(str(bioc_writer))
73 |     
74 |     # Write to disk
75 |     bioc_writer.write()
76 |     
77 | if  __name__ == '__main__':
78 |     main()
79 | 


--------------------------------------------------------------------------------
/src/test_read+write.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | from bioc import BioCReader
 4 | from bioc import BioCWriter
 5 | 
 6 | test_file = '../test_input/bcIVLearningCorpus.xml'
 7 | dtd_file = '../test_input/BioC.dtd'
 8 | 
 9 | def main():
10 |     bioc_reader = BioCReader(test_file, dtd_valid_file=dtd_file)
11 |     bioc_reader.read()
12 |     '''
13 |     sentences = bioc_reader.collection.documents[0].passages[0].sentences
14 |     for sentence in sentences:
15 |         print sentence.offset
16 |     '''
17 | 
18 |     bioc_writer = BioCWriter('output_bioc.xml')
19 |     bioc_writer.collection = bioc_reader.collection
20 |     bioc_writer.write()
21 |     print(bioc_writer)
22 | 
23 | if  __name__ == '__main__':
24 |     main()
25 | 


--------------------------------------------------------------------------------
/test_input/PMID-8557975-simplified-sentences-tokens.xml:
--------------------------------------------------------------------------------
  1 | <?xml version="1.0" encoding="utf-8"?>
  2 | <!DOCTYPE collection SYSTEM "BioC.dtd">
  3 | <collection>
  4 |   <source>PubMed</source>
  5 |   <date>20130316</date>
  6 |   <key>PMID-8557975-simplified-sentences-tokens.key</key>
  7 |   <document>
  8 |     <id>8557975</id>
  9 |     <passage>
 10 |       <infon key="type">abstract</infon>
 11 |       <offset>0</offset>
 12 |       <sentence>
 13 | 	<infon key="type">original sentence</infon>
 14 |         <offset>70</offset>
 15 |         <annotation id="t0">
 16 |           <infon key="type">token</infon>
 17 |           <location offset="70" length="6"></location>
 18 |           <text>Active</text>
 19 |         </annotation>
 20 |         <annotation id="t1">
 21 |           <infon key="type">token</infon>
 22 |           <location offset="77" length="5"></location>
 23 |           <text>Raf-1</text>
 24 |         </annotation>
 25 |         <annotation id="t2">
 26 |           <infon key="type">token</infon>
 27 |           <location offset="83" length="14"></location>
 28 |           <text>phosphorylates</text>
 29 |         </annotation>
 30 |         <annotation id="t3">
 31 |           <infon key="type">token</infon>
 32 |           <location offset="98" length="3"></location>
 33 |           <text>and</text>
 34 |         </annotation>
 35 |         <annotation id="t4">
 36 |           <infon key="type">token</infon>
 37 |           <location offset="102" length="9"></location>
 38 |           <text>activates</text>
 39 |         </annotation>
 40 |         <annotation id="t5">
 41 |           <infon key="type">token</infon>
 42 |           <location offset="112" length="3"></location>
 43 |           <text>the</text>
 44 |         </annotation>
 45 |         <annotation id="t6">
 46 |           <infon key="type">token</infon>
 47 |           <location offset="116" length="17"></location>
 48 |           <text>mitogen-activated</text>
 49 |         </annotation>
 50 |         <annotation id="t7">
 51 |           <infon key="type">token</infon>
 52 |           <location offset="134" length="7"></location>
 53 |           <text>protein</text>
 54 |         </annotation>
 55 |         <annotation id="t8">
 56 |           <infon key="type">token</infon>
 57 |           <location offset="142" length="1"></location>
 58 |           <text>(</text>
 59 |         </annotation>
 60 |         <annotation id="t9">
 61 |           <infon key="type">token</infon>
 62 |           <location offset="143" length="3"></location>
 63 |           <text>MAP</text>
 64 |         </annotation>
 65 |         <annotation id="t10">
 66 |           <infon key="type">token</infon>
 67 |           <location offset="146" length="1"></location>
 68 |           <text>)</text>
 69 |         </annotation>
 70 |         <annotation id="t11">
 71 |           <infon key="type">token</infon>
 72 |           <location offset="148" length="20"></location>
 73 |           <text>kinase/extracellular</text>
 74 |         </annotation>
 75 |         <annotation id="t12">
 76 |           <infon key="type">token</infon>
 77 |           <location offset="169" length="16"></location>
 78 |           <text>signal-regulated</text>
 79 |         </annotation>
 80 |         <annotation id="t13">
 81 |           <infon key="type">token</infon>
 82 |           <location offset="186" length="6"></location>
 83 |           <text>kinase</text>
 84 |         </annotation>
 85 |         <annotation id="t14">
 86 |           <infon key="type">token</infon>
 87 |           <location offset="193" length="6"></location>
 88 |           <text>kinase</text>
 89 |         </annotation>
 90 |         <annotation id="t15">
 91 |           <infon key="type">token</infon>
 92 |           <location offset="200" length="1"></location>
 93 |           <text>1</text>
 94 |         </annotation>
 95 |         <annotation id="t16">
 96 |           <infon key="type">token</infon>
 97 |           <location offset="202" length="1"></location>
 98 |           <text>(</text>
 99 |         </annotation>
100 |         <annotation id="t17">
101 |           <infon key="type">token</infon>
102 |           <location offset="203" length="4"></location>
103 |           <text>MEK1</text>
104 |         </annotation>
105 |         <annotation id="t18">
106 |           <infon key="type">token</infon>
107 |           <location offset="207" length="1"></location>
108 |           <text>)</text>
109 |         </annotation>
110 |         <annotation id="t19">
111 |           <infon key="type">token</infon>
112 |           <location offset="208" length="1"></location>
113 |           <text>,</text>
114 |         </annotation>
115 |         <annotation id="t20">
116 |           <infon key="type">token</infon>
117 |           <location offset="210" length="5"></location>
118 |           <text>which</text>
119 |         </annotation>
120 |         <annotation id="t21">
121 |           <infon key="type">token</infon>
122 |           <location offset="216" length="2"></location>
123 |           <text>in</text>
124 |         </annotation>
125 |         <annotation id="t22">
126 |           <infon key="type">token</infon>
127 |           <location offset="219" length="4"></location>
128 |           <text>turn</text>
129 |         </annotation>
130 |         <annotation id="t23">
131 |           <infon key="type">token</infon>
132 |           <location offset="224" length="14"></location>
133 |           <text>phosphorylates</text>
134 |         </annotation>
135 |         <annotation id="t24">
136 |           <infon key="type">token</infon>
137 |           <location offset="239" length="3"></location>
138 |           <text>and</text>
139 |         </annotation>
140 |         <annotation id="t25">
141 |           <infon key="type">token</infon>
142 |           <location offset="243" length="9"></location>
143 |           <text>activates</text>
144 |         </annotation>
145 |         <annotation id="t26">
146 |           <infon key="type">token</infon>
147 |           <location offset="253" length="3"></location>
148 |           <text>the</text>
149 |         </annotation>
150 |         <annotation id="t27">
151 |           <infon key="type">token</infon>
152 |           <location offset="257" length="3"></location>
153 |           <text>MAP</text>
154 |         </annotation>
155 |         <annotation id="t28">
156 |           <infon key="type">token</infon>
157 |           <location offset="261" length="21"></location>
158 |           <text>kinases/extracellular</text>
159 |         </annotation>
160 |         <annotation id="t29">
161 |           <infon key="type">token</infon>
162 |           <location offset="283" length="6"></location>
163 |           <text>signal</text>
164 |         </annotation>
165 |         <annotation id="t30">
166 |           <infon key="type">token</infon>
167 |           <location offset="290" length="9"></location>
168 |           <text>regulated</text>
169 |         </annotation>
170 |         <annotation id="t31">
171 |           <infon key="type">token</infon>
172 |           <location offset="300" length="7"></location>
173 |           <text>kinases</text>
174 |         </annotation>
175 |         <annotation id="t32">
176 |           <infon key="type">token</infon>
177 |           <location offset="307" length="1"></location>
178 |           <text>,</text>
179 |         </annotation>
180 |         <annotation id="t33">
181 |           <infon key="type">token</infon>
182 |           <location offset="309" length="4"></location>
183 |           <text>ERK1</text>
184 |         </annotation>
185 |         <annotation id="t34">
186 |           <infon key="type">token</infon>
187 |           <location offset="314" length="3"></location>
188 |           <text>and</text>
189 |         </annotation>
190 |         <annotation id="t35">
191 |           <infon key="type">token</infon>
192 |           <location offset="318" length="4"></location>
193 |           <text>ERK2</text>
194 |         </annotation>
195 |         <annotation id="t36">
196 |           <infon key="type">token</infon>
197 |           <location offset="322" length="1"></location>
198 |           <text>.</text>
199 |         </annotation>
200 |       </sentence>
201 |       <sentence>
202 | 	<infon key="type">simplified sentence</infon>
203 |         <offset>325</offset>
204 |         <annotation id="t37">
205 |           <infon key="type">token</infon>
206 |           <location offset="325" length="6"></location>
207 |           <text>Active</text>
208 |         </annotation>
209 |         <annotation id="t38">
210 |           <infon key="type">token</infon>
211 |           <location offset="332" length="5"></location>
212 |           <text>Raf-1</text>
213 |         </annotation>
214 |         <annotation id="t39">
215 |           <infon key="type">token</infon>
216 |           <location offset="338" length="14"></location>
217 |           <text>phosphorylates</text>
218 |         </annotation>
219 |         <annotation id="t40">
220 |           <infon key="type">token</infon>
221 |           <location offset="353" length="4"></location>
222 |           <text>MEK1</text>
223 |         </annotation>
224 |         <annotation id="t41">
225 |           <infon key="type">token</infon>
226 |           <location offset="357" length="1"></location>
227 |           <text>.</text>
228 |         </annotation>
229 |       </sentence>
230 |       <sentence>
231 | 	<infon key="type">simplified sentence</infon>
232 |         <offset>360</offset>
233 |         <annotation id="t42">
234 |           <infon key="type">token</infon>
235 |           <location offset="360" length="6"></location>
236 |           <text>Active</text>
237 |         </annotation>
238 |         <annotation id="t43">
239 |           <infon key="type">token</infon>
240 |           <location offset="367" length="5"></location>
241 |           <text>Raf-1</text>
242 |         </annotation>
243 |         <annotation id="t44">
244 |           <infon key="type">token</infon>
245 |           <location offset="373" length="9"></location>
246 |           <text>activates</text>
247 |         </annotation>
248 |         <annotation id="t45">
249 |           <infon key="type">token</infon>
250 |           <location offset="383" length="4"></location>
251 |           <text>MEK1</text>
252 |         </annotation>
253 |         <annotation id="t46">
254 |           <infon key="type">token</infon>
255 |           <location offset="387" length="1"></location>
256 |           <text>.</text>
257 |         </annotation>
258 |       </sentence>
259 |       <sentence>
260 | 	<infon key="type">simplified sentence</infon>
261 |         <offset>390</offset>
262 |         <annotation id="t47">
263 |           <infon key="type">token</infon>
264 |           <location offset="390" length="4"></location>
265 |           <text>MEK1</text>
266 |         </annotation>
267 |         <annotation id="t48">
268 |           <infon key="type">token</infon>
269 |           <location offset="395" length="2"></location>
270 |           <text>in</text>
271 |         </annotation>
272 |         <annotation id="t49">
273 |           <infon key="type">token</infon>
274 |           <location offset="398" length="4"></location>
275 |           <text>turn</text>
276 |         </annotation>
277 |         <annotation id="t50">
278 |           <infon key="type">token</infon>
279 |           <location offset="403" length="14"></location>
280 |           <text>phosphorylates</text>
281 |         </annotation>
282 |         <annotation id="t51">
283 |           <infon key="type">token</infon>
284 |           <location offset="418" length="4"></location>
285 |           <text>ERK1</text>
286 |         </annotation>
287 |         <annotation id="t52">
288 |           <infon key="type">token</infon>
289 |           <location offset="422" length="1"></location>
290 |           <text>.</text>
291 |         </annotation>
292 |       </sentence>
293 |       <sentence>
294 | 	<infon key="type">simplified sentence</infon>
295 |         <offset>425</offset>
296 |         <annotation id="t53">
297 |           <infon key="type">token</infon>
298 |           <location offset="425" length="4"></location>
299 |           <text>MEK1</text>
300 |         </annotation>
301 |         <annotation id="t54">
302 |           <infon key="type">token</infon>
303 |           <location offset="430" length="2"></location>
304 |           <text>in</text>
305 |         </annotation>
306 |         <annotation id="t55">
307 |           <infon key="type">token</infon>
308 |           <location offset="433" length="4"></location>
309 |           <text>turn</text>
310 |         </annotation>
311 |         <annotation id="t56">
312 |           <infon key="type">token</infon>
313 |           <location offset="438" length="14"></location>
314 |           <text>phosphorylates</text>
315 |         </annotation>
316 |         <annotation id="t57">
317 |           <infon key="type">token</infon>
318 |           <location offset="453" length="4"></location>
319 |           <text>ERK2</text>
320 |         </annotation>
321 |         <annotation id="t58">
322 |           <infon key="type">token</infon>
323 |           <location offset="457" length="1"></location>
324 |           <text>.</text>
325 |         </annotation>
326 |       </sentence>
327 |       <sentence>
328 | 	<infon key="type">simplified sentence</infon>
329 |         <offset>460</offset>
330 |         <annotation id="t59">
331 |           <infon key="type">token</infon>
332 |           <location offset="460" length="4"></location>
333 |           <text>MEK1</text>
334 |         </annotation>
335 |         <annotation id="t60">
336 |           <infon key="type">token</infon>
337 |           <location offset="465" length="2"></location>
338 |           <text>in</text>
339 |         </annotation>
340 |         <annotation id="t61">
341 |           <infon key="type">token</infon>
342 |           <location offset="468" length="4"></location>
343 |           <text>turn</text>
344 |         </annotation>
345 |         <annotation id="t62">
346 |           <infon key="type">token</infon>
347 |           <location offset="473" length="9"></location>
348 |           <text>activates</text>
349 |         </annotation>
350 |         <annotation id="t63">
351 |           <infon key="type">token</infon>
352 |           <location offset="483" length="4"></location>
353 |           <text>ERK1</text>
354 |         </annotation>
355 |         <annotation id="t64">
356 |           <infon key="type">token</infon>
357 |           <location offset="487" length="1"></location>
358 |           <text>.</text>
359 |         </annotation>
360 |       </sentence>
361 |       <sentence>
362 | 	<infon key="type">simplified sentence</infon>
363 |         <offset>489</offset>
364 |         <annotation id="t65">
365 |           <infon key="type">token</infon>
366 |           <location offset="489" length="4"></location>
367 |           <text>MEK1</text>
368 |         </annotation>
369 |         <annotation id="t66">
370 |           <infon key="type">token</infon>
371 |           <location offset="494" length="2"></location>
372 |           <text>in</text>
373 |         </annotation>
374 |         <annotation id="t67">
375 |           <infon key="type">token</infon>
376 |           <location offset="497" length="4"></location>
377 |           <text>turn</text>
378 |         </annotation>
379 |         <annotation id="t68">
380 |           <infon key="type">token</infon>
381 |           <location offset="502" length="9"></location>
382 |           <text>activates</text>
383 |         </annotation>
384 |         <annotation id="t69">
385 |           <infon key="type">token</infon>
386 |           <location offset="512" length="4"></location>
387 |           <text>ERK2</text>
388 |         </annotation>
389 |         <annotation id="t70">
390 |           <infon key="type">token</infon>
391 |           <location offset="516" length="1"></location>
392 |           <text>.</text>
393 |         </annotation>
394 |       </sentence>
395 |       <!-- equ -->
396 |       <!-- Active -->
397 |       <relation id="r0">
398 |         <infon key="type">equ</infon>
399 |         <node refid="t0" role="original"></node>
400 |         <node refid="t37" role="simplified"></node>
401 |         <node refid="t42" role="simplified"></node>
402 |       </relation>
403 |       <!-- RAF-1 -->
404 |       <relation id="r1">
405 |         <infon key="type">equ</infon>
406 |         <node refid="t1" role="original"></node>
407 |         <node refid="t38" role="simplified"></node>
408 |         <node refid="t43" role="simplified"></node>
409 |       </relation>
410 |       <!-- phosphorylates -->
411 |       <relation id="r2">
412 |         <infon key="type">equ</infon>
413 |         <node refid="t2" role="original"></node>
414 |         <node refid="t39" role="simplified"></node>
415 |       </relation>
416 |       <!-- MEK1 -->
417 |       <relation id="r3">
418 |         <infon key="type">equ</infon>
419 |         <node refid="t17" role="original"></node>
420 |         <node refid="t40" role="simplified"></node>
421 |         <node refid="t45" role="simplified"></node>
422 |         <node refid="t47" role="simplified"></node>
423 |         <node refid="t53" role="simplified"></node>
424 |         <node refid="t59" role="simplified"></node>
425 |         <node refid="t65" role="simplified"></node>
426 |       </relation>
427 |       <!-- . -->
428 |       <relation id="r4">
429 |         <infon key="type">equ</infon>
430 |         <node refid="t36" role="original"></node>
431 |         <node refid="t41" role="simplified"></node>
432 |         <node refid="t46" role="simplified"></node>
433 |         <node refid="t52" role="simplified"></node>
434 |         <node refid="t58" role="simplified"></node>
435 |         <node refid="t64" role="simplified"></node>
436 | 	<node refid="t70" role="simplified"></node>
437 |       </relation>
438 |       <!-- activates -->
439 |       <relation id="r5">
440 |         <infon key="type">equ</infon>
441 |         <node refid="t4" role="original"></node>
442 |         <node refid="t44" role="simplified"></node>
443 |       </relation>
444 |       <!-- in -->
445 |       <relation id="r6">
446 |         <infon key="type">equ</infon>
447 |         <node refid="t21" role="original"></node>
448 |         <node refid="t48" role="simplified"></node>
449 |         <node refid="t54" role="simplified"></node>
450 |         <node refid="t60" role="simplified"></node>
451 |         <node refid="t66" role="simplified"></node>
452 |       </relation>
453 |       <!-- turn -->
454 |       <relation id="r7">
455 |         <infon key="type">equ</infon>
456 |         <node refid="t22" role="original"></node>
457 |         <node refid="t49" role="simplified"></node>
458 |         <node refid="t55" role="simplified"></node>
459 |         <node refid="t61" role="simplified"></node>
460 |         <node refid="t67" role="simplified"></node>
461 |       </relation>
462 |       <!-- phosphorylates -->
463 |       <relation id="r8">
464 |         <infon key="type">equ</infon>
465 |         <node refid="t23" role="original"></node>
466 |         <node refid="t50" role="simplified"></node>
467 |         <node refid="t56" role="simplified"></node>
468 |       </relation>
469 |       <!-- ERK1 -->
470 |       <relation id="r9">
471 |         <infon key="type">equ</infon>
472 |         <node refid="t33" role="original"></node>
473 |         <node refid="t51" role="simplified"></node>
474 |         <node refid="t63" role="simplified"></node>
475 |       </relation>
476 |       <!-- ERK2 -->
477 |       <relation id="r10">
478 |         <infon key="type">equ</infon>
479 |         <node refid="t35" role="original"></node>
480 |         <node refid="t57" role="simplified"></node>
481 |         <node refid="t69" role="simplified"></node>
482 |       </relation>
483 |       <!-- activates -->
484 |       <relation id="r11">
485 |         <infon key="type">equ</infon>
486 |         <node refid="t25" role="original"></node>
487 |         <node refid="t62" role="simplified"></node>
488 |         <node refid="t68" role="simplified"></node>
489 |       </relation>
490 |     </passage>
491 |   </document>
492 | </collection>
493 | 


--------------------------------------------------------------------------------
/test_input/PMID-8557975-simplified-sentences.xml:
--------------------------------------------------------------------------------
 1 | <?xml version="1.0" encoding="utf-8"?>
 2 | <!DOCTYPE collection SYSTEM "BioC.dtd">
 3 | <collection>
 4 |   <source>PubMed</source>
 5 |   <date>20130316</date>
 6 |   <key>PMID-8557975-simplified-sentences.key</key>
 7 |   <document>
 8 |     <id>8557975</id>
 9 |     <passage>
10 |       <infon key="type">abstract</infon>
11 |       <offset>0</offset>
12 |       <sentence>
13 | 	<infon key="type">original sentence</infon>
14 |         <offset>70</offset>
15 |         <text>Active Raf-1 phosphorylates and activates the mitogen-activated protein (MAP) kinase/extracellular signal-regulated kinase kinase 1 (MEK1), which in turn phosphorylates and activates the MAP kinases/extracellular signal regulated kinases, ERK1 and ERK2.</text>
16 |       </sentence>
17 |       <sentence>
18 | 	<infon key="type">simplified sentence</infon>
19 |         <offset>325</offset>
20 |         <text>Active Raf-1 phosphorylates MEK1.</text>
21 |       </sentence>
22 |       <sentence>
23 | 	<infon key="type">simplified sentence</infon>
24 |         <offset>360</offset>
25 |         <text>Active Raf-1 activates MEK1.</text>
26 |       </sentence>
27 |       <sentence>
28 | 	<infon key="type">simplified sentence</infon>
29 |         <offset>390</offset>
30 |         <text>MEK1 in turn phosphorylates ERK1.</text>
31 |       </sentence>
32 |       <sentence>
33 | 	<infon key="type">simplified sentence</infon>
34 |         <offset>425</offset>
35 |         <text>MEK1 in turn phosphorylates ERK2.</text>
36 |       </sentence>
37 |       <sentence>
38 | 	<infon key="type">simplified sentence</infon>
39 |         <offset>460</offset>
40 |         <text>MEK1 in turn activates ERK1.</text>
41 |       </sentence>
42 |       <sentence>
43 | 	<infon key="type">simplified sentence</infon>
44 |         <offset>489</offset>
45 |         <text>MEK1 in turn activates ERK2.</text>
46 |       </sentence>
47 |     </passage>
48 |   </document>
49 | </collection>
50 | 


--------------------------------------------------------------------------------
/test_input/everything-sentence.xml:
--------------------------------------------------------------------------------
1 | <?xml version='1.0' encoding='UTF-8'?><!DOCTYPE collection SYSTEM "BioC.dtd"><collection><source>Made up file to test that everything is allowed and processed. Has text in the passage.</source><date>20130426</date><key>everything.key</key><infon key="collection-infon-key">collection-infon-value</infon><document><id>1</id><infon key="document-infon-key">document-infon-value</infon><passage><infon key="passage-infon-key">passage-infon-value</infon><offset>0</offset><sentence><infon key="sentence-infon-key">sentence-infon-value</infon><offset>0</offset><text>text of sentence</text><annotation id="S1"><infon key="annotation-infon-key">annotation-infon-value</infon><location offset="1" length="2"/><text>annotation text</text></annotation><relation id="RS1"><infon key="setence-relation-infon-key">sentence-relation-infon-value</infon><node refid="RS1" role="sentence-relation"/></relation></sentence><relation id="RP1"><infon key="passage-relation-infon-key">passage-relation-infon-value</infon><node refid="RP1" role="passage-relation"/></relation></passage><relation id="D1"><infon key="document-relation-infon-key">document-relation-infon-value</infon><node refid="RD1" role="document-relation"/></relation></document></collection>


--------------------------------------------------------------------------------
/test_input/everything.xml:
--------------------------------------------------------------------------------
1 | <?xml version='1.0' encoding='UTF-8'?><!DOCTYPE collection SYSTEM "BioC.dtd"><collection><source>Made up file to test that everything is allowed and processed. Has text in the passage.</source><date>20130426</date><key>everything.key</key><infon key="collection-infon-key">collection-infon-value</infon><document><id>1</id><infon key="document-infon-key">document-infon-value</infon><passage><infon key="passage-infon-key">passage-infon-value</infon><offset>0</offset><text>text of passage</text><annotation id="P1"><infon key="annotation-infon-key">annotation-infon-value</infon><location offset="1" length="2"/><text>annotation text</text></annotation><relation id="RP1"><infon key="passage-relation-infon-key">passage-relation-infon-value</infon><node refid="RP1" role="passage-relation"/></relation></passage><relation id="D1"><infon key="document-relation-infon-key">document-relation-infon-value</infon><node refid="RD1" role="document-relation"/></relation></document></collection>


--------------------------------------------------------------------------------
/test_input/example_input.xml:
--------------------------------------------------------------------------------
 1 | <?xml version='1.0' encoding='UTF-8'?>
 2 | <!DOCTYPE collection SYSTEM "BioC.dtd">
 3 | <collection>
 4 |     <source>PUBMED</source>
 5 |     <date>20130422</date>
 6 |     <key>ctdBCIVLearningDataSet.key</key>
 7 |     <document>
 8 |         <id>10617681</id>
 9 |         <passage>
10 |             <infon key="type">title</infon>
11 |             <offset>0</offset>
12 |             <text>Possible role of valvular serotonin 5-HT(2B) receptors in the cardiopathy associated with fenfluramine.</text>
13 |         </passage>
14 |         <passage>
15 |             <infon key="type">abstract</infon>
16 |             <offset>104</offset>
17 |             <text>Dexfenfluramine was approved in the United States for long-term use as an appetite suppressant until it was reported to be associated with valvular heart disease. The valvular changes (myofibroblast proliferation) are histopathologically indistinguishable from those observed in carcinoid disease or after long-term exposure to 5-hydroxytryptamine (5-HT)(2)-preferring ergot drugs (ergotamine, methysergide). 5-HT(2) receptor stimulation is known to cause fibroblast mitogenesis, which could contribute to this lesion. To elucidate the mechanism of "fen-phen"-associated valvular lesions, we examined the interaction of fenfluramine and its metabolite norfenfluramine with 5-HT(2) receptor subtypes and examined the expression of these receptors in human and porcine heart valves. Fenfluramine binds weakly to 5-HT(2A), 5-HT(2B), and 5-HT(2C) receptors. In contrast, norfenfluramine exhibited high affinity for 5-HT(2B) and 5-HT(2C) receptors and more moderate affinity for 5-HT(2A) receptors. In cells expressing recombinant 5-HT(2B) receptors, norfenfluramine potently stimulated the hydrolysis of inositol phosphates, increased intracellular Ca(2+), and activated the mitogen-activated protein kinase cascade, the latter of which has been linked to mitogenic actions of the 5-HT(2B) receptor. The level of 5-HT(2B) and 5-HT(2A) receptor transcripts in heart valves was at least 300-fold higher than the levels of 5-HT(2C) receptor transcript, which were barely detectable. We propose that preferential stimulation of valvular 5-HT(2B) receptors by norfenfluramine, ergot drugs, or 5-HT released from carcinoid tumors (with or without accompanying 5-HT(2A) receptor activation) may contribute to valvular fibroplasia in humans.</text>
18 |         </passage>
19 |     </document>
20 | </collection>
21 | 


--------------------------------------------------------------------------------