├── BioC.dtd ├── CHANGES.txt ├── LICENSE.txt ├── README.txt ├── src ├── bioc │ ├── __init__.py │ ├── bioc_annotation.py │ ├── bioc_collection.py │ ├── bioc_document.py │ ├── bioc_location.py │ ├── bioc_node.py │ ├── bioc_passage.py │ ├── bioc_reader.py │ ├── bioc_relation.py │ ├── bioc_sentence.py │ ├── bioc_writer.py │ ├── compat │ │ ├── __init__.py │ │ └── _py2_next.py │ └── meta │ │ ├── __init__.py │ │ ├── _bioc_meta.py │ │ └── _iter.py ├── stemmer.py └── test_read+write.py └── test_input ├── PMID-8557975-simplified-sentences-tokens.xml ├── PMID-8557975-simplified-sentences.xml ├── everything-sentence.xml ├── everything.xml └── example_input.xml /BioC.dtd: -------------------------------------------------------------------------------- 1 | 2 | 3 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | -------------------------------------------------------------------------------- /CHANGES.txt: -------------------------------------------------------------------------------- 1 | 1.01 2 | ---- 3 | Fix invalid handling of id attributes for annotation and relation tags. 4 | (Thanks to Tilia Ellendorff for pointing it out.) 5 | 6 | 1.00 7 | ---- 8 | PyBioC library reaches usable state. The classes BioCReader and BioCWriter can 9 | be used to read in and write out PyBioC XML data, respectively. 10 | 11 | 0.1 12 | --- 13 | Initial (incomplete) version created in analogy to BioC_Java_1.0.1. 14 | -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | Copyright (c) 2013, the OntoGene project at the University of Zurich (UZH). 2 | All rights reserved. 3 | 4 | Redistribution and use in source and binary forms, with or without 5 | modification, are permitted provided that the following conditions are met: 6 | 7 | 1. Redistributions of source code must retain the above copyright notice, this 8 | list of conditions and the following disclaimer. 9 | 2. Redistributions in binary form must reproduce the above copyright notice, 10 | this list of conditions and the following disclaimer in the documentation 11 | and/or other materials provided with the distribution. 12 | 13 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND 14 | ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED 15 | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 16 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR 17 | ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES 18 | (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 19 | LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND 20 | ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 21 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 22 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 23 | 24 | The views and conclusions contained in the software and documentation are those 25 | of the authors and should not be interpreted as representing official policies, 26 | either expressed or implied, of the OntoGene project or the University of Zurich 27 | (UZH). 28 | -------------------------------------------------------------------------------- /README.txt: -------------------------------------------------------------------------------- 1 | PyBioC is a native python library to deal with BioCreative XML data, 2 | i. e. to read from and to write to it. 3 | 4 | Usage: 5 | ------ 6 | Two example programs, test_read+write.py and stemming.py are shipped in the 7 | src/ folder. 8 | 9 | test_read+write.py shows the very basic reading and writing capability of the 10 | library. 11 | 12 | stemming.py uses the Python Natural Language Toolkit (NLTK) library to 13 | manipulate a BioC XML file read in before; it then tokenizes the corresponding 14 | text, does stemming on the tokens and transforms the manipulated PyBioC 15 | objects back to valid BioC XML format. 16 | -------------------------------------------------------------------------------- /src/bioc/__init__.py: -------------------------------------------------------------------------------- 1 | # 2 | # Package for interoperability in BioCreative work 3 | # 4 | 5 | __version__ = '1.02' 6 | 7 | __all__ = [ 8 | 'BioCAnnotation', 'BioCCollection', 'BioCDocument', 9 | 'BioCLocation', 'BioCNode', 'BioCPassage', 'BioCRelation', 10 | 'BioCSentence', 'BioCReader', 'BioCWriter' 11 | ] 12 | 13 | __author__ = 'Hernani Marques (h2m@access.uzh.ch)' 14 | 15 | from bioc_annotation import BioCAnnotation 16 | from bioc_collection import BioCCollection 17 | from bioc_document import BioCDocument 18 | from bioc_location import BioCLocation 19 | from bioc_node import BioCNode 20 | from bioc_passage import BioCPassage 21 | from bioc_relation import BioCRelation 22 | from bioc_sentence import BioCSentence 23 | from bioc_reader import BioCReader 24 | from bioc_writer import BioCWriter 25 | -------------------------------------------------------------------------------- /src/bioc/bioc_annotation.py: -------------------------------------------------------------------------------- 1 | __all__ = ['BioCAnnotation'] 2 | 3 | from meta import _MetaId, _MetaInfons, _MetaText 4 | 5 | class BioCAnnotation(_MetaId, _MetaInfons, _MetaText): 6 | 7 | def __init__(self, annotation=None): 8 | 9 | self.id = '' 10 | self.infons = dict() 11 | self.locations = list() 12 | self.text = '' 13 | 14 | if annotation is not None: 15 | self.id = annotation.id 16 | self.infons = annotation.infons 17 | self.locations = annotation.locations 18 | self.text = self.text 19 | 20 | def __str__(self): 21 | s = 'id: ' + self.id + '\n' 22 | s += str(self.infons) + '\n' 23 | s += 'locations: ' + str(self.locations) + '\n' 24 | s += 'text: ' + self.text + '\n' 25 | 26 | return s 27 | 28 | def clear_locations(self): 29 | self.locations = list() 30 | 31 | def add_location(self, location): 32 | self.locations.append(location) 33 | -------------------------------------------------------------------------------- /src/bioc/bioc_collection.py: -------------------------------------------------------------------------------- 1 | __all__ = ['BioCCollection'] 2 | 3 | from meta import _MetaInfons, _MetaIter 4 | from compat import _Py2Next 5 | 6 | class BioCCollection(_Py2Next, _MetaInfons, _MetaIter): 7 | 8 | def __init__(self, collection=None): 9 | 10 | self.infons = dict() 11 | self.source = '' 12 | self.date = '' 13 | self.key = '' 14 | self.documents = list() 15 | 16 | if collection is not None: 17 | self.infons = collection.infons 18 | self.source = collection.source 19 | self.date = collection.date 20 | self.key = collection.key 21 | self.documents = collection.documents 22 | 23 | def __str__(self): 24 | s = 'source: ' + self.source + '\n' 25 | s += 'date: ' + self.date + '\n' 26 | s += 'key: ' + self.key + '\n' 27 | s += str(self.infons) + '\n' 28 | s += str(self.documents) + '\n' 29 | 30 | return s 31 | 32 | def _iterdata(self): 33 | return self.documents 34 | 35 | def clear_documents(self): 36 | self.documents = list() 37 | 38 | def get_document(self, doc_idx): 39 | return self.documents[doc_idx] 40 | 41 | def add_document(self, document): 42 | self.documents.append(document) 43 | 44 | def remove_document(self, document): 45 | if type(document) is int: 46 | self.dcouments.remove(self.documents[document]) 47 | else: 48 | self.documents.remove(document) # TBC 49 | -------------------------------------------------------------------------------- /src/bioc/bioc_document.py: -------------------------------------------------------------------------------- 1 | __all__ = ['BioCDocument'] 2 | 3 | from compat import _Py2Next 4 | from meta import _MetaId, _MetaInfons, _MetaRelations, _MetaIter 5 | 6 | class BioCDocument(_MetaId, _MetaInfons, _MetaRelations, _MetaIter, 7 | _Py2Next): 8 | 9 | def __init__(self, document=None): 10 | 11 | self.id = '' 12 | self.infons = dict() 13 | self.relations = list() 14 | self.passages = list() 15 | 16 | if document is not None: 17 | self.id = document.id 18 | self.infons = document.infons 19 | self.relations = document.relations 20 | self.passages = document.passages 21 | 22 | def __str__(self): 23 | s = 'id: ' + self.id + '\n' 24 | s += 'infon: ' + str(self.infons) + '\n' 25 | s += str(self.passages) + '\n' 26 | s += 'relation: ' + str(self.relations) + '\n' 27 | 28 | return s 29 | 30 | def _iterdata(self): 31 | return self.passages 32 | 33 | def get_size(self): 34 | return self.passages.size() # As in Java BioC 35 | 36 | def clear_passages(self): 37 | self.passages = list() 38 | 39 | def add_passage(self, passage): 40 | self.passages.append(passage) 41 | 42 | def remove_passage(self, passage): 43 | if type(passage) is int: 44 | self.passages.remove(self.passages[passage]) 45 | else: 46 | self.passages.remove(passage) # TBC 47 | -------------------------------------------------------------------------------- /src/bioc/bioc_location.py: -------------------------------------------------------------------------------- 1 | __all__ = ['BioCLocation'] 2 | 3 | from meta import _MetaOffset 4 | 5 | class BioCLocation(_MetaOffset): 6 | 7 | def __init__(self, location=None): 8 | 9 | self.offset = '-1' 10 | self.length = '0' 11 | 12 | if location is not None: 13 | self.offset = location.offset 14 | self.length = location.length 15 | 16 | def __str__(self): 17 | s = str(self.offset) + ':' + str(self.length) 18 | 19 | return s 20 | -------------------------------------------------------------------------------- /src/bioc/bioc_node.py: -------------------------------------------------------------------------------- 1 | __all__ = ['BioCNode'] 2 | 3 | class BioCNode: 4 | 5 | def __init__(self, node=None, refid=None, role=None): 6 | 7 | self.refid = '' 8 | self.role = '' 9 | 10 | # Use arg ``node'' if set 11 | if node is not None: 12 | self.refid = node.refid 13 | self.role = node.role 14 | # Use resting optional args only if both set 15 | elif (refid is not None) and (role is not None): 16 | self.refid = refid 17 | self.role = role 18 | 19 | def __str__(self): 20 | s = 'refid: ' + self.refid + '\n' 21 | s += 'role: ' + self.role + '\n' 22 | 23 | return s 24 | -------------------------------------------------------------------------------- /src/bioc/bioc_passage.py: -------------------------------------------------------------------------------- 1 | __all__ = ['BioCPassage'] 2 | 3 | from meta import _MetaAnnotations, _MetaInfons, _MetaOffset, \ 4 | _MetaRelations, _MetaText 5 | 6 | class BioCPassage(_MetaAnnotations, _MetaOffset, _MetaText, 7 | _MetaRelations, _MetaInfons): 8 | 9 | def __init__(self, passage=None): 10 | 11 | self.offset = '-1' 12 | self.text = '' 13 | self.infons = dict() 14 | self.sentences = list() 15 | self.annotations = list() 16 | self.relations = list() 17 | 18 | if passage is not None: 19 | self.offset = passage.offset 20 | self.text = passage.text 21 | self.infons = passage.infons 22 | self.sentences = passage.sentences 23 | self.annotations = passage.annotations 24 | self.relations = passage.relations 25 | 26 | def size(self): 27 | return len(self.sentences) 28 | 29 | def has_sentences(self): 30 | if len(self.sentences) > 0: 31 | return True 32 | 33 | def add_sentence(self, sentence): 34 | self.sentences.append(sentence) 35 | 36 | def sentences_iterator(self): 37 | return self.sentences.iterator() # TBD 38 | 39 | def clear_sentences(self): 40 | self.relations = list() 41 | 42 | def remove_sentence(self, sentence): # int or obj 43 | if type(sentence) is int: 44 | self.sentences.remove(self.sentences[sentence]) 45 | else: 46 | self.sentences.remove(sentence) 47 | -------------------------------------------------------------------------------- /src/bioc/bioc_reader.py: -------------------------------------------------------------------------------- 1 | __all__ = ['BioCReader'] 2 | 3 | import StringIO 4 | 5 | from lxml import etree 6 | 7 | from bioc_annotation import BioCAnnotation 8 | from bioc_collection import BioCCollection 9 | from bioc_document import BioCDocument 10 | from bioc_location import BioCLocation 11 | from bioc_passage import BioCPassage 12 | from bioc_sentence import BioCSentence 13 | from bioc_node import BioCNode 14 | from bioc_relation import BioCRelation 15 | 16 | class BioCReader: 17 | """ 18 | This class can be used to store BioC XML files in PyBioC objects, 19 | for further manipulation. 20 | """ 21 | 22 | def __init__(self, source, dtd_valid_file=None): 23 | """ 24 | source: File path to a BioC XML input document. 25 | dtd_valid_file: File path to a BioC.dtd file. Using this 26 | optional argument ensures DTD validation. 27 | """ 28 | 29 | self.source = source 30 | self.collection = BioCCollection() 31 | self.xml_tree = etree.parse(source) 32 | 33 | if dtd_valid_file is not None: 34 | dtd = etree.DTD(dtd_valid_file) 35 | if dtd.validate(self.xml_tree) is False: 36 | raise(Exception(dtd.error_log.filter_from_errors()[0])) 37 | 38 | def read(self): 39 | """ 40 | Invoke this method in order to read in the file provided by 41 | the source class variable. Only after this method has been 42 | called the BioCReader object gets populated. 43 | """ 44 | self._read_collection() 45 | 46 | def _read_collection(self): 47 | collection_elem = self.xml_tree.xpath('/collection')[0] 48 | 49 | self.collection.source = collection_elem.xpath('source')[0].text 50 | self.collection.date = collection_elem.xpath('date')[0].text 51 | self.collection.key = collection_elem.xpath('key')[0].text 52 | 53 | infon_elem_list = collection_elem.xpath('infon') 54 | document_elem_list = collection_elem.xpath('document') 55 | 56 | self._read_infons(infon_elem_list, self.collection) 57 | self._read_documents(document_elem_list) 58 | 59 | 60 | def _read_infons(self, infon_elem_list, infons_parent_elem): 61 | for infon_elem in infon_elem_list: 62 | infons_parent_elem.put_infon(self._get_infon_key(infon_elem), 63 | infon_elem.text) 64 | 65 | def _read_documents(self, document_elem_list): 66 | for document_elem in document_elem_list: 67 | document = BioCDocument() 68 | document.id = document_elem.xpath('id')[0].text 69 | self._read_infons(document_elem.xpath('infon'), document) 70 | self._read_passages(document_elem.xpath('passage'), 71 | document) 72 | self._read_relations(document_elem.xpath('relation'), 73 | document) 74 | 75 | self.collection.add_document(document) 76 | 77 | def _read_passages(self, passage_elem_list, document_parent_elem): 78 | for passage_elem in passage_elem_list: 79 | passage = BioCPassage() 80 | self._read_infons(passage_elem.xpath('infon'), passage) 81 | passage.offset = passage_elem.xpath('offset')[0].text 82 | 83 | # Is this BioC document with ? 84 | if len(passage_elem.xpath('sentence')) > 0: 85 | self._read_sentences(passage_elem.xpath('sentence'), 86 | passage) 87 | else: 88 | # Is the (optional) text element available? 89 | try: 90 | passage.text = passage_elem.xpath('text')[0].text 91 | except: 92 | pass 93 | self._read_annotations(passage_elem.xpath('annotation'), 94 | passage) 95 | 96 | self._read_relations(passage_elem.xpath('relation'), 97 | passage) 98 | 99 | document_parent_elem.add_passage(passage) 100 | 101 | def _read_sentences(self, sentence_elem_list, passage_parent_elem): 102 | for sentence_elem in sentence_elem_list: 103 | sentence = BioCSentence() 104 | self._read_infons(sentence_elem.xpath('infon'), sentence) 105 | sentence.offset = sentence_elem.xpath('offset')[0].text 106 | sentence.text = sentence_elem.xpath('text')[0].text 107 | self._read_annotations(sentence_elem.xpath('annotation'), 108 | sentence) 109 | self._read_relations(sentence_elem.xpath('relation'), 110 | sentence) 111 | 112 | passage_parent_elem.add_sentence(sentence) 113 | 114 | def _read_annotations(self, annotation_elem_list, 115 | annotations_parent_elem): 116 | for annotation_elem in annotation_elem_list: 117 | annotation = BioCAnnotation() 118 | # Attribute id is just #IMPLIED, not #REQUIRED 119 | if 'id' in annotation_elem.attrib: 120 | annotation.id = annotation_elem.attrib['id'] 121 | self._read_infons(annotation_elem.xpath('infon'), 122 | annotation) 123 | 124 | for location_elem in annotation_elem.xpath('location'): 125 | location = BioCLocation() 126 | location.offset = location_elem.attrib['offset'] 127 | location.length = location_elem.attrib['length'] 128 | 129 | annotation.add_location(location) 130 | 131 | annotation.text = annotation_elem.xpath('text')[0].text 132 | 133 | annotations_parent_elem.add_annotation(annotation) 134 | 135 | def _read_relations(self, relation_elem_list, relations_parent_elem): 136 | for relation_elem in relation_elem_list: 137 | relation = BioCRelation() 138 | # Attribute id is just #IMPLIED, not #REQUIRED 139 | if 'id' in relation_elem.attrib: 140 | relation.id = relation_elem.attrib['id'] 141 | self._read_infons(relation_elem.xpath('infon'), relation) 142 | 143 | for node_elem in relation_elem.xpath('node'): 144 | node = BioCNode() 145 | node.refid = node_elem.attrib['refid'] 146 | node.role = node_elem.attrib['role'] 147 | 148 | relation.add_node(node) 149 | 150 | relations_parent_elem.add_relation(relation) 151 | 152 | def _get_infon_key(self, elem): 153 | return elem.attrib['key'] 154 | -------------------------------------------------------------------------------- /src/bioc/bioc_relation.py: -------------------------------------------------------------------------------- 1 | __all__ = ['BioCRelation'] 2 | 3 | from compat import _Py2Next 4 | from meta import _MetaId, _MetaInfons, _MetaIter 5 | from bioc_node import BioCNode 6 | 7 | class BioCRelation(_MetaId, _MetaInfons, _Py2Next, _MetaIter): 8 | 9 | def __init__(self, relation=None): 10 | 11 | self.id = '' 12 | self.nodes = list() 13 | self.infons = dict() 14 | 15 | if relation is not None: 16 | self.id = relation.id 17 | self.nodes = relation.nodes 18 | self.infons = relation.infons 19 | 20 | def __str__(self): 21 | s = 'id: ' + self.id + '\n' 22 | s += 'infons: ' + str(self.infons) + '\n' 23 | s += 'nodes: ' + str(self.nodes) + '\n' 24 | 25 | return s 26 | 27 | def _iterdata(self): 28 | return self.nodes 29 | 30 | def add_node(self, node, refid=None, role=None): 31 | # Discard arg ``node'' if optional args fully provided 32 | if (refid is not None) and (role is not None): 33 | self.add_node(refid=refid, role=role) 34 | else: # Only consider optional args if both set 35 | self.nodes.append(node) 36 | -------------------------------------------------------------------------------- /src/bioc/bioc_sentence.py: -------------------------------------------------------------------------------- 1 | __all__ = ['BioCSentence'] 2 | 3 | 4 | from meta import _MetaAnnotations, _MetaInfons, _MetaOffset, \ 5 | _MetaRelations, _MetaText 6 | 7 | 8 | class BioCSentence(_MetaAnnotations, _MetaInfons, _MetaOffset, 9 | _MetaRelations, _MetaText): 10 | 11 | def __init__(self, sentence=None): 12 | 13 | self.offset = '-1' 14 | self.text = '' 15 | self.infons = dict() 16 | self.annotations = list() 17 | self.relations = list() 18 | 19 | if sentence is not None: 20 | self.offset = sentence.offset 21 | self.text = sentence.text 22 | self.infons = sentence.infons 23 | self.annotations = sentence.annotations 24 | self.relations = sentence.relations 25 | 26 | def __str__(self): 27 | s = 'offset: ' + str(self.offset) + '\n' 28 | s += 'infons: ' + str(self.infons) + '\n' # TBD 29 | s += 'text: ' + str(self.text) + '\n' # TBD 30 | s += str(self.annotations) + '\n' # TBD 31 | s += str(self.relations) + '\n' # TBD 32 | 33 | return s 34 | -------------------------------------------------------------------------------- /src/bioc/bioc_writer.py: -------------------------------------------------------------------------------- 1 | __all__ = ['BioCWriter'] 2 | 3 | from lxml.builder import E 4 | from lxml.etree import tostring 5 | 6 | class BioCWriter: 7 | 8 | def __init__(self, filename=None, collection=None): 9 | 10 | self.root_tree = None 11 | 12 | self.collection = None 13 | self.doctype = '''''' 14 | self.doctype += '''''' 15 | self.filename = filename 16 | 17 | if collection is not None: 18 | self.collection = collection 19 | 20 | if filename is not None: 21 | self.filename = filename 22 | 23 | def __str__(self): 24 | """ A BioCWriter object can be printed as string. 25 | """ 26 | self._check_for_data() 27 | 28 | self.build() 29 | s = tostring(self.root_tree, 30 | pretty_print=True, 31 | doctype=self.doctype) 32 | 33 | return s 34 | 35 | def _check_for_data(self): 36 | if self.collection is None: 37 | raise(Exception('No data available.')) 38 | 39 | def write(self, filename=None): 40 | """ Use this method to write the data in the PyBioC objects 41 | to disk. 42 | 43 | filename: Output file path (optional argument; filename 44 | provided by __init__ used otherwise.) 45 | """ 46 | if filename is not None: 47 | self.filename = filename 48 | 49 | if self.filename is None: 50 | raise(Exception('No output file path provided.')) 51 | 52 | f = open(self.filename, 'w') 53 | f.write(self.__str__()) 54 | 55 | def build(self): 56 | self._build_collection() 57 | 58 | def _build_collection(self): 59 | self.root_tree = E('collection', 60 | E('source'), E('date'), E('key')) 61 | self.root_tree.xpath('source')[0].text = self.collection.source 62 | self.root_tree.xpath('date')[0].text = self.collection.date 63 | self.root_tree.xpath('key')[0].text = self.collection.key 64 | collection_elem = self.root_tree.xpath('/collection')[0] 65 | # infon* 66 | self._build_infons(self.collection.infons, collection_elem) 67 | # document+ 68 | self._build_documents(self.collection.documents, 69 | collection_elem) 70 | 71 | def _build_infons(self, infons_dict, infons_parent_elem): 72 | for infon_key, infon_val in infons_dict.items(): 73 | infons_parent_elem.append(E('infon')) 74 | infon_elem = infons_parent_elem.xpath('infon')[-1] 75 | 76 | infon_elem.attrib['key'] = infon_key 77 | infon_elem.text = infon_val 78 | 79 | def _build_documents(self, documents_list, collection_parent_elem): 80 | for document in documents_list: 81 | collection_parent_elem.append(E('document', E('id'))) 82 | document_elem = collection_parent_elem.xpath('document')[-1] 83 | # id 84 | id_elem = document_elem.xpath('id')[0] 85 | id_elem.text = document.id 86 | # infon* 87 | self._build_infons(document.infons, document_elem) 88 | # passage+ 89 | self._build_passages(document.passages, document_elem) 90 | # relation* 91 | self._build_relations(document.relations, document_elem) 92 | 93 | def _build_passages(self, passages_list, document_parent_elem): 94 | for passage in passages_list: 95 | document_parent_elem.append(E('passage')) 96 | passage_elem = document_parent_elem.xpath('passage')[-1] 97 | # infon* 98 | self._build_infons(passage.infons, passage_elem) 99 | # offset 100 | passage_elem.append(E('offset')) 101 | passage_elem.xpath('offset')[0].text = passage.offset 102 | if passage.has_sentences(): 103 | # sentence* 104 | self._build_sentences(passage.sentences, passage_elem) 105 | else: 106 | # text?, annotation* 107 | passage_elem.append(E('text')) 108 | passage_elem.xpath('text')[0].text = passage.text 109 | self._build_annotations(passage.annotations, 110 | passage_elem) 111 | # relation* 112 | self._build_relations(passage.relations, passage_elem) 113 | 114 | def _build_relations(self, relations_list, relations_parent_elem): 115 | for relation in relations_list: 116 | relations_parent_elem.append(E('relation')) 117 | relation_elem = relations_parent_elem.xpath('relation')[-1] 118 | # infon* 119 | self._build_infons(relation.infons, relation_elem) 120 | # node* 121 | for node in relation.nodes: 122 | relation_elem.append(E('node')) 123 | node_elem = relation_elem.xpath('node')[-1] 124 | node_elem.attrib['refid'] = node.refid 125 | node_elem.attrib['role'] = node.role 126 | # id (just #IMPLIED) 127 | if len(relation.id) > 0: 128 | relation_elem.attrib['id'] = relation.id 129 | 130 | def _build_annotations(self, annotations_list, 131 | annotations_parent_elem): 132 | for annotation in annotations_list: 133 | annotations_parent_elem.append(E('annotation')) 134 | annotation_elem = \ 135 | annotations_parent_elem.xpath('annotation')[-1] 136 | # infon* 137 | self._build_infons(annotation.infons, annotation_elem) 138 | # location* 139 | for location in annotation.locations: 140 | annotation_elem.append(E('location')) 141 | location_elem = annotation_elem.xpath('location')[-1] 142 | location_elem.attrib['offset'] = location.offset 143 | location_elem.attrib['length'] = location.length 144 | # text 145 | annotation_elem.append(E('text')) 146 | text_elem = annotation_elem.xpath('text')[0] 147 | text_elem.text = annotation.text 148 | # id (just #IMPLIED) 149 | if len(annotation.id) > 0: 150 | annotation_elem.attrib['id'] = annotation.id 151 | 152 | def _build_sentences(self, sentences_list, passage_parent_elem): 153 | for sentence in sentences_list: 154 | passage_parent_elem.append(E('sentence')) 155 | sentence_elem = passage_parent_elem.xpath('sentence')[-1] 156 | # infon* 157 | self._build_infons(sentence.infons, sentence_elem) 158 | # offset 159 | sentence_elem.append(E('offset')) 160 | offset_elem = sentence_elem.xpath('offset')[0] 161 | offset_elem.text = sentence.offset 162 | # text? 163 | if len(sentence.text) > 0: 164 | sentence_elem.append(E('text')) 165 | text_elem = sentence_elem.xpath('text')[0] 166 | text_elem.text = sentence.text 167 | # annotation* 168 | self._build_annotations(sentence.annotations, sentence_elem) 169 | # relation* 170 | self._build_relations(sentence.relations, sentence_elem) 171 | -------------------------------------------------------------------------------- /src/bioc/compat/__init__.py: -------------------------------------------------------------------------------- 1 | __all__ = [] 2 | 3 | __author__ = 'Hernani Marques (h2m@access.uzh.ch)' 4 | 5 | from _py2_next import _Py2Next 6 | -------------------------------------------------------------------------------- /src/bioc/compat/_py2_next.py: -------------------------------------------------------------------------------- 1 | __all__ = [] 2 | 3 | class _Py2Next: 4 | def __next__(self): 5 | self.next() 6 | -------------------------------------------------------------------------------- /src/bioc/meta/__init__.py: -------------------------------------------------------------------------------- 1 | __all__ = [] 2 | 3 | __author__ = 'Hernani Marques (h2m@access.uzh.ch)' 4 | 5 | from _bioc_meta import _MetaAnnotations, _MetaInfons, _MetaOffset, \ 6 | _MetaRelations, _MetaText, _MetaId 7 | from _iter import _MetaIter 8 | -------------------------------------------------------------------------------- /src/bioc/meta/_bioc_meta.py: -------------------------------------------------------------------------------- 1 | __all__ = [] 2 | 3 | class _MetaAnnotations: 4 | annotations = list() 5 | 6 | def annotation_iterator(self): 7 | return self.annotations.iterator() # TBD 8 | 9 | def clear_annotations(self): 10 | self.annotations = list() 11 | 12 | def add_annotation(self, annotation): 13 | self.annotations.append(annotation) 14 | 15 | def remove_annotation(self, annotation): # Can be int or obj 16 | if type(annotation) is int: 17 | self.annotations.remove(self.annotations[annotation]) 18 | else: 19 | self.annotations.remove(annotation) # TBC 20 | 21 | class _MetaInfons: 22 | infons = dict() 23 | 24 | def put_infon(self, key, val): 25 | self.infons[key] = val 26 | 27 | def remove_infon(self, key): 28 | del(self.infons[key]) 29 | 30 | def clear_infons(self): 31 | self.infons = dict() 32 | 33 | class _MetaOffset: 34 | offset = '-1' 35 | 36 | class _MetaRelations: 37 | relations = list() 38 | 39 | def relation_iterator(self): 40 | return self.relations.iterator() # TBD 41 | 42 | def clear_relations(self): 43 | self.relations = list() 44 | 45 | def add_relation(self, relation): 46 | self.relations.append(relation) 47 | 48 | def remove_relation(self, relation): # Can be int or obj 49 | if type(relation) is int: 50 | self.relations.remove(self.relations[relation]) 51 | else: 52 | self.relations.remove(relation) # TBC 53 | 54 | class _MetaText: 55 | text = '' 56 | 57 | class _MetaId: 58 | id = '' 59 | -------------------------------------------------------------------------------- /src/bioc/meta/_iter.py: -------------------------------------------------------------------------------- 1 | __all__ = [] 2 | 3 | class _MetaIter: 4 | 5 | def __iter__(self): 6 | return self._iterdata().__iter__() 7 | -------------------------------------------------------------------------------- /src/stemmer.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # h2m@access.uzh.ch 4 | 5 | from os import curdir, sep 6 | import sys 7 | 8 | from nltk.tokenize import wordpunct_tokenize 9 | from nltk import PorterStemmer 10 | 11 | from bioc import BioCAnnotation 12 | from bioc import BioCReader 13 | from bioc import BioCWriter 14 | 15 | BIOC_IN = '..' + sep + 'test_input' + sep + 'example_input.xml' 16 | BIOC_OUT = 'example_input_stemmed.xml' 17 | DTD_FILE = '..' + sep + 'BioC.dtd' 18 | 19 | def main(): 20 | # Use file defined by BIOC_IN as default if no other provided 21 | bioc_in = BIOC_IN 22 | if len(sys.argv) >= 2: 23 | bioc_in = sys.argv[1] 24 | 25 | # A BioCReader object is put in place to hold the example BioC XML 26 | # document 27 | bioc_reader = BioCReader(bioc_in, dtd_valid_file=DTD_FILE) 28 | 29 | # A BioCWRiter object is prepared to write out the annotated data 30 | bioc_writer = BioCWriter(BIOC_OUT) 31 | 32 | # The NLTK porter stemmer is used for stemming 33 | stemmer = PorterStemmer() 34 | 35 | # The example input file given above (by BIOC_IN) is fed into 36 | # a BioCReader object; validation is done by the BioC DTD 37 | bioc_reader.read() 38 | 39 | # Pass over basic data 40 | bioc_writer.collection = bioc_reader.collection 41 | 42 | # Get documents to manipulate 43 | documents = bioc_writer.collection.documents 44 | 45 | # Go through each document 46 | annotation_id = 0 47 | for document in documents: 48 | 49 | # Go through each passage of the document 50 | for passage in document: 51 | # Stem all the tokens found 52 | stems = [stemmer.stem(token) for 53 | token in wordpunct_tokenize(passage.text)] 54 | # Add an anotation showing the stemmed version, in the 55 | # given order 56 | for stem in stems: 57 | annotation_id += 1 58 | 59 | # For each token an annotation is created, providing 60 | # the surface form of a 'stemmed token'. 61 | # (The annotations are collectively added following 62 | # a document passage with a tag.) 63 | bioc_annotation = BioCAnnotation() 64 | bioc_annotation.text = stem 65 | bioc_annotation.id = str(annotation_id) 66 | bioc_annotation.put_infon('surface form', 67 | 'stemmed token') 68 | passage.add_annotation(bioc_annotation) 69 | 70 | # Print file to screen w/o trailing newline 71 | # (Can be redirected into a file, e. g output_bioc.xml) 72 | sys.stdout.write(str(bioc_writer)) 73 | 74 | # Write to disk 75 | bioc_writer.write() 76 | 77 | if __name__ == '__main__': 78 | main() 79 | -------------------------------------------------------------------------------- /src/test_read+write.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | from bioc import BioCReader 4 | from bioc import BioCWriter 5 | 6 | test_file = '../test_input/bcIVLearningCorpus.xml' 7 | dtd_file = '../test_input/BioC.dtd' 8 | 9 | def main(): 10 | bioc_reader = BioCReader(test_file, dtd_valid_file=dtd_file) 11 | bioc_reader.read() 12 | ''' 13 | sentences = bioc_reader.collection.documents[0].passages[0].sentences 14 | for sentence in sentences: 15 | print sentence.offset 16 | ''' 17 | 18 | bioc_writer = BioCWriter('output_bioc.xml') 19 | bioc_writer.collection = bioc_reader.collection 20 | bioc_writer.write() 21 | print(bioc_writer) 22 | 23 | if __name__ == '__main__': 24 | main() 25 | -------------------------------------------------------------------------------- /test_input/PMID-8557975-simplified-sentences-tokens.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | PubMed 5 | 20130316 6 | PMID-8557975-simplified-sentences-tokens.key 7 | 8 | 8557975 9 | 10 | abstract 11 | 0 12 | 13 | original sentence 14 | 70 15 | 16 | token 17 | 18 | Active 19 | 20 | 21 | token 22 | 23 | Raf-1 24 | 25 | 26 | token 27 | 28 | phosphorylates 29 | 30 | 31 | token 32 | 33 | and 34 | 35 | 36 | token 37 | 38 | activates 39 | 40 | 41 | token 42 | 43 | the 44 | 45 | 46 | token 47 | 48 | mitogen-activated 49 | 50 | 51 | token 52 | 53 | protein 54 | 55 | 56 | token 57 | 58 | ( 59 | 60 | 61 | token 62 | 63 | MAP 64 | 65 | 66 | token 67 | 68 | ) 69 | 70 | 71 | token 72 | 73 | kinase/extracellular 74 | 75 | 76 | token 77 | 78 | signal-regulated 79 | 80 | 81 | token 82 | 83 | kinase 84 | 85 | 86 | token 87 | 88 | kinase 89 | 90 | 91 | token 92 | 93 | 1 94 | 95 | 96 | token 97 | 98 | ( 99 | 100 | 101 | token 102 | 103 | MEK1 104 | 105 | 106 | token 107 | 108 | ) 109 | 110 | 111 | token 112 | 113 | , 114 | 115 | 116 | token 117 | 118 | which 119 | 120 | 121 | token 122 | 123 | in 124 | 125 | 126 | token 127 | 128 | turn 129 | 130 | 131 | token 132 | 133 | phosphorylates 134 | 135 | 136 | token 137 | 138 | and 139 | 140 | 141 | token 142 | 143 | activates 144 | 145 | 146 | token 147 | 148 | the 149 | 150 | 151 | token 152 | 153 | MAP 154 | 155 | 156 | token 157 | 158 | kinases/extracellular 159 | 160 | 161 | token 162 | 163 | signal 164 | 165 | 166 | token 167 | 168 | regulated 169 | 170 | 171 | token 172 | 173 | kinases 174 | 175 | 176 | token 177 | 178 | , 179 | 180 | 181 | token 182 | 183 | ERK1 184 | 185 | 186 | token 187 | 188 | and 189 | 190 | 191 | token 192 | 193 | ERK2 194 | 195 | 196 | token 197 | 198 | . 199 | 200 | 201 | 202 | simplified sentence 203 | 325 204 | 205 | token 206 | 207 | Active 208 | 209 | 210 | token 211 | 212 | Raf-1 213 | 214 | 215 | token 216 | 217 | phosphorylates 218 | 219 | 220 | token 221 | 222 | MEK1 223 | 224 | 225 | token 226 | 227 | . 228 | 229 | 230 | 231 | simplified sentence 232 | 360 233 | 234 | token 235 | 236 | Active 237 | 238 | 239 | token 240 | 241 | Raf-1 242 | 243 | 244 | token 245 | 246 | activates 247 | 248 | 249 | token 250 | 251 | MEK1 252 | 253 | 254 | token 255 | 256 | . 257 | 258 | 259 | 260 | simplified sentence 261 | 390 262 | 263 | token 264 | 265 | MEK1 266 | 267 | 268 | token 269 | 270 | in 271 | 272 | 273 | token 274 | 275 | turn 276 | 277 | 278 | token 279 | 280 | phosphorylates 281 | 282 | 283 | token 284 | 285 | ERK1 286 | 287 | 288 | token 289 | 290 | . 291 | 292 | 293 | 294 | simplified sentence 295 | 425 296 | 297 | token 298 | 299 | MEK1 300 | 301 | 302 | token 303 | 304 | in 305 | 306 | 307 | token 308 | 309 | turn 310 | 311 | 312 | token 313 | 314 | phosphorylates 315 | 316 | 317 | token 318 | 319 | ERK2 320 | 321 | 322 | token 323 | 324 | . 325 | 326 | 327 | 328 | simplified sentence 329 | 460 330 | 331 | token 332 | 333 | MEK1 334 | 335 | 336 | token 337 | 338 | in 339 | 340 | 341 | token 342 | 343 | turn 344 | 345 | 346 | token 347 | 348 | activates 349 | 350 | 351 | token 352 | 353 | ERK1 354 | 355 | 356 | token 357 | 358 | . 359 | 360 | 361 | 362 | simplified sentence 363 | 489 364 | 365 | token 366 | 367 | MEK1 368 | 369 | 370 | token 371 | 372 | in 373 | 374 | 375 | token 376 | 377 | turn 378 | 379 | 380 | token 381 | 382 | activates 383 | 384 | 385 | token 386 | 387 | ERK2 388 | 389 | 390 | token 391 | 392 | . 393 | 394 | 395 | 396 | 397 | 398 | equ 399 | 400 | 401 | 402 | 403 | 404 | 405 | equ 406 | 407 | 408 | 409 | 410 | 411 | 412 | equ 413 | 414 | 415 | 416 | 417 | 418 | equ 419 | 420 | 421 | 422 | 423 | 424 | 425 | 426 | 427 | 428 | 429 | equ 430 | 431 | 432 | 433 | 434 | 435 | 436 | 437 | 438 | 439 | 440 | equ 441 | 442 | 443 | 444 | 445 | 446 | equ 447 | 448 | 449 | 450 | 451 | 452 | 453 | 454 | 455 | equ 456 | 457 | 458 | 459 | 460 | 461 | 462 | 463 | 464 | equ 465 | 466 | 467 | 468 | 469 | 470 | 471 | equ 472 | 473 | 474 | 475 | 476 | 477 | 478 | equ 479 | 480 | 481 | 482 | 483 | 484 | 485 | equ 486 | 487 | 488 | 489 | 490 | 491 | 492 | 493 | -------------------------------------------------------------------------------- /test_input/PMID-8557975-simplified-sentences.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | PubMed 5 | 20130316 6 | PMID-8557975-simplified-sentences.key 7 | 8 | 8557975 9 | 10 | abstract 11 | 0 12 | 13 | original sentence 14 | 70 15 | Active Raf-1 phosphorylates and activates the mitogen-activated protein (MAP) kinase/extracellular signal-regulated kinase kinase 1 (MEK1), which in turn phosphorylates and activates the MAP kinases/extracellular signal regulated kinases, ERK1 and ERK2. 16 | 17 | 18 | simplified sentence 19 | 325 20 | Active Raf-1 phosphorylates MEK1. 21 | 22 | 23 | simplified sentence 24 | 360 25 | Active Raf-1 activates MEK1. 26 | 27 | 28 | simplified sentence 29 | 390 30 | MEK1 in turn phosphorylates ERK1. 31 | 32 | 33 | simplified sentence 34 | 425 35 | MEK1 in turn phosphorylates ERK2. 36 | 37 | 38 | simplified sentence 39 | 460 40 | MEK1 in turn activates ERK1. 41 | 42 | 43 | simplified sentence 44 | 489 45 | MEK1 in turn activates ERK2. 46 | 47 | 48 | 49 | 50 | -------------------------------------------------------------------------------- /test_input/everything-sentence.xml: -------------------------------------------------------------------------------- 1 | Made up file to test that everything is allowed and processed. Has text in the passage.20130426everything.keycollection-infon-value1document-infon-valuepassage-infon-value0sentence-infon-value0text of sentenceannotation-infon-valueannotation textsentence-relation-infon-valuepassage-relation-infon-valuedocument-relation-infon-value -------------------------------------------------------------------------------- /test_input/everything.xml: -------------------------------------------------------------------------------- 1 | Made up file to test that everything is allowed and processed. Has text in the passage.20130426everything.keycollection-infon-value1document-infon-valuepassage-infon-value0text of passageannotation-infon-valueannotation textpassage-relation-infon-valuedocument-relation-infon-value -------------------------------------------------------------------------------- /test_input/example_input.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | PUBMED 5 | 20130422 6 | ctdBCIVLearningDataSet.key 7 | 8 | 10617681 9 | 10 | title 11 | 0 12 | Possible role of valvular serotonin 5-HT(2B) receptors in the cardiopathy associated with fenfluramine. 13 | 14 | 15 | abstract 16 | 104 17 | Dexfenfluramine was approved in the United States for long-term use as an appetite suppressant until it was reported to be associated with valvular heart disease. The valvular changes (myofibroblast proliferation) are histopathologically indistinguishable from those observed in carcinoid disease or after long-term exposure to 5-hydroxytryptamine (5-HT)(2)-preferring ergot drugs (ergotamine, methysergide). 5-HT(2) receptor stimulation is known to cause fibroblast mitogenesis, which could contribute to this lesion. To elucidate the mechanism of "fen-phen"-associated valvular lesions, we examined the interaction of fenfluramine and its metabolite norfenfluramine with 5-HT(2) receptor subtypes and examined the expression of these receptors in human and porcine heart valves. Fenfluramine binds weakly to 5-HT(2A), 5-HT(2B), and 5-HT(2C) receptors. In contrast, norfenfluramine exhibited high affinity for 5-HT(2B) and 5-HT(2C) receptors and more moderate affinity for 5-HT(2A) receptors. In cells expressing recombinant 5-HT(2B) receptors, norfenfluramine potently stimulated the hydrolysis of inositol phosphates, increased intracellular Ca(2+), and activated the mitogen-activated protein kinase cascade, the latter of which has been linked to mitogenic actions of the 5-HT(2B) receptor. The level of 5-HT(2B) and 5-HT(2A) receptor transcripts in heart valves was at least 300-fold higher than the levels of 5-HT(2C) receptor transcript, which were barely detectable. We propose that preferential stimulation of valvular 5-HT(2B) receptors by norfenfluramine, ergot drugs, or 5-HT released from carcinoid tumors (with or without accompanying 5-HT(2A) receptor activation) may contribute to valvular fibroplasia in humans. 18 | 19 | 20 | 21 | --------------------------------------------------------------------------------