├── .gitignore ├── .travis.yml ├── LICENCE.txt ├── README.rst ├── setup.py ├── snapgene_reader ├── __init__.py └── snapgene_reader.py └── tests ├── test_general.py └── test_samples ├── README.md ├── addgene-plasmid-16405-sequence-190476.dna ├── pIB2-SEC13-mEGFP.dna ├── test_sequence_1.dna └── test_with_a_single_feature_ie_no_features_list.dna /.gitignore: -------------------------------------------------------------------------------- 1 | /.vscode 2 | 3 | *.py[cod] 4 | 5 | # C extensions 6 | *.so 7 | 8 | # Packages 9 | *.egg 10 | *.egg-info 11 | dist 12 | build 13 | eggs 14 | parts 15 | bin 16 | var 17 | sdist 18 | develop-eggs 19 | .installed.cfg 20 | lib 21 | lib64 22 | __pycache__ 23 | 24 | # Installer logs 25 | pip-log.txt 26 | 27 | # Unit test / coverage reports 28 | .coverage 29 | .tox 30 | 31 | nosetests.xml 32 | 33 | # Translations 34 | *.mo 35 | 36 | # Mr Developer 37 | .mr.developer.cfg 38 | .project 39 | .pydevproject 40 | 41 | # Temp files 42 | 43 | *~ 44 | 45 | # Pipy codes 46 | 47 | .pypirc 48 | .cache 49 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: python 2 | python: 3 | - "3.6" 4 | 5 | install: 6 | - pip install -e . 7 | 8 | script: 9 | - python -m pytest 10 | -------------------------------------------------------------------------------- /LICENCE.txt: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | [OSI Approved License] 3 | 4 | The MIT License (MIT) 5 | 6 | Copyright (c) 2017 Isaac Luo and Cai Lab 7 | 8 | Permission is hereby granted, free of charge, to any person obtaining a copy 9 | of this software and associated documentation files (the "Software"), to deal 10 | in the Software without restriction, including without limitation the rights 11 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 12 | copies of the Software, and to permit persons to whom the Software is 13 | furnished to do so, subject to the following conditions: 14 | 15 | The above copyright notice and this permission notice shall be included in 16 | all copies or substantial portions of the Software. 17 | 18 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 19 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 20 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 21 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 22 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 23 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 24 | THE SOFTWARE. 25 | -------------------------------------------------------------------------------- /README.rst: -------------------------------------------------------------------------------- 1 | SnapGene Reader 2 | =============== 3 | 4 | .. image:: https://travis-ci.org/Edinburgh-Genome-Foundry/SnapGeneReader.svg?branch=master 5 | :target: https://travis-ci.org/Edinburgh-Genome-Foundry/SnapGeneReader 6 | :alt: Travis CI build status 7 | 8 | SnapGene Reader is a Python library to parse Snapgene ``*.dna`` files into dictionnaries or Biopython SeqRecords: 9 | 10 | .. code:: python 11 | 12 | from snapgene_reader import snapgene_file_to_dict, snapgene_file_to_seqrecord 13 | 14 | file_path = './snap_gene_file.dna' 15 | dictionnary = snapgene_file_to_dict(filepath) 16 | seqrecord = snapgene_file_to_seqrecord(filepath) 17 | 18 | Installation 19 | ------------ 20 | 21 | Install with PIP: 22 | 23 | .. code:: bash 24 | 25 | pip install snapgene_reader 26 | 27 | Test with Pytest: 28 | 29 | .. code:: bash 30 | 31 | python -m pytest 32 | (or simply "pytest") 33 | 34 | Licence = MIT 35 | ------------- 36 | 37 | SnapGene Reader is an open-source software originally written by `Isaac Luo `_ at the Cai Lab. This fork is released on `Github `_ under the MIT licence (¢ Isaac Luo and Cai Lab) and maintained by the Edinburgh Genome Foundry. Everyone is welcome to contribute. 38 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | ''' 2 | setup.py 3 | install snapgene_reader by pip 4 | ''' 5 | from setuptools import setup, find_packages 6 | 7 | setup(name='snapgene_reader', 8 | version='0.1.15', 9 | author='yishaluo', 10 | author_email='yishaluo@gmail.com', 11 | maintainer='Zulko', 12 | description='Convert Snapgene *.dna files dict/json/biopython.', 13 | long_description=open('README.rst').read(), 14 | license='MIT', 15 | keywords="DNA sequence design format converter", 16 | packages=find_packages(), 17 | install_requires=['biopython', 'xmltodict', 'html2text']) 18 | -------------------------------------------------------------------------------- /snapgene_reader/__init__.py: -------------------------------------------------------------------------------- 1 | """ 2 | snapgene_reader module 3 | usage: 4 | from snapgene_reader import snapgene_file_to_dict 5 | obj = snapgene_file_to_dict(file_path='test.dna') 6 | """ 7 | from .snapgene_reader import (snapgene_file_to_dict, 8 | snapgene_file_to_seqrecord, 9 | snapgene_file_to_gbk) 10 | -------------------------------------------------------------------------------- /snapgene_reader/snapgene_reader.py: -------------------------------------------------------------------------------- 1 | ''' 2 | snapgene reader main file 3 | ''' 4 | import struct 5 | import json 6 | import xmltodict 7 | 8 | from Bio.Seq import Seq 9 | from Bio.SeqRecord import SeqRecord 10 | from Bio.Alphabet import DNAAlphabet 11 | from Bio.SeqFeature import SeqFeature, FeatureLocation 12 | import html2text 13 | 14 | HTML_PARSER = html2text.HTML2Text() 15 | HTML_PARSER.ignore_emphasis = True 16 | HTML_PARSER.ignore_links = True 17 | HTML_PARSER.body_width = 0 18 | HTML_PARSER.single_line_break = True 19 | 20 | 21 | def parse(val): 22 | '''parse html''' 23 | if isinstance(val, str): 24 | return (HTML_PARSER.handle(val) 25 | .strip() 26 | .replace("\n", " ") 27 | .replace('"', "'")) 28 | else: 29 | return val 30 | 31 | # def parse(val): 32 | # ss = re.sub(r'
', '\n', val) 33 | # ss = re.sub(r'<.*?>', '', ss) 34 | # return ss 35 | 36 | 37 | def parse_dict(obj): 38 | """parse dict in the obj""" 39 | if isinstance(obj, dict): 40 | for key in obj: 41 | if isinstance(obj[key], str): 42 | obj[key] = parse(obj[key]) 43 | elif isinstance(obj[key], dict): 44 | parse_dict(obj[key]) 45 | return obj 46 | 47 | 48 | def snapgene_file_to_dict(filepath=None, fileobject=None): 49 | """Return a dictionnary containing the data from a ``*.dna`` file. 50 | 51 | Parameters 52 | ---------- 53 | filepath 54 | Path to a .dna file created with SnapGene 55 | fileobject 56 | On object-like pointing to the data of a .dna file created with 57 | SnapGene 58 | 59 | """ 60 | 61 | if filepath is not None: 62 | fileobject = open(filepath, 'rb') 63 | 64 | if fileobject.read(1) != b'\t': 65 | raise ValueError('Wrong format for a SnapGene file !') 66 | 67 | def unpack(size, mode): 68 | """unpack the fileobject""" 69 | return struct.unpack('>' + mode, fileobject.read(size))[0] 70 | 71 | # READ THE DOCUMENT PROPERTIES 72 | length = unpack(4, 'I') 73 | title = fileobject.read(8).decode('ascii') 74 | if length != 14 or title != 'SnapGene': 75 | raise ValueError('Wrong format for a SnapGene file !') 76 | 77 | data = dict( 78 | isDNA=unpack(2, 'H'), 79 | exportVersion=unpack(2, 'H'), 80 | importVersion=unpack(2, 'H'), 81 | features=[] 82 | ) 83 | 84 | while True: 85 | # READ THE WHOLE FILE, BLOCK BY BLOCK, UNTIL THE END 86 | next_byte = fileobject.read(1) 87 | 88 | # next_byte table 89 | # 0: dna sequence 90 | # 1: compressed DNA 91 | # 2: unknown 92 | # 3: unknown 93 | # 5: primers 94 | # 6: notes 95 | # 7: history tree 96 | # 8: additional sequence properties segment 97 | # 9: file Description 98 | # 10: features 99 | # 11: history node 100 | # 13: unknown 101 | # 16: alignable sequence 102 | # 17: alignable sequence 103 | # 18: sequence trace 104 | # 19: Uracil Positions 105 | # 20: custom DNA colors 106 | 107 | if next_byte == b'': 108 | # END OF FILE 109 | break 110 | 111 | block_size = unpack(4, 'I') 112 | 113 | if ord(next_byte) == 0: 114 | # READ THE SEQUENCE AND ITS PROPERTIES 115 | props = unpack(1, 'b') 116 | data["dna"] = dict( 117 | topology="circular" if props & 0x01 else "linear", 118 | strandedness="double" if props & 0x02 > 0 else "single", 119 | damMethylated=props & 0x04 > 0, 120 | dcmMethylated=props & 0x08 > 0, 121 | ecoKIMethylated=props & 0x10 > 0, 122 | length=block_size - 1 123 | ) 124 | data["seq"] = fileobject.read(block_size - 1).decode('ascii') 125 | 126 | elif ord(next_byte) == 6: 127 | # READ THE NOTES 128 | block_content = fileobject.read(block_size).decode('utf-8') 129 | note_data = parse_dict(xmltodict.parse(block_content)) 130 | data['notes'] = note_data['Notes'] 131 | 132 | elif ord(next_byte) == 10: 133 | # READ THE FEATURES 134 | strand_dict = {"0": ".", "1": "+", "2": "-", "3": "="} 135 | format_dict = {'@text': parse, '@int': int} 136 | features_data = xmltodict.parse(fileobject.read(block_size)) 137 | features = features_data["Features"]["Feature"] 138 | if not isinstance(features, list): 139 | features = [features] 140 | for feature in features: 141 | segments = feature["Segment"] 142 | if not isinstance(segments, list): 143 | segments = [segments] 144 | segments_ranges = [ 145 | sorted([int(e) for e in segment['@range'].split('-')]) 146 | for segment in segments 147 | ] 148 | qualifiers = feature.get('Q', []) 149 | if not isinstance(qualifiers, list): 150 | qualifiers = [qualifiers] 151 | parsed_qualifiers = {} 152 | for qualifier in qualifiers: 153 | if qualifier['V'] is None: 154 | pass 155 | elif isinstance(qualifier['V'], list): 156 | if len(qualifier['V'][0].items()) == 1: 157 | parsed_qualifiers[qualifier['@name']] = l_v = [] 158 | for e_v in qualifier['V']: 159 | fmt, value = e_v.popitem() 160 | fmt = format_dict.get(fmt, parse) 161 | l_v.append(fmt(value)) 162 | else: 163 | parsed_qualifiers[qualifier['@name']] = d_v = {} 164 | for e_v in qualifier['V']: 165 | (fmt1, value1), (_, value2) = e_v.items() 166 | fmt = format_dict.get(fmt1, parse) 167 | d_v[value2] = fmt(value1) 168 | else: 169 | fmt, value = qualifier['V'].popitem() 170 | fmt = format_dict.get(fmt, parse) 171 | parsed_qualifiers[qualifier['@name']] = fmt(value) 172 | 173 | if 'label' not in parsed_qualifiers: 174 | parsed_qualifiers['label'] = feature['@name'] 175 | if 'note' not in parsed_qualifiers: 176 | parsed_qualifiers['note'] = [] 177 | if not isinstance(parsed_qualifiers['note'], list): 178 | parsed_qualifiers['note'] = [parsed_qualifiers['note']] 179 | color = segments[0]['@color'] 180 | parsed_qualifiers['note'].append("color: " + color) 181 | 182 | data["features"].append(dict( 183 | start=min([start - 1 for (start, end) in segments_ranges]), 184 | end=max([end for (start, end) in segments_ranges]), 185 | strand=strand_dict[feature.get('@directionality', "0")], 186 | type=feature['@type'], 187 | name=feature['@name'], 188 | color=segments[0]['@color'], 189 | textColor='black', 190 | segments=segments, 191 | row=0, 192 | isOrf=False, 193 | qualifiers=parsed_qualifiers 194 | )) 195 | 196 | else: 197 | # WE IGNORE THE WHOLE BLOCK 198 | fileobject.read(block_size) 199 | pass 200 | 201 | fileobject.close() 202 | 203 | return data 204 | 205 | 206 | def snapgene_file_to_seqrecord(filepath=None, fileobject=None): 207 | """Return a BioPython SeqRecord from the data of a ``*.dna`` file. 208 | 209 | Parameters 210 | ---------- 211 | filepath 212 | Path to a .dna file created with SnapGene 213 | fileobject 214 | On object-like pointing to the data of a .dna file created with 215 | SnapGene 216 | """ 217 | data = snapgene_file_to_dict(filepath=filepath, fileobject=fileobject) 218 | strand_dict = {'+': 1, '-': -1, '.': 0} 219 | 220 | return SeqRecord( 221 | seq=Seq(data['seq'], alphabet=DNAAlphabet()), 222 | features=[ 223 | SeqFeature( 224 | location=FeatureLocation( 225 | start=feature['start'], 226 | end=feature['end'], 227 | strand=strand_dict[feature['strand']] 228 | ), 229 | strand=strand_dict[feature['strand']], 230 | type=feature['type'], 231 | qualifiers=feature['qualifiers'] 232 | ) 233 | for feature in data['features'] 234 | ], 235 | annotations=dict(data['notes']) 236 | ) 237 | 238 | 239 | def snapgene_file_to_gbk(read_file_object, write_file_object): 240 | ''' 241 | convert a file object 242 | ''' 243 | def analyse_gs(dic, *args, **kwargs): 244 | '''extract gs block in the document''' 245 | if 'default' not in kwargs: 246 | kwargs['default'] = None 247 | 248 | for arg in args: 249 | if arg in dic: 250 | dic = dic[arg] 251 | else: 252 | return kwargs['default'] 253 | return dic 254 | 255 | 256 | data = snapgene_file_to_dict(fileobject=read_file_object) 257 | wfo = write_file_object 258 | wfo.write( 259 | ('LOCUS Exported {0:>6} bp ds-DNA {1:>8} SYN \ 260 | 15-APR-2012\n').format(len(data['seq']), data['dna']['topology'])) 261 | definition = analyse_gs(data, 'notes', 'Description', 262 | default='.').replace('\n', '\n ') 263 | wfo.write('DEFINITION {}\n'.format(definition)) 264 | wfo.write('ACCESSION .\n') 265 | wfo.write('VERSION .\n') 266 | wfo.write('KEYWORDS {}\n'.format( 267 | analyse_gs(data, 'notes', 'CustomMapLabel', default='.'))) 268 | wfo.write('SOURCE .\n') 269 | wfo.write(' ORGANISM .\n') 270 | 271 | references = analyse_gs(data, 'notes', 'References') 272 | 273 | reference_count = 0 274 | if references: 275 | for key in references: 276 | reference_count += 1 277 | ref = references[key] 278 | wfo.write('REFERENCE {} (bases 1 to {} )\n'.format( 279 | reference_count, analyse_gs(data, 'dna', 'length'))) 280 | for key2 in ref: 281 | gb_key = key2.replace('@', '').upper() 282 | wfo.write(' {} {}\n'.format(gb_key, ref[key2])) 283 | 284 | # generate special reference 285 | reference_count += 1 286 | wfo.write('REFERENCE {} (bases 1 to {} )\n'.format( 287 | reference_count, analyse_gs(data, 'dna', 'length'))) 288 | wfo.write(' AUTHORS IssacLuo\'s SnapGeneReader\n') 289 | wfo.write(' TITLE Direct Submission\n') 290 | wfo.write((' JOURNAL Exported Monday, Nov 20, 2017 from SnapGene File\ 291 | Reader\n')) 292 | wfo.write(' https://github.com/IsaacLuo/SnapGeneFileReader\n') 293 | 294 | wfo.write('COMMENT {}\n'.format( 295 | analyse_gs(data, 'notes', 'Comments', default='.').replace( 296 | '\n', '\n ').replace('\\', ''))) 297 | wfo.write('FEATURES Location/Qualifiers\n') 298 | 299 | features = analyse_gs(data, 'features') 300 | for feature in features: 301 | strand = analyse_gs(feature, 'strand', default='') 302 | 303 | segments = analyse_gs(feature, 'segments', default=[]) 304 | segments = [x for x in segments if x['@type'] == 'standard'] 305 | if len(segments) > 1: 306 | line = 'join(' 307 | for segment in segments: 308 | segment_range = analyse_gs(segment, '@range').replace('-', '..') 309 | if analyse_gs(segment, '@type') == 'standard': 310 | line += segment_range 311 | line += ',' 312 | line = line[:-1] + ')' 313 | else: 314 | line = '{}..{}'.format( 315 | analyse_gs(feature, 'start', default=' '), 316 | analyse_gs(feature, 'end', default=' ') 317 | ) 318 | 319 | if strand == '-': 320 | wfo.write(' {} complement({})\n'.format( 321 | analyse_gs(feature, 'type', default=' ').ljust(15), 322 | line, 323 | )) 324 | else: 325 | wfo.write(' {} {}\n'.format( 326 | analyse_gs(feature, 'type', default=' ').ljust(15), 327 | line, 328 | )) 329 | strand = analyse_gs(feature, 'strand', default='') 330 | # if strand == '-': 331 | # wfo.write(' /direction=LEFT\n') 332 | # name 333 | wfo.write(' /note="{}"\n'.format( 334 | analyse_gs(feature, 'name', default='feature') 335 | )) 336 | # qualifiers 337 | for q_key in analyse_gs(feature, 'qualifiers', default={}): 338 | # do not write label, because it has been written at first. 339 | if q_key == 'label': 340 | pass 341 | elif q_key == 'note': 342 | for note in analyse_gs(feature, 'qualifiers', q_key, default=[]): 343 | # do note write color, because it will be written later 344 | if note[:6] != 'color:': 345 | wfo.write(' /note="{}"\n'.format( 346 | note)) 347 | else: 348 | wfo.write(' /{}="{}"\n'.format( 349 | q_key, analyse_gs(feature, 'qualifiers', q_key, default='') 350 | )) 351 | if len(segments) > 1: 352 | wfo.write((' /note="This feature \ 353 | has {} segments:').format(len(segments))) 354 | for seg_i, seg in enumerate(segments): 355 | segment_name = analyse_gs(seg, '@name', default='') 356 | if segment_name: 357 | segment_name = ' / {}'.format(segment_name) 358 | wfo.write('\n {}: {} / {}{}'.format( 359 | seg_i, 360 | seg['@range'].replace('-', ' .. '), 361 | seg['@color'], 362 | segment_name, 363 | ) 364 | ) 365 | wfo.write('"\n') 366 | else: 367 | # write colors and direction 368 | wfo.write( 369 | 21 * ' ' + '/note="color: {}'.format( 370 | analyse_gs(feature, 'color', default='#ffffff'))) 371 | if strand == '-': 372 | wfo.write('; direction: LEFT"\n') 373 | # wfo.write('"\n') 374 | elif strand == '+': 375 | wfo.write('; direction: RIGHT"\n') 376 | else: 377 | wfo.write('"\n') 378 | 379 | # sequence 380 | wfo.write('ORIGIN\n') 381 | seq = analyse_gs(data, 'seq') 382 | # devide rows 383 | for i in range(0, len(seq), 60): 384 | wfo.write(str(i).rjust(9)) 385 | for j in range(i, min(i + 60, len(seq)), 10): 386 | wfo.write(' {}'.format(seq[j:j + 10])) 387 | wfo.write('\n') 388 | wfo.write('//\n') 389 | -------------------------------------------------------------------------------- /tests/test_general.py: -------------------------------------------------------------------------------- 1 | from snapgene_reader import snapgene_file_to_seqrecord, snapgene_file_to_gbk 2 | from Bio import SeqIO 3 | import os 4 | 5 | TEST_DIR = os.path.join('tests', 'test_samples') 6 | 7 | 8 | def test_snapgene_file_to_seqrecord(tmpdir): 9 | all_files = [f for f in os.listdir(TEST_DIR) if f.endswith('.dna')] 10 | assert len(all_files) 11 | for fname in all_files: 12 | fpath = os.path.join(TEST_DIR, fname) 13 | record = snapgene_file_to_seqrecord(fpath) 14 | assert len(record.seq) > 10 15 | target = os.path.join(str(tmpdir), fname + '.gb') 16 | with open(target, 'w', encoding='utf-8') as fwrite: 17 | SeqIO.write([record, ], fwrite, 'genbank') 18 | 19 | def test_snapgene_file_to_gbk(tmpdir): 20 | all_files = [f for f in os.listdir(TEST_DIR) if f.endswith('.dna')] 21 | assert len(all_files) 22 | for fname in all_files: 23 | print (fname) 24 | fpath = os.path.join(TEST_DIR, fname) 25 | target = os.path.join(str(tmpdir), 'testfile.gbk') 26 | with open(fpath, 'rb') as fsource: 27 | with open(target, 'w', encoding='utf-8') as ftarget: 28 | snapgene_file_to_gbk(fsource, ftarget) 29 | SeqIO.read(target, 'genbank') # verification_reparsing 30 | -------------------------------------------------------------------------------- /tests/test_samples/README.md: -------------------------------------------------------------------------------- 1 | Origin of the test sequences 2 | ---------------------------- 3 | 4 | **test_sequence_1**: obtained at the EGF by cloning simulation of mock parts (using SnapGene 4.1 on 2017/11). 5 | 6 | **pIB2-SEC13-mEGFP.dna**: provided by SnapGene as an example sequence 7 | 8 | **addgene-plasmid-16405-sequence-190476.dna**: public sequence from [addgene](https://www.addgene.org/16405/sequences/) 9 | -------------------------------------------------------------------------------- /tests/test_samples/addgene-plasmid-16405-sequence-190476.dna: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IsaacLuo/SnapGeneFileReader/aef54d90c406f3856876414afa911679c20d7c45/tests/test_samples/addgene-plasmid-16405-sequence-190476.dna -------------------------------------------------------------------------------- /tests/test_samples/pIB2-SEC13-mEGFP.dna: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IsaacLuo/SnapGeneFileReader/aef54d90c406f3856876414afa911679c20d7c45/tests/test_samples/pIB2-SEC13-mEGFP.dna -------------------------------------------------------------------------------- /tests/test_samples/test_sequence_1.dna: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IsaacLuo/SnapGeneFileReader/aef54d90c406f3856876414afa911679c20d7c45/tests/test_samples/test_sequence_1.dna -------------------------------------------------------------------------------- /tests/test_samples/test_with_a_single_feature_ie_no_features_list.dna: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IsaacLuo/SnapGeneFileReader/aef54d90c406f3856876414afa911679c20d7c45/tests/test_samples/test_with_a_single_feature_ie_no_features_list.dna --------------------------------------------------------------------------------