├── LICENSE ├── README.md ├── NCBI_genebank_file_parser.py └── test_genbank_file.txt /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 dewshr 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # NCBI-GenBank-file-parser 2 | 3 | ############################# About the program ################################ 4 | 5 | This program takes the NCBI nucletotide gene bank file and then parses the information present in NCBI gene bank file to create a .csv file with each fields in one column. You can provide any file extension but the format of the file has to be similar to .gbff file. One example file is also provided as an example file. Running the program will ask for the file name, and while giving the file name, extension of the file has to be given too, the program assumes the file is present in the same location as of the program, if your file is present in another location then you can change the source script to give the location or you can just copy the script in the file location. 6 | 7 | You can provide a file with single genbank entry or multiple entries. After running the program this will also create a txt file called 'genbank.txt', you can delete this file. 8 | NOTE: this program will only give output to the gene bank files with CDS features. The fields for certain columns may be empty if respective information is not present in the genbank file. 9 | 10 | 11 | ############################# Program Requirements ###################################### 12 | - installation of Biopython 13 | - python 2.7 14 | 15 | #################################### NCBI gene bank file link ############################ 16 | ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/ 17 | -------------------------------------------------------------------------------- /NCBI_genebank_file_parser.py: -------------------------------------------------------------------------------- 1 | #@Dewan Shrestha 2 | 3 | from Bio import GenBank 4 | 5 | #reading the gene bank file 6 | def readfile(): 7 | #for genbank file: ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/ 8 | gb_file_name = str(raw_input("Enter your NCBI gene bank file name: ")) 9 | 10 | try: 11 | genbank = open(gb_file_name).read().split('LOCUS ') #opens gene bank file and splits by '//\n' to create list of each genes 12 | except: 13 | print "File not found. Please make sure you have the right file and also include the file extension" 14 | 15 | return genbank 16 | 17 | 18 | #function to parse the nucleotide reference sequence genebank file 19 | def ntgenbank(): 20 | #retreiving all genebank files in a list calling another function 21 | nuc_genbank = readfile() 22 | #nuc_genbank = filter(None, nuc_genbank) 23 | print len(nuc_genbank) 24 | print nuc_genbank[0] 25 | length = len(nuc_genbank) 26 | print "\nParsing started" 27 | output = open('result_ntgenbank.csv','w') # opening a file to write the ouput 28 | 29 | #writing headings of the output file 30 | output.write('Name'+','+'NM'+','+ 'NM_version'+','+ 'Symbol'+','+'CDS_start'+','+ 'CDS_stop'+','+'HGNC'+','+\ 31 | 'MIM'+','+'EC_number'+','+ 'GeneID' +','+ 'NP'+','+'NP_version'+','+'gene_synonym'+','+'AA_seq'+','+\ 32 | 'AA_number'+','+'Chromosome'+ ','+'Chromosome_map'+','+ 'NT_seq'+','+'Organism'+'\n') 33 | 34 | # going through all the genes in the list 35 | for n in range(1,length): #0 index is empty 36 | print n 37 | test= 'LOCUS ' + nuc_genbank[n].lstrip('\n') #removing new line of from individual genebank files 38 | query = open('genbank.txt','w') #creating a genbank file to create query gene bank file 39 | query.write(test) 40 | query.close() 41 | parser = GenBank.RecordParser() #using biopython function for parsing 42 | record = parser.parse(open('genbank.txt')) 43 | 44 | 45 | ########################################################################################## 46 | nt_seq = (record.sequence).strip('\n') #stores nucleotide sequence 47 | nm_and_version = (record.version).strip('\n') #contains nm and nm_version 48 | nm = (nm_and_version.split('.')[0]).strip('\n') 49 | nm_version = (nm_and_version.split('.')[1]).strip('\n') 50 | 51 | 52 | ############################################################################################ 53 | source = record.features[0] #contains all the fields of source 54 | organism = source.qualifiers[0].value.strip('\n')+ ':'+source.qualifiers[2].value.strip('\n') 55 | try: 56 | organism = source.qualifiers[0].value.strip('\n')+ ':'+source.qualifiers[2].value.strip('\n') 57 | except: 58 | organism ='' 59 | 60 | try: 61 | chrm = (source.qualifiers[3].value).strip('\n') #stores chromosome number 62 | except: 63 | chrm = '' 64 | 65 | try: 66 | chrm_map = source.qualifiers[4].value.strip('\n') 67 | except: 68 | chrm_map = '' 69 | 70 | 71 | ############################################################################################ 72 | gene = record.features[1] #contains all the field of gene 73 | symbol = (gene.qualifiers[0].value).strip('\n') #symbol or gene 74 | 75 | 76 | ######################################################################################### 77 | cds='' 78 | for c in range(0, len(record.features)): 79 | if('CDS' in record.features[c].key): 80 | cds = record.features[c] 81 | break 82 | else: 83 | continue 84 | 85 | if cds != '': 86 | 87 | cds_start_stop = (cds.location).strip('\n') #stores cds start and stop position 88 | cds_start = (cds_start_stop.split('..')[0]).strip('\n') 89 | cds_stop = (cds_start_stop.split('..')[1]).strip('\n') 90 | 91 | 92 | #creating a empty dictionary to go through the elements in the CDS and update later if present 93 | cds_dict = {"HGNC":'', "MIM:":'', "EC_number":'', "GeneID":'', "product":'', 94 | "protein_id":'',"translation":'',"num_aa":'', "gene_synonym":''} 95 | 96 | for n in range(0, len(cds.qualifiers)): #going through all the elements in the cds 97 | for key, value in cds_dict.iteritems(): #looping through the dictionary items to see if present in cds 98 | if ((key in cds.qualifiers[n].key) or (key in cds.qualifiers[n].value)): 99 | keys =str(key) #storing dictionary key 100 | cds_dict[keys] = str(cds.qualifiers[n].value) #updating dictionary key with values 101 | break 102 | else: 103 | continue 104 | np = cds_dict["protein_id"].split('.')[0]+'"' 105 | np_version = '"'+cds_dict["protein_id"].split('.')[1] 106 | hgnc=cds_dict["HGNC"] 107 | mim=cds_dict["MIM:"] 108 | geneid =cds_dict["GeneID"] 109 | name = cds_dict["product"] 110 | synonym = cds_dict["gene_synonym"] 111 | translation = cds_dict["translation"] 112 | if translation != '': num_aa = len(translation) 113 | if len(hgnc) !=0: 114 | hgnc = '"'+hgnc.split(':')[2] 115 | if len(mim) !=0: 116 | mim = '"'+mim.split(':')[1] 117 | if len(geneid) !=0: 118 | geneid = '"'+geneid.split(':')[1] 119 | 120 | gvalue = name+','+nm+','+nm_version+','+symbol+','+cds_start+','+cds_stop+',' + hgnc +','+\ 121 | mim+','+cds_dict["EC_number"]+','+geneid+ ','+np+','+np_version+','+synonym+','+\ 122 | translation+','+str(num_aa) +','+str(chrm)+','+chrm_map+','+nt_seq+','+organism+'\n' 123 | output.write(gvalue) 124 | print "Parsing completed" 125 | output.close() 126 | 127 | ntgenbank() 128 | -------------------------------------------------------------------------------- /test_genbank_file.txt: -------------------------------------------------------------------------------- 1 | 2 | LOCUS NM_000016 1317 bp mRNA linear PRI 09-APR-2017 3 | DEFINITION Homo sapiens N-acetyltransferase 2 (NAT2), mRNA. 4 | ACCESSION NM_000015 5 | VERSION NM_000016.2 6 | KEYWORDS RefSeq. 7 | SOURCE Homo sapiens (human) 8 | ORGANISM Homo sapiens 9 | Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; 10 | Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; 11 | Catarrhini; Hominidae; Homo. 12 | REFERENCE 1 (bases 1 to 1317) 13 | AUTHORS Ma C, Gu L, Yang M, Zhang Z, Zeng S, Song R, Xu C and Sun Y. 14 | TITLE rs1495741 as a tag single nucleotide polymorphism of 15 | N-acetyltransferase 2 acetylator phenotype associates bladder 16 | cancer risk and interacts with smoking: A systematic review and 17 | meta-analysis 18 | JOURNAL Medicine (Baltimore) 95 (31), E4417 (2016) 19 | PUBMED 27495060 20 | REMARK GeneRIF: NAT2 single nucleotide polymorphism associated with the 21 | risk bladder cancer development and interacts with smoking. 22 | Review article 23 | REFERENCE 2 (bases 1 to 1317) 24 | AUTHORS Jimenez-Jimenez FJ, Alonso-Navarro H, Garcia-Martin E and Agundez 25 | JA. 26 | TITLE NAT2 polymorphisms and risk for Parkinson's disease: a systematic 27 | review and meta-analysis 28 | JOURNAL Expert Opin Drug Metab Toxicol 12 (8), 937-946 (2016) 29 | PUBMED 27216438 30 | REMARK GeneRIF: Data on NAT2 gene polymorphisms obtained from the current 31 | meta-analysis do not support a major association with Parkinson's 32 | diease risk, except in Asian populations 33 | Review article 34 | REFERENCE 3 (bases 1 to 1317) 35 | AUTHORS Suarez-Kurtz G, Fuchshuber-Moraes M, Struchiner CJ and Parra EJ. 36 | TITLE Single nucleotide polymorphism coverage and inference of 37 | N-acetyltransferase-2 acetylator phenotypes in wordwide population 38 | groups 39 | JOURNAL Pharmacogenet. Genomics 26 (8), 363-369 (2016) 40 | PUBMED 27136043 41 | REMARK GeneRIF: Single nucleotide polymorphism is associated with 42 | different N-acetyltransferase-2 acetylator phenotypes in wordwide 43 | population groups. 44 | REFERENCE 4 (bases 1 to 1317) 45 | AUTHORS Chamorro JG, Castagnino JP, Musella RM, Nogueras M, Frias A, Visca 46 | M, Aidar O, Costa L and de Larranaga GF. 47 | TITLE tagSNP rs1495741 as a useful molecular marker to predict 48 | antituberculosis drug-induced hepatotoxicity 49 | JOURNAL Pharmacogenet. Genomics 26 (7), 357-361 (2016) 50 | PUBMED 27104815 51 | REMARK GeneRIF: NAT2 gene variants were associated with antituberculosis 52 | drug-induced hepatotoxicity. 53 | REFERENCE 5 (bases 1 to 1317) 54 | AUTHORS Vatsis KP, Weber WW, Bell DA, Dupret JM, Evans DA, Grant DM, Hein 55 | DW, Lin HJ, Meyer UA, Relling MV et al. 56 | TITLE Nomenclature for N-acetyltransferases 57 | JOURNAL Pharmacogenetics 5 (1), 1-17 (1995) 58 | PUBMED 7773298 59 | REMARK Review article 60 | REFERENCE 6 (bases 1 to 1317) 61 | AUTHORS Hickman D, Risch A, Camilleri JP and Sim E. 62 | TITLE Genotyping human polymorphic arylamine N-acetyltransferase: 63 | identification of new slow allotypic variants 64 | JOURNAL Pharmacogenetics 2 (5), 217-226 (1992) 65 | PUBMED 1306121 66 | REFERENCE 7 (bases 1 to 1317) 67 | AUTHORS Deguchi T. 68 | TITLE Sequences and expression of alleles of polymorphic arylamine 69 | N-acetyltransferase of human liver 70 | JOURNAL J. Biol. Chem. 267 (25), 18140-18147 (1992) 71 | PUBMED 1381364 72 | REFERENCE 8 (bases 1 to 1317) 73 | AUTHORS Grant DM, Blum M and Meyer UA. 74 | TITLE Polymorphisms of N-acetyltransferase genes 75 | JOURNAL Xenobiotica 22 (9-10), 1073-1081 (1992) 76 | PUBMED 1441598 77 | REFERENCE 9 (bases 1 to 1317) 78 | AUTHORS Blum M, Grant DM, McBride W, Heim M and Meyer UA. 79 | TITLE Human arylamine N-acetyltransferase genes: isolation, chromosomal 80 | localization, and functional expression 81 | JOURNAL DNA Cell Biol. 9 (3), 193-203 (1990) 82 | PUBMED 2340091 83 | REFERENCE 10 (bases 1 to 1317) 84 | AUTHORS Grant DM, Lottspeich F and Meyer UA. 85 | TITLE Evidence for two closely related isozymes of arylamine 86 | N-acetyltransferase in human liver 87 | JOURNAL FEBS Lett. 244 (1), 203-207 (1989) 88 | PUBMED 2924904 89 | COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff. The 90 | reference sequence was derived from AC025062.6, D90042.1 and 91 | AI460128.1. 92 | This sequence is a reference standard in the RefSeqGene project. 93 | 94 | Summary: This gene encodes an enzyme that functions to both 95 | activate and deactivate arylamine and hydrazine drugs and 96 | carcinogens. Polymorphisms in this gene are responsible for the 97 | N-acetylation polymorphism in which human populations segregate 98 | into rapid, intermediate, and slow acetylator phenotypes. 99 | Polymorphisms in this gene are also associated with higher 100 | incidences of cancer and drug toxicity. A second arylamine 101 | N-acetyltransferase gene (NAT1) is located near this gene (NAT2). 102 | [provided by RefSeq, Jul 2008]. 103 | 104 | Publication Note: This RefSeq record includes a subset of the 105 | publications that are available for this gene. Please see the Gene 106 | record to access additional publications. 107 | 108 | ##Evidence-Data-START## 109 | Transcript exon combination :: D90042.1, BC067218.1 [ECO:0000332] 110 | RNAseq introns :: single sample supports all introns 111 | SAMEA1966682, SAMEA1968540 112 | [ECO:0000348] 113 | ##Evidence-Data-END## 114 | COMPLETENESS: complete on the 3' end. 115 | FEATURES Location/Qualifiers 116 | source 1..1317 117 | /organism="Homo sapiens" 118 | /mol_type="mRNA" 119 | /db_xref="taxon:9606" 120 | /chromosome="8" 121 | /map="8p22" 122 | gene 1..1317 123 | /gene="NAT2" 124 | /gene_synonym="AAC2; NAT-2; PNAT" 125 | /note="N-acetyltransferase 2" 126 | /db_xref="GeneID:10" 127 | /db_xref="HGNC:HGNC:7646" 128 | /db_xref="MIM:612182" 129 | CDS 108..980 130 | /gene="NAT2" 131 | /gene_synonym="AAC2; NAT-2; PNAT" 132 | /EC_number="2.3.1.5" 133 | /note="arylamide acetylase 2; N-acetyltransferase 2 134 | (arylamine N-acetyltransferase); N-acetyltransferase type 135 | 2" 136 | /codon_start=1 137 | /product="arylamine N-acetyltransferase 2" 138 | /protein_id="NP_000006.2" 139 | /db_xref="CCDS:CCDS6008.1" 140 | /db_xref="GeneID:10" 141 | /db_xref="HGNC:HGNC:7646" 142 | /db_xref="MIM:612182" 143 | /translation="MDIEAYFERIGYKNSRNKLDLETLTDILEHQIRAVPFENLNMHC 144 | GQAMELGLEAIFDHIVRRNRGGWCLQVNQLLYWALTTIGFQTTMLGGYFYIPPVNKYS 145 | TGMVHLLLQVTIDGRNYIVDAGSGSSSQMWQPLELISGKDQPQVPCIFCLTEERGIWY 146 | LDQIRREQYITNKEFLNSHLLPKKKHQKIYLFTLEPRTIEDFESMNTYLQTSPTSSFI 147 | TTSFCSLQTPEGVYCLVGFILTYRKFNYKDNTDLVEFKTLTEEEVEEVLRNIFKISLG 148 | RNLVPKPGDGSLTI" 149 | misc_feature 423..428 150 | /gene="NAT2" 151 | /gene_synonym="AAC2; NAT-2; PNAT" 152 | /experiment="experimental evidence, no additional details 153 | recorded" 154 | /note="propagated from UniProtKB/Swiss-Prot (P11245.1); 155 | Region: Substrate binding. {ECO:0000305}" 156 | regulatory 1301..1306 157 | /regulatory_class="polyA_signal_sequence" 158 | /gene="NAT2" 159 | /gene_synonym="AAC2; NAT-2; PNAT" 160 | BASE COUNT 418 a 252 c 263 g 384 t 161 | ORIGIN 162 | 1 tgagatcact tcccttgcag actttggaag ggagagcact ttattacaga ccttggaagc 163 | 61 aagaggattg cattcagcct agttcctggt tgctggccaa agggatcatg gacattgaag 164 | 121 catattttga aagaattggc tataagaact ctaggaacaa attggacttg gaaacattaa 165 | 181 ctgacattct tgagcaccag atccgggctg ttccctttga gaaccttaac atgcattgtg 166 | 241 ggcaagccat ggagttgggc ttagaggcta tttttgatca cattgtaaga agaaaccggg 167 | 301 gtgggtggtg tctccaggtc aatcaacttc tgtactgggc tctgaccaca atcggttttc 168 | 361 agaccacaat gttaggaggg tatttttaca tccctccagt taacaaatac agcactggca 169 | 421 tggttcacct tctcctgcag gtgaccattg acggcaggaa ttacattgtc gatgctgggt 170 | 481 ctggaagctc ctcccagatg tggcagcctc tagaattaat ttctgggaag gatcagcctc 171 | 541 aggtgccttg cattttctgc ttgacagaag agagaggaat ctggtacctg gaccaaatca 172 | 601 ggagagagca gtatattaca aacaaagaat ttcttaattc tcatctcctg ccaaagaaga 173 | 661 aacaccaaaa aatatactta tttacgcttg aacctcgaac aattgaagat tttgagtcta 174 | 721 tgaatacata cctgcagacg tctccaacat cttcatttat aaccacatca ttttgttcct 175 | 781 tgcagacccc agaaggggtt tactgtttgg tgggcttcat cctcacctat agaaaattca 176 | 841 attataaaga caatacagat ctggtcgagt ttaaaactct cactgaggaa gaggttgaag 177 | 901 aagtgctgag aaatatattt aagatttcct tggggagaaa tctcgtgccc aaacctggtg 178 | 961 atggatccct tactatttag aataaggaac aaaataaacc cttgtgtatg tatcacccaa 179 | 1021 ctcactaatt atcaacttat gtgctatcag atatcctctc taccctcacg ttattttgaa 180 | 1081 gaaaatccta aacatcaaat actttcatcc ataaaaatgt cagcatttat taaaaaacaa 181 | 1141 taacttttta aagaaacata aggacacatt ttcaaattaa taaaaataaa ggcattttaa 182 | 1201 ggatggcctg tgattatctt gggaagcaga gtgattcatg ctagaaaaca tttaatattg 183 | 1261 atttattgtt gaattcatag taaattttta ctggtaaatg aataaagaat attgtgg 184 | // 185 | --------------------------------------------------------------------------------