├── LICENSE
├── README.md
├── NCBI_genebank_file_parser.py
└── test_genbank_file.txt


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2019 dewshr
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # NCBI-GenBank-file-parser
 2 | 
 3 | ############################# About the program ################################
 4 | 
 5 | This program takes the NCBI nucletotide gene bank file and then parses the information present in NCBI gene bank file to create a .csv file with each fields in one column. You can provide any file extension but the format of the file has to be similar to .gbff file. One example file is also provided as an example file. Running the program will ask for the file name, and while giving the file name, extension of the file has to be given too, the program assumes the file is present in the same location as of the program, if your file is present in another location then you can change the source script to give the location or you can just copy the script in the file location.
 6 | 
 7 | You can provide a file with single genbank entry or multiple entries. After running the program this will also create a txt file called 'genbank.txt', you can delete this file.
 8 | NOTE: this program will only give output to the gene bank files with CDS features. The fields for certain columns may be empty if respective information is not present in the genbank file.
 9 | 
10 | 
11 | ############################# Program Requirements ######################################
12 | - installation of Biopython
13 | - python 2.7
14 | 
15 | #################################### NCBI gene bank file link ############################
16 | ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/
17 | 


--------------------------------------------------------------------------------
/NCBI_genebank_file_parser.py:
--------------------------------------------------------------------------------
  1 | #@Dewan Shrestha
  2 | 
  3 | from Bio import GenBank
  4 | 
  5 | #reading the gene bank file
  6 | def readfile():
  7 | 	#for genbank file: ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/
  8 | 	gb_file_name = str(raw_input("Enter your NCBI gene bank file name: "))
  9 | 	
 10 | 	try:
 11 | 		genbank = open(gb_file_name).read().split('LOCUS  ') #opens gene bank file and splits by '//\n' to create list of each genes
 12 | 	except:
 13 | 		print "File not found. Please make sure you have the right file and also include the file extension"
 14 | 
 15 | 	return genbank
 16 | 
 17 | 
 18 | #function to parse the nucleotide reference sequence genebank file
 19 | def ntgenbank():
 20 | 	#retreiving all genebank files in a list calling another function
 21 | 	nuc_genbank = readfile()
 22 | 	#nuc_genbank = filter(None, nuc_genbank)
 23 | 	print len(nuc_genbank)
 24 | 	print nuc_genbank[0]
 25 | 	length = len(nuc_genbank)
 26 | 	print "\nParsing started"
 27 | 	output = open('result_ntgenbank.csv','w') # opening a file to write the ouput
 28 | 
 29 | 	#writing headings of the output file
 30 | 	output.write('Name'+','+'NM'+','+ 'NM_version'+','+ 'Symbol'+','+'CDS_start'+','+ 'CDS_stop'+','+'HGNC'+','+\
 31 | 		    'MIM'+','+'EC_number'+','+ 'GeneID' +','+ 'NP'+','+'NP_version'+','+'gene_synonym'+','+'AA_seq'+','+\
 32 | 		     'AA_number'+','+'Chromosome'+ ','+'Chromosome_map'+','+ 'NT_seq'+','+'Organism'+'\n')
 33 | 	
 34 | 	# going through all the genes in the list
 35 | 	for n in range(1,length):	#0 index is empty
 36 | 		print n
 37 | 		test= 'LOCUS  ' + nuc_genbank[n].lstrip('\n')   #removing new line of from individual genebank files
 38 | 		query = open('genbank.txt','w')	#creating a genbank file to create query gene bank file
 39 | 		query.write(test)
 40 | 		query.close()
 41 | 		parser = GenBank.RecordParser()		#using biopython function for parsing
 42 | 		record = parser.parse(open('genbank.txt'))
 43 | 		
 44 | 
 45 | 		##########################################################################################
 46 | 		nt_seq = (record.sequence).strip('\n') #stores nucleotide sequence
 47 | 		nm_and_version = (record.version).strip('\n') #contains nm and nm_version
 48 | 		nm = (nm_and_version.split('.')[0]).strip('\n')
 49 | 		nm_version = (nm_and_version.split('.')[1]).strip('\n')
 50 | 
 51 | 
 52 | 		############################################################################################
 53 | 		source = record.features[0]			#contains all the fields of source
 54 | 		organism = source.qualifiers[0].value.strip('\n')+ ':'+source.qualifiers[2].value.strip('\n')
 55 | 		try:
 56 | 			organism = source.qualifiers[0].value.strip('\n')+ ':'+source.qualifiers[2].value.strip('\n')
 57 | 		except:
 58 | 			organism =''
 59 | 		
 60 | 		try:
 61 | 			chrm = (source.qualifiers[3].value).strip('\n') #stores chromosome number
 62 | 		except:
 63 | 			chrm = ''
 64 | 			
 65 | 		try:
 66 | 			chrm_map = source.qualifiers[4].value.strip('\n')
 67 | 		except:
 68 | 			chrm_map = ''
 69 |                         
 70 | 
 71 | 		############################################################################################
 72 | 		gene = record.features[1]			#contains all the field of gene
 73 | 		symbol = (gene.qualifiers[0].value).strip('\n')     #symbol or gene
 74 | 
 75 | 
 76 |                 #########################################################################################
 77 | 		cds=''		
 78 | 		for c in range(0, len(record.features)):
 79 | 			if('CDS' in record.features[c].key):
 80 | 				cds = record.features[c]
 81 | 				break
 82 | 			else:
 83 | 				continue
 84 | 
 85 | 		if cds != '':
 86 | 			
 87 | 			cds_start_stop = (cds.location).strip('\n')		#stores cds start and stop position
 88 | 			cds_start = (cds_start_stop.split('..')[0]).strip('\n')
 89 | 			cds_stop = (cds_start_stop.split('..')[1]).strip('\n')
 90 | 			
 91 | 
 92 | 			#creating a empty dictionary to go through the elements in the CDS and update later if present
 93 | 			cds_dict = {"HGNC":'', "MIM:":'', "EC_number":'', "GeneID":'', "product":'', 
 94 | 					"protein_id":'',"translation":'',"num_aa":'', "gene_synonym":''}
 95 | 
 96 | 			for n in range(0, len(cds.qualifiers)):		#going through all the elements in the cds
 97 | 				for key, value in cds_dict.iteritems():	#looping through the dictionary items to see if present in cds
 98 | 					if ((key in cds.qualifiers[n].key) or (key in cds.qualifiers[n].value)):
 99 | 						keys =str(key)					#storing dictionary key
100 | 						cds_dict[keys] = str(cds.qualifiers[n].value) #updating dictionary key with values
101 | 						break
102 | 					else:
103 | 						continue
104 | 			np = cds_dict["protein_id"].split('.')[0]+'"'
105 | 			np_version = '"'+cds_dict["protein_id"].split('.')[1]
106 | 			hgnc=cds_dict["HGNC"]
107 | 			mim=cds_dict["MIM:"]
108 | 			geneid =cds_dict["GeneID"]
109 | 			name = cds_dict["product"]
110 | 			synonym = cds_dict["gene_synonym"]
111 | 			translation = cds_dict["translation"]
112 | 			if translation != '': num_aa = len(translation)
113 | 			if len(hgnc) !=0:
114 | 				hgnc = '"'+hgnc.split(':')[2]
115 | 			if len(mim) !=0:
116 | 				mim = '"'+mim.split(':')[1]
117 | 			if len(geneid) !=0:
118 | 				geneid = '"'+geneid.split(':')[1]
119 | 
120 | 			gvalue = name+','+nm+','+nm_version+','+symbol+','+cds_start+','+cds_stop+',' + hgnc +','+\
121 | 				mim+','+cds_dict["EC_number"]+','+geneid+ ','+np+','+np_version+','+synonym+','+\
122 | 				translation+','+str(num_aa) +','+str(chrm)+','+chrm_map+','+nt_seq+','+organism+'\n'
123 | 			output.write(gvalue)
124 | 	print "Parsing completed"
125 | 	output.close()
126 | 	
127 | ntgenbank()
128 | 


--------------------------------------------------------------------------------
/test_genbank_file.txt:
--------------------------------------------------------------------------------
  1 | 
  2 | LOCUS       NM_000016               1317 bp    mRNA    linear   PRI 09-APR-2017
  3 | DEFINITION  Homo sapiens N-acetyltransferase 2 (NAT2), mRNA.
  4 | ACCESSION   NM_000015
  5 | VERSION     NM_000016.2
  6 | KEYWORDS    RefSeq.
  7 | SOURCE      Homo sapiens (human)
  8 |   ORGANISM  Homo sapiens
  9 |             Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
 10 |             Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
 11 |             Catarrhini; Hominidae; Homo.
 12 | REFERENCE   1  (bases 1 to 1317)
 13 |   AUTHORS   Ma C, Gu L, Yang M, Zhang Z, Zeng S, Song R, Xu C and Sun Y.
 14 |   TITLE     rs1495741 as a tag single nucleotide polymorphism of
 15 |             N-acetyltransferase 2 acetylator phenotype associates bladder
 16 |             cancer risk and interacts with smoking: A systematic review and
 17 |             meta-analysis
 18 |   JOURNAL   Medicine (Baltimore) 95 (31), E4417 (2016)
 19 |    PUBMED   27495060
 20 |   REMARK    GeneRIF: NAT2 single nucleotide polymorphism associated with the
 21 |             risk bladder cancer development and interacts with smoking.
 22 |             Review article
 23 | REFERENCE   2  (bases 1 to 1317)
 24 |   AUTHORS   Jimenez-Jimenez FJ, Alonso-Navarro H, Garcia-Martin E and Agundez
 25 |             JA.
 26 |   TITLE     NAT2 polymorphisms and risk for Parkinson's disease: a systematic
 27 |             review and meta-analysis
 28 |   JOURNAL   Expert Opin Drug Metab Toxicol 12 (8), 937-946 (2016)
 29 |    PUBMED   27216438
 30 |   REMARK    GeneRIF: Data on NAT2 gene polymorphisms obtained from the current
 31 |             meta-analysis do not support a major association with Parkinson's
 32 |             diease risk, except in Asian populations
 33 |             Review article
 34 | REFERENCE   3  (bases 1 to 1317)
 35 |   AUTHORS   Suarez-Kurtz G, Fuchshuber-Moraes M, Struchiner CJ and Parra EJ.
 36 |   TITLE     Single nucleotide polymorphism coverage and inference of
 37 |             N-acetyltransferase-2 acetylator phenotypes in wordwide population
 38 |             groups
 39 |   JOURNAL   Pharmacogenet. Genomics 26 (8), 363-369 (2016)
 40 |    PUBMED   27136043
 41 |   REMARK    GeneRIF: Single nucleotide polymorphism is associated with
 42 |             different N-acetyltransferase-2 acetylator phenotypes in wordwide
 43 |             population groups.
 44 | REFERENCE   4  (bases 1 to 1317)
 45 |   AUTHORS   Chamorro JG, Castagnino JP, Musella RM, Nogueras M, Frias A, Visca
 46 |             M, Aidar O, Costa L and de Larranaga GF.
 47 |   TITLE     tagSNP rs1495741 as a useful molecular marker to predict
 48 |             antituberculosis drug-induced hepatotoxicity
 49 |   JOURNAL   Pharmacogenet. Genomics 26 (7), 357-361 (2016)
 50 |    PUBMED   27104815
 51 |   REMARK    GeneRIF: NAT2 gene variants were associated with antituberculosis
 52 |             drug-induced hepatotoxicity.
 53 | REFERENCE   5  (bases 1 to 1317)
 54 |   AUTHORS   Vatsis KP, Weber WW, Bell DA, Dupret JM, Evans DA, Grant DM, Hein
 55 |             DW, Lin HJ, Meyer UA, Relling MV et al.
 56 |   TITLE     Nomenclature for N-acetyltransferases
 57 |   JOURNAL   Pharmacogenetics 5 (1), 1-17 (1995)
 58 |    PUBMED   7773298
 59 |   REMARK    Review article
 60 | REFERENCE   6  (bases 1 to 1317)
 61 |   AUTHORS   Hickman D, Risch A, Camilleri JP and Sim E.
 62 |   TITLE     Genotyping human polymorphic arylamine N-acetyltransferase:
 63 |             identification of new slow allotypic variants
 64 |   JOURNAL   Pharmacogenetics 2 (5), 217-226 (1992)
 65 |    PUBMED   1306121
 66 | REFERENCE   7  (bases 1 to 1317)
 67 |   AUTHORS   Deguchi T.
 68 |   TITLE     Sequences and expression of alleles of polymorphic arylamine
 69 |             N-acetyltransferase of human liver
 70 |   JOURNAL   J. Biol. Chem. 267 (25), 18140-18147 (1992)
 71 |    PUBMED   1381364
 72 | REFERENCE   8  (bases 1 to 1317)
 73 |   AUTHORS   Grant DM, Blum M and Meyer UA.
 74 |   TITLE     Polymorphisms of N-acetyltransferase genes
 75 |   JOURNAL   Xenobiotica 22 (9-10), 1073-1081 (1992)
 76 |    PUBMED   1441598
 77 | REFERENCE   9  (bases 1 to 1317)
 78 |   AUTHORS   Blum M, Grant DM, McBride W, Heim M and Meyer UA.
 79 |   TITLE     Human arylamine N-acetyltransferase genes: isolation, chromosomal
 80 |             localization, and functional expression
 81 |   JOURNAL   DNA Cell Biol. 9 (3), 193-203 (1990)
 82 |    PUBMED   2340091
 83 | REFERENCE   10 (bases 1 to 1317)
 84 |   AUTHORS   Grant DM, Lottspeich F and Meyer UA.
 85 |   TITLE     Evidence for two closely related isozymes of arylamine
 86 |             N-acetyltransferase in human liver
 87 |   JOURNAL   FEBS Lett. 244 (1), 203-207 (1989)
 88 |    PUBMED   2924904
 89 | COMMENT     REVIEWED REFSEQ: This record has been curated by NCBI staff. The
 90 |             reference sequence was derived from AC025062.6, D90042.1 and
 91 |             AI460128.1.
 92 |             This sequence is a reference standard in the RefSeqGene project.
 93 |             
 94 |             Summary: This gene encodes an enzyme that functions to both
 95 |             activate and deactivate arylamine and hydrazine drugs and
 96 |             carcinogens. Polymorphisms in this gene are responsible for the
 97 |             N-acetylation polymorphism in which human populations segregate
 98 |             into rapid, intermediate, and slow acetylator phenotypes.
 99 |             Polymorphisms in this gene are also associated with higher
100 |             incidences of cancer and drug toxicity. A second arylamine
101 |             N-acetyltransferase gene (NAT1) is located near this gene (NAT2).
102 |             [provided by RefSeq, Jul 2008].
103 |             
104 |             Publication Note:  This RefSeq record includes a subset of the
105 |             publications that are available for this gene. Please see the Gene
106 |             record to access additional publications.
107 |             
108 |             ##Evidence-Data-START##
109 |             Transcript exon combination :: D90042.1, BC067218.1 [ECO:0000332]
110 |             RNAseq introns              :: single sample supports all introns
111 |                                            SAMEA1966682, SAMEA1968540
112 |                                            [ECO:0000348]
113 |             ##Evidence-Data-END##
114 |             COMPLETENESS: complete on the 3' end.
115 | FEATURES             Location/Qualifiers
116 |      source          1..1317
117 |                      /organism="Homo sapiens"
118 |                      /mol_type="mRNA"
119 |                      /db_xref="taxon:9606"
120 |                      /chromosome="8"
121 |                      /map="8p22"
122 |      gene            1..1317
123 |                      /gene="NAT2"
124 |                      /gene_synonym="AAC2; NAT-2; PNAT"
125 |                      /note="N-acetyltransferase 2"
126 |                      /db_xref="GeneID:10"
127 |                      /db_xref="HGNC:HGNC:7646"
128 |                      /db_xref="MIM:612182"
129 |      CDS             108..980
130 |                      /gene="NAT2"
131 |                      /gene_synonym="AAC2; NAT-2; PNAT"
132 |                      /EC_number="2.3.1.5"
133 |                      /note="arylamide acetylase 2; N-acetyltransferase 2
134 |                      (arylamine N-acetyltransferase); N-acetyltransferase type
135 |                      2"
136 |                      /codon_start=1
137 |                      /product="arylamine N-acetyltransferase 2"
138 |                      /protein_id="NP_000006.2"
139 |                      /db_xref="CCDS:CCDS6008.1"
140 |                      /db_xref="GeneID:10"
141 |                      /db_xref="HGNC:HGNC:7646"
142 |                      /db_xref="MIM:612182"
143 |                      /translation="MDIEAYFERIGYKNSRNKLDLETLTDILEHQIRAVPFENLNMHC
144 |                      GQAMELGLEAIFDHIVRRNRGGWCLQVNQLLYWALTTIGFQTTMLGGYFYIPPVNKYS
145 |                      TGMVHLLLQVTIDGRNYIVDAGSGSSSQMWQPLELISGKDQPQVPCIFCLTEERGIWY
146 |                      LDQIRREQYITNKEFLNSHLLPKKKHQKIYLFTLEPRTIEDFESMNTYLQTSPTSSFI
147 |                      TTSFCSLQTPEGVYCLVGFILTYRKFNYKDNTDLVEFKTLTEEEVEEVLRNIFKISLG
148 |                      RNLVPKPGDGSLTI"
149 |      misc_feature    423..428
150 |                      /gene="NAT2"
151 |                      /gene_synonym="AAC2; NAT-2; PNAT"
152 |                      /experiment="experimental evidence, no additional details
153 |                      recorded"
154 |                      /note="propagated from UniProtKB/Swiss-Prot (P11245.1);
155 |                      Region: Substrate binding. {ECO:0000305}"
156 |      regulatory      1301..1306
157 |                      /regulatory_class="polyA_signal_sequence"
158 |                      /gene="NAT2"
159 |                      /gene_synonym="AAC2; NAT-2; PNAT"
160 | BASE COUNT      418 a    252 c    263 g    384 t
161 | ORIGIN      
162 |         1 tgagatcact tcccttgcag actttggaag ggagagcact ttattacaga ccttggaagc
163 |        61 aagaggattg cattcagcct agttcctggt tgctggccaa agggatcatg gacattgaag
164 |       121 catattttga aagaattggc tataagaact ctaggaacaa attggacttg gaaacattaa
165 |       181 ctgacattct tgagcaccag atccgggctg ttccctttga gaaccttaac atgcattgtg
166 |       241 ggcaagccat ggagttgggc ttagaggcta tttttgatca cattgtaaga agaaaccggg
167 |       301 gtgggtggtg tctccaggtc aatcaacttc tgtactgggc tctgaccaca atcggttttc
168 |       361 agaccacaat gttaggaggg tatttttaca tccctccagt taacaaatac agcactggca
169 |       421 tggttcacct tctcctgcag gtgaccattg acggcaggaa ttacattgtc gatgctgggt
170 |       481 ctggaagctc ctcccagatg tggcagcctc tagaattaat ttctgggaag gatcagcctc
171 |       541 aggtgccttg cattttctgc ttgacagaag agagaggaat ctggtacctg gaccaaatca
172 |       601 ggagagagca gtatattaca aacaaagaat ttcttaattc tcatctcctg ccaaagaaga
173 |       661 aacaccaaaa aatatactta tttacgcttg aacctcgaac aattgaagat tttgagtcta
174 |       721 tgaatacata cctgcagacg tctccaacat cttcatttat aaccacatca ttttgttcct
175 |       781 tgcagacccc agaaggggtt tactgtttgg tgggcttcat cctcacctat agaaaattca
176 |       841 attataaaga caatacagat ctggtcgagt ttaaaactct cactgaggaa gaggttgaag
177 |       901 aagtgctgag aaatatattt aagatttcct tggggagaaa tctcgtgccc aaacctggtg
178 |       961 atggatccct tactatttag aataaggaac aaaataaacc cttgtgtatg tatcacccaa
179 |      1021 ctcactaatt atcaacttat gtgctatcag atatcctctc taccctcacg ttattttgaa
180 |      1081 gaaaatccta aacatcaaat actttcatcc ataaaaatgt cagcatttat taaaaaacaa
181 |      1141 taacttttta aagaaacata aggacacatt ttcaaattaa taaaaataaa ggcattttaa
182 |      1201 ggatggcctg tgattatctt gggaagcaga gtgattcatg ctagaaaaca tttaatattg
183 |      1261 atttattgtt gaattcatag taaattttta ctggtaaatg aataaagaat attgtgg
184 | //
185 | 


--------------------------------------------------------------------------------