├── LICENSE ├── README.md ├── batchRandomSequences.pl ├── calc_coverage_breadth.py ├── codon_freq.py ├── concat_alignments.py ├── filter_contigs.py ├── filter_contigs_by_blastp.py ├── gc_window.py ├── get_genomes.py ├── isoelectric.py ├── random_forest.py ├── simulate_assembly.py ├── top_10_contigs.py └── translate_all_frames.py /LICENSE: -------------------------------------------------------------------------------- 1 | GNU GENERAL PUBLIC LICENSE 2 | Version 2, June 1991 3 | 4 | Copyright (C) 1989, 1991 Free Software Foundation, Inc., 5 | 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA 6 | Everyone is permitted to copy and distribute verbatim copies 7 | of this license document, but changing it is not allowed. 8 | 9 | Preamble 10 | 11 | The licenses for most software are designed to take away your 12 | freedom to share and change it. By contrast, the GNU General Public 13 | License is intended to guarantee your freedom to share and change free 14 | software--to make sure the software is free for all its users. This 15 | General Public License applies to most of the Free Software 16 | Foundation's software and to any other program whose authors commit to 17 | using it. (Some other Free Software Foundation software is covered by 18 | the GNU Lesser General Public License instead.) You can apply it to 19 | your programs, too. 20 | 21 | When we speak of free software, we are referring to freedom, not 22 | price. Our General Public Licenses are designed to make sure that you 23 | have the freedom to distribute copies of free software (and charge for 24 | this service if you wish), that you receive source code or can get it 25 | if you want it, that you can change the software or use pieces of it 26 | in new free programs; and that you know you can do these things. 27 | 28 | To protect your rights, we need to make restrictions that forbid 29 | anyone to deny you these rights or to ask you to surrender the rights. 30 | These restrictions translate to certain responsibilities for you if you 31 | distribute copies of the software, or if you modify it. 32 | 33 | For example, if you distribute copies of such a program, whether 34 | gratis or for a fee, you must give the recipients all the rights that 35 | you have. You must make sure that they, too, receive or can get the 36 | source code. And you must show them these terms so they know their 37 | rights. 38 | 39 | We protect your rights with two steps: (1) copyright the software, and 40 | (2) offer you this license which gives you legal permission to copy, 41 | distribute and/or modify the software. 42 | 43 | Also, for each author's protection and ours, we want to make certain 44 | that everyone understands that there is no warranty for this free 45 | software. If the software is modified by someone else and passed on, we 46 | want its recipients to know that what they have is not the original, so 47 | that any problems introduced by others will not reflect on the original 48 | authors' reputations. 49 | 50 | Finally, any free program is threatened constantly by software 51 | patents. We wish to avoid the danger that redistributors of a free 52 | program will individually obtain patent licenses, in effect making the 53 | program proprietary. To prevent this, we have made it clear that any 54 | patent must be licensed for everyone's free use or not licensed at all. 55 | 56 | The precise terms and conditions for copying, distribution and 57 | modification follow. 58 | 59 | GNU GENERAL PUBLIC LICENSE 60 | TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 61 | 62 | 0. This License applies to any program or other work which contains 63 | a notice placed by the copyright holder saying it may be distributed 64 | under the terms of this General Public License. The "Program", below, 65 | refers to any such program or work, and a "work based on the Program" 66 | means either the Program or any derivative work under copyright law: 67 | that is to say, a work containing the Program or a portion of it, 68 | either verbatim or with modifications and/or translated into another 69 | language. (Hereinafter, translation is included without limitation in 70 | the term "modification".) Each licensee is addressed as "you". 71 | 72 | Activities other than copying, distribution and modification are not 73 | covered by this License; they are outside its scope. The act of 74 | running the Program is not restricted, and the output from the Program 75 | is covered only if its contents constitute a work based on the 76 | Program (independent of having been made by running the Program). 77 | Whether that is true depends on what the Program does. 78 | 79 | 1. You may copy and distribute verbatim copies of the Program's 80 | source code as you receive it, in any medium, provided that you 81 | conspicuously and appropriately publish on each copy an appropriate 82 | copyright notice and disclaimer of warranty; keep intact all the 83 | notices that refer to this License and to the absence of any warranty; 84 | and give any other recipients of the Program a copy of this License 85 | along with the Program. 86 | 87 | You may charge a fee for the physical act of transferring a copy, and 88 | you may at your option offer warranty protection in exchange for a fee. 89 | 90 | 2. You may modify your copy or copies of the Program or any portion 91 | of it, thus forming a work based on the Program, and copy and 92 | distribute such modifications or work under the terms of Section 1 93 | above, provided that you also meet all of these conditions: 94 | 95 | a) You must cause the modified files to carry prominent notices 96 | stating that you changed the files and the date of any change. 97 | 98 | b) You must cause any work that you distribute or publish, that in 99 | whole or in part contains or is derived from the Program or any 100 | part thereof, to be licensed as a whole at no charge to all third 101 | parties under the terms of this License. 102 | 103 | c) If the modified program normally reads commands interactively 104 | when run, you must cause it, when started running for such 105 | interactive use in the most ordinary way, to print or display an 106 | announcement including an appropriate copyright notice and a 107 | notice that there is no warranty (or else, saying that you provide 108 | a warranty) and that users may redistribute the program under 109 | these conditions, and telling the user how to view a copy of this 110 | License. (Exception: if the Program itself is interactive but 111 | does not normally print such an announcement, your work based on 112 | the Program is not required to print an announcement.) 113 | 114 | These requirements apply to the modified work as a whole. If 115 | identifiable sections of that work are not derived from the Program, 116 | and can be reasonably considered independent and separate works in 117 | themselves, then this License, and its terms, do not apply to those 118 | sections when you distribute them as separate works. But when you 119 | distribute the same sections as part of a whole which is a work based 120 | on the Program, the distribution of the whole must be on the terms of 121 | this License, whose permissions for other licensees extend to the 122 | entire whole, and thus to each and every part regardless of who wrote it. 123 | 124 | Thus, it is not the intent of this section to claim rights or contest 125 | your rights to work written entirely by you; rather, the intent is to 126 | exercise the right to control the distribution of derivative or 127 | collective works based on the Program. 128 | 129 | In addition, mere aggregation of another work not based on the Program 130 | with the Program (or with a work based on the Program) on a volume of 131 | a storage or distribution medium does not bring the other work under 132 | the scope of this License. 133 | 134 | 3. You may copy and distribute the Program (or a work based on it, 135 | under Section 2) in object code or executable form under the terms of 136 | Sections 1 and 2 above provided that you also do one of the following: 137 | 138 | a) Accompany it with the complete corresponding machine-readable 139 | source code, which must be distributed under the terms of Sections 140 | 1 and 2 above on a medium customarily used for software interchange; or, 141 | 142 | b) Accompany it with a written offer, valid for at least three 143 | years, to give any third party, for a charge no more than your 144 | cost of physically performing source distribution, a complete 145 | machine-readable copy of the corresponding source code, to be 146 | distributed under the terms of Sections 1 and 2 above on a medium 147 | customarily used for software interchange; or, 148 | 149 | c) Accompany it with the information you received as to the offer 150 | to distribute corresponding source code. (This alternative is 151 | allowed only for noncommercial distribution and only if you 152 | received the program in object code or executable form with such 153 | an offer, in accord with Subsection b above.) 154 | 155 | The source code for a work means the preferred form of the work for 156 | making modifications to it. For an executable work, complete source 157 | code means all the source code for all modules it contains, plus any 158 | associated interface definition files, plus the scripts used to 159 | control compilation and installation of the executable. However, as a 160 | special exception, the source code distributed need not include 161 | anything that is normally distributed (in either source or binary 162 | form) with the major components (compiler, kernel, and so on) of the 163 | operating system on which the executable runs, unless that component 164 | itself accompanies the executable. 165 | 166 | If distribution of executable or object code is made by offering 167 | access to copy from a designated place, then offering equivalent 168 | access to copy the source code from the same place counts as 169 | distribution of the source code, even though third parties are not 170 | compelled to copy the source along with the object code. 171 | 172 | 4. You may not copy, modify, sublicense, or distribute the Program 173 | except as expressly provided under this License. Any attempt 174 | otherwise to copy, modify, sublicense or distribute the Program is 175 | void, and will automatically terminate your rights under this License. 176 | However, parties who have received copies, or rights, from you under 177 | this License will not have their licenses terminated so long as such 178 | parties remain in full compliance. 179 | 180 | 5. You are not required to accept this License, since you have not 181 | signed it. However, nothing else grants you permission to modify or 182 | distribute the Program or its derivative works. These actions are 183 | prohibited by law if you do not accept this License. Therefore, by 184 | modifying or distributing the Program (or any work based on the 185 | Program), you indicate your acceptance of this License to do so, and 186 | all its terms and conditions for copying, distributing or modifying 187 | the Program or works based on it. 188 | 189 | 6. Each time you redistribute the Program (or any work based on the 190 | Program), the recipient automatically receives a license from the 191 | original licensor to copy, distribute or modify the Program subject to 192 | these terms and conditions. You may not impose any further 193 | restrictions on the recipients' exercise of the rights granted herein. 194 | You are not responsible for enforcing compliance by third parties to 195 | this License. 196 | 197 | 7. If, as a consequence of a court judgment or allegation of patent 198 | infringement or for any other reason (not limited to patent issues), 199 | conditions are imposed on you (whether by court order, agreement or 200 | otherwise) that contradict the conditions of this License, they do not 201 | excuse you from the conditions of this License. If you cannot 202 | distribute so as to satisfy simultaneously your obligations under this 203 | License and any other pertinent obligations, then as a consequence you 204 | may not distribute the Program at all. For example, if a patent 205 | license would not permit royalty-free redistribution of the Program by 206 | all those who receive copies directly or indirectly through you, then 207 | the only way you could satisfy both it and this License would be to 208 | refrain entirely from distribution of the Program. 209 | 210 | If any portion of this section is held invalid or unenforceable under 211 | any particular circumstance, the balance of the section is intended to 212 | apply and the section as a whole is intended to apply in other 213 | circumstances. 214 | 215 | It is not the purpose of this section to induce you to infringe any 216 | patents or other property right claims or to contest validity of any 217 | such claims; this section has the sole purpose of protecting the 218 | integrity of the free software distribution system, which is 219 | implemented by public license practices. Many people have made 220 | generous contributions to the wide range of software distributed 221 | through that system in reliance on consistent application of that 222 | system; it is up to the author/donor to decide if he or she is willing 223 | to distribute software through any other system and a licensee cannot 224 | impose that choice. 225 | 226 | This section is intended to make thoroughly clear what is believed to 227 | be a consequence of the rest of this License. 228 | 229 | 8. If the distribution and/or use of the Program is restricted in 230 | certain countries either by patents or by copyrighted interfaces, the 231 | original copyright holder who places the Program under this License 232 | may add an explicit geographical distribution limitation excluding 233 | those countries, so that distribution is permitted only in or among 234 | countries not thus excluded. In such case, this License incorporates 235 | the limitation as if written in the body of this License. 236 | 237 | 9. The Free Software Foundation may publish revised and/or new versions 238 | of the General Public License from time to time. Such new versions will 239 | be similar in spirit to the present version, but may differ in detail to 240 | address new problems or concerns. 241 | 242 | Each version is given a distinguishing version number. If the Program 243 | specifies a version number of this License which applies to it and "any 244 | later version", you have the option of following the terms and conditions 245 | either of that version or of any later version published by the Free 246 | Software Foundation. If the Program does not specify a version number of 247 | this License, you may choose any version ever published by the Free Software 248 | Foundation. 249 | 250 | 10. If you wish to incorporate parts of the Program into other free 251 | programs whose distribution conditions are different, write to the author 252 | to ask for permission. For software which is copyrighted by the Free 253 | Software Foundation, write to the Free Software Foundation; we sometimes 254 | make exceptions for this. Our decision will be guided by the two goals 255 | of preserving the free status of all derivatives of our free software and 256 | of promoting the sharing and reuse of software generally. 257 | 258 | NO WARRANTY 259 | 260 | 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY 261 | FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN 262 | OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES 263 | PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED 264 | OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF 265 | MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS 266 | TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE 267 | PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, 268 | REPAIR OR CORRECTION. 269 | 270 | 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING 271 | WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR 272 | REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, 273 | INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING 274 | OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED 275 | TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY 276 | YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER 277 | PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE 278 | POSSIBILITY OF SUCH DAMAGES. 279 | 280 | END OF TERMS AND CONDITIONS 281 | 282 | How to Apply These Terms to Your New Programs 283 | 284 | If you develop a new program, and you want it to be of the greatest 285 | possible use to the public, the best way to achieve this is to make it 286 | free software which everyone can redistribute and change under these terms. 287 | 288 | To do so, attach the following notices to the program. It is safest 289 | to attach them to the start of each source file to most effectively 290 | convey the exclusion of warranty; and each file should have at least 291 | the "copyright" line and a pointer to where the full notice is found. 292 | 293 | {description} 294 | Copyright (C) {year} {fullname} 295 | 296 | This program is free software; you can redistribute it and/or modify 297 | it under the terms of the GNU General Public License as published by 298 | the Free Software Foundation; either version 2 of the License, or 299 | (at your option) any later version. 300 | 301 | This program is distributed in the hope that it will be useful, 302 | but WITHOUT ANY WARRANTY; without even the implied warranty of 303 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 304 | GNU General Public License for more details. 305 | 306 | You should have received a copy of the GNU General Public License along 307 | with this program; if not, write to the Free Software Foundation, Inc., 308 | 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. 309 | 310 | Also add information on how to contact you by electronic and paper mail. 311 | 312 | If the program is interactive, make it output a short notice like this 313 | when it starts in an interactive mode: 314 | 315 | Gnomovision version 69, Copyright (C) year name of author 316 | Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. 317 | This is free software, and you are welcome to redistribute it 318 | under certain conditions; type `show c' for details. 319 | 320 | The hypothetical commands `show w' and `show c' should show the appropriate 321 | parts of the General Public License. Of course, the commands you use may 322 | be called something other than `show w' and `show c'; they could even be 323 | mouse-clicks or menu items--whatever suits your program. 324 | 325 | You should also get your employer (if you work as a programmer) or your 326 | school, if any, to sign a "copyright disclaimer" for the program, if 327 | necessary. Here is a sample; alter the names: 328 | 329 | Yoyodyne, Inc., hereby disclaims all copyright interest in the program 330 | `Gnomovision' (which makes passes at compilers) written by James Hacker. 331 | 332 | {signature of Ty Coon}, 1 April 1989 333 | Ty Coon, President of Vice 334 | 335 | This General Public License does not permit incorporating your program into 336 | proprietary programs. If your program is a subroutine library, you may 337 | consider it more useful to permit linking proprietary applications with the 338 | library. If this is what you want to do, use the GNU Lesser General 339 | Public License instead of this License. 340 | 341 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # MicrobialGenomicsScripts 2 | A selection of (very) short scripts for analyzing microbial genomes. I have created this repository as a place to store and share any useful scripts for analyzing microbial genomes. 3 | 4 | ###Usage: 5 | 6 | *calc_coverage_breadth.py* 7 | 8 | Calculates number of reads, coverage, and breadth of coverage of reads mapping to a set of regions (presumably genes). Called like so: 9 | 10 | `python count_reads.py genes.pos example.sorted.bam 200` 11 | 12 | Where genes.pos is a file formatted like: 13 | ``` 14 | 14_0903_02_20cm_scaffold_13826 7320 7544 15 | 14_0903_02_20cm_scaffold_5262 27 1703 16 | 14_0903_02_20cm_scaffold_5308 3879 5147 17 | 14_0903_02_20cm_scaffold_5308 5156 5650 18 | ``` 19 | 20 | *gc_window.py* 21 | 22 | Calculates GC% for a window size across a genome. Default is 10 kbp, and can be changed in the code if needed. 23 | 24 | `python gc_window.py genome.fna` 25 | 26 | *translate_all_frames.py* 27 | 28 | Translates a FASTA file of DNA sequences in all 6 reading frames, ignoring start / stops. 29 | 30 | `python translate_all_frames.py sequences.fna` 31 | 32 | *isoelectric.py* 33 | 34 | Calculates the isoelectric points for all proteins in a proteome and outputs in a list which can be easily imported into R . Useful for showing the isoelectric point distribution of a new halophilic proteome. 35 | 36 | `python isoelectric.py proteins.faa` 37 | 38 | 39 | *codon_freq.py* 40 | 41 | Calculates codon frequencies for a set of sequences in a FASTA file (e.g. gene sequences, such as in a .ffn file). 42 | 43 | `python codon_freq.py genes.ffn` 44 | 45 | *simulate_assembly.py* 46 | 47 | Naively simulates a metagenomic assembly by taking in a directory of genomes and generating randomly sized and spaced contigs from those genomes. 48 | 49 | `python simulate_assembly.py ./directory_with_fasta_genomes/` 50 | 51 | *batchRandomSequences.pl* 52 | 53 | Generates random subsets of N reads for a batch of fasta files in a directory. Useful if you want to quickly get a random sample of reads for a large number of samples. 54 | 55 | `perl batchRandomSequences.pl DIRECTORYNAME N` 56 | 57 | *filter_contigs.py* 58 | 59 | Filters a contig file to a minimum contig length. 60 | 61 | `python filter_contigs.py contigs.fna min_contig_length` 62 | 63 | *random_forest.py* 64 | 65 | Basic demonstration of how to import training data and create a random forest classifier using sklearn. 66 | 67 | *concat_alignments.py* 68 | 69 | concatenating multiple gene / protein alignments. 70 | 71 | `python concat_alignments.py ./directory_with_alignments/*.fa` 72 | 73 | *get_genomes.py* 74 | 75 | Automatically / recursively downloads genomes from NCBI's FTP. 76 | 77 | *filter_contigs_by_blastp.py* 78 | 79 | A script for filtering contigs by comparing them to specific known reference proteomes. The input is the blast results of the predicted prodigal products of the contigs to a database of the known reference proteomes and the predicted protein fasta file itself. 80 | -------------------------------------------------------------------------------- /batchRandomSequences.pl: -------------------------------------------------------------------------------- 1 | #!/usr/bin/perl -w 2 | #usage: perl batchRandomSequences.pl DIRECTORYNAME N 3 | #where DIRECTORYNAME is the directory of fasta sequences, and N is the number of sequences that should be chosen randomly from each sequence. 4 | #The results are outputted to a new directory called DIRECTORYNAMERANDOMIZEDOUTPUT. 5 | #Selects N random sequences from all fasta files in directory B 6 | use List::Util qw(first max maxstr min minstr reduce shuffle sum); 7 | 8 | #Directory to target 9 | $dir = $ARGV[0]; 10 | 11 | #Number of sequences to get 12 | $seqnum = $ARGV[1]; 13 | 14 | #opens the directory 15 | opendir(DIR, $dir) or die $!; 16 | 17 | my @seqFiles; 18 | 19 | #Loops through all files in the directory 20 | while (my $file = readdir(DIR)) { 21 | # Use a regular expression to ignore files beginning with a period 22 | next if ($file =~ m/^\./); 23 | 24 | #Check for FASTA or fas file format 25 | next if ($file !~ /.fasta/ && $file !~ /.fas/); 26 | #Add filename to array 27 | push @seqFiles, $file; 28 | } 29 | 30 | #Make output directory 31 | system("mkdir -p " . $dir . "RandomizedOutput"); 32 | 33 | 34 | foreach my $seqf (@seqFiles){ 35 | #open the sequence file 36 | open(my $in, "<", $dir . "/" . $seqf) or die "Can't find that sequence file!"; 37 | @lines = <$in>; 38 | close($in); 39 | 40 | #Convert the sequence files lines to string 41 | $seqFile = ""; 42 | foreach (@lines){ 43 | $seqFile .= $_; 44 | } 45 | 46 | #Split into individual sequences 47 | @seqs = split(/>/, $seqFile); 48 | 49 | #Add back in the > symbol to each sequence 50 | foreach(@seqs){ 51 | $_ = ">$_" 52 | } 53 | 54 | #Shuffle the sequences randomly 55 | @seqs = shuffle @seqs; 56 | 57 | #Choose the first N sequences 58 | @seqs = splice(@seqs, 0, $seqnum); 59 | 60 | #Get filename without extension 61 | #Print them to output.fasta 62 | open (MYFILE, ">>", $dir . "RandomizedOutput/" . $seqf); 63 | print MYFILE @seqs; 64 | close (MYFILE); 65 | 66 | } 67 | 68 | 69 | closedir(DIR); 70 | -------------------------------------------------------------------------------- /calc_coverage_breadth.py: -------------------------------------------------------------------------------- 1 | import pysam 2 | import sys 3 | 4 | 5 | gene_list = sys.argv[1] 6 | bam = sys.argv[2] 7 | read_length = sys.argv[3] 8 | 9 | f = open(gene_list) 10 | positions = [] 11 | for line in f.readlines(): 12 | positions.append([line.split()[0], int(line.split()[1]), int(line.split()[2].strip()), line.split()[3]]) 13 | f.close() 14 | 15 | samfile = pysam.AlignmentFile(bam, "rb") 16 | 17 | for pos in positions: 18 | iter = samfile.fetch(pos[0], pos[1], pos[2]) 19 | i = 0 20 | for x in iter: 21 | i += 1 22 | 23 | b = 0 24 | b2 = 0 25 | for pileupcolumn in samfile.pileup(pos[0], pos[1], pos[2], stepper = 'nofilter'): 26 | 27 | if pileupcolumn.reference_pos >= pos[1] and pileupcolumn.reference_pos < pos[2]: 28 | b2 += 1 29 | if pileupcolumn.nsegments > 0: 30 | b += 1 31 | print(pos[0] + "\t" + pos[3] + "\t" + str(pos[1]) + "\t" + str(pos[2]) + "\t" + str(i) + "\t" + str(float(i)*int(read_length) / (pos[2]-pos[1])) + "\t" + str(float(b) / (pos[2]-pos[1]))) 32 | -------------------------------------------------------------------------------- /codon_freq.py: -------------------------------------------------------------------------------- 1 | #Predicts codon frequencies for a FASTA file with one or more sequences (e.g. contigs) 2 | 3 | from Bio import SeqIO 4 | import sys 5 | 6 | def predict(seqs): 7 | 8 | CodonsDict = {'TTT': 0, 'TTC': 0, 'TTA': 0, 'TTG': 0, 'CTT': 0, 9 | 'CTC': 0, 'CTA': 0, 'CTG': 0, 'ATT': 0, 'ATC': 0, 10 | 'ATA': 0, 'ATG': 0, 'GTT': 0, 'GTC': 0, 'GTA': 0, 11 | 'GTG': 0, 'TAT': 0, 'TAC': 0, 'TAA': 0, 'TAG': 0, 12 | 'CAT': 0, 'CAC': 0, 'CAA': 0, 'CAG': 0, 'AAT': 0, 13 | 'AAC': 0, 'AAA': 0, 'AAG': 0, 'GAT': 0, 'GAC': 0, 14 | 'GAA': 0, 'GAG': 0, 'TCT': 0, 'TCC': 0, 'TCA': 0, 15 | 'TCG': 0, 'CCT': 0, 'CCC': 0, 'CCA': 0, 'CCG': 0, 16 | 'ACT': 0, 'ACC': 0, 'ACA': 0, 'ACG': 0, 'GCT': 0, 17 | 'GCC': 0, 'GCA': 0, 'GCG': 0, 'TGT': 0, 'TGC': 0, 18 | 'TGA': 0, 'TGG': 0, 'CGT': 0, 'CGC': 0, 'CGA': 0, 19 | 'CGG': 0, 'AGT': 0, 'AGC': 0, 'AGA': 0, 'AGG': 0, 20 | 'GGT': 0, 'GGC': 0, 'GGA': 0, 'GGG': 0} 21 | 22 | # this dictionary shows which codons encode the same AA 23 | SynonymousCodons = { 24 | 'CYS': ['TGT', 'TGC'], 25 | 'ASP': ['GAT', 'GAC'], 26 | 'SER': ['TCT', 'TCG', 'TCA', 'TCC', 'AGC', 'AGT'], 27 | 'GLN': ['CAA', 'CAG'], 28 | 'MET': ['ATG'], 29 | 'ASN': ['AAC', 'AAT'], 30 | 'PRO': ['CCT', 'CCG', 'CCA', 'CCC'], 31 | 'LYS': ['AAG', 'AAA'], 32 | 'STOP': ['TAG', 'TGA', 'TAA'], 33 | 'THR': ['ACC', 'ACA', 'ACG', 'ACT'], 34 | 'PHE': ['TTT', 'TTC'], 35 | 'ALA': ['GCA', 'GCC', 'GCG', 'GCT'], 36 | 'GLY': ['GGT', 'GGG', 'GGA', 'GGC'], 37 | 'ILE': ['ATC', 'ATA', 'ATT'], 38 | 'LEU': ['TTA', 'TTG', 'CTC', 'CTT', 'CTG', 'CTA'], 39 | 'HIS': ['CAT', 'CAC'], 40 | 'ARG': ['CGA', 'CGC', 'CGG', 'CGT', 'AGG', 'AGA'], 41 | 'TRP': ['TGG'], 42 | 'VAL': ['GTA', 'GTC', 'GTG', 'GTT'], 43 | 'GLU': ['GAG', 'GAA'], 44 | 'TYR': ['TAT', 'TAC'] 45 | } 46 | 47 | 48 | # Count codons 49 | CodonsNormalized = CodonsDict 50 | for seq in seqs: 51 | start = 0 52 | end = 3 53 | while end <= len(str(seq.seq)): 54 | codon = str(seq.seq[start:end]) 55 | if codon in CodonsDict.keys(): 56 | CodonsDict[codon] += 1 57 | start += 3 58 | end += 3 59 | 60 | #Normalize by AA 61 | for aa in SynonymousCodons.keys(): 62 | total = 0 63 | for codon in SynonymousCodons[aa]: 64 | total += CodonsDict[codon] 65 | for codon in SynonymousCodons[aa]: 66 | CodonsNormalized[codon] = float(CodonsDict[codon]) / float(total) 67 | 68 | print "{", 69 | i = 0 70 | for codon in sorted(CodonsNormalized.keys()): 71 | if i != 0: 72 | print ", '" + codon + "':" + str(round(CodonsNormalized[codon],3)), 73 | else: 74 | print "'" + codon + "':" + str(round(CodonsNormalized[codon],3)), 75 | i += 1 76 | print "}", 77 | if __name__ == "__main__": 78 | handle = open(sys.argv[1], "rU") 79 | records = list(SeqIO.parse(handle, "fasta")) 80 | handle.close() 81 | predict(records) 82 | -------------------------------------------------------------------------------- /concat_alignments.py: -------------------------------------------------------------------------------- 1 | #A quick python script for concatenating multiple gene alignments. 2 | #Useful for creating concatenated protein phylogenies. 3 | #Usage: python concat_alignments.py ./directory_with_alignments/*.fa 4 | 5 | import glob, os, sys 6 | from Bio import SeqIO 7 | 8 | os.chdir(sys.argv[1]) 9 | concat = {} 10 | 11 | for file in glob.glob("*.fa"): 12 | input_handle = open(file, "rU") 13 | for record in SeqIO.parse(input_handle, "fasta") : 14 | if record.id in concat.keys(): 15 | concat[record.id] += str(record.seq) 16 | else: 17 | concat[record.id] = str(record.seq) 18 | 19 | for c in concat: 20 | print ">" + c 21 | print concat[c] 22 | -------------------------------------------------------------------------------- /filter_contigs.py: -------------------------------------------------------------------------------- 1 | #Usage: python filter_contigs.py filename min_contig_length 2 | from Bio import SeqIO 3 | import sys 4 | handle = open(sys.argv[1], "rU") 5 | l = SeqIO.parse(handle, "fasta") 6 | counter = 0 7 | for s in l: 8 | if len(s.seq) >= int(sys.argv[2]): 9 | print ">" + s.id 10 | print s.seq 11 | counter += 1 12 | -------------------------------------------------------------------------------- /filter_contigs_by_blastp.py: -------------------------------------------------------------------------------- 1 | with open('./results.blast') as f: 2 | nano = [] 3 | natro = [] 4 | cyano = [] 5 | contigs = {} 6 | contig_genes = [] 7 | for line in f: 8 | gene = line.split()[0] 9 | contig = int(gene.split("_")[0]) 10 | taxa = line.split("[")[1].split()[0] 11 | if contig not in contigs.keys(): 12 | contigs[contig] = [] 13 | if gene not in contig_genes: 14 | contigs[contig].append(taxa) 15 | contig_genes.append(gene) 16 | 17 | 18 | all_genes = {} 19 | with open('./all_genes') as f: 20 | for line in f: 21 | c = int(line.split(">")[1].split("_")[0]) 22 | g = line.split("_")[1] 23 | all_genes[c] = g 24 | 25 | 26 | seqs = {} 27 | from Bio import SeqIO 28 | handle = open("halite_bins.009.fasta", "rU") 29 | for record in SeqIO.parse(handle, "fasta") : 30 | seqs[int(record.id)] = record.seq 31 | handle.close() 32 | 33 | from collections import Counter 34 | 35 | for contig in sorted(contigs.keys()): 36 | c = Counter(contigs[contig]) 37 | 38 | ## Conditions for accepting a contig. 39 | ## Here it's a greater number of blastp products on that contig to the specified genome 40 | ## compared to the others, and at least 25% of all products on that contig match to that genome. 41 | if c['Halothece'] > (c['Natronomonas'] + c['Candidatus']): 42 | if c['Halothece'] > 0.25 * int(all_genes[contig]): 43 | print ">" + str(contig) 44 | print seqs[contig] 45 | -------------------------------------------------------------------------------- /gc_window.py: -------------------------------------------------------------------------------- 1 | #test 2 | from Bio import SeqIO 3 | from Bio.SeqRecord import SeqRecord 4 | from random import randint 5 | from Bio.SeqUtils import GC 6 | import sys 7 | #There should be one and only one record, the entire genome: 8 | print "reading" 9 | mito_record = SeqIO.read(open(sys.argv[1]), "fasta") 10 | 11 | gcs = [] 12 | 13 | #k is the window size 14 | k = 10000 15 | j = 0 16 | print "Read" 17 | for i in range(0, len(seq_record.seq)) : 18 | if j <= len(seq_record.seq): 19 | start=j 20 | # 21 | end=j+k 22 | window_frag=seq_record.seq[start:end] 23 | print str(round(GC(window_frag),2)) + ",", 24 | j += k 25 | else: 26 | break 27 | 28 | #How to visualize results in R with ggplot2: 29 | #m + geom_density() + geom_vline(xintercept=41.5, col='red') + xlim(40,50) + xlab("GC Content (%)") + ylab("Density") + theme_minimal() 30 | -------------------------------------------------------------------------------- /get_genomes.py: -------------------------------------------------------------------------------- 1 | import urllib2 2 | import os 3 | with open('./.listing') as f: 4 | for line in f: 5 | name = line.split()[8].strip() 6 | url = "ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/archaea/" + name + "/latest_assembly_versions" 7 | try: 8 | response = urllib2.urlopen(url, timeout=5) 9 | newurl = url + "/" + response.read().split()[8] 10 | print newurl 11 | os.system("wget -r " + newurl + "/* -A '*.faa.gz' -O " + name + ".faa.gz") 12 | except: 13 | print "ERROR" 14 | -------------------------------------------------------------------------------- /isoelectric.py: -------------------------------------------------------------------------------- 1 | from Bio.SeqUtils.ProtParam import ProteinAnalysis 2 | from Bio import SeqIO 3 | import sys 4 | handle = open(sys.argv[1], 'rU') 5 | records = list(SeqIO.parse(handle, "fasta")) 6 | for record in records: 7 | prot = ProteinAnalysis(str(record.seq)) 8 | print prot.isoelectric_point() 9 | -------------------------------------------------------------------------------- /random_forest.py: -------------------------------------------------------------------------------- 1 | # Import the random forest package 2 | from sklearn.ensemble import RandomForestClassifier 3 | from sklearn import cross_validation 4 | import numpy as np 5 | dataset = np.loadtxt('training_data.csv', delimiter=",") 6 | 7 | # Create the random forest object which will include all the parameters 8 | # for the fit 9 | forest = RandomForestClassifier(n_estimators = 100) 10 | 11 | # Fit the training data to the Survived labels and create the decision trees 12 | forest_fit = forest.fit(dataset[0::,1::],dataset[0::,0]) 13 | 14 | importances = forest.feature_importances_ 15 | std = np.std([tree.feature_importances_ for tree in forest.estimators_], 16 | axis=0) 17 | 18 | # Take the same decision trees and run it on the test data 19 | #output = forest_fit.predict(test_data) 20 | 21 | #cross cross_validation 22 | scores = cross_validation.cross_val_score(forest_fit, dataset[0::,1::], dataset[0::,0], cv=5) 23 | 24 | s = pickle.dump(forest_fit) 25 | -------------------------------------------------------------------------------- /simulate_assembly.py: -------------------------------------------------------------------------------- 1 | import os 2 | from Bio import SeqIO 3 | from Bio.SeqRecord import SeqRecord 4 | from random import randint 5 | 6 | j = 1 7 | import sys 8 | for root, subdirs, files in os.walk(sys.argv[1]): 9 | for f in files: 10 | seq = os.path.join(root,f) 11 | seqr = SeqIO.read(open(seq), "fasta") 12 | limit=len(seqr.seq) 13 | #generate a random number of fragments from this genome 14 | #random size, greater than 10,000 bp and smaller than 0.75 of its genome) 15 | if limit > 10000: 16 | #Maximum is 100,000 bp or 0.75 * genome size 17 | max_s = int(0.75*limit) 18 | if max_s > 100000: 19 | max_s = 100000 20 | size = randint(1000,max_s) 21 | #random start- anywhere from 0 to genome_length - size 22 | start = randint(0,limit-size) 23 | end = start + size 24 | 25 | fragment = seqr.seq[start:end] 26 | 27 | print ">" + str(j) 28 | print fragment 29 | 30 | j += 1 31 | -------------------------------------------------------------------------------- /top_10_contigs.py: -------------------------------------------------------------------------------- 1 | #Usage: python filter_contigs.py filename 2 | from Bio import SeqIO 3 | import sys 4 | import operator 5 | 6 | handle = open(sys.argv[1], "rU") 7 | l = SeqIO.parse(handle, "fasta") 8 | counter = 0 9 | contig_lengths = {} 10 | contigs = {} 11 | c = [] 12 | for s in l: 13 | if len(str(s.seq)) > 5000: 14 | contigs[s.id] = s.seq 15 | contig_lengths[s.id] = len(str(s.seq)) 16 | c.append(len(s.seq)) 17 | sorted_x = sorted(contig_lengths.items(), key=lambda x:x[1], reverse=True) 18 | 19 | for i in range(0,10): 20 | print str(sorted_x[i][0]) + ":" + str(sorted_x[i][1]) 21 | -------------------------------------------------------------------------------- /translate_all_frames.py: -------------------------------------------------------------------------------- 1 | #Translates DNA sequences in all 6 reading frames, ignoring start / stop codons. 2 | 3 | from Bio import SeqIO 4 | from Bio.Seq import Seq 5 | import sys 6 | 7 | input_handle = open(sys.argv[1], "rU") 8 | for record in SeqIO.parse(input_handle, "fasta") : 9 | #Frame 1 10 | original = record.seq 11 | print ">" + str(record.id) + "_1" 12 | print str(record.seq.translate()).replace("*","") 13 | #Frame 2 14 | print ">" + str(record.id) + "_2" 15 | record.seq = Seq(str(record.seq)[1:]) 16 | print str(record.seq.translate()).replace("*","") 17 | #Frame 3 18 | print ">" + str(record.id) + "_3" 19 | record.seq = Seq(str(record.seq)[1:]) 20 | print str(record.seq.translate()).replace("*","") 21 | 22 | record.seq = original.reverse_complement() 23 | 24 | #Frame -1 25 | print ">" + str(record.id) + "_-1" 26 | print str(record.seq.translate()).replace("*","") 27 | #Frame -2 28 | record.seq = Seq(str(record.seq)[1:]) 29 | print ">" + str(record.id) + "_-2" 30 | print str(record.seq.translate()).replace("*","") 31 | #Frame -3 32 | record.seq = Seq(str(record.seq)[1:]) 33 | print ">" + str(record.id) + "_-3" 34 | print str(record.seq.translate()).replace("*","") 35 | --------------------------------------------------------------------------------