├── LICENSE ├── README.md ├── calc_fastq-stats ├── README.md └── calc_fastq-stats.pl ├── cat_seq ├── README.md └── cat_seq.pl ├── cdd2cog ├── README.md └── cdd2cog.pl ├── cds_extractor ├── README.md └── cds_extractor.pl ├── ecoli_mlst ├── ADK.fas ├── FUMC.fas ├── GYRB.fas ├── ICD.fas ├── MDH.fas ├── PURA.fas ├── README.md ├── RECA.fas ├── ecoli_mlst.pl └── publicSTs.txt ├── genomes_feature_table ├── README.md └── genomes_feature_table.pl ├── ncbi_ftp_download ├── README.md ├── ncbi_ftp_concat_unpack.pl └── ncbi_ftp_download.sh ├── order_fastx ├── README.md └── order_fastx.pl ├── po2anno ├── README.md └── po2anno.pl ├── po2group_stats ├── README.md ├── pics │ ├── README.md │ ├── venn_diagram_logics.png │ └── venn_diagram_logics.svg └── po2group_stats.pl ├── prot_finder ├── README.md ├── binary_group_stats.pl ├── prot_binary_matrix.pl ├── prot_finder.pl ├── prot_finder_pipe.sh └── transpose_matrix.pl ├── rename_fasta_id ├── README.md └── rename_fasta_id.pl ├── revcom_seq ├── README.md └── revcom_seq.pl ├── rod_finder ├── README.md ├── blast_rod_finder.pl └── blast_rod_finder_legacy.sh ├── sam_insert-size ├── README.md └── sam_insert-size.pl ├── sample_fastx-txt ├── README.md └── sample_fastx-txt.pl ├── seq_format-converter ├── README.md └── seq_format-converter.pl ├── tbl2tab ├── README.md ├── example.tbl ├── example2.tab └── tbl2tab.pl └── trunc_seq ├── README.md └── trunc_seq.pl /README.md: -------------------------------------------------------------------------------- 1 | [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.215824.svg)](http://dx.doi.org/10.5281/zenodo.215824) 2 | 3 | bac-genomics-scripts 4 | ==================== 5 | 6 | A collection of scripts intended for **bacterial genomics** (some might also be useful for eukaryotes) from **high-throughput sequencing** (aka next-generation sequencing). 7 | 8 | * [Summary](#summary) 9 | * [Introduction](#introduction) 10 | * [Installation recommendations](#installation-recommendations) 11 | * [Dependencies](#dependencies) 12 | * [UNIX loops](#unix-loops) 13 | * [Windows - UNIX linebreak problems](#windows---unix-linebreak-problems) 14 | * [Citation](#citation) 15 | * [License](#license) 16 | * [Author - contact](#author---contact) 17 | 18 | ## Summary 19 | 20 | * Basic stats for bases and reads in FASTQ files: [`calc_fastq-stats`](/calc_fastq-stats) 21 | * Concatenate multi-sequence files (RichSeq EMBL or GENBANK format, or FASTA format) to a single artificial file: [`cat_seq`](/cat_seq) 22 | * COG ([cluster of orthologous groups](http://www.ncbi.nlm.nih.gov/COG/)) classification of proteins: [`cdd2cog`](/cdd2cog) 23 | * Extraction of protein/nucleotide sequences from CDSs: [`cds_extractor`](/cds_extractor) 24 | * MLST (multilocus sequence typing) assignment and allele extraction for *Escherichia coli* ([Achtman scheme](http://mlst.warwick.ac.uk/mlst/)): [`ecoli_mlst`](/ecoli_mlst) 25 | * Create a feature table for all annotated primary features in RichSeq (EMBL or GENBANK format) files: [`genomes_feature_table`](/genomes_feature_table) 26 | * **Deprecated!** Batch downloading of sequences from NCBI's FTP server: [`ncbi_ftp_download`](/ncbi_ftp_download) 27 | * Order sequence entries in FASTA/FASTQ files according to an ID list: [`order_fastx`](/order_fastx) 28 | * Create an ortholog/paralog annotation comparison matrix from [*Proteinortho5*](http://www.bioinf.uni-leipzig.de/Software/proteinortho/) output: [`po2anno`](/po2anno) 29 | * Calculate stats and plot venn diagrams for genome groups according to orthologs/paralogs from [*Proteinortho5*](http://www.bioinf.uni-leipzig.de/Software/proteinortho/) output, i.e. overall presence/absence statistics for groups of genomes and not simply single genomes: [`po2group_stats`](/po2group_stats) 30 | * Strain panel query protein search with **BLASTP** plus concise hit summary, optional alignment, and presence/absence matrix. Also included, scripts to transpose the matrix and calculate overall presence/absence statistics for groups of columns in the matrix: [`prot_finder`](/prot_finder) 31 | * Rename FASTA ID lines and optionally numerate them: [`rename_fasta_id`](/rename_fasta_id) 32 | * Reverse complement (multi-)sequence files (RichSeq EMBL or GENBANK format, or FASTA format): [`revcom_seq`](/revcom_seq) 33 | * Regions of difference (ROD) detection in genomes with **BLASTN**: [`rod_finder`](/rod_finder) 34 | * NGS paired-end library insert size estimation from BAM/SAM: [`sam_insert-size`](/sam_insert-size) 35 | * Randomly subsample FASTA, FASTQ, or TEXT files with [*reservoir sampling*](https://en.wikipedia.org/wiki/Reservoir_sampling): [`sample_fastx-txt`](/sample_fastx-txt) 36 | * Convert a sequence file to another format with [BioPerl](http://www.bioperl.org): [`seq_format-converter`](/seq_format-converter) 37 | * Manual curation of annotation in NCBI's TBL format (e.g. from [Prokka](http://www.vicbioinformatics.com/software.prokka.shtml) automatic annotation) in a spreadsheet software: [`tbl2tab`](/tbl2tab) 38 | * Truncate sequence files (RichSeq EMBL or GENBANK format, or FASTA format) according to given coordinates: [`trunc_seq`](/trunc_seq) 39 | * And an assortment of smaller scripts for tasks like (not yet uploaded to GitHub): alignment format converters, dnadiff, GC% calculation etc. 40 | 41 | ## Introduction 42 | 43 | All the scripts here are written in [**Perl**](https://www.perl.org/) (some include bash shell wrappers). 44 | 45 | Each script is hosted in its own folder, so that a separate *README.md* can be included for more information. However, all of the Perl scripts include additionally a usage/help text or a comprehensive [POD](http://perldoc.perl.org/perlpod.html) (Plain Old Documentation) by calling the script either without arguments/options or option **-h|-help**. 46 | 47 | The scripts are only tested under UNIX, some won't run in a Windows environment (because of included UNIX commands). If you are on Windows an alternative might be [Cygwin](http://cygwin.com/). 48 | 49 | ## Installation recommendations 50 | 51 | To download the repository, use either the '[Download ZIP](https://github.com/aleimba/bac-genomics-scripts/archive/master.zip)' link after clicking the green 'Clone or download' button at the top or clone the repository with `git`: 52 | 53 | git clone https://github.com/aleimba/bac-genomics-scripts.git 54 | 55 | If there is an update to this GitHub repository (see above [commits](https://github.com/aleimba/bac-genomics-scripts/commits/master) and [releases](https://github.com/aleimba/bac-genomics-scripts/releases)), you can refresh your **local** repository by using the following command **inside** the local folder: 56 | 57 | git pull 58 | 59 | To install the scripts, copy them e.g. to a home */bin* folder in your *PATH* and make them executable 60 | 61 | $ find . \( -name '*.pl' -o -name '*.sh' -o -name '*.fas' -o -name '*.txt' \) -exec cp {} ~/bin \; 62 | $ chmod u+x ~/bin/*.pl 63 | 64 | the scripts can then be run everywhere on your system. Of course you can just call them directly by prefexing `perl` to the command or a './' for bash wrappers: 65 | 66 | $ perl /path/to/script/script.pl 67 | 68 | or 69 | 70 | $ ./script.sh 71 | 72 | **Single** scripts can be downloaded as well. For this purpose click on the folder you're interested in and then on the link of the script. There click on the **Raw** button and save this page to a file (without **Raw** you'll get an unusable html file). This is also true for other files (e.g. PDFs etc.). 73 | 74 | ## Dependencies 75 | 76 | All scripts are tested with Perl v5.22.1. 77 | 78 | Most of the Perl scripts include modules from [BioPerl](http://www.bioperl.org) as stated in their respective *README.md* or POD, which as a consequence has to be installed on your system. For BioPerl installation instructions see the website ([**Installation**](http://bioperl.org/INSTALL.html)). 79 | 80 | Some scripts need additional Perl modules, which will be stated in the associated *README.md* or POD. If they're not installed yet on your system get them from [CPAN](http://www.cpan.org/) (installation instructions can be found on the website, see e.g. [**Getting Started...Installing Perl Modules**](http://www.cpan.org/modules/INSTALL.html) or [**FAQ**](http://www.cpan.org/misc/cpan-faq.html#How_install_Perl_modules)). 81 | 82 | Furthermore, some scripts call upon statistical computing language [**R**](http://www.r-project.org/) and dependent packages for plotting purposes (again see the respective *README.md* or POD). 83 | 84 | ## UNIX loops 85 | 86 | A very handy tip, if you want to run a script on all files in the current working directory you can use a **loop** in UNIX, e.g.: 87 | 88 | $ for file in *.fasta; do perl script.pl "$file"; done 89 | 90 | ## Windows - UNIX linebreak problems 91 | 92 | At last, some of the scripts don't like Windows formatted line breaks, you might consider running these input files through a nifty UNIX utility called [dos2unix](http://dos2unix.sourceforge.net/): 93 | 94 | $ dos2unix input 95 | 96 | ## Citation 97 | For now cite the latest major release (tag: [***bovine_ecoli_mastitis***](https://github.com/aleimba/bac-genomics-scripts/releases)) hosted on [Zenodo](https://zenodo.org/): 98 | 99 | **Leimbach A**. 2016. bac-genomics-scripts: Bovine *E. coli* mastitis comparative genomics edition. Zenodo. . 100 | 101 | Also, all scripts have a version number (see option **-v**), which might be included in a materials and methods section. 102 | 103 | ## License 104 | 105 | All scripts are licensed under GPLv3 which is contained in the file [*LICENSE*](./LICENSE). 106 | 107 | ## Author - contact 108 | For help, suggestions, bugs etc. use the GitHub issues or write an email to aleimba [at] gmx [dot] de. 109 | 110 | Andreas Leimbach (Microbial Genome Plasticity, Institute of Hygiene, University of Muenster) 111 | -------------------------------------------------------------------------------- /calc_fastq-stats/README.md: -------------------------------------------------------------------------------- 1 | calc_fastq-stats 2 | ================ 3 | 4 | `calc_fastq-stats.pl` is a script to calculate basic statistics for bases and reads in a FASTQ file. 5 | 6 | * [Synopsis](#synopsis) 7 | * [Description](#description) 8 | * [Usage](#usage) 9 | * [Options](#options) 10 | * [Mandatory options](#mandatory-options) 11 | * [Optional options](#optional-options) 12 | * [Output](#output) 13 | * [Run environment](#run-environment) 14 | * [Dependencies](#dependencies) 15 | * [Author - contact](#author---contact) 16 | * [Citation, installation, and license](#citation-installation-and-license) 17 | * [Changelog](#changelog) 18 | 19 | ## Synopsis 20 | 21 | perl calc_fastq-stats.pl -i reads.fastq 22 | 23 | **or** 24 | 25 | gzip -dc reads.fastq.gz | perl calc_fastq-stats.pl -i - 26 | 27 | ## Description 28 | 29 | The script calculates some simple statistics, like individual and total base 30 | counts, GC content, and basic stats for the read lengths, and 31 | read/base qualities in a FASTQ file. The GC content calculation does 32 | not include 'N's. Stats are printed to *STDOUT* and optionally to an 33 | output file. 34 | 35 | Because the quality of a read degrades over its length with all NGS 36 | machines, it is advisable to also plot the quality for each cycle as 37 | implemented in tools like 38 | [FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) 39 | or the [fastx-toolkit](http://hannonlab.cshl.edu/fastx_toolkit/). 40 | 41 | If the sequence and the quality values are interrupted by line 42 | breaks (i.e. a read is **not** represented by four lines), please fix 43 | with Heng Li's [seqtk](https://github.com/lh3/seqtk): 44 | 45 | seqtk seq -l 0 infile.fastq > outfile.fastq 46 | 47 | An alternative tool, which is a lot faster, is **fastq-stats** from 48 | [ea-utils](https://code.google.com/p/ea-utils/). 49 | 50 | ## Usage 51 | 52 | zcat reads.fastq.gz | perl calc_fastq-stats.pl -i - -q 64 -c 175000000 -n 3000000 53 | 54 | ## Options 55 | 56 | ### Mandatory options 57 | 58 | - -i, -input 59 | 60 | Input FASTQ file or piped STDIN (-) from a gzipped file 61 | 62 | - -q, -qual_offset 63 | 64 | ASCII quality offset of the Phred (Sanger) quality values [default 33] 65 | 66 | ### Optional options 67 | 68 | - -h, -help: 69 | 70 | Help (perldoc POD) 71 | 72 | - -c, -coverage_limit 73 | 74 | Number of bases to sample from the top of the file 75 | 76 | - -n, -num_read 77 | 78 | Number of reads to sample from the top of the file 79 | 80 | - -o, -output 81 | 82 | Print stats in addition to *STDOUT* to the specified output file 83 | 84 | - -v, -version 85 | 86 | Print version number to *STDERR* 87 | 88 | ## Output 89 | 90 | - *STDOUT* 91 | 92 | Calculated stats are printed to *STDOUT* 93 | 94 | - (outfile) 95 | 96 | Optional outfile for stats 97 | 98 | ## Run environment 99 | 100 | The Perl script runs under Windows and UNIX flavors. 101 | 102 | ## Dependencies 103 | 104 | If the following modules are not installed get them from 105 | [CPAN](http://www.cpan.org/): 106 | 107 | - `Statistics::Descriptive` 108 | 109 | Perl module to calculate basic descriptive statistics 110 | 111 | - `Statistics::Descriptive::Discrete` 112 | 113 | Perl module to calculate descriptive statistics for discrete data sets 114 | 115 | - `Statistics::Descriptive::Weighted` 116 | 117 | Perl module to calculate descriptive statistics for weighted variates 118 | 119 | ## Author - contact 120 | 121 | Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster) 122 | 123 | ## Citation, installation, and license 124 | 125 | For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md). 126 | 127 | ## Changelog 128 | 129 | - v0.1 (28.10.2014) 130 | -------------------------------------------------------------------------------- /cat_seq/README.md: -------------------------------------------------------------------------------- 1 | cat_seq 2 | ======= 3 | 4 | A script to merge multi-sequence RichSeq files into one single-entry 'artificial' sequence file. 5 | 6 | * [Synopsis](#synopsis) 7 | * [Description](#description) 8 | * [Usage](#usage) 9 | * [Merge multi-sequence file](#merge-multi-sequence-file) 10 | * [Merge multi-sequence file and specify different output format](#merge-multi-sequence-file-and-specify-different-output-format) 11 | * [UNIX loop to concatenate each multi-sequence file in the current working directory](#unix-loop-to-concatenate-each-multi-sequence-file-in-the-current-working-directory) 12 | * [Concatenate multi-sequence fasta files faster with UNIX's `grep`](#concatenate-multi-sequence-fasta-files-faster-with-unixs-grep) 13 | * [Output](#output) 14 | * [Dependencies](#dependencies) 15 | * [Run environment](#run-environment) 16 | * [Alternative software](#alternative-software) 17 | * [Author - contact](#author---contact) 18 | * [Citation, installation, and license](#citation-installation-and-license) 19 | * [Changelog](#changelog) 20 | 21 | ## Synopsis 22 | 23 | perl cat_seq.pl multi-seq_file.embl 24 | 25 | ## Description 26 | 27 | This script concatenates multiple sequences in a RichSeq file (embl or genbank, but also fasta) to a single artificial sequence. The first sequence in the file is used as a foundation to add the subsequent sequences, along with all features and annotations. 28 | 29 | Optionally, a different output file format can be specified (fasta/embl/genbank). 30 | 31 | ## Usage 32 | 33 | ### Merge multi-sequence file 34 | 35 | perl cat_seq.pl multi-seq_file.gbk 36 | 37 | ### Merge multi-sequence file and specify different output format 38 | 39 | perl cat_seq.pl multi-seq_file.embl [fasta|genbank] 40 | 41 | ### UNIX loop to concatenate each multi-sequence file in the current working directory 42 | 43 | for i in *.[embl|fasta|gbk]; do perl cat_seq.pl $i [embl|fasta|genbank]; done 44 | 45 | ### Concatenate multi-sequence fasta files faster with UNIXs *grep* 46 | If you're working only with fasta files UNIX's `grep` is a faster choice to concatenate sequences. 47 | 48 | grep -v ">" seq.fasta > seq_artificial.fasta 49 | 50 | Subsequently add as a first line a fasta ID (starting with '>') with an editor. 51 | 52 | ## Output 53 | 54 | * *\_artificial.[embl|fasta|genbank] 55 | 56 | Concatenated artificial sequence in the input format, or optionally the specified output sequence format. 57 | 58 | ## Dependencies 59 | 60 | * BioPerl (tested with version 1.006901) 61 | 62 | ## Run environment 63 | 64 | The Perl script runs under Windows and UNIX flavors. 65 | 66 | ## Alternative software 67 | 68 | The EMBOSS (The European Molecular Biology Open Software Suite) application ***union*** can also be used for this task (http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/union.html). 69 | 70 | ## Author - contact 71 | 72 | Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster) 73 | 74 | ## Citation, installation, and license 75 | 76 | For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md). 77 | 78 | ## Changelog 79 | 80 | * v0.1 (08.02.2013) 81 | -------------------------------------------------------------------------------- /cat_seq/cat_seq.pl: -------------------------------------------------------------------------------- 1 | #!/usr/bin/perl 2 | 3 | use warnings; 4 | use strict; 5 | use Bio::SeqIO; # bioperl module to handle sequence input/output 6 | use Bio::Seq; # bioperl module to handle sequences with features 7 | use Bio::SeqUtils; # bioperl module with additional methods (including features) for Bio::Seq objects 8 | 9 | my $usage = "\n". 10 | "\t#################################################################\n". 11 | "\t# $0 multi-seq_file [outfile-format] #\n". #$0 = program name 12 | "\t# #\n". 13 | "\t# The script merges RichSeq sequences (embl or genbank, but #\n". 14 | "\t# also fasta) in a multi-sequence file to one artificial #\n". 15 | "\t# sequence. The first sequence in the file is used as a #\n". 16 | "\t# foundation to add the subsequent sequences (along with #\n". 17 | "\t# features and annotations). Optionally, a different output #\n". 18 | "\t# file format can be specified (fasta/embl/genbank). #\n". 19 | "\t# The script uses bioperl (www.bioperl.org). #\n". 20 | "\t# #\n". 21 | "\t# Adjust unix loop to run the script with all multi-seq files #\n". 22 | "\t# in the current working directory, e.g.: #\n". 23 | "\t# for i in *.embl; do cat_seq.pl \$i genbank; done #\n". 24 | "\t# #\n". 25 | "\t# version 0.1 A Leimbach #\n". 26 | "\t# 08.02.2013 aleimba[at]gmx[dot]de #\n". 27 | "\t#################################################################\n\n"; 28 | 29 | ### Shift arguments from @ARGV or give usage 30 | my $multi_seq = shift or die $usage; 31 | my $format = shift; 32 | if ($multi_seq =~/-h/) { 33 | die $usage; 34 | } 35 | 36 | 37 | ### Bio::SeqIO/Seq objects to concat the seqs 38 | print "\nConcatenating multi-sequence file \"$multi_seq\" to an artificial sequence file ...\n"; 39 | my $seqin = Bio::SeqIO->new(-file => "<$multi_seq"); # Bio::SeqIO object; no '-format' given, leave it to bioperl guessing 40 | my @seqs; # store Bio::Seq objects for each seq in the multi-seq file 41 | while (my $seq = $seqin->next_seq) { # Bio::Seq object 42 | push(@seqs, $seq); 43 | } 44 | Bio::SeqUtils->cat(@seqs); 45 | my $cat_seq = shift @seqs; # the first sequence in the array ($seqs[0]) was modified! 46 | 47 | 48 | ### Write the artificial/concatenated sequence (with its features) to output Bio::SeqIO object 49 | my $seqout; # Bio::SeqIO object 50 | if ($format) { # true if defined 51 | $multi_seq =~ s/^(.+)\.\w+$/$1_artificial\.$format/; 52 | $seqout = Bio::SeqIO->new(-file => ">$multi_seq", -format => "$format"); 53 | } else { 54 | $multi_seq =~ s/^(.+)(\.\w+)$/$1_artificial$2/; 55 | $seqout = Bio::SeqIO->new(-file => ">$multi_seq"); 56 | } 57 | $seqout->write_seq($cat_seq); 58 | print "Created new file \"$multi_seq\"!\n\n"; 59 | 60 | exit; 61 | -------------------------------------------------------------------------------- /cdd2cog/README.md: -------------------------------------------------------------------------------- 1 | cdd2cog 2 | ======= 3 | 4 | `cdd2cog.pl` is a script to assign COG categories to query protein sequences. 5 | 6 | * [Synopsis](#synopsis) 7 | * [Description](#description) 8 | * [Usage](#usage) 9 | * [RPS-BLAST+](#rps-blast) 10 | * [cdd2cog](#cdd2cog) 11 | * [Options](#options) 12 | * [Mandatory options](#mandatory-options) 13 | * [Optional options](#optional-options) 14 | * [Output](#output) 15 | * [Run environment](#run-environment) 16 | * [Author - contact](#author---contact) 17 | * [Acknowledgements](#acknowledgements) 18 | * [Citation, installation, and license](#citation-installation-and-license) 19 | * [Changelog](#changelog) 20 | 21 | ## Synopsis 22 | 23 | perl cdd2cog.pl -r rps-blast.out -c cddid.tbl -f fun.txt -w whog 24 | 25 | ## Description 26 | For troubleshooting and a working example please see issue [#1](https://github.com/aleimba/bac-genomics-scripts/issues/1). 27 | 28 | The script assigns COG ([cluster of orthologous 29 | groups](http://www.ncbi.nlm.nih.gov/COG/)) categories to proteins. 30 | For this purpose, the query proteins need to be blasted with 31 | RPS-BLAST+ ([Reverse Position-Specific BLAST](http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download)) 32 | against NCBI's Conserved Domain Database 33 | ([CDD](http://www.ncbi.nlm.nih.gov/cdd)). Use 34 | [`cds_extractor.pl`](/cds_extractor) beforehand to extract multi-fasta protein 35 | files from GENBANK or EMBL files. 36 | 37 | Both tab-delimited RPS-BLAST+ outformats, **-outfmt 6** and **-outfmt 38 | 7**, can be processed by `cdd2cog.pl`. By default, RPS-BLAST+ hits 39 | for each query protein are filtered for the best hit (lowest 40 | e-value). Use option **-a|all\_hits** to assign COGs to all BLAST hits 41 | and e.g. do a downstream filtering in a spreadsheet application. 42 | Results are written to tab-delimited files in the './results' 43 | folder, overall assignment statistics are printed to *STDOUT*. 44 | 45 | Several files are needed from NCBI's FTP server to run the RPS-BLAST+ and `cdd2cog.pl`: 46 | 47 | 1. **CDD** (ftp://ftp.ncbi.nlm.nih.gov/pub/mmdb/cdd/) 48 | 49 | More information about the files in the CDD FTP archive can be found in the respective 'README' file. 50 | 51 | 1. 'cddid.tbl.gz' 52 | 53 | The file needs to be unpacked: 54 | 55 | `gunzip cddid.tbl.gz` 56 | 57 | Contains summary information about the CD models in a tab-delimited format. The columns are: PSSM-Id, CD accession (e.g. COG#), CD short name, CD description, and PSSM (position-specific scoring matrices) length. 58 | 59 | 2. './little_endian/Cog_LE.tar.gz' 60 | 61 | Unpack and untar via: 62 | 63 | `tar xvfz Cog_LE.tar.gz` 64 | 65 | Preformatted RPS-BLAST+ database of the CDD COG distribution for Intel CPUs and Unix/Windows architectures. 66 | 67 | 2. **COG** (ftp://ftp.ncbi.nlm.nih.gov/pub/COG/COG/) 68 | 69 | Read 'readme' for more information about the respective files in the COG FTP archive. 70 | 71 | 1. 'fun.txt' 72 | 73 | One-letter functional classification used in the COG database. 74 | 75 | 2. 'whog' 76 | 77 | Name, description, and corresponding functional classification of each COG. 78 | 79 | ## Usage 80 | 81 | ### RPS-BLAST+ 82 | 83 | rpsblast -query protein.fasta -db Cog -out rps-blast.out -evalue 1e-2 -outfmt 6 84 | rpsblast -query protein.fasta -db Cog -out rps-blast.out -evalue 1e-2 -outfmt '7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovs' 85 | 86 | ### cdd2cog 87 | 88 | perl cdd2cog.pl -r rps-blast.out -c cddid.tbl -f fun.txt -w whog -a 89 | 90 | ## Options 91 | 92 | ### Mandatory options 93 | 94 | - -r, -rps\_report 95 | 96 | Path to RPS-BLAST+ report/output, outfmt 6 or 7 97 | 98 | - -c, -cddid 99 | 100 | Path to CDD's 'cddid.tbl' file 101 | 102 | - -f, -fun 103 | 104 | Path to COG's 'fun.txt' file 105 | 106 | - -w, -whog 107 | 108 | Path to COG's 'whog' file 109 | 110 | ### Optional options 111 | 112 | - -h, -help 113 | 114 | Help (perldoc POD) 115 | 116 | - -a, -all\_hits 117 | 118 | Don't filter RPS-BLAST+ output for the best hit, rather assign COGs to all hits 119 | 120 | - -v, -version 121 | 122 | Print version number to *STDERR* 123 | 124 | ## Output 125 | 126 | - *STDOUT* 127 | 128 | Overall assignment statistics 129 | 130 | - ./results 131 | 132 | All tab-delimited output files are stored in this result folder 133 | 134 | - rps-blast_cog.txt 135 | 136 | COG assignments concatenated to the RPS-BLAST+ results for filtering 137 | 138 | - protein-id_cog.txt 139 | 140 | Slimmed down 'rps-blast_cog.txt' only including query id (first BLAST report column), COGs, and functional categories 141 | 142 | - cog_stats.txt 143 | 144 | Assignment counts for each used COG 145 | 146 | - func_stats.txt 147 | 148 | Assignment counts for single-letter functional categories 149 | 150 | ## Run environment 151 | 152 | The Perl script runs under UNIX flavors. 153 | 154 | ## Author - contact 155 | 156 | Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster) 157 | 158 | ## Acknowledgements 159 | 160 | I got the idea for using NCBI's CDD PSSMs for COG assignment from JGI's [IMG/ER annotation system](http://img.jgi.doe.gov/), which employes the same technique. 161 | 162 | ## Citation, installation, and license 163 | 164 | For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md). 165 | 166 | ## Changelog 167 | 168 | * v0.2 (2017-02-16) 169 | * Adapted to new NCBI FASTA header format for CDD RPS-BLAST+ output 170 | * v0.1 (2013-08-01) 171 | -------------------------------------------------------------------------------- /cdd2cog/cdd2cog.pl: -------------------------------------------------------------------------------- 1 | #!/usr/bin/perl 2 | 3 | ####### 4 | # POD # 5 | ####### 6 | 7 | =pod 8 | 9 | =head1 NAME 10 | 11 | C - assign COG categories to protein sequences 12 | 13 | =head1 SYNOPSIS 14 | 15 | C 16 | 17 | =head1 DESCRIPTION 18 | 19 | The script assigns COG (L) categories to proteins. 21 | For this purpose, the query proteins need to be blasted with 22 | RPS-BLAST+ (L) 23 | against NCBI's Conserved Domain Database 24 | (L). Use 25 | L|/cds_extractor> beforehand to extract multi-fasta 26 | protein files from GENBANK or EMBL files. 27 | 28 | Both tab-delimited RPS-BLAST+ outformats, B<-outfmt 6> and B<-outfmt 29 | 7>, can be processed by C. By default, RPS-BLAST+ hits 30 | for each query protein are filtered for the best hit (lowest 31 | e-value). Use option B<-a|all_hits> to assign COGs to all BLAST hits 32 | and e.g. do a downstream filtering in a spreadsheet application. 33 | Results are written to tab-delimited files in the F<./results> 34 | folder, overall assignment statistics are printed to C. 35 | 36 | Several files are needed from NCBI's FTP server to run the RPS-BLAST+ 37 | and C: 38 | 39 | =over 40 | 41 | =item 1.) L 42 | 43 | More information about the files in the CDD FTP archive can be found 44 | in the respective F file. 45 | 46 | =item 1.1.) F 47 | 48 | The file needs to be unpacked: 49 | 50 | C 51 | 52 | Contains summary information about the CD models in a tab-delimited 53 | format. The columns are: PSSM-Id, CD accession (e.g. COG#), CD short 54 | name, CD description, and PSSM (position-specific scoring matrices) 55 | length. 56 | 57 | =item 1.2.) F<./little_endian/Cog_LE.tar.gz> 58 | 59 | Unpack and untar via: 60 | 61 | C 62 | 63 | Preformatted RPS-BLAST+ database of the CDD COG distribution for 64 | Intel CPUs and Unix/Windows architectures. 65 | 66 | =item 2.) L 67 | 68 | Read F for more information about the respective files in 69 | the COG FTP archive. 70 | 71 | =item 2.1.) F 72 | 73 | One-letter functional classification used in the COG database. 74 | 75 | =item 2.2.) F 76 | 77 | Name, description, and corresponding functional classification of 78 | each COG. 79 | 80 | =back 81 | 82 | =head1 OPTIONS 83 | 84 | =head2 Mandatory options 85 | 86 | =over 20 87 | 88 | =item B<-r>=I, B<-rps_report>=I 89 | 90 | Path to RPS-BLAST+ report/output, outfmt 6 or 7 91 | 92 | =item B<-c>=I, B<-cddid>=I 93 | 94 | Path to CDD's F file 95 | 96 | =item B<-f>=I, B<-fun>=I 97 | 98 | Path to COG's F file 99 | 100 | =item B<-w>=I, B<-whog>=I 101 | 102 | Path to COG's F file 103 | 104 | =back 105 | 106 | =head2 Optional options 107 | 108 | =over 20 109 | 110 | =item B<-h>, B<-help> 111 | 112 | Help (perldoc POD) 113 | 114 | =item B<-a>, B<-all_hits> 115 | 116 | Don't filter RPS-BLAST+ output for the best hit, rather assign COGs 117 | to all hits 118 | 119 | =item B<-v>, B<-version> 120 | 121 | Print version number to C 122 | 123 | =back 124 | 125 | =head1 OUTPUT 126 | 127 | =over 20 128 | 129 | =item C 130 | 131 | Overall assignment statistics 132 | 133 | =item F<./results> 134 | 135 | All tab-delimited output files are stored in this result folder 136 | 137 | =item F 138 | 139 | COG assignments concatenated to the RPS-BLAST+ results for filtering 140 | 141 | =item F 142 | 143 | Slimmed down F only including query id (first 144 | BLAST report column), COGs, and functional categories 145 | 146 | =item F 147 | 148 | Assignment counts for each used COG 149 | 150 | =item F 151 | 152 | Assignment counts for single-letter functional categories 153 | 154 | =back 155 | 156 | =head1 EXAMPLES 157 | 158 | =head2 RPS-BLAST+ 159 | 160 | =over 161 | 162 | =item C 164 | 165 | =item C 168 | 169 | =back 170 | 171 | =head2 C 172 | 173 | =over 174 | 175 | =item C 177 | 178 | =back 179 | 180 | =head1 VERSION 181 | 182 | 0.2 update: 2017-02-16 183 | 0.1 2013-08-01 184 | 185 | =head1 AUTHOR 186 | 187 | Andreas Leimbach aleimba[at]gmx[dot]de 188 | 189 | =head1 ACKNOWLEDGEMENTS 190 | 191 | I got the idea for using NCBI's CDD PSSMs for COG assignment from JGI's L, which employes the same technique. 193 | 194 | 195 | =head1 LICENSE 196 | 197 | This program is free software: you can redistribute it and/or modify 198 | it under the terms of the GNU General Public License as published by 199 | the Free Software Foundation; either version 3 (GPLv3) of the 200 | License, or (at your option) any later version. 201 | 202 | This program is distributed in the hope that it will be useful, but 203 | WITHOUT ANY WARRANTY; without even the implied warranty of 204 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU 205 | General Public License for more details. 206 | 207 | You should have received a copy of the GNU General Public License 208 | along with this program. If not, see L. 209 | 210 | =cut 211 | 212 | 213 | ######## 214 | # MAIN # 215 | ######## 216 | 217 | use strict; 218 | use warnings; 219 | use autodie; 220 | use Getopt::Long; 221 | use Pod::Usage; 222 | 223 | 224 | ### Get the options with Getopt::Long 225 | my $Rps_Report; # path to the rps-blast report/output 226 | my $CDDid_File; # path to the CDD 'cddid.tbl' file 227 | my $Fun_File; # path to the COG 'fun' file 228 | my $Whog_File; # path to the COG 'whog' file 229 | my $Opt_All_Hits; # give all blast hits for a query, not just the best (lowest evalue) 230 | my $VERSION = 0.1; 231 | my ($Opt_Version, $Opt_Help); 232 | GetOptions ('rps_report=s' => \$Rps_Report, 233 | 'cddid=s' => \$CDDid_File, 234 | 'fun=s' => \$Fun_File, 235 | 'whog=s' => \$Whog_File, 236 | 'all_hits' => \$Opt_All_Hits, 237 | 'version' => \$Opt_Version, 238 | 'help|?' => \$Opt_Help); 239 | 240 | 241 | 242 | ### Run perldoc on POD 243 | pod2usage(-verbose => 2) if ($Opt_Help); 244 | die "$0 $VERSION\n" if ($Opt_Version); 245 | if (!$Rps_Report || !$CDDid_File || !$Fun_File || !$Whog_File) { 246 | my $warning = "\n### Fatal error: Option(s) or arguments for '-r', '-c', '-f', or '-w' are missing!\n"; 247 | pod2usage(-verbose => 1, -message => $warning, -exitval => 2); 248 | } 249 | 250 | 251 | 252 | ### Parse the 'cddid.tbl', 'fun.txt' and 'whog' file contents and store info in hash structures 253 | my (%CDDid, %Fun, %Whog); # global hashes 254 | parse_cdd_cog(); # subroutine 255 | 256 | 257 | 258 | ### Create results directory for output files 259 | my $Out_Dir = './results/'; 260 | if (-e $Out_Dir) { 261 | print "###Directory '$Out_Dir' already exists! Replace the directory and all its contents [y|n]? "; 262 | my $user_ask = ; 263 | if ($user_ask =~ /y/i) { 264 | unlink glob "$Out_Dir*"; # remove all files in results directory 265 | rmdir $Out_Dir; # remove the empty directory 266 | } else { 267 | die "Script abborted!\n"; 268 | } 269 | } 270 | mkdir $Out_Dir or die "Can't create directory \"$Out_Dir\": $!\n"; 271 | 272 | 273 | 274 | ### Parse the rps-blast report/output file and assign COGs 275 | my %Cog_Stats; # store the total number of query protein hits for each COG, written to '$Cogstats_Out' below 276 | 277 | my $Blast_Out = 'rps-blast_cog.txt'; # output file for COG assignments appended to RPS-BLAST results 278 | open (my $Blast_Out_Fh, ">", "$Out_Dir"."$Blast_Out"); 279 | print $Blast_Out_Fh "query id\tsubject id\t% identity\talignment length\tmismatches\tgap opens\tq. start\tq. end\ts. start\ts. end\tevalue\tbit score\tCOG#\tfunctional categories\t\t\t\t\tCOG protein description\n"; # header for $Blast_Out 280 | 281 | my $Locus_Cog = "protein-id_cog.txt"; # slimmed down $Blast_Out only including locus_tags, COGs, and functional categories 282 | open (my $Locus_Cog_Fh, ">", "$Out_Dir"."$Locus_Cog"); 283 | 284 | print "Parsing RPS-BLAST report ...\n"; # status message 285 | my $Skip = ''; # only keep best blast hit per query (lowest e-value), except option 'all_hits' is given 286 | open (my $Rps_Report_Fh, "<", "$Rps_Report"); 287 | while (<$Rps_Report_Fh>) { 288 | if (/^#/) { # skip comment lines in blast report for BLAST+ "outfmt 7" 289 | next; 290 | } 291 | chomp; 292 | 293 | my @line = split(/\t/, $_); # split tab-separated RPS-BLAST report 294 | 295 | if ($Skip eq $line[0] && !$Opt_All_Hits) { 296 | # only keep best blast hit per query, only if option 'all_hits' is NOT set 297 | # $line[0] is query id, and should be locus_tag or specific ID from multi-fasta protein query file 298 | next; 299 | } 300 | $Skip = $line[0]; 301 | 302 | my $pssm_id = $1 if $line[1] =~ /^CDD\:(\d+)/; # get PSSM-Id from the subject hit 303 | my $cog = $CDDid{$pssm_id}; # get the COG# according to the PSSM-Id as listed in 'cddid.tbl' 304 | $Cog_Stats{$cog}++; # increment hit-number for specific COG 305 | 306 | ### Collect functional categories stats 307 | my @functions = split('', $Whog{$cog}->{'function'}); # split the single-letter functional categories to count them and join them as tab-separated below 308 | foreach (@functions) { 309 | $Fun{$_}->{'count'}++; # increment hit-number for specific functional category 310 | } 311 | 312 | ### Print to result files 313 | my $functions = join("\t", @functions); # join functional categories tab-separated 314 | print $Locus_Cog_Fh "$line[0]\t$cog\t$functions\n"; # locus_tag\tCOG\tfunctional categories 315 | $functions .= "\t" x (5 - @functions); # add additional tabs for COGs with fewer than five functions (which is the maximum number) 316 | print $Blast_Out_Fh "$_\t$cog\t$functions\t$Whog{$cog}->{'desc'}\n"; # $_ = RPS-BLAST line 317 | } 318 | 319 | close $Rps_Report_Fh; 320 | close $Blast_Out_Fh; 321 | close $Locus_Cog_Fh; 322 | 323 | 324 | 325 | ### Total COG and functional categories stats 326 | print "Writing assignment statistic files in '$Out_Dir' folder ...\n"; # status message 327 | 328 | my $Cogstats_Out = 'cog_stats.txt'; # output file for assignment numbers for each COG 329 | open (my $Cog_Stats_Fh, ">", "$Out_Dir"."$Cogstats_Out"); 330 | my $prot_stats = 0; # store total number of query proteins, which have a COG assignment 331 | foreach my $cog (sort keys %Cog_Stats) { 332 | print $Cog_Stats_Fh "$cog\t$Whog{$cog}->{'desc'}\t$Cog_Stats{$cog}\n"; # COG protein descriptions stored in %Whog 333 | $prot_stats += $Cog_Stats{$cog}; # sum up total COG assignments 334 | } 335 | close $Cog_Stats_Fh; 336 | 337 | my $Funcstats_Out = 'func_stats.txt'; # output file for assignment numbers for each functional category 338 | open (my $Func_Stats_Fh, ">", "$Out_Dir"."$Funcstats_Out"); 339 | my $func_cats = 0; # store total number of assigned functional categories 340 | foreach my $func (sort keys %Fun) { 341 | print $Func_Stats_Fh "$func\t$Fun{$func}->{'desc'}\t$Fun{$func}->{'count'}\n"; 342 | $func_cats += $Fun{$func}->{'count'}; # sum up total functional category assignments 343 | } 344 | close $Func_Stats_Fh; 345 | 346 | 347 | 348 | ### State which files were created and print overall statistics 349 | print "\n############################################################################\n"; 350 | print "The following tab-delimited files were created in the '$Out_Dir' directory:\n"; 351 | print "- $Blast_Out: COG assignments concatenated to the RPS-BLAST results for filtering\n"; 352 | print "- $Locus_Cog: Slimmed down '$Blast_Out' only including query id (first BLAST report column), COGs, and functional categories\n"; 353 | print "- $Cogstats_Out: COG assignment counts\n"; 354 | print "- $Funcstats_Out: Functional category assignment counts\n"; 355 | print "##############################################################################\n"; 356 | print "Overall assignment statistics:\n"; 357 | print "~ Total query proteins categorized into COGs: $prot_stats\n"; 358 | print "~ Total COGs used for the query proteins [of ", scalar keys %CDDid, " overall]: ", scalar keys %Cog_Stats, "\n"; 359 | print "~ Total number of assigned functional categories: $func_cats\n"; 360 | print "~ Total functional categories used for the query proteins [of ", scalar keys %Fun, " overall]: ", scalar grep ($Fun{$_}->{'count'} > 0, keys %Fun), "\n\n"; # grep for functional categories with a count > 0 to get the ones with assigned query proteins 361 | 362 | exit; 363 | 364 | 365 | 366 | ############### 367 | # Subroutines # 368 | ############### 369 | 370 | ### Subroutine to parse the 'cddid.tbl', 'fun' and 'whog' file contents and store in hash structures 371 | sub parse_cdd_cog { 372 | 373 | ### 'cddid.tbl' 374 | open (my $cddid_fh, "<", "$CDDid_File"); 375 | print "\nParsing CDDs '$CDDid_File' file ...\n"; # status message 376 | while (<$cddid_fh>) { 377 | chomp; 378 | my @line = split(/\t/, $_); # split line at the tabs 379 | if ($line[1] =~ /^COG\d{4}$/) { # search for COG CD accessions in cddid 380 | $CDDid{$line[0]} = $line[1]; # hash to store info; $line[0] = PSSM-Id 381 | } 382 | } 383 | close $cddid_fh; 384 | 385 | ### 'fun.txt' 386 | open (my $fun_fh, "<", "$Fun_File"); 387 | print "Parsing COGs '$Fun_File' file ...\n"; # status message 388 | while (<$fun_fh>) { 389 | chomp; 390 | $_ =~ s/^\s*|\s+$//g; # get rid of all leading and trailing whitespaces 391 | if (/^\[(\w)\]\s*(.+)$/) { 392 | $Fun{$1} = {'desc' => $2, 'count' => 0}; # anonymous hash in hash 393 | # $1 = single-letter functional category, $2 = description of functional category 394 | # count used to find functional categories not present in the query proteins for final overall assignment statistics 395 | } 396 | } 397 | close $fun_fh; 398 | 399 | ### 'whog' 400 | open (my $whog_fh, "<", "$Whog_File"); 401 | print "Parsing COGs '$Whog_File' file ...\n"; # status message 402 | while (<$whog_fh>) { 403 | chomp; 404 | $_ =~ s/^\s*|\s+$//g; # get rid of all leading and trailing whitespaces 405 | if (/^\[(\w+)\]\s*(COG\d{4})\s+(.+)$/) { 406 | $Whog{$2} = {'function' => $1, 'desc' => $3}; # anonymous hash in hash 407 | # $1 = single-letter functional categories, maximal five per COG (only COG5032 with five) 408 | # $2 = COG#, $3 = COG protein description 409 | } 410 | } 411 | close $whog_fh; 412 | 413 | return 1; 414 | } 415 | -------------------------------------------------------------------------------- /cds_extractor/README.md: -------------------------------------------------------------------------------- 1 | cds_extractor 2 | ============= 3 | 4 | `cds_extractor.pl` is a script to extract amino acid or nucleotide sequences from coding sequence (CDS) features in annotated genomes. 5 | 6 | * [Synopsis](#synopsis) 7 | * [Description](#description) 8 | * [Usage](#usage) 9 | * [Extract amino acid sequences](#extract-amino-acid-sequences) 10 | * [Extract nucleotide sequences](#extract-nucleotide-sequences) 11 | * [UNIX loop to extract sequences from all files in the current working directory](#unix-loop-to-extract-sequences-from-all-files-in-the-current-working-directory) 12 | * [Options](#options) 13 | * [Mandatory options](#mandatory-options) 14 | * [Optional options](#optional-options) 15 | * [Output](#output) 16 | * [Dependencies](#dependencies) 17 | * [Run environment](#run-environment) 18 | * [Author - contact](#author---contact) 19 | * [Citation, installation, and license](#citation-installation-and-license) 20 | * [Changelog](#changelog) 21 | 22 | ## Synopsis 23 | 24 | perl cds_extractor.pl -i seq_file.[embl|gbk] -p 25 | 26 | ## Description 27 | 28 | This script extracts protein or DNA sequences of CDS features from a (multi)-RichSeq file (e.g. EMBL or GENBANK format) and writes them to a multi-FASTA file. The FASTA headers for each CDS include either the locus tag, if that's not available, protein ID, gene, or an internal CDS counter as identifier (in this order). The organism info includes also possible plasmid names. Pseudogenes (tagged by **/pseudo**) are not included (except in the CDS counter). 29 | 30 | In addition to the identifier, FASTA headers include gene (**g=**), product (**p=**), organism (**o=**), and EC numbers (**ec=**), if these are present for a CDS. Individual EC numbers are separated by **semicolons**. The location/position (**l=** start..stop) of a CDS will always be included. If gene is used as FASTA header ID '**g=** gene' will only be included with option **-f**. 31 | 32 | Fuzzy locations in the feature table of a sequence file are not taken into consideration for **l=**. If you set options **-u** and/or **-d** and the feature location overlaps a **circular** replicon boundary, positions are marked with '<' or '>' in the direction of the exceeded boundary. Features with overlapping locations in **linear** sequences (e.g. contigs) will be skipped and are **not** included in the output! A CDS feature is on the lagging strand if start > stop in the location. In the special case of overlapping circular sequence boundaries this is reversed. 33 | 34 | Of course, the **l=** positions are separate for each sequence in a multi- sequence file. Thus, if you want continuous positions for the CDSs run these files first through [`cat_seq.pl`](/cat_seq). 35 | 36 | Optionally, a file with locus tags can be given to extract only these CDS features with option **-l** (each locus tag in a new line). 37 | 38 | ## Usage 39 | 40 | ### Extract amino acid sequences 41 | 42 | perl cds_extractor.pl -i Ecoli_MG1655.gbk -p [-l locus_tags.txt -c MG1655 -f] 43 | 44 | ### Extract nucleotide sequences 45 | 46 | perl cds_extractor.pl -i Banthracis_Ames.embl -n [-l locus_tags.txt -u 100 -d 20 -c Ames -f] 47 | 48 | ### UNIX loop to extract sequences from all files in the current working directory 49 | 50 | for file in *.embl; do perl cds_extractor.pl -i "$file" -p [-l locus_tags.txt]; done 51 | 52 | ## Options 53 | 54 | ### Mandatory options 55 | 56 | * **-i**=_str_, **-input**=_str_ 57 | 58 | Input RichSeq sequence file including CDS annotation (e.g. EMBL or GENBANK format) 59 | 60 | * **-p**, **-protein** 61 | 62 | Extract **protein** sequence for each CDS feature, excludes option **-n** 63 | 64 | **or** 65 | 66 | * **-n**, **-nucleotide** 67 | 68 | Extract **nucleotide** sequence for each CDS feature, excludes option **-p** 69 | 70 | ### Optional options 71 | 72 | * **-h**, **-help** 73 | 74 | Help (perldoc POD) 75 | 76 | * **-u**=_int_, **-upstream**=_int_ 77 | 78 | Include given number of flanking nucleotides upstream of each CDS feature, forces option **-n** 79 | 80 | * **-d**=_int_, **-downstream**=_int_ 81 | 82 | Include given number of flanking nucleotides downstream of each CDS feature, forces option **-n** 83 | 84 | * **-c**=_str_, **-cds_prefix**=_str_ 85 | 86 | Prefix for the internal CDS counter [default = 'CDS'] 87 | 88 | * **-l**=_str_, **-locustag_list**=_str_ 89 | 90 | List of locus tags to extract only those (each locus tag on a new line) 91 | 92 | * **-f**, **-full_header** 93 | 94 | If gene is used as ID include additionally '**g=** gene' in FASTA headers, so downstream analyses can recognize the gene tag (e.g. [`prot_finder.pl`](/prot_finder)). 95 | 96 | * **-v**, **-version** 97 | 98 | Print version number to *STDERR* 99 | 100 | ## Output 101 | 102 | * \*.faa 103 | 104 | Multi-FASTA file of CDS protein sequences 105 | 106 | **or** 107 | 108 | * \*.ffn 109 | 110 | Multi-FASTA file of CDS DNA sequences 111 | 112 | * (no_annotation_err.txt) 113 | 114 | Lists input files missing CDS annotation, script exited with **fatal error** i.e. no FASTA output file 115 | 116 | * (double_id_err.txt) 117 | 118 | Lists input files with ambiguous FASTA IDs, script exited with **fatal error** i.e. no FASTA output file 119 | 120 | * (locus_tag_missing_err.txt) 121 | 122 | Lists CDS features without locus tags 123 | 124 | * (linear_seq_cds_overlap_err.txt) 125 | 126 | Lists CDS features overlapping sequence border of a **linear** molecule, which are **not** included in the result multi-FASTA file 127 | 128 | ## Dependencies 129 | 130 | * [BioPerl](http://www.bioperl.org) (tested with version 1.006923) 131 | 132 | ## Run environment 133 | 134 | The Perl script runs under Windows and UNIX flavors. 135 | 136 | ## Author - contact 137 | 138 | Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster) 139 | 140 | ## Citation, installation, and license 141 | 142 | For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md). 143 | 144 | ## Changelog 145 | 146 | * v0.7.1 (26.10.2015) 147 | - changed output file extensions from **\_cds\_aa.fasta* or **\_cds\_nuc.fasta* to **.faa* or **.ffn*, respectively 148 | - minor syntax changes in README, included TOC 149 | - minor syntax changes in POD 150 | * v0.7 (31.03.2014) 151 | - location (l=) and EC numbers (ec=) for CDS features are included in the FASTA header 152 | - 'ec=', 'g=', 'p=', and 'o=' only included in FASTA header if these tags are present for a CDS feature, or additionally for 'g=' with option **-f** 153 | - if, with options '-u' and/or '-d', the location of a CDS feature overlaps a sequence boundary, the positions are marked with '<' or '>' in 'l=' 154 | - additionally, CDS features whose location overlaps the sequence boundary of a linear molecule will not be included in the output, but IDs written to an error file 155 | - new option **-c** to chose prefix for internal CDS counter 156 | - /product feature value will not be used as FASTA ID anymore, skip directly to internal CDS counter, if /locus_tag, /protein_id, or /gene is missing for a CDS (too many 'hypothetical proteins') 157 | - internal CDS counter counts all CDSs of multi-sequence files sequential (doesn't start new with each new sequence in the multi-sequence file) 158 | - 'control_double' subroutine also called if /gene is used as FASTA ID 159 | - fixed bug introduced in v0.6 to exit if no CDS primary features found, because a draft multi-sequence file might have unannotated small contigs 160 | - new error files: no_annotation_err.txt, double_id_err.txt, linear_seq_cds_overlap_err.txt (the first two come in handy if you run `cds_extractor.pl` in a UNIX loop with many files) 161 | - included 'use autodie' 162 | - included version switch 163 | - included pod2usage with Pod::Usage 164 | - reorganized code into more subroutines to remove useless double codings (which contained also some bugs) and to make the script more concise 165 | - minor changes to Perl syntax 166 | * v0.6 (06.06.2013) 167 | - exit with error if no CDS primary features present in input file, as /translation feature only present in CDS features (some GENBANK files are only annotated with 'gene') 168 | - included Bio::SeqFeatureI's method *spliced-seq* for CDS with split nucleotide sequences (CDS position indicated by 'join') 169 | - minor changes how the optional list of locus tags is handled 170 | * v0.5 (03.06.2013) 171 | - included a POD 172 | - options with Getopt::Long 173 | - option **-n** to alternatively extract nucleotide sequences for CDS features (optionally with upstream and downstream sequences) 174 | - option to include full FASTA ID header for downstream [`prot_finder.pl`](/prot_finder) analysis 175 | - exit with error if the values for two (or more) /locus_tag or /protein_id tags are not unambiguous 176 | - print message to *STDOUT* if and which locus tags were not found in a given locus tag list (option **-l**) 177 | * v0.4 (06.02.2013) 178 | - replace whitespaces of /product values with underscores 179 | * v0.3 (06.09.2012) 180 | - internal CDS counter to use in FASTA ID for CDS features without a /locus_tag, /protein_id, /gene, or /product tag 181 | - include also organism (and possible plasmid) information in FASTA ID lines 182 | - give a warning to *STDOUT* if a CDS feature without a /locus_tag is found (but only for the first occurence) 183 | - additionally, *locus_tag_errors.txt* error file to list all CDSs without locus tags 184 | - catch errors with *eval* if a tag is missing 185 | * v0.2 (04.09.2012) 186 | - if a CDS feature does not have a /locus_tag, then use the value for /protein_id, /gene, or /product (in this order) in the FASTA ID lines of the result file 187 | - optional extract only CDSs with locus tags given in a file 188 | * v0.1 (24.05.2012) 189 | -------------------------------------------------------------------------------- /ecoli_mlst/README.md: -------------------------------------------------------------------------------- 1 | ecoli_mlst 2 | ========== 3 | 4 | `ecoli_mlst` is a script to determine MLST sequence types for *E. coli* genomes and extract allele sequences. 5 | 6 | * [Synopsis](#synopsis) 7 | * [Description](#description) 8 | * [Usage](#usage) 9 | * [Options](#options) 10 | * [Mandatory options](#mandatory-options) 11 | * [Optional options](#optional-options) 12 | * [Output](#output) 13 | * [Run environment](#run-environment) 14 | * [Author - contact](#author---contact) 15 | * [Citation, installation, and license](#citation-installation-and-license) 16 | * [Changelog](#changelog) 17 | 18 | # Synopsis 19 | 20 | perl ecoli_mlst.pl -a fas -g fasta 21 | 22 | # Description 23 | 24 | The script searches for multilocus sequence type (MLST) alleles in *E. coli* genomes according to 25 | Mark Achtman's scheme with seven house-keeping genes (*adk*, *fumC*, *gyrB*, 26 | *icd*, *mdh*, *purA*, and *recA*) [Wirth et al., 2006]. *NUCmer* from the 27 | [*MUMmer package*](http://mummer.sourceforge.net/) is used to compare the given allele 28 | sequences to bacterial genomes via nucleotide alignments. 29 | 30 | Download the allele files (adk.fas ...) and the sequence type file 31 | ('publicSTs.txt') from this website: 32 | http://mlst.ucc.ie/mlst/dbs/Ecoli 33 | 34 | To run `ecoli_mlst.pl` include all *E. coli* genome files (file 35 | extension e.g. 'fasta'), all allele sequence files (file extension 36 | 'fas') and 'publicSTs.txt' in the current working directory. The 37 | allele profiles are parsed from the created \*.coord files and written 38 | to a result file, plus additional information from the file 39 | 'publicSTs.txt'. Also, the corresponding allele sequences (obtained 40 | from the allele input files) are concatenated for each *E. coli* genome 41 | into a result multi-fasta file. Option **-c** can be used to initiate 42 | an alignment for this multi-fasta file with [*ClustalW*](http://www.clustal.org/clustal2/) (standard 43 | alignment parameters; has to be in the `$PATH` or change variable 44 | `$clustal_call`). The alignment fasta output file can be used 45 | directly for [*RAxML*](http://sco.h-its.org/exelixis/web/software/raxml/index.html). CAREFUL the Phylip alignment format from 46 | *ClustalW* allows only 10 characters per strain ID. 47 | 48 | `ecoli_mlst.pl` works with complete and draft genomes. However, several genomes cannot be included in a single input file! 49 | 50 | Obviously, only for those genomes whose allele sequences have been 51 | deposited in Achtman's allele database results can be obtained. If an 52 | allele is not found in a genome it is marked by a '?' in the result 53 | profile file and a place holder 'XXX' in the result fasta file. For 54 | these cases a manual *NUCmer* or *BLASTN* might be useful to fill the 55 | gaps and [`run_sub_seq.pl`](/run_sub_seq) to get the corresponding 'new' allele 56 | sequences. 57 | 58 | Non-NCBI fasta headers for the genome files have to have a 59 | unique ID directly following the '>' (e.g. 'Sakai', '55989' ...). 60 | 61 | # Usage 62 | 63 | perl ecoli_mlst.pl -a fas -g fasta -c 64 | 65 | # Options 66 | 67 | ## Mandatory options 68 | 69 | - -a, -alleles 70 | 71 | File extension of the MLST allele fasta files, e.g. 'fas' (<=> **-g**). 72 | 73 | - -g, -genomes 74 | 75 | File extension of the *E. coli* genome fasta files, e.g. 'fasta' (<=> **-a**). 76 | 77 | ## Optional options 78 | 79 | - -h, -help 80 | 81 | Help (perldoc POD) 82 | 83 | - -c, -clustalw 84 | 85 | Call [*ClustalW*](http://www.clustal.org/clustal2/) for alignment 86 | 87 | # Output 88 | 89 | - ecoli_mlst_profile.txt 90 | 91 | Tab-separated allele profiles for the *E. coli* genomes, plus additional info from 'publicSTs.txt' 92 | 93 | - ecoli_mlst_seq.fasta 94 | 95 | Multi-fasta file of all concatenated allele sequences for each genome 96 | 97 | - *.coord 98 | 99 | Text files that contain the coordinates of the *NUCmer* hits for each genome and allele 100 | 101 | - (errors.txt) 102 | 103 | Error file, summarizing number of not found alleles or unclear *NUCmer* hits 104 | 105 | - (ecoli_mlst_seq_aln.fasta) 106 | 107 | Optional, [*ClustalW*](http://www.clustal.org/clustal2/) alignment in Phylip format 108 | 109 | - (ecoli_mlst_seq_aln.dnd) 110 | 111 | Optional, *ClustalW* alignment guide tree 112 | 113 | ## Run environment 114 | 115 | The Perl script runs only under UNIX flavors. 116 | 117 | ## Author - contact 118 | 119 | Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster) 120 | 121 | ## Citation, installation, and license 122 | 123 | For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md). 124 | 125 | ## Changelog 126 | 127 | * v0.3 (30.01.2013) 128 | - additional info in POD 129 | - check if result files already exist and ask user what to do 130 | - changed script name from `ecoli_mlst_alleles.pl` to `ecoli_mlst.pl` 131 | * v0.2 (20.10.2012) 132 | - included a POD 133 | - options with Getopt::Long 134 | - don't consider input *E. coli* genome query files, which are too big (set cutoff at 9 MB for a fasta *E. coli* file) 135 | - draft *E. coli* genomes can now be used as input query files 136 | - additional info in 'publicSTs.txt' now associated to found ST types in output 137 | - give text to STDOUT which files were created 138 | - new option **-c** to align the resulting allele sequences via *ClustalW* 139 | * v0.1 (25.10.2011) 140 | -------------------------------------------------------------------------------- /ecoli_mlst/publicSTs.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aleimba/bac-genomics-scripts/1b2388fb9f5870a4aafa3e070823f9286178d3b1/ecoli_mlst/publicSTs.txt -------------------------------------------------------------------------------- /genomes_feature_table/README.md: -------------------------------------------------------------------------------- 1 | genomes_feature_table 2 | ===================== 3 | 4 | `genomes_feature_table.pl` is a script to create a feature table for genomes in EMBL and GENBANK format. 5 | 6 | * [Synopsis](#synopsis) 7 | * [Description](#description) 8 | * [Usage](#usage) 9 | * [Options](#options) 10 | * [Output](#output) 11 | * [Run environment](#run-environment) 12 | * [Dependencies](#dependencies) 13 | * [Author - contact](#author---contact) 14 | * [Citation, installation, and license](#citation-installation-and-license) 15 | * [Changelog](#changelog) 16 | 17 | ## Synopsis 18 | 19 | perl genomes_feature_table.pl path/to/genome_dir > feature_table.tsv 20 | 21 | ## Description 22 | 23 | A genome feature table lists basic stats/info (e.g. genome size, GC 24 | content, coding percentage, accession number(s)) and the numbers of 25 | annotated primary features (e.g. CDS, genes, RNAs) of genomes. It 26 | can be used to have an overview of these features in different 27 | genomes, e.g. in comparative genomics publications. 28 | 29 | `genomes_feature_table.pl` is designed to extract (or calculate) 30 | these basic stats and **all** annotated primary features from RichSeq 31 | files (**EMBL** or **GENBANK** format) in a specified directory (with the 32 | correct file extension, see option **-e**). The **default** directory 33 | is the current working directory. The primary features are 34 | counted and the results for each genome printed in tab-separated 35 | format. It is a requirement that each file contains **only one** 36 | genome (complete or draft, with or without plasmids). 37 | 38 | The most important features will be listed first, like genome 39 | description, genome size, GC content, coding percentage (calculated 40 | based on non-pseudo CDS annotation), CDS and gene numbers, accession 41 | number(s) (first..last in the sequence file), RNAs (rRNA, tRNA, 42 | tmRNA, ncRNA), and unresolved bases (IUPAC code 'N'). If plasmids are 43 | annotated in a sequence file, the number of plasmids are 44 | counted and listed as well (needs a */plasmid="plasmid_name"* tag in the 45 | *source* primary tag, see e.g. Genbank accession number 46 | [CP009167](http://www.ncbi.nlm.nih.gov/nuccore/CP009167)). Use option **-p** 47 | to list plasmids as separate entries (lines) in the feature table. 48 | 49 | For draft genomes the number of contigs/scaffolds are counted. All 50 | contigs/scaffolds of draft genomes should be marked with the *WGS* 51 | keyword (see e.g. draft NCBI Genbank entry 52 | [JSAY00000000](http://www.ncbi.nlm.nih.gov/nuccore/JSAY00000000)). If this is 53 | not the case for your file(s) you can add those keywords to each 54 | sequence entry with the following Perl one-liners (will 55 | edit files in place). For files in **GENBANK** format if 'KEYWORDS    .' is present 56 | 57 | perl -i -pe 's/^KEYWORDS(\s+)\./KEYWORDS$1WGS\./' file 58 | 59 | or if 'KEYWORDS' isn't present at all 60 | 61 | perl -i -ne 'if(/^ACCESSION/){ print; print "KEYWORDS WGS.\n";} else{ print;}' file 62 | 63 | For files in **EMBL** format if 'KW   .' is present 64 | 65 | perl -i -pe 's/^KW(\s+)\./KW$1WGS\./' file 66 | 67 | or if 'KW' isn't present at all 68 | 69 | perl -i -ne 'if(/^DE/){ $dw=1; print;} elsif(/^XX/ && $dw){ print; $dw=0; print "KW WGS.\n";} else{ print;}' file 70 | 71 | ## Usage 72 | 73 | perl genomes_feature_table.pl -p -e gb,gbk > feature_table_plasmids.tsv 74 | 75 | perl genomes_feature_table.pl path/to/genome_dir/ -e gbf -e embl > feature_table.tsv 76 | 77 | ## Options 78 | 79 | - -h, -help 80 | 81 | Help (perldoc POD) 82 | 83 | - -e, -extensions 84 | 85 | File extensions to include in the analysis (EMBL or GENBANK format), 86 | either comma-separated list or multiple occurences of the option 87 | [default = ebl,emb,embl,gb,gbf,gbff,gbank,gbk,genbank] 88 | 89 | - -p, -plasmids 90 | 91 | Optionally list plasmids as extra entries in the feature table, if 92 | they are annotated with a */plasmid="plasmid_name"* tag in the 93 | *source* primary tag 94 | 95 | - -v, -version 96 | 97 | Print version number to *STDERR* 98 | 99 | ## Output 100 | 101 | - *STDOUT* 102 | 103 | The resulting feature table is printed to *STDOUT*. Redirect or 104 | pipe into another tool as needed (e.g. `cut`, `grep`, or `head`). 105 | 106 | ## Run environment 107 | 108 | The Perl script runs under Windows and UNIX flavors. 109 | 110 | ## Dependencies 111 | 112 | - [BioPerl](http://www.bioperl.org) (tested version 1.006923) 113 | 114 | ## Author - contact 115 | 116 | Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster) 117 | 118 | ## Citation, installation, and license 119 | 120 | For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md). 121 | 122 | ## Changelog 123 | 124 | - v0.5 (14.09.2015) 125 | - changed script name to `genomes_feature_table.pl` 126 | - included a POD 127 | - options with Getopt::Long 128 | - included `pod2usage` with Pod::Usage 129 | - major code overhaul with restructuring (removing code redundancy, print out without temp file etc.) and Perl syntax changes 130 | - changed input options to get folder path from STDIN 131 | - as a consequence new option **-e|-extensions** 132 | - accession numbers not essential anymore, changed hash key to filename; but requires now only one genome per file 133 | - draft genomes should include 'WGS' keyword (warning if not) 134 | - option **-p|-plasmids** works now correctly with complete and draft genomes 135 | - count plasmids without option **-p** 136 | - v0.4 (11.08.2013) 137 | - included 'use autodie;' pragma 138 | - included version switch 139 | - v0.3 (05.11.2012) 140 | - new option **p** to report plasmid features in multi-sequence draft files separately 141 | - v0.2 (19.09.2012) 142 | - v0.1 (25.11.2011) 143 | - **original** script name: `get_genome_features.pl` 144 | -------------------------------------------------------------------------------- /ncbi_ftp_download/README.md: -------------------------------------------------------------------------------- 1 | ncbi_ftp_download 2 | ================= 3 | 4 | **This pipeline is NOT working at the moment, as NCBI reorganized the structure of their [FTP server for genomes](https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/). As an alternative way to fetch bacterial genomes from NCBI I recommend [`ncbi-genome-download`](https://github.com/kblin/ncbi-genome-download) from @kbiln, or [`Bio-RetrieveAssemblies`](https://github.com/andrewjpage/Bio-RetrieveAssemblies) from @andrewjpage from the Wellcome Trust Sanger Institute.** 5 | 6 | Scripts to batch download all bacterial genomes of a genus/species from NCBI's FTP site (RefSeq and GenBank) for easy access. 7 | 8 | ## Synopsis 9 | 10 | ncbi_ftp_download.sh Genus_species 11 | 12 | ## Description 13 | 14 | These scripts are intended to download all bacterial genomes for a particular genus or species from NCBI's FTP site (http://www.ncbi.nlm.nih.gov/Ftp/ and ftp://ftp.ncbi.nlm.nih.gov/) and copy them to result folders for easy access. 15 | 16 | `ncbi_ftp_download.sh` is a bash shell wrapper script that employs UNIX's `wget` to download microbial genomes in genbank (\*.gbk) and fasta (\*.fna) format from the GenBank and RefSeq databases (NCBI Reference Sequence Database, http://www.ncbi.nlm.nih.gov/refseq/) on NCBI's FTP server, which can be accessed anonymously. As first argument it takes the bacterial genus or species name you want to download (it uses that name with a glob inside the script, e.g. Escherichia_coli will be used as Escherichia_coli\*), see examples below in [usage](#usage). Have a look on the NCBI FTP server to get the correct name (either with your browser or e.g. with FileZilla, http://filezilla-project.org/). If you want to download genomes for several distinct species just run the script with different arguments repeatedly. 17 | 18 | The `wget` parameters are specified to keep the FTP server folder structure and mirror it locally downstream from the current working directory (folder 'ftp.ncbi.nlm.nih.gov' will be the top folder of the new folder structure). If you update an already existing folder structure, `wget` will only download and replace files if they are in a newer version on NCBI's FTP server. **But** be aware that NCBI shuffles files around (including new ones, deleting old ones etc.), thus it might be useful to remove 'ftp.ncbi.nlm.nih.gov' and download everything new. 19 | 20 | After the download with `wget`, `ncbi_ftp_download.sh` will run the Perl script `ncbi_ftp_concat_unpack.pl`. This script unpacks (draft genomes are stored as tarballs, \*.tgz) and concatenates all complete and draft genomes, which are present in the folder 'ftp.ncbi.nlm.nih.gov' in the current working directory. The script traverses the downloaded NCBI ftp-folder structure and thus has to be called from the top level (containing the folder 'ftp.ncbi.nlm.nih.gov'). `ncbi_ftp_download.sh` runs `ncbi_ftp_concat_unpack.pl` with both **genbank** and **refseq** options, as well as option **y** to overwrite the old result folders (see below [options](#options)). Both scripts have to be in the same directory (or in the path) to run `ncbi_ftp_download.sh`. 21 | 22 | For **complete** genomes **plasmids** are concatenated to the **chromosomes** to create multi-genbank/-fasta files (script `split_multi-seq_file.pl` can be used to split the multi-sequence file to single-sequence files). 23 | 24 | In **draft** genomes, **scaffold** and/or **contig** files, designated by 'draft_scaf' or 'draft_con', are controlled for annotation (i.e. if gene primary feature tags exist); usually only one of those contains annotations. The one with annotation is then used to create multi-genbank files. Multi-fasta files are created for the corresponding genbank file or, if no annotation exists, for the file which contains more sequence information (either contigs or scaffolds). In the case, that the sequence information is equal, scaffold files are preferred. If sequence size discrepancies between a genbank and its corresponding fasta file are found, error file 'seq_errors.txt' will be created and indicate the villains. 25 | 26 | As a suggestion, pick the genomes you're looking for **first** out of './refseq' and the rest out of './genbank'. RefSeq genomes have a higher annotation quality, while GenBank includes more genomes. 27 | 28 | Depending on the amount of data to download, the whole process can take quite a while. Also have a mind for space requirements, e.g. all *E. coli*/*Shigella* genomes (March 2014) have a final total space requirement of ~58 GB ('ftp.ncbi.nlm.nih.gov' = ~18 GB; ./genbank = ~25 GB; ./refseq = ~16 GB)! 29 | 30 | If you're new to the NCBI FTP site you should read an excellent overview for microbial RefSeq genomes on NCBI's FTP site on Torsten Seemann's blog: http://thegenomefactory.blogspot.de/2012/07/navigating-microbial-genomes-on-ncbi.html. 31 | 32 | You can also access an introductory talk for the microbial NCBI FTP resources at figshare (http://figshare.com/articles/Introduction_to_NCBI_s_FTP_server_for_bacterial_genomes/972893). It might be a good idea to read the blog post and have a look in the PDF to have a general idea what's going on, but of course you can just run the scripts and work with the genome files. 33 | 34 | ## Usage 35 | 36 | ### 1.) Manual consecutively 37 | 38 | #### 1.1.) `wget` 39 | 40 | Download RefSeq complete genomes (in fasta and genbank format): 41 | 42 | wget -cNrv -t 45 -A *.gbk,*.fna "ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Genus_species*" -P . 43 | 44 | Download RefSeq draft genomes as tarballs: 45 | 46 | wget -cNrv -t 45 -A *.gbk.tgz,*.fna.tgz "ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria_DRAFT/Genus_species*" -P . 47 | 48 | The same procedure has to be followed for GenBank files, here complete genomes: 49 | 50 | wget -cNrv -t 45 -A *.gbk,*.fna "ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria/Genus_species*" -P . 51 | 52 | And finally download GenBank draft genomes: 53 | 54 | wget -cNrv -t 45 -A *.gbk.tgz,*.fna.tgz "ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria_DRAFT/Genus_species*" -P . 55 | 56 | #### 1.2.) `ncbi_ftp_concat_unpack.pl` 57 | 58 | perl ncbi_ftp_concat_unpack.pl refseq y 59 | perl ncbi_ftp_concat_unpack.pl genbank y 60 | 61 | ### 2.) With one command: `ncbi_ftp_download.sh` wrapper script 62 | 63 | Some examples how you can use the shell script, e.g. download all *E. coli* genomes from NCBI's ftp server: 64 | 65 | ncbi_ftp_download.sh Escherichia_coli 66 | 67 | Download all *B. cereus* genomes: 68 | 69 | ncbi_ftp_download.sh Bacillus_cereus 70 | 71 | Download all *Paenibacillus* genomes: 72 | 73 | ncbi_ftp_download.sh Paenibacillus 74 | 75 | ## Options 76 | 77 | ### *ncbi_ftp_concat_unpack.pl* 78 | 79 | * genbank (as first argument) 80 | 81 | Copy GenBank genomes (from './ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria\*') as (multi-)sequence files in the result folder './genbank'. 82 | 83 | * refseq (as first argument) 84 | 85 | Copy RefSeq genomes (from './ftp.ncbi.nlm.nih.gov/genomes/Bacteria\*') as (multi-)sequence files in the result folder './refseq'. 86 | 87 | * y (as second argument) 88 | 89 | Will delete previous result folders and create new ones (otherwise, the script will ask user if to proceed with overwriting) 90 | 91 | ## Output 92 | 93 | ### `ncbi_ftp_download.sh` 94 | 95 | * './ftp.ncbi.nlm.nih.gov/' 96 | 97 | Mirrors NCBI's FTP server structure and downloads the wanted bacterial genome files in this folder with subfolders 98 | 99 | ### `ncbi_ftp_concat_unpack.pl` 100 | 101 | * './genbank' 102 | 103 | Result folder for all **GenBank** genomes 104 | 105 | * './refseq' 106 | 107 | Result folder for all **RefSeq** genomes 108 | 109 | * (seq_errors.txt) 110 | 111 | Lists \*.gbk and corresponding \*.fasta files with sequence size discrepancies. 112 | 113 | ## Run environment 114 | 115 | Both the Perl script and the bash-shell script run only under UNIX flavors. 116 | 117 | ## Dependencies (not in the core Perl modules) 118 | 119 | * no extra dependencies 120 | 121 | ## Authors/contact 122 | 123 | Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster) 124 | 125 | ## Citation, installation, and license 126 | 127 | For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md). 128 | 129 | ## Changelog 130 | 131 | ### *ncbi_ftp_concat_unpack.pl* 132 | 133 | * v0.2.1 (13.07.2015) 134 | - Adapted all scripts to the new NCBI FTP server address: 'ftp://ftp.ncbi.nlm.nih.gov/' 135 | * v0.2 (21.02.2013) 136 | - 'seq_errors.txt' error file if sequence size discrepancies between genbank and corresponding fasta file found 137 | - die with error if 'genbank|refseq' not given as first argument 138 | - print status message which genome is being processed and what file is kept for draft genomes (e.g. scaffold or contig etc.) 139 | - bug fixes to test for file existence before running code 140 | - changed usage to HERE document 141 | * v0.1 (15.09.2012) 142 | -------------------------------------------------------------------------------- /ncbi_ftp_download/ncbi_ftp_download.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # Download/update RefSeq complete genomes 3 | echo "#### Updating RefSeq complete $1 genomes" 4 | wget -cNrv -t 45 -A *.gbk,*.fna "ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/$1*" -P . 5 | # Download/update RefSeq draft genomes 6 | echo "#### Updating RefSeq draft $1 genomes" 7 | wget -cNrv -t 45 -A *.gbk.tgz,*.fna.tgz "ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria_DRAFT/$1*" -P . 8 | # Download/update GenBank complete genomes 9 | echo "#### Updating GenBank complete $1 genomes" 10 | wget -cNrv -t 45 -A *.gbk,*.fna "ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria/$1*" -P . 11 | # Download/update GenBank draft genomes 12 | echo "#### Updating GenBank draft $1 genomes" 13 | wget -cNrv -t 45 -A *.gbk.tgz,*.fna.tgz "ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria_DRAFT/$1*" -P . 14 | # Run script 'ncbi_concat_unpack.pl' to fill the result folders './refseq' and './genbank' 15 | echo "#### Copying files to result folder './refseq'" 16 | perl ncbi_ftp_concat_unpack.pl refseq y 17 | echo "#### Copying files to result folder './genbank'" 18 | perl ncbi_ftp_concat_unpack.pl genbank y 19 | -------------------------------------------------------------------------------- /order_fastx/README.md: -------------------------------------------------------------------------------- 1 | order_fastx 2 | =========== 3 | 4 | `order_fastx.pl` is a script to order sequences in FASTA or FASTQ files. 5 | 6 | * [Synopsis](#synopsis) 7 | * [Description](#description) 8 | * [Usage](#usage) 9 | * [Options](#options) 10 | * [Mandatory options](#mandatory-options) 11 | * [Optional options](#optional-options) 12 | * [Output](#output) 13 | * [Run environment](#run-environment) 14 | * [Author - contact](#author---contact) 15 | * [Citation, installation, and license](#citation-installation-and-license) 16 | * [Changelog](#changelog) 17 | 18 | 19 | ## Synopsis 20 | 21 | perl order_fastx.pl -i infile.fasta -l order_id_list.txt > ordered.fasta 22 | 23 | ## Description 24 | 25 | Order sequence entries in FASTA or FASTQ sequence files according to 26 | an ID list with a given order. Beware, the IDs in the order list 27 | have to be **identical** to the entire IDs in the sequence file. 28 | 29 | However, the ">" or "@" ID identifiers of FASTA or FASTQ files, 30 | respectively, can be omitted in the ID list. 31 | 32 | The file type is detected automatically. But, you can set the file 33 | type manually with option **-f**. FASTQ format assumes **four** lines 34 | per read, if this is not the case run the FASTQ file through 35 | [`fastx_fix.pl`](/fastx_fix) or use Heng Li's [`seqtk 36 | seq`](https://github.com/lh3/seqtk): 37 | 38 | seqtk seq -l 0 infile.fq > outfile.fq 39 | 40 | The script can also be used to pull a subset of sequences in the ID 41 | list from the sequence file. Probably best to set option flag **-s** 42 | in this case, see [Optional options](#optional-options) below. But, rather use 43 | [`filter_fastx.pl`](/filter_fastx). 44 | 45 | ## Usage 46 | 47 | perl order_fastx.pl -i infile.fq -l order_id_list.txt -s -f fastq > ordered.fq 48 | 49 | perl order_fastx.pl -i infile.fasta -l order_id_list.txt -e > ordered.fasta 50 | 51 | ## Options 52 | 53 | ### Mandatory options 54 | 55 | - -i, -input 56 | 57 | Input FASTA or FASTQ file 58 | 59 | - -l, -list 60 | 61 | List with sequence IDs in specified order 62 | 63 | ### Optional options 64 | 65 | - -h, -help 66 | 67 | Help (perldoc POD) 68 | 69 | - -f, -file_type 70 | 71 | Set the file type manually [fasta|fastq] 72 | 73 | - -e, -error_files 74 | 75 | Write missing IDs in the seq file or the order ID list without an equivalent in the other to error files instead of *STDERR* (see [Output](#output) below) 76 | 77 | - -s, -skip_errors 78 | 79 | Skip missing ID error statements, excludes option **-e** 80 | 81 | - -v, -version 82 | 83 | Print version number to *STDERR* 84 | 85 | ## Output 86 | 87 | - *STDOUT* 88 | 89 | The newly ordered sequences are printed to *STDOUT*. Redirect or pipe into another tool as needed. 90 | 91 | - (order_ids_missing.txt) 92 | 93 | If IDs in the order list are missing in the sequence file with option **-e** 94 | 95 | - (seq_ids_missing.txt) 96 | 97 | If IDs in the sequence file are missing in the order ID list with option **-e** 98 | 99 | ## Run environment 100 | 101 | The Perl script runs under Windows and UNIX flavors. 102 | 103 | ## Author - contact 104 | 105 | Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster) 106 | 107 | ## Citation, installation, and license 108 | 109 | For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md). 110 | 111 | ## Changelog 112 | 113 | - v0.1 (20.11.2014) 114 | -------------------------------------------------------------------------------- /order_fastx/order_fastx.pl: -------------------------------------------------------------------------------- 1 | #!/usr/bin/perl 2 | 3 | ####### 4 | # POD # 5 | ####### 6 | 7 | =pod 8 | 9 | =head1 NAME 10 | 11 | C - order sequences in FASTA or FASTQ files 12 | 13 | =head1 SYNOPSIS 14 | 15 | C ordered.fasta> 17 | 18 | =head1 DESCRIPTION 19 | 20 | Order sequence entries in FASTA or FASTQ sequence files according to 21 | an ID list with a given order. Beware, the IDs in the order list 22 | have to be B to the entire IDs in the sequence file. 23 | 24 | However, the ">" or "@" ID identifiers of FASTA or FASTQ files, 25 | respectively, can be omitted in the ID list. 26 | 27 | The file type is detected automatically. But, you can set the file 28 | type manually with option B<-f>. FASTQ format assumes B lines 29 | per read, if this is not the case run the FASTQ file through 30 | L|/fastx_fix> or use Heng Li's L|https://github.com/lh3/seqtk>: 32 | 33 | C outfile.fq> 34 | 35 | The script can also be used to pull a subset of sequences in the ID 36 | list from the sequence file. Probably best to set option flag B<-s> 37 | in this case, see L<"Optional options"> below. But, rather use 38 | L|/filter_fastx>. 39 | 40 | =head1 OPTIONS 41 | 42 | =head2 Mandatory options 43 | 44 | =over 20 45 | 46 | =item B<-i>=I, B<-input>=I 47 | 48 | Input FASTA or FASTQ file 49 | 50 | =item B<-l>=I, B<-list>=I 51 | 52 | List with sequence IDs in specified order 53 | 54 | =back 55 | 56 | =head2 Optional options 57 | 58 | =over 20 59 | 60 | =item B<-h>, B<-help> 61 | 62 | Help (perldoc POD) 63 | 64 | =item B<-f>=I, B<-file_type>=I 65 | 66 | Set the file type manually 67 | 68 | =item B<-e>, B<-error_files> 69 | 70 | Write missing IDs in the seq file or the order ID list without an 71 | equivalent in the other to error files instead of C (see 72 | L<"OUTPUT"> below) 73 | 74 | =item B<-s>, B<-skip_errors> 75 | 76 | Skip missing ID error statements, excludes option B<-e> 77 | 78 | =item B<-v>, B<-version> 79 | 80 | Print version number to C 81 | 82 | =back 83 | 84 | =head1 OUTPUT 85 | 86 | =over 20 87 | 88 | =item C 89 | 90 | The newly ordered sequences are printed to C. Redirect or 91 | pipe into another tool as needed. 92 | 93 | =item (F) 94 | 95 | If IDs in the order list are missing in the sequence file with 96 | option B<-e> 97 | 98 | =item (F) 99 | 100 | If IDs in the sequence file are missing in the order ID list with 101 | option B<-e> 102 | 103 | =back 104 | 105 | =head1 EXAMPLES 106 | 107 | =over 108 | 109 | =item C ordered.fq> 111 | 112 | =item C ordered.fasta> 114 | 115 | =back 116 | 117 | =head1 VERSION 118 | 119 | 0.1 20-11-2014 120 | 121 | =head1 AUTHOR 122 | 123 | Andreas Leimbach aleimba[at]gmx[dot]de 124 | 125 | =head1 LICENSE 126 | 127 | This program is free software: you can redistribute it and/or modify 128 | it under the terms of the GNU General Public License as published by 129 | the Free Software Foundation; either version 3 (GPLv3) of the 130 | License, or (at your option) any later version. 131 | 132 | This program is distributed in the hope that it will be useful, but 133 | WITHOUT ANY WARRANTY; without even the implied warranty of 134 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU 135 | General Public License for more details. 136 | 137 | You should have received a copy of the GNU General Public License 138 | along with this program. If not, see L. 139 | 140 | =cut 141 | 142 | 143 | ######## 144 | # MAIN # 145 | ######## 146 | 147 | use strict; 148 | use warnings; 149 | use autodie; 150 | use Getopt::Long; 151 | use Pod::Usage; 152 | 153 | ### Get the options with Getopt::Long 154 | my $Seq_File; # sequence file to order sequences in 155 | my $Order_List; # order ID list for seq file 156 | my $File_Type; # set file type; otherwise detect file type by file extension 157 | my $Opt_Error_Files; # print missing IDs not found in order list or seq file to error files instead of STDERR 158 | my $Opt_Skip_Errors; # skip missing IDs error statements/files 159 | my $VERSION = 0.1; 160 | my ($Opt_Version, $Opt_Help); 161 | GetOptions ('input=s' => \$Seq_File, 162 | 'list=s' => \$Order_List, 163 | 'file_type=s' => \$File_Type, 164 | 'error_files' => \$Opt_Error_Files, 165 | 'skip_errors' => \$Opt_Skip_Errors, 166 | 'version' => \$Opt_Version, 167 | 'help|?' => \$Opt_Help); 168 | 169 | 170 | 171 | ### Run perldoc on POD 172 | pod2usage(-verbose => 2) if ($Opt_Help); 173 | die "$0 $VERSION\n" if ($Opt_Version); 174 | if (!$Seq_File || !$Order_List) { 175 | my $warning = "\n### Fatal error: Options '-i' or '-l' or their arguments are missing!\n"; 176 | pod2usage(-verbose => 1, -message => $warning, -exitval => 2); 177 | } 178 | 179 | 180 | 181 | ### Enforce mandatory or optional options 182 | die "\n### Fatal error:\nUnknown file type '$File_Type' given with option '-f'. Please choose from either 'fasta' or 'fastq'!\n" if ($File_Type && $File_Type !~ /(fasta|fastq)/i); 183 | warn "\n### Warning:\nIgnoring option flag '-e', because option '-s' set at the same time!\n\n" if ($Opt_Error_Files && $Opt_Skip_Errors); 184 | 185 | 186 | 187 | ### Order input FASTA/FASTQ file according to a given list 188 | open (my $Order_List_Fh, "<", "$Order_List"); 189 | open (my $Input_Fh, "<", "$Seq_File"); # pipe from STDIN not working because of 'seek' on filehandle 190 | get_file_type() if (!$File_Type); # subroutine to determine file type by file extension 191 | 192 | my %Order_List_IDs; # store order IDs and indicate if found in seq file 193 | my %Seq_File_IDs; # store seq file IDs and indicate if present in order list 194 | 195 | my $Next_Fasta_ID; # for multi-line FASTA input files to store next entry header/ID line while parsing in subroutine 'get_fastx_entry' 196 | my $Parse_Run = 1; # indicate FIRST parsing cycle through seq file to collect all seq IDs 197 | 198 | while (my $ord_id = <$Order_List_Fh>) { 199 | chomp $ord_id; 200 | next if ($ord_id =~ /^\s*$/); # skip emtpy lines in order list 201 | $ord_id =~ s/^(>|@)//; # remove ">/@" for WHOLE string regex match -> ID in order list can be given with ">/@" or without (will be appended again in print) 202 | 203 | if ($Order_List_IDs{$ord_id}) { 204 | die "\n### Fatal error:\n'$ord_id' exists several times in '$Order_List' and IDs should be unique!\n"; 205 | } else { 206 | $Order_List_IDs{$ord_id} = 1; # changes to 2 if ID was found in seq file 207 | } 208 | 209 | while (<$Input_Fh>) { 210 | if (/^\s*$/) { # skip empty lines in input 211 | warn "\n### Warning:\nFASTQ file includes empty lines, which is unusual. Parsing the FASTQ reads might fail so check the output file afterwards if the script didn't quit with a fatal error. However, consider running the input FASTQ file through 'fix_fastx.pl'!\n\n" if ($File_Type =~ /fastq/i); 212 | next; 213 | } 214 | chomp; 215 | 216 | # FASTA file 217 | if ($File_Type =~ /fasta/i) { 218 | $_ = get_fastx_entry($_); # subroutine to read one FASTA sequence entry (seq in multi-line or not), returns anonymous array 219 | 220 | # FASTQ file 221 | } elsif ($File_Type =~ /fastq/i) { 222 | $_ = get_fastx_entry($_); # subroutine to read one FASTQ read composed of FOUR mandatory lines, returns reference to array 223 | } 224 | 225 | if ($Seq_File_IDs{$_->[0]} && $Parse_Run == 1) { # only for first parse cycle, subsequent parsings of course will find the same IDs 226 | die "\n### Fatal error:\n'$_->[0]' exists several times in '$Seq_File' and IDs should be unique!\n"; 227 | } elsif (!$Seq_File_IDs{$_->[0]}) { 228 | $Seq_File_IDs{$_->[0]} = 1; # changes to 2 if present in order list 229 | } 230 | 231 | if ($ord_id =~ /^$_->[0]$/) { # order ID hit in seq file with the WHOLE string; de-reference array 232 | $Order_List_IDs{$ord_id} = 2; # set to ID found 233 | $Seq_File_IDs{$_->[0]} = 2; 234 | 235 | print ">$_->[0]\n$_->[1]\n\n" if ($File_Type =~ /fasta/i); # print seq entry to STDOUT 236 | print "\@$_->[0]\n$_->[1]\n$_->[2]\n$_->[3]\n" if ($File_Type =~ /fastq/i); 237 | 238 | next if ($Parse_Run == 1); # parse the complete seq file once (skip 'last' below) to collect all seq IDs 239 | last; # jump out of seq file 'while' 240 | } 241 | } 242 | 243 | # rewind seq file for next order list ID 244 | $Next_Fasta_ID = ''; 245 | seek $Input_Fh, 0, 0; 246 | $. = 0; # set line number of seq file to 0 (seek doesn't do it automatically) 247 | $Parse_Run = 0; 248 | } 249 | close $Input_Fh; 250 | close $Order_List_Fh; 251 | 252 | 253 | 254 | ### Print order and seq IDs that were not found in seq file or in order list, resp. 255 | if (!$Opt_Skip_Errors) { 256 | # order IDs not found in seq file 257 | missing_IDs(\%Order_List_IDs, 'order_ids_missing.txt', 'order'); # subroutine to identify and print missing IDs 258 | 259 | # seq file IDs not found in order list 260 | missing_IDs(\%Seq_File_IDs, 'seq_ids_missing.txt', 'sequence'); # subroutine 261 | } 262 | 263 | exit; 264 | 265 | 266 | ############# 267 | #Subroutines# 268 | ############# 269 | 270 | ### Test for output file existence and give warning to STDERR 271 | sub file_exist { 272 | my $file = shift; 273 | if (-e $file) { 274 | warn "\n### Warning:\nThe error file '$file' exists already, the current errors will be appended to the existing file!\n"; 275 | return 1; 276 | } 277 | return 0; 278 | } 279 | 280 | 281 | 282 | ### Get sequence entries from FASTA/Q file 283 | sub get_fastx_entry { 284 | my $line = shift; 285 | 286 | # possible multi-line seq in FASTA 287 | if ($File_Type =~ /fasta/i) { 288 | my ($seq, $header); 289 | if ($. == 1) { # first line of file 290 | die "\n### Fatal error:\nNot a FASTA input file, first line of file should be a FASTA ID/header line and start with a '>':\n$line\n" if ($line !~ /^>/); 291 | $header = $line; 292 | } elsif ($Next_Fasta_ID) { 293 | $header = $Next_Fasta_ID; 294 | $seq = $line; 295 | } 296 | while (<$Input_Fh>) { 297 | chomp; 298 | $Next_Fasta_ID = $_ if (/^>/); # store ID/header for next seq entry 299 | $header =~ s/^>//; # remove '>' for WHOLE string regex match in MAIN 300 | return [$header, $seq] if (/^>/); # return anonymous array with current header and seq 301 | $seq .= $_; # concatenate multi-line seq 302 | } 303 | $header =~ s/^>//; # see above 304 | return [$header, $seq] if eof; 305 | 306 | # FASTQ: FOUR lines for each FASTQ read (seq-ID, sequence, qual-ID [optional], qual) 307 | } elsif ($File_Type =~ /fastq/i) { 308 | my @fastq_read; 309 | 310 | # read name/ID, line 1 311 | my $seq_id = $line; 312 | die "\n### Fatal error:\nThis read doesn't have a sequence identifier/read name according to FASTQ specs, it should begin with a '\@':\n$seq_id\n" if ($seq_id !~ /^@.+/); 313 | $seq_id =~ s/^@//; # remove '@' to make comparable to $qual_id and for WHOLE string regex match in MAIN 314 | push(@fastq_read, $seq_id); 315 | 316 | # sequence, line 2 317 | chomp (my $seq = <$Input_Fh>); 318 | die "\n### Fatal error:\nRead '$seq_id' has a whitespace in its sequence, which is not allowed according to FASTQ specs:\n$seq\n" if ($seq =~ /\s+/); 319 | die "\n### Fatal error:\nRead '$seq_id' has a IUPAC degenerate base (except for 'N') or non-nucleotide character in its sequence, which is not allowed according to FASTQ specs:\n$seq\n" if ($seq =~ /[^acgtun]/i); 320 | push(@fastq_read, $seq); 321 | 322 | # optional quality ID, line 3 323 | chomp (my $qual_id = <$Input_Fh>); 324 | die "\n### Fatal error:\nThe optional sequence identifier/read name for the quality line of read '$seq_id' is not according to FASTQ specs, it should begin with a '+':\n$qual_id\n" if ($qual_id !~ /^\+/); 325 | push(@fastq_read, $qual_id); 326 | $qual_id =~ s/^\+//; # if optional ID is present check if equal to $seq_id in line 1 327 | die "\n### Fatal error:\nThe sequence identifier/read name of read '$seq_id' doesn't fit to the optional ID in the quality line:\n$qual_id\n" if ($qual_id && $qual_id ne $seq_id); 328 | 329 | # quality, line 4 330 | chomp (my $qual = <$Input_Fh>); 331 | die "\n### Fatal error:\nRead '$seq_id' has a whitespace in its quality values, which is not allowed according to FASTQ specs:\n$qual\n" if ($qual =~ /\s+/); 332 | die "\n### Fatal error:\nRead '$seq_id' has a non-ASCII character in its quality values, which is not allowed according to FASTQ specs:\n$qual\n" if ($qual =~ /[^[:ascii]]/); 333 | die "\n### Fatal error:\nThe quality line of read '$seq_id' doesn't have the same number of symbols as letters in the sequence:\n$seq\n$qual\n" if (length $qual != length $seq); 334 | push(@fastq_read, $qual); 335 | 336 | return \@fastq_read; # return array-ref 337 | } 338 | return 0; 339 | } 340 | 341 | 342 | 343 | ### Determine file type via file extension (FASTA or FASTQ) 344 | sub get_file_type { 345 | if ($Seq_File =~ /.+\.(fa|fas|fasta|ffn|fna|frn|fsa)$/) { # use "|fsa)(\.gz)*$" if unzip inside script 346 | $File_Type = 'fasta'; 347 | } elsif ($Seq_File =~ /.+\.(fastq|fq)$/) { 348 | $File_Type = 'fastq'; 349 | } 350 | 351 | die "\n### Fatal error:\nFile type could not be automatically detected. Sure this is a FASTA/Q file? If yes, you can force the file type by setting option '-f' to either 'fasta' or 'fastq'!\n" if (!$File_Type); 352 | print STDERR "Detected file type: $File_Type\n"; 353 | return 1; 354 | } 355 | 356 | 357 | 358 | ### Identify and print IDs with no hit in order list or seq file 359 | sub missing_IDs { 360 | my ($hash_ref, $error_file, $mode) = @_; 361 | 362 | my @missed = grep ($hash_ref->{$_} == 1, keys %$hash_ref); # set to 2 if hit, 1 if "only" present 363 | 364 | if (@missed) { 365 | file_exist($error_file) if ($Opt_Error_Files); # subroutine 366 | open (my $error_fh, ">>", $error_file) if ($Opt_Error_Files); 367 | 368 | print STDERR "\n### Warning:\nSome $mode IDs were not found in '"; 369 | if ($mode eq 'order') { 370 | print STDERR "$Seq_File"; 371 | } elsif ($mode eq 'sequence') { 372 | print STDERR "$Order_List"; 373 | } 374 | print STDERR "', listed "; 375 | print STDERR "below:\n" if (!$Opt_Error_Files); 376 | print STDERR "in error file '$error_file'!\n" if ($Opt_Error_Files); 377 | 378 | foreach (sort @missed) { 379 | if (!$Opt_Error_Files) { 380 | print STDERR "$_\t"; # separated by tab 381 | } elsif ($Opt_Error_Files) { 382 | print $error_fh "$_\n"; # separated by newline 383 | } 384 | } 385 | print STDERR "\n" if (!$Opt_Error_Files); # final newline for STDERR print 386 | 387 | close $error_fh if ($Opt_Error_Files); 388 | } 389 | return 1; 390 | } 391 | -------------------------------------------------------------------------------- /po2anno/README.md: -------------------------------------------------------------------------------- 1 | po2anno 2 | ======= 3 | 4 | `po2anno.pl` is a script to create an annotation comparison matrix from [Proteinortho5](http://www.bioinf.uni-leipzig.de/Software/proteinortho/) output. 5 | 6 | * [Synopsis](#synopsis) 7 | * [Description](#description) 8 | * [Usage](#usage) 9 | * [cds_extractor](#cds_extractor) 10 | * [Proteinortho5](#proteinortho5) 11 | * [po2anno](#po2anno) 12 | * [Options](#options) 13 | * [Mandatory options](#mandatory-options) 14 | * [Optional options](#optional-options) 15 | * [Output](#output) 16 | * [Run environment](#run-environment) 17 | * [Author - contact](#author---contact) 18 | * [Citation, installation, and license](#citation-installation-and-license) 19 | * [Changelog](#changelog) 20 | 21 | ## Synopsis 22 | 23 | perl po2anno.pl -i matrix.proteinortho -d genome_fasta_dir/ -l -a > annotation_comparison.tsv 24 | 25 | ## Description 26 | 27 | Supplement an ortholog/paralog output matrix from a 28 | [**Proteinortho5**](http://www.bioinf.uni-leipzig.de/Software/proteinortho/) 29 | calculation with annotation information. The resulting tab-separated 30 | annotation comparison matrix (ACM) is mainly intended for the 31 | transfer of high quality annotations from reference genomes to 32 | homologs (orthologs and co-orthologs/paralogs) in a query genome 33 | (e.g. in conjunction with [`tbl2tab.pl`](/tbl2tab)). But of course 34 | it can also be used to have a quick glance at the annotation of 35 | genes present only in a couple of input genomes in comparison to the 36 | others. 37 | 38 | Annotation is retrieved from multi-FASTA files created with 39 | [`cds_extractor.pl`](/cds_extractor). See 40 | [`cds_extractor.pl`](/cds_extractor) for a description of the 41 | format. These files are used as input for the PO analysis and option 42 | **-d** for `po2anno.pl`. 43 | 44 | **Proteinortho5** (PO) has to be run with option **-singles** to include 45 | also genes without orthologs, so-called singletons/ORFans, for each 46 | genome in the PO matrix (see the 47 | [PO manual](http://www.bioinf.uni-leipzig.de/Software/proteinortho/manual.html)). 48 | Additionally, option **-selfblast** is recommended to enhance paralog 49 | detection by PO. 50 | 51 | Each orthologous group (OG) is listed in a row of the resulting ACM, 52 | the first column holds the OG numbers from the PO input matrix (i.e. 53 | line number minus one). The following columns specify the 54 | orthologous CDS for each input genome. For each CDS the ID, 55 | optionally the length in bp (option **-l**), gene, EC number(s), and 56 | product are shown depending on their presence in the CDS's 57 | annotation. The ID is in most cases the locus tag (see 58 | [`cds_extractor.pl`](/cds_extractor)). If several EC numbers exist 59 | for a single CDS they're separated by ';'. If an OG includes 60 | paralogs, i.e. co-orthologs from a single genome, these will be 61 | printed in the following row(s) **without** a new OG number in the 62 | first column. The order of paralogous CDSs within an OG is 63 | arbitrarily. 64 | 65 | The OGs are sorted numerically via the query ID (see option **-q**). 66 | If option **-a** is set, the non-query OGs are appended to the output 67 | after the query OGs, sorted numerically via OG number. 68 | 69 | ## Usage 70 | 71 | ### [`cds_extractor`](/cds_extractor) 72 | 73 | for i in *.[gbk|embl]; do perl cds_extractor.pl -i $i [-p|-n]; done 74 | 75 | ### [**Proteinortho5**](http://www.bioinf.uni-leipzig.de/Software/proteinortho/) 76 | 77 | proteinortho5.pl -graph [-synteny] -cpus=# -selfblast -singles -identity=50 -cov=50 -blastParameters='-use_sw_tback [-seg no|-dust no]' *.[faa|ffn] 78 | 79 | ### po2anno 80 | 81 | perl po2anno.pl -i matrix.[proteinortho|poff] -d genome_fasta_dir/ -q query.[faa|ffn] -l -a > annotation_comparison.tsv 82 | 83 | ## Options 84 | 85 | ### Mandatory options 86 | 87 | - **-i**=_str_, **-input**=_str_ 88 | 89 | Proteinortho (PO) result matrix (\*.proteinortho or \*.poff), or piped *STDIN* (-) 90 | 91 | - **-d**=_str_, **-dir\_genome**=_str_ 92 | 93 | Path to the directory including the genome multi-FASTA PO input files (\*.faa or \*.ffn), created with [`cds_extractor.pl`](/cds_extractor) 94 | 95 | ### Optional options 96 | 97 | - **-h**, **-help** 98 | 99 | Help (perldoc POD) 100 | 101 | - **-q**=_str_, **-query**=_str_ 102 | 103 | Query genome (has to be identical to the string in the PO matrix) [default = first one in alphabetical order] 104 | 105 | - **-l**, **-length** 106 | 107 | Include length of each CDS in bp 108 | 109 | - **-a**, **-all** 110 | 111 | Append non-query orthologous groups (OGs) to the output 112 | 113 | - **-v**, **-version** 114 | 115 | Print version number to *STDERR* 116 | 117 | ## Output 118 | 119 | - *STDOUT* 120 | 121 | The resulting tab-delimited ACM is printed to *STDOUT*. Redirect or pipe into another tool as needed (e.g. `cut`, `grep`, `head`, or `tail`). 122 | 123 | ## Run environment 124 | 125 | The Perl script runs under Windows and UNIX flavors. 126 | 127 | ## Author - contact 128 | 129 | Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster) 130 | 131 | ## Citation, installation, and license 132 | 133 | For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md). 134 | 135 | ## Changelog 136 | 137 | * v0.2.2 (23.10.2015) 138 | * minor syntax changes to `po2anno.pl` and README 139 | * changed option **-g|-genome_dir** to **-d|-dir_genome** for consistency with [`po2group_stats.pl`](/po2group_stats) 140 | * v0.2.1 (07.09.2015) 141 | * get rid of underscores in product annotation strings (from [`cds_extractor.pl`](/cds_extractor)) 142 | * debugged hard-coded relative path for `$genome_file_path` 143 | * v0.2 (15.01.2015) 144 | * give number of query-specific OGs and total query singletons/ORFans in final stat output 145 | * changed final stat output to an easier readable format 146 | * fixed bug: %Query_ID_Seen included also non-query IDs, which luckily had no consequences 147 | * v0.1 (18.12.2014) 148 | -------------------------------------------------------------------------------- /po2group_stats/README.md: -------------------------------------------------------------------------------- 1 | po2group_stats 2 | ============== 3 | 4 | `po2group_stats.pl` is a script to categorize orthologs from [Proteinortho5](http://www.bioinf.uni-leipzig.de/Software/proteinortho/) output according to genome groups. In the [prot_finder](/prot_finder) workflow is a script, `binary_group_stats.pl`, which does the same thing for column groups in a delimited TEXT binary matrix. 5 | 6 | * [Synopsis](#synopsis) 7 | * [Description](#description) 8 | * [Usage](#usage) 9 | * [cds_extractor](#cds_extractor) 10 | * [Proteinortho5](#proteinortho5) 11 | * [po2group_stats](#po2group_stats) 12 | * [Options](#options) 13 | * [Mandatory options](#mandatory-options) 14 | * [Optional options](#optional-options) 15 | * [Output](#output) 16 | * [Dependencies](#dependencies) 17 | * [Run environment](#run-environment) 18 | * [Author - contact](#author---contact) 19 | * [Citation, installation, and license](#citation-installation-and-license) 20 | * [Changelog](#changelog) 21 | 22 | ## Synopsis 23 | 24 | perl po2group_stats.pl -i matrix.proteinortho -d genome_fasta_dir/ -g group_file.tsv -p > overall_stats.tsv 25 | 26 | ## Description 27 | 28 | Categorize the genomes in an ortholog/paralog output matrix (option **-i**) from a 29 | [**Proteinortho5**](http://www.bioinf.uni-leipzig.de/Software/proteinortho/) 30 | calculation according to group affiliations. The group 31 | affiliations of the genomes are intended to get overall 32 | presence/absence statistics for groups of genomes and not simply 33 | single genomes (e.g. comparing 'marine', 'earth', 'commensal', 34 | 'pathogenic' etc. genome groups). Percentage inclusion (option 35 | **-cut\_i**) and exclusion (option **-cut\_e**) cutoffs can be set to 36 | define how strict the presence/absence of genome groups within an 37 | orthologous group (OG) are defined. Of course groups can also hold 38 | only single genomes to get single genome statistics. Group 39 | affiliations are defined in a mandatory **tab-delimited** group input 40 | file (option **-g**) with **minimal two** and **maximal four** groups. 41 | 42 | Only alphanumeric (a-z, A-Z, 0-9), underscore (\_), dash (-), and 43 | period (.) characters are allowed for the **group names** in the 44 | group file to avoid downstream problems with the operating/file 45 | system. As a consequence, also no whitespaces are allowed in these! 46 | Additionally, **group names**, **genome filenames** (should be 47 | enforced by the file system), and **FASTA IDs** considering **all** 48 | genome files (mostly locus tags; should be enforced by Proteinortho5) 49 | need to be **unique**. 50 | 51 | **Proteinortho5** (PO) has to be run with option **-singles** to 52 | include also genes without orthologs, so-called singletons/ORFans, 53 | for each genome in the PO matrix (see the 54 | [PO manual](http://www.bioinf.uni-leipzig.de/Software/proteinortho/manual.html)). 55 | Additionally, option **-selfblast** is recommended to enhance 56 | paralog detection by PO. 57 | 58 | To explain the logic behind the categorization, the following 59 | annotation for example groups will be used. A '1' exemplifies a 60 | group genome count in a respective OG >= the rounded inclusion 61 | cutoff, a '0' a group genome count <= the rounded exclusion cutoff. 62 | The presence and absence of OGs for the group affiliations are 63 | structured in different categories depending on the number of 64 | groups. For **two groups** (e.g. A and B) there are five categories: 65 | 'A specific' (A:B = 1:0), 'B specific' (0:1), 'cutoff core' (1:1), 66 | 'underrepresented' (0:0), and 'unspecific'. Unspecific OGs have a 67 | genome count for at least **one** group outside the cutoffs 68 | (exclusion cutoff < genome count < inclusion cutoff) and 69 | thus cannot be categorized. These 'unspecific' OGs will only be 70 | printed to a final annotation result file with option **-u**. Overall 71 | stats for all categories are printed to *STDOUT* in a final 72 | tab-delimited output matrix. 73 | 74 | **Three groups** (A, B, and C) have the following nine categories: 'A 75 | specific' (A:B:C = 1:0:0), 'B specific' (0:1:0), 'C specific' 76 | (0:0:1), 'A absent' (0:1:1), 'B absent' (1:0:1), 'C absent' (1:1:0), 77 | 'cutoff core' (1:1:1), 'underrepresented' (0:0:0), and 'unspecific'. 78 | 79 | **Four groups** (A, B, C, and D) are classified in 17 categories: 'A 80 | specific' (A:B:C:D = 1:0:0:0), 'B specific' (0:1:0:0), 'C specific' 81 | (0:0:1:0), 'D specific' (0:0:0:1), 'A-B specific' (1:1:0:0), 'A-C 82 | specific' (1:0:1:0), 'A-D specific' (1:0:0:1), 'B-C specific' 83 | (0:1:1:0), 'B-D specific' (0:1:0:1), 'C-D specific' (0:0:1:1), 'A 84 | absent' (0:1:1:1), 'B absent' (1:0:1:1), 'C absent' (1:1:0:1), 'D 85 | absent' (1:1:1:0), 'cutoff core' (1:1:1:1), 'underrepresented' 86 | (0:0:0:0), and 'unspecific'. 87 | 88 | The resulting group presence/absence (according to the cutoffs) can 89 | also be printed to a binary matrix (option **-b**) in the result 90 | directory (option **-r**), excluding the 'unspecific' category. Since 91 | the categories are the logics underlying venn diagrams, you can also 92 | plot the results in a venn diagram using the binary matrix (option 93 | **-p**). The 'underrepresented' category is exempt from the venn 94 | diagram, because it is outside of venn diagram logics. 95 | 96 | Here are venn diagrams illustrating the logic categories (see folder ['pics'](./pics)): 97 | 98 |

99 | venn diagram logics 100 |

101 | 102 | There are two optional categories (which are only considered for the 103 | final print outs and in the final stats matrix, not for the binary 104 | matrix and the venn diagram): 'strict core' (option **-co**) for 105 | OGs where **all** genomes have an ortholog, independent of the 106 | cutoffs. Of course all the 'strict core' OGs are also included in 107 | the 'cutoff\_core' category ('strict core' is identical to 'cutoff 108 | core' with **-cut\_i** 1 and **-cut\_e** 0). Option **-s** activates the 109 | detection of 'singleton/ORFan' OGs present in only **one** genome. 110 | Depending on the cutoffs and number of genomes in the groups, 111 | category 'underrepresented' includes most of these singletons. 112 | 113 | Additionally, annotation is retrieved from multi-FASTA files created 114 | with [`cds_extractor.pl`](/cds_extractor). See 115 | [`cds_extractor.pl`](/cds_extractor) for a description of the 116 | format. These files are used as input for the PO analysis and with 117 | option **-d** for `po2group_stats.pl`. The annotations are printed 118 | in category output files in the result directory. 119 | 120 | Annotations are only pulled from one representative genome for each 121 | category present in the current OG. With option **-co** you can set a 122 | specific genome for the representative annotation for category 123 | 'strict core'. For all other categories the representative genome is 124 | chosen according to the order of the genomes in the group files, 125 | depending on the presence in each OG. Thus, the best annotated 126 | genome should be in the first group at the topmost position 127 | (especially for 'cutoff core'), as well as the best annotated ones 128 | at the top in all other groups. 129 | 130 | In the result files, each orthologous group (OG) is listed in a row 131 | of the resulting category files, the first column holds the OG 132 | numbers from the PO input matrix (i.e. line number minus one). The 133 | following columns specify the ID for each CDS, gene, EC number(s), 134 | product, and organism are shown depending on their presence in the 135 | CDS's annotation. The ID is in most cases the locus tag (see 136 | [`cds_extractor.pl`](/cds_extractor)). If several EC numbers exist 137 | for a single CDS they are separated by a ';'. If the representative 138 | genome within an OG includes paralogs (co-orthologs) these will be 139 | printed in the following row(s) **without** a new OG number in the 140 | first column. 141 | 142 | The number of OGs in the category annotation result files are the 143 | same as listed in the venn diagram and the final stats matrix. 144 | However, since only annotation from one representative annotation is 145 | used the CDS number will be different to the final stats. The final 146 | stats include **all** the CDS in this category in **all** genomes 147 | present in the OG in groups >= the inclusion cutoff (i.e. for 148 | 'strict core' the CDS for all genomes in this OG are counted). Two 149 | categories are different, for 'unspecific' all unspecific groups are 150 | included, for 'underrepresented' all groups <= the exclusion 151 | cutoffs. This is also the reason, the 'pangenome' CDS count is 152 | greater than the 'included in categories' CDS count in the final 153 | stats matrix, as genomes in excluded groups are exempt from the CDS 154 | counts for most categories. 'Included in categories' is the OG/CDS 155 | sum of all non-optional categories ('\*specific', '\*absent', 'cutoff 156 | core', 'underrepresented', and 'unspecific'), since the optional 157 | categories are included in non-optionals. An exception to the 158 | difference in CDS counts are the 'singletons' category where OG and 159 | CDS counts are identical in the result files and in the overall 160 | final output matrix (as there is only one genome), as well as in 161 | group-'specific' categories for groups including only one genome. 162 | 163 | At last, if you want the respective representative sequences for a 164 | category you can first filter the locus tags from the result file 165 | with Unix command-line tools: 166 | 167 | grep -v "^#" result_file.tsv | cut -f 2 > locus_tags.txt 168 | 169 | And then feed the locus tag list to 170 | [`cds_extractor.pl`](/cds_extractor) with option **-l**. 171 | 172 | As a final note, in the [prot_finder](/prot_finder) workflow is a 173 | script, `binary_group_stats.pl`, based upon `po2group_stats.pl`, 174 | which can calculate overall presence/absence statistics for column 175 | groups in a delimited TEXT binary matrix (as with genomes here). 176 | 177 | ## Usage 178 | 179 | ### [`cds_extractor`](/cds_extractor) 180 | 181 | for i in *.[gbk|embl]; do perl cds_extractor.pl -i $i [-p|-n]; done 182 | 183 | ### [**Proteinortho5**](http://www.bioinf.uni-leipzig.de/Software/proteinortho/) 184 | 185 | proteinortho5.pl -graph [-synteny] -cpus=# -selfblast -singles -identity=50 -cov=50 -blastParameters='-use_sw_tback [-seg no|-dust no]' *.[faa|ffn] 186 | 187 | ### po2group_stats 188 | 189 | perl po2group_stats.pl -i matrix.[proteinortho|poff] -d genome_fasta_dir/ -g group_file.tsv -r result_dir -cut_i 0.7 -cut_e 0.2 -b -p -co genome4.[faa|ffn] -s -u -a > overall_stats.tsv 190 | 191 | ## Options 192 | 193 | ### Mandatory options 194 | 195 | - **-i**=_str_, **-input**=_str_ 196 | 197 | Proteinortho (PO) result matrix (\*.proteinortho or \*.poff) 198 | 199 | - **-d**=_str_, **-dir\_genome**=_str_ 200 | 201 | Path to the directory including the genome multi-FASTA PO input files (\*.faa or \*.ffn), created with [`cds_extractor.pl`](/cds_extractor) 202 | 203 | - **-g**=_str_, **-groups\_file**=_str_ 204 | 205 | Tab-delimited file with group affiliation for the genomes with **minimal two** and **maximal four** groups (easiest to create in a spreadsheet software and save in tab-separated format). **All** genomes from the PO matrix need to be included. Group names can only include alphanumeric (a-z, A-Z, 0-9), underscore (\_), dash (-), and period (.) characters (no whitespaces allowed either). Example format with two genomes in group A, three genomes in group B and D, and one genome in group C: 206 | 207 | group\_A group\_B group\_C group\_D
208 | genome1.faa genome2.faa genome3.faa genome4.faa
209 | genome5.faa genome6.faa  genome7.faa
210 |  genome8.faa  genome9.faa 211 | 212 | ### Optional options 213 | 214 | - **-h**, **-help** 215 | 216 | Help (perldoc POD) 217 | 218 | - **-r**=_str_, **-result\_dir**=_str_ 219 | 220 | Path to result folder \[default = inclusion and exclusion percentage cutoff, './results\_i#\_e#'\] 221 | 222 | - **-cut\_i**=_float_, **-cut\_inclusion**=_float_ 223 | 224 | Percentage inclusion cutoff for genomes in a group per OG, has to be > 0 and <= 1. Cutoff will be rounded according to the genome number in each group and has to be > the rounded exclusion cutoff in this group. \[default = 0.9\] 225 | 226 | - **-cut\_e**=_float_, **-cut\_exclusion**=_float_ 227 | 228 | Percentage exclusion cutoff, has to be >= 0 and < 1. Rounded cutoff has to be < rounded inclusion cutoff. \[default = 0.1\] 229 | 230 | - **-b**, **-binary\_matrix** 231 | 232 | Print a binary matrix with the presence/absence genome group results according to the cutoffs (excluding 'unspecific' category OGs) 233 | 234 | - **-p**, **-plot\_venn** 235 | 236 | Plot venn diagram from the binary matrix (except 'unspecific' and 'underrepresented' categories) with function `venn` from **R** package **gplots**, requires option **-b** 237 | 238 | - **-co**=(_str_), **-core_strict**=(_str_) 239 | 240 | Include 'strict core' category in output. Optionally, give a genome name from the PO matrix to use for the representative output annotation. \[default = topmost genome in first group\] 241 | 242 | - **-s**, **-singletons** 243 | 244 | Include singletons/ORFans for each genome in the output, activates also overall genome OG/CDS stats in final stats matrix for genomes with singletons 245 | 246 | - **-u**, **-unspecific** 247 | 248 | Include 'unspecific' category representative annotation file in result directory 249 | 250 | - **-a**, **-all\_genomes\_overall** 251 | 252 | Report overall stats for all genomes (appended to the final stats matrix), also those without singletons; will include all overall genome stats without option **-s** 253 | 254 | - **-v**, **-version** 255 | 256 | Print version number to *STDERR* 257 | 258 | ## Output 259 | 260 | - *STDOUT* 261 | 262 | The tab-delimited final stats matrix is printed to *STDOUT*. Redirect or pipe into another tool as needed. 263 | 264 | - ./results_i#_e# 265 | 266 | All output files are stored in a results folder 267 | 268 | - ./results_i#_e#/[\*_specific|\*_absent|cutoff_core|underrepresented]_OGs.tsv 269 | 270 | Tab-delimited files with OG annotation from a representative genome for non-optional categories 271 | 272 | - (./results_i#_e#/[\*_singletons|strict_core|unspecific]_OGs.tsv) 273 | 274 | Optional category tab-delimited output files with representative annotation 275 | 276 | - (./results_i#_e#/binary_matrix.tsv) 277 | 278 | Tab-delimited binary matrix of group presence/absence results according to cutoffs (excluding 'unspecific' category) 279 | 280 | - (./results_i#_e#/venn_diagram.pdf) 281 | 282 | Venn diagram for non-optional categories (except 'unspecific' and 'underrepresented' categories) 283 | 284 | ## Dependencies 285 | 286 | - **Statistical computing language [R](http://www.r-project.org/)** 287 | 288 | `Rscript` is needed to plot the venn diagram with option **-p**, tested with version 3.2.2 289 | 290 | - **gplots (https://cran.r-project.org/web/packages/gplots/index.html)** 291 | 292 | Package needed for **R** to plot the venn diagram with function `venn`. Tested with **gplots** version 2.17.0. 293 | 294 | ## Run environment 295 | 296 | The Perl script runs under UNIX and Windows flavors. 297 | 298 | ## Author - contact 299 | 300 | Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster) 301 | 302 | ## Citation, installation, and license 303 | 304 | For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md). 305 | 306 | ## Changelog 307 | 308 | * v0.1.3 (06.06.2016) 309 | * included check for file system conformity for group names 310 | * some minor syntax changes and additions to error messages, basically adapting to [`binary_group_stats.pl`](/prot_finder) 311 | * v0.1.2 (19.11.2015) 312 | * added `pod2usage`-die for Getopts::Long call 313 | * minor POD/README change 314 | * v0.1.1 (30.10.2015) 315 | * fixed bug for representative annotation in output files, the representative genome was not chosen according to genome order in the groups file 316 | * v0.1 (23.10.2015) 317 | -------------------------------------------------------------------------------- /po2group_stats/pics/README.md: -------------------------------------------------------------------------------- 1 | Venn diagram logics for po2group_stats 2 | ====================================== 3 | 4 | These venn diagrams were made to illustrate the logics behind the genome group classification of [`po2group_stats`](/po2group_stats). They were created with function `venn` of the [**gplots**](https://cran.r-project.org/web/packages/gplots/index.html) R package (version 2.17.0), as implemented in [`po2group_stats`](/po2group_stats), and edited with [**Inkscape**](https://inkscape.org) version 0.48. 5 | 6 | The diagrams are licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). -------------------------------------------------------------------------------- /po2group_stats/pics/venn_diagram_logics.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aleimba/bac-genomics-scripts/1b2388fb9f5870a4aafa3e070823f9286178d3b1/po2group_stats/pics/venn_diagram_logics.png -------------------------------------------------------------------------------- /prot_finder/prot_binary_matrix.pl: -------------------------------------------------------------------------------- 1 | #!/usr/bin/perl 2 | 3 | ####### 4 | # POD # 5 | ####### 6 | 7 | =pod 8 | 9 | =head1 NAME 10 | 11 | C - create a presence/absence matrix from 12 | C output 13 | 14 | =head1 SYNOPSIS 15 | 16 | C binary_matrix.tsv> 17 | 18 | B 19 | 20 | C binary_matrix.tsv> 21 | 22 | =head1 DESCRIPTION 23 | 24 | This script is intended to create a presence/absence matrix from the 25 | significant C 26 | L|http://blast.ncbi.nlm.nih.gov/Blast.cgi>) hits (or the 27 | companion bash pipe C). The tab-separated 28 | C output can be given directly via C or as a 29 | file. By default a tab-delimited binary presence/absence matrix for 30 | query hits per subject organism will be printed to C. Use 31 | option B<-t> to count all query hits per subject organism, not just 32 | the binary presence/absence. You can transpose the presence/absence 33 | binary matrix with the script C (see its help 34 | with B<-h>). 35 | 36 | The resulting matrix can be used to associate the presence/absence 37 | data with a phylogenetic tree, e.g. use the Interactive Tree Of Life 38 | website (L|http://itol.embl.de/>). B likes individual 39 | comma-separated input files, thus use options B<-s -c> for this 40 | purpose. 41 | 42 | For B the organism names have to have identical names to the 43 | leaves of the phylogenetic tree, thus manual adaptation, e.g. in a 44 | spreadsheet software, might be needed. B, subject organisms 45 | without a significant B hit won't be included in the 46 | tab-separated C result table and hence can't be 47 | included by C. If needed add them manually to 48 | the result matri(x|ces). 49 | 50 | Additionally, you can give the presence/absence binary matrix to 51 | C to calculate presence/absence statistics 52 | for groups of columns and not simply single columns of the matrix. 53 | C also has a comprehensive manual with its 54 | option B<-h>. 55 | 56 | =head1 OPTIONS 57 | 58 | =over 20 59 | 60 | =item B<-h>, B<-help> 61 | 62 | Help (perldoc POD) 63 | 64 | =item B<-s>, B<-separate> 65 | 66 | Separate presence/absence files for each query protein printed to 67 | the result directory [default without B<-s> = C matrix for 68 | all query proteins combined] 69 | 70 | =item B<-d>=I, B<-dir_result>=I 71 | 72 | Path to result folder, requires option B<-s> [default = 73 | './binary_matrix_results'] 74 | 75 | =item B<-t>, B<-total> 76 | 77 | Count total occurrences of query proteins, not just binary 78 | presence/absence 79 | 80 | =item B<-c>, B<-csv> 81 | 82 | Output matri(x|ces) in comma-separated format (*.csv) instead of 83 | tab-delimited format (*.tsv) 84 | 85 | =item B<-l>, B<-locus_tag> 86 | 87 | Use the locus_tag B in the subject_ID column of the 88 | C output (instead of the subject_organism columns) as 89 | organism IDs to associate query hits to organisms. The subject_ID 90 | column will include locus_tags if they're annotated for a genome 91 | (see the L|/cds_extractor> format description). 92 | Useful if the L|/cds_extractor> output doesn't 93 | include strain names for 'o=' in the FASTA IDs, because the prefix 94 | of a locus_tag should be unique for a genome (see 95 | L). 96 | 97 | =item B<-v>, B<-version> 98 | 99 | Print version number to C 100 | 101 | =back 102 | 103 | =head1 OUTPUT 104 | 105 | =over 17 106 | 107 | =item C 108 | 109 | The resulting presence/absence matrix is printed to C 110 | without option B<-s>. Redirect or pipe into another tool as needed. 111 | 112 | =item (F<./binary_matrix_results>) 113 | 114 | Separate query presence/absence files are stored in a result folder 115 | with option B<-s> 116 | 117 | =item (F<./binary_matrix_results/query-ID_binary_matrix.(tsv|csv)>) 118 | 119 | Separate query presence/absence files with option B<-s> 120 | 121 | =back 122 | 123 | =head1 EXAMPLES 124 | 125 | =over 126 | 127 | =item C 128 | 129 | =back 130 | 131 | B 132 | 133 | =over 134 | 135 | =item C binary_matrix.csv> 136 | 137 | =back 138 | 139 | B 140 | 141 | =over 142 | 143 | =item C binary_matrix.tsv> 144 | 145 | =back 146 | 147 | =head1 VERSION 148 | 149 | 0.6 update: 23-11-2015 150 | 0.1 25-10-2012 151 | 152 | =head1 AUTHOR 153 | 154 | Andreas Leimbach aleimba[at]gmx[dot]de 155 | 156 | =head1 LICENSE 157 | 158 | This program is free software: you can redistribute it and/or modify 159 | it under the terms of the GNU General Public License as published by 160 | the Free Software Foundation; either version 3 (GPLv3) of the 161 | License, or (at your option) any later version. 162 | 163 | This program is distributed in the hope that it will be useful, but 164 | WITHOUT ANY WARRANTY; without even the implied warranty of 165 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU 166 | General Public License for more details. 167 | 168 | You should have received a copy of the GNU General Public License 169 | along with this program. If not, see L. 170 | 171 | =cut 172 | 173 | 174 | ######## 175 | # MAIN # 176 | ######## 177 | 178 | use strict; 179 | use warnings; 180 | use autodie; 181 | use Getopt::Long; 182 | use Pod::Usage; 183 | 184 | ### Get options with Getopt::Long 185 | my $Opt_Separate; # separate presence/absence files for each query printed to result_dir (default: single presence/absence file for all queries printed to STDOUT) 186 | my $Result_Dir; # path to result folder, requires option '-s'; default set below to 'binary_matrix_results' 187 | my $Opt_Total; # count total occurrences of query proteins not just presence/absence binary 188 | my $Opt_Csv; # output in csv format (default: tsv) 189 | my $Opt_Locus_Tag; # use locus_tag prefixes (from subject_ID column, see cds_exractor) instead of subject_organism as ID to count query hits 190 | my $VERSION = 0.6; 191 | my ($Opt_Version, $Opt_Help); 192 | GetOptions ('separate' => \$Opt_Separate, 193 | 'dir_result=s' => \$Result_Dir, 194 | 'total' => \$Opt_Total, 195 | 'csv' => \$Opt_Csv, 196 | 'locus_tag' => \$Opt_Locus_Tag, 197 | 'version' => \$Opt_Version, 198 | 'help|?' => \$Opt_Help) 199 | or pod2usage(-verbose => 1, -exitval => 2); 200 | 201 | 202 | 203 | ### Run perldoc on POD and set option defaults 204 | pod2usage(-verbose => 2) if ($Opt_Help); 205 | die "$0 $VERSION\n" if ($Opt_Version); 206 | if ($Result_Dir && !$Opt_Separate) { 207 | warn "### Warning: Option '-d' given but not its required option '-s', forcing option '-s'!\n"; 208 | $Opt_Separate = 1; 209 | } 210 | 211 | my $Separator = "\t"; 212 | $Separator = "," if ($Opt_Csv); # optional csv output format 213 | 214 | 215 | 216 | ### Check input 217 | if (-t STDIN && ! @ARGV) { 218 | my $warning = "\n### Fatal error: No STDIN and no input file given as argument, please supply one of them and/or see help with '-h'!\n"; 219 | pod2usage(-verbose => 0, -message => $warning, -exitval => 2); 220 | } elsif (!-t STDIN && @ARGV) { 221 | my $warning = "\n### Fatal error: Both STDIN and an input file given as argument, please supply only either one and/or see help with '-h'!\n"; 222 | pod2usage(-verbose => 0, -message => $warning, -exitval => 2); 223 | } 224 | die "\n### Fatal error: Too many arguments given, only STDIN or one input file allowed as argument! Please see the usage with option '-h' if unclear!\n" if (@ARGV > 1); 225 | die "\n### Fatal error: File '$ARGV[0]' does not exist!\n" if (@ARGV && $ARGV[0] ne '-' && !-e $ARGV[0]); 226 | 227 | 228 | 229 | ### Create result folder, only for option '-s' 230 | if ($Opt_Separate) { 231 | $Result_Dir = 'binary_matrix_results' if (!$Result_Dir); 232 | $Result_Dir =~ s/\/$//; # get rid of a potential '/' at the end of $Result_Dir path 233 | if (-e $Result_Dir) { 234 | empty_dir($Result_Dir); # subroutine to empty a directory with user interaction 235 | } else { 236 | mkdir $Result_Dir; 237 | } 238 | } 239 | 240 | 241 | 242 | ### Parse the input from 'prot_finder.pl' to associate organism with query hit 243 | my @Queries; # store all query proteins 244 | my %Hits; # hash of hash to associate subject_organism/subject_ID with query hit 245 | 246 | while (<>) { # read STDIN or file input 247 | chomp; 248 | if ($. == 1) { # $. = check only first line of input (works with STDIN and file input) 249 | die "\n### Fatal error: Input doesn't have the correct format, it has to be the output of 'prot_finder.pl' with the following header:\n# subject_organism\tsubject_ID\tsubject_gene\tsubject_protein_desc\tquery_ID\tquery_desc\tquery_coverage [%]\tquery_identities [%]\tsubject/hit_coverage [%]\te-value of best HSP\tbit-score of best HSP\n" if (!/# subject_organism\tsubject_ID\tsubject_gene\tsubject_protein_desc\tquery_ID\tquery_desc/); 250 | next; # skip header line 251 | } 252 | 253 | my @line = split (/\t/, $_); # $line[0] = subject_organism; $line[1] = subject_ID (mostly locus_tag, see cds_extractor); $line[4] = query_ID 254 | my $query = $line[4]; 255 | my $id; 256 | if ($Opt_Locus_Tag) { # use subject_ID 257 | die "\n### Fatal error: The subject_ID of the following line doesn't look like an NCBI locus tag (see: http://www.ncbi.nlm.nih.gov/genbank/genomesubmit_annotation). Column subject_ID needs to include only locus_tags for option '-l'!\n$_\n" if ($line[1] !~ /^([a-zA-Z][0-9a-zA-Z]{2,11})_[0-9a-zA-Z]+$/); # check if subject_ID is locus_tag ('\w' not used because allows alphanumeric and '_') 258 | # excerpt: The locus_tag prefix must be 3-12 alphanumeric characters and the first character may not be a digit. All chromosomes and plasmids of an individual genome must use the exactly same locus_tag prefix followed by an underscore and then an alphanumeric identification number that is unique within the given genome. Other than the single underscore used to separate the prefix from the identification number no other special characters can be used in the locus_tag. 259 | $id = $1; # locus_tag prefix, unique for each genome 260 | } else { # use subject_organism as ID 261 | $id = $line[0]; 262 | } 263 | 264 | if ($Opt_Total) { # count total occurrences of query proteins 265 | $Hits{$id}{$query}++; 266 | 267 | } else { # only binary output (0 or 1) 268 | $Hits{$id}{$query} = 1; 269 | } 270 | 271 | push (@Queries, $query) if (!grep($_ eq $query, @Queries)); # push each query only once in @Queries 272 | } 273 | 274 | 275 | 276 | ### Print binary data to a joined or separate (for each query; as needed by iTOL) file(s) 277 | if (!$Opt_Separate) { # joined output 278 | # print header 279 | if ($Opt_Locus_Tag) { 280 | print "locus_tag"; 281 | } else { 282 | print "organism"; 283 | } 284 | print "$Separator"; 285 | print join("$Separator", sort @Queries), "\n"; 286 | 287 | # print data to STDOUT 288 | foreach my $id (sort keys %Hits) { 289 | print "$id"; 290 | foreach my $query (sort @Queries) { 291 | if ($Hits{$id}->{$query}) { 292 | print "$Separator", "$Hits{$id}->{$query}"; 293 | } else { 294 | print "$Separator", "0"; 295 | } 296 | } 297 | print "\n"; 298 | } 299 | 300 | } elsif ($Opt_Separate) { # separated output for each query 301 | foreach my $query (sort @Queries) { 302 | my $file = "$Result_Dir/$query\_binary\_matrix."; 303 | if ($Opt_Csv) { 304 | $file .= "csv"; 305 | } else { 306 | $file .= "tsv"; 307 | } 308 | open (my $binary_matrix_fh, ">", "$file"); 309 | foreach my $id (sort keys %Hits) { 310 | print $binary_matrix_fh "$id"; 311 | if ($Hits{$id}->{$query}) { 312 | print $binary_matrix_fh "$Separator", "$Hits{$id}->{$query}\n"; 313 | } else { 314 | print $binary_matrix_fh "$Separator", "0\n"; 315 | } 316 | } 317 | close $binary_matrix_fh; 318 | } 319 | } 320 | 321 | exit; 322 | 323 | 324 | ############### 325 | # Subroutines # 326 | ############### 327 | 328 | ### Subroutine to empty a directory with user interaction 329 | sub empty_dir { 330 | my $dir = shift; 331 | print STDERR "\nDirectory '$dir' already exists! You can use either option '-d' to set a different output result directory name, or do you want to replace the directory and all its contents [y|n]? "; 332 | my $user_ask = ; 333 | if ($user_ask =~ /y/i) { 334 | unlink glob "$dir/*"; # remove all files in results directory 335 | } else { 336 | die "\nScript abborted!\n"; 337 | } 338 | return 1; 339 | } 340 | -------------------------------------------------------------------------------- /prot_finder/prot_finder_pipe.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | set -e 3 | 4 | ############# 5 | # Functions # 6 | ############# 7 | 8 | usage () { 9 | cat 1>&2 << EOF # ${0##*/} parameter expansion substitution with variable '0' to get shell script filename without path 10 | Usage: ${0##*/} [OPTION] -q query.faa -f (embl|gbk) > blast_hits.tsv 11 | or: ${0##*/} [OPTION] -q query.faa -s subject.faa -d result_dir \\ 12 | > result_dir/blast_hits.tsv 13 | 14 | Bash wrapper script to run a pipeline consisting of optional 15 | 'cds_extractor.pl' (with its options '-p -f'), BLASTP, 'prot_finder.pl', 16 | and optional Clustal Omega. 'cds_extractor.pl' (only for shell script 17 | option '-f') and 'prot_finder.pl' either have to be installed in the 18 | global PATH or present in the current working directory. BLASTP is run 19 | with disabled query filtering, locally optimal Smith-Waterman alignments, 20 | and increasing the number of database sequences to show alignments 21 | to 500 for BioPerl parsing (legacy: '-F F -s T -b 500', plus: '-seg 22 | no -use_sw_tback -num_alignments 500'). 23 | 24 | The script ends with the STDERR message 'Pipeline finished!', if this 25 | is not the case have a look at the log files in the result directory 26 | for errors. 27 | 28 | Mandatory options: 29 | -q Path to query protein multi-FASTA file (*.faa) 30 | with unique FASTA IDs 31 | -f File extension for files in the current working 32 | directory to use for 'cds_extractor.pl' (e.g. 33 | 'embl' or 'gbk'); excludes shell script option '-s' 34 | or 35 | -s Path to subject protein multi-FASTA file (*.faa) 36 | already created with 'cds_extractor.pl' (and its 37 | options '-p -f'), will not run 'cds_extractor.pl'; 38 | excludes shell script option '-f' 39 | 40 | Optional options: 41 | -h Print usage 42 | -d Path to result folder [default = results_i#_cq#] 43 | -p (legacy|plus) BLASTP suite to use [default = plus] 44 | -e E-value for BLASTP [default = 1e-10] 45 | -t Number of threads to be used for BLASTP and 46 | Clustal Omega [default = all processors on 47 | system] 48 | -i Query identity cutoff for significant hits 49 | [default = 70] 50 | -c Query coverage cutoff [default = 70] 51 | -k Subject coverage cutoff [default = 0] 52 | -b Give only best hit (highest identity) for each 53 | subject sequence 54 | -a Multiple alignment of each multi-FASTA result 55 | file with Clustal Omega 56 | -o Path to executable Clustal Omega binary if not 57 | in global PATH; requires shell script option '-a' 58 | -m Clean up all non-essential files 59 | 60 | Author: Andreas Leimbach 61 | EOF 62 | } 63 | 64 | 65 | ### Check external dependencies 66 | check_commands () { 67 | which "$1" > /dev/null || err "Required executable '$1' not found in global PATH, please install.$2" 68 | } 69 | 70 | ### Check cutoff options input 71 | check_cutoff_options () { 72 | local message="Option '-$2' requires an integer number >= 0 or <= 100 as value, not '$1'!" 73 | [[ $1 =~ ^[0-9]+$ ]] || err "$message" 74 | (( $1 <= 100 )) || err "$message" # arithmetic expression (can only handle integer math, not float) 75 | } 76 | 77 | 78 | ### Error messages 79 | err () { 80 | echo -e "\n### Fatal error: $*" 1>&2 81 | exit 1 82 | } 83 | 84 | 85 | ### Run status of script to STDERR instead of STDOUT 86 | msg () { 87 | echo -e "# $*" 1>&2 88 | } 89 | 90 | 91 | ######## 92 | # MAIN # 93 | ######## 94 | 95 | shopt -s extglob # enable extended globs for bash 96 | 97 | Cmdline="$*" 98 | 99 | ### Getopts 100 | Blastp_Suite="plus" 101 | Evalue="1e-10" 102 | Threads="$(nproc --all)" # get max number of processors on system 103 | Ident_Cut=70 104 | Cov_Query_Cutoff=70 105 | Cov_Subject_Cutoff=0 106 | 107 | while getopts ':q:f:s:d:p:e:t:i:c:k:bao:mh' opt; do # beginning ':' indicates silent mode, trailing ':' after each option requires value 108 | case $opt in 109 | q) Query_File=$OPTARG 110 | [[ -r $Query_File ]] || err "Cannot read query file '$Query_File'!" 111 | ;; 112 | f) Subject_Ext=$OPTARG 113 | [[ -n "$(find . -maxdepth 1 -name "*.${Subject_Ext}" -print -quit)" ]] || err "No files with the option '-f' specified file extension '$Subject_Ext' found in the current working directory!" 114 | ;; 115 | s) Subject_File=$OPTARG 116 | [[ -r $Subject_File ]] || err "Cannot read subject file '$Subject_File'!" 117 | ;; 118 | d) Result_Dir=$OPTARG;; # checked below 119 | p) Blastp_Suite=$OPTARG 120 | [[ $Blastp_Suite = @(plus|legacy) ]] || err "Option '-p' only allows 'plus' for BLASTP+ or 'legacy' for legacy BLASTP as value, not '$Blastp_Suite'!" # extended glob (regex more expensive) 121 | ;; 122 | e) Evalue=$OPTARG 123 | [[ $Evalue =~ ^([0-9][0-9]*|[0-9]+e-[0-9]+)$ ]] || err "Option '-e' requires a real number (either integer or scientific exponential notation) as value, not '$Evalue'!" 124 | ;; 125 | t) Threads=$OPTARG 126 | [[ $Threads =~ ^[1-9][0-9]*$ ]] || err "Option '-t' requires an integer > 0 as value, not '$Threads'!" 127 | ;; 128 | i) Ident_Cut=$OPTARG 129 | check_cutoff_options "$Ident_Cut" "i" 130 | ;; 131 | c) Cov_Query_Cutoff=$OPTARG 132 | check_cutoff_options "$Cov_Query_Cutoff" "c" 133 | ;; 134 | k) Cov_Subject_Cutoff=$OPTARG 135 | check_cutoff_options "$Cov_Subject_Cutoff" "k" 136 | ;; 137 | b) Opt_Best_Hit=1;; 138 | a) Opt_Align=1;; 139 | o) Clustal_Path=$OPTARG 140 | [[ -x $Clustal_Path ]] || err "Option '-o' requires the path to an executable Clustal Omega binary as value, not '$Clustal_Path'!" 141 | ;; 142 | m) Opt_Clean_Up=1;; 143 | h) usage; exit;; # usage function, exit code zero 144 | \?) err "Invalid option '-$OPTARG'. See usage with '-h'!";; 145 | :) err "Option '-$OPTARG' requires a value. See usage with '-h'!";; 146 | esac 147 | done 148 | 149 | 150 | ### Check options and enforce mandatory options 151 | [[ $Query_File && ($Subject_Ext || $Subject_File) ]] || err "Mandatory options '-q' and '-f' or '-s' are missing!" 152 | 153 | [[ $Subject_Ext && $Subject_File ]] && err "Options '-f' and '-s' given which exclude themselves. Choose either '-f' OR '-s'!" 154 | 155 | (( Threads <= $(nproc) )) || err "Number of threads for option '-t', '$Threads', exceeds the maximum $(nproc) processors on the system!" 156 | 157 | [[ ! $Opt_Align && $Clustal_Path ]] && Opt_Align=1 && msg "Option '-o' requires option '-a', forcing option '-a'!" 158 | 159 | 160 | ### Check external dependencies 161 | echo 1>&2 # newline 162 | msg "Checking pipeline dependencies" 163 | [[ $Opt_Align && ! $Clustal_Path ]] && check_commands "clustalo" " Or use option '-o' to give the path to the binary!" 164 | 165 | for exe in cds_extractor.pl formatdb blastall makeblastdb blastp prot_finder.pl; do 166 | [[ $Subject_File && $exe == cds_extractor.pl ]] && continue 167 | [[ $Blastp_Suite == legacy && $exe = @(makeblastdb|blastp) ]] && continue # extended glob 168 | [[ $Blastp_Suite == plus && $exe = @(formatdb|blastall) ]] && continue 169 | if [[ $exe = *.pl ]]; then # glob 170 | if [[ -r "./$exe" ]]; then # present in current wd 171 | [[ $exe =~ ^cds ]] && Cds_Extractor_Cmd="perl cds_extractor.pl" 172 | [[ $exe =~ ^prot ]] && Prot_Finder_Cmd="perl prot_finder.pl" 173 | continue 174 | else 175 | [[ $exe =~ ^cds ]] && Cds_Extractor_Cmd="cds_extractor.pl" 176 | [[ $exe =~ ^prot ]] && Prot_Finder_Cmd="prot_finder.pl" 177 | check_commands "$exe" " Or copy the Perl script in the current working directory." 178 | fi 179 | continue 180 | fi 181 | check_commands "$exe" 182 | done 183 | 184 | msg "Script call command: ${0##*/} $Cmdline" 185 | 186 | 187 | ### Create result folder 188 | if [[ ! $Result_Dir ]]; then # can't give default before 'getopts' in case cutoffs are set by the user 189 | Result_Dir="results_i${Ident_Cut}_cq${Cov_Query_Cutoff}" 190 | else 191 | Result_Dir="${Result_Dir%/}" # parameter expansion substitution to get rid of a potential '/' at the end of Result_Dir path 192 | fi 193 | 194 | if [[ -d $Result_Dir ]]; then # make possible to redirect STDOUT output into result_dir (corresponding to option '-f' in 'protein_finder.pl' script) 195 | skip=0 196 | for file in "$Result_Dir"/*; do 197 | if [[ -s $file || $skip -eq 1 ]]; then # die if a file with size > 0 or more than one file already in result_dir 198 | err "Result directory '$Result_Dir' already exists! You can use option '-d' to set a different result directory name." 199 | fi 200 | skip=1 201 | done 202 | else 203 | mkdir -pv "$Result_Dir" 1>&2 204 | fi 205 | 206 | 207 | ### Run cds_extractor.pl 208 | if [[ $Subject_Ext ]]; then 209 | msg "Running cds_extractor.pl on all '*.$Subject_Ext' files in the current working directory" 210 | for file in *."$Subject_Ext"; do 211 | file_no_ext="${file%.${Subject_Ext}}.faa" # parameter expansion substitution to get rid of file extension and replace with new one (*.faa are the output files from cds_extractor) 212 | File_Names+=("$file_no_ext") # append to array 213 | eval "$Cds_Extractor_Cmd -i $file -p -f &>> $Result_Dir/cds_extractor.log" # '&>' instead of '/dev/null' for error catching 214 | done 215 | Subject_File="$Result_Dir/prot_finder.faa" # for creating BLASTP db below 216 | cat "${File_Names[@]}" > "$Subject_File" # concatenate files stored in the array, "${array[@]}" expands to list of array elements (words) 217 | fi 218 | 219 | 220 | ### Run BLASTP 221 | msg "Running BLASTP '$Blastp_Suite' with subject '$Subject_File', query '$Query_File', evalue '$Evalue', and $Threads threads" 222 | Blast_Report="$Result_Dir/prot_finder.blastp" 223 | if [[ $Blastp_Suite == legacy ]]; then 224 | formatdb -p T -i "$Subject_File" -n prot_finder_db 225 | blastall -p blastp -d prot_finder_db -i "$Query_File" -o "$Blast_Report" -e "$Evalue" -F F -s T -b 500 -a "$Threads" 226 | elif [[ $Blastp_Suite == plus ]]; then 227 | makeblastdb -in "$Subject_File" -input_type fasta -dbtype prot -out prot_finder_db &> "$Result_Dir/makeblastdb.log" # '&>' instead of '/dev/null' for error catching 228 | blastp -db prot_finder_db -query "$Query_File" -out "$Blast_Report" -evalue "$Evalue" -seg no -use_sw_tback -num_alignments 500 -num_threads "$Threads" 229 | fi 230 | 231 | 232 | ### Run prot_finder.pl 233 | msg "Running prot_finder.pl with identity cutoff '$Ident_Cut', query coverage cutoff '$Cov_Query_Cutoff', and subject coverage cutoff '$Cov_Subject_Cutoff'" 234 | Cmd="$Prot_Finder_Cmd -d $Result_Dir -f -q $Query_File -s $Subject_File -r $Blast_Report -i $Ident_Cut -cov_q $Cov_Query_Cutoff -cov_s $Cov_Subject_Cutoff" 235 | [[ $Opt_Best_Hit ]] && Cmd="$Cmd -b" # append to command 236 | [[ $Opt_Align ]] && Cmd="$Cmd -a -t $Threads" 237 | [[ $Clustal_Path ]] && Cmd="$Cmd -p $Clustal_Path" 238 | eval "$Cmd" 2> "$Result_Dir/prot_finder.log" # '2>' instead of '/dev/null' for error catching 239 | 240 | msg "All result files stored in directory '$Result_Dir'" 241 | 242 | 243 | ### Clean up non-essential files 244 | if [[ $Opt_Clean_Up ]]; then 245 | msg "Removing non-essential output files, option '-m'" 246 | for file in "${File_Names[@]}"; do # remove output files from cds_extractor 247 | rm -v "$file" 1>&2 248 | done 249 | [[ $Subject_Ext ]] && rm -v "$Subject_File" 1>&2 # 'cat' from cds_extractor 250 | if [[ $Blastp_Suite == legacy ]]; then 251 | rm -v formatdb.log 1>&2 252 | [[ -r error.log ]] && rm -v error.log 1>&2 # no idea where this guy is coming from or what is its trigger 253 | fi 254 | rm -v prot_finder_db.p* "$Blast_Report" "$Result_Dir"/*.log "${Subject_File}.idx" 1>&2 255 | fi 256 | 257 | msg "Pipeline finished!" 258 | -------------------------------------------------------------------------------- /prot_finder/transpose_matrix.pl: -------------------------------------------------------------------------------- 1 | #!/usr/bin/perl 2 | 3 | ####### 4 | # POD # 5 | ####### 6 | 7 | =pod 8 | 9 | =head1 NAME 10 | 11 | C - transpose a delimited TEXT matrix 12 | 13 | =head1 SYNOPSIS 14 | 15 | C 16 | input_matrix_transposed.tsv> 17 | 18 | B 19 | 20 | C binary_matrix_transposed.tsv> 22 | 23 | =head1 DESCRIPTION 24 | 25 | This script transposes a delimited TEXT input matrix, i.e. rows will 26 | become columns and columns rows. Use option B<-d> to set the 27 | delimiter of the input and output matrix, default is set to 28 | tab-delimited/separated matrices. Input matrices can be given 29 | directly via C or as a file. The script is intended for the 30 | resulting presence/absence binary matrices of 31 | C, but can be used for any TEXT matrix. 32 | 33 | The binary matrix of C has the query protein 34 | IDs as column headers and the subject genomes as row headers. Thus, 35 | C is very useful to transpose the 36 | C matrix for the usage with 37 | C to calculate presence/absence statistics 38 | for groups of columns/genomes (and not simply single columns of the 39 | matrix). C also has a comprehensive manual 40 | with its option B<-h>. 41 | 42 | Additionally, option B<-e> can be used to fill empty cells of the 43 | input matrix with a value in the transposed matrix (e.g. 'NA', '0' 44 | etc.). 45 | 46 | =head1 OPTIONS 47 | 48 | =over 20 49 | 50 | =item B<-h>, B<-help> 51 | 52 | Help (perldoc POD) 53 | 54 | =item B<-d>=I, B<-delimiter>=I 55 | 56 | Set delimiter of input and output matrix (e.g. comma ',', single 57 | space ' ' etc.) [default = tab-delimited/separated] 58 | 59 | =item B<-e>=I, B<-empty>=I 60 | 61 | Fill empty cells of the input matrix with a value in the transposed 62 | matrix (e.g. 'NA', '0' etc.) 63 | 64 | =item B<-v>, B<-version> 65 | 66 | Print version number to C 67 | 68 | =back 69 | 70 | =head1 OUTPUT 71 | 72 | =over 20 73 | 74 | =item C 75 | 76 | The transposed matrix is printed to C. Redirect or pipe into 77 | another tool as needed. 78 | 79 | =back 80 | 81 | =head1 EXAMPLES 82 | 83 | =over 84 | 85 | =item C input_matrix_space-delimit_transposed.txt> 86 | 87 | =back 88 | 89 | B 90 | 91 | =over 92 | 93 | =item C "${matrix%.*}_transposed.tsv"; done> 94 | 95 | =back 96 | 97 | B 98 | 99 | =over 100 | 101 | =item C binary_matrix_transposed.csv> 102 | 103 | =back 104 | 105 | B 106 | 107 | =over 108 | 109 | =item C result_dir/binary_matrix_transposed.tsv> 110 | 111 | =back 112 | 113 | =head1 VERSION 114 | 115 | 0.1 12-04-2016 116 | 117 | =head1 AUTHOR 118 | 119 | Andreas Leimbach aleimba[at]gmx[dot]de 120 | 121 | =head1 ACKNOWLEDGEMENT 122 | 123 | The Perl implementation for transposing a matrix on Stack Overflow 124 | was very useful: 125 | L 126 | 127 | =head1 LICENSE 128 | 129 | This program is free software: you can redistribute it and/or modify 130 | it under the terms of the GNU General Public License as published by 131 | the Free Software Foundation; either version 3 (GPLv3) of the 132 | License, or (at your option) any later version. 133 | 134 | This program is distributed in the hope that it will be useful, but 135 | WITHOUT ANY WARRANTY; without even the implied warranty of 136 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU 137 | General Public License for more details. 138 | 139 | You should have received a copy of the GNU General Public License 140 | along with this program. If not, see L. 141 | 142 | =cut 143 | 144 | 145 | ######## 146 | # MAIN # 147 | ######## 148 | 149 | use strict; 150 | use warnings; 151 | use autodie; 152 | use Getopt::Long; 153 | use Pod::Usage; 154 | 155 | ### Get the options with Getopt::Long 156 | my $Delimiter = "\t"; # set separator/delimiter of input/output matrix 157 | my $Empty; # optionally, fill empty cells with a value 158 | my $VERSION = 0.1; 159 | my ($Opt_Version, $Opt_Help); 160 | GetOptions ('delimiter=s' => \$Delimiter, 161 | 'empty=s' => \$Empty, 162 | 'version' => \$Opt_Version, 163 | 'help|?' => \$Opt_Help) 164 | or pod2usage(-verbose => 1, -exitval => 2); 165 | 166 | 167 | ### Run perldoc on POD and set option defaults 168 | pod2usage(-verbose => 2) if ($Opt_Help); 169 | die "$0 $VERSION\n" if ($Opt_Version); 170 | 171 | 172 | ### Check input 173 | if (-t STDIN && ! @ARGV) { 174 | my $warning = "\n### Fatal error: No STDIN and no input file given as argument, please supply one of them and/or see help with '-h'!\n"; 175 | pod2usage(-verbose => 0, -message => $warning, -exitval => 2); 176 | } elsif (!-t STDIN && @ARGV) { 177 | my $warning = "\n### Fatal error: Both STDIN and an input file given as argument, please supply only either one and/or see help with '-h'!\n"; 178 | pod2usage(-verbose => 0, -message => $warning, -exitval => 2); 179 | } 180 | die "\n### Fatal error: Too many arguments given, only STDIN or one input file allowed as argument! Please see the usage with option '-h' if unclear!\n\n" if (@ARGV > 1); 181 | die "\n### Fatal error: File '$ARGV[0]' does not exist!\n\n" if (@ARGV && $ARGV[0] ne '-' && !-e $ARGV[0]); 182 | 183 | 184 | ### Parse input matrix 185 | my %Input_Matrix; # hash of hash to store the input matrix 186 | my $Max_Columns = 0; # maximum number of columns, needed in case not every row of input matrix has the same number of columns 187 | my $Row_Num = 0; # count input matrix number of rows 188 | while (<>) { 189 | chomp; 190 | warn "### Warning: Set separator/delimiter '$Delimiter' (option '-d') not found in the following first line/header of input matrix, sure the correct one is set?\n$_\n\n" if ($_ !~ /$Delimiter/ && $. == 1); 191 | 192 | my $col_num = 0; # count number of columns for each row 193 | foreach my $cell (split(/$Delimiter/)) { # split each row for the cells 194 | $cell = $Empty if ($cell =~ /^$/); # needed for empty cells in between cells with values, because for these $cell is defined in print out below 195 | $Input_Matrix{$Row_Num}{$col_num++} = $cell; 196 | } 197 | 198 | $Max_Columns = $col_num if ($col_num > $Max_Columns); 199 | $Row_Num++; 200 | } 201 | 202 | 203 | ### Print out transposed matrix 204 | my $Max_Rows = $Row_Num; 205 | for (my $col_num = 0; $col_num < $Max_Columns; $col_num++) { 206 | for ($Row_Num = 0; $Row_Num < $Max_Rows; $Row_Num++) { # repurposing $Row_Num 207 | print "$Delimiter" if ($Row_Num > 0); # separator only after the first transposed column 208 | if (defined $Input_Matrix{$Row_Num}{$col_num}) { # 'defined' needed, in case $cell has '0' as value 209 | print $Input_Matrix{$Row_Num}{$col_num}; 210 | } elsif (defined $Empty) { # for rows of the input matrix with columns < $Max_Columns; 'defined' needed, in case $Empty is set to '0' 211 | print $Empty; 212 | } 213 | } 214 | print "\n"; 215 | } 216 | 217 | exit; 218 | -------------------------------------------------------------------------------- /rename_fasta_id/README.md: -------------------------------------------------------------------------------- 1 | rename_fasta_id 2 | =============== 3 | 4 | `rename_fasta_id.pl` is a script to rename fasta IDs according to regular expressions. 5 | 6 | * [Synopsis](#synopsis) 7 | * [Description](#description) 8 | * [Usage](#usage) 9 | * [Options](#options) 10 | * [Mandatory options](#mandatory-options) 11 | * [Optional options](#optional-options) 12 | * [Output](#output) 13 | * [Run environment](#run-environment) 14 | * [Author - contact](#author---contact) 15 | * [Citation, installation, and license](#citation-installation-and-license) 16 | * [Changelog](#changelog) 17 | 18 | ## Synopsis 19 | 20 | perl rename_fasta_id.pl -i file.fasta -p "NODE_.+$" -r "K-12_" -n -a c > out.fasta 21 | 22 | **or** 23 | 24 | zcat file.fasta.gz | perl rename_fasta_id.pl -i - -p "coli" -r "" -o > out.fasta 25 | 26 | ## Description 27 | 28 | This script uses the built-in Perl substitution operator `s///` to 29 | replace strings in FASTA IDs. To do this, a **pattern** and a 30 | **replacement** have to be provided (Perl regular expression syntax 31 | can be used). The leading '>' character for the FASTA ID will be 32 | removed before the substitution and added again afterwards. FASTA 33 | IDs will be searched for matches with the **pattern**, and if found 34 | the **pattern** will be replaced by the **replacement**. 35 | 36 | **IMPORTANT**: Enclose the **pattern** and the **replacement** in 37 | quotation marks (' or ") if they contain characters that would be 38 | interpreted by the shell (e.g. pipes '|', brackets etc.). 39 | 40 | For substitutions without any appendices in a UNIX OS you can of 41 | course just use the great 42 | [`sed`](https://www.gnu.org/software/sed/manual/sed.html) (see 43 | `man sed`), e.g.: 44 | 45 | sed 's/^>pattern/>replacement/' file.fasta 46 | 47 | ## Usage 48 | 49 | perl rename_fasta_id.pl -i file.fasta -p "T" -r "a" -c -g -o 50 | 51 | ## Options 52 | 53 | ### Mandatory options 54 | 55 | - -i, -input 56 | 57 | Input FASTA file or piped STDIN (-) from a gzipped file 58 | 59 | - -p, -pattern 60 | 61 | Pattern to be replaced in FASTA ID 62 | 63 | - -r, -replacement 64 | 65 | Replacement to replace the pattern with. To entirely remove the pattern use '' or "" as input for **-r**. 66 | 67 | ### Optional options 68 | 69 | - -h, -help 70 | 71 | Help (perldoc POD) 72 | 73 | - -c, -case-insensitive 74 | 75 | Match pattern case-insensitive 76 | 77 | - -g, -global 78 | 79 | Replace pattern globally in the string 80 | 81 | - -n, -numerate 82 | 83 | Append a numeration/the count of the pattern hits to the replacement. This is e.g. useful to number contigs consecutively in a draft genome. 84 | 85 | - -a, -append 86 | 87 | Append a string after the numeration, e.g. 'c' for chromosome 88 | 89 | - -o, -output 90 | 91 | Verbose output of the substitutions that were carried out, printed to *STDERR* 92 | 93 | - -v, -version 94 | 95 | Print version number to *STDERR* 96 | 97 | ## Output 98 | 99 | - *STDOUT* 100 | 101 | The FASTA file with substituted ID lines is printed to *STDOUT*. Redirect or pipe into another tool as needed. 102 | 103 | ## Run environment 104 | 105 | The Perl script runs under Windows and UNIX flavors. 106 | 107 | ## Author - contact 108 | 109 | Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster) 110 | 111 | ## Citation, installation, and license 112 | 113 | For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md). 114 | 115 | ## Changelog 116 | 117 | - v0.1 (09.11.2014) 118 | -------------------------------------------------------------------------------- /rename_fasta_id/rename_fasta_id.pl: -------------------------------------------------------------------------------- 1 | #!/usr/bin/perl 2 | 3 | ####### 4 | # POD # 5 | ####### 6 | 7 | =pod 8 | 9 | =head1 NAME 10 | 11 | C - rename fasta IDs according to regular expressions 12 | 13 | =head1 SYNOPSIS 14 | 15 | C out.fasta> 16 | 17 | B 18 | 19 | C out.fasta> 20 | 21 | =head1 DESCRIPTION 22 | 23 | This script uses the built-in Perl substitution operator C to 24 | replace strings in FASTA IDs. To do this, a B and a 25 | B have to be provided (Perl regular expression syntax 26 | can be used). The leading '>' character for the FASTA ID will be 27 | removed before the substitution and added again afterwards. FASTA 28 | IDs will be searched for matches with the B, and if found 29 | the B will be replaced by the B. 30 | 31 | B: Enclose the B and the B in 32 | quotation marks (' or ") if they contain characters that would be 33 | interpreted by the shell (e.g. pipes '|', brackets etc.). 34 | 35 | For substitutions without any appendices in a UNIX OS you can of 36 | course just use the great 37 | L|https://www.gnu.org/software/sed/manual/sed.html> (see 38 | C), e.g.: 39 | 40 | Cpattern/Ereplacement/' file.fasta> 41 | 42 | =head1 OPTIONS 43 | 44 | =head2 Mandatory options 45 | 46 | =over 20 47 | 48 | =item B<-i>=I, B<-input>=I 49 | 50 | Input FASTA file or piped STDIN (-) from a gzipped file 51 | 52 | =item B<-p>=I, B<-pattern>=I 53 | 54 | Pattern to be replaced in FASTA ID 55 | 56 | =item B<-r>=I, B<-replacement>=I 57 | 58 | Replacement to replace the pattern with. To entirely remove the 59 | pattern use '' or "" as input for B<-r>. 60 | 61 | =back 62 | 63 | =head2 Optional options 64 | 65 | =over 20 66 | 67 | =item B<-h>, B<-help> 68 | 69 | Help (perldoc POD) 70 | 71 | =item B<-c>, B<-case-insensitive> 72 | 73 | Match pattern case-insensitive 74 | 75 | =item B<-g>, B<-global> 76 | 77 | Replace pattern globally in the string 78 | 79 | =item B<-n>, B<-numerate> 80 | 81 | Append a numeration/the count of the pattern hits to the 82 | replacement. This is e.g. useful to number contigs consecutively in 83 | a draft genome. 84 | 85 | =item B<-a>=I, B<-append>=I 86 | 87 | Append a string after the numeration, e.g. 'c' for chromosome 88 | 89 | =item B<-o>, B<-output> 90 | 91 | Verbose output of the substitutions that were carried out, printed 92 | to C 93 | 94 | =item B<-v>, B<-version> 95 | 96 | Print version number to C 97 | 98 | =back 99 | 100 | =head1 OUTPUT 101 | 102 | =over 20 103 | 104 | =item C 105 | 106 | The FASTA file with substituted ID lines is printed to C. 107 | Redirect or pipe into another tool as needed. 108 | 109 | =back 110 | 111 | =head1 EXAMPLES 112 | 113 | =over 114 | 115 | =item C 116 | 117 | =back 118 | 119 | =head1 VERSION 120 | 121 | 0.1 09-11-2014 122 | 123 | =head1 AUTHOR 124 | 125 | Andreas Leimbach aleimba[at]gmx[dot]de 126 | 127 | =head1 LICENSE 128 | 129 | This program is free software: you can redistribute it and/or modify 130 | it under the terms of the GNU General Public License as published by 131 | the Free Software Foundation; either version 3 (GPLv3) of the License, 132 | or (at your option) any later version. 133 | 134 | This program is distributed in the hope that it will be useful, but 135 | WITHOUT ANY WARRANTY; without even the implied warranty of 136 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU 137 | General Public License for more details. 138 | 139 | You should have received a copy of the GNU General Public License 140 | along with this program. If not, see L. 141 | 142 | =cut 143 | 144 | 145 | ######## 146 | # MAIN # 147 | ######## 148 | 149 | use strict; 150 | use warnings; 151 | use autodie; 152 | use Getopt::Long; 153 | use Pod::Usage; 154 | 155 | ### Get the options with Getopt::Long 156 | my $Input_File; # input fasta file 157 | my $Pattern; # pattern to search for in the FASTA IDs 158 | my $Replacement; # regex to replace pattern with 159 | my $Opt_Case; # substitute case-insensitive 160 | my $Opt_Global; # substitute pattern globally in string 161 | my $Opt_Numerate; # append the count of the performed substitions to each replacement regex 162 | my $Append; # append an additional string after $Opt_Numerate 163 | my $Opt_Output; # print substitutions to STDERR 164 | my $VERSION = 0.1; 165 | my ($Opt_Version, $Opt_Help); 166 | GetOptions ('input=s' => \$Input_File, 167 | 'pattern=s' => \$Pattern, 168 | 'replacement=s' => \$Replacement, 169 | 'case-insensitive' => \$Opt_Case, 170 | 'global' => \$Opt_Global, 171 | 'numerate' => \$Opt_Numerate, 172 | 'append:s' => \$Append, 173 | 'output' => \$Opt_Output, 174 | 'version' => \$Opt_Version, 175 | 'help|?' => \$Opt_Help); 176 | 177 | 178 | 179 | ### Run perldoc on POD 180 | pod2usage(-verbose => 2) if ($Opt_Help); 181 | die "$0 $VERSION\n" if ($Opt_Version); 182 | if (!$Input_File || !$Pattern) { 183 | my $warning = "\n### Fatal error: Options '-i' or '-p' or their arguments are missing!\n"; 184 | pod2usage(-verbose => 1, -message => $warning, -exitval => 2); 185 | } 186 | 187 | 188 | 189 | ### Pipe input from STDIN or open input file 190 | my $Input_Fh; 191 | if ($Input_File eq '-') { # file input via STDIN 192 | $Input_Fh = *STDIN; # capture typeglob of STDIN 193 | } else { # input via input file 194 | open ($Input_Fh, "<", "$Input_File"); 195 | } 196 | 197 | 198 | 199 | ### Parse FASTA file 200 | my $Substitution_Count = 0; # count substitutions 201 | while (<$Input_Fh>) { 202 | chomp; 203 | 204 | # only substitute in FASTA ID lines 205 | if (/^>/) { 206 | # only substitute if pattern found, case-sensitive or case-INsensitive 207 | if (/$Pattern/ || (/$Pattern/i && $Opt_Case)) { 208 | $_ = substitute_string($_); # subroutine 209 | 210 | # "reprint" FASTA IDs, which don't fit the pattern 211 | } else { 212 | print "$_\n"; 213 | } 214 | 215 | # "reprint" sequence/non-ID lines of FASTA files 216 | } else { 217 | print "$_\n"; 218 | } 219 | } 220 | print STDERR "$Substitution_Count substitutions have been carried out\n"; 221 | 222 | exit; 223 | 224 | 225 | ############# 226 | #Subroutines# 227 | ############# 228 | 229 | ### Subroutine to rename headers/ID lines of the FASTA file 230 | sub substitute_string { 231 | my $string = shift; 232 | $string =~ s/^>//; # get rid of '>', append afterwards 233 | 234 | print STDERR "$string " if ($Opt_Output); # optional verbose output to STDERR 235 | $Substitution_Count++; # count occurences of carried out substitutions 236 | 237 | # substitutions 238 | if ($Opt_Global && $Opt_Case) { 239 | $string =~ s/$Pattern/$Replacement/gi; 240 | } elsif ($Opt_Case) { 241 | $string =~ s/$Pattern/$Replacement/i; 242 | } elsif ($Opt_Global) { 243 | $string =~ s/$Pattern/$Replacement/g; 244 | } else { 245 | $string =~ s/$Pattern/$Replacement/; 246 | } 247 | 248 | # output to STDOUT, optionally STDERR 249 | print ">$string"; 250 | print STDERR "-> $string" if ($Opt_Output); 251 | if ($Opt_Numerate) { 252 | print "$Substitution_Count"; 253 | print STDERR "$Substitution_Count" if ($Opt_Output); 254 | } 255 | 256 | if ($Append) { 257 | print "$Append"; 258 | print STDERR "$Append" if ($Opt_Output); 259 | } 260 | 261 | print "\n"; 262 | print STDERR "\n" if ($Opt_Output); 263 | 264 | return 1; 265 | } 266 | -------------------------------------------------------------------------------- /revcom_seq/README.md: -------------------------------------------------------------------------------- 1 | revcom_seq 2 | ========== 3 | 4 | `revcom_seq.pl` is a script to reverse complement (multi-)sequence files. 5 | 6 | * [Synopsis](#synopsis) 7 | * [Description](#description) 8 | * [Usage](#usage) 9 | * [Options](#options) 10 | * [Output](#output) 11 | * [Run environment](#run-environment) 12 | * [Dependencies](#dependencies) 13 | * [Author - contact](#author---contact) 14 | * [Citation, installation, and license](#citation-installation-and-license) 15 | * [Changelog](#changelog) 16 | 17 | ## Synopsis 18 | 19 | perl revcom_seq.pl seq-file.embl > seq-file_revcom.embl 20 | 21 | **or** 22 | 23 | perl cat_seq.pl multi-seq_file.embl | perl revcom_seq.pl -i embl > seq_file_cat_revcom.embl 24 | 25 | ## Description 26 | 27 | This script reverse complements (multi-)sequence files. The 28 | features/annotations in RichSeq files (e.g. EMBL or GENBANK format) 29 | will also be adapted accordingly. Use option **-o** to specify a 30 | different output sequence format. Input files can be given directly via 31 | *STDIN* or as a file. If *STDIN* is used, the input sequence file 32 | format has to be given with option **-i**. Be careful to set the 33 | correct input format. 34 | 35 | ## Usage 36 | 37 | perl revcom_seq.pl -o gbk seq-file.embl > seq-file_revcom.gbk 38 | 39 | **or** reverse complement all sequence files in the current working directory: 40 | 41 | for file in *.embl; do perl revcom_seq.pl -o fasta "$file" > "${file%.embl}"_revcom.fasta; done 42 | 43 | ## Options 44 | 45 | - **-h**, **-help** 46 | 47 | Help (perldoc POD) 48 | 49 | - **-o**=*str*, **-outformat**=*str* 50 | 51 | Specify different sequence format for the output [fasta, embl, or gbk] 52 | 53 | - **-i**=*str*, **-informat**=*str* 54 | 55 | Specify the input sequence file format, only needed for *STDIN* input 56 | 57 | - **-v**, **-version** 58 | 59 | Print version number to *STDOUT* 60 | 61 | ## Output 62 | 63 | - *STDOUT* 64 | 65 | The reverse complemented sequence file is printed to *STDOUT*. 66 | Redirect or pipe into another tool as needed. 67 | 68 | ## Run environment 69 | 70 | The Perl script runs under Windows and UNIX flavors. 71 | 72 | ## Dependencies 73 | 74 | - [**BioPerl**](http://www.bioperl.org) (tested version 1.007001) 75 | 76 | ## Author - contact 77 | 78 | Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster) 79 | 80 | ## Citation, installation, and license 81 | 82 | For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md). 83 | 84 | ## Changelog 85 | 86 | * v0.2 (2015-12-10) 87 | * included a POD instead of a simple usage text 88 | * included `pod2usage` with Pod::Usage 89 | * included 'use autodie' pragma 90 | * options with Getopt::Long 91 | * output format now specified with option **-o** 92 | * included version switch, **-v** 93 | * allowed file and *STDIN* input, instead of only file; thus new option **-i** for input format 94 | * output printed to *STDOUT* now, instead of output file 95 | * fixed bug, that only first sequence in multi-sequence file is reverse complemented. Now all sequences in a multi-seq file are reverse complemented. 96 | * v0.1 (2013-02-08) 97 | -------------------------------------------------------------------------------- /revcom_seq/revcom_seq.pl: -------------------------------------------------------------------------------- 1 | #!/usr/bin/perl 2 | 3 | ####### 4 | # POD # 5 | ####### 6 | 7 | =pod 8 | 9 | =head1 NAME 10 | 11 | C - reverse complement (multi-)sequence files 12 | 13 | =head1 SYNOPSIS 14 | 15 | C seq-file_revcom.embl> 16 | 17 | B 18 | 19 | C seq_file_cat_revcom.embl> 21 | 22 | =head1 DESCRIPTION 23 | 24 | This script reverse complements (multi-)sequence files. The 25 | features/annotations in RichSeq files (e.g. EMBL or GENBANK format) 26 | will also be adapted accordingly. Use option B<-o> to specify a 27 | different output sequence format. Input files can be given directly via 28 | C or as a file. If C is used, the input sequence file 29 | format has to be given with option B<-i>. Be careful to set the correct 30 | input format. 31 | 32 | =head1 OPTIONS 33 | 34 | =over 20 35 | 36 | =item B<-h>, B<-help> 37 | 38 | Help (perldoc POD) 39 | 40 | =item B<-o>=I, B<-outformat>=I 41 | 42 | Specify different sequence format for the output [fasta, embl, or gbk] 43 | 44 | =item B<-i>=I, B<-informat>=I 45 | 46 | Specify the input sequence file format, only needed for C input 47 | 48 | =item B<-v>, B<-version> 49 | 50 | Print version number to C 51 | 52 | =back 53 | 54 | =head1 OUTPUT 55 | 56 | =over 20 57 | 58 | =item C 59 | 60 | The reverse complemented sequence file is printed to C. 61 | Redirect or pipe into another tool as needed. 62 | 63 | =back 64 | 65 | =head1 EXAMPLES 66 | 67 | =over 68 | 69 | =item C 70 | seq-file_revcom.gbk> 71 | 72 | =back 73 | 74 | B 75 | 76 | =over 77 | 78 | =item C "${file%.embl}"_revcom.fasta; done> 80 | 81 | =back 82 | 83 | =head1 DEPENDENCIES 84 | 85 | =over 86 | 87 | =item B> 88 | 89 | Tested with BioPerl version 1.007001 90 | 91 | =back 92 | 93 | =head1 VERSION 94 | 95 | 0.2 update: 2015-12-10 96 | 0.1 2013-08-02 97 | 98 | =head1 AUTHOR 99 | 100 | Andreas Leimbach aleimba[at]gmx[dot]de 101 | 102 | =head1 LICENSE 103 | 104 | This program is free software: you can redistribute it and/or modify 105 | it under the terms of the GNU General Public License as published by 106 | the Free Software Foundation; either version 3 (GPLv3) of the 107 | License, or (at your option) any later version. 108 | 109 | This program is distributed in the hope that it will be useful, but 110 | WITHOUT ANY WARRANTY; without even the implied warranty of 111 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU 112 | General Public License for more details. 113 | 114 | You should have received a copy of the GNU General Public License 115 | along with this program. If not, see L. 116 | 117 | =cut 118 | 119 | 120 | ######## 121 | # MAIN # 122 | ######## 123 | 124 | use strict; 125 | use warnings; 126 | use autodie; 127 | use Getopt::Long; 128 | use Pod::Usage; 129 | use Bio::SeqIO; # bioperl module to handle sequence input/output 130 | #use Bio::Seq; # bioperl module to handle sequences with features ### apparently not needed, methods inherited 131 | #use Bio::SeqUtils; # bioperl module with additional methods (including features) for Bio::Seq objects ### apparently not needed, methods inherited 132 | 133 | ### Get options with Getopt::Long 134 | my $In_Format; # input seq file format needed for STDIN 135 | my $Out_Format; # optional different output seq file format 136 | my $VERSION = 0.2; 137 | my ($Opt_Version, $Opt_Help); 138 | GetOptions ('informat=s' => \$In_Format, 139 | 'outformat=s' => \$Out_Format, 140 | 'version' => \$Opt_Version, 141 | 'help|?' => \$Opt_Help) 142 | or pod2usage(-verbose => 1, -exitval => 2); 143 | 144 | 145 | 146 | ### Run perldoc on POD 147 | pod2usage(-verbose => 2) if ($Opt_Help); 148 | if ($Opt_Version) { 149 | print "$0 $VERSION\n"; 150 | exit; 151 | } 152 | 153 | 154 | 155 | ### Check input (@ARGV and STDIN) 156 | if (-t STDIN && ! @ARGV) { 157 | my $warning = "\n### Fatal error: No STDIN and no input file given as argument, please supply one of them and/or see help with '-h'!\n"; 158 | pod2usage(-verbose => 0, -message => $warning, -exitval => 2); 159 | } elsif (!-t STDIN && @ARGV) { 160 | my $warning = "\n### Fatal error: Both STDIN and an input file given as argument, please supply only either one and/or see help with '-h'!\n"; 161 | pod2usage(-verbose => 0, -message => $warning, -exitval => 2); 162 | } 163 | die "\n### Fatal error: Too many arguments given, only STDIN or one input file allowed as argument! Please see the usage with option '-h' if unclear!\n" if (@ARGV > 1); 164 | die "\n### Fatal error: File '$ARGV[0]' does not exist!\n" if (@ARGV && $ARGV[0] ne '-' && !-e $ARGV[0]); 165 | 166 | 167 | 168 | ### Bio::SeqIO objects for input and output 169 | print STDERR "\nReverse complementing"; 170 | my $Seqin; # Bio::SeqIO object 171 | if (-t STDIN) { # input from file 172 | warn "\n### Warning: Ignoring input file format ('-i $In_Format'), because input file given and not STDIN!\n\n" if ($In_Format); 173 | my $seq_file = shift; 174 | $Seqin = Bio::SeqIO->new(-file => "<$seq_file"); # Bio::SeqIO object; no '-format' given, leave it to bioperl guessing 175 | print STDERR " '$seq_file' "; 176 | } elsif (!-t STDIN) { # input from STDIN 177 | die "\n### Fatal error: Sequence file given as STDIN requires an input file format, please set one with option '-i' and/or see help with '-h'!\n" if (!$In_Format); 178 | $In_Format = 'genbank' if ($In_Format =~ /(gbk|gb)/i); # allow shorter format string for 'genbank' 179 | $Seqin = Bio::SeqIO->new(-fh => \*STDIN, -format => $In_Format); # capture typeglob of STDIN, requires '-format' 180 | print STDERR " input file "; 181 | } 182 | print STDERR "...\n"; 183 | 184 | my $Seqout; # Bio::SeqIO object 185 | if ($Out_Format) { 186 | $Out_Format = 'genbank' if ($Out_Format =~ /(gbk|gb)/i); 187 | } else { # same format as input file 188 | if (!-t STDIN) { 189 | $Out_Format = $In_Format; 190 | } else { 191 | if (ref($Seqin) =~ /Bio::SeqIO::(genbank|embl|fasta)/) { # from bioperl guessing 192 | $Out_Format = $1; 193 | } else { 194 | die "\n### Fatal error: Could not determine input file format, please set an output file format with option '-o'!\n"; 195 | } 196 | } 197 | } 198 | $Seqout = Bio::SeqIO->new(-fh => \*STDOUT, -format => $Out_Format); # printing to STDOUT requires '-format' 199 | 200 | 201 | ### Write reverse complemented sequence (and its features) to STDOUT 202 | while (my $seq_obj = $Seqin->next_seq) { # Bio::Seq object; for multi-seq files 203 | my $revcom = Bio::SeqUtils->revcom_with_features($seq_obj); 204 | $Seqout->write_seq($revcom); 205 | } 206 | 207 | exit; 208 | -------------------------------------------------------------------------------- /rod_finder/README.md: -------------------------------------------------------------------------------- 1 | rod_finder 2 | ========== 3 | 4 | A script to find regions of difference (RODs) between a query genome and reference genome(s). 5 | 6 | ## Synopsis 7 | 8 | blast_rod_finder_legacy.sh subject.fasta query.fasta query.[embl|gbk|fasta] 5000 9 | 10 | or 11 | 12 | perl blast_rod_finder.pl -q query.embl -r blastn.out -m 2000 13 | 14 | ## Description 15 | 16 | This script is intended to identify RODs between a nucleotide query and a nucleotide subject/reference sequence. In order to do so, a *blastn* (http://blast.ncbi.nlm.nih.gov/Blast.cgi) needs to be performed beforehand with the query and the subject sequences (see also *blast_rod_finder_legacy.sh* below). *blast_rod_finder.pl* is mainly designed to work with bacterial genomes, while a query genome can be blasted against several subject sequences to detect RODs over a number of references. Although the results are optimized towards a complete query genome, both the reference(s) as well as the query can be used in draft form. To create artificial genomes via concatenation use *cat_seq.pl* or the EMBOSS application union (http://emboss.sourceforge.net/). 17 | 18 | The *blastn* report file, the query sequence file (preferably in RichSeq format, see below) and a minimum size for ROD detection have to be provided. Subsequently, RODs are summarized in a tab-separated summary file, a gff3 (usable e.g. in Artemis/DNAPlotter, http://www.sanger.ac.uk/resources/software/artemis/) and a BRIG (BLAST Ring Image Generator, http://brig.sourceforge.net/) output file. Nucleotide sequences of each ROD are written to a multi-fasta file. 19 | 20 | The query sequence can be provided in RichSeq format (embl or genbank), but has to correspond to the fasta file used in querying the BLAST database (the accession numbers have to correspond to the fasta headers). Use *seq_format-converter.pl* to create a corresponding fasta file from embl|genbank files for *blastn* if needed. With RichSeq query files additional info is given in the result summary and the amino acid sequences of all non-pseudo CDSs, which are contained or overlap a ROD, are written to a result file. Furthermore, all detected RODs are saved in individual sequence files in the corresponding query sequence format. 21 | 22 | Run *blastn* and the script *blast_rod_finder.pl* consecutively manually or use the bash shell wrapper script *blast_rod_finder_legacy.sh* (see usage below) to perform the pipeline with one command. The same folder has to contain the subject fasta file(s), the query fasta file, optionally the query RichSeq file and the script *blast_rod_finder.pl*! *blastn* is run **without** filtering of query sequences ('-F F') and an evalue cutoff of '2e-11' is set. 23 | 24 | ## Usage 25 | 26 | ### 1.) Manual consecutively 27 | 28 | #### 1.1.) *blastn* 29 | 30 | formatdb -p F -i subject.fasta -n blast_db 31 | blastall -p blastn -d blast_db -i query.fasta -o blastn.out -e 2e-11 -F F 32 | 33 | #### 1.2.) *blast_rod_finder.pl* 34 | 35 | perl blast_rod_finder.pl -q query.[embl|gbk|fasta] -r blastn.out -m 5000 36 | 37 | ### 2.) With one command: *blast_rod_finder_legacy.sh* pipeline 38 | 39 | blast_rod_finder_legacy.sh subject.fasta query.fasta query.[embl|gbk|fasta] 5000 40 | 41 | ## Options for *blast_rod_finder.pl* 42 | 43 | ### Mandatory options 44 | 45 | * -m, -min 46 | 47 | Minimum size of RODs that are reported 48 | 49 | * -q, -query 50 | 51 | Query sequence file [fasta, embl, or genbank format] 52 | 53 | * -r, -report 54 | 55 | *blastn* report/output file 56 | 57 | ### Optional options 58 | 59 | * -h, -help: Help (perldoc POD) 60 | 61 | ## Output 62 | 63 | ### a.) *blast_rod_finder_legacy.sh* or *blastn* 64 | 65 | * *blastn* database files for subject sequence(s) 66 | 67 | \*.nhr, \*.nin, \*.nsq 68 | 69 | * *blastn* report 70 | 71 | Text file named 'blastn.out' 72 | 73 | ### b.) *blast_rod_finder.pl* 74 | 75 | * ./results 76 | 77 | All output files are stored in this result folder 78 | 79 | * rod_summary.txt 80 | 81 | Summary of detected ROD regions (for embl/genbank queries includes annotation), tab-separated 82 | 83 | * rod.gff 84 | 85 | GFF3 file with ROD coordinates to use in Artemis/DNAPlotter etc. 86 | 87 | * rod_BRIG.txt 88 | 89 | ROD coordinates to use in BRIG (BLAST Ring Image Generator), tab-separated 90 | 91 | * rod_seq.fasta 92 | 93 | Nucleotide sequences of ROD regions (>ROD# size start..stop), multi-fasta 94 | 95 | * rod_aa_fasta.txt 96 | 97 | Only present if query is in RichSeq format. Amino acid sequences of all CDSs that are contained in or overlap a ROD region in multi-fasta format (>locus_tag gene product). RODs are seperated in the file via '\~\~' (\~\~ROD# size start..stop). 98 | 99 | * ROD#.[embl|gbk] 100 | 101 | Only present if query is in RichSeq format. Each identified ROD is written to an individual sequence file (in the same format as the query). 102 | 103 | ## Run environment 104 | 105 | The Perl script runs under Windows and UNIX flavors, the bash-shell script of course only under UNIX. 106 | 107 | ## Dependencies (not in the core Perl modules) 108 | 109 | * Legacy blast (tested version blastall 2.2.18) 110 | * BioPerl (tested with version 1.006901) 111 | 112 | ## Authors/contact 113 | 114 | Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster) 115 | 116 | David Studholme (original code; D[dot]J[dot]Studholme[at]exeter[dot]ac[dot]uk; University of Exeter) 117 | 118 | ## Citation, installation, and license 119 | 120 | For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md). 121 | 122 | ## Changelog 123 | 124 | * v0.4 (13.02.2013) 125 | - included a POD 126 | - options with Getopt::Long 127 | - results directory for output files 128 | - include accession number column for multi-sequence files in 'rod_summary.txt' 129 | - include locus_tags (or alternatively gene, product, note ...) in 'rod_summary.txt' 130 | - feature positions according to leading or lagging strand 131 | - indicate if a primary feature overlaps ROD boundaries 132 | - output each ROD in the query RichSeq format with BioPerl's Bio::SeqUtils 133 | * v0.3 (23.11.2011) 134 | - status messages with autoflush 135 | - BRIG output file 136 | - extended primary tag output for RODs (in addition to CDS): tRNA, rRNA, tmRNA, ncRNA, misc_RNA, repeat_region, misc_binding, and mobile_element 137 | * v0.1 (07.11.2011) 138 | -------------------------------------------------------------------------------- /rod_finder/blast_rod_finder_legacy.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | echo "###Running legacy BLASTN with subject '$1' and query '$2'" 3 | formatdb -p F -i $1 -n ROD 4 | blastall -p blastn -d ROD -i $2 -o blastn.out -e 2e-11 -F F 5 | echo "###Running blast_rod_finder.pl with query '$3' and minimum ROD size '$4'" 6 | perl blast_rod_finder.pl -q $3 -r blastn.out -m $4 -------------------------------------------------------------------------------- /sam_insert-size/README.md: -------------------------------------------------------------------------------- 1 | sam_insert-size 2 | =============== 3 | 4 | `sam_insert-size.pl` is a script to calculate insert size and read length statistics for paired-end reads in SAM/BAM format. 5 | 6 | * [Synopsis](#synopsis) 7 | * [Description](#description) 8 | * [Usage](#usage) 9 | * [Options](#options) 10 | * [Mandatory options](#mandatory-options) 11 | * [Optional options](#optional-options) 12 | * [Output](#output) 13 | * [Run environment](#run-environment) 14 | * [Dependencies](#dependencies) 15 | * [Author - contact](#author---contact) 16 | * [Acknowledgements](#acknowledgements) 17 | * [Citation, installation, and license](#citation-installation-and-license) 18 | * [Changelog](#changelog) 19 | 20 | ## Synopsis 21 | 22 | perl sam_insert-size.pl -i file.sam 23 | 24 | **or** 25 | 26 | samtools view -h file.bam | perl sam_insert-size.pl -i - 27 | 28 | ## Description 29 | 30 | Calculate insert size and read length statistics for paired-end reads 31 | in SAM/BAM alignment format. The program gives the arithmetic mean, 32 | median, and standard deviation (stdev) among other statistical values. 33 | 34 | Insert size is defined as the total length of the original fragment 35 | put into sequencing, i.e. the sequenced DNA fragment between the 36 | adaptors. The 16-bit FLAG of the SAM/BAM file is used to filter reads 37 | (see the [SAM specifications](http://samtools.sourceforge.net/SAM1.pdf)). 38 | 39 | **Read length** statistics are calculated for all mapped reads 40 | (irrespective of their pairing). 41 | 42 | **Insert size** statistics are calculated only for **paired reads**. 43 | Typically, the insert size is perturbed by artifacts, like chimeras, 44 | structural re-arrangements or alignment errors, which result in a 45 | very high maximum insert size measure. As a consequence the mean and 46 | stdev can be strongly misleading regarding the real distribution. To 47 | avoid this, two methods are implemented that first trim the insert 48 | size distribution to a 'core' to calculate the respective statistics. 49 | Additionally, secondary alignments for multiple mapping reads and 50 | supplementary alignments for chimeric reads, as well as insert sizes 51 | of zero are not considered (option **-min_ins_cutoff** is set to 52 | **one** by default). 53 | 54 | The **-a|-align** method includes only proper/concordant paired reads 55 | in the statistical calculations (as determined by the mapper and the 56 | options for insert size minimum and maximum used for mapping). This 57 | is the **default** method. 58 | 59 | The **-p|-percentile** method first calculates insert size statistics 60 | for all read pairs, where the read and the mate are mapped ('raw 61 | data'). Subsequently, the 10th and the 90th percentile are discarded 62 | to calculate the 10% truncated mean and stdev. Discarding the lowest 63 | and highest 10% of insert sizes gives the advantage of robustness 64 | (insensitivity to outliers) and higher efficiency in heavy-tailed 65 | distributions. 66 | 67 | Alternative tools, which are a lot faster, are [`CollectInsertSizeMetrics`](https://broadinstitute.github.io/picard/command-line-overview.html#CollectInsertSizeMetrics) 68 | from [Picard Tools](https://broadinstitute.github.io/picard/) and 69 | [`sam-stats`](https://code.google.com/p/ea-utils/wiki/SamStats) from 70 | [ea-utils](https://code.google.com/p/ea-utils/). 71 | 72 | ## Usage 73 | 74 | samtools view -h file.bam | perl sam_insert-size.pl -i - -p -d -f -min 50 -max 500 -n 2000000 -xlim_i 350 -xlim_r 200 75 | 76 | ## Options 77 | 78 | ### Mandatory options 79 | 80 | - -i, -input 81 | 82 | Input SAM file or piped *STDIN* (-) from a BAM file e.g. with [`samtools view`](http://www.htslib.org/doc/samtools-1.1.html) from [Samtools](http://www.htslib.org/) 83 | 84 | - -a, -align 85 | 86 | **Default method:** Align method to calculate insert size statistics, includes only reads which are mapped in a proper/concordant pair (as determined by the mapper). Excludes option **-p**. 87 | 88 | **or** 89 | 90 | - -p, -percentile 91 | 92 | Percentile method to calculate insert size statistics, includes only read pairs with an insert size within the 10th and the 90th percentile range of all mapped read pairs. However, the frequency distribution as well as the histogram will be plotted with the 'raw' insert size data before percentile filtering. Excludes option **-a**. 93 | 94 | ### Optional options 95 | 96 | - -h, -help 97 | 98 | Help (perldoc POD) 99 | 100 | - -d, -distro 101 | 102 | Create distribution histograms for the insert sizes and read lengths with [R](http://www.r-project.org/). The calculated median and mean (that are printed to *STDOUT*) are plotted as vertical lines into the histograms. Use it to control the correctness of the statistical calculations. 103 | 104 | - -f, -frequencies 105 | 106 | Print the frequencies of the insert sizes and read lengths to tab-delimited files 'ins_frequency.txt' and 'read_frequency.txt', respectively. 107 | 108 | - -max, -max_ins_cutoff 109 | 110 | Set a maximal insert size cutoff, all insert sizes above this cutoff will be discarded (doesn't affect read length). With **-min** and **-max** you can basically run both methods, by first running the script with **-p** and then using the 10th and 90th percentile of the 'raw data' as **-min** and **-max** for option **-a**. 111 | 112 | - -min, -min_ins_cutoff 113 | 114 | Set a minimal insert size cutoff [default = 1] 115 | 116 | - -n, -num_read 117 | 118 | Number of reads to sample for the calculations from the start of the SAM/BAM file. Significant statistics can usually be calculated from a fraction of the total SAM/BAM alignment file. 119 | 120 | - -xlim_i, -xlim_ins 121 | 122 | Set an upper limit for the x-axis of the **'R' insert size** histogram, overriding automatic truncation of the histogram tail. The default cutoff is one and a half times the third quartile Q3 (75th percentile) value. The minimal cutoff is set to the lowest insert size automatically. Forces option **-d**. 123 | 124 | - -xlim_r, -xlim_read 125 | 126 | Set an upper limit for the x-axis of the optional **'R' read length** histogram. Default value is as in **-xlim_i**. Forces option **-d**. 127 | 128 | - -v, -version 129 | 130 | Print version number to *STDERR* 131 | 132 | ## Output 133 | 134 | - *STDOUT* 135 | 136 | Calculated stats are printed to *STDOUT* 137 | 138 | - ./results 139 | 140 | All **optional** output files are stored in this results folder 141 | 142 | - (./results/ins_frequency.txt) 143 | 144 | Frequencies of insert size 'raw data', tab-delimited 145 | 146 | - (./results/ins_histo.pdf) 147 | 148 | Distribution histogram for the insert size 'raw data' 149 | 150 | - (./results/read_frequency.txt) 151 | 152 | Frequencies of read lengths, tab-delimited 153 | 154 | - (./results/read_histo.pdf) 155 | 156 | Distribution histogram for the read lengths. Not informative if there's no variation in the read lengths. 157 | 158 | ## Run environment 159 | 160 | The Perl script runs under Windows and UNIX flavors. 161 | 162 | ## Dependencies 163 | 164 | - `Statistics::Descriptive` 165 | 166 | Perl module to calculate descriptive statistics, if not installed already get it from [CPAN](http://www.cpan.org/) 167 | 168 | - Statistical computing language [R](http://www.r-project.org/) 169 | 170 | `Rscript` is needed to plot the histograms with option **-d** 171 | 172 | ## Author - contact 173 | 174 | Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster) 175 | 176 | ## Acknowledgements 177 | 178 | References/thanks go to: 179 | 180 | - Tobias Rausch's online courses/workshops (EMBL Heidelberg) on the introduction to SAM files and flags (http://www.embl.de/~rausch/) 181 | 182 | - The CBS NGS Analysis course for the percentile filtering idea (http://www.cbs.dtu.dk/courses/27626/programme.php) 183 | 184 | ## Citation, installation, and license 185 | 186 | For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md). 187 | 188 | ## Changelog 189 | 190 | - v0.2 (29.10.2014) 191 | - Fixed bug for options '-min_ins_size' and '-max_ins_size' 192 | - warn if result files already exist 193 | - simplify prints to R script with Perl function 'select' 194 | - minor Perl syntax changes so all Perl scripts conform to the same syntax 195 | - minor changes to POD 196 | - finally included README.md 197 | - v0.1 (27.11.2013) 198 | -------------------------------------------------------------------------------- /sample_fastx-txt/README.md: -------------------------------------------------------------------------------- 1 | sample_fastx-txt 2 | ================ 3 | 4 | `sample_fastx-txt.pl` is a script to randomly subsample FASTA, FASTQ, or TEXT files. 5 | 6 | * [Synopsis](#synopsis) 7 | * [Description](#description) 8 | * [Usage](#usage) 9 | * [Subsample paired-end read data and retain pairing](#subsample-paired-end-read-data-and-retain-pairing) 10 | * [Subsample TEXT file and skip three header lines during subsampling](#subsample-text-file-and-skip-three-header-lines-during-subsampling) 11 | * [Subsample TEXT file and remove two header lines for final output](#subsample-text-file-and-remove-two-header-lines-for-final-output) 12 | * [Options](#options) 13 | * [Mandatory options](#mandatory-options) 14 | * [Optional options](#optional-options) 15 | * [Output](#output) 16 | * [Run environment](#run-environment) 17 | * [Author - contact](#author---contact) 18 | * [Acknowledgements](#acknowledgements) 19 | * [Citation, installation, and license](#citation-installation-and-license) 20 | * [Changelog](#changelog) 21 | 22 | ## Synopsis 23 | 24 | perl sample_fastx-txt.pl -i infile.fasta -n 100 > subsample.fasta 25 | 26 | **or** 27 | 28 | zcat reads.fastq.gz | perl sample_fastx-txt.pl -i - -n 100000 > subsample.fastq 29 | 30 | ## Description 31 | 32 | Randomly subsample FASTA, FASTQ, and TEXT files. 33 | 34 | Empty lines in the input files will be skipped and not included in 35 | sampling. Format TEXT assumes one entry per single line. FASTQ 36 | format assumes **four** lines per read, if this is not the case run 37 | the FASTQ file through [`fastx_fix.pl`](/fastx_fix) or use Heng 38 | Li's [`seqtk seq`](https://github.com/lh3/seqtk): 39 | 40 | seqtk seq -l 0 infile.fq > outfile.fq 41 | 42 | The file type is detected automatically. However, if automatic 43 | detection fails, TEXT format is assumed. As a last resort, you can 44 | set the file type manually with option **-f**. 45 | 46 | This script is an implementation of the *reservoir sampling* 47 | algorithm (or *Algorithm R (3.4.2)*) described in Donald Knuth's 48 | [*The Art of Computer Programming*](https://en.wikipedia.org/wiki/The_Art_of_Computer_Programming). 49 | It is designed to randomly pull a small sample size from a 50 | (potential) huge input file of indeterminate size, which 51 | (potentially) doesn't fit into main memory. The beauty of reservoir 52 | sampling is that it requires only one pass through the input file. 53 | The memory consumption of the algorithm is proportional to the 54 | sample size, thus large sample sizes will consume lots of memory as 55 | the whole sample will be held in memory. On the other hand, the size 56 | of the initial file is irrelevant. 57 | 58 | An alternative tool, which is a lot faster, is `seqtk sample` from 59 | the [*seqtk toolkit*](https://github.com/lh3/seqtk>). 60 | 61 | ## Usage 62 | 63 | ### Subsample paired-end read data and retain pairing 64 | 65 | perl sample_fastx-txt.pl -i read-pair_1.fq -n 1000000 -s 123 > sub-pair_1.fq 66 | 67 | perl sample_fastx-txt.pl -i read-pair_2.fq -n 1000000 -s 123 > sub-pair_2.fq 68 | 69 | ### Subsample TEXT file and skip three header lines during subsampling 70 | 71 | perl sample_fastx-txt.pl -i infile.txt -n 100 -f text -t 3 > subsample.txt 72 | 73 | ### Subsample TEXT file and remove two header lines for final output 74 | 75 | perl sample_fastx-txt.pl -i infile.txt -n 350 -t 2 | sed '1,2d' > sub_no-header.txt 76 | 77 | ## Options 78 | 79 | ### Mandatory options 80 | 81 | - -i, -input 82 | 83 | Input FASTA/Q or TEXT file, or piped *STDIN* (-) 84 | 85 | - -n, -num 86 | 87 | Number of entries/reads to subsample 88 | 89 | ### Optional options 90 | 91 | - -h, -help 92 | 93 | Help (perldoc POD) 94 | 95 | - -f, -file_type 96 | 97 | Set the file type manually [fasta|fastq|text] 98 | 99 | - -s, -seed 100 | 101 | Set starting random seed. For **paired-end** read data use the **same random seed** for both FASTQ files with option **-s** to retain pairing (see [Subsample paired-end read data and retain pairing](#subsample-paired-end-read-data-and-retain-pairing) above). 102 | 103 | - -t, -title_skip 104 | 105 | Skip the specified number of header lines in TEXT files before subsampling and append them again afterwards. If you want to get rid of the header as well, pipe the subsample output to [`sed`](https://www.gnu.org/software/sed/manual/sed.html) (see `man sed` and [Subsample TEXT file and remove two header lines for final output](#subsample-text-file-and-remove-two-header-lines-for-final-output) above). 106 | 107 | - -v, -version 108 | 109 | Print version number to *STDERR* 110 | 111 | ## Output 112 | 113 | - *STDOUT* 114 | 115 | The subsample of the input file is printed to *STDOUT*. Redirect or pipe into another tool as needed. 116 | 117 | ## Run environment 118 | 119 | The Perl script runs under Windows and UNIX flavors. 120 | 121 | ## Author - contact 122 | 123 | Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster) 124 | 125 | ## Acknowledgements 126 | 127 | I got the idea for reservoir sampling from Sean Eddy's keynote at 128 | the Janelia meeting on [*High Throughput Sequencing for Neuroscience*](http://cryptogenomicon.wordpress.com/2014/11/01/high-throughput-sequencing-for-neuroscience/) 129 | which he posted in his blog 130 | [*Cryptogenomicon*](http://cryptogenomicon.wordpress.com/). The [*Wikipedia article*](https://en.wikipedia.org/wiki/Reservoir_sampling) and the 131 | [*PerlMonks*](http://www.perlmonks.org/index.pl?node_id=177092) implementation helped a lot, as well. 132 | 133 | ## Citation, installation, and license 134 | 135 | For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md). 136 | 137 | ## Changelog 138 | 139 | - v0.1 (18.11.2014) 140 | -------------------------------------------------------------------------------- /seq_format-converter/README.md: -------------------------------------------------------------------------------- 1 | seq_format-converter 2 | ==================== 3 | 4 | A script to convert a sequence file to another format. 5 | 6 | ## Synopsis 7 | 8 | perl seq_format-converter.pl -i seq_file.gbk -f gbk -o embl 9 | 10 | ## Description 11 | 12 | This script converts a (multi-)sequence file of a specific format to a differently formatted output file. The most common sequence formats are: **embl**, **fasta**, and **gbk** (genbank). 13 | 14 | Since sequence formats change from time to time, BioPerl is not always up to date. For all available BioPerl sequence formats see: http://www.bioperl.org/wiki/HOWTO:SeqIO#Formats. **Warning**: The *bioperl-ext* package and the *io_lib* library from the **Staden** package (http://staden.sourceforge.net/) need to be installed in order to read the scf, abi, alf, pln, exp, ctf, ztr formats. 15 | 16 | ## Usage 17 | 18 | perl seq_format-converter.pl -i seq_file -f in_format -o out_format 19 | 20 | ### UNIX loop to reformat all sequence files in the current working directory 21 | 22 | for i in *.[embl|gbk]; do perl seq_format-converter.pl -i $i -f [embl|gbk] -o [embl|fasta|gbk]; done 23 | 24 | ## Options for *seq_format-converter.pl* 25 | 26 | ### Mandatory options 27 | 28 | * -i, -input 29 | 30 | Input sequence file 31 | 32 | * -f, -format 33 | 34 | Input sequence format (e.g. 'embl' or 'gbk) 35 | 36 | * -o, -out_format 37 | 38 | Output sequence format (e.g. 'embl', 'fasta' or 'gbk) 39 | 40 | ### Optional options 41 | 42 | * -h, -help 43 | 44 | Print usage 45 | 46 | * -v, -version 47 | 48 | Print version number 49 | 50 | ## Output 51 | 52 | * seq_file.[embl|fasta|gbk] 53 | 54 | Output sequence file in the specified format 55 | 56 | ## Run environment 57 | 58 | The Perl script runs under Windows and UNIX flavors. 59 | 60 | ## Dependencies (not in the core Perl modules) 61 | 62 | * BioPerl (tested with version 1.006901) 63 | 64 | ## Author/contact 65 | 66 | Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster) 67 | 68 | ## Citation, installation, and license 69 | 70 | For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md). 71 | 72 | ## Changelog 73 | 74 | * v0.2 (03.02.2014) 75 | - allow short 'gbk' format instead of 'genbank' 76 | - also short 'gbk' file-extension for output file 77 | - included 'use autodie' 78 | - usage as HERE document 79 | - options with Getopt::Long 80 | - version switch 81 | * v0.1 (10.11.2011) 82 | -------------------------------------------------------------------------------- /seq_format-converter/seq_format-converter.pl: -------------------------------------------------------------------------------- 1 | #!/usr/bin/perl 2 | 3 | use warnings; 4 | use strict; 5 | use autodie; 6 | use Getopt::Long; 7 | use Bio::SeqIO; # bioperl module to handle sequence input/output 8 | 9 | my $usage = << "USAGE"; 10 | 11 | ################################################################## 12 | # $0 -i seq_file -f in_format -o out_format # 13 | # # 14 | # Converts a (multi-)sequence file of a specific format to a # 15 | # differently formatted output file, with the help of BioPerl # 16 | # (www.bioperl.org). # 17 | # Formats are e.g. embl, fasta, gbk. # 18 | # # 19 | # Mandatory options: # 20 | # -i, -input input sequence file # 21 | # -f, -format input format # 22 | # -o, -out_format output format # 23 | # Optional options: # 24 | # -h, -help print usage # 25 | # -v, -version print version number # 26 | # # 27 | # Adjust unix loop to run the script with all files in the # 28 | # current working directory, e.g.: # 29 | # for i in *.gbk; do perl seq_format_converter.pl -i \$i -f gbk \\ # 30 | # -o embl; done # 31 | # # 32 | # version 0.2, update: 03-02-2014 A Leimbach # 33 | # 10-11-2011 aleimba[at]gmx[dot]de # 34 | ################################################################## 35 | 36 | USAGE 37 | ; 38 | 39 | ### Get options with Getopt::Long 40 | my $infile; # input sequence file 41 | my $in_format; # input sequence file format 42 | my $out_format; # desired output file format 43 | my $version = 0.2; 44 | my ($opt_version, $opt_help); 45 | GetOptions ('input=s' => \$infile, 46 | 'format=s' => \$in_format, 47 | 'out_format=s' => \$out_format, 48 | 'version' => \$opt_version, 49 | 'help|?' => \$opt_help); 50 | 51 | 52 | ### Print usage 53 | if ($opt_help) { 54 | die $usage; 55 | } elsif ($opt_version) { 56 | die "$0 $version\n"; 57 | } elsif (!$infile || !$in_format || !$out_format) { 58 | die $usage, "### Fatal error: Option(s) or argument(s) for \'-i\', \'-f\', \'-o\' are missing!\n\n"; 59 | } 60 | 61 | 62 | ### Allow shorter format string for 'genbank' 63 | $in_format = 'genbank' if ($in_format =~ /gbk/i); 64 | my $outfile = $infile; 65 | $outfile =~ s/\.\w+$/\.$out_format/; # remove file extension from infile and append out_format 66 | $out_format = 'genbank' if ($out_format =~ /gbk/i); 67 | 68 | 69 | ### SeqIO objects for input and output 70 | my $seq_in = Bio::SeqIO->new(-file => "<$infile", -format => $in_format); # a Bio::SeqIO object 71 | my $seq_out = Bio::SeqIO->new(-file => ">$outfile", -format => $out_format); # a Bio::SeqIO object 72 | 73 | 74 | ### Write sequence to different format 75 | while (my $seqobj = $seq_in->next_seq) { # a Bio::Seq object 76 | $seq_out->write_seq($seqobj); 77 | } 78 | print "\n\tCreated new file $outfile!\n\n"; 79 | 80 | exit; 81 | -------------------------------------------------------------------------------- /tbl2tab/README.md: -------------------------------------------------------------------------------- 1 | tbl2tab 2 | ======= 3 | 4 | `tbl2tab.pl` is a script to convert tbl to tab-separated format and back. 5 | 6 | * [Synopsis](#synopsis) 7 | * [Description](#description) 8 | * [Usage](#usage) 9 | * [Options](#options) 10 | * [Mandatory options](#mandatory-options) 11 | * [Optional options](#optional-options) 12 | * [Output](#output) 13 | * [Run environment](#run-environment) 14 | * [Author - contact](#author---contact) 15 | * [Citation, installation, and license](#citation-installation-and-license) 16 | * [Changelog](#changelog) 17 | 18 | ## Synopsis 19 | 20 | perl tbl2tab.pl -m tbl2tab -i feature_table.tbl -s -l locus_prefix 21 | 22 | **or** 23 | 24 | perl tbl2tab.pl -m tab2tbl -i feature_table.tab -g -l locus_prefix -p "gnl|dbname|" 25 | 26 | ## Description 27 | 28 | NCBI's feature table (**tbl**) format is needed for the submission of genomic data to GenBank with the NCBI tools [Sequin](http://www.ncbi.nlm.nih.gov/Sequin/) or [tbl2asn](http://www.ncbi.nlm.nih.gov/genbank/tbl2asn2). tbl files can be created with automatic annotation systems like [Prokka](http://www.vicbioinformatics.com/software.prokka.shtml). `tbl2tab.pl` can convert a tbl file to a tab-separated format (tab) and back to the tbl format. The tab-delimited format is useful to manipulate the data more comfortably in a spreadsheet software (e.g. LibreOffice or MS Excel). For a conversion back to tbl format save the file in the spreadsheet software as a tab-delimited text file. The script is intended for microbial genomes, but might also be useful for eukaryotes. 29 | 30 | Regular expressions are applied in mode '**tbl2tab**' to correct gene names and words in '/product' values to lowercase initials (with the exception of 'Rossman' and 'Willebrand'). The resulting tab file can then be used to check for possible errors. 31 | 32 | The first four header columns of the **tab** format are mandatory, 'seq_id' for the SeqID, and for each primary tag/feature (e.g. CDS, RNAs, repeat_region etc.), 'start', 'stop', and 'primary_tag'. These mandatory columns have to be filled in every row in the tab file. All the following columns will be included as tags/qualifiers (e.g. '/locus_tag', '/product', '/EC_number', '/note' etc.) in the conversion to the tbl file if a value is present. 33 | 34 | There are three special cases: 35 | 36 | **First**, '/pseudo' will be included as a tag if *any* value (the script uses 'T' for true) is present in the **tab** format. If a primary tag is indicated as pseudo both the primary tag and the accessory 'gene' primary tag (for CDS/RNA features with option **-g**) will include a '/pseudo' qualifier in the resulting **tbl** file. *Pseudo-genes* are indicated by 'pseudo' in the 'primary_tag' column, thus the 'pseudo' column is ignored in these cases. 37 | 38 | **Second**, tag '/gene_desc' is reserved for the 'product' values of pseudo-genes, thus a 'gene_desc' column in a tab file will be ignored in the conversion to tbl. 39 | 40 | **Third**, column 'protein_id' in a tab file will also be ignored in the conversion. '/protein_id' values are created from option **-p** and the locus_tag for each CDS primary feature. 41 | 42 | Furthermore, with option **-s** G2L-style spreadsheet formulas ([Goettingen Genomics Laboratory](http://appmibio.uni-goettingen.de/)) can be included with additional columns, 'spreadsheet_locus_tag', 'position', 'distance', 'gene_number', and 'contig_order'. These columns will not be included in a conversion to the tbl format. Thus, if you want to include e.g. the locus_tags from the formula in column 'spreadsheet_locus_tag' in the resulting tbl file copy the *values* to the column 'locus_tag'! 43 | 44 | To illustrate the process two example files are included in the repository, 'example.tbl' and 'example2.tab', which are interconvertible (see "[USAGE](#usage)" below). 45 | 46 | **Warning**, be aware of possible errors introduced by automatic format conversions using a spreadsheet software like MS Excel, see e.g. Zeeberg *et al.* 2004 (http://www.ncbi.nlm.nih.gov/pubmed/15214961). 47 | 48 | For more information regarding the feature table and the submission process see NCBI's [prokaryotic annotation guide](http://www.ncbi.nlm.nih.gov/genbank/genomesubmit) and the [bacterial genome submission guide](http://www.ncbi.nlm.nih.gov/genbank/genomesubmit_annotation). 49 | 50 | ## Usage 51 | 52 | ### Conversion from tbl to tab format 53 | 54 | perl tbl2tab.pl -m tbl2tab -i example.tbl -s -l EPE 55 | 56 | ### Conversion from tab to tbl format 57 | 58 | perl tbl2tab.pl -m tab2tbl -i example2.tab -g -l EPE 59 | 60 | ## Options 61 | 62 | ### Mandatory options 63 | 64 | * -m, -mode 65 | 66 | Conversion mode, either 'tbl2tab' or 'tab2tbl' [default = 'tbl2tab'] 67 | 68 | * -i, -input 69 | 70 | Input tbl or tab file to be converted to the other format 71 | 72 | ### Optional options 73 | 74 | * -h, -help 75 | 76 | Help (perldoc POD) 77 | 78 | * -v, -version 79 | 80 | Print version number to *STDERR* 81 | 82 | #### Mode *tbl2tab* 83 | 84 | * -l, -locus_prefix 85 | 86 | Only in combination with option **-s** and there mandatory to include the locus_tag prefix in the formula for column 'spreadsheet_locus_tag' 87 | 88 | * -c, -concat 89 | 90 | Concatenate values of identical tags within one primary tag with '~' (e.g. several '/EC_number' or '/inference' tags) 91 | 92 | * -e, -empty 93 | 94 | String used for primary features without value for a tag [default = ''] 95 | 96 | * -s, -spreadsheet 97 | 98 | Include formulas for spreadsheet editing 99 | 100 | * -f, -formula_lang 101 | 102 | Syntax language of the spreadsheet formulas, either 'English' or 'German'. If you're still encountering problems with the formulas set the decimal and thousands separator manually in the options of the spreadsheet software (instead of using the operating system separators). [default = 'e'] 103 | 104 | #### Mode *tab2tbl* 105 | 106 | * -l, -locus_prefix 107 | 108 | Prefix to the SeqID if not present already in the SeqID 109 | 110 | * -g, -gene 111 | 112 | Include accessory 'gene' primary tags (with '/gene', '/locus_tag' and possibly '/pseudo' tags) for 'CDS/RNA' primary tags; NCBI standard 113 | 114 | * -t, -tags_full 115 | 116 | Only in combination with option **-g**, include '/gene' and '/locus_tag' tags additionally in primary tag, not only in accessory 'gene' primary tag 117 | 118 | * -p, -protein_id_prefix 119 | 120 | Prefix for '/protein_id' tags; don't forget the double quotes for the string, otherwise the shell will intepret as pipe [default = 'gnl|goetting|'] 121 | 122 | ## Output 123 | 124 | * *.tab|tbl 125 | 126 | Result file in the opposite format 127 | 128 | * (hypo_putative_genes.txt) 129 | 130 | Created in mode **tab2tbl**, indicates if CDSs are annotated as 131 | 'hypothetical/putative/predicted protein' but still have a gene name 132 | 133 | ## Run environment 134 | 135 | The Perl script runs under Windows and UNIX flavors. 136 | 137 | ## Author - contact 138 | 139 | Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster) 140 | 141 | ## Citation, installation, and license 142 | 143 | For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md). 144 | 145 | ## Changelog 146 | 147 | * v0.2 (29.10.2014) 148 | * fixed bug: message which file was created was mixed up 149 | * *hypo_putative_genes.txt* includes now also 'predicted protein' annotations 150 | * additions and syntax changes to POD and README.md 151 | * v0.1 (24.06.2014) 152 | -------------------------------------------------------------------------------- /tbl2tab/example.tbl: -------------------------------------------------------------------------------- 1 | >Feature EPE_c 2 | 191 310 gene 3 | locus_tag EPE_c00010 4 | 191 310 misc_RNA 5 | inference COORDINATES:profile:Infernal:1.1 6 | product Thr_leader 7 | 336 2798 gene 8 | locus_tag EPE_c00020 9 | gene thrA 10 | 336 2798 CDS 11 | protein_id gnl|goetting|EPE_c00020 12 | EC_number 1.1.1.3 13 | EC_number 2.7.2.4 14 | inference ab initio prediction:Prodigal:2.60 15 | inference similar to AA sequence:K-12_MG1655:NP_414543.1 16 | product bifunctional aspartokinase/homoserine dehydrogenase 1 17 | 3168 4697 gene 18 | locus_tag EPE_c00030 19 | gene rrsH 20 | 3168 4697 rRNA 21 | inference COORDINATES:profile:RNAmmer:1.2 22 | product 16S ribosomal RNA 23 | 4771 4847 gene 24 | locus_tag EPE_c00040 25 | pseudo 26 | 4771 4847 tRNA 27 | inference COORDINATES:profile:Aragorn:1.2 28 | product tRNA-Pseudo-xxx 29 | pseudo 30 | 7154 5010 gene 31 | locus_tag EPE_c00050 32 | gene fadJ 33 | 7154 5010 CDS 34 | protein_id gnl|goetting|EPE_c00050 35 | EC_number 1.1.1.35 36 | EC_number 4.2.1.17 37 | EC_number 5.1.2.3 38 | EC_number 5.3.3.8 39 | inference ab initio prediction:Prodigal:2.60 40 | inference similar to AA sequence:K-12_MG1655:NP_416843.1 41 | product fused enoyl-CoA hydratase and epimerase and isomerase/3-hydroxyacyl-CoA dehydrogenase 42 | 7068 7430 gene 43 | locus_tag EPE_c00060 44 | gene ssrA 45 | 7068 7430 tmRNA 46 | inference COORDINATES:profile:Aragorn:1.2 47 | product transfer-messenger RNA, SsrA 48 | 7513 8883 repeat_region 49 | rpt_family CRISPR 50 | score 23 51 | 8979 9275 gene 52 | locus_tag EPE_c00080 53 | 8979 9275 CDS 54 | protein_id gnl|goetting|EPE_c00080 55 | inference ab initio prediction:Prodigal:2.60 56 | inference similar to AA sequence:K-12_MG1655:NP_414546.1 57 | note DUF2502 family putative periplasmic protein 58 | product hypothetical protein 59 | >Feature EPE_89p 60 | 61 369 gene 61 | pseudo 62 | locus_tag EPE_89p00010 63 | gene ydhA 64 | gene_desc hypothetical protein fragment 65 | -------------------------------------------------------------------------------- /tbl2tab/example2.tab: -------------------------------------------------------------------------------- 1 | seq_id start stop primary_tag locus_tag EC_number EC_number EC_number EC_number gene inference inference note product protein_id pseudo rpt_family score spreadsheet_locus_tag position distance gene_number contig_order 2 | c 191 310 misc_RNA EPE_c00010 COORDINATES:profile:Infernal:1.1 Thr_leader EPE_c00010 191 26 1 1 3 | c 336 2798 CDS EPE_c00020 1.1.1.3 2.7.2.4 thrA ab initio prediction:Prodigal:2.60 similar to AA sequence:K-12_MG1655:NP_414543.1 bifunctional aspartokinase/homoserine dehydrogenase 1 gnl|SmithUCSD|EPE_c00020 EPE_c00020 336 370 2 1 4 | c 3168 4697 rRNA EPE_c00030 rrsH COORDINATES:profile:RNAmmer:1.2 16S ribosomal RNA EPE_c00030 3168 74 3 1 5 | c 4771 4847 tRNA EPE_c00040 COORDINATES:profile:Aragorn:1.2 tRNA-Pseudo-xxx T EPE_c00040 4771 163 4 1 6 | c 7154 5010 CDS EPE_c00050 1.1.1.35 4.2.1.17 5.1.2.3 5.3.3.8 fadJ ab initio prediction:Prodigal:2.60 similar to AA sequence:K-12_MG1655:NP_416843.1 fused enoyl-CoA hydratase and epimerase and isomerase/3-hydroxyacyl-CoA dehydrogenase gnl|SmithUCSD|EPE_c00050 EPE_c00050 5010 -86 5 1 7 | c 7068 7430 tmRNA EPE_c00060 ssrA COORDINATES:profile:Aragorn:1.2 "transfer-messenger RNA, SsrA" EPE_c00060 7068 83 6 1 8 | c 7513 8883 repeat_region CRISPR 23 7513 96 7 1 9 | c 8979 9275 CDS EPE_c00080 ab initio prediction:Prodigal:2.60 similar to AA sequence:K-12_MG1655:NP_414546.1 DUF2502 family putative periplasmic protein hypothetical protein gnl|SmithUCSD|EPE_00070 EPE_c00080 8979 -9214 8 1 10 | 89p 61 369 pseudo EPE_89p00010 ydhA hypothetical protein fragment EPE_89p00010 61 -369 1 2 11 | -------------------------------------------------------------------------------- /trunc_seq/README.md: -------------------------------------------------------------------------------- 1 | trunc_seq 2 | ========= 3 | 4 | `trunc_seq.pl` is a script to truncate sequence files. 5 | 6 | * [Synopsis](#synopsis) 7 | * [Description](#description) 8 | * [Usage](#usage) 9 | * [Options](#options) 10 | * [Output](#output) 11 | * [Run environment](#run-environment) 12 | * [Dependencies](#dependencies) 13 | * [Author - contact](#author---contact) 14 | * [Citation, installation, and license](#citation-installation-and-license) 15 | * [Changelog](#changelog) 16 | 17 | ## Synopsis 18 | 19 | perl trunc_seq.pl 20 3500 seq-file.embl > seq-file_trunc_20_3500.embl 20 | 21 | **or** 22 | 23 | perl trunc_seq.pl file_of_filenames_and_coords.tsv 24 | 25 | ## Description 26 | 27 | This script truncates sequence files according to the given 28 | coordinates. The features/annotations in RichSeq files (e.g. EMBL or 29 | GENBANK format) will also be adapted accordingly. Use option **-o** to 30 | specify a different output sequence format. Input can be given directly 31 | as a file and truncation coordinates to the script, with the start 32 | position as the first argument, stop as the second and (the path to) 33 | the sequence file as the third. In this case the truncated sequence 34 | entry is printed to *STDOUT*. Input sequence files should contain only 35 | one sequence entry, if a multi-sequence file is used as input only the 36 | **first** sequence entry is truncated. 37 | 38 | Alternatively, a file of filenames (fof) with respective coordinates 39 | and sequence files in the following **tab-separated** format can be 40 | given to the script (the header is optional): 41 | 42 | \#start stop seq-file
43 | 300 9000 (path/to/)seq-file
44 | 50 1300 (path/to/)seq-file2
45 | 46 | With a fof the resulting truncated sequence files are printed into a 47 | results directory. Use option **-r** to specify a different results 48 | directory than the default. 49 | 50 | It is also possible to truncate a RichSeq sequence file loaded into the 51 | [Artemis](http://www.sanger.ac.uk/science/tools/artemis) genome browser 52 | from the Sanger Institute: Select a subsequence and then go to Edit -> 53 | Subsequence (and Features) 54 | 55 | ## Usage 56 | 57 | perl trunc_seq.pl -o gbk 120 30000 seq-file.embl > seq-file_trunc_120_3000.gbk 58 | 59 | **or** 60 | 61 | perl trunc_seq.pl -o fasta 5300 18500 seq-file.gbk | perl revcom_seq.pl -i fasta > seq-file_trunc_revcom.fasta 62 | 63 | **or** 64 | 65 | perl trunc_seq.pl -r path/to/trunc_embl_dir -o embl file_of_filenames_and_coords.tsv 66 | 67 | ## Options 68 | 69 | - **-h**, **-help** 70 | 71 | Help (perldoc POD) 72 | 73 | - **-o**=*str*, **-outformat**=*str* 74 | 75 | Specify different sequence format for the output (files) [fasta, embl, or gbk] 76 | 77 | - **-r**=*str*, **-result\_dir**=*str* 78 | 79 | Path to result folder for fof input \[default = './trunc\_seq\_results'\] 80 | 81 | - **-v**, **-version** 82 | 83 | Print version number to *STDOUT* 84 | 85 | ## Output 86 | 87 | - *STDOUT* 88 | 89 | If a single sequence file is given to the script the truncated sequence 90 | file is printed to *STDOUT*. Redirect or pipe into another tool as 91 | needed. 92 | 93 | **or** 94 | 95 | - ./trunc_seq_results 96 | 97 | If a fof is given to the script, all output files are stored in a 98 | results folder 99 | 100 | - ./trunc_seq_results/seq-file_trunc_start_stop.format 101 | 102 | Truncated output sequence files are named appended with 'trunc' and the 103 | corresponding start and stop positions 104 | 105 | ## Run environment 106 | 107 | The Perl script runs under Windows and UNIX flavors. 108 | 109 | ## Dependencies 110 | 111 | - [**BioPerl**](http://www.bioperl.org) (tested version 1.007001) 112 | 113 | ## Author - contact 114 | 115 | Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster) 116 | 117 | ## Citation, installation, and license 118 | 119 | For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md). 120 | 121 | ## Changelog 122 | 123 | * v0.2 (2015-12-07) 124 | * Merged funtionality of `trunc_seq.pl` and `run_trunc_seq.pl` in one single script 125 | * Allows now single file and file of filenames (fof) with coordinates input 126 | * output for single file input printed to *STDOUT* now 127 | * output for fof input printed into files in a result directory, new option **-r** to specify result directory 128 | * included a POD instead of a simple usage text 129 | * included `pod2usage` with Pod::Usage 130 | * included 'use autodie' pragma 131 | * options with Getopt::Long 132 | * output format now specified with option **-o** 133 | * included version switch, **-v** 134 | * fixed bug to remove input filepaths from fof input for output files 135 | * skip empty or comment lines (/^#/) in fof input 136 | * check and warn if input seq file has more than one seq entries 137 | * v0.1 (2013-02-08) 138 | * In v0.1 `trunc_seq.pl` only for single sequence input, but included additional wrapper script `run_trunc_seq.pl` for a fof input 139 | -------------------------------------------------------------------------------- /trunc_seq/trunc_seq.pl: -------------------------------------------------------------------------------- 1 | #!/usr/bin/perl 2 | 3 | ####### 4 | # POD # 5 | ####### 6 | 7 | =pod 8 | 9 | =head1 NAME 10 | 11 | C - truncate sequence files 12 | 13 | =head1 SYNOPSIS 14 | 15 | C 16 | seq-file_trunc_20_3500.embl> 17 | 18 | B 19 | 20 | C 21 | 22 | =head1 DESCRIPTION 23 | 24 | This script truncates sequence files according to the given 25 | coordinates. The features/annotations in RichSeq files (e.g. EMBL or 26 | GENBANK format) will also be adapted accordingly. Use option B<-o> to 27 | specify a different output sequence format. Input can be given directly 28 | as a file and truncation coordinates to the script, with the start 29 | position as the first argument, stop as the second and (the path to) 30 | the sequence file as the third. In this case the truncated sequence 31 | entry is printed to C. Input sequence files should contain only 32 | one sequence entry, if a multi-sequence file is used as input only the 33 | B sequence entry is truncated. 34 | 35 | Alternatively, a file of filenames (fof) with respective coordinates 36 | and sequence files in the following B format can be 37 | given to the script (the header is optional): 38 | 39 | #start\tstop\tseq-file 40 | 300\t9000\t(path/to/)seq-file 41 | 50\t1300\t(path/to/)seq-file2 42 | 43 | With a fof the resulting truncated sequence files are printed into a 44 | results directory. Use option B<-r> to specify a different results 45 | directory than the default. 46 | 47 | It is also possible to truncate a RichSeq sequence file loaded into the 48 | L genome browser 49 | from the Sanger Institute: Select a subsequence and then go to Edit 50 | -E Subsequence (and Features) 51 | 52 | =head1 OPTIONS 53 | 54 | =over 20 55 | 56 | =item B<-h>, B<-help> 57 | 58 | Help (perldoc POD) 59 | 60 | =item B<-o>=I, B<-outformat>=I 61 | 62 | Specify different sequence format for the output (files) [fasta, embl, 63 | or gbk] 64 | 65 | =item B<-r>=I, B<-result_dir>=I 66 | 67 | Path to result folder for fof input [default = './trunc_seq_results'] 68 | 69 | =item B<-v>, B<-version> 70 | 71 | Print version number to C 72 | 73 | =back 74 | 75 | =head1 OUTPUT 76 | 77 | =over 20 78 | 79 | =item C 80 | 81 | If a single sequence file is given to the script the truncated sequence 82 | file is printed to C. Redirect or pipe into another tool as 83 | needed. 84 | 85 | =back 86 | 87 | B 88 | 89 | =over 20 90 | 91 | =item F<./trunc_seq_results> 92 | 93 | If a fof is given to the script, all output files are stored in a 94 | results folder 95 | 96 | =item F<./trunc_seq_results/seq-file_trunc_start_stop.format> 97 | 98 | Truncated output sequence files are named appended with 'trunc' and the 99 | corresponding start and stop positions 100 | 101 | =back 102 | 103 | =head1 EXAMPLES 104 | 105 | =over 106 | 107 | =item C 108 | seq-file_trunc_120_3000.gbk> 109 | 110 | =back 111 | 112 | B 113 | 114 | =over 115 | 116 | =item C seq-file_trunc_revcom.fasta> 118 | 119 | =back 120 | 121 | B 122 | 123 | =over 124 | 125 | =item C 127 | 128 | =back 129 | 130 | =head1 DEPENDENCIES 131 | 132 | =over 133 | 134 | =item B> 135 | 136 | Tested with BioPerl version 1.007001 137 | 138 | =back 139 | 140 | =head1 VERSION 141 | 142 | 0.2 update: 2015-12-07 143 | 0.1 2013-08-02 144 | 145 | =head1 AUTHOR 146 | 147 | Andreas Leimbach aleimba[at]gmx[dot]de 148 | 149 | =head1 LICENSE 150 | 151 | This program is free software: you can redistribute it and/or modify 152 | it under the terms of the GNU General Public License as published by 153 | the Free Software Foundation; either version 3 (GPLv3) of the 154 | License, or (at your option) any later version. 155 | 156 | This program is distributed in the hope that it will be useful, but 157 | WITHOUT ANY WARRANTY; without even the implied warranty of 158 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU 159 | General Public License for more details. 160 | 161 | You should have received a copy of the GNU General Public License 162 | along with this program. If not, see L. 163 | 164 | =cut 165 | 166 | 167 | ######## 168 | # MAIN # 169 | ######## 170 | 171 | use strict; 172 | use warnings; 173 | use autodie; 174 | use Getopt::Long; 175 | use Pod::Usage; 176 | use Bio::SeqIO; # bioperl module to handle sequence input/output 177 | #use Bio::Seq; # bioperl module to handle sequences with features ### apparently not needed, methods inherited 178 | #use Bio::SeqUtils; # bioperl module with additional methods (including features) for Bio::Seq objects ### apparently not needed, methods inherited 179 | 180 | ### Get options with Getopt::Long 181 | my $Out_Format_Opt; # optional different output seq file format 182 | my $Result_Dir = 'trunc_seq_results'; # path to result folder for fof input 183 | my $VERSION = 0.2; 184 | my ($Opt_Version, $Opt_Help); 185 | GetOptions ('outformat=s' => \$Out_Format_Opt, 186 | 'result_dir=s' => \$Result_Dir, 187 | 'version' => \$Opt_Version, 188 | 'help|?' => \$Opt_Help) 189 | or pod2usage(-verbose => 1, -exitval => 2); 190 | 191 | 192 | 193 | ### Run perldoc on POD 194 | pod2usage(-verbose => 2) if ($Opt_Help); 195 | if ($Opt_Version) { 196 | print "$0 $VERSION\n"; 197 | exit; 198 | } 199 | 200 | 201 | 202 | ### Check input (@ARGV); didn't include STDIN as input option, too complicated here with fof etc. 203 | my $Fof; # file of filenames (fof) with truncation coords 204 | my $Start; 205 | my $Stop; 206 | my $Seq_File; 207 | if (@ARGV < 1 || @ARGV == 2 || @ARGV > 3) { 208 | my $warning = "\n### Fatal error: Give either three arguments,\n$0\tstart\tstop\tseq-file\nor one file of sequence filenames with truncation coordinates as argument! Please see the usage with option '-h' if unclear!\n"; 209 | pod2usage(-verbose => 0, -message => $warning, -exitval => 2); 210 | } elsif (@ARGV == 1) { # fof 211 | check_file_exists($ARGV[0]); # subroutine to check for file existence 212 | $Fof = shift; 213 | } elsif (@ARGV == 3) { 214 | check_file_exists($ARGV[2]); # subroutine 215 | if ($ARGV[0] !~ /^\d+$/ || $ARGV[1] !~ /^\d+$/) { 216 | my $warning = "\n### Fatal error: With a single sequence file input the first and second arguments are the start and stop positions for truncation, and need to include ONLY digits:\n$0\tstart\tstop\tseq-file\nPlease see the usage with option '-h' if unclear!\n"; 217 | pod2usage(-verbose => 0, -message => $warning, -exitval => 2); 218 | } 219 | ($Start, $Stop, $Seq_File) = @ARGV; 220 | } 221 | 222 | 223 | 224 | ### Truncate the sequence and write either to STDOUT for single seq file input or output files for fof 225 | if ($Fof) { 226 | open (my $fof_fh, "<", "$Fof"); 227 | 228 | # create result folder 229 | $Result_Dir =~ s/\/$//; # get rid of a potential '/' at the end of $Result_Dir path 230 | if (-e $Result_Dir) { 231 | empty_dir($Result_Dir); # subroutine to empty a directory with user interaction 232 | } else { 233 | mkdir $Result_Dir; 234 | } 235 | 236 | while (my $line = <$fof_fh>) { 237 | chomp $line; 238 | next if ($line =~ /^\s*$/ || $line =~ /^#/); # skip empty or comment lines 239 | 240 | die "\n### Fatal error: Line '$.' of the '$Fof' file of sequence filenames plus truncation coordinates does not include the mandatory tab-separated two NUMERICAL start and stop truncation positions, and the sequence file (without any other whitespaces):\nstart\tstop\tpath/to/seq-file\n" if ($line !~ /^\d+\t\d+\t\S+$/); 241 | ($Start, $Stop, $Seq_File) = split(/\t/, $line); 242 | check_file_exists($Seq_File); # subroutine 243 | 244 | my ($seqin, $truncseq) = trunc_seq($Start, $Stop, $Seq_File); # subroutine to create a Bio::SeqIO input object and truncate the respective Bio::Seq object 245 | my $seqout = seq_out($seqin, $Start, $Stop, $Seq_File); # subroutine to create a Bio::SeqIO output object, $seqin needed for format guessing, $Start/$Stop/$Seq_File needed for output filenames 246 | $seqout->write_seq($truncseq); 247 | } 248 | close $fof_fh; 249 | 250 | } else { # single seq file, @ARGV == 3 251 | my ($seqin, $truncseq) = trunc_seq($Start, $Stop, $Seq_File); # subroutine 252 | my $seqout = seq_out($seqin); # subroutine, without $Start/$Stop/$Seq_file for STDOUT output 253 | $seqout->write_seq($truncseq); 254 | } 255 | 256 | exit; 257 | 258 | 259 | 260 | ############### 261 | # Subroutines # 262 | ############### 263 | 264 | ### Subroutine to check if file exists 265 | sub check_file_exists { 266 | my $file = shift; 267 | die "\n### Fatal error: File '$file' does not exist: $!\n" if (!-e $file); 268 | } 269 | 270 | 271 | 272 | ### Subroutine to empty a directory with user interaction 273 | sub empty_dir { 274 | my $dir = shift; 275 | print STDERR "\nDirectory '$dir' already exists! You can use either option '-r' to set a different output result directory name, or do you want to replace the directory and all its contents [y|n]? "; 276 | my $user_ask = ; 277 | if ($user_ask =~ /y/i) { 278 | unlink glob "$dir/*"; # remove all files in results directory 279 | } else { 280 | die "\nScript abborted!\n"; 281 | } 282 | return 1; 283 | } 284 | 285 | 286 | 287 | ### Subroutine to create a Bio::SeqIO output object 288 | sub seq_out { 289 | my ($seqin, $start, $stop, $seq_file) = @_; 290 | 291 | my $out_format; # need to keep $Out_Format_Opt for several seq files with fof 292 | if ($Out_Format_Opt) { 293 | $Out_Format_Opt = 'genbank' if ($Out_Format_Opt =~ /(gbk|gb)/i); # allow shorter input for GENBANK format 294 | $out_format = $Out_Format_Opt; 295 | } else { # same format as input file 296 | if (ref($seqin) =~ /Bio::SeqIO::(genbank|embl|fasta)/) { # from bioperl guessing 297 | $out_format = $1; 298 | } else { 299 | die "\n### Fatal error: Could not determine input file format, please set an output file format with option '-o'!\n"; 300 | } 301 | } 302 | 303 | my $seqout; # Bio::SeqIO object 304 | if ($seq_file) { # fof 305 | $seq_file =~ s/\S+(\/|\\)//; # remove input filepaths, aka 'basename' ('/' for Unix and '\' for Windows) 306 | my $file_ext; 307 | if ($out_format eq 'genbank') { 308 | $file_ext = 'gbk'; # back to shorter file extension for GENBANK format 309 | } else { 310 | $file_ext = $out_format; 311 | } 312 | $seq_file =~ s/^(\S+)\.\w+$/$Result_Dir\/$1\_trunc_$start\_$stop\.$file_ext/; # append also result directory to output filename 313 | $seqout = Bio::SeqIO->new(-file => ">$seq_file", -format => $out_format); 314 | 315 | } else { # single seq file input 316 | $seqout = Bio::SeqIO->new(-fh => \*STDOUT, -format => $out_format); # printing to STDOUT requires '-format' 317 | } 318 | 319 | return $seqout; 320 | } 321 | 322 | 323 | 324 | ### Subroutine create a Bio::SeqIO input object and truncate the respective Bio::Seq object 325 | sub trunc_seq { 326 | my ($start, $stop, $seq_file) = @_; 327 | print STDERR "\nTruncating \"$seq_file\" to coordinates $start..$stop ...\n"; 328 | my $seqin = Bio::SeqIO->new(-file => "<$seq_file"); # Bio::SeqIO object; no '-format' given, leave it to bioperl guessing 329 | my $count = 0; 330 | my $truncseq; 331 | while (my $seq_obj = $seqin->next_seq) { # Bio::Seq object 332 | $count++; 333 | if ($count > 1) { 334 | warn "\n### Warning: More than one sequence entry in sequence file '$seq_file', but only the FIRST sequence entry will be truncated and printed to STDOUT or a result file!\n\n"; 335 | last; 336 | } 337 | $truncseq = Bio::SeqUtils->trunc_with_features($seq_obj, $start, $stop); 338 | } 339 | return ($seqin, $truncseq); # $seqin needed for outformat guessing in subroutine seqout 340 | } 341 | --------------------------------------------------------------------------------