├── LICENSE
├── README.md
├── calc_fastq-stats
    ├── README.md
    └── calc_fastq-stats.pl
├── cat_seq
    ├── README.md
    └── cat_seq.pl
├── cdd2cog
    ├── README.md
    └── cdd2cog.pl
├── cds_extractor
    ├── README.md
    └── cds_extractor.pl
├── ecoli_mlst
    ├── ADK.fas
    ├── FUMC.fas
    ├── GYRB.fas
    ├── ICD.fas
    ├── MDH.fas
    ├── PURA.fas
    ├── README.md
    ├── RECA.fas
    ├── ecoli_mlst.pl
    └── publicSTs.txt
├── genomes_feature_table
    ├── README.md
    └── genomes_feature_table.pl
├── ncbi_ftp_download
    ├── README.md
    ├── ncbi_ftp_concat_unpack.pl
    └── ncbi_ftp_download.sh
├── order_fastx
    ├── README.md
    └── order_fastx.pl
├── po2anno
    ├── README.md
    └── po2anno.pl
├── po2group_stats
    ├── README.md
    ├── pics
    │   ├── README.md
    │   ├── venn_diagram_logics.png
    │   └── venn_diagram_logics.svg
    └── po2group_stats.pl
├── prot_finder
    ├── README.md
    ├── binary_group_stats.pl
    ├── prot_binary_matrix.pl
    ├── prot_finder.pl
    ├── prot_finder_pipe.sh
    └── transpose_matrix.pl
├── rename_fasta_id
    ├── README.md
    └── rename_fasta_id.pl
├── revcom_seq
    ├── README.md
    └── revcom_seq.pl
├── rod_finder
    ├── README.md
    ├── blast_rod_finder.pl
    └── blast_rod_finder_legacy.sh
├── sam_insert-size
    ├── README.md
    └── sam_insert-size.pl
├── sample_fastx-txt
    ├── README.md
    └── sample_fastx-txt.pl
├── seq_format-converter
    ├── README.md
    └── seq_format-converter.pl
├── tbl2tab
    ├── README.md
    ├── example.tbl
    ├── example2.tab
    └── tbl2tab.pl
└── trunc_seq
    ├── README.md
    └── trunc_seq.pl


/README.md:
--------------------------------------------------------------------------------
  1 | [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.215824.svg)](http://dx.doi.org/10.5281/zenodo.215824)
  2 | 
  3 | bac-genomics-scripts
  4 | ====================
  5 | 
  6 | A collection of scripts intended for **bacterial genomics** (some might also be useful for eukaryotes) from **high-throughput sequencing** (aka next-generation sequencing).
  7 | 
  8 | * [Summary](#summary)
  9 | * [Introduction](#introduction)
 10 | * [Installation recommendations](#installation-recommendations)
 11 | * [Dependencies](#dependencies)
 12 | * [UNIX loops](#unix-loops)
 13 | * [Windows - UNIX linebreak problems](#windows---unix-linebreak-problems)
 14 | * [Citation](#citation)
 15 | * [License](#license)
 16 | * [Author - contact](#author---contact)
 17 | 
 18 | ## Summary
 19 | 
 20 | * Basic stats for bases and reads in FASTQ files: [`calc_fastq-stats`](/calc_fastq-stats)
 21 | * Concatenate multi-sequence files (RichSeq EMBL or GENBANK format, or FASTA format) to a single artificial file: [`cat_seq`](/cat_seq)
 22 | * COG ([cluster of orthologous groups](http://www.ncbi.nlm.nih.gov/COG/)) classification of proteins: [`cdd2cog`](/cdd2cog)
 23 | * Extraction of protein/nucleotide sequences from CDSs: [`cds_extractor`](/cds_extractor)
 24 | * MLST (multilocus sequence typing) assignment and allele extraction for *Escherichia coli* ([Achtman scheme](http://mlst.warwick.ac.uk/mlst/)): [`ecoli_mlst`](/ecoli_mlst)
 25 | * Create a feature table for all annotated primary features in RichSeq (EMBL or GENBANK format) files: [`genomes_feature_table`](/genomes_feature_table)
 26 | * **Deprecated!** Batch downloading of sequences from NCBI's FTP server: [`ncbi_ftp_download`](/ncbi_ftp_download)
 27 | * Order sequence entries in FASTA/FASTQ files according to an ID list: [`order_fastx`](/order_fastx)
 28 | * Create an ortholog/paralog annotation comparison matrix from [*Proteinortho5*](http://www.bioinf.uni-leipzig.de/Software/proteinortho/) output: [`po2anno`](/po2anno)
 29 | * Calculate stats and plot venn diagrams for genome groups according to orthologs/paralogs from [*Proteinortho5*](http://www.bioinf.uni-leipzig.de/Software/proteinortho/) output, i.e. overall presence/absence statistics for groups of genomes and not simply single genomes: [`po2group_stats`](/po2group_stats)
 30 | * Strain panel query protein search with **BLASTP** plus concise hit summary, optional alignment, and presence/absence matrix. Also included, scripts to transpose the matrix and calculate overall presence/absence statistics for groups of columns in the matrix: [`prot_finder`](/prot_finder)
 31 | * Rename FASTA ID lines and optionally numerate them: [`rename_fasta_id`](/rename_fasta_id)
 32 | * Reverse complement (multi-)sequence files (RichSeq EMBL or GENBANK format, or FASTA format): [`revcom_seq`](/revcom_seq)
 33 | * Regions of difference (ROD) detection in genomes with **BLASTN**: [`rod_finder`](/rod_finder)
 34 | * NGS paired-end library insert size estimation from BAM/SAM: [`sam_insert-size`](/sam_insert-size)
 35 | * Randomly subsample FASTA, FASTQ, or TEXT files with [*reservoir sampling*](https://en.wikipedia.org/wiki/Reservoir_sampling): [`sample_fastx-txt`](/sample_fastx-txt)
 36 | * Convert a sequence file to another format with [BioPerl](http://www.bioperl.org): [`seq_format-converter`](/seq_format-converter)
 37 | * Manual curation of annotation in NCBI's TBL format (e.g. from [Prokka](http://www.vicbioinformatics.com/software.prokka.shtml) automatic annotation) in a spreadsheet software: [`tbl2tab`](/tbl2tab)
 38 | * Truncate sequence files (RichSeq EMBL or GENBANK format, or FASTA format) according to given coordinates: [`trunc_seq`](/trunc_seq)
 39 | * And an assortment of smaller scripts for tasks like (not yet uploaded to GitHub): alignment format converters, dnadiff, GC% calculation etc.
 40 | 
 41 | ## Introduction
 42 | 
 43 | All the scripts here are written in [**Perl**](https://www.perl.org/) (some include bash shell wrappers).
 44 | 
 45 | Each script is hosted in its own folder, so that a separate *README.md* can be included for more information. However, all of the Perl scripts include additionally a usage/help text or a comprehensive [POD](http://perldoc.perl.org/perlpod.html) (Plain Old Documentation) by calling the script either without arguments/options or option **-h|-help**.
 46 | 
 47 | The scripts are only tested under UNIX, some won't run in a Windows environment (because of included UNIX commands). If you are on Windows an alternative might be [Cygwin](http://cygwin.com/).
 48 | 
 49 | ## Installation recommendations
 50 | 
 51 | To download the repository, use either the '[Download ZIP](https://github.com/aleimba/bac-genomics-scripts/archive/master.zip)' link after clicking the green 'Clone or download' button at the top or clone the repository with `git`:
 52 | 
 53 |     git clone https://github.com/aleimba/bac-genomics-scripts.git
 54 | 
 55 | If there is an update to this GitHub repository (see above [commits](https://github.com/aleimba/bac-genomics-scripts/commits/master) and [releases](https://github.com/aleimba/bac-genomics-scripts/releases)), you can refresh your **local** repository by using the following command **inside** the local folder:
 56 | 
 57 |     git pull
 58 | 
 59 | To install the scripts, copy them e.g. to a home */bin* folder in your *PATH* and make them executable
 60 | 
 61 |     $ find . \( -name '*.pl' -o -name '*.sh' -o -name '*.fas' -o -name '*.txt' \) -exec cp {} ~/bin \;
 62 |     $ chmod u+x ~/bin/*.pl
 63 | 
 64 | the scripts can then be run everywhere on your system. Of course you can just call them directly by prefexing `perl` to the command or a './' for bash wrappers:
 65 | 
 66 |     $ perl /path/to/script/script.pl <options>
 67 | 
 68 | or
 69 | 
 70 |     $ ./script.sh <arguments>
 71 | 
 72 | **Single** scripts can be downloaded as well. For this purpose click on the folder you're interested in and then on the link of the script. There click on the **Raw** button and save this page to a file (without **Raw** you'll get an unusable html file). This is also true for other files (e.g. PDFs etc.).
 73 | 
 74 | ## Dependencies
 75 | 
 76 | All scripts are tested with Perl v5.22.1.
 77 | 
 78 | Most of the Perl scripts include modules from [BioPerl](http://www.bioperl.org) as stated in their respective *README.md* or POD, which as a consequence has to be installed on your system. For BioPerl installation instructions see the website ([**Installation**](http://bioperl.org/INSTALL.html)).
 79 | 
 80 | Some scripts need additional Perl modules, which will be stated in the associated *README.md* or POD. If they're not installed yet on your system get them from [CPAN](http://www.cpan.org/) (installation instructions can be found on the website, see e.g. [**Getting Started...Installing Perl Modules**](http://www.cpan.org/modules/INSTALL.html) or [**FAQ**](http://www.cpan.org/misc/cpan-faq.html#How_install_Perl_modules)).
 81 | 
 82 | Furthermore, some scripts call upon statistical computing language [**R**](http://www.r-project.org/) and dependent packages for plotting purposes (again see the respective *README.md* or POD).
 83 | 
 84 | ## UNIX loops
 85 | 
 86 | A very handy tip, if you want to run a script on all files in the current working directory you can use a **loop** in UNIX, e.g.:
 87 | 
 88 |     $ for file in *.fasta; do perl script.pl "$file"; done
 89 | 
 90 | ## Windows - UNIX linebreak problems
 91 | 
 92 | At last, some of the scripts don't like Windows formatted line breaks, you might consider running these input files through a nifty UNIX utility called [dos2unix](http://dos2unix.sourceforge.net/):
 93 | 
 94 |     $ dos2unix input
 95 | 
 96 | ## Citation
 97 | For now cite the latest major release (tag: [***bovine_ecoli_mastitis***](https://github.com/aleimba/bac-genomics-scripts/releases)) hosted on [Zenodo](https://zenodo.org/):
 98 | 
 99 | **Leimbach A**. 2016. bac-genomics-scripts: Bovine *E. coli* mastitis comparative genomics edition. Zenodo. <http://dx.doi.org/10.5281/zenodo.215824>.
100 | 
101 | Also, all scripts have a version number (see option **-v**), which might be included in a materials and methods section.
102 | 
103 | ## License
104 | 
105 | All scripts are licensed under GPLv3 which is contained in the file [*LICENSE*](./LICENSE).
106 | 
107 | ## Author - contact
108 | For help, suggestions, bugs etc. use the GitHub issues or write an email to aleimba [at] gmx [dot] de.
109 | 
110 | Andreas Leimbach (Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
111 | 


--------------------------------------------------------------------------------
/calc_fastq-stats/README.md:
--------------------------------------------------------------------------------
  1 | calc_fastq-stats
  2 | ================
  3 | 
  4 | `calc_fastq-stats.pl` is a script to calculate basic statistics for bases and reads in a FASTQ file.
  5 | 
  6 | * [Synopsis](#synopsis)
  7 | * [Description](#description)
  8 | * [Usage](#usage)
  9 | * [Options](#options)
 10 |   * [Mandatory options](#mandatory-options)
 11 |   * [Optional options](#optional-options)
 12 | * [Output](#output)
 13 | * [Run environment](#run-environment)
 14 | * [Dependencies](#dependencies)
 15 | * [Author - contact](#author---contact)
 16 | * [Citation, installation, and license](#citation-installation-and-license)
 17 | * [Changelog](#changelog)
 18 | 
 19 | ## Synopsis
 20 | 
 21 |     perl calc_fastq-stats.pl -i reads.fastq
 22 | 
 23 | **or**
 24 | 
 25 |     gzip -dc reads.fastq.gz | perl calc_fastq-stats.pl -i -
 26 | 
 27 | ## Description
 28 | 
 29 | The script calculates some simple statistics, like individual and total base
 30 | counts, GC content, and basic stats for the read lengths, and
 31 | read/base qualities in a FASTQ file. The GC content calculation does
 32 | not include 'N's. Stats are printed to *STDOUT* and optionally to an
 33 | output file.
 34 | 
 35 | Because the quality of a read degrades over its length with all NGS
 36 | machines, it is advisable to also plot the quality for each cycle as
 37 | implemented in tools like
 38 | [FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
 39 | or the [fastx-toolkit](http://hannonlab.cshl.edu/fastx_toolkit/).
 40 | 
 41 | If the sequence and the quality values are interrupted by line
 42 | breaks (i.e. a read is **not** represented by four lines), please fix
 43 | with Heng Li's [seqtk](https://github.com/lh3/seqtk):
 44 | 
 45 |     seqtk seq -l 0 infile.fastq > outfile.fastq
 46 | 
 47 | An alternative tool, which is a lot faster, is **fastq-stats** from
 48 | [ea-utils](https://code.google.com/p/ea-utils/).
 49 | 
 50 | ## Usage
 51 | 
 52 |     zcat reads.fastq.gz | perl calc_fastq-stats.pl -i - -q 64 -c 175000000 -n 3000000
 53 | 
 54 | ## Options
 55 | 
 56 | ### Mandatory options
 57 | 
 58 | - -i, -input
 59 | 
 60 | Input FASTQ file or piped STDIN (-) from a gzipped file
 61 | 
 62 | - -q, -qual_offset
 63 | 
 64 | ASCII quality offset of the Phred (Sanger) quality values [default 33]
 65 | 
 66 | ### Optional options
 67 | 
 68 | - -h, -help:
 69 | 
 70 | Help (perldoc POD)
 71 | 
 72 | - -c, -coverage_limit
 73 | 
 74 | Number of bases to sample from the top of the file
 75 | 
 76 | - -n, -num_read
 77 | 
 78 | Number of reads to sample from the top of the file
 79 | 
 80 | - -o, -output
 81 | 
 82 | Print stats in addition to *STDOUT* to the specified output file
 83 | 
 84 | - -v, -version
 85 | 
 86 | Print version number to *STDERR*
 87 | 
 88 | ## Output
 89 | 
 90 | - *STDOUT*
 91 | 
 92 | Calculated stats are printed to *STDOUT*
 93 | 
 94 | - (outfile)
 95 | 
 96 | Optional outfile for stats
 97 | 
 98 | ## Run environment
 99 | 
100 | The Perl script runs under Windows and UNIX flavors.
101 | 
102 | ## Dependencies
103 | 
104 | If the following modules are not installed get them from
105 | [CPAN](http://www.cpan.org/):
106 | 
107 | - `Statistics::Descriptive`
108 | 
109 | Perl module to calculate basic descriptive statistics
110 | 
111 | - `Statistics::Descriptive::Discrete`
112 | 
113 | Perl module to calculate descriptive statistics for discrete data sets
114 | 
115 | - `Statistics::Descriptive::Weighted`
116 | 
117 | Perl module to calculate descriptive statistics for weighted variates
118 | 
119 | ## Author - contact
120 | 
121 | Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
122 | 
123 | ## Citation, installation, and license
124 | 
125 | For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
126 | 
127 | ## Changelog
128 | 
129 | - v0.1 (28.10.2014)
130 | 


--------------------------------------------------------------------------------
/cat_seq/README.md:
--------------------------------------------------------------------------------
 1 | cat_seq
 2 | =======
 3 | 
 4 | A script to merge multi-sequence RichSeq files into one single-entry 'artificial' sequence file.
 5 | 
 6 | * [Synopsis](#synopsis)
 7 | * [Description](#description)
 8 | * [Usage](#usage)
 9 |   * [Merge multi-sequence file](#merge-multi-sequence-file)
10 |   * [Merge multi-sequence file and specify different output format](#merge-multi-sequence-file-and-specify-different-output-format)
11 |   * [UNIX loop to concatenate each multi-sequence file in the current working directory](#unix-loop-to-concatenate-each-multi-sequence-file-in-the-current-working-directory)
12 |   * [Concatenate multi-sequence fasta files faster with UNIX's `grep`](#concatenate-multi-sequence-fasta-files-faster-with-unixs-grep)
13 | * [Output](#output)
14 | * [Dependencies](#dependencies)
15 | * [Run environment](#run-environment)
16 | * [Alternative software](#alternative-software)
17 | * [Author - contact](#author---contact)
18 | * [Citation, installation, and license](#citation-installation-and-license)
19 | * [Changelog](#changelog)
20 | 
21 | ## Synopsis
22 | 
23 |     perl cat_seq.pl multi-seq_file.embl
24 | 
25 | ## Description
26 | 
27 | This script concatenates multiple sequences in a RichSeq file (embl or genbank, but also fasta) to a single artificial sequence. The first sequence in the file is used as a foundation to add the subsequent sequences, along with all features and annotations.
28 | 
29 | Optionally, a different output file format can be specified (fasta/embl/genbank).
30 | 
31 | ## Usage
32 | 
33 | ### Merge multi-sequence file
34 | 
35 |     perl cat_seq.pl multi-seq_file.gbk
36 | 
37 | ### Merge multi-sequence file and specify different output format
38 | 
39 |     perl cat_seq.pl multi-seq_file.embl [fasta|genbank]
40 | 
41 | ### UNIX loop to concatenate each multi-sequence file in the current working directory
42 | 
43 |     for i in *.[embl|fasta|gbk]; do perl cat_seq.pl $i [embl|fasta|genbank]; done
44 | 
45 | ### Concatenate multi-sequence fasta files faster with UNIXs *grep*
46 | If you're working only with fasta files UNIX's `grep` is a faster choice to concatenate sequences.
47 | 
48 |     grep -v ">" seq.fasta > seq_artificial.fasta
49 | 
50 | Subsequently add as a first line a fasta ID (starting with '>') with an editor.
51 | 
52 | ## Output
53 | 
54 | * *\_artificial.[embl|fasta|genbank]
55 | 
56 | Concatenated artificial sequence in the input format, or optionally the specified output sequence format.
57 | 
58 | ## Dependencies
59 | 
60 | * BioPerl (tested with version 1.006901)
61 | 
62 | ## Run environment
63 | 
64 | The Perl script runs under Windows and UNIX flavors.
65 | 
66 | ## Alternative software
67 | 
68 | The EMBOSS (The European Molecular Biology Open Software Suite) application ***union*** can also be used for this task (http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/union.html).
69 | 
70 | ## Author - contact
71 | 
72 | Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
73 | 
74 | ## Citation, installation, and license
75 | 
76 | For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
77 | 
78 | ## Changelog
79 | 
80 | * v0.1 (08.02.2013)
81 | 


--------------------------------------------------------------------------------
/cat_seq/cat_seq.pl:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/perl
 2 | 
 3 | use warnings;
 4 | use strict;
 5 | use Bio::SeqIO; # bioperl module to handle sequence input/output
 6 | use Bio::Seq; # bioperl module to handle sequences with features
 7 | use Bio::SeqUtils; # bioperl module with additional methods (including features) for Bio::Seq objects
 8 | 
 9 | my $usage = "\n".
10 |    "\t#################################################################\n".
11 |    "\t# $0 multi-seq_file [outfile-format]                    #\n". #$0 = program name
12 |    "\t#                                                               #\n".
13 |    "\t# The script merges RichSeq sequences (embl or genbank, but     #\n".
14 |    "\t# also fasta) in a multi-sequence file to one artificial        #\n".
15 |    "\t# sequence. The first sequence in the file is used as a         #\n".
16 |    "\t# foundation to add the subsequent sequences (along with        #\n".
17 |    "\t# features and annotations). Optionally, a different output     #\n".
18 |    "\t# file format can be specified (fasta/embl/genbank).            #\n".
19 |    "\t# The script uses bioperl (www.bioperl.org).                    #\n".
20 |    "\t#                                                               #\n".
21 |    "\t# Adjust unix loop to run the script with all multi-seq files   #\n".
22 |    "\t# in the current working directory, e.g.:                       #\n".
23 |    "\t# for i in *.embl; do cat_seq.pl \$i genbank; done               #\n".
24 |    "\t#                                                               #\n".
25 |    "\t# version 0.1                                        A Leimbach #\n".
26 |    "\t# 08.02.2013                              aleimba[at]gmx[dot]de #\n".
27 |    "\t#################################################################\n\n";
28 | 
29 | ### Shift arguments from @ARGV or give usage
30 | my $multi_seq = shift or die $usage;
31 | my $format = shift;
32 | if ($multi_seq =~/-h/) {
33 |     die $usage;
34 | }
35 | 
36 | 
37 | ### Bio::SeqIO/Seq objects to concat the seqs
38 | print "\nConcatenating multi-sequence file \"$multi_seq\" to an artificial sequence file ...\n";
39 | my $seqin = Bio::SeqIO->new(-file => "<$multi_seq"); # Bio::SeqIO object; no '-format' given, leave it to bioperl guessing
40 | my @seqs; # store Bio::Seq objects for each seq in the multi-seq file
41 | while (my $seq = $seqin->next_seq) { # Bio::Seq object
42 |     push(@seqs, $seq);
43 | }
44 | Bio::SeqUtils->cat(@seqs);
45 | my $cat_seq = shift @seqs; # the first sequence in the array ($seqs[0]) was modified!
46 | 
47 | 
48 | ### Write the artificial/concatenated sequence (with its features) to output Bio::SeqIO object
49 | my $seqout; # Bio::SeqIO object
50 | if ($format) { # true if defined
51 |     $multi_seq =~ s/^(.+)\.\w+$/$1_artificial\.$format/;
52 |     $seqout = Bio::SeqIO->new(-file => ">$multi_seq", -format => "$format");
53 | } else {
54 |     $multi_seq =~ s/^(.+)(\.\w+)$/$1_artificial$2/;
55 |     $seqout = Bio::SeqIO->new(-file => ">$multi_seq");
56 | }
57 | $seqout->write_seq($cat_seq);
58 | print "Created new file \"$multi_seq\"!\n\n";
59 | 
60 | exit;
61 | 


--------------------------------------------------------------------------------
/cdd2cog/README.md:
--------------------------------------------------------------------------------
  1 | cdd2cog
  2 | =======
  3 | 
  4 | `cdd2cog.pl` is a script to assign COG categories to query protein sequences.
  5 | 
  6 | * [Synopsis](#synopsis)
  7 | * [Description](#description)
  8 | * [Usage](#usage)
  9 |   * [RPS-BLAST+](#rps-blast)
 10 |   * [cdd2cog](#cdd2cog)
 11 | * [Options](#options)
 12 |   * [Mandatory options](#mandatory-options)
 13 |   * [Optional options](#optional-options)
 14 | * [Output](#output)
 15 | * [Run environment](#run-environment)
 16 | * [Author - contact](#author---contact)
 17 | * [Acknowledgements](#acknowledgements)
 18 | * [Citation, installation, and license](#citation-installation-and-license)
 19 | * [Changelog](#changelog)
 20 | 
 21 | ## Synopsis
 22 | 
 23 |     perl cdd2cog.pl -r rps-blast.out -c cddid.tbl -f fun.txt -w whog
 24 | 
 25 | ## Description
 26 | For troubleshooting and a working example please see issue [#1](https://github.com/aleimba/bac-genomics-scripts/issues/1).
 27 | 
 28 | The script assigns COG ([cluster of orthologous
 29 | groups](http://www.ncbi.nlm.nih.gov/COG/)) categories to proteins.
 30 | For this purpose, the query proteins need to be blasted with
 31 | RPS-BLAST+ ([Reverse Position-Specific BLAST](http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download))
 32 | against NCBI's Conserved Domain Database
 33 | ([CDD](http://www.ncbi.nlm.nih.gov/cdd)). Use
 34 | [`cds_extractor.pl`](/cds_extractor) beforehand to extract multi-fasta protein
 35 | files from GENBANK or EMBL files.
 36 | 
 37 | Both tab-delimited RPS-BLAST+ outformats, **-outfmt 6** and **-outfmt
 38 | 7**, can be processed by `cdd2cog.pl`. By default, RPS-BLAST+ hits
 39 | for each query protein are filtered for the best hit (lowest
 40 | e-value). Use option **-a|all\_hits** to assign COGs to all BLAST hits
 41 | and e.g. do a downstream filtering in a spreadsheet application.
 42 | Results are written to tab-delimited files in the './results'
 43 | folder, overall assignment statistics are printed to *STDOUT*.
 44 | 
 45 | Several files are needed from NCBI's FTP server to run the RPS-BLAST+ and `cdd2cog.pl`:
 46 | 
 47 | 1. **CDD** (ftp://ftp.ncbi.nlm.nih.gov/pub/mmdb/cdd/)
 48 | 
 49 |     More information about the files in the CDD FTP archive can be found in the respective 'README' file.
 50 | 
 51 |   1. 'cddid.tbl.gz'
 52 | 
 53 |     The file needs to be unpacked:
 54 | 
 55 |     `gunzip cddid.tbl.gz`
 56 | 
 57 |     Contains summary information about the CD models in a tab-delimited format. The columns are: PSSM-Id, CD accession (e.g. COG#), CD short name, CD description, and PSSM (position-specific scoring matrices) length.
 58 | 
 59 |   2. './little_endian/Cog_LE.tar.gz'
 60 | 
 61 |     Unpack and untar via:
 62 | 
 63 |     `tar xvfz Cog_LE.tar.gz`
 64 | 
 65 |     Preformatted RPS-BLAST+ database of the CDD COG distribution for Intel CPUs and Unix/Windows architectures.
 66 | 
 67 | 2. **COG** (ftp://ftp.ncbi.nlm.nih.gov/pub/COG/COG/)
 68 | 
 69 |     Read 'readme' for more information about the respective files in the COG FTP archive.
 70 | 
 71 |   1. 'fun.txt'
 72 | 
 73 |     One-letter functional classification used in the COG database.
 74 | 
 75 |   2. 'whog'
 76 | 
 77 |     Name, description, and corresponding functional classification of each COG.
 78 | 
 79 | ## Usage
 80 | 
 81 | ### RPS-BLAST+
 82 | 
 83 |     rpsblast -query protein.fasta -db Cog -out rps-blast.out -evalue 1e-2 -outfmt 6
 84 |     rpsblast -query protein.fasta -db Cog -out rps-blast.out -evalue 1e-2 -outfmt '7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovs'
 85 | 
 86 | ### cdd2cog
 87 | 
 88 |     perl cdd2cog.pl -r rps-blast.out -c cddid.tbl -f fun.txt -w whog -a
 89 | 
 90 | ## Options
 91 | 
 92 | ### Mandatory options
 93 | 
 94 | - -r, -rps\_report
 95 | 
 96 |     Path to RPS-BLAST+ report/output, outfmt 6 or 7
 97 | 
 98 | - -c, -cddid
 99 | 
100 |     Path to CDD's 'cddid.tbl' file
101 | 
102 | - -f, -fun
103 | 
104 |     Path to COG's 'fun.txt' file
105 | 
106 | - -w, -whog
107 | 
108 |     Path to COG's 'whog' file
109 | 
110 | ### Optional options
111 | 
112 | - -h, -help
113 | 
114 |     Help (perldoc POD)
115 | 
116 | - -a, -all\_hits
117 | 
118 |     Don't filter RPS-BLAST+ output for the best hit, rather assign COGs to all hits
119 | 
120 | - -v, -version
121 | 
122 |     Print version number to *STDERR*
123 | 
124 | ## Output
125 | 
126 | - *STDOUT*
127 | 
128 |     Overall assignment statistics
129 | 
130 | - ./results
131 | 
132 |     All tab-delimited output files are stored in this result folder
133 | 
134 | - rps-blast_cog.txt
135 | 
136 |     COG assignments concatenated to the RPS-BLAST+ results for filtering
137 | 
138 | - protein-id_cog.txt
139 | 
140 |     Slimmed down 'rps-blast_cog.txt' only including query id (first BLAST report column), COGs, and functional categories
141 | 
142 | - cog_stats.txt
143 | 
144 |     Assignment counts for each used COG
145 | 
146 | - func_stats.txt
147 | 
148 |     Assignment counts for single-letter functional categories
149 | 
150 | ## Run environment
151 | 
152 | The Perl script runs under UNIX flavors.
153 | 
154 | ## Author - contact
155 | 
156 | Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
157 | 
158 | ## Acknowledgements
159 | 
160 | I got the idea for using NCBI's CDD PSSMs for COG assignment from JGI's [IMG/ER annotation system](http://img.jgi.doe.gov/), which employes the same technique.
161 | 
162 | ## Citation, installation, and license
163 | 
164 | For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
165 | 
166 | ## Changelog
167 | 
168 | * v0.2 (2017-02-16)
169 |     * Adapted to new NCBI FASTA header format for CDD RPS-BLAST+ output
170 | * v0.1 (2013-08-01)
171 | 


--------------------------------------------------------------------------------
/cdd2cog/cdd2cog.pl:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/perl
  2 | 
  3 | #######
  4 | # POD #
  5 | #######
  6 | 
  7 | =pod
  8 | 
  9 | =head1 NAME
 10 | 
 11 | C<cdd2cog.pl> - assign COG categories to protein sequences
 12 | 
 13 | =head1 SYNOPSIS
 14 | 
 15 | C<perl cdd2cog.pl -r rps-blast.out -c cddid.tbl -f fun.txt -w whog>
 16 | 
 17 | =head1 DESCRIPTION
 18 | 
 19 | The script assigns COG (L<cluster of orthologous
 20 | groups|http://www.ncbi.nlm.nih.gov/COG/>) categories to proteins.
 21 | For this purpose, the query proteins need to be blasted with
 22 | RPS-BLAST+ (L<Reverse Position-Specific BLAST|http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download>)
 23 | against NCBI's Conserved Domain Database
 24 | (L<CDD|http://www.ncbi.nlm.nih.gov/cdd>). Use
 25 | L<C<cds_extractor.pl>|/cds_extractor> beforehand to extract multi-fasta
 26 | protein files from GENBANK or EMBL files.
 27 | 
 28 | Both tab-delimited RPS-BLAST+ outformats, B<-outfmt 6> and B<-outfmt
 29 | 7>, can be processed by C<cdd2cog.pl>. By default, RPS-BLAST+ hits
 30 | for each query protein are filtered for the best hit (lowest
 31 | e-value). Use option B<-a|all_hits> to assign COGs to all BLAST hits
 32 | and e.g. do a downstream filtering in a spreadsheet application.
 33 | Results are written to tab-delimited files in the F<./results>
 34 | folder, overall assignment statistics are printed to C<STDOUT>.
 35 | 
 36 | Several files are needed from NCBI's FTP server to run the RPS-BLAST+
 37 | and C<cdd2cog.pl>:
 38 | 
 39 | =over
 40 | 
 41 | =item 1.) L<CDD|ftp://ftp.ncbi.nlm.nih.gov/pub/mmdb/cdd/>
 42 | 
 43 | More information about the files in the CDD FTP archive can be found
 44 | in the respective F<README> file.
 45 | 
 46 | =item 1.1.) F<cddid.tbl.gz>
 47 | 
 48 | The file needs to be unpacked:
 49 | 
 50 | C<gunzip cddid.tbl.gz>
 51 | 
 52 | Contains summary information about the CD models in a tab-delimited
 53 | format. The columns are: PSSM-Id, CD accession (e.g. COG#), CD short
 54 | name, CD description, and PSSM (position-specific scoring matrices)
 55 | length.
 56 | 
 57 | =item 1.2.) F<./little_endian/Cog_LE.tar.gz>
 58 | 
 59 | Unpack and untar via:
 60 | 
 61 | C<tar xvfz Cog_LE.tar.gz>
 62 | 
 63 | Preformatted RPS-BLAST+ database of the CDD COG distribution for
 64 | Intel CPUs and Unix/Windows architectures.
 65 | 
 66 | =item 2.) L<COG|ftp://ftp.ncbi.nlm.nih.gov/pub/COG/COG/>
 67 | 
 68 | Read F<readme> for more information about the respective files in
 69 | the COG FTP archive.
 70 | 
 71 | =item 2.1.) F<fun.txt>
 72 | 
 73 | One-letter functional classification used in the COG database.
 74 | 
 75 | =item 2.2.) F<whog>
 76 | 
 77 | Name, description, and corresponding functional classification of
 78 | each COG.
 79 | 
 80 | =back
 81 | 
 82 | =head1 OPTIONS
 83 | 
 84 | =head2 Mandatory options
 85 | 
 86 | =over 20
 87 | 
 88 | =item B<-r>=I<str>, B<-rps_report>=I<str>
 89 | 
 90 | Path to RPS-BLAST+ report/output, outfmt 6 or 7
 91 | 
 92 | =item B<-c>=I<str>, B<-cddid>=I<str>
 93 | 
 94 | Path to CDD's F<cddid.tbl> file
 95 | 
 96 | =item B<-f>=I<str>, B<-fun>=I<str>
 97 | 
 98 | Path to COG's F<fun.txt> file
 99 | 
100 | =item B<-w>=I<str>, B<-whog>=I<str>
101 | 
102 | Path to COG's F<whog> file
103 | 
104 | =back
105 | 
106 | =head2 Optional options
107 | 
108 | =over 20
109 | 
110 | =item B<-h>, B<-help>
111 | 
112 | Help (perldoc POD)
113 | 
114 | =item B<-a>, B<-all_hits>
115 | 
116 | Don't filter RPS-BLAST+ output for the best hit, rather assign COGs
117 | to all hits
118 | 
119 | =item B<-v>, B<-version>
120 | 
121 | Print version number to C<STDERR>
122 | 
123 | =back
124 | 
125 | =head1 OUTPUT
126 | 
127 | =over 20
128 | 
129 | =item C<STDOUT>
130 | 
131 | Overall assignment statistics
132 | 
133 | =item F<./results>
134 | 
135 | All tab-delimited output files are stored in this result folder
136 | 
137 | =item F<rps-blast_cog.txt>
138 | 
139 | COG assignments concatenated to the RPS-BLAST+ results for filtering
140 | 
141 | =item F<protein-id_cog.txt>
142 | 
143 | Slimmed down F<rps-blast_cog.txt> only including query id (first
144 | BLAST report column), COGs, and functional categories
145 | 
146 | =item F<cog_stats.txt>
147 | 
148 | Assignment counts for each used COG
149 | 
150 | =item F<func_stats.txt>
151 | 
152 | Assignment counts for single-letter functional categories
153 | 
154 | =back
155 | 
156 | =head1 EXAMPLES
157 | 
158 | =head2 RPS-BLAST+
159 | 
160 | =over
161 | 
162 | =item C<rpsblast -query protein.fasta -db Cog -out rps-blast.out
163 | -evalue 1e-2 -outfmt 6>
164 | 
165 | =item C<rpsblast -query protein.fasta -db Cog -out rps-blast.out
166 | -evalue 1e-2 -outfmt '7 qseqid sseqid pident length mismatch gapopen
167 | qstart qend sstart send evalue bitscore qcovs'>
168 | 
169 | =back
170 | 
171 | =head2 C<cdd2cog.pl>
172 | 
173 | =over
174 | 
175 | =item C<perl cdd2cog.pl -r rps-blast.out -c cddid.tbl -f fun.txt
176 | -w whog -a>
177 | 
178 | =back
179 | 
180 | =head1 VERSION
181 | 
182 |  0.2                                               update: 2017-02-16
183 |  0.1                                                       2013-08-01
184 | 
185 | =head1 AUTHOR
186 | 
187 |  Andreas Leimbach                         aleimba[at]gmx[dot]de
188 | 
189 | =head1 ACKNOWLEDGEMENTS
190 | 
191 | I got the idea for using NCBI's CDD PSSMs for COG assignment from JGI's L<IMG/ER annotation
192 | system|http://img.jgi.doe.gov/>, which employes the same technique.
193 | 
194 | 
195 | =head1 LICENSE
196 | 
197 | This program is free software: you can redistribute it and/or modify
198 | it under the terms of the GNU General Public License as published by
199 | the Free Software Foundation; either version 3 (GPLv3) of the
200 | License, or (at your option) any later version.
201 | 
202 | This program is distributed in the hope that it will be useful, but
203 | WITHOUT ANY WARRANTY; without even the implied warranty of
204 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
205 | General Public License for more details.
206 | 
207 | You should have received a copy of the GNU General Public License
208 | along with this program. If not, see L<http://www.gnu.org/licenses/>.
209 | 
210 | =cut
211 | 
212 | 
213 | ########
214 | # MAIN #
215 | ########
216 | 
217 | use strict;
218 | use warnings;
219 | use autodie;
220 | use Getopt::Long;
221 | use Pod::Usage;
222 | 
223 | 
224 | ### Get the options with Getopt::Long
225 | my $Rps_Report; # path to the rps-blast report/output
226 | my $CDDid_File; # path to the CDD 'cddid.tbl' file
227 | my $Fun_File; # path to the COG 'fun' file
228 | my $Whog_File; # path to the COG 'whog' file
229 | my $Opt_All_Hits; # give all blast hits for a query, not just the best (lowest evalue)
230 | my $VERSION = 0.1;
231 | my ($Opt_Version, $Opt_Help);
232 | GetOptions ('rps_report=s' => \$Rps_Report,
233 |             'cddid=s' => \$CDDid_File,
234 |             'fun=s' => \$Fun_File,
235 |             'whog=s' => \$Whog_File,
236 |             'all_hits' => \$Opt_All_Hits,
237 |             'version' => \$Opt_Version,
238 |             'help|?' => \$Opt_Help);
239 | 
240 | 
241 | 
242 | ### Run perldoc on POD
243 | pod2usage(-verbose => 2) if ($Opt_Help);
244 | die "$0 $VERSION\n" if ($Opt_Version);
245 | if (!$Rps_Report || !$CDDid_File || !$Fun_File || !$Whog_File) {
246 |     my $warning = "\n### Fatal error: Option(s) or arguments for '-r', '-c', '-f', or '-w' are missing!\n";
247 |     pod2usage(-verbose => 1, -message => $warning, -exitval => 2);
248 | }
249 | 
250 | 
251 | 
252 | ### Parse the 'cddid.tbl', 'fun.txt' and 'whog' file contents and store info in hash structures
253 | my (%CDDid, %Fun, %Whog); # global hashes
254 | parse_cdd_cog(); # subroutine
255 | 
256 | 
257 | 
258 | ### Create results directory for output files
259 | my $Out_Dir = './results/';
260 | if (-e $Out_Dir) {
261 |     print "###Directory '$Out_Dir' already exists! Replace the directory and all its contents [y|n]? ";
262 |     my $user_ask = <STDIN>;
263 |     if ($user_ask =~ /y/i) {
264 |         unlink glob "$Out_Dir*"; # remove all files in results directory
265 |         rmdir $Out_Dir; # remove the empty directory
266 |     } else {
267 |         die "Script abborted!\n";
268 |     }
269 | }
270 | mkdir $Out_Dir or die "Can't create directory \"$Out_Dir\": $!\n";
271 | 
272 | 
273 | 
274 | ### Parse the rps-blast report/output file and assign COGs
275 | my %Cog_Stats; # store the total number of query protein hits for each COG, written to '$Cogstats_Out' below
276 | 
277 | my $Blast_Out = 'rps-blast_cog.txt'; # output file for COG assignments appended to RPS-BLAST results
278 | open (my $Blast_Out_Fh, ">", "$Out_Dir"."$Blast_Out");
279 | print $Blast_Out_Fh "query id\tsubject id\t% identity\talignment length\tmismatches\tgap opens\tq. start\tq. end\ts. start\ts. end\tevalue\tbit score\tCOG#\tfunctional categories\t\t\t\t\tCOG protein description\n"; # header for $Blast_Out
280 | 
281 | my $Locus_Cog = "protein-id_cog.txt"; # slimmed down $Blast_Out only including locus_tags, COGs, and functional categories
282 | open (my $Locus_Cog_Fh, ">", "$Out_Dir"."$Locus_Cog");
283 | 
284 | print "Parsing RPS-BLAST report ...\n"; # status message
285 | my $Skip = ''; # only keep best blast hit per query (lowest e-value), except option 'all_hits' is given
286 | open (my $Rps_Report_Fh, "<", "$Rps_Report");
287 | while (<$Rps_Report_Fh>) {
288 |     if (/^#/) { # skip comment lines in blast report for BLAST+ "outfmt 7"
289 |         next;
290 |     }
291 |     chomp;
292 | 
293 |     my @line = split(/\t/, $_); # split tab-separated RPS-BLAST report
294 | 
295 |     if ($Skip eq $line[0] && !$Opt_All_Hits) {
296 |         # only keep best blast hit per query, only if option 'all_hits' is NOT set
297 |         # $line[0] is query id, and should be locus_tag or specific ID from multi-fasta protein query file
298 |         next;
299 |     }
300 |     $Skip = $line[0];
301 | 
302 |     my $pssm_id = $1 if $line[1] =~ /^CDD\:(\d+)/; # get PSSM-Id from the subject hit
303 |     my $cog = $CDDid{$pssm_id}; # get the COG# according to the PSSM-Id as listed in 'cddid.tbl'
304 |     $Cog_Stats{$cog}++; # increment hit-number for specific COG
305 | 
306 |     ### Collect functional categories stats
307 |     my @functions = split('', $Whog{$cog}->{'function'}); # split the single-letter functional categories to count them and join them as tab-separated below
308 |     foreach (@functions) {
309 |         $Fun{$_}->{'count'}++; # increment hit-number for specific functional category
310 |     }
311 | 
312 |     ### Print to result files
313 |     my $functions = join("\t", @functions); # join functional categories tab-separated
314 |     print $Locus_Cog_Fh "$line[0]\t$cog\t$functions\n"; # locus_tag\tCOG\tfunctional categories
315 |     $functions .= "\t" x (5 - @functions); # add additional tabs for COGs with fewer than five functions (which is the maximum number)
316 |     print $Blast_Out_Fh "$_\t$cog\t$functions\t$Whog{$cog}->{'desc'}\n"; # $_ = RPS-BLAST line
317 | }
318 | 
319 | close $Rps_Report_Fh;
320 | close $Blast_Out_Fh;
321 | close $Locus_Cog_Fh;
322 | 
323 | 
324 | 
325 | ### Total COG and functional categories stats
326 | print "Writing assignment statistic files in '$Out_Dir' folder ...\n"; # status message
327 | 
328 | my $Cogstats_Out = 'cog_stats.txt'; # output file for assignment numbers for each COG
329 | open (my $Cog_Stats_Fh, ">", "$Out_Dir"."$Cogstats_Out");
330 | my $prot_stats = 0; # store total number of query proteins, which have a COG assignment
331 | foreach my $cog (sort keys %Cog_Stats) {
332 |     print $Cog_Stats_Fh "$cog\t$Whog{$cog}->{'desc'}\t$Cog_Stats{$cog}\n"; # COG protein descriptions stored in %Whog
333 |     $prot_stats += $Cog_Stats{$cog}; # sum up total COG assignments
334 | }
335 | close $Cog_Stats_Fh;
336 | 
337 | my $Funcstats_Out = 'func_stats.txt'; # output file for assignment numbers for each functional category
338 | open (my $Func_Stats_Fh, ">", "$Out_Dir"."$Funcstats_Out");
339 | my $func_cats = 0; # store total number of assigned functional categories
340 | foreach my $func (sort keys %Fun) {
341 |     print $Func_Stats_Fh "$func\t$Fun{$func}->{'desc'}\t$Fun{$func}->{'count'}\n";
342 |     $func_cats += $Fun{$func}->{'count'}; # sum up total functional category assignments
343 | }
344 | close $Func_Stats_Fh;
345 | 
346 | 
347 | 
348 | ### State which files were created and print overall statistics
349 | print "\n############################################################################\n";
350 | print "The following tab-delimited files were created in the '$Out_Dir' directory:\n";
351 | print "- $Blast_Out: COG assignments concatenated to the RPS-BLAST results for filtering\n";
352 | print "- $Locus_Cog: Slimmed down '$Blast_Out' only including query id (first BLAST report column), COGs, and functional categories\n";
353 | print "- $Cogstats_Out: COG assignment counts\n";
354 | print "- $Funcstats_Out: Functional category assignment counts\n";
355 | print "##############################################################################\n";
356 | print "Overall assignment statistics:\n";
357 | print "~ Total query proteins categorized into COGs: $prot_stats\n";
358 | print "~ Total COGs used for the query proteins [of ", scalar keys %CDDid, " overall]: ", scalar keys %Cog_Stats, "\n";
359 | print "~ Total number of assigned functional categories: $func_cats\n";
360 | print "~ Total functional categories used for the query proteins [of ", scalar keys %Fun, " overall]: ", scalar grep ($Fun{$_}->{'count'} > 0, keys %Fun), "\n\n"; # grep for functional categories with a count > 0 to get the ones with assigned query proteins
361 | 
362 | exit;
363 | 
364 | 
365 | 
366 | ###############
367 | # Subroutines #
368 | ###############
369 | 
370 | ### Subroutine to parse the 'cddid.tbl', 'fun' and 'whog' file contents and store in hash structures
371 | sub parse_cdd_cog {
372 | 
373 |     ### 'cddid.tbl'
374 |     open (my $cddid_fh, "<", "$CDDid_File");
375 |     print "\nParsing CDDs '$CDDid_File' file ...\n"; # status message
376 |     while (<$cddid_fh>) {
377 |         chomp;
378 |         my @line = split(/\t/, $_); # split line at the tabs
379 |         if ($line[1] =~ /^COG\d{4}$/) { # search for COG CD accessions in cddid
380 |             $CDDid{$line[0]} = $line[1]; # hash to store info; $line[0] = PSSM-Id
381 |         }
382 |     }
383 |     close $cddid_fh;
384 | 
385 |     ### 'fun.txt'
386 |     open (my $fun_fh, "<", "$Fun_File");
387 |     print "Parsing COGs '$Fun_File' file ...\n"; # status message
388 |     while (<$fun_fh>) {
389 |         chomp;
390 |         $_ =~ s/^\s*|\s+$//g; # get rid of all leading and trailing whitespaces
391 |         if (/^\[(\w)\]\s*(.+)$/) {
392 |             $Fun{$1} = {'desc' => $2, 'count' => 0}; # anonymous hash in hash
393 |             # $1 = single-letter functional category, $2 = description of functional category
394 |             # count used to find functional categories not present in the query proteins for final overall assignment statistics
395 |         }
396 |     }
397 |     close $fun_fh;
398 | 
399 |     ### 'whog'
400 |     open (my $whog_fh, "<", "$Whog_File");
401 |     print "Parsing COGs '$Whog_File' file ...\n"; # status message
402 |     while (<$whog_fh>) {
403 |         chomp;
404 |         $_ =~ s/^\s*|\s+$//g; # get rid of all leading and trailing whitespaces
405 |         if (/^\[(\w+)\]\s*(COG\d{4})\s+(.+)$/) {
406 |             $Whog{$2} = {'function' => $1, 'desc' => $3}; # anonymous hash in hash
407 |             # $1 = single-letter functional categories, maximal five per COG (only COG5032 with five)
408 |             # $2 = COG#, $3 = COG protein description
409 |         }
410 |     }
411 |     close $whog_fh;
412 | 
413 |     return 1;
414 | }
415 | 


--------------------------------------------------------------------------------
/cds_extractor/README.md:
--------------------------------------------------------------------------------
  1 | cds_extractor
  2 | =============
  3 | 
  4 | `cds_extractor.pl` is a script to extract amino acid or nucleotide sequences from coding sequence (CDS) features in annotated genomes.
  5 | 
  6 | * [Synopsis](#synopsis)
  7 | * [Description](#description)
  8 | * [Usage](#usage)
  9 |   * [Extract amino acid sequences](#extract-amino-acid-sequences)
 10 |   * [Extract nucleotide sequences](#extract-nucleotide-sequences)
 11 |   * [UNIX loop to extract sequences from all files in the current working directory](#unix-loop-to-extract-sequences-from-all-files-in-the-current-working-directory)
 12 | * [Options](#options)
 13 |   * [Mandatory options](#mandatory-options)
 14 |   * [Optional options](#optional-options)
 15 | * [Output](#output)
 16 | * [Dependencies](#dependencies)
 17 | * [Run environment](#run-environment)
 18 | * [Author - contact](#author---contact)
 19 | * [Citation, installation, and license](#citation-installation-and-license)
 20 | * [Changelog](#changelog)
 21 | 
 22 | ## Synopsis
 23 | 
 24 |     perl cds_extractor.pl -i seq_file.[embl|gbk] -p
 25 | 
 26 | ## Description
 27 | 
 28 | This script extracts protein or DNA sequences of CDS features from a (multi)-RichSeq file (e.g. EMBL or GENBANK format) and writes them to a multi-FASTA file. The FASTA headers for each CDS include either the locus tag, if that's not available, protein ID, gene, or an internal CDS counter as identifier (in this order). The organism info includes also possible plasmid names. Pseudogenes (tagged by **/pseudo**) are not included (except in the CDS counter).
 29 | 
 30 | In addition to the identifier, FASTA headers include gene (**g=**), product (**p=**), organism (**o=**), and EC numbers (**ec=**), if these are present for a CDS. Individual EC numbers are separated by **semicolons**. The location/position (**l=** start..stop) of a CDS will always be included. If gene is used as FASTA header ID '**g=** gene' will only be included with option **-f**.
 31 | 
 32 | Fuzzy locations in the feature table of a sequence file are not taken into consideration for **l=**. If you set options **-u** and/or **-d** and the feature location overlaps a **circular** replicon boundary, positions are marked with '<' or '>' in the direction of the exceeded boundary. Features with overlapping locations in **linear** sequences (e.g. contigs) will be skipped and are **not** included in the output! A CDS feature is on the lagging strand if start > stop in the location. In the special case of overlapping circular sequence boundaries this is reversed.
 33 | 
 34 | Of course, the **l=** positions are separate for each sequence in a multi- sequence file. Thus, if you want continuous positions for the CDSs run these files first through [`cat_seq.pl`](/cat_seq).
 35 | 
 36 | Optionally, a file with locus tags can be given to extract only these CDS features with option **-l** (each locus tag in a new line).
 37 | 
 38 | ## Usage
 39 | 
 40 | ### Extract amino acid sequences
 41 | 
 42 |     perl cds_extractor.pl -i Ecoli_MG1655.gbk -p [-l locus_tags.txt -c MG1655 -f]
 43 | 
 44 | ### Extract nucleotide sequences
 45 | 
 46 |     perl cds_extractor.pl -i Banthracis_Ames.embl -n [-l locus_tags.txt -u 100 -d 20 -c Ames -f]
 47 | 
 48 | ### UNIX loop to extract sequences from all files in the current working directory
 49 | 
 50 |     for file in *.embl; do perl cds_extractor.pl -i "$file" -p [-l locus_tags.txt]; done
 51 | 
 52 | ## Options
 53 | 
 54 | ### Mandatory options
 55 | 
 56 | * **-i**=_str_, **-input**=_str_
 57 | 
 58 |     Input RichSeq sequence file including CDS annotation (e.g. EMBL or GENBANK format)
 59 | 
 60 | * **-p**, **-protein**
 61 | 
 62 |     Extract **protein** sequence for each CDS feature, excludes option **-n**
 63 | 
 64 | **or**
 65 | 
 66 | * **-n**, **-nucleotide**
 67 | 
 68 |     Extract **nucleotide** sequence for each CDS feature, excludes option **-p**
 69 | 
 70 | ### Optional options
 71 | 
 72 | * **-h**, **-help**
 73 | 
 74 |     Help (perldoc POD)
 75 | 
 76 | * **-u**=_int_, **-upstream**=_int_
 77 | 
 78 |     Include given number of flanking nucleotides upstream of each CDS feature, forces option **-n**
 79 | 
 80 | * **-d**=_int_, **-downstream**=_int_
 81 | 
 82 |     Include given number of flanking nucleotides downstream of each CDS feature, forces option **-n**
 83 | 
 84 | * **-c**=_str_, **-cds_prefix**=_str_
 85 | 
 86 |     Prefix for the internal CDS counter [default = 'CDS']
 87 | 
 88 | * **-l**=_str_, **-locustag_list**=_str_
 89 | 
 90 |     List of locus tags to extract only those (each locus tag on a new line)
 91 | 
 92 | * **-f**, **-full_header**
 93 | 
 94 |     If gene is used as ID include additionally '**g=** gene' in FASTA headers, so downstream analyses can recognize the gene tag (e.g. [`prot_finder.pl`](/prot_finder)).
 95 | 
 96 | * **-v**, **-version**
 97 | 
 98 |     Print version number to *STDERR*
 99 | 
100 | ## Output
101 | 
102 | * \*.faa
103 | 
104 |     Multi-FASTA file of CDS protein sequences
105 | 
106 | **or**
107 | 
108 | * \*.ffn
109 | 
110 |     Multi-FASTA file of CDS DNA sequences
111 | 
112 | * (no_annotation_err.txt)
113 | 
114 |     Lists input files missing CDS annotation, script exited with **fatal error** i.e. no FASTA output file
115 | 
116 | * (double_id_err.txt)
117 | 
118 |     Lists input files with ambiguous FASTA IDs, script exited with **fatal error** i.e. no FASTA output file
119 | 
120 | * (locus_tag_missing_err.txt)
121 | 
122 |     Lists CDS features without locus tags
123 | 
124 | * (linear_seq_cds_overlap_err.txt)
125 | 
126 |     Lists CDS features overlapping sequence border of a **linear** molecule, which are **not** included in the result multi-FASTA file
127 | 
128 | ## Dependencies
129 | 
130 | * [BioPerl](http://www.bioperl.org) (tested with version 1.006923)
131 | 
132 | ## Run environment
133 | 
134 | The Perl script runs under Windows and UNIX flavors.
135 | 
136 | ## Author - contact
137 | 
138 | Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
139 | 
140 | ## Citation, installation, and license
141 | 
142 | For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
143 | 
144 | ## Changelog
145 | 
146 | * v0.7.1 (26.10.2015)
147 |     - changed output file extensions from **\_cds\_aa.fasta* or **\_cds\_nuc.fasta* to **.faa* or **.ffn*, respectively
148 |     - minor syntax changes in README, included TOC
149 |     - minor syntax changes in POD
150 | * v0.7 (31.03.2014)
151 |     - location (l=) and EC numbers (ec=) for CDS features are included in the FASTA header
152 |     - 'ec=', 'g=', 'p=', and 'o=' only included in FASTA header if these tags are present for a CDS feature, or additionally for 'g=' with option **-f**
153 |     - if, with options '-u' and/or '-d', the location of a CDS feature overlaps a sequence boundary, the positions are marked with '<' or '>' in 'l='
154 |     - additionally, CDS features whose location overlaps the sequence boundary of a linear molecule will not be included in the output, but IDs written to an error file
155 |     - new option **-c** to chose prefix for internal CDS counter
156 |     - /product feature value will not be used as FASTA ID anymore, skip directly to internal CDS counter, if /locus_tag, /protein_id, or /gene is missing for a CDS (too many 'hypothetical proteins')
157 |     - internal CDS counter counts all CDSs of multi-sequence files sequential (doesn't start new with each new sequence in the multi-sequence file)
158 |     - 'control_double' subroutine also called if /gene is used as FASTA ID
159 |     - fixed bug introduced in v0.6 to exit if no CDS primary features found, because a draft multi-sequence file might have unannotated small contigs
160 |     - new error files: no_annotation_err.txt, double_id_err.txt, linear_seq_cds_overlap_err.txt (the first two come in handy if you run `cds_extractor.pl` in a UNIX loop with many files)
161 |     - included 'use autodie'
162 |     - included version switch
163 |     - included pod2usage with Pod::Usage
164 |     - reorganized code into more subroutines to remove useless double codings (which contained also some bugs) and to make the script more concise
165 |     - minor changes to Perl syntax
166 | * v0.6 (06.06.2013)
167 |     - exit with error if no CDS primary features present in input file, as /translation feature only present in CDS features (some GENBANK files are only annotated with 'gene')
168 |     - included Bio::SeqFeatureI's method *spliced-seq* for CDS with split nucleotide sequences (CDS position indicated by 'join')
169 |     - minor changes how the optional list of locus tags is handled
170 | * v0.5 (03.06.2013)
171 |     - included a POD
172 |     - options with Getopt::Long
173 |     - option **-n** to alternatively extract nucleotide sequences for CDS features (optionally with upstream and downstream sequences)
174 |     - option to include full FASTA ID header for downstream [`prot_finder.pl`](/prot_finder) analysis
175 |     - exit with error if the values for two (or more) /locus_tag or /protein_id tags are not unambiguous
176 |     - print message to *STDOUT* if and which locus tags were not found in a given locus tag list (option **-l**)
177 | * v0.4 (06.02.2013)
178 |     - replace whitespaces of /product values with underscores
179 | * v0.3 (06.09.2012)
180 |     - internal CDS counter to use in FASTA ID for CDS features without a /locus_tag, /protein_id, /gene, or /product tag
181 |     - include also organism (and possible plasmid) information in FASTA ID lines
182 |     - give a warning to *STDOUT* if a CDS feature without a /locus_tag is found (but only for the first occurence)
183 |     - additionally, *locus_tag_errors.txt* error file to list all CDSs without locus tags
184 |     - catch errors with *eval* if a tag is missing
185 | * v0.2 (04.09.2012)
186 |     - if a CDS feature does not have a /locus_tag, then use the value for /protein_id, /gene, or /product (in this order) in the FASTA ID lines of the result file
187 |     - optional extract only CDSs with locus tags given in a file
188 | * v0.1 (24.05.2012)
189 | 


--------------------------------------------------------------------------------
/ecoli_mlst/README.md:
--------------------------------------------------------------------------------
  1 | ecoli_mlst
  2 | ==========
  3 | 
  4 | `ecoli_mlst` is a script to determine MLST sequence types for *E. coli* genomes and extract allele sequences.
  5 | 
  6 | * [Synopsis](#synopsis)
  7 | * [Description](#description)
  8 | * [Usage](#usage)
  9 | * [Options](#options)
 10 |   * [Mandatory options](#mandatory-options)
 11 |   * [Optional options](#optional-options)
 12 | * [Output](#output)
 13 | * [Run environment](#run-environment)
 14 | * [Author - contact](#author---contact)
 15 | * [Citation, installation, and license](#citation-installation-and-license)
 16 | * [Changelog](#changelog)
 17 | 
 18 | # Synopsis
 19 | 
 20 |     perl ecoli_mlst.pl -a fas -g fasta
 21 | 
 22 | # Description
 23 | 
 24 | The script searches for multilocus sequence type (MLST) alleles in *E. coli* genomes according to
 25 | Mark Achtman's scheme with seven house-keeping genes (*adk*, *fumC*, *gyrB*,
 26 | *icd*, *mdh*, *purA*, and *recA*) [Wirth et al., 2006]. *NUCmer* from the
 27 | [*MUMmer package*](http://mummer.sourceforge.net/) is used to compare the given allele
 28 | sequences to bacterial genomes via nucleotide alignments.
 29 | 
 30 | Download the allele files (adk.fas ...) and the sequence type file
 31 | ('publicSTs.txt') from this website:
 32 |     http://mlst.ucc.ie/mlst/dbs/Ecoli
 33 | 
 34 | To run `ecoli_mlst.pl` include all *E. coli* genome files (file
 35 | extension e.g. 'fasta'), all allele sequence files (file extension
 36 | 'fas') and 'publicSTs.txt' in the current working directory. The
 37 | allele profiles are parsed from the created \*.coord files and written
 38 | to a result file, plus additional information from the file
 39 | 'publicSTs.txt'. Also, the corresponding allele sequences (obtained
 40 | from the allele input files) are concatenated for each *E. coli* genome
 41 | into a result multi-fasta file. Option **-c** can be used to initiate
 42 | an alignment for this multi-fasta file with [*ClustalW*](http://www.clustal.org/clustal2/) (standard
 43 | alignment parameters; has to be in the `$PATH` or change variable
 44 | `$clustal_call`). The alignment fasta output file can be used
 45 | directly for [*RAxML*](http://sco.h-its.org/exelixis/web/software/raxml/index.html). CAREFUL the Phylip alignment format from
 46 | *ClustalW* allows only 10 characters per strain ID.
 47 | 
 48 | `ecoli_mlst.pl` works with complete and draft genomes. However, several genomes cannot be included in a single input file!
 49 | 
 50 | Obviously, only for those genomes whose allele sequences have been
 51 | deposited in Achtman's allele database results can be obtained. If an
 52 | allele is not found in a genome it is marked by a '?' in the result
 53 | profile file and a place holder 'XXX' in the result fasta file. For
 54 | these cases a manual *NUCmer* or *BLASTN* might be useful to fill the
 55 | gaps and [`run_sub_seq.pl`](/run_sub_seq) to get the corresponding 'new' allele
 56 | sequences.
 57 | 
 58 | Non-NCBI fasta headers for the genome files have to have a
 59 | unique ID directly following the '>' (e.g. 'Sakai', '55989' ...).
 60 | 
 61 | # Usage
 62 | 
 63 |     perl ecoli_mlst.pl -a fas -g fasta -c
 64 | 
 65 | # Options
 66 | 
 67 | ## Mandatory options
 68 | 
 69 | - -a, -alleles
 70 | 
 71 |     File extension of the MLST allele fasta files, e.g. 'fas' (<=> **-g**).
 72 | 
 73 | - -g, -genomes
 74 | 
 75 |     File extension of the *E. coli* genome fasta files, e.g. 'fasta' (<=> **-a**).
 76 | 
 77 | ## Optional options
 78 | 
 79 | - -h, -help
 80 | 
 81 |     Help (perldoc POD)
 82 | 
 83 | - -c, -clustalw
 84 | 
 85 |     Call [*ClustalW*](http://www.clustal.org/clustal2/) for alignment
 86 | 
 87 | # Output
 88 | 
 89 | - ecoli_mlst_profile.txt
 90 | 
 91 |     Tab-separated allele profiles for the *E. coli* genomes, plus additional info from 'publicSTs.txt'
 92 | 
 93 | - ecoli_mlst_seq.fasta
 94 | 
 95 |     Multi-fasta file of all concatenated allele sequences for each genome
 96 | 
 97 | - *.coord
 98 | 
 99 |     Text files that contain the coordinates of the *NUCmer* hits for each genome and allele
100 | 
101 | - (errors.txt)
102 | 
103 |     Error file, summarizing number of not found alleles or unclear *NUCmer* hits
104 | 
105 | - (ecoli_mlst_seq_aln.fasta)
106 | 
107 |     Optional, [*ClustalW*](http://www.clustal.org/clustal2/) alignment in Phylip format
108 | 
109 | - (ecoli_mlst_seq_aln.dnd)
110 | 
111 |     Optional, *ClustalW* alignment guide tree
112 | 
113 | ## Run environment
114 | 
115 | The Perl script runs only under UNIX flavors.
116 | 
117 | ## Author - contact
118 | 
119 | Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
120 | 
121 | ## Citation, installation, and license
122 | 
123 | For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
124 | 
125 | ## Changelog
126 | 
127 | * v0.3 (30.01.2013)
128 |     - additional info in POD
129 |     - check if result files already exist and ask user what to do
130 |     - changed script name from `ecoli_mlst_alleles.pl` to `ecoli_mlst.pl`
131 | * v0.2 (20.10.2012)
132 |     - included a POD
133 |     - options with Getopt::Long
134 |     - don't consider input *E. coli* genome query files, which are too big (set cutoff at 9 MB for a fasta *E. coli* file)
135 |     - draft *E. coli* genomes can now be used as input query files
136 |     - additional info in 'publicSTs.txt' now associated to found ST types in output
137 |     - give text to STDOUT which files were created
138 |     - new option **-c** to align the resulting allele sequences via *ClustalW*
139 | * v0.1 (25.10.2011)
140 | 


--------------------------------------------------------------------------------
/ecoli_mlst/publicSTs.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aleimba/bac-genomics-scripts/1b2388fb9f5870a4aafa3e070823f9286178d3b1/ecoli_mlst/publicSTs.txt


--------------------------------------------------------------------------------
/genomes_feature_table/README.md:
--------------------------------------------------------------------------------
  1 | genomes_feature_table
  2 | =====================
  3 | 
  4 | `genomes_feature_table.pl` is a script to create a feature table for genomes in EMBL and GENBANK format.
  5 | 
  6 | * [Synopsis](#synopsis)
  7 | * [Description](#description)
  8 | * [Usage](#usage)
  9 | * [Options](#options)
 10 | * [Output](#output)
 11 | * [Run environment](#run-environment)
 12 | * [Dependencies](#dependencies)
 13 | * [Author - contact](#author---contact)
 14 | * [Citation, installation, and license](#citation-installation-and-license)
 15 | * [Changelog](#changelog)
 16 | 
 17 | ## Synopsis
 18 | 
 19 |     perl genomes_feature_table.pl path/to/genome_dir > feature_table.tsv
 20 | 
 21 | ## Description
 22 | 
 23 | A genome feature table lists basic stats/info (e.g. genome size, GC
 24 | content, coding percentage, accession number(s)) and the numbers of
 25 | annotated primary features (e.g. CDS, genes, RNAs) of genomes. It
 26 | can be used to have an overview of these features in different
 27 | genomes, e.g. in comparative genomics publications.
 28 | 
 29 | `genomes_feature_table.pl` is designed to extract (or calculate)
 30 | these basic stats and **all** annotated primary features from RichSeq
 31 | files (**EMBL** or **GENBANK** format) in a specified directory (with the
 32 | correct file extension, see option **-e**). The **default** directory
 33 | is the current working directory. The primary features are
 34 | counted and the results for each genome printed in tab-separated
 35 | format. It is a requirement that each file contains **only one**
 36 | genome (complete or draft, with or without plasmids).
 37 | 
 38 | The most important features will be listed first, like genome
 39 | description, genome size, GC content, coding percentage (calculated
 40 | based on non-pseudo CDS annotation), CDS and gene numbers, accession
 41 | number(s) (first..last in the sequence file), RNAs (rRNA, tRNA,
 42 | tmRNA, ncRNA), and unresolved bases (IUPAC code 'N'). If plasmids are
 43 | annotated in a sequence file, the number of plasmids are
 44 | counted and listed as well (needs a */plasmid="plasmid_name"* tag in the
 45 | *source* primary tag, see e.g. Genbank accession number
 46 | [CP009167](http://www.ncbi.nlm.nih.gov/nuccore/CP009167)). Use option **-p**
 47 | to list plasmids as separate entries (lines) in the feature table.
 48 | 
 49 | For draft genomes the number of contigs/scaffolds are counted. All
 50 | contigs/scaffolds of draft genomes should be marked with the *WGS*
 51 | keyword (see e.g. draft NCBI Genbank entry
 52 | [JSAY00000000](http://www.ncbi.nlm.nih.gov/nuccore/JSAY00000000)). If this is
 53 | not the case for your file(s) you can add those keywords to each
 54 | sequence entry with the following Perl one-liners (will
 55 | edit files in place). For files in **GENBANK** format if 'KEYWORDS&nbsp;&nbsp;&nbsp;&nbsp;.' is present
 56 | 
 57 |     perl -i -pe 's/^KEYWORDS(\s+)\./KEYWORDS$1WGS\./' file
 58 | 
 59 | or if 'KEYWORDS' isn't present at all
 60 | 
 61 |     perl -i -ne 'if(/^ACCESSION/){ print; print "KEYWORDS    WGS.\n";} else{ print;}' file
 62 | 
 63 | For files in **EMBL** format if 'KW&nbsp;&nbsp;&nbsp;.' is present
 64 | 
 65 |     perl -i -pe 's/^KW(\s+)\./KW$1WGS\./' file
 66 | 
 67 | or if 'KW' isn't present at all
 68 | 
 69 |     perl -i -ne 'if(/^DE/){ $dw=1; print;} elsif(/^XX/ && $dw){ print; $dw=0; print "KW   WGS.\n";} else{ print;}' file
 70 | 
 71 | ## Usage
 72 | 
 73 |     perl genomes_feature_table.pl -p -e gb,gbk > feature_table_plasmids.tsv
 74 | 
 75 |     perl genomes_feature_table.pl path/to/genome_dir/ -e gbf -e embl > feature_table.tsv
 76 | 
 77 | ## Options
 78 | 
 79 | - -h, -help
 80 | 
 81 |     Help (perldoc POD)
 82 | 
 83 | - -e, -extensions
 84 | 
 85 |     File extensions to include in the analysis (EMBL or GENBANK format),
 86 |     either comma-separated list or multiple occurences of the option
 87 |     [default = ebl,emb,embl,gb,gbf,gbff,gbank,gbk,genbank]
 88 | 
 89 | - -p, -plasmids
 90 | 
 91 |     Optionally list plasmids as extra entries in the feature table, if
 92 |     they are annotated with a */plasmid="plasmid_name"* tag in the
 93 |     *source* primary tag
 94 | 
 95 | - -v, -version
 96 | 
 97 |     Print version number to *STDERR*
 98 | 
 99 | ## Output
100 | 
101 | - *STDOUT*
102 | 
103 |     The resulting feature table is printed to *STDOUT*. Redirect or
104 |     pipe into another tool as needed (e.g. `cut`, `grep`, or `head`).
105 | 
106 | ## Run environment
107 | 
108 | The Perl script runs under Windows and UNIX flavors.
109 | 
110 | ## Dependencies
111 | 
112 | - [BioPerl](http://www.bioperl.org) (tested version 1.006923)
113 | 
114 | ## Author - contact
115 | 
116 | Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
117 | 
118 | ## Citation, installation, and license
119 | 
120 | For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
121 | 
122 | ## Changelog
123 | 
124 | - v0.5 (14.09.2015)
125 |     - changed script name to `genomes_feature_table.pl`
126 |     - included a POD
127 |     - options with Getopt::Long
128 |     - included `pod2usage` with Pod::Usage
129 |     - major code overhaul with restructuring (removing code redundancy, print out without temp file etc.) and Perl syntax changes
130 |     - changed input options to get folder path from STDIN
131 |     - as a consequence new option **-e|-extensions**
132 |     - accession numbers not essential anymore, changed hash key to filename; but requires now only one genome per file
133 |     - draft genomes should include 'WGS' keyword (warning if not)
134 |     - option **-p|-plasmids** works now correctly with complete and draft genomes
135 |     - count plasmids without option **-p**
136 | - v0.4 (11.08.2013)
137 |     - included 'use autodie;' pragma
138 |     - included version switch
139 | - v0.3 (05.11.2012)
140 |     - new option **p** to report plasmid features in multi-sequence draft files separately
141 | - v0.2 (19.09.2012)
142 | - v0.1 (25.11.2011)
143 |     - **original** script name: `get_genome_features.pl`
144 | 


--------------------------------------------------------------------------------
/ncbi_ftp_download/README.md:
--------------------------------------------------------------------------------
  1 | ncbi_ftp_download
  2 | =================
  3 | 
  4 | **This pipeline is NOT working at the moment, as NCBI reorganized the structure of their [FTP server for genomes](https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/). As an alternative way to fetch bacterial genomes from NCBI I recommend [`ncbi-genome-download`](https://github.com/kblin/ncbi-genome-download) from @kbiln, or [`Bio-RetrieveAssemblies`](https://github.com/andrewjpage/Bio-RetrieveAssemblies) from @andrewjpage from the Wellcome Trust Sanger Institute.**
  5 | 
  6 | Scripts to batch download all bacterial genomes of a genus/species from NCBI's FTP site (RefSeq and GenBank) for easy access.
  7 | 
  8 | ## Synopsis
  9 | 
 10 |     ncbi_ftp_download.sh Genus_species
 11 | 
 12 | ## Description
 13 | 
 14 | These scripts are intended to download all bacterial genomes for a particular genus or species from NCBI's FTP site (http://www.ncbi.nlm.nih.gov/Ftp/ and ftp://ftp.ncbi.nlm.nih.gov/) and copy them to result folders for easy access.
 15 | 
 16 | `ncbi_ftp_download.sh` is a bash shell wrapper script that employs UNIX's `wget` to download microbial genomes in genbank (\*.gbk) and fasta (\*.fna) format from the GenBank and RefSeq databases (NCBI Reference Sequence Database, http://www.ncbi.nlm.nih.gov/refseq/) on NCBI's FTP server, which can be accessed anonymously. As first argument it takes the bacterial genus or species name you want to download (it uses that name with a glob inside the script, e.g. Escherichia_coli will be used as Escherichia_coli\*), see examples below in [usage](#usage). Have a look on the NCBI FTP server to get the correct name (either with your browser or e.g. with FileZilla, http://filezilla-project.org/). If you want to download genomes for several distinct species just run the script with different arguments repeatedly.
 17 | 
 18 | The `wget` parameters are specified to keep the FTP server folder structure and mirror it locally downstream from the current working directory (folder 'ftp.ncbi.nlm.nih.gov' will be the top folder of the new folder structure). If you update an already existing folder structure, `wget` will only download and replace files if they are in a newer version on NCBI's FTP server. **But** be aware that NCBI shuffles files around (including new ones, deleting old ones etc.), thus it might be useful to remove 'ftp.ncbi.nlm.nih.gov' and download everything new.
 19 | 
 20 | After the download with `wget`, `ncbi_ftp_download.sh` will run the Perl script `ncbi_ftp_concat_unpack.pl`. This script unpacks (draft genomes are stored as tarballs, \*.tgz) and concatenates all complete and draft genomes, which are present in the folder 'ftp.ncbi.nlm.nih.gov' in the current working directory. The script traverses the downloaded NCBI ftp-folder structure and thus has to be called from the top level (containing the folder 'ftp.ncbi.nlm.nih.gov'). `ncbi_ftp_download.sh` runs `ncbi_ftp_concat_unpack.pl` with both **genbank** and **refseq** options, as well as option **y** to overwrite the old result folders (see below [options](#options)). Both scripts have to be in the same directory (or in the path) to run `ncbi_ftp_download.sh`.
 21 | 
 22 | For **complete** genomes **plasmids** are concatenated to the **chromosomes** to create multi-genbank/-fasta files (script `split_multi-seq_file.pl` can be used to split the multi-sequence file to single-sequence files).
 23 | 
 24 | In **draft** genomes, **scaffold** and/or **contig** files, designated by 'draft_scaf' or 'draft_con', are controlled for annotation (i.e. if gene primary feature tags exist); usually only one of those contains annotations. The one with annotation is then used to create multi-genbank files. Multi-fasta files are created for the corresponding genbank file or, if no annotation exists, for the file which contains more sequence information (either contigs or scaffolds). In the case, that the sequence information is equal, scaffold files are preferred. If sequence size discrepancies between a genbank and its corresponding fasta file are found, error file 'seq_errors.txt' will be created and indicate the villains.
 25 | 
 26 | As a suggestion, pick the genomes you're looking for **first** out of './refseq' and the rest out of './genbank'. RefSeq genomes have a higher annotation quality, while GenBank includes more genomes.
 27 | 
 28 | Depending on the amount of data to download, the whole process can take quite a while. Also have a mind for space requirements, e.g. all *E. coli*/*Shigella* genomes (March 2014) have a final total space requirement of ~58 GB ('ftp.ncbi.nlm.nih.gov' = ~18 GB; ./genbank = ~25 GB; ./refseq = ~16 GB)!
 29 | 
 30 | If you're new to the NCBI FTP site you should read an excellent overview for microbial RefSeq genomes on NCBI's FTP site on Torsten Seemann's blog: http://thegenomefactory.blogspot.de/2012/07/navigating-microbial-genomes-on-ncbi.html.
 31 | 
 32 | You can also access an introductory talk for the microbial NCBI FTP resources at figshare (http://figshare.com/articles/Introduction_to_NCBI_s_FTP_server_for_bacterial_genomes/972893). It might be a good idea to read the blog post and have a look in the PDF to have a general idea what's going on, but of course you can just run the scripts and work with the genome files.
 33 | 
 34 | ## Usage
 35 | 
 36 | ### 1.) Manual consecutively
 37 | 
 38 | #### 1.1.) `wget`
 39 | 
 40 | Download RefSeq complete genomes (in fasta and genbank format):
 41 | 
 42 |     wget -cNrv -t 45 -A *.gbk,*.fna "ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Genus_species*" -P .
 43 | 
 44 | Download RefSeq draft genomes as tarballs:
 45 | 
 46 |     wget -cNrv -t 45 -A *.gbk.tgz,*.fna.tgz "ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria_DRAFT/Genus_species*" -P .
 47 | 
 48 | The same procedure has to be followed for GenBank files, here complete genomes:
 49 | 
 50 |     wget -cNrv -t 45 -A *.gbk,*.fna "ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria/Genus_species*" -P .
 51 | 
 52 | And finally download GenBank draft genomes:
 53 | 
 54 |     wget -cNrv -t 45 -A *.gbk.tgz,*.fna.tgz "ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria_DRAFT/Genus_species*" -P .
 55 | 
 56 | #### 1.2.) `ncbi_ftp_concat_unpack.pl`
 57 | 
 58 |     perl ncbi_ftp_concat_unpack.pl refseq y
 59 |     perl ncbi_ftp_concat_unpack.pl genbank y
 60 | 
 61 | ### 2.) With one command: `ncbi_ftp_download.sh` wrapper script
 62 | 
 63 | Some examples how you can use the shell script, e.g. download all *E. coli* genomes from NCBI's ftp server:
 64 | 
 65 |     ncbi_ftp_download.sh Escherichia_coli
 66 | 
 67 | Download all *B. cereus* genomes:
 68 | 
 69 |     ncbi_ftp_download.sh Bacillus_cereus
 70 | 
 71 | Download all *Paenibacillus* genomes:
 72 | 
 73 |     ncbi_ftp_download.sh Paenibacillus
 74 | 
 75 | ## Options
 76 | 
 77 | ### *ncbi_ftp_concat_unpack.pl*
 78 | 
 79 | * genbank (as first argument)
 80 | 
 81 | Copy GenBank genomes (from './ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria\*') as (multi-)sequence files in the result folder './genbank'.
 82 | 
 83 | * refseq (as first argument)
 84 | 
 85 | Copy RefSeq genomes (from './ftp.ncbi.nlm.nih.gov/genomes/Bacteria\*') as (multi-)sequence files in the result folder './refseq'.
 86 | 
 87 | * y (as second argument)
 88 | 
 89 | Will delete previous result folders and create new ones (otherwise, the script will ask user if to proceed with overwriting)
 90 | 
 91 | ## Output
 92 | 
 93 | ### `ncbi_ftp_download.sh`
 94 | 
 95 | * './ftp.ncbi.nlm.nih.gov/'
 96 | 
 97 | Mirrors NCBI's FTP server structure and downloads the wanted bacterial genome files in this folder with subfolders
 98 | 
 99 | ### `ncbi_ftp_concat_unpack.pl`
100 | 
101 | * './genbank'
102 | 
103 | Result folder for all **GenBank** genomes
104 | 
105 | * './refseq'
106 | 
107 | Result folder for all **RefSeq** genomes
108 | 
109 | * (seq_errors.txt)
110 | 
111 | Lists \*.gbk and corresponding \*.fasta files with sequence size discrepancies.
112 | 
113 | ## Run environment
114 | 
115 | Both the Perl script and the bash-shell script run only under UNIX flavors.
116 | 
117 | ## Dependencies (not in the core Perl modules)
118 | 
119 | * no extra dependencies
120 | 
121 | ## Authors/contact
122 | 
123 | Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
124 | 
125 | ## Citation, installation, and license
126 | 
127 | For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
128 | 
129 | ## Changelog
130 | 
131 | ### *ncbi_ftp_concat_unpack.pl*
132 | 
133 | * v0.2.1 (13.07.2015)
134 |     - Adapted all scripts to the new NCBI FTP server address: 'ftp://ftp.ncbi.nlm.nih.gov/'
135 | * v0.2 (21.02.2013)
136 |     - 'seq_errors.txt' error file if sequence size discrepancies between genbank and corresponding fasta file found
137 |     - die with error if 'genbank|refseq' not given as first argument
138 |     - print status message which genome is being processed and what file is kept for draft genomes (e.g. scaffold or contig etc.)
139 |     - bug fixes to test for file existence before running code
140 |     - changed usage to HERE document
141 | * v0.1 (15.09.2012)
142 | 


--------------------------------------------------------------------------------
/ncbi_ftp_download/ncbi_ftp_download.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | # Download/update RefSeq complete genomes
 3 | echo "#### Updating RefSeq complete $1 genomes"
 4 | wget -cNrv -t 45 -A *.gbk,*.fna "ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/$1*" -P .
 5 | # Download/update RefSeq draft genomes
 6 | echo "#### Updating RefSeq draft $1 genomes"
 7 | wget -cNrv -t 45 -A *.gbk.tgz,*.fna.tgz "ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria_DRAFT/$1*" -P .
 8 | # Download/update GenBank complete genomes
 9 | echo "#### Updating GenBank complete $1 genomes"
10 | wget -cNrv -t 45 -A *.gbk,*.fna "ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria/$1*" -P .
11 | # Download/update GenBank draft genomes
12 | echo "#### Updating GenBank draft $1 genomes"
13 | wget -cNrv -t 45 -A *.gbk.tgz,*.fna.tgz "ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria_DRAFT/$1*" -P .
14 | # Run script 'ncbi_concat_unpack.pl' to fill the result folders './refseq' and './genbank'
15 | echo "#### Copying files to result folder './refseq'"
16 | perl ncbi_ftp_concat_unpack.pl refseq y
17 | echo "#### Copying files to result folder './genbank'"
18 | perl ncbi_ftp_concat_unpack.pl genbank y
19 | 


--------------------------------------------------------------------------------
/order_fastx/README.md:
--------------------------------------------------------------------------------
  1 | order_fastx
  2 | ===========
  3 | 
  4 | `order_fastx.pl` is a script to order sequences in FASTA or FASTQ files.
  5 | 
  6 | * [Synopsis](#synopsis)
  7 | * [Description](#description)
  8 | * [Usage](#usage)
  9 | * [Options](#options)
 10 |   * [Mandatory options](#mandatory-options)
 11 |   * [Optional options](#optional-options)
 12 | * [Output](#output)
 13 | * [Run environment](#run-environment)
 14 | * [Author - contact](#author---contact)
 15 | * [Citation, installation, and license](#citation-installation-and-license)
 16 | * [Changelog](#changelog)
 17 | 
 18 | 
 19 | ## Synopsis
 20 | 
 21 |     perl order_fastx.pl -i infile.fasta -l order_id_list.txt > ordered.fasta
 22 | 
 23 | ## Description
 24 | 
 25 | Order sequence entries in FASTA or FASTQ sequence files according to
 26 | an ID list with a given order. Beware, the IDs in the order list
 27 | have to be **identical** to the entire IDs in the sequence file.
 28 | 
 29 | However, the ">" or "@" ID identifiers of FASTA or FASTQ files,
 30 | respectively, can be omitted in the ID list.
 31 | 
 32 | The file type is detected automatically. But, you can set the file
 33 | type manually with option **-f**. FASTQ format assumes **four** lines
 34 | per read, if this is not the case run the FASTQ file through
 35 | [`fastx_fix.pl`](/fastx_fix) or use Heng Li's [`seqtk
 36 | seq`](https://github.com/lh3/seqtk):
 37 | 
 38 |     seqtk seq -l 0 infile.fq > outfile.fq
 39 | 
 40 | The script can also be used to pull a subset of sequences in the ID
 41 | list from the sequence file. Probably best to set option flag **-s**
 42 | in this case, see [Optional options](#optional-options) below. But, rather use
 43 | [`filter_fastx.pl`](/filter_fastx).
 44 | 
 45 | ## Usage
 46 | 
 47 |     perl order_fastx.pl -i infile.fq -l order_id_list.txt -s -f fastq > ordered.fq
 48 | 
 49 |     perl order_fastx.pl -i infile.fasta -l order_id_list.txt -e > ordered.fasta
 50 | 
 51 | ## Options
 52 | 
 53 | ### Mandatory options
 54 | 
 55 | - -i, -input
 56 | 
 57 |     Input FASTA or FASTQ file
 58 | 
 59 | - -l, -list
 60 | 
 61 |     List with sequence IDs in specified order
 62 | 
 63 | ### Optional options
 64 | 
 65 | - -h, -help
 66 | 
 67 |     Help (perldoc POD)
 68 | 
 69 | - -f, -file_type
 70 | 
 71 |     Set the file type manually [fasta|fastq]
 72 | 
 73 | - -e, -error_files
 74 | 
 75 |     Write missing IDs in the seq file or the order ID list without an equivalent in the other to error files instead of *STDERR* (see [Output](#output) below)
 76 | 
 77 | - -s, -skip_errors
 78 | 
 79 |     Skip missing ID error statements, excludes option **-e**
 80 | 
 81 | - -v, -version
 82 | 
 83 |     Print version number to *STDERR*
 84 | 
 85 | ## Output
 86 | 
 87 | - *STDOUT*
 88 | 
 89 |     The newly ordered sequences are printed to *STDOUT*. Redirect or pipe into another tool as needed.
 90 | 
 91 | - (order_ids_missing.txt)
 92 | 
 93 |     If IDs in the order list are missing in the sequence file with option **-e**
 94 | 
 95 | - (seq_ids_missing.txt)
 96 | 
 97 |     If IDs in the sequence file are missing in the order ID list with option **-e**
 98 | 
 99 | ## Run environment
100 | 
101 | The Perl script runs under Windows and UNIX flavors.
102 | 
103 | ## Author - contact
104 | 
105 | Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
106 | 
107 | ## Citation, installation, and license
108 | 
109 | For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
110 | 
111 | ## Changelog
112 | 
113 | - v0.1 (20.11.2014)
114 | 


--------------------------------------------------------------------------------
/order_fastx/order_fastx.pl:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/perl
  2 | 
  3 | #######
  4 | # POD #
  5 | #######
  6 | 
  7 | =pod
  8 | 
  9 | =head1 NAME
 10 | 
 11 | C<order_fastx.pl> - order sequences in FASTA or FASTQ files
 12 | 
 13 | =head1 SYNOPSIS
 14 | 
 15 | C<perl order_fastx.pl -i infile.fasta -l order_id_list.txt
 16 | E<gt> ordered.fasta>
 17 | 
 18 | =head1 DESCRIPTION
 19 | 
 20 | Order sequence entries in FASTA or FASTQ sequence files according to
 21 | an ID list with a given order. Beware, the IDs in the order list
 22 | have to be B<identical> to the entire IDs in the sequence file.
 23 | 
 24 | However, the ">" or "@" ID identifiers of FASTA or FASTQ files,
 25 | respectively, can be omitted in the ID list.
 26 | 
 27 | The file type is detected automatically. But, you can set the file
 28 | type manually with option B<-f>. FASTQ format assumes B<four> lines
 29 | per read, if this is not the case run the FASTQ file through
 30 | L<C<fastx_fix.pl>|/fastx_fix> or use Heng Li's L<C<seqtk
 31 | seq>|https://github.com/lh3/seqtk>:
 32 | 
 33 | C<seqtk seq -l 0 infile.fq E<gt> outfile.fq>
 34 | 
 35 | The script can also be used to pull a subset of sequences in the ID
 36 | list from the sequence file. Probably best to set option flag B<-s>
 37 | in this case, see L<"Optional options"> below. But, rather use
 38 | L<C<filter_fastx.pl>|/filter_fastx>.
 39 | 
 40 | =head1 OPTIONS
 41 | 
 42 | =head2 Mandatory options
 43 | 
 44 | =over 20
 45 | 
 46 | =item B<-i>=I<str>, B<-input>=I<str>
 47 | 
 48 | Input FASTA or FASTQ file
 49 | 
 50 | =item B<-l>=I<str>, B<-list>=I<str>
 51 | 
 52 | List with sequence IDs in specified order
 53 | 
 54 | =back
 55 | 
 56 | =head2 Optional options
 57 | 
 58 | =over 20
 59 | 
 60 | =item B<-h>, B<-help>
 61 | 
 62 | Help (perldoc POD)
 63 | 
 64 | =item B<-f>=I<fasta|fastq>, B<-file_type>=I<fasta|fastq>
 65 | 
 66 | Set the file type manually
 67 | 
 68 | =item B<-e>, B<-error_files>
 69 | 
 70 | Write missing IDs in the seq file or the order ID list without an
 71 | equivalent in the other to error files instead of C<STDERR> (see
 72 | L<"OUTPUT"> below)
 73 | 
 74 | =item B<-s>, B<-skip_errors>
 75 | 
 76 | Skip missing ID error statements, excludes option B<-e>
 77 | 
 78 | =item B<-v>, B<-version>
 79 | 
 80 | Print version number to C<STDERR>
 81 | 
 82 | =back
 83 | 
 84 | =head1 OUTPUT
 85 | 
 86 | =over 20
 87 | 
 88 | =item C<STDOUT>
 89 | 
 90 | The newly ordered sequences are printed to C<STDOUT>. Redirect or
 91 | pipe into another tool as needed.
 92 | 
 93 | =item (F<order_ids_missing.txt>)
 94 | 
 95 | If IDs in the order list are missing in the sequence file with
 96 | option B<-e>
 97 | 
 98 | =item (F<seq_ids_missing.txt>)
 99 | 
100 | If IDs in the sequence file are missing in the order ID list with
101 | option B<-e>
102 | 
103 | =back
104 | 
105 | =head1 EXAMPLES
106 | 
107 | =over
108 | 
109 | =item C<perl order_fastx.pl -i infile.fq -l order_id_list.txt -s -f
110 | fastq E<gt> ordered.fq>
111 | 
112 | =item C<perl order_fastx.pl -i infile.fasta -l order_id_list.txt -e
113 | E<gt> ordered.fasta>
114 | 
115 | =back
116 | 
117 | =head1 VERSION
118 | 
119 |  0.1                                                       20-11-2014
120 | 
121 | =head1 AUTHOR
122 | 
123 |  Andreas Leimbach                               aleimba[at]gmx[dot]de
124 | 
125 | =head1 LICENSE
126 | 
127 | This program is free software: you can redistribute it and/or modify
128 | it under the terms of the GNU General Public License as published by
129 | the Free Software Foundation; either version 3 (GPLv3) of the
130 | License, or (at your option) any later version.
131 | 
132 | This program is distributed in the hope that it will be useful, but
133 | WITHOUT ANY WARRANTY; without even the implied warranty of
134 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
135 | General Public License for more details.
136 | 
137 | You should have received a copy of the GNU General Public License
138 | along with this program. If not, see L<http://www.gnu.org/licenses/>.
139 | 
140 | =cut
141 | 
142 | 
143 | ########
144 | # MAIN #
145 | ########
146 | 
147 | use strict;
148 | use warnings;
149 | use autodie;
150 | use Getopt::Long;
151 | use Pod::Usage;
152 | 
153 | ### Get the options with Getopt::Long
154 | my $Seq_File; # sequence file to order sequences in
155 | my $Order_List; # order ID list for seq file
156 | my $File_Type; # set file type; otherwise detect file type by file extension
157 | my $Opt_Error_Files; # print missing IDs not found in order list or seq file to error files instead of STDERR
158 | my $Opt_Skip_Errors; # skip missing IDs error statements/files
159 | my $VERSION = 0.1;
160 | my ($Opt_Version, $Opt_Help);
161 | GetOptions ('input=s' => \$Seq_File,
162 |             'list=s' => \$Order_List,
163 |             'file_type=s' => \$File_Type,
164 |             'error_files' => \$Opt_Error_Files,
165 |             'skip_errors' => \$Opt_Skip_Errors,
166 |             'version' => \$Opt_Version,
167 |             'help|?' => \$Opt_Help);
168 | 
169 | 
170 | 
171 | ### Run perldoc on POD
172 | pod2usage(-verbose => 2) if ($Opt_Help);
173 | die "$0 $VERSION\n" if ($Opt_Version);
174 | if (!$Seq_File || !$Order_List) {
175 |     my $warning = "\n### Fatal error: Options '-i' or '-l' or their arguments are missing!\n";
176 |     pod2usage(-verbose => 1, -message => $warning, -exitval => 2);
177 | }
178 | 
179 | 
180 | 
181 | ### Enforce mandatory or optional options
182 | die "\n### Fatal error:\nUnknown file type '$File_Type' given with option '-f'. Please choose from either 'fasta' or 'fastq'!\n" if ($File_Type && $File_Type !~ /(fasta|fastq)/i);
183 | warn "\n### Warning:\nIgnoring option flag '-e', because option '-s' set at the same time!\n\n" if ($Opt_Error_Files && $Opt_Skip_Errors);
184 | 
185 | 
186 | 
187 | ### Order input FASTA/FASTQ file according to a given list
188 | open (my $Order_List_Fh, "<", "$Order_List");
189 | open (my $Input_Fh, "<", "$Seq_File"); # pipe from STDIN not working because of 'seek' on filehandle
190 | get_file_type() if (!$File_Type); # subroutine to determine file type by file extension
191 | 
192 | my %Order_List_IDs; # store order IDs and indicate if found in seq file
193 | my %Seq_File_IDs; # store seq file IDs and indicate if present in order list
194 | 
195 | my $Next_Fasta_ID; # for multi-line FASTA input files to store next entry header/ID line while parsing in subroutine 'get_fastx_entry'
196 | my $Parse_Run = 1; # indicate FIRST parsing cycle through seq file to collect all seq IDs
197 | 
198 | while (my $ord_id = <$Order_List_Fh>) {
199 |     chomp $ord_id;
200 |     next if ($ord_id =~ /^\s*$/); # skip emtpy lines in order list
201 |     $ord_id =~ s/^(>|@)//; # remove ">/@" for WHOLE string regex match -> ID in order list can be given with ">/@" or without (will be appended again in print)
202 | 
203 |     if ($Order_List_IDs{$ord_id}) {
204 |         die "\n### Fatal error:\n'$ord_id' exists several times in '$Order_List' and IDs should be unique!\n";
205 |     } else {
206 |         $Order_List_IDs{$ord_id} = 1; # changes to 2 if ID was found in seq file
207 |     }
208 | 
209 |     while (<$Input_Fh>) {
210 |         if (/^\s*$/) { # skip empty lines in input
211 |             warn "\n### Warning:\nFASTQ file includes empty lines, which is unusual. Parsing the FASTQ reads might fail so check the output file afterwards if the script didn't quit with a fatal error. However, consider running the input FASTQ file through 'fix_fastx.pl'!\n\n" if ($File_Type =~ /fastq/i);
212 |             next;
213 |         }
214 |         chomp;
215 | 
216 |         # FASTA file
217 |         if ($File_Type =~ /fasta/i) {
218 |             $_ = get_fastx_entry($_); # subroutine to read one FASTA sequence entry (seq in multi-line or not), returns anonymous array
219 | 
220 |         # FASTQ file
221 |         } elsif ($File_Type =~ /fastq/i) {
222 |             $_ = get_fastx_entry($_); # subroutine to read one FASTQ read composed of FOUR mandatory lines, returns reference to array
223 |         }
224 | 
225 |         if ($Seq_File_IDs{$_->[0]} && $Parse_Run == 1) { # only for first parse cycle, subsequent parsings of course will find the same IDs
226 |             die "\n### Fatal error:\n'$_->[0]' exists several times in '$Seq_File' and IDs should be unique!\n";
227 |         } elsif (!$Seq_File_IDs{$_->[0]}) {
228 |             $Seq_File_IDs{$_->[0]} = 1; # changes to 2 if present in order list
229 |         }
230 | 
231 |         if ($ord_id =~ /^$_->[0]$/) { # order ID hit in seq file with the WHOLE string; de-reference array
232 |             $Order_List_IDs{$ord_id} = 2; # set to ID found
233 |             $Seq_File_IDs{$_->[0]} = 2;
234 | 
235 |             print ">$_->[0]\n$_->[1]\n\n" if ($File_Type =~ /fasta/i); # print seq entry to STDOUT
236 |             print "\@$_->[0]\n$_->[1]\n$_->[2]\n$_->[3]\n" if ($File_Type =~ /fastq/i);
237 | 
238 |             next if ($Parse_Run == 1); # parse the complete seq file once (skip 'last' below) to collect all seq IDs
239 |             last; # jump out of seq file 'while'
240 |         }
241 |     }
242 | 
243 |     # rewind seq file for next order list ID
244 |     $Next_Fasta_ID = '';
245 |     seek $Input_Fh, 0, 0;
246 |     $. = 0; # set line number of seq file to 0 (seek doesn't do it automatically)
247 |     $Parse_Run = 0;
248 | }
249 | close $Input_Fh;
250 | close $Order_List_Fh;
251 | 
252 | 
253 | 
254 | ### Print order and seq IDs that were not found in seq file or in order list, resp.
255 | if (!$Opt_Skip_Errors) {
256 |     # order IDs not found in seq file
257 |     missing_IDs(\%Order_List_IDs, 'order_ids_missing.txt', 'order'); # subroutine to identify and print missing IDs
258 | 
259 |     # seq file IDs not found in order list
260 |     missing_IDs(\%Seq_File_IDs, 'seq_ids_missing.txt', 'sequence'); # subroutine
261 | }
262 | 
263 | exit;
264 | 
265 | 
266 | #############
267 | #Subroutines#
268 | #############
269 | 
270 | ### Test for output file existence and give warning to STDERR
271 | sub file_exist {
272 |     my $file = shift;
273 |     if (-e $file) {
274 |         warn "\n### Warning:\nThe error file '$file' exists already, the current errors will be appended to the existing file!\n";
275 |         return 1;
276 |     }
277 |     return 0;
278 | }
279 | 
280 | 
281 | 
282 | ### Get sequence entries from FASTA/Q file
283 | sub get_fastx_entry {
284 |     my $line = shift;
285 | 
286 |     # possible multi-line seq in FASTA
287 |     if ($File_Type =~ /fasta/i) {
288 |         my ($seq, $header);
289 |         if ($. == 1) { # first line of file
290 |             die "\n### Fatal error:\nNot a FASTA input file, first line of file should be a FASTA ID/header line and start with a '>':\n$line\n" if ($line !~ /^>/);
291 |             $header = $line;
292 |         } elsif ($Next_Fasta_ID) {
293 |             $header = $Next_Fasta_ID;
294 |             $seq = $line;
295 |         }
296 |         while (<$Input_Fh>) {
297 |             chomp;
298 |             $Next_Fasta_ID = $_ if (/^>/); # store ID/header for next seq entry
299 |             $header =~ s/^>//; # remove '>' for WHOLE string regex match in MAIN
300 |             return [$header, $seq] if (/^>/); # return anonymous array with current header and seq
301 |             $seq .= $_; # concatenate multi-line seq
302 |         }
303 |         $header =~ s/^>//; # see above
304 |         return [$header, $seq] if eof;
305 | 
306 |     # FASTQ: FOUR lines for each FASTQ read (seq-ID, sequence, qual-ID [optional], qual)
307 |     } elsif ($File_Type =~ /fastq/i) {
308 |         my @fastq_read;
309 | 
310 |         # read name/ID, line 1
311 |         my $seq_id = $line;
312 |         die "\n### Fatal error:\nThis read doesn't have a sequence identifier/read name according to FASTQ specs, it should begin with a '\@':\n$seq_id\n" if ($seq_id !~ /^@.+/);
313 |         $seq_id =~ s/^@//; # remove '@' to make comparable to $qual_id and for WHOLE string regex match in MAIN
314 |         push(@fastq_read, $seq_id);
315 | 
316 |         # sequence, line 2
317 |         chomp (my $seq = <$Input_Fh>);
318 |         die "\n### Fatal error:\nRead '$seq_id' has a whitespace in its sequence, which is not allowed according to FASTQ specs:\n$seq\n" if ($seq =~ /\s+/);
319 |         die "\n### Fatal error:\nRead '$seq_id' has a IUPAC degenerate base (except for 'N') or non-nucleotide character in its sequence, which is not allowed according to FASTQ specs:\n$seq\n" if ($seq =~ /[^acgtun]/i);
320 |         push(@fastq_read, $seq);
321 | 
322 |         # optional quality ID, line 3
323 |         chomp (my $qual_id = <$Input_Fh>);
324 |         die "\n### Fatal error:\nThe optional sequence identifier/read name for the quality line of read '$seq_id' is not according to FASTQ specs, it should begin with a '+':\n$qual_id\n" if ($qual_id !~ /^\+/);
325 |         push(@fastq_read, $qual_id);
326 |         $qual_id =~ s/^\+//; # if optional ID is present check if equal to $seq_id in line 1
327 |         die "\n### Fatal error:\nThe sequence identifier/read name of read '$seq_id' doesn't fit to the optional ID in the quality line:\n$qual_id\n" if ($qual_id && $qual_id ne $seq_id);
328 | 
329 |         # quality, line 4
330 |         chomp (my $qual = <$Input_Fh>);
331 |         die "\n### Fatal error:\nRead '$seq_id' has a whitespace in its quality values, which is not allowed according to FASTQ specs:\n$qual\n" if ($qual =~ /\s+/);
332 |         die "\n### Fatal error:\nRead '$seq_id' has a non-ASCII character in its quality values, which is not allowed according to FASTQ specs:\n$qual\n" if ($qual =~ /[^[:ascii]]/);
333 |         die "\n### Fatal error:\nThe quality line of read '$seq_id' doesn't have the same number of symbols as letters in the sequence:\n$seq\n$qual\n" if (length $qual != length $seq);
334 |         push(@fastq_read, $qual);
335 | 
336 |         return \@fastq_read; # return array-ref
337 |     }
338 |     return 0;
339 | }
340 | 
341 | 
342 | 
343 | ### Determine file type via file extension (FASTA or FASTQ)
344 | sub get_file_type {
345 |     if ($Seq_File =~ /.+\.(fa|fas|fasta|ffn|fna|frn|fsa)$/) { # use "|fsa)(\.gz)*$" if unzip inside script
346 |         $File_Type = 'fasta';
347 |     } elsif ($Seq_File =~ /.+\.(fastq|fq)$/) {
348 |         $File_Type = 'fastq';
349 |     }
350 | 
351 |     die "\n### Fatal error:\nFile type could not be automatically detected. Sure this is a FASTA/Q file? If yes, you can force the file type by setting option '-f' to either 'fasta' or 'fastq'!\n" if (!$File_Type);
352 |     print STDERR "Detected file type: $File_Type\n";
353 |     return 1;
354 | }
355 | 
356 | 
357 | 
358 | ### Identify and print IDs with no hit in order list or seq file
359 | sub missing_IDs {
360 |     my ($hash_ref, $error_file, $mode) = @_;
361 | 
362 |     my @missed = grep ($hash_ref->{$_} == 1, keys %$hash_ref); # set to 2 if hit, 1 if "only" present
363 | 
364 |     if (@missed) {
365 |         file_exist($error_file) if ($Opt_Error_Files); # subroutine
366 |         open (my $error_fh, ">>", $error_file) if ($Opt_Error_Files);
367 | 
368 |         print STDERR "\n### Warning:\nSome $mode IDs were not found in '";
369 |         if ($mode eq 'order') {
370 |             print STDERR "$Seq_File";
371 |         } elsif ($mode eq 'sequence') {
372 |             print STDERR "$Order_List";
373 |         }
374 |         print STDERR "', listed ";
375 |         print STDERR "below:\n" if (!$Opt_Error_Files);
376 |         print STDERR "in error file '$error_file'!\n" if ($Opt_Error_Files);
377 | 
378 |         foreach (sort @missed) {
379 |             if (!$Opt_Error_Files) {
380 |                 print STDERR "$_\t"; # separated by tab
381 |             } elsif ($Opt_Error_Files) {
382 |                 print $error_fh "$_\n"; # separated by newline
383 |             }
384 |         }
385 |         print STDERR "\n" if (!$Opt_Error_Files); # final newline for STDERR print
386 | 
387 |         close $error_fh if ($Opt_Error_Files);
388 |     }
389 |     return 1;
390 | }
391 | 


--------------------------------------------------------------------------------
/po2anno/README.md:
--------------------------------------------------------------------------------
  1 | po2anno
  2 | =======
  3 | 
  4 | `po2anno.pl` is a script to create an annotation comparison matrix from [Proteinortho5](http://www.bioinf.uni-leipzig.de/Software/proteinortho/) output.
  5 | 
  6 | * [Synopsis](#synopsis)
  7 | * [Description](#description)
  8 | * [Usage](#usage)
  9 |   * [cds_extractor](#cds_extractor)
 10 |   * [Proteinortho5](#proteinortho5)
 11 |   * [po2anno](#po2anno)
 12 | * [Options](#options)
 13 |   * [Mandatory options](#mandatory-options)
 14 |   * [Optional options](#optional-options)
 15 | * [Output](#output)
 16 | * [Run environment](#run-environment)
 17 | * [Author - contact](#author---contact)
 18 | * [Citation, installation, and license](#citation-installation-and-license)
 19 | * [Changelog](#changelog)
 20 | 
 21 | ## Synopsis
 22 | 
 23 |     perl po2anno.pl -i matrix.proteinortho -d genome_fasta_dir/ -l -a > annotation_comparison.tsv
 24 | 
 25 | ## Description
 26 | 
 27 | Supplement an ortholog/paralog output matrix from a
 28 | [**Proteinortho5**](http://www.bioinf.uni-leipzig.de/Software/proteinortho/)
 29 | calculation with annotation information. The resulting tab-separated
 30 | annotation comparison matrix (ACM) is mainly intended for the
 31 | transfer of high quality annotations from reference genomes to
 32 | homologs (orthologs and co-orthologs/paralogs) in a query genome
 33 | (e.g. in conjunction with [`tbl2tab.pl`](/tbl2tab)). But of course
 34 | it can also be used to have a quick glance at the annotation of
 35 | genes present only in a couple of input genomes in comparison to the
 36 | others.
 37 | 
 38 | Annotation is retrieved from multi-FASTA files created with
 39 | [`cds_extractor.pl`](/cds_extractor). See
 40 | [`cds_extractor.pl`](/cds_extractor) for a description of the
 41 | format. These files are used as input for the PO analysis and option
 42 | **-d** for `po2anno.pl`.
 43 | 
 44 | **Proteinortho5** (PO) has to be run with option **-singles** to include
 45 | also genes without orthologs, so-called singletons/ORFans, for each
 46 | genome in the PO matrix (see the
 47 | [PO manual](http://www.bioinf.uni-leipzig.de/Software/proteinortho/manual.html)).
 48 | Additionally, option **-selfblast** is recommended to enhance paralog
 49 | detection by PO.
 50 | 
 51 | Each orthologous group (OG) is listed in a row of the resulting ACM,
 52 | the first column holds the OG numbers from the PO input matrix (i.e.
 53 | line number minus one). The following columns specify the
 54 | orthologous CDS for each input genome. For each CDS the ID,
 55 | optionally the length in bp (option **-l**), gene, EC number(s), and
 56 | product are shown depending on their presence in the CDS's
 57 | annotation. The ID is in most cases the locus tag (see
 58 | [`cds_extractor.pl`](/cds_extractor)). If several EC numbers exist
 59 | for a single CDS they're separated by ';'. If an OG includes
 60 | paralogs, i.e. co-orthologs from a single genome, these will be
 61 | printed in the following row(s) **without** a new OG number in the
 62 | first column. The order of paralogous CDSs within an OG is
 63 | arbitrarily.
 64 | 
 65 | The OGs are sorted numerically via the query ID (see option **-q**).
 66 | If option **-a** is set, the non-query OGs are appended to the output
 67 | after the query OGs, sorted numerically via OG number.
 68 | 
 69 | ## Usage
 70 | 
 71 | ### [`cds_extractor`](/cds_extractor)
 72 | 
 73 |     for i in *.[gbk|embl]; do perl cds_extractor.pl -i $i [-p|-n]; done
 74 | 
 75 | ### [**Proteinortho5**](http://www.bioinf.uni-leipzig.de/Software/proteinortho/)
 76 | 
 77 |     proteinortho5.pl -graph [-synteny] -cpus=# -selfblast -singles -identity=50 -cov=50 -blastParameters='-use_sw_tback [-seg no|-dust no]' *.[faa|ffn]
 78 | 
 79 | ### po2anno
 80 | 
 81 |     perl po2anno.pl -i matrix.[proteinortho|poff] -d genome_fasta_dir/ -q query.[faa|ffn] -l -a > annotation_comparison.tsv
 82 | 
 83 | ## Options
 84 | 
 85 | ### Mandatory options
 86 | 
 87 | - **-i**=_str_, **-input**=_str_
 88 | 
 89 |     Proteinortho (PO) result matrix (\*.proteinortho or \*.poff), or piped *STDIN* (-)
 90 | 
 91 | - **-d**=_str_, **-dir\_genome**=_str_
 92 | 
 93 |     Path to the directory including the genome multi-FASTA PO input files (\*.faa or \*.ffn), created with [`cds_extractor.pl`](/cds_extractor)
 94 | 
 95 | ### Optional options
 96 | 
 97 | - **-h**, **-help**
 98 | 
 99 |     Help (perldoc POD)
100 | 
101 | - **-q**=_str_, **-query**=_str_
102 | 
103 |     Query genome (has to be identical to the string in the PO matrix) [default = first one in alphabetical order]
104 | 
105 | - **-l**, **-length**
106 | 
107 |     Include length of each CDS in bp
108 | 
109 | - **-a**, **-all**
110 | 
111 |     Append non-query orthologous groups (OGs) to the output
112 | 
113 | - **-v**, **-version**
114 | 
115 |     Print version number to *STDERR*
116 | 
117 | ## Output
118 | 
119 | - *STDOUT*
120 | 
121 |     The resulting tab-delimited ACM is printed to *STDOUT*. Redirect or pipe into another tool as needed (e.g. `cut`, `grep`, `head`, or `tail`).
122 | 
123 | ## Run environment
124 | 
125 | The Perl script runs under Windows and UNIX flavors.
126 | 
127 | ## Author - contact
128 | 
129 | Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
130 | 
131 | ## Citation, installation, and license
132 | 
133 | For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
134 | 
135 | ## Changelog
136 | 
137 | * v0.2.2 (23.10.2015)
138 |     * minor syntax changes to `po2anno.pl` and README
139 |     * changed option **-g|-genome_dir** to **-d|-dir_genome** for consistency with [`po2group_stats.pl`](/po2group_stats)
140 | * v0.2.1 (07.09.2015)
141 |     * get rid of underscores in product annotation strings (from [`cds_extractor.pl`](/cds_extractor))
142 |     * debugged hard-coded relative path for `$genome_file_path`
143 | * v0.2 (15.01.2015)
144 |     * give number of query-specific OGs and total query singletons/ORFans in final stat output
145 |     * changed final stat output to an easier readable format
146 |     * fixed bug: %Query_ID_Seen included also non-query IDs, which luckily had no consequences
147 | * v0.1 (18.12.2014)
148 | 


--------------------------------------------------------------------------------
/po2group_stats/README.md:
--------------------------------------------------------------------------------
  1 | po2group_stats
  2 | ==============
  3 | 
  4 | `po2group_stats.pl` is a script to categorize orthologs from [Proteinortho5](http://www.bioinf.uni-leipzig.de/Software/proteinortho/) output according to genome groups. In the [prot_finder](/prot_finder) workflow is a script, `binary_group_stats.pl`, which does the same thing for column groups in a delimited TEXT binary matrix.
  5 | 
  6 | * [Synopsis](#synopsis)
  7 | * [Description](#description)
  8 | * [Usage](#usage)
  9 |   * [cds_extractor](#cds_extractor)
 10 |   * [Proteinortho5](#proteinortho5)
 11 |   * [po2group_stats](#po2group_stats)
 12 | * [Options](#options)
 13 |   * [Mandatory options](#mandatory-options)
 14 |   * [Optional options](#optional-options)
 15 | * [Output](#output)
 16 | * [Dependencies](#dependencies)
 17 | * [Run environment](#run-environment)
 18 | * [Author - contact](#author---contact)
 19 | * [Citation, installation, and license](#citation-installation-and-license)
 20 | * [Changelog](#changelog)
 21 | 
 22 | ## Synopsis
 23 | 
 24 |     perl po2group_stats.pl -i matrix.proteinortho -d genome_fasta_dir/ -g group_file.tsv -p > overall_stats.tsv
 25 | 
 26 | ## Description
 27 | 
 28 | Categorize the genomes in an ortholog/paralog output matrix (option **-i**) from a
 29 | [**Proteinortho5**](http://www.bioinf.uni-leipzig.de/Software/proteinortho/)
 30 | calculation according to group affiliations. The group
 31 | affiliations of the genomes are intended to get overall
 32 | presence/absence statistics for groups of genomes and not simply
 33 | single genomes (e.g. comparing 'marine', 'earth', 'commensal',
 34 | 'pathogenic' etc. genome groups). Percentage inclusion (option
 35 | **-cut\_i**) and exclusion (option **-cut\_e**) cutoffs can be set to
 36 | define how strict the presence/absence of genome groups within an
 37 | orthologous group (OG) are defined. Of course groups can also hold
 38 | only single genomes to get single genome statistics. Group
 39 | affiliations are defined in a mandatory **tab-delimited** group input
 40 | file (option **-g**) with **minimal two** and **maximal four** groups.
 41 | 
 42 | Only alphanumeric (a-z, A-Z, 0-9), underscore (\_), dash (-), and
 43 | period (.) characters are allowed for the **group names** in the
 44 | group file to avoid downstream problems with the operating/file
 45 | system. As a consequence, also no whitespaces are allowed in these!
 46 | Additionally, **group names**, **genome filenames** (should be
 47 | enforced by the file system), and **FASTA IDs** considering **all**
 48 | genome files (mostly locus tags; should be enforced by Proteinortho5)
 49 | need to be **unique**.
 50 | 
 51 | **Proteinortho5** (PO) has to be run with option **-singles** to
 52 | include also genes without orthologs, so-called singletons/ORFans,
 53 | for each genome in the PO matrix (see the
 54 | [PO manual](http://www.bioinf.uni-leipzig.de/Software/proteinortho/manual.html)).
 55 | Additionally, option **-selfblast** is recommended to enhance
 56 | paralog detection by PO.
 57 | 
 58 | To explain the logic behind the categorization, the following
 59 | annotation for example groups will be used. A '1' exemplifies a
 60 | group genome count in a respective OG >= the rounded inclusion
 61 | cutoff, a '0' a group genome count <= the rounded exclusion cutoff.
 62 | The presence and absence of OGs for the group affiliations are
 63 | structured in different categories depending on the number of
 64 | groups. For **two groups** (e.g. A and B) there are five categories:
 65 | 'A specific' (A:B = 1:0), 'B specific' (0:1), 'cutoff core' (1:1),
 66 | 'underrepresented' (0:0), and 'unspecific'. Unspecific OGs have a
 67 | genome count for at least **one** group outside the cutoffs
 68 | (exclusion cutoff < genome count < inclusion cutoff) and
 69 | thus cannot be categorized. These 'unspecific' OGs will only be
 70 | printed to a final annotation result file with option **-u**. Overall
 71 | stats for all categories are printed to *STDOUT* in a final
 72 | tab-delimited output matrix.
 73 | 
 74 | **Three groups** (A, B, and C) have the following nine categories: 'A
 75 | specific' (A:B:C = 1:0:0), 'B specific' (0:1:0), 'C specific'
 76 | (0:0:1), 'A absent' (0:1:1), 'B absent' (1:0:1), 'C absent' (1:1:0),
 77 | 'cutoff core' (1:1:1), 'underrepresented' (0:0:0), and 'unspecific'.
 78 | 
 79 | **Four groups** (A, B, C, and D) are classified in 17 categories: 'A
 80 | specific' (A:B:C:D = 1:0:0:0), 'B specific' (0:1:0:0), 'C specific'
 81 | (0:0:1:0), 'D specific' (0:0:0:1), 'A-B specific' (1:1:0:0), 'A-C
 82 | specific' (1:0:1:0), 'A-D specific' (1:0:0:1), 'B-C specific'
 83 | (0:1:1:0), 'B-D specific' (0:1:0:1), 'C-D specific' (0:0:1:1), 'A
 84 | absent' (0:1:1:1), 'B absent' (1:0:1:1), 'C absent' (1:1:0:1), 'D
 85 | absent' (1:1:1:0), 'cutoff core' (1:1:1:1), 'underrepresented'
 86 | (0:0:0:0), and 'unspecific'.
 87 | 
 88 | The resulting group presence/absence (according to the cutoffs) can
 89 | also be printed to a binary matrix (option **-b**) in the result
 90 | directory (option **-r**), excluding the 'unspecific' category. Since
 91 | the categories are the logics underlying venn diagrams, you can also
 92 | plot the results in a venn diagram using the binary matrix (option
 93 | **-p**). The 'underrepresented' category is exempt from the venn
 94 | diagram, because it is outside of venn diagram logics.
 95 | 
 96 | Here are venn diagrams illustrating the logic categories (see folder ['pics'](./pics)):
 97 | 
 98 | <p align="center">
 99 |   <img alt="venn diagram logics" src="https://github.com/aleimba/bac-genomics-scripts/raw/master/po2group_stats/pics/venn_diagram_logics.png">
100 | </p>
101 | 
102 | There are two optional categories (which are only considered for the
103 | final print outs and in the final stats matrix, not for the binary
104 | matrix and the venn diagram): 'strict core' (option **-co**) for
105 | OGs where **all** genomes have an ortholog, independent of the
106 | cutoffs. Of course all the 'strict core' OGs are also included in
107 | the 'cutoff\_core' category ('strict core' is identical to 'cutoff
108 | core' with **-cut\_i** 1 and **-cut\_e** 0). Option **-s** activates the
109 | detection of 'singleton/ORFan' OGs present in only **one** genome.
110 | Depending on the cutoffs and number of genomes in the groups,
111 | category 'underrepresented' includes most of these singletons.
112 | 
113 | Additionally, annotation is retrieved from multi-FASTA files created
114 | with [`cds_extractor.pl`](/cds_extractor). See
115 | [`cds_extractor.pl`](/cds_extractor) for a description of the
116 | format. These files are used as input for the PO analysis and with
117 | option **-d** for `po2group_stats.pl`. The annotations are printed
118 | in category output files in the result directory.
119 | 
120 | Annotations are only pulled from one representative genome for each
121 | category present in the current OG. With option **-co** you can set a
122 | specific genome for the representative annotation for category
123 | 'strict core'. For all other categories the representative genome is
124 | chosen according to the order of the genomes in the group files,
125 | depending on the presence in each OG. Thus, the best annotated
126 | genome should be in the first group at the topmost position
127 | (especially for 'cutoff core'), as well as the best annotated ones
128 | at the top in all other groups.
129 | 
130 | In the result files, each orthologous group (OG) is listed in a row
131 | of the resulting category files, the first column holds the OG
132 | numbers from the PO input matrix (i.e. line number minus one). The
133 | following columns specify the ID for each CDS, gene, EC number(s),
134 | product, and organism are shown depending on their presence in the
135 | CDS's annotation. The ID is in most cases the locus tag (see
136 | [`cds_extractor.pl`](/cds_extractor)). If several EC numbers exist
137 | for a single CDS they are separated by a ';'. If the representative
138 | genome within an OG includes paralogs (co-orthologs) these will be
139 | printed in the following row(s) **without** a new OG number in the
140 | first column.
141 | 
142 | The number of OGs in the category annotation result files are the
143 | same as listed in the venn diagram and the final stats matrix.
144 | However, since only annotation from one representative annotation is
145 | used the CDS number will be different to the final stats. The final
146 | stats include **all** the CDS in this category in **all** genomes
147 | present in the OG in groups >= the inclusion cutoff (i.e. for
148 | 'strict core' the CDS for all genomes in this OG are counted). Two
149 | categories are different, for 'unspecific' all unspecific groups are
150 | included, for 'underrepresented' all groups <= the exclusion
151 | cutoffs. This is also the reason, the 'pangenome' CDS count is
152 | greater than the 'included in categories' CDS count in the final
153 | stats matrix, as genomes in excluded groups are exempt from the CDS
154 | counts for most categories. 'Included in categories' is the OG/CDS
155 | sum of all non-optional categories ('\*specific', '\*absent', 'cutoff
156 | core', 'underrepresented', and 'unspecific'), since the optional
157 | categories are included in non-optionals. An exception to the
158 | difference in CDS counts are the 'singletons' category where OG and
159 | CDS counts are identical in the result files and in the overall
160 | final output matrix (as there is only one genome), as well as in
161 | group-'specific' categories for groups including only one genome.
162 | 
163 | At last, if you want the respective representative sequences for a
164 | category you can first filter the locus tags from the result file
165 | with Unix command-line tools:
166 | 
167 |     grep -v "^#" result_file.tsv | cut -f 2 > locus_tags.txt
168 | 
169 | And then feed the locus tag list to
170 | [`cds_extractor.pl`](/cds_extractor) with option **-l**.
171 | 
172 | As a final note, in the [prot_finder](/prot_finder) workflow is a
173 | script, `binary_group_stats.pl`, based upon `po2group_stats.pl`,
174 | which can calculate overall presence/absence statistics for column
175 | groups in a delimited TEXT binary matrix (as with genomes here).
176 | 
177 | ## Usage
178 | 
179 | ### [`cds_extractor`](/cds_extractor)
180 | 
181 |     for i in *.[gbk|embl]; do perl cds_extractor.pl -i $i [-p|-n]; done
182 | 
183 | ### [**Proteinortho5**](http://www.bioinf.uni-leipzig.de/Software/proteinortho/)
184 | 
185 |     proteinortho5.pl -graph [-synteny] -cpus=# -selfblast -singles -identity=50 -cov=50 -blastParameters='-use_sw_tback [-seg no|-dust no]' *.[faa|ffn]
186 | 
187 | ### po2group_stats
188 | 
189 |     perl po2group_stats.pl -i matrix.[proteinortho|poff] -d genome_fasta_dir/ -g group_file.tsv -r result_dir -cut_i 0.7 -cut_e 0.2 -b -p -co genome4.[faa|ffn] -s -u -a > overall_stats.tsv
190 | 
191 | ## Options
192 | 
193 | ### Mandatory options
194 | 
195 | - **-i**=_str_, **-input**=_str_
196 | 
197 |     Proteinortho (PO) result matrix (\*.proteinortho or \*.poff)
198 | 
199 | - **-d**=_str_, **-dir\_genome**=_str_
200 | 
201 |     Path to the directory including the genome multi-FASTA PO input files (\*.faa or \*.ffn), created with [`cds_extractor.pl`](/cds_extractor)
202 | 
203 | - **-g**=_str_, **-groups\_file**=_str_
204 | 
205 |     Tab-delimited file with group affiliation for the genomes with **minimal two** and **maximal four** groups (easiest to create in a spreadsheet software and save in tab-separated format). **All** genomes from the PO matrix need to be included. Group names can only include alphanumeric (a-z, A-Z, 0-9), underscore (\_), dash (-), and period (.) characters (no whitespaces allowed either). Example format with two genomes in group A, three genomes in group B and D, and one genome in group C:
206 | 
207 |     group\_A&emsp;group\_B&emsp;group\_C&emsp;group\_D<br>
208 |     genome1.faa&emsp;genome2.faa&emsp;genome3.faa&emsp;genome4.faa<br>
209 |     genome5.faa&emsp;genome6.faa&emsp;&emsp;genome7.faa<br>
210 |     &emsp;genome8.faa&emsp;&emsp;genome9.faa
211 | 
212 | ### Optional options
213 | 
214 | - **-h**, **-help**
215 | 
216 |     Help (perldoc POD)
217 | 
218 | - **-r**=_str_, **-result\_dir**=_str_
219 | 
220 |     Path to result folder \[default = inclusion and exclusion percentage cutoff, './results\_i#\_e#'\]
221 | 
222 | - **-cut\_i**=_float_, **-cut\_inclusion**=_float_
223 | 
224 |     Percentage inclusion cutoff for genomes in a group per OG, has to be > 0 and <= 1. Cutoff will be rounded according to the genome number in each group and has to be > the rounded exclusion cutoff in this group. \[default = 0.9\]
225 | 
226 | - **-cut\_e**=_float_, **-cut\_exclusion**=_float_
227 | 
228 |     Percentage exclusion cutoff, has to be >= 0 and < 1. Rounded cutoff has to be < rounded inclusion cutoff. \[default = 0.1\]
229 | 
230 | - **-b**, **-binary\_matrix**
231 | 
232 |     Print a binary matrix with the presence/absence genome group results according to the cutoffs (excluding 'unspecific' category OGs)
233 | 
234 | - **-p**, **-plot\_venn**
235 | 
236 |     Plot venn diagram from the binary matrix (except 'unspecific' and 'underrepresented' categories) with function `venn` from **R** package **gplots**, requires option **-b**
237 | 
238 | - **-co**=(_str_), **-core_strict**=(_str_)
239 | 
240 |     Include 'strict core' category in output. Optionally, give a genome name from the PO matrix to use for the representative output annotation. \[default = topmost genome in first group\]
241 | 
242 | - **-s**, **-singletons**
243 | 
244 |     Include singletons/ORFans for each genome in the output, activates also overall genome OG/CDS stats in final stats matrix for genomes with singletons
245 | 
246 | - **-u**, **-unspecific**
247 | 
248 |     Include 'unspecific' category representative annotation file in result directory
249 | 
250 | - **-a**, **-all\_genomes\_overall**
251 | 
252 |     Report overall stats for all genomes (appended to the final stats matrix), also those without singletons; will include all overall genome stats without option **-s**
253 | 
254 | - **-v**, **-version**
255 | 
256 |     Print version number to *STDERR*
257 | 
258 | ## Output
259 | 
260 | - *STDOUT*
261 | 
262 |     The tab-delimited final stats matrix is printed to *STDOUT*. Redirect or pipe into another tool as needed.
263 | 
264 | - ./results_i#_e#
265 | 
266 |     All output files are stored in a results folder
267 | 
268 | - ./results_i#_e#/[\*_specific|\*_absent|cutoff_core|underrepresented]_OGs.tsv
269 | 
270 |     Tab-delimited files with OG annotation from a representative genome for non-optional categories
271 | 
272 | - (./results_i#_e#/[\*_singletons|strict_core|unspecific]_OGs.tsv)
273 | 
274 |     Optional category tab-delimited output files with representative annotation
275 | 
276 | - (./results_i#_e#/binary_matrix.tsv)
277 | 
278 |     Tab-delimited binary matrix of group presence/absence results according to cutoffs (excluding 'unspecific' category)
279 | 
280 | - (./results_i#_e#/venn_diagram.pdf)
281 | 
282 |     Venn diagram for non-optional categories (except 'unspecific' and 'underrepresented' categories)
283 | 
284 | ## Dependencies
285 | 
286 | - **Statistical computing language [R](http://www.r-project.org/)**
287 | 
288 |     `Rscript` is needed to plot the venn diagram with option **-p**, tested with version 3.2.2
289 | 
290 | - **gplots (https://cran.r-project.org/web/packages/gplots/index.html)**
291 | 
292 |     Package needed for **R** to plot the venn diagram with function `venn`. Tested with **gplots** version 2.17.0.
293 | 
294 | ## Run environment
295 | 
296 | The Perl script runs under UNIX and Windows flavors.
297 | 
298 | ## Author - contact
299 | 
300 | Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
301 | 
302 | ## Citation, installation, and license
303 | 
304 | For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
305 | 
306 | ## Changelog
307 | 
308 | * v0.1.3 (06.06.2016)
309 |     * included check for file system conformity for group names
310 |     * some minor syntax changes and additions to error messages, basically adapting to [`binary_group_stats.pl`](/prot_finder)
311 | * v0.1.2 (19.11.2015)
312 |     * added `pod2usage`-die for Getopts::Long call
313 |     * minor POD/README change
314 | * v0.1.1 (30.10.2015)
315 |     * fixed bug for representative annotation in output files, the representative genome was not chosen according to genome order in the groups file
316 | * v0.1 (23.10.2015)
317 | 


--------------------------------------------------------------------------------
/po2group_stats/pics/README.md:
--------------------------------------------------------------------------------
1 | Venn diagram logics for po2group_stats
2 | ======================================
3 | 
4 | These venn diagrams were made to illustrate the logics behind the genome group classification of [`po2group_stats`](/po2group_stats). They were created with function `venn` of the [**gplots**](https://cran.r-project.org/web/packages/gplots/index.html) R package (version 2.17.0), as implemented in [`po2group_stats`](/po2group_stats), and edited with [**Inkscape**](https://inkscape.org) version 0.48.
5 | 
6 | The diagrams are licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).


--------------------------------------------------------------------------------
/po2group_stats/pics/venn_diagram_logics.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aleimba/bac-genomics-scripts/1b2388fb9f5870a4aafa3e070823f9286178d3b1/po2group_stats/pics/venn_diagram_logics.png


--------------------------------------------------------------------------------
/prot_finder/prot_binary_matrix.pl:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/perl
  2 | 
  3 | #######
  4 | # POD #
  5 | #######
  6 | 
  7 | =pod
  8 | 
  9 | =head1 NAME
 10 | 
 11 | C<prot_binary_matrix.pl> - create a presence/absence matrix from
 12 | C<prot_finder.pl> output
 13 | 
 14 | =head1 SYNOPSIS
 15 | 
 16 | C<perl prot_binary_matrix.pl blast_hits.tsv E<gt> binary_matrix.tsv>
 17 | 
 18 | B<or>
 19 | 
 20 | C<perl prot_finder.pl -r report.blastp -s subject.faa | perl prot_binary_matrix.pl E<gt> binary_matrix.tsv>
 21 | 
 22 | =head1 DESCRIPTION
 23 | 
 24 | This script is intended to create a presence/absence matrix from the
 25 | significant C<prot_finder.pl>
 26 | L<B<BLASTP>|http://blast.ncbi.nlm.nih.gov/Blast.cgi>) hits (or the
 27 | companion bash pipe C<prot_finder_pipe.sh>). The tab-separated
 28 | C<prot_finder.pl> output can be given directly via C<STDIN> or as a
 29 | file. By default a tab-delimited binary presence/absence matrix for
 30 | query hits per subject organism will be printed to C<STDOUT>. Use
 31 | option B<-t> to count all query hits per subject organism, not just
 32 | the binary presence/absence. You can transpose the presence/absence
 33 | binary matrix with the script C<transpose_matrix.pl> (see its help
 34 | with B<-h>).
 35 | 
 36 | The resulting matrix can be used to associate the presence/absence
 37 | data with a phylogenetic tree, e.g. use the Interactive Tree Of Life
 38 | website (L<B<iTOL>|http://itol.embl.de/>). B<iTOL> likes individual
 39 | comma-separated input files, thus use options B<-s -c> for this
 40 | purpose.
 41 | 
 42 | For B<iTOL> the organism names have to have identical names to the
 43 | leaves of the phylogenetic tree, thus manual adaptation, e.g. in a
 44 | spreadsheet software, might be needed. B<Careful>, subject organisms
 45 | without a significant B<BLASTP> hit won't be included in the
 46 | tab-separated C<prot_finder.pl> result table and hence can't be
 47 | included by C<prot_binary_matrix.pl>. If needed add them manually to
 48 | the result matri(x|ces).
 49 | 
 50 | Additionally, you can give the presence/absence binary matrix to
 51 | C<binary_group_stats.pl> to calculate presence/absence statistics
 52 | for groups of columns and not simply single columns of the matrix.
 53 | C<binary_group_stats.pl> also has a comprehensive manual with its
 54 | option B<-h>.
 55 | 
 56 | =head1 OPTIONS
 57 | 
 58 | =over 20
 59 | 
 60 | =item B<-h>, B<-help>
 61 | 
 62 | Help (perldoc POD)
 63 | 
 64 | =item B<-s>, B<-separate>
 65 | 
 66 | Separate presence/absence files for each query protein printed to
 67 | the result directory [default without B<-s> = C<STDOUT> matrix for
 68 | all query proteins combined]
 69 | 
 70 | =item B<-d>=I<str>, B<-dir_result>=I<str>
 71 | 
 72 | Path to result folder, requires option B<-s> [default =
 73 | './binary_matrix_results']
 74 | 
 75 | =item B<-t>, B<-total>
 76 | 
 77 | Count total occurrences of query proteins, not just binary
 78 | presence/absence
 79 | 
 80 | =item B<-c>, B<-csv>
 81 | 
 82 | Output matri(x|ces) in comma-separated format (*.csv) instead of
 83 | tab-delimited format (*.tsv)
 84 | 
 85 | =item B<-l>, B<-locus_tag>
 86 | 
 87 | Use the locus_tag B<prefixes> in the subject_ID column of the
 88 | C<prot_finder.pl> output (instead of the subject_organism columns) as
 89 | organism IDs to associate query hits to organisms. The subject_ID
 90 | column will include locus_tags if they're annotated for a genome
 91 | (see the L<C<cds_extractor.pl>|/cds_extractor> format description).
 92 | Useful if the L<C<cds_extractor.pl>|/cds_extractor> output doesn't
 93 | include strain names for 'o=' in the FASTA IDs, because the prefix
 94 | of a locus_tag should be unique for a genome (see
 95 | L<http://www.ncbi.nlm.nih.gov/genbank/genomesubmit_annotation>).
 96 | 
 97 | =item B<-v>, B<-version>
 98 | 
 99 | Print version number to C<STDERR>
100 | 
101 | =back
102 | 
103 | =head1 OUTPUT
104 | 
105 | =over 17
106 | 
107 | =item C<STDOUT>
108 | 
109 | The resulting presence/absence matrix is printed to C<STDOUT>
110 | without option B<-s>. Redirect or pipe into another tool as needed.
111 | 
112 | =item (F<./binary_matrix_results>)
113 | 
114 | Separate query presence/absence files are stored in a result folder
115 | with option B<-s>
116 | 
117 | =item (F<./binary_matrix_results/query-ID_binary_matrix.(tsv|csv)>)
118 | 
119 | Separate query presence/absence files with option B<-s>
120 | 
121 | =back
122 | 
123 | =head1 EXAMPLES
124 | 
125 | =over
126 | 
127 | =item C<perl prot_binary_matrix.pl -s -d result_dir -t blast_hits.tsv>
128 | 
129 | =back
130 | 
131 | B<or>
132 | 
133 | =over
134 | 
135 | =item C<perl prot_finder.pl -r report.blastp -s subject.faa | perl prot_binary_matrix.pl -l -c E<gt> binary_matrix.csv>
136 | 
137 | =back
138 | 
139 | B<or>
140 | 
141 | =over
142 | 
143 | =item C<mkdir result_dir && ./prot_finder_pipe.sh -q query.faa -s subject.faa -d result_dir -m | tee result_dir/blast_hits.tsv | perl prot_binary_matrix.pl E<gt> binary_matrix.tsv>
144 | 
145 | =back
146 | 
147 | =head1 VERSION
148 | 
149 |  0.6                                               update: 23-11-2015
150 |  0.1                                                       25-10-2012
151 | 
152 | =head1 AUTHOR
153 | 
154 |  Andreas Leimbach                               aleimba[at]gmx[dot]de
155 | 
156 | =head1 LICENSE
157 | 
158 | This program is free software: you can redistribute it and/or modify
159 | it under the terms of the GNU General Public License as published by
160 | the Free Software Foundation; either version 3 (GPLv3) of the
161 | License, or (at your option) any later version.
162 | 
163 | This program is distributed in the hope that it will be useful, but
164 | WITHOUT ANY WARRANTY; without even the implied warranty of
165 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
166 | General Public License for more details.
167 | 
168 | You should have received a copy of the GNU General Public License
169 | along with this program. If not, see L<http://www.gnu.org/licenses/>.
170 | 
171 | =cut
172 | 
173 | 
174 | ########
175 | # MAIN #
176 | ########
177 | 
178 | use strict;
179 | use warnings;
180 | use autodie;
181 | use Getopt::Long;
182 | use Pod::Usage;
183 | 
184 | ### Get options with Getopt::Long
185 | my $Opt_Separate; # separate presence/absence files for each query printed to result_dir (default: single presence/absence file for all queries printed to STDOUT)
186 | my $Result_Dir; # path to result folder, requires option '-s'; default set below to 'binary_matrix_results'
187 | my $Opt_Total; # count total occurrences of query proteins not just presence/absence binary
188 | my $Opt_Csv; # output in csv format (default: tsv)
189 | my $Opt_Locus_Tag; # use locus_tag prefixes (from subject_ID column, see cds_exractor) instead of subject_organism as ID to count query hits
190 | my $VERSION = 0.6;
191 | my ($Opt_Version, $Opt_Help);
192 | GetOptions ('separate' => \$Opt_Separate,
193 |             'dir_result=s' => \$Result_Dir,
194 |             'total' => \$Opt_Total,
195 |             'csv' => \$Opt_Csv,
196 |             'locus_tag' => \$Opt_Locus_Tag,
197 |             'version' => \$Opt_Version,
198 |             'help|?' => \$Opt_Help)
199 |             or pod2usage(-verbose => 1, -exitval => 2);
200 | 
201 | 
202 | 
203 | ### Run perldoc on POD and set option defaults
204 | pod2usage(-verbose => 2) if ($Opt_Help);
205 | die "$0 $VERSION\n" if ($Opt_Version);
206 | if ($Result_Dir && !$Opt_Separate) {
207 |     warn "### Warning: Option '-d' given but not its required option '-s', forcing option '-s'!\n";
208 |     $Opt_Separate = 1;
209 | }
210 | 
211 | my $Separator = "\t";
212 | $Separator = "," if ($Opt_Csv); # optional csv output format
213 | 
214 | 
215 | 
216 | ### Check input
217 | if (-t STDIN && ! @ARGV) {
218 |     my $warning = "\n### Fatal error: No STDIN and no input file given as argument, please supply one of them and/or see help with '-h'!\n";
219 |     pod2usage(-verbose => 0, -message => $warning, -exitval => 2);
220 | } elsif (!-t STDIN && @ARGV) {
221 |     my $warning = "\n### Fatal error: Both STDIN and an input file given as argument, please supply only either one and/or see help with '-h'!\n";
222 |     pod2usage(-verbose => 0, -message => $warning, -exitval => 2);
223 | }
224 | die "\n### Fatal error: Too many arguments given, only STDIN or one input file allowed as argument! Please see the usage with option '-h' if unclear!\n" if (@ARGV > 1);
225 | die "\n### Fatal error: File '$ARGV[0]' does not exist!\n" if (@ARGV && $ARGV[0] ne '-' && !-e $ARGV[0]);
226 | 
227 | 
228 | 
229 | ### Create result folder, only for option '-s'
230 | if ($Opt_Separate) {
231 |     $Result_Dir = 'binary_matrix_results' if (!$Result_Dir);
232 |     $Result_Dir =~ s/\/$//; # get rid of a potential '/' at the end of $Result_Dir path
233 |     if (-e $Result_Dir) {
234 |         empty_dir($Result_Dir); # subroutine to empty a directory with user interaction
235 |     } else {
236 |         mkdir $Result_Dir;
237 |     }
238 | }
239 | 
240 | 
241 | 
242 | ### Parse the input from 'prot_finder.pl' to associate organism with query hit
243 | my @Queries; # store all query proteins
244 | my %Hits; # hash of hash to associate subject_organism/subject_ID with query hit
245 | 
246 | while (<>) { # read STDIN or file input
247 |     chomp;
248 |     if ($. == 1) { # $. = check only first line of input (works with STDIN and file input)
249 |         die "\n### Fatal error: Input doesn't have the correct format, it has to be the output of 'prot_finder.pl' with the following header:\n# subject_organism\tsubject_ID\tsubject_gene\tsubject_protein_desc\tquery_ID\tquery_desc\tquery_coverage [%]\tquery_identities [%]\tsubject/hit_coverage [%]\te-value of best HSP\tbit-score of best HSP\n" if (!/# subject_organism\tsubject_ID\tsubject_gene\tsubject_protein_desc\tquery_ID\tquery_desc/);
250 |         next; # skip header line
251 |     }
252 | 
253 |     my @line = split (/\t/, $_); # $line[0] = subject_organism; $line[1] = subject_ID (mostly locus_tag, see cds_extractor); $line[4] = query_ID
254 |     my $query = $line[4];
255 |     my $id;
256 |     if ($Opt_Locus_Tag) { # use subject_ID
257 |         die "\n### Fatal error: The subject_ID of the following line doesn't look like an NCBI locus tag (see: http://www.ncbi.nlm.nih.gov/genbank/genomesubmit_annotation). Column subject_ID needs to include only locus_tags for option '-l'!\n$_\n" if ($line[1] !~ /^([a-zA-Z][0-9a-zA-Z]{2,11})_[0-9a-zA-Z]+$/); # check if subject_ID is locus_tag ('\w' not used because allows alphanumeric and '_')
258 |         # excerpt: The locus_tag prefix must be 3-12 alphanumeric characters and the first character may not be a digit. All chromosomes and plasmids of an individual genome must use the exactly same locus_tag prefix followed by an underscore and then an alphanumeric identification number that is unique within the given genome. Other than the single underscore used to separate the prefix from the identification number no other special characters can be used in the locus_tag.
259 |         $id = $1; # locus_tag prefix, unique for each genome
260 |     } else { # use subject_organism as ID
261 |         $id = $line[0];
262 |     }
263 | 
264 |     if ($Opt_Total) { # count total occurrences of query proteins
265 |         $Hits{$id}{$query}++;
266 | 
267 |     } else { # only binary output (0 or 1)
268 |         $Hits{$id}{$query} = 1;
269 |     }
270 | 
271 |     push (@Queries, $query) if (!grep($_ eq $query, @Queries)); # push each query only once in @Queries
272 | }
273 | 
274 | 
275 | 
276 | ### Print binary data to a joined or separate (for each query; as needed by iTOL) file(s)
277 | if (!$Opt_Separate) { # joined output
278 |     # print header
279 |     if ($Opt_Locus_Tag) {
280 |         print "locus_tag";
281 |     } else {
282 |         print "organism";
283 |     }
284 |     print "$Separator";
285 |     print join("$Separator", sort @Queries), "\n";
286 | 
287 |     # print data to STDOUT
288 |     foreach my $id (sort keys %Hits) {
289 |         print "$id";
290 |         foreach my $query (sort @Queries) {
291 |             if ($Hits{$id}->{$query}) {
292 |                 print "$Separator", "$Hits{$id}->{$query}";
293 |             } else {
294 |                 print "$Separator", "0";
295 |             }
296 |         }
297 |         print "\n";
298 |     }
299 | 
300 | } elsif ($Opt_Separate) { # separated output for each query
301 |     foreach my $query (sort @Queries) {
302 |         my $file = "$Result_Dir/$query\_binary\_matrix.";
303 |         if ($Opt_Csv) {
304 |             $file .= "csv";
305 |         } else {
306 |             $file .= "tsv";
307 |         }
308 |         open (my $binary_matrix_fh, ">", "$file");
309 |         foreach my $id (sort keys %Hits) {
310 |             print $binary_matrix_fh "$id";
311 |             if ($Hits{$id}->{$query}) {
312 |                 print $binary_matrix_fh "$Separator", "$Hits{$id}->{$query}\n";
313 |             } else {
314 |                 print $binary_matrix_fh "$Separator", "0\n";
315 |             }
316 |         }
317 |         close $binary_matrix_fh;
318 |     }
319 | }
320 | 
321 | exit;
322 | 
323 | 
324 | ###############
325 | # Subroutines #
326 | ###############
327 | 
328 | ### Subroutine to empty a directory with user interaction
329 | sub empty_dir {
330 |     my $dir = shift;
331 |     print STDERR "\nDirectory '$dir' already exists! You can use either option '-d' to set a different output result directory name, or do you want to replace the directory and all its contents [y|n]? ";
332 |     my $user_ask = <STDIN>;
333 |     if ($user_ask =~ /y/i) {
334 |         unlink glob "$dir/*"; # remove all files in results directory
335 |     } else {
336 |         die "\nScript abborted!\n";
337 |     }
338 |     return 1;
339 | }
340 | 


--------------------------------------------------------------------------------
/prot_finder/prot_finder_pipe.sh:
--------------------------------------------------------------------------------
  1 | #!/bin/bash
  2 | set -e
  3 | 
  4 | #############
  5 | # Functions #
  6 | #############
  7 | 
  8 | usage () {
  9 |     cat 1>&2 << EOF # ${0##*/} parameter expansion substitution with variable '0' to get shell script filename without path
 10 | Usage: ${0##*/} [OPTION] -q query.faa -f (embl|gbk) > blast_hits.tsv
 11 | or:    ${0##*/} [OPTION] -q query.faa -s subject.faa -d result_dir \\
 12 |        > result_dir/blast_hits.tsv
 13 | 
 14 | Bash wrapper script to run a pipeline consisting of optional
 15 | 'cds_extractor.pl' (with its options '-p -f'), BLASTP, 'prot_finder.pl',
 16 | and optional Clustal Omega. 'cds_extractor.pl' (only for shell script
 17 | option '-f') and 'prot_finder.pl' either have to be installed in the
 18 | global PATH or present in the current working directory. BLASTP is run
 19 | with disabled query filtering, locally optimal Smith-Waterman alignments,
 20 | and increasing the number of database sequences to show alignments
 21 | to 500 for BioPerl parsing (legacy: '-F F -s T -b 500', plus: '-seg
 22 | no -use_sw_tback -num_alignments 500').
 23 | 
 24 | The script ends with the STDERR message 'Pipeline finished!', if this
 25 | is not the case have a look at the log files in the result directory
 26 | for errors.
 27 | 
 28 | Mandatory options:
 29 |     -q <str>           Path to query protein multi-FASTA file (*.faa)
 30 |                        with unique FASTA IDs
 31 |     -f <str>           File extension for files in the current working
 32 |                        directory to use for 'cds_extractor.pl' (e.g.
 33 |                        'embl' or 'gbk'); excludes shell script option '-s'
 34 |     or
 35 |     -s <str>           Path to subject protein multi-FASTA file (*.faa)
 36 |                        already created with 'cds_extractor.pl' (and its
 37 |                        options '-p -f'), will not run 'cds_extractor.pl';
 38 |                        excludes shell script option '-f'
 39 | 
 40 | Optional options:
 41 |     -h                 Print usage
 42 |     -d <str>           Path to result folder [default = results_i#_cq#]
 43 |     -p (legacy|plus)   BLASTP suite to use [default = plus]
 44 |     -e <real>          E-value for BLASTP [default = 1e-10]
 45 |     -t <int>           Number of threads to be used for BLASTP and
 46 |                        Clustal Omega [default = all processors on
 47 |                        system]
 48 |     -i <int>           Query identity cutoff for significant hits
 49 |                        [default = 70]
 50 |     -c <int>           Query coverage cutoff [default = 70]
 51 |     -k <int>           Subject coverage cutoff [default = 0]
 52 |     -b                 Give only best hit (highest identity) for each
 53 |                        subject sequence
 54 |     -a                 Multiple alignment of each multi-FASTA result
 55 |                        file with Clustal Omega
 56 |     -o <str>           Path to executable Clustal Omega binary if not
 57 |                        in global PATH; requires shell script option '-a'
 58 |     -m                 Clean up all non-essential files
 59 | 
 60 | Author: Andreas Leimbach <aleimba[at]gmx[dot]de>
 61 | EOF
 62 | }
 63 | 
 64 | 
 65 | ### Check external dependencies
 66 | check_commands () {
 67 |     which "$1" > /dev/null || err "Required executable '$1' not found in global PATH, please install.$2"
 68 | }
 69 | 
 70 | ### Check cutoff options input
 71 | check_cutoff_options () {
 72 |     local message="Option '-$2' requires an integer number >= 0 or <= 100 as value, not '$1'!"
 73 |     [[ $1 =~ ^[0-9]+$ ]] || err "$message"
 74 |     (( $1 <= 100 )) || err "$message" # arithmetic expression (can only handle integer math, not float)
 75 | }
 76 | 
 77 | 
 78 | ### Error messages
 79 | err () {
 80 |     echo -e "\n### Fatal error: $*" 1>&2
 81 |     exit 1
 82 | }
 83 | 
 84 | 
 85 | ### Run status of script to STDERR instead of STDOUT
 86 | msg () {
 87 |     echo -e "# $*" 1>&2
 88 | }
 89 | 
 90 | 
 91 | ########
 92 | # MAIN #
 93 | ########
 94 | 
 95 | shopt -s extglob # enable extended globs for bash
 96 | 
 97 | Cmdline="$*"
 98 | 
 99 | ### Getopts
100 | Blastp_Suite="plus"
101 | Evalue="1e-10"
102 | Threads="$(nproc --all)" # get max number of processors on system
103 | Ident_Cut=70
104 | Cov_Query_Cutoff=70
105 | Cov_Subject_Cutoff=0
106 | 
107 | while getopts ':q:f:s:d:p:e:t:i:c:k:bao:mh' opt; do # beginning ':' indicates silent mode, trailing ':' after each option requires value
108 |     case $opt in
109 |         q) Query_File=$OPTARG
110 |            [[ -r $Query_File ]] || err "Cannot read query file '$Query_File'!"
111 |            ;;
112 |         f) Subject_Ext=$OPTARG
113 |            [[ -n "$(find . -maxdepth 1 -name "*.${Subject_Ext}" -print -quit)" ]] || err "No files with the option '-f' specified file extension '$Subject_Ext' found in the current working directory!"
114 |            ;;
115 |         s) Subject_File=$OPTARG
116 |            [[ -r $Subject_File ]] || err "Cannot read subject file '$Subject_File'!"
117 |            ;;
118 |         d) Result_Dir=$OPTARG;; # checked below
119 |         p) Blastp_Suite=$OPTARG
120 |            [[ $Blastp_Suite = @(plus|legacy) ]] || err "Option '-p' only allows 'plus' for BLASTP+ or 'legacy' for legacy BLASTP as value, not '$Blastp_Suite'!" # extended glob (regex more expensive)
121 |            ;;
122 |         e) Evalue=$OPTARG
123 |            [[ $Evalue =~ ^([0-9][0-9]*|[0-9]+e-[0-9]+)$ ]] || err "Option '-e' requires a real number (either integer or scientific exponential notation) as value, not '$Evalue'!"
124 |            ;;
125 |         t) Threads=$OPTARG
126 |            [[ $Threads =~ ^[1-9][0-9]*$ ]] || err "Option '-t' requires an integer > 0 as value, not '$Threads'!"
127 |            ;;
128 |         i) Ident_Cut=$OPTARG
129 |            check_cutoff_options "$Ident_Cut" "i"
130 |            ;;
131 |         c) Cov_Query_Cutoff=$OPTARG
132 |            check_cutoff_options "$Cov_Query_Cutoff" "c"
133 |            ;;
134 |         k) Cov_Subject_Cutoff=$OPTARG
135 |            check_cutoff_options "$Cov_Subject_Cutoff" "k"
136 |            ;;
137 |         b) Opt_Best_Hit=1;;
138 |         a) Opt_Align=1;;
139 |         o) Clustal_Path=$OPTARG
140 |            [[ -x $Clustal_Path ]] || err "Option '-o' requires the path to an executable Clustal Omega binary as value, not '$Clustal_Path'!"
141 |            ;;
142 |         m) Opt_Clean_Up=1;;
143 |         h) usage; exit;; # usage function, exit code zero
144 |         \?) err "Invalid option '-$OPTARG'. See usage with '-h'!";;
145 |         :) err "Option '-$OPTARG' requires a value. See usage with '-h'!";;
146 |     esac
147 | done
148 | 
149 | 
150 | ### Check options and enforce mandatory options
151 | [[ $Query_File && ($Subject_Ext || $Subject_File) ]] || err "Mandatory options '-q' and '-f' or '-s' are missing!"
152 | 
153 | [[ $Subject_Ext && $Subject_File ]] && err "Options '-f' and '-s' given which exclude themselves. Choose either '-f' OR '-s'!"
154 | 
155 | (( Threads <= $(nproc) )) || err "Number of threads for option '-t', '$Threads', exceeds the maximum $(nproc) processors on the system!"
156 | 
157 | [[ ! $Opt_Align && $Clustal_Path ]] && Opt_Align=1 && msg "Option '-o' requires option '-a', forcing option '-a'!"
158 | 
159 | 
160 | ### Check external dependencies
161 | echo 1>&2 # newline
162 | msg "Checking pipeline dependencies"
163 | [[ $Opt_Align && ! $Clustal_Path ]] && check_commands "clustalo" " Or use option '-o' to give the path to the binary!"
164 | 
165 | for exe in cds_extractor.pl formatdb blastall makeblastdb blastp prot_finder.pl; do
166 |     [[ $Subject_File && $exe == cds_extractor.pl ]] && continue
167 |     [[ $Blastp_Suite == legacy && $exe = @(makeblastdb|blastp) ]] && continue # extended glob
168 |     [[ $Blastp_Suite == plus && $exe = @(formatdb|blastall) ]] && continue
169 |     if [[ $exe = *.pl ]]; then # glob
170 |         if [[ -r "./$exe" ]]; then # present in current wd
171 |             [[ $exe =~ ^cds ]] && Cds_Extractor_Cmd="perl cds_extractor.pl"
172 |             [[ $exe =~ ^prot ]] && Prot_Finder_Cmd="perl prot_finder.pl"
173 |             continue
174 |         else
175 |             [[ $exe =~ ^cds ]] && Cds_Extractor_Cmd="cds_extractor.pl"
176 |             [[ $exe =~ ^prot ]] && Prot_Finder_Cmd="prot_finder.pl"
177 |             check_commands "$exe" " Or copy the Perl script in the current working directory."
178 |         fi
179 |         continue
180 |     fi
181 |     check_commands "$exe"
182 | done
183 | 
184 | msg "Script call command: ${0##*/} $Cmdline"
185 | 
186 | 
187 | ### Create result folder
188 | if [[ ! $Result_Dir ]]; then # can't give default before 'getopts' in case cutoffs are set by the user
189 |     Result_Dir="results_i${Ident_Cut}_cq${Cov_Query_Cutoff}"
190 | else
191 |     Result_Dir="${Result_Dir%/}" # parameter expansion substitution to get rid of a potential '/' at the end of Result_Dir path
192 | fi
193 | 
194 | if [[ -d $Result_Dir ]]; then # make possible to redirect STDOUT output into result_dir (corresponding to option '-f' in 'protein_finder.pl' script)
195 |     skip=0
196 |     for file in "$Result_Dir"/*; do
197 |         if [[ -s $file || $skip -eq 1 ]]; then # die if a file with size > 0 or more than one file already in result_dir
198 |             err "Result directory '$Result_Dir' already exists! You can use option '-d' to set a different result directory name."
199 |         fi
200 |         skip=1
201 |     done
202 | else
203 |     mkdir -pv "$Result_Dir" 1>&2
204 | fi
205 | 
206 | 
207 | ### Run cds_extractor.pl
208 | if [[ $Subject_Ext ]]; then
209 |     msg "Running cds_extractor.pl on all '*.$Subject_Ext' files in the current working directory"
210 |     for file in *."$Subject_Ext"; do
211 |         file_no_ext="${file%.${Subject_Ext}}.faa" # parameter expansion substitution to get rid of file extension and replace with new one (*.faa are the output files from cds_extractor)
212 |         File_Names+=("$file_no_ext") # append to array
213 |         eval "$Cds_Extractor_Cmd -i $file -p -f &>> $Result_Dir/cds_extractor.log" # '&>' instead of '/dev/null' for error catching
214 |     done
215 |     Subject_File="$Result_Dir/prot_finder.faa" # for creating BLASTP db below
216 |     cat "${File_Names[@]}" > "$Subject_File" # concatenate files stored in the array, "${array[@]}" expands to list of array elements (words)
217 | fi
218 | 
219 | 
220 | ### Run BLASTP
221 | msg "Running BLASTP '$Blastp_Suite' with subject '$Subject_File', query '$Query_File', evalue '$Evalue', and $Threads threads"
222 | Blast_Report="$Result_Dir/prot_finder.blastp"
223 | if [[ $Blastp_Suite == legacy ]]; then
224 |     formatdb -p T -i "$Subject_File" -n prot_finder_db
225 |     blastall -p blastp -d prot_finder_db -i "$Query_File" -o "$Blast_Report" -e "$Evalue" -F F -s T -b 500 -a "$Threads"
226 | elif [[ $Blastp_Suite == plus ]]; then
227 |     makeblastdb -in "$Subject_File" -input_type fasta -dbtype prot -out prot_finder_db &> "$Result_Dir/makeblastdb.log" # '&>' instead of '/dev/null' for error catching
228 |     blastp -db prot_finder_db -query "$Query_File" -out "$Blast_Report" -evalue "$Evalue" -seg no -use_sw_tback -num_alignments 500 -num_threads "$Threads"
229 | fi
230 | 
231 | 
232 | ### Run prot_finder.pl
233 | msg "Running prot_finder.pl with identity cutoff '$Ident_Cut', query coverage cutoff '$Cov_Query_Cutoff', and subject coverage cutoff '$Cov_Subject_Cutoff'"
234 | Cmd="$Prot_Finder_Cmd -d $Result_Dir -f -q $Query_File -s $Subject_File -r $Blast_Report -i $Ident_Cut -cov_q $Cov_Query_Cutoff -cov_s $Cov_Subject_Cutoff"
235 | [[ $Opt_Best_Hit ]] && Cmd="$Cmd -b" # append to command
236 | [[ $Opt_Align ]] && Cmd="$Cmd -a -t $Threads"
237 | [[ $Clustal_Path ]] && Cmd="$Cmd -p $Clustal_Path"
238 | eval "$Cmd" 2> "$Result_Dir/prot_finder.log" # '2>' instead of '/dev/null' for error catching
239 | 
240 | msg "All result files stored in directory '$Result_Dir'"
241 | 
242 | 
243 | ### Clean up non-essential files
244 | if [[ $Opt_Clean_Up ]]; then
245 |     msg "Removing non-essential output files, option '-m'"
246 |     for file in "${File_Names[@]}"; do # remove output files from cds_extractor
247 |         rm -v "$file" 1>&2
248 |     done
249 |     [[ $Subject_Ext ]] && rm -v "$Subject_File" 1>&2 # 'cat' from cds_extractor
250 |     if [[ $Blastp_Suite == legacy ]]; then
251 |         rm -v formatdb.log 1>&2
252 |         [[ -r error.log ]] && rm -v error.log 1>&2 # no idea where this guy is coming from or what is its trigger
253 |     fi
254 |     rm -v prot_finder_db.p* "$Blast_Report" "$Result_Dir"/*.log "${Subject_File}.idx" 1>&2
255 | fi
256 | 
257 | msg "Pipeline finished!"
258 | 


--------------------------------------------------------------------------------
/prot_finder/transpose_matrix.pl:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/perl
  2 | 
  3 | #######
  4 | # POD #
  5 | #######
  6 | 
  7 | =pod
  8 | 
  9 | =head1 NAME
 10 | 
 11 | C<transpose_matrix.pl> - transpose a delimited TEXT matrix
 12 | 
 13 | =head1 SYNOPSIS
 14 | 
 15 | C<perl transpose_matrix.pl input_matrix.tsv E<gt>
 16 | input_matrix_transposed.tsv>
 17 | 
 18 | B<or>
 19 | 
 20 | C<perl prot_binary_matrix.pl blast_hits.tsv | perl
 21 | transpose_matrix.pl E<gt> binary_matrix_transposed.tsv>
 22 | 
 23 | =head1 DESCRIPTION
 24 | 
 25 | This script transposes a delimited TEXT input matrix, i.e. rows will
 26 | become columns and columns rows. Use option B<-d> to set the
 27 | delimiter of the input and output matrix, default is set to
 28 | tab-delimited/separated matrices. Input matrices can be given
 29 | directly via C<STDIN> or as a file. The script is intended for the
 30 | resulting presence/absence binary matrices of
 31 | C<prot_binary_matrix.pl>, but can be used for any TEXT matrix.
 32 | 
 33 | The binary matrix of C<prot_binary_matrix.pl> has the query protein
 34 | IDs as column headers and the subject genomes as row headers. Thus,
 35 | C<transpose_matrix.pl> is very useful to transpose the
 36 | C<prot_binary_matrix.pl> matrix for the usage with
 37 | C<binary_group_stats.pl> to calculate presence/absence statistics
 38 | for groups of columns/genomes (and not simply single columns of the
 39 | matrix). C<binary_group_stats.pl> also has a comprehensive manual
 40 | with its option B<-h>.
 41 | 
 42 | Additionally, option B<-e> can be used to fill empty cells of the
 43 | input matrix with a value in the transposed matrix (e.g. 'NA', '0'
 44 | etc.).
 45 | 
 46 | =head1 OPTIONS
 47 | 
 48 | =over 20
 49 | 
 50 | =item B<-h>, B<-help>
 51 | 
 52 | Help (perldoc POD)
 53 | 
 54 | =item B<-d>=I<str>, B<-delimiter>=I<str>
 55 | 
 56 | Set delimiter of input and output matrix (e.g. comma ',', single
 57 | space ' ' etc.) [default = tab-delimited/separated]
 58 | 
 59 | =item B<-e>=I<str>, B<-empty>=I<str>
 60 | 
 61 | Fill empty cells of the input matrix with a value in the transposed
 62 | matrix (e.g. 'NA', '0' etc.)
 63 | 
 64 | =item B<-v>, B<-version>
 65 | 
 66 | Print version number to C<STDERR>
 67 | 
 68 | =back
 69 | 
 70 | =head1 OUTPUT
 71 | 
 72 | =over 20
 73 | 
 74 | =item C<STDOUT>
 75 | 
 76 | The transposed matrix is printed to C<STDOUT>. Redirect or pipe into
 77 | another tool as needed.
 78 | 
 79 | =back
 80 | 
 81 | =head1 EXAMPLES
 82 | 
 83 | =over
 84 | 
 85 | =item C<perl transpose_matrix.pl -d ' ' -e NA input_matrix_space-delimit.txt E<gt> input_matrix_space-delimit_transposed.txt>
 86 | 
 87 | =back
 88 | 
 89 | B<or>
 90 | 
 91 | =over
 92 | 
 93 | =item C<for matrix in *.tsv; do perl transpose_matrix.pl "$matrix" E<gt> "${matrix%.*}_transposed.tsv"; done>
 94 | 
 95 | =back
 96 | 
 97 | B<or>
 98 | 
 99 | =over
100 | 
101 | =item C<perl prot_finder.pl -r report.blastp -s subject.faa | perl prot_binary_matrix.pl -l -c | perl transpose_matrix.pl -d , E<gt> binary_matrix_transposed.csv>
102 | 
103 | =back
104 | 
105 | B<or>
106 | 
107 | =over
108 | 
109 | =item C<mkdir result_dir && ./prot_finder_pipe.sh -q query.faa -s subject.faa -d result_dir -m | tee result_dir/blast_hits.tsv | perl prot_binary_matrix.pl | tee result_dir/binary_matrix.tsv | perl transpose_matrix.pl E<gt> result_dir/binary_matrix_transposed.tsv>
110 | 
111 | =back
112 | 
113 | =head1 VERSION
114 | 
115 |  0.1                                                       12-04-2016
116 | 
117 | =head1 AUTHOR
118 | 
119 |  Andreas Leimbach                               aleimba[at]gmx[dot]de
120 | 
121 | =head1 ACKNOWLEDGEMENT
122 | 
123 | The Perl implementation for transposing a matrix on Stack Overflow
124 | was very useful:
125 | L<https://stackoverflow.com/questions/1729824/transpose-a-file-in-bash>
126 | 
127 | =head1 LICENSE
128 | 
129 | This program is free software: you can redistribute it and/or modify
130 | it under the terms of the GNU General Public License as published by
131 | the Free Software Foundation; either version 3 (GPLv3) of the
132 | License, or (at your option) any later version.
133 | 
134 | This program is distributed in the hope that it will be useful, but
135 | WITHOUT ANY WARRANTY; without even the implied warranty of
136 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
137 | General Public License for more details.
138 | 
139 | You should have received a copy of the GNU General Public License
140 | along with this program. If not, see L<http://www.gnu.org/licenses/>.
141 | 
142 | =cut
143 | 
144 | 
145 | ########
146 | # MAIN #
147 | ########
148 | 
149 | use strict;
150 | use warnings;
151 | use autodie;
152 | use Getopt::Long;
153 | use Pod::Usage;
154 | 
155 | ### Get the options with Getopt::Long
156 | my $Delimiter = "\t"; # set separator/delimiter of input/output matrix
157 | my $Empty; # optionally, fill empty cells with a value
158 | my $VERSION = 0.1;
159 | my ($Opt_Version, $Opt_Help);
160 | GetOptions ('delimiter=s' => \$Delimiter,
161 |             'empty=s' => \$Empty,
162 |             'version' => \$Opt_Version,
163 |             'help|?' => \$Opt_Help)
164 |             or pod2usage(-verbose => 1, -exitval => 2);
165 | 
166 | 
167 | ### Run perldoc on POD and set option defaults
168 | pod2usage(-verbose => 2) if ($Opt_Help);
169 | die "$0 $VERSION\n" if ($Opt_Version);
170 | 
171 | 
172 | ### Check input
173 | if (-t STDIN && ! @ARGV) {
174 |     my $warning = "\n### Fatal error: No STDIN and no input file given as argument, please supply one of them and/or see help with '-h'!\n";
175 |     pod2usage(-verbose => 0, -message => $warning, -exitval => 2);
176 | } elsif (!-t STDIN && @ARGV) {
177 |     my $warning = "\n### Fatal error: Both STDIN and an input file given as argument, please supply only either one and/or see help with '-h'!\n";
178 |     pod2usage(-verbose => 0, -message => $warning, -exitval => 2);
179 | }
180 | die "\n### Fatal error: Too many arguments given, only STDIN or one input file allowed as argument! Please see the usage with option '-h' if unclear!\n\n" if (@ARGV > 1);
181 | die "\n### Fatal error: File '$ARGV[0]' does not exist!\n\n" if (@ARGV && $ARGV[0] ne '-' && !-e $ARGV[0]);
182 | 
183 | 
184 | ### Parse input matrix
185 | my %Input_Matrix; # hash of hash to store the input matrix
186 | my $Max_Columns = 0; # maximum number of columns, needed in case not every row of input matrix has the same number of columns
187 | my $Row_Num = 0; # count input matrix number of rows
188 | while (<>) {
189 |     chomp;
190 |     warn "### Warning: Set separator/delimiter '$Delimiter' (option '-d') not found in the following first line/header of input matrix, sure the correct one is set?\n$_\n\n" if ($_ !~ /$Delimiter/ && $. == 1);
191 | 
192 |     my $col_num = 0; # count number of columns for each row
193 |     foreach my $cell (split(/$Delimiter/)) { # split each row for the cells
194 |         $cell = $Empty if ($cell =~ /^$/); # needed for empty cells in between cells with values, because for these $cell is defined in print out below
195 |         $Input_Matrix{$Row_Num}{$col_num++} = $cell;
196 |     }
197 | 
198 |     $Max_Columns = $col_num if ($col_num > $Max_Columns);
199 |     $Row_Num++;
200 | }
201 | 
202 | 
203 | ### Print out transposed matrix
204 | my $Max_Rows = $Row_Num;
205 | for (my $col_num = 0; $col_num < $Max_Columns; $col_num++) {
206 |     for ($Row_Num = 0; $Row_Num < $Max_Rows; $Row_Num++) { # repurposing $Row_Num
207 |         print "$Delimiter" if ($Row_Num > 0); # separator only after the first transposed column
208 |         if (defined $Input_Matrix{$Row_Num}{$col_num}) { # 'defined' needed, in case $cell has '0' as value
209 |             print $Input_Matrix{$Row_Num}{$col_num};
210 |         } elsif (defined $Empty) { # for rows of the input matrix with columns < $Max_Columns; 'defined' needed, in case $Empty is set to '0'
211 |             print $Empty;
212 |         }
213 |     }
214 |     print "\n";
215 | }
216 | 
217 | exit;
218 | 


--------------------------------------------------------------------------------
/rename_fasta_id/README.md:
--------------------------------------------------------------------------------
  1 | rename_fasta_id
  2 | ===============
  3 | 
  4 | `rename_fasta_id.pl` is a script to rename fasta IDs according to regular expressions.
  5 | 
  6 | * [Synopsis](#synopsis)
  7 | * [Description](#description)
  8 | * [Usage](#usage)
  9 | * [Options](#options)
 10 |   * [Mandatory options](#mandatory-options)
 11 |   * [Optional options](#optional-options)
 12 | * [Output](#output)
 13 | * [Run environment](#run-environment)
 14 | * [Author - contact](#author---contact)
 15 | * [Citation, installation, and license](#citation-installation-and-license)
 16 | * [Changelog](#changelog)
 17 | 
 18 | ## Synopsis
 19 | 
 20 |     perl rename_fasta_id.pl -i file.fasta -p "NODE_.+$" -r "K-12_" -n -a c > out.fasta
 21 | 
 22 | **or**
 23 | 
 24 |     zcat file.fasta.gz | perl rename_fasta_id.pl -i - -p "coli" -r "" -o > out.fasta
 25 | 
 26 | ## Description
 27 | 
 28 | This script uses the built-in Perl substitution operator `s///` to
 29 | replace strings in FASTA IDs. To do this, a **pattern** and a
 30 | **replacement** have to be provided (Perl regular expression syntax
 31 | can be used). The leading '>' character for the FASTA ID will be
 32 | removed before the substitution and added again afterwards. FASTA
 33 | IDs will be searched for matches with the **pattern**, and if found
 34 | the **pattern** will be replaced by the **replacement**.
 35 | 
 36 | **IMPORTANT**: Enclose the **pattern** and the **replacement** in
 37 | quotation marks (' or ") if they contain characters that would be
 38 | interpreted by the shell (e.g. pipes '|', brackets etc.).
 39 | 
 40 | For substitutions without any appendices in a UNIX OS you can of
 41 | course just use the great
 42 | [`sed`](https://www.gnu.org/software/sed/manual/sed.html) (see
 43 | `man sed`), e.g.:
 44 | 
 45 |     sed 's/^>pattern/>replacement/' file.fasta
 46 | 
 47 | ## Usage
 48 | 
 49 |     perl rename_fasta_id.pl -i file.fasta -p "T" -r "a" -c -g -o
 50 | 
 51 | ## Options
 52 | 
 53 | ### Mandatory options
 54 | 
 55 | - -i, -input
 56 | 
 57 | Input FASTA file or piped STDIN (-) from a gzipped file
 58 | 
 59 | - -p, -pattern
 60 | 
 61 | Pattern to be replaced in FASTA ID
 62 | 
 63 | - -r, -replacement
 64 | 
 65 | Replacement to replace the pattern with. To entirely remove the pattern use '' or "" as input for **-r**.
 66 | 
 67 | ### Optional options
 68 | 
 69 | - -h, -help
 70 | 
 71 | Help (perldoc POD)
 72 | 
 73 | - -c, -case-insensitive
 74 | 
 75 | Match pattern case-insensitive
 76 | 
 77 | - -g, -global
 78 | 
 79 | Replace pattern globally in the string
 80 | 
 81 | - -n, -numerate
 82 | 
 83 | Append a numeration/the count of the pattern hits to the replacement. This is e.g. useful to number contigs consecutively in a draft genome.
 84 | 
 85 | - -a, -append
 86 | 
 87 | Append a string after the numeration, e.g. 'c' for chromosome
 88 | 
 89 | - -o, -output
 90 | 
 91 | Verbose output of the substitutions that were carried out, printed to *STDERR*
 92 | 
 93 | - -v, -version
 94 | 
 95 | Print version number to *STDERR*
 96 | 
 97 | ## Output
 98 | 
 99 | - *STDOUT*
100 | 
101 | The FASTA file with substituted ID lines is printed to *STDOUT*. Redirect or pipe into another tool as needed.
102 | 
103 | ## Run environment
104 | 
105 | The Perl script runs under Windows and UNIX flavors.
106 | 
107 | ## Author - contact
108 | 
109 | Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
110 | 
111 | ## Citation, installation, and license
112 | 
113 | For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
114 | 
115 | ## Changelog
116 | 
117 | - v0.1 (09.11.2014)
118 | 


--------------------------------------------------------------------------------
/rename_fasta_id/rename_fasta_id.pl:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/perl
  2 | 
  3 | #######
  4 | # POD #
  5 | #######
  6 | 
  7 | =pod
  8 | 
  9 | =head1 NAME
 10 | 
 11 | C<rename_fasta_id.pl> - rename fasta IDs according to regular expressions
 12 | 
 13 | =head1 SYNOPSIS
 14 | 
 15 | C<perl rename_fasta_id.pl -i file.fasta -p "NODE_.+$" -r "K-12_" -n -a c E<gt> out.fasta>
 16 | 
 17 | B<or>
 18 | 
 19 | C<zcat file.fasta.gz | perl rename_fasta_id.pl -i - -p "coli" -r "" -o E<gt> out.fasta>
 20 | 
 21 | =head1 DESCRIPTION
 22 | 
 23 | This script uses the built-in Perl substitution operator C<s///> to
 24 | replace strings in FASTA IDs. To do this, a B<pattern> and a
 25 | B<replacement> have to be provided (Perl regular expression syntax
 26 | can be used). The leading '>' character for the FASTA ID will be
 27 | removed before the substitution and added again afterwards. FASTA
 28 | IDs will be searched for matches with the B<pattern>, and if found
 29 | the B<pattern> will be replaced by the B<replacement>.
 30 | 
 31 | B<IMPORTANT>: Enclose the B<pattern> and the B<replacement> in
 32 | quotation marks (' or ") if they contain characters that would be
 33 | interpreted by the shell (e.g. pipes '|', brackets etc.).
 34 | 
 35 | For substitutions without any appendices in a UNIX OS you can of
 36 | course just use the great
 37 | L<C<sed>|https://www.gnu.org/software/sed/manual/sed.html> (see
 38 | C<man sed>), e.g.:
 39 | 
 40 | C<sed 's/^E<gt>pattern/E<gt>replacement/' file.fasta>
 41 | 
 42 | =head1 OPTIONS
 43 | 
 44 | =head2 Mandatory options
 45 | 
 46 | =over 20
 47 | 
 48 | =item B<-i>=I<str>, B<-input>=I<str>
 49 | 
 50 | Input FASTA file or piped STDIN (-) from a gzipped file
 51 | 
 52 | =item B<-p>=I<str>, B<-pattern>=I<str>
 53 | 
 54 | Pattern to be replaced in FASTA ID
 55 | 
 56 | =item B<-r>=I<str>, B<-replacement>=I<str>
 57 | 
 58 | Replacement to replace the pattern with. To entirely remove the
 59 | pattern use '' or "" as input for B<-r>.
 60 | 
 61 | =back
 62 | 
 63 | =head2 Optional options
 64 | 
 65 | =over 20
 66 | 
 67 | =item B<-h>, B<-help>
 68 | 
 69 | Help (perldoc POD)
 70 | 
 71 | =item B<-c>, B<-case-insensitive>
 72 | 
 73 | Match pattern case-insensitive
 74 | 
 75 | =item B<-g>, B<-global>
 76 | 
 77 | Replace pattern globally in the string
 78 | 
 79 | =item B<-n>, B<-numerate>
 80 | 
 81 | Append a numeration/the count of the pattern hits to the
 82 | replacement. This is e.g. useful to number contigs consecutively in
 83 | a draft genome.
 84 | 
 85 | =item B<-a>=I<str>, B<-append>=I<str>
 86 | 
 87 | Append a string after the numeration, e.g. 'c' for chromosome
 88 | 
 89 | =item B<-o>, B<-output>
 90 | 
 91 | Verbose output of the substitutions that were carried out, printed
 92 | to C<STDERR>
 93 | 
 94 | =item B<-v>, B<-version>
 95 | 
 96 | Print version number to C<STDERR>
 97 | 
 98 | =back
 99 | 
100 | =head1 OUTPUT
101 | 
102 | =over 20
103 | 
104 | =item C<STDOUT>
105 | 
106 | The FASTA file with substituted ID lines is printed to C<STDOUT>.
107 | Redirect or pipe into another tool as needed.
108 | 
109 | =back
110 | 
111 | =head1 EXAMPLES
112 | 
113 | =over
114 | 
115 | =item C<perl rename_fasta_id.pl -i file.fasta -p "T" -r "a" -c -g -o>
116 | 
117 | =back
118 | 
119 | =head1 VERSION
120 | 
121 |  0.1                                                 09-11-2014
122 | 
123 | =head1 AUTHOR
124 | 
125 |  Andreas Leimbach                                    aleimba[at]gmx[dot]de
126 | 
127 | =head1 LICENSE
128 | 
129 | This program is free software: you can redistribute it and/or modify
130 | it under the terms of the GNU General Public License as published by
131 | the Free Software Foundation; either version 3 (GPLv3) of the License,
132 | or (at your option) any later version.
133 | 
134 | This program is distributed in the hope that it will be useful, but
135 | WITHOUT ANY WARRANTY; without even the implied warranty of
136 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
137 | General Public License for more details.
138 | 
139 | You should have received a copy of the GNU General Public License
140 | along with this program. If not, see L<http://www.gnu.org/licenses/>.
141 | 
142 | =cut
143 | 
144 | 
145 | ########
146 | # MAIN #
147 | ########
148 | 
149 | use strict;
150 | use warnings;
151 | use autodie;
152 | use Getopt::Long;
153 | use Pod::Usage;
154 | 
155 | ### Get the options with Getopt::Long
156 | my $Input_File; # input fasta file
157 | my $Pattern; # pattern to search for in the FASTA IDs
158 | my $Replacement; # regex to replace pattern with
159 | my $Opt_Case; # substitute case-insensitive
160 | my $Opt_Global; # substitute pattern globally in string
161 | my $Opt_Numerate; # append the count of the performed substitions to each replacement regex
162 | my $Append; # append an additional string after $Opt_Numerate
163 | my $Opt_Output; # print substitutions to STDERR
164 | my $VERSION = 0.1;
165 | my ($Opt_Version, $Opt_Help);
166 | GetOptions ('input=s' => \$Input_File,
167 |             'pattern=s' => \$Pattern,
168 |             'replacement=s' => \$Replacement,
169 |             'case-insensitive' => \$Opt_Case,
170 |             'global' => \$Opt_Global,
171 |             'numerate' => \$Opt_Numerate,
172 |             'append:s' => \$Append,
173 |             'output' => \$Opt_Output,
174 |             'version' => \$Opt_Version,
175 |             'help|?' => \$Opt_Help);
176 | 
177 | 
178 | 
179 | ### Run perldoc on POD
180 | pod2usage(-verbose => 2) if ($Opt_Help);
181 | die "$0 $VERSION\n" if ($Opt_Version);
182 | if (!$Input_File || !$Pattern) {
183 |     my $warning = "\n### Fatal error: Options '-i' or '-p' or their arguments are missing!\n";
184 |     pod2usage(-verbose => 1, -message => $warning, -exitval => 2);
185 | }
186 | 
187 | 
188 | 
189 | ### Pipe input from STDIN or open input file
190 | my $Input_Fh;
191 | if ($Input_File eq '-') { # file input via STDIN
192 |     $Input_Fh = *STDIN; # capture typeglob of STDIN
193 | } else { # input via input file
194 |     open ($Input_Fh, "<", "$Input_File");
195 | }
196 | 
197 | 
198 | 
199 | ### Parse FASTA file
200 | my $Substitution_Count = 0; # count substitutions
201 | while (<$Input_Fh>) {
202 |     chomp;
203 | 
204 |     # only substitute in FASTA ID lines
205 |     if (/^>/) {
206 |         # only substitute if pattern found, case-sensitive or case-INsensitive
207 |         if (/$Pattern/ || (/$Pattern/i && $Opt_Case)) {
208 |             $_ = substitute_string($_); # subroutine
209 | 
210 |         # "reprint" FASTA IDs, which don't fit the pattern
211 |         } else {
212 |             print "$_\n";
213 |         }
214 | 
215 |     # "reprint" sequence/non-ID lines of FASTA files
216 |     } else {
217 |         print "$_\n";
218 |     }
219 | }
220 | print STDERR "$Substitution_Count substitutions have been carried out\n";
221 | 
222 | exit;
223 | 
224 | 
225 | #############
226 | #Subroutines#
227 | #############
228 | 
229 | ### Subroutine to rename headers/ID lines of the FASTA file
230 | sub substitute_string {
231 |     my $string = shift;
232 |     $string =~ s/^>//; # get rid of '>', append afterwards
233 | 
234 |     print STDERR "$string " if ($Opt_Output); # optional verbose output to STDERR
235 |     $Substitution_Count++; # count occurences of carried out substitutions
236 | 
237 |     # substitutions
238 |     if ($Opt_Global && $Opt_Case) {
239 |         $string =~ s/$Pattern/$Replacement/gi;
240 |     } elsif ($Opt_Case) {
241 |         $string =~ s/$Pattern/$Replacement/i;
242 |     } elsif ($Opt_Global) {
243 |         $string =~ s/$Pattern/$Replacement/g;
244 |     } else {
245 |         $string =~ s/$Pattern/$Replacement/;
246 |     }
247 | 
248 |     # output to STDOUT, optionally STDERR
249 |     print ">$string";
250 |     print STDERR "-> $string" if ($Opt_Output);
251 |     if ($Opt_Numerate) {
252 |         print "$Substitution_Count";
253 |         print STDERR "$Substitution_Count" if ($Opt_Output);
254 |     }
255 | 
256 |     if ($Append) {
257 |         print "$Append";
258 |         print STDERR "$Append" if ($Opt_Output);
259 |     }
260 | 
261 |     print "\n";
262 |     print STDERR "\n" if ($Opt_Output);
263 | 
264 |     return 1;
265 | }
266 | 


--------------------------------------------------------------------------------
/revcom_seq/README.md:
--------------------------------------------------------------------------------
 1 | revcom_seq
 2 | ==========
 3 | 
 4 | `revcom_seq.pl` is a script to reverse complement (multi-)sequence files.
 5 | 
 6 | * [Synopsis](#synopsis)
 7 | * [Description](#description)
 8 | * [Usage](#usage)
 9 | * [Options](#options)
10 | * [Output](#output)
11 | * [Run environment](#run-environment)
12 | * [Dependencies](#dependencies)
13 | * [Author - contact](#author---contact)
14 | * [Citation, installation, and license](#citation-installation-and-license)
15 | * [Changelog](#changelog)
16 | 
17 | ## Synopsis
18 | 
19 |     perl revcom_seq.pl seq-file.embl > seq-file_revcom.embl
20 | 
21 | **or**
22 | 
23 |     perl cat_seq.pl multi-seq_file.embl | perl revcom_seq.pl -i embl > seq_file_cat_revcom.embl
24 | 
25 | ## Description
26 | 
27 | This script reverse complements (multi-)sequence files. The
28 | features/annotations in RichSeq files (e.g. EMBL or GENBANK format)
29 | will also be adapted accordingly. Use option **-o** to specify a
30 | different output sequence format. Input files can be given directly via
31 | *STDIN* or as a file. If *STDIN* is used, the input sequence file
32 | format has to be given with option **-i**. Be careful to set the
33 | correct input format.
34 | 
35 | ## Usage
36 | 
37 |     perl revcom_seq.pl -o gbk seq-file.embl > seq-file_revcom.gbk
38 | 
39 | **or** reverse complement all sequence files in the current working directory:
40 | 
41 |     for file in *.embl; do perl revcom_seq.pl -o fasta "$file" > "${file%.embl}"_revcom.fasta; done
42 | 
43 | ## Options
44 | 
45 | - **-h**, **-help**
46 | 
47 |     Help (perldoc POD)
48 | 
49 | - **-o**=*str*, **-outformat**=*str*
50 | 
51 |     Specify different sequence format for the output [fasta, embl, or gbk]
52 | 
53 | - **-i**=*str*, **-informat**=*str*
54 | 
55 |     Specify the input sequence file format, only needed for *STDIN* input
56 | 
57 | - **-v**, **-version**
58 | 
59 |     Print version number to *STDOUT*
60 | 
61 | ## Output
62 | 
63 | - *STDOUT*
64 | 
65 |     The reverse complemented sequence file is printed to *STDOUT*.
66 |     Redirect or pipe into another tool as needed.
67 | 
68 | ## Run environment
69 | 
70 | The Perl script runs under Windows and UNIX flavors.
71 | 
72 | ## Dependencies
73 | 
74 | - [**BioPerl**](http://www.bioperl.org) (tested version 1.007001)
75 | 
76 | ## Author - contact
77 | 
78 | Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
79 | 
80 | ## Citation, installation, and license
81 | 
82 | For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
83 | 
84 | ## Changelog
85 | 
86 | * v0.2 (2015-12-10)
87 |     * included a POD instead of a simple usage text
88 |     * included `pod2usage` with Pod::Usage
89 |     * included 'use autodie' pragma
90 |     * options with Getopt::Long
91 |     * output format now specified with option **-o**
92 |     * included version switch, **-v**
93 |     * allowed file and *STDIN* input, instead of only file; thus new option **-i** for input format
94 |     * output printed to *STDOUT* now, instead of output file
95 |     * fixed bug, that only first sequence in multi-sequence file is reverse complemented. Now all sequences in a multi-seq file are reverse complemented.
96 | * v0.1 (2013-02-08)
97 | 


--------------------------------------------------------------------------------
/revcom_seq/revcom_seq.pl:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/perl
  2 | 
  3 | #######
  4 | # POD #
  5 | #######
  6 | 
  7 | =pod
  8 | 
  9 | =head1 NAME
 10 | 
 11 | C<revcom_seq.pl> - reverse complement (multi-)sequence files
 12 | 
 13 | =head1 SYNOPSIS
 14 | 
 15 | C<perl revcom_seq.pl seq-file.embl E<gt> seq-file_revcom.embl>
 16 | 
 17 | B<or>
 18 | 
 19 | C<perl cat_seq.pl multi-seq_file.embl | perl revcom_seq.pl -i embl
 20 | E<gt> seq_file_cat_revcom.embl>
 21 | 
 22 | =head1 DESCRIPTION
 23 | 
 24 | This script reverse complements (multi-)sequence files. The
 25 | features/annotations in RichSeq files (e.g. EMBL or GENBANK format)
 26 | will also be adapted accordingly. Use option B<-o> to specify a
 27 | different output sequence format. Input files can be given directly via
 28 | C<STDIN> or as a file. If C<STDIN> is used, the input sequence file
 29 | format has to be given with option B<-i>. Be careful to set the correct
 30 | input format.
 31 | 
 32 | =head1 OPTIONS
 33 | 
 34 | =over 20
 35 | 
 36 | =item B<-h>, B<-help>
 37 | 
 38 | Help (perldoc POD)
 39 | 
 40 | =item B<-o>=I<str>, B<-outformat>=I<str>
 41 | 
 42 | Specify different sequence format for the output [fasta, embl, or gbk]
 43 | 
 44 | =item B<-i>=I<str>, B<-informat>=I<str>
 45 | 
 46 | Specify the input sequence file format, only needed for C<STDIN> input
 47 | 
 48 | =item B<-v>, B<-version>
 49 | 
 50 | Print version number to C<STDOUT>
 51 | 
 52 | =back
 53 | 
 54 | =head1 OUTPUT
 55 | 
 56 | =over 20
 57 | 
 58 | =item C<STDOUT>
 59 | 
 60 | The reverse complemented sequence file is printed to C<STDOUT>.
 61 | Redirect or pipe into another tool as needed.
 62 | 
 63 | =back
 64 | 
 65 | =head1 EXAMPLES
 66 | 
 67 | =over
 68 | 
 69 | =item C<perl revcom_seq.pl -o gbk seq-file.embl E<gt>
 70 | seq-file_revcom.gbk>
 71 | 
 72 | =back
 73 | 
 74 | B<or>
 75 | 
 76 | =over
 77 | 
 78 | =item C<for file in *.embl; do perl revcom_seq.pl -o fasta "$file"
 79 | E<gt> "${file%.embl}"_revcom.fasta; done>
 80 | 
 81 | =back
 82 | 
 83 | =head1 DEPENDENCIES
 84 | 
 85 | =over
 86 | 
 87 | =item B<L<BioPerl|http://www.bioperl.org>>
 88 | 
 89 | Tested with BioPerl version 1.007001
 90 | 
 91 | =back
 92 | 
 93 | =head1 VERSION
 94 | 
 95 |  0.2                                               update: 2015-12-10
 96 |  0.1                                                       2013-08-02
 97 | 
 98 | =head1 AUTHOR
 99 | 
100 |  Andreas Leimbach                               aleimba[at]gmx[dot]de
101 | 
102 | =head1 LICENSE
103 | 
104 | This program is free software: you can redistribute it and/or modify
105 | it under the terms of the GNU General Public License as published by
106 | the Free Software Foundation; either version 3 (GPLv3) of the
107 | License, or (at your option) any later version.
108 | 
109 | This program is distributed in the hope that it will be useful, but
110 | WITHOUT ANY WARRANTY; without even the implied warranty of
111 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
112 | General Public License for more details.
113 | 
114 | You should have received a copy of the GNU General Public License
115 | along with this program. If not, see L<http://www.gnu.org/licenses/>.
116 | 
117 | =cut
118 | 
119 | 
120 | ########
121 | # MAIN #
122 | ########
123 | 
124 | use strict;
125 | use warnings;
126 | use autodie;
127 | use Getopt::Long;
128 | use Pod::Usage;
129 | use Bio::SeqIO; # bioperl module to handle sequence input/output
130 | #use Bio::Seq; # bioperl module to handle sequences with features ### apparently not needed, methods inherited
131 | #use Bio::SeqUtils; # bioperl module with additional methods (including features) for Bio::Seq objects ### apparently not needed, methods inherited
132 | 
133 | ### Get options with Getopt::Long
134 | my $In_Format; # input seq file format needed for STDIN
135 | my $Out_Format; # optional different output seq file format
136 | my $VERSION = 0.2;
137 | my ($Opt_Version, $Opt_Help);
138 | GetOptions ('informat=s' => \$In_Format,
139 |             'outformat=s' => \$Out_Format,
140 |             'version' => \$Opt_Version,
141 |             'help|?' => \$Opt_Help)
142 |             or pod2usage(-verbose => 1, -exitval => 2);
143 | 
144 | 
145 | 
146 | ### Run perldoc on POD
147 | pod2usage(-verbose => 2) if ($Opt_Help);
148 | if ($Opt_Version) {
149 |     print "$0 $VERSION\n";
150 |     exit;
151 | }
152 | 
153 | 
154 | 
155 | ### Check input (@ARGV and STDIN)
156 | if (-t STDIN && ! @ARGV) {
157 |     my $warning = "\n### Fatal error: No STDIN and no input file given as argument, please supply one of them and/or see help with '-h'!\n";
158 |     pod2usage(-verbose => 0, -message => $warning, -exitval => 2);
159 | } elsif (!-t STDIN && @ARGV) {
160 |     my $warning = "\n### Fatal error: Both STDIN and an input file given as argument, please supply only either one and/or see help with '-h'!\n";
161 |     pod2usage(-verbose => 0, -message => $warning, -exitval => 2);
162 | }
163 | die "\n### Fatal error: Too many arguments given, only STDIN or one input file allowed as argument! Please see the usage with option '-h' if unclear!\n" if (@ARGV > 1);
164 | die "\n### Fatal error: File '$ARGV[0]' does not exist!\n" if (@ARGV && $ARGV[0] ne '-' && !-e $ARGV[0]);
165 | 
166 | 
167 | 
168 | ### Bio::SeqIO objects for input and output
169 | print STDERR "\nReverse complementing";
170 | my $Seqin; # Bio::SeqIO object
171 | if (-t STDIN) { # input from file
172 |     warn "\n### Warning: Ignoring input file format ('-i $In_Format'), because input file given and not STDIN!\n\n" if ($In_Format);
173 |     my $seq_file = shift;
174 |     $Seqin = Bio::SeqIO->new(-file => "<$seq_file"); # Bio::SeqIO object; no '-format' given, leave it to bioperl guessing
175 |     print STDERR " '$seq_file' ";
176 | } elsif (!-t STDIN) { # input from STDIN
177 |     die "\n### Fatal error: Sequence file given as STDIN requires an input file format, please set one with option '-i' and/or see help with '-h'!\n" if (!$In_Format);
178 |     $In_Format = 'genbank' if ($In_Format =~ /(gbk|gb)/i); # allow shorter format string for 'genbank'
179 |     $Seqin = Bio::SeqIO->new(-fh => \*STDIN, -format => $In_Format); # capture typeglob of STDIN, requires '-format'
180 |     print STDERR " input file ";
181 | }
182 | print STDERR "...\n";
183 | 
184 | my $Seqout; # Bio::SeqIO object
185 | if ($Out_Format) {
186 |     $Out_Format = 'genbank' if ($Out_Format =~ /(gbk|gb)/i);
187 | } else { # same format as input file
188 |     if (!-t STDIN) {
189 |         $Out_Format = $In_Format;
190 |     } else {
191 |         if (ref($Seqin) =~ /Bio::SeqIO::(genbank|embl|fasta)/) { # from bioperl guessing
192 |             $Out_Format = $1;
193 |         } else {
194 |             die "\n### Fatal error: Could not determine input file format, please set an output file format with option '-o'!\n";
195 |         }
196 |     }
197 | }
198 | $Seqout = Bio::SeqIO->new(-fh => \*STDOUT, -format => $Out_Format); # printing to STDOUT requires '-format'
199 | 
200 | 
201 | ### Write reverse complemented sequence (and its features) to STDOUT
202 | while (my $seq_obj = $Seqin->next_seq) { # Bio::Seq object; for multi-seq files
203 |     my $revcom = Bio::SeqUtils->revcom_with_features($seq_obj);
204 |     $Seqout->write_seq($revcom);
205 | }
206 | 
207 | exit;
208 | 


--------------------------------------------------------------------------------
/rod_finder/README.md:
--------------------------------------------------------------------------------
  1 | rod_finder
  2 | ==========
  3 | 
  4 | A script to find regions of difference (RODs) between a query genome and reference genome(s).
  5 | 
  6 | ## Synopsis
  7 | 
  8 |     blast_rod_finder_legacy.sh subject.fasta query.fasta query.[embl|gbk|fasta] 5000
  9 | 
 10 | or
 11 | 
 12 |     perl blast_rod_finder.pl -q query.embl -r blastn.out -m 2000
 13 | 
 14 | ## Description
 15 | 
 16 | This script is intended to identify RODs between a nucleotide query and a nucleotide subject/reference sequence. In order to do so, a *blastn* (http://blast.ncbi.nlm.nih.gov/Blast.cgi) needs to be performed beforehand with the query and the subject sequences (see also *blast_rod_finder_legacy.sh* below). *blast_rod_finder.pl* is mainly designed to work with bacterial genomes, while a query genome can be blasted against several subject sequences to detect RODs over a number of references. Although the results are optimized towards a complete query genome, both the reference(s) as well as the query can be used in draft form. To create artificial genomes via concatenation use *cat_seq.pl* or the EMBOSS application union (http://emboss.sourceforge.net/).
 17 | 
 18 | The *blastn* report file, the query sequence file (preferably in RichSeq format, see below) and a minimum size for ROD detection have to be provided. Subsequently, RODs are summarized in a tab-separated summary file, a gff3 (usable e.g. in Artemis/DNAPlotter, http://www.sanger.ac.uk/resources/software/artemis/) and a BRIG (BLAST Ring Image Generator, http://brig.sourceforge.net/) output file. Nucleotide sequences of each ROD are written to a multi-fasta file.
 19 | 
 20 | The query sequence can be provided in RichSeq format (embl or genbank), but has to correspond to the fasta file used in querying the BLAST database (the accession numbers have to correspond to the fasta headers). Use *seq_format-converter.pl* to create a corresponding fasta file from embl|genbank files for *blastn* if needed. With RichSeq query files additional info is given in the result summary and the amino acid sequences of all non-pseudo CDSs, which are contained or overlap a ROD, are written to a result file. Furthermore, all detected RODs are saved in individual sequence files in the corresponding query sequence format.
 21 | 
 22 | Run *blastn* and the script *blast_rod_finder.pl* consecutively manually or use the bash shell wrapper script *blast_rod_finder_legacy.sh* (see usage below) to perform the pipeline with one command. The same folder has to contain the subject fasta file(s), the query fasta file, optionally the query RichSeq file and the script *blast_rod_finder.pl*! *blastn* is run **without** filtering of query sequences ('-F F') and an evalue cutoff of '2e-11' is set.
 23 | 
 24 | ## Usage
 25 | 
 26 | ### 1.) Manual consecutively
 27 | 
 28 | #### 1.1.) *blastn*
 29 | 
 30 |     formatdb -p F -i subject.fasta -n blast_db
 31 |     blastall -p blastn -d blast_db -i query.fasta -o blastn.out -e 2e-11 -F F
 32 | 
 33 | #### 1.2.) *blast_rod_finder.pl*
 34 | 
 35 |     perl blast_rod_finder.pl -q query.[embl|gbk|fasta] -r blastn.out -m 5000
 36 | 
 37 | ### 2.) With one command: *blast_rod_finder_legacy.sh* pipeline
 38 | 
 39 |     blast_rod_finder_legacy.sh subject.fasta query.fasta query.[embl|gbk|fasta] 5000
 40 | 
 41 | ## Options for *blast_rod_finder.pl*
 42 | 
 43 | ### Mandatory options
 44 | 
 45 | * -m, -min
 46 | 
 47 | Minimum size of RODs that are reported
 48 | 
 49 | * -q, -query
 50 | 
 51 | Query sequence file [fasta, embl, or genbank format]
 52 | 
 53 | * -r, -report
 54 | 
 55 | *blastn* report/output file
 56 | 
 57 | ### Optional options
 58 | 
 59 | * -h, -help:   Help (perldoc POD)
 60 | 
 61 | ## Output
 62 | 
 63 | ### a.) *blast_rod_finder_legacy.sh* or *blastn*
 64 | 
 65 | * *blastn* database files for subject sequence(s)
 66 | 
 67 | \*.nhr, \*.nin, \*.nsq
 68 | 
 69 | * *blastn* report
 70 | 
 71 | Text file named 'blastn.out'
 72 | 
 73 | ### b.) *blast_rod_finder.pl*
 74 | 
 75 | * ./results
 76 | 
 77 | All output files are stored in this result folder
 78 | 
 79 | * rod_summary.txt
 80 | 
 81 | Summary of detected ROD regions (for embl/genbank queries includes annotation), tab-separated
 82 | 
 83 | * rod.gff
 84 | 
 85 | GFF3 file with ROD coordinates to use in Artemis/DNAPlotter etc.
 86 | 
 87 | * rod_BRIG.txt
 88 | 
 89 | ROD coordinates to use in BRIG (BLAST Ring Image Generator), tab-separated
 90 | 
 91 | * rod_seq.fasta
 92 | 
 93 | Nucleotide sequences of ROD regions (>ROD# size start..stop), multi-fasta
 94 | 
 95 | * rod_aa_fasta.txt
 96 | 
 97 | Only present if query is in RichSeq format. Amino acid sequences of all CDSs that are contained in or overlap a ROD region in multi-fasta format (>locus_tag gene product). RODs are seperated in the file via '\~\~' (\~\~ROD# size start..stop).
 98 | 
 99 | * ROD#.[embl|gbk]
100 | 
101 | Only present if query is in RichSeq format. Each identified ROD is written to an individual sequence file (in the same format as the query).
102 | 
103 | ## Run environment
104 | 
105 | The Perl script runs under Windows and UNIX flavors, the bash-shell script of course only under UNIX.
106 | 
107 | ## Dependencies (not in the core Perl modules)
108 | 
109 | * Legacy blast (tested version blastall 2.2.18)
110 | * BioPerl (tested with version 1.006901)
111 | 
112 | ## Authors/contact
113 | 
114 | Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
115 | 
116 | David Studholme (original code; D[dot]J[dot]Studholme[at]exeter[dot]ac[dot]uk; University of Exeter)
117 | 
118 | ## Citation, installation, and license
119 | 
120 | For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
121 | 
122 | ## Changelog
123 | 
124 | * v0.4 (13.02.2013)
125 |     - included a POD
126 |     - options with Getopt::Long
127 |     - results directory for output files
128 |     - include accession number column for multi-sequence files in 'rod_summary.txt'
129 |     - include locus_tags (or alternatively gene, product, note ...) in 'rod_summary.txt'
130 |     - feature positions according to leading or lagging strand
131 |     - indicate if a primary feature overlaps ROD boundaries
132 |     - output each ROD in the query RichSeq format with BioPerl's Bio::SeqUtils
133 | * v0.3 (23.11.2011)
134 |     - status messages with autoflush
135 |     - BRIG output file
136 |     - extended primary tag output for RODs (in addition to CDS): tRNA, rRNA, tmRNA, ncRNA, misc_RNA, repeat_region, misc_binding, and mobile_element
137 | * v0.1 (07.11.2011)
138 | 


--------------------------------------------------------------------------------
/rod_finder/blast_rod_finder_legacy.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | echo "###Running legacy BLASTN with subject '$1' and query '$2'"
3 | formatdb -p F -i $1 -n ROD
4 | blastall -p blastn -d ROD -i $2 -o blastn.out -e 2e-11 -F F
5 | echo "###Running blast_rod_finder.pl with query '$3' and minimum ROD size '$4'"
6 | perl blast_rod_finder.pl -q $3 -r blastn.out -m $4


--------------------------------------------------------------------------------
/sam_insert-size/README.md:
--------------------------------------------------------------------------------
  1 | sam_insert-size
  2 | ===============
  3 | 
  4 | `sam_insert-size.pl` is a script to calculate insert size and read length statistics for paired-end reads in SAM/BAM format.
  5 | 
  6 | * [Synopsis](#synopsis)
  7 | * [Description](#description)
  8 | * [Usage](#usage)
  9 | * [Options](#options)
 10 |   * [Mandatory options](#mandatory-options)
 11 |   * [Optional options](#optional-options)
 12 | * [Output](#output)
 13 | * [Run environment](#run-environment)
 14 | * [Dependencies](#dependencies)
 15 | * [Author - contact](#author---contact)
 16 | * [Acknowledgements](#acknowledgements)
 17 | * [Citation, installation, and license](#citation-installation-and-license)
 18 | * [Changelog](#changelog)
 19 | 
 20 | ## Synopsis
 21 | 
 22 |     perl sam_insert-size.pl -i file.sam
 23 | 
 24 | **or**
 25 | 
 26 |     samtools view -h file.bam | perl sam_insert-size.pl -i -
 27 | 
 28 | ## Description
 29 | 
 30 | Calculate insert size and read length statistics for paired-end reads
 31 | in SAM/BAM alignment format. The program gives the arithmetic mean,
 32 | median, and standard deviation (stdev) among other statistical values.
 33 | 
 34 | Insert size is defined as the total length of the original fragment
 35 | put into sequencing, i.e. the sequenced DNA fragment between the
 36 | adaptors. The 16-bit FLAG of the SAM/BAM file is used to filter reads
 37 | (see the [SAM specifications](http://samtools.sourceforge.net/SAM1.pdf)).
 38 | 
 39 | **Read length** statistics are calculated for all mapped reads
 40 | (irrespective of their pairing).
 41 | 
 42 | **Insert size** statistics are calculated only for **paired reads**.
 43 | Typically, the insert size is perturbed by artifacts, like chimeras,
 44 | structural re-arrangements or alignment errors, which result in a
 45 | very high maximum insert size measure. As a consequence the mean and
 46 | stdev can be strongly misleading regarding the real distribution. To
 47 | avoid this, two methods are implemented that first trim the insert
 48 | size distribution to a 'core' to calculate the respective statistics.
 49 | Additionally, secondary alignments for multiple mapping reads and
 50 | supplementary alignments for chimeric reads, as well as insert sizes
 51 | of zero are not considered (option **-min_ins_cutoff** is set to
 52 | **one** by default).
 53 | 
 54 | The **-a|-align** method includes only proper/concordant paired reads
 55 | in the statistical calculations (as determined by the mapper and the
 56 | options for insert size minimum and maximum used for mapping). This
 57 | is the **default** method.
 58 | 
 59 | The **-p|-percentile** method first calculates insert size statistics
 60 | for all read pairs, where the read and the mate are mapped ('raw
 61 | data'). Subsequently, the 10th and the 90th percentile are discarded
 62 | to calculate the 10% truncated mean and stdev. Discarding the lowest
 63 | and highest 10% of insert sizes gives the advantage of robustness
 64 | (insensitivity to outliers) and higher efficiency in heavy-tailed
 65 | distributions.
 66 | 
 67 | Alternative tools, which are a lot faster, are [`CollectInsertSizeMetrics`](https://broadinstitute.github.io/picard/command-line-overview.html#CollectInsertSizeMetrics)
 68 | from [Picard Tools](https://broadinstitute.github.io/picard/) and
 69 | [`sam-stats`](https://code.google.com/p/ea-utils/wiki/SamStats) from
 70 | [ea-utils](https://code.google.com/p/ea-utils/).
 71 | 
 72 | ## Usage
 73 | 
 74 |     samtools view -h file.bam | perl sam_insert-size.pl -i - -p -d -f -min 50 -max 500 -n 2000000 -xlim_i 350 -xlim_r 200
 75 | 
 76 | ## Options
 77 | 
 78 | ### Mandatory options
 79 | 
 80 | - -i, -input
 81 | 
 82 |     Input SAM file or piped *STDIN* (-) from a BAM file e.g. with [`samtools view`](http://www.htslib.org/doc/samtools-1.1.html) from [Samtools](http://www.htslib.org/)
 83 | 
 84 | - -a, -align
 85 | 
 86 |     **Default method:** Align method to calculate insert size statistics, includes only reads which are mapped in a proper/concordant pair (as determined by the mapper). Excludes option **-p**.
 87 | 
 88 | **or**
 89 | 
 90 | - -p, -percentile
 91 | 
 92 |     Percentile method to calculate insert size statistics, includes only read pairs with an insert size within the 10th and the 90th percentile range of all mapped read pairs. However, the frequency distribution as well as the histogram will be plotted with the 'raw' insert size data before percentile filtering. Excludes option **-a**.
 93 | 
 94 | ### Optional options
 95 | 
 96 | - -h, -help
 97 | 
 98 |     Help (perldoc POD)
 99 | 
100 | - -d, -distro
101 | 
102 |     Create distribution histograms for the insert sizes and read lengths with [R](http://www.r-project.org/). The calculated median and mean (that are printed to *STDOUT*) are plotted as vertical lines into the histograms. Use it to control the correctness of the statistical calculations.
103 | 
104 | - -f, -frequencies
105 | 
106 |     Print the frequencies of the insert sizes and read lengths to tab-delimited files 'ins_frequency.txt' and 'read_frequency.txt', respectively.
107 | 
108 | - -max, -max_ins_cutoff
109 | 
110 |     Set a maximal insert size cutoff, all insert sizes above this cutoff will be discarded (doesn't affect read length). With **-min** and **-max** you can basically run both methods, by first running the script with **-p** and then using the 10th and 90th percentile of the 'raw data' as **-min** and **-max** for option **-a**.
111 | 
112 | - -min, -min_ins_cutoff
113 | 
114 |     Set a minimal insert size cutoff [default = 1]
115 | 
116 | - -n, -num_read
117 | 
118 |     Number of reads to sample for the calculations from the start of the SAM/BAM file. Significant statistics can usually be calculated from a fraction of the total SAM/BAM alignment file.
119 | 
120 | - -xlim_i, -xlim_ins
121 | 
122 |     Set an upper limit for the x-axis of the **'R' insert size** histogram, overriding automatic truncation of the histogram tail. The default cutoff is one and a half times the third quartile Q3 (75th percentile) value. The minimal cutoff is set to the lowest insert size automatically. Forces option **-d**.
123 | 
124 | - -xlim_r, -xlim_read
125 | 
126 |     Set an upper limit for the x-axis of the optional **'R' read length** histogram. Default value is as in **-xlim_i**. Forces option **-d**.
127 | 
128 | - -v, -version
129 | 
130 |     Print version number to *STDERR*
131 | 
132 | ## Output
133 | 
134 | - *STDOUT*
135 | 
136 |     Calculated stats are printed to *STDOUT*
137 | 
138 | - ./results
139 | 
140 |     All **optional** output files are stored in this results folder
141 | 
142 | - (./results/ins_frequency.txt)
143 | 
144 |     Frequencies of insert size 'raw data', tab-delimited
145 | 
146 | - (./results/ins_histo.pdf)
147 | 
148 |     Distribution histogram for the insert size 'raw data'
149 | 
150 | - (./results/read_frequency.txt)
151 | 
152 |     Frequencies of read lengths, tab-delimited
153 | 
154 | - (./results/read_histo.pdf)
155 | 
156 |     Distribution histogram for the read lengths. Not informative if there's no variation in the read lengths.
157 | 
158 | ## Run environment
159 | 
160 | The Perl script runs under Windows and UNIX flavors.
161 | 
162 | ## Dependencies
163 | 
164 | - `Statistics::Descriptive`
165 | 
166 |     Perl module to calculate descriptive statistics, if not installed already get it from [CPAN](http://www.cpan.org/)
167 | 
168 | - Statistical computing language [R](http://www.r-project.org/)
169 | 
170 |     `Rscript` is needed to plot the histograms with option **-d**
171 | 
172 | ## Author - contact
173 | 
174 | Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
175 | 
176 | ## Acknowledgements
177 | 
178 | References/thanks go to:
179 | 
180 | - Tobias Rausch's online courses/workshops (EMBL Heidelberg) on the introduction to SAM files and flags (http://www.embl.de/~rausch/)
181 | 
182 | - The CBS NGS Analysis course for the percentile filtering idea (http://www.cbs.dtu.dk/courses/27626/programme.php)
183 | 
184 | ## Citation, installation, and license
185 | 
186 | For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
187 | 
188 | ## Changelog
189 | 
190 | - v0.2 (29.10.2014)
191 |     - Fixed bug for options '-min_ins_size' and '-max_ins_size'
192 |     - warn if result files already exist
193 |     - simplify prints to R script with Perl function 'select'
194 |     - minor Perl syntax changes so all Perl scripts conform to the same syntax
195 |     - minor changes to POD
196 |     - finally included README.md
197 | - v0.1 (27.11.2013)
198 | 


--------------------------------------------------------------------------------
/sample_fastx-txt/README.md:
--------------------------------------------------------------------------------
  1 | sample_fastx-txt
  2 | ================
  3 | 
  4 | `sample_fastx-txt.pl` is a script to randomly subsample FASTA, FASTQ, or TEXT files.
  5 | 
  6 | * [Synopsis](#synopsis)
  7 | * [Description](#description)
  8 | * [Usage](#usage)
  9 |   * [Subsample paired-end read data and retain pairing](#subsample-paired-end-read-data-and-retain-pairing)
 10 |   * [Subsample TEXT file and skip three header lines during subsampling](#subsample-text-file-and-skip-three-header-lines-during-subsampling)
 11 |   * [Subsample TEXT file and remove two header lines for final output](#subsample-text-file-and-remove-two-header-lines-for-final-output)
 12 | * [Options](#options)
 13 |   * [Mandatory options](#mandatory-options)
 14 |   * [Optional options](#optional-options)
 15 | * [Output](#output)
 16 | * [Run environment](#run-environment)
 17 | * [Author - contact](#author---contact)
 18 | * [Acknowledgements](#acknowledgements)
 19 | * [Citation, installation, and license](#citation-installation-and-license)
 20 | * [Changelog](#changelog)
 21 | 
 22 | ## Synopsis
 23 | 
 24 |     perl sample_fastx-txt.pl -i infile.fasta -n 100 > subsample.fasta
 25 | 
 26 | **or**
 27 | 
 28 |     zcat reads.fastq.gz | perl sample_fastx-txt.pl -i - -n 100000 > subsample.fastq
 29 | 
 30 | ## Description
 31 | 
 32 | Randomly subsample FASTA, FASTQ, and TEXT files.
 33 | 
 34 | Empty lines in the input files will be skipped and not included in
 35 | sampling. Format TEXT assumes one entry per single line. FASTQ
 36 | format assumes **four** lines per read, if this is not the case run
 37 | the FASTQ file through [`fastx_fix.pl`](/fastx_fix) or use Heng
 38 | Li's [`seqtk seq`](https://github.com/lh3/seqtk):
 39 | 
 40 |     seqtk seq -l 0 infile.fq > outfile.fq
 41 | 
 42 | The file type is detected automatically. However, if automatic
 43 | detection fails, TEXT format is assumed. As a last resort, you can
 44 | set the file type manually with option **-f**.
 45 | 
 46 | This script is an implementation of the *reservoir sampling*
 47 | algorithm (or *Algorithm R (3.4.2)*) described in Donald Knuth's
 48 | [*The Art of Computer Programming*](https://en.wikipedia.org/wiki/The_Art_of_Computer_Programming).
 49 | It is designed to randomly pull a small sample size from a
 50 | (potential) huge input file of indeterminate size, which
 51 | (potentially) doesn't fit into main memory. The beauty of reservoir
 52 | sampling is that it requires only one pass through the input file.
 53 | The memory consumption of the algorithm is proportional to the
 54 | sample size, thus large sample sizes will consume lots of memory as
 55 | the whole sample will be held in memory. On the other hand, the size
 56 | of the initial file is irrelevant.
 57 | 
 58 | An alternative tool, which is a lot faster, is `seqtk sample` from
 59 | the [*seqtk toolkit*](https://github.com/lh3/seqtk>).
 60 | 
 61 | ## Usage
 62 | 
 63 | ### Subsample paired-end read data and retain pairing
 64 | 
 65 |     perl sample_fastx-txt.pl -i read-pair_1.fq -n 1000000 -s 123 > sub-pair_1.fq
 66 | 
 67 |     perl sample_fastx-txt.pl -i read-pair_2.fq -n 1000000 -s 123 > sub-pair_2.fq
 68 | 
 69 | ### Subsample TEXT file and skip three header lines during subsampling
 70 | 
 71 |     perl sample_fastx-txt.pl -i infile.txt -n 100 -f text -t 3 > subsample.txt
 72 | 
 73 | ### Subsample TEXT file and remove two header lines for final output
 74 | 
 75 |     perl sample_fastx-txt.pl -i infile.txt -n 350 -t 2 | sed '1,2d' > sub_no-header.txt
 76 | 
 77 | ## Options
 78 | 
 79 | ### Mandatory options
 80 | 
 81 | - -i, -input
 82 | 
 83 |     Input FASTA/Q or TEXT file, or piped *STDIN* (-)
 84 | 
 85 | - -n, -num
 86 | 
 87 |     Number of entries/reads to subsample
 88 | 
 89 | ### Optional options
 90 | 
 91 | - -h, -help
 92 | 
 93 |     Help (perldoc POD)
 94 | 
 95 | - -f, -file_type
 96 | 
 97 |     Set the file type manually [fasta|fastq|text]
 98 | 
 99 | - -s, -seed
100 | 
101 |     Set starting random seed. For **paired-end** read data use the **same random seed** for both FASTQ files with option **-s** to retain pairing (see [Subsample paired-end read data and retain pairing](#subsample-paired-end-read-data-and-retain-pairing) above).
102 | 
103 | - -t, -title_skip
104 | 
105 |     Skip the specified number of header lines in TEXT files before subsampling and append them again afterwards. If you want to get rid of the header as well, pipe the subsample output to [`sed`](https://www.gnu.org/software/sed/manual/sed.html) (see `man sed` and [Subsample TEXT file and remove two header lines for final output](#subsample-text-file-and-remove-two-header-lines-for-final-output) above).
106 | 
107 | - -v, -version
108 | 
109 |     Print version number to *STDERR*
110 | 
111 | ## Output
112 | 
113 | - *STDOUT*
114 | 
115 |     The subsample of the input file is printed to *STDOUT*. Redirect or pipe into another tool as needed.
116 | 
117 | ## Run environment
118 | 
119 | The Perl script runs under Windows and UNIX flavors.
120 | 
121 | ## Author - contact
122 | 
123 | Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
124 | 
125 | ## Acknowledgements
126 | 
127 | I got the idea for reservoir sampling from Sean Eddy's keynote at
128 | the Janelia meeting on [*High Throughput Sequencing for Neuroscience*](http://cryptogenomicon.wordpress.com/2014/11/01/high-throughput-sequencing-for-neuroscience/)
129 | which he posted in his blog
130 | [*Cryptogenomicon*](http://cryptogenomicon.wordpress.com/). The [*Wikipedia article*](https://en.wikipedia.org/wiki/Reservoir_sampling) and the
131 | [*PerlMonks*](http://www.perlmonks.org/index.pl?node_id=177092) implementation helped a lot, as well.
132 | 
133 | ## Citation, installation, and license
134 | 
135 | For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
136 | 
137 | ## Changelog
138 | 
139 | - v0.1 (18.11.2014)
140 | 


--------------------------------------------------------------------------------
/seq_format-converter/README.md:
--------------------------------------------------------------------------------
 1 | seq_format-converter
 2 | ====================
 3 | 
 4 | A script to convert a sequence file to another format.
 5 | 
 6 | ## Synopsis
 7 | 
 8 |     perl seq_format-converter.pl -i seq_file.gbk -f gbk -o embl
 9 | 
10 | ## Description
11 | 
12 | This script converts a (multi-)sequence file of a specific format to a differently formatted output file. The most common sequence formats are: **embl**, **fasta**, and **gbk** (genbank).
13 | 
14 | Since sequence formats change from time to time, BioPerl is not always up to date. For all available BioPerl sequence formats see: http://www.bioperl.org/wiki/HOWTO:SeqIO#Formats. **Warning**: The *bioperl-ext* package and the *io_lib* library from the **Staden** package (http://staden.sourceforge.net/) need to be installed in order to read the scf, abi, alf, pln, exp, ctf, ztr formats.
15 | 
16 | ## Usage
17 | 
18 |     perl seq_format-converter.pl -i seq_file -f in_format -o out_format
19 | 
20 | ### UNIX loop to reformat all sequence files in the current working directory
21 | 
22 |     for i in *.[embl|gbk]; do perl seq_format-converter.pl -i $i -f [embl|gbk] -o [embl|fasta|gbk]; done
23 | 
24 | ## Options for *seq_format-converter.pl*
25 | 
26 | ### Mandatory options
27 | 
28 | * -i, -input
29 | 
30 | Input sequence file
31 | 
32 | * -f, -format
33 | 
34 | Input sequence format (e.g. 'embl' or 'gbk)
35 | 
36 | * -o, -out_format
37 | 
38 | Output sequence format (e.g. 'embl', 'fasta' or 'gbk)
39 | 
40 | ### Optional options
41 | 
42 | * -h, -help
43 | 
44 | Print usage
45 | 
46 | * -v, -version
47 | 
48 | Print version number
49 | 
50 | ## Output
51 | 
52 | * seq_file.[embl|fasta|gbk]
53 | 
54 | Output sequence file in the specified format
55 | 
56 | ## Run environment
57 | 
58 | The Perl script runs under Windows and UNIX flavors.
59 | 
60 | ## Dependencies (not in the core Perl modules)
61 | 
62 | * BioPerl (tested with version 1.006901)
63 | 
64 | ## Author/contact
65 | 
66 | Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
67 | 
68 | ## Citation, installation, and license
69 | 
70 | For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
71 | 
72 | ## Changelog
73 | 
74 | * v0.2 (03.02.2014)
75 |     - allow short 'gbk' format instead of 'genbank'
76 |     - also short 'gbk' file-extension for output file
77 |     - included 'use autodie'
78 |     - usage as HERE document
79 |     - options with Getopt::Long
80 |     - version switch
81 | * v0.1 (10.11.2011)
82 | 


--------------------------------------------------------------------------------
/seq_format-converter/seq_format-converter.pl:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/perl
 2 | 
 3 | use warnings;
 4 | use strict;
 5 | use autodie;
 6 | use Getopt::Long;
 7 | use Bio::SeqIO; # bioperl module to handle sequence input/output
 8 | 
 9 | my $usage = << "USAGE";
10 | 
11 |   ##################################################################
12 |   # $0 -i seq_file -f in_format -o out_format #
13 |   #                                                                #
14 |   # Converts a (multi-)sequence file of a specific format to a     #
15 |   # differently formatted output file, with the help of BioPerl    #
16 |   # (www.bioperl.org).                                             #
17 |   # Formats are e.g. embl, fasta, gbk.                             #
18 |   #                                                                #
19 |   # Mandatory options:                                             #
20 |   # -i, -input       input sequence file                           #
21 |   # -f, -format      input format                                  #
22 |   # -o, -out_format  output format                                 #
23 |   # Optional options:                                              #
24 |   # -h, -help        print usage                                   #
25 |   # -v, -version     print version number                          #
26 |   #                                                                #
27 |   # Adjust unix loop to run the script with all files in the       #
28 |   # current working directory, e.g.:                               #
29 |   # for i in *.gbk; do perl seq_format_converter.pl -i \$i -f gbk \\ #
30 |   # -o embl; done                                                  #
31 |   #                                                                #
32 |   # version 0.2, update: 03-02-2014                     A Leimbach #
33 |   # 10-11-2011                               aleimba[at]gmx[dot]de #
34 |   ##################################################################
35 | 
36 | USAGE
37 | ;
38 | 
39 | ### Get options with Getopt::Long
40 | my $infile; # input sequence file
41 | my $in_format; # input sequence file format
42 | my $out_format; # desired output file format
43 | my $version = 0.2;
44 | my ($opt_version, $opt_help);
45 | GetOptions ('input=s' => \$infile,
46 |             'format=s' => \$in_format,
47 |             'out_format=s' => \$out_format,
48 |             'version' => \$opt_version,
49 |             'help|?' => \$opt_help);
50 | 
51 | 
52 | ### Print usage
53 | if ($opt_help) {
54 |     die $usage;
55 | } elsif ($opt_version) {
56 |     die "$0 $version\n";
57 | } elsif (!$infile || !$in_format || !$out_format) {
58 |     die $usage, "### Fatal error: Option(s) or argument(s) for \'-i\', \'-f\', \'-o\' are missing!\n\n";
59 | }
60 | 
61 | 
62 | ### Allow shorter format string for 'genbank'
63 | $in_format = 'genbank' if ($in_format =~ /gbk/i);
64 | my $outfile = $infile;
65 | $outfile =~ s/\.\w+$/\.$out_format/; # remove file extension from infile and append out_format
66 | $out_format = 'genbank' if ($out_format =~ /gbk/i);
67 | 
68 | 
69 | ### SeqIO objects for input and output
70 | my $seq_in = Bio::SeqIO->new(-file => "<$infile", -format => $in_format); # a Bio::SeqIO object
71 | my $seq_out = Bio::SeqIO->new(-file => ">$outfile", -format => $out_format); # a Bio::SeqIO object
72 | 
73 | 
74 | ### Write sequence to different format
75 | while (my $seqobj = $seq_in->next_seq) { # a Bio::Seq object
76 |     $seq_out->write_seq($seqobj);
77 | }
78 | print "\n\tCreated new file $outfile!\n\n";
79 | 
80 | exit;
81 | 


--------------------------------------------------------------------------------
/tbl2tab/README.md:
--------------------------------------------------------------------------------
  1 | tbl2tab
  2 | =======
  3 | 
  4 | `tbl2tab.pl` is a script to convert tbl to tab-separated format and back.
  5 | 
  6 | * [Synopsis](#synopsis)
  7 | * [Description](#description)
  8 | * [Usage](#usage)
  9 | * [Options](#options)
 10 |   * [Mandatory options](#mandatory-options)
 11 |   * [Optional options](#optional-options)
 12 | * [Output](#output)
 13 | * [Run environment](#run-environment)
 14 | * [Author - contact](#author---contact)
 15 | * [Citation, installation, and license](#citation-installation-and-license)
 16 | * [Changelog](#changelog)
 17 | 
 18 | ## Synopsis
 19 | 
 20 |     perl tbl2tab.pl -m tbl2tab -i feature_table.tbl -s -l locus_prefix
 21 | 
 22 | **or**
 23 | 
 24 |     perl tbl2tab.pl -m tab2tbl -i feature_table.tab -g -l locus_prefix -p "gnl|dbname|"
 25 | 
 26 | ## Description
 27 | 
 28 | NCBI's feature table (**tbl**) format is needed for the submission of genomic data to GenBank with the NCBI tools [Sequin](http://www.ncbi.nlm.nih.gov/Sequin/) or [tbl2asn](http://www.ncbi.nlm.nih.gov/genbank/tbl2asn2). tbl files can be created with automatic annotation systems like [Prokka](http://www.vicbioinformatics.com/software.prokka.shtml). `tbl2tab.pl` can convert a tbl file to a tab-separated format (tab) and back to the tbl format. The tab-delimited format is useful to manipulate the data more comfortably in a spreadsheet software (e.g. LibreOffice or MS Excel). For a conversion back to tbl format save the file in the spreadsheet software as a tab-delimited text file. The script is intended for microbial genomes, but might also be useful for eukaryotes.
 29 | 
 30 | Regular expressions are applied in mode '**tbl2tab**' to correct gene names and words in '/product' values  to lowercase initials (with the exception of 'Rossman' and 'Willebrand'). The resulting tab file can then be used to check for possible errors.
 31 | 
 32 | The first four header columns of the **tab** format are mandatory, 'seq_id' for the SeqID, and for each primary tag/feature (e.g. CDS, RNAs, repeat_region etc.), 'start', 'stop', and 'primary_tag'. These mandatory columns have to be filled in every row in the tab file. All the following columns will be included as tags/qualifiers (e.g. '/locus_tag', '/product', '/EC_number', '/note' etc.) in the conversion to the tbl file if a value is present.
 33 | 
 34 | There are three special cases:
 35 | 
 36 | **First**, '/pseudo' will be included as a tag if *any* value (the script uses 'T' for true) is present in the **tab** format. If a primary tag is indicated as pseudo both the primary tag and the accessory 'gene' primary tag (for CDS/RNA features with option **-g**) will include a '/pseudo' qualifier in the resulting **tbl** file. *Pseudo-genes* are indicated by 'pseudo' in the 'primary_tag' column, thus the 'pseudo' column is ignored in these cases.
 37 | 
 38 | **Second**, tag '/gene_desc' is reserved for the 'product' values of pseudo-genes, thus a 'gene_desc' column in a tab file will be ignored in the conversion to tbl.
 39 | 
 40 | **Third**, column 'protein_id' in a tab file will also be ignored in the conversion. '/protein_id' values are created from option **-p** and the locus_tag for each CDS primary feature.
 41 | 
 42 | Furthermore, with option **-s** G2L-style spreadsheet formulas ([Goettingen Genomics Laboratory](http://appmibio.uni-goettingen.de/)) can be included with additional columns, 'spreadsheet_locus_tag', 'position', 'distance', 'gene_number', and 'contig_order'. These columns will not be included in a conversion to the tbl format. Thus, if you want to include e.g. the locus_tags from the formula in column 'spreadsheet_locus_tag' in the resulting tbl file copy the *values* to the column 'locus_tag'!
 43 | 
 44 | To illustrate the process two example files are included in the repository, 'example.tbl' and 'example2.tab', which are interconvertible (see "[USAGE](#usage)" below).
 45 | 
 46 | **Warning**, be aware of possible errors introduced by automatic format conversions using a spreadsheet software like MS Excel, see e.g. Zeeberg *et al.* 2004 (http://www.ncbi.nlm.nih.gov/pubmed/15214961).
 47 | 
 48 | For more information regarding the feature table and the submission process see NCBI's [prokaryotic annotation guide](http://www.ncbi.nlm.nih.gov/genbank/genomesubmit) and the [bacterial genome submission guide](http://www.ncbi.nlm.nih.gov/genbank/genomesubmit_annotation).
 49 | 
 50 | ## Usage
 51 | 
 52 | ### Conversion from tbl to tab format
 53 | 
 54 |     perl tbl2tab.pl -m tbl2tab -i example.tbl -s -l EPE
 55 | 
 56 | ### Conversion from tab to tbl format
 57 | 
 58 |     perl tbl2tab.pl -m tab2tbl -i example2.tab -g -l EPE
 59 | 
 60 | ## Options
 61 | 
 62 | ### Mandatory options
 63 | 
 64 | * -m, -mode
 65 | 
 66 | Conversion mode, either 'tbl2tab' or 'tab2tbl' [default = 'tbl2tab']
 67 | 
 68 | * -i, -input
 69 | 
 70 | Input tbl or tab file to be converted to the other format
 71 | 
 72 | ### Optional options
 73 | 
 74 | * -h, -help
 75 | 
 76 | Help (perldoc POD)
 77 | 
 78 | * -v, -version
 79 | 
 80 | Print version number to *STDERR*
 81 | 
 82 | #### Mode *tbl2tab*
 83 | 
 84 | * -l, -locus_prefix
 85 | 
 86 | Only in combination with option **-s** and there mandatory to include the locus_tag prefix in the formula for column 'spreadsheet_locus_tag'
 87 | 
 88 | * -c, -concat
 89 | 
 90 | Concatenate values of identical tags within one primary tag with '~' (e.g. several '/EC_number' or '/inference' tags)
 91 | 
 92 | * -e, -empty
 93 | 
 94 | String used for primary features without value for a tag [default = '']
 95 | 
 96 | * -s, -spreadsheet
 97 | 
 98 | Include formulas for spreadsheet editing
 99 | 
100 | * -f, -formula_lang
101 | 
102 | Syntax language of the spreadsheet formulas, either 'English' or 'German'. If you're still encountering problems with the formulas set the decimal and thousands separator manually in the options of the spreadsheet software (instead of using the operating system separators). [default = 'e']
103 | 
104 | #### Mode *tab2tbl*
105 | 
106 | * -l, -locus_prefix
107 | 
108 | Prefix to the SeqID if not present already in the SeqID
109 | 
110 | * -g, -gene
111 | 
112 | Include accessory 'gene' primary tags (with '/gene', '/locus_tag' and possibly '/pseudo' tags) for 'CDS/RNA' primary tags; NCBI standard
113 | 
114 | * -t, -tags_full
115 | 
116 | Only in combination with option **-g**, include '/gene' and '/locus_tag' tags additionally in primary tag, not only in accessory 'gene' primary tag
117 | 
118 | * -p, -protein_id_prefix
119 | 
120 | Prefix for '/protein_id' tags; don't forget the double quotes for the string, otherwise the shell will intepret as pipe [default = 'gnl|goetting|']
121 | 
122 | ## Output
123 | 
124 | * *.tab|tbl
125 | 
126 | Result file in the opposite format
127 | 
128 | * (hypo_putative_genes.txt)
129 | 
130 | Created in mode **tab2tbl**, indicates if CDSs are annotated as
131 | 'hypothetical/putative/predicted protein' but still have a gene name
132 | 
133 | ## Run environment
134 | 
135 | The Perl script runs under Windows and UNIX flavors.
136 | 
137 | ## Author - contact
138 | 
139 | Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
140 | 
141 | ## Citation, installation, and license
142 | 
143 | For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
144 | 
145 | ## Changelog
146 | 
147 | * v0.2 (29.10.2014)
148 |     * fixed bug: message which file was created was mixed up
149 |     * *hypo_putative_genes.txt* includes now also 'predicted protein' annotations
150 |     * additions and syntax changes to POD and README.md
151 | * v0.1 (24.06.2014)
152 | 


--------------------------------------------------------------------------------
/tbl2tab/example.tbl:
--------------------------------------------------------------------------------
 1 | >Feature EPE_c
 2 | 191	310	gene
 3 | 			locus_tag	EPE_c00010
 4 | 191	310	misc_RNA
 5 | 			inference	COORDINATES:profile:Infernal:1.1
 6 | 			product	Thr_leader
 7 | 336	2798	gene
 8 | 			locus_tag	EPE_c00020
 9 | 			gene	thrA
10 | 336	2798	CDS
11 | 			protein_id	gnl|goetting|EPE_c00020
12 | 			EC_number	1.1.1.3
13 | 			EC_number	2.7.2.4
14 | 			inference	ab initio prediction:Prodigal:2.60
15 | 			inference	similar to AA sequence:K-12_MG1655:NP_414543.1
16 | 			product	bifunctional aspartokinase/homoserine dehydrogenase 1
17 | 3168	4697	gene
18 | 			locus_tag	EPE_c00030
19 | 			gene	rrsH
20 | 3168	4697	rRNA
21 | 			inference	COORDINATES:profile:RNAmmer:1.2
22 | 			product	16S ribosomal RNA
23 | 4771	4847	gene
24 | 			locus_tag	EPE_c00040
25 | 			pseudo
26 | 4771	4847	tRNA
27 | 			inference	COORDINATES:profile:Aragorn:1.2
28 | 			product	tRNA-Pseudo-xxx
29 | 			pseudo
30 | 7154	5010	gene
31 | 			locus_tag	EPE_c00050
32 | 			gene	fadJ
33 | 7154	5010	CDS
34 | 			protein_id	gnl|goetting|EPE_c00050
35 | 			EC_number	1.1.1.35
36 | 			EC_number	4.2.1.17
37 | 			EC_number	5.1.2.3
38 | 			EC_number	5.3.3.8
39 | 			inference	ab initio prediction:Prodigal:2.60
40 | 			inference	similar to AA sequence:K-12_MG1655:NP_416843.1
41 | 			product	fused enoyl-CoA hydratase and epimerase and isomerase/3-hydroxyacyl-CoA dehydrogenase
42 | 7068	7430	gene
43 | 			locus_tag	EPE_c00060
44 | 			gene	ssrA
45 | 7068	7430	tmRNA
46 | 			inference	COORDINATES:profile:Aragorn:1.2
47 | 			product	transfer-messenger RNA, SsrA
48 | 7513	8883	repeat_region
49 | 			rpt_family	CRISPR
50 | 			score	23
51 | 8979	9275	gene
52 | 			locus_tag	EPE_c00080
53 | 8979	9275	CDS
54 | 			protein_id	gnl|goetting|EPE_c00080
55 | 			inference	ab initio prediction:Prodigal:2.60
56 | 			inference	similar to AA sequence:K-12_MG1655:NP_414546.1
57 | 			note	DUF2502 family putative periplasmic protein
58 | 			product	hypothetical protein
59 | >Feature EPE_89p
60 | 61	369	gene
61 | 			pseudo
62 | 			locus_tag	EPE_89p00010
63 | 			gene	ydhA
64 | 			gene_desc	hypothetical protein fragment
65 | 


--------------------------------------------------------------------------------
/tbl2tab/example2.tab:
--------------------------------------------------------------------------------
 1 | seq_id	start	stop	primary_tag	locus_tag	EC_number	EC_number	EC_number	EC_number	gene	inference	inference	note	product	protein_id	pseudo	rpt_family	score	spreadsheet_locus_tag	position	distance	gene_number	contig_order
 2 | c	191	310	misc_RNA	EPE_c00010						COORDINATES:profile:Infernal:1.1			Thr_leader					EPE_c00010	191	26	1	1
 3 | c	336	2798	CDS	EPE_c00020	1.1.1.3	2.7.2.4			thrA	ab initio prediction:Prodigal:2.60	similar to AA sequence:K-12_MG1655:NP_414543.1		bifunctional aspartokinase/homoserine dehydrogenase 1	gnl|SmithUCSD|EPE_c00020				EPE_c00020	336	370	2	1
 4 | c	3168	4697	rRNA	EPE_c00030					rrsH	COORDINATES:profile:RNAmmer:1.2			16S ribosomal RNA					EPE_c00030	3168	74	3	1
 5 | c	4771	4847	tRNA	EPE_c00040						COORDINATES:profile:Aragorn:1.2			tRNA-Pseudo-xxx		T			EPE_c00040	4771	163	4	1
 6 | c	7154	5010	CDS	EPE_c00050	1.1.1.35	4.2.1.17	5.1.2.3	5.3.3.8	fadJ	ab initio prediction:Prodigal:2.60	similar to AA sequence:K-12_MG1655:NP_416843.1		fused enoyl-CoA hydratase and epimerase and isomerase/3-hydroxyacyl-CoA dehydrogenase	gnl|SmithUCSD|EPE_c00050				EPE_c00050	5010	-86	5	1
 7 | c	7068	7430	tmRNA	EPE_c00060					ssrA	COORDINATES:profile:Aragorn:1.2			"transfer-messenger RNA, SsrA"					EPE_c00060	7068	83	6	1
 8 | c	7513	8883	repeat_region													CRISPR	23		7513	96	7	1
 9 | c	8979	9275	CDS	EPE_c00080						ab initio prediction:Prodigal:2.60	similar to AA sequence:K-12_MG1655:NP_414546.1	DUF2502 family putative periplasmic protein	hypothetical protein	gnl|SmithUCSD|EPE_00070				EPE_c00080	8979	-9214	8	1
10 | 89p	61	369	pseudo	EPE_89p00010					ydhA				hypothetical protein fragment					EPE_89p00010	61	-369	1	2
11 | 


--------------------------------------------------------------------------------
/trunc_seq/README.md:
--------------------------------------------------------------------------------
  1 | trunc_seq
  2 | =========
  3 | 
  4 | `trunc_seq.pl` is a script to truncate sequence files.
  5 | 
  6 | * [Synopsis](#synopsis)
  7 | * [Description](#description)
  8 | * [Usage](#usage)
  9 | * [Options](#options)
 10 | * [Output](#output)
 11 | * [Run environment](#run-environment)
 12 | * [Dependencies](#dependencies)
 13 | * [Author - contact](#author---contact)
 14 | * [Citation, installation, and license](#citation-installation-and-license)
 15 | * [Changelog](#changelog)
 16 | 
 17 | ## Synopsis
 18 | 
 19 |     perl trunc_seq.pl 20 3500 seq-file.embl > seq-file_trunc_20_3500.embl
 20 | 
 21 | **or**
 22 | 
 23 |     perl trunc_seq.pl file_of_filenames_and_coords.tsv
 24 | 
 25 | ## Description
 26 | 
 27 | This script truncates sequence files according to the given
 28 | coordinates. The features/annotations in RichSeq files (e.g. EMBL or
 29 | GENBANK format) will also be adapted accordingly. Use option **-o** to
 30 | specify a different output sequence format. Input can be given directly
 31 | as a file and truncation coordinates to the script, with the start
 32 | position as the first argument, stop as the second and (the path to)
 33 | the sequence file as the third. In this case the truncated sequence
 34 | entry is printed to *STDOUT*. Input sequence files should contain only
 35 | one sequence entry, if a multi-sequence file is used as input only the
 36 | **first** sequence entry is truncated.
 37 | 
 38 | Alternatively, a file of filenames (fof) with respective coordinates
 39 | and sequence files in the following **tab-separated** format can be
 40 | given to the script (the header is optional):
 41 | 
 42 | \#start&emsp;stop&emsp;seq-file<br>
 43 | 300&emsp;9000&emsp;(path/to/)seq-file<br>
 44 | 50&emsp;1300&emsp;(path/to/)seq-file2<br>
 45 | 
 46 | With a fof the resulting truncated sequence files are printed into a
 47 | results directory. Use option **-r** to specify a different results
 48 | directory than the default.
 49 | 
 50 | It is also possible to truncate a RichSeq sequence file loaded into the
 51 | [Artemis](http://www.sanger.ac.uk/science/tools/artemis) genome browser
 52 | from the Sanger Institute: Select a subsequence and then go to Edit ->
 53 | Subsequence (and Features)
 54 | 
 55 | ## Usage
 56 | 
 57 |     perl trunc_seq.pl -o gbk 120 30000 seq-file.embl > seq-file_trunc_120_3000.gbk
 58 | 
 59 | **or**
 60 | 
 61 |     perl trunc_seq.pl -o fasta 5300 18500 seq-file.gbk | perl revcom_seq.pl -i fasta > seq-file_trunc_revcom.fasta
 62 | 
 63 | **or**
 64 | 
 65 |     perl trunc_seq.pl -r path/to/trunc_embl_dir -o embl file_of_filenames_and_coords.tsv
 66 | 
 67 | ## Options
 68 | 
 69 | - **-h**, **-help**
 70 | 
 71 |     Help (perldoc POD)
 72 | 
 73 | - **-o**=*str*, **-outformat**=*str*
 74 | 
 75 |     Specify different sequence format for the output (files) [fasta, embl, or gbk]
 76 | 
 77 | - **-r**=*str*, **-result\_dir**=*str*
 78 | 
 79 |     Path to result folder for fof input \[default = './trunc\_seq\_results'\]
 80 | 
 81 | - **-v**, **-version**
 82 | 
 83 |     Print version number to *STDOUT*
 84 | 
 85 | ## Output
 86 | 
 87 | - *STDOUT*
 88 | 
 89 |     If a single sequence file is given to the script the truncated sequence
 90 |     file is printed to *STDOUT*. Redirect or pipe into another tool as
 91 |     needed.
 92 | 
 93 | **or**
 94 | 
 95 | - ./trunc_seq_results
 96 | 
 97 |     If a fof is given to the script, all output files are stored in a
 98 |     results folder
 99 | 
100 | - ./trunc_seq_results/seq-file_trunc_start_stop.format
101 | 
102 |     Truncated output sequence files are named appended with 'trunc' and the
103 |     corresponding start and stop positions
104 | 
105 | ## Run environment
106 | 
107 | The Perl script runs under Windows and UNIX flavors.
108 | 
109 | ## Dependencies
110 | 
111 | - [**BioPerl**](http://www.bioperl.org) (tested version 1.007001)
112 | 
113 | ## Author - contact
114 | 
115 | Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
116 | 
117 | ## Citation, installation, and license
118 | 
119 | For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
120 | 
121 | ## Changelog
122 | 
123 | * v0.2 (2015-12-07)
124 |     * Merged funtionality of `trunc_seq.pl` and `run_trunc_seq.pl` in one single script
125 |         * Allows now single file and file of filenames (fof) with coordinates input
126 |         * output for single file input printed to *STDOUT* now
127 |         * output for fof input printed into files in a result directory, new option **-r** to specify result directory
128 |     * included a POD instead of a simple usage text
129 |     * included `pod2usage` with Pod::Usage
130 |     * included 'use autodie' pragma
131 |     * options with Getopt::Long
132 |     * output format now specified with option **-o**
133 |     * included version switch, **-v**
134 |     * fixed bug to remove input filepaths from fof input for output files
135 |     * skip empty or comment lines (/^#/) in fof input
136 |     * check and warn if input seq file has more than one seq entries
137 | * v0.1 (2013-02-08)
138 |     * In v0.1 `trunc_seq.pl` only for single sequence input, but included additional wrapper script `run_trunc_seq.pl` for a fof input
139 | 


--------------------------------------------------------------------------------
/trunc_seq/trunc_seq.pl:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/perl
  2 | 
  3 | #######
  4 | # POD #
  5 | #######
  6 | 
  7 | =pod
  8 | 
  9 | =head1 NAME
 10 | 
 11 | C<trunc_seq.pl> - truncate sequence files
 12 | 
 13 | =head1 SYNOPSIS
 14 | 
 15 | C<perl trunc_seq.pl 20 3500 seq-file.embl E<gt>
 16 | seq-file_trunc_20_3500.embl>
 17 | 
 18 | B<or>
 19 | 
 20 | C<perl trunc_seq.pl file_of_filenames_and_coords.tsv>
 21 | 
 22 | =head1 DESCRIPTION
 23 | 
 24 | This script truncates sequence files according to the given
 25 | coordinates. The features/annotations in RichSeq files (e.g. EMBL or
 26 | GENBANK format) will also be adapted accordingly. Use option B<-o> to
 27 | specify a different output sequence format. Input can be given directly
 28 | as a file and truncation coordinates to the script, with the start
 29 | position as the first argument, stop as the second and (the path to)
 30 | the sequence file as the third. In this case the truncated sequence
 31 | entry is printed to C<STDOUT>. Input sequence files should contain only
 32 | one sequence entry, if a multi-sequence file is used as input only the
 33 | B<first> sequence entry is truncated.
 34 | 
 35 | Alternatively, a file of filenames (fof) with respective coordinates
 36 | and sequence files in the following B<tab-separated> format can be
 37 | given to the script (the header is optional):
 38 | 
 39 |  #start\tstop\tseq-file
 40 |  300\t9000\t(path/to/)seq-file
 41 |  50\t1300\t(path/to/)seq-file2
 42 | 
 43 | With a fof the resulting truncated sequence files are printed into a
 44 | results directory. Use option B<-r> to specify a different results
 45 | directory than the default.
 46 | 
 47 | It is also possible to truncate a RichSeq sequence file loaded into the
 48 | L<Artemis|http://www.sanger.ac.uk/science/tools/artemis> genome browser
 49 | from the Sanger Institute: Select a subsequence and then go to Edit
 50 | -E<gt> Subsequence (and Features)
 51 | 
 52 | =head1 OPTIONS
 53 | 
 54 | =over 20
 55 | 
 56 | =item B<-h>, B<-help>
 57 | 
 58 | Help (perldoc POD)
 59 | 
 60 | =item B<-o>=I<str>, B<-outformat>=I<str>
 61 | 
 62 | Specify different sequence format for the output (files) [fasta, embl,
 63 | or gbk]
 64 | 
 65 | =item B<-r>=I<str>, B<-result_dir>=I<str>
 66 | 
 67 | Path to result folder for fof input [default = './trunc_seq_results']
 68 | 
 69 | =item B<-v>, B<-version>
 70 | 
 71 | Print version number to C<STDOUT>
 72 | 
 73 | =back
 74 | 
 75 | =head1 OUTPUT
 76 | 
 77 | =over 20
 78 | 
 79 | =item C<STDOUT>
 80 | 
 81 | If a single sequence file is given to the script the truncated sequence
 82 | file is printed to C<STDOUT>. Redirect or pipe into another tool as
 83 | needed.
 84 | 
 85 | =back
 86 | 
 87 | B<or>
 88 | 
 89 | =over 20
 90 | 
 91 | =item F<./trunc_seq_results>
 92 | 
 93 | If a fof is given to the script, all output files are stored in a
 94 | results folder
 95 | 
 96 | =item F<./trunc_seq_results/seq-file_trunc_start_stop.format>
 97 | 
 98 | Truncated output sequence files are named appended with 'trunc' and the
 99 | corresponding start and stop positions
100 | 
101 | =back
102 | 
103 | =head1 EXAMPLES
104 | 
105 | =over
106 | 
107 | =item C<perl trunc_seq.pl -o gbk 120 30000 seq-file.embl E<gt>
108 | seq-file_trunc_120_3000.gbk>
109 | 
110 | =back
111 | 
112 | B<or>
113 | 
114 | =over
115 | 
116 | =item C<perl trunc_seq.pl -o fasta 5300 18500 seq-file.gbk | perl
117 | revcom_seq.pl -i fasta E<gt> seq-file_trunc_revcom.fasta>
118 | 
119 | =back
120 | 
121 | B<or>
122 | 
123 | =over
124 | 
125 | =item C<perl trunc_seq.pl -r path/to/trunc_embl_dir -o embl
126 | file_of_filenames_and_coords.tsv>
127 | 
128 | =back
129 | 
130 | =head1 DEPENDENCIES
131 | 
132 | =over
133 | 
134 | =item B<L<BioPerl|http://www.bioperl.org>>
135 | 
136 | Tested with BioPerl version 1.007001
137 | 
138 | =back
139 | 
140 | =head1 VERSION
141 | 
142 |  0.2                                               update: 2015-12-07
143 |  0.1                                                       2013-08-02
144 | 
145 | =head1 AUTHOR
146 | 
147 |  Andreas Leimbach                               aleimba[at]gmx[dot]de
148 | 
149 | =head1 LICENSE
150 | 
151 | This program is free software: you can redistribute it and/or modify
152 | it under the terms of the GNU General Public License as published by
153 | the Free Software Foundation; either version 3 (GPLv3) of the
154 | License, or (at your option) any later version.
155 | 
156 | This program is distributed in the hope that it will be useful, but
157 | WITHOUT ANY WARRANTY; without even the implied warranty of
158 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
159 | General Public License for more details.
160 | 
161 | You should have received a copy of the GNU General Public License
162 | along with this program. If not, see L<http://www.gnu.org/licenses/>.
163 | 
164 | =cut
165 | 
166 | 
167 | ########
168 | # MAIN #
169 | ########
170 | 
171 | use strict;
172 | use warnings;
173 | use autodie;
174 | use Getopt::Long;
175 | use Pod::Usage;
176 | use Bio::SeqIO; # bioperl module to handle sequence input/output
177 | #use Bio::Seq; # bioperl module to handle sequences with features ### apparently not needed, methods inherited
178 | #use Bio::SeqUtils; # bioperl module with additional methods (including features) for Bio::Seq objects ### apparently not needed, methods inherited
179 | 
180 | ### Get options with Getopt::Long
181 | my $Out_Format_Opt; # optional different output seq file format
182 | my $Result_Dir = 'trunc_seq_results'; # path to result folder for fof input
183 | my $VERSION = 0.2;
184 | my ($Opt_Version, $Opt_Help);
185 | GetOptions ('outformat=s' => \$Out_Format_Opt,
186 |             'result_dir=s' => \$Result_Dir,
187 |             'version' => \$Opt_Version,
188 |             'help|?' => \$Opt_Help)
189 |             or pod2usage(-verbose => 1, -exitval => 2);
190 | 
191 | 
192 | 
193 | ### Run perldoc on POD
194 | pod2usage(-verbose => 2) if ($Opt_Help);
195 | if ($Opt_Version) {
196 |     print "$0 $VERSION\n";
197 |     exit;
198 | }
199 | 
200 | 
201 | 
202 | ### Check input (@ARGV); didn't include STDIN as input option, too complicated here with fof etc.
203 | my $Fof; # file of filenames (fof) with truncation coords
204 | my $Start;
205 | my $Stop;
206 | my $Seq_File;
207 | if (@ARGV < 1 || @ARGV == 2 || @ARGV > 3) {
208 |     my $warning = "\n### Fatal error: Give either three arguments,\n$0\tstart\tstop\tseq-file\nor one file of sequence filenames with truncation coordinates as argument! Please see the usage with option '-h' if unclear!\n";
209 |     pod2usage(-verbose => 0, -message => $warning, -exitval => 2);
210 | } elsif (@ARGV == 1) { # fof
211 |     check_file_exists($ARGV[0]); # subroutine to check for file existence
212 |     $Fof = shift;
213 | } elsif (@ARGV == 3) {
214 |     check_file_exists($ARGV[2]); # subroutine
215 |     if ($ARGV[0] !~ /^\d+$/ || $ARGV[1] !~ /^\d+$/) {
216 |         my $warning = "\n### Fatal error: With a single sequence file input the first and second arguments are the start and stop positions for truncation, and need to include ONLY digits:\n$0\tstart\tstop\tseq-file\nPlease see the usage with option '-h' if unclear!\n";
217 |         pod2usage(-verbose => 0, -message => $warning, -exitval => 2);
218 |     }
219 |     ($Start, $Stop, $Seq_File) = @ARGV;
220 | }
221 | 
222 | 
223 | 
224 | ### Truncate the sequence and write either to STDOUT for single seq file input or output files for fof
225 | if ($Fof) {
226 |     open (my $fof_fh, "<", "$Fof");
227 | 
228 |     # create result folder
229 |     $Result_Dir =~ s/\/$//; # get rid of a potential '/' at the end of $Result_Dir path
230 |     if (-e $Result_Dir) {
231 |         empty_dir($Result_Dir); # subroutine to empty a directory with user interaction
232 |     } else {
233 |         mkdir $Result_Dir;
234 |     }
235 | 
236 |     while (my $line = <$fof_fh>) {
237 |         chomp $line;
238 |         next if ($line =~ /^\s*$/ || $line =~ /^#/); # skip empty or comment lines
239 | 
240 |         die "\n### Fatal error: Line '$.' of the '$Fof' file of sequence filenames plus truncation coordinates does not include the mandatory tab-separated two NUMERICAL start and stop truncation positions, and the sequence file (without any other whitespaces):\nstart\tstop\tpath/to/seq-file\n" if ($line !~ /^\d+\t\d+\t\S+$/);
241 |         ($Start, $Stop, $Seq_File) = split(/\t/, $line);
242 |         check_file_exists($Seq_File); # subroutine
243 | 
244 |         my ($seqin, $truncseq) = trunc_seq($Start, $Stop, $Seq_File); # subroutine to create a Bio::SeqIO input object and truncate the respective Bio::Seq object
245 |         my $seqout = seq_out($seqin, $Start, $Stop, $Seq_File); # subroutine to create a Bio::SeqIO output object, $seqin needed for format guessing, $Start/$Stop/$Seq_File needed for output filenames
246 |         $seqout->write_seq($truncseq);
247 |     }
248 |     close $fof_fh;
249 | 
250 | } else { # single seq file, @ARGV == 3
251 |     my ($seqin, $truncseq) = trunc_seq($Start, $Stop, $Seq_File); # subroutine
252 |     my $seqout = seq_out($seqin); # subroutine, without $Start/$Stop/$Seq_file for STDOUT output
253 |     $seqout->write_seq($truncseq);
254 | }
255 | 
256 | exit;
257 | 
258 | 
259 | 
260 | ###############
261 | # Subroutines #
262 | ###############
263 | 
264 | ### Subroutine to check if file exists
265 | sub check_file_exists {
266 |     my $file = shift;
267 |     die "\n### Fatal error: File '$file' does not exist: $!\n" if (!-e $file);
268 | }
269 | 
270 | 
271 | 
272 | ### Subroutine to empty a directory with user interaction
273 | sub empty_dir {
274 |     my $dir = shift;
275 |     print STDERR "\nDirectory '$dir' already exists! You can use either option '-r' to set a different output result directory name, or do you want to replace the directory and all its contents [y|n]? ";
276 |     my $user_ask = <STDIN>;
277 |     if ($user_ask =~ /y/i) {
278 |         unlink glob "$dir/*"; # remove all files in results directory
279 |     } else {
280 |         die "\nScript abborted!\n";
281 |     }
282 |     return 1;
283 | }
284 | 
285 | 
286 | 
287 | ### Subroutine to create a Bio::SeqIO output object
288 | sub seq_out {
289 |     my ($seqin, $start, $stop, $seq_file) = @_;
290 | 
291 |     my $out_format; # need to keep $Out_Format_Opt for several seq files with fof
292 |     if ($Out_Format_Opt) {
293 |         $Out_Format_Opt = 'genbank' if ($Out_Format_Opt =~ /(gbk|gb)/i); # allow shorter input for GENBANK format
294 |         $out_format = $Out_Format_Opt;
295 |     } else { # same format as input file
296 |         if (ref($seqin) =~ /Bio::SeqIO::(genbank|embl|fasta)/) { # from bioperl guessing
297 |             $out_format = $1;
298 |         } else {
299 |             die "\n### Fatal error: Could not determine input file format, please set an output file format with option '-o'!\n";
300 |         }
301 |     }
302 | 
303 |     my $seqout; # Bio::SeqIO object
304 |     if ($seq_file) { # fof
305 |         $seq_file =~ s/\S+(\/|\\)//; # remove input filepaths, aka 'basename' ('/' for Unix and '\' for Windows)
306 |         my $file_ext;
307 |         if ($out_format eq 'genbank') {
308 |             $file_ext = 'gbk'; # back to shorter file extension for GENBANK format
309 |         } else {
310 |             $file_ext = $out_format;
311 |         }
312 |         $seq_file =~ s/^(\S+)\.\w+$/$Result_Dir\/$1\_trunc_$start\_$stop\.$file_ext/; # append also result directory to output filename
313 |         $seqout = Bio::SeqIO->new(-file => ">$seq_file", -format => $out_format);
314 | 
315 |     } else { # single seq file input
316 |         $seqout = Bio::SeqIO->new(-fh => \*STDOUT, -format => $out_format); # printing to STDOUT requires '-format'
317 |     }
318 | 
319 |     return $seqout;
320 | }
321 | 
322 | 
323 | 
324 | ### Subroutine create a Bio::SeqIO input object and truncate the respective Bio::Seq object
325 | sub trunc_seq {
326 |     my ($start, $stop, $seq_file) = @_;
327 |     print STDERR "\nTruncating \"$seq_file\" to coordinates $start..$stop ...\n";
328 |     my $seqin = Bio::SeqIO->new(-file => "<$seq_file"); # Bio::SeqIO object; no '-format' given, leave it to bioperl guessing
329 |     my $count = 0;
330 |     my $truncseq;
331 |     while (my $seq_obj = $seqin->next_seq) { # Bio::Seq object
332 |         $count++;
333 |         if ($count > 1) {
334 |             warn "\n### Warning: More than one sequence entry in sequence file '$seq_file', but only the FIRST sequence entry will be truncated and printed to STDOUT or a result file!\n\n";
335 |             last;
336 |         }
337 |         $truncseq = Bio::SeqUtils->trunc_with_features($seq_obj, $start, $stop);
338 |     }
339 |     return ($seqin, $truncseq); # $seqin needed for outformat guessing in subroutine seqout
340 | }
341 | 


--------------------------------------------------------------------------------