├── .gitattributes ├── .gitignore ├── README.md ├── Snakefile ├── config.yaml ├── envs └── environment.yaml ├── scripts ├── ._callSNPs.py ├── ._coreSNPs2fasta.py ├── ._findCoreSNPs.py ├── ._getCoverage.py ├── ._pairwiseDist.py ├── ._renderTree.R ├── .snakemake.7g49ubxl.coreSNPs2fasta.py ├── .snakemake.9wtzh4ps.callSNPs.py ├── .snakemake.gej6roz5.callSNPs.py ├── .snakemake.jrhv5jyf.callSNPs.py ├── .snakemake.k53chwr8.callSNPs.py ├── .snakemake.mgp_kol5.callSNPs.py ├── .snakemake.sgrw01xs.callSNPs.py ├── callSNPs.py ├── coreSNPs2fasta.py ├── findCoreSNPs.py ├── getCoverage.py ├── pairwiseDist.py └── renderTree.R └── tutorial ├── fastq ├── isolate1_1.fq ├── isolate1_2.fq ├── isolate2_1.fq ├── isolate2_2.fq ├── meta1_1.fq ├── meta1_2.fq ├── meta2_1.fq ├── meta2_2.fq ├── meta3_1.fq ├── meta3_2.fq ├── meta4_1.fq └── meta4_2.fq └── reference ├── E_coli_K12.fna ├── E_coli_K12.fna.amb ├── E_coli_K12.fna.ann ├── E_coli_K12.fna.bwt ├── E_coli_K12.fna.fai ├── E_coli_K12.fna.pac └── E_coli_K12.fna.sa /.gitattributes: -------------------------------------------------------------------------------- 1 | *.fq binary 2 | *.fastq binary 3 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | ._Snakefile 2 | ._config.yaml 3 | .snakemake 4 | tutorial 5 | config_old.yaml 6 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # StrainSifter 2 | 3 | A straightforward bioinformatic pipeline for detecting the presence of a bacterial strain in one or more metagenome(s). 4 | 5 | StrainSifter is based on [Snakemake](https://snakemake.readthedocs.io/en/stable/). This pipeline allows you to output phylogenetic trees showing strain relatedness of input strains, as well as pairwise counts of single-nucleotide variants (SNVs) between input samples. 6 | 7 | ## Installation 8 | 9 | To run StrainSifter, you must have miniconda3 and Snakemake installed. 10 | 11 | #### Install instructions (One time only) 12 | 1. Download and install [miniconda3](https://conda.io/miniconda.html): 13 | 14 | For Linux: 15 | 16 | wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh 17 | bash Miniconda3-latest-Linux-x86_64.sh 18 | 19 | 2. Clone the StrainSifter workflow to the directory where you wish to run the pipeline: 20 | 21 | git clone https://github.com/bhattlab/strainsifter 22 | 23 | 3. Create the new conda environment: 24 | 25 | cd strainsifter 26 | conda env create -f envs/environment.yaml 27 | 28 | 4. Install Snakemake: 29 | 30 | conda install snakemake -c bioconda -c conda-forge 31 | 32 | 33 | StrainSifter has been developed and tested with Snakemake version 5.1.4 or higher. Check your version by typing: 34 | 35 | snakemake --version 36 | 37 | If you are running a version of Snakemake prior to 5.1.4, update to the latest version: 38 | 39 | conda update snakemake -c bioconda -c conda-forge 40 | 41 | #### Activate the conda environment (Every time you use StrainSifter) 42 | 43 | source activate ssift 44 | 45 | ### Dependencies 46 | 47 | We recommend running StrainSifter in the provided conda environment. If you wish to run StrainSifter without using the conda environment, the following tools must be installed and in your system PATH: 48 | * [Burrows-Wheeler Aligner (BWA)](http://bio-bwa.sourceforge.net) 49 | * [Samtools](http://www.htslib.org) 50 | * [Bamtools](https://github.com/pezmaster31/bamtools) 51 | * [Bedtools](http://bedtools.readthedocs.io/en/latest/) 52 | * [MUSCLE](https://www.drive5.com/muscle/) 53 | * [FastTree](http://www.microbesonline.org/fasttree/) 54 | * [Python3](https://www.python.org/downloads/) 55 | * [R](https://www.r-project.org) 56 | 57 | ## Running StrainSifter 58 | 59 | Due to the computing demands of the StrainSifter pipeline, we recommend running on a computing cluster if possible. 60 | Instructions to enable Snakemake to schedule cluster jobs with SLURM can be found at https://github.com/bhattlab/slurm 61 | 62 | ### Input files 63 | 64 | * Reference genome assembly in fasta format (can be a draft genome or a finished reference genome) 65 | Acceptable file extensions: ".fasta", ".fa", ".fna" 66 | 67 | * Two or more short read datasets in fastq format (metagenomic reads or isolate reads), optionally gzipped 68 | Acceptable file extensions: ".fq", ".fastq", ".fq.gz", ".fastq.gz" 69 | 70 | Short read data can be paired- or single-end. 71 | 72 | You will need to indicate input files in the config file for each sample you wish to run StrainSifter on. This is described below in the *Config* section: 73 | 74 | ### Config 75 | 76 | You must update the config.yaml file as follows: 77 | 78 | **reference:** Path to reference genome (fasta format) 79 | 80 | **reads:** Samples and the file path(s) to the input reads. 81 | 82 |
83 | 84 | Optionally, you can update the following parameters: 85 | 86 | **prefix:** (optional) desired filename for output files. If blank, the name of the reference genome will be used. 87 | 88 | **mapq:** minimum mapping quality score to evaluate a read aligment 89 | 90 | **n_mismatches:** consider reads with this many mismatches or fewer 91 | 92 | **min_cvg:** minimum read depth to determine the nucleotide at any given postion 93 | 94 | **min_genome_percent:** the minimum fraction of bases that must be covered at min_cvg or greater to process an sample 95 | 96 | **base_freq:** minimum frequency of a nucleotide to call a base at any position 97 | 98 |
99 | 100 | Example config.yaml: 101 | 102 | ##### input files ##### 103 | 104 | # reference genome (required) 105 | reference: /path/to/ref.fna 106 | 107 | # short read data (at least two samples required) 108 | reads: 109 | sample1: 110 | [/path/to/sample1_R1.fq, 111 | /path/tp/sample1_R2.fq] 112 | sample2: 113 | [/path/to/sample2_R1.fq, 114 | /path/to/sample2_R2.fq] 115 | sample3: /path/to/sample3.fq 116 | 117 | # prefix for output files (optional - can leave blank) 118 | prefix: 119 | 120 | 121 | ##### workflow parameters ##### 122 | 123 | # alignment parameters: 124 | mapq: 60 125 | n_mismatches: 5 126 | 127 | # variant calling parameters: 128 | min_cvg: 5 129 | min_genome_percent: 0.5 130 | base_freq: 0.8 131 | 132 | 133 | ### Running StrainSifter 134 | 135 | To run StrainSifter, the config file must be present in the directory in which you wish to run the workflow. 136 | You should then be able to run StrainSifter as follows: 137 | 138 | #### Phylogeny 139 | 140 | To generate a phylogenetic tree showing all of the input samples that contain your strain of interest at sufficient coverage to profile: 141 | 142 | snakemake {prefix}.tree.pdf 143 | 144 | #### SNV counts 145 | 146 | To generate a list of pairwise SNV counts between all input samples: 147 | 148 | snakemake {prefix}.dist.tsv 149 | 150 | ### FAQ 151 | 152 | Q: Can StrainSifter be used for non-bacterial genomes (e.g. yeast)? 153 | 154 | A: At present, we recommend StrainSifter for bacteria only. In theory, StrainSifter should work for yeast if a haploid reference genome is provided. 155 | -------------------------------------------------------------------------------- /Snakefile: -------------------------------------------------------------------------------- 1 | import os 2 | import re 3 | from snakemake.utils import min_version 4 | 5 | ##### set minimum snakemake version ##### 6 | min_version("5.1.4") 7 | 8 | ##### load config file and sample list ##### 9 | configfile: "config.yaml" 10 | 11 | ##### list of input samples ##### 12 | samples = [key for key in config['reads']] 13 | 14 | ##### prefix for phylogenetic tree and SNV distance files ##### 15 | if config['prefix'] is None: 16 | prefix = re.split("/|\.", config['reference'])[-2] 17 | else: 18 | prefix = config['prefix'] 19 | 20 | ##### rules ##### 21 | 22 | # index reference genome for bwa alignment 23 | rule bwa_index: 24 | input: config['reference'] 25 | output: 26 | "{ref}.amb".format(ref=config['reference']), 27 | "{ref}.ann".format(ref=config['reference']), 28 | "{ref}.bwt".format(ref=config['reference']), 29 | "{ref}.pac".format(ref=config['reference']), 30 | "{ref}.sa".format(ref=config['reference']) 31 | resources: 32 | mem=2, 33 | time=1 34 | shell: 35 | "bwa index {input}" 36 | 37 | # align reads to reference genome with bwa 38 | rule bwa_align: 39 | input: 40 | ref_index = rules.bwa_index.output, 41 | r = lambda wildcards: config["reads"][wildcards.sample] 42 | output: 43 | "filtered_bam/{sample}.filtered.bam" 44 | resources: 45 | mem=32, 46 | time=6 47 | threads: 8 48 | params: 49 | ref = config['reference'], 50 | qual=config['mapq'], 51 | nm=config['n_mismatches'] 52 | shell: 53 | "bwa mem -t {threads} {params.ref} {input.r} | "\ 54 | "samtools view -b -q {params.qual} | "\ 55 | "bamtools filter -tag 'NM:<={params.nm}' | "\ 56 | "samtools sort --threads {threads} -o {output}" 57 | 58 | # count base read coverage 59 | rule genomecov: 60 | input: 61 | rules.bwa_align.output 62 | output: 63 | "genomecov/{sample}.tsv" 64 | resources: 65 | mem=8, 66 | time=1, 67 | threads: 1 68 | shell: 69 | "bedtools genomecov -ibam {input} > {output}" 70 | 71 | # calculate average coverage across the genome 72 | rule calc_coverage: 73 | input: 74 | rules.genomecov.output 75 | output: 76 | "coverage/{sample}.cvg" 77 | resources: 78 | mem=8, 79 | time=1, 80 | threads: 1 81 | params: 82 | cvg=config['min_cvg'] 83 | script: 84 | "scripts/getCoverage.py" 85 | 86 | # filter samples that meet coverage requirements 87 | rule filter_samples: 88 | input: expand("coverage/{sample}.cvg", sample = samples) 89 | output: 90 | dynamic("passed_samples/{sample}.bam") 91 | resources: 92 | mem=1, 93 | time=1 94 | threads: 1 95 | params: 96 | min_cvg=config['min_cvg'], 97 | min_perc=config['min_genome_percent'] 98 | run: 99 | samps = input 100 | for samp in samps: 101 | with open(samp) as s: 102 | cvg, perc = s.readline().rstrip('\n').split('\t') 103 | if (float(cvg) >= params.min_cvg and float(perc) > params.min_perc): 104 | shell("ln -s $PWD/filtered_bam/{s}.filtered.bam passed_samples/{s}.bam".format(s=os.path.basename(samp).rstrip(".cvg"))) 105 | 106 | # index reference genome for pileup 107 | rule faidx: 108 | input: config['reference'] 109 | output: "{ref}.fai".format(ref=config['reference']) 110 | resources: 111 | mem=2, 112 | time=1 113 | shell: 114 | "samtools faidx {input}" 115 | 116 | # create pileup from bam files 117 | rule pileup: 118 | input: 119 | bam="passed_samples/{sample}.bam", 120 | ref=config['reference'], 121 | index=rules.faidx.output 122 | output: "pileup/{sample}.pileup" 123 | resources: 124 | mem=32, 125 | time=1 126 | threads: 16 127 | shell: 128 | "samtools mpileup -f {input.ref} -B -aa -o {output} {input.bam}" 129 | 130 | # call SNPs from pileup 131 | rule call_snps: 132 | input: rules.pileup.output 133 | output: "snp_calls/{sample}.tsv" 134 | resources: 135 | mem=32, 136 | time=2 137 | threads: 16 138 | params: 139 | min_cvg=5, 140 | min_freq=0.8, 141 | min_qual=20 142 | script: 143 | "scripts/callSNPs.py" 144 | 145 | # get consensus sequence from pileup 146 | rule snp_consensus: 147 | input: rules.call_snps.output 148 | output: "consensus/{sample}.txt" 149 | resources: 150 | mem=2, 151 | time=2 152 | threads: 1 153 | shell: 154 | "echo {wildcards.sample} > {output}; cut -f4 {input} >> {output}" 155 | 156 | # combine consensus sequences into one file 157 | rule combine: 158 | input: 159 | dynamic("consensus/{sample}.txt") 160 | output: "{name}.cns.tsv".format(name = prefix) 161 | resources: 162 | mem=2, 163 | time=1 164 | threads: 1 165 | shell: 166 | "paste {input} > {output}" 167 | 168 | # find positions that have a base call in each input genome and at least 169 | # one variant in the set of input genomes 170 | rule core_snps: 171 | input: rules.combine.output 172 | output: "{name}.core_snps.tsv".format(name = prefix) 173 | resources: 174 | mem=16, 175 | time=1 176 | threads: 1 177 | script: 178 | "scripts/findCoreSNPs.py" 179 | 180 | # convert core SNPs file to fasta format 181 | rule core_snps_to_fasta: 182 | input: rules.core_snps.output 183 | output: "{name}.fasta".format(name = prefix) 184 | resources: 185 | mem=16, 186 | time=1 187 | threads: 1 188 | script: 189 | "scripts/coreSNPs2fasta.py" 190 | 191 | # perform multiple sequence alignment of fasta file 192 | rule multi_align: 193 | input: rules.core_snps_to_fasta.output 194 | output: "{name}.afa".format(name = prefix) 195 | resources: 196 | mem=200, 197 | time=12 198 | threads: 1 199 | shell: 200 | "muscle -in {input} -out {output}" 201 | 202 | # calculate phylogenetic tree from multiple sequence alignment 203 | rule build_tree: 204 | input: rules.multi_align.output 205 | output: "{name}.tree".format(name = prefix) 206 | resources: 207 | mem=8, 208 | time=1 209 | threads: 1 210 | shell: 211 | "fasttree -nt {input} > {output}" 212 | 213 | # plot phylogenetic tree 214 | rule plot_tree: 215 | input: rules.build_tree.output 216 | output: "{name}.tree.pdf".format(name = prefix) 217 | resources: 218 | mem=8, 219 | time=1 220 | threads: 1 221 | script: 222 | "scripts/renderTree.R" 223 | 224 | # count pairwise SNVs between input samples 225 | rule pairwise_snvs: 226 | input: dynamic("consensus/{sample}.txt") 227 | output: "{name}.dist.tsv".format(name = prefix) 228 | resources: 229 | mem=8, 230 | time=1 231 | threads: 1 232 | script: 233 | "scripts/pairwiseDist.py" 234 | -------------------------------------------------------------------------------- /config.yaml: -------------------------------------------------------------------------------- 1 | ##### input files ##### 2 | 3 | # reference genome (required) 4 | reference: /path/to/ref.fna 5 | 6 | # short read data (at least two samples required) 7 | reads: 8 | sample1: 9 | [/path/to/sample1_R1.fq, 10 | /path/tp/sample1_R2.fq] 11 | sample2: 12 | [/path/to/sample2_R1.fq, 13 | /path/to/sample2_R2.fq] 14 | sample3: /path/to/sample3.fq 15 | 16 | # prefix for output files (optional - can leave blank) 17 | prefix: 18 | 19 | 20 | ##### workflow parameters ##### 21 | 22 | # alignment parameters: 23 | mapq: 60 24 | n_mismatches: 5 25 | 26 | # variant calling parameters: 27 | min_cvg: 5 28 | min_genome_percent: 0.5 29 | base_freq: 0.8 30 | -------------------------------------------------------------------------------- /envs/environment.yaml: -------------------------------------------------------------------------------- 1 | name: ssift 2 | channels: 3 | - bioconda 4 | - defaults 5 | - conda-forge 6 | dependencies: 7 | - bwa 8 | - samtools 9 | - ncurses 10 | - bamtools 11 | - bedtools 12 | - MUSCLE 13 | - FastTree 14 | - r-ggplot2 15 | - bioconductor-ggtree 16 | - r-phangorn 17 | -------------------------------------------------------------------------------- /scripts/._callSNPs.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bhattlab/StrainSifter/4a354e5a0ae25bfe3d57a6b66fa27a36b5daf748/scripts/._callSNPs.py -------------------------------------------------------------------------------- /scripts/._coreSNPs2fasta.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bhattlab/StrainSifter/4a354e5a0ae25bfe3d57a6b66fa27a36b5daf748/scripts/._coreSNPs2fasta.py -------------------------------------------------------------------------------- /scripts/._findCoreSNPs.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bhattlab/StrainSifter/4a354e5a0ae25bfe3d57a6b66fa27a36b5daf748/scripts/._findCoreSNPs.py -------------------------------------------------------------------------------- /scripts/._getCoverage.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bhattlab/StrainSifter/4a354e5a0ae25bfe3d57a6b66fa27a36b5daf748/scripts/._getCoverage.py -------------------------------------------------------------------------------- /scripts/._pairwiseDist.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bhattlab/StrainSifter/4a354e5a0ae25bfe3d57a6b66fa27a36b5daf748/scripts/._pairwiseDist.py -------------------------------------------------------------------------------- /scripts/._renderTree.R: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bhattlab/StrainSifter/4a354e5a0ae25bfe3d57a6b66fa27a36b5daf748/scripts/._renderTree.R -------------------------------------------------------------------------------- /scripts/.snakemake.7g49ubxl.coreSNPs2fasta.py: -------------------------------------------------------------------------------- 1 | 2 | ######## Snakemake header ######## 3 | import sys; sys.path.append("/home/tamburin/tools/miniconda3/lib/python3.6/site-packages"); import pickle; snakemake = pickle.loads(b'\x80\x03csnakemake.script\nSnakemake\nq\x00)\x81q\x01}q\x02(X\x05\x00\x00\x00inputq\x03csnakemake.io\nInputFiles\nq\x04)\x81q\x05X\x14\x00\x00\x00E_coli.core_snps.tsvq\x06a}q\x07X\x06\x00\x00\x00_namesq\x08}q\tsbX\x06\x00\x00\x00outputq\ncsnakemake.io\nOutputFiles\nq\x0b)\x81q\x0cX\x0c\x00\x00\x00E_coli.fastaq\ra}q\x0eh\x08}q\x0fsbX\x06\x00\x00\x00paramsq\x10csnakemake.io\nParams\nq\x11)\x81q\x12}q\x13h\x08}q\x14sbX\t\x00\x00\x00wildcardsq\x15csnakemake.io\nWildcards\nq\x16)\x81q\x17}q\x18h\x08}q\x19sbX\x07\x00\x00\x00threadsq\x1aK\x01X\t\x00\x00\x00resourcesq\x1bcsnakemake.io\nResources\nq\x1c)\x81q\x1d(K\x01K\x01K\x10K\x01e}q\x1e(h\x08}q\x1f(X\x06\x00\x00\x00_coresq K\x00N\x86q!X\x06\x00\x00\x00_nodesq"K\x01N\x86q#X\x03\x00\x00\x00memq$K\x02N\x86q%X\x04\x00\x00\x00timeq&K\x03N\x86q\'uh K\x01h"K\x01h$K\x10h&K\x01ubX\x03\x00\x00\x00logq(csnakemake.io\nLog\nq))\x81q*}q+h\x08}q,sbX\x06\x00\x00\x00configq-}q.(X\t\x00\x00\x00referenceq/X\x0e\x00\x00\x00isolate2.fastaq0X\x05\x00\x00\x00readsq1}q2(X\x05\x00\x00\x00meta1q3]q4(X\x16\x00\x00\x00reads/meta1_1.fastq.gzq5X\x16\x00\x00\x00reads/meta1_1.fastq.gzq6eX\x05\x00\x00\x00meta2q7]q8(X\x16\x00\x00\x00reads/meta2_1.fastq.gzq9X\x16\x00\x00\x00reads/meta2_1.fastq.gzq:eX\x05\x00\x00\x00meta3q;]q<(X\x16\x00\x00\x00reads/meta3_1.fastq.gzq=X\x16\x00\x00\x00reads/meta3_1.fastq.gzq>eX\x08\x00\x00\x00isolate1q?]q@(X\x19\x00\x00\x00reads/isolate1_1.fastq.gzqAX\x19\x00\x00\x00reads/isolate1_2.fastq.gzqBeX\x08\x00\x00\x00isolate2qC]qD(X\x19\x00\x00\x00reads/isolate2_1.fastq.gzqEX\x19\x00\x00\x00reads/isolate2_2.fastq.gzqFeuX\x06\x00\x00\x00prefixqGX\x06\x00\x00\x00E_coliqHX\x04\x00\x00\x00mapqqIK" + fastas[f][0]) 37 | print(fastas[f][1]) 38 | -------------------------------------------------------------------------------- /scripts/.snakemake.9wtzh4ps.callSNPs.py: -------------------------------------------------------------------------------- 1 | 2 | ######## Snakemake header ######## 3 | import sys; sys.path.append("/home/tamburin/tools/miniconda3/lib/python3.6/site-packages"); import pickle; snakemake = pickle.loads(b'\x80\x03csnakemake.script\nSnakemake\nq\x00)\x81q\x01}q\x02(X\x05\x00\x00\x00inputq\x03csnakemake.io\nInputFiles\nq\x04)\x81q\x05X\x1c\x00\x00\x00pileup/4308_7-26-2016.pileupq\x06a}q\x07X\x06\x00\x00\x00_namesq\x08}q\tsbX\x06\x00\x00\x00outputq\ncsnakemake.io\nOutputFiles\nq\x0b)\x81q\x0cX\x1c\x00\x00\x00snp_calls/4308_7-26-2016.tsvq\ra}q\x0eh\x08}q\x0fsbX\x06\x00\x00\x00paramsq\x10csnakemake.io\nParams\nq\x11)\x81q\x12(K\x05G?\xe9\x99\x99\x99\x99\x99\x9aK\x14e}q\x13(h\x08}q\x14(X\x07\x00\x00\x00min_cvgq\x15K\x00N\x86q\x16X\x08\x00\x00\x00min_freqq\x17K\x01N\x86q\x18X\x08\x00\x00\x00min_qualq\x19K\x02N\x86q\x1auh\x15K\x05h\x17G?\xe9\x99\x99\x99\x99\x99\x9ah\x19K\x14ubX\t\x00\x00\x00wildcardsq\x1bcsnakemake.io\nWildcards\nq\x1c)\x81q\x1dX\x0e\x00\x00\x004308_7-26-2016q\x1ea}q\x1f(h\x08}q X\x06\x00\x00\x00sampleq!K\x00N\x86q"sX\x06\x00\x00\x00sampleq#h\x1eubX\x07\x00\x00\x00threadsq$K\x01X\t\x00\x00\x00resourcesq%csnakemake.io\nResources\nq&)\x81q\'(K\x01K\x01K K\x02e}q((h\x08}q)(X\x06\x00\x00\x00_coresq*K\x00N\x86q+X\x06\x00\x00\x00_nodesq,K\x01N\x86q-X\x03\x00\x00\x00memq.K\x02N\x86q/X\x04\x00\x00\x00timeq0K\x03N\x86q1uh*K\x01h,K\x01h.K h0K\x02ubX\x03\x00\x00\x00logq2csnakemake.io\nLog\nq3)\x81q4}q5h\x08}q6sbX\x06\x00\x00\x00configq7}q8(X\t\x00\x00\x00referenceq9XO\x00\x00\x00/home/tamburin/fiona/bacteremia/1.assemble_trimmed/filtered_assemblies/L3.fastaq:X\x07\x00\x00\x00samplesq;X0\x00\x00\x00/home/tamburin/fiona/crassphage/all_samples.listqX\n\x00\x00\x00paired_endq?X\x01\x00\x00\x00Yq@X\x07\x00\x00\x00min_cvgqAK\nuX\x04\x00\x00\x00ruleqBX\t\x00\x00\x00call_snpsqCub.'); from snakemake.logging import logger; logger.printshellcmds = False 4 | ######## Original script ######### 5 | #!/usr/bin/env python3 6 | # callSNPS.py 7 | # call SNPs from a samtools pileup file 8 | 9 | import sys 10 | import gzip 11 | import re 12 | 13 | DEBUG = 0 14 | 15 | ASCII_OFFSET = ord("!") 16 | INDEL_PATTERN = re.compile(r"[+-](\d+)") 17 | INDEL_STRING = "[+-]INTEGER[ACGTN]{INTEGER}" 18 | ROUND = 3 # places to right of decimal point 19 | 20 | min_coverage, min_proportion, min_qual = snakemake.params 21 | pileup_file = snakemake.input[0] 22 | out_file = snakemake.output[0] 23 | 24 | min_coverage = int(min_coverage) 25 | min_proportion = float(min_proportion) 26 | min_qual = int(min_qual) 27 | 28 | ## function to process each line of pileup 29 | 30 | def parse_pileup(line): 31 | 32 | # read fields from line of pileup 33 | chromosome, position, reference, coverage, pileup, quality = line.rstrip("\r\n").split("\t") 34 | 35 | # proportion is 0 if no consensus base, top base is N by default 36 | parsed = { 37 | "proportion": 0.0, "chromosome": chromosome, 38 | "position": int(position), "reference": reference, 39 | "coverage": int(coverage), "pileup": pileup, 40 | "quality": quality, "top_base": "N"} 41 | 42 | # if the base coverage is below the limit or above our acceptable max, call N 43 | if parsed["coverage"] < min_coverage: 44 | return parsed 45 | 46 | # uppercase pileup string for processing 47 | pileup = pileup.upper() 48 | 49 | # Remove start and stop characters from pileup string 50 | pileup = re.sub(r"\^.|\$", "", pileup) 51 | 52 | # Remove indels from pileup string 53 | start = 0 54 | 55 | while True: 56 | match = INDEL_PATTERN.search(pileup, start) 57 | 58 | if match: 59 | integer = match.group(1) 60 | pileup = re.sub(INDEL_STRING.replace("INTEGER", integer), "", pileup) 61 | start = match.start() 62 | else: 63 | break 64 | 65 | # get total base count and top base count 66 | total = 0 67 | top_base = "N" 68 | top_base_count = 0 69 | 70 | # uppercase reference base for comparison 71 | reference = reference.upper() 72 | 73 | base_counts = {"A": 0, "C": 0, "G": 0, "T": 0} 74 | 75 | quality_length = len(quality) 76 | 77 | for i in range(quality_length): 78 | 79 | # convert ASCII character to phred base quality 80 | base_quality = ord(quality[i]) - ASCII_OFFSET 81 | 82 | # only count high-quality bases 83 | if base_quality >= min_qual: 84 | 85 | currBase = pileup[i] 86 | 87 | if currBase in base_counts: 88 | base_counts[currBase] += 1 89 | 90 | else: 91 | base_counts[reference] += 1 92 | 93 | total += 1 94 | 95 | parsed["total"] = total 96 | 97 | # find top base 98 | for base in base_counts: 99 | if base_counts[base] > top_base_count: 100 | top_base = base 101 | top_base_count = base_counts[base] 102 | 103 | # if more that 0 bases processed 104 | if total > 0: 105 | prop = top_base_count / float(total) 106 | 107 | if prop >= min_proportion: 108 | parsed["proportion"] = prop 109 | parsed["top_base"] = top_base 110 | 111 | return parsed 112 | 113 | ## process pileup file 114 | 115 | # read in input file 116 | if not DEBUG: 117 | out_file = open(out_file, "w") 118 | 119 | with open(pileup_file, "rt") as pileup_file: 120 | for pileup_file_line in pileup_file: 121 | 122 | parsed = parse_pileup(pileup_file_line) 123 | 124 | proportion = parsed["proportion"] 125 | 126 | # print to output file 127 | if DEBUG: 128 | print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)), 129 | parsed["pileup"], parsed["quality"]])) 130 | else: 131 | print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)), 132 | parsed["pileup"], parsed["quality"]]), file=out_file) 133 | 134 | if not DEBUG: 135 | out_file.close() 136 | -------------------------------------------------------------------------------- /scripts/.snakemake.gej6roz5.callSNPs.py: -------------------------------------------------------------------------------- 1 | 2 | ######## Snakemake header ######## 3 | import sys; sys.path.append("/home/tamburin/tools/miniconda3/lib/python3.6/site-packages"); import pickle; snakemake = pickle.loads(b'\x80\x03csnakemake.script\nSnakemake\nq\x00)\x81q\x01}q\x02(X\x05\x00\x00\x00inputq\x03csnakemake.io\nInputFiles\nq\x04)\x81q\x05X\x1c\x00\x00\x00pileup/4757_2-16-2016.pileupq\x06a}q\x07X\x06\x00\x00\x00_namesq\x08}q\tsbX\x06\x00\x00\x00outputq\ncsnakemake.io\nOutputFiles\nq\x0b)\x81q\x0cX\x1c\x00\x00\x00snp_calls/4757_2-16-2016.tsvq\ra}q\x0eh\x08}q\x0fsbX\x06\x00\x00\x00paramsq\x10csnakemake.io\nParams\nq\x11)\x81q\x12(K\x05G?\xe9\x99\x99\x99\x99\x99\x9aK\x14e}q\x13(h\x08}q\x14(X\x07\x00\x00\x00min_cvgq\x15K\x00N\x86q\x16X\x08\x00\x00\x00min_freqq\x17K\x01N\x86q\x18X\x08\x00\x00\x00min_qualq\x19K\x02N\x86q\x1auh\x15K\x05h\x17G?\xe9\x99\x99\x99\x99\x99\x9ah\x19K\x14ubX\t\x00\x00\x00wildcardsq\x1bcsnakemake.io\nWildcards\nq\x1c)\x81q\x1dX\x0e\x00\x00\x004757_2-16-2016q\x1ea}q\x1f(h\x08}q X\x06\x00\x00\x00sampleq!K\x00N\x86q"sX\x06\x00\x00\x00sampleq#h\x1eubX\x07\x00\x00\x00threadsq$K\x01X\t\x00\x00\x00resourcesq%csnakemake.io\nResources\nq&)\x81q\'(K\x01K\x01K K\x02e}q((h\x08}q)(X\x06\x00\x00\x00_coresq*K\x00N\x86q+X\x06\x00\x00\x00_nodesq,K\x01N\x86q-X\x03\x00\x00\x00memq.K\x02N\x86q/X\x04\x00\x00\x00timeq0K\x03N\x86q1uh*K\x01h,K\x01h.K h0K\x02ubX\x03\x00\x00\x00logq2csnakemake.io\nLog\nq3)\x81q4}q5h\x08}q6sbX\x06\x00\x00\x00configq7}q8(X\t\x00\x00\x00referenceq9X?\x00\x00\x00/home/tamburin/fiona/crassphage/strainsifter/ref/B_vulgatus.fnaq:X\x07\x00\x00\x00samplesq;X0\x00\x00\x00/home/tamburin/fiona/crassphage/all_samples.listqX\n\x00\x00\x00paired_endq?X\x01\x00\x00\x00Yq@X\x07\x00\x00\x00min_cvgqAK\nuX\x04\x00\x00\x00ruleqBX\t\x00\x00\x00call_snpsqCub.'); from snakemake.logging import logger; logger.printshellcmds = False 4 | ######## Original script ######### 5 | #!/usr/bin/env python3 6 | # callSNPS.py 7 | # call SNPs from a samtools pileup file 8 | 9 | import sys 10 | import gzip 11 | import re 12 | 13 | DEBUG = 0 14 | 15 | ASCII_OFFSET = ord("!") 16 | INDEL_PATTERN = re.compile(r"[+-](\d+)") 17 | INDEL_STRING = "[+-]INTEGER[ACGTN]{INTEGER}" 18 | ROUND = 3 # places to right of decimal point 19 | 20 | min_coverage, min_proportion, min_qual = snakemake.params 21 | pileup_file = snakemake.input[0] 22 | out_file = snakemake.output[0] 23 | 24 | min_coverage = int(min_coverage) 25 | min_proportion = float(min_proportion) 26 | min_qual = int(min_qual) 27 | 28 | ## function to process each line of pileup 29 | 30 | def parse_pileup(line): 31 | 32 | # read fields from line of pileup 33 | chromosome, position, reference, coverage, pileup, quality = line.rstrip("\r\n").split("\t") 34 | 35 | # proportion is 0 if no consensus base, top base is N by default 36 | parsed = { 37 | "proportion": 0.0, "chromosome": chromosome, 38 | "position": int(position), "reference": reference, 39 | "coverage": int(coverage), "pileup": pileup, 40 | "quality": quality, "top_base": "N"} 41 | 42 | # if the base coverage is below the limit or above our acceptable max, call N 43 | if parsed["coverage"] < min_coverage: 44 | return parsed 45 | 46 | # uppercase pileup string for processing 47 | pileup = pileup.upper() 48 | 49 | # Remove start and stop characters from pileup string 50 | pileup = re.sub(r"\^.|\$", "", pileup) 51 | 52 | # Remove indels from pileup string 53 | start = 0 54 | 55 | while True: 56 | match = INDEL_PATTERN.search(pileup, start) 57 | 58 | if match: 59 | integer = match.group(1) 60 | pileup = re.sub(INDEL_STRING.replace("INTEGER", integer), "", pileup) 61 | start = match.start() 62 | else: 63 | break 64 | 65 | # get total base count and top base count 66 | total = 0 67 | top_base = "N" 68 | top_base_count = 0 69 | 70 | # uppercase reference base for comparison 71 | reference = reference.upper() 72 | 73 | base_counts = {"A": 0, "C": 0, "G": 0, "T": 0} 74 | 75 | quality_length = len(quality) 76 | 77 | for i in range(quality_length): 78 | 79 | # convert ASCII character to phred base quality 80 | base_quality = ord(quality[i]) - ASCII_OFFSET 81 | 82 | # only count high-quality bases 83 | if base_quality >= min_qual: 84 | 85 | currBase = pileup[i] 86 | 87 | if currBase in base_counts: 88 | base_counts[currBase] += 1 89 | 90 | else: 91 | base_counts[reference] += 1 92 | 93 | total += 1 94 | 95 | parsed["total"] = total 96 | 97 | # find top base 98 | for base in base_counts: 99 | if base_counts[base] > top_base_count: 100 | top_base = base 101 | top_base_count = base_counts[base] 102 | 103 | # if more that 0 bases processed 104 | if total > 0: 105 | prop = top_base_count / float(total) 106 | 107 | if prop >= min_proportion: 108 | parsed["proportion"] = prop 109 | parsed["top_base"] = top_base 110 | 111 | return parsed 112 | 113 | ## process pileup file 114 | 115 | # read in input file 116 | if not DEBUG: 117 | out_file = open(out_file, "w") 118 | 119 | with open(pileup_file, "rt") as pileup_file: 120 | for pileup_file_line in pileup_file: 121 | 122 | parsed = parse_pileup(pileup_file_line) 123 | 124 | proportion = parsed["proportion"] 125 | 126 | # print to output file 127 | if DEBUG: 128 | print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)), 129 | parsed["pileup"], parsed["quality"]])) 130 | else: 131 | print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)), 132 | parsed["pileup"], parsed["quality"]]), file=out_file) 133 | 134 | if not DEBUG: 135 | out_file.close() 136 | -------------------------------------------------------------------------------- /scripts/.snakemake.jrhv5jyf.callSNPs.py: -------------------------------------------------------------------------------- 1 | 2 | ######## Snakemake header ######## 3 | import sys; sys.path.append("/home/tamburin/tools/miniconda3/lib/python3.6/site-packages"); import pickle; snakemake = pickle.loads(b'\x80\x03csnakemake.script\nSnakemake\nq\x00)\x81q\x01}q\x02(X\x05\x00\x00\x00inputq\x03csnakemake.io\nInputFiles\nq\x04)\x81q\x05X\x1d\x00\x00\x00pileup/6336_11-10-2015.pileupq\x06a}q\x07X\x06\x00\x00\x00_namesq\x08}q\tsbX\x06\x00\x00\x00outputq\ncsnakemake.io\nOutputFiles\nq\x0b)\x81q\x0cX\x1d\x00\x00\x00snp_calls/6336_11-10-2015.tsvq\ra}q\x0eh\x08}q\x0fsbX\x06\x00\x00\x00paramsq\x10csnakemake.io\nParams\nq\x11)\x81q\x12(K\x05G?\xe9\x99\x99\x99\x99\x99\x9aK\x14e}q\x13(h\x08}q\x14(X\x07\x00\x00\x00min_cvgq\x15K\x00N\x86q\x16X\x08\x00\x00\x00min_freqq\x17K\x01N\x86q\x18X\x08\x00\x00\x00min_qualq\x19K\x02N\x86q\x1auh\x15K\x05h\x17G?\xe9\x99\x99\x99\x99\x99\x9ah\x19K\x14ubX\t\x00\x00\x00wildcardsq\x1bcsnakemake.io\nWildcards\nq\x1c)\x81q\x1dX\x0f\x00\x00\x006336_11-10-2015q\x1ea}q\x1f(h\x08}q X\x06\x00\x00\x00sampleq!K\x00N\x86q"sX\x06\x00\x00\x00sampleq#h\x1eubX\x07\x00\x00\x00threadsq$K\x01X\t\x00\x00\x00resourcesq%csnakemake.io\nResources\nq&)\x81q\'(K\x01K\x01K K\x02e}q((h\x08}q)(X\x06\x00\x00\x00_coresq*K\x00N\x86q+X\x06\x00\x00\x00_nodesq,K\x01N\x86q-X\x03\x00\x00\x00memq.K\x02N\x86q/X\x04\x00\x00\x00timeq0K\x03N\x86q1uh*K\x01h,K\x01h.K h0K\x02ubX\x03\x00\x00\x00logq2csnakemake.io\nLog\nq3)\x81q4}q5h\x08}q6sbX\x06\x00\x00\x00configq7}q8(X\t\x00\x00\x00referenceq9X<\x00\x00\x00/home/tamburin/fiona/crassphage/strainsifter/ref/B_dorei.fnaq:X\x07\x00\x00\x00samplesq;X0\x00\x00\x00/home/tamburin/fiona/crassphage/all_samples.listqX\n\x00\x00\x00paired_endq?X\x01\x00\x00\x00Yq@X\x07\x00\x00\x00min_cvgqAK\x05uX\x04\x00\x00\x00ruleqBX\t\x00\x00\x00call_snpsqCub.'); from snakemake.logging import logger; logger.printshellcmds = False 4 | ######## Original script ######### 5 | #!/usr/bin/env python3 6 | # callSNPS.py 7 | # call SNPs from a samtools pileup file 8 | 9 | import sys 10 | import gzip 11 | import re 12 | 13 | DEBUG = 0 14 | 15 | ASCII_OFFSET = ord("!") 16 | INDEL_PATTERN = re.compile(r"[+-](\d+)") 17 | INDEL_STRING = "[+-]INTEGER[ACGTN]{INTEGER}" 18 | ROUND = 3 # places to right of decimal point 19 | 20 | min_coverage, min_proportion, min_qual = snakemake.params 21 | pileup_file = snakemake.input[0] 22 | out_file = snakemake.output[0] 23 | 24 | min_coverage = int(min_coverage) 25 | min_proportion = float(min_proportion) 26 | min_qual = int(min_qual) 27 | 28 | ## function to process each line of pileup 29 | 30 | def parse_pileup(line): 31 | 32 | # read fields from line of pileup 33 | chromosome, position, reference, coverage, pileup, quality = line.rstrip("\r\n").split("\t") 34 | 35 | # proportion is 0 if no consensus base, top base is N by default 36 | parsed = { 37 | "proportion": 0.0, "chromosome": chromosome, 38 | "position": int(position), "reference": reference, 39 | "coverage": int(coverage), "pileup": pileup, 40 | "quality": quality, "top_base": "N"} 41 | 42 | # if the base coverage is below the limit or above our acceptable max, call N 43 | if parsed["coverage"] < min_coverage: 44 | return parsed 45 | 46 | # uppercase pileup string for processing 47 | pileup = pileup.upper() 48 | 49 | # Remove start and stop characters from pileup string 50 | pileup = re.sub(r"\^.|\$", "", pileup) 51 | 52 | # Remove indels from pileup string 53 | start = 0 54 | 55 | while True: 56 | match = INDEL_PATTERN.search(pileup, start) 57 | 58 | if match: 59 | integer = match.group(1) 60 | pileup = re.sub(INDEL_STRING.replace("INTEGER", integer), "", pileup) 61 | start = match.start() 62 | else: 63 | break 64 | 65 | # get total base count and top base count 66 | total = 0 67 | top_base = "N" 68 | top_base_count = 0 69 | 70 | # uppercase reference base for comparison 71 | reference = reference.upper() 72 | 73 | base_counts = {"A": 0, "C": 0, "G": 0, "T": 0} 74 | 75 | quality_length = len(quality) 76 | 77 | for i in range(quality_length): 78 | 79 | # convert ASCII character to phred base quality 80 | base_quality = ord(quality[i]) - ASCII_OFFSET 81 | 82 | # only count high-quality bases 83 | if base_quality >= min_qual: 84 | 85 | currBase = pileup[i] 86 | 87 | if currBase in base_counts: 88 | base_counts[currBase] += 1 89 | 90 | else: 91 | base_counts[reference] += 1 92 | 93 | total += 1 94 | 95 | parsed["total"] = total 96 | 97 | # find top base 98 | for base in base_counts: 99 | if base_counts[base] > top_base_count: 100 | top_base = base 101 | top_base_count = base_counts[base] 102 | 103 | # if more that 0 bases processed 104 | if total > 0: 105 | prop = top_base_count / float(total) 106 | 107 | if prop >= min_proportion: 108 | parsed["proportion"] = prop 109 | parsed["top_base"] = top_base 110 | 111 | return parsed 112 | 113 | ## process pileup file 114 | 115 | # read in input file 116 | if not DEBUG: 117 | out_file = open(out_file, "w") 118 | 119 | with open(pileup_file, "rt") as pileup_file: 120 | for pileup_file_line in pileup_file: 121 | 122 | parsed = parse_pileup(pileup_file_line) 123 | 124 | proportion = parsed["proportion"] 125 | 126 | # print to output file 127 | if DEBUG: 128 | print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)), 129 | parsed["pileup"], parsed["quality"]])) 130 | else: 131 | print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)), 132 | parsed["pileup"], parsed["quality"]]), file=out_file) 133 | 134 | if not DEBUG: 135 | out_file.close() 136 | -------------------------------------------------------------------------------- /scripts/.snakemake.k53chwr8.callSNPs.py: -------------------------------------------------------------------------------- 1 | 2 | ######## Snakemake header ######## 3 | import sys; sys.path.append("/home/tamburin/tools/miniconda3/lib/python3.6/site-packages"); import pickle; snakemake = pickle.loads(b'\x80\x03csnakemake.script\nSnakemake\nq\x00)\x81q\x01}q\x02(X\x05\x00\x00\x00inputq\x03csnakemake.io\nInputFiles\nq\x04)\x81q\x05X\x1d\x00\x00\x00pileup/6387_11-13-2015.pileupq\x06a}q\x07X\x06\x00\x00\x00_namesq\x08}q\tsbX\x06\x00\x00\x00outputq\ncsnakemake.io\nOutputFiles\nq\x0b)\x81q\x0cX\x1d\x00\x00\x00snp_calls/6387_11-13-2015.tsvq\ra}q\x0eh\x08}q\x0fsbX\x06\x00\x00\x00paramsq\x10csnakemake.io\nParams\nq\x11)\x81q\x12(K\x05G?\xe9\x99\x99\x99\x99\x99\x9aK\x14e}q\x13(h\x08}q\x14(X\x07\x00\x00\x00min_cvgq\x15K\x00N\x86q\x16X\x08\x00\x00\x00min_freqq\x17K\x01N\x86q\x18X\x08\x00\x00\x00min_qualq\x19K\x02N\x86q\x1auh\x15K\x05h\x17G?\xe9\x99\x99\x99\x99\x99\x9ah\x19K\x14ubX\t\x00\x00\x00wildcardsq\x1bcsnakemake.io\nWildcards\nq\x1c)\x81q\x1dX\x0f\x00\x00\x006387_11-13-2015q\x1ea}q\x1f(h\x08}q X\x06\x00\x00\x00sampleq!K\x00N\x86q"sX\x06\x00\x00\x00sampleq#h\x1eubX\x07\x00\x00\x00threadsq$K\x01X\t\x00\x00\x00resourcesq%csnakemake.io\nResources\nq&)\x81q\'(K\x01K\x01K K\x02e}q((h\x08}q)(X\x06\x00\x00\x00_coresq*K\x00N\x86q+X\x06\x00\x00\x00_nodesq,K\x01N\x86q-X\x03\x00\x00\x00memq.K\x02N\x86q/X\x04\x00\x00\x00timeq0K\x03N\x86q1uh*K\x01h,K\x01h.K h0K\x02ubX\x03\x00\x00\x00logq2csnakemake.io\nLog\nq3)\x81q4}q5h\x08}q6sbX\x06\x00\x00\x00configq7}q8(X\t\x00\x00\x00referenceq9XO\x00\x00\x00/home/tamburin/fiona/bacteremia/1.assemble_trimmed/filtered_assemblies/L3.fastaq:X\x07\x00\x00\x00samplesq;X0\x00\x00\x00/home/tamburin/fiona/crassphage/all_samples.listqX\n\x00\x00\x00paired_endq?X\x01\x00\x00\x00Yq@X\x07\x00\x00\x00min_cvgqAK\nuX\x04\x00\x00\x00ruleqBX\t\x00\x00\x00call_snpsqCub.'); from snakemake.logging import logger; logger.printshellcmds = False 4 | ######## Original script ######### 5 | #!/usr/bin/env python3 6 | # callSNPS.py 7 | # call SNPs from a samtools pileup file 8 | 9 | import sys 10 | import gzip 11 | import re 12 | 13 | DEBUG = 0 14 | 15 | ASCII_OFFSET = ord("!") 16 | INDEL_PATTERN = re.compile(r"[+-](\d+)") 17 | INDEL_STRING = "[+-]INTEGER[ACGTN]{INTEGER}" 18 | ROUND = 3 # places to right of decimal point 19 | 20 | min_coverage, min_proportion, min_qual = snakemake.params 21 | pileup_file = snakemake.input[0] 22 | out_file = snakemake.output[0] 23 | 24 | min_coverage = int(min_coverage) 25 | min_proportion = float(min_proportion) 26 | min_qual = int(min_qual) 27 | 28 | ## function to process each line of pileup 29 | 30 | def parse_pileup(line): 31 | 32 | # read fields from line of pileup 33 | chromosome, position, reference, coverage, pileup, quality = line.rstrip("\r\n").split("\t") 34 | 35 | # proportion is 0 if no consensus base, top base is N by default 36 | parsed = { 37 | "proportion": 0.0, "chromosome": chromosome, 38 | "position": int(position), "reference": reference, 39 | "coverage": int(coverage), "pileup": pileup, 40 | "quality": quality, "top_base": "N"} 41 | 42 | # if the base coverage is below the limit or above our acceptable max, call N 43 | if parsed["coverage"] < min_coverage: 44 | return parsed 45 | 46 | # uppercase pileup string for processing 47 | pileup = pileup.upper() 48 | 49 | # Remove start and stop characters from pileup string 50 | pileup = re.sub(r"\^.|\$", "", pileup) 51 | 52 | # Remove indels from pileup string 53 | start = 0 54 | 55 | while True: 56 | match = INDEL_PATTERN.search(pileup, start) 57 | 58 | if match: 59 | integer = match.group(1) 60 | pileup = re.sub(INDEL_STRING.replace("INTEGER", integer), "", pileup) 61 | start = match.start() 62 | else: 63 | break 64 | 65 | # get total base count and top base count 66 | total = 0 67 | top_base = "N" 68 | top_base_count = 0 69 | 70 | # uppercase reference base for comparison 71 | reference = reference.upper() 72 | 73 | base_counts = {"A": 0, "C": 0, "G": 0, "T": 0} 74 | 75 | quality_length = len(quality) 76 | 77 | for i in range(quality_length): 78 | 79 | # convert ASCII character to phred base quality 80 | base_quality = ord(quality[i]) - ASCII_OFFSET 81 | 82 | # only count high-quality bases 83 | if base_quality >= min_qual: 84 | 85 | currBase = pileup[i] 86 | 87 | if currBase in base_counts: 88 | base_counts[currBase] += 1 89 | 90 | else: 91 | base_counts[reference] += 1 92 | 93 | total += 1 94 | 95 | parsed["total"] = total 96 | 97 | # find top base 98 | for base in base_counts: 99 | if base_counts[base] > top_base_count: 100 | top_base = base 101 | top_base_count = base_counts[base] 102 | 103 | # if more that 0 bases processed 104 | if total > 0: 105 | prop = top_base_count / float(total) 106 | 107 | if prop >= min_proportion: 108 | parsed["proportion"] = prop 109 | parsed["top_base"] = top_base 110 | 111 | return parsed 112 | 113 | ## process pileup file 114 | 115 | # read in input file 116 | if not DEBUG: 117 | out_file = open(out_file, "w") 118 | 119 | with open(pileup_file, "rt") as pileup_file: 120 | for pileup_file_line in pileup_file: 121 | 122 | parsed = parse_pileup(pileup_file_line) 123 | 124 | proportion = parsed["proportion"] 125 | 126 | # print to output file 127 | if DEBUG: 128 | print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)), 129 | parsed["pileup"], parsed["quality"]])) 130 | else: 131 | print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)), 132 | parsed["pileup"], parsed["quality"]]), file=out_file) 133 | 134 | if not DEBUG: 135 | out_file.close() 136 | -------------------------------------------------------------------------------- /scripts/.snakemake.mgp_kol5.callSNPs.py: -------------------------------------------------------------------------------- 1 | 2 | ######## Snakemake header ######## 3 | import sys; sys.path.append("/home/tamburin/tools/miniconda3/lib/python3.6/site-packages"); import pickle; snakemake = pickle.loads(b'\x80\x03csnakemake.script\nSnakemake\nq\x00)\x81q\x01}q\x02(X\x05\x00\x00\x00inputq\x03csnakemake.io\nInputFiles\nq\x04)\x81q\x05X\x1c\x00\x00\x00pileup/5160_8-16-2016.pileupq\x06a}q\x07X\x06\x00\x00\x00_namesq\x08}q\tsbX\x06\x00\x00\x00outputq\ncsnakemake.io\nOutputFiles\nq\x0b)\x81q\x0cX\x1c\x00\x00\x00snp_calls/5160_8-16-2016.tsvq\ra}q\x0eh\x08}q\x0fsbX\x06\x00\x00\x00paramsq\x10csnakemake.io\nParams\nq\x11)\x81q\x12(K\x05G?\xe9\x99\x99\x99\x99\x99\x9aK\x14e}q\x13(h\x08}q\x14(X\x07\x00\x00\x00min_cvgq\x15K\x00N\x86q\x16X\x08\x00\x00\x00min_freqq\x17K\x01N\x86q\x18X\x08\x00\x00\x00min_qualq\x19K\x02N\x86q\x1auh\x15K\x05h\x17G?\xe9\x99\x99\x99\x99\x99\x9ah\x19K\x14ubX\t\x00\x00\x00wildcardsq\x1bcsnakemake.io\nWildcards\nq\x1c)\x81q\x1dX\x0e\x00\x00\x005160_8-16-2016q\x1ea}q\x1f(h\x08}q X\x06\x00\x00\x00sampleq!K\x00N\x86q"sX\x06\x00\x00\x00sampleq#h\x1eubX\x07\x00\x00\x00threadsq$K\x01X\t\x00\x00\x00resourcesq%csnakemake.io\nResources\nq&)\x81q\'(K\x01K\x01K K\x02e}q((h\x08}q)(X\x06\x00\x00\x00_coresq*K\x00N\x86q+X\x06\x00\x00\x00_nodesq,K\x01N\x86q-X\x03\x00\x00\x00memq.K\x02N\x86q/X\x04\x00\x00\x00timeq0K\x03N\x86q1uh*K\x01h,K\x01h.K h0K\x02ubX\x03\x00\x00\x00logq2csnakemake.io\nLog\nq3)\x81q4}q5h\x08}q6sbX\x06\x00\x00\x00configq7}q8(X\t\x00\x00\x00referenceq9X<\x00\x00\x00/home/tamburin/fiona/crassphage/strainsifter/ref/B_dorei.fnaq:X\x07\x00\x00\x00samplesq;X0\x00\x00\x00/home/tamburin/fiona/crassphage/all_samples.listqX\n\x00\x00\x00paired_endq?X\x01\x00\x00\x00Yq@X\x07\x00\x00\x00min_cvgqAK\x05uX\x04\x00\x00\x00ruleqBX\t\x00\x00\x00call_snpsqCub.'); from snakemake.logging import logger; logger.printshellcmds = False 4 | ######## Original script ######### 5 | #!/usr/bin/env python3 6 | # callSNPS.py 7 | # call SNPs from a samtools pileup file 8 | 9 | import sys 10 | import gzip 11 | import re 12 | 13 | DEBUG = 0 14 | 15 | ASCII_OFFSET = ord("!") 16 | INDEL_PATTERN = re.compile(r"[+-](\d+)") 17 | INDEL_STRING = "[+-]INTEGER[ACGTN]{INTEGER}" 18 | ROUND = 3 # places to right of decimal point 19 | 20 | min_coverage, min_proportion, min_qual = snakemake.params 21 | pileup_file = snakemake.input[0] 22 | out_file = snakemake.output[0] 23 | 24 | min_coverage = int(min_coverage) 25 | min_proportion = float(min_proportion) 26 | min_qual = int(min_qual) 27 | 28 | ## function to process each line of pileup 29 | 30 | def parse_pileup(line): 31 | 32 | # read fields from line of pileup 33 | chromosome, position, reference, coverage, pileup, quality = line.rstrip("\r\n").split("\t") 34 | 35 | # proportion is 0 if no consensus base, top base is N by default 36 | parsed = { 37 | "proportion": 0.0, "chromosome": chromosome, 38 | "position": int(position), "reference": reference, 39 | "coverage": int(coverage), "pileup": pileup, 40 | "quality": quality, "top_base": "N"} 41 | 42 | # if the base coverage is below the limit or above our acceptable max, call N 43 | if parsed["coverage"] < min_coverage: 44 | return parsed 45 | 46 | # uppercase pileup string for processing 47 | pileup = pileup.upper() 48 | 49 | # Remove start and stop characters from pileup string 50 | pileup = re.sub(r"\^.|\$", "", pileup) 51 | 52 | # Remove indels from pileup string 53 | start = 0 54 | 55 | while True: 56 | match = INDEL_PATTERN.search(pileup, start) 57 | 58 | if match: 59 | integer = match.group(1) 60 | pileup = re.sub(INDEL_STRING.replace("INTEGER", integer), "", pileup) 61 | start = match.start() 62 | else: 63 | break 64 | 65 | # get total base count and top base count 66 | total = 0 67 | top_base = "N" 68 | top_base_count = 0 69 | 70 | # uppercase reference base for comparison 71 | reference = reference.upper() 72 | 73 | base_counts = {"A": 0, "C": 0, "G": 0, "T": 0} 74 | 75 | quality_length = len(quality) 76 | 77 | for i in range(quality_length): 78 | 79 | # convert ASCII character to phred base quality 80 | base_quality = ord(quality[i]) - ASCII_OFFSET 81 | 82 | # only count high-quality bases 83 | if base_quality >= min_qual: 84 | 85 | currBase = pileup[i] 86 | 87 | if currBase in base_counts: 88 | base_counts[currBase] += 1 89 | 90 | else: 91 | base_counts[reference] += 1 92 | 93 | total += 1 94 | 95 | parsed["total"] = total 96 | 97 | # find top base 98 | for base in base_counts: 99 | if base_counts[base] > top_base_count: 100 | top_base = base 101 | top_base_count = base_counts[base] 102 | 103 | # if more that 0 bases processed 104 | if total > 0: 105 | prop = top_base_count / float(total) 106 | 107 | if prop >= min_proportion: 108 | parsed["proportion"] = prop 109 | parsed["top_base"] = top_base 110 | 111 | return parsed 112 | 113 | ## process pileup file 114 | 115 | # read in input file 116 | if not DEBUG: 117 | out_file = open(out_file, "w") 118 | 119 | with open(pileup_file, "rt") as pileup_file: 120 | for pileup_file_line in pileup_file: 121 | 122 | parsed = parse_pileup(pileup_file_line) 123 | 124 | proportion = parsed["proportion"] 125 | 126 | # print to output file 127 | if DEBUG: 128 | print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)), 129 | parsed["pileup"], parsed["quality"]])) 130 | else: 131 | print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)), 132 | parsed["pileup"], parsed["quality"]]), file=out_file) 133 | 134 | if not DEBUG: 135 | out_file.close() 136 | -------------------------------------------------------------------------------- /scripts/.snakemake.sgrw01xs.callSNPs.py: -------------------------------------------------------------------------------- 1 | 2 | ######## Snakemake header ######## 3 | import sys; sys.path.append("/home/tamburin/tools/miniconda3/lib/python3.6/site-packages"); import pickle; snakemake = pickle.loads(b'\x80\x03csnakemake.script\nSnakemake\nq\x00)\x81q\x01}q\x02(X\x05\x00\x00\x00inputq\x03csnakemake.io\nInputFiles\nq\x04)\x81q\x05X\x1c\x00\x00\x00pileup/2769_5-17-2016.pileupq\x06a}q\x07X\x06\x00\x00\x00_namesq\x08}q\tsbX\x06\x00\x00\x00outputq\ncsnakemake.io\nOutputFiles\nq\x0b)\x81q\x0cX\x1c\x00\x00\x00snp_calls/2769_5-17-2016.tsvq\ra}q\x0eh\x08}q\x0fsbX\x06\x00\x00\x00paramsq\x10csnakemake.io\nParams\nq\x11)\x81q\x12(K\x05G?\xe9\x99\x99\x99\x99\x99\x9aK\x14e}q\x13(h\x08}q\x14(X\x07\x00\x00\x00min_cvgq\x15K\x00N\x86q\x16X\x08\x00\x00\x00min_freqq\x17K\x01N\x86q\x18X\x08\x00\x00\x00min_qualq\x19K\x02N\x86q\x1auh\x15K\x05h\x17G?\xe9\x99\x99\x99\x99\x99\x9ah\x19K\x14ubX\t\x00\x00\x00wildcardsq\x1bcsnakemake.io\nWildcards\nq\x1c)\x81q\x1dX\x0e\x00\x00\x002769_5-17-2016q\x1ea}q\x1f(h\x08}q X\x06\x00\x00\x00sampleq!K\x00N\x86q"sX\x06\x00\x00\x00sampleq#h\x1eubX\x07\x00\x00\x00threadsq$K\x01X\t\x00\x00\x00resourcesq%csnakemake.io\nResources\nq&)\x81q\'(K\x01K\x01K K\x02e}q((h\x08}q)(X\x06\x00\x00\x00_coresq*K\x00N\x86q+X\x06\x00\x00\x00_nodesq,K\x01N\x86q-X\x03\x00\x00\x00memq.K\x02N\x86q/X\x04\x00\x00\x00timeq0K\x03N\x86q1uh*K\x01h,K\x01h.K h0K\x02ubX\x03\x00\x00\x00logq2csnakemake.io\nLog\nq3)\x81q4}q5h\x08}q6sbX\x06\x00\x00\x00configq7}q8(X\t\x00\x00\x00referenceq9X?\x00\x00\x00/home/tamburin/fiona/crassphage/strainsifter/ref/B_vulgatus.fnaq:X\x07\x00\x00\x00samplesq;X0\x00\x00\x00/home/tamburin/fiona/crassphage/all_samples.listqX\n\x00\x00\x00paired_endq?X\x01\x00\x00\x00Yq@X\x07\x00\x00\x00min_cvgqAK\nuX\x04\x00\x00\x00ruleqBX\t\x00\x00\x00call_snpsqCub.'); from snakemake.logging import logger; logger.printshellcmds = False 4 | ######## Original script ######### 5 | #!/usr/bin/env python3 6 | # callSNPS.py 7 | # call SNPs from a samtools pileup file 8 | 9 | import sys 10 | import gzip 11 | import re 12 | 13 | DEBUG = 0 14 | 15 | ASCII_OFFSET = ord("!") 16 | INDEL_PATTERN = re.compile(r"[+-](\d+)") 17 | INDEL_STRING = "[+-]INTEGER[ACGTN]{INTEGER}" 18 | ROUND = 3 # places to right of decimal point 19 | 20 | min_coverage, min_proportion, min_qual = snakemake.params 21 | pileup_file = snakemake.input[0] 22 | out_file = snakemake.output[0] 23 | 24 | min_coverage = int(min_coverage) 25 | min_proportion = float(min_proportion) 26 | min_qual = int(min_qual) 27 | 28 | ## function to process each line of pileup 29 | 30 | def parse_pileup(line): 31 | 32 | # read fields from line of pileup 33 | chromosome, position, reference, coverage, pileup, quality = line.rstrip("\r\n").split("\t") 34 | 35 | # proportion is 0 if no consensus base, top base is N by default 36 | parsed = { 37 | "proportion": 0.0, "chromosome": chromosome, 38 | "position": int(position), "reference": reference, 39 | "coverage": int(coverage), "pileup": pileup, 40 | "quality": quality, "top_base": "N"} 41 | 42 | # if the base coverage is below the limit or above our acceptable max, call N 43 | if parsed["coverage"] < min_coverage: 44 | return parsed 45 | 46 | # uppercase pileup string for processing 47 | pileup = pileup.upper() 48 | 49 | # Remove start and stop characters from pileup string 50 | pileup = re.sub(r"\^.|\$", "", pileup) 51 | 52 | # Remove indels from pileup string 53 | start = 0 54 | 55 | while True: 56 | match = INDEL_PATTERN.search(pileup, start) 57 | 58 | if match: 59 | integer = match.group(1) 60 | pileup = re.sub(INDEL_STRING.replace("INTEGER", integer), "", pileup) 61 | start = match.start() 62 | else: 63 | break 64 | 65 | # get total base count and top base count 66 | total = 0 67 | top_base = "N" 68 | top_base_count = 0 69 | 70 | # uppercase reference base for comparison 71 | reference = reference.upper() 72 | 73 | base_counts = {"A": 0, "C": 0, "G": 0, "T": 0} 74 | 75 | quality_length = len(quality) 76 | 77 | for i in range(quality_length): 78 | 79 | # convert ASCII character to phred base quality 80 | base_quality = ord(quality[i]) - ASCII_OFFSET 81 | 82 | # only count high-quality bases 83 | if base_quality >= min_qual: 84 | 85 | currBase = pileup[i] 86 | 87 | if currBase in base_counts: 88 | base_counts[currBase] += 1 89 | 90 | else: 91 | base_counts[reference] += 1 92 | 93 | total += 1 94 | 95 | parsed["total"] = total 96 | 97 | # find top base 98 | for base in base_counts: 99 | if base_counts[base] > top_base_count: 100 | top_base = base 101 | top_base_count = base_counts[base] 102 | 103 | # if more that 0 bases processed 104 | if total > 0: 105 | prop = top_base_count / float(total) 106 | 107 | if prop >= min_proportion: 108 | parsed["proportion"] = prop 109 | parsed["top_base"] = top_base 110 | 111 | return parsed 112 | 113 | ## process pileup file 114 | 115 | # read in input file 116 | if not DEBUG: 117 | out_file = open(out_file, "w") 118 | 119 | with open(pileup_file, "rt") as pileup_file: 120 | for pileup_file_line in pileup_file: 121 | 122 | parsed = parse_pileup(pileup_file_line) 123 | 124 | proportion = parsed["proportion"] 125 | 126 | # print to output file 127 | if DEBUG: 128 | print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)), 129 | parsed["pileup"], parsed["quality"]])) 130 | else: 131 | print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)), 132 | parsed["pileup"], parsed["quality"]]), file=out_file) 133 | 134 | if not DEBUG: 135 | out_file.close() 136 | -------------------------------------------------------------------------------- /scripts/callSNPs.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # callSNPS.py 3 | # call SNPs from a samtools pileup file 4 | 5 | import sys 6 | import gzip 7 | import re 8 | 9 | DEBUG = 0 10 | 11 | ASCII_OFFSET = ord("!") 12 | INDEL_PATTERN = re.compile(r"[+-](\d+)") 13 | INDEL_STRING = "[+-]INTEGER[ACGTN]{INTEGER}" 14 | ROUND = 3 # places to right of decimal point 15 | 16 | min_coverage, min_proportion, min_qual = snakemake.params 17 | pileup_file = snakemake.input[0] 18 | out_file = snakemake.output[0] 19 | 20 | min_coverage = int(min_coverage) 21 | min_proportion = float(min_proportion) 22 | min_qual = int(min_qual) 23 | 24 | ## function to process each line of pileup 25 | 26 | def parse_pileup(line): 27 | 28 | # read fields from line of pileup 29 | chromosome, position, reference, coverage, pileup, quality = line.rstrip("\r\n").split("\t") 30 | 31 | # proportion is 0 if no consensus base, top base is N by default 32 | parsed = { 33 | "proportion": 0.0, 34 | "chromosome": chromosome, 35 | "position": int(position), 36 | "reference": reference, 37 | "coverage": int(coverage), 38 | "pileup": pileup, 39 | "quality": quality, 40 | "top_base": "N"} 41 | 42 | # if the base coverage is below the limit or above our acceptable max, call N 43 | if parsed["coverage"] < min_coverage: 44 | return parsed 45 | 46 | # uppercase pileup string for processing 47 | pileup = pileup.upper() 48 | 49 | # Remove start and stop characters from pileup string 50 | pileup = re.sub(r"\^.|\$", "", pileup) 51 | 52 | # Remove indels from pileup string 53 | start = 0 54 | 55 | while True: 56 | match = INDEL_PATTERN.search(pileup, start) 57 | 58 | if match: 59 | integer = match.group(1) 60 | pileup = re.sub(INDEL_STRING.replace("INTEGER", integer), "", pileup) 61 | start = match.start() 62 | else: 63 | break 64 | 65 | # get total base count and top base count 66 | total = 0 67 | top_base = "N" 68 | top_base_count = 0 69 | 70 | # uppercase reference base for comparison 71 | reference = reference.upper() 72 | 73 | base_counts = {"A": 0, "C": 0, "G": 0, "T": 0} 74 | 75 | quality_length = len(quality) 76 | 77 | for i in range(quality_length): 78 | 79 | # convert ASCII character to phred base quality 80 | base_quality = ord(quality[i]) - ASCII_OFFSET 81 | 82 | # only count high-quality bases 83 | if base_quality >= min_qual: 84 | 85 | currBase = pileup[i] 86 | 87 | if currBase in base_counts: 88 | base_counts[currBase] += 1 89 | 90 | else: 91 | base_counts[reference] += 1 92 | 93 | total += 1 94 | 95 | parsed["total"] = total 96 | 97 | # find top base 98 | for base in base_counts: 99 | if base_counts[base] > top_base_count: 100 | top_base = base 101 | top_base_count = base_counts[base] 102 | 103 | # if more that 0 bases processed 104 | if total > 0: 105 | prop = top_base_count / float(total) 106 | 107 | if prop >= min_proportion: 108 | parsed["proportion"] = prop 109 | parsed["top_base"] = top_base 110 | 111 | return parsed 112 | 113 | ## process pileup file 114 | 115 | # read in input file 116 | if not DEBUG: 117 | out_file = open(out_file, "w") 118 | 119 | with open(pileup_file, "rt") as pileup_file: 120 | for pileup_file_line in pileup_file: 121 | 122 | parsed = parse_pileup(pileup_file_line) 123 | 124 | proportion = parsed["proportion"] 125 | 126 | # print to output file 127 | if DEBUG: 128 | print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)), 129 | parsed["pileup"], parsed["quality"]])) 130 | else: 131 | print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)), 132 | parsed["pileup"], parsed["quality"]]), file=out_file) 133 | 134 | if not DEBUG: 135 | out_file.close() 136 | -------------------------------------------------------------------------------- /scripts/coreSNPs2fasta.py: -------------------------------------------------------------------------------- 1 | import sys 2 | 3 | HEADER_POS = 0 4 | FASTA_POS = 1 5 | #coreSNPs_file = sys.argv[1] 6 | 7 | snp_file = snakemake.input[0] 8 | out_file = snakemake.output[0] 9 | 10 | # hold sequences as we go 11 | fastas = [] 12 | 13 | with open(snp_file, "r") as core_snps: 14 | 15 | # process header 16 | header = core_snps.readline().rstrip("\n").split("\t") 17 | 18 | # add header to dict with empty string 19 | for h in range(len(header)): 20 | fastas += [[header[h], ""]] 21 | 22 | for line in core_snps: 23 | line = line.rstrip("\n").split("\t") 24 | 25 | for pos in range(len(line)): 26 | # print(fastas[pos]) 27 | # print(fastas[pos][FASTA_POS]) 28 | fastas[pos][FASTA_POS] += line[pos] 29 | 30 | with open(out_file, "w") as sys.stdout: 31 | for f in range(len(fastas)): 32 | print(">" + fastas[f][0]) 33 | print(fastas[f][1]) 34 | -------------------------------------------------------------------------------- /scripts/findCoreSNPs.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # find core snvs shared between samples 3 | 4 | import sys 5 | 6 | snp_file = snakemake.input[0] 7 | out_file = snakemake.output[0] 8 | 9 | # read each line and print if base is called for every sample and 10 | # bases are not the same in every sample 11 | with open(snp_file, "r") as snps: 12 | with open(out_file, "w") as sys.stdout: 13 | 14 | # print header 15 | print(snps.readline().rstrip('\n')) 16 | 17 | # process each line and output positions with no unknown bases ("N") 18 | # and where all samples do not have same base 19 | for line in snps: 20 | positions = line.rstrip('\n').split('\t') 21 | if (len(set(positions)) > 1) and ("N" not in positions): 22 | print("\t".join(positions)) 23 | -------------------------------------------------------------------------------- /scripts/getCoverage.py: -------------------------------------------------------------------------------- 1 | # Fiona Tamburini 2 | # Jan 26 2018 3 | 4 | # calculate average coverage and % bases covered above min threshold from bedtools genomecov output 5 | # usage python3 getCoverage.py 5 sample < depth.tsv 6 | 7 | import sys 8 | 9 | # snakemake input and output files 10 | sample_file = snakemake.input[0] 11 | out_file = snakemake.output[0] 12 | 13 | # minimum coverage threshold -- report percentage of bases covered at or 14 | # beyond this depth 15 | minCvg = int(snakemake.params[0]) 16 | 17 | totalBases = 0 18 | coveredBases = 0 19 | weightedAvg = 0 20 | with open(sample_file, 'r') as sample: 21 | for line in sample: 22 | if line.startswith("genome"): 23 | chr, depth, numBases, size, fraction = line.rstrip('\n').split('\t') 24 | 25 | depth = int(depth) 26 | numBases = int(numBases) 27 | 28 | totalBases += numBases 29 | weightedAvg += depth * numBases 30 | 31 | if depth >= minCvg: 32 | coveredBases += numBases 33 | avgCvg = 0 34 | percCovered = 0 35 | 36 | if totalBases > 0: 37 | avgCvg = float(weightedAvg) / float(totalBases) 38 | percCovered = float(coveredBases) / float(totalBases) 39 | 40 | with open(out_file, 'w') as out: 41 | print("\t".join([str(round(avgCvg, 2)), str(round(percCovered, 2))]), file = out) 42 | -------------------------------------------------------------------------------- /scripts/pairwiseDist.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | import sys 4 | import itertools 5 | import subprocess 6 | import re 7 | 8 | # input and output files from snakemake 9 | in_files = snakemake.input 10 | out_file = snakemake.output[0] 11 | 12 | cmd = " ".join(["cat", in_files[0], "| wc -l"]) 13 | totalBases = subprocess.check_output(cmd, stdin=subprocess.PIPE, shell=True ).decode('ascii').rstrip('\n') 14 | totalBases = int(totalBases) - 1 15 | 16 | # for every pairwise combination of files, check SNV distance 17 | with open(out_file, "w") as sys.stdout: 18 | 19 | # print header 20 | print('\t'.join(["Sample1", "Sample2", "SNVs", "BasesCompared", "TotalBases"])) 21 | 22 | # get all pairwise combinations of input files 23 | for element in itertools.combinations(in_files, 2): 24 | file1, file2 = element 25 | 26 | cmd1 = " ".join(["paste", file1, file2, "| sed '1d' | grep -v N | wc -l"]) 27 | totalPos = subprocess.check_output(cmd1, stdin=subprocess.PIPE, shell=True ).decode('ascii').rstrip('\n') 28 | 29 | cmd2 = " ".join(["paste", file1, file2, "| sed '1d' | grep -v N | awk '$1 != $2 {print $0}' | wc -l"]) 30 | diffPos = subprocess.check_output(cmd2, stdin=subprocess.PIPE, shell=True ).decode('ascii').rstrip('\n') 31 | 32 | fname1 = re.findall('consensus/(.+)\.', file1)[0] 33 | fname2 = re.findall('consensus/(.+)\.', file2)[0] 34 | 35 | # dist = subprocess.check_output(['./count_snvs.sh', file1, file2]).decode('ascii').rstrip('\n') 36 | print('\t'.join([fname1, fname2, str(diffPos), str(totalPos), str(totalBases)])) 37 | -------------------------------------------------------------------------------- /scripts/renderTree.R: -------------------------------------------------------------------------------- 1 | library(ggtree) 2 | library(phangorn) 3 | library(ggplot2) 4 | 5 | # input files 6 | tree_file <- snakemake@input[[1]] 7 | out_file <- snakemake@output[[1]] 8 | 9 | # read tree file 10 | tree <- read.tree(tree_file) 11 | 12 | # midpoint root tree if there is more than 1 node 13 | if (tree$Nnode > 1){ 14 | tree <- midpoint(tree) 15 | } 16 | 17 | # draw tree 18 | p <- ggtree(tree, branch.length = 1) + 19 | geom_tiplab(offset = .05) + 20 | geom_tippoint(size=3) + 21 | xlim(0, 4) + 22 | geom_treescale(width = 0.1, offset=0.25) 23 | 24 | # save output file 25 | ggsave(out_file, plot = p, height = 1*length(tree$tip.label)) 26 | -------------------------------------------------------------------------------- /tutorial/fastq/isolate1_1.fq: -------------------------------------------------------------------------------- 1 | /home/tamburin/fiona/bacteremia/isolate_reads_all/L2_PE1.fq -------------------------------------------------------------------------------- /tutorial/fastq/isolate1_2.fq: -------------------------------------------------------------------------------- 1 | /home/tamburin/fiona/bacteremia/isolate_reads_all/L2_PE2.fq -------------------------------------------------------------------------------- /tutorial/fastq/isolate2_1.fq: -------------------------------------------------------------------------------- 1 | /home/tamburin/fiona/bacteremia/isolate_reads_all/L5_PE1.fq -------------------------------------------------------------------------------- /tutorial/fastq/isolate2_2.fq: -------------------------------------------------------------------------------- 1 | /home/tamburin/fiona/bacteremia/isolate_reads_all/L5_PE2.fq -------------------------------------------------------------------------------- /tutorial/fastq/meta1_1.fq: -------------------------------------------------------------------------------- 1 | /home/tamburin/fiona/bacteremia/stool_reads_all/S37_PE1.fq -------------------------------------------------------------------------------- /tutorial/fastq/meta1_2.fq: -------------------------------------------------------------------------------- 1 | /home/tamburin/fiona/bacteremia/stool_reads_all/S37_PE2.fq -------------------------------------------------------------------------------- /tutorial/fastq/meta2_1.fq: -------------------------------------------------------------------------------- 1 | /home/tamburin/fiona/bacteremia/stool_reads_all/S38_PE1.fq -------------------------------------------------------------------------------- /tutorial/fastq/meta2_2.fq: -------------------------------------------------------------------------------- 1 | /home/tamburin/fiona/bacteremia/stool_reads_all/S38_PE2.fq -------------------------------------------------------------------------------- /tutorial/fastq/meta3_1.fq: -------------------------------------------------------------------------------- 1 | /home/tamburin/fiona/bacteremia/stool_reads_all/S10_PE1.fq -------------------------------------------------------------------------------- /tutorial/fastq/meta3_2.fq: -------------------------------------------------------------------------------- 1 | /home/tamburin/fiona/bacteremia/stool_reads_all/S10_PE2.fq -------------------------------------------------------------------------------- /tutorial/fastq/meta4_1.fq: -------------------------------------------------------------------------------- 1 | /home/tamburin/fiona/bacteremia/stool_reads_all/S6_PE1.fq -------------------------------------------------------------------------------- /tutorial/fastq/meta4_2.fq: -------------------------------------------------------------------------------- 1 | /home/tamburin/fiona/bacteremia/stool_reads_all/S6_PE2.fq -------------------------------------------------------------------------------- /tutorial/reference/E_coli_K12.fna.amb: -------------------------------------------------------------------------------- 1 | 4641652 1 0 2 | -------------------------------------------------------------------------------- /tutorial/reference/E_coli_K12.fna.ann: -------------------------------------------------------------------------------- 1 | 4641652 1 11 2 | 0 NC_000913.3 Escherichia coli str. K-12 substr. MG1655, complete genome 3 | 0 4641652 0 4 | -------------------------------------------------------------------------------- /tutorial/reference/E_coli_K12.fna.bwt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bhattlab/StrainSifter/4a354e5a0ae25bfe3d57a6b66fa27a36b5daf748/tutorial/reference/E_coli_K12.fna.bwt -------------------------------------------------------------------------------- /tutorial/reference/E_coli_K12.fna.fai: -------------------------------------------------------------------------------- 1 | NC_000913.3 4641652 72 80 81 2 | -------------------------------------------------------------------------------- /tutorial/reference/E_coli_K12.fna.pac: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bhattlab/StrainSifter/4a354e5a0ae25bfe3d57a6b66fa27a36b5daf748/tutorial/reference/E_coli_K12.fna.pac -------------------------------------------------------------------------------- /tutorial/reference/E_coli_K12.fna.sa: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bhattlab/StrainSifter/4a354e5a0ae25bfe3d57a6b66fa27a36b5daf748/tutorial/reference/E_coli_K12.fna.sa --------------------------------------------------------------------------------