├── .gitattributes
├── .gitignore
├── README.md
├── Snakefile
├── config.yaml
├── envs
    └── environment.yaml
├── scripts
    ├── ._callSNPs.py
    ├── ._coreSNPs2fasta.py
    ├── ._findCoreSNPs.py
    ├── ._getCoverage.py
    ├── ._pairwiseDist.py
    ├── ._renderTree.R
    ├── .snakemake.7g49ubxl.coreSNPs2fasta.py
    ├── .snakemake.9wtzh4ps.callSNPs.py
    ├── .snakemake.gej6roz5.callSNPs.py
    ├── .snakemake.jrhv5jyf.callSNPs.py
    ├── .snakemake.k53chwr8.callSNPs.py
    ├── .snakemake.mgp_kol5.callSNPs.py
    ├── .snakemake.sgrw01xs.callSNPs.py
    ├── callSNPs.py
    ├── coreSNPs2fasta.py
    ├── findCoreSNPs.py
    ├── getCoverage.py
    ├── pairwiseDist.py
    └── renderTree.R
└── tutorial
    ├── fastq
        ├── isolate1_1.fq
        ├── isolate1_2.fq
        ├── isolate2_1.fq
        ├── isolate2_2.fq
        ├── meta1_1.fq
        ├── meta1_2.fq
        ├── meta2_1.fq
        ├── meta2_2.fq
        ├── meta3_1.fq
        ├── meta3_2.fq
        ├── meta4_1.fq
        └── meta4_2.fq
    └── reference
        ├── E_coli_K12.fna
        ├── E_coli_K12.fna.amb
        ├── E_coli_K12.fna.ann
        ├── E_coli_K12.fna.bwt
        ├── E_coli_K12.fna.fai
        ├── E_coli_K12.fna.pac
        └── E_coli_K12.fna.sa


/.gitattributes:
--------------------------------------------------------------------------------
1 | *.fq binary
2 | *.fastq binary
3 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | ._Snakefile
2 | ._config.yaml
3 | .snakemake
4 | tutorial
5 | config_old.yaml
6 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # StrainSifter
  2 | 
  3 | A straightforward bioinformatic pipeline for detecting the presence of a bacterial strain in one or more metagenome(s).
  4 | 
  5 | StrainSifter is based on [Snakemake](https://snakemake.readthedocs.io/en/stable/). This pipeline allows you to output phylogenetic trees showing strain relatedness of input strains, as well as pairwise counts of single-nucleotide variants (SNVs) between input samples.
  6 | 
  7 | ## Installation
  8 | 
  9 | To run StrainSifter, you must have miniconda3 and Snakemake installed.
 10 | 
 11 | #### Install instructions (One time only)
 12 | 1. Download and install [miniconda3](https://conda.io/miniconda.html):
 13 | 
 14 |     For Linux:
 15 | 
 16 |         wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
 17 |         bash Miniconda3-latest-Linux-x86_64.sh
 18 | 
 19 | 2. Clone the StrainSifter workflow to the directory where you wish to run the pipeline:
 20 | 
 21 |         git clone https://github.com/bhattlab/strainsifter
 22 | 
 23 | 3. Create the new conda environment:
 24 | 
 25 |         cd strainsifter
 26 |         conda env create -f envs/environment.yaml
 27 | 
 28 | 4. Install Snakemake:
 29 | 
 30 |    	conda install snakemake -c bioconda -c conda-forge
 31 | 
 32 | 
 33 |   StrainSifter has been developed and tested with Snakemake version 5.1.4 or higher. Check your version by typing:
 34 | 
 35 |     snakemake --version
 36 | 
 37 |   If you are running a version of Snakemake prior to 5.1.4, update to the latest version:
 38 | 
 39 |     conda update snakemake -c bioconda -c conda-forge
 40 | 
 41 | #### Activate the conda environment (Every time you use StrainSifter)
 42 | 
 43 |     source activate ssift
 44 | 
 45 | ### Dependencies
 46 | 
 47 | We recommend running StrainSifter in the provided conda environment. If you wish to run StrainSifter without using the conda environment, the following tools must be installed and in your system PATH:
 48 | * [Burrows-Wheeler Aligner (BWA)](http://bio-bwa.sourceforge.net)
 49 | * [Samtools](http://www.htslib.org)
 50 | * [Bamtools](https://github.com/pezmaster31/bamtools)
 51 | * [Bedtools](http://bedtools.readthedocs.io/en/latest/)
 52 | * [MUSCLE](https://www.drive5.com/muscle/)
 53 | * [FastTree](http://www.microbesonline.org/fasttree/)
 54 | * [Python3](https://www.python.org/downloads/)
 55 | * [R](https://www.r-project.org)
 56 | 
 57 | ## Running StrainSifter
 58 | 
 59 | Due to the computing demands of the StrainSifter pipeline, we recommend running on a computing cluster if possible.
 60 | Instructions to enable Snakemake to schedule cluster jobs with SLURM can be found at https://github.com/bhattlab/slurm
 61 | 
 62 | ### Input files
 63 | 
 64 | * Reference genome assembly in fasta format (can be a draft genome or a finished reference genome)
 65 | Acceptable file extensions: ".fasta", ".fa", ".fna"
 66 | 
 67 | * Two or more short read datasets in fastq format (metagenomic reads or isolate reads), optionally gzipped
 68 | Acceptable file extensions: ".fq", ".fastq", ".fq.gz", ".fastq.gz"
 69 | 
 70 | Short read data can be paired- or single-end.
 71 | 
 72 | You will need to indicate input files in the config file for each sample you wish to run StrainSifter on. This is described below in the *Config* section:
 73 | 
 74 | ### Config
 75 | 
 76 | You must update the config.yaml file as follows:
 77 | 
 78 | **reference:** Path to reference genome (fasta format)
 79 | 
 80 | **reads:** Samples and the file path(s) to the input reads.
 81 | 
 82 | <br>
 83 | 
 84 | Optionally, you can update the following parameters:
 85 | 
 86 | **prefix:** (optional) desired filename for output files. If blank, the name of the reference genome will be used.
 87 | 
 88 | **mapq:** minimum mapping quality score to evaluate a read aligment
 89 | 
 90 | **n_mismatches:** consider reads with this many mismatches or fewer
 91 | 
 92 | **min_cvg:** minimum read depth to determine the nucleotide at any given postion
 93 | 
 94 | **min_genome_percent:** the minimum fraction of bases that must be covered at min_cvg or greater to process an sample
 95 | 
 96 | **base_freq:** minimum frequency of a nucleotide to call a base at any position
 97 | 
 98 | <br>
 99 | 
100 | Example config.yaml:
101 | 
102 |     ##### input files #####
103 | 
104 |     # reference genome (required)
105 |     reference: /path/to/ref.fna
106 | 
107 |     # short read data (at least two samples required)
108 |     reads:
109 |     sample1:
110 |     [/path/to/sample1_R1.fq,
111 |     /path/tp/sample1_R2.fq]
112 |     sample2:
113 |     [/path/to/sample2_R1.fq,
114 |     /path/to/sample2_R2.fq]
115 |     sample3: /path/to/sample3.fq
116 | 
117 |     # prefix for output files (optional - can leave blank)
118 |     prefix:
119 | 
120 | 
121 |     ##### workflow parameters #####
122 | 
123 |     # alignment parameters:
124 |     mapq: 60
125 |     n_mismatches: 5
126 | 
127 |     # variant calling parameters:
128 |     min_cvg: 5
129 |     min_genome_percent: 0.5
130 |     base_freq: 0.8
131 | 
132 | 
133 | ### Running StrainSifter
134 | 
135 | To run StrainSifter, the config file must be present in the directory in which you wish to run the workflow.
136 | You should then be able to run StrainSifter as follows:
137 | 
138 | #### Phylogeny
139 | 
140 | To generate a phylogenetic tree showing all of the input samples that contain your strain of interest at sufficient coverage to profile:
141 | 
142 |     snakemake {prefix}.tree.pdf
143 | 
144 | #### SNV counts
145 | 
146 | To generate a list of pairwise SNV counts between all input samples:
147 | 
148 |     snakemake {prefix}.dist.tsv
149 | 
150 | ### FAQ
151 | 
152 | Q: Can StrainSifter be used for non-bacterial genomes (e.g. yeast)?
153 | 
154 | A: At present, we recommend StrainSifter for bacteria only. In theory, StrainSifter should work for yeast if a haploid reference genome is provided.
155 | 


--------------------------------------------------------------------------------
/Snakefile:
--------------------------------------------------------------------------------
  1 | import os
  2 | import re
  3 | from snakemake.utils import min_version
  4 | 
  5 | ##### set minimum snakemake version #####
  6 | min_version("5.1.4")
  7 | 
  8 | ##### load config file and sample list #####
  9 | configfile: "config.yaml"
 10 | 
 11 | ##### list of input samples #####
 12 | samples = [key for key in config['reads']]
 13 | 
 14 | ##### prefix for phylogenetic tree and SNV distance files #####
 15 | if config['prefix'] is None:
 16 | 	prefix = re.split("/|\.", config['reference'])[-2]
 17 | else:
 18 | 	prefix = config['prefix']
 19 | 
 20 | ##### rules #####
 21 | 
 22 | # index reference genome for bwa alignment
 23 | rule bwa_index:
 24 | 	input: config['reference']
 25 | 	output:
 26 | 		"{ref}.amb".format(ref=config['reference']),
 27 | 		"{ref}.ann".format(ref=config['reference']),
 28 | 		"{ref}.bwt".format(ref=config['reference']),
 29 | 		"{ref}.pac".format(ref=config['reference']),
 30 | 		"{ref}.sa".format(ref=config['reference'])
 31 | 	resources:
 32 | 		mem=2,
 33 | 		time=1
 34 | 	shell:
 35 | 		"bwa index {input}"
 36 | 
 37 | # align reads to reference genome with bwa
 38 | rule bwa_align:
 39 | 	input:
 40 | 		ref_index = rules.bwa_index.output,
 41 | 		r = lambda wildcards: config["reads"][wildcards.sample]
 42 | 	output:
 43 | 		"filtered_bam/{sample}.filtered.bam"
 44 | 	resources:
 45 | 		mem=32,
 46 | 		time=6
 47 | 	threads: 8
 48 | 	params:
 49 | 		ref = config['reference'],
 50 | 		qual=config['mapq'],
 51 | 		nm=config['n_mismatches']
 52 | 	shell:
 53 | 		"bwa mem -t {threads} {params.ref} {input.r} | "\
 54 | 		"samtools view -b -q {params.qual} | "\
 55 | 		"bamtools filter -tag 'NM:<={params.nm}' | "\
 56 | 		"samtools sort --threads {threads} -o {output}"
 57 | 
 58 | # count base read coverage
 59 | rule genomecov:
 60 | 	input:
 61 | 		rules.bwa_align.output
 62 | 	output:
 63 | 		"genomecov/{sample}.tsv"
 64 | 	resources:
 65 | 		mem=8,
 66 | 		time=1,
 67 | 	threads: 1
 68 | 	shell:
 69 | 		"bedtools genomecov -ibam {input} > {output}"
 70 | 
 71 | # calculate average coverage across the genome
 72 | rule calc_coverage:
 73 | 	input:
 74 | 		rules.genomecov.output
 75 | 	output:
 76 | 		"coverage/{sample}.cvg"
 77 | 	resources:
 78 | 		mem=8,
 79 | 		time=1,
 80 | 	threads: 1
 81 | 	params:
 82 | 		cvg=config['min_cvg']
 83 | 	script:
 84 | 		"scripts/getCoverage.py"
 85 | 
 86 | # filter samples that meet coverage requirements
 87 | rule filter_samples:
 88 | 	input: expand("coverage/{sample}.cvg", sample = samples)
 89 | 	output:
 90 | 		dynamic("passed_samples/{sample}.bam")
 91 | 	resources:
 92 | 		mem=1,
 93 | 		time=1
 94 | 	threads: 1
 95 | 	params:
 96 | 		min_cvg=config['min_cvg'],
 97 | 		min_perc=config['min_genome_percent']
 98 | 	run:
 99 | 		samps = input
100 | 		for samp in samps:
101 | 			with open(samp) as s:
102 | 				cvg, perc = s.readline().rstrip('\n').split('\t')
103 | 			if (float(cvg) >= params.min_cvg and float(perc) > params.min_perc):
104 | 				shell("ln -s $PWD/filtered_bam/{s}.filtered.bam passed_samples/{s}.bam".format(s=os.path.basename(samp).rstrip(".cvg")))
105 | 
106 | # index reference genome for pileup
107 | rule faidx:
108 | 	input: config['reference']
109 | 	output: "{ref}.fai".format(ref=config['reference'])
110 | 	resources:
111 | 		mem=2,
112 | 		time=1
113 | 	shell:
114 | 		"samtools faidx {input}"
115 | 
116 | # create pileup from bam files
117 | rule pileup:
118 | 	input:
119 | 		bam="passed_samples/{sample}.bam",
120 | 		ref=config['reference'],
121 | 		index=rules.faidx.output
122 | 	output: "pileup/{sample}.pileup"
123 | 	resources:
124 | 		mem=32,
125 | 		time=1
126 | 	threads: 16
127 | 	shell:
128 | 		"samtools mpileup -f {input.ref} -B -aa -o {output} {input.bam}"
129 | 
130 | # call SNPs from pileup
131 | rule call_snps:
132 | 	input: rules.pileup.output
133 | 	output: "snp_calls/{sample}.tsv"
134 | 	resources:
135 | 		mem=32,
136 | 		time=2
137 | 	threads: 16
138 | 	params:
139 | 		min_cvg=5,
140 | 		min_freq=0.8,
141 | 		min_qual=20
142 | 	script:
143 | 		"scripts/callSNPs.py"
144 | 
145 | # get consensus sequence from pileup
146 | rule snp_consensus:
147 | 	input: rules.call_snps.output
148 | 	output: "consensus/{sample}.txt"
149 | 	resources:
150 | 		mem=2,
151 | 		time=2
152 | 	threads: 1
153 | 	shell:
154 | 		"echo {wildcards.sample} > {output}; cut -f4 {input} >> {output}"
155 | 
156 | # combine consensus sequences into one file
157 | rule combine:
158 | 	input:
159 | 		dynamic("consensus/{sample}.txt")
160 | 	output: "{name}.cns.tsv".format(name = prefix)
161 | 	resources:
162 | 		mem=2,
163 | 		time=1
164 | 	threads: 1
165 | 	shell:
166 | 		"paste {input} > {output}"
167 | 
168 | # find positions that have a base call in each input genome and at least
169 | # one variant in the set of input genomes
170 | rule core_snps:
171 | 	input: rules.combine.output
172 | 	output: "{name}.core_snps.tsv".format(name = prefix)
173 | 	resources:
174 | 		mem=16,
175 | 		time=1
176 | 	threads: 1
177 | 	script:
178 | 		"scripts/findCoreSNPs.py"
179 | 
180 | # convert core SNPs file to fasta format
181 | rule core_snps_to_fasta:
182 | 	input: rules.core_snps.output
183 | 	output: "{name}.fasta".format(name = prefix)
184 | 	resources:
185 | 		mem=16,
186 | 		time=1
187 | 	threads: 1
188 | 	script:
189 | 		"scripts/coreSNPs2fasta.py"
190 | 
191 | # perform multiple sequence alignment of fasta file
192 | rule multi_align:
193 | 	input: rules.core_snps_to_fasta.output
194 | 	output: "{name}.afa".format(name = prefix)
195 | 	resources:
196 | 		mem=200,
197 | 		time=12
198 | 	threads: 1
199 | 	shell:
200 | 		"muscle -in {input} -out {output}"
201 | 
202 | # calculate phylogenetic tree from multiple sequence alignment
203 | rule build_tree:
204 | 	input: rules.multi_align.output
205 | 	output: "{name}.tree".format(name = prefix)
206 | 	resources:
207 | 		mem=8,
208 | 		time=1
209 | 	threads: 1
210 | 	shell:
211 | 		"fasttree -nt {input} > {output}"
212 | 
213 | # plot phylogenetic tree
214 | rule plot_tree:
215 | 	input: rules.build_tree.output
216 | 	output: "{name}.tree.pdf".format(name = prefix)
217 | 	resources:
218 | 		mem=8,
219 | 		time=1
220 | 	threads: 1
221 | 	script:
222 | 		"scripts/renderTree.R"
223 | 
224 | # count pairwise SNVs between input samples
225 | rule pairwise_snvs:
226 | 	input: dynamic("consensus/{sample}.txt")
227 | 	output: "{name}.dist.tsv".format(name = prefix)
228 | 	resources:
229 | 		mem=8,
230 | 		time=1
231 | 	threads: 1
232 | 	script:
233 | 		"scripts/pairwiseDist.py"
234 | 


--------------------------------------------------------------------------------
/config.yaml:
--------------------------------------------------------------------------------
 1 | ##### input files #####
 2 | 
 3 | # reference genome (required)
 4 | reference: /path/to/ref.fna
 5 | 
 6 | # short read data (at least two samples required)
 7 | reads:
 8 |   sample1:
 9 |     [/path/to/sample1_R1.fq,
10 |   /path/tp/sample1_R2.fq]
11 |   sample2:
12 |     [/path/to/sample2_R1.fq,
13 |   /path/to/sample2_R2.fq]
14 |   sample3: /path/to/sample3.fq
15 | 
16 | # prefix for output files (optional - can leave blank)
17 | prefix:
18 | 
19 | 
20 | ##### workflow parameters #####
21 | 
22 | # alignment parameters:
23 | mapq: 60
24 | n_mismatches: 5
25 | 
26 | # variant calling parameters:
27 | min_cvg: 5
28 | min_genome_percent: 0.5
29 | base_freq: 0.8
30 | 


--------------------------------------------------------------------------------
/envs/environment.yaml:
--------------------------------------------------------------------------------
 1 | name: ssift
 2 | channels:
 3 |   - bioconda
 4 |   - defaults
 5 |   - conda-forge
 6 | dependencies:
 7 |   - bwa
 8 |   - samtools
 9 |   - ncurses
10 |   - bamtools
11 |   - bedtools
12 |   - MUSCLE
13 |   - FastTree
14 |   - r-ggplot2
15 |   - bioconductor-ggtree
16 |   - r-phangorn
17 | 


--------------------------------------------------------------------------------
/scripts/._callSNPs.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bhattlab/StrainSifter/4a354e5a0ae25bfe3d57a6b66fa27a36b5daf748/scripts/._callSNPs.py


--------------------------------------------------------------------------------
/scripts/._coreSNPs2fasta.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bhattlab/StrainSifter/4a354e5a0ae25bfe3d57a6b66fa27a36b5daf748/scripts/._coreSNPs2fasta.py


--------------------------------------------------------------------------------
/scripts/._findCoreSNPs.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bhattlab/StrainSifter/4a354e5a0ae25bfe3d57a6b66fa27a36b5daf748/scripts/._findCoreSNPs.py


--------------------------------------------------------------------------------
/scripts/._getCoverage.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bhattlab/StrainSifter/4a354e5a0ae25bfe3d57a6b66fa27a36b5daf748/scripts/._getCoverage.py


--------------------------------------------------------------------------------
/scripts/._pairwiseDist.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bhattlab/StrainSifter/4a354e5a0ae25bfe3d57a6b66fa27a36b5daf748/scripts/._pairwiseDist.py


--------------------------------------------------------------------------------
/scripts/._renderTree.R:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bhattlab/StrainSifter/4a354e5a0ae25bfe3d57a6b66fa27a36b5daf748/scripts/._renderTree.R


--------------------------------------------------------------------------------
/scripts/.snakemake.7g49ubxl.coreSNPs2fasta.py:
--------------------------------------------------------------------------------
 1 | 
 2 | ######## Snakemake header ########
 3 | import sys; sys.path.append("/home/tamburin/tools/miniconda3/lib/python3.6/site-packages"); import pickle; snakemake = pickle.loads(b'\x80\x03csnakemake.script\nSnakemake\nq\x00)\x81q\x01}q\x02(X\x05\x00\x00\x00inputq\x03csnakemake.io\nInputFiles\nq\x04)\x81q\x05X\x14\x00\x00\x00E_coli.core_snps.tsvq\x06a}q\x07X\x06\x00\x00\x00_namesq\x08}q\tsbX\x06\x00\x00\x00outputq\ncsnakemake.io\nOutputFiles\nq\x0b)\x81q\x0cX\x0c\x00\x00\x00E_coli.fastaq\ra}q\x0eh\x08}q\x0fsbX\x06\x00\x00\x00paramsq\x10csnakemake.io\nParams\nq\x11)\x81q\x12}q\x13h\x08}q\x14sbX\t\x00\x00\x00wildcardsq\x15csnakemake.io\nWildcards\nq\x16)\x81q\x17}q\x18h\x08}q\x19sbX\x07\x00\x00\x00threadsq\x1aK\x01X\t\x00\x00\x00resourcesq\x1bcsnakemake.io\nResources\nq\x1c)\x81q\x1d(K\x01K\x01K\x10K\x01e}q\x1e(h\x08}q\x1f(X\x06\x00\x00\x00_coresq K\x00N\x86q!X\x06\x00\x00\x00_nodesq"K\x01N\x86q#X\x03\x00\x00\x00memq$K\x02N\x86q%X\x04\x00\x00\x00timeq&K\x03N\x86q\'uh K\x01h"K\x01h$K\x10h&K\x01ubX\x03\x00\x00\x00logq(csnakemake.io\nLog\nq))\x81q*}q+h\x08}q,sbX\x06\x00\x00\x00configq-}q.(X\t\x00\x00\x00referenceq/X\x0e\x00\x00\x00isolate2.fastaq0X\x05\x00\x00\x00readsq1}q2(X\x05\x00\x00\x00meta1q3]q4(X\x16\x00\x00\x00reads/meta1_1.fastq.gzq5X\x16\x00\x00\x00reads/meta1_1.fastq.gzq6eX\x05\x00\x00\x00meta2q7]q8(X\x16\x00\x00\x00reads/meta2_1.fastq.gzq9X\x16\x00\x00\x00reads/meta2_1.fastq.gzq:eX\x05\x00\x00\x00meta3q;]q<(X\x16\x00\x00\x00reads/meta3_1.fastq.gzq=X\x16\x00\x00\x00reads/meta3_1.fastq.gzq>eX\x08\x00\x00\x00isolate1q?]q@(X\x19\x00\x00\x00reads/isolate1_1.fastq.gzqAX\x19\x00\x00\x00reads/isolate1_2.fastq.gzqBeX\x08\x00\x00\x00isolate2qC]qD(X\x19\x00\x00\x00reads/isolate2_1.fastq.gzqEX\x19\x00\x00\x00reads/isolate2_2.fastq.gzqFeuX\x06\x00\x00\x00prefixqGX\x06\x00\x00\x00E_coliqHX\x04\x00\x00\x00mapqqIK<X\x0c\x00\x00\x00n_mismatchesqJK\x05X\x07\x00\x00\x00min_cvgqKK\x05X\x12\x00\x00\x00min_genome_percentqLG?\xe0\x00\x00\x00\x00\x00\x00X\t\x00\x00\x00base_freqqMG?\xe9\x99\x99\x99\x99\x99\x9auX\x04\x00\x00\x00ruleqNX\x12\x00\x00\x00core_snps_to_fastaqOub.'); from snakemake.logging import logger; logger.printshellcmds = False
 4 | ######## Original script #########
 5 | import sys
 6 | 
 7 | HEADER_POS = 0
 8 | FASTA_POS = 1
 9 | #coreSNPs_file = sys.argv[1]
10 | 
11 | snp_file = snakemake.input[0]
12 | out_file = snakemake.output[0]
13 | 
14 | # hold sequences as we go
15 | fastas = []
16 | 
17 | with open(snp_file, "r") as core_snps:
18 | 
19 |     # process header
20 |     header = core_snps.readline().rstrip("\n").split("\t")
21 | 
22 |     # add header to dict with empty string
23 |     for h in range(len(header)):
24 |         fastas += [[header[h], ""]]
25 | 
26 |     for line in core_snps:
27 |         line = line.rstrip("\n").split("\t")
28 | 
29 |         for pos in range(len(line)):
30 |             # print(fastas[pos])
31 |             # print(fastas[pos][FASTA_POS])
32 |             fastas[pos][FASTA_POS] += line[pos]
33 | 
34 | with open(out_file, "w") as sys.stdout:
35 |     for f in range(len(fastas)):
36 |         print(">" + fastas[f][0])
37 |         print(fastas[f][1])
38 | 


--------------------------------------------------------------------------------
/scripts/.snakemake.9wtzh4ps.callSNPs.py:
--------------------------------------------------------------------------------
  1 | 
  2 | ######## Snakemake header ########
  3 | import sys; sys.path.append("/home/tamburin/tools/miniconda3/lib/python3.6/site-packages"); import pickle; snakemake = pickle.loads(b'\x80\x03csnakemake.script\nSnakemake\nq\x00)\x81q\x01}q\x02(X\x05\x00\x00\x00inputq\x03csnakemake.io\nInputFiles\nq\x04)\x81q\x05X\x1c\x00\x00\x00pileup/4308_7-26-2016.pileupq\x06a}q\x07X\x06\x00\x00\x00_namesq\x08}q\tsbX\x06\x00\x00\x00outputq\ncsnakemake.io\nOutputFiles\nq\x0b)\x81q\x0cX\x1c\x00\x00\x00snp_calls/4308_7-26-2016.tsvq\ra}q\x0eh\x08}q\x0fsbX\x06\x00\x00\x00paramsq\x10csnakemake.io\nParams\nq\x11)\x81q\x12(K\x05G?\xe9\x99\x99\x99\x99\x99\x9aK\x14e}q\x13(h\x08}q\x14(X\x07\x00\x00\x00min_cvgq\x15K\x00N\x86q\x16X\x08\x00\x00\x00min_freqq\x17K\x01N\x86q\x18X\x08\x00\x00\x00min_qualq\x19K\x02N\x86q\x1auh\x15K\x05h\x17G?\xe9\x99\x99\x99\x99\x99\x9ah\x19K\x14ubX\t\x00\x00\x00wildcardsq\x1bcsnakemake.io\nWildcards\nq\x1c)\x81q\x1dX\x0e\x00\x00\x004308_7-26-2016q\x1ea}q\x1f(h\x08}q X\x06\x00\x00\x00sampleq!K\x00N\x86q"sX\x06\x00\x00\x00sampleq#h\x1eubX\x07\x00\x00\x00threadsq$K\x01X\t\x00\x00\x00resourcesq%csnakemake.io\nResources\nq&)\x81q\'(K\x01K\x01K K\x02e}q((h\x08}q)(X\x06\x00\x00\x00_coresq*K\x00N\x86q+X\x06\x00\x00\x00_nodesq,K\x01N\x86q-X\x03\x00\x00\x00memq.K\x02N\x86q/X\x04\x00\x00\x00timeq0K\x03N\x86q1uh*K\x01h,K\x01h.K h0K\x02ubX\x03\x00\x00\x00logq2csnakemake.io\nLog\nq3)\x81q4}q5h\x08}q6sbX\x06\x00\x00\x00configq7}q8(X\t\x00\x00\x00referenceq9XO\x00\x00\x00/home/tamburin/fiona/bacteremia/1.assemble_trimmed/filtered_assemblies/L3.fastaq:X\x07\x00\x00\x00samplesq;X0\x00\x00\x00/home/tamburin/fiona/crassphage/all_samples.listq<X\t\x00\x00\x00reads_dirq=X)\x00\x00\x00/home/tamburin/fiona/crassphage/readlinksq>X\n\x00\x00\x00paired_endq?X\x01\x00\x00\x00Yq@X\x07\x00\x00\x00min_cvgqAK\nuX\x04\x00\x00\x00ruleqBX\t\x00\x00\x00call_snpsqCub.'); from snakemake.logging import logger; logger.printshellcmds = False
  4 | ######## Original script #########
  5 | #!/usr/bin/env python3
  6 | # callSNPS.py
  7 | # call SNPs from a samtools pileup file
  8 | 
  9 | import sys
 10 | import gzip
 11 | import re
 12 | 
 13 | DEBUG = 0
 14 | 
 15 | ASCII_OFFSET = ord("!")
 16 | INDEL_PATTERN = re.compile(r"[+-](\d+)")
 17 | INDEL_STRING = "[+-]INTEGER[ACGTN]{INTEGER}"
 18 | ROUND = 3  # places to right of decimal point
 19 | 
 20 | min_coverage, min_proportion, min_qual = snakemake.params
 21 | pileup_file = snakemake.input[0]
 22 | out_file = snakemake.output[0]
 23 | 
 24 | min_coverage = int(min_coverage)
 25 | min_proportion = float(min_proportion)
 26 | min_qual = int(min_qual)
 27 | 
 28 | ## function to process each line of pileup
 29 | 
 30 | def parse_pileup(line):
 31 | 
 32 |     # read fields from line of pileup
 33 |     chromosome, position, reference, coverage, pileup, quality = line.rstrip("\r\n").split("\t")
 34 | 
 35 |     # proportion is 0 if no consensus base, top base is N by default
 36 |     parsed = {
 37 |         "proportion": 0.0, "chromosome": chromosome,
 38 |         "position": int(position), "reference": reference,
 39 |         "coverage": int(coverage), "pileup": pileup,
 40 |         "quality": quality, "top_base": "N"}
 41 | 
 42 |     # if the base coverage is below the limit or above our acceptable max, call N
 43 |     if parsed["coverage"] < min_coverage:
 44 |         return parsed
 45 | 
 46 |     # uppercase pileup string for processing
 47 |     pileup = pileup.upper()
 48 | 
 49 |     # Remove start and stop characters from pileup string
 50 |     pileup = re.sub(r"\^.|\$", "", pileup)
 51 | 
 52 |     # Remove indels from pileup string
 53 |     start = 0
 54 | 
 55 |     while True:
 56 |         match = INDEL_PATTERN.search(pileup, start)
 57 | 
 58 |         if match:
 59 |             integer = match.group(1)
 60 |             pileup = re.sub(INDEL_STRING.replace("INTEGER", integer), "", pileup)
 61 |             start = match.start()
 62 |         else:
 63 |             break
 64 | 
 65 |     # get total base count and top base count
 66 |     total = 0
 67 |     top_base = "N"
 68 |     top_base_count = 0
 69 | 
 70 |     # uppercase reference base for comparison
 71 |     reference = reference.upper()
 72 | 
 73 |     base_counts = {"A": 0, "C": 0, "G": 0, "T": 0}
 74 | 
 75 |     quality_length = len(quality)
 76 | 
 77 |     for i in range(quality_length):
 78 | 
 79 |         # convert ASCII character to phred base quality
 80 |         base_quality = ord(quality[i]) - ASCII_OFFSET
 81 | 
 82 |         # only count high-quality bases
 83 |         if base_quality >= min_qual:
 84 | 
 85 |             currBase = pileup[i]
 86 | 
 87 |             if currBase in base_counts:
 88 |                 base_counts[currBase] += 1
 89 | 
 90 |             else:
 91 |                 base_counts[reference] += 1
 92 | 
 93 |             total += 1
 94 | 
 95 |     parsed["total"] = total
 96 | 
 97 |     # find top base
 98 |     for base in base_counts:
 99 |         if base_counts[base] > top_base_count:
100 |             top_base = base
101 |             top_base_count = base_counts[base]
102 | 
103 |     # if more that 0 bases processed
104 |     if total > 0:
105 |         prop = top_base_count / float(total)
106 | 
107 |         if prop >= min_proportion:
108 |             parsed["proportion"] = prop
109 |             parsed["top_base"] = top_base
110 | 
111 |     return parsed
112 | 
113 | ## process pileup file
114 | 
115 | # read in input file
116 | if not DEBUG:
117 |     out_file = open(out_file, "w")
118 | 
119 | with open(pileup_file, "rt") as pileup_file:
120 |     for pileup_file_line in pileup_file:
121 | 
122 |         parsed = parse_pileup(pileup_file_line)
123 | 
124 |         proportion = parsed["proportion"]
125 | 
126 |         # print to output file
127 |         if DEBUG:
128 |             print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)),
129 |                 parsed["pileup"], parsed["quality"]]))
130 |         else:
131 |             print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)),
132 |                 parsed["pileup"], parsed["quality"]]), file=out_file)
133 | 
134 | if not DEBUG:
135 |     out_file.close()
136 | 


--------------------------------------------------------------------------------
/scripts/.snakemake.gej6roz5.callSNPs.py:
--------------------------------------------------------------------------------
  1 | 
  2 | ######## Snakemake header ########
  3 | import sys; sys.path.append("/home/tamburin/tools/miniconda3/lib/python3.6/site-packages"); import pickle; snakemake = pickle.loads(b'\x80\x03csnakemake.script\nSnakemake\nq\x00)\x81q\x01}q\x02(X\x05\x00\x00\x00inputq\x03csnakemake.io\nInputFiles\nq\x04)\x81q\x05X\x1c\x00\x00\x00pileup/4757_2-16-2016.pileupq\x06a}q\x07X\x06\x00\x00\x00_namesq\x08}q\tsbX\x06\x00\x00\x00outputq\ncsnakemake.io\nOutputFiles\nq\x0b)\x81q\x0cX\x1c\x00\x00\x00snp_calls/4757_2-16-2016.tsvq\ra}q\x0eh\x08}q\x0fsbX\x06\x00\x00\x00paramsq\x10csnakemake.io\nParams\nq\x11)\x81q\x12(K\x05G?\xe9\x99\x99\x99\x99\x99\x9aK\x14e}q\x13(h\x08}q\x14(X\x07\x00\x00\x00min_cvgq\x15K\x00N\x86q\x16X\x08\x00\x00\x00min_freqq\x17K\x01N\x86q\x18X\x08\x00\x00\x00min_qualq\x19K\x02N\x86q\x1auh\x15K\x05h\x17G?\xe9\x99\x99\x99\x99\x99\x9ah\x19K\x14ubX\t\x00\x00\x00wildcardsq\x1bcsnakemake.io\nWildcards\nq\x1c)\x81q\x1dX\x0e\x00\x00\x004757_2-16-2016q\x1ea}q\x1f(h\x08}q X\x06\x00\x00\x00sampleq!K\x00N\x86q"sX\x06\x00\x00\x00sampleq#h\x1eubX\x07\x00\x00\x00threadsq$K\x01X\t\x00\x00\x00resourcesq%csnakemake.io\nResources\nq&)\x81q\'(K\x01K\x01K K\x02e}q((h\x08}q)(X\x06\x00\x00\x00_coresq*K\x00N\x86q+X\x06\x00\x00\x00_nodesq,K\x01N\x86q-X\x03\x00\x00\x00memq.K\x02N\x86q/X\x04\x00\x00\x00timeq0K\x03N\x86q1uh*K\x01h,K\x01h.K h0K\x02ubX\x03\x00\x00\x00logq2csnakemake.io\nLog\nq3)\x81q4}q5h\x08}q6sbX\x06\x00\x00\x00configq7}q8(X\t\x00\x00\x00referenceq9X?\x00\x00\x00/home/tamburin/fiona/crassphage/strainsifter/ref/B_vulgatus.fnaq:X\x07\x00\x00\x00samplesq;X0\x00\x00\x00/home/tamburin/fiona/crassphage/all_samples.listq<X\t\x00\x00\x00reads_dirq=X)\x00\x00\x00/home/tamburin/fiona/crassphage/readlinksq>X\n\x00\x00\x00paired_endq?X\x01\x00\x00\x00Yq@X\x07\x00\x00\x00min_cvgqAK\nuX\x04\x00\x00\x00ruleqBX\t\x00\x00\x00call_snpsqCub.'); from snakemake.logging import logger; logger.printshellcmds = False
  4 | ######## Original script #########
  5 | #!/usr/bin/env python3
  6 | # callSNPS.py
  7 | # call SNPs from a samtools pileup file
  8 | 
  9 | import sys
 10 | import gzip
 11 | import re
 12 | 
 13 | DEBUG = 0
 14 | 
 15 | ASCII_OFFSET = ord("!")
 16 | INDEL_PATTERN = re.compile(r"[+-](\d+)")
 17 | INDEL_STRING = "[+-]INTEGER[ACGTN]{INTEGER}"
 18 | ROUND = 3  # places to right of decimal point
 19 | 
 20 | min_coverage, min_proportion, min_qual = snakemake.params
 21 | pileup_file = snakemake.input[0]
 22 | out_file = snakemake.output[0]
 23 | 
 24 | min_coverage = int(min_coverage)
 25 | min_proportion = float(min_proportion)
 26 | min_qual = int(min_qual)
 27 | 
 28 | ## function to process each line of pileup
 29 | 
 30 | def parse_pileup(line):
 31 | 
 32 |     # read fields from line of pileup
 33 |     chromosome, position, reference, coverage, pileup, quality = line.rstrip("\r\n").split("\t")
 34 | 
 35 |     # proportion is 0 if no consensus base, top base is N by default
 36 |     parsed = {
 37 |         "proportion": 0.0, "chromosome": chromosome,
 38 |         "position": int(position), "reference": reference,
 39 |         "coverage": int(coverage), "pileup": pileup,
 40 |         "quality": quality, "top_base": "N"}
 41 | 
 42 |     # if the base coverage is below the limit or above our acceptable max, call N
 43 |     if parsed["coverage"] < min_coverage:
 44 |         return parsed
 45 | 
 46 |     # uppercase pileup string for processing
 47 |     pileup = pileup.upper()
 48 | 
 49 |     # Remove start and stop characters from pileup string
 50 |     pileup = re.sub(r"\^.|\$", "", pileup)
 51 | 
 52 |     # Remove indels from pileup string
 53 |     start = 0
 54 | 
 55 |     while True:
 56 |         match = INDEL_PATTERN.search(pileup, start)
 57 | 
 58 |         if match:
 59 |             integer = match.group(1)
 60 |             pileup = re.sub(INDEL_STRING.replace("INTEGER", integer), "", pileup)
 61 |             start = match.start()
 62 |         else:
 63 |             break
 64 | 
 65 |     # get total base count and top base count
 66 |     total = 0
 67 |     top_base = "N"
 68 |     top_base_count = 0
 69 | 
 70 |     # uppercase reference base for comparison
 71 |     reference = reference.upper()
 72 | 
 73 |     base_counts = {"A": 0, "C": 0, "G": 0, "T": 0}
 74 | 
 75 |     quality_length = len(quality)
 76 | 
 77 |     for i in range(quality_length):
 78 | 
 79 |         # convert ASCII character to phred base quality
 80 |         base_quality = ord(quality[i]) - ASCII_OFFSET
 81 | 
 82 |         # only count high-quality bases
 83 |         if base_quality >= min_qual:
 84 | 
 85 |             currBase = pileup[i]
 86 | 
 87 |             if currBase in base_counts:
 88 |                 base_counts[currBase] += 1
 89 | 
 90 |             else:
 91 |                 base_counts[reference] += 1
 92 | 
 93 |             total += 1
 94 | 
 95 |     parsed["total"] = total
 96 | 
 97 |     # find top base
 98 |     for base in base_counts:
 99 |         if base_counts[base] > top_base_count:
100 |             top_base = base
101 |             top_base_count = base_counts[base]
102 | 
103 |     # if more that 0 bases processed
104 |     if total > 0:
105 |         prop = top_base_count / float(total)
106 | 
107 |         if prop >= min_proportion:
108 |             parsed["proportion"] = prop
109 |             parsed["top_base"] = top_base
110 | 
111 |     return parsed
112 | 
113 | ## process pileup file
114 | 
115 | # read in input file
116 | if not DEBUG:
117 |     out_file = open(out_file, "w")
118 | 
119 | with open(pileup_file, "rt") as pileup_file:
120 |     for pileup_file_line in pileup_file:
121 | 
122 |         parsed = parse_pileup(pileup_file_line)
123 | 
124 |         proportion = parsed["proportion"]
125 | 
126 |         # print to output file
127 |         if DEBUG:
128 |             print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)),
129 |                 parsed["pileup"], parsed["quality"]]))
130 |         else:
131 |             print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)),
132 |                 parsed["pileup"], parsed["quality"]]), file=out_file)
133 | 
134 | if not DEBUG:
135 |     out_file.close()
136 | 


--------------------------------------------------------------------------------
/scripts/.snakemake.jrhv5jyf.callSNPs.py:
--------------------------------------------------------------------------------
  1 | 
  2 | ######## Snakemake header ########
  3 | import sys; sys.path.append("/home/tamburin/tools/miniconda3/lib/python3.6/site-packages"); import pickle; snakemake = pickle.loads(b'\x80\x03csnakemake.script\nSnakemake\nq\x00)\x81q\x01}q\x02(X\x05\x00\x00\x00inputq\x03csnakemake.io\nInputFiles\nq\x04)\x81q\x05X\x1d\x00\x00\x00pileup/6336_11-10-2015.pileupq\x06a}q\x07X\x06\x00\x00\x00_namesq\x08}q\tsbX\x06\x00\x00\x00outputq\ncsnakemake.io\nOutputFiles\nq\x0b)\x81q\x0cX\x1d\x00\x00\x00snp_calls/6336_11-10-2015.tsvq\ra}q\x0eh\x08}q\x0fsbX\x06\x00\x00\x00paramsq\x10csnakemake.io\nParams\nq\x11)\x81q\x12(K\x05G?\xe9\x99\x99\x99\x99\x99\x9aK\x14e}q\x13(h\x08}q\x14(X\x07\x00\x00\x00min_cvgq\x15K\x00N\x86q\x16X\x08\x00\x00\x00min_freqq\x17K\x01N\x86q\x18X\x08\x00\x00\x00min_qualq\x19K\x02N\x86q\x1auh\x15K\x05h\x17G?\xe9\x99\x99\x99\x99\x99\x9ah\x19K\x14ubX\t\x00\x00\x00wildcardsq\x1bcsnakemake.io\nWildcards\nq\x1c)\x81q\x1dX\x0f\x00\x00\x006336_11-10-2015q\x1ea}q\x1f(h\x08}q X\x06\x00\x00\x00sampleq!K\x00N\x86q"sX\x06\x00\x00\x00sampleq#h\x1eubX\x07\x00\x00\x00threadsq$K\x01X\t\x00\x00\x00resourcesq%csnakemake.io\nResources\nq&)\x81q\'(K\x01K\x01K K\x02e}q((h\x08}q)(X\x06\x00\x00\x00_coresq*K\x00N\x86q+X\x06\x00\x00\x00_nodesq,K\x01N\x86q-X\x03\x00\x00\x00memq.K\x02N\x86q/X\x04\x00\x00\x00timeq0K\x03N\x86q1uh*K\x01h,K\x01h.K h0K\x02ubX\x03\x00\x00\x00logq2csnakemake.io\nLog\nq3)\x81q4}q5h\x08}q6sbX\x06\x00\x00\x00configq7}q8(X\t\x00\x00\x00referenceq9X<\x00\x00\x00/home/tamburin/fiona/crassphage/strainsifter/ref/B_dorei.fnaq:X\x07\x00\x00\x00samplesq;X0\x00\x00\x00/home/tamburin/fiona/crassphage/all_samples.listq<X\t\x00\x00\x00reads_dirq=X)\x00\x00\x00/home/tamburin/fiona/crassphage/readlinksq>X\n\x00\x00\x00paired_endq?X\x01\x00\x00\x00Yq@X\x07\x00\x00\x00min_cvgqAK\x05uX\x04\x00\x00\x00ruleqBX\t\x00\x00\x00call_snpsqCub.'); from snakemake.logging import logger; logger.printshellcmds = False
  4 | ######## Original script #########
  5 | #!/usr/bin/env python3
  6 | # callSNPS.py
  7 | # call SNPs from a samtools pileup file
  8 | 
  9 | import sys
 10 | import gzip
 11 | import re
 12 | 
 13 | DEBUG = 0
 14 | 
 15 | ASCII_OFFSET = ord("!")
 16 | INDEL_PATTERN = re.compile(r"[+-](\d+)")
 17 | INDEL_STRING = "[+-]INTEGER[ACGTN]{INTEGER}"
 18 | ROUND = 3  # places to right of decimal point
 19 | 
 20 | min_coverage, min_proportion, min_qual = snakemake.params
 21 | pileup_file = snakemake.input[0]
 22 | out_file = snakemake.output[0]
 23 | 
 24 | min_coverage = int(min_coverage)
 25 | min_proportion = float(min_proportion)
 26 | min_qual = int(min_qual)
 27 | 
 28 | ## function to process each line of pileup
 29 | 
 30 | def parse_pileup(line):
 31 | 
 32 |     # read fields from line of pileup
 33 |     chromosome, position, reference, coverage, pileup, quality = line.rstrip("\r\n").split("\t")
 34 | 
 35 |     # proportion is 0 if no consensus base, top base is N by default
 36 |     parsed = {
 37 |         "proportion": 0.0, "chromosome": chromosome,
 38 |         "position": int(position), "reference": reference,
 39 |         "coverage": int(coverage), "pileup": pileup,
 40 |         "quality": quality, "top_base": "N"}
 41 | 
 42 |     # if the base coverage is below the limit or above our acceptable max, call N
 43 |     if parsed["coverage"] < min_coverage:
 44 |         return parsed
 45 | 
 46 |     # uppercase pileup string for processing
 47 |     pileup = pileup.upper()
 48 | 
 49 |     # Remove start and stop characters from pileup string
 50 |     pileup = re.sub(r"\^.|\$", "", pileup)
 51 | 
 52 |     # Remove indels from pileup string
 53 |     start = 0
 54 | 
 55 |     while True:
 56 |         match = INDEL_PATTERN.search(pileup, start)
 57 | 
 58 |         if match:
 59 |             integer = match.group(1)
 60 |             pileup = re.sub(INDEL_STRING.replace("INTEGER", integer), "", pileup)
 61 |             start = match.start()
 62 |         else:
 63 |             break
 64 | 
 65 |     # get total base count and top base count
 66 |     total = 0
 67 |     top_base = "N"
 68 |     top_base_count = 0
 69 | 
 70 |     # uppercase reference base for comparison
 71 |     reference = reference.upper()
 72 | 
 73 |     base_counts = {"A": 0, "C": 0, "G": 0, "T": 0}
 74 | 
 75 |     quality_length = len(quality)
 76 | 
 77 |     for i in range(quality_length):
 78 | 
 79 |         # convert ASCII character to phred base quality
 80 |         base_quality = ord(quality[i]) - ASCII_OFFSET
 81 | 
 82 |         # only count high-quality bases
 83 |         if base_quality >= min_qual:
 84 | 
 85 |             currBase = pileup[i]
 86 | 
 87 |             if currBase in base_counts:
 88 |                 base_counts[currBase] += 1
 89 | 
 90 |             else:
 91 |                 base_counts[reference] += 1
 92 | 
 93 |             total += 1
 94 | 
 95 |     parsed["total"] = total
 96 | 
 97 |     # find top base
 98 |     for base in base_counts:
 99 |         if base_counts[base] > top_base_count:
100 |             top_base = base
101 |             top_base_count = base_counts[base]
102 | 
103 |     # if more that 0 bases processed
104 |     if total > 0:
105 |         prop = top_base_count / float(total)
106 | 
107 |         if prop >= min_proportion:
108 |             parsed["proportion"] = prop
109 |             parsed["top_base"] = top_base
110 | 
111 |     return parsed
112 | 
113 | ## process pileup file
114 | 
115 | # read in input file
116 | if not DEBUG:
117 |     out_file = open(out_file, "w")
118 | 
119 | with open(pileup_file, "rt") as pileup_file:
120 |     for pileup_file_line in pileup_file:
121 | 
122 |         parsed = parse_pileup(pileup_file_line)
123 | 
124 |         proportion = parsed["proportion"]
125 | 
126 |         # print to output file
127 |         if DEBUG:
128 |             print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)),
129 |                 parsed["pileup"], parsed["quality"]]))
130 |         else:
131 |             print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)),
132 |                 parsed["pileup"], parsed["quality"]]), file=out_file)
133 | 
134 | if not DEBUG:
135 |     out_file.close()
136 | 


--------------------------------------------------------------------------------
/scripts/.snakemake.k53chwr8.callSNPs.py:
--------------------------------------------------------------------------------
  1 | 
  2 | ######## Snakemake header ########
  3 | import sys; sys.path.append("/home/tamburin/tools/miniconda3/lib/python3.6/site-packages"); import pickle; snakemake = pickle.loads(b'\x80\x03csnakemake.script\nSnakemake\nq\x00)\x81q\x01}q\x02(X\x05\x00\x00\x00inputq\x03csnakemake.io\nInputFiles\nq\x04)\x81q\x05X\x1d\x00\x00\x00pileup/6387_11-13-2015.pileupq\x06a}q\x07X\x06\x00\x00\x00_namesq\x08}q\tsbX\x06\x00\x00\x00outputq\ncsnakemake.io\nOutputFiles\nq\x0b)\x81q\x0cX\x1d\x00\x00\x00snp_calls/6387_11-13-2015.tsvq\ra}q\x0eh\x08}q\x0fsbX\x06\x00\x00\x00paramsq\x10csnakemake.io\nParams\nq\x11)\x81q\x12(K\x05G?\xe9\x99\x99\x99\x99\x99\x9aK\x14e}q\x13(h\x08}q\x14(X\x07\x00\x00\x00min_cvgq\x15K\x00N\x86q\x16X\x08\x00\x00\x00min_freqq\x17K\x01N\x86q\x18X\x08\x00\x00\x00min_qualq\x19K\x02N\x86q\x1auh\x15K\x05h\x17G?\xe9\x99\x99\x99\x99\x99\x9ah\x19K\x14ubX\t\x00\x00\x00wildcardsq\x1bcsnakemake.io\nWildcards\nq\x1c)\x81q\x1dX\x0f\x00\x00\x006387_11-13-2015q\x1ea}q\x1f(h\x08}q X\x06\x00\x00\x00sampleq!K\x00N\x86q"sX\x06\x00\x00\x00sampleq#h\x1eubX\x07\x00\x00\x00threadsq$K\x01X\t\x00\x00\x00resourcesq%csnakemake.io\nResources\nq&)\x81q\'(K\x01K\x01K K\x02e}q((h\x08}q)(X\x06\x00\x00\x00_coresq*K\x00N\x86q+X\x06\x00\x00\x00_nodesq,K\x01N\x86q-X\x03\x00\x00\x00memq.K\x02N\x86q/X\x04\x00\x00\x00timeq0K\x03N\x86q1uh*K\x01h,K\x01h.K h0K\x02ubX\x03\x00\x00\x00logq2csnakemake.io\nLog\nq3)\x81q4}q5h\x08}q6sbX\x06\x00\x00\x00configq7}q8(X\t\x00\x00\x00referenceq9XO\x00\x00\x00/home/tamburin/fiona/bacteremia/1.assemble_trimmed/filtered_assemblies/L3.fastaq:X\x07\x00\x00\x00samplesq;X0\x00\x00\x00/home/tamburin/fiona/crassphage/all_samples.listq<X\t\x00\x00\x00reads_dirq=X)\x00\x00\x00/home/tamburin/fiona/crassphage/readlinksq>X\n\x00\x00\x00paired_endq?X\x01\x00\x00\x00Yq@X\x07\x00\x00\x00min_cvgqAK\nuX\x04\x00\x00\x00ruleqBX\t\x00\x00\x00call_snpsqCub.'); from snakemake.logging import logger; logger.printshellcmds = False
  4 | ######## Original script #########
  5 | #!/usr/bin/env python3
  6 | # callSNPS.py
  7 | # call SNPs from a samtools pileup file
  8 | 
  9 | import sys
 10 | import gzip
 11 | import re
 12 | 
 13 | DEBUG = 0
 14 | 
 15 | ASCII_OFFSET = ord("!")
 16 | INDEL_PATTERN = re.compile(r"[+-](\d+)")
 17 | INDEL_STRING = "[+-]INTEGER[ACGTN]{INTEGER}"
 18 | ROUND = 3  # places to right of decimal point
 19 | 
 20 | min_coverage, min_proportion, min_qual = snakemake.params
 21 | pileup_file = snakemake.input[0]
 22 | out_file = snakemake.output[0]
 23 | 
 24 | min_coverage = int(min_coverage)
 25 | min_proportion = float(min_proportion)
 26 | min_qual = int(min_qual)
 27 | 
 28 | ## function to process each line of pileup
 29 | 
 30 | def parse_pileup(line):
 31 | 
 32 |     # read fields from line of pileup
 33 |     chromosome, position, reference, coverage, pileup, quality = line.rstrip("\r\n").split("\t")
 34 | 
 35 |     # proportion is 0 if no consensus base, top base is N by default
 36 |     parsed = {
 37 |         "proportion": 0.0, "chromosome": chromosome,
 38 |         "position": int(position), "reference": reference,
 39 |         "coverage": int(coverage), "pileup": pileup,
 40 |         "quality": quality, "top_base": "N"}
 41 | 
 42 |     # if the base coverage is below the limit or above our acceptable max, call N
 43 |     if parsed["coverage"] < min_coverage:
 44 |         return parsed
 45 | 
 46 |     # uppercase pileup string for processing
 47 |     pileup = pileup.upper()
 48 | 
 49 |     # Remove start and stop characters from pileup string
 50 |     pileup = re.sub(r"\^.|\$", "", pileup)
 51 | 
 52 |     # Remove indels from pileup string
 53 |     start = 0
 54 | 
 55 |     while True:
 56 |         match = INDEL_PATTERN.search(pileup, start)
 57 | 
 58 |         if match:
 59 |             integer = match.group(1)
 60 |             pileup = re.sub(INDEL_STRING.replace("INTEGER", integer), "", pileup)
 61 |             start = match.start()
 62 |         else:
 63 |             break
 64 | 
 65 |     # get total base count and top base count
 66 |     total = 0
 67 |     top_base = "N"
 68 |     top_base_count = 0
 69 | 
 70 |     # uppercase reference base for comparison
 71 |     reference = reference.upper()
 72 | 
 73 |     base_counts = {"A": 0, "C": 0, "G": 0, "T": 0}
 74 | 
 75 |     quality_length = len(quality)
 76 | 
 77 |     for i in range(quality_length):
 78 | 
 79 |         # convert ASCII character to phred base quality
 80 |         base_quality = ord(quality[i]) - ASCII_OFFSET
 81 | 
 82 |         # only count high-quality bases
 83 |         if base_quality >= min_qual:
 84 | 
 85 |             currBase = pileup[i]
 86 | 
 87 |             if currBase in base_counts:
 88 |                 base_counts[currBase] += 1
 89 | 
 90 |             else:
 91 |                 base_counts[reference] += 1
 92 | 
 93 |             total += 1
 94 | 
 95 |     parsed["total"] = total
 96 | 
 97 |     # find top base
 98 |     for base in base_counts:
 99 |         if base_counts[base] > top_base_count:
100 |             top_base = base
101 |             top_base_count = base_counts[base]
102 | 
103 |     # if more that 0 bases processed
104 |     if total > 0:
105 |         prop = top_base_count / float(total)
106 | 
107 |         if prop >= min_proportion:
108 |             parsed["proportion"] = prop
109 |             parsed["top_base"] = top_base
110 | 
111 |     return parsed
112 | 
113 | ## process pileup file
114 | 
115 | # read in input file
116 | if not DEBUG:
117 |     out_file = open(out_file, "w")
118 | 
119 | with open(pileup_file, "rt") as pileup_file:
120 |     for pileup_file_line in pileup_file:
121 | 
122 |         parsed = parse_pileup(pileup_file_line)
123 | 
124 |         proportion = parsed["proportion"]
125 | 
126 |         # print to output file
127 |         if DEBUG:
128 |             print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)),
129 |                 parsed["pileup"], parsed["quality"]]))
130 |         else:
131 |             print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)),
132 |                 parsed["pileup"], parsed["quality"]]), file=out_file)
133 | 
134 | if not DEBUG:
135 |     out_file.close()
136 | 


--------------------------------------------------------------------------------
/scripts/.snakemake.mgp_kol5.callSNPs.py:
--------------------------------------------------------------------------------
  1 | 
  2 | ######## Snakemake header ########
  3 | import sys; sys.path.append("/home/tamburin/tools/miniconda3/lib/python3.6/site-packages"); import pickle; snakemake = pickle.loads(b'\x80\x03csnakemake.script\nSnakemake\nq\x00)\x81q\x01}q\x02(X\x05\x00\x00\x00inputq\x03csnakemake.io\nInputFiles\nq\x04)\x81q\x05X\x1c\x00\x00\x00pileup/5160_8-16-2016.pileupq\x06a}q\x07X\x06\x00\x00\x00_namesq\x08}q\tsbX\x06\x00\x00\x00outputq\ncsnakemake.io\nOutputFiles\nq\x0b)\x81q\x0cX\x1c\x00\x00\x00snp_calls/5160_8-16-2016.tsvq\ra}q\x0eh\x08}q\x0fsbX\x06\x00\x00\x00paramsq\x10csnakemake.io\nParams\nq\x11)\x81q\x12(K\x05G?\xe9\x99\x99\x99\x99\x99\x9aK\x14e}q\x13(h\x08}q\x14(X\x07\x00\x00\x00min_cvgq\x15K\x00N\x86q\x16X\x08\x00\x00\x00min_freqq\x17K\x01N\x86q\x18X\x08\x00\x00\x00min_qualq\x19K\x02N\x86q\x1auh\x15K\x05h\x17G?\xe9\x99\x99\x99\x99\x99\x9ah\x19K\x14ubX\t\x00\x00\x00wildcardsq\x1bcsnakemake.io\nWildcards\nq\x1c)\x81q\x1dX\x0e\x00\x00\x005160_8-16-2016q\x1ea}q\x1f(h\x08}q X\x06\x00\x00\x00sampleq!K\x00N\x86q"sX\x06\x00\x00\x00sampleq#h\x1eubX\x07\x00\x00\x00threadsq$K\x01X\t\x00\x00\x00resourcesq%csnakemake.io\nResources\nq&)\x81q\'(K\x01K\x01K K\x02e}q((h\x08}q)(X\x06\x00\x00\x00_coresq*K\x00N\x86q+X\x06\x00\x00\x00_nodesq,K\x01N\x86q-X\x03\x00\x00\x00memq.K\x02N\x86q/X\x04\x00\x00\x00timeq0K\x03N\x86q1uh*K\x01h,K\x01h.K h0K\x02ubX\x03\x00\x00\x00logq2csnakemake.io\nLog\nq3)\x81q4}q5h\x08}q6sbX\x06\x00\x00\x00configq7}q8(X\t\x00\x00\x00referenceq9X<\x00\x00\x00/home/tamburin/fiona/crassphage/strainsifter/ref/B_dorei.fnaq:X\x07\x00\x00\x00samplesq;X0\x00\x00\x00/home/tamburin/fiona/crassphage/all_samples.listq<X\t\x00\x00\x00reads_dirq=X)\x00\x00\x00/home/tamburin/fiona/crassphage/readlinksq>X\n\x00\x00\x00paired_endq?X\x01\x00\x00\x00Yq@X\x07\x00\x00\x00min_cvgqAK\x05uX\x04\x00\x00\x00ruleqBX\t\x00\x00\x00call_snpsqCub.'); from snakemake.logging import logger; logger.printshellcmds = False
  4 | ######## Original script #########
  5 | #!/usr/bin/env python3
  6 | # callSNPS.py
  7 | # call SNPs from a samtools pileup file
  8 | 
  9 | import sys
 10 | import gzip
 11 | import re
 12 | 
 13 | DEBUG = 0
 14 | 
 15 | ASCII_OFFSET = ord("!")
 16 | INDEL_PATTERN = re.compile(r"[+-](\d+)")
 17 | INDEL_STRING = "[+-]INTEGER[ACGTN]{INTEGER}"
 18 | ROUND = 3  # places to right of decimal point
 19 | 
 20 | min_coverage, min_proportion, min_qual = snakemake.params
 21 | pileup_file = snakemake.input[0]
 22 | out_file = snakemake.output[0]
 23 | 
 24 | min_coverage = int(min_coverage)
 25 | min_proportion = float(min_proportion)
 26 | min_qual = int(min_qual)
 27 | 
 28 | ## function to process each line of pileup
 29 | 
 30 | def parse_pileup(line):
 31 | 
 32 |     # read fields from line of pileup
 33 |     chromosome, position, reference, coverage, pileup, quality = line.rstrip("\r\n").split("\t")
 34 | 
 35 |     # proportion is 0 if no consensus base, top base is N by default
 36 |     parsed = {
 37 |         "proportion": 0.0, "chromosome": chromosome,
 38 |         "position": int(position), "reference": reference,
 39 |         "coverage": int(coverage), "pileup": pileup,
 40 |         "quality": quality, "top_base": "N"}
 41 | 
 42 |     # if the base coverage is below the limit or above our acceptable max, call N
 43 |     if parsed["coverage"] < min_coverage:
 44 |         return parsed
 45 | 
 46 |     # uppercase pileup string for processing
 47 |     pileup = pileup.upper()
 48 | 
 49 |     # Remove start and stop characters from pileup string
 50 |     pileup = re.sub(r"\^.|\$", "", pileup)
 51 | 
 52 |     # Remove indels from pileup string
 53 |     start = 0
 54 | 
 55 |     while True:
 56 |         match = INDEL_PATTERN.search(pileup, start)
 57 | 
 58 |         if match:
 59 |             integer = match.group(1)
 60 |             pileup = re.sub(INDEL_STRING.replace("INTEGER", integer), "", pileup)
 61 |             start = match.start()
 62 |         else:
 63 |             break
 64 | 
 65 |     # get total base count and top base count
 66 |     total = 0
 67 |     top_base = "N"
 68 |     top_base_count = 0
 69 | 
 70 |     # uppercase reference base for comparison
 71 |     reference = reference.upper()
 72 | 
 73 |     base_counts = {"A": 0, "C": 0, "G": 0, "T": 0}
 74 | 
 75 |     quality_length = len(quality)
 76 | 
 77 |     for i in range(quality_length):
 78 | 
 79 |         # convert ASCII character to phred base quality
 80 |         base_quality = ord(quality[i]) - ASCII_OFFSET
 81 | 
 82 |         # only count high-quality bases
 83 |         if base_quality >= min_qual:
 84 | 
 85 |             currBase = pileup[i]
 86 | 
 87 |             if currBase in base_counts:
 88 |                 base_counts[currBase] += 1
 89 | 
 90 |             else:
 91 |                 base_counts[reference] += 1
 92 | 
 93 |             total += 1
 94 | 
 95 |     parsed["total"] = total
 96 | 
 97 |     # find top base
 98 |     for base in base_counts:
 99 |         if base_counts[base] > top_base_count:
100 |             top_base = base
101 |             top_base_count = base_counts[base]
102 | 
103 |     # if more that 0 bases processed
104 |     if total > 0:
105 |         prop = top_base_count / float(total)
106 | 
107 |         if prop >= min_proportion:
108 |             parsed["proportion"] = prop
109 |             parsed["top_base"] = top_base
110 | 
111 |     return parsed
112 | 
113 | ## process pileup file
114 | 
115 | # read in input file
116 | if not DEBUG:
117 |     out_file = open(out_file, "w")
118 | 
119 | with open(pileup_file, "rt") as pileup_file:
120 |     for pileup_file_line in pileup_file:
121 | 
122 |         parsed = parse_pileup(pileup_file_line)
123 | 
124 |         proportion = parsed["proportion"]
125 | 
126 |         # print to output file
127 |         if DEBUG:
128 |             print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)),
129 |                 parsed["pileup"], parsed["quality"]]))
130 |         else:
131 |             print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)),
132 |                 parsed["pileup"], parsed["quality"]]), file=out_file)
133 | 
134 | if not DEBUG:
135 |     out_file.close()
136 | 


--------------------------------------------------------------------------------
/scripts/.snakemake.sgrw01xs.callSNPs.py:
--------------------------------------------------------------------------------
  1 | 
  2 | ######## Snakemake header ########
  3 | import sys; sys.path.append("/home/tamburin/tools/miniconda3/lib/python3.6/site-packages"); import pickle; snakemake = pickle.loads(b'\x80\x03csnakemake.script\nSnakemake\nq\x00)\x81q\x01}q\x02(X\x05\x00\x00\x00inputq\x03csnakemake.io\nInputFiles\nq\x04)\x81q\x05X\x1c\x00\x00\x00pileup/2769_5-17-2016.pileupq\x06a}q\x07X\x06\x00\x00\x00_namesq\x08}q\tsbX\x06\x00\x00\x00outputq\ncsnakemake.io\nOutputFiles\nq\x0b)\x81q\x0cX\x1c\x00\x00\x00snp_calls/2769_5-17-2016.tsvq\ra}q\x0eh\x08}q\x0fsbX\x06\x00\x00\x00paramsq\x10csnakemake.io\nParams\nq\x11)\x81q\x12(K\x05G?\xe9\x99\x99\x99\x99\x99\x9aK\x14e}q\x13(h\x08}q\x14(X\x07\x00\x00\x00min_cvgq\x15K\x00N\x86q\x16X\x08\x00\x00\x00min_freqq\x17K\x01N\x86q\x18X\x08\x00\x00\x00min_qualq\x19K\x02N\x86q\x1auh\x15K\x05h\x17G?\xe9\x99\x99\x99\x99\x99\x9ah\x19K\x14ubX\t\x00\x00\x00wildcardsq\x1bcsnakemake.io\nWildcards\nq\x1c)\x81q\x1dX\x0e\x00\x00\x002769_5-17-2016q\x1ea}q\x1f(h\x08}q X\x06\x00\x00\x00sampleq!K\x00N\x86q"sX\x06\x00\x00\x00sampleq#h\x1eubX\x07\x00\x00\x00threadsq$K\x01X\t\x00\x00\x00resourcesq%csnakemake.io\nResources\nq&)\x81q\'(K\x01K\x01K K\x02e}q((h\x08}q)(X\x06\x00\x00\x00_coresq*K\x00N\x86q+X\x06\x00\x00\x00_nodesq,K\x01N\x86q-X\x03\x00\x00\x00memq.K\x02N\x86q/X\x04\x00\x00\x00timeq0K\x03N\x86q1uh*K\x01h,K\x01h.K h0K\x02ubX\x03\x00\x00\x00logq2csnakemake.io\nLog\nq3)\x81q4}q5h\x08}q6sbX\x06\x00\x00\x00configq7}q8(X\t\x00\x00\x00referenceq9X?\x00\x00\x00/home/tamburin/fiona/crassphage/strainsifter/ref/B_vulgatus.fnaq:X\x07\x00\x00\x00samplesq;X0\x00\x00\x00/home/tamburin/fiona/crassphage/all_samples.listq<X\t\x00\x00\x00reads_dirq=X)\x00\x00\x00/home/tamburin/fiona/crassphage/readlinksq>X\n\x00\x00\x00paired_endq?X\x01\x00\x00\x00Yq@X\x07\x00\x00\x00min_cvgqAK\nuX\x04\x00\x00\x00ruleqBX\t\x00\x00\x00call_snpsqCub.'); from snakemake.logging import logger; logger.printshellcmds = False
  4 | ######## Original script #########
  5 | #!/usr/bin/env python3
  6 | # callSNPS.py
  7 | # call SNPs from a samtools pileup file
  8 | 
  9 | import sys
 10 | import gzip
 11 | import re
 12 | 
 13 | DEBUG = 0
 14 | 
 15 | ASCII_OFFSET = ord("!")
 16 | INDEL_PATTERN = re.compile(r"[+-](\d+)")
 17 | INDEL_STRING = "[+-]INTEGER[ACGTN]{INTEGER}"
 18 | ROUND = 3  # places to right of decimal point
 19 | 
 20 | min_coverage, min_proportion, min_qual = snakemake.params
 21 | pileup_file = snakemake.input[0]
 22 | out_file = snakemake.output[0]
 23 | 
 24 | min_coverage = int(min_coverage)
 25 | min_proportion = float(min_proportion)
 26 | min_qual = int(min_qual)
 27 | 
 28 | ## function to process each line of pileup
 29 | 
 30 | def parse_pileup(line):
 31 | 
 32 |     # read fields from line of pileup
 33 |     chromosome, position, reference, coverage, pileup, quality = line.rstrip("\r\n").split("\t")
 34 | 
 35 |     # proportion is 0 if no consensus base, top base is N by default
 36 |     parsed = {
 37 |         "proportion": 0.0, "chromosome": chromosome,
 38 |         "position": int(position), "reference": reference,
 39 |         "coverage": int(coverage), "pileup": pileup,
 40 |         "quality": quality, "top_base": "N"}
 41 | 
 42 |     # if the base coverage is below the limit or above our acceptable max, call N
 43 |     if parsed["coverage"] < min_coverage:
 44 |         return parsed
 45 | 
 46 |     # uppercase pileup string for processing
 47 |     pileup = pileup.upper()
 48 | 
 49 |     # Remove start and stop characters from pileup string
 50 |     pileup = re.sub(r"\^.|\$", "", pileup)
 51 | 
 52 |     # Remove indels from pileup string
 53 |     start = 0
 54 | 
 55 |     while True:
 56 |         match = INDEL_PATTERN.search(pileup, start)
 57 | 
 58 |         if match:
 59 |             integer = match.group(1)
 60 |             pileup = re.sub(INDEL_STRING.replace("INTEGER", integer), "", pileup)
 61 |             start = match.start()
 62 |         else:
 63 |             break
 64 | 
 65 |     # get total base count and top base count
 66 |     total = 0
 67 |     top_base = "N"
 68 |     top_base_count = 0
 69 | 
 70 |     # uppercase reference base for comparison
 71 |     reference = reference.upper()
 72 | 
 73 |     base_counts = {"A": 0, "C": 0, "G": 0, "T": 0}
 74 | 
 75 |     quality_length = len(quality)
 76 | 
 77 |     for i in range(quality_length):
 78 | 
 79 |         # convert ASCII character to phred base quality
 80 |         base_quality = ord(quality[i]) - ASCII_OFFSET
 81 | 
 82 |         # only count high-quality bases
 83 |         if base_quality >= min_qual:
 84 | 
 85 |             currBase = pileup[i]
 86 | 
 87 |             if currBase in base_counts:
 88 |                 base_counts[currBase] += 1
 89 | 
 90 |             else:
 91 |                 base_counts[reference] += 1
 92 | 
 93 |             total += 1
 94 | 
 95 |     parsed["total"] = total
 96 | 
 97 |     # find top base
 98 |     for base in base_counts:
 99 |         if base_counts[base] > top_base_count:
100 |             top_base = base
101 |             top_base_count = base_counts[base]
102 | 
103 |     # if more that 0 bases processed
104 |     if total > 0:
105 |         prop = top_base_count / float(total)
106 | 
107 |         if prop >= min_proportion:
108 |             parsed["proportion"] = prop
109 |             parsed["top_base"] = top_base
110 | 
111 |     return parsed
112 | 
113 | ## process pileup file
114 | 
115 | # read in input file
116 | if not DEBUG:
117 |     out_file = open(out_file, "w")
118 | 
119 | with open(pileup_file, "rt") as pileup_file:
120 |     for pileup_file_line in pileup_file:
121 | 
122 |         parsed = parse_pileup(pileup_file_line)
123 | 
124 |         proportion = parsed["proportion"]
125 | 
126 |         # print to output file
127 |         if DEBUG:
128 |             print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)),
129 |                 parsed["pileup"], parsed["quality"]]))
130 |         else:
131 |             print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)),
132 |                 parsed["pileup"], parsed["quality"]]), file=out_file)
133 | 
134 | if not DEBUG:
135 |     out_file.close()
136 | 


--------------------------------------------------------------------------------
/scripts/callSNPs.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | # callSNPS.py
  3 | # call SNPs from a samtools pileup file
  4 | 
  5 | import sys
  6 | import gzip
  7 | import re
  8 | 
  9 | DEBUG = 0
 10 | 
 11 | ASCII_OFFSET = ord("!")
 12 | INDEL_PATTERN = re.compile(r"[+-](\d+)")
 13 | INDEL_STRING = "[+-]INTEGER[ACGTN]{INTEGER}"
 14 | ROUND = 3  # places to right of decimal point
 15 | 
 16 | min_coverage, min_proportion, min_qual = snakemake.params
 17 | pileup_file = snakemake.input[0]
 18 | out_file = snakemake.output[0]
 19 | 
 20 | min_coverage = int(min_coverage)
 21 | min_proportion = float(min_proportion)
 22 | min_qual = int(min_qual)
 23 | 
 24 | ## function to process each line of pileup
 25 | 
 26 | def parse_pileup(line):
 27 | 
 28 |     # read fields from line of pileup
 29 |     chromosome, position, reference, coverage, pileup, quality = line.rstrip("\r\n").split("\t")
 30 | 
 31 |     # proportion is 0 if no consensus base, top base is N by default
 32 |     parsed = {
 33 |         "proportion": 0.0,
 34 |         "chromosome": chromosome,
 35 |         "position": int(position),
 36 |         "reference": reference,
 37 |         "coverage": int(coverage),
 38 |         "pileup": pileup,
 39 |         "quality": quality,
 40 |         "top_base": "N"}
 41 | 
 42 |     # if the base coverage is below the limit or above our acceptable max, call N
 43 |     if parsed["coverage"] < min_coverage:
 44 |         return parsed
 45 | 
 46 |     # uppercase pileup string for processing
 47 |     pileup = pileup.upper()
 48 | 
 49 |     # Remove start and stop characters from pileup string
 50 |     pileup = re.sub(r"\^.|\$", "", pileup)
 51 | 
 52 |     # Remove indels from pileup string
 53 |     start = 0
 54 | 
 55 |     while True:
 56 |         match = INDEL_PATTERN.search(pileup, start)
 57 | 
 58 |         if match:
 59 |             integer = match.group(1)
 60 |             pileup = re.sub(INDEL_STRING.replace("INTEGER", integer), "", pileup)
 61 |             start = match.start()
 62 |         else:
 63 |             break
 64 | 
 65 |     # get total base count and top base count
 66 |     total = 0
 67 |     top_base = "N"
 68 |     top_base_count = 0
 69 | 
 70 |     # uppercase reference base for comparison
 71 |     reference = reference.upper()
 72 | 
 73 |     base_counts = {"A": 0, "C": 0, "G": 0, "T": 0}
 74 | 
 75 |     quality_length = len(quality)
 76 | 
 77 |     for i in range(quality_length):
 78 | 
 79 |         # convert ASCII character to phred base quality
 80 |         base_quality = ord(quality[i]) - ASCII_OFFSET
 81 | 
 82 |         # only count high-quality bases
 83 |         if base_quality >= min_qual:
 84 | 
 85 |             currBase = pileup[i]
 86 | 
 87 |             if currBase in base_counts:
 88 |                 base_counts[currBase] += 1
 89 | 
 90 |             else:
 91 |                 base_counts[reference] += 1
 92 | 
 93 |             total += 1
 94 | 
 95 |     parsed["total"] = total
 96 | 
 97 |     # find top base
 98 |     for base in base_counts:
 99 |         if base_counts[base] > top_base_count:
100 |             top_base = base
101 |             top_base_count = base_counts[base]
102 | 
103 |     # if more that 0 bases processed
104 |     if total > 0:
105 |         prop = top_base_count / float(total)
106 | 
107 |         if prop >= min_proportion:
108 |             parsed["proportion"] = prop
109 |             parsed["top_base"] = top_base
110 | 
111 |     return parsed
112 | 
113 | ## process pileup file
114 | 
115 | # read in input file
116 | if not DEBUG:
117 |     out_file = open(out_file, "w")
118 | 
119 | with open(pileup_file, "rt") as pileup_file:
120 |     for pileup_file_line in pileup_file:
121 | 
122 |         parsed = parse_pileup(pileup_file_line)
123 | 
124 |         proportion = parsed["proportion"]
125 | 
126 |         # print to output file
127 |         if DEBUG:
128 |             print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)),
129 |                 parsed["pileup"], parsed["quality"]]))
130 |         else:
131 |             print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)),
132 |                 parsed["pileup"], parsed["quality"]]), file=out_file)
133 | 
134 | if not DEBUG:
135 |     out_file.close()
136 | 


--------------------------------------------------------------------------------
/scripts/coreSNPs2fasta.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | 
 3 | HEADER_POS = 0
 4 | FASTA_POS = 1
 5 | #coreSNPs_file = sys.argv[1]
 6 | 
 7 | snp_file = snakemake.input[0]
 8 | out_file = snakemake.output[0]
 9 | 
10 | # hold sequences as we go
11 | fastas = []
12 | 
13 | with open(snp_file, "r") as core_snps:
14 | 
15 |     # process header
16 |     header = core_snps.readline().rstrip("\n").split("\t")
17 | 
18 |     # add header to dict with empty string
19 |     for h in range(len(header)):
20 |         fastas += [[header[h], ""]]
21 | 
22 |     for line in core_snps:
23 |         line = line.rstrip("\n").split("\t")
24 | 
25 |         for pos in range(len(line)):
26 |             # print(fastas[pos])
27 |             # print(fastas[pos][FASTA_POS])
28 |             fastas[pos][FASTA_POS] += line[pos]
29 | 
30 | with open(out_file, "w") as sys.stdout:
31 |     for f in range(len(fastas)):
32 |         print(">" + fastas[f][0])
33 |         print(fastas[f][1])
34 | 


--------------------------------------------------------------------------------
/scripts/findCoreSNPs.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | # find core snvs shared between samples
 3 | 
 4 | import sys
 5 | 
 6 | snp_file = snakemake.input[0]
 7 | out_file = snakemake.output[0]
 8 | 
 9 | # read each line and print if base is called for every sample and
10 | # bases are not the same in every sample
11 | with open(snp_file, "r") as snps:
12 |     with open(out_file, "w") as sys.stdout:
13 | 
14 |         # print header
15 |         print(snps.readline().rstrip('\n'))
16 | 
17 |         # process each line and output positions with no unknown bases ("N")
18 |         # and where all samples do not have same base
19 |         for line in snps:
20 |             positions = line.rstrip('\n').split('\t')
21 |             if (len(set(positions)) > 1) and ("N" not in positions):
22 |                 print("\t".join(positions))
23 | 


--------------------------------------------------------------------------------
/scripts/getCoverage.py:
--------------------------------------------------------------------------------
 1 | # Fiona Tamburini
 2 | # Jan 26 2018
 3 | 
 4 | # calculate average coverage and % bases covered above min threshold from bedtools genomecov output
 5 | # usage python3 getCoverage.py 5 sample < depth.tsv
 6 | 
 7 | import sys
 8 | 
 9 | # snakemake input and output files
10 | sample_file = snakemake.input[0]
11 | out_file = snakemake.output[0]
12 | 
13 | # minimum coverage threshold -- report percentage of bases covered at or
14 | # beyond this depth
15 | minCvg = int(snakemake.params[0])
16 | 
17 | totalBases = 0
18 | coveredBases = 0
19 | weightedAvg = 0
20 | with open(sample_file, 'r') as sample:
21 |     for line in sample:
22 |         if line.startswith("genome"):
23 |             chr, depth, numBases, size, fraction = line.rstrip('\n').split('\t')
24 | 
25 |             depth = int(depth)
26 |             numBases = int(numBases)
27 | 
28 |             totalBases += numBases
29 |             weightedAvg += depth * numBases
30 | 
31 |             if depth >= minCvg:
32 |                 coveredBases += numBases
33 | avgCvg = 0
34 | percCovered = 0
35 | 
36 | if totalBases > 0:
37 |     avgCvg = float(weightedAvg) / float(totalBases)
38 |     percCovered = float(coveredBases) / float(totalBases)
39 | 
40 | with open(out_file, 'w') as out:
41 |     print("\t".join([str(round(avgCvg, 2)), str(round(percCovered, 2))]), file = out)
42 | 


--------------------------------------------------------------------------------
/scripts/pairwiseDist.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | 
 3 | import sys
 4 | import itertools
 5 | import subprocess
 6 | import re
 7 | 
 8 | # input and output files from snakemake
 9 | in_files = snakemake.input
10 | out_file = snakemake.output[0]
11 | 
12 | cmd = " ".join(["cat", in_files[0], "| wc -l"])
13 | totalBases = subprocess.check_output(cmd, stdin=subprocess.PIPE, shell=True ).decode('ascii').rstrip('\n')
14 | totalBases = int(totalBases) - 1
15 | 
16 | # for every pairwise combination of files, check SNV distance
17 | with open(out_file, "w") as sys.stdout:
18 | 
19 |     # print header
20 |     print('\t'.join(["Sample1", "Sample2", "SNVs", "BasesCompared", "TotalBases"]))
21 | 
22 |     # get all pairwise combinations of input files
23 |     for element in itertools.combinations(in_files, 2):
24 |         file1, file2 = element
25 | 
26 |         cmd1 = " ".join(["paste", file1, file2, "| sed '1d' | grep -v N | wc -l"])
27 |         totalPos = subprocess.check_output(cmd1, stdin=subprocess.PIPE, shell=True ).decode('ascii').rstrip('\n')
28 | 
29 |         cmd2 = " ".join(["paste", file1, file2, "| sed '1d' | grep -v N | awk '$1 != $2 {print $0}' | wc -l"])
30 |         diffPos = subprocess.check_output(cmd2, stdin=subprocess.PIPE, shell=True ).decode('ascii').rstrip('\n')
31 | 
32 |         fname1 = re.findall('consensus/(.+)\.', file1)[0]
33 |         fname2 = re.findall('consensus/(.+)\.', file2)[0]
34 | 
35 |         # dist = subprocess.check_output(['./count_snvs.sh', file1, file2]).decode('ascii').rstrip('\n')
36 |         print('\t'.join([fname1, fname2, str(diffPos), str(totalPos), str(totalBases)]))
37 | 


--------------------------------------------------------------------------------
/scripts/renderTree.R:
--------------------------------------------------------------------------------
 1 | library(ggtree)
 2 | library(phangorn)
 3 | library(ggplot2)
 4 | 
 5 | # input files
 6 | tree_file <- snakemake@input[[1]]
 7 | out_file <- snakemake@output[[1]]
 8 | 
 9 | # read tree file
10 | tree <- read.tree(tree_file)
11 | 
12 | # midpoint root tree if there is more than 1 node
13 | if (tree$Nnode > 1){
14 |   tree <- midpoint(tree)
15 | }
16 | 
17 | # draw tree
18 | p <- ggtree(tree, branch.length = 1) +
19 |   geom_tiplab(offset = .05) +
20 |   geom_tippoint(size=3) +
21 |   xlim(0, 4) +
22 |   geom_treescale(width = 0.1, offset=0.25)
23 | 
24 | # save output file
25 | ggsave(out_file, plot = p, height = 1*length(tree$tip.label))
26 | 


--------------------------------------------------------------------------------
/tutorial/fastq/isolate1_1.fq:
--------------------------------------------------------------------------------
1 | /home/tamburin/fiona/bacteremia/isolate_reads_all/L2_PE1.fq


--------------------------------------------------------------------------------
/tutorial/fastq/isolate1_2.fq:
--------------------------------------------------------------------------------
1 | /home/tamburin/fiona/bacteremia/isolate_reads_all/L2_PE2.fq


--------------------------------------------------------------------------------
/tutorial/fastq/isolate2_1.fq:
--------------------------------------------------------------------------------
1 | /home/tamburin/fiona/bacteremia/isolate_reads_all/L5_PE1.fq


--------------------------------------------------------------------------------
/tutorial/fastq/isolate2_2.fq:
--------------------------------------------------------------------------------
1 | /home/tamburin/fiona/bacteremia/isolate_reads_all/L5_PE2.fq


--------------------------------------------------------------------------------
/tutorial/fastq/meta1_1.fq:
--------------------------------------------------------------------------------
1 | /home/tamburin/fiona/bacteremia/stool_reads_all/S37_PE1.fq


--------------------------------------------------------------------------------
/tutorial/fastq/meta1_2.fq:
--------------------------------------------------------------------------------
1 | /home/tamburin/fiona/bacteremia/stool_reads_all/S37_PE2.fq


--------------------------------------------------------------------------------
/tutorial/fastq/meta2_1.fq:
--------------------------------------------------------------------------------
1 | /home/tamburin/fiona/bacteremia/stool_reads_all/S38_PE1.fq


--------------------------------------------------------------------------------
/tutorial/fastq/meta2_2.fq:
--------------------------------------------------------------------------------
1 | /home/tamburin/fiona/bacteremia/stool_reads_all/S38_PE2.fq


--------------------------------------------------------------------------------
/tutorial/fastq/meta3_1.fq:
--------------------------------------------------------------------------------
1 | /home/tamburin/fiona/bacteremia/stool_reads_all/S10_PE1.fq


--------------------------------------------------------------------------------
/tutorial/fastq/meta3_2.fq:
--------------------------------------------------------------------------------
1 | /home/tamburin/fiona/bacteremia/stool_reads_all/S10_PE2.fq


--------------------------------------------------------------------------------
/tutorial/fastq/meta4_1.fq:
--------------------------------------------------------------------------------
1 | /home/tamburin/fiona/bacteremia/stool_reads_all/S6_PE1.fq


--------------------------------------------------------------------------------
/tutorial/fastq/meta4_2.fq:
--------------------------------------------------------------------------------
1 | /home/tamburin/fiona/bacteremia/stool_reads_all/S6_PE2.fq


--------------------------------------------------------------------------------
/tutorial/reference/E_coli_K12.fna.amb:
--------------------------------------------------------------------------------
1 | 4641652 1 0
2 | 


--------------------------------------------------------------------------------
/tutorial/reference/E_coli_K12.fna.ann:
--------------------------------------------------------------------------------
1 | 4641652 1 11
2 | 0 NC_000913.3 Escherichia coli str. K-12 substr. MG1655, complete genome
3 | 0 4641652 0
4 | 


--------------------------------------------------------------------------------
/tutorial/reference/E_coli_K12.fna.bwt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bhattlab/StrainSifter/4a354e5a0ae25bfe3d57a6b66fa27a36b5daf748/tutorial/reference/E_coli_K12.fna.bwt


--------------------------------------------------------------------------------
/tutorial/reference/E_coli_K12.fna.fai:
--------------------------------------------------------------------------------
1 | NC_000913.3	4641652	72	80	81
2 | 


--------------------------------------------------------------------------------
/tutorial/reference/E_coli_K12.fna.pac:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bhattlab/StrainSifter/4a354e5a0ae25bfe3d57a6b66fa27a36b5daf748/tutorial/reference/E_coli_K12.fna.pac


--------------------------------------------------------------------------------
/tutorial/reference/E_coli_K12.fna.sa:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bhattlab/StrainSifter/4a354e5a0ae25bfe3d57a6b66fa27a36b5daf748/tutorial/reference/E_coli_K12.fna.sa


--------------------------------------------------------------------------------