├── .gitattributes
├── .gitignore
├── README.md
├── Snakefile
├── config.yaml
├── envs
└── environment.yaml
├── scripts
├── ._callSNPs.py
├── ._coreSNPs2fasta.py
├── ._findCoreSNPs.py
├── ._getCoverage.py
├── ._pairwiseDist.py
├── ._renderTree.R
├── .snakemake.7g49ubxl.coreSNPs2fasta.py
├── .snakemake.9wtzh4ps.callSNPs.py
├── .snakemake.gej6roz5.callSNPs.py
├── .snakemake.jrhv5jyf.callSNPs.py
├── .snakemake.k53chwr8.callSNPs.py
├── .snakemake.mgp_kol5.callSNPs.py
├── .snakemake.sgrw01xs.callSNPs.py
├── callSNPs.py
├── coreSNPs2fasta.py
├── findCoreSNPs.py
├── getCoverage.py
├── pairwiseDist.py
└── renderTree.R
└── tutorial
├── fastq
├── isolate1_1.fq
├── isolate1_2.fq
├── isolate2_1.fq
├── isolate2_2.fq
├── meta1_1.fq
├── meta1_2.fq
├── meta2_1.fq
├── meta2_2.fq
├── meta3_1.fq
├── meta3_2.fq
├── meta4_1.fq
└── meta4_2.fq
└── reference
├── E_coli_K12.fna
├── E_coli_K12.fna.amb
├── E_coli_K12.fna.ann
├── E_coli_K12.fna.bwt
├── E_coli_K12.fna.fai
├── E_coli_K12.fna.pac
└── E_coli_K12.fna.sa
/.gitattributes:
--------------------------------------------------------------------------------
1 | *.fq binary
2 | *.fastq binary
3 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | ._Snakefile
2 | ._config.yaml
3 | .snakemake
4 | tutorial
5 | config_old.yaml
6 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # StrainSifter
2 |
3 | A straightforward bioinformatic pipeline for detecting the presence of a bacterial strain in one or more metagenome(s).
4 |
5 | StrainSifter is based on [Snakemake](https://snakemake.readthedocs.io/en/stable/). This pipeline allows you to output phylogenetic trees showing strain relatedness of input strains, as well as pairwise counts of single-nucleotide variants (SNVs) between input samples.
6 |
7 | ## Installation
8 |
9 | To run StrainSifter, you must have miniconda3 and Snakemake installed.
10 |
11 | #### Install instructions (One time only)
12 | 1. Download and install [miniconda3](https://conda.io/miniconda.html):
13 |
14 | For Linux:
15 |
16 | wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
17 | bash Miniconda3-latest-Linux-x86_64.sh
18 |
19 | 2. Clone the StrainSifter workflow to the directory where you wish to run the pipeline:
20 |
21 | git clone https://github.com/bhattlab/strainsifter
22 |
23 | 3. Create the new conda environment:
24 |
25 | cd strainsifter
26 | conda env create -f envs/environment.yaml
27 |
28 | 4. Install Snakemake:
29 |
30 | conda install snakemake -c bioconda -c conda-forge
31 |
32 |
33 | StrainSifter has been developed and tested with Snakemake version 5.1.4 or higher. Check your version by typing:
34 |
35 | snakemake --version
36 |
37 | If you are running a version of Snakemake prior to 5.1.4, update to the latest version:
38 |
39 | conda update snakemake -c bioconda -c conda-forge
40 |
41 | #### Activate the conda environment (Every time you use StrainSifter)
42 |
43 | source activate ssift
44 |
45 | ### Dependencies
46 |
47 | We recommend running StrainSifter in the provided conda environment. If you wish to run StrainSifter without using the conda environment, the following tools must be installed and in your system PATH:
48 | * [Burrows-Wheeler Aligner (BWA)](http://bio-bwa.sourceforge.net)
49 | * [Samtools](http://www.htslib.org)
50 | * [Bamtools](https://github.com/pezmaster31/bamtools)
51 | * [Bedtools](http://bedtools.readthedocs.io/en/latest/)
52 | * [MUSCLE](https://www.drive5.com/muscle/)
53 | * [FastTree](http://www.microbesonline.org/fasttree/)
54 | * [Python3](https://www.python.org/downloads/)
55 | * [R](https://www.r-project.org)
56 |
57 | ## Running StrainSifter
58 |
59 | Due to the computing demands of the StrainSifter pipeline, we recommend running on a computing cluster if possible.
60 | Instructions to enable Snakemake to schedule cluster jobs with SLURM can be found at https://github.com/bhattlab/slurm
61 |
62 | ### Input files
63 |
64 | * Reference genome assembly in fasta format (can be a draft genome or a finished reference genome)
65 | Acceptable file extensions: ".fasta", ".fa", ".fna"
66 |
67 | * Two or more short read datasets in fastq format (metagenomic reads or isolate reads), optionally gzipped
68 | Acceptable file extensions: ".fq", ".fastq", ".fq.gz", ".fastq.gz"
69 |
70 | Short read data can be paired- or single-end.
71 |
72 | You will need to indicate input files in the config file for each sample you wish to run StrainSifter on. This is described below in the *Config* section:
73 |
74 | ### Config
75 |
76 | You must update the config.yaml file as follows:
77 |
78 | **reference:** Path to reference genome (fasta format)
79 |
80 | **reads:** Samples and the file path(s) to the input reads.
81 |
82 |
83 |
84 | Optionally, you can update the following parameters:
85 |
86 | **prefix:** (optional) desired filename for output files. If blank, the name of the reference genome will be used.
87 |
88 | **mapq:** minimum mapping quality score to evaluate a read aligment
89 |
90 | **n_mismatches:** consider reads with this many mismatches or fewer
91 |
92 | **min_cvg:** minimum read depth to determine the nucleotide at any given postion
93 |
94 | **min_genome_percent:** the minimum fraction of bases that must be covered at min_cvg or greater to process an sample
95 |
96 | **base_freq:** minimum frequency of a nucleotide to call a base at any position
97 |
98 |
99 |
100 | Example config.yaml:
101 |
102 | ##### input files #####
103 |
104 | # reference genome (required)
105 | reference: /path/to/ref.fna
106 |
107 | # short read data (at least two samples required)
108 | reads:
109 | sample1:
110 | [/path/to/sample1_R1.fq,
111 | /path/tp/sample1_R2.fq]
112 | sample2:
113 | [/path/to/sample2_R1.fq,
114 | /path/to/sample2_R2.fq]
115 | sample3: /path/to/sample3.fq
116 |
117 | # prefix for output files (optional - can leave blank)
118 | prefix:
119 |
120 |
121 | ##### workflow parameters #####
122 |
123 | # alignment parameters:
124 | mapq: 60
125 | n_mismatches: 5
126 |
127 | # variant calling parameters:
128 | min_cvg: 5
129 | min_genome_percent: 0.5
130 | base_freq: 0.8
131 |
132 |
133 | ### Running StrainSifter
134 |
135 | To run StrainSifter, the config file must be present in the directory in which you wish to run the workflow.
136 | You should then be able to run StrainSifter as follows:
137 |
138 | #### Phylogeny
139 |
140 | To generate a phylogenetic tree showing all of the input samples that contain your strain of interest at sufficient coverage to profile:
141 |
142 | snakemake {prefix}.tree.pdf
143 |
144 | #### SNV counts
145 |
146 | To generate a list of pairwise SNV counts between all input samples:
147 |
148 | snakemake {prefix}.dist.tsv
149 |
150 | ### FAQ
151 |
152 | Q: Can StrainSifter be used for non-bacterial genomes (e.g. yeast)?
153 |
154 | A: At present, we recommend StrainSifter for bacteria only. In theory, StrainSifter should work for yeast if a haploid reference genome is provided.
155 |
--------------------------------------------------------------------------------
/Snakefile:
--------------------------------------------------------------------------------
1 | import os
2 | import re
3 | from snakemake.utils import min_version
4 |
5 | ##### set minimum snakemake version #####
6 | min_version("5.1.4")
7 |
8 | ##### load config file and sample list #####
9 | configfile: "config.yaml"
10 |
11 | ##### list of input samples #####
12 | samples = [key for key in config['reads']]
13 |
14 | ##### prefix for phylogenetic tree and SNV distance files #####
15 | if config['prefix'] is None:
16 | prefix = re.split("/|\.", config['reference'])[-2]
17 | else:
18 | prefix = config['prefix']
19 |
20 | ##### rules #####
21 |
22 | # index reference genome for bwa alignment
23 | rule bwa_index:
24 | input: config['reference']
25 | output:
26 | "{ref}.amb".format(ref=config['reference']),
27 | "{ref}.ann".format(ref=config['reference']),
28 | "{ref}.bwt".format(ref=config['reference']),
29 | "{ref}.pac".format(ref=config['reference']),
30 | "{ref}.sa".format(ref=config['reference'])
31 | resources:
32 | mem=2,
33 | time=1
34 | shell:
35 | "bwa index {input}"
36 |
37 | # align reads to reference genome with bwa
38 | rule bwa_align:
39 | input:
40 | ref_index = rules.bwa_index.output,
41 | r = lambda wildcards: config["reads"][wildcards.sample]
42 | output:
43 | "filtered_bam/{sample}.filtered.bam"
44 | resources:
45 | mem=32,
46 | time=6
47 | threads: 8
48 | params:
49 | ref = config['reference'],
50 | qual=config['mapq'],
51 | nm=config['n_mismatches']
52 | shell:
53 | "bwa mem -t {threads} {params.ref} {input.r} | "\
54 | "samtools view -b -q {params.qual} | "\
55 | "bamtools filter -tag 'NM:<={params.nm}' | "\
56 | "samtools sort --threads {threads} -o {output}"
57 |
58 | # count base read coverage
59 | rule genomecov:
60 | input:
61 | rules.bwa_align.output
62 | output:
63 | "genomecov/{sample}.tsv"
64 | resources:
65 | mem=8,
66 | time=1,
67 | threads: 1
68 | shell:
69 | "bedtools genomecov -ibam {input} > {output}"
70 |
71 | # calculate average coverage across the genome
72 | rule calc_coverage:
73 | input:
74 | rules.genomecov.output
75 | output:
76 | "coverage/{sample}.cvg"
77 | resources:
78 | mem=8,
79 | time=1,
80 | threads: 1
81 | params:
82 | cvg=config['min_cvg']
83 | script:
84 | "scripts/getCoverage.py"
85 |
86 | # filter samples that meet coverage requirements
87 | rule filter_samples:
88 | input: expand("coverage/{sample}.cvg", sample = samples)
89 | output:
90 | dynamic("passed_samples/{sample}.bam")
91 | resources:
92 | mem=1,
93 | time=1
94 | threads: 1
95 | params:
96 | min_cvg=config['min_cvg'],
97 | min_perc=config['min_genome_percent']
98 | run:
99 | samps = input
100 | for samp in samps:
101 | with open(samp) as s:
102 | cvg, perc = s.readline().rstrip('\n').split('\t')
103 | if (float(cvg) >= params.min_cvg and float(perc) > params.min_perc):
104 | shell("ln -s $PWD/filtered_bam/{s}.filtered.bam passed_samples/{s}.bam".format(s=os.path.basename(samp).rstrip(".cvg")))
105 |
106 | # index reference genome for pileup
107 | rule faidx:
108 | input: config['reference']
109 | output: "{ref}.fai".format(ref=config['reference'])
110 | resources:
111 | mem=2,
112 | time=1
113 | shell:
114 | "samtools faidx {input}"
115 |
116 | # create pileup from bam files
117 | rule pileup:
118 | input:
119 | bam="passed_samples/{sample}.bam",
120 | ref=config['reference'],
121 | index=rules.faidx.output
122 | output: "pileup/{sample}.pileup"
123 | resources:
124 | mem=32,
125 | time=1
126 | threads: 16
127 | shell:
128 | "samtools mpileup -f {input.ref} -B -aa -o {output} {input.bam}"
129 |
130 | # call SNPs from pileup
131 | rule call_snps:
132 | input: rules.pileup.output
133 | output: "snp_calls/{sample}.tsv"
134 | resources:
135 | mem=32,
136 | time=2
137 | threads: 16
138 | params:
139 | min_cvg=5,
140 | min_freq=0.8,
141 | min_qual=20
142 | script:
143 | "scripts/callSNPs.py"
144 |
145 | # get consensus sequence from pileup
146 | rule snp_consensus:
147 | input: rules.call_snps.output
148 | output: "consensus/{sample}.txt"
149 | resources:
150 | mem=2,
151 | time=2
152 | threads: 1
153 | shell:
154 | "echo {wildcards.sample} > {output}; cut -f4 {input} >> {output}"
155 |
156 | # combine consensus sequences into one file
157 | rule combine:
158 | input:
159 | dynamic("consensus/{sample}.txt")
160 | output: "{name}.cns.tsv".format(name = prefix)
161 | resources:
162 | mem=2,
163 | time=1
164 | threads: 1
165 | shell:
166 | "paste {input} > {output}"
167 |
168 | # find positions that have a base call in each input genome and at least
169 | # one variant in the set of input genomes
170 | rule core_snps:
171 | input: rules.combine.output
172 | output: "{name}.core_snps.tsv".format(name = prefix)
173 | resources:
174 | mem=16,
175 | time=1
176 | threads: 1
177 | script:
178 | "scripts/findCoreSNPs.py"
179 |
180 | # convert core SNPs file to fasta format
181 | rule core_snps_to_fasta:
182 | input: rules.core_snps.output
183 | output: "{name}.fasta".format(name = prefix)
184 | resources:
185 | mem=16,
186 | time=1
187 | threads: 1
188 | script:
189 | "scripts/coreSNPs2fasta.py"
190 |
191 | # perform multiple sequence alignment of fasta file
192 | rule multi_align:
193 | input: rules.core_snps_to_fasta.output
194 | output: "{name}.afa".format(name = prefix)
195 | resources:
196 | mem=200,
197 | time=12
198 | threads: 1
199 | shell:
200 | "muscle -in {input} -out {output}"
201 |
202 | # calculate phylogenetic tree from multiple sequence alignment
203 | rule build_tree:
204 | input: rules.multi_align.output
205 | output: "{name}.tree".format(name = prefix)
206 | resources:
207 | mem=8,
208 | time=1
209 | threads: 1
210 | shell:
211 | "fasttree -nt {input} > {output}"
212 |
213 | # plot phylogenetic tree
214 | rule plot_tree:
215 | input: rules.build_tree.output
216 | output: "{name}.tree.pdf".format(name = prefix)
217 | resources:
218 | mem=8,
219 | time=1
220 | threads: 1
221 | script:
222 | "scripts/renderTree.R"
223 |
224 | # count pairwise SNVs between input samples
225 | rule pairwise_snvs:
226 | input: dynamic("consensus/{sample}.txt")
227 | output: "{name}.dist.tsv".format(name = prefix)
228 | resources:
229 | mem=8,
230 | time=1
231 | threads: 1
232 | script:
233 | "scripts/pairwiseDist.py"
234 |
--------------------------------------------------------------------------------
/config.yaml:
--------------------------------------------------------------------------------
1 | ##### input files #####
2 |
3 | # reference genome (required)
4 | reference: /path/to/ref.fna
5 |
6 | # short read data (at least two samples required)
7 | reads:
8 | sample1:
9 | [/path/to/sample1_R1.fq,
10 | /path/tp/sample1_R2.fq]
11 | sample2:
12 | [/path/to/sample2_R1.fq,
13 | /path/to/sample2_R2.fq]
14 | sample3: /path/to/sample3.fq
15 |
16 | # prefix for output files (optional - can leave blank)
17 | prefix:
18 |
19 |
20 | ##### workflow parameters #####
21 |
22 | # alignment parameters:
23 | mapq: 60
24 | n_mismatches: 5
25 |
26 | # variant calling parameters:
27 | min_cvg: 5
28 | min_genome_percent: 0.5
29 | base_freq: 0.8
30 |
--------------------------------------------------------------------------------
/envs/environment.yaml:
--------------------------------------------------------------------------------
1 | name: ssift
2 | channels:
3 | - bioconda
4 | - defaults
5 | - conda-forge
6 | dependencies:
7 | - bwa
8 | - samtools
9 | - ncurses
10 | - bamtools
11 | - bedtools
12 | - MUSCLE
13 | - FastTree
14 | - r-ggplot2
15 | - bioconductor-ggtree
16 | - r-phangorn
17 |
--------------------------------------------------------------------------------
/scripts/._callSNPs.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bhattlab/StrainSifter/4a354e5a0ae25bfe3d57a6b66fa27a36b5daf748/scripts/._callSNPs.py
--------------------------------------------------------------------------------
/scripts/._coreSNPs2fasta.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bhattlab/StrainSifter/4a354e5a0ae25bfe3d57a6b66fa27a36b5daf748/scripts/._coreSNPs2fasta.py
--------------------------------------------------------------------------------
/scripts/._findCoreSNPs.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bhattlab/StrainSifter/4a354e5a0ae25bfe3d57a6b66fa27a36b5daf748/scripts/._findCoreSNPs.py
--------------------------------------------------------------------------------
/scripts/._getCoverage.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bhattlab/StrainSifter/4a354e5a0ae25bfe3d57a6b66fa27a36b5daf748/scripts/._getCoverage.py
--------------------------------------------------------------------------------
/scripts/._pairwiseDist.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bhattlab/StrainSifter/4a354e5a0ae25bfe3d57a6b66fa27a36b5daf748/scripts/._pairwiseDist.py
--------------------------------------------------------------------------------
/scripts/._renderTree.R:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bhattlab/StrainSifter/4a354e5a0ae25bfe3d57a6b66fa27a36b5daf748/scripts/._renderTree.R
--------------------------------------------------------------------------------
/scripts/.snakemake.7g49ubxl.coreSNPs2fasta.py:
--------------------------------------------------------------------------------
1 |
2 | ######## Snakemake header ########
3 | import sys; sys.path.append("/home/tamburin/tools/miniconda3/lib/python3.6/site-packages"); import pickle; snakemake = pickle.loads(b'\x80\x03csnakemake.script\nSnakemake\nq\x00)\x81q\x01}q\x02(X\x05\x00\x00\x00inputq\x03csnakemake.io\nInputFiles\nq\x04)\x81q\x05X\x14\x00\x00\x00E_coli.core_snps.tsvq\x06a}q\x07X\x06\x00\x00\x00_namesq\x08}q\tsbX\x06\x00\x00\x00outputq\ncsnakemake.io\nOutputFiles\nq\x0b)\x81q\x0cX\x0c\x00\x00\x00E_coli.fastaq\ra}q\x0eh\x08}q\x0fsbX\x06\x00\x00\x00paramsq\x10csnakemake.io\nParams\nq\x11)\x81q\x12}q\x13h\x08}q\x14sbX\t\x00\x00\x00wildcardsq\x15csnakemake.io\nWildcards\nq\x16)\x81q\x17}q\x18h\x08}q\x19sbX\x07\x00\x00\x00threadsq\x1aK\x01X\t\x00\x00\x00resourcesq\x1bcsnakemake.io\nResources\nq\x1c)\x81q\x1d(K\x01K\x01K\x10K\x01e}q\x1e(h\x08}q\x1f(X\x06\x00\x00\x00_coresq K\x00N\x86q!X\x06\x00\x00\x00_nodesq"K\x01N\x86q#X\x03\x00\x00\x00memq$K\x02N\x86q%X\x04\x00\x00\x00timeq&K\x03N\x86q\'uh K\x01h"K\x01h$K\x10h&K\x01ubX\x03\x00\x00\x00logq(csnakemake.io\nLog\nq))\x81q*}q+h\x08}q,sbX\x06\x00\x00\x00configq-}q.(X\t\x00\x00\x00referenceq/X\x0e\x00\x00\x00isolate2.fastaq0X\x05\x00\x00\x00readsq1}q2(X\x05\x00\x00\x00meta1q3]q4(X\x16\x00\x00\x00reads/meta1_1.fastq.gzq5X\x16\x00\x00\x00reads/meta1_1.fastq.gzq6eX\x05\x00\x00\x00meta2q7]q8(X\x16\x00\x00\x00reads/meta2_1.fastq.gzq9X\x16\x00\x00\x00reads/meta2_1.fastq.gzq:eX\x05\x00\x00\x00meta3q;]q<(X\x16\x00\x00\x00reads/meta3_1.fastq.gzq=X\x16\x00\x00\x00reads/meta3_1.fastq.gzq>eX\x08\x00\x00\x00isolate1q?]q@(X\x19\x00\x00\x00reads/isolate1_1.fastq.gzqAX\x19\x00\x00\x00reads/isolate1_2.fastq.gzqBeX\x08\x00\x00\x00isolate2qC]qD(X\x19\x00\x00\x00reads/isolate2_1.fastq.gzqEX\x19\x00\x00\x00reads/isolate2_2.fastq.gzqFeuX\x06\x00\x00\x00prefixqGX\x06\x00\x00\x00E_coliqHX\x04\x00\x00\x00mapqqIK" + fastas[f][0])
37 | print(fastas[f][1])
38 |
--------------------------------------------------------------------------------
/scripts/.snakemake.9wtzh4ps.callSNPs.py:
--------------------------------------------------------------------------------
1 |
2 | ######## Snakemake header ########
3 | import sys; sys.path.append("/home/tamburin/tools/miniconda3/lib/python3.6/site-packages"); import pickle; snakemake = pickle.loads(b'\x80\x03csnakemake.script\nSnakemake\nq\x00)\x81q\x01}q\x02(X\x05\x00\x00\x00inputq\x03csnakemake.io\nInputFiles\nq\x04)\x81q\x05X\x1c\x00\x00\x00pileup/4308_7-26-2016.pileupq\x06a}q\x07X\x06\x00\x00\x00_namesq\x08}q\tsbX\x06\x00\x00\x00outputq\ncsnakemake.io\nOutputFiles\nq\x0b)\x81q\x0cX\x1c\x00\x00\x00snp_calls/4308_7-26-2016.tsvq\ra}q\x0eh\x08}q\x0fsbX\x06\x00\x00\x00paramsq\x10csnakemake.io\nParams\nq\x11)\x81q\x12(K\x05G?\xe9\x99\x99\x99\x99\x99\x9aK\x14e}q\x13(h\x08}q\x14(X\x07\x00\x00\x00min_cvgq\x15K\x00N\x86q\x16X\x08\x00\x00\x00min_freqq\x17K\x01N\x86q\x18X\x08\x00\x00\x00min_qualq\x19K\x02N\x86q\x1auh\x15K\x05h\x17G?\xe9\x99\x99\x99\x99\x99\x9ah\x19K\x14ubX\t\x00\x00\x00wildcardsq\x1bcsnakemake.io\nWildcards\nq\x1c)\x81q\x1dX\x0e\x00\x00\x004308_7-26-2016q\x1ea}q\x1f(h\x08}q X\x06\x00\x00\x00sampleq!K\x00N\x86q"sX\x06\x00\x00\x00sampleq#h\x1eubX\x07\x00\x00\x00threadsq$K\x01X\t\x00\x00\x00resourcesq%csnakemake.io\nResources\nq&)\x81q\'(K\x01K\x01K K\x02e}q((h\x08}q)(X\x06\x00\x00\x00_coresq*K\x00N\x86q+X\x06\x00\x00\x00_nodesq,K\x01N\x86q-X\x03\x00\x00\x00memq.K\x02N\x86q/X\x04\x00\x00\x00timeq0K\x03N\x86q1uh*K\x01h,K\x01h.K h0K\x02ubX\x03\x00\x00\x00logq2csnakemake.io\nLog\nq3)\x81q4}q5h\x08}q6sbX\x06\x00\x00\x00configq7}q8(X\t\x00\x00\x00referenceq9XO\x00\x00\x00/home/tamburin/fiona/bacteremia/1.assemble_trimmed/filtered_assemblies/L3.fastaq:X\x07\x00\x00\x00samplesq;X0\x00\x00\x00/home/tamburin/fiona/crassphage/all_samples.listqX\n\x00\x00\x00paired_endq?X\x01\x00\x00\x00Yq@X\x07\x00\x00\x00min_cvgqAK\nuX\x04\x00\x00\x00ruleqBX\t\x00\x00\x00call_snpsqCub.'); from snakemake.logging import logger; logger.printshellcmds = False
4 | ######## Original script #########
5 | #!/usr/bin/env python3
6 | # callSNPS.py
7 | # call SNPs from a samtools pileup file
8 |
9 | import sys
10 | import gzip
11 | import re
12 |
13 | DEBUG = 0
14 |
15 | ASCII_OFFSET = ord("!")
16 | INDEL_PATTERN = re.compile(r"[+-](\d+)")
17 | INDEL_STRING = "[+-]INTEGER[ACGTN]{INTEGER}"
18 | ROUND = 3 # places to right of decimal point
19 |
20 | min_coverage, min_proportion, min_qual = snakemake.params
21 | pileup_file = snakemake.input[0]
22 | out_file = snakemake.output[0]
23 |
24 | min_coverage = int(min_coverage)
25 | min_proportion = float(min_proportion)
26 | min_qual = int(min_qual)
27 |
28 | ## function to process each line of pileup
29 |
30 | def parse_pileup(line):
31 |
32 | # read fields from line of pileup
33 | chromosome, position, reference, coverage, pileup, quality = line.rstrip("\r\n").split("\t")
34 |
35 | # proportion is 0 if no consensus base, top base is N by default
36 | parsed = {
37 | "proportion": 0.0, "chromosome": chromosome,
38 | "position": int(position), "reference": reference,
39 | "coverage": int(coverage), "pileup": pileup,
40 | "quality": quality, "top_base": "N"}
41 |
42 | # if the base coverage is below the limit or above our acceptable max, call N
43 | if parsed["coverage"] < min_coverage:
44 | return parsed
45 |
46 | # uppercase pileup string for processing
47 | pileup = pileup.upper()
48 |
49 | # Remove start and stop characters from pileup string
50 | pileup = re.sub(r"\^.|\$", "", pileup)
51 |
52 | # Remove indels from pileup string
53 | start = 0
54 |
55 | while True:
56 | match = INDEL_PATTERN.search(pileup, start)
57 |
58 | if match:
59 | integer = match.group(1)
60 | pileup = re.sub(INDEL_STRING.replace("INTEGER", integer), "", pileup)
61 | start = match.start()
62 | else:
63 | break
64 |
65 | # get total base count and top base count
66 | total = 0
67 | top_base = "N"
68 | top_base_count = 0
69 |
70 | # uppercase reference base for comparison
71 | reference = reference.upper()
72 |
73 | base_counts = {"A": 0, "C": 0, "G": 0, "T": 0}
74 |
75 | quality_length = len(quality)
76 |
77 | for i in range(quality_length):
78 |
79 | # convert ASCII character to phred base quality
80 | base_quality = ord(quality[i]) - ASCII_OFFSET
81 |
82 | # only count high-quality bases
83 | if base_quality >= min_qual:
84 |
85 | currBase = pileup[i]
86 |
87 | if currBase in base_counts:
88 | base_counts[currBase] += 1
89 |
90 | else:
91 | base_counts[reference] += 1
92 |
93 | total += 1
94 |
95 | parsed["total"] = total
96 |
97 | # find top base
98 | for base in base_counts:
99 | if base_counts[base] > top_base_count:
100 | top_base = base
101 | top_base_count = base_counts[base]
102 |
103 | # if more that 0 bases processed
104 | if total > 0:
105 | prop = top_base_count / float(total)
106 |
107 | if prop >= min_proportion:
108 | parsed["proportion"] = prop
109 | parsed["top_base"] = top_base
110 |
111 | return parsed
112 |
113 | ## process pileup file
114 |
115 | # read in input file
116 | if not DEBUG:
117 | out_file = open(out_file, "w")
118 |
119 | with open(pileup_file, "rt") as pileup_file:
120 | for pileup_file_line in pileup_file:
121 |
122 | parsed = parse_pileup(pileup_file_line)
123 |
124 | proportion = parsed["proportion"]
125 |
126 | # print to output file
127 | if DEBUG:
128 | print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)),
129 | parsed["pileup"], parsed["quality"]]))
130 | else:
131 | print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)),
132 | parsed["pileup"], parsed["quality"]]), file=out_file)
133 |
134 | if not DEBUG:
135 | out_file.close()
136 |
--------------------------------------------------------------------------------
/scripts/.snakemake.gej6roz5.callSNPs.py:
--------------------------------------------------------------------------------
1 |
2 | ######## Snakemake header ########
3 | import sys; sys.path.append("/home/tamburin/tools/miniconda3/lib/python3.6/site-packages"); import pickle; snakemake = pickle.loads(b'\x80\x03csnakemake.script\nSnakemake\nq\x00)\x81q\x01}q\x02(X\x05\x00\x00\x00inputq\x03csnakemake.io\nInputFiles\nq\x04)\x81q\x05X\x1c\x00\x00\x00pileup/4757_2-16-2016.pileupq\x06a}q\x07X\x06\x00\x00\x00_namesq\x08}q\tsbX\x06\x00\x00\x00outputq\ncsnakemake.io\nOutputFiles\nq\x0b)\x81q\x0cX\x1c\x00\x00\x00snp_calls/4757_2-16-2016.tsvq\ra}q\x0eh\x08}q\x0fsbX\x06\x00\x00\x00paramsq\x10csnakemake.io\nParams\nq\x11)\x81q\x12(K\x05G?\xe9\x99\x99\x99\x99\x99\x9aK\x14e}q\x13(h\x08}q\x14(X\x07\x00\x00\x00min_cvgq\x15K\x00N\x86q\x16X\x08\x00\x00\x00min_freqq\x17K\x01N\x86q\x18X\x08\x00\x00\x00min_qualq\x19K\x02N\x86q\x1auh\x15K\x05h\x17G?\xe9\x99\x99\x99\x99\x99\x9ah\x19K\x14ubX\t\x00\x00\x00wildcardsq\x1bcsnakemake.io\nWildcards\nq\x1c)\x81q\x1dX\x0e\x00\x00\x004757_2-16-2016q\x1ea}q\x1f(h\x08}q X\x06\x00\x00\x00sampleq!K\x00N\x86q"sX\x06\x00\x00\x00sampleq#h\x1eubX\x07\x00\x00\x00threadsq$K\x01X\t\x00\x00\x00resourcesq%csnakemake.io\nResources\nq&)\x81q\'(K\x01K\x01K K\x02e}q((h\x08}q)(X\x06\x00\x00\x00_coresq*K\x00N\x86q+X\x06\x00\x00\x00_nodesq,K\x01N\x86q-X\x03\x00\x00\x00memq.K\x02N\x86q/X\x04\x00\x00\x00timeq0K\x03N\x86q1uh*K\x01h,K\x01h.K h0K\x02ubX\x03\x00\x00\x00logq2csnakemake.io\nLog\nq3)\x81q4}q5h\x08}q6sbX\x06\x00\x00\x00configq7}q8(X\t\x00\x00\x00referenceq9X?\x00\x00\x00/home/tamburin/fiona/crassphage/strainsifter/ref/B_vulgatus.fnaq:X\x07\x00\x00\x00samplesq;X0\x00\x00\x00/home/tamburin/fiona/crassphage/all_samples.listqX\n\x00\x00\x00paired_endq?X\x01\x00\x00\x00Yq@X\x07\x00\x00\x00min_cvgqAK\nuX\x04\x00\x00\x00ruleqBX\t\x00\x00\x00call_snpsqCub.'); from snakemake.logging import logger; logger.printshellcmds = False
4 | ######## Original script #########
5 | #!/usr/bin/env python3
6 | # callSNPS.py
7 | # call SNPs from a samtools pileup file
8 |
9 | import sys
10 | import gzip
11 | import re
12 |
13 | DEBUG = 0
14 |
15 | ASCII_OFFSET = ord("!")
16 | INDEL_PATTERN = re.compile(r"[+-](\d+)")
17 | INDEL_STRING = "[+-]INTEGER[ACGTN]{INTEGER}"
18 | ROUND = 3 # places to right of decimal point
19 |
20 | min_coverage, min_proportion, min_qual = snakemake.params
21 | pileup_file = snakemake.input[0]
22 | out_file = snakemake.output[0]
23 |
24 | min_coverage = int(min_coverage)
25 | min_proportion = float(min_proportion)
26 | min_qual = int(min_qual)
27 |
28 | ## function to process each line of pileup
29 |
30 | def parse_pileup(line):
31 |
32 | # read fields from line of pileup
33 | chromosome, position, reference, coverage, pileup, quality = line.rstrip("\r\n").split("\t")
34 |
35 | # proportion is 0 if no consensus base, top base is N by default
36 | parsed = {
37 | "proportion": 0.0, "chromosome": chromosome,
38 | "position": int(position), "reference": reference,
39 | "coverage": int(coverage), "pileup": pileup,
40 | "quality": quality, "top_base": "N"}
41 |
42 | # if the base coverage is below the limit or above our acceptable max, call N
43 | if parsed["coverage"] < min_coverage:
44 | return parsed
45 |
46 | # uppercase pileup string for processing
47 | pileup = pileup.upper()
48 |
49 | # Remove start and stop characters from pileup string
50 | pileup = re.sub(r"\^.|\$", "", pileup)
51 |
52 | # Remove indels from pileup string
53 | start = 0
54 |
55 | while True:
56 | match = INDEL_PATTERN.search(pileup, start)
57 |
58 | if match:
59 | integer = match.group(1)
60 | pileup = re.sub(INDEL_STRING.replace("INTEGER", integer), "", pileup)
61 | start = match.start()
62 | else:
63 | break
64 |
65 | # get total base count and top base count
66 | total = 0
67 | top_base = "N"
68 | top_base_count = 0
69 |
70 | # uppercase reference base for comparison
71 | reference = reference.upper()
72 |
73 | base_counts = {"A": 0, "C": 0, "G": 0, "T": 0}
74 |
75 | quality_length = len(quality)
76 |
77 | for i in range(quality_length):
78 |
79 | # convert ASCII character to phred base quality
80 | base_quality = ord(quality[i]) - ASCII_OFFSET
81 |
82 | # only count high-quality bases
83 | if base_quality >= min_qual:
84 |
85 | currBase = pileup[i]
86 |
87 | if currBase in base_counts:
88 | base_counts[currBase] += 1
89 |
90 | else:
91 | base_counts[reference] += 1
92 |
93 | total += 1
94 |
95 | parsed["total"] = total
96 |
97 | # find top base
98 | for base in base_counts:
99 | if base_counts[base] > top_base_count:
100 | top_base = base
101 | top_base_count = base_counts[base]
102 |
103 | # if more that 0 bases processed
104 | if total > 0:
105 | prop = top_base_count / float(total)
106 |
107 | if prop >= min_proportion:
108 | parsed["proportion"] = prop
109 | parsed["top_base"] = top_base
110 |
111 | return parsed
112 |
113 | ## process pileup file
114 |
115 | # read in input file
116 | if not DEBUG:
117 | out_file = open(out_file, "w")
118 |
119 | with open(pileup_file, "rt") as pileup_file:
120 | for pileup_file_line in pileup_file:
121 |
122 | parsed = parse_pileup(pileup_file_line)
123 |
124 | proportion = parsed["proportion"]
125 |
126 | # print to output file
127 | if DEBUG:
128 | print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)),
129 | parsed["pileup"], parsed["quality"]]))
130 | else:
131 | print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)),
132 | parsed["pileup"], parsed["quality"]]), file=out_file)
133 |
134 | if not DEBUG:
135 | out_file.close()
136 |
--------------------------------------------------------------------------------
/scripts/.snakemake.jrhv5jyf.callSNPs.py:
--------------------------------------------------------------------------------
1 |
2 | ######## Snakemake header ########
3 | import sys; sys.path.append("/home/tamburin/tools/miniconda3/lib/python3.6/site-packages"); import pickle; snakemake = pickle.loads(b'\x80\x03csnakemake.script\nSnakemake\nq\x00)\x81q\x01}q\x02(X\x05\x00\x00\x00inputq\x03csnakemake.io\nInputFiles\nq\x04)\x81q\x05X\x1d\x00\x00\x00pileup/6336_11-10-2015.pileupq\x06a}q\x07X\x06\x00\x00\x00_namesq\x08}q\tsbX\x06\x00\x00\x00outputq\ncsnakemake.io\nOutputFiles\nq\x0b)\x81q\x0cX\x1d\x00\x00\x00snp_calls/6336_11-10-2015.tsvq\ra}q\x0eh\x08}q\x0fsbX\x06\x00\x00\x00paramsq\x10csnakemake.io\nParams\nq\x11)\x81q\x12(K\x05G?\xe9\x99\x99\x99\x99\x99\x9aK\x14e}q\x13(h\x08}q\x14(X\x07\x00\x00\x00min_cvgq\x15K\x00N\x86q\x16X\x08\x00\x00\x00min_freqq\x17K\x01N\x86q\x18X\x08\x00\x00\x00min_qualq\x19K\x02N\x86q\x1auh\x15K\x05h\x17G?\xe9\x99\x99\x99\x99\x99\x9ah\x19K\x14ubX\t\x00\x00\x00wildcardsq\x1bcsnakemake.io\nWildcards\nq\x1c)\x81q\x1dX\x0f\x00\x00\x006336_11-10-2015q\x1ea}q\x1f(h\x08}q X\x06\x00\x00\x00sampleq!K\x00N\x86q"sX\x06\x00\x00\x00sampleq#h\x1eubX\x07\x00\x00\x00threadsq$K\x01X\t\x00\x00\x00resourcesq%csnakemake.io\nResources\nq&)\x81q\'(K\x01K\x01K K\x02e}q((h\x08}q)(X\x06\x00\x00\x00_coresq*K\x00N\x86q+X\x06\x00\x00\x00_nodesq,K\x01N\x86q-X\x03\x00\x00\x00memq.K\x02N\x86q/X\x04\x00\x00\x00timeq0K\x03N\x86q1uh*K\x01h,K\x01h.K h0K\x02ubX\x03\x00\x00\x00logq2csnakemake.io\nLog\nq3)\x81q4}q5h\x08}q6sbX\x06\x00\x00\x00configq7}q8(X\t\x00\x00\x00referenceq9X<\x00\x00\x00/home/tamburin/fiona/crassphage/strainsifter/ref/B_dorei.fnaq:X\x07\x00\x00\x00samplesq;X0\x00\x00\x00/home/tamburin/fiona/crassphage/all_samples.listqX\n\x00\x00\x00paired_endq?X\x01\x00\x00\x00Yq@X\x07\x00\x00\x00min_cvgqAK\x05uX\x04\x00\x00\x00ruleqBX\t\x00\x00\x00call_snpsqCub.'); from snakemake.logging import logger; logger.printshellcmds = False
4 | ######## Original script #########
5 | #!/usr/bin/env python3
6 | # callSNPS.py
7 | # call SNPs from a samtools pileup file
8 |
9 | import sys
10 | import gzip
11 | import re
12 |
13 | DEBUG = 0
14 |
15 | ASCII_OFFSET = ord("!")
16 | INDEL_PATTERN = re.compile(r"[+-](\d+)")
17 | INDEL_STRING = "[+-]INTEGER[ACGTN]{INTEGER}"
18 | ROUND = 3 # places to right of decimal point
19 |
20 | min_coverage, min_proportion, min_qual = snakemake.params
21 | pileup_file = snakemake.input[0]
22 | out_file = snakemake.output[0]
23 |
24 | min_coverage = int(min_coverage)
25 | min_proportion = float(min_proportion)
26 | min_qual = int(min_qual)
27 |
28 | ## function to process each line of pileup
29 |
30 | def parse_pileup(line):
31 |
32 | # read fields from line of pileup
33 | chromosome, position, reference, coverage, pileup, quality = line.rstrip("\r\n").split("\t")
34 |
35 | # proportion is 0 if no consensus base, top base is N by default
36 | parsed = {
37 | "proportion": 0.0, "chromosome": chromosome,
38 | "position": int(position), "reference": reference,
39 | "coverage": int(coverage), "pileup": pileup,
40 | "quality": quality, "top_base": "N"}
41 |
42 | # if the base coverage is below the limit or above our acceptable max, call N
43 | if parsed["coverage"] < min_coverage:
44 | return parsed
45 |
46 | # uppercase pileup string for processing
47 | pileup = pileup.upper()
48 |
49 | # Remove start and stop characters from pileup string
50 | pileup = re.sub(r"\^.|\$", "", pileup)
51 |
52 | # Remove indels from pileup string
53 | start = 0
54 |
55 | while True:
56 | match = INDEL_PATTERN.search(pileup, start)
57 |
58 | if match:
59 | integer = match.group(1)
60 | pileup = re.sub(INDEL_STRING.replace("INTEGER", integer), "", pileup)
61 | start = match.start()
62 | else:
63 | break
64 |
65 | # get total base count and top base count
66 | total = 0
67 | top_base = "N"
68 | top_base_count = 0
69 |
70 | # uppercase reference base for comparison
71 | reference = reference.upper()
72 |
73 | base_counts = {"A": 0, "C": 0, "G": 0, "T": 0}
74 |
75 | quality_length = len(quality)
76 |
77 | for i in range(quality_length):
78 |
79 | # convert ASCII character to phred base quality
80 | base_quality = ord(quality[i]) - ASCII_OFFSET
81 |
82 | # only count high-quality bases
83 | if base_quality >= min_qual:
84 |
85 | currBase = pileup[i]
86 |
87 | if currBase in base_counts:
88 | base_counts[currBase] += 1
89 |
90 | else:
91 | base_counts[reference] += 1
92 |
93 | total += 1
94 |
95 | parsed["total"] = total
96 |
97 | # find top base
98 | for base in base_counts:
99 | if base_counts[base] > top_base_count:
100 | top_base = base
101 | top_base_count = base_counts[base]
102 |
103 | # if more that 0 bases processed
104 | if total > 0:
105 | prop = top_base_count / float(total)
106 |
107 | if prop >= min_proportion:
108 | parsed["proportion"] = prop
109 | parsed["top_base"] = top_base
110 |
111 | return parsed
112 |
113 | ## process pileup file
114 |
115 | # read in input file
116 | if not DEBUG:
117 | out_file = open(out_file, "w")
118 |
119 | with open(pileup_file, "rt") as pileup_file:
120 | for pileup_file_line in pileup_file:
121 |
122 | parsed = parse_pileup(pileup_file_line)
123 |
124 | proportion = parsed["proportion"]
125 |
126 | # print to output file
127 | if DEBUG:
128 | print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)),
129 | parsed["pileup"], parsed["quality"]]))
130 | else:
131 | print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)),
132 | parsed["pileup"], parsed["quality"]]), file=out_file)
133 |
134 | if not DEBUG:
135 | out_file.close()
136 |
--------------------------------------------------------------------------------
/scripts/.snakemake.k53chwr8.callSNPs.py:
--------------------------------------------------------------------------------
1 |
2 | ######## Snakemake header ########
3 | import sys; sys.path.append("/home/tamburin/tools/miniconda3/lib/python3.6/site-packages"); import pickle; snakemake = pickle.loads(b'\x80\x03csnakemake.script\nSnakemake\nq\x00)\x81q\x01}q\x02(X\x05\x00\x00\x00inputq\x03csnakemake.io\nInputFiles\nq\x04)\x81q\x05X\x1d\x00\x00\x00pileup/6387_11-13-2015.pileupq\x06a}q\x07X\x06\x00\x00\x00_namesq\x08}q\tsbX\x06\x00\x00\x00outputq\ncsnakemake.io\nOutputFiles\nq\x0b)\x81q\x0cX\x1d\x00\x00\x00snp_calls/6387_11-13-2015.tsvq\ra}q\x0eh\x08}q\x0fsbX\x06\x00\x00\x00paramsq\x10csnakemake.io\nParams\nq\x11)\x81q\x12(K\x05G?\xe9\x99\x99\x99\x99\x99\x9aK\x14e}q\x13(h\x08}q\x14(X\x07\x00\x00\x00min_cvgq\x15K\x00N\x86q\x16X\x08\x00\x00\x00min_freqq\x17K\x01N\x86q\x18X\x08\x00\x00\x00min_qualq\x19K\x02N\x86q\x1auh\x15K\x05h\x17G?\xe9\x99\x99\x99\x99\x99\x9ah\x19K\x14ubX\t\x00\x00\x00wildcardsq\x1bcsnakemake.io\nWildcards\nq\x1c)\x81q\x1dX\x0f\x00\x00\x006387_11-13-2015q\x1ea}q\x1f(h\x08}q X\x06\x00\x00\x00sampleq!K\x00N\x86q"sX\x06\x00\x00\x00sampleq#h\x1eubX\x07\x00\x00\x00threadsq$K\x01X\t\x00\x00\x00resourcesq%csnakemake.io\nResources\nq&)\x81q\'(K\x01K\x01K K\x02e}q((h\x08}q)(X\x06\x00\x00\x00_coresq*K\x00N\x86q+X\x06\x00\x00\x00_nodesq,K\x01N\x86q-X\x03\x00\x00\x00memq.K\x02N\x86q/X\x04\x00\x00\x00timeq0K\x03N\x86q1uh*K\x01h,K\x01h.K h0K\x02ubX\x03\x00\x00\x00logq2csnakemake.io\nLog\nq3)\x81q4}q5h\x08}q6sbX\x06\x00\x00\x00configq7}q8(X\t\x00\x00\x00referenceq9XO\x00\x00\x00/home/tamburin/fiona/bacteremia/1.assemble_trimmed/filtered_assemblies/L3.fastaq:X\x07\x00\x00\x00samplesq;X0\x00\x00\x00/home/tamburin/fiona/crassphage/all_samples.listqX\n\x00\x00\x00paired_endq?X\x01\x00\x00\x00Yq@X\x07\x00\x00\x00min_cvgqAK\nuX\x04\x00\x00\x00ruleqBX\t\x00\x00\x00call_snpsqCub.'); from snakemake.logging import logger; logger.printshellcmds = False
4 | ######## Original script #########
5 | #!/usr/bin/env python3
6 | # callSNPS.py
7 | # call SNPs from a samtools pileup file
8 |
9 | import sys
10 | import gzip
11 | import re
12 |
13 | DEBUG = 0
14 |
15 | ASCII_OFFSET = ord("!")
16 | INDEL_PATTERN = re.compile(r"[+-](\d+)")
17 | INDEL_STRING = "[+-]INTEGER[ACGTN]{INTEGER}"
18 | ROUND = 3 # places to right of decimal point
19 |
20 | min_coverage, min_proportion, min_qual = snakemake.params
21 | pileup_file = snakemake.input[0]
22 | out_file = snakemake.output[0]
23 |
24 | min_coverage = int(min_coverage)
25 | min_proportion = float(min_proportion)
26 | min_qual = int(min_qual)
27 |
28 | ## function to process each line of pileup
29 |
30 | def parse_pileup(line):
31 |
32 | # read fields from line of pileup
33 | chromosome, position, reference, coverage, pileup, quality = line.rstrip("\r\n").split("\t")
34 |
35 | # proportion is 0 if no consensus base, top base is N by default
36 | parsed = {
37 | "proportion": 0.0, "chromosome": chromosome,
38 | "position": int(position), "reference": reference,
39 | "coverage": int(coverage), "pileup": pileup,
40 | "quality": quality, "top_base": "N"}
41 |
42 | # if the base coverage is below the limit or above our acceptable max, call N
43 | if parsed["coverage"] < min_coverage:
44 | return parsed
45 |
46 | # uppercase pileup string for processing
47 | pileup = pileup.upper()
48 |
49 | # Remove start and stop characters from pileup string
50 | pileup = re.sub(r"\^.|\$", "", pileup)
51 |
52 | # Remove indels from pileup string
53 | start = 0
54 |
55 | while True:
56 | match = INDEL_PATTERN.search(pileup, start)
57 |
58 | if match:
59 | integer = match.group(1)
60 | pileup = re.sub(INDEL_STRING.replace("INTEGER", integer), "", pileup)
61 | start = match.start()
62 | else:
63 | break
64 |
65 | # get total base count and top base count
66 | total = 0
67 | top_base = "N"
68 | top_base_count = 0
69 |
70 | # uppercase reference base for comparison
71 | reference = reference.upper()
72 |
73 | base_counts = {"A": 0, "C": 0, "G": 0, "T": 0}
74 |
75 | quality_length = len(quality)
76 |
77 | for i in range(quality_length):
78 |
79 | # convert ASCII character to phred base quality
80 | base_quality = ord(quality[i]) - ASCII_OFFSET
81 |
82 | # only count high-quality bases
83 | if base_quality >= min_qual:
84 |
85 | currBase = pileup[i]
86 |
87 | if currBase in base_counts:
88 | base_counts[currBase] += 1
89 |
90 | else:
91 | base_counts[reference] += 1
92 |
93 | total += 1
94 |
95 | parsed["total"] = total
96 |
97 | # find top base
98 | for base in base_counts:
99 | if base_counts[base] > top_base_count:
100 | top_base = base
101 | top_base_count = base_counts[base]
102 |
103 | # if more that 0 bases processed
104 | if total > 0:
105 | prop = top_base_count / float(total)
106 |
107 | if prop >= min_proportion:
108 | parsed["proportion"] = prop
109 | parsed["top_base"] = top_base
110 |
111 | return parsed
112 |
113 | ## process pileup file
114 |
115 | # read in input file
116 | if not DEBUG:
117 | out_file = open(out_file, "w")
118 |
119 | with open(pileup_file, "rt") as pileup_file:
120 | for pileup_file_line in pileup_file:
121 |
122 | parsed = parse_pileup(pileup_file_line)
123 |
124 | proportion = parsed["proportion"]
125 |
126 | # print to output file
127 | if DEBUG:
128 | print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)),
129 | parsed["pileup"], parsed["quality"]]))
130 | else:
131 | print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)),
132 | parsed["pileup"], parsed["quality"]]), file=out_file)
133 |
134 | if not DEBUG:
135 | out_file.close()
136 |
--------------------------------------------------------------------------------
/scripts/.snakemake.mgp_kol5.callSNPs.py:
--------------------------------------------------------------------------------
1 |
2 | ######## Snakemake header ########
3 | import sys; sys.path.append("/home/tamburin/tools/miniconda3/lib/python3.6/site-packages"); import pickle; snakemake = pickle.loads(b'\x80\x03csnakemake.script\nSnakemake\nq\x00)\x81q\x01}q\x02(X\x05\x00\x00\x00inputq\x03csnakemake.io\nInputFiles\nq\x04)\x81q\x05X\x1c\x00\x00\x00pileup/5160_8-16-2016.pileupq\x06a}q\x07X\x06\x00\x00\x00_namesq\x08}q\tsbX\x06\x00\x00\x00outputq\ncsnakemake.io\nOutputFiles\nq\x0b)\x81q\x0cX\x1c\x00\x00\x00snp_calls/5160_8-16-2016.tsvq\ra}q\x0eh\x08}q\x0fsbX\x06\x00\x00\x00paramsq\x10csnakemake.io\nParams\nq\x11)\x81q\x12(K\x05G?\xe9\x99\x99\x99\x99\x99\x9aK\x14e}q\x13(h\x08}q\x14(X\x07\x00\x00\x00min_cvgq\x15K\x00N\x86q\x16X\x08\x00\x00\x00min_freqq\x17K\x01N\x86q\x18X\x08\x00\x00\x00min_qualq\x19K\x02N\x86q\x1auh\x15K\x05h\x17G?\xe9\x99\x99\x99\x99\x99\x9ah\x19K\x14ubX\t\x00\x00\x00wildcardsq\x1bcsnakemake.io\nWildcards\nq\x1c)\x81q\x1dX\x0e\x00\x00\x005160_8-16-2016q\x1ea}q\x1f(h\x08}q X\x06\x00\x00\x00sampleq!K\x00N\x86q"sX\x06\x00\x00\x00sampleq#h\x1eubX\x07\x00\x00\x00threadsq$K\x01X\t\x00\x00\x00resourcesq%csnakemake.io\nResources\nq&)\x81q\'(K\x01K\x01K K\x02e}q((h\x08}q)(X\x06\x00\x00\x00_coresq*K\x00N\x86q+X\x06\x00\x00\x00_nodesq,K\x01N\x86q-X\x03\x00\x00\x00memq.K\x02N\x86q/X\x04\x00\x00\x00timeq0K\x03N\x86q1uh*K\x01h,K\x01h.K h0K\x02ubX\x03\x00\x00\x00logq2csnakemake.io\nLog\nq3)\x81q4}q5h\x08}q6sbX\x06\x00\x00\x00configq7}q8(X\t\x00\x00\x00referenceq9X<\x00\x00\x00/home/tamburin/fiona/crassphage/strainsifter/ref/B_dorei.fnaq:X\x07\x00\x00\x00samplesq;X0\x00\x00\x00/home/tamburin/fiona/crassphage/all_samples.listqX\n\x00\x00\x00paired_endq?X\x01\x00\x00\x00Yq@X\x07\x00\x00\x00min_cvgqAK\x05uX\x04\x00\x00\x00ruleqBX\t\x00\x00\x00call_snpsqCub.'); from snakemake.logging import logger; logger.printshellcmds = False
4 | ######## Original script #########
5 | #!/usr/bin/env python3
6 | # callSNPS.py
7 | # call SNPs from a samtools pileup file
8 |
9 | import sys
10 | import gzip
11 | import re
12 |
13 | DEBUG = 0
14 |
15 | ASCII_OFFSET = ord("!")
16 | INDEL_PATTERN = re.compile(r"[+-](\d+)")
17 | INDEL_STRING = "[+-]INTEGER[ACGTN]{INTEGER}"
18 | ROUND = 3 # places to right of decimal point
19 |
20 | min_coverage, min_proportion, min_qual = snakemake.params
21 | pileup_file = snakemake.input[0]
22 | out_file = snakemake.output[0]
23 |
24 | min_coverage = int(min_coverage)
25 | min_proportion = float(min_proportion)
26 | min_qual = int(min_qual)
27 |
28 | ## function to process each line of pileup
29 |
30 | def parse_pileup(line):
31 |
32 | # read fields from line of pileup
33 | chromosome, position, reference, coverage, pileup, quality = line.rstrip("\r\n").split("\t")
34 |
35 | # proportion is 0 if no consensus base, top base is N by default
36 | parsed = {
37 | "proportion": 0.0, "chromosome": chromosome,
38 | "position": int(position), "reference": reference,
39 | "coverage": int(coverage), "pileup": pileup,
40 | "quality": quality, "top_base": "N"}
41 |
42 | # if the base coverage is below the limit or above our acceptable max, call N
43 | if parsed["coverage"] < min_coverage:
44 | return parsed
45 |
46 | # uppercase pileup string for processing
47 | pileup = pileup.upper()
48 |
49 | # Remove start and stop characters from pileup string
50 | pileup = re.sub(r"\^.|\$", "", pileup)
51 |
52 | # Remove indels from pileup string
53 | start = 0
54 |
55 | while True:
56 | match = INDEL_PATTERN.search(pileup, start)
57 |
58 | if match:
59 | integer = match.group(1)
60 | pileup = re.sub(INDEL_STRING.replace("INTEGER", integer), "", pileup)
61 | start = match.start()
62 | else:
63 | break
64 |
65 | # get total base count and top base count
66 | total = 0
67 | top_base = "N"
68 | top_base_count = 0
69 |
70 | # uppercase reference base for comparison
71 | reference = reference.upper()
72 |
73 | base_counts = {"A": 0, "C": 0, "G": 0, "T": 0}
74 |
75 | quality_length = len(quality)
76 |
77 | for i in range(quality_length):
78 |
79 | # convert ASCII character to phred base quality
80 | base_quality = ord(quality[i]) - ASCII_OFFSET
81 |
82 | # only count high-quality bases
83 | if base_quality >= min_qual:
84 |
85 | currBase = pileup[i]
86 |
87 | if currBase in base_counts:
88 | base_counts[currBase] += 1
89 |
90 | else:
91 | base_counts[reference] += 1
92 |
93 | total += 1
94 |
95 | parsed["total"] = total
96 |
97 | # find top base
98 | for base in base_counts:
99 | if base_counts[base] > top_base_count:
100 | top_base = base
101 | top_base_count = base_counts[base]
102 |
103 | # if more that 0 bases processed
104 | if total > 0:
105 | prop = top_base_count / float(total)
106 |
107 | if prop >= min_proportion:
108 | parsed["proportion"] = prop
109 | parsed["top_base"] = top_base
110 |
111 | return parsed
112 |
113 | ## process pileup file
114 |
115 | # read in input file
116 | if not DEBUG:
117 | out_file = open(out_file, "w")
118 |
119 | with open(pileup_file, "rt") as pileup_file:
120 | for pileup_file_line in pileup_file:
121 |
122 | parsed = parse_pileup(pileup_file_line)
123 |
124 | proportion = parsed["proportion"]
125 |
126 | # print to output file
127 | if DEBUG:
128 | print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)),
129 | parsed["pileup"], parsed["quality"]]))
130 | else:
131 | print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)),
132 | parsed["pileup"], parsed["quality"]]), file=out_file)
133 |
134 | if not DEBUG:
135 | out_file.close()
136 |
--------------------------------------------------------------------------------
/scripts/.snakemake.sgrw01xs.callSNPs.py:
--------------------------------------------------------------------------------
1 |
2 | ######## Snakemake header ########
3 | import sys; sys.path.append("/home/tamburin/tools/miniconda3/lib/python3.6/site-packages"); import pickle; snakemake = pickle.loads(b'\x80\x03csnakemake.script\nSnakemake\nq\x00)\x81q\x01}q\x02(X\x05\x00\x00\x00inputq\x03csnakemake.io\nInputFiles\nq\x04)\x81q\x05X\x1c\x00\x00\x00pileup/2769_5-17-2016.pileupq\x06a}q\x07X\x06\x00\x00\x00_namesq\x08}q\tsbX\x06\x00\x00\x00outputq\ncsnakemake.io\nOutputFiles\nq\x0b)\x81q\x0cX\x1c\x00\x00\x00snp_calls/2769_5-17-2016.tsvq\ra}q\x0eh\x08}q\x0fsbX\x06\x00\x00\x00paramsq\x10csnakemake.io\nParams\nq\x11)\x81q\x12(K\x05G?\xe9\x99\x99\x99\x99\x99\x9aK\x14e}q\x13(h\x08}q\x14(X\x07\x00\x00\x00min_cvgq\x15K\x00N\x86q\x16X\x08\x00\x00\x00min_freqq\x17K\x01N\x86q\x18X\x08\x00\x00\x00min_qualq\x19K\x02N\x86q\x1auh\x15K\x05h\x17G?\xe9\x99\x99\x99\x99\x99\x9ah\x19K\x14ubX\t\x00\x00\x00wildcardsq\x1bcsnakemake.io\nWildcards\nq\x1c)\x81q\x1dX\x0e\x00\x00\x002769_5-17-2016q\x1ea}q\x1f(h\x08}q X\x06\x00\x00\x00sampleq!K\x00N\x86q"sX\x06\x00\x00\x00sampleq#h\x1eubX\x07\x00\x00\x00threadsq$K\x01X\t\x00\x00\x00resourcesq%csnakemake.io\nResources\nq&)\x81q\'(K\x01K\x01K K\x02e}q((h\x08}q)(X\x06\x00\x00\x00_coresq*K\x00N\x86q+X\x06\x00\x00\x00_nodesq,K\x01N\x86q-X\x03\x00\x00\x00memq.K\x02N\x86q/X\x04\x00\x00\x00timeq0K\x03N\x86q1uh*K\x01h,K\x01h.K h0K\x02ubX\x03\x00\x00\x00logq2csnakemake.io\nLog\nq3)\x81q4}q5h\x08}q6sbX\x06\x00\x00\x00configq7}q8(X\t\x00\x00\x00referenceq9X?\x00\x00\x00/home/tamburin/fiona/crassphage/strainsifter/ref/B_vulgatus.fnaq:X\x07\x00\x00\x00samplesq;X0\x00\x00\x00/home/tamburin/fiona/crassphage/all_samples.listqX\n\x00\x00\x00paired_endq?X\x01\x00\x00\x00Yq@X\x07\x00\x00\x00min_cvgqAK\nuX\x04\x00\x00\x00ruleqBX\t\x00\x00\x00call_snpsqCub.'); from snakemake.logging import logger; logger.printshellcmds = False
4 | ######## Original script #########
5 | #!/usr/bin/env python3
6 | # callSNPS.py
7 | # call SNPs from a samtools pileup file
8 |
9 | import sys
10 | import gzip
11 | import re
12 |
13 | DEBUG = 0
14 |
15 | ASCII_OFFSET = ord("!")
16 | INDEL_PATTERN = re.compile(r"[+-](\d+)")
17 | INDEL_STRING = "[+-]INTEGER[ACGTN]{INTEGER}"
18 | ROUND = 3 # places to right of decimal point
19 |
20 | min_coverage, min_proportion, min_qual = snakemake.params
21 | pileup_file = snakemake.input[0]
22 | out_file = snakemake.output[0]
23 |
24 | min_coverage = int(min_coverage)
25 | min_proportion = float(min_proportion)
26 | min_qual = int(min_qual)
27 |
28 | ## function to process each line of pileup
29 |
30 | def parse_pileup(line):
31 |
32 | # read fields from line of pileup
33 | chromosome, position, reference, coverage, pileup, quality = line.rstrip("\r\n").split("\t")
34 |
35 | # proportion is 0 if no consensus base, top base is N by default
36 | parsed = {
37 | "proportion": 0.0, "chromosome": chromosome,
38 | "position": int(position), "reference": reference,
39 | "coverage": int(coverage), "pileup": pileup,
40 | "quality": quality, "top_base": "N"}
41 |
42 | # if the base coverage is below the limit or above our acceptable max, call N
43 | if parsed["coverage"] < min_coverage:
44 | return parsed
45 |
46 | # uppercase pileup string for processing
47 | pileup = pileup.upper()
48 |
49 | # Remove start and stop characters from pileup string
50 | pileup = re.sub(r"\^.|\$", "", pileup)
51 |
52 | # Remove indels from pileup string
53 | start = 0
54 |
55 | while True:
56 | match = INDEL_PATTERN.search(pileup, start)
57 |
58 | if match:
59 | integer = match.group(1)
60 | pileup = re.sub(INDEL_STRING.replace("INTEGER", integer), "", pileup)
61 | start = match.start()
62 | else:
63 | break
64 |
65 | # get total base count and top base count
66 | total = 0
67 | top_base = "N"
68 | top_base_count = 0
69 |
70 | # uppercase reference base for comparison
71 | reference = reference.upper()
72 |
73 | base_counts = {"A": 0, "C": 0, "G": 0, "T": 0}
74 |
75 | quality_length = len(quality)
76 |
77 | for i in range(quality_length):
78 |
79 | # convert ASCII character to phred base quality
80 | base_quality = ord(quality[i]) - ASCII_OFFSET
81 |
82 | # only count high-quality bases
83 | if base_quality >= min_qual:
84 |
85 | currBase = pileup[i]
86 |
87 | if currBase in base_counts:
88 | base_counts[currBase] += 1
89 |
90 | else:
91 | base_counts[reference] += 1
92 |
93 | total += 1
94 |
95 | parsed["total"] = total
96 |
97 | # find top base
98 | for base in base_counts:
99 | if base_counts[base] > top_base_count:
100 | top_base = base
101 | top_base_count = base_counts[base]
102 |
103 | # if more that 0 bases processed
104 | if total > 0:
105 | prop = top_base_count / float(total)
106 |
107 | if prop >= min_proportion:
108 | parsed["proportion"] = prop
109 | parsed["top_base"] = top_base
110 |
111 | return parsed
112 |
113 | ## process pileup file
114 |
115 | # read in input file
116 | if not DEBUG:
117 | out_file = open(out_file, "w")
118 |
119 | with open(pileup_file, "rt") as pileup_file:
120 | for pileup_file_line in pileup_file:
121 |
122 | parsed = parse_pileup(pileup_file_line)
123 |
124 | proportion = parsed["proportion"]
125 |
126 | # print to output file
127 | if DEBUG:
128 | print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)),
129 | parsed["pileup"], parsed["quality"]]))
130 | else:
131 | print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)),
132 | parsed["pileup"], parsed["quality"]]), file=out_file)
133 |
134 | if not DEBUG:
135 | out_file.close()
136 |
--------------------------------------------------------------------------------
/scripts/callSNPs.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 | # callSNPS.py
3 | # call SNPs from a samtools pileup file
4 |
5 | import sys
6 | import gzip
7 | import re
8 |
9 | DEBUG = 0
10 |
11 | ASCII_OFFSET = ord("!")
12 | INDEL_PATTERN = re.compile(r"[+-](\d+)")
13 | INDEL_STRING = "[+-]INTEGER[ACGTN]{INTEGER}"
14 | ROUND = 3 # places to right of decimal point
15 |
16 | min_coverage, min_proportion, min_qual = snakemake.params
17 | pileup_file = snakemake.input[0]
18 | out_file = snakemake.output[0]
19 |
20 | min_coverage = int(min_coverage)
21 | min_proportion = float(min_proportion)
22 | min_qual = int(min_qual)
23 |
24 | ## function to process each line of pileup
25 |
26 | def parse_pileup(line):
27 |
28 | # read fields from line of pileup
29 | chromosome, position, reference, coverage, pileup, quality = line.rstrip("\r\n").split("\t")
30 |
31 | # proportion is 0 if no consensus base, top base is N by default
32 | parsed = {
33 | "proportion": 0.0,
34 | "chromosome": chromosome,
35 | "position": int(position),
36 | "reference": reference,
37 | "coverage": int(coverage),
38 | "pileup": pileup,
39 | "quality": quality,
40 | "top_base": "N"}
41 |
42 | # if the base coverage is below the limit or above our acceptable max, call N
43 | if parsed["coverage"] < min_coverage:
44 | return parsed
45 |
46 | # uppercase pileup string for processing
47 | pileup = pileup.upper()
48 |
49 | # Remove start and stop characters from pileup string
50 | pileup = re.sub(r"\^.|\$", "", pileup)
51 |
52 | # Remove indels from pileup string
53 | start = 0
54 |
55 | while True:
56 | match = INDEL_PATTERN.search(pileup, start)
57 |
58 | if match:
59 | integer = match.group(1)
60 | pileup = re.sub(INDEL_STRING.replace("INTEGER", integer), "", pileup)
61 | start = match.start()
62 | else:
63 | break
64 |
65 | # get total base count and top base count
66 | total = 0
67 | top_base = "N"
68 | top_base_count = 0
69 |
70 | # uppercase reference base for comparison
71 | reference = reference.upper()
72 |
73 | base_counts = {"A": 0, "C": 0, "G": 0, "T": 0}
74 |
75 | quality_length = len(quality)
76 |
77 | for i in range(quality_length):
78 |
79 | # convert ASCII character to phred base quality
80 | base_quality = ord(quality[i]) - ASCII_OFFSET
81 |
82 | # only count high-quality bases
83 | if base_quality >= min_qual:
84 |
85 | currBase = pileup[i]
86 |
87 | if currBase in base_counts:
88 | base_counts[currBase] += 1
89 |
90 | else:
91 | base_counts[reference] += 1
92 |
93 | total += 1
94 |
95 | parsed["total"] = total
96 |
97 | # find top base
98 | for base in base_counts:
99 | if base_counts[base] > top_base_count:
100 | top_base = base
101 | top_base_count = base_counts[base]
102 |
103 | # if more that 0 bases processed
104 | if total > 0:
105 | prop = top_base_count / float(total)
106 |
107 | if prop >= min_proportion:
108 | parsed["proportion"] = prop
109 | parsed["top_base"] = top_base
110 |
111 | return parsed
112 |
113 | ## process pileup file
114 |
115 | # read in input file
116 | if not DEBUG:
117 | out_file = open(out_file, "w")
118 |
119 | with open(pileup_file, "rt") as pileup_file:
120 | for pileup_file_line in pileup_file:
121 |
122 | parsed = parse_pileup(pileup_file_line)
123 |
124 | proportion = parsed["proportion"]
125 |
126 | # print to output file
127 | if DEBUG:
128 | print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)),
129 | parsed["pileup"], parsed["quality"]]))
130 | else:
131 | print("\t".join([parsed["chromosome"], str(parsed["position"]), parsed["reference"], parsed["top_base"], str(round(proportion, ROUND)),
132 | parsed["pileup"], parsed["quality"]]), file=out_file)
133 |
134 | if not DEBUG:
135 | out_file.close()
136 |
--------------------------------------------------------------------------------
/scripts/coreSNPs2fasta.py:
--------------------------------------------------------------------------------
1 | import sys
2 |
3 | HEADER_POS = 0
4 | FASTA_POS = 1
5 | #coreSNPs_file = sys.argv[1]
6 |
7 | snp_file = snakemake.input[0]
8 | out_file = snakemake.output[0]
9 |
10 | # hold sequences as we go
11 | fastas = []
12 |
13 | with open(snp_file, "r") as core_snps:
14 |
15 | # process header
16 | header = core_snps.readline().rstrip("\n").split("\t")
17 |
18 | # add header to dict with empty string
19 | for h in range(len(header)):
20 | fastas += [[header[h], ""]]
21 |
22 | for line in core_snps:
23 | line = line.rstrip("\n").split("\t")
24 |
25 | for pos in range(len(line)):
26 | # print(fastas[pos])
27 | # print(fastas[pos][FASTA_POS])
28 | fastas[pos][FASTA_POS] += line[pos]
29 |
30 | with open(out_file, "w") as sys.stdout:
31 | for f in range(len(fastas)):
32 | print(">" + fastas[f][0])
33 | print(fastas[f][1])
34 |
--------------------------------------------------------------------------------
/scripts/findCoreSNPs.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 | # find core snvs shared between samples
3 |
4 | import sys
5 |
6 | snp_file = snakemake.input[0]
7 | out_file = snakemake.output[0]
8 |
9 | # read each line and print if base is called for every sample and
10 | # bases are not the same in every sample
11 | with open(snp_file, "r") as snps:
12 | with open(out_file, "w") as sys.stdout:
13 |
14 | # print header
15 | print(snps.readline().rstrip('\n'))
16 |
17 | # process each line and output positions with no unknown bases ("N")
18 | # and where all samples do not have same base
19 | for line in snps:
20 | positions = line.rstrip('\n').split('\t')
21 | if (len(set(positions)) > 1) and ("N" not in positions):
22 | print("\t".join(positions))
23 |
--------------------------------------------------------------------------------
/scripts/getCoverage.py:
--------------------------------------------------------------------------------
1 | # Fiona Tamburini
2 | # Jan 26 2018
3 |
4 | # calculate average coverage and % bases covered above min threshold from bedtools genomecov output
5 | # usage python3 getCoverage.py 5 sample < depth.tsv
6 |
7 | import sys
8 |
9 | # snakemake input and output files
10 | sample_file = snakemake.input[0]
11 | out_file = snakemake.output[0]
12 |
13 | # minimum coverage threshold -- report percentage of bases covered at or
14 | # beyond this depth
15 | minCvg = int(snakemake.params[0])
16 |
17 | totalBases = 0
18 | coveredBases = 0
19 | weightedAvg = 0
20 | with open(sample_file, 'r') as sample:
21 | for line in sample:
22 | if line.startswith("genome"):
23 | chr, depth, numBases, size, fraction = line.rstrip('\n').split('\t')
24 |
25 | depth = int(depth)
26 | numBases = int(numBases)
27 |
28 | totalBases += numBases
29 | weightedAvg += depth * numBases
30 |
31 | if depth >= minCvg:
32 | coveredBases += numBases
33 | avgCvg = 0
34 | percCovered = 0
35 |
36 | if totalBases > 0:
37 | avgCvg = float(weightedAvg) / float(totalBases)
38 | percCovered = float(coveredBases) / float(totalBases)
39 |
40 | with open(out_file, 'w') as out:
41 | print("\t".join([str(round(avgCvg, 2)), str(round(percCovered, 2))]), file = out)
42 |
--------------------------------------------------------------------------------
/scripts/pairwiseDist.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 |
3 | import sys
4 | import itertools
5 | import subprocess
6 | import re
7 |
8 | # input and output files from snakemake
9 | in_files = snakemake.input
10 | out_file = snakemake.output[0]
11 |
12 | cmd = " ".join(["cat", in_files[0], "| wc -l"])
13 | totalBases = subprocess.check_output(cmd, stdin=subprocess.PIPE, shell=True ).decode('ascii').rstrip('\n')
14 | totalBases = int(totalBases) - 1
15 |
16 | # for every pairwise combination of files, check SNV distance
17 | with open(out_file, "w") as sys.stdout:
18 |
19 | # print header
20 | print('\t'.join(["Sample1", "Sample2", "SNVs", "BasesCompared", "TotalBases"]))
21 |
22 | # get all pairwise combinations of input files
23 | for element in itertools.combinations(in_files, 2):
24 | file1, file2 = element
25 |
26 | cmd1 = " ".join(["paste", file1, file2, "| sed '1d' | grep -v N | wc -l"])
27 | totalPos = subprocess.check_output(cmd1, stdin=subprocess.PIPE, shell=True ).decode('ascii').rstrip('\n')
28 |
29 | cmd2 = " ".join(["paste", file1, file2, "| sed '1d' | grep -v N | awk '$1 != $2 {print $0}' | wc -l"])
30 | diffPos = subprocess.check_output(cmd2, stdin=subprocess.PIPE, shell=True ).decode('ascii').rstrip('\n')
31 |
32 | fname1 = re.findall('consensus/(.+)\.', file1)[0]
33 | fname2 = re.findall('consensus/(.+)\.', file2)[0]
34 |
35 | # dist = subprocess.check_output(['./count_snvs.sh', file1, file2]).decode('ascii').rstrip('\n')
36 | print('\t'.join([fname1, fname2, str(diffPos), str(totalPos), str(totalBases)]))
37 |
--------------------------------------------------------------------------------
/scripts/renderTree.R:
--------------------------------------------------------------------------------
1 | library(ggtree)
2 | library(phangorn)
3 | library(ggplot2)
4 |
5 | # input files
6 | tree_file <- snakemake@input[[1]]
7 | out_file <- snakemake@output[[1]]
8 |
9 | # read tree file
10 | tree <- read.tree(tree_file)
11 |
12 | # midpoint root tree if there is more than 1 node
13 | if (tree$Nnode > 1){
14 | tree <- midpoint(tree)
15 | }
16 |
17 | # draw tree
18 | p <- ggtree(tree, branch.length = 1) +
19 | geom_tiplab(offset = .05) +
20 | geom_tippoint(size=3) +
21 | xlim(0, 4) +
22 | geom_treescale(width = 0.1, offset=0.25)
23 |
24 | # save output file
25 | ggsave(out_file, plot = p, height = 1*length(tree$tip.label))
26 |
--------------------------------------------------------------------------------
/tutorial/fastq/isolate1_1.fq:
--------------------------------------------------------------------------------
1 | /home/tamburin/fiona/bacteremia/isolate_reads_all/L2_PE1.fq
--------------------------------------------------------------------------------
/tutorial/fastq/isolate1_2.fq:
--------------------------------------------------------------------------------
1 | /home/tamburin/fiona/bacteremia/isolate_reads_all/L2_PE2.fq
--------------------------------------------------------------------------------
/tutorial/fastq/isolate2_1.fq:
--------------------------------------------------------------------------------
1 | /home/tamburin/fiona/bacteremia/isolate_reads_all/L5_PE1.fq
--------------------------------------------------------------------------------
/tutorial/fastq/isolate2_2.fq:
--------------------------------------------------------------------------------
1 | /home/tamburin/fiona/bacteremia/isolate_reads_all/L5_PE2.fq
--------------------------------------------------------------------------------
/tutorial/fastq/meta1_1.fq:
--------------------------------------------------------------------------------
1 | /home/tamburin/fiona/bacteremia/stool_reads_all/S37_PE1.fq
--------------------------------------------------------------------------------
/tutorial/fastq/meta1_2.fq:
--------------------------------------------------------------------------------
1 | /home/tamburin/fiona/bacteremia/stool_reads_all/S37_PE2.fq
--------------------------------------------------------------------------------
/tutorial/fastq/meta2_1.fq:
--------------------------------------------------------------------------------
1 | /home/tamburin/fiona/bacteremia/stool_reads_all/S38_PE1.fq
--------------------------------------------------------------------------------
/tutorial/fastq/meta2_2.fq:
--------------------------------------------------------------------------------
1 | /home/tamburin/fiona/bacteremia/stool_reads_all/S38_PE2.fq
--------------------------------------------------------------------------------
/tutorial/fastq/meta3_1.fq:
--------------------------------------------------------------------------------
1 | /home/tamburin/fiona/bacteremia/stool_reads_all/S10_PE1.fq
--------------------------------------------------------------------------------
/tutorial/fastq/meta3_2.fq:
--------------------------------------------------------------------------------
1 | /home/tamburin/fiona/bacteremia/stool_reads_all/S10_PE2.fq
--------------------------------------------------------------------------------
/tutorial/fastq/meta4_1.fq:
--------------------------------------------------------------------------------
1 | /home/tamburin/fiona/bacteremia/stool_reads_all/S6_PE1.fq
--------------------------------------------------------------------------------
/tutorial/fastq/meta4_2.fq:
--------------------------------------------------------------------------------
1 | /home/tamburin/fiona/bacteremia/stool_reads_all/S6_PE2.fq
--------------------------------------------------------------------------------
/tutorial/reference/E_coli_K12.fna.amb:
--------------------------------------------------------------------------------
1 | 4641652 1 0
2 |
--------------------------------------------------------------------------------
/tutorial/reference/E_coli_K12.fna.ann:
--------------------------------------------------------------------------------
1 | 4641652 1 11
2 | 0 NC_000913.3 Escherichia coli str. K-12 substr. MG1655, complete genome
3 | 0 4641652 0
4 |
--------------------------------------------------------------------------------
/tutorial/reference/E_coli_K12.fna.bwt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bhattlab/StrainSifter/4a354e5a0ae25bfe3d57a6b66fa27a36b5daf748/tutorial/reference/E_coli_K12.fna.bwt
--------------------------------------------------------------------------------
/tutorial/reference/E_coli_K12.fna.fai:
--------------------------------------------------------------------------------
1 | NC_000913.3 4641652 72 80 81
2 |
--------------------------------------------------------------------------------
/tutorial/reference/E_coli_K12.fna.pac:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bhattlab/StrainSifter/4a354e5a0ae25bfe3d57a6b66fa27a36b5daf748/tutorial/reference/E_coli_K12.fna.pac
--------------------------------------------------------------------------------
/tutorial/reference/E_coli_K12.fna.sa:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bhattlab/StrainSifter/4a354e5a0ae25bfe3d57a6b66fa27a36b5daf748/tutorial/reference/E_coli_K12.fna.sa
--------------------------------------------------------------------------------