├── HaploABBABABA_multithreaded.jar ├── Phix_Removal.sh ├── README.md ├── addRecombRates.r ├── allelicBalance.py ├── convertIndsFineSTR.sh ├── convertVCFtoEigenstrat.sh ├── createRADmappingReport.sh ├── createuniformrecmap.r ├── demultiplexing.sh ├── filter_vcf.py ├── ldPruning.sh ├── makeRecombMap4fineSTRUCTURE.sh ├── makeRelateMap.r ├── removeNonsignDsuite.r ├── runBowtie2RADID.sh ├── vcf2fineRADstructure.sh ├── vcf2fineSTR.lsf └── vcf2phylip.py /HaploABBABABA_multithreaded.jar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/joanam/scripts/3625dbf3ccec3abbc3f1079259f967ba0a4e952d/HaploABBABABA_multithreaded.jar -------------------------------------------------------------------------------- /Phix_Removal.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Map the reads against the PhiX genome creating: 4 | # - a sam file (_phix.sam) with all PhiX reads aligned 5 | # - a fastq file (.noPhiXreads.fastq) with all non-PhiX reads 6 | # written by Joana Meier (Nov 2016) 7 | 8 | # Note, this line needs to be adjusted to show the correct path and prefix of the bowtie2 index of the PhiX genome 9 | phix_reference="/cluster/project/gdc/shared/p129/ref-genome/phix174" 10 | 11 | # This script requires that bowtie2 is installed, exectable the command 'bowtie2'. 12 | # Else, the script must be modified with the correct path and/or name of bowtie 2. 13 | 14 | # Check if two arguments were given 15 | if [[ $1 = "-h" ]]; then 16 | echo "Usage: PhiX_removal.sh " 17 | exit 0 18 | fi 19 | 20 | if (( "$#" != 2 )); then 21 | echo "Please provide 2 arguments: " 22 | echo "Usage: PhiX_removal.sh " 23 | exit 0 24 | fi 25 | 26 | 27 | file=$1 28 | prefix=$2 29 | pu=`awk -F: '{print $1":"$2}' $file | head -1 | sed 's/@//'` 30 | 31 | # Run bowtie2 to write out the PhiX reads into a sam file and generate a fastq file without PhiX reads 32 | bowtie2 -x ${phix_reference} \ 33 | -q $file --phred33 --end-to-end \ 34 | -p 4 -N 1 --no-unal \ 35 | --un $prefix".noPhiXreads.fastq" \ 36 | -S $prefix"_phix.sam" \ 37 | --rg-id $prefix"_phix" \ 38 | --rg "ID:"$prefix"_phix" \ 39 | --rg "PG:Bowtie2_end-to-end_default_settings" \ 40 | --rg "PL:ILLUMINA" \ 41 | --rg "PU:"$pu \ 42 | --rg "SM:"$prefix"_phix" 43 | 44 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Scripts 2 | Java, R, Python and Bash scripts to handle NGS data and other biological data 3 | 4 | Also check out other speciation genomics scripts (https://github.com/speciationgenomics/scripts/ 5 | ) of the online tutorial (https://speciationgenomics.github.io/)! 6 | 7 | 8 | ## Filtering scripts 9 | 10 | ### ldPruning.sh 11 | Bash script to prune SNPs in high linkage disequilibrium from a vcf file. This is important for all down-stream analyses that assume no linkage among SNPs, i.e. that require independent SNPs, e.g. STRUCTURE, PCA, treemix... 12 | 13 | Requires vcftools and plink 14 | 15 | ``` 16 | Usage: 17 | ldPruning.sh [optional: ] 18 | ``` 19 | 20 | ### PhiX_removal.sh 21 | Bash script to remove PhiX (virus) reads from NGS data. This script generates a sam file with the aligned PhiX reads and a fastq file with all non-PhiX reads. This script may be useful to remove PhiX reads from RAD libraries. 22 | 23 | ``` 24 | Usage: PhiX_removal < raw reads file > < output files prefix> 25 | ``` 26 | 27 | ### allelicBalance.py 28 | Python 2 script that handles genotypes with strong allelic disbalance, i.e. heterozygote genotypes with one allele covered by most reads and another allele covered by much fewer reads. This is usually an indication of contamination but it can also be present if some reads are PCR duplicates. The user can specify if disbalanced heterozygote genotypes should be set as missing data or turned into homozygote genotypes of the allele with more reads. 29 | 30 | ``` 31 | usage: allelicBalance.py [-h] -i I -o O (-hom | -excl | -two) [-p P] [-p2 P2] 32 | [-r R] 33 | 34 | optional arguments: 35 | -h, --help show this help message and exit 36 | -i I, --input I input file in vcf format [required] 37 | -o O, --output O output file [required] 38 | -hom, --homozygote set failing genotypes as homozygous 39 | -excl, --exclude set failing genotypes as missing 40 | -two, --twoSteps set failing genotypes as missing if pvalue

2 alleles) SNPs 89 | will be filtered 90 | -rrsnp, --removeRefSNPs 91 | if -rrsnp is specified, SNPs to the reference only 92 | (=monomorphic within sample) will be filtered 93 | -np MININDPOP, --minIndPop MININDPOP 94 | minimum number of individuals per population-code 95 | -e [E [E ...]], --exclude [E [E ...]] 96 | exclude popcodes containing the following string(s) 97 | from minIndPop-filtering 98 | ``` 99 | 100 | ## File conversion scripts 101 | 102 | ### vcf2phylip.py 103 | Python2 script to convert a vcf(.gz) file to phylip format, e.g. for RAxML 104 | 105 | ``` 106 | Usage: vcf2phylip.py -i -o [optional: -r -f -e -m] 107 | if -r is specified, the reference sequence will be included in the phylip file 108 | if -f is specified, all sites not in the vcf file will be printed as missing (N) 109 | if -e is specified, indels are not printed (else replaced by N) 110 | if -m is specified, haploid genotypes as e.g. for mitochondrial DNA are expected 111 | (use e.g. when genotype calling was performed with a GATK tool and the option --sample_ploidy 1 112 | 113 | If -f is specified, there will be no frameshifts if a site in the vcf file was filtered out due to low 114 | quality and also indels will be handled so that the positions and the length of the sequence is the 115 | same as the reference sequence length. This is very useful for data partitioning (e.g. specifying first, 116 | second and third codons) without having to adjust the positions because of sites lost during filtering 117 | steps or due to indels. 118 | ``` 119 | 120 | 121 | ### convertVCFtoEigenstrat.sh 122 | Bash script which converts a vcf file to eigenstrat format for ADMIXTOOLS. 123 | 124 | Requires vcftools and converf 125 | 126 | ``` 127 | Usage: convertVCFtoEigenstrat.sh 128 | ``` 129 | 130 | ### vcf2fineSTR.lsf 131 | 132 | Script that can be submitted to the Euler cluster (ETHZ) generating the fineSTRUCTURE input files from a vcf file and a linkage map. This script is not intended to be useful for people without Euler access but rather as an example of how one could generate a fineSTRUCTURE file from a vcf file using vcftools and the makeRecombMap4fineSTRUCTURE.sh and convertIndsFineSTR.sh scripts. 133 | 134 | ``` 135 | Usage: bsub < vcf2fineSTR.lsf 136 | ``` 137 | 138 | ### addRecombRates.r 139 | 140 | R script to generate a map in plink format, e.g. for phasing with BEAGLE. It requires a vcf file and a linkage map as input. 141 | ``` 142 | Usage: addRecombRates.r 143 | ``` 144 | 145 | If you do not have a linkage map, you can use this script to generate uniform recombination maps: 146 | ### createuniformrecmap.r 147 | ``` 148 | Usage: createuniformrecmap.r 149 | ``` 150 | 151 | ### vcf2fineRADstructure.sh 152 | Bash script that converts a vcf(.gz) file into the input format required for fineRADstructure (http://cichlid.gurdon.cam.ac.uk/fineRADstructure.html). It is rather slow as it uses vcftools to extract each RAD locus from the vcf file and then converts it to the correct format. It is best to only use it on vcf files including only SNPs as monomorphic sites and indels are anyways not used by fineRADstructure. This script uses createRADmappingReport.sh to define the RAD loci. This script should be run with a phased vcf file as fineRADstructure needs haplotypes. For RAD data, I would use e.g. GATK ReadBackedPhasing (https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_phasing_ReadBackedPhasing.php). 153 | 154 | ``` 155 | Usage: vcf2fineRADstructure.sh 156 | ``` 157 | 158 | 159 | 160 | ## Scripts for RAD data handling 161 | 162 | ### demultiplexing.sh 163 | Bash script to demultiplex RAD reads with barcodes of different lengths. It uses process_radtags from Stacks. It also trimms the reads if the trimming length is shorter than the read length. This may make sense to remove read ends of bad sequencing quality and 164 | to ensure that samples with different barcode lengths end up with the same demultiplexed read length. Note, the barcodes file needs to have one line per barcode giving the sample name and then tab-delimited the barcode. 165 | 166 | ``` 167 | Usage: demultiplexing.sh 168 | ``` 169 | 170 | ### runBowtie2RADID.sh 171 | This script is meant for RAD data mostly for the fish ecology and evolution group at EAWAG as it requires a naming scheme used by that group. However, it may also be useful to other users as a guide to setup a script that automatically aligns all fastq files in a directory. 172 | 173 | 174 | ### createRADmappingReport.sh 175 | Bash script that generates a mapping report of RAD data in bam files including for each individual the number of RAD loci sequenced, the mean sequencing depth, the RAD loci covered by at least 10 reads, and the mean sequencing depth of these RAD loci. It also generates a file with one line per RAD locus and individual giving the sequencing depth for each. The script needs to be run in the directory containing the bam files. The report can be used to test how good a RAD library is. If samples with few reads have few loci sequenced at high depth instead of many loci at low depth, this is strong indication for high duplication levels. 176 | 177 | ``` 178 | Usage: cd ; createRADmappingReport.sh 179 | ``` 180 | 181 | 182 | ## Data analysis scripts 183 | 184 | ### HaploABBABABA_multithreaded # Attention: I am working on a bug fix, please do not currently use this software 185 | Java program to calculate ABBA-BABA (D-statistics) to infer gene flow and the five-population test to infer the direction of gene flow. 186 | 187 | For more info, use: java -jar HaploABBABABA_multithreaded.jar -h 188 | ``` 189 | Usage: java -jar HaploABBABABA_multithreaded.jar \ 190 | -i \ 191 | -p -c \ 192 | -o -t -b 193 | ``` 194 | 195 | 196 | 197 | -------------------------------------------------------------------------------- /addRecombRates.r: -------------------------------------------------------------------------------- 1 | #!/cluster/apps/r/3.1.2_openblas/x86_64/bin/Rscript --vanilla 2 | 3 | # Written by Joana Meier, May-2017 4 | # usage: addRecombRates.r 5 | # Generates a map file in plink format for the positions provided in the vcf file 6 | 7 | # Load the required packages 8 | require(data.table) 9 | require(gtools) 10 | 11 | # Read in the vcf file name and linkage map name (including path) from the command line argument 12 | args<-commandArgs(TRUE) 13 | file<-args[1] 14 | map<-args[2] 15 | prefix<-sub('\\.vcf.*', '', file) 16 | 17 | # If the file is gzipped, add info to gunzip 18 | if(grepl(file,pattern=".gz")) file=paste("gunzip -c ",file,sep="") 19 | 20 | # Read in the data (just the first 5 columns) 21 | data<-fread(cmd=file,header=T,skip="#CHR",select=c(1:5),data.table=F) 22 | 23 | # Extract the first five columns with position information and add a column for recombination distances 24 | data<-cbind(data,"Morgan"=vector(length = length(data[,1]))) 25 | 26 | # Read in the table of physical vs recombination distance 27 | recomb<-read.table(map,header=T,sep="\t") 28 | 29 | # For each chromosome separately, add the recombination distance 30 | # For unmapped scaffolds, use the empirical mean of 2cM/Mb 31 | 32 | chrom<-mixedsort(levels(as.factor(data[,1]))) 33 | 34 | for(i in 1:length(chrom)){ 35 | 36 | # get the name of the ith chromosome/scaffold 37 | chr=chrom[i] 38 | 39 | # get the lines of the dataset of this specific chromosome 40 | dataChr<-data[data[,1]==chr,] 41 | 42 | # Get physical and recombination distances for that chromosome: 43 | recChr<-recomb[recomb$CHROM==chr,] 44 | if(length(recChr$CHROM)>0){ 45 | d<-smooth.spline(recChr$POS,recChr$cM,spar=0.7) 46 | recPos<-predict(d,dataChr[,2])$y 47 | recPos[recPos<0]<-0 48 | 49 | # Make the recombination distance monotonously increasing: 50 | for(i in 2:length(recPos)){ 51 | if(recPos[i] Int(reads[1]): 108 | genotype[0]=alleles[0]+"/"+alleles[0] 109 | else: 110 | genotype[0]=alleles[1]+"/"+alleles[1] 111 | 112 | result.append(":".join(genotype)) 113 | 114 | # if -two is specified, set genotypes failing threshold as missing, and those failing threshold2 as homozygous 115 | elif args.twoSteps: 116 | if pval>threshold2: 117 | result.append("./.") 118 | else: 119 | if int(reads[0])> Int(reads[1]): 120 | genotype[0]=alleles[0]+"/"+alleles[0] 121 | else: 122 | genotype[0]=alleles[1]+"/"+alleles[1] 123 | result.append(":".join(genotype)) 124 | else: 125 | exit("either -hom or -excl or -two is required, please specify how failing genotypes should be handled") 126 | 127 | # If the binomial test is not significant but the allic ratio very small (at low depth possible) 128 | elif float(reads[0])/float(reads[1])> $file.fineSTRphase.$4 13 | grep -v "^#" $file.vcf | cut -f $ind | cut -d"|" -f 2 | sed -e ':a;N;$!ba;s/\n/ /g' -e 's/ //g' >> $file.fineSTRphase.$4 14 | done 15 | 16 | elif [ -f $file.vcf.gz ]; then 17 | for (( ind=$1+9; ind<=$2+9; ind++ )) 18 | do 19 | zgrep -v "^#" $file.vcf.gz | cut -f $ind | cut -d"|" -f 1 | sed -e ':a;N;$!ba;s/\n/ /g' -e 's/ //g' >> $file.fineSTRphase.$4 20 | zgrep -v "^#" $file.vcf.gz | cut -f $ind | cut -d"|" -f 2 | sed -e ':a;N;$!ba;s/\n/ /g' -e 's/ //g' >> $file.fineSTRphase.$4 21 | done 22 | 23 | else 24 | echo "file $file.vcf[.gz] does not exist!" 25 | 26 | fi 27 | -------------------------------------------------------------------------------- /convertVCFtoEigenstrat.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Script to convert vcf to eigenstrat format for ADMIXTOOLS 4 | # Written by Joana Meier 5 | # It takes a single argument: the vcf file (can be gzipped) and 6 | # optionally you can specify --renameScaff if you have scaffold names (not chr1, chr2...) 7 | 8 | # Here, you can change the recombination rate which is currently set to 2 cM/Mb 9 | rec=2 10 | 11 | # It requires vcftools and admixtools 12 | 13 | # for some clusters, it is needed to load these modules: 14 | # module load gcc/4.8.2 vcftools openblas/0.2.13_seq perl/5.18.4 admixtools 15 | 16 | renameScaff="FALSE" 17 | 18 | # If help is requested 19 | if [[ $1 == "-h" ]] 20 | then 21 | echo "Please provide the vcf file to parse, and optionally add --renameScaff if you have scaffolds instead of chromosomes" 22 | echo "Usage: convertVCFtoEigenstrat.sh --renameScaff (note, the second argument is optional)" 23 | exit 1 24 | 25 | # If the second argument renameScaff is given, set it to True 26 | elif [[ $2 == "--renameScaff" ]] 27 | then 28 | renameScaff="TRUE" 29 | 30 | # If no argument is given or the second one is not -removeChr, give information and quit 31 | elif [ $# -ne 1 ] 32 | then 33 | echo "Please provide the vcf file to parse, and optionally add --renameScaff if you have scaffolds instead of chromosomes" 34 | echo "Usage: ./convertVCFtoEigenstrat.sh --renameScaff (note, the second argument is optional)" 35 | exit 1 36 | fi 37 | 38 | # Set the first argument to the file name 39 | file=$1 40 | file=${file%.gz} 41 | file=${file%.vcf} 42 | 43 | 44 | # if the vcf file is gzipped: 45 | 46 | if [ -s $file.vcf.gz ] 47 | then 48 | 49 | # If renaming of scaffolds is requested, set all chromosome/scaffold names to 1 50 | if [ $renameScaff == "TRUE" ] 51 | then 52 | echo "setting scaffold names to 1 and positions to additive numbers" 53 | zcat $file".vcf.gz" | awk 'BEGIN {OFS = "\t";add=0;lastPos=0;scaff=""}{ 54 | if($1!~/^#/){ 55 | if($1!=scaff){add=lastPos;scaff=$1} 56 | $1="1" 57 | $2=$2+add 58 | lastPos=$2 59 | } 60 | print $0}' | gzip > $file.renamedScaff.vcf.gz 61 | 62 | # Get a .map and .ped file (remove multiallelic SNPs, monomorphic sites and indels) 63 | vcftools --gzvcf $file".renamedScaff.vcf.gz" \ 64 | --plink --mac 1.0 --remove-indels --max-alleles 2 --out $file 65 | 66 | else 67 | # Get a .map and .ped file (remove multiallelic SNPs, monomorphic sites and indels) 68 | vcftools --gzvcf $file".vcf.gz" \ 69 | --plink --mac 1.0 --remove-indels --max-alleles 2 --out $file 70 | fi 71 | 72 | # if the file is not gzipped 73 | else 74 | # If renaming of scaffolds is requested, set all chromosome/scaffold names to 1 75 | if [ $renameScaff == "TRUE" ] 76 | then 77 | echo "setting scaffold names to 1 and positions to additive numbers" 78 | awk 'BEGIN {OFS = "\t";add=0;lastPos=0;scaff=""}{ 79 | if($1!~/^#/){ 80 | if($1!=scaff){add=lastPos;scaff=$1} 81 | $1="1" 82 | $2=$2+add 83 | lastPos=$2 84 | } 85 | print $0}' $file.vcf | gzip > $file.renamedScaff.vcf.gz 86 | 87 | # Get a .map and .ped file (remove multiallelic SNPs, monomorphic sites and indels) 88 | vcftools --gzvcf $file".renamedScaff.vcf.gz" \ 89 | --plink --mac 1.0 --remove-indels --max-alleles 2 --out $file 90 | else 91 | vcftools --vcf $file".vcf" --plink --mac 1.0 --remove-indels --max-alleles 2 --out $file 92 | fi 93 | fi 94 | 95 | 96 | # Change the .map file to match the requirements of ADMIXTOOLS by adding fake Morgan positions (assuming a recombination rate of 2 cM/Mbp) 97 | awk -F"\t" -v rec=$rec 'BEGIN{scaff="";add=0}{ 98 | split($2,newScaff,":") 99 | if(!match(newScaff[1],scaff)){ 100 | scaff=newScaff[1] 101 | add=lastPos 102 | } 103 | pos=add+$4 104 | count+=0.00000001*rec*(pos-lastPos) 105 | print newScaff[1]"\t"$2"\t"count"\t"pos 106 | lastPos=pos 107 | }' ${file}.map | sed 's/^chr//' > better.map 108 | mv better.map ${file}.map 109 | 110 | # Change the .ped file to match the ADMIXTOOLS requirements 111 | awk 'BEGIN{ind=1}{printf ind"\t"$2"\t0\t0\t0\t1\t"; 112 | for(i=7;i<=NF;++i) printf $i"\t";ind++;printf "\n"}' ${file}.ped > tmp.ped 113 | mv tmp.ped ${file}.ped 114 | 115 | 116 | # create an inputfile for convertf 117 | echo "genotypename: ${file}.ped" > par.PED.EIGENSTRAT.${file} 118 | echo "snpname: ${file}.map" >> par.PED.EIGENSTRAT.${file} 119 | echo "indivname: ${file}.ped" >> par.PED.EIGENSTRAT.${file} 120 | echo "outputformat: EIGENSTRAT" >> par.PED.EIGENSTRAT.${file} 121 | echo "genotypeoutname: ${file}.eigenstratgeno" >> par.PED.EIGENSTRAT.${file} 122 | echo "snpoutname: ${file}.snp" >> par.PED.EIGENSTRAT.${file} 123 | echo "indivoutname: ${file}.ind" >> par.PED.EIGENSTRAT.${file} 124 | echo "familynames: NO" >> par.PED.EIGENSTRAT.${file} 125 | 126 | 127 | # Use CONVERTF to parse PED to eigenstrat 128 | convertf -p par.PED.EIGENSTRAT.${file} 129 | 130 | # change the snp file for ADMIXTOOLS: 131 | awk 'BEGIN{i=0}{i=i+1; print $1"\t"$2"\t"$3"\t"i"\t"$5"\t"$6}' $file.snp > $file.snp.tmp 132 | mv $file.snp.tmp $file.snp 133 | -------------------------------------------------------------------------------- /createRADmappingReport.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # calculates the number of reads, number of loci, mean sequencing depth and number of loci with min 10 reads and their mean depth 4 | 5 | # create the report.txt file with a line for each individual (each bam file) with read counts 6 | echo -e "sample\tsampleLib\tmappedReads" > report.txt 7 | for i in *${1}*.bam 8 | do 9 | echo "counting reads in "$i 10 | totalReads=`samtools view $i | wc -l` 11 | echo -e ${i%.GQI*}"\t"${i%.bam}"\t"$totalReads >> report.txt 12 | done 13 | 14 | # creates a file with a line for each locus with mapped reads in each contig for each individual 15 | 16 | echo "sample marker scaffold locus orientation depth" > seq_depth.txt 17 | echo "the following samples will be included: " 18 | echo *${1}*.bam 19 | 20 | for i in *${1}*.bam 21 | do 22 | echo "working on "$i 23 | 24 | samtools view $i | awk \ 25 | 'BEGIN{ 26 | markerold="" 27 | marker="" 28 | scaffold="" 29 | locus="" 30 | orientation="" 31 | depth="" 32 | cigar="" 33 | first="false" 34 | sample="" 35 | } 36 | { 37 | if($5>30){ # mapping quality MAPQ>30 to filter out badly aligned reads 38 | 39 | # for the first scaffold: 40 | if(scaffold==""){ 41 | 42 | # get the sample name 43 | for(i=1;i<=NF;i++) if($i ~ /RG:Z:.*/){sample=$i} 44 | 45 | # if reverse check for indels and use the last site - 4 46 | if($2==16){ 47 | b=0 48 | cigar=$6 49 | gsub(/M/,"+",cigar);gsub(/D/,"+",cigar);gsub(/\d*.I/,"",cigar);sub(/$/,"0",cigar) 50 | split(cigar,indexx,"+");for (i in indexx) b+=indexx[i] 51 | locus=$4+b-4 52 | } 53 | # if forward, set locus to the first base position 54 | else{ 55 | locus=$4 56 | } 57 | 58 | # set the variables marker, scaffold, orientation and depth 59 | marker=$3"_"locus"_"$2 60 | scaffold=$3 61 | orientation=$2 62 | depth=1 63 | } 64 | 65 | # if the locus is from the same scaffold as the last one: 66 | else if(scaffold==$3){ 67 | markerold=marker 68 | locusold=locus 69 | 70 | # for reverse reads 71 | if($2==16){ 72 | b=0 73 | cigar=$6 74 | gsub(/M/,"+",cigar);gsub(/D/,"+",cigar);gsub(/\d*.I/,"",cigar);sub(/$/,"0",cigar) 75 | split(cigar,indexx,"+");for (i in indexx) b+=indexx[i] 76 | locus=$4+b-4 77 | } 78 | # for forward reads 79 | else{ 80 | locus=$4 81 | } 82 | marker=$3"_"locus"_"$2 83 | 84 | # if it is the same locus increase the variable depth by 1 85 | if(marker==markerold){ 86 | depth+=1 87 | } 88 | 89 | # if it is a new locus: 90 | else{ 91 | # print the information about the previous locus: 92 | print sample,markerold,scaffold,locusold,orientation,depth 93 | 94 | scaffold=$3 95 | if($2==16){ 96 | b=0 97 | cigar=$6 98 | gsub(/M/,"+",cigar);gsub(/D/,"+",cigar);gsub(/\d*.I/,"",cigar);sub(/$/,"0",cigar) 99 | split(cigar,indexx,"+");for (i in indexx) b+=indexx[i] 100 | locus=$4+b-4 101 | } 102 | else{ 103 | locus=$4 104 | } 105 | marker=$3"_"locus"_"$2 106 | orientation=$2 107 | depth=1 108 | } 109 | } 110 | 111 | # if it is a new scaffold: 112 | else{ 113 | print sample,marker,scaffold,locus,orientation,depth 114 | scaffold=$3 115 | if($2==16){ 116 | b=0 117 | cigar=$6 118 | gsub(/M/,"+",cigar);gsub(/D/,"+",cigar);gsub(/\d*.I/,"",cigar);sub(/$/,"0",cigar) 119 | split(cigar,indexx,"+");for (i in indexx) b+=indexx[i] 120 | locus=$4+b-4 121 | } 122 | else{ 123 | locus=$4 124 | } 125 | marker=$3"_"locus"_"$2 126 | orientation=$2 127 | depth=1 128 | } 129 | } 130 | }END{ 131 | # print the information of the last locus 132 | print sample,marker,scaffold,locus,orientation,depth 133 | }' | sed 's/RG:Z://g' >> seq_depth.txt 134 | done 135 | 136 | awk '{if($6>9) print $0}' seq_depth.txt > seq_depth_min10.txt 137 | 138 | 139 | # get per locus information 140 | 141 | echo -e "locus\ttotalInds\ttotalReads" > contigs.txt 142 | cat seq_depth.txt | \ 143 | awk '{ 144 | counter[$2]++ 145 | totReads[$2]+=$6 146 | } 147 | END{ 148 | for (i in counter) print i"\t"counter[i]"\t"totReads[i] 149 | }' >> contigs.txt 150 | 151 | # get per locus information (count only individuals with at least 10 reads) 152 | 153 | echo -e "locus\ttotalInds\ttotalReads" > lociMin10reads.txt 154 | cat seq_depth_min10.txt | \ 155 | awk '{ 156 | counter[$2]++ 157 | totReads[$2]+=$6 158 | } 159 | END{ 160 | for (i in counter) print i"\t"counter[i]"\t"totReads[i] 161 | }' >> lociMin10reads.txt 162 | 163 | 164 | 165 | # create a file with lociN and meanDepth per ind 166 | echo -e "sample lociN meanDepth" > indInfo.txt 167 | awk '{ 168 | counter[$1]++ 169 | depth[$1]+=$6 170 | } 171 | END{ 172 | for(i in counter) print i,counter[i],depth[i]/counter[i] 173 | }' seq_depth.txt >> indInfo.txt 174 | 175 | 176 | # create a file with lociN and meanDepth per ind counting loci with min 10 reads 177 | echo -e "sample lociN meanDepth" > indInfoMin10reads.txt 178 | awk '{ 179 | counter[$1]++ 180 | depth[$1]+=$6 181 | } 182 | END{ 183 | for(i in counter) print i,counter[i],depth[i]/counter[i] 184 | }' seq_depth_min10.txt >> indInfoMin10reads.txt 185 | 186 | 187 | 188 | # sort the stats files skipping the header line (careful: "sample" cannot be included in any sample name) 189 | sort indInfoMin10reads.txt | awk '!/sample/' > sorted_indInfoMin10reads.txt 190 | sort indInfo.txt | awk '!/sample/' > indInfoSorted 191 | sort report.txt | awk '!/sample/' > sortedReport 192 | 193 | 194 | # create a file with Nreads, lociN, meanDepth... for each individual 195 | join -a 1 -a 2 sortedReport indInfoSorted -1 1 -2 1 > indInfoAdv.txt 196 | sort indInfoAdv.txt > indInfoAdvSorted 197 | echo "sample sampleLib mappedReads lociN meanDepth lociNmin10reads meanDepthMin10reads" > mappingReport.txt 198 | join -a 1 -a 2 indInfoAdvSorted sorted_indInfoMin10reads.txt -1 1 -2 1 >> mappingReport.txt 199 | 200 | 201 | # Remove files not needed anymore: 202 | rm indInfoAdvSorted indInfoAdv.txt sortedReport indInfoSorted sorted_indInfoMin10reads.txt contigs.txt lociMin10reads.txt indInfoMin10reads.txt indInfo.txt report.txt 203 | 204 | # rm seq_depth* 205 | -------------------------------------------------------------------------------- /createuniformrecmap.r: -------------------------------------------------------------------------------- 1 | #!/cluster/apps/r/3.1.2_openblas/x86_64/bin/Rscript --vanilla 2 | 3 | # Written by Joana Meier, May-2017 4 | # usage: addRecombRates.r 5 | # Generates a map file in plink format for the positions provided in the vcf file 6 | # if no linkage map is available, just write "none" instead and a uniform linkage map will be produced 7 | 8 | # Load the required packages 9 | require(data.table) 10 | require(gtools) 11 | 12 | # Read in the vcf file name and linkage map name (including path) from the command line argument 13 | args<-commandArgs(TRUE) 14 | 15 | file<-args[1] 16 | prefix<-sub('\\.vcf.*', '', file) 17 | 18 | # If a linkage map is provided, read it in 19 | if(file.exists(args[2])) map<-args[2] 20 | 21 | # If the file is gzipped, add info to gunzip 22 | if(grepl(file,pattern=".gz")) file=paste("gunzip -c ",file,sep="") 23 | 24 | # Read in the data (just the first 5 columns) 25 | data<-fread(cmd=file,header=T,skip="#CHR",select=c(1:5),data.table=F) 26 | 27 | # Extract the first five columns with position information and add a column for recombination distances 28 | data<-cbind(data,"Morgan"=vector(length = length(data[,1]))) 29 | 30 | # Read in the table of physical vs recombination distance 31 | recomb<-read.table(map,header=T,sep="\t") 32 | 33 | # For each chromosome separately, add the recombination distance 34 | # For unmapped scaffolds, use the empirical mean of 2cM/Mb 35 | 36 | chrom<-mixedsort(levels(as.factor(data[,1]))) 37 | 38 | for(i in 1:length(chrom)){ 39 | 40 | # get the name of the ith chromosome/scaffold 41 | chr=chrom[i] 42 | 43 | # get the lines of the dataset of this specific chromosome 44 | dataChr<-data[data[,1]==chr,] 45 | 46 | # Get physical and recombination distances for that chromosome: 47 | recChr<-recomb[recomb$CHROM==chr,] 48 | if(length(recChr$CHROM)>0){ 49 | d<-smooth.spline(recChr$POS,recChr$cM,spar=0.7) 50 | recPos<-predict(d,dataChr[,2])$y 51 | recPos[recPos<0]<-0 52 | 53 | # Make the recombination distance monotonously increasing: 54 | for(i in 2:length(recPos)){ 55 | if(recPos[i] "tab" 11 | 12 | # created by Joana Meier, July 2014 13 | 14 | 15 | # Load stacks (needs to be adjusted according to the cluster/server setup): 16 | module load gcc/4.8.2 gdc perl/5.18.4 stacks/1.40 17 | 18 | # This script assumes, that process_radtags is executable from the current directory. If that is not the case, specify the path here: 19 | path="" 20 | 21 | 22 | #******************************************* 23 | # Check correct usage of the bash Script 24 | #******************************************* 25 | 26 | # Check if help is requested 27 | if [[ $1 = "-h" ]]; then 28 | echo -e "\nUsage: demultiplexing.sh readsFastqFile barcodesFile outputFolder trimmingLength" 29 | echo -e "\nThe fastq file can be gzipped or not" 30 | echo -e "\nNote: the barcodes file should contain per line a fishecRADid than a tab, and a barcode of 5-8 bp length" 31 | echo -e "\nNote: If you set the trimming value to a value exceeding the read length, it will discard all reads as low quality" 32 | exit 0 33 | fi 34 | 35 | # Check if the right number of arguments is provided 36 | if (( "$#" != 4 )); then 37 | echo -e "\nPlease provide 4 arguments: reads.fastq barcodes.file path/to/outputFiles trimmingLength" 38 | exit 0 39 | fi 40 | 41 | # Check if the fastq file exists and contains data 42 | if [[ ! -s $1 ]] 43 | then 44 | echo -e "\n"$1" not found or empty. Please provide file with reads to demultiplex" 45 | exit 1 46 | fi 47 | 48 | # Check if the barcodes file exists and contains data 49 | if [[ ! -s $2 ]] 50 | then 51 | echo -e "\n"$2" not found or empty. Please provide file with barcodes" 52 | exit 1 53 | fi 54 | 55 | # Check if the given output path exists 56 | if [[ ! -d $3 ]] 57 | then 58 | echo -e "\noutput path "$3" does not exist! Please provide a valid path" 59 | exit 1 60 | fi 61 | 62 | # Check that the barcodes file contains barcodes in the second column 63 | if [[ $(awk '/[[:graph:]]/ {if(!match($2,/[ATGC][ATGC][ATGC][ATGC]+/)) print $2}' $2 | wc -l ) -gt 0 ]] 64 | then 65 | echo -e "\n$(awk '/[[:graph:]]/ {if(!match($2,/[ATGC][ATGC][ATGC][ATGC]+/)) print $2}' $2 | head -1) is not a barcode!\n" 66 | echo "Please provide a barcode file in the following format: fishecRADid barcode" 67 | exit 1 68 | fi 69 | 70 | 71 | #************************************ 72 | # Check if the input file is zipped 73 | #************************************ 74 | 75 | if [[ $1 == *"gz" ]] 76 | then 77 | format=gzfastq 78 | else 79 | format=fastq 80 | fi 81 | 82 | 83 | #************************** 84 | # Read in arguments 85 | #************************** 86 | 87 | readsFile=$1 88 | barcodes=$2 89 | out=$3 90 | # if present, remove slash at the end of the path name 91 | out=`echo $out | sed -e 's-/$--'` 92 | readsPath="." 93 | trim=$4 94 | 95 | 96 | #******************************************************* 97 | # in case that there are DOS line endings, remove them 98 | #******************************************************* 99 | 100 | dos2unix -q -k $barcodes 101 | 102 | 103 | 104 | #************************************************************ 105 | # demultiplex the reads starting with the longest barcodes 106 | #************************************************************ 107 | 108 | for i in 8 7 6 5 109 | do 110 | # print all barcodes of length $i into a temporary barcodes file 111 | awk -v len=$i '{if(length($2)==len) print $2}' $barcodes > $barcodes"_"$i".tmp" 112 | 113 | # check if the temporary barcodes file is empty 114 | if [[ ! -s $barcodes"_"$i".tmp" ]] 115 | then 116 | rm $barcodes"_"$i".tmp" 117 | echo -e "\nno barcodes of "$i"bp length" 118 | 119 | # if the temporary barcodes file contains barcodes: extract reads with these barcodes 120 | else 121 | echo -e "\ndemultiplexing reads for "$i"bp long barcodes" 122 | 123 | # demultiplex the reads with the temporary barcodes file 124 | ${path}process_radtags \ 125 | -f $readsPath"/"$readsFile \ 126 | -o $out \ 127 | -t $trim \ 128 | -b $barcodes"_"$i".tmp" \ 129 | -i $format -r -e "sbfI" -E 'phred33' -D 130 | 131 | # rename the process_radtags log file to prevent overwriting 132 | mv $out/process_radtags.log $out/process_radtags_$i"bp.log" 133 | 134 | # remove temporary barcodes file 135 | rm $barcodes"_"$i".tmp" 136 | 137 | # change the readsFileName and path 138 | readsFile=$readsFile".discards" 139 | readsPath=$out 140 | format="fastq" 141 | 142 | fi 143 | done 144 | 145 | 146 | #************************************************************ 147 | # Suggest next steps to the user 148 | #******************************************************** 149 | 150 | echo -e "\n\n finished!" 151 | echo "Check the process_radtags log files for additional barcodes" 152 | echo "If everything looks ok, delete the .discards file" 153 | echo "To rename the files and recover the header information, use: header_recoverfromstacks.sh on the Github page of David Marques" 154 | -------------------------------------------------------------------------------- /filter_vcf.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | 3 | # Authors: Samuel Wittwer, Joana Meier, David Marques, Irene Keller 4 | 5 | from sys import * 6 | import os, time, argparse, re, numpy 7 | from collections import defaultdict 8 | 9 | parser = argparse.ArgumentParser(description='Filter all sites in the given VCF file') 10 | 11 | parser.add_argument('-i', '--input', dest='i', help="input file in vcf format [required]", required=True) 12 | parser.add_argument('-o', '--output', dest='o', help="output file [required]", required=True) 13 | parser.add_argument('-q', '--QUAL', dest='qual', type=int, help="minimum quality of the SNP (QUAL field of vcf) [default 30]", default=30) 14 | parser.add_argument('-p', '--minGQ', dest='gq', type=int, help="minimum quality for each single genotype (GQ of genotype) [default 30]", default=30) 15 | parser.add_argument('-d', '--readDepth', dest='depth', type=int, help="minimal read depth per single genotype (DP of genotype) [default 1]", default=1) 16 | parser.add_argument('-n', '--minInd', dest='minInd', type=int, help="minimal number of individuals with gentoype per site [default 1]", default=1) 17 | parser.add_argument('-m', '--minAllele', dest='minAllele', type=int, help="minimum number of times an allele must be observed [default 1]", default=1) 18 | parser.add_argument('-v', '--verbose', action='store_true', help="if -v is specified, progress info will be written to stdout (slows the program down)", default=True) 19 | parser.add_argument('-rindel', '--removeIndels', action='store_true', help="if -rindel is specified, indels will be filtered", default=False) 20 | parser.add_argument('-risnp', '--removeIndelSNPs', action='store_true', help="if -risnp is specified, SNPs around indels (distance specified by -idist) will be filtered", default=False) 21 | parser.add_argument('-idist', '--indelDist', type=int, help="SNPs at the given distance or smaller from indels will be filtered if -risnp given as additional argument [default 10]", default=10) 22 | parser.add_argument('-rmono', '--removeMono', action='store_true', help="if -rmono is specified, monomorphic sites (except SNPs to the reference only) will be filtered", default=False) 23 | parser.add_argument('-rmsnp', '--removeMultiallelicSNPs', action='store_true', help="if -rmsnp is specified, multiallelic (>2 alleles) SNPs will be filtered", default=False) 24 | parser.add_argument('-rrsnp', '--removeRefSNPs', action='store_true', help="if -rrsnp is specified, SNPs to the reference only (=monomorphic within sample) will be filtered", default=False) 25 | parser.add_argument('-np', '--minIndPop', dest='minIndPop', type=int, help="minimum number of individuals per population-code", default=0) 26 | parser.add_argument('-e', '--exclude', dest='e', help="exclude popcodes containing the following string(s) from minIndPop-filtering", required=False, default=["!!!!!!!!!"], nargs='*') 27 | 28 | args = parser.parse_args() 29 | 30 | if args.verbose: 31 | print "\n\tfiltering the VCF file with QUAL>=%d, GQ>=%d, DP>=%d, minInd>=%d, minAllele>=%d" % (args.qual,args.gq,args.depth,args.minInd,args.minAllele) 32 | if args.minIndPop>0 and args.e==["!!!!!!!!!"]: 33 | print"\t + sites with <%d genotypes per population-code" %(args.minIndPop) 34 | if args.minIndPop>0 and args.e!=["!!!!!!!!!"]: 35 | print"\t + sites with <%d genotypes per population-code (except for populations containing the string %s) will be filtered" %(args.minIndPop,args.e) 36 | if args.removeIndels: 37 | print"\t + all indels will be filtered" 38 | if args.removeIndelSNPs: 39 | print"\t + SNPs %d bp near indels will be filtered" %(args.indelDist) 40 | if args.removeMono: 41 | print"\t + all monomorphic sites (except SNPs to the reference only) will be filtered" 42 | if args.removeRefSNPs: 43 | print"\t + all SNPs to the reference only (monomorphic within sample) will be filtered" 44 | if args.removeMultiallelicSNPs: 45 | print"\t + all multiallelic (>2 alleles) SNPs will be filtered" 46 | 47 | filesize=os.stat(args.i)[6] 48 | readbytes=0 49 | 50 | inputF=open(args.i,'r') 51 | outputF=open(args.o, 'w') 52 | 53 | LineNumber=0 54 | relLineNumber=0 55 | prevScaffold="" 56 | pos=0 57 | prevPos=0 58 | sites=[] 59 | positions=[] 60 | indel=False 61 | snp=False 62 | exclu=0 63 | if args.minIndPop>0 and args.e!=["!!!!!!!!!"]: 64 | exclu=1 65 | SNPtomonoFail=False 66 | snpCounter=0 67 | refsnpCounter=0 68 | multiallelicCounter=0 69 | indelCounter=0 70 | monoCounter=0 71 | baseCounter=0 72 | printedBases=0 73 | indelPos=1000000000000 74 | indelDist=100000000000 75 | 76 | #failed sites counters: 77 | qualFail=0 78 | SNPtomonoCount=0 79 | minIndFail=0 80 | minIndPopFail=0 81 | minAlleleFail=0 82 | nearIndelSNPFail=0 83 | refsnpfiltered=0 84 | multiallelicfiltered=0 85 | monofiltered=0 86 | snpsPassed=0 87 | refsnpsPassed=0 88 | multiallelicPassed=0 89 | indelsretained=0 90 | 91 | for Line in inputF: 92 | # DATA SECTION: clause checks if the header section is over 93 | if re.match('^#',Line) is None: 94 | # REPORTING TO STDOUT: writes out every millionth line number and the percentage processed if -v selected 95 | relLineNumber+=1 96 | readbytes+=len(Line) 97 | done=str(int((float(readbytes)/filesize)*100)) 98 | if args.verbose and relLineNumber>9999: 99 | LineNumber+=10000 100 | stdout.write("Line %d, %s%% done %s"%(LineNumber,done,"\r")) 101 | stdout.flush() 102 | time.sleep(1) 103 | relLineNumber=0 104 | 105 | columns=Line.strip("\n").split("\t") 106 | scaffold=columns[0] 107 | pos=int(columns[1]) 108 | 109 | if scaffold == prevScaffold: 110 | indelDist+=(pos-prevPos) 111 | else: 112 | indelDist=10000000000000 # if it's a new scaffold prevent effects of indels in the previous scaffold 113 | 114 | # QUAL filtering - checks if overall quality of the site is good (-q) 115 | # If filtered, a site is not printed. 116 | if columns[5]=="." or float(columns[5])>=args.qual: 117 | result=columns[0:9] 118 | genotypecolumns=range(9,len(columns)) 119 | indCounter = 0 120 | alleleCounter = 0 121 | countOne=0 122 | countZero=0 123 | countTwo=0 124 | countThree=0 125 | refsnp = False 126 | snp = False 127 | indel = False 128 | SNPtomonoFail=False 129 | multiallelic=False 130 | minIndAlleleCheck = False 131 | 132 | # GQ/DP filtering - checks if genotype quality (-p) and depth (-d) are sufficient for each single genotype. 133 | # If filtered out, genotype is set to missing "./." 134 | for ind in genotypecolumns: 135 | genotype=columns[ind] 136 | genotype=genotype.split(":") 137 | 138 | # Case 1: Genotype missing "./." or "./.:.:4" or "0/0" without further information -> no filtering necessary 139 | if ('./.' in genotype) or (genotype==["0/0"]): 140 | result.append("./.") 141 | 142 | # Case 2: Non-variant site equal to reference, has only 2 fields (GT:DP) -> only depth filtering possible 143 | elif len(genotype)==2: 144 | if float(genotype[1])>=args.depth: 145 | result.append(columns[ind]) 146 | indCounter+=1 147 | countZero+=1 148 | else: 149 | result.append("./.") 150 | 151 | # Case 3: SNP/indel against on reference, but non-variant within samples, has only 3 fields (GT:AD:DP) -> only depth filtering possible 152 | elif len(genotype)==3: 153 | if float(genotype[2])>=args.depth: 154 | result.append(columns[ind]) 155 | indCounter+=1 156 | countOne+=1 157 | else: 158 | result.append("./.") 159 | 160 | # Case 4: SNP/indel with genotype qualities, > 3 fields present (GT:AD:DP:GQ:PL) -> genotype quality and depth filtering 161 | elif len(genotype)>3: 162 | if genotype[2]!=".": 163 | if float(genotype[2])>=args.depth: 164 | if float(genotype[3])>=args.gq: 165 | result.append(columns[ind]) 166 | indCounter+=1 167 | if "/" in genotype[0]: 168 | alleles=genotype[0].split("/") # counts the occurrences of each allele (up to 3 alt alleles -> LIMITATION) 169 | elif "|" in genotype[0]: 170 | alleles=genotype[0].split("|") # counts the occurrences of each allele (up to 3 alt alleles -> LIMITATION) 171 | if int(alleles[0])==0: 172 | countZero+=1 173 | elif int(alleles[0])==1: 174 | countOne+=1 175 | elif int(alleles[0])==2: 176 | countTwo+=1 177 | elif int(alleles[0])==3: 178 | countThree+=1 179 | if int(alleles[1])==0: 180 | countZero+=1 181 | elif int(alleles[1])==1: 182 | countOne+=1 183 | elif int(alleles[1])==2: 184 | countTwo+=1 185 | elif int(alleles[1])==3: 186 | countThree+=1 187 | else: 188 | result.append("./.") 189 | else: 190 | result.append("./.") 191 | else: 192 | result.append("./.") 193 | # allele counts after genotype-based filtering 194 | countArray=[countZero, countOne, countTwo, countThree] 195 | 196 | # Identification of indels, SNPs and refSNPs 197 | # also counts bases, indels, SNPs and refSNPs 198 | baseCounter+=1 199 | if len(columns[3])>1: # in case of a deletion (reference allele is more than one base) -> indel 200 | indel = True 201 | indelPos = pos 202 | elif len(columns[4])>1: # could be an indel or >1 alternative alleles 203 | alleles = columns[4].split(",") # if there are more than 1 alternative alleles, they are separated by commas 204 | for alt in alleles: 205 | if len(alt)>1: # in case of an insertion (more than one base in alt allele) -> indel 206 | indel = True 207 | indelPos = pos 208 | snp = False 209 | break 210 | if indel: 211 | indelCounter+=1 212 | elif (columns[4]!="."): 213 | snp = True 214 | snpCounter+=1 215 | if (len(columns[3])>1 or len(columns[4])>1): 216 | multiallelic = True 217 | multiallelicCounter+=1 218 | # In case variable genotypes got filtered (SNP->monomorphic) or SNP is a refSNP (SNP against reference, but monomorphic within sample) 219 | if (sum(i>0 for i in countArray)<2): 220 | if countArray[0]>0: 221 | snp = False 222 | SNPtomonoFail = True 223 | else: 224 | refsnp = True 225 | refsnpCounter+=1 226 | else: 227 | monoCounter+=1 228 | 229 | # minIND and minAllele filtering (-n & -m), only for non-INDELs 230 | # Checks that there are enough individuals that passed and that there are enough alternative allele counts 231 | if indel: 232 | minIndAlleleCheck = True 233 | elif indCounter >= args.minInd: 234 | if args.minIndPop>0: 235 | for m in range(exclu,len(NbrPerSpecies)): 236 | present=0 237 | absent=0 238 | for n in range(0,len(result)-9): 239 | if levels[n]==m: 240 | if './.' in result[n+9]: 241 | absent+=1 242 | else: 243 | present +=1 244 | if (present>=args.minIndPop) or (absent==0): 245 | continue 246 | else: 247 | break 248 | if m==(len(NbrPerSpecies)-1) and (present>=args.minIndPop or absent==0): # checks if the loop was exited early. Check of present/absent is necessary in case the last factor level is the only one with insufficiant number of genotypes 249 | if (not snp): 250 | minIndAlleleCheck = True 251 | elif (snp and (sum(i>=args.minAllele for i in countArray) >= sum(i>0 for i in countArray))): 252 | minIndAlleleCheck = True 253 | snpsPassed+=1 254 | if refsnp: 255 | refsnpsPassed+=1 256 | if multiallelic: 257 | multiallelicPassed+=1 258 | else: 259 | minAlleleFail+=1 260 | else: 261 | minIndAlleleCheck = False 262 | minIndPopFail+=1 263 | else: 264 | if (not snp): 265 | minIndAlleleCheck = True 266 | elif (snp and (sum(i>=args.minAllele for i in countArray) >= sum(i>0 for i in countArray))): 267 | minIndAlleleCheck = True 268 | snpsPassed+=1 269 | if refsnp: 270 | refsnpsPassed+=1 271 | if multiallelic: 272 | multiallelicPassed+=1 273 | 274 | else: 275 | minAlleleFail+=1 276 | else: 277 | minIndFail+=1 278 | 279 | # Counts the number of SNPs that were converted to monomorphic loci and that passed quality filtering. 280 | if SNPtomonoFail and minIndAlleleCheck: 281 | SNPtomonoCount+=1 282 | 283 | # Direct OUTPUT without INDEL-proximity SNP filter: Writes locus to file, if -risnp option is not specified 284 | if(not args.removeIndelSNPs) and minIndAlleleCheck: 285 | if indel: 286 | if (not args.removeIndels): 287 | outputF.write('\t'.join(result)+"\n") 288 | printedBases+=1 289 | indelsretained+=1 290 | elif (not snp): 291 | if args.removeMono: 292 | monofiltered+=1 293 | else: 294 | outputF.write('\t'.join(result)+"\n") 295 | printedBases+=1 296 | elif snp: 297 | if refsnp and args.removeRefSNPs: 298 | refsnpfiltered+=1 299 | elif multiallelic and args.removeMultiallelicSNPs: 300 | multiallelicfiltered+=1 301 | else: 302 | outputF.write('\t'.join(result)+"\n") 303 | printedBases+=1 304 | 305 | # Looped OUTPUT with INDEL-proximity SNP filter: If option -ri is specified, INDEL-proximity SNPs are filtered 306 | elif args.removeIndelSNPs and minIndAlleleCheck: 307 | # start over if scaffold is new -> write out former sites & start memorizing new sites 308 | if scaffold!=prevScaffold: # if it's a new scaffold, write out the previous lines and empty the lists 309 | if len(sites)>0: # to not create an empty line if nothing contained in sites 310 | outputF.write('\n'.join(sites)+"\n") 311 | printedBases+=len(sites) 312 | del sites[:] 313 | del positions[:] 314 | if indel: 315 | indelDist=0 316 | if (not args.removeIndels): 317 | outputF.write('\t'.join(result)+"\n") #print '\t'.join(result)+"\n" 318 | printedBases+=1 319 | indelsretained+=1 320 | else: # if it's not an indel 321 | if (not args.removeMono) and (not args.removeRefSNPs) and (not args.removeMultiallelicSNPs): 322 | sites.append('\t'.join(result)) 323 | if snp: 324 | positions.append(pos) 325 | else: 326 | positions.append(-1000) # to force printing of monomorphic sites even near indels 327 | elif (not snp): 328 | if args.removeMono: 329 | monofiltered+=1 330 | else: 331 | sites.append('\t'.join(result)) 332 | positions.append(-1000) 333 | elif snp: 334 | if refsnp and args.removeRefSNPs: 335 | refsnpfiltered+=1 336 | elif multiallelic and args.removeMultiallelicSNPs: 337 | multiallelicfiltered+=1 338 | else: 339 | sites.append('\t'.join(result)) 340 | positions.append(pos) 341 | 342 | # If current site is an INDEL, writes out non-SNPs and SNPs outside the proximity limit 343 | elif indel: # if the current site is an indel: write out previous lines that are distant enough and empty the lists 344 | indelDist=0 345 | for index in range(len(positions)): 346 | if positions[index]<(indelPos-args.indelDist): 347 | outputF.write(sites[index]+"\n") #print sites[index]+"\n" # 348 | printedBases+=1 349 | else: 350 | nearIndelSNPFail+=1 351 | if (not args.removeIndels): 352 | outputF.write('\t'.join(result)+"\n") #print '\t'.join(result)+"\n" 353 | printedBases+=1 354 | indelsretained+=1 355 | del sites[:] 356 | del positions[:] 357 | 358 | # For current site outside proximity range of INDEL, appends to "sites" and "positions" 359 | elif indelDist>args.indelDist: # if the distance to the nearest indel is large enough 360 | if len(sites)>args.indelDist: # if there are enough sites in the list, write out the first site and remove it 361 | outputF.write(sites[0]+"\n") # print sites[0]+"\n" # 362 | printedBases+=1 363 | sites.pop(0) 364 | positions.pop(0) 365 | if (not args.removeMono) and (not args.removeRefSNPs) and (not args.removeMultiallelicSNPs): 366 | sites.append('\t'.join(result)) 367 | if snp: 368 | positions.append(pos) 369 | else: 370 | positions.append(-1000) # to force printing of monomorphic sites even near indels 371 | elif (not snp): 372 | if args.removeMono: 373 | monofiltered+=1 374 | else: 375 | sites.append('\t'.join(result)) 376 | positions.append(-1000) 377 | elif snp: 378 | if refsnp and args.removeRefSNPs: 379 | refsnpfiltered+=1 380 | elif multiallelic and args.removeMultiallelicSNPs: 381 | multiallelicfiltered+=1 382 | else: 383 | sites.append('\t'.join(result)) 384 | positions.append(pos) 385 | 386 | # right after the indel, prints only non-SNPs to the sites variable 387 | elif (not snp): 388 | if args.removeMono: 389 | monofiltered+=1 390 | else: 391 | sites.append('\t'.join(result)) 392 | positions.append(-1000) 393 | 394 | # Counts filtered SNPs because of INDEL proximity 395 | else: 396 | if refsnp and args.removeRefSNPs: 397 | refsnpfiltered+=1 398 | elif multiallelic and args.removeMultiallelicSNPs: 399 | multiallelicfiltered+=1 400 | else: 401 | nearIndelSNPFail+=1 402 | else: 403 | qualFail+=1 404 | 405 | prevScaffold=scaffold 406 | prevPos=pos 407 | 408 | 409 | # HEADER 410 | # check if it is a header line and if yes, write it to the output file 411 | else: # writes header information into outfile 412 | outputF.write(Line) 413 | if re.match('^##',Line) is None: 414 | if args.minIndPop>0: 415 | header=Line.strip("\n").split("\t") 416 | header=header[9:len(header)] # header now contains all the individual IDs 417 | species=[] 418 | excluded=[] 419 | for i in range(0,len(header)): 420 | ID=header[i] 421 | sub=ID.split('.')[1] # extracts popcode from our standard RAD naming convention, i.e. "FishecNumber"."popcode"."library" 422 | regex=re.compile("|".join(args.e)) # search pattern for excluded popcode (-e option) 423 | if regex.search(sub) is not None: 424 | species.append("!") # gives "!" value to excluded popuations -> this gives excluded popcodes the first factor position in the function categorical 425 | excluded.append(sub) 426 | else: 427 | species.append(sub) 428 | a=numpy.array(species) 429 | poplist=list(set(a)) 430 | levels=[dict([list(reversed(i)) for i in list(enumerate(sorted(list(set(a)))))])[j] for j in a] 431 | NbrPerSpecies=defaultdict(int) 432 | for k in levels: 433 | NbrPerSpecies[k]+=1 # NbrPerSpecies contains counts of how often each factor level is observed. There must be a simpler way to do this. (counting nbr of columns in design matrix produced by categorical should do the trick) 434 | if len(NbrPerSpecies)==1: # Aborts with warning message if all samples were excluded from population-based filtering by -e option 435 | outputF.close() 436 | os.remove(args.o) 437 | exit("\n\tWARNING: all samples were excluded from filtering for minimum number of individuals per population-code [-e option] \n\t\tor the sample names are not in standard format [fishecnumber.popcode.library]\n\tFiltering aborted! Disable population-code based filtering (-np 0) or choose another -e value!") 438 | del header 439 | del ID 440 | del sub 441 | del regex 442 | del a 443 | 444 | 445 | # Outputs the site variable after the last line in the input, if the Indel-associated SNP removal option was chosen 446 | if args.removeIndelSNPs: 447 | for index in range(len(positions)): 448 | outputF.write(sites[index]+"\n") #print sites[index]+"\n" # 449 | printedBases+=1 450 | 451 | if args.verbose: 452 | print "\n\tInput file contains: \n\t %d\ttotal sites with QUAL>=%d, consisting of:" %(baseCounter,args.qual) 453 | print "\t %d\tmonomorphic sites" %(monoCounter) 454 | print "\t %d\tSNPs (%d SNPs to the reference only and %d SNPs polymorphic in sample, incl. %d multiallelic SNPs)" %(snpCounter,refsnpCounter,snpCounter-refsnpCounter,multiallelicCounter) 455 | print "\t %d\tindels" %(indelCounter) 456 | 457 | print "\n\tFiltering report:" 458 | print "\t %d\tsites filtered out due to low quality (QUAL)" %(qualFail) 459 | print "\t %d\tsites filtered out because <%d individuals covered (-minInd)" %(minIndFail,args.minInd) 460 | if args.minIndPop>0: 461 | print "\t %d\tsites filtered out because <%d individuals covered per population-code (-minIndPop) " %(minIndPopFail,args.minIndPop) 462 | if args.e!=["!!!!!!!!!"]: 463 | print "\t \tfollowing population-codes not filtered out for min. number of individuals (-excluded):" 464 | print "\t \t "+", ".join(list(set(excluded))) 465 | print "\t %d\tSNPs filtered out due to <%d minor allele counts (-minAllele)" %(minAlleleFail,args.minAllele) 466 | print "\t %d\tSNPs converted to monomorphic sites due to filtered genotypes (GQ and/or DP)" %(SNPtomonoCount) 467 | if args.removeIndels: 468 | print "\t %d\tindels were filtered out" %(indelCounter) 469 | if args.removeIndelSNPs: 470 | print "\t %d\tvalid SNPs were filtered out due to within %d bp proximity to indels" %(nearIndelSNPFail,args.indelDist) 471 | if args.removeMono: 472 | print "\t %d\tvalid monomorphic sites were filtered out" %(monofiltered) 473 | if args.removeRefSNPs: 474 | print "\t %d\tvalid SNPs to the reference only (monomorphic within sample) were filtered out" %(refsnpfiltered) 475 | if args.removeMultiallelicSNPs: 476 | print "\t %d\tvalid multiallelic (>2 alleles) SNPs were filtered out" %(multiallelicfiltered) 477 | 478 | print "\n\tOutput file %s contains: \n\t %d\ttotal sites, consisting of:" %(args.o,printedBases) 479 | print "\t %d\tmonomorphic sites" %(printedBases-indelsretained-(snpsPassed-nearIndelSNPFail-refsnpfiltered-multiallelicfiltered)) 480 | print "\t %d\tSNPs (%d SNPs to the reference only and %d SNPs polymorphic in sample, incl. %d multiallelic SNPs)" %(snpsPassed-nearIndelSNPFail-refsnpfiltered-multiallelicfiltered,refsnpsPassed-refsnpfiltered,snpsPassed-refsnpsPassed-nearIndelSNPFail-multiallelicfiltered,multiallelicPassed-multiallelicfiltered) 481 | print "\t %d\tindels\n" %(indelsretained) 482 | 483 | inputF.close() 484 | outputF.close() 485 | -------------------------------------------------------------------------------- /ldPruning.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # script that prunes SNPs with too high linkage 4 | # requires vcftools and plink 5 | # Uses plink to remove SNPs with r2>0.2 in a window of 50 kb sliding by 10 kb 6 | 7 | #@author: Joana Meier 8 | #@date: December, 2016 9 | 10 | # Check if help is requested 11 | if [[ $1 = "-h" ]]; then 12 | echo -e "\nUsage: ldPruning.sh [optional: ]" 13 | exit 0 14 | fi 15 | 16 | # Read in the file name (remove potential file endings given) 17 | file=$1 18 | file=${file%.gz} 19 | file=${file%.vcf} 20 | 21 | # Set default values 22 | thresh=0.1 23 | format="vcf" 24 | 25 | # If the vcf file is zipped: 26 | gz1="" 27 | gz2="" 28 | if [[ $1 == *"vcf.gz" ]]; then 29 | gz1="gz" 30 | gz2=".gz" 31 | fi 32 | 33 | 34 | # Check if the provided file exists and contains data 35 | if [[ ! -s $file".vcf" ]] && [[ ! -s $file".vcf.gz" ]] 36 | then 37 | echo -e "\n"$1" not found or empty. Please provide a vcf file with data" 38 | echo -e "\nUsage: ldPruning.sh [optional: ]" 39 | exit 1 40 | fi 41 | 42 | 43 | # Check if more than one argument is provided which ones 44 | if (( "$#" > 1 )) 45 | then 46 | case $2 in 47 | 01) 48 | r="01" 49 | ;; 50 | vcf) 51 | format="vcf" 52 | ;; 53 | plink) 54 | format="plink" 55 | ;; 56 | *) 57 | thresh=$2 58 | ;; 59 | esac 60 | fi 61 | 62 | if (( "$#" > 2 )) 63 | then 64 | case $3 in 65 | 01) 66 | r="01" 67 | ;; 68 | vcf) 69 | format="vcf" 70 | ;; 71 | plink) 72 | format="plink" 73 | ;; 74 | *) 75 | thresh=$3 76 | ;; 77 | esac 78 | fi 79 | 80 | if (( "$#" > 3 )) 81 | then 82 | case $4 in 83 | 01) 84 | r="01" 85 | ;; 86 | vcf) 87 | format="vcf" 88 | ;; 89 | plink) 90 | format="plink" 91 | ;; 92 | *) 93 | thresh=$4 94 | ;; 95 | esac 96 | fi 97 | 98 | 99 | if (( "$#" > 4 )) 100 | then 101 | echo "Error: Too many arguments provided" 102 | echo -e "\nUsage: ldPruning.sh [optional: ]" 103 | exit 1 104 | fi 105 | 106 | # Let the user know, that it is working on it 107 | echo "working..." 108 | 109 | # For running on the Euler cluster, load the required modules 110 | #module load plink/1.07 111 | #module load openblas/0.2.13_par 112 | #module load zlib/1.2.8 113 | #module load vcftools 114 | 115 | 116 | # Check which SNPs are in too high linkage and output a list of SNPs to be pruned out 117 | vcftools --${gz1}vcf ${file}.vcf${gz2} --plink --out ${file} 2> tmp 118 | sed -i 's/^0\t/1\t/g' ${file}.map 119 | plink --file $file --indep-pairwise 50 10 $thresh --out $file --noweb --silent 120 | 121 | sed -i 's/:/\t/g' ${file}.prune.in 122 | 123 | 124 | # Output pruned file either in vcf or plink format (if requested, 01 recoded) 125 | if (( format == "vcf" )) 126 | then 127 | vcftools --${gz1}vcf ${file}.vcf${gz2} --out $file.pruning --positions $file.prune.in --stdout --recode | gzip > $file.LDpruned.vcf.gz 128 | else 129 | vcftools --${gz1}vcf ${file}.vcf${gz2} --positions $file.prune.in --out $file.LDpruned --plink 130 | if (( r == "01" )) 131 | then 132 | plink --file $file.LDpruned --recode$r --out $file.LDpruned$r 133 | fi 134 | fi 135 | 136 | # Clean up unnecessary info file of vcftools which does not want to be silent at all 137 | rm tmp 138 | 139 | # Output info about number of pruned SNPs to the console 140 | echo "finished, new file "$file.LDpruned$r" filtered for LD in 50 kb windows, shifting by 10 kb with LD threshold "$thresh 141 | echo `grep "After filtering" $file.log` 142 | 143 | -------------------------------------------------------------------------------- /makeRecombMap4fineSTRUCTURE.sh: -------------------------------------------------------------------------------- 1 | #!/cluster/apps/r/3.1.2_openblas/x86_64/bin/Rscript --vanilla 2 | 3 | # Written by Joana Meier, May-2017 4 | # usage: makeRecombMap4fineSTRUCTURE.sh 5 | # Generates a recomb file for fineSTRUCTURE with recombination rates in M/bp 6 | 7 | 8 | # Check if there were two arguments given 9 | if [ $# -lt 2 ] 10 | then 11 | echo -e "ERROR: Not enough arguments provided!\nUsage: makeRecombMap4fineSTRUCTURE.sh " 12 | exit 1 13 | fi 14 | 15 | 16 | # Load the required packages 17 | require(data.table) 18 | require(gtools) 19 | 20 | # Read in the vcf file name and linkage map name (including path) from the command line argument 21 | args<-commandArgs(TRUE) 22 | file<-args[1] 23 | map<-args[2] 24 | prefix<-sub('\\.vcf.*', '', file) 25 | 26 | # If the file is gzipped, add info to gunzip 27 | if(grepl(file,pattern=".gz")) file=paste("gunzip -c ",file,sep="") 28 | 29 | # Read in the data (just the column with the positions) 30 | data<-fread(file,header=T,skip="#CHR",select=c(1,2),data.table=F) 31 | 32 | # add a column for recombination distances in Morgan as required by fineSTRUCTURE 33 | data<-cbind("chrom"=data[,1],"start.pos"=data[,2],"recom.rate.perbp"=rep(0.00,length(data[,1]))) 34 | data<-as.data.frame(data) 35 | data$start.pos<-as.integer(as.character(data$start.pos)) 36 | data$recom.rate.perbp<-as.double(data$recom.rate.perbp) 37 | 38 | # Read in the linkage map, i.e. table of physical vs recombination distance, tab-delimited 39 | recomb<-read.table(map,header=T,sep="\t") 40 | 41 | # For each chromosome separately, add the recombination distance 42 | # For unmapped scaffolds, use the empirical mean of 2M/Mb = 2e-8 M/bp 43 | 44 | chrom<-mixedsort(levels(as.factor(data[,1]))) 45 | add=0 46 | 47 | for(i in 1:length(chrom)){ 48 | 49 | # get the name of the ith chromosome/scaffold 50 | chr=chrom[i] 51 | 52 | # get the lines of the dataset of this specific chromosome 53 | dataChr<-data[data[,1]==chr,] 54 | n<-length(dataChr) 55 | 56 | # Get physical position and the local recombination rate, if no info in linkage map use mean of 2 cM/Mb = 2e-8 M/bp 57 | recChr<-recomb[recomb$CHROM==chr,] 58 | if(length(recChr$CHROM)>0){ 59 | d<-smooth.spline(recChr$POS,recChr$cM,spar=0.7) 60 | # predict the recombination rate for each position, divide by 100 to convert from cM to Morgan 61 | recRate<-stats:::predict.smooth.spline(d,as.integer(dataChr[,2]),deriv=1)$y/100 62 | recRate[recRate<0]<-0.000000000001 63 | } 64 | else{ 65 | recRate<-rep(0.00000002,times=length(dataChr[,1])) 66 | } 67 | recRate[length(recRate)]<-(-9.00) 68 | data[data[,1]==chr,3]<-recRate 69 | data[data[,1]==chr,2]<-data[data[,1]==chr,2]+add 70 | 71 | add=add+dataChr[length(dataChr[,1]),2] 72 | print(paste(chr," finished, starting at ",add,sep="")) 73 | } 74 | 75 | write.table(data[,2:3],paste(prefix,".recomb",sep=""),row.names=F,quote=F,sep=" ",col.names=T) 76 | -------------------------------------------------------------------------------- /makeRelateMap.r: -------------------------------------------------------------------------------- 1 | #!Rscript --vanilla 2 | 3 | # Written by Joana Meier, Oct 2019 4 | # usage: makeRelateMap.r 5 | # The linkage map needs to be tab-delimited and contain the following three columns (additional columns are no problem): CHROM, POS, cM 6 | # Generates a map file with the position, recomb rate, cM position as required by Relate 7 | 8 | # Load the required packages 9 | require(data.table) 10 | require(gtools) 11 | 12 | # Read in the vcf file name and linkage map name (including path) from the command line argument 13 | args<-commandArgs(TRUE) 14 | file<-args[1] 15 | map<-args[2] 16 | prefix<-sub('\\.vcf.*', '', file) 17 | 18 | # avoid scientific notation 19 | options(scipen=999) 20 | 21 | # If the file is gzipped, add info to gunzip 22 | if(grepl(file,pattern=".gz")) file=paste("gunzip -c ",file,sep="") 23 | 24 | # Read in the data (just the first 5 columns) 25 | data<-fread(file,header=T,skip="#CHR",select=c(1:5),data.table=F) 26 | 27 | # Extract the first five columns with position information and add a column for recombination distances 28 | data<-cbind(data,"rec"=vector(length = length(data[,1])),"cM"=vector(length = length(data[,1]))) 29 | 30 | # Read in the table of physical vs recombination distance 31 | recomb<-read.table(map,header=T,sep="\t") 32 | 33 | # For each chromosome separately, add the recombination distance 34 | # For unmapped scaffolds, use the empirical mean of 2cM/Mb 35 | 36 | chrom<-mixedsort(levels(as.factor(data[,1]))) 37 | 38 | for(i in length(chrom)){ 39 | 40 | # get the name of the ith chromosome/scaffold 41 | chr=chrom[i] 42 | 43 | # get the lines of the dataset of this specific chromosome 44 | dataChr<-data[data[,1]==chr,] 45 | 46 | # Get physical and recombination distances for that chromosome: 47 | recChr<-recomb[recomb$CHROM==chr,] 48 | if(length(recChr$CHROM)>0){ 49 | d<-smooth.spline(recChr$POS,recChr$cM,spar=0.7) 50 | recPos<-predict(d,dataChr[,2])$y 51 | recPos[recPos<0]<-0 52 | 53 | # Make the recombination distance monotonously increasing: 54 | for(i in 2:length(recPos)){ 55 | if(recPos[i]<=recPos[i-1]) recPos[i]=recPos[i-1]+0.000000001 56 | } 57 | # predict the recombination rate for each position 58 | recRate<-stats:::predict.smooth.spline(d,as.integer(dataChr[,2]),deriv=1)$y 59 | 60 | # Set 0 / negative recombination rates to tiny value 61 | recRate[recRate<=0]<-0.000000000001 62 | } 63 | else{ 64 | recPos<-dataChr$POS/0.5e6 65 | recRate<-rep(2,times=length(dataChr[,1])) 66 | } 67 | data[data[,1]==chr,length(names(data))-1]<-recPos 68 | data[data[,1]==chr,length(names(data))]<-recRate 69 | } 70 | 71 | toPrint<-cbind(data[,2],data[,length(names(data))],data[,length(names(data))-1]) 72 | 73 | write.table(toPrint,paste(prefix,".relate.map",sep=""),row.names=F,quote=F,sep=" ",col.names=F) 74 | 75 | 76 | -------------------------------------------------------------------------------- /removeNonsignDsuite.r: -------------------------------------------------------------------------------- 1 | #!~/bin/Rscript --vanilla 2 | 3 | # Usage: removeNonsignFbranch.r -i -z -o 4 | 5 | # Load libaries 6 | library(optparse) 7 | 8 | # Read input arguments 9 | option_list = list( 10 | make_option(c("-z", "--zthreshold"), type="double", default=3.0, 11 | help="z score threshold", metavar="double"), 12 | make_option(c("-i", "--infile"), type="character", default=NULL, 13 | help="file with fbranch scores with z scores produced by running Dsuite Fbranch with -Z True", metavar="character"), 14 | make_option(c("-o", "--outfile"), type="character", default=NULL, 15 | help="output file name", metavar="character") 16 | ) 17 | 18 | opt_parser = OptionParser(option_list=option_list); 19 | opt = parse_args(opt_parser); 20 | 21 | threshold<-opt$zthreshold 22 | 23 | # Get the line number where z scores start 24 | test<-read.table(file=opt$infile,sep=";",as.is=T,comment.char = ":") 25 | lines<-which(grepl(test[,1],pattern="#"))-2 26 | 27 | # Read in the fbranch values and z scores 28 | fbranch<-read.table(opt$infile,header=T,as.is=T,comment.char = "#",nrows=lines) 29 | zscore<-read.table(opt$infile,header=T,as.is=T,comment.char = "#",skip=lines+1) 30 | 31 | # Extract values entries 32 | fMatrix<-as.matrix(fbranch[-1,-c(1,2,3)]) 33 | zMatrix<-as.matrix(zscore[-1,-c(1,2,3)]) 34 | 35 | # Set fbranch values to zero if z is below threshold 36 | fMatrix[apply(zMatrix,MARGIN=1:2,FUN = function(x) x" 3 | 4 | # This script runs bowtie2 on all files ending on .fastq in the current directory 5 | # This script assumes the usage of the standard naming scheme of the aquatic ecology and evolution group at EAWAG. 6 | # Users utilizing a different naming scheme must adjust the script 7 | 8 | # The alignment settings are default end-to-end alignment and 9 | # the .bam file contains only aligned sequences (option --no-unal) 10 | # Check bowtie2 manual for the explanation of the arguments and for parameter optimization 11 | # Check bowtie.log files after running the script 12 | # created by David Marques and Joana Meier, May 2014 13 | # Note runBowtie2RADID.sh now directly generates sorted and indexed bam files, samToBamSortIndex.sh is deprecated! 14 | 15 | # Load the modules (adjust depending on the cluster/server setup) 16 | module load bowtie2 samtools 17 | 18 | # This clause checks if there was a path to the bowtie2 index file given 19 | if [ $# -lt 1 ] 20 | then 21 | echo "Error: Please provide path to bowtie2 index like /cluster/project/gdc/ref-genome/someGenome (no suffix)" 22 | exit 1 23 | fi 24 | 25 | # This clause double-checks if the bowtie2-index is available where specified 26 | if [ -f $1.1.bt2 ] 27 | then 28 | echo "" 29 | else 30 | echo "Error: bowtie2-index not found, please add the correct file/path combination (or remove the file suffixes if given before)" 31 | exit 1 32 | fi 33 | 34 | # Loops over all files in the current directory with the ending GQI*.fastq 35 | for i in *GQI*.fastq 36 | do 37 | echo -e "\nbowtie2 aligning: "$i 38 | 39 | # These variables capture different informations from the RADID string 40 | # example: "12345.gasacuCHL1.GQI14.fastq" 41 | # f captures "12345.gasacuCHL1" and l captures "GQI14" 42 | f=$(echo $i | cut -f 1-2 -d ".") 43 | l=$(echo $i | cut -f 3 -d ".") 44 | ind=${i%.fastq} 45 | 46 | # The variable m captures the name of the sequencing machine 47 | # and the run-number (important for base-quality score recalibration 48 | m=$(head -1 $i | cut -f 1-2 -d ":" | cut -b 2-) 49 | 50 | bowtie2 \ 51 | -x $1 \ 52 | -q $i \ 53 | --phred33 \ 54 | --end-to-end \ 55 | --no-unal \ 56 | -p 4 \ 57 | -N 1 \ 58 | --rg-id $f \ 59 | --rg "ID:"$f \ 60 | --rg "LB:"$l \ 61 | --rg "PG:bowtie2" \ 62 | --rg "PL:ILLUMINA" \ 63 | --rg "PU:"$m \ 64 | --rg "SM:"$f \ 65 | 2> $ind.bowtie2.log | samtools view -bS - > $ind.bowtie2.bam 66 | 67 | # sorting and indexing 68 | samtools sort -o $ind.bowtie2.bam -O bam -T $ind.tmp $ind.bowtie2.bam 69 | samtools index $ind.bowtie2.bam 70 | done 71 | -------------------------------------------------------------------------------- /vcf2fineRADstructure.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Shell script to convert the vcf file to an input format for fineRADstructure 4 | # Written by Joana Meier, 2018 5 | # Sorry, this is a very slow script as it extracts each RAD locus from the vcf file with vcftools 6 | 7 | # usage: vcf2fineRADstructure.sh 8 | 9 | # This script requires the createRADmappingReport.sh script which requires the folder with the bam files 10 | # It gets the RADloci present in at least 10 individuals with at least 10 reads each 11 | # It runs files of 1000 lines in parallel (increase number of lines if it uses too many CPU). 12 | 13 | # Define variables (modify as needed): 14 | prefix="X" # fineRADstructure does not allow sample names to start with a number, if they do, add X here 15 | minSites=10 # minimum number of sites that a RADlocus needs to contain to be considered 16 | # (if monomorphic sites are included, this number should be higher) 17 | minInds=10 # minimum number of individuals that must be sequenced for each RAD locus to be considered 18 | 19 | 20 | # get the vcf file name (without suffix) 21 | file=$1 22 | file=${file%.gz} 23 | file=${file%.vcf} 24 | bamfilesFolder=$2 25 | bamfilesFolder=${bamfilesFolder%/} 26 | 27 | # Extract the RADtags present in at least 10 inds (positions of RAD sites from mapping report) 28 | currentDir=`pwd` 29 | 30 | # If the mapping report does not exist yet in the directory containing the bam files, run createRADmappingReport 31 | if [ -s $bamfilesFolder/seq_depth_min10.txt ] 32 | then 33 | echo "Mapping report exists already. I will not regenerate it." 34 | else 35 | cd $bamfilesFolder; createMappingReport.sh; cd $currentDir 36 | fi 37 | 38 | # Extract RADloci with at least minInds individuals from seq_depth_min10.txt file 39 | cut -d" " -f 3,4 $bamfilesFolder/seq_depth_min10.txt | sort | uniq -c > RADpos.c 40 | awk -v minInds=$minInds '{if($1>=minInds) print $2,$3}' RADpos.c > RADpos 41 | sort -V RADpos > RADpos.sorted 42 | awk '{print $1" "$2-100" "$2"\n"$1" "$2" "$2+100}' RADpos.sorted | \ 43 | grep -v "locus" > RADloci 44 | 45 | # Delete temporary files 46 | rm RADpos RADpos.c RADpos.sorted 47 | 48 | 49 | # Get the number of individuals 50 | if [ -s $file.vcf.gz ] 51 | then 52 | nind=`zgrep ^#CH ${file}.vcf.gz | awk '{print NF-9}'` 53 | suff=".gz" # suffix 54 | vcfgz="gz" # for vcftools 55 | elif [ -s $file.vcf ] 56 | then 57 | nind=`grep ^#CH ${file}.vcf | awk '{print NF-9}'` 58 | suff="";vcfgz="" 59 | else 60 | echo -e "Error: file $file.vcf[.gz] not found!\nexiting..." 61 | exit 1 62 | fi 63 | 64 | # Split the RADloci file into files of 10000 lines for parallelisation 65 | split -l 10000 RADloci RADloci. 66 | 67 | # Function to generate the $RADlocifile.file 68 | function generateFile { 69 | a=0 70 | while read i 71 | do 72 | # increment the index 73 | a=$((a+1)) 74 | 75 | # get the SNPs of the RADlocus 76 | vcftools --${vcfgz}vcf $file.vcf$suff --chr `echo $i | cut -d" " -f1` --plink-tped \ 77 | --from-bp `echo $i | cut -d" " -f2` --to-bp `echo $i | cut -d" " -f3` \ 78 | --out ${RADlocifile}.$a 79 | 80 | # if the RADlocus contains SNPs: append the info in the correct format to the ${RADlocifile}.file 81 | if [[ -s ${RADlocifile}.$a.tped ]] 82 | then 83 | echo -e "locus"$a"\t"`head -1 ${RADlocifile}.$a.tped | cut -f 1,2`"\t" | tr -d '\n' >> ${RADlocifile}.file 84 | for((c=5;c<($nind*2+5);c+=2)) 85 | do 86 | echo -e `cut -f $c ${RADlocifile}.$a.tped | \ 87 | tr -d '\n'`"/"`cut -f $((c+1)) ${RADlocifile}.$a.tped | \ 88 | tr -d '\n'`"\t" | tr -d '\n' >> ${RADlocifile}.file 89 | done 90 | echo "" >> ${RADlocifile}.file 91 | fi 92 | 93 | # remove temporary files 94 | rm ${RADlocifile}.$a.log ${RADlocifile}.$a.tfam ${RADlocifile}.$a.tped 95 | 96 | # Read in the RADlocifile given as argument when running the function 97 | done < $1 98 | 99 | } 100 | 101 | 102 | # run each file separately (in parallel) with the function above 103 | for RADlocifile in RADloci.* 104 | do 105 | generateFile $RADlocifile & 106 | done 107 | 108 | wait 109 | 110 | # Compute missing data for each individual (can later also be used to filter out bad individuals) 111 | vcftools --missing-indv --${vcfgz}vcf $file.vcf$suff --out $file 112 | 113 | # Get the number of individuals in the file: 114 | nind=`grep -v INDV $file.imiss -c` 115 | 116 | # Generate the final input file for fineRADstructure: 117 | # Note: If samples start with a number, add X to the beginning of each individual name by specifying prefix above 118 | awk -v prefix=$prefix 'BEGIN{printf "Chr\t"} !/INDV/ {printf prefix$1"\t"}END{print ""}' $file.imiss > ${file}_fineRADstructure 119 | 120 | # Add the RADloci data (only RADtags longer than $minSites) 121 | # Collapse the two haplotypes of an individual if they are identical 122 | # Remove both haplotypes if they contain missing data 123 | for RADlociFile in RADloci*file 124 | do 125 | cut -f 2-`echo ${nind}+2 | bc` $RADlociFile | \ 126 | awk '{split($2,cont,"[:]"); $1=""; $2=cont[1]; print $0}' | \ 127 | awk -v minSites=$minSites '{if(length($4)>=(minSites*2+1)) { 128 | for(i=2; i <= NF; i++){ 129 | split($i,genot,"/"); 130 | if(genot[1]==genot[2]) $i=genot[1] 131 | if($i~/0/) $i=""} print $0 132 | } 133 | }' | sed 's/ /\t/g' >> ${file}_fineRADstructure 134 | done 135 | 136 | # Replace blanks by tabs (fineRADstructure prefers that) 137 | sed -i 's/ /\t/g' ${file}_fineRADstructure 138 | -------------------------------------------------------------------------------- /vcf2fineSTR.lsf: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | #BSUB -J "fineSTR[3-22]%10" 4 | #BSUB -R "rusage[mem=1000]" 5 | #BSUB -o log.%J.%I 6 | #BSUB -e err.%J.%I 7 | #BSUB -n 1 8 | #BSUB -W 24:00 9 | 10 | i=$LSB_JOBINDEX 11 | 12 | file=mattsSubset.minDP5.max35dp.maxHet1.54e-9.biallSNPs.Excl0.001p.excl0.2r.max0.25N.phased 13 | 14 | module load r 15 | 16 | # Extract the chromosome (if not already done), if vcf is not gzipped, adjust this part 17 | vcftools15 --gzvcf $file.vcf.gz --chr chr$i --recode --stdout | gzip > $file.chr$i.vcf.gz 18 | file=$file.chr$i 19 | 20 | # Make the recombination file 21 | makeRecombMap4fineSTRUCTURE.sh $file.vcf.gz puncross.gapsEstimated.lifted.pruned.bed 22 | 23 | # Start the chr finestructure phase with the number of haploid inds, sites and the positions of the sites 24 | # first line=number of haploid inds, second line=number of sites, third line=P SNPs 25 | 26 | # add number of individuals 27 | hnind=`zgrep "^#CH" $file.vcf.gz | awk '{print (NF-9)*2}'` 28 | echo $hnind > $file.fineSTRphase 29 | 30 | # add number of sites 31 | echo `grep -v "^#" <(zcat $file.vcf.gz) -c` >> $file.fineSTRphase 32 | 33 | # add sites 34 | echo "P " | tr -d '\n' >> $file.fineSTRphase 35 | cut -d" " -f 1 $file.recomb | sed -e ':a;N;$!ba;s/\n/ /g' -e 's/start.pos //g' >> $file.fineSTRphase 36 | 37 | # Add the genotype information for each individual: 38 | dnind=`echo $hnind / 2 | bc` 39 | convertIndsFineSTR.sh 1 $dnind $file 1 40 | 41 | # add the data to the chr finestructure phase file 42 | cat $file.fineSTRphase.1 >> $file.fineSTRphase 43 | rm $file.fineSTRphase.1 44 | -------------------------------------------------------------------------------- /vcf2phylip.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | 3 | # python version 3 4 | # by Joana, script to convert vcf to phylip 5 | # Some functions are recycled from the RAD python script written by Sam Wittwer 6 | 7 | import sys, getpass, re, argparse, gzip 8 | 9 | # Define the neccessary functions: 10 | 11 | # Function to extract header info: extractHeaderInfo 12 | # takes a vcf(.gz) file from line 1, goes through header, returns: 13 | # 0[string] header 14 | # 1[list[string]] individual IDs 15 | # 2[int] number of individuals in vcf file 16 | # 3[int] length of header 17 | def extractHeaderInfo(input): 18 | linecounter = 0 19 | header = "" 20 | for line in input: 21 | linecounter += 1 22 | if re.match("^\#\#", line): 23 | header+=line 24 | else: 25 | header+=line 26 | n_individuals = len(str(line.strip('\n')).split('\t'))-9 27 | id_individuals = str(line.strip('\n')).split('\t')[9:] 28 | break 29 | return header, id_individuals, n_individuals, linecounter 30 | 31 | # Function to make all sample names of equal length (could be simplified) 32 | # FillUp: accepts list, fills up list with optional character (default is " ") until they are of equal length 33 | def fillUp(list, fill = " "): # Checks length of entries in a list, fills up with specified fill string. 34 | returnlist = [] 35 | for entry in list: 36 | if len(entry) < len(max(list, key=len)): 37 | returnlist.append(entry+(len(max(list, key=len))-len(entry))*fill) 38 | else: 39 | returnlist.append(entry) 40 | return returnlist 41 | 42 | 43 | # Function to write the the lines in phylip format 44 | def writePhylipSequences(samplenames, sequences, outputdestination, writeref): 45 | if writeref: 46 | beginning = 0 47 | nsamples = str(len(samplenames)) 48 | else: 49 | beginning = 1 50 | nsamples = str(len(samplenames)-1) 51 | nbases = str(len(sequences[0])) 52 | outputdestination.write(nsamples +"\t"+nbases+"\n") 53 | outstring = "" 54 | for i in range(beginning, len(samplenames)): 55 | outstring += samplenames[i]+"".join(sequences[i]) 56 | outstring +="\n" 57 | outputdestination.write(outstring.strip("\n")) 58 | 59 | 60 | # Parse the arguments provided 61 | parser = argparse.ArgumentParser(description='Convert vcf file to phylip file format') 62 | 63 | parser.add_argument('-i', '--input', dest='i', help="input file in vcf(.gz) format [required]", required=True) 64 | parser.add_argument('-o', '--output', dest='o', help="output file [required]", required=True) 65 | parser.add_argument('-r', '--ref', action='store_true', help="if -r is specified, the reference sequence will be included in the phylip file)", default=False) 66 | parser.add_argument('-f', '--fill', action='store_true', help="if -f is specified, all sites not in the vcf file will be printed as missing (N)", default=False) 67 | parser.add_argument('-e', '--exclIndels', action='store_true', help="if -e is specified, indels are not printed (else replaced by N)", default=False) 68 | parser.add_argument('-m', '--mtDNA', action='store_true', help="if -m is specified, haploid genotype calls are expected in the vcf", default=False) 69 | 70 | args = parser.parse_args() 71 | 72 | # Set the default values: 73 | if args.i.endswith('.gz'): 74 | input = gzip.open(args.i,'rt') 75 | else: 76 | input = open(args.i,'r') 77 | output = open(args.o,'w') 78 | writeref = args.ref 79 | fill = args.fill 80 | noIndels = args.exclIndels 81 | haploid = args.mtDNA 82 | prev=100000000000000000000000000000000000 83 | 84 | # How to convert vcf-style genotypes to single letters 85 | AmbiguityMatrix = [["A","M","R","W","N"],["M","C","S","Y","N"],["R","S","G","K","N"],["W","Y","K","T","N"],["N","N","N","N","N","N"]] 86 | CoordinatesDictionary = { "A":0 , "C":1, "G":2, "T":3, ".":4 } 87 | GetGenotype = lambda individual, altlist: AmbiguityMatrix[CoordinatesDictionary[altlist[int(individual[0])]]][CoordinatesDictionary[altlist[int(individual[1])]]] 88 | 89 | # If -f and -e are specified 90 | if noIndels and fill: 91 | print("-f and -e are incompatible! Please decide if you want all sites or not") 92 | sys.exit(2) 93 | 94 | # Get the header info 95 | headerinfo = extractHeaderInfo(input) #skips header, retains IDs in headerinfo 96 | 97 | # Get the sample labels 98 | IDs = [] #IDs holds all sample names without equal spacing 99 | IDs.append("reference ") 100 | resultsequences = [] #resultsequences holds all complete sequences to be written 101 | resultsequences.append([]) #resultsequences[0] holds reference 102 | for entry in headerinfo[1]: 103 | IDs.append(entry.replace("-",".")+" ") 104 | resultsequences.append([]) 105 | samplenames = fillUp(IDs) 106 | 107 | linecounter = 0 108 | print("\ngenerating phylip file with ",len(samplenames)-1," individuals") 109 | 110 | # Go through the lines to get the genotypes 111 | for line in input: 112 | site = line.strip('\n').split('\t') 113 | indel = False 114 | pos = int(site[1]) 115 | 116 | # If missing positions should be filled up with Ns (-f specified, e.g. sites of low quality that were filtered out) 117 | if pos > (prev + 1) and fill: 118 | addLine=pos-(prev+1) 119 | individualcounter = 1 120 | for individual in site[9:]: 121 | resultsequences[individualcounter]+= "N" * addLine 122 | individualcounter += 1 123 | linecounter += addLine 124 | resultsequences[0] += "N" * addLine # reference 125 | 126 | # site contains a deletion, replace by missing data 127 | if len(site[3])>1: 128 | indel=True 129 | 130 | else: 131 | alleles = site[4].split(",") # if there are more than 1 alternative alleles, they are separated by commas 132 | for alt in alleles: 133 | if len(alt)>1 or '*' in alt: # in case of an insertion 134 | indel=True 135 | break 136 | 137 | alternativeslist = [] 138 | alternativeslist.append(site[3]) 139 | if site[4] != ".": 140 | for entry in site[4].split(","): 141 | alternativeslist.append(entry) 142 | individualcounter = 1 143 | 144 | # If the site is not an indel, i.e. if it is a SNP or monomorphic site 145 | if not indel: 146 | for individual in site[9:]: 147 | indGeno = individual.split(":") 148 | if haploid: 149 | if '.' in indGeno[0]: 150 | resultsequences[individualcounter]+="N" 151 | else: 152 | resultsequences[individualcounter] += alternativeslist[int(indGeno[0])] 153 | else: 154 | if '.' in indGeno[0]: 155 | resultsequences[individualcounter]+="N" 156 | else: 157 | resultsequences[individualcounter] += GetGenotype(re.split('/|\|',indGeno[0]), alternativeslist) 158 | individualcounter += 1 159 | linecounter += 1 160 | resultsequences[0] += site[3] 161 | 162 | 163 | # If the site is an indel and noIndels is not specified, print as missing data (else not printed) 164 | elif not noIndels: 165 | for individual in site[9:]: 166 | resultsequences[individualcounter]+= "N" 167 | individualcounter += 1 168 | linecounter += 1 169 | resultsequences[0] += site[3][:1] # reference 170 | 171 | prev=int(site[1]) 172 | 173 | input.close() 174 | 175 | writePhylipSequences(samplenames, resultsequences, output, writeref) 176 | output.write("\n") 177 | output.close() 178 | 179 | 180 | --------------------------------------------------------------------------------